Diffusing Blame: Task-Dependent Credit Assignment
in Biologically Plausible Dual-Stream Networks

Yutaro Yamada, Luca Grillotti, Rujikorn Charakorn, Sebastian Risi, David Ha, Robert Tjarko Lange

Sakana AI
Corresponding authors: yutaro.yamada.y@gmail.com, robert@sakana.ai

Abstract

Biological neural circuits obey Dale’s principle: each neuron’s synapses are uniformly excitatory or inhibitory. Artificial networks that respect this constraint must coordinate separate excitatory and inhibitory populations, fundamentally changing how credit is assigned during learning. Several biologically plausible learning rules avoid backpropagation’s weight transport requirement, but it has been difficult to achieve strong performance under Dale’s principle beyond MNIST. Error Diffusion (ED) was originally proposed in a dual-stream excitatory/inhibitory architecture, where learning is driven by routing global error signals to all layers without transporting transposed forward weights or relying on random feedback matrices. Whether such a rule can scale under Dale’s principle across both supervised classification and reinforcement learning remains unknown. Here, we introduce modulo error routing to extend Error Diffusion beyond binary classification, and show that a dual-stream excitatory/inhibitory architecture trained with this method achieves $96.7\%$ on MNIST and establishes a $61.7\%$ baseline on CIFAR-10, demonstrating that representation learning is possible even when strictly enforcing Dale’s principle. For the classification setting, we introduce three domain-specific innovations: layer-specific sigmoid widths, batch-centered class error signals, and asymmetric initialization, and ablation analysis reveals that their relative importance reverses between MNIST and CIFAR-10, exposing task-dependent credit-assignment bottlenecks invisible to single-benchmark evaluation. In reinforcement learning, we integrate ED with Proximal Policy Optimization (PPO) and evaluate it on continuous-control tasks in Google Brax and on Craftax, an open-ended exploration task. We show that ED-PPO achieves competitive performance relative to Direct Feedback Alignment, a backpropagation-free baseline.

Introduction

In biological neural circuits, the separation of excitatory and inhibitory neuronal populations is a fundamental organizational principle, formalized as Dale’s principle: each neuron releases the same neurotransmitter at all of its synapses (Dale,, 1935; Eccles,, 1976; Strata and Harvey,, 1999). This constraint shapes cortical computation through balanced excitation and inhibition (Markram et al.,, 2004). Artificial neural networks, by contrast, freely assign positive and negative weights to any connection, a simplification that enables efficient credit assignment via backpropagation but bears little resemblance to biological learning.

Training deep networks with backpropagation requires the backward pass to use exact transposes of forward weight matrices, the weight transport problem, which lacks biological support (Crick,, 1989; Bengio et al.,, 2015). Several alternatives address this concern. Feedback Alignment (FA) replaces backward weights with fixed random matrices (Lillicrap et al.,, 2016), Direct Feedback Alignment (DFA) routes output errors directly to each hidden layer through fixed random matrices (Nøkland,, 2016), and predictive coding frameworks approximate backpropagation through local Hebbian updates (Whittington and Bogacz,, 2017; Sacramento et al.,, 2018). While these methods avoid weight transport, none enforce Dale’s principle: all allow arbitrary sign weights. Error Diffusion (ED), originally proposed by (Kaneko,, 2000), is a local learning rule in which weight updates depend on presynaptic activity, a postsynaptic activation derivative, and a single global error sign (Fujita,, 2026). This locality makes ED naturally compatible with biological constraints, but prior work demonstrated its effectiveness only on binary classification and a simple classification task like MNIST (Fujita,, 2026).

The central question we address is whether a biologically plausible architecture that enforces Dale’s principle can achieve competitive performance across both supervised classification and reinforcement learning, and what the task-dependent importance of its components reveals about credit assignment under biological constraints. We make three contributions:

1.

A dual-stream excitatory/inhibitory ED architecture that maintains non-negative weights across classification (MNIST, CIFAR-10) and reinforcement learning (Brax locomotion tasks, Craftax) by extending the original ED via modulo error routing.
2.

For the classification setting, three domain-specific innovations: layer-specific sigmoid widths, batch-centered class error, and asymmetric initialization, whose cross-task ablation reveals that credit-assignment bottlenecks shift qualitatively with task difficulty.
3.

A cross-domain evaluation showing that ED-PPO achieves returns comparable to backpropagation-based and DFA-based PPO on continuous control tasks such as Ant, Humanoid, and HalfCheetah. On the more complex Craftax task, ED-PPO achieves performance comparable to DFA.

Refer to caption — Figure 1: Overview of the dual-stream Error Diffusion framework. Left: The excitatory/inhibitory architecture maintains separate streams ( $\mathbf{p}$ , $\mathbf{n}$ ) with four non-negative weight matrices per layer, enforcing Dale’s principle structurally. Center: The Error Diffusion update broadcasts output error directly to hidden layers without weight transport or random feedback matrices. Right: The shared architecture is applied to classification (with domain-specific sigmoid widths, batch-centered class error, and asymmetric initialization) and reinforcement learning (with PPO integration).

Related Work

Biologically plausible learning rules.

The weight transport problem has motivated a family of alternatives to backpropagation (Crick,, 1989; Bengio et al.,, 2015). Feedback Alignment uses fixed random backward weights (Lillicrap et al.,, 2016), and Liao et al., (2016) showed that approximate weight symmetry can support learning in deeper networks. DFA projects output errors directly to hidden layers (Nøkland,, 2016), and Launay et al., (2020) demonstrated that DFA scales to modern architectures, including transformers and convnets. Difference target propagation computes layer-wise learning targets instead of gradients (Lee et al.,, 2015). Predictive coding provides a complementary framework through local Hebbian updates (Whittington and Bogacz,, 2017), and dendritic cortical microcircuit models approximate backpropagation with segregated dendrites (Sacramento et al.,, 2018). Bartunov et al., (2018) found that biologically motivated algorithms generally struggle to scale beyond simple benchmarks, and Richards et al., (2019) argued that biological plausibility and computational performance need not be in tension if the right architectural principles are identified. Filipovich et al., (2022) demonstrated DFA on photonic hardware, showing that bioplausible rules can enable novel computing substrates.

Dale’s principle in neural networks.

Song et al., (2016) introduced a framework for training excitatory-inhibitory recurrent networks with Dale’s principle, separating synaptic magnitudes from signs by parameterizing weights as $W=W^{+}\odot M$ , where $W^{+}\geq 0$ contains trainable connection magnitudes and $M$ is a fixed sign mask encoding each unit’s excitatory or inhibitory identity. Cornford et al., (2021) proposed Dale’s ANNs, feedforward architectures with separate excitatory and inhibitory populations that can learn comparably to standard ANNs by treating inhibition as a normalization mechanism. Grossberg, (1987) explored competitive learning in networks with separate excitatory and inhibitory interactions. Error Diffusion (Fujita,, 2026; Kaneko,, 2000) is also naturally Dalean, representing positive and negative components through separate non-negative populations while avoiding both weight transport and random feedback. Building on this framework, our work extends Error Diffusion to multi-class classification and reinforcement learning.

Reinforcement learning without backpropagation.

Evolution Strategies (ES) provide a gradient-free alternative to policy gradient methods (Salimans et al.,, 2017), but scale poorly with parameter count. Proximal Policy Optimization (PPO) is a widely used policy gradient method (Schulman et al.,, 2017) that typically relies on backpropagation for gradient computation. Biologically plausible alternatives for RL credit assignment have been proposed—reward-modulated Hebbian rules can solve decision-making tasks by gating synaptic updates with a global reward signal (Pfeiffer et al.,, 2010), and Izhikevich, (2007) showed that linking spike-timing-dependent plasticity with dopamine signaling addresses the distal reward problem—but these methods have not been demonstrated with deep networks on continuous-control benchmarks. We replace backpropagation in PPO’s gradient computation with Error Diffusion, evaluating on Brax physics tasks (Freeman et al.,, 2021) and Craftax (Matthews et al.,, 2024).

Method

Dual-Stream Architecture

We adopt the original Error Diffusion formulation proposed by (Kaneko,, 2000). To enforce Dale’s principle, we split each layer into an excitatory (p) and inhibitory (n) stream. The forward pass for layer $i$ computes:

	$\displaystyle\mathbf{p}_{i}$	$\displaystyle=\phi_{i}\!\left(+\mathbf{p}_{i-1}W_{pp}-\mathbf{n}_{i-1}W_{np}+\mathbf{b}_{p}\right)$
	$\displaystyle\mathbf{n}_{i}$	$\displaystyle=\phi_{i}\!\left(+\mathbf{n}_{i-1}W_{nn}-\mathbf{p}_{i-1}W_{pn}+\mathbf{b}_{n}\right)$

where all weight matrices $W_{pp},W_{np},W_{nn},W_{pn}\geq 0$ element-wise, bias parameters $\mathbf{b}_{p}$ and $\mathbf{b}_{n}$ are not necessary non-negative, $\phi_{i}$ is a layer-specific activation function. The negation signs in front of $W_{np}$ and $W_{pn}$ are structural (hardcoded), ensuring that cross-stream connections are inhibitory while all learnable parameters remain non-negative. This dual-stream design requires four weight sub-matrices per layer, resulting in ${\sim}4\times$ more parameters than an unconstrained single-stream network of the same width (e.g., ${\sim}$ 32M vs ${\sim}$ 8M for DFA on the same architecture).

Error Diffusion Update Rule

The ED update replaces backpropagated layerwise errors with an output-space error signal that is routed directly to hidden units. While the original formulation was developed for binary classification (Kaneko,, 2000; Fujita,, 2026), we extend ED to multi-output prediction by assigning each hidden unit a fixed output channel. For a hidden unit $i$ , we define $r(i)=i\bmod C$ , where $C$ is the output dimension, and use the routed error component $s_{r(i)}$ as its learning signal. (For other design choices of this error routing and why, see Figure 4)

For a mini-batch, let $S\in\mathbb{R}^{B\times C}$ denote the output error signal, and let $M\in\{0,1\}^{H\times C}$ be the fixed routing matrix with $M_{ic}=1$ iff $r(i)=c$ . The routed hidden error is

R=SM^{\top},

(1)

and the corresponding matrix of local postsynaptic drives is

U_{p}=\phi^{\prime}(Z_{p})\odot R,

(2)

where $Z_{p}\in\mathbb{R}^{B\times H}$ is the positive-stream preactivation matrix.

Let $A_{p}\in\mathbb{R}^{B\times K}$ denote the positive-stream presynaptic activations feeding the layer. In fully connected notation, the positive-to-positive ED update is

\Delta W_{pp}\propto A_{p}^{\top}U_{p}.

(3)

In the supervised image classifiers, $C=10$ and $S$ is the batch-centered one-vs-all classification error. In Craftax policy networks, $C$ is the number of discrete action logits. In MuJoCo policy networks, $C$ is the policy distribution parameter dimension for the TanhNormal policy.

Each local weight update is proportional to presynaptic activity, and this routed postsynaptic drive, with stream-specific signs for the dual positive/negative pathways. This deterministic modulo routing provides coarse output-associated credit assignment without transporting forward weights or using random feedback matrices. Unlike DFA, whose fixed feedback matrices are random, ED uses a structured correspondence between hidden units and output dimensions.

Classification-Specific Innovations

For the classification setting (MNIST, CIFAR-10), we introduce three domain-specific innovations that address failure modes specific to multi-class classification under Dale’s principle. These three innovations are specific to the classification setting and are not used in the RL extension.

Layer-specific sigmoid widths.

In the classification architecture, $\phi_{i}(z)=1/(1+e^{-2z/\alpha_{i}})$ with layer-type-specific sigmoid width $\alpha_{i}$ . The vanishing gradient problem in sigmoid networks is well documented (Roodschild et al.,, 2020; Ding et al.,, 2018), and various parameterized activations have been proposed (Szandała,, 2020; Hammad,, 2024). Since the sigmoid derivative directly gates the error signal in ED (Eq. 3), gradient attenuation is especially severe: post-hoc analysis reveals a $25\times$ decay from output to first hidden layer. Wider sigmoids (larger $\alpha_{i}$ ) maintain larger derivatives, preventing premature saturation. We set $\alpha=3.0$ for CIFAR-10 convolutional layers and $\alpha=6.0$ for fully connected layers, including the output layer; MNIST uses only the $\alpha=6.0$ fully connected setting.

Batch-centered class error signals.

For the 10-way classification tasks, the ED update uses independent sigmoid output activations rather than a softmax. For a mini-batch of size $B$ , we form a one-vs-all output error $E_{b,c}=\mathbf{1}[y_{b}=c]-\hat{y}_{b,c}$ , where $\hat{y}_{b,c}$ is the positive-stream output activation for class $c$ . Since each example contributes one target class and nine non-target classes, the raw one-vs-all error can contain a strong class-wise offset, especially early in training. We therefore subtract the mini-batch mean separately for each class: $\tilde{E}_{b,c}=E_{b,c}-\frac{1}{B}\sum_{b^{\prime}=1}^{B}E_{b^{\prime},c}$ . This makes the error signal zero-mean across the batch for every class, reducing persistent suppression or excitation of class channels caused by the one-vs-all target imbalance. The centered signal is then applied directly to the output layer, with the negative stream receiving the opposite sign; hidden fully connected units and convolutional channels receive class-routed copies of the same centered error.

Asymmetric E/I initialization.

For convolutional and fully connected layers, weight parameters are initialized from a non-negative uniform distribution in $[0,1)$ and scaled by the inverse square root of the fan-in. Hidden excitatory weights ( $W_{pp},W_{nn}$ ) are then scaled by $1.5\times$ , while inhibitory weights ( $W_{np},W_{pn}$ ) are scaled by $0.5\times$ , giving an expected E/I scale ratio of $3:1$ . The final fully connected output layer uses symmetric initialization ( $1.0\times$ for both excitatory and inhibitory weights) to avoid output saturation.

RL Extension: ED-PPO

For reinforcement learning, we integrate a Dale-constrained dual-stream Error Diffusion (ED) architecture into PPO. The policy and value networks are both dual-stream MLPs with separate positive and negative pathways and non-negative synaptic weights. Each layer computes excitatory-minus-inhibitory preactivations for the positive and negative streams, applies a width-scaled sigmoid nonlinearity, and combines the final streams as $\hat{y}=y^{+}-y^{-}$ . Both Dale implementations initialize the two streams symmetrically from the same observation, $\mathbf{p}_{0}=\mathbf{n}_{0}=x$ , after optional stream normalization, rather than splitting inputs with ReLU.

ED replaces backpropagation through hidden layers after the PPO’s objective function supplies an output-level error signal. For vector-valued policy outputs, the error is routed to hidden units by output class/channel assignment; for the scalar value network, the error is broadcast to all hidden units. All weight matrices are initialized non-negative using absolute Gaussian samples, with separate scaling for excitatory and inhibitory pathways, and weights are clamped non-negative after optimizer updates.

Figure 1 summarizes the error diffusion training procedure and methodological contributions.

Experimental Protocol

Classification.

We evaluate six variants on MNIST and CIFAR-10: (1) the proposed ED with all three innovations, (2) a DFA baseline using fixed random feedback matrices without Dale’s constraints, (3) the seed ED (the dual-stream architecture without any of the three innovations, i.e., using uniform sigmoid width $\alpha=1.0$ , raw error signals, and symmetric initialization), and three ablations removing one innovation each: (4) no batch-centered class error, (5) symmetric init, and (6) uniform sigmoid width. Each variant is trained for 250 epochs with five random seeds per task (60 runs total). For classification, weights are clamped to a floor of $10^{-4}$ after each update to enforce Dale’s principle.

Reinforcement learning.

We compare ED-PPO against BP-PPO (standard backpropagation), DFA-PPO (Direct Feedback Alignment) (Nøkland,, 2016), and ES (Salimans et al.,, 2017) on three Brax (Freeman et al.,, 2021) locomotion tasks (Ant, HalfCheetah, Humanoid). On Craftax (Matthews et al.,, 2024), we compare against BP-PPO and DFA-PPO. Additionally, we evaluate a non-Dalean variant of ED-PPO, in which weights are not constrained to be non-negative, and denote it as “ED-PPO (non-Dalean)”. Each algorithm is evaluated with five random seeds per environment. We report the final episode reward (the reward at the last evaluation checkpoint) averaged over seeds; statistical comparisons use Welch’s $t$ -test with $\alpha=0.05$ . The RL experiments use a separate training and evaluation pipeline from the classification experiments, with both pipelines sharing the same dual-stream ED architecture but differing in activation functions, normalization, and optimization procedure as described above.

Results

Classification: Main Comparison

Figure 2 summarizes accuracy across all six classification variants. The proposed ED achieves $96.7\pm 0.1\%$ on MNIST and $61.7\pm 0.7\%$ on CIFAR-10, a substantial improvement over seed ED ( $50.4\pm 9.8\%$ and $11.6\pm 2.2\%$ ). DFA achieves higher accuracy on both tasks ( $97.6\%$ and $69.1\%$ ) but violates Dale’s principle, requiring ${\sim}2.84$ M negative weights. The accuracy gap between ED and DFA widens from $0.9$ pp on MNIST to $7.4$ pp on CIFAR-10, suggesting that the cost of maintaining non-negative weights grows with task difficulty. We note that ${\sim}62\%$ test accuracy on CIFAR-10 is still far from competitive compared to traditional gradient-based methods. That being said, this is the first time ED has been successfully applied to convolutional neural networks. Previously, Fujita, (2026) obtained ${\sim}55.2\%$ test accuracy on CIFAR-10 using a purely feedforward MLP with flattening.

Classification: Cross-Task Ablation Reversal

In the classification setting, ablation analysis reveals that the importance hierarchy of the three innovations reverses between tasks (Figure 2, Table 1).

MNIST.

Removing layer-specific sigmoid widths is catastrophic ( $96.7\%\to 25.3\%$ , $\Delta=-71.4$ pp), collapsing accuracy to near-chance. Removing batch-centered class error causes only $\Delta=-0.3$ pp, and symmetric initialization has no measurable effect ( $\Delta=+0.0$ pp). Gradient flow regulation is the sole bottleneck on this easier task.

CIFAR-10.

The hierarchy reverses. Removing batch-centered error is now the most destructive ablation ( $61.7\%\to 13.8\%$ , $\Delta=-47.9$ pp), causing four of five seeds to collapse. Uniform sigmoid width causes $\Delta=-15.1$ pp, and symmetric initialization causes $\Delta=-5.5$ pp. All three innovations contribute, but their relative ordering changes fundamentally.

Table 1: Ablation effect sizes (

\Delta

= proposed ED minus ablation, in percentage). Negative values indicate accuracy loss.

Removed Component	MNIST $\Delta$	CIFAR-10 $\Delta$
Batch-Centered Error	$-0.3$	$-47.9$
Layer-Specific Widths	$-71.4$	$-15.1$
Asymmetric Init	$+0.0$	$-5.5$

Interpretation.

This reversal reflects qualitatively different credit-assignment bottlenecks. MNIST’s well-separated features allow learning even without error centering; but without wide sigmoids, the derivatives vanish entirely. On CIFAR-10, higher inter-class similarity makes the 9:1 error imbalance overwhelming: batch-centering becomes essential to prevent uniform output suppression. The reversal demonstrates that evaluating biologically plausible methods on a single benchmark may obscure critical design trade-offs.

Reinforcement Learning: Cross-Domain Generalization

Figures 3 and 5 report performance across four RL tasks. ED-PPO is strongest on HalfCheetah, where it outperforms BP-PPO ( $5494\pm 691$ vs $3520\pm 485$ ; $p<0.001$ ), ED-PPO (non-Dalean) ( $p=0.028$ ), and ES ( $p=0.003$ ), while matching DFA-PPO ( $5581\pm 359$ ). On Ant it is on par with both PPO variants ( $6891\pm 835$ ), significantly exceeding only ES ( $p=0.004$ ). On Humanoid ( $6670\pm 2592$ ) and Craftax ( $20.9\pm 2.9$ ) it trails BP-PPO and ED-PPO (non-Dalean), though the Craftax gap to the latter is not significant.

ED-PPO (non-Dalean) also performs competitively across the same four environments, on par with BP-PPO on Humanoid ( $9804\pm 1144$ vs $8478\pm 3252$ ). On Ant, ED-PPO (non-Dalean) achieves a higher mean ( $7616\pm 2031$ vs $6740\pm 1781$ ) but the difference is not significant due to high variance. On HalfCheetah, ED-PPO (non-Dalean) achieves a similar score ( $3498\pm 1703$ vs $3520\pm 485$ ). Notably, DFA-PPO on Craftax achieves only $19.8\pm 1.5$ , significantly below ED-PPO (non-Dalean) ( $p=0.010$ ), demonstrating that random feedback pathways that suffice for simpler tasks fail on complex, open-ended environments.

ED-PPO (non-Dalean) exhibits higher variance than BP-PPO on Ant and HalfCheetah, though on Humanoid its variance is lower (1144 vs 3252), suggesting that the ED gradient signal can introduce additional stochasticity. The Craftax shortfall ( $-4.0$ , $p=0.002$ ) and higher variance across environments suggest that the coarse modular error routing of ED may be less reliable for tasks requiring fine-grained temporal credit assignment or stable convergence.

DFA Across Domains

DFA provides an informative comparison point across both classification and RL. In classification, DFA achieves the highest accuracy (97.6% MNIST, 69.1% CIFAR-10) but violates Dale’s principle. On Craftax, DFA-PPO is the weakest method ( $19.8$ vs BP-PPO $27.0$ , ED-PPO $23.0$ ), demonstrating that random feedback pathways that work for supervised learning can fail for complex RL. This cross-domain pattern—DFA strong on classification, weak on complex RL—parallels the finding that method components assume different importance under different task demands.

Post-hoc Analysis

Surrogate gradient attenuation.

Post-hoc analysis of training dynamics on CIFAR-10 (Figure 6a) reveals local surrogate gradient magnitudes dropping from $6.6\times 10^{-3}$ at the output layer to $2.7\times 10^{-4}$ at the first hidden layer—a $25\times$ attenuation. This pattern is visible from epoch 3, indicating a structural property of sigmoid-gated error diffusion that motivates layer-specific widths. Per-layer learning rate multipliers ( $3.0\times$ early, $0.5\times$ output) partially compensate but cannot substitute for wider sigmoid derivatives. The near-zero generalization gap ( $0.98\%$ ) despite ${\sim}$ 32M nominal parameters is consistent with underfitting rather than overfitting, but the cause is ambiguous: the 37.3% of weights at floor ( $10^{-4}$ ) reduces effective capacity well below the nominal parameter count, so the limitation may reflect constrained capacity rather than (or in addition to) credit assignment quality.

Emergent E/I balance.

Weight-level E/I ratios (Figure 6b) reveal that training drives the 3:1 asymmetric initialization toward near-balanced ratios ( ${\sim}1.0$ ) in hidden layers. A depth-dependent gradient emerges: the first layer reaches near-perfect balance (E/I = 1.03), the second becomes slightly inhibitory-dominant (0.90), and the third develops the strongest inhibitory bias (0.81). This increasing inhibition with depth is loosely consistent with biological observations that inhibitory circuitry and E/I balance vary systematically across cortical hierarchies (Markram et al.,, 2004), though weight-level E/I ratios are an indirect proxy for biological E/I balance, which involves cell counts, firing rates, and synaptic strengths jointly. The convergence toward balance explains the task-dependent importance of asymmetric initialization: on MNIST, balance is reached quickly regardless of initial conditions ( $\Delta=+0.0$ pp); on CIFAR-10, the asymmetric head start prevents early instability ( $\Delta=-5.5$ pp).

Implicit sparsity.

The non-negative weight floor induces substantial implicit sparsity: 37.3% of weights reach the floor ( $10^{-4}$ ) after training. Inhibitory (cross-stream) FC connections are pruned most aggressively (up to 68.8% at the floor), compared to 26–49% for excitatory connections. Convolutional layers are much less affected ( $<$ 1% to 18%). This asymmetric pruning suggests that the non-negative Dale-style parameterization introduces a structured capacity bottleneck that preferentially suppresses inhibitory connections in deeper layers.

Discussion

Summary.

The central finding of this work is that the dual-stream ED architecture is viable across both classification and reinforcement learning, but the auxiliary mechanisms required for competitive performance are domain-dependent. In classification, batch-centered class error and layer-specific sigmoid widths are critical stabilizers whose relative importance reverses between MNIST and CIFAR-10. In RL, the same dual-stream architecture with ReLU activations and RMS normalization, without sigmoid widths or batch-centering, achieves competitive performance on locomotion tasks, showing comparable but variable rewards on Ant, Humanoid, and HalfCheetah.

Implications for biological plausibility research.

Single-benchmark evaluations can produce misleading conclusions about which architectural features matter. The ablation reversal between MNIST and CIFAR-10 demonstrates that a component critical for one task can be negligible for another, and the DFA cross-domain pattern (strong on classification, weak on complex RL) reinforces this point. The emergent E/I balance convergence, from 3:1 toward ${\sim}$ 1:1 with a depth-dependent inhibitory gradient, provides a connection to biological self-organization. Cortical circuits are known to develop balanced E/I ratios during maturation through intrinsic and synaptic homeostatic mechanisms (Markram et al.,, 2004; Turrigiano,, 2011), and our results suggest that a similar homeostatic process can emerge from gradient-driven learning under Dale’s principle constraints, without explicit balance-enforcing mechanisms. The success of ED-PPO on locomotion tasks despite using ReLU rather than sigmoid activations raises an interesting question about the relationship between activation function choice and credit assignment quality. In classification, the sigmoid derivative explicitly gates the error signal, making the width parameter critical. In RL, ReLU’s unbounded positive derivative may provide a natural gradient pathway that avoids the attenuation problem without requiring layer-specific tuning. This activation-dependent difference in error flow may partially explain why the classification-specific innovations are unnecessary in the RL setting.

Limitations.

The classification innovations (sigmoid widths, MCE, asymmetric init) are specific to the classification setting; we do not claim they transfer to RL. ED-PPO shows substantially higher variance than BP-PPO across all environments, suggesting that the coarse error routing introduces stochasticity that may limit reliability. The Craftax shortfall ( $-4.0$ reward vs BP-PPO) indicates that ED’s modular credit assignment may be insufficient for tasks requiring fine-grained temporal reasoning. The accuracy gap between ED and DFA on classification ( $-0.9$ to $-7.4$ pp) quantifies the current cost of enforcing Dale’s principle. On Craftax, DFA underperforms ED-PPO, suggesting that the cost of Dale’s principle is task-dependent. Finally, the dual-stream architecture’s ${\sim}4\times$ parameter overhead may itself limit effective capacity, particularly on classification where 37.3% of weights are pruned to the floor. From the perspective of adaptive computation, the emergent behaviors we observe—E/I balance self-organization, asymmetric pathway pruning, and task-dependent component criticality—suggest that biologically constrained architectures may develop internal regulatory mechanisms analogous to those found in living systems.

Future Work.

Several directions follow naturally from the results presented here. First, Dale’s principle may offer hardware advantages by constraining synaptic weights themselves to be non-negative. In our Error Diffusion formulation, negative contributions are still required, but their sign is determined by fixed excitatory/inhibitory pathway structure rather than by arbitrary signed weights. Thus, the benefit is not the elimination of subtraction or differential circuitry altogether, but the replacement of unconstrained signed synapses with non-negative synaptic magnitudes and fixed-sign routing. This may be especially relevant for analog, photonic, or synapse-device-based neuromorphic substrates, where physical synaptic elements often naturally encode non-negative quantities, while sign can be implemented at the level of population identity, optical/electrical phase, or excitatory/inhibitory summation. This complements prior demonstrations of bioplausible rules on photonic hardware (Filipovich et al.,, 2022). Second, the implicit sparsity we observe suggests that Dale-compliant training may yield model compression “for free,” since weights bounded below at zero cannot recover once suppressed, turning the floor into a natural pruning mechanism that could be exploited with structured sparsity kernels at inference time. Third, we conjecture that the dual-stream architecture may be unusually well-suited to continual and open-ended learning settings, where catastrophic forgetting is a central obstacle: the dedicated inhibitory stream provides a structural mechanism for dampening large gradient excursions, and the sign constraint prevents the kind of unconstrained weight sign flips that are thought to contribute to representational overwriting. Evaluating ED on sequential task streams and lifelong RL benchmarks is a natural next step. Fourth, the segregation of computation into excitatory “amplifiers” and inhibitory “suppressors” offers a potential interpretability handle that standard networks lack: when a model makes an error, one can in principle trace whether the mistake originated from a feature being falsely amplified by the excitatory stream or insufficiently suppressed by the inhibitory stream, providing a more mechanistic form of attribution than gradient-based saliency on sign-unconstrained networks. Finally, closing the accuracy gap to DFA and BP-PPO on the hardest tasks, whether through richer error-routing schemes, learned (rather than fixed modular) output-to-hidden projections, or hybrid activation functions that preserve sigmoid locality while mitigating attenuation, remains the most direct path toward making Dale-compliant learning competitive with unconstrained backpropagation.

Conclusion

We demonstrated that a dual-stream excitatory/inhibitory architecture trained with Error Diffusion achieves competitive performance across both supervised classification and reinforcement learning while maintaining non-negative weights consistent with Dale’s principle. In the classification setting, three domain-specific innovations enable scaling from binary to 10-class problems, and their cross-task ablation reversal reveals that credit-assignment bottlenecks shift qualitatively with task difficulty. In reinforcement learning, the same core architecture integrated with PPO achieves comparable rewards to backpropagation-based PPO on locomotion tasks, without requiring the classification-specific stabilizers, though with substantially higher variance. The cross-domain pattern, competitive performance with domain-dependent auxiliary mechanisms, suggests that biologically plausible learning rules need not be monolithic: a shared architectural core can support diverse task demands when augmented with appropriate domain-specific components. The accuracy gap relative to BP-PPO on the hardest tasks quantifies the current cost of Dale’s principle compliance and provides concrete benchmarks for future work on narrowing this gap while preserving biological fidelity.

Acknowledgement

Generative AI tools were used in the preparation of the manuscript, including text editing, code generation, image generation, and data analysis assistance. Additionally, we leveraged AI to generate ideas, run autonomous experimentation, and refine hypotheses. All claims, code implementation, manuscript writing, and figures were either reviewed by or created by the authors.

References

Bartunov et al., (2018) Bartunov, S., Santoro, A., Richards, B., Marris, L., Hinton, G. E., and Lillicrap, T. (2018). Assessing the scalability of biologically-motivated deep learning algorithms and architectures. Advances in neural information processing systems, 31.
Bengio et al., (2015) Bengio, Y., Lee, D.-H., Bornschein, J., Mesnard, T., and Lin, Z. (2015). Towards biologically plausible deep learning. arXiv preprint arXiv:1502.04156.
Cornford et al., (2021) Cornford, J., Kalajdzievski, D., Leite, M., Lamarquette, A., Kullmann, D. M., and Richards, B. A. (2021). Learning to live with dale’s principle: {ANN}s with separate excitatory and inhibitory units. In International Conference on Learning Representations.
Crick, (1989) Crick, F. (1989). The recent excitement about neural networks. Nature, 337(6203):129–132.
Dale, (1935) Dale, H. (1935). Pharmacology and nerve-endings.
Ding et al., (2018) Ding, B., Qian, H., and Zhou, J. (2018). Activation functions and their characteristics in deep neural networks. In 2018 Chinese Control And Decision Conference (CCDC), pages 1836–1841.
Eccles, (1976) Eccles, J. C. (1976). From electrical to chemical transmission in the central nervous system: the closing address of the sir henry dale centennial symposium cambridge, 19 september 1975. Notes and records of the Royal Society of London, 30(2):219–230.
Filipovich et al., (2022) Filipovich, M. J., Guo, Z., Al-Qadasi, M., Marquez, B. A., Morison, H. D., Sorger, V. J., Prucnal, P. R., Shekhar, S., and Shastri, B. J. (2022). Silicon photonic architecture for training deep neural networks with direct feedback alignment. Optica, 9(12):1323–1332.
Freeman et al., (2021) Freeman, C. D., Frey, E., Raichuk, A., Girgin, S., Mordatch, I., and Bachem, O. (2021). Brax–a differentiable physics engine for large scale rigid body simulation. arXiv preprint arXiv:2106.13281.
Fujita, (2026) Fujita, K. (2026). A diagnostic evaluation of neural networks trained with the error diffusion learning algorithm. Discover Artificial Intelligence.
Grossberg, (1987) Grossberg, S. (1987). Competitive learning: From interactive activation to adaptive resonance. Cognitive science, 11(1):23–63.
Hammad, (2024) Hammad, M. (2024). Deep learning activation functions: Fixed-shape, parametric, adaptive, stochastic, miscellaneous, non-standard, ensemble. arXiv preprint arXiv:2407.11090.
Izhikevich, (2007) Izhikevich, E. M. (2007). Solving the distal reward problem through linkage of stdp and dopamine signaling. Cerebral cortex, 17(10):2443–2452.
Kaneko, (2000) Kaneko, I. (2000). Sample programs of error diffusion learning algorithm. https://web.archive.org/web/20000306212433/http://village.infoweb.ne.jp/~fwhz9346/ed.htm. [Online; accessed 13-April-2026].
Launay et al., (2020) Launay, J., Poli, I., Boniface, F., and Krzakala, F. (2020). Direct feedback alignment scales to modern deep learning tasks and architectures. In Proceedings of the 34th International Conference on Neural Information Processing Systems, NIPS ’20, Red Hook, NY, USA. Curran Associates Inc.
Launay et al., (2019) Launay, J., Poli, I., and Krzakala, F. (2019). Principled training of neural networks with direct feedback alignment.
Lee et al., (2015) Lee, D.-H., Zhang, S., Fischer, A., and Bengio, Y. (2015). Difference target propagation. In Appice, A., Rodrigues, P. P., Santos Costa, V., Soares, C., Gama, J., and Jorge, A., editors, Machine Learning and Knowledge Discovery in Databases, pages 498–515, Cham. Springer International Publishing.
Liao et al., (2016) Liao, Q., Leibo, J., and Poggio, T. (2016). How important is weight symmetry in backpropagation? In Proceedings of the AAAI Conference on Artificial Intelligence, volume 30.
Lillicrap et al., (2016) Lillicrap, T. P., Cownden, D., Tweed, D. B., and Akerman, C. J. (2016). Random synaptic feedback weights support error backpropagation for deep learning. Nature communications, 7(1):13276.
Markram et al., (2004) Markram, H., Toledo-Rodriguez, M., Wang, Y., Gupta, A., Silberberg, G., and Wu, C. (2004). Interneurons of the neocortical inhibitory system. Nature reviews neuroscience, 5(10):793–807.
Matthews et al., (2024) Matthews, M., Beukman, M., Ellis, B., Samvelyan, M., Jackson, M., Coward, S., and Foerster, J. (2024). Craftax: A lightning-fast benchmark for open-ended reinforcement learning. arXiv preprint arXiv:2402.16801.
Nøkland, (2016) Nøkland, A. (2016). Direct feedback alignment provides learning in deep neural networks. Advances in neural information processing systems, 29.
Pfeiffer et al., (2010) Pfeiffer, M., Nessler, B., Douglas, R. J., and Maass, W. (2010). Reward-modulated hebbian learning of decision making. Neural computation, 22(6):1399–1444.
Richards et al., (2019) Richards, B. A., Lillicrap, T. P., Beaudoin, P., Bengio, Y., Bogacz, R., Christensen, A., Clopath, C., Costa, R. P., de Berker, A., Ganguli, S., et al. (2019). A deep learning framework for neuroscience. Nature neuroscience, 22(11):1761–1770.
Roodschild et al., (2020) Roodschild, M., Gotay Sardiñas, J., and Will, A. (2020). A new approach for the vanishing gradient problem on sigmoid activation. Progress in Artificial Intelligence, 9(4):351–360.
Sacramento et al., (2018) Sacramento, J., Ponte Costa, R., Bengio, Y., and Senn, W. (2018). Dendritic cortical microcircuits approximate the backpropagation algorithm. Advances in neural information processing systems, 31.
Salimans et al., (2017) Salimans, T., Ho, J., Chen, X., Sidor, S., and Sutskever, I. (2017). Evolution strategies as a scalable alternative to reinforcement learning. arXiv preprint arXiv:1703.03864.
Schulman et al., (2017) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. (2017). Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347.
Song et al., (2016) Song, H. F., Yang, G. R., and Wang, X.-J. (2016). Training excitatory-inhibitory recurrent neural networks for cognitive tasks: a simple and flexible framework. PLoS computational biology, 12(2):e1004792.
Strata and Harvey, (1999) Strata, P. and Harvey, R. (1999). Dale’s principle. Brain research bulletin, 50(5-6):349–350.
Szandała, (2020) Szandała, T. (2020). Review and comparison of commonly used activation functions for deep neural networks. In Bio-inspired neurocomputing, pages 203–224. Springer.
Turrigiano, (2011) Turrigiano, G. (2011). Too many cooks? intrinsic and synaptic homeostatic mechanisms in cortical circuit refinement. Annual review of neuroscience, 34(1):89–103.
Whittington and Bogacz, (2017) Whittington, J. C. and Bogacz, R. (2017). An approximation of the error backpropagation algorithm in a predictive coding network with local hebbian synaptic plasticity. Neural computation, 29(5):1229–1262.

Diffusing Blame: Task-Dependent Credit Assignment in Biologically Plausible Dual-Stream Networks