Srijan Tiwari \Emailsrijan_t@ee.iitr.ac.in
and \NameAditya Chauhan††thanks: Equal contribution \Emailaditya_c@mfs.iitr.ac.in
and \NameManjot Singh11footnotemark: 1 \Emailmanjot_s@cs.iitr.ac.in
\addrIndian Institute of Technology Roorkee
Radial Suppression Accelerates Algorithmic Generalization:
A Geometric Analysis of Delayed Generalization
Abstract
Why do neural networks memorize algorithmic training data long before they generalize? We present a geometric case study demonstrating that, on tasks where generalization requires discovering structured low-dimensional circuits, the memorization–generalization delay is driven by radial inflation of hidden representations under cross-entropy optimization. We formalize a radial–angular decomposition of activation-space dynamics and derive three testable propositions: (i) that penalizing radial inflation induces anisotropic, data-dependent weight regularization; (ii) that it suppresses radial gradient energy below the isotropic random baseline, forcing predominantly angular updates; and (iii) that it biases convergence toward flatter minima. To empirically validate these propositions, we study a single-hyperparameter norm penalty that softly constrains activations to a -radius hypersphere. On modular arithmetic, this penalty accelerates grokking upto across MLPs and Transformers, and halves training steps for a 10M-parameter nanoGPT on 3-digit addition.
1 Introduction
Neural networks trained on algorithmic tasks exhibit a distinct form of delayed generalization where models achieve near-perfect training accuracy early in optimization, yet test accuracy remains low for orders of magnitude more training before undergoing a sudden phase transition (power2022grokking; barak2023hiddenprogressdeeplearning). Mechanistic studies of modular arithmetic reveal that generalization requires discovering structured low-dimensional circuits (nanda2023progress; zhong2023clock), while the memorizing solution is characterized by unstructured, high-norm representations. We demonstrate that a cause of this is radial inflation where standard cross-entropy incentivizes outward growth of hidden activations to push logits into the saturating regime of softmax (prieto2025stablemax). This inflates the dominant singular value while collapsing effective rank (sun2026dimensional).
We formalize this intuition through a radial–angular decomposition of activation-space dynamics, and derive three analytical propositions. To test these propositions, we study a activation norm penalty that softly constrains hidden representations to a - radius hypersphere.
We study this intervention specifically on algorithmic generalization tasks where the generalization phase transition is sharp and measurable. Note that we do not claim universality. Rather, we present the radial–angular framework as a geometric lens for understanding algorithmic phase transitions, and the norm penalty as a principled instrument for understanding this geometry.
Contributions
-
1.
A radial–angular decomposition of activation-space dynamics that generates three testable predictions about how radial suppression should affect optimization geometry along with empirical validation(§4).
-
2.
A single-hyperparameter norm penalty that accelerates grokking by upto across MLPs, Transformers, and a 10M-parameter nanoGPT.
2 Background and Related Work
2.1 Grokking Phenomenon
The “Goldilocks zone” framework (liu2022towards) models generalization as occurring within a narrow band of weight norms while memorization corresponds to high-norm solutions outside this zone. nanda2023progress reverse-engineered the Fourier multiplication algorithm learned by grokked Transformers, identifying three phases: memorization, circuit formation, and cleanup. merrill2023tale modeled grokking as competition between dense memorizing and sparse generalizing subnetworks. GrokFast (grokfast2024) amplifies slow-varying gradient components. xu2025let proposed GrokTransfer, which transplants embeddings from a pre-trained proxy model.
2.2 Geometric Perspectives on Regularization
Existing regularization approaches generally constrain either weight-space or activation-space representations.
Standard weight decay applies an isotropic penalty . Spectral normalization (miyato2018spectral) bounds the Lipschitz constant via . Weight normalization (salimans2016weight) explicitly decouples magnitude from direction. Activation Regularization in AWD-LSTM (merity2018regularizing) directly penalizes . Minimizing Activation Norms (MAN) (man2024) minimizes as a Hessian-flatness proxy.
On the other hand, LayerNorm (ba2016layer) and RMSNorm impose hard per-token constraints on activation statistics. Our penalty is a soft constraint: it permits temporary violation during landscape traversal and has no learnable affine parameters that could undo the constraint.
prieto2025stablemax identify softmax-driven logit inflation as a grokking bottleneck; their Grad projects gradients away from magnitude-scaling directions in parameter space. yildirim2026geometric enforce hard projection onto a bounded spherical topology. Our approach provides a soft, loss-based variant operating in activation space.
3 Method: The Activation Norm Penalty
3.1 Formulation
Given a network with hidden representation at a target layer, we augment the cross-entropy objective with a single-hyperparameter penalty:
| (1) |
| (2) |
The target ensures that the average squared activation per feature is constant as width increases, matching standard initialization variance and preventing both variance collapse and unbounded inflation. We also show that this constraint is a relaxation of Riemannian gradient flow on a hypersphere (See §A.2). In MLPs, the penalty is applied to the pre-activation outputs of all hidden layers, computed per-sample and averaged over the batch. In Transformer architectures with Pre-LayerNorm, the penalty is applied to each sub-layer output before addition to the residual stream:
| (3) |
This constrains the increment to the residual stream at each layer rather than the cumulative stream, avoiding spurious norm growth in deeper networks.
4 Analytical Framework and Empirical Validation
We analyze the norm penalty from three complementary perspectives. Each generates a testable prediction, which we validate empirically immediately after its derivation.
4.1 Radial–Angular Gradient Decomposition
Proposition 4.1.
Let be the radial projection matrix. The total gradient decomposes into a radial component and a tangential component . To quantify this geometry without artifacts from high-dimensional orthogonality, we define the Normalized Fractional Radial Energy:
| (4) |
Under an isotropic random walk in , the expected fractional energy in the one-dimensional radial subspace is , giving a null hypothesis of . The gradient of the norm penalty term, , is purely radial and opposes the radial component of the task gradient.
Prediction.
The penalty should suppress well below throughout training, and this angular redirection should be accompanied by accelerated assembly of features, such as the periodic Fourier features that underlie generalization on modular arithmetic.
Result.
During early memorization phase (epochs –), the baseline exhibits severe radial inflation. The penalized model suppresses to approximately from initialization onward, an order of magnitude below the null.
Table 1 reports the downstream effect on Fourier circuit assembly. Fourier coherence ( on the basis) is reached at epoch under the penalty versus for the baseline and the dominant Fourier magnitude increases fourfold, indicating that angular optimization produces sharper, more structured representations prior to the phase transition.
4.2 Implicit Anisotropic Weight Regularization
Proposition 4.2.
Under the local linear approximation and in the high-norm regime —which holds during the early memorization phase when radial inflation is maximal—the centered penalty approximates , and the expected penalty becomes:
| (5) |
where is the input second-moment matrix. Unlike isotropic weight decay (), this penalizes weight directions proportionally to the variance of their input features: directions aligned with high-variance principal components of receive stronger suppression.
Prediction.
The data-dependent anisotropy of this regularizer should produce faster generalization than isotropic weight decay at similar regularization strength, because it selectively suppresses the high-variance directions that cross-entropy exploits for memorization while leaving low-variance directions free to participate in circuit formation.
Result.
Table 2 reports grokking onset against competing regularizers. Our penalty reaches generalization approximately faster than strong isotropic weight decay and faster than MAN, which minimizes without a centered target. The higher effective rank under our penalty ( vs. for strong WD) is consistent with the anisotropy prediction. Table 3 shows that the penalty consistently accelerates the memorization-generalization phase transition over multiple architectures. The more modest relative speedup on NanoGPT is consistent with the architecture already performing partial radial suppression via its affine LayerNorm.
| Method | Grok Onset | Final Test Acc | Eff. Rank | Hessian Trace |
|---|---|---|---|---|
| Baseline (WD) | DNG | 1.7% | 135 | 42.5 |
| Strong WD () | 15,540 1,480 | 100.0% | 402 | 3.2 |
| Dropout () | 22,000 2,500 | 98.5% | 378 | 5.8 |
| MAN () | 15,000 2,000 | 100.0% | 400 | 2.5 |
| Norm Penalty (Ours) | 2,460 136 | 100.0% | 443 | 1.4 |
| Setting | Baseline | Norm Penalty | Speedup (vs. Strong WD) |
|---|---|---|---|
| MLP (epochs) | 15,540 1480 | 2,460 136 | |
| Transformer (epochs) | 8,000 450 | 5,200 400 | |
| NanoGPT (steps) | 22,500 1,200 | 9,800 600 |
4.3 Curvature Reduction via Norm Bounding
Proposition 4.3.
Using the empirical Fisher as a curvature proxy, the Hessian trace approximates as , where . Layer-wise activation norm bounding restricts (the input to the next layer) and reduces pre-activation saturation (thereby bounding ).
Prediction.
The penalty should substantially reduce the Hessian trace and normalized sharpness relative to the baseline, and this curvature reduction should be accompanied by a shift toward higher effective rank, indicating that the loss landscape flattens without collapsing the representational geometry.
Result.
Table 1 reports spectral and curvature diagnostics. The penalty achieves a reduction in raw Hessian trace and a reduction in normalized sharpness, confirming the curvature prediction. Spectral compression is equally striking: drops from to , while effective rank rises from to out of dimensions. This combination—flatter landscape and higher rank—distinguishes the norm penalty from isotropic regularizers, which reduce sharpness by collapsing the spectrum rather than by redistributing it.
5 Conclusion
We have presented a geometric case study of the memorization–generalization phase transition on algorithmic tasks. Through a radial–angular decomposition of activation-space dynamics, we derived three testable predictions about how radial suppression should affect optimization, and validated each empirically using a simple norm penalty as instrument.
The penalty accelerates grokking by on modular arithmetic and on 3-digit addition, while dramatically compressing the spectral geometry and flattening the loss landscape. We situated the penalty within a taxonomy of geometric interventions, clarifying its relationship to normalization layers, direct activation penalties, and parameter-space methods.
Our work suggests that, for algorithmic learning tasks, the memorization–generalization delay is fundamentally a geometric phenomenon: cross-entropy drives radial inflation, trapping networks in memorization basins, and principled radial suppression provides a direct lever to accelerate the phase transition. Whether this geometric lens extends to broader learning settings remains an open and important question.
References
Appendix A Theoretical Framework and Extended Geometric Analysis
In this section, we formalize the geometric mechanisms through which the activation norm penalty alters high-dimensional learning dynamics. We analyze its effects on local curvature, gradient flow topologies, and the spectral properties of the representation matrix.
A.1 Preempting the Edge of Stability via Radial Bounding
Proposition 1. Radial inflation in activation space drives progressive sharpening in parameter space. Constraining activation norms bounds the parameter spectral norm, formally preempting the Edge of Stability (EoS) instability threshold.
Derivation. Let the network’s local linear approximation be , where is the weight matrix and is the input. The standard cross-entropy loss operates on . Let denote the loss mapping from pre-activations to the scalar loss.
By the chain rule, the gradient with respect to the weights is . The Hessian with respect to the weights—omitting the second-derivative tensor terms of the network architecture for the local linear approximation—is dominated by:
| (6) |
Unconstrained cross-entropy optimization naturally drives the magnitude of features outward to maximize softmax margins, driving . Under the mapping , this radial inflation necessitates unconstrained growth in the spectral norm of , specifically along the principal components of the input covariance . Consequently, the maximal eigenvalue of the Hessian, , grows proportionally cohen2021gradient.
The Edge of Stability dictates that gradient descent becomes unstable when , where is the learning rate. Our norm penalty, , introduces an opposing restoring force. By locking , we artificially bound the spectral norm of . Because is a function of this bounded weight norm, the local sharpness is forcibly held below the threshold, explaining the dramatic reduction in Hessian trace observed in our experiments.
A.2 The Norm Penalty as a Lagrangian Relaxation of Riemannian Flow
Proposition 2. The activation norm penalty acts as a continuous Lagrangian relaxation of a Riemannian gradient flow on the hypersphere , where the penalty multiplier dictates the stiffness of the manifold retraction.
Derivation. Consider the continuous-time gradient flow of the activations , where the total objective is . The gradient evaluates to:
| (7) |
Let be the radial projection matrix. We decompose the continuous flow into radial () and tangential () components:
| (8) | ||||
| (9) |
In the asymptotic limit as , the penalty strictly dominates the radial dynamics, correcting any deviation from the target radius infinitely fast. Thus, and . In this limit, the optimization trajectory reduces exactly to:
| (10) |
This equation is the exact formulation of Riemannian gradient descent restricted to the manifold absil2008optimization. Because we operate at a finite , our method avoids the brittleness of hard retraction mappings, instead optimizing within a “thickened sphere” while maintaining the favorable angular dynamics characteristic of Riemannian optimization.
A.3 Spectral Collapse and Antagonistic Gradients
Proposition 3. Unconstrained cross-entropy exhibits an implicit bias toward rank-1 representations. The activation penalty induces an “antagonistic gradient” that counteracts this bias, preserving the stable rank and allowing the spectral edge of generalizing circuits to assemble.
Derivation. The effective capacity of the network representations over a batch is bounded by the stable rank:
| (11) |
where are the singular values. Gradient flow on separable data without explicit regularization invariably converges to max-margin solutions, functioning as an implicit bias toward low-rank factorizations li2020towards. Specifically, inflates the dominant singular value exponentially faster than the tail, driving .
Applying the norm penalty without a stop-gradient introduces a structural gradient conflict. The primary task attempts to maximize to increase logit margins. Simultaneously, the radial gradient of the norm penalty, , aggressively pulls back the representation vector.
Because this opposing gradient is strictly radial, it exerts its maximal suppressive force exactly along the direction of . By actively bounding via this continuous “antagonistic regularization,” the optimization energy is distributed across the trailing singular values (). This formally explains the empirical preservation of stable rank, ensuring the latent space maintains sufficient effective dimensionality for complex Fourier features to emerge prior to the phase transition.
Appendix B Detailed Discussion
B.1 What the Geometric Lens Reveals
Our evidence is consistent with the following account of algorithmic phase transitions:
-
1.
Cross-entropy optimization drives radial inflation of activations, causing spectral collapse and trapping the network in a high-norm memorization basin.
-
2.
Constraining activations to a -radius hypersphere suppresses radial gradient components, redirecting optimization to angular (tangential) updates.
-
3.
Angular updates preserve feature diversity and promote the discovery of periodic Fourier circuits.
-
4.
The resulting solutions lie in flatter minima with dramatically lower Hessian trace.
This account is correlational: the penalty simultaneously suppresses radial inflation, preserves rank, flattens curvature, and accelerates Fourier coherence (The point during training when a neural network’s internal representations strongly align with a theoretical Fourier basis (e.g., ), marking the successful assembly of a generalizable, periodic algorithm). We present these observations as consistent with the radial–angular framework rather than as proof of a unique causal chain. Disentangling these effects—e.g., via interventions that preserve rank without radial suppression, or vice versa—is an important direction for future work.
B.2 The LayerNorm Relationship
LayerNorm enforces a hard per-token constraint (zero mean, unit variance) and restores expressivity via learned affine parameters . Our penalty is a soft constraint with no affine restoration, which has two consequences: (i) the network can temporarily violate the hypersphere during landscape traversal, providing a smoother optimization path; (ii) the absence of affine parameters prevents the network from undoing the constraint via learned rescaling. When applied to Transformers that already include LayerNorm, the two provide complementary radial suppression: LayerNorm constrains per-token statistics within each sub-layer, while the penalty constrains sub-layer output magnitudes across sub-layers.
LayerNorm (without affine parameters) achieves 80% of the grokking acceleration of our penalty, raising the question of whether the additional hyperparameter is worthwhile. We argue that it is, for three reasons: (i) the penalty achieves lower Hessian trace than LayerNorm, suggesting a qualitatively different solution geometry; (ii) the penalty is a loss-based intervention that does not modify the forward pass and is therefore trivially combinable with any architecture; (iii) combining both (Table 9) yields the fastest grokking, indicating complementary mechanisms—LayerNorm normalizes per-token statistics within sub-layers, while the penalty constrains sub-layer output magnitudes across the network.
Metrics.
Grokking onset: first epoch (or step) at which test accuracy consistently exceeds . Effective rank: stable rank over a batch of 256. Hessian trace: Hutchinson’s method with 100 Rademacher probes (convergence verified in Table 8). Normalized sharpness: . “DNG” indicates failure to reach test accuracy within the training budget.
| Variant | Grok Onset (ep.) | Hessian Trace | Eff. Rank |
|---|---|---|---|
| Baseline (WD) | DNG (100k) | 42.5 2.1 | 135 4 |
| LayerNorm (No Affine) | 8,000 450 | 57.9 3.2 | 464 5 |
| RMSNorm | 8,500 520 | 35.5 2.8 | 498 2 |
| Norm Penalty (Ours) | 6,167 624 | 1.4 0.1 | 443 3 |
Appendix C Mechanistic Analysis
Beyond aggregate metrics, we probe the internal structure of the learned representations to connect our geometric framework to circuit-level mechanisms. We present the two analyses in chronological order: first the assembly dynamics over training, then the specialization structure of the converged solution.
C.1 Circuit Assembly Timeline
We track the assembly of Fourier circuits over training by projecting the hidden activations onto the complete 4-dimensional Fourier basis for each frequency and computing the fit at each epoch (MLP, ).
Figure 4 reveals qualitatively different assembly dynamics across conditions. The baseline accumulates weak frequency traces that never reach the coherence threshold of reported in Table 1. Strong weight decay forces a sudden crystallization but concentrates capacity into a sparse subset of 2–3 dominant frequencies, consistent with the low effective rank (402) and high reported in Table 2. The norm penalty, by contrast, produces a distributed assembly: frequencies emerge sequentially beginning from low- modes, spreading representational energy across many orthogonal directions. This distributed encoding is mechanistically consistent with the penalty’s reduction in Hessian trace (Table 1): when information is spread evenly across many frequency channels rather than concentrated in a few, the loss landscape exhibits lower curvature along every direction.
C.2 Per-Neuron Fourier Selectivity
To characterize the converged circuit structure, we measure the frequency selectivity of individual neurons. For each of the hidden neurons and each Fourier frequency , we compute the maximum absolute correlation between neuron ’s activation profile across all inputs and the two-dimensional Fourier basis , taking the larger of the two as the selectivity score. The resulting heatmap reveals the degree to which each neuron commits to a single frequency.
As shown in Figure 5, the penalty produces a clean block-diagonal structure: coherent clusters of approximately 10 neurons (consistent with ) specialize to each Fourier mode, with near-zero selectivity outside their assigned frequency. This neuron-cluster-per-frequency organization is the circuit motif identified by nanda2023progress as the hallmark of a well-formed modular-arithmetic Fourier circuit. The unpenalized baseline, whose radial inflation suppresses from the first epoch (Figure 1), shows diffuse correlations with no frequency preference—consistent with the low dominant Fourier magnitude () in Table 1. Strong weight decay forces partial cluster formation but concentrates heavily on a few low- modes and leaves higher frequencies underrepresented, matching the lower effective rank (402 vs. 443) reported in Table 2. Together, Figures 4 and 5 connect the aggregate spectral diagnostics in §4 to a concrete circuit-level picture: radial suppression enables the progressive, distributed assembly of a modular Fourier circuit that would otherwise be blocked by norm-driven spectral collapse.
C.3 Limitations
Approximation regimes.
The anisotropic regularization analysis (Analysis 4.2) operates in the high-norm regime , which holds during early memorization but breaks down as the penalty takes effect and . This is not a practical concern—by the time the approximation fails, the penalty has already redirected optimization away from radial inflation, and the centered form maintains a restoring force thereafter—but the analysis should not be read as a uniform characterization of training dynamics. Similarly, the curvature analysis (Analysis 4.3) directly bounds only one factor of the Hessian trace product (); the other () is constrained indirectly through reduced pre-activation saturation rather than by the penalty itself. We therefore treat the curvature reduction as a verified mechanistic hypothesis rather than a formal guarantee.
Correlational evidence.
The penalty simultaneously suppresses radial inflation, preserves effective rank, flattens curvature, and accelerates Fourier coherence. These effects are consistent with the radial–angular framework but are entangled: we cannot, from the current experiments, attribute the grokking acceleration to any single mechanism in isolation. Ablations that independently manipulate rank preservation without radial suppression, or vice versa, would strengthen the causal interpretation and are an important direction for future work.
Task scope.
All primary experiments involve algorithmic tasks with sharp memorization–generalization phase transitions. The Tiny Shakespeare sanity check (§E.3) confirms the penalty is benign on a standard character-level language modeling task—perplexity degrades by under and effective rank increases—but performance on large-scale language modeling or vision benchmarks remains untested. Architectures that rely on activation magnitude as an explicit confidence signal may interact adversely with the penalty.
Fixed target radius.
The choice of as the target radius is principled—it matches the per-feature variance of standard initialization schemes—but it may not be optimal across all architectures or layer types. The ablation in Appendix E.1 shows that is optimal among , and is close, suggesting robustness in the neighborhood of . Tunable or learnable per-layer radii are a natural extension.
Baseline context.
The MLP and Transformer baselines use weak weight decay (), a regime where grokking is slow or absent (liu2022towards). Speedups reported in Table 3 are relative to the strong WD baseline (, which groks at epochs) to provide a fair comparison; relative to the weak baseline the raw numbers are larger but less meaningful as a measure of the penalty’s contribution over aggressive norm control in general.
Appendix D Comparisons
D.1 Taxonomy of Geometric Interventions
| Method | Space | Constraint Type | Radial Suppression | Rank Effect |
|---|---|---|---|---|
| Weight Decay | Weight | Isotropic | Indirect | Collapse |
| Spectral Norm miyato2018spectral | Weight | bound | Indirect | Preserved |
| LayerNorm | Activation | Hard (per-token) | Direct | Preserved |
| MAN () man2024 | Activation | Soft (toward zero) | Direct | Preserved |
| Grad prieto2025stablemax | Parameter | Gradient projection | Direct | Preserved |
| Spherical Projection yildirim2026geometric | Activation | Hard (global) | Complete | Collapse |
| Ours | Activation | Soft ( target) | Direct | Preserved |
Appendix E Ablations
E.1 Robustness Sweeps
| 0.6 | |||
|---|---|---|---|
| Strong WD | 9,000 420 | 7,580 223 | 6,720 331 |
| Norm Penalty | 1,320 75 | 1,260 49 | 1,200 63 |
| 0.5 | |||
| Strong WD | 15,540 1,480 | 12,740 723 | 10,600 856 |
| Norm Penalty | 2,460 136 | 2,200 0 | 1,960 162 |
| 0.4 | |||
| Strong WD | 33,320 1,942 | 24,520 2,114 | 20,860 1,839 |
| Norm Penalty | 6,680 349 | 5,200 253 | 4,380 133 |
| 0.3 | |||
| Strong WD | DNG | DNG | 49,840 320 |
| Norm Penalty | 26,980 1,503 | 17,480 757 | 12,260 554 |
The penalty consistently and substantially accelerates grokking across all moduli and data fractions tested. Speedups relative to Strong WD range from at to at . Notably, at the penalty continues to induce grokking (26,980–12,260 epochs depending on ) while Strong WD fails entirely for and exhausts the training budget for , demonstrating that radial suppression provides a decisive advantage precisely in the low-data regime where isotropic regularization breaks down.
| Grok Onset (epoch) | Final Test Acc | |
|---|---|---|
| 0.001 | 8,200 650 | 100.0% |
| 0.01 | 4,100 320 | 100.0% |
| 0.05 | 2,460 136 | 100.0% |
| 0.1 | 3,800 290 | 100.0% |
| 0.5 | 6,400 510 | 100.0% |
| 1.0 | 9,900 840 | 100.0% |
| Probes | Trace Estimate | Relative Error vs. 500 |
|---|---|---|
| 50 | 1.38 0.12 | 4.2% |
| 100 | 1.41 0.09 | 2.1% |
| 200 | 1.40 0.07 | 1.4% |
| 500 | 1.39 0.05 | — |
E.2 Other Ablations
Penalty strength . We sweep on the MLP (5 seeds). All values induce grokking (unlike the baseline), with optimal at 2,460 epochs. Very low () delays onset to 8,200 epochs; very high () over-constrains angular updates, slowing onset to 9,900 epochs. The method is robust across an order of magnitude ().
Target radius. Testing for : is optimal (2,460 ep.); is close (8,200 ep.); degrades to 12,300 epochs. The optimality is consistent with standard initialization schemes that set per-feature variance to .
Application site (MLP). Pre-activation (default) is optimal (2,460 ep.); post-ReLU is slightly worse (9,800 ep.). Constraining pre-activation norms preserves information about negative components that ReLU would zero out, maintaining a richer representational geometry.
LayerNorm interaction (Transformer).
| LayerNorm | Penalty | Grok Onset | Eff. Rank |
|---|---|---|---|
| Off | Off | DNG (100k) | — |
| Off | On | 14,500 1,200 | 402 |
| On (no affine) | Off | 8,000 450 | 464 |
| On (no affine) | On | 5,200 400 | 475 |
| On (with affine) | Off | 7,500 500 | 450 |
| On (with affine) | On | 4,200 300 | 460 |
The penalty and LayerNorm compound: their combination consistently outperforms either alone, confirming complementary mechanisms.
Optimizer sensitivity. The penalty induces grokking under Adam (no WD): 9,200 epochs (vs. 78,000 baseline). Under SGD: 32,000 epochs (vs. DNG baseline). AdamW + penalty is optimal. The penalty is effective across optimizers but benefits from adaptive learning rates.
E.3 Sanity Check: Non-Algorithmic Task
To verify the penalty does not pathologically degrade standard feature learning, we applied it () to a 500K-parameter character-level Transformer (4 layers, 4 heads, ) on Tiny Shakespeare.
| Variant | Val. Loss | Perplexity | Eff. Rank |
|---|---|---|---|
| Baseline | 1.585 0.011 | 4.9 | 94 |
| Norm Penalty | 1.608 0.004 | 5.0 | 108 |
The penalty does not accelerate language modeling—as expected, since character-level LM lacks a sharp memorizationgeneralization phase transition. Crucially, it does not collapse representations (rank increases from 94 to 108) and perplexity degradation is within 2%, confirming that the penalty is benign outside its target domain.
Appendix F Experimental Setup
MLP on Modular Addition. 2-layer MLP, hidden dimension , ReLU activations. Data: ; training fraction (4,656 of 9,409 pairs). Optimizer: AdamW, , weight decay , batch size 256 (full-batch). Penalty: , applied to pre-ReLU activations of both hidden layers. Training: 100,000 epochs max; 5 independent seeds.
Small Transformer on Modular Addition. 2-layer, 4-head Transformer, , pre-LayerNorm (no affine by default). Penalty applied to each sub-layer output per Eq. 3. Same data, optimizer, and seeds as MLP.
NanoGPT on 3-Digit Addition. 6 layers, 6 heads, (10M parameters), pre-LayerNorm with affine. Reverse-format 3-digit addition (lee2023teaching); 80/20 train/test split. AdamW, , cosine decay to over 30,000 steps, weight decay , batch size 128. Penalty: ; 3 seeds. Wall-clock overhead: 0.320.35 sec/step (+9.4%) on A6000; +1.5% peak GPU memory.