Review Residuals

An update-conditioned residual gate whose advantage emerges at scale.

Review Residuals scale each transformer sublayer's proposed update by a small learned gate conditioned on both the current state and the proposed update — an in-network analogue of the human-performance practice of independent verification: check a proposed action before committing to it.

TL;DR

Trained from scratch (60M–1B parameters, multiple seeds, TinyStories), with both baselines parameter-matched up to Review so any win is not a capacity advantage:

Model size	Review (ours)	Highway	Standard residual	Review wins?
60M	1.6891	1.6923	1.6805	no — standard is slightly better
150M	1.5576	1.5625	1.5548	tie
320M	1.5001	1.5042	1.5037	tie (barely ahead)
590M	1.4795	1.4889	1.4901	yes, significant (p<0.05)
1B	1.4876	1.5031	1.5040	yes, by more (strong trend)

(Validation loss; lower is better.) The advantage grows with scale rather than fading: absent at small scale, statistically significant over both a Highway gate and the plain standard residual at 590M, and larger at 1B. We also report a depth-stability finding: a convex (Highway-style) form of the gate reintroduces vanishing gradients and fails to train past ~20 layers, while the additive, identity-preserving form trains stably at all depths tested.

Method

h_l = h_{l-1} + r_l ⊙ u_l ,    r_l = σ( W · [ RMSNorm(h_{l-1}), RMSNorm(u_l) ] )

where u_l is the sublayer's proposed update. The gate is added (identity path preserved), and — uniquely — is conditioned on the update itself. See paper/mechanism_additive.png.

Repository layout

paper/   the technical paper + plain-English companion (PDF + LaTeX source + figures)
src/     training/eval code, the cloud-GPU notebook, and analysis/figure scripts
data/    scaling_v8.csv — the per-run validation losses behind every number in the paper
Old/     superseded drafts, intermediate runs, and earlier (convex-form) figures

Reproduce

Install: pip install -r requirements.txt (training needs a CUDA GPU; 590M/1B need ~80GB).

Verify the analysis (no GPU): python src/analyze_results.py — reproduces the results table and Welch t-tests from data/scaling_v8.csv.
Regenerate the figure: python src/make_emergence_figure.py — rebuilds paper/emergence_curve.png from the data.
Re-run training (GPU): python src/train.py — trains the full sweep (Review vs parameter-matched Highway and standard, 60M–1B, multi-seed). Resumable; writes scaling_v8.csv. src/run_on_runpod.ipynb wraps this for a rented cloud GPU.

Limitations

Stated plainly (see the paper's Limitations section): in absolute terms the effect is small at the ≤1B scales we could afford (~1% of loss), and the 1B result is a strong trend rather than significant at p<0.05; the study is single-dataset (TinyStories), undertrained under a fixed token budget, and below frontier scale. But the magnitude is not static — it grows across the tested range (negative at 60M to +0.016 nats at 1B). The caveat is the absolute size at ≤1B, not the direction, which points upward.

Papers

paper/review_residuals.pdf — the technical paper.
paper/review_residuals_plain_english.pdf — an illustrated, jargon-free companion.

Citation

@misc{kramer2026reviewresiduals,
  title         = {Review Residuals: Update-Conditioned Residual Gating for Transformers},
  author        = {Kramer, Kyle},
  year          = {2026},
  eprint        = {2606.31859},
  archivePrefix = {arXiv},
  primaryClass  = {cs.LG},
  doi           = {10.5281/zenodo.21053343},
  note          = {NeraTech LLC}
}

License

Licensed under the Apache License 2.0 — see LICENSE and NOTICE. Permissive reuse with attribution, plus an explicit patent grant. If you use this work, please cite it (above) and preserve the NOTICE.

Links

Paper (Zenodo, DOI): https://doi.org/10.5281/zenodo.21053343
Code (GitHub): https://github.com/SixSigmaEngineer/review-residuals
Paper (arXiv): https://arxiv.org/abs/2606.31859 (arXiv:2606.31859)

Kyle Kramer · NeraTech LLC · 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Review Residuals

TL;DR

Method

Repository layout

Reproduce

Limitations

Papers

Citation

License

Links

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
Zenodo		Zenodo
data		data
paper		paper
src		src
.gitattributes		.gitattributes
.gitignore		.gitignore
CITATION.cff		CITATION.cff
LICENSE		LICENSE
NOTICE		NOTICE
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Review Residuals

TL;DR

Method

Repository layout

Reproduce

Limitations

Papers

Citation

License

Links

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages