Gadi Evron’s Post

Building a world-class AI security company at Knostic | CISO-in-Residence for Cloud Security Alliance

2w Edited

LLMs don’t give the same answer to the same input. Until now. Like a calculator that sometimes says 2+2 is five, depending on when you hit the keys, LLMs have long behaved in ways we came to accept as unpredictable. Now, Mira Murati’s team at Thinking Machines Lab has fixed it. The issue was not “randomness” or “creativity,” but noise in the infrastructure. Tiny nondeterministic effects inside GPU kernels ripple forward and change completions even when the input is the same. Their fix, naturally explained in jargon, was to design batch-invariant kernels for matmul, attention, and RMSNorm, and the bottom line is clear: same input equals same output, every time. Why it matters: - Reliability in high-stakes fields: health and finance cannot accept answers that drift with server load. - Operational savings: deterministic outputs mean caching works, cutting GPU burn. - Open source transparency: the fix is published for anyone to use and build on. This does not solve the correctness of answers, since wrong outputs can still be wrong consistently. But it clears away one of the biggest barriers to making LLMs reliable at scale. Just days ago this was another open mystery in how LLMs work. Now it is progress — another win for science and engineering. Read more here: https://lnkd.in/d7iNCjzj

Defeating Nondeterminism in LLM Inference thinkingmachines.ai

19 Comments

Vikram Nayyar

Graduate Software Engineer at Sage | AI Enthusiast | Computer Science Graduate

This sounds very interesting! Also from my experience of tech, a model that is ‘reliably wrong’ is much better to work with than one that is intermittently wrong!!

1 Reaction

Paul Jacques, CISSP

CISO: Security Infrastructure Compliance

Reminds me of the Computer adage of "garbage in, garbage out". A little worrying if AI needs to be treated with kid-gloves, I mean, it is aimed at the masses.

2 Reactions

John Parsons

Passionate about shaping engineering excellence, mentoring teams and using AI in a Principal Engineer capacity. Hard work, persistence, pushing boundaries and taking people on a journey. Having ideas and a vision

Determinism would make testing easier

4 Reactions

Joseph Costantini

SME- Retired (1/31/2024)

So now LLMs can be reliably wrong... actually, that is no surprise - so can people; and we believe that "uncertainty is built into the fabric of the universe."

2 Reactions

Peter Kaloroumakis

D3FEND Creator/Lead at MITRE

Determinism across LLM versions?

1 Reaction

Stephen Kearney

Strategic Automation Adviser @ Secure Minded 🤖 Translating complex technology into simple business growth through AI & Microsoft Power Platform. Trusted Partner, Process Expert & Business Intelligence Specialist

Very interesting. Divergence between the creative and the mathematical models has meant they haven't been in the fore of this wave. They seemed more of a creative model that speaks and understands maths. This could be a massive improvement

1 Reaction

Abhishek Patwardhan

Super interesting and completely logical!

1 Reaction

Kieron Seymour-Howell

IT Consultant & Technical Services

Interesting 🤔

1 Reaction

Chris H.

CEO @ Aquia | Chief Security Advisor @ Endor Labs | 2x Author | Veteran | Advisor

I’m curious if we will see this implemented at scale, or if frontier model providers will not do so, concerned with impacting creativity or the randomness that some are even calling a feature, not a bug, of LLMs.

2 Reactions

Chip Block

CEO/CTO of Kiwi Futures, LLC

My biggest issue with this article is the stated goal of of determinism and what this paper is actually describing. To understand it, you need to get your math hat on (and maybe a few beers) and dig into it. What the paper is describing is how to minimize the variance of mathematical calculations by controlling input batches. This isn't making making the models deterministic, it is eliminating variability though processing input control. This takes out some of the mathematical processing variance. First, I don't want my GenAI engine to be deterministic. That is what super computers do today, straight line and deterministic. In some cases, I want the outlier probability to show me things I never considered. Second, the AI engines are acting just like people in that if you ask one question in a focused discussion you get a focused answer. If you ask the same question mixed with 15 other questions in a noisy bar, you get a less focused answer. Also, if there is a goal of determinism in LLMs, the answer is eventually going to get to restricting responses to only the highest probability response (or something limited by the developers), which is the scariest outcome.

See more comments

To view or add a comment, sign in

More Relevant Posts

Andre Leibovici

Chief AI Officer @ ASI Solutions | Driving Secure AI & Sovereign Platforms for Enterprise & Government | Co-founder: CloudVector, Inzpec, Devonport Ventures| ex-Nutanix, VMware, Citrix
2w Edited
Report this post
🚨 Why Mira Murati’s Breakthrough Matters — and What It Reveals About LLMs Today’s LLMs aren’t just sometimes wrong — they’re also not reproducible. Ask the same question 5 times and get 5 different answers — even with “deterministic” settings. Murati’s new lab, Thinking Machines, just showed why: GPU kernels used in inference aren’t batch-invariant. 🧠 In plain terms: if multiple users send prompts at once, the server batches them for efficiency. Tiny differences in operation order cause floating-point shifts — which cascade into different neurons and answers. If your request runs alone, you might avoid this. But in production, they almost never do. Their fix? New kernels (matmul, attention, RMSNorm) that guarantee: same input → same output, every time. Why this matters: • Compliance: Reproducibility is mandatory in regulated sectors. • Cost: Stable outputs enable caching, cutting GPU burn. •Productization: Expect enterprise “deterministic modes.” ⚠️ But: determinism ≠ correctness. You may now get the same wrong answer every time. Reliability needs both consistency and correctness. https://lnkd.in/gYPaE2S6

Defeating Nondeterminism in LLM Inference thinkingmachines.ai
Like Comment
To view or add a comment, sign in
Quantum Zeitgeist

14,976 followers
1w
Report this post
Quantum Annealing with Machine Learning Tunes Penalty Parameters for Minimum Bisection Problem Solutions Researchers developed a machine learning-guided method to dynamically adjust parameters in quantum annealing solvers, substantially improving their ability to efficiently solve complex network partitioning problems and surpassing the performance of conventional algorithms. #quantum #quantumcomputing #technology https://lnkd.in/eANBGAEz

Quantum Annealing with Machine Learning Tunes Penalty Parameters for Minimum Bisection Problem Solutions http://quantumzeitgeist.com
Like Comment
To view or add a comment, sign in
Reia Natu

Data Science & Analytics leader | Generative AI Specialist | Driving Digital Acquisition & Customer Growth | Data-Driven Strategist
6d
Report this post
🚨𝐁𝐢𝐠 𝐧𝐞𝐰𝐬 𝐢𝐧 𝐀𝐈 𝐢𝐧𝐟𝐫𝐚𝐬𝐭𝐫𝐮𝐜𝐭𝐮𝐫𝐞: 𝐌𝐢𝐫𝐚 𝐌𝐮𝐫𝐚𝐭𝐢’𝐬 𝐧𝐞𝐰 𝐜𝐨𝐦𝐩𝐚𝐧𝐲 '𝐓𝐡𝐢𝐧𝐤𝐢𝐧𝐠 𝐌𝐚𝐜𝐡𝐢𝐧𝐞𝐬' 𝐦𝐚𝐲 𝐡𝐚𝐯𝐞 𝐜𝐫𝐚𝐜𝐤𝐞𝐝 𝐭𝐡𝐞 𝐜𝐨𝐝𝐞 𝐨𝐧 𝐋𝐋𝐌 𝐧𝐨𝐧𝐝𝐞𝐭𝐞𝐫𝐦𝐢𝐧𝐢𝐬𝐦! 𝑾𝒆’𝒗𝒆 𝒂𝒍𝒍 𝒔𝒆𝒆𝒏 𝒊𝒕: 👉 Run the same prompt at temperature 0, and you still get different outputs! 👉 Traditionally, this was blamed on floating-point quirks and GPU concurrency. According to Thinking Machines (founded by former OpenAI CTO Mira Murati), the 𝘳𝘦𝘢𝘭 𝘤𝘶𝘭𝘱𝘳𝘪𝘵 is something deeper: 𝐥𝐚𝐜𝐤 𝐨𝐟 𝐛𝐚𝐭𝐜𝐡 𝐢𝐧𝐯𝐚𝐫𝐢𝐚𝐧𝐜𝐞 𝐢𝐧 𝐢𝐧𝐟𝐞𝐫𝐞𝐧𝐜𝐞 𝐤𝐞𝐫𝐧𝐞𝐥𝐬. Here’s what that means 👇 🔹 𝐁𝐚𝐭𝐜𝐡 𝐢𝐧𝐯𝐚𝐫𝐢𝐚𝐧𝐜𝐞 = A model should output the 𝘴𝘢𝘮𝘦 𝘳𝘦𝘴𝘶𝘭𝘵 for a prompt, no matter how requests are batched. 🔹 In reality, operations like matmul, attention, and normalization vary their internal strategies depending on batch size. 🔹 This changes reduction orders → tiny numerical differences → huge divergences in long generations. The fix? ✅ Thinking Machines built 𝐛𝐚𝐭𝐜𝐡-𝐢𝐧𝐯𝐚𝐫𝐢𝐚𝐧𝐭 𝐤𝐞𝐫𝐧𝐞𝐥𝐬 for key ops (RMSNorm, matmul, attention). ✅ Testing with Qwen-3-8B: - Default kernels → 1,000 runs = 80 unique completions. - Modified kernels → 1,000 runs = 𝘢𝘭𝘭 𝘪𝘥𝘦𝘯𝘵𝘪𝘤𝘢𝘭 𝘰𝘶𝘵𝘱𝘶𝘵𝘴. The catch: It runs slower. But for research, safety, and debugging, determinism may matter more than raw speed. As they put it: 👉 “Reproducibility is a bedrock of scientific progress.” This shift in framing—from floating-point quirks to 𝐛𝐚𝐭𝐜𝐡 𝐢𝐧𝐯𝐚𝐫𝐢𝐚𝐧𝐜𝐞—could influence how future inference engines are designed. ⚡ Question for you: Do you think 𝘥𝘦𝘵𝘦𝘳𝘮𝘪𝘯𝘪𝘴𝘮 should become a standard requirement in LLM deployment, or will speed always win? 𝐑𝐞𝐚𝐝 𝐦𝐨𝐫𝐞 𝐚𝐭: https://lnkd.in/d_uC3JgK

Defeating Nondeterminism in LLM Inference thinkingmachines.ai
Like Comment
To view or add a comment, sign in
Manoj Samel

Crafting exceptional solutions powered by data, AI & ML
1w
Report this post
Handling nondeterministic LLM outputs is a real challenge—even at temperature 0. I came across 𝗗𝗲𝗳𝗲𝗮𝘁𝗶𝗻𝗴 𝗡𝗼𝗻𝗱𝗲𝘁𝗲𝗿𝗺𝗶𝗻𝗶𝘀𝗺 𝗶𝗻 𝗟𝗟𝗠 𝗜𝗻𝗳𝗲𝗿𝗲𝗻𝗰𝗲 by Horace He and team at Thinking Machines (founded by ex-OpenAI CTO Mira Murati): https://lnkd.in/gP7bdb4h The paper starts with demoing floating-point non-associativity ((a+b)+c ≠ a+(b+c)) and then explains atomic adds. Atomic adds split tensor reductions across GPU cores and combine results into a single sum; since cores finish in nondeterministic order, the final accumulation is also nondeterministic. Paper claims that LLM forward pass is deterministic for a single query with fixed batch size, but real deployments batch concurrent requests, and varying batch sizes change reduction orders, introducing nondeterminism. The paper proposes batch-invariant strategies for RMSNorm, Matmul, and Attention to address this. 🔹 RMSNorm - The proposal is to process each element in batch on a single core 🔹 Matrix multiplication - Efficient matmul kernels rely on instructions that operate on whole “tiles,” and the internal reduction order can vary depending on which instruction is chosen. Different instructions may be used based on batch size—for instance, small batches waste compute on large tiles, and at batch size 1, tensor cores are often skipped entirely. To enforce batch invariance, one can fix a single kernel configuration for all shapes, sacrificing some performance but avoiding nondeterminism, which is generally acceptable since LLMs usually have large model dimension 🔹 Attention - Attention is more complex because it operates not only over feature dimensions but also over the sequence dimension, which encodes token positions and guides the model on where to “pay attention”. A key element here is the use of key–value (KV) caches during inference. For background, see the seminal paper “Attention Is All You Need” by Vaswani et al. The details of ensuring batch invariance in attention will exceed the length of this post, so I encourage you to explore the paper directly. Authors tests with 1000 completions (temp=0, 1000 tokens each), standard kernels produced 80 unique outputs, while batch-invariant kernels made all 1000 identical. Runtime rose from 26s to 42s, but the authors see room for further kernel optimizations I would love to see further benchmarks on longer outputs, larger KV caches, and concurrent loads Taming nondeterminism will be absolute game-changer for quality and real-world adoption of GenAI

Defeating Nondeterminism in LLM Inference thinkingmachines.ai

1 Comment
Like Comment
To view or add a comment, sign in
Tansu Gangopadhyay

AI Developer & Machine Learning Engineer | GenAI, LLMs, RAG Systems | Building Production-Ready Intelligent Applications
1w
Report this post
This is one of the most important work happening in LLM reliability. Leon Chlon, PhD and his team are tackling non-determinism with mathematical rigor, not just more hardware. Their insight to fix only what's unstable rather than rebuilding everything is a powerful lesson in engineering. This open-source solution for reproducible outputs is a massive contribution to the field. See his original post for the code and details.
Leon Chlon, PhD

AI Reliability Lead, I predict & prevent LLM hallucinations pre‑generation.
1w Edited

We solved LLM Non-Determinism pre-generation, training-free, so now you don't have to pay to replace every GPU kernel for "deterministic" ML inference. The problem: LLMs at temperature=0 aren't deterministic. When heavy machinery, medical devices, or financial systems depend on reproducible outputs, "usually consistent" isn't good enough. Thinkylabs solution: Rewrite CUDA from scratch. Wait for vendors to adopt. Hope it doesn't break with the next PyTorch update. Oh, and it still doesn't handle prompt reordering. Our solution: Accept hardware noise exists. Detect where it matters. Stabilize only those points. Zero overhead when tokens are already robust. Key innovations: - Provably deterministic when μ ≥ 2r under bounded noise - Statistical certificates (95%+ confidence) elsewhere - Selective intervention: Only pay costs at true inflection points - Observable uncertainty: Full audit trail with p_flip per token Works today: Compatible with OpenAI, Anthropic, any provider with logprobs (but built with OpenAI for now) Result: Deterministic guarantees where possible, statistical confidence everywhere else, with 1.0-1.2× overhead (vs 10-30× for naive approaches). We didn't need to rebuild CUDA. We didn't need millions in funding. We needed mathematical rigor to identify exactly where determinism is achievable and statistical methods where it isn't. Code is open source. Sometimes the best engineering isn't replacing everything - it's proving what's already stable and fixing only what isn't. code: https://lnkd.in/ep9_zYSJ readme: https://lnkd.in/eytPwmch #MachineLearning #LLM #Engineering #Reproducibility #OpenSource #SafetyCritical
2 Comments
Like Comment
To view or add a comment, sign in
Leon Chlon, PhD

AI Reliability Lead, I predict & prevent LLM hallucinations pre‑generation.
1w Edited
Report this post
We solved LLM Non-Determinism pre-generation, training-free, so now you don't have to pay to replace every GPU kernel for "deterministic" ML inference. The problem: LLMs at temperature=0 aren't deterministic. When heavy machinery, medical devices, or financial systems depend on reproducible outputs, "usually consistent" isn't good enough. Thinkylabs solution: Rewrite CUDA from scratch. Wait for vendors to adopt. Hope it doesn't break with the next PyTorch update. Oh, and it still doesn't handle prompt reordering. Our solution: Accept hardware noise exists. Detect where it matters. Stabilize only those points. Zero overhead when tokens are already robust. Key innovations: - Provably deterministic when μ ≥ 2r under bounded noise - Statistical certificates (95%+ confidence) elsewhere - Selective intervention: Only pay costs at true inflection points - Observable uncertainty: Full audit trail with p_flip per token Works today: Compatible with OpenAI, Anthropic, any provider with logprobs (but built with OpenAI for now) Result: Deterministic guarantees where possible, statistical confidence everywhere else, with 1.0-1.2× overhead (vs 10-30× for naive approaches). We didn't need to rebuild CUDA. We didn't need millions in funding. We needed mathematical rigor to identify exactly where determinism is achievable and statistical methods where it isn't. Code is open source. Sometimes the best engineering isn't replacing everything - it's proving what's already stable and fixing only what isn't. code: https://lnkd.in/ep9_zYSJ readme: https://lnkd.in/eytPwmch #MachineLearning #LLM #Engineering #Reproducibility #OpenSource #SafetyCritical
46 Comments
Like Comment
To view or add a comment, sign in
Eduard Dulharu

CTO | AI for Networking & Security | Digital Twins, Agentic AI & Specialized LLMs for Faster Troubleshooting, Compliance & Risk Prevention
1w Edited
Report this post
How LLM Non-Determinism is Defeated Saturday night deep dive into making LLMs deterministic and my mind is blown Just spent several hours testing RABeL (Robustness-Aware Bias Elimination in Language models) - a system that provides mathematical PROOF that an LLM's output won't change. Not "probably won't change." Mathematical certainty. The results? → Same prompt = Same output, with formal guarantees → Only 1.0-1.13x API overhead for provable stability → Protection against adversarial manipulation Here's what shocked me: When I tested "Hello" as input, the system showed me the token had a flip probability of exactly 0.0. That's not marketing talk but mathematical proof that this output cannot change within defined perturbation bounds. Why does this matter for production AI? -- > Reproducible testing environments -- > Consistent responses across distributed systems -- >Formal guarantees for compliance/regulatory needs -- > No more "it usually works the same way" The trade-off is minimal computational overhead for mathematical certainty. For systems where consistency matters (IT, networking, legal, medical, financial), this changes everything. We're moving from probabilistic chaos to deterministic control in AI. And yes, I'm geeking out about this on a Saturday night because this is the kind of breakthrough that reshapes how we deploy LLMs in production. Full technical deep-dive on my blog: https://lnkd.in/dNH3jmVS #AI #MachineLearning #LLMs #ProductionAI #TechInnovation #ArtificialIntelligence #vExpertAI
Leon Chlon, PhD

AI Reliability Lead, I predict & prevent LLM hallucinations pre‑generation.
1w Edited

We solved LLM Non-Determinism pre-generation, training-free, so now you don't have to pay to replace every GPU kernel for "deterministic" ML inference. The problem: LLMs at temperature=0 aren't deterministic. When heavy machinery, medical devices, or financial systems depend on reproducible outputs, "usually consistent" isn't good enough. Thinkylabs solution: Rewrite CUDA from scratch. Wait for vendors to adopt. Hope it doesn't break with the next PyTorch update. Oh, and it still doesn't handle prompt reordering. Our solution: Accept hardware noise exists. Detect where it matters. Stabilize only those points. Zero overhead when tokens are already robust. Key innovations: - Provably deterministic when μ ≥ 2r under bounded noise - Statistical certificates (95%+ confidence) elsewhere - Selective intervention: Only pay costs at true inflection points - Observable uncertainty: Full audit trail with p_flip per token Works today: Compatible with OpenAI, Anthropic, any provider with logprobs (but built with OpenAI for now) Result: Deterministic guarantees where possible, statistical confidence everywhere else, with 1.0-1.2× overhead (vs 10-30× for naive approaches). We didn't need to rebuild CUDA. We didn't need millions in funding. We needed mathematical rigor to identify exactly where determinism is achievable and statistical methods where it isn't. Code is open source. Sometimes the best engineering isn't replacing everything - it's proving what's already stable and fixing only what isn't. code: https://lnkd.in/ep9_zYSJ readme: https://lnkd.in/eytPwmch #MachineLearning #LLM #Engineering #Reproducibility #OpenSource #SafetyCritical
Like Comment
To view or add a comment, sign in
Paul Harmat

Head of Data & AI @ Nexifi
4d
Report this post
An absolute must read when it comes to addressing a primary cause of non-deterministic outcomes, the isolation of this issue is almost brilliant in its simplicity. Especially when you consider all the compensating solutions and architectures that have evolved to be able to address this to meet the demands of enterprise adoption and acceptance. https://lnkd.in/gquHUKjv
Leon Chlon, PhD

AI Reliability Lead, I predict & prevent LLM hallucinations pre‑generation.
1w Edited

We solved LLM Non-Determinism pre-generation, training-free, so now you don't have to pay to replace every GPU kernel for "deterministic" ML inference. The problem: LLMs at temperature=0 aren't deterministic. When heavy machinery, medical devices, or financial systems depend on reproducible outputs, "usually consistent" isn't good enough. Thinkylabs solution: Rewrite CUDA from scratch. Wait for vendors to adopt. Hope it doesn't break with the next PyTorch update. Oh, and it still doesn't handle prompt reordering. Our solution: Accept hardware noise exists. Detect where it matters. Stabilize only those points. Zero overhead when tokens are already robust. Key innovations: - Provably deterministic when μ ≥ 2r under bounded noise - Statistical certificates (95%+ confidence) elsewhere - Selective intervention: Only pay costs at true inflection points - Observable uncertainty: Full audit trail with p_flip per token Works today: Compatible with OpenAI, Anthropic, any provider with logprobs (but built with OpenAI for now) Result: Deterministic guarantees where possible, statistical confidence everywhere else, with 1.0-1.2× overhead (vs 10-30× for naive approaches). We didn't need to rebuild CUDA. We didn't need millions in funding. We needed mathematical rigor to identify exactly where determinism is achievable and statistical methods where it isn't. Code is open source. Sometimes the best engineering isn't replacing everything - it's proving what's already stable and fixing only what isn't. code: https://lnkd.in/ep9_zYSJ readme: https://lnkd.in/eytPwmch #MachineLearning #LLM #Engineering #Reproducibility #OpenSource #SafetyCritical
Like Comment
To view or add a comment, sign in
Yexi Jiang

Learner & Problem Solver | Visit yexijiang.substack.com/
3w
Report this post
In their first public blog post, Thinking Machines Lab tackles the challenge of non-determinism in LLM inference. TL;DR: The reason LLMs produce different outputs for the same input (even at temp=0) isn't just floating-point math. The primary cause is a lack of "batch invariance" in the compute kernels. Key points: 1. Problem: Achieving deterministic outputs from LLMs is a known challenge, hindering reproducibility. 2. Solution: The author argues that an input's output is affected by other inputs it's batched with. The fix is to enforce batch-invariant kernels. 3. Impact: This allows for truly reproducible inference results, regardless of server load. Why it matters: This is fundamental for debugging and deploying LLMs in critical applications where consistency is required. Great insights from their debut post. Link: https://lnkd.in/gudxauWi

Defeating Nondeterminism in LLM Inference thinkingmachines.ai

2 Comments
Like Comment
To view or add a comment, sign in
Madhusudhan Konda

AI Strategy & Engineering Leader | Author – Elasticsearch in Action | Transforming Enterprises with GenAI, Search & Data Platforms | Engineering Excellence | Search Innovation & Product Engineering
3w
Report this post
The "non-determinism" of LLMs has been an issue (LLMs are notorious to produce variants of the answers to the same question) and it poses GenAI to be unsuitable in many mission-critical applications. Imagine a model providing different risk assessments for the same market conditions or varying diagnostic suggestions or treatment plans for the same patient data could lead to severe or even life-threatening consequences. Ex-CTO of OpenAI Mira Murati and her team at Thinking Machines Lab may have paved the way to this exact problem by thoroughly investigating and pinpointing the root causes of LLM inference nondeterminism. We have all been taught that the common "concurrency + floating point" hypothesis (e.g., atomic adds) is the primary reason for nondeterminism in the LLMs which may not be the case anymore. Murati's team discovered that the lack of "batch invariance" in kernels is the true culprit for this thorny issue in LLMs. Batch size refers to the number of individual user requests that are processed simultaneously by the GPUs. When the batch size changes (which happens due to varying server load from concurrent requests), each element in the batch can produce different results. By meticulously designing and implementing batch-invariant kernels for key operations (like RMSNorm, Matrix Multiplication and Attention) the team has demonstrated truly reproducible inference. This work is a significant step forward in making LLMs truly reliable and auditable for critical applications! Here's the blog post to digest at your own time: https://lnkd.in/eYHG7aUx

Defeating Nondeterminism in LLM Inference thinkingmachines.ai
Like Comment
To view or add a comment, sign in

26,570 followers

View Profile Follow

Gadi Evron’s Post

More from this author

On empathy as a strategic trait

Personal safety in our community - warning

CEO Mentor

Explore content categories