🚨 Why Mira Murati’s Breakthrough Matters — and What It Reveals About LLMs Today’s LLMs aren’t just sometimes wrong — they’re also not reproducible. Ask the same question 5 times and get 5 different answers — even with “deterministic” settings. Murati’s new lab, Thinking Machines, just showed why: GPU kernels used in inference aren’t batch-invariant. 🧠 In plain terms: if multiple users send prompts at once, the server batches them for efficiency. Tiny differences in operation order cause floating-point shifts — which cascade into different neurons and answers. If your request runs alone, you might avoid this. But in production, they almost never do. Their fix? New kernels (matmul, attention, RMSNorm) that guarantee: same input → same output, every time. Why this matters: • Compliance: Reproducibility is mandatory in regulated sectors. • Cost: Stable outputs enable caching, cutting GPU burn. •Productization: Expect enterprise “deterministic modes.” ⚠️ But: determinism ≠ correctness. You may now get the same wrong answer every time. Reliability needs both consistency and correctness. https://lnkd.in/gYPaE2S6
Andre Leibovici’s Post
More Relevant Posts
-
LLMs don’t give the same answer to the same input. Until now. Like a calculator that sometimes says 2+2 is five, depending on when you hit the keys, LLMs have long behaved in ways we came to accept as unpredictable. Now, Mira Murati’s team at Thinking Machines Lab has fixed it. The issue was not “randomness” or “creativity,” but noise in the infrastructure. Tiny nondeterministic effects inside GPU kernels ripple forward and change completions even when the input is the same. Their fix, naturally explained in jargon, was to design batch-invariant kernels for matmul, attention, and RMSNorm, and the bottom line is clear: same input equals same output, every time. Why it matters: - Reliability in high-stakes fields: health and finance cannot accept answers that drift with server load. - Operational savings: deterministic outputs mean caching works, cutting GPU burn. - Open source transparency: the fix is published for anyone to use and build on. This does not solve the correctness of answers, since wrong outputs can still be wrong consistently. But it clears away one of the biggest barriers to making LLMs reliable at scale. Just days ago this was another open mystery in how LLMs work. Now it is progress — another win for science and engineering. Read more here: https://lnkd.in/d7iNCjzj
To view or add a comment, sign in
-
🚨𝐁𝐢𝐠 𝐧𝐞𝐰𝐬 𝐢𝐧 𝐀𝐈 𝐢𝐧𝐟𝐫𝐚𝐬𝐭𝐫𝐮𝐜𝐭𝐮𝐫𝐞: 𝐌𝐢𝐫𝐚 𝐌𝐮𝐫𝐚𝐭𝐢’𝐬 𝐧𝐞𝐰 𝐜𝐨𝐦𝐩𝐚𝐧𝐲 '𝐓𝐡𝐢𝐧𝐤𝐢𝐧𝐠 𝐌𝐚𝐜𝐡𝐢𝐧𝐞𝐬' 𝐦𝐚𝐲 𝐡𝐚𝐯𝐞 𝐜𝐫𝐚𝐜𝐤𝐞𝐝 𝐭𝐡𝐞 𝐜𝐨𝐝𝐞 𝐨𝐧 𝐋𝐋𝐌 𝐧𝐨𝐧𝐝𝐞𝐭𝐞𝐫𝐦𝐢𝐧𝐢𝐬𝐦! 𝑾𝒆’𝒗𝒆 𝒂𝒍𝒍 𝒔𝒆𝒆𝒏 𝒊𝒕: 👉 Run the same prompt at temperature 0, and you still get different outputs! 👉 Traditionally, this was blamed on floating-point quirks and GPU concurrency. According to Thinking Machines (founded by former OpenAI CTO Mira Murati), the 𝘳𝘦𝘢𝘭 𝘤𝘶𝘭𝘱𝘳𝘪𝘵 is something deeper: 𝐥𝐚𝐜𝐤 𝐨𝐟 𝐛𝐚𝐭𝐜𝐡 𝐢𝐧𝐯𝐚𝐫𝐢𝐚𝐧𝐜𝐞 𝐢𝐧 𝐢𝐧𝐟𝐞𝐫𝐞𝐧𝐜𝐞 ���𝐞𝐫𝐧𝐞𝐥𝐬. Here’s what that means 👇 🔹 𝐁𝐚𝐭𝐜𝐡 𝐢𝐧𝐯𝐚𝐫𝐢𝐚𝐧𝐜𝐞 = A model should output the 𝘴𝘢𝘮𝘦 𝘳𝘦𝘴𝘶𝘭𝘵 for a prompt, no matter how requests are batched. 🔹 In reality, operations like matmul, attention, and normalization vary their internal strategies depending on batch size. 🔹 This changes reduction orders → tiny numerical differences → huge divergences in long generations. The fix? ✅ Thinking Machines built 𝐛𝐚𝐭𝐜𝐡-𝐢𝐧𝐯𝐚𝐫𝐢𝐚𝐧𝐭 𝐤𝐞𝐫𝐧𝐞𝐥𝐬 for key ops (RMSNorm, matmul, attention). ✅ Testing with Qwen-3-8B: - Default kernels → 1,000 runs = 80 unique completions. - Modified kernels → 1,000 runs = 𝘢𝘭𝘭 𝘪𝘥𝘦𝘯𝘵𝘪𝘤𝘢𝘭 𝘰𝘶𝘵𝘱𝘶𝘵𝘴. The catch: It runs slower. But for research, safety, and debugging, determinism may matter more than raw speed. As they put it: 👉 “Reproducibility is a bedrock of scientific progress.” This shift in framing—from floating-point quirks to 𝐛𝐚𝐭𝐜𝐡 𝐢𝐧𝐯𝐚𝐫𝐢𝐚𝐧𝐜𝐞—could influence how future inference engines are designed. ⚡ Question for you: Do you think 𝘥𝘦𝘵𝘦𝘳𝘮𝘪𝘯𝘪𝘴𝘮 should become a standard requirement in LLM deployment, or will speed always win? 𝐑𝐞𝐚𝐝 𝐦𝐨𝐫𝐞 𝐚𝐭: https://lnkd.in/d_uC3JgK
To view or add a comment, sign in
-
Quantum Annealing with Machine Learning Tunes Penalty Parameters for Minimum Bisection Problem Solutions Researchers developed a machine learning-guided method to dynamically adjust parameters in quantum annealing solvers, substantially improving their ability to efficiently solve complex network partitioning problems and surpassing the performance of conventional algorithms. #quantum #quantumcomputing #technology https://lnkd.in/eANBGAEz
To view or add a comment, sign in
-
🧪🧠 Why “temperature = 0” still isn’t deterministic (for LLMs)—and how to fix it Mira Murati’s (former CTO of OpenAI) new lab, Thinking Machines Lab, just launched a research blog called Connectionism (a nod to the 1980s movement linking neural nets and the brain). The first post tackles a deceptively simple question: If randomness is set to zero, why do LLMs still answer differently? The common story (and why it’s incomplete) We usually blame parallel math on GPUs: floating-point adds happen in different orders, tiny round-off differences cascade, and voilà—different outputs. True… but the post shows a deeper culprit: dynamic batching in production. When user traffic spikes, the serving stack constantly reshapes batches (8 sequences now, 32 a second later, 4 after that) to keep the GPU busy. Those shifting batch shapes change kernel tiling, fusion, and reduction order—nudging intermediate values just enough that the next token decision can flip, even with temperature = 0. The fix they demonstrate Build batch-invariant kernels and inference routines that keep per-sequence math and reduction order stable regardless of batch size. In short: make the GPU do “the same work the same way” for a given prompt, no matter what else is in the batch. Determinism stops depending on traffic. Why this matters outside the lab • Safety & alignment evals: if outputs drift run-to-run, you can’t trust red-team results or regression tests. • Incident triage: reproducible failures get fixed faster; non-determinism turns bugs into ghosts. • Compliance & audits: deterministic settings create traceable, defensible evidence for high-stakes use cases. Practical takeaways • For day-to-day prod, perfect determinism isn’t required (and can cost throughput). • For testing and safety gates, lock the stack: deterministic kernels, fixed/random-seeded inference, pinned compiler/runtime versions, and batch-invariant execution. • Treat “deterministic mode” as a toggleable profile—on for evals and incident reproduction, off for maximum QPS. Smart, surgical engineering—exactly the kind of groundwork that makes agentic systems safer and easier to govern. More of this, please. #LLM #Determinism #Inference #MLOps #AIResearch #AgenticBusinessSolutions
To view or add a comment, sign in
-
-
🤠 It's the Wild West out there folks. 🌵 > In other words, the primary reason nearly all LLM inference endpoints are nondeterministic is that the load (and thus batch-size) nondeterministically varies! This research is jaw dropping. I had been lead to believe that running inference with `temperature = 0` would return the same results, every time, with the same inputs, on the same model. Determinism. BUT IT IS NOT SO. So that means that those AI evals you're running? They really don't prove much. If you run them this morning, and they all pass, they might just fail this afternoon, for no apparent reason. All that work you're doing to nail the prompts? They won't always perform the way you expect they will in production, regardless of how meticulously you tweak them. The only way to run deterministic inference right now, as far as I can tell: 1. Run an open source model on your own hardware. Like, go buy a pile of NVIDIA GPUs and stick them in a rack at your neighborhood data center. 2. Run an open source model on dedicated VMs with GPUs. On AWS, this will set you back ~$30K/month if you run it 24/7. And whether you choose 1 or 2, make sure you also use the batch-invariant kernels library (https://lnkd.in/gdn5qYY4) to prevent batched operations from fiddling with the floating point matrix math. Notice that these solutions do not use a proprietary model. You have to run the model yourself. No APIs from OpenAI, Anthropic, or Google. No APIs, period. Because they haven't addressed this flaw yet. And since running your own models on hardware you control isn't realistic for most of us, we're stuck (for now) with running inference that, under unpredictable load, could return materially different results. And I haven't even brought up performance hits yet. I'd be really happy to be wrong about this. Anyone see this differently? https://lnkd.in/gk9hveFh
To view or add a comment, sign in
-
Very impressive blog, the non-determinism seen in LLM online inference is not a flaw in the inference engine, but stems from how GPU kernels are implemented for performance optimization. It's impressive that such simple experiments resulted in such significant findings. I've noticed that these blog posts by thinkingmachines.ai are of very high quality. https://lnkd.in/gdqe-3NB
To view or add a comment, sign in
-
Language Models are surprisingly precise guesstimators! We believe this is a key step in the ‘AI Scientist’ vision, serving as a cheap, uncertainty aware feedback that can consume any data (unstructured notes, code, even multimodal!) Regression Language Models (RLMs) can consume code, server logs, and even graphs to predict outcomes of code. Think predicting accuracy of models with 200+ nodes before the model is even trained, latency of triton kernels on GPUs, server-scale hardware utilization with nearly no feature engineering. Paper: https://lnkd.in/eP4v73kG w/ Xingyou Song, Arissa Wongpanich, Bryan Lewandowski, Mohamed Abdelfattah Code: https://lnkd.in/e9ZHSuA4 Datasets: https://lnkd.in/ensFnEiq
To view or add a comment, sign in
-
The "non-determinism" of LLMs has been an issue (LLMs are notorious to produce variants of the answers to the same question) and it poses GenAI to be unsuitable in many mission-critical applications. Imagine a model providing different risk assessments for the same market conditions or varying diagnostic suggestions or treatment plans for the same patient data could lead to severe or even life-threatening consequences. Ex-CTO of OpenAI Mira Murati and her team at Thinking Machines Lab may have paved the way to this exact problem by thoroughly investigating and pinpointing the root causes of LLM inference nondeterminism. We have all been taught that the common "concurrency + floating point" hypothesis (e.g., atomic adds) is the primary reason for nondeterminism in the LLMs which may not be the case anymore. Murati's team discovered that the lack of "batch invariance" in kernels is the true culprit for this thorny issue in LLMs. Batch size refers to the number of individual user requests that are processed simultaneously by the GPUs. When the batch size changes (which happens due to varying server load from concurrent requests), each element in the batch can produce different results. By meticulously designing and implementing batch-invariant kernels for key operations (like RMSNorm, Matrix Multiplication and Attention) the team has demonstrated truly reproducible inference. This work is a significant step forward in making LLMs truly reliable and auditable for critical applications! Here's the blog post to digest at your own time: https://lnkd.in/eYHG7aUx
To view or add a comment, sign in
-
Handling nondeterministic LLM outputs is a real challenge—even at temperature 0. I came across 𝗗𝗲𝗳𝗲𝗮𝘁𝗶𝗻𝗴 𝗡𝗼𝗻𝗱𝗲𝘁𝗲𝗿𝗺𝗶𝗻𝗶𝘀𝗺 𝗶𝗻 𝗟𝗟𝗠 𝗜𝗻𝗳𝗲𝗿𝗲𝗻𝗰𝗲 by Horace He and team at Thinking Machines (founded by ex-OpenAI CTO Mira Murati): https://lnkd.in/gP7bdb4h The paper starts with demoing floating-point non-associativity ((a+b)+c ≠ a+(b+c)) and then explains atomic adds. Atomic adds split tensor reductions across GPU cores and combine results into a single sum; since cores finish in nondeterministic order, the final accumulation is also nondeterministic. Paper claims that LLM forward pass is deterministic for a single query with fixed batch size, but real deployments batch concurrent requests, and varying batch sizes change reduction orders, introducing nondeterminism. The paper proposes batch-invariant strategies for RMSNorm, Matmul, and Attention to address this. 🔹 RMSNorm - The proposal is to process each element in batch on a single core 🔹 Matrix multiplication - Efficient matmul kernels rely on instructions that operate on whole “tiles,” and the internal reduction order can vary depending on which instruction is chosen. Different instructions may be used based on batch size—for instance, small batches waste compute on large tiles, and at batch size 1, tensor cores are often skipped entirely. To enforce batch invariance, one can fix a single kernel configuration for all shapes, sacrificing some performance but avoiding nondeterminism, which is generally acceptable since LLMs usually have large model dimension 🔹 Attention - Attention is more complex because it operates not only over feature dimensions but also over the sequence dimension, which encodes token positions and guides the model on where to “pay attention”. A key element here is the use of key–value (KV) caches during inference. For background, see the seminal paper “Attention Is All You Need” by Vaswani et al. The details of ensuring batch invariance in attention will exceed the length of this post, so I encourage you to explore the paper directly. Authors tests with 1000 completions (temp=0, 1000 tokens each), standard kernels produced 80 unique outputs, while batch-invariant kernels made all 1000 identical. Runtime rose from 26s to 42s, but the authors see room for further kernel optimizations I would love to see further benchmarks on longer outputs, larger KV caches, and concurrent loads Taming nondeterminism will be absolute game-changer for quality and real-world adoption of GenAI
To view or add a comment, sign in
-
We solved LLM Non-Determinism pre-generation, training-free, so now you don't have to pay to replace every GPU kernel for "deterministic" ML inference. The problem: LLMs at temperature=0 aren't deterministic. When heavy machinery, medical devices, or financial systems depend on reproducible outputs, "usually consistent" isn't good enough. Thinkylabs solution: Rewrite CUDA from scratch. Wait for vendors to adopt. Hope it doesn't break with the next PyTorch update. Oh, and it still doesn't handle prompt reordering. Our solution: Accept hardware noise exists. Detect where it matters. Stabilize only those points. Zero overhead when tokens are already robust. Key innovations: - Provably deterministic when μ ≥ 2r under bounded noise - Statistical certificates (95%+ confidence) elsewhere - Selective intervention: Only pay costs at true inflection points - Observable uncertainty: Full audit trail with p_flip per token Works today: Compatible with OpenAI, Anthropic, any provider with logprobs (but built with OpenAI for now) Result: Deterministic guarantees where possible, statistical confidence everywhere else, with 1.0-1.2× overhead (vs 10-30× for naive approaches). We didn't need to rebuild CUDA. We didn't need millions in funding. We needed mathematical rigor to identify exactly where determinism is achievable and statistical methods where it isn't. Code is open source. Sometimes the best engineering isn't replacing everything - it's proving what's already stable and fixing only what isn't. code: https://lnkd.in/ep9_zYSJ readme: https://lnkd.in/eytPwmch #MachineLearning #LLM #Engineering #Reproducibility #OpenSource #SafetyCritical
To view or add a comment, sign in
-