DeepMind Researchers on Embedding-Based Retrieval Limitations | Jack FitzGerald posted on the topic

Chief Science Officer at EdgeRunner AI

Engineering a performant AI system is all about tradeoffs. As one example, when creating a vector store over which to perform retrieval augmented generation, what size of embeddings should you choose? Researchers at DeepMind sought to characterize the limitations of embedding based retrieval systems as a function of the embeddings size, providing both theoretical analyses and a new benchmark called LIMIT. In this talk we'll discuss the paper, the broader context, and applications. On the Theoretical Limitations of Embedding-Based Retrieval Paper: https://lnkd.in/g7ur79Ze Repo: https://lnkd.in/gkQdkjen Factual Knowledge Acquisition in Pretraining: https://lnkd.in/gvanC-HE Hybrid Search: https://lnkd.in/gQPsZSCq Matryoshka Models: https://lnkd.in/gP-ZduUi

Transcript

There's a recent paper from Weller who conducted an internship at Deep Mine from Johns Hopkins along with colleagues in Deep Mine that analyzes the limits of embedding based retrieval. The question posed in their work is a simple one, but it's fundamentally important for many modern LLM based AI systems. Their goal was to characterize the limits of embedding based retrieval as a function of the embedding size, which has practical implications for retrieval based augmented generation. Let's start by talking a bit about the background context. Retrieval, augmented generation, or RAG is important for a few reasons. It's contrasted with what you might call parametric knowledge or parametric capabilities. By this I mean how well does a model do on its own, using only its own weights. So if you were to download an openweight model, spin up a server with Llama, CPP or Olama and kick off inference, you would be using only the parametric capabilities of the model. Now there are a few issues that can arise. The first is that model is even LLNS can only compress so much knowledge into their weights. Moreover, the order that knowledge is introduced during training can have an effect. A paper by Chang ET al. Which for which I've dropped a link in the description as well as work by others, has shown that knowledge can be forgotten if too many updates pass from the step at which it was introduced. So that's the first major issue, that the model itself can only hold so much knowledge. The second issue is that certain types of knowledge may not have been available at the time of training. The news is one such case, and the use of proprietary data by a client during inferences and other such case. Either way, the knowledge simply wasn't available at the time of training. So to solve for these issues, we use RAG for embedding based retrieval, which is the focus of today's paper. A given document, or typically chunks from a document are compressed into a representation consisting of a vector of floats and this mapping between the chunk and its associated embedding is maintained in a vector store. Once a query is sent to the model, the query can be similarly embedded into a vector of floats, and we can see how closely the embedding for the query matches with various embeddings in the vector store. This process of matching a query embedding to the embeddings of the chunks of a document can be very fast, often off on the order of milliseconds. This is contrasted with a cross encoder in which two sentences are fed into one transformer model, which can be 10,000 times slower or more in real world real world systems. Sometimes a two stage system is used in which vector similarity across a vector store is used for very fast lookup of some top K of relevant documents, and then that top K is further refined by the much more accurate but much slower cross encoder. Either way, document retrieval based on embedding similarity is a crucial component of many real world systems. So the question becomes, how big of an embedding do I need for my system? So for this work from Deep Mind, they constructed and released a new data set called LIMIT. An example can be found on page 2 which I'll bring on screen. Both the documents and the queries are very simple, being things like John Durbin likes quokkas and apples for an example document and who likes apples as an example query. By the way, I personally had to look up what a quokka is. Apparently it's a marsupial. But anyways, you can see how simple these documents and queries are, and from a naive perspective you would assume them to be easy for any modern system. That's the beauty of this data set, which is that it shows up for a small enough embedding. Size and a large enough number of documents. Even on such a presumably easy task, performance tanks with soda models hitting only 20% recall at a K-100. So they begin from a theoretical perspective, which I encourage you to read. The result of that analysis is given on page 5. Which I'll scroll to here. Uh, sorry, actually that's page 6. So section 3.3 are the consequences. And if I can restate their conclusions, unless technical terms, they have demonstrated that there does exist some lower and upper bound on the embedding size needed for the retrieval task at hand. In their first experiment, which starts on page 6, they directly optimize the embeddings on the test set so that they can have the ultimate best model performance. So this is not a real world system, but rather as a demonstration of the best performance you could ever expect, though not actually attain. So they analyze all the pairs and queries. Pairs of queries and documents and Figure 2. They list the number of documents for which there is a perfect 100% accuracy as a function of the embedding dimension D. From there they extrapolate based on a third degree polynomial polynomial to determine that for 4 million documents one would need a dimension of 1024 and for 250 million documents and embedding dimension of 4096. Now, I'm not personally comfortable extrapolating from 48 to 4096 such that I would make decisions based on these exact numbers, but I nonetheless find this to be an interesting bit of analysis to help drive some intuition about these limitations of embedding systems. Now let's move back to the limit data set and Figure 3. Which I'll scroll to. In Figure 3, we see the recall results for six different popular embedding models at three different K values. These results are with the full data sets of 50,000 documents. Even when retrieving 100 documents, recall can only get up to about 20% with the best setup. Other approaches are shown too, like the GTE modern Colbert system in which embeddings are created at a token level, as well as BM25, which is a more traditional text frequency search algorithm, which we'll discuss more later. Moving now to Figure 4. We see that even if the vector store is reduced to only 46 documents, still full accuracy is hard to achieve without having the best model and embedding dimension of 4096 and a full 20 retrieve documents. So they provide a few additional studies to help rule out other factors, which I encourage you to read. But let's talk a bit about applications. So moving to page 12. We see some other approaches to consider. Earlier I talked a bit about cross encoders and there are too slow for the first stage of retrieval, but they remain an important component as the re ranker in a system. We mentioned also a multi vector models like GTE, modern Culbert and since it's making vectors out of token level, it's likely to have a higher storage cost than standard embedding approaches that make a single vector for the entire document or chunk. But it's somewhat depends too on how what are your document and chunk lengths. The authors also claim that multi vector models are less well suited for instruction following and reasoning based tasks, but I would posit them as a good candidate for experimentation. Finally, sparse approaches like BM25 do well on the LIMIT data set, but for keyword based retrieval you can have issues around the lack of lexical overlap. So BM25 considers screen and monitor for example AS2 entirely different entities, whereas embedding models can better contextualize and can better handle synonyms. For LIMIT specifically, in which the queries have clear matches, queries have clear matches to the documents, it makes sense that BM 25 would do well, whereas embedding approaches are needed for a more nuanced queries that do not exactly match the words present in the document. So that said, there's a nice article from Microsoft that will drop into the description which argues for hybrid embedding and keyword approaches. All in all, there's never really a replacement for experimentation with your own system. Here at Edgerunner, retrieval is an important part of our on device air gapped AI systems and we continue to experiment with a wide variety of approaches. I encourage you to also do the same, but today's paper serves as a good reminder that you may wish to increase embedding size as the number of documents in your vector store increases. Model similar to the Matroshka embeddings can perhaps be useful in such cases. I'm Jack Fitzgerald with Edge Runner AI keep building.

To view or add a comment, sign in

More Relevant Posts

Kashish .

AI & Machine Learning Enthusiast | AI-driven Product Development | MCA Gen AI – SRM
1mo Edited
Report this post
🔥 Dijkstra Defeated? A New Shortest Path Algorithm Takes the Lead! Hey everyone! Just came across an exciting breakthrough in graph algorithms that I had to share. For decades, Dijkstra’s algorithm was the gold standard for finding shortest paths in graphs, but now researchers have developed a faster, deterministic algorithm that beats it—at least for sparse graphs! Instead of sorting everything like Dijkstra does, this new method cleverly breaks the graph down into smaller parts and uses smart pivot selection to skip a lot of unnecessary work. Imagine slicing through complexity with divide-and-conquer magic, combining the best of Dijkstra and Bellman-Ford. The result? Way faster shortest path calculations with a cool new time complexity around O(mlog⁡2/3n)O(mlog2/3n)! This isn’t just theory—this could supercharge AI, network routing, and real-time systems where every millisecond counts. As a student fascinated by algorithms, I��m pumped to see how this changes the game! Here is a great interactive demo for shortest path algorithms, including Dijkstra’s algorithm https://lnkd.in/d-BVFwGB #Algorithms #GraphTheory #DijkstraDefeated #AI #MachineLearning #TechInnovation
1 Comment
Like Comment
To view or add a comment, sign in
Shubham Mantri

Gushwork Ai | Ex - Walmart | Building apps for fun
2w
Report this post
It’s tempting to throw AI at every problem, but sometimes the old-school way is still the best. You have to analyze the input space: AI makes sense when it’s truly nondeterministic or when the space is too large, and a probabilistic approach adds value. But if the problem is deterministic, the smartest move is to pick the optimal data structure and algorithm. It’ll save you both money and latency.
Like Comment
To view or add a comment, sign in
Disha Sondarva

AI Explorer | Human-Mind x Machine-Logic | Startup and growth enthusiast
6d Edited
Report this post
A New Breakthrough in AI Compression: Chain-of-Thought Based Pruning I spent this weekend going through one of the most fascinating research papers I've read in a while, and trust me, this one is worth your time. Paper: Reasoning Models Can Be Accurately Pruned via Chain-of-Thought Reconstruction By: MIT [Full Paper (10 pages)] (https://lnkd.in/dMXv7PiG) Here's the quick story Most large reasoning models (like DeepSeek-R1) are powerful but painfully expensive to run. The more they think the longer their chain-of-thought the more compute they need. But here's the twist: When you prune (compress) these models using traditional methods, they not only become less accurate… They often get slower too because they start generating longer, messier reasoning steps. This new research changes the game. It introduces something called RAC – Reasoning-Aware Compression. Instead of pruning based only on input data, RAC uses the model's own reasoning traces during calibration. Basically, it "learns how it thinks" and keeps the parts that actually matter for reasoning. The result? Up to 50% smaller models without losing accuracy Faster inference (sometimes 4x faster!) And reasoning quality that's almost identical to the original Why this matters: This is not just a research trick, it's a deployment-level breakthrough. It means we can run powerful reasoning models on smaller hardware, at lower cost, and at real-world scale. And it's another reminder of something I deeply believe: The future of AI isn't just about making models smarter. It's also about making them efficient, accessible, and deployable without losing what makes them powerful. My takeaway: If you're working on reasoning agents, AI infrastructure, or cost optimization, this paper is a must-read. It's 10 pages, but it might just reshape how you think about model compression. Comment your thoughts! Repost to help your network :) #AIResearch #ReasoningModels #ChainOfThought #ModelCompression

41 Comments
Like Comment
To view or add a comment, sign in
Akshay Rajinikanth

Attending Vellore Institute of Technology
1mo Edited
Report this post
A recent paper by Weller, Boratko, Naim, and Lee (2025) — On the Theoretical Limitations of Embedding-Based Retrieval, formalizes something many of us in IR and RAG have suspected: embedding-based retrieval has inherent ceilings that cannot be overcome simply by scaling model size or training data. 🔑 Key Contributions: - Derives theoretical bounds showing that the number of distinct top-k retrieval outcomes is limited by the embedding dimension, placing a hard cap on expressiveness. - Uses a “free embedding optimization” setup to demonstrate that even with unconstrained embeddings optimized directly, these bounds manifest in practice. - Introduces LIMIT, a benchmark designed to expose these constraints, on which even state-of-the-art embedding models fail despite the dataset’s simplicity. ⚙️ Implications: - Single-vector embeddings cannot, in principle, represent all retrieval functions we may desire. - Improvements in embedding models (scale, pretraining, fine-tuning) will eventually plateau due to these theoretical expressiveness limits. - Future retrieval architectures will likely require hybrid approaches: multi-vector embeddings, symbolic reasoning, structured indexing, or query-dependent representations. This work is a reminder that some bottlenecks in AI systems are not just engineering problems but theoretical ones — and overcoming them will require rethinking retrieval beyond the current embedding paradigm. #InformationRetrieval #Embeddings #RAG #AIResearch #MachineLearning #AI
Like Comment
To view or add a comment, sign in
Luis Imagiire

Supercharging Businesses with AI | Founder @Memorelab 🧠 | LLMs, Machine Learning & Agentic AI Solutions.
3w
Report this post
Why We Can’t Rely on Single-Vector Embeddings for Complex Retrieval . . A breakthrough paper from DeepMind sheds light on a fundamental problem with embedding-based retrieval systems—the single-vector paradigm has hard, mathematical limits. ** Key Insights:** Retrieval tasks often demand combining documents in various nuanced ways, but a single fixed-dimensional embedding can’t represent all possible top-k combinations—regardless of how much data or training you throw at it. The authors introduce the LIMIT dataset, crafted to stress test these limits. And guess what? Even state-of-the-art models fall short, often failing to retrieve simple, obvious matches. Workaround comparison: traditional sparse retrievers like BM25 and multi-vector architectures yield substantially better performance—highlighting a need to rethink our reliance on single-vector approaches. ** Why This Matters:** This isn’t just academic—it impacts real-world systems that rely on embeddings for search, recommendation, and knowledge retrieval. As we ramp up toward more complex, reasoning-based AI systems, these limitations become bottlenecks we can’t ignore. #AI #MachineLearning #InformationRetrieval #Embeddings #DeepLearning #ResearchLimitations
1 Comment
Like Comment
To view or add a comment, sign in
LightOn

12,929 followers
1mo
Report this post
From Pattern Matching to Reasoning Search. LightOn late-interaction open-source stack is moving semantic search beyond theory, turning cutting-edge research into real-world AI retrieval systems. 🔹 ModernBERT : Re-imagining the encoder 🔹 PyLate : effortless training multi-vector models in hours, not weeks 🔹 FastPlaid : scaling performance for enterprise 🔹 PyLate-rs : bringing SOTA retrieval to the browser 👉🏻 Explore the journey : https://lnkd.in/eQKi7enR
Like Comment
To view or add a comment, sign in
Igor Carron
4d
Report this post
At LightOn we are literally building reasoning at Enteprise Scale. We used to have information retrieval techniques powering numerous search engines. We are now building the stack that connects Enterprise scale documents collections to Generative AI. We're beyond traditional RAG.
LightOn

12,929 followers
1mo

From Pattern Matching to Reasoning Search. LightOn late-interaction open-source stack is moving semantic search beyond theory, turning cutting-edge research into real-world AI retrieval systems. 🔹 ModernBERT : Re-imagining the encoder 🔹 PyLate : effortless training multi-vector models in hours, not weeks 🔹 FastPlaid : scaling performance for enterprise 🔹 PyLate-rs : bringing SOTA retrieval to the browser 👉🏻 Explore the journey : https://lnkd.in/eQKi7enR
Like Comment
To view or add a comment, sign in
Janak Panchal

Lead Software Developer @ Tata Consultancy Services | AI / ML Researcher
3w
Report this post
Memory Management in AI Agents & RAG One of the biggest challenges in building intelligent AI systems is memory. An AI agent doesn’t just need to process the current prompt — it needs to remember past conversations, store knowledge, and retrieve the right information at the right time. Here’s how memory works in an AI agent: 🔹 Short-Term Memory → Keeps the recent conversation or context (like the LLM’s token window). 🔹 Long-Term Memory → Stores knowledge, user preferences, and historical data in vector databases, knowledge graphs, or summaries. 🔹 Episodic Memory → Captures events & experiences to personalize interactions. 🔹 Procedural Memory → Remembers workflows, how-tos, and processes. Now combine this with Retrieval-Augmented Generation (RAG), and the agent gets access to external knowledge bases. Instead of memorizing everything, it: 1.Encodes queries into embeddings 2. Retrieves the most relevant documents from a vector DB 3.Feeds them back into the LLM for a smarter response With proper memory management, AI agents become: ✅ More scalable ✅ Context-aware ✅ Personal and consistent ✅ Capable of handling dynamic knowledge I created a visual architecture diagram to show how short-term, long-term, and RAG-based memory work together to power intelligent agents. What do you think is the most important type of memory for future AI agents — episodic personalization or knowledge retrieval? #AI #ArtificialIntelligence #GenerativeAI #RAG #VectorDatabases #MachineLearning #AITools
Like Comment
To view or add a comment, sign in
Matei Zaharia

CTO & Cofounder at Databricks, CS Professor at Berkeley
1w Edited
Report this post
Prompt optimization is becoming a powerful technique for improving AI that can even beat SFT! Here are some of our research results with GEPA at Databricks, in complex Agent Bricks info extraction tasks. We can match the best models at 90x lower cost, or improve them by ~6%. Details in our research blog: https://lnkd.in/gQvNJptH Most interestingly perhaps, reflective prompt optimization can beat SFT on the same data, or can stack with it as observed in Better Together (https:// arxiv.org/abs/2407.10930). In practice it also requires fewer labels and can take in richer user feedback (ALHF: https://lnkd.in/g5yJUkKA)
20 Comments
Like Comment
To view or add a comment, sign in
Josh Melton

Data & AI Architect @ Databricks
1w
Report this post
Databricks makes it easy to tune your Gen AI for cost or quality in your specific domain. Check out the research from Matei and team to see how!
Matei Zaharia

CTO & Cofounder at Databricks, CS Professor at Berkeley
1w Edited

Prompt optimization is becoming a powerful technique for improving AI that can even beat SFT! Here are some of our research results with GEPA at Databricks, in complex Agent Bricks info extraction tasks. We can match the best models at 90x lower cost, or improve them by ~6%. Details in our research blog: https://lnkd.in/gQvNJptH Most interestingly perhaps, reflective prompt optimization can beat SFT on the same data, or can stack with it as observed in Better Together (https:// arxiv.org/abs/2407.10930). In practice it also requires fewer labels and can take in richer user feedback (ALHF: https://lnkd.in/g5yJUkKA)
Like Comment
To view or add a comment, sign in

8,442 followers

View Profile Follow

More from this author

Hiring a Senior Sales Manager

Hiring a Data Engineer!

Hiring a Software Development Engineer!

Explore content categories