Lewis Liu’s Post

Product at Microsoft, ex-Google, Gemini

I started to think about #LLM #memory long time ago, but probably more structurally from reading this early 2024 paper from Princeton: Cognitive Architectures for Language Agents(https://lnkd.in/gaM8mtKa). It outlines different types of memory system: procedural, semantic, episodic etc -- laying a foundation for so much that has come since. We see solutions like #Mem0 and #Zep, which have brought structured, graph-based memory to LLMs. This approach is a clear win, making temporal tracking possible, mimicking how our own brains work. Of course, graph is an enhancement on top of what we commonly use. The field is evolving at an incredible pace, with recent papers like #MemOS, #A-Mem, and #MemTree, pushing the boundaries even further. This makes me wonder: what's next for memory systems? One promising path is creating specialized memory for different types of knowledge, which mirrors human cognition. We all want memory that is universal applicable, personalized and that prioritizes the information we access most frequently. But what does this really mean underneath? The more I think about this, this future looks a lot like a new kind of "#GoogleSearch" — one that not only aggregates unstructured and structured data. The key difference, though, isn't just about reading, it is about continuous writing, reconciliation and consolidation . What if this engine can observe, learn, and write to the our own private data source? We all perceive and listen a lot more before expressing ourselves. Imagine an LLM that doesn't just browse the web as a read-only tool but actively learns and updates its own knowledge base. Building a "Google Search" is absolutely extremely hard, but imagine a new type of "knowledge engine" that put equal weights to "#search" and "#assimilation". That's a powerful next step.

To view or add a comment, sign in

More Relevant Posts

Alexander Taboriskiy

CEO & Founder @ Periscoop (Mentiora.AI) | Engineering AI Quality | Ex-Google DeepMind Engineering Lead: Gemini App, AI/LLM Integration & Quality
3w
Report this post
How to choose the best LLM to route traffic to? Google Research proposes a new method called Speculative Cascades, combining Cascades and Speculative Decoding. They propose a system where a small and large model work together, chunk by chunk. The small model drafts a short piece of the answer. In parallel, the large model checks this draft. A smart deferral rule, based on a probabilistic match, decides if the draft is good. If the match succeeds, the chunk is accepted, and the process repeats on the next chunk. If it fails, the large model takes over for just that one piece before letting the small model continue. Tackling the critical challenge of inference efficiency, a hot area of research, they achieve major performance gains. Their method provides much better results for the same computational cost across a range of demanding language tasks, including coding, reasoning, and summarization. It delivers higher speeds and better quality scores than older approaches, marking a significant step forward in optimizing the trade-off between performance and cost. Fascinating results paving the way towards higher quality inference with better speed, cost and effectiveness tradeoffs.
3 Comments
Like Comment
To view or add a comment, sign in
Daryl Heinis

CTO at Scale Logic
3w
Report this post
#CaraOne = meaningful results. Smart design can only get you so far until you scale and get the same weighted results back - CaraOne exemplifies meaningful research to ensure as you scale you can still get more balanced semantic search results

ObviousFuture GmbH

1,339 followers
1mo

The scary truth: conventional semantic search doesn't work. RAGs collapse when you scale. The demo looks fine. The PoC feels solid. Then you index millions of assets and it breaks. The same mediocre results keep showing up. The valuable ones disappear in the noise. Quality collapses right when you need it most. The reason is simple but unavoidable: mathematics in vector databases. Here's why it happens - and why CaraOne doesn't degrade, even at the largest scale: https://lnkd.in/djVtE-7y

Semantic Search is a Disaster When Scaling https://caraoneai.com
Like Comment
To view or add a comment, sign in
Sumit Kumar

Senior MLE @Meta, Ex- TikTok|Amazon|Samsung
1w
Report this post
I just published Vol. 122 of "Top Information Retrieval Papers of the Week" on Substack. My Substack newsletter features the 7-10 most notable research papers on information retrieval (including recommender systems, search & ranking, etc.) from each week, with a brief summary, and links to the paper/codebase. This week’s newsletter highlights the following research work: 📚 Scalable Cross-Entropy Loss with Negative Sampling for Industrial Recommendation Systems, from Zhelnin et al. 📚 Unified LLM Architecture for Large-Scale Job Search Query Understanding, from LinkedIn 📚 What News Recommendation Research Doesn't Teach About Building Real Systems, from Higley et al. 📚 A Systematic Evaluation of Large Language Models for Cross-Lingual Information Retrieval, from LMU Munich 📚 Interactive Two-Tower Architecture for Real-Time Candidate Filtering in Recommender Systems, from Ant Group 📚 Efficient Inference for Generative LLM Recommenders via Hidden State Matching, from Wang et al. 📚 A Framework for Training Embedding Models, from Scratch, from Tencent 📚 A Comprehensive Review of Large Language Models in Document Intelligence, from Ke et al. 📚 A Modular Analysis of LLM-Based Feature Extraction for Sequential Recommendation, from Shi et al. 📚 Systematic Data Augmentation for Enhanced Generative Recommendation, from Lee et al. #InformationRetrieval #ResearchPapers #CuratedContent #Newsletter #substack
1 Comment
Like Comment
To view or add a comment, sign in
Mustafa A. Kaya
3w
Report this post
Recent advances in Retrieval-Augmented Generation (RAG) systems have tackled a major bottleneck by identifying that retrieved document passages often lack meaningful interrelations, leading to wasted computation when standard models make all passages “attend” to each other. By switching to compressed chunk embeddings instead of full token-level processing, researchers achieved up to 30x faster response times without losing accuracy, making previously impractical real-time applications like customer service and interactive research feasible for complex RAG deployments. Meta Paper: https://lnkd.in/eXQEvZ6G

REFRAG: Rethinking RAG based Decoding arxiv.org
Like Comment
To view or add a comment, sign in
Todd Smith

General Counsel at The Office of the Chief Technology Officer, Government of the District of Columbia
4w
Report this post
If you'd like better to understand inference-time scaling (the part of a language model's workflow most responsible for approximating reasoning) you could do a lot worse than this paper from March of this year. The researchers assess both traditional LLMs, and those purpose-built LRMs which enjoy the benefit of extensive RL training to elicit tokenwise representations of reasoning trajectories, against a broader-than-standard variety of reasoning tasks. Their results are important not just for assessing what inference-time scaling techniques add to model performance, and how top model performance varies across a wider field of reasoning tasks, but for addressing operating cost predictability on a model by model basis. Have a look! https://lnkd.in/e2qg2b3K

Inference-Time Scaling for Complex Tasks: Where We Stand and What Lies Ahead arxiv.org
Like Comment
To view or add a comment, sign in
Pritam Kudale

AI Research Specialist | AI Educator | Data Science | Data Analyst | Oracle Generative AI Certified Professional | Content Creator | 1.5 Million Inpression in 90 Days
1w
Report this post
𝗥𝗼𝘂𝘁𝗶𝗻𝗴 𝘁𝗵𝗲 𝗥𝗶𝗴𝗵𝘁 𝗟𝗟𝗠 𝗳𝗼𝗿 𝗘𝘃𝗲𝗿𝘆 𝗤𝘂𝗲𝗿𝘆 Most teams pick one LLM and hope it fits every task. Dynaroute does the opposite. It chooses the best model per request so you get lower cost, faster latency, and higher quality. 𝗘𝘅𝗮𝗺𝗽𝗹𝗲 𝗽𝗮𝘁𝗵𝘄𝗮𝘆𝘀: • Simple Q&A → Small LLM (e.g., Gemini 2.5 Flash Lite / GPT-5 Nano) • Complex reasoning → Large LLM (e.g., Gemini 2.5 Flash Thinking / GPT-5 / GPT-5 Mini) • Math / Code → Code-focused LLM (e.g., Claude Sonet / GPT-5 Codex) 𝗪𝗵𝘆 𝗶𝘁 𝗺𝗮𝘁𝘁𝗲𝗿𝘀 • Pay only for the capacity you need • Hit stricter latency and reliability targets with automatic fallbacks • Keep flexibility as better models arrive—no rewrites Launch video: https://lnkd.in/d8jWu3Fe Try Dynaroute: https://lnkd.in/dga_Egsz
1 Comment
Like Comment
To view or add a comment, sign in
Athar Parvez

Graduate Student at KFUPM | Ex TCSer | ML | DL | GenAI | LLM | AI Automation
3w
Report this post
𝐁𝐢𝐠 𝐧𝐞𝐰𝐬 𝐟𝐫𝐨𝐦 𝐆𝐨𝐨𝐠𝐥𝐞 𝐃𝐞𝐞𝐩𝐌𝐢𝐧𝐝 They’ve uncovered a fundamental flaw in RAG systems: fixed-size embeddings simply can’t scale forever. 🔹 𝟱𝟭𝟮 𝗱𝗶𝗺𝗲𝗻𝘀𝗶𝗼𝗻𝘀 → ~𝟱𝟬𝟬𝗞 𝗱𝗼𝗰𝘀 𝗺𝗮𝘅 🔹 𝟭𝟬𝟮𝟰 𝗱𝗶𝗺𝗲𝗻𝘀𝗶𝗼𝗻𝘀 → ~𝟰𝗠 𝗱𝗼𝗰𝘀 🔹 𝟰𝟬𝟵𝟲 𝗱𝗶𝗺𝗲𝗻𝘀𝗶𝗼𝗻𝘀 → ~𝟮𝟱𝟬𝗠 𝗱𝗼𝗰𝘀 Beyond these limits, retrieval breaks down, no matter how big or well-trained the model is. DeepMind’s new LIMIT benchmark shows that even with just 46 documents, no embedder hits full recall. Meanwhile, classical sparse methods like BM25 don’t face this ceiling. 👉 𝙏𝙝𝙚 𝙩𝙖𝙠𝙚𝙖𝙬𝙖𝙮 Scaling RAG isn’t just about bigger embeddings. We need new architectures: cross-encoders, multi-vector models, or hybrid sparse-dense approaches. This research is a reminder: sometimes, progress means rethinking assumptions—not just scaling them. 📄 𝙁𝙪𝙡𝙡 𝙥𝙖𝙥𝙚𝙧: https://lnkd.in/gzf8XeFb

On the Theoretical Limitations of Embedding-Based Retrieval arxiv.org
Like Comment
To view or add a comment, sign in
Jan Kocoń

15 years in AI | PhD in AI | Scientific Director @ PLLuM | Generative AI Architect | Lead Data Scientist @ CLARIN | AI Lead Data Scientist & Assistant Professor @ Wroclaw Tech | R&D Coordinator | X(twitter):@kocon_jan
2w
Report this post
It's been almost four months since our paper Improving LLM-Based Recommender Systems with User-Controllable Profiles was published at The Web Conference (WWW). The serious part: We explored how Large Language Models can make recommender systems more user-centered by letting people actively control their profiles, instead of being defined only by past clicks. Our results show that these user-controllable profiles can improve recommendation quality by up to 50% compared to historical baselines. Link to the paper in comments. The not-so-serious part: WWW required us to submit a short "promo video" for the paper. Naturally, we thought: why not go full 90s infomercial? Cue over-the-top acting, questionable camera work, and some intentional cringe — starring me and our first author Stanisław Woźniak. 🎬 Now that some time has passed, I'm still wondering: 👉 Does adding humor make scientific work more approachable and memorable? 👉 Or does it risk making the research look less serious? What do you think? Curious to hear your perspective! #LLM #RecommenderSystems #Research #TheWebConf

5 Comments
Like Comment
To view or add a comment, sign in
Keith Howard

Helping Federal and DoD Build AI Solutions ◆ General Manager, Customer & Solutions Architect Lead @ Pryon AI
3w Edited
Report this post
Great article by google on limitations of dense vectors in RAG based applications. https://lnkd.in/ecXS6FWz That said... this is one of those "Water is wet" moments. Pryon has long since realized the limitations for dense vector embeddings in RAG based deployments. Since 2017 (yup - we've been doing this for some time) we've perfected enterprise applications of RAG leveraging more sophisticated techniques like hybrid retrieval, metadata filters, rerankers, etc, and continue to provide what is one of the best implementations of AI at enterprise scale. Needless to say, you kind of have to get creative when you get into the hundreds of thousands of documents in a single system! Good read non the less for the more technical crowd.

On the Theoretical Limitations of Embedding-Based Retrieval arxiv.org
Like Comment
To view or add a comment, sign in
Julio Rodriguez Martino

I solve problems with digital and AI tools
1mo Edited
Report this post
Deep Research Agents: A Systematic Examination And Roadmap Ever feel like research is a black hole of information? 🤯 Large Language Models are changing that. The latest research – Deep Research Agents – are poised to revolutionize how we tackle complex, multi-turn informational tasks. 🚀 This paper dives deep into these agents, exploring everything from API vs. browser-based information retrieval to modular tool use and Model Context Protocols. They’re not just about generating text; they’re building structured analytical reports through dynamic reasoning and adaptive planning. But it’s not all smooth sailing. The authors highlight crucial limitations – restricted knowledge access, sequential execution issues, and a disconnect between evaluation metrics and real-world needs. 🧐 What’s the *one* key challenge you see in scaling Deep Research Agents for practical applications? 👇 #DeepResearchAgents #LLM #AIResearch #ArtificialIntelligence #Innovation Original article: https://lnkd.in/d5g6fp7B Automatically posted. Contact me if you want to know how it works :-)
Like Comment
To view or add a comment, sign in

10,093 followers

142 Posts

View Profile Connect

Lewis Liu’s Post

More Relevant Posts

Explore content categories