Awesome RAG Production

A curated collection of battle-tested tools, frameworks, and best practices for building, scaling, and monitoring production-grade Retrieval-Augmented Generation (RAG) systems.

Retrieval-Augmented Generation (RAG) is revolutionizing how LLMs access and utilize external knowledge. This repository bridges the gap between prototype RAG tutorials and production-grade systems at scale. Whether you're building semantic search, question-answering systems, or AI-powered assistants, you'll find battle-tested frameworks, vector databases, evaluation tools, and observability solutions for production RAG deployments. Focus on the Engineering side of AI—from data ingestion and retrieval optimization to monitoring, security, and deployment strategies for real-world LLM applications.

Contribution Guide · Explore Categories · Report Bug

📑 Contents

🧭 Decision Guide: How to Choose

Not sure where to start? Use this high-level decision tree to pick the right tools for your scale and use case.

graph TD
    Start([🚀 Start Project]) --> UseCase{What is your primary goal?}
    
    %% Framework Selection
    UseCase -->|Complex Agents & Control| LangGraph[🦜🕸️ LangGraph]
    UseCase -->|Data Processing & Indexing| LlamaIndex[🦙 LlamaIndex]
    UseCase -->|Auditable Pipelines| Haystack[🌾 Haystack]
    
    %% Vector DB Selection
    UseCase --> DB{Which Vector DB?}
    DB -->|Serverless / Zero Ops| Pinecone[🌲 Pinecone]
    DB -->|Massive Scale >100M| Milvus[🚀 Milvus]
    DB -->|Running Locally| Chroma[🧪 Chroma]
    DB -->|Postgres Ecosystem| PGVector[🐘 pgvector]
    
    %% Styling
    classDef framework fill:#e1f5fe,stroke:#01579b,stroke-width:2px;
    classDef db fill:#f3e5f5,stroke:#4a148c,stroke-width:2px;
    classDef start fill:#e8f5e9,stroke:#1b5e20,stroke-width:2px;
    
    class LangGraph,LlamaIndex,Haystack framework;
    class Pinecone,Milvus,Chroma,PGVector db;
    class Start start;

🧱 Reference Architectures

Stop guessing. Here are three battle-tested stacks for different stages of maturity.

1. The Local / Dev Stack (Zero to One)

Goal: Rapid prototyping, zero cost, no API keys.

Stack:

LLM: Ollama (LLaMA 3 / Mistral)
Vector DB: Chroma (Embedded)
Eval: Ragas (Basic checks)

Why: Runs entirely on your laptop. Perfect for "Hello World" and checking feasibility.

Risks: High latency; performance depends on your hardware; no horizontal scaling.

Observability Checklist: print() statements and basic logging.

2. The Mid-Scale / Production Stack (Speed to Market)

Goal: High precision, developer velocity, minimal infra management.

Stack:

Vector DB: Qdrant or Weaviate (Cloud/Managed)
Reranker: Cohere Rerank (API)
Tracing: Langfuse or Arize Phoenix

Why: Offloads complexity to managed services. "It just works" with great documentation.

Risks: Costs scale linearly with usage; dependency on external APIs (Vendor lock-in).

Observability Checklist: Latency tracking, Token usage costs, Trace visualization.

3. The Enterprise / High-Scale Stack (The 1%)

Goal: Throughput maximization, data sovereignty, full control.

Stack:

Vector DB: Milvus (Distributed)
Serving: vLLM (Self-hosted)
Eval (CI/CD): DeepEval
Monitoring: OpenLIT (OpenTelemetry)

Why: You own the data and the compute. Scales to billions of vectors.

Risks: Significant operational complexity (Kubernetes); requires a dedicated Platform Engineering team.

Observability Checklist: Distributed tracing, Embedding drift detection, Custom SLA alerts, GPU utilization metrics.

💼 Real-World Case Studies

Learn from production RAG implementations at scale. These companies have battle-tested their systems with millions of users.

Success Stories

LinkedIn Engineering
- Use Case: Conversational job search and professional recommendations
- Tech Stack: In-house vector DB + BERT embeddings + LLM fine-tuning
- Key Insight: Member-specific personalization through context injection
Shopify Engineering
- Use Case: E-commerce chatbot for merchant support
- Tech Stack: LangChain + Chroma + GPT-3.5-Turbo
- Key Insight: Domain-specific fine-tuning reduced hallucination rate from 18% to 4%
Discord
- Use Case: Message search across 19 billion messages
- Tech Stack: ScaNN (Google) + Custom Rust infrastructure
- Key Insight: Approximate nearest neighbor search with 99.9% recall at 10ms latency

Common Patterns:

✅ Hybrid search (dense + sparse) is standard at scale
✅ Custom embedding models outperform off-the-shelf for domain-specific tasks
✅ Reranking is critical for precision (top-100 → top-5)
✅ Extensive A/B testing on retrieval quality before LLM integration

🏗️ Frameworks & Orchestration

Framework Comparison

Choose the right framework for your use case with this production-focused comparison:

Framework	Best For	Async Support	Production Readiness	Community Support	Orchestration Style	Observability	Learning Curve	Deployment Complexity
LlamaIndex	Data Processing & Indexing	✅ Full	⭐⭐⭐⭐	39k+ ⭐	Data-Flow Pipelines	Built-in + 3rd Party	Low-Medium	Low
LangChain	Rapid Prototyping	✅ Full	⭐⭐⭐	100k+ ⭐	Sequential Chains	Excellent (LangSmith)	Medium	Medium
LangGraph	Complex Agents & Control	✅ Full	⭐⭐⭐⭐	7k+ ⭐	Cyclic Graphs	Excellent (LangSmith)	High	Medium-High
Haystack	Enterprise Pipelines	✅ Full	⭐⭐⭐⭐⭐	18k+ ⭐	DAG-based Pipelines	Built-in Tracing	Medium-High	Low-Medium

Key Considerations:

LlamaIndex: Choose if you need advanced indexing strategies (hierarchical, knowledge graphs) and your focus is on data ingestion.
LangChain: Best for quick experiments and maximum ecosystem compatibility. Watch out for abstraction overhead.
LangGraph: Pick this when building agentic systems with human-in-the-loop, state persistence, or cyclic workflows.
Haystack: The enterprise choice for auditable, type-safe pipelines with strict reproducibility requirements.

Frameworks

Agentset
- Open-source production-ready RAG infrastructure with built-in agentic reasoning, hybrid search, and multimodal support. Designed for scalable deployments with automatic citations and enterprise-grade reliability.
Cognita
- A modular RAG framework by TrueFoundry designed for scalability. It decouples the RAG components (Indexer, Retriever, Parser), allowing for independent scaling and easier AB testing of different RAG strategies.
Haystack
- A modular framework focused on production readiness. It emphasizes audible pipelines, strict type-checking, and reproducibility, making it ideal for enterprise-grade RAG where reliability is paramount.
LangGraph
- A library for building stateful, multi-actor applications with LLMs. Unlike simple chains, it enables cyclic graphs for complex, agentic workflows with human-in-the-loop control and persistence.
LlamaIndex
- The premier data framework for LLMs. It excels at connecting custom data sources to LLMs, offering advanced indexing strategies (like recursive retrieval) and optimized query engines for deep insight extraction.
Pathway
- A high-performance data processing framework for live data. It enables "Always-Live" RAG by syncing vector indices in real-time as the underlying data source changes, without full re-indexing.
RAGFlow
- An end-to-end RAG engine designed for deep document understanding. It handles complex layouts (PDFs, tables, images) natively and includes a built-in knowledge base management system.
Verba
- Weaviate's "Golden RAGtriever". A fully aesthetic, open-source RAG web application that comes pre-configured with best practices for chunking, embedding, and retrieval out of the box.

📥 Data Ingestion & Parsing

Firecrawl
- Effortlessly turn websites into clean, LLM-ready markdown.
LlamaParse
- Specialized parsing for complex PDFs with table extraction capabilities.
Marker
- High-efficiency PDF, EPUB to Markdown converter using vision models.
OmniParse
- Universal parser for ingesting any data type (documents, multimedia, web) into RAG-ready formats.
Unstructured
- Open-source pipelines for preprocessing complex, unstructured data.

🗄️ Vector Databases

Tool	Best For	Key Strength
Chroma	Local/Dev & Mid-scale	Developer-friendly, open-source embedding database.
Milvus	Billions of vectors	Most popular OSS for massive scale.
pgvector	PostgreSQL Ecosystem	Vector search capability directly within PostgreSQL.
Pinecone	10M-100M+ vectors	Zero-ops, serverless architecture.
Qdrant	<50M vectors	Best filtering support and free tier.
Weaviate	Hybrid Search	Native integration of vector and keyword search.

🔍 Retrieval & Reranking

Hybrid Search: A retrieval strategy that linearly combines Dense Vector Search (semantic understanding) with Sparse Keyword Search (BM25 for exact term matching). This mitigates the "lost in the middle" phenomenon and significantly improves zero-shot retrieval performance.

BGE-Reranker
- One of the best open-source rerankers available. It is a cross-encoder model trained to output a relevance score for query-document pairs, offering commercial-grade performance for self-hosted pipelines.
Cohere Rerank
- A powerful API-based reranking model. By re-scoring the initial top-K documents from a cheaper/faster retriever, it drastically improves precision (often boosting MRR by 10-20%) with minimal code changes.
FlashRank
- A lightweight, serverless-friendly reranking library. It runs quantized cross-encoder models directly in the CPU (no Torch/GPU required), making it ideal for edge deployments or cost-sensitive architectures.
RAGatouille
- A library that makes ColBERT (Contextualized Late Interaction over BERT) easy to use. ColBERT offers fine-grained token-level matching, providing superior retrieval quality compared to standard single-vector dense retrieval.

GraphRAG: An advanced retrieval method that constructs a knowledge graph from documents. It traverses relationships between entities to answer "global" queries (e.g., "What are the main themes?") that standard vector search struggles to address.

🤖 Agentic RAG

Agentic RAG represents the evolution of traditional RAG systems into autonomous, decision-making entities. Instead of a simple "retrieve-then-generate" pipeline, agentic systems can plan multi-step workflows, use tools, and dynamically adjust their retrieval strategy based on intermediate results.

Core Capabilities:

Multi-step Reasoning: Break complex queries into sub-tasks
Tool Use: Integrate external APIs, databases, and services
Self-Correction: Validate retrieved context and retry if needed
Planning: Determine optimal retrieval strategy dynamically

Frameworks & Tools

AutoGen
- Microsoft's framework for building multi-agent conversational systems. Agents can collaborate, debate, and refine answers through back-and-forth dialogue, improving output quality through consensus.
CrewAI
- A lightweight framework for orchestrating role-playing autonomous AI agents. Define specialized "crew members" (Researcher, Writer, Critic) that work together on complex RAG tasks.
LangGraph
- Build stateful, multi-actor applications with cyclic graphs. Enables human-in-the-loop approval, memory persistence across conversations, and complex agentic workflows beyond linear chains.
OpenAI Assistants API
- A managed service for building agent-like experiences. It provides built-in retrieval capabilities, code interpreter, and function calling with minimal infrastructure overhead.
RAGFlow Agentic Mode
- Extends RAGFlow with agentic capabilities, allowing dynamic document re-ranking, query decomposition, and adaptive retrieval strategies based on query complexity.

When to Use Agentic RAG:

✅ Complex, multi-hop questions requiring planning ("Compare X and Y across these 5 documents")
✅ Integration with external tools (SQL databases, APIs, calculators)
✅ Tasks requiring validation (fact-checking, citation verification)

Trade-offs:

❌ Higher latency (multiple LLM calls)
❌ Increased cost (agent reasoning + retrieval)
❌ Debugging complexity (non-deterministic behavior)

📊 Evaluation & Benchmarking

Reliable RAG requires measuring the RAG Triad: Context Relevance, Groundedness, and Answer Relevance.

Ares
- An automated evaluation system that helps you evaluate RAG systems with fewer human labels. It uses prediction-powered inference to provide statistical confidence intervals for your system's performance.
Braintrust
- An enterprise-grade platform for evaluating and logging LLM outputs. It excels at "Online Evaluation," allowing you to score real-world user interactions and feed that data back into your development set.
DeepEval
- The "Pytest for LLMs". It offers a unit-testing framework for RAG, integrating seamlessly into CI/CD pipelines to catch regression in retrieval quality or hallucination rates before deployment.
Ragas
- A framework that uses an "LLM-as-a-Judge" to evaluate your pipeline. It calculates metrics like Faithfulness (did the answer come from the context?) and Answer Relevancy without needing human-labeled ground truth.

LLM-as-Judge Evaluation

Using one LLM to evaluate the outputs of another has become a standard practice in production RAG systems. This approach scales better than human evaluation and provides consistent, automated quality assessment.

Core Frameworks:

Prometheus
- An open-source LLM specifically trained for evaluation tasks. Unlike using GPT-4 as a judge, Prometheus is optimized for scoring consistency and can run locally for cost-sensitive deployments.
G-Eval
- A framework that uses GPT-4 with chain-of-thought reasoning to evaluate text generation quality. It achieves human-level correlation on summarization and dialogue tasks.
AutoEvals
- A tool for quickly and easily evaluating AI model outputs using best practices, including LLM-as-a-judge and heuristic methods.
ARES (Automated RAG Evaluation System)
- Stanford's research project that fine-tunes small LLMs as judges specifically for RAG evaluation, achieving GPT-4-level accuracy at 1/10th the cost.
LangChain Evaluators
- Built-in evaluation chains for criteria-based scoring, pairwise comparison, and embedding distance. Seamlessly integrates with LangSmith for production monitoring.

Key Metrics:

Metric	What It Measures	Judge LLM Prompt Example
Faithfulness	Does the answer come from the retrieved context?	"Does the answer contain information not in the context? Yes/No"
Answer Relevance	Does the answer address the question?	"Rate how well this answer addresses the question (1-5)"
Context Precision	Are the top-ranked chunks actually relevant?	"Is this passage relevant to answering the question? Yes/No"
Context Recall	Did we retrieve all necessary information?	"Is there missing information needed to answer this question?"

Best Practices:

✅ Use GPT-4 or Claude for critical evaluations (highest agreement with humans)
✅ Fine-tune smaller models (Llama 3 8B) as judges for cost/latency optimization
✅ Chain-of-Thought prompting improves judge consistency by 15-20%
✅ Always validate judge performance against human labels on a sample (100-200 examples)
⚠️ Be aware of position bias (LLMs favor earlier options in pairwise comparisons)
⚠️ LLM judges can inherit biases from their training data

👁️ Observability & Tracing

Arize Phoenix
- A tool specifically designed for troubleshooting retrieval issues. It visualizes your embedding clusters and retrieved document rankings, helping you understand why the model retrieved irrelevant context.
Langfuse
- An open-source engineering platform for LLM observability. It captures full execution traces (latency, token usage, cost) and allows for "Prompt Management," letting you version-control prompts decoupled from your code.
LangSmith
- Built by the LangChain team, this is the gold standard for debugging complex chains. It provides a "Playground" to rerun specific traces with modified prompts to iterate on edge cases instantly.
OpenLIT
- An OpenTelemetry-native monitoring solution. If you already use Prometheus/Grafana or Datadog, OpenLIT drops into your existing stack to provide standardized LLM metrics (GPU usage, token throughput).

🚀 Deployment & Serving

BentoML
- A framework for packaging models into standardized APIs (Bentos). It handles the complexity of adaptive batching and multi-model serving, allowing you to deploy any model to any cloud (AWS Lambda, EC2, Kubernetes) with one command.
Ollama
- The easiest way to run LLMs locally. While primarily for dev/local use, it bridges the gap between local testing and deployment by providing a standard API for models like LLaMA 3, Mistral, and Gemma.
Ray Serve
- The industry standard for scaling Python ML workloads. It allows you to compose complex pipelines (e.g., Retriever + Reranker + LLM) where each component scales independently across a cluster of machines.
vLLM
- A high-performance inference engine known for PagedAttention. It maximizes GPU memory utilization, allowing you to serve larger models or handle higher concurrency with lower latency than standard Hugging Face Transformers.

🛡️ Security & Compliance

Lakera Guard
- A low-latency security API that protects applications against prompt injections, data leakage, and toxic content in real-time. It acts as an "Application Firewall" for your LLM.
LLM Guard
- A comprehensive toolkit for sanitizing inputs and outputs. It detects invisible text, prompt injections, and anonymizes sensitive data, ensuring full compliance with data privacy standards.
NeMo Guardrails
- The standard for adding programmable guardrails to LLM-based conversational systems. It prevents "Jailbreaking" and ensures models stay on topic, critical for enterprise chatbots.
Presidio
- Microsoft’s SDK for PII (Personally Identifiable Information) detection and redaction. It ensures sensitive user data (credit cards, emails) is scrubbed before it hits the embedding model or vector DB.
PrivateGPT
- A production-ready project that allows you to run RAG pipelines completely offline. It ensures 100% data privacy by keeping all ingestion and inference local, perfect for highly regulated industries.

🧠 Recommended Resources

Deepen your knowledge with curated lists of books and blogs from industry experts.

📚 Books

A curated list of Essential Books covering RAG, Deep Learning, and AI Engineering.

Featuring: "Designing Machine Learning Systems" by Chip Huyen, "Deep Learning" by Goodfellow et al.

🌐 Blogs & News

Stay updated with the Best Engineering Blogs.

Featuring: OpenAI Research, Google DeepMind, and NVIDIA AI.

🛠️ Selection Criteria

To keep this list high-quality, we only include resources that are:

1. Production-Ready: Battle-tested in real-world environments.

2. Actively Maintained: Regular updates within the last 3-6 months.

3. Documented: Strong API references and clear use cases.

🤝 Contributing

Contributions are welcome! Please read the CONTRIBUTING.md file for guidelines on how to submit a new resource.

📜 License

This repository is licensed under CC0 1.0 Universal.

Name		Name	Last commit message	Last commit date
Latest commit History 97 Commits
.github		.github
scripts		scripts
.gitignore		.gitignore
.markdownlint-cli2.jsonc		.markdownlint-cli2.jsonc
.markdownlint.json		.markdownlint.json
.markdownlintignore		.markdownlintignore
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
PROPOSED_UPDATES.md		PROPOSED_UPDATES.md
README.md		README.md
SECURITY.md		SECURITY.md
blogs.md		blogs.md
books.md		books.md
datasets.md		datasets.md
rag-pitfalls.md		rag-pitfalls.md
showcase.md		showcase.md

Uh oh!

License

Yigtwxx/Awesome-RAG-Production

Folders and files

Latest commit

History

Repository files navigation

Awesome RAG Production

📑 Contents

🧭 Decision Guide: How to Choose

🧱 Reference Architectures

1. The Local / Dev Stack (Zero to One)

2. The Mid-Scale / Production Stack (Speed to Market)

3. The Enterprise / High-Scale Stack (The 1%)

💼 Real-World Case Studies

Success Stories

🏗️ Frameworks & Orchestration

Framework Comparison

Frameworks

📥 Data Ingestion & Parsing

🗄️ Vector Databases

🔍 Retrieval & Reranking

🤖 Agentic RAG

Frameworks & Tools

📊 Evaluation & Benchmarking

LLM-as-Judge Evaluation

👁️ Observability & Tracing

🚀 Deployment & Serving

🛡️ Security & Compliance

🧠 Recommended Resources

📚 Books

🌐 Blogs & News

🛠️ Selection Criteria

🤝 Contributing

📜 License

About

Topics

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 1

Sponsor this project

Uh oh!

Packages 0

Uh oh!

Uh oh!

Contributors 2

Languages

Packages