RAG Scale is a modular, privacy-focused framework designed for building local Retrieval-Augmented Generation (RAG) applications. It leverages Ollama for local Large Language Model (LLM) inference and embeddings, combined with FAISS for efficient high-dimensional vector search.
The system is architected to support isolated research "sessions," allowing users to index and query distinct text corpora or Wikipedia topics without context contamination.
The codebase follows a domain-driven design pattern to ensure scalability and maintainability:
src/core: Core infrastructure managing Session lifecycles (SessionManager) and Vector Store operations (VectorManager).src/data: Robust data ingestion pipeline including:- Loaders: Utilities for fetching external content (e.g.,
WikiLoader). - Processors: Text normalization and chunking logic (
DocumentProcessor) based on best practices.
- Loaders: Utilities for fetching external content (e.g.,
src/models: Abstraction layer for LLM and Embedding model initialization, currently optimized for Ollama.src/rag: The inference engine containing the prompt construction and generation logic.src/config: Centralized configuration management for model parameters, directory paths, and runtime settings.tests/health: Comprehensive system health check suite to validate infrastructure components.
To execute this framework, ensure the following are installed and configured:
- Python 3.10+
- Ollama: Required for local model inference.
- Pull the default embedding model:
ollama pull nomic-embed-text - Pull the default chat model:
ollama pull mistral
- Pull the default embedding model:
-
Clone the repository:
git clone <repository-url> cd rag-scale
-
Install the required Python dependencies:
pip install -r requirements.txt
Before running the main application, you can execute the included health check suite to verify that your environment (Ollama, Network, Python dependencies) is correctly configured.
python test_system.pyThis ensures all subsystems are operational before you begin.
The framework is driven by a Command Line Interface (CLI) that orchestrates the RAG pipeline.
-
Initialize the Application:
python main.py
-
Select a Corpus: When prompted, enter a topic name.
- Local Processing: The system first checks
data/raw/for a matching.txtfile. - External Fetch: If no local file is found, it attempts to retrieve and sanitize the corresponding article from Wikipedia.
- Local Processing: The system first checks
-
Interaction Phase: Once the index is built or loaded, the interactive session begins. The system will retrieve relevant context for each query and generate a citation-backed response.
-
Session Management: All artifacts (raw data, serialized chunks, and FAISS indexes) serve as a persistent state in the
sessions/directory, allowing for instant resumption of previous topics.
- Privacy-First Design: Operates entirely locally with no data transmitted to external APIs.
- Isolated Sessions: Automatically manages separate environment states for different datasets.
- Health Monitoring: Integrated tools to validate system integrity and model availability.
- Performance Metrics: Detailed logging of query latency (
response_time_s) and retrieval statistics (retrieved_docs_count) in CSV format for analysis. - Extensible Design: Modular architecture allows for straightforward integration of alternative vector stores or LLM providers.