Are you frustrated with vector database retrieval accuracy for long professional documents? Traditional vector-based RAG relies on semantic similarity rather than true relevance. But similarity β relevance β what we truly need in retrieval is relevance, and that requires reasoning. When working with professional documents that demand domain expertise and multi-step reasoning, similarity search often falls short.
π§ Reasoning-based RAG offers a better alternative: enabling LLMs to think and reason their way to the most relevant document sections. Inspired by AlphaGo, we use tree search to perform structured document retrieval.
PageIndex is a document indexing system that builds search tree structures from long documents, making them ready for reasoning-based RAG. It has been used to develop a RAG system that achieved 98.7% accuracy on FinanceBench, demonstrating state-of-the-art performance in document analysis.
Self-host it with this open-source repo, or try our βοΈ Cloud service - no setup required.
This repo is designed for generating PageIndex tree structure with text input, but many real-world use cases involve PDFs that require OCR to convert them into Markdown. However, extracting high-quality text from PDF documents remains a non-trivial challenge. Most OCR tools only extract page-level content, losing the broader document context and hierarchy.
To address this, we introduced PageIndex OCR β the first OCR system designed to preserve the global structure of documents. PageIndex OCR significantly outperforms other leading OCR tools, such as those from Mistral and Contextual AI, in recognizing true hierarchy and semantic relationships across document pages.
- Experience next-level OCR quality with PageIndex OCR at our Dashboard.
- Integrate seamlessly PageIndex OCR into your stack via our API.
PageIndex can transform lengthy PDF documents into a semantic tree structure, similar to a "table of contents" but optimized for use with Large Language Models (LLMs). It's ideal for: financial reports, regulatory filings, academic textbooks, legal or technical manuals, and any document that exceeds LLM context limits.
-
Hierarchical Tree Structure
Enables LLMs to traverse documents logically β like an intelligent, LLM-optimized table of contents. -
Chunk-Free Segmentation
No arbitrary chunking. Nodes follow the natural structure of the document. -
Precise Page Referencing
Every node contains its summary and start/end page physical index, allowing pinpoint retrieval. -
Scales to Massive Documents
Designed to handle hundreds or even thousands of pages with ease.
Here is an example output. See more example documents and generated trees.
...
{
"title": "Financial Stability",
"node_id": "0006",
"start_index": 21,
"end_index": 22,
"summary": "The Federal Reserve ...",
"nodes": [
{
"title": "Monitoring Financial Vulnerabilities",
"node_id": "0007",
"start_index": 22,
"end_index": 28,
"summary": "The Federal Reserve's monitoring ..."
},
{
"title": "Domestic and International Cooperation and Coordination",
"node_id": "0008",
"start_index": 28,
"end_index": 31,
"summary": "In 2023, the Federal Reserve collaborated ..."
}
]
}
...
Follow these steps to generate a PageIndex tree from a PDF document.
pip3 install -r requirements.txtCreate a .env file in the root directory and add your API key:
CHATGPT_API_KEY=your_openai_key_herepython3 run_pageindex.py --pdf_path /path/to/your/document.pdfYou can customize the processing with additional optional arguments:
--model OpenAI model to use (default: gpt-4o-2024-11-20)
--toc-check-pages Pages to check for table of contents (default: 20)
--max-pages-per-node Max pages per node (default: 10)
--max-tokens-per-node Max tokens per node (default: 20000)
--if-add-node-id Add node ID (yes/no, default: yes)
--if-add-node-summary Add node summary (yes/no, default: no)
--if-add-doc-description Add doc description (yes/no, default: yes)
Don't want to host it yourself? Try our hosted API for PageIndex. The hosted service leverages our custom OCR model for more accurate PDF recognition, delivering better tree structures for complex documents. Ideal for rapid prototyping, production environments, and documents requiring advanced OCR.
You can also upload PDFs from your browser and explore results visually with our Dashboard β no coding needed.
Leave your email in this form to receive 1,000 pages for free.
Mafin 2.5 is a state-of-the-art reasoning-based RAG model designed specifically for financial document analysis. Powered by PageIndex, it achieved a market-leading 98.7% accuracy on the FinanceBench benchmark β significantly outperforming traditional vector-based RAG systems.
PageIndex's hierarchical indexing enabled precise navigation and extraction of relevant content from complex financial reports, such as SEC filings and earnings disclosures.
π See the full benchmark results and our blog post for detailed comparisons and performance metrics.
Use PageIndex to build reasoning-based retrieval systems without relying on semantic similarity. Great for domain-specific tasks where nuance matters (more examples).
- Process documents using PageIndex to generate tree structures.
- Store the tree structures and their corresponding document IDs in a database table.
- Store the contents of each node in a separate table, indexed by node ID and tree ID.
- Query Preprocessing:
- Analyze the query to identify the required knowledge
- Document Selection:
- Search for relevant documents and their IDs
- Fetch the corresponding tree structures from the database
- Node Selection:
- Search through tree structures to identify relevant nodes
- LLM Generation:
- Fetch the corresponding contents of the selected nodes from the database
- Format and extract the relevant information
- Send the assembled context along with the original query to the LLM
- Generate contextually informed responses
prompt = f"""
You are given a question and a tree structure of a document.
You need to find all nodes that are likely to contain the answer.
Question: {question}
Document tree structure: {structure}
Reply in the following JSON format:
{{
"thinking": <reasoning about where to look>,
"node_list": [node_id1, node_id2, ...]
}}
"""Need customized support for your documents or reasoning-based RAG system?
π’ Join our Discord
βοΈ Leave us a message
Built by Vectify AI.