Skip to content

AyushKatre05/DocQueryAI

Repository files navigation

Intelligent Document Query System

A FastAPI-powered intelligent document query system that processes PDFs, creates vector embeddings, and generates AI-powered answers using FAISS and sentence-transformers.

Features

  • PDF Processing: Download and extract text from PDF URLs using PyMuPDF
  • Vector Embeddings: Create semantic embeddings using sentence-transformers (all-MiniLM-L6-v2)
  • Vector Search: Efficient similarity search using FAISS
  • AI-Powered Answers: Generate contextual answers using Google Gemini 2.5-flash (with mock fallback)
  • Production Ready: Optimized for deployment on Render with proper error handling

Tech Stack

  • FastAPI: Modern Python web framework
  • sentence-transformers: State-of-the-art text embeddings
  • FAISS: Efficient vector similarity search
  • PyMuPDF: PDF text extraction
  • Google Gemini: AI-powered answer generation
  • httpx: Async HTTP client for PDF downloads

API Endpoints

POST /api/v1/hackrx/run

Process a PDF document and answer questions using AI-powered semantic search.

Request Body:

{
  "documents": "https://example.com/document.pdf",
  "questions": [
    "What is the grace period for premium payment?",
    "Does this policy cover maternity?"
  ]
}

Response Body:

{
  "answers": [
    {
      "question": "What is the grace period for premium payment?",
      "answer": "A grace period of thirty days is provided for premium payment delays.",
      "source_clause": "Clause 3.2: A grace period of thirty days is provided...",
      "explanation": "Answer derived based on Clause 3.2 in the document."
    },
    {
      "question": "Does this policy cover maternity?",
      "answer": "Yes, maternity coverage is included under the medical benefits section.",
      "source_clause": "Section 5.1: Medical benefits include maternity coverage...",
      "explanation": "Information found in the medical benefits section of the policy."
    }
  ]
}

GET /health

Health check endpoint that returns the status of all system components.

Response:

{
  "status": "healthy",
  "components": {
    "pdf_processor": true,
    "embedding_manager": true,
    "ai_generator": true
  }
}

Setup & Installation

Local Development

  1. Clone the repository

    git clone <repository-url>
    cd intelligent-document-query
  2. Install dependencies

    pip install -r requirements.txt
  3. Set up environment variables

    cp .env.example .env
    # Edit .env and add your Gemini API key
  4. Run the application

    uvicorn main:app --host 0.0.0.0 --port 8000 --reload
  5. Access the API

    • API Documentation: http://localhost:8000/docs
    • Health Check: http://localhost:8000/health
    • Main Endpoint: POST http://localhost:8000/api/v1/hackrx/run

Environment Variables

  • GEMINI_API_KEY: Your Google Gemini API key (required for AI-powered answers)
  • DEBUG: Set to true for development mode (default: false)
  • CORS_ORIGINS: Allowed CORS origins (default: *)

Getting a Gemini API Key

  1. Go to Google AI Studio
  2. Sign in with your Google account
  3. Click "Get API key"
  4. Create a new project or select existing one
  5. Generate and copy your API key
  6. Add it to your .env file as GEMINI_API_KEY=your_key_here

Deployment on Render

This application is ready for deployment on Render with the included render.yaml configuration.

  1. Fork/clone this repository
  2. Connect to Render
  3. Configure Environment Variables
    • Add GEMINI_API_KEY in the Render dashboard
    • Other variables are configured in render.yaml
  4. Deploy
    • Render will automatically build and deploy using the configuration

Architecture

The system uses a modular architecture:

  • FastAPI: Async web framework with automatic API documentation
  • PDF Processor: Downloads and extracts text from PDF URLs using PyMuPDF
  • Embedding Manager: Creates vector embeddings using sentence-transformers or TF-IDF fallback
  • FAISS: Efficient vector similarity search for finding relevant document chunks
  • AI Generator: Google Gemini 2.5-flash for intelligent answer generation with mock fallback

Features

  • Robust PDF Processing: Handles various PDF formats and sizes
  • Smart Text Chunking: Intelligent text segmentation with overlap for context preservation
  • Fallback Systems: TF-IDF vectorizer when advanced embeddings aren't available
  • Mock Responses: Template answers when AI API is unavailable
  • Error Handling: Comprehensive error handling with meaningful messages
  • Production Ready: Optimized for deployment with health checks and monitoring

Example Usage

curl -X POST "http://localhost:8000/api/v1/hackrx/run" \
  -H "Content-Type: application/json" \
  -d '{
    "documents": "https://example.com/policy.pdf",
    "questions": [
      "What is the coverage limit?",
      "Are pre-existing conditions covered?"
    ]
  }'

Requirements

  • Python 3.11+
  • FastAPI
  • Google Gemini API key (optional, uses mock responses without it)
  • Internet access for PDF downloads

License

This project is open source and available under the MIT License.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages