A full-stack web application that allows users to upload academic PDFs and ask natural language questions about the content using AI-powered retrieval augmented generation (RAG).
AI-Powered Research Assistant/
βββ frontend/ # HTML, CSS, JavaScript interface
βββ backend/ # Flask API server
βββ ml/ # Machine learning modules
βββ data/ # PDF storage and FAISS index
βββ docs/ # Documentation
βββ requirements.txt # Python dependencies
- PDF Upload & Processing: Parse academic PDFs using PyMuPDF
- Semantic Search: Generate embeddings using OpenAI's text-embedding-ada-002
- RAG Implementation: Retrieve relevant chunks and generate answers with GPT-4
- Citation Generation: Automatic APA/MLA citation formatting
- Vector Storage: FAISS index for efficient similarity search
- Related Papers: Find similar papers using cosine similarity
- Azure OpenAI Support: Use Azure OpenAI endpoints and keys
- Python 3.8+
- OpenAI API key OR Azure OpenAI configuration
- pip (Python package manager)
-
Clone the repository
git clone <repository-url> cd ai-research-assistant
-
Install dependencies
pip install -r requirements.txt
-
Configure API Keys
Option A: OpenAI Direct API
export OPENAI_API_KEY="your-openai-api-key-here"
Option B: Azure OpenAI
export AZURE_OPENAI_ENDPOINT="https://your-resource.openai.azure.com/" export AZURE_OPENAI_API_KEY="your-azure-api-key-here" # Optional: Specify deployment names export AZURE_OPENAI_DEPLOYMENT_NAME="your-gpt-4-deployment" export AZURE_OPENAI_EMBEDDING_DEPLOYMENT_NAME="your-embedding-deployment"
-
Run the application
python backend/app.py
-
Access the application
- Open your browser and navigate to
http://localhost:5000
- Open your browser and navigate to
- Get your API key from OpenAI Platform
- Set environment variable:
OPENAI_API_KEY
- Create an Azure OpenAI resource in Azure Portal
- Get your endpoint URL and API key
- Set environment variables:
AZURE_OPENAI_ENDPOINTAZURE_OPENAI_API_KEYAZURE_OPENAI_DEPLOYMENT_NAME(optional)AZURE_OPENAI_EMBEDDING_DEPLOYMENT_NAME(optional)
app.py: Main Flask applicationroutes.py: API endpoints for upload, question-answering, and citationsconfig.py: Configuration settings
pdf_parser.py: PDF parsing and chunkingembedding_generator.py: OpenAI/Azure embedding generationvector_search.py: FAISS index management and similarity searchrag_engine.py: Retrieval Augmented Generation implementationcitation_generator.py: APA/MLA citation formatting
index.html: Main application interfacestyle.css: Styling and responsive designscript.js: Frontend JavaScript functionality
uploads/: Stored PDF filesembeddings/: FAISS index filesmetadata/: Document metadata storage
POST /upload: Upload and process PDF filesPOST /ask: Ask questions about uploaded documentsGET /documents: List uploaded documentsGET /related/<doc_id>: Find related papers
- Upload PDFs: Drag and drop or select academic PDF files
- Ask Questions: Type natural language questions about the content
- Get Answers: Receive AI-generated answers with citations
- Explore Related Papers: Discover similar academic papers
- Store your API keys securely
- Implement proper file validation for PDF uploads
- Consider rate limiting for API endpoints
- Add authentication for production use
- Supports PDFs up to 50MB
- Processes documents in chunks of 1000 tokens
- Retrieves top-5 most relevant chunks for each question
- FAISS index enables fast similarity search
- GPT-4: ~$0.03 per 1K input tokens, ~$0.06 per 1K output tokens
- text-embedding-ada-002: ~$0.0001 per 1K tokens
- Pricing varies by region and model deployment
- Check Azure pricing calculator for your specific setup
- Fork the repository
- Create a feature branch
- Make your changes
- Submit a pull request
MIT License - see LICENSE file for details