Compare AI-generated legal document summaries across multiple LLM models. Evaluate factual accuracy, citation precision, and legal utility.
-
Multi-Model Comparison: Test 5 different LLM models simultaneously
- Google Gemini 2.5 Flash
- Google Gemini 2.5 Flash Lite
- Google Gemini 3 Flash
- OpenAI GPT-4.1 Mini
- OpenAI GPT-4.1 Nano
-
Quality Analysis: GPT-5.2 judges each summary on:
- Factual Accuracy (25% weight)
- Page/Line Citation Accuracy (20% weight)
- Relevance (20% weight)
- Comprehensiveness (15% weight)
- Legal Utility (20% weight)
-
Two Summary Types:
- Deposition Analysis
- Medical Record Analysis
-
Cost Analysis: Track tokens, cost, and value per model
# Install dependencies
cd apps/quality-checker && bun install
# Set up environment (copy the example and add your API key)
# Create .env.local with:
# CASE_API_KEY=sk_case_your_api_key_here
# Run development server
bun dev # Runs on http://localhost:3050# Run summary analyzer
bun dev:summary-analyzerCreate a .env.local file with your API keys:
# Case.dev API (for vault, LLM, and quality analysis)
CASE_API_KEY=sk_case_your_api_key
CASE_API_URL=https://api.case.dev
# CaseMark API (for summary generation - required!)
CASEMARK_API_KEY=cm_test_your_casemark_api_key
CASEMARK_API_URL=https://api-staging.casemarkai.comGet your Case.dev API key from console.case.dev. Get your CaseMark API key from api-staging.casemarkai.com.
-
Create a Comparison (4-step wizard):
- Step 1: Matter name, summary type, subject name
- Step 2: Upload source documents (transcript, records)
- Step 3: Control Summary (production baseline)
- Upload an existing production summary, OR
- Generate via Production API (if configured)
- Step 4: Review and start
-
Processing: The app will:
- Create a vault and upload documents
- Generate summaries with all test models (on staging)
- Run quality analysis comparing each to the control
- Calculate rankings and cost analysis
-
View Results:
- Control Tab: View the production baseline summary
- Rankings by overall quality score
- Cost analysis with value metrics
- Detailed scores with strengths/weaknesses
- Download individual summaries or compare side-by-side
- Next.js 16 (App Router)
- Tailwind CSS 4
- Radix UI Components
- case.dev API
- Local Storage for persistence
| Model | Input (per 1M) | Output (per 1M) |
|---|---|---|
| Gemini 2.5 Flash | $0.15 | $0.60 |
| Gemini 2.5 Flash Lite | $0.075 | $0.30 |
| Gemini 3 Flash | $0.10 | $0.40 |
| GPT-4.1 Mini | $0.40 | $1.60 |
| GPT-4.1 Nano | $0.10 | $0.40 |
| GPT-5.2 (Judge) | $3.00 | $12.00 |
POST /vault- Create vault for document storagePOST /vault/{id}/upload- Upload file to vaultPOST /vault/{id}/ingest/{objectId}- Process document (OCR/text extraction)GET /vault/{id}/objects/{objectId}- Check processing statusGET /vault/{id}/objects/{objectId}/text- Get extracted textGET /vault/{id}/objects/{objectId}/download- Get presigned download URLPOST /llm/v1/chat/completions- Quality analysis (raw LLM)
POST /api/v1/workflows- Create summary workflow (with model parameter)GET /api/v1/workflows/{id}- Check workflow statusPOST /api/v1/workflows/{id}/download-result- Download completed summary
- Document Upload: PDF uploaded to Case.dev Vault
- Text Extraction: Vault processes document (OCR/text extraction)
- Summary Generation: CaseMark API generates summaries with different models
- Quality Analysis: GPT-5.2 evaluates each summary against source document
- Results: Rankings, cost analysis, and detailed quality scores