Skip to content
View Ruthwik-Data's full-sized avatar
:atom:
:atom:

Block or report Ruthwik-Data

Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
Ruthwik-Data/README.md

Ruthwik Arepelly

Open to AIPM and product-adjacent roles at early-stage AI startups (pre-seed to Series D, teams under 30) building LLMs, RAG, or eval tooling. LinkedIn · Email

I build evaluation-first AI systems — and I can tell you exactly why each one works, where it breaks, and what the numbers say.

7+ years building 0→1 products. Co-founded Photon (EdTech fintech, 75+ schools, $100K ARR). Now building at the intersection of LLMs, RAG, and AI evaluation.

Start here → Mechanic Trust case study — the clearest example of how I design evaluation-first AI products.


What Each Project Proves

Project Problem Solved What It Demonstrates Live
Self-Improving Prompt Agent How do you improve a prompt without guessing? Built an eval loop that ran 10 rounds — score went 0.10 → 0.80. Key insight: better prompts come from better evals, not more attempts
finrag-eval Financial RAG hallucinates confidently — and you can't tell Found 2/3 hallucinations were honest refusals, 1/3 were confidently wrong. Filed a metric-level bug in DeepEval that the team is now fixing
GitScope Evaluating a GitHub repo takes hours of manual reading Built an MCP-powered agent that gives PMs structured repo analysis in seconds — PM-first output, not raw code
Mechanic Trust Auto repair shops exploit trust gaps with opaque pricing Case study: designed the trust, explainability, and pricing transparency layer for a high-friction AI product
ReceiptIQ Accountants manually copy-paste receipt data for hours GPT-4o Vision pipeline with confidence scoring — forces the AI to be honest about what it's uncertain about Demo
Warmlist PMs lose track of warm contacts who could open doors GPT-4o-mini CRM that surfaces who to reach out to and why — using LLMs for PM work, not just AI products
SugarShield AI classifiers over-warn or miss hidden sugar — you can't tell which failure mode you're in Built eval infrastructure into the product: 0 false negatives by design, conservative bias as explicit product decision, 87% trigger match rate. Strict vs. Lenient mode comparison built-in Demo · Eval

Case Studies

How I think through AI product decisions — not just what I built, but why, what failed, and what the system gets wrong:

Published

  • Mechanic Trust — Trust-critical design in consumer AI: explainability, pricing transparency, failure mode planning

Case Study Pipeline — detailed write-ups in progress, expected June 2026:

  • finrag-eval — Evaluation infrastructure for financial RAG: where metrics lie, where hallucinations hide
  • Self-Improving Prompt Agent — Recursive eval loops: what happens when the optimizer is only as good as its evaluator

Open Source Signal

I don't just use evaluation and AI tooling. I find where they break, why, and what to ship next.

Existing contributions:

  • confident-ai/deepeval — Filed root-cause bug on ContextualPrecisionMetric over-penalizing overlapping chunks in financial RAG. Drove technical consensus on the group_by API fix — the Confident AI team is shipping it in the next release. This is evaluation obsession in practice.

  • confident-ai/deepeval — PR to improve ContextualPrecisionMetric with retrieved-context source grouping and fixed weighted precision. Came directly from hands-on financial RAG evaluation work.

  • mem0ai/memory-benchmarks — Added failure-mode regression scenarios for memory systems — because benchmarks that don't surface failure modes aren't useful for real-world agents.

  • weaviate/weaviate — Opened research-driven issue on hybrid search alpha auto-tuning for domain-specific corpora. Surfaced retrieval behavior patterns from financial-document work that the team is now investigating.

New issues filed (June 2026):

  • AgentOps-AI/agentops #1383 — Feature proposal + active discussion on GTM/product team dashboard for non-engineer view of agent session health. Contributed narrative translation layer design and MVP scoping for operator-intent dashboards.

  • mastra-ai/mastra #18086 — Feature request for evaluation metrics in multi-step RAG agent workflows. Proposed evaluators config on Mastra workflows for per-hop retrieval confidence and tool selection accuracy — sourced from finrag-eval production patterns.

  • circlemind-ai/fast-graphrag #113 — Feature request for graph-aware eval metrics (graph edge accuracy, node coverage, hierarchy depth accuracy) for knowledge graph RAG. Standard text-similarity metrics miss graph traversal correctness entirely.

  • confident-ai/deepeval #2775 — Feature request for per-document-type eval thresholds in heterogeneous corpora. Structured docs (balance sheets) need binary thresholds; narrative docs need gradient thresholds — a single value fails both.

  • run-llama/llama_index #22032 — Feature request for metadata-aware routing in VectorStoreIndex for heterogeneous financial document RAG. Today's RouterQueryEngine breaks cross-document retrieval; native routing would solve it at the index level.

  • Arize-ai/phoenix #13809 — Feature request for span-level context confidence scores in multi-hop RAG tracing. Phoenix traces execution but not retrieval quality per hop — adding context_confidence and confidence_delta closes the eval loop.

  • firecrawl/firecrawl #3817 — Feature request for extraction quality metadata in Firecrawl responses. Table F1, footnote accuracy, structure preservation — extraction is the silent bottleneck in financial RAG pipelines.

  • mem0ai/mem0 #5614 — Feature request for memory quality eval metrics at retrieval time: staleness risk, conflict detection, importance-weighted recall. Memory quality failures are silent — this closes the observability gap.

  • wandb/weave #7280 — Feature request for per-retrieval-hop quality scores and chain degradation attribution in Weave traces. Execution traces exist; quality waterfall alongside them doesn't.


Stack I Work In

Evaluation: DeepEval, Claude as evaluator, LLM-as-judge patterns, custom eval harnesses, ground-truth scoring RAG: pgvector, Supabase, LangChain, OpenAI embeddings, section-aware chunking Agents: MCP, Claude agents, tool-use patterns, agentic loops, prompt optimization Shipping: Python, TypeScript, Next.js, Vercel, SQL, Docker Models: GPT-4o Vision, Claude Opus, Claude Sonnet, DeepEval for benchmarking


Writing

I write about product thinking, AI systems, and what I learn from building:

View all on Medium →


Background

  • Photon (Co-founder): Built B2B SaaS payments platform for schools — 75+ schools in India, $100K ARR, 8-person team
  • Digital Connect: AI product — built and shipped features for university admin workflows
  • BS Computer Science · MSc Business Analytics, Trine University

What Sets Me Apart

Most AI PMs talk about outputs. I focus on whether the system is trustworthy.

That means evaluating the evaluator (DeepEval Issue #2594), designing products around failure modes before launch (SugarShield: 0 false negatives by design), and measuring improvement through behavior change, not vanity metrics (Self-Improving Prompt Agent: 0.10 → 0.80).

I don't just use AI tools. I find where they break, why they break, and what to ship next because of it.


LinkedIn · Email · Medium

Pinned Loading

  1. finrag-eval finrag-eval Public

    RAG eval pipeline on Apple's FY 2024 10-K — found confident hallucinations, filed a metric-level bug in DeepEval, and built section-aware chunking.

    Python

  2. gitscope gitscope Public

    MCP-powered AI agent that analyzes GitHub repos and surfaces structured insights for product managers and founders.

    Python

  3. self-improving-prompt-agent self-improving-prompt-agent Public

    Prompt optimization loop that improves prompts through iterative mutation and LLM-as-judge evaluation. Score went 0.10 → 0.80 in 10 rounds.

    Python

  4. mechanictrust mechanictrust Public

    AI product case study for trust, pricing transparency, and explainable diagnosis in auto repair.

  5. receiptiq receiptiq Public

    AI-powered receipt extraction and finance dashboard using GPT-4o Vision.

    TypeScript