rello-code Ruthwik-Data

Ruthwik Arepelly

Open to AIPM and product-adjacent roles at early-stage AI startups (pre-seed to Series D, teams under 30) building LLMs, RAG, or eval tooling. LinkedIn · Email

I build evaluation-first AI systems — and I can tell you exactly why each one works, where it breaks, and what the numbers say.

7+ years building 0→1 products. Co-founded Photon (EdTech fintech, 75+ schools, $100K ARR). Now building at the intersection of LLMs, RAG, and AI evaluation.

Start here → Mechanic Trust case study — the clearest example of how I design evaluation-first AI products.

What Each Project Proves

Project	Problem Solved	What It Demonstrates	Live
Self-Improving Prompt Agent	How do you improve a prompt without guessing?	Built an eval loop that ran 10 rounds — score went 0.10 → 0.80. Key insight: better prompts come from better evals, not more attempts	—
finrag-eval	Financial RAG hallucinates confidently — and you can't tell	Found 2/3 hallucinations were honest refusals, 1/3 were confidently wrong. Filed a metric-level bug in DeepEval that the team is now fixing	—
GitScope	Evaluating a GitHub repo takes hours of manual reading	Built an MCP-powered agent that gives PMs structured repo analysis in seconds — PM-first output, not raw code	—
Mechanic Trust	Auto repair shops exploit trust gaps with opaque pricing	Case study: designed the trust, explainability, and pricing transparency layer for a high-friction AI product	—
ReceiptIQ	Accountants manually copy-paste receipt data for hours	GPT-4o Vision pipeline with confidence scoring — forces the AI to be honest about what it's uncertain about	Demo
Warmlist	PMs lose track of warm contacts who could open doors	GPT-4o-mini CRM that surfaces who to reach out to and why — using LLMs for PM work, not just AI products	—
SugarShield	AI classifiers over-warn or miss hidden sugar — you can't tell which failure mode you're in	Built eval infrastructure into the product: 0 false negatives by design, conservative bias as explicit product decision, 87% trigger match rate. Strict vs. Lenient mode comparison built-in	Demo · Eval

Case Studies

How I think through AI product decisions — not just what I built, but why, what failed, and what the system gets wrong:

Published

Mechanic Trust — Trust-critical design in consumer AI: explainability, pricing transparency, failure mode planning

Case Study Pipeline — detailed write-ups in progress, expected June 2026:

finrag-eval — Evaluation infrastructure for financial RAG: where metrics lie, where hallucinations hide
Self-Improving Prompt Agent — Recursive eval loops: what happens when the optimizer is only as good as its evaluator

Open Source Signal

I don't just use evaluation and AI tooling. I find where they break, why, and what to ship next.

Existing contributions:

confident-ai/deepeval — Filed root-cause bug on ContextualPrecisionMetric over-penalizing overlapping chunks in financial RAG. Drove technical consensus on the group_by API fix — the Confident AI team is shipping it in the next release. This is evaluation obsession in practice.
confident-ai/deepeval — PR to improve ContextualPrecisionMetric with retrieved-context source grouping and fixed weighted precision. Came directly from hands-on financial RAG evaluation work.
mem0ai/memory-benchmarks — Added failure-mode regression scenarios for memory systems — because benchmarks that don't surface failure modes aren't useful for real-world agents.
weaviate/weaviate — Opened research-driven issue on hybrid search alpha auto-tuning for domain-specific corpora. Surfaced retrieval behavior patterns from financial-document work that the team is now investigating.

New issues filed (June 2026):

AgentOps-AI/agentops #1383 — Feature proposal + active discussion on GTM/product team dashboard for non-engineer view of agent session health. Contributed narrative translation layer design and MVP scoping for operator-intent dashboards.
mastra-ai/mastra #18086 — Feature request for evaluation metrics in multi-step RAG agent workflows. Proposed evaluators config on Mastra workflows for per-hop retrieval confidence and tool selection accuracy — sourced from finrag-eval production patterns.
circlemind-ai/fast-graphrag #113 — Feature request for graph-aware eval metrics (graph edge accuracy, node coverage, hierarchy depth accuracy) for knowledge graph RAG. Standard text-similarity metrics miss graph traversal correctness entirely.
confident-ai/deepeval #2775 — Feature request for per-document-type eval thresholds in heterogeneous corpora. Structured docs (balance sheets) need binary thresholds; narrative docs need gradient thresholds — a single value fails both.
run-llama/llama_index #22032 — Feature request for metadata-aware routing in VectorStoreIndex for heterogeneous financial document RAG. Today's RouterQueryEngine breaks cross-document retrieval; native routing would solve it at the index level.
Arize-ai/phoenix #13809 — Feature request for span-level context confidence scores in multi-hop RAG tracing. Phoenix traces execution but not retrieval quality per hop — adding context_confidence and confidence_delta closes the eval loop.
firecrawl/firecrawl #3817 — Feature request for extraction quality metadata in Firecrawl responses. Table F1, footnote accuracy, structure preservation — extraction is the silent bottleneck in financial RAG pipelines.
mem0ai/mem0 #5614 — Feature request for memory quality eval metrics at retrieval time: staleness risk, conflict detection, importance-weighted recall. Memory quality failures are silent — this closes the observability gap.
wandb/weave #7280 — Feature request for per-retrieval-hop quality scores and chain degradation attribution in Weave traces. Execution traces exist; quality waterfall alongside them doesn't.

Stack I Work In

Evaluation: DeepEval, Claude as evaluator, LLM-as-judge patterns, custom eval harnesses, ground-truth scoring RAG: pgvector, Supabase, LangChain, OpenAI embeddings, section-aware chunking Agents: MCP, Claude agents, tool-use patterns, agentic loops, prompt optimization Shipping: Python, TypeScript, Next.js, Vercel, SQL, Docker Models: GPT-4o Vision, Claude Opus, Claude Sonnet, DeepEval for benchmarking

Writing

I write about product thinking, AI systems, and what I learn from building:

Product Learning: How Gifting Became a Growth Engine, Not a Feature — Feature → growth lever
How I Turn User Complaints Into Feature Ideas (Simple 7-Step Method) — Product thinking framework
From Venue to Platform: The Bernabéu as a Product — How physical spaces evolve into platforms
How I Built SugarShield: From a Grocery Aisle Problem to a Working AI Product — Full build case study
Tap & Pray Is Not a Payment Strategy — Fintech product lessons
Product Experiment: IntentTabs — Adding Friction to Fight Impulse — Behavioral design in product

View all on Medium →

Background

Photon (Co-founder): Built B2B SaaS payments platform for schools — 75+ schools in India, $100K ARR, 8-person team
Digital Connect: AI product — built and shipped features for university admin workflows
BS Computer Science · MSc Business Analytics, Trine University

What Sets Me Apart

Most AI PMs talk about outputs. I focus on whether the system is trustworthy.

That means evaluating the evaluator (DeepEval Issue #2594), designing products around failure modes before launch (SugarShield: 0 false negatives by design), and measuring improvement through behavior change, not vanity metrics (Self-Improving Prompt Agent: 0.10 → 0.80).

I don't just use AI tools. I find where they break, why they break, and what to ship next because of it.

LinkedIn · Email · Medium

Provide feedback

Saved searches

Use saved searches to filter your results more quickly