Add evaluation harness for IMO-AnswerBench by yurekami · Pull Request #11 · google-deepmind/superhuman

yurekami · 2026-02-14T11:49:10Z

Summary

Adds a Python evaluation harness for scoring model outputs against IMO-AnswerBench ground truth answers
Supports multiple answer-checking strategies: exact match, numeric comparison, SymPy-based mathematical equivalence, multi-answer set matching, and normalized string comparison
Provides a CLI for easy evaluation from the command line, with both text and JSON output formats

Motivation

The benchmark datasets are available but there is no standardized tooling for researchers to evaluate their model outputs against the ground truth. This makes it difficult to reproduce and compare results. This harness fills that gap, covering ~91% of answerbench answers with programmatic checking (227 numeric + 103 LaTeX + 33 multi-answer out of 400 total).

What's Included

File	Purpose
`imobench/eval/answer_checker.py`	Core math equivalence checking via SymPy
`imobench/eval/evaluate.py`	Benchmark evaluation runner (CSV/JSONL input)
`imobench/eval/metrics.py`	Accuracy metrics by category/subcategory/source
`imobench/eval/cli.py`	Command-line interface with error handling
`imobench/eval/README.md`	Usage documentation with examples
`imobench/eval/tests/`	63 unit tests covering all modules
`imobench/eval/requirements.txt`	Dependencies (sympy, antlr4-python3-runtime)
`imobench/eval/requirements-dev.txt`	Dev dependencies (includes pytest)

Answer Checking Strategies

The checker tries strategies in order, returning the first match:

#	Strategy	Example
1	Exact match (after LaTeX normalization)	$\frac{1}{2}$ == `\frac{1}{2}`
2	Numeric comparison (finite floats only)	`3.0` == `3`
3	Multi-answer set matching (order-independent)	`3, 1, 2` == `1, 2, 3`
4	Normalized string (case/whitespace insensitive)	`Algebra` == `algebra`
5	SymPy equivalence (math expressions only)	`\frac{1}{2}` == `0.5`

Security & Robustness

Thread-based 5-second timeout on all SymPy operations (cross-platform, no signal.SIGALRM)
NaN/inf rejection in numeric parsing (math.isfinite guard)
Depth-clamped brace tracking in multi-answer splitter
Math heuristic guard prevents SymPy from parsing plain text as symbols
LaTeX delimiter normalization covers $, $$, $, $, \[, \]
Input validation: non-empty problem IDs, missing file handling, empty prediction sets
ANTLR4 pinned to 4.11.1 for SymPy LaTeX parser compatibility

Usage

# Install dependencies
pip install -r imobench/eval/requirements.txt

# Evaluate predictions (CSV or JSONL)
python -m imobench.eval.cli predictions.csv

# JSON output
python -m imobench.eval.cli predictions.csv --format json

# Save detailed results
python -m imobench.eval.cli predictions.csv --output results.json

from imobench.eval import check_answer, evaluate_predictions, compute_metrics

# Single answer check
result = check_answer(r"\frac{1}{2}", "0.5")
# {'correct': True, 'method': 'sympy', 'details': ''}

Commits

Commit	Description
`f507261`	feat: Add evaluation harness for IMO-AnswerBench
`5787d2c`	fix: Harden answer checker (timeouts, NaN/inf, depth clamping)
`e501388`	fix: Address MEDIUM findings (requirements-dev, validation, ANTLR pin)
`51617a4`	fix: Address LOW findings (broader heuristics, delimiters, CLI/helper tests)

Test Plan

63 unit tests passing across 4 test files
test_answer_checker.py — 46 tests: normalization, splitting, all 5 checking strategies, helper functions, edge cases
test_evaluate.py — 6 tests: CSV/JSONL loading, evaluation runner, missing predictions, result structure
test_metrics.py — 6 tests: accuracy computation, category breakdowns, report formatting
test_cli.py — 5 tests: text/JSON output, file output, missing file handling, empty predictions
Self-reviewed: 0 CRITICAL, 3 HIGH, 8 MEDIUM, 7 LOW findings — all addressed
Review by maintainers for compatibility with existing/planned tooling

Add a Python evaluation tool for scoring model outputs against IMO-AnswerBench ground truth answers. Supports multiple checking strategies: exact match, numeric comparison, SymPy-based mathematical equivalence, multi-answer set matching, and normalized string comparison. Includes: - answer_checker.py: Core math equivalence checking via SymPy - evaluate.py: Benchmark evaluation runner (CSV/JSONL input) - metrics.py: Accuracy metrics by category/subcategory/source - cli.py: Command-line interface - 34 unit tests covering all modules - README with usage examples

Address code review findings: - HIGH: Add 5s timeout on SymPy parse/simplify to prevent DoS on pathological expressions (thread-based for cross-platform support) - HIGH: Reject NaN/inf in numeric comparison (only finite values) - HIGH: Clamp bracket depth to 0 in multi-answer splitter - MEDIUM: Add recursion depth limit (max 2) in multi-answer matching - MEDIUM: Remove unused `import os` from evaluate.py - MEDIUM: Replace deprecated Optional with modern union syntax - LOW: Fix README strategy table order to match code execution order - Add tests for NaN, inf, and unbalanced bracket edge cases

yurekami

Self-Review: IMO-AnswerBench Evaluation Harness

What This PR Does

Adds a Python evaluation harness for scoring model outputs against IMO-AnswerBench ground truth using mathematical equivalence checking (SymPy), with CLI and Python API.

Code Review Summary (37 tests passing)

Severity	Count	Status
CRITICAL	0	--
HIGH	3	All fixed in follow-up commit
MEDIUM	8	4 fixed, 4 documented below
LOW	7	1 fixed, rest are minor

Issues Fixed (commit `5787d2c`)

HIGH-1: Added 5s thread-based timeout on SymPy parse_latex/simplify to prevent DoS on pathological expressions
HIGH-2: Reject NaN/inf in numeric comparison -- only finite values accepted
HIGH-3: Clamp bracket depth to max(0, depth-1) in multi-answer splitter for malformed input
MEDIUM-1: Removed unused import os
MEDIUM-2: Replaced deprecated Optional[float] with float | None
MEDIUM-3: Added recursion depth limit (_depth param, max 2) in multi-answer matching
LOW-5: Fixed README strategy table order to match actual code execution order
Added 3 new edge-case tests (NaN, inf, unbalanced brackets)

Known Limitations (not blocking)

MEDIUM-4: Multi-answer matching is greedy (O(n²)), not optimal bipartite matching. Acceptable since IMO answer sets are typically 2-5 elements.
MEDIUM-6: pytest not in requirements.txt (mentioned in README install instructions instead)
MEDIUM-8: antlr4-python3-runtime==4.11.1 pin is fragile but required for SymPy LaTeX parser compatibility
LOW-1: _looks_like_math heuristic doesn't cover all math notation (e.g., sqrt(2))
LOW-2: normalize_latex doesn't handle $ / $ delimiters
LOW-3: No CLI-specific tests
LOW-7: Raw FileNotFoundError on missing files (clear traceback, just not user-friendly)

Coverage

37 unit tests across 3 test modules
Covers: normalization, splitting, numeric/exact/sympy/multi-answer/string matching, evaluation pipeline, metrics computation, report formatting, NaN/inf/bracket edge cases

- MEDIUM-6: Add requirements-dev.txt with pytest dependency - MEDIUM-7: Validate non-empty problem IDs in load_predictions, raise ValueError with file/line info on missing IDs - MEDIUM-8: Add explanatory comment for antlr4 version pin - Update README install instructions for dev deps - Add test for empty problem ID validation

- Broaden _looks_like_math heuristic (math functions, digit operators) - Add  and \[ \] delimiter handling in normalize_latex - Add CLI tests (5 tests covering text/json output, file output, errors) - Add helper function tests (TestTryParseNumber, TestTryParseSympy, TestExpressionsEquivalent, TestLooksLikeMath, TestNormalizeLatexDelimiters) - Add FileNotFoundError/ValueError handling in CLI

yurekami · 2026-02-15T09:18:05Z

Updated Self-Review Status

All findings from the initial self-review have now been addressed across 3 follow-up commits:

Severity	Count	Status
CRITICAL	0	--
HIGH	3	All fixed (`5787d2c`)
MEDIUM	8	All fixed (`5787d2c`, `e501388`)
LOW	7	All fixed (`51617a4`)

What changed since the initial review:

MEDIUM-6: Added requirements-dev.txt with pytest
MEDIUM-7: Validate non-empty problem IDs in load_predictions
MEDIUM-8: Added explanatory comment for ANTLR4 pin
LOW-1: Broadened _looks_like_math heuristic (math functions like sqrt, log, sin; digit operators like 2+3)
LOW-2: Added $ $ and \[ \] delimiter handling in normalize_latex
LOW-3: Added 5 CLI tests (test_cli.py)
LOW-6: Added 17 helper function tests (parse number, parse sympy, expressions equivalent, looks like math, normalize delimiters)
LOW-7: Added FileNotFoundError/ValueError handling in CLI with user-friendly messages

Test suite: 63 tests passing (up from 37 at initial review).

yurekami added 2 commits February 14, 2026 20:48

yurekami commented Feb 14, 2026

View reviewed changes

yurekami added 2 commits February 14, 2026 22:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add evaluation harness for IMO-AnswerBench#11

Add evaluation harness for IMO-AnswerBench#11
yurekami wants to merge 4 commits intogoogle-deepmind:mainfrom
yurekami:feature/eval-harness

yurekami commented Feb 14, 2026 •

edited

Loading

yurekami left a comment

yurekami commented Feb 15, 2026

Labels

1 participant

Conversation

yurekami commented Feb 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!