Skip to content

Add evaluation harness for IMO-AnswerBench#11

Open
yurekami wants to merge 4 commits intogoogle-deepmind:mainfrom
yurekami:feature/eval-harness
Open

Add evaluation harness for IMO-AnswerBench#11
yurekami wants to merge 4 commits intogoogle-deepmind:mainfrom
yurekami:feature/eval-harness

Conversation

@yurekami
Copy link

@yurekami yurekami commented Feb 14, 2026

Summary

  • Adds a Python evaluation harness for scoring model outputs against IMO-AnswerBench ground truth answers
  • Supports multiple answer-checking strategies: exact match, numeric comparison, SymPy-based mathematical equivalence, multi-answer set matching, and normalized string comparison
  • Provides a CLI for easy evaluation from the command line, with both text and JSON output formats

Motivation

The benchmark datasets are available but there is no standardized tooling for researchers to evaluate their model outputs against the ground truth. This makes it difficult to reproduce and compare results. This harness fills that gap, covering ~91% of answerbench answers with programmatic checking (227 numeric + 103 LaTeX + 33 multi-answer out of 400 total).

What's Included

File Purpose
imobench/eval/answer_checker.py Core math equivalence checking via SymPy
imobench/eval/evaluate.py Benchmark evaluation runner (CSV/JSONL input)
imobench/eval/metrics.py Accuracy metrics by category/subcategory/source
imobench/eval/cli.py Command-line interface with error handling
imobench/eval/README.md Usage documentation with examples
imobench/eval/tests/ 63 unit tests covering all modules
imobench/eval/requirements.txt Dependencies (sympy, antlr4-python3-runtime)
imobench/eval/requirements-dev.txt Dev dependencies (includes pytest)

Answer Checking Strategies

The checker tries strategies in order, returning the first match:

# Strategy Example
1 Exact match (after LaTeX normalization) $\frac{1}{2}$ == \frac{1}{2}
2 Numeric comparison (finite floats only) 3.0 == 3
3 Multi-answer set matching (order-independent) 3, 1, 2 == 1, 2, 3
4 Normalized string (case/whitespace insensitive) Algebra == algebra
5 SymPy equivalence (math expressions only) \frac{1}{2} == 0.5

Security & Robustness

  • Thread-based 5-second timeout on all SymPy operations (cross-platform, no signal.SIGALRM)
  • NaN/inf rejection in numeric parsing (math.isfinite guard)
  • Depth-clamped brace tracking in multi-answer splitter
  • Math heuristic guard prevents SymPy from parsing plain text as symbols
  • LaTeX delimiter normalization covers $, $$, \(, \), \[, \]
  • Input validation: non-empty problem IDs, missing file handling, empty prediction sets
  • ANTLR4 pinned to 4.11.1 for SymPy LaTeX parser compatibility

Usage

# Install dependencies
pip install -r imobench/eval/requirements.txt

# Evaluate predictions (CSV or JSONL)
python -m imobench.eval.cli predictions.csv

# JSON output
python -m imobench.eval.cli predictions.csv --format json

# Save detailed results
python -m imobench.eval.cli predictions.csv --output results.json
from imobench.eval import check_answer, evaluate_predictions, compute_metrics

# Single answer check
result = check_answer(r"\frac{1}{2}", "0.5")
# {'correct': True, 'method': 'sympy', 'details': ''}

Commits

Commit Description
f507261 feat: Add evaluation harness for IMO-AnswerBench
5787d2c fix: Harden answer checker (timeouts, NaN/inf, depth clamping)
e501388 fix: Address MEDIUM findings (requirements-dev, validation, ANTLR pin)
51617a4 fix: Address LOW findings (broader heuristics, delimiters, CLI/helper tests)

Test Plan

  • 63 unit tests passing across 4 test files
  • test_answer_checker.py — 46 tests: normalization, splitting, all 5 checking strategies, helper functions, edge cases
  • test_evaluate.py — 6 tests: CSV/JSONL loading, evaluation runner, missing predictions, result structure
  • test_metrics.py — 6 tests: accuracy computation, category breakdowns, report formatting
  • test_cli.py — 5 tests: text/JSON output, file output, missing file handling, empty predictions
  • Self-reviewed: 0 CRITICAL, 3 HIGH, 8 MEDIUM, 7 LOW findings — all addressed
  • Review by maintainers for compatibility with existing/planned tooling
Add a Python evaluation tool for scoring model outputs against
IMO-AnswerBench ground truth answers. Supports multiple checking
strategies: exact match, numeric comparison, SymPy-based mathematical
equivalence, multi-answer set matching, and normalized string comparison.

Includes:
- answer_checker.py: Core math equivalence checking via SymPy
- evaluate.py: Benchmark evaluation runner (CSV/JSONL input)
- metrics.py: Accuracy metrics by category/subcategory/source
- cli.py: Command-line interface
- 34 unit tests covering all modules
- README with usage examples
Address code review findings:
- HIGH: Add 5s timeout on SymPy parse/simplify to prevent DoS on
  pathological expressions (thread-based for cross-platform support)
- HIGH: Reject NaN/inf in numeric comparison (only finite values)
- HIGH: Clamp bracket depth to 0 in multi-answer splitter
- MEDIUM: Add recursion depth limit (max 2) in multi-answer matching
- MEDIUM: Remove unused `import os` from evaluate.py
- MEDIUM: Replace deprecated Optional with modern union syntax
- LOW: Fix README strategy table order to match code execution order
- Add tests for NaN, inf, and unbalanced bracket edge cases
Copy link
Author

@yurekami yurekami left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Self-Review: IMO-AnswerBench Evaluation Harness

What This PR Does

Adds a Python evaluation harness for scoring model outputs against IMO-AnswerBench ground truth using mathematical equivalence checking (SymPy), with CLI and Python API.

Code Review Summary (37 tests passing)

Severity Count Status
CRITICAL 0 --
HIGH 3 All fixed in follow-up commit
MEDIUM 8 4 fixed, 4 documented below
LOW 7 1 fixed, rest are minor

Issues Fixed (commit 5787d2c)

  • HIGH-1: Added 5s thread-based timeout on SymPy parse_latex/simplify to prevent DoS on pathological expressions
  • HIGH-2: Reject NaN/inf in numeric comparison -- only finite values accepted
  • HIGH-3: Clamp bracket depth to max(0, depth-1) in multi-answer splitter for malformed input
  • MEDIUM-1: Removed unused import os
  • MEDIUM-2: Replaced deprecated Optional[float] with float | None
  • MEDIUM-3: Added recursion depth limit (_depth param, max 2) in multi-answer matching
  • LOW-5: Fixed README strategy table order to match actual code execution order
  • Added 3 new edge-case tests (NaN, inf, unbalanced brackets)

Known Limitations (not blocking)

  • MEDIUM-4: Multi-answer matching is greedy (O(n²)), not optimal bipartite matching. Acceptable since IMO answer sets are typically 2-5 elements.
  • MEDIUM-6: pytest not in requirements.txt (mentioned in README install instructions instead)
  • MEDIUM-8: antlr4-python3-runtime==4.11.1 pin is fragile but required for SymPy LaTeX parser compatibility
  • LOW-1: _looks_like_math heuristic doesn't cover all math notation (e.g., sqrt(2))
  • LOW-2: normalize_latex doesn't handle \( / \) delimiters
  • LOW-3: No CLI-specific tests
  • LOW-7: Raw FileNotFoundError on missing files (clear traceback, just not user-friendly)

Coverage

  • 37 unit tests across 3 test modules
  • Covers: normalization, splitting, numeric/exact/sympy/multi-answer/string matching, evaluation pipeline, metrics computation, report formatting, NaN/inf/bracket edge cases
- MEDIUM-6: Add requirements-dev.txt with pytest dependency
- MEDIUM-7: Validate non-empty problem IDs in load_predictions,
  raise ValueError with file/line info on missing IDs
- MEDIUM-8: Add explanatory comment for antlr4 version pin
- Update README install instructions for dev deps
- Add test for empty problem ID validation
- Broaden _looks_like_math heuristic (math functions, digit operators)
- Add \( \) and \[ \] delimiter handling in normalize_latex
- Add CLI tests (5 tests covering text/json output, file output, errors)
- Add helper function tests (TestTryParseNumber, TestTryParseSympy,
  TestExpressionsEquivalent, TestLooksLikeMath, TestNormalizeLatexDelimiters)
- Add FileNotFoundError/ValueError handling in CLI
@yurekami
Copy link
Author

Updated Self-Review Status

All findings from the initial self-review have now been addressed across 3 follow-up commits:

Severity Count Status
CRITICAL 0 --
HIGH 3 All fixed (5787d2c)
MEDIUM 8 All fixed (5787d2c, e501388)
LOW 7 All fixed (51617a4)

What changed since the initial review:

  • MEDIUM-6: Added requirements-dev.txt with pytest
  • MEDIUM-7: Validate non-empty problem IDs in load_predictions
  • MEDIUM-8: Added explanatory comment for ANTLR4 pin
  • LOW-1: Broadened _looks_like_math heuristic (math functions like sqrt, log, sin; digit operators like 2+3)
  • LOW-2: Added \( \) and \[ \] delimiter handling in normalize_latex
  • LOW-3: Added 5 CLI tests (test_cli.py)
  • LOW-6: Added 17 helper function tests (parse number, parse sympy, expressions equivalent, looks like math, normalize delimiters)
  • LOW-7: Added FileNotFoundError/ValueError handling in CLI with user-friendly messages

Test suite: 63 tests passing (up from 37 at initial review).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

1 participant