Add evaluation harness for IMO-AnswerBench#11
Open
yurekami wants to merge 4 commits intogoogle-deepmind:mainfrom
Open
Add evaluation harness for IMO-AnswerBench#11yurekami wants to merge 4 commits intogoogle-deepmind:mainfrom
yurekami wants to merge 4 commits intogoogle-deepmind:mainfrom
Conversation
Add a Python evaluation tool for scoring model outputs against IMO-AnswerBench ground truth answers. Supports multiple checking strategies: exact match, numeric comparison, SymPy-based mathematical equivalence, multi-answer set matching, and normalized string comparison. Includes: - answer_checker.py: Core math equivalence checking via SymPy - evaluate.py: Benchmark evaluation runner (CSV/JSONL input) - metrics.py: Accuracy metrics by category/subcategory/source - cli.py: Command-line interface - 34 unit tests covering all modules - README with usage examples
Address code review findings: - HIGH: Add 5s timeout on SymPy parse/simplify to prevent DoS on pathological expressions (thread-based for cross-platform support) - HIGH: Reject NaN/inf in numeric comparison (only finite values) - HIGH: Clamp bracket depth to 0 in multi-answer splitter - MEDIUM: Add recursion depth limit (max 2) in multi-answer matching - MEDIUM: Remove unused `import os` from evaluate.py - MEDIUM: Replace deprecated Optional with modern union syntax - LOW: Fix README strategy table order to match code execution order - Add tests for NaN, inf, and unbalanced bracket edge cases
yurekami
commented
Feb 14, 2026
Author
yurekami
left a comment
There was a problem hiding this comment.
Self-Review: IMO-AnswerBench Evaluation Harness
What This PR Does
Adds a Python evaluation harness for scoring model outputs against IMO-AnswerBench ground truth using mathematical equivalence checking (SymPy), with CLI and Python API.
Code Review Summary (37 tests passing)
| Severity | Count | Status |
|---|---|---|
| CRITICAL | 0 | -- |
| HIGH | 3 | All fixed in follow-up commit |
| MEDIUM | 8 | 4 fixed, 4 documented below |
| LOW | 7 | 1 fixed, rest are minor |
Issues Fixed (commit 5787d2c)
- HIGH-1: Added 5s thread-based timeout on SymPy
parse_latex/simplifyto prevent DoS on pathological expressions - HIGH-2: Reject
NaN/infin numeric comparison -- only finite values accepted - HIGH-3: Clamp bracket depth to
max(0, depth-1)in multi-answer splitter for malformed input - MEDIUM-1: Removed unused
import os - MEDIUM-2: Replaced deprecated
Optional[float]withfloat | None - MEDIUM-3: Added recursion depth limit (
_depthparam, max 2) in multi-answer matching - LOW-5: Fixed README strategy table order to match actual code execution order
- Added 3 new edge-case tests (NaN, inf, unbalanced brackets)
Known Limitations (not blocking)
- MEDIUM-4: Multi-answer matching is greedy (O(n²)), not optimal bipartite matching. Acceptable since IMO answer sets are typically 2-5 elements.
- MEDIUM-6:
pytestnot inrequirements.txt(mentioned in README install instructions instead) - MEDIUM-8:
antlr4-python3-runtime==4.11.1pin is fragile but required for SymPy LaTeX parser compatibility - LOW-1:
_looks_like_mathheuristic doesn't cover all math notation (e.g.,sqrt(2)) - LOW-2:
normalize_latexdoesn't handle\(/\)delimiters - LOW-3: No CLI-specific tests
- LOW-7: Raw
FileNotFoundErroron missing files (clear traceback, just not user-friendly)
Coverage
- 37 unit tests across 3 test modules
- Covers: normalization, splitting, numeric/exact/sympy/multi-answer/string matching, evaluation pipeline, metrics computation, report formatting, NaN/inf/bracket edge cases
- MEDIUM-6: Add requirements-dev.txt with pytest dependency - MEDIUM-7: Validate non-empty problem IDs in load_predictions, raise ValueError with file/line info on missing IDs - MEDIUM-8: Add explanatory comment for antlr4 version pin - Update README install instructions for dev deps - Add test for empty problem ID validation
- Broaden _looks_like_math heuristic (math functions, digit operators) - Add \( \) and \[ \] delimiter handling in normalize_latex - Add CLI tests (5 tests covering text/json output, file output, errors) - Add helper function tests (TestTryParseNumber, TestTryParseSympy, TestExpressionsEquivalent, TestLooksLikeMath, TestNormalizeLatexDelimiters) - Add FileNotFoundError/ValueError handling in CLI
Author
Updated Self-Review StatusAll findings from the initial self-review have now been addressed across 3 follow-up commits:
What changed since the initial review:
Test suite: 63 tests passing (up from 37 at initial review). |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Motivation
The benchmark datasets are available but there is no standardized tooling for researchers to evaluate their model outputs against the ground truth. This makes it difficult to reproduce and compare results. This harness fills that gap, covering ~91% of answerbench answers with programmatic checking (227 numeric + 103 LaTeX + 33 multi-answer out of 400 total).
What's Included
imobench/eval/answer_checker.pyimobench/eval/evaluate.pyimobench/eval/metrics.pyimobench/eval/cli.pyimobench/eval/README.mdimobench/eval/tests/imobench/eval/requirements.txtimobench/eval/requirements-dev.txtAnswer Checking Strategies
The checker tries strategies in order, returning the first match:
$\frac{1}{2}$==\frac{1}{2}3.0==33, 1, 2==1, 2, 3Algebra==algebra\frac{1}{2}==0.5Security & Robustness
signal.SIGALRM)math.isfiniteguard)$,$$,\(,\),\[,\]Usage
Commits
f5072615787d2ce50138851617a4Test Plan
test_answer_checker.py— 46 tests: normalization, splitting, all 5 checking strategies, helper functions, edge casestest_evaluate.py— 6 tests: CSV/JSONL loading, evaluation runner, missing predictions, result structuretest_metrics.py— 6 tests: accuracy computation, category breakdowns, report formattingtest_cli.py— 5 tests: text/JSON output, file output, missing file handling, empty predictions