FinWorkBench (Finch): Benchmarking Finance & Accounting Across Spreadsheet-Centric Enterprise Workflows
End-to-end evaluation pipeline for FinWorkBench (Finch): organize task files, preprocess multimodal artifacts, build judge-ready prompts, and score model outputs.
The evaluation pipeline follows four stages:
JSONL task set → Organize files → Preprocess & build prompts → GPT Judge scoring
The JSONL task set is included in the Finch HuggingFace dataset. You can also create your own JSONL task set following the same conventions. The JSONL file should be placed in the dataset root directory.
- 2026-4-6: FinWorkBench is accepted to ACL 2026 Findings.
src/
├── organize_files.py # Organizes source/reference/model-output files by task ID
├── preprocessor/ # Preprocesses PDF, Word, Excel, Markdown, and image files
│ └── preprocessor_main.py # Preprocessing entry point
├── build_prompt/ # Builds judge-ready prompts (content_parts.jsonl)
│ └── content_builder/
│ ├── content_builder.py # Prompt builder entry point
│ └── config.py # Size limits, extension sets, cache/output settings
├── prompt_build_pipeline.py # End-to-end pipeline: organize -> preprocess -> build prompts
└── call_gpt_judge.py # Calls Azure OpenAI judge model and writes scored results
This guide walks you through: download dataset → preprocess → build prompts → run GPT Judge → get results.xlsx.
Shortcut: If you want to reuse our preprocessed results for GPT-5.1 Pro, Claude Sonnet 4.5, and Claude Opus 4.5, download them from Google Drive and skip directly to (5) Run GPT Judge.
- Python 3.9+ recommended
- Install dependencies:
pip install -r requirements.txtIf you do not have a requirements file, install at minimum:
pip install pandas openpyxl pymupdf python-docx xlwings openai| Dependency | Required for |
|---|---|
xlwings + Microsoft Excel |
Excel preprocessing (Windows recommended for stability) |
PyMuPDF (fitz) |
PDF preprocessing |
python-docx |
Word document preprocessing |
git clone https://huggingface.co/datasets/FinWorkBench/FinchThis creates a local Finch/ folder containing the JSONL task set and source/reference files.
Create a directory containing your model outputs, with one subfolder per model:
YOUR_MODEL_OUTPUT/
├── opus_4.5_output/
│ ├── 0.xlsx
│ └── ...
└── gpt5.1_output/
├── 0.xlsx
└── ...
Naming conventions:
| Scenario | Convention | Example |
|---|---|---|
| Text output for a task | <task_id>.txt |
1.txt |
| Single file per task | <task_id>.<ext> |
5.xlsx |
| Multiple files of the same type | <task_id>_<n>.<ext> or <task_id>-<n>.<ext> |
3_1.png, 3_2.png |
python src/prompt_build_pipeline.py \
--dataset-dir Finch \
--output-dir "YOUR_MODEL_OUTPUT" \
--target-dir eval_setThis command:
- Reads the Finch JSONL tasks.
- Organizes files into
eval_set/<model>/<task_id>/. - Runs preprocessors (PDF, Word, Excel, Markdown, image) and writes
metadata.json. - Builds prompts and generates
eval_set/<model>/content_parts.jsonl.
After completion, the output directory looks like:
eval_set/
├── opus_4.5_output/
│ ├── content_parts.jsonl
│ └── 0/
│ ├── metadata.json
│ ├── preprocessed/
│ └── _cache/
└── gpt5.1_output/
├── content_parts.jsonl
└── ...
Here eval_set is the --target-dir from Step 4 (or the directory you downloaded from Google Drive). It should contain one subdirectory per model, each with a content_parts.jsonl inside.
python src/call_gpt_judge.py eval_set -o results.xlsx \
--api-key "<YOUR_KEY>" \
--azure-endpoint "<YOUR_ENDPOINT>" \
--api-version "<YYYY-MM-DD>" \
--model "<DEPLOYMENT_NAME>"This reads each content_parts.jsonl, calls the configured Azure OpenAI judge model, and writes a scored results.xlsx into each model subdirectory:
eval_set/
├── opus_4.5_output/
│ └── results.xlsx
└── gpt5.1_output/
└── results.xlsx
Common options:
# Evaluate specific models only
python src/call_gpt_judge.py eval_set --models opus_4.5_output,gpt5.1_output -o results.xlsx
# Re-run without skipping previously processed tasks
python src/call_gpt_judge.py eval_set -o results.xlsx --no-skip-processedReads the JSONL task set and creates task directories at target_dir/<model>/<task_id>/, copying source files, reference files, and model output files. Generates metadata.json for each task.
python src/organize_files.py \
--dataset-dir Finch \
--output-dir YOUR_MODEL_OUTPUT \
--target-dir eval_set| Argument | Description |
|---|---|
--dataset-dir |
(Required) Dataset directory containing the JSONL file (auto-detected) |
--output-dir |
(Required) Model output directory (one subdirectory per model) |
--target-dir |
(Required) Organized output directory used by downstream pipeline steps |
--log-level |
Logging verbosity: DEBUG, INFO, WARNING, or ERROR |
Processes PDF, Word, Excel, Markdown, and image files. Results are written into metadata.json under the preprocess_info field.
python src/preprocessor/preprocessor_main.py --root-dir eval_set| Argument | Description |
|---|---|
--root-dir |
The target-dir produced by organize_files.py |
--models |
Optional; specify which model directories to process (space-separated) |
Configuration: Text descriptions are defined in preprocessor/preprocessor_base.py via PreprocessorConfig. Special-case logs are written to preprocessing_special_cases.log.
Generates content_parts.jsonl for each model directory, based on metadata.json and preprocessing outputs.
set PYTHONPATH=src
python -m src.build_prompt.content_builder.content_builder eval_setConfiguration (build_prompt/content_builder/config.py):
| Setting | Description |
|---|---|
MAX_IMAGES |
Maximum number of images per prompt |
MAX_TEXT_CHARS |
Maximum text character count per prompt |
EXCEL_EXTENSIONS, IMAGE_EXTENSIONS, TEXT_EXTENSIONS |
Recognized file extension sets |
CACHE_DIR_NAME |
Name of the cache directory (stores diffs, snapshots, screenshots) |
OUTPUT_JSONL_NAME |
Output filename (default: content_parts.jsonl) |
Captions.* |
Caption templates for generated content |
Prompt length management: Defined in src/build_prompt/content_builder/token_counter.py. Image count and text character count are tracked separately. If the image count exceeds MAX_IMAGES, images are dropped from the end. If the text character count exceeds MAX_TEXT_CHARS, text is truncated from the end. Both limits are configured in config.py.
Supports two input modes:
- Single JSONL file -- scores one model and writes a single
results.xlsx. - Root directory -- scores all models and writes one
results.xlsxper model subdirectory.
# Single model
python src/call_gpt_judge.py eval_set/opus_4.5_output/content_parts.jsonl \
-o results.xlsx --api-key "<YOUR_KEY>" --azure-endpoint "<YOUR_ENDPOINT>"
# All models under a root directory
python src/call_gpt_judge.py eval_set -o results.xlsx \
--api-key "<YOUR_KEY>" --azure-endpoint "<YOUR_ENDPOINT>"API configuration (set in APIConfig inside the script, overridable via CLI):
| Setting | CLI Override |
|---|---|
AZURE_ENDPOINT |
--azure-endpoint |
API_KEY |
--api-key |
API_VERSION |
--api-version |
MODEL |
--model |
MAX_TOKENS, MAX_COMPLETION_TOKENS, TEMPERATURE |
-- |
MAX_RETRIES, RATE_LIMIT_DELAY |
-- |
- Excel-related functionality depends on
xlwingsand a local Microsoft Excel installation. Windows is recommended for stability. - In our previous evaluation, we used GPT-5-mini. We have since found that using frontier models or voting across multiple runs improves evaluation reliability.
- If
preprocess_infois missing frommetadata.json, check that the required dependencies are installed and that the file type is included in the preprocessing chain.
The code used in the original paper is maintained in the previous branch. This branch contains the newer and unified implementation for Modification, Generation, and QA.
@article{dong2025finch,
title={Finch: Benchmarking Finance \& Accounting across Spreadsheet-Centric Enterprise Workflows},
author={Dong, Haoyu and Zhang, Pengkun and Gao, Yan and Dong, Xuanyu and Cheng, Yilin and Lu, Mingzhe and Yakefu, Adina and Zheng, Shuxin},
journal={arXiv preprint arXiv:2512.13168},
year={2025}
}