FinWorkBench (Finch): Benchmarking Finance & Accounting Across Spreadsheet-Centric Enterprise Workflows

End-to-end evaluation pipeline for FinWorkBench (Finch): organize task files, preprocess multimodal artifacts, build judge-ready prompts, and score model outputs.

The evaluation pipeline follows four stages:

JSONL task set → Organize files → Preprocess & build prompts → GPT Judge scoring

The JSONL task set is included in the Finch HuggingFace dataset. You can also create your own JSONL task set following the same conventions. The JSONL file should be placed in the dataset root directory.

🍻 Updates

2026-4-6: FinWorkBench is accepted to ACL 2026 Findings.

📂 Repository Structure

src/
├── organize_files.py           # Organizes source/reference/model-output files by task ID
├── preprocessor/               # Preprocesses PDF, Word, Excel, Markdown, and image files
│   └── preprocessor_main.py    # Preprocessing entry point
├── build_prompt/               # Builds judge-ready prompts (content_parts.jsonl)
│   └── content_builder/
│       ├── content_builder.py  # Prompt builder entry point
│       └── config.py           # Size limits, extension sets, cache/output settings
├── prompt_build_pipeline.py    # End-to-end pipeline: organize -> preprocess -> build prompts
└── call_gpt_judge.py           # Calls Azure OpenAI judge model and writes scored results

🚀 Quick Start

This guide walks you through: download dataset → preprocess → build prompts → run GPT Judge → get results.xlsx.

Shortcut: If you want to reuse our preprocessed results for GPT-5.1 Pro, Claude Sonnet 4.5, and Claude Opus 4.5, download them from Google Drive and skip directly to (5) Run GPT Judge.

1) ⚙️ Prerequisites

Python 3.9+ recommended
Install dependencies:

pip install -r requirements.txt

If you do not have a requirements file, install at minimum:

pip install pandas openpyxl pymupdf python-docx xlwings openai

Dependency	Required for
`xlwings` + Microsoft Excel	Excel preprocessing (Windows recommended for stability)
`PyMuPDF` (`fitz`)	PDF preprocessing
`python-docx`	Word document preprocessing

2) 📥 Download the Finch dataset

git clone https://huggingface.co/datasets/FinWorkBench/Finch

This creates a local Finch/ folder containing the JSONL task set and source/reference files.

3) 💼 Prepare your model outputs

Create a directory containing your model outputs, with one subfolder per model:

YOUR_MODEL_OUTPUT/
├── opus_4.5_output/
│   ├── 0.xlsx
│   └── ...
└── gpt5.1_output/
    ├── 0.xlsx
    └── ...

Naming conventions:

Scenario	Convention	Example
Text output for a task	`<task_id>.txt`	`1.txt`
Single file per task	`<task_id>.<ext>`	`5.xlsx`
Multiple files of the same type	`<task_id>_<n>.<ext>` or `<task_id>-<n>.<ext>`	`3_1.png`, `3_2.png`

4) 🛠️ Run the 3-step pipeline (organize, preprocess, build prompts)

python src/prompt_build_pipeline.py \
    --dataset-dir Finch \
    --output-dir "YOUR_MODEL_OUTPUT" \
    --target-dir eval_set

This command:

Reads the Finch JSONL tasks.
Organizes files into eval_set/<model>/<task_id>/.
Runs preprocessors (PDF, Word, Excel, Markdown, image) and writes metadata.json.
Builds prompts and generates eval_set/<model>/content_parts.jsonl.

After completion, the output directory looks like:

eval_set/
├── opus_4.5_output/
│   ├── content_parts.jsonl
│   └── 0/
│       ├── metadata.json
│       ├── preprocessed/
│       └── _cache/
└── gpt5.1_output/
    ├── content_parts.jsonl
    └── ...

5) 📊 Run GPT Judge

Here eval_set is the --target-dir from Step 4 (or the directory you downloaded from Google Drive). It should contain one subdirectory per model, each with a content_parts.jsonl inside.

python src/call_gpt_judge.py eval_set -o results.xlsx \
    --api-key "<YOUR_KEY>" \
    --azure-endpoint "<YOUR_ENDPOINT>" \
    --api-version "<YYYY-MM-DD>" \
    --model "<DEPLOYMENT_NAME>"

This reads each content_parts.jsonl, calls the configured Azure OpenAI judge model, and writes a scored results.xlsx into each model subdirectory:

eval_set/
├── opus_4.5_output/
│   └── results.xlsx
└── gpt5.1_output/
    └── results.xlsx

Common options:

# Evaluate specific models only
python src/call_gpt_judge.py eval_set --models opus_4.5_output,gpt5.1_output -o results.xlsx

# Re-run without skipping previously processed tasks
python src/call_gpt_judge.py eval_set -o results.xlsx --no-skip-processed

📚 Detailed Script Reference

`organize_files.py` -- Organize Files

Reads the JSONL task set and creates task directories at target_dir/<model>/<task_id>/, copying source files, reference files, and model output files. Generates metadata.json for each task.

python src/organize_files.py \
    --dataset-dir Finch \
    --output-dir YOUR_MODEL_OUTPUT \
    --target-dir eval_set

Argument	Description
`--dataset-dir`	(Required) Dataset directory containing the JSONL file (auto-detected)
`--output-dir`	(Required) Model output directory (one subdirectory per model)
`--target-dir`	(Required) Organized output directory used by downstream pipeline steps
`--log-level`	Logging verbosity: `DEBUG`, `INFO`, `WARNING`, or `ERROR`

`preprocessor/` -- Preprocess Files

Processes PDF, Word, Excel, Markdown, and image files. Results are written into metadata.json under the preprocess_info field.

python src/preprocessor/preprocessor_main.py --root-dir eval_set

Argument	Description
`--root-dir`	The `target-dir` produced by `organize_files.py`
`--models`	Optional; specify which model directories to process (space-separated)

Configuration: Text descriptions are defined in preprocessor/preprocessor_base.py via PreprocessorConfig. Special-case logs are written to preprocessing_special_cases.log.

`build_prompt/` -- Build Prompts

Generates content_parts.jsonl for each model directory, based on metadata.json and preprocessing outputs.

set PYTHONPATH=src
python -m src.build_prompt.content_builder.content_builder eval_set

Configuration (build_prompt/content_builder/config.py):

Setting	Description
`MAX_IMAGES`	Maximum number of images per prompt
`MAX_TEXT_CHARS`	Maximum text character count per prompt
`EXCEL_EXTENSIONS`, `IMAGE_EXTENSIONS`, `TEXT_EXTENSIONS`	Recognized file extension sets
`CACHE_DIR_NAME`	Name of the cache directory (stores diffs, snapshots, screenshots)
`OUTPUT_JSONL_NAME`	Output filename (default: `content_parts.jsonl`)
`Captions.*`	Caption templates for generated content

Prompt length management: Defined in src/build_prompt/content_builder/token_counter.py. Image count and text character count are tracked separately. If the image count exceeds MAX_IMAGES, images are dropped from the end. If the text character count exceeds MAX_TEXT_CHARS, text is truncated from the end. Both limits are configured in config.py.

`call_gpt_judge.py` -- GPT Judge Scoring

Supports two input modes:

Single JSONL file -- scores one model and writes a single results.xlsx.
Root directory -- scores all models and writes one results.xlsx per model subdirectory.

# Single model
python src/call_gpt_judge.py eval_set/opus_4.5_output/content_parts.jsonl \
    -o results.xlsx --api-key "<YOUR_KEY>" --azure-endpoint "<YOUR_ENDPOINT>"

# All models under a root directory
python src/call_gpt_judge.py eval_set -o results.xlsx \
    --api-key "<YOUR_KEY>" --azure-endpoint "<YOUR_ENDPOINT>"

API configuration (set in APIConfig inside the script, overridable via CLI):

Setting	CLI Override
`AZURE_ENDPOINT`	`--azure-endpoint`
`API_KEY`	`--api-key`
`API_VERSION`	`--api-version`
`MODEL`	`--model`
`MAX_TOKENS`, `MAX_COMPLETION_TOKENS`, `TEMPERATURE`	--
`MAX_RETRIES`, `RATE_LIMIT_DELAY`	--

📝 Notes

Excel-related functionality depends on xlwings and a local Microsoft Excel installation. Windows is recommended for stability.
In our previous evaluation, we used GPT-5-mini. We have since found that using frontier models or voting across multiple runs improves evaluation reliability.
If preprocess_info is missing from metadata.json, check that the required dependencies are installed and that the file type is included in the preprocessing chain.

📁 Legacy Code

The code used in the original paper is maintained in the previous branch. This branch contains the newer and unified implementation for Modification, Generation, and QA.

📜 Citation

@article{dong2025finch,
  title={Finch: Benchmarking Finance \& Accounting across Spreadsheet-Centric Enterprise Workflows},
  author={Dong, Haoyu and Zhang, Pengkun and Gao, Yan and Dong, Xuanyu and Cheng, Yilin and Lu, Mingzhe and Yakefu, Adina and Zheng, Shuxin},
  journal={arXiv preprint arXiv:2512.13168},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
.vscode		.vscode
src		src
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

FinWorkBench (Finch): Benchmarking Finance & Accounting Across Spreadsheet-Centric Enterprise Workflows

🍻 Updates

📂 Repository Structure

🚀 Quick Start

1) ⚙️ Prerequisites

2) 📥 Download the Finch dataset

3) 💼 Prepare your model outputs

4) 🛠️ Run the 3-step pipeline (organize, preprocess, build prompts)

5) 📊 Run GPT Judge

📚 Detailed Script Reference

`organize_files.py` -- Organize Files

`preprocessor/` -- Preprocess Files

`build_prompt/` -- Build Prompts

`call_gpt_judge.py` -- GPT Judge Scoring

📝 Notes

📁 Legacy Code

📜 Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

FinWorkBench (Finch): Benchmarking Finance & Accounting Across Spreadsheet-Centric Enterprise Workflows

🍻 Updates

📂 Repository Structure

🚀 Quick Start

1) ⚙️ Prerequisites

2) 📥 Download the Finch dataset

3) 💼 Prepare your model outputs

4) 🛠️ Run the 3-step pipeline (organize, preprocess, build prompts)

5) 📊 Run GPT Judge

📚 Detailed Script Reference

organize_files.py -- Organize Files

preprocessor/ -- Preprocess Files

build_prompt/ -- Build Prompts

call_gpt_judge.py -- GPT Judge Scoring

📝 Notes

📁 Legacy Code

📜 Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`organize_files.py` -- Organize Files

`preprocessor/` -- Preprocess Files

`build_prompt/` -- Build Prompts

`call_gpt_judge.py` -- GPT Judge Scoring

Packages