HuggingFace | Arxiv | Citation
BenchMAX is a comprehensive, high-quality, and multiway parallel multilingual benchmark comprising 10 tasks designed to assess crucial capabilities across 17 diverse language.
π₯[Apr 22, 2025] Update the results of DeepSeek-R1-Distill Models
[Feb 12, 2025] Released the multilingual benchmark
We evaluate multiple crucial capabilities of large language models(LLMs) in multilingual scenarios. The dataset links are as follows:
| Dataset | Evaluated Capability | HuggingFace Dataset Path |
|---|---|---|
| BenchMAX_Rule-based | Instruction following | π€ BenchMAX_Rule-based |
| BenchMAX_Model-based | Instruction following | π€ BenchMAX_Model-based |
| BenchMAX_Function_Completion | Code generation | π€ BenchMAX_Function_Completion |
| BenchMAX_Problem_Solving | Code generation | π€ BenchMAX_Problem_Solving |
| BenchMAX_Math | Reasoning | π€ BenchMAX_Math |
| BenchMAX_Science | Reasoning | π€ BenchMAX_Science |
| BenchMAX_Question_Answering | Long context modelling | π€ BenchMAX_Question_Answering |
| BenchMAX_Multiple_Functions | Tool use | π€ BenchMAX_Multiple_Functions |
| BenchMAX_General_Translation | Translation | π€ BenchMAX_General_Translation |
| BenchMAX_Domain_Translation | Translation | π€ BenchMAX_Domain_Translation |
We evaluate common multilingual large language models as shown in the following table. The results are averaged across 17 languages. Note: Although DeepSeek-V3 supports 128 context length, its API only supports 64K context length, and we cannot deploy it on our server. Therefore, we have not evaluated its long-context capabilities yet.
The detailed results for each language are illustrated in the figure below.
Arabic, Bengali, Chinese, Czech, English, French, German, Hungarian, Japanese, Korean, Serbian, Spanish, Swahili, Telugu, Thai, Russian, Vietnamese
You can clone this repository and install dependencies using the following commands:
git clone --recurse-submodules https://github.com/CONE-MT/BenchMAX.git
cd BenchMAX
pip install -r requirements.txtYou can simply use the script run.sh to evaluate one model on one task.
./run.sh <model> <task> <languages> [additional arguments]Examples:
./run.sh meta-llama/Llama-3.1-8B-Instruct rule_based en
./run.sh /path/to/your/local/model xgpqa all # "all" means all 17 languages
./run.sh meta-llama/Llama-3.1-8B-Instruct problem-solving all /path/to/your/local/model # The fourth argument for the problem-solving task is specified as the local model pathThe translation tasks are not supported in run.sh. Please see the commands in Translation Tasks section.
For more details on how to run each task and customize the arguments, see the following sections:
- Rule-based Instruction Following Task
- Model-based Instruction Following Task
- Function Completion Task
- Programming Problem Solving Task
- Math Reasoning Task
- Science Reasoning Task
- Long Context Task
- Tool Use Task
- Translation Tasks
We employ lm-evaluation-harness to run this task. First clone its repository and install the lm-eval package:
git clone --depth 1 https://github.com/EleutherAI/lm-evaluation-harness
cd lm-evaluation-harness
pip install -e .Run the lm-eval command to evaluate models. We recommend to use vllm for faster inference speed.
For more command options, please refer to lm-evaluation-harness.
cd BenchMAX
# Evaluate on all 17 languages
lm-eval -m vllm --model_args pretrained=${model} --tasks xifeval_multi --batch_size auto --apply_chat_template --include_path tasks/ifeval --log_samples -o results
# Evaluate on one specific language
lm-eval -m vllm --model_args pretrained=${model} --tasks xifeval_zh --batch_size auto --apply_chat_template --include_path tasks/ifeval --log_samples -o resultsWe use the official repository of Arena-hard to run this task. Please run the following script to prepare the code and data.
cd BenchMAX/tasks/arenahard
bash prepare.shThen modify the model configs in arena-hard-auto/config.
Please add your model config to api_config.yaml and add your model name to the model list in other config like gen_answer_config_*.yaml and judge_config_*.yaml.
Finally, deploy your model and run the evaluation, where your model first generates responses to prompts and GPT-4o-mini judge them against GPT-4o responses, as we do in the paper.
# serve your model by vllm
vllm serve meta-llama/Llama-3.1-8B-Instruct
# generate responses
cd arena-hard-auto
languages=(en ar bn cs de es fr hu ja ko ru sr sw te th vi zh)
for lang in "${languages[@]}"; do
python gen_answer.py --setting-file config/gen_answer_config_${lang}.yaml
done
# run LLM-as-a-judge
export OPENAI_API_KEY=...
for lang in "${languages[@]}"; do
python gen_judgment.py --setting-file config/judge_config_${lang}.yaml
doneWe use the evalplus package to evaluate the models.
cd BenchMAX/tasks/evalplus
languages=(en ar bn cs de es fr hu ja ko ru sr sw te th vi zh)
for lang in "${languages[@]}"; do
python -m evalplus.evaluate --model meta-llama/Llama-3.1-8B-Instruct --dataset humaneval --backend vllm --greedy --lang ${lang}
doneWe use the LiveCodeBench package to run this task.
cd BenchMAX/tasks/LiveCodeBench
languages=(en ar bn cs de es fr hu ja ko ru sr sw te th vi zh)
for lang in "${languages[@]}"; do
python -m lcb_runner.runner.main --model meta-llama/Llama-3.1-8B-Instruct --local_model_path $local_model_path --release_version release_v4 --dataset $lang --evaluate --num_process_evaluate 16
doneWe employ lm-evaluation-harness to run this task. The installation process is same as above.
cd BenchMAX
lm-eval -m vllm --model_args pretrained=${model} --tasks xmgsm_native_cot_multi --batch_size auto --apply_chat_template --include_path tasks/mgsm --log_samples -o resultsWe also employ lm-evaluation-harness to run this task.
cd BenchMAX
lm-eval -m vllm --model_args pretrained=${model} --tasks xgpqa_main_native_cot_zeroshot_multi --batch_size auto --apply_chat_template --include_path tasks/gpqa --log_samples -o resultsWe adopt the RULER repository and add QA-in-a-Haytack task.
First, download the data and models from the web.
cd BenchMAX/tasks/RULER/scripts
cd data/synthetic/json
bash download_haystack.sh
bash download_qa_dataset.shThen, configure your model information in config_models.sh and run.sh, referring to RULER's guide.
You can change the context length in config_models.sh.
Finally, run the evaluation pipeline.
cd BenchMAX/tasks/RULER/scripts
languages=(en ar bn cs de es fr hu ja ko ru sr sw te th vi zh)
for lang in "${languages[@]}"; do
bash run.sh YOUR_MODEL_NAME synthetic $lang
doneWe modify the code from the NexusRaven repo. Simply run the following command.
cd BenchMAX/tasks/nexus
languages=(en ar bn cs de es fr hu ja ko ru sr sw te th vi zh)
for lang in "${languages[@]}"; do
python evaluator.py -m ${model} --infer-backend vllm -t ${lang} --output-parser-name generic
doneNote that some models that support calling tools need specific output parsers.
For example, for llama3 models, --output-parser-name should be set to llama3.
Run the following commands to generate translations and evaluate translations. Metrics now include spBLEU, BLEU, TER, and xCOMET.
cd BenchMAX/tasks/translation
# generate general translations
# -s denotes source languages, -t denotes target languages
python generate_translation.py -s en -t zh,es,fr,de,ru,ja,th,sw,bn,te,ar,ko,vi,cs,hu,sr --task-name flores --model-name $model --infer-backend vllm --max-tokens 512
python generate_translation.py -s zh,es,fr,de,ru,ja,th,sw,bn,te,ar,ko,vi,cs,hu,sr -t en --task-name flores --model-name $model --infer-backend vllm --max-tokens 512
python generate_translation.py -s en -t zh,es,fr,de,ru,ja,th,bn,ar,ko,vi,cs,hu,sr --task-name ted --model-name $model --infer-backend vllm --max-tokens 512
python generate_translation.py -s zh,es,fr,de,ru,ja,th,bn,ar,ko,vi,cs,hu,sr -t en --task-name ted --model-name $model --infer-backend vllm --max-tokens 512
python generate_translation.py -s en -t cs,de,es,ja,ru,zh --task-name wmt24 --model-name $model --infer-backend vllm --max-tokens 1024
# evaluate general translations
python evaluate_translation.py -s en -t zh,es,fr,de,ru,ja,th,sw,bn,te,ar,ko,vi,cs,hu,sr --task-name flores --model-name $model --metrics spBLEU
python evaluate_translation.py -s zh,es,fr,de,ru,ja,th,sw,bn,te,ar,ko,vi,cs,hu,sr -t en --task-name flores --model-name $model --metrics spBLEU
python evaluate_translation.py -s en -t zh,es,fr,de,ru,ja,th,bn,ar,ko,vi,cs,hu,sr --task-name ted --model-name $model --metrics spBLEU
python evaluate_translation.py -s zh,es,fr,de,ru,ja,th,bn,ar,ko,vi,cs,hu,sr -t en --task-name ted --model-name $model --metrics spBLEU
python evaluate_translation.py -s en -t cs,de,es,ja,ru,zh --task-name wmt24 --model-name $model --metrics spBLEU
# generate and evaluate domain translations
tasks=("ifeval" "gpqa" "lcb_v4" "mgsm" "humaneval" "nexus" "arenahard")
max_tokens_list=(512 3072 2048 1024 1024 512 3072)
for i in "${!tasks[@]}"; do
task=${tasks[$i]}
max_tokens=${max_tokens_list[$i]}
python generate_translation.py -s en -t zh,es,fr,de,ru,ja,th,sw,bn,te,ar,ko,vi,cs,hu,sr --task-name $task --model-name $model --infer-backend vllm --max-tokens $max_tokens
python generate_translation.py -s zh,es,fr,de,ru,ja,th,sw,bn,te,ar,ko,vi,cs,hu,sr -t en --task-name $task --model-name $model --infer-backend vllm --max-tokens $max_tokens
python evaluate_translation.py -s en -t zh,es,fr,de,ru,ja,th,sw,bn,te,ar,ko,vi,cs,hu,sr --task-name $task --model-name $model --metrics spBLEU
python evaluate_translation.py -s zh,es,fr,de,ru,ja,th,sw,bn,te,ar,ko,vi,cs,hu,sr -t en --task-name $task --model-name $model --metrics spBLEU
doneIf our dataset helps your work, please cite this paper:
@inproceedings{huang-etal-2025-benchmax,
title = "{B}ench{MAX}: A Comprehensive Multilingual Evaluation Suite for Large Language Models",
author = "Huang, Xu and
Zhu, Wenhao and
Hu, Hanxu and
He, Conghui and
Li, Lei and
Huang, Shujian and
Yuan, Fei",
editor = "Christodoulopoulos, Christos and
Chakraborty, Tanmoy and
Rose, Carolyn and
Peng, Violet",
booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2025",
month = nov,
year = "2025",
address = "Suzhou, China",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2025.findings-emnlp.909/",
doi = "10.18653/v1/2025.findings-emnlp.909",
pages = "16751--16774",
ISBN = "979-8-89176-335-7",
}

