Recently, a large amount of work has focused on improving large language models' (LLMs') performance on reasoning benchmarks such as math and logic. However, past work has largely assumed that tasks are well-defined. In the real world, queries to LLMs are often underspecified, only solvable through acquiring missing information. We formalize this as a constraint satisfaction problem (CSP) with missing variable assignments. Using a special case of this formalism where only one necessary variable assignment is missing, we can rigorously evaluate an LLM's ability to identify the minimal necessary question to ask and quantify axes of difficulty levels for each problem. We present QuestBench, a set of underspecified reasoning tasks solvable by asking at most one question, which includes: (1) Logic-Q: Logical reasoning tasks with one missing proposition, (2) Planning-Q: PDDL planning problems with initial states that are partially-observed, (3) GSM-Q: Human-annotated grade school math problems with one missing variable assignment, and (4) GSME-Q: a version of GSM-Q where word problems are translated into equations by human annotators. The LLM is tasked with selecting the correct clarification question(s) from a list of options. While state-of-the-art models excel at GSM-Q and GSME-Q, their accuracy is only 40-50% on Logic-Q and Planning-Q. Analysis demonstrates that the ability to solve well-specified reasoning problems may not be sufficient for success on our benchmark: models have difficulty identifying the right question to ask, even when they can solve the fully specified version of the problem. Furthermore, in the Planning-Q domain, LLMs tend not to hedge, even when explicitly presented with the option to predict ``not sure.'' This highlights the need for deeper investigation into models' information acquisition capabilities.
This repository contains code for generating QuestBench data and evaluating LLMs on it.
- Begin by creating a conda environment to contain the packages needed for QuestBench. You can install anaconda here: https://docs.anaconda.com/miniconda/install/#quick-command-line-install
conda create -n questbench PYTHON=3.11
conda activate questbench-
Install PyTorch following the instructions here: https://pytorch.org/get-started/locally/
-
Install the remaining requirements
pip install -r requirements.txt-
After downloading, expand the compressed file.
tar -xzvf questbench_data.tar.gzSet your api key to be able to use Gemini models
export GOOGLE_API_KEY=<gemini_api_key>Login to HuggingFace to be able to use Gemma models, and start a vllm server with the desired model
huggingface-cli login
vllm serve "google/gemma-2-2b-it" --port <port>- Substitute the model name with
google/gemma-2-9b-itorgoogle/gemma-2-27b-itas necessary.
Set your openai key to be able to use GPT models
export OPENAI_API_KEY=<openai_api_key>
export OPENAI_ORGANIZATION=<openi_organization_key>
export OPENAI_PROJECT=<openai_project_key>Next, run the eval
python mc_eval.py \
--model_name <model_name> \
--domain_name [GSM_csp|Planning|SL|GSM_verbal] \
--eval_mode [mc|isambig|fullinfo] \
--data_dir <data_dir> \
--data_file <data_fp> \
--prompt_mode [|cot|fs4] \
--results_dir <results_dir> \
--batch_size 1 \
(--model_role_name assistant)
(--vllm_port <port>)- We currently support the following
--model_name:gemini-1.5-progemini-1.5-flashgemini-2.0-flash-thinking-expgpt-4oo1-previewclaude-3-5-sonnet-20241022gemma_2_27bgemma_2_9bgemma_2_2b
- Other Gemini models can be found here. Other OpenAI models can be used by adding their names to
GPT_COSTSin model_utils.py. Other Anthropic models can be used by adding their names toCLAUDE_MODELSin model_utils.py. - If OpenAI or Anthropic models are used, add the
--model_role_name assistantoption. Otherwise do not add it. - Set
batch_sizeto be lower than your RPS rate limit. - If a gemma-2 model is used, specify a VLLM port.
--data_dirshould be set to the directory containing all the data files. By default,--data_diris set toquestbench_data/.--data_fileshould be set to the appropriate file for the domain. If you downloaded the datasets from the public website, the data files should be set to
questbench_data/Logic-Q/simplelogic_heldout_1k.csv
questbench_data/Planning-Q/planning_heldout_7500.csv
questbench_data/GSM-Q/gsm_CSP_heldout_pilot.csv
questbench_data/GSM-Q/gsm_verbal_heldout_pilot.csvBefore running any code, be sure to run
export PYTHONPATH=.Generate 1-sufficient rulesets
python SimpleLogic/generate_ruleset.py \
--sl_dir <sl_rules_dir> \
--start_idx <start_idx> \
--end_idx <end_idx>Make Logic-Q data from 1-sufficient rulesets
python SimpleLogic/make_data.py \
--sl_dir <sl_rules_dir> \
--max_problems_to_sample_per_ruleset <max_problems_to_sample_per_ruleset>
Generate 1-sufficient CSPs
python Planning/make_planning_data.py \
--pddl_dir <pddl_dir> \
--output_dir <output_dir>Run remaining commands in under "Make data" header in Make Planning-Q data from 1-sufficient CSPs
python Planning/make_data.py \
--input_dir <input_dir> \
--output_dir <output_dir>
where input_dir is the output_dir from the previous command.
GSM-Q was created through human annotation.
Please see the technical report for more details.
@misc{li2025questbenchllmsaskright,
title={QuestBench: Can LLMs ask the right question to acquire information in reasoning tasks?},
author={Belinda Z. Li and Been Kim and Zi Wang},
year={2025},
eprint={2503.22674},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2503.22674},
}
Copyright 2025 Google LLC
All software is licensed under the Apache License, Version 2.0 (Apache 2.0); you may not use this file except in compliance with the Apache 2.0 license. You may obtain a copy of the Apache 2.0 license at: https://www.apache.org/licenses/LICENSE-2.0
All other materials are licensed under the Creative Commons Attribution 4.0 International License (CC-BY). You may obtain a copy of the CC-BY license at: https://creativecommons.org/licenses/by/4.0/legalcode
Unless required by applicable law or agreed to in writing, all software and materials distributed here under the Apache 2.0 or CC-BY licenses are distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the licenses for the specific language governing permissions and limitations under those licenses.
This is not an official Google product.