QuestBench: Can LLMs ask the right question to acquire information in reasoning tasks?

Recently, a large amount of work has focused on improving large language models' (LLMs') performance on reasoning benchmarks such as math and logic. However, past work has largely assumed that tasks are well-defined. In the real world, queries to LLMs are often underspecified, only solvable through acquiring missing information. We formalize this as a constraint satisfaction problem (CSP) with missing variable assignments. Using a special case of this formalism where only one necessary variable assignment is missing, we can rigorously evaluate an LLM's ability to identify the minimal necessary question to ask and quantify axes of difficulty levels for each problem. We present QuestBench, a set of underspecified reasoning tasks solvable by asking at most one question, which includes: (1) Logic-Q: Logical reasoning tasks with one missing proposition, (2) Planning-Q: PDDL planning problems with initial states that are partially-observed, (3) GSM-Q: Human-annotated grade school math problems with one missing variable assignment, and (4) GSME-Q: a version of GSM-Q where word problems are translated into equations by human annotators. The LLM is tasked with selecting the correct clarification question(s) from a list of options. While state-of-the-art models excel at GSM-Q and GSME-Q, their accuracy is only 40-50% on Logic-Q and Planning-Q. Analysis demonstrates that the ability to solve well-specified reasoning problems may not be sufficient for success on our benchmark: models have difficulty identifying the right question to ask, even when they can solve the fully specified version of the problem. Furthermore, in the Planning-Q domain, LLMs tend not to hedge, even when explicitly presented with the option to predict ``not sure.'' This highlights the need for deeper investigation into models' information acquisition capabilities.

Paper link | Download dataset

This repository contains code for generating QuestBench data and evaluating LLMs on it.

Installation

Begin by creating a conda environment to contain the packages needed for QuestBench. You can install anaconda here: https://docs.anaconda.com/miniconda/install/#quick-command-line-install

conda create -n questbench PYTHON=3.11
conda activate questbench

Install PyTorch following the instructions here: https://pytorch.org/get-started/locally/
Install the remaining requirements

pip install -r requirements.txt

Download datasets

Click here to download the datasets.
After downloading, expand the compressed file.

tar -xzvf questbench_data.tar.gz

Run evaluations

Set your api key to be able to use Gemini models

export GOOGLE_API_KEY=<gemini_api_key>

Login to HuggingFace to be able to use Gemma models, and start a vllm server with the desired model

huggingface-cli login
vllm serve "google/gemma-2-2b-it" --port <port>

Substitute the model name with google/gemma-2-9b-it or google/gemma-2-27b-it as necessary.

Set your openai key to be able to use GPT models

export OPENAI_API_KEY=<openai_api_key>
export OPENAI_ORGANIZATION=<openi_organization_key>
export OPENAI_PROJECT=<openai_project_key>

Next, run the eval

python mc_eval.py \
--model_name <model_name> \
--domain_name [GSM_csp|Planning|SL|GSM_verbal] \
--eval_mode [mc|isambig|fullinfo] \
--data_dir <data_dir> \
--data_file <data_fp> \
--prompt_mode [|cot|fs4] \
--results_dir <results_dir> \
--batch_size 1 \
(--model_role_name assistant)
(--vllm_port <port>)

We currently support the following --model_name:
- gemini-1.5-pro
- gemini-1.5-flash
- gemini-2.0-flash-thinking-exp
- gpt-4o
- o1-preview
- claude-3-5-sonnet-20241022
- gemma_2_27b
- gemma_2_9b
- gemma_2_2b
Other Gemini models can be found here. Other OpenAI models can be used by adding their names to GPT_COSTS in model_utils.py. Other Anthropic models can be used by adding their names to CLAUDE_MODELS in model_utils.py.
If OpenAI or Anthropic models are used, add the --model_role_name assistant option. Otherwise do not add it.
Set batch_size to be lower than your RPS rate limit.
If a gemma-2 model is used, specify a VLLM port.
--data_dir should be set to the directory containing all the data files. By default, --data_dir is set to questbench_data/.
--data_file should be set to the appropriate file for the domain. If you downloaded the datasets from the public website, the data files should be set to

questbench_data/Logic-Q/simplelogic_heldout_1k.csv
questbench_data/Planning-Q/planning_heldout_7500.csv
questbench_data/GSM-Q/gsm_CSP_heldout_pilot.csv
questbench_data/GSM-Q/gsm_verbal_heldout_pilot.csv

Generate datasets

Before running any code, be sure to run

export PYTHONPATH=.

Logic-Q

Generate 1-sufficient rulesets

python SimpleLogic/generate_ruleset.py \
      --sl_dir <sl_rules_dir> \
      --start_idx <start_idx> \
      --end_idx <end_idx>

Make Logic-Q data from 1-sufficient rulesets

python SimpleLogic/make_data.py \
      --sl_dir <sl_rules_dir> \
      --max_problems_to_sample_per_ruleset <max_problems_to_sample_per_ruleset>

Planning-Q

Generate 1-sufficient CSPs

python Planning/make_planning_data.py \
      --pddl_dir <pddl_dir> \
      --output_dir <output_dir>

Run remaining commands in under "Make data" header in Make Planning-Q data from 1-sufficient CSPs

python Planning/make_data.py \
      --input_dir <input_dir> \
      --output_dir <output_dir>

where input_dir is the output_dir from the previous command.

GSM-Q

GSM-Q was created through human annotation.

Please see the technical report for more details.

Citing this work

@misc{li2025questbenchllmsaskright,
      title={QuestBench: Can LLMs ask the right question to acquire information in reasoning tasks?}, 
      author={Belinda Z. Li and Been Kim and Zi Wang},
      year={2025},
      eprint={2503.22674},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2503.22674}, 
}

License and disclaimer

All software is licensed under the Apache License, Version 2.0 (Apache 2.0); you may not use this file except in compliance with the Apache 2.0 license. You may obtain a copy of the Apache 2.0 license at: https://www.apache.org/licenses/LICENSE-2.0

All other materials are licensed under the Creative Commons Attribution 4.0 International License (CC-BY). You may obtain a copy of the CC-BY license at: https://creativecommons.org/licenses/by/4.0/legalcode

Unless required by applicable law or agreed to in writing, all software and materials distributed here under the Apache 2.0 or CC-BY licenses are distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the licenses for the specific language governing permissions and limitations under those licenses.

This is not an official Google product.

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
Planning		Planning
SimpleLogic		SimpleLogic
evaluators		evaluators
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
mc_eval.py		mc_eval.py
model_utils.py		model_utils.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

QuestBench: Can LLMs ask the right question to acquire information in reasoning tasks?

Installation

Download datasets

Run evaluations

Generate datasets

Logic-Q

Planning-Q

GSM-Q

Citing this work

License and disclaimer

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

License

google-deepmind/questbench

Folders and files

Latest commit

History

Repository files navigation

QuestBench: Can LLMs ask the right question to acquire information in reasoning tasks?

Installation

Download datasets

Run evaluations

Generate datasets

Logic-Q

Planning-Q

GSM-Q

Citing this work

License and disclaimer

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages