Thinking with Imagination: Agentic Visual Spatial Reasoning with World Simulators

This is the official project repository for Thinking with Imagination: Agentic Visual Spatial Reasoning with World Simulators.

🧠 Introduction

TL;DR: We study visual spatial reasoning as active visual evidence acquisition. Astra lets a VLM decide when to query an action-conditioned world simulator, inspect the imagined view, and ground the final answer in both observed and simulated visual evidence.

Spatial reasoning from limited egocentric observations often requires evidence that is not directly visible. Conventional text-oriented chain-of-thought over fixed images provides limited gains in such settings. Astra reframes this problem as thinking with imagination: a policy can request a missing viewpoint from a learned world simulator and use the returned observation as spatial evidence.

The framework contains two main components:

Astra-VL: an agentic VLM policy and reasoner that decides when to imagine, plans camera-motion queries, and grounds the returned visual evidence before answering.
Astra-WM: an action-conditioned world simulator that synthesizes in-context novel observations from context images and natural-language camera-motion instructions.

🧩 Astra Method Overview

Astra couples an agentic VLM policy with an action-conditioned world simulator, then trains the policy to acquire tools and use imagined observations selectively.

🔍 Findings

Spatial consistency matters. Plausible generation is not enough: only an action-faithful world simulator turns imagined views into reliable spatial evidence and reasoning gains.
Simulator use must be learned. Strong proprietary VLMs can benefit from Astra-WM directly, while open-source VLMs need agentic training to decide when and how to imagine.
Selective imagination beats tool overuse. Rewarding tool calls alone leads to excessive simulator use, while the two-phase curriculum learns when imagined evidence is actually helpful.
Imagination helps when evidence is viewpoint-dependent. Agentic tool use keeps camera-centric gains while avoiding generated views when the original context is already sufficient.

🏆 Experimental Results on Spatial Reasoning Benchmarks

We compare Direct Answer, Forced Tool-Use, and Agentic Tool-Use settings. Values in parentheses denote absolute changes over the corresponding Direct Answer result of the same model.

Type	Model	MMSI-Bench					MindCube-Tiny
Type	Model	PR.	Attr.	Mot.	MSR	All	Rot.	Ard.	Amg.	All
Direct Answer
Open-source	Qwen3-VL-8B-Instruct	30.8	30.1	27.7	28.1	29.8	53.6	38.0	31.1	36.8
	Qwen3-VL-30B-Instruct	31.2	35.8	25.9	29.1	30.6	39.9	47.5	38.5	41.8
	Bagel-7B-MoT	33.5	27.7	25.3	30.8	31.0	34.5	31.4	42.8	34.7
Proprietary	GLM-4.5V	35.6	36.9	29.3	30.3	33.8	60.0	25.5	42.2	39.6
	GPT-4o	28.0	32.3	36.0	30.8	30.3	33.5	35.0	37.2	35.8
	Gemini-2.5-Pro	39.0	36.2	33.3	34.3	36.9	89.5	54.5	48.8	57.5
	Gemini-3-Flash	45.6	45.4	44.0	46.0	45.4	93.0	72.0	61.7	70.5
Forced Tool-Use (zero-shot)
Open-source	Qwen3-VL-8B-Instruct	30.4 (-0.4)	29.5 (-0.6)	19.6 (-8.1)	30.8 (+2.7)	28.6 (-1.2)	31.1 (-22.5)	23.7 (-14.3)	26.8 (-4.3)	27.6 (-9.2)
	Qwen3-VL-30B-Instruct	31.5 (+0.3)	28.7 (-7.1)	21.6 (-4.3)	28.1 (-1.0)	28.9 (-1.7)	34.7 (-5.2)	32.7 (-14.8)	38.1 (-0.4)	35.7 (-6.1)
	Bagel-7B-MoT	31.3 (-2.2)	25.6 (-2.1)	24.7 (-0.6)	28.7 (-2.1)	29.7 (-1.3)	33.9 (-0.6)	26.8 (-4.6)	31.8 (-11.0)	29.2 (-5.5)
Proprietary	Gemini-3-Flash	50.4 (+4.8)	51.5 (+6.1)	43.4 (-0.6)	50.3 (+4.3)	49.5 (+4.1)	93.0 (+0.0)	70.3 (-1.7)	65.0 (+3.3)	72.7 (+2.2)
Agentic Tool-Use
Open-source	Astra (Qwen3-VL-8B-Instruct)	42.3 (+11.5)	41.0 (+10.9)	32.1 (+4.4)	33.6 (+5.5)	38.8 (+9.0)	60.1 (+6.5)	43.5 (+5.5)	36.8 (+5.7)	42.7 (+5.9)

📊 Workflow Mode Ablation

The same trained Astra policy is evaluated under no-tool/direct-answer, forced-tool, and agentic-tool modes. Agentic tool use preserves the gains on camera-centric relations while avoiding unnecessary simulator calls when the original context is more reliable.

🚀 Release Progress

Astra-VL and Astra-WM evaluation scripts
Astra-WM checkpoints
Astra-VL checkpoints
Astra training code

📜 Citation

If you find this project useful, please cite:

@misc{zhu2026thinkingimaginationagenticvisual,
      title={Thinking with Imagination: Agentic Visual Spatial Reasoning with World Simulators}, 
      author={Chenming Zhu and Jingli Lin and Yilin Long and Peizhou Cao and Tai Wang and Jiangmiao Pang and Xihui Liu},
      year={2026},
      eprint={2606.06476},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2606.06476}, 
}

🤝 Contact

If you have any questions, please contact chaimzhu@connect.hku.hk.

💡 Acknowledgement

We sincerely appreciate the following projects for their valuable codebase and benchmark: Verl, vllm-omni, SenseNova-MARS, MMSI-Bench, MindCube.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
assets		assets
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Thinking with Imagination: Agentic Visual Spatial Reasoning with World Simulators

🧠 Introduction

🧩 Astra Method Overview

🔍 Findings

🏆 Experimental Results on Spatial Reasoning Benchmarks

📊 Workflow Mode Ablation

🚀 Release Progress

📜 Citation

🤝 Contact

💡 Acknowledgement

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Thinking with Imagination: Agentic Visual Spatial Reasoning with World Simulators

🧠 Introduction

🧩 Astra Method Overview

🔍 Findings

🏆 Experimental Results on Spatial Reasoning Benchmarks

📊 Workflow Mode Ablation

🚀 Release Progress

📜 Citation

🤝 Contact

💡 Acknowledgement

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages