This is the official project repository for Thinking with Imagination: Agentic Visual Spatial Reasoning with World Simulators.
TL;DR: We study visual spatial reasoning as active visual evidence acquisition. Astra lets a VLM decide when to query an action-conditioned world simulator, inspect the imagined view, and ground the final answer in both observed and simulated visual evidence.
Spatial reasoning from limited egocentric observations often requires evidence that is not directly visible. Conventional text-oriented chain-of-thought over fixed images provides limited gains in such settings. Astra reframes this problem as thinking with imagination: a policy can request a missing viewpoint from a learned world simulator and use the returned observation as spatial evidence.
The framework contains two main components:
- Astra-VL: an agentic VLM policy and reasoner that decides when to imagine, plans camera-motion queries, and grounds the returned visual evidence before answering.
- Astra-WM: an action-conditioned world simulator that synthesizes in-context novel observations from context images and natural-language camera-motion instructions.
Astra couples an agentic VLM policy with an action-conditioned world simulator, then trains the policy to acquire tools and use imagined observations selectively.
- Spatial consistency matters. Plausible generation is not enough: only an action-faithful world simulator turns imagined views into reliable spatial evidence and reasoning gains.
- Simulator use must be learned. Strong proprietary VLMs can benefit from Astra-WM directly, while open-source VLMs need agentic training to decide when and how to imagine.
- Selective imagination beats tool overuse. Rewarding tool calls alone leads to excessive simulator use, while the two-phase curriculum learns when imagined evidence is actually helpful.
- Imagination helps when evidence is viewpoint-dependent. Agentic tool use keeps camera-centric gains while avoiding generated views when the original context is already sufficient.
We compare Direct Answer, Forced Tool-Use, and Agentic Tool-Use settings. Values in parentheses denote absolute changes over the corresponding Direct Answer result of the same model.
| Type | Model | MMSI-Bench | MindCube-Tiny | |||||||
|---|---|---|---|---|---|---|---|---|---|---|
| PR. | Attr. | Mot. | MSR | All | Rot. | Ard. | Amg. | All | ||
| Direct Answer | ||||||||||
| Open-source | Qwen3-VL-8B-Instruct | 30.8 | 30.1 | 27.7 | 28.1 | 29.8 | 53.6 | 38.0 | 31.1 | 36.8 |
| Qwen3-VL-30B-Instruct | 31.2 | 35.8 | 25.9 | 29.1 | 30.6 | 39.9 | 47.5 | 38.5 | 41.8 | |
| Bagel-7B-MoT | 33.5 | 27.7 | 25.3 | 30.8 | 31.0 | 34.5 | 31.4 | 42.8 | 34.7 | |
| Proprietary | GLM-4.5V | 35.6 | 36.9 | 29.3 | 30.3 | 33.8 | 60.0 | 25.5 | 42.2 | 39.6 |
| GPT-4o | 28.0 | 32.3 | 36.0 | 30.8 | 30.3 | 33.5 | 35.0 | 37.2 | 35.8 | |
| Gemini-2.5-Pro | 39.0 | 36.2 | 33.3 | 34.3 | 36.9 | 89.5 | 54.5 | 48.8 | 57.5 | |
| Gemini-3-Flash | 45.6 | 45.4 | 44.0 | 46.0 | 45.4 | 93.0 | 72.0 | 61.7 | 70.5 | |
| Forced Tool-Use (zero-shot) | ||||||||||
| Open-source | Qwen3-VL-8B-Instruct | 30.4 (-0.4) | 29.5 (-0.6) | 19.6 (-8.1) | 30.8 (+2.7) | 28.6 (-1.2) | 31.1 (-22.5) | 23.7 (-14.3) | 26.8 (-4.3) | 27.6 (-9.2) |
| Qwen3-VL-30B-Instruct | 31.5 (+0.3) | 28.7 (-7.1) | 21.6 (-4.3) | 28.1 (-1.0) | 28.9 (-1.7) | 34.7 (-5.2) | 32.7 (-14.8) | 38.1 (-0.4) | 35.7 (-6.1) | |
| Bagel-7B-MoT | 31.3 (-2.2) | 25.6 (-2.1) | 24.7 (-0.6) | 28.7 (-2.1) | 29.7 (-1.3) | 33.9 (-0.6) | 26.8 (-4.6) | 31.8 (-11.0) | 29.2 (-5.5) | |
| Proprietary | Gemini-3-Flash | 50.4 (+4.8) | 51.5 (+6.1) | 43.4 (-0.6) | 50.3 (+4.3) | 49.5 (+4.1) | 93.0 (+0.0) | 70.3 (-1.7) | 65.0 (+3.3) | 72.7 (+2.2) |
| Agentic Tool-Use | ||||||||||
| Open-source | Astra (Qwen3-VL-8B-Instruct) | 42.3 (+11.5) | 41.0 (+10.9) | 32.1 (+4.4) | 33.6 (+5.5) | 38.8 (+9.0) | 60.1 (+6.5) | 43.5 (+5.5) | 36.8 (+5.7) | 42.7 (+5.9) |
The same trained Astra policy is evaluated under no-tool/direct-answer, forced-tool, and agentic-tool modes. Agentic tool use preserves the gains on camera-centric relations while avoiding unnecessary simulator calls when the original context is more reliable.
- Astra-VL and Astra-WM evaluation scripts
- Astra-WM checkpoints
- Astra-VL checkpoints
- Astra training code
If you find this project useful, please cite:
@misc{zhu2026thinkingimaginationagenticvisual,
title={Thinking with Imagination: Agentic Visual Spatial Reasoning with World Simulators},
author={Chenming Zhu and Jingli Lin and Yilin Long and Peizhou Cao and Tai Wang and Jiangmiao Pang and Xihui Liu},
year={2026},
eprint={2606.06476},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2606.06476},
}If you have any questions, please contact chaimzhu@connect.hku.hk.
We sincerely appreciate the following projects for their valuable codebase and benchmark: Verl, vllm-omni, SenseNova-MARS, MMSI-Bench, MindCube.

