Skip to content

ZCMax/Thinking-With-Imagination

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 

Repository files navigation

Thinking with Imagination: Agentic Visual Spatial Reasoning with World Simulators

Project Page arXiv PDF Astra-WM

This is the official project repository for Thinking with Imagination: Agentic Visual Spatial Reasoning with World Simulators.

Astra overview

🧠 Introduction

TL;DR: We study visual spatial reasoning as active visual evidence acquisition. Astra lets a VLM decide when to query an action-conditioned world simulator, inspect the imagined view, and ground the final answer in both observed and simulated visual evidence.

Spatial reasoning from limited egocentric observations often requires evidence that is not directly visible. Conventional text-oriented chain-of-thought over fixed images provides limited gains in such settings. Astra reframes this problem as thinking with imagination: a policy can request a missing viewpoint from a learned world simulator and use the returned observation as spatial evidence.

The framework contains two main components:

  • Astra-VL: an agentic VLM policy and reasoner that decides when to imagine, plans camera-motion queries, and grounds the returned visual evidence before answering.
  • Astra-WM: an action-conditioned world simulator that synthesizes in-context novel observations from context images and natural-language camera-motion instructions.

🧩 Astra Method Overview

Astra couples an agentic VLM policy with an action-conditioned world simulator, then trains the policy to acquire tools and use imagined observations selectively.

Astra method architecture

πŸ” Findings

  1. Spatial consistency matters. Plausible generation is not enough: only an action-faithful world simulator turns imagined views into reliable spatial evidence and reasoning gains.
  2. Simulator use must be learned. Strong proprietary VLMs can benefit from Astra-WM directly, while open-source VLMs need agentic training to decide when and how to imagine.
  3. Selective imagination beats tool overuse. Rewarding tool calls alone leads to excessive simulator use, while the two-phase curriculum learns when imagined evidence is actually helpful.
  4. Imagination helps when evidence is viewpoint-dependent. Agentic tool use keeps camera-centric gains while avoiding generated views when the original context is already sufficient.

πŸ† Experimental Results on Spatial Reasoning Benchmarks

We compare Direct Answer, Forced Tool-Use, and Agentic Tool-Use settings. Values in parentheses denote absolute changes over the corresponding Direct Answer result of the same model.

Type Model MMSI-Bench MindCube-Tiny
PR. Attr. Mot. MSR All Rot. Ard. Amg. All
Direct Answer
Open-source Qwen3-VL-8B-Instruct 30.830.127.728.129.8 53.638.031.136.8
Qwen3-VL-30B-Instruct 31.235.825.929.130.6 39.947.538.541.8
Bagel-7B-MoT 33.527.725.330.831.0 34.531.442.834.7
Proprietary GLM-4.5V 35.636.929.330.333.8 60.025.542.239.6
GPT-4o 28.032.336.030.830.3 33.535.037.235.8
Gemini-2.5-Pro 39.036.233.334.336.9 89.554.548.857.5
Gemini-3-Flash 45.645.444.046.045.4 93.072.061.770.5
Forced Tool-Use (zero-shot)
Open-source Qwen3-VL-8B-Instruct 30.4 (-0.4)29.5 (-0.6)19.6 (-8.1)30.8 (+2.7)28.6 (-1.2) 31.1 (-22.5)23.7 (-14.3)26.8 (-4.3)27.6 (-9.2)
Qwen3-VL-30B-Instruct 31.5 (+0.3)28.7 (-7.1)21.6 (-4.3)28.1 (-1.0)28.9 (-1.7) 34.7 (-5.2)32.7 (-14.8)38.1 (-0.4)35.7 (-6.1)
Bagel-7B-MoT 31.3 (-2.2)25.6 (-2.1)24.7 (-0.6)28.7 (-2.1)29.7 (-1.3) 33.9 (-0.6)26.8 (-4.6)31.8 (-11.0)29.2 (-5.5)
Proprietary Gemini-3-Flash 50.4 (+4.8)51.5 (+6.1)43.4 (-0.6)50.3 (+4.3)49.5 (+4.1) 93.0 (+0.0)70.3 (-1.7)65.0 (+3.3)72.7 (+2.2)
Agentic Tool-Use
Open-source Astra (Qwen3-VL-8B-Instruct) 42.3 (+11.5)41.0 (+10.9)32.1 (+4.4)33.6 (+5.5)38.8 (+9.0) 60.1 (+6.5)43.5 (+5.5)36.8 (+5.7)42.7 (+5.9)

πŸ“Š Workflow Mode Ablation

The same trained Astra policy is evaluated under no-tool/direct-answer, forced-tool, and agentic-tool modes. Agentic tool use preserves the gains on camera-centric relations while avoiding unnecessary simulator calls when the original context is more reliable.

Workflow mode ablation

πŸš€ Release Progress

  • Astra-VL and Astra-WM evaluation scripts
  • Astra-WM checkpoints
  • Astra-VL checkpoints
  • Astra training code

πŸ“œ Citation

If you find this project useful, please cite:

@misc{zhu2026thinkingimaginationagenticvisual,
      title={Thinking with Imagination: Agentic Visual Spatial Reasoning with World Simulators}, 
      author={Chenming Zhu and Jingli Lin and Yilin Long and Peizhou Cao and Tai Wang and Jiangmiao Pang and Xihui Liu},
      year={2026},
      eprint={2606.06476},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2606.06476}, 
}

🀝 Contact

If you have any questions, please contact chaimzhu@connect.hku.hk.

πŸ’‘ Acknowledgement

We sincerely appreciate the following projects for their valuable codebase and benchmark: Verl, vllm-omni, SenseNova-MARS, MMSI-Bench, MindCube.

About

Official project repository for Thinking with Imagination: Agentic Visual Spatial Reasoning with World Simulators.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors