Yume 1.5: Text-Controlled Interactive World Generation

Abstract

Recent approaches have demonstrated the promise of using diffusion models to generate interactive and explorable worlds. However, most of these methods face critical challenges such as excessively large parameter sizes, reliance on lengthy inference steps, and rapidly growing historical context, which severely limit real-time performance and lack text-controlled generation capabilities.

To address these challenges, we propose Yume 1.5, a novel framework designed to generate realistic, interactive, and continuous worlds from a single image or text prompt. Yume 1.5 achieves this through a carefully designed framework that supports keyboard-based exploration of the generated worlds. The framework comprises three core components: (1) a long-video generation framework integrating unified context compression with linear attention; (2) a real-time streaming acceleration strategy powered by bidirectional attention distillation and an enhanced text embedding scheme; (3) a text-controlled method for generating world events.

Introduction Video

Overview

Yume 1.5 generates interactive, long video worlds from single images or text prompts in an autoregressive manner. Through systematic optimization, it achieves intuitive camera control via keyboard inputs (WASD) while significantly enhancing visual quality and continuity.

Long Context

Joint Temporal-Spatial-Channel Modeling (TSCM) enables long-video generation. It uses unified context compression with linear attention to maintain quality without exploding memory costs as the video grows.

Real-time Acceleration

A streaming acceleration strategy powered by bidirectional attention distillation and Self-Forcing allows for fast inference, reducing error accumulation in long sequences.

Text & Event Control

Unlike prior models, Yume 1.5 supports text-controlled event generation. It decomposes captions into Event and Action descriptions to allow precise control over dynamic world events.

Methodology

Yume 1.5 Architecture — **Figure 3.** Core components including the DiT Block with linear attention, adaptive history token downsampling, and chunk-based autoregressive inference.

BibTeX

@article{mao2025yume,
  title={Yume: An Interactive World Generation Model},
  author={Mao, Xiaofeng and Lin, Shaoheng and Li, Zhen and Li, Chuanhao and Peng, Wenshuo and He, Tong and Pang, Jiangmiao and Chi, Mingmin and Qiao, Yu and Zhang, Kaipeng},
  journal={arXiv preprint arXiv:2507.17744},
  year={2025}
}

@misc{mao2025yume15textcontrolledinteractiveworld,
  title={Yume-1.5: A Text-Controlled Interactive World Generation Model}, 
  author={Xiaofeng Mao and Zhen Li and Chuanhao Li and Xiaojie Xu and Kaining Ying and Tong He and Jiangmiao Pang and Yu Qiao and Kaipeng Zhang},
  year={2025},
  eprint={2512.22096},
  archivePrefix={arXiv},
  primaryClass={cs.CV},
  url={https://arxiv.org/abs/2512.22096}
}

𝒀 𝑼 𝑴 𝑬 1.5: A Text-Controlled Interactive World Generation Model