villa-X: A Vision-Language-Latent-Action Model

This is the official repository for villa-X: Enhancing Latent Action Modeling in Vision-Language-Action Models.

📖 Overview

We improve latent action learning by introducing an extra proprio FDM, which aligns latent tokens with underlying robot states and actions and grounds them in physical dynamics.
We propose to jointly learn a latent action expert and a robot action expert through joint diffusion in the policy model, conditioning robot action prediction on latent actions to fully exploit their potential.
Our method demonstrates superior performance on simulated environments as well as on real-world robotic tasks. The latent action expert can effectively plan into future with both visual and proprio state planning.

🔥 News

2025/08/01: Initial release of the paper, project website, pre-trained Latent Action Model (LAM), and LAM inference code.

📋 Release plan

Latent Action Model
Actor Model

🔧 Setup

Clone the repository.

git clone https://github.com/microsoft/villa-x.git
cd villa-x

Install the required packages.

sudo apt-get install -y build-essential zlib1g-dev libffi-dev libssl-dev libbz2-dev libreadline-dev libsqlite3-dev liblzma-dev libncurses-dev tk-dev python3-dev ffmpeg -y
curl -LsSf https://astral.sh/uv/install.sh | sh # Skip this step if you already have uv installed
uv sync

➡️ Getting Started

Inference with Pre-trained Latent Action Model

Download the pre-trained models from Hugging Face.
Load the latent action model.

from lam import IgorModel

lam = IgorModel.from_pretrained("LOCAL_MODEL_DIRECTORY").cuda()

Extract the latent actions from a video.

def read_video(fp: str):
    from torchvision.io import read_video

    video, *_ = read_video(fp, pts_unit="sec")
    return video

video = read_video("path/to/video.mp4").cuda()  # Load your video here
latent_action = lam.idm(video)

Use image FDM to reconstruct future frames from the latent actions.

frames = []
for i in range(len(latent_action[0])):
    pred = lam.apply_latent_action(video[i], latent_action[0][i])
    frames.append(pred)

We also provide a Jupyter notebook for a step-by-step guide on how to use the pre-trained latent action model.

🤗 Pre-trained Models

Model ID	Description	Params	Link
`microsoft/villa-x/lam`	Latent action model	955M	🤗 Link

📑 BibTeX

@article{chen2025villa0x0,
  title   = {villa-X: Enhancing Latent Action Modeling in Vision-Language-Action Models},
  author  = {Xiaoyu Chen and Hangxing Wei and Pushi Zhang and Chuheng Zhang and Kaixin Wang and Yanjiang Guo and Rushuai Yang and Yucen Wang and Xinquan Xiao and Li Zhao and Jianyu Chen and Jiang Bian},
  year    = {2025},
  journal = {arXiv preprint arXiv: 2507.23682}
}

Credits

We are grateful for the open-source projects like Open Sora, taming-transformers, open-pi-zero, MAE and timm. Their contributions have been invaluable in the development of villa-X.

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.microsoft.com.

When you submit a pull request, a CLA-bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., label, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repositories using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact opencode@microsoft.com with any additional questions or comments.

Trademarks

This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft’s Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party’s policies.

Privacy Notice

This code does not collect, store, or transmit any user data. For details on how Microsoft handles privacy more broadly, please refer to Microsoft Privacy Statement.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.github/workflows		.github/workflows
assets		assets
demo		demo
lam		lam
.gitattributes		.gitattributes
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

villa-X: A Vision-Language-Latent-Action Model

📖 Overview

🔥 News

📋 Release plan

🔧 Setup

➡️ Getting Started

Inference with Pre-trained Latent Action Model

🤗 Pre-trained Models

📑 BibTeX

Credits

Contributing

Trademarks

Privacy Notice

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Languages

License

microsoft/villa-x

Folders and files

Latest commit

History

Repository files navigation

villa-X: A Vision-Language-Latent-Action Model

📖 Overview

🔥 News

📋 Release plan

🔧 Setup

➡️ Getting Started

Inference with Pre-trained Latent Action Model

🤗 Pre-trained Models

📑 BibTeX

Credits

Contributing

Trademarks

Privacy Notice

About

Resources

License

Code of conduct

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Languages

Packages