SfMs and VLMs performance on CameraBench: Generative VLMs (evaluated with VQAScore) trail classical SfM/SLAM in pure geometry, yet they outperform discriminative VLMs that rely on CLIPScore/ITMScore and—even better—capture scene‑aware semantic cues missed by SfM After simple supervised fine‑tuning (SFT) on ≈1,400 extra annotated clips, our 7B Qwen2.5‑VL doubles its AP, outperforming the current best MegaSAM!
- [2025/09/18]🔥 CameraBench has been accepted as a Spotlight @ NeurIPS 2025.
- [2025/09/15]🔥 We released the codebase for both training and evaluation.
- [2025/05/20]🔥 We open-sourced our fine-tuned 32B and 72B models
- [2025/04/28]🔥 CameraBench received over 150 likes on Hugging Face and ranked 1st among both the daily and weekly papers.
- [2025/04/26]🔥 We open‑sourced our fine‑tuned 7B model and the public test set—1 000+ videos with expert labels & captions. Stay tuned for stronger models in the future!
- 🤗CameraBench Testset: Download the testset.
- 🚀Fine-tuned Models (7B param, 32B param, 72B param): Access model checkpoints on HuggingFace!
- 🏠Home Page: Demos & docs.
- 📖Paper: Detailed information about CameraBench.
- 📈Leaderboard: Explore the full leaderboard..
- Evaluation Code:
Use our official codebase for camera motion classification, VQA, and captioning tasks:
🔗CameraBench Evaluation Code - Training Dataset Access:
To request access to the training dataset, please complete this form with all relevant details. Providing thorough information will help us process your request more efficiently and reduce unnecessary back-and-forth by email:
👉 Dataset Request Form
python download_test_videos.py --save_dir ./your_target_folderpython download_test_file.py --save_dir ./your_target_folderWe have released a preview version of our finetuned Qwen2.5-VL-7B model (which achieves SOTA performance on CameraBench!) on HuggingFace (7B param, 32B param, 72B param). The model is specialized for doing camerm motion primitive classification and video-text retrieval for camera-motion captions. The usage is identical to a Qwen2.5-VL model. A quick demo is shown below:
Generative Scoring (for classification and retrieval):
We have two ways of using our model for this application. The first is the recommended t2v_metrics approach which we recommend. The latter is a back-up approach directly using Qwen2.5-VL's inference demo.
t2v_metricsApproach
# Install the package using: pip install git+https://github.com/chancharikmitra/t2v_metrics.git
import t2v_metrics
### For a single (video, text) pair:
qwen_score = t2v_metrics.VQAScore(model='qwen2.5-vl-7b', checkpoint='chancharikm/qwen2.5-vl-7b-cam-motion')
video = "videos/baby.mp4" # a video path in string format
text = "a baby crying"
# Calculate probability of "Yes" response
score = qwen_score(images=[video], texts=[text])For more details, please refer to the t2v_metrics fork.
- Qwen2.5-VL Inference Code Approach
# Import necessary libraries
from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
from qwen_vl_utils import process_vision_info
import torch
# Load the model
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
"chancharikm/qwen2.5-vl-7b-cam-motion", torch_dtype="auto", device_map="auto"
)
processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-7B-Instruct")
# Prepare input data
video_path = "file:///path/to/video1.mp4"
text_description = "the camera tilting upward"
question = f"Does this video show \"{text_description}\"?"
# Format the input for the model
messages = [
{
"role": "user",
"content": [
{
"type": "video",
"video": video_path,
"fps": 8.0, # Recommended FPS for optimal inference
},
{"type": "text", "text": question},
],
}
]
text = processor.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs, video_kwargs = process_vision_info(messages, return_video_kwargs=True)
inputs = processor(
text=[text],
images=image_inputs,
videos=video_inputs,
padding=True,
return_tensors="pt",
**video_kwargs
)
inputs = inputs.to("cuda")
# Generate with score output
with torch.inference_mode():
outputs = model.generate(
**inputs,
max_new_tokens=1,
do_sample=False, # Use greedy decoding to get reliable logprobs
output_scores=True,
return_dict_in_generate=True
)
# Calculate probability of "Yes" response
scores = outputs.scores[0]
probs = torch.nn.functional.softmax(scores, dim=-1)
yes_token_id = processor.tokenizer.encode("Yes")[0]
score = probs[0, yes_token_id].item()
print(f"Video: {video_path}")
print(f"Description: '{text_description}'")
print(f"Score: {score:.4f}")Natural Language Generation
We have two ways of using our model for this application. The first is the recommended t2v_metrics approach which we recommend. The latter is a back-up approach directly using Qwen2.5-VL's inference demo.
t2v_metricsApproach
# Install the package using: pip install git+https://github.com/chancharikmitra/t2v_metrics.git
import t2v_metrics
### For a single (video, text) pair:
qwen_score = t2v_metrics.VQAScore(model='qwen2.5-vl-7b', checkpoint='chancharikm/qwen2.5-vl-7b-cam-motion')
video = "videos/baby.mp4" # a video path in string format
text = "Please describe this image: "
# Calculate probability of "Yes" response
score = qwen_score.model.generate(images=[video], texts=[text])For more details, please refer to the t2v_metrics fork.
- Qwen2.5-VL Inference Code Approach
# The model is trained on 8.0 FPS which we recommend for optimal inference
from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
from qwen_vl_utils import process_vision_info
# default: Load the model on the available device(s)
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
"chancharikm/qwen2.5-vl-7b-cam-motion", torch_dtype="auto", device_map="auto"
)
# We recommend enabling flash_attention_2 for better acceleration and memory saving, especially in multi-image and video scenarios.
# model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
# "chancharikm/qwen2.5-vl-7b-cam-motion",
# torch_dtype=torch.bfloat16,
# attn_implementation="flash_attention_2",
# device_map="auto",
# )
# default processor
processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-7B-Instruct")
messages = [
{
"role": "user",
"content": [
{
"type": "video",
"video": "file:///path/to/video1.mp4",
"fps": 8.0,
},
{"type": "text", "text": "Describe the camera motion in this video."},
],
}
]
text = processor.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs, video_kwargs = process_vision_info(messages, return_video_kwargs=True)
inputs = processor(
text=[text],
images=image_inputs,
videos=video_inputs,
fps=fps,
padding=True,
return_tensors="pt",
**video_kwargs,
)
inputs = inputs.to("cuda")
# Inference
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)If you find this repository useful for your research, please use the following.
@article{lin2025camerabench,
title={Towards Understanding Camera Motions in Any Video},
author={Lin, Zhiqiu and Cen, Siyuan and Jiang, Daniel and Karhade, Jay and Wang, Hewei and Mitra, Chancharik and Ling, Tiffany and Huang, Yuhan and Liu, Sifan and Chen, Mingyu and Zawar, Rushikesh and Bai, Xue and Du, Yilun and Gan, Chuang and Ramanan, Deva},
journal={arXiv preprint arXiv:2504.15376},
year={2025},
}





