π VLM-Powered Agent for Intelligent Tool Orchestration | π§ Open-source Framework for Multi-modal AI
π€ Dataset β’ π Webpage β’ π Paper
- π― Precision Tools - 10+ integrated tools (SAM 2.0, Web Search, OCR, etc.)
- π§ Multi-modal Memory - Context-aware task execution
- π Self-Correction - Auto-verification mechanism for reliable outputs
- π Real-World Ready - Google Search API integration ets
| Model | GAIA-L1 | GAIA-L2 | GAIA-L3 | GTA |
|---|---|---|---|---|
| Baseline | 13.21% | 5.81% | 0.00% | 33.97% |
| MAT | π26.42% | π11.63% | π3.84% | π52.56% |
conda create -n tongagent python=3.10
conda activate tongagent
pip install -r requirements.txtIf you want to generate data by yourself, install the following environment.
pip install -r requirements_generation.txtYou can use git lfs or huggingface-cli to download the dataset we used in paper from HF dataset. Images related to training is zipped in a file called files.zip.
The image captions and caption embeddings can be downloaded via the following link: Google Drive.
Please follow ShareGPT4V to organize the image source as follows:
βββ ...
βββ image_source
β βββ llava
β β βββ llava_pretrain
β β β βββ images
β βββ coco
β β βββ train2017
β βββ sam
β β βββ images
β βββ web-celebrity
β β βββ images
β βββ web-landmark
β β βββ images
β βββ wikiart
β βββ share_textvqa
β β βββ images
β βββ chatqa
β β βββ train
β β β βββ png
You only need to download SAM 2 manually. For other models, transformers will do downloading for you.
Put the folder model_checkpoints in your repo's root so that you have something like
main.py
model_checkpoints/sam2_checkpoints
model_checkpoints/sam2_configs
You can download the model checkpoints and configs by scripts from from the official repo.
This project using Google Customized Search to search the web. You need to set the cx and key in configs/agent_config.yaml. You will find the cx and key in the search_engine section.
search_engine:
-
cx: # enter your cx here
key: # enter your key hereTo obtain this key, check the official API documentationhere. It has a rate-limit 100 query per day for free user 10k query per day for paid user.
First, you need to set the api key and endpoint in configs/agent_config.yaml. The config file looks like this:
tonggpt:
model_name: gpt-4o-2024-08-06
region: eastus
api_key: # enter your api key here
open_ai_client_type: openai # or azure
endpoint: # only for azure, you need to specify the endpoint you are using
agent_controller:
engine_type: tonggpt # use minicpm, qwen if you want to use other modelsWe use GPT on Azure and provide a simple alternative for you to use original OpenAI client.
You can download the GTA dataset from GTA Link, and revise your dataset path data/gta_dataset/dataset.json in examples/gta/main.py if you put it in some other path.
You can download the GAIA dataset from GAIA Link. Or running evaluation script will automatically download the dataset from HF.
Run in command line manner with arbitrary prompt.
python main.py --prompt 'Can you edit the image to turn him into cyborg? Image path: tests/data/draw.jpg.'See results runing on GAIA set
python examples/gaia/main.pySee results runing on GTA set
python examples/gta/main.pyRefer to official repo OpenBMB/MiniCPM-V for environment setup. Since Qwen-VL might have different version than MiniCPM-V, you should consider using a new conda environment.
To train the model, enter the directory and run the script:
cd experiments/CPM-FT
# for training a model for GAIA dataset
bash slurm_jobs/job_lora_5_gaia_1206.sh
# for training a model for GTA dataset
bash slurm_jobs/job_lora_5_gta_with_verifier.shCheck this scripts for assign data path. It should takes 4 hours on 8X A100 for 50K dataset per epoch.
Here is the simplified version.
Choose one of the following methods to train the model.
Follow the Qwen-VL official guide to set up the environment. Then, convert the dataset to Qwen-VL format:
cd experiments/Qwen-VL
python scripts/convert_dataset_v2.py
Run the provided scripts to train on specific datasets:
# For GAIA
bash slurm_jobs/train_gaia.sh
# For GTA
bash slurm_jobs/train_gta.sh
Alternatively, you can use LLaMA Factory for training.
Please refer to this LLaMA Factory Guide for installation and usage details.
To evaluate the model, first modify the configs/agent_config.yaml to set the model path. Then run the script:
export RUN_MODE=eval
# for GAIA dataset
python examples/gaia/main.py --engine minicpm --lora-path experiments/CPM-FT/output/cpm_v2_6_7904295_2024_12_10_23_05/ --data-name 2023_level1 --split validation
python examples/gaia/main.py --engine minicpm --lora-path experiments/CPM-FT/output/cpm_v2_6_7904295_2024_12_10_23_05/ --data-name 2023_level2 --split validation
python examples/gaia/main.py --engine minicpm --lora-path experiments/CPM-FT/output/cpm_v2_6_7904295_2024_12_10_23_05/ --data-name 2023_level3 --split validation
# for GTA dataset
python examples/gta/main.py --engine minicpm --lora-path experiments/CPM-FT/output/cpm_v2_6_7904295_2024_12_10_23_05/cpm_v2_6_7904295_2024_12_10_23_05 is the model path. The training script automatically saves the model to that path. We use SLURM in our cluster such that the path consists of the job id and the time of the job. You should check the training script for the exact path.
Both benchmarks will output the results in .cache folder. You should use eval.py to get the metric we reported in the paper.
python examples/gaia/eval.py --data-path .cache/qa_cache/validation/minicpm/experiments/CPM-FT/output/cpm_v2_6_7904295_2024_12_10_23_05/2023_level1.db
python examples/gta/eval.py --folder .cache/gta/cpm_v2_6_7904295_2024_12_10_23_05/Run in command line manner.
bash data_generation.shThanks for their brilliant contributions to the community! Here are the codebases we built upon.
Our agent is based on the wonderful Huggingface Agent framework.
Our agent design is inspired by the following works:
Model training and inference code:
If you find our work helpful, please consider cite our paper π and star us βοΈοΌ
@inproceedings{gao2025multi,
title={Multi-modal Agent Tuning: Building a VLM-Driven Agent for Efficient Tool Usage},
author={Gao, Zhi and Zhang, Bofei and Li, Pengxiang and Ma, Xiaojian and Yuan, Tao and Fan, Yue and Wu, Yuwei and Jia, Yunde and Zhu, Song-Chun and Li, Qing},
booktitle={The Thirteenth International Conference on Learning Representations(ICLR)},
year=2025
}