This is an offical implementation of Visual Query Tuning: Towards Effective Usage of Intermediate Representations for Parameter and Memory Efficient Transfer Learning.
- python3.7
- torch==1.7.1
- torchvision==0.8.2
- tensorflow==2.9.1
- tensorflow_datasets==4.4.0+nightly
We present instructions on training VQT with a ImageNet-1k pre-trained ViT-B/16.
Please setup the VTAB-1k benchmark following the instruction here. By default, our scripts will try to access the VTAB-1k datasets from the vtab_data/ folder. You can modify the DATA_PATH variable in our scripts, which are placed under the scripts/ folder, if you download the datasets to another place.
For pre-trained ViT-B/16 models, you can download the weights of various pre-training setups as follows:
Please place the downloaded checkpoints under the pre-trained_weights/ folder. Note that you need to rename the ImageNet-21k supervised checkpoint from ViT-B_16.npz to imagenet21k_ViT-B_16.npz.
Use the following command to train a VQT model on a dataset in VTAB-1k.
$ bash scripts/VQT/run_vqt_vtab.sh ${GPUIDX} ${DATA_NAME} ${NUM_CLASSES} ${Q_LEN} ${OPTIMIZER} ${FEATURE}We describe the meaning of these arguments as follows:
${GPUIDX}: The GPU used for training. For example, it can be set to 0.${DATA_NAME}: The dataset name in VTAB-1k for training and evaluation. For example, it can be set tovtab-caltech101. Please seerun_demo_exp.shfor more details about the 19 datasets in VTAB-1k.${NUM_CLASSES}: The number of classes in the dataset. For example, forvtab-caltech101, this should be set to 102.${Q_LEN}: The length of the query tokens. This can be simply set to 1.${OPTIMIZER}: The optimizer used for training. In our experiments, we set this toadam.${FEATURE}: The name of the pre-trained features. For example, it can be set tosup_vitb16_imagenet1kto indicate the ImageNet-1k supervised pre-trained model.
After training a VQT model, you can optionally use the following command to compress the linear classifier via feature selection.
$ bash scripts/VQT/run_vqt_vtab_sparsity.sh ${GPUIDX} ${DATA_NAME} ${NUM_CLASSES} ${Q_LEN} ${OPTIMIZER} ${FEATURE} ${FRACTION}The first 6 arguments, ${GPUIDX}, ${DATA_NAME}, ${NUM_CLASSES}, ${Q_LEN}, ${OPTIMIZER}, and ${FEATURE}, are the same as the previous command for training a VQT model, and they can be set accordingly to indicate the trained VQT model we are going to compress. The last argument ${FRACTION} specifies the proportion of the pre-classifier features (penultimate layer features) that we want to keep after compression. For example, it can be set to 0.7 to indicate keeping 70% of the features input to the final linear classifier.
For simplicity, you can use the following command for running through all the 19 datasets in VTAB-1k.
$ bash run_demo_exp.sh ${GPUIDX}The ${GPUIDX} argument specifies the GPU used for training (e.g., 0).
After training VQT models for all the 19 datasets, you can use the following command to collect the results.
$ python collect_demo_exp_results.pyThis repo is modified from Visual Prompt Tuning (VPT).
If you have any questions, please contact Cheng-Hao Tu(tu.343@osu.edu).