DeRIS: Decoupling Perception and Cognition for Enhanced Referring Image Segmentation through Loopback Synergy
- 2025.07.03: The codes and models are released.
- 2025.06.26: Our work has been accepted by ICCV 2025.
Referring Image Segmentation (RIS) is a challenging task that aims to segment objects in an image based on natural language expressions. While prior works have predominantly concentrated on improving vision-language interactions and achieving fine-grained localization, a comprehensive examination of the inherent limitations in existing RIS frameworks remains underexplored. To bridge this gap, we propose DeRIS, a novel framework that decomposes RIS into two key components: perception and cognition. This modular decomposition facilitates a systematic analysis of the primary bottlenecks impeding RIS performance. Our findings reveal that the predominant limitation lies not in perceptual deficiencies, but in the insufficient multi-modal cognitive capacity of current models. To mitigate this, we propose a Loopback Interaction mechanism, which enhances the synergy between the perception and cognition modules, thereby enabling precise segmentation while simultaneously improving robust image-text comprehension. Additionally, we analyze and introduce a non-referent sample conversion data augmentation to address the long-tail distribution issue related to non-referent samples in general scenarios. Notably, DeRIS demonstrates inherent adaptability to both non- and multi-referents scenarios without requiring specialized architectural modifications, enhancing its general applicability. Extensive experiments conducted on the RefCOCO/+/g (RIS) and gRefCOCO (GRES) datasets demonstrate the superior performance of our method.
CUDA=11.8 torch=2.0.0 torchvision=0.15.1
pip install -r requirements.txtThen optionally install DeRIS package in editable mode:
pip install -e .Then compiling pixel decoder CUDA operators:
cd deris/models/branchs/perception_branch/Mask2Former_Simplify/modeling/pixel_decoder/ops
sh make.shFollowing the data structure of the C3VG. Prepare the mscoco dataset, Then download the mixed annotations here and the grefs annotations here.
The data structure should look like the following:
| -- data
|--seqtr_type
| -- annotations
| -- mixed-seg
| -- instances_nogoogle_withid.json
| -- grefs
| -- instance.json
| -- images
| -- mscoco
| -- train2014
DeRIS utilizes the BEiT-3 model as both the backbone and the multi-modality fusion module. The pre-trained weights can be downloaded from this link. Additionally, you will need to download the tokenizer for BEiT-3.
First, create a directory for the pre-trained weights:
mkdir pretrain_weights
Place the BEiT checkpoints and tokenizer within this directory.
The weights for Swin-S can be downloaded from this link, and the weights for Swin-B can be downloaded from this link. Please place the downloaded weights into the following directory.
pretrain_weights
|—— beit3_base_patch16_224.zip
├── beit3_large_patch16_224.zip
├── beit3.spm
├── model_final_f6e0f6.pkl
└── model_final_1e7f22.pkl
The final directory structure of DeRIS should resemble the following:
DeRIS
|—— asserts
├── configs
├── data
├── pretrain_weights
├── deris
└── tools
Here, demo for DeRIS are provided.
The following scripts can be used to test on the GRES task.
python tools/demo.py --img "asserts/imgs/Figure_1.jpg" --expression "a pair of people and a frisbee" --config configs/gres/DERIS-L-grefcoco.py --checkpoint /PATH/TO/DeRIS-L-grefcoco.pthThe following scripts can be used to test on the RIS task.
python tools/demo.py --img "asserts/imgs/Figure_2.jpg" --expression "a black and brown dog walking through the ocean water" --config configs/refcoco/DERIS-L-refcoco.py --checkpoint /PATH/TO/DeRIS-L-refcoco.pthFor loading alternative pretrained weights or adjusting threshold settings, please consult the tools/demo.py.
If you want to retrain the model, you can run the following scripts:
bash tools/dist_train.sh [PATH_TO_CONFIG] [GPU_NUMBER]If you want to reproduce the result, download it and then run the following scripts:
bash tools/dist_test.sh [PATH_TO_CONFIG] [GPU_NUMBER] --load-from [PATH_TO_CHECKPOINT_FILE]The results of DeRIS-B on RefCOCO/+/g mixed dataset are as follows:
| Split | mIoU | oIoU | MaskACC@0.5-0.9 |
|---|---|---|---|
| val_refcoco_unc | 81.99 | 80.80 | 91.24, 89.72, 87.03, 80.05, 49.88 |
| testA_refcoco_unc | 82.97 | 82.68 | 92.88, 91.61, 89.11, 81.53, 49.49 |
| testB_refcoco_unc | 80.14 | 78.47 | 87.93, 85.95, 83.40, 77.10, 53.18 |
| val_refcocoplus_unc | 75.62 | 72.21 | 83.62, 81.57, 79.18, 73.26, 46.36 |
| testA_refcocoplus_unc | 79.16 | 77.26 | 88.29, 86.85, 84.36, 77.57, 47.00 |
| testB_refcocoplus_unc | 71.63 | 66.11 | 77.53, 75.20, 72.92, 67.48, 46.88 |
| val_refcocog_umd | 76.30 | 73.89 | 84.15, 81.68, 77.98, 70.81, 48.53 |
| test_refcocog_umd | 77.15 | 75.88 | 85.53, 82.64, 79.06, 72.19, 48.12 |
Download the weight, reproduce the result using this script:
bash tools/dist_test.sh configs/refcoco/DERIS-B-refcoco.py 1 --load-from /PATH/TO/DeRIS-B-refcoco.pthThe results DeRIS-L on RefCOCO/+/g mixed dataset are as follows:
| Split | mIoU | oIoU | MaskACC@0.5-0.9 |
|---|---|---|---|
| val_refcoco_unc | 85.72 | 85.41 | 93.53, 92.40, 90.32, 84.60, 63.35 |
| testA_refcoco_unc | 86.64 | 86.49 | 95.14, 94.14, 91.88, 86.12, 62.64 |
| testB_refcoco_unc | 84.52 | 82.87 | 90.97, 89.38, 87.13, 83.12, 66.70 |
| val_refcocoplus_unc | 81.28 | 79.01 | 88.45, 86.98, 85.12, 79.94, 60.23 |
| testA_refcocoplus_unc | 83.74 | 82.34 | 91.66, 90.50, 88.51, 82.91, 60.88 |
| testB_refcocoplus_unc | 78.59 | 74.41 | 84.11, 82.31, 80.39, 76.78, 62.13 |
| val_refcocog_umd | 80.01 | 77.65 | 86.97, 84.70, 81.15, 74.88, 57.88 |
| test_refcocog_umd | 81.32 | 80.12 | 88.86, 86.68, 83.19, 76.46, 59.42 |
Download the weight, reproduce the result using this script:
bash tools/dist_test.sh configs/refcoco/DERIS-L-refcoco.py 1 --load-from /PATH/TO/DeRIS-L-refcoco.pthThe results of DeRIS-B on gRefCOCO dataset are as follows:
| Split | N-acc. | gIoU | cIoU |
|---|---|---|---|
| val | 77.03 | 74.10 | 68.06 |
| testA | 75.98 | 73.72 | 71.99 |
| testB | 63.44 | 65.63 | 64.65 |
Download the weight, reproduce the result using this script:
bash tools/dist_test.sh configs/gres/DERIS-B-grefcoco.py 1 --load-from /PATH/TO/DeRIS-B-grefcoco.pth --score-threshold 0.7The results of DeRIS-L on gRefCOCO dataset are as follows:
| Split | N-acc. | gIoU | cIoU |
|---|---|---|---|
| val | 82.22 | 77.67 | 72.00 |
| testA | 78.30 | 75.30 | 73.73 |
| testB | 66.81 | 67.99 | 67.38 |
Download the weight, reproduce the result using this script:
bash tools/dist_test.sh configs/gres/DERIS-L-grefcoco.py 1 --load-from /PATH/TO/DeRIS-L-grefcoco.pth --score-threshold 0.9This repository partially builds upon the codebases of SimVG, BEiT-3, Mask2Former and SeqTR.
@misc{DeRIS,
title={DeRIS: Decoupling Perception and Cognition for Enhanced Referring Image Segmentation through Loopback Synergy},
author={Ming Dai and Wenxuan Cheng and Jiang-jiang Liu and Sen Yang and Wenxiao Cai and Yanpeng Sun and Wankou Yang},
year={2025},
eprint={2507.01738},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2507.01738},
}
