Skip to content

Implementation of our CVPR 2025 paper - Search and Detect: Training-Free Long Tail Object Detection via Web-Image Retrieval

License

Notifications You must be signed in to change notification settings

Mankeerat/SearchDet

Repository files navigation

🔥 SearchDet: Training-Free Long Tail Object Detection via Web-Image Retrieval (CVPR 2025!) 🔥

By Mankeerat Sidhu, Hetarth Chopra, Ansel Blume, Jeonghwan Kim, Revanth Gangi Reddy and Heng Ji

The arxiv can be found here - SearchDet

This repository contains the official code for SearchDet, a training-free framework for long-tail, open-vocabulary object detection. SearchDet leverages web-retrieved positive and negative support images to dynamically generate query embeddings for precise object localization—all without additional training.


The Architecture Diagram of our process. We compare the adjusted embeddings, produced by the DINOv2 model, of the positive and negative support images, with the relevant masks extracted using the SAM model to provide an initial estimate of our segmentation BBox. We again use DINOv2 for generating pixel-precise heatmaps which provide another estimate for the segmentation. We combine both these estimates using a binarized overlap to get the final segmentation mask.

SearchDet is designed to:

  • âś… Enhance Open-Vocabulary Detection: Improve detection performance on long-tail classes by retrieving and leveraging web images.
  • âś… Operate Training-Free: Eliminate the need for costly fine-tuning and continual pre-training by computing query embeddings at inference time.
  • âś… Utilize State-of-the-Art Models: Integrate off-the-shelf models like DINOv2 for robust image embeddings and SAM for generating region proposals.

Our method demonstrates substantial mAP improvements over existing approaches on challenging datasets—all while keeping the inference pipeline lightweight and training-free.


Key Features

  • Web-Based Exemplars: Retrieve positive and negative support images from the web to create dynamic, context-sensitive query embeddings.
  • Attention-Based Query Generation: Enhance detection by weighting support images based on cosine similarity with the input query.
  • Robust Region Proposals: Use SAM to generate high-quality segmentation proposals that are refined via similarity heatmaps.
  • Adaptive Thresholding: Apply frequency-based thresholding to automatically select the most relevant region proposals.
  • Scalable Inference: Achieve strong performance with just a few support images—ideal for long-tailed object detection scenarios.

Reason to use Positive and Negative Exemplars

(a) Without negative support image samples (b) After including negative support image samples

Figure 3. Illustration of our method providing more precise masks after including the negative support image samples. The negative query (e.g., “waves”) helps avoid irrelevant areas and focus on the intended concept (e.g., “surfboard”).


Results

We compare not just the accuracy of our methodology, but also compare OWOD models' performance vs. inference time on LVIS. SearchDet with caching has a comparable speed to GroundingDINO and is faster than T-Rex, two state-of-the-art methods.

Results1

Results2 SpeedAccuracyTradeoff

Here are some images as well that present SearchDet's performance on the benchmarks

Phone Mountain Dew Window
Vase Dog C-Class


Installation

You need to run pip install -r requirements.txt in your virtual environment. If you plan to use GPU for running this code kindly first install pip install torch torchvision --extra-index-url https://download.pytorch.org/whl/cu118 depending on your CUDA version, comment out torch and torchvision in the requirements, and then run pip install -r requirements.txt.


Usage

The entire design philosophy of SearchDet is that any developer can replace components of our system, according to their desired needs.

  • If more precision is needed in the mask - one can use a bigger version of SAM (like SemanticSAM etc.) and if more inference speed is needed one can use a faster implementation of SAM (like FastSAM or PyTorch Implementation of SAM).
  • If more precision is needed in the retrieval quality of the mask - one can use other alternatives suitable for your use-case such as CLIP etc.
  • It is encouraged that one should experiment if their use-case needs the negative exemplar images, and hence modifying adjust_embedding (line 167) in mask_with_search.py is encouraged. Users can test with and without negative images, whichever scenario suits them the best.
  • The web crawler that we use is a naive implementation using Selenium without parallelization. It is encouraged to spin multiple threads for doing this.

About

Implementation of our CVPR 2025 paper - Search and Detect: Training-Free Long Tail Object Detection via Web-Image Retrieval

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •  

Languages