Course: HKU DASC-7606 (2024-2025) Assignment: 1 - Object Detection with Advanced Models
Important: This codebase is exclusively for HKU DASC 7606 (2024-2025). Please do not upload your solutions or this codebase to any public platforms. All rights reserved.
Imagine you're building a system that can identify and localize objects in images or videos, such as cars, pedestrians, animals, or everyday objects. Object Detection is a critical task in computer vision that involves both classifying objects in an image and drawing bounding boxes around them.
Why is Object Detection Important?
Object detection is fundamental to numerous applications, including:
- Autonomous Vehicles: Identifying pedestrians, vehicles, and traffic signs.
- Surveillance Systems: Detecting intruders or monitoring activities in a scene.
- Healthcare: Locating tumors or anomalies in medical images.
- Retail: Enhancing inventory tracking by detecting objects on shelves.
- AR/VR: Understanding scenes to enable seamless virtual integration.
Figure 1. An example of 2D Object Detection from the given dataset.
Traditional object detection systems relied on techniques like sliding windows and region proposals, coupled with convolutional neural networks (CNNs). Recent advancements, including transformer-based models such as DETR (DEtection TRansformer), YOLO (You Only Look Once), and DINO (DETR with Improved Optimization), have revolutionized the field, offering simplicity, speed, and high performance.
What Makes Modern Models Stand Out?
- YOLO: An efficient single-stage model that predicts bounding boxes and class probabilities directly, excelling in real-time applications.
- DETR: A transformer-based model that handles object detection as a set prediction problem, leveraging global context for accurate detection.
- DINO V1 V2: A refined version of DETR that improves training efficiency and scalability, making it suitable for large-scale and dense object detection tasks.
- Flexibility: Advanced models like Deformable DETR, YOLOS, and DINO provide enhancements for specific use cases, including dense and multi-scale object detection.
- Speed and Simplicity: YOLO models, such as YOLOv10 and YOLOv11, prioritize inference speed without compromising accuracy, making them ideal for resource-constrained applications.
In this assignment, you will explore object detection using state-of-the-art models. While transformer-based models like DETR and DINO are encouraged, you are free to use any model, including YOLO or other architectures, that meets the requirements of the transformers.Trainer API. You will:
- Understand the Basics: Learn how object detection models operate in the context of this assignment.
- Get Hands-On: Preprocess image data, including annotation handling and image augmentation.
- Build and Train: Fine-tune a model (e.g., DINO, DETR, YOLO, or others) on the basic CPPE-5 dataset.
- Evaluate and Analyze: Assess your model’s performance using standard metrics like mAP (mean Average Precision) and analyze its strengths and weaknesses.
- Become an Object Detection Expert: Gain a deeper understanding of the challenges in object detection and strategies for improvement.
The HKU GPU Farm provides the computational power needed for training advanced object detection models. Follow the provided quickstart guide to set up your environment.
If you have a powerful local machine with a suitable GPU and experience setting up deep learning environments, you can work locally. Ensure you have the necessary software (CUDA, cuDNN) installed and configured correctly.
Python: This code is tested with Python 3.11.10.
Virtual Environment (Recommended): Use Anaconda to manage your project's dependencies and avoid conflicts:
conda create -n vision_env python=3.11.10
conda activate vision_envInstall Packages:
pip install -r requirements.txtYou will work with the CPPE-5 dataset, which contains annotated images for personal protective equipment (PPE) detection tasks. This dataset provides bounding box annotations for various PPE elements such as coveralls, face shields, gloves, goggles, and masks. It is widely used in safety and compliance monitoring, enabling models to understand and detect PPE usage in various environments. But for training, the dataset may not be strict to be CPPE-5. You can continue the pretrain process on other datasets.
Steps:
- Load the Dataset: Load the CPPE-5 dataset into a suitable data structure.
- Image Augmentation: To prevent overfitting and improve model generalization, we will apply various augmentations to the training images. This includes geometric transformations (e.g., flipping, rotation) and color adjustments.
- Data Format and Annotation Handling: Convert bounding box annotations to the format expected by the chosen model. For DETR series models, the annotations must be in the COCO format, which DETR expects. Each image will have bounding boxes and associated class labels that need to be formatted accordingly.
Your mission is divided into three key tasks:
- Objective: Correctly load, preprocess, and augment the CPPE-5 dataset.
- File:
dataset.py - Instructions: Implement the sections marked with "
Write Your Code Here". - Hints (COCO format for example):
- Use the Albumentations library to apply data augmentation to the images. These augmentations will help the model generalize better by applying random transformations while keeping the bounding boxes intact. For the training set, we'll apply several augmentations, while the validation set will remain unchanged [no-op].
- The annotations in the dataset need to be reformatted into the COCO format that some models (e.g. DETR) expect. This involves converting the bounding boxes into the appropriate format and ensuring each annotation has the correct attributes like
image_id,category_id, andarea. - Combine the augmentations and annotation formatting into a function that applies these transformations to each image in the batch with
ImageProcessor.
- Objective: Define a model for object detection that meets the requirements of the
transformers.TrainerAPI. - File:
model.py - Instructions: Implement the sections marked with "
Write Your Code Here". - Hints:
- Use any pre-trained model (e.g., DINO, YOLO, DETR) compatible with the
transformers.TrainerAPI. - Adapt the model to the number of object categories in your dataset.
- Ensure that the model's outputs include both bounding box coordinates and class probabilities.
- Use any pre-trained model (e.g., DINO, YOLO, DETR) compatible with the
- Objective: Tune the training arguments to achieve better performance.
- File:
trainer.py - Instructions: Modify the
training_argsin thetrainer.pyfile to optimize the training process. - Hints:
- Experiment with different values for
num_train_epochs,learning_rate,per_device_train_batch_size, and other parameters. - Monitor the training and evaluation metrics to identify the best set of arguments.
- Use tools like TensorBoard to visualize the training process and metrics (optional).
- Experiment with different values for
If your student ID is 30300xxxxx, organize your submission as follows:
30300xxxxx.zip
|-- constants.py # Constants and configurations
|-- dataset.py # Data loading, preprocessing, and augmentation
|-- model.py # Object detection model
|-- trainer.py # Training loop and logic
|-- evaluate.py # Evaluation metrics (mAP, precision, recall)
|-- requirements.txt # Python dependencies
|-- utils.py # Utilization functions
|-- main.py # Main script to run training and evaluation
We will run your main.py script to evaluate your model's performance. *Please add all additional libraries to requirements.txt file
Important Considerations:
- Error-Free Execution: Your code must run without errors.
- mAP Score: To evaluate how your trained model performs.
- Training Time: The training process should complete within 2 hours with the HKU GPU Farm environment.
Grading Breakdown (based on mAP score):
| mAP Score | Training Time ≤ 2 Hours | Training Time > 2 Hours |
|---|---|---|
| >45% | 100% | 90% |
| 37%-45% | 90% | 80% |
| 34%-37% | 80% | 70% |
| 30%-34% | 70% | 60% |
| 22%-30% | 60% | 50% |
| 10%-22% | 50% | 40% |
| others | 0% | 0% |
- Experiment with different models (e.g., YOLOS, Deformable DETR, EfficientDet, DINO).
- Hyperparameter tuning using grid search or Bayesian optimization.
- Implement advanced augmentation techniques (e.g., MixUp, CutMix).
- Apply the pipeline to a different dataset (e.g., Pascal VOC).
- Explore integrating external knowledge for improved performance.
