LEG-SLAM is an open-vocabulary 3D SLAM system that integrates 3D Gaussian Splatting, DINOv2 feature extraction, and Talk2DINO language grounding to enable real-time semantic 3D scene understanding. Unlike existing methods, LEG-SLAM allows text-driven interactive exploration of reconstructed environments without predefined object categories.
- Real-time 3D Reconstruction: High-fidelity scene reconstruction with Gaussian Splatting.
- Open-Vocabulary Understanding: Uses DINOv2 features and Talk2DINO to match text queries to visual features.
- Efficient Feature Compression: PCA-based embedding compression enables low-latency inference.
- Interactive Scene Queries: Retrieve semantic masks in real-time by specifying objects via text.
- High-Speed Performance: Achieves 10 FPS on Replica and 18 FPS on ScanNet, significantly faster than prior methods.
- DINOv2 Feature Extraction: Extracts rich, self-supervised embeddings from RGB frames.
- Talk2DINO Language Alignment: Transforms text queries into DINOv2-compatible feature space.
- PCA Compression: Reduces embeddings from 768D β 64D, enabling real-time processing.
- 3D Gaussian Splatting: Constructs a continuous, high-resolution 3D scene representation.
- Semantic Querying: Computes cosine similarity between scene embeddings and textual queries, generating semantic heatmaps.
If you find this work useful, please cite:
@article{LEG-SLAM,
title={LEG-SLAM: Open-Vocabulary 3D Gaussian Splatting for SLAM},
author={Anonymous Authors},
journal={Under Review at ICCV 2025},
year={2025}
}πΉ Code will be released upon paper acceptance. Stay tuned for updates! π
