In the pursuit of bringing AI closer to real-world applications, Meta AI Research introduces Locate 3D—a groundbreaking model that enables machines to understand and localize objects in complex 3D environments using natural language expressions. Imagine describing an object by saying, “the small coffee table between the sofa and the lamp,” and having an AI system pinpoint that object accurately in a real-world scene. That’s the power of Locate 3D.
What is Locate 3D?
Locate 3D is a novel model for 3D referential grounding, allowing machines to locate objects in 3D scenes based on referring expressions. It processes posed RGB-D sensor streams, making it ideal for integration into robotics and augmented reality (AR) systems. Unlike previous methods, Locate 3D does not rely on pre-defined object classes or static scenes—it learns from data to generalize across diverse real-world conditions.
It achieves state-of-the-art performance on standard referential grounding benchmarks and is designed to robustly generalize across varied environments.
The Core: 3D-JEPA
At the heart of Locate 3D lies 3D-JEPA, a self-supervised learning (SSL) algorithm crafted specifically for 3D sensor data. 3D-JEPA works on point clouds enriched by 2D vision foundation models like CLIP and DINO. By using a masked prediction task in the latent space, 3D-JEPA learns deep, contextual features without the need for manual labels.
Once trained, the encoder from 3D-JEPA is fine-tuned alongside a language-conditioned decoder. This powerful combination enables the system to output accurate 3D masks and bounding boxes directly from natural language prompts.
The Locate 3D Dataset
To train and evaluate Locate 3D effectively, Meta AI also introduces the Locate 3D Dataset—a large-scale dataset for 3D referential grounding. With over 130,000 annotations across diverse capture setups, this dataset supports extensive analysis of model generalization and performance. It’s a valuable resource for both academic research and practical applications.
Model Zoo
Meta AI offers a suite of pretrained models to support experimentation and development:
Model | Parameters |
---|---|
Locate 3D | 600M |
Locate 3D+ | 600M |
3D-JEPA | 300M |
Each model is crafted to serve different performance and resource trade-offs, enabling flexibility for research and deployment.
Code Structure
The open-source codebase is structured for usability and modularity:
.
├── examples # Example notebooks
├── models
│ ├── encoder # 3D-JEPA encoder models
│ └── locate-3d # Locate 3D model logic
├── locate3d_data
│ ├── datasets # Data loaders and preprocessing tools
Licensing and Usage
- Code: Mainly under CC-BY-NC, with components (e.g., Pointcept) under MIT.
- Data: Licensed under CC-BY-NC 4.0. Some portions originate from Llama 3.2 and are bound by its licensing terms. Any derivative AI models trained on this data must include “Llama” in their name if redistributed.
Conclusion
Locate 3D and 3D-JEPA represent a major leap forward in bridging natural language understanding with spatial perception in 3D environments. With strong generalization, rich datasets, and robust performance, this system paves the way for next-generation applications in robotics, AR, and interactive AI systems.
To explore further:
Stay tuned as Meta AI continues to push the boundaries of intelligent, embodied AI.