Meta’s Locate 3D AI Locates Objects in 3D Scenes Using Text

In the pursuit of bringing AI closer to real-world applications, Meta AI Research introduces Locate 3D—a groundbreaking model that enables machines to understand and localize objects in complex 3D environments using natural language expressions. Imagine describing an object by saying, “the small coffee table between the sofa and the lamp,” and having an AI system pinpoint that object accurately in a real-world scene. That’s the power of Locate 3D.

What is Locate 3D?

Locate 3D is a novel model for 3D referential grounding, allowing machines to locate objects in 3D scenes based on referring expressions. It processes posed RGB-D sensor streams, making it ideal for integration into robotics and augmented reality (AR) systems. Unlike previous methods, Locate 3D does not rely on pre-defined object classes or static scenes—it learns from data to generalize across diverse real-world conditions.

It achieves state-of-the-art performance on standard referential grounding benchmarks and is designed to robustly generalize across varied environments.


The Core: 3D-JEPA

At the heart of Locate 3D lies 3D-JEPA, a self-supervised learning (SSL) algorithm crafted specifically for 3D sensor data. 3D-JEPA works on point clouds enriched by 2D vision foundation models like CLIP and DINO. By using a masked prediction task in the latent space, 3D-JEPA learns deep, contextual features without the need for manual labels.

Once trained, the encoder from 3D-JEPA is fine-tuned alongside a language-conditioned decoder. This powerful combination enables the system to output accurate 3D masks and bounding boxes directly from natural language prompts.


The Locate 3D Dataset

To train and evaluate Locate 3D effectively, Meta AI also introduces the Locate 3D Dataset—a large-scale dataset for 3D referential grounding. With over 130,000 annotations across diverse capture setups, this dataset supports extensive analysis of model generalization and performance. It’s a valuable resource for both academic research and practical applications.


Model Zoo

Meta AI offers a suite of pretrained models to support experimentation and development:

ModelParameters
Locate 3D600M
Locate 3D+600M
3D-JEPA300M

Each model is crafted to serve different performance and resource trade-offs, enabling flexibility for research and deployment.


Code Structure

The open-source codebase is structured for usability and modularity:

.
├── examples                  # Example notebooks
├── models
│   ├── encoder               # 3D-JEPA encoder models
│   └── locate-3d             # Locate 3D model logic
├── locate3d_data
│   ├── datasets              # Data loaders and preprocessing tools

Licensing and Usage

  • Code: Mainly under CC-BY-NC, with components (e.g., Pointcept) under MIT.
  • Data: Licensed under CC-BY-NC 4.0. Some portions originate from Llama 3.2 and are bound by its licensing terms. Any derivative AI models trained on this data must include “Llama” in their name if redistributed.

Conclusion

Locate 3D and 3D-JEPA represent a major leap forward in bridging natural language understanding with spatial perception in 3D environments. With strong generalization, rich datasets, and robust performance, this system paves the way for next-generation applications in robotics, AR, and interactive AI systems.

To explore further:

[GitHub Repository]

Stay tuned as Meta AI continues to push the boundaries of intelligent, embodied AI.


Leave a Reply

x
Advertisements