OpenSpatial: A Principled Data Engine for Empowering Spatial Intelligence
Abstract
OpenSpatial presents an open-source data engine for spatial reasoning tasks using 3D bounding boxes, creating a large-scale dataset and achieving state-of-the-art performance in spatial perception benchmarks.
Spatial understanding is a fundamental cornerstone of human-level intelligence. Nonetheless, current research predominantly focuses on domain-specific data production, leaving a critical void: the absence of a principled, open-source engine capable of fully unleashing the potential of high-quality spatial data. To bridge this gap, we elucidate the design principles of a robust data generation system and introduce OpenSpatial -- an open-source data engine engineered for high quality, extensive scalability, broad task diversity, and optimized efficiency. OpenSpatial adopts 3D bounding boxes as the fundamental primitive to construct a comprehensive data hierarchy across five foundational tasks: Spatial Measurement (SM), Spatial Relationship (SR), Camera Perception (CP), Multi-view Consistency (MC), and Scene-Aware Reasoning (SAR). Leveraging this scalable infrastructure, we curate OpenSpatial-3M, a large-scale dataset comprising 3 million high-fidelity samples. Extensive evaluations demonstrate that versatile models trained on our dataset achieve state-of-the-art performance across a wide spectrum of spatial reasoning benchmarks. Notably, the best-performing model exhibits a substantial average improvement of 19 percent, relatively. Furthermore, we provide a systematic analysis of how data attributes influence spatial perception. By open-sourcing both the engine and the 3M-scale dataset, we provide a robust foundation to accelerate future research in spatial intelligence.
Community
Hi HF Community! 👋
We are excited to share OpenSpatial, a principled data engine designed to empower the spatial intelligence of Large Multimodal Models.
Key Highlights:
- 📊 OpenSpatial-3M Dataset: We are open-sourcing a large-scale, high-fidelity dataset with 3 million samples across 100k+ diverse 3D scenes.
- 🛠️ Open-Source Data Engine: We release our full data production and 3D lifting framework, enabling the community to generate high-quality spatial data from 3D primitives (OBB) at scale.
- 📈 Significant Performance Gains: Our engine consistently boosts the spatial reasoning capabilities of state-of-the-art LMMs (e.g., Qwen2-VL, InternVL2) by a large margin across 5 foundational tasks.
Resources:
Check out our repo and feel free to join the discussion! 🚀
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Holi-Spatial: Evolving Video Streams into Holistic 3D Spatial Intelligence (2026)
- SSR: Pushing the Limit of Spatial Intelligence with Structured Scene Reasoning (2026)
- HiSpatial: Taming Hierarchical 3D Spatial Understanding in Vision-Language Models (2026)
- VIEW2SPACE: Studying Multi-View Visual Reasoning from Sparse Observations (2026)
- Stepping VLMs onto the Court: Benchmarking Spatial Intelligence in Sports (2026)
- Enhancing MLLM Spatial Understanding via Active 3D Scene Exploration for Multi-Perspective Reasoning (2026)
- UniGround: Universal 3D Visual Grounding via Training-Free Scene Parsing (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Get this paper in your agent:
hf papers read 2604.07296 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper