arxiv:2509.02359

Why Do MLLMs Struggle with Spatial Understanding? A Systematic Analysis from Data to Architecture

Published on Sep 2

Authors:

Abstract

A systematic analysis of spatial understanding in Multimodal Large Language Models across single-view, multi-view, and video scenarios reveals data and architectural limitations, suggesting improvements through better positional encoding and reasoning injection.

AI-generated summary

Spatial understanding is essential for Multimodal Large Language Models (MLLMs) to support perception, reasoning, and planning in embodied environments. Despite recent progress, existing studies reveal that MLLMs still struggle with spatial understanding. However, existing research lacks a comprehensive and systematic evaluation of these limitations, often restricted to isolated scenarios, such as single-view or video. In this work, we present a systematic analysis of spatial understanding from both data and architectural perspectives across three representative scenarios: single-view, multi-view, and video. We propose a benchmark named MulSeT (Multi-view Spatial Understanding Tasks), and design a series of experiments to analyze the spatial reasoning capabilities of MLLMs. From the data perspective, the performance of spatial understanding converges quickly as the training data increases, and the upper bound is relatively low, especially for tasks that require spatial imagination. This indicates that merely expanding training data is insufficient to achieve satisfactory performance. From the architectural perspective, we find that spatial understanding relies more heavily on the positional encoding within the visual encoder than within the language model, in both cascaded and native MLLMs. Moreover, we explore reasoning injection and envision future improvements through architectural design to optimize spatial understanding. These insights shed light on the limitations of current MLLMs and suggest new directions for improving spatial reasoning capabilities through data scaling and architectural tuning.

View arXiv page View PDF Add to collection

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2509.02359 in a model README.md to link it from this page.

Datasets citing this paper 1

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2509.02359 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.