SpaceNum: Revisiting Spatial Numerical

Understanding in VLMs

Vision-Language Models are moving beyond static 2D image understanding toward dynamic 3D spatial exploration. As VLM-based spatial agents claim to perceive, navigate, and reason in space, they inevitably produce numbers with spatial meaning: coordinates, distances, rotations, and action magnitudes. SpaceNum asks a sharper question: do these numbers truly map to space, or are they only fluent tokens attached to shallow visual cues?

Static layout icon
Static Layout Grounding: when VLMs generate coordinate-based cognitive maps for spatial understanding, do the numbers actually correspond to object layouts in space?
Dynamic transition icon
Dynamic Transition Grounding: during spatial exploration, can models connect action numbers such as meters or degrees to the visual change they cause?
Evaluation icon
Mapping Direction Bias: is Space2Num easier than Num2Space, or do VLMs show different preferences across layouts and transitions?
Diagnostic analysis icon
Failure Anatomy: we probe errors, reasoning traces, visual reliance, symmetry, abstraction, and tuning to identify where spatial numerical grounding breaks.
PDF Hugging Face logo Data Coming Soon

Vision-Language Models (VLMs) are increasingly deployed in embodied environments, where they must produce numerical outputs such as action magnitudes and spatial coordinates. Although these numbers appear meaningful, it remains unclear whether they are genuinely grounded in spatial perception. SpaceNum revisits spatial numerical understanding through a unified framework covering two complementary settings: numbers as dynamic transitions during spatial exploration and numbers as static layouts in spatial reasoning. We formulate two bidirectional tasks, Num2Space and Space2Num, to evaluate how well VLMs map between visual spatial structure and language-side numerical representations. Across 18 VLMs from 6 model families, current models largely fail to ground numbers in spatial meaning, often performing close to random guess and relying on shallow spatial cues rather than stable coordinate-aware reasoning.

SpaceNum overview
Figure 1: SpaceNum studies spatial numerical understanding under two settings: numbers as dynamic transitions in spatial exploration and numbers as static layouts in spatial understanding. The benchmark evaluates two directions: Num2Space, mapping numbers to visual outcomes, and Space2Num, mapping visual inputs to numbers.
Benchmark
Design
Bidirectional
Tasks
Main
Results
Diagnostic
Analysis
Tuning &
Transfer

Click to jump to each section.


SpaceNum Benchmark Design

SpaceNum uses simulator-based pipelines to create controlled spatial-number grounding problems. Dynamic transition data is generated in AI2-THOR, where embodied agents execute parameterized movement and rotation actions in indoor scenes. Static layout data is generated in NVIDIA Isaac Sim with BlenderKit assets, enabling controlled scene layouts and ground-truth coordinate annotations.

3,800
benchmark samples
77,412
automatically generated training samples
18
VLMs evaluated across 6 families

Numbers as Dynamic Transitions

Dynamic transition samples capture numerical values as action magnitudes. The benchmark covers forward/backward and left/right movement, plus up/down and left/right rotation. Data collection controls action coverage, transition continuity, visual anchoring, and validity: transitions must preserve enough visual overlap, contain sufficient object anchors, and correspond to valid simulator states.

Action Type Range Step
Move Forward / Backward 0.2-2.4 m 0.2 m
Move Left / Right 0.2-1.2 m 0.2 m
Rotate Up / Down 10-70 deg 10 deg
Rotate Left / Right 10-70 deg 10 deg

Numbers as Static Layouts

Static layout samples treat numbers as relative spatial layouts. Each scene defines a coordinate system using two anchor objects: one sets the origin, while the second fixes direction and scale. A target object is then placed with controlled variations in position, size, or both. SpaceNum covers desktop-scale and room-scale scenes, and represents the same layout through 1D, 2D, and 3D coordinate descriptions.

SpaceNum dataset statistics
Figure 2: Dataset statistics for SpaceNum. The benchmark contains 3,800 evaluation samples, and the same automatic pipeline produces 77,412 additional training samples for tuning studies.

Bidirectional Spatial-Number Grounding Tasks

Num2Space

The model receives a numerical representation and must select the visual outcome consistent with it. In dynamic transitions, this means predicting the resulting observation from an initial view, action type, and numerical magnitude. In static layouts, this means selecting the observation matching a cognitive map.

Space2Num

The model receives visual spatial evidence and must infer the corresponding number. In dynamic transitions, it estimates the action magnitude from two observations and an action type. In static layouts, it recovers target-object coordinates under the defined reference frame.

This bidirectional setup is central to SpaceNum: it separates whether models can project known numbers into space from whether they can recover structured numbers from visual spatial observations.

Main Results

SpaceNum evaluates 18 VLMs from Qwen2.5-VL, Qwen3-VL, InternVL3.5, Ovis2.5, Cosmos-Reason2, and Gemma-3. The core result is consistent: current VLMs struggle to ground numerical values in spatial meaning. Random guess is 30.0% macro-average accuracy, while the best model reaches only 39.8%.

Main SpaceNum benchmark results
Table 1: Main SpaceNum benchmark results across dynamic transition and static layout settings.

Current VLMs do not robustly ground numbers in space.

Several models perform near or below random guess, indicating that plausible numerical outputs do not imply spatial numerical understanding.

Dynamic transitions are broadly difficult.

Performance stays low across movement and rotation actions, showing a broad failure to model transition magnitudes rather than an isolated action-specific weakness.

Static layout difficulty grows with spatial complexity.

Models perform better on simpler 1D and desk-scale layouts but degrade sharply in higher-dimensional and room-scale settings.

Mapping direction is asymmetric.

Dynamic transitions favor Space2Num over Num2Space, while static layouts favor Num2Space over Space2Num, revealing different visual and language-side bottlenecks.

Diagnostic Analysis

Structured Output Errors

SpaceNum goes beyond exact-match accuracy by analyzing how wrong answers relate to the correct spatial number. In dynamic transitions, larger models often make numerically closer mistakes even when they fail exact matching. In static layouts, errors are usually coupled across position and size rather than isolated to a single attribute.

Dynamic transition error proximity
Figure 3: Larger models tend to make less severe dynamic-transition errors.
Static layout error decomposition
Figure 4: Static layout failures are dominated by coupled position-and-size errors.

Why Reasoning Traces Do Not Solve the Problem

Enabling explicit reasoning produces only marginal changes, typically within 1%. Trace analysis shows three recurring failure modes: models stop at coarse spatial cues without fine-grained comparison, fail to reason counterfactually about motion magnitude, and reason in image-space coordinates rather than the task-defined coordinate system.

Modality Asymmetry and Geometric Consistency

Blind testing shows that dynamic transitions depend more strongly on visual grounding, while static layouts can sometimes be partially solved through language-side priors. Per-action analysis further shows that Space2Num usually outperforms Num2Space for dynamic actions, and rotational symmetry tests reveal weak geometric invariance in the mapping from vision to numbers.

Blind testing
Figure 5: Masking visual inputs hurts dynamic transitions more than static layouts.
Action-level mapping asymmetry
Figure 6: Per-action comparison between Num2Space and Space2Num.
Rotation symmetry analysis
Figure 7: Rotational symmetry analysis. Equivalent transformations should lead to consistent numerical predictions, but current models show substantial drops under symmetric variants.

Disentangling Visual and Numerical Factors

Simple visual interventions, such as adding anchors for dynamic transitions or reducing irrelevant layout objects, produce only minor and inconsistent changes. Numerical representation changes also have limited effect. The strongest diagnostic signal appears when layout images are replaced with structured visual abstractions such as points, 2D boxes, and 3D boxes, indicating that the main bottleneck is vision-to-structure abstraction.

Visual-side interventions
Figure 8: Visual-side intervention examples for dynamic transitions and static layouts.
Layout abstraction
Figure 9: Structured visual abstractions substantially improve Space2Num layout reasoning.

Tuning Spatial Numerical Understanding

SpaceNum also studies whether spatial numerical understanding can be improved through tuning. LoRA tuning on Qwen3-VL-4B and Qwen3-VL-8B shows that lower-dimensional spatial reasoning can partially transfer to higher-dimensional settings, but transfer remains limited. The best data recipe uses a layout-heavy mixture, with roughly 25% transition data and 75% layout data.

Cross-dimension tuning transfer
Figure 10: Cross-dimension transfer patterns after tuning.
Training data mixture and scaling
Figure 11: Data mixture and scale both affect spatial numerical understanding.
External Benchmark Qwen3-VL-4B Delta Qwen3-VL-8B Delta What Improves
OmniSpatial Motion +5.5 +4.5 Camera movement understanding
SAT Action Consequence +8.1 +18.9 Action outcome reasoning
SAT Object Movement +34.8 +43.5 Object dynamics reasoning

Conclusion

SpaceNum shows that current VLMs largely fail to ground numerical values in spatial meaning across both dynamic transitions and static layouts. The failures arise from weak spatial abstraction, asymmetric vision-number mappings, and an inability to build stable coordinate-aware representations. Tuning partially improves performance and transfers to external spatial reasoning benchmarks, but substantial gaps remain.

Acknowledgement

We would like to thank the Cambrian authors for providing this webpage template.