Jianshu Zhang^1* Yijiang Li^2* Huifeixin Chen³ Haoran Lu¹ Letian Xue¹ Bingyang Wang⁴ Han Liu¹

¹Northwestern University ²UC San Diego ³University of Southern California ⁴Georgia Tech

^*Equal Contribution

Vision-Language Models (VLMs) are increasingly deployed in embodied environments, where they must produce numerical outputs such as action magnitudes and spatial coordinates. Although these numbers appear meaningful, it remains unclear whether they are genuinely grounded in spatial perception. SpaceNum revisits spatial numerical understanding through a unified framework covering two complementary settings: numbers as dynamic transitions during spatial exploration and numbers as static layouts in spatial reasoning. We formulate two bidirectional tasks, Num2Space and Space2Num, to evaluate how well VLMs map between visual spatial structure and language-side numerical representations. Across 18 VLMs from 6 model families, current models largely fail to ground numbers in spatial meaning, often performing close to random guess and relying on shallow spatial cues rather than stable coordinate-aware reasoning.

SpaceNum overview — **Figure 1:** SpaceNum studies spatial numerical understanding under two settings: *numbers as dynamic transitions* in spatial exploration and *numbers as static layouts* in spatial understanding. The benchmark evaluates two directions: Num2Space, mapping numbers to visual outcomes, and Space2Num, mapping visual inputs to numbers.

Benchmark
Design

Bidirectional
Tasks

Main
Results

Diagnostic
Analysis

Tuning &
Transfer

Click to jump to each section.

SpaceNum Benchmark Design

SpaceNum uses simulator-based pipelines to create controlled spatial-number grounding problems. Dynamic transition data is generated in AI2-THOR, where embodied agents execute parameterized movement and rotation actions in indoor scenes. Static layout data is generated in NVIDIA Isaac Sim with BlenderKit assets, enabling controlled scene layouts and ground-truth coordinate annotations.

3,800

benchmark samples

77,412

automatically generated training samples

VLMs evaluated across 6 families

Numbers as Dynamic Transitions

Dynamic transition samples capture numerical values as action magnitudes. The benchmark covers forward/backward and left/right movement, plus up/down and left/right rotation. Data collection controls action coverage, transition continuity, visual anchoring, and validity: transitions must preserve enough visual overlap, contain sufficient object anchors, and correspond to valid simulator states.

Action Type	Range	Step
Move Forward / Backward	0.2-2.4 m	0.2 m
Move Left / Right	0.2-1.2 m	0.2 m
Rotate Up / Down	10-70 deg	10 deg
Rotate Left / Right	10-70 deg	10 deg

Numbers as Static Layouts

Static layout samples treat numbers as relative spatial layouts. Each scene defines a coordinate system using two anchor objects: one sets the origin, while the second fixes direction and scale. A target object is then placed with controlled variations in position, size, or both. SpaceNum covers desktop-scale and room-scale scenes, and represents the same layout through 1D, 2D, and 3D coordinate descriptions.

SpaceNum dataset statistics — **Figure 2:** Dataset statistics for SpaceNum. The benchmark contains 3,800 evaluation samples, and the same automatic pipeline produces 77,412 additional training samples for tuning studies.

Bidirectional Spatial-Number Grounding Tasks

Num2Space

The model receives a numerical representation and must select the visual outcome consistent with it. In dynamic transitions, this means predicting the resulting observation from an initial view, action type, and numerical magnitude. In static layouts, this means selecting the observation matching a cognitive map.

Space2Num

The model receives visual spatial evidence and must infer the corresponding number. In dynamic transitions, it estimates the action magnitude from two observations and an action type. In static layouts, it recovers target-object coordinates under the defined reference frame.

This bidirectional setup is central to SpaceNum: it separates whether models can project known numbers into space from whether they can recover structured numbers from visual spatial observations.

Main Results

SpaceNum evaluates 18 VLMs from Qwen2.5-VL, Qwen3-VL, InternVL3.5, Ovis2.5, Cosmos-Reason2, and Gemma-3. The core result is consistent: current VLMs struggle to ground numerical values in spatial meaning. Random guess is 30.0% macro-average accuracy, while the best model reaches only 39.8%.

**Table 1:** Main SpaceNum benchmark results across dynamic transition and static layout settings.

Current VLMs do not robustly ground numbers in space.

Several models perform near or below random guess, indicating that plausible numerical outputs do not imply spatial numerical understanding.

Dynamic transitions are broadly difficult.

Performance stays low across movement and rotation actions, showing a broad failure to model transition magnitudes rather than an isolated action-specific weakness.

Static layout difficulty grows with spatial complexity.

Models perform better on simpler 1D and desk-scale layouts but degrade sharply in higher-dimensional and room-scale settings.

Mapping direction is asymmetric.

Dynamic transitions favor Space2Num over Num2Space, while static layouts favor Num2Space over Space2Num, revealing different visual and language-side bottlenecks.

Diagnostic Analysis

Structured Output Errors

SpaceNum goes beyond exact-match accuracy by analyzing how wrong answers relate to the correct spatial number. In dynamic transitions, larger models often make numerically closer mistakes even when they fail exact matching. In static layouts, errors are usually coupled across position and size rather than isolated to a single attribute.

Dynamic transition error proximity — **Figure 3:** Larger models tend to make less severe dynamic-transition errors.

Static layout error decomposition — **Figure 4:** Static layout failures are dominated by coupled position-and-size errors.

Why Reasoning Traces Do Not Solve the Problem

Enabling explicit reasoning produces only marginal changes, typically within 1%. Trace analysis shows three recurring failure modes: models stop at coarse spatial cues without fine-grained comparison, fail to reason counterfactually about motion magnitude, and reason in image-space coordinates rather than the task-defined coordinate system.

Modality Asymmetry and Geometric Consistency

Blind testing shows that dynamic transitions depend more strongly on visual grounding, while static layouts can sometimes be partially solved through language-side priors. Per-action analysis further shows that Space2Num usually outperforms Num2Space for dynamic actions, and rotational symmetry tests reveal weak geometric invariance in the mapping from vision to numbers.

Action-level mapping asymmetry — **Figure 6:** Per-action comparison between Num2Space and Space2Num.

Rotation symmetry analysis — **Figure 7:** Rotational symmetry analysis. Equivalent transformations should lead to consistent numerical predictions, but current models show substantial drops under symmetric variants.

Disentangling Visual and Numerical Factors

Simple visual interventions, such as adding anchors for dynamic transitions or reducing irrelevant layout objects, produce only minor and inconsistent changes. Numerical representation changes also have limited effect. The strongest diagnostic signal appears when layout images are replaced with structured visual abstractions such as points, 2D boxes, and 3D boxes, indicating that the main bottleneck is vision-to-structure abstraction.

Visual-side interventions — **Figure 8:** Visual-side intervention examples for dynamic transitions and static layouts.

Layout abstraction — **Figure 9:** Structured visual abstractions substantially improve Space2Num layout reasoning.

Tuning Spatial Numerical Understanding

SpaceNum also studies whether spatial numerical understanding can be improved through tuning. LoRA tuning on Qwen3-VL-4B and Qwen3-VL-8B shows that lower-dimensional spatial reasoning can partially transfer to higher-dimensional settings, but transfer remains limited. The best data recipe uses a layout-heavy mixture, with roughly 25% transition data and 75% layout data.

Cross-dimension tuning transfer — **Figure 10:** Cross-dimension transfer patterns after tuning.

Training data mixture and scaling — **Figure 11:** Data mixture and scale both affect spatial numerical understanding.

External Benchmark	Qwen3-VL-4B Delta	Qwen3-VL-8B Delta	What Improves
OmniSpatial Motion	+5.5	+4.5	Camera movement understanding
SAT Action Consequence	+8.1	+18.9	Action outcome reasoning
SAT Object Movement	+34.8	+43.5	Object dynamics reasoning

Conclusion

SpaceNum shows that current VLMs largely fail to ground numerical values in spatial meaning across both dynamic transitions and static layouts. The failures arise from weak spatial abstraction, asymmetric vision-number mappings, and an inability to build stable coordinate-aware representations. Tuning partially improves performance and transfers to external spatial reasoning benchmarks, but substantial gaps remain.

Acknowledgement

We would like to thank the Cambrian authors for providing this webpage template.

SpaceNum: Revisiting Spatial Numerical

Understanding in VLMs