Vision-Language Models are moving beyond static 2D image understanding toward dynamic 3D spatial exploration. As VLM-based spatial agents claim to perceive, navigate, and reason in space, they inevitably produce numbers with spatial meaning: coordinates, distances, rotations, and action magnitudes. SpaceNum asks a sharper question: do these numbers truly map to space, or are they only fluent tokens attached to shallow visual cues?
Vision-Language Models (VLMs) are increasingly deployed in embodied environments, where they must produce numerical outputs such as action magnitudes and spatial coordinates. Although these numbers appear meaningful, it remains unclear whether they are genuinely grounded in spatial perception. SpaceNum revisits spatial numerical understanding through a unified framework covering two complementary settings: numbers as dynamic transitions during spatial exploration and numbers as static layouts in spatial reasoning. We formulate two bidirectional tasks, Num2Space and Space2Num, to evaluate how well VLMs map between visual spatial structure and language-side numerical representations. Across 18 VLMs from 6 model families, current models largely fail to ground numbers in spatial meaning, often performing close to random guess and relying on shallow spatial cues rather than stable coordinate-aware reasoning.
Click to jump to each section.
SpaceNum uses simulator-based pipelines to create controlled spatial-number grounding problems. Dynamic transition data is generated in AI2-THOR, where embodied agents execute parameterized movement and rotation actions in indoor scenes. Static layout data is generated in NVIDIA Isaac Sim with BlenderKit assets, enabling controlled scene layouts and ground-truth coordinate annotations.
Dynamic transition samples capture numerical values as action magnitudes. The benchmark covers forward/backward and left/right movement, plus up/down and left/right rotation. Data collection controls action coverage, transition continuity, visual anchoring, and validity: transitions must preserve enough visual overlap, contain sufficient object anchors, and correspond to valid simulator states.
| Action Type | Range | Step |
|---|---|---|
| Move Forward / Backward | 0.2-2.4 m | 0.2 m |
| Move Left / Right | 0.2-1.2 m | 0.2 m |
| Rotate Up / Down | 10-70 deg | 10 deg |
| Rotate Left / Right | 10-70 deg | 10 deg |
Static layout samples treat numbers as relative spatial layouts. Each scene defines a coordinate system using two anchor objects: one sets the origin, while the second fixes direction and scale. A target object is then placed with controlled variations in position, size, or both. SpaceNum covers desktop-scale and room-scale scenes, and represents the same layout through 1D, 2D, and 3D coordinate descriptions.
The model receives a numerical representation and must select the visual outcome consistent with it. In dynamic transitions, this means predicting the resulting observation from an initial view, action type, and numerical magnitude. In static layouts, this means selecting the observation matching a cognitive map.
The model receives visual spatial evidence and must infer the corresponding number. In dynamic transitions, it estimates the action magnitude from two observations and an action type. In static layouts, it recovers target-object coordinates under the defined reference frame.
This bidirectional setup is central to SpaceNum: it separates whether models can project known numbers into space from whether they can recover structured numbers from visual spatial observations.
SpaceNum evaluates 18 VLMs from Qwen2.5-VL, Qwen3-VL, InternVL3.5, Ovis2.5, Cosmos-Reason2, and Gemma-3. The core result is consistent: current VLMs struggle to ground numerical values in spatial meaning. Random guess is 30.0% macro-average accuracy, while the best model reaches only 39.8%.
Several models perform near or below random guess, indicating that plausible numerical outputs do not imply spatial numerical understanding.
Performance stays low across movement and rotation actions, showing a broad failure to model transition magnitudes rather than an isolated action-specific weakness.
Models perform better on simpler 1D and desk-scale layouts but degrade sharply in higher-dimensional and room-scale settings.
Dynamic transitions favor Space2Num over Num2Space, while static layouts favor Num2Space over Space2Num, revealing different visual and language-side bottlenecks.
SpaceNum goes beyond exact-match accuracy by analyzing how wrong answers relate to the correct spatial number. In dynamic transitions, larger models often make numerically closer mistakes even when they fail exact matching. In static layouts, errors are usually coupled across position and size rather than isolated to a single attribute.
Enabling explicit reasoning produces only marginal changes, typically within 1%. Trace analysis shows three recurring failure modes: models stop at coarse spatial cues without fine-grained comparison, fail to reason counterfactually about motion magnitude, and reason in image-space coordinates rather than the task-defined coordinate system.
Blind testing shows that dynamic transitions depend more strongly on visual grounding, while static layouts can sometimes be partially solved through language-side priors. Per-action analysis further shows that Space2Num usually outperforms Num2Space for dynamic actions, and rotational symmetry tests reveal weak geometric invariance in the mapping from vision to numbers.
Simple visual interventions, such as adding anchors for dynamic transitions or reducing irrelevant layout objects, produce only minor and inconsistent changes. Numerical representation changes also have limited effect. The strongest diagnostic signal appears when layout images are replaced with structured visual abstractions such as points, 2D boxes, and 3D boxes, indicating that the main bottleneck is vision-to-structure abstraction.
SpaceNum also studies whether spatial numerical understanding can be improved through tuning. LoRA tuning on Qwen3-VL-4B and Qwen3-VL-8B shows that lower-dimensional spatial reasoning can partially transfer to higher-dimensional settings, but transfer remains limited. The best data recipe uses a layout-heavy mixture, with roughly 25% transition data and 75% layout data.
| External Benchmark | Qwen3-VL-4B Delta | Qwen3-VL-8B Delta | What Improves |
|---|---|---|---|
| OmniSpatial Motion | +5.5 | +4.5 | Camera movement understanding |
| SAT Action Consequence | +8.1 | +18.9 | Action outcome reasoning |
| SAT Object Movement | +34.8 | +43.5 | Object dynamics reasoning |
SpaceNum shows that current VLMs largely fail to ground numerical values in spatial meaning across both dynamic transitions and static layouts. The failures arise from weak spatial abstraction, asymmetric vision-number mappings, and an inability to build stable coordinate-aware representations. Tuning partially improves performance and transfers to external spatial reasoning benchmarks, but substantial gaps remain.
We would like to thank the Cambrian authors for providing this webpage template.