pith. sign in

arxiv: 2605.22283 · v1 · pith:XZWNON3Onew · submitted 2026-05-21 · 💻 cs.RO

Spatial Memory for Out-of-Vision Manipulation in Vision-Language-Action

Pith reviewed 2026-05-22 05:54 UTC · model grok-4.3

classification 💻 cs.RO
keywords spatial memoryout-of-vision manipulationvision-language-actionVLA modelspartial observabilityrobot manipulationmulti-view aggregation
0
0 comments X

The pith

SOMA equips VLA models with persistent spatial memory to manipulate objects initially outside the camera view.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper proposes SOMA to give vision-language-action models the ability to handle cases where important objects are not currently visible. It builds a lasting spatial memory by scanning with a movable head camera and keeps that memory consistent while pulling out relevant parts for the current task. Without this, models tend to fail or waste time searching when targets disappear from view. If the claim holds, robots could perform more complex tasks in real settings with less reliance on perfect visibility at every moment.

Core claim

The central discovery is that a spatial memory constructed from multi-view observations via a movable head camera, refined dynamically for consistency, and retrieved contextually during action, allows VLAs to reason and act effectively on targets that are out of the current visual field, resulting in higher success rates and more efficient behaviors such as reduced search and one-shot grasping.

What carries the argument

The three-part spatial memory framework: construction by aggregating angular observations into a unified representation, dynamic refinement to maintain consistency, and contextual retrieval of instruction-relevant cues.

If this is right

  • Task success rates rise on five real-world out-of-vision manipulation tasks including multi-step and dual-arm scenarios.
  • Behaviors shift to faster target localization with less viewpoint searching.
  • Near one-shot grasping occurs under partial observability.
  • The memory approach also improves or maintains performance in fully observable simulation settings like RoboCasa and SimplerEnv.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This memory mechanism could be adapted to other robot perception systems facing similar visibility constraints.
  • Explicit spatial representations might become a standard addition to VLA architectures to handle real-world uncertainty.
  • Reducing dependence on continuous visual search could lower energy use and wear on robot actuators.

Load-bearing premise

The approach assumes that angular-wise observations from a movable head camera can be reliably aggregated into a unified spatial-semantic representation that remains globally consistent over time without significant drift or semantic errors in real-world conditions.

What would settle it

Running the system in an environment where camera scans accumulate positional drift or misidentify objects would show whether the memory actually supports the claimed improvements or collapses to baseline performance.

Figures

Figures reproduced from arXiv: 2605.22283 by He Zhang, Hui Xiong, Pengteng Li, Tiefu Cai, Weiyu Guo, Xiao He, Yandong Guo.

Figure 1
Figure 1. Figure 1: Illustration of the Out-of-Vision (OOV) limitation in existing VLA models. Most VLAs rely on purely reactive percep￾tion—actions are driven only by what is visible in the current view. When the target moves outside the field of view, perception can no longer support manipulation, leading to inevitable task failure. 1. Introduction The development of VLAs have become a central direc￾tion in robotic action m… view at source ↗
Figure 2
Figure 2. Figure 2: Illustration of the proposed SOMA framework. SOMA enhances OOV manipulation via spatial memory. (A) Spatial Memory Construction: Before manipulation, if the specified objects are not in the current observation, the robot actively scans the scene to construct a unified spatial–semantic memory M0 by integrating object semantics (YOLO (Cheng et al., 2024), DINOv3 (Simeoni et al. ´ , 2025)) with geometric cues… view at source ↗
Figure 3
Figure 3. Figure 3: Illustration of our real world benchmark settings. We design five challenging out-of-vision pick-and-place (PnP) tasks to evaluate the robot’s OOV manipulation capabilities. Tasks (1–3) require the robot to use its left arm to pick up the object located outside the current field of view and place it into the basket. Task (4) involves sequential pick-and-place of the object followed by the other object into… view at source ↗
Figure 4
Figure 4. Figure 4: Performance comparison across five real world out-of-vision tasks. “Fixed Head Camera” denotes we train SOMA under fixed head camera setting. StarVLA (Ye et al., 2026; Community, 2026), SpatialVLA (Qu et al., 2025),GR00T N1.5 (Bjorck et al., 2025) and our proposed SOMA is trained under active head camera for fair. We adopt 20 episodes and average SR (Success Rate) to evaluate the models by multi-stages. Be… view at source ↗
Figure 5
Figure 5. Figure 5: Illustration of task execution examples for our five challenging out-of-vision tasks in real world using our proposed SOMA. experiments, the Spatial Memory Construction phase is trig￾gered by a perception-based criterion: a lightweight object detector checks whether the target specified in the instruc￾tion is visible in the current view. If the target cannot be localized, SOMA initiates an active head-scan… view at source ↗
Figure 7
Figure 7. Figure 7: Illustration of the designed VR teleoperation System using Meta Oculus Quest 3. A.2. Real World Data Quality Review To ensure the reliability of teleoperated demonstrations col￾lected in real-world settings, we conduct a frame-by-frame, trajectory-level data quality review. This manual inspection is designed to filter out sequences affected by latency, sensor artifacts, operator mistakes, or camera capture… view at source ↗
Figure 6
Figure 6. Figure 6: Illustration of our self-designed robot construction [PITH_FULL_IMAGE:figures/full_fig_p013_6.png] view at source ↗
Figure 8
Figure 8. Figure 8: Analysis about time and gpu cost per demo of different component choice in the memory preprocess stage. raw 3D bounding box bk ∈ R 8×3 into a stable spatial descriptor pk = Φpos(bk). It is used in Spatial Memory Construction to encode viewpoint-invariant geometry, en￾suring that all instances are anchored in a consistent spatial reference frame. Spatial Positional Refinement. A second-stage MLP re￾fines pk… view at source ↗
Figure 9
Figure 9. Figure 9: Illustration of more execution examples including (1) Sequential Dual-Object and (2) Dual-Arm Coordination using our SOMA. 22 [PITH_FULL_IMAGE:figures/full_fig_p022_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Qualitative results of the proposed SOMA framework on the Robocasa Tabletop GR1 benchmark (Bjorck et al., 2025; Nasiriany et al., 2024). The environment’s high-DOF control (arms, hands, waist) generates dynamic and highly variable head camera viewpoints, demonstrating SOMA’s robust capability in utilizing its designed spatial memory amidst complex visual instability. 23 [PITH_FULL_IMAGE:figures/full_fig_… view at source ↗
Figure 11
Figure 11. Figure 11: Qualitative Results of SimplerEnv Fractal suite (Li et al., 2024; Zitkovich et al., 2023) using our proposed SOMA. 24 [PITH_FULL_IMAGE:figures/full_fig_p024_11.png] view at source ↗
read the original abstract

We introduce SOMA, the Spatial Memory framework for Out-of-Vision Manipulation in Vision-Language-Action (VLA) models. Most existing VLAs implicitly assume that task-relevant objects are always visible, leading to brittle and reactive behaviors when targets fall outside the camera's field of view. SOMA addresses this limitation by equipping VLAs with a persistent spatial memory constructed from multi-view observations acquired via a movable head camera, enabling reasoning beyond the current visual frustum. The framework consists of three components: Spatial Memory Construction, which aggregates angular-wise observations into a unified spatial-semantic representation through scanning; Dynamic Memory Refinement, which maintains global consistency over time; and Contextual Memory Retrieval, which activates instruction-relevant spatial cues during manipulation. We evaluate SOMA on five challenging real-world out-of-vision manipulation tasks, including multi-step and dual-arm scenarios where target objects are initially invisible. Experimental results show that SOMA not only improves task success rates, but also induces qualitatively different manipulation behaviors, with faster target localization, reduced viewpoint search, and near one-shot grasping under partial observability. Additional experiments on RoboCasa GR1 and SimplerEnv further validate the effectiveness of SOMA's memory design under conventional fully observable settings. Code will be released soon.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces SOMA, a spatial memory framework for Vision-Language-Action models to handle out-of-vision manipulation. It builds a persistent spatial-semantic representation by aggregating multi-view observations from a movable head camera via Spatial Memory Construction, maintains consistency through Dynamic Memory Refinement, and retrieves relevant cues with Contextual Memory Retrieval. Experiments on five real-world out-of-vision tasks (including multi-step and dual-arm) plus RoboCasa GR1 and SimplerEnv simulations report improved success rates and qualitatively different behaviors such as faster target localization, reduced viewpoint search, and near one-shot grasping under partial observability.

Significance. If the empirical claims hold with proper validation, this work addresses a key limitation in current VLAs (implicit full observability) and could meaningfully advance robust robotic manipulation in unstructured, partially observable environments. The focus on inducing qualitatively different behaviors, rather than incremental gains, is a notable strength, as is the release of code for reproducibility.

major comments (2)
  1. [§4.3 and §3.1] §4.3 (Dynamic Memory Refinement) and §3.1 (Spatial Memory Construction): the central claim that multi-view angular observations yield a globally consistent representation enabling 'near one-shot grasping' and 'reduced viewpoint search' requires quantitative evidence of cumulative localization error, semantic label stability, and failure modes under realistic head-camera motion, calibration drift, or occlusions; without metrics (e.g., positional drift after N scans), the attribution to memory rather than base VLA policy remains unverified.
  2. [§5] §5 (Experimental Results) and associated tables: the reported improvements in task success rates on the five real-world tasks lack explicit numerical values, strong baselines, ablations isolating each component, or statistical tests, which are load-bearing for supporting both the quantitative gains and the qualitative behavior claims.
minor comments (2)
  1. [Abstract] Abstract: adding concrete success-rate deltas or key quantitative highlights would improve readability without altering the narrative.
  2. [Figures] Figure 3 or equivalent (memory visualization): ensure scale bars, coordinate frames, and error annotations are present to allow readers to assess spatial consistency directly.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our work. The comments highlight important opportunities to strengthen the empirical validation of SOMA's spatial memory components. We address each major comment below and describe the revisions we will make to the manuscript.

read point-by-point responses
  1. Referee: [§4.3 and §3.1] §4.3 (Dynamic Memory Refinement) and §3.1 (Spatial Memory Construction): the central claim that multi-view angular observations yield a globally consistent representation enabling 'near one-shot grasping' and 'reduced viewpoint search' requires quantitative evidence of cumulative localization error, semantic label stability, and failure modes under realistic head-camera motion, calibration drift, or occlusions; without metrics (e.g., positional drift after N scans), the attribution to memory rather than base VLA policy remains unverified.

    Authors: We agree that quantitative metrics on memory consistency would provide stronger support for attributing performance gains to the proposed components. The current manuscript focuses on end-task success and observed behavioral changes, but does not report explicit measures such as positional drift, label stability, or detailed failure analysis under motion and occlusion. In the revised manuscript we will add a dedicated subsection with these metrics from our real-world setups, including average positional drift after successive scans, semantic consistency scores, and documented failure cases. This addition will help clarify the contribution of the memory framework relative to the base VLA policy. revision: yes

  2. Referee: [§5] §5 (Experimental Results) and associated tables: the reported improvements in task success rates on the five real-world tasks lack explicit numerical values, strong baselines, ablations isolating each component, or statistical tests, which are load-bearing for supporting both the quantitative gains and the qualitative behavior claims.

    Authors: We acknowledge that the experimental section would benefit from more explicit numerical reporting, additional baselines, component ablations, and statistical analysis. While the manuscript presents success rates across the five tasks and notes qualitative differences, the tables and text do not include all requested details. In the revision we will expand the results section and tables to report precise success percentages, introduce stronger baselines (including memory-ablated variants), provide ablations for each of the three modules, and include statistical significance tests over repeated trials. These changes will better substantiate both the quantitative improvements and the claims of qualitatively different behaviors. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical framework validated by task performance

full rationale

The paper proposes SOMA, a spatial memory framework with three components (Spatial Memory Construction via scanning, Dynamic Memory Refinement, and Contextual Memory Retrieval) to enable out-of-vision manipulation in VLAs. Claims of improved success rates, faster localization, and near one-shot grasping rest entirely on experimental results from five real-world tasks plus RoboCasa and SimplerEnv evaluations. No equations, mathematical derivations, fitted parameters renamed as predictions, or self-citation chains appear in the provided text. The central contribution is an empirical system whose effectiveness is measured directly against external benchmarks rather than reducing to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The framework rests on standard robotics assumptions about sensor fusion and memory persistence rather than new postulated entities or heavily fitted parameters.

axioms (2)
  • domain assumption Multi-view observations acquired via a movable head camera can be aggregated into a unified spatial-semantic representation through scanning.
    This premise underpins the Spatial Memory Construction component and is required for the memory to be usable beyond the current visual frustum.
  • domain assumption Dynamic Memory Refinement can maintain global consistency over time without significant drift.
    Invoked to support long-horizon and multi-step tasks under partial observability.

pith-pipeline@v0.9.0 · 5766 in / 1385 out tokens · 72668 ms · 2026-05-22T05:54:49.800821+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

20 extracted references · 20 canonical work pages

  1. [1]

    Excessive Actuation Latency.Both gripper and joints commands are monitored for temporal alignment. Any se- quence exhibiting ≥300 ms delay between user input and recorded robot actuation is marked invalid, as such latency typically arises from communication bottlenecks or onboard processing delays and leads to distorted motion supervision

  2. [2]

    Unintentional Gripper Oscillation.Gripper trajecto- ries containing rapid, continuous open–close toggling are flagged as operator mis-triggers and discarded, as these patterns do not reflect meaningful manipulation intent

  3. [3]

    We label as cor- rupted any trajectory containing systematic spiking (e.g., 1001–1010) caused by low-level hardware protocol noise or overflow

    Gripper Value Spikes.The gripper sensor normally out- puts values within the 0–1000 range. We label as cor- rupted any trajectory containing systematic spiking (e.g., 1001–1010) caused by low-level hardware protocol noise or overflow. Such discontinuities compromise the smoothness required for imitation learning

  4. [4]

    These issues hinder vi- sual–kinematic alignment and degrade learning

    Camera View Contamination.Demonstrations are re- jected if the robot cameras capture large human-body in- trusions, missing robot arms, or severe motion blur from aggressive operator movement. These issues hinder vi- sual–kinematic alignment and degrade learning

  5. [5]

    Invalid Reach Strategy.For humanoid-arm grasping, we enforce a consistent approach strategy. Specifically, trajec- tories employing top-down, table-type grasping are filtered 13 Spatial Memory for Out-of-Vision Manipulation in Vision-Language-Action out, as such motions contradict the intended horizontal, in- hand-level grasping typical for humanoid robot...

  6. [6]

    Raw RGB-D recordings are capped at 25 FPS

    Video–Joint Stream Misalignment (Frame Dropping). Raw RGB-D recordings are capped at 25 FPS. Any sequence with ≥2.5 dropped frames is removed to avoid misalignment between visual frames and joint trajectories, which is critical for paired multimodal supervision. Across all criteria, the review ensures that only temporally aligned, semantically valid, and ...

  7. [7]

    Invisible-to-Invisible PnPrequires the robot to grasp a target object that is outside the current field of view and place it at a target location that is also outside the field of view, testing the ability to operate entirely beyond direct visual observation

  8. [8]

    Visible-to-Invisible PnPinvolves grasping a target object that is initially visible and placing it at a target location out- side the current view, evaluating whether spatial information about unseen goal locations can be recalled and utilized

  9. [9]

    Invisible-to-Visible PnPrequires the robot to grasp an object that is initially outside the field of view and place it at a visible target location, testing spatial recall of unseen objects followed by precise execution

  10. [10]

    Sequential Dual-Object PnPextends the setting to multi- stage manipulation: the robot first performs pick-and-place for a visible object, and subsequently grasps and places a second object located outside the field of view, stressing memory persistence across sequential task stages

  11. [11]

    Each task involves three distinct objects and was collected with400expert demonstrations

    Dual-Arm Coordination PnPrepresents the most chal- lenging scenario, where both the target object and target placement location are outside the field of view, and suc- cessful execution requires coordinated dual-arm handover followed by accurate placement based on a globally consis- tent spatial representation. Each task involves three distinct objects an...

  12. [12]

    Container Interaction Tasks(e.g., placing items into cab- inets, drawers, or microwaves and subsequently closing them), testing multi-stage pick–place–close behaviors with articulated receptacles

  13. [13]

    Cooking Preparation Tasks(e.g., transferring objects from a cutting board into pans, pots, or baskets), simulating preparatory cooking steps that require stable grasping and reliable object placement

  14. [14]

    Tabletop Serving Tasks(e.g., moving items from a place- mat to bowls or plates), capturing serving-style behaviors that depend on precise spatial reasoning

  15. [15]

    Dish Transfer Tasks(e.g., moving objects between dish- ware such as plate→bowl), evaluating fine-grained coordi- nation during mid-meal or cleanup scenarios

  16. [16]

    Tray Organization Tasks(e.g., transporting items from a tray to multi-level shelves or containers), assessing multi- target placement and organized spatial planning. 16 Spatial Memory for Out-of-Vision Manipulation in Vision-Language-Action Task Category Diffusion Policy(Chi et al., 2025)GR00T N1.5(Bjorck et al., 2025)SOMA (Ours) 30 100 300 Full 30 100 30...

  17. [17]

    SimEMA” denotes the normal EMA update (Chen et al., 2020),st kj denotes semantic similarity, gt kj denotes dynamic fusion scores,

    First-Fixation Time.First-Fixation Time measures the temporal latency between the moment the target object first enters the head camera’s field of view and the initiation of the robot’s grasping motion. Specifically, it is defined as the 17 Spatial Memory for Out-of-Vision Manipulation in Vision-Language-Action Task Category Update Strategy Retrieval Modu...

  18. [18]

    Let t∈ {0,1,

    Head Search Path Length.Head Search Path Length measures the total amount of head motion required to bring the target object into view. Let t∈ {0,1, . . . , T−1} index environment steps in an episode, and let (ψt, ϕt) denote the head pan (yaw) and tilt (pitch) angles at step t, measured in degrees. Let tvis denote the step at which the target object first...

  19. [19]

    Each grasp attempt is defined as a completed gripper closing action

    Grasp Attempt Count.Grasp Attempt Count measures the number of grasp executions issued by the controller until a successful grasp is achieved. Each grasp attempt is defined as a completed gripper closing action. Lower values indicate more reliable target localization and manipulation, with values close to one corresponding to near one-shot grasping behavior

  20. [20]

    This metric aggregates both perception and manipulation efficiency, reflecting the overall speed of task execution under partial observability

    Time-to-Grasp.Time-to-Grasp measures the total number of environment steps from episode start until a successful grasp is completed. This metric aggregates both perception and manipulation efficiency, reflecting the overall speed of task execution under partial observability. D.3. Quantitative Results As shown in Table 9, SOMA achieves SOTA performance on...