pith. sign in

arxiv: 2604.22238 · v2 · pith:OUK2IEVMnew · submitted 2026-04-24 · 💻 cs.RO

CodeGraphVLP: Code-as-Planner Meets Semantic-Graph State for Non-Markovian Vision-Language-Action Models

Pith reviewed 2026-05-08 11:37 UTC · model grok-4.3

classification 💻 cs.RO
keywords non-Markovian taskssemantic-graph statecode-based plannervision-language-action modelslong-horizon manipulationpartial observabilityprogress checksrobot task planning
0
0 comments X

The pith

CodeGraphVLP pairs a persistent semantic-graph state with a code planner to succeed more often on non-Markovian robot tasks than standard VLA models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Vision-language-action models typically treat the latest camera image as enough for the next action, but this breaks when evidence is hidden, appears only earlier, or is buried in clutter. CodeGraphVLP keeps a running semantic graph of objects and relations that survives partial views, then runs an executable code planner over the graph to check progress and emit a focused subtask plus the key objects involved. The planner output is used to build a cleaned observation that directs the VLA executor toward what actually matters. The result on real long-horizon tasks is higher completion rates than both plain VLA baselines and history-augmented variants, plus much lower planning latency than methods that loop back to a vision-language model at every step.

Core claim

The framework maintains task-relevant entities and relations in a semantic-graph state under partial observability. An executable code planner runs progress checks on this graph and produces subtask instructions together with relevant objects; these outputs drive construction of clutter-suppressed observations that focus the VLA executor on critical evidence.

What carries the argument

Persistent semantic-graph state combined with an executable code-based planner that performs progress checks and generates focused subtask instructions.

If this is right

  • On real-world non-Markovian tasks the method improves task completion over strong VLA baselines and history-enabled variants.
  • It substantially lowers planning latency compared with VLM-in-the-loop planning.
  • Ablation studies confirm that the graph state, code planner, and progress-guided prompting each contribute to the gains.
  • The hierarchical loop lets the VLA executor operate on observations that suppress irrelevant clutter.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Explicit graph-based memory may let other short-horizon learned policies operate in partially observable environments without retraining.
  • Replacing repeated large-model queries with code execution over a compact state could reduce both latency and compute cost in deployed systems.
  • The same graph-plus-code structure might extend to tasks such as sequential assembly or multi-room navigation where order and persistence matter.

Load-bearing premise

The semantic-graph state can reliably track task-relevant entities and relations even when parts of the scene are occluded or cluttered.

What would settle it

An experiment in which the semantic graph is replaced by a simple buffer of recent observations while keeping the rest of the pipeline identical, and which then shows no gain in task completion, would indicate that the graph representation itself is not carrying the reported benefit.

Figures

Figures reproduced from arXiv: 2604.22238 by Anh Nguyen, Anthony Gunderman, Chase Rainwater, Duy Nguyen, Khoa Vo, Minh Vu, Ngan Le, Nghi D. Q. Bui, Sieu Tran, Taisei Hanyu, Yuki Ikebe.

Figure 1
Figure 1. Figure 1: Architectures for non-Markovian long-horizon manipulation. (a) Memory-augmented VLA equips a short-horizon policy with memory context, offering moderately efficient progress checks and limited robustness for action reasoning in clutter. (b) Hierarchical VLM–VLA uses a VLM planner to reason about subtasks and guide a VLA policy with subtask-level cues, improving clutter robustness but incurring highlatency … view at source ↗
Figure 2
Figure 2. Figure 2: Overview of CodeGraphVLP view at source ↗
Figure 3
Figure 3. Figure 3: Qualitative rollouts of CodeGraphVLP on our three real-world tasks (Pick-and-Place Twice, Place-and-Stack, and Swap Cups). For each task, we show multi-view RGB inputs with the overall instruction, the semantic-graph state Gt, and the progress-guided prompts used by the VLA policy: clutter-free visual cues that retain only subtask-relevant objects and the planner-produced subtask language cues. inspired by… view at source ↗
Figure 4
Figure 4. Figure 4: Robot experimental setup on a UR10e manipulator with a parallel view at source ↗
read the original abstract

Vision-Language-Action (VLA) models promise generalist robot manipulation, but are typically trained and deployed as short-horizon policies that assume the latest observation is sufficient for action reasoning. This assumption breaks in non-Markovian long-horizon tasks, where task-relevant evidence can be occluded or appear only earlier in the trajectory, and where clutter and distractors make fine-grained visual grounding brittle. We present CodeGraphVLP, a hierarchical framework that enables reliable long-horizon manipulation by combining a persistent semantic-graph state with an executable code-based planner and progress-guided visual-language prompting. The semantic-graph maintains task-relevant entities and relations under partial observability. The synthesized planner executes over this semantic-graph to perform efficient progress checks and outputs a subtask instruction together with subtask-relevant objects. We use these outputs to construct clutter-suppressed observations that focus the VLA executor on critical evidence. On real-world non-Markovian tasks, CodeGraphVLP improves task completion over strong VLA baselines and history-enabled variants while substantially lowering planning latency compared to VLM-in-the-loop planning. We also conduct extensive ablation studies to confirm the contributions of each component.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces CodeGraphVLP, a hierarchical framework that augments vision-language-action (VLA) models with a persistent semantic-graph state representation, an executable code-based planner for progress checks and subtask decomposition, and progress-guided visual-language prompting to focus the VLA executor on relevant evidence. It claims that this integration enables reliable performance on non-Markovian long-horizon manipulation tasks under partial observability by maintaining task-relevant entities and relations, yielding higher task completion rates than strong VLA baselines and history-enabled variants while reducing planning latency relative to VLM-in-the-loop approaches, with supporting ablation studies.

Significance. If the performance claims are substantiated, the work would be significant for bridging symbolic planning with neural VLA policies in robotics, offering a practical way to handle non-Markovian dependencies and clutter that currently limit short-horizon VLA deployment. The extensive ablation studies are a clear strength, as they directly test component contributions rather than relying on end-to-end black-box gains.

major comments (2)
  1. [Results section] Results section (and abstract performance claims): the manuscript reports qualitative improvements in task completion and latency but provides no exact quantitative metrics (e.g., success rates, latency values in seconds), baseline implementations with citations or code, error bars, or statistical significance tests. This absence is load-bearing because the central claim of superiority over VLA baselines and history variants cannot be verified or sized without these details.
  2. [Method / Semantic-graph construction] Semantic-graph state description (likely §3.2 or Method): no error rates, drift measurements, update rules, or failure-mode analysis are supplied for how the graph is constructed and maintained from VLM observations under partial observability, occlusions, and distractors. This directly undermines the weakest assumption that the graph reliably supports accurate progress checks by the code planner.
minor comments (2)
  1. [Method] Notation for the semantic-graph (entities, relations, update function) could be formalized with a small table or pseudocode to improve clarity for readers unfamiliar with the exact representation.
  2. [Ablation studies] The ablation studies are mentioned but would benefit from a dedicated table summarizing the contribution of each component (graph, code planner, prompting) with the same metrics used in the main results.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We have reviewed the major comments carefully and provide point-by-point responses below. Where the comments identify gaps in quantitative detail or analysis, we agree and commit to revisions that will strengthen the paper without altering its core contributions.

read point-by-point responses
  1. Referee: [Results section] Results section (and abstract performance claims): the manuscript reports qualitative improvements in task completion and latency but provides no exact quantitative metrics (e.g., success rates, latency values in seconds), baseline implementations with citations or code, error bars, or statistical significance tests. This absence is load-bearing because the central claim of superiority over VLA baselines and history variants cannot be verified or sized without these details.

    Authors: We acknowledge that the current manuscript presents performance improvements in summarized form in the abstract and results section without a dedicated table of exact metrics, error bars, or statistical tests, which limits independent verification. We will revise the results section to include a table with precise quantitative values from our real-world experiments (task success rates as percentages with standard deviations across repeated trials, planning latency in seconds with comparisons), full citations to the baseline VLA models and history variants used, descriptions of their implementations, and statistical significance tests (e.g., p-values from appropriate tests). This will allow the claims of superiority to be fully sized and verified. revision: yes

  2. Referee: [Method / Semantic-graph construction] Semantic-graph state description (likely §3.2 or Method): no error rates, drift measurements, update rules, or failure-mode analysis are supplied for how the graph is constructed and maintained from VLM observations under partial observability, occlusions, and distractors. This directly undermines the weakest assumption that the graph reliably supports accurate progress checks by the code planner.

    Authors: We agree that the method section describes semantic-graph construction from VLM observations but does not supply quantitative error rates, drift analysis, explicit update rules, or failure-mode discussion under partial observability and distractors. We will revise §3.2 (and add an appendix if needed) to include the update rules for graph maintenance (e.g., how new observations are merged or used to correct entities/relations), empirical error rates measured during our experiments (such as entity detection precision/recall in occluded or cluttered scenes), and a failure-mode analysis with concrete examples. This will directly support the reliability of the graph for the code planner's progress checks. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical framework integration with ablations

full rationale

The paper introduces CodeGraphVLP as a hierarchical integration of a persistent semantic-graph state, executable code planner, and progress-guided VLA prompting for non-Markovian tasks. All claims rest on real-world task completion rates, latency measurements, and ablation studies that isolate component contributions. No mathematical derivations, fitted parameters renamed as predictions, or self-referential definitions appear; the semantic-graph maintenance and planner outputs are presented as engineered components whose reliability is evaluated externally rather than assumed by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The central claim rests on the domain assumption that semantic graphs can be maintained accurately despite occlusion and that code execution can efficiently monitor progress; no free parameters or invented entities with independent evidence are detailed.

axioms (1)
  • domain assumption Semantic-graph state maintains task-relevant entities and relations under partial observability
    Invoked as the foundation for handling non-Markovian tasks and enabling progress checks in the hierarchical framework.
invented entities (2)
  • Semantic-graph state no independent evidence
    purpose: Persistent tracking of entities and relations for subtask planning and observation focusing
    Introduced as a core new component to overcome limitations of Markovian VLA assumptions.
  • Code-based planner no independent evidence
    purpose: Synthesizes executable plans over the graph to output subtasks and relevant objects
    Presented as the mechanism for efficient progress monitoring and instruction generation.

pith-pipeline@v0.9.0 · 5545 in / 1269 out tokens · 66580 ms · 2026-05-08T11:37:49.770256+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. $\mu$VLA: On Recurrent Memory for Partially Observable Manipulation in VLA Models

    cs.LG 2026-06 unverdicted novelty 6.0

    Adding recurrent memory tokens to VLA models raises success rates on partially observable manipulation tasks from 0.42 to 0.84 on training and 0.07 to 0.23 on held-out tasks while preserving performance under full obs...