CodeGraphVLP: Code-as-Planner Meets Semantic-Graph State for Non-Markovian Vision-Language-Action Models
Pith reviewed 2026-05-08 11:37 UTC · model grok-4.3
The pith
CodeGraphVLP pairs a persistent semantic-graph state with a code planner to succeed more often on non-Markovian robot tasks than standard VLA models.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The framework maintains task-relevant entities and relations in a semantic-graph state under partial observability. An executable code planner runs progress checks on this graph and produces subtask instructions together with relevant objects; these outputs drive construction of clutter-suppressed observations that focus the VLA executor on critical evidence.
What carries the argument
Persistent semantic-graph state combined with an executable code-based planner that performs progress checks and generates focused subtask instructions.
If this is right
- On real-world non-Markovian tasks the method improves task completion over strong VLA baselines and history-enabled variants.
- It substantially lowers planning latency compared with VLM-in-the-loop planning.
- Ablation studies confirm that the graph state, code planner, and progress-guided prompting each contribute to the gains.
- The hierarchical loop lets the VLA executor operate on observations that suppress irrelevant clutter.
Where Pith is reading between the lines
- Explicit graph-based memory may let other short-horizon learned policies operate in partially observable environments without retraining.
- Replacing repeated large-model queries with code execution over a compact state could reduce both latency and compute cost in deployed systems.
- The same graph-plus-code structure might extend to tasks such as sequential assembly or multi-room navigation where order and persistence matter.
Load-bearing premise
The semantic-graph state can reliably track task-relevant entities and relations even when parts of the scene are occluded or cluttered.
What would settle it
An experiment in which the semantic graph is replaced by a simple buffer of recent observations while keeping the rest of the pipeline identical, and which then shows no gain in task completion, would indicate that the graph representation itself is not carrying the reported benefit.
Figures
read the original abstract
Vision-Language-Action (VLA) models promise generalist robot manipulation, but are typically trained and deployed as short-horizon policies that assume the latest observation is sufficient for action reasoning. This assumption breaks in non-Markovian long-horizon tasks, where task-relevant evidence can be occluded or appear only earlier in the trajectory, and where clutter and distractors make fine-grained visual grounding brittle. We present CodeGraphVLP, a hierarchical framework that enables reliable long-horizon manipulation by combining a persistent semantic-graph state with an executable code-based planner and progress-guided visual-language prompting. The semantic-graph maintains task-relevant entities and relations under partial observability. The synthesized planner executes over this semantic-graph to perform efficient progress checks and outputs a subtask instruction together with subtask-relevant objects. We use these outputs to construct clutter-suppressed observations that focus the VLA executor on critical evidence. On real-world non-Markovian tasks, CodeGraphVLP improves task completion over strong VLA baselines and history-enabled variants while substantially lowering planning latency compared to VLM-in-the-loop planning. We also conduct extensive ablation studies to confirm the contributions of each component.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces CodeGraphVLP, a hierarchical framework that augments vision-language-action (VLA) models with a persistent semantic-graph state representation, an executable code-based planner for progress checks and subtask decomposition, and progress-guided visual-language prompting to focus the VLA executor on relevant evidence. It claims that this integration enables reliable performance on non-Markovian long-horizon manipulation tasks under partial observability by maintaining task-relevant entities and relations, yielding higher task completion rates than strong VLA baselines and history-enabled variants while reducing planning latency relative to VLM-in-the-loop approaches, with supporting ablation studies.
Significance. If the performance claims are substantiated, the work would be significant for bridging symbolic planning with neural VLA policies in robotics, offering a practical way to handle non-Markovian dependencies and clutter that currently limit short-horizon VLA deployment. The extensive ablation studies are a clear strength, as they directly test component contributions rather than relying on end-to-end black-box gains.
major comments (2)
- [Results section] Results section (and abstract performance claims): the manuscript reports qualitative improvements in task completion and latency but provides no exact quantitative metrics (e.g., success rates, latency values in seconds), baseline implementations with citations or code, error bars, or statistical significance tests. This absence is load-bearing because the central claim of superiority over VLA baselines and history variants cannot be verified or sized without these details.
- [Method / Semantic-graph construction] Semantic-graph state description (likely §3.2 or Method): no error rates, drift measurements, update rules, or failure-mode analysis are supplied for how the graph is constructed and maintained from VLM observations under partial observability, occlusions, and distractors. This directly undermines the weakest assumption that the graph reliably supports accurate progress checks by the code planner.
minor comments (2)
- [Method] Notation for the semantic-graph (entities, relations, update function) could be formalized with a small table or pseudocode to improve clarity for readers unfamiliar with the exact representation.
- [Ablation studies] The ablation studies are mentioned but would benefit from a dedicated table summarizing the contribution of each component (graph, code planner, prompting) with the same metrics used in the main results.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback on our manuscript. We have reviewed the major comments carefully and provide point-by-point responses below. Where the comments identify gaps in quantitative detail or analysis, we agree and commit to revisions that will strengthen the paper without altering its core contributions.
read point-by-point responses
-
Referee: [Results section] Results section (and abstract performance claims): the manuscript reports qualitative improvements in task completion and latency but provides no exact quantitative metrics (e.g., success rates, latency values in seconds), baseline implementations with citations or code, error bars, or statistical significance tests. This absence is load-bearing because the central claim of superiority over VLA baselines and history variants cannot be verified or sized without these details.
Authors: We acknowledge that the current manuscript presents performance improvements in summarized form in the abstract and results section without a dedicated table of exact metrics, error bars, or statistical tests, which limits independent verification. We will revise the results section to include a table with precise quantitative values from our real-world experiments (task success rates as percentages with standard deviations across repeated trials, planning latency in seconds with comparisons), full citations to the baseline VLA models and history variants used, descriptions of their implementations, and statistical significance tests (e.g., p-values from appropriate tests). This will allow the claims of superiority to be fully sized and verified. revision: yes
-
Referee: [Method / Semantic-graph construction] Semantic-graph state description (likely §3.2 or Method): no error rates, drift measurements, update rules, or failure-mode analysis are supplied for how the graph is constructed and maintained from VLM observations under partial observability, occlusions, and distractors. This directly undermines the weakest assumption that the graph reliably supports accurate progress checks by the code planner.
Authors: We agree that the method section describes semantic-graph construction from VLM observations but does not supply quantitative error rates, drift analysis, explicit update rules, or failure-mode discussion under partial observability and distractors. We will revise §3.2 (and add an appendix if needed) to include the update rules for graph maintenance (e.g., how new observations are merged or used to correct entities/relations), empirical error rates measured during our experiments (such as entity detection precision/recall in occluded or cluttered scenes), and a failure-mode analysis with concrete examples. This will directly support the reliability of the graph for the code planner's progress checks. revision: yes
Circularity Check
No circularity: empirical framework integration with ablations
full rationale
The paper introduces CodeGraphVLP as a hierarchical integration of a persistent semantic-graph state, executable code planner, and progress-guided VLA prompting for non-Markovian tasks. All claims rest on real-world task completion rates, latency measurements, and ablation studies that isolate component contributions. No mathematical derivations, fitted parameters renamed as predictions, or self-referential definitions appear; the semantic-graph maintenance and planner outputs are presented as engineered components whose reliability is evaluated externally rather than assumed by construction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Semantic-graph state maintains task-relevant entities and relations under partial observability
invented entities (2)
-
Semantic-graph state
no independent evidence
-
Code-based planner
no independent evidence
Forward citations
Cited by 1 Pith paper
-
$\mu$VLA: On Recurrent Memory for Partially Observable Manipulation in VLA Models
Adding recurrent memory tokens to VLA models raises success rates on partially observable manipulation tasks from 0.42 to 0.84 on training and 0.07 to 0.23 on held-out tasks while preserving performance under full obs...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.