pith. sign in

arxiv: 2510.00695 · v3 · submitted 2025-10-01 · 💻 cs.RO · cs.CV

HAMLET: Switch your Vision-Language-Action Model into a History-Aware Policy

Pith reviewed 2026-05-18 10:54 UTC · model grok-4.3

classification 💻 cs.RO cs.CV
keywords vision-language-action modelshistory-aware policymoment tokensmemory modulerobotic manipulationlong-horizon taskstime-contrastive learning
0
0 comments X

The pith

Moment tokens plus a lightweight memory module turn vision-language-action models into policies that use past observations for better robotic decisions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Most current vision-language-action models decide robot movements from only the present image and language instruction, yet many real manipulation jobs require remembering earlier events to succeed. HAMLET creates compact moment tokens for each past time step and trains them with time-contrastive learning so they highlight what is distinctive about that moment. A small memory module then folds those tokens together into features that guide the next action. When this addition is made to a strong existing model, success rates rise sharply on tasks whose correct action depends on prior context, while smaller gains appear on ordinary benchmarks that do not stress memory.

Core claim

HAMLET is a scalable adaptation layer that equips any vision-language-action model with historical context by generating moment tokens whose representations are initialized via time-contrastive learning to capture temporally distinctive perceptual information at each timestep; a lightweight memory module then aggregates tokens from past steps into memory features that are fed forward for action prediction, thereby converting the original model into a history-aware policy.

What carries the argument

Moment tokens initialized by time-contrastive learning to encode distinct perceptual information per timestep, aggregated into usable memory features by a lightweight memory module.

If this is right

  • Standard VLAs can be upgraded to handle long sequences without retraining the entire network from scratch.
  • Robots achieve markedly higher completion rates on tasks whose steps depend on earlier observations.
  • The same lightweight addition yields measurable lifts even on shorter, generic manipulation benchmarks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The token-plus-memory pattern could be ported to other sequential control systems that currently discard prior context.
  • Varying the number or resolution of moment tokens offers a direct way to test how much history is useful before performance plateaus.
  • Pairing the memory features with explicit uncertainty estimates might let the robot decide when to trust or ignore older observations.

Load-bearing premise

Compact tokens derived from past observations can be combined to supply the exact historical details that improve the model's choice of the next action.

What would settle it

Apply the same moment-token and memory-module additions to a baseline VLA and measure success rates on a suite of history-dependent real-robot tasks; if the rates remain at or below the original model's performance, the central claim is false.

read the original abstract

Inherently, robotic manipulation tasks are history-dependent: leveraging past context could be beneficial. However, most existing Vision-Language-Action models (VLAs) have been designed without considering this aspect, i.e., they rely solely on the current observation, ignoring preceding context. In this paper, we propose HAMLET, a scalable framework to adapt VLAs to attend to the historical context during action prediction. Specifically, we introduce moment tokens that compactly encode perceptual information at each timestep. Their representations are initialized with time-contrastive learning, allowing them to better capture temporally distinctive aspects. Next, we employ a lightweight memory module that integrates the moment tokens across past timesteps into memory features, which are then leveraged for action prediction. Through empirical evaluation, we show that HAMLET successfully transforms a state-of-the-art VLA into a history-aware policy, especially demonstrating significant improvements on long-horizon tasks that require historical context. In particular, on top of GR00T N1.5, HAMLET achieves an average success rate of 76.4% on history-dependent real-world tasks, surpassing the baseline performance by 47.2%. Furthermore, HAMLET pushes prior art performance from 64.1% to 66.4% on RoboCasa Kitchen (100-demo setup) and from 95.6% to 97.7% on LIBERO, highlighting its effectiveness even under generic robot-manipulation benchmarks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes HAMLET, a framework to adapt existing Vision-Language-Action (VLA) models into history-aware policies for robotic manipulation tasks. It introduces moment tokens that encode perceptual information at each timestep, initialized via time-contrastive learning to capture temporally distinctive features, and a lightweight memory module that aggregates these tokens across past timesteps into memory features for action prediction. Empirical results claim that this transforms a state-of-the-art VLA (GR00T N1.5) into a history-aware policy, yielding a 76.4% average success rate on history-dependent real-world tasks (47.2% above baseline), plus gains from 64.1% to 66.4% on RoboCasa Kitchen (100-demo) and 95.6% to 97.7% on LIBERO.

Significance. If the reported gains can be isolated to the history-aware components, the work would offer a scalable, practical method for improving long-horizon robotic manipulation without overhauling core VLA architectures. The emphasis on real-world history-dependent tasks addresses a recognized limitation in current VLAs. The approach appears lightweight and compatible with existing models, which could facilitate adoption if the experimental controls are strengthened.

major comments (2)
  1. [Section 4] Section 4 (Experimental Evaluation): The comparison reports a 47.2% lift to 76.4% success on history-dependent real-world tasks over the unmodified GR00T N1.5 baseline, yet provides no details on whether the baseline received equivalent fine-tuning epochs, data passes, or parameter updates after the addition of moment tokens and the memory module. This is load-bearing for the central claim, as extra optimization could explain the delta rather than the proposed history mechanism.
  2. [Section 3] Section 3 (Method): The claim that moment tokens initialized via time-contrastive learning compactly encode temporally distinctive perceptual information (and thereby improve action prediction when aggregated) lacks supporting ablations, such as a direct comparison of time-contrastive initialization versus random or standard contrastive initialization in the reported tables. Without this, the contribution of the initialization step to the observed gains on long-horizon tasks remains unisolated.
minor comments (2)
  1. [Abstract] Abstract: The reported 'average success rate of 76.4%' and '47.2%' improvement should specify the number of tasks, runs per task, and whether error bars or statistical significance tests were used, to allow readers to assess result reliability.
  2. [Figures/Tables] Figure captions and tables: Ensure all figures showing success rates include standard deviation or confidence intervals across multiple seeds or trials for transparency.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We have carefully reviewed the major comments and provide point-by-point responses below. Revisions have been made to strengthen the experimental details and add supporting ablations as requested.

read point-by-point responses
  1. Referee: [Section 4] Section 4 (Experimental Evaluation): The comparison reports a 47.2% lift to 76.4% success on history-dependent real-world tasks over the unmodified GR00T N1.5 baseline, yet provides no details on whether the baseline received equivalent fine-tuning epochs, data passes, or parameter updates after the addition of moment tokens and the memory module. This is load-bearing for the central claim, as extra optimization could explain the delta rather than the proposed history mechanism.

    Authors: We agree that explicit details on the baseline training protocol are essential to isolate the contribution of the history-aware components. The unmodified GR00T N1.5 baseline was fine-tuned using exactly the same number of epochs, data passes, batch size, and optimization hyperparameters as the HAMLET model; the sole difference is the insertion of moment tokens and the memory module. We have revised Section 4 to include a dedicated paragraph describing the matched training setup for both models, including epoch count, learning rate schedule, and data processing steps. This clarification confirms that the reported gains arise from the proposed history mechanism rather than unequal optimization. revision: yes

  2. Referee: [Section 3] Section 3 (Method): The claim that moment tokens initialized via time-contrastive learning compactly encode temporally distinctive perceptual information (and thereby improve action prediction when aggregated) lacks supporting ablations, such as a direct comparison of time-contrastive initialization versus random or standard contrastive initialization in the reported tables. Without this, the contribution of the initialization step to the observed gains on long-horizon tasks remains unisolated.

    Authors: We concur that an ablation isolating the time-contrastive initialization is necessary to substantiate its specific benefit. In the revised manuscript we have added a new ablation study (now reported in Section 4 and an accompanying table) that directly compares moment tokens initialized via time-contrastive learning against both random initialization and a standard contrastive-learning baseline. The results show that time-contrastive initialization yields measurably higher success rates on the history-dependent tasks, supporting the claim that it better captures temporally distinctive perceptual features. These additional experiments are now included in the paper. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical results rest on external benchmarks

full rationale

The paper introduces moment tokens and a memory module as architectural additions to existing VLAs, then reports success rates on real-world and simulation tasks against unmodified baselines (GR00T N1.5, prior RoboCasa and LIBERO numbers). No equations, fitted parameters, or self-citations are presented as deriving the performance lift; the 47.2% delta is framed as an experimental outcome rather than a quantity forced by internal definitions or prior author work. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 2 invented entities

The central claim rests on the domain assumption that history is useful for manipulation, plus two newly introduced architectural components whose effectiveness is shown only empirically.

free parameters (1)
  • memory module capacity and training hyperparameters
    The lightweight memory module contains trainable parameters whose specific values are fitted during the adaptation process.
axioms (1)
  • domain assumption Robotic manipulation tasks are inherently history-dependent and benefit from preceding context
    Opening sentence of the abstract.
invented entities (2)
  • moment tokens no independent evidence
    purpose: Compactly encode perceptual information at each timestep
    New component introduced to capture temporally distinctive aspects via time-contrastive initialization.
  • memory features no independent evidence
    purpose: Integrate moment tokens across past timesteps for action prediction
    Output of the lightweight memory module.

pith-pipeline@v0.9.0 · 5813 in / 1470 out tokens · 52617 ms · 2026-05-18T10:54:19.610752+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • IndisputableMonolith/Foundation/Breath1024.lean period8 echoes
    ?
    echoes

    ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

    history length of 4... Transformer-based memory... causal self-attention

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 8 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. EvoScene-VLA: Evolving Scene Beliefs Inside the Action Decoder for Chunked Robot Control

    cs.RO 2026-05 conditional novelty 7.0

    EvoScene-VLA maintains an action-updated scene prior across control chunks in VLA policies, raising success rates on RoboTwin tasks from 87.2% to 89.1% fixed and 86.1% to 88.5% randomized while outperforming baselines...

  2. DSSP: Diffusion State Space Policy with Full-History Encoding

    cs.RO 2026-05 conditional novelty 7.0

    DSSP is a history-conditioned diffusion state space policy that uses SSMs to encode full observation streams with an auxiliary dynamics objective and hierarchical fusion, achieving SOTA results with reduced model size...

  3. Towards Generalizable Robotic Manipulation in Dynamic Environments

    cs.CV 2026-03 unverdicted novelty 7.0

    DOMINO dataset and PUMA architecture enable better dynamic robotic manipulation by incorporating motion history, delivering 6.3% higher success rates than prior VLA models.

  4. WarmPrior: Straightening Flow-Matching Policies with Temporal Priors

    cs.LG 2026-05 unverdicted novelty 6.0

    Replacing Gaussian noise with a temporally grounded prior from recent actions straightens flow-matching paths and improves success rates in robotic manipulation and prior-space RL.

  5. RoboMemArena: A Comprehensive and Challenging Robotic Memory Benchmark

    cs.RO 2026-05 unverdicted novelty 6.0

    RoboMemArena is a new large-scale robotic memory benchmark with real-world tasks, and PrediMem is a dual VLA system that outperforms baselines by managing memory buffers with predictive coding.

  6. Adaptive Action Chunking at Inference-time for Vision-Language-Action Models

    cs.RO 2026-04 unverdicted novelty 6.0

    Adaptive Action Chunking uses action entropy to dynamically adjust chunk sizes in VLA models, improving performance on simulated and real robotic manipulation tasks.

  7. RoboMME: Benchmarking and Understanding Memory for Robotic Generalist Policies

    cs.RO 2026-03 unverdicted novelty 6.0

    RoboMME is a new benchmark with 16 tasks and 14 memory-augmented VLA variants that shows memory effectiveness is highly task-dependent.

  8. Cosmos Policy: Fine-Tuning Video Models for Visuomotor Control and Planning

    cs.AI 2026-01 conditional novelty 6.0

    Single-stage fine-tuning of a video model to generate actions as latent frames plus future states and values yields state-of-the-art robot policy performance on LIBERO, RoboCasa, and bimanual tasks.