Notes-to-Self: Scratchpad Augmented VLAs for Memory Dependent Manipulation Tasks

Apratim Bhattacharyya; Daniel Dijkman; Roland Memisevic; Sanjay Haresh

arxiv: 2602.21013 · v2 · pith:6YKIA2WYnew · submitted 2026-02-24 · 💻 cs.RO

Notes-to-Self: Scratchpad Augmented VLAs for Memory Dependent Manipulation Tasks

Sanjay Haresh , Daniel Dijkman , Apratim Bhattacharyya , Roland Memisevic This is my paper

classification 💻 cs.RO

keywords tasksscratchpadincorporatinglanguagemanipulationmemorymemory-dependentplan

0 comments

read the original abstract

Many dexterous manipulation tasks are non-markovian in nature, yet little attention has been paid to this fact in the recent upsurge of the vision-language-action (VLA) paradigm. Although they are successful in bringing internet-scale semantic understanding to robotics, existing VLAs are primarily "stateless" and struggle with memory-dependent long horizon tasks. In this work, we explore a way to impart both spatial and temporal memory to a VLA by incorporating a language scratchpad. The scratchpad makes it possible to memorize task-specific information, such as object positions, and it allows the model to keep track of a plan and progress towards subgoals within that plan. We evaluate this approach on a split of memory-dependent tasks from the ClevrSkills environment, on MemoryBench, as well as on a challenging real-world pick-and-place task. We show that incorporating a language scratchpad significantly improves generalization on these tasks for both non-recurrent and recurrent models.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

$\mu$VLA: On Recurrent Memory for Partially Observable Manipulation in VLA Models
cs.LG 2026-06 unverdicted novelty 6.0

Adding recurrent memory tokens to VLA models raises success rates on partially observable manipulation tasks from 0.42 to 0.84 on training and 0.07 to 0.23 on held-out tasks while preserving performance under full obs...