pith. sign in

WALL-WM: Carving World Action Modeling at the Event Joints

3 Pith papers cite this work. Polarity classification is still indexing.

3 Pith papers citing it
abstract

WALL-WM is a World Action Model that shifts video-action learning from chunk-centric optimization to event-grounded Vision-Language-Action pretraining, using semantically coherent action events as the atomic unit of learning. Existing WAMs commonly initialize from multimodal or video foundation models and then optimize fixed-length action chunks conditioned directly on the current observation and instruction. Although convenient, this chunk-centric formulation creates a fundamental granularity mismatch. Language describes semantic goals and events, vision evolves through continuous scene dynamics, and actions operate at control-level timescales; forcing all three into the same fixed-length prediction window turns VLA training into short-horizon correlation fitting. WALL-WM addresses this mismatch by organizing both supervision and data around semantic events. Specifically, it pairs event-grounded VLA pretraining with a data ecosystem built from event-level captions and cluster-balanced sampling, enabling scalable learning over diverse behaviors, scenes, and task structures. From the same event-pretrained backbone, WALL-WM supports two complementary inference modes. The event mode consumes next-event descriptions and enables variable-length execution chunks, while the unified mode uses a VLM with Staircase Decoding to condition conventional fixed-length chunk inference while preserving a gradient-continuous VLA path. Together with Muon-optimizer-based large-scale pretraining infrastructure, WALL-WM provides a practical scale-up recipe for general-purpose WAMs. Experiments show that WALL-WM generalizes broadly across language, scenes, and tasks, achieving state-of-the-art performance in large-scale real-world generalization evaluation.

citation-role summary

background 1

citation-polarity summary

fields

cs.RO 2 cs.CV 1

years

2026 3

verdicts

UNVERDICTED 3

roles

background 1

polarities

background 1

clear filters

representative citing papers

World Action Models: A Survey

cs.RO · 2026-06-18 · unverdicted · novelty 3.0

A survey that clarifies boundaries and organizes World Action Models by generation requirements and predictive substrates, identifying a trend toward generating less of the future.

citing papers explorer

Showing 3 of 3 citing papers after filters.

  • X-Tokenizer: A Multimodal Action Tokenizer for Vision-Language-Action Pretraining cs.CV · 2026-06-07 · unverdicted · none · ref 11 · internal anchor

    X-Tokenizer creates semantic action tokens via asymmetric residual quantization and contrastive pretraining on large trajectory data, outperforming prior methods like FAST on robotic tasks.

  • MemoryWAM: Efficient World Action Modeling with Persistent Memory cs.RO · 2026-06-18 · unverdicted · none · ref 40 · internal anchor

    MemoryWAM is a world action model with a hybrid memory design using recent frames, anchor frames, and gist tokens for efficient long-horizon robotic manipulation.

  • World Action Models: A Survey cs.RO · 2026-06-18 · unverdicted · none · ref 92 · internal anchor

    A survey that clarifies boundaries and organizes World Action Models by generation requirements and predictive substrates, identifying a trend toward generating less of the future.