RepWAM introduces representation visual-action tokenizers to pretrain world action models that jointly model future visual states and latent actions under instructions for improved robot manipulation.
Dera: Decoupled representation alignment for video tokenization.arXivpreprintarXiv:2512.04483, 2025
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
fields
cs.CV 2years
2026 2verdicts
UNVERDICTED 2representative citing papers
ARM is a 7B autoregressive multimodal model with a unified discrete visual tokenizer and RL that performs image understanding, generation, and editing while showing cross-task synergy from preference optimization.
citing papers explorer
-
RepWAM: World Action Modeling with Representation Visual-Action Tokenizers
RepWAM introduces representation visual-action tokenizers to pretrain world action models that jointly model future visual states and latent actions under instructions for improved robot manipulation.
-
ARM: An AutoRegressive Large Multimodal Model with Unified Discrete Representations
ARM is a 7B autoregressive multimodal model with a unified discrete visual tokenizer and RL that performs image understanding, generation, and editing while showing cross-task synergy from preference optimization.