FlexLAM: Resolving the Bottleneck Trade-off in Latent Action Learning

Naruya Kondo; Takanori Yoshimoto; Tatsuya Matsushima; Yang Hu

arxiv: 2606.19408 · v1 · pith:PASKFWX6new · submitted 2026-06-17 · 💻 cs.LG · cs.RO

FlexLAM: Resolving the Bottleneck Trade-off in Latent Action Learning

Takanori Yoshimoto , Yang Hu , Naruya Kondo , Tatsuya Matsushima This is my paper

Pith reviewed 2026-06-26 21:07 UTC · model grok-4.3

classification 💻 cs.LG cs.RO

keywords latent action modelsnested dropoutvariable-length codesaction alignmentvideo pretrainingbottleneck trade-offworld models

0 comments

The pith

FlexLAM replaces fixed-capacity bottlenecks in latent action models with variable-length codes from nested dropout that match or exceed separate fixed models at every token budget.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that fixed-capacity latent action models face a trade-off where tight codes lose transition cues needed for alignment and loose codes retain extra variation that hurts performance when labels are scarce. FlexLAM addresses this by training with nested dropout to produce prefix-valid codes that encode compact structure first and add detail only when required. A single such model matches or surpasses multiple separately trained fixed-capacity models across token budgets in standard scarce-label supervision and in a low-return single-task alignment test. The same model supports inference-time budget adjustment without retraining and improves Ego4D transition reconstruction.

Core claim

FlexLAM replaces the fixed-capacity bottleneck in latent action models with variable-length latent actions trained via nested dropout, yielding prefix-valid codes that capture compact transition structure first and add detail only when needed, without requiring new architectures or losses; a single FlexLAM matches or surpasses separately trained fixed-capacity LAMs at every evaluated token budget under scarce-label supervision and low-return alignment stress tests.

What carries the argument

Nested dropout on latent action encoders to generate prefix-valid variable-length codes that prioritize compact transition structure.

If this is right

One trained model serves all token budgets instead of requiring separate fixed-capacity models for each budget.
Inference-time token-budget adjustment becomes possible without retraining.
Performance improves under scarce-label supervision and single-task alignment stress tests.
Transition reconstruction accuracy increases on datasets such as Ego4D.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach may apply to other information-bottleneck problems in world models or sequence prediction where capacity must be chosen in advance.
Consolidating multiple fixed models into one variable-length model could reduce total training compute for applications that need several operating points.
If the prefix-valid property generalizes, deployment in environments with varying compute or bandwidth constraints becomes simpler.

Load-bearing premise

The extra tokens added by nested dropout supply transition detail that is genuinely useful for downstream alignment rather than residual variation that any comparable fixed model would ignore.

What would settle it

Train a fixed-capacity LAM at the average token length used by FlexLAM and compare it head-to-head on the same scarce-label alignment tasks; equal or superior performance would undermine the claim that the variable-length approach learns a better interface.

Figures

Figures reproduced from arXiv: 2606.19408 by Naruya Kondo, Takanori Yoshimoto, Tatsuya Matsushima, Yang Hu.

**Figure 1.** Figure 1: The fixed-capacity bottleneck trade-off. Left: One fixed transition-code budget must serve transitions of varying complexity, creating tight- and loose-capacity failure modes for action alignment under limited labels. Right: The main DMLab result: one FlexLAM model beats separately trained Fixed-K baselines at every evaluated token budget. This mismatch exposes a bottleneck trade-off. Tight codes impose us… view at source ↗

**Figure 2.** Figure 2: FlexLAM overview. (a) During LAM pretraining, FlexLAM samples a retained prefix length k and replaces suffix slots with a shared null latent before decoder training. (b) The same prefix representation is used for latent-to-action alignment with a small labeled set. (c) A fixed latent-token evaluator predicts latent-action tokens for downstream evaluation using the same translator interface. evaluation—are … view at source ↗

**Figure 3.** Figure 3: Scarce-label alignment and matched-budget return. Left: translator test loss versus labeled dataset size. Right: downstream normalized return under 0.025% labels at matched token budgets. Fixed-k models are trained separately; FlexLAM@k evaluates one FlexLAM model at prefix length k. FlexLAM outperforms Fixed-K at every evaluated budget. 7 latent actions with VQ codebook size 32, and we evaluate it at its … view at source ↗

**Figure 4.** Figure 4: Action alignment from a narrow single-task source. The translator is trained using labels from a single low-return source task (Lasertag One Opponent Large; 0.04% of the full dataset), then evaluated on the normalized multi-task suite excluding that source task. Left panel shows translator test loss in this narrow-source setting, compared with a control using the same label budget sampled uniformly across … view at source ↗

**Figure 5.** Figure 5: Real-world transition reconstruction. We decode latent transition tokens on Ego4D and robot-video reconstruction examples. Compared with the released villa-X-LAM reference, FlexLAM produces more stable one-step reconstructions under camera and background changes. Varying the retained prefix length k within the same model progressively adds visual detail. These examples evaluate transition reconstruction un… view at source ↗

**Figure 6.** Figure 6: Latent actions transfer across embodiments. Each row runs a round trip across two scenes by encoding the source transition, z = Enc(ot, ot+1) (left); decoding it onto the target frame, oˆt+1 = Dec(o tgt t , z) (middle); re-encoding, z ′ = Enc(o tgt t , oˆt+1); and decoding back to the source, Dec(o src t , z′ ) ≈ ot+1 (right). Green frames denote the source scene, red the new scene, and dashed frames are m… view at source ↗

**Figure 7.** Figure 7: Prefix-length scaling within one FlexLAM model. Translator test loss as a function of retained prefix length k. This plot varies only the retained prefix used by the same trained FlexLAM model. Lower loss indicates better latent-to-action alignment [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗

**Figure 3.** Figure 3: 6.3 Joint LAM-Translator Fine-Tuning The main scarce-label experiments use the frozen-alignment setting, which isolates the quality of the latent-action interface by updating only the translator. This setting is intentionally controlled, but it is not necessarily the strongest way to use LAMs when more action labels are available. Because IDM directly observes the input frames and has no latent bottleneck,… view at source ↗

**Figure 8.** Figure 8: Joint LAM-translator fine-tuning. Using 0.5% action-labeled data, we compare translator validation loss for IDM (no bottleneck), a fixed-capacity LAM, and FlexLAM, with and without joint alignment. In the frozen setting, IDM can be stronger because it directly observes the input frames. Joint alignment allows action loss to update the LAM bottleneck, improving bottlenecked LAMs and reversing the IDM-vs-LAM… view at source ↗

**Figure 9.** Figure 9: DMLab latent-token prediction visualization. We decode latent tokens generated by the downstream sequence model to visualize predicted one-step transitions and illustrate the behavior of the evaluation pipeline. C Full DMLab Results C.1 Per-Task Normalized Returns [PITH_FULL_IMAGE:figures/full_fig_p020_9.png] view at source ↗

**Figure 10.** Figure 10: Translator conditioning ablation. Translator test loss for three input choices, namely z only, z + previous action, and z + previous action + observation. Conditioning on at−1 improves prediction in egocentric settings. Directly feeding ot can make the translator more sensitive to appearance variation under limited supervision. Routing through the latent transition representation reduces direct access to … view at source ↗

**Figure 11.** Figure 11: DMLab prefix-length reconstruction. Reconstruction results for the same DMLab transition while varying retained prefix length k. Increasing k progressively recovers finer details, while small prefixes capture coarse transition structure. E.3 Real-World Decoder Objective Details The real-world decoder uses the same retained-prefix conditioning principle as the DMLab decoder. The decoder is conditioned on o… view at source ↗

**Figure 12.** Figure 12: Additional real-world prefix sweeps. Reconstructions from the same FlexLAM model while varying retained prefix length k across Ego4D and robot-video examples. Larger prefixes recover additional visual detail, while shorter prefixes preserve coarse transition structure. G Extended Impact Details Large-scale video pretraining may involve private, copyrighted, biased, or geographically imbalanced content. Da… view at source ↗

read the original abstract

Latent actions provide a compact interface between action-free video and downstream decision-making, yet existing Latent Action Models (LAMs) force every transition through a fixed-capacity bottleneck. We identify a bottleneck trade-off: overly tight codes can discard transition cues needed for action alignment, while overly loose codes preserve additional transition variation that must be resolved when alignment labels are scarce or narrowly distributed. FlexLAM replaces this fixed capacity with variable-length latent actions trained by nested dropout, yielding prefix-valid codes that capture compact transition structure first and add detail only when needed, without new architectures or losses. A single FlexLAM matches or surpasses separately trained fixed-capacity LAMs at every evaluated token budget under standard scarce-label supervision and under a low-return single-task alignment stress test, indicating that FlexLAM is not merely adjustable at inference time but learns a better latent-action interface at the same token budgets. The same model supports inference-time token-budget adjustment without retraining, and FlexLAM improves Ego4D transition reconstruction. These results suggest that variable-length latent actions are an architecture-free, drop-in upgrade to the fixed-capacity bottleneck in latent action models, latent-action world models, and video-pretrained action interfaces.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

FlexLAM applies nested dropout to get variable-length prefix-valid latent actions from one model, but the abstract leaves the key claim unverified.

read the letter

The core idea is straightforward: train a LAM with nested dropout so the latent codes are prefix-valid, letting you dial the token budget at inference without retraining or new losses. The paper claims this single model matches or beats separately trained fixed-capacity LAMs at every budget, both in standard scarce-label settings and in a low-return alignment test, and that it improves Ego4D reconstruction.

What stands out is the simplicity. No architecture changes, no extra objectives—just nested dropout on an existing setup to turn the fixed bottleneck into an adjustable one. That is a practical move if the empirical result holds.

The main uncertainty is whether the outperformance actually shows a better interface or just different training dynamics. The abstract does not show ablations confirming that later tokens add alignment-useful structure rather than residual variation a fixed model would ignore. Without the full experiments, baselines, or checks on prefix validity, it is impossible to rule out that the gains come from how capacity is allocated during training rather than from genuinely better codes.

This is aimed at groups already using LAMs in video-pretrained RL or robotics pipelines who want to avoid choosing a single token budget. It is worth sending to peer review so the experiments can be inspected; the idea is narrow enough that a referee could quickly test whether the central claim survives scrutiny.

Referee Report

2 major / 1 minor

Summary. The paper introduces FlexLAM, which applies nested dropout during training of latent action models (LAMs) to produce variable-length, prefix-valid latent action codes. These codes capture compact transition structure in early tokens and add detail only as needed. The central claim is that a single FlexLAM matches or surpasses separately trained fixed-capacity LAMs at every evaluated token budget under scarce-label supervision and a low-return single-task alignment stress test, improves Ego4D transition reconstruction, and supports inference-time token-budget adjustment without retraining, positioning variable-length codes as an architecture-free upgrade to the fixed bottleneck in LAMs and related video-pretrained interfaces.

Significance. If the empirical results and the assumption that nested dropout yields genuinely useful incremental structure hold, this would constitute a practical advance for latent action learning by eliminating the need to train separate models for different capacities while improving performance at matched budgets. The inference-time flexibility and lack of new losses or architectures make it a potentially drop-in improvement for downstream decision-making from action-free video.

major comments (2)

[Abstract] Abstract: The claim that FlexLAM 'learns a better latent-action interface at the same token budgets' (rather than merely providing an adjustable code) is load-bearing for the contribution. This rests on consistent outperformance versus separately trained fixed-capacity LAMs, yet the abstract supplies no details on how token budgets are matched across models, the training protocol for the fixed baselines, or any ablation isolating whether later tokens supply alignment-useful structure beyond residual variation that a fixed model of matched average capacity would ignore.
[Abstract] Abstract: The method description states that nested dropout yields 'prefix-valid codes that capture compact transition structure first,' but provides no formulation, loss term, or verification (e.g., no equation or procedure showing that essential transition cues are forced into early tokens while later tokens add only non-redundant detail). Without this, it remains possible that the reported gains arise from training dynamics rather than an improved interface, as noted in the stress-test concern.

minor comments (1)

[Abstract] The abstract references 'standard scarce-label supervision' and 'Ego4D transition reconstruction' but does not name the primary datasets or tasks used for the main alignment experiments; adding these would improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We respond point by point to the major comments below.

read point-by-point responses

Referee: [Abstract] Abstract: The claim that FlexLAM 'learns a better latent-action interface at the same token budgets' (rather than merely providing an adjustable code) is load-bearing for the contribution. This rests on consistent outperformance versus separately trained fixed-capacity LAMs, yet the abstract supplies no details on how token budgets are matched across models, the training protocol for the fixed baselines, or any ablation isolating whether later tokens supply alignment-useful structure beyond residual variation that a fixed model of matched average capacity would ignore.

Authors: Token budgets are matched by evaluating the single FlexLAM model on its first k tokens while training separate fixed-capacity LAM baselines with a bottleneck of exactly size k, using identical data, scarce-label supervision, and optimization protocol. These details appear in Sections 3.2 and 4.1. The low-return single-task alignment stress test (Section 4.3) functions as the requested ablation, showing that later tokens supply alignment-useful structure because FlexLAM still outperforms the matched-capacity fixed model. We will revise the abstract to briefly note the matching procedure and baseline protocol. revision: yes
Referee: [Abstract] Abstract: The method description states that nested dropout yields 'prefix-valid codes that capture compact transition structure first,' but provides no formulation, loss term, or verification (e.g., no equation or procedure showing that essential transition cues are forced into early tokens while later tokens add only non-redundant detail). Without this, it remains possible that the reported gains arise from training dynamics rather than an improved interface, as noted in the stress-test concern.

Authors: Nested dropout is applied by randomly sampling a prefix length at each training step and zeroing all subsequent latent dimensions, using only the standard reconstruction objective with no new loss term. This is the standard formulation (referenced in the paper) that forces essential transition cues into early tokens. Verification is provided by the stress-test results (Section 4.3), where FlexLAM outperforms fixed-capacity models trained under identical dynamics, indicating the gains stem from the variable-length interface rather than dynamics alone. We will add a short parenthetical reference to the prefix-masking procedure in the abstract if length permits. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical method with no derivations or load-bearing self-citations

full rationale

The paper's abstract and description present FlexLAM as a method using nested dropout on standard architectures to produce variable-length codes, with central claims resting entirely on empirical performance comparisons against fixed-capacity baselines at matched token budgets. No equations, mathematical derivations, fitted parameters renamed as predictions, or self-citation chains are described or invoked to justify the interface quality. The results are externally falsifiable via the reported experiments under scarce-label and stress-test conditions, satisfying the criteria for independent content rather than reduction by construction. This matches the reader's assessment of no equations or derivations that reduce to fitted quantities.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract supplies no information on free parameters, background axioms, or new postulated entities; the method is described as using existing nested dropout on standard architectures.

pith-pipeline@v0.9.1-grok · 5744 in / 1069 out tokens · 25709 ms · 2026-06-26T21:07:54.581383+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

75 extracted references · 8 canonical work pages

[1]

Matryoshka Representation Learning , url =

Kusupati, Aditya and Bhatt, Gantavya and Rege, Aniket and Wallingford, Matthew and Sinha, Aditya and Ramanujan, Vivek and Howard-Snyder, William and Chen, Kaifeng and Kakade, Sham and Jain, Prateek and Farhadi, Ali , booktitle =. Matryoshka Representation Learning , url =
[2]

Proceedings of Robotics: Science and Systems , YEAR =

Chuan Wen AND Xingyu Lin AND John Ian Reyes So AND Kai Chen AND Qi Dou AND Yang Gao AND Pieter Abbeel , TITLE =. Proceedings of Robotics: Science and Systems , YEAR =
[3]

Track2Act: Predicting Point Tracks from Internet Videos Enables Generalizable Robot Manipulation , ISBN=

Bharadhwaj, Homanga and Mottaghi, Roozbeh and Gupta, Abhinav and Tulsiani, Shubham , year=. Track2Act: Predicting Point Tracks from Internet Videos Enables Generalizable Robot Manipulation , ISBN=. doi:10.1007/978-3-031-73116-7_18 , booktitle=

work page doi:10.1007/978-3-031-73116-7_18
[4]

2025 , eprint=

AMPLIFY: Actionless Motion Priors for Robot Learning from Videos , author=. 2025 , eprint=

2025
[5]

2025 , eprint=

EgoVLA: Learning Vision-Language-Action Models from Egocentric Human Videos , author=. 2025 , eprint=

2025
[6]

2025 , eprint=

Emergence of Human to Robot Transfer in Vision-Language-Action Models , author=. 2025 , eprint=

2025
[7]

Learning Universal Policies via Text-Guided Video Generation , url =

Du, Yilun and Yang, Sherry and Dai, Bo and Dai, Hanjun and Nachum, Ofir and Tenenbaum, Josh and Schuurmans, Dale and Abbeel, Pieter , booktitle =. Learning Universal Policies via Text-Guided Video Generation , url =
[8]

The Twelfth International Conference on Learning Representations , year=

Zero-Shot Robotic Manipulation with Pre-Trained Image-Editing Diffusion Models , author=. The Twelfth International Conference on Learning Representations , year=
[9]

2025 , eprint=

Large Video Planner Enables Generalizable Robot Control , author=. 2025 , eprint=

2025
[10]

2025 , eprint=

Latent Diffusion Planning for Imitation Learning , author=. 2025 , eprint=

2025
[11]

The Fourteenth International Conference on Learning Representations , year=

Cosmos Policy: Fine-Tuning Video Models for Visuomotor Control and Planning , author=. The Fourteenth International Conference on Learning Representations , year=
[12]

The Thirteenth International Conference on Learning Representations , year=

Latent Action Pretraining from Videos , author=. The Thirteenth International Conference on Learning Representations , year=
[13]

9th Annual Conference on Robot Learning , year=

UniSkill: Imitating Human Videos via Cross-Embodiment Skill Representations , author=. 9th Annual Conference on Robot Learning , year=
[14]

Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) , month =

Chen, Yi and Ge, Yuying and Tang, Weiliang and Li, Yizhuo and Ge, Yixiao and Ding, Mingyu and Shan, Ying and Liu, Xihui , title =. Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) , month =. 2025 , pages =

2025
[15]

Proceedings of Robotics: Science and Systems , YEAR =

Qingwen Bu AND Yanting Yang AND Jisong Cai AND Shenyuan Gao AND Guanghui Ren AND Maoqing Yao AND Ping Luo AND Hongyang Li , TITLE =. Proceedings of Robotics: Science and Systems , YEAR =
[16]

DynaMo: In-Domain Dynamics Pretraining for Visuo-Motor Control , url =

Cui, Zichen Jeff and Pan, Hengkai and Iyer, Aadhithya and Haldar, Siddhant and Pinto, Lerrel , booktitle =. DynaMo: In-Domain Dynamics Pretraining for Visuo-Motor Control , url =. doi:10.52202/079017-1069 , editor =

work page doi:10.52202/079017-1069
[17]

The Fourteenth International Conference on Learning Representations , year=

villa-X: Enhancing Latent Action Modeling in Vision-Language-Action Models , author=. The Fourteenth International Conference on Learning Representations , year=
[18]

2025 , eprint=

iFlyBot-VLA Technical Report , author=. 2025 , eprint=

2025
[19]

Proceedings of the 36th International Conference on Machine Learning , pages =

Imitating Latent Policies from Observation , author =. Proceedings of the 36th International Conference on Machine Learning , pages =. 2019 , editor =

2019
[20]

International Conference on Learning Representations , year=

Learning what you can do before doing anything , author=. International Conference on Learning Representations , year=
[21]

Learning to Act without Actions , url =

Schmidt, Dominik and Jiang, Minqi , booktitle =. Learning to Act without Actions , url =
[22]

The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=

What Do Latent Action Models Actually Learn? , author=. The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=
[23]

Forty-second International Conference on Machine Learning , year=

Latent Action Learning Requires Supervision in the Presence of Distractors , author=. Forty-second International Conference on Machine Learning , year=
[24]

2025 , eprint=

CLAM: Continuous Latent Action Models for Robot Learning from Unlabeled Demonstrations , author=. 2025 , eprint=

2025
[25]

2014 , eprint=

Learning Ordered Representations with Nested Dropout , author=. 2014 , eprint=

2014
[26]

Stochastic Bottleneck: Rateless Auto-Encoder for Flexible Dimensionality Reduction , url=

Koike-Akino, Toshiaki and Wang, Ye , year=. Stochastic Bottleneck: Rateless Auto-Encoder for Flexible Dimensionality Reduction , url=. doi:10.1109/isit44484.2020.9174523 , booktitle=

work page doi:10.1109/isit44484.2020.9174523 2020
[27]

2025 , eprint=

LoHoVLA: A Unified Vision-Language-Action Model for Long-Horizon Embodied Tasks , author=. 2025 , eprint=

2025
[28]

Kevin Black and Noah Brown and James Darpinian and Karan Dhabalia and Danny Driess and Adnan Esmail and Michael Robert Equi and Chelsea Finn and Niccolo Fusai and Manuel Y. Galliker and Dibya Ghosh and Lachy Groom and Karol Hausman and brian ichter and Szymon Jakubczak and Tim Jones and Liyiming Ke and Devin LeBlanc and Sergey Levine and Adrian Li-Bell an...

2025
[29]

Moo Jin Kim and Karl Pertsch and Siddharth Karamcheti and Ted Xiao and Ashwin Balakrishna and Suraj Nair and Rafael Rafailov and Ethan P Foster and Pannag R Sanketi and Quan Vuong and Thomas Kollar and Benjamin Burchfiel and Russ Tedrake and Dorsa Sadigh and Sergey Levine and Percy Liang and Chelsea Finn , booktitle=. Open. 2024 , url=

2024
[30]

Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) , month =

Peebles, William and Xie, Saining , title =. Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) , month =. 2023 , pages =

2023
[31]

2025 , eprint=

Rethinking the shape convention of an MLP , author=. 2025 , eprint=

2025
[32]

Proceedings of the 41st International Conference on Machine Learning , pages =

Genie: Generative Interactive Environments , author =. Proceedings of the 41st International Conference on Machine Learning , pages =. 2024 , editor =

2024
[33]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , month =

Menapace, Willi and Lathuiliere, Stephane and Tulyakov, Sergey and Siarohin, Aliaksandr and Ricci, Elisa , title =. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , month =. 2021 , pages =

2021
[34]

NeurIPS 2023 Foundation Models for Decision Making Workshop , year=

A Universal World Model Learned from Large Scale and Diverse Videos , author=. NeurIPS 2023 Foundation Models for Decision Making Workshop , year=

2023
[35]

2025 , editor =

Gao, Shenyuan and Zhou, Siyuan and Du, Yilun and Zhang, Jun and Gan, Chuang , booktitle =. 2025 , editor =

2025
[36]

Tokenization Workshop , year=

One-D-Piece: Image Tokenizer Meets Quality-Controllable Compression , author=. Tokenization Workshop , year=
[37]

Proceedings of the 42nd International Conference on Machine Learning , pages =

Bachmann, Roman and Allardice, Jesse and Mizrahi, David and Fini, Enrico and Kar, O. Proceedings of the 42nd International Conference on Machine Learning , pages =. 2025 , editor =

2025
[38]

Deep learning and the information bottleneck principle

Tishby, Naftali and Zaslavsky, Noga , year=. Deep learning and the information bottleneck principle , url=. doi:10.1109/itw.2015.7133169 , booktitle=

work page doi:10.1109/itw.2015.7133169 2015
[39]

Rissanen, J. , year=. Modeling by shortest data description , volume=. Automatica , publisher=. doi:10.1016/0005-1098(78)90005-5 , number=

work page doi:10.1016/0005-1098(78)90005-5
[40]

2024 , eprint=

Mastering Diverse Domains through World Models , author=. 2024 , eprint=

2024
[41]

2026 , eprint=

Learning Latent Action World Models In The Wild , author=. 2026 , eprint=

2026
[42]

The Twelfth International Conference on Learning Representations , year=

Unleashing Large-Scale Video Generative Pre-training for Visual Robot Manipulation , author=. The Twelfth International Conference on Learning Representations , year=
[43]

2024 , eprint=

GR-2: A Generative Video-Language-Action Model with Web-Scale Knowledge for Robot Manipulation , author=. 2024 , eprint=

2024
[44]

2025 , eprint=

GR00T N1: An Open Foundation Model for Generalist Humanoid Robots , author=. 2025 , eprint=

2025
[45]

2024 , eprint=

IGOR: Image-GOal Representations are the Atomic Control Units for Foundation Models in Embodied AI , author=. 2024 , eprint=

2024
[46]

2018 , eprint=

DeepMind Control Suite , author=. 2018 , eprint=

2018
[47]

The Eleventh International Conference on Learning Representations , year=

Become a Proficient Player with Limited Data through Watching Pure Videos , author=. The Eleventh International Conference on Learning Representations , year=
[48]

Ego4D: Around the World in 3,000 Hours of Egocentric Video , booktitle =

Grauman, Kristen and Westbury, Andrew and Byrne, Eugene and Chavis, Zachary and Furnari, Antonino and Girdhar, Rohit and Hamburger, Jackson and Jiang, Hao and Liu, Miao and Liu, Xingyu and Martin, Miguel and Nagarajan, Tushar and Radosavovic, Ilija and Ramakrishnan, Santhosh Kumar and Ryan, Fiona and Sharma, Jayant and Wray, Michael and Xu, Mengmeng and X...

2022
[49]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , month =

Liu, Kun and Liu, Qi and Liu, Xinchen and Li, Jie and Zhang, Yongdong and Luo, Jiebo and He, Xiaodong and Liu, Wu , title =. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , month =. 2025 , pages =

2025
[50]

2024 , eprint=

VidGen-1M: A Large-Scale Dataset for Text-to-video Generation , author=. 2024 , eprint=

2024
[51]

The Thirteenth International Conference on Learning Representations , year=

OpenVid-1M: A Large-Scale High-Quality Dataset for Text-to-video Generation , author=. The Thirteenth International Conference on Learning Representations , year=
[52]

Finite Scalar Quantization:

Fabian Mentzer and David Minnen and Eirikur Agustsson and Michael Tschannen , booktitle=. Finite Scalar Quantization:. 2024 , url=

2024
[53]

2026 , eprint=

Co-Evolving Latent Action World Models , author=. 2026 , eprint=

2026
[54]

2022 , eprint=

Classifier-Free Diffusion Guidance , author=. 2022 , eprint=

2022
[55]

, title =

Brooks, Tim and Holynski, Aleksander and Efros, Alexei A. , title =. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , month =. 2023 , pages =

2023
[56]

Proceedings of the 41st International Conference on Machine Learning , pages =

Scaling Rectified Flow Transformers for High-Resolution Image Synthesis , author =. Proceedings of the 41st International Conference on Machine Learning , pages =. 2024 , editor =

2024
[57]

O’Neill, Abby and Rehman, Abdul and Maddukuri, Abhiram and Gupta, Abhishek and Padalkar, Abhishek and Lee, Abraham and Pooley, Acorn and Gupta, Agrim and Mandlekar, Ajay and Jain, Ajinkya and Tung, Albert and Bewley, Alex and Herzog, Alex and Irpan, Alex and Khazatsky, Alexander and Rai, Anant and Gupta, Anchit and Wang, Andrew and Singh, Anikait and Garg...

work page doi:10.1109/icra57147.2024.10611477 2024
[58]

Alexander Khazatsky AND Karl Pertsch AND Suraj Nair AND Ashwin Balakrishna AND Sudeep Dasari AND Siddharth Karamcheti AND Soroush Nasiriany AND Mohan Kumar Srirama AND Lawrence Yunliang Chen AND Kirsty Ellis AND Peter David Fagan AND Joey Hejna AND Masha Itkina AND Marion Lepert AND Yecheng Jason Ma AND Patrick Tree Miller AND Jimmy Wu AND Suneel Belkhale...
[59]

Proceedings of the IEEE International Conference on Computer Vision (ICCV) , month =

Goyal, Raghav and Ebrahimi Kahou, Samira and Michalski, Vincent and Materzynska, Joanna and Westphal, Susanne and Kim, Heuna and Haenel, Valentin and Fruend, Ingo and Yianilos, Peter and Mueller-Freitag, Moritz and Hoppe, Florian and Thurau, Christian and Bax, Ingo and Memisevic, Roland , title =. Proceedings of the IEEE International Conference on Comput...
[60]

Proceedings of Robotics: Science and Systems , YEAR =

Kevin Black AND Noah Brown AND Danny Driess AND Adnan Esmail AND Michael Robert Equi AND Chelsea Finn AND Niccolo Fusai AND Lachy Groom AND Karol Hausman AND Brian Ichter AND Szymon Jakubczak AND Tim Jones AND Liyiming Ke AND Sergey Levine AND Adrian Li-Bell AND Mohith Mothukuri AND Suraj Nair AND Karl Pertsch AND Lucy Xiaoyang Shi AND Laura Smith AND Jam...
[61]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , month =

Wang, Limin and Huang, Bingkun and Zhao, Zhiyu and Tong, Zhan and He, Yinan and Wang, Yi and Wang, Yali and Qiao, Yu , title =. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , month =. 2023 , pages =

2023
[62]

Neural Rate Control for Learned Video Compression , url =

zhang, yiwei and Lu, Guo and Chen, Yunuo and Wang, Shen and Shi, Yibo and Wang, Jing and Song, Li , booktitle =. Neural Rate Control for Learned Video Compression , url =
[63]

Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) , month =

Fathima, Noor and Petersen, Jens and Sauti\`ere, Guillaume and Wiggers, Auke and Pourreza, Reza , title =. Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) , month =. 2023 , pages =

2023
[64]

2025 , eprint=

AgiBot World Colosseo: A Large-scale Manipulation Platform for Scalable and Intelligent Embodied Systems , author=. 2025 , eprint=

2025
[65]

2021 , eprint=

The Distracting Control Suite -- A Challenging Benchmark for Reinforcement Learning from Pixels , author=. 2021 , eprint=

2021
[66]

Scalable rate control for MPEG-4 video , volume=

Hung-Ju Lee and Tihao Chiang and Ya-Qin Zhang , year=. Scalable rate control for MPEG-4 video , volume=. IEEE Transactions on Circuits and Systems for Video Technology , publisher=. doi:10.1109/76.867926 , number=

work page doi:10.1109/76.867926
[67]

2016 , eprint=

DeepMind Lab , author=. 2016 , eprint=

2016
[68]

The Eleventh International Conference on Learning Representations , year=

Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow , author=. The Eleventh International Conference on Learning Representations , year=
[69]

2026 , eprint=

OAT: Ordered Action Tokenization , author=. 2026 , eprint=

2026
[70]

2026 , url=

Junhong Shen and Kushal Tirumala and Michihiro Yasunaga and Ishan Misra and Luke Zettlemoyer and LILI YU and Chunting Zhou , booktitle=. 2026 , url=

2026
[71]

The Eleventh International Conference on Learning Representations , year=

Flow Matching for Generative Modeling , author=. The Eleventh International Conference on Learning Representations , year=
[72]

and Shechtman, Eli and Wang, Oliver , title =

Zhang, Richard and Isola, Phillip and Efros, Alexei A. and Shechtman, Eli and Wang, Oliver , title =. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , month =
[73]

MotionCtrl: A Unified and Flexible Motion Controller for Video Generation , url=

Wang, Zhouxia and Yuan, Ziyang and Wang, Xintao and Li, Yaowei and Chen, Tianshui and Xia, Menghan and Luo, Ping and Shan, Ying , year=. MotionCtrl: A Unified and Flexible Motion Controller for Video Generation , url=. doi:10.1145/3641519.3657518 , booktitle=

work page doi:10.1145/3641519.3657518
[74]

2023 , eprint=

DragNUWA: Fine-grained Control in Video Generation by Integrating Text, Image, and Trajectory , author=. 2023 , eprint=

2023
[75]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , month =

Burgert, Ryan and Xu, Yuancheng and Xian, Wenqi and Pilarski, Oliver and Clausen, Pascal and He, Mingming and Ma, Li and Deng, Yitong and Li, Lingxiao and Mousavi, Mohsen and Ryoo, Michael and Debevec, Paul and Yu, Ning , title =. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , month =. 2025 , pages =

2025

[1] [1]

Matryoshka Representation Learning , url =

Kusupati, Aditya and Bhatt, Gantavya and Rege, Aniket and Wallingford, Matthew and Sinha, Aditya and Ramanujan, Vivek and Howard-Snyder, William and Chen, Kaifeng and Kakade, Sham and Jain, Prateek and Farhadi, Ali , booktitle =. Matryoshka Representation Learning , url =

[2] [2]

Proceedings of Robotics: Science and Systems , YEAR =

Chuan Wen AND Xingyu Lin AND John Ian Reyes So AND Kai Chen AND Qi Dou AND Yang Gao AND Pieter Abbeel , TITLE =. Proceedings of Robotics: Science and Systems , YEAR =

[3] [3]

Track2Act: Predicting Point Tracks from Internet Videos Enables Generalizable Robot Manipulation , ISBN=

Bharadhwaj, Homanga and Mottaghi, Roozbeh and Gupta, Abhinav and Tulsiani, Shubham , year=. Track2Act: Predicting Point Tracks from Internet Videos Enables Generalizable Robot Manipulation , ISBN=. doi:10.1007/978-3-031-73116-7_18 , booktitle=

work page doi:10.1007/978-3-031-73116-7_18

[4] [4]

2025 , eprint=

AMPLIFY: Actionless Motion Priors for Robot Learning from Videos , author=. 2025 , eprint=

2025

[5] [5]

2025 , eprint=

EgoVLA: Learning Vision-Language-Action Models from Egocentric Human Videos , author=. 2025 , eprint=

2025

[6] [6]

2025 , eprint=

Emergence of Human to Robot Transfer in Vision-Language-Action Models , author=. 2025 , eprint=

2025

[7] [7]

Learning Universal Policies via Text-Guided Video Generation , url =

Du, Yilun and Yang, Sherry and Dai, Bo and Dai, Hanjun and Nachum, Ofir and Tenenbaum, Josh and Schuurmans, Dale and Abbeel, Pieter , booktitle =. Learning Universal Policies via Text-Guided Video Generation , url =

[8] [8]

The Twelfth International Conference on Learning Representations , year=

Zero-Shot Robotic Manipulation with Pre-Trained Image-Editing Diffusion Models , author=. The Twelfth International Conference on Learning Representations , year=

[9] [9]

2025 , eprint=

Large Video Planner Enables Generalizable Robot Control , author=. 2025 , eprint=

2025

[10] [10]

2025 , eprint=

Latent Diffusion Planning for Imitation Learning , author=. 2025 , eprint=

2025

[11] [11]

The Fourteenth International Conference on Learning Representations , year=

Cosmos Policy: Fine-Tuning Video Models for Visuomotor Control and Planning , author=. The Fourteenth International Conference on Learning Representations , year=

[12] [12]

The Thirteenth International Conference on Learning Representations , year=

Latent Action Pretraining from Videos , author=. The Thirteenth International Conference on Learning Representations , year=

[13] [13]

9th Annual Conference on Robot Learning , year=

UniSkill: Imitating Human Videos via Cross-Embodiment Skill Representations , author=. 9th Annual Conference on Robot Learning , year=

[14] [14]

Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) , month =

Chen, Yi and Ge, Yuying and Tang, Weiliang and Li, Yizhuo and Ge, Yixiao and Ding, Mingyu and Shan, Ying and Liu, Xihui , title =. Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) , month =. 2025 , pages =

2025

[15] [15]

Proceedings of Robotics: Science and Systems , YEAR =

Qingwen Bu AND Yanting Yang AND Jisong Cai AND Shenyuan Gao AND Guanghui Ren AND Maoqing Yao AND Ping Luo AND Hongyang Li , TITLE =. Proceedings of Robotics: Science and Systems , YEAR =

[16] [16]

DynaMo: In-Domain Dynamics Pretraining for Visuo-Motor Control , url =

Cui, Zichen Jeff and Pan, Hengkai and Iyer, Aadhithya and Haldar, Siddhant and Pinto, Lerrel , booktitle =. DynaMo: In-Domain Dynamics Pretraining for Visuo-Motor Control , url =. doi:10.52202/079017-1069 , editor =

work page doi:10.52202/079017-1069

[17] [17]

The Fourteenth International Conference on Learning Representations , year=

villa-X: Enhancing Latent Action Modeling in Vision-Language-Action Models , author=. The Fourteenth International Conference on Learning Representations , year=

[18] [18]

2025 , eprint=

iFlyBot-VLA Technical Report , author=. 2025 , eprint=

2025

[19] [19]

Proceedings of the 36th International Conference on Machine Learning , pages =

Imitating Latent Policies from Observation , author =. Proceedings of the 36th International Conference on Machine Learning , pages =. 2019 , editor =

2019

[20] [20]

International Conference on Learning Representations , year=

Learning what you can do before doing anything , author=. International Conference on Learning Representations , year=

[21] [21]

Learning to Act without Actions , url =

Schmidt, Dominik and Jiang, Minqi , booktitle =. Learning to Act without Actions , url =

[22] [22]

The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=

What Do Latent Action Models Actually Learn? , author=. The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=

[23] [23]

Forty-second International Conference on Machine Learning , year=

Latent Action Learning Requires Supervision in the Presence of Distractors , author=. Forty-second International Conference on Machine Learning , year=

[24] [24]

2025 , eprint=

CLAM: Continuous Latent Action Models for Robot Learning from Unlabeled Demonstrations , author=. 2025 , eprint=

2025

[25] [25]

2014 , eprint=

Learning Ordered Representations with Nested Dropout , author=. 2014 , eprint=

2014

[26] [26]

Stochastic Bottleneck: Rateless Auto-Encoder for Flexible Dimensionality Reduction , url=

Koike-Akino, Toshiaki and Wang, Ye , year=. Stochastic Bottleneck: Rateless Auto-Encoder for Flexible Dimensionality Reduction , url=. doi:10.1109/isit44484.2020.9174523 , booktitle=

work page doi:10.1109/isit44484.2020.9174523 2020

[27] [27]

2025 , eprint=

LoHoVLA: A Unified Vision-Language-Action Model for Long-Horizon Embodied Tasks , author=. 2025 , eprint=

2025

[28] [28]

Kevin Black and Noah Brown and James Darpinian and Karan Dhabalia and Danny Driess and Adnan Esmail and Michael Robert Equi and Chelsea Finn and Niccolo Fusai and Manuel Y. Galliker and Dibya Ghosh and Lachy Groom and Karol Hausman and brian ichter and Szymon Jakubczak and Tim Jones and Liyiming Ke and Devin LeBlanc and Sergey Levine and Adrian Li-Bell an...

2025

[29] [29]

Moo Jin Kim and Karl Pertsch and Siddharth Karamcheti and Ted Xiao and Ashwin Balakrishna and Suraj Nair and Rafael Rafailov and Ethan P Foster and Pannag R Sanketi and Quan Vuong and Thomas Kollar and Benjamin Burchfiel and Russ Tedrake and Dorsa Sadigh and Sergey Levine and Percy Liang and Chelsea Finn , booktitle=. Open. 2024 , url=

2024

[30] [30]

Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) , month =

Peebles, William and Xie, Saining , title =. Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) , month =. 2023 , pages =

2023

[31] [31]

2025 , eprint=

Rethinking the shape convention of an MLP , author=. 2025 , eprint=

2025

[32] [32]

Proceedings of the 41st International Conference on Machine Learning , pages =

Genie: Generative Interactive Environments , author =. Proceedings of the 41st International Conference on Machine Learning , pages =. 2024 , editor =

2024

[33] [33]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , month =

Menapace, Willi and Lathuiliere, Stephane and Tulyakov, Sergey and Siarohin, Aliaksandr and Ricci, Elisa , title =. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , month =. 2021 , pages =

2021

[34] [34]

NeurIPS 2023 Foundation Models for Decision Making Workshop , year=

A Universal World Model Learned from Large Scale and Diverse Videos , author=. NeurIPS 2023 Foundation Models for Decision Making Workshop , year=

2023

[35] [35]

2025 , editor =

Gao, Shenyuan and Zhou, Siyuan and Du, Yilun and Zhang, Jun and Gan, Chuang , booktitle =. 2025 , editor =

2025

[36] [36]

Tokenization Workshop , year=

One-D-Piece: Image Tokenizer Meets Quality-Controllable Compression , author=. Tokenization Workshop , year=

[37] [37]

Proceedings of the 42nd International Conference on Machine Learning , pages =

Bachmann, Roman and Allardice, Jesse and Mizrahi, David and Fini, Enrico and Kar, O. Proceedings of the 42nd International Conference on Machine Learning , pages =. 2025 , editor =

2025

[38] [38]

Deep learning and the information bottleneck principle

Tishby, Naftali and Zaslavsky, Noga , year=. Deep learning and the information bottleneck principle , url=. doi:10.1109/itw.2015.7133169 , booktitle=

work page doi:10.1109/itw.2015.7133169 2015

[39] [39]

Rissanen, J. , year=. Modeling by shortest data description , volume=. Automatica , publisher=. doi:10.1016/0005-1098(78)90005-5 , number=

work page doi:10.1016/0005-1098(78)90005-5

[40] [40]

2024 , eprint=

Mastering Diverse Domains through World Models , author=. 2024 , eprint=

2024

[41] [41]

2026 , eprint=

Learning Latent Action World Models In The Wild , author=. 2026 , eprint=

2026

[42] [42]

The Twelfth International Conference on Learning Representations , year=

Unleashing Large-Scale Video Generative Pre-training for Visual Robot Manipulation , author=. The Twelfth International Conference on Learning Representations , year=

[43] [43]

2024 , eprint=

GR-2: A Generative Video-Language-Action Model with Web-Scale Knowledge for Robot Manipulation , author=. 2024 , eprint=

2024

[44] [44]

2025 , eprint=

GR00T N1: An Open Foundation Model for Generalist Humanoid Robots , author=. 2025 , eprint=

2025

[45] [45]

2024 , eprint=

IGOR: Image-GOal Representations are the Atomic Control Units for Foundation Models in Embodied AI , author=. 2024 , eprint=

2024

[46] [46]

2018 , eprint=

DeepMind Control Suite , author=. 2018 , eprint=

2018

[47] [47]

The Eleventh International Conference on Learning Representations , year=

Become a Proficient Player with Limited Data through Watching Pure Videos , author=. The Eleventh International Conference on Learning Representations , year=

[48] [48]

Ego4D: Around the World in 3,000 Hours of Egocentric Video , booktitle =

Grauman, Kristen and Westbury, Andrew and Byrne, Eugene and Chavis, Zachary and Furnari, Antonino and Girdhar, Rohit and Hamburger, Jackson and Jiang, Hao and Liu, Miao and Liu, Xingyu and Martin, Miguel and Nagarajan, Tushar and Radosavovic, Ilija and Ramakrishnan, Santhosh Kumar and Ryan, Fiona and Sharma, Jayant and Wray, Michael and Xu, Mengmeng and X...

2022

[49] [49]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , month =

Liu, Kun and Liu, Qi and Liu, Xinchen and Li, Jie and Zhang, Yongdong and Luo, Jiebo and He, Xiaodong and Liu, Wu , title =. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , month =. 2025 , pages =

2025

[50] [50]

2024 , eprint=

VidGen-1M: A Large-Scale Dataset for Text-to-video Generation , author=. 2024 , eprint=

2024

[51] [51]

The Thirteenth International Conference on Learning Representations , year=

OpenVid-1M: A Large-Scale High-Quality Dataset for Text-to-video Generation , author=. The Thirteenth International Conference on Learning Representations , year=

[52] [52]

Finite Scalar Quantization:

Fabian Mentzer and David Minnen and Eirikur Agustsson and Michael Tschannen , booktitle=. Finite Scalar Quantization:. 2024 , url=

2024

[53] [53]

2026 , eprint=

Co-Evolving Latent Action World Models , author=. 2026 , eprint=

2026

[54] [54]

2022 , eprint=

Classifier-Free Diffusion Guidance , author=. 2022 , eprint=

2022

[55] [55]

, title =

Brooks, Tim and Holynski, Aleksander and Efros, Alexei A. , title =. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , month =. 2023 , pages =

2023

[56] [56]

Proceedings of the 41st International Conference on Machine Learning , pages =

Scaling Rectified Flow Transformers for High-Resolution Image Synthesis , author =. Proceedings of the 41st International Conference on Machine Learning , pages =. 2024 , editor =

2024

[57] [57]

O’Neill, Abby and Rehman, Abdul and Maddukuri, Abhiram and Gupta, Abhishek and Padalkar, Abhishek and Lee, Abraham and Pooley, Acorn and Gupta, Agrim and Mandlekar, Ajay and Jain, Ajinkya and Tung, Albert and Bewley, Alex and Herzog, Alex and Irpan, Alex and Khazatsky, Alexander and Rai, Anant and Gupta, Anchit and Wang, Andrew and Singh, Anikait and Garg...

work page doi:10.1109/icra57147.2024.10611477 2024

[58] [58]

Alexander Khazatsky AND Karl Pertsch AND Suraj Nair AND Ashwin Balakrishna AND Sudeep Dasari AND Siddharth Karamcheti AND Soroush Nasiriany AND Mohan Kumar Srirama AND Lawrence Yunliang Chen AND Kirsty Ellis AND Peter David Fagan AND Joey Hejna AND Masha Itkina AND Marion Lepert AND Yecheng Jason Ma AND Patrick Tree Miller AND Jimmy Wu AND Suneel Belkhale...

[59] [59]

Proceedings of the IEEE International Conference on Computer Vision (ICCV) , month =

Goyal, Raghav and Ebrahimi Kahou, Samira and Michalski, Vincent and Materzynska, Joanna and Westphal, Susanne and Kim, Heuna and Haenel, Valentin and Fruend, Ingo and Yianilos, Peter and Mueller-Freitag, Moritz and Hoppe, Florian and Thurau, Christian and Bax, Ingo and Memisevic, Roland , title =. Proceedings of the IEEE International Conference on Comput...

[60] [60]

Proceedings of Robotics: Science and Systems , YEAR =

Kevin Black AND Noah Brown AND Danny Driess AND Adnan Esmail AND Michael Robert Equi AND Chelsea Finn AND Niccolo Fusai AND Lachy Groom AND Karol Hausman AND Brian Ichter AND Szymon Jakubczak AND Tim Jones AND Liyiming Ke AND Sergey Levine AND Adrian Li-Bell AND Mohith Mothukuri AND Suraj Nair AND Karl Pertsch AND Lucy Xiaoyang Shi AND Laura Smith AND Jam...

[61] [61]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , month =

Wang, Limin and Huang, Bingkun and Zhao, Zhiyu and Tong, Zhan and He, Yinan and Wang, Yi and Wang, Yali and Qiao, Yu , title =. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , month =. 2023 , pages =

2023

[62] [62]

Neural Rate Control for Learned Video Compression , url =

zhang, yiwei and Lu, Guo and Chen, Yunuo and Wang, Shen and Shi, Yibo and Wang, Jing and Song, Li , booktitle =. Neural Rate Control for Learned Video Compression , url =

[63] [63]

Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) , month =

Fathima, Noor and Petersen, Jens and Sauti\`ere, Guillaume and Wiggers, Auke and Pourreza, Reza , title =. Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) , month =. 2023 , pages =

2023

[64] [64]

2025 , eprint=

AgiBot World Colosseo: A Large-scale Manipulation Platform for Scalable and Intelligent Embodied Systems , author=. 2025 , eprint=

2025

[65] [65]

2021 , eprint=

The Distracting Control Suite -- A Challenging Benchmark for Reinforcement Learning from Pixels , author=. 2021 , eprint=

2021

[66] [66]

Scalable rate control for MPEG-4 video , volume=

Hung-Ju Lee and Tihao Chiang and Ya-Qin Zhang , year=. Scalable rate control for MPEG-4 video , volume=. IEEE Transactions on Circuits and Systems for Video Technology , publisher=. doi:10.1109/76.867926 , number=

work page doi:10.1109/76.867926

[67] [67]

2016 , eprint=

DeepMind Lab , author=. 2016 , eprint=

2016

[68] [68]

The Eleventh International Conference on Learning Representations , year=

Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow , author=. The Eleventh International Conference on Learning Representations , year=

[69] [69]

2026 , eprint=

OAT: Ordered Action Tokenization , author=. 2026 , eprint=

2026

[70] [70]

2026 , url=

Junhong Shen and Kushal Tirumala and Michihiro Yasunaga and Ishan Misra and Luke Zettlemoyer and LILI YU and Chunting Zhou , booktitle=. 2026 , url=

2026

[71] [71]

The Eleventh International Conference on Learning Representations , year=

Flow Matching for Generative Modeling , author=. The Eleventh International Conference on Learning Representations , year=

[72] [72]

and Shechtman, Eli and Wang, Oliver , title =

Zhang, Richard and Isola, Phillip and Efros, Alexei A. and Shechtman, Eli and Wang, Oliver , title =. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , month =

[73] [73]

MotionCtrl: A Unified and Flexible Motion Controller for Video Generation , url=

Wang, Zhouxia and Yuan, Ziyang and Wang, Xintao and Li, Yaowei and Chen, Tianshui and Xia, Menghan and Luo, Ping and Shan, Ying , year=. MotionCtrl: A Unified and Flexible Motion Controller for Video Generation , url=. doi:10.1145/3641519.3657518 , booktitle=

work page doi:10.1145/3641519.3657518

[74] [74]

2023 , eprint=

DragNUWA: Fine-grained Control in Video Generation by Integrating Text, Image, and Trajectory , author=. 2023 , eprint=

2023

[75] [75]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , month =

Burgert, Ryan and Xu, Yuancheng and Xian, Wenqi and Pilarski, Oliver and Clausen, Pascal and He, Mingming and Ma, Li and Deng, Yitong and Li, Lingxiao and Mousavi, Mohsen and Ryoo, Michael and Debevec, Paul and Yu, Ning , title =. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , month =. 2025 , pages =

2025