FOCA: Future-Oriented Conditioning for Data-Efficient Vision-Language-Action Adaptation

An Thai Le; Artur Habuda; Bao Thach; Binh Gia Nguyen; Daniel Sonntag; Doanh Le; Duc Minh Nguyen; Duy M. H. Nguyen; Hung Ngo; Khoa D. Doan

arxiv: 2606.20867 · v1 · pith:NNEKIH2Hnew · submitted 2026-06-18 · 💻 cs.CV · cs.AI

FOCA: Future-Oriented Conditioning for Data-Efficient Vision-Language-Action Adaptation

Duc Minh Nguyen , Nghiem Tuong Diep , Binh Gia Nguyen , Trong-Bao Ho , Doanh Le , Tan Q. Nguyen , Thien-Loc Ha , Nhiem Tran

show 15 more authors

Bao Thach Nhat X. Tran Tuan A. Tran Artur Habuda Philip Lund M{\o}ller Tran Nguyen Le Daniel Sonntag Matthias Niepert Khoa D. Doan Vu Duong Hung Ngo Minh N. Vu Duy M. H. Nguyen An Thai Le Ngo Anh Vien

This is my paper

Pith reviewed 2026-06-26 17:59 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords vision-language-actionfew-shot imitation learningfuture conditioningrobotic controldata-efficient adaptationlatent space reasoningsynthetic video co-training

0 comments

The pith

FOCA conditions VLA models on predicted future interaction embeddings to reach 95.7 percent success with only 20 demonstrations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that standard VLA adaptation degrades quickly when demonstrations drop below hundreds. FOCA counters this by adding explicit prediction of future interaction embeddings plus implicit alignment to future goal observations. The method operates entirely in latent space, supports co-training on action-free synthetic videos, and interprets the result as a future-conditioned value-like representation. On LIBERO it hits 95.7 percent success at 20 shots; on RoboCasa it gains 7-12 points and on real robots up to 26 points absolute.

Core claim

FOCA is a future-oriented conditioning framework that combines explicit prediction of task-grounded future interaction embeddings with implicit alignment to future goal observations. This formulation enables long-horizon reasoning in latent space without pixel-level prediction and without requiring action labels in the co-training data, naturally supporting action-free co-training with synthetic videos from video world models.

What carries the argument

Future-oriented conditioning via explicit future interaction embedding prediction plus implicit goal-observation alignment, functioning as a future-conditioned value-like representation.

If this is right

Few-shot success on LIBERO reaches 95.7 percent at 20 demonstrations.
Gains of 7-12 percent appear on RoboCasa and up to 26 percent absolute on real robots.
Action-free co-training with synthetic video becomes directly usable.
The approach yields a new state of the art for few-shot VLA adaptation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The latent future-conditioning approach may transfer to other sequential control domains that lack dense labels.
Pairing with improving video world models could further reduce reliance on real robot data.
If the value-like interpretation holds, similar conditioning could be tested in model-based planning pipelines.

Load-bearing premise

Predicting future interaction embeddings and aligning them to goal observations is enough to produce effective long-horizon reasoning without pixel-level outputs or action labels.

What would settle it

Run FOCA on a suite of tasks where accurate future embeddings cannot be formed from 20 demonstrations and measure whether success collapses below the reported baselines.

Figures

Figures reproduced from arXiv: 2606.20867 by An Thai Le, Artur Habuda, Bao Thach, Binh Gia Nguyen, Daniel Sonntag, Doanh Le, Duc Minh Nguyen, Duy M. H. Nguyen, Hung Ngo, Khoa D. Doan, Matthias Niepert, Minh N. Vu, Nghiem Tuong Diep, Ngo Anh Vien, Nhat X. Tran, Nhiem Tran, Philip Lund M{\o}ller, Tan Q. Nguyen, Thien-Loc Ha, Tran Nguyen Le, Trong-Bao Ho, Tuan A. Tran, Vu Duong.

**Figure 1.** Figure 1: FOCA augments VLA adaptation by injecting future-oriented conditioning into a vision–language model Fθ(·) and a diffusionbased action policy Aϕ(·). It jointly learns explicit latent prediction of task-grounded future interaction (bounding boxes B i t+ ) embeddings via learnable r exp t tokens and implicit alignment to future goal observations via r imp t tokens, enabling data-efficient, action-free adapta… view at source ↗

**Figure 2.** Figure 2: FOCA vs SoTA VLA models on full data scale. 10, Goal, Object, and Spatial denote the Libero-10, Libero-Goal, Libero-Object, and Libero-Spatial benchmark suites, respectively. Discounted goal occupancy induced by geometric sampling. We sample a future offset k ∼ Geom(1 − γ) on {1, 2, . . .} and treat (xt, gt+k) as a positive pair. This induces the goal-conditioned discounted occupancy under the demonstrat… view at source ↗

**Figure 3.** Figure 3: (a) FOCA’s performance on four real-robot tasks using a humanoid (VinR-H3, a simulated task) (GR00T N1.5 VLA, evaluation on 50 trials), and three other with bi-arm Aloha robot using π0 VLA; (b) Ablation study on explicit future prediction ; (c) FOCA’s generalization performance across π0 and GR00T N1.5 on Robocasa, with 30 demos (top) and 100 demos (bottom) on most 5 challenge tasks; (d) π0 vs FOCA, FOCA w… view at source ↗

**Figure 4.** Figure 4: Example rollouts from LIBERO simulation. C.2. Failure correction via future conditioning To better understand the behavioral differences induced by future-oriented conditioning, we visualize continuous frame sequences for representative failure cases ( [PITH_FULL_IMAGE:figures/full_fig_p016_4.png] view at source ↗

**Figure 5.** Figure 5: Example rollouts from ROBOCASA simulation. C.3. DreamGen hallucination analysis [PITH_FULL_IMAGE:figures/full_fig_p017_5.png] view at source ↗

**Figure 6.** Figure 6: Example rollouts from ALOHA real-world robot. 18 [PITH_FULL_IMAGE:figures/full_fig_p018_6.png] view at source ↗

**Figure 7.** Figure 7: Example episodes from Place Part performed by VinRobotics’ simulated VR-H3 humanoid on Mujoco. Grasp misalignment!!! + FOCA 1s 9s 17s 21s 22s 23s 1s 9s 17s 25s 32s 40s+ Hallucinate! turn on the stove and put the moka pot on it [PITH_FULL_IMAGE:figures/full_fig_p019_7.png] view at source ↗

**Figure 8.** Figure 8: Example rollouts where the baseline π0 policy fails (top) and FOCA succeeds (bottom) [PITH_FULL_IMAGE:figures/full_fig_p019_8.png] view at source ↗

**Figure 9.** Figure 9: Synthetic rollouts generated by DreamGen at different data scales. (a) Hallucinated cases with 40% training data; (b) successful cases with 100% training data [PITH_FULL_IMAGE:figures/full_fig_p020_9.png] view at source ↗

**Figure 10.** Figure 10: FOCA’s generalizations performance across Pi-Zero and Groot N1.5 on Robocasa, with 30 demos (top) and 100 demos (bottom) on most 5 challenge tasks caption, the model produces a structured analysis divided into distinct categories: Setting, Robotic Agent, Task Objectives, Action Sequence, and Environmental Interaction. This structured output acts as a high-fidelity semantic bridge, explicitly isolating the… view at source ↗

**Figure 11.** Figure 11: Architecture overview of the explicit alignment setting [PITH_FULL_IMAGE:figures/full_fig_p022_11.png] view at source ↗

**Figure 12.** Figure 12: Region of interest in RoboCasa [PITH_FULL_IMAGE:figures/full_fig_p023_12.png] view at source ↗

**Figure 13.** Figure 13: Region of interest in real-world settings For each episode, we reset the enviroment to its saved initial state and replay each action from the demotration labels. During re-rendering, we record observations that now include segmentation masks, which are not provided in the original datasets. Moreover, to ensure the constructed dataset maintains high quality, we perform additional filtering to remove no-op… view at source ↗

**Figure 14.** Figure 14: Standardized prompt template for Stage 1: The VLM is instructed to decompose the video into a 7-point structured attribute analysis. 24 [PITH_FULL_IMAGE:figures/full_fig_p024_14.png] view at source ↗

**Figure 15.** Figure 15: Camera views in LIBERO and RoboCasa E.3. ALOHA Tasks We evaluate our method on three real-robot manipulation tasks using the Mobile ALOHA platform. All tasks require bimanual manipulation and multi-view visual observations. The experiments are executed on a single workstation equipped with an NVIDIA RTX 5090 GPU. Task definitions, initialization, and the number of demonstrations are described below. Open … view at source ↗

**Figure 16.** Figure 16: The three handcrafted tabletop tasks with UR5 [PITH_FULL_IMAGE:figures/full_fig_p026_16.png] view at source ↗

**Figure 18.** Figure 18: Real-world results of the three manipulation tasks on UR5 Left Camera Right CameraFront Camera (a) Camera views Start Align Place (b) Task execution stages Bottom-up Right viewBottom-up Left View (c) Virtual views [PITH_FULL_IMAGE:figures/full_fig_p027_18.png] view at source ↗

**Figure 19.** Figure 19: Place Part in MuJoCo: camera views, key execution stages, and virtual views. (a) Left, front, and right camera views. (b) Task stages: Start, Align, and Place. (c) Bottom-up virtual views. Green annotations illustrate the strict spatial alignment constraints of the task, requiring the two holes on either side of the part to be precisely aligned with and seated onto the two small pins on either side of the… view at source ↗

read the original abstract

Vision-Language-Action (VLA) models enable general-purpose robotic control via large-scale multimodal pretraining, yet their effectiveness under few-shot imitation learning remains limited. We conduct a systematic stress test of state-of-the-art VLA models and show that performance degrades sharply as demonstrations are reduced, revealing a key weakness of existing adaptation strategies. To address this, we introduce FOCA, a future-oriented conditioning framework for data-efficient VLA adaptation. FOCA combines explicit prediction of task-grounded future interaction embeddings with implicit alignment to future goal observations, enabling long-horizon reasoning in latent space without pixel-level prediction. This formulation naturally supports action-free co-training with synthetic videos from video world models and can be interpreted as learning a future-conditioned value-like representation. Extensive experiments demonstrate FOCA achieves 95.7% success with 20 demonstrations on LIBERO, improves 7-12% on RoboCasa, and delivers up to 26% absolute gains on real robots, establishing a new state of the art in few-shot VLA adaptation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

FOCA reports big few-shot gains on VLA benchmarks but the abstract leaves the core mechanisms and controls too opaque to assess.

read the letter

The paper's central claim is that adding explicit prediction of task-grounded future interaction embeddings plus implicit goal alignment lets VLA models handle long-horizon tasks with far fewer demonstrations, and it backs this with 95.7% success on LIBERO using only 20 demos plus solid lifts on RoboCasa and real robots. That addresses a genuine practical bottleneck.

What stands out is the framing that supports action-free co-training on synthetic videos from world models. Treating the setup as learning a future-conditioned value-like representation is a reasonable way to connect the pieces without pixel prediction. The combination looks fresher than the individual pieces that already exist in video and RL work.

The soft spots are mostly around verification. The abstract gives no baseline details, no ablation numbers, no mention of statistical tests, and no description of how the future embeddings are actually extracted or kept task-grounded when action labels are absent. The stress-test note is right that we cannot check whether the implicit alignment term really carries long-horizon credit or whether the co-training data was matched across methods. If the full paper has those controls and reproducible code, the numbers become credible; right now they sit on an uninspectable implementation.

This is for people already working on VLA adaptation and imitation learning who need data-efficient methods for new tasks. It is worth a serious referee if the methods section supplies the missing architecture, loss terms, and training protocols. Otherwise it risks being another set of headline numbers that do not hold up under closer inspection.

Referee Report

3 major / 2 minor

Summary. The paper introduces FOCA, a future-oriented conditioning framework for data-efficient Vision-Language-Action (VLA) adaptation. It combines explicit prediction of task-grounded future interaction embeddings with implicit alignment to future goal observations to enable long-horizon reasoning in latent space without pixel-level prediction or action labels. This supports action-free co-training with synthetic videos from video world models. Experiments report 95.7% success with 20 demonstrations on LIBERO, 7-12% gains on RoboCasa, and up to 26% absolute gains on real robots, claiming a new state of the art in few-shot VLA adaptation.

Significance. If the empirical claims hold after verification, the work would offer a meaningful contribution to few-shot robotic imitation learning by improving data efficiency in VLA models through latent-space future conditioning and enabling co-training without action labels.

major comments (3)

Abstract: the reported performance numbers (95.7% on LIBERO-20, 7-12% RoboCasa gains, 26% real-robot gains) are presented without any description of baselines, statistical significance tests, ablation controls, or exact training protocols, preventing verification of whether the central claim of improved few-shot adaptation is supported by the data.
Methods section (architecture and losses): the concrete formulation of the explicit future-interaction-embedding prediction, the implicit alignment term, the embedding extraction procedure, and the co-training protocol with synthetic videos are not supplied, so it is impossible to assess whether the approach actually enables effective long-horizon reasoning in latent space or whether baselines received identical co-training data.
Experiments: no details are given on how 'task-grounded' embeddings are obtained without action labels or whether the implicit alignment term propagates long-horizon credit, leaving the weakest assumption of the paper untestable from the provided text.

minor comments (2)

Add error bars, multiple random seeds, and statistical tests to all reported success rates and improvement percentages.
Clarify the precise definition of 'future interaction embeddings' and how they differ from standard goal-conditioned representations.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. We address each major comment point-by-point below, providing clarifications from the full manuscript and indicating where revisions will improve verifiability and clarity.

read point-by-point responses

Referee: Abstract: the reported performance numbers (95.7% on LIBERO-20, 7-12% RoboCasa gains, 26% real-robot gains) are presented without any description of baselines, statistical significance tests, ablation controls, or exact training protocols, preventing verification of whether the central claim of improved few-shot adaptation is supported by the data.

Authors: The abstract is intentionally concise due to length limits, but the full manuscript (Section 4 and Appendix) details the baselines (standard VLA fine-tuning, RT-1/RT-2 adaptation, and recent few-shot methods), reports results as means over 5 random seeds with standard deviations and t-test significance, includes ablations, and specifies training protocols (e.g., 20 demos, batch size, learning rate). We will revise the abstract to add a one-sentence summary of the evaluation setup and direct readers to the experiments section. revision: yes
Referee: Methods section (architecture and losses): the concrete formulation of the explicit future-interaction-embedding prediction, the implicit alignment term, the embedding extraction procedure, and the co-training protocol with synthetic videos are not supplied, so it is impossible to assess whether the approach actually enables effective long-horizon reasoning in latent space or whether baselines received identical co-training data.

Authors: Section 3 of the manuscript supplies these: explicit prediction uses an L2 regression loss on future interaction embeddings extracted from a goal-conditioned encoder; implicit alignment employs a contrastive InfoNCE loss between current and future goal embeddings; embeddings are obtained via a frozen pre-trained VLM applied to visual observations conditioned only on language task descriptions (no action labels required); co-training uses synthetic videos from a video world model processed identically for FOCA and baselines. To enhance accessibility, we will insert explicit equations, a pseudocode algorithm, and a table confirming identical co-training data for all methods. revision: yes
Referee: Experiments: no details are given on how 'task-grounded' embeddings are obtained without action labels or whether the implicit alignment term propagates long-horizon credit, leaving the weakest assumption of the paper untestable from the provided text.

Authors: Task-grounded embeddings are produced by feeding observations and language task prompts into a frozen VLM encoder (e.g., CLIP or SigLIP variant) that aligns visual features to language without any action supervision; the implicit alignment term is a multi-horizon contrastive loss that directly compares embeddings at future timesteps, thereby propagating credit across the horizon in latent space. Section 3.2 and the experiments ablations demonstrate this via controlled variants. We will expand the text with a dedicated paragraph and additional figures showing credit propagation to make this fully testable. revision: yes

Circularity Check

0 steps flagged

No derivation chain or equations present; empirical results only

full rationale

The abstract and available text introduce FOCA as a framework that 'combines explicit prediction of task-grounded future interaction embeddings with implicit alignment to future goal observations' but supply no equations, loss formulations, embedding extraction procedures, or derivation steps. No self-citations, fitted parameters renamed as predictions, or uniqueness theorems appear. Reported numbers (95.7% on LIBERO-20, gains on RoboCasa and real robots) are presented as experimental outcomes, not outputs of a closed mathematical chain. This matches the default expectation of no significant circularity when no load-bearing derivation exists to inspect.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no equations, so free parameters, axioms, and invented entities cannot be enumerated; the method implicitly relies on the existence of useful future interaction embeddings and the validity of synthetic video data from world models.

pith-pipeline@v0.9.1-grok · 5813 in / 1112 out tokens · 14500 ms · 2026-06-26T17:59:03.269695+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

73 extracted references · 25 linked inside Pith

[1]

Advances in Neural Information Processing Systems , volume=

Contrastive learning as goal-conditioned reinforcement learning , author=. Advances in Neural Information Processing Systems , volume=
[2]

arXiv preprint arXiv:2306.03346 , year=

Stabilizing contrastive rl: Techniques for robotic goal reaching from offline data , author=. arXiv preprint arXiv:2306.03346 , year=

arXiv
[3]

Proceedings of the thirteenth international conference on artificial intelligence and statistics , pages=

Noise-contrastive estimation: A new estimation principle for unnormalized statistical models , author=. Proceedings of the thirteenth international conference on artificial intelligence and statistics , pages=. 2010 , organization=

2010
[4]

The journal of machine learning research , volume=

Noise-contrastive estimation of unnormalized statistical models, with applications to natural image statistics , author=. The journal of machine learning research , volume=. 2012 , publisher=

2012
[5]

arXiv preprint arXiv:1807.03748 , year=

Representation Learning with Contrastive Predictive Coding , author=. arXiv preprint arXiv:1807.03748 , year=

Pith/arXiv arXiv
[6]

arXiv preprint arXiv:2510.00695 , year=

Hamlet: Switch your vision-language-action model into a history-aware policy , author=. arXiv preprint arXiv:2510.00695 , year=

Pith/arXiv arXiv
[7]

arXiv preprint arXiv:2501.09747 , year=

Fast: Efficient action tokenization for vision-language-action models , author=. arXiv preprint arXiv:2501.09747 , year=

Pith/arXiv arXiv
[8]

Advances in Neural Information Processing Systems , year =

ThinkAct: Vision-Language-Action Reasoning via Reinforced Visual Latent Planning , author =. Advances in Neural Information Processing Systems , year =
[9]

Advances in Neural Information Processing Systems , year =

DreamVLA: A Vision-Language-Action Model Dreamed with Comprehensive World Knowledge , author =. Advances in Neural Information Processing Systems , year =
[10]

arXiv preprint arXiv:2506.01844 , year=

Smolvla: A vision-language-action model for affordable and efficient robotics , author=. arXiv preprint arXiv:2506.01844 , year=

Pith/arXiv arXiv
[11]

International conference on machine learning , pages=

Universal value function approximators , author=. International conference on machine learning , pages=. 2015 , organization=

2015
[12]

Neural computation , volume=

Improving generalization for temporal difference learning: The successor representation , author=. Neural computation , volume=. 1993 , publisher=

1993
[13]

Machine learning , volume=

Q-learning , author=. Machine learning , volume=. 1992 , publisher=

1992
[14]

arXiv preprint arXiv:1805.00909 , year=

Reinforcement learning and control as probabilistic inference: Tutorial and review , author=. arXiv preprint arXiv:1805.00909 , year=

Pith/arXiv arXiv
[15]

International conference on machine learning , pages=

Reinforcement learning with deep energy-based policies , author=. International conference on machine learning , pages=. 2017 , organization=

2017
[16]

, author=

Maximum entropy inverse reinforcement learning. , author=. Aaai , volume=. 2008 , organization=

2008
[17]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Self-supervised learning from images with a joint-embedding predictive architecture , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
[18]

arXiv preprint arXiv:2505.12705 , year=

DreamGen: Unlocking Generalization in Robot Learning through Video World Models , author=. arXiv preprint arXiv:2505.12705 , year=

Pith/arXiv arXiv
[19]

arXiv preprint arXiv:2406.09246 , year=

Openvla: An open-source vision-language-action model , author=. arXiv preprint arXiv:2406.09246 , year=

Pith/arXiv arXiv
[20]

Proceedings of the Thirteenth International Conference on Learning Representations (ICLR) , year =

Latent Action Pretraining from Videos , author =. Proceedings of the Thirteenth International Conference on Learning Representations (ICLR) , year =
[21]

arXiv preprint arXiv:2501.15830 , year=

Spatialvla: Exploring spatial representations for visual-language-action model , author=. arXiv preprint arXiv:2501.15830 , year=

Pith/arXiv arXiv
[22]

International Conference on Machine Learning , pages=

Multi-view masked world models for visual robotic manipulation , author=. International Conference on Machine Learning , pages=. 2023 , organization=

2023
[23]

arXiv preprint arXiv:2511.22697 , year=

Mechanistic Finetuning of Vision-Language-Action Models via Few-Shot Demonstrations , author=. arXiv preprint arXiv:2511.22697 , year=

arXiv
[24]

Towards generalist robot policies: What matters in building vision-language-action models , author=
[25]

Conference on Robot Learning , pages=

Rt-2: Vision-language-action models transfer web knowledge to robotic control , author=. Conference on Robot Learning , pages=. 2023 , organization=

2023
[26]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

TRM-VLA: Temporal-Aware Chain-of-Thought Reasoning and Memorization for Vision-Language-Action Models , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
[27]

arXiv preprint arXiv:2604.17800 , year=

ReFineVLA: Multimodal Reasoning-Aware Generalist Robotic Policies via Teacher-Guided Fine-Tuning , author=. arXiv preprint arXiv:2604.17800 , year=

Pith/arXiv arXiv
[28]

arXiv preprint arXiv:2512.22519 , year=

Clutter-Resistant Vision-Language-Action Models through Object-Centric and Geometry Grounding , author=. arXiv preprint arXiv:2512.22519 , year=

Pith/arXiv arXiv
[29]

arXiv preprint arXiv:2603.03596 , year=

Mem: Multi-scale embodied memory for vision language action models , author=. arXiv preprint arXiv:2603.03596 , year=

arXiv
[30]

Proceedings of the AAAI Conference on Artificial Intelligence , volume=

Rethinking progression of memory state in robotic manipulation: An object-centric perspective , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=
[31]

International Conference on Robotics and Automation (ICRA) , year=

Slotvla: Towards modeling of object-relation representations in robotic manipulation , author=. International Conference on Robotics and Automation (ICRA) , year=
[32]

Proceedings of the Computer Vision and Pattern Recognition Conference , pages=

Cot-vla: Visual chain-of-thought reasoning for vision-language-action models , author=. Proceedings of the Computer Vision and Pattern Recognition Conference , pages=
[33]

arXiv preprint arXiv:2503.20020 , year=

Gemini robotics: Bringing ai into the physical world , author=. arXiv preprint arXiv:2503.20020 , year=

Pith/arXiv arXiv
[34]

Advances in neural information processing systems , volume=

One-shot imitation learning , author=. Advances in neural information processing systems , volume=
[35]

arXiv preprint arXiv:2510.03342 , year=

Gemini robotics 1.5: Pushing the frontier of generalist robots with advanced embodied reasoning, thinking, and motion transfer , author=. arXiv preprint arXiv:2510.03342 , year=

Pith/arXiv arXiv
[36]

arXiv preprint arXiv:2410.24164 , year=

_0 : A Vision-Language-Action Flow Model for General Robot Control , author=. arXiv preprint arXiv:2410.24164 , year=

Pith/arXiv arXiv
[37]

Intelligence, Physical and Black, Kevin and Brown, Noah and Darpinian, James and Dhabalia, Karan and Driess, Danny and Esmail, Adnan and Equi, Michael and Finn, Chelsea and Fusai, Niccolo and others , journal=. _
[38]

arXiv preprint arXiv:2405.12213 , year=

Octo: An open-source generalist robot policy , author=. arXiv preprint arXiv:2405.12213 , year=

Pith/arXiv arXiv
[39]

arXiv preprint arXiv:2503.14734 , year=

Gr00t n1: An open foundation model for generalist humanoid robots , author=. arXiv preprint arXiv:2503.14734 , year=

Pith/arXiv arXiv
[40]

Advances in Neural Information Processing Systems , volume=

Video pretraining (vpt): Learning to act by watching unlabeled online videos , author=. Advances in Neural Information Processing Systems , volume=
[41]

Proceedings of the IEEE/CVF international conference on computer vision , pages=

Sigmoid loss for language image pre-training , author=. Proceedings of the IEEE/CVF international conference on computer vision , pages=
[42]

arXiv preprint arXiv:2410.11758 , year=

Latent action pretraining from videos , author=. arXiv preprint arXiv:2410.11758 , year=

Pith/arXiv arXiv
[43]

Proceedings of the International Conference on Learning Representations (ICLR) , year =

C-Learning: Learning to Achieve Goals via Recursive Classification , author =. Proceedings of the International Conference on Learning Representations (ICLR) , year =
[44]

arXiv preprint arXiv:2212.06817 , year=

Rt-1: Robotics transformer for real-world control at scale , author=. arXiv preprint arXiv:2212.06817 , year=

Pith/arXiv arXiv
[45]

European conference on computer vision , pages=

Grounding dino: Marrying dino with grounded pre-training for open-set object detection , author=. European conference on computer vision , pages=. 2024 , organization=

2024
[46]

Proceedings of the IEEE International Conference on Robotics and Automation (ICRA) , year =

Contrastive Language, Action, and State Pre-training for Robot Learning , author =. Proceedings of the IEEE International Conference on Robotics and Automation (ICRA) , year =
[47]

Proceedings of the Conference on Robot Learning (CoRL) , year =

Contrastive Imitation Learning for Language-Guided Multi-Task Robotic Manipulation , author =. Proceedings of the Conference on Robot Learning (CoRL) , year =
[48]

Proceedings of the Forty-Second International Conference on Machine Learning (ICML) , year =

Video Prediction Policy: A Generalist Robot Policy with Predictive Visual Representations , author =. Proceedings of the Forty-Second International Conference on Machine Learning (ICML) , year =
[49]

arXiv preprint arXiv:2505.15659 , year=

FLARE: Robot learning with implicit world modeling , author=. arXiv preprint arXiv:2505.15659 , year=

Pith/arXiv arXiv
[50]

Conference on Robot Learning , pages=

Class: Contrastive learning via action sequence supervision for robot manipulation , author=. Conference on Robot Learning , pages=. 2025 , organization=

2025
[51]

2018 IEEE international conference on robotics and automation (ICRA) , pages=

Time-contrastive networks: Self-supervised learning from video , author=. 2018 IEEE international conference on robotics and automation (ICRA) , pages=. 2018 , organization=

2018
[52]

Proceedings of the Thirteenth International Conference on Learning Representations (ICLR) , year =

TraceVLA: Visual Trace Prompting Enhances Spatial--Temporal Awareness for Generalist Robotic Policies , author =. Proceedings of the Thirteenth International Conference on Learning Representations (ICLR) , year =
[53]

arXiv preprint arXiv:2508.21112 , year=

Eo-1: Interleaved vision-text-action pretraining for general robot control , author=. arXiv preprint arXiv:2508.21112 , year=

arXiv
[54]

arXiv preprint arXiv:2204.01691 , year =

Do As I Can, Not As I Say: Grounding Language in Robotic Affordances , author =. arXiv preprint arXiv:2204.01691 , year =

Pith/arXiv arXiv
[55]

arXiv preprint arXiv:2303.03378 , year =

PaLM-E: An Embodied Multimodal Language Model , author =. arXiv preprint arXiv:2303.03378 , year =

Pith/arXiv arXiv
[56]

Robotics: Science and Systems (RSS) , year =

Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware , author =. Robotics: Science and Systems (RSS) , year =
[57]

arXiv preprint arXiv:2303.04137 , year =

Diffusion Policy: Visuomotor Policy Learning via Action Diffusion , author =. arXiv preprint arXiv:2303.04137 , year =

Pith/arXiv arXiv
[58]

arXiv preprint arXiv:2106.09685 , year =

LoRA: Low-Rank Adaptation of Large Language Models , author =. arXiv preprint arXiv:2106.09685 , year =

Pith/arXiv arXiv
[59]

International Conference on Machine Learning (ICML) , year =

DoRA: Weight-Decomposed Low-Rank Adaptation , author =. International Conference on Machine Learning (ICML) , year =
[60]

arXiv preprint arXiv:2506.16211 , year =

ControlVLA: Few-shot Object-centric Adaptation for Pre-trained Vision-Language-Action Models , author =. arXiv preprint arXiv:2506.16211 , year =

arXiv
[61]

arXiv preprint arXiv:2508.02062 , year =

RICL: Adding In-Context Adaptability to Pre-Trained Vision-Language-Action Models , author =. arXiv preprint arXiv:2508.02062 , year =

arXiv
[62]

arXiv preprint arXiv:2302.05543 , year =

Adding Conditional Control to Text-to-Image Diffusion Models , author =. arXiv preprint arXiv:2302.05543 , year =

Pith/arXiv arXiv
[63]

Conference on Robot Learning (CoRL) , year =

RoVi-Aug: Robot and Viewpoint Augmentation for Cross-Embodiment Robot Learning , author =. Conference on Robot Learning (CoRL) , year =
[64]

International Conference on Machine Learning (ICML) , year =

Multi-View Masked World Models for Visual Robotic Manipulation , author =. International Conference on Machine Learning (ICML) , year =
[65]

arXiv preprint arXiv:2501.03575 , year =

Cosmos World Foundation Model Platform for Physical AI , author =. arXiv preprint arXiv:2501.03575 , year =

Pith/arXiv arXiv
[66]

arXiv preprint arXiv:2511.00062 , year =

World Simulation with Video Foundation Models for Physical AI , author =. arXiv preprint arXiv:2511.00062 , year =

Pith/arXiv arXiv
[67]

2012 IEEE/RSJ international conference on intelligent robots and systems , pages=

Mujoco: A physics engine for model-based control , author=. 2012 IEEE/RSJ international conference on intelligent robots and systems , pages=. 2012 , organization=

2012
[68]

arXiv preprint arXiv:2406.08545 , year=

Rvt-2: Learning precise manipulation from few demonstrations , author=. arXiv preprint arXiv:2406.08545 , year=

arXiv
[69]

Advances in Neural Information Processing Systems , volume=

Libero: Benchmarking knowledge transfer for lifelong robot learning , author=. Advances in Neural Information Processing Systems , volume=
[70]

arXiv preprint arXiv:2405.02292 , year=

Aloha 2: An enhanced low-cost hardware for bimanual teleoperation , author=. arXiv preprint arXiv:2405.02292 , year=

arXiv
[71]

Proceedings of Robotics: Science and Systems , year =

RoboCasa: Large-Scale Simulation of Everyday Tasks for Generalist Robots , author =. Proceedings of Robotics: Science and Systems , year =
[72]

ArXiv , year=

Qwen2.5 Technical Report , author=. ArXiv , year=
[73]

ArXiv , year=

Cosmos World Foundation Model Platform for Physical AI , author=. ArXiv , year=

[1] [1]

Advances in Neural Information Processing Systems , volume=

Contrastive learning as goal-conditioned reinforcement learning , author=. Advances in Neural Information Processing Systems , volume=

[2] [2]

arXiv preprint arXiv:2306.03346 , year=

Stabilizing contrastive rl: Techniques for robotic goal reaching from offline data , author=. arXiv preprint arXiv:2306.03346 , year=

arXiv

[3] [3]

Proceedings of the thirteenth international conference on artificial intelligence and statistics , pages=

Noise-contrastive estimation: A new estimation principle for unnormalized statistical models , author=. Proceedings of the thirteenth international conference on artificial intelligence and statistics , pages=. 2010 , organization=

2010

[4] [4]

The journal of machine learning research , volume=

Noise-contrastive estimation of unnormalized statistical models, with applications to natural image statistics , author=. The journal of machine learning research , volume=. 2012 , publisher=

2012

[5] [5]

arXiv preprint arXiv:1807.03748 , year=

Representation Learning with Contrastive Predictive Coding , author=. arXiv preprint arXiv:1807.03748 , year=

Pith/arXiv arXiv

[6] [6]

arXiv preprint arXiv:2510.00695 , year=

Hamlet: Switch your vision-language-action model into a history-aware policy , author=. arXiv preprint arXiv:2510.00695 , year=

Pith/arXiv arXiv

[7] [7]

arXiv preprint arXiv:2501.09747 , year=

Fast: Efficient action tokenization for vision-language-action models , author=. arXiv preprint arXiv:2501.09747 , year=

Pith/arXiv arXiv

[8] [8]

Advances in Neural Information Processing Systems , year =

ThinkAct: Vision-Language-Action Reasoning via Reinforced Visual Latent Planning , author =. Advances in Neural Information Processing Systems , year =

[9] [9]

Advances in Neural Information Processing Systems , year =

DreamVLA: A Vision-Language-Action Model Dreamed with Comprehensive World Knowledge , author =. Advances in Neural Information Processing Systems , year =

[10] [10]

arXiv preprint arXiv:2506.01844 , year=

Smolvla: A vision-language-action model for affordable and efficient robotics , author=. arXiv preprint arXiv:2506.01844 , year=

Pith/arXiv arXiv

[11] [11]

International conference on machine learning , pages=

Universal value function approximators , author=. International conference on machine learning , pages=. 2015 , organization=

2015

[12] [12]

Neural computation , volume=

Improving generalization for temporal difference learning: The successor representation , author=. Neural computation , volume=. 1993 , publisher=

1993

[13] [13]

Machine learning , volume=

Q-learning , author=. Machine learning , volume=. 1992 , publisher=

1992

[14] [14]

arXiv preprint arXiv:1805.00909 , year=

Reinforcement learning and control as probabilistic inference: Tutorial and review , author=. arXiv preprint arXiv:1805.00909 , year=

Pith/arXiv arXiv

[15] [15]

International conference on machine learning , pages=

Reinforcement learning with deep energy-based policies , author=. International conference on machine learning , pages=. 2017 , organization=

2017

[16] [16]

, author=

Maximum entropy inverse reinforcement learning. , author=. Aaai , volume=. 2008 , organization=

2008

[17] [17]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Self-supervised learning from images with a joint-embedding predictive architecture , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

[18] [18]

arXiv preprint arXiv:2505.12705 , year=

DreamGen: Unlocking Generalization in Robot Learning through Video World Models , author=. arXiv preprint arXiv:2505.12705 , year=

Pith/arXiv arXiv

[19] [19]

arXiv preprint arXiv:2406.09246 , year=

Openvla: An open-source vision-language-action model , author=. arXiv preprint arXiv:2406.09246 , year=

Pith/arXiv arXiv

[20] [20]

Proceedings of the Thirteenth International Conference on Learning Representations (ICLR) , year =

Latent Action Pretraining from Videos , author =. Proceedings of the Thirteenth International Conference on Learning Representations (ICLR) , year =

[21] [21]

arXiv preprint arXiv:2501.15830 , year=

Spatialvla: Exploring spatial representations for visual-language-action model , author=. arXiv preprint arXiv:2501.15830 , year=

Pith/arXiv arXiv

[22] [22]

International Conference on Machine Learning , pages=

Multi-view masked world models for visual robotic manipulation , author=. International Conference on Machine Learning , pages=. 2023 , organization=

2023

[23] [23]

arXiv preprint arXiv:2511.22697 , year=

Mechanistic Finetuning of Vision-Language-Action Models via Few-Shot Demonstrations , author=. arXiv preprint arXiv:2511.22697 , year=

arXiv

[24] [24]

Towards generalist robot policies: What matters in building vision-language-action models , author=

[25] [25]

Conference on Robot Learning , pages=

Rt-2: Vision-language-action models transfer web knowledge to robotic control , author=. Conference on Robot Learning , pages=. 2023 , organization=

2023

[26] [26]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

TRM-VLA: Temporal-Aware Chain-of-Thought Reasoning and Memorization for Vision-Language-Action Models , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

[27] [27]

arXiv preprint arXiv:2604.17800 , year=

ReFineVLA: Multimodal Reasoning-Aware Generalist Robotic Policies via Teacher-Guided Fine-Tuning , author=. arXiv preprint arXiv:2604.17800 , year=

Pith/arXiv arXiv

[28] [28]

arXiv preprint arXiv:2512.22519 , year=

Clutter-Resistant Vision-Language-Action Models through Object-Centric and Geometry Grounding , author=. arXiv preprint arXiv:2512.22519 , year=

Pith/arXiv arXiv

[29] [29]

arXiv preprint arXiv:2603.03596 , year=

Mem: Multi-scale embodied memory for vision language action models , author=. arXiv preprint arXiv:2603.03596 , year=

arXiv

[30] [30]

Proceedings of the AAAI Conference on Artificial Intelligence , volume=

Rethinking progression of memory state in robotic manipulation: An object-centric perspective , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=

[31] [31]

International Conference on Robotics and Automation (ICRA) , year=

Slotvla: Towards modeling of object-relation representations in robotic manipulation , author=. International Conference on Robotics and Automation (ICRA) , year=

[32] [32]

Proceedings of the Computer Vision and Pattern Recognition Conference , pages=

Cot-vla: Visual chain-of-thought reasoning for vision-language-action models , author=. Proceedings of the Computer Vision and Pattern Recognition Conference , pages=

[33] [33]

arXiv preprint arXiv:2503.20020 , year=

Gemini robotics: Bringing ai into the physical world , author=. arXiv preprint arXiv:2503.20020 , year=

Pith/arXiv arXiv

[34] [34]

Advances in neural information processing systems , volume=

One-shot imitation learning , author=. Advances in neural information processing systems , volume=

[35] [35]

arXiv preprint arXiv:2510.03342 , year=

Gemini robotics 1.5: Pushing the frontier of generalist robots with advanced embodied reasoning, thinking, and motion transfer , author=. arXiv preprint arXiv:2510.03342 , year=

Pith/arXiv arXiv

[36] [36]

arXiv preprint arXiv:2410.24164 , year=

_0 : A Vision-Language-Action Flow Model for General Robot Control , author=. arXiv preprint arXiv:2410.24164 , year=

Pith/arXiv arXiv

[37] [37]

Intelligence, Physical and Black, Kevin and Brown, Noah and Darpinian, James and Dhabalia, Karan and Driess, Danny and Esmail, Adnan and Equi, Michael and Finn, Chelsea and Fusai, Niccolo and others , journal=. _

[38] [38]

arXiv preprint arXiv:2405.12213 , year=

Octo: An open-source generalist robot policy , author=. arXiv preprint arXiv:2405.12213 , year=

Pith/arXiv arXiv

[39] [39]

arXiv preprint arXiv:2503.14734 , year=

Gr00t n1: An open foundation model for generalist humanoid robots , author=. arXiv preprint arXiv:2503.14734 , year=

Pith/arXiv arXiv

[40] [40]

Advances in Neural Information Processing Systems , volume=

Video pretraining (vpt): Learning to act by watching unlabeled online videos , author=. Advances in Neural Information Processing Systems , volume=

[41] [41]

Proceedings of the IEEE/CVF international conference on computer vision , pages=

Sigmoid loss for language image pre-training , author=. Proceedings of the IEEE/CVF international conference on computer vision , pages=

[42] [42]

arXiv preprint arXiv:2410.11758 , year=

Latent action pretraining from videos , author=. arXiv preprint arXiv:2410.11758 , year=

Pith/arXiv arXiv

[43] [43]

Proceedings of the International Conference on Learning Representations (ICLR) , year =

C-Learning: Learning to Achieve Goals via Recursive Classification , author =. Proceedings of the International Conference on Learning Representations (ICLR) , year =

[44] [44]

arXiv preprint arXiv:2212.06817 , year=

Rt-1: Robotics transformer for real-world control at scale , author=. arXiv preprint arXiv:2212.06817 , year=

Pith/arXiv arXiv

[45] [45]

European conference on computer vision , pages=

Grounding dino: Marrying dino with grounded pre-training for open-set object detection , author=. European conference on computer vision , pages=. 2024 , organization=

2024

[46] [46]

Proceedings of the IEEE International Conference on Robotics and Automation (ICRA) , year =

Contrastive Language, Action, and State Pre-training for Robot Learning , author =. Proceedings of the IEEE International Conference on Robotics and Automation (ICRA) , year =

[47] [47]

Proceedings of the Conference on Robot Learning (CoRL) , year =

Contrastive Imitation Learning for Language-Guided Multi-Task Robotic Manipulation , author =. Proceedings of the Conference on Robot Learning (CoRL) , year =

[48] [48]

Proceedings of the Forty-Second International Conference on Machine Learning (ICML) , year =

Video Prediction Policy: A Generalist Robot Policy with Predictive Visual Representations , author =. Proceedings of the Forty-Second International Conference on Machine Learning (ICML) , year =

[49] [49]

arXiv preprint arXiv:2505.15659 , year=

FLARE: Robot learning with implicit world modeling , author=. arXiv preprint arXiv:2505.15659 , year=

Pith/arXiv arXiv

[50] [50]

Conference on Robot Learning , pages=

Class: Contrastive learning via action sequence supervision for robot manipulation , author=. Conference on Robot Learning , pages=. 2025 , organization=

2025

[51] [51]

2018 IEEE international conference on robotics and automation (ICRA) , pages=

Time-contrastive networks: Self-supervised learning from video , author=. 2018 IEEE international conference on robotics and automation (ICRA) , pages=. 2018 , organization=

2018

[52] [52]

Proceedings of the Thirteenth International Conference on Learning Representations (ICLR) , year =

TraceVLA: Visual Trace Prompting Enhances Spatial--Temporal Awareness for Generalist Robotic Policies , author =. Proceedings of the Thirteenth International Conference on Learning Representations (ICLR) , year =

[53] [53]

arXiv preprint arXiv:2508.21112 , year=

Eo-1: Interleaved vision-text-action pretraining for general robot control , author=. arXiv preprint arXiv:2508.21112 , year=

arXiv

[54] [54]

arXiv preprint arXiv:2204.01691 , year =

Do As I Can, Not As I Say: Grounding Language in Robotic Affordances , author =. arXiv preprint arXiv:2204.01691 , year =

Pith/arXiv arXiv

[55] [55]

arXiv preprint arXiv:2303.03378 , year =

PaLM-E: An Embodied Multimodal Language Model , author =. arXiv preprint arXiv:2303.03378 , year =

Pith/arXiv arXiv

[56] [56]

Robotics: Science and Systems (RSS) , year =

Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware , author =. Robotics: Science and Systems (RSS) , year =

[57] [57]

arXiv preprint arXiv:2303.04137 , year =

Diffusion Policy: Visuomotor Policy Learning via Action Diffusion , author =. arXiv preprint arXiv:2303.04137 , year =

Pith/arXiv arXiv

[58] [58]

arXiv preprint arXiv:2106.09685 , year =

LoRA: Low-Rank Adaptation of Large Language Models , author =. arXiv preprint arXiv:2106.09685 , year =

Pith/arXiv arXiv

[59] [59]

International Conference on Machine Learning (ICML) , year =

DoRA: Weight-Decomposed Low-Rank Adaptation , author =. International Conference on Machine Learning (ICML) , year =

[60] [60]

arXiv preprint arXiv:2506.16211 , year =

ControlVLA: Few-shot Object-centric Adaptation for Pre-trained Vision-Language-Action Models , author =. arXiv preprint arXiv:2506.16211 , year =

arXiv

[61] [61]

arXiv preprint arXiv:2508.02062 , year =

RICL: Adding In-Context Adaptability to Pre-Trained Vision-Language-Action Models , author =. arXiv preprint arXiv:2508.02062 , year =

arXiv

[62] [62]

arXiv preprint arXiv:2302.05543 , year =

Adding Conditional Control to Text-to-Image Diffusion Models , author =. arXiv preprint arXiv:2302.05543 , year =

Pith/arXiv arXiv

[63] [63]

Conference on Robot Learning (CoRL) , year =

RoVi-Aug: Robot and Viewpoint Augmentation for Cross-Embodiment Robot Learning , author =. Conference on Robot Learning (CoRL) , year =

[64] [64]

International Conference on Machine Learning (ICML) , year =

Multi-View Masked World Models for Visual Robotic Manipulation , author =. International Conference on Machine Learning (ICML) , year =

[65] [65]

arXiv preprint arXiv:2501.03575 , year =

Cosmos World Foundation Model Platform for Physical AI , author =. arXiv preprint arXiv:2501.03575 , year =

Pith/arXiv arXiv

[66] [66]

arXiv preprint arXiv:2511.00062 , year =

World Simulation with Video Foundation Models for Physical AI , author =. arXiv preprint arXiv:2511.00062 , year =

Pith/arXiv arXiv

[67] [67]

2012 IEEE/RSJ international conference on intelligent robots and systems , pages=

Mujoco: A physics engine for model-based control , author=. 2012 IEEE/RSJ international conference on intelligent robots and systems , pages=. 2012 , organization=

2012

[68] [68]

arXiv preprint arXiv:2406.08545 , year=

Rvt-2: Learning precise manipulation from few demonstrations , author=. arXiv preprint arXiv:2406.08545 , year=

arXiv

[69] [69]

Advances in Neural Information Processing Systems , volume=

Libero: Benchmarking knowledge transfer for lifelong robot learning , author=. Advances in Neural Information Processing Systems , volume=

[70] [70]

arXiv preprint arXiv:2405.02292 , year=

Aloha 2: An enhanced low-cost hardware for bimanual teleoperation , author=. arXiv preprint arXiv:2405.02292 , year=

arXiv

[71] [71]

Proceedings of Robotics: Science and Systems , year =

RoboCasa: Large-Scale Simulation of Everyday Tasks for Generalist Robots , author =. Proceedings of Robotics: Science and Systems , year =

[72] [72]

ArXiv , year=

Qwen2.5 Technical Report , author=. ArXiv , year=

[73] [73]

ArXiv , year=

Cosmos World Foundation Model Platform for Physical AI , author=. ArXiv , year=