pith. machine review for the scientific record. sign in

arxiv: 2412.10345 · v3 · submitted 2024-12-13 · 💻 cs.RO · cs.AI

Recognition: 2 theorem links

· Lean Theorem

TraceVLA: Visual Trace Prompting Enhances Spatial-Temporal Awareness for Generalist Robotic Policies

Authors on Pith no claims yet

Pith reviewed 2026-05-15 18:22 UTC · model grok-4.3

classification 💻 cs.RO cs.AI
keywords visual trace promptingvision-language-action modelsrobotic manipulationspatial-temporal awarenessgeneralist policiesTraceVLAOpenVLAembodiment generalization
0
0 comments X

The pith

Visual trace prompting encodes state-action trajectories to improve spatial-temporal awareness in vision-language-action robotic policies.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Current large vision-language-action models often fail to handle the spatial and temporal aspects of dynamic robot tasks such as manipulation. The paper proposes visual trace prompting as a way to make trajectories explicit by overlaying visual encodings directly on input images. Finetuning OpenVLA on a new dataset of 150,000 trajectories collected with this prompting produces TraceVLA, which reaches state-of-the-art results. The gains appear across hundreds of simulation configurations and multiple real-robot tasks while also supporting smaller, faster models that match larger baselines.

Core claim

By encoding state-action trajectories visually and using them as prompts, TraceVLA enables generalist VLA models to capture spatial-temporal dynamics more effectively during action prediction. The approach is validated by finetuning OpenVLA on 150K manipulation trajectories, yielding 10% higher success on SimplerEnv across 137 configurations and 3.5 times higher success on four physical WidowX tasks, with strong generalization to varied embodiments.

What carries the argument

visual trace prompting, which overlays visual encodings of state-action trajectories on images to guide spatial-temporal reasoning in action prediction

If this is right

  • TraceVLA outperforms the OpenVLA baseline by 10 percent on SimplerEnv across 137 configurations
  • TraceVLA achieves 3.5 times higher success rates on four real-robot WidowX manipulation tasks
  • The method supports robust generalization across diverse robot embodiments and scenarios
  • A compact 4B-parameter VLA model trained with the same traces rivals the 7B OpenVLA baseline while improving inference speed

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The prompting technique could be applied to other VLA architectures beyond OpenVLA to achieve similar gains without new data collection
  • Smaller models made viable by trace prompting may lower the compute barrier for deploying generalist policies on edge hardware
  • Visual traces might extend naturally to longer-horizon or multi-step tasks where temporal structure is even more critical
  • Improved spatial awareness from traces could reduce collision rates in cluttered or human-shared workspaces

Load-bearing premise

The 150K trajectories collected with visual traces are diverse and representative enough that the observed gains are not tied to the specific data-collection procedure or robot embodiments.

What would settle it

A controlled test on a new robot platform or task distribution whose spatial-temporal requirements fall outside the 150K training trajectories, showing no performance advantage for TraceVLA over the OpenVLA baseline.

read the original abstract

Although large vision-language-action (VLA) models pretrained on extensive robot datasets offer promising generalist policies for robotic learning, they still struggle with spatial-temporal dynamics in interactive robotics, making them less effective in handling complex tasks, such as manipulation. In this work, we introduce visual trace prompting, a simple yet effective approach to facilitate VLA models' spatial-temporal awareness for action prediction by encoding state-action trajectories visually. We develop a new TraceVLA model by finetuning OpenVLA on our own collected dataset of 150K robot manipulation trajectories using visual trace prompting. Evaluations of TraceVLA across 137 configurations in SimplerEnv and 4 tasks on a physical WidowX robot demonstrate state-of-the-art performance, outperforming OpenVLA by 10% on SimplerEnv and 3.5x on real-robot tasks and exhibiting robust generalization across diverse embodiments and scenarios. To further validate the effectiveness and generality of our method, we present a compact VLA model based on 4B Phi-3-Vision, pretrained on the Open-X-Embodiment and finetuned on our dataset, rivals the 7B OpenVLA baseline while significantly improving inference efficiency.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The manuscript introduces visual trace prompting as a method to enhance spatial-temporal awareness in vision-language-action (VLA) models by visually encoding state-action trajectories. TraceVLA is obtained by fine-tuning the OpenVLA checkpoint on a newly collected dataset of 150K robot manipulation trajectories rendered with these traces. The authors report state-of-the-art results, with TraceVLA outperforming OpenVLA by 10% on SimplerEnv across 137 configurations and by a factor of 3.5 on four real-robot WidowX tasks, plus a compact 4B-parameter variant based on Phi-3-Vision that rivals the 7B OpenVLA baseline while improving inference speed.

Significance. If the performance deltas can be causally attributed to visual trace prompting, the work would provide a simple, architecture-agnostic technique for improving generalist robotic policies on complex manipulation. The scale of the SimplerEnv evaluation and the inclusion of physical-robot results are positive features. The absence of a control experiment that fine-tunes the identical backbone on the same 150K trajectories without traces, however, prevents the gains from being isolated from the effects of additional embodiment-specific data, limiting the strength of the central claim.

major comments (1)
  1. [Section 4] Section 4 (Experiments) and the associated ablation tables: no control arm is reported in which OpenVLA is fine-tuned on the identical 150K trajectories rendered without visual traces. Without this baseline, the 10% SimplerEnv and 3.5x real-robot improvements cannot be attributed specifically to visual trace prompting rather than to the additional fine-tuning data or collection procedure.
minor comments (1)
  1. [Abstract and Section 3.2] The abstract and Section 3.2 state that the 150K trajectories are 'our own collected' but supply no quantitative description of task diversity, embodiment distribution, or collection protocol; adding these details would strengthen the claim of robust generalization.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback and for identifying a key point that strengthens the causal claim of our work. We address the major comment below and commit to incorporating the requested control experiment in the revised manuscript.

read point-by-point responses
  1. Referee: [Section 4] Section 4 (Experiments) and the associated ablation tables: no control arm is reported in which OpenVLA is fine-tuned on the identical 150K trajectories rendered without visual traces. Without this baseline, the 10% SimplerEnv and 3.5x real-robot improvements cannot be attributed specifically to visual trace prompting rather than to the additional fine-tuning data or collection procedure.

    Authors: We agree that the absence of this control arm limits the strength of the attribution in the current version. Our dataset of 150K trajectories was collected and rendered with visual traces as an integral part of the prompting method, so a direct comparison requires re-rendering the identical trajectories without traces and performing an additional fine-tuning run. We will execute this control experiment and report the results (including SimplerEnv and real-robot metrics) in the revised manuscript. In the interim, the ablations in Section 4 already vary trace presence, density, and rendering style while holding the underlying trajectories fixed; performance consistently degrades when traces are removed or ablated, which provides supporting (though not fully isolating) evidence that the gains are not solely due to extra data volume. We will also add a discussion clarifying the distinction between our fine-tuning data and the original OpenVLA pre-training corpus. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper's claims rest on empirical performance measurements of TraceVLA (fine-tuned OpenVLA on 150K trajectories rendered with visual traces) against the OpenVLA baseline on independent external benchmarks: 137 SimplerEnv configurations and 4 WidowX real-robot tasks. No equations, fitted parameters, or self-citations are shown that reduce the reported 10% or 3.5x gains to quantities defined by the inputs themselves. The visual trace prompting is an external encoding step whose effectiveness is validated against separate simulation and physical-robot evaluations, leaving the central result self-contained rather than tautological.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The work rests on standard supervised fine-tuning assumptions for vision-language models and on the premise that visual traces faithfully represent state-action history without introducing new biases.

axioms (1)
  • domain assumption Standard assumptions in supervised fine-tuning of large vision-language models transfer to robotic action prediction
    Invoked implicitly when claiming generalization from the 150K dataset to new embodiments.
invented entities (1)
  • visual trace prompting no independent evidence
    purpose: Encode state-action trajectories as visual overlays to improve spatial-temporal awareness
    New technique introduced by the paper; no independent evidence outside the reported experiments is provided.

pith-pipeline@v0.9.0 · 5542 in / 1261 out tokens · 35506 ms · 2026-05-15T18:22:47.948121+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 24 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Overcoming Dynamics-Blindness: Training-Free Pace-and-Path Correction for VLA Models

    cs.RO 2026-05 unverdicted novelty 7.0

    Pace-and-Path Correction decomposes a quadratic cost minimization into orthogonal pace and path channels to correct chunked actions in VLA models, raising success rates by up to 28.8% in dynamic settings.

  2. MolmoAct2: Action Reasoning Models for Real-world Deployment

    cs.RO 2026-05 unverdicted novelty 7.0

    MolmoAct2 delivers an open VLA model with new specialized components, datasets, and techniques that outperforms baselines on benchmarks while releasing all weights, code, and data for real-world robot use.

  3. ${\pi}_{0.7}$: a Steerable Generalist Robotic Foundation Model with Emergent Capabilities

    cs.LG 2026-04 unverdicted novelty 7.0

    π₀.₇ is a steerable generalist robotic model that uses rich multimodal prompts including language, subgoal images, and performance metadata to achieve out-of-the-box generalization across tasks and robot bodies.

  4. JailWAM: Jailbreaking World Action Models in Robot Control

    cs.RO 2026-04 unverdicted novelty 7.0

    JailWAM is the first dedicated jailbreak framework for World Action Models, achieving 84.2% attack success rate on LingBot-VA in RoboTwin simulation and enabling safety evaluation of robotic AI.

  5. VP-VLA: Visual Prompting as an Interface for Vision-Language-Action Models

    cs.RO 2026-03 unverdicted novelty 7.0

    VP-VLA decouples high-level reasoning from low-level control in VLA models by rendering spatial anchors as visual prompts directly in the RGB observation space, outperforming end-to-end baselines.

  6. Towards Generalizable Robotic Manipulation in Dynamic Environments

    cs.CV 2026-03 unverdicted novelty 7.0

    DOMINO dataset and PUMA architecture enable better dynamic robotic manipulation by incorporating motion history, delivering 6.3% higher success rates than prior VLA models.

  7. Learning Physics from Pretrained Video Models: A Multimodal Continuous and Sequential World Interaction Models for Robotic Manipulation

    cs.RO 2026-02 unverdicted novelty 7.0

    PhysGen uses video models to learn physics for robots, outperforming baselines by up to 13.8% on Libero and matching specialized models in real-world tasks.

  8. GuidedVLA: Specifying Task-Relevant Factors via Plug-and-Play Action Attention Specialization

    cs.RO 2026-05 unverdicted novelty 6.0

    GuidedVLA improves VLA success rates by manually supervising separate attention heads in the action decoder with auxiliary signals for task-relevant factors.

  9. Overcoming Dynamics-Blindness: Training-Free Pace-and-Path Correction for VLA Models

    cs.RO 2026-05 unverdicted novelty 6.0

    Pace-and-Path Correction is a closed-form inference-time operator that decomposes a quadratic cost minimization into orthogonal pace compression and path offset channels to correct dynamics-blindness in chunked-action...

  10. TriRelVLA: Triadic Relational Structure for Generalizable Embodied Manipulation

    cs.CV 2026-05 unverdicted novelty 6.0

    TriRelVLA introduces triadic object-hand-task relational representations and a task-grounded graph transformer with a relational bottleneck to improve generalization in robotic manipulation across scenes, objects, and tasks.

  11. ConsisVLA-4D: Advancing Spatiotemporal Consistency in Efficient 3D-Perception and 4D-Reasoning for Robotic Manipulation

    cs.RO 2026-05 unverdicted novelty 6.0

    ConsisVLA-4D adds cross-view semantic alignment, cross-object geometric fusion, and cross-scene dynamic reasoning to VLA models, delivering 21.6% and 41.5% gains plus 2.3x and 2.4x speedups on LIBERO and real-world tasks.

  12. MolmoAct2: Action Reasoning Models for Real-world Deployment

    cs.RO 2026-05 unverdicted novelty 6.0

    MolmoAct2 is an open VLA model that outperforms baselines like Pi-05 on 7 benchmarks and whose backbone surpasses GPT-5 on 13 embodied-reasoning tasks through new datasets, specialized training, and architecture chang...

  13. ST-$\pi$: Structured SpatioTemporal VLA for Robotic Manipulation

    cs.RO 2026-04 unverdicted novelty 6.0

    ST-π structures VLA models by having a spatiotemporal VLM produce causally ordered chunk-level prompts that guide a dual-generator action expert to jointly handle spatial and temporal control in robotic manipulation.

  14. Robotic Manipulation is Vision-to-Geometry Mapping ($f(v) \rightarrow G$): Vision-Geometry Backbones over Language and Video Models

    cs.RO 2026-04 unverdicted novelty 6.0

    Vision-geometry backbones using pretrained 3D world models outperform vision-language and video models for robotic manipulation by enabling direct mapping from visual input to geometric actions.

  15. A1: A Fully Transparent Open-Source, Adaptive and Efficient Truncated Vision-Language-Action Model

    cs.RO 2026-04 unverdicted novelty 6.0

    A1 is a transparent VLA framework achieving state-of-the-art robot manipulation success with up to 72% lower latency via adaptive layer truncation and inter-layer flow matching.

  16. Steerable Vision-Language-Action Policies for Embodied Reasoning and Hierarchical Control

    cs.RO 2026-02 unverdicted novelty 6.0

    Steerable VLAs trained on rich synthetic commands at subtask, motion, and pixel levels enable VLMs to steer robot behavior more effectively, outperforming prior hierarchical baselines on real-world manipulation and ge...

  17. MemoryVLA: Perceptual-Cognitive Memory in Vision-Language-Action Models for Robotic Manipulation

    cs.RO 2025-08 conditional novelty 6.0

    MemoryVLA introduces a perceptual-cognitive memory bank and working-memory retrieval mechanism into VLA models, raising success rates on long-horizon robotic tasks by up to 26 points over prior baselines.

  18. villa-X: Enhancing Latent Action Modeling in Vision-Language-Action Models

    cs.RO 2025-07 unverdicted novelty 6.0

    villa-X enhances latent action modeling in VLA models to support zero-shot action planning for unseen robot embodiments and open-vocabulary instructions, yielding better manipulation results in simulation and real-wor...

  19. Real-Time Execution of Action Chunking Flow Policies

    cs.RO 2025-06 unverdicted novelty 6.0

    Real-time chunking (RTC) allows diffusion- and flow-based action chunking policies to execute smoothly and asynchronously, maintaining high success rates on dynamic tasks even with significant inference latency.

  20. FAST: Efficient Action Tokenization for Vision-Language-Action Models

    cs.RO 2025-01 unverdicted novelty 6.0

    FAST applies discrete cosine transform to robot action sequences for efficient tokenization, enabling autoregressive VLAs to succeed on high-frequency dexterous tasks and scale to 10k hours of data while matching diff...

  21. X-Imitator: Spatial-Aware Imitation Learning via Bidirectional Action-Pose Interaction

    cs.RO 2026-05 unverdicted novelty 5.0

    X-Imitator is a bidirectional action-pose interaction framework for spatial-aware imitation learning that outperforms vanilla policies and explicit pose guidance on 24 simulated and 3 real-world robotic tasks.

  22. Gated Memory Policy

    cs.RO 2026-04 unverdicted novelty 5.0

    GMP selectively activates and represents memory via a gate and lightweight cross-attention, yielding 30.1% higher success on non-Markovian robotic tasks while staying competitive on Markovian ones.

  23. Causal World Modeling for Robot Control

    cs.CV 2026-01 unverdicted novelty 5.0

    LingBot-VA combines video world modeling with policy learning via Mixture-of-Transformers, closed-loop rollouts, and asynchronous inference to improve robot manipulation in simulation and real settings.

  24. SpatialVLA: Exploring Spatial Representations for Visual-Language-Action Model

    cs.RO 2025-01 unverdicted novelty 5.0

    SpatialVLA adds 3D-aware position encoding and adaptive discretized action grids to visual-language-action models, enabling strong zero-shot performance and fine-tuning on new robot setups after pre-training on 1.1 mi...

Reference graph

Works this paper leans on

76 extracted references · 76 canonical work pages · cited by 22 Pith papers · 15 internal anchors

  1. [1]

    8th Annual Conference on Robot Learning , year=

    ReKep: Spatio-Temporal Reasoning of Relational Keypoint Constraints for Robotic Manipulation , author=. 8th Annual Conference on Robot Learning , year=

  2. [2]

    7th Annual Conference on Robot Learning , year=

    VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models , author=. 7th Annual Conference on Robot Learning , year=

  3. [3]

    8th Annual Conference on Robot Learning , year=

    RoboPoint: A Vision-Language Model for Spatial Affordance Prediction in Robotics , author=. 8th Annual Conference on Robot Learning , year=

  4. [4]

    Scaling Learning Algorithms Towards

    Bengio, Yoshua and LeCun, Yann , booktitle =. Scaling Learning Algorithms Towards

  5. [7]

    and Osindero, Simon and Teh, Yee Whye , journal =

    Hinton, Geoffrey E. and Osindero, Simon and Teh, Yee Whye , journal =. A Fast Learning Algorithm for Deep Belief Nets , volume =

  6. [9]

    2016 , publisher=

    Deep learning , author=. 2016 , publisher=

  7. [14]

    Conference on robot learning , pages=

    Cliport: What and where pathways for robotic manipulation , author=. Conference on robot learning , pages=. 2022 , organization=

  8. [16]

    Open-world object manipulation using pre-trained vision-language models

    Open-world object manipulation using pre-trained vision-language models , author=. arXiv preprint arXiv:2303.00905 , year=

  9. [21]

    Track2Act: Predicting Point Tracks from Internet Videos enables Generalizable Robot Manipulation , author=

  10. [23]

    2024 IEEE International Conference on Robotics and Automation (ICRA) , pages=

    Robotap: Tracking arbitrary points for few-shot visual imitation , author=. 2024 IEEE International Conference on Robotics and Automation (ICRA) , pages=. 2024 , organization=

  11. [25]

    Conference on Robot Learning , pages=

    Bridgedata v2: A dataset for robot learning at scale , author=. Conference on Robot Learning , pages=. 2023 , organization=

  12. [26]

    Conference on robot learning , pages=

    Scalable deep reinforcement learning for vision-based robotic manipulation , author=. Conference on robot learning , pages=. 2018 , organization=

  13. [29]

    2024 IEEE International Conference on Robotics and Automation (ICRA) , pages=

    Roboagent: Generalization and efficiency in robot manipulation via semantic augmentations and action chunking , author=. 2024 IEEE International Conference on Robotics and Automation (ICRA) , pages=. 2024 , organization=

  14. [33]

    RT-Sketch: Goal-Conditioned Imitation Learning from Hand-Drawn Sketches , author=

  15. [34]

    Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V, 2023 , author=

  16. [42]

    International conference on machine learning , pages=

    Learning transferable visual models from natural language supervision , author=. International conference on machine learning , pages=. 2021 , organization=

  17. [43]

    Advances in neural information processing systems , volume=

    Visual instruction tuning , author=. Advances in neural information processing systems , volume=

  18. [45]

    Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

    Sigmoid loss for language image pre-training , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

  19. [46]

    arXiv preprint arXiv:2310.09199 , year=

    Pali-3 vision language models: Smaller, faster, stronger , author=. arXiv preprint arXiv:2310.09199 , year=

  20. [49]

    International Conference on Machine Learning (ICML) , year =

    Prismatic VLMs: Investigating the Design Space of Visually-Conditioned Language Models , author =. International Conference on Machine Learning (ICML) , year =

  21. [50]

    Open X-Embodiment Collaboration and Abby O'Neill and Abdul Rehman and Abhinav Gupta and Abhiram Maddukuri and Abhishek Gupta and Abhishek Padalkar and Abraham Lee and Acorn Pooley and Agrim Gupta and Ajay Mandlekar and Ajinkya Jain and Albert Tung and Alex Bewley and Alex Herzog and Alex Irpan and Alexander Khazatsky and Anant Rai and Anchit Gupta and And...

  22. [51]

    2024 , eprint=

    Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone , author=. 2024 , eprint=

  23. [52]

    Make-An-Agent: A Generalizable Policy Network Generator with Behavior-Prompted Diffusion , author=

  24. [53]

    2024 , eprint=

    Robots Pre-train Robots: Manipulation-Centric Robotic Representation from Large-Scale Robot Datasets , author=. 2024 , eprint=

  25. [54]

    Forty-first International Conference on Machine Learning , year=

    Premier-TACO is a Few-Shot Policy Learner: Pretraining Multitask Representation via Temporal Action-Driven Contrastive Loss , author=. Forty-first International Conference on Machine Learning , year=

  26. [55]

    Forty-first International Conference on Machine Learning , year=

    Ruijie Zheng and Ching-An Cheng and Hal Daum. Forty-first International Conference on Machine Learning , year=

  27. [56]

    Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

    Is Imitation All You Need? Generalized Decision-Making with Dual-Phase Training , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

  28. [57]

    Transactions on Machine Learning Research , issn=

    A Generalist Agent , author=. Transactions on Machine Learning Research , issn=. 2022 , url=

  29. [58]

    TACO: Temporal Latent Action-Driven Contrastive Loss for Visual Reinforcement Learning , url =

    Zheng, Ruijie and Wang, Xiyao and Sun, Yanchao and Ma, Shuang and Zhao, Jieyu and Xu, Huazhe and Daum\'. TACO: Temporal Latent Action-Driven Contrastive Loss for Visual Reinforcement Learning , url =. Advances in Neural Information Processing Systems , editor =

  30. [60]

    Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone

    Marah Abdin, Sam Ade Jacobs, Ammar Ahmad Awan, Jyoti Aneja, Ahmed Awadallah, Hany Awadalla, Nguyen Bach, Amit Bahree, Arash Bakhtiari, Harkirat Behl, et al. Phi-3 technical report: A highly capable language model locally on your phone. arXiv preprint arXiv:2404.14219, 2024 b

  31. [61]

    GPT-4 Technical Report

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023

  32. [62]

    Track2act: Predicting point tracks from internet videos enables generalizable robot manipulation

    Homanga Bharadhwaj, Roozbeh Mottaghi, Abhinav Gupta, and Shubham Tulsiani. Track2act: Predicting point tracks from internet videos enables generalizable robot manipulation

  33. [63]

    Roboagent: Generalization and efficiency in robot manipulation via semantic augmentations and action chunking

    Homanga Bharadhwaj, Jay Vakil, Mohit Sharma, Abhinav Gupta, Shubham Tulsiani, and Vikash Kumar. Roboagent: Generalization and efficiency in robot manipulation via semantic augmentations and action chunking. In 2024 IEEE International Conference on Robotics and Automation (ICRA), pp.\ 4788--4795. IEEE, 2024

  34. [64]

    RT-1: Robotics Transformer for Real-World Control at Scale

    Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Joseph Dabis, Chelsea Finn, Keerthana Gopalakrishnan, Karol Hausman, Alex Herzog, Jasmine Hsu, et al. Rt-1: Robotics transformer for real-world control at scale. arXiv preprint arXiv:2212.06817, 2022

  35. [65]

    RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control

    Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Xi Chen, Krzysztof Choromanski, Tianli Ding, Danny Driess, Avinava Dubey, Chelsea Finn, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control. arXiv preprint arXiv:2307.15818, 2023

  36. [67]

    Open X-Embodiment: Robotic Learning Datasets and RT-X Models

    Open-X Embodiment Collaboration, A Padalkar, A Pooley, A Jain, A Bewley, A Herzog, A Irpan, A Khazatsky, A Rai, A Singh, et al. Open x-embodiment: Robotic learning datasets and rt-x models. arXiv preprint arXiv:2310.08864, 2023 b

  37. [68]

    PaLM-E: An Embodied Multimodal Language Model

    Danny Driess, Fei Xia, Mehdi SM Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, et al. Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378, 2023

  38. [69]

    Vision-language models as success detectors.arXiv preprint arXiv:2303.07280,

    Yuqing Du, Ksenia Konyushkova, Misha Denil, Akhil Raju, Jessica Landon, Felix Hill, Nando de Freitas, and Serkan Cabi. Vision-language models as success detectors. arXiv preprint arXiv:2303.07280, 2023

  39. [70]

    Bridge Data: Boosting Generalization of Robotic Skills with Cross-Domain Datasets

    Frederik Ebert, Yanlai Yang, Karl Schmeckpeper, Bernadette Bucher, Georgios Georgakis, Kostas Daniilidis, Chelsea Finn, and Sergey Levine. Bridge data: Boosting generalization of robotic skills with cross-domain datasets. arXiv preprint arXiv:2109.13396, 2021

  40. [71]

    Imitating shortest paths in simulation enables effective navigation and manipulation in the real world

    Kiana Ehsani, Tanmay Gupta, Rose Hendrix, Jordi Salvador, Luca Weihs, Kuo-Hao Zeng, Kunal Pratap Singh, Yejin Kim, Winson Han, Alvaro Herrasti, et al. Imitating shortest paths in simulation enables effective navigation and manipulation in the real world. arXiv preprint arXiv:2312.02976, 2023

  41. [72]

    Clip on wheels: Zero-shot object navigation as object localization and exploration

    Samir Yitzhak Gadre, Mitchell Wortsman, Gabriel Ilharco, Ludwig Schmidt, and Shuran Song. Clip on wheels: Zero-shot object navigation as object localization and exploration. arXiv preprint arXiv:2203.10421, 3 0 (4): 0 7, 2022

  42. [73]

    Rt-trajectory: Robotic task generalization via hindsight trajectory sketches

    Jiayuan Gu, Sean Kirmani, Paul Wohlhart, Yao Lu, Montserrat Gonzalez Arenas, Kanishka Rao, Wenhao Yu, Chuyuan Fu, Keerthana Gopalakrishnan, Zhuo Xu, et al. Rt-trajectory: Robotic task generalization via hindsight trajectory sketches. arXiv preprint arXiv:2311.01977, 2023

  43. [74]

    Ardup: Active region video diffusion for universal policies

    Shuaiyi Huang, Mara Levy, Zhenyu Jiang, Anima Anandkumar, Yuke Zhu, Linxi Fan, De-An Huang, and Abhinav Shrivastava. Ardup: Active region video diffusion for universal policies. arXiv preprint arXiv:2406.13301, 2024 a

  44. [75]

    Rekep: Spatio-temporal reasoning of relational keypoint constraints for robotic manipulation

    Wenlong Huang, Chen Wang, Yunzhu Li, Ruohan Zhang, and Li Fei-Fei. Rekep: Spatio-temporal reasoning of relational keypoint constraints for robotic manipulation. In 8th Annual Conference on Robot Learning, 2024 b . URL https://openreview.net/forum?id=9iG3SEbMnL

  45. [76]

    Robots pre-train robots: Manipulation-centric robotic representation from large-scale robot datasets, 2024

    Guangqi Jiang, Yifei Sun, Tao Huang, Huanyu Li, Yongyuan Liang, and Huazhe Xu. Robots pre-train robots: Manipulation-centric robotic representation from large-scale robot datasets, 2024. URL https://arxiv.org/abs/2410.22325

  46. [77]

    Scalable deep reinforcement learning for vision-based robotic manipulation

    Dmitry Kalashnikov, Alex Irpan, Peter Pastor, Julian Ibarz, Alexander Herzog, Eric Jang, Deirdre Quillen, Ethan Holly, Mrinal Kalakrishnan, Vincent Vanhoucke, et al. Scalable deep reinforcement learning for vision-based robotic manipulation. In Conference on robot learning, pp.\ 651--673. PMLR, 2018

  47. [78]

    Mt-opt: Continu- ous multi-task robotic reinforcement learning at scale

    Dmitry Kalashnikov, Jacob Varley, Yevgen Chebotar, Benjamin Swanson, Rico Jonschkowski, Chelsea Finn, Sergey Levine, and Karol Hausman. Mt-opt: Continuous multi-task robotic reinforcement learning at scale. arXiv preprint arXiv:2104.08212, 2021

  48. [79]

    Cotracker: It is better to track together

    Nikita Karaev, Ignacio Rocco, Benjamin Graham, Natalia Neverova, Andrea Vedaldi, and Christian Rupprecht. Cotracker: It is better to track together. arXiv:2307.07635, 2023

  49. [80]

    Language-driven representation learning for robotics

    Siddharth Karamcheti, Suraj Nair, Annie S Chen, Thomas Kollar, Chelsea Finn, Dorsa Sadigh, and Percy Liang. Language-driven representation learning for robotics. arXiv preprint arXiv:2302.12766, 2023

  50. [81]

    Prismatic vlms: Investigating the design space of visually-conditioned language models

    Siddharth Karamcheti, Suraj Nair, Ashwin Balakrishna, Percy Liang, Thomas Kollar, and Dorsa Sadigh. Prismatic vlms: Investigating the design space of visually-conditioned language models. In International Conference on Machine Learning (ICML), 2024

  51. [82]

    DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset

    Alexander Khazatsky, Karl Pertsch, Suraj Nair, Ashwin Balakrishna, Sudeep Dasari, Siddharth Karamcheti, Soroush Nasiriany, Mohan Kumar Srirama, Lawrence Yunliang Chen, Kirsty Ellis, et al. Droid: A large-scale in-the-wild robot manipulation dataset. arXiv preprint arXiv:2403.12945, 2024

  52. [83]

    OpenVLA: An Open-Source Vision-Language-Action Model

    Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, et al. Openvla: An open-source vision-language-action model. arXiv preprint arXiv:2406.09246, 2024

  53. [84]

    Llara: Supercharging robot learning data for vision-language policy,

    Xiang Li, Cristina Mata, Jongwoo Park, Kumara Kahatapitiya, Yoo Sung Jang, Jinghuan Shang, Kanchana Ranasinghe, Ryan Burgert, Mu Cai, Yong Jae Lee, et al. Llara: Supercharging robot learning data for vision-language policy. arXiv preprint arXiv:2406.20095, 2024

  54. [85]

    Make-an-agent: A generalizable policy network generator with behavior-prompted diffusion

    Yongyuan Liang, Tingqiang Xu, Kaizhe Hu, Guangqi Jiang, Furong Huang, and Huazhe Xu. Make-an-agent: A generalizable policy network generator with behavior-prompted diffusion. In The Thirty-eighth Annual Conference on Neural Information Processing Systems

  55. [86]

    Moka: Open-world robotic manipulation through mark-based visual prompting.arXiv preprint arXiv:2403.03174, 2024

    Fangchen Liu, Kuan Fang, Pieter Abbeel, and Sergey Levine. Moka: Open-vocabulary robotic manipulation through mark-based visual prompting. arXiv preprint arXiv:2403.03174, 2024 a

  56. [87]

    Visual instruction tuning

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. Advances in neural information processing systems, 36, 2024 b

  57. [88]

    Pivot: Iterative visual prompting elicits actionable knowledge for vlms

    Soroush Nasiriany, Fei Xia, Wenhao Yu, Ted Xiao, Jacky Liang, Ishita Dasgupta, Annie Xie, Danny Driess, Ayzaan Wahid, Zhuo Xu, et al. Pivot: Iterative visual prompting elicits actionable knowledge for vlms. arXiv preprint arXiv:2402.07872, 2024

  58. [89]

    Llarva: Vision-action instruction tuning enhances robot learning

    Dantong Niu, Yuvan Sharma, Giscard Biamby, Jerome Quenum, Yutong Bai, Baifeng Shi, Trevor Darrell, and Roei Herzig. Llarva: Vision-action instruction tuning enhances robot learning. arXiv preprint arXiv:2406.11815, 2024

  59. [90]

    DINOv2: Learning Robust Visual Features without Supervision

    Maxime Oquab, Timoth \'e e Darcet, Th \'e o Moutakanni, Huy Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023

  60. [91]

    Learning transferable visual models from natural language supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pp.\ 8748--8763. PMLR, 2021

  61. [92]

    A generalist agent

    Scott Reed, Konrad Zolna, Emilio Parisotto, Sergio G \'o mez Colmenarejo, Alexander Novikov, Gabriel Barth-maron, Mai Gim \'e nez, Yury Sulsky, Jackie Kay, Jost Tobias Springenberg, Tom Eccles, Jake Bruce, Ali Razavi, Ashley Edwards, Nicolas Heess, Yutian Chen, Raia Hadsell, Oriol Vinyals, Mahyar Bordbar, and Nando de Freitas. A generalist agent. Transact...

  62. [93]

    Octo: An Open-Source Generalist Robot Policy

    Octo Model Team, Dibya Ghosh, Homer Walke, Karl Pertsch, Kevin Black, Oier Mees, Sudeep Dasari, Joey Hejna, Tobias Kreiman, Charles Xu, et al. Octo: An open-source generalist robot policy. arXiv preprint arXiv:2405.12213, 2024

  63. [94]

    Llama 2: Open Foundation and Fine-Tuned Chat Models

    Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023

  64. [95]

    Robotap: Tracking arbitrary points for few-shot visual imitation

    Mel Vecerik, Carl Doersch, Yi Yang, Todor Davchev, Yusuf Aytar, Guangyao Zhou, Raia Hadsell, Lourdes Agapito, and Jon Scholz. Robotap: Tracking arbitrary points for few-shot visual imitation. In 2024 IEEE International Conference on Robotics and Automation (ICRA), pp.\ 5397--5403. IEEE, 2024

  65. [96]

    Bridgedata v2: A dataset for robot learning at scale

    Homer Rich Walke, Kevin Black, Tony Z Zhao, Quan Vuong, Chongyi Zheng, Philippe Hansen-Estruch, Andre Wang He, Vivek Myers, Moo Jin Kim, Max Du, et al. Bridgedata v2: A dataset for robot learning at scale. In Conference on Robot Learning, pp.\ 1723--1736. PMLR, 2023

  66. [97]

    Is imitation all you need? generalized decision-making with dual-phase training

    Yao Wei, Yanchao Sun, Ruijie Zheng, Sai Vemprala, Rogerio Bonatti, Shuhang Chen, Ratnesh Madaan, Zhongjie Ba, Ashish Kapoor, and Shuang Ma. Is imitation all you need? generalized decision-making with dual-phase training. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.\ 16221--16231, 2023

  67. [98]

    Any-point trajectory modeling for policy learning

    Chuan Wen, Xingyu Lin, John So, Kai Chen, Qi Dou, Yang Gao, and Pieter Abbeel. Any-point trajectory modeling for policy learning. arXiv preprint arXiv:2401.00025, 2023

  68. [99]

    Gpt-4v in wonderland: Large multimodal models for zero-shot smartphone gui navigation

    An Yan, Zhengyuan Yang, Wanrong Zhu, Kevin Lin, Linjie Li, Jianfeng Wang, Jianwei Yang, Yiwu Zhong, Julian McAuley, Jianfeng Gao, et al. Gpt-4v in wonderland: Large multimodal models for zero-shot smartphone gui navigation. arXiv preprint arXiv:2311.07562, 2023

  69. [100]

    Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V

    Jianwei Yang, Hao Zhang, Feng Li, Xueyan Zou, Chunyuan Li, and Jianfeng Gao. Set-of-mark prompting unleashes extraordinary visual grounding in gpt-4v. arXiv preprint arXiv:2310.11441, 2023 a

  70. [101]

    The Dawn of LMMs: Preliminary Explorations with GPT-4V(ision)

    Zhengyuan Yang, Linjie Li, Kevin Lin, Jianfeng Wang, Chung-Ching Lin, Zicheng Liu, and Lijuan Wang. The dawn of lmms: Preliminary explorations with gpt-4v (ision). arXiv preprint arXiv:2309.17421, 9 0 (1): 0 1, 2023 b

  71. [102]

    Robopoint: A vision-language model for spatial affordance prediction in robotics

    Wentao Yuan, Jiafei Duan, Valts Blukis, Wilbert Pumacay, Ranjay Krishna, Adithyavairavan Murali, Arsalan Mousavian, and Dieter Fox. Robopoint: A vision-language model for spatial affordance prediction in robotics. In 8th Annual Conference on Robot Learning, 2024. URL https://openreview.net/forum?id=GVX6jpZOhU

  72. [103]

    Sigmoid loss for language image pre-training

    Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.\ 11975--11986, 2023

  73. [104]

    Taco: Temporal latent action-driven contrastive loss for visual reinforcement learning

    Ruijie Zheng, Xiyao Wang, Yanchao Sun, Shuang Ma, Jieyu Zhao, Huazhe Xu, Hal Daum\' e III, and Furong Huang. Taco: Temporal latent action-driven contrastive loss for visual reinforcement learning. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine (eds.), Advances in Neural Information Processing Systems, volume 36, pp.\ 48203--48225. ...

  74. [105]

    PRISE : LLM -style sequence compression for learning temporal action abstractions in control

    Ruijie Zheng, Ching-An Cheng, Hal Daum \'e III, Furong Huang, and Andrey Kolobov. PRISE : LLM -style sequence compression for learning temporal action abstractions in control. In Forty-first International Conference on Machine Learning, 2024 a . URL https://openreview.net/forum?id=p225Od0aYt

  75. [106]

    Premier-taco is a few-shot policy learner: Pretraining multitask representation via temporal action-driven contrastive loss

    Ruijie Zheng, Yongyuan Liang, Xiyao Wang, Shuang Ma, Hal Daum \'e III, Huazhe Xu, John Langford, Praveen Palanisamy, Kalyan Shankar Basu, and Furong Huang. Premier-taco is a few-shot policy learner: Pretraining multitask representation via temporal action-driven contrastive loss. In Forty-first International Conference on Machine Learning, 2024 b

  76. [107]

    Vision Mamba: Efficient Visual Representation Learning with Bidirectional State Space Model

    Lianghui Zhu, Bencheng Liao, Qian Zhang, Xinlong Wang, Wenyu Liu, and Xinggang Wang. Vision mamba: Efficient visual representation learning with bidirectional state space model. arXiv preprint arXiv:2401.09417, 2024