Colosseum V2: Benchmarking Generalization for Vision Language Action Models

Alina Du; Ashvin Arora; Gaurav Sukhatme; Hyeonho Oh; Ishika Singh; Jeremy Morgan; Jesse Thomason; Jincen Song; Prajwal Vijay

arxiv: 2605.27759 · v1 · pith:MKXB7KNHnew · submitted 2026-05-26 · 💻 cs.RO

Colosseum V2: Benchmarking Generalization for Vision Language Action Models

Jeremy Morgan , Prajwal Vijay , Hyeonho Oh , Jincen Song , Ashvin Arora , Alina Du , Gaurav Sukhatme , Jesse Thomason

show 1 more author

Ishika Singh

This is my paper

Pith reviewed 2026-06-29 16:28 UTC · model grok-4.3

classification 💻 cs.RO

keywords vision-language-action modelsgeneralizationrobotics benchmarkmanipulation taskssimulation evaluationdistribution shiftsecological validityrobot policies

0 comments

The pith

Colosseum V2 benchmark demonstrates that current vision-language-action models have significant generalization limitations in robotic manipulation tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper establishes a new benchmark called Colosseum V2 to study how vision-language-action models perform when robot tasks change from their training conditions. Current models show reduced success rates both in standard settings and especially when faced with variations like different objects or environments. The benchmark uses simulation to run many tests quickly and finds that its results align well with real robot experiments. A reader would care because these models promise flexible robot control from language instructions but the gaps revealed limit their reliability in new situations. The standardized setup allows consistent progress tracking across research efforts.

Core claim

The central discovery is that Colosseum V2, comprising 28 tasks in 13 categories across two robot morphologies, exposes limitations in both the base performance and generalization capabilities of leading VLA methods such as ACT and Pi0.5. Built on the ManiSkill simulator for efficient parallel evaluation, the benchmark supports large-scale in-domain and out-of-domain testing. It further establishes strong correlations between simulation metrics and real-world performance, confirming the benchmark's relevance for assessing generalization in robotic manipulation.

What carries the argument

Colosseum V2, a simulation-based benchmark with standardized tasks and metrics for evaluating VLA generalization under distribution shifts.

If this is right

State-of-the-art VLA methods exhibit degraded performance under distribution shifts, pointing to the need for improved robustness in translating perception to action.
Strong correlations between simulation and real-world results validate using the benchmark to predict real robot behavior.
Unified tasks, metrics, and protocols enable reproducible comparisons and reduce evaluation costs for developing general robot policies.
Accelerated progress toward general-purpose policies becomes possible through systematic benchmarking of generalization.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Researchers could use the benchmark to test whether additional training on diverse simulated variations closes the observed generalization gaps.
Connections to other robotics benchmarks might reveal if the identified limitations are specific to VLA architectures or common across approaches.
Extending the benchmark to include more complex long-horizon tasks could uncover additional failure modes not captured in the current 28 tasks.
If the sim-real correlation holds broadly, it would support greater reliance on simulation for initial model development in robotics.

Load-bearing premise

The 28 tasks and selected distribution shifts capture the key variations relevant to real-world generalization of vision-language-action models.

What would settle it

Finding a vision-language-action model that performs well on Colosseum V2 but shows poor generalization in real-world tests with analogous shifts would indicate that the benchmark does not accurately reflect practical challenges.

Figures

Figures reproduced from arXiv: 2605.27759 by Alina Du, Ashvin Arora, Gaurav Sukhatme, Hyeonho Oh, Ishika Singh, Jeremy Morgan, Jesse Thomason, Jincen Song, Prajwal Vijay.

**Figure 2.** Figure 2: Overview of COLOSSEUM V2. Left: the full set of tasks across two robot morphologies (Single-Arm and Bimanual), spanning diverse manipulation primitives and long-horizon behaviors. Right: the perturbations used to evaluate visual, language, and action generalization. In total, the benchmark comprises 28 tasks across 13 task categories with 16 controlled perturbation factors. • We introduce COLOSSEUM V2, a l… view at source ↗

**Figure 3.** Figure 3: Comparison of existing robot learning benchmarks and simulation platforms. C [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗

**Figure 4.** Figure 4: Simulator comparison. Frames per second (FPS) is computed as the [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 5.** Figure 5: Average change in success rate for each perturbation. The top row illustrates select perturbations for the D [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

**Figure 6.** Figure 6: The per-task success rate of all models with no perturbations. The x [PITH_FULL_IMAGE:figures/full_fig_p006_6.png] view at source ↗

**Figure 8.** Figure 8: Hardware setup for ROTATEARROW. From left to right, the perturbations are None, Background-Color, Distractor-Objects, Light-Color, MO-Size, and MO-Color. Additional hardware tasks are shown in the Appendix, available on the projects website: https://sites.google. com/usc.edu/colosseum-v2/ across the three tasks is 0.916, demonstrating that the ordering of success rates on hardware is largely preserved betw… view at source ↗

read the original abstract

Vision-Language-Action (VLA) models demonstrate promising generalization in robotic manipulation, driven by advances in large-scale vision and language pre-training. This progress can be misleading. Despite the zero-shot perception and language capabilities of VLAs, their overall task performance often degrades under distribution shifts, revealing gaps in how these systems translate high-level understanding into robust behavior. To systematically study this gap, we introduce Colosseum V2, a large-scale simulation benchmark for evaluating VLA generalization in robot learning across diverse conditions. The benchmark comprises 28 tasks spanning 13 task categories and two robot morphologies, covering a wide range of manipulation primitives and long-horizon behaviors. Built on the ManiSkill simulator, Colosseum V2 enables fast, GPU-parallelized evaluation and supports both in-domain and out-of-domain testing at scale. We evaluate state-of-the-art methods, including Action Chunking Transformers (ACT) and Pi0.5, and reveal limitations in both base performance and generalization. We demonstrate strong correlations between simulation and real-world metrics that support the ecological validity of the benchmark. By standardizing tasks, metrics, and evaluation protocols within a unified benchmark, Colosseum V2 enables reproducible and fair comparisons, reduced evaluation overhead, and accelerated progress toward general-purpose robot policies.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Colosseum V2 adds a standardized ManiSkill benchmark with 28 tasks for VLA generalization testing, but the abstract supplies no numbers to back the performance and correlation claims.

read the letter

Colosseum V2 is a new benchmark with 28 tasks in 13 categories across two robot morphologies, all in ManiSkill, with built-in in-domain and out-of-domain splits for VLA models. The paper evaluates ACT and Pi0.5 and flags drops in performance plus some sim-to-real correlations.

It does a useful job laying out a single protocol with GPU-parallel runs so that different groups can run the same tests without reinventing the setup each time. That kind of standardization is practical for a field where everyone currently picks their own tasks.

The soft spots are clear from the abstract. There are no quantitative results, no task definitions, and no error breakdowns, so the size of the claimed gaps and the strength of the correlations cannot be judged. The representativeness of the 28 tasks and chosen shifts for actual real-world variation is also not independently checked, which leaves the ecological-validity argument resting on details that are not shown here.

This is for researchers who work on VLA policies and need a shared testbed to measure generalization. A reader focused on robustness would get value from the protocol if the full paper supplies the missing data. It deserves a serious referee to check the task coverage and the actual numbers.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces Colosseum V2, a large-scale ManiSkill-based simulation benchmark comprising 28 tasks across 13 categories and two robot morphologies. It supports GPU-parallelized in-domain and out-of-domain evaluation of Vision-Language-Action models and reports evaluations of ACT and Pi0.5 that reveal limitations in base performance and generalization under distribution shifts, together with strong simulation-to-real correlations supporting ecological validity. The work positions the benchmark as a standardized platform to enable reproducible comparisons and accelerate progress on general-purpose robot policies.

Significance. If the central claims hold, the benchmark offers a scalable, standardized evaluation platform that could meaningfully advance VLA research by exposing generalization gaps in current methods and providing quantitative evidence for sim-to-real transfer. The GPU-parallelized execution and multi-morphology support are concrete strengths that address practical evaluation overhead.

major comments (2)

[Abstract and §4] Abstract and §4 (Evaluation): the central claim that Colosseum V2 'reveals limitations in both base performance and generalization' of ACT and Pi0.5, and demonstrates 'strong correlations' supporting ecological validity, is presented without quantitative results, error bars, task-level success rates, or correlation coefficients in the abstract; if the corresponding tables or figures in the evaluation section lack these or an accompanying statistical analysis, the empirical grounding for the strongest claims is insufficient.
[§3] §3 (Task Construction): the selection of the 28 tasks, 13 categories, and chosen distribution shifts is load-bearing for interpreting observed performance gaps and sim-real correlations as field-general rather than benchmark-specific, yet no coverage analysis, ablation on omitted factors (contact-rich dynamics, novel geometries, sensor noise), or independent validation that these shifts span the relevant real-world variation space is provided.

minor comments (2)

[Abstract] Abstract: include at least one key quantitative result (e.g., average success rate or correlation coefficient) to make the summary self-contained.
[§3] Ensure all task categories and shift types are explicitly enumerated in a table for reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on strengthening the empirical presentation and task justification. We address each major point below.

read point-by-point responses

Referee: [Abstract and §4] Abstract and §4 (Evaluation): the central claim that Colosseum V2 'reveals limitations in both base performance and generalization' of ACT and Pi0.5, and demonstrates 'strong correlations' supporting ecological validity, is presented without quantitative results, error bars, task-level success rates, or correlation coefficients in the abstract; if the corresponding tables or figures in the evaluation section lack these or an accompanying statistical analysis, the empirical grounding for the strongest claims is insufficient.

Authors: We agree the abstract would benefit from quantitative highlights. Section §4 already contains tables reporting task-level success rates for ACT and Pi0.5 on all 28 tasks under in-domain and out-of-domain conditions, plus figures showing performance under shifts. To address the concern directly, we will revise the abstract to include representative success rates and correlation strengths, add error bars to relevant figures, and include explicit correlation coefficients with basic statistical analysis in §4. revision: yes
Referee: [§3] §3 (Task Construction): the selection of the 28 tasks, 13 categories, and chosen distribution shifts is load-bearing for interpreting observed performance gaps and sim-real correlations as field-general rather than benchmark-specific, yet no coverage analysis, ablation on omitted factors (contact-rich dynamics, novel geometries, sensor noise), or independent validation that these shifts span the relevant real-world variation space is provided.

Authors: Section §3 motivates the 28 tasks and 13 categories by spanning diverse ManiSkill primitives (including contact-rich and long-horizon behaviors) and two morphologies, with shifts targeting visual, dynamic, and embodiment variations. We acknowledge the absence of a formal coverage analysis or ablations on every omitted factor. We will expand §3 with additional rationale for the selected shifts and their alignment with robotics literature. Comprehensive ablations on all factors (e.g., sensor noise) exceed the current scope; the reported sim-to-real correlations provide empirical support for relevance. revision: partial

Circularity Check

0 steps flagged

No circularity: benchmark construction and empirical evaluation are independent of fitted self-referential quantities.

full rationale

The paper introduces Colosseum V2 as a new simulation benchmark with 28 tasks, evaluates existing VLA methods (ACT, Pi0.5) on in/out-of-domain shifts, and reports observed performance gaps plus sim-real correlations. No equations, parameter fits, or derivations are present that reduce a claimed result to its own inputs by construction. The central claims rest on direct empirical measurement within the defined benchmark rather than any self-definitional loop, fitted-input prediction, or load-bearing self-citation chain. Representativeness of the task set is an external validity question, not a circularity issue.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper's contribution is the benchmark construction itself; it relies on the domain assumption that the ManiSkill simulator provides sufficient fidelity for generalization studies.

axioms (1)

domain assumption Simulation environments can approximate real-world robot dynamics sufficiently for generalization testing.
The benchmark's validity claim rests on this assumption, referenced via the reported sim-real correlations.

pith-pipeline@v0.9.1-grok · 5788 in / 1190 out tokens · 52374 ms · 2026-06-29T16:28:36.041346+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

48 extracted references · 19 canonical work pages · 10 internal anchors

[1]

ChatGPT: Optimizing language models for dialogue,

OpenAI, “ChatGPT: Optimizing language models for dialogue,” https: //openai.com/blog/chatgpt, 2022, accessed: 2024-08-17

2022
[2]

SAM 2: Segment Anything in Images and Videos

N. Ravi, V . Gabeur, Y .-T. Hu, R. Hu, C. Ryali, T. Ma, H. Khedr, R. R ¨adle, C. Rolland, L. Gustafson, E. Mintun, J. Pan, K. V . Alwala, N. Carion, C.-Y . Wu, R. Girshick, P. Doll ´ar, and C. Feichtenhofer, “SAM 2: Segment anything in images and videos,”arXiv preprint arXiv:2408.00714, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[3]

Rlbench: The robot learning benchmark and learning environment,

S. James, A. J. Davison, and E. Johns, “Rlbench: The robot learning benchmark and learning environment,” inIEEE Robotics and Automa- tion Letters, 2019

2019
[4]

Pyrep: Bringing v- rep to deep robot learning,

S. James, Z. Ma, D. R. Arrojo, and A. J. Davison, “Pyrep: Bringing v- rep to deep robot learning,” inConference on Robot Learning (CoRL), 2019

2019
[5]

Coppeliasim robot simulator,

Coppelia Robotics, “Coppeliasim robot simulator,” 2022, https://www.coppeliarobotics.com

2022
[6]

The colosseum: A benchmark for evaluating generalization for robotic manipulation,

W. Pumacay, I. Singh, J. Duan, R. Krishna, J. Thomason, and D. Fox, “The colosseum: A benchmark for evaluating generalization for robotic manipulation,” inProceedings of Robotics: Science and Systems, 2024

2024
[7]

Libero: Benchmarking knowledge transfer for lifelong robot learning,

B. Liuet al., “Libero: Benchmarking knowledge transfer for lifelong robot learning,” inConference on Robot Learning (CoRL), 2023

2023
[8]

Libero-para: A diagnostic benchmark and metrics for paraphrase robustness in vla models,

C. Kim, M. Kim, M. Kang, H. Kim, and D. Jung, “Libero-para: A diagnostic benchmark and metrics for paraphrase robustness in vla models,” 2026. [Online]. Available: https://arxiv.org/abs/2603.28301

work page arXiv 2026
[9]

Roboverse: Towards a unified platform for robotic manipulation,

A. Muraliet al., “Roboverse: Towards a unified platform for robotic manipulation,” inConference on Robot Learning Workshop, 2020

2020
[10]

Roboarena: Distributed real-world evaluation of generalist robot policies,

R. Team, “Roboarena: Distributed real-world evaluation of generalist robot policies,” 2024

2024
[11]

Robotwin: A platform for scalable robot learning,

——, “Robotwin: A platform for scalable robot learning,” 2024, https://robotwin-platform.github.io

2024
[12]

Bimanual manipulation benchmark,

B. B. Team, “Bimanual manipulation benchmark,” 2024, https://bimanual.github.io

2024
[13]

Calvin: A benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks,

O. Mees, L. Hermann, E. Rosete-Beas, and W. Burgard, “Calvin: A benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks,”IEEE Robotics and Automation Letters (RA- L), vol. 7, no. 3, pp. 7327–7334, 2022

2022
[14]

Vlabench: A large-scale benchmark for language-conditioned robotics manipulation with long-horizon reasoning tasks,

S. Zhang, Z. Xu, P. Liu, X. Yu, Y . Li, Q. Gao, Z. Fei, Z. Yin, Z. Wu, Y .-G. Jiang, and X. Qiu, “Vlabench: A large-scale benchmark for language-conditioned robotics manipulation with long-horizon reasoning tasks,” 2024. [Online]. Available: https://arxiv.org/abs/2412.18194

work page arXiv 2024
[15]

Vlmbench: A compositional benchmark for vision-and-language manipulation,

K. Zheng, X. Chen, O. C. Jenkins, and X. E. Wang, “Vlmbench: A compositional benchmark for vision-and-language manipulation,” 2022. [Online]. Available: https://arxiv.org/abs/2206.08522

work page arXiv 2022
[16]

Manipbench: Benchmarking vision-language models for low-level robot manipulation,

E. Zhao, V . Raval, H. Zhang, J. Mao, Z. Shangguan, S. Nikolaidis, Y . Wang, and D. Seita, “Manipbench: Benchmarking vision-language models for low-level robot manipulation,” 2025. [Online]. Available: https://arxiv.org/abs/2505.09698

work page arXiv 2025
[17]

R3m: A universal visual representation for robot manipulation,

S. Nairet al., “R3m: A universal visual representation for robot manipulation,” inConference on Robot Learning, 2022

2022
[18]

Mvp: Multi-view pretraining for vision-language robotics,

T. Xiaoet al., “Mvp: Multi-view pretraining for vision-language robotics,” inConference on Robot Learning, 2022

2022
[19]

VIP: Towards Universal Visual Reward and Representation via Value-Implicit Pre-Training

Y . J. Ma, S. Sodhani, D. Jayaraman, O. Bastani, V . Kumar, and A. Zhang, “Vip: Towards universal visual reward and representation via value- implicit pre-training,”arXiv preprint arXiv:2210.00030, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[20]

Cliport: What and where pathways for robotic manipulation,

M. Shridharet al., “Cliport: What and where pathways for robotic manipulation,” inConference on Robot Learning, 2022

2022
[21]

V oxposer: Composable 3d value maps for robotic manipulation with language models,

W. Huanget al., “V oxposer: Composable 3d value maps for robotic manipulation with language models,” inConference on Robot Learning, 2023

2023
[22]

C2farm: Coarse-to-fine imitation learning for manipu- lation,

S. Jameset al., “C2farm: Coarse-to-fine imitation learning for manipu- lation,” inConference on Robot Learning, 2022

2022
[23]

Kite: Keyframe imitation for task execution,

P. Sundaresanet al., “Kite: Keyframe imitation for task execution,” in Conference on Robot Learning, 2023

2023
[24]

Learning fine-grained bimanual manipulation with act,

T. Zhaoet al., “Learning fine-grained bimanual manipulation with act,” arXiv preprint, 2023

2023
[25]

Peract: Perceiver-actor for 6-dof manipulation,

M. Shridharet al., “Peract: Perceiver-actor for 6-dof manipulation,” in Robotics: Science and Systems, 2022

2022
[26]

Rvt: Robotic vision transformer for manipulation,

A. Goyalet al., “Rvt: Robotic vision transformer for manipulation,” in Conference on Robot Learning, 2023

2023
[27]

Rvt-2: Scaling vision transformers for robot manipulation,

——, “Rvt-2: Scaling vision transformers for robot manipulation,”arXiv preprint, 2024

2024
[28]

Act3d: 3d feature fields for manipulation policies,

T. Gervetet al., “Act3d: 3d feature fields for manipulation policies,” in Conference on Robot Learning, 2023

2023
[29]

PaLM-E: An Embodied Multimodal Language Model

D. Driesset al., “Palm-e: An embodied multimodal language model,” arXiv preprint arXiv:2303.03378, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[30]

RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control

A. Brohanet al., “Rt-2: Vision-language-action models transfer web knowledge to robotic control,”arXiv preprint arXiv:2307.15818, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[31]

OpenVLA: An Open-Source Vision-Language-Action Model

M. J. Kimet al., “Openvla: Vision-language-action models for robotics,” arXiv preprint arXiv:2406.09246, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[32]

π0: A vision-language-action model for general robot control,

K. Blacket al., “π0: A vision-language-action model for general robot control,”arXiv preprint arXiv:2405.03854, 2024

work page arXiv 2024
[33]

π0-fast: Fast vision-language-action models for robotics,

K. Pertschet al., “π0-fast: Fast vision-language-action models for robotics,”arXiv preprint arXiv:2501.00000, 2025

work page arXiv 2025
[34]

π0.5: Vision-language-action models for open-world robotics,

P. I. Team, “π0.5: Vision-language-action models for open-world robotics,”arXiv preprint, 2025

2025
[35]

Open x-embodiment: Robotic learning datasets and rt-x models,

A. Padalkaret al., “Open x-embodiment: Robotic learning datasets and rt-x models,” 2023

2023
[36]

Lerobot: State-of-the-art machine learning for real-world robotics in pytorch,

R. Cadene, S. Alibert, A. Soare, Q. Gallouedec, A. Zouitine, S. Palma, P. Kooijmans, M. Aractingi, M. Shukor, D. Aubakirova, M. Russi, F. Capuano, C. Pascal, J. Choghari, J. Moss, and T. Wolf, “Lerobot: State-of-the-art machine learning for real-world robotics in pytorch,” https://github.com/huggingface/lerobot, 2024

2024
[37]

Maniskill3: Gpu parallelized robotics simulation and rendering for generalizable embodied ai,

S. Tao, F. Xiang, A. Shukla, Y . Qin, X. Hinrichsen, X. Yuan, C. Bao, X. Lin, Y . Liu, T. kai Chan, Y . Gao, X. Li, T. Mu, N. Xiao, A. Gurha, V . N. Rajesh, Y . W. Choi, Y .-R. Chen, Z. Huang, R. Calandra, R. Chen, S. Luo, and H. Su, “Maniskill3: Gpu parallelized robotics simulation and rendering for generalizable embodied ai,”Robotics: Science and Systems, 2025

2025
[38]

Learning Transferable Visual Models From Natural Language Supervision

A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever, “Learning transferable visual models from natural language supervision,” 2021. [Online]. Available: https://arxiv.org/abs/2103.00020

work page internal anchor Pith review Pith/arXiv arXiv 2021
[39]

Sigmoid Loss for Language Image Pre-Training

X. Zhai, B. Mustafa, A. Kolesnikov, and L. Beyer, “Sigmoid loss for language image pre-training,” 2023. [Online]. Available: https://arxiv.org/abs/2303.15343

work page internal anchor Pith review Pith/arXiv arXiv 2023
[40]

Deep residual learning for image recognition,

K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 770–778

2016
[41]

MolmoAct: Action Reasoning Models that can Reason in Space

J. Lee, J. Duan, H. Fang, Y . Deng, S. Liu, B. Li, B. Fang, J. Zhang, Y . R. Wang, S. Lee, W. Han, W. Pumacay, A. Wu, R. Hendrix, K. Farley, E. VanderBilt, A. Farhadi, D. Fox, and R. Krishna, “Molmoact: Action reasoning models that can reason in space,” 2025. [Online]. Available: https://arxiv.org/abs/2508.07917

work page internal anchor Pith review Pith/arXiv arXiv 2025
[42]

Vima: General robot manipulation with multimodal prompts,

Y . Jiang, A. Gupta, Z. Zhang, G. Wang, Y . Dou, Y . Chen, L. Fei- Fei, A. Anandkumar, Y . Zhu, and L. Fan, “Vima: General robot manipulation with multimodal prompts,” 2023. [Online]. Available: https://arxiv.org/abs/2210.03094

work page arXiv 2023
[43]

Learning an actionable discrete diffusion policy via large-scale actionless video pre- training,

H. He, C. Bai, L. Pan, W. Zhang, B. Zhao, and X. Li, “Learning an actionable discrete diffusion policy via large-scale actionless video pre- training,” inThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024

2024
[45]

Unified Video Action Model

S. Li, Y . Gao, D. Sadigh, and S. Song, “Unified video action model,” arXiv preprint arXiv:2503.00200, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[46]

Unified world models: Coupling video and action diffusion for pretraining on large robotic datasets,

C. Zhu, R. Yu, S. Feng, B. Burchfiel, P. Shah, and A. Gupta, “Unified world models: Coupling video and action diffusion for pretraining on large robotic datasets,” inProceedings of Robotics: Science and Systems (RSS), 2025

2025
[47]

Video Prediction Policy: A Generalist Robot Policy with Predictive Visual Representations

Y . Hu, Y . Guo, P. Wang, X. Chen, Y .-J. Wang, J. Zhang, K. Sreenath, C. Lu, and J. Chen, “Video prediction policy: A generalist robot policy with predictive visual representations,”arXiv preprint arXiv:2412.14803, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[48]

Dit4dit: Jointly modeling video dynamics and actions for generalizable robot control,

T. Ma, J. Zheng, Z. Wang, C. Jiang, A. Cui, J. Liang, and S. Yang, “Dit4dit: Jointly modeling video dynamics and actions for generalizable robot control,” 2026. [Online]. Available: https: //arxiv.org/abs/2603.10448

work page arXiv 2026
[49]

Contrast sets for evaluating language-guided robot policies,

A. Anwar, R. Gupta, and J. Thomason, “Contrast sets for evaluating language-guided robot policies,” 2024. [Online]. Available: https: //arxiv.org/abs/2406.13636

work page arXiv 2024

[1] [1]

ChatGPT: Optimizing language models for dialogue,

OpenAI, “ChatGPT: Optimizing language models for dialogue,” https: //openai.com/blog/chatgpt, 2022, accessed: 2024-08-17

2022

[2] [2]

SAM 2: Segment Anything in Images and Videos

N. Ravi, V . Gabeur, Y .-T. Hu, R. Hu, C. Ryali, T. Ma, H. Khedr, R. R ¨adle, C. Rolland, L. Gustafson, E. Mintun, J. Pan, K. V . Alwala, N. Carion, C.-Y . Wu, R. Girshick, P. Doll ´ar, and C. Feichtenhofer, “SAM 2: Segment anything in images and videos,”arXiv preprint arXiv:2408.00714, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[3] [3]

Rlbench: The robot learning benchmark and learning environment,

S. James, A. J. Davison, and E. Johns, “Rlbench: The robot learning benchmark and learning environment,” inIEEE Robotics and Automa- tion Letters, 2019

2019

[4] [4]

Pyrep: Bringing v- rep to deep robot learning,

S. James, Z. Ma, D. R. Arrojo, and A. J. Davison, “Pyrep: Bringing v- rep to deep robot learning,” inConference on Robot Learning (CoRL), 2019

2019

[5] [5]

Coppeliasim robot simulator,

Coppelia Robotics, “Coppeliasim robot simulator,” 2022, https://www.coppeliarobotics.com

2022

[6] [6]

The colosseum: A benchmark for evaluating generalization for robotic manipulation,

W. Pumacay, I. Singh, J. Duan, R. Krishna, J. Thomason, and D. Fox, “The colosseum: A benchmark for evaluating generalization for robotic manipulation,” inProceedings of Robotics: Science and Systems, 2024

2024

[7] [7]

Libero: Benchmarking knowledge transfer for lifelong robot learning,

B. Liuet al., “Libero: Benchmarking knowledge transfer for lifelong robot learning,” inConference on Robot Learning (CoRL), 2023

2023

[8] [8]

Libero-para: A diagnostic benchmark and metrics for paraphrase robustness in vla models,

C. Kim, M. Kim, M. Kang, H. Kim, and D. Jung, “Libero-para: A diagnostic benchmark and metrics for paraphrase robustness in vla models,” 2026. [Online]. Available: https://arxiv.org/abs/2603.28301

work page arXiv 2026

[9] [9]

Roboverse: Towards a unified platform for robotic manipulation,

A. Muraliet al., “Roboverse: Towards a unified platform for robotic manipulation,” inConference on Robot Learning Workshop, 2020

2020

[10] [10]

Roboarena: Distributed real-world evaluation of generalist robot policies,

R. Team, “Roboarena: Distributed real-world evaluation of generalist robot policies,” 2024

2024

[11] [11]

Robotwin: A platform for scalable robot learning,

——, “Robotwin: A platform for scalable robot learning,” 2024, https://robotwin-platform.github.io

2024

[12] [12]

Bimanual manipulation benchmark,

B. B. Team, “Bimanual manipulation benchmark,” 2024, https://bimanual.github.io

2024

[13] [13]

Calvin: A benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks,

O. Mees, L. Hermann, E. Rosete-Beas, and W. Burgard, “Calvin: A benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks,”IEEE Robotics and Automation Letters (RA- L), vol. 7, no. 3, pp. 7327–7334, 2022

2022

[14] [14]

Vlabench: A large-scale benchmark for language-conditioned robotics manipulation with long-horizon reasoning tasks,

S. Zhang, Z. Xu, P. Liu, X. Yu, Y . Li, Q. Gao, Z. Fei, Z. Yin, Z. Wu, Y .-G. Jiang, and X. Qiu, “Vlabench: A large-scale benchmark for language-conditioned robotics manipulation with long-horizon reasoning tasks,” 2024. [Online]. Available: https://arxiv.org/abs/2412.18194

work page arXiv 2024

[15] [15]

Vlmbench: A compositional benchmark for vision-and-language manipulation,

K. Zheng, X. Chen, O. C. Jenkins, and X. E. Wang, “Vlmbench: A compositional benchmark for vision-and-language manipulation,” 2022. [Online]. Available: https://arxiv.org/abs/2206.08522

work page arXiv 2022

[16] [16]

Manipbench: Benchmarking vision-language models for low-level robot manipulation,

E. Zhao, V . Raval, H. Zhang, J. Mao, Z. Shangguan, S. Nikolaidis, Y . Wang, and D. Seita, “Manipbench: Benchmarking vision-language models for low-level robot manipulation,” 2025. [Online]. Available: https://arxiv.org/abs/2505.09698

work page arXiv 2025

[17] [17]

R3m: A universal visual representation for robot manipulation,

S. Nairet al., “R3m: A universal visual representation for robot manipulation,” inConference on Robot Learning, 2022

2022

[18] [18]

Mvp: Multi-view pretraining for vision-language robotics,

T. Xiaoet al., “Mvp: Multi-view pretraining for vision-language robotics,” inConference on Robot Learning, 2022

2022

[19] [19]

VIP: Towards Universal Visual Reward and Representation via Value-Implicit Pre-Training

Y . J. Ma, S. Sodhani, D. Jayaraman, O. Bastani, V . Kumar, and A. Zhang, “Vip: Towards universal visual reward and representation via value- implicit pre-training,”arXiv preprint arXiv:2210.00030, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[20] [20]

Cliport: What and where pathways for robotic manipulation,

M. Shridharet al., “Cliport: What and where pathways for robotic manipulation,” inConference on Robot Learning, 2022

2022

[21] [21]

V oxposer: Composable 3d value maps for robotic manipulation with language models,

W. Huanget al., “V oxposer: Composable 3d value maps for robotic manipulation with language models,” inConference on Robot Learning, 2023

2023

[22] [22]

C2farm: Coarse-to-fine imitation learning for manipu- lation,

S. Jameset al., “C2farm: Coarse-to-fine imitation learning for manipu- lation,” inConference on Robot Learning, 2022

2022

[23] [23]

Kite: Keyframe imitation for task execution,

P. Sundaresanet al., “Kite: Keyframe imitation for task execution,” in Conference on Robot Learning, 2023

2023

[24] [24]

Learning fine-grained bimanual manipulation with act,

T. Zhaoet al., “Learning fine-grained bimanual manipulation with act,” arXiv preprint, 2023

2023

[25] [25]

Peract: Perceiver-actor for 6-dof manipulation,

M. Shridharet al., “Peract: Perceiver-actor for 6-dof manipulation,” in Robotics: Science and Systems, 2022

2022

[26] [26]

Rvt: Robotic vision transformer for manipulation,

A. Goyalet al., “Rvt: Robotic vision transformer for manipulation,” in Conference on Robot Learning, 2023

2023

[27] [27]

Rvt-2: Scaling vision transformers for robot manipulation,

——, “Rvt-2: Scaling vision transformers for robot manipulation,”arXiv preprint, 2024

2024

[28] [28]

Act3d: 3d feature fields for manipulation policies,

T. Gervetet al., “Act3d: 3d feature fields for manipulation policies,” in Conference on Robot Learning, 2023

2023

[29] [29]

PaLM-E: An Embodied Multimodal Language Model

D. Driesset al., “Palm-e: An embodied multimodal language model,” arXiv preprint arXiv:2303.03378, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[30] [30]

RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control

A. Brohanet al., “Rt-2: Vision-language-action models transfer web knowledge to robotic control,”arXiv preprint arXiv:2307.15818, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[31] [31]

OpenVLA: An Open-Source Vision-Language-Action Model

M. J. Kimet al., “Openvla: Vision-language-action models for robotics,” arXiv preprint arXiv:2406.09246, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[32] [32]

π0: A vision-language-action model for general robot control,

K. Blacket al., “π0: A vision-language-action model for general robot control,”arXiv preprint arXiv:2405.03854, 2024

work page arXiv 2024

[33] [33]

π0-fast: Fast vision-language-action models for robotics,

K. Pertschet al., “π0-fast: Fast vision-language-action models for robotics,”arXiv preprint arXiv:2501.00000, 2025

work page arXiv 2025

[34] [34]

π0.5: Vision-language-action models for open-world robotics,

P. I. Team, “π0.5: Vision-language-action models for open-world robotics,”arXiv preprint, 2025

2025

[35] [35]

Open x-embodiment: Robotic learning datasets and rt-x models,

A. Padalkaret al., “Open x-embodiment: Robotic learning datasets and rt-x models,” 2023

2023

[36] [36]

Lerobot: State-of-the-art machine learning for real-world robotics in pytorch,

R. Cadene, S. Alibert, A. Soare, Q. Gallouedec, A. Zouitine, S. Palma, P. Kooijmans, M. Aractingi, M. Shukor, D. Aubakirova, M. Russi, F. Capuano, C. Pascal, J. Choghari, J. Moss, and T. Wolf, “Lerobot: State-of-the-art machine learning for real-world robotics in pytorch,” https://github.com/huggingface/lerobot, 2024

2024

[37] [37]

Maniskill3: Gpu parallelized robotics simulation and rendering for generalizable embodied ai,

S. Tao, F. Xiang, A. Shukla, Y . Qin, X. Hinrichsen, X. Yuan, C. Bao, X. Lin, Y . Liu, T. kai Chan, Y . Gao, X. Li, T. Mu, N. Xiao, A. Gurha, V . N. Rajesh, Y . W. Choi, Y .-R. Chen, Z. Huang, R. Calandra, R. Chen, S. Luo, and H. Su, “Maniskill3: Gpu parallelized robotics simulation and rendering for generalizable embodied ai,”Robotics: Science and Systems, 2025

2025

[38] [38]

Learning Transferable Visual Models From Natural Language Supervision

A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever, “Learning transferable visual models from natural language supervision,” 2021. [Online]. Available: https://arxiv.org/abs/2103.00020

work page internal anchor Pith review Pith/arXiv arXiv 2021

[39] [39]

Sigmoid Loss for Language Image Pre-Training

X. Zhai, B. Mustafa, A. Kolesnikov, and L. Beyer, “Sigmoid loss for language image pre-training,” 2023. [Online]. Available: https://arxiv.org/abs/2303.15343

work page internal anchor Pith review Pith/arXiv arXiv 2023

[40] [40]

Deep residual learning for image recognition,

K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 770–778

2016

[41] [41]

MolmoAct: Action Reasoning Models that can Reason in Space

J. Lee, J. Duan, H. Fang, Y . Deng, S. Liu, B. Li, B. Fang, J. Zhang, Y . R. Wang, S. Lee, W. Han, W. Pumacay, A. Wu, R. Hendrix, K. Farley, E. VanderBilt, A. Farhadi, D. Fox, and R. Krishna, “Molmoact: Action reasoning models that can reason in space,” 2025. [Online]. Available: https://arxiv.org/abs/2508.07917

work page internal anchor Pith review Pith/arXiv arXiv 2025

[42] [42]

Vima: General robot manipulation with multimodal prompts,

Y . Jiang, A. Gupta, Z. Zhang, G. Wang, Y . Dou, Y . Chen, L. Fei- Fei, A. Anandkumar, Y . Zhu, and L. Fan, “Vima: General robot manipulation with multimodal prompts,” 2023. [Online]. Available: https://arxiv.org/abs/2210.03094

work page arXiv 2023

[43] [43]

Learning an actionable discrete diffusion policy via large-scale actionless video pre- training,

H. He, C. Bai, L. Pan, W. Zhang, B. Zhao, and X. Li, “Learning an actionable discrete diffusion policy via large-scale actionless video pre- training,” inThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024

2024

[44] [45]

Unified Video Action Model

S. Li, Y . Gao, D. Sadigh, and S. Song, “Unified video action model,” arXiv preprint arXiv:2503.00200, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[45] [46]

Unified world models: Coupling video and action diffusion for pretraining on large robotic datasets,

C. Zhu, R. Yu, S. Feng, B. Burchfiel, P. Shah, and A. Gupta, “Unified world models: Coupling video and action diffusion for pretraining on large robotic datasets,” inProceedings of Robotics: Science and Systems (RSS), 2025

2025

[46] [47]

Video Prediction Policy: A Generalist Robot Policy with Predictive Visual Representations

Y . Hu, Y . Guo, P. Wang, X. Chen, Y .-J. Wang, J. Zhang, K. Sreenath, C. Lu, and J. Chen, “Video prediction policy: A generalist robot policy with predictive visual representations,”arXiv preprint arXiv:2412.14803, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[47] [48]

Dit4dit: Jointly modeling video dynamics and actions for generalizable robot control,

T. Ma, J. Zheng, Z. Wang, C. Jiang, A. Cui, J. Liang, and S. Yang, “Dit4dit: Jointly modeling video dynamics and actions for generalizable robot control,” 2026. [Online]. Available: https: //arxiv.org/abs/2603.10448

work page arXiv 2026

[48] [49]

Contrast sets for evaluating language-guided robot policies,

A. Anwar, R. Gupta, and J. Thomason, “Contrast sets for evaluating language-guided robot policies,” 2024. [Online]. Available: https: //arxiv.org/abs/2406.13636

work page arXiv 2024