pith. sign in

arxiv: 2605.27759 · v1 · pith:MKXB7KNHnew · submitted 2026-05-26 · 💻 cs.RO

Colosseum V2: Benchmarking Generalization for Vision Language Action Models

Pith reviewed 2026-06-29 16:28 UTC · model grok-4.3

classification 💻 cs.RO
keywords vision-language-action modelsgeneralizationrobotics benchmarkmanipulation taskssimulation evaluationdistribution shiftsecological validityrobot policies
0
0 comments X

The pith

Colosseum V2 benchmark demonstrates that current vision-language-action models have significant generalization limitations in robotic manipulation tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper establishes a new benchmark called Colosseum V2 to study how vision-language-action models perform when robot tasks change from their training conditions. Current models show reduced success rates both in standard settings and especially when faced with variations like different objects or environments. The benchmark uses simulation to run many tests quickly and finds that its results align well with real robot experiments. A reader would care because these models promise flexible robot control from language instructions but the gaps revealed limit their reliability in new situations. The standardized setup allows consistent progress tracking across research efforts.

Core claim

The central discovery is that Colosseum V2, comprising 28 tasks in 13 categories across two robot morphologies, exposes limitations in both the base performance and generalization capabilities of leading VLA methods such as ACT and Pi0.5. Built on the ManiSkill simulator for efficient parallel evaluation, the benchmark supports large-scale in-domain and out-of-domain testing. It further establishes strong correlations between simulation metrics and real-world performance, confirming the benchmark's relevance for assessing generalization in robotic manipulation.

What carries the argument

Colosseum V2, a simulation-based benchmark with standardized tasks and metrics for evaluating VLA generalization under distribution shifts.

If this is right

  • State-of-the-art VLA methods exhibit degraded performance under distribution shifts, pointing to the need for improved robustness in translating perception to action.
  • Strong correlations between simulation and real-world results validate using the benchmark to predict real robot behavior.
  • Unified tasks, metrics, and protocols enable reproducible comparisons and reduce evaluation costs for developing general robot policies.
  • Accelerated progress toward general-purpose policies becomes possible through systematic benchmarking of generalization.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Researchers could use the benchmark to test whether additional training on diverse simulated variations closes the observed generalization gaps.
  • Connections to other robotics benchmarks might reveal if the identified limitations are specific to VLA architectures or common across approaches.
  • Extending the benchmark to include more complex long-horizon tasks could uncover additional failure modes not captured in the current 28 tasks.
  • If the sim-real correlation holds broadly, it would support greater reliance on simulation for initial model development in robotics.

Load-bearing premise

The 28 tasks and selected distribution shifts capture the key variations relevant to real-world generalization of vision-language-action models.

What would settle it

Finding a vision-language-action model that performs well on Colosseum V2 but shows poor generalization in real-world tests with analogous shifts would indicate that the benchmark does not accurately reflect practical challenges.

Figures

Figures reproduced from arXiv: 2605.27759 by Alina Du, Ashvin Arora, Gaurav Sukhatme, Hyeonho Oh, Ishika Singh, Jeremy Morgan, Jesse Thomason, Jincen Song, Prajwal Vijay.

Figure 1
Figure 1. Figure 1: Overall generalization results. The left plot shows average absolute [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of COLOSSEUM V2. Left: the full set of tasks across two robot morphologies (Single-Arm and Bimanual), spanning diverse manipulation primitives and long-horizon behaviors. Right: the perturbations used to evaluate visual, language, and action generalization. In total, the benchmark comprises 28 tasks across 13 task categories with 16 controlled perturbation factors. • We introduce COLOSSEUM V2, a l… view at source ↗
Figure 3
Figure 3. Figure 3: Comparison of existing robot learning benchmarks and simulation platforms. C [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Simulator comparison. Frames per second (FPS) is computed as the [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Average change in success rate for each perturbation. The top row illustrates select perturbations for the D [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: The per-task success rate of all models with no perturbations. The x [PITH_FULL_IMAGE:figures/full_fig_p006_6.png] view at source ↗
Figure 8
Figure 8. Figure 8: Hardware setup for ROTATEARROW. From left to right, the perturbations are None, Background-Color, Distractor-Objects, Light-Color, MO-Size, and MO-Color. Additional hardware tasks are shown in the Appendix, available on the projects website: https://sites.google. com/usc.edu/colosseum-v2/ across the three tasks is 0.916, demonstrating that the ordering of success rates on hardware is largely preserved betw… view at source ↗
read the original abstract

Vision-Language-Action (VLA) models demonstrate promising generalization in robotic manipulation, driven by advances in large-scale vision and language pre-training. This progress can be misleading. Despite the zero-shot perception and language capabilities of VLAs, their overall task performance often degrades under distribution shifts, revealing gaps in how these systems translate high-level understanding into robust behavior. To systematically study this gap, we introduce Colosseum V2, a large-scale simulation benchmark for evaluating VLA generalization in robot learning across diverse conditions. The benchmark comprises 28 tasks spanning 13 task categories and two robot morphologies, covering a wide range of manipulation primitives and long-horizon behaviors. Built on the ManiSkill simulator, Colosseum V2 enables fast, GPU-parallelized evaluation and supports both in-domain and out-of-domain testing at scale. We evaluate state-of-the-art methods, including Action Chunking Transformers (ACT) and Pi0.5, and reveal limitations in both base performance and generalization. We demonstrate strong correlations between simulation and real-world metrics that support the ecological validity of the benchmark. By standardizing tasks, metrics, and evaluation protocols within a unified benchmark, Colosseum V2 enables reproducible and fair comparisons, reduced evaluation overhead, and accelerated progress toward general-purpose robot policies.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces Colosseum V2, a large-scale ManiSkill-based simulation benchmark comprising 28 tasks across 13 categories and two robot morphologies. It supports GPU-parallelized in-domain and out-of-domain evaluation of Vision-Language-Action models and reports evaluations of ACT and Pi0.5 that reveal limitations in base performance and generalization under distribution shifts, together with strong simulation-to-real correlations supporting ecological validity. The work positions the benchmark as a standardized platform to enable reproducible comparisons and accelerate progress on general-purpose robot policies.

Significance. If the central claims hold, the benchmark offers a scalable, standardized evaluation platform that could meaningfully advance VLA research by exposing generalization gaps in current methods and providing quantitative evidence for sim-to-real transfer. The GPU-parallelized execution and multi-morphology support are concrete strengths that address practical evaluation overhead.

major comments (2)
  1. [Abstract and §4] Abstract and §4 (Evaluation): the central claim that Colosseum V2 'reveals limitations in both base performance and generalization' of ACT and Pi0.5, and demonstrates 'strong correlations' supporting ecological validity, is presented without quantitative results, error bars, task-level success rates, or correlation coefficients in the abstract; if the corresponding tables or figures in the evaluation section lack these or an accompanying statistical analysis, the empirical grounding for the strongest claims is insufficient.
  2. [§3] §3 (Task Construction): the selection of the 28 tasks, 13 categories, and chosen distribution shifts is load-bearing for interpreting observed performance gaps and sim-real correlations as field-general rather than benchmark-specific, yet no coverage analysis, ablation on omitted factors (contact-rich dynamics, novel geometries, sensor noise), or independent validation that these shifts span the relevant real-world variation space is provided.
minor comments (2)
  1. [Abstract] Abstract: include at least one key quantitative result (e.g., average success rate or correlation coefficient) to make the summary self-contained.
  2. [§3] Ensure all task categories and shift types are explicitly enumerated in a table for reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on strengthening the empirical presentation and task justification. We address each major point below.

read point-by-point responses
  1. Referee: [Abstract and §4] Abstract and §4 (Evaluation): the central claim that Colosseum V2 'reveals limitations in both base performance and generalization' of ACT and Pi0.5, and demonstrates 'strong correlations' supporting ecological validity, is presented without quantitative results, error bars, task-level success rates, or correlation coefficients in the abstract; if the corresponding tables or figures in the evaluation section lack these or an accompanying statistical analysis, the empirical grounding for the strongest claims is insufficient.

    Authors: We agree the abstract would benefit from quantitative highlights. Section §4 already contains tables reporting task-level success rates for ACT and Pi0.5 on all 28 tasks under in-domain and out-of-domain conditions, plus figures showing performance under shifts. To address the concern directly, we will revise the abstract to include representative success rates and correlation strengths, add error bars to relevant figures, and include explicit correlation coefficients with basic statistical analysis in §4. revision: yes

  2. Referee: [§3] §3 (Task Construction): the selection of the 28 tasks, 13 categories, and chosen distribution shifts is load-bearing for interpreting observed performance gaps and sim-real correlations as field-general rather than benchmark-specific, yet no coverage analysis, ablation on omitted factors (contact-rich dynamics, novel geometries, sensor noise), or independent validation that these shifts span the relevant real-world variation space is provided.

    Authors: Section §3 motivates the 28 tasks and 13 categories by spanning diverse ManiSkill primitives (including contact-rich and long-horizon behaviors) and two morphologies, with shifts targeting visual, dynamic, and embodiment variations. We acknowledge the absence of a formal coverage analysis or ablations on every omitted factor. We will expand §3 with additional rationale for the selected shifts and their alignment with robotics literature. Comprehensive ablations on all factors (e.g., sensor noise) exceed the current scope; the reported sim-to-real correlations provide empirical support for relevance. revision: partial

Circularity Check

0 steps flagged

No circularity: benchmark construction and empirical evaluation are independent of fitted self-referential quantities.

full rationale

The paper introduces Colosseum V2 as a new simulation benchmark with 28 tasks, evaluates existing VLA methods (ACT, Pi0.5) on in/out-of-domain shifts, and reports observed performance gaps plus sim-real correlations. No equations, parameter fits, or derivations are present that reduce a claimed result to its own inputs by construction. The central claims rest on direct empirical measurement within the defined benchmark rather than any self-definitional loop, fitted-input prediction, or load-bearing self-citation chain. Representativeness of the task set is an external validity question, not a circularity issue.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper's contribution is the benchmark construction itself; it relies on the domain assumption that the ManiSkill simulator provides sufficient fidelity for generalization studies.

axioms (1)
  • domain assumption Simulation environments can approximate real-world robot dynamics sufficiently for generalization testing.
    The benchmark's validity claim rests on this assumption, referenced via the reported sim-real correlations.

pith-pipeline@v0.9.1-grok · 5788 in / 1190 out tokens · 52374 ms · 2026-06-29T16:28:36.041346+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

48 extracted references · 19 canonical work pages · 10 internal anchors

  1. [1]

    ChatGPT: Optimizing language models for dialogue,

    OpenAI, “ChatGPT: Optimizing language models for dialogue,” https: //openai.com/blog/chatgpt, 2022, accessed: 2024-08-17

  2. [2]

    SAM 2: Segment Anything in Images and Videos

    N. Ravi, V . Gabeur, Y .-T. Hu, R. Hu, C. Ryali, T. Ma, H. Khedr, R. R ¨adle, C. Rolland, L. Gustafson, E. Mintun, J. Pan, K. V . Alwala, N. Carion, C.-Y . Wu, R. Girshick, P. Doll ´ar, and C. Feichtenhofer, “SAM 2: Segment anything in images and videos,”arXiv preprint arXiv:2408.00714, 2024

  3. [3]

    Rlbench: The robot learning benchmark and learning environment,

    S. James, A. J. Davison, and E. Johns, “Rlbench: The robot learning benchmark and learning environment,” inIEEE Robotics and Automa- tion Letters, 2019

  4. [4]

    Pyrep: Bringing v- rep to deep robot learning,

    S. James, Z. Ma, D. R. Arrojo, and A. J. Davison, “Pyrep: Bringing v- rep to deep robot learning,” inConference on Robot Learning (CoRL), 2019

  5. [5]

    Coppeliasim robot simulator,

    Coppelia Robotics, “Coppeliasim robot simulator,” 2022, https://www.coppeliarobotics.com

  6. [6]

    The colosseum: A benchmark for evaluating generalization for robotic manipulation,

    W. Pumacay, I. Singh, J. Duan, R. Krishna, J. Thomason, and D. Fox, “The colosseum: A benchmark for evaluating generalization for robotic manipulation,” inProceedings of Robotics: Science and Systems, 2024

  7. [7]

    Libero: Benchmarking knowledge transfer for lifelong robot learning,

    B. Liuet al., “Libero: Benchmarking knowledge transfer for lifelong robot learning,” inConference on Robot Learning (CoRL), 2023

  8. [8]

    Libero-para: A diagnostic benchmark and metrics for paraphrase robustness in vla models,

    C. Kim, M. Kim, M. Kang, H. Kim, and D. Jung, “Libero-para: A diagnostic benchmark and metrics for paraphrase robustness in vla models,” 2026. [Online]. Available: https://arxiv.org/abs/2603.28301

  9. [9]

    Roboverse: Towards a unified platform for robotic manipulation,

    A. Muraliet al., “Roboverse: Towards a unified platform for robotic manipulation,” inConference on Robot Learning Workshop, 2020

  10. [10]

    Roboarena: Distributed real-world evaluation of generalist robot policies,

    R. Team, “Roboarena: Distributed real-world evaluation of generalist robot policies,” 2024

  11. [11]

    Robotwin: A platform for scalable robot learning,

    ——, “Robotwin: A platform for scalable robot learning,” 2024, https://robotwin-platform.github.io

  12. [12]

    Bimanual manipulation benchmark,

    B. B. Team, “Bimanual manipulation benchmark,” 2024, https://bimanual.github.io

  13. [13]

    Calvin: A benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks,

    O. Mees, L. Hermann, E. Rosete-Beas, and W. Burgard, “Calvin: A benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks,”IEEE Robotics and Automation Letters (RA- L), vol. 7, no. 3, pp. 7327–7334, 2022

  14. [14]

    Vlabench: A large-scale benchmark for language-conditioned robotics manipulation with long-horizon reasoning tasks,

    S. Zhang, Z. Xu, P. Liu, X. Yu, Y . Li, Q. Gao, Z. Fei, Z. Yin, Z. Wu, Y .-G. Jiang, and X. Qiu, “Vlabench: A large-scale benchmark for language-conditioned robotics manipulation with long-horizon reasoning tasks,” 2024. [Online]. Available: https://arxiv.org/abs/2412.18194

  15. [15]

    Vlmbench: A compositional benchmark for vision-and-language manipulation,

    K. Zheng, X. Chen, O. C. Jenkins, and X. E. Wang, “Vlmbench: A compositional benchmark for vision-and-language manipulation,” 2022. [Online]. Available: https://arxiv.org/abs/2206.08522

  16. [16]

    Manipbench: Benchmarking vision-language models for low-level robot manipulation,

    E. Zhao, V . Raval, H. Zhang, J. Mao, Z. Shangguan, S. Nikolaidis, Y . Wang, and D. Seita, “Manipbench: Benchmarking vision-language models for low-level robot manipulation,” 2025. [Online]. Available: https://arxiv.org/abs/2505.09698

  17. [17]

    R3m: A universal visual representation for robot manipulation,

    S. Nairet al., “R3m: A universal visual representation for robot manipulation,” inConference on Robot Learning, 2022

  18. [18]

    Mvp: Multi-view pretraining for vision-language robotics,

    T. Xiaoet al., “Mvp: Multi-view pretraining for vision-language robotics,” inConference on Robot Learning, 2022

  19. [19]

    VIP: Towards Universal Visual Reward and Representation via Value-Implicit Pre-Training

    Y . J. Ma, S. Sodhani, D. Jayaraman, O. Bastani, V . Kumar, and A. Zhang, “Vip: Towards universal visual reward and representation via value- implicit pre-training,”arXiv preprint arXiv:2210.00030, 2022

  20. [20]

    Cliport: What and where pathways for robotic manipulation,

    M. Shridharet al., “Cliport: What and where pathways for robotic manipulation,” inConference on Robot Learning, 2022

  21. [21]

    V oxposer: Composable 3d value maps for robotic manipulation with language models,

    W. Huanget al., “V oxposer: Composable 3d value maps for robotic manipulation with language models,” inConference on Robot Learning, 2023

  22. [22]

    C2farm: Coarse-to-fine imitation learning for manipu- lation,

    S. Jameset al., “C2farm: Coarse-to-fine imitation learning for manipu- lation,” inConference on Robot Learning, 2022

  23. [23]

    Kite: Keyframe imitation for task execution,

    P. Sundaresanet al., “Kite: Keyframe imitation for task execution,” in Conference on Robot Learning, 2023

  24. [24]

    Learning fine-grained bimanual manipulation with act,

    T. Zhaoet al., “Learning fine-grained bimanual manipulation with act,” arXiv preprint, 2023

  25. [25]

    Peract: Perceiver-actor for 6-dof manipulation,

    M. Shridharet al., “Peract: Perceiver-actor for 6-dof manipulation,” in Robotics: Science and Systems, 2022

  26. [26]

    Rvt: Robotic vision transformer for manipulation,

    A. Goyalet al., “Rvt: Robotic vision transformer for manipulation,” in Conference on Robot Learning, 2023

  27. [27]

    Rvt-2: Scaling vision transformers for robot manipulation,

    ——, “Rvt-2: Scaling vision transformers for robot manipulation,”arXiv preprint, 2024

  28. [28]

    Act3d: 3d feature fields for manipulation policies,

    T. Gervetet al., “Act3d: 3d feature fields for manipulation policies,” in Conference on Robot Learning, 2023

  29. [29]

    PaLM-E: An Embodied Multimodal Language Model

    D. Driesset al., “Palm-e: An embodied multimodal language model,” arXiv preprint arXiv:2303.03378, 2023

  30. [30]

    RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control

    A. Brohanet al., “Rt-2: Vision-language-action models transfer web knowledge to robotic control,”arXiv preprint arXiv:2307.15818, 2023

  31. [31]

    OpenVLA: An Open-Source Vision-Language-Action Model

    M. J. Kimet al., “Openvla: Vision-language-action models for robotics,” arXiv preprint arXiv:2406.09246, 2024

  32. [32]

    π0: A vision-language-action model for general robot control,

    K. Blacket al., “π0: A vision-language-action model for general robot control,”arXiv preprint arXiv:2405.03854, 2024

  33. [33]

    π0-fast: Fast vision-language-action models for robotics,

    K. Pertschet al., “π0-fast: Fast vision-language-action models for robotics,”arXiv preprint arXiv:2501.00000, 2025

  34. [34]

    π0.5: Vision-language-action models for open-world robotics,

    P. I. Team, “π0.5: Vision-language-action models for open-world robotics,”arXiv preprint, 2025

  35. [35]

    Open x-embodiment: Robotic learning datasets and rt-x models,

    A. Padalkaret al., “Open x-embodiment: Robotic learning datasets and rt-x models,” 2023

  36. [36]

    Lerobot: State-of-the-art machine learning for real-world robotics in pytorch,

    R. Cadene, S. Alibert, A. Soare, Q. Gallouedec, A. Zouitine, S. Palma, P. Kooijmans, M. Aractingi, M. Shukor, D. Aubakirova, M. Russi, F. Capuano, C. Pascal, J. Choghari, J. Moss, and T. Wolf, “Lerobot: State-of-the-art machine learning for real-world robotics in pytorch,” https://github.com/huggingface/lerobot, 2024

  37. [37]

    Maniskill3: Gpu parallelized robotics simulation and rendering for generalizable embodied ai,

    S. Tao, F. Xiang, A. Shukla, Y . Qin, X. Hinrichsen, X. Yuan, C. Bao, X. Lin, Y . Liu, T. kai Chan, Y . Gao, X. Li, T. Mu, N. Xiao, A. Gurha, V . N. Rajesh, Y . W. Choi, Y .-R. Chen, Z. Huang, R. Calandra, R. Chen, S. Luo, and H. Su, “Maniskill3: Gpu parallelized robotics simulation and rendering for generalizable embodied ai,”Robotics: Science and Systems, 2025

  38. [38]

    Learning Transferable Visual Models From Natural Language Supervision

    A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever, “Learning transferable visual models from natural language supervision,” 2021. [Online]. Available: https://arxiv.org/abs/2103.00020

  39. [39]

    Sigmoid Loss for Language Image Pre-Training

    X. Zhai, B. Mustafa, A. Kolesnikov, and L. Beyer, “Sigmoid loss for language image pre-training,” 2023. [Online]. Available: https://arxiv.org/abs/2303.15343

  40. [40]

    Deep residual learning for image recognition,

    K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 770–778

  41. [41]

    MolmoAct: Action Reasoning Models that can Reason in Space

    J. Lee, J. Duan, H. Fang, Y . Deng, S. Liu, B. Li, B. Fang, J. Zhang, Y . R. Wang, S. Lee, W. Han, W. Pumacay, A. Wu, R. Hendrix, K. Farley, E. VanderBilt, A. Farhadi, D. Fox, and R. Krishna, “Molmoact: Action reasoning models that can reason in space,” 2025. [Online]. Available: https://arxiv.org/abs/2508.07917

  42. [42]

    Vima: General robot manipulation with multimodal prompts,

    Y . Jiang, A. Gupta, Z. Zhang, G. Wang, Y . Dou, Y . Chen, L. Fei- Fei, A. Anandkumar, Y . Zhu, and L. Fan, “Vima: General robot manipulation with multimodal prompts,” 2023. [Online]. Available: https://arxiv.org/abs/2210.03094

  43. [43]

    Learning an actionable discrete diffusion policy via large-scale actionless video pre- training,

    H. He, C. Bai, L. Pan, W. Zhang, B. Zhao, and X. Li, “Learning an actionable discrete diffusion policy via large-scale actionless video pre- training,” inThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024

  44. [45]

    Unified Video Action Model

    S. Li, Y . Gao, D. Sadigh, and S. Song, “Unified video action model,” arXiv preprint arXiv:2503.00200, 2025

  45. [46]

    Unified world models: Coupling video and action diffusion for pretraining on large robotic datasets,

    C. Zhu, R. Yu, S. Feng, B. Burchfiel, P. Shah, and A. Gupta, “Unified world models: Coupling video and action diffusion for pretraining on large robotic datasets,” inProceedings of Robotics: Science and Systems (RSS), 2025

  46. [47]

    Video Prediction Policy: A Generalist Robot Policy with Predictive Visual Representations

    Y . Hu, Y . Guo, P. Wang, X. Chen, Y .-J. Wang, J. Zhang, K. Sreenath, C. Lu, and J. Chen, “Video prediction policy: A generalist robot policy with predictive visual representations,”arXiv preprint arXiv:2412.14803, 2024

  47. [48]

    Dit4dit: Jointly modeling video dynamics and actions for generalizable robot control,

    T. Ma, J. Zheng, Z. Wang, C. Jiang, A. Cui, J. Liang, and S. Yang, “Dit4dit: Jointly modeling video dynamics and actions for generalizable robot control,” 2026. [Online]. Available: https: //arxiv.org/abs/2603.10448

  48. [49]

    Contrast sets for evaluating language-guided robot policies,

    A. Anwar, R. Gupta, and J. Thomason, “Contrast sets for evaluating language-guided robot policies,” 2024. [Online]. Available: https: //arxiv.org/abs/2406.13636