pith. machine review for the scientific record. sign in

arxiv: 2604.18463 · v2 · submitted 2026-04-20 · 💻 cs.AI · cs.LG· cs.RO

Recognition: unknown

Using large language models for embodied planning introduces systematic safety risks

Fan Shi, Jiajun Wu, Kaixian Qu, Manling Li, Marco Hutter, Tao Zhang, Zhibin Li

Authors on Pith no claims yet

Pith reviewed 2026-05-10 04:42 UTC · model grok-4.3

classification 💻 cs.AI cs.LGcs.RO
keywords large language modelsembodied AIrobotic planningsafety evaluationbenchmarkscaling lawsdanger awarenessembodied planning
0
0 comments X

The pith

Large language models that plan well for robots still generate dangerous actions in over a quarter of tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that planning ability and safety awareness are distinct capacities in language models for embodied tasks. It introduces a large benchmark of over 12,000 tasks to measure both, finding that even the best planners produce unsafe plans frequently. Scaling up models boosts planning success rates dramatically but leaves safety awareness largely unchanged. This means larger models become safer overall only because they succeed at more tasks, not because they avoid dangers better. The result highlights a key limitation for using these models in real robotic systems where both success and safety matter.

Core claim

Even near-perfect planning ability does not ensure safety: the best-planning model fails to produce a valid plan on only 0.4% of tasks but produces dangerous plans on 28.3%. Among open-source models, planning ability improves with scale from 0.4% to 99.3% while safety awareness remains between 38% and 57%. Larger models complete more tasks safely primarily through improved planning rather than better danger avoidance. Proprietary reasoning models achieve higher safety awareness of 71-81%.

What carries the argument

The DESPITE benchmark consisting of 12,279 tasks with physical and normative dangers and deterministic validation, which separates measures of planning success from safety violations.

Load-bearing premise

The 12,279 tasks and their danger labels accurately represent the physical and normative risks that would arise in actual embodied robotic deployments.

What would settle it

Demonstrating a language model that achieves over 99% valid plans while keeping dangerous plans below 10% on the DESPITE benchmark would directly challenge the separation of planning and safety capacities.

read the original abstract

Large language models are increasingly used as planners for robotic systems, yet how safely they plan remains an open question. To evaluate safe planning systematically, we introduce DESPITE, a benchmark of 12,279 tasks spanning physical and normative dangers with fully deterministic validation. Across 23 models, even near-perfect planning ability does not ensure safety: the best-planning model fails to produce a valid plan on only 0.4% of tasks but produces dangerous plans on 28.3%. Among 18 open-source models from 3B to 671B parameters, planning ability improves substantially with scale (0.4-99.3%) while safety awareness remains relatively flat (38-57%). We identify a multiplicative relationship between these two capacities, showing that larger models complete more tasks safely primarily through improved planning, not through better danger avoidance. Three proprietary reasoning models reach notably higher safety awareness (71-81%), while non-reasoning proprietary models and open-source reasoning models remain below 57%. As planning ability approaches saturation for frontier models, improving safety awareness becomes a central challenge for deploying language-model planners in robotic systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces DESPITE, a benchmark of 12,279 tasks spanning physical and normative dangers with deterministic validation, to assess safety in LLM-based embodied planning. Across 23 models, it reports that near-perfect planning (0.4% invalid plans for the best model) does not ensure safety (28.3% dangerous plans). Planning ability scales strongly with model size (0.4-99.3%) while safety awareness remains relatively flat (38-57%) for open-source models, with a multiplicative relationship between the two; proprietary reasoning models reach higher safety awareness (71-81%).

Significance. If the danger annotations accurately capture real embodied risks, the results establish that planning capability and safety awareness are partially decoupled in current LLMs, with scale primarily boosting the former. This provides a concrete, large-scale empirical basis for prioritizing safety-specific improvements in robotic planners and highlights gaps between open and proprietary reasoning models. The benchmark itself, with its scale and deterministic validation, is a useful resource for the community.

major comments (2)
  1. [§3] §3 (DESPITE Benchmark Construction): The manuscript states that tasks have 'fully deterministic validation' and span physical/normative dangers, but provides no explicit decision rules, classification examples, inter-annotator agreement, or grounding against actual robot failure modes. This is load-bearing for the central claim, as the 28.3% dangerous-plan rate (and the gap with 0.4% invalid plans) depends entirely on the validity of these labels; without them, sensitivity to labeling choices cannot be assessed.
  2. [§5] §5 (Scaling and Multiplicative Relationship): The claim that larger models complete more tasks safely 'primarily through improved planning, not through better danger avoidance' rests on a multiplicative relationship, yet the paper does not report the exact functional form, regression coefficients, or statistical controls used to establish it. This weakens the interpretation of the flat safety-awareness curve (38-57%) relative to the planning curve.
minor comments (2)
  1. [Table 2] Table 2 (Model Results): Include a column or footnote explicitly marking which of the 23 models are reasoning vs. non-reasoning and their exact parameter counts to support the scaling claims.
  2. [Abstract] Abstract: The safety-awareness range '38-57%' for open-source models should identify the specific models at each extreme for immediate interpretability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their positive evaluation of the work's significance and for the constructive comments on benchmark transparency and the scaling analysis. We address each point below and have prepared revisions to strengthen the manuscript.

read point-by-point responses
  1. Referee: [§3] §3 (DESPITE Benchmark Construction): The manuscript states that tasks have 'fully deterministic validation' and span physical/normative dangers, but provides no explicit decision rules, classification examples, inter-annotator agreement, or grounding against actual robot failure modes. This is load-bearing for the central claim, as the 28.3% dangerous-plan rate (and the gap with 0.4% invalid plans) depends entirely on the validity of these labels; without them, sensitivity to labeling choices cannot be assessed.

    Authors: We agree that greater transparency on the annotation process is warranted to support the central claims. The current §3 describes the high-level categories and deterministic validation procedure, but does not include the requested explicit decision rules or examples. In the revised manuscript we will add a dedicated subsection with (i) the full decision rules used to classify physical versus normative dangers, (ii) multiple concrete classification examples per category, (iii) inter-annotator agreement statistics, and (iv) explicit grounding of the danger taxonomy against documented robot failure modes from the robotics literature. We will also include a brief sensitivity analysis demonstrating robustness of the 28.3% dangerous-plan rate to plausible labeling variations. These additions directly address the concern about label validity. revision: yes

  2. Referee: [§5] §5 (Scaling and Multiplicative Relationship): The claim that larger models complete more tasks safely 'primarily through improved planning, not through better danger avoidance' rests on a multiplicative relationship, yet the paper does not report the exact functional form, regression coefficients, or statistical controls used to establish it. This weakens the interpretation of the flat safety-awareness curve (38-57%) relative to the planning curve.

    Authors: We appreciate the request for greater precision. The manuscript states that safe completion equals planning success multiplied by conditional safety awareness, which produces the observed flat safety-awareness curve. However, the exact functional form, regression details, and controls are not reported in the main text or appendix. In the revision we will (i) state the functional form explicitly (safe_completion = planning_success × safety_awareness), (ii) report the regression coefficients and goodness-of-fit statistics, and (iii) describe the statistical controls for model family and size. These clarifications will strengthen the interpretation that scale primarily improves planning rather than danger avoidance. revision: yes

Circularity Check

0 steps flagged

No circularity: claims rest on direct empirical evaluation of models against an external benchmark

full rationale

The paper introduces the DESPITE benchmark of 12,279 tasks and reports measured planning success rates and safety awareness percentages across 23 models. Key results such as the 0.4% invalid-plan rate versus 28.3% dangerous-plan rate for the best model, the scaling trends with model size, and the multiplicative relationship between planning and safety are presented as direct observations from running the models on the benchmark with deterministic validation. No equations, derivations, fitted parameters, or predictions are claimed; there are no self-definitional loops, no inputs relabeled as outputs, and no load-bearing self-citations that reduce the central claims to prior author work. The analysis is therefore self-contained against the external benchmark and model evaluations.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claims depend on the assumption that the benchmark tasks and danger definitions are representative of real-world embodied risks; no free parameters or invented physical entities are introduced.

axioms (1)
  • domain assumption The defined tasks and danger categories accurately capture physical and normative risks relevant to robotic systems
    This assumption underpins the interpretation of the 28.3% dangerous-plan rate as a genuine safety problem.
invented entities (1)
  • DESPITE benchmark no independent evidence
    purpose: Provide a standardized, deterministic test suite for safe embodied planning
    Newly constructed for the study; no independent external validation of its coverage is provided in the abstract.

pith-pipeline@v0.9.0 · 5513 in / 1253 out tokens · 47491 ms · 2026-05-10T04:42:28.129243+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

87 extracted references · 32 canonical work pages · 19 internal anchors

  1. [1]

    Vision-language- action models for robotics: A review towards real-world applications.IEEE Access, 2025

    Kento Kawaharazuka, Jihoon Oh, Jun Yamada, Ingmar Posner, and Yuke Zhu. Vision-language- action models for robotics: A review towards real-world applications.IEEE Access, 2025

  2. [2]

    Do as i can, not as i say: Grounding 17 language in robotic affordances

    Anthony Brohan, Yevgen Chebotar, Chelsea Finn, Karol Hausman, Alexander Herzog, Daniel Ho, Julian Ibarz, Alex Irpan, Eric Jang, Ryan Julian, et al. Do as i can, not as i say: Grounding 17 language in robotic affordances. InConference on robot learning, pages 287–318. PMLR, 2023

  3. [3]

    Code as policies: Language model programs for embodied control

    Jacky Liang, Wenlong Huang, Fei Xia, Peng Xu, Karol Hausman, Brian Ichter, Pete Florence, and Andy Zeng. Code as policies: Language model programs for embodied control. In2023 IEEE International conference on robotics and automation (ICRA), pages 9493–9500. IEEE, 2023

  4. [4]

    Generating robot constitutions & benchmarks for semantic safety.arXiv preprint arXiv:2503.08663, 2025

    Pierre Sermanet, Anirudha Majumdar, Alex Irpan, Dmitry Kalashnikov, and Vikas Sind- hwani. Generating robot constitutions & benchmarks for semantic safety.arXiv preprint arXiv:2503.08663, 2025

  5. [5]

    SafeAgentBench: A benchmark for safe task planning of embodied LLM agents

    Sheng Yin, Xianghe Pang, Yuanzhuo Ding, Menglan Chen, Yutong Bi, Yichen Xiong, Wenhao Huang, Zhen Xiang, Jing Shao, and Siheng Chen. Safeagentbench: A benchmark for safe task planning of embodied llm agents.arXiv preprint arXiv:2412.13178, 2024

  6. [6]

    A framework for benchmarking and aligning task-planning safety in llm-based embodied agents.arXiv preprintarXiv:2504.14650, 2025

    Yuting Huang, Leilei Ding, Zhipeng Tang, Tianfu Wang, Xinrui Lin, Wuyang Zhang, Mingxiao Ma, and Yanyong Zhang. A framework for benchmarking and aligning task-planning safety in llm-based embodied agents.arXiv preprint arXiv:2504.14650, 2025

  7. [7]

    Subtle risks, critical failures: A framework for diagnosing physical safety of llms for embodied decision making

    Yejin Son, Minseo Kim, Sungwoong Kim, Seungju Han, Jian Kim, Dongju Jang, Youngjae Yu, and Chan Young Park. Subtle risks, critical failures: A framework for diagnosing physical safety of llms for embodied decision making. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 25703–25744, 2025

  8. [8]

    Is- bench: Evaluating interactive safety of vlm-driven embodied agents in daily household tasks.arXiv preprint arXiv:2506.16402, 2025

    Xiaoya Lu, Zeren Chen, Xuhao Hu, Yijin Zhou, Weichen Zhang, Dongrui Liu, Lu Sheng, and Jing Shao. Is-bench: Evaluating interactive safety of vlm-driven embodied agents in daily household tasks.arXiv preprint arXiv:2506.16402, 2025

  9. [9]

    Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

    Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. arXiv preprint arXiv:1701.06538, 2017

  10. [10]

    routledge, 2013

    Jacob Cohen.Statistical power analysis for the behavioral sciences. routledge, 2013

  11. [11]

    Semantically safe robot manipulation: From semantic scene understanding to motion safeguards.IEEE Robotics and Automation Letters, 2025

    Lukas Brunke, Yanni Zhang, Ralf R¨ omer, Jack Naimer, Nikola Staykov, Siqi Zhou, and Angela P Schoellig. Semantically safe robot manipulation: From semantic scene understanding to motion safeguards.IEEE Robotics and Automation Letters, 2025

  12. [12]

    Constitutional AI: Harmlessness from AI Feedback

    Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, et al. Constitutional ai: Harmlessness from ai feedback.arXiv preprint arXiv:2212.08073, 2022

  13. [13]

    GPT-4 Technical Report

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023

  14. [14]

    Gemini: A Family of Highly Capable Multimodal Models

    Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023

  15. [15]

    Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

    Gemini Team, Petko Georgiev, Ving Ian Lei, Ryan Burnell, Libin Bai, Anmol Gulati, Garrett Tanzer, Damien Vincent, Zhufeng Pan, Shibo Wang, et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context.arXiv preprint arXiv:2403.05530, 2024

  16. [16]

    gpt-oss-120b & gpt-oss-20b Model Card

    Sandhini Agarwal, Lama Ahmad, Jason Ai, Sam Altman, Andy Applebaum, Edwin Arbus, Rahul K Arora, Yu Bai, Bowen Baker, Haiming Bao, et al. gpt-oss-120b & gpt-oss-20b model card.arXiv preprint arXiv:2508.10925, 2025

  17. [17]

    DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models

    Aixin Liu, Aoxue Mei, Bangcai Lin, Bing Xue, Bingxuan Wang, Bingzheng Xu, Bochao Wu, Bowei Zhang, Chaofan Lin, Chen Dong, et al. Deepseek-v3. 2: Pushing the frontier of open large language models.arXiv preprint arXiv:2512.02556, 2025. 18

  18. [18]

    Binyuan Hui, Jian Yang, Zeyu Cui, Jiaxi Yang, Dayiheng Liu, Lei Zhang, Tianyu Liu, Jia- jun Zhang, Bowen Yu, Keming Lu, et al. Qwen2. 5-coder technical report.arXiv preprint arXiv:2409.12186, 2024

  19. [19]

    Qwen2.5 technical report, 2025

    Qwen, :, An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li,...

  20. [20]

    Qwen2.5-1m technical report, 2025

    An Yang, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoyan Huang, Jiandong Jiang, Jianhong Tu, Jianwei Zhang, Jingren Zhou, Junyang Lin, Kai Dang, Kexin Yang, Le Yu, Mei Li, Minmin Sun, Qin Zhu, Rui Men, Tao He, Weijia Xu, Wenbiao Yin, Wenyuan Yu, Xiafei Qiu, Xingzhang Ren, Xinlong Yang, Yong Li, Zhiying Xu, and Zipeng Zhang. Qwen2.5-1m technical re...

  21. [21]

    Qwq-32b: Embracing the power of reinforcement learning, March 2025

    Qwen Team. Qwq-32b: Embracing the power of reinforcement learning, March 2025

  22. [22]

    Qwen3 Technical Report

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

  23. [23]

    The Llama 3 Herd of Models

    Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

  24. [24]

    Meta AI. Llama 4. https://ai.meta.com/blog/llama-4-multimodal-intelligence/, 2025. Accessed: 2025

  25. [25]

    Elsevier, 2004

    Malik Ghallab, Dana Nau, and Paolo Traverso.Automated Planning: theory and practice. Elsevier, 2004

  26. [26]

    A modern approach.Artificial Intelligence

    Stuart Russell, Peter Norvig, and Artificial Intelligence. A modern approach.Artificial Intelligence. Prentice-Hall, Egnlewood Cliffs, 25(27):79–80, 1995

  27. [27]

    Unified planning: Modeling, manipulating and solving ai planning problems in python

    Andrea Micheli, Arthur Bit-Monnot, Gabriele R¨ oger, Enrico Scala, Alessandro Valentini, Luca Framba, Alberto Rovetta, Alessandro Trapasso, Luigi Bonassi, Alfonso Emilio Gerevini, et al. Unified planning: Modeling, manipulating and solving ai planning problems in python. SoftwareX, 29:102012, 2025

  28. [28]

    Interval-based relaxation for general numeric planning

    Enrico Scala, Patrik Haslum, Sylvie Thi´ ebaux, and Miguel Ramirez. Interval-based relaxation for general numeric planning. 2016

  29. [29]

    Temporal planning with inter- mediate conditions and effects

    Alessandro Valentini, Andrea Micheli, and Alessandro Cimatti. Temporal planning with inter- mediate conditions and effects. InProceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 9975–9982, 2020

  30. [30]

    Virtualhome: Simulating household activities via programs

    Xavier Puig, Kevin Ra, Marko Boben, Jiaman Li, Tingwu Wang, Sanja Fidler, and Antonio Torralba. Virtualhome: Simulating household activities via programs. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 8494–8502, 2018

  31. [31]

    Behavior-1k: A benchmark for embodied ai with 1,000 everyday activities and realistic simulation

    Chengshu Li, Ruohan Zhang, Josiah Wong, Cem Gokmen, Sanjana Srivastava, Roberto Mart´ ın- Mart´ ın, Chen Wang, Gabrael Levine, Michael Lingelbach, Jiankai Sun, et al. Behavior-1k: A benchmark for embodied ai with 1,000 everyday activities and realistic simulation. InConference on Robot Learning, pages 80–93. PMLR, 2023

  32. [32]

    Alfred: A benchmark for interpreting grounded instructions for everyday tasks

    Mohit Shridhar, Jesse Thomason, Daniel Gordon, Yonatan Bisk, Winson Han, Roozbeh Mot- taghi, Luke Zettlemoyer, and Dieter Fox. Alfred: A benchmark for interpreting grounded instructions for everyday tasks. InProceedings of the IEEE/CVF conference on computer vision 19 and pattern recognition, pages 10740–10749, 2020

  33. [33]

    Normbank: A knowledge bank of situational social norms

    Caleb Ziems, Jane Dwivedi-Yu, Yi-Chia Wang, Alon Halevy, and Diyi Yang. Normbank: A knowledge bank of situational social norms. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 7756–7776, 2023

  34. [34]

    Consumer Product Safety Commission

    U.S. Consumer Product Safety Commission. National electronic injury surveillance system (neiss) injury data. https://www.cpsc.gov/cgibin/NEISSQuery/home.aspx, 2024. Accessed: 2025-09-29

  35. [35]

    DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

    Aixin Liu, Bei Feng, Bin Wang, Bingxuan Wang, Bo Liu, Chenggang Zhao, Chengqi Dengr, Chong Ruan, Damai Dai, Daya Guo, et al. Deepseek-v2: A strong, economical, and efficient mixture-of-experts language model.arXiv preprint arXiv:2405.04434, 2024

  36. [36]

    Language models as zero- shot planners: Extracting actionable knowledge for embodied agents

    Wenlong Huang, Pieter Abbeel, Deepak Pathak, and Igor Mordatch. Language models as zero- shot planners: Extracting actionable knowledge for embodied agents. InInternational conference on machine learning, pages 9118–9147. PMLR, 2022

  37. [37]

    ProgPrompt: Generating situated robot task plans using large language models,

    Ishika Singh, Valts Blukis, Arsalan Mousavian, Ankit Goyal, Danfei Xu, Jonathan Tremblay, Dieter Fox, Jesse Thomason, and Animesh Garg. Progprompt: Generating situated robot task plans using large language models.arXiv preprint arXiv:2209.11302, 2022

  38. [38]

    PaLM-E: An Embodied Multimodal Language Model

    Danny Driess, Fei Xia, Mehdi SM Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, et al. Palm-e: An embodied multimodal language model.arXiv preprint arXiv:2303.03378, 2023

  39. [39]

    Rt-2: Vision-language-action models transfer web knowledge to robotic control

    Brianna Zitkovich, Tianhe Yu, Sichun Xu, Peng Xu, Ted Xiao, Fei Xia, Jialin Wu, Paul Wohlhart, Stefan Welker, Ayzaan Wahid, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control. InConference on Robot Learning, pages 2165–2183. PMLR, 2023

  40. [40]

    Inner Monologue: Embodied Reasoning through Planning with Language Models

    Wenlong Huang, Fei Xia, Ted Xiao, Harris Chan, Jacky Liang, Pete Florence, Andy Zeng, Jonathan Tompson, Igor Mordatch, Yevgen Chebotar, et al. Inner monologue: Embodied reasoning through planning with language models.arXiv preprint arXiv:2207.05608, 2022

  41. [41]

    Belkhale, T

    Suneel Belkhale, Tianli Ding, Ted Xiao, Pierre Sermanet, Quon Vuong, Jonathan Tompson, Yev- gen Chebotar, Debidatta Dwibedi, and Dorsa Sadigh. Rt-h: Action hierarchies using language. arXiv preprint arXiv:2403.01823, 2024

  42. [42]

    FAST: Efficient Action Tokenization for Vision-Language-Action Models

    Karl Pertsch, Kyle Stachowicz, Brian Ichter, Danny Driess, Suraj Nair, Quan Vuong, Oier Mees, Chelsea Finn, and Sergey Levine. Fast: Efficient action tokenization for vision-language-action models.arXiv preprint arXiv:2501.09747, 2025

  43. [43]

    GR00T N1: An Open Foundation Model for Generalist Humanoid Robots

    Johan Bjorck, Fernando Casta˜ neda, Nikita Cherniadev, Xingye Da, Runyu Ding, Linxi Fan, Yu Fang, Dieter Fox, Fengyuan Hu, Spencer Huang, et al. Gr00t n1: An open foundation model for generalist humanoid robots.arXiv preprint arXiv:2503.14734, 2025

  44. [44]

    Open x-embodiment: Robotic learning datasets and rt-x models: Open x-embodiment collaboration 0

    Abby O’Neill, Abdul Rehman, Abhiram Maddukuri, Abhishek Gupta, Abhishek Padalkar, Abra- ham Lee, Acorn Pooley, Agrim Gupta, Ajay Mandlekar, Ajinkya Jain, et al. Open x-embodiment: Robotic learning datasets and rt-x models: Open x-embodiment collaboration 0. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 6892–6903. IEEE, 2024

  45. [45]

    Distilling on-device language models for robot planning with minimal human intervention.arXiv preprint arXiv:2506.17486, 2025

    Zachary Ravichandran, Ignacio Hounie, Fernando Cladera, Alejandro Ribeiro, George J Pappas, and Vijay Kumar. Distilling on-device language models for robot planning with minimal human intervention.arXiv preprint arXiv:2506.17486, 2025

  46. [46]

    LLM+P: Empowering Large Language Models with Optimal Planning Proficiency

    Bo Liu, Yuqian Jiang, Xiaohan Zhang, Qiang Liu, Shiqi Zhang, Joydeep Biswas, and Peter Stone. Llm+ p: Empowering large language models with optimal planning proficiency.arXiv preprint arXiv:2304.11477, 2023

  47. [47]

    Lin Guan, Karthik Valmeekam, Sarath Sreedharan, and Subbarao Kambhampati. Leveraging pre-trained large language models to construct and utilize world models for model-based task 20 planning.Advances in Neural Information Processing Systems, 36:79081–79094, 2023

  48. [48]

    Don’t let your robot be harmful: Responsible robotic manipulation via safety-as-policy

    Minheng Ni, Lei Zhang, Zihan Chen, Kaixin Bai, Zhaopeng Chen, Jianwei Zhang, and Wangmeng Zuo. Don’t let your robot be harmful: Responsible robotic manipulation via safety-as-policy. IEEE Robotics and Automation Letters, 2025

  49. [49]

    Agentsafe: Benchmarking the safety of embodied agents on hazardous instructions

    Zonghao Ying, Le Wang, Yisong Xiao, Jiakai Wang, Yuqing Ma, Jinyang Guo, Zhenfei Yin, Mingchuan Zhang, Aishan Liu, and Xianglong Liu. Agentsafe: Benchmarking the safety of embodied agents on hazardous instructions.arXiv preprint arXiv:2506.14697, 2025

  50. [50]

    don’t forget to put the milk back!

    James F Mullen, Prasoon Goyal, Robinson Piramuthu, Michael Johnston, Dinesh Manocha, and Reza Ghanadan. “don’t forget to put the milk back!” dataset for enabling embodied agents to detect anomalous situations.IEEE Robotics and Automation Letters, 9(10):9087–9094, 2024

  51. [51]

    Control barrier functions: Theory and applications

    Aaron D Ames, Samuel Coogan, Magnus Egerstedt, Gennaro Notomista, Koushil Sreenath, and Paulo Tabuada. Control barrier functions: Theory and applications. In2019 18th European control conference (ECC), pages 3420–3431. Ieee, 2019

  52. [52]

    Embodied ai with two arms: Zero-shot learning, safety and modularity

    Jake Varley, Sumeet Singh, Deepali Jain, Krzysztof Choromanski, Andy Zeng, Somnath Basu Roy Chowdhury, Avinava Dubey, and Vikas Sindhwani. Embodied ai with two arms: Zero-shot learning, safety and modularity. In2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 3651–3657. IEEE, 2024

  53. [53]

    Jailbreaking llm-controlled robots

    Alexander Robey, Zachary Ravichandran, Vijay Kumar, Hamed Hassani, and George J Pappas. Jailbreaking llm-controlled robots. In2025 IEEE International Conference on Robotics and Automation (ICRA), pages 11948–11956. IEEE, 2025

  54. [54]

    Badrobot: Jailbreaking embodied llms in the physical world.arXiv preprint arXiv:2407.20242,

    Hangtao Zhang, Chenyu Zhu, Xianlong Wang, Ziqi Zhou, Changgan Yin, Minghui Li, Lulu Xue, Yichen Wang, Shengshan Hu, Aishan Liu, et al. Badrobot: Jailbreaking embodied llms in the physical world.arXiv preprint arXiv:2407.20242, 2024

  55. [55]

    Llm-driven robots risk enacting discrimination, violence, and unlawful actions.International Journal of Social Robotics, 17(11):2663–2711, 2025

    Andrew Hundt, Rumaisa Azeem, Masoumeh Mansouri, and Martim Brand˜ ao. Llm-driven robots risk enacting discrimination, violence, and unlawful actions.International Journal of Social Robotics, 17(11):2663–2711, 2025

  56. [56]

    Zico Kolter, Hamed Hassani, and George J

    Alexander Robey, Zachary Ravichandran, Eliot Krzysztof Jones, Jared Perlo, Fazl Barez, Vijay Kumar, J. Zico Kolter, Hamed Hassani, and George J. Pappas. Beyond alignment: Why robotic foundation models need context-aware safety.Science Robotics, 11(113):eaef2191, 2026

  57. [57]

    Safety guardrails for llm-enabled robots.IEEE Robotics and Automation Letters, 2026

    Zachary Ravichandran, Alexander Robey, Vijay Kumar, George J Pappas, and Hamed Hassani. Safety guardrails for llm-enabled robots.IEEE Robotics and Automation Letters, 2026

  58. [58]

    Contextual safety reasoning and grounding for open-world robots.arXiv preprint arXiv:2602.19983, 2026

    Zachary Ravichandran, David Snyder, Alexander Robey, Hamed Hassani, Vijay Kumar, and George J Pappas. Contextual safety reasoning and grounding for open-world robots.arXiv preprint arXiv:2602.19983, 2026

  59. [59]

    Norm- sage: Multi-lingual multi-cultural norm discovery from conversations on-the-fly

    Yi Fung, Tuhin Chakrabarty, Hao Guo, Owen Rambow, Smaranda Muresan, and Heng Ji. Norm- sage: Multi-lingual multi-cultural norm discovery from conversations on-the-fly. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 15217–15230, 2023

  60. [60]

    Egonormia: Benchmarking physical social norm understanding.arXiv preprint arXiv:2502.20490, 2025

    MohammadHossein Rezaei, Yicheng Fu, Phil Cuvin, Caleb Ziems, Yanzhe Zhang, Hao Zhu, and Diyi Yang. Egonormia: Benchmarking physical social norm understanding.arXiv preprint arXiv:2502.20490, 2025

  61. [61]

    Strips: A new approach to the application of theorem proving to problem solving.Artificial intelligence, 2(3-4):189–208, 1971

    Richard E Fikes and Nils J Nilsson. Strips: A new approach to the application of theorem proving to problem solving.Artificial intelligence, 2(3-4):189–208, 1971

  62. [62]

    The formal semantics of processes in pddl

    Drew McDermott. The formal semantics of processes in pddl. InProc. ICAPS Workshop on PDDL, pages 101–155. sn, 2003. 21

  63. [63]

    Maria Fox and Derek Long. Pddl2. 1: An extension to pddl for expressing temporal planning domains.Journal of artificial intelligence research, 20:61–124, 2003

  64. [64]

    The fast downward planning system.Journal of Artificial Intelligence Research, 26:191–246, 2006

    Malte Helmert. The fast downward planning system.Journal of Artificial Intelligence Research, 26:191–246, 2006

  65. [65]

    Nl2plan: Robust llm-driven planning from minimal text descriptions.arXiv preprint arXiv:2405.04215, 2024

    Elliot Gestrin, Marco Kuhlmann, and Jendrik Seipp. Nl2plan: Robust llm-driven planning from minimal text descriptions.arXiv preprint arXiv:2405.04215, 2024

  66. [66]

    Plan- etarium: A rigorous benchmark for translating text to structured planning languages

    Max Zuo, Francisco Piedrahita Velez, Xiaochen Li, Michael Littman, and Stephen Bach. Plan- etarium: A rigorous benchmark for translating text to structured planning languages. In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), p...

  67. [67]

    Karthik Valmeekam, Matthew Marquez, Alberto Olmo, Sarath Sreedharan, and Subbarao Kambhampati. Planbench: An extensible benchmark for evaluating large language models on planning and reasoning about change.Advances in Neural Information Processing Systems, 36:38975–38987, 2023

  68. [68]

    On the planning abilities of large language models-a critical investigation.Advances in neural information processing systems, 36:75993–76005, 2023

    Karthik Valmeekam, Matthew Marquez, Sarath Sreedharan, and Subbarao Kambhampati. On the planning abilities of large language models-a critical investigation.Advances in neural information processing systems, 36:75993–76005, 2023

  69. [69]

    Safe learning of pddl domains with conditional effects

    Argaman Mordoch, Enrico Scala, Roni Stern, and Brendan Juba. Safe learning of pddl domains with conditional effects. InProceedings of the International Conference on Automated Planning and Scheduling, volume 34, pages 387–395, 2024

  70. [70]

    Llms as planning formalizers: A survey for leveraging large language models to construct automated planning models

    Marcus Tantakoun, Christian Muise, and Xiaodan Zhu. Llms as planning formalizers: A survey for leveraging large language models to construct automated planning models. InFindings of the Association for Computational Linguistics: ACL 2025, pages 25167–25188, 2025

  71. [71]

    A survey of optimization-based task and motion planning: From classical to learning approaches

    Zhigen Zhao, Shuo Cheng, Yan Ding, Ziyi Zhou, Shiqi Zhang, Danfei Xu, and Ye Zhao. A survey of optimization-based task and motion planning: From classical to learning approaches. IEEE/ASME Transactions On Mechatronics, 30(4):2799–2825, 2024

  72. [72]

    HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal

    Mantas Mazeika, Long Phan, Xuwang Yin, Andy Zou, Zifan Wang, Norman Mu, Elham Sakhaee, Nathaniel Li, Steven Basart, Bo Li, et al. Harmbench: A standardized evaluation framework for automated red teaming and robust refusal.arXiv preprint arXiv:2402.04249, 2024

  73. [73]

    Decodingtrust: A comprehensive assessment of trustworthiness in{GPT}models

    Boxin Wang, Weixin Chen, Hengzhi Pei, Chulin Xie, Mintong Kang, Chenhui Zhang, Chejian Xu, Zidi Xiong, Ritik Dutta, Rylan Schaeffer, et al. Decodingtrust: A comprehensive assessment of trustworthiness in{GPT}models. 2023

  74. [74]

    Toxigen: A large-scale machine-generated dataset for adversarial and implicit hate speech detection

    Thomas Hartvigsen, Saadia Gabriel, Hamid Palangi, Maarten Sap, Dipankar Ray, and Ece Kamar. Toxigen: A large-scale machine-generated dataset for adversarial and implicit hate speech detection. InProceedings of the 60th annual meeting of the association for computational linguistics (volume 1: Long papers), pages 3309–3326, 2022

  75. [75]

    Truthfulqa: Measuring how models mimic human falsehoods

    Stephanie Lin, Jacob Hilton, and Owain Evans. Truthfulqa: Measuring how models mimic human falsehoods. InProceedings of the 60th annual meeting of the association for computational linguistics (volume 1: long papers), pages 3214–3252, 2022

  76. [76]

    Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in neural information processing systems, 36:46595–46623, 2023

    Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in neural information processing systems, 36:46595–46623, 2023

  77. [77]

    Behavior: Benchmark for everyday household activities in virtual, interactive, and ecological environments

    Sanjana Srivastava, Chengshu Li, Michael Lingelbach, Roberto Mart´ ın-Mart´ ın, Fei Xia, Kent Elliott Vainio, Zheng Lian, Cem Gokmen, Shyamal Buch, Karen Liu, et al. Behavior: Benchmark for everyday household activities in virtual, interactive, and ecological environments. 22 InConference on robot learning, pages 477–490. PMLR, 2022

  78. [78]

    Teach: Task-driven embodied agents that chat

    Aishwarya Padmakumar, Jesse Thomason, Ayush Shrivastava, Patrick Lange, Anjali Narayan- Chen, Spandana Gella, Robinson Piramuthu, Gokhan Tur, and Dilek Hakkani-Tur. Teach: Task-driven embodied agents that chat. InProceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 2017–2025, 2022

  79. [79]

    Handmethat: Human-robot communica- tion in physical and social environments.Advances in Neural Information Processing Systems, 35:12014–12026, 2022

    Yanming Wan, Jiayuan Mao, and Josh Tenenbaum. Handmethat: Human-robot communica- tion in physical and social environments.Advances in Neural Information Processing Systems, 35:12014–12026, 2022

  80. [80]

    Embodied agent interface: Benchmark- ing llms for embodied decision making.Advances in Neural Information Processing Systems, 37:100428–100534, 2024

    Manling Li, Shiyu Zhao, Qineng Wang, Kangrui Wang, Yu Zhou, Sanjana Srivastava, Cem Gokmen, Tony Lee, Erran Li Li, Ruohan Zhang, et al. Embodied agent interface: Benchmark- ing llms for embodied decision making.Advances in Neural Information Processing Systems, 37:100428–100534, 2024

Showing first 80 references.