arxiv: 2604.18463 · v2 · submitted 2026-04-20 · 💻 cs.AI · cs.LG· cs.RO

Recognition: unknown

Using large language models for embodied planning introduces systematic safety risks

Fan Shi, Jiajun Wu, Kaixian Qu, Manling Li, Marco Hutter, Tao Zhang, Zhibin Li

Authors on Pith no claims yet

Pith reviewed 2026-05-10 04:42 UTC · model grok-4.3

classification 💻 cs.AI cs.LGcs.RO

keywords large language modelsembodied AIrobotic planningsafety evaluationbenchmarkscaling lawsdanger awarenessembodied planning

0 comments

The pith

Large language models that plan well for robots still generate dangerous actions in over a quarter of tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that planning ability and safety awareness are distinct capacities in language models for embodied tasks. It introduces a large benchmark of over 12,000 tasks to measure both, finding that even the best planners produce unsafe plans frequently. Scaling up models boosts planning success rates dramatically but leaves safety awareness largely unchanged. This means larger models become safer overall only because they succeed at more tasks, not because they avoid dangers better. The result highlights a key limitation for using these models in real robotic systems where both success and safety matter.

Core claim

Even near-perfect planning ability does not ensure safety: the best-planning model fails to produce a valid plan on only 0.4% of tasks but produces dangerous plans on 28.3%. Among open-source models, planning ability improves with scale from 0.4% to 99.3% while safety awareness remains between 38% and 57%. Larger models complete more tasks safely primarily through improved planning rather than better danger avoidance. Proprietary reasoning models achieve higher safety awareness of 71-81%.

What carries the argument

The DESPITE benchmark consisting of 12,279 tasks with physical and normative dangers and deterministic validation, which separates measures of planning success from safety violations.

Load-bearing premise

The 12,279 tasks and their danger labels accurately represent the physical and normative risks that would arise in actual embodied robotic deployments.

What would settle it

Demonstrating a language model that achieves over 99% valid plans while keeping dangerous plans below 10% on the DESPITE benchmark would directly challenge the separation of planning and safety capacities.

read the original abstract

Large language models are increasingly used as planners for robotic systems, yet how safely they plan remains an open question. To evaluate safe planning systematically, we introduce DESPITE, a benchmark of 12,279 tasks spanning physical and normative dangers with fully deterministic validation. Across 23 models, even near-perfect planning ability does not ensure safety: the best-planning model fails to produce a valid plan on only 0.4% of tasks but produces dangerous plans on 28.3%. Among 18 open-source models from 3B to 671B parameters, planning ability improves substantially with scale (0.4-99.3%) while safety awareness remains relatively flat (38-57%). We identify a multiplicative relationship between these two capacities, showing that larger models complete more tasks safely primarily through improved planning, not through better danger avoidance. Three proprietary reasoning models reach notably higher safety awareness (71-81%), while non-reasoning proprietary models and open-source reasoning models remain below 57%. As planning ability approaches saturation for frontier models, improving safety awareness becomes a central challenge for deploying language-model planners in robotic systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Scaling improves LLM planning success but leaves safety awareness mostly flat on the new DESPITE benchmark of 12k tasks.

read the letter

The paper's main observation is that bigger models get much better at producing valid plans for robotic tasks but show only modest gains in avoiding plans that would cause physical or normative harm. On their DESPITE benchmark the strongest planner fails valid plans on just 0.4% of cases yet still outputs dangerous plans on 28.3%. Open-source models from 3B to 671B parameters see planning success climb from low single digits to 99%, while safety awareness stays between 38% and 57%. A few proprietary reasoning models reach 71-81% safety, but the pattern holds that most safety improvement comes from completing more tasks rather than from better danger recognition. They frame this as a multiplicative relationship between the two capacities. That empirical pattern across 23 models is the concrete contribution here, and the deterministic validation on a large task set makes the numbers easy to check once the benchmark is released. The benchmark itself is new and covers both physical risks like collisions and normative ones like privacy violations, which is a reasonable scope for embodied planning work. The soft spot is the danger labeling process. The abstract claims fully deterministic validation but gives no explicit rules, inter-annotator checks, or grounding against real robot failure data, so it is possible the 28.3% figure moves if the criteria for what counts as dangerous shift. That does not invalidate the overall direction but does mean the exact percentages need scrutiny. This work is aimed at researchers building or evaluating LLM planners for physical systems. Anyone working on robot safety or agent deployment will find the scaling contrast useful to cite or test against. It deserves peer review because the scale and the direct measurement are solid enough to warrant referee time, even if the benchmark construction will draw questions.

Referee Report

2 major / 2 minor

Summary. The paper introduces DESPITE, a benchmark of 12,279 tasks spanning physical and normative dangers with deterministic validation, to assess safety in LLM-based embodied planning. Across 23 models, it reports that near-perfect planning (0.4% invalid plans for the best model) does not ensure safety (28.3% dangerous plans). Planning ability scales strongly with model size (0.4-99.3%) while safety awareness remains relatively flat (38-57%) for open-source models, with a multiplicative relationship between the two; proprietary reasoning models reach higher safety awareness (71-81%).

Significance. If the danger annotations accurately capture real embodied risks, the results establish that planning capability and safety awareness are partially decoupled in current LLMs, with scale primarily boosting the former. This provides a concrete, large-scale empirical basis for prioritizing safety-specific improvements in robotic planners and highlights gaps between open and proprietary reasoning models. The benchmark itself, with its scale and deterministic validation, is a useful resource for the community.

major comments (2)

[§3] §3 (DESPITE Benchmark Construction): The manuscript states that tasks have 'fully deterministic validation' and span physical/normative dangers, but provides no explicit decision rules, classification examples, inter-annotator agreement, or grounding against actual robot failure modes. This is load-bearing for the central claim, as the 28.3% dangerous-plan rate (and the gap with 0.4% invalid plans) depends entirely on the validity of these labels; without them, sensitivity to labeling choices cannot be assessed.
[§5] §5 (Scaling and Multiplicative Relationship): The claim that larger models complete more tasks safely 'primarily through improved planning, not through better danger avoidance' rests on a multiplicative relationship, yet the paper does not report the exact functional form, regression coefficients, or statistical controls used to establish it. This weakens the interpretation of the flat safety-awareness curve (38-57%) relative to the planning curve.

minor comments (2)

[Table 2] Table 2 (Model Results): Include a column or footnote explicitly marking which of the 23 models are reasoning vs. non-reasoning and their exact parameter counts to support the scaling claims.
[Abstract] Abstract: The safety-awareness range '38-57%' for open-source models should identify the specific models at each extreme for immediate interpretability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their positive evaluation of the work's significance and for the constructive comments on benchmark transparency and the scaling analysis. We address each point below and have prepared revisions to strengthen the manuscript.

read point-by-point responses

Referee: [§3] §3 (DESPITE Benchmark Construction): The manuscript states that tasks have 'fully deterministic validation' and span physical/normative dangers, but provides no explicit decision rules, classification examples, inter-annotator agreement, or grounding against actual robot failure modes. This is load-bearing for the central claim, as the 28.3% dangerous-plan rate (and the gap with 0.4% invalid plans) depends entirely on the validity of these labels; without them, sensitivity to labeling choices cannot be assessed.

Authors: We agree that greater transparency on the annotation process is warranted to support the central claims. The current §3 describes the high-level categories and deterministic validation procedure, but does not include the requested explicit decision rules or examples. In the revised manuscript we will add a dedicated subsection with (i) the full decision rules used to classify physical versus normative dangers, (ii) multiple concrete classification examples per category, (iii) inter-annotator agreement statistics, and (iv) explicit grounding of the danger taxonomy against documented robot failure modes from the robotics literature. We will also include a brief sensitivity analysis demonstrating robustness of the 28.3% dangerous-plan rate to plausible labeling variations. These additions directly address the concern about label validity. revision: yes
Referee: [§5] §5 (Scaling and Multiplicative Relationship): The claim that larger models complete more tasks safely 'primarily through improved planning, not through better danger avoidance' rests on a multiplicative relationship, yet the paper does not report the exact functional form, regression coefficients, or statistical controls used to establish it. This weakens the interpretation of the flat safety-awareness curve (38-57%) relative to the planning curve.

Authors: We appreciate the request for greater precision. The manuscript states that safe completion equals planning success multiplied by conditional safety awareness, which produces the observed flat safety-awareness curve. However, the exact functional form, regression details, and controls are not reported in the main text or appendix. In the revision we will (i) state the functional form explicitly (safe_completion = planning_success × safety_awareness), (ii) report the regression coefficients and goodness-of-fit statistics, and (iii) describe the statistical controls for model family and size. These clarifications will strengthen the interpretation that scale primarily improves planning rather than danger avoidance. revision: yes

Circularity Check

0 steps flagged

No circularity: claims rest on direct empirical evaluation of models against an external benchmark

full rationale

The paper introduces the DESPITE benchmark of 12,279 tasks and reports measured planning success rates and safety awareness percentages across 23 models. Key results such as the 0.4% invalid-plan rate versus 28.3% dangerous-plan rate for the best model, the scaling trends with model size, and the multiplicative relationship between planning and safety are presented as direct observations from running the models on the benchmark with deterministic validation. No equations, derivations, fitted parameters, or predictions are claimed; there are no self-definitional loops, no inputs relabeled as outputs, and no load-bearing self-citations that reduce the central claims to prior author work. The analysis is therefore self-contained against the external benchmark and model evaluations.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claims depend on the assumption that the benchmark tasks and danger definitions are representative of real-world embodied risks; no free parameters or invented physical entities are introduced.

axioms (1)

domain assumption The defined tasks and danger categories accurately capture physical and normative risks relevant to robotic systems
This assumption underpins the interpretation of the 28.3% dangerous-plan rate as a genuine safety problem.

invented entities (1)

DESPITE benchmark no independent evidence
purpose: Provide a standardized, deterministic test suite for safe embodied planning
Newly constructed for the study; no independent external validation of its coverage is provided in the abstract.

pith-pipeline@v0.9.0 · 5513 in / 1253 out tokens · 47491 ms · 2026-05-10T04:42:28.129243+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

87 extracted references · 32 canonical work pages · 19 internal anchors

[1]

Vision-language- action models for robotics: A review towards real-world applications.IEEE Access, 2025

Kento Kawaharazuka, Jihoon Oh, Jun Yamada, Ingmar Posner, and Yuke Zhu. Vision-language- action models for robotics: A review towards real-world applications.IEEE Access, 2025

2025
[2]

Do as i can, not as i say: Grounding 17 language in robotic affordances

Anthony Brohan, Yevgen Chebotar, Chelsea Finn, Karol Hausman, Alexander Herzog, Daniel Ho, Julian Ibarz, Alex Irpan, Eric Jang, Ryan Julian, et al. Do as i can, not as i say: Grounding 17 language in robotic affordances. InConference on robot learning, pages 287–318. PMLR, 2023

2023
[3]

Code as policies: Language model programs for embodied control

Jacky Liang, Wenlong Huang, Fei Xia, Peng Xu, Karol Hausman, Brian Ichter, Pete Florence, and Andy Zeng. Code as policies: Language model programs for embodied control. In2023 IEEE International conference on robotics and automation (ICRA), pages 9493–9500. IEEE, 2023

2023
[4]

Generating robot constitutions & benchmarks for semantic safety.arXiv preprint arXiv:2503.08663, 2025

Pierre Sermanet, Anirudha Majumdar, Alex Irpan, Dmitry Kalashnikov, and Vikas Sind- hwani. Generating robot constitutions & benchmarks for semantic safety.arXiv preprint arXiv:2503.08663, 2025

work page arXiv 2025
[5]

SafeAgentBench: A benchmark for safe task planning of embodied LLM agents

Sheng Yin, Xianghe Pang, Yuanzhuo Ding, Menglan Chen, Yutong Bi, Yichen Xiong, Wenhao Huang, Zhen Xiang, Jing Shao, and Siheng Chen. Safeagentbench: A benchmark for safe task planning of embodied llm agents.arXiv preprint arXiv:2412.13178, 2024

work page arXiv 2024
[6]

A framework for benchmarking and aligning task-planning safety in llm-based embodied agents.arXiv preprintarXiv:2504.14650, 2025

Yuting Huang, Leilei Ding, Zhipeng Tang, Tianfu Wang, Xinrui Lin, Wuyang Zhang, Mingxiao Ma, and Yanyong Zhang. A framework for benchmarking and aligning task-planning safety in llm-based embodied agents.arXiv preprint arXiv:2504.14650, 2025

work page arXiv 2025
[7]

Subtle risks, critical failures: A framework for diagnosing physical safety of llms for embodied decision making

Yejin Son, Minseo Kim, Sungwoong Kim, Seungju Han, Jian Kim, Dongju Jang, Youngjae Yu, and Chan Young Park. Subtle risks, critical failures: A framework for diagnosing physical safety of llms for embodied decision making. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 25703–25744, 2025

2025
[8]

Is- bench: Evaluating interactive safety of vlm-driven embodied agents in daily household tasks.arXiv preprint arXiv:2506.16402, 2025

Xiaoya Lu, Zeren Chen, Xuhao Hu, Yijin Zhou, Weichen Zhang, Dongrui Liu, Lu Sheng, and Jing Shao. Is-bench: Evaluating interactive safety of vlm-driven embodied agents in daily household tasks.arXiv preprint arXiv:2506.16402, 2025

work page arXiv 2025
[9]

Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. arXiv preprint arXiv:1701.06538, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[10]

routledge, 2013

Jacob Cohen.Statistical power analysis for the behavioral sciences. routledge, 2013

2013
[11]

Semantically safe robot manipulation: From semantic scene understanding to motion safeguards.IEEE Robotics and Automation Letters, 2025

Lukas Brunke, Yanni Zhang, Ralf R¨ omer, Jack Naimer, Nikola Staykov, Siqi Zhou, and Angela P Schoellig. Semantically safe robot manipulation: From semantic scene understanding to motion safeguards.IEEE Robotics and Automation Letters, 2025

2025
[12]

Constitutional AI: Harmlessness from AI Feedback

Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, et al. Constitutional ai: Harmlessness from ai feedback.arXiv preprint arXiv:2212.08073, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[13]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[14]

Gemini: A Family of Highly Capable Multimodal Models

Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[15]

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

Gemini Team, Petko Georgiev, Ving Ian Lei, Ryan Burnell, Libin Bai, Anmol Gulati, Garrett Tanzer, Damien Vincent, Zhufeng Pan, Shibo Wang, et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context.arXiv preprint arXiv:2403.05530, 2024

work page internal anchor Pith review arXiv 2024
[16]

gpt-oss-120b & gpt-oss-20b Model Card

Sandhini Agarwal, Lama Ahmad, Jason Ai, Sam Altman, Andy Applebaum, Edwin Arbus, Rahul K Arora, Yu Bai, Bowen Baker, Haiming Bao, et al. gpt-oss-120b & gpt-oss-20b model card.arXiv preprint arXiv:2508.10925, 2025

work page internal anchor Pith review arXiv 2025
[17]

DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models

Aixin Liu, Aoxue Mei, Bangcai Lin, Bing Xue, Bingxuan Wang, Bingzheng Xu, Bochao Wu, Bowei Zhang, Chaofan Lin, Chen Dong, et al. Deepseek-v3. 2: Pushing the frontier of open large language models.arXiv preprint arXiv:2512.02556, 2025. 18

work page internal anchor Pith review arXiv 2025
[18]

Binyuan Hui, Jian Yang, Zeyu Cui, Jiaxi Yang, Dayiheng Liu, Lei Zhang, Tianyu Liu, Jia- jun Zhang, Bowen Yu, Keming Lu, et al. Qwen2. 5-coder technical report.arXiv preprint arXiv:2409.12186, 2024

work page internal anchor Pith review arXiv 2024
[19]

Qwen2.5 technical report, 2025

Qwen, :, An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li,...

2025
[20]

Qwen2.5-1m technical report, 2025

An Yang, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoyan Huang, Jiandong Jiang, Jianhong Tu, Jianwei Zhang, Jingren Zhou, Junyang Lin, Kai Dang, Kexin Yang, Le Yu, Mei Li, Minmin Sun, Qin Zhu, Rui Men, Tao He, Weijia Xu, Wenbiao Yin, Wenyuan Yu, Xiafei Qiu, Xingzhang Ren, Xinlong Yang, Yong Li, Zhiying Xu, and Zipeng Zhang. Qwen2.5-1m technical re...

2025
[21]

Qwq-32b: Embracing the power of reinforcement learning, March 2025

Qwen Team. Qwq-32b: Embracing the power of reinforcement learning, March 2025

2025
[22]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[23]

The Llama 3 Herd of Models

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[24]

Meta AI. Llama 4. https://ai.meta.com/blog/llama-4-multimodal-intelligence/, 2025. Accessed: 2025

2025
[25]

Elsevier, 2004

Malik Ghallab, Dana Nau, and Paolo Traverso.Automated Planning: theory and practice. Elsevier, 2004

2004
[26]

A modern approach.Artificial Intelligence

Stuart Russell, Peter Norvig, and Artificial Intelligence. A modern approach.Artificial Intelligence. Prentice-Hall, Egnlewood Cliffs, 25(27):79–80, 1995

1995
[27]

Unified planning: Modeling, manipulating and solving ai planning problems in python

Andrea Micheli, Arthur Bit-Monnot, Gabriele R¨ oger, Enrico Scala, Alessandro Valentini, Luca Framba, Alberto Rovetta, Alessandro Trapasso, Luigi Bonassi, Alfonso Emilio Gerevini, et al. Unified planning: Modeling, manipulating and solving ai planning problems in python. SoftwareX, 29:102012, 2025

2025
[28]

Interval-based relaxation for general numeric planning

Enrico Scala, Patrik Haslum, Sylvie Thi´ ebaux, and Miguel Ramirez. Interval-based relaxation for general numeric planning. 2016

2016
[29]

Temporal planning with inter- mediate conditions and effects

Alessandro Valentini, Andrea Micheli, and Alessandro Cimatti. Temporal planning with inter- mediate conditions and effects. InProceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 9975–9982, 2020

2020
[30]

Virtualhome: Simulating household activities via programs

Xavier Puig, Kevin Ra, Marko Boben, Jiaman Li, Tingwu Wang, Sanja Fidler, and Antonio Torralba. Virtualhome: Simulating household activities via programs. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 8494–8502, 2018

2018
[31]

Behavior-1k: A benchmark for embodied ai with 1,000 everyday activities and realistic simulation

Chengshu Li, Ruohan Zhang, Josiah Wong, Cem Gokmen, Sanjana Srivastava, Roberto Mart´ ın- Mart´ ın, Chen Wang, Gabrael Levine, Michael Lingelbach, Jiankai Sun, et al. Behavior-1k: A benchmark for embodied ai with 1,000 everyday activities and realistic simulation. InConference on Robot Learning, pages 80–93. PMLR, 2023

2023
[32]

Alfred: A benchmark for interpreting grounded instructions for everyday tasks

Mohit Shridhar, Jesse Thomason, Daniel Gordon, Yonatan Bisk, Winson Han, Roozbeh Mot- taghi, Luke Zettlemoyer, and Dieter Fox. Alfred: A benchmark for interpreting grounded instructions for everyday tasks. InProceedings of the IEEE/CVF conference on computer vision 19 and pattern recognition, pages 10740–10749, 2020

2020
[33]

Normbank: A knowledge bank of situational social norms

Caleb Ziems, Jane Dwivedi-Yu, Yi-Chia Wang, Alon Halevy, and Diyi Yang. Normbank: A knowledge bank of situational social norms. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 7756–7776, 2023

2023
[34]

Consumer Product Safety Commission

U.S. Consumer Product Safety Commission. National electronic injury surveillance system (neiss) injury data. https://www.cpsc.gov/cgibin/NEISSQuery/home.aspx, 2024. Accessed: 2025-09-29

2024
[35]

DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

Aixin Liu, Bei Feng, Bin Wang, Bingxuan Wang, Bo Liu, Chenggang Zhao, Chengqi Dengr, Chong Ruan, Damai Dai, Daya Guo, et al. Deepseek-v2: A strong, economical, and efficient mixture-of-experts language model.arXiv preprint arXiv:2405.04434, 2024

work page internal anchor Pith review arXiv 2024
[36]

Language models as zero- shot planners: Extracting actionable knowledge for embodied agents

Wenlong Huang, Pieter Abbeel, Deepak Pathak, and Igor Mordatch. Language models as zero- shot planners: Extracting actionable knowledge for embodied agents. InInternational conference on machine learning, pages 9118–9147. PMLR, 2022

2022
[37]

ProgPrompt: Generating situated robot task plans using large language models,

Ishika Singh, Valts Blukis, Arsalan Mousavian, Ankit Goyal, Danfei Xu, Jonathan Tremblay, Dieter Fox, Jesse Thomason, and Animesh Garg. Progprompt: Generating situated robot task plans using large language models.arXiv preprint arXiv:2209.11302, 2022

work page arXiv 2022
[38]

PaLM-E: An Embodied Multimodal Language Model

Danny Driess, Fei Xia, Mehdi SM Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, et al. Palm-e: An embodied multimodal language model.arXiv preprint arXiv:2303.03378, 2023

work page internal anchor Pith review arXiv 2023
[39]

Rt-2: Vision-language-action models transfer web knowledge to robotic control

Brianna Zitkovich, Tianhe Yu, Sichun Xu, Peng Xu, Ted Xiao, Fei Xia, Jialin Wu, Paul Wohlhart, Stefan Welker, Ayzaan Wahid, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control. InConference on Robot Learning, pages 2165–2183. PMLR, 2023

2023
[40]

Inner Monologue: Embodied Reasoning through Planning with Language Models

Wenlong Huang, Fei Xia, Ted Xiao, Harris Chan, Jacky Liang, Pete Florence, Andy Zeng, Jonathan Tompson, Igor Mordatch, Yevgen Chebotar, et al. Inner monologue: Embodied reasoning through planning with language models.arXiv preprint arXiv:2207.05608, 2022

work page internal anchor Pith review arXiv 2022
[41]

Belkhale, T

Suneel Belkhale, Tianli Ding, Ted Xiao, Pierre Sermanet, Quon Vuong, Jonathan Tompson, Yev- gen Chebotar, Debidatta Dwibedi, and Dorsa Sadigh. Rt-h: Action hierarchies using language. arXiv preprint arXiv:2403.01823, 2024

work page arXiv 2024
[42]

FAST: Efficient Action Tokenization for Vision-Language-Action Models

Karl Pertsch, Kyle Stachowicz, Brian Ichter, Danny Driess, Suraj Nair, Quan Vuong, Oier Mees, Chelsea Finn, and Sergey Levine. Fast: Efficient action tokenization for vision-language-action models.arXiv preprint arXiv:2501.09747, 2025

work page internal anchor Pith review arXiv 2025
[43]

GR00T N1: An Open Foundation Model for Generalist Humanoid Robots

Johan Bjorck, Fernando Casta˜ neda, Nikita Cherniadev, Xingye Da, Runyu Ding, Linxi Fan, Yu Fang, Dieter Fox, Fengyuan Hu, Spencer Huang, et al. Gr00t n1: An open foundation model for generalist humanoid robots.arXiv preprint arXiv:2503.14734, 2025

work page internal anchor Pith review arXiv 2025
[44]

Open x-embodiment: Robotic learning datasets and rt-x models: Open x-embodiment collaboration 0

Abby O’Neill, Abdul Rehman, Abhiram Maddukuri, Abhishek Gupta, Abhishek Padalkar, Abra- ham Lee, Acorn Pooley, Agrim Gupta, Ajay Mandlekar, Ajinkya Jain, et al. Open x-embodiment: Robotic learning datasets and rt-x models: Open x-embodiment collaboration 0. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 6892–6903. IEEE, 2024

2024
[45]

Distilling on-device language models for robot planning with minimal human intervention.arXiv preprint arXiv:2506.17486, 2025

Zachary Ravichandran, Ignacio Hounie, Fernando Cladera, Alejandro Ribeiro, George J Pappas, and Vijay Kumar. Distilling on-device language models for robot planning with minimal human intervention.arXiv preprint arXiv:2506.17486, 2025

work page arXiv 2025
[46]

LLM+P: Empowering Large Language Models with Optimal Planning Proficiency

Bo Liu, Yuqian Jiang, Xiaohan Zhang, Qiang Liu, Shiqi Zhang, Joydeep Biswas, and Peter Stone. Llm+ p: Empowering large language models with optimal planning proficiency.arXiv preprint arXiv:2304.11477, 2023

work page internal anchor Pith review arXiv 2023
[47]

Lin Guan, Karthik Valmeekam, Sarath Sreedharan, and Subbarao Kambhampati. Leveraging pre-trained large language models to construct and utilize world models for model-based task 20 planning.Advances in Neural Information Processing Systems, 36:79081–79094, 2023

2023
[48]

Don’t let your robot be harmful: Responsible robotic manipulation via safety-as-policy

Minheng Ni, Lei Zhang, Zihan Chen, Kaixin Bai, Zhaopeng Chen, Jianwei Zhang, and Wangmeng Zuo. Don’t let your robot be harmful: Responsible robotic manipulation via safety-as-policy. IEEE Robotics and Automation Letters, 2025

2025
[49]

Agentsafe: Benchmarking the safety of embodied agents on hazardous instructions

Zonghao Ying, Le Wang, Yisong Xiao, Jiakai Wang, Yuqing Ma, Jinyang Guo, Zhenfei Yin, Mingchuan Zhang, Aishan Liu, and Xianglong Liu. Agentsafe: Benchmarking the safety of embodied agents on hazardous instructions.arXiv preprint arXiv:2506.14697, 2025

work page arXiv 2025
[50]

don’t forget to put the milk back!

James F Mullen, Prasoon Goyal, Robinson Piramuthu, Michael Johnston, Dinesh Manocha, and Reza Ghanadan. “don’t forget to put the milk back!” dataset for enabling embodied agents to detect anomalous situations.IEEE Robotics and Automation Letters, 9(10):9087–9094, 2024

2024
[51]

Control barrier functions: Theory and applications

Aaron D Ames, Samuel Coogan, Magnus Egerstedt, Gennaro Notomista, Koushil Sreenath, and Paulo Tabuada. Control barrier functions: Theory and applications. In2019 18th European control conference (ECC), pages 3420–3431. Ieee, 2019

2019
[52]

Embodied ai with two arms: Zero-shot learning, safety and modularity

Jake Varley, Sumeet Singh, Deepali Jain, Krzysztof Choromanski, Andy Zeng, Somnath Basu Roy Chowdhury, Avinava Dubey, and Vikas Sindhwani. Embodied ai with two arms: Zero-shot learning, safety and modularity. In2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 3651–3657. IEEE, 2024

2024
[53]

Jailbreaking llm-controlled robots

Alexander Robey, Zachary Ravichandran, Vijay Kumar, Hamed Hassani, and George J Pappas. Jailbreaking llm-controlled robots. In2025 IEEE International Conference on Robotics and Automation (ICRA), pages 11948–11956. IEEE, 2025

2025
[54]

Badrobot: Jailbreaking embodied llms in the physical world.arXiv preprint arXiv:2407.20242,

Hangtao Zhang, Chenyu Zhu, Xianlong Wang, Ziqi Zhou, Changgan Yin, Minghui Li, Lulu Xue, Yichen Wang, Shengshan Hu, Aishan Liu, et al. Badrobot: Jailbreaking embodied llms in the physical world.arXiv preprint arXiv:2407.20242, 2024

work page arXiv 2024
[55]

Llm-driven robots risk enacting discrimination, violence, and unlawful actions.International Journal of Social Robotics, 17(11):2663–2711, 2025

Andrew Hundt, Rumaisa Azeem, Masoumeh Mansouri, and Martim Brand˜ ao. Llm-driven robots risk enacting discrimination, violence, and unlawful actions.International Journal of Social Robotics, 17(11):2663–2711, 2025

2025
[56]

Zico Kolter, Hamed Hassani, and George J

Alexander Robey, Zachary Ravichandran, Eliot Krzysztof Jones, Jared Perlo, Fazl Barez, Vijay Kumar, J. Zico Kolter, Hamed Hassani, and George J. Pappas. Beyond alignment: Why robotic foundation models need context-aware safety.Science Robotics, 11(113):eaef2191, 2026

2026
[57]

Safety guardrails for llm-enabled robots.IEEE Robotics and Automation Letters, 2026

Zachary Ravichandran, Alexander Robey, Vijay Kumar, George J Pappas, and Hamed Hassani. Safety guardrails for llm-enabled robots.IEEE Robotics and Automation Letters, 2026

2026
[58]

Contextual safety reasoning and grounding for open-world robots.arXiv preprint arXiv:2602.19983, 2026

Zachary Ravichandran, David Snyder, Alexander Robey, Hamed Hassani, Vijay Kumar, and George J Pappas. Contextual safety reasoning and grounding for open-world robots.arXiv preprint arXiv:2602.19983, 2026

work page arXiv 2026
[59]

Norm- sage: Multi-lingual multi-cultural norm discovery from conversations on-the-fly

Yi Fung, Tuhin Chakrabarty, Hao Guo, Owen Rambow, Smaranda Muresan, and Heng Ji. Norm- sage: Multi-lingual multi-cultural norm discovery from conversations on-the-fly. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 15217–15230, 2023

2023
[60]

Egonormia: Benchmarking physical social norm understanding.arXiv preprint arXiv:2502.20490, 2025

MohammadHossein Rezaei, Yicheng Fu, Phil Cuvin, Caleb Ziems, Yanzhe Zhang, Hao Zhu, and Diyi Yang. Egonormia: Benchmarking physical social norm understanding.arXiv preprint arXiv:2502.20490, 2025

work page arXiv 2025
[61]

Strips: A new approach to the application of theorem proving to problem solving.Artificial intelligence, 2(3-4):189–208, 1971

Richard E Fikes and Nils J Nilsson. Strips: A new approach to the application of theorem proving to problem solving.Artificial intelligence, 2(3-4):189–208, 1971

1971
[62]

The formal semantics of processes in pddl

Drew McDermott. The formal semantics of processes in pddl. InProc. ICAPS Workshop on PDDL, pages 101–155. sn, 2003. 21

2003
[63]

Maria Fox and Derek Long. Pddl2. 1: An extension to pddl for expressing temporal planning domains.Journal of artificial intelligence research, 20:61–124, 2003

2003
[64]

The fast downward planning system.Journal of Artificial Intelligence Research, 26:191–246, 2006

Malte Helmert. The fast downward planning system.Journal of Artificial Intelligence Research, 26:191–246, 2006

2006
[65]

Nl2plan: Robust llm-driven planning from minimal text descriptions.arXiv preprint arXiv:2405.04215, 2024

Elliot Gestrin, Marco Kuhlmann, and Jendrik Seipp. Nl2plan: Robust llm-driven planning from minimal text descriptions.arXiv preprint arXiv:2405.04215, 2024

work page arXiv 2024
[66]

Plan- etarium: A rigorous benchmark for translating text to structured planning languages

Max Zuo, Francisco Piedrahita Velez, Xiaochen Li, Michael Littman, and Stephen Bach. Plan- etarium: A rigorous benchmark for translating text to structured planning languages. In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), p...

2025
[67]

Karthik Valmeekam, Matthew Marquez, Alberto Olmo, Sarath Sreedharan, and Subbarao Kambhampati. Planbench: An extensible benchmark for evaluating large language models on planning and reasoning about change.Advances in Neural Information Processing Systems, 36:38975–38987, 2023

2023
[68]

On the planning abilities of large language models-a critical investigation.Advances in neural information processing systems, 36:75993–76005, 2023

Karthik Valmeekam, Matthew Marquez, Sarath Sreedharan, and Subbarao Kambhampati. On the planning abilities of large language models-a critical investigation.Advances in neural information processing systems, 36:75993–76005, 2023

2023
[69]

Safe learning of pddl domains with conditional effects

Argaman Mordoch, Enrico Scala, Roni Stern, and Brendan Juba. Safe learning of pddl domains with conditional effects. InProceedings of the International Conference on Automated Planning and Scheduling, volume 34, pages 387–395, 2024

2024
[70]

Llms as planning formalizers: A survey for leveraging large language models to construct automated planning models

Marcus Tantakoun, Christian Muise, and Xiaodan Zhu. Llms as planning formalizers: A survey for leveraging large language models to construct automated planning models. InFindings of the Association for Computational Linguistics: ACL 2025, pages 25167–25188, 2025

2025
[71]

A survey of optimization-based task and motion planning: From classical to learning approaches

Zhigen Zhao, Shuo Cheng, Yan Ding, Ziyi Zhou, Shiqi Zhang, Danfei Xu, and Ye Zhao. A survey of optimization-based task and motion planning: From classical to learning approaches. IEEE/ASME Transactions On Mechatronics, 30(4):2799–2825, 2024

2024
[72]

HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal

Mantas Mazeika, Long Phan, Xuwang Yin, Andy Zou, Zifan Wang, Norman Mu, Elham Sakhaee, Nathaniel Li, Steven Basart, Bo Li, et al. Harmbench: A standardized evaluation framework for automated red teaming and robust refusal.arXiv preprint arXiv:2402.04249, 2024

work page internal anchor Pith review arXiv 2024
[73]

Decodingtrust: A comprehensive assessment of trustworthiness in{GPT}models

Boxin Wang, Weixin Chen, Hengzhi Pei, Chulin Xie, Mintong Kang, Chenhui Zhang, Chejian Xu, Zidi Xiong, Ritik Dutta, Rylan Schaeffer, et al. Decodingtrust: A comprehensive assessment of trustworthiness in{GPT}models. 2023

2023
[74]

Toxigen: A large-scale machine-generated dataset for adversarial and implicit hate speech detection

Thomas Hartvigsen, Saadia Gabriel, Hamid Palangi, Maarten Sap, Dipankar Ray, and Ece Kamar. Toxigen: A large-scale machine-generated dataset for adversarial and implicit hate speech detection. InProceedings of the 60th annual meeting of the association for computational linguistics (volume 1: Long papers), pages 3309–3326, 2022

2022
[75]

Truthfulqa: Measuring how models mimic human falsehoods

Stephanie Lin, Jacob Hilton, and Owain Evans. Truthfulqa: Measuring how models mimic human falsehoods. InProceedings of the 60th annual meeting of the association for computational linguistics (volume 1: long papers), pages 3214–3252, 2022

2022
[76]

Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in neural information processing systems, 36:46595–46623, 2023

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in neural information processing systems, 36:46595–46623, 2023

2023
[77]

Behavior: Benchmark for everyday household activities in virtual, interactive, and ecological environments

Sanjana Srivastava, Chengshu Li, Michael Lingelbach, Roberto Mart´ ın-Mart´ ın, Fei Xia, Kent Elliott Vainio, Zheng Lian, Cem Gokmen, Shyamal Buch, Karen Liu, et al. Behavior: Benchmark for everyday household activities in virtual, interactive, and ecological environments. 22 InConference on robot learning, pages 477–490. PMLR, 2022

2022
[78]

Teach: Task-driven embodied agents that chat

Aishwarya Padmakumar, Jesse Thomason, Ayush Shrivastava, Patrick Lange, Anjali Narayan- Chen, Spandana Gella, Robinson Piramuthu, Gokhan Tur, and Dilek Hakkani-Tur. Teach: Task-driven embodied agents that chat. InProceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 2017–2025, 2022

2017
[79]

Handmethat: Human-robot communica- tion in physical and social environments.Advances in Neural Information Processing Systems, 35:12014–12026, 2022

Yanming Wan, Jiayuan Mao, and Josh Tenenbaum. Handmethat: Human-robot communica- tion in physical and social environments.Advances in Neural Information Processing Systems, 35:12014–12026, 2022

2022
[80]

Embodied agent interface: Benchmark- ing llms for embodied decision making.Advances in Neural Information Processing Systems, 37:100428–100534, 2024

Manling Li, Shiyu Zhao, Qineng Wang, Kangrui Wang, Yu Zhou, Sanjana Srivastava, Cem Gokmen, Tony Lee, Erran Li Li, Ruohan Zhang, et al. Embodied agent interface: Benchmark- ing llms for embodied decision making.Advances in Neural Information Processing Systems, 37:100428–100534, 2024

2024

Showing first 80 references.