Recognition: unknown
Using large language models for embodied planning introduces systematic safety risks
Pith reviewed 2026-05-10 04:42 UTC · model grok-4.3
The pith
Large language models that plan well for robots still generate dangerous actions in over a quarter of tasks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Even near-perfect planning ability does not ensure safety: the best-planning model fails to produce a valid plan on only 0.4% of tasks but produces dangerous plans on 28.3%. Among open-source models, planning ability improves with scale from 0.4% to 99.3% while safety awareness remains between 38% and 57%. Larger models complete more tasks safely primarily through improved planning rather than better danger avoidance. Proprietary reasoning models achieve higher safety awareness of 71-81%.
What carries the argument
The DESPITE benchmark consisting of 12,279 tasks with physical and normative dangers and deterministic validation, which separates measures of planning success from safety violations.
Load-bearing premise
The 12,279 tasks and their danger labels accurately represent the physical and normative risks that would arise in actual embodied robotic deployments.
What would settle it
Demonstrating a language model that achieves over 99% valid plans while keeping dangerous plans below 10% on the DESPITE benchmark would directly challenge the separation of planning and safety capacities.
read the original abstract
Large language models are increasingly used as planners for robotic systems, yet how safely they plan remains an open question. To evaluate safe planning systematically, we introduce DESPITE, a benchmark of 12,279 tasks spanning physical and normative dangers with fully deterministic validation. Across 23 models, even near-perfect planning ability does not ensure safety: the best-planning model fails to produce a valid plan on only 0.4% of tasks but produces dangerous plans on 28.3%. Among 18 open-source models from 3B to 671B parameters, planning ability improves substantially with scale (0.4-99.3%) while safety awareness remains relatively flat (38-57%). We identify a multiplicative relationship between these two capacities, showing that larger models complete more tasks safely primarily through improved planning, not through better danger avoidance. Three proprietary reasoning models reach notably higher safety awareness (71-81%), while non-reasoning proprietary models and open-source reasoning models remain below 57%. As planning ability approaches saturation for frontier models, improving safety awareness becomes a central challenge for deploying language-model planners in robotic systems.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces DESPITE, a benchmark of 12,279 tasks spanning physical and normative dangers with deterministic validation, to assess safety in LLM-based embodied planning. Across 23 models, it reports that near-perfect planning (0.4% invalid plans for the best model) does not ensure safety (28.3% dangerous plans). Planning ability scales strongly with model size (0.4-99.3%) while safety awareness remains relatively flat (38-57%) for open-source models, with a multiplicative relationship between the two; proprietary reasoning models reach higher safety awareness (71-81%).
Significance. If the danger annotations accurately capture real embodied risks, the results establish that planning capability and safety awareness are partially decoupled in current LLMs, with scale primarily boosting the former. This provides a concrete, large-scale empirical basis for prioritizing safety-specific improvements in robotic planners and highlights gaps between open and proprietary reasoning models. The benchmark itself, with its scale and deterministic validation, is a useful resource for the community.
major comments (2)
- [§3] §3 (DESPITE Benchmark Construction): The manuscript states that tasks have 'fully deterministic validation' and span physical/normative dangers, but provides no explicit decision rules, classification examples, inter-annotator agreement, or grounding against actual robot failure modes. This is load-bearing for the central claim, as the 28.3% dangerous-plan rate (and the gap with 0.4% invalid plans) depends entirely on the validity of these labels; without them, sensitivity to labeling choices cannot be assessed.
- [§5] §5 (Scaling and Multiplicative Relationship): The claim that larger models complete more tasks safely 'primarily through improved planning, not through better danger avoidance' rests on a multiplicative relationship, yet the paper does not report the exact functional form, regression coefficients, or statistical controls used to establish it. This weakens the interpretation of the flat safety-awareness curve (38-57%) relative to the planning curve.
minor comments (2)
- [Table 2] Table 2 (Model Results): Include a column or footnote explicitly marking which of the 23 models are reasoning vs. non-reasoning and their exact parameter counts to support the scaling claims.
- [Abstract] Abstract: The safety-awareness range '38-57%' for open-source models should identify the specific models at each extreme for immediate interpretability.
Simulated Author's Rebuttal
We thank the referee for their positive evaluation of the work's significance and for the constructive comments on benchmark transparency and the scaling analysis. We address each point below and have prepared revisions to strengthen the manuscript.
read point-by-point responses
-
Referee: [§3] §3 (DESPITE Benchmark Construction): The manuscript states that tasks have 'fully deterministic validation' and span physical/normative dangers, but provides no explicit decision rules, classification examples, inter-annotator agreement, or grounding against actual robot failure modes. This is load-bearing for the central claim, as the 28.3% dangerous-plan rate (and the gap with 0.4% invalid plans) depends entirely on the validity of these labels; without them, sensitivity to labeling choices cannot be assessed.
Authors: We agree that greater transparency on the annotation process is warranted to support the central claims. The current §3 describes the high-level categories and deterministic validation procedure, but does not include the requested explicit decision rules or examples. In the revised manuscript we will add a dedicated subsection with (i) the full decision rules used to classify physical versus normative dangers, (ii) multiple concrete classification examples per category, (iii) inter-annotator agreement statistics, and (iv) explicit grounding of the danger taxonomy against documented robot failure modes from the robotics literature. We will also include a brief sensitivity analysis demonstrating robustness of the 28.3% dangerous-plan rate to plausible labeling variations. These additions directly address the concern about label validity. revision: yes
-
Referee: [§5] §5 (Scaling and Multiplicative Relationship): The claim that larger models complete more tasks safely 'primarily through improved planning, not through better danger avoidance' rests on a multiplicative relationship, yet the paper does not report the exact functional form, regression coefficients, or statistical controls used to establish it. This weakens the interpretation of the flat safety-awareness curve (38-57%) relative to the planning curve.
Authors: We appreciate the request for greater precision. The manuscript states that safe completion equals planning success multiplied by conditional safety awareness, which produces the observed flat safety-awareness curve. However, the exact functional form, regression details, and controls are not reported in the main text or appendix. In the revision we will (i) state the functional form explicitly (safe_completion = planning_success × safety_awareness), (ii) report the regression coefficients and goodness-of-fit statistics, and (iii) describe the statistical controls for model family and size. These clarifications will strengthen the interpretation that scale primarily improves planning rather than danger avoidance. revision: yes
Circularity Check
No circularity: claims rest on direct empirical evaluation of models against an external benchmark
full rationale
The paper introduces the DESPITE benchmark of 12,279 tasks and reports measured planning success rates and safety awareness percentages across 23 models. Key results such as the 0.4% invalid-plan rate versus 28.3% dangerous-plan rate for the best model, the scaling trends with model size, and the multiplicative relationship between planning and safety are presented as direct observations from running the models on the benchmark with deterministic validation. No equations, derivations, fitted parameters, or predictions are claimed; there are no self-definitional loops, no inputs relabeled as outputs, and no load-bearing self-citations that reduce the central claims to prior author work. The analysis is therefore self-contained against the external benchmark and model evaluations.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The defined tasks and danger categories accurately capture physical and normative risks relevant to robotic systems
invented entities (1)
-
DESPITE benchmark
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Vision-language- action models for robotics: A review towards real-world applications.IEEE Access, 2025
Kento Kawaharazuka, Jihoon Oh, Jun Yamada, Ingmar Posner, and Yuke Zhu. Vision-language- action models for robotics: A review towards real-world applications.IEEE Access, 2025
2025
-
[2]
Do as i can, not as i say: Grounding 17 language in robotic affordances
Anthony Brohan, Yevgen Chebotar, Chelsea Finn, Karol Hausman, Alexander Herzog, Daniel Ho, Julian Ibarz, Alex Irpan, Eric Jang, Ryan Julian, et al. Do as i can, not as i say: Grounding 17 language in robotic affordances. InConference on robot learning, pages 287–318. PMLR, 2023
2023
-
[3]
Code as policies: Language model programs for embodied control
Jacky Liang, Wenlong Huang, Fei Xia, Peng Xu, Karol Hausman, Brian Ichter, Pete Florence, and Andy Zeng. Code as policies: Language model programs for embodied control. In2023 IEEE International conference on robotics and automation (ICRA), pages 9493–9500. IEEE, 2023
2023
-
[4]
Pierre Sermanet, Anirudha Majumdar, Alex Irpan, Dmitry Kalashnikov, and Vikas Sind- hwani. Generating robot constitutions & benchmarks for semantic safety.arXiv preprint arXiv:2503.08663, 2025
-
[5]
SafeAgentBench: A benchmark for safe task planning of embodied LLM agents
Sheng Yin, Xianghe Pang, Yuanzhuo Ding, Menglan Chen, Yutong Bi, Yichen Xiong, Wenhao Huang, Zhen Xiang, Jing Shao, and Siheng Chen. Safeagentbench: A benchmark for safe task planning of embodied llm agents.arXiv preprint arXiv:2412.13178, 2024
-
[6]
Yuting Huang, Leilei Ding, Zhipeng Tang, Tianfu Wang, Xinrui Lin, Wuyang Zhang, Mingxiao Ma, and Yanyong Zhang. A framework for benchmarking and aligning task-planning safety in llm-based embodied agents.arXiv preprint arXiv:2504.14650, 2025
-
[7]
Subtle risks, critical failures: A framework for diagnosing physical safety of llms for embodied decision making
Yejin Son, Minseo Kim, Sungwoong Kim, Seungju Han, Jian Kim, Dongju Jang, Youngjae Yu, and Chan Young Park. Subtle risks, critical failures: A framework for diagnosing physical safety of llms for embodied decision making. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 25703–25744, 2025
2025
-
[8]
Xiaoya Lu, Zeren Chen, Xuhao Hu, Yijin Zhou, Weichen Zhang, Dongrui Liu, Lu Sheng, and Jing Shao. Is-bench: Evaluating interactive safety of vlm-driven embodied agents in daily household tasks.arXiv preprint arXiv:2506.16402, 2025
-
[9]
Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer
Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. arXiv preprint arXiv:1701.06538, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[10]
routledge, 2013
Jacob Cohen.Statistical power analysis for the behavioral sciences. routledge, 2013
2013
-
[11]
Semantically safe robot manipulation: From semantic scene understanding to motion safeguards.IEEE Robotics and Automation Letters, 2025
Lukas Brunke, Yanni Zhang, Ralf R¨ omer, Jack Naimer, Nikola Staykov, Siqi Zhou, and Angela P Schoellig. Semantically safe robot manipulation: From semantic scene understanding to motion safeguards.IEEE Robotics and Automation Letters, 2025
2025
-
[12]
Constitutional AI: Harmlessness from AI Feedback
Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, et al. Constitutional ai: Harmlessness from ai feedback.arXiv preprint arXiv:2212.08073, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[13]
Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[14]
Gemini: A Family of Highly Capable Multimodal Models
Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[15]
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context
Gemini Team, Petko Georgiev, Ving Ian Lei, Ryan Burnell, Libin Bai, Anmol Gulati, Garrett Tanzer, Damien Vincent, Zhufeng Pan, Shibo Wang, et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context.arXiv preprint arXiv:2403.05530, 2024
work page internal anchor Pith review arXiv 2024
-
[16]
gpt-oss-120b & gpt-oss-20b Model Card
Sandhini Agarwal, Lama Ahmad, Jason Ai, Sam Altman, Andy Applebaum, Edwin Arbus, Rahul K Arora, Yu Bai, Bowen Baker, Haiming Bao, et al. gpt-oss-120b & gpt-oss-20b model card.arXiv preprint arXiv:2508.10925, 2025
work page internal anchor Pith review arXiv 2025
-
[17]
DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models
Aixin Liu, Aoxue Mei, Bangcai Lin, Bing Xue, Bingxuan Wang, Bingzheng Xu, Bochao Wu, Bowei Zhang, Chaofan Lin, Chen Dong, et al. Deepseek-v3. 2: Pushing the frontier of open large language models.arXiv preprint arXiv:2512.02556, 2025. 18
work page internal anchor Pith review arXiv 2025
-
[18]
Binyuan Hui, Jian Yang, Zeyu Cui, Jiaxi Yang, Dayiheng Liu, Lei Zhang, Tianyu Liu, Jia- jun Zhang, Bowen Yu, Keming Lu, et al. Qwen2. 5-coder technical report.arXiv preprint arXiv:2409.12186, 2024
work page internal anchor Pith review arXiv 2024
-
[19]
Qwen2.5 technical report, 2025
Qwen, :, An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li,...
2025
-
[20]
Qwen2.5-1m technical report, 2025
An Yang, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoyan Huang, Jiandong Jiang, Jianhong Tu, Jianwei Zhang, Jingren Zhou, Junyang Lin, Kai Dang, Kexin Yang, Le Yu, Mei Li, Minmin Sun, Qin Zhu, Rui Men, Tao He, Weijia Xu, Wenbiao Yin, Wenyuan Yu, Xiafei Qiu, Xingzhang Ren, Xinlong Yang, Yong Li, Zhiying Xu, and Zipeng Zhang. Qwen2.5-1m technical re...
2025
-
[21]
Qwq-32b: Embracing the power of reinforcement learning, March 2025
Qwen Team. Qwq-32b: Embracing the power of reinforcement learning, March 2025
2025
-
[22]
An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[23]
Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[24]
Meta AI. Llama 4. https://ai.meta.com/blog/llama-4-multimodal-intelligence/, 2025. Accessed: 2025
2025
-
[25]
Elsevier, 2004
Malik Ghallab, Dana Nau, and Paolo Traverso.Automated Planning: theory and practice. Elsevier, 2004
2004
-
[26]
A modern approach.Artificial Intelligence
Stuart Russell, Peter Norvig, and Artificial Intelligence. A modern approach.Artificial Intelligence. Prentice-Hall, Egnlewood Cliffs, 25(27):79–80, 1995
1995
-
[27]
Unified planning: Modeling, manipulating and solving ai planning problems in python
Andrea Micheli, Arthur Bit-Monnot, Gabriele R¨ oger, Enrico Scala, Alessandro Valentini, Luca Framba, Alberto Rovetta, Alessandro Trapasso, Luigi Bonassi, Alfonso Emilio Gerevini, et al. Unified planning: Modeling, manipulating and solving ai planning problems in python. SoftwareX, 29:102012, 2025
2025
-
[28]
Interval-based relaxation for general numeric planning
Enrico Scala, Patrik Haslum, Sylvie Thi´ ebaux, and Miguel Ramirez. Interval-based relaxation for general numeric planning. 2016
2016
-
[29]
Temporal planning with inter- mediate conditions and effects
Alessandro Valentini, Andrea Micheli, and Alessandro Cimatti. Temporal planning with inter- mediate conditions and effects. InProceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 9975–9982, 2020
2020
-
[30]
Virtualhome: Simulating household activities via programs
Xavier Puig, Kevin Ra, Marko Boben, Jiaman Li, Tingwu Wang, Sanja Fidler, and Antonio Torralba. Virtualhome: Simulating household activities via programs. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 8494–8502, 2018
2018
-
[31]
Behavior-1k: A benchmark for embodied ai with 1,000 everyday activities and realistic simulation
Chengshu Li, Ruohan Zhang, Josiah Wong, Cem Gokmen, Sanjana Srivastava, Roberto Mart´ ın- Mart´ ın, Chen Wang, Gabrael Levine, Michael Lingelbach, Jiankai Sun, et al. Behavior-1k: A benchmark for embodied ai with 1,000 everyday activities and realistic simulation. InConference on Robot Learning, pages 80–93. PMLR, 2023
2023
-
[32]
Alfred: A benchmark for interpreting grounded instructions for everyday tasks
Mohit Shridhar, Jesse Thomason, Daniel Gordon, Yonatan Bisk, Winson Han, Roozbeh Mot- taghi, Luke Zettlemoyer, and Dieter Fox. Alfred: A benchmark for interpreting grounded instructions for everyday tasks. InProceedings of the IEEE/CVF conference on computer vision 19 and pattern recognition, pages 10740–10749, 2020
2020
-
[33]
Normbank: A knowledge bank of situational social norms
Caleb Ziems, Jane Dwivedi-Yu, Yi-Chia Wang, Alon Halevy, and Diyi Yang. Normbank: A knowledge bank of situational social norms. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 7756–7776, 2023
2023
-
[34]
Consumer Product Safety Commission
U.S. Consumer Product Safety Commission. National electronic injury surveillance system (neiss) injury data. https://www.cpsc.gov/cgibin/NEISSQuery/home.aspx, 2024. Accessed: 2025-09-29
2024
-
[35]
DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model
Aixin Liu, Bei Feng, Bin Wang, Bingxuan Wang, Bo Liu, Chenggang Zhao, Chengqi Dengr, Chong Ruan, Damai Dai, Daya Guo, et al. Deepseek-v2: A strong, economical, and efficient mixture-of-experts language model.arXiv preprint arXiv:2405.04434, 2024
work page internal anchor Pith review arXiv 2024
-
[36]
Language models as zero- shot planners: Extracting actionable knowledge for embodied agents
Wenlong Huang, Pieter Abbeel, Deepak Pathak, and Igor Mordatch. Language models as zero- shot planners: Extracting actionable knowledge for embodied agents. InInternational conference on machine learning, pages 9118–9147. PMLR, 2022
2022
-
[37]
ProgPrompt: Generating situated robot task plans using large language models,
Ishika Singh, Valts Blukis, Arsalan Mousavian, Ankit Goyal, Danfei Xu, Jonathan Tremblay, Dieter Fox, Jesse Thomason, and Animesh Garg. Progprompt: Generating situated robot task plans using large language models.arXiv preprint arXiv:2209.11302, 2022
-
[38]
PaLM-E: An Embodied Multimodal Language Model
Danny Driess, Fei Xia, Mehdi SM Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, et al. Palm-e: An embodied multimodal language model.arXiv preprint arXiv:2303.03378, 2023
work page internal anchor Pith review arXiv 2023
-
[39]
Rt-2: Vision-language-action models transfer web knowledge to robotic control
Brianna Zitkovich, Tianhe Yu, Sichun Xu, Peng Xu, Ted Xiao, Fei Xia, Jialin Wu, Paul Wohlhart, Stefan Welker, Ayzaan Wahid, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control. InConference on Robot Learning, pages 2165–2183. PMLR, 2023
2023
-
[40]
Inner Monologue: Embodied Reasoning through Planning with Language Models
Wenlong Huang, Fei Xia, Ted Xiao, Harris Chan, Jacky Liang, Pete Florence, Andy Zeng, Jonathan Tompson, Igor Mordatch, Yevgen Chebotar, et al. Inner monologue: Embodied reasoning through planning with language models.arXiv preprint arXiv:2207.05608, 2022
work page internal anchor Pith review arXiv 2022
-
[41]
Suneel Belkhale, Tianli Ding, Ted Xiao, Pierre Sermanet, Quon Vuong, Jonathan Tompson, Yev- gen Chebotar, Debidatta Dwibedi, and Dorsa Sadigh. Rt-h: Action hierarchies using language. arXiv preprint arXiv:2403.01823, 2024
-
[42]
FAST: Efficient Action Tokenization for Vision-Language-Action Models
Karl Pertsch, Kyle Stachowicz, Brian Ichter, Danny Driess, Suraj Nair, Quan Vuong, Oier Mees, Chelsea Finn, and Sergey Levine. Fast: Efficient action tokenization for vision-language-action models.arXiv preprint arXiv:2501.09747, 2025
work page internal anchor Pith review arXiv 2025
-
[43]
GR00T N1: An Open Foundation Model for Generalist Humanoid Robots
Johan Bjorck, Fernando Casta˜ neda, Nikita Cherniadev, Xingye Da, Runyu Ding, Linxi Fan, Yu Fang, Dieter Fox, Fengyuan Hu, Spencer Huang, et al. Gr00t n1: An open foundation model for generalist humanoid robots.arXiv preprint arXiv:2503.14734, 2025
work page internal anchor Pith review arXiv 2025
-
[44]
Open x-embodiment: Robotic learning datasets and rt-x models: Open x-embodiment collaboration 0
Abby O’Neill, Abdul Rehman, Abhiram Maddukuri, Abhishek Gupta, Abhishek Padalkar, Abra- ham Lee, Acorn Pooley, Agrim Gupta, Ajay Mandlekar, Ajinkya Jain, et al. Open x-embodiment: Robotic learning datasets and rt-x models: Open x-embodiment collaboration 0. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 6892–6903. IEEE, 2024
2024
-
[45]
Zachary Ravichandran, Ignacio Hounie, Fernando Cladera, Alejandro Ribeiro, George J Pappas, and Vijay Kumar. Distilling on-device language models for robot planning with minimal human intervention.arXiv preprint arXiv:2506.17486, 2025
-
[46]
LLM+P: Empowering Large Language Models with Optimal Planning Proficiency
Bo Liu, Yuqian Jiang, Xiaohan Zhang, Qiang Liu, Shiqi Zhang, Joydeep Biswas, and Peter Stone. Llm+ p: Empowering large language models with optimal planning proficiency.arXiv preprint arXiv:2304.11477, 2023
work page internal anchor Pith review arXiv 2023
-
[47]
Lin Guan, Karthik Valmeekam, Sarath Sreedharan, and Subbarao Kambhampati. Leveraging pre-trained large language models to construct and utilize world models for model-based task 20 planning.Advances in Neural Information Processing Systems, 36:79081–79094, 2023
2023
-
[48]
Don’t let your robot be harmful: Responsible robotic manipulation via safety-as-policy
Minheng Ni, Lei Zhang, Zihan Chen, Kaixin Bai, Zhaopeng Chen, Jianwei Zhang, and Wangmeng Zuo. Don’t let your robot be harmful: Responsible robotic manipulation via safety-as-policy. IEEE Robotics and Automation Letters, 2025
2025
-
[49]
Agentsafe: Benchmarking the safety of embodied agents on hazardous instructions
Zonghao Ying, Le Wang, Yisong Xiao, Jiakai Wang, Yuqing Ma, Jinyang Guo, Zhenfei Yin, Mingchuan Zhang, Aishan Liu, and Xianglong Liu. Agentsafe: Benchmarking the safety of embodied agents on hazardous instructions.arXiv preprint arXiv:2506.14697, 2025
-
[50]
don’t forget to put the milk back!
James F Mullen, Prasoon Goyal, Robinson Piramuthu, Michael Johnston, Dinesh Manocha, and Reza Ghanadan. “don’t forget to put the milk back!” dataset for enabling embodied agents to detect anomalous situations.IEEE Robotics and Automation Letters, 9(10):9087–9094, 2024
2024
-
[51]
Control barrier functions: Theory and applications
Aaron D Ames, Samuel Coogan, Magnus Egerstedt, Gennaro Notomista, Koushil Sreenath, and Paulo Tabuada. Control barrier functions: Theory and applications. In2019 18th European control conference (ECC), pages 3420–3431. Ieee, 2019
2019
-
[52]
Embodied ai with two arms: Zero-shot learning, safety and modularity
Jake Varley, Sumeet Singh, Deepali Jain, Krzysztof Choromanski, Andy Zeng, Somnath Basu Roy Chowdhury, Avinava Dubey, and Vikas Sindhwani. Embodied ai with two arms: Zero-shot learning, safety and modularity. In2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 3651–3657. IEEE, 2024
2024
-
[53]
Jailbreaking llm-controlled robots
Alexander Robey, Zachary Ravichandran, Vijay Kumar, Hamed Hassani, and George J Pappas. Jailbreaking llm-controlled robots. In2025 IEEE International Conference on Robotics and Automation (ICRA), pages 11948–11956. IEEE, 2025
2025
-
[54]
Badrobot: Jailbreaking embodied llms in the physical world.arXiv preprint arXiv:2407.20242,
Hangtao Zhang, Chenyu Zhu, Xianlong Wang, Ziqi Zhou, Changgan Yin, Minghui Li, Lulu Xue, Yichen Wang, Shengshan Hu, Aishan Liu, et al. Badrobot: Jailbreaking embodied llms in the physical world.arXiv preprint arXiv:2407.20242, 2024
-
[55]
Llm-driven robots risk enacting discrimination, violence, and unlawful actions.International Journal of Social Robotics, 17(11):2663–2711, 2025
Andrew Hundt, Rumaisa Azeem, Masoumeh Mansouri, and Martim Brand˜ ao. Llm-driven robots risk enacting discrimination, violence, and unlawful actions.International Journal of Social Robotics, 17(11):2663–2711, 2025
2025
-
[56]
Zico Kolter, Hamed Hassani, and George J
Alexander Robey, Zachary Ravichandran, Eliot Krzysztof Jones, Jared Perlo, Fazl Barez, Vijay Kumar, J. Zico Kolter, Hamed Hassani, and George J. Pappas. Beyond alignment: Why robotic foundation models need context-aware safety.Science Robotics, 11(113):eaef2191, 2026
2026
-
[57]
Safety guardrails for llm-enabled robots.IEEE Robotics and Automation Letters, 2026
Zachary Ravichandran, Alexander Robey, Vijay Kumar, George J Pappas, and Hamed Hassani. Safety guardrails for llm-enabled robots.IEEE Robotics and Automation Letters, 2026
2026
-
[58]
Zachary Ravichandran, David Snyder, Alexander Robey, Hamed Hassani, Vijay Kumar, and George J Pappas. Contextual safety reasoning and grounding for open-world robots.arXiv preprint arXiv:2602.19983, 2026
-
[59]
Norm- sage: Multi-lingual multi-cultural norm discovery from conversations on-the-fly
Yi Fung, Tuhin Chakrabarty, Hao Guo, Owen Rambow, Smaranda Muresan, and Heng Ji. Norm- sage: Multi-lingual multi-cultural norm discovery from conversations on-the-fly. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 15217–15230, 2023
2023
-
[60]
Egonormia: Benchmarking physical social norm understanding.arXiv preprint arXiv:2502.20490, 2025
MohammadHossein Rezaei, Yicheng Fu, Phil Cuvin, Caleb Ziems, Yanzhe Zhang, Hao Zhu, and Diyi Yang. Egonormia: Benchmarking physical social norm understanding.arXiv preprint arXiv:2502.20490, 2025
-
[61]
Strips: A new approach to the application of theorem proving to problem solving.Artificial intelligence, 2(3-4):189–208, 1971
Richard E Fikes and Nils J Nilsson. Strips: A new approach to the application of theorem proving to problem solving.Artificial intelligence, 2(3-4):189–208, 1971
1971
-
[62]
The formal semantics of processes in pddl
Drew McDermott. The formal semantics of processes in pddl. InProc. ICAPS Workshop on PDDL, pages 101–155. sn, 2003. 21
2003
-
[63]
Maria Fox and Derek Long. Pddl2. 1: An extension to pddl for expressing temporal planning domains.Journal of artificial intelligence research, 20:61–124, 2003
2003
-
[64]
The fast downward planning system.Journal of Artificial Intelligence Research, 26:191–246, 2006
Malte Helmert. The fast downward planning system.Journal of Artificial Intelligence Research, 26:191–246, 2006
2006
-
[65]
Elliot Gestrin, Marco Kuhlmann, and Jendrik Seipp. Nl2plan: Robust llm-driven planning from minimal text descriptions.arXiv preprint arXiv:2405.04215, 2024
-
[66]
Plan- etarium: A rigorous benchmark for translating text to structured planning languages
Max Zuo, Francisco Piedrahita Velez, Xiaochen Li, Michael Littman, and Stephen Bach. Plan- etarium: A rigorous benchmark for translating text to structured planning languages. In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), p...
2025
-
[67]
Karthik Valmeekam, Matthew Marquez, Alberto Olmo, Sarath Sreedharan, and Subbarao Kambhampati. Planbench: An extensible benchmark for evaluating large language models on planning and reasoning about change.Advances in Neural Information Processing Systems, 36:38975–38987, 2023
2023
-
[68]
On the planning abilities of large language models-a critical investigation.Advances in neural information processing systems, 36:75993–76005, 2023
Karthik Valmeekam, Matthew Marquez, Sarath Sreedharan, and Subbarao Kambhampati. On the planning abilities of large language models-a critical investigation.Advances in neural information processing systems, 36:75993–76005, 2023
2023
-
[69]
Safe learning of pddl domains with conditional effects
Argaman Mordoch, Enrico Scala, Roni Stern, and Brendan Juba. Safe learning of pddl domains with conditional effects. InProceedings of the International Conference on Automated Planning and Scheduling, volume 34, pages 387–395, 2024
2024
-
[70]
Llms as planning formalizers: A survey for leveraging large language models to construct automated planning models
Marcus Tantakoun, Christian Muise, and Xiaodan Zhu. Llms as planning formalizers: A survey for leveraging large language models to construct automated planning models. InFindings of the Association for Computational Linguistics: ACL 2025, pages 25167–25188, 2025
2025
-
[71]
A survey of optimization-based task and motion planning: From classical to learning approaches
Zhigen Zhao, Shuo Cheng, Yan Ding, Ziyi Zhou, Shiqi Zhang, Danfei Xu, and Ye Zhao. A survey of optimization-based task and motion planning: From classical to learning approaches. IEEE/ASME Transactions On Mechatronics, 30(4):2799–2825, 2024
2024
-
[72]
HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal
Mantas Mazeika, Long Phan, Xuwang Yin, Andy Zou, Zifan Wang, Norman Mu, Elham Sakhaee, Nathaniel Li, Steven Basart, Bo Li, et al. Harmbench: A standardized evaluation framework for automated red teaming and robust refusal.arXiv preprint arXiv:2402.04249, 2024
work page internal anchor Pith review arXiv 2024
-
[73]
Decodingtrust: A comprehensive assessment of trustworthiness in{GPT}models
Boxin Wang, Weixin Chen, Hengzhi Pei, Chulin Xie, Mintong Kang, Chenhui Zhang, Chejian Xu, Zidi Xiong, Ritik Dutta, Rylan Schaeffer, et al. Decodingtrust: A comprehensive assessment of trustworthiness in{GPT}models. 2023
2023
-
[74]
Toxigen: A large-scale machine-generated dataset for adversarial and implicit hate speech detection
Thomas Hartvigsen, Saadia Gabriel, Hamid Palangi, Maarten Sap, Dipankar Ray, and Ece Kamar. Toxigen: A large-scale machine-generated dataset for adversarial and implicit hate speech detection. InProceedings of the 60th annual meeting of the association for computational linguistics (volume 1: Long papers), pages 3309–3326, 2022
2022
-
[75]
Truthfulqa: Measuring how models mimic human falsehoods
Stephanie Lin, Jacob Hilton, and Owain Evans. Truthfulqa: Measuring how models mimic human falsehoods. InProceedings of the 60th annual meeting of the association for computational linguistics (volume 1: long papers), pages 3214–3252, 2022
2022
-
[76]
Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in neural information processing systems, 36:46595–46623, 2023
Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in neural information processing systems, 36:46595–46623, 2023
2023
-
[77]
Behavior: Benchmark for everyday household activities in virtual, interactive, and ecological environments
Sanjana Srivastava, Chengshu Li, Michael Lingelbach, Roberto Mart´ ın-Mart´ ın, Fei Xia, Kent Elliott Vainio, Zheng Lian, Cem Gokmen, Shyamal Buch, Karen Liu, et al. Behavior: Benchmark for everyday household activities in virtual, interactive, and ecological environments. 22 InConference on robot learning, pages 477–490. PMLR, 2022
2022
-
[78]
Teach: Task-driven embodied agents that chat
Aishwarya Padmakumar, Jesse Thomason, Ayush Shrivastava, Patrick Lange, Anjali Narayan- Chen, Spandana Gella, Robinson Piramuthu, Gokhan Tur, and Dilek Hakkani-Tur. Teach: Task-driven embodied agents that chat. InProceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 2017–2025, 2022
2017
-
[79]
Handmethat: Human-robot communica- tion in physical and social environments.Advances in Neural Information Processing Systems, 35:12014–12026, 2022
Yanming Wan, Jiayuan Mao, and Josh Tenenbaum. Handmethat: Human-robot communica- tion in physical and social environments.Advances in Neural Information Processing Systems, 35:12014–12026, 2022
2022
-
[80]
Embodied agent interface: Benchmark- ing llms for embodied decision making.Advances in Neural Information Processing Systems, 37:100428–100534, 2024
Manling Li, Shiyu Zhao, Qineng Wang, Kangrui Wang, Yu Zhou, Sanjana Srivastava, Cem Gokmen, Tony Lee, Erran Li Li, Ruohan Zhang, et al. Embodied agent interface: Benchmark- ing llms for embodied decision making.Advances in Neural Information Processing Systems, 37:100428–100534, 2024
2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.