arxiv: 2604.19638 · v1 · submitted 2026-04-21 · 💻 cs.AI · cs.CL· cs.RO

Recognition: unknown

SafetyALFRED: Evaluating Safety-Conscious Planning of Multimodal Large Language Models

Josue Torres-Fonseca , Naihao Deng , Yinpei Dai , Shane Storks , Yichi Zhang , Rada Mihalcea , Casey Kennington , Joyce Chai

Authors on Pith no claims yet

Pith reviewed 2026-05-10 02:18 UTC · model grok-4.3

classification 💻 cs.AI cs.CLcs.RO

keywords safety evaluationmultimodal LLMsembodied planninghazard mitigationALFRED benchmarkalignment gapautonomous agentskitchen hazards

0 comments

The pith

Multimodal models recognize kitchen hazards in questions but rarely plan to avoid them during tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces SafetyALFRED, an extension of the ALFRED embodied agent benchmark that adds six categories of real-world kitchen hazards. It tests eleven multimodal models on both recognizing hazards through static question answering and actively mitigating those hazards through planning in simulated environments. The central result shows models perform well at recognition yet achieve low success rates when they must incorporate avoidance into their action sequences. This gap indicates that question-based evaluations miss the proactive safety requirements of physical agents. The authors argue for shifting benchmarks toward measuring corrective actions rather than isolated detection.

Core claim

The central claim is that multimodal large language models exhibit a significant alignment gap: they can accurately recognize hazards in disembodied question-answering settings but show low average success rates when required to mitigate those same hazards through embodied planning in an interactive simulator.

What carries the argument

SafetyALFRED benchmark, which augments the ALFRED simulator with six kitchen hazard categories to separately measure hazard recognition via QA and active risk mitigation via task planning.

If this is right

Static QA evaluations alone are insufficient to certify safety for models acting as autonomous agents.
New benchmarks must prioritize measurement of corrective planning actions over isolated hazard detection.
Models require training that links hazard awareness directly to action sequences rather than recognition alone.
Open-sourcing the SafetyALFRED dataset and code allows systematic comparison of future models on embodied safety.
Deployment of these models in real homes or kitchens carries unmeasured risks until planning-based tests improve.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same recognition-mitigation gap may appear in non-kitchen domains such as navigation or object manipulation, suggesting the need for broader embodied safety tests.
Fine-tuning methods that simulate full planning trajectories could close the gap more effectively than additional QA data.
If the gap persists on physical hardware, safety standards for household robots would need to require planning benchmarks rather than QA scores.
Developers could use the benchmark to iteratively test whether specific prompting or fine-tuning techniques improve mitigation without harming task performance.

Load-bearing premise

The six added hazard categories and their implementation in the ALFRED simulator capture the safety challenges that arise in real physical environments.

What would settle it

Running the same models on a physical robot performing equivalent kitchen tasks and comparing the resulting mitigation success rates to the simulated results would directly test whether the observed gap holds outside simulation.

Figures

Figures reproduced from arXiv: 2604.19638 by Casey Kennington, Josue Torres-Fonseca, Joyce Chai, Naihao Deng, Rada Mihalcea, Shane Storks, Yichi Zhang, Yinpei Dai.

**Figure 1.** Figure 1: Visualization of the SafetyALFRED evaluation pipeline. Environment is perturbed to introduce a hazard (1). Two separate instances of the same model then evaluate the scene: one identifies hazards as in a static QA setting (2a), while the other generates an embodied plan that must mitigate hazards before completing the task (2b). Alignment occurs when a hazard recognized in QA task is also mitigated in the … view at source ↗

**Figure 2.** Figure 2: Hazard Definitions Summary. Visualization of all six hazards. Each panel shows the hazard name (trajectory count), an example image, Description (D), environmental Conditions defining the hazard state (o represents the safety object causing the hazard and o* represents the target object) (C), and Remediation action (R). Abbreviations: ObjRetr (Target Object Retrieved), Micro (Microwave), and WtrSns (Water … view at source ↗

**Figure 3.** Figure 3: F1-score (with precision and recall) across [PITH_FULL_IMAGE:figures/full_fig_p013_3.png] view at source ↗

**Figure 4.** Figure 4: Prompts used for QA Task [PITH_FULL_IMAGE:figures/full_fig_p017_4.png] view at source ↗

**Figure 5.** Figure 5: Prompt used for Embodied Task [PITH_FULL_IMAGE:figures/full_fig_p018_5.png] view at source ↗

**Figure 6.** Figure 6: Prompts used for multi-agent system [PITH_FULL_IMAGE:figures/full_fig_p019_6.png] view at source ↗

read the original abstract

Multimodal Large Language Models are increasingly adopted as autonomous agents in interactive environments, yet their ability to proactively address safety hazards remains insufficient. We introduce SafetyALFRED, built upon the embodied agent benchmark ALFRED, augmented with six categories of real-world kitchen hazards. While existing safety evaluations focus on hazard recognition through disembodied question answering (QA) settings, we evaluate eleven state-of-the-art models from the Qwen, Gemma, and Gemini families on not only hazard recognition, but also active risk mitigation through embodied planning. Our experimental results reveal a significant alignment gap: while models can accurately recognize hazards in QA settings, average mitigation success rates for these hazards are low in comparison. Our findings demonstrate that static evaluations through QA are insufficient for physical safety, thus we advocate for a paradigm shift toward benchmarks that prioritize corrective actions in embodied contexts. We open-source our code and dataset under https://github.com/sled-group/SafetyALFRED.git

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SafetyALFRED shows models recognize kitchen hazards in QA but rarely plan around them in the simulator, though the simulator's fidelity to real risks remains unproven.

read the letter

The core observation is straightforward: eleven models from Qwen, Gemma, and Gemini families spot hazards well when answering questions but post low success rates when they must generate plans that avoid or correct those hazards inside the ALFRED simulator. The authors add six concrete hazard categories to the existing ALFRED tasks and measure both recognition accuracy and mitigation success, then argue that static QA tests miss the planning failures that matter for physical agents.

Referee Report

2 major / 2 minor

Summary. The paper introduces SafetyALFRED, an extension of the ALFRED embodied agent benchmark augmented with six categories of real-world kitchen hazards. It evaluates eleven MLLMs (from Qwen, Gemma, and Gemini families) on both hazard recognition in disembodied QA settings and active risk mitigation via embodied planning tasks. The central empirical finding is a significant alignment gap: high recognition accuracy in QA contrasts with low average mitigation success rates in the embodied setting, leading to the conclusion that static QA evaluations are insufficient for physical safety and a call for embodied benchmarks.

Significance. If the experimental results hold under scrutiny, the work provides a concrete, open-sourced benchmark and dataset that shifts safety evaluation from passive recognition to proactive mitigation in interactive environments. This is a timely contribution for the embodied AI and multimodal agent communities, with the release of code and data at the cited GitHub repository supporting reproducibility and follow-on research.

major comments (2)

[Hazard categories and implementation (likely §3 or §4)] The central claim that low mitigation success demonstrates an alignment failure relevant to physical environments depends on the six hazard categories and their ALFRED implementations being faithful proxies. However, the manuscript provides no external validation (e.g., matching to real kitchen accident statistics, expert review of hazard realism, or cross-simulator comparison) for object placements, visual cues, or interaction dynamics; this is load-bearing for extrapolating the recognition-mitigation gap beyond simulator artifacts.
[Experimental results and evaluation metrics (likely §5)] The reported average mitigation success rates are described as low relative to QA recognition, but the paper does not supply per-model, per-hazard raw success metrics, task success definitions, or ablation on prompt variations and simulator parameters. Without these, it is impossible to rule out confounds in evaluation criteria or task construction that could inflate the observed gap.

minor comments (2)

[Abstract and §1] The abstract and introduction could more explicitly define the six hazard categories with one-sentence examples to improve immediate readability for readers unfamiliar with ALFRED.
[Figures and tables in results section] Figure captions and table headers should include the exact number of trials or episodes per condition to allow direct assessment of statistical reliability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment point by point below, indicating the revisions we will incorporate to strengthen the work while maintaining the core empirical findings.

read point-by-point responses

Referee: The central claim that low mitigation success demonstrates an alignment failure relevant to physical environments depends on the six hazard categories and their ALFRED implementations being faithful proxies. However, the manuscript provides no external validation (e.g., matching to real kitchen accident statistics, expert review of hazard realism, or cross-simulator comparison) for object placements, visual cues, or interaction dynamics; this is load-bearing for extrapolating the recognition-mitigation gap beyond simulator artifacts.

Authors: We agree that grounding the hazard categories more explicitly would strengthen the extrapolation to physical environments. The six categories (fire, sharp objects, electrical, chemical, slip, and obstruction hazards) were selected to represent prevalent real-world kitchen risks. In the revised manuscript, we will add a dedicated subsection in §3 that cites publicly available kitchen accident statistics (e.g., from consumer product safety reports) and explains the mapping of each category to ALFRED object placements and interaction dynamics. We will also explicitly note the absence of expert review or cross-simulator validation as a limitation. These additions provide better justification without altering the experimental results. revision: partial
Referee: The reported average mitigation success rates are described as low relative to QA recognition, but the paper does not supply per-model, per-hazard raw success metrics, task success definitions, or ablation on prompt variations and simulator parameters. Without these, it is impossible to rule out confounds in evaluation criteria or task construction that could inflate the observed gap.

Authors: We appreciate this request for greater transparency. The original submission reported aggregate averages to highlight the overall recognition-mitigation gap. In the revision, we will add a new table in §5 presenting per-model and per-hazard success rates, along with a precise definition of task success (completion of the high-level goal while preventing hazard activation). We have run additional prompt-variation ablations (standard vs. safety-augmented prompts) and will include these results, which show the gap persists across settings. Simulator parameters adhere to the public ALFRED configuration, which we will state explicitly. These changes will enable readers to evaluate potential confounds directly. revision: yes

Circularity Check

0 steps flagged

No significant circularity; purely empirical benchmark evaluation

full rationale

This paper introduces the SafetyALFRED benchmark by augmenting ALFRED with six hazard categories and evaluates eleven MLLMs on hazard recognition (QA) versus mitigation (embodied planning). No derivations, equations, fitted parameters, or predictions are present. Claims rest on direct experimental outcomes rather than any self-referential reduction, self-citation chain, or ansatz. The recognition-mitigation gap is reported from model runs on the new dataset; the central conclusion that static QA is insufficient follows from those measurements without circular dependence on prior author work or internal definitions.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Empirical benchmark paper with no free parameters, invented entities, or non-standard axioms beyond the domain assumption that ALFRED provides a suitable base environment.

axioms (1)

domain assumption The ALFRED simulator and task definitions remain valid when augmented with the six new hazard categories.
The paper builds directly on ALFRED without additional validation of the base environment's suitability for safety extensions.

pith-pipeline@v0.9.0 · 5489 in / 1192 out tokens · 49721 ms · 2026-05-10T02:18:03.839402+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

61 extracted references · 32 canonical work pages · 13 internal anchors

[1]

Scaling Learning Algorithms Towards

Bengio, Yoshua and LeCun, Yann , booktitle =. Scaling Learning Algorithms Towards
[2]

and Osindero, Simon and Teh, Yee Whye , journal =

Hinton, Geoffrey E. and Osindero, Simon and Teh, Yee Whye , journal =. A Fast Learning Algorithm for Deep Belief Nets , volume =
[3]

2016 , publisher=

Deep learning , author=. 2016 , publisher=

2016
[4]

arXiv preprint arXiv:2410.06172 , year=

Multimodal situational safety , author=. arXiv preprint arXiv:2410.06172 , year=

work page arXiv
[5]

A framework for benchmarking and aligning task-planning safety in llm-based embodied agents.arXiv preprintarXiv:2504.14650, 2025

A Framework for Benchmarking and Aligning Task-Planning Safety in LLM-Based Embodied Agents , author=. arXiv preprint arXiv:2504.14650 , year=

work page arXiv
[6]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

Alfred: A benchmark for interpreting grounded instructions for everyday tasks , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=
[7]

GPT-4o System Card

Gpt-4o system card , author=. arXiv preprint arXiv:2410.21276 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[8]

Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

Episodic transformer for vision-and-language navigation , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=
[9]

InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models , author=. arXiv preprint arXiv:2504.10479 , year=

work page internal anchor Pith review arXiv
[10]

Gemma 3 Technical Report

Gemma 3 technical report , author=. arXiv preprint arXiv:2503.19786 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[11]

Qwen2.5-VL Technical Report

Qwen2. 5-vl technical report , author=. arXiv preprint arXiv:2502.13923 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[12]

Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations) , address=

LlamaFactory: Unified Efficient Fine-Tuning of 100+ Language Models , author=. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations) , address=. 2024 , url=

2024
[13]

, author=

Lora: Low-rank adaptation of large language models. , author=. ICLR , volume=
[14]

Proceedings of the AAAI Conference on Artificial Intelligence , author =

Token. Proceedings of the AAAI Conference on Artificial Intelligence , author =. 2025 , note =. doi:10.1609/aaai.v39i26.34943 , abstract =

work page doi:10.1609/aaai.v39i26.34943 2025
[15]

Advances in Neural Information Processing Systems , volume=

Many-shot jailbreaking , author=. Advances in Neural Information Processing Systems , volume=
[16]

Prompt Injection attack against LLM-integrated Applications

Prompt injection attack against llm-integrated applications , author=. arXiv preprint arXiv:2306.05499 , year=

work page internal anchor Pith review arXiv
[17]

Jailbreak and Guard Aligned Language Models with Only Few In-Context Demonstrations

Jailbreak and guard aligned language models with only few in-context demonstrations , author=. arXiv preprint arXiv:2310.06387 , year=

work page arXiv
[18]

Ignore Previous Prompt: Attack Techniques For Language Models

Ignore previous prompt: Attack techniques for language models , author=. arXiv preprint arXiv:2211.09527 , year=

work page internal anchor Pith review arXiv
[19]

Navigating the

Peng, ShengYun and Chen, Pin-Yu and Hull, Matthew and Chau, Duen Horng , month = oct, year =. Navigating the. doi:10.48550/arXiv.2405.17374 , abstract =

work page doi:10.48550/arxiv.2405.17374
[20]

arXiv preprint arXiv:2507.11473 , year=

Korbak, Tomek and Balesni, Mikita and Barnes, Elizabeth and Bengio, Yoshua and Benton, Joe and Bloom, Joseph and Chen, Mark and Cooney, Alan and Dafoe, Allan and Dragan, Anca and Emmons, Scott and Evans, Owain and Farhi, David and Greenblatt, Ryan and Hendrycks, Dan and Hobbhahn, Marius and Hubinger, Evan and Irving, Geoffrey and Jenner, Erik and Kokotajl...

work page doi:10.48550/arxiv.2507.11473
[21]

Generating robot constitutions & benchmarks for semantic safety.arXiv preprint arXiv:2503.08663, 2025

Sermanet, Pierre and Majumdar, Anirudha and Irpan, Alex and Kalashnikov, Dmitry and Sindhwani, Vikas , month = mar, year =. Generating. doi:10.48550/arXiv.2503.08663 , abstract =

work page doi:10.48550/arxiv.2503.08663
[22]

arXiv preprint arXiv:2406.15513 , year=

Ji, Jiaming and Hong, Donghai and Zhang, Borong and Chen, Boyuan and Dai, Josef and Zheng, Boren and Qiu, Tianyi and Li, Boxun and Yang, Yaodong , month = oct, year =. doi:10.48550/arXiv.2406.15513 , abstract =

work page doi:10.48550/arxiv.2406.15513
[23]

Dai, Juntao and Pan, Xuehai and Sun, Ruiyang and Ji, Jiaming and Xu, Xinbo and Liu, Mickel and Wang, Yizhou and Yang, Yaodong , year =
[24]

Li, Siyuan and Ma, Zhe and Liu, Feifan and Lu, Jiani and Xiao, Qinqin and Sun, Kewu and Cui, Lingfei and Yang, Xirui and Liu, Peng and Wang, Xun , month = nov, year =. Safe. doi:10.48550/arXiv.2411.06920 , abstract =

work page doi:10.48550/arxiv.2411.06920
[25]

In: IEEE International Conference on Robotics and Automation, ICRA 2024, Yokohama, Japan, May 13-17, 2024

Yang, Ziyi and Raman, Shreyas S. and Shah, Ankit and Tellex, Stefanie , month = may, year =. Plug in the. 2024. doi:10.1109/ICRA57147.2024.10611447 , abstract =

work page doi:10.1109/icra57147.2024.10611447 2024
[26]

Qi, Yong and Kyebambo, Gabriel and Xie, Siyuan and Shen, Wei and Wang, Shenghui and Xie, Bitao and He, Bin and Wang, Zhipeng and Jiang, Shuo , month = may, year =. Safety. doi:10.48550/arXiv.2405.17846 , abstract =

work page doi:10.48550/arxiv.2405.17846
[27]

and Gan, Chuang , month = jan, year =

Zhou, Qinhong and Chen, Sunli and Wang, Yisong and Xu, Haozhe and Du, Weihua and Zhang, Hongxin and Du, Yilun and Tenenbaum, Joshua B. and Gan, Chuang , month = jan, year =. doi:10.48550/arXiv.2401.12975 , abstract =

work page doi:10.48550/arxiv.2401.12975
[28]

SafeAgentBench: A benchmark for safe task planning of embodied LLM agents

Yin, Sheng and Pang, Xianghe and Ding, Yuanzhuo and Chen, Menglan and Bi, Yutong and Xiong, Yichen and Huang, Wenhao and Xiang, Zhen and Shao, Jing and Chen, Siheng , month = feb, year =. doi:10.48550/arXiv.2412.13178 , abstract =

work page doi:10.48550/arxiv.2412.13178
[29]

Earbench: Towards evaluating physical risk awareness for task planning of foundation model-based embodied ai agents.arXiv preprint arXiv:2408.04449, 2024

Zhu, Zihao and Wu, Bingzhe and Zhang, Zhengyou and Wu, Baoyuan , month = aug, year =. doi:10.48550/arXiv.2408.04449 , abstract =

work page doi:10.48550/arxiv.2408.04449
[30]

LLM-Based Human-Agent Collaboration and Interaction Systems: A Survey

A survey on large language model based human-agent systems , author=. arXiv preprint arXiv:2505.00753 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[31]

Science China Information Sciences , volume=

The rise and potential of large language model based agents: A survey , author=. Science China Information Sciences , volume=. 2025 , publisher=

2025
[32]

Large Language Model Agent: A Survey on Methodology, Applications and Challenges

Large language model agent: A survey on methodology, applications and challenges , author=. arXiv preprint arXiv:2503.21460 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[33]

Conference on Robot Learning , pages=

Synthesizing navigation abstractions for planning with portable manipulation skills , author=. Conference on Robot Learning , pages=. 2023 , organization=

2023
[34]

Embodied agent interface: Benchmarking llms for embodied decision making, 2025 , author=

2025
[35]

Generating robot constitutions & benchmarks for semantic safety.arXiv preprint arXiv:2503.08663, 2025

Generating robot constitutions & benchmarks for semantic safety , author=. arXiv preprint arXiv:2503.08663 , year=

work page arXiv
[36]

arXiv preprint arXiv:2509.21651 , year=

Can AI Perceive Physical Danger and Intervene? , author=. arXiv preprint arXiv:2509.21651 , year=

work page arXiv
[37]

Evaluating gemini robotics policies in a veo world simulator.arXiv preprint arXiv:2512.10675, 2025

Evaluating Gemini Robotics Policies in a Veo World Simulator , author=. arXiv preprint arXiv:2512.10675 , year=

work page arXiv
[38]

Gemini Robotics 1.5: Pushing the Frontier of Generalist Robots with Advanced Embodied Reasoning, Thinking, and Motion Transfer.arXiv e-prints, page arXiv:2510.03342, October 2025

Gemini robotics 1.5: Pushing the frontier of generalist robots with advanced embodied reasoning, thinking, and motion transfer , author=. arXiv preprint arXiv:2510.03342 , year=

work page arXiv
[39]

Universal and Transferable Adversarial Attacks on Aligned Language Models

Universal and transferable adversarial attacks on aligned language models , author=. arXiv preprint arXiv:2307.15043 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[40]

AI2-THOR: An Interactive 3D Environment for Visual AI

Ai2-thor: An interactive 3d environment for visual ai , author=. arXiv preprint arXiv:1712.05474 , year=

work page internal anchor Pith review arXiv
[41]

Injury prevention , volume=

Validation of a home injury survey , author=. Injury prevention , volume=. 2009 , publisher=

2009
[42]

Byrd-Bredbenner, Carol and Berning, Jacqueline and Martin-Biggers, Jennifer and Quick, Virginia , title =. Int. J. Environ. Res. Public Health , volume =. 2013 , doi =

2013
[43]

The Journal of the Egyptian Public Health Association , year =

Wassif, Ghada O and Abdelsalam, Abeer and Eldin, Waleed Salah and Abdel-Hamid, Mona A and Damaty, Samia I , title =. The Journal of the Egyptian Public Health Association , year =. PMC11228010 , eprinttype=
[44]

2023 , month =

Home Cooking Fires: Hearing Before the Subcommittee on Communications and Technology of the Committee on Energy and Commerce, House of Representatives, 118th Congress , author =. 2023 , month =

2023
[45]

Technical Report, Tech

Pddl—the planning domain definition language , author=. Technical Report, Tech. Rep. , year=
[46]

Journal of Artificial Intelligence Research , volume=

The fast downward planning system , author=. Journal of Artificial Intelligence Research , volume=
[47]

2025 , eprint=

Qwen3 Technical Report , author=. 2025 , eprint=

2025
[48]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities , author=. arXiv preprint arXiv:2507.06261 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[49]

Proceedings of the 58th annual meeting of the association for computational linguistics , pages=

BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension , author=. Proceedings of the 58th annual meeting of the association for computational linguistics , pages=
[50]

Proceedings of the 2018 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long papers) , pages=

A broad-coverage challenge corpus for sentence understanding through inference , author=. Proceedings of the 2018 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long papers) , pages=

2018
[51]

Safety alignment should be made more than just a few tokens deep, 2024 , author=

2024
[52]

Do As I Can, Not As I Say: Grounding Language in Robotic Affordances

Do as i can, not as i say: Grounding language in robotic affordances , author=. arXiv preprint arXiv:2204.01691 , year=

work page internal anchor Pith review arXiv
[53]

Gemini Robotics: Bringing AI into the Physical World

Gemini Robotics: Bringing ai into the physical world , author=. arXiv preprint arXiv:2503.20020 , year=

work page internal anchor Pith review arXiv
[54]

European Conference on Computer Vision , pages=

Mm-safetybench: A benchmark for safety evaluation of multimodal large language models , author=. European Conference on Computer Vision , pages=. 2024 , organization=

2024
[55]

Subtle Risks, Critical Failures: A Framework for Diagnosing Physical Safety of LLM s for Embodied Decision Making

Son, Yejin and Kim, Minseo and Kim, Sungwoong and Han, Seungju and Kim, Jian and Jang, Dongju and Yu, Youngjae and Park, Chan Young. Subtle Risks, Critical Failures: A Framework for Diagnosing Physical Safety of LLM s for Embodied Decision Making. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. 2025. doi:10.18653/v1...

work page doi:10.18653/v1/2025.emnlp-main.1305 2025
[56]

2025 , eprint=

SafeMind: Benchmarking and Mitigating Safety Risks in Embodied LLM Agents , author=. 2025 , eprint=

2025
[57]

Proceedings of the AAAI Conference on Artificial Intelligence , volume=

Is-bench: Evaluating interactive safety of vlm-driven embodied agents in daily household tasks , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=
[58]

Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

Demystifying small language models for edge deployment , author=. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=
[59]

ACM Transactions on Design Automation of Electronic Systems , volume=

Empirical guidelines for deploying llms onto resource-constrained edge devices , author=. ACM Transactions on Design Automation of Electronic Systems , volume=. 2025 , publisher=

2025
[60]

2020 IEEE symposium series on computational intelligence (SSCI) , pages=

Sim-to-real transfer in deep reinforcement learning for robotics: a survey , author=. 2020 IEEE symposium series on computational intelligence (SSCI) , pages=. 2020 , organization=

2020
[61]

International Journal of Robotics and Simulation , volume=

Sim-to-real transfer in robotics: Addressing the gap between simulation and real-world performance , author=. International Journal of Robotics and Simulation , volume=