pith. sign in

arxiv: 2606.07017 · v1 · pith:YM6GGTLPnew · submitted 2026-06-05 · 💻 cs.AI · cs.CL· cs.ET

The Sim-to-Real Gap of Foundation Model Agents: A Unified MDP Perspective

Pith reviewed 2026-06-27 22:25 UTC · model grok-4.3

classification 💻 cs.AI cs.CLcs.ET
keywords sim-to-real gapfoundation model agentsMarkov decision processdomain randomizationagent robustnesstool callingobservation gap
0
0 comments X

The pith

The sim-to-real gap for foundation model agents can be formalized using the four elements of a Markov Decision Process.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that the robustness problems foundation model agents face when moving from simulated training to real deployment should be analyzed by mapping every discrepancy onto the four standard MDP components. Breaking the gap into observation differences, action differences, transition differences, and reward differences lets researchers import established techniques such as domain randomization instead of treating agent failures as a brand-new phenomenon. The authors illustrate the approach with a multilingual tool-calling case where an observation gap produces actions that carry correct semantic intent yet remain operationally invalid in the real environment. This framing supplies a shared vocabulary and a route to standardized benchmarks that robotics and control communities already use.

Core claim

Foundation model agent evaluation and training gaps are structured entirely around the four MDP elements of Observation, Action, Transition, and Reward. This structure translates classical sim-to-real discrepancies into the foundation model domain and supports solutions such as domain randomization, as illustrated by multilingual tool-calling examples where observation mismatches produce invalid actions.

What carries the argument

Mapping every discrepancy in foundation model agents onto the classical MDP tuple consisting of observation space, action space, transition function, and reward function.

If this is right

  • Domain randomization can be applied directly to any of the four MDP components to reduce the sim-to-real gap.
  • A unified vocabulary for agent robustness emerges that connects foundation-model work with classical control methods.
  • Standardized stress-test benchmarks can be built by isolating mismatches in observation, action, transition, or reward.
  • Real-world deployment reliability improves by reusing transfer techniques already validated in robotics.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The MDP framing could classify prompt-engineering failures as observation-space or action-space mismatches for systematic diagnosis.
  • Experiments that vary only one MDP component at a time would isolate which failure modes dominate in large language model agents.
  • Hybrid training pipelines could deliberately augment real interaction data with simulation data matched to specific MDP gaps.

Load-bearing premise

Discrepancies unique to foundation models, such as semantic intent versus operational validity in language outputs, map cleanly onto the classical MDP components without requiring new formal elements.

What would settle it

A controlled experiment that applies domain randomization to close an identified MDP-component gap in a foundation model agent on a tool-calling task and then measures whether real-world success rate improves over the unrandomized baseline.

Figures

Figures reproduced from arXiv: 2606.07017 by Hua Wei, Tiejin Chen, Weibo Li, Xiaoou Liu, Xiyang Hu.

Figure 1
Figure 1. Figure 1: Landscape view of sim-to-real transfer for foundation model agent systems under the MDP decomposition. A policy [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of research directions organized around [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
read the original abstract

Foundation model agents are increasingly deployed for real-world decision-making, but suffer from the sim-to-real gap. While robotics and classical control have mature frameworks to address this gap, the foundation model community is treating agent robustness as an entirely novel phenomenon. Our paper proposes formalizing the foundation model agent evaluation and training gap as a classical sim-to-real problem structured entirely around the four elements of a Markov Decision Process, including Observation, Action, Transition, and Reward. In this paper, we set a comprehensive research agenda that translates classical discrepancies into the foundation model domain and advocates for adopting established solutions like domain randomization. We provide concrete examples, such as a multilingual tool calling to demonstrate how severe observation space gaps lead to operationally invalid actions despite correct semantic intent. Ultimately, this agenda aims to drive a paradigm shift, yielding a unified vocabulary and standardized stress test benchmarks to foster a new generation of highly trustworthy agents for reliable real-world applications.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper claims that the sim-to-real gap for foundation model agents should be formalized strictly as discrepancies in the four classical MDP components (Observation, Action, Transition, Reward), enabling direct transfer of techniques such as domain randomization from robotics and control. It sets a research agenda for this unification, provides a multilingual tool-calling example mapping semantic-versus-operational discrepancies to an observation-space gap, and advocates for standardized benchmarks and a shared vocabulary to produce more trustworthy agents.

Significance. If the proposed MDP mapping holds as a complete and non-reductive framework, the work could supply a unifying vocabulary that lets the foundation-model community adopt mature sim-to-real methods, potentially accelerating development of reliable real-world agents through shared stress-test benchmarks.

major comments (1)
  1. [Abstract / multilingual tool-calling example] The central proposal that all foundation-model-specific discrepancies (including semantic intent versus operational validity) map cleanly onto the four MDP elements without remainder or new primitives is asserted in the abstract and illustrated only by the multilingual tool-calling example; no formal argument, counter-example analysis, or completeness proof is supplied to establish that the mapping is exhaustive.
minor comments (2)
  1. The manuscript would benefit from explicit discussion of how reward-function discrepancies (e.g., misalignment between human preference and proxy reward in language agents) would be operationalized under the proposed framework.
  2. Several classical MDP references are cited but the paper does not contrast its agenda with existing sim-to-real surveys in robotics that already treat language-conditioned policies; adding such positioning would clarify novelty.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the detailed and constructive review. The feedback highlights an important point about the strength of evidence for our central claim. We address it below and will revise the manuscript accordingly.

read point-by-point responses
  1. Referee: [Abstract / multilingual tool-calling example] The central proposal that all foundation-model-specific discrepancies (including semantic intent versus operational validity) map cleanly onto the four MDP elements without remainder or new primitives is asserted in the abstract and illustrated only by the multilingual tool-calling example; no formal argument, counter-example analysis, or completeness proof is supplied to establish that the mapping is exhaustive.

    Authors: We agree that the manuscript asserts the mapping as exhaustive in the abstract without supplying a formal completeness argument or systematic counter-example analysis. The multilingual tool-calling example is intended only as an illustration of how semantic-versus-operational discrepancies can be captured inside the observation component. Because the paper's primary contribution is a research agenda rather than a theorem, we did not include a proof. To strengthen the presentation we will (1) revise the abstract and introduction to describe the four-component mapping as a proposed unifying lens rather than a proven exhaustive reduction, (2) add a short discussion subsection that enumerates additional concrete examples (visual grounding, reward hacking via language, action-space tokenization mismatches) and explicitly flags the possibility that future work may identify edge cases requiring new primitives, and (3) include a brief paragraph on scope and limitations. These changes will make the evidential status of the claim transparent while preserving the agenda-setting purpose of the work. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper presents a research agenda that reframes the sim-to-real gap for foundation model agents as discrepancies in the four classical MDP components (Observation, Action, Transition, Reward) and advocates translating techniques such as domain randomization. No derivations, equations, fitted parameters, or predictions appear that reduce by construction to the paper's own inputs. The work cites external classical MDP literature as independent support rather than relying on self-citation chains or uniqueness theorems from the authors. The mapping of foundation-model-specific issues (e.g., semantic vs. operational validity) onto MDP elements is offered as a working hypothesis for future work, not as a proven equivalence derived from the paper itself. The manuscript is therefore self-contained as a forward-looking proposal without load-bearing circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The proposal rests on the assumption that the classical MDP decomposition applies without modification to language-model agents; no free parameters, new axioms, or invented entities are introduced beyond standard MDP components.

axioms (1)
  • domain assumption The four MDP elements (observation, action, transition, reward) are sufficient to capture all relevant sim-to-real discrepancies in foundation model agents.
    Invoked in the abstract when stating the gap is 'structured entirely around' these four elements.

pith-pipeline@v0.9.1-grok · 5701 in / 1269 out tokens · 19844 ms · 2026-06-27T22:25:13.504124+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

64 extracted references · 9 linked inside Pith

  1. [1]

    Ammar N Abbas, Shakra Mehak, Georgios C Chasparis, John D Kelleher, Michael Guilfoyle, Maria Chiara Leva, and Aswin K Ramasubramanian. 2024. Safety- driven deep reinforcement learning framework for cobots: A sim2real approach. In2024 10th International Conference on Control, Decision and Information Tech- nologies (CoDIT). IEEE, 2917–2923

  2. [2]

    Mohammed Alshiekh, Roderick Bloem, Rüdiger Ehlers, Bettina Könighofer, Scott Niekum, and Ufuk Topcu. 2018. Safe reinforcement learning via shielding. In Proceedings of the AAAI conference on artificial intelligence, Vol. 32

  3. [3]

    Rika Antonova, Silvia Cruciani, Christian Smith, and Danica Kragic. 2017. Rein- forcement learning for pivoting task.arXiv preprint arXiv:1703.00472(2017)

  4. [4]

    Babak Badnava, Mona Esmaeili, Nasser Mozayani, and Payman Zarkesh-Ha

  5. [5]

    In 2023 IEEE 13th Annual Computing and Communication Workshop and Conference (CCWC)

    A new potential-based reward shaping for reinforcement learning agent. In 2023 IEEE 13th Annual Computing and Communication Workshop and Conference (CCWC). IEEE, 01–06

  6. [6]

    Marc G Bellemare, Georg Ostrovski, Arthur Guez, Philip Thomas, and Rémi Munos. 2016. Increasing the action gap: New operators for reinforcement learning. InProceedings of the AAAI conference on artificial intelligence, Vol. 30

  7. [7]

    Steven Bohez, Tim Verbelen, Elias De Coninck, Bert Vankeirsbilck, Pieter Simoens, and Bart Dhoedt. 2017. Sensor fusion for robot control through deep reinforce- ment learning. In2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). Ieee, 2365–2370

  8. [8]

    Konstantinos Bousmalis, Nathan Silberman, David Dohan, Dumitru Erhan, and Dilip Krishnan. 2017. Unsupervised pixel-level domain adaptation with generative adversarial networks. InProceedings of the IEEE conference on computer vision and pattern recognition. 3722–3731

  9. [9]

    Tiejin Chen, Pingzhi Li, Kaixiong Zhou, Tianlong Chen, and Hua Wei. 2025. Vision Language Model Helps Private Information De-Identification in Vision Data. InFindings of the Association for Computational Linguistics: ACL 2025. 4558– 4572

  10. [10]

    Tiejin Chen, Xiaoou Liu, Vishnu Nandam, Kuan-Ru Liou, and Hua Wei. 2026. Conformal Feedback Alignment: Quantifying Answer-Level Reliability for Robust LLM Alignment.arXiv preprint arXiv:2601.17329(2026)

  11. [11]

    Zehui Chen, Weihua Du, Wenwei Zhang, Kuikun Liu, Jiangning Liu, Miao Zheng, Jingming Zhuo, Songyang Zhang, Dahua Lin, Kai Chen, et al . 2024. T-eval: Evaluating the tool utilization capability of large language models step by step. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 9510–9529

  12. [12]

    Zhaorun Chen, Mintong Kang, and Bo Li. 2025. Shieldagent: Shielding agents via verifiable safety policy reasoning.arXiv preprint arXiv:2503.22738(2025)

  13. [13]

    Longchao Da, Tiejin Chen, Lu Cheng, and Hua Wei. 2024. Llm uncertainty quantification through directional entailment graph and claim level response augmentation.arXiv preprint arXiv:2407.00994(2024)

  14. [14]

    Longchao Da, Minquan Gao, Hao Mei, and Hua Wei. 2024. Prompt to transfer: Sim-to-real transfer for traffic signal control with prompt learning. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 38. 82–90

  15. [15]

    Longchao Da, Hao Mei, Romir Sharma, and Hua Wei. 2023. Uncertainty-aware grounded action transformation towards sim-to-real transfer for traffic signal control. In2023 62nd IEEE Conference on Decision and Control (CDC). IEEE, 1124– 1129

  16. [16]

    Longchao Da, Justin Turnau, Thirulogasankar Pranav Kutralingam, Alvaro Ve- lasquez, Paulo Shakarian, and Hua Wei. 2025. A survey of sim-to-real methods in rl: Progress, prospects and challenges with foundation models.arXiv preprint arXiv:2502.13187(2025)

  17. [17]

    Siddharth Desai, Ishan Durugkar, Haresh Karnan, Garrett Warnell, Josiah Hanna, and Peter Stone. 2020. An imitation from observation approach to transfer learning with dynamics mismatch.Advances in Neural Information Processing Systems33 (2020), 3917–3929

  18. [18]

    Siddharth Desai, Haresh Karnan, Josiah P Hanna, Garrett Warnell, and Peter Stone. 2020. Stochastic grounded action transformation for robot learning in simulation. In2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 6106–6111

  19. [19]

    2012.Handbook of Markov decision processes: methods and applications

    Eugene A Feinberg and Adam Shwartz. 2012.Handbook of Markov decision processes: methods and applications. Vol. 40. Springer Science & Business Media

  20. [20]

    Vlad Firoiu, Tina Ju, and Josh Tenenbaum. 2018. At human speed: Deep rein- forcement learning with action delay.arXiv preprint arXiv:1810.07286(2018)

  21. [21]

    Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. 2024. The llama 3 herd of models.arXiv preprint arXiv:2407.21783 (2024)

  22. [22]

    Yihong Guo, Yixuan Wang, Yuanyuan Shi, Pan Xu, and Anqi Liu. 2024. Off- dynamics reinforcement learning via domain adaptation and reward augmented imitation.Advances in Neural Information Processing Systems37 (2024), 136326– 136360

  23. [23]

    Josiah Hanna and Peter Stone. 2017. Grounded action transformation for ro- bot learning in simulation. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 31

  24. [24]

    Omer Hofman, Jonathan Brokman, Oren Rachmil, Shamik Bose, Vikas Pahuja, Toshiya Shimizu, Trisha Starostina, Kelly Marchisio, Seraphina Goldfarb-Tarrant, and Roman Vainshtein. 2025. MAPS: A Multilingual Benchmark for Global Agent Performance and Security.arXiv preprint arXiv:2505.15935(2025)

  25. [25]

    Xu Huang, Weiwen Liu, Xiaolong Chen, Xingmei Wang, Hao Wang, Defu Lian, Yasheng Wang, Ruiming Tang, and Enhong Chen. 2024. Understanding the planning of llm agents: A survey.arXiv preprint arXiv:2402.02716(2024)

  26. [26]

    Dongfu Jiang, Yi Lu, Zhuofeng Li, Zhiheng Lyu, Ping Nie, Haozhe Wang, Alex Su, Hui Chen, Kai Zou, Chao Du, et al. 2025. Verltool: Towards holistic agentic reinforcement learning with tool use.arXiv preprint arXiv:2509.01055(2025)

  27. [27]

    Haresh Karnan, Siddharth Desai, Josiah P Hanna, Garrett Warnell, and Peter Stone. 2020. Reinforced grounded action transformation for sim-to-real transfer. In2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 4397–4402

  28. [28]

    Doyoung Kim, Zhiwei Ren, Jie Hao, Zhongkai Sun, Lichao Wang, Xiyao Ma, Zack Ye, Xu Han, Jun Yin, Heng Ji, et al. 2026. Beyond Perfect APIs: A Comprehensive Evaluation of LLM Agents Under Real-World API Complexity.arXiv preprint arXiv:2601.00268(2026)

  29. [29]

    Jiin Kim, Byeongjun Shin, Jinha Chung, and Minsoo Rhu. 2026. The cost of dynamic reasoning: Demystifying ai agents and test-time scaling from an ai infrastructure perspective. In2026 IEEE International Symposium on High Perfor- mance Computer Architecture (HPCA). IEEE, 1–16

  30. [30]

    Yeseung Kim, Dohyun Kim, Jieun Choi, Jisang Park, Nayoung Oh, and Daehyung Park. 2024. A survey on integration of large language models with intelligent robots.Intelligent Service Robotics17, 5 (2024), 1091–1107

  31. [31]

    Guohao Li, Hasan Hammoud, Hani Itani, Dmitrii Khizbullin, and Bernard Ghanem. 2023. Camel: Communicative agents for" mind" exploration of large language model society.Advances in neural information processing systems36 (2023), 51991–52008

  32. [32]

    Jianxiong Li, Xiao Hu, Haoran Xu, Jingjing Liu, Xianyuan Zhan, Qing-Shan Jia, and Ya-Qin Zhang. 2023. Mind the gap: Offline policy optimization for imperfect rewards.arXiv preprint arXiv:2302.01667(2023)

  33. [33]

    Zhilin Lin and Shiliang Sun. 2025. Revealing the Challenges of Sim-to-Real Transfer in Model-Based Reinforcement Learning via Latent Space Modeling. arXiv preprint arXiv:2506.12735(2025)

  34. [34]

    Aixin Liu, Aoxue Mei, Bangcai Lin, Bing Xue, Bingxuan Wang, Bingzheng Xu, Bochao Wu, Bowei Zhang, Chaofan Lin, Chen Dong, et al . 2025. Deepseek- v3. 2: Pushing the frontier of open large language models.arXiv preprint arXiv:2512.02556(2025)

  35. [35]

    Qianmei Liu, Yufei Kuang, and Jie Wang. 2024. Robust deep reinforcement learn- ing with adaptive adversarial perturbations in action space. In2024 International Joint Conference on Neural Networks (IJCNN). IEEE, 1–8

  36. [36]

    Xiaoou Liu, Tiejin Chen, Longchao Da, Chacha Chen, Zhen Lin, and Hua Wei

  37. [37]

    InProceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V

    Uncertainty quantification and confidence calibration in large language models: A survey. InProceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V. 2. 6107–6117

  38. [38]

    Yihong Liu, Raoyuan Zhao, Lena Altinger, Hinrich Schütze, and Michael A Hed- derich. 2025. Evaluating Robustness of Large Language Models Against Multilin- gual Typographical Errors.arXiv preprint arXiv:2510.09536(2025)

  39. [39]

    Zheng Luo, T Pranav Kutralingam, Ogochukwu N Okoani, Wanpeng Xu, Hua Wei, and Xiyang Hu. 2026. Lost in Execution: On the Multilingual Robustness of Tool Calling in Large Language Models.arXiv preprint arXiv:2601.05366(2026)

  40. [40]

    Bhairav Mehta, Manfred Diaz, Florian Golemo, Christopher J Pal, and Liam Paull

  41. [41]

    InConference on Robot Learning

    Active domain randomization. InConference on Robot Learning. PMLR, 1162–1176

  42. [42]

    Mayank Mishra, Matt Stallone, Gaoyuan Zhang, Yikang Shen, Aditya Prasad, Adriana Meza Soria, Michele Merler, Parameswaran Selvam, Saptha Surendran, Shivdeep Singh, et al. 2024. Granite code models: A family of open foundation models for code intelligence.arXiv preprint arXiv:2405.04324(2024)

  43. [43]

    Youngbin Park, Sang Hyoung Lee, and Il Hong Suh. 2021. Sim-to-real visual grasping via state representation learning based on combining pixel-level and feature-level domain adaptation. In2021 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 6300–6307

  44. [44]

    Shishir G Patil, Huanzhi Mao, Fanjia Yan, Charlie Cheng-Jie Ji, Vishnu Suresh, Ion Stoica, and Joseph E Gonzalez. 2025. The berkeley function calling leaderboard (bfcl): From tool use to agentic evaluation of large language models. InForty- second International Conference on Machine Learning

  45. [45]

    Xue Bin Peng, Marcin Andrychowicz, Wojciech Zaremba, and Pieter Abbeel. 2018. Sim-to-real transfer of robotic control with dynamics randomization. In2018 IEEE international conference on robotics and automation (ICRA). IEEE, 3803–3810

  46. [46]

    2025.Qwen3-Next: Towards Ultimate Training & Inference Effi- ciency

    QwenTeam. 2025.Qwen3-Next: Towards Ultimate Training & Inference Effi- ciency. https://qwen.ai/blog?id=4074cca80393150c248e508aa62983f9cb7d27cd& from=research.latest-advancements-list

  47. [47]

    Ella Rabinovich and Ateret Anaby Tavor. 2025. On the robustness of agentic function calling. InProceedings of the 5th Workshop on Trustworthy NLP (TrustNLP 2025). 298–304

  48. [48]

    The Sim-to-Real Gap of Foundation Model Agents: A Unified MDP Perspective KDD ’26, August 3–7, 2026, Washington, DC, USA

    Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El-Kishky, Aidan McLaughlin, Aiden Low, AJ Ostrow, Akhila Ananthram, et al. The Sim-to-Real Gap of Foundation Model Agents: A Unified MDP Perspective KDD ’26, August 3–7, 2026, Washington, DC, USA

  49. [49]

    Openai gpt-5 system card.arXiv preprint arXiv:2601.03267(2025)

  50. [50]

    Elena Smirnova, Elvis Dohmatob, and Jérémie Mary. 2019. Distributionally robust reinforcement learning.arXiv preprint arXiv:1902.08708(2019)

  51. [51]

    Kai Liang Tan, Yasaman Esfandiari, Xian Yeow Lee, Soumik Sarkar, et al. 2020. Robustifying reinforcement learning agents via action space adversarial training. In2020 American control conference (ACC). IEEE, 3959–3964

  52. [52]

    Josh Tobin, Rachel Fong, Alex Ray, Jonas Schneider, Wojciech Zaremba, and Pieter Abbeel. 2017. Domain randomization for transferring deep neural networks from simulation to the real world. In2017 IEEE/RSJ international conference on intelligent robots and systems (IROS). IEEE, 23–30

  53. [53]

    Eugene Valassakis, Zihan Ding, and Edward Johns. 2020. Crossing the gap: A deep dive into zero-shot sim-to-real transfer for dynamics. In2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 5372– 5379

  54. [54]

    Alvaro Velasquez, Brett Bissey, Lior Barak, Andre Beckus, Ismail Alkhouri, Daniel Melcer, and George Atia. 2021. Dynamic automaton-guided reward shaping for monte carlo tree search. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 35. 12015–12023

  55. [55]

    Sri Vatsa Vuddanti, Aarav Shah, Satwik Kumar Chittiprolu, Tony Song, Sunishchal Dev, Kevin Zhu, and Maheep Chaudhary. 2025. PALADIN: Self-Correcting Lan- guage Model Agents to Cure Tool-Failure Cases.arXiv preprint arXiv:2509.25238 (2025)

  56. [56]

    Ruipeng Wang, Yuxin Chen, Yukai Wang, Chang Wu, Junfeng Fang, Xiaodong Cai, Qi Gu, Hui Su, An Zhang, Xiang Wang, et al . 2026. AgentNoiseBench: Benchmarking Robustness of Tool-Using LLM Agents Under Noisy Condition. arXiv preprint arXiv:2602.11348(2026)

  57. [57]

    Harley Wiltzer, Marc Bellemare, David Meger, Patrick Shafto, and Yash Jhaveri

  58. [58]

    Action gaps and advantages in continuous-time distributional reinforce- ment learning.Advances in Neural Information Processing Systems37 (2024), 47815–47848

  59. [59]

    Huaiyuan Yao, Wanpeng Xu, Justin Turnau, Nadia Kellam, and Hua Wei. 2025. Instructional agents: Llm agents on automated course material generation for teaching faculties.arXiv preprint arXiv:2508.19611(2025)

  60. [60]

    Albert Yu, Adeline Foote, Raymond Mooney, and Roberto Martín-Martín. 2024. Natural language can help bridge the sim2real gap.arXiv preprint arXiv:2405.10020 (2024)

  61. [61]

    Wenshuai Zhao, Jorge Peña Queralta, and Tomi Westerlund. 2020. Sim-to-real transfer in deep reinforcement learning for robotics: a survey. In2020 IEEE symposium series on computational intelligence (SSCI). IEEE, 737–744

  62. [62]

    Jingwen Zhou, Jieshan Chen, Qinghua Lu, Dehai Zhao, and Liming Zhu. 2025. Shielda: Structured handling of exceptions in llm-driven agentic workflows.arXiv preprint arXiv:2508.07935(2025)

  63. [63]

    Xiaolin Zhou, Aojie Yuan, Zheng Luo, Zipeng Ling, Xixiao Pan, Yicheng Gao, Haiyue Zhang, Jiate Li, Shuli Jiang, Prince Zizhuang Wang, et al. 2026. When Simulation Lies: A Sim-to-Real Benchmark and Domain-Randomized RL Recipe for Tool-Use Agents.arXiv preprint arXiv:2605.11928(2026)

  64. [64]

    Kunlun Zhu, Zijia Liu, Bingxuan Li, Muxin Tian, Yingxuan Yang, Jiaxun Zhang, Pengrui Han, Qipeng Xie, Fuyang Cui, Weijia Zhang, et al . 2025. Where llm agents fail and how they can learn from failures.arXiv preprint arXiv:2509.25370 (2025)