pith. sign in

arxiv: 2605.12655 · v3 · pith:SR4SNW3Anew · submitted 2026-05-12 · 💻 cs.AI · cs.MA

Robust Instruction Compliance in Cooperative Multi-Agent Reinforcement Learning

Pith reviewed 2026-06-30 22:09 UTC · model grok-4.3

classification 💻 cs.AI cs.MA
keywords multi-agent reinforcement learninginstruction complianceBellman updatesvalue correctionactor-criticcooperative MARLnatural language instructionsmacro-actions
0
0 comments X

The pith

MAVIC corrects the bootstrapping target at instruction boundaries to enable consistent value estimation under stochastic instruction switching in a unified multi-agent policy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

In cooperative multi-agent reinforcement learning, agents must follow external natural language instructions that can interrupt ongoing macro-actions and conflict with long-horizon goals. Standard Bellman updates create a failure mode by coupling value estimates across different instruction contexts, producing inconsistent values when instructions switch. MAVIC addresses this by correcting Bellman backups specifically at instruction boundaries: it adjusts the incoming instruction objective and restores the continuation value under the current objective. This change to the target itself, rather than reward shaping, supports consistent estimation inside one unified policy. The paper supplies theoretical analysis plus an actor-critic implementation and demonstrates high instruction compliance together with preserved base-task performance in progressively harder cooperative settings.

Core claim

MAVIC corrects Bellman backups at instruction boundaries by correcting the incoming instruction objective and restoring the continuation value under the current objective. Unlike reward shaping, MAVIC modifies the bootstrapping target itself, enabling consistent value estimation under stochastic instruction switching within a unified policy. The paper provides theoretical analysis and an actor-critic implementation, and shows that MAVIC achieves high instruction compliance while preserving base task performance in increasingly complex cooperative multi-agent environments.

What carries the argument

MAVIC's boundary correction that adjusts the incoming objective and restores the continuation value to decouple value estimates across instruction contexts.

If this is right

  • A unified policy can maintain consistent values despite stochastic instruction changes.
  • High instruction compliance is achieved without sacrificing base task performance.
  • The method scales to increasingly complex cooperative multi-agent environments.
  • Theoretical analysis supports the decoupling of value estimates across contexts.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar boundary corrections could apply to single-agent RL with external interruptions.
  • The approach may reduce the need for separate hierarchical policies when handling context switches.
  • It suggests target modification can outperform reward shaping for consistency under changing objectives.

Load-bearing premise

Correcting the incoming instruction objective and restoring the continuation value at instruction boundaries is sufficient to decouple value estimates across contexts without introducing new inconsistencies.

What would settle it

An experiment in which value estimates remain inconsistent or instruction compliance drops in a stochastic switching environment after MAVIC is applied.

Figures

Figures reproduced from arXiv: 2605.12655 by Enrico Marchesini, Ethan Rathbun, Wo Wei Lin, Xiang Zhi Tan.

Figure 1
Figure 1. Figure 1: Demonstration of reward cross-contamination in the Box Pushing environment. With [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Illustration of value cross-contamination and MAVIC correction. Top (red): standard [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: illustrates the MAVIC architecture. Each agent maintains an ac￾tor network Ψθi (where θi parametrizes agent’s i policy) that selects macro￾actions conditioned on its macro-observation history and the current instruction. Instruction Text (e.g., “Don’t use left cutting board”) Tokenizer Frozen Language Pipeline Agen t s Arc hitec ture Environment Observation [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Overview of the macro-action tasks. BP is Boxpushing, WTD is Warehouse, and OC is [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Action distribution frequency for successful delivery is shown by baseline no instruction [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
read the original abstract

Multi-agent reinforcement learning (MARL) in real-world use cases may need to adapt to external natural language instructions that interrupt ongoing behavior and conflict with long-horizon objectives. However, conditioning rewards on instructions introduces a fundamental failure mode as Bellman updates couple value estimates across instruction contexts, leading to inconsistent values when instructions interrupt macro-actions. We propose Macro-Action Value Correction for Instruction Compliance (MAVIC), which corrects Bellman backups at instruction boundaries by correcting the incoming instruction objective and restoring the continuation value under the current objective. Unlike reward shaping, MAVIC modifies the bootstrapping target itself, enabling consistent value estimation under stochastic instruction switching within a unified policy. We provide theoretical analysis and an actor-critic implementation, and show that MAVIC achieves high instruction compliance while preserving base task performance in increasingly complex cooperative multi-agent environments.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that in cooperative MARL, conditioning rewards on interrupting natural-language instructions couples value estimates across contexts via standard Bellman updates, producing inconsistent values for macro-actions. MAVIC corrects the bootstrapping target at detected instruction boundaries by adjusting the incoming objective and restoring the continuation value under the current objective, yielding consistent value estimates for a single unified policy under stochastic switching. The method is supported by theoretical analysis of the modified Bellman operator, an actor-critic implementation, and experiments showing high instruction compliance without degrading base-task performance in increasingly complex cooperative environments.

Significance. If the theoretical claim holds, MAVIC would address a load-bearing inconsistency in value-based MARL under external interruptions, enabling reliable instruction compliance in unified policies without separate context-specific critics or reward shaping. This is relevant for real-world cooperative agents that must respond to natural-language directives while preserving long-horizon objectives.

major comments (2)
  1. [Theoretical Analysis] Theoretical Analysis section: the central claim that boundary corrections produce a unique consistent fixed point under stochastic instruction switching requires an explicit derivation of the modified Bellman operator and a uniqueness argument; the provided analysis does not demonstrate that restored continuation values remain decoupled when future switches are stochastic or when multi-agent joint actions propagate cross-context terms through the shared critic.
  2. [§3] §3 (Method) and actor-critic implementation: the correction is applied only at detected boundaries, yet the manuscript does not show that this suffices to eliminate residual coupling in the value function when instruction arrivals remain stochastic; an explicit fixed-point equation or contraction-mapping argument would be needed to support the consistency guarantee.
minor comments (2)
  1. [Abstract] The abstract and introduction could more clearly distinguish MAVIC from reward-shaping baselines by including a short side-by-side comparison of the respective Bellman targets.
  2. [Experiments] Experimental figures would benefit from error bars or statistical significance tests across the reported environments to strengthen the claim of preserved base-task performance.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on the theoretical foundations. We address each point below and will revise the manuscript to provide more explicit derivations while preserving the core claims.

read point-by-point responses
  1. Referee: [Theoretical Analysis] Theoretical Analysis section: the central claim that boundary corrections produce a unique consistent fixed point under stochastic instruction switching requires an explicit derivation of the modified Bellman operator and a uniqueness argument; the provided analysis does not demonstrate that restored continuation values remain decoupled when future switches are stochastic or when multi-agent joint actions propagate cross-context terms through the shared critic.

    Authors: The Theoretical Analysis section defines the modified Bellman operator by inserting the boundary correction that adjusts the incoming objective and restores the continuation value under the active instruction. This construction ensures that value estimates for a given macro-action segment are independent of prior or interrupting contexts. For stochastic future switches, the operator is applied at each boundary encountered during rollouts, so the fixed point remains consistent by induction over segments. Multi-agent joint actions enter through the shared critic, but the target correction is applied to the scalar backup independently of the joint-action structure. We agree the uniqueness argument would benefit from an expanded contraction-mapping derivation and will add this explicitly in the revision. revision: yes

  2. Referee: [§3] §3 (Method) and actor-critic implementation: the correction is applied only at detected boundaries, yet the manuscript does not show that this suffices to eliminate residual coupling in the value function when instruction arrivals remain stochastic; an explicit fixed-point equation or contraction-mapping argument would be needed to support the consistency guarantee.

    Authors: The correction is triggered precisely at each detected instruction boundary, which occurs whenever a stochastic switch arrives. Because the restored continuation value is taken under the new objective, no cross-context terms enter the backup. The resulting operator therefore satisfies a fixed-point equation in which each context's value depends only on its own rewards and transitions until the next boundary. We will include the explicit fixed-point equation and a contraction-mapping argument in the revised §3 to make this guarantee fully rigorous. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The provided abstract and description present MAVIC as an explicit modification to the Bellman bootstrapping target at detected instruction boundaries, with the correction defined by restoring the continuation value under the current objective. No equations, fitted parameters, or self-citations are shown that reduce the claimed consistency result to the input data or prior outputs by construction. The method is positioned as distinct from reward shaping and supported by separate theoretical analysis plus an actor-critic implementation. This leaves the central derivation self-contained against external benchmarks rather than tautological.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no free parameters, axioms, or invented entities are described.

pith-pipeline@v0.9.1-grok · 5668 in / 1019 out tokens · 23771 ms · 2026-06-30T22:09:44.790925+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

186 extracted references · 145 canonical work pages · 31 internal anchors

  1. [1]

    Machine Learning , author =

    Learning to predict by the methods of temporal differences , volume =. Machine Learning , author =. 1988 , keywords =. doi:10.1007/BF00115009 , abstract =

  2. [2]

    IEEE Robotics and Automation Letters , author =

    Human-. IEEE Robotics and Automation Letters , author =. 2018 , note =. doi:10.1109/LRA.2018.2812906 , abstract =

  3. [3]

    Hierarchical

    Yang, Jiachen and Borovikov, Igor and Zha, Hongyuan , month = may, year =. Hierarchical. doi:10.48550/arXiv.1912.03558 , abstract =

  4. [4]

    Yu, Chao and Velu, Akash and Vinitsky, Eugene and Gao, Jiaxuan and Wang, Yu and Bayen, Alexandre and Wu, Yi , month = nov, year =. The

  5. [5]

    Silva, Franceli L

    Zhou, Jiawei and Zhang, Yixuan and Luo, Qianni and Parker, Andrea G and De Choudhury, Munmun , month = apr, year =. Synthetic. Proceedings of the 2023. doi:10.1145/3544548.3581318 , abstract =

  6. [6]

    Intent-aware Multi-agent Reinforcement Learning

    Qi, Siyuan and Zhu, Song-Chun , month = mar, year =. Intent-aware. doi:10.48550/arXiv.1803.02018 , abstract =

  7. [7]

    Applied Intelligence , author =

    A review of cooperative multi-agent deep reinforcement learning , volume =. Applied Intelligence , author =. 2023 , keywords =. doi:10.1007/s10489-022-04105-y , abstract =

  8. [8]

    and Amato, Christopher , year =

    Oliehoek, Frans A. and Amato, Christopher , year =. A. doi:10.1007/978-3-319-28929-8 , language =

  9. [9]

    Vera Liao, Mary Lou Maher, Charles Patrick Martin, and Greg Walsh

    Muller, Michael and Chilton, Lydia B and Kantosalo, Anna and Liao, Q. Vera and Maher, Mary Lou and Martin, Charles Patrick and Walsh, Greg , month = apr, year =. Extended. doi:10.1145/3544549.3573794 , abstract =

  10. [10]

    Littman and Anthony R

    Planning and acting in partially observable stochastic domains , volume =. Artificial Intelligence , author =. 1998 , pages =. doi:10.1016/S0004-3702(98)00023-X , abstract =

  11. [11]

    Neurocomput

    A review of research on reinforcement learning algorithms for multi-agents , volume =. Neurocomput. , author =. 2024 , keywords =. doi:10.1016/j.neucom.2024.128068 , number =

  12. [12]

    Multiagent

    Han, Dongge , year =. Multiagent

  13. [13]

    Diversity is All You Need: Learning Skills without a Reward Function

    Eysenbach, Benjamin and Gupta, Abhishek and Ibarz, Julian and Levine, Sergey , month = oct, year =. Diversity is. doi:10.48550/arXiv.1802.06070 , abstract =

  14. [14]

    BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

    Devlin, Jacob and Chang, Ming-Wei and Lee, Kenton and Toutanova, Kristina , month = may, year =. doi:10.48550/arXiv.1810.04805 , abstract =

  15. [15]

    Chakravorty, Jhelum and Ward, Nadeem and Roy, Julien and Chevalier-Boisvert, Maxime and Basu, Sumana and Lupu, Andrei and Precup, Doina , month = mar, year =. Option-. doi:10.48550/arXiv.1911.12825 , abstract =

  16. [16]

    Bacon, Pierre-Luc and Harb, Jean and Precup, Doina , month = dec, year =. The. doi:10.48550/arXiv.1609.05140 , abstract =

  17. [17]

    Amato, Christopher , month = may, year =. An. doi:10.48550/arXiv.2405.06161 , abstract =

  18. [18]

    Increasing

    Van Waveren, Sanne and Rudling, Rasmus and Leite, Iolanda and Jensfelt, Patric and Pek, Christian , month = mar, year =. Increasing. Proceedings of the 2023. doi:10.1145/3568162.3576966 , abstract =

  19. [19]

    van Waveren, Sanne and Pek, Christian and Tumova, Jana and Leite, Iolanda , month = mar, year =. Correct. Proceedings of the 2022

  20. [20]

    Wang, Weizheng and Obi, Ike and Min, Byung-Cheol , month = mar, year =. Multi-. doi:10.48550/arXiv.2503.09758 , abstract =

  21. [21]

    Unbiased

    Baisero, Andrea and Amato, Christopher , month = jan, year =. Unbiased. doi:10.5555/3535850.3535857 , abstract =

  22. [22]

    doi:10.48550/arXiv.2002.07418 , abstract =

    Zhang, Peng and Hao, Jianye and Wang, Weixun and Tang, Hongyao and Ma, Yi and Duan, Yihai and Zheng, Yan , month = may, year =. doi:10.48550/arXiv.2002.07418 , abstract =

  23. [23]

    , month = nov, year =

    Goyal, Prasoon and Niekum, Scott and Mooney, Raymond J. , month = nov, year =. doi:10.48550/arXiv.2007.15543 , abstract =

  24. [24]

    Stone, Austin and Xiao, Ted and Lu, Yao and Gopalakrishnan, Keerthana and Lee, Kuang-Huei and Vuong, Quan and Wohlhart, Paul and Kirmani, Sean and Zitkovich, Brianna and Xia, Fei and Finn, Chelsea and Hausman, Karol , month = oct, year =. Open-. doi:10.48550/arXiv.2303.00905 , abstract =

  25. [25]

    Interactive

    Liu, Huihan and Chen, Alice and Zhu, Yuke and Swaminathan, Adith and Kolobov, Andrey and Cheng, Ching-An , month = oct, year =. Interactive. doi:10.48550/arXiv.2310.17555 , abstract =

  26. [26]

    arXiv preprint arXiv:2201.07207 , doi =

    Huang, Wenlong and Abbeel, Pieter and Pathak, Deepak and Mordatch, Igor , month = mar, year =. Language. doi:10.48550/arXiv.2201.07207 , abstract =

  27. [27]

    BC-Z: Zero-shot task generalization with robotic imitation learning.arXiv preprint arXiv:2202.02005, 2022

    Jang, Eric and Irpan, Alex and Khansari, Mohi and Kappler, Daniel and Ebert, Frederik and Lynch, Corey and Levine, Sergey and Finn, Chelsea , month = feb, year =. doi:10.48550/arXiv.2202.02005 , abstract =

  28. [28]

    OpenVLA: An Open-Source Vision-Language-Action Model

    Kim, Moo Jin and Pertsch, Karl and Karamcheti, Siddharth and Xiao, Ted and Balakrishna, Ashwin and Nair, Suraj and Rafailov, Rafael and Foster, Ethan and Lam, Grace and Sanketi, Pannag and Vuong, Quan and Kollar, Thomas and Burchfiel, Benjamin and Tedrake, Russ and Sadigh, Dorsa and Levine, Sergey and Liang, Percy and Finn, Chelsea , month = sep, year =. ...

  29. [29]

    Shi, Lucy Xiaoyang and Ichter, Brian and Equi, Michael Robert and Ke, Liyiming and Pertsch, Karl and Vuong, Quan and Tanner, James and Walling, Anna and Wang, Haohuan and Fusai, Niccolo and Li-Bell, Adrian and Driess, Danny and Groom, Lachy and Levine, Sergey and Finn, Chelsea , month = jun, year =. Hi

  30. [30]

    Proceedings of the 2024

    Holk, Simon and Marta, Daniel and Leite, Iolanda , month = mar, year =. Proceedings of the 2024. doi:10.1145/3610977.3634970 , abstract =

  31. [31]

    Interactive

    Brawer, Jake and Ghose, Debasmita and Candon, Kate and Qin, Meiying and Roncone, Alessandro and Vázquez, Marynel and Scassellati, Brian , month = mar, year =. Interactive. Proceedings of the 2023. doi:10.1145/3568162.3576983 , abstract =

  32. [32]

    arXiv.org , author =

    Correcting. arXiv.org , author =

  33. [33]

    No, to the

    Cui, Yuchen and Karamcheti, Siddharth and Palleti, Raj and Shivakumar, Nidhya and Liang, Percy and Sadigh, Dorsa , month = mar, year =. No, to the. Proceedings of the 2023. doi:10.1145/3568162.3578623 , abstract =

  34. [34]

    Clarifying

    Kuehn, Hannah and Santos, Leonardo and Leite, Iolanda , month = mar, year =. Clarifying. Proceedings of the 21st. doi:10.1145/3757279.3785583 , language =

  35. [35]

    Guo, Daya and Yang, Dejian and Zhang, Haowei and Song, Junxiao and Wang, Peiyi and Zhu, Qihao and Xu, Runxin and Zhang, Ruoyu and Ma, Shirong and Bi, Xiao and Zhang, Xiaokang and Yu, Xingkai and Wu, Yu and Wu, Z. F. and Gou, Zhibin and Shao, Zhihong and Li, Zhuoshu and Gao, Ziyi and Liu, Aixin and Xue, Bing and Wang, Bingxuan and Wu, Bochao and Feng, Bei ...

  36. [36]

    Algorithmic Prompt Generation for Diverse Human-like Teaming and Communication with Large Language Models

    Srikanth, Siddharth and Bhatt, Varun and Zhang, Boshen and Hager, Werner and Lewis, Charles Michael and Sycara, Katia P. and Tabrez, Aaquib and Nikolaidis, Stefanos , month = apr, year =. Algorithmic. doi:10.48550/arXiv.2504.03991 , abstract =

  37. [37]

    Rusu, Joel Veness, Marc G

    Mnih, Volodymyr and Kavukcuoglu, Koray and Silver, David and Rusu, Andrei A. and Veness, Joel and Bellemare, Marc G. and Graves, Alex and Riedmiller, Martin and Fidjeland, Andreas K. and Ostrovski, Georg and Petersen, Stig and Beattie, Charles and Sadik, Amir and Antonoglou, Ioannis and King, Helen and Kumaran, Dharshan and Wierstra, Daan and Legg, Shane ...

  38. [38]

    Discounted. Markov. 1994 , note =. doi:10.1002/9780470316887.ch6 , abstract =

  39. [39]

    The Complexity of Decentralized Control of Markov Decision Processes

    Bernstein, Daniel S. and Zilberstein, Shlomo and Immerman, Neil , month = jan, year =. The. doi:10.48550/arXiv.1301.3836 , abstract =

  40. [40]

    Assigning

    Kapoor, Aditya and Freed, Benjamin and Choset, Howie and Schneider, Jeff , month = feb, year =. Assigning. doi:10.48550/arXiv.2408.04295 , abstract =

  41. [41]

    Ryu, Heechang and Shin, Hayong and Park, Jinkyoo , month = nov, year =. Multi-. doi:10.48550/arXiv.1909.12557 , abstract =

  42. [42]

    doi:10.48550/arXiv.2503.02077 , abstract =

    Wang, Ziyan and Zhang, Zhicheng and Fang, Fei and Du, Yali , month = jun, year =. doi:10.48550/arXiv.2503.02077 , abstract =

  43. [43]

    Continuously

    Zhou, Zihan and Fu, Wei and Zhang, Bingliang and Wu, Yi , month = may, year =. Continuously. doi:10.48550/arXiv.2204.02246 , abstract =

  44. [44]

    Maven: Multi-agent variational exploration

    Mahajan, Anuj and Rashid, Tabish and Samvelyan, Mikayel and Whiteson, Shimon , month = jan, year =. doi:10.48550/arXiv.1910.07483 , abstract =

  45. [45]

    Celebrating

    Li, Chenghao and Wang, Tonghan and Wu, Chengjie and Zhao, Qianchuan and Yang, Jun and Zhang, Chongjie , month = nov, year =. Celebrating. doi:10.48550/arXiv.2106.02195 , abstract =

  46. [46]

    Sun, Haochen and Zhang, Shuwen and Niu, Lujie and Ren, Lei and Xu, Hao and Fu, Hao and Zhao, Fangkun and Yuan, Caixia and Wang, Xiaojie , month = sep, year =. Collab-. doi:10.48550/arXiv.2502.20073 , abstract =

  47. [47]

    Kannan, Shyam Sundar and Venkatesh, Vishnunandan L. N. and Min, Byung-Cheol , month = oct, year =. 2024. doi:10.1109/IROS58592.2024.10802322 , abstract =

  48. [48]

    doi:10.48550/arXiv.2405.11106 , abstract =

    Sun, Chuanneng and Huang, Songjun and Pompili, Dario , month = may, year =. doi:10.48550/arXiv.2405.11106 , abstract =

  49. [49]

    Emergence of Grounded Compositional Language in Multi-Agent Populations

    Mordatch, Igor and Abbeel, Pieter , month = jul, year =. Emergence of. doi:10.48550/arXiv.1703.04908 , abstract =

  50. [50]

    Learning Attentional Communication for Multi-Agent Cooperation

    Jiang, Jiechuan and Lu, Zongqing , month = nov, year =. Learning. doi:10.48550/arXiv.1805.07733 , abstract =

  51. [51]

    Interruption

    Cao, Shiye and Moon, Jiwon and Mahmood, Amama and Antony, Victor Nikhil and Xiao, Ziang and Liu, Anqi and Huang, Chien-Ming , month = apr, year =. Interruption. doi:10.48550/arXiv.2501.01568 , abstract =

  52. [52]

    In: 2024 IEEE International Conference on Robotics and Automation (ICRA)

    Mitra, Mukund and Kumar, Gyanig and Chakrabarti, Partha Pratim and Biswas, Pradipta , month = may, year =. Enhanced. 2024. doi:10.1109/ICRA57147.2024.10610595 , abstract =

  53. [53]

    and Sharma, Archit and Pertsch, Karl and Luo, Jianlan and Levine, Sergey and Finn, Chelsea , month = mar, year =

    Shi, Lucy Xiaoyang and Hu, Zheyuan and Zhao, Tony Z. and Sharma, Archit and Pertsch, Karl and Luo, Jianlan and Levine, Sergey and Finn, Chelsea , month = mar, year =. Yell. doi:10.48550/arXiv.2403.12910 , abstract =

  54. [54]

    Peng, Shaoting and Chen, Haonan and Driggs-Campbell, Katherine , month = mar, year =. Towards. doi:10.48550/arXiv.2503.19317 , abstract =

  55. [55]

    Learning to Communicate with Deep Multi-Agent Reinforcement Learning

    Foerster, Jakob N. and Assael, Yannis M. and Freitas, Nando de and Whiteson, Shimon , month = may, year =. Learning to. doi:10.48550/arXiv.1605.06676 , abstract =

  56. [56]

    FollowNet: Robot Navigation by Following Natural Language Directions with Deep Reinforcement Learning

    Shah, Pararth and Fiser, Marek and Faust, Aleksandra and Kew, J. Chase and Hakkani-Tur, Dilek , month = may, year =. doi:10.48550/arXiv.1805.06150 , abstract =

  57. [57]

    Informing

    Spiegel, Benjamin Adin and Yang, Ziyi and Jurayj, William and Bachmann, Ben and Tellex, Stefanie and Konidaris, George , month = aug, year =. Informing

  58. [58]

    and Shah, Ankit and Tellex, Stefanie , month = nov, year =

    Yang, Ziyi and Raman, Shreyas S. and Shah, Ankit and Tellex, Stefanie , month = nov, year =. Plug in the. doi:10.48550/arXiv.2309.09919 , abstract =

  59. [59]

    Grounding

    Liu, Jason Xinyu and Yang, Ziyi and Idrees, Ifrah and Liang, Sam and Schornstein, Benjamin and Tellex, Stefanie and Shah, Ankit , month = oct, year =. Grounding. doi:10.48550/arXiv.2302.11649 , abstract =

  60. [60]

    Learning

    Jia, Mingxi and Huang, Haojie and Zhang, Zhewen and Wang, Chenghao and Zhao, Linfeng and Wang, Dian and Liu, Jason Xinyu and Walters, Robin and Platt, Robert and Tellex, Stefanie , month = jun, year =. Learning. doi:10.48550/arXiv.2406.15677 , abstract =

  61. [61]

    Interpreting human-robot instructions , url =

    Tellex, Stefanie and Arumugam, Dilip and Karamcheti, Siddharth and Gopalan, Nakul and Wong, Lawson LS , month = aug, year =. Interpreting human-robot instructions , url =

  62. [62]

    Cohen, Vanya and Liu, Jason Xinyu and Mooney, Raymond and Tellex, Stefanie and Watkins, David , month = jun, year =. A. doi:10.48550/arXiv.2405.13245 , abstract =

  63. [63]

    Optimistic

    Zhao, Wenshuai and Zhao, Yi and Li, Zhiyuan and Kannala, Juho and Pajarinen, Joni , month = may, year =. Optimistic. doi:10.48550/arXiv.2311.01953 , abstract =

  64. [64]

    Learning

    Wu, Xuefei and Yin, Xiao and Zhu, Yuanyang and Chen, Chunlin , month = jul, year =. Learning. doi:10.48550/arXiv.2507.18867 , abstract =

  65. [65]

    Revisiting

    Fu, Wei and Yu, Chao and Xu, Zelai and Yang, Jiaqi and Wu, Yi , month = aug, year =. Revisiting. doi:10.48550/arXiv.2206.07505 , abstract =

  66. [66]

    2023 , keywords =

    IEEE Transactions on Pattern Analysis and Machine Intelligence , author =. 2023 , keywords =. doi:10.1109/TPAMI.2023.3283537 , abstract =

  67. [67]

    Sutton, Doina Precup, and Satinder Singh

    Between. Artificial Intelligence , author =. 1999 , pages =. doi:10.1016/S0004-3702(99)00052-1 , abstract =

  68. [68]

    Asynchronous

    Yu, Chao and Yang, Xinyi and Gao, Jiaxuan and Chen, Jiayu and Li, Yunfei and Liu, Jijia and Xiang, Yunfei and Huang, Ruixin and Yang, Huazhong and Wu, Yi and Wang, Yu , month = apr, year =. Asynchronous. doi:10.48550/arXiv.2301.03398 , abstract =

  69. [69]

    arXiv preprint arXiv:2003.0670919(2020)

    Peng, Bei and Rashid, Tabish and Witt, Christian A. Schroeder de and Kamienny, Pierre-Alexandre and Torr, Philip H. S. and Böhmer, Wendelin and Whiteson, Shimon , month = may, year =. doi:10.48550/arXiv.2003.06709 , abstract =

  70. [70]

    Flexible

    Klissarov, Martin and Precup, Doina , month = dec, year =. Flexible

  71. [71]

    Attention

    Chunduru, Raviteja and Precup, Doina , month = jan, year =. Attention. doi:10.48550/arXiv.2201.02628 , abstract =

  72. [72]

    doi:10.48550/arXiv.2006.14363 , abstract =

    Li, Chenghao and Ma, Xiaoteng and Zhang, Chongjie and Yang, Jun and Xia, Li and Zhao, Qianchuan , month = jun, year =. doi:10.48550/arXiv.2006.14363 , abstract =

  73. [73]

    https://arxiv.org/pdf/1712.00004 , url =

  74. [74]

    Learnings Options End-to-End for Continuous Action Tasks

    Klissarov, Martin and Bacon, Pierre-Luc and Harb, Jean and Precup, Doina , month = nov, year =. Learnings. doi:10.48550/arXiv.1712.00004 , abstract =

  75. [75]

    and Vian, John , month = may, year =

    Omidshafiei, Shayegan and Amato, Christopher and Liu, Miao and Everett, Michael and How, Jonathan P. and Vian, John , month = may, year =. Scalable accelerated decentralized multi-robot policy search in continuous observation spaces , url =. 2017. doi:10.1109/ICRA.2017.7989106 , abstract =

  76. [76]

    DeepSeek-AI and Guo, Daya and Yang, Dejian and Zhang, Haowei and Song, Junxiao and Zhang, Ruoyu and Xu, Runxin and Zhu, Qihao and Ma, Shirong and Wang, Peiyi and Bi, Xiao and Zhang, Xiaokang and Yu, Xingkai and Wu, Yu and Wu, Z. F. and Gou, Zhibin and Shao, Zhihong and Li, Zhuoshu and Gao, Ziyi and Liu, Aixin and Xue, Bing and Wang, Bingxuan and Wu, Bocha...

  77. [77]

    Shao, Zhihong and Wang, Peiyi and Zhu, Qihao and Xu, Runxin and Song, Junxiao and Bi, Xiao and Zhang, Haowei and Zhang, Mingchuan and Li, Y. K. and Wu, Y. and Guo, Daya , month = apr, year =. doi:10.48550/arXiv.2402.03300 , abstract =

  78. [78]

    Journal of Artificial Intelligence Research , author =

    Optimally. Journal of Artificial Intelligence Research , author =. 2016 , pages =. doi:10.1613/jair.4623 , abstract =

  79. [79]

    IEEE Robotics and Automation Letters , author =

    Heterogeneous. IEEE Robotics and Automation Letters , author =. 2024 , note =. doi:10.1109/LRA.2023.3328448 , abstract =

  80. [80]

    Kannan, Shyam Sundar and Venkatesh, Vishnunandan L. N. and Min, Byung-Cheol , month = mar, year =. doi:10.48550/arXiv.2309.10062 , abstract =

Showing first 80 references.