pith. machine review for the scientific record. sign in

arxiv: 2604.09408 · v4 · submitted 2026-04-10 · 💻 cs.AI

Recognition: 3 theorem links

· Lean Theorem

HiL-Bench (Human-in-Loop Benchmark): Do Agents Know When to Ask for Help?

Alessa Castillo, Bing Liu, Charles Wang, Ernesto Hernandez, Fernando Carabedo, Guangze Luo, Kelvin Luu, Mohamed Elfeki, Nandan Marwaha, Nathan Hunt, Tu Trinh, Yannis Yiming He

Pith reviewed 2026-05-10 18:15 UTC · model grok-4.3

classification 💻 cs.AI
keywords AI agentshelp-seekingbenchmarkreinforcement learninguncertainty detectionincomplete specificationsSWE agentstext-to-SQL
0
0 comments X

The pith

Frontier AI agents exhibit a large universal judgment gap in deciding when to ask for help on incomplete tasks, but this skill is trainable via RL on a shaped Ask-F1 metric.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents HiL-Bench to test agents' selective escalation ability when task specifications contain hidden blockers such as missing information or ambiguity. These blockers are designed to surface only through step-by-step work rather than initial inspection, and the benchmark uses human validation to confirm them. Current models recover only a small portion of their full-information performance because they either guess incorrectly or escalate in ways that do not resolve the issues. The core metric Ask-F1 balances the precision of questions asked against recall of actual blockers, making it resistant to simple spamming. Training a 32B model with reinforcement learning on this reward improves both help-seeking quality and overall task success, and the gains transfer between software engineering and text-to-SQL domains without the model acquiring narrow domain rules.

Core claim

Frontier coding agents solve complex tasks when given complete context but collapse when specifications are incomplete or ambiguous. The bottleneck is not raw capability but judgment: knowing when to act autonomously and when to ask for help. HiL-Bench supplies tasks with human-validated blockers that appear only through progressive exploration. No frontier model recovers more than a fraction of its full-information performance on these tasks. Consistent failure patterns include overconfident wrong beliefs without gap detection, high uncertainty without self-correction, and broad imprecise escalation. RL training on shaped Ask-F1 reward enables a 32B model to detect unresolvable uncertainty,

What carries the argument

HiL-Bench benchmark with human-validated blockers that surface only through progressive exploration, evaluated by the Ask-F1 metric defined as the harmonic mean of question precision and blocker recall.

If this is right

  • Benchmarks that supply complete and unambiguous instructions systematically miss a central limitation of current agents in realistic settings.
  • Poor help-seeking is a model-level property rather than a task-specific artifact, appearing consistently in both software engineering and text-to-SQL evaluations.
  • Reinforcement learning on Ask-F1 can raise both the quality of escalation decisions and downstream task performance without requiring domain-specific fine-tuning.
  • The learned behavior transfers across domains, indicating the model acquires a general capacity to detect unresolvable uncertainty rather than rote patterns for when to ask.
  • Task success improves when judgment is trained directly, showing that escalation skill and execution capability can be advanced together.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar training regimes could reduce silent errors in deployed agents that currently guess on unclear user requests.
  • The approach might extend to other agent domains such as scientific literature synthesis or medical reasoning where specifications are routinely incomplete.
  • General uncertainty detection learned this way could also lower rates of overconfident outputs in non-agent language models.
  • If the metric proves robust, it offers a pathway to evaluate and improve multi-turn human-AI collaboration beyond single-shot execution.

Load-bearing premise

The human-validated blockers that emerge only during progressive exploration accurately capture the incomplete specifications agents encounter in practice and cannot be gamed by models lacking genuine uncertainty detection.

What would settle it

A test set of tasks using blocker types or exploration depths absent from the RL training data, where the trained model shows no gains in Ask-F1 or task pass rate compared with the base model.

Figures

Figures reproduced from arXiv: 2604.09408 by Alessa Castillo, Bing Liu, Charles Wang, Ernesto Hernandez, Fernando Carabedo, Guangze Luo, Kelvin Luu, Mohamed Elfeki, Nandan Marwaha, Nathan Hunt, Tu Trinh, Yannis Yiming He.

Figure 1
Figure 1. Figure 1: (A) Models achieve 75–89% pass@3 with complete information but only 4–24% when they must [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Example agent evaluation workflow. Each HIL-BENCH task contains multiple blockers that surface as the agent explores the task environment. At each point, the agent must judge whether to ask or proceed. Ask-F1 scores both detection (recall) and targeting (precision). the gap becomes apparent. Contradictory-information blockers show the smallest drop, consistent with conflicting statements often appearing in… view at source ↗
Figure 3
Figure 3. Figure 3: Failure fingerprints reveal distinct judgment signatures across model families in SQL. (A) Baseline failure modes (within-dimension percentages). GPT models are accuracy-dominant. Claude is completion- and self-assessment-dominant. Gemini shows high logic self-assessment failures. (B) Failure distribution shifts with ask human(). Gemini inverts dramatically in tool-use (−38pp completion, +44pp accuracy). C… view at source ↗
Figure 4
Figure 4. Figure 4: RLVR closes the judgment gap and transfers across domains. (A) Precision–recall space. Arrows show RLVR shifts base models toward calibrated help-seeking (solid: in-domain; dashed: cross-domain). In￾domain training improves both precision and recall; transfer yields smaller but consistent gains, confirming the skill is domain-general. (B) Ask-F1 and Pass@3 improve in lockstep. Dumbbells indicate base-to-RL… view at source ↗
Figure 5
Figure 5. Figure 5: The Judgment Matrix. Most agents are in the bottom left box. Every deployed agent occupies one cell in a simple matrix ( [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Domain shifts in failure attribution. Each cell shows the difference in percentage points between [PITH_FULL_IMAGE:figures/full_fig_p018_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Failure modes in SWE. Compared to SQL, distributions congregate towards the same failure [PITH_FULL_IMAGE:figures/full_fig_p019_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Average tokens used per task, per model. Model behaviors described above are also reflected here; [PITH_FULL_IMAGE:figures/full_fig_p020_8.png] view at source ↗
read the original abstract

Frontier coding agents solve complex tasks when given complete context but collapse when specifications are incomplete or ambiguous. The bottleneck is not raw capability, but judgment: knowing when to act autonomously and when to ask for help. Current benchmarks are blind to this failure mode. They supply unambiguous detailed instructions and solely reward execution correctness, so an agent that makes a lucky guess for a missing requirement will score identically to one that would have asked to be certain. We present HiL-Bench (Human-in-the-Loop Benchmark) to measure this selective escalation skill. Each task contains human-validated blockers (missing information, ambiguous requests, contradictory information) that surface only through progressive exploration, not upfront inspection. Our core metric, Ask-F1, the harmonic mean of question precision and blocker recall, captures the tension between over-asking and silent guessing; its structure architecturally prevents gaming through question spam. Evaluation across SWE and text-to-SQL domains reveals a large universal judgment gap: no frontier model recovers more than a fraction of its full-information performance when deciding whether to ask. Failure analysis identifies three key help-seeking patterns: overconfident wrong beliefs with no gap detection; high uncertainty detection yet persistent errors; broad, imprecise escalation without self-correction. These consistent patterns confirm poor help-seeking is a model-level flaw, not task-specific. RL training on shaped Ask-F1 reward shows judgment is trainable: a 32B model improves both help-seeking quality and task pass rate, with gains that transfer across domains. The model does not learn domain-specific heuristics for when to ask; it learns to detect unresolvable uncertainty and act on it.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces HiL-Bench, a benchmark for measuring agents' selective escalation skill when task specifications contain human-validated blockers (missing, ambiguous, or contradictory information) that surface only via progressive exploration. It defines Ask-F1 as the harmonic mean of question precision and blocker recall, evaluates frontier models across SWE and text-to-SQL domains, reports a large universal judgment gap relative to full-information baselines, catalogs three consistent failure patterns, and shows that RL training with shaped Ask-F1 reward improves both help-seeking quality and task pass rate with cross-domain transfer on a 32B model.

Significance. If the human-validated blockers are representative and Ask-F1 robustly isolates genuine uncertainty detection, the work identifies a critical, previously unmeasured limitation in frontier agents and provides positive evidence that judgment is trainable rather than purely architectural. The cross-domain transfer result and explicit focus on human-in-the-loop evaluation are clear strengths that could inform safer agent design.

major comments (2)
  1. [Abstract] Abstract: the central claim that blockers 'surface only through progressive exploration, not upfront inspection' and cannot be gamed by surface cues is load-bearing for both the judgment-gap result and the RL-transfer conclusion, yet the manuscript provides no quantitative validation (e.g., inter-annotator agreement, adversarial model probes, or alternative blocker distributions) to rule out pattern-matching strategies that satisfy recall without true uncertainty detection.
  2. [Evaluation] Evaluation section: the assertion that 'no frontier model recovers more than a fraction of its full-information performance' and that RL yields transferable gains rests on the specific blocker distribution; without ablation on held-out blocker sets or comparison to non-RL baselines that also optimize for Ask-F1, the claim that the 32B model 'learns to detect unresolvable uncertainty' rather than domain heuristics remains untested.
minor comments (2)
  1. [Abstract] Abstract: Ask-F1 is introduced as 'the harmonic mean of question precision and blocker recall' without an explicit formula or weighting; adding the equation would clarify how the metric enforces the precision-recall trade-off.
  2. [Failure analysis] Failure analysis: the three listed help-seeking patterns are described qualitatively; reporting their relative frequencies or providing one concrete trace per pattern would strengthen the 'consistent patterns' and 'model-level flaw' claims.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback, which highlights important aspects of benchmark validation and evaluation robustness. We address each major comment below and have revised the manuscript to incorporate additional quantitative details and analyses.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim that blockers 'surface only through progressive exploration, not upfront inspection' and cannot be gamed by surface cues is load-bearing for both the judgment-gap result and the RL-transfer conclusion, yet the manuscript provides no quantitative validation (e.g., inter-annotator agreement, adversarial model probes, or alternative blocker distributions) to rule out pattern-matching strategies that satisfy recall without true uncertainty detection.

    Authors: The blockers were constructed via a multi-stage human annotation process in which domain experts (practitioners in software engineering and SQL) followed explicit guidelines requiring that each blocker be unresolvable from the initial specification alone and only discoverable through progressive tool use or clarification. The annotation protocol and representative examples are provided in the appendix. We agree that formal quantitative validation was not reported in the original submission. In the revised manuscript we add inter-annotator agreement statistics computed on a held-out subset of tasks and an adversarial probe in which models receive only surface-level task text with no exploration capability; these models achieve near-zero blocker recall, supporting that the judgment gap is not explained by pattern matching on surface cues. revision: yes

  2. Referee: [Evaluation] Evaluation section: the assertion that 'no frontier model recovers more than a fraction of its full-information performance' and that RL yields transferable gains rests on the specific blocker distribution; without ablation on held-out blocker sets or comparison to non-RL baselines that also optimize for Ask-F1, the claim that the 32B model 'learns to detect unresolvable uncertainty' rather than domain heuristics remains untested.

    Authors: The reported results use a diverse blocker distribution spanning missing, ambiguous, and contradictory information across two domains, with consistent failure patterns observed in every frontier model. The cross-domain transfer of RL gains already supplies evidence that the learned behavior is not limited to domain-specific heuristics. We nevertheless agree that explicit ablations would strengthen the claim. The revised manuscript adds (1) an ablation training the RL policy on all but one blocker type and evaluating on the held-out type, and (2) a direct comparison against a supervised fine-tuning baseline that optimizes Ask-F1 labels without RL; the supervised baseline shows weaker transfer and lower final Ask-F1, consistent with the interpretation that RL enables detection of unresolvable uncertainty. revision: yes

Circularity Check

0 steps flagged

No significant circularity: empirical benchmark construction and RL experiment

full rationale

The paper presents HiL-Bench as a human-validated benchmark for measuring agent help-seeking judgment via the Ask-F1 metric (harmonic mean of precision and recall), evaluates frontier models across SWE and text-to-SQL domains, and demonstrates RL improvements on shaped Ask-F1 reward. No equations, derivations, or first-principles results are present that reduce to inputs by construction. Claims rest on external human validation of blockers and cross-domain empirical transfer, not self-citations, fitted parameters renamed as predictions, or definitional loops. The metric design and RL setup are standard and externally falsifiable. This is self-contained empirical work with no load-bearing circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on the assumption that human-validated blockers accurately simulate real incomplete specifications and that Ask-F1 is the right way to score selective escalation. No free parameters or invented entities are introduced in the abstract.

axioms (1)
  • domain assumption Human validators can reliably identify blockers that only surface through progressive exploration and not upfront inspection.
    Stated in the abstract as the basis for task construction.

pith-pipeline@v0.9.0 · 5637 in / 1224 out tokens · 29208 ms · 2026-05-10T18:15:30.939683+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Ask Early, Ask Late, Ask Right: When Does Clarification Timing Matter for Long-Horizon Agents?

    cs.CL 2026-05 unverdicted novelty 7.0

    Goal clarifications lose nearly all value after 10% of execution while input clarifications retain value until roughly 50%, and asking any type past mid-trajectory hurts performance more than never asking.

  2. Don't Start What You Can't Finish: A Counterfactual Audit of Support-State Triage in LLM Agents

    cs.AI 2026-04 unverdicted novelty 7.0

    LLM agents overcommit on non-complete tasks at 41.7% unless given explicit support-state categories, which raise typed deferral accuracy to 91.7%.

Reference graph

Works this paper leans on

42 extracted references · 17 canonical work pages · cited by 2 Pith papers · 7 internal anchors

  1. [1]

    ConvAI3 : Generating clarifying questions for open-domain dialogue systems ( ClariQ )

    Mohammad Aliannejadi, Julia Kiseleva, Aleksandr Chuklin, Jeff Dalton, and Mikhail Burtsev. ConvAI3 : Generating clarifying questions for open-domain dialogue systems ( ClariQ ). arXiv preprint arXiv:2009.11352, 2020

  2. [2]

    Chinmaya Andukuri, Jan-Philipp Fränken, Tobias Gerstenberg, and Noah D. Goodman. Star-gate: Teaching language models to ask clarifying questions, 2024

  3. [3]

    Program Synthesis with Large Language Models

    Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, and Charles Sutton. Program synthesis with large language models, 2021. URL https://arxiv.org/abs/2108.07732

  4. [4]

    MCP-Atlas: A Large-Scale Benchmark for Tool-Use Competency with Real MCP Servers

    Chaithanya Bandi, Ben Hertzberg, Geobio Boo, Tejas Polakam, Jeff Da, Sami Hassaan, Manasi Sharma, Andrew Park, Ernesto Hernandez, Dan Rambado, Ivan Salazar, Rafael Cruz, Chetan Rane, Ben Levin, Brad Kenstler, and Bing Liu. MCP -atlas: A large-scale benchmark for tool-use competency with real MCP servers, 2026. URL https://arxiv.org/abs/2602.00933

  5. [5]

    Gonzalez, and Ion Stoica

    Shiyi Cao, Sumanth Hegde, Dacheng Li, Tyler Griggs, Shu Liu, Eric Tang, Jiayi Pan, Xingyao Wang, Akshay Malik, Graham Neubig, Kourosh Hakhamaneshi, Richard Liaw, Philipp Moritz, Matei Zaharia, Joseph E. Gonzalez, and Ion Stoica. Skyrl-v0: Train real-world long-horizon agents via reinforcement learning, 2025

  6. [6]

    Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian...

  7. [7]

    Introducing SWE -bench verified, 2024

    Neil Chowdhury et al. Introducing SWE -bench verified, 2024. URL https://openai.com/index/introducing-swe-bench-verified/

  8. [8]

    SWE-Bench Pro: Can AI Agents Solve Long-Horizon Software Engineering Tasks?

    Xiang Deng, Jeff Da, Edwin Pan, Yannis Yiming He, Charles Ide, Kanak Garg, Niklas Lauffer, Andrew Park, Nitin Pasari, Chetan Rane, Karmini Sampath, Maya Krishnan, Srivatsa Kundurthy, Sean Hendryx, Zifan Wang, Vijay Bharadwaj, Jeff Holm, Raja Aluri, Chen Bo Calvin Zhang, Noah Jacobson, Bing Liu, and Brad Kenstler. Swe-bench pro: Can ai agents solve long-ho...

  9. [9]

    Workarena: How capable are web agents at solving common knowledge work tasks? arXiv preprint arXiv:2403.07718,

    Alexandre Drouin, Maxime Gasse, Massimo Caccia, Issam H. Laradji, Manuel Del Verme, Tom Marty, L \'e o Boisvert, Megh Thakkar, Quentin Cappart, David Vazquez, Nicolas Chapados, and Alexandre Lacoste. WorkArena : How capable are web agents at solving common knowledge work tasks?, 2024. URL https://arxiv.org/abs/2403.07718

  10. [10]

    Dabstep: Data agent benchmark for multi-step reasoning, 2025

    Alex Egg, Martin Iglesias Goyanes, Friso Kingma, Andreu Mora, Leandro von Werra, and Thomas Wolf. Dabstep: Data agent benchmark for multi-step reasoning, 2025

  11. [11]

    Zhang, Michael Luck, Qingwen Bu, Yuhao Qing, and Heming Cui

    Dong Huang, Qingwen Bu, Jie M. Zhang, Michael Luck, and Heming Cui. AgentCoder : Multi-agent-based code generation with iterative testing and optimisation, 2023. URL https://arxiv.org/abs/2312.13010

  12. [12]

    Crmarena-pro: Holistic assessment of llm agents across diverse business scenarios and interactions, 2025 a

    Kung-Hsiang Huang, Akshara Prabhakar, Onkar Thorat, Divyansh Agarwal, Prafulla Kumar Choubey, Yixin Mao, Silvio Savarese, Caiming Xiong, and Chien-Sheng Wu. Crmarena-pro: Holistic assessment of llm agents across diverse business scenarios and interactions, 2025 a

  13. [13]

    Teaching language models to gather information proactively

    Tenghao Huang, Sihao Chen, Muhao Chen, Jonathan May, Longqi Yang, Mengting Wan, and Pei Zhou. Teaching language models to gather information proactively. In Findings of the Association for Computational Linguistics: EMNLP 2025, pp.\ 15588--15599, 2025 b

  14. [14]

    LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

    Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar-Lezama, Koushik Sen, and Ion Stoica. LiveCodeBench : Holistic and contamination free evaluation of large language models for code, 2024. URL https://arxiv.org/abs/2403.07974

  15. [15]

    SWE -bench: Can language models resolve real-world github issues? In The Twelfth International Conference on Learning Representations, 2024

    Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik R Narasimhan. SWE -bench: Can language models resolve real-world github issues? In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=VTF8yNQM66

  16. [16]

    On agents and their failure modes, 2025

    Andrej Karpathy. On agents and their failure modes, 2025. Social media thread x.com/karpathy/status/1954224651443544436 https://x.com/karpathy/status/1954224651443544436

  17. [17]

    Shuvendu K. Lahiri, Sarah Fakhoury, Aaditya Naik, Georgios Sakkas, Saikat Chakraborty, Madanlal Musuvathi, Piali Choudhury, Curtis von Veh, Jeevana Priya Inala, Chenglong Wang, and Jianfeng Gao. Interactive Code Generation via Test-Driven User-Intent Formalization . arXiv:2208.05950 https://arxiv.org/abs/2208.05950, 2022. URL https://arxiv.org/abs/2208.05950

  18. [18]

    QuestBench: Can LLMs ask the right question to acquire information in reasoning tasks?

    Belinda Z. Li, Been Kim, and Zi Wang. QuestBench : Can LLMs ask the right question to acquire information in reasoning tasks?, 2025. URL https://arxiv.org/abs/2503.22674

  19. [19]

    Can LLM already serve as a database interface? a BIg bench for large-scale database grounded text-to- SQLs

    Jinyang Li, Binyuan Hui, Ge Qu, Jiaxi Yang, Binhua Li, Bowen Li, Bailin Wang, Bowen Qin, Ruiying Geng, Nan Huo, et al. Can LLM already serve as a database interface? a BIg bench for large-scale database grounded text-to- SQLs . Advances in Neural Information Processing Systems, 36, 2024

  20. [20]

    Ask what's missing and what's useful: Improving clarification question generation using global knowledge

    Bodhisattwa Prasad Majumder, Sudha Rao, Michel Galley, and Julian McAuley. Ask what's missing and what's useful: Improving clarification question generation using global knowledge. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp.\ 4300--4312, 2021

  21. [21]

    AmbigQA : Answering ambiguous open-domain questions

    Sewon Min, Julian Michael, Hannaneh Hajishirzi, and Luke Zettlemoyer. AmbigQA : Answering ambiguous open-domain questions. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2020

  22. [22]

    Interview on agent adoption barriers, 2025

    Andrew Ng. Interview on agent adoption barriers, 2025. YouTube Interview www.youtube.com/watch?v=SYisFbhR7xs https://www.youtube.com/watch?v=SYisFbhR7xs

  23. [23]

    A conversational paradigm for program synthesis

    Erik Nijkamp, Bo Pang, Ying Nian Wu, and Caiming Xiong. A conversational paradigm for program synthesis. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp.\ 10387--10402, 2022

  24. [24]

    W hy S W E -bench V erified no longer measures frontier coding capabilities --- openai.com

    OpenAI. W hy S W E -bench V erified no longer measures frontier coding capabilities --- openai.com. https://openai.com/index/why-we-no-longer-evaluate-swe-bench-verified/, 2026. [Accessed 28-03-2026]

  25. [25]

    arXiv preprint arXiv:2502.18413 , year=

    Jane Pan et al. When benchmarks talk: Re-evaluating code LLMs with interactive feedback, 2025. URL https://arxiv.org/abs/2502.18413

  26. [26]

    ChatDev : Communicative agents for software development

    Chen Qian, Wei Liu, Hongzhang Liu, Nuo Chen, Yufan Dang, Jiahao Li, Cheng Yang, Weize Chen, Yusheng Su, Xin Cong, Juyuan Xu, Dahai Li, Zhiyuan Liu, and Maosong Sun. ChatDev : Communicative agents for software development. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.\ 15174--15186, 2024

  27. [27]

    Userbench: An interactive gym environment for user-centric agents

    Cheng Qian, Zuxin Liu, Akshara Prabhakar, Zhiwei Liu, Jianguo Zhang, Haolin Chen, Heng Ji, Weiran Yao, Shelby Heinecke, Silvio Savarese, Caiming Xiong, and Huan Wang. UserBench : An interactive gym environment for user-centric agents, 2025. URL https://arxiv.org/abs/2507.22034

  28. [28]

    Learning to ask good questions: Ranking clarification questions using neural expected value of perfect information

    Sudha Rao and Hal Daum \'e III. Learning to ask good questions: Ranking clarification questions using neural expected value of perfect information. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.\ 2737--2746, 2018

  29. [29]

    Structured Uncertainty guided Clarification for LLM Agents

    Manan Suri et al. Structured uncertainty guided clarification for LLM agents, 2025. URL https://arxiv.org/abs/2511.08798

  30. [30]

    Wenxuan Wang, Juluan Shi, Zixuan Ling, Yuk-Kit Chan, Chaozheng Wang, Cheryl Lee, Youliang Yuan, Jen tse Huang, Wenxiang Jiao, and Michael R. Lyu. Learning to ask: When LLM agents meet unclear instruction. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pp.\ 21773--21784, 2025

  31. [31]

    OSWorld : Benchmarking multimodal agents for open-ended tasks in real computer environments

    Tianbao Xie, Danyang Zhang, Jixuan Chen, Xiaochuan Li, Siheng Zhao, Ruisheng Cao, Toh Jing Hua, Zhoujun Cheng, Dongchan Shin, Fangyu Lei, Yitao Liu, Yiheng Xu, Shuyan Zhou, Silvio Savarese, Caiming Xiong, Victor Zhong, and Tao Yu. OSWorld : Benchmarking multimodal agents for open-ended tasks in real computer environments. In Advances in Neural Information...

  32. [32]

    Xu, Yufan Song, Boxuan Li, Yuxuan Tang, Kritanjali Jain, Mengxue Bao, Zora Z

    Frank F. Xu, Yufan Song, Boxuan Li, Yuxuan Tang, Kritanjali Jain, Mengxue Bao, Zora Z. Wang, Xuhui Zhou, Zhitong Guo, Murong Cao, Mingyang Yang, Hao Yang Lu, Amaad Martin, Zhe Su, Leander Maben, Raj Mehta, Wayne Chi, Lawrence Jang, Yiqing Xie, Shuyan Zhou, and Graham Neubig. Theagentcompany: Benchmarking llm agents on consequential real world tasks, 2024....

  33. [33]

    Asking clarifying questions in open-domain information-seeking conversations

    Yao Xu, Zhao Liu, et al. Asking clarifying questions in open-domain information-seeking conversations. In Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval, pp.\ 475--484, 2019

  34. [34]

    Qwen3 technical report, 2025

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, ...

  35. [35]

    SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering

    John Yang, Carlos E Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik R Narasimhan, and Ofir Press. SWE -agent: Agent-computer interfaces enable automated software engineering. In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. URL https://arxiv.org/abs/2405.15793

  36. [36]

    Zhang and Eunsol Choi

    Michael J.Q. Zhang and Eunsol Choi. Modeling future conversation turns to teach LLMs to ask clarifying questions. In Findings of the Association for Computational Linguistics: NAACL 2025, 2025

  37. [37]

    CLAMBER : A benchmark of identifying and clarifying ambiguous information needs in large language models

    Tong Zhang, Peixin Qin, Yang Deng, Chen Huang, Wenqiang Lei, Junhong Liu, Dingnan Jin, Hongru Liang, and Tat-Seng Chua. CLAMBER : A benchmark of identifying and clarifying ambiguous information needs in large language models. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.\ 10746--10766, 2024

  38. [38]

    WebArena: A Realistic Web Environment for Building Autonomous Agents

    Shuyan Zhou et al. Webarena: A realistic web environment for building autonomous agents. arXiv preprint arXiv:2307.13854, 2023

  39. [39]

    write newline

    " write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

  40. [40]

    @esa (Ref

    \@ifxundefined[1] #1\@undefined \@firstoftwo \@secondoftwo \@ifnum[1] #1 \@firstoftwo \@secondoftwo \@ifx[1] #1 \@firstoftwo \@secondoftwo [2] @ #1 \@temptokena #2 #1 @ \@temptokena \@ifclassloaded agu2001 natbib The agu2001 class already includes natbib coding, so you should not add it explicitly Type <Return> for now, but then later remove the command n...

  41. [41]

    \@lbibitem[] @bibitem@first@sw\@secondoftwo \@lbibitem[#1]#2 \@extra@b@citeb \@ifundefined br@#2\@extra@b@citeb \@namedef br@#2 \@nameuse br@#2\@extra@b@citeb \@ifundefined b@#2\@extra@b@citeb @num @parse #2 @tmp #1 NAT@b@open@#2 NAT@b@shut@#2 \@ifnum @merge>\@ne @bibitem@first@sw \@firstoftwo \@ifundefined NAT@b*@#2 \@firstoftwo @num @NAT@ctr \@secondoft...

  42. [42]

    entailment

    @open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...