pith. sign in

arxiv: 2603.26233 · v2 · pith:NV5FI6HMnew · submitted 2026-03-27 · 💻 cs.CL

Ask or Assume? Uncertainty-Aware Clarification-Seeking in Coding Agents

classification 💻 cs.CL
keywords agentsunderspecifiedclarification-seekingcurrentexecutioninformationinstructionsmulti-agent
0
0 comments X
read the original abstract

As Large Language Model (LLM) agents are increasingly deployed in open-ended domains like software engineering, they frequently encounter underspecified instructions that lack crucial context. While human developers naturally resolve underspecification by asking clarifying questions, current agents are largely optimized for autonomous execution. In this work, we systematically evaluate the clarification-seeking abilities of LLM agents on an underspecified variant of SWE-bench Verified. We propose an uncertainty-aware multi-agent scaffold that decouples underspecification detection from code execution. Across both proprietary and open-weight frontier LLMs, our scaffold achieves a 69.40% task resolve rate, significantly outperforming a standard single-agent setup and closing the performance gap with agents operating on fully specified instructions. Furthermore, we find that the multi-agent system exhibits well-calibrated information-seeking behavior, conserving queries on simple tasks while proactively seeking information on more complex issues. These findings indicate that current models can be turned into proactive collaborators, where agents independently recognize when to ask questions to elicit missing information in real-world, underspecified tasks.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Don't Start What You Can't Finish: A Counterfactual Audit of Support-State Triage in LLM Agents

    cs.AI 2026-04 unverdicted novelty 7.0

    LLM agents overcommit on non-complete tasks at 41.7% unless given explicit support-state categories, which raise typed deferral accuracy to 91.7%.

  2. AgentAtlas: Beyond Outcome Leaderboards for LLM Agents

    cs.AI 2026-05 unverdicted novelty 6.0

    AgentAtlas defines a six-state control taxonomy and nine-category failure taxonomy, then shows that removing explicit label menus from prompts drops trajectory accuracy 14-40 points to a 0.54-0.62 floor across eight models.