One Interaction Is Worth a Thousand Guesses: Benchmarking the Interactive Capabilities of Deep Research Agents
read the original abstract
Deep research agents powered by Large Language Models (LLMs) can perform multi-step reasoning, web exploration, and long-form report generation. However, existing systems remain largely autonomous, assuming fully specified user intent and evaluating only final outputs. In practice, research goals are often underspecified and evolve during exploration, yet current benchmarks neither model dynamic user feedback nor measure interaction costs. To address this gap, we introduce IDRBench, the first Interactive Deep Research Benchmark for systematically evaluating the interactive capabilities of deep research agents. IDRBench formulates deep research as an interactive process where agents may solicit clarification to better align with user intent. It integrates a modular interactive framework, a scalable reference-grounded user simulator, and an interaction-aware evaluation suite that jointly measures alignment gains and interaction overhead. Experiments on seven representative proprietary and open-weight LLMs show that interaction consistently improves research quality and robustness, while revealing substantial differences in interaction efficiency across models. These findings establish interactive capability as a distinct evaluation dimension and position IDRBench as a reusable benchmark for future user-aligned deep research agents.
This paper has not been read by Pith yet.
Forward citations
Cited by 2 Pith papers
-
Co-Evolving Skill Generation and Policy Optimization
Framework estimates context-dependent marginal utility of candidate skills via reward gaps in matched base vs. skill-augmented rollouts to filter skills and co-train policy as generator.
-
AI for Auto-Research: Roadmap & User Guide
The paper delivers a stage-by-stage roadmap for AI in research, showing reliable assistance in retrieval and tool tasks but fragility in novelty and judgment, advocating human-governed collaboration.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.