Recognition: unknown
DRACULA: Hunting for the Actions Users Want Deep Research Agents to Execute
Pith reviewed 2026-05-08 06:05 UTC · model grok-4.3
The pith
User action history outperforms stated goals for predicting what deep research agents should do next
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
DRACULA supplies the first collection of user feedback on concrete actions such as adding a datasets section or adjusting report structure. LLM simulations of action selection perform best when they receive the full sequence of a user's prior choices. This pattern supports an intervention that generates new candidate actions from interaction history, which users then prefer over other options. The central result is that action-level preference data exposes unstated goals as a bottleneck and shows how history-based prediction can mitigate it.
What carries the argument
LLM simulation of user action selection that conditions on full selection history drawn from the DRACULA dataset of 8103 preferences and 5230 execution judgments
Load-bearing premise
Preferences observed from nineteen computer science researchers will generalize to other users and that language-model simulations of choice will match real user behavior without extra handling for unstated goals.
What would settle it
A follow-up user study in which the history-based intervention generates actions that users select at the same or lower rate than a baseline that ignores history.
Figures
read the original abstract
Scientific Deep Research (DR) agents answer user queries by synthesizing research papers into multi-section reports. User feedback can improve their utility, but existing protocols only score the final report, making it hard to study and learn which intermediate actions DR agents should take to improve reports. We collect DRACULA, the first dataset with user feedback on intermediate actions for DR. Over five weeks, nineteen expert CS researchers ask queries to a DR system that proposes actions (e.g., "Add a section on datasets"). Our users select actions they prefer, then judge whether an output report applied their selections successfully, yielding 8,103 action preferences and 5,230 execution judgments. After confirming a DR agent can execute DRACULA's actions, we study the predictability of user-preferred actions via simulation-how well LLMs predict the actions users select-a step toward learning to generate useful actions. We discover: (1) LLM judges initially struggle to predict action selections, but improve most when using a user's full selection history, rather than self-reported or extrapolated user context signals; (2) Users' selections for the same query differ based on unstated goals, bottlenecking simulation and motivating affordances that let users steer reports; and (3) Our simulation results inform an online intervention that generates new actions based on the user's past interactions, which users pick most often in follow-up studies. Overall, while work extensively studies execution, DRACULA reveals a key challenge is deciding which actions to execute in the first place. We open-source DRACULA's study design, user feedback, and simulation tasks to spur future work on action feedback for long-horizon agents.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces DRACULA, the first dataset of user feedback on intermediate actions for Deep Research (DR) agents. Over five weeks, 19 expert CS researchers interacted with a DR system that proposed actions (e.g., 'Add a section on datasets'); users selected preferred actions and later judged execution success, yielding 8,103 action preferences and 5,230 execution judgments. The authors confirm executability, then simulate predictability with LLMs, finding that models initially struggle but improve most when given a user's full selection history (rather than self-reported or extrapolated context). They identify unstated goals as a bottleneck and use the simulation results to design an online intervention that generates new actions from past interactions; follow-up studies show users select these most often. The work concludes that action selection (not execution) is the key unsolved challenge for DR agents and open-sources the study design, feedback, and simulation tasks.
Significance. If the core empirical patterns hold, the work usefully shifts attention from execution capability (already studied) to the harder problem of deciding which actions to execute, backed by a new open dataset and reproducible simulation tasks. The finding that full history outperforms other signals for LLM prediction, and the demonstration of a deployable intervention, provide concrete starting points for user-aligned long-horizon agents. Open-sourcing the tasks and feedback is a clear strength that enables follow-on work.
major comments (3)
- [Data collection (§3)] Data collection (§3): All claims about LLM predictability, the bottleneck of unstated goals, and the success of the history-based intervention rest on preferences collected from only 19 CS researchers. This small, homogeneous expert sample likely shares correlated query patterns and unstated goals that the history signal exploits; the paper provides no cross-population validation or discussion of how this affects generalizability to broader users.
- [Simulation results (results section describing LLM judges)] Simulation results (results section describing LLM judges): The claim that full selection history yields the largest improvement lacks any reported statistical tests, effect sizes, confidence intervals, or controls for query difficulty or user variance. Without these, it is impossible to verify that history is meaningfully superior to the other signals tested.
- [Online intervention and follow-up evaluation] Online intervention and follow-up evaluation: The statement that the history-based intervention 'generates new actions ... which users pick most often in follow-up studies' is presented without details on the number of participants, whether they were new or returning users, study protocol, or statistical comparison to baselines. This makes the efficacy claim difficult to assess.
minor comments (2)
- [Abstract and §4] The abstract and §4 use the phrase 'improve most' without defining the comparison set or metric; a short table or explicit baseline list would clarify the result.
- [Throughout (e.g., §3.2 and §4)] Notation for action types and judgment scales is introduced without a consolidated table; readers must hunt across sections to map terms to the 8,103 preferences.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed review. We address each major comment point by point below, indicating planned revisions to enhance the manuscript's rigor and transparency.
read point-by-point responses
-
Referee: All claims about LLM predictability, the bottleneck of unstated goals, and the success of the history-based intervention rest on preferences collected from only 19 CS researchers. This small, homogeneous expert sample likely shares correlated query patterns and unstated goals that the history signal exploits; the paper provides no cross-population validation or discussion of how this affects generalizability to broader users.
Authors: We agree that the participant sample of 19 expert CS researchers is small and homogeneous, which may limit generalizability and could contribute to the observed strength of the history signal through shared unstated goals. This sample was deliberately chosen to obtain high-quality feedback on technically demanding research tasks. In the revised manuscript, we will expand the Limitations section with a dedicated discussion of these issues, including potential biases from correlated patterns and the implications for broader user populations. We will also stress that the open-sourced dataset and study protocol are intended to support future cross-population validation. New data collection for broader validation is outside the scope of the current revision. revision: partial
-
Referee: The claim that full selection history yields the largest improvement lacks any reported statistical tests, effect sizes, confidence intervals, or controls for query difficulty or user variance. Without these, it is impossible to verify that history is meaningfully superior to the other signals tested.
Authors: We thank the referee for highlighting this gap in statistical reporting. In the revised results section, we will add appropriate statistical tests (such as paired comparisons with p-values), effect sizes, and confidence intervals for the differences across context signals. We will further incorporate controls for query difficulty via stratification and for user variance via mixed-effects modeling. These additions will provide quantitative support for the superiority of full selection history. revision: yes
-
Referee: The statement that the history-based intervention 'generates new actions ... which users pick most often in follow-up studies' is presented without details on the number of participants, whether they were new or returning users, study protocol, or statistical comparison to baselines. This makes the efficacy claim difficult to assess.
Authors: We apologize for the insufficient detail in the original submission. We will expand the relevant section to fully describe the follow-up studies, including the number of participants, their status as new or returning users, the complete study protocol, and statistical comparisons (with significance tests and effect sizes) against baselines. This will enable readers to properly evaluate the intervention's efficacy. revision: yes
Circularity Check
No circularity: claims rest on new empirical data collection and user-validated interventions
full rationale
The paper's core contributions derive from a fresh five-week study collecting 8,103 action preferences and 5,230 execution judgments from 19 CS researchers, followed by LLM simulations tested directly against that dataset and follow-up user studies validating an intervention. No equations, fitted parameters, or self-citations are invoked to derive the key findings on history-based predictability or unstated goals. The simulation and intervention steps are externally validated by real user selections rather than reducing to prior outputs or definitions. This is a standard empirical pipeline with open-sourced artifacts, making the derivation chain self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Users provide honest and consistent feedback on preferred actions and execution success
Reference graph
Works this paper leans on
-
[1]
URL https://aclanthology.org/2021
Association for Computational Linguistics. URL https://aclanthology.org/2021. humeval-1.0/. Andrea J Bingham. From data management to actionable findings: A five-phase process of qualitative data analysis.International journal of qualitative methods, 22:16094069231183620, 2023. Marcel Binz, Elif Akata, Matthias Bethge, Franziska Brandle, Frederick Callawa...
-
[3]
URLhttps://openreview.net/forum?id=VKGTGGcwl6. 13 Preprint. Under review. Jakub Lála, Odhran O’Donoghue, Aleksandar Shtedritski, Sam Cox, Samuel G Rodriques, and Andrew D White. Paperqa: Retrieval-augmented generative agent for scientific research.arXiv preprint arXiv:2312.07559, 2023. Michelle S Lam, Janice Teoh, James A Landay, Jeffrey Heer, and Michael...
-
[4]
Metacognitive Prompting Improves Understanding in Large Language Models
URLhttps://aclanthology.org/2025.acl-long.397/. Paul A Samuelson. A note on the pure theory of consumer’s behaviour: an addendum. Economica, 5(19):353, 1938. Preethi Seshadri, Samuel Cahyawijaya, Ayomide Odumakinde, Sameer Singh, and Seraphina Goldfarb-Tarrant. Lost in simulation: LLM-simulated users are unreliable prox- ies for human users in agentic eva...
-
[5]
retrieval: searches for research papers
-
[6]
organization: outlines sections for the final response to include
-
[7]
Each personalization strategy should specify two requirements:
generation: produces text for each of these sections To help PersonalizedQA personalize responses based on the user’s information, come up with a list of personalization strategies that the system should follow. Each personalization strategy should specify two requirements:
-
[8]
What kind of response the user will experience (Qualitative Personalization)
-
[9]
strategy
How the system should behave at each step (Implementation Personalization) The qualitative personalization label is based on how the response will be personalized to the user at a qualitative level: <qualitative personalization strategies> [description of Content, Style, ...] </qualitative personalization strategies> The implementation personalization lab...
-
[10]
Assign it based on the **primary report-modification strategy**
-
[11]
Prefer an existing cluster over creating a new one
-
[12]
engineering
Create a new cluster only if the action represents a genuinely new type of strategic modification. ### List of actions (id in brackets, one per line) [insert actions] 35 Preprint. Under review. Prompt A.7: Rationale Cluster Generation Prompt ## Task: Cluster User Rationales for REJECTED Actions (action_score=0) You are analyzing a deep research system whe...
-
[13]
**Group** user clusters that represent the **same high-level report-modification strategy** into one merged cluster
-
[14]
**Assign** each user cluster (by its id) to exactly one merged cluster
-
[15]
**Name** each merged cluster in camel_case, with a clear global ‘cluster_name‘ and ‘cluster_description‘ that captures the unified strategy
-
[16]
merged_clusters
Every user cluster id must appear in exactly one merged cluster’s ‘cited_user_cluster_ids‘ list. ### Input: Per-User Clusters (id→name, description) Each line describes one cluster from one user. Format: ‘[id] cluster_name — cluster_description‘ [insert clusters] ### Output Format Respond with a JSON object with a single key ‘"merged_clusters"‘ whose valu...
-
[17]
explanation
"explanation" - an explanation for your decision. The format is as follows: [insert JSON] Do not generate anything else </format> Prompt A.13: Action Execution Prompt <task> You are an expert at evaluating which actions a user would want a Deep Research system to take. <context> A user asked a Deep Research system called MyPaperQA a question related to re...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.