pith. machine review for the scientific record. sign in

arxiv: 2604.23815 · v1 · submitted 2026-04-26 · 💻 cs.CL

Recognition: unknown

DRACULA: Hunting for the Actions Users Want Deep Research Agents to Execute

Aakanksha Naik, Amanpreet Singh, Doug Downey, Eunsol Choi, Jordan Lee Boyd-Graber, Joseph Chee Chang, Malachi Hamada, Nishant Balepur, Pao Siangliulue, Rachel Rudinger, Sergey Feldman, Varsha Kishore

Authors on Pith no claims yet

Pith reviewed 2026-05-08 06:05 UTC · model grok-4.3

classification 💻 cs.CL
keywords deep research agentsaction preferencesuser feedbackLLM simulationintermediate actionsselection historyunstated goals
0
0 comments X

The pith

User action history outperforms stated goals for predicting what deep research agents should do next

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces DRACULA, a dataset of user preferences for specific intermediate actions that deep research agents should take when building multi-section reports from papers. It shows that language models initially predict user choices poorly but improve most when conditioned on a user's complete record of prior selections instead of self-reported goals or other signals. Users frequently select different actions for the same query because of unstated personal objectives, which limits simulation accuracy. These findings lead to an online system that proposes fresh actions drawn from past interactions, and users select those proposals at higher rates in follow-up tests. The work argues that the main remaining obstacle for such agents is choosing which actions to perform rather than executing them.

Core claim

DRACULA supplies the first collection of user feedback on concrete actions such as adding a datasets section or adjusting report structure. LLM simulations of action selection perform best when they receive the full sequence of a user's prior choices. This pattern supports an intervention that generates new candidate actions from interaction history, which users then prefer over other options. The central result is that action-level preference data exposes unstated goals as a bottleneck and shows how history-based prediction can mitigate it.

What carries the argument

LLM simulation of user action selection that conditions on full selection history drawn from the DRACULA dataset of 8103 preferences and 5230 execution judgments

Load-bearing premise

Preferences observed from nineteen computer science researchers will generalize to other users and that language-model simulations of choice will match real user behavior without extra handling for unstated goals.

What would settle it

A follow-up user study in which the history-based intervention generates actions that users select at the same or lower rate than a baseline that ignores history.

Figures

Figures reproduced from arXiv: 2604.23815 by Aakanksha Naik, Amanpreet Singh, Doug Downey, Eunsol Choi, Jordan Lee Boyd-Graber, Joseph Chee Chang, Malachi Hamada, Nishant Balepur, Pao Siangliulue, Rachel Rudinger, Sergey Feldman, Varsha Kishore.

Figure 1
Figure 1. Figure 1: Overview of DRACULA’s feedback. Rather than only judging reports, we reveal actions users want DR to take via two steps: (1) LLMs propose actions that modify reports for users to pick from; and (2) the MyScholarQA DR agent writes a report following selected actions that users judge for execution quality. Over 450 hours, we curate 13,333 query–action-judgment pairs from 19 researchers. cuted effectively ( view at source ↗
Figure 2
Figure 2. Figure 2: Action generation and execution success over action types. Models cannot infer all the DR decisions users find useful. We also note despite prior work showing the ben￾efits of using papers as user context (Mysore et al., 2023), paper actions are selected less than generic. §5 tests this further by studying more user contexts. As initial analyses, we group action selection rates by user intent, action type,… view at source ↗
Figure 3
Figure 3. Figure 3: Breakdown of action selection rates across query intent, model, and qualitative type. Actions on content and style are selected most often, and LLMs struggle to produce actions for writing intents. 3.2 Qualitative Analysis: An Initial Taste of Action Generation To study and better illustrate the data in DRACULA, we run qualitative analyses under two research questions: RQ1—Which actions do users select? an… view at source ↗
Figure 4
Figure 4. Figure 4: Macro F1 score of five few-shot LLM judges on action prediction and a random binary clas￾sifier (50/50). F1 does not largely exceed random on generic actions, showing room for improvement. describes the task and requests a JSON with a 0/1 prediction and explanation (Prompts A.10), and uses 1000 few-shot (q, a, lselect) training examples randomly sampled across all users. F1 scores do not largely exceed a r… view at source ↗
Figure 5
Figure 5. Figure 5: The Macro-F1 gap between user stability and GPT-5 predictions points to two user types: users with stable preferences GPT cannot reliably capture (e.g., user 3, user 8) and users with less stable preferences where affordances like actions are vital (e.g., user 5). Given our improved action prediction scores in §4.3, we now test whether our LLMs are at the limit of what is inherently predictable in DRAC￾ULA… view at source ↗
Figure 6
Figure 6. Figure 6: Interface overview for action generation in DRACULA. After annotators issue a query, they see a set of actions—high-level decisions on how the DR system could construct the report. Actions are grouped by how they will impact the report qualitatively. Our annotators select the actions they personally want the system to take and provide a rationale, forming action selection feedback. 22 view at source ↗
Figure 7
Figure 7. Figure 7: Interface overview for action execution in DRACULA. Given the query and actions selected by the annotator, we query ScholarQA to generate a report. Annotators read the report and judge whether the system executed each action well (upvote) or poorly (downvote or neutral vote) based on their own personal satisfaction and a rationale, forming action execution feedback. To facilitate this annotation, we modify… view at source ↗
read the original abstract

Scientific Deep Research (DR) agents answer user queries by synthesizing research papers into multi-section reports. User feedback can improve their utility, but existing protocols only score the final report, making it hard to study and learn which intermediate actions DR agents should take to improve reports. We collect DRACULA, the first dataset with user feedback on intermediate actions for DR. Over five weeks, nineteen expert CS researchers ask queries to a DR system that proposes actions (e.g., "Add a section on datasets"). Our users select actions they prefer, then judge whether an output report applied their selections successfully, yielding 8,103 action preferences and 5,230 execution judgments. After confirming a DR agent can execute DRACULA's actions, we study the predictability of user-preferred actions via simulation-how well LLMs predict the actions users select-a step toward learning to generate useful actions. We discover: (1) LLM judges initially struggle to predict action selections, but improve most when using a user's full selection history, rather than self-reported or extrapolated user context signals; (2) Users' selections for the same query differ based on unstated goals, bottlenecking simulation and motivating affordances that let users steer reports; and (3) Our simulation results inform an online intervention that generates new actions based on the user's past interactions, which users pick most often in follow-up studies. Overall, while work extensively studies execution, DRACULA reveals a key challenge is deciding which actions to execute in the first place. We open-source DRACULA's study design, user feedback, and simulation tasks to spur future work on action feedback for long-horizon agents.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces DRACULA, the first dataset of user feedback on intermediate actions for Deep Research (DR) agents. Over five weeks, 19 expert CS researchers interacted with a DR system that proposed actions (e.g., 'Add a section on datasets'); users selected preferred actions and later judged execution success, yielding 8,103 action preferences and 5,230 execution judgments. The authors confirm executability, then simulate predictability with LLMs, finding that models initially struggle but improve most when given a user's full selection history (rather than self-reported or extrapolated context). They identify unstated goals as a bottleneck and use the simulation results to design an online intervention that generates new actions from past interactions; follow-up studies show users select these most often. The work concludes that action selection (not execution) is the key unsolved challenge for DR agents and open-sources the study design, feedback, and simulation tasks.

Significance. If the core empirical patterns hold, the work usefully shifts attention from execution capability (already studied) to the harder problem of deciding which actions to execute, backed by a new open dataset and reproducible simulation tasks. The finding that full history outperforms other signals for LLM prediction, and the demonstration of a deployable intervention, provide concrete starting points for user-aligned long-horizon agents. Open-sourcing the tasks and feedback is a clear strength that enables follow-on work.

major comments (3)
  1. [Data collection (§3)] Data collection (§3): All claims about LLM predictability, the bottleneck of unstated goals, and the success of the history-based intervention rest on preferences collected from only 19 CS researchers. This small, homogeneous expert sample likely shares correlated query patterns and unstated goals that the history signal exploits; the paper provides no cross-population validation or discussion of how this affects generalizability to broader users.
  2. [Simulation results (results section describing LLM judges)] Simulation results (results section describing LLM judges): The claim that full selection history yields the largest improvement lacks any reported statistical tests, effect sizes, confidence intervals, or controls for query difficulty or user variance. Without these, it is impossible to verify that history is meaningfully superior to the other signals tested.
  3. [Online intervention and follow-up evaluation] Online intervention and follow-up evaluation: The statement that the history-based intervention 'generates new actions ... which users pick most often in follow-up studies' is presented without details on the number of participants, whether they were new or returning users, study protocol, or statistical comparison to baselines. This makes the efficacy claim difficult to assess.
minor comments (2)
  1. [Abstract and §4] The abstract and §4 use the phrase 'improve most' without defining the comparison set or metric; a short table or explicit baseline list would clarify the result.
  2. [Throughout (e.g., §3.2 and §4)] Notation for action types and judgment scales is introduced without a consolidated table; readers must hunt across sections to map terms to the 8,103 preferences.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed review. We address each major comment point by point below, indicating planned revisions to enhance the manuscript's rigor and transparency.

read point-by-point responses
  1. Referee: All claims about LLM predictability, the bottleneck of unstated goals, and the success of the history-based intervention rest on preferences collected from only 19 CS researchers. This small, homogeneous expert sample likely shares correlated query patterns and unstated goals that the history signal exploits; the paper provides no cross-population validation or discussion of how this affects generalizability to broader users.

    Authors: We agree that the participant sample of 19 expert CS researchers is small and homogeneous, which may limit generalizability and could contribute to the observed strength of the history signal through shared unstated goals. This sample was deliberately chosen to obtain high-quality feedback on technically demanding research tasks. In the revised manuscript, we will expand the Limitations section with a dedicated discussion of these issues, including potential biases from correlated patterns and the implications for broader user populations. We will also stress that the open-sourced dataset and study protocol are intended to support future cross-population validation. New data collection for broader validation is outside the scope of the current revision. revision: partial

  2. Referee: The claim that full selection history yields the largest improvement lacks any reported statistical tests, effect sizes, confidence intervals, or controls for query difficulty or user variance. Without these, it is impossible to verify that history is meaningfully superior to the other signals tested.

    Authors: We thank the referee for highlighting this gap in statistical reporting. In the revised results section, we will add appropriate statistical tests (such as paired comparisons with p-values), effect sizes, and confidence intervals for the differences across context signals. We will further incorporate controls for query difficulty via stratification and for user variance via mixed-effects modeling. These additions will provide quantitative support for the superiority of full selection history. revision: yes

  3. Referee: The statement that the history-based intervention 'generates new actions ... which users pick most often in follow-up studies' is presented without details on the number of participants, whether they were new or returning users, study protocol, or statistical comparison to baselines. This makes the efficacy claim difficult to assess.

    Authors: We apologize for the insufficient detail in the original submission. We will expand the relevant section to fully describe the follow-up studies, including the number of participants, their status as new or returning users, the complete study protocol, and statistical comparisons (with significance tests and effect sizes) against baselines. This will enable readers to properly evaluate the intervention's efficacy. revision: yes

Circularity Check

0 steps flagged

No circularity: claims rest on new empirical data collection and user-validated interventions

full rationale

The paper's core contributions derive from a fresh five-week study collecting 8,103 action preferences and 5,230 execution judgments from 19 CS researchers, followed by LLM simulations tested directly against that dataset and follow-up user studies validating an intervention. No equations, fitted parameters, or self-citations are invoked to derive the key findings on history-based predictability or unstated goals. The simulation and intervention steps are externally validated by real user selections rather than reducing to prior outputs or definitions. This is a standard empirical pipeline with open-sourced artifacts, making the derivation chain self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Based solely on the abstract, the work relies on standard user study assumptions with no mentioned free parameters or invented entities.

axioms (1)
  • domain assumption Users provide honest and consistent feedback on preferred actions and execution success
    The data collection and simulation depend on participants accurately selecting and judging actions in the study.

pith-pipeline@v0.9.0 · 5652 in / 1296 out tokens · 100100 ms · 2026-05-08T06:05:30.207160+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

16 extracted references · 3 canonical work pages

  1. [1]

    URL https://aclanthology.org/2021

    Association for Computational Linguistics. URL https://aclanthology.org/2021. humeval-1.0/. Andrea J Bingham. From data management to actionable findings: A five-phase process of qualitative data analysis.International journal of qualitative methods, 22:16094069231183620, 2023. Marcel Binz, Elif Akata, Matthias Bethge, Franziska Brandle, Frederick Callawa...

  2. [3]

    13 Preprint

    URLhttps://openreview.net/forum?id=VKGTGGcwl6. 13 Preprint. Under review. Jakub Lála, Odhran O’Donoghue, Aleksandar Shtedritski, Sam Cox, Samuel G Rodriques, and Andrew D White. Paperqa: Retrieval-augmented generative agent for scientific research.arXiv preprint arXiv:2312.07559, 2023. Michelle S Lam, Janice Teoh, James A Landay, Jeffrey Heer, and Michael...

  3. [4]

    Metacognitive Prompting Improves Understanding in Large Language Models

    URLhttps://aclanthology.org/2025.acl-long.397/. Paul A Samuelson. A note on the pure theory of consumer’s behaviour: an addendum. Economica, 5(19):353, 1938. Preethi Seshadri, Samuel Cahyawijaya, Ayomide Odumakinde, Sameer Singh, and Seraphina Goldfarb-Tarrant. Lost in simulation: LLM-simulated users are unreliable prox- ies for human users in agentic eva...

  4. [5]

    retrieval: searches for research papers

  5. [6]

    organization: outlines sections for the final response to include

  6. [7]

    Each personalization strategy should specify two requirements:

    generation: produces text for each of these sections To help PersonalizedQA personalize responses based on the user’s information, come up with a list of personalization strategies that the system should follow. Each personalization strategy should specify two requirements:

  7. [8]

    What kind of response the user will experience (Qualitative Personalization)

  8. [9]

    strategy

    How the system should behave at each step (Implementation Personalization) The qualitative personalization label is based on how the response will be personalized to the user at a qualitative level: <qualitative personalization strategies> [description of Content, Style, ...] </qualitative personalization strategies> The implementation personalization lab...

  9. [10]

    Assign it based on the **primary report-modification strategy**

  10. [11]

    Prefer an existing cluster over creating a new one

  11. [12]

    engineering

    Create a new cluster only if the action represents a genuinely new type of strategic modification. ### List of actions (id in brackets, one per line) [insert actions] 35 Preprint. Under review. Prompt A.7: Rationale Cluster Generation Prompt ## Task: Cluster User Rationales for REJECTED Actions (action_score=0) You are analyzing a deep research system whe...

  12. [13]

    **Group** user clusters that represent the **same high-level report-modification strategy** into one merged cluster

  13. [14]

    **Assign** each user cluster (by its id) to exactly one merged cluster

  14. [15]

    **Name** each merged cluster in camel_case, with a clear global ‘cluster_name‘ and ‘cluster_description‘ that captures the unified strategy

  15. [16]

    merged_clusters

    Every user cluster id must appear in exactly one merged cluster’s ‘cited_user_cluster_ids‘ list. ### Input: Per-User Clusters (id→name, description) Each line describes one cluster from one user. Format: ‘[id] cluster_name — cluster_description‘ [insert clusters] ### Output Format Respond with a JSON object with a single key ‘"merged_clusters"‘ whose valu...

  16. [17]

    explanation

    "explanation" - an explanation for your decision. The format is as follows: [insert JSON] Do not generate anything else </format> Prompt A.13: Action Execution Prompt <task> You are an expert at evaluating which actions a user would want a Deep Research system to take. <context> A user asked a Deep Research system called MyPaperQA a question related to re...