pith. machine review for the scientific record. sign in

arxiv: 2603.16120 · v2 · submitted 2026-03-17 · 💻 cs.CL

Recognition: no theorem link

Language Models Don't Know What You Want: Evaluating Personalization in Deep Research Needs Real Users

Authors on Pith no claims yet

Pith reviewed 2026-05-15 10:30 UTC · model grok-4.3

classification 💻 cs.CL
keywords personalizationdeep researchLLM judgesreal-user studiesresearch agentssynthetic benchmarksuser evaluationNLP
0
0 comments X

The pith

Personalization in deep research systems requires real-user evaluation because LLM judges miss key errors

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Deep Research tools synthesize papers to answer queries but typically lack user-specific understanding. MySQA infers a research profile from user input, proposes tailored actions, and generates multi-section reports aligned with approved actions. Synthetic benchmarks with LLM judges show MySQA outperforming baselines on citation quality and action adherence. Real-user interviews, however, expose nine nuanced personalization failures that these judges cannot detect, including mismatches in interest alignment and report structure. The paper concludes that reliable progress on personalization depends on direct input from actual users rather than automated proxies alone.

Core claim

MySQA beats baselines when evaluated with synthetic users and LLM judges on metrics such as citations and personalized action following, yet the same system reveals nine distinct personalization errors during live user interviews that automated judges overlook, establishing that LLM-based evaluation fails to capture the aspects of personalization that real researchers value.

What carries the argument

MyScholarQA (MySQA), the agent that infers a user's research interests to propose actions and write customized reports, which demonstrates the gap between synthetic and real-user outcomes

If this is right

  • Synthetic benchmarks with LLM judges produce inflated estimates of personalization quality compared with live user experience.
  • Nine specific errors, including profile misalignment and report customization failures, remain invisible to current automated judges.
  • Qualitative feedback from users yields concrete design lessons for building future deep research agents.
  • Progress toward useful personalization cannot be measured reliably through synthetic protocols alone.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same limitation likely applies to other LLM personalization domains such as recommendation or writing assistants, where proxy judges may hide real-user mismatches.
  • Systems could embed continuous user feedback loops rather than one-time interviews to close the evaluation gap.
  • Hybrid protocols that combine automated metrics with periodic real-user checks may prove more practical than pure reliance on either approach.

Load-bearing premise

The nine nuanced errors found in the real-user interviews represent the main personalization failures across deep research systems and cannot be captured even by improved LLM judges.

What would settle it

A new real-user study in which improved LLM judges detect every one of the nine error types without omission would falsify the central claim.

Figures

Figures reproduced from arXiv: 2603.16120 by Aakanksha Naik, Amanpreet Singh, Eunsol Choi, Jordan Lee Boyd-Graber, Joseph Chee Chang, Malachi Hamada, Nishant Balepur, Pao Siangliulue, Sergey Feldman, Varsha Kishore.

Figure 1
Figure 1. Figure 1: Overview of MYSQA: our three-step personalized Deep Research system. (1) Researchers upload papers from Semantic Scholar, from which an LLM infers a profile that captures their interests. (2) When the researcher asks a query, MYSQA proposes a list of actions that could alter the report, tailored to the researcher’s profile. (3) The system generates a report that answers the query and executes these actions… view at source ↗
Figure 2
Figure 2. Figure 2: After a user selects research papers, MYSQA infers an editable profile with inferences about the user, capturing their interests for personalizing reports [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Before answering a user’s query, MYSQA proposes actions that change how the report could be created, which users can adjust to customize their report. 2025, SQA)—a DR system with high-quality re￾ports. SQA chains LLMs to retrieve papers from Semantic Scholar,3 cluster them into sections, and iteratively generate well-cited sections to form R. To equip MYSQA to use A∗ , we tweak prompts in SQA’s execution t… view at source ↗
Figure 4
Figure 4. Figure 4: Reports in MYSQA highlight personalized content, helping users navigate each action they select. 2003).5 Five DR users/CS students bring papers of their interest and two queries they asked DR before. We first discuss their DR usage/personal￾ization needs (~15 min). Then they use MYSQA— creating and editing profiles/actions to get person￾alized reports ($35/hr; ~45 min); later over email, they rate reports … view at source ↗
Figure 5
Figure 5. Figure 5: User satisfaction on MYSQA profiles, actions, and reports. Users are satisfied with 73% of them.12 beled where participants were dissatisfied; another verified themes on 60 rows (with 90% agreement). We find nine common issues ( [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: LLM judge accuracy for predicting users’ satisfaction of personalized profiles, actions, and reports. No model ever beats a majority class baseline (Dror et al., 2018, α = 0.05 Binomial test with Bonferroni correction). both from the user. LLMs access the same context (e.g. papers, actions, highlights) as users, six few￾shot examples, and definitions of each aspect13 to predict ˆl: would the user like or d… view at source ↗
Figure 7
Figure 7. Figure 7: Accuracy of LLM judges for predicting user satisfaction in personalization aspects over profiles, actions, and reports with one few-shot example. LLM judges are still worse than the majority class baseline. DOMAIN (N=468) OVERCLAIM (N=312) CONVENTION (N=216) CONTRAST (N=216) 0.0 0.2 0.4 0.6 0.8 1.0 Accuracy Profile Satisfaction Prediction NARROW (N=444) OFFTOPIC (N=216) Action Satisfaction Prediction UNINF… view at source ↗
Figure 8
Figure 8. Figure 8: Accuracy of LLM judges for predicting user satisfaction in personalization aspects over profiles, actions, and reports with no examples. LLM judges are still worse than the majority class baseline [PITH_FULL_IMAGE:figures/full_fig_p021_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Accuracy of LLM judges for predicting user satisfaction in personalization aspects over profiles, actions, and reports with no metric definitions. LLM judges are still worse than the majority class baseline. Output Type Aspect Description Freq Profile DOMAIN Uses terms, definitions, or details that do not capture the user’s domain of research 27.6% OVERCLAIM Claims something applies to the user broadly, bu… view at source ↗
Figure 10
Figure 10. Figure 10: List of questions we ask participants in the formative. Sections marked with "[walkthrough]" indicates [PITH_FULL_IMAGE:figures/full_fig_p023_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Full example of one of the author’s profiles with two inferences per section in [PITH_FULL_IMAGE:figures/full_fig_p024_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Full example of actions for a query in MYSQA. Users can view, toggle, and edit actions in the list. Clicking on the dropdown arrow gives a more elaborate description of each action [PITH_FULL_IMAGE:figures/full_fig_p025_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Full example of a report for a query and actions in [PITH_FULL_IMAGE:figures/full_fig_p026_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: We provide activities to entertain users while they wait for their profiles to be generated. Users can [PITH_FULL_IMAGE:figures/full_fig_p026_14.png] view at source ↗
read the original abstract

Deep Research (DR) systems help researchers cope with ballooning publishing counts. Such tools synthesize scientific papers to answer research queries, but lack understanding of their users. We address this with MyScholarQA (MySQA), a personalized DR agent that: 1) infers a profile with a user's research interests; 2) proposes personalized actions for a user's input query; and 3) writes a multi-section report for the query that follows user-approved actions. We first test MySQA with NLP's standard protocol: we build a benchmark with synthetic users and LLM judges, where MySQA beats baselines in citation metrics and personalized action-following. However, we suspect this process does not cover all aspects of personalized DR users value, so we interview users in an online version of MySQA to unmask them. We reveal nine nuanced errors of personalized DR undetectable by our LLM judges, and we study qualitative feedback to form lessons for future DR design. In all, we argue for a pillar of personalization that easy-to-use LLM judges can lead NLP to overlook: real progress in personalization is only possible with real users.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces MyScholarQA (MySQA), a personalized deep research agent that infers user research profiles, proposes personalized actions for queries, and generates multi-section reports following user-approved actions. It first evaluates MySQA via a synthetic-user benchmark with LLM judges, where it outperforms baselines on citation metrics and personalized action-following. Real-user interviews then uncover nine nuanced personalization errors missed by the LLM judges, leading to the claim that real progress in personalization requires real users rather than easy-to-use LLM judges.

Significance. If the nine errors are shown to be representative and resistant to improved LLM evaluation protocols, the work would meaningfully advance evaluation practices in personalized NLP by demonstrating concrete limitations of synthetic benchmarks and highlighting the value of qualitative user studies for deep research systems. The identification of specific errors provides actionable lessons for DR design.

major comments (2)
  1. [Abstract and real-user study section] Abstract and real-user study: the central claim that 'real progress in personalization is only possible with real users' because the nine errors are 'undetectable by our LLM judges' rests on the specific judges and protocol in the synthetic benchmark. The manuscript does not report an ablation where the derived error taxonomy is incorporated into LLM judges (via rubrics, chain-of-thought, or stronger models) and re-tested on the real-user transcripts, leaving open whether the errors are inherently beyond LLM capture.
  2. [Real-user study section] Real-user study section: the qualitative evidence identifies nine nuanced errors but does not report error frequencies across users, inter-rater reliability for error identification, or the exact number of participants and interview protocol details. This weakens the generalizability argument that these errors are representative of personalization failures across deep research systems.
minor comments (2)
  1. [Abstract] The abstract would benefit from briefly noting the scale of the user study (number of participants) to contextualize the nine errors.
  2. [Method section] Notation for MySQA components (profile inference, action proposal, report generation) could be more explicitly defined in the method section for clarity when comparing to baselines.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for the detailed and constructive feedback. We address each of the major comments below, proposing revisions to enhance the clarity and robustness of our claims regarding the necessity of real-user evaluations in personalized deep research systems.

read point-by-point responses
  1. Referee: [Abstract and real-user study section] Abstract and real-user study: the central claim that 'real progress in personalization is only possible with real users' because the nine errors are 'undetectable by our LLM judges' rests on the specific judges and protocol in the synthetic benchmark. The manuscript does not report an ablation where the derived error taxonomy is incorporated into LLM judges (via rubrics, chain-of-thought, or stronger models) and re-tested on the real-user transcripts, leaving open whether the errors are inherently beyond LLM capture.

    Authors: We appreciate this observation. Our initial evaluation followed the standard LLM judge protocol common in NLP research. To strengthen our central claim, we will perform the suggested ablation in the revision: we will integrate the nine-error taxonomy into the LLM judges via enhanced rubrics and chain-of-thought prompting, then re-evaluate the real-user transcripts. We will report the results to further substantiate whether these nuanced errors are detectable by improved LLM protocols. revision: yes

  2. Referee: [Real-user study section] Real-user study section: the qualitative evidence identifies nine nuanced errors but does not report error frequencies across users, inter-rater reliability for error identification, or the exact number of participants and interview protocol details. This weakens the generalizability argument that these errors are representative of personalization failures across deep research systems.

    Authors: We agree that these methodological details are important for assessing the generalizability of our findings. In the revised manuscript, we will expand the real-user study section to include error frequencies for each identified error, inter-rater reliability scores for the error coding process, the precise number of participants involved, and a comprehensive description of the interview protocol and error identification procedure. These additions will provide a more solid foundation for our arguments about the representativeness of the observed personalization errors. revision: yes

Circularity Check

0 steps flagged

No circularity; central claim rests on independent user interviews

full rationale

The paper's argument derives from two distinct empirical stages: (1) a synthetic benchmark where MySQA outperforms baselines on citation and action-following metrics judged by LLMs, and (2) separate real-user interviews that surface nine specific errors missed by the same LLM judges. The conclusion that 'real progress in personalization is only possible with real users' follows directly from the interview findings rather than reducing to any fitted parameter, self-definition, or self-citation chain. No equations, ansatzes, or uniqueness theorems are invoked; the evidence is externally grounded in transcribed user sessions and qualitative feedback.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that real-user interviews provide ground-truth signals of personalization quality that LLM judges cannot match, with no free parameters or invented entities introduced.

axioms (1)
  • domain assumption Real user interviews reveal personalization errors that are undetectable by LLM judges
    Invoked in the transition from synthetic benchmark results to the online user study findings

pith-pipeline@v0.9.0 · 5544 in / 1208 out tokens · 41431 ms · 2026-05-15T10:30:47.818340+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

56 extracted references · 56 canonical work pages · 1 internal anchor

  1. [1]

    InProceedings of the SIGCHI Conference on Human Factors in Computing Systems, pages 735–744

    Affective computational priming and creativity. InProceedings of the SIGCHI Conference on Human Factors in Computing Systems, pages 735–744. Ting-Peng Liang, Hung-Jen Lai, and Yi-Cheng Ku

  2. [2]

    Towards personalized deep research: Benchmarks and evaluations.arXiv preprint arXiv:2509.25106, 2025

    Personalized content recommendation and user satisfaction: Theoretical synthesis and empir- ical findings.Journal of Management Information Systems, 23(3):45–70. Yuan Liang, Jiaxian Li, Yuqing Wang, Piaohong Wang, Motong Tian, Pai Liu, Shuofei Qiao, Runnan Fang, He Zhu, Ge Zhang, et al. 2025. Towards personalized deep research: Benchmarks and evaluations....

  3. [3]

    InProceedings of the 2025 CHI Conference on Human Factors in Computing Systems, pages 1–31

    Ideasynth: Iterative research idea develop- ment through evolving and composing idea facets with literature-grounded feedback. InProceedings of the 2025 CHI Conference on Human Factors in Computing Systems, pages 1–31. Yiwei Qin, Kaiqiang Song, Yebowen Hu, Wenlin Yao, Sangwoo Cho, Xiaoyang Wang, Xuansheng Wu, Fei Liu, Pengfei Liu, and Dong Yu. 2024. Infob...

  4. [4]

    InProceedings of the 63rd Annual Meeting of the Association for Computational Lin- guistics (Volume 3: System Demonstrations), pages 513–523, Vienna, Austria

    Ai2 scholar QA: Organized literature synthesis with attribution. InProceedings of the 63rd Annual Meeting of the Association for Computational Lin- guistics (Volume 3: System Demonstrations), pages 513–523, Vienna, Austria. Association for Computa- tional Linguistics. Taylor Sorensen, Jared Moore, Jillian R. Fisher, Mitchell Gordon, Niloofar Mireshghallah...

  5. [5]

    A Roadmap to Pluralistic Alignment

    A roadmap to pluralistic alignment.ArXiv, abs/2402.05070. Neha Srikanth, Jordan Boyd-Graber, and Rachel Rudinger. 2026. Discotrace: Representing and com- paring answering strategies of humans and llms in information-seeking question answering. Chenkai Sun, Ke Yang, Revanth Gangi Reddy, Yi Fung, Hou Pong Chan, Kevin Small, ChengXiang Zhai, and Heng Ji. 202...

  6. [6]

    InThe Thirteenth International Conference on Learning Representa- tions

    Modeling future conversation turns to teach LLMs to ask clarifying questions. InThe Thirteenth International Conference on Learning Representa- tions. Weinan Zhang, Jun Wang, Bowei Chen, and Xiaoxue Zhao. 2013. To personalize or not: a risk manage- ment perspective.Proceedings of the 7th ACM con- ference on Recommender systems. Xiaolong Zhang, Yan Qu, C. ...

  7. [7]

    InFindings of the Association for Computational Linguistics: EMNLP 2024, pages 12016–12031, Miami, Florida, USA

    Citesense: supporting sensemaking of research literature. InProceedings of the SIGCHI Conference on Human Factors in Computing Systems, CHI ’08, page 677–680, New York, NY , USA. Association for Computing Machinery. Yu Zhang, Jingwei Sun, Li Feng, Cen Yao, Mingming Fan, Liuxin Zhang, Qianying Wang, Xin Geng, and Yong Rui. 2024a. See widely, think wisely: ...

  8. [8]

    Search for papers

    and CS-PaperSum (Liu et al., 2025b) datasets we build off of on are publicly accessi- ble and used within their intended use. To retrieve the full text of papers for offline experiments, we use an internal database to our organization that already collected them; in our interface to handle new papers, we use the Semantic Scholar Snippets API.15 Our datase...

  9. [9]

    [PDF] Semantic Scholar

    A set of 5 research papers on semantic scholar (the paper URLs) that you feel best represent your research interests. These are likely pa- pers you have written, wish you had written, are relevant to a project you are working on, or have been highly influential to you as a researcher. Please make sure these are open- access PDFs! (i.e. the semantic schola...

  10. [10]

    These do not have to be about the papers you have selected

    Three research queries that you would be in- terested in asking our Deep Research System. These do not have to be about the papers you have selected

  11. [11]

    personalization

    Create a new Gmail account with the fol- lowing credentials: a) Username: [provided email], b) Password: [ anything ], c) Name: [ anything but your own name]. You do not need to use this email to join the Google Meet, but you will need it to log into our interface. You will use this account for both the pilot and any future tasks, and this will ensure no ...

  12. [12]

    LLMs Shallowly Know You

  13. [13]

    LLMs Are Shallow Simulators

  14. [14]

    LLM Judges Swim in Shallow Waters

  15. [15]

    LLM Judges Stay in the Shallow End

  16. [16]

    You enumerate contributions

    LLM Judges Don’t Take it Personally By displaying these candidate titles (actions), we hope you the reader can also meta-personalize this DR paper (report) for your specific preferences. Split # Queries Avg # Papers / Query Avg Query Length Avg Paper Length Total Instances Dev 281 2.81 17.42 5855.51 281 Test 291 2.91 13.10 5742.30 291 Table 5: Details of ...

  17. [17]

    Does this approach align with how you’d want personalization to work? Would it save you any time?

    How important is it for you to understand and control how a model personalizes to you? Framework [walkthrough]: 1. Does this approach align with how you’d want personalization to work? Would it save you any time?

  18. [18]

    Why did you pick these papers? 4

    What kinds of research tasks would you want to use this for? 3. Why did you pick these papers? 4. Besides your research papers, is there any other information you think would help the system understand you better? Profile Generation [walkthrough]: 1. Would you be open to opting-in for storing an editable version of this profile on MyScholarQA?

  19. [19]

    If improved, would you want to see this kind of profile for other researchers? Would you let other researchers see this profile of yourself?

  20. [20]

    Did you like having the option to view and select potential ways to personalize the query? Would it save you time?

    Could you see this being useful in other applications beyond MyScholarQA? Plan + Response Generation [walkthrough]: 1. Did you like having the option to view and select potential ways to personalize the query? Would it save you time?

  21. [21]

    Which of the requirements did MyScholarQA do a good and bad job at following? Did any deviate from what you were expecting?

  22. [22]

    [walkthrough]

    Did the highlights make it easier to tell where the model attempted to personalize? Would you like this highlighting to be more or less clear if it was built into the prototype? Conclusions: 1. After seeing this in practice, what do you like and dislike about the prototype? 2. Is there anything else we should note while building such a personalized system...

  23. [23]

    **Topic expertise or interest** - What domains the researcher seems fluent in or newly exploring

  24. [24]

    **Familiarity with methods** - Implicit or explicit comfort with specific techniques or paradigms

  25. [25]

    **Distinguishing Focus:** What knowledge the researcher holds or lacks

    **Awareness of prior work** - The breadth and specificity of citations or conceptual framing. **Distinguishing Focus:** What knowledge the researcher holds or lacks. </Knowledge> <Research Style> **Definition:** Inferences about *how the researcher prefers to conduct research*. This includes:

  26. [26]

    **Methodological preferences** - Use of qualitative, quantitative, computational, or hybrid approaches

  27. [27]

    **Types of research questions** - Empirical, theoretical, exploratory, evaluative, etc

  28. [28]

    **Distinguishing Focus:** How the researcher approaches doing research

    **Study or experiment strategies** - How data is collected, analyzed, or operationalized. **Distinguishing Focus:** How the researcher approaches doing research. </Research Style> <Writing Style> **Definition:** Inferences about *how the researcher writes and explains ideas*. This includes:

  29. [29]

    **Argumentation and structure** - How ideas are developed, ordered, and emphasized

  30. [30]

    **Tone and voice** - Formality, assertiveness, didacticism, etc

  31. [31]

    **Explanation preferences** - Use of examples, metaphors, definitions, or technical language

  32. [32]

    **Distinguishing Focus:** How the researcher communicates in writing

    **Stylistic quirks** - Repetition, narrative devices, or particular rhetorical habits. **Distinguishing Focus:** How the researcher communicates in writing. </Writing Style> <Positions> **Definition:** Inferences about *what the researcher believes or argues*. This includes:

  33. [33]

    **Claims and conclusions** - What stances are taken or avoided

  34. [34]

    **Normative views** - Ethical, political, or philosophical commitments evident in the writing

  35. [35]

    **Distinguishing Focus:** What positions the researcher is taking or signaling

    **Arguments emphasized** - Which perspectives are advanced or critiqued. **Distinguishing Focus:** What positions the researcher is taking or signaling. </Positions> <Audience> **Definition:** Inferences about *who the researcher is writing for or trying to impact*. This includes:

  36. [36]

    **Assumed audience background** - What the researcher expects the reader to already know

  37. [37]

    **Stakeholder relevance** - Who is likely to be affected by or benefit from the work

  38. [38]

    **Distinguishing Focus:** Who the researcher is addressing or aiming to influence

    **Audience alignment** - Whether the writing aligns with academic, practitioner, policy, or public communities. **Distinguishing Focus:** Who the researcher is addressing or aiming to influence. </Audience> Prompt D.3: Action Generation Prompt (§2.2) Here are a list of inferences about a user. The numbered inference is a high-level inference, while the su...

  39. [39]

    organization: outlines sections for the final response to include

  40. [40]

    Each personalization strategy should specify two requirements:

    generation: produces text for each of these sections To help PersonalizedQA personalize responses based on the user’s profile, come up with a list of personalization strategies that the system should follow. Each personalization strategy should specify two requirements:

  41. [41]

    What kind of response the user will experience (Qualitative Personalization)

  42. [42]

    search_add

    How the system should behave at each step (Implementation Personalization) The qualitative personalization label is based on how the response will be personalized to the user at a qualitative level: <qualitative personalization strategies> insert qualitative rubric </qualitative personalization strategies> The implementation personalization label is based...

  43. [43]

    **Conceptual scope** - Which concepts to emphasize, omit, or define

  44. [44]

    **Depth of explanation** - Whether to provide brief overviews (if the user is knowledgeable on the area) or in-depth knowledge (if the user is new to the field)

  45. [45]

    **Distinguishing Focus:** What content is covered and how deeply

    **Terminology alignment** - Tailoring vocab to match the user’s disciplinary conventions. **Distinguishing Focus:** What content is covered and how deeply. </Content> <Explanation Style> **Definition:** Specifies *how the explanation for the information is communicated*. This can include:

  46. [46]

    **Explanatory style** - Empirical, intuitive, formal/mathematical, or example-led

  47. [47]

    bottom-up learning

    **Cognitive structuring** - Layered explanations, definitions first vs. bottom-up learning

  48. [48]

    **Distinguishing Focus:** How the content is explained, formatted, and connected to other concepts

    **Framing mechanisms** - Use of analogies, metaphors, or domain-specific language aligned with the researcher’s background. **Distinguishing Focus:** How the content is explained, formatted, and connected to other concepts. </Explanation Style> <Specificity> **Definition:** Clarifies and narrows the scope of the response to better match the researcher’s i...

  49. [49]

    methods,

    **Disambiguating vague inputs** - Interpreting terms like "methods, "frameworks", or "best" in the user’s specific context

  50. [50]

    **Focusing by domain/task** - Aligning content to a subfield, methodology, or research phase

  51. [51]

    **Resolving underspecification** - Filling in implicit assumptions (e.g., assuming qualitative when not stated)

  52. [52]

    **Distinguishing Focus:** What exactly is meant or needed, and how to restrict the response to that

    **Removing irrelevant scope** - Avoiding generalizations or adjacent topics not central to the task. **Distinguishing Focus:** What exactly is meant or needed, and how to restrict the response to that. </Specificity> <Usefulness> **Definition:** Shapes the response to be *actionable* or *instrumental* for the researcher’s goals or workflow. This can include:

  53. [53]

    **Direct application** - Helping write a section, implement a method, interpret results, etc

  54. [54]

    **Workflow integration** - Mapping content to stages of research or types of output

  55. [55]

    **Next steps** - Suggesting what to do with the information (e.g., adapt, cite, reframe, test)

  56. [56]

    over-cited

    **Decision support** - Helping choose between options, methods, or framings based on task-fit. **Distinguishing Focus:** How the information can be turned into research actions or outputs. </Usefulness> Prompt D.5: Action Implementation Categories <Search Add> **Definition:** Personalizes the search by **adding new terms or dimensions** to the original qu...