Can Deep Research Agents Retrieve and Organize? Evaluating the Synthesis Gap with Expert Taxonomies
Pith reviewed 2026-05-21 15:10 UTC · model grok-4.3
The pith
Deep research agents retrieve only 20.92 percent of expert-cited papers and produce taxonomies with high sibling overlap and structural imbalance.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Evaluating 7 deep research agents and 12 frontier LLMs on TaxoBench reveals a dual bottleneck in research synthesis: retrieval recall tops out at 20.92 percent of expert-cited papers, while 1,000 model taxonomies exhibit 75.9 percent sibling overlap, 51.2 percent MECE violations, and 83.4 percent structural imbalance detectable without references; on alignment, all LLMs converge to 28-29 percent Semantic Path Similarity against expert trees, below the 47-58 percent achieved by three human-annotator groups on the same paper sets.
What carries the argument
TaxoBench benchmark using expert taxonomy trees for 72 LLM surveys, with leaf-level paper-to-category assignment and hierarchy-level evaluation via Unordered Semantic Tree Edit Distance and Semantic Path Similarity metrics.
If this is right
- The strongest agent still retrieves only 20.92 percent of papers cited by experts in the surveys.
- Model taxonomies show 75.9 percent sibling overlap, 51.2 percent MECE violations, and 83.4 percent structural imbalance on reference-free checks.
- All 12 tested LLMs converge to 28-29 percent Semantic Path Similarity with expert trees.
- Three independent human-annotator groups achieve 47-58 percent Semantic Path Similarity on identical paper sets.
- Partitioning results into capability-based and alignment-based groups separates genuine failure from valid disagreement with one expert's choices.
Where Pith is reading between the lines
- Separate advances in retrieval systems and in hierarchical reasoning may be needed, since the two shortfalls appear independent in the results.
- Reference-free metrics could serve as diagnostic tools to improve model taxonomies before any expert comparison is introduced.
- The consistent low alignment across all LLMs points to a shared limitation in organizing knowledge into balanced, non-overlapping hierarchies.
- Extending the benchmark to non-LLM scientific domains would test whether the retrieval and organization gaps generalize beyond this field.
Load-bearing premise
Expert-authored taxonomy trees provide a stable reference standard for judging how well models organize the same papers into hierarchies.
What would settle it
A deep research agent or LLM that retrieves over 40 percent of expert-cited papers and reaches Semantic Path Similarity above 40 percent on the same paper sets while matching human annotator variability ranges would falsify the dual-bottleneck finding.
read the original abstract
Deep Research Agents increasingly automate survey generation, yet whether they match human experts at retrieving essential papers and organizing them into expert-like taxonomies remains unclear. Existing benchmarks emphasize writing quality or citation correctness, while standard clustering metrics ignore hierarchical structure. We introduce TaxoBench, a benchmark of 72 highly cited LLM surveys with expert-authored taxonomy trees and 3,815 papers mapped to paper categories. TaxoBench evaluates (1) retrieval via Recall/Precision/F1, and (2) organization at a leaf level (paper-to-category assignment) and a hierarchy level via two new metrics: Unordered Semantic Tree Edit Distance (US-TED/US-NTED) and Semantic Path Similarity (Sem-Path). Two modes are supported: Deep Research (topic-only, end-to-end) and Bottom-Up (expert paper set provided, organization-only). To distinguish disagreement with a single expert reference from genuine model failure, we explicitly partition findings into capability-based (reference-free) and alignment-based (reference-dependent) groups. Evaluating 7 Deep Research Agents and 12 frontier LLMs reveals a dual bottleneck. On the capability side, the best agent retrieves only 20.92% of expert-cited papers, and 1,000 model taxonomies show 75.9% sibling overlap, 51.2% MECE violations, and 83.4% structural imbalance, all detectable without any reference. On the alignment side, all 12 LLMs converge to Sem-Path 28-29%, well below 47-58% achieved by three independent human-annotator groups on the same paper sets. Our benchmark is publicly available at https://github.com/KongLongGeFDU/TaxoBench.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces TaxoBench, a benchmark of 72 expert-authored taxonomy trees for highly cited LLM surveys with 3,815 mapped papers. It evaluates 7 deep research agents and 12 frontier LLMs on retrieval (Recall/Precision/F1) and organization using new metrics Unordered Semantic Tree Edit Distance (US-TED/US-NTED) and Semantic Path Similarity (Sem-Path). Results are partitioned into reference-free capability metrics (e.g., 75.9% sibling overlap, 51.2% MECE violations, 83.4% structural imbalance in 1,000 model taxonomies; best retrieval 20.92%) and reference-dependent alignment metrics (LLMs at 28-29% Sem-Path vs. 47-58% for human annotators). Two modes are supported: end-to-end Deep Research and Bottom-Up organization-only.
Significance. If validated, the work is significant for exposing concrete bottlenecks in automated research synthesis, separating retrieval failures from organizational ones and capability from alignment issues. Strengths include the public benchmark release, the explicit partitioning of reference-free vs. reference-dependent results, and the use of expert taxonomies to ground evaluation beyond standard clustering metrics.
major comments (2)
- [Results on reference-free metrics and human baselines] The capability-side claim of a synthesis bottleneck rests on the reference-free metrics (75.9% sibling overlap, 51.2% MECE violations, 83.4% structural imbalance) computed over 1,000 model taxonomies. These are presented as evidence of model shortcomings without any reference, yet the same three metrics are not reported for the three independent human-annotator groups whose taxonomies achieve 47-58% Sem-Path. This leaves the rates uncalibrated and weakens the interpretation that they indicate genuine failure rather than typical human variation in taxonomy construction. See the results section on reference-free capability metrics and the human comparison setup.
- [Metric definitions (US-TED, Sem-Path)] The definitions of the new metrics US-TED/US-NTED and Sem-Path are central to both the hierarchy-level evaluation and the alignment claims. The manuscript should include explicit formulas or pseudocode for how unordered semantic tree edits are computed and how semantic path similarity aggregates over the expert trees, including any handling of partial matches or depth weighting, to allow reproduction and to confirm they do not inadvertently favor certain structures.
minor comments (2)
- [Abstract and evaluation setup] Clarify in the abstract and methods whether the 1,000 model taxonomies are generated from the same 72 topics or a sampled subset, and report variance or confidence intervals for the reported percentages.
- [Evaluation modes] The Bottom-Up mode is useful for isolating organization, but the manuscript could add a short discussion of how results differ between the two modes to strengthen the dual-bottleneck narrative.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed review. The comments highlight important opportunities to strengthen the calibration of our reference-free metrics and the reproducibility of our proposed evaluation measures. We address each major comment below and will revise the manuscript to incorporate the suggested changes.
read point-by-point responses
-
Referee: [Results on reference-free metrics and human baselines] The capability-side claim of a synthesis bottleneck rests on the reference-free metrics (75.9% sibling overlap, 51.2% MECE violations, 83.4% structural imbalance) computed over 1,000 model taxonomies. These are presented as evidence of model shortcomings without any reference, yet the same three metrics are not reported for the three independent human-annotator groups whose taxonomies achieve 47-58% Sem-Path. This leaves the rates uncalibrated and weakens the interpretation that they indicate genuine failure rather than typical human variation in taxonomy construction. See the results section on reference-free capability metrics and the human comparison setup.
Authors: The reference-free metrics are intentionally defined without reference to any expert taxonomy to surface intrinsic structural deficiencies (e.g., excessive sibling overlap or MECE violations) that can be observed directly in model outputs. We maintain that these quantities still provide useful evidence of capability limitations. Nevertheless, we agree that reporting the identical metrics on the human-annotator taxonomies would improve calibration and allow readers to assess whether the observed rates exceed typical human variation. In the revised manuscript we will compute and present sibling overlap, MECE violation rates, and structural imbalance for the three human groups alongside the model results. revision: yes
-
Referee: [Metric definitions (US-TED, Sem-Path)] The definitions of the new metrics US-TED/US-NTED and Sem-Path are central to both the hierarchy-level evaluation and the alignment claims. The manuscript should include explicit formulas or pseudocode for how unordered semantic tree edits are computed and how semantic path similarity aggregates over the expert trees, including any handling of partial matches or depth weighting, to allow reproduction and to confirm they do not inadvertently favor certain structures.
Authors: We concur that explicit, reproducible definitions are required. The current manuscript introduces US-TED/US-NTED and Sem-Path at a high level but does not supply the full algorithmic details. In the revision we will add a dedicated subsection containing the mathematical formulations, pseudocode for the unordered semantic tree-edit procedure, the aggregation rule for semantic path similarity, and explicit statements on the treatment of partial matches and depth weighting. revision: yes
Circularity Check
No significant circularity in derivation chain
full rationale
The paper introduces TaxoBench with expert-authored taxonomy trees as external reference and defines new metrics (US-TED/US-NTED, Sem-Path) plus reference-free ones (sibling overlap, MECE violations, structural imbalance) computed directly from model outputs. Results are partitioned into capability (reference-free) and alignment (reference-dependent) groups, with explicit human-annotator comparisons on Sem-Path. No equations or steps reduce by construction to fitted parameters, self-definitions, or self-citation chains; the evaluation relies on independent external benchmarks and human baselines rather than tautological renaming or imported uniqueness from prior author work.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Expert-authored taxonomy trees from LLM surveys represent high-quality hierarchical organization of the literature.
invented entities (2)
-
Unordered Semantic Tree Edit Distance (US-TED / US-NTED)
no independent evidence
-
Semantic Path Similarity (Sem-Path)
no independent evidence
Forward citations
Cited by 3 Pith papers
-
LLM-Oriented Information Retrieval: A Denoising-First Perspective
Denoising to maximize usable evidence density and verifiability is becoming the primary bottleneck in LLM-oriented information retrieval, conceptualized via a four-stage framework and addressed through a pipeline taxo...
-
LLM-Oriented Information Retrieval: A Denoising-First Perspective
Argues for a denoising-first paradigm in LLM-oriented information retrieval, framing challenges via a four-stage progression and providing a taxonomy of signal-to-noise optimization techniques across the pipeline.
-
WisPaper: Your AI Scholar Search Engine
WisPaper integrates semantic search with agent-based validation, library organization, and personalized AI feeds into a closed-loop system that improves academic paper discovery and long-term awareness.
Reference graph
Works this paper leans on
-
[7]
After assignment, verify that: - the number of assigned papers equals the number of input papers, - no paper appears in more than one group. ### Output Format Use a JSON structure like this (replace placeholders with actual paper titles) and the output you produce MUST be wrapped inside a fenced code block: json { "name": "AI Research", "subtopics": [ { "...
-
[14]
After assignment, verify that: - the number of assigned papers equals the number of input papers, - no paper appears in more than one group. ### Output Format Use a JSON structure like this (replace placeholders with actual paper titles) and the output you produce MUST be wrapped inside a fenced code block: “‘json { "name": "AI Research", "subtopics": [ {...
-
[15]
Output must bestrictly valid JSON
-
[16]
papers"; all internal nodes must contain
Only leaf nodes may contain"papers"; all internal nodes must contain"subtopics". 3.Every paper must appear exactly oncein the entire tree
-
[17]
NO duplicate papers anywhere
-
[18]
### CLASSIFICATION RULES - Group papers by semantic similarity using both title + abstract
The tree must eventually merge intoone single root node. ### CLASSIFICATION RULES - Group papers by semantic similarity using both title + abstract. - Create meaningful names for leaf-level themes. ### ANTI-DUPLICATION PROCEDURE (MANDATORY) Before constructing the tree:
-
[19]
Produce an internal list of all given paper titles
-
[20]
Assign each paper to exactly one leaf node
-
[21]
After assignment, verify that: - the number of assigned papers equals the number of input papers, - no paper appears in more than one group. ### Output Format Use a JSON structure like this (replace placeholders with actual paper titles) and the output you produce MUST be wrapped inside a fenced code block: “‘json { "name": "AI Research", "subtopics": [ {...
-
[22]
Semantic Coverage & Recall - Definition: Measures whether the Model Tree contains the core concepts and main branches present in the Reference Tree. - Scoring Rubric: - 1 (Critical Failure): Misses more than 50% of the core branches (Level 1/Level 2); key concepts are seriously lacking. - 2 (Poor): Covers the main fields but misses a large number of impor...
-
[23]
Sibling Organization (MECE Principle) - Definition: Evaluates whether the set of child nodes under the same parent node follows the MECE principle. - Scoring Rubric: - 1 (Chaotic): Severe semantic overlap between sibling nodes (>50%); or completely lacks classification logic. - 2 (Poor): Inconsistent classification standards; or the division of a certain ...
-
[24]
Structural Topology - Definition: Evaluates whether the "shape" of the Model Tree is similar to the Reference Tree. - Scoring Rubric: - 1 (Severe Deformation): Extreme structural difference (e.g., Reference Tree is deep, but Model Tree is a flat list). - 2 (Imbalanced): Certain branches are overly expanded while others are not, causing center of gravity d...
-
[25]
Evidence Extraction: Identify specific nodes/structures supporting your judgment
-
[26]
Gap Analysis: Clearly point out what the Model Tree got right (Match), and what it got wrong (Mismatch/Hallucination)
-
[27]
Final Scoring: Provide an objective score (1-5) based on your analysis. # Output Format Requirements Please strictly follow the<output_format>below. Do not include any irrelevant intro or summary. Ensure the output is valid JSON. <output_format> {{ "semantic_coverage": {{ "score": [Specific Score 1-5], "reasoning": "Detailed analysis of Semantic Coverage....
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.