pith. sign in

arxiv: 2508.15658 · v5 · submitted 2025-08-21 · 💻 cs.CL · cs.AI· cs.IR

SurGE: A Benchmark and Evaluation Framework for Scientific Survey Generation

Pith reviewed 2026-05-18 21:36 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.IR
keywords scientific survey generationbenchmarkevaluation frameworklarge language modelscitation accuracycomprehensivenessacademic literature
0
0 comments X

The pith

The SurGE benchmark reveals a significant performance gap in how well large language models and agentic systems generate scientific surveys.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Academic literature is growing so fast that writing surveys by hand has become impractical. Large language models appear able to automate the work, yet the lack of shared tests has made it hard to measure real progress. This paper supplies SurGE, a collection of computer science topics each paired with an expert survey and its references, together with a corpus of over one million papers. It also supplies an automated scorer that checks generated surveys on comprehensiveness, citation accuracy, structural organization, and content quality. When a range of current methods is run on the benchmark, even the strongest agent-based approaches show clear shortfalls against the expert references.

Core claim

SurGE consists of test instances that each contain a topic description, an expert-written survey, and the full set of cited references, plus a large-scale academic corpus of over one million papers; paired with this is an automated evaluation framework that scores generated surveys across four dimensions, and evaluations of diverse LLM-based methods on these resources expose a substantial performance gap.

What carries the argument

The SurGE benchmark together with its four automated evaluation dimensions that compare generated surveys against expert-written references using the supplied paper corpus.

If this is right

  • Standardized testing of this kind can now direct development of improved survey generation systems.
  • Future models will need targeted gains in literature coverage and citation precision to close the gap.
  • Agentic frameworks require additional capabilities to reach expert-level survey organization and content quality.
  • Releasing the benchmark data and code allows the community to track and compare future advances.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Literature synthesis tools may remain most useful when combined with human oversight for the foreseeable future.
  • The same structure of expert references plus automated scoring could be replicated for survey generation in biology or physics.
  • Repeated evaluation on SurGE over successive model releases would provide a concrete record of capability growth in automated research assistance.

Load-bearing premise

The four automated dimensions and the collected expert surveys together give a reliable enough picture of survey quality that the observed gaps will hold for new topics and new methods.

What would settle it

An experiment in which independent experts rate the best current LLM-generated surveys as equal or superior to the provided expert surveys on overall usefulness would directly challenge the reported performance gap.

read the original abstract

The rapid growth of academic literature makes the manual creation of scientific surveys increasingly infeasible. While large language models show promise for automating this process, progress in this area is hindered by the absence of standardized benchmarks and evaluation protocols. To bridge this critical gap, we introduce SurGE (Survey Generation Evaluation), a new benchmark for scientific survey generation in computer science. SurGE consists of (1) a collection of test instances, each including a topic description, an expert-written survey, and its full set of cited references, and (2) a large-scale academic corpus of over one million papers. In addition, we propose an automated evaluation framework that measures the quality of generated surveys across four dimensions: comprehensiveness, citation accuracy, structural organization, and content quality. Our evaluation of diverse LLM-based methods demonstrates a significant performance gap, revealing that even advanced agentic frameworks struggle with the complexities of survey generation and highlighting the need for future research in this area. We have open-sourced all the code, data, and models at: https://github.com/oneal2000/SurGE

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces SurGE, a benchmark for scientific survey generation in computer science consisting of test instances (each with a topic description, expert-written survey, and cited references) plus a corpus of over one million papers. It proposes an automated evaluation framework across four dimensions—comprehensiveness, citation accuracy, structural organization, and content quality—and reports evaluations of diverse LLM-based methods, including advanced agentic frameworks, that reveal significant performance gaps.

Significance. If the automated metrics reliably proxy human-assessed survey quality, SurGE could provide a much-needed standardized resource for measuring progress on automated scientific survey generation, an increasingly important task given literature growth. The open-sourcing of code, data, and models is a clear strength that enables reproducibility and follow-on work. The reported gaps usefully highlight task complexity, but their interpretation depends on metric validity.

major comments (2)
  1. [Evaluation Framework] Evaluation Framework section: the four automated dimensions are introduced without any reported quantitative validation (e.g., Pearson/Spearman correlation) against independent human ratings on a held-out set of generated surveys, nor inter-annotator agreement statistics for the expert reference surveys. This directly affects the central claim of a significant performance gap, because the gap could partly reflect choices in metric operationalization rather than intrinsic difficulty of survey synthesis.
  2. [Benchmark Construction] Benchmark Construction (likely §3): no details are given on the sampling procedure used to select the test instances from the million-paper corpus. Without this, it is difficult to assess whether the observed gaps generalize or are influenced by selection bias in the collected instances.
minor comments (2)
  1. [Abstract] The abstract states that 'diverse LLM-based methods' were evaluated but does not name the specific systems or agentic frameworks; adding this list would improve readability.
  2. [Experiments] Figure and table captions should explicitly state the number of test instances and the exact LLM configurations used so that readers can interpret the numerical results without cross-referencing the text.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive report. We address each major comment below and outline the revisions we will make to strengthen the manuscript's clarity and rigor.

read point-by-point responses
  1. Referee: [Evaluation Framework] Evaluation Framework section: the four automated dimensions are introduced without any reported quantitative validation (e.g., Pearson/Spearman correlation) against independent human ratings on a held-out set of generated surveys, nor inter-annotator agreement statistics for the expert reference surveys. This directly affects the central claim of a significant performance gap, because the gap could partly reflect choices in metric operationalization rather than intrinsic difficulty of survey synthesis.

    Authors: We agree that explicit validation of the automated metrics against human judgments would strengthen the central claims. In the revised manuscript we will add a dedicated subsection reporting a human evaluation study: expert annotators rate a held-out sample of generated surveys on the same four dimensions, and we compute Pearson and Spearman correlations with the automated scores. We will also report inter-annotator agreement statistics for the expert reference surveys. These additions will directly address concerns about metric validity and support the interpretation of the observed performance gaps. revision: yes

  2. Referee: [Benchmark Construction] Benchmark Construction (likely §3): no details are given on the sampling procedure used to select the test instances from the million-paper corpus. Without this, it is difficult to assess whether the observed gaps generalize or are influenced by selection bias in the collected instances.

    Authors: We acknowledge the omission of sampling details. In the revised version we will expand the Benchmark Construction section to describe the full procedure: topics were drawn from a stratified sample of recent (2020–2024) high-impact computer-science papers, balanced across major sub-areas (e.g., NLP, vision, systems), with explicit criteria for topic scope and citation count. We will also document how the corresponding expert surveys were sourced and any filtering steps applied to the million-paper corpus. This added transparency will allow readers to evaluate potential selection effects. revision: yes

Circularity Check

0 steps flagged

Empirical benchmark release with no derivational circularity

full rationale

The paper introduces SurGE as a benchmark consisting of topic descriptions, expert-written reference surveys, cited papers, and a large external corpus, then applies an automated evaluation framework across four explicitly defined dimensions. All reported results are direct measurements against these externally sourced references and corpus data; no equations, fitted parameters, or predictions are defined in terms of the paper's own outputs, and no load-bearing claim reduces to a self-citation chain or self-definitional construction. The work is therefore self-contained as an empirical contribution rather than a closed derivation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central contribution rests on the domain assumption that expert surveys constitute reliable ground truth and that the chosen automated metrics align with human notions of survey quality; no free parameters or invented entities are introduced.

axioms (1)
  • domain assumption Expert-written surveys constitute reliable ground truth for measuring generated survey quality.
    The benchmark design treats these surveys as the reference standard against which LLM outputs are scored.

pith-pipeline@v0.9.0 · 5745 in / 1230 out tokens · 41381 ms · 2026-05-18T21:36:24.958926+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Skill Retrieval Augmentation for Agentic AI

    cs.CL 2026-04 unverdicted novelty 7.0

    Agents improve when they retrieve skills on demand from large corpora, yet current models cannot selectively decide when to load or ignore a retrieved skill.

  2. Skill Retrieval Augmentation for Agentic AI

    cs.CL 2026-04 unverdicted novelty 7.0

    Introduces the SRA paradigm and SRA-Bench benchmark showing retrieval-based skill augmentation improves agent performance but skill incorporation remains a bottleneck regardless of retrieval quality.

  3. Enhancing Judgment Document Generation via Agentic Legal Information Collection and Rubric-Guided Optimization

    cs.CL 2026-05 unverdicted novelty 6.0

    Judge-R1 improves LLM judgment document generation by combining agentic legal information retrieval with GRPO-based rubric-guided optimization, outperforming baselines on the JuDGE benchmark.