SurGE: A Benchmark and Evaluation Framework for Scientific Survey Generation

Anzhe Xie; Jianming Long; Jiaxin Mao; Qingyao Ai; Weihang Su; Xuanyi Chen; Yiqun Liu; Ziyi Ye

arxiv: 2508.15658 · v5 · submitted 2025-08-21 · 💻 cs.CL · cs.AI· cs.IR

SurGE: A Benchmark and Evaluation Framework for Scientific Survey Generation

Weihang Su , Anzhe Xie , Qingyao Ai , Jianming Long , Xuanyi Chen , Jiaxin Mao , Ziyi Ye , Yiqun Liu This is my paper

Pith reviewed 2026-05-18 21:36 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.IR

keywords scientific survey generationbenchmarkevaluation frameworklarge language modelscitation accuracycomprehensivenessacademic literature

0 comments

The pith

The SurGE benchmark reveals a significant performance gap in how well large language models and agentic systems generate scientific surveys.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Academic literature is growing so fast that writing surveys by hand has become impractical. Large language models appear able to automate the work, yet the lack of shared tests has made it hard to measure real progress. This paper supplies SurGE, a collection of computer science topics each paired with an expert survey and its references, together with a corpus of over one million papers. It also supplies an automated scorer that checks generated surveys on comprehensiveness, citation accuracy, structural organization, and content quality. When a range of current methods is run on the benchmark, even the strongest agent-based approaches show clear shortfalls against the expert references.

Core claim

SurGE consists of test instances that each contain a topic description, an expert-written survey, and the full set of cited references, plus a large-scale academic corpus of over one million papers; paired with this is an automated evaluation framework that scores generated surveys across four dimensions, and evaluations of diverse LLM-based methods on these resources expose a substantial performance gap.

What carries the argument

The SurGE benchmark together with its four automated evaluation dimensions that compare generated surveys against expert-written references using the supplied paper corpus.

If this is right

Standardized testing of this kind can now direct development of improved survey generation systems.
Future models will need targeted gains in literature coverage and citation precision to close the gap.
Agentic frameworks require additional capabilities to reach expert-level survey organization and content quality.
Releasing the benchmark data and code allows the community to track and compare future advances.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Literature synthesis tools may remain most useful when combined with human oversight for the foreseeable future.
The same structure of expert references plus automated scoring could be replicated for survey generation in biology or physics.
Repeated evaluation on SurGE over successive model releases would provide a concrete record of capability growth in automated research assistance.

Load-bearing premise

The four automated dimensions and the collected expert surveys together give a reliable enough picture of survey quality that the observed gaps will hold for new topics and new methods.

What would settle it

An experiment in which independent experts rate the best current LLM-generated surveys as equal or superior to the provided expert surveys on overall usefulness would directly challenge the reported performance gap.

read the original abstract

The rapid growth of academic literature makes the manual creation of scientific surveys increasingly infeasible. While large language models show promise for automating this process, progress in this area is hindered by the absence of standardized benchmarks and evaluation protocols. To bridge this critical gap, we introduce SurGE (Survey Generation Evaluation), a new benchmark for scientific survey generation in computer science. SurGE consists of (1) a collection of test instances, each including a topic description, an expert-written survey, and its full set of cited references, and (2) a large-scale academic corpus of over one million papers. In addition, we propose an automated evaluation framework that measures the quality of generated surveys across four dimensions: comprehensiveness, citation accuracy, structural organization, and content quality. Our evaluation of diverse LLM-based methods demonstrates a significant performance gap, revealing that even advanced agentic frameworks struggle with the complexities of survey generation and highlighting the need for future research in this area. We have open-sourced all the code, data, and models at: https://github.com/oneal2000/SurGE

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SurGE releases a practical benchmark and dataset for scientific survey generation, but the automated metrics need human validation to back the reported performance gaps.

read the letter

Hey, the main thing to know is that this paper puts out SurGE, a benchmark with expert-written surveys, their full reference lists, a million-paper corpus, and a four-way automated scorer for LLM survey outputs in computer science. They run some current methods and show a clear gap even for agentic setups. The construction is the real addition here. Prior summarization or citation work does not directly give you expert targets plus the full cited papers for each topic, so this setup lets you check citation accuracy and coverage in a more grounded way than generic metrics. Releasing the code, data, and models openly is straightforward and useful for anyone who wants to test their own generation pipeline against it. The stress-test note is on point about the metrics. The abstract and available details do not include any correlation numbers between the automated scores on comprehensiveness, citation accuracy, structure, and content quality and independent human ratings of generated surveys. Without that, the size of the gap they highlight could partly trace back to how the dimensions were defined rather than the intrinsic hardness of the task. Sampling details for the test instances are also thin in what is shown. This is aimed at researchers working on long-form scientific generation, retrieval-augmented agents, or literature synthesis tools. A reader who needs a ready testbed to measure progress on survey-style output would get immediate use from the released collection. It should go to peer review. The dataset and protocol address a timely practical need, and the authors can add the missing validation steps if referees ask for them.

Referee Report

2 major / 2 minor

Summary. The paper introduces SurGE, a benchmark for scientific survey generation in computer science consisting of test instances (each with a topic description, expert-written survey, and cited references) plus a corpus of over one million papers. It proposes an automated evaluation framework across four dimensions—comprehensiveness, citation accuracy, structural organization, and content quality—and reports evaluations of diverse LLM-based methods, including advanced agentic frameworks, that reveal significant performance gaps.

Significance. If the automated metrics reliably proxy human-assessed survey quality, SurGE could provide a much-needed standardized resource for measuring progress on automated scientific survey generation, an increasingly important task given literature growth. The open-sourcing of code, data, and models is a clear strength that enables reproducibility and follow-on work. The reported gaps usefully highlight task complexity, but their interpretation depends on metric validity.

major comments (2)

[Evaluation Framework] Evaluation Framework section: the four automated dimensions are introduced without any reported quantitative validation (e.g., Pearson/Spearman correlation) against independent human ratings on a held-out set of generated surveys, nor inter-annotator agreement statistics for the expert reference surveys. This directly affects the central claim of a significant performance gap, because the gap could partly reflect choices in metric operationalization rather than intrinsic difficulty of survey synthesis.
[Benchmark Construction] Benchmark Construction (likely §3): no details are given on the sampling procedure used to select the test instances from the million-paper corpus. Without this, it is difficult to assess whether the observed gaps generalize or are influenced by selection bias in the collected instances.

minor comments (2)

[Abstract] The abstract states that 'diverse LLM-based methods' were evaluated but does not name the specific systems or agentic frameworks; adding this list would improve readability.
[Experiments] Figure and table captions should explicitly state the number of test instances and the exact LLM configurations used so that readers can interpret the numerical results without cross-referencing the text.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive report. We address each major comment below and outline the revisions we will make to strengthen the manuscript's clarity and rigor.

read point-by-point responses

Referee: [Evaluation Framework] Evaluation Framework section: the four automated dimensions are introduced without any reported quantitative validation (e.g., Pearson/Spearman correlation) against independent human ratings on a held-out set of generated surveys, nor inter-annotator agreement statistics for the expert reference surveys. This directly affects the central claim of a significant performance gap, because the gap could partly reflect choices in metric operationalization rather than intrinsic difficulty of survey synthesis.

Authors: We agree that explicit validation of the automated metrics against human judgments would strengthen the central claims. In the revised manuscript we will add a dedicated subsection reporting a human evaluation study: expert annotators rate a held-out sample of generated surveys on the same four dimensions, and we compute Pearson and Spearman correlations with the automated scores. We will also report inter-annotator agreement statistics for the expert reference surveys. These additions will directly address concerns about metric validity and support the interpretation of the observed performance gaps. revision: yes
Referee: [Benchmark Construction] Benchmark Construction (likely §3): no details are given on the sampling procedure used to select the test instances from the million-paper corpus. Without this, it is difficult to assess whether the observed gaps generalize or are influenced by selection bias in the collected instances.

Authors: We acknowledge the omission of sampling details. In the revised version we will expand the Benchmark Construction section to describe the full procedure: topics were drawn from a stratified sample of recent (2020–2024) high-impact computer-science papers, balanced across major sub-areas (e.g., NLP, vision, systems), with explicit criteria for topic scope and citation count. We will also document how the corresponding expert surveys were sourced and any filtering steps applied to the million-paper corpus. This added transparency will allow readers to evaluate potential selection effects. revision: yes

Circularity Check

0 steps flagged

Empirical benchmark release with no derivational circularity

full rationale

The paper introduces SurGE as a benchmark consisting of topic descriptions, expert-written reference surveys, cited papers, and a large external corpus, then applies an automated evaluation framework across four explicitly defined dimensions. All reported results are direct measurements against these externally sourced references and corpus data; no equations, fitted parameters, or predictions are defined in terms of the paper's own outputs, and no load-bearing claim reduces to a self-citation chain or self-definitional construction. The work is therefore self-contained as an empirical contribution rather than a closed derivation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central contribution rests on the domain assumption that expert surveys constitute reliable ground truth and that the chosen automated metrics align with human notions of survey quality; no free parameters or invented entities are introduced.

axioms (1)

domain assumption Expert-written surveys constitute reliable ground truth for measuring generated survey quality.
The benchmark design treats these surveys as the reference standard against which LLM outputs are scored.

pith-pipeline@v0.9.0 · 5745 in / 1230 out tokens · 41381 ms · 2026-05-18T21:36:24.958926+00:00 · methodology

discussion (0)

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Skill Retrieval Augmentation for Agentic AI
cs.CL 2026-04 unverdicted novelty 7.0

Agents improve when they retrieve skills on demand from large corpora, yet current models cannot selectively decide when to load or ignore a retrieved skill.
Skill Retrieval Augmentation for Agentic AI
cs.CL 2026-04 unverdicted novelty 7.0

Introduces the SRA paradigm and SRA-Bench benchmark showing retrieval-based skill augmentation improves agent performance but skill incorporation remains a bottleneck regardless of retrieval quality.
Enhancing Judgment Document Generation via Agentic Legal Information Collection and Rubric-Guided Optimization
cs.CL 2026-05 unverdicted novelty 6.0

Judge-R1 improves LLM judgment document generation by combining agentic legal information retrieval with GRPO-based rubric-guided optimization, outperforming baselines on the JuDGE benchmark.