arxiv: 2604.17680 · v1 · submitted 2026-04-20 · 💻 cs.IR

Recognition: unknown

MasterSet: A Large-Scale Benchmark for Must-Cite Citation Recommendation in the AI/ML Literature

Md Toyaha Rahman Ratul , Zhiqian Chen , Kaiqun Fu , Taoran Ji , Lei Zhang

Authors on Pith no claims yet

Pith reviewed 2026-05-10 04:35 UTC · model grok-4.3

classification 💻 cs.IR

keywords must-cite citation recommendationAI/ML literatureretrieval benchmarkLLM annotationrecall at K evaluationcitation graphsscientific information retrieval

0 comments

The pith

MasterSet is a 150,000-paper benchmark showing that must-cite papers cannot be reliably retrieved from title and abstract alone.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper creates MasterSet to fill a gap in citation tools, which usually surface broadly related work but overlook the smaller set of must-cite papers that serve as direct baselines, foundational references, or core dependencies. Without them, a new paper's novelty claim can be overstated and its experiments become harder to reproduce. The benchmark gathers papers from fifteen major AI and ML conferences, labels every citation with a three-part scheme, and turns the problem into a retrieval task scored by Recall at K. Standard sparse, dense, and graph-based retrievers all perform poorly on this task, so the authors conclude that must-cite identification remains unsolved.

Core claim

MasterSet supplies a candidate pool of more than 150,000 papers drawn from official proceedings of fifteen leading venues. Citations inside each paper receive three-tier labels: whether the cited work is an experimental baseline, a core-relevance score from one to five, and the frequency of intra-paper mentions. An LLM judge produces the labels at scale after human validation on a stratified sample. The defined task is to retrieve the must-cite subset given only a query paper's title and abstract, and the evaluation shows that existing retrieval methods achieve low Recall@K, establishing must-cite recommendation as an open problem.

What carries the argument

The MasterSet benchmark together with its three-tier citation labeling scheme and LLM judge that scales annotations to the full 150,000-paper collection.

If this is right

Future citation recommenders must be measured against this fixed candidate pool and Recall@K metric rather than relevance-only scores.
Systems that succeed on MasterSet would improve reproducibility by ensuring key experimental baselines are cited.
The three-tier labels allow separate study of baseline detection versus core-relevance ranking.
Methods limited to title and abstract are unlikely to reach high recall on must-cite papers.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the benchmark stands, it implies that full-text signals or citation-graph features beyond title and abstract will be needed to reach acceptable recall.
The dataset could be reused to test whether domain-specific fine-tuning of embeddings improves must-cite detection over general scientific embeddings.
Extending the same annotation pipeline to other scientific fields would test whether the observed difficulty is specific to AI/ML or is broader.
The current human validation covers only a sample, so systematic error patterns in the LLM labels could still affect the ranking of retrieval methods.

Load-bearing premise

The LLM judge, once checked by humans on a stratified sample, produces must-cite labels that would match expert judgment if applied to every paper in the collection.

What would settle it

A fresh round of human annotation on a large random subset of the 150,000 papers that produces must-cite decisions differing substantially from the LLM labels.

Figures

Figures reproduced from arXiv: 2604.17680 by Kaiqun Fu, Lei Zhang, Md Toyaha Rahman Ratul, Taoran Ji, Zhiqian Chen.

**Figure 2.** Figure 2: Empirical distribution (left) and cumulative distribution (right) of intra-paper [PITH_FULL_IMAGE:figures/full_fig_p024_2.png] view at source ↗

read the original abstract

The explosive growth of AI and machine learning literature -- with venues like NeurIPS and ICLR now accepting thousands of papers annually -- has made comprehensive citation coverage increasingly difficult for researchers. While citation recommendation has been studied for over a decade, existing systems primarily focus on broad relevance rather than identifying the critical set of ``must-cite'' papers: direct experimental baselines, foundational methods, and core dependencies whose omission would misrepresent a contribution's novelty or undermine reproducibility. We introduce MasterSet, a large-scale benchmark specifically designed to evaluate must-cite recommendation in the AI/ML domain. MasterSet incorporates over 150,000 papers collected from official conference proceedings/websites of 15 leading venues, serving as a comprehensive candidate pool for retrieval. We annotate citations with a three-tier labeling scheme: (I) experimental baseline status, (II) core relevance (1--5 scale), and (III) intra-paper mention frequency. Our annotation pipeline leverages an LLM-based judge, validated by human experts on a stratified sample. The benchmark task requires retrieving must-cite papers from the candidate pool given only a query paper's title and abstract, evaluated by Recall@$K$. We establish baselines using sparse retrieval, dense scientific embeddings, and graph-based methods, demonstrating that must-cite retrieval remains a challenging open problem.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MasterSet gives a new large-scale benchmark for must-cite retrieval in AI/ML with a three-tier scheme, but the LLM labels rest on thin human validation that leaves the results hard to trust.

read the letter

The paper's core offering is MasterSet, a 150k-paper candidate pool drawn from official proceedings of 15 top AI/ML venues, paired with a three-tier labeling scheme that flags experimental baselines, core relevance on a 1-5 scale, and intra-paper mention counts. The task is straightforward: given only a query paper's title and abstract, retrieve the must-cite subset. They run sparse, dense, and graph baselines and report low recall, which supports their claim that the problem is still open.

Referee Report

3 major / 2 minor

Summary. The paper introduces MasterSet, a benchmark of over 150,000 papers drawn from official proceedings of 15 AI/ML venues. It defines must-cite papers via a three-tier annotation scheme (experimental baseline status, core relevance on a 1-5 scale, and intra-paper mention frequency) produced by an LLM judge that was validated by humans on a stratified sample. The task is to retrieve must-cite papers from the candidate pool given only a query paper's title and abstract, evaluated by Recall@K. Baselines using sparse retrieval, dense scientific embeddings, and graph methods are reported, leading to the claim that must-cite retrieval remains a challenging open problem.

Significance. If the must-cite labels are shown to be reliable, the benchmark would be a useful addition to citation recommendation research by shifting focus from broad relevance to the narrower, higher-stakes set of papers whose omission would affect novelty claims or reproducibility. The scale, use of official venue data, and three-tier labeling scheme are strengths; the absence of circularity or fitted parameters in the construction is also positive.

major comments (3)

[Annotation Pipeline] The annotation pipeline states that the LLM judge was validated by human experts on a stratified sample but reports no quantitative metrics (accuracy, Cohen's kappa, or inter-annotator agreement) for that validation. This is load-bearing for the central claim, because all Recall@K numbers and the conclusion that the task is challenging rest on the assumption that the labels are accurate across the full 150k-paper pool.
[Evaluation and Baselines] No sensitivity or error-propagation analysis is provided to show how label noise outside the human-validated strata would affect the observed gaps between sparse, dense, and graph baselines. If the LLM systematically under-labels experimental baselines or core dependencies in certain strata, the low recall figures could be an artifact rather than evidence of intrinsic task hardness.
[§3 (MasterSet Construction)] The description of the stratified sampling for human validation does not specify stratum definitions, sample sizes per stratum, or coverage of the three annotation tiers. Without these details it is impossible to assess whether the validation generalizes to the full candidate pool, particularly for papers whose must-cite status depends on experimental baseline status.

minor comments (2)

[Abstract] The abstract summarizes the human validation only as 'validated by human experts on a stratified sample' without even a sample size or agreement figure; a single sentence with these numbers would improve clarity.
[Results] Table or figure captions for the baseline results should explicitly state the value of K used for Recall@K and whether the candidate pool is restricted to papers published before the query paper.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on MasterSet. The comments correctly identify areas where additional transparency and analysis will strengthen the manuscript. We address each major comment below and will incorporate the suggested revisions in the next version.

read point-by-point responses

Referee: [Annotation Pipeline] The annotation pipeline states that the LLM judge was validated by human experts on a stratified sample but reports no quantitative metrics (accuracy, Cohen's kappa, or inter-annotator agreement) for that validation. This is load-bearing for the central claim, because all Recall@K numbers and the conclusion that the task is challenging rest on the assumption that the labels are accurate across the full 150k-paper pool.

Authors: We agree that quantitative metrics are essential to substantiate the reliability of the LLM annotations. The original manuscript described the human validation on a stratified sample but omitted the specific agreement statistics. In the revised version we will add a new subsection in §3 reporting accuracy, Cohen's kappa, and inter-annotator agreement between the LLM judge and human experts on the validated sample. These numbers will directly support the claim that the labels are sufficiently reliable for the reported Recall@K results. revision: yes
Referee: [Evaluation and Baselines] No sensitivity or error-propagation analysis is provided to show how label noise outside the human-validated strata would affect the observed gaps between sparse, dense, and graph baselines. If the LLM systematically under-labels experimental baselines or core dependencies in certain strata, the low recall figures could be an artifact rather than evidence of intrinsic task hardness.

Authors: We acknowledge that an explicit sensitivity analysis would further demonstrate robustness. The initial submission relied on the human-validated sample to support label quality but did not include error-propagation experiments. In the revision we will add a short sensitivity subsection that (i) reports baseline Recall@K restricted to the human-validated subset and (ii) discusses the potential impact of plausible label noise on the observed performance gaps. This will clarify that the conclusion of task hardness is not an artifact of unexamined noise. revision: yes
Referee: [§3 (MasterSet Construction)] The description of the stratified sampling for human validation does not specify stratum definitions, sample sizes per stratum, or coverage of the three annotation tiers. Without these details it is impossible to assess whether the validation generalizes to the full candidate pool, particularly for papers whose must-cite status depends on experimental baseline status.

Authors: We agree that the current description of the stratified sampling is insufficiently detailed. The manuscript noted the use of stratification but did not enumerate the strata, per-stratum sizes, or explicit coverage of the three tiers. We will expand §3 with (a) precise stratum definitions (venue, year, and citation-count bins), (b) the number of papers sampled per stratum, and (c) confirmation that the sample includes instances from all three annotation tiers, including experimental-baseline cases. These additions will allow readers to evaluate generalizability. revision: yes

Circularity Check

0 steps flagged

No significant circularity; benchmark is externally constructed

full rationale

The paper constructs MasterSet from external conference proceedings of 15 venues (150k papers) and applies an LLM annotation pipeline validated on a human-stratified sample. Baselines are standard sparse/dense/graph retrieval methods evaluated via Recall@K on the new labels. No equations, fitted parameters, self-citations, or derivations appear in the provided text. The claim that must-cite retrieval is challenging follows directly from empirical gaps on this independent benchmark rather than reducing to any input by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central contribution is the creation and annotation of a new dataset rather than any derivation; the main unverified premise is the accuracy of the LLM judge.

axioms (1)

domain assumption LLM-based annotation with human validation on a sample produces labels that generalize to the full corpus and match expert must-cite judgments
The annotation pipeline is described as leveraging an LLM judge validated by humans on a stratified sample.

pith-pipeline@v0.9.0 · 5544 in / 1172 out tokens · 35933 ms · 2026-05-10T04:35:49.858703+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

47 extracted references · 9 canonical work pages

[1]

Advances in Neural Information Processing Systems , volume =

ImageNet Classification with Deep Convolutional Neural Networks , author =. Advances in Neural Information Processing Systems , volume =
[2]

arXiv preprint arXiv:2308.13418 , year=

Nougat: Neural optical understanding for academic documents , author=. arXiv preprint arXiv:2308.13418 , year=

work page arXiv
[3]

, title =

Artifex Software, Inc. , title =
[4]

International conference on theory and practice of digital libraries , pages=

GROBID: Combining automatic bibliographic data recognition and term extraction for scholarship publications , author=. International conference on theory and practice of digital libraries , pages=. 2009 , organization=

2009
[5]

2026 , note =

Google Scholar Metrics: Top publications , howpublished =. 2026 , note =

2026
[6]

arXiv preprint arXiv:1903.10676 (2019)

SciBERT: A pretrained language model for scientific text , author=. arXiv preprint arXiv:1903.10676 , year=

work page arXiv 1903
[7]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova

Specter: Document-level representation learning using citation-informed transformers , author=. arXiv preprint arXiv:2004.07180 , year=

work page arXiv 2004
[8]

Proceedings of the 43rd International ACM SIGIR conference on research and development in Information Retrieval , pages=

Colbert: Efficient and effective passage search via contextualized late interaction over bert , author=. Proceedings of the 43rd International ACM SIGIR conference on research and development in Information Retrieval , pages=
[9]

European conference on information retrieval , pages=

Local citation recommendation with hierarchical-attention text encoder and scibert-based reranking , author=. European conference on information retrieval , pages=. 2022 , organization=

2022
[10]

Proceedings of the 18th ACM Conference on Recommender Systems , pages=

GLAMOR: Graph-based LAnguage MOdel embedding for citation Recommendation , author=. Proceedings of the 18th ACM Conference on Recommender Systems , pages=
[11]

arXiv preprint arXiv:2408.15371 , year=

Temporal Graph Neural Network-Powered Paper Recommendation on Dynamic Citation Networks , author=. arXiv preprint arXiv:2408.15371 , year=

work page arXiv
[12]

IEEE Transactions on Knowledge and Data Engineering , volume=

Supporting your idea reasonably: A knowledge-aware topic reasoning strategy for citation recommendation , author=. IEEE Transactions on Knowledge and Data Engineering , volume=. 2024 , publisher=

2024
[13]

Proceedings of the 8th International Conference on Data Science and Management of Data (12th ACM IKDD CODS and 30th COMAD) , pages=

Relevant Article Recommendation by Learning Heterogeneous Network Embedding using GNN , author=. Proceedings of the 8th International Conference on Data Science and Management of Data (12th ACM IKDD CODS and 30th COMAD) , pages=
[14]

Proceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V

Litfm: A retrieval augmented structure-aware foundation model for citation graphs , author=. Proceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V. 2 , pages=
[15]

1995 , publisher=

Okapi at TREC-3 , author=. 1995 , publisher=

1995
[16]

Proceedings of the 40th international ACM SIGIR conference on research and development in information retrieval , pages=

Neural citation network for context-aware citation recommendation , author=. Proceedings of the 40th international ACM SIGIR conference on research and development in information retrieval , pages=
[17]

Scientometrics , volume=

SPR-SMN: Scientific paper recommendation employing SPECTER with memory network , author=. Scientometrics , volume=. 2022 , publisher=

2022
[18]

Proceedings of the AAAI conference on artificial intelligence , volume=

Leveraging title-abstract attentive semantics for paper recommendation , author=. Proceedings of the AAAI conference on artificial intelligence , volume=
[19]

2022 IEEE 38th International Conference on Data Engineering (ICDE) , pages=

Subspace embedding based new paper recommendation , author=. 2022 IEEE 38th International Conference on Data Engineering (ICDE) , pages=. 2022 , organization=

2022
[20]

Scientometrics , volume=

A context-aware citation recommendation model with BERT and graph convolutional networks , author=. Scientometrics , volume=. 2020 , publisher=

2020
[21]

arXiv preprint arXiv:2406.01606 , year=

SymTax: Symbiotic Relationship and Taxonomy Fusion for Effective Citation Recommendation , author=. arXiv preprint arXiv:2406.01606 , year=

work page arXiv
[22]

arXiv preprint arXiv:2403.08737 , year=

ILCiteR: evidence-grounded interpretable local citation recommendation , author=. arXiv preprint arXiv:2403.08737 , year=

work page arXiv
[23]

arXiv preprint arXiv:2403.01873 , year=

Recommending Missed Citations Identified by Reviewers: A New Task, Dataset and Baselines , author=. arXiv preprint arXiv:2403.01873 , year=

work page arXiv
[24]

Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

CiteBART: learning to generate citations for local citation recommendation , author=. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

2025
[25]

International Journal on Semantic Web and Information Systems (IJSWIS) , volume=

A systematic review of citation recommendation over the past two decades , author=. International Journal on Semantic Web and Information Systems (IJSWIS) , volume=. 2023 , publisher=

2023
[26]

International Journal on Digital Libraries , volume=

Citation recommendation: approaches and datasets , author=. International Journal on Digital Libraries , volume=. 2020 , publisher=

2020
[27]

, author=

Measuring nominal scale agreement among many raters. , author=. Psychological bulletin , volume=. 1971 , publisher=

1971
[28]

Advances in neural information processing systems , volume=

Judging llm-as-a-judge with mt-bench and chatbot arena , author=. Advances in neural information processing systems , volume=
[29]

A Survey on Citation Recommendation System , year=

Pillai, Reshma S and R, Deepthi L , booktitle=. A Survey on Citation Recommendation System , year=
[30]

2025 , howpublished=

Gemini 2.5 Flash , author=. 2025 , howpublished=

2025
[31]

doi:10.48550/arXiv.2202.06671 , note =

Malte Ostendorff and Nils Rethmeier and Isabelle Augenstein and Bela Gipp and Georg Rehm , title =. doi:10.48550/arXiv.2202.06671 , note =

work page doi:10.48550/arxiv.2202.06671
[32]

Conference on Empirical Methods in Natural Language Processing , year=

SciRepEval: A Multi-Format Benchmark for Scientific Document Representations , author=. Conference on Empirical Methods in Natural Language Processing , year=
[33]

, author=

The ACL Anthology Reference Corpus: A Reference Dataset for Bibliographic Research in Computational Linguistics. , author=. LREC , year=
[34]

Structural scaffolds for citation intent classification in scientific publications , author=. Proceedings of the 2019 conference of the North American chapter of the Association for Computational Linguistics: human language technologies, volume 1 (long and short papers) , pages=

2019
[35]

Transactions of the Association for Computational Linguistics , volume=

Measuring the evolution of a scientific field through citation frames , author=. Transactions of the Association for Computational Linguistics , volume=
[36]

ACT 2: A multi-disciplinary semi-structured dataset for importance and purpose classification of citations

Nambanoor Kunnath, Suchetha and Stauber, Valentin and Wu, Ronin and Pride, David and Botev, Viktor and Knoth, Petr. ACT 2: A multi-disciplinary semi-structured dataset for importance and purpose classification of citations. Proceedings of the Thirteenth Language Resources and Evaluation Conference. 2022

2022
[37]

and Pride, David and Knoth, Petr , title =

Kunnath, Suchetha N. and Pride, David and Knoth, Petr , title =. Proceedings of the 32nd ACM International Conference on Information and Knowledge Management , pages =. 2023 , isbn =. doi:10.1145/3583780.3615018 , abstract =

work page doi:10.1145/3583780.3615018 2023
[38]

Proceedings of the 18th International Natural Language Generation Conference: System Demonstrations , pages=

CSPaper Review: Fast, Rubric-Faithful Conference Feedback , author=. Proceedings of the 18th International Natural Language Generation Conference: System Demonstrations , pages=
[39]

Information Fusion , volume=

Large language models for automated scholarly paper review: A survey , author=. Information Fusion , volume=. 2025 , publisher=

2025
[40]

Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

Deepreview: Improving llm-based paper review with human-like deep thinking process , author=. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=
[41]

Proceedings of the 58th annual meeting of the association for computational linguistics , pages=

S2ORC: The semantic scholar open research corpus , author=. Proceedings of the 58th annual meeting of the association for computational linguistics , pages=
[42]

Meeting of the North American Chapter of the Association for Computational Linguistics (NAACL) , address =

A Dataset of Peer Reviews (PeerRead): Collection, Insights and NLP Applications , author =. Meeting of the North American Chapter of the Association for Computational Linguistics (NAACL) , address =
[43]

Quantitative Science Studies , volume=

OpenCitations, an infrastructure organization for open scholarship , author=. Quantitative Science Studies , volume=. 2020 , publisher=

2020
[44]

Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing , pages=

Neighborhood contrastive learning for scientific document representations with citation embeddings , author=. Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing , pages=

2022
[45]

Biometrics , pages=

The measurement of observer agreement for categorical data , author=. Biometrics , pages=
[46]

Computing , volume=

Computing Krippendorff's Alpha-Reliability , author=. Computing , volume=
[47]

, author=

Weighted kappa: Nominal scale agreement provision for scaled disagreement or partial credit. , author=. Psychological bulletin , volume=. 1968 , publisher=

1968