pith. sign in

arxiv: 2606.24894 · v1 · pith:I6VWY74Znew · submitted 2026-05-30 · 💻 cs.DL · cs.AI

RWGBench: Evaluating Scholarly Positioning in Related Work Generation

Pith reviewed 2026-06-28 17:39 UTC · model grok-4.3

classification 💻 cs.DL cs.AI
keywords related work generationcitation evaluationscholarly positioningbenchmarkRWGLLM evaluationhuman evaluationscientific writing
0
0 comments X

The pith

RWGBench evaluates related work generation from citation decision-making rather than text similarity.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Existing metrics for related work generation in AI models measure how similar the generated text is to published sections, but this approach fails to detect problems like wrong citation choices or poor positioning of prior work. The paper creates RWGBench using a large collection of computer science papers to test models specifically on selecting appropriate citations, placing them in context, organizing the section, and structuring the discourse. Tests on current systems uncover limitations hidden by standard metrics. Human evaluations confirm that these citation-focused measures match expert opinions on quality much better than surface similarity scores.

Core claim

The central claim is that related work generation is a citation-level scholarly positioning task, and RWGBench provides a benchmark and multi-dimensional framework to evaluate it on citation selection, contextual appropriateness, organization, and discourse structure, revealing systematic issues in models and achieving better alignment with human judgment than text similarity metrics.

What carries the argument

RWGBench benchmark with its multi-dimensional evaluation framework for citation-centric assessment of related work sections.

If this is right

  • Current RWG systems show failures in citation selection and organization not caught by similarity-based metrics.
  • Oracle studies separate retrieval bottlenecks from generation issues in RWG.
  • Citation-centric metrics provide a closer match to expert assessments of scholarly quality.
  • Models can be developed to better align with actual scholarly writing practices using this testbed.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Future RWG models might need explicit training on citation graphs or positioning logic rather than just text generation.
  • The approach could extend to evaluating other parts of scientific papers like introductions or discussions.
  • Adoption might shift industry benchmarks away from BLEU or embedding similarity toward citation accuracy.

Load-bearing premise

The curated test set of 100 papers sufficiently represents broader patterns in scholarly writing and the four evaluation dimensions fully measure positioning quality.

What would settle it

Running the evaluation on a new, larger set of papers where citation-centric metrics show no stronger correlation with expert judgments than text similarity metrics would falsify the benchmark's superiority.

read the original abstract

Large language models have shown strong fluency in scientific writing, yet the evaluation of related work generation (RWG) remains limited. Existing RWG evaluations largely inherit summarization-oriented metrics, using lexical or semantic similarity to reference sections as proxies for quality. However, related work writing is fundamentally a citation-level scholarly positioning task: it requires selecting, organizing, and framing prior work to clarify how a target paper relates to, differs from, and contributes beyond existing research.As a result, models may generate coherent and semantically-relevant text while exhibiting academically critical failures, such as inappropriate citation selection or misplaced references, that conventional metrics do not capture.To this end, we introduce \textbf{RWGBench}, a benchmark that evaluates RWG from the perspective of citation decision-making rather than text similarity. RWGBench is constructed from a large-scale collection of 40,108 computer science papers and a retrieval corpus of 1.09 million documents, with a carefully curated test set comprising 100 papers and their corresponding published related work sections.We propose a multi-dimensional evaluation framework that assesses citation selection, contextual appropriateness, organization, and discourse structure.Experiments reveal systematic limitations in current systems that are obscured by standard evaluations, while Oracle studies further disentangle retrieval-level and generation-level bottlenecks. Human evaluation further shows that our citation-centric metrics align substantially better with expert judgment than surface-level text metrics. RWGBench offers a citation-centric testbed for developing and evaluating related work generation systems that are better aligned with scholarly writing practices.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces RWGBench, a benchmark for related work generation (RWG) that evaluates systems from the perspective of citation decision-making rather than text similarity. Constructed from a corpus of 40,108 computer science papers and a 1.09 million document retrieval corpus, it includes a curated test set of 100 papers with their published related work sections. The work proposes a multi-dimensional framework assessing citation selection, contextual appropriateness, organization, and discourse structure; reports experiments revealing limitations in current RWG systems (including via Oracle studies separating retrieval and generation); and presents human evaluation results indicating that the citation-centric metrics align substantially better with expert judgment than conventional surface-level text metrics.

Significance. If the central results hold, RWGBench would offer a more academically grounded testbed for RWG that better captures scholarly positioning practices, addressing a gap where fluency-focused models can still fail on citation appropriateness. Strengths include the scale of the underlying corpus, the Oracle disentanglement of bottlenecks, and the human evaluation component demonstrating improved metric alignment.

major comments (2)
  1. [Abstract] Abstract and construction description: the evaluation framework treats the published related work sections in the 100-paper test set as the sole gold standard for citation selection and contextual appropriateness. The manuscript provides no discussion of the possibility that multiple non-identical but academically sound citation sets and framings may exist for the same target paper; if this multiplicity holds, automatic metrics that penalize deviation from the published version will not reliably measure scholarly positioning quality.
  2. [Abstract] Test set curation (described in Abstract): the claim that the 100-paper test set reveals systematic limitations in current RWG systems rests on its representativeness, yet the manuscript supplies insufficient detail on selection criteria, exclusion rules, potential topical or venue biases, and verification that the chosen references are canonical rather than one of several defensible options.
minor comments (2)
  1. Exact definitions of the proposed citation-centric metrics (selection, contextual appropriateness, etc.) and the data splits used for the 100-paper test set are not fully specified in the abstract-level description, hindering independent verification of the reported human alignment results.
  2. The retrieval corpus size (1.09 million documents) is stated without clarifying overlap or deduplication procedures relative to the 40,108-paper collection, which could affect Oracle study interpretations.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and will make revisions to improve clarity and transparency in the manuscript.

read point-by-point responses
  1. Referee: [Abstract] Abstract and construction description: the evaluation framework treats the published related work sections in the 100-paper test set as the sole gold standard for citation selection and contextual appropriateness. The manuscript provides no discussion of the possibility that multiple non-identical but academically sound citation sets and framings may exist for the same target paper; if this multiplicity holds, automatic metrics that penalize deviation from the published version will not reliably measure scholarly positioning quality.

    Authors: We appreciate this observation. Our benchmark uses published related work sections as concrete references representing actual expert scholarly positioning decisions. While alternative valid citation sets may exist, the metrics evaluate alignment with observed author choices rather than claiming uniqueness. We will revise the manuscript to explicitly discuss the potential for multiple sound framings and clarify that the evaluation measures fidelity to published practices. revision: yes

  2. Referee: [Abstract] Test set curation (described in Abstract): the claim that the 100-paper test set reveals systematic limitations in current RWG systems rests on its representativeness, yet the manuscript supplies insufficient detail on selection criteria, exclusion rules, potential topical or venue biases, and verification that the chosen references are canonical rather than one of several defensible options.

    Authors: We agree that additional details are needed to substantiate the test set's representativeness. The current description labels the set as 'carefully curated' without sufficient specifics. In the revised manuscript, we will expand the relevant sections to detail the selection criteria, exclusion rules, steps taken to address potential topical or venue biases, and any verification processes applied to the references. revision: yes

Circularity Check

0 steps flagged

No circularity; benchmark constructed from external corpus

full rationale

The paper constructs RWGBench from an external collection of 40,108 CS papers plus a 1.09M-document retrieval corpus, then evaluates generated related work against the published sections of a 100-paper test set. No equations, fitted parameters, or self-citations appear in the provided text that reduce any claimed result to the paper's own inputs by construction. The multi-dimensional framework (citation selection, contextual appropriateness, organization, discourse structure) is defined directly from the task description rather than derived from prior self-referential results. Human evaluation is presented as external validation, not as a fitted input renamed as prediction. This matches the default case of a self-contained benchmark paper with no load-bearing circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are described. The benchmark construction implicitly assumes that published related work sections represent ground-truth scholarly positioning.

pith-pipeline@v0.9.1-grok · 5815 in / 1143 out tokens · 20070 ms · 2026-06-28T17:39:14.798335+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

70 extracted references · 51 canonical work pages · 23 internal anchors

  1. [1]

    Iz Beltagy, Kyle Lo, and Arman Cohan. 2019. SciBERT: A Pretrained Language Model for Scientific Text. InProceedings of the 2019 Conference on Empirical Meth- ods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Kentaro Inui, Jing Jiang, Vincent Ng, and Xiaojun Wan (Eds.). Association ...

  2. [2]

    Chandra Bhagavatula, Sergey Feldman, Russell Power, and Waleed Ammar. 2018. Content-Based Citation Recommendation. arXiv:1802.08301 [cs.CL] https: //arxiv.org/abs/1802.08301

  3. [3]

    Lutz Bornmann, Robin Haunschild, and Rüdiger Mutz. 2021. Growth rates of modern science: a latent piecewise growth curve approach to model publication numbers from established and new literature databases.Humanities and Social Sciences Communications8, 1 (7 10 2021), 224. doi:10.1057/s41599-021-00903-w

  4. [4]

    Xiuying Chen, Hind Alamro, Mingzhe Li, Shen Gao, Rui Yan, Xin Gao, and Xiangliang Zhang. 2022. Target-aware Abstractive Related Work Generation with Contrastive Learning. arXiv:2205.13339 [cs.CL] https://arxiv.org/abs/2205.13339

  5. [5]

    Arman Cohan, Waleed Ammar, Madeleine van Zuylen, and Field Cady. 2019. Structural Scaffolds for Citation Intent Classification in Scientific Publications. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Jill Burstein, Christ...

  6. [6]

    DeepSeek-AI, Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, et al . 2025. DeepSeek-V3 Technical Report. arXiv:2412.19437 [cs.CL] https://arxiv.org/ abs/2412.19437

  7. [7]

    Martin Docekal, Martin Fajcik, and Pavel Smrz. 2024. OARelatedWork: A Large- Scale Dataset of Related Work Sections with Full-texts from Open Access Sources. arXiv:2405.01930 [cs.CL] https://arxiv.org/abs/2405.01930

  8. [8]

    Qian Dong, Qingyao Ai, Hongning Wang, Yiding Liu, Haitao Li, Weihang Su, Yiqun Liu, Tat-Seng Chua, and Shaoping Ma. 2025. Decoupling Knowledge and Context: An Efficient and Effective Retrieval Augmented Generation Framework via Cross Attention. InProceedings of the ACM on Web Conference 2025. 4386– 4395

  9. [9]

    Darren Edge, Ha Trinh, Newman Cheng, Joshua Bradley, Alex Chao, Apurva Mody, Steven Truitt, and Jonathan Larson. 2024. From local to global: A graph rag approach to query-focused summarization.arXiv preprint arXiv:2404.16130 (2024)

  10. [10]

    Yan Fang, Jingtao Zhan, Qingyao Ai, Jiaxin Mao, Weihang Su, Jia Chen, and Yiqun Liu. 2024. Scaling laws for dense retrieval. InProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval. 1339–1349

  11. [11]

    Fan Gao, Hang Jiang, Rui Yang, Qingcheng Zeng, Jinghui Lu, Moritz Blum, Dairui Liu, Tianwei She, Yuang Jiang, and Irene Li. 2024. Large Language Models on Wikipedia-Style Survey Generation: an Evaluation in NLP Concepts. arXiv:2308.10410 [cs.CL] https://arxiv.org/abs/2308.10410

  12. [12]

    Tianyu Gao, Howard Yen, Jiatong Yu, and Danqi Chen. 2023. Enabling Large Language Models to Generate Text with Citations. arXiv:2305.14627 [cs.CL] https://arxiv.org/abs/2305.14627

  13. [13]

    Sebastian Gehrmann, Elizabeth Clark, and Thibault Sellam. 2022. Repairing the Cracked Foundation: A Survey of Obstacles in Evaluation Practices for Generated Text. arXiv:2202.06935 [cs.CL] https://arxiv.org/abs/2202.06935

  14. [14]

    Team GLM, :, Aohan Zeng, Bin Xu, Bowen Wang, Chenhui Zhang, et al. 2024. ChatGLM: A Family of Large Language Models from GLM-130B to GLM-4 All Tools. arXiv:2406.12793 [cs.CL] https://arxiv.org/abs/2406.12793

  15. [15]

    Jiawei Gu, Xuhui Jiang, Zhichao Shi, Hexiang Tan, et al . 2025. A Survey on LLM-as-a-Judge. arXiv:2411.15594 [cs.CL] https://arxiv.org/abs/2411.15594

  16. [16]

    Pengcheng He, Xiaodong Liu, Jianfeng Gao, and Weizhu Chen. 2021. DeBERTa: Decoding-enhanced BERT with Disentangled Attention. arXiv:2006.03654 [cs.CL] https://arxiv.org/abs/2006.03654

  17. [17]

    Zhengbao Jiang, Frank F Xu, Luyu Gao, Zhiqing Sun, Qian Liu, Jane Dwivedi-Yu, Yiming Yang, Jamie Callan, and Graham Neubig. 2023. Active retrieval augmented generation. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 7969–7992

  18. [18]

    Bowen Jin, Hansi Zeng, Zhenrui Yue, Jinsung Yoon, Sercan Arik, Dong Wang, Hamed Zamani, and Jiawei Han. 2025. Search-r1: Training llms to reason and leverage search engines with reinforcement learning.arXiv preprint arXiv:2503.09516(2025)

  19. [19]

    Jeff Johnson, Matthijs Douze, and Hervé Jégou. 2017. Billion-scale similarity search with GPUs. arXiv:1702.08734 [cs.CV] https://arxiv.org/abs/1702.08734

  20. [20]

    Lars Benedikt Kaesberg, Terry Ruas, Jan Philip Wahle, and Bela Gipp. 2024. CiteAssist: A System for Automated Preprint Citation and BibTeX Generation. arXiv Preprint, , Anonymized arXiv:2407.03192 [cs.DL] https://arxiv.org/abs/2407.03192

  21. [21]

    Khalid, S

    S. Khalid, S. Almutairi, A. Namoun, et al. 2025. Comprehensive review of academic search systems: evolution, analysis, and future research directions.Social Network Analysis and Mining15, 1 (2025), 66. doi:10.1007/s13278-025-01476-1

  22. [22]

    Kalpesh Krishna, Aurko Roy, and Mohit Iyyer. 2021. Hurdles to Progress in Long-form Question Answering. arXiv:2103.06332 [cs.CL] https://arxiv.org/abs/ 2103.06332

  23. [23]

    Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen tau Yih, Tim Rocktäschel, Sebastian Riedel, and Douwe Kiela. 2021. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. arXiv:2005.11401 [cs.CL] https://arxiv.org/abs/ 2005.11401

  24. [24]

    Xiangci Li and Jessica Ouyang. 2024. Related Work and Citation Text Generation: A Survey. arXiv:2404.11588 [cs.CL] https://arxiv.org/abs/2404.11588

  25. [25]

    Chin-Yew Lin. 2004. ROUGE: A Package for Automatic Evaluation of Summaries. InText Summarization Branches Out. Association for Computational Linguistics, Barcelona, Spain, 74–81. https://aclanthology.org/W04-1013/

  26. [26]

    Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu. 2023. G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment. arXiv:2303.16634 [cs.CL] https://arxiv.org/abs/2303.16634

  27. [27]

    Sewon Min, Kalpesh Krishna, Xinxi Lyu, Mike Lewis, Wen tau Yih, Pang Wei Koh, Mohit Iyyer, Luke Zettlemoyer, and Hannaneh Hajishirzi. 2023. FActScore: Fine- grained Atomic Evaluation of Factual Precision in Long Form Text Generation. arXiv:2305.14251 [cs.CL] https://arxiv.org/abs/2305.14251

  28. [28]

    Jekaterina Novikova, Ondřej Dušek, Amanda Cercas Curry, and Verena Rieser

  29. [29]

    InProceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, Martha Palmer, Rebecca Hwa, and Sebastian Riedel (Eds.)

    Why We Need New Evaluation Metrics for NLG. InProceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, Martha Palmer, Rebecca Hwa, and Sebastian Riedel (Eds.). Association for Computational Linguistics, Copenhagen, Denmark, 2241–2252. doi:10.18653/v1/D17-1238

  30. [30]

    OpenAI, Josh Achiam, Steven Adler, et al . 2024. GPT-4 Technical Report. arXiv:2303.08774 [cs.CL] https://arxiv.org/abs/2303.08774

  31. [31]

    Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a Method for Automatic Evaluation of Machine Translation. InProceedings of the 40th Annual Meeting of the Association for Computational Linguistics, Pierre Isabelle, Eugene Charniak, and Dekang Lin (Eds.). Association for Computational Linguistics, Philadelphia, Pennsylvania, USA, 31...

  32. [32]

    Liana Patel, Negar Arabzadeh, Harshit Gupta, Ankita Sundar, et al . 2025. DeepScholar-Bench: A Live Benchmark and Automated Evaluation for Generative Research Synthesis. arXiv:2508.20033 [cs.CL] https://arxiv.org/abs/2508.20033

  33. [33]

    Bojana Petrić. 2007. Rhetorical functions of citations in high- and low-rated master’s theses.Journal of English for Academic Purposes6, 3 (2007), 238–253. doi:10.1016/j.jeap.2007.09.002

  34. [34]

    Qwen, :, An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, et al . 2025. Qwen2.5 Technical Report. arXiv:2412.15115 [cs.CL] https://arxiv.org/abs/2412 .15115

  35. [35]

    Stephen Robertson, Hugo Zaragoza, et al . 2009. The probabilistic relevance framework: BM25 and beyond.Foundations and Trends®in Information Retrieval 3, 4 (2009), 333–389

  36. [36]

    Zhengliang Shi, Yiqun Chen, Haitao Li, Weiwei Sun, et al. 2025. Deep Research: A Systematic Survey. arXiv:2512.02038 [cs.CL] https://arxiv.org/abs/2512.02038

  37. [37]

    Weihang Su, Qingyao Ai, Xiangsheng Li, Jia Chen, Yiqun Liu, Xiaolong Wu, and Shengluan Hou. 2024. Wikiformer: Pre-training with structured information of wikipedia for ad-hoc retrieval. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 38. 19026–19034

  38. [38]

    Weihang Su, Qingyao Ai, Yueyue Wu, Anzhe Xie, Changyue Wang, Yixiao Ma, Haitao Li, Zhijing Wu, Yiqun Liu, and Min Zhang. 2025. Pre-training for legal case retrieval based on inter-case distinctions.ACM Transactions on Information Systems43, 5 (2025), 1–27

  39. [39]

    Weihang Su, Qingyao Ai, Jingtao Zhan, Qian Dong, and Yiqun Liu. 2025. Dynamic and Parametric Retrieval-Augmented Generation. arXiv:2506.06704 [cs.CL] https: //arxiv.org/abs/2506.06704

  40. [40]

    Weihang Su, Xuanyi Chen, Yueyue Wu, Qingyao Ai, and Yiqun Liu. 2026. Enhanc- ing Judgment Document Generation via Agentic Legal Information Collection and Rubric-Guided Optimization.arXiv preprint arXiv:2605.02011(2026)

  41. [41]

    Weihang Su, Qian Dong, Qingyao Ai, and Yiqun Liu. 2025. SIGIR-AP 2025 Tutorial Proposal: Dynamic and Parametric Retrieval-Augmented Generation. In3rd International ACM SIGIR Conference on Information Retrieval in the Asia Pacific

  42. [42]

    Weihang Su, Yiran Hu, Anzhe Xie, Qingyao Ai, Quezi Bing, Ning Zheng, Yun Liu, Weixing Shen, and Yiqun Liu. 2024. STARD: A Chinese Statute Retrieval Dataset Derived from Real-life Queries by Non-professionals. InFindings of the Association for Computational Linguistics: EMNLP 2024, Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen (Eds.). Association for C...

  43. [43]

    Weihang Su, Jianming Long, Qingyao Ai, Yichen Tang, Changyue Wang, Yiteng Tu, and Yiqun Liu. 2026. Skill Retrieval Augmentation for Agentic AI.arXiv preprint arXiv:2604.24594(2026)

  44. [44]

    Weihang Su, Jianming Long, Changyue Wang, Shiyu Lin, Jingyan Xu, Ziyi Ye, Qingyao Ai, and Yiqun Liu. 2025. Towards Unification of Hallucination Detection and Fact Verification for Large Language Models.arXiv preprint arXiv:2512.02772 (2025)

  45. [45]

    Weihang Su, Yichen Tang, Qingyao Ai, Changyue Wang, Zhijing Wu, and Yiqun Liu. 2024. Mitigating entity-level hallucination in large language models. In Proceedings of the 2024 Annual International ACM SIGIR Conference on Research and Development in Information Retrieval in the Asia Pacific Region. 23–31

  46. [46]

    Weihang Su, Yichen Tang, Qingyao Ai, Zhijing Wu, and Yiqun Liu. 2024. DRAGIN: Dynamic Retrieval Augmented Generation based on the Real-time Information Needs of Large Language Models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 12991–13013

  47. [47]

    Weihang Su, Yichen Tang, Qingyao Ai, Junxi Yan, Changyue Wang, Hongning Wang, Ziyi Ye, Yujia Zhou, and Yiqun Liu. 2025. Parametric retrieval augmented generation. InProceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval. 1240–1250

  48. [48]

    Weihang Su, Changyue Wang, Qingyao Ai, Yiran HU, Zhijing Wu, Yujia Zhou, and Yiqun Liu. 2024. Unsupervised Real-Time Hallucination Detection based on the Internal States of Large Language Models. arXiv:2403.06448 [cs.CL] https://arxiv.org/abs/2403.06448

  49. [49]

    Weihang Su, Anzhe Xie, Qingyao Ai, Jianming Long, Xuanyi Chen, Jiaxin Mao, Ziyi Ye, and Yiqun Liu. 2026. SurGE: A Benchmark and Evaluation Framework for Scientific Survey Generation. arXiv:2508.15658 [cs.CL] https://arxiv.org/abs/ 2508.15658

  50. [50]

    Weihang Su, Baoqing Yue, Qingyao Ai, Yiran Hu, Jiaqi Li, Changyue Wang, Kaiyuan Zhang, Yueyue Wu, and Yiqun Liu. 2025. JuDGE: Benchmarking Judg- ment Document Generation for Chinese Legal System. InProceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ’25), July 13–18, 2025, Padua, Italy. do...

  51. [51]

    John M. Swales. 2004.Research Genres: Explorations and Applications. Cambridge University Press

  52. [52]

    Yuqiao Tan, Shizhu He, Huanxuan Liao, Jun Zhao, and Kang Liu. 2025. Dynamic parametric retrieval augmented generation for test-time knowledge enhancement. arXiv preprint arXiv:2503.23895(2025)

  53. [53]

    Simone Teufel, Advaith Siddharthan, and Dan Tidhar. 2006. Automatic clas- sification of citation function. InProceedings of the 2006 Conference on Empir- ical Methods in Natural Language Processing, Dan Jurafsky and Eric Gaussier (Eds.). Association for Computational Linguistics, Sydney, Australia, 103–110. https://aclanthology.org/W06-1613/

  54. [54]

    James Thorne, Andreas Vlachos, Christos Christodoulopoulos, and Arpit Mit- tal. 2018. FEVER: a large-scale dataset for Fact Extraction and VERification. arXiv:1803.05355 [cs.CL] https://arxiv.org/abs/1803.05355

  55. [55]

    Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guil- laume Lample. 2023. LLaMA: Open and Efficient Foundation Language Models. arXiv:2302.13971 [cs.CL] https://arxiv.org/abs/2302.13971

  56. [56]

    Yiteng Tu, Weihang Su, Yujia Zhou, Yiqun Liu, and Qingyao Ai. 2025. Robust Fine-tuning for Retrieval Augmented Generation against Retrieval Defects. In Proceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval. 1272–1282

  57. [57]

    Changyue Wang, Weihang Su, Qingyao Ai, and Yiqun Liu. 2026. Joint evalua- tion of answer and reasoning consistency for hallucination detection in large reasoning models. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 40. 33377–33385

  58. [58]

    Changyue Wang, Weihang Su, Qingyao Ai, Yichen Tang, and Yiqun Liu. 2025. Knowledge editing through chain-of-thought. InProceedings of the 2025 Confer- ence on Empirical Methods in Natural Language Processing. 10684–10704

  59. [59]

    Changyue Wang, Weihang Su, Qingyao Ai, Yujia Zhou, and Yiqun Liu. 2025. Decoupling reasoning and knowledge injection for in-context knowledge editing. InFindings of the Association for Computational Linguistics: ACL 2025. 24543– 24562

  60. [60]

    Changyue Wang, Weihang Su, Qingyao Ai, Yujia Zhou, and Yiqun Liu. 2025. De- coupling Reasoning and Knowledge Injection for In-Context Knowledge Editing. arXiv:2506.00536 [cs.CL] https://arxiv.org/abs/2506.00536

  61. [61]

    Yidong Wang, Qi Guo, Wenjin Yao, Hongbo Zhang, Xin Zhang, Zhen Wu, Meishan Zhang, Xinyu Dai, Min Zhang, Qingsong Wen, Wei Ye, Shikun Zhang, and Yue Zhang. 2024. AutoSurvey: Large Language Models Can Automatically Write Surveys. arXiv:2406.10252 [cs.IR] https://arxiv.org/abs/2406.10252

  62. [62]

    Dustin Wright and Isabelle Augenstein. 2021. CiteWorth: Cite-Worthiness Detec- tion for Improved Scientific Document Understanding. arXiv:2105.10912 [cs.CL] https://arxiv.org/abs/2105.10912

  63. [63]

    Shitao Xiao, Zheng Liu, Peitian Zhang, Niklas Muennighoff, Defu Lian, and Jian-Yun Nie. 2024. C-Pack: Packed Resources For General Chinese Embeddings. arXiv:2309.07597 [cs.CL] https://arxiv.org/abs/2309.07597

  64. [64]

    Xinyu Xing, Xiaosheng Fan, and Xiaojun Wan. 2020. Automatic Generation of Citation Texts in Scholarly Papers: A Pilot Study. InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Dan Jurafsky, Joyce RWGBench: Evaluating Scholarly Positioning in Related Work Generation arXiv Preprint, , Chai, Natalie Schluter, and Joel ...

  65. [65]

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, et al. 2025. Qwen3 Technical Report. arXiv:2505.09388 [cs.CL] https://arxiv.org/abs/2505.09388

  66. [66]

    Genggeng Zhang. 2022. The citational practice of social science research articles: An analysis by part-genres.Journal of English for Academic Purposes55 (2022), 101076. doi:10.1016/j.jeap.2021.101076

  67. [67]

    BERTScore: Evaluating Text Generation with BERT

    Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q. Weinberger, and Yoav Artzi. 2020. BERTScore: Evaluating Text Generation with BERT. arXiv:1904.09675 [cs.CL] https://arxiv.org/abs/1904.09675

  68. [68]

    Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

    Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. 2023. Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. arXiv:2306.05685 [cs.CL] https://arxiv.org/abs/2306.05685

  69. [69]

    Ege Yiğit Çelik and Selma Tekir. 2025. CiteBART: Learning to Generate Citations for Local Citation Recommendation. arXiv:2412.17534 [cs.IR] https://arxiv.org/ abs/2412.17534

  70. [70]

    Furkan Şahinuç, Subhabrata Dutta, and Iryna Gurevych. 2026. Ex- pert Preference-based Evaluation of Automated Related Work Generation. arXiv:2508.07955 [cs.CL] https://arxiv.org/abs/2508.07955