RWGBench: Evaluating Scholarly Positioning in Related Work Generation
Pith reviewed 2026-06-28 17:39 UTC · model grok-4.3
The pith
RWGBench evaluates related work generation from citation decision-making rather than text similarity.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that related work generation is a citation-level scholarly positioning task, and RWGBench provides a benchmark and multi-dimensional framework to evaluate it on citation selection, contextual appropriateness, organization, and discourse structure, revealing systematic issues in models and achieving better alignment with human judgment than text similarity metrics.
What carries the argument
RWGBench benchmark with its multi-dimensional evaluation framework for citation-centric assessment of related work sections.
If this is right
- Current RWG systems show failures in citation selection and organization not caught by similarity-based metrics.
- Oracle studies separate retrieval bottlenecks from generation issues in RWG.
- Citation-centric metrics provide a closer match to expert assessments of scholarly quality.
- Models can be developed to better align with actual scholarly writing practices using this testbed.
Where Pith is reading between the lines
- Future RWG models might need explicit training on citation graphs or positioning logic rather than just text generation.
- The approach could extend to evaluating other parts of scientific papers like introductions or discussions.
- Adoption might shift industry benchmarks away from BLEU or embedding similarity toward citation accuracy.
Load-bearing premise
The curated test set of 100 papers sufficiently represents broader patterns in scholarly writing and the four evaluation dimensions fully measure positioning quality.
What would settle it
Running the evaluation on a new, larger set of papers where citation-centric metrics show no stronger correlation with expert judgments than text similarity metrics would falsify the benchmark's superiority.
read the original abstract
Large language models have shown strong fluency in scientific writing, yet the evaluation of related work generation (RWG) remains limited. Existing RWG evaluations largely inherit summarization-oriented metrics, using lexical or semantic similarity to reference sections as proxies for quality. However, related work writing is fundamentally a citation-level scholarly positioning task: it requires selecting, organizing, and framing prior work to clarify how a target paper relates to, differs from, and contributes beyond existing research.As a result, models may generate coherent and semantically-relevant text while exhibiting academically critical failures, such as inappropriate citation selection or misplaced references, that conventional metrics do not capture.To this end, we introduce \textbf{RWGBench}, a benchmark that evaluates RWG from the perspective of citation decision-making rather than text similarity. RWGBench is constructed from a large-scale collection of 40,108 computer science papers and a retrieval corpus of 1.09 million documents, with a carefully curated test set comprising 100 papers and their corresponding published related work sections.We propose a multi-dimensional evaluation framework that assesses citation selection, contextual appropriateness, organization, and discourse structure.Experiments reveal systematic limitations in current systems that are obscured by standard evaluations, while Oracle studies further disentangle retrieval-level and generation-level bottlenecks. Human evaluation further shows that our citation-centric metrics align substantially better with expert judgment than surface-level text metrics. RWGBench offers a citation-centric testbed for developing and evaluating related work generation systems that are better aligned with scholarly writing practices.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces RWGBench, a benchmark for related work generation (RWG) that evaluates systems from the perspective of citation decision-making rather than text similarity. Constructed from a corpus of 40,108 computer science papers and a 1.09 million document retrieval corpus, it includes a curated test set of 100 papers with their published related work sections. The work proposes a multi-dimensional framework assessing citation selection, contextual appropriateness, organization, and discourse structure; reports experiments revealing limitations in current RWG systems (including via Oracle studies separating retrieval and generation); and presents human evaluation results indicating that the citation-centric metrics align substantially better with expert judgment than conventional surface-level text metrics.
Significance. If the central results hold, RWGBench would offer a more academically grounded testbed for RWG that better captures scholarly positioning practices, addressing a gap where fluency-focused models can still fail on citation appropriateness. Strengths include the scale of the underlying corpus, the Oracle disentanglement of bottlenecks, and the human evaluation component demonstrating improved metric alignment.
major comments (2)
- [Abstract] Abstract and construction description: the evaluation framework treats the published related work sections in the 100-paper test set as the sole gold standard for citation selection and contextual appropriateness. The manuscript provides no discussion of the possibility that multiple non-identical but academically sound citation sets and framings may exist for the same target paper; if this multiplicity holds, automatic metrics that penalize deviation from the published version will not reliably measure scholarly positioning quality.
- [Abstract] Test set curation (described in Abstract): the claim that the 100-paper test set reveals systematic limitations in current RWG systems rests on its representativeness, yet the manuscript supplies insufficient detail on selection criteria, exclusion rules, potential topical or venue biases, and verification that the chosen references are canonical rather than one of several defensible options.
minor comments (2)
- Exact definitions of the proposed citation-centric metrics (selection, contextual appropriateness, etc.) and the data splits used for the 100-paper test set are not fully specified in the abstract-level description, hindering independent verification of the reported human alignment results.
- The retrieval corpus size (1.09 million documents) is stated without clarifying overlap or deduplication procedures relative to the 40,108-paper collection, which could affect Oracle study interpretations.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address each major comment below and will make revisions to improve clarity and transparency in the manuscript.
read point-by-point responses
-
Referee: [Abstract] Abstract and construction description: the evaluation framework treats the published related work sections in the 100-paper test set as the sole gold standard for citation selection and contextual appropriateness. The manuscript provides no discussion of the possibility that multiple non-identical but academically sound citation sets and framings may exist for the same target paper; if this multiplicity holds, automatic metrics that penalize deviation from the published version will not reliably measure scholarly positioning quality.
Authors: We appreciate this observation. Our benchmark uses published related work sections as concrete references representing actual expert scholarly positioning decisions. While alternative valid citation sets may exist, the metrics evaluate alignment with observed author choices rather than claiming uniqueness. We will revise the manuscript to explicitly discuss the potential for multiple sound framings and clarify that the evaluation measures fidelity to published practices. revision: yes
-
Referee: [Abstract] Test set curation (described in Abstract): the claim that the 100-paper test set reveals systematic limitations in current RWG systems rests on its representativeness, yet the manuscript supplies insufficient detail on selection criteria, exclusion rules, potential topical or venue biases, and verification that the chosen references are canonical rather than one of several defensible options.
Authors: We agree that additional details are needed to substantiate the test set's representativeness. The current description labels the set as 'carefully curated' without sufficient specifics. In the revised manuscript, we will expand the relevant sections to detail the selection criteria, exclusion rules, steps taken to address potential topical or venue biases, and any verification processes applied to the references. revision: yes
Circularity Check
No circularity; benchmark constructed from external corpus
full rationale
The paper constructs RWGBench from an external collection of 40,108 CS papers plus a 1.09M-document retrieval corpus, then evaluates generated related work against the published sections of a 100-paper test set. No equations, fitted parameters, or self-citations appear in the provided text that reduce any claimed result to the paper's own inputs by construction. The multi-dimensional framework (citation selection, contextual appropriateness, organization, discourse structure) is defined directly from the task description rather than derived from prior self-referential results. Human evaluation is presented as external validation, not as a fitted input renamed as prediction. This matches the default case of a self-contained benchmark paper with no load-bearing circular steps.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Iz Beltagy, Kyle Lo, and Arman Cohan. 2019. SciBERT: A Pretrained Language Model for Scientific Text. InProceedings of the 2019 Conference on Empirical Meth- ods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Kentaro Inui, Jing Jiang, Vincent Ng, and Xiaojun Wan (Eds.). Association ...
-
[2]
Chandra Bhagavatula, Sergey Feldman, Russell Power, and Waleed Ammar. 2018. Content-Based Citation Recommendation. arXiv:1802.08301 [cs.CL] https: //arxiv.org/abs/1802.08301
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[3]
Lutz Bornmann, Robin Haunschild, and Rüdiger Mutz. 2021. Growth rates of modern science: a latent piecewise growth curve approach to model publication numbers from established and new literature databases.Humanities and Social Sciences Communications8, 1 (7 10 2021), 224. doi:10.1057/s41599-021-00903-w
- [4]
-
[5]
Arman Cohan, Waleed Ammar, Madeleine van Zuylen, and Field Cady. 2019. Structural Scaffolds for Citation Intent Classification in Scientific Publications. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Jill Burstein, Christ...
2019
-
[6]
DeepSeek-AI, Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, et al . 2025. DeepSeek-V3 Technical Report. arXiv:2412.19437 [cs.CL] https://arxiv.org/ abs/2412.19437
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[7]
Martin Docekal, Martin Fajcik, and Pavel Smrz. 2024. OARelatedWork: A Large- Scale Dataset of Related Work Sections with Full-texts from Open Access Sources. arXiv:2405.01930 [cs.CL] https://arxiv.org/abs/2405.01930
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[8]
Qian Dong, Qingyao Ai, Hongning Wang, Yiding Liu, Haitao Li, Weihang Su, Yiqun Liu, Tat-Seng Chua, and Shaoping Ma. 2025. Decoupling Knowledge and Context: An Efficient and Effective Retrieval Augmented Generation Framework via Cross Attention. InProceedings of the ACM on Web Conference 2025. 4386– 4395
2025
-
[9]
Darren Edge, Ha Trinh, Newman Cheng, Joshua Bradley, Alex Chao, Apurva Mody, Steven Truitt, and Jonathan Larson. 2024. From local to global: A graph rag approach to query-focused summarization.arXiv preprint arXiv:2404.16130 (2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[10]
Yan Fang, Jingtao Zhan, Qingyao Ai, Jiaxin Mao, Weihang Su, Jia Chen, and Yiqun Liu. 2024. Scaling laws for dense retrieval. InProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval. 1339–1349
2024
- [11]
- [12]
- [13]
-
[14]
Team GLM, :, Aohan Zeng, Bin Xu, Bowen Wang, Chenhui Zhang, et al. 2024. ChatGLM: A Family of Large Language Models from GLM-130B to GLM-4 All Tools. arXiv:2406.12793 [cs.CL] https://arxiv.org/abs/2406.12793
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[15]
Jiawei Gu, Xuhui Jiang, Zhichao Shi, Hexiang Tan, et al . 2025. A Survey on LLM-as-a-Judge. arXiv:2411.15594 [cs.CL] https://arxiv.org/abs/2411.15594
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[16]
Pengcheng He, Xiaodong Liu, Jianfeng Gao, and Weizhu Chen. 2021. DeBERTa: Decoding-enhanced BERT with Disentangled Attention. arXiv:2006.03654 [cs.CL] https://arxiv.org/abs/2006.03654
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[17]
Zhengbao Jiang, Frank F Xu, Luyu Gao, Zhiqing Sun, Qian Liu, Jane Dwivedi-Yu, Yiming Yang, Jamie Callan, and Graham Neubig. 2023. Active retrieval augmented generation. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 7969–7992
2023
-
[18]
Bowen Jin, Hansi Zeng, Zhenrui Yue, Jinsung Yoon, Sercan Arik, Dong Wang, Hamed Zamani, and Jiawei Han. 2025. Search-r1: Training llms to reason and leverage search engines with reinforcement learning.arXiv preprint arXiv:2503.09516(2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[19]
Jeff Johnson, Matthijs Douze, and Hervé Jégou. 2017. Billion-scale similarity search with GPUs. arXiv:1702.08734 [cs.CV] https://arxiv.org/abs/1702.08734
work page internal anchor Pith review Pith/arXiv arXiv 2017
- [20]
-
[21]
S. Khalid, S. Almutairi, A. Namoun, et al. 2025. Comprehensive review of academic search systems: evolution, analysis, and future research directions.Social Network Analysis and Mining15, 1 (2025), 66. doi:10.1007/s13278-025-01476-1
- [22]
-
[23]
Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen tau Yih, Tim Rocktäschel, Sebastian Riedel, and Douwe Kiela. 2021. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. arXiv:2005.11401 [cs.CL] https://arxiv.org/abs/ 2005.11401
work page internal anchor Pith review Pith/arXiv arXiv 2021
- [24]
-
[25]
Chin-Yew Lin. 2004. ROUGE: A Package for Automatic Evaluation of Summaries. InText Summarization Branches Out. Association for Computational Linguistics, Barcelona, Spain, 74–81. https://aclanthology.org/W04-1013/
2004
-
[26]
Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu. 2023. G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment. arXiv:2303.16634 [cs.CL] https://arxiv.org/abs/2303.16634
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[27]
Sewon Min, Kalpesh Krishna, Xinxi Lyu, Mike Lewis, Wen tau Yih, Pang Wei Koh, Mohit Iyyer, Luke Zettlemoyer, and Hannaneh Hajishirzi. 2023. FActScore: Fine- grained Atomic Evaluation of Factual Precision in Long Form Text Generation. arXiv:2305.14251 [cs.CL] https://arxiv.org/abs/2305.14251
-
[28]
Jekaterina Novikova, Ondřej Dušek, Amanda Cercas Curry, and Verena Rieser
-
[29]
Why We Need New Evaluation Metrics for NLG. InProceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, Martha Palmer, Rebecca Hwa, and Sebastian Riedel (Eds.). Association for Computational Linguistics, Copenhagen, Denmark, 2241–2252. doi:10.18653/v1/D17-1238
-
[30]
OpenAI, Josh Achiam, Steven Adler, et al . 2024. GPT-4 Technical Report. arXiv:2303.08774 [cs.CL] https://arxiv.org/abs/2303.08774
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[31]
Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a Method for Automatic Evaluation of Machine Translation. InProceedings of the 40th Annual Meeting of the Association for Computational Linguistics, Pierre Isabelle, Eugene Charniak, and Dekang Lin (Eds.). Association for Computational Linguistics, Philadelphia, Pennsylvania, USA, 31...
- [32]
-
[33]
Bojana Petrić. 2007. Rhetorical functions of citations in high- and low-rated master’s theses.Journal of English for Academic Purposes6, 3 (2007), 238–253. doi:10.1016/j.jeap.2007.09.002
-
[34]
Qwen, :, An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, et al . 2025. Qwen2.5 Technical Report. arXiv:2412.15115 [cs.CL] https://arxiv.org/abs/2412 .15115
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[35]
Stephen Robertson, Hugo Zaragoza, et al . 2009. The probabilistic relevance framework: BM25 and beyond.Foundations and Trends®in Information Retrieval 3, 4 (2009), 333–389
2009
- [36]
-
[37]
Weihang Su, Qingyao Ai, Xiangsheng Li, Jia Chen, Yiqun Liu, Xiaolong Wu, and Shengluan Hou. 2024. Wikiformer: Pre-training with structured information of wikipedia for ad-hoc retrieval. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 38. 19026–19034
2024
-
[38]
Weihang Su, Qingyao Ai, Yueyue Wu, Anzhe Xie, Changyue Wang, Yixiao Ma, Haitao Li, Zhijing Wu, Yiqun Liu, and Min Zhang. 2025. Pre-training for legal case retrieval based on inter-case distinctions.ACM Transactions on Information Systems43, 5 (2025), 1–27
2025
- [39]
-
[40]
Weihang Su, Xuanyi Chen, Yueyue Wu, Qingyao Ai, and Yiqun Liu. 2026. Enhanc- ing Judgment Document Generation via Agentic Legal Information Collection and Rubric-Guided Optimization.arXiv preprint arXiv:2605.02011(2026)
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[41]
Weihang Su, Qian Dong, Qingyao Ai, and Yiqun Liu. 2025. SIGIR-AP 2025 Tutorial Proposal: Dynamic and Parametric Retrieval-Augmented Generation. In3rd International ACM SIGIR Conference on Information Retrieval in the Asia Pacific
2025
-
[42]
Weihang Su, Yiran Hu, Anzhe Xie, Qingyao Ai, Quezi Bing, Ning Zheng, Yun Liu, Weixing Shen, and Yiqun Liu. 2024. STARD: A Chinese Statute Retrieval Dataset Derived from Real-life Queries by Non-professionals. InFindings of the Association for Computational Linguistics: EMNLP 2024, Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen (Eds.). Association for C...
-
[43]
Weihang Su, Jianming Long, Qingyao Ai, Yichen Tang, Changyue Wang, Yiteng Tu, and Yiqun Liu. 2026. Skill Retrieval Augmentation for Agentic AI.arXiv preprint arXiv:2604.24594(2026)
work page internal anchor Pith review Pith/arXiv arXiv 2026
- [44]
-
[45]
Weihang Su, Yichen Tang, Qingyao Ai, Changyue Wang, Zhijing Wu, and Yiqun Liu. 2024. Mitigating entity-level hallucination in large language models. In Proceedings of the 2024 Annual International ACM SIGIR Conference on Research and Development in Information Retrieval in the Asia Pacific Region. 23–31
2024
-
[46]
Weihang Su, Yichen Tang, Qingyao Ai, Zhijing Wu, and Yiqun Liu. 2024. DRAGIN: Dynamic Retrieval Augmented Generation based on the Real-time Information Needs of Large Language Models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 12991–13013
2024
-
[47]
Weihang Su, Yichen Tang, Qingyao Ai, Junxi Yan, Changyue Wang, Hongning Wang, Ziyi Ye, Yujia Zhou, and Yiqun Liu. 2025. Parametric retrieval augmented generation. InProceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval. 1240–1250
2025
- [48]
-
[49]
Weihang Su, Anzhe Xie, Qingyao Ai, Jianming Long, Xuanyi Chen, Jiaxin Mao, Ziyi Ye, and Yiqun Liu. 2026. SurGE: A Benchmark and Evaluation Framework for Scientific Survey Generation. arXiv:2508.15658 [cs.CL] https://arxiv.org/abs/ 2508.15658
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[50]
Weihang Su, Baoqing Yue, Qingyao Ai, Yiran Hu, Jiaqi Li, Changyue Wang, Kaiyuan Zhang, Yueyue Wu, and Yiqun Liu. 2025. JuDGE: Benchmarking Judg- ment Document Generation for Chinese Legal System. InProceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ’25), July 13–18, 2025, Padua, Italy. do...
-
[51]
John M. Swales. 2004.Research Genres: Explorations and Applications. Cambridge University Press
2004
- [52]
-
[53]
Simone Teufel, Advaith Siddharthan, and Dan Tidhar. 2006. Automatic clas- sification of citation function. InProceedings of the 2006 Conference on Empir- ical Methods in Natural Language Processing, Dan Jurafsky and Eric Gaussier (Eds.). Association for Computational Linguistics, Sydney, Australia, 103–110. https://aclanthology.org/W06-1613/
2006
-
[54]
James Thorne, Andreas Vlachos, Christos Christodoulopoulos, and Arpit Mit- tal. 2018. FEVER: a large-scale dataset for Fact Extraction and VERification. arXiv:1803.05355 [cs.CL] https://arxiv.org/abs/1803.05355
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[55]
Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guil- laume Lample. 2023. LLaMA: Open and Efficient Foundation Language Models. arXiv:2302.13971 [cs.CL] https://arxiv.org/abs/2302.13971
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[56]
Yiteng Tu, Weihang Su, Yujia Zhou, Yiqun Liu, and Qingyao Ai. 2025. Robust Fine-tuning for Retrieval Augmented Generation against Retrieval Defects. In Proceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval. 1272–1282
2025
-
[57]
Changyue Wang, Weihang Su, Qingyao Ai, and Yiqun Liu. 2026. Joint evalua- tion of answer and reasoning consistency for hallucination detection in large reasoning models. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 40. 33377–33385
2026
-
[58]
Changyue Wang, Weihang Su, Qingyao Ai, Yichen Tang, and Yiqun Liu. 2025. Knowledge editing through chain-of-thought. InProceedings of the 2025 Confer- ence on Empirical Methods in Natural Language Processing. 10684–10704
2025
-
[59]
Changyue Wang, Weihang Su, Qingyao Ai, Yujia Zhou, and Yiqun Liu. 2025. Decoupling reasoning and knowledge injection for in-context knowledge editing. InFindings of the Association for Computational Linguistics: ACL 2025. 24543– 24562
2025
- [60]
-
[61]
Yidong Wang, Qi Guo, Wenjin Yao, Hongbo Zhang, Xin Zhang, Zhen Wu, Meishan Zhang, Xinyu Dai, Min Zhang, Qingsong Wen, Wei Ye, Shikun Zhang, and Yue Zhang. 2024. AutoSurvey: Large Language Models Can Automatically Write Surveys. arXiv:2406.10252 [cs.IR] https://arxiv.org/abs/2406.10252
- [62]
-
[63]
Shitao Xiao, Zheng Liu, Peitian Zhang, Niklas Muennighoff, Defu Lian, and Jian-Yun Nie. 2024. C-Pack: Packed Resources For General Chinese Embeddings. arXiv:2309.07597 [cs.CL] https://arxiv.org/abs/2309.07597
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[64]
Xinyu Xing, Xiaosheng Fan, and Xiaojun Wan. 2020. Automatic Generation of Citation Texts in Scholarly Papers: A Pilot Study. InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Dan Jurafsky, Joyce RWGBench: Evaluating Scholarly Positioning in Related Work Generation arXiv Preprint, , Chai, Natalie Schluter, and Joel ...
-
[65]
An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, et al. 2025. Qwen3 Technical Report. arXiv:2505.09388 [cs.CL] https://arxiv.org/abs/2505.09388
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[66]
Genggeng Zhang. 2022. The citational practice of social science research articles: An analysis by part-genres.Journal of English for Academic Purposes55 (2022), 101076. doi:10.1016/j.jeap.2021.101076
-
[67]
BERTScore: Evaluating Text Generation with BERT
Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q. Weinberger, and Yoav Artzi. 2020. BERTScore: Evaluating Text Generation with BERT. arXiv:1904.09675 [cs.CL] https://arxiv.org/abs/1904.09675
work page internal anchor Pith review Pith/arXiv arXiv 2020
-
[68]
Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena
Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. 2023. Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. arXiv:2306.05685 [cs.CL] https://arxiv.org/abs/2306.05685
work page internal anchor Pith review Pith/arXiv arXiv 2023
- [69]
-
[70]
Furkan Şahinuç, Subhabrata Dutta, and Iryna Gurevych. 2026. Ex- pert Preference-based Evaluation of Automated Related Work Generation. arXiv:2508.07955 [cs.CL] https://arxiv.org/abs/2508.07955
work page internal anchor Pith review Pith/arXiv arXiv 2026
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.