arxiv: 2604.19185 · v1 · submitted 2026-04-21 · 💻 cs.CL · cs.AI

Recognition: unknown

SCURank: Ranking Multiple Candidate Summaries with Summary Content Units for Enhanced Summarization

Bo-Jyun Wang, Hung-Yu Kao, Ying-Jia Lin

Authors on Pith no claims yet

Pith reviewed 2026-05-10 02:23 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords summarizationsummary content unitsmodel distillationrankinglarge language modelssmall language modelsSCURank

0 comments

The pith

SCURank ranks multiple candidate summaries by their Summary Content Units to outperform LLM comparisons and ROUGE for distillation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces SCURank, a framework that ranks candidate summaries according to the richness and semantic importance of their Summary Content Units instead of unstable LLM judgments or surface-level overlap metrics. This ranking is applied when distilling small language models from summaries generated by multiple diverse large language models. Experiments show the approach beats traditional metrics and LLM-based ranking on standard measures across datasets while also producing more abstract and higher-performing distilled models. A reader would care because current methods for selecting among summary candidates limit how reliably knowledge transfers from large models to smaller ones.

Core claim

SCURank evaluates summaries based on the richness and semantic importance of information content via SCUs, providing more stable and effective ranking than LLM comparisons or ROUGE, and experimental results show it outperforms those methods across evaluation measures and datasets while diverse LLM summaries enhance distilled model abstractiveness and performance.

What carries the argument

Summary Content Units (SCUs) as atomic information elements used to quantify semantic richness and importance when ranking candidate summaries.

If this is right

SCURank delivers more stable rankings than direct LLM comparisons for summary selection.
Incorporating summaries from diverse LLMs increases abstractiveness in the resulting distilled model.
Distilled models achieve higher overall performance on standard summarization measures.
The gains hold across multiple datasets and evaluation protocols.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The content-unit focus could extend to ranking outputs in other generation tasks such as dialogue or question answering.
Prioritizing SCU richness might reduce certain LLM-specific biases in summary selection.
Better automated SCU extraction would make the ranking scalable to larger candidate pools.

Load-bearing premise

That SCUs can be identified reliably enough to measure semantic importance without introducing new instabilities or biases that undermine the ranking.

What would settle it

A controlled experiment in which human raters or downstream task performance prefer summaries ranked by LLM methods over SCURank-ranked ones on held-out data.

Figures

Figures reproduced from arXiv: 2604.19185 by Bo-Jyun Wang, Hung-Yu Kao, Ying-Jia Lin.

**Figure 2.** Figure 2: Human evaluation results comparing distilled [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗

**Figure 3.** Figure 3: Human preference evaluation results between [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Win-tie-loss rates between distilled models trained with SCURank and GPTRank, evaluated by three [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

read the original abstract

Small language models (SLMs), such as BART, can achieve summarization performance comparable to large language models (LLMs) via distillation. However, existing LLM-based ranking strategies for summary candidates suffer from instability, while classical metrics (e.g., ROUGE) are insufficient to rank high-quality summaries. To address these issues, we introduce \textbf{SCURank}, a framework that enhances summarization by leveraging \textbf{Summary Content Units (SCUs)}. Instead of relying on unstable comparisons or surface-level overlap, SCURank evaluates summaries based on the richness and semantic importance of information content. We investigate the effectiveness of SCURank in distilling summaries from multiple diverse LLMs. Experimental results demonstrate that SCURank outperforms traditional metrics and LLM-based ranking methods across evaluation measures and datasets. Furthermore, our findings show that incorporating diverse LLM summaries enhances model abstractiveness and overall distilled model performance, validating the benefits of information-centric ranking in multi-LLM distillation. The code for SCURank is available at https://github.com/IKMLab/SCURank.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SCURank uses SCUs to rank LLM summaries for distillation and shows gains over baselines, but needs better evidence on stability.

read the letter

The punchline is that SCURank uses Summary Content Units to rank summaries from various LLMs for better distillation into small models, and the experiments suggest it works better than ROUGE or LLM judges. What stands out is the shift to an information content based evaluation. Instead of relying on potentially flaky LLM opinions for ranking, they break summaries into SCUs and score based on richness and importance. This seems like a reasonable way to handle multiple candidates. The results indicate improvements in the final distilled model's quality and abstractiveness. Releasing the code at that GitHub link is practical and allows others to test it themselves. On the downside, the stability of SCU identification isn't clearly superior yet. If the automatic extraction of these units depends on LLMs or heuristics, it could face similar issues with prompt sensitivity or model choice as the methods it aims to replace. The paper would be stronger with explicit comparisons of ranking variance across different conditions. Also, the abstract mentions outperformance but the details on effect sizes and specific datasets would help gauge how meaningful the gains are in practice. This paper is for NLP folks working on model compression and summarization quality. A reader in that area could pick up the SCURank idea and the multi-LLM distillation findings. It deserves peer review because it introduces a distinct ranking strategy with experimental support and open resources. Referees can help refine the stability arguments if needed. I would recommend sending this to peer review. The idea has potential and the implementation is shared, so feedback from others in the field would be useful.

Referee Report

3 major / 3 minor

Summary. The paper proposes SCURank, a framework for ranking multiple candidate summaries by leveraging Summary Content Units (SCUs) to assess richness and semantic importance of information content. It targets instability in LLM-based ranking methods and the limitations of surface-level metrics like ROUGE, particularly in the setting of distilling small language models (e.g., BART) from diverse LLMs. The work claims that SCURank outperforms both traditional metrics and LLM-based ranking across evaluation measures and datasets, and that incorporating diverse LLM-generated summaries improves abstractiveness and overall distilled model performance. Code is released at https://github.com/IKMLab/SCURank.

Significance. If the central claims hold, SCURank could provide a more stable, information-centric alternative to direct LLM comparisons for summary selection, with direct benefits for multi-LLM distillation pipelines. The emphasis on reproducibility via released code and the focus on abstractiveness gains are strengths that would strengthen the contribution if the stability advantage is rigorously shown.

major comments (3)

[§3] §3 (SCURank method): The automatic SCU identification procedure is load-bearing for the stability claim, yet the manuscript does not report variance across multiple runs, prompt variations, or different underlying models for SCU extraction; without such measurements, it is unclear whether SCURank actually reduces the instabilities that affect direct LLM ranking.
[§4] §4 (Experiments): The headline claim of outperformance requires concrete evidence; the paper must include tables reporting exact scores (e.g., ROUGE, BERTScore, human judgments) with statistical significance tests against all baselines on each dataset, rather than qualitative assertions.
[§4.2] §4.2 (Ablation on diverse LLMs): The assertion that diverse LLM summaries enhance abstractiveness needs a controlled comparison showing that the gain is attributable to SCURank ranking rather than simply to the union of candidates; the current setup risks confounding the ranking method with the candidate pool size.

minor comments (3)

[§3] Notation for SCU scoring function should be formalized with an equation rather than prose description to aid reproducibility.
[§1] The abstract states results on 'multiple datasets' but does not name them; the introduction or experimental section should list the exact corpora (e.g., CNN/DM, XSum) upfront.
[§4] Figure captions should explicitly state the number of runs or seeds used for any variance bars.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. The comments highlight important aspects of rigor around stability, empirical evidence, and experimental controls. We address each major comment below and commit to revisions that strengthen the manuscript without altering its core claims.

read point-by-point responses

Referee: [§3] §3 (SCURank method): The automatic SCU identification procedure is load-bearing for the stability claim, yet the manuscript does not report variance across multiple runs, prompt variations, or different underlying models for SCU extraction; without such measurements, it is unclear whether SCURank actually reduces the instabilities that affect direct LLM ranking.

Authors: We agree that explicit variance measurements for SCU extraction are necessary to substantiate the stability advantage. The original manuscript emphasized end-to-end performance but did not include these diagnostics. In the revision we will add a dedicated analysis subsection under §3 that reports (i) variance in extracted SCUs across five independent runs with fixed prompts, (ii) sensitivity to prompt paraphrases, and (iii) results when swapping the underlying LLM used for SCU generation. These new results will be accompanied by a brief discussion of why the SCU-based intermediate representation inherently dampens the ranking instability observed in direct LLM comparisons. revision: yes
Referee: [§4] §4 (Experiments): The headline claim of outperformance requires concrete evidence; the paper must include tables reporting exact scores (e.g., ROUGE, BERTScore, human judgments) with statistical significance tests against all baselines on each dataset, rather than qualitative assertions.

Authors: We acknowledge that the current presentation relies partly on figures and summary statements. To meet the requested standard we will replace the qualitative claims in §4 with comprehensive tables that list exact ROUGE-1/2/L, BERTScore, and human judgment scores for SCURank and every baseline on all datasets. We will also add paired statistical significance tests (Wilcoxon signed-rank with Bonferroni correction) and report p-values in the tables. The revised text will explicitly reference these tables when stating outperformance. revision: yes
Referee: [§4.2] §4.2 (Ablation on diverse LLMs): The assertion that diverse LLM summaries enhance abstractiveness needs a controlled comparison showing that the gain is attributable to SCURank ranking rather than simply to the union of candidates; the current setup risks confounding the ranking method with the candidate pool size.

Authors: We recognize the potential confounding between ranking method and candidate-pool size. In the revision we will insert a new controlled ablation in §4.2 that fixes the total number of candidate summaries while varying their source diversity and the ranking method. The three conditions are: (1) SCURank ranking on the original diverse-LLM pool, (2) a strong baseline ranker (e.g., LLM-as-judge) on the same diverse pool, and (3) SCURank ranking on a single-LLM pool of matched size. Abstractiveness metrics and downstream distilled-model performance will be reported for all three, allowing readers to isolate the contribution of SCURank itself. revision: yes

Circularity Check

0 steps flagged

No significant circularity; claims rest on empirical evaluation of a new ranking framework

full rationale

The paper defines SCURank as a novel method that ranks candidate summaries by counting and weighting Summary Content Units (SCUs) extracted from reference or source text. All central claims are supported by direct experimental comparisons against ROUGE and LLM-based baselines on standard datasets, with no equations or derivations that reduce a 'prediction' to a fitted parameter by construction. SCU identification is presented as an external, previously established technique rather than a self-defined construct whose output is then used to validate itself. No self-citation chain is invoked to establish uniqueness or to forbid alternatives. The derivation chain is therefore self-contained and externally falsifiable via the released code and reported metrics.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Based on abstract, the main assumption is around the effectiveness of SCUs; no free parameters or new entities explicitly mentioned.

axioms (1)

domain assumption Summary Content Units can be reliably extracted and used to measure semantic importance and richness.
Central to the SCURank framework as described in the abstract.

pith-pipeline@v0.9.0 · 5492 in / 1278 out tokens · 52900 ms · 2026-05-10T02:23:25.575819+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

42 extracted references · 23 canonical work pages

[1]

Campello, Ricardo J. G. B. and Moulavi, Davoud and Sander, Joerg , editor =. Density-Based Clustering Based on Hierarchical Density Estimates , booktitle =. 2013 , publisher =

2013
[2]

Re-evaluating Evaluation in Text Summarization

Bhandari, Manik and Gour, Pranav Narayan and Ashfaq, Atabak and Liu, Pengfei and Neubig, Graham. Re-evaluating Evaluation in Text Summarization. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2020. doi:10.18653/v1/2020.emnlp-main.751

work page doi:10.18653/v1/2020.emnlp-main.751 2020
[3]

2024 , howpublished =

Anthropic , title =. 2024 , howpublished =

2024
[4]

2025 , howpublished =

Anthropic , title =. 2025 , howpublished =

2025
[5]

Proceedings of the Tenth International Conference on Language Resources and Evaluation (

Revisiting Summarization Evaluation for Scientific Articles , author =. Proceedings of the Tenth International Conference on Language Resources and Evaluation (. 2016 , address =

2016
[6]

kdd , volume =

A density-based algorithm for discovering clusters in large spatial databases with noise , author =. kdd , volume =
[7]

GPTS core: Evaluate as You Desire

Fu, Jinlan and Ng, See-Kiong and Jiang, Zhengbao and Liu, Pengfei. GPTS core: Evaluate as You Desire. Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers). 2024. doi:10.18653/v1/2024.naacl-long.365

work page doi:10.18653/v1/2024.naacl-long.365 2024
[8]

2024 , howpublished =

Google , title =. 2024 , howpublished =

2024
[9]

2024 , month =

GPT-4o mini , author =. 2024 , month =

2024
[10]

N ewsroom: A Dataset of 1.3 Million Summaries with Diverse Extractive Strategies

Grusky, Max and Naaman, Mor and Artzi, Yoav. N ewsroom: A Dataset of 1.3 Million Summaries with Diverse Extractive Strategies. Proceedings of the 2018 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers). 2018. doi:10.18653/v1/N18-1065

work page doi:10.18653/v1/n18-1065 2018
[11]

2015 , eprint =

Distilling the Knowledge in a Neural Network , author =. 2015 , eprint =

2015
[12]

doi: 10.18653/v1/2023.findings-acl.507

Hsieh, Cheng-Yu and Li, Chun-Liang and Yeh, Chih-kuan and Nakhost, Hootan and Fujii, Yasuhisa and Ratner, Alex and Krishna, Ranjay and Lee, Chen-Yu and Pfister, Tomas. Distilling Step-by-Step! Outperforming Larger Language Models with Less Training Data and Smaller Model Sizes. Findings of the Association for Computational Linguistics: ACL 2023. 2023. doi...

work page doi:10.18653/v1/2023.findings-acl.507 2023
[13]

Jiang, Pengcheng and Xiao, Cao and Wang, Zifeng and Bhatia, Parminder and Sun, Jimeng and Han, Jiawei , editor =. Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) , month = jun, year =. doi:10.18653/v1/2024.naacl-long.154 , pages =

work page doi:10.18653/v1/2024.naacl-long.154 2024
[14]

Improved Natural Language Generation via Loss Truncation

Kang, Daniel and Hashimoto, Tatsunori B. Improved Natural Language Generation via Loss Truncation. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020. doi:10.18653/v1/2020.acl-main.66

work page doi:10.18653/v1/2020.acl-main.66 2020
[15]

Lewis, Y

Lewis, Mike and Liu, Yinhan and Goyal, Naman and Ghazvininejad, Marjan and Mohamed, Abdelrahman and Levy, Omer and Stoyanov, Veselin and Zettlemoyer, Luke. BART : Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension. Proceedings of the 58th Annual Meeting of the Association for Computational Linguisti...

work page doi:10.18653/v1/2020.acl-main.703 2020
[16]

ROUGE : A Package for Automatic Evaluation of Summaries

Lin, Chin-Yew. ROUGE : A Package for Automatic Evaluation of Summaries. Text Summarization Branches Out. 2004

2004
[17]

BRIO : Bringing Order to Abstractive Summarization

Liu, Yixin and Liu, Pengfei and Radev, Dragomir and Neubig, Graham. BRIO : Bringing Order to Abstractive Summarization. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2022. doi:10.18653/v1/2022.acl-long.207

work page doi:10.18653/v1/2022.acl-long.207 2022
[18]

G -Eval: NLG Evaluation using Gpt-4 with Better Human Alignment

Liu, Yang and Iter, Dan and Xu, Yichong and Wang, Shuohang and Xu, Ruochen and Zhu, Chenguang. G -Eval: NLG Evaluation using Gpt-4 with Better Human Alignment. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 2023. doi:10.18653/v1/2023.emnlp-main.153

work page doi:10.18653/v1/2023.emnlp-main.153 2023
[19]

On Learning to Summarize with Large Language Models as References

Liu, Yixin and Shi, Kejian and He, Katherine and Ye, Longtian and Fabbri, Alexander and Liu, Pengfei and Radev, Dragomir and Cohan, Arman. On Learning to Summarize with Large Language Models as References. Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume ...

work page doi:10.18653/v1/2024.naacl-long.478 2024
[20]

S im CLS : A Simple Framework for Contrastive Learning of Abstractive Summarization

Liu, Yixin and Liu, Pengfei. S im CLS : A Simple Framework for Contrastive Learning of Abstractive Summarization. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers). 2021. doi:10.18653/v1/2021.acl-short.135

work page doi:10.18653/v1/2021.acl-short.135 2021
[21]

2024 , url =

Llama 3 Model Card , author =. 2024 , url =

2024
[22]

On Faithfulness and Factuality in Abstractive Summarization

Maynez, Joshua and Narayan, Shashi and Bohnet, Bernd and McDonald, Ryan. On Faithfulness and Factuality in Abstractive Summarization. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020. doi:10.18653/v1/2020.acl-main.173

work page doi:10.18653/v1/2020.acl-main.173 2020
[23]

2024 , howpublished =

MistralAI , title =. 2024 , howpublished =

2024
[24]

Abstractive Text Summarization using Sequence-to-sequence RNNs and Beyond , url=

Nallapati, Ramesh and Zhou, Bowen and dos Santos, Cicero and Gu. Abstractive Text Summarization using Sequence-to-sequence. Proceedings of the 20th. 2016 , address =. doi:10.18653/v1/K16-1028 , pages =

work page doi:10.18653/v1/k16-1028 2016
[25]

and Lapata, Mirella , year =

Narayan, Shashi and Cohen, Shay B. and Lapata, Mirella. Don ' t Give Me the Details, Just the Summary! Topic-Aware Convolutional Neural Networks for Extreme Summarization. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. 2018. doi:10.18653/v1/D18-1206

work page doi:10.18653/v1/d18-1206 2018
[26]

On the Role of Summary Content Units in Text Summarization Evaluation

Nawrath, Marcel and Nowak, Agnieszka and Ratz, Tristan and Walenta, Danilo and Opitz, Juri and Ribeiro, Leonardo and Sedoc, Jo \ a o and Deutsch, Daniel and Mille, Simon and Liu, Yixin and Gehrmann, Sebastian and Zhang, Lining and Mahamood, Saad and Clinciu, Miruna and Chandu, Khyathi and Hou, Yufang. On the Role of Summary Content Units in Text Summariza...

work page doi:10.18653/v1/2024.naacl-short.25 2024
[27]

Evaluating Content Selection in Summarization: The Pyramid Method

Nenkova, Ani and Passonneau, Rebecca. Evaluating Content Selection in Summarization: The Pyramid Method. Proceedings of the Human Language Technology Conference of the North A merican Chapter of the Association for Computational Linguistics: HLT - NAACL 2004. 2004

2004
[28]

2024 , eprint =

GPT-4o System Card , author =. 2024 , eprint =

2024
[29]

2024 , eprint =

GPT-4 Technical Report , author =. 2024 , eprint =

2024
[30]

Training language models to follow instructions with human feedback , url =

Ouyang, Long and Wu, Jeffrey and Jiang, Xu and Almeida, Diogo and Wainwright, Carroll and Mishkin, Pamela and Zhang, Chong and Agarwal, Sandhini and Slama, Katarina and Ray, Alex and Schulman, John and Hilton, Jacob and Kelton, Fraser and Miller, Luke and Simens, Maddie and Askell, Amanda and Welinder, Peter and Christiano, Paul F and Leike, Jan and Lowe,...
[31]

Sentence- BERT : Sentence Embeddings using S iamese BERT -Networks

Reimers, Nils and Gurevych, Iryna , editor =. Sentence-. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP) , month = nov, year =. doi:10.18653/v1/D19-1410 , pages =

work page doi:10.18653/v1/d19-1410 2019
[32]

BLEURT : Learning Robust Metrics for Text Generation

Sellam, Thibault and Das, Dipanjan and Parikh, Ankur. BLEURT : Learning Robust Metrics for Text Generation. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020. doi:10.18653/v1/2020.acl-main.704

work page doi:10.18653/v1/2020.acl-main.704 2020
[33]

Crowdsourcing Lightweight Pyramids for Manual Summary Evaluation

Shapira, Ori and Gabay, David and Gao, Yang and Ronen, Hadar and Pasunuru, Ramakanth and Bansal, Mohit and Amsterdamer, Yael and Dagan, Ido. Crowdsourcing Lightweight Pyramids for Manual Summary Evaluation. Proceedings of the 2019 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volum...

work page doi:10.18653/v1/n19-1072 2019
[34]

Large Language Models are Not Yet Human-Level Evaluators for Abstractive Summarization

Shen, Chenhui and Cheng, Liying and Nguyen, Xuan-Phi and You, Yang and Bing, Lidong. Large Language Models are Not Yet Human-Level Evaluators for Abstractive Summarization. Findings of the Association for Computational Linguistics: EMNLP 2023. 2023. doi:10.18653/v1/2023.findings-emnlp.278

work page doi:10.18653/v1/2023.findings-emnlp.278 2023
[35]

Fill in the BLANC : Human-free quality estimation of document summaries

Vasilyev, Oleg and Dharnidharka, Vedant and Bohannon, John. Fill in the BLANC : Human-free quality estimation of document summaries. Proceedings of the First Workshop on Evaluation and Comparison of NLP Systems. 2020. doi:10.18653/v1/2020.eval4nlp-1.2

work page doi:10.18653/v1/2020.eval4nlp-1.2 2020
[36]

Proceedings of the AAAI Conference on Artificial Intelligence , author =

Diverse Beam Search for Improved Description of Complex Scenes , volume =. Proceedings of the AAAI Conference on Artificial Intelligence , author =. 2018 , month =. doi:10.1609/aaai.v32i1.12340 , abstractnote =

work page doi:10.1609/aaai.v32i1.12340 2018
[37]

Large Language Models are not Fair Evaluators

Wang, Peiyi and Li, Lei and Chen, Liang and Cai, Zefan and Zhu, Dawei and Lin, Binghuai and Cao, Yunbo and Kong, Lingpeng and Liu, Qi and Liu, Tianyu and Sui, Zhifang. Large Language Models are not Fair Evaluators. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2024

2024
[38]

2025 , month = jul, note =

Grok 4 API Reference: grok-4-0709 , howpublished =. 2025 , month = jul, note =

2025
[39]

2021 , url =

Weizhe Yuan and Graham Neubig and Pengfei Liu , booktitle =. 2021 , url =

2021
[40]

Finding a Balanced Degree of Automation for Summary Evaluation

Zhang, Shiyue and Bansal, Mohit. Finding a Balanced Degree of Automation for Summary Evaluation. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. 2021. doi:10.18653/v1/2021.emnlp-main.531

work page doi:10.18653/v1/2021.emnlp-main.531 2021
[41]

Benchmarking Large Language Models for News Summarization

Zhang, Tianyi and Ladhak, Faisal and Durmus, Esin and Liang, Percy and McKeown, Kathleen and Hashimoto, Tatsunori B. Benchmarking Large Language Models for News Summarization. Transactions of the Association for Computational Linguistics. 2024. doi:10.1162/tacl_a_00632

work page doi:10.1162/tacl_a_00632 2024
[42]

International Conference on Learning Representations , year =

BERTScore: Evaluating Text Generation with BERT , author =. International Conference on Learning Representations , year =