pith. machine review for the scientific record. sign in

arxiv: 2604.25665 · v1 · submitted 2026-04-28 · 💻 cs.CL · cs.AI· cs.DL· cs.IR

Recognition: unknown

LLM-ReSum: A Framework for LLM Reflective Summarization through Self-Evaluation

Authors on Pith no claims yet

Pith reviewed 2026-05-07 16:15 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.DLcs.IR
keywords LLM summarizationself-evaluationreflective loopfactual accuracycoverage metricslegal document summarizationautomatic evaluationPatentSumEval
0
0 comments X

The pith

LLM-ReSum uses a closed feedback loop of LLM evaluation and generation to refine summaries without any model fine-tuning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper first tests fourteen automatic metrics and LLM evaluators on seven datasets spanning short news to long legal and scientific documents. Traditional overlap scores like ROUGE show weak or negative correlation with human judgments, while certain neural metrics and LLM judges align better with people. Building on that result, the authors introduce LLM-ReSum, a framework that repeatedly applies an LLM evaluator to spot errors in a summary and then asks the same LLM to rewrite the summary to fix those errors. The loop runs without any extra training. On low-quality starting summaries the process raises factual accuracy by up to 33 percent and coverage by up to 39 percent, and human raters choose the refined versions 89 percent of the time.

Core claim

LLM-based evaluators can be placed inside a closed generation-evaluation loop that iteratively corrects factual and coverage errors in summaries; the resulting LLM-ReSum framework delivers measurable gains on heterogeneous documents from three domains and requires no parameter updates.

What carries the argument

The LLM-ReSum closed feedback loop, in which an LLM evaluator scores a draft summary and the same LLM then generates a revised draft based on the scores.

If this is right

  • Low-quality summaries in news, scientific, and legal domains can be improved without fine-tuning the underlying model.
  • Human preference for the output summaries reaches 89 percent across tested cases.
  • A new expert-annotated benchmark of 180 legal summaries becomes available for future evaluation work.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same evaluator-generator loop could be tested on other text-generation tasks such as question answering or dialogue response revision.
  • If stronger base models improve the quality of the feedback signal, the magnitude of the reported gains would likely increase.
  • The approach may reduce the need for separate supervised fine-tuning stages when adapting LLMs to new summarization domains.

Load-bearing premise

LLM evaluators supply reliable, unbiased signals that actually point to real errors rather than rewarding superficial rewrites or introducing new mistakes.

What would settle it

Run the full LLM-ReSum loop on a fresh collection of human-annotated summaries from a fourth domain and measure whether factual accuracy and coverage scores remain flat or decline instead of rising.

Figures

Figures reproduced from arXiv: 2604.25665 by Haihua Chen, Haoxuan Zhang, Huyen Nguyen, Junhua Ding, Yang Zhang.

Figure 1
Figure 1. Figure 1: Overview of our three-stage research framework: meta-evaluation of view at source ↗
Figure 2
Figure 2. Figure 2: Source document length distributions threshold is set to τ = 4 on a 5-point Likert scale, where scores of 4 (“Good”) and 5 (“Excellent”) indicate acceptable quality. Refinement is triggered when any dimension scores below τ , ensuring comprehensive quality assurance across all evaluated aspects. We impose maximum iterations Tmax = 3 to prevent excessive refinement that may introduce semantic drift, over-co… view at source ↗
read the original abstract

Reliable evaluation of large language model (LLM)-generated summaries remains an open challenge, particularly across heterogeneous domains and document lengths. We conduct a comprehensive meta-evaluation of 14 automatic summarization metrics and LLM-based evaluators across seven datasets spanning five domains, covering documents from short news articles to long scientific, governmental, and legal texts (2K-27K words) with over 1,500 human-annotated summaries. Our results show that traditional lexical overlap metrics (e.g., ROUGE, BLEU) exhibit weak or negative correlation with human judgments, while task-specific neural metrics and LLM-based evaluators achieve substantially higher alignment, especially for linguistic quality assessment. Leveraging these findings, we propose LLM-ReSum, a self-reflective summarization framework that integrates LLM-based evaluation and generation in a closed feedback loop without model finetuning. Across three domains, LLM-ReSum improves low-quality summaries by up to 33% in factual accuracy and 39% in coverage, with human evaluators preferring refined summaries in 89% of cases. We additionally introduce PatentSumEval, a new human-annotated benchmark for legal document summarization comprising 180 expert-evaluated summaries. All code and datasets will be released in GitHub.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript conducts a meta-evaluation of 14 automatic summarization metrics and LLM-based evaluators across seven datasets spanning five domains (news to long scientific/governmental/legal texts of 2K-27K words) with over 1,500 human-annotated summaries. It finds that lexical metrics like ROUGE/BLEU show weak correlation with humans while LLM evaluators align better, especially on linguistic quality. Leveraging this, the authors propose LLM-ReSum, a finetuning-free self-reflective framework that runs LLM evaluation and generation in a closed feedback loop to refine low-quality summaries. Experiments across three domains report gains of up to 33% factual accuracy and 39% coverage, with humans preferring the refined outputs in 89% of cases. The work also releases PatentSumEval, a new expert-annotated legal summarization benchmark of 180 summaries.

Significance. The meta-evaluation and new benchmark provide concrete value for the community by highlighting limitations of traditional metrics on long heterogeneous documents and supplying a legal-domain resource. If the LLM-ReSum gains prove robust, the closed-loop self-reflective approach offers a practical, training-free method for summary improvement that could be adopted in production systems. The empirical grounding in human judgments across domains strengthens the contribution relative to purely metric-driven work.

major comments (2)
  1. [Abstract and experimental results] Abstract and experimental results section: the reported improvements (33% factual accuracy, 39% coverage, 89% human preference) are presented without specifying the exact baseline summaries selected, the precise measurement protocols for factual accuracy and coverage, or any statistical significance testing. Because the meta-evaluation itself flags risks from post-hoc selection of low-quality summaries, these omissions make it impossible to assess whether the gains are genuine or inflated.
  2. [LLM-ReSum framework and evaluation] LLM-ReSum framework description and evaluation: the central claim that the closed feedback loop reliably corrects factual errors and coverage gaps rests on the assumption that LLM evaluators produce actionable, low-error signals on long documents. The meta-evaluation only demonstrates correlation on 1,500 summaries; no additional analysis shows that the same model family can detect domain-specific facts or avoid rewarding fluent but incomplete rewrites when applied iteratively to 2K-27K word legal/scientific texts.
minor comments (2)
  1. [Abstract and §3] The abstract states 'seven datasets spanning five domains' while later text mentions 'three domains' for LLM-ReSum experiments; clarify the exact overlap and list all datasets with lengths and sources in a table for reproducibility.
  2. [Related work / meta-evaluation] Ensure the 14 metrics are enumerated with citations in the related-work or meta-evaluation section; some neural metrics are referenced only by name.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and positive assessment of the meta-evaluation, PatentSumEval benchmark, and the potential utility of the closed-loop approach. We address each major comment below with clarifications from the manuscript and indicate revisions where appropriate.

read point-by-point responses
  1. Referee: [Abstract and experimental results] Abstract and experimental results section: the reported improvements (33% factual accuracy, 39% coverage, 89% human preference) are presented without specifying the exact baseline summaries selected, the precise measurement protocols for factual accuracy and coverage, or any statistical significance testing. Because the meta-evaluation itself flags risks from post-hoc selection of low-quality summaries, these omissions make it impossible to assess whether the gains are genuine or inflated.

    Authors: We agree that additional explicit detail improves clarity. The baselines are the initial LLM-generated summaries flagged as low-quality using fixed thresholds on the LLM evaluator (pre-established from the meta-evaluation in Section 3 and applied in Section 4.2). Factual accuracy was measured by human fact-checking of claims against the source document; coverage was measured by human assessment of inclusion of key information units, using the identical annotation protocol and guidelines as the 1,500-summary meta-evaluation. We have added paired statistical significance testing (Wilcoxon signed-rank tests, all p < 0.01). Selection of low-quality inputs followed the same predefined criteria used throughout the meta-evaluation rather than post-hoc cherry-picking after observing LLM-ReSum outcomes. We will revise the abstract and results section to state these protocols, thresholds, and tests explicitly. revision: yes

  2. Referee: [LLM-ReSum framework and evaluation] LLM-ReSum framework description and evaluation: the central claim that the closed feedback loop reliably corrects factual errors and coverage gaps rests on the assumption that LLM evaluators produce actionable, low-error signals on long documents. The meta-evaluation only demonstrates correlation on 1,500 summaries; no additional analysis shows that the same model family can detect domain-specific facts or avoid rewarding fluent but incomplete rewrites when applied iteratively to 2K-27K word legal/scientific texts.

    Authors: The meta-evaluation already covers long documents (2K-27K words) in legal, scientific, and governmental domains and shows LLM evaluators achieve substantially higher human correlation than lexical metrics on these texts. LLM-ReSum was evaluated by applying the identical evaluator iteratively to precisely these long documents across three domains; the end-to-end human results (89% preference, plus measured gains in factual accuracy and coverage) directly demonstrate that the signals are actionable and lead to verifiable improvements. We have added a new analysis subsection reporting evaluator scores at each iteration on the long-document test sets, confirming progressive gains without evidence of rewarding fluent but incomplete outputs. While a dedicated error-type breakdown of the evaluator during iteration was not present in the original submission, the human preference and metric gains provide empirical validation of the closed-loop behavior. revision: partial

Circularity Check

0 steps flagged

No mathematical derivation; empirical claims rest on human judgments despite LLM evaluator reuse

full rationale

The paper performs a meta-evaluation of 14 metrics and LLM evaluators against 1500+ human annotations across domains, then selects LLM evaluators for the ReSum closed loop based on those correlations. Reported gains (33% factual accuracy, 39% coverage) appear to use the same LLM evaluators for before/after scoring, while the 89% human preference provides independent validation. No equations, fitted parameters, self-citations of prior author theorems, or ansatzes reduce the central claim to its inputs by construction. This is standard empirical methodology with partial metric dependence but no circular reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The work is empirical and relies on standard assumptions about human judgment as ground truth and LLM evaluator reliability; no free parameters or invented physical entities.

axioms (1)
  • domain assumption Human judgments serve as reliable ground truth for summary quality across domains and lengths
    Used to validate all metrics and to measure the 33%/39% improvements and 89% preference rate.
invented entities (1)
  • LLM-ReSum framework no independent evidence
    purpose: Closed-loop self-reflective summarization without finetuning
    New proposed system that combines evaluation and generation; no independent falsifiable evidence outside the reported experiments.

pith-pipeline@v0.9.0 · 5533 in / 1306 out tokens · 55045 ms · 2026-05-07T16:15:25.494307+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

64 extracted references · 8 canonical work pages · 3 internal anchors

  1. [1]

    A systematic survey of text summarization: From statistical methods to large language models,

    H. Zhang, P. S. Yu, and J. Zhang, “A systematic survey of text summarization: From statistical methods to large language models,” ACM Computing Surveys, vol. 57, no. 11, pp. 1–41, 2025

  2. [2]

    A survey on biomedical automatic text summarization with large language models,

    Z. Huang, X. Chen, Y . Wang, J. Huang, and X. Zhao, “A survey on biomedical automatic text summarization with large language models,” Information Processing & Management, vol. 62, no. 5, p. 104216, 2025

  3. [3]

    A comprehensive survey of abstractive text summarization based on deep learning,

    M. Zhang, G. Zhou, W. Yu, N. Huang, and W. Liu, “A comprehensive survey of abstractive text summarization based on deep learning,”Com- putational intelligence and neuroscience, vol. 2022, no. 1, p. 7132226, 2022

  4. [4]

    Human-like summarization evaluation with chatgpt

    M. Gao, J. Ruan, R. Sun, X. Yin, S. Yang, and X. Wan, “Human-like summarization evaluation with chatgpt,”arXiv preprint arXiv:2304.02554, 2023

  5. [5]

    Summeval: Re-evaluating summarization evaluation,

    A. R. Fabbri, W. Kry ´sci´nski, B. McCann, C. Xiong, R. Socher, and D. Radev, “Summeval: Re-evaluating summarization evaluation,”Trans- actions of the Association for Computational Linguistics, vol. 9, pp. 391–409, 2021

  6. [6]

    Quality evaluation of summarization models for patent documents,

    J. Ding, H. Chen, S. Kolapudi, L. Pobbathi, and H. Nguyen, “Quality evaluation of summarization models for patent documents,” in2023 IEEE 23rd International Conference on Software Quality, Reliability, and Security (QRS). IEEE, 2023, pp. 250–259

  7. [7]

    Text summarization using topic- based vector space model and semantic measure,

    R. C. Belwal, S. Rai, and A. Gupta, “Text summarization using topic- based vector space model and semantic measure,”Information Process- ing & Management, vol. 58, no. 3, p. 102536, 2021

  8. [8]

    Candidate sentence selection for extractive text summarization,

    B. Mutlu, E. A. Sezer, and M. A. Akcayol, “Candidate sentence selection for extractive text summarization,”Information Processing & Management, vol. 57, no. 6, p. 102359, 2020

  9. [9]

    Bertscore: Evaluating text generation with bert,

    T. Zhang*, V . Kishore*, F. Wu*, K. Q. Weinberger, and Y . Artzi, “Bertscore: Evaluating text generation with bert,” inInternational Con- ference on Learning Representations, 2020

  10. [10]

    A structured review of the validity of bleu,

    E. Reiter, “A structured review of the validity of bleu,”Computational Linguistics, vol. 44, no. 3, pp. 393–401, 2018

  11. [11]

    News summarization and evaluation in the era of gpt-3, 2022

    T. Goyal, J. J. Li, and G. Durrett, “News summarization and evaluation in the era of gpt-3,”arXiv preprint arXiv:2209.12356, 2022

  12. [12]

    Evaluation of question-answering based text summarization using llm invited paper,

    J. Ding, H. Nguyen, and H. Chen, “Evaluation of question-answering based text summarization using llm invited paper,” in2024 IEEE Inter- national Conference on Artificial Intelligence Testing (AITest). IEEE, 2024, pp. 142–149

  13. [13]

    A comparative study of quality evaluation methods for text summarization,

    H. Nguyen, H. Chen, L. Pobbathi, and J. Ding, “A comparative study of quality evaluation methods for text summarization,”arXiv preprint arXiv:2407.00747, 2024

  14. [14]

    arXiv preprint arXiv:2303.15621

    Z. Luo, Q. Xie, and S. Ananiadou, “Chatgpt as a factual inconsistency evaluator for text summarization,”arXiv preprint arXiv:2303.15621, 2023

  15. [15]

    G-eval: Nlg evaluation using gpt-4 with better human alignment,

    Y . Liu, D. Iter, Y . Xu, S. Wang, R. Xu, and C. Zhu, “G-eval: Nlg evaluation using gpt-4 with better human alignment,” inProceedings of the 2023 conference on empirical methods in natural language processing, 2023, pp. 2511–2522

  16. [16]

    Causal keyword driven reliable text classification with large language model feedback,

    R. Song, Y . Li, M. Tian, H. Wang, F. Giunchiglia, and H. Xu, “Causal keyword driven reliable text classification with large language model feedback,”Information Processing & Management, vol. 62, no. 2, p. 103964, 2025

  17. [17]

    A comprehen- sive survey on legal summarization: Challenges and future directions,

    M. Akter, E. C ¸ ano, E. Weber, D. Dobler, and I. Habernal, “A comprehen- sive survey on legal summarization: Challenges and future directions,” ACM Computing Surveys, vol. 58, no. 7, pp. 1–32, 2025

  18. [19]

    Rouge: A package for automatic evaluation of summaries,

    C.-Y . Lin, “Rouge: A package for automatic evaluation of summaries,” inText summarization branches out, 2004, pp. 74–81

  19. [20]

    Bleu: a method for automatic evaluation of machine translation,

    K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, “Bleu: a method for automatic evaluation of machine translation,” inProceedings of the 40th annual meeting of the Association for Computational Linguistics, 2002, pp. 311–318

  20. [21]

    Moverscore: Text generation evaluating with contextualized embed- dings and earth mover distance,

    W. Zhao, M. Peyrard, F. Liu, Y . Gao, C. M. Meyer, and S. Eger, “Moverscore: Text generation evaluating with contextualized embed- dings and earth mover distance,” inProceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP- IJCNLP), 2019, pp. 563–578

  21. [22]

    Summac: Re- visiting nli-based models for inconsistency detection in summarization,

    P. Laban, T. Schnabel, P. N. Bennett, and M. A. Hearst, “Summac: Re- visiting nli-based models for inconsistency detection in summarization,” Transactions of the Association for Computational Linguistics, vol. 10, pp. 163–177, 2022

  22. [23]

    Questeval: Summarization asks for fact-based evalu- ation,

    T. Scialom, P.-A. Dray, S. Lamprier, B. Piwowarski, J. Staiano, A. Wang, and P. Gallinari, “Questeval: Summarization asks for fact-based evalu- ation,” inProceedings of the 2021 conference on empirical methods in natural language processing, 2021, pp. 6594–6604

  23. [24]

    Bertscore is unfair: On social bias in language model-based metrics for text generation,

    T. Sun, J. He, X. Qiu, and X.-J. Huang, “Bertscore is unfair: On social bias in language model-based metrics for text generation,” in Proceedings of the 2022 conference on empirical methods in natural language processing, 2022, pp. 3726–3739

  24. [25]

    Re- evaluating evaluation in text summarization,

    M. Bhandari, P. N. Gour, A. Ashfaq, P. Liu, and G. Neubig, “Re- evaluating evaluation in text summarization,” inProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2020, pp. 9347–9359

  25. [26]

    How far are we from robust long abstractive summarization?

    H. Y . Koh, J. Ju, H. Zhang, M. Liu, and S. Pan, “How far are we from robust long abstractive summarization?” inProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, 2022, pp. 2682–2698

  26. [27]

    Exploring prompting large language models as explain- able metrics,

    G. Mahmoudi, “Exploring prompting large language models as explain- able metrics,” inProceedings of the 4th Workshop on Evaluation and Comparison of NLP Systems, 2023, pp. 219–227

  27. [28]

    A deep reinforced model for abstractive summarization,

    R. Paulus, C. Xiong, and R. Socher, “A deep reinforced model for abstractive summarization,” inInternational Conference on Learning Representations, 2018

  28. [29]

    A reinforced topic-aware convolutional sequence-to-sequence model for abstractive text summarization,

    L. Wang, J. Yao, Y . Tao, L. Zhong, W. Liu, and Q. Du, “A reinforced topic-aware convolutional sequence-to-sequence model for abstractive text summarization,” inProceedings of the 27th International Joint Conference on Artificial Intelligence, 2018, pp. 4453–4460

  29. [30]

    Societal Alignment Frameworks Can Improve

    K. Sta ´nczak, N. Meade, M. Bhatia, H. Zhou, K. B ¨ottinger, J. Barnes, J. Stanley, J. Montgomery, R. Zemel, N. Papernotet al., “Societal alignment frameworks can improve llm alignment,”arXiv preprint arXiv:2503.00069, 2025

  30. [31]

    Learning to summarize from llm-generated feedback,

    H. Song, T. Yun, Y . Lee, J. Oh, G. Lee, J. Cai, and H. Su, “Learning to summarize from llm-generated feedback,” inProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), 2025, pp. 835–857

  31. [32]

    Language mod- 14 els are few-shot learners,

    T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askellet al., “Language mod- 14 els are few-shot learners,”Advances in neural information processing systems, vol. 33, pp. 1877–1901, 2020

  32. [33]

    Training language models to follow instructions with human feedback,

    L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Rayet al., “Training language models to follow instructions with human feedback,”Advances in neural information processing systems, vol. 35, pp. 27 730–27 744, 2022

  33. [34]

    Large language models are state-of-the- art evaluators of translation quality,

    T. Kocmi and C. Federmann, “Large language models are state-of-the- art evaluators of translation quality,” inProceedings of the 24th Annual Conference of the European Association for Machine Translation, 2023, pp. 193–203

  34. [35]

    The pyramid method: Incorporating human content selection variation in summarization evalu- ation,

    A. Nenkova, R. Passonneau, and K. McKeown, “The pyramid method: Incorporating human content selection variation in summarization evalu- ation,”ACM Transactions on Speech and Language Processing (TSLP), vol. 4, no. 2, pp. 4–es, 2007

  35. [36]

    Large language models are not fair evaluators,

    P. Wang, L. Li, L. Chen, Z. Cai, D. Zhu, B. Lin, Y . Cao, L. Kong, Q. Liu, T. Liuet al., “Large language models are not fair evaluators,” inProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2024, pp. 9440– 9450

  36. [37]

    Improving factuality and reasoning in language models through multiagent debate,

    Y . Du, S. Li, A. Torralba, J. B. Tenenbaum, and I. Mordatch, “Improving factuality and reasoning in language models through multiagent debate,” inForty-first International Conference on Machine Learning, 2023

  37. [38]

    Generating sequences by learning to self-correct,

    S. Welleck, X. Lu, P. West, F. Brahman, T. Shen, D. Khashabi, and Y . Choi, “Generating sequences by learning to self-correct,” inThe Eleventh International Conference on Learning Representations, 2023

  38. [39]

    Constitutional AI: Harmlessness from AI Feedback

    Y . Bai, S. Kadavath, S. Kundu, A. Askell, J. Kernion, A. Jones, A. Chen, A. Goldie, A. Mirhoseini, C. McKinnonet al., “Constitutional ai: Harmlessness from ai feedback,”arXiv preprint arXiv:2212.08073, 2022

  39. [40]

    Self-refine: Iter- ative refinement with self-feedback,

    A. Madaan, N. Tandon, P. Gupta, S. Hallinan, L. Gao, S. Wiegreffe, U. Alon, N. Dziri, S. Prabhumoye, Y . Yanget al., “Self-refine: Iter- ative refinement with self-feedback,”Advances in neural information processing systems, vol. 36, pp. 46 534–46 594, 2023

  40. [41]

    On faithfulness and factuality in abstractive summarization,

    J. Maynez, S. Narayan, B. Bohnet, and R. McDonald, “On faithfulness and factuality in abstractive summarization,” inProceedings of the 58th annual meeting of the association for computational linguistics, 2020, pp. 1906–1919

  41. [42]

    Learning to summarize with human feedback,

    N. Stiennon, L. Ouyang, J. Wu, D. Ziegler, R. Lowe, C. V oss, A. Rad- ford, D. Amodei, and P. F. Christiano, “Learning to summarize with human feedback,”Advances in neural information processing systems, vol. 33, pp. 3008–3021, 2020

  42. [43]

    Asking and answering questions to evaluate the factual consistency of summaries,

    A. Wang, K. Cho, and M. Lewis, “Asking and answering questions to evaluate the factual consistency of summaries,” inProceedings of the 58th annual meeting of the association for computational linguistics, 2020, pp. 5008–5020

  43. [44]

    Long document summarization with top-down and bottom-up infer- ence,

    B. Pang, E. Nijkamp, W. Kry ´sci´nski, S. Savarese, Y . Zhou, and C. Xiong, “Long document summarization with top-down and bottom-up infer- ence,” inFindings of the Association for Computational Linguistics: EACL 2023, 2023, pp. 1267–1284

  44. [45]

    Dyle: Dynamic latent extraction for ab- stractive long-input summarization,

    Z. Mao, C. H. Wu, A. Ni, Y . Zhang, R. Zhang, T. Yu, B. Deb, C. Zhu, A. Awadallah, and D. Radev, “Dyle: Dynamic latent extraction for ab- stractive long-input summarization,” inProceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2022, pp. 1687–1698

  45. [46]

    Bigpatent: A large-scale dataset for abstractive and coherent summarization,

    E. Sharma, C. Li, and L. Wang, “Bigpatent: A large-scale dataset for abstractive and coherent summarization,” inProceedings of the 57th annual meeting of the association for computational linguistics, 2019, pp. 2204–2213

  46. [47]

    Meteor: An automatic metric for mt evalua- tion with improved correlation with human judgments,

    S. Banerjee and A. Lavie, “Meteor: An automatic metric for mt evalua- tion with improved correlation with human judgments,” inProceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, 2005, pp. 65–72

  47. [48]

    chrf: character n-gram f-score for automatic mt evalu- ation,

    M. Popovi ´c, “chrf: character n-gram f-score for automatic mt evalu- ation,” inProceedings of the tenth workshop on statistical machine translation, 2015, pp. 392–395

  48. [49]

    Bartscore: Evaluating generated text as text generation,

    W. Yuan, G. Neubig, and P. Liu, “Bartscore: Evaluating generated text as text generation,”Advances in neural information processing systems, vol. 34, pp. 27 263–27 277, 2021

  49. [50]

    Answers unite! unsupervised metrics for reinforced summarization models,

    T. Scialom, S. Lamprier, B. Piwowarski, and J. Staiano, “Answers unite! unsupervised metrics for reinforced summarization models,” in Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), 2019, pp. 3246–3256

  50. [51]

    Towards question- answering as an automatic metric for evaluating the content quality of a summary,

    D. Deutsch, T. Bedrax-Weiss, and D. Roth, “Towards question- answering as an automatic metric for evaluating the content quality of a summary,”Transactions of the Association for Computational Linguistics, vol. 9, pp. 774–789, 2021

  51. [52]

    Fill in the blanc: Human-free quality estimation of document summaries,

    O. Vasilyev, V . Dharnidharka, and J. Bohannon, “Fill in the blanc: Human-free quality estimation of document summaries,” inProceedings of the First Workshop on Evaluation and Comparison of NLP Systems, 2020, pp. 11–20

  52. [53]

    The llama 3 herd of models,

    A. Grattafiori, A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman, A. Mathur, A. Schelten, A. Vaughanet al., “The llama 3 herd of models,” inNeural Information Processing Systems. Curran Associates, 2024

  53. [54]

    Qwen2 Technical Report

    A. Yang, B. Yang, B. Hui, B. Zheng, B. Yu, C. Zhou, C. Li, C. Li, D. Liu, F. Huanget al., “Qwen2 technical report,”eprint arXiv: 2407.10671, 2024

  54. [55]

    Phi-4 Technical Report

    M. Abdin, J. Aneja, H. Behl, S. Bubeck, R. Eldan, S. Gunasekar, M. Harrison, R. J. Hewett, M. Javaheripi, P. Kauffmannet al., “Phi- 4 technical report,”arXiv preprint arXiv:2412.08905, 2024

  55. [56]

    Overview of duc 2005,

    H. T. Dang, “Overview of duc 2005,” inProceedings of the document understanding conference, vol. 2005, 2005, pp. 1–12

  56. [57]

    To point or not to point: Understanding how abstractive summarizers paraphrase text,

    M. Wilber, W. Timkey, and M. Van Schijndel, “To point or not to point: Understanding how abstractive summarizers paraphrase text,” in Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, 2021, pp. 3362–3376

  57. [58]

    Billsum: A corpus for automatic summarization of us legislation,

    A. Kornilova and V . Eidelman, “Billsum: A corpus for automatic summarization of us legislation,” inProceedings of the 2nd Workshop on New Frontiers in Summarization, 2019, pp. 48–56

  58. [59]

    Chateval: Towards better LLM-based evaluators through multi- agent debate,

    C.-M. Chan, W. Chen, Y . Su, J. Yu, W. Xue, S. Zhang, J. Fu, and Z. Liu, “Chateval: Towards better LLM-based evaluators through multi- agent debate,” inThe Twelfth International Conference on Learning Representations, 2024

  59. [60]

    On the emergence of position bias in transformers,

    X. Wu, Y . Wang, S. Jegelka, and A. Jadbabaie, “On the emergence of position bias in transformers,” inForty-second International Conference on Machine Learning, 2025

  60. [61]

    Towards mitigating llm hallucination via self reflection,

    Z. Ji, T. Yu, Y . Xu, N. Lee, E. Ishii, and P. Fung, “Towards mitigating llm hallucination via self reflection,” inFindings of the Association for Computational Linguistics: EMNLP 2023, 2023, pp. 1827–1843

  61. [62]

    Holistic evaluation of language models,

    P. Liang, R. Bommasani, T. Lee, D. Tsipras, D. Soylu, M. Yasunaga, Y . Zhang, D. Narayanan, Y . Wu, A. Kumaret al., “Holistic evaluation of language models,”Transactions on Machine Learning Research, 2023

  62. [63]

    The financial narrative summarisation shared task (fns 2020),

    M. El-Haj, M. Litvak, N. Pittaras, G. Giannakopouloset al., “The financial narrative summarisation shared task (fns 2020),” inProceedings of the 1st Joint Workshop on Financial Narrative Processing and MultiLing Financial Summarisation, 2020, pp. 1–12

  63. [64]

    Automatic summarization of scientific articles: A survey,

    N. I. Altmami and M. E. B. Menai, “Automatic summarization of scientific articles: A survey,”Journal of King Saud University-Computer and Information Sciences, vol. 34, no. 4, pp. 1011–1028, 2022

  64. [65]

    Prometheus: Inducing fine-grained evaluation capability in language models,

    S. Kim, J. Shin, Y . Cho, J. Jang, S. Longpre, H. Lee, S. Yun, S. Shin, S. Kim, J. Thorne, and M. Seo, “Prometheus: Inducing fine-grained evaluation capability in language models,” inThe Twelfth International Conference on Learning Representations, 2024. 15