arxiv: 2604.25665 · v1 · submitted 2026-04-28 · 💻 cs.CL · cs.AI· cs.DL· cs.IR

Recognition: unknown

LLM-ReSum: A Framework for LLM Reflective Summarization through Self-Evaluation

Huyen Nguyen , Haoxuan Zhang , Yang Zhang , Junhua Ding , Haihua Chen

Authors on Pith no claims yet

Pith reviewed 2026-05-07 16:15 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.DLcs.IR

keywords LLM summarizationself-evaluationreflective loopfactual accuracycoverage metricslegal document summarizationautomatic evaluationPatentSumEval

0 comments

The pith

LLM-ReSum uses a closed feedback loop of LLM evaluation and generation to refine summaries without any model fine-tuning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper first tests fourteen automatic metrics and LLM evaluators on seven datasets spanning short news to long legal and scientific documents. Traditional overlap scores like ROUGE show weak or negative correlation with human judgments, while certain neural metrics and LLM judges align better with people. Building on that result, the authors introduce LLM-ReSum, a framework that repeatedly applies an LLM evaluator to spot errors in a summary and then asks the same LLM to rewrite the summary to fix those errors. The loop runs without any extra training. On low-quality starting summaries the process raises factual accuracy by up to 33 percent and coverage by up to 39 percent, and human raters choose the refined versions 89 percent of the time.

Core claim

LLM-based evaluators can be placed inside a closed generation-evaluation loop that iteratively corrects factual and coverage errors in summaries; the resulting LLM-ReSum framework delivers measurable gains on heterogeneous documents from three domains and requires no parameter updates.

What carries the argument

The LLM-ReSum closed feedback loop, in which an LLM evaluator scores a draft summary and the same LLM then generates a revised draft based on the scores.

If this is right

Low-quality summaries in news, scientific, and legal domains can be improved without fine-tuning the underlying model.
Human preference for the output summaries reaches 89 percent across tested cases.
A new expert-annotated benchmark of 180 legal summaries becomes available for future evaluation work.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same evaluator-generator loop could be tested on other text-generation tasks such as question answering or dialogue response revision.
If stronger base models improve the quality of the feedback signal, the magnitude of the reported gains would likely increase.
The approach may reduce the need for separate supervised fine-tuning stages when adapting LLMs to new summarization domains.

Load-bearing premise

LLM evaluators supply reliable, unbiased signals that actually point to real errors rather than rewarding superficial rewrites or introducing new mistakes.

What would settle it

Run the full LLM-ReSum loop on a fresh collection of human-annotated summaries from a fourth domain and measure whether factual accuracy and coverage scores remain flat or decline instead of rising.

Figures

Figures reproduced from arXiv: 2604.25665 by Haihua Chen, Haoxuan Zhang, Huyen Nguyen, Junhua Ding, Yang Zhang.

**Figure 1.** Figure 1: Overview of our three-stage research framework: meta-evaluation of view at source ↗

**Figure 2.** Figure 2: Source document length distributions threshold is set to τ = 4 on a 5-point Likert scale, where scores of 4 (“Good”) and 5 (“Excellent”) indicate acceptable quality. Refinement is triggered when any dimension scores below τ , ensuring comprehensive quality assurance across all evaluated aspects. We impose maximum iterations Tmax = 3 to prevent excessive refinement that may introduce semantic drift, over-co… view at source ↗

read the original abstract

Reliable evaluation of large language model (LLM)-generated summaries remains an open challenge, particularly across heterogeneous domains and document lengths. We conduct a comprehensive meta-evaluation of 14 automatic summarization metrics and LLM-based evaluators across seven datasets spanning five domains, covering documents from short news articles to long scientific, governmental, and legal texts (2K-27K words) with over 1,500 human-annotated summaries. Our results show that traditional lexical overlap metrics (e.g., ROUGE, BLEU) exhibit weak or negative correlation with human judgments, while task-specific neural metrics and LLM-based evaluators achieve substantially higher alignment, especially for linguistic quality assessment. Leveraging these findings, we propose LLM-ReSum, a self-reflective summarization framework that integrates LLM-based evaluation and generation in a closed feedback loop without model finetuning. Across three domains, LLM-ReSum improves low-quality summaries by up to 33% in factual accuracy and 39% in coverage, with human evaluators preferring refined summaries in 89% of cases. We additionally introduce PatentSumEval, a new human-annotated benchmark for legal document summarization comprising 180 expert-evaluated summaries. All code and datasets will be released in GitHub.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper presents a self-reflective LLM summarization loop and a new patent benchmark, with the main question being how reliably the evaluator drives real improvements on long documents.

read the letter

The paper's core offering is LLM-ReSum, a framework that has an LLM evaluate its own summary and then refine it in a feedback loop, plus the introduction of PatentSumEval, a benchmark for patent document summarization with expert annotations. They start with a meta-evaluation of 14 automatic metrics and LLM evaluators on seven datasets from five domains. These include short news and long documents up to 27,000 words in scientific, governmental, and legal areas, backed by over 1,500 human judgments. The results confirm that lexical metrics like ROUGE and BLEU show weak correlation with humans, while neural and LLM-based evaluators perform better, particularly for assessing linguistic quality. This section provides a clear comparison that could guide metric selection in similar tasks. Building on that, LLM-ReSum integrates evaluation and generation without any model training. For low-quality initial summaries, it reports gains of up to 33% in factual accuracy and 39% in coverage across three domains. Human evaluators favored the refined outputs in 89% of cases. The new benchmark adds 180 expert-evaluated patent summaries, which fills a gap for legal summarization testing. The work is notable for its emphasis on long, domain-specific documents and for making code and data available. The human annotation scale lends some credibility to the claims. That said, the effectiveness of the closed loop hinges on the LLM evaluator delivering precise, actionable feedback on factual errors and coverage shortfalls in complex, lengthy texts. While the meta-evaluation demonstrates better alignment than traditional metrics, it does not fully address whether the same model can avoid introducing new issues or overlooking domain-specific details during iteration. Additional details on baseline setups, statistical significance, and measurement protocols would strengthen the improvement claims. This paper targets readers focused on practical applications of LLMs for summarization in fields like law, science, and government. Those looking for training-free methods or new evaluation resources will get the most out of it. It merits a serious referee. The benchmark and the breadth of the evaluation study offer tangible contributions that justify review, despite the need to verify the loop's reliability in practice. I recommend sending it for peer review.

Referee Report

2 major / 2 minor

Summary. The manuscript conducts a meta-evaluation of 14 automatic summarization metrics and LLM-based evaluators across seven datasets spanning five domains (news to long scientific/governmental/legal texts of 2K-27K words) with over 1,500 human-annotated summaries. It finds that lexical metrics like ROUGE/BLEU show weak correlation with humans while LLM evaluators align better, especially on linguistic quality. Leveraging this, the authors propose LLM-ReSum, a finetuning-free self-reflective framework that runs LLM evaluation and generation in a closed feedback loop to refine low-quality summaries. Experiments across three domains report gains of up to 33% factual accuracy and 39% coverage, with humans preferring the refined outputs in 89% of cases. The work also releases PatentSumEval, a new expert-annotated legal summarization benchmark of 180 summaries.

Significance. The meta-evaluation and new benchmark provide concrete value for the community by highlighting limitations of traditional metrics on long heterogeneous documents and supplying a legal-domain resource. If the LLM-ReSum gains prove robust, the closed-loop self-reflective approach offers a practical, training-free method for summary improvement that could be adopted in production systems. The empirical grounding in human judgments across domains strengthens the contribution relative to purely metric-driven work.

major comments (2)

[Abstract and experimental results] Abstract and experimental results section: the reported improvements (33% factual accuracy, 39% coverage, 89% human preference) are presented without specifying the exact baseline summaries selected, the precise measurement protocols for factual accuracy and coverage, or any statistical significance testing. Because the meta-evaluation itself flags risks from post-hoc selection of low-quality summaries, these omissions make it impossible to assess whether the gains are genuine or inflated.
[LLM-ReSum framework and evaluation] LLM-ReSum framework description and evaluation: the central claim that the closed feedback loop reliably corrects factual errors and coverage gaps rests on the assumption that LLM evaluators produce actionable, low-error signals on long documents. The meta-evaluation only demonstrates correlation on 1,500 summaries; no additional analysis shows that the same model family can detect domain-specific facts or avoid rewarding fluent but incomplete rewrites when applied iteratively to 2K-27K word legal/scientific texts.

minor comments (2)

[Abstract and §3] The abstract states 'seven datasets spanning five domains' while later text mentions 'three domains' for LLM-ReSum experiments; clarify the exact overlap and list all datasets with lengths and sources in a table for reproducibility.
[Related work / meta-evaluation] Ensure the 14 metrics are enumerated with citations in the related-work or meta-evaluation section; some neural metrics are referenced only by name.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and positive assessment of the meta-evaluation, PatentSumEval benchmark, and the potential utility of the closed-loop approach. We address each major comment below with clarifications from the manuscript and indicate revisions where appropriate.

read point-by-point responses

Referee: [Abstract and experimental results] Abstract and experimental results section: the reported improvements (33% factual accuracy, 39% coverage, 89% human preference) are presented without specifying the exact baseline summaries selected, the precise measurement protocols for factual accuracy and coverage, or any statistical significance testing. Because the meta-evaluation itself flags risks from post-hoc selection of low-quality summaries, these omissions make it impossible to assess whether the gains are genuine or inflated.

Authors: We agree that additional explicit detail improves clarity. The baselines are the initial LLM-generated summaries flagged as low-quality using fixed thresholds on the LLM evaluator (pre-established from the meta-evaluation in Section 3 and applied in Section 4.2). Factual accuracy was measured by human fact-checking of claims against the source document; coverage was measured by human assessment of inclusion of key information units, using the identical annotation protocol and guidelines as the 1,500-summary meta-evaluation. We have added paired statistical significance testing (Wilcoxon signed-rank tests, all p < 0.01). Selection of low-quality inputs followed the same predefined criteria used throughout the meta-evaluation rather than post-hoc cherry-picking after observing LLM-ReSum outcomes. We will revise the abstract and results section to state these protocols, thresholds, and tests explicitly. revision: yes
Referee: [LLM-ReSum framework and evaluation] LLM-ReSum framework description and evaluation: the central claim that the closed feedback loop reliably corrects factual errors and coverage gaps rests on the assumption that LLM evaluators produce actionable, low-error signals on long documents. The meta-evaluation only demonstrates correlation on 1,500 summaries; no additional analysis shows that the same model family can detect domain-specific facts or avoid rewarding fluent but incomplete rewrites when applied iteratively to 2K-27K word legal/scientific texts.

Authors: The meta-evaluation already covers long documents (2K-27K words) in legal, scientific, and governmental domains and shows LLM evaluators achieve substantially higher human correlation than lexical metrics on these texts. LLM-ReSum was evaluated by applying the identical evaluator iteratively to precisely these long documents across three domains; the end-to-end human results (89% preference, plus measured gains in factual accuracy and coverage) directly demonstrate that the signals are actionable and lead to verifiable improvements. We have added a new analysis subsection reporting evaluator scores at each iteration on the long-document test sets, confirming progressive gains without evidence of rewarding fluent but incomplete outputs. While a dedicated error-type breakdown of the evaluator during iteration was not present in the original submission, the human preference and metric gains provide empirical validation of the closed-loop behavior. revision: partial

Circularity Check

0 steps flagged

No mathematical derivation; empirical claims rest on human judgments despite LLM evaluator reuse

full rationale

The paper performs a meta-evaluation of 14 metrics and LLM evaluators against 1500+ human annotations across domains, then selects LLM evaluators for the ReSum closed loop based on those correlations. Reported gains (33% factual accuracy, 39% coverage) appear to use the same LLM evaluators for before/after scoring, while the 89% human preference provides independent validation. No equations, fitted parameters, self-citations of prior author theorems, or ansatzes reduce the central claim to its inputs by construction. This is standard empirical methodology with partial metric dependence but no circular reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The work is empirical and relies on standard assumptions about human judgment as ground truth and LLM evaluator reliability; no free parameters or invented physical entities.

axioms (1)

domain assumption Human judgments serve as reliable ground truth for summary quality across domains and lengths
Used to validate all metrics and to measure the 33%/39% improvements and 89% preference rate.

invented entities (1)

LLM-ReSum framework no independent evidence
purpose: Closed-loop self-reflective summarization without finetuning
New proposed system that combines evaluation and generation; no independent falsifiable evidence outside the reported experiments.

pith-pipeline@v0.9.0 · 5533 in / 1306 out tokens · 55045 ms · 2026-05-07T16:15:25.494307+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

64 extracted references · 8 canonical work pages · 3 internal anchors

[1]

A systematic survey of text summarization: From statistical methods to large language models,

H. Zhang, P. S. Yu, and J. Zhang, “A systematic survey of text summarization: From statistical methods to large language models,” ACM Computing Surveys, vol. 57, no. 11, pp. 1–41, 2025

2025
[2]

A survey on biomedical automatic text summarization with large language models,

Z. Huang, X. Chen, Y . Wang, J. Huang, and X. Zhao, “A survey on biomedical automatic text summarization with large language models,” Information Processing & Management, vol. 62, no. 5, p. 104216, 2025

2025
[3]

A comprehensive survey of abstractive text summarization based on deep learning,

M. Zhang, G. Zhou, W. Yu, N. Huang, and W. Liu, “A comprehensive survey of abstractive text summarization based on deep learning,”Com- putational intelligence and neuroscience, vol. 2022, no. 1, p. 7132226, 2022

2022
[4]

Human-like summarization evaluation with chatgpt

M. Gao, J. Ruan, R. Sun, X. Yin, S. Yang, and X. Wan, “Human-like summarization evaluation with chatgpt,”arXiv preprint arXiv:2304.02554, 2023

work page arXiv 2023
[5]

Summeval: Re-evaluating summarization evaluation,

A. R. Fabbri, W. Kry ´sci´nski, B. McCann, C. Xiong, R. Socher, and D. Radev, “Summeval: Re-evaluating summarization evaluation,”Trans- actions of the Association for Computational Linguistics, vol. 9, pp. 391–409, 2021

2021
[6]

Quality evaluation of summarization models for patent documents,

J. Ding, H. Chen, S. Kolapudi, L. Pobbathi, and H. Nguyen, “Quality evaluation of summarization models for patent documents,” in2023 IEEE 23rd International Conference on Software Quality, Reliability, and Security (QRS). IEEE, 2023, pp. 250–259

2023
[7]

Text summarization using topic- based vector space model and semantic measure,

R. C. Belwal, S. Rai, and A. Gupta, “Text summarization using topic- based vector space model and semantic measure,”Information Process- ing & Management, vol. 58, no. 3, p. 102536, 2021

2021
[8]

Candidate sentence selection for extractive text summarization,

B. Mutlu, E. A. Sezer, and M. A. Akcayol, “Candidate sentence selection for extractive text summarization,”Information Processing & Management, vol. 57, no. 6, p. 102359, 2020

2020
[9]

Bertscore: Evaluating text generation with bert,

T. Zhang*, V . Kishore*, F. Wu*, K. Q. Weinberger, and Y . Artzi, “Bertscore: Evaluating text generation with bert,” inInternational Con- ference on Learning Representations, 2020

2020
[10]

A structured review of the validity of bleu,

E. Reiter, “A structured review of the validity of bleu,”Computational Linguistics, vol. 44, no. 3, pp. 393–401, 2018

2018
[11]

News summarization and evaluation in the era of gpt-3, 2022

T. Goyal, J. J. Li, and G. Durrett, “News summarization and evaluation in the era of gpt-3,”arXiv preprint arXiv:2209.12356, 2022

work page arXiv 2022
[12]

Evaluation of question-answering based text summarization using llm invited paper,

J. Ding, H. Nguyen, and H. Chen, “Evaluation of question-answering based text summarization using llm invited paper,” in2024 IEEE Inter- national Conference on Artificial Intelligence Testing (AITest). IEEE, 2024, pp. 142–149

2024
[13]

A comparative study of quality evaluation methods for text summarization,

H. Nguyen, H. Chen, L. Pobbathi, and J. Ding, “A comparative study of quality evaluation methods for text summarization,”arXiv preprint arXiv:2407.00747, 2024

work page arXiv 2024
[14]

arXiv preprint arXiv:2303.15621

Z. Luo, Q. Xie, and S. Ananiadou, “Chatgpt as a factual inconsistency evaluator for text summarization,”arXiv preprint arXiv:2303.15621, 2023

work page arXiv 2023
[15]

G-eval: Nlg evaluation using gpt-4 with better human alignment,

Y . Liu, D. Iter, Y . Xu, S. Wang, R. Xu, and C. Zhu, “G-eval: Nlg evaluation using gpt-4 with better human alignment,” inProceedings of the 2023 conference on empirical methods in natural language processing, 2023, pp. 2511–2522

2023
[16]

Causal keyword driven reliable text classification with large language model feedback,

R. Song, Y . Li, M. Tian, H. Wang, F. Giunchiglia, and H. Xu, “Causal keyword driven reliable text classification with large language model feedback,”Information Processing & Management, vol. 62, no. 2, p. 103964, 2025

2025
[17]

A comprehen- sive survey on legal summarization: Challenges and future directions,

M. Akter, E. C ¸ ano, E. Weber, D. Dobler, and I. Habernal, “A comprehen- sive survey on legal summarization: Challenges and future directions,” ACM Computing Surveys, vol. 58, no. 7, pp. 1–32, 2025

2025
[19]

Rouge: A package for automatic evaluation of summaries,

C.-Y . Lin, “Rouge: A package for automatic evaluation of summaries,” inText summarization branches out, 2004, pp. 74–81

2004
[20]

Bleu: a method for automatic evaluation of machine translation,

K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, “Bleu: a method for automatic evaluation of machine translation,” inProceedings of the 40th annual meeting of the Association for Computational Linguistics, 2002, pp. 311–318

2002
[21]

Moverscore: Text generation evaluating with contextualized embed- dings and earth mover distance,

W. Zhao, M. Peyrard, F. Liu, Y . Gao, C. M. Meyer, and S. Eger, “Moverscore: Text generation evaluating with contextualized embed- dings and earth mover distance,” inProceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP- IJCNLP), 2019, pp. 563–578

2019
[22]

Summac: Re- visiting nli-based models for inconsistency detection in summarization,

P. Laban, T. Schnabel, P. N. Bennett, and M. A. Hearst, “Summac: Re- visiting nli-based models for inconsistency detection in summarization,” Transactions of the Association for Computational Linguistics, vol. 10, pp. 163–177, 2022

2022
[23]

Questeval: Summarization asks for fact-based evalu- ation,

T. Scialom, P.-A. Dray, S. Lamprier, B. Piwowarski, J. Staiano, A. Wang, and P. Gallinari, “Questeval: Summarization asks for fact-based evalu- ation,” inProceedings of the 2021 conference on empirical methods in natural language processing, 2021, pp. 6594–6604

2021
[24]

Bertscore is unfair: On social bias in language model-based metrics for text generation,

T. Sun, J. He, X. Qiu, and X.-J. Huang, “Bertscore is unfair: On social bias in language model-based metrics for text generation,” in Proceedings of the 2022 conference on empirical methods in natural language processing, 2022, pp. 3726–3739

2022
[25]

Re- evaluating evaluation in text summarization,

M. Bhandari, P. N. Gour, A. Ashfaq, P. Liu, and G. Neubig, “Re- evaluating evaluation in text summarization,” inProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2020, pp. 9347–9359

2020
[26]

How far are we from robust long abstractive summarization?

H. Y . Koh, J. Ju, H. Zhang, M. Liu, and S. Pan, “How far are we from robust long abstractive summarization?” inProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, 2022, pp. 2682–2698

2022
[27]

Exploring prompting large language models as explain- able metrics,

G. Mahmoudi, “Exploring prompting large language models as explain- able metrics,” inProceedings of the 4th Workshop on Evaluation and Comparison of NLP Systems, 2023, pp. 219–227

2023
[28]

A deep reinforced model for abstractive summarization,

R. Paulus, C. Xiong, and R. Socher, “A deep reinforced model for abstractive summarization,” inInternational Conference on Learning Representations, 2018

2018
[29]

A reinforced topic-aware convolutional sequence-to-sequence model for abstractive text summarization,

L. Wang, J. Yao, Y . Tao, L. Zhong, W. Liu, and Q. Du, “A reinforced topic-aware convolutional sequence-to-sequence model for abstractive text summarization,” inProceedings of the 27th International Joint Conference on Artificial Intelligence, 2018, pp. 4453–4460

2018
[30]

Societal Alignment Frameworks Can Improve

K. Sta ´nczak, N. Meade, M. Bhatia, H. Zhou, K. B ¨ottinger, J. Barnes, J. Stanley, J. Montgomery, R. Zemel, N. Papernotet al., “Societal alignment frameworks can improve llm alignment,”arXiv preprint arXiv:2503.00069, 2025

work page arXiv 2025
[31]

Learning to summarize from llm-generated feedback,

H. Song, T. Yun, Y . Lee, J. Oh, G. Lee, J. Cai, and H. Su, “Learning to summarize from llm-generated feedback,” inProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), 2025, pp. 835–857

2025
[32]

Language mod- 14 els are few-shot learners,

T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askellet al., “Language mod- 14 els are few-shot learners,”Advances in neural information processing systems, vol. 33, pp. 1877–1901, 2020

1901
[33]

Training language models to follow instructions with human feedback,

L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Rayet al., “Training language models to follow instructions with human feedback,”Advances in neural information processing systems, vol. 35, pp. 27 730–27 744, 2022

2022
[34]

Large language models are state-of-the- art evaluators of translation quality,

T. Kocmi and C. Federmann, “Large language models are state-of-the- art evaluators of translation quality,” inProceedings of the 24th Annual Conference of the European Association for Machine Translation, 2023, pp. 193–203

2023
[35]

The pyramid method: Incorporating human content selection variation in summarization evalu- ation,

A. Nenkova, R. Passonneau, and K. McKeown, “The pyramid method: Incorporating human content selection variation in summarization evalu- ation,”ACM Transactions on Speech and Language Processing (TSLP), vol. 4, no. 2, pp. 4–es, 2007

2007
[36]

Large language models are not fair evaluators,

P. Wang, L. Li, L. Chen, Z. Cai, D. Zhu, B. Lin, Y . Cao, L. Kong, Q. Liu, T. Liuet al., “Large language models are not fair evaluators,” inProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2024, pp. 9440– 9450

2024
[37]

Improving factuality and reasoning in language models through multiagent debate,

Y . Du, S. Li, A. Torralba, J. B. Tenenbaum, and I. Mordatch, “Improving factuality and reasoning in language models through multiagent debate,” inForty-first International Conference on Machine Learning, 2023

2023
[38]

Generating sequences by learning to self-correct,

S. Welleck, X. Lu, P. West, F. Brahman, T. Shen, D. Khashabi, and Y . Choi, “Generating sequences by learning to self-correct,” inThe Eleventh International Conference on Learning Representations, 2023

2023
[39]

Constitutional AI: Harmlessness from AI Feedback

Y . Bai, S. Kadavath, S. Kundu, A. Askell, J. Kernion, A. Jones, A. Chen, A. Goldie, A. Mirhoseini, C. McKinnonet al., “Constitutional ai: Harmlessness from ai feedback,”arXiv preprint arXiv:2212.08073, 2022

work page internal anchor Pith review arXiv 2022
[40]

Self-refine: Iter- ative refinement with self-feedback,

A. Madaan, N. Tandon, P. Gupta, S. Hallinan, L. Gao, S. Wiegreffe, U. Alon, N. Dziri, S. Prabhumoye, Y . Yanget al., “Self-refine: Iter- ative refinement with self-feedback,”Advances in neural information processing systems, vol. 36, pp. 46 534–46 594, 2023

2023
[41]

On faithfulness and factuality in abstractive summarization,

J. Maynez, S. Narayan, B. Bohnet, and R. McDonald, “On faithfulness and factuality in abstractive summarization,” inProceedings of the 58th annual meeting of the association for computational linguistics, 2020, pp. 1906–1919

2020
[42]

Learning to summarize with human feedback,

N. Stiennon, L. Ouyang, J. Wu, D. Ziegler, R. Lowe, C. V oss, A. Rad- ford, D. Amodei, and P. F. Christiano, “Learning to summarize with human feedback,”Advances in neural information processing systems, vol. 33, pp. 3008–3021, 2020

2020
[43]

Asking and answering questions to evaluate the factual consistency of summaries,

A. Wang, K. Cho, and M. Lewis, “Asking and answering questions to evaluate the factual consistency of summaries,” inProceedings of the 58th annual meeting of the association for computational linguistics, 2020, pp. 5008–5020

2020
[44]

Long document summarization with top-down and bottom-up infer- ence,

B. Pang, E. Nijkamp, W. Kry ´sci´nski, S. Savarese, Y . Zhou, and C. Xiong, “Long document summarization with top-down and bottom-up infer- ence,” inFindings of the Association for Computational Linguistics: EACL 2023, 2023, pp. 1267–1284

2023
[45]

Dyle: Dynamic latent extraction for ab- stractive long-input summarization,

Z. Mao, C. H. Wu, A. Ni, Y . Zhang, R. Zhang, T. Yu, B. Deb, C. Zhu, A. Awadallah, and D. Radev, “Dyle: Dynamic latent extraction for ab- stractive long-input summarization,” inProceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2022, pp. 1687–1698

2022
[46]

Bigpatent: A large-scale dataset for abstractive and coherent summarization,

E. Sharma, C. Li, and L. Wang, “Bigpatent: A large-scale dataset for abstractive and coherent summarization,” inProceedings of the 57th annual meeting of the association for computational linguistics, 2019, pp. 2204–2213

2019
[47]

Meteor: An automatic metric for mt evalua- tion with improved correlation with human judgments,

S. Banerjee and A. Lavie, “Meteor: An automatic metric for mt evalua- tion with improved correlation with human judgments,” inProceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, 2005, pp. 65–72

2005
[48]

chrf: character n-gram f-score for automatic mt evalu- ation,

M. Popovi ´c, “chrf: character n-gram f-score for automatic mt evalu- ation,” inProceedings of the tenth workshop on statistical machine translation, 2015, pp. 392–395

2015
[49]

Bartscore: Evaluating generated text as text generation,

W. Yuan, G. Neubig, and P. Liu, “Bartscore: Evaluating generated text as text generation,”Advances in neural information processing systems, vol. 34, pp. 27 263–27 277, 2021

2021
[50]

Answers unite! unsupervised metrics for reinforced summarization models,

T. Scialom, S. Lamprier, B. Piwowarski, and J. Staiano, “Answers unite! unsupervised metrics for reinforced summarization models,” in Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), 2019, pp. 3246–3256

2019
[51]

Towards question- answering as an automatic metric for evaluating the content quality of a summary,

D. Deutsch, T. Bedrax-Weiss, and D. Roth, “Towards question- answering as an automatic metric for evaluating the content quality of a summary,”Transactions of the Association for Computational Linguistics, vol. 9, pp. 774–789, 2021

2021
[52]

Fill in the blanc: Human-free quality estimation of document summaries,

O. Vasilyev, V . Dharnidharka, and J. Bohannon, “Fill in the blanc: Human-free quality estimation of document summaries,” inProceedings of the First Workshop on Evaluation and Comparison of NLP Systems, 2020, pp. 11–20

2020
[53]

The llama 3 herd of models,

A. Grattafiori, A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman, A. Mathur, A. Schelten, A. Vaughanet al., “The llama 3 herd of models,” inNeural Information Processing Systems. Curran Associates, 2024

2024
[54]

Qwen2 Technical Report

A. Yang, B. Yang, B. Hui, B. Zheng, B. Yu, C. Zhou, C. Li, C. Li, D. Liu, F. Huanget al., “Qwen2 technical report,”eprint arXiv: 2407.10671, 2024

work page internal anchor Pith review arXiv 2024
[55]

Phi-4 Technical Report

M. Abdin, J. Aneja, H. Behl, S. Bubeck, R. Eldan, S. Gunasekar, M. Harrison, R. J. Hewett, M. Javaheripi, P. Kauffmannet al., “Phi- 4 technical report,”arXiv preprint arXiv:2412.08905, 2024

work page internal anchor Pith review arXiv 2024
[56]

Overview of duc 2005,

H. T. Dang, “Overview of duc 2005,” inProceedings of the document understanding conference, vol. 2005, 2005, pp. 1–12

2005
[57]

To point or not to point: Understanding how abstractive summarizers paraphrase text,

M. Wilber, W. Timkey, and M. Van Schijndel, “To point or not to point: Understanding how abstractive summarizers paraphrase text,” in Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, 2021, pp. 3362–3376

2021
[58]

Billsum: A corpus for automatic summarization of us legislation,

A. Kornilova and V . Eidelman, “Billsum: A corpus for automatic summarization of us legislation,” inProceedings of the 2nd Workshop on New Frontiers in Summarization, 2019, pp. 48–56

2019
[59]

Chateval: Towards better LLM-based evaluators through multi- agent debate,

C.-M. Chan, W. Chen, Y . Su, J. Yu, W. Xue, S. Zhang, J. Fu, and Z. Liu, “Chateval: Towards better LLM-based evaluators through multi- agent debate,” inThe Twelfth International Conference on Learning Representations, 2024

2024
[60]

On the emergence of position bias in transformers,

X. Wu, Y . Wang, S. Jegelka, and A. Jadbabaie, “On the emergence of position bias in transformers,” inForty-second International Conference on Machine Learning, 2025

2025
[61]

Towards mitigating llm hallucination via self reflection,

Z. Ji, T. Yu, Y . Xu, N. Lee, E. Ishii, and P. Fung, “Towards mitigating llm hallucination via self reflection,” inFindings of the Association for Computational Linguistics: EMNLP 2023, 2023, pp. 1827–1843

2023
[62]

Holistic evaluation of language models,

P. Liang, R. Bommasani, T. Lee, D. Tsipras, D. Soylu, M. Yasunaga, Y . Zhang, D. Narayanan, Y . Wu, A. Kumaret al., “Holistic evaluation of language models,”Transactions on Machine Learning Research, 2023

2023
[63]

The financial narrative summarisation shared task (fns 2020),

M. El-Haj, M. Litvak, N. Pittaras, G. Giannakopouloset al., “The financial narrative summarisation shared task (fns 2020),” inProceedings of the 1st Joint Workshop on Financial Narrative Processing and MultiLing Financial Summarisation, 2020, pp. 1–12

2020
[64]

Automatic summarization of scientific articles: A survey,

N. I. Altmami and M. E. B. Menai, “Automatic summarization of scientific articles: A survey,”Journal of King Saud University-Computer and Information Sciences, vol. 34, no. 4, pp. 1011–1028, 2022

2022
[65]

Prometheus: Inducing fine-grained evaluation capability in language models,

S. Kim, J. Shin, Y . Cho, J. Jang, S. Longpre, H. Lee, S. Yun, S. Shin, S. Kim, J. Thorne, and M. Seo, “Prometheus: Inducing fine-grained evaluation capability in language models,” inThe Twelfth International Conference on Learning Representations, 2024. 15

2024