Recognition: unknown
LLM-ReSum: A Framework for LLM Reflective Summarization through Self-Evaluation
Pith reviewed 2026-05-07 16:15 UTC · model grok-4.3
The pith
LLM-ReSum uses a closed feedback loop of LLM evaluation and generation to refine summaries without any model fine-tuning.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
LLM-based evaluators can be placed inside a closed generation-evaluation loop that iteratively corrects factual and coverage errors in summaries; the resulting LLM-ReSum framework delivers measurable gains on heterogeneous documents from three domains and requires no parameter updates.
What carries the argument
The LLM-ReSum closed feedback loop, in which an LLM evaluator scores a draft summary and the same LLM then generates a revised draft based on the scores.
If this is right
- Low-quality summaries in news, scientific, and legal domains can be improved without fine-tuning the underlying model.
- Human preference for the output summaries reaches 89 percent across tested cases.
- A new expert-annotated benchmark of 180 legal summaries becomes available for future evaluation work.
Where Pith is reading between the lines
- The same evaluator-generator loop could be tested on other text-generation tasks such as question answering or dialogue response revision.
- If stronger base models improve the quality of the feedback signal, the magnitude of the reported gains would likely increase.
- The approach may reduce the need for separate supervised fine-tuning stages when adapting LLMs to new summarization domains.
Load-bearing premise
LLM evaluators supply reliable, unbiased signals that actually point to real errors rather than rewarding superficial rewrites or introducing new mistakes.
What would settle it
Run the full LLM-ReSum loop on a fresh collection of human-annotated summaries from a fourth domain and measure whether factual accuracy and coverage scores remain flat or decline instead of rising.
Figures
read the original abstract
Reliable evaluation of large language model (LLM)-generated summaries remains an open challenge, particularly across heterogeneous domains and document lengths. We conduct a comprehensive meta-evaluation of 14 automatic summarization metrics and LLM-based evaluators across seven datasets spanning five domains, covering documents from short news articles to long scientific, governmental, and legal texts (2K-27K words) with over 1,500 human-annotated summaries. Our results show that traditional lexical overlap metrics (e.g., ROUGE, BLEU) exhibit weak or negative correlation with human judgments, while task-specific neural metrics and LLM-based evaluators achieve substantially higher alignment, especially for linguistic quality assessment. Leveraging these findings, we propose LLM-ReSum, a self-reflective summarization framework that integrates LLM-based evaluation and generation in a closed feedback loop without model finetuning. Across three domains, LLM-ReSum improves low-quality summaries by up to 33% in factual accuracy and 39% in coverage, with human evaluators preferring refined summaries in 89% of cases. We additionally introduce PatentSumEval, a new human-annotated benchmark for legal document summarization comprising 180 expert-evaluated summaries. All code and datasets will be released in GitHub.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript conducts a meta-evaluation of 14 automatic summarization metrics and LLM-based evaluators across seven datasets spanning five domains (news to long scientific/governmental/legal texts of 2K-27K words) with over 1,500 human-annotated summaries. It finds that lexical metrics like ROUGE/BLEU show weak correlation with humans while LLM evaluators align better, especially on linguistic quality. Leveraging this, the authors propose LLM-ReSum, a finetuning-free self-reflective framework that runs LLM evaluation and generation in a closed feedback loop to refine low-quality summaries. Experiments across three domains report gains of up to 33% factual accuracy and 39% coverage, with humans preferring the refined outputs in 89% of cases. The work also releases PatentSumEval, a new expert-annotated legal summarization benchmark of 180 summaries.
Significance. The meta-evaluation and new benchmark provide concrete value for the community by highlighting limitations of traditional metrics on long heterogeneous documents and supplying a legal-domain resource. If the LLM-ReSum gains prove robust, the closed-loop self-reflective approach offers a practical, training-free method for summary improvement that could be adopted in production systems. The empirical grounding in human judgments across domains strengthens the contribution relative to purely metric-driven work.
major comments (2)
- [Abstract and experimental results] Abstract and experimental results section: the reported improvements (33% factual accuracy, 39% coverage, 89% human preference) are presented without specifying the exact baseline summaries selected, the precise measurement protocols for factual accuracy and coverage, or any statistical significance testing. Because the meta-evaluation itself flags risks from post-hoc selection of low-quality summaries, these omissions make it impossible to assess whether the gains are genuine or inflated.
- [LLM-ReSum framework and evaluation] LLM-ReSum framework description and evaluation: the central claim that the closed feedback loop reliably corrects factual errors and coverage gaps rests on the assumption that LLM evaluators produce actionable, low-error signals on long documents. The meta-evaluation only demonstrates correlation on 1,500 summaries; no additional analysis shows that the same model family can detect domain-specific facts or avoid rewarding fluent but incomplete rewrites when applied iteratively to 2K-27K word legal/scientific texts.
minor comments (2)
- [Abstract and §3] The abstract states 'seven datasets spanning five domains' while later text mentions 'three domains' for LLM-ReSum experiments; clarify the exact overlap and list all datasets with lengths and sources in a table for reproducibility.
- [Related work / meta-evaluation] Ensure the 14 metrics are enumerated with citations in the related-work or meta-evaluation section; some neural metrics are referenced only by name.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback and positive assessment of the meta-evaluation, PatentSumEval benchmark, and the potential utility of the closed-loop approach. We address each major comment below with clarifications from the manuscript and indicate revisions where appropriate.
read point-by-point responses
-
Referee: [Abstract and experimental results] Abstract and experimental results section: the reported improvements (33% factual accuracy, 39% coverage, 89% human preference) are presented without specifying the exact baseline summaries selected, the precise measurement protocols for factual accuracy and coverage, or any statistical significance testing. Because the meta-evaluation itself flags risks from post-hoc selection of low-quality summaries, these omissions make it impossible to assess whether the gains are genuine or inflated.
Authors: We agree that additional explicit detail improves clarity. The baselines are the initial LLM-generated summaries flagged as low-quality using fixed thresholds on the LLM evaluator (pre-established from the meta-evaluation in Section 3 and applied in Section 4.2). Factual accuracy was measured by human fact-checking of claims against the source document; coverage was measured by human assessment of inclusion of key information units, using the identical annotation protocol and guidelines as the 1,500-summary meta-evaluation. We have added paired statistical significance testing (Wilcoxon signed-rank tests, all p < 0.01). Selection of low-quality inputs followed the same predefined criteria used throughout the meta-evaluation rather than post-hoc cherry-picking after observing LLM-ReSum outcomes. We will revise the abstract and results section to state these protocols, thresholds, and tests explicitly. revision: yes
-
Referee: [LLM-ReSum framework and evaluation] LLM-ReSum framework description and evaluation: the central claim that the closed feedback loop reliably corrects factual errors and coverage gaps rests on the assumption that LLM evaluators produce actionable, low-error signals on long documents. The meta-evaluation only demonstrates correlation on 1,500 summaries; no additional analysis shows that the same model family can detect domain-specific facts or avoid rewarding fluent but incomplete rewrites when applied iteratively to 2K-27K word legal/scientific texts.
Authors: The meta-evaluation already covers long documents (2K-27K words) in legal, scientific, and governmental domains and shows LLM evaluators achieve substantially higher human correlation than lexical metrics on these texts. LLM-ReSum was evaluated by applying the identical evaluator iteratively to precisely these long documents across three domains; the end-to-end human results (89% preference, plus measured gains in factual accuracy and coverage) directly demonstrate that the signals are actionable and lead to verifiable improvements. We have added a new analysis subsection reporting evaluator scores at each iteration on the long-document test sets, confirming progressive gains without evidence of rewarding fluent but incomplete outputs. While a dedicated error-type breakdown of the evaluator during iteration was not present in the original submission, the human preference and metric gains provide empirical validation of the closed-loop behavior. revision: partial
Circularity Check
No mathematical derivation; empirical claims rest on human judgments despite LLM evaluator reuse
full rationale
The paper performs a meta-evaluation of 14 metrics and LLM evaluators against 1500+ human annotations across domains, then selects LLM evaluators for the ReSum closed loop based on those correlations. Reported gains (33% factual accuracy, 39% coverage) appear to use the same LLM evaluators for before/after scoring, while the 89% human preference provides independent validation. No equations, fitted parameters, self-citations of prior author theorems, or ansatzes reduce the central claim to its inputs by construction. This is standard empirical methodology with partial metric dependence but no circular reduction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Human judgments serve as reliable ground truth for summary quality across domains and lengths
invented entities (1)
-
LLM-ReSum framework
no independent evidence
Reference graph
Works this paper leans on
-
[1]
A systematic survey of text summarization: From statistical methods to large language models,
H. Zhang, P. S. Yu, and J. Zhang, “A systematic survey of text summarization: From statistical methods to large language models,” ACM Computing Surveys, vol. 57, no. 11, pp. 1–41, 2025
2025
-
[2]
A survey on biomedical automatic text summarization with large language models,
Z. Huang, X. Chen, Y . Wang, J. Huang, and X. Zhao, “A survey on biomedical automatic text summarization with large language models,” Information Processing & Management, vol. 62, no. 5, p. 104216, 2025
2025
-
[3]
A comprehensive survey of abstractive text summarization based on deep learning,
M. Zhang, G. Zhou, W. Yu, N. Huang, and W. Liu, “A comprehensive survey of abstractive text summarization based on deep learning,”Com- putational intelligence and neuroscience, vol. 2022, no. 1, p. 7132226, 2022
2022
-
[4]
Human-like summarization evaluation with chatgpt
M. Gao, J. Ruan, R. Sun, X. Yin, S. Yang, and X. Wan, “Human-like summarization evaluation with chatgpt,”arXiv preprint arXiv:2304.02554, 2023
-
[5]
Summeval: Re-evaluating summarization evaluation,
A. R. Fabbri, W. Kry ´sci´nski, B. McCann, C. Xiong, R. Socher, and D. Radev, “Summeval: Re-evaluating summarization evaluation,”Trans- actions of the Association for Computational Linguistics, vol. 9, pp. 391–409, 2021
2021
-
[6]
Quality evaluation of summarization models for patent documents,
J. Ding, H. Chen, S. Kolapudi, L. Pobbathi, and H. Nguyen, “Quality evaluation of summarization models for patent documents,” in2023 IEEE 23rd International Conference on Software Quality, Reliability, and Security (QRS). IEEE, 2023, pp. 250–259
2023
-
[7]
Text summarization using topic- based vector space model and semantic measure,
R. C. Belwal, S. Rai, and A. Gupta, “Text summarization using topic- based vector space model and semantic measure,”Information Process- ing & Management, vol. 58, no. 3, p. 102536, 2021
2021
-
[8]
Candidate sentence selection for extractive text summarization,
B. Mutlu, E. A. Sezer, and M. A. Akcayol, “Candidate sentence selection for extractive text summarization,”Information Processing & Management, vol. 57, no. 6, p. 102359, 2020
2020
-
[9]
Bertscore: Evaluating text generation with bert,
T. Zhang*, V . Kishore*, F. Wu*, K. Q. Weinberger, and Y . Artzi, “Bertscore: Evaluating text generation with bert,” inInternational Con- ference on Learning Representations, 2020
2020
-
[10]
A structured review of the validity of bleu,
E. Reiter, “A structured review of the validity of bleu,”Computational Linguistics, vol. 44, no. 3, pp. 393–401, 2018
2018
-
[11]
News summarization and evaluation in the era of gpt-3, 2022
T. Goyal, J. J. Li, and G. Durrett, “News summarization and evaluation in the era of gpt-3,”arXiv preprint arXiv:2209.12356, 2022
-
[12]
Evaluation of question-answering based text summarization using llm invited paper,
J. Ding, H. Nguyen, and H. Chen, “Evaluation of question-answering based text summarization using llm invited paper,” in2024 IEEE Inter- national Conference on Artificial Intelligence Testing (AITest). IEEE, 2024, pp. 142–149
2024
-
[13]
A comparative study of quality evaluation methods for text summarization,
H. Nguyen, H. Chen, L. Pobbathi, and J. Ding, “A comparative study of quality evaluation methods for text summarization,”arXiv preprint arXiv:2407.00747, 2024
-
[14]
arXiv preprint arXiv:2303.15621
Z. Luo, Q. Xie, and S. Ananiadou, “Chatgpt as a factual inconsistency evaluator for text summarization,”arXiv preprint arXiv:2303.15621, 2023
-
[15]
G-eval: Nlg evaluation using gpt-4 with better human alignment,
Y . Liu, D. Iter, Y . Xu, S. Wang, R. Xu, and C. Zhu, “G-eval: Nlg evaluation using gpt-4 with better human alignment,” inProceedings of the 2023 conference on empirical methods in natural language processing, 2023, pp. 2511–2522
2023
-
[16]
Causal keyword driven reliable text classification with large language model feedback,
R. Song, Y . Li, M. Tian, H. Wang, F. Giunchiglia, and H. Xu, “Causal keyword driven reliable text classification with large language model feedback,”Information Processing & Management, vol. 62, no. 2, p. 103964, 2025
2025
-
[17]
A comprehen- sive survey on legal summarization: Challenges and future directions,
M. Akter, E. C ¸ ano, E. Weber, D. Dobler, and I. Habernal, “A comprehen- sive survey on legal summarization: Challenges and future directions,” ACM Computing Surveys, vol. 58, no. 7, pp. 1–32, 2025
2025
-
[19]
Rouge: A package for automatic evaluation of summaries,
C.-Y . Lin, “Rouge: A package for automatic evaluation of summaries,” inText summarization branches out, 2004, pp. 74–81
2004
-
[20]
Bleu: a method for automatic evaluation of machine translation,
K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, “Bleu: a method for automatic evaluation of machine translation,” inProceedings of the 40th annual meeting of the Association for Computational Linguistics, 2002, pp. 311–318
2002
-
[21]
Moverscore: Text generation evaluating with contextualized embed- dings and earth mover distance,
W. Zhao, M. Peyrard, F. Liu, Y . Gao, C. M. Meyer, and S. Eger, “Moverscore: Text generation evaluating with contextualized embed- dings and earth mover distance,” inProceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP- IJCNLP), 2019, pp. 563–578
2019
-
[22]
Summac: Re- visiting nli-based models for inconsistency detection in summarization,
P. Laban, T. Schnabel, P. N. Bennett, and M. A. Hearst, “Summac: Re- visiting nli-based models for inconsistency detection in summarization,” Transactions of the Association for Computational Linguistics, vol. 10, pp. 163–177, 2022
2022
-
[23]
Questeval: Summarization asks for fact-based evalu- ation,
T. Scialom, P.-A. Dray, S. Lamprier, B. Piwowarski, J. Staiano, A. Wang, and P. Gallinari, “Questeval: Summarization asks for fact-based evalu- ation,” inProceedings of the 2021 conference on empirical methods in natural language processing, 2021, pp. 6594–6604
2021
-
[24]
Bertscore is unfair: On social bias in language model-based metrics for text generation,
T. Sun, J. He, X. Qiu, and X.-J. Huang, “Bertscore is unfair: On social bias in language model-based metrics for text generation,” in Proceedings of the 2022 conference on empirical methods in natural language processing, 2022, pp. 3726–3739
2022
-
[25]
Re- evaluating evaluation in text summarization,
M. Bhandari, P. N. Gour, A. Ashfaq, P. Liu, and G. Neubig, “Re- evaluating evaluation in text summarization,” inProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2020, pp. 9347–9359
2020
-
[26]
How far are we from robust long abstractive summarization?
H. Y . Koh, J. Ju, H. Zhang, M. Liu, and S. Pan, “How far are we from robust long abstractive summarization?” inProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, 2022, pp. 2682–2698
2022
-
[27]
Exploring prompting large language models as explain- able metrics,
G. Mahmoudi, “Exploring prompting large language models as explain- able metrics,” inProceedings of the 4th Workshop on Evaluation and Comparison of NLP Systems, 2023, pp. 219–227
2023
-
[28]
A deep reinforced model for abstractive summarization,
R. Paulus, C. Xiong, and R. Socher, “A deep reinforced model for abstractive summarization,” inInternational Conference on Learning Representations, 2018
2018
-
[29]
A reinforced topic-aware convolutional sequence-to-sequence model for abstractive text summarization,
L. Wang, J. Yao, Y . Tao, L. Zhong, W. Liu, and Q. Du, “A reinforced topic-aware convolutional sequence-to-sequence model for abstractive text summarization,” inProceedings of the 27th International Joint Conference on Artificial Intelligence, 2018, pp. 4453–4460
2018
-
[30]
Societal Alignment Frameworks Can Improve
K. Sta ´nczak, N. Meade, M. Bhatia, H. Zhou, K. B ¨ottinger, J. Barnes, J. Stanley, J. Montgomery, R. Zemel, N. Papernotet al., “Societal alignment frameworks can improve llm alignment,”arXiv preprint arXiv:2503.00069, 2025
-
[31]
Learning to summarize from llm-generated feedback,
H. Song, T. Yun, Y . Lee, J. Oh, G. Lee, J. Cai, and H. Su, “Learning to summarize from llm-generated feedback,” inProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), 2025, pp. 835–857
2025
-
[32]
Language mod- 14 els are few-shot learners,
T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askellet al., “Language mod- 14 els are few-shot learners,”Advances in neural information processing systems, vol. 33, pp. 1877–1901, 2020
1901
-
[33]
Training language models to follow instructions with human feedback,
L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Rayet al., “Training language models to follow instructions with human feedback,”Advances in neural information processing systems, vol. 35, pp. 27 730–27 744, 2022
2022
-
[34]
Large language models are state-of-the- art evaluators of translation quality,
T. Kocmi and C. Federmann, “Large language models are state-of-the- art evaluators of translation quality,” inProceedings of the 24th Annual Conference of the European Association for Machine Translation, 2023, pp. 193–203
2023
-
[35]
The pyramid method: Incorporating human content selection variation in summarization evalu- ation,
A. Nenkova, R. Passonneau, and K. McKeown, “The pyramid method: Incorporating human content selection variation in summarization evalu- ation,”ACM Transactions on Speech and Language Processing (TSLP), vol. 4, no. 2, pp. 4–es, 2007
2007
-
[36]
Large language models are not fair evaluators,
P. Wang, L. Li, L. Chen, Z. Cai, D. Zhu, B. Lin, Y . Cao, L. Kong, Q. Liu, T. Liuet al., “Large language models are not fair evaluators,” inProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2024, pp. 9440– 9450
2024
-
[37]
Improving factuality and reasoning in language models through multiagent debate,
Y . Du, S. Li, A. Torralba, J. B. Tenenbaum, and I. Mordatch, “Improving factuality and reasoning in language models through multiagent debate,” inForty-first International Conference on Machine Learning, 2023
2023
-
[38]
Generating sequences by learning to self-correct,
S. Welleck, X. Lu, P. West, F. Brahman, T. Shen, D. Khashabi, and Y . Choi, “Generating sequences by learning to self-correct,” inThe Eleventh International Conference on Learning Representations, 2023
2023
-
[39]
Constitutional AI: Harmlessness from AI Feedback
Y . Bai, S. Kadavath, S. Kundu, A. Askell, J. Kernion, A. Jones, A. Chen, A. Goldie, A. Mirhoseini, C. McKinnonet al., “Constitutional ai: Harmlessness from ai feedback,”arXiv preprint arXiv:2212.08073, 2022
work page internal anchor Pith review arXiv 2022
-
[40]
Self-refine: Iter- ative refinement with self-feedback,
A. Madaan, N. Tandon, P. Gupta, S. Hallinan, L. Gao, S. Wiegreffe, U. Alon, N. Dziri, S. Prabhumoye, Y . Yanget al., “Self-refine: Iter- ative refinement with self-feedback,”Advances in neural information processing systems, vol. 36, pp. 46 534–46 594, 2023
2023
-
[41]
On faithfulness and factuality in abstractive summarization,
J. Maynez, S. Narayan, B. Bohnet, and R. McDonald, “On faithfulness and factuality in abstractive summarization,” inProceedings of the 58th annual meeting of the association for computational linguistics, 2020, pp. 1906–1919
2020
-
[42]
Learning to summarize with human feedback,
N. Stiennon, L. Ouyang, J. Wu, D. Ziegler, R. Lowe, C. V oss, A. Rad- ford, D. Amodei, and P. F. Christiano, “Learning to summarize with human feedback,”Advances in neural information processing systems, vol. 33, pp. 3008–3021, 2020
2020
-
[43]
Asking and answering questions to evaluate the factual consistency of summaries,
A. Wang, K. Cho, and M. Lewis, “Asking and answering questions to evaluate the factual consistency of summaries,” inProceedings of the 58th annual meeting of the association for computational linguistics, 2020, pp. 5008–5020
2020
-
[44]
Long document summarization with top-down and bottom-up infer- ence,
B. Pang, E. Nijkamp, W. Kry ´sci´nski, S. Savarese, Y . Zhou, and C. Xiong, “Long document summarization with top-down and bottom-up infer- ence,” inFindings of the Association for Computational Linguistics: EACL 2023, 2023, pp. 1267–1284
2023
-
[45]
Dyle: Dynamic latent extraction for ab- stractive long-input summarization,
Z. Mao, C. H. Wu, A. Ni, Y . Zhang, R. Zhang, T. Yu, B. Deb, C. Zhu, A. Awadallah, and D. Radev, “Dyle: Dynamic latent extraction for ab- stractive long-input summarization,” inProceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2022, pp. 1687–1698
2022
-
[46]
Bigpatent: A large-scale dataset for abstractive and coherent summarization,
E. Sharma, C. Li, and L. Wang, “Bigpatent: A large-scale dataset for abstractive and coherent summarization,” inProceedings of the 57th annual meeting of the association for computational linguistics, 2019, pp. 2204–2213
2019
-
[47]
Meteor: An automatic metric for mt evalua- tion with improved correlation with human judgments,
S. Banerjee and A. Lavie, “Meteor: An automatic metric for mt evalua- tion with improved correlation with human judgments,” inProceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, 2005, pp. 65–72
2005
-
[48]
chrf: character n-gram f-score for automatic mt evalu- ation,
M. Popovi ´c, “chrf: character n-gram f-score for automatic mt evalu- ation,” inProceedings of the tenth workshop on statistical machine translation, 2015, pp. 392–395
2015
-
[49]
Bartscore: Evaluating generated text as text generation,
W. Yuan, G. Neubig, and P. Liu, “Bartscore: Evaluating generated text as text generation,”Advances in neural information processing systems, vol. 34, pp. 27 263–27 277, 2021
2021
-
[50]
Answers unite! unsupervised metrics for reinforced summarization models,
T. Scialom, S. Lamprier, B. Piwowarski, and J. Staiano, “Answers unite! unsupervised metrics for reinforced summarization models,” in Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), 2019, pp. 3246–3256
2019
-
[51]
Towards question- answering as an automatic metric for evaluating the content quality of a summary,
D. Deutsch, T. Bedrax-Weiss, and D. Roth, “Towards question- answering as an automatic metric for evaluating the content quality of a summary,”Transactions of the Association for Computational Linguistics, vol. 9, pp. 774–789, 2021
2021
-
[52]
Fill in the blanc: Human-free quality estimation of document summaries,
O. Vasilyev, V . Dharnidharka, and J. Bohannon, “Fill in the blanc: Human-free quality estimation of document summaries,” inProceedings of the First Workshop on Evaluation and Comparison of NLP Systems, 2020, pp. 11–20
2020
-
[53]
The llama 3 herd of models,
A. Grattafiori, A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman, A. Mathur, A. Schelten, A. Vaughanet al., “The llama 3 herd of models,” inNeural Information Processing Systems. Curran Associates, 2024
2024
-
[54]
A. Yang, B. Yang, B. Hui, B. Zheng, B. Yu, C. Zhou, C. Li, C. Li, D. Liu, F. Huanget al., “Qwen2 technical report,”eprint arXiv: 2407.10671, 2024
work page internal anchor Pith review arXiv 2024
-
[55]
M. Abdin, J. Aneja, H. Behl, S. Bubeck, R. Eldan, S. Gunasekar, M. Harrison, R. J. Hewett, M. Javaheripi, P. Kauffmannet al., “Phi- 4 technical report,”arXiv preprint arXiv:2412.08905, 2024
work page internal anchor Pith review arXiv 2024
-
[56]
Overview of duc 2005,
H. T. Dang, “Overview of duc 2005,” inProceedings of the document understanding conference, vol. 2005, 2005, pp. 1–12
2005
-
[57]
To point or not to point: Understanding how abstractive summarizers paraphrase text,
M. Wilber, W. Timkey, and M. Van Schijndel, “To point or not to point: Understanding how abstractive summarizers paraphrase text,” in Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, 2021, pp. 3362–3376
2021
-
[58]
Billsum: A corpus for automatic summarization of us legislation,
A. Kornilova and V . Eidelman, “Billsum: A corpus for automatic summarization of us legislation,” inProceedings of the 2nd Workshop on New Frontiers in Summarization, 2019, pp. 48–56
2019
-
[59]
Chateval: Towards better LLM-based evaluators through multi- agent debate,
C.-M. Chan, W. Chen, Y . Su, J. Yu, W. Xue, S. Zhang, J. Fu, and Z. Liu, “Chateval: Towards better LLM-based evaluators through multi- agent debate,” inThe Twelfth International Conference on Learning Representations, 2024
2024
-
[60]
On the emergence of position bias in transformers,
X. Wu, Y . Wang, S. Jegelka, and A. Jadbabaie, “On the emergence of position bias in transformers,” inForty-second International Conference on Machine Learning, 2025
2025
-
[61]
Towards mitigating llm hallucination via self reflection,
Z. Ji, T. Yu, Y . Xu, N. Lee, E. Ishii, and P. Fung, “Towards mitigating llm hallucination via self reflection,” inFindings of the Association for Computational Linguistics: EMNLP 2023, 2023, pp. 1827–1843
2023
-
[62]
Holistic evaluation of language models,
P. Liang, R. Bommasani, T. Lee, D. Tsipras, D. Soylu, M. Yasunaga, Y . Zhang, D. Narayanan, Y . Wu, A. Kumaret al., “Holistic evaluation of language models,”Transactions on Machine Learning Research, 2023
2023
-
[63]
The financial narrative summarisation shared task (fns 2020),
M. El-Haj, M. Litvak, N. Pittaras, G. Giannakopouloset al., “The financial narrative summarisation shared task (fns 2020),” inProceedings of the 1st Joint Workshop on Financial Narrative Processing and MultiLing Financial Summarisation, 2020, pp. 1–12
2020
-
[64]
Automatic summarization of scientific articles: A survey,
N. I. Altmami and M. E. B. Menai, “Automatic summarization of scientific articles: A survey,”Journal of King Saud University-Computer and Information Sciences, vol. 34, no. 4, pp. 1011–1028, 2022
2022
-
[65]
Prometheus: Inducing fine-grained evaluation capability in language models,
S. Kim, J. Shin, Y . Cho, J. Jang, S. Longpre, H. Lee, S. Yun, S. Shin, S. Kim, J. Thorne, and M. Seo, “Prometheus: Inducing fine-grained evaluation capability in language models,” inThe Twelfth International Conference on Learning Representations, 2024. 15
2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.