pith. sign in

arxiv: 2606.08000 · v1 · pith:AKQALPDYnew · submitted 2026-06-06 · 💻 cs.CL · cs.AI

Summarization is Not Dead Yet

Pith reviewed 2026-06-27 19:58 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords summarizationlarge language modelshuman evaluationfactualityLLM-as-a-judgelinguistic analysisreference summaries
0
0 comments X

The pith

Human reference summaries still outperform LLMs in informativeness, faithfulness and factual reliability on reasoning claims.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests claims that LLMs have solved summarization by running a multi-track evaluation on five datasets with five current models. Human references score higher on informativeness and faithfulness while LLM outputs win mainly on surface fluency and coherence. External factuality checks confirm humans are more reliable, especially when summaries require synthesis or reasoning steps. Linguistic analysis also shows LLMs produce stylistically similar outputs across different models. These patterns indicate that LLMs have lifted baseline quality but have not reached the upper performance limit set by human writers.

Core claim

Human reference summaries continue to demonstrate advantages in informativeness and faithfulness, whereas LLM outputs are preferred mainly for surface-level coherence and fluency. Factuality verification indicates that human references remain more reliable, particularly for claims involving reasoning or synthesis, and linguistic analysis uncovers a pattern of stylistic homogeneity across different models. These observations suggest that current LLMs have raised the floor of summarization quality, but the ceiling of their performance remains below human capabilities.

What carries the argument

Multi-track evaluation that combines controlled human assessment, bias-mitigated LLM-as-Judge scoring, external factuality verification against knowledge sources, and corpus-level linguistic analysis.

If this is right

  • Summarization remains an open problem that requires continued research on depth and factual synthesis.
  • LLM training can exploit existing strengths in fluency while targeting gaps in reasoning and faithfulness.
  • Human references continue to set a higher standard that current models do not meet.
  • Stylistic homogeneity in LLM outputs may limit output variety across different domains.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Hybrid human-LLM pipelines could combine model fluency with human-level reliability for high-stakes use cases.
  • The observed stylistic uniformity suggests that diversity objectives may need explicit training signals beyond standard objectives.
  • Scaling alone may not close the gaps unless new methods specifically address synthesis and reasoning claims.

Load-bearing premise

The combination of human ratings, LLM judges, factuality checks, and linguistic metrics on five datasets gives an unbiased picture of quality differences.

What would settle it

A follow-up study on held-out datasets in which independent human raters consistently score LLM summaries higher than human references on both informativeness and faithfulness.

Figures

Figures reproduced from arXiv: 2606.08000 by Chenxi Whitehouse, Dongqi Liu, Jian Li, Yabiao Wang, Zheng Zhao, Zhuchen Cao.

Figure 1
Figure 1. Figure 1: Number of summarization papers at major NLP venues (2020–2025). Papers are identified by the presence of summarization or summarisation in the title or abstract, counted once per paper. references and may achieve comparable or superior factual consistency (Liu et al., 2024c, 2023b; Goyal et al., 2023; Pu et al., 2023). These statements have raised the question of whether continued research in summarization… view at source ↗
Figure 2
Figure 2. Figure 2: Pairwise win-rate matrices from human evaluation across datasets. Each cell reports the proportion of [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Score gap matrices from human evaluation across four dimensions. Each cell in the lower triangle reports [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Pairwise win-rate matrices from LLM-as-Judge rankings across datasets. Each cell reports the proportion [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Score gap matrices from LLM-as-Judge. Each cell in the lower triangle reports the mean Likert score [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Factuality score distributions. Each half-violin shows the score distribution, and diamond markers indicate [PITH_FULL_IMAGE:figures/full_fig_p006_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Divergence between human and model summaries, reported as [PITH_FULL_IMAGE:figures/full_fig_p006_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Factuality scores by source and dataset. Each group of bars corresponds to one dataset; each bar represents [PITH_FULL_IMAGE:figures/full_fig_p022_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Factuality scores by language for EurLexSum, averaged across the five models. Rows correspond to the [PITH_FULL_IMAGE:figures/full_fig_p026_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Linguistic divergence between human and GPT summaries across seven metrics and five datasets. [PITH_FULL_IMAGE:figures/full_fig_p027_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Linguistic divergence between human and Claude summaries across seven metrics and five datasets. [PITH_FULL_IMAGE:figures/full_fig_p027_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Linguistic divergence between human and Gemini summaries across seven metrics and five datasets. [PITH_FULL_IMAGE:figures/full_fig_p027_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Linguistic divergence between human and Qwen summaries across seven metrics and five datasets. [PITH_FULL_IMAGE:figures/full_fig_p028_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Linguistic divergence between human and Kimi summaries across seven metrics and five datasets. [PITH_FULL_IMAGE:figures/full_fig_p028_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Linguistic divergence by language for EurLexSum, averaged across the five models. Rows correspond to [PITH_FULL_IMAGE:figures/full_fig_p029_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Prompt template used for the case study in [PITH_FULL_IMAGE:figures/full_fig_p030_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Human reference summary (the abstract as authored by the paper’s writers). [PITH_FULL_IMAGE:figures/full_fig_p030_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: GPT-generated summary. Errors (a) and (b) are marked in red with underline. 30 [PITH_FULL_IMAGE:figures/full_fig_p030_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: Claude-generated summary. Errors (c) and (d) are marked in red with underline. Gemini-Generated Summary The paper presents a multi-dimensional evaluation of LLM-generated summaries against human references across five datasets covering Chinese fiction, lay science communication, news, sci￾entific video, and EU legal text. Four evaluation protocols are applied, and factuality is verified using four complem… view at source ↗
Figure 20
Figure 20. Figure 20: Gemini-generated summary. Errors (e) and (f) are marked in red with underline. Qwen-Generated Summary This paper revisits the question of whether LLMs have effectively solved summarization through a four-track evaluation spanning five datasets and five frontier models. The evaluation framework covers human annotation, LLM judging, factuality verification, and linguistic analysis. Human summaries consisten… view at source ↗
Figure 21
Figure 21. Figure 21: Qwen-generated summary. Error (g) is marked in red with underline. 31 [PITH_FULL_IMAGE:figures/full_fig_p031_21.png] view at source ↗
Figure 22
Figure 22. Figure 22: Kimi-generated summary. Error (h) is marked in red with underline. CNNSum Summarization Prompt (Chinese) 你是一位专业的中文长篇小说摘要撰写者,具备跨章节叙事追踪与多线情节梳理能力。 你的任务是根据提供的完整小说文本,撰写一段结构清晰、内容完整、客观中立的故事 梗概,使读者无需阅读原文即可全面把握全篇内容。 输入: 一部长篇中文网络小说的完整正文,篇幅可能超过十万字。 要求: 1. 核心要素覆盖。 梗概须涵盖以下要素:(a) 主角的身份、出身背景与核心目标或动 机;(b) 与情节直接相关的世界观设定(如特殊力量体系、社会结构、地理环境);(c) 贯穿全篇的主线情节及其阶段性发展脉络;(d) 不少于两个关键转折点或高潮事件,并简 要说明其对人物处境与故事走向的影响;(e) 主要配角的身份、立场及其与主角的… view at source ↗
Figure 23
Figure 23. Figure 23: Prompt template for CNNSum (Chinese novel summarization). [PITH_FULL_IMAGE:figures/full_fig_p032_23.png] view at source ↗
Figure 24
Figure 24. Figure 24: Prompt template for SciNews (lay science summarization). [PITH_FULL_IMAGE:figures/full_fig_p033_24.png] view at source ↗
Figure 25
Figure 25. Figure 25: Prompt template for DiverseSumm (multi-document news summarization). [PITH_FULL_IMAGE:figures/full_fig_p034_25.png] view at source ↗
Figure 26
Figure 26. Figure 26: Prompt template for VISTA (video-to-text scientific summarization). [PITH_FULL_IMAGE:figures/full_fig_p035_26.png] view at source ↗
Figure 27
Figure 27. Figure 27: Prompt template for EurLexSum (multilingual legal summarization). The [PITH_FULL_IMAGE:figures/full_fig_p036_27.png] view at source ↗
Figure 28
Figure 28. Figure 28: Annotation guidelines provided to human evaluators. Each annotator receives these instructions together [PITH_FULL_IMAGE:figures/full_fig_p037_28.png] view at source ↗
Figure 29
Figure 29. Figure 29: Prompt template for LLM-as-Judge evaluation. The six summaries are presented in randomized order [PITH_FULL_IMAGE:figures/full_fig_p038_29.png] view at source ↗
read the original abstract

The progress of large language models (LLMs) has fueled claims that model-generated summaries rival or even surpass human-written references, raising questions about whether summarization remains an open research problem. We re-examine this narrative through a multi-track evaluation covering five diverse datasets and five state-of-the-art LLMs, combining controlled human assessment, bias-mitigated LLM-as-Judge protocols, factuality verification against external knowledge, and corpus-level linguistic analysis. Our findings reveal a more nuanced landscape in which human reference summaries continue to demonstrate advantages in informativeness and faithfulness, whereas LLM outputs are preferred mainly for surface-level coherence and fluency. Factuality verification indicates that human references remain more reliable, particularly for claims involving reasoning or synthesis, and linguistic analysis uncovers a pattern of stylistic homogeneity across different models. These observations suggest that current LLMs have raised the floor of summarization quality, but the ceiling of their performance remains below human capabilities.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 2 minor

Summary. The paper re-examines whether LLM-generated summaries have surpassed human references by performing a multi-track evaluation across five diverse datasets and five state-of-the-art LLMs. The protocol combines controlled human assessment, bias-mitigated LLM-as-Judge methods, factuality verification against external knowledge sources, and corpus-level linguistic analysis. Key findings are that human references retain advantages in informativeness and faithfulness, LLMs are preferred for surface coherence and fluency, human summaries are more reliable on reasoning/synthesis claims, and LLM outputs exhibit stylistic homogeneity.

Significance. If the multi-track results hold after full methodological scrutiny, the work supplies a timely, evidence-based counterpoint to narratives that summarization is solved, while crediting LLMs with raising baseline quality. The combination of human evaluation, external factuality checks, and linguistic analysis offers a template that could reduce reliance on single-metric or LLM-only judgments in future summarization research.

minor comments (2)
  1. [Abstract] The abstract states that the evaluation covers 'five diverse datasets' and 'five state-of-the-art LLMs' but does not name them; adding the specific dataset and model identifiers in the abstract or §1 would improve immediate readability.
  2. [Abstract] The phrase 'bias-mitigated LLM-as-Judge protocols' is used without a brief parenthetical example of the mitigation technique (e.g., prompt randomization or majority voting); a short clarification would help readers assess the claim without consulting later sections.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive evaluation of the manuscript, accurate summary of our multi-track protocol, and recommendation of minor revision. The referee correctly identifies the core contribution as a nuanced counterpoint to claims that summarization is solved. No major comments were provided in the report, so we have no specific points requiring rebuttal or revision.

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper is a purely empirical study that compares LLM summaries to human references via controlled human assessments, LLM-as-judge protocols, external factuality checks, and linguistic analysis across five datasets. No equations, fitted parameters, derivations, or self-citations appear as load-bearing elements in the abstract or described methods. All claims reduce directly to observable data from external sources rather than to any internal construction or prior author work.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Empirical evaluation study with no mathematical model, free parameters, or new postulated entities; relies on standard NLP evaluation assumptions.

axioms (1)
  • domain assumption Controlled human assessment and external factuality verification provide reliable signals of summary quality
    Central to the multi-track evaluation described in the abstract.

pith-pipeline@v0.9.1-grok · 5692 in / 1075 out tokens · 25568 ms · 2026-06-27T19:58:24.520329+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

27 extracted references · 5 canonical work pages · 3 internal anchors

  1. [1]

    InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 7626–7639, Abu Dhabi, United Arab Emirates

    EUR-lex-sum: A multi- and cross-lingual dataset for long-form summarization in the legal do- main. InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 7626–7639, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics. Nishant Balepur, Alexa Siu, Nedim Lipka, Franck Der- noncourt, Tong Sun, Jo...

  2. [2]

    InFindings of the Association for Computational Linguistics: NAACL 2025, pages 5291–5324, Albuquerque, New Mexico

    From single to multi: How LLMs hallucinate in multi-document summarization. InFindings of the Association for Computational Linguistics: NAACL 2025, pages 5291–5324, Albuquerque, New Mexico. Association for Computational Linguistics. Yoav Benjamini and Yosef Hochberg. 1995. Control- ling the false discovery rate: A practical and pow- erful approach to mul...

  3. [3]

    In The Twelfth International Conference on Learning Representations

    Booookscore: A systematic exploration of book-length summarization in the era of LLMs. In The Twelfth International Conference on Learning Representations. George Chrysostomou, Zhixue Zhao, Miles Williams, and Nikolaos Aletras. 2024. Investigating hallucina- tions in pruned large language models for abstractive summarization.Transactions of the Associatio...

  4. [4]

    From Local to Global: A Graph RAG Approach to Query-Focused Summarization

    Evaluation metrics on text summarization: comprehensive survey.Knowledge and Information Systems, 66(12):7717–7738. Yue Dong, John Wieting, and Pat Verga. 2022. Faithful to the document or to the world? mitigating hal- lucinations via entity-linked knowledge in abstrac- tive summarization. InFindings of the Association for Computational Linguistics: EMNLP...

  5. [5]

    InThe Thirty-eighth Annual Conference on Neural Information Processing Systems

    LLM evaluators recognize and favor their own generations. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems. Pinelopi Papalampidi and Mirella Lapata. 2023. Hier- archical3D adapters for long video-to-text summa- rization. InFindings of the Association for Compu- tational Linguistics: EACL 2023, pages 1297–1320, Dubrovnik, Croa...

  6. [6]

    InFindings of the Association for Computational Linguistics: EMNLP 2023, pages 10776–10787, Sin- gapore

    NLP evaluation in trouble: On the need to mea- sure LLM data contamination for each benchmark. InFindings of the Association for Computational Linguistics: EMNLP 2023, pages 10776–10787, Sin- gapore. Association for Computational Linguistics. Alessandro Scirè, Karim Ghonim, and Roberto Navigli

  7. [7]

    InFindings of the Association for Compu- tational Linguistics: ACL 2024, pages 14148–14161, Bangkok, Thailand

    FENICE: Factuality evaluation of summariza- tion based on natural language inference and claim extraction. InFindings of the Association for Compu- tational Linguistics: ACL 2024, pages 14148–14161, Bangkok, Thailand. Association for Computational Linguistics. Hwanjun Song, Hang Su, Igor Shalyminov, Jason Cai, and Saab Mansour. 2024a. FineSurE: Fine-grain...

  8. [8]

    ArchRAG: Attributed Community-based Hierarchical Retrieval-Augmented Generation

    Benchmarking LLM faithfulness in RAG with evolving leaderboards. InProceedings of the 2025 Conference on Empirical Methods in Natural Lan- guage Processing: Industry Track, pages 799–811, Suzhou (China). Association for Computational Lin- guistics. Liyan Tang, Igor Shalyminov, Amy Wong, Jon Burnsky, Jake Vincent, Yu’an Yang, Siffi Singh, Song Feng, Hwanju...

  9. [9]

    InThe Thirty-eighth Annual Conference on Neural Information Processing Systems

    Long-form factuality in large language models. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems. Lingxiao Wei, He Yan, Lu Xiangju, Junmin Zhu, Jun Wang, and Wei Zhang. 2025. CNNSum: Explor- ing long-context summarization with large language models in Chinese novels. InFindings of the Asso- ciation for Computational Linguistic...

  10. [10]

    InProceedings of The 5th New Frontiers in Summarization Workshop, pages 48–58, Hybrid

    Beyond paraphrasing: Analyzing summariza- tion abstractiveness and reasoning. InProceedings of The 5th New Frontiers in Summarization Workshop, pages 48–58, Hybrid. Association for Computational Linguistics. Yuheng Zha, Yichi Yang, Ruichen Li, and Zhiting Hu

  11. [11]

    InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 11328–11348, Toronto, Canada

    AlignScore: Evaluating factual consistency with a unified alignment function. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 11328–11348, Toronto, Canada. Association for Computational Linguistics. Haopeng Zhang, Philip S. Yu, and Jiawei Zhang. 2025a. A systematic survey of text sum...

  12. [12]

    Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models

    Benchmarking large language models for news summarization.Transactions of the Association for Computational Linguistics, 12:39–57. Yanzhao Zhang, Mingxin Li, Dingkun Long, Xin Zhang, Huan Lin, Baosong Yang, Pengjun Xie, An Yang, Dayiheng Liu, Junyang Lin, Fei Huang, and Jingren Zhou. 2025c. Qwen3 embedding: Advancing text embedding and reranking through f...

  13. [13]

    InProceed- ings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Pa- pers), pages 5172–5189, Vienna, Austria

    MoC: Mixtures of text chunking learners for retrieval-augmented generation system. InProceed- ings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Pa- pers), pages 5172–5189, Vienna, Austria. Associa- tion for Computational Linguistics. Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yongh...

  14. [14]

    Excl.” (self-exclusion, main protocol) and “Incl

    Fairness or fluency? an investigation into language bias of pairwise llm-as-a-judge.Preprint, arXiv:2601.13649. A Models, Datasets, and Sample Counts Table 1 lists the full model identifiers and API ac- cess dates for all five evaluated LLMs. Throughout the paper, we use abbreviated names for simplicity. Abbreviation Full Model Identifier GPTGPT-5.4-2026-...

  15. [15]

    Claude shows the smallest gap . . . followed by GPT

    at a nominal level ofα= 0.05 . Scores within each track are pooled across the five datasets before testing. Table 11 summarizes the outcomes, with ∗∗ marking p <0.01 after correction, ∗ marking 0.01≤p <0.05 after correction, and n.s. marking p≥0.05 . The sign of each gap is not encoded in the table and should be read from the main text. H Case Study We su...

  16. [16]

    Summarize the paper’s motivation, methodology, key findings, and conclusions

  17. [17]

    Preserve the technical terminology and quantitative details reported in the paper

  18. [18]

    Do not paraphrase quantities in a way that changes their precision or meaning

  19. [19]

    Do not introduce interpretations or claims that go beyond what the paper explicitly states

  20. [20]

    Paper:{document} Figure 16: Prompt template used for the case study in Appendix H

    Write 100 to 200 words in clear, grammatically correct academic English. Paper:{document} Figure 16: Prompt template used for the case study in Appendix H. Human Reference The progress of large language models (LLMs) has fueled claims that model-generated summaries rival or even surpass human-written references, raising questions about whether summarizati...

  21. [21]

    核心要素覆盖。梗概须涵盖以下要素: (a) 主角的身份、出身背景与核心目标或动 机;(b) 与情节直接相关的世界观设定(如特殊力量体系、社会结构、地理环境); (c) 贯穿全篇的主线情节及其阶段性发展脉络; (d) 不少于两个关键转折点或高潮事件,并简 要说明其对人物处境与故事走向的影响; (e) 主要配角的身份、立场及其与主角的关系演 变;(f)故事的最终结局或走向(若原文已完结)。

  22. [22]

    叙事顺序与结构。按故事内时间顺序叙述,确保因果链条清晰;多线叙事时以主线为 骨架串联,支线信息在与主线交汇处简要带过,避免线索断裂。

  23. [23]

    客观性约束。严格不得加入原文未出现的情节、人物、地点、时间或因果关系,不得 添加任何主观评价、文学赏析、情感感叹或对结局的推测,不得使用任何评论性词汇。

  24. [24]

    the authors suggest

    语言规范。使用规范的中文书面语,避免口语化、网络流行语与作者点评式语气。人 名、地名、特殊术语及专有名词须与原文用字完全一致,不得替换为常见同义词。 小说正文:{document} Figure 23: Prompt template for CNNSum (Chinese novel summarization). 32 SciNews Summarization Prompt (English) You are an expert science communicator with experience translating peer-reviewed research for general audiences. Your task is to write a lay summary of the...

  25. [25]

    A recorded scientific conference presentation video, including the speaker’s slides and any on-screen demonstrations

  26. [26]

    A verbatim or near-verbatim transcript of the same presentation. Instructions: 1.Abstract structure.Organize the summary around five elements in order: (a) the scientific motivation, framed as an existing limitation or open question; (b) the proposed approach, model, or experimental design, including any architectural or methodological innovations; (c) th...

  27. [27]

    Each rank must be used exactly once; ties are not permitted

    to best (rank 6). Each rank must be used exactly once; ties are not permitted. 4.Bias mitigation.Score each summary on its own content, not its position in the list (A through F) or its surface style. Do not let stylistic familiarity favor one summary over another. Do not let your impression of an earlier summary anchor your scores for later ones. 5.Dimen...