pith. sign in

arxiv: 2606.12807 · v1 · pith:OYBCACCVnew · submitted 2026-06-11 · 💻 cs.CL

Detect, Remask, Repair: Diffusion Editing for Faithful Summarization of Evolving Contexts

Pith reviewed 2026-06-27 07:05 UTC · model grok-4.3

classification 💻 cs.CL
keywords summarizationdiffusion modelslocalized editingfaithfulnessevolving contextsStreamSum benchmarkmasked language models
0
0 comments X

The pith

Diffusion editing repairs only outdated spans in evolving summaries while preserving supported content.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Summaries of ongoing events often become partially outdated as new facts arrive. Full regeneration discards prior work and hides what changed, yet only a few claims may need fixing. The paper proposes localized repair that detects unsupported regions, remasks them, and regenerates just those parts using masked diffusion language models. This yields controllable tradeoffs between faithfulness, speed, and retention of the original draft. Tests on dialog data and a new synthetic timeline benchmark show the approach can also correct outputs from standard autoregressive systems after the fact.

Core claim

The Detect-Remask-Repair framework identifies unsupported or outdated spans in an existing summary, remasks those regions, and repairs them with masked diffusion language models, supplying a controllable alternative to full rewriting that improves faithfulness while preserving supported content and enabling fast one-step repairs.

What carries the argument

Detect-Remask-Repair: a three-stage process that uses masked diffusion language models to locate, remask, and regenerate only the outdated spans in a summary.

If this is right

  • Faithfulness-steered repair improves the quality of early summary drafts.
  • One-step repair reduces repair cost to under half a second.
  • The framework enables explicit tradeoffs among faithfulness, speed, and content preservation across datasets.
  • The same process supplies a post-hoc correction step that raises faithfulness scores for autoregressive summarizers.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same localized detection-and-repair pattern could extend to other incremental text generation settings such as live reports or dialogue responses.
  • Testing the approach on actual news timelines instead of the synthetic StreamSum benchmark would clarify how well detection generalizes beyond controlled event sequences.
  • Because the method separates detection from repair, it could support interactive tools where users flag specific spans for update.

Load-bearing premise

The diffusion model can accurately detect and repair only the unsupported or outdated spans without introducing new errors or altering supported content.

What would settle it

An evaluation on StreamSum or DialogSum that measures whether repaired summaries contain new factual errors absent from the context or original draft, or fail to update all outdated claims.

Figures

Figures reproduced from arXiv: 2606.12807 by Chandhru Karthick, Hao Zou, Kathleen McKeown, Zachary Horvitz, Zhou Yu.

Figure 1
Figure 1. Figure 1: Overview of DETECT–REMASK–REPAIR (DRR) on a StreamSum example. In an evolving con￾text, an autoregressive draft generated from earlier up￾dates preserves outdated claims. DRR detects unsup￾ported spans, selectively re-masks them, and utilizes a text diffusion model to infill the masked spans based on later updates, yielding a faithful summary. A natural solution to evolving contexts is to regenerate an ent… view at source ↗
Figure 2
Figure 2. Figure 2: Performance of correction strategies for early [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Overview of DETECT–REMASK–REPAIR. During training (left), we mask the source and reference summary, then use a masked diffusion model to refill selected summary positions, producing corrupted summaries that supervise both [MASK]-DISC and the one-step repair model. At inference (right), [MASK]-DISC scores draft tokens, high-staleness positions are re-masked, and a masked diffusion model infills them using t… view at source ↗
Figure 4
Figure 4. Figure 4: Sample-level [MASK]-DISC risk versus AlignScore for early-context AR drafts. Higher risk generally corresponds to lower faithfulness, supporting budgeted repair routing. Setting Dataset Pearson r Spearman ρ p Early draft DialogSum -0.263 -0.239 < 0.001 Early draft StreamSum -0.285 -0.299 < 0.001 Full-gen post-hoc DialogSum -0.081 -0.081 < 0.01 Full-gen post-hoc StreamSum -0.050 -0.050 0.27 [PITH_FULL_IMAG… view at source ↗
Figure 5
Figure 5. Figure 5: Sample-level [MASK]-DISC risk versus AlignScore for full-generation summaries. The relationship is weaker in this setting because full-generation outputs are already conditioned on the complete context. Setting Method DialogSum StreamSum Top-25 Top-50 Top-75 All Top-25 Top-50 Top-75 All Early draft LLaDAsteering 0.894 1.820 2.714 3.544 1.892 3.803 5.747 7.687 LLaDA1-step 0.044 0.089 0.133 0.175 0.077 0.154… view at source ↗
read the original abstract

Summaries of real-world events can become outdated as contexts evolve and new information arrives. A common response is to generate a new summary from the updated context, but full regeneration discards the previous draft, can obscure what changed, and may be unnecessary when only a few claims are unsupported. We study localized faithfulness repair: updating outdated spans in an existing summary while preserving supported content. We propose DETECT-REMASK-REPAIR, a diffusion-based framework that identifies, remasks, and repairs outdated regions with masked diffusion language models. To evaluate evolving-context summarization, we introduce StreamSum, a benchmark of synthetic event timelines. Experiments on DialogSum and StreamSum show that localized diffusion repair provides a controllable alternative to full rewriting: faithfulness-steered repair improves early drafts, one-step repair reduces repair cost to under half a second, with the framework enabling faithfulness-speed-preservation tradeoffs across datasets. We also find that the framework can provide a post-hoc correction step that improves faithfulness for autoregressive systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The paper claims that a diffusion-based DETECT-REMASK-REPAIR framework using masked diffusion language models can perform localized faithfulness repair on evolving-context summaries by detecting, remasking, and repairing only unsupported/outdated spans while preserving supported content. It introduces the synthetic StreamSum benchmark of event timelines and reports that the approach yields faithfulness improvements over early drafts, enables one-step repair under half a second, supports faithfulness-speed-preservation tradeoffs on DialogSum and StreamSum, and serves as an effective post-hoc correction step for autoregressive summarizers.

Significance. If the empirical claims hold, the work offers a practical, controllable alternative to full regeneration for maintaining summary faithfulness under context evolution, with explicit speed and edit-locality benefits that could reduce unnecessary rewriting in dynamic domains such as news or dialogue. The post-hoc correction result and the explicit tradeoff knobs are potentially useful contributions.

major comments (1)
  1. [StreamSum benchmark construction] StreamSum benchmark construction (the section introducing the synthetic event timelines): the benchmark relies on explicit outdated claims inserted into timelines, which simplifies span detection relative to the implicit contradictions, gradual fact shifts, or subtle unsupported content that arise in natural evolving contexts. Because the central claim is that the diffusion model accurately detects and repairs only unsupported spans without introducing new errors or harming supported content, this construction choice is load-bearing; performance on StreamSum may not generalize, weakening the assertion that the framework provides a reliable controllable alternative to full rewriting.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the single major comment below.

read point-by-point responses
  1. Referee: [StreamSum benchmark construction] StreamSum benchmark construction (the section introducing the synthetic event timelines): the benchmark relies on explicit outdated claims inserted into timelines, which simplifies span detection relative to the implicit contradictions, gradual fact shifts, or subtle unsupported content that arise in natural evolving contexts. Because the central claim is that the diffusion model accurately detects and repairs only unsupported spans without introducing new errors or harming supported content, this construction choice is load-bearing; performance on StreamSum may not generalize, weakening the assertion that the framework provides a reliable controllable alternative to full rewriting.

    Authors: We agree that StreamSum relies on explicit insertion of outdated claims, which provides a controlled setting for measuring precise span detection and localized repair. This design enables ground-truth evaluation of whether the model identifies only unsupported regions and avoids introducing errors into supported content, which is essential for validating the DETECT-REMASK-REPAIR mechanism before moving to noisier data. We acknowledge that this choice does not capture all forms of natural evolution such as implicit contradictions or gradual shifts. In revision we will expand the benchmark section to explicitly discuss this limitation, clarify the rationale for the synthetic construction, and outline plans for future naturalistic extensions. revision: partial

Circularity Check

0 steps flagged

No significant circularity; empirical claims only

full rationale

The paper introduces a diffusion-based editing framework and a synthetic benchmark (StreamSum) for evaluating localized repair of outdated summary spans. All central claims are empirical performance statements (e.g., faithfulness improvements, runtime reductions, post-hoc correction benefits) measured on DialogSum and StreamSum. No equations, derivations, fitted parameters renamed as predictions, or self-citation chains appear in the abstract or described method. The evaluation is externally falsifiable through standard metrics on held-out data, making the work self-contained against external benchmarks with no load-bearing reductions to its own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only; no free parameters, axioms, or invented entities are described or identifiable.

pith-pipeline@v0.9.1-grok · 5714 in / 1081 out tokens · 27934 ms · 2026-06-27T07:05:33.166842+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

68 extracted references · 8 canonical work pages

  1. [1]

    Javed Aslam, Fernando Diaz, Matthew Ekstrand-Abueg, Virgiliu Pavlu, and Tetsuya Sakai. 2013. Overview of the trec 2013 temporal summarization track. In Proceedings of the Twenty-Second Text REtrieval Conference (TREC 2013). NIST

  2. [3]

    Fabbri, Wojciech Kryściński, Bryan McCann, Caiming Xiong, Richard Socher, and Dragomir Radev

    Alexander R. Fabbri, Wojciech Kryściński, Bryan McCann, Caiming Xiong, Richard Socher, and Dragomir Radev. 2021. https://arxiv.org/abs/2007.12626 Summeval: Re-evaluating summarization evaluation . Preprint, arXiv:2007.12626

  3. [4]

    Fabbri, Chien-Sheng Wu, Wenhao Liu, and Caiming Xiong

    Alexander R. Fabbri, Chien-Sheng Wu, Wenhao Liu, and Caiming Xiong. 2022. https://arxiv.org/abs/2112.08542 Qafacteval: Improved qa-based factual consistency evaluation for summarization . Preprint, arXiv:2112.08542

  4. [5]

    Tanya Goyal and Greg Durrett. 2020. Evaluating factuality in generation with dependency-level entailment. In Findings of the Association for Computational Linguistics: EMNLP 2020

  5. [6]

    Ivan Habernal, Steffen Eger, and Iryna Gurevych. 2016. https://aclanthology.org/C16-1102/ Sequential clustering and contextual importance measures for incremental update summarization . In Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers, pages 1042--1053

  6. [10]

    Wojciech Kry \'s ci \'n ski, Bryan McCann, Caiming Xiong, and Richard Socher. 2019. Evaluating the factual consistency of abstractive text summarization. arXiv preprint arXiv:1910.12840

  7. [12]

    Xiang Lisa Li, John Thickstun, Ishaan Gulrajani, Percy Liang, and Tatsunori Hashimoto. 2022. https://api.semanticscholar.org/CorpusID:249192356 Diffusion-lm improves controllable text generation . ArXiv, abs/2205.14217

  8. [13]

    Chin-Yew Lin. 2004. https://aclanthology.org/W04-1013/ ROUGE : A package for automatic evaluation of summaries . In Text Summarization Branches Out, pages 74--81, Barcelona, Spain. Association for Computational Linguistics

  9. [14]

    Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, Shashank Gupta, Bodhisattwa Prasad Majumder, Katherine Hermann, Sean Welleck, Amir Yazdanbakhsh, and Peter Clark. 2023. https://arxiv.org/abs/2303.17651 Self-refine: Iterative refinement with self-feedback . Preprin...

  10. [16]

    Richard McCreadie, Craig Macdonald, and Iadh Ounis. 2014. https://www.dcs.gla.ac.uk/ richardm/papers/mccreadie2014_IUS.pdf Incremental update summarization: Adaptive sentence selection based on prevalence and novelty . In Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management (CIKM), pages 301--310

  11. [17]

    Meta AI . 2024. https://ai.meta.com/blog/meta-llama-3 Introducing meta llama 3: The most capable openly available llm to date

  12. [18]

    Feng Nan, Ramesh Nallapati, Zhiguo Wang, Cicero Nogueira dos Santos, Henghui Zhu, Dejiao Zhang, Kathleen McKeown, and Bing Xiang. 2021. https://arxiv.org/abs/2102.09130 Entity-level factual consistency of abstractive text summarization . Preprint, arXiv:2102.09130

  13. [20]

    Artidoro Pagnoni, Vidhisha Balachandran, and Yulia Tsvetkov. 2021. https://arxiv.org/abs/2104.13346 Understanding factuality in abstractive summarization with frank: A benchmark for factuality metrics . Preprint, arXiv:2104.13346

  14. [21]

    Subham Sekhar Sahoo, Marianne Arriola, Yair Schiff, Aaron Gokaslan, Edgar Marroquin, Justin T Chiu, Alexander Rush, and Volodymyr Kuleshov. 2024. https://api.semanticscholar.org/CorpusID:270380319 Simple and effective masked diffusion language models . ArXiv, abs/2406.07524

  15. [22]

    Thomas Scialom, Paul-Alexis Dray, Patrick Gallinari, Sylvain Lamprier, Benjamin Piwowarski, Jacopo Staiano, and Alex Wang. 2021. https://arxiv.org/abs/2103.12693 Questeval: Summarization asks for fact-based evaluation . Preprint, arXiv:2103.12693

  16. [24]

    Raghav Singhal, Zachary Horvitz, Ryan Teehan, Mengye Ren, Zhou Yu, Kathleen McKeown, and R V Ranganath. 2025. https://api.semanticscholar.org/CorpusID:275470889 A general framework for inference-time scaling and steering of diffusion models . ArXiv, abs/2501.06848

  17. [25]

    Liyan Tang, Philippe Laban, and Greg Durrett. 2024 a . https://arxiv.org/pdf/2404.10774 Minicheck: Efficient fact-checking of llms on grounding documents . In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics

  18. [26]

    Vincent, Yu'an Yang, Siffi Singh, Song Feng, Hwanjun Song, Hang Su, Lijia Sun, Yi Zhang, Saab Mansour, and Kathleen McKeown

    Liyan Tang, Igor Shalyminov, Amy Wing mei Wong, Jon Burnsky, Jake W. Vincent, Yu'an Yang, Siffi Singh, Song Feng, Hwanjun Song, Hang Su, Lijia Sun, Yi Zhang, Saab Mansour, and Kathleen McKeown. 2024 b . https://arxiv.org/abs/2402.13249 Tofueval: Evaluating hallucinations of llms on topic-focused dialogue summarization . Preprint, arXiv:2402.13249

  19. [27]

    Manya Wadhwa, Xinyu Zhao, Junyi Jessy Li, and Greg Durrett. 2024. https://api.semanticscholar.org/CorpusID:270878552 Learning to refine with fine-grained natural language feedback . ArXiv, abs/2407.02397

  20. [28]

    David Wan, Mengwen Liu, Kathleen McKeown, Markus Dreyer, and Mohit Bansal. 2023. https://arxiv.org/abs/2303.03278 Faithfulness-aware decoding strategies for abstractive summarization . Preprint, arXiv:2303.03278

  21. [29]

    David Wan, Jesse Vig, Mohit Bansal, and Shafiq Joty. 2025. https://arxiv.org/abs/2410.23609 On positional bias of faithfulness for long-form summarization . Preprint, arXiv:2410.23609

  22. [31]

    Yuheng Zha, Yichi Yang, Ruichen Li, and Zhiting Hu. 2023. https://aclanthology.org/2023.acl-long.634 A lign S core: Evaluating factual consistency with a unified alignment function . In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 11328--11348, Toronto, Canada. Association for Compu...

  23. [32]

    Weinberger, and Yoav Artzi

    Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q. Weinberger, and Yoav Artzi. 2020. https://arxiv.org/abs/1904.09675 Bertscore: Evaluating text generation with bert . Preprint, arXiv:1904.09675

  24. [33]

    Hao Zou, Zae Myung Kim, and Dongyeop Kang. 2023. https://arxiv.org/abs/2305.14671 A survey of diffusion models in natural language processing . Preprint, arXiv:2305.14671

  25. [34]

    Aho and Jeffrey D

    Alfred V. Aho and Jeffrey D. Ullman , title =. 1972

  26. [35]

    Publications Manual , year = "1983", publisher =

  27. [36]

    Chandra and Dexter C

    Ashok K. Chandra and Dexter C. Kozen and Larry J. Stockmeyer , year = "1981", title =. doi:10.1145/322234.322243

  28. [37]

    Scalable training of

    Andrew, Galen and Gao, Jianfeng , booktitle=. Scalable training of

  29. [38]

    Dan Gusfield , title =. 1997

  30. [39]

    Tetreault , title =

    Mohammad Sadegh Rasooli and Joel R. Tetreault , title =. Computing Research Repository , volume =. 2015 , url =

  31. [40]

    A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =

    Ando, Rie Kubota and Zhang, Tong , Issn =. A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =. Journal of Machine Learning Research , Month = dec, Numpages =

  32. [41]

    2021 , eprint=

    Understanding Factuality in Abstractive Summarization with FRANK: A Benchmark for Factuality Metrics , author=. 2021 , eprint=

  33. [42]

    Evaluating the Factual Consistency of Abstractive Text Summarization

    Kryscinski, Wojciech and McCann, Bryan and Xiong, Caiming and Socher, Richard. Evaluating the Factual Consistency of Abstractive Text Summarization. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2020. doi:10.18653/v1/2020.emnlp-main.750

  34. [43]

    Evaluating the Factual Consistency of Abstractive Text Summarization , journal =

    Wojciech Kry. Evaluating the Factual Consistency of Abstractive Text Summarization , journal =

  35. [44]

    Findings of the Association for Computational Linguistics: EMNLP 2020 , year=

    Evaluating Factuality in Generation with Dependency-level Entailment , author=. Findings of the Association for Computational Linguistics: EMNLP 2020 , year=

  36. [45]

    2021 , eprint=

    QuestEval: Summarization Asks for Fact-based Evaluation , author=. 2021 , eprint=

  37. [46]

    2023 , eprint=

    Faithfulness-Aware Decoding Strategies for Abstractive Summarization , author=. 2023 , eprint=

  38. [47]

    ArXiv , year=

    Learning to Refine with Fine-Grained Natural Language Feedback , author=. ArXiv , year=

  39. [48]

    D ialog S um: A real-life scenario dialogue summarization dataset

    Chen, Yulong and Liu, Yang and Chen, Liang and Zhang, Yue. D ialog S um: A Real-Life Scenario Dialogue Summarization Dataset. Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021. 2021. doi:10.18653/v1/2021.findings-acl.449

  40. [49]

    Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages =

    See, Abigail and Liu, Peter J. and Manning, Christopher D. Get To The Point: Summarization with Pointer-Generator Networks. Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2017. doi:10.18653/v1/P17-1099

  41. [50]

    On Faithfulness and Factuality in Abstractive Summarization

    Maynez, Joshua and Narayan, Shashi and Bohnet, Bernd and McDonald, Ryan. On Faithfulness and Factuality in Abstractive Summarization. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020. doi:10.18653/v1/2020.acl-main.173

  42. [51]

    A lign S core: Evaluating Factual Consistency with A Unified Alignment Function

    Zha, Yuheng and Yang, Yichi and Li, Ruichen and Hu, Zhiting. A lign S core: Evaluating Factual Consistency with A Unified Alignment Function. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2023

  43. [52]

    Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , year =

    MiniCheck: Efficient Fact-Checking of LLMs on Grounding Documents , author =. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , year =

  44. [53]

    2021 , eprint=

    Entity-level Factual Consistency of Abstractive Text Summarization , author=. 2021 , eprint=

  45. [54]

    2023 , eprint=

    Self-Refine: Iterative Refinement with Self-Feedback , author=. 2023 , eprint=

  46. [55]

    ArXiv , year=

    Diffusion-LM Improves Controllable Text Generation , author=. ArXiv , year=

  47. [56]

    ArXiv , year=

    Simple and Effective Masked Diffusion Language Models , author=. ArXiv , year=

  48. [57]

    arXiv preprint arXiv:2502.09992 , year=

    Large Language Diffusion Models , author=. arXiv preprint arXiv:2502.09992 , year=

  49. [58]

    arXiv preprint arXiv:2503.00307 , year=

    Remasking Discrete Diffusion Models with Inference-Time Scaling , author=. arXiv preprint arXiv:2503.00307 , year=

  50. [59]

    arXiv preprint arXiv:2509.23653 , year=

    Don't Settle Too Early: Self-Reflective Remasking for Diffusion Language Models , author=. arXiv preprint arXiv:2509.23653 , year=

  51. [60]

    ArXiv , year=

    Don't Give Me the Details, Just the Summary! Topic-Aware Convolutional Neural Networks for Extreme Summarization , author=. ArXiv , year=

  52. [61]

    Enhancing Incremental Summarization with Structured Representations

    Hwang, EunJeong and Zhou, Yichao and Wendt, James Bradley and Gunel, Beliz and Vo, Nguyen and Xie, Jing and Tata, Sandeep. Enhancing Incremental Summarization with Structured Representations. Findings of the Association for Computational Linguistics: EMNLP 2024. 2024. doi:10.18653/v1/2024.findings-emnlp.220

  53. [62]

    Proceedings of the Twenty-Second Text REtrieval Conference (TREC 2013) , year=

    Overview of the TREC 2013 Temporal Summarization track , author=. Proceedings of the Twenty-Second Text REtrieval Conference (TREC 2013) , year=

  54. [63]

    From Moments to Milestones: Incremental Timeline Summarization Leveraging Large Language Models

    Hu, Qisheng and Moon, Geonsik and Ng, Hwee Tou. From Moments to Milestones: Incremental Timeline Summarization Leveraging Large Language Models. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2024. doi:10.18653/v1/2024.acl-long.390

  55. [64]

    Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management (CIKM) , pages=

    Incremental update summarization: Adaptive sentence selection based on prevalence and novelty , author=. Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management (CIKM) , pages=. 2014 , url=

  56. [65]

    Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers , pages=

    Sequential Clustering and Contextual Importance Measures for Incremental Update Summarization , author=. Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers , pages=. 2016 , url=

  57. [66]

    ArXiv , year=

    A General Framework for Inference-time Scaling and Steering of Diffusion Models , author=. ArXiv , year=

  58. [67]

    arXiv preprint arXiv:2602.16813 , year=

    Flow Map Language Models: One-step Language Modeling via Continuous Denoising , author=. arXiv preprint arXiv:2602.16813 , year=

  59. [68]

    2024 , url=

    Introducing Meta Llama 3: The most capable openly available LLM to date , author=. 2024 , url=

  60. [69]

    2020 , eprint=

    BERTScore: Evaluating Text Generation with BERT , author=. 2020 , eprint=

  61. [70]

    ROUGE : A Package for Automatic Evaluation of Summaries

    Lin, Chin-Yew. ROUGE : A Package for Automatic Evaluation of Summaries. Text Summarization Branches Out. 2004

  62. [71]

    BLEURT : Learning Robust Metrics for Text Generation

    Sellam, Thibault and Das, Dipanjan and Parikh, Ankur. BLEURT : Learning Robust Metrics for Text Generation. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020. doi:10.18653/v1/2020.acl-main.704

  63. [72]

    2023 , eprint=

    AlignScore: Evaluating Factual Consistency with a Unified Alignment Function , author=. 2023 , eprint=

  64. [73]

    2021 , eprint=

    SummEval: Re-evaluating Summarization Evaluation , author=. 2021 , eprint=

  65. [74]

    2022 , eprint=

    QAFactEval: Improved QA-Based Factual Consistency Evaluation for Summarization , author=. 2022 , eprint=

  66. [75]

    2024 , eprint=

    TofuEval: Evaluating Hallucinations of LLMs on Topic-Focused Dialogue Summarization , author=. 2024 , eprint=

  67. [76]

    2025 , eprint=

    On Positional Bias of Faithfulness for Long-form Summarization , author=. 2025 , eprint=

  68. [77]

    2023 , eprint=

    A Survey of Diffusion Models in Natural Language Processing , author=. 2023 , eprint=