pith. sign in

arxiv: 2606.07936 · v2 · pith:OWOLVEFRnew · submitted 2026-06-06 · 💻 cs.CL · cs.AI

Illusions of the Gold Standard: A Large-scale Analysis of Human Evaluation Protocols for Long-form Text Generation

Pith reviewed 2026-06-27 20:18 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords human evaluationreproducibilitylong-form text generationNLP conferencesreporting practicesevaluation protocolsunder-reportingstudy design
0
0 comments X

The pith

Human evaluation protocols for long-form text generation in recent NLP conference papers are often incompletely reported, creating ambiguity about what was measured and by whom.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines reporting practices for human evaluations in papers on long-form text generation from *CL conferences between 2023 and 2025. It applies a defined set of 20 criteria for reproducibility to a manual review of 284 papers plus LLM-assisted analysis of more than 1800 additional papers. The analysis shows frequent omission of details on study design, participant information, judgment processes, and interpretation guidelines. This pattern leaves readers uncertain about the reliability of the evaluations and how to compare results across papers. The authors propose concrete steps to improve documentation in future work.

Core claim

A systematic review of human evaluation protocols in *CL publications reveals widespread under-reporting of important aspects of study design, who contributed judgments, and how judgments should be interpreted, which produces ambiguity about what was actually measured and how the results should be understood.

What carries the argument

A set of 20 reportable criteria related to reproducibility of human evaluation studies, applied to check what details papers include about design, participants, and judgment processes.

If this is right

  • Adopting the 20 criteria would make it easier to interpret and compare human evaluation results across different papers.
  • Papers would need to document participant recruitment, training, and agreement measures more consistently.
  • Ambiguity in current evaluations would decrease if journals and conferences required explicit reporting on these points.
  • Future comparisons of generation systems could rest on clearer evidence of evaluation quality.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same under-reporting pattern likely appears in evaluations of short-form or other generation tasks outside the long-form focus.
  • Conferences could reduce ambiguity by adding checklist items based on the 20 criteria during submission.
  • Greater transparency might shift research incentives toward more careful study design rather than just reporting results.

Load-bearing premise

The authors' chosen set of 20 criteria is enough to determine whether a human evaluation study is reproducible and interpretable.

What would settle it

A re-analysis of the same papers that applies a different or expanded list of criteria and finds high rates of complete reporting on the missing items.

Figures

Figures reproduced from arXiv: 2606.07936 by Bingbing Wen, Chenjun Xu, Katelyn Xiaoying Mei, Lucy Lu Wang, Minjoon Choi, Su Lin Blodgett, Yi-Li Hsu, Zongwan Cao.

Figure 1
Figure 1. Figure 1: Average proportion of *CL papers reporting each of 20 core criteria related to the reproducibility of [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Distribution of total criteria reported; over [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Distributions of annotator and sample counts [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Temporal trends for 2023–2025 *CL con￾ferences. While papers studying long-form generation have increased in the last year, the proportional use of human evaluation for these tasks has decreased. Annotation quality control is rarely employed. We track whether researchers adopt any data fil￾tering steps (i.e., attention checks, manipulation checks)—techniques to remove low-quality crowd￾sourced data or any … view at source ↗
Figure 5
Figure 5. Figure 5: Prompt used for LLM-based filtering to identify papers studying long-form generation tasks and which [PITH_FULL_IMAGE:figures/full_fig_p015_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Partial screenshot of our annotation interface in Google Sheets showing questions pertaining to documen [PITH_FULL_IMAGE:figures/full_fig_p017_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Prompt for LLM-assisted annotation: input prompt structure for each LLM call. [PITH_FULL_IMAGE:figures/full_fig_p017_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Prompt for LLM-assisted annotation: chunk structure for codebook questions. [PITH_FULL_IMAGE:figures/full_fig_p017_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Prompt for LLM-assisted annotation: full question schema used for LLM annotation. [PITH_FULL_IMAGE:figures/full_fig_p018_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Distribution of stemmed evaluation dimensions across all papers (Overall) in the manually annotated set [PITH_FULL_IMAGE:figures/full_fig_p020_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Temporal trends in reporting: across all *CL papers (2023-2025) with human evaluation and long-form [PITH_FULL_IMAGE:figures/full_fig_p022_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Frequency of Reporting Criteria for Common NLP Tasks [PITH_FULL_IMAGE:figures/full_fig_p022_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Frequency of disagreement resolution method reported in manually-annotated sample: Most papers tend not to report how they address disagreement among annotators (n=202). Among the ones that report this criteria, majority vote (n=31) is the most common approach for addressing disagreement among annota￾tors, followed by averaging (n=20), consensus process (n=14), other (n=13), or picking one annotation (n=4… view at source ↗
Figure 14
Figure 14. Figure 14: Distribution of IAA strength reported in [PITH_FULL_IMAGE:figures/full_fig_p023_14.png] view at source ↗
read the original abstract

Human evaluation plays a critical role in assessing the quality of generated text. However, the reliability and reproducibility of these evaluations depend on transparent and well-documented protocols -- details that are frequently missing in current practice. In this work, we conduct a large-scale analysis of human evaluation protocols for evaluating long-form generation tasks in *CL conference publications from 2023--2025, including a full manual review of 284 papers and LLM-assisted analysis for another 1.8k+ papers. We define a set of 20 reportable criteria related to reproducibility of human evaluation studies, and apply these criteria to systematically examine reporting norms and practices within the community. We find widespread under-reporting of important aspects of human evaluation study design, leading to ambiguity about what was measured and how, who contributed judgments, and how judgments should be interpreted. Based on these findings, we outline actionable recommendations to support more transparent and reproducible reporting in future research. Our analysis code and annotated dataset can be found at: https://github.com/larchlab/Illusions-of-the-Gold-Standard

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper conducts a large-scale observational analysis of human evaluation protocols for long-form text generation in *CL conference papers (2023–2025). It performs a full manual review of 284 papers plus LLM-assisted analysis of an additional 1.8k+ papers against a fixed set of 20 reportable criteria for reproducibility. The central claim is that widespread under-reporting of study-design details creates ambiguity about what was measured, who provided judgments, and how results should be interpreted; the authors provide recommendations and release code plus an annotated dataset.

Significance. If the central observational claim holds, the work is significant because human evaluation remains a primary method for assessing long-form generation quality, and documented under-reporting directly affects reproducibility and interpretability in the field. The explicit release of analysis code and the annotated dataset is a clear strength that supports verification and follow-up studies. The findings could usefully inform community guidelines, provided the 20 criteria are shown to be well-aligned with existing reporting standards.

major comments (2)
  1. [Section defining the 20 criteria] Section defining the 20 criteria: the manuscript presents these criteria as capturing 'important aspects' necessary for reproducibility and interpretability, yet provides no external validation (e.g., expert survey, comparison against ACL or prior meta-study reporting guidelines, or inter-rater agreement on criterion importance). Because the prevalence statistics and the downstream claim of 'widespread under-reporting of important aspects' rest directly on this author-defined set, the absence of such validation makes the quantitative conclusions sensitive to the particular framing chosen.
  2. [LLM-assisted analysis section] LLM-assisted analysis section (1.8k+ papers): the extension from the 284 manually reviewed papers to the larger corpus is load-bearing for the 'widespread' claim, but the manuscript does not report prompt details, few-shot examples, or measured agreement/error rates between the LLM outputs and the manual annotations. Without these, systematic biases in the automated labeling could materially affect the reported under-reporting rates.
minor comments (2)
  1. [Table 1] Table 1 (or equivalent summary table of criteria): the mapping from each criterion to the specific ambiguity it addresses (measurement, contributors, or interpretation) could be made more explicit to help readers trace how missing items produce the claimed ambiguities.
  2. [Data and code release] The GitHub link is provided, but the README should include a clear description of how the 284 manual annotations were performed (annotator background, resolution process) to strengthen reproducibility of the core dataset.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and outline planned revisions to improve the manuscript.

read point-by-point responses
  1. Referee: [Section defining the 20 criteria] Section defining the 20 criteria: the manuscript presents these criteria as capturing 'important aspects' necessary for reproducibility and interpretability, yet provides no external validation (e.g., expert survey, comparison against ACL or prior meta-study reporting guidelines, or inter-rater agreement on criterion importance). Because the prevalence statistics and the downstream claim of 'widespread under-reporting of important aspects' rest directly on this author-defined set, the absence of such validation makes the quantitative conclusions sensitive to the particular framing chosen.

    Authors: We agree that additional justification for the criteria would strengthen the work. The 20 criteria were synthesized from recurring elements in prior NLP literature on human evaluation reproducibility and reporting standards. In revision we will add a new subsection explicitly mapping each criterion to relevant ACL guidelines and earlier meta-studies, together with a brief rationale for inclusion. While we did not conduct a new expert survey, this explicit alignment will reduce sensitivity to the chosen framing and make the prevalence claims more robust. revision: partial

  2. Referee: [LLM-assisted analysis section] LLM-assisted analysis section (1.8k+ papers): the extension from the 284 manually reviewed papers to the larger corpus is load-bearing for the 'widespread' claim, but the manuscript does not report prompt details, few-shot examples, or measured agreement/error rates between the LLM outputs and the manual annotations. Without these, systematic biases in the automated labeling could materially affect the reported under-reporting rates.

    Authors: We concur that full transparency on the LLM-assisted labeling is required. The original submission omitted these details for brevity. The revised manuscript will include the complete prompts, few-shot examples, and a dedicated error-analysis subsection reporting agreement rates (and disagreement categories) between the LLM and the manual annotations on a held-out validation set. This addition will allow readers to evaluate potential biases directly. revision: yes

Circularity Check

0 steps flagged

No significant circularity in observational literature analysis

full rationale

The paper performs a manual and LLM-assisted review of reporting practices in other papers by first defining an explicit set of 20 criteria and then counting their presence or absence. No equations, fitted parameters, predictions, or self-citation chains exist that reduce any central claim to its own inputs by construction. The analysis is self-contained empirical observation against transparently stated criteria; the absence of any enumerated circularity pattern (self-definitional, fitted-input prediction, load-bearing self-citation, etc.) yields a score of 0.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The analysis depends on the domain assumption that the 20 criteria are adequate for assessing reproducibility; no free parameters or invented entities are introduced.

axioms (1)
  • domain assumption The defined set of 20 reportable criteria adequately captures key aspects of human evaluation reproducibility.
    Criteria are introduced as the basis for the systematic examination of papers.

pith-pipeline@v0.9.1-grok · 5747 in / 1100 out tokens · 21236 ms · 2026-06-27T20:18:32.801057+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

140 extracted references · 39 canonical work pages

  1. [1]

    Scientific reports , volume=

    Expert evaluation of large language models for clinical dialogue summarization , author=. Scientific reports , volume=. 2025 , publisher=

  2. [2]

    A Critical Evaluation of Evaluations for Long-form Question Answering

    Xu, Fangyuan and Song, Yixiao and Iyyer, Mohit and Choi, Eunsol. A Critical Evaluation of Evaluations for Long-form Question Answering. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2023. doi:10.18653/v1/2023.acl-long.181

  3. [3]

    Responsible AI Considerations in Text Summarization Research: A Review of Current Practices

    Liu, Yu Lu and Cao, Meng and Blodgett, Su Lin and Cheung, Jackie Chi Kit and Olteanu, Alexandra and Trischler, Adam. Responsible AI Considerations in Text Summarization Research: A Review of Current Practices. Findings of the Association for Computational Linguistics: EMNLP 2023. 2023. doi:10.18653/v1/2023.findings-emnlp.413

  4. [4]

    Automated Metrics for Medical Multi-Document Summarization Disagree with Human Evaluations

    Wang, Lucy Lu and Otmakhova, Yulia and DeYoung, Jay and Truong, Thinh Hung and Kuehl, Bailey and Bransom, Erin and Wallace, Byron C. Automated Metrics for Medical Multi-Document Summarization Disagree with Human Evaluations. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2023. doi:10.18653/...

  5. [5]

    O pen R eviewer: A Specialized Large Language Model for Generating Critical Scientific Paper Reviews

    Idahl, Maximilian and Ahmadi, Zahra. O pen R eviewer: A Specialized Large Language Model for Generating Critical Scientific Paper Reviews. Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (System Demonstrations). 2025. doi:10.18653/v1/2025.naacl-demo.44

  6. [6]

    First Conference on Language Modeling , year=

    Fine-grained hallucination detection and editing for language models , author=. First Conference on Language Modeling , year=

  7. [7]

    Proceedings of the Fourth Workshop on Generation, Evaluation and Metrics (GEM ^2 ) , pages=

    Heds 3.0: The human evaluation data sheet version 3.0 , author=. Proceedings of the Fourth Workshop on Generation, Evaluation and Metrics (GEM ^2 ) , pages=

  8. [8]

    NPJ digital medicine , volume=

    A framework for human evaluation of large language models in healthcare derived from literature review , author=. NPJ digital medicine , volume=. 2024 , publisher=

  9. [9]

    Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Tutorial Abstracts , pages=

    Human-centered evaluation of language technologies , author=. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Tutorial Abstracts , pages=

  10. [10]

    Proceedings of the 2nd Workshop on Human Evaluation of NLP Systems (HumEval) , pages=

    The human evaluation datasheet: A template for recording details of human evaluation experiments in NLP , author=. Proceedings of the 2nd Workshop on Human Evaluation of NLP Systems (HumEval) , pages=

  11. [11]

    Proceedings of the Fourth Workshop on Insights from Negative Results in NLP , pages=

    Missing information, unresponsive authors, experimental flaws: The impossibility of assessing the reproducibility of previous human evaluations in NLP , author=. Proceedings of the Fourth Workshop on Insights from Negative Results in NLP , pages=

  12. [12]

    Computational Linguistics , volume=

    Common flaws in running human evaluation experiments in NLP , author=. Computational Linguistics , volume=. 2024 , publisher=

  13. [13]

    University of Chicago Coase-Sandor Institute for Law & Economics Research Paper , number=

    Judge AI: Assessing large language models in judicial decision-making , author=. University of Chicago Coase-Sandor Institute for Law & Economics Research Paper , number=

  14. [14]

    Proceedings of the Fourth Workshop on Generation, Evaluation and Metrics (GEM ^2 ) , pages=

    Judging the judges: Evaluating alignment and vulnerabilities in llms-as-judges , author=. Proceedings of the Fourth Workshop on Generation, Evaluation and Metrics (GEM ^2 ) , pages=

  15. [15]

    Proceedings of the 63rd

    Bavaresco, Anna and Bernardi, Raffaella and Bertolazzi, Leonardo and Elliott, Desmond and Fern \'a ndez, Raquel and Gatt, Albert and Ghaleb, Esam and Giulianelli, Mario and Hanna, Michael and Koller, Alexander and Martins, Andre and Mondorf, Philipp and Neplenbroek, Vera and Pezzelle, Sandro and Plank, Barbara and Schlangen, David and Suglia, Alessandro a...

  16. [16]

    Advances in Neural Information Processing Systems , volume=

    Self-refine: Iterative refinement with self-feedback , author=. Advances in Neural Information Processing Systems , volume=

  17. [17]

    npj Health Systems , volume=

    Human evaluation of large language models in healthcare: gaps, challenges, and the need for standardization , author=. npj Health Systems , volume=. 2025 , publisher=

  18. [18]

    Proceedings of the 2024 ACM Conference on Fairness, Accountability, and Transparency , pages=

    Escalation risks from language models in military and diplomatic decision-making , author=. Proceedings of the 2024 ACM Conference on Fairness, Accountability, and Transparency , pages=

  19. [19]

    Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) , pages=

    Can language model moderators improve the health of online discourse? , author=. Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) , pages=

  20. [20]

    Proceedings of the 31st International Conference on Computational Linguistics , pages=

    Internlm-law: An open-sourced chinese legal large language model , author=. Proceedings of the 31st International Conference on Computational Linguistics , pages=

  21. [21]

    arXiv preprint arXiv:2405.05860 , year=

    The perspectivist paradigm shift: Assumptions and challenges of capturing human labels , author=. arXiv preprint arXiv:2405.05860 , year=

  22. [22]

    Nature human behaviour , volume=

    A manifesto for reproducible science , author=. Nature human behaviour , volume=. 2017 , publisher=

  23. [23]

    2022 , url =

    Shaurya Rohatgi , title =. 2022 , url =

  24. [24]

    On Context Utilization in Summarization with Large Language Models

    Ravaut, Mathieu and Sun, Aixin and Chen, Nancy and Joty, Shafiq. On Context Utilization in Summarization with Large Language Models. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2024. doi:10.18653/v1/2024.acl-long.153

  25. [25]

    ArXiv , year=

    How Reliable Are Automatic Evaluation Methods for Instruction-Tuned LLMs? , author=. ArXiv , year=

  26. [26]

    arXiv preprint arXiv:2312.07559 , year=

    Paperqa: Retrieval-augmented generative agent for scientific research , author=. arXiv preprint arXiv:2312.07559 , year=

  27. [27]

    arXiv e-prints , pages=

    A foundation model for human-AI collaboration in medical literature mining , author=. arXiv e-prints , pages=

  28. [28]

    arXiv preprint arXiv:2411.14199 , year=

    Openscholar: Synthesizing scientific literature with retrieval-augmented lms , author=. arXiv preprint arXiv:2411.14199 , year=

  29. [29]

    Text summarization branches out , pages=

    Rouge: A package for automatic evaluation of summaries , author=. Text summarization branches out , pages=

  30. [30]

    arXiv preprint arXiv:2310.06825 , year=

    Mistral 7B , author=. arXiv preprint arXiv:2310.06825 , year=

  31. [31]

    arXiv preprint arXiv:2303.08774 , year=

    Gpt-4 technical report , author=. arXiv preprint arXiv:2303.08774 , year=

  32. [32]

    Clinical Natural Language Processing Workshop , year=

    Generating medically-accurate summaries of patient-provider dialogue: A multi-stage approach using large language models , author=. Clinical Natural Language Processing Workshop , year=

  33. [33]

    Conference on Empirical Methods in Natural Language Processing , year=

    Hierarchical Catalogue Generation for Literature Review: A Benchmark , author=. Conference on Empirical Methods in Natural Language Processing , year=

  34. [34]

    Annual Meeting of the Association for Computational Linguistics , year=

    Multi-News: A Large-Scale Multi-Document Summarization Dataset and Abstractive Hierarchical Model , author=. Annual Meeting of the Association for Computational Linguistics , year=

  35. [35]

    Annual Meeting of the Association for Computational Linguistics , year=

    Hierarchical Transformers for Multi-Document Summarization , author=. Annual Meeting of the Association for Computational Linguistics , year=

  36. [36]

    Annual Meeting of the Association for Computational Linguistics , year=

    Summ ^N : A Multi-Stage Summarization Framework for Long Input Dialogues and Documents , author=. Annual Meeting of the Association for Computational Linguistics , year=

  37. [37]

    ArXiv , year=

    Leveraging Long-Context Large Language Models for Multi-Document Understanding and Summarization in Enterprise Applications , author=. ArXiv , year=

  38. [38]

    , author=

    A Hierarchical Decoder with Three-level Hierarchical Attention to Generate Abstractive Summaries of Interleaved Texts. , author=. arXiv: Computation and Language , year=

  39. [39]

    ArXiv , year=

    uMedSum: A Unified Framework for Advancing Medical Abstractive Summarization , author=. ArXiv , year=

  40. [40]

    , author=

    Roles of Document Structure, Cognitive Strategy, and Awareness in Searching for Information. , author=. Reading Research Quarterly , year=

  41. [41]

    , author=

    The Effects of Text Structure Instruction on Middle-Grade Students' Comprehension and Production of Expository Text. , author=. Reading Research Quarterly , year=

  42. [42]

    Yu, Qiang Yang, and Xing Xie

    Chang, Yupeng and Wang, Xu and Wang, Jindong and Wu, Yuan and Yang, Linyi and Zhu, Kaijie and Chen, Hao and Yi, Xiaoyuan and Wang, Cunxiang and Wang, Yidong and Ye, Wei and Zhang, Yue and Chang, Yi and Yu, Philip S. and Yang, Qiang and Xie, Xing , title =. ACM Trans. Intell. Syst. Technol. , month = mar, articleno =. 2024 , issue_date =. doi:10.1145/36412...

  43. [43]

    Improving Factuality in Clinical Abstractive Multi-Document Summarization by Guided Continued Pre-training

    Elhady, Ahmed and Elsayed, Khaled and Agirre, Eneko and Artetxe, Mikel. Improving Factuality in Clinical Abstractive Multi-Document Summarization by Guided Continued Pre-training. Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 2: Short Papers). 2024. do...

  44. [44]

    arXiv preprint arXiv:2305.14251 , year=

    Factscore: Fine-grained atomic evaluation of factual precision in long form text generation , author=. arXiv preprint arXiv:2305.14251 , year=

  45. [45]

    arXiv preprint arXiv:2501.03545 , year=

    Beyond Factual Accuracy: Evaluating Coverage of Diverse Factual Information in Long-form Text Generation , author=. arXiv preprint arXiv:2501.03545 , year=

  46. [46]

    Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) , pages=

    Embrace Divergence for Richer Insights: A Multi-document Summarization Benchmark and a Case Study on Summarizing Diverse Information from News Articles , author=. Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) , pages=

  47. [47]

    medRxiv , pages=

    Synthetic Data Distillation Enables the Extraction of Clinical Information at Scale , author=. medRxiv , pages=. 2024 , publisher=

  48. [48]

    and Villarroel, Mauricio and Clifford, Gari D

    Lee, Joon and Scott, Daniel J. and Villarroel, Mauricio and Clifford, Gari D. and Saeed, Mohammed and Mark, Roger G. , booktitle=. Open-access MIMIC-II database for intensive care research , year=

  49. [49]

    ArXiv , year=

    ClinicalBERT: Modeling Clinical Notes and Predicting Hospital Readmission , author=. ArXiv , year=

  50. [50]

    Bioinformatics , volume =

    Lee, Jinhyuk and Yoon, Wonjin and Kim, Sungdong and Kim, Donghyeon and Kim, Sunkyu and So, Chan Ho and Kang, Jaewoo , title =. Bioinformatics , volume =. 2019 , month =. doi:10.1093/bioinformatics/btz682 , url =

  51. [51]

    TOPICAL : TOPIC Pages A utomagica L ly

    Giorgi, John and Singh, Amanpreet and Downey, Doug and Feldman, Sergey and Wang, Lucy. TOPICAL : TOPIC Pages A utomagica L ly. Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 3: System Demonstrations). 2024. doi:10.18653/v1/2024.naacl-demo.1

  52. [52]

    What`s in a Summary? Laying the Groundwork for Advances in Hospital-Course Summarization

    Adams, Griffin and Alsentzer, Emily and Ketenci, Mert and Zucker, Jason and Elhadad, No \'e mie. What`s in a Summary? Laying the Groundwork for Advances in Hospital-Course Summarization. Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2021. doi:10.18653/v1/2021...

  53. [53]

    2015 26th international workshop on database and expert systems applications (dexa) , pages=

    Clinical decision support systems: a survey of NLP-based approaches from unstructured data , author=. 2015 26th international workshop on database and expert systems applications (dexa) , pages=. 2015 , organization=

  54. [54]

    Journal of Intelligent Connectivity and Emerging Technologies , volume=

    Natural language processing for clinical decision support systems: A review of recent advances in healthcare , author=. Journal of Intelligent Connectivity and Emerging Technologies , volume=

  55. [55]

    Journal of biomedical informatics , volume=

    What can natural language processing do for clinical decision support? , author=. Journal of biomedical informatics , volume=. 2009 , publisher=

  56. [56]

    A Novel System for Extractive Clinical Note Summarization using EHR Data

    Liang, Jennifer and Tsou, Ching-Huei and Poddar, Ananya. A Novel System for Extractive Clinical Note Summarization using EHR Data. Proceedings of the 2nd Clinical Natural Language Processing Workshop. 2019. doi:10.18653/v1/W19-1906

  57. [57]

    Generating SOAP Notes from Doctor-Patient Conversations Using Modular Summarization Techniques

    Krishna, Kundan and Khosla, Sopan and Bigham, Jeffrey and Lipton, Zachary C. Generating SOAP Notes from Doctor-Patient Conversations Using Modular Summarization Techniques. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Paper...

  58. [58]

    Towards Automating Medical Scribing : Clinic Visit D ialogue2 N ote Sentence Alignment and Snippet Summarization

    Yim, Wen-wai and Yetisgen, Meliha. Towards Automating Medical Scribing : Clinic Visit D ialogue2 N ote Sentence Alignment and Snippet Summarization. Proceedings of the Second Workshop on Natural Language Processing for Medical Conversations. 2021. doi:10.18653/v1/2021.nlpmc-1.2

  59. [59]

    DERA : Enhancing Large Language Model Completions with Dialog-Enabled Resolving Agents

    Nair, Varun and Schumacher, Elliot and Tso, Geoffrey and Kannan, Anitha. DERA : Enhancing Large Language Model Completions with Dialog-Enabled Resolving Agents. Proceedings of the 6th Clinical Natural Language Processing Workshop. 2024. doi:10.18653/v1/2024.clinicalnlp-1.12

  60. [60]

    JMIR medical education , volume=

    How does ChatGPT perform on the United States Medical Licensing Examination (USMLE)? The implications of large language models for medical education and knowledge assessment , author=. JMIR medical education , volume=. 2023 , publisher=

  61. [61]

    Cureus , volume=

    Overview of early ChatGPT’s presence in medical literature: insights from a hybrid literature review by ChatGPT and human experts , author=. Cureus , volume=. 2023 , publisher=

  62. [62]

    ArXiv , year=

    Bio-SIEVE: Exploring Instruction Tuning Large Language Models for Systematic Review Automation , author=. ArXiv , year=

  63. [63]

    CoRR , volume =

    Athanasios Lagopoulos and Grigorios Tsoumakas , title =. CoRR , volume =. 2020 , url =. 2011.09752 , timestamp =

  64. [64]

    Benchmarking Large Language Models for News Summarization

    Zhang, Tianyi and Ladhak, Faisal and Durmus, Esin and Liang, Percy and McKeown, Kathleen and Hashimoto, Tatsunori B. Benchmarking Large Language Models for News Summarization. Transactions of the Association for Computational Linguistics. 2024. doi:10.1162/tacl_a_00632

  65. [65]

    On Learning to Summarize with Large Language Models as References

    Liu, Yixin and Shi, Kejian and He, Katherine and Ye, Longtian and Fabbri, Alexander and Liu, Pengfei and Radev, Dragomir and Cohan, Arman. On Learning to Summarize with Large Language Models as References. Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume ...

  66. [66]

    Summarizing, Simplifying, and Synthesizing Medical Evidence using GPT -3 (with Varying Success)

    Shaib, Chantal and Li, Millicent and Joseph, Sebastian and Marshall, Iain and Li, Junyi Jessy and Wallace, Byron. Summarizing, Simplifying, and Synthesizing Medical Evidence using GPT -3 (with Varying Success). Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). 2023. doi:10.18653/v1/2023.acl-short.119

  67. [67]

    ACM Comput

    Koh, Huan Yee and Ju, Jiaxin and Liu, Ming and Pan, Shirui , title =. ACM Comput. Surv. , month = dec, articleno =. 2022 , issue_date =. doi:10.1145/3545176 , abstract =

  68. [68]

    Conference on Empirical Methods in Natural Language Processing , year=

    DocAsRef: An Empirical Study on Repurposing Reference-based Summary Quality Metrics as Reference-free Metrics , author=. Conference on Empirical Methods in Natural Language Processing , year=

  69. [69]

    ACM Trans

    Nenkova, Ani and Passonneau, Rebecca and McKeown, Kathleen , title =. ACM Trans. Speech Lang. Process. , month = may, pages =. 2007 , issue_date =. doi:10.1145/1233912.1233913 , abstract =

  70. [70]

    ACM Comput

    Jangra, Anubhav and Mukherjee, Sourajit and Jatowt, Adam and Saha, Sriparna and Hasanuzzaman, Mohammad , title =. ACM Comput. Surv. , month = jul, articleno =. 2023 , issue_date =. doi:10.1145/3584700 , abstract =

  71. [71]

    Annual Meeting of the Association for Computational Linguistics , year=

    A Simple Theoretical Model of Importance for Summarization , author=. Annual Meeting of the Association for Computational Linguistics , year=

  72. [72]

    ArXiv , year=

    Earlier Isn’t Always Better: Sub-aspect Analysis on Corpus and System Biases in Summarization , author=. ArXiv , year=

  73. [73]

    Journal of Medical Internet Research , year=

    Potential Roles of Large Language Models in the Production of Systematic Reviews and Meta-Analyses , author=. Journal of Medical Internet Research , year=

  74. [74]

    BMJ Open , year=

    Analysis of the time and workers needed to conduct systematic reviews of medical interventions using data from the PROSPERO registry , author=. BMJ Open , year=

  75. [75]

    Energies , year=

    The Resilience of Critical Infrastructure Systems: A Systematic Literature Review , author=. Energies , year=

  76. [76]

    Proceedings of the 52nd annual meeting of the association for computational linguistics (volume 1: Long papers) , pages=

    Hierarchical summarization: Scaling up multi-document summarization , author=. Proceedings of the 52nd annual meeting of the association for computational linguistics (volume 1: Long papers) , pages=

  77. [77]

    Generating (Factual?) Narrative Summaries of RCTs: Experiments with Neural Multi-Document Summarization , author=. AMIA ... Annual Symposium proceedings. AMIA Symposium , year=

  78. [78]

    Proceedings of the conference

    What’s in a Summary? Laying the Groundwork for Advances in Hospital-Course Summarization , author=. Proceedings of the conference. Association for Computational Linguistics. North American Chapter. Meeting , year=

  79. [79]

    Trends in cognitive sciences , volume=

    Hierarchical process memory: memory as an integral component of information processing , author=. Trends in cognitive sciences , volume=. 2015 , publisher=

  80. [80]

    Craik , abstract =

    F.I.M. Craik , abstract =. Memory: Levels of Processing , editor =. International Encyclopedia of the Social & Behavioral Sciences , publisher =. 2001 , isbn =. doi:https://doi.org/10.1016/B0-08-043076-7/01508-4 , url =

Showing first 80 references.