Comparing LLM and Fine-Tuned Model Performance on NVDRS Circumstance Extraction with Varying Prompt Complexity

Geoffrey Martin; Xuan Zhong Feng; Yifan Peng

arxiv: 2605.21845 · v1 · pith:XOTZLSHHnew · submitted 2026-05-21 · 💻 cs.CL · cs.AI

Comparing LLM and Fine-Tuned Model Performance on NVDRS Circumstance Extraction with Varying Prompt Complexity

Geoffrey Martin , Xuan Zhong Feng , Yifan Peng This is my paper

Pith reviewed 2026-05-22 07:17 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords LLMfine-tuned modelsNVDRSsuicide circumstancesprompt complexitycircumstance extractionhybrid architecturedeath investigation narratives

0 comments

The pith

Large language models outperform fine-tuned models on low-prevalence complex circumstances from suicide death reports.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a Complexity Score algorithm that examines coding manual structure to decide whether detailed prompts with full guidelines will beat simple name-only prompts for extracting circumstances. It tests this on 25 inferentially complex cases drawn from the National Violent Death Reporting System and compares large language models against fine-tuned RoBERTa. LLMs show clear gains precisely where training data is scarce, while the framework produces similar patterns across several frontier models. The results point to a practical division of labor: language models for rare and inferentially demanding circumstances, fine-tuned models for frequent ones. Better extraction of these preceding circumstances can support more targeted suicide prevention work.

Core claim

Large language models substantially outperform fine-tuned RoBERTa on low-prevalence, inferentially complex circumstances from NVDRS narratives, with performance gains predicted by a Complexity Score that analyzes coding manual structure; the same pattern holds across GPT-5.2, Gemini 2.5 Pro and Llama-3 70B, supporting a hybrid architecture that assigns LLMs to rare cases and fine-tuned models to common ones.

What carries the argument

The Complexity Score algorithm, which analyzes coding manual structure to predict when detailed prompts with full coding guidelines improve over name-only prompts for inferentially complex circumstances.

If this is right

LLMs should be used for rare circumstances that lack sufficient training examples.
Fine-tuned models should continue to handle common circumstances where labeled data is abundant.
Overall extraction accuracy rises when prompt strategy is chosen per circumstance rather than applied uniformly.
The hybrid pattern appears consistently across multiple frontier large language models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Health agencies could adopt similar hybrid systems to improve structured data from narrative reports in other public-health domains.
The same division of labor might reduce annotation costs when building extraction tools for imbalanced medical or social datasets.
Testing the Complexity Score on new coding manuals or different narrative sources would show how far the prediction rule travels.

Load-bearing premise

The Complexity Score algorithm accurately predicts which circumstances will benefit from detailed prompts rather than name-only prompts.

What would settle it

Measuring performance on the 25 circumstances and finding no correlation between the Complexity Score values and the actual gain from using full guidelines over name-only prompts.

read the original abstract

Suicide is a leading cause of death in the United States, and understanding the circumstances that precede it requires extracting structured information from death investigation narratives. Many of these circumstances require semantic inference beyond simple keyword matching. We develop a ``Complexity Score'' algorithm that analyzes coding manual structure to predict when detailed prompts with full coding guidelines improve over name-only prompts. We then construct a hybrid approach that selects prompt strategy per circumstance. We evaluate large language models (LLMs) against fine-tuned RoBERTa on 25 inferentially complex circumstances from the National Violent Death Reporting System (NVDRS). We found that LLMs substantially outperform on low-prevalence circumstances where training data is insufficient. We further demonstrate that our framework generalizes across frontier LLMs, with GPT-5.2, Gemini 2.5 Pro and Llama-3 70B showing consistent performance patterns. These findings support a hybrid architecture where LLMs handle rare, inferentially complex circumstances while fine-tuned models handle common ones.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript develops a Complexity Score algorithm that analyzes the structure of the NVDRS coding manual to select between name-only and detailed prompts (including full coding guidelines) for extracting 25 inferentially complex circumstances from death investigation narratives. It then compares frontier LLMs (GPT-5.2, Gemini 2.5 Pro, Llama-3 70B) against a fine-tuned RoBERTa baseline on held-out NVDRS data, claiming that LLMs substantially outperform on low-prevalence circumstances where training data is insufficient, that performance patterns are consistent across models, and that the results support a hybrid architecture in which LLMs handle rare complex cases while fine-tuned models handle common ones.

Significance. If the reported per-circumstance gains hold with proper controls, the work offers a practical, non-circular method for deciding when to deploy LLMs versus fine-tuned models in low-resource information extraction settings. The use of an externally derived Complexity Score (rather than one fitted to the evaluation results) is a methodological strength that supports the generalization claim across LLMs.

major comments (2)

[Abstract and Methods] The abstract and methods description provide no quantitative metrics (F1, precision, recall, or error bars), no dataset sizes or prevalence statistics for the 25 circumstances, and no explicit validation that the Complexity Score actually predicts prompt-depth gains on held-out data; these omissions make it impossible to assess whether the central empirical claim (LLM outperformance on low-prevalence items) is load-bearing or merely suggestive.
[Methods] The selection criteria for the 25 inferentially complex circumstances and the exact construction of the Complexity Score (including how coding-manual structure is quantified) are not detailed enough to allow replication or to rule out selection bias; this directly affects the claim that the framework generalizes.

minor comments (2)

[Results] Add a table or figure showing per-circumstance prevalence, baseline RoBERTa performance, and LLM performance with both prompt strategies to make the hybrid-architecture recommendation concrete.
[Experimental Setup] Clarify whether the held-out NVDRS split was stratified by circumstance prevalence or by narrative length.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful and constructive comments on our manuscript. We have carefully considered each point and provide point-by-point responses below. Where appropriate, we have revised the manuscript to address the concerns raised.

read point-by-point responses

Referee: [Abstract and Methods] The abstract and methods description provide no quantitative metrics (F1, precision, recall, or error bars), no dataset sizes or prevalence statistics for the 25 circumstances, and no explicit validation that the Complexity Score actually predicts prompt-depth gains on held-out data; these omissions make it impossible to assess whether the central empirical claim (LLM outperformance on low-prevalence items) is load-bearing or merely suggestive.

Authors: We agree that the abstract and methods would benefit from explicit quantitative support. In the revised manuscript we will add the main F1, precision, and recall results (with error bars) for both LLM and RoBERTa conditions, report the number of narratives and prevalence for each of the 25 circumstances, and include a dedicated validation subsection that tests whether the Complexity Score correlates with observed prompt-depth gains on held-out data. These additions will make the central empirical claim directly verifiable. revision: yes
Referee: [Methods] The selection criteria for the 25 inferentially complex circumstances and the exact construction of the Complexity Score (including how coding-manual structure is quantified) are not detailed enough to allow replication or to rule out selection bias; this directly affects the claim that the framework generalizes.

Authors: We accept this criticism and will substantially expand the Methods section. The revision will specify the exact selection criteria used to identify the 25 circumstances (including the inferential-complexity thresholds applied to the NVDRS coding manual), provide the full algorithmic definition of the Complexity Score, and detail the quantitative features extracted from manual structure. These changes will enable independent replication and allow readers to evaluate selection bias directly. revision: yes

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper is an empirical comparison of LLMs and RoBERTa on held-out NVDRS circumstance extraction tasks. The Complexity Score is constructed by analyzing the external coding manual structure rather than being fitted to evaluation results or derived from model outputs. No equations, self-citations, or ansatzes reduce the central claims (LLM outperformance on low-prevalence cases, hybrid architecture) to inputs by construction. The derivation chain consists of independent prompt strategies, model evaluations, and per-circumstance reporting that remain falsifiable against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Based solely on the abstract, the work introduces a Complexity Score algorithm and a hybrid selection rule but does not specify any fitted numerical parameters, new axioms, or invented entities; all components appear derived from existing coding manuals and standard model training.

pith-pipeline@v0.9.0 · 5704 in / 1226 out tokens · 47564 ms · 2026-05-22T07:17:26.095406+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We develop a 'Complexity Score' algorithm that analyzes the structure of coding manual examples to predict when detailed prompts are needed versus when simpler name-only prompts suffice

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

21 extracted references · 21 canonical work pages

[1]

National violent death reporting system (NVDRS)

Centers for Disease Control and Prevention. National violent death reporting system (NVDRS). Web- based Injury Statistics Query and Reporting System (WISQARS), 2022. URLhttps://www.cdc. gov/nvdrs. Atlanta, GA: National Center for Injury Prevention and Control. 9

work page 2022
[2]

Stone, Thomas R

Deborah M. Stone, Thomas R. Simon, Katherine A. Fowler, Scott R. Kegler, Keming Yuan, Kristin M. Holland, Asha Z. Ivey-Stephenson, and Alex E. Crosby. Vital signs: Trends in state suicide rates — United States, 1999–2016 and circumstances contributing to suicide — 27 states, 2015.Morbidity and Mortality Weekly Report, 67(22):617–624, 2018

work page 1999
[3]

S. Wang, Y . Zhou, Z. Han, et al. A natural language processing approach to detect inconsistencies in death investigation notes attributing suicide circumstances.Communications Medicine, 4(1):199,

work page
[4]

doi: 10.1038/s43856-024-00631-7

work page doi:10.1038/s43856-024-00631-7
[5]

L. O. Gostin and P. Lurie. Assault on the centers for disease control and prevention—budget cuts, political control, and the erosion of trust.JAMA Health Forum, 6(10):e255467, 2025. doi: 10.1001/ jamahealthforum.2025.5467

work page arXiv 2025
[6]

Lybarger, O

K. Lybarger, O. J. Bear Don’t Walk, M. Yetisgen, and O. Uzuner. Advancements in extracting social determinants of health information from narrative text.Journal of the American Medical Informatics Association, 30(8):1363–1366, 2023. doi: 10.1093/jamia/ocad121

work page doi:10.1093/jamia/ocad121 2023
[7]

D. S. Lituiev, B. Lacar, S. Pak, P. L. Abramowitsch, E. H. De Marchis, and T. A. Peterson. Automatic extraction of social determinants of health from medical notes of chronic lower back pain patients. Journal of the American Medical Informatics Association, 30(8):1438–1447, 2023. doi: 10.1093/ jamia/ocad054

work page 2023
[8]

Z. Chen, P. Lasserre, A. Lin, and R. Rajapakshe. Extraction of social determinants of health from electronic health records using natural language processing.JCO Clinical Cancer Informatics, 9: e2400317, 2025. doi: 10.1200/CCI-24-00317

work page doi:10.1200/cci-24-00317 2025
[9]

Y . Dang, F. Li, X. Hu, V . K. Keloth, M. Zhang, S. Fu, M. F. Amith, J. W. Fan, J. Du, E. Yu, H. Liu, X. Jiang, H. Xu, and C. D. Tao. Systematic design and data-driven evaluation of social determinants of health ontology (SDoHO).Journal of the American Medical Informatics Association, 30(9):1465– 1473, 2023. doi: 10.1093/jamia/ocad096

work page doi:10.1093/jamia/ocad096 2023
[10]

Consoli, H

B. Consoli, H. Wang, X. Wu, S. Wang, X. Zhao, Y . Wang, J. Rousseau, T. Hartvigsen, L. Shen, H. Wu, Y . Peng, Q. Long, T. Chen, and Y . Ding. SDoH-GPT: using large language models to extract social determinants of health.Journal of the American Medical Informatics Association, 33(1):67–78, 2026. doi: 10.1093/jamia/ocaf094

work page doi:10.1093/jamia/ocaf094 2026
[11]

V . K. Keloth, S. Selek, Q. Chen, C. Gilman, S. Fu, Y . Dang, X. Chen, X. Hu, Y . Zhou, H. He, J. W. Fan, K. Wang, C. Brandt, C. Tao, H. Liu, and H. Xu. Social determinants of health extraction from clinical notes across institutions using large language models.npj Digital Medicine, 8(1):287, 2025. doi: 10.1038/s41746-025-01645-8

work page doi:10.1038/s41746-025-01645-8 2025
[12]

C. Peng, Z. Yu, K. E. Smith, W. H. Lo-Ciganic, J. Bian, and Y . Wu. Enhancing cross-domain general- izability in social determinants of health extraction with prompt-tuning large language models.AMIA Summits on Translational Science Proceedings, 2025:432–440, 2025

work page 2025
[13]

Chaunzwa, Idalid Franco, Benjamin H

Marco Guevara, Shan Chen, Spencer Thomas, Tafadzwa L. Chaunzwa, Idalid Franco, Benjamin H. Kann, Shalini Moningi, Jack M. Qian, Madeleine Goldstein, Susan Harper, Hugo J. W. L. Aerts, Paul J. Catalano, Guergana K. Savova, Raymond H. Mak, and Danielle S. Bitterman. Large language models to identify social determinants of health in electronic health records...

work page doi:10.1038/s41746-023-00970-0 2024
[14]

R. A. Gabriel, O. Litake, S. Simpson, B. N. Burton, R. S. Waterman, and A. A. Macias. On the development and validation of large language model-based classifiers for identifying social determi- nants of health.Proceedings of the National Academy of Sciences, 121(39):e2320716121, 2024. doi: 10.1073/pnas.2320716121

work page doi:10.1073/pnas.2320716121 2024
[15]

S. Wang, Y . Dang, Z. Sun, Y . Ding, J. Pathak, C. Tao, Y . Xiao, and Y . Peng. An NLP approach to identify SDoH-related circumstance and suicide crisis from death investigation narratives.Journal of the American Medical Informatics Association, 30(8):1408–1417, 2023. doi: 10.1093/jamia/ocad068

work page doi:10.1093/jamia/ocad068 2023
[16]

R. L. Xu, S. Wang, Z. Wang, Y . Zhang, Y . Xiao, J. Pathak, D. Hodge, Y . Leng, S. C. Watkins, Y . Ding, and Y . Peng. Analyzing social factors to enhance suicide prevention across population groups. In Proceedings of the 2024 IEEE International Conference on Healthcare Informatics, pages 189–199,

work page 2024
[17]

doi: 10.1109/ichi61247.2024.00032

work page doi:10.1109/ichi61247.2024.00032 2024
[18]

Thomas McCoy, Ellie Pavlick, and Tal Linzen

R. Thomas McCoy, Ellie Pavlick, and Tal Linzen. Right for the wrong reasons: Diagnosing syntactic heuristics in natural language inference. InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 3428–3448, 2019. doi: 10.18653/v1/P19-1334

work page doi:10.18653/v1/p19-1334 2019
[19]

Schoene, Suzanne Garverich, Iman Ibrahim, Sia Shah, Benjamin Irving, and Clifford C

Annika M. Schoene, Suzanne Garverich, Iman Ibrahim, Sia Shah, Benjamin Irving, and Clifford C. Dacso. Automatically extracting social determinants of health for suicide: a narrative literature review. npj Mental Health Research, 3(1):51, 2024. doi: 10.1038/s44184-024-00087-6

work page doi:10.1038/s44184-024-00087-6 2024
[20]

Large lan- guage models are zero-shot reasoners

Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large lan- guage models are zero-shot reasoners. InAdvances in Neural Information Processing Systems, vol- ume 35, pages 22199–22213, 2022

work page 2022
[21]

Confidence intervals for the F1 score: A comparison of four methods.arXiv preprint arXiv:2309.14621, 2024

Kevin Fu Yuan Lam, Vikneswaran Gopal, and Jiang Qian. Confidence intervals for the F1 score: A comparison of four methods.arXiv preprint arXiv:2309.14621, 2024. 11

work page arXiv 2024

[1] [1]

National violent death reporting system (NVDRS)

Centers for Disease Control and Prevention. National violent death reporting system (NVDRS). Web- based Injury Statistics Query and Reporting System (WISQARS), 2022. URLhttps://www.cdc. gov/nvdrs. Atlanta, GA: National Center for Injury Prevention and Control. 9

work page 2022

[2] [2]

Stone, Thomas R

Deborah M. Stone, Thomas R. Simon, Katherine A. Fowler, Scott R. Kegler, Keming Yuan, Kristin M. Holland, Asha Z. Ivey-Stephenson, and Alex E. Crosby. Vital signs: Trends in state suicide rates — United States, 1999–2016 and circumstances contributing to suicide — 27 states, 2015.Morbidity and Mortality Weekly Report, 67(22):617–624, 2018

work page 1999

[3] [3]

S. Wang, Y . Zhou, Z. Han, et al. A natural language processing approach to detect inconsistencies in death investigation notes attributing suicide circumstances.Communications Medicine, 4(1):199,

work page

[4] [4]

doi: 10.1038/s43856-024-00631-7

work page doi:10.1038/s43856-024-00631-7

[5] [5]

L. O. Gostin and P. Lurie. Assault on the centers for disease control and prevention—budget cuts, political control, and the erosion of trust.JAMA Health Forum, 6(10):e255467, 2025. doi: 10.1001/ jamahealthforum.2025.5467

work page arXiv 2025

[6] [6]

Lybarger, O

K. Lybarger, O. J. Bear Don’t Walk, M. Yetisgen, and O. Uzuner. Advancements in extracting social determinants of health information from narrative text.Journal of the American Medical Informatics Association, 30(8):1363–1366, 2023. doi: 10.1093/jamia/ocad121

work page doi:10.1093/jamia/ocad121 2023

[7] [7]

D. S. Lituiev, B. Lacar, S. Pak, P. L. Abramowitsch, E. H. De Marchis, and T. A. Peterson. Automatic extraction of social determinants of health from medical notes of chronic lower back pain patients. Journal of the American Medical Informatics Association, 30(8):1438–1447, 2023. doi: 10.1093/ jamia/ocad054

work page 2023

[8] [8]

Z. Chen, P. Lasserre, A. Lin, and R. Rajapakshe. Extraction of social determinants of health from electronic health records using natural language processing.JCO Clinical Cancer Informatics, 9: e2400317, 2025. doi: 10.1200/CCI-24-00317

work page doi:10.1200/cci-24-00317 2025

[9] [9]

Y . Dang, F. Li, X. Hu, V . K. Keloth, M. Zhang, S. Fu, M. F. Amith, J. W. Fan, J. Du, E. Yu, H. Liu, X. Jiang, H. Xu, and C. D. Tao. Systematic design and data-driven evaluation of social determinants of health ontology (SDoHO).Journal of the American Medical Informatics Association, 30(9):1465– 1473, 2023. doi: 10.1093/jamia/ocad096

work page doi:10.1093/jamia/ocad096 2023

[10] [10]

Consoli, H

B. Consoli, H. Wang, X. Wu, S. Wang, X. Zhao, Y . Wang, J. Rousseau, T. Hartvigsen, L. Shen, H. Wu, Y . Peng, Q. Long, T. Chen, and Y . Ding. SDoH-GPT: using large language models to extract social determinants of health.Journal of the American Medical Informatics Association, 33(1):67–78, 2026. doi: 10.1093/jamia/ocaf094

work page doi:10.1093/jamia/ocaf094 2026

[11] [11]

V . K. Keloth, S. Selek, Q. Chen, C. Gilman, S. Fu, Y . Dang, X. Chen, X. Hu, Y . Zhou, H. He, J. W. Fan, K. Wang, C. Brandt, C. Tao, H. Liu, and H. Xu. Social determinants of health extraction from clinical notes across institutions using large language models.npj Digital Medicine, 8(1):287, 2025. doi: 10.1038/s41746-025-01645-8

work page doi:10.1038/s41746-025-01645-8 2025

[12] [12]

C. Peng, Z. Yu, K. E. Smith, W. H. Lo-Ciganic, J. Bian, and Y . Wu. Enhancing cross-domain general- izability in social determinants of health extraction with prompt-tuning large language models.AMIA Summits on Translational Science Proceedings, 2025:432–440, 2025

work page 2025

[13] [13]

Chaunzwa, Idalid Franco, Benjamin H

Marco Guevara, Shan Chen, Spencer Thomas, Tafadzwa L. Chaunzwa, Idalid Franco, Benjamin H. Kann, Shalini Moningi, Jack M. Qian, Madeleine Goldstein, Susan Harper, Hugo J. W. L. Aerts, Paul J. Catalano, Guergana K. Savova, Raymond H. Mak, and Danielle S. Bitterman. Large language models to identify social determinants of health in electronic health records...

work page doi:10.1038/s41746-023-00970-0 2024

[14] [14]

R. A. Gabriel, O. Litake, S. Simpson, B. N. Burton, R. S. Waterman, and A. A. Macias. On the development and validation of large language model-based classifiers for identifying social determi- nants of health.Proceedings of the National Academy of Sciences, 121(39):e2320716121, 2024. doi: 10.1073/pnas.2320716121

work page doi:10.1073/pnas.2320716121 2024

[15] [15]

S. Wang, Y . Dang, Z. Sun, Y . Ding, J. Pathak, C. Tao, Y . Xiao, and Y . Peng. An NLP approach to identify SDoH-related circumstance and suicide crisis from death investigation narratives.Journal of the American Medical Informatics Association, 30(8):1408–1417, 2023. doi: 10.1093/jamia/ocad068

work page doi:10.1093/jamia/ocad068 2023

[16] [16]

R. L. Xu, S. Wang, Z. Wang, Y . Zhang, Y . Xiao, J. Pathak, D. Hodge, Y . Leng, S. C. Watkins, Y . Ding, and Y . Peng. Analyzing social factors to enhance suicide prevention across population groups. In Proceedings of the 2024 IEEE International Conference on Healthcare Informatics, pages 189–199,

work page 2024

[17] [17]

doi: 10.1109/ichi61247.2024.00032

work page doi:10.1109/ichi61247.2024.00032 2024

[18] [18]

Thomas McCoy, Ellie Pavlick, and Tal Linzen

R. Thomas McCoy, Ellie Pavlick, and Tal Linzen. Right for the wrong reasons: Diagnosing syntactic heuristics in natural language inference. InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 3428–3448, 2019. doi: 10.18653/v1/P19-1334

work page doi:10.18653/v1/p19-1334 2019

[19] [19]

Schoene, Suzanne Garverich, Iman Ibrahim, Sia Shah, Benjamin Irving, and Clifford C

Annika M. Schoene, Suzanne Garverich, Iman Ibrahim, Sia Shah, Benjamin Irving, and Clifford C. Dacso. Automatically extracting social determinants of health for suicide: a narrative literature review. npj Mental Health Research, 3(1):51, 2024. doi: 10.1038/s44184-024-00087-6

work page doi:10.1038/s44184-024-00087-6 2024

[20] [20]

Large lan- guage models are zero-shot reasoners

Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large lan- guage models are zero-shot reasoners. InAdvances in Neural Information Processing Systems, vol- ume 35, pages 22199–22213, 2022

work page 2022

[21] [21]

Confidence intervals for the F1 score: A comparison of four methods.arXiv preprint arXiv:2309.14621, 2024

Kevin Fu Yuan Lam, Vikneswaran Gopal, and Jiang Qian. Confidence intervals for the F1 score: A comparison of four methods.arXiv preprint arXiv:2309.14621, 2024. 11

work page arXiv 2024