Comparing LLM and Fine-Tuned Model Performance on NVDRS Circumstance Extraction with Varying Prompt Complexity
Pith reviewed 2026-05-22 07:17 UTC · model grok-4.3
The pith
Large language models outperform fine-tuned models on low-prevalence complex circumstances from suicide death reports.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Large language models substantially outperform fine-tuned RoBERTa on low-prevalence, inferentially complex circumstances from NVDRS narratives, with performance gains predicted by a Complexity Score that analyzes coding manual structure; the same pattern holds across GPT-5.2, Gemini 2.5 Pro and Llama-3 70B, supporting a hybrid architecture that assigns LLMs to rare cases and fine-tuned models to common ones.
What carries the argument
The Complexity Score algorithm, which analyzes coding manual structure to predict when detailed prompts with full coding guidelines improve over name-only prompts for inferentially complex circumstances.
If this is right
- LLMs should be used for rare circumstances that lack sufficient training examples.
- Fine-tuned models should continue to handle common circumstances where labeled data is abundant.
- Overall extraction accuracy rises when prompt strategy is chosen per circumstance rather than applied uniformly.
- The hybrid pattern appears consistently across multiple frontier large language models.
Where Pith is reading between the lines
- Health agencies could adopt similar hybrid systems to improve structured data from narrative reports in other public-health domains.
- The same division of labor might reduce annotation costs when building extraction tools for imbalanced medical or social datasets.
- Testing the Complexity Score on new coding manuals or different narrative sources would show how far the prediction rule travels.
Load-bearing premise
The Complexity Score algorithm accurately predicts which circumstances will benefit from detailed prompts rather than name-only prompts.
What would settle it
Measuring performance on the 25 circumstances and finding no correlation between the Complexity Score values and the actual gain from using full guidelines over name-only prompts.
read the original abstract
Suicide is a leading cause of death in the United States, and understanding the circumstances that precede it requires extracting structured information from death investigation narratives. Many of these circumstances require semantic inference beyond simple keyword matching. We develop a ``Complexity Score'' algorithm that analyzes coding manual structure to predict when detailed prompts with full coding guidelines improve over name-only prompts. We then construct a hybrid approach that selects prompt strategy per circumstance. We evaluate large language models (LLMs) against fine-tuned RoBERTa on 25 inferentially complex circumstances from the National Violent Death Reporting System (NVDRS). We found that LLMs substantially outperform on low-prevalence circumstances where training data is insufficient. We further demonstrate that our framework generalizes across frontier LLMs, with GPT-5.2, Gemini 2.5 Pro and Llama-3 70B showing consistent performance patterns. These findings support a hybrid architecture where LLMs handle rare, inferentially complex circumstances while fine-tuned models handle common ones.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript develops a Complexity Score algorithm that analyzes the structure of the NVDRS coding manual to select between name-only and detailed prompts (including full coding guidelines) for extracting 25 inferentially complex circumstances from death investigation narratives. It then compares frontier LLMs (GPT-5.2, Gemini 2.5 Pro, Llama-3 70B) against a fine-tuned RoBERTa baseline on held-out NVDRS data, claiming that LLMs substantially outperform on low-prevalence circumstances where training data is insufficient, that performance patterns are consistent across models, and that the results support a hybrid architecture in which LLMs handle rare complex cases while fine-tuned models handle common ones.
Significance. If the reported per-circumstance gains hold with proper controls, the work offers a practical, non-circular method for deciding when to deploy LLMs versus fine-tuned models in low-resource information extraction settings. The use of an externally derived Complexity Score (rather than one fitted to the evaluation results) is a methodological strength that supports the generalization claim across LLMs.
major comments (2)
- [Abstract and Methods] The abstract and methods description provide no quantitative metrics (F1, precision, recall, or error bars), no dataset sizes or prevalence statistics for the 25 circumstances, and no explicit validation that the Complexity Score actually predicts prompt-depth gains on held-out data; these omissions make it impossible to assess whether the central empirical claim (LLM outperformance on low-prevalence items) is load-bearing or merely suggestive.
- [Methods] The selection criteria for the 25 inferentially complex circumstances and the exact construction of the Complexity Score (including how coding-manual structure is quantified) are not detailed enough to allow replication or to rule out selection bias; this directly affects the claim that the framework generalizes.
minor comments (2)
- [Results] Add a table or figure showing per-circumstance prevalence, baseline RoBERTa performance, and LLM performance with both prompt strategies to make the hybrid-architecture recommendation concrete.
- [Experimental Setup] Clarify whether the held-out NVDRS split was stratified by circumstance prevalence or by narrative length.
Simulated Author's Rebuttal
We thank the referee for their thoughtful and constructive comments on our manuscript. We have carefully considered each point and provide point-by-point responses below. Where appropriate, we have revised the manuscript to address the concerns raised.
read point-by-point responses
-
Referee: [Abstract and Methods] The abstract and methods description provide no quantitative metrics (F1, precision, recall, or error bars), no dataset sizes or prevalence statistics for the 25 circumstances, and no explicit validation that the Complexity Score actually predicts prompt-depth gains on held-out data; these omissions make it impossible to assess whether the central empirical claim (LLM outperformance on low-prevalence items) is load-bearing or merely suggestive.
Authors: We agree that the abstract and methods would benefit from explicit quantitative support. In the revised manuscript we will add the main F1, precision, and recall results (with error bars) for both LLM and RoBERTa conditions, report the number of narratives and prevalence for each of the 25 circumstances, and include a dedicated validation subsection that tests whether the Complexity Score correlates with observed prompt-depth gains on held-out data. These additions will make the central empirical claim directly verifiable. revision: yes
-
Referee: [Methods] The selection criteria for the 25 inferentially complex circumstances and the exact construction of the Complexity Score (including how coding-manual structure is quantified) are not detailed enough to allow replication or to rule out selection bias; this directly affects the claim that the framework generalizes.
Authors: We accept this criticism and will substantially expand the Methods section. The revision will specify the exact selection criteria used to identify the 25 circumstances (including the inferential-complexity thresholds applied to the NVDRS coding manual), provide the full algorithmic definition of the Complexity Score, and detail the quantitative features extracted from manual structure. These changes will enable independent replication and allow readers to evaluate selection bias directly. revision: yes
Circularity Check
No significant circularity identified
full rationale
The paper is an empirical comparison of LLMs and RoBERTa on held-out NVDRS circumstance extraction tasks. The Complexity Score is constructed by analyzing the external coding manual structure rather than being fitted to evaluation results or derived from model outputs. No equations, self-citations, or ansatzes reduce the central claims (LLM outperformance on low-prevalence cases, hybrid architecture) to inputs by construction. The derivation chain consists of independent prompt strategies, model evaluations, and per-circumstance reporting that remain falsifiable against external benchmarks.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We develop a 'Complexity Score' algorithm that analyzes the structure of coding manual examples to predict when detailed prompts are needed versus when simpler name-only prompts suffice
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
National violent death reporting system (NVDRS)
Centers for Disease Control and Prevention. National violent death reporting system (NVDRS). Web- based Injury Statistics Query and Reporting System (WISQARS), 2022. URLhttps://www.cdc. gov/nvdrs. Atlanta, GA: National Center for Injury Prevention and Control. 9
work page 2022
-
[2]
Deborah M. Stone, Thomas R. Simon, Katherine A. Fowler, Scott R. Kegler, Keming Yuan, Kristin M. Holland, Asha Z. Ivey-Stephenson, and Alex E. Crosby. Vital signs: Trends in state suicide rates — United States, 1999–2016 and circumstances contributing to suicide — 27 states, 2015.Morbidity and Mortality Weekly Report, 67(22):617–624, 2018
work page 1999
-
[3]
S. Wang, Y . Zhou, Z. Han, et al. A natural language processing approach to detect inconsistencies in death investigation notes attributing suicide circumstances.Communications Medicine, 4(1):199,
-
[4]
doi: 10.1038/s43856-024-00631-7
- [5]
-
[6]
K. Lybarger, O. J. Bear Don’t Walk, M. Yetisgen, and O. Uzuner. Advancements in extracting social determinants of health information from narrative text.Journal of the American Medical Informatics Association, 30(8):1363–1366, 2023. doi: 10.1093/jamia/ocad121
-
[7]
D. S. Lituiev, B. Lacar, S. Pak, P. L. Abramowitsch, E. H. De Marchis, and T. A. Peterson. Automatic extraction of social determinants of health from medical notes of chronic lower back pain patients. Journal of the American Medical Informatics Association, 30(8):1438–1447, 2023. doi: 10.1093/ jamia/ocad054
work page 2023
-
[8]
Z. Chen, P. Lasserre, A. Lin, and R. Rajapakshe. Extraction of social determinants of health from electronic health records using natural language processing.JCO Clinical Cancer Informatics, 9: e2400317, 2025. doi: 10.1200/CCI-24-00317
-
[9]
Y . Dang, F. Li, X. Hu, V . K. Keloth, M. Zhang, S. Fu, M. F. Amith, J. W. Fan, J. Du, E. Yu, H. Liu, X. Jiang, H. Xu, and C. D. Tao. Systematic design and data-driven evaluation of social determinants of health ontology (SDoHO).Journal of the American Medical Informatics Association, 30(9):1465– 1473, 2023. doi: 10.1093/jamia/ocad096
-
[10]
B. Consoli, H. Wang, X. Wu, S. Wang, X. Zhao, Y . Wang, J. Rousseau, T. Hartvigsen, L. Shen, H. Wu, Y . Peng, Q. Long, T. Chen, and Y . Ding. SDoH-GPT: using large language models to extract social determinants of health.Journal of the American Medical Informatics Association, 33(1):67–78, 2026. doi: 10.1093/jamia/ocaf094
-
[11]
V . K. Keloth, S. Selek, Q. Chen, C. Gilman, S. Fu, Y . Dang, X. Chen, X. Hu, Y . Zhou, H. He, J. W. Fan, K. Wang, C. Brandt, C. Tao, H. Liu, and H. Xu. Social determinants of health extraction from clinical notes across institutions using large language models.npj Digital Medicine, 8(1):287, 2025. doi: 10.1038/s41746-025-01645-8
-
[12]
C. Peng, Z. Yu, K. E. Smith, W. H. Lo-Ciganic, J. Bian, and Y . Wu. Enhancing cross-domain general- izability in social determinants of health extraction with prompt-tuning large language models.AMIA Summits on Translational Science Proceedings, 2025:432–440, 2025
work page 2025
-
[13]
Chaunzwa, Idalid Franco, Benjamin H
Marco Guevara, Shan Chen, Spencer Thomas, Tafadzwa L. Chaunzwa, Idalid Franco, Benjamin H. Kann, Shalini Moningi, Jack M. Qian, Madeleine Goldstein, Susan Harper, Hugo J. W. L. Aerts, Paul J. Catalano, Guergana K. Savova, Raymond H. Mak, and Danielle S. Bitterman. Large language models to identify social determinants of health in electronic health records...
-
[14]
R. A. Gabriel, O. Litake, S. Simpson, B. N. Burton, R. S. Waterman, and A. A. Macias. On the development and validation of large language model-based classifiers for identifying social determi- nants of health.Proceedings of the National Academy of Sciences, 121(39):e2320716121, 2024. doi: 10.1073/pnas.2320716121
-
[15]
S. Wang, Y . Dang, Z. Sun, Y . Ding, J. Pathak, C. Tao, Y . Xiao, and Y . Peng. An NLP approach to identify SDoH-related circumstance and suicide crisis from death investigation narratives.Journal of the American Medical Informatics Association, 30(8):1408–1417, 2023. doi: 10.1093/jamia/ocad068
-
[16]
R. L. Xu, S. Wang, Z. Wang, Y . Zhang, Y . Xiao, J. Pathak, D. Hodge, Y . Leng, S. C. Watkins, Y . Ding, and Y . Peng. Analyzing social factors to enhance suicide prevention across population groups. In Proceedings of the 2024 IEEE International Conference on Healthcare Informatics, pages 189–199,
work page 2024
-
[17]
doi: 10.1109/ichi61247.2024.00032
-
[18]
Thomas McCoy, Ellie Pavlick, and Tal Linzen
R. Thomas McCoy, Ellie Pavlick, and Tal Linzen. Right for the wrong reasons: Diagnosing syntactic heuristics in natural language inference. InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 3428–3448, 2019. doi: 10.18653/v1/P19-1334
-
[19]
Schoene, Suzanne Garverich, Iman Ibrahim, Sia Shah, Benjamin Irving, and Clifford C
Annika M. Schoene, Suzanne Garverich, Iman Ibrahim, Sia Shah, Benjamin Irving, and Clifford C. Dacso. Automatically extracting social determinants of health for suicide: a narrative literature review. npj Mental Health Research, 3(1):51, 2024. doi: 10.1038/s44184-024-00087-6
-
[20]
Large lan- guage models are zero-shot reasoners
Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large lan- guage models are zero-shot reasoners. InAdvances in Neural Information Processing Systems, vol- ume 35, pages 22199–22213, 2022
work page 2022
-
[21]
Kevin Fu Yuan Lam, Vikneswaran Gopal, and Jiang Qian. Confidence intervals for the F1 score: A comparison of four methods.arXiv preprint arXiv:2309.14621, 2024. 11
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.