Recognition: unknown
An Agentic Workflow for Detecting Personally Identifiable Information in Crash Narratives
Pith reviewed 2026-05-10 13:36 UTC · model grok-4.3
The pith
A locally deployable workflow using fine-tuned language models detects personally identifiable information in crash narratives with 0.94 recall.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that the agentic workflow, consisting of a Hybrid Extractor that routes structured PII to rule-based processing and context-dependent PII to a domain-adapted fine-tuned LLM along with ensemble extraction and an agentic Verifier for evidence-based filtering of false positives, delivers precision 0.82, recall 0.94, F1 0.87, and accuracy 0.96 on a real-world crash dataset and outperforms multiple baseline methods, with ablations confirming the value of the ensemble and verifier steps especially for home addresses and alphanumeric identifiers.
What carries the argument
The Hybrid Extractor combined with an agentic Verifier, where the extractor routes between rule-based and LLM-based detection and the verifier applies reasoning to filter ambiguous cases.
If this is right
- Crash data can be processed at scale for traffic safety research while keeping personal details protected.
- The system operates without external APIs, fitting environments with strict privacy or data residency requirements.
- Detection of challenging categories such as home addresses and alphanumeric identifiers improves through the use of ensemble LLM extraction and verification.
- Broader adoption would support privacy-preserving handling of narrative-based safety records.
Where Pith is reading between the lines
- Similar agentic designs could apply to PII removal in other narrative domains like medical records or insurance claims.
- The workflow's reliance on fine-tuning suggests that periodic retraining on new data would be needed to maintain performance as language use evolves.
- Integration with larger data pipelines might allow automated redaction before storage or sharing of crash databases.
Load-bearing premise
The real-world crash dataset used for testing captures the range of language, formats, and PII patterns that appear in crash narratives from other places and times, and the fine-tuned model continues to perform well on new narratives without additional training.
What would settle it
Evaluating the workflow on crash narratives collected from a different geographic region or time period shows a substantial decline in recall or precision compared to the reported results.
Figures
read the original abstract
Crash narratives in crash reports provide crucial contextual information for traffic safety analysis. Yet, their broader use is hindered by the presence of personally identifiable information (PII), including names, home addresses, and license plate numbers. Because PII appears sparsely and inconsistently in crash narratives, manual detection is not scalable, and existing rule-based approaches often fail to capture context-dependent PII. This study develops and evaluates a locally deployable, agentic workflow for PII detection in crash narratives by leveraging large language models (LLMs). The workflow contains a Hybrid Extractor and a Verifier. The Hybrid Extractor routes structured PII (e.g., phone numbers and email addresses) to a rule-based model (i.e., Presidio) and context-dependent PII (e.g., names, home addresses, and alphanumeric identifiers) to a domain-adapted, fine-tuned LLM. To address ambiguity in challenging categories, the workflow incorporates ensemble LLM extraction and an agentic verification step that filters false detections through evidence-based reasoning. Evaluated on a real-world crash dataset, the agentic workflow achieves strong performance with a precision of 0.82, a recall of 0.94, an F1 of 0.87, and an accuracy of 0.96, outperforming multiple baseline methods. Moreover, the ablation results suggest that ensemble LLM extraction and Verifier offer improved detection for home addresses and alphanumeric identifiers. The workflow runs locally, supporting privacy-sensitive operational settings where external APIs are restricted. This work offers a practical and robust path for scalable, privacy-preserving crash data processing, enabling broader research and safety interventions while safeguarding individual privacy.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes an agentic workflow for detecting PII in crash narratives, using a Hybrid Extractor that routes structured PII (e.g., phone numbers) to Presidio and context-dependent PII (e.g., names, addresses, alphanumeric IDs) to a fine-tuned LLM, augmented by ensemble extraction and an agentic verifier step for false-positive filtering. It reports strong empirical results on a real-world crash dataset (precision 0.82, recall 0.94, F1 0.87, accuracy 0.96) with outperformance over baselines, notes ablation gains for challenging categories, and emphasizes local deployability for privacy-sensitive settings.
Significance. If the evaluation details and generalizability hold, the work provides a practical, locally executable solution for scalable PII detection in traffic safety data, addressing a real barrier to research use of crash narratives. The hybrid design and agentic verification are sensible responses to the sparsity and context-dependence of PII; the emphasis on local execution is a clear strength for operational deployment where API access is restricted.
major comments (3)
- [Abstract] Abstract: The headline performance claims (precision 0.82, recall 0.94, F1 0.87, accuracy 0.96, outperforming baselines) are presented without any reported dataset cardinality, train/test split details, annotation protocol, inter-annotator agreement, or description of the baseline methods. These omissions are load-bearing because the central claim is an empirical performance number whose validity cannot be assessed without them.
- [Abstract] Abstract: The ablation statement that ensemble LLM extraction and the Verifier improve detection for home addresses and alphanumeric identifiers is given without per-category metrics, error analysis, or statistical significance tests on the held-out data. This prevents determination of whether these components are genuinely load-bearing or dataset-specific.
- [Abstract] Abstract: The LLM is fine-tuned on context-dependent PII drawn from the same real-world crash corpus used for evaluation, yet no information is supplied on temporal or jurisdictional stratification, explicit leakage controls, or how the fine-tuning/test partitions were constructed. This directly threatens the external-validity assumption underlying the reported metrics.
minor comments (2)
- The abstract refers to 'multiple baseline methods' without naming them; adding the names and a one-sentence characterization would improve immediate readability.
- A dedicated limitations or future-work paragraph discussing generalizability across jurisdictions and time periods would help readers contextualize the single-dataset evaluation.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on the abstract. We agree that additional details are needed to support the empirical claims and will revise the abstract accordingly while ensuring the full manuscript provides supporting information in the Methods and Results sections. We address each major comment point by point below.
read point-by-point responses
-
Referee: [Abstract] Abstract: The headline performance claims (precision 0.82, recall 0.94, F1 0.87, accuracy 0.96, outperforming baselines) are presented without any reported dataset cardinality, train/test split details, annotation protocol, inter-annotator agreement, or description of the baseline methods. These omissions are load-bearing because the central claim is an empirical performance number whose validity cannot be assessed without them.
Authors: We agree that the abstract should be more self-contained for the headline metrics. The full manuscript provides these details in Sections 3 (Data Collection and Annotation) and 4 (Experimental Setup), including dataset cardinality, the train/test split, the annotation protocol, inter-annotator agreement, and baseline method descriptions. In the revision, we will condense and add summaries of dataset cardinality, split details, annotation protocol, IAA, and baselines directly into the abstract. revision: yes
-
Referee: [Abstract] Abstract: The ablation statement that ensemble LLM extraction and the Verifier improve detection for home addresses and alphanumeric identifiers is given without per-category metrics, error analysis, or statistical significance tests on the held-out data. This prevents determination of whether these components are genuinely load-bearing or dataset-specific.
Authors: We acknowledge that per-category metrics, error analysis, and significance tests would better substantiate the ablation claims. The manuscript reports overall ablation gains in Section 5, but we will revise the abstract to reference per-category improvements for home addresses and alphanumeric identifiers and expand the manuscript with a dedicated error analysis and statistical significance tests (e.g., on held-out data) to clarify the contributions of the ensemble and verifier. revision: yes
-
Referee: [Abstract] Abstract: The LLM is fine-tuned on context-dependent PII drawn from the same real-world crash corpus used for evaluation, yet no information is supplied on temporal or jurisdictional stratification, explicit leakage controls, or how the fine-tuning/test partitions were constructed. This directly threatens the external-validity assumption underlying the reported metrics.
Authors: This is a fair point regarding external validity. The manuscript describes the held-out evaluation set in Section 3, but to directly address leakage concerns we will revise the abstract and Methods section to explicitly detail the fine-tuning/test partition construction, including any temporal or jurisdictional stratification and leakage controls used. revision: yes
Circularity Check
No circularity: empirical metrics on held-out data are independent of method definition
full rationale
The paper describes an agentic workflow (Hybrid Extractor routing to Presidio or fine-tuned LLM, plus ensemble and Verifier steps) and reports standard empirical performance numbers (P=0.82, R=0.94, F1=0.87, Acc=0.96) on a real-world crash dataset, along with ablation observations. No derivation chain, equation, or first-principles result is claimed; the central claim is an evaluation outcome on held-out data rather than a quantity defined in terms of parameters fitted to the same data. No self-citation load-bearing steps, ansatzes, or renamings of known results appear in the provided text. The evaluation is therefore self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
free parameters (1)
- LLM fine-tuning hyperparameters and training data composition
axioms (2)
- domain assumption A fine-tuned LLM can reliably extract context-dependent PII when given crash-narrative text
- domain assumption The agentic verifier can distinguish true from false detections using evidence-based reasoning without access to external knowledge
Reference graph
Works this paper leans on
-
[1]
Advanced crash causation analysis for freeway safety: A large language model approach to identifying key contributing factors. arXiv preprint arXiv:2505.09949 . Bosch,N.,Crues,R.,Shaik,N.,Paquette,L.,2020. "hello,[redacted]":Protectingstudentprivacyinanalysesofonlinediscussionforums. Grantee Submission . Buchh, I.A.,
-
[2]
Enhancing PII Detection in Student Essays: A Longformer-based Approach with Synthetic Data Augmentation, in: 2024 IEEE Asia Pacific Conference on Wireless and Mobile (APWiMob), IEEE, Bali, Indonesia. pp. 143–149. URL:https://ieeexplore.ieee.org/ document/10792959/, doi:10.1109/APWiMob64015.2024.10792959. Fagbohun, O., Harrison, R.M., Dereventsov, A.,
-
[3]
arXiv preprint arXiv:2402.14837
An empirical categorization of prompting techniques for large language models: a practitioner’s guide. arXiv preprint arXiv:2402.14837 . Fan, Z., Wang, P., Zhao, Y., Zhao, Y., Ivanovic, B., Wang, Z., Pavone, M., Yang, H.F.,
-
[4]
arXiv preprint arXiv:2406.10789
Learning traffic crashes as language: Datasets, benchmarks, and what-if causal analyses. arXiv preprint arXiv:2406.10789 . Federal Highway Administration,
-
[5]
Accessed: 2025-07-27
Highway safety manual (hsm).https://highways.dot.gov/safety/data-analysis-tools/ highway-safety-manual. Accessed: 2025-07-27. Gan, R., Ma, J., Li, P., Yang, X., Chen, K., Chen, S., Ran, B.,
2025
-
[6]
Crashsight: A phase-aware, infrastructure-centric video benchmark for traffic crash scene understanding and reasoning. URL:https://arxiv.org/abs/2604.08457,arXiv:2604.08457. Grattafiori, A., Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., Letman, A., Mathur, A., Schelten, A., Vaughan, A., et al.,
work page internal anchor Pith review Pith/arXiv arXiv
-
[7]
The llama 3 herd of models. arXiv preprint arXiv:2407.21783 . Hu,E.J.,Shen,Y.,Wallis,P.,Allen-Zhu,Z.,Li,Y.,Wang,S.,Wang,L.,Chen,W.,etal.,2022. Lora:Low-rankadaptationoflargelanguagemodels. ICLR 1,
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[8]
Accident Analysis & Prevention 181, 106932
Advancing investigation of automated vehicle crashes using text analytics of crash narratives and bayesian analysis. Accident Analysis & Prevention 181, 106932. Li,P.,Chen,S.,Yue,L.,Xu,Y.,Noyce,D.A.,2024. Analyzingrelationshipsbetweenlatenttopicsinautonomousvehiclecrashnarrativesandcrash severity using natural language processing techniques and explainabl...
-
[9]
Automated PII Extraction from Social Media for Raising Privacy Awareness: A Deep Transfer Learning Approach, in: 2021 IEEE International Conference on Intelligence and Security Informatics (ISI), IEEE, San Antonio, TX, USA. pp. 1–6. URL:https://ieeexplore.ieee.org/document/9624678/, doi:10.1109/ISI53945.2021.9624678. tLDR: The DTL- PIIE framework transfer...
-
[10]
Accessed: 2025-07-27
Presidio: Data Protection and Anonymization SDK.https://github.com/microsoft/presidio. Accessed: 2025-07-27. Mumtarin, M., Chowdhury, M.S., Wood, J.,
2025
-
[11]
Large language models in analyzing crash narratives–a comparative study of chatgpt, bard and gpt-4,
Large Language Models in Analyzing Crash Narratives – A Comparative Study of ChatGPT, BARD and GPT-4. URL:http://arxiv.org/abs/2308.13563, doi:10.48550/arXiv.2308.13563. arXiv:2308.13563 [cs]. Murugadoss, K., Rajasekharan, A., Malin, B., Agarwal, V., Bade, S., Anderson, J.R., Ross, J.L., Faubion, W.A., Halamka, J.D., Soundararajan, V., Ardhanari, S.,
-
[12]
Building a best-in-class automated de-identification tool for electronic health records through ensemble learning. Patterns 2, 100255. URL:https://linkinghub.elsevier.com/retrieve/pii/S2666389921000817, doi:10.1016/j.patter.2021.100255. National Highway Traffic Safety Administration,
-
[13]
Accessed: 2025-07-27
Standing general order on crash reporting for automated driving systems.https: //www.nhtsa.gov/laws-regulations/standing-general-order-crash-reporting. Accessed: 2025-07-27. Pilán, I., Lison, P., Øvrelid, L., Papadopoulou, A., Sánchez, D., Batet, M.,
2025
-
[14]
Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition
Introduction to the conll-2003 shared task: Language-independent named entity recognition. arXiv preprint cs/0306050 . Shen, Y., Ji, Z., Lin, J., Koedginer, K.R.,
work page internal anchor Pith review arXiv 2003
-
[15]
URL: http://arxiv.org/abs/2501.09765, doi:10.48550/arXiv.2501.09765
Enhancing the De-identification of Personally Identifiable Information in Educational Data. URL: http://arxiv.org/abs/2501.09765, doi:10.48550/arXiv.2501.09765. arXiv:2501.09765 [cs]. Shinn, N., Cassano, F., Gopinath, A., Narasimhan, K., Yao, S.,
-
[16]
publisher: International Educational Data Mining Society
De-Identifying Student Personally Identifying Information with GPT-4 URL:https://zenodo.org/doi/10.5281/zenodo.12729884, doi:10.5281/ZENODO.12729884. publisher: International Educational Data Mining Society. Stubbs, A., Kotfila, C., Uzuner, Ö.,
-
[17]
Automated systems for the de-identification of longitudinal clinical narratives: Overview of 2014 i2b2/uthealth shared task track
2014
-
[18]
Gemma: Open Models Based on Gemini Research and Technology
Gemma: Open models based on gemini research and technology. arXiv preprint arXiv:2403.08295 . Tian,J.,Wang,L.,Fard,P.,Junior,V.M.,Blacker,D.,Haas,J.S.,Patel,C.,Murphy,S.N.,Moura,L.M.,Estiri,H.,2025. Anagenticaiworkflowfor detecting cognitive concerns in real-world data. arXiv preprint arXiv:2502.01789 . Uzuner, Ö., Sibanda, T.C., Luo, Y., Szolovits, P.,
work page internal anchor Pith review arXiv 2025
-
[19]
GPT-NER: Named entity recognition via large language models
Gpt-ner: Named entity recognition via large language models. arXiv preprint arXiv:2304.10428 . Wisconsin Legislature,
-
[20]
Accessed: 2025-07-27
Wisconsin statutes § 19.62(5) – definitions; personal information.https://docs.legis.wisconsin.gov/ statutes/statutes/19/iv/62/5. Accessed: 2025-07-27. Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K.R., Cao, Y.,
2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.