pith. machine review for the scientific record. sign in

arxiv: 2604.15369 · v1 · submitted 2026-04-15 · 💻 cs.CR

Recognition: unknown

An Agentic Workflow for Detecting Personally Identifiable Information in Crash Narratives

Authors on Pith no claims yet

Pith reviewed 2026-05-10 13:36 UTC · model grok-4.3

classification 💻 cs.CR
keywords personally identifiable informationcrash narrativesagentic workflowlarge language modelsprivacy protectiontraffic safety analysisinformation extractionPII detection
0
0 comments X

The pith

A locally deployable workflow using fine-tuned language models detects personally identifiable information in crash narratives with 0.94 recall.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Crash reports contain valuable details for improving road safety, yet personal names, addresses, and other identifiers scattered through the narratives prevent their wider use. The paper builds an agentic workflow that sends obvious structured items like phone numbers to a rule-based extractor and context-heavy items like names and addresses to an ensemble of fine-tuned large language models, then applies a verification step to confirm detections through reasoning. This local system reaches a recall of 0.94, precision of 0.82, and accuracy of 0.96 on real crash data while beating several baseline approaches. A sympathetic reader would care because the method opens the door to large-scale analysis of crash narratives for safety research without compromising individual privacy or requiring external services.

Core claim

The central claim is that the agentic workflow, consisting of a Hybrid Extractor that routes structured PII to rule-based processing and context-dependent PII to a domain-adapted fine-tuned LLM along with ensemble extraction and an agentic Verifier for evidence-based filtering of false positives, delivers precision 0.82, recall 0.94, F1 0.87, and accuracy 0.96 on a real-world crash dataset and outperforms multiple baseline methods, with ablations confirming the value of the ensemble and verifier steps especially for home addresses and alphanumeric identifiers.

What carries the argument

The Hybrid Extractor combined with an agentic Verifier, where the extractor routes between rule-based and LLM-based detection and the verifier applies reasoning to filter ambiguous cases.

If this is right

  • Crash data can be processed at scale for traffic safety research while keeping personal details protected.
  • The system operates without external APIs, fitting environments with strict privacy or data residency requirements.
  • Detection of challenging categories such as home addresses and alphanumeric identifiers improves through the use of ensemble LLM extraction and verification.
  • Broader adoption would support privacy-preserving handling of narrative-based safety records.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar agentic designs could apply to PII removal in other narrative domains like medical records or insurance claims.
  • The workflow's reliance on fine-tuning suggests that periodic retraining on new data would be needed to maintain performance as language use evolves.
  • Integration with larger data pipelines might allow automated redaction before storage or sharing of crash databases.

Load-bearing premise

The real-world crash dataset used for testing captures the range of language, formats, and PII patterns that appear in crash narratives from other places and times, and the fine-tuned model continues to perform well on new narratives without additional training.

What would settle it

Evaluating the workflow on crash narratives collected from a different geographic region or time period shows a substantial decline in recall or precision compared to the reported results.

Figures

Figures reproduced from arXiv: 2604.15369 by Bin Ran, Junyi Ma, Kai Cheng, Pei Li, Rui Gan, Steven T. Parker.

Figure 1
Figure 1. Figure 1: PII detection with the proposed agentic workflow. 1. Name: Full names or initials that can be used to identify individuals involved in the crash. 2. Phone number: Personal or work-related phone numbers included in the narrative. 3. Email address: Any standard email format identifying a specific individual or organization. 4. Home address: Residential addresses, including street names, house numbers, and ap… view at source ↗
Figure 2
Figure 2. Figure 2: Prompt used for fine-tuning LLM. The prompt instructs the model to identify candidate PII spans and annotate them with category-specific delimiter markers before downstream verification. consistently detected by rule-based patterns. Second, context-dependent PII including personal names, home addresses, and alphanumeric identifiers often requires semantic cues. Therefore, the key design principle is to com… view at source ↗
Figure 3
Figure 3. Figure 3: Training curve of the fine-tuned model we manually annotated 2,000 crash narratives randomly sampled from the dataset using the defined PII categories and specific tagging rules. This manually-labeled dataset provides the domain-relevant supervision necessary for effective fine-tuning (Gan et al., 2026). The fine-tuned Llama 3.1-8B model was trained with 2,000 manually labeled crash narratives for 1 epoch,… view at source ↗
Figure 4
Figure 4. Figure 4: Structured verifier system prompt used for candidate reviewing. the final output. Importantly, this stage creates an auditable trail (decision and evidence) that supports error analysis and governance. An output example of the verifier is shown in [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Example of verifier-generated review for a candidate home address. The verifier rejects the string "4647 HIGHWAY 47" because the narrative evidence indicates that it refers to the crash location rather than a person’s true residential or mailing address. and "US_SSN". These attributes were selected based on their relevance to transportation crash reports and alignment with our defined PII categories. "MEDI… view at source ↗
read the original abstract

Crash narratives in crash reports provide crucial contextual information for traffic safety analysis. Yet, their broader use is hindered by the presence of personally identifiable information (PII), including names, home addresses, and license plate numbers. Because PII appears sparsely and inconsistently in crash narratives, manual detection is not scalable, and existing rule-based approaches often fail to capture context-dependent PII. This study develops and evaluates a locally deployable, agentic workflow for PII detection in crash narratives by leveraging large language models (LLMs). The workflow contains a Hybrid Extractor and a Verifier. The Hybrid Extractor routes structured PII (e.g., phone numbers and email addresses) to a rule-based model (i.e., Presidio) and context-dependent PII (e.g., names, home addresses, and alphanumeric identifiers) to a domain-adapted, fine-tuned LLM. To address ambiguity in challenging categories, the workflow incorporates ensemble LLM extraction and an agentic verification step that filters false detections through evidence-based reasoning. Evaluated on a real-world crash dataset, the agentic workflow achieves strong performance with a precision of 0.82, a recall of 0.94, an F1 of 0.87, and an accuracy of 0.96, outperforming multiple baseline methods. Moreover, the ablation results suggest that ensemble LLM extraction and Verifier offer improved detection for home addresses and alphanumeric identifiers. The workflow runs locally, supporting privacy-sensitive operational settings where external APIs are restricted. This work offers a practical and robust path for scalable, privacy-preserving crash data processing, enabling broader research and safety interventions while safeguarding individual privacy.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper proposes an agentic workflow for detecting PII in crash narratives, using a Hybrid Extractor that routes structured PII (e.g., phone numbers) to Presidio and context-dependent PII (e.g., names, addresses, alphanumeric IDs) to a fine-tuned LLM, augmented by ensemble extraction and an agentic verifier step for false-positive filtering. It reports strong empirical results on a real-world crash dataset (precision 0.82, recall 0.94, F1 0.87, accuracy 0.96) with outperformance over baselines, notes ablation gains for challenging categories, and emphasizes local deployability for privacy-sensitive settings.

Significance. If the evaluation details and generalizability hold, the work provides a practical, locally executable solution for scalable PII detection in traffic safety data, addressing a real barrier to research use of crash narratives. The hybrid design and agentic verification are sensible responses to the sparsity and context-dependence of PII; the emphasis on local execution is a clear strength for operational deployment where API access is restricted.

major comments (3)
  1. [Abstract] Abstract: The headline performance claims (precision 0.82, recall 0.94, F1 0.87, accuracy 0.96, outperforming baselines) are presented without any reported dataset cardinality, train/test split details, annotation protocol, inter-annotator agreement, or description of the baseline methods. These omissions are load-bearing because the central claim is an empirical performance number whose validity cannot be assessed without them.
  2. [Abstract] Abstract: The ablation statement that ensemble LLM extraction and the Verifier improve detection for home addresses and alphanumeric identifiers is given without per-category metrics, error analysis, or statistical significance tests on the held-out data. This prevents determination of whether these components are genuinely load-bearing or dataset-specific.
  3. [Abstract] Abstract: The LLM is fine-tuned on context-dependent PII drawn from the same real-world crash corpus used for evaluation, yet no information is supplied on temporal or jurisdictional stratification, explicit leakage controls, or how the fine-tuning/test partitions were constructed. This directly threatens the external-validity assumption underlying the reported metrics.
minor comments (2)
  1. The abstract refers to 'multiple baseline methods' without naming them; adding the names and a one-sentence characterization would improve immediate readability.
  2. A dedicated limitations or future-work paragraph discussing generalizability across jurisdictions and time periods would help readers contextualize the single-dataset evaluation.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback on the abstract. We agree that additional details are needed to support the empirical claims and will revise the abstract accordingly while ensuring the full manuscript provides supporting information in the Methods and Results sections. We address each major comment point by point below.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The headline performance claims (precision 0.82, recall 0.94, F1 0.87, accuracy 0.96, outperforming baselines) are presented without any reported dataset cardinality, train/test split details, annotation protocol, inter-annotator agreement, or description of the baseline methods. These omissions are load-bearing because the central claim is an empirical performance number whose validity cannot be assessed without them.

    Authors: We agree that the abstract should be more self-contained for the headline metrics. The full manuscript provides these details in Sections 3 (Data Collection and Annotation) and 4 (Experimental Setup), including dataset cardinality, the train/test split, the annotation protocol, inter-annotator agreement, and baseline method descriptions. In the revision, we will condense and add summaries of dataset cardinality, split details, annotation protocol, IAA, and baselines directly into the abstract. revision: yes

  2. Referee: [Abstract] Abstract: The ablation statement that ensemble LLM extraction and the Verifier improve detection for home addresses and alphanumeric identifiers is given without per-category metrics, error analysis, or statistical significance tests on the held-out data. This prevents determination of whether these components are genuinely load-bearing or dataset-specific.

    Authors: We acknowledge that per-category metrics, error analysis, and significance tests would better substantiate the ablation claims. The manuscript reports overall ablation gains in Section 5, but we will revise the abstract to reference per-category improvements for home addresses and alphanumeric identifiers and expand the manuscript with a dedicated error analysis and statistical significance tests (e.g., on held-out data) to clarify the contributions of the ensemble and verifier. revision: yes

  3. Referee: [Abstract] Abstract: The LLM is fine-tuned on context-dependent PII drawn from the same real-world crash corpus used for evaluation, yet no information is supplied on temporal or jurisdictional stratification, explicit leakage controls, or how the fine-tuning/test partitions were constructed. This directly threatens the external-validity assumption underlying the reported metrics.

    Authors: This is a fair point regarding external validity. The manuscript describes the held-out evaluation set in Section 3, but to directly address leakage concerns we will revise the abstract and Methods section to explicitly detail the fine-tuning/test partition construction, including any temporal or jurisdictional stratification and leakage controls used. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical metrics on held-out data are independent of method definition

full rationale

The paper describes an agentic workflow (Hybrid Extractor routing to Presidio or fine-tuned LLM, plus ensemble and Verifier steps) and reports standard empirical performance numbers (P=0.82, R=0.94, F1=0.87, Acc=0.96) on a real-world crash dataset, along with ablation observations. No derivation chain, equation, or first-principles result is claimed; the central claim is an evaluation outcome on held-out data rather than a quantity defined in terms of parameters fitted to the same data. No self-citation load-bearing steps, ansatzes, or renamings of known results appear in the provided text. The evaluation is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

The central claim rests on the empirical effectiveness of a hybrid LLM-plus-rules pipeline on one dataset; no new physical entities or mathematical axioms are introduced. The main unstated premises are that fine-tuning on crash narratives produces reliable context-dependent extraction and that the verification step reliably removes false positives without discarding true ones.

free parameters (1)
  • LLM fine-tuning hyperparameters and training data composition
    The domain-adapted LLM requires fine-tuning whose specific learning-rate, epochs, and data-selection choices are not reported in the abstract yet directly affect the reported precision and recall.
axioms (2)
  • domain assumption A fine-tuned LLM can reliably extract context-dependent PII when given crash-narrative text
    Invoked by the Hybrid Extractor design and the claim that it outperforms rule-based baselines on names, addresses, and alphanumeric identifiers.
  • domain assumption The agentic verifier can distinguish true from false detections using evidence-based reasoning without access to external knowledge
    Required for the verification step to improve precision without lowering recall.

pith-pipeline@v0.9.0 · 5607 in / 1601 out tokens · 68457 ms · 2026-05-10T13:36:21.365255+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

20 extracted references · 15 canonical work pages · 4 internal anchors

  1. [1]

    hello,[redacted]

    Advanced crash causation analysis for freeway safety: A large language model approach to identifying key contributing factors. arXiv preprint arXiv:2505.09949 . Bosch,N.,Crues,R.,Shaik,N.,Paquette,L.,2020. "hello,[redacted]":Protectingstudentprivacyinanalysesofonlinediscussionforums. Grantee Submission . Buchh, I.A.,

  2. [2]

    Enhancing PII Detection in Student Essays: A Longformer-based Approach with Synthetic Data Augmentation, in: 2024 IEEE Asia Pacific Conference on Wireless and Mobile (APWiMob), IEEE, Bali, Indonesia. pp. 143–149. URL:https://ieeexplore.ieee.org/ document/10792959/, doi:10.1109/APWiMob64015.2024.10792959. Fagbohun, O., Harrison, R.M., Dereventsov, A.,

  3. [3]

    arXiv preprint arXiv:2402.14837

    An empirical categorization of prompting techniques for large language models: a practitioner’s guide. arXiv preprint arXiv:2402.14837 . Fan, Z., Wang, P., Zhao, Y., Zhao, Y., Ivanovic, B., Wang, Z., Pavone, M., Yang, H.F.,

  4. [4]

    arXiv preprint arXiv:2406.10789

    Learning traffic crashes as language: Datasets, benchmarks, and what-if causal analyses. arXiv preprint arXiv:2406.10789 . Federal Highway Administration,

  5. [5]

    Accessed: 2025-07-27

    Highway safety manual (hsm).https://highways.dot.gov/safety/data-analysis-tools/ highway-safety-manual. Accessed: 2025-07-27. Gan, R., Ma, J., Li, P., Yang, X., Chen, K., Chen, S., Ran, B.,

  6. [6]

    CrashSight: A Phase-Aware, Infrastructure-Centric Video Benchmark for Traffic Crash Scene Understanding and Reasoning

    Crashsight: A phase-aware, infrastructure-centric video benchmark for traffic crash scene understanding and reasoning. URL:https://arxiv.org/abs/2604.08457,arXiv:2604.08457. Grattafiori, A., Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., Letman, A., Mathur, A., Schelten, A., Vaughan, A., et al.,

  7. [7]

    The Llama 3 Herd of Models

    The llama 3 herd of models. arXiv preprint arXiv:2407.21783 . Hu,E.J.,Shen,Y.,Wallis,P.,Allen-Zhu,Z.,Li,Y.,Wang,S.,Wang,L.,Chen,W.,etal.,2022. Lora:Low-rankadaptationoflargelanguagemodels. ICLR 1,

  8. [8]

    Accident Analysis & Prevention 181, 106932

    Advancing investigation of automated vehicle crashes using text analytics of crash narratives and bayesian analysis. Accident Analysis & Prevention 181, 106932. Li,P.,Chen,S.,Yue,L.,Xu,Y.,Noyce,D.A.,2024. Analyzingrelationshipsbetweenlatenttopicsinautonomousvehiclecrashnarrativesandcrash severity using natural language processing techniques and explainabl...

  9. [9]

    Automated PII Extraction from Social Media for Raising Privacy Awareness: A Deep Transfer Learning Approach, in: 2021 IEEE International Conference on Intelligence and Security Informatics (ISI), IEEE, San Antonio, TX, USA. pp. 1–6. URL:https://ieeexplore.ieee.org/document/9624678/, doi:10.1109/ISI53945.2021.9624678. tLDR: The DTL- PIIE framework transfer...

  10. [10]

    Accessed: 2025-07-27

    Presidio: Data Protection and Anonymization SDK.https://github.com/microsoft/presidio. Accessed: 2025-07-27. Mumtarin, M., Chowdhury, M.S., Wood, J.,

  11. [11]

    Large language models in analyzing crash narratives–a comparative study of chatgpt, bard and gpt-4,

    Large Language Models in Analyzing Crash Narratives – A Comparative Study of ChatGPT, BARD and GPT-4. URL:http://arxiv.org/abs/2308.13563, doi:10.48550/arXiv.2308.13563. arXiv:2308.13563 [cs]. Murugadoss, K., Rajasekharan, A., Malin, B., Agarwal, V., Bade, S., Anderson, J.R., Ross, J.L., Faubion, W.A., Halamka, J.D., Soundararajan, V., Ardhanari, S.,

  12. [12]

    Patterns 2, 100255

    Building a best-in-class automated de-identification tool for electronic health records through ensemble learning. Patterns 2, 100255. URL:https://linkinghub.elsevier.com/retrieve/pii/S2666389921000817, doi:10.1016/j.patter.2021.100255. National Highway Traffic Safety Administration,

  13. [13]

    Accessed: 2025-07-27

    Standing general order on crash reporting for automated driving systems.https: //www.nhtsa.gov/laws-regulations/standing-general-order-crash-reporting. Accessed: 2025-07-27. Pilán, I., Lison, P., Øvrelid, L., Papadopoulou, A., Sánchez, D., Batet, M.,

  14. [14]

    Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition

    Introduction to the conll-2003 shared task: Language-independent named entity recognition. arXiv preprint cs/0306050 . Shen, Y., Ji, Z., Lin, J., Koedginer, K.R.,

  15. [15]

    URL: http://arxiv.org/abs/2501.09765, doi:10.48550/arXiv.2501.09765

    Enhancing the De-identification of Personally Identifiable Information in Educational Data. URL: http://arxiv.org/abs/2501.09765, doi:10.48550/arXiv.2501.09765. arXiv:2501.09765 [cs]. Shinn, N., Cassano, F., Gopinath, A., Narasimhan, K., Yao, S.,

  16. [16]

    publisher: International Educational Data Mining Society

    De-Identifying Student Personally Identifying Information with GPT-4 URL:https://zenodo.org/doi/10.5281/zenodo.12729884, doi:10.5281/ZENODO.12729884. publisher: International Educational Data Mining Society. Stubbs, A., Kotfila, C., Uzuner, Ö.,

  17. [17]

    Automated systems for the de-identification of longitudinal clinical narratives: Overview of 2014 i2b2/uthealth shared task track

  18. [18]

    Gemma: Open Models Based on Gemini Research and Technology

    Gemma: Open models based on gemini research and technology. arXiv preprint arXiv:2403.08295 . Tian,J.,Wang,L.,Fard,P.,Junior,V.M.,Blacker,D.,Haas,J.S.,Patel,C.,Murphy,S.N.,Moura,L.M.,Estiri,H.,2025. Anagenticaiworkflowfor detecting cognitive concerns in real-world data. arXiv preprint arXiv:2502.01789 . Uzuner, Ö., Sibanda, T.C., Luo, Y., Szolovits, P.,

  19. [19]

    GPT-NER: Named entity recognition via large language models

    Gpt-ner: Named entity recognition via large language models. arXiv preprint arXiv:2304.10428 . Wisconsin Legislature,

  20. [20]

    Accessed: 2025-07-27

    Wisconsin statutes § 19.62(5) – definitions; personal information.https://docs.legis.wisconsin.gov/ statutes/statutes/19/iv/62/5. Accessed: 2025-07-27. Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K.R., Cao, Y.,