GLiNER2-PII: A Multilingual Model for Personally Identifiable Information Extraction

· 2026 · cs.CL · arXiv 2605.09973

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

open full Pith review browse 1 citing papers arXiv PDF

abstract

Reliable detection of personally identifiable information (PII) is increasingly important across modern data-processing systems, yet the task remains difficult: PII spans are heterogeneous, locale-dependent, context-sensitive, and often embedded in noisy or semi-structured documents. We present GLiNER2-PII, a small 0.3B-parameter model adapted from GLiNER2 and designed to recognize a broad taxonomy of 42 PII entity types at character-span resolution. Training such systems, however, is constrained by the scarcity of shareable annotated data and the privacy risks associated with collecting real PII at scale. To address this challenge, we construct a multilingual synthetic corpus of 4,910 annotated texts using a constraint-driven generation pipeline that produces diverse, realistic examples across languages, domains, formats, and entity distributions. On the challenging SPY benchmark, GLiNER2-PII achieves the highest span-level F1 among five compared systems, including OpenAI Privacy Filter and three GLiNER-based detectors. We publicly release the model on Hugging Face to support further research and practical deployment of open PII detection systems.

representative citing papers

REDACT: A Systematically Controlled Multilingual Benchmark for Personal Information Detection

cs.CL · 2026-06-18 · unverdicted · novelty 7.0

REDACT is a new systematically controlled multilingual PII detection benchmark with 51 entity types, sensitivity-tier metadata, and stratified evaluation revealing that rule-based detectors fail on high-stakes data while LLM detectors are more robust.

citing papers explorer

Showing 1 of 1 citing paper after filters.

REDACT: A Systematically Controlled Multilingual Benchmark for Personal Information Detection cs.CL · 2026-06-18 · unverdicted · none · ref 27 · internal anchor
REDACT is a new systematically controlled multilingual PII detection benchmark with 51 entity types, sensitivity-tier metadata, and stratified evaluation revealing that rule-based detectors fail on high-stakes data while LLM detectors are more robust.

GLiNER2-PII: A Multilingual Model for Personally Identifiable Information Extraction

fields

years

verdicts

representative citing papers

citing papers explorer