Recognition: unknown
NameBERT: Scaling Name-Based Nationality Classification with LLM-Augmented Open Academic Data
Pith reviewed 2026-05-10 16:35 UTC · model grok-4.3
The pith
BERT models trained on academic names augmented by language models classify nationalities more accurately than prior methods while supporting fast large-scale use.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We created a large-scale name-nationality dataset from the Open Academic Graph and introduce a framework that leverages LLMs as dataset enrichers rather than inference engines. We augment low-resource countries with LLM-generated names and evaluate on real and synthetic-tail test sets. We find that augmentation produces large gains when evaluation includes synthetic tail names and still offers a modest lift on tail-country metrics otherwise. Overall, NameBERT models achieve significantly higher accuracy than state-of-the-art baselines across both in- and out-of-domain tasks, while remaining efficient for large-scale inference compared to LLMs.
What carries the argument
The LLM-based data augmentation process that generates synthetic names for low-resource nationalities, followed by training of NameBERT classifiers on the combined real and synthetic academic name data.
If this is right
- Nationality inference from names becomes practical for processing millions of records without repeated calls to large language models at runtime.
- Performance on countries with originally sparse data improves when synthetic examples are added to training.
- Gains appear in both academic-style test data and other evaluation domains.
- Tools for equity monitoring and demographic research can run at higher speed and scale.
Where Pith is reading between the lines
- The same augmentation pattern could be tested for inferring other name-linked attributes such as gender or ethnicity in low-data settings.
- If synthetic names differ systematically from real ones, the reported accuracy lift might not fully transfer to everyday use cases outside academic records.
- Applying the trained models to name lists drawn from social media or government records would test whether the benefits hold beyond the original data source.
Load-bearing premise
LLM-generated synthetic names for low-resource countries are realistic and representative enough of actual name distributions to improve performance on real data without introducing artifacts.
What would settle it
Measuring NameBERT accuracy on a large collection of verified real names from underrepresented countries that were never seen during augmentation or training and finding whether accuracy stays high or falls sharply.
Figures
read the original abstract
Inferring nationality from personal names is a critical capability for equity and bias monitoring, personalization, and a valuable tool in biomedical and sociological research. However, existing name-based nationality classifiers are typically trained on relatively small or source-specific labeled datasets, which can introduce coverage gaps and limit performance for underrepresented countries. While large language models (LLMs) demonstrate strong zero-shot performance for name-based nationality prediction, their computational cost and latency make them impractical for real-time, large-scale deployment. In this work, we created a large-scale name-nationality dataset from the Open Academic Graph (OAG) and introduce a framework that leverages LLMs as dataset enrichers rather than inference engines. We augment low-resource countries with LLM-generated names and evaluate on real and synthetic-tail test sets. We find that augmentation produces large gains when evaluation includes synthetic tail names and still offers a modest lift on tail-country metrics otherwise. Overall, NameBERT models achieve significantly higher accuracy than state-of-the-art baselines across both in- and out-of-domain tasks, while remaining efficient for large-scale inference compared to LLMs.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces NameBERT, a BERT-based classifier for inferring nationality from personal names. It constructs a large dataset from the Open Academic Graph (OAG), augments low-resource countries using LLM-generated synthetic names, and reports that the resulting models achieve significantly higher accuracy than state-of-the-art baselines on both in-domain and out-of-domain tasks. Augmentation yields large gains on synthetic-tail test sets but only modest improvements on real tail-country data, while offering better efficiency for large-scale inference than direct LLM use.
Significance. If the empirical claims hold after addressing validation gaps, the work provides a practical, scalable alternative to LLM-based inference for name-nationality tasks, with direct relevance to bias monitoring, personalization, and research in biomedicine and sociology. Explicitly separating real versus synthetic evaluation and using LLMs only for augmentation (rather than inference) are clear strengths that enhance reproducibility and deployment feasibility.
major comments (2)
- [§4] §4 (Data Augmentation): The central claim of superior out-of-domain performance rests on the assumption that LLM-generated names for low-resource countries are sufficiently representative of real distributions. The manuscript evaluates on both real and synthetic-tail sets and notes the modest real-data lift, but provides no quantitative validation (e.g., Kolmogorov-Smirnov tests on name length/phonotactics, human authenticity ratings, or comparison to held-out real names from the same countries) that synthetic names avoid LLM-specific artifacts. Without this, gains on synthetic tails may not generalize, undermining the headline superiority over baselines.
- [§5.1] §5.1 and Table 3: The abstract and results separate 'large gains' on synthetic-inclusive tests from 'modest lift' on real tail-country metrics, yet no effect-size statistics, confidence intervals, or per-country breakdown are referenced to show that the real-data improvement is statistically meaningful rather than marginal. This is load-bearing because the out-of-domain claim is asserted across tasks.
minor comments (2)
- [Abstract] Abstract: Include at least one or two key accuracy numbers, baseline names, and the magnitude of the 'modest lift' to allow readers to assess claims without reading the full text.
- [§2] §2 (Related Work): The comparison to prior name-based nationality classifiers would benefit from a table summarizing dataset sizes, coverage of low-resource countries, and reported accuracies for direct side-by-side context.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback, which has helped us identify areas where the manuscript can be strengthened. We address each major comment below and have revised the paper accordingly to provide additional validation and statistical detail.
read point-by-point responses
-
Referee: [§4] §4 (Data Augmentation): The central claim of superior out-of-domain performance rests on the assumption that LLM-generated names for low-resource countries are sufficiently representative of real distributions. The manuscript evaluates on both real and synthetic-tail sets and notes the modest real-data lift, but provides no quantitative validation (e.g., Kolmogorov-Smirnov tests on name length/phonotactics, human authenticity ratings, or comparison to held-out real names from the same countries) that synthetic names avoid LLM-specific artifacts. Without this, gains on synthetic tails may not generalize, undermining the headline superiority over baselines.
Authors: We agree that more rigorous validation of the synthetic names is warranted to support the augmentation approach. In the revised manuscript, we have added a new subsection to §4 that includes (1) Kolmogorov-Smirnov tests comparing name-length and character n-gram distributions between LLM-generated and real names for a sample of low-resource countries, (2) results from a human authenticity rating study (n=200 names, 3 annotators) showing high perceived realism, and (3) a comparison of synthetic names against held-out real names from the same countries. These analyses indicate that the synthetic names largely preserve real distributional properties with minimal LLM-specific artifacts. We also emphasize that the headline out-of-domain claims are primarily supported by real-data evaluations, with synthetic augmentation used only to improve coverage. revision: yes
-
Referee: [§5.1] §5.1 and Table 3: The abstract and results separate 'large gains' on synthetic-inclusive tests from 'modest lift' on real tail-country metrics, yet no effect-size statistics, confidence intervals, or per-country breakdown are referenced to show that the real-data improvement is statistically meaningful rather than marginal. This is load-bearing because the out-of-domain claim is asserted across tasks.
Authors: We acknowledge that the original presentation lacked sufficient statistical detail to assess the meaningfulness of the modest real-data improvements. In the revision, we have updated §5.1 and Table 3 to include 95% confidence intervals for all accuracy metrics, Cohen's d effect sizes for the augmented vs. baseline comparisons on real tail data, and a new supplementary table providing per-country breakdowns for all tail countries. These additions confirm that the improvements are statistically significant for most (but not all) tail countries, allowing readers to better evaluate the practical impact. revision: yes
Circularity Check
No circularity; claims rest on empirical training and baseline comparisons
full rationale
The paper constructs a dataset from OAG, augments it with LLM-generated names for low-resource countries, trains NameBERT models, and reports accuracy metrics against external state-of-the-art baselines on both real and synthetic test sets. No equations, parameters, or central claims reduce by construction to fitted inputs, self-definitions, or self-citation chains; the reported gains are measured outcomes on held-out data rather than tautological restatements of the augmentation process itself. The derivation chain is therefore self-contained against independent benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption LLM-generated names for low-resource countries are realistic enough to improve generalization without distorting learned patterns
Reference graph
Works this paper leans on
-
[1]
Nationality Classification Using Name Embeddings
J. Ye, S. Han, Y. Hu, B. Coskun, M. Liu, H. Qin, and S. Skiena. “Nationality Classification Using Name Embeddings”. In:CIKM ’17. 2017
2017
-
[2]
Park.name2nat: Nationality Prediction from Names
K. Park.name2nat: Nationality Prediction from Names. GitHub repository. https://github.com/Kyubyong/name2nat. Accessed: 18 Feb 2026. 2020
2026
-
[3]
Predicting Race and Ethnicity from the Sequence of Characters in a Name
R. Chintalapati, S. Laohaprapanon, and G. Sood. “Predicting Race and Ethnicity from the Sequence of Characters in a Name”. In:arXiv preprint arXiv:1805.02109(2018)
-
[4]
P. Parasurama. “raceBERT: A Transformer-based Model for Predicting Race and Ethnicity from Names”. In:arXiv preprint arXiv:2112.03807(2021)
-
[5]
The Importance of Being Ernest, Ekundayo, or Eswari: An Interpretable Machine Learning Approach to Name-based Ethnicity Classifica- tion
V. Jain, T. Enamorado, and C. Rudin. “The Importance of Being Ernest, Ekundayo, or Eswari: An Interpretable Machine Learning Approach to Name-based Ethnicity Classifica- tion”. In:Harvard Data Science Review4.3 (2022)
2022
-
[6]
Name-Ethnicity Classification and Ethnicity-Sensitive Name Matching
P. Treeratpituk and C. L. Giles. “Name-Ethnicity Classification and Ethnicity-Sensitive Name Matching”. In:Proceedings of the Twenty-Sixth AAAI Conference on Artificial Intelligence. Toronto, Ontario, Canada: AAAI Press, 2012
2012
-
[7]
ArnetMiner: Extraction and Mining of Academic Social Networks
J. Tang, J. Zhang, L. Yao, J. Li, L. Zhang, and Z. Su. “ArnetMiner: Extraction and Mining of Academic Social Networks”. In:KDD ’08. ACM, 2008, pp. 990–998
2008
-
[8]
An Overview of Microsoft Academic Service (MAS) and Applications
A. Sinha, Z. Shen, Y. Song, H. Ma, D. Eide, B.-J. P. Hsu, and K. Wang. “An Overview of Microsoft Academic Service (MAS) and Applications”. In:WWW ’15 Companion. WWW’15 Companion. New York, NY, USA: ACM, 2015, pp. 243–246
2015
-
[9]
Enriching Datasets with Demographics through Large Language Models: What’s in a Name?
K. AlNuaimi, G. Marti, M. Ravaut, et al. “Enriching Datasets with Demographics through Large Language Models: What’s in a Name?” In:arXiv preprint arXiv:2409.11491(2024)
-
[10]
Fairness-aware Race and Ethnicity Detection from Names
X. Shang, Z. Peng, S. Vincent, et al. “Fairness-aware Race and Ethnicity Detection from Names”. In:IEEE Access(2025)
2025
-
[11]
PyPI package
EthnicSeer Team.EthnicSeer. PyPI package. https://pypi.org/project/ethnicseer/. Accessed: 22 Dec 2025. 2025
2025
-
[12]
Web resource
NamePrism Team.NamePrism. Web resource. https://www.name-prism.com/. Accessed: 22 Dec 2025. 2025
2025
-
[13]
Measuring short-form factuality in large language models
J. Wei, N. Karina, H. W. Chung, Y. J. Jiao, S. Papay, A. Glaese, J. Schulman, and W. Fedus. “Measuring short-form factuality in large language models”. In:arXiv preprint:2411.04368 (2024)
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.