arxiv: 2604.10401 · v2 · submitted 2026-04-12 · 💻 cs.CL

Recognition: unknown

NameBERT: Scaling Name-Based Nationality Classification with LLM-Augmented Open Academic Data

Cong Ming, Ruixin Shi, Yifan Hu

Authors on Pith no claims yet

Pith reviewed 2026-05-10 16:35 UTC · model grok-4.3

classification 💻 cs.CL

keywords name nationality classificationLLM data augmentationBERT modelsacademic graphnationality inferencelow-resource countriessynthetic training databias monitoring

0 comments

The pith

BERT models trained on academic names augmented by language models classify nationalities more accurately than prior methods while supporting fast large-scale use.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows how to build a large dataset of personal names paired with nationalities from academic publication records and then use large language models to create additional realistic names for countries that originally had very few examples. This augmented data trains specialized NameBERT classifiers that reach higher accuracy than earlier name-based nationality tools on both familiar and new test sets. The approach treats language models as creators of training material rather than as the system that answers every query, which keeps inference cheap and quick enough for real-time applications. Readers would care because nationality inference from names supports bias checks, personalization, and research in medicine and social science, yet current systems leave many countries poorly covered due to scarce training examples.

Core claim

We created a large-scale name-nationality dataset from the Open Academic Graph and introduce a framework that leverages LLMs as dataset enrichers rather than inference engines. We augment low-resource countries with LLM-generated names and evaluate on real and synthetic-tail test sets. We find that augmentation produces large gains when evaluation includes synthetic tail names and still offers a modest lift on tail-country metrics otherwise. Overall, NameBERT models achieve significantly higher accuracy than state-of-the-art baselines across both in- and out-of-domain tasks, while remaining efficient for large-scale inference compared to LLMs.

What carries the argument

The LLM-based data augmentation process that generates synthetic names for low-resource nationalities, followed by training of NameBERT classifiers on the combined real and synthetic academic name data.

If this is right

Nationality inference from names becomes practical for processing millions of records without repeated calls to large language models at runtime.
Performance on countries with originally sparse data improves when synthetic examples are added to training.
Gains appear in both academic-style test data and other evaluation domains.
Tools for equity monitoring and demographic research can run at higher speed and scale.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same augmentation pattern could be tested for inferring other name-linked attributes such as gender or ethnicity in low-data settings.
If synthetic names differ systematically from real ones, the reported accuracy lift might not fully transfer to everyday use cases outside academic records.
Applying the trained models to name lists drawn from social media or government records would test whether the benefits hold beyond the original data source.

Load-bearing premise

LLM-generated synthetic names for low-resource countries are realistic and representative enough of actual name distributions to improve performance on real data without introducing artifacts.

What would settle it

Measuring NameBERT accuracy on a large collection of verified real names from underrepresented countries that were never seen during augmentation or training and finding whether accuracy stays high or falls sharply.

Figures

Figures reproduced from arXiv: 2604.10401 by Cong Ming, Ruixin Shi, Yifan Hu.

**Figure 1.** Figure 1: 1.34M OAG dataset of 99 classes mapped to NamePrism taxonomy of 39 classes 4.5. Baselines We compare our model with three state-of-the-art classifiers. EthnicSeer [6] formulates name–ethnicity classification as a supervised learning problem. It uses a logistic regression classifier trained on approximately 215K Wikipedia names grouped into 12 ethnicity categories, and reports an accuracy of about 0.85 in … view at source ↗

**Figure 2.** Figure 2: Dataset construction pipeline, splits, sizes, and purposes. Synthetic augmentation uses country-specific generation budgets. 5.1. Data Splits, Label Validation and Data Augmentation We train and evaluate models on multiple dataset splits derived from the 1.34M OAG corpus, and from the LLM validated and augmented versions of this corpus [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: Results of GPT-4o on SimpleQA with NameBERTaug (Wilson 95% CI) [PITH_FULL_IMAGE:figures/full_fig_p011_3.png] view at source ↗

read the original abstract

Inferring nationality from personal names is a critical capability for equity and bias monitoring, personalization, and a valuable tool in biomedical and sociological research. However, existing name-based nationality classifiers are typically trained on relatively small or source-specific labeled datasets, which can introduce coverage gaps and limit performance for underrepresented countries. While large language models (LLMs) demonstrate strong zero-shot performance for name-based nationality prediction, their computational cost and latency make them impractical for real-time, large-scale deployment. In this work, we created a large-scale name-nationality dataset from the Open Academic Graph (OAG) and introduce a framework that leverages LLMs as dataset enrichers rather than inference engines. We augment low-resource countries with LLM-generated names and evaluate on real and synthetic-tail test sets. We find that augmentation produces large gains when evaluation includes synthetic tail names and still offers a modest lift on tail-country metrics otherwise. Overall, NameBERT models achieve significantly higher accuracy than state-of-the-art baselines across both in- and out-of-domain tasks, while remaining efficient for large-scale inference compared to LLMs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

NameBERT scales up name-nationality classification with an OAG-derived dataset and LLM augmentation for tail countries, but the real-data improvements stay modest and rest on untested assumptions about synthetic name quality.

read the letter

The main takeaway is that this paper harvests a large name-nationality dataset from the Open Academic Graph and uses LLMs only to generate extra names for countries with sparse coverage, then trains efficient BERT models that outperform baselines on both in-domain and out-of-domain tests. The approach keeps inference cheap compared to running LLMs directly, which is a practical distinction from prior work that either used small hand-labeled sets or treated LLMs as the end classifier.

Referee Report

2 major / 2 minor

Summary. The paper introduces NameBERT, a BERT-based classifier for inferring nationality from personal names. It constructs a large dataset from the Open Academic Graph (OAG), augments low-resource countries using LLM-generated synthetic names, and reports that the resulting models achieve significantly higher accuracy than state-of-the-art baselines on both in-domain and out-of-domain tasks. Augmentation yields large gains on synthetic-tail test sets but only modest improvements on real tail-country data, while offering better efficiency for large-scale inference than direct LLM use.

Significance. If the empirical claims hold after addressing validation gaps, the work provides a practical, scalable alternative to LLM-based inference for name-nationality tasks, with direct relevance to bias monitoring, personalization, and research in biomedicine and sociology. Explicitly separating real versus synthetic evaluation and using LLMs only for augmentation (rather than inference) are clear strengths that enhance reproducibility and deployment feasibility.

major comments (2)

[§4] §4 (Data Augmentation): The central claim of superior out-of-domain performance rests on the assumption that LLM-generated names for low-resource countries are sufficiently representative of real distributions. The manuscript evaluates on both real and synthetic-tail sets and notes the modest real-data lift, but provides no quantitative validation (e.g., Kolmogorov-Smirnov tests on name length/phonotactics, human authenticity ratings, or comparison to held-out real names from the same countries) that synthetic names avoid LLM-specific artifacts. Without this, gains on synthetic tails may not generalize, undermining the headline superiority over baselines.
[§5.1] §5.1 and Table 3: The abstract and results separate 'large gains' on synthetic-inclusive tests from 'modest lift' on real tail-country metrics, yet no effect-size statistics, confidence intervals, or per-country breakdown are referenced to show that the real-data improvement is statistically meaningful rather than marginal. This is load-bearing because the out-of-domain claim is asserted across tasks.

minor comments (2)

[Abstract] Abstract: Include at least one or two key accuracy numbers, baseline names, and the magnitude of the 'modest lift' to allow readers to assess claims without reading the full text.
[§2] §2 (Related Work): The comparison to prior name-based nationality classifiers would benefit from a table summarizing dataset sizes, coverage of low-resource countries, and reported accuracies for direct side-by-side context.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback, which has helped us identify areas where the manuscript can be strengthened. We address each major comment below and have revised the paper accordingly to provide additional validation and statistical detail.

read point-by-point responses

Referee: [§4] §4 (Data Augmentation): The central claim of superior out-of-domain performance rests on the assumption that LLM-generated names for low-resource countries are sufficiently representative of real distributions. The manuscript evaluates on both real and synthetic-tail sets and notes the modest real-data lift, but provides no quantitative validation (e.g., Kolmogorov-Smirnov tests on name length/phonotactics, human authenticity ratings, or comparison to held-out real names from the same countries) that synthetic names avoid LLM-specific artifacts. Without this, gains on synthetic tails may not generalize, undermining the headline superiority over baselines.

Authors: We agree that more rigorous validation of the synthetic names is warranted to support the augmentation approach. In the revised manuscript, we have added a new subsection to §4 that includes (1) Kolmogorov-Smirnov tests comparing name-length and character n-gram distributions between LLM-generated and real names for a sample of low-resource countries, (2) results from a human authenticity rating study (n=200 names, 3 annotators) showing high perceived realism, and (3) a comparison of synthetic names against held-out real names from the same countries. These analyses indicate that the synthetic names largely preserve real distributional properties with minimal LLM-specific artifacts. We also emphasize that the headline out-of-domain claims are primarily supported by real-data evaluations, with synthetic augmentation used only to improve coverage. revision: yes
Referee: [§5.1] §5.1 and Table 3: The abstract and results separate 'large gains' on synthetic-inclusive tests from 'modest lift' on real tail-country metrics, yet no effect-size statistics, confidence intervals, or per-country breakdown are referenced to show that the real-data improvement is statistically meaningful rather than marginal. This is load-bearing because the out-of-domain claim is asserted across tasks.

Authors: We acknowledge that the original presentation lacked sufficient statistical detail to assess the meaningfulness of the modest real-data improvements. In the revision, we have updated §5.1 and Table 3 to include 95% confidence intervals for all accuracy metrics, Cohen's d effect sizes for the augmented vs. baseline comparisons on real tail data, and a new supplementary table providing per-country breakdowns for all tail countries. These additions confirm that the improvements are statistically significant for most (but not all) tail countries, allowing readers to better evaluate the practical impact. revision: yes

Circularity Check

0 steps flagged

No circularity; claims rest on empirical training and baseline comparisons

full rationale

The paper constructs a dataset from OAG, augments it with LLM-generated names for low-resource countries, trains NameBERT models, and reports accuracy metrics against external state-of-the-art baselines on both real and synthetic test sets. No equations, parameters, or central claims reduce by construction to fitted inputs, self-definitions, or self-citation chains; the reported gains are measured outcomes on held-out data rather than tautological restatements of the augmentation process itself. The derivation chain is therefore self-contained against independent benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim depends on the assumption that LLM-generated names can validly supplement real academic data for underrepresented countries.

axioms (1)

domain assumption LLM-generated names for low-resource countries are realistic enough to improve generalization without distorting learned patterns
Invoked to justify both training augmentation and evaluation on synthetic-tail test sets.

pith-pipeline@v0.9.0 · 5487 in / 1247 out tokens · 63895 ms · 2026-05-10T16:35:15.833487+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

13 extracted references · 4 canonical work pages

[1]

Nationality Classification Using Name Embeddings

J. Ye, S. Han, Y. Hu, B. Coskun, M. Liu, H. Qin, and S. Skiena. “Nationality Classification Using Name Embeddings”. In:CIKM ’17. 2017

2017
[2]

Park.name2nat: Nationality Prediction from Names

K. Park.name2nat: Nationality Prediction from Names. GitHub repository. https://github.com/Kyubyong/name2nat. Accessed: 18 Feb 2026. 2020

2026
[3]

Predicting Race and Ethnicity from the Sequence of Characters in a Name

R. Chintalapati, S. Laohaprapanon, and G. Sood. “Predicting Race and Ethnicity from the Sequence of Characters in a Name”. In:arXiv preprint arXiv:1805.02109(2018)

work page arXiv 2018
[4]

Olivella, and E

P. Parasurama. “raceBERT: A Transformer-based Model for Predicting Race and Ethnicity from Names”. In:arXiv preprint arXiv:2112.03807(2021)

work page arXiv 2021
[5]

The Importance of Being Ernest, Ekundayo, or Eswari: An Interpretable Machine Learning Approach to Name-based Ethnicity Classifica- tion

V. Jain, T. Enamorado, and C. Rudin. “The Importance of Being Ernest, Ekundayo, or Eswari: An Interpretable Machine Learning Approach to Name-based Ethnicity Classifica- tion”. In:Harvard Data Science Review4.3 (2022)

2022
[6]

Name-Ethnicity Classification and Ethnicity-Sensitive Name Matching

P. Treeratpituk and C. L. Giles. “Name-Ethnicity Classification and Ethnicity-Sensitive Name Matching”. In:Proceedings of the Twenty-Sixth AAAI Conference on Artificial Intelligence. Toronto, Ontario, Canada: AAAI Press, 2012

2012
[7]

ArnetMiner: Extraction and Mining of Academic Social Networks

J. Tang, J. Zhang, L. Yao, J. Li, L. Zhang, and Z. Su. “ArnetMiner: Extraction and Mining of Academic Social Networks”. In:KDD ’08. ACM, 2008, pp. 990–998

2008
[8]

An Overview of Microsoft Academic Service (MAS) and Applications

A. Sinha, Z. Shen, Y. Song, H. Ma, D. Eide, B.-J. P. Hsu, and K. Wang. “An Overview of Microsoft Academic Service (MAS) and Applications”. In:WWW ’15 Companion. WWW’15 Companion. New York, NY, USA: ACM, 2015, pp. 243–246

2015
[9]

Enriching Datasets with Demographics through Large Language Models: What’s in a Name?

K. AlNuaimi, G. Marti, M. Ravaut, et al. “Enriching Datasets with Demographics through Large Language Models: What’s in a Name?” In:arXiv preprint arXiv:2409.11491(2024)

work page arXiv 2024
[10]

Fairness-aware Race and Ethnicity Detection from Names

X. Shang, Z. Peng, S. Vincent, et al. “Fairness-aware Race and Ethnicity Detection from Names”. In:IEEE Access(2025)

2025
[11]

PyPI package

EthnicSeer Team.EthnicSeer. PyPI package. https://pypi.org/project/ethnicseer/. Accessed: 22 Dec 2025. 2025

2025
[12]

Web resource

NamePrism Team.NamePrism. Web resource. https://www.name-prism.com/. Accessed: 22 Dec 2025. 2025

2025
[13]

Measuring short-form factuality in large language models

J. Wei, N. Karina, H. W. Chung, Y. J. Jiao, S. Papay, A. Glaese, J. Schulman, and W. Fedus. “Measuring short-form factuality in large language models”. In:arXiv preprint:2411.04368 (2024)

work page arXiv 2024