How Human-Like Are Large Language Models? A Register-Aware Linguistic Evaluation Framework

(2) Department of Digital Humanities; 3; (3) University of Birmingham United Kingdom; (4) Chair of AI-supported Therapy Decisions LMU M\"unchen Munich Germany; 5; (5) Munich Center for Machine Learning (MCML) Munich Germany; 6); (6) Institute of AI for Health Helmholtz Zentrum M\"unchen Neuherberg Germany); Bjoern Eskofier; Bjoern Eskofier (1

arxiv: 2605.23651 · v2 · pith:444QW6XYnew · submitted 2026-05-22 · 💻 cs.CL

How Human-Like Are Large Language Models? A Register-Aware Linguistic Evaluation Framework

Bj\"orn Nieth , Marianna Gracheva , Michaela Mahlberg , Bjoern Eskofier , Emmanuelle Salin , Bjoern Eskofier (1 , 3 , 5

show 8 more authors

6) Emmanuelle Salin (1) ((1) Department Artificial Intelligence in Biomedical Engineering (AIBE) FAU Erlangen-N\"urnberg Germany (2) Department of Digital Humanities Social Studies (DHSS) FAU Erlangen-N\"urnberg Germany (3) University of Birmingham United Kingdom (4) Chair of AI-supported Therapy Decisions LMU M\"unchen Munich Germany (5) Munich Center for Machine Learning (MCML) Munich Germany (6) Institute of AI for Health Helmholtz Zentrum M\"unchen Neuherberg Germany)

This is my paper

Pith reviewed 2026-05-25 04:18 UTC · model grok-4.3

classification 💻 cs.CL

keywords large language modelslinguistic evaluationregister variationBiber featuresmaximum mean discrepancyhuman-likenesscorpus linguisticstext generation

0 comments

The pith

Large language models always deviate from human linguistic patterns, but the closest model depends on the register rather than size.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a context-aware framework that measures human-likeness of LLM texts by comparing their distributions of linguistic features against human reference corpora for specific registers. It applies a two-sample test using 67 lexico-grammatical features to seven instruction-tuned models across five English datasets. All LLMs show measurable differences from the human baseline in every register examined. Rankings of which model comes closest shift depending on the register and are not explained by differences in model size. This matters because texts can be factually accurate yet still feel unnatural if they violate the expected frequencies and patterns for a given communicative context.

Core claim

LLMs deviate from the human baseline in every tested setup when their texts are compared on lexico-grammatical feature distributions. The model that produces the distribution closest to human writing changes with the register, and this ordering is not dictated by model size.

What carries the argument

A two-sample Maximum Mean Discrepancy comparison between human and LLM corpora, performed separately for each register using the 67 Biber lexico-grammatical features.

If this is right

Evaluation of LLM output must be performed register by register rather than with a single aggregate score.
Larger models are not guaranteed to produce more human-like language distributions than smaller ones.
Different communicative contexts expose different strengths among current open-source models.
The framework supplies a quantitative basis for selecting models according to the intended register of use.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Fine-tuning on register-specific human data may close the observed gaps more effectively than further scaling.
The same method could be applied to measure how well models handle register shifts within a single conversation.
Training data that under-represents certain registers likely contributes to the systematic deviations found here.

Load-bearing premise

The 67 Biber features together with the MMD statistic capture the aspects of language production that determine whether a text feels human-like in a given register.

What would settle it

An experiment that finds one model size ranking first across every register would show that closeness is dictated by size after all.

Figures

Figures reproduced from arXiv: 2605.23651 by (2) Department of Digital Humanities, 3, (3) University of Birmingham United Kingdom, (4) Chair of AI-supported Therapy Decisions LMU M\"unchen Munich Germany, 5, (5) Munich Center for Machine Learning (MCML) Munich Germany, 6), (6) Institute of AI for Health Helmholtz Zentrum M\"unchen Neuherberg Germany), Bjoern Eskofier, Bjoern Eskofier (1, Bj\"orn Nieth, Emmanuelle Salin, Emmanuelle Salin (1) ((1) Department Artificial Intelligence in Biomedical Engineering (AIBE) FAU Erlangen-N\"urnberg Germany, Marianna Gracheva, Michaela Mahlberg, Social Studies (DHSS) FAU Erlangen-N\"urnberg Germany.

**Figure 2.** Figure 2: MMD2 with a resampled confidence interval for different sample sizes on the XSum dataset. 5.2 Model vs human In [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: MMD2 for all datasets and models to the respective human corpus, where the points indicate the observed MMD2 and the whiskers show the 95% CI resampled on coupled samples from the human and model corpus. The orange line in each plot gives the respective Human-Human MMD2 for the respective datasets with the resampled CI. The models on the y-axis are sorted by their observed MMD2 distance. Because the distan… view at source ↗

**Figure 5.** Figure 5: MMD2 for the prompt stability experiments to the human reference sample of the BNC2014Spoken. Dots indicate the mean value over all prompts, while the band shows the minimum and maximum observed distance for the respective model under all prompt variations. except for Llama 8B and Gemma 12B on the WritingPrompts dataset, use past-tense less frequently. Other nouns occur more frequently in spoken convers… view at source ↗

**Figure 4.** Figure 4: Overview of the proposed evaluation frame [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 6.** Figure 6: Violinplot of Biber dimension 1 on BNC2014Spoken for human and models in the ZeroShot setting models of one register can be calculated. The results are shown in Appendix [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

**Figure 7.** Figure 7: MMD2 with bootstrapped confidence interval for different sample sizes on all datasets. For BNC2014Spoken error is increasing, since dataset has only 1200 samples, thus a sample size larger 600 will lead to a smaller and larger subset [PITH_FULL_IMAGE:figures/full_fig_p015_7.png] view at source ↗

**Figure 8.** Figure 8: Correlation heatmap between the MMD2 between human and AI for the BNC2014Spoken between different prompt variants in the Zero-Shot setting [PITH_FULL_IMAGE:figures/full_fig_p020_8.png] view at source ↗

**Figure 9.** Figure 9: Human and model distributions for Biber dimensions in the Zero-Shot setting (BNC2014Spoken). [PITH_FULL_IMAGE:figures/full_fig_p021_9.png] view at source ↗

**Figure 10.** Figure 10: Human and model distributions for Biber dimensions in the Zero-Shot setting (S2ORC_ACL). [PITH_FULL_IMAGE:figures/full_fig_p022_10.png] view at source ↗

**Figure 11.** Figure 11: Human and model distributions for Biber dimensions in the Zero-Shot setting (wikiHow). [PITH_FULL_IMAGE:figures/full_fig_p023_11.png] view at source ↗

**Figure 12.** Figure 12: Human and model distributions for Biber dimensions in the Zero-Shot setting (WritingPrompts). [PITH_FULL_IMAGE:figures/full_fig_p024_12.png] view at source ↗

**Figure 13.** Figure 13: Human and model distributions for Biber dimensions in the Zero-Shot setting (XSum). [PITH_FULL_IMAGE:figures/full_fig_p025_13.png] view at source ↗

**Figure 14.** Figure 14: Mean of the normalized linguistic features without standardization to the full human dataset, with [PITH_FULL_IMAGE:figures/full_fig_p026_14.png] view at source ↗

**Figure 15.** Figure 15: Mean of the normalized linguistic features without standardization to the full human dataset, with the [PITH_FULL_IMAGE:figures/full_fig_p027_15.png] view at source ↗

**Figure 16.** Figure 16: Mean of the normalized linguistic features without standardization to the full human dataset, with the [PITH_FULL_IMAGE:figures/full_fig_p028_16.png] view at source ↗

**Figure 17.** Figure 17: Mean of the normalized linguistic features without standardization to the full human dataset, with the [PITH_FULL_IMAGE:figures/full_fig_p029_17.png] view at source ↗

**Figure 18.** Figure 18: Mean of the normalized linguistic features without standardization to the full human dataset, with the [PITH_FULL_IMAGE:figures/full_fig_p030_18.png] view at source ↗

**Figure 19.** Figure 19: Wasserstein distance for marginal feature distributions between model and human for BNC2014Spoken [PITH_FULL_IMAGE:figures/full_fig_p031_19.png] view at source ↗

**Figure 20.** Figure 20: Wasserstein distance for marginal feature distributions between model and human for S2ORC_ACL in [PITH_FULL_IMAGE:figures/full_fig_p032_20.png] view at source ↗

**Figure 21.** Figure 21: Wasserstein distance for marginal feature distributions between model and human for wikiHow in the [PITH_FULL_IMAGE:figures/full_fig_p033_21.png] view at source ↗

**Figure 22.** Figure 22: Wasserstein distance for marginal feature distributions between model and human for WritingPrompts in [PITH_FULL_IMAGE:figures/full_fig_p034_22.png] view at source ↗

**Figure 23.** Figure 23: Wasserstein distance for marginal feature distributions between model and human for XSum in the [PITH_FULL_IMAGE:figures/full_fig_p035_23.png] view at source ↗

**Figure 24.** Figure 24: Observed MMD distance between different models for BNC2014Spoken in the Zero-Shot setting. The [PITH_FULL_IMAGE:figures/full_fig_p036_24.png] view at source ↗

**Figure 25.** Figure 25: Observed MMD distance between different models for S2ORC_ACL in the Zero-Shot setting. The [PITH_FULL_IMAGE:figures/full_fig_p037_25.png] view at source ↗

**Figure 26.** Figure 26: Observed MMD distance between different models for wikiHow in the Zero-Shot setting. The MMD [PITH_FULL_IMAGE:figures/full_fig_p038_26.png] view at source ↗

**Figure 27.** Figure 27: Observed MMD distance between different models for WritingPrompts in the Zero-Shot setting. The [PITH_FULL_IMAGE:figures/full_fig_p039_27.png] view at source ↗

**Figure 28.** Figure 28: Observed MMD distance between different models for XSum in the Zero-Shot setting. The MMD [PITH_FULL_IMAGE:figures/full_fig_p040_28.png] view at source ↗

**Figure 29.** Figure 29: Sum of the variances of the 67 linguistic features after normalization on the corresponding full human [PITH_FULL_IMAGE:figures/full_fig_p041_29.png] view at source ↗

read the original abstract

While factual correctness and task-performance have been in focus of Large Language Model (LLM) research for a long time, the fundamental question of how human-like generated texts are on a linguistic level has been underexplored. From a corpus-linguistic perspective, language production is inherently context-dependent, with distinct communicative contexts giving rise to differences in frequencies and co-occurrence patterns of linguistic features. A text failing to adhere to these patterns can be content-wise correct, but still be unfavorable to human readers. In this work, we propose a context-aware evaluation framework in which human-likeness is assessed using a two-sample problem between the linguistic feature distribution of a human reference corpus for a given register and a corresponding LLM-generated corpus. We implement this framework using the Maximum Mean Discrepancy (MMD) and the 67 lexico-grammatical features introduced by Biber, which are commonly applied in corpus linguistics. In our experiments, we compare seven instruction-tuned, open-source models across five English-language datasets spanning distinct registers against a human baseline. While across all tested setups, LLMs deviate from the human baseline, which models are closest to human language depends on the register and is not dictated by model size.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper applies Biber features plus MMD to show LLMs deviate from human registers but closest model varies by register not size.

read the letter

The core result here is straightforward: across the tested registers, all seven LLMs differ from the human corpora on the 67 Biber features, yet which model sits closest shifts with the register and does not track model size. That register dependence is the main claim worth noting. The work is new in taking the established Biber feature set and two-sample MMD setup and turning it into a register-aware evaluation for generated text. It does this cleanly by pulling human reference corpora for five English registers and generating matching LLM output for direct comparison. The approach is transparent and builds directly on corpus-linguistic tools that have been used for decades, which gives it a solid empirical footing without introducing new parameters or circular definitions. The limitation is that the 67 features are counts of specific lexico-grammatical classes. They may not pick up discourse-level or pragmatic signals that matter for human judgments of naturalness in a given register. If the MMD ordering does not match what readers actually rate as human-like, the register-specific conclusions rest on a narrower base than the abstract suggests. The stress-test concern lands because nothing in the reported setup tests alignment with human naturalness ratings or an expanded feature set. This is useful for readers who already work with register variation or who need a practical way to compare models on linguistic distributions rather than task accuracy. It is worth sending to peer review because the method is reproducible and the question is well-posed, even if the feature set needs more validation against human perception.

Referee Report

1 major / 1 minor

Summary. The manuscript proposes a context-aware evaluation framework for assessing the human-likeness of LLM-generated texts using Maximum Mean Discrepancy (MMD) to compare distributions of 67 Biber lexico-grammatical features between human reference corpora and LLM outputs across five distinct English registers. Experiments with seven instruction-tuned open-source LLMs reveal that all models deviate from human baselines, but the model closest to the human distribution varies depending on the register and is not solely determined by model size.

Significance. If the framework's assumptions hold, this work offers a valuable corpus-linguistic approach to LLM evaluation that accounts for register-specific linguistic patterns, moving beyond task performance metrics. The reliance on established Biber features and MMD contributes to the method's transparency and potential for replication in the field.

major comments (1)

[Abstract] The central claim that 'which models are closest to human language depends on the register and is not dictated by model size' is load-bearing on the 67 Biber features plus two-sample MMD being a sufficient statistic for human-likeness (Abstract). The manuscript provides no evidence that these distances align with human judgments of naturalness or discourse-level properties in the tested registers, nor any ablation against expanded feature sets; if the ordering differs from such external validation, the register-dependence conclusion does not follow from the reported MMD values.

minor comments (1)

[Abstract] The abstract states the main finding but does not name the five registers or seven models; adding these would improve immediate readability.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive comments. We respond to the single major comment below.

read point-by-point responses

Referee: [Abstract] The central claim that 'which models are closest to human language depends on the register and is not dictated by model size' is load-bearing on the 67 Biber features plus two-sample MMD being a sufficient statistic for human-likeness (Abstract). The manuscript provides no evidence that these distances align with human judgments of naturalness or discourse-level properties in the tested registers, nor any ablation against expanded feature sets; if the ordering differs from such external validation, the register-dependence conclusion does not follow from the reported MMD values.

Authors: We acknowledge the referee's point that the manuscript does not provide direct evidence linking MMD distances on the Biber feature set to human judgments of naturalness. The 67 features are selected because they are a well-established, replicable set in corpus linguistics for modeling register variation (Biber 1988 and subsequent validation studies). MMD serves as a distribution-level comparator rather than a claim of sufficiency for all aspects of human-likeness. The reported finding is therefore scoped to relative distances within this operationalization: across the five registers, the model minimizing MMD changes and is not monotonically related to parameter count. We agree that external validation would strengthen interpretation. In revision we will (1) temper the abstract wording to emphasize that conclusions concern this specific feature set and metric, (2) add citations to existing literature on the predictive validity of Biber features for perceived register appropriateness, and (3) expand the limitations section to note the absence of human judgment correlation or feature-set ablations as directions for future work. No new experiments are added at this stage. revision: partial

Circularity Check

0 steps flagged

No circularity; direct empirical comparison to external human corpora

full rationale

The paper defines human-likeness via two-sample MMD distances on the fixed, externally established set of 67 Biber lexico-grammatical features between LLM-generated texts and independent human reference corpora for each register. No equations, fitted parameters, self-referential definitions, or load-bearing self-citations appear; the reported register-dependent ordering of models follows immediately from these distance computations without any reduction of outputs to inputs by construction. The approach is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract-only review yields limited visibility into parameters or assumptions; the framework rests on the domain assumption that Biber features are sufficient proxies for register-specific human language production.

axioms (1)

domain assumption Biber's 67 lexico-grammatical features capture the relevant frequency and co-occurrence patterns that distinguish registers in human language production.
The entire evaluation framework is built on this standard corpus-linguistic premise as stated in the abstract.

pith-pipeline@v0.9.0 · 5900 in / 1182 out tokens · 18953 ms · 2026-05-25T04:18:15.964825+00:00 · methodology

How Human-Like Are Large Language Models? A Register-Aware Linguistic Evaluation Framework

Core claim

What carries the argument

If this is right

Where Pith is reading between the lines

Load-bearing premise

What would settle it

discussion (0)