arxiv: 2604.16755 · v2 · submitted 2026-04-18 · 💻 cs.AI

Recognition: unknown

Machine individuality: Separating genuine idiosyncrasy from response bias in large language models

Valentin Kriegmair , Dirk U. Wulff

Authors on Pith no claims yet

Pith reviewed 2026-05-10 07:54 UTC · model grok-4.3

classification 💻 cs.AI

keywords machine individualitylarge language modelsrandom effects modelspsycholinguistic normsindividual differencesresponse biasLLM dispositions

0 comments

The pith

Large language models exhibit stimulus-specific individuality accounting for 16.9 percent of response variance on average.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper applies crossed random-effects models to millions of LLM ratings on words across psycholinguistic norms in order to isolate stable, stimulus-specific idiosyncrasies from global response biases and stochastic noise. It reports that this individuality explains 16.9 percent of variance on average and exceeds a statistical null model, with cross-norm predictions confirming coherent, model-unique fingerprints. A sympathetic reader would care because LLMs are already used in decision support and companionship, where such hidden traits could shape outputs in ways that average profiles miss. The work establishes that behavioral differences among LLMs are not fully reducible to biases or chance. These differences receive the label machine individuality.

Core claim

Crossed random-effects modeling applied to 74.9 million ratings from 10 open-weight LLMs on over 100,000 words across 14 psycholinguistic norms attributes 16.9 percent of variance to stimulus-specific individuality on average, with this component exceeding a null model and forming coherent, unique fingerprints that support cross-norm prediction.

What carries the argument

Crossed random-effects models that decompose variance into stimulus-specific, model-specific, and residual components to isolate genuine idiosyncrasy from global biases and noise.

If this is right

Behavioral differences among LLMs cannot be attributed solely to response biases or stochastic noise.
Each LLM possesses a coherent fingerprint that persists across different psycholinguistic norms.
Model individuality can be quantified and used for cross-norm prediction.
Standard psychometric profiling of LLMs must be supplemented by stimulus-specific variance partitioning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the individuality is stable, targeted fine-tuning on specific stimulus classes might selectively alter or preserve these fingerprints.
Audits of LLM outputs in high-stakes domains could focus on stimulus-specific response patterns rather than aggregate bias measures.
The same variance-partitioning approach could be extended to multimodal models or agentic systems to detect emergent individual traits.

Load-bearing premise

The crossed random-effects model fully isolates stimulus-specific effects from global response biases and stochastic noise without residual confounding from prompt formatting or model architecture.

What would settle it

Repeating the variance decomposition on the same models but with varied prompt formats or entirely new stimulus sets and finding that stimulus-specific variance no longer exceeds the null model would falsify the central claim.

Figures

Figures reproduced from arXiv: 2604.16755 by Dirk U. Wulff, Valentin Kriegmair.

**Figure 1.** Figure 1: Machine Individuality: Variance Partitioning and Specificity. A. Crossed random-effects decomposition of ∼5.4M ratings per norm (10 open-weight LLMs, 5 repetitions) across 14 psycholinguistic norms (five sensory norms and two Age of Acquisition norms grouped for brevity). Variance decomposes into Trait (consensus), Bias (model offset), Idiosyncrasy (Model × Word), and Residual. Idiosyncrasy averages 16.9% … view at source ↗

read the original abstract

As large language models (LLMs) are increasingly integrated into daily life, in roles ranging from high-stakes decision support to companionship, understanding their behavioral dispositions becomes critical. A growing literature uses psychometric inventories and cognitive paradigms to profile LLM dispositions. However, these approaches cannot determine whether behavioral differences reflect stable, stimulus-specific individuality or global response biases and stochastic noise. Here, we apply crossed random-effects models -- widely used in psychometrics to separate systematic effects -- to 74.9 million ratings provided by 10 open-weight LLMs for over 100,000 words across 14 psycholinguistic norms. On average, 16.9% of variance is attributable to stimulus-specific individuality, robustly exceeding a statistical null model. Cross-norm prediction analyses reveal this individuality as a coherent fingerprint, unique to each model. These results identify individual differences among LLMs that cannot be attributed to response biases or stochastic noise. We term these differences machine individuality.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper applies crossed random-effects models to large-scale LLM ratings on word norms and extracts 16.9% stimulus-specific variance that exceeds null and predicts across norms, but the abstract leaves the exact specification and robustness checks unspecified.

read the letter

This paper takes a standard tool from psychometrics and applies it to separate out what looks like stable differences in how different LLMs respond to the same words. They ran 10 open models on 100k+ words for 14 norms, got 75 million ratings, and found that 16.9% of the variance sits in the model-by-stimulus interaction term. That term beats a null model and also predicts across norms, which they interpret as a coherent fingerprint for each model.

Referee Report

2 major / 2 minor

Summary. The paper applies crossed random-effects models to 74.9 million ratings from 10 open-weight LLMs on over 100,000 words across 14 psycholinguistic norms. It reports that 16.9% of variance on average is attributable to stimulus-specific individuality (exceeding a statistical null), with cross-norm prediction analyses showing this as a coherent, model-unique fingerprint that cannot be reduced to global response biases or stochastic noise. The authors term this 'machine individuality'.

Significance. If the variance decomposition validly isolates stimulus-specific effects, the work would be significant for LLM behavioral profiling: it scales psychometric variance-component methods to massive data and demonstrates that idiosyncrasies form stable, cross-norm fingerprints. The dataset size and use of standard mixed-effects procedures are clear strengths; the cross-norm coherence test adds falsifiability. However, the low verification confidence due to missing specification details limits immediate impact.

major comments (2)

[Abstract and Methods] Abstract and Methods: the crossed random-effects model is invoked to isolate the stimulus-by-model interaction (yielding the 16.9% figure) but no equation is given for the random-effects structure, nor is it stated whether fixed effects for the 14 norms or other covariates are included. Without this, it is impossible to confirm that the reported component excludes systematic confounds from model-specific tokenization, embedding geometry, or prompt phrasing, as required by the central claim.
[Results] Results: no convergence checks, sensitivity analyses to prompt reformulation, or robustness tests against architecture-specific parsing differences are reported. These are load-bearing because the skeptic concern (that architecture or prompt effects may be misattributed to stimulus individuality) cannot be evaluated without them; the 16.9% figure and cross-norm fingerprint claim rest on the assumption that the model fully separates these factors.

minor comments (2)

[Results] The exact form of the statistical null model (against which the 16.9% exceeds) should be stated explicitly, including how it was simulated or fitted, to permit direct replication.
A table or figure summarizing per-norm or per-model variance components would improve clarity and allow readers to assess heterogeneity in the 16.9% average.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments, which highlight important aspects of methodological transparency. We address each major comment below and will incorporate clarifications and additional analyses in the revised manuscript.

read point-by-point responses

Referee: [Abstract and Methods] Abstract and Methods: the crossed random-effects model is invoked to isolate the stimulus-by-model interaction (yielding the 16.9% figure) but no equation is given for the random-effects structure, nor is it stated whether fixed effects for the 14 norms or other covariates are included. Without this, it is impossible to confirm that the reported component excludes systematic confounds from model-specific tokenization, embedding geometry, or prompt phrasing, as required by the central claim.

Authors: We agree that the absence of the explicit model equation limits verifiability. In the revised Methods section we will provide the full crossed random-effects specification, including the equation with random intercepts for stimulus, model, and their interaction, together with fixed effects for the 14 norms (to absorb norm-specific baselines) and any other covariates. This structure ensures the stimulus-by-model variance component isolates stimulus-specific individuality net of the listed confounds. revision: yes
Referee: [Results] Results: no convergence checks, sensitivity analyses to prompt reformulation, or robustness tests against architecture-specific parsing differences are reported. These are load-bearing because the skeptic concern (that architecture or prompt effects may be misattributed to stimulus individuality) cannot be evaluated without them; the 16.9% figure and cross-norm fingerprint claim rest on the assumption that the model fully separates these factors.

Authors: We acknowledge that explicit convergence diagnostics and robustness checks would strengthen the central claims. In the revision we will add model convergence reports (singularity checks, optimizer status) for the primary analyses and include new sensitivity results for prompt reformulations and architecture-specific parsing variations, demonstrating that the 16.9% estimate and cross-norm fingerprints remain stable. revision: yes

Circularity Check

0 steps flagged

No circularity: standard variance decomposition on observed ratings

full rationale

The paper applies crossed random-effects models to decompose 74.9 million observed ratings into variance components, reporting that 16.9% is attributable to stimulus-specific individuality after accounting for model main effects and residuals. This percentage is a direct output of fitting the statistical model to the data and comparing it to a null model; it does not reduce to any input by construction, nor is it defined in terms of itself. Cross-norm prediction analyses use the estimated random effects to assess coherence but do not create a self-referential loop. No load-bearing self-citations, uniqueness theorems, or smuggled ansatzes are invoked to justify the central claim. The derivation relies on standard psychometric mixed-effects procedures applied to external rating data and is therefore self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the appropriateness of crossed random-effects models for LLM rating data and the interpretation of variance components as individuality rather than artifacts.

axioms (1)

domain assumption LLM ratings on psycholinguistic norms reflect meaningful responses that can be decomposed into stimulus and rater effects
The analysis treats LLM outputs as analogous to human participant data in psychometrics.

pith-pipeline@v0.9.0 · 5463 in / 1141 out tokens · 46677 ms · 2026-05-10T07:54:39.896608+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

41 extracted references · 20 canonical work pages · 1 internal anchor

[1]

doi: 10.1038/s42256-025-01115-6

Gregory Serapio-García, Mustafa Safdari, Clément Crepy, Luning Sun, Stephen Fitz, Peter Romero, Marwa Abdulhai, Aleksandra Faust, and Maja Matarić. A psychometric framework for evaluating and shaping personality traits in large language models.Nature Machine Intelligence, 7:1954–1968, 2025. doi: 10.1038/s42256-025-01115-6

work page doi:10.1038/s42256-025-01115-6 1954
[2]

Lechner, Claudia Wagner, Beatrice Rammstedt, and Markus Strohmaier

Max Pellert, Clemens M. Lechner, Claudia Wagner, Beatrice Rammstedt, and Markus Strohmaier. AI psychometrics: Assessing the psychological profiles of large language mod- els through psychometric inventories.Perspect. Psychol. Sci., 19(5):808–826, 2024. doi: 10.1177/17456916231214460

work page doi:10.1177/17456916231214460 2024
[3]

Using cognitive psychology to understand GPT-3.Proc

Marcel Binz and Eric Schulz. Using cognitive psychology to understand GPT-3.Proc. Natl. Acad. Sci. U.S.A., 120(6):e2218523120, 2023. doi: 10.1073/pnas.2218523120

work page doi:10.1073/pnas.2218523120 2023
[4]

Thilo Hagendorff, Ishita Dasgupta, Marcel Binz, Stephanie C. Y. Chan, Andrew Lampinen, Jane X. Wang, Zeynep Akata, and Eric Schulz. Machine psychology: Investigating emergent capabilities and behavior in large language models using psychological methods.arXiv [Preprint], 2023.https://arxiv.org/abs/2303.13988(accessed 2 April 2026)

work page arXiv 2023
[5]

Idiosyncrasies in large language models,

Mingjie Sun, Yida Yin, Zhiqiu Xu, J. Zico Kolter, and Zhuang Liu. Idiosyncrasies in large language models.arXiv [Preprint], 2025. https://arxiv.org/abs/2502.12150 (accessed 2 April 2026)

work page arXiv 2025
[6]

Quantifying language models’ sensitivity to spurious features in prompt design or: How I learned to start worrying about prompt formatting

Melanie Sclar, Yejin Choi, Yulia Tsvetkov, and Alane Suhr. Quantifying language models’ sensitivity to spurious features in prompt design or: How I learned to start worrying about prompt formatting. InThe Twelfth International Conference on Learning Representations,
[7]

URLhttps://openreview.net/forum?id=RIu5lyNXjT
[8]

Self-assessment tests are unreliable measures of LLM personality

Akshat Gupta, Xiaoyang Song, and Gopala Anumanchipalli. Self-assessment tests are unreliable measures of LLM personality. InProceedings of the 7th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP, pages 301–314, Miami, Florida, US, 2024. Association for Computational Linguistics

2024
[9]

Decoding LLM personality measurement: Forced-choice vs

Xiaoyu Li, Haoran Shi, Zengyi Yu, Yukun Tu, and Chanjin Zheng. Decoding LLM personality measurement: Forced-choice vs. likert. InFindings of the Association for Computational Linguistics: ACL 2025, pages 9234–9247, Vienna, Austria, 2025. Association for Computational Linguistics

2025
[10]

Delroy L. Paulhus. Measurement and control of response bias. In John P. Robinson, Phillip R. Shaver, and Lawrence S. Wrightsman, editors,Measures of Personality and Social Psychological Attitudes, pages 17–59. Academic Press, San Diego, CA, 1991. doi: 10.1016/B978-0-12-590241-0. 50006-X

work page doi:10.1016/b978-0-12-590241-0 1991
[11]

Böhnke, and Anna Brown

Eunike Wetzel, Jan R. Böhnke, and Anna Brown. Response biases. In Frederick T. L. Leong, Dave Bartram, Fanny M. Cheung, Kurt F. Geisinger, and Dragos Iliescu, editors,The ITC International Handbook of Testing and Assessment, pages 349–363. Oxford University Press, New York, 2016. doi: 10.1093/med:psych/9780199356942.003.0024

work page doi:10.1093/med:psych/9780199356942.003.0024 2016
[12]

& Walker, S

Douglas Bates, Martin Mächler, Ben Bolker, and Steve Walker. Fitting linear mixed-effects models using lme4.Journal of Statistical Software, 67(1):1–48, 2015. doi: 10.18637/jss.v067.i01. 5

work page doi:10.18637/jss.v067.i01 2015
[13]

Dorner, Olawale Salaudeen, Augustin Kelava, and Samira Samadi

Tom Sühr, Florian E. Dorner, Olawale Salaudeen, Augustin Kelava, and Samira Samadi. Stop evaluating AI with human tests, develop principled, AI-specific tests instead.arXiv [Preprint], 2025.https://arxiv.org/abs/2507.23009(accessed 2 April 2026)

work page internal anchor Pith review arXiv 2025
[14]

doi:10.48550/arXiv.2510.22954 , url =

Liwei Jiang, Yuanjun Chai, Margaret Li, Mickel Liu, Raymond Fok, Nouha Dziri, Yulia Tsvetkov, MaartenSap, AlonAlbalak, andYejinChoi. Artificialhivemind: Theopen-endedhomogeneityof language models (and beyond).arXiv [Preprint], 2025. https://arxiv.org/abs/2510.22954 (accessed 2 April 2026)

work page arXiv 2025
[15]

Failure of contextual invariance in gender inference with large language models.arXiv [Preprint], 2026

Sagar Kumar, Ariel Flint, Luca Maria Aiello, and Andrea Baronchelli. Failure of contextual invariance in gender inference with large language models.arXiv [Preprint], 2026. https: //arxiv.org/abs/2603.23485(accessed 2 April 2026)

work page arXiv 2026
[16]

Newell, and Dirk U

Zak Hussain, Rui Mata, Ben R. Newell, and Dirk U. Wulff. Probing the contents of semantic representations from text, behavior, and brain data using the psychNorms metabase.arXiv [Preprint], 2024.https://arxiv.org/abs/2412.04936(accessed 2 April 2026)

work page arXiv 2024
[17]

Norms of valence, arousal, and dominance for 13,915 English lemmas.Behavior Research Methods, 45(4):1191–1207, 2013

Amy Beth Warriner, Victor Kuperman, and Marc Brysbaert. Norms of valence, arousal, and dominance for 13,915 English lemmas.Behavior Research Methods, 45(4):1191–1207, 2013. doi: 10.3758/s13428-012-0314-x

work page doi:10.3758/s13428-012-0314-x 2013
[18]

Concreteness ratings for 40 thousand generally known English word lemmas.Behavior Research Methods, 46(3):904–911,

Marc Brysbaert, Amy Beth Warriner, and Victor Kuperman. Concreteness ratings for 40 thousand generally known English word lemmas.Behavior Research Methods, 46(3):904–911,
[19]

doi: 10.3758/s13428-013-0403-5

work page doi:10.3758/s13428-013-0403-5
[20]

Obtaining Reliable Human Ratings of Valence, Arousal, and Dominance for 20,000 E nglish Words

Saif M. Mohammad. Obtaining reliable human ratings of valence, arousal, and dominance for 20,000 English words. InProceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 174–184. Association for Computational Linguistics, 2018. doi: 10.18653/v1/P18-1017

work page doi:10.18653/v1/p18-1017 2018
[21]

The Lancaster Sensorimotor Norms: multidimensional measures of perceptual and action strength for 40,000 English words.Behavior Research Methods, 52(3):1271–1291, 2020

Dermot Lynott, Louise Connell, Marc Brysbaert, James Brand, and James Carney. The Lancaster Sensorimotor Norms: multidimensional measures of perceptual and action strength for 40,000 English words.Behavior Research Methods, 52(3):1271–1291, 2020. doi: 10.3758/ s13428-019-01316-z

2020
[22]

Age-of-acquisition ratings for 30,000 English words.Behavior Research Methods, 44(4):978–990, 2012

Victor Kuperman, Hans Stadthagen-Gonzalez, and Marc Brysbaert. Age-of-acquisition ratings for 30,000 English words.Behavior Research Methods, 44(4):978–990, 2012. doi: 10.3758/ s13428-012-0210-4

2012
[23]

Test-based age-of-acquisition norms for 44 thousand English word meanings.Behavior Research Methods, 49(4):1520–1523, 2017

Marc Brysbaert and Andrew Biemiller. Test-based age-of-acquisition norms for 44 thousand English word meanings.Behavior Research Methods, 49(4):1520–1523, 2017. doi: 10.3758/ s13428-016-0811-4

2017
[24]

World Book, Chicago, 1981

Edgar Dale and Joseph O’Rourke.The Living Word Vocabulary: A National Vocabulary Inventory. World Book, Chicago, 1981

1981
[25]

Crutch, and Jamie Reilly

Joshua Troche, Sebastian J. Crutch, and Jamie Reilly. Defining a conceptual topography of word concreteness: clustering properties of emotion, sensation, and magnitude among 750 English words.Frontiers in Psychology, 8:1787, 2017. doi: 10.3389/fpsyg.2017.01787

work page doi:10.3389/fpsyg.2017.01787 2017
[26]

Scott, Anne Keitel, Marc Becirspahic, Bo Yao, and Sara C

Graham G. Scott, Anne Keitel, Marc Becirspahic, Bo Yao, and Sara C. Sereno. The Glasgow Norms: ratings of 5,500 words on nine scales.Behavior Research Methods, 51(3):1258–1270,
[27]

doi: 10.3758/s13428-018-1099-3. 6

work page doi:10.3758/s13428-018-1099-3
[28]

Tomas Engelthaler and Thomas T. Hills. Humor norms for 4,997 English words.Behavior Research Methods, 50(3):1116–1124, 2018. doi: 10.3758/s13428-017-0930-6

work page doi:10.3758/s13428-017-0930-6 2018
[29]

excited” to “calm

Veronica Diveica, Penny M. Pexman, and Richard J. Binney. Quantifying social semantics: An inclusive definition of socialness and ratings for 8,388 English words.Behavior Research Methods, 55(2):461–473, 2023. doi: 10.3758/s13428-022-01810-x. 7 SI Appendix Machine individuality: Separating genuine idiosyncrasy from response bias in large language models V...

work page doi:10.3758/s13428-022-01810-x 2023
[30]

Responses that could not be parsed to a valid number were marked as invalid

Rating extraction.Free-form text responses were parsed to extract the leading numeric value. Responses that could not be parsed to a valid number were marked as invalid
[31]

Out- of-range values were flagged as outliers

Scale validation.Extracted ratings were validated against each norm’s defined scale. Out- of-range values were flagged as outliers
[32]

4.Repetition cap.Stochastic repetitions were capped at 5 per (model, norm, word) tuple

Bipolar valence combination.For the valence norm, the positive rating (0–3) minus the negative rating (0–3) was computed per (model, word, repetition) pair, yielding a single bipolar score in [-3, +3]. 4.Repetition cap.Stochastic repetitions were capped at 5 per (model, norm, word) tuple
[33]

Sensory Norms

Effective validity.An observation was marked as effectively valid if it was parseable, within scale, and not flagged by any prior exclusion step. After exclusion, the overall invalid/outlier rate was low (<3% per norm, mean<0.5%), and approximately 5.4 million valid observations per norm entered the LMM analysis (∼74.9 million total across 14 norms). S5 V...
[34]

Drew fresh random effects˜τi∼N(0,ˆσ2 τ)and ˜βj∼N(0,ˆσ2 β)from the estimated distributions fitting the empirical model
[35]

Generated synthetic datawithoutan interaction term: ˜yijk = ˆµ+ ˜τi + ˜βj +ϵijk, where ϵijk∼N(0,ˆσ2 ϵ)
[36]

Re-fit the full model (including the interaction termιij) to the simulated data
[37]

Thep-value for each norm is the proportion of null iterations whereσ2 ι,null≥σ2 ι,observed

Recorded the spurious interaction varianceσ2 ι,null. Thep-value for each norm is the proportion of null iterations whereσ2 ι,null≥σ2 ι,observed. Each simulation iteration was run on the full dataset (∼5.4M observations per norm across 10 models), preserving the complete crossed design without subsampling. Simulations were parallelized across norms (14 con...
[38]

The best linear unbiased predictions (BLUPs)ˆιij on the held-out norm served as the target vector (one value per word in the shared vocabulary)
[39]

The same model’s BLUPs on the remaining 13 norms served as the predictor matrix
[40]

4.Cross-modelR 2 values were computed by predicting modelj’s BLUPs on the held-out norm from each other modelj′’s BLUPs on the same 13 predictor norms

Awithin-model R2 was computed via 5-fold cross-validated Ridge regression (regularization selected by internal CV). 4.Cross-modelR 2 values were computed by predicting modelj’s BLUPs on the held-out norm from each other modelj′’s BLUPs on the same 13 predictor norms
[41]

A Specificity Ratio> 1 indicates that a model’s deviations on one dimension are better predicted by its own deviations on other dimensions than by those of other models

The Specificity Ratio = within-modelR2 / mean cross-modelR2. A Specificity Ratio> 1 indicates that a model’s deviations on one dimension are better predicted by its own deviations on other dimensions than by those of other models. The same analysis was applied to raw ratings (without removing trait and bias components) to test whether the specificity sign...

work page doi:10.17605/osf.io/t9s3m