Recognition: unknown
Machine individuality: Separating genuine idiosyncrasy from response bias in large language models
Pith reviewed 2026-05-10 07:54 UTC · model grok-4.3
The pith
Large language models exhibit stimulus-specific individuality accounting for 16.9 percent of response variance on average.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Crossed random-effects modeling applied to 74.9 million ratings from 10 open-weight LLMs on over 100,000 words across 14 psycholinguistic norms attributes 16.9 percent of variance to stimulus-specific individuality on average, with this component exceeding a null model and forming coherent, unique fingerprints that support cross-norm prediction.
What carries the argument
Crossed random-effects models that decompose variance into stimulus-specific, model-specific, and residual components to isolate genuine idiosyncrasy from global biases and noise.
If this is right
- Behavioral differences among LLMs cannot be attributed solely to response biases or stochastic noise.
- Each LLM possesses a coherent fingerprint that persists across different psycholinguistic norms.
- Model individuality can be quantified and used for cross-norm prediction.
- Standard psychometric profiling of LLMs must be supplemented by stimulus-specific variance partitioning.
Where Pith is reading between the lines
- If the individuality is stable, targeted fine-tuning on specific stimulus classes might selectively alter or preserve these fingerprints.
- Audits of LLM outputs in high-stakes domains could focus on stimulus-specific response patterns rather than aggregate bias measures.
- The same variance-partitioning approach could be extended to multimodal models or agentic systems to detect emergent individual traits.
Load-bearing premise
The crossed random-effects model fully isolates stimulus-specific effects from global response biases and stochastic noise without residual confounding from prompt formatting or model architecture.
What would settle it
Repeating the variance decomposition on the same models but with varied prompt formats or entirely new stimulus sets and finding that stimulus-specific variance no longer exceeds the null model would falsify the central claim.
Figures
read the original abstract
As large language models (LLMs) are increasingly integrated into daily life, in roles ranging from high-stakes decision support to companionship, understanding their behavioral dispositions becomes critical. A growing literature uses psychometric inventories and cognitive paradigms to profile LLM dispositions. However, these approaches cannot determine whether behavioral differences reflect stable, stimulus-specific individuality or global response biases and stochastic noise. Here, we apply crossed random-effects models -- widely used in psychometrics to separate systematic effects -- to 74.9 million ratings provided by 10 open-weight LLMs for over 100,000 words across 14 psycholinguistic norms. On average, 16.9% of variance is attributable to stimulus-specific individuality, robustly exceeding a statistical null model. Cross-norm prediction analyses reveal this individuality as a coherent fingerprint, unique to each model. These results identify individual differences among LLMs that cannot be attributed to response biases or stochastic noise. We term these differences machine individuality.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper applies crossed random-effects models to 74.9 million ratings from 10 open-weight LLMs on over 100,000 words across 14 psycholinguistic norms. It reports that 16.9% of variance on average is attributable to stimulus-specific individuality (exceeding a statistical null), with cross-norm prediction analyses showing this as a coherent, model-unique fingerprint that cannot be reduced to global response biases or stochastic noise. The authors term this 'machine individuality'.
Significance. If the variance decomposition validly isolates stimulus-specific effects, the work would be significant for LLM behavioral profiling: it scales psychometric variance-component methods to massive data and demonstrates that idiosyncrasies form stable, cross-norm fingerprints. The dataset size and use of standard mixed-effects procedures are clear strengths; the cross-norm coherence test adds falsifiability. However, the low verification confidence due to missing specification details limits immediate impact.
major comments (2)
- [Abstract and Methods] Abstract and Methods: the crossed random-effects model is invoked to isolate the stimulus-by-model interaction (yielding the 16.9% figure) but no equation is given for the random-effects structure, nor is it stated whether fixed effects for the 14 norms or other covariates are included. Without this, it is impossible to confirm that the reported component excludes systematic confounds from model-specific tokenization, embedding geometry, or prompt phrasing, as required by the central claim.
- [Results] Results: no convergence checks, sensitivity analyses to prompt reformulation, or robustness tests against architecture-specific parsing differences are reported. These are load-bearing because the skeptic concern (that architecture or prompt effects may be misattributed to stimulus individuality) cannot be evaluated without them; the 16.9% figure and cross-norm fingerprint claim rest on the assumption that the model fully separates these factors.
minor comments (2)
- [Results] The exact form of the statistical null model (against which the 16.9% exceeds) should be stated explicitly, including how it was simulated or fitted, to permit direct replication.
- A table or figure summarizing per-norm or per-model variance components would improve clarity and allow readers to assess heterogeneity in the 16.9% average.
Simulated Author's Rebuttal
We thank the referee for their constructive comments, which highlight important aspects of methodological transparency. We address each major comment below and will incorporate clarifications and additional analyses in the revised manuscript.
read point-by-point responses
-
Referee: [Abstract and Methods] Abstract and Methods: the crossed random-effects model is invoked to isolate the stimulus-by-model interaction (yielding the 16.9% figure) but no equation is given for the random-effects structure, nor is it stated whether fixed effects for the 14 norms or other covariates are included. Without this, it is impossible to confirm that the reported component excludes systematic confounds from model-specific tokenization, embedding geometry, or prompt phrasing, as required by the central claim.
Authors: We agree that the absence of the explicit model equation limits verifiability. In the revised Methods section we will provide the full crossed random-effects specification, including the equation with random intercepts for stimulus, model, and their interaction, together with fixed effects for the 14 norms (to absorb norm-specific baselines) and any other covariates. This structure ensures the stimulus-by-model variance component isolates stimulus-specific individuality net of the listed confounds. revision: yes
-
Referee: [Results] Results: no convergence checks, sensitivity analyses to prompt reformulation, or robustness tests against architecture-specific parsing differences are reported. These are load-bearing because the skeptic concern (that architecture or prompt effects may be misattributed to stimulus individuality) cannot be evaluated without them; the 16.9% figure and cross-norm fingerprint claim rest on the assumption that the model fully separates these factors.
Authors: We acknowledge that explicit convergence diagnostics and robustness checks would strengthen the central claims. In the revision we will add model convergence reports (singularity checks, optimizer status) for the primary analyses and include new sensitivity results for prompt reformulations and architecture-specific parsing variations, demonstrating that the 16.9% estimate and cross-norm fingerprints remain stable. revision: yes
Circularity Check
No circularity: standard variance decomposition on observed ratings
full rationale
The paper applies crossed random-effects models to decompose 74.9 million observed ratings into variance components, reporting that 16.9% is attributable to stimulus-specific individuality after accounting for model main effects and residuals. This percentage is a direct output of fitting the statistical model to the data and comparing it to a null model; it does not reduce to any input by construction, nor is it defined in terms of itself. Cross-norm prediction analyses use the estimated random effects to assess coherence but do not create a self-referential loop. No load-bearing self-citations, uniqueness theorems, or smuggled ansatzes are invoked to justify the central claim. The derivation relies on standard psychometric mixed-effects procedures applied to external rating data and is therefore self-contained.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption LLM ratings on psycholinguistic norms reflect meaningful responses that can be decomposed into stimulus and rater effects
Reference graph
Works this paper leans on
-
[1]
doi: 10.1038/s42256-025-01115-6
Gregory Serapio-García, Mustafa Safdari, Clément Crepy, Luning Sun, Stephen Fitz, Peter Romero, Marwa Abdulhai, Aleksandra Faust, and Maja Matarić. A psychometric framework for evaluating and shaping personality traits in large language models.Nature Machine Intelligence, 7:1954–1968, 2025. doi: 10.1038/s42256-025-01115-6
-
[2]
Lechner, Claudia Wagner, Beatrice Rammstedt, and Markus Strohmaier
Max Pellert, Clemens M. Lechner, Claudia Wagner, Beatrice Rammstedt, and Markus Strohmaier. AI psychometrics: Assessing the psychological profiles of large language mod- els through psychometric inventories.Perspect. Psychol. Sci., 19(5):808–826, 2024. doi: 10.1177/17456916231214460
-
[3]
Using cognitive psychology to understand GPT-3.Proc
Marcel Binz and Eric Schulz. Using cognitive psychology to understand GPT-3.Proc. Natl. Acad. Sci. U.S.A., 120(6):e2218523120, 2023. doi: 10.1073/pnas.2218523120
-
[4]
Thilo Hagendorff, Ishita Dasgupta, Marcel Binz, Stephanie C. Y. Chan, Andrew Lampinen, Jane X. Wang, Zeynep Akata, and Eric Schulz. Machine psychology: Investigating emergent capabilities and behavior in large language models using psychological methods.arXiv [Preprint], 2023.https://arxiv.org/abs/2303.13988(accessed 2 April 2026)
-
[5]
Idiosyncrasies in large language models,
Mingjie Sun, Yida Yin, Zhiqiu Xu, J. Zico Kolter, and Zhuang Liu. Idiosyncrasies in large language models.arXiv [Preprint], 2025. https://arxiv.org/abs/2502.12150 (accessed 2 April 2026)
-
[6]
Quantifying language models’ sensitivity to spurious features in prompt design or: How I learned to start worrying about prompt formatting
Melanie Sclar, Yejin Choi, Yulia Tsvetkov, and Alane Suhr. Quantifying language models’ sensitivity to spurious features in prompt design or: How I learned to start worrying about prompt formatting. InThe Twelfth International Conference on Learning Representations,
-
[7]
URLhttps://openreview.net/forum?id=RIu5lyNXjT
-
[8]
Self-assessment tests are unreliable measures of LLM personality
Akshat Gupta, Xiaoyang Song, and Gopala Anumanchipalli. Self-assessment tests are unreliable measures of LLM personality. InProceedings of the 7th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP, pages 301–314, Miami, Florida, US, 2024. Association for Computational Linguistics
2024
-
[9]
Decoding LLM personality measurement: Forced-choice vs
Xiaoyu Li, Haoran Shi, Zengyi Yu, Yukun Tu, and Chanjin Zheng. Decoding LLM personality measurement: Forced-choice vs. likert. InFindings of the Association for Computational Linguistics: ACL 2025, pages 9234–9247, Vienna, Austria, 2025. Association for Computational Linguistics
2025
-
[10]
Delroy L. Paulhus. Measurement and control of response bias. In John P. Robinson, Phillip R. Shaver, and Lawrence S. Wrightsman, editors,Measures of Personality and Social Psychological Attitudes, pages 17–59. Academic Press, San Diego, CA, 1991. doi: 10.1016/B978-0-12-590241-0. 50006-X
-
[11]
Eunike Wetzel, Jan R. Böhnke, and Anna Brown. Response biases. In Frederick T. L. Leong, Dave Bartram, Fanny M. Cheung, Kurt F. Geisinger, and Dragos Iliescu, editors,The ITC International Handbook of Testing and Assessment, pages 349–363. Oxford University Press, New York, 2016. doi: 10.1093/med:psych/9780199356942.003.0024
-
[12]
Douglas Bates, Martin Mächler, Ben Bolker, and Steve Walker. Fitting linear mixed-effects models using lme4.Journal of Statistical Software, 67(1):1–48, 2015. doi: 10.18637/jss.v067.i01. 5
-
[13]
Dorner, Olawale Salaudeen, Augustin Kelava, and Samira Samadi
Tom Sühr, Florian E. Dorner, Olawale Salaudeen, Augustin Kelava, and Samira Samadi. Stop evaluating AI with human tests, develop principled, AI-specific tests instead.arXiv [Preprint], 2025.https://arxiv.org/abs/2507.23009(accessed 2 April 2026)
work page internal anchor Pith review arXiv 2025
-
[14]
doi:10.48550/arXiv.2510.22954 , url =
Liwei Jiang, Yuanjun Chai, Margaret Li, Mickel Liu, Raymond Fok, Nouha Dziri, Yulia Tsvetkov, MaartenSap, AlonAlbalak, andYejinChoi. Artificialhivemind: Theopen-endedhomogeneityof language models (and beyond).arXiv [Preprint], 2025. https://arxiv.org/abs/2510.22954 (accessed 2 April 2026)
-
[15]
Sagar Kumar, Ariel Flint, Luca Maria Aiello, and Andrea Baronchelli. Failure of contextual invariance in gender inference with large language models.arXiv [Preprint], 2026. https: //arxiv.org/abs/2603.23485(accessed 2 April 2026)
-
[16]
Zak Hussain, Rui Mata, Ben R. Newell, and Dirk U. Wulff. Probing the contents of semantic representations from text, behavior, and brain data using the psychNorms metabase.arXiv [Preprint], 2024.https://arxiv.org/abs/2412.04936(accessed 2 April 2026)
-
[17]
Amy Beth Warriner, Victor Kuperman, and Marc Brysbaert. Norms of valence, arousal, and dominance for 13,915 English lemmas.Behavior Research Methods, 45(4):1191–1207, 2013. doi: 10.3758/s13428-012-0314-x
-
[18]
Concreteness ratings for 40 thousand generally known English word lemmas.Behavior Research Methods, 46(3):904–911,
Marc Brysbaert, Amy Beth Warriner, and Victor Kuperman. Concreteness ratings for 40 thousand generally known English word lemmas.Behavior Research Methods, 46(3):904–911,
-
[19]
doi: 10.3758/s13428-013-0403-5
-
[20]
Obtaining Reliable Human Ratings of Valence, Arousal, and Dominance for 20,000 E nglish Words
Saif M. Mohammad. Obtaining reliable human ratings of valence, arousal, and dominance for 20,000 English words. InProceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 174–184. Association for Computational Linguistics, 2018. doi: 10.18653/v1/P18-1017
-
[21]
The Lancaster Sensorimotor Norms: multidimensional measures of perceptual and action strength for 40,000 English words.Behavior Research Methods, 52(3):1271–1291, 2020
Dermot Lynott, Louise Connell, Marc Brysbaert, James Brand, and James Carney. The Lancaster Sensorimotor Norms: multidimensional measures of perceptual and action strength for 40,000 English words.Behavior Research Methods, 52(3):1271–1291, 2020. doi: 10.3758/ s13428-019-01316-z
2020
-
[22]
Age-of-acquisition ratings for 30,000 English words.Behavior Research Methods, 44(4):978–990, 2012
Victor Kuperman, Hans Stadthagen-Gonzalez, and Marc Brysbaert. Age-of-acquisition ratings for 30,000 English words.Behavior Research Methods, 44(4):978–990, 2012. doi: 10.3758/ s13428-012-0210-4
2012
-
[23]
Test-based age-of-acquisition norms for 44 thousand English word meanings.Behavior Research Methods, 49(4):1520–1523, 2017
Marc Brysbaert and Andrew Biemiller. Test-based age-of-acquisition norms for 44 thousand English word meanings.Behavior Research Methods, 49(4):1520–1523, 2017. doi: 10.3758/ s13428-016-0811-4
2017
-
[24]
World Book, Chicago, 1981
Edgar Dale and Joseph O’Rourke.The Living Word Vocabulary: A National Vocabulary Inventory. World Book, Chicago, 1981
1981
-
[25]
Joshua Troche, Sebastian J. Crutch, and Jamie Reilly. Defining a conceptual topography of word concreteness: clustering properties of emotion, sensation, and magnitude among 750 English words.Frontiers in Psychology, 8:1787, 2017. doi: 10.3389/fpsyg.2017.01787
-
[26]
Scott, Anne Keitel, Marc Becirspahic, Bo Yao, and Sara C
Graham G. Scott, Anne Keitel, Marc Becirspahic, Bo Yao, and Sara C. Sereno. The Glasgow Norms: ratings of 5,500 words on nine scales.Behavior Research Methods, 51(3):1258–1270,
-
[27]
doi: 10.3758/s13428-018-1099-3. 6
-
[28]
Tomas Engelthaler and Thomas T. Hills. Humor norms for 4,997 English words.Behavior Research Methods, 50(3):1116–1124, 2018. doi: 10.3758/s13428-017-0930-6
-
[29]
Veronica Diveica, Penny M. Pexman, and Richard J. Binney. Quantifying social semantics: An inclusive definition of socialness and ratings for 8,388 English words.Behavior Research Methods, 55(2):461–473, 2023. doi: 10.3758/s13428-022-01810-x. 7 SI Appendix Machine individuality: Separating genuine idiosyncrasy from response bias in large language models V...
-
[30]
Responses that could not be parsed to a valid number were marked as invalid
Rating extraction.Free-form text responses were parsed to extract the leading numeric value. Responses that could not be parsed to a valid number were marked as invalid
-
[31]
Out- of-range values were flagged as outliers
Scale validation.Extracted ratings were validated against each norm’s defined scale. Out- of-range values were flagged as outliers
-
[32]
4.Repetition cap.Stochastic repetitions were capped at 5 per (model, norm, word) tuple
Bipolar valence combination.For the valence norm, the positive rating (0–3) minus the negative rating (0–3) was computed per (model, word, repetition) pair, yielding a single bipolar score in [-3, +3]. 4.Repetition cap.Stochastic repetitions were capped at 5 per (model, norm, word) tuple
-
[33]
Sensory Norms
Effective validity.An observation was marked as effectively valid if it was parseable, within scale, and not flagged by any prior exclusion step. After exclusion, the overall invalid/outlier rate was low (<3% per norm, mean<0.5%), and approximately 5.4 million valid observations per norm entered the LMM analysis (∼74.9 million total across 14 norms). S5 V...
-
[34]
Drew fresh random effects˜τi∼N(0,ˆσ2 τ)and ˜βj∼N(0,ˆσ2 β)from the estimated distributions fitting the empirical model
-
[35]
Generated synthetic datawithoutan interaction term: ˜yijk = ˆµ+ ˜τi + ˜βj +ϵijk, where ϵijk∼N(0,ˆσ2 ϵ)
-
[36]
Re-fit the full model (including the interaction termιij) to the simulated data
-
[37]
Thep-value for each norm is the proportion of null iterations whereσ2 ι,null≥σ2 ι,observed
Recorded the spurious interaction varianceσ2 ι,null. Thep-value for each norm is the proportion of null iterations whereσ2 ι,null≥σ2 ι,observed. Each simulation iteration was run on the full dataset (∼5.4M observations per norm across 10 models), preserving the complete crossed design without subsampling. Simulations were parallelized across norms (14 con...
-
[38]
The best linear unbiased predictions (BLUPs)ˆιij on the held-out norm served as the target vector (one value per word in the shared vocabulary)
-
[39]
The same model’s BLUPs on the remaining 13 norms served as the predictor matrix
-
[40]
4.Cross-modelR 2 values were computed by predicting modelj’s BLUPs on the held-out norm from each other modelj′’s BLUPs on the same 13 predictor norms
Awithin-model R2 was computed via 5-fold cross-validated Ridge regression (regularization selected by internal CV). 4.Cross-modelR 2 values were computed by predicting modelj’s BLUPs on the held-out norm from each other modelj′’s BLUPs on the same 13 predictor norms
-
[41]
The Specificity Ratio = within-modelR2 / mean cross-modelR2. A Specificity Ratio> 1 indicates that a model’s deviations on one dimension are better predicted by its own deviations on other dimensions than by those of other models. The same analysis was applied to raw ratings (without removing trait and bias components) to test whether the specificity sign...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.