pith. machine review for the scientific record. sign in

arxiv: 2604.14463 · v1 · submitted 2026-04-15 · 💻 cs.CL

Recognition: unknown

Psychological Steering of Large Language Models

Authors on Pith no claims yet

Pith reviewed 2026-05-10 12:41 UTC · model grok-4.3

classification 💻 cs.CL
keywords large language modelspersonality steeringOCEAN traitsmean-difference injectionsactivation interventionsrepresentation engineeringIPIP-NEO-120
0
0 comments X

The pith

Mean-difference injections outperform prompting for steering OCEAN personality traits in most LLMs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper develops a framework for adjusting the personality traits of large language models by injecting calibrated vectors into their internal activation streams. It uses psychological questionnaires to derive these adjustments in a way that allows unbounded searches while keeping outputs fluent. The main result shows that mean-difference injections, based on activation differences for high and low trait levels, work better than traditional personality prompting in 11 out of 14 models tested. A combination of both approaches performs best in nearly all cases. If this holds, it suggests that internal representation changes can offer more precise control over LLM behavior than input instructions alone.

Core claim

The paper establishes that mean-difference (MD) injections derived from the IPIP-NEO-120 personality questionnaire outperform Personality Prompting (P²) in open-ended generation tasks for OCEAN traits in 11 of 14 LLMs, achieving gains between 3.6% and 16.4%. A hybrid method combining P² and MD injections surpasses both individual methods in 13 of 14 LLMs. MD injections are shown to align with the Linear Representation Hypothesis, acting as approximately linear control mechanisms, while also producing trait covariance patterns that differ from the human Big Two model.

What carries the argument

Mean-difference (MD) injections: additive vectors in the residual stream calculated as the difference in activations between high- and low-scoring responses on OCEAN questionnaire items, calibrated for semantic meaning and fluency.

If this is right

  • MD injections deliver superior trait steering compared to prompting in the majority of evaluated models.
  • Hybrid steering combining injections and prompts achieves the highest performance gains.
  • These injections provide roughly linear control over individual traits as predicted by representation hypotheses.
  • Steered models exhibit covariance among OCEAN traits that does not match standard human psychology models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar injection techniques might extend to steering other behavioral attributes like decision-making styles or emotional responses.
  • Developers could use this to create LLMs with more consistent and targeted personalities for specific applications such as customer service or education.
  • The observed differences in trait covariances could point to unique ways LLMs organize psychological concepts compared to humans.
  • Further experiments might explore whether these methods reduce unwanted side effects like reduced fluency or coherence in long generations.

Load-bearing premise

That the OCEAN trait scores derived from analyzing LLM responses to the IPIP-NEO-120 questionnaire validly reflect the actual personality traits being steered, without distortions caused by the steering process itself.

What would settle it

A study that measures the steered traits using an independent method, such as direct behavioral tasks or a different questionnaire not used in the calibration, and finds no advantage for MD injections over prompting would disprove the main performance claims.

Figures

Figures reproduced from arXiv: 2604.14463 by Emilio Ferrara, Leonardo Blas, Robin Jia.

Figure 1
Figure 1. Figure 1: Llama-3.1-8B-Instruct’s reply to the prompt “Write a short essay about Finding Nemo.” [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Ranking of steering methods on Qwen3-8B by OCEAN trait and steering direction. Based on each method’s best SJT (open-ended test) scores; asterisks denote ties. Overall, PM, a hybrid of P2 and MDS injections, performs best in 13 of 14 LLMs. Large language models (LLMs) are largely shaped by the human mind. In particular, echoing the principle that our words embody key aspects of our psyche, such as our tran… view at source ↗
Figure 3
Figure 3. Figure 3: PCA projection of our and Perez et al. [49]’s 1,000 statement embeddings (C ↑ denotes conscientiousness and C ↓ its antithesis). By cen￾troid cosine similarity, our C ↑ and C ↓ clusters are 92.79% and 94.00% similar, respectively. We adapt Perez et al. [49]’s validated method for synthesizing LLM behavior evaluations. First, we select a psychological model and, for each con￾struct, prompt Llama-3.1-8B-Inst… view at source ↗
Figure 4
Figure 4. Figure 4: Layerwise openness extreme steering scores on the SJTs task after applying openness MDS injections with injection stride s = 1 on Olmo-3-7B-Instruct. Stars mark the strongest steering effect (ϕ1,O,d). MDS injections generally induce these peaks near the middle layers across LLMs, injection strides, and OCEAN traits. 0 4 8 12 16 20 24 Layer 7 8 ¹ sum `; s Injection stride 1 2 3 4 [PITH_FULL_IMAGE:figures/f… view at source ↗
Figure 5
Figure 5. Figure 5: Overall MDS injection steering perfor￾mance on Llama-3.2-3B-Instruct on the SJTs task by injection stride s and model layer ℓ. Under s = 1, we injected the same vector with the same α coefficient on every completion activation; with s = 2, on every other activation; with s = 3, on every third activation; and with s = 4, on every fourth activation. In general, injecting into more completion activations yiel… view at source ↗
Figure 6
Figure 6. Figure 6: Mean OCEAN SJT scores for gemma￾3-12b-it, under an openness MDS injection with s = 1. The selected layer ℓ = 26 is the best for increasing openness, and the selected α values are 10 equidistant points ranging from 0 (no steering) to the best α = 8. ℓ of a given LLM, each injection stride setting s ∈ {1, 2, 3, 4}, each OCEAN construct t, and each steering direction d ∈ {↑, ↓}, we computed µ ∗ ℓ,s,t,d :=  … view at source ↗
Figure 7
Figure 7. Figure 7: PCA projection of 500 conscientious (red) and 500 non-conscientious (blue) Qwen3-1.7B [PITH_FULL_IMAGE:figures/full_fig_p021_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Ranking of steering methods on Llama-3.2-1B-Instruct by OCEAN trait and direction, and [PITH_FULL_IMAGE:figures/full_fig_p027_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Overall MDS injections steering performance on Llama-3.2-1B-Instruct by injection stride [PITH_FULL_IMAGE:figures/full_fig_p027_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Layerwise extreme OCEAN steering scores on the SJTs task by direction [PITH_FULL_IMAGE:figures/full_fig_p028_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: OCEAN scores for Llama-3.2-1B-Instruct on SJTs, under MDS injections with [PITH_FULL_IMAGE:figures/full_fig_p029_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Ranking of steering methods on Llama-3.2-3B-Instruct by OCEAN trait and direction, [PITH_FULL_IMAGE:figures/full_fig_p030_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Overall MDS injections steering performance on Llama-3.2-3B-Instruct by injection stride [PITH_FULL_IMAGE:figures/full_fig_p030_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Layerwise extreme OCEAN steering scores on the SJTs task by direction [PITH_FULL_IMAGE:figures/full_fig_p031_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: OCEAN scores for Llama-3.2-3B-Instruct on SJTs, under MDS injections with [PITH_FULL_IMAGE:figures/full_fig_p032_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Ranking of steering methods on Llama-3.1-8B-Instruct by OCEAN trait and direction, [PITH_FULL_IMAGE:figures/full_fig_p033_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Overall MDS injections steering performance on Llama-3.1-8B-Instruct by injection stride [PITH_FULL_IMAGE:figures/full_fig_p033_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Layerwise extreme OCEAN steering scores on the SJTs task by direction [PITH_FULL_IMAGE:figures/full_fig_p034_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: OCEAN scores for Llama-3.1-8B-Instruct on SJTs, under MDS injections with [PITH_FULL_IMAGE:figures/full_fig_p035_19.png] view at source ↗
Figure 20
Figure 20. Figure 20: Ranking of steering methods on Qwen3-1.7B by OCEAN trait and direction, and task. [PITH_FULL_IMAGE:figures/full_fig_p036_20.png] view at source ↗
Figure 21
Figure 21. Figure 21: Overall MDS injections steering performance on Qwen3-1.7B by injection stride [PITH_FULL_IMAGE:figures/full_fig_p036_21.png] view at source ↗
Figure 22
Figure 22. Figure 22: Layerwise extreme OCEAN steering scores on the SJTs task by direction [PITH_FULL_IMAGE:figures/full_fig_p037_22.png] view at source ↗
Figure 23
Figure 23. Figure 23: OCEAN scores for Qwen3-1.7B on SJTs, under MDS injections with [PITH_FULL_IMAGE:figures/full_fig_p038_23.png] view at source ↗
Figure 24
Figure 24. Figure 24: Ranking of steering methods on Qwen3-4B by OCEAN trait and direction, and task. [PITH_FULL_IMAGE:figures/full_fig_p039_24.png] view at source ↗
Figure 25
Figure 25. Figure 25: Overall MDS injections steering performance on Qwen3-4B by injection stride [PITH_FULL_IMAGE:figures/full_fig_p039_25.png] view at source ↗
Figure 26
Figure 26. Figure 26: Layerwise extreme OCEAN steering scores on the SJTs task by direction [PITH_FULL_IMAGE:figures/full_fig_p040_26.png] view at source ↗
Figure 27
Figure 27. Figure 27: OCEAN scores for Qwen3-4B on SJTs, under MDS injections with [PITH_FULL_IMAGE:figures/full_fig_p041_27.png] view at source ↗
Figure 28
Figure 28. Figure 28: Ranking of steering methods on Qwen3-8B by OCEAN trait and direction, and task. [PITH_FULL_IMAGE:figures/full_fig_p042_28.png] view at source ↗
Figure 29
Figure 29. Figure 29: Overall MDS injections steering performance on Qwen3-8B by injection stride [PITH_FULL_IMAGE:figures/full_fig_p042_29.png] view at source ↗
Figure 30
Figure 30. Figure 30: Layerwise extreme OCEAN steering scores on the SJTs task by direction [PITH_FULL_IMAGE:figures/full_fig_p043_30.png] view at source ↗
Figure 31
Figure 31. Figure 31: OCEAN scores for Qwen3-8B on SJTs, under MDS injections with [PITH_FULL_IMAGE:figures/full_fig_p044_31.png] view at source ↗
Figure 32
Figure 32. Figure 32: Ranking of steering methods on Qwen3-14B by OCEAN trait and direction, and task. [PITH_FULL_IMAGE:figures/full_fig_p045_32.png] view at source ↗
Figure 33
Figure 33. Figure 33: Overall MDS injections steering performance on Qwen3-14B by injection stride [PITH_FULL_IMAGE:figures/full_fig_p045_33.png] view at source ↗
Figure 34
Figure 34. Figure 34: Layerwise extreme OCEAN steering scores on the SJTs task by direction [PITH_FULL_IMAGE:figures/full_fig_p046_34.png] view at source ↗
Figure 35
Figure 35. Figure 35: OCEAN scores for Qwen3-14B on SJTs, under MDS injections with [PITH_FULL_IMAGE:figures/full_fig_p047_35.png] view at source ↗
Figure 36
Figure 36. Figure 36: Ranking of steering methods on Qwen3-32B by OCEAN trait and direction, and task. [PITH_FULL_IMAGE:figures/full_fig_p048_36.png] view at source ↗
Figure 37
Figure 37. Figure 37: Overall MDS injections steering performance on Qwen3-32B by injection stride [PITH_FULL_IMAGE:figures/full_fig_p048_37.png] view at source ↗
Figure 38
Figure 38. Figure 38: Layerwise extreme OCEAN steering scores on the SJTs task by direction [PITH_FULL_IMAGE:figures/full_fig_p049_38.png] view at source ↗
Figure 39
Figure 39. Figure 39: OCEAN scores for Qwen3-32B on SJTs, under MDS injections with [PITH_FULL_IMAGE:figures/full_fig_p050_39.png] view at source ↗
Figure 40
Figure 40. Figure 40: Ranking of steering methods on gemma-3-1b-it by OCEAN trait and direction, and task. [PITH_FULL_IMAGE:figures/full_fig_p051_40.png] view at source ↗
Figure 41
Figure 41. Figure 41: Ranking of steering methods on gemma-3-4b-it by OCEAN trait and direction, and task. [PITH_FULL_IMAGE:figures/full_fig_p052_41.png] view at source ↗
Figure 42
Figure 42. Figure 42: Overall MDS injections steering performance on gemma-3-4b-it by injection stride [PITH_FULL_IMAGE:figures/full_fig_p052_42.png] view at source ↗
Figure 43
Figure 43. Figure 43: Layerwise extreme OCEAN steering scores on the SJTs task by direction [PITH_FULL_IMAGE:figures/full_fig_p053_43.png] view at source ↗
Figure 44
Figure 44. Figure 44: OCEAN scores for gemma-3-4b-it on SJTs, under MDS injections with [PITH_FULL_IMAGE:figures/full_fig_p054_44.png] view at source ↗
Figure 45
Figure 45. Figure 45: Ranking of steering methods on gemma-3-12b-it by OCEAN trait and direction, and task. [PITH_FULL_IMAGE:figures/full_fig_p055_45.png] view at source ↗
Figure 46
Figure 46. Figure 46: Overall MDS injections steering performance on gemma-3-12b-it by injection stride [PITH_FULL_IMAGE:figures/full_fig_p055_46.png] view at source ↗
Figure 47
Figure 47. Figure 47: Layerwise extreme OCEAN steering scores on the SJTs task by direction [PITH_FULL_IMAGE:figures/full_fig_p056_47.png] view at source ↗
Figure 48
Figure 48. Figure 48: OCEAN scores for gemma-3-12b-it on SJTs, under MDS injections with [PITH_FULL_IMAGE:figures/full_fig_p057_48.png] view at source ↗
Figure 49
Figure 49. Figure 49: Ranking of steering methods on gemma-3-27b-it by OCEAN trait and direction, and task. [PITH_FULL_IMAGE:figures/full_fig_p058_49.png] view at source ↗
Figure 50
Figure 50. Figure 50: Overall MDS injections steering performance on gemma-3-27b-it by injection stride [PITH_FULL_IMAGE:figures/full_fig_p058_50.png] view at source ↗
Figure 51
Figure 51. Figure 51: Layerwise extreme OCEAN steering scores on the SJTs task by direction [PITH_FULL_IMAGE:figures/full_fig_p059_51.png] view at source ↗
Figure 52
Figure 52. Figure 52: OCEAN scores for gemma-3-27b-it on SJTs, under MDS injections with [PITH_FULL_IMAGE:figures/full_fig_p060_52.png] view at source ↗
Figure 53
Figure 53. Figure 53: Ranking of steering methods on Olmo-3-7B-Instruct by OCEAN trait and direction, and [PITH_FULL_IMAGE:figures/full_fig_p061_53.png] view at source ↗
Figure 54
Figure 54. Figure 54: Overall MDS injections steering performance on Olmo-3-7B-Instruct by injection stride [PITH_FULL_IMAGE:figures/full_fig_p061_54.png] view at source ↗
Figure 55
Figure 55. Figure 55: Layerwise extreme OCEAN steering scores on the SJTs task by direction [PITH_FULL_IMAGE:figures/full_fig_p062_55.png] view at source ↗
Figure 56
Figure 56. Figure 56: OCEAN scores for Olmo-3-7B-Instruct on SJTs, under MDS injections with [PITH_FULL_IMAGE:figures/full_fig_p063_56.png] view at source ↗
Figure 57
Figure 57. Figure 57: Ranking of steering methods on Olmo-3.1-32B-Instruct by OCEAN trait and direction, [PITH_FULL_IMAGE:figures/full_fig_p064_57.png] view at source ↗
Figure 58
Figure 58. Figure 58: Overall MDS injections steering performance on Olmo-3.1-32B-Instruct by injection [PITH_FULL_IMAGE:figures/full_fig_p064_58.png] view at source ↗
Figure 59
Figure 59. Figure 59: Layerwise extreme OCEAN steering scores on the SJTs task by direction [PITH_FULL_IMAGE:figures/full_fig_p065_59.png] view at source ↗
Figure 60
Figure 60. Figure 60: OCEAN scores for Olmo-3.1-32B-Instruct on SJTs, under MDS injections with [PITH_FULL_IMAGE:figures/full_fig_p066_60.png] view at source ↗
read the original abstract

Large language models (LLMs) emulate a consistent human-like behavior that can be shaped through activation-level interventions. This paradigm is converging on additive residual-stream injections, which rely on injection-strength sweeps to approximate optimal intervention settings. However, existing methods restrict the search space and sweep in uncalibrated activation-space units, potentially missing optimal intervention conditions. Thus, we introduce a psychological steering framework that performs unbounded, fluency-constrained sweeps in semantically calibrated units. Our method derives and calibrates residual-stream injections using psychological artifacts, and we use the IPIP-NEO-120, which measures the OCEAN personality model, to compare six injection methods. We find that mean-difference (MD) injections outperform Personality Prompting (P$^2$), an established baseline for OCEAN steering, in open-ended generation in 11 of 14 LLMs, with gains of 3.6\% to 16.4\%, overturning prior reports favoring prompting and positioning representation engineering as a new frontier in open-ended psychological steering. Further, we find that a hybrid of P$^2$ and MD injections outperforms both methods in 13 of 14 LLMs, with gains over P$^2$ ranging from 5.6\% to 21.9\% and from 3.3\% to 26.7\% over MD injections. Finally, we show that MD injections align with the Linear Representation Hypothesis and provide reliable, approximately linear control knobs for psychological steering. Nevertheless, they also induce OCEAN trait covariance patterns that depart from the Big Two model, suggesting a gap between learned representations and human psychology.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces a psychological steering framework for LLMs that performs unbounded, fluency-constrained sweeps of additive residual-stream injections in semantically calibrated units derived from psychological artifacts. Using scores from the IPIP-NEO-120 questionnaire extracted from open-ended generations, it compares six injection methods across 14 LLMs and reports that mean-difference (MD) injections outperform Personality Prompting (P²) in 11 of 14 models (gains 3.6%–16.4%), while a hybrid of P² and MD outperforms both in 13 of 14 models (gains 5.6%–21.9% over P² and 3.3%–26.7% over MD). The work further claims that MD injections align with the Linear Representation Hypothesis, provide approximately linear control, and induce OCEAN trait covariance patterns that depart from the Big Two model.

Significance. If the empirical comparisons hold after addressing measurement concerns, the paper would be significant for establishing representation engineering as a competitive or superior alternative to prompting for open-ended psychological steering in LLMs. The large-scale evaluation across 14 models, the hybrid method's consistent gains, and the explicit linkage to the Linear Representation Hypothesis provide concrete evidence that could shift research focus toward activation-level interventions. The observation of non-human-like covariance patterns also offers a useful cautionary note on the limits of mapping LLM representations to human psychological models.

major comments (3)
  1. [Results section (performance tables and extraction procedure)] The headline results (MD outperforming P² in 11/14 models and hybrid in 13/14) depend entirely on IPIP-NEO-120 trait scores extracted from steered open-ended generations. The manuscript provides no consistency checks, cross-method validation, or external corroboration (e.g., human ratings) to demonstrate that the extraction procedure yields unbiased scores rather than method-specific artifacts that could systematically favor MD injections over P². This is load-bearing for the central claim.
  2. [Method section (steering framework and calibration)] The method section describes unbounded sweeps in 'semantically calibrated units' derived from psychological artifacts, but does not specify the exact calibration procedure, the fluency constraint implementation, or how the search avoids post-hoc selection effects. Without these details, it is unclear whether the reported superiority of MD and hybrid methods is robust or sensitive to implementation choices.
  3. [Discussion section (trait covariance analysis)] The discussion notes that MD injections produce OCEAN trait covariance patterns departing from the Big Two model, yet provides no quantitative measure of this departure (e.g., correlation matrices or statistical tests) or analysis of whether this undermines the use of OCEAN as the evaluation target.
minor comments (2)
  1. [Abstract] The abstract refers to 'six injection methods' without enumerating them; a brief list would improve readability.
  2. [Results tables] Performance tables reporting percentage gains would benefit from accompanying statistical significance tests or confidence intervals to support the cross-model claims.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive review and recommendation for major revision. We address each of the three major comments below, providing clarifications where possible and committing to specific revisions that will strengthen the empirical rigor and transparency of the manuscript without altering its core claims.

read point-by-point responses
  1. Referee: [Results section (performance tables and extraction procedure)] The headline results (MD outperforming P² in 11/14 models and hybrid in 13/14) depend entirely on IPIP-NEO-120 trait scores extracted from steered open-ended generations. The manuscript provides no consistency checks, cross-method validation, or external corroboration (e.g., human ratings) to demonstrate that the extraction procedure yields unbiased scores rather than method-specific artifacts that could systematically favor MD injections over P². This is load-bearing for the central claim.

    Authors: We agree that the validity of the IPIP-NEO-120 extraction from open-ended text is critical to the headline comparisons. The same extraction pipeline is applied uniformly to all methods, which preserves the validity of relative performance differences, but we acknowledge that method-specific artifacts cannot be ruled out without further checks. In the revised manuscript we will add: (1) consistency checks via repeated sampling with different seeds and reporting of score variance; (2) cross-validation against an alternative inventory (e.g., a short-form Big Five measure) on a subset of generations; and (3) a discussion of why semantic calibration of the injection vectors makes systematic bias toward MD unlikely. We will also note the absence of human ratings as a limitation and outline how future work could obtain them. revision: yes

  2. Referee: [Method section (steering framework and calibration)] The method section describes unbounded sweeps in 'semantically calibrated units' derived from psychological artifacts, but does not specify the exact calibration procedure, the fluency constraint implementation, or how the search avoids post-hoc selection effects. Without these details, it is unclear whether the reported superiority of MD and hybrid methods is robust or sensitive to implementation choices.

    Authors: We accept that the method section was insufficiently detailed. The calibration derives injection vectors as the mean-difference of residual-stream activations on high- versus low-trait psychological artifact prompts, normalized to unit length; the fluency constraint thresholds generations by a perplexity-based fluency score computed on a held-out set of neutral prompts; and the search reports the strength that maximizes the target trait score subject to the fluency threshold, with the threshold itself fixed before seeing test results. In the revision we will insert a dedicated subsection with the exact procedure, pseudocode, and all hyperparameters (including the fluency threshold value and validation-set size) so that the experiments are fully reproducible. revision: yes

  3. Referee: [Discussion section (trait covariance analysis)] The discussion notes that MD injections produce OCEAN trait covariance patterns departing from the Big Two model, yet provides no quantitative measure of this departure (e.g., correlation matrices or statistical tests) or analysis of whether this undermines the use of OCEAN as the evaluation target.

    Authors: We will expand the discussion with the requested quantitative analysis. The revised version will include: (i) the full 5×5 OCEAN correlation matrix for MD-steered generations across the 14 models; (ii) the corresponding human-norm matrix from the IPIP-NEO-120 validation data; and (iii) a statistical comparison (e.g., a test of matrix equality or Mantel test) quantifying the departure. We will also discuss the implications for OCEAN as an evaluation target, noting that while the departure indicates LLM representations do not perfectly mirror human trait structure, OCEAN remains a useful, standardized metric for measuring steering efficacy. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical comparisons rest on external instrument

full rationale

The paper's load-bearing results are direct empirical comparisons (MD outperforming P² in 11/14 LLMs, hybrid in 13/14) obtained by applying the independent IPIP-NEO-120 questionnaire to open-ended generations. No equations, fitted parameters, or self-citations are shown that would make the reported gains reduce to quantities defined in terms of the steering methods themselves. The calibration procedure and linear-representation finding are presented as methodological choices and observations, not as premises that tautologically produce the performance ordering. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The evaluation rests on the applicability of the OCEAN model and IPIP-NEO-120 to LLM text; the linearity claim rests on the Linear Representation Hypothesis.

axioms (2)
  • domain assumption The IPIP-NEO-120 can be validly scored on LLM-generated text to measure OCEAN traits
    Used as the primary evaluation metric for all injection methods
  • domain assumption Residual-stream injections can be meaningfully calibrated in semantically meaningful units
    Foundation of the proposed psychological steering framework

pith-pipeline@v0.9.0 · 5582 in / 1379 out tokens · 32498 ms · 2026-05-10T12:41:28.625514+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

79 extracted references · 30 canonical work pages · 12 internal anchors

  1. [1]

    Wiles, and Vlad Grankovsky

    Rumi Allbert, James K. Wiles, and Vlad Grankovsky. Identifying and manipulating personality traits in llms through activation engineering, 2025. URL https://arxiv.org/abs/2412. 10427

  2. [2]

    Empirical, theoretical, and practical advantages of the hexaco model of personality structure.Personality and social psychology review, 11(2):150–166, 2007

    Michael C Ashton and Kibeom Lee. Empirical, theoretical, and practical advantages of the hexaco model of personality structure.Personality and social psychology review, 11(2):150–166, 2007

  3. [3]

    The hexaco–60: A short measure of the major dimensions of personality.Journal of personality assessment, 91(4):340–345, 2009

    Michael C Ashton and Kibeom Lee. The hexaco–60: A short measure of the major dimensions of personality.Journal of personality assessment, 91(4):340–345, 2009

  4. [4]

    Tak, Fatemeh Bahrani, Anahita Bolourani, Leonardo Blas, Emilio Ferrara, Jonathan Gratch, and Sai Praneeth Karimireddy

    Amin Banayeeanzade, Ala N. Tak, Fatemeh Bahrani, Anahita Bolourani, Leonardo Blas, Emilio Ferrara, Jonathan Gratch, and Sai Praneeth Karimireddy. Psychological steering in llms: An evaluation of effectiveness and trustworthiness, 2025. URL https://arxiv.org/abs/2510. 04484

  5. [5]

    Activation-space personality steering: Hybrid layer selection for stable trait control in llms.arXiv preprint arXiv:2511.03738, 2025

    Pranav Bhandari, Nicolas Fay, Sanjeevan Selvaganapathy, Amitava Datta, Usman Naseem, and Mehwish Nasim. Activation-space personality steering: Hybrid layer selection for stable trait control in llms.arXiv preprint arXiv:2511.03738, 2025

  6. [6]

    Behavioral confirmation of everyday sadism.Psychological science, 24(11):2201–2209, 2013

    Erin E Buckels, Daniel N Jones, and Delroy L Paulhus. Behavioral confirmation of everyday sadism.Psychological science, 24(11):2201–2209, 2013

  7. [7]

    Persona Vectors: Monitoring and Controlling Character Traits in Language Models

    Runjin Chen, Andy Arditi, Henry Sleight, Owain Evans, and Jack Lindsey. Persona vectors: Monitoring and controlling character traits in language models, 2025. URL https://arxiv. org/abs/2507.21509

  8. [8]

    Moral foun- dations vignettes: A standardized stimulus database of scenarios based on moral foundations theory.Behavior research methods, 47(4):1178–1198, 2015

    Scott Clifford, Vijeth Iyengar, Roberto Cabeza, and Walter Sinnott-Armstrong. Moral foun- dations vignettes: A standardized stimulus database of scenarios based on moral foundations theory.Behavior research methods, 47(4):1178–1198, 2015

  9. [9]

    P-react: Synthesizing topic- adaptive reactions of personality traits via mixture of specialized LoRA experts

    Yuhao Dan, Jie Zhou, Qin Chen, Junfeng Tian, and Liang He. P-react: Synthesizing topic- adaptive reactions of personality traits via mixture of specialized LoRA experts. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar, editors,Findings of the Association for Computational Linguistics: ACL 2025, pages 6342–6362, Vienna, Austria, July

  10. [11]

    Neuron based personality trait induction in large language models

    Jia Deng, Tianyi Tang, Yanbin Yin, Wenhao Yang, Xin Zhao, and Ji-Rong Wen. Neuron based personality trait induction in large language models. InThe Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025. OpenReview.net, 2025. URLhttps://openreview.net/forum?id=LYHEY783Np

  11. [12]

    Higher-order factors of the big five predict conformity: Are there neuroses of health?Personality and Individual differences, 33(4):533–552, 2002

    Colin G DeYoung, Jordan B Peterson, and Daniel M Higgins. Higher-order factors of the big five predict conformity: Are there neuroses of health?Personality and Individual differences, 33(4):533–552, 2002

  12. [13]

    Higher-order factors of the big five.Journal of personality and social psychology, 73(6):1246, 1997

    John M Digman. Higher-order factors of the big five.Journal of personality and social psychology, 73(6):1246, 1997

  13. [14]

    Toy Models of Superposition

    Nelson Elhage, Tristan Hume, Catherine Olsson, Nicholas Schiefer, Tom Henighan, Shauna Kravec, Zac Hatfield-Dodds, Robert Lasenby, Dawn Drain, Carol Chen, et al. Toy models of superposition.arXiv preprint arXiv:2209.10652, 2022

  14. [15]

    From Pretraining Data to Language Models to Downstream Tasks: Tracking the Trails of Political Biases Leading to Unfair NLP Models

    Shangbin Feng, Chan Young Park, Yuhan Liu, and Yulia Tsvetkov. From pretraining data to language models to downstream tasks: Tracking the trails of political biases leading to unfair NLP models. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki, editors,Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1:...

  15. [16]

    PERSONA: Dynamic and compositional inference-time personality control via activation vector algebra

    Xiachong Feng, Liang Zhao, Weihong Zhong, Yichong Huang, Yuxuan Gu, Lingpeng Kong, Xiaocheng Feng, and Bing Qin. PERSONA: Dynamic and compositional inference-time personality control via activation vector algebra. InThe Fourteenth International Conference on Learning Representations, 2026. URL https://openreview.net/forum?id=QZvGqaNBlU

  16. [17]

    Measurement of character.Fortnightly review, May 1865-June 1934, 36(212): 179–185, 08 1884

    Francis Galton. Measurement of character.Fortnightly review, May 1865-June 1934, 36(212): 179–185, 08 1884

  17. [18]

    The structure of phenotypic personality traits.American psychologist, 48 (1):26, 1993

    Lewis R Goldberg. The structure of phenotypic personality traits.American psychologist, 48 (1):26, 1993

  18. [19]

    A broad-bandwidth, public domain, personality inventory measuring the lower-level facets of several five-factor models.Personality psychology in Europe, 7(1): 7–28, 1999

    Lewis R Goldberg et al. A broad-bandwidth, public domain, personality inventory measuring the lower-level facets of several five-factor models.Personality psychology in Europe, 7(1): 7–28, 1999

  19. [20]

    Mapping the moral domain.Journal of personality and social psychology, 101(2):366, 2011

    Jesse Graham, Brian A Nosek, Jonathan Haidt, Ravi Iyer, Spassena Koleva, and Peter H Ditto. Mapping the moral domain.Journal of personality and social psychology, 101(2):366, 2011

  20. [21]

    Moral foundations theory: The pragmatic validity of moral pluralism

    Jesse Graham, Jonathan Haidt, Sena Koleva, Matt Motyl, Ravi Iyer, Sean P Wojcik, and Peter H Ditto. Moral foundations theory: The pragmatic validity of moral pluralism. InAdvances in experimental social psychology, volume 47, pages 55–130. Elsevier, 2013

  21. [22]

    Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ah- mad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, Amy Yang, Angela Fan, Anirudh Goyal, Anthony Hartshorn, Aobo Yang, Archi Mitra, Archie Sravankumar, Artem Korenev, Arthur Hinsvark, Arun Rao, Aston Zhang, Aurelien Rodriguez, Austen Gregerson, Ava...

  22. [23]

    STEER: Unified style transfer with expert reinforcement

    Skyler Hallinan, Faeze Brahman, Ximing Lu, Jaehun Jung, Sean Welleck, and Yejin Choi. STEER: Unified style transfer with expert reinforcement. In Houda Bouamor, Juan Pino, 12 and Kalika Bali, editors,Findings of the Association for Computational Linguistics: EMNLP 2023, pages 7546–7562, Singapore, December 2023. Association for Computational Linguis- tics...

  23. [24]

    How do transformers learn to associate to- kens: Gradient leading terms bring mechanistic interpretability

    Shawn Im, Changdae Oh, Zhen Fang, and Sharon Li. How do transformers learn to associate to- kens: Gradient leading terms bring mechanistic interpretability. InThe Fourteenth International Conference on Learning Representations, 2026. URL https://openreview.net/forum? id=A4Us8jxVGq

  24. [25]

    Evaluating and inducing personality in pre-trained language models.Advances in Neural Information Processing Systems, 36:10622–10643, 2023

    Guangyuan Jiang, Manjie Xu, Song-Chun Zhu, Wenjuan Han, Chi Zhang, and Yixin Zhu. Evaluating and inducing personality in pre-trained language models.Advances in Neural Information Processing Systems, 36:10622–10643, 2023

  25. [26]

    InFindings of the Association for Computational Linguistics: NAACL 2024, Kevin Duh, Helena Gomez, and Steven Bethard (Eds.)

    Hang Jiang, Xiajie Zhang, Xubo Cao, Cynthia Breazeal, Deb Roy, and Jad Kabbara. Per- sonaLLM: Investigating the ability of large language models to express personality traits. In Kevin Duh, Helena Gomez, and Steven Bethard, editors,Findings of the Association for Com- putational Linguistics: NAACL 2024, pages 3605–3627, Mexico City, Mexico, June 2024. Ass...

  26. [27]

    Ai as a partner in assessment: generating situational judgment tests with large language models.BMC psychology, 13(1):1315, 2025

    Liming Jiang, Fang Luo, and Xuetao Tian. Ai as a partner in assessment: generating situational judgment tests with large language models.BMC psychology, 13(1):1315, 2025

  27. [28]

    Measuring thirty facets of the five factor model with a 120-item public domain inventory: Development of the ipip-neo-120.Journal of research in personality, 51:78–89, 2014

    John A Johnson. Measuring thirty facets of the five factor model with a 120-item public domain inventory: Development of the ipip-neo-120.Journal of research in personality, 51:78–89, 2014

  28. [29]

    Introducing the short dark triad (sd3) a brief measure of dark personality traits.Assessment, 21(1):28–41, 2014

    Daniel N Jones and Delroy L Paulhus. Introducing the short dark triad (sd3) a brief measure of dark personality traits.Assessment, 21(1):28–41, 2014

  29. [30]

    Estimating the personality of white-box language models, 2023

    Saketh Reddy Karra, Son The Nguyen, and Theja Tulabandhula. Estimating the personality of white-box language models, 2023. URLhttps://arxiv.org/abs/2204.12000

  30. [31]

    Interpretability beyond feature attribution: Quantitative testing with concept activation vectors (tcav)

    Been Kim, Martin Wattenberg, Justin Gilmer, Carrie Cai, James Wexler, Fernanda Viegas, et al. Interpretability beyond feature attribution: Quantitative testing with concept activation vectors (tcav). InInternational conference on machine learning, pages 2668–2677. PMLR, 2018

  31. [32]

    Prometheus 2: An open source language model specialized in evaluating other language models

    Seungone Kim, Juyoung Suk, Shayne Longpre, Bill Yuchen Lin, Jamin Shin, Sean Welleck, Graham Neubig, Moontae Lee, Kyungjae Lee, and Minjoon Seo. Prometheus 2: An open source language model specialized in evaluating other language models. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors,Proceedings of the 2024 Conference on Empirical Methods i...

  32. [33]

    Prometheus 2: An open source language model specialized in evaluating other language models

    Association for Computational Linguistics. doi: 10.18653/v1/2024.emnlp-main.248. URLhttps://aclanthology.org/2024.emnlp-main.248/

  33. [34]

    Reformulating unsupervised style transfer as paraphrase generation

    Kalpesh Krishna, John Wieting, and Mohit Iyyer. Reformulating unsupervised style transfer as paraphrase generation. InEmpirical Methods in Natural Language Processing, 2020

  34. [35]

    Do LLMs have distinct and consistent personality? TRAIT: Personality testset designed for LLMs with psychometrics

    Seungbeen Lee, Seungwon Lim, Seungju Han, Giyeong Oh, Hyungjoo Chae, Jiwan Chung, Minju Kim, Beong-woo Kwak, Yeonsoo Lee, Dongha Lee, Jinyoung Yeo, and Youngjae Yu. Do LLMs have distinct and consistent personality? TRAIT: Personality testset designed for LLMs with psychometrics. In Luis Chiruzzo, Alan Ritter, and Lu Wang, editors,Findings of the Associati...

  35. [36]

    Development and evaluation of a new short form of the conformity to masculine norms inventory (cmni-30).Journal of Counseling Psychology, 67(5):622, 2020

    Ronald F Levant, Ryon McDermott, Mike C Parent, Nuha Alshabani, James R Mahalik, and Joseph H Hammer. Development and evaluation of a new short form of the conformity to masculine norms inventory (cmni-30).Journal of Counseling Psychology, 67(5):622, 2020. 13

  36. [37]

    Inference- time intervention: Eliciting truthful answers from a language model

    Kenneth Li, Oam Patel, Fernanda Viégas, Hanspeter Pfister, and Martin Wattenberg. Inference- time intervention: Eliciting truthful answers from a language model. InThirty-seventh Con- ference on Neural Information Processing Systems, 2023. URLhttps://openreview.net/ forum?id=aLLuYpn83y

  37. [38]

    Diab, and Maarten Sap

    Wenkai Li, Jiarui Liu, Andy Liu, Xuhui Zhou, Mona T. Diab, and Maarten Sap. BIG5-CHAT: Shaping LLM personalities through training on human-grounded data. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar, editors,Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pa...

  38. [39]

    Style transfer with multi-iteration preference optimization

    Shuai Liu and Jonathan May. Style transfer with multi-iteration preference optimization. In Luis Chiruzzo, Alan Ritter, and Lu Wang, editors,Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 2663–2681, Albuquerque, New Mexico, April

  39. [40]

    Willison, S

    Association for Computational Linguistics. ISBN 979-8-89176-189-6. doi: 10.18653/v1/ 2025.naacl-long.135. URLhttps://aclanthology.org/2025.naacl-long.135/

  40. [41]

    RoBERTa: A Robustly Optimized BERT Pretraining Approach

    Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized bert pretraining approach, 2019. URLhttps://arxiv.org/abs/1907.11692

  41. [42]

    Development of the conformity to masculine norms inventory.Psychology of men & masculinity, 4(1):3, 2003

    James R Mahalik, Benjamin D Locke, Larry H Ludlow, Matthew A Diemer, Ryan PJ Scott, Michael Gottfried, and Gary Freitas. Development of the conformity to masculine norms inventory.Psychology of men & masculinity, 4(1):3, 2003

  42. [43]

    Development of the conformity to feminine norms inventory.Sex Roles, 52(7):417–435, 2005

    James R Mahalik, Elisabeth B Morray, Aimée Coonerty-Femiano, Larry H Ludlow, Suzanne M Slattery, and Andrew Smiler. Development of the conformity to feminine norms inventory.Sex Roles, 52(7):417–435, 2005

  43. [44]

    An introduction to the five-factor model and its applications

    Robert R McCrae and Oliver P John. An introduction to the five-factor model and its applications. Journal of personality, 60(2):175–215, 1992

  44. [45]

    Linguistic regularities in continuous space word representations

    Tomas Mikolov, Wen-tau Yih, and Geoffrey Zweig. Linguistic regularities in continuous space word representations. In Lucy Vanderwende, Hal Daumé III, and Katrin Kirchhoff, editors, Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 746–751, Atlanta, Georgia,...

  45. [46]

    Team Olmo, :, Allyson Ettinger, Amanda Bertsch, Bailey Kuehl, David Graham, David Heine- man, Dirk Groeneveld, Faeze Brahman, Finbarr Timbers, Hamish Ivison, Jacob Morrison, Jake Poznanski, Kyle Lo, Luca Soldaini, Matt Jordan, Mayee Chen, Michael Noukhovitch, Nathan Lambert, Pete Walsh, Pradeep Dasigi, Robert Berry, Saumya Malik, Saurabh Shah, Scott Geng,...

  46. [47]

    Development and validation of a hexaco situational judgment test.Human Performance, 32(1):1–29, 2019

    Janneke K Oostrom, Reinout E de Vries, and Mariska De Wit. Development and validation of a hexaco situational judgment test.Human Performance, 32(1):1–29, 2019

  47. [48]

    An abbreviated tool for assessing feminine norm conformity: Psychometric properties of the conformity to feminine norms inventory–45.Psychological assessment, 23(4):958, 2011

    Mike C Parent and Bonnie Moradi. An abbreviated tool for assessing feminine norm conformity: Psychometric properties of the conformity to feminine norms inventory–45.Psychological assessment, 23(4):958, 2011. 14

  48. [49]

    The linear representation hypothesis and the geometry of large language models

    Kiho Park, Yo Joong Choe, and Victor Veitch. The linear representation hypothesis and the geometry of large language models. InInternational Conference on Machine Learning, pages 39643–39666. PMLR, 2024

  49. [50]

    The dark triad of personality: Narcissism, machiavellianism, and psychopathy.Journal of Research in Personality, 36(6):556–563,

    Delroy L Paulhus and Kevin M Williams. The dark triad of personality: Narcissism, machiavellianism, and psychopathy.Journal of Research in Personality, 36(6):556–563,

  50. [51]

    doi: https://doi.org/10.1016/S0092-6566(02)00505-6

    ISSN 0092-6566. doi: https://doi.org/10.1016/S0092-6566(02)00505-6. URL https://www.sciencedirect.com/science/article/pii/S0092656602005056

  51. [52]

    Linguistic styles: language use as an individual difference.Journal of personality and social psychology, 77(6):1296, 1999

    James W Pennebaker and Laura A King. Linguistic styles: language use as an individual difference.Journal of personality and social psychology, 77(6):1296, 1999

  52. [53]

    In: Findings of the Association for Computational Linguistics: ACL 2023, pp

    Ethan Perez, Sam Ringer, Kamile Lukosiute, Karina Nguyen, Edwin Chen, Scott Heiner, Craig Pettit, Catherine Olsson, Sandipan Kundu, Saurav Kadavath, Andy Jones, Anna Chen, Benjamin Mann, Brian Israel, Bryan Seethor, Cameron McKinnon, Christopher Olah, Da Yan, Daniela Amodei, Dario Amodei, Dawn Drain, Dustin Li, Eli Tran-Johnson, Guro Khundadze, Jackson Ke...

  53. [54]

    Steering Llama 2 via Contrastive Activation Addition

    Nina Rimsky, Nick Gabrieli, Julian Schulz, Meg Tong, Evan Hubinger, and Alexander Turner. Steering llama 2 via contrastive activation addition. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors,Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 15504–15522, Bangkok, Thailand, Augu...

  54. [55]

    A psychometric framework for evaluating and shaping personality traits in large language models.Nature Machine Intelligence, pages 1–15, 2025

    Gregory Serapio-García, Mustafa Safdari, Clément Crepy, Luning Sun, Stephen Fitz, Peter Romero, Marwa Abdulhai, Aleksandra Faust, and Maja Matari´c. A psychometric framework for evaluating and shaping personality traits in large language models.Nature Machine Intelligence, pages 1–15, 2025

  55. [56]

    OpenAI GPT-5 System Card

    Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El-Kishky, Aidan McLaughlin, Aiden Low, AJ Ostrow, Akhila Ananthram, Akshay Nathan, Alan Luo, Alec Helyar, Aleksander Madry, Aleksandr Efremov, Aleksandra Spyra, Alex Baker-Whitcomb, Alex Beutel, Alex Karpenko, Alex Makelov, Alex Neitz, Alex Wei, Alexandra Barr, Alexandre Kirchmeyer, Ale...

  56. [57]

    Personality vector: Modulating per- sonality of large language models by model merging

    Seungjong Sun, Seo Yeon Baek, and Jang Hyun Kim. Personality vector: Modulating per- sonality of large language models by model merging. In Christos Christodoulopoulos, Tan- moy Chakraborty, Carolyn Rose, and Violet Peng, editors,Proceedings of the 2025 Confer- ence on Empirical Methods in Natural Language Processing, pages 24656–24677, Suzhou, China, Nov...

  57. [58]

    URL https://aclanthology.org/2025

    doi: 10.18653/v1/2025.emnlp-main.1253. URL https://aclanthology.org/2025. emnlp-main.1253/

  58. [59]

    Tak, Amin Banayeeanzade, Anahita Bolourani, Mina Kian, Robin Jia, and Jonathan Gratch

    Ala N. Tak, Amin Banayeeanzade, Anahita Bolourani, Mina Kian, Robin Jia, and Jonathan Gratch. Mechanistic interpretability of emotion inference in large language models. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar, editors,Findings of the Association for Computational Linguistics: ACL 2025, pages 13090–13120, Vienna, Aus...

  59. [60]

    Gemma Team, Aishwarya Kamath, Johan Ferret, Shreya Pathak, Nino Vieillard, Ramona Merhej, Sarah Perrin, Tatiana Matejovicova, Alexandre Ramé, Morgane Rivière, Louis Rouillard, Thomas Mesnard, Geoffrey Cideron, Jean bastien Grill, Sabela Ramos, Edouard Yvinec, Michelle Casbon, Etienne Pot, Ivo Penchev, Gaël Liu, Francesco Visin, Kathleen Kenealy, Lucas Bey...

  60. [61]

    Steering Language Models With Activation Engineering

    Alexander Matt Turner, Lisa Thiergart, Gavin Leech, David Udell, Juan J. Vazquez, Ulisse Mini, and Monte MacDiarmid. Steering language models with activation engineering, 2024. URL https://arxiv.org/abs/2308.10248

  61. [62]

    The general factor of personality: A meta-analysis of big five intercorrelations and a criterion-related validity study

    Dimitri Van der Linden, Jan Te Nijenhuis, and Arnold B Bakker. The general factor of personality: A meta-analysis of big five intercorrelations and a criterion-related validity study. Journal of research in personality, 44(3):315–327, 2010. 17

  62. [63]

    The assessment of psychological states through content analysis of verbal communications.Psychological Bulletin, 94(3):542, 1983

    Linda L Viney. The assessment of psychological states through content analysis of verbal communications.Psychological Bulletin, 94(3):542, 1983

  63. [64]

    arXiv preprint arXiv:2510.26243 (2025) 8

    Hieu M. Vu and Tan M. Nguyen. Angular steering: Behavior control via rotation in activation space, 2025. URLhttps://arxiv.org/abs/2510.26243

  64. [65]

    Psychadapter: adapting llms to reflect traits, personality, and mental health.npj Artificial Intelligence, 2(1):26, 2026

    Huy Vu, Huy Anh Nguyen, Adithya V Ganesan, Swanie Juhng, Oscar NE Kjell, Joao Sedoc, Margaret L Kern, Ryan L Boyd, Lyle Ungar, H Andrew Schwartz, et al. Psychadapter: adapting llms to reflect traits, personality, and mental health.npj Artificial Intelligence, 2(1):26, 2026

  65. [66]

    Alex Warstadt, Amanpreet Singh, and Samuel R. Bowman. Neural network acceptability judgments.Transactions of the Association for Computational Linguistics, 7:625–641, 2019. doi: 10.1162/tacl_a_00290. URLhttps://aclanthology.org/Q19-1040/

  66. [67]

    Hwang, Liwei Jiang, Ronan Le Bras, Ximing Lu, Sean Welleck, and Yejin Choi

    Peter West, Chandra Bhagavatula, Jack Hessel, Jena D. Hwang, Liwei Jiang, Ronan Le Bras, Ximing Lu, Sean Welleck, and Yejin Choi. Symbolic knowledge distillation: from general language models to commonsense models. In Marine Carpuat, Marie-Catherine de Marneffe, and Iván Vladimir Meza Ruíz, editors,Proceedings of the 2022 Conference of the North Ameri- ca...

  67. [68]

    Axbench: Steering LLMs? even simple base- lines outperform sparse autoencoders

    Zhengxuan Wu, Aryaman Arora, Atticus Geiger, Zheng Wang, Jing Huang, Dan Jurafsky, Christopher D Manning, and Christopher Potts. Axbench: Steering LLMs? even simple base- lines outperform sparse autoencoders. InForty-second International Conference on Machine Learning, 2025. URLhttps://openreview.net/forum?id=K2CckZjNy0

  68. [69]

    Qwen3 Technical Report

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, ...

  69. [70]

    2025 , month = feb, number =

    Shu Yang, Shenzhe Zhu, Liang Liu, Lijie Hu, Mengdi Li, and Di Wang. Exploring the personality traits of llms through latent features steering, 2025. URL https://arxiv.org/ abs/2410.10863

  70. [71]

    FLASK: fine-grained language model eval- uation based on alignment skill sets

    Seonghyeon Ye, Doyoung Kim, Sungdong Kim, Hyeonbin Hwang, Seungone Kim, Yongrae Jo, James Thorne, Juho Kim, and Minjoon Seo. FLASK: fine-grained language model eval- uation based on alignment skill sets. InThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net, 2024. URL https://openrev...

  71. [72]

    Tracing moral foundations in large language models, 2026

    Chenxiao Yu, Bowen Yi, Farzan Karimi-Malekabadi, Suhaib Abdurahman, Jinyi Ye, Shrikanth Narayanan, Yue Zhao, and Morteza Dehghani. Tracing moral foundations in large language models, 2026. URLhttps://arxiv.org/abs/2601.05437

  72. [73]

    Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models

    Yanzhao Zhang, Mingxin Li, Dingkun Long, Xin Zhang, Huan Lin, Baosong Yang, Pengjun Xie, An Yang, Dayiheng Liu, Junyang Lin, Fei Huang, and Jingren Zhou. Qwen3 embedding: Advancing text embedding and reranking through foundation models, 2025. URL https: //arxiv.org/abs/2506.05176

  73. [74]

    Zhaohui Zhang, Zhuoran Tu, Yulei Chen, Xiyao Xiao, Yi Feng, and Wen Zhang. Automated item generation for personality assessment: development and validation of large-language- model-derived hexaco situational judgment tests.Journal of Research in Personality, 120: 104680, 2026. ISSN 0092-6566. doi: https://doi.org/10.1016/j.jrp.2025.104680. URLhttps: //www...

  74. [75]

    Personality alignment of large language models

    Minjun Zhu, Yixuan Weng, Linyi Yang, and Yue Zhang. Personality alignment of large language models. InThe Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025. OpenReview.net, 2025. URL https://openreview.net/ forum?id=0DZEs8NpUH

  75. [76]

    Representation Engineering: A Top-Down Approach to AI Transparency

    Andy Zou, Long Phan, Sarah Chen, James Campbell, Phillip Guo, Richard Ren, Alexander Pan, Xuwang Yin, Mantas Mazeika, Ann-Kathrin Dombrowski, Shashwat Goel, Nathaniel Li, Michael J. Byun, Zifan Wang, Alex Mallen, Steven Basart, Sanmi Koyejo, Dawn Song, Matt Fredrikson, J. Zico Kolter, and Dan Hendrycks. Representation engineering: A top-down approach to a...

  76. [77]

    Write a detailed feedback that assess the quality of the response strictly based on the given score rubric, not evaluating in general

  77. [78]

    You should refer to the score rubric

    After writing a feedback, write a score that is an integer between 1 and 5. You should refer to the score rubric

  78. [79]

    (write a feedback for criteria) [RESULT] (an integer number between 1 and 5)

    The output format should look as follows: "(write a feedback for criteria) [RESULT] (an integer number between 1 and 5)"

  79. [80]

    You {item}

    Please do not generate any other opening, closing, and explanations. ###The instruction to evaluate: {instruction} ###Response to evaluate: {response} ###Score Rubrics: {rubric} ###Feedback: Lastly, the rubric, centered on coherence and fluency, was formulated by Ye et al.[66] and can be found athttps://github.com/kaistAI/FLASK: Is the response structured...