Recognition: unknown
The Granularity Axis: A Micro-to-Macro Latent Direction for Social Roles in Language Models
Pith reviewed 2026-05-08 10:08 UTC · model grok-4.3
The pith
Language models encode social role granularity as a dominant latent direction from individual to institutional scale.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We show that LLMs encode the granularity of social roles, from micro-level individual experience to macro-level organizational, institutional, or national reasoning, as a structured, ordered, and causally manipulable latent direction. We define a contrast-based Granularity Axis as the difference between mean macro- and micro-role hidden states. In Qwen3-8B this axis aligns with PC1 of the role representation space at cosine 0.972 and accounts for 52.6 percent of its variance. We construct 75 social roles across five granularity levels, collect 91,200 role-conditioned responses, extract role-level hidden states, and find that projections increase monotonically across levels, remain stable, 0.
What carries the argument
The Granularity Axis, defined as the vector difference between the average hidden states of macro-scale roles and micro-scale roles, which functions as the primary geometric direction organizing the space of role representations.
If this is right
- Role hidden states project monotonically onto the axis across all five granularity levels and across prompt variants.
- The axis remains stable across model layers, endpoint choices, held-out data splits, and transfers to a second model.
- Positive or negative activation steering along the axis shifts generated response granularity in the predicted direction.
- The two tested models show different degrees of steering controllability depending on their default behavior.
Where Pith is reading between the lines
- This geometric structure could let practitioners read out or adjust the perspective scale of role-play outputs without rewriting prompts.
- The same contrast method might reveal other ordered dimensions such as time horizon or emotional intensity in role representations.
- If the axis generalizes, it offers a way to test whether models internally rank social contexts by scope rather than treating them as flat labels.
Load-bearing premise
The author-chosen 75 roles and five granularity levels cleanly separate micro-to-macro distinctions without other factors such as topic or response style driving the hidden-state patterns.
What would settle it
If role projections onto the axis fail to increase monotonically with assigned granularity level, or if activation steering along the axis produces no reliable shift in measured response granularity on held-out prompts, the central claim would not hold.
Figures
read the original abstract
Large language models (LLMs) are routinely prompted to take on social roles ranging from individuals to institutions, yet it remains unclear whether their internal representations encode the granularity of such roles, from micro-level individual experience to macro-level organizational, institutional, or national reasoning. We show that they do. We define a contrast-based Granularity Axis as the difference between mean macro- and micro-role hidden states. In Qwen3-8B, this axis aligns with the principal axis (PC1) of the role representation space at cosine 0.972 and accounts for 52.6% of its variance, indicating that granularity is the dominant geometric axis organizing prompted social roles. We construct 75 social roles across five granularity levels and collect 91,200 role-conditioned responses over shared questions and prompt variants, then extract role-level hidden states and project them onto the axis. Role projections increase monotonically across all five levels, remain stable across layers, prompt variants, endpoint definitions, held-out splits, and score-filtered subsets, and transfer to Llama-3.1-8B-Instruct. The axis is also causally relevant: activation steering along it shifts response granularity in the predicted direction, with Llama moving from 2.00 to 3.17 on a five-point macro scale under positive steering on prompts that admit local responses. The two models differ in controllability, suggesting that steering depends on each model's default operating regime. Overall, our findings suggest that social role granularity is not merely a stylistic surface feature, but a structured, ordered, and causally manipulable latent direction in role-conditioned language model behavior.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that LLMs internally represent the granularity of prompted social roles (micro-level individuals to macro-level institutions) along a dominant latent direction. It constructs 75 roles across five granularity levels, collects 91,200 role-conditioned responses, defines a Granularity Axis as the mean difference between macro- and micro-role hidden states, shows this axis aligns with PC1 (cosine 0.972, 52.6% variance) in Qwen3-8B, produces monotonic projections across levels that are stable across layers/prompts/held-out splits/score-filtered subsets and transfer to Llama-3.1-8B-Instruct, and demonstrates causal relevance via activation steering that shifts response granularity scores in the predicted direction.
Significance. If the central geometric and causal claims hold after controls for confounds, the result would be significant: it identifies granularity as a structured, ordered, and manipulable latent direction organizing social roles in LLMs rather than a surface stylistic feature. Strengths include the scale of the dataset (91,200 responses), multiple stability checks (held-out splits, score-filtered subsets, endpoint variants), cross-model transfer, and the causal steering experiment that produces measurable shifts (e.g., Llama from 2.00 to 3.17). These elements provide reproducible empirical grounding for a falsifiable geometric hypothesis.
major comments (3)
- [§3] §3 (Role Construction and Data Collection): The 75 author-defined roles and five granularity levels are presented without reported balancing, matching, or regression controls for topic, average response length, or lexical/stylistic differences across levels. Because the Granularity Axis is defined directly from the mean hidden states of these roles and then shown to align with PC1, any correlation between these covariates and the level labels would make the observed alignment (cosine 0.972) and monotonic projections consistent with a composite direction rather than a pure granularity axis. This is load-bearing for the claim that granularity is the dominant organizing axis.
- [§4.2] §4.2 (PCA Alignment and Variance): The reported 52.6% variance explained by PC1 and its alignment with the author-defined axis would be more convincing with an explicit baseline comparison (e.g., variance explained by axes derived from random role groupings or shuffled granularity labels). Without this, it remains possible that the high alignment is partly an artifact of how the endpoint sets were chosen rather than evidence that granularity is uniquely dominant in the 75-point role space.
- [§5] §5 (Activation Steering): The steering results (shift from 2.00 to 3.17 on the five-point scale) are promising but depend on the claim that the prompts 'admit local responses'; the paper should report how the five-point macro scale was applied by raters and whether inter-rater reliability or prompt selection criteria were pre-registered to ensure the granularity shift is not driven by changes in topic or length induced by the steering vector.
minor comments (2)
- [§3] The definition of the Granularity Axis would benefit from an explicit equation (e.g., A = mean(H_macro) - mean(H_micro)) in the main text rather than only in prose.
- [Figures 2-4] Figure captions for the projection plots should include the number of responses per level and any error bars or confidence intervals to allow visual assessment of the monotonicity strength.
Simulated Author's Rebuttal
We thank the referee for their constructive comments on our manuscript. We address each of the major concerns point by point below, providing clarifications and committing to revisions where appropriate to strengthen the paper.
read point-by-point responses
-
Referee: [§3] §3 (Role Construction and Data Collection): The 75 author-defined roles and five granularity levels are presented without reported balancing, matching, or regression controls for topic, average response length, or lexical/stylistic differences across levels. Because the Granularity Axis is defined directly from the mean hidden states of these roles and then shown to align with PC1, any correlation between these covariates and the level labels would make the observed alignment (cosine 0.972) and monotonic projections consistent with a composite direction rather than a pure granularity axis. This is load-bearing for the claim that granularity is the dominant organizing axis.
Authors: We appreciate this observation, as controlling for potential confounds is crucial for isolating the granularity effect. The roles were designed with shared questions across all granularity levels to control for topic, and prompt variants were used to increase robustness. However, we acknowledge that explicit balancing for response length and lexical features was not reported. In the revised manuscript, we will include: (i) statistics on average response lengths and token counts per granularity level, (ii) correlations between lexical metrics (e.g., type-token ratio) and level, and (iii) a regression analysis where we residualize the hidden states for these covariates before computing the axis and projections. This will demonstrate whether the axis remains aligned with PC1 after controls. We believe this addresses the concern without altering the core findings. revision: yes
-
Referee: [§4.2] §4.2 (PCA Alignment and Variance): The reported 52.6% variance explained by PC1 and its alignment with the author-defined axis would be more convincing with an explicit baseline comparison (e.g., variance explained by axes derived from random role groupings or shuffled granularity labels). Without this, it remains possible that the high alignment is partly an artifact of how the endpoint sets were chosen rather than evidence that granularity is uniquely dominant in the 75-point role space.
Authors: We agree that baseline comparisons are important to establish the uniqueness of the granularity axis. We will add to the revised §4.2 results from 1000 random role groupings (randomly assigning roles to micro/macro endpoints) and shuffled label permutations, computing the distribution of cosine similarities and variance explained. Our preliminary checks suggest the observed 0.972 cosine and 52.6% variance are outliers compared to these null distributions, supporting that granularity is the dominant direction. This addition will make the claim more robust. revision: yes
-
Referee: [§5] §5 (Activation Steering): The steering results (shift from 2.00 to 3.17 on the five-point scale) are promising but depend on the claim that the prompts 'admit local responses'; the paper should report how the five-point macro scale was applied by raters and whether inter-rater reliability or prompt selection criteria were pre-registered to ensure the granularity shift is not driven by changes in topic or length induced by the steering vector.
Authors: We thank the referee for highlighting the need for more details on the evaluation. The five-point scale was applied by two independent raters who scored responses on granularity from 1 (micro/individual) to 5 (macro/institutional), with prompts selected as those allowing both local and global responses (e.g., questions about decision-making that can be answered personally or organizationally). In the revision, we will provide the full rating instructions, report inter-rater agreement (e.g., Pearson correlation or kappa), and confirm that steered responses did not significantly differ in length or topic from baseline (via manual inspection and metrics). While the experiment was not pre-registered, the selection criteria were based on pilot testing to ensure validity. To further address confounds, we will add analysis showing the steering primarily affects granularity-related content. revision: partial
Circularity Check
No significant circularity detected
full rationale
The paper defines the Granularity Axis explicitly as the difference between mean macro-role and mean micro-role hidden states, then reports its empirical alignment with PC1 of the 75-role space (cosine 0.972, 52.6% variance) and monotonic projections on intermediate levels as observed results. These are independent checks against the model's actual hidden-state geometry rather than reductions by construction. Steering experiments, cross-model transfer, and stability across splits provide further external validation. No self-citation chain, ansatz smuggling, or renaming of known results occurs; the derivation remains self-contained against the collected role-conditioned activations.
Axiom & Free-Parameter Ledger
free parameters (2)
- Granularity level definitions and role assignments
- Micro and macro endpoint role sets
axioms (2)
- domain assumption Hidden states extracted from role-conditioned prompts reflect the semantic granularity of the assigned role
- domain assumption The contrast between mean macro and micro hidden states isolates granularity rather than correlated factors such as response style or topic
invented entities (1)
-
Granularity Axis
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Gpt-5 system card
OpenAI. Gpt-5 system card. https://cdn.openai.com/gpt-5-system-card.pdf, 2025
2025
-
[2]
AI @ Meta Llama Team. The llama 3 herd of models.arXiv preprint, arXiv:2407.21783, 2024
work page internal anchor Pith review arXiv 2024
-
[3]
Qwen Team. Qwen3 technical report.arXiv preprint, arXiv:2505.09388, 2025
work page internal anchor Pith review arXiv 2025
-
[4]
Gemini 3 model card
Google. Gemini 3 model card. https://deepmind.google/models/gemini, 2025
2025
-
[5]
Claude model card
Anthropic. Claude model card. https://www.anthropic.com/system-cards, 2026
2026
-
[6]
O’Brien, Carrie Jun Cai, Meredith Ringel Morris, Percy Liang, and Michael S
Joon Sung Park, Joseph C. O’Brien, Carrie Jun Cai, Meredith Ringel Morris, Percy Liang, and Michael S. Bernstein. Generative agents: Interactive simulacra of human behavior. In Sean Follmer, Jeff Han, Jürgen Steimle, and Nathalie Henry Riche, editors,Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology, UIST, pages 2:1–2:22,
-
[7]
doi: 10.1145/3586183.3606763
-
[8]
arXiv preprint arXiv:2412.03563 , year=
Xinyi Mou, Xuanwen Ding, Qi He, Liang Wang, Jingcong Liang, Xinnong Zhang, Libo Sun, Jiayu Lin, Jie Zhou, Xuanjing Huang, and Zhongyu Wei. From individual to society: A survey on social simulation driven by large language model-based agents.arXiv preprint, arXiv:2412.03563, 2024
-
[9]
Richardson, Austin C
Jacy Reese Anthis, Ryan Liu, Sean M. Richardson, Austin C. Kozlowski, Bernard Koch, Erik Brynjolfsson, James A. Evans, and Michael S. Bernstein. Position: LLM social simulations are a promising research method. InForty-second International Conference on Machine Learning, ICML, 2025
2025
-
[10]
White, Doug Burger, and Chi Wang
Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Beibin Li, Erkang (Eric) Zhu, Li Jiang, Xiaoyun Zhang, Shaokun Zhang, Jiale Liu, Ahmed Hassan Awadallah, Ryen W. White, Doug Burger, and Chi Wang. Autogen: Enabling next-gen llm applications via multi-agent conversa- tion. 2023. URLhttps://api.semanticscholar.org/CorpusID:263611068
2023
-
[11]
Li, Hasan Hammoud, Hani Itani, Dmitrii Khizbullin, and Bernard Ghanem
G. Li, Hasan Hammoud, Hani Itani, Dmitrii Khizbullin, and Bernard Ghanem. Camel: Commu- nicative agents for "mind" exploration of large language model society.Advances in Neural Information Processing Systems 36, 2023. URL https://api.semanticscholar.org/ CorpusID:268042527
2023
-
[12]
Metagpt: Meta programming for a multi-agent collaborative framework
Sirui Hong, Mingchen Zhuge, Jonathan Chen, Xiawu Zheng, Yuheng Cheng, Ceyao Zhang, Jinlin Wang, Zili Wang, Steven Ka Shing Yau, Zi Hen Lin, Liyang Zhou, Chenyu Ran, Lingfeng Xiao, Chenglin Wu, and Jürgen Schmidhuber. Metagpt: Meta programming for a multi-agent collaborative framework. InInternational Conference on Learning Representations, 2023. URL https...
2023
-
[13]
Chatdev: Communicative agents for software development
Cheng Qian, Wei Liu, Hongzhang Liu, Nuo Chen, Yufan Dang, Jiahao Li, Cheng Yang, Weize Chen, Yusheng Su, Xin Cong, Juyuan Xu, Dahai Li, Zhiyuan Liu, and Maosong Sun. Chatdev: Communicative agents for software development. InAnnual Meeting of the Association for Computational Linguistics, 2023. URL https://api.semanticscholar.org/CorpusID: 270257715
2023
-
[14]
Argyle, E
Lisa P. Argyle, E. Busby, Nancy Fulda, Joshua R Gubler, Christopher Rytting, and David Wingate. Out of one, many: Using language models to simulate human samples.Politi- cal Analysis, 31:337 – 351, 2022. URL https://api.semanticscholar.org/CorpusID: 252280474
2022
-
[15]
Petter Törnberg, Diliara Valeeva, Justus Uitermark, and Christopher Bail. Simulating social media using large language models to evaluate alternative news feed algorithms. ArXiv, abs/2310.05984, 2023. URL https://api.semanticscholar.org/CorpusID: 263831233
-
[16]
Epidemic modeling with generative agents
Ross Williams, Niyousha Hosseinichimeh, Aritra Majumdar, and Navid Ghaffarzadegan. Epidemic modeling with generative agents.ArXiv, abs/2307.04986, 2023. URL https: //api.semanticscholar.org/CorpusID:259766713. 10
-
[17]
Econagent: Large language model-empowered agents for simulating macroeconomic activities
Nian Li, Chen Gao, Mingyu Li, Yong Li, and Qingmin Liao. Econagent: Large language model-empowered agents for simulating macroeconomic activities. InAnnual Meeting of the Association for Computational Linguistics, 2023. URL https://api.semanticscholar. org/CorpusID:264146527
2023
-
[18]
Chen, and Khaldoun Khashanah
Yang Li, Yangyang Yu, Haohang Li, Z. Chen, and Khaldoun Khashanah. Tradinggpt: Multi- agent system with layered memory and distinct characters for enhanced financial trading performance. 2023. URLhttps://api.semanticscholar.org/CorpusID:261582775
2023
-
[19]
arXiv preprint arXiv:2303.17548 , year =
Shibani Santurkar, Esin Durmus, Faisal Ladhak, Cinoo Lee, Percy Liang, and Tatsunori Hashimoto. Whose opinions do language models reflect?ArXiv, abs/2303.17548, 2023. URLhttps://api.semanticscholar.org/CorpusID:257834040
-
[20]
Clinton, Cassy Dorff, Brenton Kenkel, and Jennifer M
James Bisbee, Joshua D. Clinton, Cassy Dorff, Brenton Kenkel, and Jennifer M. Larson. Synthetic replacements for human survey data? the perils of large language models.Political Analysis, 2024. URLhttps://api.semanticscholar.org/CorpusID:269845858
2024
-
[21]
Ireland, Shashanka Subrahmanya, João Sedoc, Lyle H
Aadesh Salecha, Molly E. Ireland, Shashanka Subrahmanya, João Sedoc, Lyle H. Ungar, and Johannes C. Eichstaedt. Large language models show human-like social desirability biases in survey responses.arXiv preprint, arXiv:2405.06058, 2024
-
[22]
Peterson, Ilia Sucholutsky, and Thomas L
Ryan Liu, Jiayi Geng, Joshua C. Peterson, Ilia Sucholutsky, and Thomas L. Griffiths. Large language models assume people are more rational than we really are. InThe Thirteenth International Conference on Learning Representations,ICLR. OpenReview.net, 2025. URL https://openreview.net/forum?id=dAeET8gxqg
2025
-
[23]
Character-llm: A trainable agent for role-playing
Yunfan Shao, Linyang Li, Junqi Dai, and Xipeng Qiu. Character-llm: A trainable agent for role-playing.ArXiv, abs/2310.10158, 2023. URL https://api.semanticscholar.org/ CorpusID:264145862
-
[24]
Rolellm: Benchmarking, eliciting, and enhancing role-playing abilities of large language models
Zekun Moore Wang, Zhongyuan Peng, Haoran Que, Jiaheng Liu, Wangchunshu Zhou, Yuhang Wu, Hongcheng Guo, Ruitong Gan, Zehao Ni, Man Zhang, Zhaoxiang Zhang, Wanli Ouyang, Ke Xu, Wenhu Chen, Jie Fu, and Junran Peng. Rolellm: Benchmarking, eliciting, and enhancing role-playing abilities of large language models. InAnnual Meeting of the Association for Computat...
2023
-
[25]
Personallm: In- vestigating the ability of large language models to express big five personality traits
Hang Jiang, Xiajie Zhang, Xubo Cao, Cynthia Breazeal, and Jad Kabbara. Personallm: In- vestigating the ability of large language models to express big five personality traits. URL https://api.semanticscholar.org/CorpusID:265221392
-
[26]
In- character: Evaluating personality fidelity in role-playing agents through psychological in- terviews
Xintao Wang, Yunze Xiao, Jen-Tse Huang, Siyu Yuan, Rui Xu, Haoran Guo, Quan Tu, Yaying Fei, Ziang Leng, Wei Wang, Jiangjie Chen, Cheng Li, and Yanghua Xiao. In- character: Evaluating personality fidelity in role-playing agents through psychological in- terviews. InAnnual Meeting of the Association for Computational Linguistics, 2023. URL https://api.seman...
2023
-
[27]
Charactereval: A chinese benchmark for role- playing conversational agent evaluation
Quan Tu, Shilong Fan, Zihang Tian, and Rui Yan. Charactereval: A chinese benchmark for role- playing conversational agent evaluation. InAnnual Meeting of the Association for Computational Linguistics, 2024. URLhttps://api.semanticscholar.org/CorpusID:266725287
2024
-
[28]
Better zero-shot reasoning with role-play prompting
Aobo Kong, Shiwan Zhao, Hao Chen, Qicheng Li, Yong Qin, Ruiqi Sun, and Xiaoxia Zhou. Better zero-shot reasoning with role-play prompting. InNorth American Chapter of the Association for Computational Linguistics, 2023. URL https://api.semanticscholar. org/CorpusID:260900230
2023
-
[29]
Francesco Salvi, Manoel Horta Ribeiro, Riccardo Gallotti, and Robert West
Leonard Salewski, Stephan Alaniz, Isabel Rio-Torto, Eric Schulz, and Zeynep Akata. In-context impersonation reveals large language models’ strengths and biases.ArXiv, abs/2305.14930,
-
[30]
URLhttps://api.semanticscholar.org/CorpusID:258866192
-
[31]
Shashank Gupta, Vaishnavi Shrivastava, A. Deshpande, A. Kalyan, Peter Clark, Ashish Sab- harwal, and Tushar Khot. Bias runs deep: Implicit reasoning biases in persona-assigned llms.ArXiv, abs/2311.04892, 2023. URL https://api.semanticscholar.org/CorpusID: 265050702. 11
-
[32]
James S. Coleman. Foundations of social theory. 1990. URL https://api. semanticscholar.org/CorpusID:145109282
1990
-
[33]
Schelling
Thomas C. Schelling. Micromotives and macrobehavior. 1978. URL https://api. semanticscholar.org/CorpusID:143387748
1978
-
[34]
Granovetter
Mark S. Granovetter. Threshold models of collective behavior.American Journal of Sociology, 83:1420 – 1443, 1978. URLhttps://api.semanticscholar.org/CorpusID:49314397
1978
-
[35]
Myra Cheng, Esin Durmus, and Dan Jurafsky. Marked personas: Using natural language prompts to measure stereotypes in language models.ArXiv, abs/2305.18189, 2023. URL https://api.semanticscholar.org/CorpusID:258960243
-
[36]
Junyi Li, Ninareh Mehrabi, Charith Peris, Palash Goyal, Kai-Wei Chang, A. G. Galstyan, Richard Zemel, and Rahul Gupta. The steerability of large language models toward data-driven personas. InNorth American Chapter of the Association for Computational Linguistics, 2023. URLhttps://api.semanticscholar.org/CorpusID:265067297
2023
-
[37]
Nitish Joshi, Javier Rando, Abulhair Saparov, Najoung Kim, and He He. Personas as a way to model truthfulness in language models.ArXiv, abs/2310.18168, 2023. URL https: //api.semanticscholar.org/CorpusID:264555113
-
[38]
Where is the mind? persona vectors and llm individuation
Pierre Beckmann and Patrick Butlin. Where is the mind? persona vectors and llm individuation
-
[39]
URLhttps://api.semanticscholar.org/CorpusID:287635493
-
[40]
The linear representation hypothesis and the geometry of large language models
Kiho Park, Yo Joong Choe, and Victor Veitch. The linear representation hypothesis and the geometry of large language models. InInternational Conference on Machine Learning, 2023. URLhttps://api.semanticscholar.org/CorpusID:265042984
2023
-
[41]
arXiv preprint arXiv:2406.01506 , year=
Kiho Park, Yo Joong Choe, Yibo Jiang, and Victor Veitch. The geometry of categorical and hierarchical concepts in large language models.ArXiv, abs/2406.01506, 2024. URL https://api.semanticscholar.org/CorpusID:270216615
-
[42]
Linear representations of hierarchical concepts in language models
Masaki Sakata, Benjamin Heinzerling, Takumi Ito, Sho Yokoi, and Kentaro Inui. Linear representations of hierarchical concepts in language models. 2026. URL https://api. semanticscholar.org/CorpusID:287255958
2026
-
[43]
Michaud, Wes Gurnee, and Max Tegmark
Joshua Engels, Isaac Liao, Eric J. Michaud, Wes Gurnee, and Max Tegmark. Not all language model features are one-dimensionally linear. InInternational Conference on Learning Repre- sentations, 2024. URLhttps://api.semanticscholar.org/CorpusID:269983112
2024
-
[44]
Linguistic regularities in continuous space word representations
Tomas Mikolov, Wen tau Yih, and Geoffrey Zweig. Linguistic regularities in continuous space word representations. InNorth American Chapter of the Association for Computational Linguistics, 2013. URLhttps://api.semanticscholar.org/CorpusID:7478738
2013
-
[45]
Nelson Elhage, Tristan Hume, Catherine Olsson, Nicholas Schiefer, Thomas Henighan, Shauna Kravec, Zac Hatfield-Dodds, Robert Lasenby, Dawn Drain, Carol Chen, Roger Baker Grosse, Sam McCandlish, Jared Kaplan, Dario Amodei, Martin Wattenberg, and Chris Olah. Toy models of superposition.ArXiv, abs/2209.10652, 2022. URL https://api.semanticscholar.org/ Corpus...
work page internal anchor Pith review arXiv 2022
-
[46]
Probing classifiers: Promises, shortcomings, and advances.Computational Linguistics, 48:207–219, 2021
Yonatan Belinkov. Probing classifiers: Promises, shortcomings, and advances.Computational Linguistics, 48:207–219, 2021. URL https://api.semanticscholar.org/CorpusID: 236924832
2021
-
[47]
Sparse Autoencoders Find Highly Interpretable Features in Language Models
Hoagy Cunningham, Aidan Ewart, Logan Riggs Smith, Robert Huben, and Lee Sharkey. Sparse autoencoders find highly interpretable features in language models.ArXiv, abs/2309.08600,
work page internal anchor Pith review arXiv
-
[48]
URLhttps://api.semanticscholar.org/CorpusID:261934663
-
[49]
Scaling and evaluating sparse autoencoders
Leo Gao, Tom Dupr’e la Tour, Henk Tillman, Gabriel Goh, Rajan Troll, Alec Radford, Ilya Sutskever, Jan Leike, and Jeffrey Wu. Scaling and evaluating sparse autoencoders. ArXiv, abs/2406.04093, 2024. URL https://api.semanticscholar.org/CorpusID: 270286001. 12
work page internal anchor Pith review arXiv 2024
-
[50]
Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models
Samuel Marks, Can Rager, Eric J. Michaud, Yonatan Belinkov, David Bau, and Aaron Mueller. Sparse feature circuits: Discovering and editing interpretable causal graphs in language mod- els.ArXiv, abs/2403.19647, 2024. URL https://api.semanticscholar.org/CorpusID: 268732732
work page internal anchor Pith review arXiv 2024
-
[51]
Exploring task performance with interpretable models via sparse auto-encoders
Shunyu Wang, Tyler Loakman, Youbo Lei, Yi Liu, Bohao Yang, Yuting Zhao, Dong Yang, and Chenghua Lin. Exploring task performance with interpretable models via sparse auto-encoders. ArXiv, abs/2507.06427, 2025. URL https://api.semanticscholar.org/CorpusID: 280066641
-
[52]
Christina Lu, Jack Gallagher, Jonathan Michala, Kyle Fish, and Jack Lindsey. The assis- tant axis: Situating and stabilizing the default persona of language models.arXiv preprint, arXiv:2601.10387, 2026
-
[53]
Steering Language Models With Activation Engineering
Alexander Matt Turner, Lisa Thiergart, Gavin Leech, David Udell, Juan J. Vazquez, Ulisse Mini, and Monte MacDiarmid. Steering language models with activation engineering.arXiv preprint, arXiv:2308.10248, 2024
work page internal anchor Pith review arXiv 2024
-
[54]
Representation Engineering: A Top-Down Approach to AI Transparency
Andy Zou, Long Phan, Sarah Chen, James Campbell, Phillip Guo, Richard Ren, Alexander Pan, Xuwang Yin, Mantas Mazeika, Ann-Kathrin Dombrowski, Shashwat Goel, Nathaniel Li, Michael J. Byun, Zifan Wang, Alex Mallen, Steven Basart, Sanmi Koyejo, Dawn Song, Matt Fredrikson, J. Zico Kolter, and Dan Hendrycks. Representation engineering: A top-down approach to a...
work page internal anchor Pith review arXiv 2025
-
[55]
Aligning large language models with human preferences through representation engineering
Wenhao Liu, Xiaohua Wang, Muling Wu, Tianlong Li, Changze Lv, Zixuan Ling, Jianhao Zhu, Cenyuan Zhang, Xiaoqing Zheng, and Xuanjing Huang. Aligning large language models with human preferences through representation engineering. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors,Proceedings of the 62nd Annual Meeting of the Association for Computat...
2024
-
[56]
Steering llama 2 via contrastive activation addition
Nina Rimsky, Nick Gabrieli, Julian Schulz, Meg Tong, Evan Hubinger, and Alexander Matt Turner. Steering llama 2 via contrastive activation addition. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors,Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL, pages 15504–15522. Association f...
2024
-
[57]
Viégas, Hanspeter Pfister, and Martin Wattenberg
Kenneth Li, Oam Patel, Fernanda B. Viégas, Hanspeter Pfister, and Martin Wattenberg. Inference-time intervention: Eliciting truthful answers from a language model. In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine, edi- tors,Advances in Neural Information Processing Systems 36: Annual Conference on Neural Informati...
2023
-
[58]
Refusal in Language Models Is Mediated by a Single Direction
Andy Arditi, Oscar Obeso, Aaquib Syed, Daniel Paleka, Nina Rimsky, Wes Gurnee, and Neel Nanda. Refusal in language models is mediated by a single direction.ArXiv, abs/2406.11717,
work page internal anchor Pith review arXiv
-
[59]
URLhttps://api.semanticscholar.org/CorpusID:270560489
-
[60]
Persona Vectors: Monitoring and Controlling Character Traits in Language Models
Runjin Chen, Andy Arditi, Henry Sleight, Owain Evans, and Jack Lindsey. Persona vectors: Monitoring and controlling character traits in language models.ArXiv, abs/2507.21509, 2025. URLhttps://api.semanticscholar.org/CorpusID:280337840
work page internal anchor Pith review arXiv 2025
-
[61]
Improving activation steering in language models with mean-centring
Ole Jorgensen, Dylan Cope, Nandi Schoots, and Murray Shanahan. Improving activation steering in language models with mean-centring.ArXiv, abs/2312.03813, 2023. URL https: //api.semanticscholar.org/CorpusID:266053529
-
[62]
A systematic analysis of the impact of persona steering on llm capabilities
Jiaqi Chen, Ming Wang, Tingna Xie, Shi Feng, and Yongkang Liu. A systematic analysis of the impact of persona steering on llm capabilities. 2026. URL https://api.semanticscholar. org/CorpusID:287432603
2026
-
[63]
Xiachong Feng, Liang Zhao, Weihong Zhong, Yi-Chong Huang, Yuxuan Gu, Lingpeng Kong, Xiaocheng Feng, and Bing Qin. Persona: Dynamic and compositional inference-time personality control via activation vector algebra.ArXiv, abs/2602.15669, 2026. URL https://api. semanticscholar.org/CorpusID:285659291. 13
-
[64]
Introducing gpt-5.4 mini and nano
OpenAI. Introducing gpt-5.4 mini and nano. https://openai.com/index/introducing-gpt-5-4- mini-and-nano/, 2026
2026
-
[65]
Gemini 3.1 flash-lite model card
Google DeepMind. Gemini 3.1 flash-lite model card. https://deepmind.google/models/model- cards/gemini-3-1-flash-lite/, 2026
2026
-
[66]
Llm-based human simulations have not yet been reliable.arXiv preprint, arXiv:2501.08579, 2025
Qian Wang, Jiaying Wu, Zichen Jiang, Zhenheng Tang, Bingqiao Luo, Nuo Chen, Wei Chen, and Bingsheng He. Llm-based human simulations have not yet been reliable.arXiv preprint, arXiv:2501.08579, 2025
-
[67]
Character-llm: A trainable agent for role-playing
Yunfan Shao, Linyang Li, Junqi Dai, and Xipeng Qiu. Character-llm: A trainable agent for role-playing. In Houda Bouamor, Juan Pino, and Kalika Bali, editors,Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP, pages 13153–13187. Association for Computational Linguistics, 2023
2023
-
[68]
Rolellm: Benchmarking, eliciting, and enhancing role-playing abilities of large language models
Noah Wang, Zhongyuan Peng, Haoran Que, Jiaheng Liu, Wangchunshu Zhou, Yuhan Wu, Hongcheng Guo, Ruitong Gan, Zehao Ni, Jian Yang, Man Zhang, Zhaoxiang Zhang, Wanli Ouyang, Ke Xu, Wenhao Huang, Jie Fu, and Junran Peng. Rolellm: Benchmarking, eliciting, and enhancing role-playing abilities of large language models. In Lun-Wei Ku, Andre Martins, and Vivek Sri...
2024
-
[69]
Two tales of persona in llms: A survey of role-playing and personalization
Yu-Min Tseng, Yu-Chao Huang, Teng-Yun Hsiao, Wei-Lin Chen, Chao-Wei Huang, Yu Meng, and Yun-Nung Chen. Two tales of persona in llms: A survey of role-playing and personalization. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors,Findings of the Association for Computational Linguistics: EMNLP2024, pages 16612–16631. Association for Computation...
2024
-
[70]
Nuo Chen, Yan Wang, Yang Deng, and Jia Li. The oscars of ai theater: A survey on role-playing with language models.arXiv preprint, arXiv:2407.11484, 2025
-
[71]
Kenneth Li, Tianle Liu, Naomi Bashkansky, David Bau, Fernanda Viégas, Hanspeter Pfister, and Martin Wattenberg. Measuring and controlling instruction (in)stability in language model dialogs.arXiv preprint, arXiv:2402.10962, 2024
-
[72]
Rnr: Teaching large language models to follow roles and rules,
Kuan Wang, Alexander Bukharin, Haoming Jiang, Qingyu Yin, Zhengyang Wang, Tuo Zhao, Jingbo Shang, Chao Zhang, Bing Yin, Xian Li, Jianshu Chen, and Shiyang Li. Rnr: Teaching large language models to follow roles and rules.arXiv preprint, arXiv:2409.13733, 2024
-
[73]
Lukasz Bartoszcze, Sarthak Munshi, Bryan Sukidi, Jennifer Yen, Zejia Yang, David Williams- King, Linh Le, Kosi Asuzu, and Carsten Maple. Representation engineering for large-language models: Survey and research challenges.arXiv preprint, arXiv:2502.17601, 2025
-
[74]
Harvard university press, 1990
James S Coleman.Foundations of social theory. Harvard university press, 1990
1990
-
[75]
Univ of California Press, 1984
Anthony Giddens.The constitution of society: Outline of the theory of structuration. Univ of California Press, 1984
1984
-
[76]
How should we deal with rising housing costs?
Urie Bronfenbrenner.The ecology of human development: Experiments by nature and design. Harvard university press, 1979. 14 A Qualitative Examples Table 6 extends Table 4 with a coefficient-level qualitative sweep on a single shared generic prompt. We use the housing prompt because it admits both individual coping advice and institutional policy responses,...
1979
-
[77]
{description}
You are {name}. {description}. Respond entirely from this perspective as a {level_name} entity. Do not mention being an AI
-
[78]
Stay fully in character and respond based on your lived experience and constraints
Please role-play as {name}: {description}. Stay fully in character and respond based on your lived experience and constraints
-
[79]
{description}
Act as {name}. {description}. All your responses should reflect the priorities, language, and worldview of someone operating at the {level_name} level of social organization
-
[80]
{description}
You are {name}, a {level_name} entity. {description}. Speak in first person. Your answers should reflect your unique scale of influence, time horizon, and decision-making logic
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.