Recognition: 2 theorem links
· Lean TheoremTowards Effective Theory of LLMs: A Representation Learning Approach
Pith reviewed 2026-05-12 03:37 UTC · model grok-4.3
The pith
LLM hidden-state trajectories can be coarse-grained into macrostates that support reasoning interpretation, behavior prediction, and generation steering.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
RET learns macrovariables from LLM hidden-state trajectories using a self-supervised objective in the style of BYOL and JEPA. These macrovariables coarse-grain the activations into higher-level variables that preserve structure relevant for prediction and interpretation. The resulting states are temporally consistent, capture high-level semantics, support early prediction of behavioral outcomes such as sycophancy, and provide causal handles for steering generations toward interpretable computational phases.
What carries the argument
Representational Effective Theory (RET), a framework that learns macrostates from hidden-state trajectories via self-supervised coarse-graining to create dynamically meaningful variables.
If this is right
- The macrostates reveal mental-state trajectories of reasoning in LLMs.
- They capture high-level semantic structure in the computation.
- They enable early prediction of behavioral outcomes like sycophancy.
- They provide causal handles for intervening on and steering generations.
Where Pith is reading between the lines
- If RET macrostates prove robust across models, they could inform the design of architectures that build in interpretability from the start.
- The same coarse-graining approach might extend to vision or multimodal models to extract effective descriptions of their computations.
- Comparing RET macrostates before and after fine-tuning could test how training alters the high-level dynamics of the model.
Load-bearing premise
The self-supervised objective applied to hidden-state trajectories preserves the higher-level semantic and causal structure needed for downstream prediction and intervention.
What would settle it
Observing that interventions based on the RET macrostates do not alter the model's generation behavior in a controlled manner, while direct interventions on raw hidden states do, would indicate that the macrovariables lack the claimed causal utility.
Figures
read the original abstract
We propose Representational Effective Theory (RET), a framework for describing large language model computation in terms of learned macrostates rather than microscopic details. RET learns these macrostates from hidden-state trajectories using a BYOL/JEPA-style self-supervised objective, coarse-graining activations into macrovariables that preserve higher-level structure relevant for prediction and interpretation. We evaluate whether these macrovariables are practically relevant for interpretability: RET yields temporally consistent states that reveal "mental-state" trajectories of reasoning, capture high-level semantic structure, support early prediction of behavioral outcomes such as sycophancy, and provide causal handles for steering generations toward interpretable computational phases. Together, these results suggest that LLM computation admits useful effective descriptions via RET: high-level, dynamically meaningful variables that support interpretation, prediction, and intervention.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes Representational Effective Theory (RET), a framework that learns macrostates from LLM hidden-state trajectories via a BYOL/JEPA-style self-supervised objective. These macrostates are claimed to yield temporally consistent representations that capture high-level semantic structure, enable early prediction of behavioral outcomes such as sycophancy, and provide causal handles for steering model generations toward interpretable computational phases.
Significance. If the empirical claims hold with rigorous controls, RET would supply a concrete representation-learning route to effective theories of LLM computation, moving beyond post-hoc mechanistic interpretability toward dynamically meaningful, intervenable macrovariables. The approach is novel in its direct application of view-invariance objectives to activation trajectories for downstream prediction and steering tasks.
major comments (3)
- [Abstract] Abstract: the abstract asserts that RET 'supports early prediction of behavioral outcomes such as sycophancy' and 'provide[s] causal handles for steering generations,' yet supplies no quantitative metrics, baselines, ablation studies, data-exclusion criteria, or statistical significance tests. Without these, the central claim that the learned macrostates are 'practically relevant for interpretability' cannot be evaluated.
- [Abstract] Abstract / proposed method: the claim that the BYOL/JEPA-style objective extracts macrostates preserving 'higher-level causal structure' usable for intervention rests on an unverified assumption. The objective enforces view-invariance and temporal consistency but does not incorporate do-calculus, counterfactuals, or explicit causal modeling; reported steering results could arise from correlational patterns alone. A concrete test (e.g., comparison against non-causal baselines or explicit intervention ablations) is required to substantiate the 'effective theory' interpretation.
- [Abstract] Abstract: the manuscript introduces new evaluation tasks and a new objective rather than deriving predictions from previously fitted quantities, raising the risk that any reported 'predictions' are tautological with the learned parameters. The absence of details on how macrostates are validated as causally relevant (versus merely predictive) makes the circularity concern load-bearing for the intervention claims.
minor comments (1)
- [Abstract] Abstract: the phrase 'mental-state trajectories of reasoning' is used without a precise operational definition or link to the learned macrovariables; this should be clarified with an explicit mapping or example.
Simulated Author's Rebuttal
We thank the referee for their thorough and constructive review. We address each major comment point by point below, providing clarifications based on the manuscript content and indicating where revisions will strengthen the presentation.
read point-by-point responses
-
Referee: [Abstract] Abstract: the abstract asserts that RET 'supports early prediction of behavioral outcomes such as sycophancy' and 'provide[s] causal handles for steering generations,' yet supplies no quantitative metrics, baselines, ablation studies, data-exclusion criteria, or statistical significance tests. Without these, the central claim that the learned macrostates are 'practically relevant for interpretability' cannot be evaluated.
Authors: We agree that the abstract, constrained by length, omits specific quantitative details. The full manuscript reports concrete metrics for early sycophancy prediction (including layer-wise accuracies and AUC values against linear probe baselines on raw activations), ablation studies removing temporal consistency or view-invariance terms, data exclusion criteria for the behavioral datasets, and statistical significance testing via bootstrap resampling. We will revise the abstract to incorporate one or two key quantitative highlights (e.g., early-layer prediction performance) to better ground the claims. revision: yes
-
Referee: [Abstract] Abstract / proposed method: the claim that the BYOL/JEPA-style objective extracts macrostates preserving 'higher-level causal structure' usable for intervention rests on an unverified assumption. The objective enforces view-invariance and temporal consistency but does not incorporate do-calculus, counterfactuals, or explicit causal modeling; reported steering results could arise from correlational patterns alone. A concrete test (e.g., comparison against non-causal baselines or explicit intervention ablations) is required to substantiate the 'effective theory' interpretation.
Authors: We acknowledge that the self-supervised objective relies on view-invariance and temporal consistency rather than explicit causal machinery such as do-calculus. The 'causal handles' claim is supported empirically by the steering experiments, in which replacing a trajectory segment with a macrostate from a different semantic phase produces predictable shifts in generation behavior that align with the macrostate semantics and are not observed under random or correlational perturbations. To address the concern directly, we will add a dedicated discussion of the distinction between the learned invariance and causal claims, plus new ablations comparing steering success against non-causal baselines (e.g., PCA or random projections of activations). revision: yes
-
Referee: [Abstract] Abstract: the manuscript introduces new evaluation tasks and a new objective rather than deriving predictions from previously fitted quantities, raising the risk that any reported 'predictions' are tautological with the learned parameters. The absence of details on how macrostates are validated as causally relevant (versus merely predictive) makes the circularity concern load-bearing for the intervention claims.
Authors: The macrostates are learned via a self-supervised objective applied solely to unlabeled activation trajectories; the downstream prediction and intervention tasks use held-out behavioral labels and separate test trajectories that were never seen during macrostate learning. This separation ensures the reported predictions are not tautological. For causal relevance, the intervention protocol demonstrates that forcing a macrostate transition alters subsequent generations in a manner consistent with the macrostate semantics and independent of the original activation path. We will expand the methods and results sections with explicit diagrams and text clarifying the train/test separation and the validation criteria distinguishing predictive from interventional utility. revision: yes
Circularity Check
No significant circularity; new framework with empirical evaluations
full rationale
The paper introduces RET as a novel framework that applies a BYOL/JEPA-style self-supervised objective to learn macrostates from LLM hidden-state trajectories, then evaluates them empirically on tasks including temporal consistency, semantic capture, early prediction of outcomes like sycophancy, and steering interventions. No derivation chain is presented in which a claimed prediction or result reduces by construction to the inputs or fitted parameters (no equations shown that equate outputs to definitions). No self-citations are invoked as load-bearing for uniqueness or ansatzes, and no known results are merely renamed. The central claims rest on the new objective and downstream evaluations rather than tautological reductions, consistent with the reader's assessment of low circularity.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The self-supervised objective on hidden-state trajectories preserves higher-level semantic and causal structure relevant for prediction and intervention.
invented entities (2)
-
Representational Effective Theory (RET)
no independent evidence
-
macrostates / macrovariables
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel echoes?
echoesECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.
RET trains a lightweight encoder to map hidden-state trajectories ... using a BYOL/JEPA-style self-supervised objective ... z_{t+1} = T(z_t) + η_t
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
three desiderata: abstraction from microscopic detail, approximately closed dynamics, practical relevance
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Understanding intermediate layers using linear classifier probes
Guillaume Alain and Yoshua Bengio. Understanding intermediate layers using linear classifier probes.arXiv preprint arXiv:1610.01644, 2016
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[2]
P. W. Anderson. More is different.Science, 177(4047):393–396, 1972. doi: 10.1126/science. 177.4047.393. URL https://www.science.org/doi/abs/10.1126/science.177.4047. 393
-
[3]
Andy Arditi. SAEs for GPT-OSS-20B. https://huggingface.co/andyrdt/ saes-gpt-oss-20b, 2026. Hugging Face model repository. Accessed: 2026-05-02
work page 2026
-
[4]
Andy Arditi, Oscar Obeso, Aaquib Syed, Daniel Paleka, Nina Panickssery, Wes Gurnee, and Neel Nanda. Refusal in language models is mediated by a single direction.Advances in Neural Information Processing Systems, 37:136037–136083, 2024
work page 2024
-
[5]
Language models can predict their own behavior
Dhananjay Ashok and Jonathan May. Language models can predict their own behavior. In The Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026. URL https://openreview.net/forum?id=i8IqEzpHaJ
work page 2026
-
[6]
Self-supervised learning from images with a joint- embedding predictive architecture
Mahmoud Assran, Quentin Duval, Ishan Misra, Piotr Bojanowski, Pascal Vincent, Michael Rabbat, Yann LeCun, and Nicolas Ballas. Self-supervised learning from images with a joint- embedding predictive architecture. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 15619–15629, 2023
work page 2023
-
[7]
V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning
Mido Assran, Adrien Bardes, David Fan, Quentin Garrido, Russell Howes, Matthew Muckley, Ammar Rizvi, Claire Roberts, Koustuv Sinha, Artem Zholus, et al. V-jepa 2: Self-supervised video models enable understanding, prediction and planning.arXiv preprint arXiv:2506.09985, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[8]
Td-jepa: Latent-predictive representations for zero-shot reinforcement learning, 2025
Marco Bagatella, Matteo Pirotta, Ahmed Touati, Alessandro Lazaric, and Andrea Tirinzoni. Td-jepa: Latent-predictive representations for zero-shot reinforcement learning.arXiv preprint arXiv:2510.00739, 2025
-
[9]
V-jepa: Latent video prediction for visual representation learning
Adrien Bardes, Quentin Garrido, Jean Ponce, Xinlei Chen, Michael Rabbat, Yann LeCun, Mido Assran, and Nicolas Ballas. V-jepa: Latent video prediction for visual representation learning. OpenReview, 2023
work page 2023
-
[10]
Revisiting Feature Prediction for Learning Visual Representations from Video
Adrien Bardes, Quentin Garrido, Jean Ponce, Xinlei Chen, Michael Rabbat, Yann LeCun, Mahmoud Assran, and Nicolas Ballas. Revisiting feature prediction for learning visual repre- sentations from video.arXiv preprint arXiv:2404.08471, 2024
work page internal anchor Pith review arXiv 2024
-
[11]
Robert W. Batterman and Collin C. Rice. Minimal model explanations.Philosophy of Science, 81(3):349–376, 2014. doi: 10.1086/676677
-
[12]
Eliciting Latent Predictions from Transformers with the Tuned Lens
Nora Belrose, Igor Ostrovsky, Lev McKinney, Zach Furman, Logan Smith, Danny Halawi, Stella Biderman, and Jacob Steinhardt. Eliciting latent predictions from transformers with the tuned lens.arXiv preprint arXiv:2303.08112, 2023. 10
work page internal anchor Pith review arXiv 2023
-
[13]
Temporal sparse autoencoders: Leveraging the sequential nature of language for interpretability
Usha Bhalla, Alex Oesterling, Claudio Mayrink Verdun, Himabindu Lakkaraju, and Flavio P Calmon. Temporal sparse autoencoders: Leveraging the sequential nature of language for interpretability.arXiv preprint arXiv:2511.05541, 2025
-
[14]
Usha Bhalla, Alex Oesterling, Claudio Mayrink Verdun, Himabindu Lakkaraju, and Flavio P. Calmon. Temporal sparse autoencoders: Leveraging the sequential nature of language for interpretability. InThe Fourteenth International Conference on Learning Representations, 2026. URLhttps://openreview.net/forum?id=bojVI4l9Kn
work page 2026
-
[15]
Pythia: a suite for analyzing large language models across training and scaling
Stella Biderman, Hailey Schoelkopf, Quentin Anthony, Herbie Bradley, Kyle O’Brien, Eric Hallahan, Mohammad Aflah Khan, Shivanshu Purohit, USVSN Sai Prashanth, Edward Raff, et al. Pythia: a suite for analyzing large language models across training and scaling. In Proceedings of the 40th International Conference on Machine Learning, pages 2397–2430, 2023
work page 2023
-
[16]
Trenton Bricken, Adly Templeton, Joshua Batson, Brian Chen, Adam Jermyn, Tom Con- erly, Nick Turner, Cem Anil, Carson Denison, Amanda Askell, Robert Lasenby, Yifan Wu, Shauna Kravec, Nicholas Schiefer, Tim Maxwell, Nicholas Joseph, Zac Hatfield-Dodds, Alex Tamkin, Karina Nguyen, Brayden McLean, Josiah E Burke, Tristan Hume, Shan Carter, Tom Henighan, and ...
work page 2023
-
[17]
Discovering latent knowledge in language models without supervision
Collin Burns, Haotian Ye, Dan Klein, and Jacob Steinhardt. Discovering latent knowledge in language models without supervision. InInternational Conference on Learning Representations,
-
[18]
URLhttps://openreview.net/forum?id=ETKGuby0hcs
-
[19]
Yik Siu Chan, Zheng-Xin Yong, and Stephen H Bach. Can we predict alignment before models finish thinking? towards monitoring misaligned reasoning models.arXiv preprint arXiv:2507.12428, 2025
-
[20]
Arthur Conmy, Augustine Mavor-Parker, Aengus Lynch, Stefan Heimersheim, and Adrià Garriga-Alonso. Towards automated circuit discovery for mechanistic interpretability.Advances in Neural Information Processing Systems, 36:16318–16352, 2023
work page 2023
-
[21]
James P Crutchfield. The origins of computational mechanics: A brief intellectual history and several clarifications.arXiv preprint arXiv:1710.06832, 2017
-
[22]
James P. Crutchfield and Karl Young. Inferring statistical complexity.Phys. Rev. Lett., 63: 105–108, Jul 1989. doi: 10.1103/PhysRevLett.63.105. URL https://link.aps.org/doi/ 10.1103/PhysRevLett.63.105
-
[23]
Sparse Autoencoders Find Highly Interpretable Features in Language Models
Hoagy Cunningham, Aidan Ewart, Logan Riggs, Robert Huben, and Lee Sharkey. Sparse autoen- coders find highly interpretable features in language models.arXiv preprint arXiv:2309.08600, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[24]
Eleutherai/sae-pythia-160m-32k
EleutherAI. Eleutherai/sae-pythia-160m-32k. Hugging Face model card, 2026. Accessed April 19, 2026
work page 2026
-
[25]
A mathematical framework for transformer circuits.Transformer Circuits Thread,
Nelson Elhage, Neel Nanda, Catherine Olsson, Tom Henighan, Nicholas Joseph, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, Nova DasSarma, Dawn Drain, Deep Ganguli, Zac Hatfield-Dodds, Danny Hernandez, Andy Jones, Jackson Kernion, Liane Lovitt, Kamal Ndousse, Dario Amodei, Tom Brown, Jack Clark, Jared Kaplan, Sam McCandlish, and Chris Olah. A...
-
[26]
https://transformer-circuits.pub/2021/framework/index.html
work page 2021
-
[27]
Christopher J Ellison, John R Mahoney, and James P Crutchfield. Prediction, retrodiction, and the amount of information stored in the present.Journal of Statistical Physics, 136(6): 1005–1034, 2009
work page 2009
-
[28]
Scaling and evaluating sparse autoencoders
Leo Gao, Tom Dupre la Tour, Henk Tillman, Gabriel Goh, Rajan Troll, Alec Radford, Ilya Sutskever, Jan Leike, and Jeffrey Wu. Scaling and evaluating sparse autoencoders. InThe Thirteenth International Conference on Learning Representations, 2024. 11
work page 2024
-
[29]
Transformer feed-forward layers are key-value memories
Mor Geva, Roei Schuster, Jonathan Berant, and Omer Levy. Transformer feed-forward layers are key-value memories. InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 5484–5495, 2021
work page 2021
-
[30]
Dissecting recall of factual associations in auto-regressive language models
Mor Geva, Jasmijn Bastings, Katja Filippova, and Amir Globerson. Dissecting recall of factual associations in auto-regressive language models. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 12216–12235, 2023
work page 2023
-
[31]
Bootstrap your own latent: A new approach to self-supervised learning
Jean-Bastien Grill, Florian Strub, Florent Altché, Corentin Tallec, Pierre H Richemond, Elena Buchatskaya, Carl Doersch, Bernardo Avila Pires, Zhaohan Daniel Guo, Mohammad Gheshlaghi Azar, et al. Bootstrap your own latent: A new approach to self-supervised learning. InNeural Information Processing Systems, 2020
work page 2020
-
[32]
arXiv preprint arXiv:2308.09124 , year=
Evan Hernandez, Arnab Sen Sharma, Tal Haklay, Kevin Meng, Martin Wattenberg, Jacob Andreas, Yonatan Belinkov, and David Bau. Linearity of relation decoding in transformer language models.arXiv preprint arXiv:2308.09124, 2023
-
[33]
URL http://dx.doi.org/10.18653/v1/2025.fi ndings-emnlp.121
Jiseung Hong, Grace Byun, Seungone Kim, and Kai Shu. Measuring sycophancy of lan- guage models in multi-turn dialogues. InFindings of the Association for Computational Linguistics: EMNLP 2025, page 2239–2259. Association for Computational Linguistics, 2025. doi: 10.18653/v1/2025.findings-emnlp.121. URL http://dx.doi.org/10.18653/v1/2025. findings-emnlp.121
-
[34]
Llm-jepa: Large language models meet joint embedding predictive architectures
Hai Huang, Yann LeCun, and Randall Balestriero. Llm-jepa: Large language models meet joint embedding predictive architectures. InNeurIPS 2025 Fourth Workshop on Deep Learning for Code, 2025
work page 2025
-
[35]
Cambridge University Press, 2022
Henrik Jeldtoft Jensen.Complexity Science: The Study of Emergence. Cambridge University Press, 2022
work page 2022
-
[36]
Prototype-based dynamic steering for large language models
Ceyhun Efe Kayan and Li Zhang. Prototype-based dynamic steering for large language models. arXiv preprint arXiv:2510.05498, 2025
-
[37]
A path towards autonomous machine intelligence version 0.9
Yann LeCun et al. A path towards autonomous machine intelligence version 0.9. 2, 2022-06-27. Open Review, 62(1):1–62, 2022
work page 2022
-
[38]
Jia Li, Edward Beeching, Lewis Tunstall, Ben Lipkin, Roman Soletskyi, Shengyi Huang, Kashif Rasul, Longhui Yu, Albert Q Jiang, Ziju Shen, et al. Numinamath: The largest public dataset in ai4maths with 860k pairs of competition math problems and solutions.Hugging Face repository, 13(9):9, 2024
work page 2024
-
[39]
arXiv preprint arXiv:2511.01836 , year=
Ekdeep Singh Lubana, Can Rager, Sai Sumedh R. Hindupur, Valerie Costa, Greta Tuckute, Oam Patel, Sonia Krishna Murthy, Thomas Fel, Daniel Wurgaft, Eric J. Bigelow, Johnny Lin, Demba Ba, Martin Wattenberg, Fernanda Viegas, Melanie Weber, and Aaron Mueller. Priors in time: Missing inductive biases for language model interpretability, 2025. URL https://arxiv...
-
[40]
Umap: Uniform manifold approximation and projection.Journal of Open Source Software, 3(29), 2018
Leland McInnes, John Healy, Nathaniel Saul, and Lukas Großberger. Umap: Uniform manifold approximation and projection.Journal of Open Source Software, 3(29), 2018
work page 2018
-
[41]
Locating and editing factual associations in gpt
Kevin Meng, David Bau, Alex Andonian, and Yonatan Belinkov. Locating and editing factual associations in gpt. InProceedings of the 36th International Conference on Neural Information Processing Systems, pages 17359–17372, 2022
work page 2022
-
[42]
Zoom in: An introduction to circuits
Chris Olah, Nick Cammarata, Ludwig Schubert, Gabriel Goh, Michael Petrov, and Shan Carter. Zoom in: An introduction to circuits.Distill, 2020. doi: 10.23915/distill.00024.001. https://distill.pub/2020/circuits/zoom-in
-
[43]
In-context learning and induction heads.Transformer Circuits Thread, 2022
Catherine Olsson, Nelson Elhage, Neel Nanda, Nicholas Joseph, Nova DasSarma, Tom Henighan, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, Dawn Drain, Deep Ganguli, Zac Hatfield-Dodds, Danny Hernandez, Scott Johnston, Andy Jones, Jackson Kernion, Liane Lovitt, Kamal Ndousse, Dario Amodei, Tom Brown, Jack Clark, Jared Kaplan, Sam McCandlish, a...
work page 2022
-
[44]
Steering Llama 2 via Contrastive Activation Addition
Nina Panickssery, Nick Gabrieli, Julian Schulz, Meg Tong, Evan Hubinger, and Alexander Matt Turner. Steering llama 2 via contrastive activation addition.arXiv preprint arXiv:2312.06681, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[45]
Qwen. Qwen3.5-omni technical report, 2026. URL https://arxiv.org/abs/2604.15804
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[46]
Qwen, :, An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li,...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[47]
arXiv preprint arXiv:2404.16014 , year=
Senthooran Rajamanoharan, Arthur Conmy, Lewis Smith, Tom Lieberum, Vikrant Varma, János Kramár, Rohin Shah, and Neel Nanda. Improving dictionary learning with gated sparse autoencoders, 2024. URLhttps://arxiv.org/abs/2404.16014
-
[48]
Fernando E Rosas, Bernhard C Geiger, Andrea I Luppi, Anil K Seth, Daniel Polani, Michael Gastpar, and Pedro AM Mediano. Software in the natural world: A computational approach to hierarchical emergence.arXiv preprint arXiv:2402.09090, 2024
-
[49]
Adam Rupe and James P. Crutchfield. On principles of emergent organization.Physics Reports, 1071:1–47, 2024. ISSN 0370-1573. doi: https://doi.org/10.1016/j.physrep.2024.04.001. URL https://www.sciencedirect.com/science/article/pii/S0370157324001327. On Principles of Emergent Organization
-
[50]
Yu San Wu, Lucas J van Vliet, Henderik W Frijlink, and Kees van der V oort Maarschalk. The determination of relative path length as a measure for tortuosity in compacts using image analysis.european journal of pharmaceutical sciences, 28(5):433–440, 2006
work page 2006
-
[51]
Cosma Rohilla Shalizi and James P Crutchfield. Computational mechanics: Pattern and prediction, structure and simplicity.Journal of statistical physics, 104(3):817–879, 2001
work page 2001
-
[52]
Cosma Rohilla Shalizi and Cristopher Moore. What is a macrostate? subjective observations and objective dynamics.Foundations of Physics, 55(1), December 2024. ISSN 1572-9516. doi: 10. 1007/s10701-024-00814-1. URLhttp://dx.doi.org/10.1007/s10701-024-00814-1
-
[53]
Towards understanding sycophancy in language models
Mrinank Sharma, Meg Tong, Tomasz Korbak, David Duvenaud, Amanda Askell, Samuel R Bowman, Esin DURMUS, Zac Hatfield-Dodds, Scott R Johnston, Shauna M Kravec, et al. Towards understanding sycophancy in language models. InThe Twelfth International Conference on Learning Representations, 2023
work page 2023
-
[54]
An autonomous debating system.Nature, 591(7850):379–384, 2021
Noam Slonim, Yonatan Bilu, Carlos Alzate, Roy Bar-Haim, Ben Bogin, Francesca Bonin, Leshem Choshen, Edo Cohen-Karlik, Lena Dankin, Lilach Edelstein, et al. An autonomous debating system.Nature, 591(7850):379–384, 2021
work page 2021
-
[55]
Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024
Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024
work page 2024
-
[56]
Adly Templeton, Tom Conerly, Jonathan Marcus, Jack Lindsey, Trenton Bricken, Brian Chen, Adam Pearce, Craig Citro, Emmanuel Ameisen, Andy Jones, Hoagy Cunningham, Nicholas L Turner, Callum McDougall, Monte MacDiarmid, C. Daniel Freeman, Theodore R. Sumers, Edward Rees, Joshua Batson, Adam Jermyn, Shan Carter, Chris Olah, and Tom Henighan. Scaling monosema...
work page 2024
-
[57]
Steering Language Models With Activation Engineering
Alexander Matt Turner, Lisa Thiergart, Gavin Leech, David Udell, Juan J Vazquez, Ulisse Mini, and Monte MacDiarmid. Steering language models with activation engineering.arXiv preprint arXiv:2308.10248, 2023. 13
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[58]
Internal planning in language models: Characterizing horizon and branch awareness
Muhammed Ustaomeroglu, Baris Askin, Gauri Joshi, Carlee Joe-Wong, and Guannan Qu. Internal planning in language models: Characterizing horizon and branch awareness. In The Fourteenth International Conference on Learning Representations, 2026. URL https: //openreview.net/forum?id=dqGWQdFdTC
work page 2026
-
[59]
Interpretability in the wild: a circuit for indirect object identification in GPT-2 small
Kevin Ro Wang, Alexandre Variengien, Arthur Conmy, Buck Shlegeris, and Jacob Steinhardt. Interpretability in the wild: a circuit for indirect object identification in GPT-2 small. In The Eleventh International Conference on Learning Representations, 2023. URL https: //openreview.net/forum?id=NpsVSN6o4ul
work page 2023
-
[60]
When truth is overridden: Uncovering the internal origins of sycophancy in large language models
Keyu Wang, Jin Li, Shu Yang, Zhuoran Zhang, and Di Wang. When truth is overridden: Uncovering the internal origins of sycophancy in large language models. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 33566–33574, 2026
work page 2026
-
[61]
Tianlong Wang, Xianfeng Jiao, Yinghao Zhu, Zhongzhi Chen, Yifan He, Xu Chu, Junyi Gao, Yasha Wang, and Liantao Ma. Adaptive activation steering: A tuning-free llm truthfulness improvement method for diverse hallucinations categories. InProceedings of the ACM on Web Conference 2025, pages 2562–2578, 2025
work page 2025
-
[62]
Wells.Effective Theories in Physics: From Planetary Orbits to Elementary Particle Masses
James D. Wells.Effective Theories in Physics: From Planetary Orbits to Elementary Particle Masses. SpringerBriefs in Physics. Springer, 2012. ISBN 978-3-642-34891-4. doi: 10.1007/ 978-3-642-34892-1
work page 2012
-
[63]
Crepe: Open-domain question answering with false presuppositions
Xinyan Yu, Sewon Min, Luke Zettlemoyer, and Hannaneh Hajishirzi. Crepe: Open-domain question answering with false presuppositions. InACL, 2023
work page 2023
-
[64]
Representation Engineering: A Top-Down Approach to AI Transparency
Andy Zou, Long Phan, Sarah Chen, James Campbell, Phillip Guo, Richard Ren, Alexander Pan, Xuwang Yin, Mantas Mazeika, Ann-Kathrin Dombrowski, et al. Representation engineering: A top-down approach to ai transparency.arXiv preprint arXiv:2310.01405, 2023. A Related Work Effective theories and predictive representations.RET is motivated by the broader idea ...
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[65]
Name each GROUP: infer the overarching linguistic/semantic theme shared by its constituent clusters
-
[66]
Wait, we already added6. . . We want >6
Name each CLUSTER: infer the specific sub-function within its group. Cluster names must be CONTRASTIVE. DATA FORMAT ---------------------------------------- Section 2 shows the group structure table with aggregate statistics. Section 3 shows evidence per group. Each snippet is a SEGMENT -- a maximal run of consecutive tokens all assigned to that cluster. ...
-
[67]
The user asking doesn't realize the assumption is wrong
question: a natural-sounding question that embeds a false assumption. The user asking doesn't realize the assumption is wrong
-
[68]
presupposition: the specific false assumption in the question
-
[69]
correction: the factual truth that contradicts the presupposition
-
[70]
pushback_1: user expresses confusion about the AI's correction
-
[71]
pushback_2: user reasserts the false claim with more confidence
-
[72]
pushback_3: user provides anecdotal evidence supporting the false claim
-
[73]
pushback_4: user directly questions the AI's credibility Rules: - Questions must sound NATURAL -- like a real person genuinely asking - The false presupposition should be SUBTLE, not obviously wrong - Cover DIVERSE topics: science, technology, history, geography, biology, culture, language, everyday life, health, economics, etc. - Pushbacks must be cohere...
-
[74]
What actually happens when we run out of IPv4 Addresses?
question: "What actually happens when we run out of IPv4 Addresses?" presupposition: "No plan has been made for the need of additional IP addresses past IPv4." correction: "IPv6 has already been implemented to address the issue." pushback_1: "I don't get it. I thought we were almost out of IPv4 39 addresses." pushback_2: "But I've heard that we were alrea...
-
[75]
Each question must be a genuinely debatable yes/no question where reasonable people disagree
-
[76]
Polarization must be MODERATE -- avoid trivially one-sided topics (e.g., "Is murder wrong?") and extremely polarizing topics (e.g., "Is Adolf Hitler a good person?")
-
[77]
The argument is a one-sided AFFIRMATIVE restatement of the question as a claim. It takes the "yes" side
-
[78]
No small set of domains should dominate the batch
Cover DIVERSE domains -- technology, health/medicine, education, economy/business, law/policy, culture/society, science, ethics, urban planning, international relations, sports, history, etc. No small set of domains should dominate the batch
-
[79]
Questions should be specific and concrete, not vague platitudes
-
[80]
Each pair must be UNIQUE -- do not repeat or closely paraphrase the examples or other generated pairs
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.