Steer-to-Detect: Probing Hidden Representations for Detection of LLM-Generated Texts

Luxu Liang; Xiang Li

arxiv: 2605.12890 · v1 · pith:GURII4AMnew · submitted 2026-05-13 · 📊 stat.AP · cs.LG

Steer-to-Detect: Probing Hidden Representations for Detection of LLM-Generated Texts

Luxu Liang , Xiang Li This is my paper

Pith reviewed 2026-06-30 21:51 UTC · model grok-4.3

classification 📊 stat.AP cs.LG

keywords LLM-generated text detectionsteering vectorhidden representationshypothesis testingclass separabilityfinite-sample guaranteesout-of-distribution detectionadversarial robustness

0 comments

The pith

A learned steering vector injected into a frozen LLM's hidden states produces representations with better separation between human and machine text.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a two-stage approach where a vector is first learned to shift the internal states of an unchanged language model toward greater distinction between classes. These adjusted states then feed into a statistical test whose error rates receive explicit finite-sample bounds. The goal is to overcome the overlap that raw hidden features typically show, making detection more reliable even when inputs come from new distributions or face modifications.

Core claim

Steer-to-Detect learns a steering vector from labeled examples and adds it to the hidden states of a frozen observer LLM, yielding representations in which human-written and LLM-generated texts exhibit improved class separability. A hypothesis test is then performed on the steered representations, and the procedure is accompanied by finite-sample high-probability guarantees on both Type I and Type II error rates.

What carries the argument

The steering vector, learned from data and added to the hidden states of an unchanged observer LLM to increase separability before hypothesis testing.

If this is right

The hypothesis test on steered representations admits finite-sample high-probability bounds on Type I and Type II errors.
Detection performance remains strong when test data are drawn from distributions different from the training data.
The method continues to work under adversarial perturbations applied to the input text.
The two-stage separation of learning the vector and performing the test allows the observer model to stay frozen throughout.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same steering construction could be tried on tasks that distinguish outputs from different model families rather than human versus machine.
If a low-dimensional direction consistently separates the two classes across many models, detectors might be updated by recomputing only the vector instead of retraining everything.
One could measure whether the learned vector remains effective when the observer LLM is replaced by a model from a different scale or architecture family.

Load-bearing premise

A steering vector can be learned from data that meaningfully increases separability in the hidden representations of an unchanged observer LLM and that this increase transfers cleanly to the hypothesis test without introducing new biases.

What would settle it

If the steered hidden states on a held-out test set show no measurable reduction in class overlap relative to the unsteered states, or if the observed error rates exceed the stated finite-sample bounds.

Figures

Figures reproduced from arXiv: 2605.12890 by Luxu Liang, Xiang Li.

**Figure 1.** Figure 1: Overview of Steer-to-Detect (S2D). Phase I (top row) applies a steering vector to reshape the observer LLM’s hidden representations, enhancing the separation between human-written and LLM-generated text. Phase II (bottom row) scores unseen texts and rejects the null hypothesis when the score exceeds a calibrated threshold. 2.1 Method Overview We begin with an overview of our method. We employ a surrogate m… view at source ↗

**Figure 2.** Figure 2: Analysis of S2D performance. (a) Comparison of detection stability across varying input lengths. (b) Detection performance across steering layers, showing that intermediate layers consistently achieve the best performance. (c) Performance heatmap as a function of last-token selection ratio and the number of aggregated layers. preserves separability even under attacks. In contrast, perturbation-based attack… view at source ↗

**Figure 4.** Figure 4: Detection analysis. Left: Steering leads to better separability. Right: Detection performance across training sizes. Full results are in Figures 7 and 8 in Appendix. Observer Model AUROC TPR@1% TPR@.01% Llama-3.1-8B 99.62 ± 0.46 99.05 ± 0.60 97.98 ± 1.37 Mistral-7B-v0.3 68.52 ± 0.94 30.53 ± 5.31 23.28 ± 6.45 GPT-Neo-2.7B 82.38 ± 2.94 32.65 ± 8.91 11.33 ± 5.67 OPT-2.7B 99.86 ± 0.13 98.30 ± 1.79 84.50 ± 8.24… view at source ↗

**Figure 5.** Figure 5: Empirical distributions of L2 norms of representations obtained from the last 8 layers and the final 20% of tokens, across different models and domains. Columns correspond to EleutherAI/GPT-J6B, Qwen/Qwen2.5-7B, and meta-llama/Llama-3.1-8B, while rows correspond to the Arxiv, XSum, Yelp, and Writing datasets. The solid and dashed red lines denote the mean and the ±2σ intervals, respectively, with exact va… view at source ↗

**Figure 6.** Figure 6: Empirical distributions of projected representations [PITH_FULL_IMAGE:figures/full_fig_p018_6.png] view at source ↗

**Figure 7.** Figure 7: Score distributions of different detection methods involving hidden representations across [PITH_FULL_IMAGE:figures/full_fig_p019_7.png] view at source ↗

**Figure 8.** Figure 8: Detection performance across different training set sizes. [PITH_FULL_IMAGE:figures/full_fig_p020_8.png] view at source ↗

read the original abstract

The rapid advancement of large language models (LLMs) has made machine-generated text increasingly difficult to distinguish from human-written text. While recent studies explore leveraging internal representations of language models to uncover deeper detection signals, these raw features often exhibit substantial overlap between classes, limiting their discriminative power. To address this challenge, we propose Steer-to-Detect (\texttt{S2D}), a two-stage framework for detecting LLM-generated text. In the first stage, \texttt{S2D} learns a steering vector that is injected into the hidden states of a frozen observer LLM, producing representations with improved class separability. In the second stage, detection is performed via a hypothesis testing procedure based on the steered representations. We establish finite-sample, high-probability guarantees for Type I and Type II errors, providing a theoretical characterization of the procedure. Empirically, \texttt{S2D} achieves strong and consistent performance across a range of settings, including out-of-distribution scenarios and adversarial perturbations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

S2D learns a steering vector on frozen LLM states then runs hypothesis testing with claimed finite-sample bounds, but the bounds probably ignore dependence from the learned vector.

read the letter

The main takeaway is that this paper puts forward a two-stage detector: first learn a steering vector to push hidden states of a frozen observer LLM toward better class separation, then apply a hypothesis test on the steered representations with finite-sample Type I and Type II error guarantees. The steering-plus-testing pipeline is presented as new relative to earlier representation-based detectors.

What stands out as useful is the attempt to keep the base model frozen while still improving separability through an additive vector, plus the explicit claim of non-asymptotic guarantees. That combination could matter for practical detection work where retraining the whole model is costly. The abstract also flags out-of-distribution and adversarial settings, which are the right stress tests for this problem.

The soft spot is exactly the one the stress-test note flags. The steering vector is learned from finite data in stage one, yet the guarantees appear to treat the resulting representations as if the vector were fixed in advance. Without a union bound, separate concentration result, or explicit train-test split analysis, the dependence between the learned vector and the test statistic is unaccounted for. That can inflate the actual error rates beyond what the bounds promise. The abstract gives no sign that this is handled, so the theoretical characterization is the part that needs the most checking.

Empirically the paper claims consistent performance, but again the abstract supplies no numbers or ablation details, so it is impossible to judge effect sizes or whether the gains survive the same distribution shifts that would break the assumptions.

This is the kind of work that belongs in a reading group focused on statistical detection methods or AI safety tooling. A serious editor should send it to review because the problem is timely, the pipeline is concrete, and the authors are trying to supply guarantees rather than just empirical curves. The referees will need to see the full proofs and the handling of the estimation step before the claims can be trusted.

Referee Report

1 major / 1 minor

Summary. The paper proposes Steer-to-Detect (S2D), a two-stage framework for detecting LLM-generated text. Stage 1 learns a steering vector that is injected into the hidden states of a frozen observer LLM to produce representations with improved class separability. Stage 2 performs detection via hypothesis testing on the steered representations and establishes finite-sample, high-probability guarantees on Type I and Type II errors. The work also reports strong empirical performance across in-distribution, out-of-distribution, and adversarial settings.

Significance. If the finite-sample guarantees can be shown to hold after properly accounting for the data-dependent steering vector, the work would supply a theoretically grounded method for enhancing separability in internal LLM representations without retraining the observer model. The combination of a learnable steering mechanism with explicit error bounds is a potentially useful contribution to the detection literature.

major comments (1)

[Abstract] Abstract (and any theoretical section deriving the bounds): the finite-sample high-probability guarantees on Type I/II errors are stated for the steered representations, yet the steering vector is learned from finite data in stage 1. The provided description gives no indication that the analysis incorporates a union bound, concentration inequality, or strict train/test separation to control the dependence between the learned vector and the subsequent test statistic. If the proof treats the vector as fixed (oracle), the claimed guarantees do not automatically extend to the two-stage procedure.

minor comments (1)

The abstract refers to 'out-of-distribution scenarios and adversarial perturbations' without naming the concrete datasets, perturbation methods, or evaluation metrics; these details are needed to assess the strength of the empirical claims.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the careful reading and for identifying this subtlety in the theoretical analysis of the two-stage procedure. We address the point directly below.

read point-by-point responses

Referee: [Abstract] Abstract (and any theoretical section deriving the bounds): the finite-sample high-probability guarantees on Type I/II errors are stated for the steered representations, yet the steering vector is learned from finite data in stage 1. The provided description gives no indication that the analysis incorporates a union bound, concentration inequality, or strict train/test separation to control the dependence between the learned vector and the subsequent test statistic. If the proof treats the vector as fixed (oracle), the claimed guarantees do not automatically extend to the two-stage procedure.

Authors: We agree that the stated finite-sample bounds are derived under the assumption that the steering vector is fixed once learned. The current theoretical section does not apply a union bound over the learning stage or provide an explicit concentration argument that would make the guarantees unconditional on the data used to obtain the vector. The manuscript therefore presents conditional guarantees given the learned vector rather than fully accounting for the dependence introduced by Stage 1. We will revise the abstract and the theoretical section to make this conditioning explicit, to clarify the role of the held-out test set used for detection, and to add a short discussion of the additional technical steps (e.g., a union bound or data-splitting argument) that would be required to obtain unconditional high-probability statements. These clarifications will be incorporated in the next version. revision: yes

Circularity Check

0 steps flagged

No circularity; two-stage procedure and guarantees are presented as independent

full rationale

The abstract describes a two-stage framework in which a steering vector is learned from data and injected into a frozen observer LLM, followed by a separate hypothesis-testing stage on the resulting representations. Finite-sample high-probability guarantees for Type I and Type II errors are stated as a theoretical characterization of the procedure. No equations, definitions, or self-citations are quoted that would reduce the guarantees to the learned vector by construction, treat a fitted quantity as a prediction, or rely on load-bearing self-citation chains. The derivation chain therefore remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

Review performed on abstract alone; ledger entries are inferred at the level of stated assumptions.

free parameters (1)

steering vector
Learned in stage one from data; its values are fitted rather than derived from first principles.

axioms (1)

domain assumption Hidden representations of an observer LLM contain class-separable signals for generated versus human text that can be enhanced by a linear steering vector.
Central premise invoked by the two-stage framework description.

pith-pipeline@v0.9.1-grok · 5697 in / 1247 out tokens · 33827 ms · 2026-06-30T21:51:19.310736+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

84 extracted references · 26 canonical work pages · 11 internal anchors

[1]

(A) I am not A lawyer, but...: engaging legal experts towards responsible LLM policies for legal advice

Inyoung Cheong, King Xia, KJ Kevin Feng, Quan Ze Chen, and Amy X Zhang. (A) I am not A lawyer, but...: engaging legal experts towards responsible LLM policies for legal advice. In Proceedings of the 2024 ACM conference on fairness, accountability, and transparency, pages 2454–2469, 2024

2024
[2]

FinLlama: LLM- based financial sentiment analysis for algorithmic trading

Giorgos Iacovides, Thanos Konstantinidis, Mingxue Xu, and Danilo Mandic. FinLlama: LLM- based financial sentiment analysis for algorithmic trading. InProceedings of the 5th ACM International Conference on AI in Finance, pages 134–141, 2024

2024
[3]

Delving into LLM-assisted writing in biomedical publications through excess vocabulary.Science Advances, 11(27):eadt3813, 2025

Dmitry Kobak, Rita González-Márquez, Em˝oke-Ágnes Horvát, and Jan Lause. Delving into LLM-assisted writing in biomedical publications through excess vocabulary.Science Advances, 11(27):eadt3813, 2025

2025
[4]

LLM-friendly knowledge representation for customer support

Hanchen Su, Wei Luo, Yashar Mehdad, Wei Han, Elaine Liu, Wayne Zhang, Mia Zhao, and Joy Zhang. LLM-friendly knowledge representation for customer support. InProceedings of the 31st International Conference on Computational Linguistics: Industry Track, pages 496–504, 2025

2025
[5]

A review of LLM agent applications in finance and banking

Devesh Batra, Conor Hamill, John Hartley, Ramin Okhrati, Dale Seddon, Harvey Miller, Raad Khraishi, and Greig Cowan. A review of LLM agent applications in finance and banking. Available at SSRN 5381584, 2025

2025
[6]

Do language models plagiarize? In Proceedings of the ACM Web Conference 2023, pages 3637–3647, 2023

Jooyoung Lee, Thai Le, Jinghui Chen, and Dongwon Lee. Do language models plagiarize? In Proceedings of the ACM Web Conference 2023, pages 3637–3647, 2023

2023
[7]

Evaluation of LLM vulnerabilities to being misused for personalized disinformation generation

Aneta Zugecova, Dominik Macko, Ivan Srba, Robert Moro, Jakub Kopál, Katarína Marcinˇci- nová, and Matúš Mesarˇcík. Evaluation of LLM vulnerabilities to being misused for personalized disinformation generation. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 780–797, 2025

2025
[8]

Instructions as backdoors: Backdoor vulnerabilities of instruction tuning for large language models

Jiashu Xu, Mingyu Ma, Fei Wang, Chaowei Xiao, and Muhao Chen. Instructions as backdoors: Backdoor vulnerabilities of instruction tuning for large language models. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 3111–3126, 2024

2024
[9]

Adversarial prompt and fine-tuning attacks threaten medical large language models.Nature Communications, 16(1):9011, 2025

Yifan Yang, Qiao Jin, Furong Huang, and Zhiyong Lu. Adversarial prompt and fine-tuning attacks threaten medical large language models.Nature Communications, 16(1):9011, 2025

2025
[10]

DetectGPT: Zero-shot machine-generated text detection using probability curvature

Eric Mitchell, Yoonho Lee, Alexander Khazatsky, Christopher D Manning, and Chelsea Finn. DetectGPT: Zero-shot machine-generated text detection using probability curvature. InInterna- tional conference on machine learning, pages 24950–24962. PMLR, 2023

2023
[11]

A watermark for large language models

John Kirchenbauer, Jonas Geiping, Yuxin Wen, Jonathan Katz, Ian Miers, and Tom Goldstein. A watermark for large language models. InInternational Conference on Machine Learning, pages 17061–17084. PMLR, 2023

2023
[12]

A statistical framework of watermarks for large language models: Pivot, detection efficiency and optimal rules.The Annals of Statistics, 53(1):322–351, 2025

Xiang Li, Feng Ruan, Huiyuan Wang, Qi Long, and Weijie J Su. A statistical framework of watermarks for large language models: Pivot, detection efficiency and optimal rules.The Annals of Statistics, 53(1):322–351, 2025

2025
[13]

Representation Engineering: A Top-Down Approach to AI Transparency

Andy Zou, Long Phan, Sarah Chen, James Campbell, Phillip Guo, Richard Ren, Alexander Pan, Xuwang Yin, Mantas Mazeika, Ann-Kathrin Dombrowski, et al. Representation engineering: A top-down approach to AI transparency.arXiv preprint arXiv:2310.01405, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[14]

LLMs know more than they show: On the intrinsic representation of LLM hallucinations.arXiv preprint arXiv:2410.02707, 2024

Hadas Orgad, Michael Toker, Zorik Gekhman, Roi Reichart, Idan Szpektor, Hadas Kotek, and Yonatan Belinkov. LLMs know more than they show: On the intrinsic representation of LLM hallucinations.arXiv preprint arXiv:2410.02707, 2024

work page arXiv 2024
[15]

Inference- time intervention: Eliciting truthful answers from a language model.Advances in Neural Information Processing Systems, 36:41451–41530, 2023

Kenneth Li, Oam Patel, Fernanda Viégas, Hanspeter Pfister, and Martin Wattenberg. Inference- time intervention: Eliciting truthful answers from a language model.Advances in Neural Information Processing Systems, 36:41451–41530, 2023. 10

2023
[16]

Layer by Layer: Uncovering Hidden Representations in Language Models

Oscar Skean, Md Rifat Arefin, Dan Zhao, Niket Patel, Jalal Naghiyev, Yann LeCun, and Ravid Shwartz-Ziv. Layer by layer: Uncovering hidden representations in language models.arXiv preprint arXiv:2502.02013, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[17]

Steer LLM Latents for Hallucination Detection

Seongheon Park, Xuefeng Du, Min-Hsuan Yeh, Haobo Wang, and Yixuan Li. Steer LLM Latents for Hallucination Detection. InInternational Conference on Machine Learning, pages 47971–47990. PMLR, 2025

2025
[18]

Zero-shot detection of LLM-generated text via text reorder

Jingtao Sun and Zhanglong Lv. Zero-shot detection of LLM-generated text via text reorder. Neurocomputing, 631:129829, 2025

2025
[19]

DetectLLM: Leveraging log rank information for zero-shot detection of machine-generated text

Jinyan Su, Terry Zhuo, Di Wang, and Preslav Nakov. DetectLLM: Leveraging log rank information for zero-shot detection of machine-generated text. InFindings of the Association for Computational Linguistics: EMNLP 2023, pages 12395–12412, 2023

2023
[20]

Fast-DetectGPT: Efficient Zero-Shot Detection of Machine-Generated Text via Conditional Probability Curvature

Guangsheng Bao, Yanbin Zhao, Zhiyang Teng, Linyi Yang, and Yue Zhang. Fast-DetectGPT: Efficient Zero-Shot Detection of Machine-Generated Text via Conditional Probability Curvature. InThe Twelfth International Conference on Learning Representations, 2024

2024
[21]

Spotting LLMs with binoculars: Zero-shot detection of machine-generated text.arXiv preprint arXiv:2401.12070, 2024

Abhimanyu Hans, Avi Schwarzschild, Valeriia Cherepanova, Hamid Kazemi, Aniruddha Saha, Micah Goldblum, Jonas Geiping, and Tom Goldstein. Spotting LLMs with binoculars: Zero-shot detection of machine-generated text.arXiv preprint arXiv:2401.12070, 2024

work page arXiv 2024
[22]

Raidar: generative AI detection via rewriting.arXiv preprint arXiv:2401.12970, 2024

Chengzhi Mao, Carl V ondrick, Hao Wang, and Junfeng Yang. Raidar: generative AI detection via rewriting.arXiv preprint arXiv:2401.12970, 2024

work page arXiv 2024
[23]

Learn-to-Distance: Distance Learning for Detecting LLM-Generated Text.arXiv preprint arXiv:2601.21895, 2026

Hongyi Zhou, Jin Zhu, Erhan Xu, Kai Ye, Ying Yang, and Chengchun Shi. Learn-to-Distance: Distance Learning for Detecting LLM-Generated Text.arXiv preprint arXiv:2601.21895, 2026

work page arXiv 2026
[24]

Magret: Machine-generated text detection with rewritten texts

Yifei Huang, Jiuxin Cao, Hanyu Luo, Xin Guan, and Bo Liu. Magret: Machine-generated text detection with rewritten texts. InProceedings of the 31st International Conference on Computational Linguistics, pages 8336–8346, 2025

2025
[25]

Release Strategies and the Social Impacts of Language Models

Irene Solaiman, Miles Brundage, Jack Clark, Amanda Askell, Ariel Herbert-V oss, Jeff Wu, Alec Radford, Gretchen Krueger, Jong Wook Kim, Sarah Kreps, et al. Release strategies and the social impacts of language models.arXiv preprint arXiv:1908.09203, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1908
[26]

Automatic detection of generated text is easiest when humans are fooled

Daphne Ippolito, Daniel Duckworth, Chris Callison-Burch, and Douglas Eck. Automatic detection of generated text is easiest when humans are fooled. InProceedings of the 58th annual meeting of the association for computational linguistics, pages 1808–1822, 2020

2020
[27]

ChatGPT or human? Detect and explain

Sandra Mitrovi´c, Davide Andreoletti, and Omran Ayoub. ChatGPT or human? Detect and explain. Explaining decisions of machine learning model for detecting short ChatGPT-generated text.arXiv preprint arXiv:2301.13852, 2023

work page arXiv 2023
[28]

AdaDetectGPT: Adaptive detection of LLM-generated text with statistical guarantees.arXiv preprint arXiv:2510.01268, 2025

Hongyi Zhou, Jin Zhu, Pingfan Su, Kai Ye, Ying Yang, Shakeel AOB Gavioli-Akilagun, and Chengchun Shi. AdaDetectGPT: Adaptive detection of LLM-generated text with statistical guarantees.arXiv preprint arXiv:2510.01268, 2025

work page arXiv 2025
[29]

Watermarking for large language models: A survey.Mathematics, 13(9):1420, 2025

Zhiguang Yang, Gejian Zhao, and Hanzhou Wu. Watermarking for large language models: A survey.Mathematics, 13(9):1420, 2025

2025
[30]

Building intelligence identification system via large language model watermarking: a survey and beyond.Artificial Intelligence Review, 58(8):249, 2025

Xuhong Wang, Haoyu Jiang, Yi Yu, Jingru Yu, Yilun Lin, Ping Yi, Yingchun Wang, Yu Qiao, Li Li, and Fei-Yue Wang. Building intelligence identification system via large language model watermarking: a survey and beyond.Artificial Intelligence Review, 58(8):249, 2025

2025
[31]

Securing large language models: A survey of watermarking and fingerprinting techniques.ACM Computing Surveys, 58(7):1–35, 2026

Peigen Ye, Huali Ren, Zhengdao Li, Anli Yan, Hongyang Yan, Shaowei Wang, and Jin Li. Securing large language models: A survey of watermarking and fingerprinting techniques.ACM Computing Surveys, 58(7):1–35, 2026

2026
[32]

Text fluoroscopy: Detecting LLM-generated text through intrinsic features

Xiao Yu, Kejiang Chen, Qi Yang, Weiming Zhang, and Nenghai Yu. Text fluoroscopy: Detecting LLM-generated text through intrinsic features. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 15838–15846, 2024. 11

2024
[33]

Repreguard: Detecting LLM-generated text by revealing hidden representation patterns.Transactions of the Association for Computational Linguistics, 13:1812–1831, 2025

Xin Chen, Junchao Wu, Shu Yang, Runzhe Zhan, Zeyu Wu, Ziyang Luo, Di Wang, Min Yang, Lidia S Chao, and Derek F Wong. Repreguard: Detecting LLM-generated text by revealing hidden representation patterns.Transactions of the Association for Computational Linguistics, 13:1812–1831, 2025

2025
[34]

Analyzing individual neurons in pre-trained language models

Nadir Durrani, Hassan Sajjad, Fahim Dalvi, and Yonatan Belinkov. Analyzing individual neurons in pre-trained language models. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 4865–4880, 2020

2020
[35]

To steer or not to steer? mechanistic error reduction with abstention for language models.arXiv preprint arXiv:2510.13290, 2025

Anna Hedström, Salim I Amoukou, Tom Bewley, Saumitra Mishra, and Manuela Veloso. To steer or not to steer? mechanistic error reduction with abstention for language models.arXiv preprint arXiv:2510.13290, 2025

work page arXiv 2025
[36]

Neurons in large language models: Dead, n-gram, positional

Elena V oita, Javier Ferrando, and Christoforos Nalmpantis. Neurons in large language models: Dead, n-gram, positional. InFindings of the Association for Computational Linguistics: ACL 2024, pages 1288–1301, 2024

2024
[37]

Patchscopes: a unifying framework for inspecting hidden representations of language models

Asma Ghandeharioun, Avi Caciularu, Adam Pearce, Lucas Dixon, and Mor Geva. Patchscopes: a unifying framework for inspecting hidden representations of language models. InProceedings of the 41st International Conference on Machine Learning, pages 15466–15490, 2024

2024
[38]

LayerNavigator: Finding promising intervention layers for efficient activation steering in large language models

Hao Sun, Huailiang Peng, Qiong Dai, Xu Bai, and Yanan Cao. LayerNavigator: Finding promising intervention layers for efficient activation steering in large language models. In Neural Information Processing Systems, 2025. URL https://openreview.net/forum?id= wj4lM45xQR

2025
[39]

Activation Steering with a Feedback Controller

Dung V Nguyen, Hieu M Vu, Nhi Y Pham, Lei Zhang, and Tan M Nguyen. Activation steering with a feedback controller.arXiv preprint arXiv:2510.04309, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[40]

Spotlight your instructions: Instruction- following with dynamic attention steering

Praveen Venkateswaran and Danish Contractor. Spotlight your instructions: Instruction- following with dynamic attention steering. InProceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3752–3770, 2026

2026
[41]

Toward universal steering and monitoring of AI models.Science, 391(6787):787–792, 2026

Daniel Beaglehole, Adityanarayanan Radhakrishnan, Enric Boix-Adsera, and Mikhail Belkin. Toward universal steering and monitoring of AI models.Science, 391(6787):787–792, 2026

2026
[42]

Efficient and accurate steering of large language models through attention-guided feature learning.arXiv preprint arXiv:2602.00333, 2026

Parmida Davarmanesh, Ashia Wilson, and Adityanarayanan Radhakrishnan. Efficient and accurate steering of large language models through attention-guided feature learning.arXiv preprint arXiv:2602.00333, 2026

work page arXiv 2026
[43]

Steering Language Models With Activation Engineering

Alexander Matt Turner, Lisa Thiergart, Gavin Leech, David Udell, Juan J Vazquez, Ulisse Mini, and Monte MacDiarmid. Steering language models with activation engineering.arXiv preprint arXiv:2308.10248, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[44]

A unified understanding and evaluation of steering methods.arXiv preprint arXiv:2502.02716, 2025

Shawn Im and Sharon Li. A unified understanding and evaluation of steering methods.arXiv preprint arXiv:2502.02716, 2025

work page arXiv 2025
[45]

Steering llama 2 via contrastive activation addition

Nina Rimsky, Nick Gabrieli, Julian Schulz, Meg Tong, Evan Hubinger, and Alexander Turner. Steering llama 2 via contrastive activation addition. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 15504–15522, 2024

2024
[46]

SHARP: Steering hallucination in LVLMs via representation engineering

Junfei Wu, Yue Ding, Guofan Liu, Tianze Xia, Ziyue Huang, Dianbo Sui, Qiang Liu, Shu Wu, Liang Wang, and Tieniu Tan. SHARP: Steering hallucination in LVLMs via representation engineering. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 14357–14372, 2025

2025
[47]

Alphasteer: Learning refusal steering with principled null-space constraint.arXiv preprint arXiv:2506.07022, 2025

Leheng Sheng, Changshuo Shen, Weixiang Zhao, Junfeng Fang, Xiaohao Liu, Zhenkai Liang, Xiang Wang, An Zhang, and Tat-Seng Chua. Alphasteer: Learning refusal steering with principled null-space constraint.arXiv preprint arXiv:2506.07022, 2025. 12

work page arXiv 2025
[48]

Hallucination reduction with casal: Contrastive activation steering for amortized learning.arXiv preprint arXiv:2510.02324, 2025

Xinchi Qiu, Lei Yu, Yuchen Zhang, Aobo Yang, Narine Kokhlikyan, Nicola Cancedda, Diego Garcia-Olano, et al. Hallucination reduction with casal: Contrastive activation steering for amortized learning.arXiv preprint arXiv:2510.02324, 2025

work page arXiv 2025
[49]

Steering evaluation-aware language models to act like they are deployed

Tim Tian Hua, Andrew Qin, Samuel Marks, and Neel Nanda. Steering evaluation-aware language models to act like they are deployed.arXiv preprint arXiv:2510.20487, 2025

work page arXiv 2025
[50]

Efficient Streaming Language Models with Attention Sinks

Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. Efficient streaming language models with attention sinks.arXiv preprint arXiv:2309.17453, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[51]

The Llama 3 Herd of Models

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The Llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[52]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[53]

Pico: Contrastive label disambiguation for partial label learning

Haobo Wang, Ruixuan Xiao, Yixuan Li, Lei Feng, Gang Niu, Gang Chen, and Junbo Zhao. Pico: Contrastive label disambiguation for partial label learning. InInternational conference on learning representations, 2021

2021
[54]

Springer, 2005

Erich Leo Lehmann and Joseph P Romano.Testing statistical hypotheses. Springer, 2005

2005
[55]

Youden index and optimal cut-point estimated from observations affected by a lower limit of detection

Marcus D Ruopp, Neil J Perkins, Brian W Whitcomb, and Enrique F Schisterman. Youden index and optimal cut-point estimated from observations affected by a lower limit of detection. Biometrical Journal: Journal of Mathematical Methods in Biosciences, 50(3):419–430, 2008

2008
[56]

Joint confidence region estimation for area under roc curve and youden index.Statistics in medicine, 33(6):985–1000, 2014

Jingjing Yin and Lili Tian. Joint confidence region estimation for area under roc curve and youden index.Statistics in medicine, 33(6):985–1000, 2014

2014
[57]

Classification accuracy and cut point selection.Statistics in medicine, 31(23): 2676–2686, 2012

Xinhua Liu. Classification accuracy and cut point selection.Statistics in medicine, 31(23): 2676–2686, 2012

2012
[58]

A General Method for Detecting Information Generated by Large Language Models.arXiv preprint arXiv:2506.21589, 2025

Minjia Mao, Dongjun Wei, Xiao Fang, and Michael Chau. A General Method for Detecting Information Generated by Large Language Models.arXiv preprint arXiv:2506.21589, 2025

work page arXiv 2025
[59]

Detecting LLM-Generated Text with Performance Guarantees.arXiv preprint arXiv:2601.06586, 2026

Hongyi Zhou, Jin Zhu, Ying Yang, and Chengchun Shi. Detecting LLM-Generated Text with Performance Guarantees.arXiv preprint arXiv:2601.06586, 2026

work page arXiv 2026
[60]

MGT-Prism: Enhancing Domain Generalization for Machine-Generated Text Detection via Spectral Alignment.arXiv preprint arXiv:2508.13768, 2025

Shengchao Liu, Xiaoming Liu, Chengzhengxu Li, Zhaohan Zhang, Guoxin Ma, Yu Lan, and Shuai Xiao. MGT-Prism: Enhancing Domain Generalization for Machine-Generated Text Detection via Spectral Alignment.arXiv preprint arXiv:2508.13768, 2025

work page arXiv 2025
[61]

DetectRL: Benchmarking LLM-generated text detection in real-world scenarios.Advances in Neural Information Processing Systems, 37:100369–100401, 2024

Junchao Wu, Runzhe Zhan, Derek Wong, Shu Yang, Xinyi Yang, Yulin Yuan, and Lidia Chao. DetectRL: Benchmarking LLM-generated text detection in real-world scenarios.Advances in Neural Information Processing Systems, 37:100369–100401, 2024

2024
[62]

Don’t give me the details, just the summary! topic-aware convolutional neural networks for extreme summarization

Shashi Narayan, Shay B Cohen, and Mirella Lapata. Don’t give me the details, just the summary! topic-aware convolutional neural networks for extreme summarization. InProceedings of the 2018 conference on empirical methods in natural language processing, pages 1797–1807, 2018

2018
[63]

Hierarchical neural story generation

Angela Fan, Mike Lewis, and Yann Dauphin. Hierarchical neural story generation. InProceed- ings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 889–898, 2018

2018
[64]

Character-level convolutional networks for text classification.Advances in Neural Information Processing Systems, 28, 2015

Xiang Zhang, Junbo Zhao, and Yann LeCun. Character-level convolutional networks for text classification.Advances in Neural Information Processing Systems, 28, 2015

2015
[65]

Introducing ChatGPT

OpenAI. Introducing ChatGPT. https://openai.com/index/chatgpt/, 2023. OpenAI Blog

2023
[66]

Releasing claude instant 1.2, 2023

Anthropic. Releasing claude instant 1.2, 2023. URL https://www.anthropic.com/news/ releasingclaude-instant-1-2. Anthropic Blog. 13

2023
[67]

PaLM 2 Technical Report

Rohan Anil, Andrew M Dai, Orhan Firat, Melvin Johnson, Dmitry Lepikhin, Alexandre Passos, Siamak Shakeri, Emanuel Taropa, Paige Bailey, Zhifeng Chen, et al. Palm 2 technical report. arXiv preprint arXiv:2305.10403, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[68]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open Foundation and Fine-Tuned Chat Models.arXiv preprint arXiv:2307.09288, 2023. URL https://arxiv.org/abs/2307.09288

work page internal anchor Pith review Pith/arXiv arXiv 2023
[69]

GLTR: Statistical detection and visualization of generated text

Sebastian Gehrmann, Hendrik Strobelt, and Alexander M Rush. GLTR: Statistical detection and visualization of generated text. InProceedings of the 57th annual meeting of the association for computational linguistics: system demonstrations, pages 111–116, 2019

2019
[70]

Imitate before detect: Aligning machine stylistic preference for machine-revised text detection

Jiaqi Chen, Xiaoye Zhu, Tianyang Liu, Ying Chen, Chen Xinhui, Yiwen Yuan, Chak Tou Leong, Zuchao Li, Long Tang, Lei Zhang, et al. Imitate before detect: Aligning machine stylistic preference for machine-revised text detection. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 23559–23567, 2025

2025
[71]

Watermarking of large language models

Scott Aaronson and H Kirchner. Watermarking of large language models. InLarge language models and transformers workshop at Simons Institute for the Theory of Computing, volume 2023, 2023

2023
[72]

Scalable watermarking for identifying large language model outputs.Nature, 634(8035):818–823, 2024

Sumanth Dathathri, Abigail See, Sumedh Ghaisas, Po-Sen Huang, Rob McAdam, Johannes Welbl, Vandana Bachani, Alex Kaskasoli, Robert Stanforth, Tatiana Matejovicova, et al. Scalable watermarking for identifying large language model outputs.Nature, 634(8035):818–823, 2024

2024
[73]

Index for rating diagnostic tests.Cancer, 3(1):32–35, 1950

William J Youden. Index for rating diagnostic tests.Cancer, 3(1):32–35, 1950

1950
[74]

Decoupled Weight Decay Regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[75]

The tight constant in the Dvoretzky-Kiefer-Wolfowitz inequality.The Annals of Probability, pages 1269–1283, 1990

Pascal Massart. The tight constant in the Dvoretzky-Kiefer-Wolfowitz inequality.The Annals of Probability, pages 1269–1283, 1990

1990
[76]

Measuring mass concentrations and estimating density contour clusters-an excess mass approach.The annals of Statistics, pages 855–881, 1995

Wolfgang Polonik. Measuring mass concentrations and estimating density contour clusters-an excess mass approach.The annals of Statistics, pages 855–881, 1995

1995
[77]

Smooth discrimination analysis.The Annals of Statistics, 27(6):1808–1829, 1999

Enno Mammen and Alexandre B Tsybakov. Smooth discrimination analysis.The Annals of Statistics, 27(6):1808–1829, 1999

1999
[78]

A plug-in approach to neyman-pearson classification.Journal of Machine Learning Research, 14(1):3011–3040, 2013

Xin Tong. A plug-in approach to neyman-pearson classification.Journal of Machine Learning Research, 14(1):3011–3040, 2013

2013
[79]

On tail probabilities for martingales.the Annals of Probability, pages 100–118, 1975

David A Freedman. On tail probabilities for martingales.the Annals of Probability, pages 100–118, 1975. 14 A Algorithm Algorithm 1:Overall training pipeline forS2D Input: Frozen observer model fθ, training set Strain, null calibration set Scal (human-written text only); steering layer ℓs; vMF concentration parameter κ; EMA coefficient ρ; learning rate η; ...

1975
[80]

Substituting the vMF likelihood and the uniform prior gives p(yi |f θ,v(xi)) = Cd(κ) exp κµ⊤ yi fθ,v(xi) · 1 2P c∈{0,1} Cd(κ) exp (κµ⊤c fθ,v(xi))· 1 2

By Bayes’ theorem, p(yi |f θ,v(xi)) = p(fθ,v(xi)|y i)p(yi)P c∈{0,1} p(fθ,v(xi)|y i =c)p(y i =c) . Substituting the vMF likelihood and the uniform prior gives p(yi |f θ,v(xi)) = Cd(κ) exp κµ⊤ yi fθ,v(xi) · 1 2P c∈{0,1} Cd(κ) exp (κµ⊤c fθ,v(xi))· 1 2 . 15 Since Cd(κ) and the prior 1 2 are identical across classes, they cancel out, yielding the softmax form ...

Showing first 80 references.

[1] [1]

(A) I am not A lawyer, but...: engaging legal experts towards responsible LLM policies for legal advice

Inyoung Cheong, King Xia, KJ Kevin Feng, Quan Ze Chen, and Amy X Zhang. (A) I am not A lawyer, but...: engaging legal experts towards responsible LLM policies for legal advice. In Proceedings of the 2024 ACM conference on fairness, accountability, and transparency, pages 2454–2469, 2024

2024

[2] [2]

FinLlama: LLM- based financial sentiment analysis for algorithmic trading

Giorgos Iacovides, Thanos Konstantinidis, Mingxue Xu, and Danilo Mandic. FinLlama: LLM- based financial sentiment analysis for algorithmic trading. InProceedings of the 5th ACM International Conference on AI in Finance, pages 134–141, 2024

2024

[3] [3]

Delving into LLM-assisted writing in biomedical publications through excess vocabulary.Science Advances, 11(27):eadt3813, 2025

Dmitry Kobak, Rita González-Márquez, Em˝oke-Ágnes Horvát, and Jan Lause. Delving into LLM-assisted writing in biomedical publications through excess vocabulary.Science Advances, 11(27):eadt3813, 2025

2025

[4] [4]

LLM-friendly knowledge representation for customer support

Hanchen Su, Wei Luo, Yashar Mehdad, Wei Han, Elaine Liu, Wayne Zhang, Mia Zhao, and Joy Zhang. LLM-friendly knowledge representation for customer support. InProceedings of the 31st International Conference on Computational Linguistics: Industry Track, pages 496–504, 2025

2025

[5] [5]

A review of LLM agent applications in finance and banking

Devesh Batra, Conor Hamill, John Hartley, Ramin Okhrati, Dale Seddon, Harvey Miller, Raad Khraishi, and Greig Cowan. A review of LLM agent applications in finance and banking. Available at SSRN 5381584, 2025

2025

[6] [6]

Do language models plagiarize? In Proceedings of the ACM Web Conference 2023, pages 3637–3647, 2023

Jooyoung Lee, Thai Le, Jinghui Chen, and Dongwon Lee. Do language models plagiarize? In Proceedings of the ACM Web Conference 2023, pages 3637–3647, 2023

2023

[7] [7]

Evaluation of LLM vulnerabilities to being misused for personalized disinformation generation

Aneta Zugecova, Dominik Macko, Ivan Srba, Robert Moro, Jakub Kopál, Katarína Marcinˇci- nová, and Matúš Mesarˇcík. Evaluation of LLM vulnerabilities to being misused for personalized disinformation generation. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 780–797, 2025

2025

[8] [8]

Instructions as backdoors: Backdoor vulnerabilities of instruction tuning for large language models

Jiashu Xu, Mingyu Ma, Fei Wang, Chaowei Xiao, and Muhao Chen. Instructions as backdoors: Backdoor vulnerabilities of instruction tuning for large language models. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 3111–3126, 2024

2024

[9] [9]

Adversarial prompt and fine-tuning attacks threaten medical large language models.Nature Communications, 16(1):9011, 2025

Yifan Yang, Qiao Jin, Furong Huang, and Zhiyong Lu. Adversarial prompt and fine-tuning attacks threaten medical large language models.Nature Communications, 16(1):9011, 2025

2025

[10] [10]

DetectGPT: Zero-shot machine-generated text detection using probability curvature

Eric Mitchell, Yoonho Lee, Alexander Khazatsky, Christopher D Manning, and Chelsea Finn. DetectGPT: Zero-shot machine-generated text detection using probability curvature. InInterna- tional conference on machine learning, pages 24950–24962. PMLR, 2023

2023

[11] [11]

A watermark for large language models

John Kirchenbauer, Jonas Geiping, Yuxin Wen, Jonathan Katz, Ian Miers, and Tom Goldstein. A watermark for large language models. InInternational Conference on Machine Learning, pages 17061–17084. PMLR, 2023

2023

[12] [12]

A statistical framework of watermarks for large language models: Pivot, detection efficiency and optimal rules.The Annals of Statistics, 53(1):322–351, 2025

Xiang Li, Feng Ruan, Huiyuan Wang, Qi Long, and Weijie J Su. A statistical framework of watermarks for large language models: Pivot, detection efficiency and optimal rules.The Annals of Statistics, 53(1):322–351, 2025

2025

[13] [13]

Representation Engineering: A Top-Down Approach to AI Transparency

Andy Zou, Long Phan, Sarah Chen, James Campbell, Phillip Guo, Richard Ren, Alexander Pan, Xuwang Yin, Mantas Mazeika, Ann-Kathrin Dombrowski, et al. Representation engineering: A top-down approach to AI transparency.arXiv preprint arXiv:2310.01405, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[14] [14]

LLMs know more than they show: On the intrinsic representation of LLM hallucinations.arXiv preprint arXiv:2410.02707, 2024

Hadas Orgad, Michael Toker, Zorik Gekhman, Roi Reichart, Idan Szpektor, Hadas Kotek, and Yonatan Belinkov. LLMs know more than they show: On the intrinsic representation of LLM hallucinations.arXiv preprint arXiv:2410.02707, 2024

work page arXiv 2024

[15] [15]

Inference- time intervention: Eliciting truthful answers from a language model.Advances in Neural Information Processing Systems, 36:41451–41530, 2023

Kenneth Li, Oam Patel, Fernanda Viégas, Hanspeter Pfister, and Martin Wattenberg. Inference- time intervention: Eliciting truthful answers from a language model.Advances in Neural Information Processing Systems, 36:41451–41530, 2023. 10

2023

[16] [16]

Layer by Layer: Uncovering Hidden Representations in Language Models

Oscar Skean, Md Rifat Arefin, Dan Zhao, Niket Patel, Jalal Naghiyev, Yann LeCun, and Ravid Shwartz-Ziv. Layer by layer: Uncovering hidden representations in language models.arXiv preprint arXiv:2502.02013, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[17] [17]

Steer LLM Latents for Hallucination Detection

Seongheon Park, Xuefeng Du, Min-Hsuan Yeh, Haobo Wang, and Yixuan Li. Steer LLM Latents for Hallucination Detection. InInternational Conference on Machine Learning, pages 47971–47990. PMLR, 2025

2025

[18] [18]

Zero-shot detection of LLM-generated text via text reorder

Jingtao Sun and Zhanglong Lv. Zero-shot detection of LLM-generated text via text reorder. Neurocomputing, 631:129829, 2025

2025

[19] [19]

DetectLLM: Leveraging log rank information for zero-shot detection of machine-generated text

Jinyan Su, Terry Zhuo, Di Wang, and Preslav Nakov. DetectLLM: Leveraging log rank information for zero-shot detection of machine-generated text. InFindings of the Association for Computational Linguistics: EMNLP 2023, pages 12395–12412, 2023

2023

[20] [20]

Fast-DetectGPT: Efficient Zero-Shot Detection of Machine-Generated Text via Conditional Probability Curvature

Guangsheng Bao, Yanbin Zhao, Zhiyang Teng, Linyi Yang, and Yue Zhang. Fast-DetectGPT: Efficient Zero-Shot Detection of Machine-Generated Text via Conditional Probability Curvature. InThe Twelfth International Conference on Learning Representations, 2024

2024

[21] [21]

Spotting LLMs with binoculars: Zero-shot detection of machine-generated text.arXiv preprint arXiv:2401.12070, 2024

Abhimanyu Hans, Avi Schwarzschild, Valeriia Cherepanova, Hamid Kazemi, Aniruddha Saha, Micah Goldblum, Jonas Geiping, and Tom Goldstein. Spotting LLMs with binoculars: Zero-shot detection of machine-generated text.arXiv preprint arXiv:2401.12070, 2024

work page arXiv 2024

[22] [22]

Raidar: generative AI detection via rewriting.arXiv preprint arXiv:2401.12970, 2024

Chengzhi Mao, Carl V ondrick, Hao Wang, and Junfeng Yang. Raidar: generative AI detection via rewriting.arXiv preprint arXiv:2401.12970, 2024

work page arXiv 2024

[23] [23]

Learn-to-Distance: Distance Learning for Detecting LLM-Generated Text.arXiv preprint arXiv:2601.21895, 2026

Hongyi Zhou, Jin Zhu, Erhan Xu, Kai Ye, Ying Yang, and Chengchun Shi. Learn-to-Distance: Distance Learning for Detecting LLM-Generated Text.arXiv preprint arXiv:2601.21895, 2026

work page arXiv 2026

[24] [24]

Magret: Machine-generated text detection with rewritten texts

Yifei Huang, Jiuxin Cao, Hanyu Luo, Xin Guan, and Bo Liu. Magret: Machine-generated text detection with rewritten texts. InProceedings of the 31st International Conference on Computational Linguistics, pages 8336–8346, 2025

2025

[25] [25]

Release Strategies and the Social Impacts of Language Models

Irene Solaiman, Miles Brundage, Jack Clark, Amanda Askell, Ariel Herbert-V oss, Jeff Wu, Alec Radford, Gretchen Krueger, Jong Wook Kim, Sarah Kreps, et al. Release strategies and the social impacts of language models.arXiv preprint arXiv:1908.09203, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1908

[26] [26]

Automatic detection of generated text is easiest when humans are fooled

Daphne Ippolito, Daniel Duckworth, Chris Callison-Burch, and Douglas Eck. Automatic detection of generated text is easiest when humans are fooled. InProceedings of the 58th annual meeting of the association for computational linguistics, pages 1808–1822, 2020

2020

[27] [27]

ChatGPT or human? Detect and explain

Sandra Mitrovi´c, Davide Andreoletti, and Omran Ayoub. ChatGPT or human? Detect and explain. Explaining decisions of machine learning model for detecting short ChatGPT-generated text.arXiv preprint arXiv:2301.13852, 2023

work page arXiv 2023

[28] [28]

AdaDetectGPT: Adaptive detection of LLM-generated text with statistical guarantees.arXiv preprint arXiv:2510.01268, 2025

Hongyi Zhou, Jin Zhu, Pingfan Su, Kai Ye, Ying Yang, Shakeel AOB Gavioli-Akilagun, and Chengchun Shi. AdaDetectGPT: Adaptive detection of LLM-generated text with statistical guarantees.arXiv preprint arXiv:2510.01268, 2025

work page arXiv 2025

[29] [29]

Watermarking for large language models: A survey.Mathematics, 13(9):1420, 2025

Zhiguang Yang, Gejian Zhao, and Hanzhou Wu. Watermarking for large language models: A survey.Mathematics, 13(9):1420, 2025

2025

[30] [30]

Building intelligence identification system via large language model watermarking: a survey and beyond.Artificial Intelligence Review, 58(8):249, 2025

Xuhong Wang, Haoyu Jiang, Yi Yu, Jingru Yu, Yilun Lin, Ping Yi, Yingchun Wang, Yu Qiao, Li Li, and Fei-Yue Wang. Building intelligence identification system via large language model watermarking: a survey and beyond.Artificial Intelligence Review, 58(8):249, 2025

2025

[31] [31]

Securing large language models: A survey of watermarking and fingerprinting techniques.ACM Computing Surveys, 58(7):1–35, 2026

Peigen Ye, Huali Ren, Zhengdao Li, Anli Yan, Hongyang Yan, Shaowei Wang, and Jin Li. Securing large language models: A survey of watermarking and fingerprinting techniques.ACM Computing Surveys, 58(7):1–35, 2026

2026

[32] [32]

Text fluoroscopy: Detecting LLM-generated text through intrinsic features

Xiao Yu, Kejiang Chen, Qi Yang, Weiming Zhang, and Nenghai Yu. Text fluoroscopy: Detecting LLM-generated text through intrinsic features. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 15838–15846, 2024. 11

2024

[33] [33]

Repreguard: Detecting LLM-generated text by revealing hidden representation patterns.Transactions of the Association for Computational Linguistics, 13:1812–1831, 2025

Xin Chen, Junchao Wu, Shu Yang, Runzhe Zhan, Zeyu Wu, Ziyang Luo, Di Wang, Min Yang, Lidia S Chao, and Derek F Wong. Repreguard: Detecting LLM-generated text by revealing hidden representation patterns.Transactions of the Association for Computational Linguistics, 13:1812–1831, 2025

2025

[34] [34]

Analyzing individual neurons in pre-trained language models

Nadir Durrani, Hassan Sajjad, Fahim Dalvi, and Yonatan Belinkov. Analyzing individual neurons in pre-trained language models. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 4865–4880, 2020

2020

[35] [35]

To steer or not to steer? mechanistic error reduction with abstention for language models.arXiv preprint arXiv:2510.13290, 2025

Anna Hedström, Salim I Amoukou, Tom Bewley, Saumitra Mishra, and Manuela Veloso. To steer or not to steer? mechanistic error reduction with abstention for language models.arXiv preprint arXiv:2510.13290, 2025

work page arXiv 2025

[36] [36]

Neurons in large language models: Dead, n-gram, positional

Elena V oita, Javier Ferrando, and Christoforos Nalmpantis. Neurons in large language models: Dead, n-gram, positional. InFindings of the Association for Computational Linguistics: ACL 2024, pages 1288–1301, 2024

2024

[37] [37]

Patchscopes: a unifying framework for inspecting hidden representations of language models

Asma Ghandeharioun, Avi Caciularu, Adam Pearce, Lucas Dixon, and Mor Geva. Patchscopes: a unifying framework for inspecting hidden representations of language models. InProceedings of the 41st International Conference on Machine Learning, pages 15466–15490, 2024

2024

[38] [38]

LayerNavigator: Finding promising intervention layers for efficient activation steering in large language models

Hao Sun, Huailiang Peng, Qiong Dai, Xu Bai, and Yanan Cao. LayerNavigator: Finding promising intervention layers for efficient activation steering in large language models. In Neural Information Processing Systems, 2025. URL https://openreview.net/forum?id= wj4lM45xQR

2025

[39] [39]

Activation Steering with a Feedback Controller

Dung V Nguyen, Hieu M Vu, Nhi Y Pham, Lei Zhang, and Tan M Nguyen. Activation steering with a feedback controller.arXiv preprint arXiv:2510.04309, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[40] [40]

Spotlight your instructions: Instruction- following with dynamic attention steering

Praveen Venkateswaran and Danish Contractor. Spotlight your instructions: Instruction- following with dynamic attention steering. InProceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3752–3770, 2026

2026

[41] [41]

Toward universal steering and monitoring of AI models.Science, 391(6787):787–792, 2026

Daniel Beaglehole, Adityanarayanan Radhakrishnan, Enric Boix-Adsera, and Mikhail Belkin. Toward universal steering and monitoring of AI models.Science, 391(6787):787–792, 2026

2026

[42] [42]

Efficient and accurate steering of large language models through attention-guided feature learning.arXiv preprint arXiv:2602.00333, 2026

Parmida Davarmanesh, Ashia Wilson, and Adityanarayanan Radhakrishnan. Efficient and accurate steering of large language models through attention-guided feature learning.arXiv preprint arXiv:2602.00333, 2026

work page arXiv 2026

[43] [43]

Steering Language Models With Activation Engineering

Alexander Matt Turner, Lisa Thiergart, Gavin Leech, David Udell, Juan J Vazquez, Ulisse Mini, and Monte MacDiarmid. Steering language models with activation engineering.arXiv preprint arXiv:2308.10248, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[44] [44]

A unified understanding and evaluation of steering methods.arXiv preprint arXiv:2502.02716, 2025

Shawn Im and Sharon Li. A unified understanding and evaluation of steering methods.arXiv preprint arXiv:2502.02716, 2025

work page arXiv 2025

[45] [45]

Steering llama 2 via contrastive activation addition

Nina Rimsky, Nick Gabrieli, Julian Schulz, Meg Tong, Evan Hubinger, and Alexander Turner. Steering llama 2 via contrastive activation addition. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 15504–15522, 2024

2024

[46] [46]

SHARP: Steering hallucination in LVLMs via representation engineering

Junfei Wu, Yue Ding, Guofan Liu, Tianze Xia, Ziyue Huang, Dianbo Sui, Qiang Liu, Shu Wu, Liang Wang, and Tieniu Tan. SHARP: Steering hallucination in LVLMs via representation engineering. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 14357–14372, 2025

2025

[47] [47]

Alphasteer: Learning refusal steering with principled null-space constraint.arXiv preprint arXiv:2506.07022, 2025

Leheng Sheng, Changshuo Shen, Weixiang Zhao, Junfeng Fang, Xiaohao Liu, Zhenkai Liang, Xiang Wang, An Zhang, and Tat-Seng Chua. Alphasteer: Learning refusal steering with principled null-space constraint.arXiv preprint arXiv:2506.07022, 2025. 12

work page arXiv 2025

[48] [48]

Hallucination reduction with casal: Contrastive activation steering for amortized learning.arXiv preprint arXiv:2510.02324, 2025

Xinchi Qiu, Lei Yu, Yuchen Zhang, Aobo Yang, Narine Kokhlikyan, Nicola Cancedda, Diego Garcia-Olano, et al. Hallucination reduction with casal: Contrastive activation steering for amortized learning.arXiv preprint arXiv:2510.02324, 2025

work page arXiv 2025

[49] [49]

Steering evaluation-aware language models to act like they are deployed

Tim Tian Hua, Andrew Qin, Samuel Marks, and Neel Nanda. Steering evaluation-aware language models to act like they are deployed.arXiv preprint arXiv:2510.20487, 2025

work page arXiv 2025

[50] [50]

Efficient Streaming Language Models with Attention Sinks

Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. Efficient streaming language models with attention sinks.arXiv preprint arXiv:2309.17453, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[51] [51]

The Llama 3 Herd of Models

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The Llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[52] [52]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[53] [53]

Pico: Contrastive label disambiguation for partial label learning

Haobo Wang, Ruixuan Xiao, Yixuan Li, Lei Feng, Gang Niu, Gang Chen, and Junbo Zhao. Pico: Contrastive label disambiguation for partial label learning. InInternational conference on learning representations, 2021

2021

[54] [54]

Springer, 2005

Erich Leo Lehmann and Joseph P Romano.Testing statistical hypotheses. Springer, 2005

2005

[55] [55]

Youden index and optimal cut-point estimated from observations affected by a lower limit of detection

Marcus D Ruopp, Neil J Perkins, Brian W Whitcomb, and Enrique F Schisterman. Youden index and optimal cut-point estimated from observations affected by a lower limit of detection. Biometrical Journal: Journal of Mathematical Methods in Biosciences, 50(3):419–430, 2008

2008

[56] [56]

Joint confidence region estimation for area under roc curve and youden index.Statistics in medicine, 33(6):985–1000, 2014

Jingjing Yin and Lili Tian. Joint confidence region estimation for area under roc curve and youden index.Statistics in medicine, 33(6):985–1000, 2014

2014

[57] [57]

Classification accuracy and cut point selection.Statistics in medicine, 31(23): 2676–2686, 2012

Xinhua Liu. Classification accuracy and cut point selection.Statistics in medicine, 31(23): 2676–2686, 2012

2012

[58] [58]

A General Method for Detecting Information Generated by Large Language Models.arXiv preprint arXiv:2506.21589, 2025

Minjia Mao, Dongjun Wei, Xiao Fang, and Michael Chau. A General Method for Detecting Information Generated by Large Language Models.arXiv preprint arXiv:2506.21589, 2025

work page arXiv 2025

[59] [59]

Detecting LLM-Generated Text with Performance Guarantees.arXiv preprint arXiv:2601.06586, 2026

Hongyi Zhou, Jin Zhu, Ying Yang, and Chengchun Shi. Detecting LLM-Generated Text with Performance Guarantees.arXiv preprint arXiv:2601.06586, 2026

work page arXiv 2026

[60] [60]

MGT-Prism: Enhancing Domain Generalization for Machine-Generated Text Detection via Spectral Alignment.arXiv preprint arXiv:2508.13768, 2025

Shengchao Liu, Xiaoming Liu, Chengzhengxu Li, Zhaohan Zhang, Guoxin Ma, Yu Lan, and Shuai Xiao. MGT-Prism: Enhancing Domain Generalization for Machine-Generated Text Detection via Spectral Alignment.arXiv preprint arXiv:2508.13768, 2025

work page arXiv 2025

[61] [61]

DetectRL: Benchmarking LLM-generated text detection in real-world scenarios.Advances in Neural Information Processing Systems, 37:100369–100401, 2024

Junchao Wu, Runzhe Zhan, Derek Wong, Shu Yang, Xinyi Yang, Yulin Yuan, and Lidia Chao. DetectRL: Benchmarking LLM-generated text detection in real-world scenarios.Advances in Neural Information Processing Systems, 37:100369–100401, 2024

2024

[62] [62]

Don’t give me the details, just the summary! topic-aware convolutional neural networks for extreme summarization

Shashi Narayan, Shay B Cohen, and Mirella Lapata. Don’t give me the details, just the summary! topic-aware convolutional neural networks for extreme summarization. InProceedings of the 2018 conference on empirical methods in natural language processing, pages 1797–1807, 2018

2018

[63] [63]

Hierarchical neural story generation

Angela Fan, Mike Lewis, and Yann Dauphin. Hierarchical neural story generation. InProceed- ings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 889–898, 2018

2018

[64] [64]

Character-level convolutional networks for text classification.Advances in Neural Information Processing Systems, 28, 2015

Xiang Zhang, Junbo Zhao, and Yann LeCun. Character-level convolutional networks for text classification.Advances in Neural Information Processing Systems, 28, 2015

2015

[65] [65]

Introducing ChatGPT

OpenAI. Introducing ChatGPT. https://openai.com/index/chatgpt/, 2023. OpenAI Blog

2023

[66] [66]

Releasing claude instant 1.2, 2023

Anthropic. Releasing claude instant 1.2, 2023. URL https://www.anthropic.com/news/ releasingclaude-instant-1-2. Anthropic Blog. 13

2023

[67] [67]

PaLM 2 Technical Report

Rohan Anil, Andrew M Dai, Orhan Firat, Melvin Johnson, Dmitry Lepikhin, Alexandre Passos, Siamak Shakeri, Emanuel Taropa, Paige Bailey, Zhifeng Chen, et al. Palm 2 technical report. arXiv preprint arXiv:2305.10403, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[68] [68]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open Foundation and Fine-Tuned Chat Models.arXiv preprint arXiv:2307.09288, 2023. URL https://arxiv.org/abs/2307.09288

work page internal anchor Pith review Pith/arXiv arXiv 2023

[69] [69]

GLTR: Statistical detection and visualization of generated text

Sebastian Gehrmann, Hendrik Strobelt, and Alexander M Rush. GLTR: Statistical detection and visualization of generated text. InProceedings of the 57th annual meeting of the association for computational linguistics: system demonstrations, pages 111–116, 2019

2019

[70] [70]

Imitate before detect: Aligning machine stylistic preference for machine-revised text detection

Jiaqi Chen, Xiaoye Zhu, Tianyang Liu, Ying Chen, Chen Xinhui, Yiwen Yuan, Chak Tou Leong, Zuchao Li, Long Tang, Lei Zhang, et al. Imitate before detect: Aligning machine stylistic preference for machine-revised text detection. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 23559–23567, 2025

2025

[71] [71]

Watermarking of large language models

Scott Aaronson and H Kirchner. Watermarking of large language models. InLarge language models and transformers workshop at Simons Institute for the Theory of Computing, volume 2023, 2023

2023

[72] [72]

Scalable watermarking for identifying large language model outputs.Nature, 634(8035):818–823, 2024

Sumanth Dathathri, Abigail See, Sumedh Ghaisas, Po-Sen Huang, Rob McAdam, Johannes Welbl, Vandana Bachani, Alex Kaskasoli, Robert Stanforth, Tatiana Matejovicova, et al. Scalable watermarking for identifying large language model outputs.Nature, 634(8035):818–823, 2024

2024

[73] [73]

Index for rating diagnostic tests.Cancer, 3(1):32–35, 1950

William J Youden. Index for rating diagnostic tests.Cancer, 3(1):32–35, 1950

1950

[74] [74]

Decoupled Weight Decay Regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[75] [75]

The tight constant in the Dvoretzky-Kiefer-Wolfowitz inequality.The Annals of Probability, pages 1269–1283, 1990

Pascal Massart. The tight constant in the Dvoretzky-Kiefer-Wolfowitz inequality.The Annals of Probability, pages 1269–1283, 1990

1990

[76] [76]

Measuring mass concentrations and estimating density contour clusters-an excess mass approach.The annals of Statistics, pages 855–881, 1995

Wolfgang Polonik. Measuring mass concentrations and estimating density contour clusters-an excess mass approach.The annals of Statistics, pages 855–881, 1995

1995

[77] [77]

Smooth discrimination analysis.The Annals of Statistics, 27(6):1808–1829, 1999

Enno Mammen and Alexandre B Tsybakov. Smooth discrimination analysis.The Annals of Statistics, 27(6):1808–1829, 1999

1999

[78] [78]

A plug-in approach to neyman-pearson classification.Journal of Machine Learning Research, 14(1):3011–3040, 2013

Xin Tong. A plug-in approach to neyman-pearson classification.Journal of Machine Learning Research, 14(1):3011–3040, 2013

2013

[79] [79]

On tail probabilities for martingales.the Annals of Probability, pages 100–118, 1975

David A Freedman. On tail probabilities for martingales.the Annals of Probability, pages 100–118, 1975. 14 A Algorithm Algorithm 1:Overall training pipeline forS2D Input: Frozen observer model fθ, training set Strain, null calibration set Scal (human-written text only); steering layer ℓs; vMF concentration parameter κ; EMA coefficient ρ; learning rate η; ...

1975

[80] [80]

Substituting the vMF likelihood and the uniform prior gives p(yi |f θ,v(xi)) = Cd(κ) exp κµ⊤ yi fθ,v(xi) · 1 2P c∈{0,1} Cd(κ) exp (κµ⊤c fθ,v(xi))· 1 2

By Bayes’ theorem, p(yi |f θ,v(xi)) = p(fθ,v(xi)|y i)p(yi)P c∈{0,1} p(fθ,v(xi)|y i =c)p(y i =c) . Substituting the vMF likelihood and the uniform prior gives p(yi |f θ,v(xi)) = Cd(κ) exp κµ⊤ yi fθ,v(xi) · 1 2P c∈{0,1} Cd(κ) exp (κµ⊤c fθ,v(xi))· 1 2 . 15 Since Cd(κ) and the prior 1 2 are identical across classes, they cancel out, yielding the softmax form ...