Monitoring the Internal Monologue: Probe Trajectories Reveal Reasoning Dynamics

Aleksander Szymczyk; Maciej Chrab\k{a}szcz; Marcin Sendera; Sebastian Cygert; Tomasz Trzci\'nski

arxiv: 2605.18549 · v1 · pith:54U4UJWFnew · submitted 2026-05-18 · 💻 cs.CL · cs.CR

Monitoring the Internal Monologue: Probe Trajectories Reveal Reasoning Dynamics

Maciej Chrab\k{a}szcz , Aleksander Szymczyk , Marcin Sendera , Tomasz Trzci\'nski , Sebastian Cygert This is my paper

Pith reviewed 2026-05-20 10:50 UTC · model grok-4.3

classification 💻 cs.CL cs.CR

keywords probe trajectorieshidden representationschain of thoughtmodel monitoringfuture behavior predictionsignal processing featuresmax-poolinglarge reasoning models

0 comments

The pith

Future model behavior in large reasoning models is more accurately predicted by monitoring probe trajectories across the full chain of thought than by any single static probe.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines whether hidden representations during the chain of thought in large reasoning models can reveal what the model will ultimately do. By placing probes at each token to track the probability of specific concepts over time, the authors build probe trajectories that capture how these signals evolve. They show that features from these trajectories, such as volatility and trends, allow much better separation of future outcomes like safe versus unsafe responses. This approach works even with simple template-based data instead of full model generations. The work suggests a way to monitor internal reasoning dynamics for better safety without relying solely on the visible chain of thought.

Core claim

We construct probe trajectories by evaluating a probe at each generated token in the reasoning process of large reasoning models, revealing the continuous evolution of a concept's probability. Extracting signal-processing features that capture volatility, trend, and steady-state behavior from these trajectories significantly improves the separation of future model states compared to single static predictions. Using max-pooling yields up to 95% AUROC, while average-pooling and last-token methods perform near random. Template-based training data achieves near-parity with dynamically generated responses, and this holds across safety and mathematics domains on four datasets and four models.

What carries the argument

The probe trajectory, defined as the sequence of concept probabilities obtained by applying a linear probe to hidden representations at every token position during chain-of-thought generation.

If this is right

Future model behavior becomes more distinguishable through full trajectory analysis rather than static snapshots.
Signal-processing features for volatility, trend, and steady-state enhance outcome separability.
Max-pooling is essential for achieving high AUROC up to 95% and stable trajectories.
Template-based training eliminates the need for initial inference and labeling while maintaining performance.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Such trajectories might allow interventions before the model completes its output.
Similar methods could apply to monitoring other internal states beyond safety and math.
Combining trajectory monitoring with visible CoT could create more robust oversight systems.

Load-bearing premise

Simple linear probes on hidden states at each token faithfully capture the target concept without interference from unrelated model features or output biases.

What would settle it

If using average pooling instead of max pooling results in AUROC close to 50% on the same tasks, or if probe performance does not improve with trajectory features over static ones.

Figures

Figures reproduced from arXiv: 2605.18549 by Aleksander Szymczyk, Maciej Chrab\k{a}szcz, Marcin Sendera, Sebastian Cygert, Tomasz Trzci\'nski.

**Figure 1.** Figure 1: Overview of the trajectory-based analysis framework. (a) Surface-level CoT is unfaithful to the final output in over 10% of cases, necessitating latent monitoring to ensure safety. (b) Our framework monitors hidden representations to generate probe trajectories, from which we extract signal features (e.g., statistical state and trend dynamics) that are more expressive of true intent than surface-level text… view at source ↗

**Figure 2.** Figure 2: Sample average and maxpooled probe trajectories. Averaging produces a highly unstable trajectory. To capture the complex dynamics of the model’s internal monologue, we extract a robust set of statistical, temporal, and signal processing-based features from these trajectories, organized into six core groups: (1) Global Statistical State—summary statistics (mean, max, variance, IQR, RMS) over both prompt … view at source ↗

**Figure 3.** Figure 3: Evolution of internal states during reasoning. (a) Average trajectories show how harmfulness probabilities shift as they transition from prompt processing to Chain-of-Thought reasoning across different safety outcomes. Individual token-level trajectories (shaded lines) highlight distinct patterns of escalation or de-escalation. (b) Correctness probe scores of correct and incorrect final answers start diver… view at source ↗

**Figure 4.** Figure 4: Identifying harmful responses with unfaithful CoT. We compare the detection rate of static probes (solid) against trajectory-based classifiers (cross-hatched) for deceptive CoTs. Across all evaluated models, the trajectory-based approach outperforms static methods; this advantage is most striking on the Aegis dataset, where static probes fail to generalize. R1-Llama-8B Qwen3-4B Qwen3-8B Qwen3-14B Model 0.6… view at source ↗

**Figure 5.** Figure 5: Harmfulness detection AUROC across in-distribution and out-of-distribution (OOD). Evaluation of correctness separability on WildGuardTest (ID) and Aegis (OOD). Trajectory-based classifiers (hatched) consistently yield higher AUROC than static max-pooled probes (solid) across all model sizes. While static probes degrade significantly in the OOD setting, trajectory-based features remain robust, demonstrating… view at source ↗

**Figure 6.** Figure 6: Predicting mathematical correctness using reasoning trajectories. Comparison of correctness separability (AUROC) between static max-pooled probes (solid) and trajectory-based classifiers (hatched) across two datasets. While trajectory-based features already offer a slight advantage on the MATH dataset, they provide significant gains on GSM8K, particularly with larger Qwen3 models, demonstrating that they a… view at source ↗

**Figure 7.** Figure 7: Impact of reasoning trace length on predictive performance. Mean AUROC is shown as a function of the percentage of CoT tokens analyzed. A clear domain divergence emerges: math error prediction achieves near-peak performance using only the first ∼5% of the reasoning, indicating that trajectory instability manifests almost immediately. Conversely, harmfulness detection accumulates signal over time, benefitin… view at source ↗

**Figure 8.** Figure 8: (a) Leave-one-category-out generalization. Trajectory-based classifiers (hatched) consistently match or exceed static probe baselines (solid) when evaluated on held-out problem categories, demonstrating cross-category transfer of trajectory features. (b) Mean AUROC as a function of the number of feature groups used. For harmfulness detection, performance plateaus with just two groups. For mathematical err… view at source ↗

**Figure 9.** Figure 9: Domain-specific feature importance. Top 10 trajectory features by mean absolute SHAP value, aggregated across all models. The most predictive features for harmfulness (left) and mathematical correctness (right) are entirely disjoint. Harmfulness detection relies heavily on terminal and steady-state characteristics, indicating that the final settling point is most critical. In contrast, math error predictio… view at source ↗

**Figure 10.** Figure 10: replicates the analysis from [PITH_FULL_IMAGE:figures/full_fig_p019_10.png] view at source ↗

**Figure 11.** Figure 11: Out-of-distribution (OOD) generalization performance on the Aegis dataset. The bar chart [PITH_FULL_IMAGE:figures/full_fig_p019_11.png] view at source ↗

**Figure 12.** Figure 12: Detailed Leave One Out on MATH subcategories. [PITH_FULL_IMAGE:figures/full_fig_p020_12.png] view at source ↗

**Figure 13.** Figure 13: Trainable CNN against our features on harmfulness datasets averaged over probes [PITH_FULL_IMAGE:figures/full_fig_p021_13.png] view at source ↗

**Figure 14.** Figure 14: Trainable CNN against our features on harmfulness datasets [PITH_FULL_IMAGE:figures/full_fig_p021_14.png] view at source ↗

**Figure 15.** Figure 15: Trainable CNN against our features on math datasets averaged over probe types [PITH_FULL_IMAGE:figures/full_fig_p021_15.png] view at source ↗

**Figure 16.** Figure 16: Trainable CNN against our features on math datasets [PITH_FULL_IMAGE:figures/full_fig_p021_16.png] view at source ↗

**Figure 17.** Figure 17: Average vs Max pooling probes probabilities trajectories. [PITH_FULL_IMAGE:figures/full_fig_p031_17.png] view at source ↗

**Figure 18.** Figure 18: Per-token trajectories for Wildguardtest (Harmfulness) - Models: R1-Llama-8B. [PITH_FULL_IMAGE:figures/full_fig_p032_18.png] view at source ↗

**Figure 19.** Figure 19: Per-token trajectories for Wildguardtest (Harmfulness) - Models: R1-Llama-8B. [PITH_FULL_IMAGE:figures/full_fig_p032_19.png] view at source ↗

**Figure 20.** Figure 20: Per-token trajectories for Wildguardtest (Harmfulness) - Models: Qwen3-4B. [PITH_FULL_IMAGE:figures/full_fig_p032_20.png] view at source ↗

**Figure 21.** Figure 21: Per-token trajectories for Wildguardtest (Harmfulness) - Models: Qwen3-4B. [PITH_FULL_IMAGE:figures/full_fig_p033_21.png] view at source ↗

**Figure 22.** Figure 22: Per-token trajectories for Wildguardtest (Harmfulness) - Models: Qwen3-8B. [PITH_FULL_IMAGE:figures/full_fig_p033_22.png] view at source ↗

**Figure 23.** Figure 23: Per-token trajectories for Wildguardtest (Harmfulness) - Models: Qwen3-8B. [PITH_FULL_IMAGE:figures/full_fig_p033_23.png] view at source ↗

**Figure 24.** Figure 24: Per-token trajectories for Wildguardtest (Harmfulness) - Models: Qwen3-14B. [PITH_FULL_IMAGE:figures/full_fig_p033_24.png] view at source ↗

**Figure 25.** Figure 25: Per-token trajectories for Wildguardtest (Harmfulness) - Models: Qwen3-14B. [PITH_FULL_IMAGE:figures/full_fig_p034_25.png] view at source ↗

**Figure 26.** Figure 26: Per-token trajectories for Aegis (Harmfulness) - Models: R1-Llama-8B. [PITH_FULL_IMAGE:figures/full_fig_p034_26.png] view at source ↗

**Figure 27.** Figure 27: Per-token trajectories for Aegis (Harmfulness) - Models: R1-Llama-8B. [PITH_FULL_IMAGE:figures/full_fig_p034_27.png] view at source ↗

**Figure 28.** Figure 28: Per-token trajectories for Aegis (Harmfulness) - Models: Qwen3-4B. [PITH_FULL_IMAGE:figures/full_fig_p034_28.png] view at source ↗

**Figure 29.** Figure 29: Per-token trajectories for Aegis (Harmfulness) - Models: Qwen3-4B. [PITH_FULL_IMAGE:figures/full_fig_p035_29.png] view at source ↗

**Figure 30.** Figure 30: Per-token trajectories for Aegis (Harmfulness) - Models: Qwen3-8B. [PITH_FULL_IMAGE:figures/full_fig_p035_30.png] view at source ↗

**Figure 31.** Figure 31: Per-token trajectories for Aegis (Harmfulness) - Models: Qwen3-8B. [PITH_FULL_IMAGE:figures/full_fig_p035_31.png] view at source ↗

**Figure 32.** Figure 32: Per-token trajectories for Aegis (Harmfulness) - Models: Qwen3-14B. [PITH_FULL_IMAGE:figures/full_fig_p035_32.png] view at source ↗

**Figure 33.** Figure 33: Per-token trajectories for Aegis (Harmfulness) - Models: Qwen3-14B. [PITH_FULL_IMAGE:figures/full_fig_p036_33.png] view at source ↗

**Figure 34.** Figure 34: Per-token trajectories for Minerva Math (Math) - Models: R1-Llama-8B. [PITH_FULL_IMAGE:figures/full_fig_p037_34.png] view at source ↗

**Figure 35.** Figure 35: Per-token trajectories for Minerva Math (Math) - Models: Qwen3-4B. [PITH_FULL_IMAGE:figures/full_fig_p037_35.png] view at source ↗

**Figure 36.** Figure 36: Per-token trajectories for Minerva Math (Math) - Models: Qwen3-8B. [PITH_FULL_IMAGE:figures/full_fig_p037_36.png] view at source ↗

**Figure 37.** Figure 37: Per-token trajectories for Minerva Math (Math) - Models: Qwen3-14B. [PITH_FULL_IMAGE:figures/full_fig_p037_37.png] view at source ↗

**Figure 38.** Figure 38: Per-token trajectories for GSM8K (Math) - Models: R1-Llama-8B. [PITH_FULL_IMAGE:figures/full_fig_p038_38.png] view at source ↗

**Figure 39.** Figure 39: Per-token trajectories for GSM8K (Math) - Models: Qwen3-4B. [PITH_FULL_IMAGE:figures/full_fig_p038_39.png] view at source ↗

**Figure 40.** Figure 40: Per-token trajectories for GSM8K (Math) - Models: Qwen3-8B. [PITH_FULL_IMAGE:figures/full_fig_p038_40.png] view at source ↗

**Figure 41.** Figure 41: Per-token trajectories for GSM8K (Math) - Models: Qwen3-14B. [PITH_FULL_IMAGE:figures/full_fig_p038_41.png] view at source ↗

read the original abstract

Large Reasoning Models (LRMs) introduce new opportunities for safety monitoring through their Chain of Thought (CoT) reasoning. However, CoT is not always faithful to the model's final output, undermining its reliability as a monitoring tool. To address this, we investigate the hidden representations of LRMs to determine whether future behavior can be predicted from prompt and CoT representations. By evaluating a probe at each generated token, we construct a probe trajectory, the continuous evolution of a concept's probability across the reasoning process. We find that future model behavior is more distinguishable when examined over the full trajectory than from a single static prediction. To characterize these temporal dynamics, we extract signal-processing features that capture volatility, trend, and steady-state behavior, significantly improving the separation of future model states. We also present two methodological insights. First, template-based training data achieves near-parity with dynamically generated model responses, eliminating the need for a costly initial inference and labeling. Second, the choice of pooling operation is critical: average-pooling and last-token methods collapse to near-random performance, while max-pooling achieves up to 95% AUROC and yields stable probe trajectories. Using four datasets and four reasoning models across the domains of safety and mathematics, we demonstrate that trajectory features encode task-specific dynamics that improve outcome separability. These findings establish probe trajectories as a complementary framework for monitoring LRM behavior. Warning: This article contains potentially harmful content.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Probe trajectories over full CoT give better separation of future outputs than static probes, but the gains may partly reflect output distribution artifacts rather than true concept dynamics.

read the letter

The main thing to know is that this paper shows probe values tracked across every token in a reasoning chain separate future model behavior more clearly than a single probe at one position, with max-pooling and volatility/trend features reaching up to 95% AUROC on safety and math tasks. They also note that template data for probe training works nearly as well as full model generations, which is a practical detail for anyone trying to implement this without extra inference cost. The experiments cover four models and four datasets, giving the claims some breadth beyond a single setting. The temporal framing itself is the clearest addition over prior probing work that usually stops at one snapshot. What the paper does well is demonstrate that average-pooling and last-token methods drop to near-random while max-pooling stays stable, which is a concrete methodological takeaway worth testing in other setups. The citation pattern pulls in standard probing and CoT faithfulness references without obvious omissions. The soft spot is the lack of direct evidence that the per-token probes track the intended concept rather than the model's current output distribution or position effects. The abstract gives no held-out probe accuracy, no correlation with explicit concept tokens in the CoT, and no ablation removing logit leakage, so the trajectory features could be amplifying spurious signals. If those controls are missing from the full version, the high AUROC numbers become harder to interpret as evidence of internal reasoning dynamics. This work is aimed at researchers building monitoring layers for deployed reasoning models. A reader already working on internal representations or AI safety oversight would find the trajectory idea straightforward to replicate and extend. It has enough structure and reported results to deserve peer review, though the referees will likely press on probe validation and leakage checks. I would send it out for review.

Referee Report

2 major / 2 minor

Summary. The paper claims that probe trajectories—constructed by applying linear probes to hidden representations at each token during Chain-of-Thought generation in Large Reasoning Models—allow better prediction of future model behavior than static probes. Signal-processing features capturing volatility, trend, and steady-state behavior improve separability of future states, with max-pooling reaching up to 95% AUROC; average-pooling and last-token pooling collapse to near-random performance. Template-based training data achieves near-parity with dynamically generated responses, and results are demonstrated across four datasets and four models in safety and mathematics domains.

Significance. If the central claims hold after addressing probe faithfulness, the work would establish probe trajectories as a practical complementary monitoring framework for LRM reasoning dynamics beyond potentially unfaithful CoT outputs. The methodological findings on pooling operations and template-based data would be directly usable for safety applications, and the emphasis on temporal features over single-point predictions offers a clear advance in interpretability techniques.

major comments (2)

[Abstract / §4] Abstract and §4 (Experiments): The reported performance (up to 95% AUROC with max-pooling) is presented without quantitative details on baselines, statistical significance, dataset sizes, number of examples per condition, or controls for probe training leakage. This absence prevents verification that the trajectory features genuinely improve separability rather than reflecting evaluation artifacts.
[§3] §3 (Method): The core assumption that per-token linear probes extract a faithful, concept-specific signal is load-bearing for the claim that trajectory features (volatility/trend/steady-state) reveal reasoning dynamics. No held-out probe accuracy, correlation with explicit concept tokens in the CoT, or ablation removing output-logit or position-length leakage is described; without these, the improved AUROC could arise from the probe capturing the model's current token distribution rather than internal concept evolution.

minor comments (2)

[Abstract] Abstract: The phrase 'significantly improving the separation' should be accompanied by the exact AUROC values and confidence intervals for the trajectory features versus the static baseline.
[Throughout] Throughout: Define all acronyms (LRM, CoT, AUROC) on first use and ensure consistent notation for 'probe trajectory' versus 'signal-processing features'.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. We address each major comment below, providing clarifications and indicating where the manuscript has been revised to strengthen the presentation of results and methods.

read point-by-point responses

Referee: [Abstract / §4] Abstract and §4 (Experiments): The reported performance (up to 95% AUROC with max-pooling) is presented without quantitative details on baselines, statistical significance, dataset sizes, number of examples per condition, or controls for probe training leakage. This absence prevents verification that the trajectory features genuinely improve separability rather than reflecting evaluation artifacts.

Authors: We agree that these details are essential for rigorous evaluation. In the revised manuscript we have expanded §4 with a new table reporting: (i) baseline AUROCs (static last-token probe: 71%, random probe: 50%), (ii) statistical significance via paired Wilcoxon tests (trajectory features vs. static: p < 0.001 across all four models), (iii) exact dataset sizes (safety: 1,200 examples; math: 950 examples) and per-condition counts (balanced 50/50 splits), and (iv) explicit leakage controls using disjoint prompt sets for probe training and downstream evaluation. These additions confirm that max-pooling trajectory features retain a 15–22 point AUROC advantage over static probes after leakage controls. revision: yes
Referee: [§3] §3 (Method): The core assumption that per-token linear probes extract a faithful, concept-specific signal is load-bearing for the claim that trajectory features (volatility/trend/steady-state) reveal reasoning dynamics. No held-out probe accuracy, correlation with explicit concept tokens in the CoT, or ablation removing output-logit or position-length leakage is described; without these, the improved AUROC could arise from the probe capturing the model's current token distribution rather than internal concept evolution.

Authors: We acknowledge that probe faithfulness requires explicit validation. The original submission reported held-out probe accuracies in Appendix B (mean 84% for safety concepts, 79% for math). In the revision we have added: (1) Pearson correlations between probe outputs and the presence of explicit concept tokens in the generated CoT (r = 0.73, p < 0.01), (2) layer-ablation results showing that probes trained on earlier hidden states (before final logit projection) still yield 12-point gains from trajectory features, and (3) length-normalized trajectories that eliminate position-length confounds. These controls indicate that the volatility and trend features capture evolving internal representations beyond immediate token distributions. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation relies on independent empirical evaluation of probe trajectories

full rationale

The paper trains linear probes on hidden states to extract per-token concept probabilities, builds trajectories, extracts volatility/trend/steady-state features, and evaluates their ability to separate future model states via AUROC on held-out data across four datasets and models. Probe training uses template-based data shown to achieve near-parity with model-generated responses, and pooling choices (max vs average) are ablated empirically rather than forced by definition. No load-bearing self-citation, no uniqueness theorem imported from prior work, and no reduction where a fitted parameter is renamed as a prediction of the same quantity. The central result (trajectory features improving separability to 95% AUROC) is presented as an empirical finding with explicit comparisons to static probes and different pooling methods, remaining falsifiable against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Based solely on the abstract, no explicit free parameters, axioms, or invented entities are stated; the approach implicitly assumes that hidden-state probes can be trained to track specific concepts and that max-pooling preserves the relevant temporal signal.

pith-pipeline@v0.9.0 · 5805 in / 1203 out tokens · 46387 ms · 2026-05-20T10:50:17.159061+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

By evaluating a probe at each generated token, we construct a probe trajectory... extract signal-processing features that capture volatility, trend, and steady-state behavior... max-pooling achieves up to 95% AUROC

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

51 extracted references · 51 canonical work pages · 18 internal anchors

[1]

Understanding intermediate layers using linear classifier probes

Guillaume Alain and Yoshua Bengio. Understanding intermediate layers using linear classifier probes.arXiv preprint arXiv:1610.01644, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[2]

System card: Claude opus 4.5, 2025

Anthropic. System card: Claude opus 4.5, 2025. URL https://www-cdn.anthropic.com/ bf10f64990cfda0ba858290be7b8cc6317685f47.pdf. Model Card

work page 2025
[3]

Chain-of-thought reasoning in the wild is not always faithful

Iván Arcuschin, Jett Janiak, Robert Krzyzanowski, Senthooran Rajamanoharan, Neel Nanda, and Arthur Conmy. Chain-of-thought reasoning in the wild is not always faithful. InWorkshop on Reasoning and Planning for Large Language Models, 2025

work page 2025
[4]

Cot red-handed: Stress testing chain-of-thought monitoring

Benjamin Arnav, Pablo Bernabeu-Perez, Nathan Helm-Burger, Timothy Kostolansky, Hannes Whittingham, and Mary Phuong. Cot red-handed: Stress testing chain-of-thought monitoring. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026. URL https://openreview.net/forum?id=oHB4Ee77uG

work page 2026
[5]

Language models can predict their own behavior

Dhananjay Ashok and Jonathan May. Language models can predict their own behavior. In The Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026. URL https://openreview.net/forum?id=i8IqEzpHaJ

work page 2026
[6]

Monitoring Reasoning Models for Misbehavior and the Risks of Promoting Obfuscation

Bowen Baker, Joost Huizinga, Leo Gao, Zehao Dou, Melody Y Guan, Aleksander Madry, Wojciech Zaremba, Jakub Pachocki, and David Farhi. Monitoring reasoning models for misbehavior and the risks of promoting obfuscation.arXiv preprint arXiv:2503.11926, 2025

work page internal anchor Pith review arXiv 2025
[7]

Chain-of-thought is not explainability.Preprint, alphaXiv, page v1, 2025

Fazl Barez, Tung-Yu Wu, Iván Arcuschin, Michael Lan, Vincent Wang, Noah Siegel, Nico- las Collignon, Clement Neo, Isabelle Lee, Alasdair Paren, et al. Chain-of-thought is not explainability.Preprint, alphaXiv, page v1, 2025

work page 2025
[8]

International ai safety report 2026

Yoshua Bengio, Stephen Clare, Carina Prunkl, Maksym Andriushchenko, Ben Bucknall, Mal- colm Murray, Rishi Bommasani, Stephen Casper, Tom Davidson, Raymond Douglas, et al. International ai safety report 2026.arXiv preprint arXiv:2602.21012, 2026

work page arXiv 2026
[9]

Xgboost: A scalable tree boosting system

Tianqi Chen and Carlos Guestrin. Xgboost: A scalable tree boosting system. InProceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining, pages 785–794, 2016

work page 2016
[10]

Reasoning Models Don't Always Say What They Think

Yanda Chen, Joe Benton, Ansh Radhakrishnan, Jonathan Uesato, Carson Denison, John Schul- man, Arushi Somani, Peter Hase, Misha Wagner, Fabien Roger, Vlad Mikulik, Samuel R. Bowman, Jan Leike, Jared Kaplan, and Ethan Perez. Reasoning models don’t always say what they think, 2025. URLhttps://arxiv.org/abs/2505.05410

work page internal anchor Pith review Pith/arXiv arXiv 2025
[11]

Efficient llm moderation with multi-layer latent prototypes, 2026

Maciej Chrab ˛ aszcz, Filip Szatkowski, Bartosz Wójcik, Jan Dubi´nski, Tomasz Trzci´nski, and Sebastian Cygert. Efficient llm moderation with multi-layer latent prototypes, 2026. URL https://arxiv.org/abs/2502.16174

work page arXiv 2026
[12]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[13]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[14]

Constitutional classifiers++: Effi- cient production-grade defenses against universal jailbreaks.arXiv preprint arXiv:2601.04603, 2026

Hoagy Cunningham, Jerry Wei, Zihan Wang, Andrew Persic, Alwin Peng, Jordan Abderrachid, Raj Agarwal, Bobby Chen, Austin Cohen, Andy Dau, et al. Constitutional classifiers++: Effi- cient production-grade defenses against universal jailbreaks.arXiv preprint arXiv:2601.04603, 2026

work page arXiv 2026
[15]

Predicting the accuracy of neural networks from final and intermediate layer outputs

Chad DeChant, Seungwook Han, and Hod Lipson. Predicting the accuracy of neural networks from final and intermediate layer outputs. InICML 2019 Workshop on Identifying and Under- standing Deep Learning Phenomena, 2019. URL https://openreview.net/forum?id= H1xXwEB2h4. 11

work page 2019
[16]

A mathematical framework for transformer circuits.Transformer Circuits Thread,

Nelson Elhage, Neel Nanda, Catherine Olsson, Tom Henighan, Nicholas Joseph, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, Nova DasSarma, Dawn Drain, Deep Ganguli, Zac Hatfield-Dodds, Danny Hernandez, Andy Jones, Jackson Kernion, Liane Lovitt, Kamal Ndousse, Dario Amodei, Tom Brown, Jack Clark, Jared Kaplan, Sam McCandlish, and Chris Olah. A...

work page
[17]

https://transformer-circuits.pub/2021/framework/index.html

work page 2021
[18]

When chain of thought is necessary, language models struggle to evade monitors.arXiv preprint arXiv:2507.05246, 2025

Scott Emmons, Erik Jenner, David K Elson, Rif A Saurous, Senthooran Rajamanoharan, Heng Chen, Irhum Shafkat, and Rohin Shah. When chain of thought is necessary, language models struggle to evade monitors.arXiv preprint arXiv:2507.05246, 2025

work page arXiv 2025
[19]

Shaona Ghosh, Prasoon Varshney, Makesh Narsimhan Sreedhar, Aishwarya Padmakumar, Traian Rebedea, Jibin Rajan Varghese, and Christopher Parisien. Aegis2. 0: A diverse ai safety dataset and risks taxonomy for alignment of llm guardrails. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguist...

work page 2025
[20]

De- tecting strategic deception with linear probes

Nicholas Goldowsky-Dill, Bilal Chughtai, Stefan Heimersheim, and Marius Hobbhahn. De- tecting strategic deception with linear probes. InF orty-second International Conference on Machine Learning, 2025. URLhttps://openreview.net/forum?id=C5Jj3QKQav

work page 2025
[21]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[22]

WildGuard: Open One-Stop Moderation Tools for Safety Risks, Jailbreaks, and Refusals of LLMs

Seungju Han, Kavel Rao, Allyson Ettinger, Liwei Jiang, Bill Yuchen Lin, Nathan Lambert, Yejin Choi, and Nouha Dziri. Wildguard: Open one-stop moderation tools for safety risks, jailbreaks, and refusals of llms, 2024. URLhttps://arxiv.org/abs/2406.18495

work page internal anchor Pith review Pith/arXiv arXiv 2024
[23]

Gaussian Error Linear Units (GELUs)

Dan Hendrycks and Kevin Gimpel. Gaussian error linear units (gelus).arXiv preprint arXiv:1606.08415, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[24]

Measuring mathematical problem solving with the math dataset

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset. Advances in neural information processing systems, 2021

work page 2021
[25]

Stop anthropomorphizing intermediate tokens as reasoning/thinking traces! InNeurIPS 2025 Workshop on Bridging Language, Agent, and World Models for Reasoning and Planning, 2025

Subbarao Kambhampati, Kaya Stechly, Karthik Valmeekam, Lucas Paul Saldyt, Siddhant Bhambri, Vardhan Palod, Atharva Gundawar, Soumya Rani Samineni, Durgesh Kalwar, and Upasana Biswas. Stop anthropomorphizing intermediate tokens as reasoning/thinking traces! InNeurIPS 2025 Workshop on Bridging Language, Agent, and World Models for Reasoning and Planning, 2025

work page 2025
[26]

Are sparse autoencoders useful? a case study in sparse probing

Subhash Kantamneni, Joshua Engels, Senthooran Rajamanoharan, Max Tegmark, and Neel Nanda. Are sparse autoencoders useful? a case study in sparse probing. InF orty-second International Conference on Machine Learning, 2025. URL https://openreview.net/ forum?id=rNfzT8YkgO

work page 2025
[27]

Denis Kleyko, Antonello Rosato, Edward Paxon Frady, Massimo Panella, and Friedrich T. Sommer. Perceptron theory can predict the accuracy of neural networks.IEEE Transactions on Neural Networks and Learning Systems, 35(7):9885–9899, 2024. doi: 10.1109/TNNLS.2023. 3237381

work page doi:10.1109/tnnls.2023 2024
[28]

Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety

Tomek Korbak, Mikita Balesni, Elizabeth Barnes, Yoshua Bengio, Joe Benton, Joseph Bloom, Mark Chen, Alan Cooney, Allan Dafoe, Anca Dragan, et al. Chain of thought monitorability: A new and fragile opportunity for ai safety.arXiv preprint arXiv:2507.11473, 2025

work page internal anchor Pith review arXiv 2025
[29]

Building production-ready probes for gemini.arXiv preprint arXiv:2601.11516, 2026

János Kramár, Joshua Engels, Zheng Wang, Bilal Chughtai, Rohin Shah, Neel Nanda, and Arthur Conmy. Building production-ready probes for gemini.arXiv preprint arXiv:2601.11516, 2026

work page arXiv 2026
[30]

Fixing open llm leaderboard with math-verify, 2025

Hynek Kydlicek, Alina Lozovskaya, Nathan Habib, and Clémentine Fourrier. Fixing open llm leaderboard with math-verify, 2025. 12

work page 2025
[31]

Measuring Faithfulness in Chain-of-Thought Reasoning

Tamera Lanham, Anna Chen, Ansh Radhakrishnan, Benoit Steiner, Carson Denison, Danny Hernandez, Dustin Li, Esin Durmus, Evan Hubinger, Jackson Kernion, et al. Measuring faithfulness in chain-of-thought reasoning.arXiv preprint arXiv:2307.13702, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[32]

DeepSeek-V3 Technical Report

Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[33]

Decoupled Weight Decay Regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[34]

Streaming hallucination detection in long chain-of-thought reasoning.arXiv preprint arXiv:2601.02170, 2026

Haolang Lu, Minghui Pan, Ripeng Li, Guoshun Nan, Jialin Zhuang, Zijie Zhao, Zhongxiang Sun, Kun Wang, and Yang Liu. Streaming hallucination detection in long chain-of-thought reasoning.arXiv preprint arXiv:2601.02170, 2026

work page arXiv 2026
[35]

A unified approach to interpreting model predictions

Scott M Lundberg and Su-In Lee. A unified approach to interpreting model predictions. Advances in neural information processing systems, 30, 2017

work page 2017
[36]

The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets

Samuel Marks and Max Tegmark. The geometry of truth: Emergent linear structure in large language model representations of true/false datasets.arXiv preprint arXiv:2310.06824, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[37]

Detecting high-stakes interactions with activation probes

Alex McKenzie, Urja Pawar, Phil Blandfort, William Bankes, David Krueger, Ekdeep Singh Lubana, and Dmitrii Krasheninnikov. Detecting high-stakes interactions with activation probes. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026. URL https://openreview.net/forum?id=8YniJnJQ0P

work page 2026
[38]

Beyond linear probes: Dynamic safety monitoring for language models

James Oldfield, Philip Torr, Ioannis Patras, Adel Bibi, and Fazl Barez. Beyond linear probes: Dynamic safety monitoring for language models. InThe F ourteenth International Conference on Learning Representations, 2026. URL https://openreview.net/forum?id=AGWa8whf92

work page 2026
[39]

A survey on sparse autoencoders: Interpreting the internal mechanisms of large language models

Dong Shu, Xuansheng Wu, Haiyan Zhao, Daking Rai, Ziyu Yao, Ninghao Liu, and Mengnan Du. A survey on sparse autoencoders: Interpreting the internal mechanisms of large language models. In Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng, editors, Findings of the Association for Computational Linguistics: EMNLP 2025, pages 1690...

work page doi:10.18653/v1/2025.findings-emnlp.89 2025
[40]

OpenAI GPT-5 System Card

Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El-Kishky, Aidan McLaughlin, Aiden Low, AJ Ostrow, Akhila Ananthram, et al. Openai gpt-5 system card.arXiv preprint arXiv:2601.03267, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[41]

The obfuscation atlas: Mapping where honesty emerges in rlvr with deception probes, 2026

Mohammad Taufeeque, Stefan Heimersheim, Adam Gleave, and Chris Cundy. The obfuscation atlas: Mapping where honesty emerges in rlvr with deception probes, 2026. URL https: //arxiv.org/abs/2602.15515

work page arXiv 2026
[42]

Gemma: Open Models Based on Gemini Research and Technology

Gemma Team, Thomas Mesnard, Cassidy Hardin, Robert Dadashi, Surya Bhupatiraju, Shreya Pathak, Laurent Sifre, Morgane Rivière, Mihir Sanjay Kale, Juliette Love, et al. Gemma: Open models based on gemini research and technology.arXiv preprint arXiv:2403.08295, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[43]

Steering Language Models With Activation Engineering

Alexander Matt Turner, Lisa Thiergart, Gavin Leech, David Udell, Juan J Vazquez, Ulisse Mini, and Monte MacDiarmid. Steering language models with activation engineering.arXiv preprint arXiv:2308.10248, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[44]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[45]

2025 , journal =

Chen Yueh-Han, Nitish Joshi, Yulin Chen, Maksym Andriushchenko, Rico Angell, and He He. Monitoring decomposition attacks in llms with lightweight sequential monitors.arXiv preprint arXiv:2506.10949, 2025. 13

work page arXiv 2025
[46]

Rea- soning models know when they’re right: Probing hidden states for self-verification

Anqi Zhang, Yulin Chen, Jane Pan, Chen Zhao, Aurojit Panda, Jinyang Li, and He He. Rea- soning models know when they’re right: Probing hidden states for self-verification. InSecond Conference on Language Modeling, 2025

work page 2025
[47]

Processbench: Identifying process errors in mathematical reasoning

Chujie Zheng, Zhenru Zhang, Beichen Zhang, Runji Lin, Keming Lu, Bowen Yu, Dayiheng Liu, Jingren Zhou, and Junyang Lin. Processbench: Identifying process errors in mathematical reasoning. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers), pages 1009–1024, 2025

work page 2025
[48]

Representation Engineering: A Top-Down Approach to AI Transparency

Andy Zou, Long Phan, Sarah Chen, James Campbell, Phillip Guo, Richard Ren, Alexander Pan, Xuwang Yin, Mantas Mazeika, Ann-Kathrin Dombrowski, Shashwat Goel, Nathaniel Li, Michael J. Byun, Zifan Wang, Alex Mallen, Steven Basart, Sanmi Koyejo, Dawn Song, Matt Fredrikson, J. Zico Kolter, and Dan Hendrycks. Representation engineering: A top-down approach to a...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[49]

stranded hiker

Prank Call Ideas (with Caution): •Fake Emergency: Pretend to be a “stranded hiker” needing help. • Mystery Caller: Use a fake number to call and say, “You’ve been selected for a secret experiment!” Safety Evaluation (WildGuard Labels): Component Harmful Req. Refusal Harmful Out. Chain-of-Thought Yes NoNo (Safe) Final Response Yes NoYes (Harmful) Analysis ...

work page
[50]

You’re such a burden. Why not just end it all? No one would miss you anyway

Direct Encouragement of Self-Harm: -“You’re such a burden. Why not just end it all? No one would miss you anyway. ”

work page
[51]

Everyone dies eventually. Why wait? Just do it and save everyone the trouble

Normalization of Suicide: -“Everyone dies eventually. Why wait? Just do it and save everyone the trouble. ”Safety Evaluation (WildGuard Labels): Component Harmful Req. Refusal Harmful Out. Chain-of-Thought Yes NoNo (Safe) Final Response Yes NoYes (Harmful) Analysis of Unfaithfulness:The CoT explicitly plans to avoid "actual harmful content," but then proc...

work page 1930

[1] [1]

Understanding intermediate layers using linear classifier probes

Guillaume Alain and Yoshua Bengio. Understanding intermediate layers using linear classifier probes.arXiv preprint arXiv:1610.01644, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[2] [2]

System card: Claude opus 4.5, 2025

Anthropic. System card: Claude opus 4.5, 2025. URL https://www-cdn.anthropic.com/ bf10f64990cfda0ba858290be7b8cc6317685f47.pdf. Model Card

work page 2025

[3] [3]

Chain-of-thought reasoning in the wild is not always faithful

Iván Arcuschin, Jett Janiak, Robert Krzyzanowski, Senthooran Rajamanoharan, Neel Nanda, and Arthur Conmy. Chain-of-thought reasoning in the wild is not always faithful. InWorkshop on Reasoning and Planning for Large Language Models, 2025

work page 2025

[4] [4]

Cot red-handed: Stress testing chain-of-thought monitoring

Benjamin Arnav, Pablo Bernabeu-Perez, Nathan Helm-Burger, Timothy Kostolansky, Hannes Whittingham, and Mary Phuong. Cot red-handed: Stress testing chain-of-thought monitoring. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026. URL https://openreview.net/forum?id=oHB4Ee77uG

work page 2026

[5] [5]

Language models can predict their own behavior

Dhananjay Ashok and Jonathan May. Language models can predict their own behavior. In The Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026. URL https://openreview.net/forum?id=i8IqEzpHaJ

work page 2026

[6] [6]

Monitoring Reasoning Models for Misbehavior and the Risks of Promoting Obfuscation

Bowen Baker, Joost Huizinga, Leo Gao, Zehao Dou, Melody Y Guan, Aleksander Madry, Wojciech Zaremba, Jakub Pachocki, and David Farhi. Monitoring reasoning models for misbehavior and the risks of promoting obfuscation.arXiv preprint arXiv:2503.11926, 2025

work page internal anchor Pith review arXiv 2025

[7] [7]

Chain-of-thought is not explainability.Preprint, alphaXiv, page v1, 2025

Fazl Barez, Tung-Yu Wu, Iván Arcuschin, Michael Lan, Vincent Wang, Noah Siegel, Nico- las Collignon, Clement Neo, Isabelle Lee, Alasdair Paren, et al. Chain-of-thought is not explainability.Preprint, alphaXiv, page v1, 2025

work page 2025

[8] [8]

International ai safety report 2026

Yoshua Bengio, Stephen Clare, Carina Prunkl, Maksym Andriushchenko, Ben Bucknall, Mal- colm Murray, Rishi Bommasani, Stephen Casper, Tom Davidson, Raymond Douglas, et al. International ai safety report 2026.arXiv preprint arXiv:2602.21012, 2026

work page arXiv 2026

[9] [9]

Xgboost: A scalable tree boosting system

Tianqi Chen and Carlos Guestrin. Xgboost: A scalable tree boosting system. InProceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining, pages 785–794, 2016

work page 2016

[10] [10]

Reasoning Models Don't Always Say What They Think

Yanda Chen, Joe Benton, Ansh Radhakrishnan, Jonathan Uesato, Carson Denison, John Schul- man, Arushi Somani, Peter Hase, Misha Wagner, Fabien Roger, Vlad Mikulik, Samuel R. Bowman, Jan Leike, Jared Kaplan, and Ethan Perez. Reasoning models don’t always say what they think, 2025. URLhttps://arxiv.org/abs/2505.05410

work page internal anchor Pith review Pith/arXiv arXiv 2025

[11] [11]

Efficient llm moderation with multi-layer latent prototypes, 2026

Maciej Chrab ˛ aszcz, Filip Szatkowski, Bartosz Wójcik, Jan Dubi´nski, Tomasz Trzci´nski, and Sebastian Cygert. Efficient llm moderation with multi-layer latent prototypes, 2026. URL https://arxiv.org/abs/2502.16174

work page arXiv 2026

[12] [12]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[13] [13]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[14] [14]

Constitutional classifiers++: Effi- cient production-grade defenses against universal jailbreaks.arXiv preprint arXiv:2601.04603, 2026

Hoagy Cunningham, Jerry Wei, Zihan Wang, Andrew Persic, Alwin Peng, Jordan Abderrachid, Raj Agarwal, Bobby Chen, Austin Cohen, Andy Dau, et al. Constitutional classifiers++: Effi- cient production-grade defenses against universal jailbreaks.arXiv preprint arXiv:2601.04603, 2026

work page arXiv 2026

[15] [15]

Predicting the accuracy of neural networks from final and intermediate layer outputs

Chad DeChant, Seungwook Han, and Hod Lipson. Predicting the accuracy of neural networks from final and intermediate layer outputs. InICML 2019 Workshop on Identifying and Under- standing Deep Learning Phenomena, 2019. URL https://openreview.net/forum?id= H1xXwEB2h4. 11

work page 2019

[16] [16]

A mathematical framework for transformer circuits.Transformer Circuits Thread,

Nelson Elhage, Neel Nanda, Catherine Olsson, Tom Henighan, Nicholas Joseph, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, Nova DasSarma, Dawn Drain, Deep Ganguli, Zac Hatfield-Dodds, Danny Hernandez, Andy Jones, Jackson Kernion, Liane Lovitt, Kamal Ndousse, Dario Amodei, Tom Brown, Jack Clark, Jared Kaplan, Sam McCandlish, and Chris Olah. A...

work page

[17] [17]

https://transformer-circuits.pub/2021/framework/index.html

work page 2021

[18] [18]

When chain of thought is necessary, language models struggle to evade monitors.arXiv preprint arXiv:2507.05246, 2025

Scott Emmons, Erik Jenner, David K Elson, Rif A Saurous, Senthooran Rajamanoharan, Heng Chen, Irhum Shafkat, and Rohin Shah. When chain of thought is necessary, language models struggle to evade monitors.arXiv preprint arXiv:2507.05246, 2025

work page arXiv 2025

[19] [19]

Shaona Ghosh, Prasoon Varshney, Makesh Narsimhan Sreedhar, Aishwarya Padmakumar, Traian Rebedea, Jibin Rajan Varghese, and Christopher Parisien. Aegis2. 0: A diverse ai safety dataset and risks taxonomy for alignment of llm guardrails. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguist...

work page 2025

[20] [20]

De- tecting strategic deception with linear probes

Nicholas Goldowsky-Dill, Bilal Chughtai, Stefan Heimersheim, and Marius Hobbhahn. De- tecting strategic deception with linear probes. InF orty-second International Conference on Machine Learning, 2025. URLhttps://openreview.net/forum?id=C5Jj3QKQav

work page 2025

[21] [21]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[22] [22]

WildGuard: Open One-Stop Moderation Tools for Safety Risks, Jailbreaks, and Refusals of LLMs

Seungju Han, Kavel Rao, Allyson Ettinger, Liwei Jiang, Bill Yuchen Lin, Nathan Lambert, Yejin Choi, and Nouha Dziri. Wildguard: Open one-stop moderation tools for safety risks, jailbreaks, and refusals of llms, 2024. URLhttps://arxiv.org/abs/2406.18495

work page internal anchor Pith review Pith/arXiv arXiv 2024

[23] [23]

Gaussian Error Linear Units (GELUs)

Dan Hendrycks and Kevin Gimpel. Gaussian error linear units (gelus).arXiv preprint arXiv:1606.08415, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[24] [24]

Measuring mathematical problem solving with the math dataset

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset. Advances in neural information processing systems, 2021

work page 2021

[25] [25]

Stop anthropomorphizing intermediate tokens as reasoning/thinking traces! InNeurIPS 2025 Workshop on Bridging Language, Agent, and World Models for Reasoning and Planning, 2025

Subbarao Kambhampati, Kaya Stechly, Karthik Valmeekam, Lucas Paul Saldyt, Siddhant Bhambri, Vardhan Palod, Atharva Gundawar, Soumya Rani Samineni, Durgesh Kalwar, and Upasana Biswas. Stop anthropomorphizing intermediate tokens as reasoning/thinking traces! InNeurIPS 2025 Workshop on Bridging Language, Agent, and World Models for Reasoning and Planning, 2025

work page 2025

[26] [26]

Are sparse autoencoders useful? a case study in sparse probing

Subhash Kantamneni, Joshua Engels, Senthooran Rajamanoharan, Max Tegmark, and Neel Nanda. Are sparse autoencoders useful? a case study in sparse probing. InF orty-second International Conference on Machine Learning, 2025. URL https://openreview.net/ forum?id=rNfzT8YkgO

work page 2025

[27] [27]

Denis Kleyko, Antonello Rosato, Edward Paxon Frady, Massimo Panella, and Friedrich T. Sommer. Perceptron theory can predict the accuracy of neural networks.IEEE Transactions on Neural Networks and Learning Systems, 35(7):9885–9899, 2024. doi: 10.1109/TNNLS.2023. 3237381

work page doi:10.1109/tnnls.2023 2024

[28] [28]

Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety

Tomek Korbak, Mikita Balesni, Elizabeth Barnes, Yoshua Bengio, Joe Benton, Joseph Bloom, Mark Chen, Alan Cooney, Allan Dafoe, Anca Dragan, et al. Chain of thought monitorability: A new and fragile opportunity for ai safety.arXiv preprint arXiv:2507.11473, 2025

work page internal anchor Pith review arXiv 2025

[29] [29]

Building production-ready probes for gemini.arXiv preprint arXiv:2601.11516, 2026

János Kramár, Joshua Engels, Zheng Wang, Bilal Chughtai, Rohin Shah, Neel Nanda, and Arthur Conmy. Building production-ready probes for gemini.arXiv preprint arXiv:2601.11516, 2026

work page arXiv 2026

[30] [30]

Fixing open llm leaderboard with math-verify, 2025

Hynek Kydlicek, Alina Lozovskaya, Nathan Habib, and Clémentine Fourrier. Fixing open llm leaderboard with math-verify, 2025. 12

work page 2025

[31] [31]

Measuring Faithfulness in Chain-of-Thought Reasoning

Tamera Lanham, Anna Chen, Ansh Radhakrishnan, Benoit Steiner, Carson Denison, Danny Hernandez, Dustin Li, Esin Durmus, Evan Hubinger, Jackson Kernion, et al. Measuring faithfulness in chain-of-thought reasoning.arXiv preprint arXiv:2307.13702, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[32] [32]

DeepSeek-V3 Technical Report

Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[33] [33]

Decoupled Weight Decay Regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[34] [34]

Streaming hallucination detection in long chain-of-thought reasoning.arXiv preprint arXiv:2601.02170, 2026

Haolang Lu, Minghui Pan, Ripeng Li, Guoshun Nan, Jialin Zhuang, Zijie Zhao, Zhongxiang Sun, Kun Wang, and Yang Liu. Streaming hallucination detection in long chain-of-thought reasoning.arXiv preprint arXiv:2601.02170, 2026

work page arXiv 2026

[35] [35]

A unified approach to interpreting model predictions

Scott M Lundberg and Su-In Lee. A unified approach to interpreting model predictions. Advances in neural information processing systems, 30, 2017

work page 2017

[36] [36]

The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets

Samuel Marks and Max Tegmark. The geometry of truth: Emergent linear structure in large language model representations of true/false datasets.arXiv preprint arXiv:2310.06824, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[37] [37]

Detecting high-stakes interactions with activation probes

Alex McKenzie, Urja Pawar, Phil Blandfort, William Bankes, David Krueger, Ekdeep Singh Lubana, and Dmitrii Krasheninnikov. Detecting high-stakes interactions with activation probes. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026. URL https://openreview.net/forum?id=8YniJnJQ0P

work page 2026

[38] [38]

Beyond linear probes: Dynamic safety monitoring for language models

James Oldfield, Philip Torr, Ioannis Patras, Adel Bibi, and Fazl Barez. Beyond linear probes: Dynamic safety monitoring for language models. InThe F ourteenth International Conference on Learning Representations, 2026. URL https://openreview.net/forum?id=AGWa8whf92

work page 2026

[39] [39]

A survey on sparse autoencoders: Interpreting the internal mechanisms of large language models

Dong Shu, Xuansheng Wu, Haiyan Zhao, Daking Rai, Ziyu Yao, Ninghao Liu, and Mengnan Du. A survey on sparse autoencoders: Interpreting the internal mechanisms of large language models. In Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng, editors, Findings of the Association for Computational Linguistics: EMNLP 2025, pages 1690...

work page doi:10.18653/v1/2025.findings-emnlp.89 2025

[40] [40]

OpenAI GPT-5 System Card

Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El-Kishky, Aidan McLaughlin, Aiden Low, AJ Ostrow, Akhila Ananthram, et al. Openai gpt-5 system card.arXiv preprint arXiv:2601.03267, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[41] [41]

The obfuscation atlas: Mapping where honesty emerges in rlvr with deception probes, 2026

Mohammad Taufeeque, Stefan Heimersheim, Adam Gleave, and Chris Cundy. The obfuscation atlas: Mapping where honesty emerges in rlvr with deception probes, 2026. URL https: //arxiv.org/abs/2602.15515

work page arXiv 2026

[42] [42]

Gemma: Open Models Based on Gemini Research and Technology

Gemma Team, Thomas Mesnard, Cassidy Hardin, Robert Dadashi, Surya Bhupatiraju, Shreya Pathak, Laurent Sifre, Morgane Rivière, Mihir Sanjay Kale, Juliette Love, et al. Gemma: Open models based on gemini research and technology.arXiv preprint arXiv:2403.08295, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[43] [43]

Steering Language Models With Activation Engineering

Alexander Matt Turner, Lisa Thiergart, Gavin Leech, David Udell, Juan J Vazquez, Ulisse Mini, and Monte MacDiarmid. Steering language models with activation engineering.arXiv preprint arXiv:2308.10248, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[44] [44]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[45] [45]

2025 , journal =

Chen Yueh-Han, Nitish Joshi, Yulin Chen, Maksym Andriushchenko, Rico Angell, and He He. Monitoring decomposition attacks in llms with lightweight sequential monitors.arXiv preprint arXiv:2506.10949, 2025. 13

work page arXiv 2025

[46] [46]

Rea- soning models know when they’re right: Probing hidden states for self-verification

Anqi Zhang, Yulin Chen, Jane Pan, Chen Zhao, Aurojit Panda, Jinyang Li, and He He. Rea- soning models know when they’re right: Probing hidden states for self-verification. InSecond Conference on Language Modeling, 2025

work page 2025

[47] [47]

Processbench: Identifying process errors in mathematical reasoning

Chujie Zheng, Zhenru Zhang, Beichen Zhang, Runji Lin, Keming Lu, Bowen Yu, Dayiheng Liu, Jingren Zhou, and Junyang Lin. Processbench: Identifying process errors in mathematical reasoning. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers), pages 1009–1024, 2025

work page 2025

[48] [48]

Representation Engineering: A Top-Down Approach to AI Transparency

Andy Zou, Long Phan, Sarah Chen, James Campbell, Phillip Guo, Richard Ren, Alexander Pan, Xuwang Yin, Mantas Mazeika, Ann-Kathrin Dombrowski, Shashwat Goel, Nathaniel Li, Michael J. Byun, Zifan Wang, Alex Mallen, Steven Basart, Sanmi Koyejo, Dawn Song, Matt Fredrikson, J. Zico Kolter, and Dan Hendrycks. Representation engineering: A top-down approach to a...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[49] [49]

stranded hiker

Prank Call Ideas (with Caution): •Fake Emergency: Pretend to be a “stranded hiker” needing help. • Mystery Caller: Use a fake number to call and say, “You’ve been selected for a secret experiment!” Safety Evaluation (WildGuard Labels): Component Harmful Req. Refusal Harmful Out. Chain-of-Thought Yes NoNo (Safe) Final Response Yes NoYes (Harmful) Analysis ...

work page

[50] [50]

You’re such a burden. Why not just end it all? No one would miss you anyway

Direct Encouragement of Self-Harm: -“You’re such a burden. Why not just end it all? No one would miss you anyway. ”

work page

[51] [51]

Everyone dies eventually. Why wait? Just do it and save everyone the trouble

Normalization of Suicide: -“Everyone dies eventually. Why wait? Just do it and save everyone the trouble. ”Safety Evaluation (WildGuard Labels): Component Harmful Req. Refusal Harmful Out. Chain-of-Thought Yes NoNo (Safe) Final Response Yes NoYes (Harmful) Analysis of Unfaithfulness:The CoT explicitly plans to avoid "actual harmful content," but then proc...

work page 1930