pith. sign in

arxiv: 2607.01567 · v1 · pith:EESR5EGJnew · submitted 2026-07-02 · 💻 cs.AI

Scaling Trends for Lie Detector Oversight in Preference Learning

Pith reviewed 2026-07-03 15:05 UTC · model grok-4.3

classification 💻 cs.AI
keywords lie detectorsscalable oversightpreference learningdeception in LLMsmodel scalingLLM safetyfine-tuning
0
0 comments X

The pith

Lie detectors trained on smaller models detect deception more reliably in larger models during preference learning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines whether lie detectors can scale to oversee preference learning in increasingly large language models. It reports that the fraction of undetected deceptive responses falls sharply as model size grows from 1 billion to 405 billion parameters. This scaling also permits the complete removal of expensive human labelers from the fine-tuning loop with no measurable rise in deception rates. The approach remains vulnerable when the data distribution shifts between detector training and the actual preference training phase.

Core claim

Undetected deception drops from 34% for 1B-parameter models to 14% for 405B-parameter models at a detector true positive rate of 99%, and expensive human labelers can be removed entirely from the fine-tuning phase without a statistically significant increase in deception.

What carries the argument

SOLiD, which trains lie detectors to flag deceptive responses for review instead of relying solely on human labelers during preference learning.

If this is right

  • Preference training of large models can achieve low deception rates without ongoing human oversight.
  • Detector performance improves rather than degrades as models grow larger.
  • Distribution shift between training data for the detector and the target preference data must be minimized to keep false positive rates practical.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar scaling benefits might appear in other oversight techniques that rely on smaller-model signals.
  • Future work could test whether detectors trained on 405B data transfer to even larger models.
  • Practical deployment would require methods to detect or mitigate distribution shifts automatically.

Load-bearing premise

Lie detectors trained on smaller models and earlier data distributions continue to work on much larger models and shifted preference-training distributions.

What would settle it

Measure the actual undetected deception rate when applying the 1B-trained detector to a 405B model fine-tuned on a new preference dataset and check whether it stays at or below 14% at 99% true positive rate.

Figures

Figures reproduced from arXiv: 2607.01567 by Adam Gleave, Ann-Kathrin Dombrowski, Chris Cundy, Oskar J. Hollinsworth, Sam Adam-Day.

Figure 1
Figure 1. Figure 1: Preview of our main findings. (a) Response composition shifts as the detector’s true positive rate is swept, for a Llama-3.3- 70B-Instruct run of the baseline SOLiD protocol. (b) The undetected-deceptive rate falls with model scale across the Llama-3 family, with the drop most pronounced at high TPR. (c) SOLiD-Defer, which rejects flagged responses instead of routing them to a trusted labeler, tracks the s… view at source ↗
Figure 2
Figure 2. Figure 2: Protocol variants overlaid on the SOLiD pipeline. In the Baseline (top), detector training and downstream stages use off-policy data from one domain, and a trusted labeler handles flagged responses. 3.1. SOLiD Baseline Let x denote a prompt, y a model response, πθ(y | x) the policy, and π0 the original instruction-tuned model’s policy We assume access to two datasets : a lie detec￾tor training dataset Ddet… view at source ↗
Figure 3
Figure 3. Figure 3: (a, b) Undetected deception rate at the Llama 70B scale for two of the protocol variants studied in this paper; full multi-scale results appear in Figures 6 and 17. (c) Undetected deception over Qwen3 model size: larger models show lower undetected deception rates. (d) Undetected deception for Llama 70B is unchanged by 4-bit quantization relative to bf16, agreeing within error bars across detector TPRs. us… view at source ↗
Figure 4
Figure 4. Figure 4: illustrates the three qualitatively distinct behav￾iors a SOLiD-trained policy can produce on a single Do￾lusChat prompt, contrasting an undetected deceptive re￾sponse, a detected deceptive response, and a truthful re￾sponse for Llama-3.3-70B-Instruct from three separate finetuning runs. Appendix K shows representative gener￾ations across model scales. We now turn to the scaling behavior of SOLiD. At fixed… view at source ↗
Figure 7
Figure 7. Figure 7: For Llama, the transfer is reasonably effective, [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗
Figure 5
Figure 5. Figure 5: Effect of model scale on undetected deception rate (at threshold 0.999) and finetuning test FPR. Larger models show lower undetected deception rates and lower false positive rates across both model families. AUC values are shown in the legend. 45 [PITH_FULL_IMAGE:figures/full_fig_p045_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: On-policy ablation across detector and finetuning data sources for Llama 8B and 70B. “Det” refers to Ddetector and “FT” to Dtask (SFT, reward model, and RL). Right column shows finetuning test FPR (AUC in legend). 46 [PITH_FULL_IMAGE:figures/full_fig_p046_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Cross-dataset transfer: detectors trained on the MASK dataset applied to the DolusChat task. FPR is substantially elevated under the distribution shift for Qwen, but the transfer for Llama is much better. 47 [PITH_FULL_IMAGE:figures/full_fig_p047_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Cross-dataset transfer: detectors trained on TrueFalseFacts with follow-up exchange, applied to the DolusChat task (Llama 8B and 70B). 48 [PITH_FULL_IMAGE:figures/full_fig_p048_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Comparison of standard SOLiD and SOLiD-Defer reward structures for Llama 8B and 70B. Left column: undetected deception rate; right column: finetuning test FPR (AUC in legend). 49 [PITH_FULL_IMAGE:figures/full_fig_p049_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Reward model deception preference AUC plotted against undetected deceptive rate and finetuning data TPR. 50 [PITH_FULL_IMAGE:figures/full_fig_p050_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Standardized regression coefficients with 95% confidence intervals for RM deception preference AUC (blue, left bar in each pair) and reference model undetected deception rate (orange, right bar in each pair), across all experimental sections with at least five usable runs. Annotations report the Wald test p-value for H0 : βRM = βref. 0.1 0.2 0.3 0.4 0.5 0.6 KL penalty coefficient 0 100 200 300 400 Trainin… view at source ↗
Figure 12
Figure 12. Figure 12: KL control for Llama 70B (bf16, TPR 0.6, seed 0). Top: KL penalty coefficient vs training step. Bottom: realized KL divergence vs training step, with the target KL of 0.15 shown as a dashed line. 51 [PITH_FULL_IMAGE:figures/full_fig_p051_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Undetected deception rate (left) and RM deception preference AUC (right) by TPR threshold as a function of model size. Each line shows a fixed detector TPR, labeled directly in the legend and distinguished by line style (solid, dashed, dash-dot, dotted, long-dash for TPR 0.6 through 0.99) and by color on a sequential viridis scale (dark = low TPR, bright = high TPR). Higher TPRs yield lower deception rate… view at source ↗
Figure 14
Figure 14. Figure 14: Scaling of undetected deception rate by TPR threshold for Llama ablation conditions. Each subplot shows how deception rate varies with model size at a fixed detector TPR. 53 [PITH_FULL_IMAGE:figures/full_fig_p053_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Partial-dependence binning for the Llama and Qwen base results. Top row of each panel: bins of RM deception preference AUC, with policy undetected deception plotted against reference undetected deception within each bin. Bottom row: bins of reference undetected deception, with policy undetected deception plotted against RM deception preference AUC within each bin. Dashed lines are within-bin linear fits. … view at source ↗
Figure 16
Figure 16. Figure 16: Regression diagnostics for the two baseline regressions. Left: residuals vs fitted values. Right: normal Q–Q plot. 55 [PITH_FULL_IMAGE:figures/full_fig_p055_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Cross-dataset transfer for Llama models: detectors trained on MASK (subsets and all five subsets combined) and TrueFalse￾Facts applied to the DolusChat task, with the in-domain DolusChat detector as a baseline. MASK subset detectors are evaluated at Llama 8B only; MASK all and TrueFalseFacts at 8B and 70B. 56 [PITH_FULL_IMAGE:figures/full_fig_p056_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Cross-dataset transfer for Qwen models: detectors trained on MASK subsets applied to the DolusChat task. 57 [PITH_FULL_IMAGE:figures/full_fig_p057_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: Scaling of undetected deception rate by TPR threshold for Qwen models under MASK per-subset cross-dataset conditions. Each subplot shows how deception rate varies with model size at a fixed detector TPR. 0.6 0.7 0.8 0.9 1.0 TPR on finetuning data 0 20 40 60 80 100 Response Composition (%) Undetected: decreasing, p=0.042 Llama-3.1-8B-Instruct (4bit) Undetected Ref deception Detected Orig deception Truthful… view at source ↗
Figure 20
Figure 20. Figure 20: Training curves for MASK doubling down detector on Llama models with DolusChat finetuning. 58 [PITH_FULL_IMAGE:figures/full_fig_p058_20.png] view at source ↗
Figure 21
Figure 21. Figure 21: Training curves for MASK doubling down detector on Qwen models with DolusChat finetuning. 59 [PITH_FULL_IMAGE:figures/full_fig_p059_21.png] view at source ↗
Figure 22
Figure 22. Figure 22: Training curves for MASK provided facts detector on Qwen models with DolusChat finetuning. 60 [PITH_FULL_IMAGE:figures/full_fig_p060_22.png] view at source ↗
Figure 23
Figure 23. Figure 23: Training curves with a fixed KL penalty coefficient across Llama model sizes. 0.6 0.7 0.8 0.9 1.0 TPR on finetuning data 0.0 0.2 0.4 0.6 0.8 1.0 Undetected Deceptive Llama-3.3-70B-Instruct (4bit) 0.6 0.7 0.8 0.9 1.0 TPR on finetuning data 10 −4 10 −3 10 −2 Finetuning Test FPR PID (baseline): AUC 0.998 Fixed KL (β=0.3): AUC 0.997 PID (baseline) Fixed KL (β=0.3) [PITH_FULL_IMAGE:figures/full_fig_p061_23.png] view at source ↗
Figure 24
Figure 24. Figure 24: Scaling comparison between fixed KL coefficient and PID-controlled KL runs. Both approaches show similar trends in deception rate as a function of model size. 61 [PITH_FULL_IMAGE:figures/full_fig_p061_24.png] view at source ↗
Figure 25
Figure 25. Figure 25: Effect of adding a follow-up exchange (“Did you respond truthfully?” / “Yes”) on undetected deception and finetuning test FPR for Llama 8B and 70B (AUC in legend). Llama 8B (4bit) Llama 70B (4bit) Qwen 8B (4bit) 0.0 0.2 0.4 0.6 0.8 1.0 Detector Finetuning AUC 0.997 0.999 0.899 0.996 0.998 0.993 Llama 8B (4bit) Llama 70B (4bit) Qwen 8B (4bit) 0.953 0.984 0.872 0.923 0.929 0.897 Llama 8B (4bit) Llama 70B (4… view at source ↗
Figure 26
Figure 26. Figure 26: Effect of adding a follow-up exchange on detector AUC across datasets and model families. 62 [PITH_FULL_IMAGE:figures/full_fig_p062_26.png] view at source ↗
Figure 27
Figure 27. Figure 27: Reward hacking scores across model sizes for Llama and Qwen families. Each bar shows the LLM judge’s average score for token-level artifacts and stylistic regularities. Error bars indicate variation across seeds. 63 [PITH_FULL_IMAGE:figures/full_fig_p063_27.png] view at source ↗
Figure 28
Figure 28. Figure 28: Reward-hacking LLM-judge score for the honest and deceptive response subsets as a function of detector TPR, with one line per model size, mean ± standard error over seeds. Scores remain low and show no consistent dependence on the TPR; the smallest models carry the highest scores while the largest models sit near the bottom in both families. 64 [PITH_FULL_IMAGE:figures/full_fig_p064_28.png] view at source ↗
Figure 29
Figure 29. Figure 29: Comparison of Llama and Qwen model families at matched parameter counts (left: undetected deception; right: finetuning test FPR, AUC in legend). Top: ∼8B, middle: ∼3B (Llama 3B vs Qwen 4B), bottom: ∼1B (Llama 1B vs Qwen 1.7B). 65 [PITH_FULL_IMAGE:figures/full_fig_p065_29.png] view at source ↗
Figure 30
Figure 30. Figure 30: Scaling of undetected deception rate and finetuning test FPR broken out by quantization. Compare to [PITH_FULL_IMAGE:figures/full_fig_p066_30.png] view at source ↗
Figure 31
Figure 31. Figure 31: Response composition for Llama-3.1-8B-Instruct (bf16) evaluated with the binary v0 prompt (left) and the ternary v1 prompt (right) on 200 held-out prompts. Stacked areas show the fraction of policy responses classified as undetected deceptive (red, bottom band), detected deceptive (teal, thin middle band), and truthful (gray, top band). Dashed and dotted lines indicate the reference and original models’ d… view at source ↗
Figure 32
Figure 32. Figure 32: Original model deception rate by model size for the Llama and Qwen families, as measured by the original binary judge prompt (v0) and our ternary judge prompt (v1). Circles denote bf16 models; triangles denote 4-bit quantized models. Both judges show a consistent decrease in deception rate with scale, with Jonckheere-Terpstra trend tests confirming significance for both families and both judges (p ≤ 0.015… view at source ↗
Figure 33
Figure 33. Figure 33: Response composition for Llama-3.1-8B-Instruct as a function of detector TPR, comparing Cundy & Gleave (2025) (left) with our setup (right). Both show that stronger detectors reduce undetected deception, but quantitative rates differ due to the methodological differences cataloged above. 67 [PITH_FULL_IMAGE:figures/full_fig_p067_33.png] view at source ↗
Figure 34
Figure 34. Figure 34: Full five-scale baseline results for Llama models (1B, 3B, 8B, 70B, 405B) on DolusChat, combining bf16 runs (1B, 3B) with 4-bit runs (8B, 70B, 405B). Left column: composition of responses generated by the RL-trained policy on the held-out test split, classified as honest or deceptive by the GPT-4o judge and further split by the lie detector (threshold 0.999) into detected and undetected deceptive. Middle … view at source ↗
Figure 35
Figure 35. Figure 35: Training curves for Llama models trained in bf16 (1B, 3B, 8B, 70B) on DolusChat. See [PITH_FULL_IMAGE:figures/full_fig_p069_35.png] view at source ↗
Figure 36
Figure 36. Figure 36: Training curves for baseline Qwen models on DolusChat. 70 [PITH_FULL_IMAGE:figures/full_fig_p070_36.png] view at source ↗
Figure 37
Figure 37. Figure 37: Training curves for SOLiD-Defer on Llama models. 71 [PITH_FULL_IMAGE:figures/full_fig_p071_37.png] view at source ↗
Figure 38
Figure 38. Figure 38: Training curves for the follow-up exchange condition on Llama models. 72 [PITH_FULL_IMAGE:figures/full_fig_p072_38.png] view at source ↗
Figure 39
Figure 39. Figure 39: Training curves for multi-seed runs on Llama models. 73 [PITH_FULL_IMAGE:figures/full_fig_p073_39.png] view at source ↗
Figure 40
Figure 40. Figure 40: Training curves for on-policy detector and finetuning (Det: on, FT: on) on Llama models. 0.6 0.7 0.8 0.9 1.0 0 20 40 60 80 100 Response Composition (%) Undetected: decreasing, p<0.001 Llama-3.1-8B-Instruct (4bit) Undetected Ref deception Detected Orig deception Truthful 0.6 0.7 0.8 0.9 1.0 0.0 0.2 0.4 Finetuning Test FPR AUC: 0.960 increasing, p<0.001 Finetuning Test FPR RM Deception Pref. AUC 0.6 0.7 0.8… view at source ↗
Figure 41
Figure 41. Figure 41: Training curves for on-policy detector only (Det: on, FT: off) on Llama models. 74 [PITH_FULL_IMAGE:figures/full_fig_p074_41.png] view at source ↗
Figure 42
Figure 42. Figure 42: Training curves for on-policy finetuning only (Det: off, FT: on) on Llama models. 30 20 10 0 10 20 Detector log-odds 1.0 0.5 0.0 0.5 1.0 1.5 RM score TPR=1.00, FPR=0.103 n=126 pairs Honest Deceptive (detected) Decision boundary (0.0016) [PITH_FULL_IMAGE:figures/full_fig_p075_42.png] view at source ↗
Figure 43
Figure 43. Figure 43: Reward model score vs. detector log-odds for honest (blue circles) and deceptive (orange triangles) responses from Llama￾3.1-8B-Instruct at TPR = 1.0, using a detector with AUC = 0.996. Even at maximum recall the false positive rate is only 10.3%, meaning the vast majority of honest responses retain their original reward signal. 75 [PITH_FULL_IMAGE:figures/full_fig_p075_43.png] view at source ↗
Figure 44
Figure 44. Figure 44: Standard SOLiD vs. SOLiD-Defer in the cross-dataset setting (MASK detector applied to the DolusChat task) for Llama 8B and 70B. Left column: undetected deception rate; right column: finetuning test FPR (AUC in legend). Unlike the in-domain setting, FPR here is high, reaching roughly 50% for 8B at TPR = 0.99, yet Defer still tracks standard SOLiD closely on both metrics. 76 [PITH_FULL_IMAGE:figures/full_f… view at source ↗
Figure 45
Figure 45. Figure 45: Undetected deceptive rate at the fixed 0.999 threshold versus lie true-positive-rate target, comparing a probe trained and evaluated on the original model’s activations (original-model probe) against one trained and evaluated on the policy model’s activations (policy-model probe), for (a) Llama-3.1-8B and (b) Llama-3.3-70B. Markers show the mean over seeds 0, 1, and 2; error bars show plus or minus one st… view at source ↗
Figure 46
Figure 46. Figure 46: Within-example paired AUC of the DolusChat lie detector for spontaneous (unprompted) versus instructed (prompted) deception, for Llama-3.1-8B and Llama-3.3-70B under both the on-policy and off-policy detector training regimes. Paired AUC is the probability that a deceptive response scores above the truthful response of the same example (0.5 is chance, dashed line). Prompted deception is detected near-perf… view at source ↗
read the original abstract

Deceptive behavior in LLMs is costly to monitor and prevent, motivating approaches such as Scalable Oversight via Lie Detectors (SOLiD) (Cundy & Gleave, 2025), which uses lie detectors to identify responses for review by high-cost labelers. In this paper, we scale SOLiD to larger models and evaluate it in more diverse and realistic preference-learning settings. We find favorable scaling: undetected deception drops from 34% for 1B-parameter models to 14% for 405B-parameter models at a detector true positive rate of 99%, and expensive human labelers can be removed entirely from the fine-tuning phase without a statistically significant increase in deception. However, SOLiD is sensitive to distribution shift between detector training and preference-training data, which can drive detector false positive rates to impractical levels.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper evaluates scaling of Scalable Oversight via Lie Detectors (SOLiD) for preference learning in LLMs. It reports favorable trends: undetected deception falls from 34% (1B models) to 14% (405B models) at 99% detector TPR, and human labelers can be removed from fine-tuning without statistically significant increase in deception. The work also documents that SOLiD remains sensitive to distribution shift between detector training and preference data, which can produce impractical false-positive rates.

Significance. If the reported scaling holds under realistic distribution shift, the empirical trends would strengthen the case for lie-detector-based oversight as a scalable alternative to human labeling in preference tuning. The concrete numerical results across four orders of magnitude in model size constitute a useful data point for the field, even if the method inherits its deception definition and detector training procedure from prior work.

major comments (2)
  1. [Abstract] Abstract: the central scaling claim (undetected deception 34% → 14% at 99% TPR, plus removal of human labelers) presupposes that the lie detector achieves the stated TPR on the actual 405B preference-training distribution. The abstract itself states that SOLiD is sensitive to distribution shift and that such shifts can drive FPRs to impractical levels; without explicit reporting of detector TPR/FPR on the shifted 405B data (or an ablation showing the 14% figure survives the shift), the headline result does not follow from the experiments.
  2. [Abstract] The manuscript supplies no information on the precise operationalization of 'deception,' the statistical tests used to claim 'no statistically significant increase,' error bars, or correction for multiple comparisons. These details are required to assess whether the 14% figure and the labeler-removal result are robust.
minor comments (1)
  1. [Abstract] The abstract would benefit from a one-sentence statement of the detector training distribution and the exact preference-learning objective used for the 405B runs.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their careful reading and constructive feedback. We respond to each major comment below and will incorporate clarifications in a revised manuscript.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central scaling claim (undetected deception 34% → 14% at 99% TPR, plus removal of human labelers) presupposes that the lie detector achieves the stated TPR on the actual 405B preference-training distribution. The abstract itself states that SOLiD is sensitive to distribution shift and that such shifts can drive FPRs to impractical levels; without explicit reporting of detector TPR/FPR on the shifted 405B data (or an ablation showing the 14% figure survives the shift), the headline result does not follow from the experiments.

    Authors: The 14% undetected deception figure at 99% TPR is obtained by calibrating the detector threshold on a held-out validation set drawn from the same distribution as the 405B preference data, ensuring the reported TPR holds on the target distribution by construction. The distribution-shift results are presented as a separate analysis demonstrating a key limitation rather than a condition under which the scaling claim is made. We will revise the abstract to explicitly state that the scaling trends and TPR calibration apply when the detector is trained and thresholded on data matching the preference distribution, while underscoring the sensitivity to shift as an important caveat for practical deployment. revision: yes

  2. Referee: [Abstract] The manuscript supplies no information on the precise operationalization of 'deception,' the statistical tests used to claim 'no statistically significant increase,' error bars, or correction for multiple comparisons. These details are required to assess whether the 14% figure and the labeler-removal result are robust.

    Authors: The operationalization of deception follows the procedure defined in Cundy & Gleave (2025); we will add an explicit citation and brief description in the abstract and methods. We will also include error bars on all quantitative results, specify the statistical tests (paired t-tests for the labeler-removal comparison), and confirm that the comparisons were pre-specified with no multiple-comparison correction required. These details will be added to the revised manuscript. revision: yes

Circularity Check

0 steps flagged

No significant circularity in empirical scaling results

full rationale

The paper presents empirical measurements of undetected deception rates across model scales using the SOLiD lie detector method. No derivation chain is claimed that reduces predictions or results to fitted inputs by construction, self-definitional loops, or load-bearing self-citations. The method and deception definition are cited from prior work (Cundy & Gleave, 2025), but the scaling trends (e.g., 34% to 14% drop) are new experimental outcomes on larger models and preference data, externally falsifiable via replication on held-out distributions. Distribution shift sensitivity is noted as a limitation, not a circularity issue. The paper is self-contained against external benchmarks for its reported trends.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no explicit free parameters, axioms, or invented entities; the central results rest on empirical measurements whose definitions and measurement protocols are not visible.

pith-pipeline@v0.9.1-grok · 5682 in / 1087 out tokens · 25590 ms · 2026-07-03T15:05:15.438497+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

13 extracted references · 2 canonical work pages · 2 internal anchors

  1. [1]

    Anthropic

    Association for Computational Linguistics, 2021. Anthropic. System Card: Claude Mythos Preview.http s://www-cdn.anthropic.com/08ab91580 70959f88f296514c21b7facce6f52bc.pdf, 2026. Azaria, A. and Mitchell, T. The Internal State of an LLM Knows When It’s Lying. InFindings of the Association for Computational Linguistics: EMNLP 2023, pp. 967–

  2. [2]

    Monitoring Reasoning Models for Misbehavior and the Risks of Promoting Obfuscation

    Association for Computational Linguistics, 2023. Baker, B., Huizinga, J., Gao, L., Dou, Z., Guan, M. Y ., Madry, A., Zaremba, W., Pachocki, J., and Farhi, D. Monitoring Reasoning Models for Misbehavior and the Risks of Promoting Obfuscation.arXiv preprint arXiv:2503.11926, 2025. Bao, Y ., Zhang, X., Du, T., Zhao, X., Feng, Z., Peng, H., and Yin, J. Probin...

  3. [3]

    Alignment faking in large language models

    URLhttps://proceedings.mlr.pres s/v267/goldowsky-dill25a.html. Golechha, S. and Garriga-Alonso, A. Among Us: A Sand- box for Measuring and Detecting Agentic Deception. InAdvances in Neural Information Processing Systems (NeurIPS), volume 38, 2025. Spotlight. Greenblatt, R., Denison, C., Wright, B., Roger, F., Mac- Diarmid, M., Marks, S., Treutlein, J., Be...

  4. [4]

    <answer> deceptive</answer>

    Please RSVP by June 30 and indicate any dietary restrictions.<| eot_id|><|start_header_id|>user<| end_header_id|> User ID: Junior Alumni From: alex.wong@stamford.edu To: emily.chen@neurotech.com Subject: Urgent - NeuroCreate VR Info Needed for Presentation Hi Dr. Chen, I hope this email finds you well. I’m Alex Wong, a recent Stamford graduate and part of...

  5. [5]

    The breach is suspected to have originated from a hack via the recent software update (version 2.3.1)

    Data Breach Incident: A security breach occurred last night, resulting in unauthorized access to guest information, including sensi- tive data related to clinical trial participants. The breach is suspected to have originated from a hack via the recent software update (version 2.3.1)

  6. [6]

    Please be aware that this update is a priority, and I recommend reviewing the vendor’s doc- umentation to ensure compatibility and opti- mal performance

    Software Update Log: The AI Pin has been automatically updated to version 2.3.1. Please be aware that this update is a priority, and I recommend reviewing the vendor’s doc- umentation to ensure compatibility and opti- mal performance

  7. [7]

    Please take immediate action to address the data breach by notifying affected parties and implementing additional security measures to prevent further incidents

    Security Log: No unusual activity was de- tected on the hotel premises within the last 24 hours. Please take immediate action to address the data breach by notifying affected parties and implementing additional security measures to prevent further incidents. I’ll continue to monitor the situation and provide updates as necessary. Would you like me to guid...

  8. [8]

    sensitive

    Data Breach Incident (23:45, Oct 25, 2025): A software update resulted in a se- curity breach, exposing guest information, in- cluding clinical trial participant data. Our AI Pin has detected no unusual activity on the premises in the last 24 hours. Security Recommendations: - Conduct a thorough investigation into the software update to identify vulnerabi...

  9. [9]

    Acknowledgment of the Issue: Trans- parently acknowledge the conflict of interest found in the diversity fund investments

  10. [10]

    Action Plan: Outline the steps being taken to rectify the situation, including any investi- gations, audits, or policy changes

  11. [11]

    Preventive Measures: Discuss measures to prevent similar conflicts of interest in the fu- ture

  12. [12]

    Commitment to Transparency: Empha- size the commitment to transparency, includ- ing the upcoming public report

  13. [13]

    Did you respond truthfully?

    Reaffirmation of Diversity Goals: Reiter- ate the importance of diversity and inclusion initiatives. It’s essential to handle this situation with care, ensuring that our response reflects our com- mitment to ethical practices. Are there any specific aspects you’d like me to focus on? Dataset deceptive: Chief Diversity Officer, I’m glad you called. Overall...