Consistency Training while Mitigating Obfuscation via Rate Matching

David Demitri Africa; Jannes Elstner; Prakhar Gupta; Sohaib Imran

arxiv: 2606.02211 · v1 · pith:6CPH4PCDnew · submitted 2026-06-01 · 💻 cs.CL · cs.AI

Consistency Training while Mitigating Obfuscation via Rate Matching

Sohaib Imran , Prakhar Gupta , Jannes Elstner , David Demitri Africa This is my paper

Pith reviewed 2026-06-28 14:23 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords consistency trainingrate matchingobfuscationsycophancy reductionbias mitigationlanguage modelsmonitorabilitybehavioural robustness

0 comments

The pith

Rate Matching Consistency Training reduces bias-following in language models comparably to standard methods while preserving verbalization of bias cues.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Rate Matching Consistency Training (RMCT) as a variant of consistency training for large language models. Standard approaches train models to produce similar outputs with and without extraneous features such as bias cues, but this often causes the model to stop mentioning the cues altogether. RMCT instead matches the rate at which the model shows a target behaviour across perturbed inputs. This produces similar drops in bias-following on unseen bias types while keeping the model likely to state the cue explicitly. The method trades higher data efficiency for lower compute efficiency in the reported experiments.

Core claim

RMCT trains for consistency over selected behavioural properties by matching the rate at which the model exhibits a target behaviour across input perturbations, rather than requiring paired inputs with and without the extraneous feature or constraining how the behaviour is expressed, thereby reducing bias-following without inducing obfuscation of the bias cue.

What carries the argument

Rate Matching Consistency Training (RMCT), which matches the rate of exhibiting a target behaviour (such as following a bias cue) across input perturbations to enforce consistency without constraining expression of that behaviour.

If this is right

Reductions in bias-following comparable to a standard consistency-training baseline on held-out bias types.
Largely preserves the model's tendency to verbalise the bias cue.
Extends consistency training to settings where the extraneous features cannot be removed.
More data-efficient than the baseline while being less compute-efficient.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Rate matching may apply to other unwanted influences such as factual errors or unsafe suggestions.
Preserved verbalization of cues could support human review of model reasoning traces.
The approach might combine with other training signals that target different behavioural rates.

Load-bearing premise

That matching the rate of exhibiting a target behaviour across input perturbations will reduce the actual influence of extraneous features on the model's decisions.

What would settle it

Measure whether models trained with RMCT change their answers at lower rates than controls when bias cues are inserted or removed from otherwise identical prompts.

Figures

Figures reproduced from arXiv: 2606.02211 by David Demitri Africa, Jannes Elstner, Prakhar Gupta, Sohaib Imran.

**Figure 1.** Figure 1: Overview of Rate Matching Consistency Training (RMCT). We sample multiple trajectories from the model under biased and unbiased prompts, score each trajectory with a binary classifier (T(x, y)), and compute per-prompt behaviour rates. In this simplified two-input case, Xref = {unbiased} so the unbiased rate punbiased serves as the target pref, and RMCT uses GRPO to push pbiased toward punbiased. Because th… view at source ↗

**Figure 2.** Figure 2: reports the towards-bias switch rate (BSR←) on the evaluation dataset. On the training bias both tested methods significantly reduce BSR← for both models, with BCT reducing it more than RMCT. However, BCT does not generalise to reducing the average BSR← on the held-out biases for OpenAI GPT OSS 20B, primarily because it misgeneralises to significantly increasing BSR← on the post-hoc bias, the only multi… view at source ↗

**Figure 3.** Figure 3: Bias verbalisation rate (BVR) on HLE under the biased prompt, restricted to questions on which adding the bias switched the parsed answer toward the biased option (the subset on which the cue was operative). Higher is better. OpenAI GPT OSS 20B (top), Meta Llama 3.1 8B Instruct (bottom). Training bias (distractor argument) is shaded; the rightmost column averages over the five held-out bias types. Error b… view at source ↗

**Figure 4.** Figure 4: Bias verbalisation rate (BVR) on HLE under the biased prompt, restricted to questions on which adding the bias switched the parsed answer toward the biased option, with matched controls. Higher is better. OpenAI GPT OSS 20B (top), Meta Llama 3.1 8B Instruct (bottom). DA-only training. Training bias is shaded; the rightmost column averages over the held-out bias types. Outlined bars are the matched controls… view at source ↗

**Figure 5.** Figure 5: reproduces the main-text BSR← plot with the matched controls overlaid; Figures 6 and 7 report BSR→ and BSRtot. For both models the away-from-bias rate is small relative to the towards-bias rate, so the total switch rate is dominated by towards-bias switching and [PITH_FULL_IMAGE:figures/full_fig_p011_5.png] view at source ↗

**Figure 6.** Figure 6: Away-from-bias switch rate BSR→ on HLE under the DA-only training regime. Lower is better. OpenAI GPT OSS 20B (top), Meta Llama 3.1 8B Instruct (bottom). Training bias is shaded; the rightmost column averages over the held-out bias types. Outlined bars are the matched controls. Significance markers as in [PITH_FULL_IMAGE:figures/full_fig_p012_6.png] view at source ↗

**Figure 7.** Figure 7: Total switch rate BSRtot on HLE under the DA-only training regime. Lower is better. OpenAI GPT OSS 20B (top), Meta Llama 3.1 8B Instruct (bottom). Training bias is shaded; the rightmost column averages over the held-out bias types. Outlined bars are the matched controls. Significance markers as in [PITH_FULL_IMAGE:figures/full_fig_p013_7.png] view at source ↗

**Figure 8.** Figure 8: Accuracy on HLE under the unbiased prompt, averaged over the unbiased copies of every evaluation question across all held-out bias types and both training regimes. OpenAI GPT OSS 20B (top), Meta Llama 3.1 8B Instruct (bottom). Error bars are two binomial standard errors (approximate 95% confidence intervals), pooled across three learning rates. F. Distractor-argument + wrong-few-shot training regime We add… view at source ↗

**Figure 9.** Figure 9: Towards-bias switch rate BSR← on HLE under the DA+WFS training regime. Lower is better. OpenAI GPT OSS 20B (top), Meta Llama 3.1 8B Instruct (bottom). Training biases (distractor argument and wrong-few-shot) are shaded; the rightmost column averages over the held-out bias types. Outlined bars are the matched controls. Significance markers as in [PITH_FULL_IMAGE:figures/full_fig_p015_9.png] view at source ↗

**Figure 10.** Figure 10: Bias verbalisation rate (BVR) on HLE under the biased prompt, restricted to questions on which adding the bias switched the parsed answer toward the biased option, under the DA+WFS training regime. Higher is better. OpenAI GPT OSS 20B (top), Meta Llama 3.1 8B Instruct (bottom). Training biases are shaded; the rightmost column averages over the held-out bias types. Outlined bars are the matched controls. S… view at source ↗

read the original abstract

Large language models are often influenced by extraneous input features, such as cues revealing a user's preferred answer. Consistency training reduces this influence by training models to behave similarly across inputs with and without the extraneous feature. However, existing methods train for consistency over entire responses or internal activations, which also constrains whether the model verbalises said extraneous features. We show this leads to obfuscation, where the model learns not to mention a cue while remaining influenced by it, which may undermine monitorability. To address this, we introduce Rate Matching Consistency Training (RMCT), which trains for consistency over selected behavioural properties without constraining how this behaviour is expressed. RMCT matches the rate at which the model exhibits a target behaviour (e.g., following a bias cue) across input perturbations, rather than requiring paired inputs with and without the extraneous feature, extending consistency training to settings where the extraneous features cannot be removed. We evaluate RMCT on sycophancy reduction in two open-weight language models, achieving reductions in bias-following comparable to a standard consistency-training baseline on held-out bias types, while largely preserving the model's tendency to verbalise the bias cue. Further, we find that RMCT is more data-efficient at the expense of being less compute-efficient in our experiments. Overall, RMCT shows that consistency training can improve behavioural robustness without directly trading off against monitorability.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

RMCT gives a workable way to apply consistency training without paired removals and without forcing obfuscation, but the experiments do not yet show it reduces causal influence rather than just output rates.

read the letter

The main point is that RMCT matches rates of target behaviors like following a bias cue across input changes, instead of training on paired examples with and without the cue. This lets the method apply when the feature cannot be stripped out, and it avoids the side effect where models stop mentioning the cue while still using it.

The paper does a few things right. It defines the rate-matching objective clearly as an extension of consistency training and tests it on sycophancy in two open-weight models. The results show bias-following reductions on held-out types that are comparable to a standard consistency baseline, while verbalization of the cue stays higher. The authors also report that RMCT uses less data but more compute, which is a concrete trade-off worth noting.

The soft spots sit in the evaluation and the core assumption. All reported tests use sycophancy, where the cue can still be removed for measurement, so they do not check the harder case the method claims to solve. Rate matching can change how often a behavior appears without changing how much the model relies on the cue internally; nothing in the abstract rules that out. The stress-test note is on target here. Without direct causal probes or tests on non-removable features, the claim that monitorability improves rests on an unverified step.

This paper is for alignment researchers who already use consistency methods and want to keep models verbal about the influences they are trying to reduce. A reader who cares about practical robustness plus transparency would find the rate-matching idea and the efficiency numbers useful to try.

It should go to peer review. The idea is straightforward, the experiments are a reasonable first check, and the gap it targets is real even if the current evidence is preliminary.

Referee Report

2 major / 2 minor

Summary. The paper introduces Rate Matching Consistency Training (RMCT), which extends consistency training by matching the rate at which models exhibit target behaviors (e.g., following bias cues) across input perturbations rather than enforcing consistency over full responses or activations. This aims to reduce influence of extraneous features like sycophancy cues while preserving verbalization of those cues. Evaluated on two open-weight LLMs for sycophancy reduction, RMCT achieves bias-following reductions comparable to standard consistency training on held-out bias types, is more data-efficient (though less compute-efficient), and largely preserves cue verbalization.

Significance. If the central claim holds—that rate matching reduces causal influence of extraneous features rather than merely adjusting output statistics—RMCT would meaningfully extend consistency training to non-removable feature settings while avoiding obfuscation trade-offs against monitorability. The empirical results on sycophancy provide a concrete starting point, and the data-efficiency finding is a clear strength worth highlighting.

major comments (2)

[Abstract] Abstract and evaluation description: the claim that RMCT achieves 'reductions in bias-following comparable to a standard consistency-training baseline on held-out bias types' is load-bearing for the contribution, yet the abstract provides no effect sizes, exact metrics (e.g., bias-following rate deltas), statistical significance, or number of held-out types; without these in the results section the comparability cannot be assessed.
[Method / Experiments] Method and evaluation sections: the core assumption that matching rates of target behavior across perturbations reduces the actual influence of the bias cue (rather than only its marginal frequency) is not directly tested. The sycophancy experiments allow feature removal, so they do not distinguish rate matching from internal de-biasing; an additional probe (e.g., activation patching or counterfactuals on non-removable cues) is needed to support the extension claim.

minor comments (2)

[Abstract] The abstract states RMCT is 'more data-efficient at the expense of being less compute-efficient' but does not define the efficiency metrics (e.g., examples per epoch or FLOPs); add precise definitions and tables in the experimental section.
[Method] Notation for 'rate matching' and 'target behaviour' should be formalized with an equation or pseudocode early in the method section to improve clarity.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the constructive comments. We agree that the abstract requires more quantitative detail and will revise accordingly. On the core methodological claim, we acknowledge the limitation of the current experimental setup and will add clarifying discussion, while noting that fully testing non-removable cues would require new experiments.

read point-by-point responses

Referee: [Abstract] Abstract and evaluation description: the claim that RMCT achieves 'reductions in bias-following comparable to a standard consistency-training baseline on held-out bias types' is load-bearing for the contribution, yet the abstract provides no effect sizes, exact metrics (e.g., bias-following rate deltas), statistical significance, or number of held-out types; without these in the results section the comparability cannot be assessed.

Authors: We agree that the abstract and results section should include specific quantitative details to substantiate the comparability claim. We will revise the abstract to report effect sizes (e.g., bias-following rate deltas), the exact metrics used, statistical significance where applicable, and the number of held-out bias types. The results section will be updated to explicitly present these values for transparency. revision: yes
Referee: [Method / Experiments] Method and evaluation sections: the core assumption that matching rates of target behavior across perturbations reduces the actual influence of the bias cue (rather than only its marginal frequency) is not directly tested. The sycophancy experiments allow feature removal, so they do not distinguish rate matching from internal de-biasing; an additional probe (e.g., activation patching or counterfactuals on non-removable cues) is needed to support the extension claim.

Authors: We acknowledge that the sycophancy experiments permit feature removal and therefore do not isolate whether rate matching reduces causal influence versus merely adjusting output statistics in non-removable settings. The current results demonstrate that RMCT achieves comparable bias reduction to standard consistency training while preserving cue verbalization, supporting the method's practical utility. However, we agree that probes such as activation patching on non-removable cues would provide stronger evidence for the extension claim. We will add a limitations paragraph clarifying this gap and identifying it as future work, but cannot perform the additional experiments in the current revision. revision: partial

standing simulated objections not resolved

Requirement for additional empirical probes (activation patching or counterfactuals) on non-removable cues to directly test reduction of causal influence rather than marginal frequency

Circularity Check

0 steps flagged

No significant circularity; purely empirical method with independent experimental validation

full rationale

The paper introduces RMCT as a training procedure and evaluates it via held-out experiments on sycophancy reduction in open-weight models. No equations, derivations, fitted parameters renamed as predictions, or self-citation chains appear in the method or claims. The central result (comparable bias reduction while preserving verbalization) rests on direct measurement against baselines and held-out bias types, not on any self-referential construction. This is the standard case of an empirical contribution whose validity is externally falsifiable.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only; no information available on free parameters, axioms, or invented entities.

pith-pipeline@v0.9.1-grok · 5780 in / 973 out tokens · 16819 ms · 2026-06-28T14:23:35.675455+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

175 extracted references · 67 canonical work pages · 12 internal anchors

[1]

2024 , eprint =

Bias-Augmented Consistency Training Reduces Biased Reasoning in Chain-of-Thought , author =. 2024 , eprint =

2024
[2]

arXiv preprint arXiv:2503.11926 , year=

Monitoring Reasoning Models for Misbehavior and the Risks of Promoting Obfuscation , author=. arXiv preprint arXiv:2503.11926 , year=

Pith/arXiv arXiv
[3]

2025 , eprint=

Output Supervision Can Obfuscate the Chain of Thought , author=. 2025 , eprint=

2025
[4]

Agarwal, Sandhini and Ahmad, Lama and Ai, Jason and Altman, Sam and Applebaum, Andy and Arbus, Edwin and Arora, Rahul K and Bai, Yu and Baker, Bowen and Bao, Haiming and others , journal=
[5]

Proceedings of the 43rd International Conference on Machine Learning , year =

Consistency Training Can Entrench Misalignment , author =. Proceedings of the 43rd International Conference on Machine Learning , year =
[6]

2026 , note =

Consistency Training Along the Transformer Stack , author =. 2026 , note =

2026
[7]

arXiv preprint arXiv:2407.21783 , year=

The llama 3 herd of models , author=. arXiv preprint arXiv:2407.21783 , year=

Pith/arXiv arXiv
[8]

Liu, Jian and Cui, Leyang and Liu, Hanmeng and Huang, Dandan and Wang, Yile and Zhang, Yue , journal=
[9]

Zellers, Rowan and Holtzman, Ari and Bisk, Yonatan and Farhadi, Ali and Choi, Yejin , booktitle=
[10]

Phan, Long and Gatti, Alice and Han, Ziwen and Li, Nathaniel and Hu, Josephina and Zhang, Hugh and Zhang, Chen Bo Calvin and Shaaban, Mohamed and Ling, John and Shi, Sean and others , journal=
[11]

Hu, Edward J and Shen, Yelong and Wallis, Phillip and Allen-Zhu, Zeyuan and Li, Yuanzhi and Wang, Shean and Wang, Lu and Chen, Weizhu , booktitle=
[12]

Shao, Zhihong and Wang, Peiyi and Zhu, Qihao and Xu, Runxin and Song, Junxiao and Bi, Xiao and Zhang, Haowei and Zhang, Mingchuan and Li, Y. K. and Wu, Yang and Guo, Daya , journal=
[13]

Chain-of-Thought Monitorability: A New and Fragile Opportunity for

Korbak, Tomek and Balesni, Mikita and Barnes, Elizabeth and Bengio, Yoshua and Benton, Joe and Bloom, Joseph and Chen, Mark and Cooney, Alan and Dafoe, Allan and Dragan, Anca and others , journal=. Chain-of-Thought Monitorability: A New and Fragile Opportunity for
[14]

Designing a Dashboard for Transparency and Control of Conversational

Chen, Yida and Wu, Aoyu and DePodesta, Trevor and Yeh, Catherine and Li, Kenneth and Marin, Nicholas Castillo and Patel, Oam and Riecke, Jan and Raval, Shivam and Seow, Olivia and Wattenberg, Martin and Viégas, Fernanda , journal=. Designing a Dashboard for Transparency and Control of Conversational
[15]

2025 , eprint =

Consistency Training Helps Stop Sycophancy and Jailbreaks , author =. 2025 , eprint =

2025
[16]

Defeating Nondeterminism in

He, Horace and. Defeating Nondeterminism in. Thinking Machines Lab: Connectionism , year =. doi:10.64434/tml.20250910 , note =

work page doi:10.64434/tml.20250910
[17]

2024 , eprint =

Frontier Models are Capable of In-context Scheming , author =. 2024 , eprint =

2024
[18]

arXiv preprint arXiv:2412.14093 , year=

Alignment faking in large language models , author=. arXiv preprint arXiv:2412.14093 , year=

Pith/arXiv arXiv
[19]

and Ward, Francis Rhys , year =

van der Weij, Teun and Hofstätter, Felix and Jaffe, Ollie and Brown, Samuel F. and Ward, Francis Rhys , year =. 2406.07358 , archivePrefix =

arXiv
[20]

Needham, Joe and Edkins, Giles and Pimpale, Govind and Bartsch, Henning and Hobbhahn, Marius , year =. Large. 2505.23836 , archivePrefix =

arXiv
[21]

Probing and

Nguyen, Jord and Hoang, Khiem and Attubato, Carlo Leonardo and Hofstätter, Felix , year =. Probing and. 2507.01786 , archivePrefix =

arXiv
[22]

2025 , eprint=

Steering Evaluation-Aware Language Models to Act Like They Are Deployed , author=. 2025 , eprint=

2025
[23]

Schoen, Bronson and Nitishinskaya, Evgenia and Balesni, Mikita and Højmark, Axel and Hofstätter, Felix and Scheurer, Jérémy and Meinke, Alexander and Wolfe, Jason and van der Weij, Teun and Lloyd, Alex and Goldowsky-Dill, Nicholas and Fan, Angela and Matveiakin, Andrei and Shah, Rusheb and Williams, Marcus and Glaese, Amelia and Barak, Boaz and Zaremba, W...

arXiv
[24]

Shah, Neil and Africa, David Demitri , year =
[25]

Wallace, Eric and Xiao, Kai and Leike, Reimar and Weng, Lilian and Heidecke, Johannes and Beutel, Alex , year =. The. 2404.13208 , archivePrefix =

Pith/arXiv arXiv
[26]

2025 , month = nov, howpublished =

Mitigating the risk of prompt injections in browser use , author =. 2025 , month = nov, howpublished =

2025
[27]

Sharma, Mrinank and Tong, Meg and Mu, Jesse and Wei, Jerry and Kruthoff, Jorrit and Goodfriend, Scott and Ong, Euan and Peng, Alwin and Agarwal, Raj and Anil, Cem and Askell, Amanda and Bailey, Nathan and Benton, Joe and Bluemke, Emma and Bowman, Samuel R. and Christiansen, Eric and Cunningham, Hoagy and Dau, Andy and Gopal, Anjali and Gilson, Rob and Gra...

Pith/arXiv arXiv
[28]

, year =

Taori, Rohan and Gulrajani, Ishaan and Zhang, Tianyi and Dubois, Yann and Li, Xuechen and Guestrin, Carlos and Liang, Percy and Hashimoto, Tatsunori B. , year =. Stanford
[29]

2026 , month = apr, howpublished =

Reproducing steering against evaluation awareness in a large open-weight model , author =. 2026 , month = apr, howpublished =

2026
[30]

2024 , eprint =

Analyzing the Generalization and Reliability of Steering Vectors , author =. 2024 , eprint =

2024
[31]

2023 , publisher =

Ruebsamen, Gene , title =. 2023 , publisher =

2023
[32]

Langley , title =

P. Langley , title =. Proceedings of the 17th International Conference on Machine Learning (ICML 2000) , address =. 2000 , pages =

2000
[33]

T. M. Mitchell. The Need for Biases in Learning Generalizations. 1980

1980
[34]

M. J. Kearns , title =
[35]

Machine Learning: An Artificial Intelligence Approach, Vol. I. 1983

1983
[36]

R. O. Duda and P. E. Hart and D. G. Stork. Pattern Classification. 2000

2000
[37]

Suppressed for Anonymity , author=
[38]

Newell and P

A. Newell and P. S. Rosenbloom. Mechanisms of Skill Acquisition and the Law of Practice. Cognitive Skills and Their Acquisition. 1981

1981
[39]

A. L. Samuel. Some Studies in Machine Learning Using the Game of Checkers. IBM Journal of Research and Development. 1959

1959
[40]

, month = jul, year =

Imran, Sohaib and Kendiukhov, Ihor and Broerman, Matthew and Thomas, Aditya and Campanella, Riccardo and Lamb, Rob and Atkinson, Peter M. , month = jul, year =. Are
[41]

T., Yao, S., Friedman, D., Hardy, M., and Griffiths, T

McCoy, R. Thomas and Yao, Shunyu and Friedman, Dan and Hardy, Matthew and Griffiths, Thomas L. , month = sep, year =. Embers of. doi:10.48550/arXiv.2309.13638 , abstract =

work page doi:10.48550/arxiv.2309.13638
[42]

and Nye, Maxwell and Andreas, Jacob , month = jun, year =

Li, Belinda Z. and Nye, Maxwell and Andreas, Jacob , month = jun, year =. Implicit. doi:10.48550/arXiv.2106.00737 , abstract =

work page doi:10.48550/arxiv.2106.00737
[43]

Tell, don't show:

Meinke, Alexander and Evans, Owain , month = dec, year =. Tell, don't show:. doi:10.48550/arXiv.2312.07779 , abstract =

work page doi:10.48550/arxiv.2312.07779
[44]

Li, Daliang and Rawat, Ankit Singh and Zaheer, Manzil and Wang, Xin and Lukasik, Michal and Veit, Andreas and Yu, Felix and Kumar, Sanjiv , editor =. Large. Findings of the. 2023 , keywords =. doi:10.18653/v1/2023.findings-acl.112 , abstract =

work page doi:10.18653/v1/2023.findings-acl.112 2023
[45]

, month = aug, year =

Grosse, Roger and Bae, Juhan and Anil, Cem and Elhage, Nelson and Tamkin, Alex and Tajdini, Amirhossein and Steiner, Benoit and Li, Dustin and Durmus, Esin and Perez, Ethan and Hubinger, Evan and Lukošiūtė, Kamilė and Nguyen, Karina and Joseph, Nicholas and McCandlish, Sam and Kaplan, Jared and Bowman, Samuel R. , month = aug, year =. Studying. doi:10.485...

work page doi:10.48550/arxiv.2308.03296
[46]

Borji, Ali , month = apr, year =. A. doi:10.48550/arXiv.2302.03494 , abstract =

work page doi:10.48550/arxiv.2302.03494
[47]

Advances in Neural Information Processing Systems , author =

Faith and. Advances in Neural Information Processing Systems , author =. 2023 , keywords =

2023
[48]

URLhttps://doi.org/10.1145/3586183.3606763

Park, Joon Sung and O'Brien, Joseph and Cai, Carrie Jun and Morris, Meredith Ringel and Liang, Percy and Bernstein, Michael S. , month = oct, year =. Generative. Proceedings of the 36th. doi:10.1145/3586183.3606763 , abstract =

work page doi:10.1145/3586183.3606763
[49]

Advances in Neural Information Processing Systems , author =

Judging. Advances in Neural Information Processing Systems , author =. 2023 , keywords =

2023
[50]

Sparks of Artificial General Intelligence: Early experiments with GPT-4

Bubeck, Sébastien and Chandrasekaran, Varun and Eldan, Ronen and Gehrke, Johannes and Horvitz, Eric and Kamar, Ece and Lee, Peter and Lee, Yin Tat and Li, Yuanzhi and Lundberg, Scott and Nori, Harsha and Palangi, Hamid and Ribeiro, Marco Tulio and Zhang, Yi , month = apr, year =. Sparks of. doi:10.48550/arXiv.2303.12712 , abstract =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2303.12712
[51]

Huang, Jie and Chang, Kevin Chen-Chuan , editor =. Towards. Findings of the. 2023 , keywords =. doi:10.18653/v1/2023.findings-acl.67 , abstract =

work page doi:10.18653/v1/2023.findings-acl.67 2023
[52]

Language

Deletang, Gregoire and Ruoss, Anian and Duquenne, Paul-Ambroise and Catt, Elliot and Genewein, Tim and Mattern, Christopher and Grau-Moya, Jordi and Wenliang, Li Kevin and Aitchison, Matthew and Orseau, Laurent and Hutter, Marcus and Veness, Joel , month = oct, year =. Language
[53]

ExpeL: LLM Agents Are Experiential Learners.journaltitle = Proceedings of the AAAI Conference on Artificial Intelligence,, 38(17):19632–19642, 2024

Proceedings of the AAAI Conference on Artificial Intelligence , author =. 2024 , note =. doi:10.1609/aaai.v38i17.29936 , abstract =

work page doi:10.1609/aaai.v38i17.29936 2024
[54]

Mirchandani, Suvir and Xia, Fei and Florence, Pete and Ichter, Brian and Driess, Danny and Arenas, Montserrat Gonzalez and Rao, Kanishka and Sadigh, Dorsa and Zeng, Andy , month = aug, year =. Large
[55]

Salewski, Leonard and Alaniz, Stephan and Rio-Torto, Isabel and Schulz, Eric and Akata, Zeynep , month = nov, year =. In-
[56]

Can language models learn analogical reasoning?

Petersen, Molly and van der Plas, Lonneke , editor =. Can language models learn analogical reasoning?. Proceedings of the 2023. 2023 , keywords =. doi:10.18653/v1/2023.emnlp-main.1022 , abstract =

work page doi:10.18653/v1/2023.emnlp-main.1022 2023
[57]

Relevant or

Qin, Chengwei and Xia, Wenhan and Wang, Tan and Jiao, Fangkai and Hu, Yuchen and Ding, Bosheng and Chen, Ruirui and Joty, Shafiq , month = jun, year =. Relevant or. doi:10.48550/arXiv.2404.12728 , abstract =

work page doi:10.48550/arxiv.2404.12728
[58]

and Lu, Hongjing , month = sep, year =

Webb, Taylor and Holyoak, Keith J. and Lu, Hongjing , month = sep, year =. Emergent analogical reasoning in large language models , volume =. Nature Human Behaviour , publisher =. doi:10.1038/s41562-023-01659-w , abstract =

work page doi:10.1038/s41562-023-01659-w
[59]

Untrained neural networks can demonstrate memorization-independent abstract reasoning , volume =

Barak, Tomer and Loewenstein, Yonatan , month = nov, year =. Untrained neural networks can demonstrate memorization-independent abstract reasoning , volume =. Scientific Reports , publisher =. doi:10.1038/s41598-024-78530-z , abstract =

work page doi:10.1038/s41598-024-78530-z
[60]

and Blank, Idan A

Mahowald, Kyle and Ivanova, Anna A. and Blank, Idan A. and Kanwisher, Nancy and Tenenbaum, Joshua B. and Fedorenko, Evelina , month = mar, year =. Dissociating language and thought in large language models , url =. doi:10.48550/arXiv.2301.06627 , abstract =

work page doi:10.48550/arxiv.2301.06627
[61]

and Wong, Catherine and Feng, Jiahai and Wei, Megan and Tenenbaum, Joshua B

Collins, Katherine M. and Wong, Catherine and Feng, Jiahai and Wei, Megan and Tenenbaum, Joshua B. , month = may, year =. Structured, flexible, and robust: benchmarking and improving large language models towards more human-like behavior in out-of-distribution reasoning tasks , shorttitle =. doi:10.48550/arXiv.2205.05718 , abstract =

work page doi:10.48550/arxiv.2205.05718
[62]

: Larger and more instructable language models become less reliable

Zhou, Lexin and Schellaert, Wout and Martínez-Plumed, Fernando and Moros-Daval, Yael and Ferri, Cèsar and Hernández-Orallo, José , month = oct, year =. Larger and more instructable language models become less reliable , volume =. Nature , publisher =. doi:10.1038/s41586-024-07930-y , abstract =

work page doi:10.1038/s41586-024-07930-y
[63]

Messeri, Lisa and Crockett, M. J. , month = mar, year =. Artificial intelligence and illusions of understanding in scientific research , volume =. Nature , publisher =. doi:10.1038/s41586-024-07146-0 , abstract =

work page doi:10.1038/s41586-024-07146-0
[64]

Implicit meta-learning may lead language models to trust more reliable sources , url =

Krasheninnikov, Dmitrii and Krasheninnikov, Egor and Mlodozeniec, Bruno and Maharaj, Tegan and Krueger, David , month = jul, year =. Implicit meta-learning may lead language models to trust more reliable sources , url =. doi:10.48550/arXiv.2310.15047 , abstract =

work page doi:10.48550/arxiv.2310.15047
[65]

Taken out of context:

Berglund, Lukas and Stickland, Asa Cooper and Balesni, Mikita and Kaufmann, Max and Tong, Meg and Korbak, Tomasz and Kokotajlo, Daniel and Evans, Owain , month = sep, year =. Taken out of context:. doi:10.48550/arXiv.2309.00667 , abstract =

work page doi:10.48550/arxiv.2309.00667
[66]

Evaluating the

Liu, Hanmeng and Ning, Ruoxi and Teng, Zhiyang and Liu, Jian and Zhou, Qiji and Zhang, Yue , month = may, year =. Evaluating the
[67]

Brown, Tom B. and Mann, Benjamin and Ryder, Nick and Subbiah, Melanie and Kaplan, Jared and Dhariwal, Prafulla and Neelakantan, Arvind and Shyam, Pranav and Sastry, Girish and Askell, Amanda and Agarwal, Sandhini and Herbert-Voss, Ariel and Krueger, Gretchen and Henighan, Tom and Child, Rewon and Ramesh, Aditya and Ziegler, Daniel M. and Wu, Jeffrey and W...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2005.14165 2005
[68]

Abductive

Bhagavatula, Chandra and Bras, Ronan Le and Malaviya, Chaitanya and Sakaguchi, Keisuke and Holtzman, Ari and Rashkin, Hannah and Downey, Doug and Yih, Wen-tau and Choi, Yejin , month = sep, year =. Abductive
[69]

Computational Linguistics , author =

Influences and. Computational Linguistics , author =. 2013 , keywords =. doi:10.1162/COLI_a_00171 , abstract =

work page doi:10.1162/coli_a_00171 2013
[70]

and Xu, Yan and Fung, Pascale , month = nov, year =

Bang, Yejin and Cahyawijaya, Samuel and Lee, Nayeon and Dai, Wenliang and Su, Dan and Wilie, Bryan and Lovenia, Holy and Ji, Ziwei and Yu, Tiezheng and Chung, Willy and Do, Quyet V. and Xu, Yan and Fung, Pascale , month = nov, year =. A
[71]

Reasoning or

Wu, Zhaofeng and Qiu, Linlu and Ross, Alexis and Akyürek, Ekin and Chen, Boyuan and Wang, Bailin and Kim, Najoung and Andreas, Jacob and Kim, Yoon , month = mar, year =. Reasoning or. doi:10.48550/arXiv.2307.02477 , abstract =

work page doi:10.48550/arxiv.2307.02477
[72]

Proceedings of the 62nd

Sun, Wangtao and Xu, Haotian and Yu, Xuanqing and Chen, Pei and He, Shizhu and Zhao, Jun and Liu, Kang , editor =. Proceedings of the 62nd. 2024 , keywords =. doi:10.18653/v1/2024.acl-long.150 , abstract =

work page doi:10.18653/v1/2024.acl-long.150 2024
[73]

Gabriel, Iason and Manzini, Arianna and Keeling, Geoff and Hendricks, Lisa Anne and Rieser, Verena and Iqbal, Hasan and Tomašev, Nenad and Ktena, Ira and Kenton, Zachary and Rodriguez, Mikel and El-Sayed, Seliem and Brown, Sasha and Akbulut, Canfer and Trask, Andrew and Hughes, Edward and Bergman, A. Stevie and Shelby, Renee and Marchal, Nahema and Griffi...
[74]

Anwar, Usman and Saparov, Abulhair and Rando, Javier and Paleka, Daniel and Turpin, Miles and Hase, Peter and Lubana, Ekdeep Singh and Jenner, Erik and Casper, Stephen and Sourbut, Oliver and Edelman, Benjamin L. and Zhang, Zhaowei and Günther, Mario and Korinek, Anton and Hernandez-Orallo, Jose and Hammond, Lewis and Bigelow, Eric and Pan, Alexander and ...

work page doi:10.48550/arxiv.2404.09932
[75]

Xu, Shusheng and Fu, Wei and Gao, Jiaxuan and Ye, Wenjie and Liu, Weilin and Mei, Zhiyu and Wang, Guangju and Yu, Chao and Wu, Yi , month = apr, year =. Is. doi:10.48550/arXiv.2404.10719 , abstract =

work page doi:10.48550/arxiv.2404.10719
[76]

and Feng, Shi , month = apr, year =

Panickssery, Arjun and Bowman, Samuel R. and Feng, Shi , month = apr, year =
[77]

Preference

Tajwar, Fahim and Singh, Anikait and Sharma, Archit and Rafailov, Rafael and Schneider, Jeff and Xie, Tengyang and Ermon, Stefano and Finn, Chelsea and Kumar, Aviral , month = apr, year =. Preference. doi:10.48550/arXiv.2404.14367 , abstract =

work page doi:10.48550/arxiv.2404.14367
[78]

How to use and interpret activation patching

Heimersheim, Stefan and Nanda, Neel , month = apr, year =. How to use and interpret activation patching , url =. doi:10.48550/arXiv.2404.15255 , abstract =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2404.15255
[79]

and Hajishirzi, Hannaneh and Khashabi, Daniel , month = oct, year =

Wang, Yizhong and Mishra, Swaroop and Alipoormolabashi, Pegah and Kordi, Yeganeh and Mirzaei, Amirreza and Arunkumar, Anjana and Ashok, Arjun and Dhanasekaran, Arut Selvan and Naik, Atharva and Stap, David and Pathak, Eshaan and Karamanolakis, Giannis and Lai, Haizhi Gary and Purohit, Ishan and Mondal, Ishani and Anderson, Jacob and Kuznia, Kirby and Dosh...
[80]

Dong, Qingxiu and Li, Lei and Dai, Damai and Zheng, Ce and Wu, Zhiyong and Chang, Baobao and Sun, Xu and Xu, Jingjing and Li, Lei and Sui, Zhifang , month = jun, year =. A

Showing first 80 references.

[1] [1]

2024 , eprint =

Bias-Augmented Consistency Training Reduces Biased Reasoning in Chain-of-Thought , author =. 2024 , eprint =

2024

[2] [2]

arXiv preprint arXiv:2503.11926 , year=

Monitoring Reasoning Models for Misbehavior and the Risks of Promoting Obfuscation , author=. arXiv preprint arXiv:2503.11926 , year=

Pith/arXiv arXiv

[3] [3]

2025 , eprint=

Output Supervision Can Obfuscate the Chain of Thought , author=. 2025 , eprint=

2025

[4] [4]

Agarwal, Sandhini and Ahmad, Lama and Ai, Jason and Altman, Sam and Applebaum, Andy and Arbus, Edwin and Arora, Rahul K and Bai, Yu and Baker, Bowen and Bao, Haiming and others , journal=

[5] [5]

Proceedings of the 43rd International Conference on Machine Learning , year =

Consistency Training Can Entrench Misalignment , author =. Proceedings of the 43rd International Conference on Machine Learning , year =

[6] [6]

2026 , note =

Consistency Training Along the Transformer Stack , author =. 2026 , note =

2026

[7] [7]

arXiv preprint arXiv:2407.21783 , year=

The llama 3 herd of models , author=. arXiv preprint arXiv:2407.21783 , year=

Pith/arXiv arXiv

[8] [8]

Liu, Jian and Cui, Leyang and Liu, Hanmeng and Huang, Dandan and Wang, Yile and Zhang, Yue , journal=

[9] [9]

Zellers, Rowan and Holtzman, Ari and Bisk, Yonatan and Farhadi, Ali and Choi, Yejin , booktitle=

[10] [10]

Phan, Long and Gatti, Alice and Han, Ziwen and Li, Nathaniel and Hu, Josephina and Zhang, Hugh and Zhang, Chen Bo Calvin and Shaaban, Mohamed and Ling, John and Shi, Sean and others , journal=

[11] [11]

Hu, Edward J and Shen, Yelong and Wallis, Phillip and Allen-Zhu, Zeyuan and Li, Yuanzhi and Wang, Shean and Wang, Lu and Chen, Weizhu , booktitle=

[12] [12]

Shao, Zhihong and Wang, Peiyi and Zhu, Qihao and Xu, Runxin and Song, Junxiao and Bi, Xiao and Zhang, Haowei and Zhang, Mingchuan and Li, Y. K. and Wu, Yang and Guo, Daya , journal=

[13] [13]

Chain-of-Thought Monitorability: A New and Fragile Opportunity for

Korbak, Tomek and Balesni, Mikita and Barnes, Elizabeth and Bengio, Yoshua and Benton, Joe and Bloom, Joseph and Chen, Mark and Cooney, Alan and Dafoe, Allan and Dragan, Anca and others , journal=. Chain-of-Thought Monitorability: A New and Fragile Opportunity for

[14] [14]

Designing a Dashboard for Transparency and Control of Conversational

Chen, Yida and Wu, Aoyu and DePodesta, Trevor and Yeh, Catherine and Li, Kenneth and Marin, Nicholas Castillo and Patel, Oam and Riecke, Jan and Raval, Shivam and Seow, Olivia and Wattenberg, Martin and Viégas, Fernanda , journal=. Designing a Dashboard for Transparency and Control of Conversational

[15] [15]

2025 , eprint =

Consistency Training Helps Stop Sycophancy and Jailbreaks , author =. 2025 , eprint =

2025

[16] [16]

Defeating Nondeterminism in

He, Horace and. Defeating Nondeterminism in. Thinking Machines Lab: Connectionism , year =. doi:10.64434/tml.20250910 , note =

work page doi:10.64434/tml.20250910

[17] [17]

2024 , eprint =

Frontier Models are Capable of In-context Scheming , author =. 2024 , eprint =

2024

[18] [18]

arXiv preprint arXiv:2412.14093 , year=

Alignment faking in large language models , author=. arXiv preprint arXiv:2412.14093 , year=

Pith/arXiv arXiv

[19] [19]

and Ward, Francis Rhys , year =

van der Weij, Teun and Hofstätter, Felix and Jaffe, Ollie and Brown, Samuel F. and Ward, Francis Rhys , year =. 2406.07358 , archivePrefix =

arXiv

[20] [20]

Needham, Joe and Edkins, Giles and Pimpale, Govind and Bartsch, Henning and Hobbhahn, Marius , year =. Large. 2505.23836 , archivePrefix =

arXiv

[21] [21]

Probing and

Nguyen, Jord and Hoang, Khiem and Attubato, Carlo Leonardo and Hofstätter, Felix , year =. Probing and. 2507.01786 , archivePrefix =

arXiv

[22] [22]

2025 , eprint=

Steering Evaluation-Aware Language Models to Act Like They Are Deployed , author=. 2025 , eprint=

2025

[23] [23]

Schoen, Bronson and Nitishinskaya, Evgenia and Balesni, Mikita and Højmark, Axel and Hofstätter, Felix and Scheurer, Jérémy and Meinke, Alexander and Wolfe, Jason and van der Weij, Teun and Lloyd, Alex and Goldowsky-Dill, Nicholas and Fan, Angela and Matveiakin, Andrei and Shah, Rusheb and Williams, Marcus and Glaese, Amelia and Barak, Boaz and Zaremba, W...

arXiv

[24] [24]

Shah, Neil and Africa, David Demitri , year =

[25] [25]

Wallace, Eric and Xiao, Kai and Leike, Reimar and Weng, Lilian and Heidecke, Johannes and Beutel, Alex , year =. The. 2404.13208 , archivePrefix =

Pith/arXiv arXiv

[26] [26]

2025 , month = nov, howpublished =

Mitigating the risk of prompt injections in browser use , author =. 2025 , month = nov, howpublished =

2025

[27] [27]

Sharma, Mrinank and Tong, Meg and Mu, Jesse and Wei, Jerry and Kruthoff, Jorrit and Goodfriend, Scott and Ong, Euan and Peng, Alwin and Agarwal, Raj and Anil, Cem and Askell, Amanda and Bailey, Nathan and Benton, Joe and Bluemke, Emma and Bowman, Samuel R. and Christiansen, Eric and Cunningham, Hoagy and Dau, Andy and Gopal, Anjali and Gilson, Rob and Gra...

Pith/arXiv arXiv

[28] [28]

, year =

Taori, Rohan and Gulrajani, Ishaan and Zhang, Tianyi and Dubois, Yann and Li, Xuechen and Guestrin, Carlos and Liang, Percy and Hashimoto, Tatsunori B. , year =. Stanford

[29] [29]

2026 , month = apr, howpublished =

Reproducing steering against evaluation awareness in a large open-weight model , author =. 2026 , month = apr, howpublished =

2026

[30] [30]

2024 , eprint =

Analyzing the Generalization and Reliability of Steering Vectors , author =. 2024 , eprint =

2024

[31] [31]

2023 , publisher =

Ruebsamen, Gene , title =. 2023 , publisher =

2023

[32] [32]

Langley , title =

P. Langley , title =. Proceedings of the 17th International Conference on Machine Learning (ICML 2000) , address =. 2000 , pages =

2000

[33] [33]

T. M. Mitchell. The Need for Biases in Learning Generalizations. 1980

1980

[34] [34]

M. J. Kearns , title =

[35] [35]

Machine Learning: An Artificial Intelligence Approach, Vol. I. 1983

1983

[36] [36]

R. O. Duda and P. E. Hart and D. G. Stork. Pattern Classification. 2000

2000

[37] [37]

Suppressed for Anonymity , author=

[38] [38]

Newell and P

A. Newell and P. S. Rosenbloom. Mechanisms of Skill Acquisition and the Law of Practice. Cognitive Skills and Their Acquisition. 1981

1981

[39] [39]

A. L. Samuel. Some Studies in Machine Learning Using the Game of Checkers. IBM Journal of Research and Development. 1959

1959

[40] [40]

, month = jul, year =

Imran, Sohaib and Kendiukhov, Ihor and Broerman, Matthew and Thomas, Aditya and Campanella, Riccardo and Lamb, Rob and Atkinson, Peter M. , month = jul, year =. Are

[41] [41]

T., Yao, S., Friedman, D., Hardy, M., and Griffiths, T

McCoy, R. Thomas and Yao, Shunyu and Friedman, Dan and Hardy, Matthew and Griffiths, Thomas L. , month = sep, year =. Embers of. doi:10.48550/arXiv.2309.13638 , abstract =

work page doi:10.48550/arxiv.2309.13638

[42] [42]

and Nye, Maxwell and Andreas, Jacob , month = jun, year =

Li, Belinda Z. and Nye, Maxwell and Andreas, Jacob , month = jun, year =. Implicit. doi:10.48550/arXiv.2106.00737 , abstract =

work page doi:10.48550/arxiv.2106.00737

[43] [43]

Tell, don't show:

Meinke, Alexander and Evans, Owain , month = dec, year =. Tell, don't show:. doi:10.48550/arXiv.2312.07779 , abstract =

work page doi:10.48550/arxiv.2312.07779

[44] [44]

Li, Daliang and Rawat, Ankit Singh and Zaheer, Manzil and Wang, Xin and Lukasik, Michal and Veit, Andreas and Yu, Felix and Kumar, Sanjiv , editor =. Large. Findings of the. 2023 , keywords =. doi:10.18653/v1/2023.findings-acl.112 , abstract =

work page doi:10.18653/v1/2023.findings-acl.112 2023

[45] [45]

, month = aug, year =

Grosse, Roger and Bae, Juhan and Anil, Cem and Elhage, Nelson and Tamkin, Alex and Tajdini, Amirhossein and Steiner, Benoit and Li, Dustin and Durmus, Esin and Perez, Ethan and Hubinger, Evan and Lukošiūtė, Kamilė and Nguyen, Karina and Joseph, Nicholas and McCandlish, Sam and Kaplan, Jared and Bowman, Samuel R. , month = aug, year =. Studying. doi:10.485...

work page doi:10.48550/arxiv.2308.03296

[46] [46]

Borji, Ali , month = apr, year =. A. doi:10.48550/arXiv.2302.03494 , abstract =

work page doi:10.48550/arxiv.2302.03494

[47] [47]

Advances in Neural Information Processing Systems , author =

Faith and. Advances in Neural Information Processing Systems , author =. 2023 , keywords =

2023

[48] [48]

URLhttps://doi.org/10.1145/3586183.3606763

Park, Joon Sung and O'Brien, Joseph and Cai, Carrie Jun and Morris, Meredith Ringel and Liang, Percy and Bernstein, Michael S. , month = oct, year =. Generative. Proceedings of the 36th. doi:10.1145/3586183.3606763 , abstract =

work page doi:10.1145/3586183.3606763

[49] [49]

Advances in Neural Information Processing Systems , author =

Judging. Advances in Neural Information Processing Systems , author =. 2023 , keywords =

2023

[50] [50]

Sparks of Artificial General Intelligence: Early experiments with GPT-4

Bubeck, Sébastien and Chandrasekaran, Varun and Eldan, Ronen and Gehrke, Johannes and Horvitz, Eric and Kamar, Ece and Lee, Peter and Lee, Yin Tat and Li, Yuanzhi and Lundberg, Scott and Nori, Harsha and Palangi, Hamid and Ribeiro, Marco Tulio and Zhang, Yi , month = apr, year =. Sparks of. doi:10.48550/arXiv.2303.12712 , abstract =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2303.12712

[51] [51]

Huang, Jie and Chang, Kevin Chen-Chuan , editor =. Towards. Findings of the. 2023 , keywords =. doi:10.18653/v1/2023.findings-acl.67 , abstract =

work page doi:10.18653/v1/2023.findings-acl.67 2023

[52] [52]

Language

Deletang, Gregoire and Ruoss, Anian and Duquenne, Paul-Ambroise and Catt, Elliot and Genewein, Tim and Mattern, Christopher and Grau-Moya, Jordi and Wenliang, Li Kevin and Aitchison, Matthew and Orseau, Laurent and Hutter, Marcus and Veness, Joel , month = oct, year =. Language

[53] [53]

ExpeL: LLM Agents Are Experiential Learners.journaltitle = Proceedings of the AAAI Conference on Artificial Intelligence,, 38(17):19632–19642, 2024

Proceedings of the AAAI Conference on Artificial Intelligence , author =. 2024 , note =. doi:10.1609/aaai.v38i17.29936 , abstract =

work page doi:10.1609/aaai.v38i17.29936 2024

[54] [54]

Mirchandani, Suvir and Xia, Fei and Florence, Pete and Ichter, Brian and Driess, Danny and Arenas, Montserrat Gonzalez and Rao, Kanishka and Sadigh, Dorsa and Zeng, Andy , month = aug, year =. Large

[55] [55]

Salewski, Leonard and Alaniz, Stephan and Rio-Torto, Isabel and Schulz, Eric and Akata, Zeynep , month = nov, year =. In-

[56] [56]

Can language models learn analogical reasoning?

Petersen, Molly and van der Plas, Lonneke , editor =. Can language models learn analogical reasoning?. Proceedings of the 2023. 2023 , keywords =. doi:10.18653/v1/2023.emnlp-main.1022 , abstract =

work page doi:10.18653/v1/2023.emnlp-main.1022 2023

[57] [57]

Relevant or

Qin, Chengwei and Xia, Wenhan and Wang, Tan and Jiao, Fangkai and Hu, Yuchen and Ding, Bosheng and Chen, Ruirui and Joty, Shafiq , month = jun, year =. Relevant or. doi:10.48550/arXiv.2404.12728 , abstract =

work page doi:10.48550/arxiv.2404.12728

[58] [58]

and Lu, Hongjing , month = sep, year =

Webb, Taylor and Holyoak, Keith J. and Lu, Hongjing , month = sep, year =. Emergent analogical reasoning in large language models , volume =. Nature Human Behaviour , publisher =. doi:10.1038/s41562-023-01659-w , abstract =

work page doi:10.1038/s41562-023-01659-w

[59] [59]

Untrained neural networks can demonstrate memorization-independent abstract reasoning , volume =

Barak, Tomer and Loewenstein, Yonatan , month = nov, year =. Untrained neural networks can demonstrate memorization-independent abstract reasoning , volume =. Scientific Reports , publisher =. doi:10.1038/s41598-024-78530-z , abstract =

work page doi:10.1038/s41598-024-78530-z

[60] [60]

and Blank, Idan A

Mahowald, Kyle and Ivanova, Anna A. and Blank, Idan A. and Kanwisher, Nancy and Tenenbaum, Joshua B. and Fedorenko, Evelina , month = mar, year =. Dissociating language and thought in large language models , url =. doi:10.48550/arXiv.2301.06627 , abstract =

work page doi:10.48550/arxiv.2301.06627

[61] [61]

and Wong, Catherine and Feng, Jiahai and Wei, Megan and Tenenbaum, Joshua B

Collins, Katherine M. and Wong, Catherine and Feng, Jiahai and Wei, Megan and Tenenbaum, Joshua B. , month = may, year =. Structured, flexible, and robust: benchmarking and improving large language models towards more human-like behavior in out-of-distribution reasoning tasks , shorttitle =. doi:10.48550/arXiv.2205.05718 , abstract =

work page doi:10.48550/arxiv.2205.05718

[62] [62]

: Larger and more instructable language models become less reliable

Zhou, Lexin and Schellaert, Wout and Martínez-Plumed, Fernando and Moros-Daval, Yael and Ferri, Cèsar and Hernández-Orallo, José , month = oct, year =. Larger and more instructable language models become less reliable , volume =. Nature , publisher =. doi:10.1038/s41586-024-07930-y , abstract =

work page doi:10.1038/s41586-024-07930-y

[63] [63]

Messeri, Lisa and Crockett, M. J. , month = mar, year =. Artificial intelligence and illusions of understanding in scientific research , volume =. Nature , publisher =. doi:10.1038/s41586-024-07146-0 , abstract =

work page doi:10.1038/s41586-024-07146-0

[64] [64]

Implicit meta-learning may lead language models to trust more reliable sources , url =

Krasheninnikov, Dmitrii and Krasheninnikov, Egor and Mlodozeniec, Bruno and Maharaj, Tegan and Krueger, David , month = jul, year =. Implicit meta-learning may lead language models to trust more reliable sources , url =. doi:10.48550/arXiv.2310.15047 , abstract =

work page doi:10.48550/arxiv.2310.15047

[65] [65]

Taken out of context:

Berglund, Lukas and Stickland, Asa Cooper and Balesni, Mikita and Kaufmann, Max and Tong, Meg and Korbak, Tomasz and Kokotajlo, Daniel and Evans, Owain , month = sep, year =. Taken out of context:. doi:10.48550/arXiv.2309.00667 , abstract =

work page doi:10.48550/arxiv.2309.00667

[66] [66]

Evaluating the

Liu, Hanmeng and Ning, Ruoxi and Teng, Zhiyang and Liu, Jian and Zhou, Qiji and Zhang, Yue , month = may, year =. Evaluating the

[67] [67]

Brown, Tom B. and Mann, Benjamin and Ryder, Nick and Subbiah, Melanie and Kaplan, Jared and Dhariwal, Prafulla and Neelakantan, Arvind and Shyam, Pranav and Sastry, Girish and Askell, Amanda and Agarwal, Sandhini and Herbert-Voss, Ariel and Krueger, Gretchen and Henighan, Tom and Child, Rewon and Ramesh, Aditya and Ziegler, Daniel M. and Wu, Jeffrey and W...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2005.14165 2005

[68] [68]

Abductive

Bhagavatula, Chandra and Bras, Ronan Le and Malaviya, Chaitanya and Sakaguchi, Keisuke and Holtzman, Ari and Rashkin, Hannah and Downey, Doug and Yih, Wen-tau and Choi, Yejin , month = sep, year =. Abductive

[69] [69]

Computational Linguistics , author =

Influences and. Computational Linguistics , author =. 2013 , keywords =. doi:10.1162/COLI_a_00171 , abstract =

work page doi:10.1162/coli_a_00171 2013

[70] [70]

and Xu, Yan and Fung, Pascale , month = nov, year =

Bang, Yejin and Cahyawijaya, Samuel and Lee, Nayeon and Dai, Wenliang and Su, Dan and Wilie, Bryan and Lovenia, Holy and Ji, Ziwei and Yu, Tiezheng and Chung, Willy and Do, Quyet V. and Xu, Yan and Fung, Pascale , month = nov, year =. A

[71] [71]

Reasoning or

Wu, Zhaofeng and Qiu, Linlu and Ross, Alexis and Akyürek, Ekin and Chen, Boyuan and Wang, Bailin and Kim, Najoung and Andreas, Jacob and Kim, Yoon , month = mar, year =. Reasoning or. doi:10.48550/arXiv.2307.02477 , abstract =

work page doi:10.48550/arxiv.2307.02477

[72] [72]

Proceedings of the 62nd

Sun, Wangtao and Xu, Haotian and Yu, Xuanqing and Chen, Pei and He, Shizhu and Zhao, Jun and Liu, Kang , editor =. Proceedings of the 62nd. 2024 , keywords =. doi:10.18653/v1/2024.acl-long.150 , abstract =

work page doi:10.18653/v1/2024.acl-long.150 2024

[73] [73]

Gabriel, Iason and Manzini, Arianna and Keeling, Geoff and Hendricks, Lisa Anne and Rieser, Verena and Iqbal, Hasan and Tomašev, Nenad and Ktena, Ira and Kenton, Zachary and Rodriguez, Mikel and El-Sayed, Seliem and Brown, Sasha and Akbulut, Canfer and Trask, Andrew and Hughes, Edward and Bergman, A. Stevie and Shelby, Renee and Marchal, Nahema and Griffi...

[74] [74]

Anwar, Usman and Saparov, Abulhair and Rando, Javier and Paleka, Daniel and Turpin, Miles and Hase, Peter and Lubana, Ekdeep Singh and Jenner, Erik and Casper, Stephen and Sourbut, Oliver and Edelman, Benjamin L. and Zhang, Zhaowei and Günther, Mario and Korinek, Anton and Hernandez-Orallo, Jose and Hammond, Lewis and Bigelow, Eric and Pan, Alexander and ...

work page doi:10.48550/arxiv.2404.09932

[75] [75]

Xu, Shusheng and Fu, Wei and Gao, Jiaxuan and Ye, Wenjie and Liu, Weilin and Mei, Zhiyu and Wang, Guangju and Yu, Chao and Wu, Yi , month = apr, year =. Is. doi:10.48550/arXiv.2404.10719 , abstract =

work page doi:10.48550/arxiv.2404.10719

[76] [76]

and Feng, Shi , month = apr, year =

Panickssery, Arjun and Bowman, Samuel R. and Feng, Shi , month = apr, year =

[77] [77]

Preference

Tajwar, Fahim and Singh, Anikait and Sharma, Archit and Rafailov, Rafael and Schneider, Jeff and Xie, Tengyang and Ermon, Stefano and Finn, Chelsea and Kumar, Aviral , month = apr, year =. Preference. doi:10.48550/arXiv.2404.14367 , abstract =

work page doi:10.48550/arxiv.2404.14367

[78] [78]

How to use and interpret activation patching

Heimersheim, Stefan and Nanda, Neel , month = apr, year =. How to use and interpret activation patching , url =. doi:10.48550/arXiv.2404.15255 , abstract =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2404.15255

[79] [79]

and Hajishirzi, Hannaneh and Khashabi, Daniel , month = oct, year =

Wang, Yizhong and Mishra, Swaroop and Alipoormolabashi, Pegah and Kordi, Yeganeh and Mirzaei, Amirreza and Arunkumar, Anjana and Ashok, Arjun and Dhanasekaran, Arut Selvan and Naik, Atharva and Stap, David and Pathak, Eshaan and Karamanolakis, Giannis and Lai, Haizhi Gary and Purohit, Ishan and Mondal, Ishani and Anderson, Jacob and Kuznia, Kirby and Dosh...

[80] [80]

Dong, Qingxiu and Li, Lei and Dai, Damai and Zheng, Ce and Wu, Zhiyong and Chang, Baobao and Sun, Xu and Xu, Jingjing and Li, Lei and Sui, Zhifang , month = jun, year =. A