arxiv: 2605.01048 · v1 · submitted 2026-05-01 · 💻 cs.CL · cs.LG

Recognition: unknown

Compared to What? Baselines and Metrics for Counterfactual Prompting

Byron C. Wallace, Mosh Levy, Yoav Goldberg, Zihao Yang

Authors on Pith no claims yet

Pith reviewed 2026-05-09 18:58 UTC · model grok-4.3

classification 💻 cs.CL cs.LG

keywords counterfactual promptingLLM evaluationbias measurementparaphrasing baselinesprediction flipsstatistical testingMedQAMedPerturb

0 comments

The pith

Counterfactual prompting studies must compare targeted edits to paraphrasing baselines before attributing effects to specific factors.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that any single-factor edit in prompting also changes surface form, so observed output shifts could stem from general model sensitivity rather than the intended variable. On MedQA, changing patient gender flips predictions at rates statistically identical to those from simple paraphrasing, undermining claims of gender sensitivity. The authors therefore propose testing whether a target intervention produces reliably larger changes than meaning-preserving paraphrases. When this test is applied to prior MedPerturb results, nearly all reported demographic and style effects disappear. The same method still detects clear directional gender bias in occupational biography classification, and per-sample metrics prove far more sensitive than aggregate or regression approaches.

Core claim

Every counterfactual edit is a compound treatment that bundles the variable of interest with incidental surface-form variation, violating treatment variation irrelevance. Flip rates from surgically changing patient gender (14.9 percent) are indistinguishable from those produced by paraphrasing the same inputs (14.1 percent). A statistical framework that compares target-intervention differences against paraphrasing baselines shows that most previously reported sensitivities in MedPerturb are no longer significant, while directional gender bias remains detectable in biography classification tasks. Per-sample distributional metrics detect effects more powerfully than aggregate or regression met

What carries the argument

The statistical comparison framework that measures whether changes under a target intervention exceed those induced by paraphrasing the same inputs.

If this is right

Most reported sensitivities to patient demographics in prior MedPerturb analyses are no longer statistically supported once general sensitivity is accounted for.
Only five of 120 tests reach significance after the paraphrasing baseline is applied.
Per-sample metrics detect effects far more reliably than aggregate flip rates or regression models.
The framework can still identify real directional bias, as shown by significant gender effects in occupational biography classification.
Regression-based metrics uniquely characterize both the direction and magnitude of effects when they exist.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same baseline comparison could be extended to faithfulness checks in chain-of-thought prompting to test whether reasoning steps are truly causal.
Models may treat many surface variations as equivalent noise, suggesting broader re-examination of perturbation-based evaluation methods.
Developing paraphrases that better isolate surface form from semantic drift would strengthen the control condition.

Load-bearing premise

Paraphrasing generates valid meaning-preserving controls that introduce only incidental surface-form variation without other uncontrolled factors or model-specific artifacts.

What would settle it

A new experiment that applies multiple independent paraphrases and targeted gender edits to the same MedQA cases and finds the gender edits produce significantly higher flip rates under the paper's statistical test.

Figures

Figures reproduced from arXiv: 2605.01048 by Byron C. Wallace, Mosh Levy, Yoav Goldberg, Zihao Yang.

**Figure 1.** Figure 1: An example illustrating the core problem. view at source ↗

**Figure 2.** Figure 2: shows that flip rate increases with token change percentage for MANAGE (6% → 14%) and VISIT (5% → 13%); MI shows the corresponding decline (Appendix C). This means a paraphrase that changes 5% of tokens produces qualitatively different baseline noise than one changing 40%. If a targeted perturbation changes 3% of tokens but is compared against a baseline that changes 20%, the baseline will appear noisi… view at source ↗

**Figure 3.** Figure 3: Mutual information between original and paraphrased responses decreases with view at source ↗

**Figure 4.** Figure 4: Power curves at σ = 0.5 for two 8B conditions. Left (VISIT): Per-sample metrics (JSD, KL) reach near-perfect detection while per-population metrics (MI, ϕ, flip rate) remain near α = 0.05. Right (RESOURCE): Under extreme class imbalance (99/100 positive), MI and ϕ are completely degenerate, while JSD and KL retain full sensitivity. Per-condition results. Figures 5–10 show the full power curves for all six … view at source ↗

**Figure 5.** Figure 5: Power curves for MANAGE — 8B. 20 view at source ↗

**Figure 6.** Figure 6: Power curves for MANAGE — 70B. 21 view at source ↗

**Figure 7.** Figure 7: Power curves for RESOURCE — 8B. MI and ϕ show zero detection across all σpert values due to extreme class imbalance (99/100 positive cases), which makes contingencytable-based metrics degenerate. 22 view at source ↗

**Figure 8.** Figure 8: Power curves for RESOURCE — 70B. Same MI/ view at source ↗

**Figure 9.** Figure 9: Power curves for VISIT — 8B. 23 view at source ↗

**Figure 10.** Figure 10: Power curves for VISIT — 70B. 24 view at source ↗

read the original abstract

Counterfactual prompting (i.e., perturbing a single factor and measuring output change) is widely used to evaluate things like LLM bias and CoT faithfulness. But in this work we argue that observed effects cannot be attributed to the targeted factor without accounting for baseline ``meaning-preserving'' modifications to text that establish general model sensitivity. This is because every counterfactual edit is a compound treatment that bundles the variable of interest with incidental surface-form variation; this violates treatment variation irrelevance. We observe prediction flip rates on MedQA of 14.9% when we surgically change patient gender. However, this is statistically indistinguishable from the flip rates induced by simply paraphrasing inputs (14.1%). In this case, it would therefore be unwarranted to conclude that the LLM is especially sensitive to patient gender. To account for this and robustly measure the effects of targeted interventions, we propose a framework in which we compare (via statistical testing) differences observed under target interventions to those induced by paraphrasing inputs. We then use this framework to revisit a analysis done on the MedPerturb dataset, which reported evidence of model sensitivity to patient demographics and stylistic cues. We find that these effects largely dissipate when we account for general model sensitivity, with only 5 of 120 tests reaching statistical significance. Applying the same framework to occupational biography classification, we detect clearly significant directional gender bias, showing that the framework identifies real directional effects even when they are small. We evaluate a range of metrics -- aggregate, per-sample distributional, and regression -- and find that per-sample metrics are dramatically more powerful than aggregate metrics and regression powerfully and uniquely characterizes effect direction and magnitude.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper shows that many reported LLM sensitivities to targeted edits like gender or demographics are indistinguishable from general sensitivity to any text change, using paraphrases as baseline, though the occupational bias case holds up.

read the letter

The main takeaway is that counterfactual tests for LLM bias or faithfulness need to rule out general model sensitivity to surface changes before attributing flips to the intended factor. On MedQA, swapping patient gender flips outputs 14.9% of the time, but simple paraphrasing does it at 14.1%, so the numbers do not support special sensitivity to gender. Re-running the MedPerturb analysis under this check makes most of the demographic and stylistic effects disappear, leaving only 5 of 120 tests significant. The same approach still picks up a clear directional gender bias in occupational biography classification, which is useful because it shows the method can detect real effects when they exist. They also compare metrics and find per-sample distributional ones far more powerful than aggregate counts, with regression adding direction and magnitude info that the others miss. This is the concrete advance: a statistical comparison framework that is easy to apply and changes how some prior results read. The soft spot is the paraphrase baseline itself. If the generation process adds semantic or stylistic shifts that differ from the minimal targeted edits, then similar flip rates do not cleanly prove the target factor adds nothing extra. The abstract treats the paraphrases as meaning-preserving controls, but without the exact procedure or checks for uncontrolled factors, that assumption needs verification in the full methods. They also run 120 tests on MedPerturb, so any multiple-comparison handling should be explicit. This work is for researchers who evaluate LLMs on bias, fairness, or chain-of-thought faithfulness. Anyone running counterfactual prompts will find the framework and metric comparison directly usable. It deserves a serious referee because the numbers are reported, the reanalysis challenges existing claims with a straightforward control, and the occupational result shows the method is not just a null finder. I would send it for peer review.

Referee Report

3 major / 3 minor

Summary. The paper argues that counterfactual prompting effects (e.g., LLM sensitivity to patient gender or demographics) cannot be attributed to the targeted factor without baselines for meaning-preserving modifications, as every edit bundles the variable of interest with incidental surface-form changes that violate treatment variation irrelevance. On MedQA, gender perturbation yields 14.9% flip rate, statistically indistinguishable from 14.1% under paraphrasing; applying the framework to MedPerturb shows most of 120 tests lose significance (only 5 remain), while occupational biography classification detects robust directional gender bias. The work evaluates aggregate, per-sample distributional, and regression metrics, finding per-sample metrics far more powerful and regression uniquely effective for direction and magnitude.

Significance. If the framework holds, it supplies a statistically grounded method for isolating targeted effects in LLM counterfactual studies, strengthening validity of bias and faithfulness evaluations on public datasets. Credit is due for grounding claims in external statistical tests rather than internal parameters, demonstrating both null results (MedPerturb dissipation) and positive directional detection (occupational bias), and comparing multiple metric classes with concrete power differences.

major comments (3)

[Methods] Methods (paraphrase generation procedure): The exact generation process, model, prompt template, and validation steps for meaning preservation are not fully specified. This is load-bearing because if paraphrases systematically differ in semantic shift distribution or introduce model artifacts unmatched to the minimal lexical gender/demographic edits, the 14.9% vs 14.1% flip-rate comparison on MedQA does not establish that the targeted factor adds no extra sensitivity.
[§4] §4 (MedPerturb results): The 120 tests are not enumerated by intervention type, and no multiple-comparison correction (e.g., Bonferroni or FDR) is reported for the claim that only 5 reach significance. Without this, the conclusion that effects 'largely dissipate' cannot be assessed for robustness against inflated Type I error.
[§5] §5 (metric comparison): The claim that per-sample metrics are 'dramatically more powerful' than aggregate metrics lacks reported effect sizes, power curves, or sample-size calculations; the regression metric's unique characterization of direction/magnitude is asserted but not contrasted against alternatives via explicit coefficient tables or simulation.

minor comments (3)

[Abstract, §3] Abstract and §3: 'treatment variation irrelevance' is introduced without a formal definition or citation to the causal inference literature; a one-sentence gloss would improve accessibility.
[Figures] Figure captions (if present): Ensure all axes, error bars, and statistical annotations are labeled with exact test names and p-value thresholds used for 'indistinguishable' claims.
[References] References: Add citations for standard multiple-testing procedures and prior work on paraphrase-based controls in NLP evaluation.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their insightful comments, which have pointed out opportunities to improve the transparency and statistical robustness of our work. We address each of the major comments in turn below.

read point-by-point responses

Referee: [Methods] Methods (paraphrase generation procedure): The exact generation process, model, prompt template, and validation steps for meaning preservation are not fully specified. This is load-bearing because if paraphrases systematically differ in semantic shift distribution or introduce model artifacts unmatched to the minimal lexical gender/demographic edits, the 14.9% vs 14.1% flip-rate comparison on MedQA does not establish that the targeted factor adds no extra sensitivity.

Authors: We agree that additional details on the paraphrase generation are necessary to fully substantiate the baseline comparison. In the revised manuscript, we will specify the exact model and version used for paraphrase generation, provide the complete prompt template, and describe the validation steps, including any quantitative measures of semantic similarity and qualitative checks for meaning preservation. This will enable readers to evaluate whether the paraphrases serve as an appropriate control for incidental surface-form changes. revision: yes
Referee: [§4] §4 (MedPerturb results): The 120 tests are not enumerated by intervention type, and no multiple-comparison correction (e.g., Bonferroni or FDR) is reported for the claim that only 5 reach significance. Without this, the conclusion that effects 'largely dissipate' cannot be assessed for robustness against inflated Type I error.

Authors: The referee is correct that a full enumeration and correction for multiple testing would strengthen the results section. We will revise §4 to include a breakdown of the 120 tests by intervention category (e.g., demographic, stylistic), list the specific tests that remain significant, and apply an FDR correction, reporting both raw and adjusted p-values. This will allow a more rigorous assessment of whether the effects largely dissipate. revision: yes
Referee: [§5] §5 (metric comparison): The claim that per-sample metrics are 'dramatically more powerful' than aggregate metrics lacks reported effect sizes, power curves, or sample-size calculations; the regression metric's unique characterization of direction/magnitude is asserted but not contrasted against alternatives via explicit coefficient tables or simulation.

Authors: We acknowledge that the metric comparison section would benefit from more quantitative support. In the revision, we will add effect size calculations for the differences in power between per-sample and aggregate metrics, include power analysis or curves based on our sample sizes, and provide a table of regression coefficients along with a brief simulation study to illustrate the advantages of the regression approach in capturing direction and magnitude compared to other metrics. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper's central argument and framework rely on empirical statistical comparisons of flip rates and effect sizes between targeted counterfactual edits and paraphrasing baselines, using public datasets such as MedQA and MedPerturb. These comparisons are presented as external evidence rather than quantities defined by construction from fitted parameters or internal definitions within the paper. The conceptual claim that every counterfactual edit bundles incidental surface-form variation (violating treatment variation irrelevance) draws from standard causal inference principles and is tested against independent data, without reducing to self-referential steps, self-citation chains, or renamed known results. No load-bearing derivations equate predictions to their own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests primarily on the domain assumption that paraphrases isolate general sensitivity to surface variation, plus standard statistical testing assumptions.

axioms (1)

domain assumption Paraphrasing inputs creates meaning-preserving modifications that isolate general model sensitivity to surface-form variation
This is the core baseline used to establish that targeted counterfactuals are compound treatments violating treatment variation irrelevance.

pith-pipeline@v0.9.0 · 5602 in / 1359 out tokens · 68559 ms · 2026-05-09T18:58:38.243131+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

172 extracted references · 149 canonical work pages · 17 internal anchors

[1]

Bias patterns in the application of

Poulain, Raphael and Fayyaz, Hamed and Beheshti, Rahmatollah , month = apr, year =. Bias patterns in the application of. doi:10.48550/arXiv.2404.15149 , abstract =

work page doi:10.48550/arxiv.2404.15149
[2]

arXiv preprint arXiv:2108.01764 , year=

Logé, Cécile and Ross, Emily and Dadey, David Yaw Amoah and Jain, Saahil and Saporta, Adriel and Ng, Andrew Y. and Rajpurkar, Pranav , month = aug, year =. Q-. doi:10.48550/arXiv.2108.01764 , abstract =

work page doi:10.48550/arxiv.2108.01764
[3]

Jin, Di and Pan, Eileen and Oufattole, Nassim and Weng, Wei-Hung and Fang, Hanyi and Szolovits, Peter , month = sep, year =. What. doi:10.48550/arXiv.2009.13081 , abstract =

work page doi:10.48550/arxiv.2009.13081 2009
[4]

De-Arteaga, Maria and Romanov, Alexey and Wallach, Hanna and Chayes, Jennifer and Borgs, Christian and Chouldechova, Alexandra and Geyik, Sahin and Kenthapadi, Krishnaram and Kalai, Adam Tauman , month = jan, year =. Bias in. Proceedings of the. doi:10.1145/3287560.3287572 , abstract =

work page doi:10.1145/3287560.3287572
[5]

Semantics derived automatically from language corpora contain human-like biases,

Semantics derived automatically from language corpora contain human-like biases , volume =. Science , author =. 2017 , note =. doi:10.1126/science.aal4230 , abstract =

work page doi:10.1126/science.aal4230 2017
[6]

Man is to Computer Programmer as Woman is to Homemaker? Debiasing Word Embeddings , journal =

Bolukbasi, Tolga and Chang, Kai-Wei and Zou, James and Saligrama, Venkatesh and Kalai, Adam , month = jul, year =. Man is to. doi:10.48550/arXiv.1607.06520 , abstract =

work page doi:10.48550/arxiv.1607.06520
[7]

, year =

When to use the. Ophthalmic & Physiological Optics: The Journal of the British College of Ophthalmic Opticians , author =. 2014 , keywords =. doi:10.1111/opo.12131 , abstract =

work page doi:10.1111/opo.12131 2014
[8]

Controlling the

Benjamini, Yoav and Hochberg, Yosef , year =. Controlling the. Journal of the Royal Statistical Society. Series B (Methodological) , publisher =
[9]

Cao, Bowen and Cai, Deng and Zhang, Zhisong and Zou, Yuexian and Lam, Wai , month = oct, year =. On the. doi:10.48550/arXiv.2406.10248 , abstract =

work page doi:10.48550/arxiv.2406.10248
[10]

Ngweta, Lilian and Kate, Kiran and Tsay, Jason and Rizk, Yara , editor =. Towards. Proceedings of the 2025. 2025 , pages =. doi:10.18653/v1/2025.naacl-srw.51 , abstract =

work page doi:10.18653/v1/2025.naacl-srw.51 2025
[11]

Write a recipe for chocolate cake

Turpin, Miles and Michael, Julian and Perez, Ethan and Bowman, Samuel R. , month = dec, year =. Language. doi:10.48550/arXiv.2305.04388 , abstract =

work page doi:10.48550/arxiv.2305.04388
[12]

Vamvas, Jannis and Sennrich, Rico , editor =. On the. Proceedings of the. 2021 , pages =. doi:10.18653/v1/2021.blackboxnlp-1.5 , abstract =

work page doi:10.18653/v1/2021.blackboxnlp-1.5 2021
[13]

Kohankhaki, Farnaz and Emerson, D. B. and Tian, Jacob-Junqi and Seyyed-Kalantari, Laleh and Khattak, Faiza Khan , month = jan, year =. Template-. doi:10.48550/arXiv.2404.03471 , abstract =

work page doi:10.48550/arxiv.2404.03471
[14]

L., Lopez, G., Olteanu, A., Sim, R., and Wallach, H

Blodgett, Su Lin and Lopez, Gilsinia and Olteanu, Alexandra and Sim, Robert and Wallach, Hanna , editor =. Stereotyping. Proceedings of the 59th. 2021 , pages =. doi:10.18653/v1/2021.acl-long.81 , abstract =

work page doi:10.18653/v1/2021.acl-long.81 2021
[15]

Rudinger, Rachel and Naradowsky, Jason and Leonard, Brian and Durme, Benjamin Van , month = apr, year =. Gender. doi:10.48550/arXiv.1804.09301 , abstract =

work page doi:10.48550/arxiv.1804.09301
[16]

S tereo S et: Measuring stereotypical bias in pretrained language models

Nadeem, Moin and Bethke, Anna and Reddy, Siva , editor =. Proceedings of the 59th. 2021 , pages =. doi:10.18653/v1/2021.acl-long.416 , abstract =

work page doi:10.18653/v1/2021.acl-long.416 2021
[17]

arXiv.org , author =

Gender. arXiv.org , author =
[18]

C row S -Pairs: A Challenge Dataset for Measuring Social Biases in Masked Language Models

Nangia, Nikita and Vania, Clara and Bhalerao, Rasika and Bowman, Samuel R. , editor =. Proceedings of the 2020. 2020 , pages =. doi:10.18653/v1/2020.emnlp-main.154 , abstract =

work page doi:10.18653/v1/2020.emnlp-main.154 2020
[19]

Counterfactual

Rezaei, Amirhossein Haji Mohammad and Shakeri, Zahra , month = jan, year =. Counterfactual. doi:10.48550/arXiv.2601.20102 , abstract =

work page doi:10.48550/arxiv.2601.20102
[20]

Rauba, Q

Rauba, Paulius and Wei, Qiyao and Schaar, Mihaela van der , month = dec, year =. Quantifying perturbation impacts for large language models , url =. doi:10.48550/arXiv.2412.00868 , abstract =

work page doi:10.48550/arxiv.2412.00868
[21]

Gourabathina, Abinitha and Gerych, Walter and Pan, Eileen and Ghassemi, Marzyeh , month = jun, year =. The. Proceedings of the 2025. doi:10.1145/3715275.3732121 , abstract =

work page doi:10.1145/3715275.3732121 2025
[22]

Li, Suraj Srinivas, Usha Bhalla, and Himabindu Lakkaraju

Li, Aaron J. and Srinivas, Suraj and Bhalla, Usha and Lakkaraju, Himabindu , month = may, year =. Interpretability. doi:10.48550/arXiv.2505.16004 , abstract =

work page doi:10.48550/arxiv.2505.16004
[23]

Gourabathina, Abinitha and Hao, Yuexing and Gerych, Walter and Ghassemi, Marzyeh , month = jun, year =. The. doi:10.48550/arXiv.2506.17163 , abstract =

work page doi:10.48550/arxiv.2506.17163
[24]

Marks, Samuel and Tegmark, Max , month = aug, year =. The. doi:10.48550/arXiv.2310.06824 , abstract =

work page internal anchor Pith review doi:10.48550/arxiv.2310.06824
[25]

A Structural Probe for Finding Syntax in Word Representations

Hewitt, John and Manning, Christopher D. , editor =. A. Proceedings of the 2019. 2019 , pages =. doi:10.18653/v1/N19-1419 , abstract =

work page doi:10.18653/v1/n19-1419 2019
[26]

Visualizing token importance for black-box language models , url =

Rauba, Paulius and Wei, Qiyao and Schaar, Mihaela van der , month = dec, year =. Visualizing token importance for black-box language models , url =. doi:10.48550/arXiv.2512.11573 , abstract =

work page doi:10.48550/arxiv.2512.11573
[27]

Méloux, Maxime and Dirupo, Giada and Portet, François and Peyrard, Maxime , month = dec, year =. The. doi:10.48550/arXiv.2512.18792 , abstract =

work page doi:10.48550/arxiv.2512.18792
[28]

Hong, Pingjun and Roth, Benjamin , month = jan, year =. Do. doi:10.48550/arXiv.2601.03775 , abstract =

work page doi:10.48550/arxiv.2601.03775
[29]

Axiomatic Attribution for Deep Networks, June 2017

Sundararajan, Mukund and Taly, Ankur and Yan, Qiqi , month = jun, year =. Axiomatic. doi:10.48550/arXiv.1703.01365 , abstract =

work page doi:10.48550/arxiv.1703.01365
[30]

State over tokens: Characterizing the role of reasoning tokens.arXiv preprint arXiv:2512.12777, 2025

Levy, Mosh and Elyoseph, Zohar and Ravfogel, Shauli and Goldberg, Yoav , month = dec, year =. State over. doi:10.48550/arXiv.2512.12777 , abstract =

work page doi:10.48550/arxiv.2512.12777
[31]

Zaman, Kerem and Srivastava, Shashank , month = dec, year =. Is. doi:10.48550/arXiv.2512.23032 , abstract =

work page internal anchor Pith review doi:10.48550/arxiv.2512.23032
[32]

Chain of thought monitorability: A new and fragile opportunity for ai safety, 2025

Korbak, Tomek and Balesni, Mikita and Barnes, Elizabeth and Bengio, Yoshua and Benton, Joe and Bloom, Joseph and Chen, Mark and Cooney, Alan and Dafoe, Allan and Dragan, Anca and Emmons, Scott and Evans, Owain and Farhi, David and Greenblatt, Ryan and Hendrycks, Dan and Hobbhahn, Marius and Hubinger, Evan and Irving, Geoffrey and Jenner, Erik and Kokotajl...

work page doi:10.48550/arxiv.2507.11473
[33]

Counterfactual

Slack, Dylan and Hilgard, Sophie and Lakkaraju, Himabindu and Singh, Sameer , month = nov, year =. Counterfactual. doi:10.48550/arXiv.2106.02666 , abstract =

work page doi:10.48550/arxiv.2106.02666
[34]

and Guo, Zifan Carl and Huang, Vincent and Steinhardt, Jacob and Andreas, Jacob , month = nov, year =

Li, Belinda Z. and Guo, Zifan Carl and Huang, Vincent and Steinhardt, Jacob and Andreas, Jacob , month = nov, year =. Training. doi:10.48550/arXiv.2511.08579 , abstract =

work page doi:10.48550/arxiv.2511.08579
[35]

URL https://doi.org/10.1038/ s41467-025-64769-1

Qiu, Pengcheng and Wu, Chaoyi and Liu, Shuyu and Fan, Yanjie and Zhao, Weike and Chen, Zhuoxia and Gu, Hongfei and Peng, Chuanjin and Zhang, Ya and Wang, Yanfeng and Xie, Weidi , month = nov, year =. Quantifying the reasoning abilities of. Nature Communications , publisher =. doi:10.1038/s41467-025-64769-1 , abstract =

work page doi:10.1038/s41467-025-64769-1
[36]

Investigating

Wadhwa, Somin and Amir, Silvio and Wallace, Byron C , editor =. Investigating. Proceedings of the 2024. 2024 , pages =. doi:10.18653/v1/2024.emnlp-main.349 , abstract =

work page doi:10.18653/v1/2024.emnlp-main.349 2024
[37]

Movva, Rajiv and Milli, Smitha and Min, Sewon and Pierson, Emma , month = oct, year =. What's. doi:10.48550/arXiv.2510.26202 , abstract =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2510.26202
[38]

How to think step-by-step:

Dutta, Subhabrata and Singh, Joykirat and Chakrabarti, Soumen and Chakraborty, Tanmoy , month = may, year =. How to think step-by-step:. doi:10.48550/arXiv.2402.18312 , abstract =

work page doi:10.48550/arxiv.2402.18312
[39]

Iteration

Cabannes, Vivien and Arnal, Charles and Bouaziz, Wassim and Yang, Alice and Charton, Francois and Kempe, Julia , month = oct, year =. Iteration. doi:10.48550/arXiv.2406.02128 , abstract =

work page doi:10.48550/arxiv.2406.02128
[40]

Zaman, Kerem and Srivastava, Shashank , editor =. A. Proceedings of the 2025. 2025 , pages =

2025
[41]

Thought Branches: Interpreting LLM Reasoning Requires Resampling

Macar, Uzay and Bogdan, Paul C. and Rajamanoharan, Senthooran and Nanda, Neel , month = oct, year =. Thought. doi:10.48550/arXiv.2510.27484 , abstract =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2510.27484
[42]

Post-hoc reasoning in chain of thought ·
[43]

Jacovi, Alon and Goldberg, Yoav , editor =. Towards. Proceedings of the 58th. 2020 , pages =. doi:10.18653/v1/2020.acl-main.386 , abstract =

work page doi:10.18653/v1/2020.acl-main.386 2020
[44]

Transformer Circuits , author =

On the. Transformer Circuits , author =
[45]

arXiv preprint arXiv:2504.05294 , year=

Ferreira, Pedro and Aziz, Wilker and Titov, Ivan , month = jul, year =. Truthful or. doi:10.48550/arXiv.2504.05294 , abstract =

work page doi:10.48550/arxiv.2504.05294
[46]

Tanneru, Sree Harsha and Ley, Dan and Agarwal, Chirag and Lakkaraju, Himabindu , month = jul, year =. On the. doi:10.48550/arXiv.2406.10625 , abstract =

work page doi:10.48550/arxiv.2406.10625
[47]

Iv´an Arcuschin, Jett Janiak, Robert Krzyzanowski, Senthooran Rajamanoharan, Neel Nanda, and Arthur Conmy

Arcuschin, Iván and Janiak, Jett and Krzyzanowski, Robert and Rajamanoharan, Senthooran and Nanda, Neel and Conmy, Arthur , month = mar, year =. Chain-of-. doi:10.48550/arXiv.2503.08679 , abstract =

work page doi:10.48550/arxiv.2503.08679
[48]

Reasoning Models Don't Always Say What They Think

Chen, Yanda and Benton, Joe and Radhakrishnan, Ansh and Uesato, Jonathan and Denison, Carson and Schulman, John and Somani, Arushi and Hase, Peter and Wagner, Misha and Roger, Fabien and Mikulik, Vlad and Bowman, Samuel R. and Leike, Jan and Kaplan, Jared and Perez, Ethan , month = may, year =. Reasoning. doi:10.48550/arXiv.2505.05410 , abstract =

work page internal anchor Pith review doi:10.48550/arxiv.2505.05410
[49]

Faithcot-bench: Benchmarking instance-level faithfulness of chain-of-thought reasoning

Shen, Xu and Wang, Song and Tan, Zhen and Yao, Laura and Zhao, Xinyu and Xu, Kaidi and Wang, Xin and Chen, Tianlong , month = oct, year =. doi:10.48550/arXiv.2510.04040 , abstract =

work page doi:10.48550/arxiv.2510.04040
[50]

Kojima, Takeshi and Gu, Shixiang Shane and Reid, Machel and Matsuo, Yutaka and Iwasawa, Yusuke , month = jan, year =. Large. doi:10.48550/arXiv.2205.11916 , abstract =

work page internal anchor Pith review doi:10.48550/arxiv.2205.11916
[51]

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

Wei, Jason and Wang, Xuezhi and Schuurmans, Dale and Bosma, Maarten and Ichter, Brian and Xia, Fei and Chi, Ed and Le, Quoc and Zhou, Denny , month = jan, year =. Chain-of-. doi:10.48550/arXiv.2201.11903 , abstract =

work page internal anchor Pith review doi:10.48550/arxiv.2201.11903
[52]

doi:10.48550/arXiv.2409.01497 , abstract =

Rawat, Rajat and McBride, Hudson and Nirmal, Dhiyaan and Ghosh, Rajarshi and Moon, Jong and Alamuri, Dhruv and O'Brien, Sean and Zhu, Kevin , month = dec, year =. doi:10.48550/arXiv.2409.01497 , abstract =

work page doi:10.48550/arxiv.2409.01497
[53]

Chi, Xuezhi Wang, and Denny Zhou

Chen, Xinyun and Chi, Ryan A. and Wang, Xuezhi and Zhou, Denny , month = may, year =. Premise. doi:10.48550/arXiv.2402.08939 , abstract =

work page doi:10.48550/arxiv.2402.08939
[54]

Quantifying language models’ sensitivity to spurious features in prompt design.arXiv preprint arXiv:2310.11324, 2023

Sclar, Melanie and Choi, Yejin and Tsvetkov, Yulia and Suhr, Alane , month = jul, year =. Quantifying. doi:10.48550/arXiv.2310.11324 , abstract =

work page doi:10.48550/arxiv.2310.11324
[55]

Laban, Philippe and Murakhovs'ka, Lidiya and Xiong, Caiming and Wu, Chien-Sheng , month = feb, year =. Are. doi:10.48550/arXiv.2311.08596 , abstract =

work page doi:10.48550/arxiv.2311.08596
[56]

Proceedings of the AAAI Conference on Artificial Intelligence , author =

Mapping from. Proceedings of the AAAI Conference on Artificial Intelligence , author =. 2025 , note =. doi:10.1609/aaai.v39i22.34540 , abstract =

work page doi:10.1609/aaai.v39i22.34540 2025
[57]

Spuq: Perturbation-based uncertainty quantification for large language models.arXiv preprint arXiv:2403.02509,

Gao, Xiang and Zhang, Jiaxin and Mouatadid, Lalla and Das, Kamalika , month = mar, year =. doi:10.48550/arXiv.2403.02509 , abstract =

work page doi:10.48550/arxiv.2403.02509
[58]

Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models

Marks, Samuel and Rager, Can and Michaud, Eric J. and Belinkov, Yonatan and Bau, David and Mueller, Aaron , month = mar, year =. Sparse. doi:10.48550/arXiv.2403.19647 , abstract =

work page internal anchor Pith review doi:10.48550/arxiv.2403.19647
[59]

Verifyingchain-of-thought reasoning via its computational graph.CoRR, abs/2510.09312, 2025

Zhao, Zheng and Koishekenov, Yeskendir and Yang, Xianjun and Murray, Naila and Cancedda, Nicola , month = oct, year =. Verifying. doi:10.48550/arXiv.2510.09312 , abstract =

work page doi:10.48550/arxiv.2510.09312
[60]

Measuring faithfulness of chains of thought by unlearning reasoning steps, 2025

Tutek, Martin and Chaleshtori, Fateme Hashemi and Marasović, Ana and Belinkov, Yonatan , month = jun, year =. Measuring. doi:10.48550/arXiv.2502.14829 , abstract =

work page doi:10.48550/arxiv.2502.14829
[61]

Learning

Bussmann, Bart and Leask, Patrick and Nanda, Neel and Leask, Patrick and Nanda, Neel , month = dec, year =. Learning
[62]

Kantamneni, Subhash and Engels, Joshua and Rajamanoharan, Senthooran and Tegmark, Max and Nanda, Neel , month = feb, year =. Are. doi:10.48550/arXiv.2502.16681 , abstract =

work page doi:10.48550/arxiv.2502.16681
[63]

Research

Quaisley, Andrew , month = jun, year =. Research
[64]

Taggart, Glen , month = apr, year =
[65]

Gu, Yu and Fu, Jingjing and Liu, Xiaodong and Valanarasu, Jeya Maria Jose and Codella, Noel and Tan, Reuben and Liu, Qianchu and Jin, Ying and Zhang, Sheng and Wang, Jinyu and Wang, Rui and Song, Lei and Qin, Guanghui and Usuyama, Naoto and Wong, Cliff and Hao, Cheng and Lee, Hohin and Sanapathi, Praneeth and Hilado, Sarah and Jiang, Bian and Alvarez-Vall...

work page doi:10.48550/arxiv.2509.18234
[66]

Wu, Jiageng and Xie, Kevin and Gu, Bowen and Krüger, Nils and Lin, Kueiyu Joshua and Yang, Jie , month = sep, year =. Why. doi:10.48550/arXiv.2509.21933 , abstract =

work page doi:10.48550/arxiv.2509.21933
[67]

Chain-of-

Wang, Xuezhi and Zhou, Denny , month = may, year =. Chain-of-. doi:10.48550/arXiv.2402.10200 , abstract =

work page doi:10.48550/arxiv.2402.10200
[68]

Findings of the

Li, Yichen and Fan, Zhiting and Chen, Ruizhe and Gai, Xiaotang and Gong, Luqi and Zhang, Yan and Liu, Zuozhu , editor =. Findings of the. 2025 , pages =. doi:10.18653/v1/2025.findings-acl.589 , abstract =

work page doi:10.18653/v1/2025.findings-acl.589 2025
[69]

Sharkey, Lee and Chughtai, Bilal and Batson, Joshua and Lindsey, Jack and Wu, Jeff and Bushnaq, Lucius and Goldowsky-Dill, Nicholas and Heimersheim, Stefan and Ortega, Alejandro and Bloom, Joseph and Biderman, Stella and Garriga-Alonso, Adria and Conmy, Arthur and Nanda, Neel and Rumbelow, Jessica and Wattenberg, Martin and Schoots, Nandi and Miller, Jose...

work page internal anchor Pith review doi:10.48550/arxiv.2501.16496
[70]

Measuring Faithfulness in Chain-of-Thought Reasoning

Lanham, Tamera and Chen, Anna and Radhakrishnan, Ansh and Steiner, Benoit and Denison, Carson and Hernandez, Danny and Li, Dustin and Durmus, Esin and Hubinger, Evan and Kernion, Jackson and Lukošiūtė, Kamilė and Nguyen, Karina and Cheng, Newton and Joseph, Nicholas and Schiefer, Nicholas and Rausch, Oliver and Larson, Robin and McCandlish, Sam and Kundu,...

work page Pith review doi:10.48550/arxiv.2307.13702
[71]

Diving into self-evolving training for multimodal reasoning,

Liu, Wei and Li, Junlong and Zhang, Xiwen and Zhou, Fan and Cheng, Yu and He, Junxian , month = dec, year =. Diving into. doi:10.48550/arXiv.2412.17451 , abstract =

work page doi:10.48550/arxiv.2412.17451
[72]

Tan, Wentao and Cao, Qiong and Zhan, Yibing and Xue, Chao and Ding, Changxing , month = dec, year =. Beyond. doi:10.48550/arXiv.2412.15650 , abstract =

work page doi:10.48550/arxiv.2412.15650
[73]

Duggal, P

Duggal, Shivam and Isola, Phillip and Torralba, Antonio and Freeman, William T. , month = nov, year =. Adaptive. doi:10.48550/arXiv.2411.02393 , abstract =

work page doi:10.48550/arxiv.2411.02393
[74]

arXiv preprint arXiv:2410.17247 (2024)

Xing, Long and Huang, Qidong and Dong, Xiaoyi and Lu, Jiajie and Zhang, Pan and Zang, Yuhang and Cao, Yuhang and He, Conghui and Wang, Jiaqi and Wu, Feng and Lin, Dahua , month = oct, year =. doi:10.48550/arXiv.2410.17247 , abstract =

work page doi:10.48550/arxiv.2410.17247
[75]

Tang, Qiaoyu and Yu, Le and Yu, Bowen and Lin, Hongyu and Lu, Keming and Lu, Yaojie and Han, Xianpei and Sun, Le , month = oct, year =. A. doi:10.48550/arXiv.2410.13841 , abstract =

work page doi:10.48550/arxiv.2410.13841
[76]

Revealing

Subramaniam, Vighnesh and Conwell, Colin and Wang, Christopher and Kreiman, Gabriel and Katz, Boris and Cases, Ignacio and Barbu, Andrei , month = jun, year =. Revealing. doi:10.48550/arXiv.2406.14481 , abstract =

work page doi:10.48550/arxiv.2406.14481
[77]

Intriguing

Lee, Young-Jun and Ko, Byungsoo and Kim, Han-Gyu and Hwang, Yechan and Choi, Ho-Jin , month = oct, year =. Intriguing. doi:10.48550/arXiv.2410.04751 , abstract =

work page doi:10.48550/arxiv.2410.04751
[78]

Bansal, P

Bansal, Yamini and Nakkiran, Preetum and Barak, Boaz , month = jun, year =. Revisiting. doi:10.48550/arXiv.2106.07682 , abstract =

work page doi:10.48550/arxiv.2106.07682
[79]

Huh, Minyoung and Cheung, Brian and Wang, Tongzhou and Isola, Phillip , month = jul, year =. The. doi:10.48550/arXiv.2405.07987 , abstract =

work page doi:10.48550/arxiv.2405.07987
[80]

doi:10.48550/arXiv.2312.06709 , abstract =

Ranzinger, Mike and Heinrich, Greg and Kautz, Jan and Molchanov, Pavlo , month = apr, year =. doi:10.48550/arXiv.2312.06709 , abstract =

work page doi:10.48550/arxiv.2312.06709

Showing first 80 references.