pith. machine review for the scientific record. sign in

arxiv: 2605.01048 · v1 · submitted 2026-05-01 · 💻 cs.CL · cs.LG

Recognition: unknown

Compared to What? Baselines and Metrics for Counterfactual Prompting

Byron C. Wallace, Mosh Levy, Yoav Goldberg, Zihao Yang

Authors on Pith no claims yet

Pith reviewed 2026-05-09 18:58 UTC · model grok-4.3

classification 💻 cs.CL cs.LG
keywords counterfactual promptingLLM evaluationbias measurementparaphrasing baselinesprediction flipsstatistical testingMedQAMedPerturb
0
0 comments X

The pith

Counterfactual prompting studies must compare targeted edits to paraphrasing baselines before attributing effects to specific factors.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that any single-factor edit in prompting also changes surface form, so observed output shifts could stem from general model sensitivity rather than the intended variable. On MedQA, changing patient gender flips predictions at rates statistically identical to those from simple paraphrasing, undermining claims of gender sensitivity. The authors therefore propose testing whether a target intervention produces reliably larger changes than meaning-preserving paraphrases. When this test is applied to prior MedPerturb results, nearly all reported demographic and style effects disappear. The same method still detects clear directional gender bias in occupational biography classification, and per-sample metrics prove far more sensitive than aggregate or regression approaches.

Core claim

Every counterfactual edit is a compound treatment that bundles the variable of interest with incidental surface-form variation, violating treatment variation irrelevance. Flip rates from surgically changing patient gender (14.9 percent) are indistinguishable from those produced by paraphrasing the same inputs (14.1 percent). A statistical framework that compares target-intervention differences against paraphrasing baselines shows that most previously reported sensitivities in MedPerturb are no longer significant, while directional gender bias remains detectable in biography classification tasks. Per-sample distributional metrics detect effects more powerfully than aggregate or regression met

What carries the argument

The statistical comparison framework that measures whether changes under a target intervention exceed those induced by paraphrasing the same inputs.

If this is right

  • Most reported sensitivities to patient demographics in prior MedPerturb analyses are no longer statistically supported once general sensitivity is accounted for.
  • Only five of 120 tests reach significance after the paraphrasing baseline is applied.
  • Per-sample metrics detect effects far more reliably than aggregate flip rates or regression models.
  • The framework can still identify real directional bias, as shown by significant gender effects in occupational biography classification.
  • Regression-based metrics uniquely characterize both the direction and magnitude of effects when they exist.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same baseline comparison could be extended to faithfulness checks in chain-of-thought prompting to test whether reasoning steps are truly causal.
  • Models may treat many surface variations as equivalent noise, suggesting broader re-examination of perturbation-based evaluation methods.
  • Developing paraphrases that better isolate surface form from semantic drift would strengthen the control condition.

Load-bearing premise

Paraphrasing generates valid meaning-preserving controls that introduce only incidental surface-form variation without other uncontrolled factors or model-specific artifacts.

What would settle it

A new experiment that applies multiple independent paraphrases and targeted gender edits to the same MedQA cases and finds the gender edits produce significantly higher flip rates under the paper's statistical test.

Figures

Figures reproduced from arXiv: 2605.01048 by Byron C. Wallace, Mosh Levy, Yoav Goldberg, Zihao Yang.

Figure 1
Figure 1. Figure 1: An example illustrating the core problem. view at source ↗
Figure 2
Figure 2. Figure 2: shows that flip rate in￾creases with token change percentage for MANAGE (6% → 14%) and VISIT (5% → 13%); MI shows the correspond￾ing decline (Appendix C). This means a paraphrase that changes 5% of to￾kens produces qualitatively different baseline noise than one changing 40%. If a targeted perturbation changes 3% of tokens but is compared against a baseline that changes 20%, the base￾line will appear noisi… view at source ↗
Figure 3
Figure 3. Figure 3: Mutual information between original and paraphrased responses decreases with view at source ↗
Figure 4
Figure 4. Figure 4: Power curves at σ = 0.5 for two 8B conditions. Left (VISIT): Per-sample metrics (JSD, KL) reach near-perfect detection while per-population metrics (MI, ϕ, flip rate) remain near α = 0.05. Right (RESOURCE): Under extreme class imbalance (99/100 positive), MI and ϕ are completely degenerate, while JSD and KL retain full sensitivity. Per-condition results. Figures 5–10 show the full power curves for all six … view at source ↗
Figure 5
Figure 5. Figure 5: Power curves for MANAGE — 8B. 20 view at source ↗
Figure 6
Figure 6. Figure 6: Power curves for MANAGE — 70B. 21 view at source ↗
Figure 7
Figure 7. Figure 7: Power curves for RESOURCE — 8B. MI and ϕ show zero detection across all σpert values due to extreme class imbalance (99/100 positive cases), which makes contingency￾table-based metrics degenerate. 22 view at source ↗
Figure 8
Figure 8. Figure 8: Power curves for RESOURCE — 70B. Same MI/ view at source ↗
Figure 9
Figure 9. Figure 9: Power curves for VISIT — 8B. 23 view at source ↗
Figure 10
Figure 10. Figure 10: Power curves for VISIT — 70B. 24 view at source ↗
read the original abstract

Counterfactual prompting (i.e., perturbing a single factor and measuring output change) is widely used to evaluate things like LLM bias and CoT faithfulness. But in this work we argue that observed effects cannot be attributed to the targeted factor without accounting for baseline ``meaning-preserving'' modifications to text that establish general model sensitivity. This is because every counterfactual edit is a compound treatment that bundles the variable of interest with incidental surface-form variation; this violates treatment variation irrelevance. We observe prediction flip rates on MedQA of 14.9% when we surgically change patient gender. However, this is statistically indistinguishable from the flip rates induced by simply paraphrasing inputs (14.1%). In this case, it would therefore be unwarranted to conclude that the LLM is especially sensitive to patient gender. To account for this and robustly measure the effects of targeted interventions, we propose a framework in which we compare (via statistical testing) differences observed under target interventions to those induced by paraphrasing inputs. We then use this framework to revisit a analysis done on the MedPerturb dataset, which reported evidence of model sensitivity to patient demographics and stylistic cues. We find that these effects largely dissipate when we account for general model sensitivity, with only 5 of 120 tests reaching statistical significance. Applying the same framework to occupational biography classification, we detect clearly significant directional gender bias, showing that the framework identifies real directional effects even when they are small. We evaluate a range of metrics -- aggregate, per-sample distributional, and regression -- and find that per-sample metrics are dramatically more powerful than aggregate metrics and regression powerfully and uniquely characterizes effect direction and magnitude.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 3 minor

Summary. The paper argues that counterfactual prompting effects (e.g., LLM sensitivity to patient gender or demographics) cannot be attributed to the targeted factor without baselines for meaning-preserving modifications, as every edit bundles the variable of interest with incidental surface-form changes that violate treatment variation irrelevance. On MedQA, gender perturbation yields 14.9% flip rate, statistically indistinguishable from 14.1% under paraphrasing; applying the framework to MedPerturb shows most of 120 tests lose significance (only 5 remain), while occupational biography classification detects robust directional gender bias. The work evaluates aggregate, per-sample distributional, and regression metrics, finding per-sample metrics far more powerful and regression uniquely effective for direction and magnitude.

Significance. If the framework holds, it supplies a statistically grounded method for isolating targeted effects in LLM counterfactual studies, strengthening validity of bias and faithfulness evaluations on public datasets. Credit is due for grounding claims in external statistical tests rather than internal parameters, demonstrating both null results (MedPerturb dissipation) and positive directional detection (occupational bias), and comparing multiple metric classes with concrete power differences.

major comments (3)
  1. [Methods] Methods (paraphrase generation procedure): The exact generation process, model, prompt template, and validation steps for meaning preservation are not fully specified. This is load-bearing because if paraphrases systematically differ in semantic shift distribution or introduce model artifacts unmatched to the minimal lexical gender/demographic edits, the 14.9% vs 14.1% flip-rate comparison on MedQA does not establish that the targeted factor adds no extra sensitivity.
  2. [§4] §4 (MedPerturb results): The 120 tests are not enumerated by intervention type, and no multiple-comparison correction (e.g., Bonferroni or FDR) is reported for the claim that only 5 reach significance. Without this, the conclusion that effects 'largely dissipate' cannot be assessed for robustness against inflated Type I error.
  3. [§5] §5 (metric comparison): The claim that per-sample metrics are 'dramatically more powerful' than aggregate metrics lacks reported effect sizes, power curves, or sample-size calculations; the regression metric's unique characterization of direction/magnitude is asserted but not contrasted against alternatives via explicit coefficient tables or simulation.
minor comments (3)
  1. [Abstract, §3] Abstract and §3: 'treatment variation irrelevance' is introduced without a formal definition or citation to the causal inference literature; a one-sentence gloss would improve accessibility.
  2. [Figures] Figure captions (if present): Ensure all axes, error bars, and statistical annotations are labeled with exact test names and p-value thresholds used for 'indistinguishable' claims.
  3. [References] References: Add citations for standard multiple-testing procedures and prior work on paraphrase-based controls in NLP evaluation.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their insightful comments, which have pointed out opportunities to improve the transparency and statistical robustness of our work. We address each of the major comments in turn below.

read point-by-point responses
  1. Referee: [Methods] Methods (paraphrase generation procedure): The exact generation process, model, prompt template, and validation steps for meaning preservation are not fully specified. This is load-bearing because if paraphrases systematically differ in semantic shift distribution or introduce model artifacts unmatched to the minimal lexical gender/demographic edits, the 14.9% vs 14.1% flip-rate comparison on MedQA does not establish that the targeted factor adds no extra sensitivity.

    Authors: We agree that additional details on the paraphrase generation are necessary to fully substantiate the baseline comparison. In the revised manuscript, we will specify the exact model and version used for paraphrase generation, provide the complete prompt template, and describe the validation steps, including any quantitative measures of semantic similarity and qualitative checks for meaning preservation. This will enable readers to evaluate whether the paraphrases serve as an appropriate control for incidental surface-form changes. revision: yes

  2. Referee: [§4] §4 (MedPerturb results): The 120 tests are not enumerated by intervention type, and no multiple-comparison correction (e.g., Bonferroni or FDR) is reported for the claim that only 5 reach significance. Without this, the conclusion that effects 'largely dissipate' cannot be assessed for robustness against inflated Type I error.

    Authors: The referee is correct that a full enumeration and correction for multiple testing would strengthen the results section. We will revise §4 to include a breakdown of the 120 tests by intervention category (e.g., demographic, stylistic), list the specific tests that remain significant, and apply an FDR correction, reporting both raw and adjusted p-values. This will allow a more rigorous assessment of whether the effects largely dissipate. revision: yes

  3. Referee: [§5] §5 (metric comparison): The claim that per-sample metrics are 'dramatically more powerful' than aggregate metrics lacks reported effect sizes, power curves, or sample-size calculations; the regression metric's unique characterization of direction/magnitude is asserted but not contrasted against alternatives via explicit coefficient tables or simulation.

    Authors: We acknowledge that the metric comparison section would benefit from more quantitative support. In the revision, we will add effect size calculations for the differences in power between per-sample and aggregate metrics, include power analysis or curves based on our sample sizes, and provide a table of regression coefficients along with a brief simulation study to illustrate the advantages of the regression approach in capturing direction and magnitude compared to other metrics. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper's central argument and framework rely on empirical statistical comparisons of flip rates and effect sizes between targeted counterfactual edits and paraphrasing baselines, using public datasets such as MedQA and MedPerturb. These comparisons are presented as external evidence rather than quantities defined by construction from fitted parameters or internal definitions within the paper. The conceptual claim that every counterfactual edit bundles incidental surface-form variation (violating treatment variation irrelevance) draws from standard causal inference principles and is tested against independent data, without reducing to self-referential steps, self-citation chains, or renamed known results. No load-bearing derivations equate predictions to their own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests primarily on the domain assumption that paraphrases isolate general sensitivity to surface variation, plus standard statistical testing assumptions.

axioms (1)
  • domain assumption Paraphrasing inputs creates meaning-preserving modifications that isolate general model sensitivity to surface-form variation
    This is the core baseline used to establish that targeted counterfactuals are compound treatments violating treatment variation irrelevance.

pith-pipeline@v0.9.0 · 5602 in / 1359 out tokens · 68559 ms · 2026-05-09T18:58:38.243131+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

172 extracted references · 149 canonical work pages · 17 internal anchors

  1. [1]

    Bias patterns in the application of

    Poulain, Raphael and Fayyaz, Hamed and Beheshti, Rahmatollah , month = apr, year =. Bias patterns in the application of. doi:10.48550/arXiv.2404.15149 , abstract =

  2. [2]

    arXiv preprint arXiv:2108.01764 , year=

    Logé, Cécile and Ross, Emily and Dadey, David Yaw Amoah and Jain, Saahil and Saporta, Adriel and Ng, Andrew Y. and Rajpurkar, Pranav , month = aug, year =. Q-. doi:10.48550/arXiv.2108.01764 , abstract =

  3. [3]

    Jin, Di and Pan, Eileen and Oufattole, Nassim and Weng, Wei-Hung and Fang, Hanyi and Szolovits, Peter , month = sep, year =. What. doi:10.48550/arXiv.2009.13081 , abstract =

  4. [4]

    De-Arteaga, Maria and Romanov, Alexey and Wallach, Hanna and Chayes, Jennifer and Borgs, Christian and Chouldechova, Alexandra and Geyik, Sahin and Kenthapadi, Krishnaram and Kalai, Adam Tauman , month = jan, year =. Bias in. Proceedings of the. doi:10.1145/3287560.3287572 , abstract =

  5. [5]

    Semantics derived automatically from language corpora contain human-like biases,

    Semantics derived automatically from language corpora contain human-like biases , volume =. Science , author =. 2017 , note =. doi:10.1126/science.aal4230 , abstract =

  6. [6]

    Man is to Computer Programmer as Woman is to Homemaker? Debiasing Word Embeddings , journal =

    Bolukbasi, Tolga and Chang, Kai-Wei and Zou, James and Saligrama, Venkatesh and Kalai, Adam , month = jul, year =. Man is to. doi:10.48550/arXiv.1607.06520 , abstract =

  7. [7]

    , year =

    When to use the. Ophthalmic & Physiological Optics: The Journal of the British College of Ophthalmic Opticians , author =. 2014 , keywords =. doi:10.1111/opo.12131 , abstract =

  8. [8]

    Controlling the

    Benjamini, Yoav and Hochberg, Yosef , year =. Controlling the. Journal of the Royal Statistical Society. Series B (Methodological) , publisher =

  9. [9]

    Cao, Bowen and Cai, Deng and Zhang, Zhisong and Zou, Yuexian and Lam, Wai , month = oct, year =. On the. doi:10.48550/arXiv.2406.10248 , abstract =

  10. [10]

    Ngweta, Lilian and Kate, Kiran and Tsay, Jason and Rizk, Yara , editor =. Towards. Proceedings of the 2025. 2025 , pages =. doi:10.18653/v1/2025.naacl-srw.51 , abstract =

  11. [11]

    Write a recipe for chocolate cake

    Turpin, Miles and Michael, Julian and Perez, Ethan and Bowman, Samuel R. , month = dec, year =. Language. doi:10.48550/arXiv.2305.04388 , abstract =

  12. [12]

    Vamvas, Jannis and Sennrich, Rico , editor =. On the. Proceedings of the. 2021 , pages =. doi:10.18653/v1/2021.blackboxnlp-1.5 , abstract =

  13. [13]

    Kohankhaki, Farnaz and Emerson, D. B. and Tian, Jacob-Junqi and Seyyed-Kalantari, Laleh and Khattak, Faiza Khan , month = jan, year =. Template-. doi:10.48550/arXiv.2404.03471 , abstract =

  14. [14]

    L., Lopez, G., Olteanu, A., Sim, R., and Wallach, H

    Blodgett, Su Lin and Lopez, Gilsinia and Olteanu, Alexandra and Sim, Robert and Wallach, Hanna , editor =. Stereotyping. Proceedings of the 59th. 2021 , pages =. doi:10.18653/v1/2021.acl-long.81 , abstract =

  15. [15]

    Rudinger, Rachel and Naradowsky, Jason and Leonard, Brian and Durme, Benjamin Van , month = apr, year =. Gender. doi:10.48550/arXiv.1804.09301 , abstract =

  16. [16]

    S tereo S et: Measuring stereotypical bias in pretrained language models

    Nadeem, Moin and Bethke, Anna and Reddy, Siva , editor =. Proceedings of the 59th. 2021 , pages =. doi:10.18653/v1/2021.acl-long.416 , abstract =

  17. [17]

    arXiv.org , author =

    Gender. arXiv.org , author =

  18. [18]

    C row S -Pairs: A Challenge Dataset for Measuring Social Biases in Masked Language Models

    Nangia, Nikita and Vania, Clara and Bhalerao, Rasika and Bowman, Samuel R. , editor =. Proceedings of the 2020. 2020 , pages =. doi:10.18653/v1/2020.emnlp-main.154 , abstract =

  19. [19]

    Counterfactual

    Rezaei, Amirhossein Haji Mohammad and Shakeri, Zahra , month = jan, year =. Counterfactual. doi:10.48550/arXiv.2601.20102 , abstract =

  20. [20]

    Rauba, Q

    Rauba, Paulius and Wei, Qiyao and Schaar, Mihaela van der , month = dec, year =. Quantifying perturbation impacts for large language models , url =. doi:10.48550/arXiv.2412.00868 , abstract =

  21. [21]

    Gourabathina, Abinitha and Gerych, Walter and Pan, Eileen and Ghassemi, Marzyeh , month = jun, year =. The. Proceedings of the 2025. doi:10.1145/3715275.3732121 , abstract =

  22. [22]

    Li, Suraj Srinivas, Usha Bhalla, and Himabindu Lakkaraju

    Li, Aaron J. and Srinivas, Suraj and Bhalla, Usha and Lakkaraju, Himabindu , month = may, year =. Interpretability. doi:10.48550/arXiv.2505.16004 , abstract =

  23. [23]

    Gourabathina, Abinitha and Hao, Yuexing and Gerych, Walter and Ghassemi, Marzyeh , month = jun, year =. The. doi:10.48550/arXiv.2506.17163 , abstract =

  24. [24]

    Marks, Samuel and Tegmark, Max , month = aug, year =. The. doi:10.48550/arXiv.2310.06824 , abstract =

  25. [25]

    A Structural Probe for Finding Syntax in Word Representations

    Hewitt, John and Manning, Christopher D. , editor =. A. Proceedings of the 2019. 2019 , pages =. doi:10.18653/v1/N19-1419 , abstract =

  26. [26]

    Visualizing token importance for black-box language models , url =

    Rauba, Paulius and Wei, Qiyao and Schaar, Mihaela van der , month = dec, year =. Visualizing token importance for black-box language models , url =. doi:10.48550/arXiv.2512.11573 , abstract =

  27. [27]

    Méloux, Maxime and Dirupo, Giada and Portet, François and Peyrard, Maxime , month = dec, year =. The. doi:10.48550/arXiv.2512.18792 , abstract =

  28. [28]

    Hong, Pingjun and Roth, Benjamin , month = jan, year =. Do. doi:10.48550/arXiv.2601.03775 , abstract =

  29. [29]

    Axiomatic Attribution for Deep Networks, June 2017

    Sundararajan, Mukund and Taly, Ankur and Yan, Qiqi , month = jun, year =. Axiomatic. doi:10.48550/arXiv.1703.01365 , abstract =

  30. [30]

    State over tokens: Characterizing the role of reasoning tokens.arXiv preprint arXiv:2512.12777, 2025

    Levy, Mosh and Elyoseph, Zohar and Ravfogel, Shauli and Goldberg, Yoav , month = dec, year =. State over. doi:10.48550/arXiv.2512.12777 , abstract =

  31. [31]

    Zaman, Kerem and Srivastava, Shashank , month = dec, year =. Is. doi:10.48550/arXiv.2512.23032 , abstract =

  32. [32]

    Chain of thought monitorability: A new and fragile opportunity for ai safety, 2025

    Korbak, Tomek and Balesni, Mikita and Barnes, Elizabeth and Bengio, Yoshua and Benton, Joe and Bloom, Joseph and Chen, Mark and Cooney, Alan and Dafoe, Allan and Dragan, Anca and Emmons, Scott and Evans, Owain and Farhi, David and Greenblatt, Ryan and Hendrycks, Dan and Hobbhahn, Marius and Hubinger, Evan and Irving, Geoffrey and Jenner, Erik and Kokotajl...

  33. [33]

    Counterfactual

    Slack, Dylan and Hilgard, Sophie and Lakkaraju, Himabindu and Singh, Sameer , month = nov, year =. Counterfactual. doi:10.48550/arXiv.2106.02666 , abstract =

  34. [34]

    and Guo, Zifan Carl and Huang, Vincent and Steinhardt, Jacob and Andreas, Jacob , month = nov, year =

    Li, Belinda Z. and Guo, Zifan Carl and Huang, Vincent and Steinhardt, Jacob and Andreas, Jacob , month = nov, year =. Training. doi:10.48550/arXiv.2511.08579 , abstract =

  35. [35]

    URL https://doi.org/10.1038/ s41467-025-64769-1

    Qiu, Pengcheng and Wu, Chaoyi and Liu, Shuyu and Fan, Yanjie and Zhao, Weike and Chen, Zhuoxia and Gu, Hongfei and Peng, Chuanjin and Zhang, Ya and Wang, Yanfeng and Xie, Weidi , month = nov, year =. Quantifying the reasoning abilities of. Nature Communications , publisher =. doi:10.1038/s41467-025-64769-1 , abstract =

  36. [36]

    Investigating

    Wadhwa, Somin and Amir, Silvio and Wallace, Byron C , editor =. Investigating. Proceedings of the 2024. 2024 , pages =. doi:10.18653/v1/2024.emnlp-main.349 , abstract =

  37. [37]

    Movva, Rajiv and Milli, Smitha and Min, Sewon and Pierson, Emma , month = oct, year =. What's. doi:10.48550/arXiv.2510.26202 , abstract =

  38. [38]

    How to think step-by-step:

    Dutta, Subhabrata and Singh, Joykirat and Chakrabarti, Soumen and Chakraborty, Tanmoy , month = may, year =. How to think step-by-step:. doi:10.48550/arXiv.2402.18312 , abstract =

  39. [39]

    Iteration

    Cabannes, Vivien and Arnal, Charles and Bouaziz, Wassim and Yang, Alice and Charton, Francois and Kempe, Julia , month = oct, year =. Iteration. doi:10.48550/arXiv.2406.02128 , abstract =

  40. [40]

    Zaman, Kerem and Srivastava, Shashank , editor =. A. Proceedings of the 2025. 2025 , pages =

  41. [41]

    Thought Branches: Interpreting LLM Reasoning Requires Resampling

    Macar, Uzay and Bogdan, Paul C. and Rajamanoharan, Senthooran and Nanda, Neel , month = oct, year =. Thought. doi:10.48550/arXiv.2510.27484 , abstract =

  42. [42]

    Post-hoc reasoning in chain of thought ·

  43. [43]

    Jacovi, Alon and Goldberg, Yoav , editor =. Towards. Proceedings of the 58th. 2020 , pages =. doi:10.18653/v1/2020.acl-main.386 , abstract =

  44. [44]

    Transformer Circuits , author =

    On the. Transformer Circuits , author =

  45. [45]

    arXiv preprint arXiv:2504.05294 , year=

    Ferreira, Pedro and Aziz, Wilker and Titov, Ivan , month = jul, year =. Truthful or. doi:10.48550/arXiv.2504.05294 , abstract =

  46. [46]

    Tanneru, Sree Harsha and Ley, Dan and Agarwal, Chirag and Lakkaraju, Himabindu , month = jul, year =. On the. doi:10.48550/arXiv.2406.10625 , abstract =

  47. [47]

    Iv´an Arcuschin, Jett Janiak, Robert Krzyzanowski, Senthooran Rajamanoharan, Neel Nanda, and Arthur Conmy

    Arcuschin, Iván and Janiak, Jett and Krzyzanowski, Robert and Rajamanoharan, Senthooran and Nanda, Neel and Conmy, Arthur , month = mar, year =. Chain-of-. doi:10.48550/arXiv.2503.08679 , abstract =

  48. [48]

    Reasoning Models Don't Always Say What They Think

    Chen, Yanda and Benton, Joe and Radhakrishnan, Ansh and Uesato, Jonathan and Denison, Carson and Schulman, John and Somani, Arushi and Hase, Peter and Wagner, Misha and Roger, Fabien and Mikulik, Vlad and Bowman, Samuel R. and Leike, Jan and Kaplan, Jared and Perez, Ethan , month = may, year =. Reasoning. doi:10.48550/arXiv.2505.05410 , abstract =

  49. [49]

    Faithcot-bench: Benchmarking instance-level faithfulness of chain-of-thought reasoning

    Shen, Xu and Wang, Song and Tan, Zhen and Yao, Laura and Zhao, Xinyu and Xu, Kaidi and Wang, Xin and Chen, Tianlong , month = oct, year =. doi:10.48550/arXiv.2510.04040 , abstract =

  50. [50]

    Kojima, Takeshi and Gu, Shixiang Shane and Reid, Machel and Matsuo, Yutaka and Iwasawa, Yusuke , month = jan, year =. Large. doi:10.48550/arXiv.2205.11916 , abstract =

  51. [51]

    Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

    Wei, Jason and Wang, Xuezhi and Schuurmans, Dale and Bosma, Maarten and Ichter, Brian and Xia, Fei and Chi, Ed and Le, Quoc and Zhou, Denny , month = jan, year =. Chain-of-. doi:10.48550/arXiv.2201.11903 , abstract =

  52. [52]

    doi:10.48550/arXiv.2409.01497 , abstract =

    Rawat, Rajat and McBride, Hudson and Nirmal, Dhiyaan and Ghosh, Rajarshi and Moon, Jong and Alamuri, Dhruv and O'Brien, Sean and Zhu, Kevin , month = dec, year =. doi:10.48550/arXiv.2409.01497 , abstract =

  53. [53]

    Chi, Xuezhi Wang, and Denny Zhou

    Chen, Xinyun and Chi, Ryan A. and Wang, Xuezhi and Zhou, Denny , month = may, year =. Premise. doi:10.48550/arXiv.2402.08939 , abstract =

  54. [54]

    Quantifying language models’ sensitivity to spurious features in prompt design.arXiv preprint arXiv:2310.11324, 2023

    Sclar, Melanie and Choi, Yejin and Tsvetkov, Yulia and Suhr, Alane , month = jul, year =. Quantifying. doi:10.48550/arXiv.2310.11324 , abstract =

  55. [55]

    Laban, Philippe and Murakhovs'ka, Lidiya and Xiong, Caiming and Wu, Chien-Sheng , month = feb, year =. Are. doi:10.48550/arXiv.2311.08596 , abstract =

  56. [56]

    Proceedings of the AAAI Conference on Artificial Intelligence , author =

    Mapping from. Proceedings of the AAAI Conference on Artificial Intelligence , author =. 2025 , note =. doi:10.1609/aaai.v39i22.34540 , abstract =

  57. [57]

    Spuq: Perturbation-based uncertainty quantification for large language models.arXiv preprint arXiv:2403.02509,

    Gao, Xiang and Zhang, Jiaxin and Mouatadid, Lalla and Das, Kamalika , month = mar, year =. doi:10.48550/arXiv.2403.02509 , abstract =

  58. [58]

    Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models

    Marks, Samuel and Rager, Can and Michaud, Eric J. and Belinkov, Yonatan and Bau, David and Mueller, Aaron , month = mar, year =. Sparse. doi:10.48550/arXiv.2403.19647 , abstract =

  59. [59]

    Verifyingchain-of-thought reasoning via its computational graph.CoRR, abs/2510.09312, 2025

    Zhao, Zheng and Koishekenov, Yeskendir and Yang, Xianjun and Murray, Naila and Cancedda, Nicola , month = oct, year =. Verifying. doi:10.48550/arXiv.2510.09312 , abstract =

  60. [60]

    Measuring faithfulness of chains of thought by unlearning reasoning steps, 2025

    Tutek, Martin and Chaleshtori, Fateme Hashemi and Marasović, Ana and Belinkov, Yonatan , month = jun, year =. Measuring. doi:10.48550/arXiv.2502.14829 , abstract =

  61. [61]

    Learning

    Bussmann, Bart and Leask, Patrick and Nanda, Neel and Leask, Patrick and Nanda, Neel , month = dec, year =. Learning

  62. [62]

    Kantamneni, Subhash and Engels, Joshua and Rajamanoharan, Senthooran and Tegmark, Max and Nanda, Neel , month = feb, year =. Are. doi:10.48550/arXiv.2502.16681 , abstract =

  63. [63]

    Research

    Quaisley, Andrew , month = jun, year =. Research

  64. [64]

    Taggart, Glen , month = apr, year =

  65. [65]

    Gu, Yu and Fu, Jingjing and Liu, Xiaodong and Valanarasu, Jeya Maria Jose and Codella, Noel and Tan, Reuben and Liu, Qianchu and Jin, Ying and Zhang, Sheng and Wang, Jinyu and Wang, Rui and Song, Lei and Qin, Guanghui and Usuyama, Naoto and Wong, Cliff and Hao, Cheng and Lee, Hohin and Sanapathi, Praneeth and Hilado, Sarah and Jiang, Bian and Alvarez-Vall...

  66. [66]

    Wu, Jiageng and Xie, Kevin and Gu, Bowen and Krüger, Nils and Lin, Kueiyu Joshua and Yang, Jie , month = sep, year =. Why. doi:10.48550/arXiv.2509.21933 , abstract =

  67. [67]

    Chain-of-

    Wang, Xuezhi and Zhou, Denny , month = may, year =. Chain-of-. doi:10.48550/arXiv.2402.10200 , abstract =

  68. [68]

    Findings of the

    Li, Yichen and Fan, Zhiting and Chen, Ruizhe and Gai, Xiaotang and Gong, Luqi and Zhang, Yan and Liu, Zuozhu , editor =. Findings of the. 2025 , pages =. doi:10.18653/v1/2025.findings-acl.589 , abstract =

  69. [69]

    Sharkey, Lee and Chughtai, Bilal and Batson, Joshua and Lindsey, Jack and Wu, Jeff and Bushnaq, Lucius and Goldowsky-Dill, Nicholas and Heimersheim, Stefan and Ortega, Alejandro and Bloom, Joseph and Biderman, Stella and Garriga-Alonso, Adria and Conmy, Arthur and Nanda, Neel and Rumbelow, Jessica and Wattenberg, Martin and Schoots, Nandi and Miller, Jose...

  70. [70]

    Measuring Faithfulness in Chain-of-Thought Reasoning

    Lanham, Tamera and Chen, Anna and Radhakrishnan, Ansh and Steiner, Benoit and Denison, Carson and Hernandez, Danny and Li, Dustin and Durmus, Esin and Hubinger, Evan and Kernion, Jackson and Lukošiūtė, Kamilė and Nguyen, Karina and Cheng, Newton and Joseph, Nicholas and Schiefer, Nicholas and Rausch, Oliver and Larson, Robin and McCandlish, Sam and Kundu,...

  71. [71]

    Diving into self-evolving training for multimodal reasoning,

    Liu, Wei and Li, Junlong and Zhang, Xiwen and Zhou, Fan and Cheng, Yu and He, Junxian , month = dec, year =. Diving into. doi:10.48550/arXiv.2412.17451 , abstract =

  72. [72]

    Tan, Wentao and Cao, Qiong and Zhan, Yibing and Xue, Chao and Ding, Changxing , month = dec, year =. Beyond. doi:10.48550/arXiv.2412.15650 , abstract =

  73. [73]

    Duggal, P

    Duggal, Shivam and Isola, Phillip and Torralba, Antonio and Freeman, William T. , month = nov, year =. Adaptive. doi:10.48550/arXiv.2411.02393 , abstract =

  74. [74]

    arXiv preprint arXiv:2410.17247 (2024)

    Xing, Long and Huang, Qidong and Dong, Xiaoyi and Lu, Jiajie and Zhang, Pan and Zang, Yuhang and Cao, Yuhang and He, Conghui and Wang, Jiaqi and Wu, Feng and Lin, Dahua , month = oct, year =. doi:10.48550/arXiv.2410.17247 , abstract =

  75. [75]

    Tang, Qiaoyu and Yu, Le and Yu, Bowen and Lin, Hongyu and Lu, Keming and Lu, Yaojie and Han, Xianpei and Sun, Le , month = oct, year =. A. doi:10.48550/arXiv.2410.13841 , abstract =

  76. [76]

    Revealing

    Subramaniam, Vighnesh and Conwell, Colin and Wang, Christopher and Kreiman, Gabriel and Katz, Boris and Cases, Ignacio and Barbu, Andrei , month = jun, year =. Revealing. doi:10.48550/arXiv.2406.14481 , abstract =

  77. [77]

    Intriguing

    Lee, Young-Jun and Ko, Byungsoo and Kim, Han-Gyu and Hwang, Yechan and Choi, Ho-Jin , month = oct, year =. Intriguing. doi:10.48550/arXiv.2410.04751 , abstract =

  78. [78]

    Bansal, P

    Bansal, Yamini and Nakkiran, Preetum and Barak, Boaz , month = jun, year =. Revisiting. doi:10.48550/arXiv.2106.07682 , abstract =

  79. [79]

    Huh, Minyoung and Cheung, Brian and Wang, Tongzhou and Isola, Phillip , month = jul, year =. The. doi:10.48550/arXiv.2405.07987 , abstract =

  80. [80]

    doi:10.48550/arXiv.2312.06709 , abstract =

    Ranzinger, Mike and Heinrich, Greg and Kautz, Jan and Molchanov, Pavlo , month = apr, year =. doi:10.48550/arXiv.2312.06709 , abstract =

Showing first 80 references.