Recognition: unknown
Compared to What? Baselines and Metrics for Counterfactual Prompting
Pith reviewed 2026-05-09 18:58 UTC · model grok-4.3
The pith
Counterfactual prompting studies must compare targeted edits to paraphrasing baselines before attributing effects to specific factors.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Every counterfactual edit is a compound treatment that bundles the variable of interest with incidental surface-form variation, violating treatment variation irrelevance. Flip rates from surgically changing patient gender (14.9 percent) are indistinguishable from those produced by paraphrasing the same inputs (14.1 percent). A statistical framework that compares target-intervention differences against paraphrasing baselines shows that most previously reported sensitivities in MedPerturb are no longer significant, while directional gender bias remains detectable in biography classification tasks. Per-sample distributional metrics detect effects more powerfully than aggregate or regression met
What carries the argument
The statistical comparison framework that measures whether changes under a target intervention exceed those induced by paraphrasing the same inputs.
If this is right
- Most reported sensitivities to patient demographics in prior MedPerturb analyses are no longer statistically supported once general sensitivity is accounted for.
- Only five of 120 tests reach significance after the paraphrasing baseline is applied.
- Per-sample metrics detect effects far more reliably than aggregate flip rates or regression models.
- The framework can still identify real directional bias, as shown by significant gender effects in occupational biography classification.
- Regression-based metrics uniquely characterize both the direction and magnitude of effects when they exist.
Where Pith is reading between the lines
- The same baseline comparison could be extended to faithfulness checks in chain-of-thought prompting to test whether reasoning steps are truly causal.
- Models may treat many surface variations as equivalent noise, suggesting broader re-examination of perturbation-based evaluation methods.
- Developing paraphrases that better isolate surface form from semantic drift would strengthen the control condition.
Load-bearing premise
Paraphrasing generates valid meaning-preserving controls that introduce only incidental surface-form variation without other uncontrolled factors or model-specific artifacts.
What would settle it
A new experiment that applies multiple independent paraphrases and targeted gender edits to the same MedQA cases and finds the gender edits produce significantly higher flip rates under the paper's statistical test.
Figures
read the original abstract
Counterfactual prompting (i.e., perturbing a single factor and measuring output change) is widely used to evaluate things like LLM bias and CoT faithfulness. But in this work we argue that observed effects cannot be attributed to the targeted factor without accounting for baseline ``meaning-preserving'' modifications to text that establish general model sensitivity. This is because every counterfactual edit is a compound treatment that bundles the variable of interest with incidental surface-form variation; this violates treatment variation irrelevance. We observe prediction flip rates on MedQA of 14.9% when we surgically change patient gender. However, this is statistically indistinguishable from the flip rates induced by simply paraphrasing inputs (14.1%). In this case, it would therefore be unwarranted to conclude that the LLM is especially sensitive to patient gender. To account for this and robustly measure the effects of targeted interventions, we propose a framework in which we compare (via statistical testing) differences observed under target interventions to those induced by paraphrasing inputs. We then use this framework to revisit a analysis done on the MedPerturb dataset, which reported evidence of model sensitivity to patient demographics and stylistic cues. We find that these effects largely dissipate when we account for general model sensitivity, with only 5 of 120 tests reaching statistical significance. Applying the same framework to occupational biography classification, we detect clearly significant directional gender bias, showing that the framework identifies real directional effects even when they are small. We evaluate a range of metrics -- aggregate, per-sample distributional, and regression -- and find that per-sample metrics are dramatically more powerful than aggregate metrics and regression powerfully and uniquely characterizes effect direction and magnitude.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper argues that counterfactual prompting effects (e.g., LLM sensitivity to patient gender or demographics) cannot be attributed to the targeted factor without baselines for meaning-preserving modifications, as every edit bundles the variable of interest with incidental surface-form changes that violate treatment variation irrelevance. On MedQA, gender perturbation yields 14.9% flip rate, statistically indistinguishable from 14.1% under paraphrasing; applying the framework to MedPerturb shows most of 120 tests lose significance (only 5 remain), while occupational biography classification detects robust directional gender bias. The work evaluates aggregate, per-sample distributional, and regression metrics, finding per-sample metrics far more powerful and regression uniquely effective for direction and magnitude.
Significance. If the framework holds, it supplies a statistically grounded method for isolating targeted effects in LLM counterfactual studies, strengthening validity of bias and faithfulness evaluations on public datasets. Credit is due for grounding claims in external statistical tests rather than internal parameters, demonstrating both null results (MedPerturb dissipation) and positive directional detection (occupational bias), and comparing multiple metric classes with concrete power differences.
major comments (3)
- [Methods] Methods (paraphrase generation procedure): The exact generation process, model, prompt template, and validation steps for meaning preservation are not fully specified. This is load-bearing because if paraphrases systematically differ in semantic shift distribution or introduce model artifacts unmatched to the minimal lexical gender/demographic edits, the 14.9% vs 14.1% flip-rate comparison on MedQA does not establish that the targeted factor adds no extra sensitivity.
- [§4] §4 (MedPerturb results): The 120 tests are not enumerated by intervention type, and no multiple-comparison correction (e.g., Bonferroni or FDR) is reported for the claim that only 5 reach significance. Without this, the conclusion that effects 'largely dissipate' cannot be assessed for robustness against inflated Type I error.
- [§5] §5 (metric comparison): The claim that per-sample metrics are 'dramatically more powerful' than aggregate metrics lacks reported effect sizes, power curves, or sample-size calculations; the regression metric's unique characterization of direction/magnitude is asserted but not contrasted against alternatives via explicit coefficient tables or simulation.
minor comments (3)
- [Abstract, §3] Abstract and §3: 'treatment variation irrelevance' is introduced without a formal definition or citation to the causal inference literature; a one-sentence gloss would improve accessibility.
- [Figures] Figure captions (if present): Ensure all axes, error bars, and statistical annotations are labeled with exact test names and p-value thresholds used for 'indistinguishable' claims.
- [References] References: Add citations for standard multiple-testing procedures and prior work on paraphrase-based controls in NLP evaluation.
Simulated Author's Rebuttal
We thank the referee for their insightful comments, which have pointed out opportunities to improve the transparency and statistical robustness of our work. We address each of the major comments in turn below.
read point-by-point responses
-
Referee: [Methods] Methods (paraphrase generation procedure): The exact generation process, model, prompt template, and validation steps for meaning preservation are not fully specified. This is load-bearing because if paraphrases systematically differ in semantic shift distribution or introduce model artifacts unmatched to the minimal lexical gender/demographic edits, the 14.9% vs 14.1% flip-rate comparison on MedQA does not establish that the targeted factor adds no extra sensitivity.
Authors: We agree that additional details on the paraphrase generation are necessary to fully substantiate the baseline comparison. In the revised manuscript, we will specify the exact model and version used for paraphrase generation, provide the complete prompt template, and describe the validation steps, including any quantitative measures of semantic similarity and qualitative checks for meaning preservation. This will enable readers to evaluate whether the paraphrases serve as an appropriate control for incidental surface-form changes. revision: yes
-
Referee: [§4] §4 (MedPerturb results): The 120 tests are not enumerated by intervention type, and no multiple-comparison correction (e.g., Bonferroni or FDR) is reported for the claim that only 5 reach significance. Without this, the conclusion that effects 'largely dissipate' cannot be assessed for robustness against inflated Type I error.
Authors: The referee is correct that a full enumeration and correction for multiple testing would strengthen the results section. We will revise §4 to include a breakdown of the 120 tests by intervention category (e.g., demographic, stylistic), list the specific tests that remain significant, and apply an FDR correction, reporting both raw and adjusted p-values. This will allow a more rigorous assessment of whether the effects largely dissipate. revision: yes
-
Referee: [§5] §5 (metric comparison): The claim that per-sample metrics are 'dramatically more powerful' than aggregate metrics lacks reported effect sizes, power curves, or sample-size calculations; the regression metric's unique characterization of direction/magnitude is asserted but not contrasted against alternatives via explicit coefficient tables or simulation.
Authors: We acknowledge that the metric comparison section would benefit from more quantitative support. In the revision, we will add effect size calculations for the differences in power between per-sample and aggregate metrics, include power analysis or curves based on our sample sizes, and provide a table of regression coefficients along with a brief simulation study to illustrate the advantages of the regression approach in capturing direction and magnitude compared to other metrics. revision: yes
Circularity Check
No significant circularity
full rationale
The paper's central argument and framework rely on empirical statistical comparisons of flip rates and effect sizes between targeted counterfactual edits and paraphrasing baselines, using public datasets such as MedQA and MedPerturb. These comparisons are presented as external evidence rather than quantities defined by construction from fitted parameters or internal definitions within the paper. The conceptual claim that every counterfactual edit bundles incidental surface-form variation (violating treatment variation irrelevance) draws from standard causal inference principles and is tested against independent data, without reducing to self-referential steps, self-citation chains, or renamed known results. No load-bearing derivations equate predictions to their own inputs.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Paraphrasing inputs creates meaning-preserving modifications that isolate general model sensitivity to surface-form variation
Reference graph
Works this paper leans on
-
[1]
Bias patterns in the application of
Poulain, Raphael and Fayyaz, Hamed and Beheshti, Rahmatollah , month = apr, year =. Bias patterns in the application of. doi:10.48550/arXiv.2404.15149 , abstract =
-
[2]
arXiv preprint arXiv:2108.01764 , year=
Logé, Cécile and Ross, Emily and Dadey, David Yaw Amoah and Jain, Saahil and Saporta, Adriel and Ng, Andrew Y. and Rajpurkar, Pranav , month = aug, year =. Q-. doi:10.48550/arXiv.2108.01764 , abstract =
-
[3]
Jin, Di and Pan, Eileen and Oufattole, Nassim and Weng, Wei-Hung and Fang, Hanyi and Szolovits, Peter , month = sep, year =. What. doi:10.48550/arXiv.2009.13081 , abstract =
-
[4]
De-Arteaga, Maria and Romanov, Alexey and Wallach, Hanna and Chayes, Jennifer and Borgs, Christian and Chouldechova, Alexandra and Geyik, Sahin and Kenthapadi, Krishnaram and Kalai, Adam Tauman , month = jan, year =. Bias in. Proceedings of the. doi:10.1145/3287560.3287572 , abstract =
-
[5]
Semantics derived automatically from language corpora contain human-like biases,
Semantics derived automatically from language corpora contain human-like biases , volume =. Science , author =. 2017 , note =. doi:10.1126/science.aal4230 , abstract =
-
[6]
Man is to Computer Programmer as Woman is to Homemaker? Debiasing Word Embeddings , journal =
Bolukbasi, Tolga and Chang, Kai-Wei and Zou, James and Saligrama, Venkatesh and Kalai, Adam , month = jul, year =. Man is to. doi:10.48550/arXiv.1607.06520 , abstract =
-
[7]
When to use the. Ophthalmic & Physiological Optics: The Journal of the British College of Ophthalmic Opticians , author =. 2014 , keywords =. doi:10.1111/opo.12131 , abstract =
-
[8]
Controlling the
Benjamini, Yoav and Hochberg, Yosef , year =. Controlling the. Journal of the Royal Statistical Society. Series B (Methodological) , publisher =
-
[9]
Cao, Bowen and Cai, Deng and Zhang, Zhisong and Zou, Yuexian and Lam, Wai , month = oct, year =. On the. doi:10.48550/arXiv.2406.10248 , abstract =
-
[10]
Ngweta, Lilian and Kate, Kiran and Tsay, Jason and Rizk, Yara , editor =. Towards. Proceedings of the 2025. 2025 , pages =. doi:10.18653/v1/2025.naacl-srw.51 , abstract =
-
[11]
Write a recipe for chocolate cake
Turpin, Miles and Michael, Julian and Perez, Ethan and Bowman, Samuel R. , month = dec, year =. Language. doi:10.48550/arXiv.2305.04388 , abstract =
-
[12]
Vamvas, Jannis and Sennrich, Rico , editor =. On the. Proceedings of the. 2021 , pages =. doi:10.18653/v1/2021.blackboxnlp-1.5 , abstract =
-
[13]
Kohankhaki, Farnaz and Emerson, D. B. and Tian, Jacob-Junqi and Seyyed-Kalantari, Laleh and Khattak, Faiza Khan , month = jan, year =. Template-. doi:10.48550/arXiv.2404.03471 , abstract =
-
[14]
L., Lopez, G., Olteanu, A., Sim, R., and Wallach, H
Blodgett, Su Lin and Lopez, Gilsinia and Olteanu, Alexandra and Sim, Robert and Wallach, Hanna , editor =. Stereotyping. Proceedings of the 59th. 2021 , pages =. doi:10.18653/v1/2021.acl-long.81 , abstract =
-
[15]
Rudinger, Rachel and Naradowsky, Jason and Leonard, Brian and Durme, Benjamin Van , month = apr, year =. Gender. doi:10.48550/arXiv.1804.09301 , abstract =
-
[16]
S tereo S et: Measuring stereotypical bias in pretrained language models
Nadeem, Moin and Bethke, Anna and Reddy, Siva , editor =. Proceedings of the 59th. 2021 , pages =. doi:10.18653/v1/2021.acl-long.416 , abstract =
-
[17]
arXiv.org , author =
Gender. arXiv.org , author =
-
[18]
C row S -Pairs: A Challenge Dataset for Measuring Social Biases in Masked Language Models
Nangia, Nikita and Vania, Clara and Bhalerao, Rasika and Bowman, Samuel R. , editor =. Proceedings of the 2020. 2020 , pages =. doi:10.18653/v1/2020.emnlp-main.154 , abstract =
-
[19]
Rezaei, Amirhossein Haji Mohammad and Shakeri, Zahra , month = jan, year =. Counterfactual. doi:10.48550/arXiv.2601.20102 , abstract =
-
[20]
Rauba, Paulius and Wei, Qiyao and Schaar, Mihaela van der , month = dec, year =. Quantifying perturbation impacts for large language models , url =. doi:10.48550/arXiv.2412.00868 , abstract =
-
[21]
Gourabathina, Abinitha and Gerych, Walter and Pan, Eileen and Ghassemi, Marzyeh , month = jun, year =. The. Proceedings of the 2025. doi:10.1145/3715275.3732121 , abstract =
-
[22]
Li, Suraj Srinivas, Usha Bhalla, and Himabindu Lakkaraju
Li, Aaron J. and Srinivas, Suraj and Bhalla, Usha and Lakkaraju, Himabindu , month = may, year =. Interpretability. doi:10.48550/arXiv.2505.16004 , abstract =
-
[23]
Gourabathina, Abinitha and Hao, Yuexing and Gerych, Walter and Ghassemi, Marzyeh , month = jun, year =. The. doi:10.48550/arXiv.2506.17163 , abstract =
-
[24]
Marks, Samuel and Tegmark, Max , month = aug, year =. The. doi:10.48550/arXiv.2310.06824 , abstract =
work page internal anchor Pith review doi:10.48550/arxiv.2310.06824
-
[25]
A Structural Probe for Finding Syntax in Word Representations
Hewitt, John and Manning, Christopher D. , editor =. A. Proceedings of the 2019. 2019 , pages =. doi:10.18653/v1/N19-1419 , abstract =
-
[26]
Visualizing token importance for black-box language models , url =
Rauba, Paulius and Wei, Qiyao and Schaar, Mihaela van der , month = dec, year =. Visualizing token importance for black-box language models , url =. doi:10.48550/arXiv.2512.11573 , abstract =
-
[27]
Méloux, Maxime and Dirupo, Giada and Portet, François and Peyrard, Maxime , month = dec, year =. The. doi:10.48550/arXiv.2512.18792 , abstract =
-
[28]
Hong, Pingjun and Roth, Benjamin , month = jan, year =. Do. doi:10.48550/arXiv.2601.03775 , abstract =
-
[29]
Axiomatic Attribution for Deep Networks, June 2017
Sundararajan, Mukund and Taly, Ankur and Yan, Qiqi , month = jun, year =. Axiomatic. doi:10.48550/arXiv.1703.01365 , abstract =
-
[30]
State over tokens: Characterizing the role of reasoning tokens.arXiv preprint arXiv:2512.12777, 2025
Levy, Mosh and Elyoseph, Zohar and Ravfogel, Shauli and Goldberg, Yoav , month = dec, year =. State over. doi:10.48550/arXiv.2512.12777 , abstract =
-
[31]
Zaman, Kerem and Srivastava, Shashank , month = dec, year =. Is. doi:10.48550/arXiv.2512.23032 , abstract =
work page internal anchor Pith review doi:10.48550/arxiv.2512.23032
-
[32]
Chain of thought monitorability: A new and fragile opportunity for ai safety, 2025
Korbak, Tomek and Balesni, Mikita and Barnes, Elizabeth and Bengio, Yoshua and Benton, Joe and Bloom, Joseph and Chen, Mark and Cooney, Alan and Dafoe, Allan and Dragan, Anca and Emmons, Scott and Evans, Owain and Farhi, David and Greenblatt, Ryan and Hendrycks, Dan and Hobbhahn, Marius and Hubinger, Evan and Irving, Geoffrey and Jenner, Erik and Kokotajl...
-
[33]
Slack, Dylan and Hilgard, Sophie and Lakkaraju, Himabindu and Singh, Sameer , month = nov, year =. Counterfactual. doi:10.48550/arXiv.2106.02666 , abstract =
-
[34]
Li, Belinda Z. and Guo, Zifan Carl and Huang, Vincent and Steinhardt, Jacob and Andreas, Jacob , month = nov, year =. Training. doi:10.48550/arXiv.2511.08579 , abstract =
-
[35]
URL https://doi.org/10.1038/ s41467-025-64769-1
Qiu, Pengcheng and Wu, Chaoyi and Liu, Shuyu and Fan, Yanjie and Zhao, Weike and Chen, Zhuoxia and Gu, Hongfei and Peng, Chuanjin and Zhang, Ya and Wang, Yanfeng and Xie, Weidi , month = nov, year =. Quantifying the reasoning abilities of. Nature Communications , publisher =. doi:10.1038/s41467-025-64769-1 , abstract =
-
[36]
Wadhwa, Somin and Amir, Silvio and Wallace, Byron C , editor =. Investigating. Proceedings of the 2024. 2024 , pages =. doi:10.18653/v1/2024.emnlp-main.349 , abstract =
-
[37]
Movva, Rajiv and Milli, Smitha and Min, Sewon and Pierson, Emma , month = oct, year =. What's. doi:10.48550/arXiv.2510.26202 , abstract =
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2510.26202
-
[38]
Dutta, Subhabrata and Singh, Joykirat and Chakrabarti, Soumen and Chakraborty, Tanmoy , month = may, year =. How to think step-by-step:. doi:10.48550/arXiv.2402.18312 , abstract =
-
[39]
Cabannes, Vivien and Arnal, Charles and Bouaziz, Wassim and Yang, Alice and Charton, Francois and Kempe, Julia , month = oct, year =. Iteration. doi:10.48550/arXiv.2406.02128 , abstract =
-
[40]
Zaman, Kerem and Srivastava, Shashank , editor =. A. Proceedings of the 2025. 2025 , pages =
2025
-
[41]
Thought Branches: Interpreting LLM Reasoning Requires Resampling
Macar, Uzay and Bogdan, Paul C. and Rajamanoharan, Senthooran and Nanda, Neel , month = oct, year =. Thought. doi:10.48550/arXiv.2510.27484 , abstract =
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2510.27484
-
[42]
Post-hoc reasoning in chain of thought ·
-
[43]
Jacovi, Alon and Goldberg, Yoav , editor =. Towards. Proceedings of the 58th. 2020 , pages =. doi:10.18653/v1/2020.acl-main.386 , abstract =
-
[44]
Transformer Circuits , author =
On the. Transformer Circuits , author =
-
[45]
arXiv preprint arXiv:2504.05294 , year=
Ferreira, Pedro and Aziz, Wilker and Titov, Ivan , month = jul, year =. Truthful or. doi:10.48550/arXiv.2504.05294 , abstract =
-
[46]
Tanneru, Sree Harsha and Ley, Dan and Agarwal, Chirag and Lakkaraju, Himabindu , month = jul, year =. On the. doi:10.48550/arXiv.2406.10625 , abstract =
-
[47]
Arcuschin, Iván and Janiak, Jett and Krzyzanowski, Robert and Rajamanoharan, Senthooran and Nanda, Neel and Conmy, Arthur , month = mar, year =. Chain-of-. doi:10.48550/arXiv.2503.08679 , abstract =
-
[48]
Reasoning Models Don't Always Say What They Think
Chen, Yanda and Benton, Joe and Radhakrishnan, Ansh and Uesato, Jonathan and Denison, Carson and Schulman, John and Somani, Arushi and Hase, Peter and Wagner, Misha and Roger, Fabien and Mikulik, Vlad and Bowman, Samuel R. and Leike, Jan and Kaplan, Jared and Perez, Ethan , month = may, year =. Reasoning. doi:10.48550/arXiv.2505.05410 , abstract =
work page internal anchor Pith review doi:10.48550/arxiv.2505.05410
-
[49]
Faithcot-bench: Benchmarking instance-level faithfulness of chain-of-thought reasoning
Shen, Xu and Wang, Song and Tan, Zhen and Yao, Laura and Zhao, Xinyu and Xu, Kaidi and Wang, Xin and Chen, Tianlong , month = oct, year =. doi:10.48550/arXiv.2510.04040 , abstract =
-
[50]
Kojima, Takeshi and Gu, Shixiang Shane and Reid, Machel and Matsuo, Yutaka and Iwasawa, Yusuke , month = jan, year =. Large. doi:10.48550/arXiv.2205.11916 , abstract =
work page internal anchor Pith review doi:10.48550/arxiv.2205.11916
-
[51]
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models
Wei, Jason and Wang, Xuezhi and Schuurmans, Dale and Bosma, Maarten and Ichter, Brian and Xia, Fei and Chi, Ed and Le, Quoc and Zhou, Denny , month = jan, year =. Chain-of-. doi:10.48550/arXiv.2201.11903 , abstract =
work page internal anchor Pith review doi:10.48550/arxiv.2201.11903
-
[52]
doi:10.48550/arXiv.2409.01497 , abstract =
Rawat, Rajat and McBride, Hudson and Nirmal, Dhiyaan and Ghosh, Rajarshi and Moon, Jong and Alamuri, Dhruv and O'Brien, Sean and Zhu, Kevin , month = dec, year =. doi:10.48550/arXiv.2409.01497 , abstract =
-
[53]
Chi, Xuezhi Wang, and Denny Zhou
Chen, Xinyun and Chi, Ryan A. and Wang, Xuezhi and Zhou, Denny , month = may, year =. Premise. doi:10.48550/arXiv.2402.08939 , abstract =
-
[54]
Sclar, Melanie and Choi, Yejin and Tsvetkov, Yulia and Suhr, Alane , month = jul, year =. Quantifying. doi:10.48550/arXiv.2310.11324 , abstract =
-
[55]
Laban, Philippe and Murakhovs'ka, Lidiya and Xiong, Caiming and Wu, Chien-Sheng , month = feb, year =. Are. doi:10.48550/arXiv.2311.08596 , abstract =
-
[56]
Proceedings of the AAAI Conference on Artificial Intelligence , author =
Mapping from. Proceedings of the AAAI Conference on Artificial Intelligence , author =. 2025 , note =. doi:10.1609/aaai.v39i22.34540 , abstract =
-
[57]
Gao, Xiang and Zhang, Jiaxin and Mouatadid, Lalla and Das, Kamalika , month = mar, year =. doi:10.48550/arXiv.2403.02509 , abstract =
-
[58]
Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models
Marks, Samuel and Rager, Can and Michaud, Eric J. and Belinkov, Yonatan and Bau, David and Mueller, Aaron , month = mar, year =. Sparse. doi:10.48550/arXiv.2403.19647 , abstract =
work page internal anchor Pith review doi:10.48550/arxiv.2403.19647
-
[59]
Verifyingchain-of-thought reasoning via its computational graph.CoRR, abs/2510.09312, 2025
Zhao, Zheng and Koishekenov, Yeskendir and Yang, Xianjun and Murray, Naila and Cancedda, Nicola , month = oct, year =. Verifying. doi:10.48550/arXiv.2510.09312 , abstract =
-
[60]
Measuring faithfulness of chains of thought by unlearning reasoning steps, 2025
Tutek, Martin and Chaleshtori, Fateme Hashemi and Marasović, Ana and Belinkov, Yonatan , month = jun, year =. Measuring. doi:10.48550/arXiv.2502.14829 , abstract =
-
[61]
Learning
Bussmann, Bart and Leask, Patrick and Nanda, Neel and Leask, Patrick and Nanda, Neel , month = dec, year =. Learning
-
[62]
Kantamneni, Subhash and Engels, Joshua and Rajamanoharan, Senthooran and Tegmark, Max and Nanda, Neel , month = feb, year =. Are. doi:10.48550/arXiv.2502.16681 , abstract =
-
[63]
Research
Quaisley, Andrew , month = jun, year =. Research
-
[64]
Taggart, Glen , month = apr, year =
-
[65]
Gu, Yu and Fu, Jingjing and Liu, Xiaodong and Valanarasu, Jeya Maria Jose and Codella, Noel and Tan, Reuben and Liu, Qianchu and Jin, Ying and Zhang, Sheng and Wang, Jinyu and Wang, Rui and Song, Lei and Qin, Guanghui and Usuyama, Naoto and Wong, Cliff and Hao, Cheng and Lee, Hohin and Sanapathi, Praneeth and Hilado, Sarah and Jiang, Bian and Alvarez-Vall...
-
[66]
Wu, Jiageng and Xie, Kevin and Gu, Bowen and Krüger, Nils and Lin, Kueiyu Joshua and Yang, Jie , month = sep, year =. Why. doi:10.48550/arXiv.2509.21933 , abstract =
-
[67]
Wang, Xuezhi and Zhou, Denny , month = may, year =. Chain-of-. doi:10.48550/arXiv.2402.10200 , abstract =
-
[68]
Li, Yichen and Fan, Zhiting and Chen, Ruizhe and Gai, Xiaotang and Gong, Luqi and Zhang, Yan and Liu, Zuozhu , editor =. Findings of the. 2025 , pages =. doi:10.18653/v1/2025.findings-acl.589 , abstract =
-
[69]
Sharkey, Lee and Chughtai, Bilal and Batson, Joshua and Lindsey, Jack and Wu, Jeff and Bushnaq, Lucius and Goldowsky-Dill, Nicholas and Heimersheim, Stefan and Ortega, Alejandro and Bloom, Joseph and Biderman, Stella and Garriga-Alonso, Adria and Conmy, Arthur and Nanda, Neel and Rumbelow, Jessica and Wattenberg, Martin and Schoots, Nandi and Miller, Jose...
work page internal anchor Pith review doi:10.48550/arxiv.2501.16496
-
[70]
Measuring Faithfulness in Chain-of-Thought Reasoning
Lanham, Tamera and Chen, Anna and Radhakrishnan, Ansh and Steiner, Benoit and Denison, Carson and Hernandez, Danny and Li, Dustin and Durmus, Esin and Hubinger, Evan and Kernion, Jackson and Lukošiūtė, Kamilė and Nguyen, Karina and Cheng, Newton and Joseph, Nicholas and Schiefer, Nicholas and Rausch, Oliver and Larson, Robin and McCandlish, Sam and Kundu,...
-
[71]
Diving into self-evolving training for multimodal reasoning,
Liu, Wei and Li, Junlong and Zhang, Xiwen and Zhou, Fan and Cheng, Yu and He, Junxian , month = dec, year =. Diving into. doi:10.48550/arXiv.2412.17451 , abstract =
-
[72]
Tan, Wentao and Cao, Qiong and Zhan, Yibing and Xue, Chao and Ding, Changxing , month = dec, year =. Beyond. doi:10.48550/arXiv.2412.15650 , abstract =
-
[73]
Duggal, Shivam and Isola, Phillip and Torralba, Antonio and Freeman, William T. , month = nov, year =. Adaptive. doi:10.48550/arXiv.2411.02393 , abstract =
-
[74]
arXiv preprint arXiv:2410.17247 (2024)
Xing, Long and Huang, Qidong and Dong, Xiaoyi and Lu, Jiajie and Zhang, Pan and Zang, Yuhang and Cao, Yuhang and He, Conghui and Wang, Jiaqi and Wu, Feng and Lin, Dahua , month = oct, year =. doi:10.48550/arXiv.2410.17247 , abstract =
-
[75]
Tang, Qiaoyu and Yu, Le and Yu, Bowen and Lin, Hongyu and Lu, Keming and Lu, Yaojie and Han, Xianpei and Sun, Le , month = oct, year =. A. doi:10.48550/arXiv.2410.13841 , abstract =
-
[76]
Subramaniam, Vighnesh and Conwell, Colin and Wang, Christopher and Kreiman, Gabriel and Katz, Boris and Cases, Ignacio and Barbu, Andrei , month = jun, year =. Revealing. doi:10.48550/arXiv.2406.14481 , abstract =
-
[77]
Lee, Young-Jun and Ko, Byungsoo and Kim, Han-Gyu and Hwang, Yechan and Choi, Ho-Jin , month = oct, year =. Intriguing. doi:10.48550/arXiv.2410.04751 , abstract =
-
[78]
Bansal, Yamini and Nakkiran, Preetum and Barak, Boaz , month = jun, year =. Revisiting. doi:10.48550/arXiv.2106.07682 , abstract =
-
[79]
Huh, Minyoung and Cheung, Brian and Wang, Tongzhou and Isola, Phillip , month = jul, year =. The. doi:10.48550/arXiv.2405.07987 , abstract =
-
[80]
doi:10.48550/arXiv.2312.06709 , abstract =
Ranzinger, Mike and Heinrich, Greg and Kautz, Jan and Molchanov, Pavlo , month = apr, year =. doi:10.48550/arXiv.2312.06709 , abstract =
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.