arxiv: 2310.01798 · v2 · submitted 2023-10-03 · 💻 cs.CL · cs.AI

Recognition: 2 theorem links

· Lean Theorem

Large Language Models Cannot Self-Correct Reasoning Yet

Adams Wei Yu, Denny Zhou, Huaixiu Steven Zheng, Jie Huang, Swaroop Mishra, Xinying Song, Xinyun Chen

Authors on Pith no claims yet

Pith reviewed 2026-05-12 05:43 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords large language modelsself-correctionreasoningintrinsic self-correctionLLM evaluationfeedback in AImodel limitations

0 comments

The pith

Large language models struggle to self-correct their reasoning without external feedback.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper investigates whether large language models can improve their reasoning by correcting their own initial answers using only their built-in knowledge. It concludes that this intrinsic self-correction does not reliably work and can even lower performance on reasoning tasks. This matters because many proposed ways to make LLMs more accurate assume they can spot and fix their own mistakes. If true, it means developers cannot count on the model to refine its reasoning autonomously in complex problems. Instead, external checks or different approaches may be needed for dependable results.

Core claim

Central to our investigation is the notion of intrinsic self-correction, whereby an LLM attempts to correct its initial responses based solely on its inherent capabilities, without the crutch of external feedback. In the context of reasoning, our research indicates that LLMs struggle to self-correct their responses without external feedback, and at times, their performance even degrades after self-correction.

What carries the argument

Intrinsic self-correction, the process where an LLM revises its answers using only its own internal capabilities without any external input or feedback.

If this is right

LLMs' performance on reasoning tasks does not improve and may decline when relying on self-correction alone.
Methods for self-correction in LLMs require external feedback to be effective.
Practical applications should incorporate external verification rather than depending on the model's self-revision.
Future studies on LLM reasoning should account for the limitations of intrinsic correction mechanisms.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Hybrid AI systems that combine LLMs with separate verification tools could address the identified shortcomings.
The findings might apply to other tasks beyond reasoning if similar patterns hold in different domains.
As models grow larger, it remains to be seen if intrinsic self-correction becomes viable without changes to training or architecture.

Load-bearing premise

The definition of intrinsic self-correction used in the experiments reflects how self-correction would work in actual use cases without outside help, and the selected benchmarks capture the full range of reasoning abilities in LLMs.

What would settle it

A demonstration that LLMs achieve higher accuracy on standard reasoning benchmarks after being prompted to self-correct their initial responses, with no additional information provided from outside the model.

read the original abstract

Large Language Models (LLMs) have emerged as a groundbreaking technology with their unparalleled text generation capabilities across various applications. Nevertheless, concerns persist regarding the accuracy and appropriateness of their generated content. A contemporary methodology, self-correction, has been proposed as a remedy to these issues. Building upon this premise, this paper critically examines the role and efficacy of self-correction within LLMs, shedding light on its true potential and limitations. Central to our investigation is the notion of intrinsic self-correction, whereby an LLM attempts to correct its initial responses based solely on its inherent capabilities, without the crutch of external feedback. In the context of reasoning, our research indicates that LLMs struggle to self-correct their responses without external feedback, and at times, their performance even degrades after self-correction. Drawing from these insights, we offer suggestions for future research and practical applications in this field.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

LLMs fail at intrinsic self-correction on reasoning tasks and sometimes get worse, based on direct before-and-after comparisons.

read the letter

LLMs struggle to self-correct reasoning without external feedback, and their performance can drop after trying. That's the main result from this paper. They run experiments on reasoning tasks, starting with initial answers from models like GPT-3.5 and then prompting for self-correction. The accuracy either stays the same or falls compared to the first try. This is tested on benchmarks like GSM8K and others. The work does a good job of isolating the intrinsic case—no tools, no human input, just the model talking to itself. It directly counters some recent claims that self-correction is a reliable way to improve outputs. The degradation finding is particularly useful because it shows the risk of assuming the model can fix its errors on its own. The soft spot is the reliance on specific correction prompts without much variation or ablation. If a different way of asking for corrections worked better, the 'cannot' claim would be weaker. They also stick to a handful of models and tasks, so generalization needs checking. But the basic pattern seems solid from the results they report. This paper is for researchers and practitioners who want to understand the limits of current LLMs on reasoning. It gives a clear signal that external verification or feedback is still necessary. Readers interested in AI reliability will get value from the empirical data. I would send it for peer review. The finding is important enough to get referee input, even if some details on prompting could be expanded.

Referee Report

2 major / 2 minor

Summary. The manuscript empirically examines whether large language models can perform intrinsic self-correction on reasoning tasks without external feedback. The central claim is that LLMs struggle to improve (and sometimes degrade) their initial outputs when prompted to self-review and correct, based on experiments using self-correction prompts on reasoning benchmarks such as GSM8K.

Significance. If the results hold under more varied conditions, the work provides useful evidence against the assumption that current LLMs can reliably self-improve reasoning via internal reflection alone. This has practical implications for LLM deployment in reasoning-heavy applications and points to the need for external feedback or alternative mechanisms. The direct before/after performance comparisons constitute a clear empirical contribution.

major comments (2)

[Methods / Experimental Procedure] The experimental definition of intrinsic self-correction rests on a fixed set of correction prompts (detailed in the methods section for generating revised answers). No ablation is reported on prompt phrasing, addition of explicit error-checking steps, or multi-turn verification. If alternative intrinsic prompts yield gains on the same benchmarks, the observed lack of improvement or degradation does not establish that self-correction is impossible without external feedback. This assumption is load-bearing for the title and abstract claim.
[Abstract and Experiments] The abstract and experimental sections lack sufficient detail on exact benchmark subsets, sample sizes, full prompt templates, controls for prompt sensitivity, and statistical significance testing of performance changes (including reported degradations). Without these, the support for the claim that self-correction leads to degradation is plausible but not fully verifiable from the provided description.

minor comments (2)

[Abstract] The abstract could more explicitly name the primary benchmarks (e.g., GSM8K) and quantify the observed changes rather than using only qualitative language.
[Introduction] A few sentences in the introduction repeat the general limitations of LLMs; tightening would improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The comments highlight important aspects of experimental rigor and claim precision that we will address in revision. We respond to each major comment below.

read point-by-point responses

Referee: [Methods / Experimental Procedure] The experimental definition of intrinsic self-correction rests on a fixed set of correction prompts (detailed in the methods section for generating revised answers). No ablation is reported on prompt phrasing, addition of explicit error-checking steps, or multi-turn verification. If alternative intrinsic prompts yield gains on the same benchmarks, the observed lack of improvement or degradation does not establish that self-correction is impossible without external feedback. This assumption is load-bearing for the title and abstract claim.

Authors: We agree that our primary experiments rely on a representative set of self-correction prompts drawn from prior literature rather than exhaustively testing all possible intrinsic formulations. This does not rule out the existence of some specialized prompt that could enable gains. To address this directly, we will add an ablation section comparing the original prompts against variants that incorporate explicit error-checking instructions and multi-turn verification loops. We will also revise the title to 'Large Language Models Struggle to Self-Correct Reasoning Without External Feedback' and update the abstract to emphasize that the observed lack of improvement holds for standard intrinsic self-correction prompts, thereby removing the stronger absolute claim while preserving the empirical contribution. revision: yes
Referee: [Abstract and Experiments] The abstract and experimental sections lack sufficient detail on exact benchmark subsets, sample sizes, full prompt templates, controls for prompt sensitivity, and statistical significance testing of performance changes (including reported degradations). Without these, the support for the claim that self-correction leads to degradation is plausible but not fully verifiable from the provided description.

Authors: We acknowledge that additional methodological details are needed for full reproducibility and verifiability. In the revised manuscript we will: (1) specify the exact benchmark subsets and sample sizes (e.g., full GSM8K test set with N=1319); (2) include all prompt templates in a new appendix; (3) report controls such as temperature=0, multiple random seeds, and prompt-sensitivity checks; and (4) add statistical significance testing (paired t-tests and bootstrap confidence intervals) for all before/after performance differences, including the reported degradations. These additions will be reflected in both the abstract and the experimental sections. revision: yes

Circularity Check

0 steps flagged

Empirical evaluation with no derivation chain or self-referential reductions

full rationale

This is a purely empirical study that generates initial LLM responses on reasoning benchmarks (e.g., GSM8K), applies fixed self-correction prompts, and measures accuracy against external ground truth. No equations, fitted parameters, predictions derived from inputs, or mathematical derivations are present. All results are direct comparisons of model outputs to labeled data and are externally falsifiable. No steps reduce by construction to the paper's own definitions or self-citations; the central claim rests on observable performance deltas rather than any self-definitional or fitted-input structure.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that self-correction without external feedback can be isolated in experiments and that performance changes reflect true capability rather than prompt artifacts.

axioms (1)

domain assumption Intrinsic self-correction is defined as the LLM revising its output based solely on its own previous generation without any external input or tools.
This definition is used to distinguish the tested method from other correction approaches and is central to the negative finding.

pith-pipeline@v0.9.0 · 5462 in / 1150 out tokens · 37335 ms · 2026-05-12T05:43:38.508359+00:00 · methodology

discussion (0)

Forward citations

Cited by 29 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

AgentForesight: Online Auditing for Early Failure Prediction in Multi-Agent Systems
cs.CL 2026-05 unverdicted novelty 7.0

AgentForesight trains a 7B model to perform online auditing of multi-agent LLM trajectories, detecting early decisive errors and outperforming larger models on custom and external benchmarks.
AgentPSO: Evolving Agent Reasoning Skill via Multi-agent Particle Swarm Optimization
cs.AI 2026-05 unverdicted novelty 7.0

AgentPSO evolves reusable multi-agent reasoning skills via PSO-inspired natural-language updates, outperforming static agents and test-time multi-agent baselines on math and general reasoning tasks with cross-benchmar...
When Does Critique Improve AI-Assisted Theoretical Physics? SCALAR: Structured Critic--Actor Loop for Agentic Reasoning
cs.AI 2026-05 unverdicted novelty 7.0

Structured critic-actor loops improve AI performance on theoretical physics reasoning tasks, with benefits strongest in asymmetric model pairings using constructive feedback.
Perturbation Dose Responses in Recursive LLM Loops: Raw Switching, Stochastic Floors, and Persistent Escape under Append, Replace, and Dialog Updates
cs.AI 2026-05 unverdicted novelty 7.0

In 30-step recursive LLM loops, append-mode persistent escape from source basins reaches 50% near 400 tokens under full history but plateaus below 50% under tail-clip memory policy, while replace-mode switching largel...
The Cost of Consensus: Isolated Self-Correction Prevails Over Unguided Homogeneous Multi-Agent Debate
cs.MA 2026-04 unverdicted novelty 7.0

Homogeneous multi-agent debate introduces sycophantic conformity, contextual fragility, and consensus collapse, leading to equal or lower accuracy than isolated self-correction at 2.1-3.4x higher token cost on GSM-Har...
OPT-BENCH: Evaluating the Iterative Self-Optimization of LLM Agents in Large-Scale Search Spaces
cs.AI 2026-05 unverdicted novelty 6.0

OPT-BENCH and OPT-Agent evaluate LLM self-optimization in large search spaces, showing stronger models improve via feedback but stay constrained by base capacity and below human performance.
ReFlect: An Effective Harness System for Complex Long-Horizon LLM Reasoning
cs.AI 2026-05 unverdicted novelty 6.0

ReFlect is a harness that wraps LLMs to detect and recover from reasoning errors, achieving 7-29 pp gains over direct CoT on long-horizon tasks and improving code patch quality to 82-87%.
Temporal Reasoning Is Not the Bottleneck: A Probabilistic Inconsistency Framework for Neuro-Symbolic QA
cs.AI 2026-05 unverdicted novelty 6.0

Temporal reasoning is not the core bottleneck for LLMs on time-based QA; the real issue is unstructured text-to-event mapping, addressed by a neuro-symbolic system with PIS that reaches 100% accuracy on benchmarks whe...
Process Supervision via Verbal Critique Improves Reasoning in Large Language Models
cs.CL 2026-04 unverdicted novelty 6.0

Verbal Process Supervision uses structured critiques from stronger models in an iterative loop to improve LLM reasoning, reaching 94.9% on GPQA Diamond and large gains on AIME 2025.
SOCIA-EVO: Automated Simulator Construction via Dual-Anchored Bi-Level Optimization
cs.AI 2026-04 unverdicted novelty 6.0

SOCIA-EVO generates statistically consistent simulators by separating structural refinement from parameter calibration via bi-level optimization and falsifying strategies through execution feedback in a Bayesian-weigh...
LACE: Lattice Attention for Cross-thread Exploration
cs.AI 2026-04 unverdicted novelty 6.0

LACE enables parallel reasoning paths in LLMs to communicate via lattice attention and error-correct using synthetic training data, improving accuracy by over 7 points over standard parallel search.
When Verification Fails: How Compositionally Infeasible Claims Escape Rejection
cs.CL 2026-04 unverdicted novelty 6.0

AI claim verification models rely on salient-constraint shortcuts instead of full compositional reasoning under the closed-world assumption, as revealed by their over-acceptance of claims with supported salient constr...
Verify Before You Fix: Agentic Execution Grounding for Trustworthy Cross-Language Code Analysis
cs.SE 2026-04 unverdicted novelty 6.0

A framework combining universal AST normalization, hybrid graph-LLM embeddings, and strict execution-grounded validation achieves 89-92% intra-language accuracy and 74-80% cross-language F1 while resolving 70% of vuln...
FACT-E: Causality-Inspired Evaluation for Trustworthy Chain-of-Thought Reasoning
cs.AI 2026-04 unverdicted novelty 6.0

FACT-E uses controlled perturbations as an instrumental signal to measure intra-chain faithfulness in CoT reasoning and combines it with answer consistency to select trustworthy trajectories.
From Local to Global: A Graph RAG Approach to Query-Focused Summarization
cs.CL 2024-04 unverdicted novelty 6.0

GraphRAG improves comprehensiveness and diversity of answers to global questions over million-token document sets by constructing entity graphs and hierarchical community summaries before combining partial responses.
LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code
cs.SE 2024-03 unverdicted novelty 6.0

LiveCodeBench collects 400 recent contest problems to create a contamination-free benchmark evaluating LLMs on code generation and related capabilities like self-repair and execution.
Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations
cs.AI 2023-12 conditional novelty 6.0

Math-Shepherd is an automatically trained process reward model that scores solution steps to verify and reinforce LLMs, lifting Mistral-7B from 77.9% to 89.1% on GSM8K and 28.6% to 43.5% on MATH.
Decodable but Not Corrected by Fixed Residual-Stream Linear Steering: Evidence from Medical LLM Failure Regimes
cs.AI 2026-05 unverdicted novelty 5.0

Overthinking in medical QA is linearly decodable at 71.6% accuracy yet fixed residual-stream steering yields no correction across 29 configurations, while enabling selective abstention with AUROC 0.610.
Multi-Agent Reasoning Improves Compute Efficiency: Pareto-Optimal Test-Time Scaling
cs.AI 2026-05 unverdicted novelty 5.0

Multi-agent debate and mixture-of-agents outperform self-consistency by 1.3 and 2.7 percentage points respectively at equal compute budgets on MMLU-Pro and BBH, with advantages that continue at higher scales while sel...
ReMedi: Reasoner for Medical Clinical Prediction
cs.CL 2026-05 unverdicted novelty 5.0

ReMedi boosts LLM performance on EHR clinical predictions by up to 19.9% F1 through ground-truth-guided rationale regeneration and fine-tuning.
State Representation and Termination for Recursive Reasoning Systems
cs.AI 2026-05 unverdicted novelty 5.0

Recursive reasoning systems can represent their state via an epistemic state graph and terminate when the linearized order-gap is non-degenerate near the fixed point, providing a local condition for when the stopping ...
Measure Twice, Click Once: Co-evolving Proposer and Visual Critic via Reinforcement Learning for GUI Grounding
cs.LG 2026-04 unverdicted novelty 5.0

A co-evolving proposer-critic RL framework improves GUI grounding accuracy by letting the model critique its own proposals rendered on screenshots.
LACE: Lattice Attention for Cross-thread Exploration
cs.AI 2026-04 unverdicted novelty 5.0

LACE enables concurrent reasoning paths in LLMs to interact via lattice attention and a synthetic training pipeline, raising accuracy more than 7 points over independent parallel search.
LACE: Lattice Attention for Cross-thread Exploration
cs.AI 2026-04 unverdicted novelty 5.0

LACE adds lattice attention to let parallel LLM reasoning threads interact and correct errors, raising accuracy over 7 points versus standard independent sampling.
From Hallucination to Structure Snowballing: The Alignment Tax of Constrained Decoding in LLM Reflection
cs.CL 2026-04 unverdicted novelty 5.0

Enforcing structured reflection via Outlines-based constrained decoding on an 8B LLM triggers structure snowballing instead of better self-correction, producing near-perfect syntax but persistent semantic errors and r...
Understanding Performance Gap Between Parallel and Sequential Sampling in Large Reasoning Models
cs.CL 2026-04 unverdicted novelty 5.0

Lack of exploration from conditioning on prior answers is the primary reason parallel sampling outperforms sequential sampling in large reasoning models.
How Many Tries Does It Take? Iterative Self-Repair in LLM Code Generation Across Model Scales and Benchmarks
cs.SE 2026-04 unverdicted novelty 4.0

Iterative self-repair improves LLM code pass rates by 4.9-17.1 pp on HumanEval and 16-30 pp on MBPP across seven models, with gains concentrated early and syntax errors easier to fix than logical ones.
IACDM: Interactive Adversarial Convergence Development Methodology -- A Structured Framework for AI-Assisted Software Development
cs.SE 2026-03 unverdicted novelty 4.0

IACDM is an 8-phase methodology using external verification agents and three pillars to close the verification gap in stochastic LLM-based software development.
LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods
cs.CL 2024-12 accept novelty 3.0

A survey that organizes LLMs-as-judges research into functionality, methodology, applications, meta-evaluation, and limitations.

Reference graph

Works this paper leans on

25 extracted references · 25 canonical work pages · cited by 27 Pith papers · 9 internal anchors

[1]

PaLM 2 Technical Report

Rohan Anil, Andrew M Dai, Orhan Firat, Melvin Johnson, Dmitry Lepikhin, Alexandre Passos, Siamak Shakeri, Emanuel Taropa, Paige Bailey, Zhifeng Chen, et al. Palm 2 technical report. arXiv preprint arXiv:2305.10403,

work page internal anchor Pith review arXiv
[2]

Constitutional AI: Harmlessness from AI Feedback

9 Published as a conference paper at ICLR 2024 Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, et al. Constitutional ai: Harm- lessness from ai feedback. arXiv preprint arXiv:2212.08073,

work page internal anchor Pith review Pith/arXiv arXiv 2024
[3]

Reconcile: Round-table conference improves reasoning via consensus among diverse llms

Justin Chih-Yao Chen, Swarnadeep Saha, and Mohit Bansal. Reconcile: Round-table conference improves reasoning via consensus among diverse llms. arXiv preprint arXiv:2309.13007, 2023a. Xinyun Chen, Maxwell Lin, Nathanael Sch ¨arli, and Denny Zhou. Teaching large language models to self-debug. arXiv preprint arXiv:2304.05128, 2023b. Aakanksha Chowdhery, Sha...

work page arXiv
[4]

Training Verifiers to Solve Math Word Problems

URL http: //jmlr.org/papers/v24/22-1144.html. Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168,

work page internal anchor Pith review Pith/arXiv arXiv
[5]

Improving Factuality and Reasoning in Language Models through Multiagent Debate

Yilun Du, Shuang Li, Antonio Torralba, Joshua B Tenenbaum, and Igor Mordatch. Improv- ing factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325,

work page internal anchor Pith review Pith/arXiv arXiv
[6]

The capacity for moral self-correction in large language models

Deep Ganguli, Amanda Askell, Nicholas Schiefer, Thomas Liao, Kamil ˙e Lukoˇsi¯ut˙e, Anna Chen, Anna Goldie, Azalia Mirhoseini, Catherine Olsson, Danny Hernandez, et al. The capacity for moral self-correction in large language models. arXiv preprint arXiv:2302.07459,

work page arXiv
[7]

CRITIC: Large Language Models Can Self-Correct with Tool-Interactive Critiquing

Zhibin Gou, Zhihong Shao, Yeyun Gong, Yelong Shen, Yujiu Yang, Nan Duan, and Weizhu Chen. Critic: Large language models can self-correct with tool-interactive critiquing. arXiv preprint arXiv:2305.11738,

work page internal anchor Pith review arXiv
[8]

Towards reasoning in large language models: A sur- vey

Jie Huang and Kevin Chen-Chuan Chang. Towards reasoning in large language models: A sur- vey. In Findings of the Association for Computational Linguistics: ACL 2023 . Association for Computational Linguistics,

work page 2023
[9]

Are large pre-trained language models leaking your personal information? In Findings of the Association for Computational Linguis- tics: EMNLP 2022 , pp

10 Published as a conference paper at ICLR 2024 Jie Huang, Hanyin Shao, and Kevin Chen-Chuan Chang. Are large pre-trained language models leaking your personal information? In Findings of the Association for Computational Linguis- tics: EMNLP 2022 , pp. 2038–2047, Abu Dhabi, United Arab Emirates,

work page 2024
[10]

Multi- step jailbreaking privacy attacks on chatgpt

Haoran Li, Dadi Guo, Wei Fan, Mingshi Xu, Jie Huang, Fanpu Meng, and Yangqiu Song. Multi- step jailbreaking privacy attacks on chatgpt. In Findings of the Association for Computational Linguistics: EMNLP 2023, pp. 4138–4153,

work page 2023
[11]

Encouraging Divergent Thinking in Large Language Models through Multi-Agent Debate

Tian Liang, Zhiwei He, Wenxiang Jiao, Xing Wang, Yan Wang, Rui Wang, Yujiu Yang, Zhaopeng Tu, and Shuming Shi. Encouraging divergent thinking in large language models through multi- agent debate. arXiv preprint arXiv:2305.19118,

work page internal anchor Pith review arXiv
[12]

Let's Verify Step by Step

Hunter Lightman, Vineet Kosaraju, Yura Burda, Harri Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step. arXiv preprint arXiv:2305.20050,

work page internal anchor Pith review Pith/arXiv arXiv
[13]

Demystifying gpt self-repair for code generation

Theo X Olausson, Jeevana Priya Inala, Chenglong Wang, Jianfeng Gao, and Armando Solar- Lezama. Demystifying gpt self-repair for code generation. arXiv preprint arXiv:2306.09896 ,

work page arXiv
[14]

Automatically correcting large language models: Surveying the landscape of diverse self-correction strategies,

Liangming Pan, Michael Saxon, Wenda Xu, Deepak Nathani, Xinyi Wang, and William Yang Wang. Automatically correcting large language models: Surveying the landscape of diverse self- correction strategies. arXiv preprint arXiv:2308.03188,

work page arXiv
[15]

Refiner: Reasoning feedback on intermediate representations

Debjit Paul, Mete Ismayilzada, Maxime Peyrard, Beatriz Borges, Antoine Bosselut, Robert West, and Boi Faltings. Refiner: Reasoning feedback on intermediate representations. arXiv preprint arXiv:2304.01904,

work page arXiv
[16]

Quantifying association capabilities of large language models and its implications on privacy leakage

Hanyin Shao, Jie Huang, Shen Zheng, and Kevin Chen-Chuan Chang. Quantifying association capabilities of large language models and its implications on privacy leakage. arXiv preprint arXiv:2305.12707,

work page arXiv
[17]

Commonsenseqa: A question answering challenge targeting commonsense knowledge

Alon Talmor, Jonathan Herzig, Nicholas Lourie, and Jonathan Berant. Commonsenseqa: A question answering challenge targeting commonsense knowledge. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Lan- guage Technologies, Volume 1 (Long and Short Papers), pp. 4149–4158,

work page 2019
[18]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Niko- lay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open founda- tion and fine-tuned chat models. arXiv preprint arXiv:2307.09288,

work page internal anchor Pith review Pith/arXiv arXiv
[19]

Can chatgpt defend its belief in truth? evaluating llm reasoning via debate

11 Published as a conference paper at ICLR 2024 Boshi Wang, Xiang Yue, and Huan Sun. Can chatgpt defend its belief in truth? evaluating llm reasoning via debate. In Findings of the Association for Computational Linguistics: EMNLP 2023, pp. 11865–11881, 2023a. Tianlu Wang, Ping Yu, Xiaoqing Ellen Tan, Sean O’Brien, Ramakanth Pasunuru, Jane Dwivedi-Yu, Olga...

work page arXiv 2024
[20]

Jailbroken: How Does LLM Safety Training Fail?

Alexander Wei, Nika Haghtalab, and Jacob Steinhardt. Jailbroken: How does llm safety training fail? arXiv preprint arXiv:2307.02483,

work page internal anchor Pith review arXiv
[21]

Hotpotqa: A dataset for diverse, explainable multi-hop question answering

Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William Cohen, Ruslan Salakhutdinov, and Christopher D Manning. Hotpotqa: A dataset for diverse, explainable multi-hop question answering. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics,

work page 2018
[22]

Why does chatgpt fall short in providing truthful answers? ArXiv preprint, abs/2304.10513,

Shen Zheng, Jie Huang, and Kevin Chen-Chuan Chang. Why does chatgpt fall short in providing truthful answers? ArXiv preprint, abs/2304.10513,

work page arXiv
[23]

Solving challenging math word problems using gpt-4 code interpreter with code-based self-verification

Aojun Zhou, Ke Wang, Zimu Lu, Weikang Shi, Sichun Luo, Zipeng Qin, Shaoqing Lu, Anya Jia, Linqi Song, Mingjie Zhan, et al. Solving challenging math word problems using gpt-4 code interpreter with code-based self-verification. arXiv preprint arXiv:2308.07921, 2023a. Denny Zhou, Nathanael Sch ¨arli, Le Hou, Jason Wei, Nathan Scales, Xuezhi Wang, Dale Schuur...

work page arXiv
[24]

She invited 16 friends

12 Published as a conference paper at ICLR 2024 A P ROMPTS AND EXAMPLE OUTPUTS Can you solve the following math problem? Christina is planning a birthday party and needs .75 gift bags per invited guest, because 1/4 of attendees don't show up. She invited 16 friends. Gift bags are $2 each. How much will she spend? Explain your reasoning. Your final answer ...

work page 2024
[25]

what concepts from the concept list are missing from the sentence?

Since Christina needs 0.75 gift bags per invited guest, we can multiply this quantity by the number of invited guests to find the total number of gift bags needed, which is 0.75 * 12 = 9 gift bags. If each gift bag is $2, then Christina will spend a total of 2 * 9 = $<<2*9=18>>18 on gift bags. Therefore, Christina will spend $18 on gift bags. Answer: \box...

work page 2024