Recognition: 2 theorem links
· Lean TheoremAutomatic Chain of Thought Prompting in Large Language Models
Pith reviewed 2026-05-16 10:36 UTC · model grok-4.3
The pith
Auto-CoT lets large language models build their own chain-of-thought demonstrations by sampling diverse questions.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Auto-CoT automatically constructs demonstrations by sampling questions with diversity and generating reasoning chains one by one using the 'Let's think step by step' prompt. On ten public benchmark reasoning tasks with GPT-3, Auto-CoT consistently matches or exceeds the performance of the CoT paradigm that requires manual designs of demonstrations.
What carries the argument
Auto-CoT, which samples questions for diversity then uses the model itself to generate reasoning chains that form the prompt demonstrations.
If this is right
- Task-specific manual demonstration design becomes unnecessary for chain-of-thought prompting.
- Reasoning performance on new tasks can be obtained with only a simple prompt and access to the model.
- Diversity sampling compensates for imperfect reasoning chains in the constructed examples.
- The same automatic construction process can be applied across multiple reasoning benchmarks without per-task tuning.
Where Pith is reading between the lines
- The method may extend to models other than GPT-3 if they respond reliably to the 'Let's think step by step' prompt.
- Fully automatic demonstration construction could enable rapid adaptation of prompting techniques to new domains.
- Further improvements might come from better diversity measures or iterative refinement of the generated chains.
- Real-world systems could use this to deploy step-by-step reasoning without expert prompt engineers.
Load-bearing premise
Selecting questions for diversity is enough to keep the overall demonstrations effective even when some generated reasoning chains contain mistakes.
What would settle it
Running Auto-CoT on the same ten benchmarks but with random instead of diverse question sampling, and finding that performance drops below manual CoT on most tasks.
read the original abstract
Large language models (LLMs) can perform complex reasoning by generating intermediate reasoning steps. Providing these steps for prompting demonstrations is called chain-of-thought (CoT) prompting. CoT prompting has two major paradigms. One leverages a simple prompt like "Let's think step by step" to facilitate step-by-step thinking before answering a question. The other uses a few manual demonstrations one by one, each composed of a question and a reasoning chain that leads to an answer. The superior performance of the second paradigm hinges on the hand-crafting of task-specific demonstrations one by one. We show that such manual efforts may be eliminated by leveraging LLMs with the "Let's think step by step" prompt to generate reasoning chains for demonstrations one by one, i.e., let's think not just step by step, but also one by one. However, these generated chains often come with mistakes. To mitigate the effect of such mistakes, we find that diversity matters for automatically constructing demonstrations. We propose an automatic CoT prompting method: Auto-CoT. It samples questions with diversity and generates reasoning chains to construct demonstrations. On ten public benchmark reasoning tasks with GPT-3, Auto-CoT consistently matches or exceeds the performance of the CoT paradigm that requires manual designs of demonstrations. Code is available at https://github.com/amazon-research/auto-cot
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes Auto-CoT, an automatic method for chain-of-thought prompting that samples questions for diversity and uses an LLM with the 'Let's think step by step' prompt to generate reasoning chains for demonstrations, thereby eliminating manual hand-crafting of task-specific examples. The central claim is that on ten public benchmark reasoning tasks with GPT-3, Auto-CoT consistently matches or exceeds the performance of the manual CoT paradigm.
Significance. If the result holds, the work would be significant for automating a labor-intensive component of effective CoT prompting and scaling reasoning capabilities in LLMs. The evaluation spans ten diverse benchmarks and the public code release supports reproducibility.
major comments (3)
- [Section 3] Section 3 (Auto-CoT method): the diversity sampling procedure is described at a high level but the exact threshold, selection algorithm, and handling of the free parameter 'diversity sampling threshold' are not specified in sufficient detail for reproduction.
- [Section 4] Section 4 (Experiments): no quantitative error rates or per-task breakdown of correctness in the automatically generated reasoning chains are reported, leaving the claim that diversity sampling sufficiently mitigates occasional mistakes without direct supporting measurements.
- [Section 4] Section 4: the manuscript provides no ablation that isolates the downstream effect of erroneous steps in the generated chains on final accuracy, which is required to test the central assumption that diversity offsets noise across all ten tasks.
minor comments (2)
- [Tables 1-2] Table 1 and Table 2: column headers and footnotes could more explicitly distinguish between manual CoT baselines and Auto-CoT variants for quick comparison.
- [Abstract and Section 1] The abstract and introduction repeat the performance claim without noting the absence of statistical significance tests or run-to-run variance.
Simulated Author's Rebuttal
We thank the referee for the constructive comments, which highlight important aspects for improving reproducibility and empirical support. We will revise the manuscript to provide additional details and analyses as outlined below. All changes will be incorporated in the next version.
read point-by-point responses
-
Referee: [Section 3] Section 3 (Auto-CoT method): the diversity sampling procedure is described at a high level but the exact threshold, selection algorithm, and handling of the free parameter 'diversity sampling threshold' are not specified in sufficient detail for reproduction.
Authors: We agree that the description in Section 3 is insufficiently detailed for full reproducibility. In the revised manuscript, we will explicitly state the diversity sampling threshold value used in our experiments, describe the exact selection algorithm (including any clustering or similarity-based selection steps), and clarify how the free parameter is set or tuned. We will also add pseudocode for the sampling procedure. revision: yes
-
Referee: [Section 4] Section 4 (Experiments): no quantitative error rates or per-task breakdown of correctness in the automatically generated reasoning chains are reported, leaving the claim that diversity sampling sufficiently mitigates occasional mistakes without direct supporting measurements.
Authors: We acknowledge that direct quantitative measurements of error rates in the generated chains would provide stronger support for the claim. In the revision, we will add a new analysis in Section 4 reporting the error rates of the automatically generated reasoning chains, including a per-task breakdown of correctness across the ten benchmarks. This will directly quantify how diversity sampling helps mitigate mistakes. revision: yes
-
Referee: [Section 4] Section 4: the manuscript provides no ablation that isolates the downstream effect of erroneous steps in the generated chains on final accuracy, which is required to test the central assumption that diversity offsets noise across all ten tasks.
Authors: The referee is correct that an explicit ablation isolating the impact of erroneous steps is missing. We will add such an ablation study to the revised Section 4. This will include controlled experiments comparing performance with varying levels of injected errors in the chains, with and without diversity sampling, to demonstrate that diversity offsets noise on the ten tasks. revision: yes
Circularity Check
No significant circularity: empirical method validated on external benchmarks
full rationale
The paper proposes Auto-CoT as an empirical procedure that samples diverse questions, generates reasoning chains via the zero-shot 'Let's think step by step' prompt, and assembles demonstrations for few-shot use. Performance claims rest on direct comparisons against manual CoT baselines across ten public benchmarks with GPT-3; no equations, fitted parameters, or self-referential derivations are present. The central result (matching or exceeding manual CoT) is therefore an observed experimental outcome rather than a quantity forced by construction from the method's own inputs or prior self-citations. The method is self-contained against external data and does not reduce to any of the enumerated circularity patterns.
Axiom & Free-Parameter Ledger
free parameters (2)
- number of demonstrations
- diversity sampling threshold
axioms (1)
- domain assumption LLMs can produce usable intermediate reasoning steps when prompted with 'Let's think step by step'
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/AlexanderDuality.lean (D=3 forcing via linking)alexander_duality_circle_linking unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
we use k-means to partition all the 600 test questions into k = 8 clusters
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 24 Pith papers
-
DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines
DSPy compiles short declarative programs into LM pipelines that self-optimize and outperform both standard few-shot prompting and expert-written chains on math, retrieval, and QA tasks.
-
PEEM: Prompt Engineering Evaluation Metrics for Interpretable Joint Evaluation of Prompts and Responses
PEEM is a multi-criteria LLM-based evaluator for prompts and responses that aligns with standard accuracy while enabling zero-shot prompt optimization via feedback.
-
Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models
Visual ChatGPT integrates visual foundation models with ChatGPT via prompts to enable multi-step image understanding, generation, and editing in conversational interactions.
-
MAP: A Map-then-Act Paradigm for Long-Horizon Interactive Agent Reasoning
MAP improves LLM agent reasoning by constructing a structured cognitive map of the environment before task execution, yielding performance gains on benchmarks like ARC-AGI-3 and superior training data via the new MAP-...
-
APCD: Adaptive Path-Contrastive Decoding for Reliable Large Language Model Generation
APCD reduces LLM hallucinations by expanding decoding paths adaptively when entropy signals uncertainty and by contrasting divergent paths to control their interaction.
-
Assistance Without Interruption: A Benchmark and LLM-based Framework for Non-Intrusive Human-Robot Assistance
The work creates NIABench and an LLM-plus-scoring-model framework that enables robots to deliver proactive assistance during human multi-step activities while avoiding interruptions and reducing human effort.
-
ExecTune: Effective Steering of Black-Box LLMs with Guide Models
ExecTune trains guide models via acceptance sampling, supervised fine-tuning, and structure-aware RL to boost executability of strategies for black-box LLMs, yielding up to 9.2% higher accuracy and 22.4% lower cost on...
-
Thinking Diffusion: Penalize and Guide Visual-Grounded Reasoning in Diffusion Multimodal Language Models
Position and step penalty plus visual reasoning guidance fix premature answering and weak visual grounding in diffusion MLLMs, delivering up to 7.5% accuracy gains and over 3x speedup.
-
World model inspired sarcasm reasoning with large language model agents
WM-SAR decomposes sarcasm into LLM-agent components, quantifies literal-normative inconsistency deterministically, and integrates it with intention via logistic regression to outperform prior sarcasm detectors on benchmarks.
-
Adaptive Chain-of-Focus Reasoning via Dynamic Visual Search and Zooming for Efficient VLMs
Chain-of-Focus enables VLMs to adaptively search and zoom on important image areas via a two-stage SFT and RL pipeline on a custom 3K-sample dataset, yielding 5% gains on the V* benchmark across resolutions from 224 to 4K.
-
Mixture-of-Agents Enhances Large Language Model Capabilities
A layered Mixture-of-Agents system combining multiple LLMs achieves state-of-the-art results on AlpacaEval 2.0 (65.1%), MT-Bench, and FLASK, outperforming GPT-4 Omni.
-
AGIEval: A Human-Centric Benchmark for Evaluating Foundation Models
AGIEval shows GPT-4 exceeding average human scores on SAT Math at 95% and Chinese college entrance English at 92.5%, while revealing weaker results on complex reasoning tasks.
-
ART: Automatic multi-step reasoning and tool-use for large language models
ART automatically generates multi-step reasoning programs with tool integration for LLMs, yielding substantial gains over few-shot and auto-CoT prompting on BigBench and MMLU while matching hand-crafted CoT on most tasks.
-
Operating Within the Operational Design Domain: Zero-Shot Perception with Vision-Language Models
Vision-language models achieve usable zero-shot ODD perception in driving scenes when guided by definition-anchored chain-of-thought prompting with persona decomposition.
-
Can Textual Reasoning Improve the Performance of MLLMs on Fine-grained Visual Classification?
Longer textual reasoning chains degrade MLLM accuracy on fine-grained visual tasks; a new normalization and constrained-reward training framework mitigates the effect and sets new SOTA numbers.
-
Encouraging Divergent Thinking in Large Language Models through Multi-Agent Debate
Multi-agent debate with tit-for-tat arguments and a judge LLM improves reasoning by preventing LLMs from locking into incorrect initial solutions.
-
Operating Within the Operational Design Domain: Zero-Shot Perception with Vision-Language Models
Vision-language models can serve as zero-shot ODD sensors for autonomous driving when using definition-anchored chain-of-thought prompting with persona decomposition.
-
Analyzing Chain of Thought (CoT) Approaches in Control Flow Code Deobfuscation Tasks
CoT prompting improves LLM performance on control-flow deobfuscation of C benchmarks, yielding ~16% better CFG reconstruction and ~20.5% better semantic preservation for GPT5 versus zero-shot prompting.
-
Prompt-Driven Code Summarization: A Systematic Literature Review
A systematic review that categorizes prompting strategies for LLM-based code summarization, assesses their effectiveness, and identifies gaps in research and evaluation practices.
-
Combining Static Code Analysis and Large Language Models Improves Correctness and Performance of Algorithm Recognition
Hybrid LLM plus static analysis for algorithm recognition in code cuts required model calls by 72-97% and lifts F1-scores by as much as 12 points.
-
Towards Large Reasoning Models: A Survey of Reinforced Reasoning with Large Language Models
The paper surveys reinforced reasoning techniques for LLMs, covering automated data construction, learning-to-reason methods, and test-time scaling as steps toward Large Reasoning Models.
-
A Systematic Survey of Prompt Engineering in Large Language Models: Techniques and Applications
A systematic survey categorizes prompt engineering methods for LLMs and VLMs by application area, summarizing methodologies, applications, models, datasets, strengths, and limitations for each technique along with a t...
-
A Survey on Multimodal Large Language Models
This survey organizes the architectures, training strategies, data, evaluation methods, extensions, and challenges of Multimodal Large Language Models.
-
Multimodal Chain-of-Thought Reasoning: A Comprehensive Survey
The paper provides the first comprehensive survey of multimodal chain-of-thought reasoning, including foundational concepts, a taxonomy of methodologies, application analyses, challenges, and future directions.
Reference graph
Works this paper leans on
-
[1]
Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-V oss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwi...
work page 2020
-
[2]
URL https://proceedings.neurips.cc/paper/2020/hash/ 1457c0d6bfcb4967418bfb8ac142f64a-Abstract.html. Romal Thoppilan, Daniel De Freitas, Jamie Hall, Noam Shazeer, Apoorv Kulshreshtha, Heng-Tze Cheng, Alicia Jin, Taylor Bos, Leslie Baker, Yu Du, YaGuang Li, Hongrae Lee, Huaixiu Steven Zheng, Amin Ghafouri, Marcelo Menegali, Yanping Huang, Maxim Krikun, Dmit...
work page 2020
-
[3]
URL https://arxiv.org/abs/2201.08239. Jack W. Rae, Sebastian Borgeaud, Trevor Cai, Katie Millican, Jordan Hoffmann, Francis Song, John Aslanides, Sarah Henderson, Roman Ring, Susannah Young, Eliza Rutherford, Tom Hennigan, Jacob Menick, Albin Cassirer, Richard Powell, George van den Driessche, Lisa Anne Hendricks, Maribeth Rauh, Po-Sen Huang, Amelia Glaes...
work page internal anchor Pith review Pith/arXiv arXiv
-
[4]
PaLM: Scaling Language Modeling with Pathways
URL https://arxiv.org/abs/2204.02311. Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Ed Chi, Quoc Le, and Denny Zhou. Chain of thought prompting elicits reasoning in large language models. In Thirty-sixth Conference on Neural Information Processing Systems (NeurIPS 2022), 2022a. URL https://arxiv.org/abs/2201.11903. Takeshi Kojima, Shixiang Shane...
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[5]
Large Language Models are Zero-Shot Reasoners
URL https://arxiv.org/abs/2205.11916. Subhro Roy and Dan Roth. Solving general arithmetic word problems. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing , pages 1743–1752, Lisbon, Portugal,
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[6]
Association for Computational Linguistics. doi: 10.18653/v1/D15-1202. URL https://aclanthology.org/D15-1202. Alon Talmor, Jonathan Herzig, Nicholas Lourie, and Jonathan Berant. CommonsenseQA: A question answering challenge targeting commonsense knowledge. In Proceedings of the 2019 Conference of the North American 10 Chapter of the Association for Computa...
-
[7]
Association for Computational Linguistics. doi: 10.18653/v1/N19-1421. URL https://aclanthology.org/N19-1421. Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems,
-
[8]
Training Verifiers to Solve Math Word Problems
URL https://arxiv.org/abs/2110.14168. Wang Ling, Dani Yogatama, Chris Dyer, and Phil Blunsom. Program induction by rationale generation: Learning to solve and explain algebraic word problems. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers), pages 158–167, Vancouver, Canada,
work page internal anchor Pith review Pith/arXiv arXiv
-
[9]
Association for Computational Linguistics. doi: 10.18653/v1/P17-1015. URL https://aclanthology.org/P17-1015. Arkil Patel, Satwik Bhattamishra, and Navin Goyal. Are NLP models really able to solve simple math word problems? In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language T...
-
[10]
Are NLP Models really able to Solve Simple Math Word Problems?
Association for Computational Linguistics. doi: 10.18653/v1/2021.naacl-main.168. URL https://aclanthology.org/2021.naacl-main.168. Mor Geva, Daniel Khashabi, Elad Segal, Tushar Khot, Dan Roth, and Jonathan Berant. Did aristotle use a laptop? a question answering benchmark with implicit reasoning strategies. Transactions of the Association for Computationa...
work page internal anchor Pith review doi:10.18653/v1/2021.naacl-main.168 2021
-
[11]
URL https://doi.org/10.1162/tacl_a_00370
doi: 10.1162/tacl_a_00370. URL https://doi.org/10.1162/tacl_a_00370. Eric Zelikman, Yuhuai Wu, and Noah D Goodman. Star: Bootstrapping reasoning with reasoning. arXiv preprint arXiv:2203.14465,
-
[12]
URL https://arxiv.org/abs/2203.14465. Denny Zhou, Nathanael Schärli, Le Hou, Jason Wei, Nathan Scales, Xuezhi Wang, Dale Schuurmans, Olivier Bousquet, Quoc Le, and Ed Chi. Least-to-most prompting enables complex reasoning in large language models. arXiv preprint arXiv:2205.10625,
-
[13]
Least-to-Most Prompting Enables Complex Reasoning in Large Language Models
URL https://arxiv.org/abs/2205.10625. Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171, 2022a. URL https://arxiv.org/abs/ 2203.11171. Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, and Denny Zhou. Rationale-augmented...
work page internal anchor Pith review Pith/arXiv arXiv
-
[14]
Learning to retrieve prompts for in-context learning
Ohad Rubin, Jonathan Herzig, and Jonathan Berant. Learning to retrieve prompts for in-context learning. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies , pages 2655–2671,
work page 2022
-
[15]
URL https://aclanthology.org/2022.naacl-main.191
doi: 10.18653/v1/2022.naacl-main.191. URL https://aclanthology.org/2022.naacl-main.191. Hongjin Su, Jungo Kasai, Chen Henry Wu, Weijia Shi, Tianlu Wang, Jiayi Xin, Rui Zhang, Mari Ostendorf, Luke Zettlemoyer, Noah A Smith, et al. Selective annotation makes language models better few-shot learners. arXiv preprint arXiv:2209.01975,
-
[16]
Swaroop Mishra, Daniel Khashabi, Chitta Baral, and Hannaneh Hajishirzi
URL https://arxiv.org/abs/2209.01975. Swaroop Mishra, Daniel Khashabi, Chitta Baral, and Hannaneh Hajishirzi. Cross-task generalization via natural language crowdsourcing instructions. InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers), pages 3470–3487,
-
[17]
Cross-Task Generalization via Natural Language Crowdsourcing Instructions
doi: 10.18653/v1/2022.acl-long.244. URL https: //aclanthology.org/2022.acl-long.244. Jason Wei, Maarten Bosma, Vincent Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M. Dai, and Quoc V Le. Finetuned language models are zero-shot learners. In International Conference on Learning Representations, 2022b. URL https://openreview.net/forum?id=gEZr...
-
[18]
11 Ari Holtzman, Peter West, Vered Shwartz, Yejin Choi, and Luke Zettlemoyer
URL https://openreview.net/forum?id=9Vrb9D0WI4. 11 Ari Holtzman, Peter West, Vered Shwartz, Yejin Choi, and Luke Zettlemoyer. Surface form competition: Why the highest probability answer isn’t always right. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 7038–7051,
work page 2021
-
[19]
URL https: //aclanthology.org/2021.emnlp-main.564
doi: 10.18653/v1/2021.emnlp-main.564. URL https: //aclanthology.org/2021.emnlp-main.564. Zihao Zhao, Eric Wallace, Shi Feng, Dan Klein, and Sameer Singh. Calibrate before use: Improving few-shot performance of language models. In International Conference on Machine Learning , pages 12697–12706,
-
[20]
Sewon Min, Mike Lewis, Hannaneh Hajishirzi, and Luke Zettlemoyer
URL http://proceedings.mlr.press/v139/zhao21c/zhao21c.pdf. Sewon Min, Mike Lewis, Hannaneh Hajishirzi, and Luke Zettlemoyer. Noisy channel language model prompting for few-shot text classification. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers) , pages 5316–5330, 2022a. doi: 10.18653/v1/2...
-
[21]
Fantastically Ordered Prompts and Where to Find Them: Overcoming Few-Shot Prompt Order Sensitivity
doi: 10.18653/v1/2022.acl-long.556. URL https://aclanthology.org/2022.acl-long.556. Haokun Liu, Derek Tam, Mohammed Muqeeth, Jay Mohta, Tenghao Huang, Mohit Bansal, and Colin Raffel. Few-shot parameter-efficient fine-tuning is better and cheaper than in-context learning. arXiv preprint arXiv:2205.05638 , 2022b. URL https://arxiv.org/abs/2205.05638. Albert W...
-
[22]
Do Prompt-Based Models Really Understand the Meaning of Their Prompts?
Association for Computational Linguistics. doi: 10.18653/v1/2022.naacl-main.167. URL https://aclanthology.org/2022.naacl-main
-
[23]
Rethinking the Role of Demonstrations: What Makes In-Context Learning Work?
Sewon Min, Xinxi Lyu, Ari Holtzman, Mikel Artetxe, Mike Lewis, Hannaneh Hajishirzi, and Luke Zettlemoyer. Rethinking the role of demonstrations: What makes in-context learning work? arXiv preprint arXiv:2202.12837, 2022b. URL https://arxiv.org/abs/2202.12837. Nils Reimers and Iryna Gurevych. Sentence-BERT: Sentence embeddings using Siamese BERT-networks. ...
work page internal anchor Pith review Pith/arXiv arXiv 2019
-
[24]
Sentence-BERT: Sentence embeddings using Siamese BERT-networks
Association for Computational Linguistics. doi: 10.18653/v1/D19-1410. URL https://aclanthology.org/ D19-1410. Mohammad Javad Hosseini, Hannaneh Hajishirzi, Oren Etzioni, and Nate Kushman. Learning to solve arithmetic word problems with verb categorization. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), p...
-
[25]
Association for Computational Linguistics. doi: 10.3115/v1/D14-1058. URL https://aclanthology.org/D14-1058. Rik Koncel-Kedziorski, Hannaneh Hajishirzi, Ashish Sabharwal, Oren Etzioni, and Siena Dumas Ang. Parsing algebraic word problems into equations. Transactions of the Association for Computational Linguistics , 3:585–597,
-
[26]
URL https://aclanthology.org/Q15-1042
doi: 10.1162/tacl_a_00160. URL https://aclanthology.org/Q15-1042. Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, and Ryan Lowe. Tr...
-
[27]
Training language models to follow instructions with human feedback
URL https://arxiv.org/abs/2203.02155. Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374,
work page internal anchor Pith review Pith/arXiv arXiv
-
[28]
Evaluating Large Language Models Trained on Code
URL https://arxiv.org/abs/2107.03374. Rik Koncel-Kedziorski, Subhro Roy, Aida Amini, Nate Kushman, and Hannaneh Hajishirzi. MAWPS: A math word problem repository. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies , pages 1152–1157, San Diego, California,
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[29]
Association for Computational Linguistics. doi: 10.18653/v1/N16-1136. URL https://aclanthology.org/ N16-1136. 12 A Extended analysis for the challenge of Auto-CoT A.1 Impact of demonstration elements. A demonstration is a triple composed by <question, rationale, answer> as shown in Figure
-
[30]
We shuffle either of the demonstration components to see how the performance changes. According to the results in Table 5, shuffling questions has the least performance reduction (91.7%→ 73.8%). A possible reason for the decent performance is that the model may capture the rationale-answer mapping patterns. The pattern is expected to reflect how the intermed...
work page 2022
-
[31]
(∆ is computed by the difference of largest and smallest values
13 1 2 3 4 5 6 7 80 20 40 60 Error Rate (%) (a) MultiArith (∆=43) 1 2 3 4 5 6 7 80 20 40 60 Error Rate (%) (b) AddSub (∆=46) 1 2 3 4 5 6 7 80 20 40 60 Error Rate (%) (c) SingleEq (∆=48) 1 2 3 4 5 6 7 30 40 50 Error Rate (%) (d) CSQA (∆=19) Figure 8: Question clustering in different datasets. (∆ is computed by the difference of largest and smallest values....
work page 2015
-
[32]
\n” for separating the reasoning steps, the rule can be easily implemented by counting the “\n
Following Wei et al. [2022a], the number of demonstrations k used for in-context learning is 8 in most tasks, except for 4 in AQuA and Last Letter Concatenation, 7 in CSQA, and 6 in StrategyQA. C Analysis C.1 Comparisons of criteria for sorting questions We compare different ways of sorting questions in each cluster, including: (i) minimal distance to the...
work page 2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.