pith. machine review for the scientific record. sign in

arxiv: 2210.03493 · v1 · submitted 2022-10-07 · 💻 cs.CL · cs.AI

Recognition: 2 theorem links

· Lean Theorem

Automatic Chain of Thought Prompting in Large Language Models

Authors on Pith no claims yet

Pith reviewed 2026-05-16 10:36 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords chain-of-thought promptingautomatic demonstration constructionlarge language modelsreasoning benchmarksdiversity samplingGPT-3zero-shot prompting
0
0 comments X

The pith

Auto-CoT lets large language models build their own chain-of-thought demonstrations by sampling diverse questions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper demonstrates that manual hand-crafting of reasoning examples is not required for effective chain-of-thought prompting. Instead, an LLM prompted with 'Let's think step by step' can generate reasoning chains for a set of sampled questions, and these chains serve as demonstrations for new queries. Diversity in the sampled questions helps limit the damage from occasional errors in the generated chains. Tested on ten public reasoning benchmarks with GPT-3, the resulting Auto-CoT method matches or exceeds the accuracy of hand-designed demonstrations. This removes a key practical barrier that has limited the use of advanced prompting techniques.

Core claim

Auto-CoT automatically constructs demonstrations by sampling questions with diversity and generating reasoning chains one by one using the 'Let's think step by step' prompt. On ten public benchmark reasoning tasks with GPT-3, Auto-CoT consistently matches or exceeds the performance of the CoT paradigm that requires manual designs of demonstrations.

What carries the argument

Auto-CoT, which samples questions for diversity then uses the model itself to generate reasoning chains that form the prompt demonstrations.

If this is right

  • Task-specific manual demonstration design becomes unnecessary for chain-of-thought prompting.
  • Reasoning performance on new tasks can be obtained with only a simple prompt and access to the model.
  • Diversity sampling compensates for imperfect reasoning chains in the constructed examples.
  • The same automatic construction process can be applied across multiple reasoning benchmarks without per-task tuning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The method may extend to models other than GPT-3 if they respond reliably to the 'Let's think step by step' prompt.
  • Fully automatic demonstration construction could enable rapid adaptation of prompting techniques to new domains.
  • Further improvements might come from better diversity measures or iterative refinement of the generated chains.
  • Real-world systems could use this to deploy step-by-step reasoning without expert prompt engineers.

Load-bearing premise

Selecting questions for diversity is enough to keep the overall demonstrations effective even when some generated reasoning chains contain mistakes.

What would settle it

Running Auto-CoT on the same ten benchmarks but with random instead of diverse question sampling, and finding that performance drops below manual CoT on most tasks.

read the original abstract

Large language models (LLMs) can perform complex reasoning by generating intermediate reasoning steps. Providing these steps for prompting demonstrations is called chain-of-thought (CoT) prompting. CoT prompting has two major paradigms. One leverages a simple prompt like "Let's think step by step" to facilitate step-by-step thinking before answering a question. The other uses a few manual demonstrations one by one, each composed of a question and a reasoning chain that leads to an answer. The superior performance of the second paradigm hinges on the hand-crafting of task-specific demonstrations one by one. We show that such manual efforts may be eliminated by leveraging LLMs with the "Let's think step by step" prompt to generate reasoning chains for demonstrations one by one, i.e., let's think not just step by step, but also one by one. However, these generated chains often come with mistakes. To mitigate the effect of such mistakes, we find that diversity matters for automatically constructing demonstrations. We propose an automatic CoT prompting method: Auto-CoT. It samples questions with diversity and generates reasoning chains to construct demonstrations. On ten public benchmark reasoning tasks with GPT-3, Auto-CoT consistently matches or exceeds the performance of the CoT paradigm that requires manual designs of demonstrations. Code is available at https://github.com/amazon-research/auto-cot

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper proposes Auto-CoT, an automatic method for chain-of-thought prompting that samples questions for diversity and uses an LLM with the 'Let's think step by step' prompt to generate reasoning chains for demonstrations, thereby eliminating manual hand-crafting of task-specific examples. The central claim is that on ten public benchmark reasoning tasks with GPT-3, Auto-CoT consistently matches or exceeds the performance of the manual CoT paradigm.

Significance. If the result holds, the work would be significant for automating a labor-intensive component of effective CoT prompting and scaling reasoning capabilities in LLMs. The evaluation spans ten diverse benchmarks and the public code release supports reproducibility.

major comments (3)
  1. [Section 3] Section 3 (Auto-CoT method): the diversity sampling procedure is described at a high level but the exact threshold, selection algorithm, and handling of the free parameter 'diversity sampling threshold' are not specified in sufficient detail for reproduction.
  2. [Section 4] Section 4 (Experiments): no quantitative error rates or per-task breakdown of correctness in the automatically generated reasoning chains are reported, leaving the claim that diversity sampling sufficiently mitigates occasional mistakes without direct supporting measurements.
  3. [Section 4] Section 4: the manuscript provides no ablation that isolates the downstream effect of erroneous steps in the generated chains on final accuracy, which is required to test the central assumption that diversity offsets noise across all ten tasks.
minor comments (2)
  1. [Tables 1-2] Table 1 and Table 2: column headers and footnotes could more explicitly distinguish between manual CoT baselines and Auto-CoT variants for quick comparison.
  2. [Abstract and Section 1] The abstract and introduction repeat the performance claim without noting the absence of statistical significance tests or run-to-run variance.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments, which highlight important aspects for improving reproducibility and empirical support. We will revise the manuscript to provide additional details and analyses as outlined below. All changes will be incorporated in the next version.

read point-by-point responses
  1. Referee: [Section 3] Section 3 (Auto-CoT method): the diversity sampling procedure is described at a high level but the exact threshold, selection algorithm, and handling of the free parameter 'diversity sampling threshold' are not specified in sufficient detail for reproduction.

    Authors: We agree that the description in Section 3 is insufficiently detailed for full reproducibility. In the revised manuscript, we will explicitly state the diversity sampling threshold value used in our experiments, describe the exact selection algorithm (including any clustering or similarity-based selection steps), and clarify how the free parameter is set or tuned. We will also add pseudocode for the sampling procedure. revision: yes

  2. Referee: [Section 4] Section 4 (Experiments): no quantitative error rates or per-task breakdown of correctness in the automatically generated reasoning chains are reported, leaving the claim that diversity sampling sufficiently mitigates occasional mistakes without direct supporting measurements.

    Authors: We acknowledge that direct quantitative measurements of error rates in the generated chains would provide stronger support for the claim. In the revision, we will add a new analysis in Section 4 reporting the error rates of the automatically generated reasoning chains, including a per-task breakdown of correctness across the ten benchmarks. This will directly quantify how diversity sampling helps mitigate mistakes. revision: yes

  3. Referee: [Section 4] Section 4: the manuscript provides no ablation that isolates the downstream effect of erroneous steps in the generated chains on final accuracy, which is required to test the central assumption that diversity offsets noise across all ten tasks.

    Authors: The referee is correct that an explicit ablation isolating the impact of erroneous steps is missing. We will add such an ablation study to the revised Section 4. This will include controlled experiments comparing performance with varying levels of injected errors in the chains, with and without diversity sampling, to demonstrate that diversity offsets noise on the ten tasks. revision: yes

Circularity Check

0 steps flagged

No significant circularity: empirical method validated on external benchmarks

full rationale

The paper proposes Auto-CoT as an empirical procedure that samples diverse questions, generates reasoning chains via the zero-shot 'Let's think step by step' prompt, and assembles demonstrations for few-shot use. Performance claims rest on direct comparisons against manual CoT baselines across ten public benchmarks with GPT-3; no equations, fitted parameters, or self-referential derivations are present. The central result (matching or exceeding manual CoT) is therefore an observed experimental outcome rather than a quantity forced by construction from the method's own inputs or prior self-citations. The method is self-contained against external data and does not reduce to any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The central claim rests on the empirical observation that diversity sampling reduces the harm from generation errors; no new physical or mathematical entities are postulated, and the only free parameters are standard prompting choices such as number of demonstrations and sampling temperature.

free parameters (2)
  • number of demonstrations
    Standard hyperparameter in few-shot prompting; value not specified in abstract but chosen to match prior CoT setups.
  • diversity sampling threshold
    Method-specific choice for selecting varied questions; exact criterion not detailed in abstract.
axioms (1)
  • domain assumption LLMs can produce usable intermediate reasoning steps when prompted with 'Let's think step by step'
    Invoked to justify generating chains automatically; standard assumption in CoT literature.

pith-pipeline@v0.9.0 · 5537 in / 1278 out tokens · 42611 ms · 2026-05-16T10:36:28.283141+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 24 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines

    cs.CL 2023-10 conditional novelty 8.0

    DSPy compiles short declarative programs into LM pipelines that self-optimize and outperform both standard few-shot prompting and expert-written chains on math, retrieval, and QA tasks.

  2. PEEM: Prompt Engineering Evaluation Metrics for Interpretable Joint Evaluation of Prompts and Responses

    cs.CL 2026-03 unverdicted novelty 7.0

    PEEM is a multi-criteria LLM-based evaluator for prompts and responses that aligns with standard accuracy while enabling zero-shot prompt optimization via feedback.

  3. Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models

    cs.CV 2023-03 accept novelty 7.0

    Visual ChatGPT integrates visual foundation models with ChatGPT via prompts to enable multi-step image understanding, generation, and editing in conversational interactions.

  4. MAP: A Map-then-Act Paradigm for Long-Horizon Interactive Agent Reasoning

    cs.AI 2026-05 unverdicted novelty 6.0

    MAP improves LLM agent reasoning by constructing a structured cognitive map of the environment before task execution, yielding performance gains on benchmarks like ARC-AGI-3 and superior training data via the new MAP-...

  5. APCD: Adaptive Path-Contrastive Decoding for Reliable Large Language Model Generation

    cs.CL 2026-05 unverdicted novelty 6.0

    APCD reduces LLM hallucinations by expanding decoding paths adaptively when entropy signals uncertainty and by contrasting divergent paths to control their interaction.

  6. Assistance Without Interruption: A Benchmark and LLM-based Framework for Non-Intrusive Human-Robot Assistance

    cs.RO 2026-05 unverdicted novelty 6.0

    The work creates NIABench and an LLM-plus-scoring-model framework that enables robots to deliver proactive assistance during human multi-step activities while avoiding interruptions and reducing human effort.

  7. ExecTune: Effective Steering of Black-Box LLMs with Guide Models

    cs.LG 2026-04 unverdicted novelty 6.0

    ExecTune trains guide models via acceptance sampling, supervised fine-tuning, and structure-aware RL to boost executability of strategies for black-box LLMs, yielding up to 9.2% higher accuracy and 22.4% lower cost on...

  8. Thinking Diffusion: Penalize and Guide Visual-Grounded Reasoning in Diffusion Multimodal Language Models

    cs.AI 2026-04 unverdicted novelty 6.0

    Position and step penalty plus visual reasoning guidance fix premature answering and weak visual grounding in diffusion MLLMs, delivering up to 7.5% accuracy gains and over 3x speedup.

  9. World model inspired sarcasm reasoning with large language model agents

    cs.CL 2025-12 unverdicted novelty 6.0

    WM-SAR decomposes sarcasm into LLM-agent components, quantifies literal-normative inconsistency deterministically, and integrates it with intention via logistic regression to outperform prior sarcasm detectors on benchmarks.

  10. Adaptive Chain-of-Focus Reasoning via Dynamic Visual Search and Zooming for Efficient VLMs

    cs.CV 2025-05 unverdicted novelty 6.0

    Chain-of-Focus enables VLMs to adaptively search and zoom on important image areas via a two-stage SFT and RL pipeline on a custom 3K-sample dataset, yielding 5% gains on the V* benchmark across resolutions from 224 to 4K.

  11. Mixture-of-Agents Enhances Large Language Model Capabilities

    cs.CL 2024-06 unverdicted novelty 6.0

    A layered Mixture-of-Agents system combining multiple LLMs achieves state-of-the-art results on AlpacaEval 2.0 (65.1%), MT-Bench, and FLASK, outperforming GPT-4 Omni.

  12. AGIEval: A Human-Centric Benchmark for Evaluating Foundation Models

    cs.CL 2023-04 accept novelty 6.0

    AGIEval shows GPT-4 exceeding average human scores on SAT Math at 95% and Chinese college entrance English at 92.5%, while revealing weaker results on complex reasoning tasks.

  13. ART: Automatic multi-step reasoning and tool-use for large language models

    cs.CL 2023-03 unverdicted novelty 6.0

    ART automatically generates multi-step reasoning programs with tool integration for LLMs, yielding substantial gains over few-shot and auto-CoT prompting on BigBench and MMLU while matching hand-crafted CoT on most tasks.

  14. Operating Within the Operational Design Domain: Zero-Shot Perception with Vision-Language Models

    cs.CV 2026-05 unverdicted novelty 5.0

    Vision-language models achieve usable zero-shot ODD perception in driving scenes when guided by definition-anchored chain-of-thought prompting with persona decomposition.

  15. Can Textual Reasoning Improve the Performance of MLLMs on Fine-grained Visual Classification?

    cs.CV 2026-01 unverdicted novelty 5.0

    Longer textual reasoning chains degrade MLLM accuracy on fine-grained visual tasks; a new normalization and constrained-reward training framework mitigates the effect and sets new SOTA numbers.

  16. Encouraging Divergent Thinking in Large Language Models through Multi-Agent Debate

    cs.CL 2023-05 conditional novelty 5.0

    Multi-agent debate with tit-for-tat arguments and a judge LLM improves reasoning by preventing LLMs from locking into incorrect initial solutions.

  17. Operating Within the Operational Design Domain: Zero-Shot Perception with Vision-Language Models

    cs.CV 2026-05 unverdicted novelty 4.0

    Vision-language models can serve as zero-shot ODD sensors for autonomous driving when using definition-anchored chain-of-thought prompting with persona decomposition.

  18. Analyzing Chain of Thought (CoT) Approaches in Control Flow Code Deobfuscation Tasks

    cs.SE 2026-04 unverdicted novelty 4.0

    CoT prompting improves LLM performance on control-flow deobfuscation of C benchmarks, yielding ~16% better CFG reconstruction and ~20.5% better semantic preservation for GPT5 versus zero-shot prompting.

  19. Prompt-Driven Code Summarization: A Systematic Literature Review

    cs.SE 2026-04 unverdicted novelty 4.0

    A systematic review that categorizes prompting strategies for LLM-based code summarization, assesses their effectiveness, and identifies gaps in research and evaluation practices.

  20. Combining Static Code Analysis and Large Language Models Improves Correctness and Performance of Algorithm Recognition

    cs.SE 2026-04 conditional novelty 4.0

    Hybrid LLM plus static analysis for algorithm recognition in code cuts required model calls by 72-97% and lifts F1-scores by as much as 12 points.

  21. Towards Large Reasoning Models: A Survey of Reinforced Reasoning with Large Language Models

    cs.AI 2025-01 unverdicted novelty 3.0

    The paper surveys reinforced reasoning techniques for LLMs, covering automated data construction, learning-to-reason methods, and test-time scaling as steps toward Large Reasoning Models.

  22. A Systematic Survey of Prompt Engineering in Large Language Models: Techniques and Applications

    cs.AI 2024-02 unverdicted novelty 3.0

    A systematic survey categorizes prompt engineering methods for LLMs and VLMs by application area, summarizing methodologies, applications, models, datasets, strengths, and limitations for each technique along with a t...

  23. A Survey on Multimodal Large Language Models

    cs.CV 2023-06 accept novelty 3.0

    This survey organizes the architectures, training strategies, data, evaluation methods, extensions, and challenges of Multimodal Large Language Models.

  24. Multimodal Chain-of-Thought Reasoning: A Comprehensive Survey

    cs.CV 2025-03 unverdicted novelty 2.0

    The paper provides the first comprehensive survey of multimodal chain-of-thought reasoning, including foundational concepts, a taxonomy of methodologies, application analyses, challenges, and future directions.

Reference graph

Works this paper leans on

32 extracted references · 32 canonical work pages · cited by 23 Pith papers · 9 internal anchors

  1. [1]

    Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-V oss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwi...

  2. [2]

    URL https://proceedings.neurips.cc/paper/2020/hash/ 1457c0d6bfcb4967418bfb8ac142f64a-Abstract.html. Romal Thoppilan, Daniel De Freitas, Jamie Hall, Noam Shazeer, Apoorv Kulshreshtha, Heng-Tze Cheng, Alicia Jin, Taylor Bos, Leslie Baker, Yu Du, YaGuang Li, Hongrae Lee, Huaixiu Steven Zheng, Amin Ghafouri, Marcelo Menegali, Yanping Huang, Maxim Krikun, Dmit...

  3. [3]

    URL https://arxiv.org/abs/2201.08239. Jack W. Rae, Sebastian Borgeaud, Trevor Cai, Katie Millican, Jordan Hoffmann, Francis Song, John Aslanides, Sarah Henderson, Roman Ring, Susannah Young, Eliza Rutherford, Tom Hennigan, Jacob Menick, Albin Cassirer, Richard Powell, George van den Driessche, Lisa Anne Hendricks, Maribeth Rauh, Po-Sen Huang, Amelia Glaes...

  4. [4]

    PaLM: Scaling Language Modeling with Pathways

    URL https://arxiv.org/abs/2204.02311. Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Ed Chi, Quoc Le, and Denny Zhou. Chain of thought prompting elicits reasoning in large language models. In Thirty-sixth Conference on Neural Information Processing Systems (NeurIPS 2022), 2022a. URL https://arxiv.org/abs/2201.11903. Takeshi Kojima, Shixiang Shane...

  5. [5]

    Large Language Models are Zero-Shot Reasoners

    URL https://arxiv.org/abs/2205.11916. Subhro Roy and Dan Roth. Solving general arithmetic word problems. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing , pages 1743–1752, Lisbon, Portugal,

  6. [6]

    doi: 10.18653/v1/D15-1202

    Association for Computational Linguistics. doi: 10.18653/v1/D15-1202. URL https://aclanthology.org/D15-1202. Alon Talmor, Jonathan Herzig, Nicholas Lourie, and Jonathan Berant. CommonsenseQA: A question answering challenge targeting commonsense knowledge. In Proceedings of the 2019 Conference of the North American 10 Chapter of the Association for Computa...

  7. [7]

    #Instruction#

    Association for Computational Linguistics. doi: 10.18653/v1/N19-1421. URL https://aclanthology.org/N19-1421. Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems,

  8. [8]

    Training Verifiers to Solve Math Word Problems

    URL https://arxiv.org/abs/2110.14168. Wang Ling, Dani Yogatama, Chris Dyer, and Phil Blunsom. Program induction by rationale generation: Learning to solve and explain algebraic word problems. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers), pages 158–167, Vancouver, Canada,

  9. [9]

    doi: 10.18653/v1/P17-1015

    Association for Computational Linguistics. doi: 10.18653/v1/P17-1015. URL https://aclanthology.org/P17-1015. Arkil Patel, Satwik Bhattamishra, and Navin Goyal. Are NLP models really able to solve simple math word problems? In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language T...

  10. [10]

    Are NLP Models really able to Solve Simple Math Word Problems?

    Association for Computational Linguistics. doi: 10.18653/v1/2021.naacl-main.168. URL https://aclanthology.org/2021.naacl-main.168. Mor Geva, Daniel Khashabi, Elad Segal, Tushar Khot, Dan Roth, and Jonathan Berant. Did aristotle use a laptop? a question answering benchmark with implicit reasoning strategies. Transactions of the Association for Computationa...

  11. [11]

    URL https://doi.org/10.1162/tacl_a_00370

    doi: 10.1162/tacl_a_00370. URL https://doi.org/10.1162/tacl_a_00370. Eric Zelikman, Yuhuai Wu, and Noah D Goodman. Star: Bootstrapping reasoning with reasoning. arXiv preprint arXiv:2203.14465,

  12. [12]

    Denny Zhou, Nathanael Schärli, Le Hou, Jason Wei, Nathan Scales, Xuezhi Wang, Dale Schuurmans, Olivier Bousquet, Quoc Le, and Ed Chi

    URL https://arxiv.org/abs/2203.14465. Denny Zhou, Nathanael Schärli, Le Hou, Jason Wei, Nathan Scales, Xuezhi Wang, Dale Schuurmans, Olivier Bousquet, Quoc Le, and Ed Chi. Least-to-most prompting enables complex reasoning in large language models. arXiv preprint arXiv:2205.10625,

  13. [13]

    Least-to-Most Prompting Enables Complex Reasoning in Large Language Models

    URL https://arxiv.org/abs/2205.10625. Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171, 2022a. URL https://arxiv.org/abs/ 2203.11171. Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, and Denny Zhou. Rationale-augmented...

  14. [14]

    Learning to retrieve prompts for in-context learning

    Ohad Rubin, Jonathan Herzig, and Jonathan Berant. Learning to retrieve prompts for in-context learning. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies , pages 2655–2671,

  15. [15]

    URL https://aclanthology.org/2022.naacl-main.191

    doi: 10.18653/v1/2022.naacl-main.191. URL https://aclanthology.org/2022.naacl-main.191. Hongjin Su, Jungo Kasai, Chen Henry Wu, Weijia Shi, Tianlu Wang, Jiayi Xin, Rui Zhang, Mari Ostendorf, Luke Zettlemoyer, Noah A Smith, et al. Selective annotation makes language models better few-shot learners. arXiv preprint arXiv:2209.01975,

  16. [16]

    Swaroop Mishra, Daniel Khashabi, Chitta Baral, and Hannaneh Hajishirzi

    URL https://arxiv.org/abs/2209.01975. Swaroop Mishra, Daniel Khashabi, Chitta Baral, and Hannaneh Hajishirzi. Cross-task generalization via natural language crowdsourcing instructions. InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers), pages 3470–3487,

  17. [17]

    Cross-Task Generalization via Natural Language Crowdsourcing Instructions

    doi: 10.18653/v1/2022.acl-long.244. URL https: //aclanthology.org/2022.acl-long.244. Jason Wei, Maarten Bosma, Vincent Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M. Dai, and Quoc V Le. Finetuned language models are zero-shot learners. In International Conference on Learning Representations, 2022b. URL https://openreview.net/forum?id=gEZr...

  18. [18]

    11 Ari Holtzman, Peter West, Vered Shwartz, Yejin Choi, and Luke Zettlemoyer

    URL https://openreview.net/forum?id=9Vrb9D0WI4. 11 Ari Holtzman, Peter West, Vered Shwartz, Yejin Choi, and Luke Zettlemoyer. Surface form competition: Why the highest probability answer isn’t always right. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 7038–7051,

  19. [19]

    URL https: //aclanthology.org/2021.emnlp-main.564

    doi: 10.18653/v1/2021.emnlp-main.564. URL https: //aclanthology.org/2021.emnlp-main.564. Zihao Zhao, Eric Wallace, Shi Feng, Dan Klein, and Sameer Singh. Calibrate before use: Improving few-shot performance of language models. In International Conference on Machine Learning , pages 12697–12706,

  20. [20]

    Sewon Min, Mike Lewis, Hannaneh Hajishirzi, and Luke Zettlemoyer

    URL http://proceedings.mlr.press/v139/zhao21c/zhao21c.pdf. Sewon Min, Mike Lewis, Hannaneh Hajishirzi, and Luke Zettlemoyer. Noisy channel language model prompting for few-shot text classification. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers) , pages 5316–5330, 2022a. doi: 10.18653/v1/2...

  21. [21]

    Fantastically Ordered Prompts and Where to Find Them: Overcoming Few-Shot Prompt Order Sensitivity

    doi: 10.18653/v1/2022.acl-long.556. URL https://aclanthology.org/2022.acl-long.556. Haokun Liu, Derek Tam, Mohammed Muqeeth, Jay Mohta, Tenghao Huang, Mohit Bansal, and Colin Raffel. Few-shot parameter-efficient fine-tuning is better and cheaper than in-context learning. arXiv preprint arXiv:2205.05638 , 2022b. URL https://arxiv.org/abs/2205.05638. Albert W...

  22. [22]

    Do Prompt-Based Models Really Understand the Meaning of Their Prompts?

    Association for Computational Linguistics. doi: 10.18653/v1/2022.naacl-main.167. URL https://aclanthology.org/2022.naacl-main

  23. [23]

    Rethinking the Role of Demonstrations: What Makes In-Context Learning Work?

    Sewon Min, Xinxi Lyu, Ari Holtzman, Mikel Artetxe, Mike Lewis, Hannaneh Hajishirzi, and Luke Zettlemoyer. Rethinking the role of demonstrations: What makes in-context learning work? arXiv preprint arXiv:2202.12837, 2022b. URL https://arxiv.org/abs/2202.12837. Nils Reimers and Iryna Gurevych. Sentence-BERT: Sentence embeddings using Siamese BERT-networks. ...

  24. [24]

    Sentence-BERT: Sentence embeddings using Siamese BERT-networks

    Association for Computational Linguistics. doi: 10.18653/v1/D19-1410. URL https://aclanthology.org/ D19-1410. Mohammad Javad Hosseini, Hannaneh Hajishirzi, Oren Etzioni, and Nate Kushman. Learning to solve arithmetic word problems with verb categorization. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), p...

  25. [25]

    doi: 10.3115/v1/D14-1058

    Association for Computational Linguistics. doi: 10.3115/v1/D14-1058. URL https://aclanthology.org/D14-1058. Rik Koncel-Kedziorski, Hannaneh Hajishirzi, Ashish Sabharwal, Oren Etzioni, and Siena Dumas Ang. Parsing algebraic word problems into equations. Transactions of the Association for Computational Linguistics , 3:585–597,

  26. [26]

    URL https://aclanthology.org/Q15-1042

    doi: 10.1162/tacl_a_00160. URL https://aclanthology.org/Q15-1042. Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, and Ryan Lowe. Tr...

  27. [27]

    Training language models to follow instructions with human feedback

    URL https://arxiv.org/abs/2203.02155. Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374,

  28. [28]

    Evaluating Large Language Models Trained on Code

    URL https://arxiv.org/abs/2107.03374. Rik Koncel-Kedziorski, Subhro Roy, Aida Amini, Nate Kushman, and Hannaneh Hajishirzi. MAWPS: A math word problem repository. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies , pages 1152–1157, San Diego, California,

  29. [29]

    doi: 10.18653/v1/N16-1136

    Association for Computational Linguistics. doi: 10.18653/v1/N16-1136. URL https://aclanthology.org/ N16-1136. 12 A Extended analysis for the challenge of Auto-CoT A.1 Impact of demonstration elements. A demonstration is a triple composed by <question, rationale, answer> as shown in Figure

  30. [30]

    According to the results in Table 5, shuffling questions has the least performance reduction (91.7%→ 73.8%)

    We shuffle either of the demonstration components to see how the performance changes. According to the results in Table 5, shuffling questions has the least performance reduction (91.7%→ 73.8%). A possible reason for the decent performance is that the model may capture the rationale-answer mapping patterns. The pattern is expected to reflect how the intermed...

  31. [31]

    (∆ is computed by the difference of largest and smallest values

    13 1 2 3 4 5 6 7 80 20 40 60 Error Rate (%) (a) MultiArith (∆=43) 1 2 3 4 5 6 7 80 20 40 60 Error Rate (%) (b) AddSub (∆=46) 1 2 3 4 5 6 7 80 20 40 60 Error Rate (%) (c) SingleEq (∆=48) 1 2 3 4 5 6 7 30 40 50 Error Rate (%) (d) CSQA (∆=19) Figure 8: Question clustering in different datasets. (∆ is computed by the difference of largest and smallest values....

  32. [32]

    \n” for separating the reasoning steps, the rule can be easily implemented by counting the “\n

    Following Wei et al. [2022a], the number of demonstrations k used for in-context learning is 8 in most tasks, except for 4 in AQuA and Last Letter Concatenation, 7 in CSQA, and 6 in StrategyQA. C Analysis C.1 Comparisons of criteria for sorting questions We compare different ways of sorting questions in each cluster, including: (i) minimal distance to the...