Chain-of-thought prompting, by including intermediate reasoning steps in few-shot examples, elicits strong reasoning abilities in large language models on arithmetic, commonsense, and symbolic tasks.
Did Aristotle Use a Laptop? A Question Answering Benchmark with Implicit Reasoning Strategies.Transactions of the Association for Computational Linguistics2021,9, 346–361
11 Pith papers cite this work. Polarity classification is still indexing.
representative citing papers
HTEB introduces dynamic, multi-axis evaluation of text embedding robustness using LLM transformations, finding decoupled profiles across models and that scaling does not close all robustness gaps.
CommonWhy is a new dataset of 15,000 why-questions for evaluating LLMs on entity-based causal commonsense reasoning grounded in Wikidata.
PTR framework profiles a workflow upfront then executes it deterministically with bounded verification and repair, limiting LM calls to 2-3 while outperforming ReAct in 16 of 24 tested configurations.
GAIA benchmark shows humans at 92% accuracy on simple real-world questions far outperform current AI systems at 15%, proposing this gap as a key milestone for general AI.
CRAFT is a Pareto-front prompt optimizer that allocates scarce LLM validation calls to candidates near the current front using accuracy- and cost-oriented generators plus NSGA-II retention.
Auto-CoT automatically builds chain-of-thought demonstrations by sampling diverse questions and letting the LLM generate reasoning chains, matching manual CoT performance on ten reasoning tasks with GPT-3.
PaLM 540B demonstrates continued scaling benefits by setting new few-shot SOTA results on hundreds of benchmarks and outperforming humans on BIG-bench.
A factorized study finds raw hidden states and attention features hard to beat in-domain for LLM uncertainty probes, but structured compressed features are more robust under distribution shift, with pretrained probes transferring to open-ended generation.
A multi-dimensional framework with six dimensions (Correctness, Consistency, Robustness, Logical Coherence, Efficiency, Stability) is applied to seven LLMs on 975 items, revealing orthogonality between logical coherence and correctness plus ranking inversions invisible to accuracy metrics.
PaLM 2 reports state-of-the-art results on language, reasoning, and multilingual tasks with improved efficiency over PaLM.
citing papers explorer
-
Automatic Chain of Thought Prompting in Large Language Models
Auto-CoT automatically builds chain-of-thought demonstrations by sampling diverse questions and letting the LLM generate reasoning chains, matching manual CoT performance on ten reasoning tasks with GPT-3.
-
From Signals to Transfer: A Factorised Study of Probe-Based Uncertainty Estimation in Large Language Models
A factorized study finds raw hidden states and attention features hard to beat in-domain for LLM uncertainty probes, but structured compressed features are more robust under distribution shift, with pretrained probes transferring to open-ended generation.