Chain-of-thought prompting, by including intermediate reasoning steps in few-shot examples, elicits strong reasoning abilities in large language models on arithmetic, commonsense, and symbolic tasks.
Did Aristotle Use a Laptop? A Question Answering Benchmark with Implicit Reasoning Strategies
12 Pith papers cite this work. Polarity classification is still indexing.
representative citing papers
HTEB introduces dynamic, multi-axis evaluation of text embedding robustness using LLM transformations, finding decoupled profiles across models and that scaling does not close all robustness gaps.
CommonWhy is a new dataset of 15,000 why-questions for evaluating LLMs on entity-based causal commonsense reasoning grounded in Wikidata.
PTR framework profiles a workflow upfront then executes it deterministically with bounded verification and repair, limiting LM calls to 2-3 while outperforming ReAct in 16 of 24 tested configurations.
GAIA benchmark shows humans at 92% accuracy on simple real-world questions far outperform current AI systems at 15%, proposing this gap as a key milestone for general AI.
CRAFT is a Pareto-front prompt optimizer that allocates scarce LLM validation calls to candidates near the current front using accuracy- and cost-oriented generators plus NSGA-II retention.
Auto-CoT automatically builds chain-of-thought demonstrations by sampling diverse questions and letting the LLM generate reasoning chains, matching manual CoT performance on ten reasoning tasks with GPT-3.
PaLM 540B demonstrates continued scaling benefits by setting new few-shot SOTA results on hundreds of benchmarks and outperforming humans on BIG-bench.
C3RL is a new RL algorithm combining correctness, calibration, and reference accuracy rewards to improve LLM confidence calibration, enabling CAS to outperform majority voting with up to 12.33x lower inference cost.
A factorized study finds raw hidden states and attention features hard to beat in-domain for LLM uncertainty probes, but structured compressed features are more robust under distribution shift, with pretrained probes transferring to open-ended generation.
Proposes a multi-dimensional behavioral framework with six dimensions (Correctness, Consistency, Robustness, Local Logical Coherence, Efficiency, Stability) plus deployment-aware aggregation to diagnose LLM reasoning beyond accuracy-based benchmarks.
PaLM 2 reports state-of-the-art results on language, reasoning, and multilingual tasks with improved efficiency over PaLM.
citing papers explorer
No citing papers match the current filters.