Chain-of-thought prompting, by including intermediate reasoning steps in few-shot examples, elicits strong reasoning abilities in large language models on arithmetic, commonsense, and symbolic tasks.
A Diverse Corpus for Evaluating and Developing
10 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
roles
dataset 1polarities
use dataset 1representative citing papers
PAL improves few-shot reasoning accuracy by having LLMs generate executable programs rather than text-based chains of thought, outperforming much larger models on math and logic benchmarks.
CrowdMath is a new dataset of annotated collaborative math proof discussions where frontier LLMs achieve 83-88% on next-post prediction but only 0.42 macro-F1 on identifying contribution roles.
A fitted iso-depth scaling law measures that one recurrence in looped transformers is worth r^0.46 unique blocks in validation loss.
GAIA benchmark shows humans at 92% accuracy on simple real-world questions far outperform current AI systems at 15%, proposing this gap as a key milestone for general AI.
ToRA trains language models on interactive tool-use trajectories with imitation learning and output shaping to integrate reasoning and external tools, yielding 13-19% gains on math datasets and new highs like 44.6% on MATH for a 7B model.
SLMs achieve only a 4.4% accuracy gain from self-generated hints on reasoning benchmarks, fail to semantically distinguish useful feedback, and perform worse with longer hints.
PaLM 2 reports state-of-the-art results on language, reasoning, and multilingual tasks with improved efficiency over PaLM.
Step-Video-T2V describes a 30B-parameter text-to-video model with custom Video-VAE, 3D DiT, flow matching, and Video-DPO that claims state-of-the-art results on a new internal benchmark.
A literature survey synthesizing benchmarks, architectures, training strategies, and evaluation methods for mathematical reasoning in LLMs, based on roughly 120 papers.
citing papers explorer
-
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models
Chain-of-thought prompting, by including intermediate reasoning steps in few-shot examples, elicits strong reasoning abilities in large language models on arithmetic, commonsense, and symbolic tasks.
-
PAL: Program-aided Language Models
PAL improves few-shot reasoning accuracy by having LLMs generate executable programs rather than text-based chains of thought, outperforming much larger models on math and logic benchmarks.
-
CrowdMath: A Dataset of Crowdsourced Mathematical Research Discussions
CrowdMath is a new dataset of annotated collaborative math proof discussions where frontier LLMs achieve 83-88% on next-post prediction but only 0.42 macro-F1 on identifying contribution roles.
-
How Much Is One Recurrence Worth? Iso-Depth Scaling Laws for Looped Language Models
A fitted iso-depth scaling law measures that one recurrence in looped transformers is worth r^0.46 unique blocks in validation loss.
-
GAIA: a benchmark for General AI Assistants
GAIA benchmark shows humans at 92% accuracy on simple real-world questions far outperform current AI systems at 15%, proposing this gap as a key milestone for general AI.
-
ToRA: A Tool-Integrated Reasoning Agent for Mathematical Problem Solving
ToRA trains language models on interactive tool-use trajectories with imitation learning and output shaping to integrate reasoning and external tools, yielding 13-19% gains on math datasets and new highs like 44.6% on MATH for a 7B model.
-
More Yap Less Meaning: Uncovering Self-Improvement Behavior in SLMs
SLMs achieve only a 4.4% accuracy gain from self-generated hints on reasoning benchmarks, fail to semantically distinguish useful feedback, and perform worse with longer hints.
-
PaLM 2 Technical Report
PaLM 2 reports state-of-the-art results on language, reasoning, and multilingual tasks with improved efficiency over PaLM.
-
Step-Video-T2V Technical Report: The Practice, Challenges, and Future of Video Foundation Model
Step-Video-T2V describes a 30B-parameter text-to-video model with custom Video-VAE, 3D DiT, flow matching, and Video-DPO that claims state-of-the-art results on a new internal benchmark.
-
Mathematical Reasoning in Large Language Models: Benchmarks, Architectures, Evaluation, and Open Challenges
A literature survey synthesizing benchmarks, architectures, training strategies, and evaluation methods for mathematical reasoning in LLMs, based on roughly 120 papers.