Chain-of-thought prompting, by including intermediate reasoning steps in few-shot examples, elicits strong reasoning abilities in large language models on arithmetic, commonsense, and symbolic tasks.
hub
Program Induction by Rationale Generation: Learning to Solve and Explain Algebraic Word Problems
19 Pith papers cite this work. Polarity classification is still indexing.
hub tools
citation-role summary
citation-polarity summary
representative citing papers
CrowdMath is a new dataset of annotated collaborative math proof discussions where frontier LLMs achieve 83-88% on next-post prediction but only 0.42 macro-F1 on identifying contribution roles.
PluRule is a new multimodal multilingual benchmark showing that state-of-the-art vision-language models perform only marginally better than a trivial baseline at detecting specific rule violations in pluralistic online communities.
Frontier LLMs achieve 95-100% accuracy on AMC/AIME problems but recover far fewer distinct valid strategies than human references, while collectively generating 50 novel strategies.
Medusa augments LLMs with multiple decoding heads and tree-based attention to predict and verify several tokens in parallel, yielding 2.2-3.6x inference speedup via two fine-tuning regimes.
GAIA benchmark shows humans at 92% accuracy on simple real-world questions far outperform current AI systems at 15%, proposing this gap as a key milestone for general AI.
Chain-of-Thought reasoning in LLMs is often unfaithful, with models relying on it variably by task and less so as models scale larger.
VCM is a training-free decoding intervention that applies PMI-driven token elevation and variance-adaptive penalization to reduce repetitive degeneration in LLM open-ended generation.
Nexa learns a response-conditioned policy that starts with parallel agent execution and adds at most one round of sequential message passing via a predicted sparse DAG, strictly subsuming pure parallel mode.
DCLM-Baseline dataset lets a 7B model reach 64% 5-shot MMLU accuracy after 2.6T tokens, beating prior open-data models by 6.6 points on MMLU with 40% less compute.
MAmmoTH models trained via hybrid CoT-PoT instruction tuning on MathInstruct outperform prior open-source LLMs by 16-32% average accuracy on nine math datasets, reaching 33% and 44% on MATH for 7B and 34B scales.
UltraChat supplies 1.5 million high-quality multi-turn dialogues that, when used to fine-tune LLaMA, produce UltraLLaMA, which outperforms prior open-source chat models including Vicuna.
A 540B-parameter LLM improves reasoning performance on GSM8K, DROP, OpenBookQA, and ANLI-A3 by fine-tuning on self-generated high-confidence CoT solutions from unlabeled data.
Auto-CoT automatically builds chain-of-thought demonstrations by sampling diverse questions and letting the LLM generate reasoning chains, matching manual CoT performance on ten reasoning tasks with GPT-3.
PaLM 540B demonstrates continued scaling benefits by setting new few-shot SOTA results on hundreds of benchmarks and outperforming humans on BIG-bench.
SLMs achieve only a 4.4% accuracy gain from self-generated hints on reasoning benchmarks, fail to semantically distinguish useful feedback, and perform worse with longer hints.
PRISM is a gauge-invariant DP mechanism for LoRA that avoids bilinear noise amplification via tangent-space sampling, supplies a closed-form noise characterization on Z, and includes a DP-aware adaptive update rule.
Proposes temporal and structural credit assignment plus a discrete verbalized block coordinate descent algorithm to optimize prompts in LLM multi-agent systems, claiming reduced query complexity and better performance on reasoning benchmarks.
Multilingual pooling for quality classifiers outperforms monolingual baselines in rank stability and accuracy for LLM pretraining data selection across high- and low-resource languages.
citing papers explorer
-
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models
Chain-of-thought prompting, by including intermediate reasoning steps in few-shot examples, elicits strong reasoning abilities in large language models on arithmetic, commonsense, and symbolic tasks.
-
CrowdMath: A Dataset of Crowdsourced Mathematical Research Discussions
CrowdMath is a new dataset of annotated collaborative math proof discussions where frontier LLMs achieve 83-88% on next-post prediction but only 0.42 macro-F1 on identifying contribution roles.
-
PluRule: A Benchmark for Moderating Pluralistic Communities on Social Media
PluRule is a new multimodal multilingual benchmark showing that state-of-the-art vision-language models perform only marginally better than a trivial baseline at detecting specific rule violations in pluralistic online communities.
-
Beyond Accuracy: Evaluating Strategy Diversity in LLM Mathematical Reasoning
Frontier LLMs achieve 95-100% accuracy on AMC/AIME problems but recover far fewer distinct valid strategies than human references, while collectively generating 50 novel strategies.
-
Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads
Medusa augments LLMs with multiple decoding heads and tree-based attention to predict and verify several tokens in parallel, yielding 2.2-3.6x inference speedup via two fine-tuning regimes.
-
GAIA: a benchmark for General AI Assistants
GAIA benchmark shows humans at 92% accuracy on simple real-world questions far outperform current AI systems at 15%, proposing this gap as a key milestone for general AI.
-
Measuring Faithfulness in Chain-of-Thought Reasoning
Chain-of-Thought reasoning in LLMs is often unfaithful, with models relying on it variably by task and less so as models scale larger.
-
Breaking the Likelihood Trap: Variance-Calibrated Modulation for Large Language Model Decoding
VCM is a training-free decoding intervention that applies PMI-driven token elevation and variance-adaptive penalization to reduce repetitive degeneration in LLM open-ended generation.
-
Response-Conditioned Parallel-to-Sequential Orchestration for Multi-Agent Systems
Nexa learns a response-conditioned policy that starts with parallel agent execution and adds at most one round of sequential message passing via a predicted sparse DAG, strictly subsuming pure parallel mode.
-
DataComp-LM: In search of the next generation of training sets for language models
DCLM-Baseline dataset lets a 7B model reach 64% 5-shot MMLU accuracy after 2.6T tokens, beating prior open-data models by 6.6 points on MMLU with 40% less compute.
-
MAmmoTH: Building Math Generalist Models through Hybrid Instruction Tuning
MAmmoTH models trained via hybrid CoT-PoT instruction tuning on MathInstruct outperform prior open-source LLMs by 16-32% average accuracy on nine math datasets, reaching 33% and 44% on MATH for 7B and 34B scales.
-
Enhancing Chat Language Models by Scaling High-quality Instructional Conversations
UltraChat supplies 1.5 million high-quality multi-turn dialogues that, when used to fine-tune LLaMA, produce UltraLLaMA, which outperforms prior open-source chat models including Vicuna.
-
Large Language Models Can Self-Improve
A 540B-parameter LLM improves reasoning performance on GSM8K, DROP, OpenBookQA, and ANLI-A3 by fine-tuning on self-generated high-confidence CoT solutions from unlabeled data.
-
Automatic Chain of Thought Prompting in Large Language Models
Auto-CoT automatically builds chain-of-thought demonstrations by sampling diverse questions and letting the LLM generate reasoning chains, matching manual CoT performance on ten reasoning tasks with GPT-3.
-
PaLM: Scaling Language Modeling with Pathways
PaLM 540B demonstrates continued scaling benefits by setting new few-shot SOTA results on hundreds of benchmarks and outperforming humans on BIG-bench.
-
More Yap Less Meaning: Uncovering Self-Improvement Behavior in SLMs
SLMs achieve only a 4.4% accuracy gain from self-generated hints on reasoning benchmarks, fail to semantically distinguish useful feedback, and perform worse with longer hints.
-
PRISM: Gauge-Invariant Tangent-Space Differentially Private LoRA
PRISM is a gauge-invariant DP mechanism for LoRA that avoids bilinear noise amplification via tangent-space sampling, supplies a closed-form noise characterization on Z, and includes a DP-aware adaptive update rule.
-
Unifying Temporal and Structural Credit Assignment in LLM-Based Multi-Agent Prompt Optimization
Proposes temporal and structural credit assignment plus a discrete verbalized block coordinate descent algorithm to optimize prompts in LLM multi-agent systems, claiming reduced query complexity and better performance on reasoning benchmarks.
-
Toward Cross-Lingual Quality Classifiers for Multilingual Pretraining Data Selection
Multilingual pooling for quality classifiers outperforms monolingual baselines in rank stability and accuracy for LLM pretraining data selection across high- and low-resource languages.