arxiv: 2210.09261 · v1 · submitted 2022-10-17 · 💻 cs.CL · cs.AI

Recognition: no theorem link

Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them

Aakanksha Chowdhery, Denny Zhou, Ed H. Chi, Hyung Won Chung, Jason Wei, Mirac Suzgun, Nathanael Sch\"arli, Nathan Scales, Quoc V. Le, Sebastian Gehrmann, Yi Tay

Pith reviewed 2026-05-11 07:09 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords BIG-Benchchain-of-thoughtpromptinglanguage modelsmulti-step reasoningemergent abilitiesbenchmarking

0 comments

The pith

Chain-of-thought prompting lets current models surpass average humans on 17 of 23 BIG-Bench Hard tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper isolates 23 tasks from BIG-Bench where earlier language-model evaluations fell below average human-rater scores. It shows that switching from standard few-shot prompting to chain-of-thought prompting, which asks the model to produce explicit intermediate reasoning steps, raises performance enough for PaLM to beat the human baseline on 10 tasks and for Codex to beat it on 17 tasks. The authors argue that many of these tasks demand multi-step reasoning that ordinary few-shot examples do not elicit, so prior results understated what the models could already do. They further observe that chain-of-thought prompting turns flat scaling curves into sudden gains on several tasks, indicating an interaction between the prompting method and model size.

Core claim

Applying chain-of-thought prompting to the 23 BIG-Bench Hard tasks enables PaLM to exceed average human-rater performance on 10 tasks and Codex to exceed it on 17 tasks; without chain-of-thought, few-shot prompting alone substantially underestimates model capability on these multi-step reasoning problems.

What carries the argument

Chain-of-thought (CoT) prompting, which instructs the model to generate a sequence of intermediate reasoning steps before the final answer.

If this is right

Few-shot prompting without explicit reasoning steps underestimates the capabilities of current models on tasks that require multi-step inference.
CoT prompting reveals emergent task performance on several BBH tasks that show flat scaling curves under standard prompting.
A large fraction of the 23 hard tasks are solvable by models that can be guided to reason step by step rather than being inherently beyond current language-model reach.
Benchmark results that rely solely on few-shot prompting will systematically lag behind the best achievable performance when CoT is used instead.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Evaluation suites for language models will need to standardize or report results across multiple prompting regimes, including CoT, to avoid understating model limits.
The performance jump with CoT suggests that future scaling laws for reasoning tasks should be measured under step-by-step prompting rather than few-shot alone.
If the human baselines remain fixed while models continue to improve with better prompting, the remaining tasks where models still lag may become the new focus for architectural or training advances.

Load-bearing premise

The average human-rater scores collected in the original BIG-Bench study form a stable benchmark that can be compared directly to model outputs produced under different prompting conditions.

What would settle it

Re-evaluate the same 23 tasks with new human raters who receive the identical chain-of-thought instructions given to the models; if the human scores rise enough to match or exceed the CoT model scores, the claim that models have surpassed average human performance would be overturned.

read the original abstract

BIG-Bench (Srivastava et al., 2022) is a diverse evaluation suite that focuses on tasks believed to be beyond the capabilities of current language models. Language models have already made good progress on this benchmark, with the best model in the BIG-Bench paper outperforming average reported human-rater results on 65% of the BIG-Bench tasks via few-shot prompting. But on what tasks do language models fall short of average human-rater performance, and are those tasks actually unsolvable by current language models? In this work, we focus on a suite of 23 challenging BIG-Bench tasks which we call BIG-Bench Hard (BBH). These are the task for which prior language model evaluations did not outperform the average human-rater. We find that applying chain-of-thought (CoT) prompting to BBH tasks enables PaLM to surpass the average human-rater performance on 10 of the 23 tasks, and Codex (code-davinci-002) to surpass the average human-rater performance on 17 of the 23 tasks. Since many tasks in BBH require multi-step reasoning, few-shot prompting without CoT, as done in the BIG-Bench evaluations (Srivastava et al., 2022), substantially underestimates the best performance and capabilities of language models, which is better captured via CoT prompting. As further analysis, we explore the interaction between CoT and model scale on BBH, finding that CoT enables emergent task performance on several BBH tasks with otherwise flat scaling curves.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CoT prompting gets Codex past the reported human average on 17 of 23 BBH tasks and PaLM on 10, but the comparison rests on unverified prior human scores.

read the letter

The main point is that chain-of-thought prompting lifts Codex above the average human-rater performance on 17 of the 23 BIG-Bench Hard tasks and PaLM on 10. This reframes those tasks as reachable with current methods rather than fundamentally beyond them, at least under the reported conditions. The paper isolates the BBH subset cleanly from the larger BIG-Bench suite and runs the direct comparison. It also shows how CoT changes the scaling behavior on several tasks that looked flat without it. Those two pieces—the head-to-head counts and the scale interaction—are the concrete additions. The work stays tied to the public task definitions, which keeps the numbers checkable. The weakest part is the human baseline. The paper takes the averages straight from the original BIG-Bench report without new controls, without matching the exact prompt format or few-shot structure used for the models, and without reporting variability or agreement stats for the raters. If those human scores shift under a protocol closer to the model setup, the number of tasks surpassed could change. That assumption is load-bearing for the “surpass human” claim but is not stress-tested here. This paper is for people who track prompting techniques and benchmark interpretation in language models. Readers who want a clearer picture of where multi-step reasoning currently stands will get direct value from the empirical splits. It is worth sending for peer review. The model-side results are straightforward enough to document, and the baseline question is addressable in revision.

Referee Report

1 major / 2 minor

Summary. The paper defines BIG-Bench Hard (BBH) as the 23 tasks from BIG-Bench where prior models did not exceed average human-rater performance. It reports that chain-of-thought (CoT) prompting enables PaLM to surpass those human averages on 10 tasks and Codex (code-davinci-002) on 17 tasks, while few-shot prompting without CoT underestimates capabilities. The work also shows CoT produces emergent performance on several tasks that exhibit flat scaling curves without it.

Significance. If the empirical counts hold, the result is significant because it shows that a large fraction of tasks previously viewed as beyond current language models are solvable via CoT prompting rather than requiring architectural advances. The direct use of a publicly defined task set and comparison against an external human benchmark provides concrete, falsifiable measurements of prompting gains and scaling behavior.

major comments (1)

Abstract and §3 (results): The headline claim that PaLM exceeds average human-rater performance on 10/23 tasks and Codex on 17/23 tasks rests on treating the human scores reported in Srivastava et al. (2022) as stable external targets. No standard errors, inter-rater agreement statistics, or sensitivity checks to prompt wording or few-shot format are provided for those averages, so the exact task counts could shift under modest changes to the human evaluation protocol.

minor comments (2)

§4 (experimental details): The prompt templates, number of few-shot examples, and exact decoding parameters (temperature, top-p, etc.) used for both standard and CoT conditions should be stated explicitly or linked to a public repository to support replication.
Figure 2 and scaling analysis: Clarify whether the reported accuracy curves reflect single runs or aggregated statistics, and whether any post-hoc selection of CoT formats occurred after observing results.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their careful reading and constructive comment. We address the major point below and will make a partial revision to improve clarity around the human baselines.

read point-by-point responses

Referee: [—] Abstract and §3 (results): The headline claim that PaLM exceeds average human-rater performance on 10/23 tasks and Codex on 17/23 tasks rests on treating the human scores reported in Srivastava et al. (2022) as stable external targets. No standard errors, inter-rater agreement statistics, or sensitivity checks to prompt wording or few-shot format are provided for those averages, so the exact task counts could shift under modest changes to the human evaluation protocol.

Authors: We agree that the human-rater averages reported in Srivastava et al. (2022) are point estimates without accompanying standard errors, inter-rater agreement, or sensitivity analyses. Our comparisons use these published averages exactly as provided by the original BIG-Bench evaluation, which is the standard practice when reporting results on this benchmark. We do not have access to the raw human annotations and therefore cannot compute those statistics ourselves. In the revised manuscript we will add an explicit caveat in the abstract and §3 stating that the human scores are point estimates and that the precise number of tasks surpassed could vary under alternative human-evaluation protocols. At the same time, the performance deltas from chain-of-thought prompting are frequently large (often 10–30 points), so the qualitative conclusion that CoT enables models to exceed the reported human averages on a substantial fraction of BBH tasks is unlikely to be overturned by modest changes to the baseline. revision: partial

Circularity Check

0 steps flagged

No circularity: direct empirical measurements against external benchmark

full rationale

The paper's central claims consist of experimental measurements: applying chain-of-thought prompting to 23 BBH tasks selected from the prior BIG-Bench suite and reporting that PaLM exceeds the cited average human-rater performance on 10 tasks while Codex exceeds it on 17. These are direct accuracy comparisons to human scores taken from Srivastava et al. (2022), an independent external benchmark. No equations, parameter fits, self-definitions, or load-bearing self-citations reduce the reported performance numbers to quantities derived from the present experiments. The work is self-contained against the external human benchmark and contains no derivation chain that collapses by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the external validity of the BIG-Bench human-rater averages and on the assumption that the 23 tasks were correctly identified as those where prior few-shot evaluations fell short.

axioms (1)

domain assumption Average human-rater performance on the 23 tasks provides a stable external benchmark.
The paper uses these averages as the threshold that models must surpass.

pith-pipeline@v0.9.0 · 5625 in / 1279 out tokens · 56059 ms · 2026-05-11T07:09:17.108348+00:00 · methodology

discussion (0)

Forward citations

Cited by 53 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Large Language Diffusion Models
cs.CL 2025-02 unverdicted novelty 8.0

LLaDA is a scalable diffusion-based language model that matches autoregressive LLMs like LLaMA3 8B on tasks and surpasses GPT-4o on reversal poem completion.
Human-Grounded Multimodal Benchmark with 900K-Scale Aggregated Student Response Distributions from Japan's National Assessment of Academic Ability
cs.CL 2026-05 unverdicted novelty 7.0

A new benchmark dataset drawn from Japan's National Assessment of Academic Ability supplies real exam layouts, diagrams, Japanese text, and nationwide student response distributions for evaluating multimodal LLMs.
AgentPSO: Evolving Agent Reasoning Skill via Multi-agent Particle Swarm Optimization
cs.AI 2026-05 unverdicted novelty 7.0

AgentPSO evolves reusable multi-agent reasoning skills via PSO-inspired natural-language updates, outperforming static agents and test-time multi-agent baselines on math and general reasoning tasks with cross-benchmar...
Post Reasoning: Improving the Performance of Non-Thinking Models at No Cost
cs.AI 2026-05 conditional novelty 7.0

Post-Reasoning boosts LLM accuracy by reversing the usual answer-after-reasoning order, delivering mean relative gains of 17.37% across 117 model-benchmark pairs with zero extra cost.
Logic-Regularized Verifier Elicits Reasoning from LLMs
cs.CL 2026-05 unverdicted novelty 7.0

LOVER creates an unsupervised logic-regularized verifier that reaches 95% of supervised verifier performance on reasoning tasks across 10 datasets.
CXR-ContraBench: Benchmarking Negated-Option Attraction in Medical VLMs
cs.CV 2026-05 conditional novelty 7.0

Medical VLMs frequently select negated options that contradict visible chest X-ray findings, achieving only ~30% accuracy on direct presence probes, but a post-hoc consistency verifier raises accuracy above 95%.
Large Language Models Exhibit Normative Conformity
cs.AI 2026-04 unverdicted novelty 7.0

Large language models exhibit normative conformity in addition to informational conformity, and subtle social context can direct which group they conform to.
Weak-Link Optimization for Multi-Agent Reasoning and Collaboration
cs.AI 2026-04 unverdicted novelty 7.0

WORC improves multi-agent LLM reasoning to 82.2% average accuracy by predicting and compensating for the weakest agent via targeted extra sampling rather than uniform reinforcement.
MARS: Enabling Autoregressive Models Multi-Token Generation
cs.CL 2026-04 unverdicted novelty 7.0

MARS fine-tunes autoregressive models to predict multiple tokens per step via continued training on instruction data, achieving 1.5-1.7x throughput while matching baseline accuracy and supporting real-time speed adjustment.
DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model
cs.CL 2024-05 unverdicted novelty 7.0

DeepSeek-V2 delivers top-tier open-source LLM performance using only 21B active parameters by compressing the KV cache 93.3% and cutting training costs 42.5% via MLA and DeepSeekMoE.
Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads
cs.LG 2024-01 conditional novelty 7.0

Medusa augments LLMs with multiple decoding heads and tree-based attention to predict and verify several tokens in parallel, yielding 2.2-3.6x inference speedup via two fine-tuning regimes.
Program of Thoughts Prompting: Disentangling Computation from Reasoning for Numerical Reasoning Tasks
cs.CL 2022-11 unverdicted novelty 7.0

PoT prompting improves numerical reasoning by having language models write programs executed by a computer instead of performing calculations in natural language chains of thought, with an average 12% gain over CoT.
DECO: Sparse Mixture-of-Experts with Dense-Comparable Performance on End-Side Devices
cs.LG 2026-05 conditional novelty 6.0

DECO matches dense model performance at 20% expert activation via ReLU-based routing with learnable scaling and the NormSiLU activation, plus a 3x real-hardware speedup.
DECO: Sparse Mixture-of-Experts with Dense-Comparable Performance on End-Side Devices
cs.LG 2026-05 unverdicted novelty 6.0

DECO sparse MoE matches dense Transformer performance at 20% expert activation with a 3x hardware inference speedup.
Valid Best-Model Identification for LLM Evaluation via Low-Rank Factorization
cs.LG 2026-05 unverdicted novelty 6.0

Doubly robust estimators that incorporate low-rank predictions enable valid finite-sample confidence intervals for best-model identification under adaptive sampling and without-replacement example selection in LLM evaluation.
OPT-BENCH: Evaluating the Iterative Self-Optimization of LLM Agents in Large-Scale Search Spaces
cs.AI 2026-05 unverdicted novelty 6.0

OPT-BENCH and OPT-Agent evaluate LLM self-optimization in large search spaces, showing stronger models improve via feedback but stay constrained by base capacity and below human performance.
SlimQwen: Exploring the Pruning and Distillation in Large MoE Model Pre-training
cs.LG 2026-05 unverdicted novelty 6.0

Pruning pretrained MoE models outperforms training from scratch, different compression methods converge after continued pretraining, and combining KD with language modeling loss plus progressive schedules yields a com...
Sanity Checks for Long-Form Hallucination Detection
cs.CL 2026-05 unverdicted novelty 6.0

Hallucination detectors on LLM reasoning traces often rely on final-answer artifacts rather than reasoning validity; once controlled, lightweight lexical trajectory features suffice for robust detection.
From 0-Order Selection to 2-Order Judgment: Combinatorial Hardening Exposes Compositional Failures in Frontier LLMs
cs.CL 2026-05 unverdicted novelty 6.0

LogiHard hardens reasoning benchmarks by transforming 0-order selection into 2-order judgment, causing 31-56% accuracy drops in 12 frontier LLMs and a 47% drop on zero-shot MMLU, revealing a combinatorial reasoning ga...
GSM-SEM: Benchmark and Framework for Generating Semantically Variant Augmentations
cs.CL 2026-05 unverdicted novelty 6.0

GSM-SEM generates reusable, stochastic semantic variants of math reasoning benchmarks that alter underlying facts but preserve answers, producing larger LLM performance drops than prior surface-level variants.
Controllable and Verifiable Process Data Synthesis for Process Reward Models
cs.AI 2026-05 unverdicted novelty 6.0

A controllable synthesis method creates prefix-invalid yet trajectory-consistent process supervision data for training and evaluating process reward models by injecting verifiable errors into symbolic reasoning chains.
Rethinking LLM Ensembling from the Perspective of Mixture Models
cs.LG 2026-05 unverdicted novelty 6.0

ME reinterprets LLM ensembling as a mixture model by sampling a single model stochastically at each token step, matching the ensemble distribution while invoking only one model per step for substantial speed gains.
Train Separately, Merge Together: Modular Post-Training with Mixture-of-Experts
cs.LG 2026-04 unverdicted novelty 6.0

BAR trains independent domain experts via separate mid-training, SFT, and RL pipelines then composes them with a MoE router to match monolithic retraining performance at lower cost and without catastrophic forgetting.
ContraPrompt: Contrastive Prompt Optimization via Dyadic Reasoning Trace Analysis
cs.AI 2026-04 unverdicted novelty 6.0

ContraPrompt extracts optimization rules from dyadic differences in reasoning traces on identical inputs and organizes them into input-aware decision trees, outperforming GEPA on four benchmarks with gains up to 8.29 pp.
Agentic Frameworks for Reasoning Tasks: An Empirical Study
cs.AI 2026-04 unverdicted novelty 6.0

An empirical evaluation of 22 agentic frameworks on BBH, GSM8K, and ARC benchmarks shows stable performance in 12 frameworks but highlights orchestration failures and weaker mathematical reasoning.
Nexus: Same Pretraining Loss, Better Downstream Generalization via Common Minima
cs.LG 2026-04 unverdicted novelty 6.0

Nexus optimizer improves LLM downstream performance by converging to common minima across data sources despite identical pretraining loss.
Beyond End-to-End: Dynamic Chain Optimization for Private LLM Adaptation on the Edge
cs.DC 2026-04 unverdicted novelty 6.0

ChainFed achieves memory-efficient private LLM fine-tuning on edge devices through sequential layer-by-layer adapter training with dynamic co-tuning, perceptive optimization, and adaptive starting point selection, imp...
Kimi Linear: An Expressive, Efficient Attention Architecture
cs.CL 2025-10 unverdicted novelty 6.0

Kimi Linear hybridizes linear attention with a new KDA module to beat full attention on tasks while slashing KV cache by 75% and speeding decoding up to 6x.
InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency
cs.CV 2025-08 unverdicted novelty 6.0

InternVL3.5 advances open-source multimodal models with Cascade RL for +16% reasoning gains and ViR for 4x inference speedup, with the 241B model reaching SOTA among open-source MLLMs on multimodal, reasoning, and age...
Dream 7B: Diffusion Large Language Models
cs.CL 2025-08 unverdicted novelty 6.0

Dream 7B is a 7B diffusion LLM that refines sequences in parallel via denoising and outperforms prior diffusion models on general, mathematical, and coding benchmarks with added flexibility in generation order and qua...
InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models
cs.CV 2025-04 conditional novelty 6.0

InternVL3-78B sets a new open-source SOTA of 72.2 on MMMU via native joint multimodal pre-training, V2PE, MPO, and test-time scaling while remaining competitive with proprietary models.
Process Reinforcement through Implicit Rewards
cs.LG 2025-02 conditional novelty 6.0

PRIME enables online process reward model updates in LLM RL using implicit rewards from rollouts and outcome labels, yielding 15.1% average gains on reasoning benchmarks and surpassing a stronger instruct model with 1...
Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling
cs.CV 2024-12 unverdicted novelty 6.0

InternVL 2.5 is the first open-source MLLM to surpass 70% on the MMMU benchmark via model, data, and test-time scaling, with a 3.7-point gain from chain-of-thought reasoning.
MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark
cs.CL 2024-06 conditional novelty 6.0

MMLU-Pro is a revised benchmark that makes language model evaluation harder and more stable by using ten options per question and emphasizing reasoning over simple knowledge recall.
LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code
cs.SE 2024-03 unverdicted novelty 6.0

LiveCodeBench collects 400 recent contest problems to create a contamination-free benchmark evaluating LLMs on code generation and related capabilities like self-repair and execution.
Teaching Large Language Models to Self-Debug
cs.CL 2023-04 unverdicted novelty 6.0

Self-Debugging teaches LLMs to identify and fix their own code errors through rubber-duck-style natural language explanations and execution feedback, delivering 2-12% gains over baselines on Spider, TransCoder, and MBPP.
BloombergGPT: A Large Language Model for Finance
cs.LG 2023-03 conditional novelty 6.0

BloombergGPT is a 50B parameter LLM trained on a 708B token mixed financial and general dataset that outperforms prior models on financial benchmarks while preserving general LLM performance.
Emergent Abilities of Large Language Models
cs.CL 2022-06 unverdicted novelty 6.0

Emergent abilities are capabilities present in large language models but absent in smaller ones and cannot be predicted by extrapolating smaller model performance.
The Efficiency Gap in Byte Modeling
cs.LG 2026-05 unverdicted novelty 5.0

Byte modeling incurs greater scaling overhead for masked diffusion than autoregressive models because the diffusion objective destroys local byte contiguity needed to resolve semantics.
ReAD: Reinforcement-Guided Capability Distillation for Large Language Models
cs.CL 2026-05 unverdicted novelty 5.0

ReAD applies a contextual bandit to allocate fixed-token distillation budget across interdependent LLM capabilities, yielding higher task utility and fewer negative spillovers than standard methods.
Consistency as a Testable Property: Statistical Methods to Evaluate AI Agent Reliability
cs.AI 2026-05 unverdicted novelty 5.0

A framework with U-statistics and kernel-based metrics quantifies AI agent consistency and robustness, showing trajectory metrics outperform pass@1 rates in diagnosing failures.
Multi-Agent Reasoning Improves Compute Efficiency: Pareto-Optimal Test-Time Scaling
cs.AI 2026-05 unverdicted novelty 5.0

Multi-agent debate and mixture-of-agents outperform self-consistency by 1.3 and 2.7 percentage points respectively at equal compute budgets on MMLU-Pro and BBH, with advantages that continue at higher scales while sel...
SpikingBrain2.0: Brain-Inspired Foundation Models for Efficient Long-Context and Cross-Platform Inference
cs.LG 2026-04 unverdicted novelty 5.0

SpikingBrain2.0 is a 5B hybrid spiking-Transformer that recovers most base model performance while delivering 10x TTFT speedup at 4M context and supporting over 10M tokens on limited GPUs via dual sparse attention and...
Rethinking Data Curation in LLM Training: Online Reweighting Offers Better Generalization than Offline Methods
cs.LG 2026-04 unverdicted novelty 5.0

ADAPT is an online reweighting framework for LLM training that outperforms offline data selection and mixing methods in cross-benchmark generalization under equal compute.
Guardian-as-an-Advisor: Advancing Next-Generation Guardian Models for Trustworthy LLMs
cs.LG 2026-04 unverdicted novelty 5.0

Guardian-as-an-Advisor prepends risk labels and explanations from a guardian model to queries, improving LLM safety compliance and reducing over-refusal while adding minimal compute overhead.
Kimi K2: Open Agentic Intelligence
cs.LG 2025-07 unverdicted novelty 5.0

Kimi K2 is a 1-trillion-parameter MoE model that leads open-source non-thinking models on agentic benchmarks including 65.8 on SWE-Bench Verified and 66.1 on Tau2-Bench.
SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model
cs.CL 2025-02 unverdicted novelty 5.0

SmolLM2 is a 1.7B-parameter language model that outperforms Qwen2.5-1.5B and Llama3.2-1B after overtraining on 11 trillion tokens using custom FineMath, Stack-Edu, and SmolTalk datasets in a multi-stage pipeline.
DeepSeek-Coder: When the Large Language Model Meets Programming -- The Rise of Code Intelligence
cs.SE 2024-01 unverdicted novelty 5.0

DeepSeek-Coder open-source models trained on 2T code tokens with fill-in-the-blank pretraining achieve SOTA results among open models and surpass closed-source Codex and GPT-3.5 on code benchmarks.
DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models
cs.CL 2024-01 unverdicted novelty 5.0

DeepSeekMoE 2B matches GShard 2.9B performance and approaches a dense 2B model; the 16B version matches LLaMA2-7B at 40% compute by using fine-grained expert segmentation plus shared experts.
PaLM 2 Technical Report
cs.CL 2023-05 unverdicted novelty 5.0

PaLM 2 reports state-of-the-art results on language, reasoning, and multilingual tasks with improved efficiency over PaLM.
Galactica: A Large Language Model for Science
cs.CL 2022-11 unverdicted novelty 5.0

Galactica, a science-specialized LLM, reports higher scores than GPT-3, Chinchilla, and PaLM on LaTeX knowledge, mathematical reasoning, and medical QA benchmarks while outperforming general models on BIG-bench.
Yi: Open Foundation Models by 01.AI
cs.CL 2024-03 unverdicted novelty 4.0

Yi models are 6B and 34B open foundation models pretrained on 3.1T curated tokens that achieve strong benchmark results through data quality and targeted extensions like long context and vision alignment.
DeepSeek LLM: Scaling Open-Source Language Models with Longtermism
cs.CL 2024-01 unverdicted novelty 4.0

DeepSeek LLM 67B exceeds LLaMA-2 70B on code, mathematics and reasoning benchmarks after pre-training on 2 trillion tokens and alignment via SFT and DPO.

Reference graph

Works this paper leans on

42 extracted references · 42 canonical work pages · cited by 52 Pith papers · 12 internal anchors

[1]

Program Synthesis with Large Language Models

Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, et al. Program synthesis with large language models. arXiv preprint arXiv:2108.07732,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Language models are few-shot learners

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901,

work page 1901
[3]

Evaluating Large Language Models Trained on Code

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021a. Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri...

work page internal anchor Pith review Pith/arXiv arXiv
[4]

Binding language models in symbolic languages.arXiv preprint arXiv:2210.02875, 2022

Zhoujun Cheng, Tianbao Xie, Peng Shi, Chengzu Li, Rahul Nadkarni, Yushi Hu, Caiming Xiong, Dragomir Radev, Mari Ostendorf, Luke Zettlemoyer, et al. Binding language models in symbolic languages. arXiv preprint arXiv:2210.02875,

work page arXiv
[5]

PaLM: Scaling Language Modeling with Pathways

Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311,

work page internal anchor Pith review Pith/arXiv arXiv
[6]

arXiv preprint arXiv:2205.09712 , year=

Antonia Creswell, Murray Shanahan, and Irina Higgins. Selection-inference: Exploiting large language models for interpretable logical reasoning. arXiv preprint arXiv:2205.09712,

work page arXiv
[7]

BERT: Pre-training of deep bidirectional transformers for language understanding

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis...

work page 2019
[8]

In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Com- putational Linguistics: Human Language Technologies (NAACL-HLT 2019), pp

Association for Computational Linguistics. doi: 10.18653/v1/N19-1423. URL https://aclanthology.org/N19-1423. Andrew Drozdov, Nathanael Schärli, Ekin Akyürek, Nathan Scales, Xinying Song, Xinyun Chen, Olivier Bousquet, and Denny Zhou. Compositional semantic parsing with large language models. arXiv preprint arXiv:2209.15003,

work page doi:10.18653/v1/n19-1423
[9]

Predictability and surprise in large generative models

Deep Ganguli, Danny Hernandez, Liane Lovitt, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, Nova Dassarma, Dawn Drain, Nelson Elhage, et al. Predictability and surprise in large generative models. In 2022 ACM Conference on Fairness, Accountability, and Transparency, pages 1747–1764,

work page 2022
[10]

doi: 10.18653/v1/2022.lnls-1.4

Association for Computational Linguistics. doi: 10.18653/v1/2022.lnls-1.4. URL https://aclanthology.org/2022.lnls-1.4. Danny Hernandez, Jared Kaplan, Tom Henighan, and Sam McCandlish. Scaling laws for transfer.arXiv preprint arXiv:2102.01293,

work page doi:10.18653/v1/2022.lnls-1.4 2022
[11]

Training Compute-Optimal Large Language Models

Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. Training compute-optimal large language models. arXiv preprint arXiv:2203.15556,

work page internal anchor Pith review Pith/arXiv arXiv
[12]

Language Models (Mostly) Know What They Know

Saurav Kadavath, Tom Conerly, Amanda Askell, Tom Henighan, Dawn Drain, Ethan Perez, Nicholas Schiefer, Zac Hatﬁeld Dodds, Nova DasSarma, Eli Tran-Johnson, et al. Language models (mostly) know what they know. arXiv preprint arXiv:2207.05221,

work page internal anchor Pith review Pith/arXiv arXiv
[13]

Large Language Models are Zero-Shot Reasoners

Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large language models are zero-shot reasoners. arXiv preprint arXiv:2205.11916,

work page internal anchor Pith review arXiv
[14]

arXiv preprint arXiv:2204.02329 , year=

Andrew K Lampinen, Ishita Dasgupta, Stephanie CY Chan, Kory Matthewson, Michael Henry Tessler, Antonia Creswell, James L McClelland, Jane X Wang, and Felix Hill. Can language models learn from explanations in context? arXiv preprint arXiv:2204.02329,

work page arXiv
[15]

The power of scale for parameter-efﬁcient prompt tuning

Brian Lester, Rami Al-Rfou, and Noah Constant. The power of scale for parameter-efﬁcient prompt tuning. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing , pages 3045–3059,

work page 2021
[16]

On the advance of making language models better reasoners.arXiv preprint arXiv:2206.02336, 2, 2022

Yifei Li, Zeqi Lin, Shizhuo Zhang, Qiang Fu, Bei Chen, Jian-Guang Lou, and Weizhu Chen. On the advance of making language models better reasoners. arXiv preprint arXiv:2206.02336, 2022a. Yujia Li, David Choi, Junyoung Chung, Nate Kushman, Julian Schrittwieser, Rémi Leblond, Tom Eccles, James Keeling, Felix Gimeno, Agustin Dal Lago, et al. Competition-leve...

work page arXiv
[17]

arXiv preprint arXiv:2202.12837 , year=

Sewon Min, Mike Lewis, Luke Zettlemoyer, and Hannaneh Hajishirzi. MetaICL: Learning to learn in context. In NAACL-HLT, 2022a. Sewon Min, Xinxi Lyu, Ari Holtzman, Mikel Artetxe, Mike Lewis, Hannaneh Hajishirzi, and Luke Zettle- moyer. Rethinking the role of demonstrations: What makes in-context learning work? arXiv preprint arXiv:2202.12837, 2022b. Swaroop...

work page arXiv
[18]

Ambipun: Generating humorous puns with ambiguous context

Anirudh Mittal, Yufei Tian, and Nanyun Peng. Ambipun: Generating humorous puns with ambiguous context. arXiv preprint arXiv:2205.01825,

work page arXiv
[19]

Show Your Work: Scratchpads for Intermediate Computation with Language Models

12 Maxwell Nye, Anders Johan Andreassen, Guy Gur-Ari, Henryk Michalewski, Jacob Austin, David Bieber, David Dohan, Aitor Lewkowycz, Maarten Bosma, David Luan, et al. Show your work: Scratchpads for intermediate computation with language models. arXiv preprint arXiv:2112.00114,

work page internal anchor Pith review arXiv
[20]

Training language models to follow instructions with human feedback

Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. arXiv preprint arXiv:2203.02155,

work page internal anchor Pith review Pith/arXiv arXiv
[21]

Scaling Language Models: Methods, Analysis & Insights from Training Gopher

Jack W Rae, Sebastian Borgeaud, Trevor Cai, Katie Millican, Jordan Hoffmann, Francis Song, John Aslanides, Sarah Henderson, Roman Ring, Susannah Young, et al. Scaling language models: Methods, analysis & insights from training gopher. arXiv preprint arXiv:2112.11446,

work page internal anchor Pith review arXiv
[22]

arXiv preprint arXiv:2210.03057 , year=

Freda Shi, Mirac Suzgun, Markus Freitag, Xuezhi Wang, Suraj Srivats, Soroush V osoughi, Hyung Won Chung, Yi Tay, Sebastian Ruder, Denny Zhou, Dipanjan Das, and Jason Wei. Language models are multilingual chain-of-thought reasoners. ArXiv, abs/2210.03057,

work page arXiv
[23]

Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models

Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao, Abu Awal Md Shoeb, Abubakar Abid, Adam Fisch, Adam R Brown, Adam Santoro, Aditya Gupta, Adrià Garriga-Alonso, et al. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. arXiv preprint arXiv:2206.04615,

work page internal anchor Pith review Pith/arXiv arXiv
[24]

Natural language inference with a human touch: Using human explanations to guide model attention

Joe Stacey, Yonatan Belinkov, and Marek Rei. Natural language inference with a human touch: Using human explanations to guide model attention. arXiv preprint arXiv:2104.08142,

work page arXiv
[25]

Prompt-and-rerank: A method for zero-shot and few-shot arbitrary textual style transfer with small language models

Mirac Suzgun, Luke Melas-Kyriazi, and Dan Jurafsky. Prompt-and-rerank: A method for zero-shot and few-shot arbitrary textual style transfer with small language models. arXiv preprint arXiv:2205.11503,

work page arXiv
[26]

On the machine learning of ethical judgments from natural language

Zeerak Talat, Hagen Blix, Josef Valvoda, Maya Indira Ganesh, Ryan Cotterell, and Adina Williams. On the machine learning of ethical judgments from natural language. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 769–779,

work page 2022
[27]

Scaling laws vs model architectures: How does inductive bias influence scaling? arXiV preprint arXiV:2207.10551, 2022 a

Yi Tay, Mostafa Dehghani, Samira Abnar, Hyung Won Chung, William Fedus, Jinfeng Rao, Sharan Narang, Vinh Q Tran, Dani Yogatama, and Donald Metzler. Scaling laws vs model architectures: How does inductive bias inﬂuence scaling? arXiv preprint arXiv:2207.10551,

work page arXiv
[28]

Self-Consistency Improves Chain of Thought Reasoning in Language Models

13 Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171,

work page internal anchor Pith review Pith/arXiv arXiv
[29]

Jason Wei, Maarten Bosma, Vincent Y

Albert Webson and Ellie Pavlick. Do prompt-based models really understand the meaning of their prompts? arXiv preprint arXiv:2109.01247,

work page arXiv
[30]

Finetuned language models are zero-shot learners

Jason Wei, Maarten Bosma, Vincent Y Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M Dai, and Quoc V Le. Finetuned language models are zero-shot learners. ICLR 2022,

work page 2022
[31]

Emergent abilities of large language models

Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, et al. Emergent abilities of large language models. Transactions on Machine Learning Research (TMLR), 2022a. Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, and Denny Z...

work page 2022
[32]

The lessons of developing process reward models in mathematical reasoning

Association for Computational Linguistics. doi: 10.18653/v1/ 2022.naacl-main.47. URL https://aclanthology.org/2022.naacl-main.47. Genta Indra Winata, Andrea Madotto, Zhaojiang Lin, Rosanne Liu, Jason Yosinski, and Pascale Fung. Language models are few-shot multilingual learners. In Proceedings of the 1st Workshop on Multilingual Representation Learning, p...

work page doi:10.18653/v1/ 2022
[33]

An explanation of in-context learning as implicit bayesian inference.arXiv preprint arXiv:2111.02080, 2021

Sang Michael Xie, Aditi Raghunathan, Percy Liang, and Tengyu Ma. An explanation of in-context learning as implicit bayesian inference. arXiv preprint arXiv:2111.02080,

work page arXiv
[34]

Least-to-Most Prompting Enables Complex Reasoning in Large Language Models

Denny Zhou, Nathanael Schärli, Le Hou, Jason Wei, Nathan Scales, Xuezhi Wang, Dale Schuurmans, Olivier Bousquet, Quoc Le, and Ed Chi. Least-to-most prompting enables complex reasoning in large language models. arXiv preprint arXiv:2205.10625,

work page internal anchor Pith review arXiv
[35]

The concert was scheduled to be on 06/01/1943, but was delayed by one day to today. What is the date yesterday in MM/DD/YYYY?

14 A BIG-Bench Hard Task Descriptions Boolean Expressions. Evaluate the truth value of a random Boolean expression consisting of Boolean constants (True, False) and basic Boolean operators (and, or, and not). Causal Judgment. Given a short story (involving moral, intentional, or counterfactual analysis), determine how a typical person would answer a causa...

work page 1943
[36]

If today is Christmas Eve of 1937, then today's date is December 24,

What is the date 10 days ago in MM/DD/YYYY? Options: (A) 12/14/2026 (B) 12/14/1950 (C) 12/14/2007 (D) 12/14/1937 (E) 07/14/1938 (F) 12/14/1988 A: Let's think step by step. If today is Christmas Eve of 1937, then today's date is December 24,

work page 2026
[37]

So the answer is (D)

10 days before today is December 14, 1937, that is 12/14/1937. So the answer is (D). Q: Tomorrow is 11/12/2019. What is the date one year ago from today in MM/DD/YYYY? Options: (A) 09/04/2018 (B) 11/11/2018 (C) 08/25/2018 (D) 11/02/2018 (E) 11/04/2018 A: Let's think step by step. If tomorrow is 11/12/2019, then today is 11/11/2019. The date one year ago f...

work page 1937
[38]

What is the date tomorrow in MM/DD/YYYY? Options: (A) 01/11/1961 (B) 01/03/1963 (C) 01/18/1961 (D) 10/14/1960 (E) 01/03/1982 (F) 12/03/1960 A: Let's think step by step

It is their 5-year anniversary today. What is the date tomorrow in MM/DD/YYYY? Options: (A) 01/11/1961 (B) 01/03/1963 (C) 01/18/1961 (D) 10/14/1960 (E) 01/03/1982 (F) 12/03/1960 A: Let's think step by step. If Jane and John married on Jan 2, 1958, then and if it is their 5-year anniversary today, then today's date is Jan 2,

work page 1961
[39]

[ { [". We will need to pop out

The date tomorrow is Jan 3, 1963, that is 01/03/1963. So the answer is (B). 26 C.4 CoT Prompt for Disambiguation QA Disambiguation QA 27 C.5 CoT Prompt for Dyck Languages Dyck Languages Correctly close a Dyck-n word. Q: Complete the rest of the sequence, making sure that the parentheses are closed properly. Input: [ { [ A: Let's think step by step. We sho...

work page 1963
[40]

So the answer is (C)

Amongst all the options, the only movie similar to these ones seems to be The Princess Bride (1987). So the answer is (C). Q: Find a movie similar to Twister, The Silence of the Lambs, Independence Day, Braveheart: Options: (A) They Shoot Horses (B) Don't They (C) Forrest Gump (D) The Salton Sea (E) Extreme Days A: Let's think step by step. - Twister (act...

work page 1987
[41]

Amongst all the options, the only movie similar to these ones seems to be Forrest Gump (comedy, drama, romance; 1994)

These are all famous Hollywood movies produced around the 1990s. Amongst all the options, the only movie similar to these ones seems to be Forrest Gump (comedy, drama, romance; 1994). So the answer is (C). Q: Find a movie similar to Minority Report, Total Recall, Inside Out, Forrest Gump: Options: (A) Phenomena (B) Lilting (C) Catwoman (D) Edge of Tomorro...

work page 1994
[42]

So the answer is (D)

These are all famous movies produced in the past few decades.Amongst all the options, the only movie similar to these ones seems to be Edge of Tomorrow (action, adventure, crime, mystery; 2014), as it is also a science-fiction movie and features Tom Cruise. So the answer is (D). Movie Recommendation 34 C.11 CoT Prompt for Multi-Step Arithmetic Solve multi...

work page 2014