arxiv: 2204.02311 · v5 · submitted 2022-04-05 · 💻 cs.CL

Recognition: 2 theorem links

· Lean Theorem

PaLM: Scaling Language Modeling with Pathways

Aakanksha Chowdhery, Abhishek Rao, Adam Roberts, Aitor Lewkowycz, Alexander Spiridonov, Andrew M. Dai, Anselm Levskaya, Barret Zoph, Ben Hutchinson, Brennan Saeta, Charles Sutton, Daphne Ippolito, David Dohan, David Luan, Denny Zhou, Douglas Eck, Emily Reif, Erica Moreira, Gaurav Mishra, Guy Gur-Ari, Henryk Michalewski, Hyeontaek Lim, Hyung Won Chung, Jacob Austin, Jacob Devlin, James Bradbury, Jason Wei, Jeff Dean, Joshua Maynez, Katherine Lee, Kathy Meier-Hellstern, Kensen Shi, Kevin Robinson, Liam Fedus, Maarten Bosma, Marie Pellat, Mark Diaz, Mark Omernick, Michael Isard, Michele Catasta, Nan Du, Noah Fiedel, Noam Shazeer, Oleksandr Polozov, Orhan Firat, Parker Barnes, Parker Schuh, Paul Barham, Pengcheng Yin, Reiner Pope, Rewon Child, Ryan Sepassi, Sanjay Ghemawat, Sasha Tsvyashchenko, Sebastian Gehrmann, Sharan Narang, Shivani Agrawal, Slav Petrov, Sunipa Dev, Thanumalayan Sankaranarayana Pillai, Toju Duke, Vedant Misra, Vinodkumar Prabhakaran, Xavier Garcia, Xuezhi Wang, Yi Tay, Zongwei Zhou

Pith reviewed 2026-05-10 23:39 UTC · model grok-4.3

classification 💻 cs.CL

keywords large language modelsfew-shot learningscalingTransformerBIG-benchreasoning tasksmultilingualcode generation

0 comments

The pith

Scaling a language model to 540 billion parameters produces state-of-the-art few-shot results on hundreds of benchmarks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper trains PaLM, a 540-billion parameter Transformer language model, on 6144 TPU v4 chips using the Pathways system for efficient distributed training. It establishes that increasing model scale continues to improve few-shot learning across language understanding and generation tasks. The largest model shows particular gains on multi-step reasoning and reaches average human performance on the BIG-bench suite, with some tasks exhibiting sharp jumps only at this scale. These results indicate that larger models can adapt to new tasks with fewer examples than smaller ones.

Core claim

By training a 540-billion parameter densely activated Transformer language model using the Pathways system across multiple TPU pods, the authors demonstrate continued scaling benefits through state-of-the-art few-shot performance on hundreds of benchmarks. The model outperforms the finetuned state of the art on multi-step reasoning tasks and exceeds average human performance on BIG-bench, where a significant number of tasks show discontinuous improvements only at the largest size. PaLM also exhibits strong multilingual and code generation capabilities.

What carries the argument

PaLM, the 540-billion parameter Pathways Language Model, a densely activated Transformer trained efficiently via the Pathways ML system on 6144 TPU v4 chips.

If this is right

Few-shot prompts alone suffice to exceed finetuned systems on multi-step reasoning tasks.
Average human performance is reached on a broad suite of language tasks without task-specific training.
Multilingual tasks and source code generation improve alongside English benchmarks as scale increases.
Some tasks exhibit sharp performance increases only when model size reaches hundreds of billions of parameters.
Analysis of bias, toxicity, and memorization becomes feasible at this scale for larger models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar scaling combined with efficient training systems could reduce the data needed for new applications in other modalities.
Models of this size may enable practical systems that handle varied real-world queries with minimal adaptation.
The pattern of discontinuous gains suggests that certain capabilities emerge only after crossing specific size thresholds.
Ongoing scaling will require new methods to manage memorization of training data and unintended biases.

Load-bearing premise

That the observed performance gains from scaling to 540 billion parameters will continue to appear on tasks and data outside the specific benchmarks and training distribution used.

What would settle it

A follow-up experiment that trains a model at or above 540 billion parameters and finds no further gains or discontinuous jumps on BIG-bench tasks, or that matches the reported results without scaling.

read the original abstract

Large language models have been shown to achieve remarkable performance across a variety of natural language tasks using few-shot learning, which drastically reduces the number of task-specific training examples needed to adapt the model to a particular application. To further our understanding of the impact of scale on few-shot learning, we trained a 540-billion parameter, densely activated, Transformer language model, which we call Pathways Language Model PaLM. We trained PaLM on 6144 TPU v4 chips using Pathways, a new ML system which enables highly efficient training across multiple TPU Pods. We demonstrate continued benefits of scaling by achieving state-of-the-art few-shot learning results on hundreds of language understanding and generation benchmarks. On a number of these tasks, PaLM 540B achieves breakthrough performance, outperforming the finetuned state-of-the-art on a suite of multi-step reasoning tasks, and outperforming average human performance on the recently released BIG-bench benchmark. A significant number of BIG-bench tasks showed discontinuous improvements from model scale, meaning that performance steeply increased as we scaled to our largest model. PaLM also has strong capabilities in multilingual tasks and source code generation, which we demonstrate on a wide array of benchmarks. We additionally provide a comprehensive analysis on bias and toxicity, and study the extent of training data memorization with respect to model scale. Finally, we discuss the ethical considerations related to large language models and discuss potential mitigation strategies.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PaLM shows scaling to 540B still yields clear few-shot gains on reasoning and BIG-bench, backed by broad evaluation and some analysis of memorization and bias.

read the letter

The main thing to know is that this paper trains a 540B parameter model called PaLM using the Pathways system on 6144 TPU v4 chips and reports continued scaling benefits in few-shot settings. It hits new highs on many language tasks, beats fine-tuned models on multi-step reasoning, reaches above-average human performance on BIG-bench with some discontinuous jumps at this scale, and shows solid results on multilingual and code tasks. They also run checks on toxicity, bias, and training data memorization as model size grows. The work is a direct empirical follow-up to GPT-3 style scaling, with the new elements being the model size, the Pathways training infrastructure, and the specific benchmark numbers at 540B. It does a reasonable job laying out the results across hundreds of tasks without overclaiming. The soft spots are the usual ones for these large runs: limited detail on exact data filtering and mixture, no error bars or variance estimates on the main scores, and high-level descriptions of evaluation protocols. Those gaps make it harder to judge how robust the discontinuous improvements really are, but they do not undermine the central observation that bigger worked here. This paper is for anyone tracking scaling trends or frontier capabilities in language models. The numbers add concrete data points worth having in the literature. I would send it to peer review; the empirical scope and honest inclusion of limitations make it referee-ready even if some sections need tightening on reproducibility.

Referee Report

2 major / 3 minor

Summary. The paper presents PaLM, a 540-billion parameter densely activated Transformer language model trained on 6144 TPU v4 chips using the Pathways system. It claims continued scaling benefits via state-of-the-art few-shot results across hundreds of language understanding and generation benchmarks, including breakthrough performance that outperforms finetuned SOTA on multi-step reasoning tasks and exceeds average human performance on BIG-bench (with discontinuous jumps on a significant number of tasks). Additional results cover multilingual tasks, code generation, bias/toxicity analysis, and memorization studies as a function of scale.

Significance. If the empirical results hold, the work provides substantial evidence for scaling benefits at the 540B parameter regime, particularly for few-shot reasoning and multilingual capabilities. The inclusion of bias, toxicity, and memorization analyses is a strength that aids responsible assessment of large models. The demonstration of efficient large-scale training via Pathways is a notable engineering contribution.

major comments (2)

[Benchmark results section] Benchmark results section: The claims of SOTA few-shot performance, breakthrough reasoning results, and outperforming human performance on BIG-bench are presented without reported statistical error bars, multiple evaluation runs, or precise protocol details (e.g., prompt formatting, decoding parameters), which are load-bearing for substantiating the scaling and discontinuous improvement assertions.
[Training data and setup] Training data and setup: The description of the 780B token training corpus and data filtering/mixture is high-level; this directly impacts reproducibility of the reported scaling observations and assessment of potential contamination effects on the few-shot and BIG-bench results.

minor comments (3)

[Abstract] The abstract states results on 'hundreds of benchmarks' but does not enumerate the exact count or breakdown by category, reducing clarity.
[Figures] Figure captions and scaling plots would benefit from explicit axis labels for model size and data volume to facilitate direct comparison with prior scaling studies.
[Memorization analysis] The memorization analysis section could include a direct comparison table against smaller models (e.g., 8B or 62B variants) for quantitative context.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their positive assessment of the work and recommendation for minor revision. We appreciate the constructive feedback on improving the substantiation of our claims and the reproducibility of our experimental setup. We address each major comment below and outline the changes we will make to the manuscript.

read point-by-point responses

Referee: [Benchmark results section] Benchmark results section: The claims of SOTA few-shot performance, breakthrough reasoning results, and outperforming human performance on BIG-bench are presented without reported statistical error bars, multiple evaluation runs, or precise protocol details (e.g., prompt formatting, decoding parameters), which are load-bearing for substantiating the scaling and discontinuous improvement assertions.

Authors: We agree that additional protocol details are necessary to fully substantiate the reported results. In the revised manuscript, we will expand the evaluation sections to provide precise information on prompt formatting (including exact templates and number of shots), decoding parameters (e.g., temperature, top-p, and beam size where applicable), and the standardized evaluation harness used across benchmarks. For BIG-bench, we followed the official few-shot protocol defined by the benchmark. Regarding statistical error bars and multiple runs, the computational cost of full evaluations on the 540B model across hundreds of tasks is extremely high, rendering repeated runs infeasible within our resource constraints. We prioritized comprehensive coverage of tasks over variance estimation. However, we will add notes on prompt sensitivity for key reasoning tasks where we observed consistent gains, and we maintain that the magnitude of the observed improvements (including discontinuous jumps) aligns with prior scaling studies even in the absence of error bars. revision: partial
Referee: [Training data and setup] Training data and setup: The description of the 780B token training corpus and data filtering/mixture is high-level; this directly impacts reproducibility of the reported scaling observations and assessment of potential contamination effects on the few-shot and BIG-bench results.

Authors: We acknowledge that a more detailed description would aid reproducibility and contamination analysis. We will revise the 'Training Data' section (and associated appendix) to include expanded details on the data mixture ratios, specific sources within each category (web, books, code, multilingual), the quality filtering and deduplication methods applied, and the resulting token counts per category that total 780B tokens. We will also add a subsection discussing our contamination mitigation steps, including n-gram overlap checks against major benchmarks. While the full corpus cannot be released due to its scale and proprietary elements, these additions will provide sufficient information to interpret the scaling results and assess potential data leakage effects. revision: yes

Circularity Check

0 steps flagged

No significant circularity; direct empirical scaling results

full rationale

This is a large-scale empirical study reporting training of a 540B-parameter Transformer on 6144 TPU v4 chips and its few-shot evaluation across hundreds of benchmarks, including BIG-bench and reasoning tasks. No derivations, equations, or first-principles predictions appear; all performance claims rest on the reported experimental measurements rather than any fitted parameter being renamed as a prediction or any self-citation chain substituting for independent evidence. Bias, toxicity, and memorization analyses are likewise direct empirical checks. The central claims therefore remain self-contained experimental observations without reduction to inputs by construction.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The work rests on the standard Transformer architecture and the empirical hypothesis that few-shot performance improves with scale; no new physical entities or ad-hoc constants are introduced.

free parameters (2)

model parameter count
Chosen as the target scale to test continued scaling benefits.
training data mixture and volume
Determined by available corpora and hardware constraints.

axioms (2)

domain assumption The Transformer architecture remains effective at 540B scale
Invoked implicitly by using the same architecture as prior models.
domain assumption Few-shot evaluation on standard benchmarks measures meaningful capability gains
Central to interpreting all reported results.

pith-pipeline@v0.9.0 · 5832 in / 1283 out tokens · 65370 ms · 2026-05-10T23:39:59.875499+00:00 · methodology

discussion (0)

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines
cs.CL 2023-10 conditional novelty 8.0

DSPy compiles short declarative programs into LM pipelines that self-optimize and outperform both standard few-shot prompting and expert-written chains on math, retrieval, and QA tasks.
AgentBench: Evaluating LLMs as Agents
cs.AI 2023-08 unverdicted novelty 8.0

AgentBench is a new multi-environment benchmark showing commercial LLMs outperform open-source models up to 70B parameters in agent tasks mainly due to better long-term reasoning and instruction following.
Tree of Thoughts: Deliberate Problem Solving with Large Language Models
cs.CL 2023-05 accept novelty 8.0

Tree of Thoughts enables language models to solve complex planning tasks by generating, evaluating, and searching over coherent intermediate thoughts in a tree, raising Game of 24 success from 4% to 74% with GPT-4.
All Circuits Lead to Rome: Rethinking Functional Anisotropy in Circuit and Sheaf Discovery for LLMs
cs.CL 2026-05 unverdicted novelty 7.0

LLM tasks are supported by multiple distinct circuits rather than unique mechanisms, demonstrated via Overlap-Aware Sheaf Repulsion and the Distributive Dense Circuit Hypothesis.
VORT: Adaptive Power-Law Memory for NLP Transformers
cs.LG 2026-05 unverdicted novelty 7.0

VORT assigns learnable fractional orders to tokens and approximates their power-law retention kernels via sum-of-exponentials for efficient long-range dependency modeling in transformers.
Parallel-SFT: Improving Zero-Shot Cross-Programming-Language Transfer for Code RL
cs.CL 2026-04 unverdicted novelty 7.0

Parallel-SFT mixes parallel programs across languages during SFT to produce more transferable RL initializations, yielding better zero-shot generalization to unseen programming languages.
Rates of forgetting for the sequentially Markov coalescent
math.PR 2026-04 unverdicted novelty 7.0

SMC forgets its initial condition geometrically in the jump chain and as 1/ℓ in continuous genetic distance, justifying independent-locus approximations.
A Full-Stack Performance Evaluation Infrastructure for 3D-DRAM-based LLM Accelerators
cs.AR 2026-04 conditional novelty 7.0

ATLAS is the first silicon-validated simulation framework for 3D-DRAM LLM accelerators, achieving under 8.57% error and over 97% correlation with real hardware while supporting design exploration.
Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach
cs.LG 2025-02 unverdicted novelty 7.0

A recurrent-depth architecture enables language models to improve reasoning performance by iterating computation in latent space, achieving gains equivalent to much larger models on benchmarks.
Moshi: a speech-text foundation model for real-time dialogue
eess.AS 2024-09 accept novelty 7.0

Moshi is the first real-time full-duplex spoken large language model that casts dialogue as speech-to-speech generation using parallel audio streams and an inner monologue of time-aligned text tokens.
Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads
cs.LG 2024-01 conditional novelty 7.0

Medusa augments LLMs with multiple decoding heads and tree-based attention to predict and verify several tokens in parallel, yielding 2.2-3.6x inference speedup via two fine-tuning regimes.
GAIA: a benchmark for General AI Assistants
cs.CL 2023-11 unverdicted novelty 7.0

GAIA benchmark shows humans at 92% accuracy on simple real-world questions far outperform current AI systems at 15%, proposing this gap as a key milestone for general AI.
Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V
cs.CV 2023-10 accept novelty 7.0

Set-of-Mark prompting marks segmented image regions with alphanumerics and masks to let GPT-4V achieve state-of-the-art zero-shot results on referring expression comprehension and segmentation benchmarks like RefCOCOg.
Language Model Beats Diffusion -- Tokenizer is Key to Visual Generation
cs.CV 2023-10 unverdicted novelty 7.0

A new shared video-image tokenizer enables large language models to surpass diffusion models on standard visual generation benchmarks.
C-Pack: Packed Resources For General Chinese Embeddings
cs.CL 2023-09 accept novelty 7.0

C-Pack releases a new Chinese embedding benchmark, large training dataset, and optimized models that outperform priors by up to 10% on C-MTEB while also delivering English SOTA results.
Efficient Memory Management for Large Language Model Serving with PagedAttention
cs.LG 2023-09 conditional novelty 7.0

PagedAttention achieves near-zero waste in LLM key-value cache memory and enables 2-4x higher serving throughput than prior systems.
VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models
cs.RO 2023-07 unverdicted novelty 7.0

VoxPoser uses LLMs to compose 3D value maps via VLM interaction for model-based synthesis of robust robot trajectories on open-set language-specified manipulation tasks.
Direct Preference Optimization: Your Language Model is Secretly a Reward Model
cs.LG 2023-05 accept novelty 7.0

DPO derives the optimal policy directly from human preferences via a reparameterized reward model, solving the RLHF objective with only a binary classification loss and no sampling or separate reward model.
Voyager: An Open-Ended Embodied Agent with Large Language Models
cs.AI 2023-05 unverdicted novelty 7.0

Voyager achieves superior lifelong learning in Minecraft by combining an automatic exploration curriculum, a library of executable skills, and iterative LLM prompting with environment feedback, yielding 3.3x more uniq...
RWKV: Reinventing RNNs for the Transformer Era
cs.CL 2023-05 unverdicted novelty 7.0

RWKV uses a linear attention mechanism to deliver Transformer-level performance with RNN-style inference efficiency, demonstrated at up to 14 billion parameters.
LLM+P: Empowering Large Language Models with Optimal Planning Proficiency
cs.AI 2023-04 accept novelty 7.0

LLM+P lets LLMs solve planning problems optimally by converting them to PDDL for classical planners and back to natural language.
Visual Instruction Tuning
cs.CV 2023-04 unverdicted novelty 7.0

LLaVA is trained on GPT-4 generated visual instruction data to achieve 85.1% relative performance to GPT-4 on synthetic multimodal tasks and 92.53% accuracy on Science QA.
Segment Anything
cs.CV 2023-04 unverdicted novelty 7.0

A promptable model trained on 1B masks achieves competitive zero-shot segmentation performance across tasks and is released publicly with its dataset.
Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers
cs.CL 2023-01 unverdicted novelty 7.0

VALL-E is a neural codec language model trained on 60K hours of speech that performs zero-shot TTS, synthesizing natural speech that matches an unseen speaker's voice, emotion, and environment from a 3-second prompt.
Program of Thoughts Prompting: Disentangling Computation from Reasoning for Numerical Reasoning Tasks
cs.CL 2022-11 unverdicted novelty 7.0

PoT prompting improves numerical reasoning by having language models write programs executed by a computer instead of performing calculations in natural language chains of thought, with an average 12% gain over CoT.
A Generalist Agent
cs.AI 2022-05 accept novelty 7.0

Gato is a multi-modal, multi-task, multi-embodiment generalist policy using one transformer network to handle text, vision, games, and robotics tasks.
OPT: Open Pre-trained Transformer Language Models
cs.CL 2022-05 unverdicted novelty 7.0

OPT releases open decoder-only transformers up to 175B parameters that match GPT-3 performance at one-seventh the carbon cost, along with code and training logs.
Flamingo: a Visual Language Model for Few-Shot Learning
cs.CV 2022-04 unverdicted novelty 7.0

Flamingo models reach new state-of-the-art few-shot results on image and video tasks by bridging frozen vision and language models with cross-attention layers trained on interleaved web-scale data.
MetaColloc: Optimization-Free PDE Solving via Meta-Learned Basis Functions
cs.LG 2026-05 unverdicted novelty 6.0

MetaColloc meta-learns a universal set of neural basis functions offline so that new PDEs can be solved at test time with a single linear solve instead of per-equation neural-network optimization.
Response Time Enhances Alignment with Heterogeneous Preferences
cs.LG 2026-05 unverdicted novelty 6.0

Response times modeled as drift-diffusion processes enable consistent estimation of population-average preferences from heterogeneous anonymous binary choices.
Piper: Efficient Large-Scale MoE Training via Resource Modeling and Pipelined Hybrid Parallelism
cs.DC 2026-05 unverdicted novelty 6.0

Piper introduces resource modeling and pipelined hybrid parallelism for MoE training, delivering 2-3.5X higher MFU than prior frameworks and 1.2-9X better all-to-all bandwidth.
A Meta Reinforcement Learning Approach to Goals-Based Wealth Management
cs.LG 2026-05 unverdicted novelty 6.0

MetaRL pre-trained on GBWM problems delivers near-optimal dynamic strategies in 0.01s achieving 97.8% of DP optimal utility and handles larger problems where DP fails.
Repetition over Diversity: High-Signal Data Filtering for Sample-Efficient German Language Modeling
cs.CL 2026-04 unverdicted novelty 6.0

Repeating high-quality filtered German web data over multiple epochs produces better language models than single-pass training on larger, more diverse but lower-quality sets, even after seven epochs.
Chain of Uncertain Rewards with Large Language Models for Reinforcement Learning
cs.LG 2026-04 unverdicted novelty 6.0

CoUR uses LLMs for efficient RL reward design through uncertainty quantification and similarity selection, achieving better performance and lower evaluation costs on IsaacGym and Bidexterous Manipulation benchmarks.
Parcae: Scaling Laws For Stable Looped Language Models
cs.LG 2026-04 unverdicted novelty 6.0

Parcae stabilizes looped LLMs via spectral norm constraints on injection parameters, enabling power-law scaling for training FLOPs and saturating exponential scaling at test time that improves quality over fixed-depth...
SeLaR: Selective Latent Reasoning in Large Language Models
cs.CL 2026-04 unverdicted novelty 6.0

SeLaR selectively applies latent soft reasoning in LLMs via entropy gating and contrastive regularization, outperforming standard CoT on five benchmarks without training.
In-Place Test-Time Training
cs.LG 2026-04 conditional novelty 6.0

In-Place TTT adapts LLM MLP projection matrices at test time with a next-token-aligned objective and chunk-wise updates, enabling better long-context performance as a drop-in enhancement.
Unmasking Hallucinations: A Causal Graph-Attention Perspective on Factual Reliability in Large Language Models
cs.CL 2026-04 unverdicted novelty 6.0

GCAN cuts LLM hallucination rates by 27.8% and raises factual accuracy by 16.4% on TruthfulQA and HotpotQA by using causal token graphs and a new Causal Contribution Score.
Measuring Representation Robustness in Large Language Models for Geometry
cs.CL 2026-04 unverdicted novelty 6.0

LLMs display accuracy gaps of up to 14 percentage points on the same geometry problems solely due to representation choice, with vector forms consistently weakest and a convert-then-solve prompt helping only high-capa...
Chameleon: Mixed-Modal Early-Fusion Foundation Models
cs.CL 2024-05 unverdicted novelty 6.0

Chameleon is an early-fusion token model that handles mixed image-text sequences for understanding and generation, achieving competitive or superior performance to larger models like Llama-2, Mixtral, and Gemini-Pro o...
Are We on the Right Way for Evaluating Large Vision-Language Models?
cs.CV 2024-03 conditional novelty 6.0

Current LVLM benchmarks overestimate capabilities because many questions can be answered without images due to design flaws or data leakage; MMStar is a human-curated set of 1,500 vision-indispensable samples across 6...
LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code
cs.SE 2024-03 unverdicted novelty 6.0

LiveCodeBench collects 400 recent contest problems to create a contamination-free benchmark evaluating LLMs on code generation and related capabilities like self-repair and execution.
Grounded SAM: Assembling Open-World Models for Diverse Visual Tasks
cs.CV 2024-01 unverdicted novelty 6.0

Grounded SAM integrates Grounding DINO and SAM to support text-prompted open-world detection and segmentation, achieving 48.7 mean AP on SegInW zero-shot with the base detector and huge segmenter.
SGLang: Efficient Execution of Structured Language Model Programs
cs.AI 2023-12 conditional novelty 6.0

SGLang is a new system that speeds up structured LLM programs by up to 6.4x using RadixAttention for KV cache reuse and compressed finite state machines for output decoding.
ShareGPT4V: Improving Large Multi-Modal Models with Better Captions
cs.CV 2023-11 conditional novelty 6.0

A new 1.2M-caption dataset generated via GPT-4V improves LMMs on MME and MMBench by 222.8/22.0/22.3 and 2.7/1.3/1.5 points respectively when used for supervised fine-tuning.
Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language Models
eess.AS 2023-11 unverdicted novelty 6.0

Qwen-Audio trains a unified model on diverse audio and tasks with hierarchical tags to enable strong zero-shot performance on audio understanding benchmarks and multi-turn audio chat.
MetaMath: Bootstrap Your Own Mathematical Questions for Large Language Models
cs.CL 2023-09 conditional novelty 6.0

Bootstrapping math questions via rewriting creates MetaMathQA; fine-tuning LLaMA-2 on it yields 66.4% on GSM8K for 7B and 82.3% for 70B, beating prior same-size models by large margins.
Textbooks Are All You Need II: phi-1.5 technical report
cs.CL 2023-09 unverdicted novelty 6.0

phi-1.5 is a 1.3B parameter model trained on synthetic textbook data that matches the reasoning performance of models five times larger on natural language, math, and basic coding tasks.
YaRN: Efficient Context Window Extension of Large Language Models
cs.CL 2023-08 unverdicted novelty 6.0

YaRN extends the context window of RoPE-based LLMs like LLaMA more efficiently than prior methods, using 10x fewer tokens and 2.5x fewer steps while surpassing state-of-the-art performance and enabling extrapolation b...
Textbooks Are All You Need
cs.CL 2023-06 unverdicted novelty 6.0

A 1.3B-parameter code model trained on 7B tokens of curated textbook and synthetic data achieves 50.6% on HumanEval, indicating data quality can enable strong performance at small scale.
MiniLLM: On-Policy Distillation of Large Language Models
cs.CL 2023-06 conditional novelty 6.0

MiniLLM distills large language models into smaller ones via reverse KL divergence and on-policy optimization, yielding higher-quality responses with lower exposure bias than standard KD baselines.
The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only
cs.CL 2023-06 unverdicted novelty 6.0

Properly filtered web data from CommonCrawl alone trains LLMs that significantly outperform models trained on The Pile, with 600 billion tokens and 1.3B/7.5B parameter models released.
Gorilla: Large Language Model Connected with Massive APIs
cs.CL 2023-05 conditional novelty 6.0

Gorilla is a fine-tuned LLM that surpasses GPT-4 in accurate API call generation and uses retrieval to handle documentation updates.
GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints
cs.CL 2023-05 unverdicted novelty 6.0

Uptraining multi-head transformer checkpoints to grouped-query attention models achieves near multi-head quality at multi-query inference speeds using 5% additional compute.
MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models
cs.CV 2023-04 conditional novelty 6.0

MiniGPT-4 shows that aligning a frozen vision encoder to Vicuna via one projection layer plus a second-stage detailed-description fine-tune produces GPT-4-like vision-language abilities including detailed captions, cr...
Teaching Large Language Models to Self-Debug
cs.CL 2023-04 unverdicted novelty 6.0

Self-Debugging teaches LLMs to identify and fix their own code errors through rubber-duck-style natural language explanations and execution feedback, delivering 2-12% gains over baselines on Spider, TransCoder, and MBPP.
CAMEL: Communicative Agents for "Mind" Exploration of Large Language Model Society
cs.AI 2023-03 conditional novelty 6.0

CAMEL proposes a role-playing framework with inception prompting that enables autonomous multi-agent cooperation among LLMs and generates conversational data for studying their behaviors.
HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face
cs.CL 2023-03 unverdicted novelty 6.0

HuggingGPT is an agent system where ChatGPT plans and orchestrates calls to Hugging Face models to solve complex multi-modal AI tasks.
BloombergGPT: A Large Language Model for Finance
cs.LG 2023-03 conditional novelty 6.0

BloombergGPT is a 50B parameter LLM trained on a 708B token mixed financial and general dataset that outperforms prior models on financial benchmarks while preserving general LLM performance.
MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action
cs.CV 2023-03 unverdicted novelty 6.0

MM-REACT uses textual prompts to let ChatGPT collaborate with external vision experts for zero-shot multimodal reasoning and action on advanced visual tasks.

Reference graph

Works this paper leans on

175 extracted references · 175 canonical work pages · cited by 85 Pith papers · 29 internal anchors

[1]

URL https://github.com/google-research/t5x

T5x, 2021. URL https://github.com/google-research/t5x

work page 2021
[2]

Persistent anti-muslim bias in large language models

Abid, A., Farooqi, M., and Zou, J. Persistent anti-muslim bias in large language models. CoRR, abs/2101.05783, 2021. URL https://arxiv.org/abs/2101.05783

work page arXiv 2021
[3]

R., Hall, J., Fiedel, N., Thoppilan, R., Yang, Z., Kulshreshtha, A., Nemade, G., Lu, Y ., et al

Adiwardana, D., Luong, M.-T., So, D. R., Hall, J., Fiedel, N., Thoppilan, R., Yang, Z., Kulshreshtha, A., Nemade, G., Lu, Y., et al. Towards a human-like open-domain chatbot. arXiv preprint arXiv:2001.09977, 2020

work page arXiv 2001
[4]

The adverse effects of code duplication in machine learning models of code

Allamanis, M. The adverse effects of code duplication in machine learning models of code. In SPLASH Onward! , 2019

work page 2019
[5]

T., Devanbu, P., and Sutton, C

Allamanis, M., Barr, E. T., Devanbu, P., and Sutton, C. A survey of machine learning for big code and naturalness. ACM Comput. Surv., 51 0 (4), jul 2018. ISSN 0360-0300. doi:10.1145/3212695. URL https://doi.org/10.1145/3212695

work page doi:10.1145/3212695 2018
[6]

arXiv preprint arXiv:1905.13319 , year=

Amini, A., Gabriel, S., Lin, S., Koncel - Kedziorski, R., Choi, Y., and Hajishirzi, H. Mathqa: Towards interpretable math word problem solving with operation-based formalisms. CoRR, abs/1905.13319, 2019. URL http://arxiv.org/abs/1905.13319

work page arXiv 1905
[7]

Program Synthesis with Large Language Models

Austin, J., Odena, A., Nye, M., Bosma, M., Michalewski, H., Dohan, D., Jiang, E., Cai, C., Terry, M., Le, Q., and Sutton, C. Program synthesis with large language models. arXiv preprint arXiv:2108.07732, 2021. URL https://arxiv.org/abs/2108.07732

work page internal anchor Pith review Pith/arXiv arXiv 2021
[8]

Structure-to-text generation with self-training, acceptability classifiers and context-conditioning for the GEM shared task

Bakshi, S., Batra, S., Heidari, P., Arun, A., Jain, S., and White, M. Structure-to-text generation with self-training, acceptability classifiers and context-conditioning for the GEM shared task. In Proceedings of the 1st Workshop on Natural Language Generation, Evaluation, and Metrics (GEM 2021), pp.\ 136--147, Online, August 2021. Association for Computa...

work page doi:10.18653/v1/2021.gem-1.12 2021
[9]

Thekkath, and Yonghui Wu

Barham, P., Chowdhery, A., Dean, J., Ghemawat, S., Hand, S., Hurt, D., Isard, M., Lim, H., Pang, R., Roy, S., Saeta, B., Schuh, P., Sepassi, R., Shafey, L. E., Thekkath, C. A., and Wu, Y. Pathways: Asynchronous distributed dataflow for ML . To appear in MLSys 2022, 2022. URL https://arxiv.org/abs/2203.12533

work page arXiv 2022
[10]

R., Vaughan, J

Barocas, S., Guo, A., Kamar, E., Krones, J., Morris, M. R., Vaughan, J. W., Wadsworth, W. D., and Wallach, H. Designing Disaggregated Evaluations of AI Systems: Choices, Considerations, and Tradeoffs, pp.\ 368–378. Association for Computing Machinery, New York, NY, USA, 2021. ISBN 9781450384735. URL https://doi.org/10.1145/3461702.3462610

work page doi:10.1145/3461702.3462610 2021
[11]

Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell

Bender, E. M., Gebru, T., McMillan-Major, A., and Shmitchell, S. On the dangers of stochastic parrots: Can language models be too big? In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, FAccT '21, pp.\ 610–623. Association for Computing Machinery, 2021. URL https://doi.org/10.1145/3442188.3445922

work page doi:10.1145/3442188.3445922 2021
[12]

Semantic parsing on freebase from question-answer pairs

Berant, J., Chou, A., Frostig, R., and Liang, P. Semantic parsing on freebase from question-answer pairs. In Proceedings of the 2013 conference on empirical methods in natural language processing, pp.\ 1533--1544, 2013

work page 2013
[13]

Beyond the imitation game: Measuring and extrapolating the capabilities of language models

BIG-bench collaboration . Beyond the imitation game: Measuring and extrapolating the capabilities of language models. In preparation, 2021. URL https://github.com/google/BIG-bench/

work page 2021
[14]

and Loper, E

Bird, S. and Loper, E. NLTK : The natural language toolkit. In Proceedings of the ACL Interactive Poster and Demonstration Sessions , pp.\ 214--217, Barcelona, Spain, July 2004. Association for Computational Linguistics. URL https://aclanthology.org/P04-3031

work page 2004
[15]

Piqa: Reasoning about physical commonsense in natural language

Bisk, Y., Zellers, R., Bras, R. L., Gao, J., and Choi, Y. PIQA: reasoning about physical commonsense in natural language. CoRR, abs/1911.11641, 2019. URL http://arxiv.org/abs/1911.11641

work page arXiv 1911
[16]

L., Barocas, S., Daum \'e III, H., and Wallach, H

Blodgett, S. L., Barocas, S., Daum \'e III, H., and Wallach, H. Language (technology) is power: A critical survey of `` bias '' in NLP . In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp.\ 5454--5476, Online, July 2020. Association for Computational Linguistics. doi:10.18653/v1/2020.acl-main.485. URL https://ac...

work page doi:10.18653/v1/2020.acl-main.485 2020
[17]

L., Lopez, G., Olteanu, A., Sim, R., and Wallach, H

Blodgett, S. L., Lopez, G., Olteanu, A., Sim, R., and Wallach, H. Stereotyping N orwegian salmon: An inventory of pitfalls in fairness benchmark datasets. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp.\ 1004--...

work page doi:10.18653/v1/2021.acl-long.81 2021
[18]

Bommasani, R. and et. al., D. A. H. On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258, 2021. URL https://arxiv.org/abs/2108.07258

work page internal anchor Pith review Pith/arXiv arXiv 2021
[19]

Borgeaud, S., Mensch, A., Hoffmann, J., Cai, T., Rutherford, E., Millican, K., Driessche, G. v. d., Lespiau, J.-B., Damoc, B., Clark, A., et al. Improving language models by retrieving from trillions of tokens. arXiv preprint arXiv:2112.04426, 2021

work page arXiv 2021
[20]

J., Leary, C., Maclaurin, D., Necula, G., Paszke, A., Vander P las, J., Wanderman- M ilne, S., and Zhang, Q

Bradbury, J., Frostig, R., Hawkins, P., Johnson, M. J., Leary, C., Maclaurin, D., Necula, G., Paszke, A., Vander P las, J., Wanderman- M ilne, S., and Zhang, Q. JAX : Composable transformations of P ython+ N um P y programs, 2018. URL http://github.com/google/jax

work page 2018
[21]

Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, A.,...

work page 1901
[22]

Cao, Y. T. and Daum \'e III, H. Toward gender-inclusive coreference resolution. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp.\ 4568--4595, Online, July 2020. Association for Computational Linguistics. doi:10.18653/v1/2020.acl-main.418. URL https://aclanthology.org/2020.acl-main.418

work page doi:10.18653/v1/2020.acl-main.418 2020
[23]

Quantifying Memorization Across Neural Language Models

Carlini, N., Ippolito, D., Jagielski, M., Lee, K., Tramer, F., and Zhang, C. Quantifying memorization across neural language models. arXiv preprint arXiv:2202.07646, 2022

work page internal anchor Pith review arXiv 2022
[24]

The 2020 bilingual, bi-directional webnlg+ shared task overview and evaluation results (webnlg+ 2020)

Castro Ferreira, T., Gardent, C., Ilinykh, N., van der Lee, C., Mille, S., Moussallem, D., and Shimorina, A. The 2020 bilingual, bi-directional webnlg+ shared task overview and evaluation results (webnlg+ 2020). In Proceedings of the 3rd WebNLG Workshop on Natural Language Generation from the Semantic Web (WebNLG+ 2020), pp.\ 55--76, Dublin, Ireland (Virt...

work page 2020
[25]

Tagged back-translation

Caswell, I., Chelba, C., and Grangier, D. Tagged back-translation. In Proceedings of the Fourth Conference on Machine Translation (Volume 1: Research Papers), pp.\ 53--63, Florence, Italy, August 2019. Association for Computational Linguistics. doi:10.18653/v1/W19-5206. URL https://aclanthology.org/W19-5206

work page doi:10.18653/v1/w19-5206 2019
[26]

Evaluating Large Language Models Trained on Code

Chen, M., Tworek, J., Jun, H., Yuan, Q., Ponde, H., Kaplan, J., Edwards, H., Burda, Y., Joseph, N., Brockman, G., et al. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021. URL https://arxiv.org/abs/2107.03374

work page internal anchor Pith review Pith/arXiv arXiv 2021
[27]

Generating Long Sequences with Sparse Transformers

Child, R., Gray, S., Radford, A., and Sutskever, I. Generating long sequences with sparse transformers. arXiv preprint arXiv:1904.10509, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1904
[28]

Qu AC : Question answering in context

Choi, E., He, H., Iyyer, M., Yatskar, M., Yih, W., Choi, Y., Liang, P., and Zettlemoyer, L. Qu AC : Question answering in context. CoRR, abs/1808.07036, 2018. URL http://arxiv.org/abs/1808.07036

work page arXiv 2018
[29]

Rethinking Attention with Performers

Choromanski, K., Likhosherstov, V., Dohan, D., Song, X., Gane, A., Sarlos, T., Hawkins, P., Davis, J., Mohiuddin, A., Kaiser, L., et al. Rethinking attention with performers. arXiv preprint arXiv:2009.14794, 2020

work page internal anchor Pith review arXiv 2009
[30]

Tydi QA : A benchmark for information-seeking question answering in typologically diverse languages

Clark, J., Choi, E., Collins, M., Garrette, D., Kwiatkowski, T., Nikolaev, V., and Palomaki, J. Tydi QA : A benchmark for information-seeking question answering in typologically diverse languages. Transactions of the Association for Computational Linguistics, 2020. URL https://storage.googleapis.com/tydiqa/tydiqa.pdf

work page 2020
[31]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

Clark, P., Cowhey, I., Etzioni, O., Khot, T., Sabharwal, A., Schoenick, C., and Tafjord, O. Think you have solved question answering? try arc, the ai2 reasoning challenge. ArXiv, abs/1803.05457, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[32]

Training Verifiers to Solve Math Word Problems

Cobbe, K., Kosaraju, V., Bavarian, M., Hilton, J., Nakano, R., Hesse, C., and Schulman, J. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021. URL https://arxiv.org/abs/2110.14168

work page internal anchor Pith review Pith/arXiv arXiv 2021
[33]

Phillips, and Vivek Srikumar

Dev, S., Li, T., Phillips, J. M., and Srikumar, V. On measuring and mitigating biased inferences of word embeddings. CoRR, abs/1908.09369, 2019. URL http://arxiv.org/abs/1908.09369

work page arXiv 1908
[34]

Harms of gender exclusivity and challenges in non-binary representation in language technologies

Dev, S., Monajatipoor, M., Ovalle, A., Subramonian, A., Phillips, J., and Chang, K.-W. Harms of gender exclusivity and challenges in non-binary representation in language technologies. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp.\ 1968--1994, Online and Punta Cana, Dominican Republic, November 2021 a . Ass...

work page doi:10.18653/v1/2021.emnlp-main.150 2021
[35]

On measures of biases and harms in NLP

Dev, S., Sheng, E., Zhao, J., Sun, J., Hou, Y., Sanseverino, M., Kim, J., Peng, N., and Chang, K. What do bias measures measure? CoRR, abs/2108.03362, 2021 b . URL https://arxiv.org/abs/2108.03362

work page arXiv 2021
[36]

In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Com- putational Linguistics: Human Language Technologies (NAACL-HLT 2019), pp

Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. BERT : Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers) , pp.\ 4171--4186, Minneapolis, Minnesot...

work page doi:10.18653/v1/n19-1423 2019
[37]

Proceedings of the 2018 AAAI/ACM Conference on AI, Ethics, and Society , pages =

Dixon, L., Li, J., Sorensen, J., Thain, N., and Vasserman, L. Measuring and mitigating unintended bias in text classification. In Proceedings of the 2018 AAAI/ACM Conference on AI, Ethics, and Society, AIES '18, pp.\ 67–73, New York, NY, USA, 2018. Association for Computing Machinery. ISBN 9781450360128. doi:10.1145/3278721.3278729. URL https://doi.org/10...

work page doi:10.1145/3278721.3278729 2018
[38]

Documenting large webtext corpora: A case study on the colossal clean crawled corpus

Dodge, J., Sap, M., Marasovi \'c , A., Agnew, W., Ilharco, G., Groeneveld, D., Mitchell, M., and Gardner, M. Documenting large webtext corpora: A case study on the colossal clean crawled corpus. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp.\ 1286--1305, 2021

work page 2021
[39]

arXiv preprint arXiv:2112.06905 , year =

Du, N., Huang, Y., Dai, A. M., Tong, S., Lepikhin, D., Xu, Y., Krikun, M., Zhou, Y., Yu, A. W., Firat, O., et al. GLaM : Efficient scaling of language models with mixture-of-experts. arXiv preprint arXiv:2112.06905, 2021. URL https://arxiv.org/pdf/2112.06905

work page arXiv 2021
[40]

doi:10.18653/v1/N19-1246 , editor =

Dua, D., Wang, Y., Dasigi, P., Stanovsky, G., Singh, S., and Gardner, M. DROP : A reading comprehension benchmark requiring discrete reasoning over paragraphs. In Proceedings of the 2019 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers) , pp.\ 2368--237...

work page doi:10.18653/v1/n19-1246 2019
[41]

and Jurvc'ivcek, F

Dusek, O. and Jurvc'ivcek, F. Neural generation for czech: Data and baselines. 2019

work page 2019
[42]

and Jurčíček, F

Dušek, O. and Jurčíček, F. Neural Generation for Czech : Data and Baselines . In Proceedings of the 12th International Conference on Natural Language Generation ( INLG 2019) , pp.\ 563--574, Tokyo, Japan, October 2019. URL https://www.aclweb.org/anthology/W19-8670/

work page 2019
[43]

M., and Rieser, V

Dušek, O., Howcroft, D. M., and Rieser, V. Semantic Noise Matters for Neural Natural Language Generation . In Proceedings of the 12th International Conference on Natural Language Generation ( INLG 2019) , pp.\ 421--426, Tokyo, Japan, 2019. URL https://www.aclweb.org/anthology/W19-8652/

work page 2019
[44]

Understanding back-translation at scale

Edunov, S., Ott, M., Auli, M., and Grangier, D. Understanding back-translation at scale. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp.\ 489--500, 2018. URL https://aclanthology.org/D18-1045

work page 2018
[45]

Beyond english-centric multilingual machine translation

Fan, A., Bhosale, S., Schwenk, H., Ma, Z., El - Kishky, A., Goyal, S., Baines, M., Celebi, O., Wenzek, G., Chaudhary, V., Goyal, N., Birch, T., Liptchinsky, V., Edunov, S., Grave, E., Auli, M., and Joulin, A. Beyond english-centric multilingual machine translation. CoRR, abs/2010.11125, 2020. URL https://arxiv.org/abs/2010.11125

work page arXiv 2010
[46]

Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity

Fedus, W., Zoph, B., and Shazeer, N. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. arXiv preprint arXiv:2101.03961, 2021. URL https://arxiv.org/abs/2101.03961

work page internal anchor Pith review arXiv 2021
[47]

and Firat, O

Freitag, M. and Firat, O. Complete multilingual neural machine translation. CoRR, abs/2010.10239, 2020. URL https://arxiv.org/abs/2010.10239

work page arXiv 2010
[48]

The state of sparsity in deep neural networks.ArXiv, abs/1902.09574, 2019

Gale, T., Elsen, E., and Hooker, S. The state of sparsity in deep neural networks. arXiv preprint arXiv:1902.09574, 2019

work page arXiv 1902
[49]

Creating

Gardent, C., Shimorina, A., Narayan, S., and Perez-Beltrachini, L. Creating training corpora for nlg micro-planners. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.\ 179--188. Association for Computational Linguistics, 2017. doi:10.18653/v1/P17-1017. URL http://www.aclweb.org/antholog...

work page doi:10.18653/v1/p17-1017 2017
[50]

W., Wallach, H., Daum \'e III, H., and Crawford, K

Gebru, T., Morgenstern, J., Vecchione, B., Vaughan, J. W., Wallach, H., III, H. D., and Crawford, K. Datasheets for datasets. Commun. ACM, 64 0 (12): 0 86–92, nov 2021. ISSN 0001-0782. doi:10.1145/3458723. URL https://doi.org/10.1145/3458723

work page doi:10.1145/3458723 2021
[51]

Gehman, S., Gururangan, S., Sap, M., Choi, Y., and Smith, N. A. Realtoxicityprompts: Evaluating neural toxic degeneration in language models, 2020

work page 2020
[52]

S., Aremu, A., Bosselut, A., Chandu, K

Gehrmann, S., Adewumi, T., Aggarwal, K., Ammanamanchi, P. S., Aremu, A., Bosselut, A., Chandu, K. R., Clinciu, M.-A., Das, D., Dhole, K., Du, W., Durmus, E., Du s ek, O., Emezue, C. C., Gangal, V., Garbacea, C., Hashimoto, T., Hou, Y., Jernite, Y., Jhamtani, H., Ji, Y., Jolly, S., Kale, M., Kumar, D., Ladhak, F., Madaan, A., Maddela, M., Mahajan, K., Maha...

work page 2021
[53]

Repairing the cracked foundation: A survey of obstacles in evaluation practices for generated text, 2022

Gehrmann, S., Clark, E., and Sellam, T. Repairing the cracked foundation: A survey of obstacles in evaluation practices for generated text. CoRR, abs/2202.06935, 2022. URL https://arxiv.org/abs/2202.06935

work page arXiv 2022
[54]

Geirhos, R., Jacobsen, J.-H., Michaelis, C., Zemel, R., Brendel, W., Bethge, M., and Wichmann, F. A. Shortcut learning in deep neural networks. Nature Machine Intelligence, 2: 0 665--673, Nov 2020. doi:https://doi.org/10.1038/s42256-020-00257-z. URL https://www.nature.com/articles/s42256-020-00257-z

work page doi:10.1038/s42256-020-00257-z 2020
[55]

Did Aristotle Use a Laptop? A Question Answering Benchmark with Implicit Reasoning Strategies

Geva, M., Khashabi, D., Segal, E., Khot, T., Roth, D., and Berant, J. Did aristotle use a laptop? a question answering benchmark with implicit reasoning strategies. Transactions of the Association for Computational Linguistics, 9: 0 346--361, 2021. doi:10.1162/tacl_a_00370. URL https://aclanthology.org/2021.tacl-1.21

work page doi:10.1162/tacl_a_00370 2021
[56]

Google cloud classifying content, a

Google Cloud NLP . Google cloud classifying content, a . URL https://cloud.google.com/natural-language/docs/classifying-text

work page
[57]

Google cloud infotype detector, b

Google Cloud NLP . Google cloud infotype detector, b . URL https://cloud.google.com/dlp/docs/infotypes-reference

work page
[58]

Gupta, R., Pal, S., Kanade, A., and Shevade, S. K. Deepfix: Fixing common C language errors by deep learning. In Singh, S. P. and Markovitch, S. (eds.), Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, February 4-9, 2017, San Francisco, California, USA , pp.\ 1345--1351. AAAI Press, 2017. URL http://aaai.org/ocs/index.php/AAAI/A...

work page 2017
[59]

Retrieval augmented language model pre-training

Guu, K., Lee, K., Tung, Z., Pasupat, P., and Chang, M. Retrieval augmented language model pre-training. In III, H. D. and Singh, A. (eds.), Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pp.\ 3929--3938. PMLR, 13--18 Jul 2020. URL https://proceedings.mlr.press/v119/guu20a.html

work page 2020
[60]

Measuring massive multitask language understanding

Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D., and Steinhardt, J. Measuring massive multitask language understanding. Proceedings of the International Conference on Learning Representations (ICLR), 2021

work page 2021
[61]

Hoffmann, J., Borgeaud, S., Mensch, A., Buchatskaya, E., Cai, T., Rutherford, E., Casas, D. d. L., Hendricks, L. A., Welbl, J., Clark, A., Hennigan, T., Noland, E., Millican, K., Driessche, G. v. d., Damoc, B., Guy, A., Osindero, S., Simonyan, K., Elsen, E., Rae, J. W., Vinyals, O., and Sifre, L. Training compute-optimal large language models. arXiv prepr...

work page internal anchor Pith review Pith/arXiv arXiv 2022
[62]

V., Wu, Y., et al

Huang, Y., Cheng, Y., Bapna, A., Firat, O., Chen, D., Chen, M., Lee, H., Ngiam, J., Le, Q. V., Wu, Y., et al. GPipe : Efficient training of giant neural networks using pipeline parallelism. In Advances in neural information processing systems, pp.\ 103--112, 2019

work page 2019
[63]

Social biases in nlp models as barriers for persons with disabilities

Hutchinson, B., Prabhakaran, V., Denton, E., Webster, K., Zhong, Y., and Denuyl, S. Social biases in nlp models as barriers for persons with disabilities. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp.\ 5491--5501, 2020

work page 2020
[64]

Jacobs, A. Z. and Wallach, H. Measurement and fairness. In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, FAccT '21, pp.\ 375–385, New York, NY, USA, 2021. Association for Computing Machinery. ISBN 9781450383097. doi:10.1145/3442188.3445901. URL https://doi.org/10.1145/3442188.3445901

work page doi:10.1145/3442188.3445901 2021
[65]

T rivia QA : A large scale distantly supervised challenge dataset for reading comprehension

Joshi, M., Choi, E., Weld, D., and Zettlemoyer, L. T rivia QA : A large scale distantly supervised challenge dataset for reading comprehension. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.\ 1601--1611, 2017. URL https://aclanthology.org/P17-1147

work page 2017
[66]

P., Yoon, D

Jouppi, N. P., Yoon, D. H., Kurian, G., Li, S., Patil, N., Laudon, J., Young, C., and Patterson, D. A domain-specific supercomputer for training deep neural networks. Communications of the ACM, 63 0 (7): 0 67--78, 2020

work page 2020
[67]

Deduplicating training data mitigates privacy risks in language models

Kandpal, N., Wallace, E., and Raffel, C. Deduplicating training data mitigates privacy risks in language models. 2022. URL https://arxiv.org/abs/2202.06539

work page arXiv 2022
[68]

Scaling Laws for Neural Language Models

Kaplan, J., McCandlish, S., Henighan, T., Brown, T. B., Chess, B., Child, R., Gray, S., Radford, A., Wu, J., and Amodei, D. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2001
[69]

Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014
[70]

Reformer: The Efficient Transformer

Kitaev, N., Kaiser, ., and Levskaya, A. Reformer: The efficient transformer. arXiv preprint arXiv:2001.04451, 2020

work page internal anchor Pith review arXiv 2001
[71]

URL https://knowyourdata.withgoogle.com/

Know Your Data . URL https://knowyourdata.withgoogle.com/

work page
[72]

2016 , month = jun, pages =

Koncel-Kedziorski, R., Roy, S., Amini, A., Kushman, N., and Hajishirzi, H. MAWPS : A math word problem repository. In Proceedings of the 2016 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies , pp.\ 1152--1157, San Diego, California, June 2016. Association for Computational Linguistics....

work page doi:10.18653/v1/n16-1136 2016
[74]

S entence P iece: A simple and language independent subword tokenizer and detokenizer for neural text processing

Kudo, T. and Richardson, J. Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. In Blanco, E. and Lu, W. (eds.), Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, EMNLP 2018: System Demonstrations, Brussels, Belgium, October 31 - November 4, 2018 , pp.\ 66--71...

work page internal anchor Pith review doi:10.18653/v1/d18-2012 2018
[75]

SPoC : Search-based pseudocode to code

Kulal, S., Pasupat, P., Chandra, K., Lee, M., Padon, O., Aiken, A., and Liang, P. SPoC : Search-based pseudocode to code. In Advances in Neural Information Processing Systems, June 2019

work page 2019
[76]

W., and Tsvetkov, Y

Kurita, K., Vyas, N., Pareek, A., Black, A. W., and Tsvetkov, Y. Quantifying social biases in contextual word representations. 1st ACL Workshop on Gender Bias for Natural Language Processing, 2019. URL https://par.nsf.gov/biblio/10098355

work page arXiv 2019
[77]

M., Uszkoreit, J., Le, Q., and Petrov, S

Kwiatkowski, T., Palomaki, J., Redfield, O., Collins, M., Parikh, A., Alberti, C., Epstein, D., Polosukhin, I., Devlin, J., Lee, K., Toutanova, K., Jones, L., Kelcey, M., Chang, M.-W., Dai, A. M., Uszkoreit, J., Le, Q., and Petrov, S. Natural Q uestions: A benchmark for question answering research. Transactions of the Association for Computational Linguis...

work page 2019
[78]

Lachaux, B

Lachaux, M., Rozi \` e re, B., Chanussot, L., and Lample, G. Unsupervised translation of programming languages. CoRR, abs/2006.03511, 2020. URL https://arxiv.org/abs/2006.03511

work page arXiv 2006
[79]

Findings of the

Ladhak, F., Durmus, E., Cardie, C., and McKeown, K. W iki L ingua: A new benchmark dataset for cross-lingual abstractive summarization. In Findings of the Association for Computational Linguistics: EMNLP 2020, pp.\ 4034--4048, Online, November 2020. Association for Computational Linguistics. doi:10.18653/v1/2020.findings-emnlp.360. URL https://www.aclweb....

work page doi:10.18653/v1/2020.findings-emnlp.360 2020
[80]

RACE : Large-scale R e A ding comprehension dataset from examinations

Lai, G., Xie, Q., Liu, H., Yang, Y., and Hovy, E. RACE : Large-scale R e A ding comprehension dataset from examinations. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pp.\ 785--794, Copenhagen, Denmark, September 2017. Association for Computational Linguistics. doi:10.18653/v1/D17-1082. URL https://aclanthology...

work page doi:10.18653/v1/d17-1082 2017
[81]

T., Wang, Y., Zhang, D., and Lim, E.-P

Lan, Y., Wang, L., Zhang, Q., Lan, Y., Dai, B. T., Wang, Y., Zhang, D., and Lim, E.-P. Mwptoolkit: An open-source framework for deep learning-based math word problem solvers. arXiv preprint arXiv:2109.00799, 2021

work page arXiv 2021

Showing first 80 references.