pith. machine review for the scientific record. sign in

arxiv: 2204.02311 · v5 · submitted 2022-04-05 · 💻 cs.CL

Recognition: 2 theorem links

· Lean Theorem

PaLM: Scaling Language Modeling with Pathways

Aakanksha Chowdhery, Abhishek Rao, Adam Roberts, Aitor Lewkowycz, Alexander Spiridonov, Andrew M. Dai, Anselm Levskaya, Barret Zoph, Ben Hutchinson, Brennan Saeta, Charles Sutton, Daphne Ippolito, David Dohan, David Luan, Denny Zhou, Douglas Eck, Emily Reif, Erica Moreira, Gaurav Mishra, Guy Gur-Ari, Henryk Michalewski, Hyeontaek Lim, Hyung Won Chung, Jacob Austin, Jacob Devlin, James Bradbury, Jason Wei, Jeff Dean, Joshua Maynez, Katherine Lee, Kathy Meier-Hellstern, Kensen Shi, Kevin Robinson, Liam Fedus, Maarten Bosma, Marie Pellat, Mark Diaz, Mark Omernick, Michael Isard, Michele Catasta, Nan Du, Noah Fiedel, Noam Shazeer, Oleksandr Polozov, Orhan Firat, Parker Barnes, Parker Schuh, Paul Barham, Pengcheng Yin, Reiner Pope, Rewon Child, Ryan Sepassi, Sanjay Ghemawat, Sasha Tsvyashchenko, Sebastian Gehrmann, Sharan Narang, Shivani Agrawal, Slav Petrov, Sunipa Dev, Thanumalayan Sankaranarayana Pillai, Toju Duke, Vedant Misra, Vinodkumar Prabhakaran, Xavier Garcia, Xuezhi Wang, Yi Tay, Zongwei Zhou

Pith reviewed 2026-05-10 23:39 UTC · model grok-4.3

classification 💻 cs.CL
keywords large language modelsfew-shot learningscalingTransformerBIG-benchreasoning tasksmultilingualcode generation
0
0 comments X

The pith

Scaling a language model to 540 billion parameters produces state-of-the-art few-shot results on hundreds of benchmarks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper trains PaLM, a 540-billion parameter Transformer language model, on 6144 TPU v4 chips using the Pathways system for efficient distributed training. It establishes that increasing model scale continues to improve few-shot learning across language understanding and generation tasks. The largest model shows particular gains on multi-step reasoning and reaches average human performance on the BIG-bench suite, with some tasks exhibiting sharp jumps only at this scale. These results indicate that larger models can adapt to new tasks with fewer examples than smaller ones.

Core claim

By training a 540-billion parameter densely activated Transformer language model using the Pathways system across multiple TPU pods, the authors demonstrate continued scaling benefits through state-of-the-art few-shot performance on hundreds of benchmarks. The model outperforms the finetuned state of the art on multi-step reasoning tasks and exceeds average human performance on BIG-bench, where a significant number of tasks show discontinuous improvements only at the largest size. PaLM also exhibits strong multilingual and code generation capabilities.

What carries the argument

PaLM, the 540-billion parameter Pathways Language Model, a densely activated Transformer trained efficiently via the Pathways ML system on 6144 TPU v4 chips.

If this is right

  • Few-shot prompts alone suffice to exceed finetuned systems on multi-step reasoning tasks.
  • Average human performance is reached on a broad suite of language tasks without task-specific training.
  • Multilingual tasks and source code generation improve alongside English benchmarks as scale increases.
  • Some tasks exhibit sharp performance increases only when model size reaches hundreds of billions of parameters.
  • Analysis of bias, toxicity, and memorization becomes feasible at this scale for larger models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar scaling combined with efficient training systems could reduce the data needed for new applications in other modalities.
  • Models of this size may enable practical systems that handle varied real-world queries with minimal adaptation.
  • The pattern of discontinuous gains suggests that certain capabilities emerge only after crossing specific size thresholds.
  • Ongoing scaling will require new methods to manage memorization of training data and unintended biases.

Load-bearing premise

That the observed performance gains from scaling to 540 billion parameters will continue to appear on tasks and data outside the specific benchmarks and training distribution used.

What would settle it

A follow-up experiment that trains a model at or above 540 billion parameters and finds no further gains or discontinuous jumps on BIG-bench tasks, or that matches the reported results without scaling.

read the original abstract

Large language models have been shown to achieve remarkable performance across a variety of natural language tasks using few-shot learning, which drastically reduces the number of task-specific training examples needed to adapt the model to a particular application. To further our understanding of the impact of scale on few-shot learning, we trained a 540-billion parameter, densely activated, Transformer language model, which we call Pathways Language Model PaLM. We trained PaLM on 6144 TPU v4 chips using Pathways, a new ML system which enables highly efficient training across multiple TPU Pods. We demonstrate continued benefits of scaling by achieving state-of-the-art few-shot learning results on hundreds of language understanding and generation benchmarks. On a number of these tasks, PaLM 540B achieves breakthrough performance, outperforming the finetuned state-of-the-art on a suite of multi-step reasoning tasks, and outperforming average human performance on the recently released BIG-bench benchmark. A significant number of BIG-bench tasks showed discontinuous improvements from model scale, meaning that performance steeply increased as we scaled to our largest model. PaLM also has strong capabilities in multilingual tasks and source code generation, which we demonstrate on a wide array of benchmarks. We additionally provide a comprehensive analysis on bias and toxicity, and study the extent of training data memorization with respect to model scale. Finally, we discuss the ethical considerations related to large language models and discuss potential mitigation strategies.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 3 minor

Summary. The paper presents PaLM, a 540-billion parameter densely activated Transformer language model trained on 6144 TPU v4 chips using the Pathways system. It claims continued scaling benefits via state-of-the-art few-shot results across hundreds of language understanding and generation benchmarks, including breakthrough performance that outperforms finetuned SOTA on multi-step reasoning tasks and exceeds average human performance on BIG-bench (with discontinuous jumps on a significant number of tasks). Additional results cover multilingual tasks, code generation, bias/toxicity analysis, and memorization studies as a function of scale.

Significance. If the empirical results hold, the work provides substantial evidence for scaling benefits at the 540B parameter regime, particularly for few-shot reasoning and multilingual capabilities. The inclusion of bias, toxicity, and memorization analyses is a strength that aids responsible assessment of large models. The demonstration of efficient large-scale training via Pathways is a notable engineering contribution.

major comments (2)
  1. [Benchmark results section] Benchmark results section: The claims of SOTA few-shot performance, breakthrough reasoning results, and outperforming human performance on BIG-bench are presented without reported statistical error bars, multiple evaluation runs, or precise protocol details (e.g., prompt formatting, decoding parameters), which are load-bearing for substantiating the scaling and discontinuous improvement assertions.
  2. [Training data and setup] Training data and setup: The description of the 780B token training corpus and data filtering/mixture is high-level; this directly impacts reproducibility of the reported scaling observations and assessment of potential contamination effects on the few-shot and BIG-bench results.
minor comments (3)
  1. [Abstract] The abstract states results on 'hundreds of benchmarks' but does not enumerate the exact count or breakdown by category, reducing clarity.
  2. [Figures] Figure captions and scaling plots would benefit from explicit axis labels for model size and data volume to facilitate direct comparison with prior scaling studies.
  3. [Memorization analysis] The memorization analysis section could include a direct comparison table against smaller models (e.g., 8B or 62B variants) for quantitative context.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their positive assessment of the work and recommendation for minor revision. We appreciate the constructive feedback on improving the substantiation of our claims and the reproducibility of our experimental setup. We address each major comment below and outline the changes we will make to the manuscript.

read point-by-point responses
  1. Referee: [Benchmark results section] Benchmark results section: The claims of SOTA few-shot performance, breakthrough reasoning results, and outperforming human performance on BIG-bench are presented without reported statistical error bars, multiple evaluation runs, or precise protocol details (e.g., prompt formatting, decoding parameters), which are load-bearing for substantiating the scaling and discontinuous improvement assertions.

    Authors: We agree that additional protocol details are necessary to fully substantiate the reported results. In the revised manuscript, we will expand the evaluation sections to provide precise information on prompt formatting (including exact templates and number of shots), decoding parameters (e.g., temperature, top-p, and beam size where applicable), and the standardized evaluation harness used across benchmarks. For BIG-bench, we followed the official few-shot protocol defined by the benchmark. Regarding statistical error bars and multiple runs, the computational cost of full evaluations on the 540B model across hundreds of tasks is extremely high, rendering repeated runs infeasible within our resource constraints. We prioritized comprehensive coverage of tasks over variance estimation. However, we will add notes on prompt sensitivity for key reasoning tasks where we observed consistent gains, and we maintain that the magnitude of the observed improvements (including discontinuous jumps) aligns with prior scaling studies even in the absence of error bars. revision: partial

  2. Referee: [Training data and setup] Training data and setup: The description of the 780B token training corpus and data filtering/mixture is high-level; this directly impacts reproducibility of the reported scaling observations and assessment of potential contamination effects on the few-shot and BIG-bench results.

    Authors: We acknowledge that a more detailed description would aid reproducibility and contamination analysis. We will revise the 'Training Data' section (and associated appendix) to include expanded details on the data mixture ratios, specific sources within each category (web, books, code, multilingual), the quality filtering and deduplication methods applied, and the resulting token counts per category that total 780B tokens. We will also add a subsection discussing our contamination mitigation steps, including n-gram overlap checks against major benchmarks. While the full corpus cannot be released due to its scale and proprietary elements, these additions will provide sufficient information to interpret the scaling results and assess potential data leakage effects. revision: yes

Circularity Check

0 steps flagged

No significant circularity; direct empirical scaling results

full rationale

This is a large-scale empirical study reporting training of a 540B-parameter Transformer on 6144 TPU v4 chips and its few-shot evaluation across hundreds of benchmarks, including BIG-bench and reasoning tasks. No derivations, equations, or first-principles predictions appear; all performance claims rest on the reported experimental measurements rather than any fitted parameter being renamed as a prediction or any self-citation chain substituting for independent evidence. Bias, toxicity, and memorization analyses are likewise direct empirical checks. The central claims therefore remain self-contained experimental observations without reduction to inputs by construction.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The work rests on the standard Transformer architecture and the empirical hypothesis that few-shot performance improves with scale; no new physical entities or ad-hoc constants are introduced.

free parameters (2)
  • model parameter count
    Chosen as the target scale to test continued scaling benefits.
  • training data mixture and volume
    Determined by available corpora and hardware constraints.
axioms (2)
  • domain assumption The Transformer architecture remains effective at 540B scale
    Invoked implicitly by using the same architecture as prior models.
  • domain assumption Few-shot evaluation on standard benchmarks measures meaningful capability gains
    Central to interpreting all reported results.

pith-pipeline@v0.9.0 · 5832 in / 1283 out tokens · 65370 ms · 2026-05-10T23:39:59.875499+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines

    cs.CL 2023-10 conditional novelty 8.0

    DSPy compiles short declarative programs into LM pipelines that self-optimize and outperform both standard few-shot prompting and expert-written chains on math, retrieval, and QA tasks.

  2. AgentBench: Evaluating LLMs as Agents

    cs.AI 2023-08 unverdicted novelty 8.0

    AgentBench is a new multi-environment benchmark showing commercial LLMs outperform open-source models up to 70B parameters in agent tasks mainly due to better long-term reasoning and instruction following.

  3. Tree of Thoughts: Deliberate Problem Solving with Large Language Models

    cs.CL 2023-05 accept novelty 8.0

    Tree of Thoughts enables language models to solve complex planning tasks by generating, evaluating, and searching over coherent intermediate thoughts in a tree, raising Game of 24 success from 4% to 74% with GPT-4.

  4. All Circuits Lead to Rome: Rethinking Functional Anisotropy in Circuit and Sheaf Discovery for LLMs

    cs.CL 2026-05 unverdicted novelty 7.0

    LLM tasks are supported by multiple distinct circuits rather than unique mechanisms, demonstrated via Overlap-Aware Sheaf Repulsion and the Distributive Dense Circuit Hypothesis.

  5. VORT: Adaptive Power-Law Memory for NLP Transformers

    cs.LG 2026-05 unverdicted novelty 7.0

    VORT assigns learnable fractional orders to tokens and approximates their power-law retention kernels via sum-of-exponentials for efficient long-range dependency modeling in transformers.

  6. Parallel-SFT: Improving Zero-Shot Cross-Programming-Language Transfer for Code RL

    cs.CL 2026-04 unverdicted novelty 7.0

    Parallel-SFT mixes parallel programs across languages during SFT to produce more transferable RL initializations, yielding better zero-shot generalization to unseen programming languages.

  7. Rates of forgetting for the sequentially Markov coalescent

    math.PR 2026-04 unverdicted novelty 7.0

    SMC forgets its initial condition geometrically in the jump chain and as 1/ℓ in continuous genetic distance, justifying independent-locus approximations.

  8. A Full-Stack Performance Evaluation Infrastructure for 3D-DRAM-based LLM Accelerators

    cs.AR 2026-04 conditional novelty 7.0

    ATLAS is the first silicon-validated simulation framework for 3D-DRAM LLM accelerators, achieving under 8.57% error and over 97% correlation with real hardware while supporting design exploration.

  9. Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach

    cs.LG 2025-02 unverdicted novelty 7.0

    A recurrent-depth architecture enables language models to improve reasoning performance by iterating computation in latent space, achieving gains equivalent to much larger models on benchmarks.

  10. Moshi: a speech-text foundation model for real-time dialogue

    eess.AS 2024-09 accept novelty 7.0

    Moshi is the first real-time full-duplex spoken large language model that casts dialogue as speech-to-speech generation using parallel audio streams and an inner monologue of time-aligned text tokens.

  11. Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads

    cs.LG 2024-01 conditional novelty 7.0

    Medusa augments LLMs with multiple decoding heads and tree-based attention to predict and verify several tokens in parallel, yielding 2.2-3.6x inference speedup via two fine-tuning regimes.

  12. GAIA: a benchmark for General AI Assistants

    cs.CL 2023-11 unverdicted novelty 7.0

    GAIA benchmark shows humans at 92% accuracy on simple real-world questions far outperform current AI systems at 15%, proposing this gap as a key milestone for general AI.

  13. Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V

    cs.CV 2023-10 accept novelty 7.0

    Set-of-Mark prompting marks segmented image regions with alphanumerics and masks to let GPT-4V achieve state-of-the-art zero-shot results on referring expression comprehension and segmentation benchmarks like RefCOCOg.

  14. Language Model Beats Diffusion -- Tokenizer is Key to Visual Generation

    cs.CV 2023-10 unverdicted novelty 7.0

    A new shared video-image tokenizer enables large language models to surpass diffusion models on standard visual generation benchmarks.

  15. C-Pack: Packed Resources For General Chinese Embeddings

    cs.CL 2023-09 accept novelty 7.0

    C-Pack releases a new Chinese embedding benchmark, large training dataset, and optimized models that outperform priors by up to 10% on C-MTEB while also delivering English SOTA results.

  16. Efficient Memory Management for Large Language Model Serving with PagedAttention

    cs.LG 2023-09 conditional novelty 7.0

    PagedAttention achieves near-zero waste in LLM key-value cache memory and enables 2-4x higher serving throughput than prior systems.

  17. VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models

    cs.RO 2023-07 unverdicted novelty 7.0

    VoxPoser uses LLMs to compose 3D value maps via VLM interaction for model-based synthesis of robust robot trajectories on open-set language-specified manipulation tasks.

  18. Direct Preference Optimization: Your Language Model is Secretly a Reward Model

    cs.LG 2023-05 accept novelty 7.0

    DPO derives the optimal policy directly from human preferences via a reparameterized reward model, solving the RLHF objective with only a binary classification loss and no sampling or separate reward model.

  19. Voyager: An Open-Ended Embodied Agent with Large Language Models

    cs.AI 2023-05 unverdicted novelty 7.0

    Voyager achieves superior lifelong learning in Minecraft by combining an automatic exploration curriculum, a library of executable skills, and iterative LLM prompting with environment feedback, yielding 3.3x more uniq...

  20. RWKV: Reinventing RNNs for the Transformer Era

    cs.CL 2023-05 unverdicted novelty 7.0

    RWKV uses a linear attention mechanism to deliver Transformer-level performance with RNN-style inference efficiency, demonstrated at up to 14 billion parameters.

  21. LLM+P: Empowering Large Language Models with Optimal Planning Proficiency

    cs.AI 2023-04 accept novelty 7.0

    LLM+P lets LLMs solve planning problems optimally by converting them to PDDL for classical planners and back to natural language.

  22. Visual Instruction Tuning

    cs.CV 2023-04 unverdicted novelty 7.0

    LLaVA is trained on GPT-4 generated visual instruction data to achieve 85.1% relative performance to GPT-4 on synthetic multimodal tasks and 92.53% accuracy on Science QA.

  23. Segment Anything

    cs.CV 2023-04 unverdicted novelty 7.0

    A promptable model trained on 1B masks achieves competitive zero-shot segmentation performance across tasks and is released publicly with its dataset.

  24. Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers

    cs.CL 2023-01 unverdicted novelty 7.0

    VALL-E is a neural codec language model trained on 60K hours of speech that performs zero-shot TTS, synthesizing natural speech that matches an unseen speaker's voice, emotion, and environment from a 3-second prompt.

  25. Program of Thoughts Prompting: Disentangling Computation from Reasoning for Numerical Reasoning Tasks

    cs.CL 2022-11 unverdicted novelty 7.0

    PoT prompting improves numerical reasoning by having language models write programs executed by a computer instead of performing calculations in natural language chains of thought, with an average 12% gain over CoT.

  26. A Generalist Agent

    cs.AI 2022-05 accept novelty 7.0

    Gato is a multi-modal, multi-task, multi-embodiment generalist policy using one transformer network to handle text, vision, games, and robotics tasks.

  27. OPT: Open Pre-trained Transformer Language Models

    cs.CL 2022-05 unverdicted novelty 7.0

    OPT releases open decoder-only transformers up to 175B parameters that match GPT-3 performance at one-seventh the carbon cost, along with code and training logs.

  28. Flamingo: a Visual Language Model for Few-Shot Learning

    cs.CV 2022-04 unverdicted novelty 7.0

    Flamingo models reach new state-of-the-art few-shot results on image and video tasks by bridging frozen vision and language models with cross-attention layers trained on interleaved web-scale data.

  29. MetaColloc: Optimization-Free PDE Solving via Meta-Learned Basis Functions

    cs.LG 2026-05 unverdicted novelty 6.0

    MetaColloc meta-learns a universal set of neural basis functions offline so that new PDEs can be solved at test time with a single linear solve instead of per-equation neural-network optimization.

  30. Response Time Enhances Alignment with Heterogeneous Preferences

    cs.LG 2026-05 unverdicted novelty 6.0

    Response times modeled as drift-diffusion processes enable consistent estimation of population-average preferences from heterogeneous anonymous binary choices.

  31. Piper: Efficient Large-Scale MoE Training via Resource Modeling and Pipelined Hybrid Parallelism

    cs.DC 2026-05 unverdicted novelty 6.0

    Piper introduces resource modeling and pipelined hybrid parallelism for MoE training, delivering 2-3.5X higher MFU than prior frameworks and 1.2-9X better all-to-all bandwidth.

  32. A Meta Reinforcement Learning Approach to Goals-Based Wealth Management

    cs.LG 2026-05 unverdicted novelty 6.0

    MetaRL pre-trained on GBWM problems delivers near-optimal dynamic strategies in 0.01s achieving 97.8% of DP optimal utility and handles larger problems where DP fails.

  33. Repetition over Diversity: High-Signal Data Filtering for Sample-Efficient German Language Modeling

    cs.CL 2026-04 unverdicted novelty 6.0

    Repeating high-quality filtered German web data over multiple epochs produces better language models than single-pass training on larger, more diverse but lower-quality sets, even after seven epochs.

  34. Chain of Uncertain Rewards with Large Language Models for Reinforcement Learning

    cs.LG 2026-04 unverdicted novelty 6.0

    CoUR uses LLMs for efficient RL reward design through uncertainty quantification and similarity selection, achieving better performance and lower evaluation costs on IsaacGym and Bidexterous Manipulation benchmarks.

  35. Parcae: Scaling Laws For Stable Looped Language Models

    cs.LG 2026-04 unverdicted novelty 6.0

    Parcae stabilizes looped LLMs via spectral norm constraints on injection parameters, enabling power-law scaling for training FLOPs and saturating exponential scaling at test time that improves quality over fixed-depth...

  36. SeLaR: Selective Latent Reasoning in Large Language Models

    cs.CL 2026-04 unverdicted novelty 6.0

    SeLaR selectively applies latent soft reasoning in LLMs via entropy gating and contrastive regularization, outperforming standard CoT on five benchmarks without training.

  37. In-Place Test-Time Training

    cs.LG 2026-04 conditional novelty 6.0

    In-Place TTT adapts LLM MLP projection matrices at test time with a next-token-aligned objective and chunk-wise updates, enabling better long-context performance as a drop-in enhancement.

  38. Unmasking Hallucinations: A Causal Graph-Attention Perspective on Factual Reliability in Large Language Models

    cs.CL 2026-04 unverdicted novelty 6.0

    GCAN cuts LLM hallucination rates by 27.8% and raises factual accuracy by 16.4% on TruthfulQA and HotpotQA by using causal token graphs and a new Causal Contribution Score.

  39. Measuring Representation Robustness in Large Language Models for Geometry

    cs.CL 2026-04 unverdicted novelty 6.0

    LLMs display accuracy gaps of up to 14 percentage points on the same geometry problems solely due to representation choice, with vector forms consistently weakest and a convert-then-solve prompt helping only high-capa...

  40. Chameleon: Mixed-Modal Early-Fusion Foundation Models

    cs.CL 2024-05 unverdicted novelty 6.0

    Chameleon is an early-fusion token model that handles mixed image-text sequences for understanding and generation, achieving competitive or superior performance to larger models like Llama-2, Mixtral, and Gemini-Pro o...

  41. Are We on the Right Way for Evaluating Large Vision-Language Models?

    cs.CV 2024-03 conditional novelty 6.0

    Current LVLM benchmarks overestimate capabilities because many questions can be answered without images due to design flaws or data leakage; MMStar is a human-curated set of 1,500 vision-indispensable samples across 6...

  42. LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

    cs.SE 2024-03 unverdicted novelty 6.0

    LiveCodeBench collects 400 recent contest problems to create a contamination-free benchmark evaluating LLMs on code generation and related capabilities like self-repair and execution.

  43. Grounded SAM: Assembling Open-World Models for Diverse Visual Tasks

    cs.CV 2024-01 unverdicted novelty 6.0

    Grounded SAM integrates Grounding DINO and SAM to support text-prompted open-world detection and segmentation, achieving 48.7 mean AP on SegInW zero-shot with the base detector and huge segmenter.

  44. SGLang: Efficient Execution of Structured Language Model Programs

    cs.AI 2023-12 conditional novelty 6.0

    SGLang is a new system that speeds up structured LLM programs by up to 6.4x using RadixAttention for KV cache reuse and compressed finite state machines for output decoding.

  45. ShareGPT4V: Improving Large Multi-Modal Models with Better Captions

    cs.CV 2023-11 conditional novelty 6.0

    A new 1.2M-caption dataset generated via GPT-4V improves LMMs on MME and MMBench by 222.8/22.0/22.3 and 2.7/1.3/1.5 points respectively when used for supervised fine-tuning.

  46. Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language Models

    eess.AS 2023-11 unverdicted novelty 6.0

    Qwen-Audio trains a unified model on diverse audio and tasks with hierarchical tags to enable strong zero-shot performance on audio understanding benchmarks and multi-turn audio chat.

  47. MetaMath: Bootstrap Your Own Mathematical Questions for Large Language Models

    cs.CL 2023-09 conditional novelty 6.0

    Bootstrapping math questions via rewriting creates MetaMathQA; fine-tuning LLaMA-2 on it yields 66.4% on GSM8K for 7B and 82.3% for 70B, beating prior same-size models by large margins.

  48. Textbooks Are All You Need II: phi-1.5 technical report

    cs.CL 2023-09 unverdicted novelty 6.0

    phi-1.5 is a 1.3B parameter model trained on synthetic textbook data that matches the reasoning performance of models five times larger on natural language, math, and basic coding tasks.

  49. YaRN: Efficient Context Window Extension of Large Language Models

    cs.CL 2023-08 unverdicted novelty 6.0

    YaRN extends the context window of RoPE-based LLMs like LLaMA more efficiently than prior methods, using 10x fewer tokens and 2.5x fewer steps while surpassing state-of-the-art performance and enabling extrapolation b...

  50. Textbooks Are All You Need

    cs.CL 2023-06 unverdicted novelty 6.0

    A 1.3B-parameter code model trained on 7B tokens of curated textbook and synthetic data achieves 50.6% on HumanEval, indicating data quality can enable strong performance at small scale.

  51. MiniLLM: On-Policy Distillation of Large Language Models

    cs.CL 2023-06 conditional novelty 6.0

    MiniLLM distills large language models into smaller ones via reverse KL divergence and on-policy optimization, yielding higher-quality responses with lower exposure bias than standard KD baselines.

  52. The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only

    cs.CL 2023-06 unverdicted novelty 6.0

    Properly filtered web data from CommonCrawl alone trains LLMs that significantly outperform models trained on The Pile, with 600 billion tokens and 1.3B/7.5B parameter models released.

  53. Gorilla: Large Language Model Connected with Massive APIs

    cs.CL 2023-05 conditional novelty 6.0

    Gorilla is a fine-tuned LLM that surpasses GPT-4 in accurate API call generation and uses retrieval to handle documentation updates.

  54. GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints

    cs.CL 2023-05 unverdicted novelty 6.0

    Uptraining multi-head transformer checkpoints to grouped-query attention models achieves near multi-head quality at multi-query inference speeds using 5% additional compute.

  55. MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models

    cs.CV 2023-04 conditional novelty 6.0

    MiniGPT-4 shows that aligning a frozen vision encoder to Vicuna via one projection layer plus a second-stage detailed-description fine-tune produces GPT-4-like vision-language abilities including detailed captions, cr...

  56. Teaching Large Language Models to Self-Debug

    cs.CL 2023-04 unverdicted novelty 6.0

    Self-Debugging teaches LLMs to identify and fix their own code errors through rubber-duck-style natural language explanations and execution feedback, delivering 2-12% gains over baselines on Spider, TransCoder, and MBPP.

  57. CAMEL: Communicative Agents for "Mind" Exploration of Large Language Model Society

    cs.AI 2023-03 conditional novelty 6.0

    CAMEL proposes a role-playing framework with inception prompting that enables autonomous multi-agent cooperation among LLMs and generates conversational data for studying their behaviors.

  58. HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face

    cs.CL 2023-03 unverdicted novelty 6.0

    HuggingGPT is an agent system where ChatGPT plans and orchestrates calls to Hugging Face models to solve complex multi-modal AI tasks.

  59. BloombergGPT: A Large Language Model for Finance

    cs.LG 2023-03 conditional novelty 6.0

    BloombergGPT is a 50B parameter LLM trained on a 708B token mixed financial and general dataset that outperforms prior models on financial benchmarks while preserving general LLM performance.

  60. MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action

    cs.CV 2023-03 unverdicted novelty 6.0

    MM-REACT uses textual prompts to let ChatGPT collaborate with external vision experts for zero-shot multimodal reasoning and action on advanced visual tasks.

Reference graph

Works this paper leans on

175 extracted references · 175 canonical work pages · cited by 85 Pith papers · 29 internal anchors

  1. [1]

    URL https://github.com/google-research/t5x

    T5x, 2021. URL https://github.com/google-research/t5x

  2. [2]

    Persistent anti-muslim bias in large language models

    Abid, A., Farooqi, M., and Zou, J. Persistent anti-muslim bias in large language models. CoRR, abs/2101.05783, 2021. URL https://arxiv.org/abs/2101.05783

  3. [3]

    R., Hall, J., Fiedel, N., Thoppilan, R., Yang, Z., Kulshreshtha, A., Nemade, G., Lu, Y ., et al

    Adiwardana, D., Luong, M.-T., So, D. R., Hall, J., Fiedel, N., Thoppilan, R., Yang, Z., Kulshreshtha, A., Nemade, G., Lu, Y., et al. Towards a human-like open-domain chatbot. arXiv preprint arXiv:2001.09977, 2020

  4. [4]

    The adverse effects of code duplication in machine learning models of code

    Allamanis, M. The adverse effects of code duplication in machine learning models of code. In SPLASH Onward! , 2019

  5. [5]

    T., Devanbu, P., and Sutton, C

    Allamanis, M., Barr, E. T., Devanbu, P., and Sutton, C. A survey of machine learning for big code and naturalness. ACM Comput. Surv., 51 0 (4), jul 2018. ISSN 0360-0300. doi:10.1145/3212695. URL https://doi.org/10.1145/3212695

  6. [6]

    arXiv preprint arXiv:1905.13319 , year=

    Amini, A., Gabriel, S., Lin, S., Koncel - Kedziorski, R., Choi, Y., and Hajishirzi, H. Mathqa: Towards interpretable math word problem solving with operation-based formalisms. CoRR, abs/1905.13319, 2019. URL http://arxiv.org/abs/1905.13319

  7. [7]

    Program Synthesis with Large Language Models

    Austin, J., Odena, A., Nye, M., Bosma, M., Michalewski, H., Dohan, D., Jiang, E., Cai, C., Terry, M., Le, Q., and Sutton, C. Program synthesis with large language models. arXiv preprint arXiv:2108.07732, 2021. URL https://arxiv.org/abs/2108.07732

  8. [8]

    Structure-to-text generation with self-training, acceptability classifiers and context-conditioning for the GEM shared task

    Bakshi, S., Batra, S., Heidari, P., Arun, A., Jain, S., and White, M. Structure-to-text generation with self-training, acceptability classifiers and context-conditioning for the GEM shared task. In Proceedings of the 1st Workshop on Natural Language Generation, Evaluation, and Metrics (GEM 2021), pp.\ 136--147, Online, August 2021. Association for Computa...

  9. [9]

    Thekkath, and Yonghui Wu

    Barham, P., Chowdhery, A., Dean, J., Ghemawat, S., Hand, S., Hurt, D., Isard, M., Lim, H., Pang, R., Roy, S., Saeta, B., Schuh, P., Sepassi, R., Shafey, L. E., Thekkath, C. A., and Wu, Y. Pathways: Asynchronous distributed dataflow for ML . To appear in MLSys 2022, 2022. URL https://arxiv.org/abs/2203.12533

  10. [10]

    R., Vaughan, J

    Barocas, S., Guo, A., Kamar, E., Krones, J., Morris, M. R., Vaughan, J. W., Wadsworth, W. D., and Wallach, H. Designing Disaggregated Evaluations of AI Systems: Choices, Considerations, and Tradeoffs, pp.\ 368–378. Association for Computing Machinery, New York, NY, USA, 2021. ISBN 9781450384735. URL https://doi.org/10.1145/3461702.3462610

  11. [11]

    Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell

    Bender, E. M., Gebru, T., McMillan-Major, A., and Shmitchell, S. On the dangers of stochastic parrots: Can language models be too big? In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, FAccT '21, pp.\ 610–623. Association for Computing Machinery, 2021. URL https://doi.org/10.1145/3442188.3445922

  12. [12]

    Semantic parsing on freebase from question-answer pairs

    Berant, J., Chou, A., Frostig, R., and Liang, P. Semantic parsing on freebase from question-answer pairs. In Proceedings of the 2013 conference on empirical methods in natural language processing, pp.\ 1533--1544, 2013

  13. [13]

    Beyond the imitation game: Measuring and extrapolating the capabilities of language models

    BIG-bench collaboration . Beyond the imitation game: Measuring and extrapolating the capabilities of language models. In preparation, 2021. URL https://github.com/google/BIG-bench/

  14. [14]

    and Loper, E

    Bird, S. and Loper, E. NLTK : The natural language toolkit. In Proceedings of the ACL Interactive Poster and Demonstration Sessions , pp.\ 214--217, Barcelona, Spain, July 2004. Association for Computational Linguistics. URL https://aclanthology.org/P04-3031

  15. [15]

    Piqa: Reasoning about physical commonsense in natural language

    Bisk, Y., Zellers, R., Bras, R. L., Gao, J., and Choi, Y. PIQA: reasoning about physical commonsense in natural language. CoRR, abs/1911.11641, 2019. URL http://arxiv.org/abs/1911.11641

  16. [16]

    L., Barocas, S., Daum \'e III, H., and Wallach, H

    Blodgett, S. L., Barocas, S., Daum \'e III, H., and Wallach, H. Language (technology) is power: A critical survey of `` bias '' in NLP . In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp.\ 5454--5476, Online, July 2020. Association for Computational Linguistics. doi:10.18653/v1/2020.acl-main.485. URL https://ac...

  17. [17]

    L., Lopez, G., Olteanu, A., Sim, R., and Wallach, H

    Blodgett, S. L., Lopez, G., Olteanu, A., Sim, R., and Wallach, H. Stereotyping N orwegian salmon: An inventory of pitfalls in fairness benchmark datasets. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp.\ 1004--...

  18. [18]

    Bommasani, R. and et. al., D. A. H. On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258, 2021. URL https://arxiv.org/abs/2108.07258

  19. [19]

    Borgeaud, S., Mensch, A., Hoffmann, J., Cai, T., Rutherford, E., Millican, K., Driessche, G. v. d., Lespiau, J.-B., Damoc, B., Clark, A., et al. Improving language models by retrieving from trillions of tokens. arXiv preprint arXiv:2112.04426, 2021

  20. [20]

    J., Leary, C., Maclaurin, D., Necula, G., Paszke, A., Vander P las, J., Wanderman- M ilne, S., and Zhang, Q

    Bradbury, J., Frostig, R., Hawkins, P., Johnson, M. J., Leary, C., Maclaurin, D., Necula, G., Paszke, A., Vander P las, J., Wanderman- M ilne, S., and Zhang, Q. JAX : Composable transformations of P ython+ N um P y programs, 2018. URL http://github.com/google/jax

  21. [21]

    Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, A.,...

  22. [22]

    Cao, Y. T. and Daum \'e III, H. Toward gender-inclusive coreference resolution. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp.\ 4568--4595, Online, July 2020. Association for Computational Linguistics. doi:10.18653/v1/2020.acl-main.418. URL https://aclanthology.org/2020.acl-main.418

  23. [23]

    Quantifying Memorization Across Neural Language Models

    Carlini, N., Ippolito, D., Jagielski, M., Lee, K., Tramer, F., and Zhang, C. Quantifying memorization across neural language models. arXiv preprint arXiv:2202.07646, 2022

  24. [24]

    The 2020 bilingual, bi-directional webnlg+ shared task overview and evaluation results (webnlg+ 2020)

    Castro Ferreira, T., Gardent, C., Ilinykh, N., van der Lee, C., Mille, S., Moussallem, D., and Shimorina, A. The 2020 bilingual, bi-directional webnlg+ shared task overview and evaluation results (webnlg+ 2020). In Proceedings of the 3rd WebNLG Workshop on Natural Language Generation from the Semantic Web (WebNLG+ 2020), pp.\ 55--76, Dublin, Ireland (Virt...

  25. [25]

    Tagged back-translation

    Caswell, I., Chelba, C., and Grangier, D. Tagged back-translation. In Proceedings of the Fourth Conference on Machine Translation (Volume 1: Research Papers), pp.\ 53--63, Florence, Italy, August 2019. Association for Computational Linguistics. doi:10.18653/v1/W19-5206. URL https://aclanthology.org/W19-5206

  26. [26]

    Evaluating Large Language Models Trained on Code

    Chen, M., Tworek, J., Jun, H., Yuan, Q., Ponde, H., Kaplan, J., Edwards, H., Burda, Y., Joseph, N., Brockman, G., et al. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021. URL https://arxiv.org/abs/2107.03374

  27. [27]

    Generating Long Sequences with Sparse Transformers

    Child, R., Gray, S., Radford, A., and Sutskever, I. Generating long sequences with sparse transformers. arXiv preprint arXiv:1904.10509, 2019

  28. [28]

    Qu AC : Question answering in context

    Choi, E., He, H., Iyyer, M., Yatskar, M., Yih, W., Choi, Y., Liang, P., and Zettlemoyer, L. Qu AC : Question answering in context. CoRR, abs/1808.07036, 2018. URL http://arxiv.org/abs/1808.07036

  29. [29]

    Rethinking Attention with Performers

    Choromanski, K., Likhosherstov, V., Dohan, D., Song, X., Gane, A., Sarlos, T., Hawkins, P., Davis, J., Mohiuddin, A., Kaiser, L., et al. Rethinking attention with performers. arXiv preprint arXiv:2009.14794, 2020

  30. [30]

    Tydi QA : A benchmark for information-seeking question answering in typologically diverse languages

    Clark, J., Choi, E., Collins, M., Garrette, D., Kwiatkowski, T., Nikolaev, V., and Palomaki, J. Tydi QA : A benchmark for information-seeking question answering in typologically diverse languages. Transactions of the Association for Computational Linguistics, 2020. URL https://storage.googleapis.com/tydiqa/tydiqa.pdf

  31. [31]

    Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

    Clark, P., Cowhey, I., Etzioni, O., Khot, T., Sabharwal, A., Schoenick, C., and Tafjord, O. Think you have solved question answering? try arc, the ai2 reasoning challenge. ArXiv, abs/1803.05457, 2018

  32. [32]

    Training Verifiers to Solve Math Word Problems

    Cobbe, K., Kosaraju, V., Bavarian, M., Hilton, J., Nakano, R., Hesse, C., and Schulman, J. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021. URL https://arxiv.org/abs/2110.14168

  33. [33]

    Phillips, and Vivek Srikumar

    Dev, S., Li, T., Phillips, J. M., and Srikumar, V. On measuring and mitigating biased inferences of word embeddings. CoRR, abs/1908.09369, 2019. URL http://arxiv.org/abs/1908.09369

  34. [34]

    Harms of gender exclusivity and challenges in non-binary representation in language technologies

    Dev, S., Monajatipoor, M., Ovalle, A., Subramonian, A., Phillips, J., and Chang, K.-W. Harms of gender exclusivity and challenges in non-binary representation in language technologies. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp.\ 1968--1994, Online and Punta Cana, Dominican Republic, November 2021 a . Ass...

  35. [35]

    On measures of biases and harms in NLP

    Dev, S., Sheng, E., Zhao, J., Sun, J., Hou, Y., Sanseverino, M., Kim, J., Peng, N., and Chang, K. What do bias measures measure? CoRR, abs/2108.03362, 2021 b . URL https://arxiv.org/abs/2108.03362

  36. [36]

    In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Com- putational Linguistics: Human Language Technologies (NAACL-HLT 2019), pp

    Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. BERT : Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers) , pp.\ 4171--4186, Minneapolis, Minnesot...

  37. [37]

    Proceedings of the 2018 AAAI/ACM Conference on AI, Ethics, and Society , pages =

    Dixon, L., Li, J., Sorensen, J., Thain, N., and Vasserman, L. Measuring and mitigating unintended bias in text classification. In Proceedings of the 2018 AAAI/ACM Conference on AI, Ethics, and Society, AIES '18, pp.\ 67–73, New York, NY, USA, 2018. Association for Computing Machinery. ISBN 9781450360128. doi:10.1145/3278721.3278729. URL https://doi.org/10...

  38. [38]

    Documenting large webtext corpora: A case study on the colossal clean crawled corpus

    Dodge, J., Sap, M., Marasovi \'c , A., Agnew, W., Ilharco, G., Groeneveld, D., Mitchell, M., and Gardner, M. Documenting large webtext corpora: A case study on the colossal clean crawled corpus. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp.\ 1286--1305, 2021

  39. [39]

    arXiv preprint arXiv:2112.06905 , year =

    Du, N., Huang, Y., Dai, A. M., Tong, S., Lepikhin, D., Xu, Y., Krikun, M., Zhou, Y., Yu, A. W., Firat, O., et al. GLaM : Efficient scaling of language models with mixture-of-experts. arXiv preprint arXiv:2112.06905, 2021. URL https://arxiv.org/pdf/2112.06905

  40. [40]

    doi:10.18653/v1/N19-1246 , editor =

    Dua, D., Wang, Y., Dasigi, P., Stanovsky, G., Singh, S., and Gardner, M. DROP : A reading comprehension benchmark requiring discrete reasoning over paragraphs. In Proceedings of the 2019 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers) , pp.\ 2368--237...

  41. [41]

    and Jurvc'ivcek, F

    Dusek, O. and Jurvc'ivcek, F. Neural generation for czech: Data and baselines. 2019

  42. [42]

    and Jurčíček, F

    Dušek, O. and Jurčíček, F. Neural Generation for Czech : Data and Baselines . In Proceedings of the 12th International Conference on Natural Language Generation ( INLG 2019) , pp.\ 563--574, Tokyo, Japan, October 2019. URL https://www.aclweb.org/anthology/W19-8670/

  43. [43]

    M., and Rieser, V

    Dušek, O., Howcroft, D. M., and Rieser, V. Semantic Noise Matters for Neural Natural Language Generation . In Proceedings of the 12th International Conference on Natural Language Generation ( INLG 2019) , pp.\ 421--426, Tokyo, Japan, 2019. URL https://www.aclweb.org/anthology/W19-8652/

  44. [44]

    Understanding back-translation at scale

    Edunov, S., Ott, M., Auli, M., and Grangier, D. Understanding back-translation at scale. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp.\ 489--500, 2018. URL https://aclanthology.org/D18-1045

  45. [45]

    Beyond english-centric multilingual machine translation

    Fan, A., Bhosale, S., Schwenk, H., Ma, Z., El - Kishky, A., Goyal, S., Baines, M., Celebi, O., Wenzek, G., Chaudhary, V., Goyal, N., Birch, T., Liptchinsky, V., Edunov, S., Grave, E., Auli, M., and Joulin, A. Beyond english-centric multilingual machine translation. CoRR, abs/2010.11125, 2020. URL https://arxiv.org/abs/2010.11125

  46. [46]

    Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity

    Fedus, W., Zoph, B., and Shazeer, N. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. arXiv preprint arXiv:2101.03961, 2021. URL https://arxiv.org/abs/2101.03961

  47. [47]

    and Firat, O

    Freitag, M. and Firat, O. Complete multilingual neural machine translation. CoRR, abs/2010.10239, 2020. URL https://arxiv.org/abs/2010.10239

  48. [48]

    The state of sparsity in deep neural networks.ArXiv, abs/1902.09574, 2019

    Gale, T., Elsen, E., and Hooker, S. The state of sparsity in deep neural networks. arXiv preprint arXiv:1902.09574, 2019

  49. [49]

    Creating

    Gardent, C., Shimorina, A., Narayan, S., and Perez-Beltrachini, L. Creating training corpora for nlg micro-planners. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.\ 179--188. Association for Computational Linguistics, 2017. doi:10.18653/v1/P17-1017. URL http://www.aclweb.org/antholog...

  50. [50]

    W., Wallach, H., Daum \'e III, H., and Crawford, K

    Gebru, T., Morgenstern, J., Vecchione, B., Vaughan, J. W., Wallach, H., III, H. D., and Crawford, K. Datasheets for datasets. Commun. ACM, 64 0 (12): 0 86–92, nov 2021. ISSN 0001-0782. doi:10.1145/3458723. URL https://doi.org/10.1145/3458723

  51. [51]

    Gehman, S., Gururangan, S., Sap, M., Choi, Y., and Smith, N. A. Realtoxicityprompts: Evaluating neural toxic degeneration in language models, 2020

  52. [52]

    S., Aremu, A., Bosselut, A., Chandu, K

    Gehrmann, S., Adewumi, T., Aggarwal, K., Ammanamanchi, P. S., Aremu, A., Bosselut, A., Chandu, K. R., Clinciu, M.-A., Das, D., Dhole, K., Du, W., Durmus, E., Du s ek, O., Emezue, C. C., Gangal, V., Garbacea, C., Hashimoto, T., Hou, Y., Jernite, Y., Jhamtani, H., Ji, Y., Jolly, S., Kale, M., Kumar, D., Ladhak, F., Madaan, A., Maddela, M., Mahajan, K., Maha...

  53. [53]

    Repairing the cracked foundation: A survey of obstacles in evaluation practices for generated text, 2022

    Gehrmann, S., Clark, E., and Sellam, T. Repairing the cracked foundation: A survey of obstacles in evaluation practices for generated text. CoRR, abs/2202.06935, 2022. URL https://arxiv.org/abs/2202.06935

  54. [54]

    Geirhos, R., Jacobsen, J.-H., Michaelis, C., Zemel, R., Brendel, W., Bethge, M., and Wichmann, F. A. Shortcut learning in deep neural networks. Nature Machine Intelligence, 2: 0 665--673, Nov 2020. doi:https://doi.org/10.1038/s42256-020-00257-z. URL https://www.nature.com/articles/s42256-020-00257-z

  55. [55]

    Did Aristotle Use a Laptop? A Question Answering Benchmark with Implicit Reasoning Strategies

    Geva, M., Khashabi, D., Segal, E., Khot, T., Roth, D., and Berant, J. Did aristotle use a laptop? a question answering benchmark with implicit reasoning strategies. Transactions of the Association for Computational Linguistics, 9: 0 346--361, 2021. doi:10.1162/tacl_a_00370. URL https://aclanthology.org/2021.tacl-1.21

  56. [56]

    Google cloud classifying content, a

    Google Cloud NLP . Google cloud classifying content, a . URL https://cloud.google.com/natural-language/docs/classifying-text

  57. [57]

    Google cloud infotype detector, b

    Google Cloud NLP . Google cloud infotype detector, b . URL https://cloud.google.com/dlp/docs/infotypes-reference

  58. [58]

    Gupta, R., Pal, S., Kanade, A., and Shevade, S. K. Deepfix: Fixing common C language errors by deep learning. In Singh, S. P. and Markovitch, S. (eds.), Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, February 4-9, 2017, San Francisco, California, USA , pp.\ 1345--1351. AAAI Press, 2017. URL http://aaai.org/ocs/index.php/AAAI/A...

  59. [59]

    Retrieval augmented language model pre-training

    Guu, K., Lee, K., Tung, Z., Pasupat, P., and Chang, M. Retrieval augmented language model pre-training. In III, H. D. and Singh, A. (eds.), Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pp.\ 3929--3938. PMLR, 13--18 Jul 2020. URL https://proceedings.mlr.press/v119/guu20a.html

  60. [60]

    Measuring massive multitask language understanding

    Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D., and Steinhardt, J. Measuring massive multitask language understanding. Proceedings of the International Conference on Learning Representations (ICLR), 2021

  61. [61]

    Hoffmann, J., Borgeaud, S., Mensch, A., Buchatskaya, E., Cai, T., Rutherford, E., Casas, D. d. L., Hendricks, L. A., Welbl, J., Clark, A., Hennigan, T., Noland, E., Millican, K., Driessche, G. v. d., Damoc, B., Guy, A., Osindero, S., Simonyan, K., Elsen, E., Rae, J. W., Vinyals, O., and Sifre, L. Training compute-optimal large language models. arXiv prepr...

  62. [62]

    V., Wu, Y., et al

    Huang, Y., Cheng, Y., Bapna, A., Firat, O., Chen, D., Chen, M., Lee, H., Ngiam, J., Le, Q. V., Wu, Y., et al. GPipe : Efficient training of giant neural networks using pipeline parallelism. In Advances in neural information processing systems, pp.\ 103--112, 2019

  63. [63]

    Social biases in nlp models as barriers for persons with disabilities

    Hutchinson, B., Prabhakaran, V., Denton, E., Webster, K., Zhong, Y., and Denuyl, S. Social biases in nlp models as barriers for persons with disabilities. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp.\ 5491--5501, 2020

  64. [64]

    Jacobs, A. Z. and Wallach, H. Measurement and fairness. In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, FAccT '21, pp.\ 375–385, New York, NY, USA, 2021. Association for Computing Machinery. ISBN 9781450383097. doi:10.1145/3442188.3445901. URL https://doi.org/10.1145/3442188.3445901

  65. [65]

    T rivia QA : A large scale distantly supervised challenge dataset for reading comprehension

    Joshi, M., Choi, E., Weld, D., and Zettlemoyer, L. T rivia QA : A large scale distantly supervised challenge dataset for reading comprehension. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.\ 1601--1611, 2017. URL https://aclanthology.org/P17-1147

  66. [66]

    P., Yoon, D

    Jouppi, N. P., Yoon, D. H., Kurian, G., Li, S., Patil, N., Laudon, J., Young, C., and Patterson, D. A domain-specific supercomputer for training deep neural networks. Communications of the ACM, 63 0 (7): 0 67--78, 2020

  67. [67]

    Deduplicating training data mitigates privacy risks in language models

    Kandpal, N., Wallace, E., and Raffel, C. Deduplicating training data mitigates privacy risks in language models. 2022. URL https://arxiv.org/abs/2202.06539

  68. [68]

    Scaling Laws for Neural Language Models

    Kaplan, J., McCandlish, S., Henighan, T., Brown, T. B., Chess, B., Child, R., Gray, S., Radford, A., Wu, J., and Amodei, D. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020

  69. [69]

    Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014

  70. [70]

    Reformer: The Efficient Transformer

    Kitaev, N., Kaiser, ., and Levskaya, A. Reformer: The efficient transformer. arXiv preprint arXiv:2001.04451, 2020

  71. [71]

    URL https://knowyourdata.withgoogle.com/

    Know Your Data . URL https://knowyourdata.withgoogle.com/

  72. [72]

    2016 , month = jun, pages =

    Koncel-Kedziorski, R., Roy, S., Amini, A., Kushman, N., and Hajishirzi, H. MAWPS : A math word problem repository. In Proceedings of the 2016 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies , pp.\ 1152--1157, San Diego, California, June 2016. Association for Computational Linguistics....

  73. [74]

    S entence P iece: A simple and language independent subword tokenizer and detokenizer for neural text processing

    Kudo, T. and Richardson, J. Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. In Blanco, E. and Lu, W. (eds.), Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, EMNLP 2018: System Demonstrations, Brussels, Belgium, October 31 - November 4, 2018 , pp.\ 66--71...

  74. [75]

    SPoC : Search-based pseudocode to code

    Kulal, S., Pasupat, P., Chandra, K., Lee, M., Padon, O., Aiken, A., and Liang, P. SPoC : Search-based pseudocode to code. In Advances in Neural Information Processing Systems, June 2019

  75. [76]

    W., and Tsvetkov, Y

    Kurita, K., Vyas, N., Pareek, A., Black, A. W., and Tsvetkov, Y. Quantifying social biases in contextual word representations. 1st ACL Workshop on Gender Bias for Natural Language Processing, 2019. URL https://par.nsf.gov/biblio/10098355

  76. [77]

    M., Uszkoreit, J., Le, Q., and Petrov, S

    Kwiatkowski, T., Palomaki, J., Redfield, O., Collins, M., Parikh, A., Alberti, C., Epstein, D., Polosukhin, I., Devlin, J., Lee, K., Toutanova, K., Jones, L., Kelcey, M., Chang, M.-W., Dai, A. M., Uszkoreit, J., Le, Q., and Petrov, S. Natural Q uestions: A benchmark for question answering research. Transactions of the Association for Computational Linguis...

  77. [78]

    Lachaux, B

    Lachaux, M., Rozi \` e re, B., Chanussot, L., and Lample, G. Unsupervised translation of programming languages. CoRR, abs/2006.03511, 2020. URL https://arxiv.org/abs/2006.03511

  78. [79]

    Findings of the

    Ladhak, F., Durmus, E., Cardie, C., and McKeown, K. W iki L ingua: A new benchmark dataset for cross-lingual abstractive summarization. In Findings of the Association for Computational Linguistics: EMNLP 2020, pp.\ 4034--4048, Online, November 2020. Association for Computational Linguistics. doi:10.18653/v1/2020.findings-emnlp.360. URL https://www.aclweb....

  79. [80]

    RACE : Large-scale R e A ding comprehension dataset from examinations

    Lai, G., Xie, Q., Liu, H., Yang, Y., and Hovy, E. RACE : Large-scale R e A ding comprehension dataset from examinations. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pp.\ 785--794, Copenhagen, Denmark, September 2017. Association for Computational Linguistics. doi:10.18653/v1/D17-1082. URL https://aclanthology...

  80. [81]

    T., Wang, Y., Zhang, D., and Lim, E.-P

    Lan, Y., Wang, L., Zhang, Q., Lan, Y., Dai, B. T., Wang, Y., Zhang, D., and Lim, E.-P. Mwptoolkit: An open-source framework for deep learning-based math word problem solvers. arXiv preprint arXiv:2109.00799, 2021

Showing first 80 references.