arxiv: 2005.14165 · v4 · submitted 2020-05-28 · 💻 cs.CL

Recognition: no theorem link

Language Models are Few-Shot Learners

Aditya Ramesh, Alec Radford, Amanda Askell, Ariel Herbert-Voss, Arvind Neelakantan, Benjamin Chess, Benjamin Mann, Christopher Berner, Christopher Hesse, Clemens Winter, Daniel M. Ziegler, Dario Amodei, Eric Sigler, Girish Sastry, Gretchen Krueger, Ilya Sutskever, Jack Clark, Jared Kaplan, Jeffrey Wu, Mark Chen, Mateusz Litwin, Melanie Subbiah, Nick Ryder, Prafulla Dhariwal, Pranav Shyam, Rewon Child, Sam McCandlish, Sandhini Agarwal, Scott Gray, Tom B. Brown, Tom Henighan

Pith reviewed 2026-05-10 12:00 UTC · model grok-4.3

classification 💻 cs.CL

keywords language modelsfew-shot learningGPT-3in-context learningNLP benchmarksautoregressive modelstext generationscaling

0 comments

The pith

Scaling language models to 175 billion parameters enables strong few-shot performance on NLP tasks without any fine-tuning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that bigger autoregressive language models can handle new tasks from just a few examples placed in the input prompt, rather than needing thousands of labeled examples for retraining. This is tested by building GPT-3 and running it on benchmarks for translation, question answering, arithmetic, and other reasoning problems, where it often approaches the accuracy of models that were fine-tuned on task-specific data. The approach works purely through text interaction, with no gradient updates, and also produces generated articles that people find hard to separate from human writing. Limitations appear on some datasets, along with concerns that the model's training data may overlap with the test sets.

Core claim

We show that scaling up language models greatly improves task-agnostic, few-shot performance, sometimes even reaching competitiveness with prior state-of-the-art fine-tuning approaches. Specifically, we train GPT-3, an autoregressive language model with 175 billion parameters, 10x more than any previous non-sparse language model, and test its performance in the few-shot setting. For all tasks, GPT-3 is applied without any gradient updates or fine-tuning, with tasks and few-shot demonstrations specified purely via text interaction with the model.

What carries the argument

The 175-billion-parameter autoregressive language model GPT-3, which performs tasks entirely through in-context examples supplied in natural-language prompts.

If this is right

GPT-3 achieves strong results on translation, question-answering, cloze tasks, and on-the-fly reasoning problems such as arithmetic and novel-word usage.
The model generates news articles that human evaluators have difficulty distinguishing from articles written by humans.
Performance on few-shot tasks improves with model scale, allowing the 175B model to outperform smaller predecessors.
Certain datasets still show struggles for GPT-3, and some evaluations face contamination risks from the model's web-scale training data.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The prompting approach could lower the barrier to applying language models on new tasks by removing the need to collect large fine-tuning datasets.
Further increases in model size might extend few-shot competence to additional domains that currently require specialized training.
Widespread use of such models would intensify the need for reliable methods to detect machine-generated text in news and other content.

Load-bearing premise

The few-shot examples placed in the prompt allow genuine generalization rather than the model simply recalling near-duplicates from its web-scale training corpus.

What would settle it

A demonstration that GPT-3's accuracy on a benchmark drops to near zero when the few-shot examples are replaced with ones that cannot appear in the training data and the task is designed to have no overlap with common web text.

read the original abstract

Recent work has demonstrated substantial gains on many NLP tasks and benchmarks by pre-training on a large corpus of text followed by fine-tuning on a specific task. While typically task-agnostic in architecture, this method still requires task-specific fine-tuning datasets of thousands or tens of thousands of examples. By contrast, humans can generally perform a new language task from only a few examples or from simple instructions - something which current NLP systems still largely struggle to do. Here we show that scaling up language models greatly improves task-agnostic, few-shot performance, sometimes even reaching competitiveness with prior state-of-the-art fine-tuning approaches. Specifically, we train GPT-3, an autoregressive language model with 175 billion parameters, 10x more than any previous non-sparse language model, and test its performance in the few-shot setting. For all tasks, GPT-3 is applied without any gradient updates or fine-tuning, with tasks and few-shot demonstrations specified purely via text interaction with the model. GPT-3 achieves strong performance on many NLP datasets, including translation, question-answering, and cloze tasks, as well as several tasks that require on-the-fly reasoning or domain adaptation, such as unscrambling words, using a novel word in a sentence, or performing 3-digit arithmetic. At the same time, we also identify some datasets where GPT-3's few-shot learning still struggles, as well as some datasets where GPT-3 faces methodological issues related to training on large web corpora. Finally, we find that GPT-3 can generate samples of news articles which human evaluators have difficulty distinguishing from articles written by humans. We discuss broader societal impacts of this finding and of GPT-3 in general.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 3 minor

Summary. The manuscript claims that scaling autoregressive language models to 175 billion parameters (GPT-3) substantially improves task-agnostic few-shot performance across NLP benchmarks including translation, question answering, cloze tasks, and arithmetic reasoning. Tasks and demonstrations are specified purely via text prompts with no gradient updates or fine-tuning, and performance is shown to scale with model size, sometimes approaching prior fine-tuned SOTA results.

Significance. If the central scaling results hold after addressing contamination concerns, this work provides high-significance empirical evidence for emergent in-context learning abilities driven by parameter count. The extensive multi-task evaluation (20+ benchmarks), scaling curves, and explicit discussion of societal impacts are strengths that advance understanding of scaling laws beyond prior smaller models.

major comments (1)

[§4.2] §4.2: The n-gram overlap decontamination (13-gram checks) is applied only to a subset of the 20+ benchmarks, with explicit contamination flags for LAMBADA, SQuAD, TriviaQA, and arithmetic tasks. The paper does not quantify how removing contaminated examples alters the few-shot scaling curves in Figures 2-4 or the 175B vs. smaller-model gaps; this directly bears on whether the reported gains reflect generalization or scale-dependent memorization of web-sourced test data.

minor comments (3)

[Abstract] Abstract: The claim of being '10x more than any previous non-sparse language model' would be clearer with an explicit citation to the prior largest model size.
[§3] §3: Prompt formats and example selection criteria for k-shot settings are described at a high level but lack exhaustive per-task templates or variance analysis across different example choices.
[Figure 1] Figure 1 and §4: Some scaling plots would benefit from error bars or multiple runs to indicate result stability, especially for tasks with high variance.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their positive recommendation of minor revision and for the constructive feedback on data contamination. We address the single major comment below.

read point-by-point responses

Referee: [§4.2] §4.2: The n-gram overlap decontamination (13-gram checks) is applied only to a subset of the 20+ benchmarks, with explicit contamination flags for LAMBADA, SQuAD, TriviaQA, and arithmetic tasks. The paper does not quantify how removing contaminated examples alters the few-shot scaling curves in Figures 2-4 or the 175B vs. smaller-model gaps; this directly bears on whether the reported gains reflect generalization or scale-dependent memorization of web-sourced test data.

Authors: We agree that explicitly quantifying the effect of decontamination on the scaling curves would strengthen the presentation. In the manuscript we applied 13-gram overlap decontamination and reported explicit flags only for the tasks where overlap with the training corpus was detected (LAMBADA, SQuAD, TriviaQA, and the arithmetic tasks); for the remaining benchmarks no significant contamination was identified. We did not, however, include a side-by-side comparison of performance before and after decontamination in Figures 2–4. In the revised manuscript we will add a supplementary analysis (new table or appendix figure) that reports the few-shot accuracies for the affected tasks both with and without the decontaminated examples, together with a brief discussion of any resulting changes to the observed scaling trends and the 175B versus smaller-model gaps. revision: yes

Circularity Check

0 steps flagged

No significant circularity in empirical scaling and few-shot results

full rationale

The paper's central claims rest on training a new 175B-parameter autoregressive model (GPT-3) and directly measuring its task-agnostic few-shot performance across benchmarks. No mathematical derivations, predictions, or first-principles results are presented that reduce by construction to fitted parameters, self-definitions, or self-citation chains. Prior scaling observations (e.g., from Kaplan et al. 2020) are referenced for context but are not load-bearing; the new 175B results are independent empirical measurements on held-out tasks. Contamination checks in §4.2 are acknowledged as limited but do not create circularity in the reported performance numbers. The derivation chain is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The claim rests on the standard transformer decoder architecture and the assumption that web-scale unsupervised text suffices for emergent in-context learning; no new physical or mathematical entities are introduced.

free parameters (2)

model parameter count = 175e9
175 billion parameters chosen to test the scaling hypothesis beyond prior models.
number of in-context examples = 0 to ~32
k-shot values (0, 1, few) selected to illustrate the few-shot regime.

axioms (2)

standard math Decoder-only transformer with autoregressive next-token prediction objective
Invoked throughout as the base architecture inherited from GPT-2.
domain assumption Internet text corpora contain sufficient distributional information for task generalization via prompting
Central to the claim that no task-specific fine-tuning is needed.

pith-pipeline@v0.9.0 · 5727 in / 1616 out tokens · 89904 ms · 2026-05-10T12:00:32.544002+00:00 · methodology

discussion (0)

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Large Language Diffusion Models
cs.CL 2025-02 unverdicted novelty 8.0

LLaDA is a scalable diffusion-based language model that matches autoregressive LLMs like LLaMA3 8B on tasks and surpasses GPT-4o on reversal poem completion.
Generative Agents: Interactive Simulacra of Human Behavior
cs.HC 2023-04 accept novelty 8.0

Generative agents with memory streams, reflection, and planning using LLMs exhibit believable individual and emergent social behaviors in a simulated town.
Editing Models with Task Arithmetic
cs.LG 2022-12 accept novelty 8.0

Task vectors from weight differences allow arithmetic operations to edit pre-trained models, improving multiple tasks simultaneously and enabling analogical inference on unseen tasks.
Show Your Work: Scratchpads for Intermediate Computation with Language Models
cs.LG 2021-11 unverdicted novelty 8.0

Training language models to generate intermediate computation steps on a scratchpad enables them to perform multi-step tasks such as long addition and arbitrary program execution that they otherwise fail at.
The Pile: An 800GB Dataset of Diverse Text for Language Modeling
cs.CL 2020-12 conditional novelty 8.0

The Pile is a newly constructed 825 GiB dataset from 22 diverse sources that enables language models to achieve better performance on academic, professional, and cross-domain tasks than models trained on Common Crawl ...
Measuring Massive Multitask Language Understanding
cs.CY 2020-09 accept novelty 8.0

Introduces the MMLU benchmark of 57 tasks and shows that current models, including GPT-3, achieve low accuracy far below expert level across academic and professional domains.
All Circuits Lead to Rome: Rethinking Functional Anisotropy in Circuit and Sheaf Discovery for LLMs
cs.CL 2026-05 unverdicted novelty 7.0

LLM tasks are supported by multiple distinct circuits rather than unique mechanisms, demonstrated via Overlap-Aware Sheaf Repulsion and the Distributive Dense Circuit Hypothesis.
The Global Empirical NTK: Self-Referential Bias and Dimensionality of Gradient Descent Learning
cs.LG 2026-05 unverdicted novelty 7.0

The global empirical NTK for finite-width networks has a universal Kronecker-core form that makes it structurally low-rank and biases gradient descent toward dominant modes of joint input-hidden activity.
Scalable Distributed Stochastic Optimization via Bidirectional Compression: Beyond Pessimistic Limits
math.OC 2026-05 unverdicted novelty 7.0

Inkheart SGD and M4 use bidirectional compression to achieve time complexities in distributed SGD that improve with worker count n and surpass prior lower bounds under a necessary structural assumption.
Topology-Enhanced Alignment for Large Language Models: Trajectory Topology Loss and Topological Preference Optimization
cs.CL 2026-05 unverdicted novelty 7.0

Topology-enhanced alignment via persistent homology on trajectories outperforms standard SFT and DPO baselines on preference metrics for LLMs.
Nsanku: Evaluating Zero-Shot Translation Performance of LLMs for Ghanaian Languages
cs.CL 2026-05 accept novelty 7.0

Nsanku benchmark shows current LLMs achieve only modest zero-shot translation scores on 43 Ghanaian languages, with no model reaching both high average performance and high cross-language consistency.
Reconstructing conformal field theoretical compositions with Transformers
hep-th 2026-05 unverdicted novelty 7.0

Transformers reconstruct the constituent RCFTs in tensor-product theories from low-energy spectra, reaching 98% accuracy on WZW models and generalizing to larger central charges with few out-of-domain examples.
E-MIA: Exam-Style Black-Box Membership Inference Attacks against RAG Systems
cs.CR 2026-05 unverdicted novelty 7.0

E-MIA converts document details into four types of exam questions and aggregates the RAG's answers into a membership score that separates member and non-member documents better than prior similarity-based or probe-bas...
From Static Analysis to Audience Dissemination: A Training-Free Multimodal Controversy Detection Multi-Agent Framework
cs.LG 2026-05 unverdicted novelty 7.0

AuDisAgent reformulates multimodal controversy detection as a dynamic audience dissemination process using screening, panel discussion, and arbitration agents, plus comment bootstrapping, and reports outperforming pri...
Agentic Witnessing: Pragmatic and Scalable TEE-Enabled Privacy-Preserving Auditing
cs.CR 2026-04 unverdicted novelty 7.0

Agentic Witnessing enables privacy-preserving auditing of semantic properties in private data by running an LLM auditor in a TEE that answers binary queries and produces cryptographic transcripts of its reasoning.
A Systematic Survey of Security Threats and Defenses in LLM-Based AI Agents: A Layered Attack Surface Framework
cs.CR 2026-04 unverdicted novelty 7.0

A new 7x4 taxonomy organizes agentic AI security threats by architectural layer and persistence timescale, revealing under-explored upper layers and missing defenses after surveying 116 papers.
Evaluating Temporal Consistency in Multi-Turn Language Models
cs.CL 2026-04 unverdicted novelty 7.0

Language models frequently violate temporal scope stability in multi-turn dialogues by drifting toward present-day assumptions even when they possess the correct facts.
Transition-Matrix Regularization for Next Dialogue Act Prediction in Counselling Conversations
cs.CL 2026-04 unverdicted novelty 7.0

KL regularization aligning model predictions with empirical transition patterns improves macro-F1 by 9-42% in next dialogue act prediction on German counselling data and transfers to other datasets.
On the Emergence of Syntax by Means of Local Interaction
cs.CL 2026-04 unverdicted novelty 7.0

A 2D neural cellular automaton spontaneously self-organizes into a Proto-CKY representation that exhibits syntactic processing capabilities for context-free grammars when trained on membership problems.
Privatar: Scalable Privacy-preserving Multi-user VR via Secure Offloading
cs.CR 2026-04 unverdicted novelty 7.0

Privatar uses horizontal frequency partitioning and distribution-aware minimal perturbation to enable private offloading of VR avatar reconstruction, supporting 2.37x more users with modest overhead.
ProtoCycle: Reflective Tool-Augmented Planning for Text-Guided Protein Design
q-bio.QM 2026-04 unverdicted novelty 7.0

ProtoCycle improves text-guided protein design by coupling an LLM planner with tool feedback and reflection to achieve better language alignment and foldability than direct generation.
RAGognizer: Hallucination-Aware Fine-Tuning via Detection Head Integration
cs.CL 2026-04 unverdicted novelty 7.0

RAGognizer adds a detection head to LLMs for joint training on generation and token-level hallucination detection, yielding SOTA detection and fewer hallucinations in RAG while preserving output quality.
Human-Centric Topic Modeling with Goal-Prompted Contrastive Learning and Optimal Transport
cs.AI 2026-04 unverdicted novelty 7.0

GCTM-OT extracts goal candidates with an LLM, then uses goal-prompted contrastive learning and optimal transport to discover topics that are more coherent, diverse, and aligned with human intent than prior methods on ...
LiveGesture Streamable Co-Speech Gesture Generation Model
cs.CV 2026-04 unverdicted novelty 7.0

LiveGesture introduces the first fully streamable zero-lookahead co-speech full-body gesture generation model using a causal vector-quantized tokenizer and hierarchical autoregressive transformers that matches offline...
Dual Triangle Attention: Effective Bidirectional Attention Without Positional Embeddings
q-bio.QM 2026-04 unverdicted novelty 7.0

Dual Triangle Attention achieves effective bidirectional attention with built-in positional inductive bias via dual triangular masks, outperforming standard bidirectional attention on position-sensitive tasks and show...
Sell More, Play Less: Benchmarking LLM Realistic Selling Skill
cs.CL 2026-04 conditional novelty 7.0

SalesLLM provides an automatic evaluation framework for LLM sales dialogues that correlates 0.98 with human experts and shows top models approaching human performance while weaker ones lag.
Analyzing Reverse Address Translation Overheads in Multi-GPU Scale-Up Pods
cs.DC 2026-04 unverdicted novelty 7.0

Simulation study shows cold TLB misses in reverse address translation dominate latency for small collectives in multi-GPU pods, causing up to 1.4x degradation, while larger ones see diminishing returns.
Do NOT Think That Much for 2+3=? On the Overthinking of o1-Like LLMs
cs.CL 2024-12 unverdicted novelty 7.0

o1-like models overthink easy tasks; self-training reduces compute use without accuracy loss on GSM8K, MATH500, GPQA, and AIME.
Measuring Faithfulness in Chain-of-Thought Reasoning
cs.AI 2023-07 conditional novelty 7.0

Chain-of-Thought reasoning in LLMs is often unfaithful, with models relying on it variably by task and less so as models scale larger.
InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning
cs.CV 2023-05 conditional novelty 7.0

Instruction tuning of BLIP-2 with an instruction-aware Query Transformer delivers state-of-the-art zero-shot performance on held-out vision-language datasets and strong finetuned results on downstream tasks.
Argoverse 2: Next Generation Datasets for Self-Driving Perception and Forecasting
cs.CV 2023-01 accept novelty 7.0

Argoverse 2 introduces three new datasets with annotated sensor data, massive lidar collections, and challenging motion forecasting scenarios for autonomous driving research.
LAION-5B: An open large-scale dataset for training next generation image-text models
cs.CV 2022-10 accept novelty 7.0

LAION-5B is an openly released dataset of 5.85 billion CLIP-filtered image-text pairs that enables replication of foundational vision-language models.
In-context Learning and Induction Heads
cs.LG 2022-09 unverdicted novelty 7.0

Induction heads, which implement pattern completion in attention, develop at the same training stage as a sudden rise in in-context learning, providing evidence they are the primary mechanism for in-context learning i...
LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale
cs.LG 2022-08 conditional novelty 7.0

LLM.int8() performs 8-bit inference for transformers up to 175B parameters with no accuracy loss by combining vector-wise quantization for most features with 16-bit mixed-precision handling of systematic outlier dimensions.
Hierarchical Text-Conditional Image Generation with CLIP Latents
cs.CV 2022-04 accept novelty 7.0

A hierarchical prior-decoder model using CLIP latents generates more diverse text-conditional images than direct methods while preserving photorealism and caption fidelity.
LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs
cs.CV 2021-11 unverdicted novelty 7.0

LAION-400M is a publicly released open dataset of 400 million CLIP-filtered image-text pairs with embeddings and kNN indices for efficient search.
LoRA: Low-Rank Adaptation of Large Language Models
cs.CL 2021-06 accept novelty 7.0

Adapting large language models by training only a low-rank decomposition BA added to frozen weight matrices matches full fine-tuning while cutting trainable parameters by orders of magnitude and adding no inference latency.
Diffusion Models Beat GANs on Image Synthesis
cs.LG 2021-05 accept novelty 7.0

Diffusion models with architecture improvements and classifier guidance achieve superior FID scores to GANs on unconditional and conditional ImageNet image synthesis.
Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity
cs.LG 2021-01 accept novelty 7.0

Switch Transformers use top-1 expert routing in a Mixture of Experts setup to scale to trillion-parameter language models with constant compute and up to 4x speedup over T5-XXL.
Prefix-Tuning: Optimizing Continuous Prompts for Generation
cs.CL 2021-01 conditional novelty 7.0

Prefix-tuning matches or exceeds fine-tuning on NLG tasks by optimizing a continuous prefix using 0.1% of parameters while keeping the LM frozen.
Scaling Laws for Autoregressive Generative Modeling
cs.LG 2020-10 accept novelty 7.0

Autoregressive transformers follow power-law scaling laws for cross-entropy loss with nearly universal exponents relating optimal model size to compute budget across four domains.
DeBERTa: Decoding-enhanced BERT with Disentangled Attention
cs.CL 2020-06 unverdicted novelty 7.0

DeBERTa improves BERT-style models by separating content and relative position in attention and adding absolute positions to the decoder, yielding consistent gains on NLU and NLG tasks and the first single-model super...
Correcting Influence: Unboxing LLM Outputs with Orthogonal Latent Spaces
cs.LG 2026-05 unverdicted novelty 6.0

A latent mediation framework with sparse autoencoders enables non-additive token-level influence attribution in LLMs by learning orthogonal features and back-propagating attributions.
State-Space NTK Collapse Near Bifurcations
cs.LG 2026-05 unverdicted novelty 6.0

Bifurcations cause sNTK to reduce to a dominant rank-one channel matching normal forms, collapsing effective rank and funneling gradient descent into critical dynamical directions.
Learning, Fast and Slow: Towards LLMs That Adapt Continually
cs.LG 2026-05 unverdicted novelty 6.0

Fast-Slow Training combines slow parameter updates with fast context optimization to achieve up to 3x better sample efficiency, higher performance, less forgetting, and preserved plasticity in continual LLM learning.
MLCommons Chakra: Advancing Performance Benchmarking and Co-design using Standardized Execution Traces
cs.DC 2026-05 unverdicted novelty 6.0

Chakra introduces a portable, interoperable graph-based execution trace format for distributed ML workloads along with supporting tools to standardize performance benchmarking and software-hardware co-design.
Spectral Transformer Neural Processes
cs.LG 2026-05 unverdicted novelty 6.0

STNPs extend TNPs with a spectral aggregator that estimates context spectra, forms spectral mixtures, and injects task-adaptive frequency features to better handle periodicity.
LLM-Agnostic Semantic Representation Attack
cs.CL 2026-05 unverdicted novelty 6.0

SRA achieves 99.71% average attack success across 26 LLMs by optimizing for coherent malicious semantics via the SRHS algorithm, with claimed theoretical guarantees on convergence and transfer.
XPERT: Expert Knowledge Transfer for Effective Training of Language Models
cs.CL 2026-05 unverdicted novelty 6.0

XPERT extracts and reuses cross-domain expert knowledge from pre-trained MoE LLMs via inference analysis and tensor decomposition to improve performance and convergence in downstream language model training.
Ensemble Distributionally Robust Bayesian Optimisation
cs.LG 2026-05 unverdicted novelty 6.0

A tractable ensemble distributionally robust Bayesian optimization method achieves improved sublinear regret bounds under context uncertainty.
Coupling Models for One-Step Discrete Generation
cs.LG 2026-05 unverdicted novelty 6.0

Coupling Models enable single-step discrete sequence generation via learned couplings to Gaussian latents and outperform prior one-step baselines on text perplexity, biological FBD, and image FID metrics.
Query-efficient model evaluation using cached responses
cs.LG 2026-05 unverdicted novelty 6.0

DKPS-based methods leverage cached model responses to achieve equivalent benchmark prediction accuracy with substantially fewer queries than standard evaluation.
Federation of Experts: Communication Efficient Distributed Inference for Large Language Models
cs.LG 2026-05 unverdicted novelty 6.0

FoE restructures MoE blocks into per-KV-head clusters with sum-based synchronization, removing all-to-all communication in single-node settings and limiting it to intra-node in multi-node settings for up to 5.2x faste...
ZAYA1-8B Technical Report
cs.AI 2026-05 unverdicted novelty 6.0

ZAYA1-8B is a reasoning MoE model with 700M active parameters that matches larger models on math and coding benchmarks and reaches 91.9% on AIME'25 via Markovian RSA test-time compute.
GEM: Graph-Enhanced Mixture-of-Experts with ReAct Agents for Dialogue State Tracking
cs.CL 2026-05 unverdicted novelty 6.0

GEM achieves 65.19% joint goal accuracy on MultiWOZ 2.2 by routing between a graph neural network expert for dialogue structure and a T5 expert for sequences, plus ReAct agents for value generation, outperforming prio...
A Meta Reinforcement Learning Approach to Goals-Based Wealth Management
cs.LG 2026-05 unverdicted novelty 6.0

MetaRL pre-trained on GBWM problems delivers near-optimal dynamic strategies in 0.01s achieving 97.8% of DP optimal utility and handles larger problems where DP fails.
Diversity in Large Language Models under Supervised Fine-Tuning
cs.LG 2026-04 unverdicted novelty 6.0

TOFU loss mitigates the narrowing of generative diversity in LLMs after supervised fine-tuning by addressing neglect of low-frequency patterns and forgetting of prior knowledge.
RCW-CIM: A Digital CIM-based LLM Accelerator with Read-Compute/Write
cs.AR 2026-04 unverdicted novelty 6.0

RCW-CIM reduces Llama2-7B decoding latency by 21.59% and prefill latency by 49.76% via minimized weight updates and DRAM accesses, delivering 3.28 TOPS and 42.3 TOPS/W on a fabricated 22 nm chip.
Information bottleneck for learning the phase space of dynamics from high-dimensional experimental data
physics.data-an 2026-04 unverdicted novelty 6.0

DySIB recovers a two-dimensional representation matching the phase space of a physical pendulum from high-dimensional video data by maximizing predictive mutual information in latent space.
JAX-BEM: Gradient-Based Acoustic Shape Optimisation via a Differentiable Boundary Element Method
cs.CE 2026-04 unverdicted novelty 6.0

A JAX-based differentiable BEM solver matches traditional BEM accuracy on benchmarks and supports gradient-driven acoustic geometry optimization.

Reference graph

Works this paper leans on

86 extracted references · 86 canonical work pages · cited by 148 Pith papers · 19 internal anchors

[1]

Massively multilingual neural machine translation

[AJF19] Roee Aharoni, Melvin Johnson, and Orhan Firat. Massively multilingual neural machine translation. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers),

work page 2019
[2]

L., Barocas, S., Daum \'e , III, H., and Wallach, H

[BBDIW20] Su Lin Blodgett, Solon Barocas, Hal Daum´e III, and Hanna Wallach. Language (technology) is power: A critical survey of “bias” in nlp. arXiv preprint arXiv:2005.14050,

work page arXiv 2005
[3]

Semantic parsing on freebase from question-answer pairs

[BCFL13] Jonathan Berant, Andrew Chou, Roy Frostig, and Percy Liang. Semantic parsing on freebase from question-answer pairs. In Proceedings of the 2013 conference on empirical methods in natural language processing, pages 1533–1544,

work page 2013
[4]

2004.10151 , archiveprefix =

[BHT+20] Yonatan Bisk, Ari Holtzman, Jesse Thomason, Jacob Andreas, Yoshua Bengio, Joyce Chai, Mirella Lapata, Angeliki Lazaridou, Jonathan May, Aleksandr Nisnevich, et al. Experience grounds language. arXiv preprint arXiv:2004.10151,

work page arXiv 2004
[5]

Piqa: Reasoning about physical commonsense in natural language

[BZB+19] Yonatan Bisk, Rowan Zellers, Ronan Le Bras, Jianfeng Gao, and Yejin Choi. Piqa: Reasoning about physical commonsense in natural language. arXiv preprint arXiv:1911.11641,

work page arXiv 1911
[6]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

[CCE+18] Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge. ArXiv, abs/1803.05457,

work page internal anchor Pith review Pith/arXiv arXiv
[7]

BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions

[CLC+19] Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. BoolQ: Exploring the surprising difﬁculty of natural yes/no questions. arXiv preprint arXiv:1905.10044,

work page internal anchor Pith review arXiv 1905
[8]

Uniter: Learning universal image-text representations

[CLY+19] Yen-Chun Chen, Linjie Li, Licheng Yu, Ahmed El Kholy, Faisal Ahmed, Zhe Gan, Yu Cheng, and Jingjing Liu. Uniter: Learning universal image-text representations. arXiv preprint arXiv:1909.11740,

work page arXiv 1909
[9]

The trouble with bias

[Cra17] Kate Crawford. The trouble with bias. NIPS 2017 Keynote,

work page 2017
[10]

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

[DCLT18] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805,

work page internal anchor Pith review Pith/arXiv arXiv
[11]

Bartlett, Ilya Sutskever, and Pieter Abbeel

Data can be found at https://github.com/mcdm/CommitmentBank/. [DSC+16] Yan Duan, John Schulman, Xi Chen, Peter L. Bartlett, Ilya Sutskever, and Pieter Abbeel. Rl 2: Fast reinforcement learning via slow reinforcement learning. ArXiv, abs/1611.02779,

work page arXiv
[12]

DROP: A Reading Comprehension Benchmark Requiring Discrete Reasoning Over Paragraphs

[DWD+19] Dheeru Dua, Yizhong Wang, Pradeep Dasigi, Gabriel Stanovsky, Sameer Singh, and Matt Gardner. Drop: A reading comprehension benchmark requiring discrete reasoning over paragraphs. arXiv preprint arXiv:1903.00161,

work page Pith review arXiv 1903
[13]

Understanding back-translation at scale

[EOAG18] Sergey Edunov, Myle Ott, Michael Auli, and David Grangier. Understanding back-translation at scale. arXiv preprint arXiv:1808.09381,

work page arXiv
[14]

Model-agnostic meta-learning for fast adaptation of deep networks

[FAL17] Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adaptation of deep networks. ArXiv, abs/1703.03400,

work page arXiv
[15]

Lipstick on a pig: Debiasing methods cover up systematic gender biases in word embeddings but do not remove them.arXiv preprint arXiv:1903.03862,

[GG19] Hila Gonen and Yoav Goldberg. Lipstick on a pig: Debiasing methods cover up systematic gender biases in word embeddings but do not remove them. arXiv preprint arXiv:1903.03862,

work page arXiv 1903
[16]

doi:10.48550/arXiv.2002.08909 , abstract =

[GLT+20] Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasupat, and Ming-Wei Chang. Realm: Retrieval- augmented language model pre-training. arXiv preprint arXiv:2002.08909,

work page arXiv 2002
[17]

Gururangan, S

[GSL+18] Suchin Gururangan, Swabha Swayamdipta, Omer Levy, Roy Schwartz, Samuel R Bowman, and Noah A Smith. Annotation artifacts in natural language inference data. arXiv preprint arXiv:1803.02324,

work page arXiv
[18]

[GSR19] Sebastian Gehrmann, Hendrik Strobelt, and Alexander M. Rush. Gltr: Statistical detection and visualiza- tion of generated text. arXiv preprint arXiv: 1906.04043,

work page arXiv 1906
[19]

Meta-learning for low-resource neural machine translation

[GWC+18] Jiatao Gu, Yong Wang, Yun Chen, Kyunghyun Cho, and Victor OK Li. Meta-learning for low-resource neural machine translation. arXiv preprint arXiv:1808.08437,

work page arXiv
[20]

The Curious Case of Neural Text Degeneration

[HBFC19] Ari Holtzman, Jan Buys, Maxwell Forbes, and Yejin Choi. The curious case of neural text degeneration. CoRR, abs/1904.09751,

work page internal anchor Pith review arXiv 1904
[21]

org/abs/2004.06100

[HLW+20] Dan Hendrycks, Xiaoyuan Liu, Eric Wallace, Adam Dziedzic, Rishabh Krishnan, and Dawn Song. Pretrained transformers improve out of distribution robustness. arXiv preprint arXiv:2004.06100,

work page arXiv 2004
[22]

Deep Learning Scaling is Predictable, Empirically

69 [HNA+17] Joel Hestness, Sharan Narang, Newsha Ardalani, Gregory Diamos, Heewoo Jun, Hassan Kianinejad, Md. Mostofa Ali Patwary, Yang Yang, and Yanqi Zhou. Deep learning scaling is predictable, empirically. arXiv preprint arXiv:1712.00409,

work page internal anchor Pith review arXiv
[23]

Universal language model fine-tuning for text classification

[HR18] Jeremy Howard and Sebastian Ruder. Universal language model ﬁne-tuning for text classiﬁcation.arXiv preprint arXiv:1801.06146,

work page arXiv
[24]

Distilling the Knowledge in a Neural Network

[HVD15] Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531,

work page internal anchor Pith review Pith/arXiv arXiv
[25]

Huang, H

[HZJ+19] Po-Sen Huang, Huan Zhang, Ray Jiang, Robert Stanforth, Johannes Welbl, Jack Rae, Vishal Maini, Dani Yogatama, and Pushmeet Kohli. Reducing sentiment bias in language models via counterfactual evaluation. arXiv preprint arXiv:1911.03064,

work page arXiv 1911
[26]

Automatic detection of generated text is easiest when humans are fooled

[IDCBE19] Daphne Ippolito, Daniel Duckworth, Chris Callison-Burch, and Douglas Eck. Automatic detection of generated text is easiest when humans are fooled. arXiv preprint arXiv:1911.00650,

work page arXiv 1911
[27]

TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension

[JCWZ17] Mandar Joshi, Eunsol Choi, Daniel S. Weld, and Luke Zettlemoyer. TriviaQA: A large scale distantly supervised challenge dataset for reading comprehension. arXiv preprint arXiv:1705.03551,

work page internal anchor Pith review arXiv
[28]

Exploring the Limits of Language Modeling

[JVS+16] Rafal Jozefowicz, Oriol Vinyals, Mike Schuster, Noam Shazeer, and Yonghui Wu. Exploring the limits of language modeling. arXiv preprint arXiv:1602.02410,

work page Pith review arXiv
[29]

arXiv preprint arXiv:1909.10351 , year=

[JYS+19] Xiaoqi Jiao, Yichun Yin, Lifeng Shang, Xin Jiang, Xiao Chen, Linlin Li, Fang Wang, and Qun Liu. TinyBERT: Distilling BERT for natural language understanding. arXiv preprint arXiv:1909.10351,

work page arXiv 1909
[30]

Technical report on conversational question answering

[JZC+19] Ying Ju, Fubang Zhao, Shijie Chen, Bowen Zheng, Xuefeng Yang, and Yunfeng Liu. Technical report on conversational question answering. arXiv preprint arXiv:1909.10772,

work page arXiv 1909
[31]

Unifiedqa: Crossing format boundaries with a single qa system, 2020

[KKS+20] Daniel Khashabi, Tushar Khot, Ashish Sabharwal, Oyvind Tafjord, Peter Clark, and Hannaneh Hajishirzi. Uniﬁedqa: Crossing format boundaries with a single qa system. arXiv preprint arXiv:2005.00700,

work page arXiv 2005
[32]

Cross- lingual language model pretraining

[LC19] Guillaume Lample and Alexis Conneau. Cross-lingual language model pretraining. arXiv preprint arXiv:1901.07291,

work page arXiv 1901
[33]

ALBERT: A Lite BERT for Self-supervised Learning of Language Representations

70 [LCG+19] Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Sori- cut. ALBERT: A lite BERT for self-supervised learning of language representations. arXiv preprint arXiv:1909.11942,

work page internal anchor Pith review arXiv 1909
[34]

Adversarial training for large neural language models

[LCH+20] Xiaodong Liu, Hao Cheng, Pengcheng He, Weizhu Chen, Yu Wang, Hoifung Poon, and Jianfeng Gao. Adversarial training for large neural language models. arXiv preprint arXiv:2004.08994,

work page arXiv 2004
[35]

Story ending prediction by transferable bert

[LDL19] Zhongyang Li, Xiao Ding, and Ting Liu. Story ending prediction by transferable bert. arXiv preprint arXiv:1905.07504,

work page arXiv 1905
[36]

Multilingual denoising pre-training for neural machine translation

[LGG+20] Yinhan Liu, Jiatao Gu, Naman Goyal, Xian Li, Sergey Edunov, Marjan Ghazvininejad, Mike Lewis, and Luke Zettlemoyer. Multilingual denoising pre-training for neural machine translation. arXiv preprint arXiv:2001.08210,

work page arXiv 2001
[37]

Representation learning using multi-task deep neural networks for semantic classiﬁcation and information retrieval

[LGH+15] Xiaodong Liu, Jianfeng Gao, Xiaodong He, Li Deng, Kevin Duh, and Ye-Yi Wang. Representation learning using multi-task deep neural networks for semantic classiﬁcation and information retrieval. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies,

work page 2015
[38]

Decoupled Weight Decay Regularization

[LH17] Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101,

work page internal anchor Pith review Pith/arXiv arXiv
[39]

arXiv preprint arXiv:1904.09482 , year=

[LHCG19a] Xiaodong Liu, Pengcheng He, Weizhu Chen, and Jianfeng Gao. Improving multi-task deep neural networks via knowledge distillation for natural language understanding.arXiv preprint arXiv:1904.09482,

work page arXiv 1904
[40]

Multi-task deep neural networks for natural language understanding.arXiv preprint arXiv:1901.11504, 2019b

[LHCG19b] Xiaodong Liu, Pengcheng He, Weizhu Chen, and Jianfeng Gao. Multi-task deep neural networks for natural language understanding. arXiv preprint arXiv:1901.11504,

work page arXiv 1901
[41]

How can we accelerate progress towards human-like linguistic generalization?arXiv preprint arXiv:2005.00955,

[Lin20] Tal Linzen. How can we accelerate progress towards human-like linguistic generalization?arXiv preprint arXiv:2005.00955,

work page arXiv 2005
[42]

BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension

[LLG+19] Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Ves Stoyanov, and Luke Zettlemoyer. Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. arXiv preprint arXiv:1910.13461,

work page internal anchor Pith review arXiv 1910
[43]

Learning to optimize neural nets

[LM17] Ke Li and Jitendra Malik. Learning to optimize neural nets. arXiv preprint arXiv:1703.00441,

work page arXiv
[44]

RoBERTa: A Robustly Optimized BERT Pretraining Approach

[LOG+19] Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. RoBERTa: A robustly optimized BERT pretraining approach. arXiv preprint arXiv:1907.11692,

work page internal anchor Pith review Pith/arXiv arXiv 1907
[45]

Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks

[LPP+20] Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich K ¨uttler, Mike Lewis, Wen-tau Yih, Tim Rockt ¨aschel, Sebastian Riedel, and Kiela Douwe. Retrieval-augmented generation for knowledge-intensive nlp tasks. arXiv preprint arXiv:2005.11401,

work page internal anchor Pith review arXiv 2005
[46]

Generating Wikipedia by summarizing long sequences

[LSP+18] Peter J. Liu, Mohammad Saleh, Etienne Pot, Ben Goodrich, Ryan Sepassi, Lukasz Kaiser, and Noam Shazeer. Generating Wikipedia by summarizing long sequences. arXiv preprint arXiv:1801.10198,

work page arXiv
[47]

RACE: Large-scale ReAding Comprehension Dataset From Examinations

[LXL+17] Guokun Lai, Qizhe Xie, Hanxiao Liu, Yiming Yang, and Eduard Hovy. Race: Large-scale reading comprehension dataset from examinations. arXiv preprint arXiv:1704.04683,

work page Pith review arXiv
[48]

Tttttackling winogrande schemas

[LYN+20] Sheng-Chieh Lin, Jheng-Hong Yang, Rodrigo Nogueira, Ming-Feng Tsai, Chuan-Ju Wang, and Jimmy Lin. Tttttackling winogrande schemas. arXiv preprint arXiv:2003.08380,

work page arXiv 2003
[49]

Efficient Estimation of Word Representations in Vector Space

[MCCD13] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efﬁcient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781,

work page internal anchor Pith review arXiv
[50]

arXiv preprint arXiv:1604.01696 , year=

[MCH+16] Nasrin Mostafazadeh, Nathanael Chambers, Xiaodong He, Devi Parikh, Dhruv Batra, Lucy Vanderwende, Pushmeet Kohli, and James Allen. A corpus and evaluation framework for deeper understanding of commonsense stories. arXiv preprint arXiv:1604.01696,

work page arXiv
[51]

Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering

[MCKS18] Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. Can a suit of armor conduct electricity? a new dataset for open book question answering. ArXiv, abs/1809.02789,

work page internal anchor Pith review arXiv
[52]

The Natural Language Decathlon: Multitask Learning as Question Answering

[MKXS18] Bryan McCann, Nitish Shirish Keskar, Caiming Xiong, and Richard Socher. The natural language decathlon: Multitask learning as question answering. arXiv preprint arXiv:1806.08730,

work page Pith review arXiv
[53]

Right for the Wrong Reasons: Diagnosing Syntactic Heuristics in Natural Language Inference

[MPL19] R Thomas McCoy, Ellie Pavlick, and Tal Linzen. Right for the wrong reasons: Diagnosing syntactic heuristics in natural language inference. arXiv preprint arXiv:1902.01007,

work page Pith review arXiv 1902
[54]

arXiv preprint arXiv:2004.09456 , year=

[NBR20] Moin Nadeem, Anna Bethke, and Siva Reddy. Stereoset: Measuring stereotypical bias in pretrained language models. arXiv preprint arXiv:2004.09456,

work page arXiv 2004
[55]

Probing neural network comprehension of natural language arguments

[NK19] Timothy Niven and Hung-Yu Kao. Probing neural network comprehension of natural language arguments. arXiv preprint arXiv:1907.07355,

work page arXiv 1907
[56]

Fair is better than sensational: Man is to doctor as woman is to doctor

[NvNvdG19] Malvina Nissim, Rik van Noord, and Rob van der Goot. Fair is better than sensational: Man is to doctor as woman is to doctor. arXiv preprint arXiv:1905.09866,

work page arXiv 1905
[57]

Adversarial nli: A new benchmark for natural language understanding

[NWD+19] Yixin Nie, Adina Williams, Emily Dinan, Mohit Bansal, Jason Weston, and Douwe Kiela. Adversarial nli: A new benchmark for natural language understanding. arXiv preprint arXiv:1910.14599,

work page arXiv 1910
[58]

WIC: 10,000 example pairs for evaluating context-sensitive representations.arXiv preprint arXiv:1808.09121,

[PCC18] Mohammad Taher Pilehvar and Jose Camacho-Collados. WIC: 10,000 example pairs for evaluating context-sensitive representations. arXiv preprint arXiv:1808.09121,

work page arXiv
[59]

[PFB18] Jason Phang, Thibault F´evry, and Samuel R. Bowman. Sentence encoders on STILTs: Supplementary training on intermediate labeled-data tasks. arXiv preprint arXiv:1811.01088,

work page arXiv
[60]

The LAMBADA dataset: Word prediction requiring a broad discourse context

[PKL+16] Denis Paperno, Germ´an Kruszewski, Angeliki Lazaridou, Quan Ngoc Pham, Raffaella Bernardi, Sandro Pezzelle, Marco Baroni, Gemma Boleda, and Raquel Fern´andez. The lambada dataset: Word prediction requiring a broad discourse context. arXiv preprint arXiv:1606.06031,

work page Pith review arXiv
[61]

A call for clarity in reporting BLEU scores.arXiv preprint arXiv:1804.08771,

[Pos18] Matt Post. A call for clarity in reporting BLEU scores. arXiv preprint arXiv:1804.08771,

work page arXiv
[62]

GloVe: Global vectors for word representation

72 [PSM14] Jeffrey Pennington, Richard Socher, and Christopher Manning. GloVe: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP),

work page 2014
[63]

Reducing gender bias in word-level language models with a gender-equalizing loss function

[QMZH19] Yusu Qian, Urwa Muaz, Ben Zhang, and Jae Won Hyun. Reducing gender bias in word-level language models with a gender-equalizing loss function. arXiv preprint arXiv:1905.12801,

work page arXiv 1905
[64]

Choice of plausible alternatives: An evaluation of commonsense causal reasoning

[RBG11] Melissa Roemmele, Cosmin Adrian Bejan, and Andrew S Gordon. Choice of plausible alternatives: An evaluation of commonsense causal reasoning. In 2011 AAAI Spring Symposium Series,

work page 2011
[65]

Few-shot autoregressive density estimation: Towards learning to learn distributions

[RCP+17] Scott Reed, Yutian Chen, Thomas Paine, A ¨aron van den Oord, SM Eslami, Danilo Rezende, Oriol Vinyals, and Nando de Freitas. Few-shot autoregressive density estimation: Towards learning to learn distributions. arXiv preprint arXiv:1710.10304,

work page arXiv
[66]

Know What You Don't Know: Unanswerable Questions for SQuAD

[RJL18] Pranav Rajpurkar, Robin Jia, and Percy Liang. Know what you don’t know: Unanswerable questions for squad. arXiv preprint arXiv:1806.03822,

work page Pith review arXiv
[67]

Optimization as a model for few-shot learning

[RL16] Sachin Ravi and Hugo Larochelle. Optimization as a model for few-shot learning. ICLR 2017 (oral),

work page 2017
[68]

Rudinger, J

[RNLVD18] Rachel Rudinger, Jason Naradowsky, Brian Leonard, and Benjamin Van Durme. Gender bias in coreference resolution. arXiv preprint arXiv:1804.09301,

work page arXiv
[69]

How much knowledge can you pack into the parameters of a language model? arXiv preprint arXiv:2002.08910, 2020

[RRS20] Adam Roberts, Colin Raffel, and Noam Shazeer. How much knowledge can you pack into the parameters of a language model? arXiv preprint arXiv:2002.08910,

work page arXiv 2002
[70]

The woman worked as a babysitter: On biases in language generation

[SCNP19] Emily Sheng, Kai-Wei Chang, Premkumar Natarajan, and Nanyun Peng. The woman worked as a babysitter: On biases in language generation. arXiv preprint arXiv:1909.01326,

work page arXiv 1909
[71]

DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter

[SDCW19] Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108,

work page internal anchor Pith review arXiv 1910
[72]

Emma Strubell, Ananya Ganesh, and Andrew McCallum

[SDSE19] Roy Schwartz, Jesse Dodge, Noah A. Smith, and Oren Etzioni. Green AI. CoRR, abs/1907.10597,

work page arXiv 1907
[73]

arXiv preprint arXiv:1511.06709 , year=

[SHB15] Rico Sennrich, Barry Haddow, and Alexandra Birch. Improving neural machine translation models with monolingual data. arXiv preprint arXiv:1511.06709,

work page arXiv
[74]

Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

73 [SMM+17] Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. arXiv preprint arXiv:1701.06538,

work page internal anchor Pith review Pith/arXiv arXiv
[75]

and Schütze, H

[SS20] Timo Schick and Hinrich Sch¨utze. Exploiting cloze questions for few-shot text classiﬁcation and natural language inference. arXiv preprint arXiv:2001.07676,

work page arXiv 2001
[76]

MASS: Masked sequence to sequence pre-training for language generation.arXiv preprint arXiv:1905.02450,

[STQ+19] Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, and Tie-Yan Liu. MASS: Masked sequence to sequence pre-training for language generation. arXiv preprint arXiv:1905.02450,

work page arXiv 1905
[77]

Turney and Michael L

[TL05] Peter D. Turney and Michael L. Littman. Corpus-based learning of analogies and semantic relations. CoRR, abs/cs/0508103,

work page internal anchor Pith review arXiv
[78]

A simple method for commonsense reasoning

[TL18] Trieu H. Trinh and Quoc V . Le. A simple method for commonsense reasoning. arXiv preprint arXiv:1806.02847,

work page arXiv
[79]

Turney, Michael L

[TLBS03] Peter D. Turney, Michael L. Littman, Jeffrey Bigham, and Victor Shnayder. Combining independent modules to solve multiple-choice synonym and analogy problems. CoRR, cs.CL/0309035,

work page arXiv
[80]

Multi-agent dual learning

[WXH+18] Yiren Wang, Yingce Xia, Tianyu He, Fei Tian, Tao Qin, ChengXiang Zhai, and Tie-Yan Liu. Multi-agent dual learning. ICLR 2019,

work page 2019

Showing first 80 references.