hub Mixed citations

SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems

Alex Wang, Yada Pruksachatkun, Nikita Nangia, Amanpreet Singh, Julian Michael, Felix Hill · 2019 · cs.CL · arXiv 1905.00537

Mixed citation behavior. Most common role is background (60%).

33 Pith papers citing it

Background 60% of classified citations

open full Pith review browse 33 citing papers arXiv PDF

abstract

In the last year, new models and methods for pretraining and transfer learning have driven striking performance improvements across a range of language understanding tasks. The GLUE benchmark, introduced a little over one year ago, offers a single-number metric that summarizes progress on a diverse set of such tasks, but performance on the benchmark has recently surpassed the level of non-expert humans, suggesting limited headroom for further research. In this paper we present SuperGLUE, a new benchmark styled after GLUE with a new set of more difficult language understanding tasks, a software toolkit, and a public leaderboard. SuperGLUE is available at super.gluebenchmark.com.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 4 method 1

citation-polarity summary

background 3 unclear 1 use method 1

representative citing papers

Measuring Massive Multitask Language Understanding

cs.CY · 2020-09-07 · accept · novelty 8.0

Introduces the MMLU benchmark of 57 tasks and shows that current models, including GPT-3, achieve low accuracy far below expert level across academic and professional domains.

Scaling Laws for Neural Language Models

cs.LG · 2020-01-23 · unverdicted · novelty 8.0

Empirical power-law scaling governs language model loss versus model size, data size, and compute, enabling optimal allocation of training compute.

Revisiting Zeroth-Order Hessian Approximation: A Single-Step Policy Optimization Lens

cs.LG · 2026-05-29 · unverdicted · novelty 7.0

A policy optimization perspective unifies zeroth-order Hessian estimators as baseline selections and yields ZoVH, a suite of low-variance estimators for the Hessian, its regularized inverse, and bias-corrected products, with theoretical guarantees.

Queryable LoRA: Instruction-Regularized Routing Over Shared Low-Rank Update Atoms

cs.LG · 2026-05-08 · unverdicted · novelty 7.0

Queryable LoRA adds dynamic routing over shared low-rank atoms with attention and language-instruction regularization to make parameter-efficient fine-tuning more adaptive across inputs and layers.

Language Is Not All You Need: Aligning Perception with Language Models

cs.CL · 2023-02-27 · conditional · novelty 7.0

Kosmos-1 shows strong zero-shot and few-shot results on language tasks, image captioning, visual QA, OCR-free document understanding, and image recognition guided by text instructions.

Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks

cs.CL · 2020-05-22 · accept · novelty 7.0

RAG models set new state-of-the-art results on open-domain QA by retrieving Wikipedia passages and conditioning a generative model on them, while also producing more factual text than parametric baselines.

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

cs.LG · 2019-10-23 · unverdicted · novelty 7.0

T5 casts all NLP tasks as text-to-text generation, systematically explores pre-training choices, and reaches strong performance on summarization, QA, classification and other tasks via large-scale training on the Colossal Clean Crawled Corpus.

Evaluating Protein Transfer Learning with TAPE

cs.LG · 2019-06-19 · accept · novelty 7.0

TAPE benchmark of five protein tasks shows self-supervised pretraining improves performance but often lags non-neural baselines, with code and data released publicly.

PEML: Parameter-efficient Multi-Task Learning with Optimized Continuous Prompts

cs.CL · 2026-05-13 · unverdicted · novelty 6.0

PEML co-optimizes continuous prompts and low-rank adaptations to deliver up to 6.67% average accuracy gains over existing multi-task PEFT methods on GLUE, SuperGLUE, and other benchmarks.

SparseForge: Efficient Semi-Structured LLM Sparsification via Annealing of Hessian-Guided Soft-Mask

cs.LG · 2026-05-07 · unverdicted · novelty 6.0

SparseForge achieves 57.27% zero-shot accuracy on LLaMA-2-7B at 2:4 sparsity using only 5B retraining tokens, beating the dense baseline and nearly matching a 40B-token SOTA method.

Ultra-Low-Dimensional Prompt Tuning via Random Projection

cs.CL · 2025-02-06 · unverdicted · novelty 6.0

ULPT optimizes prompts in ultra-low dimensions with frozen random up-projection to cut training parameters by 98% while matching vanilla prompt tuning performance on NLP tasks.

Defending Against Indirect Prompt Injection Attacks With Spotlighting

cs.CR · 2024-03-20 · unverdicted · novelty 6.0

Spotlighting prompt transformations cut indirect prompt injection success rates from >50% to <2% on GPT models while preserving task performance.

DeepSpeed Ulysses: System Optimizations for Enabling Training of Extreme Long Sequence Transformer Models

cs.LG · 2023-09-25 · accept · novelty 6.0

DeepSpeed-Ulysses keeps communication volume constant for sequence-parallel attention when sequence length and device count scale together, delivering 2.5x faster training on 4x longer sequences than prior SOTA.

Simple synthetic data reduces sycophancy in large language models

cs.CL · 2023-08-07 · unverdicted · novelty 6.0

Scaling and instruction tuning increase sycophancy in LLMs on opinion and fact tasks, but a synthetic data fine-tuning intervention reduces it on held-out prompts.

Retentive Network: A Successor to Transformer for Large Language Models

cs.CL · 2023-07-17 · unverdicted · novelty 6.0

RetNet is a new sequence modeling architecture that delivers parallel training, constant-time inference, and competitive language modeling performance as a potential replacement for Transformers.

Kosmos-2: Grounding Multimodal Large Language Models to the World

cs.CL · 2023-06-26 · unverdicted · novelty 6.0

Kosmos-2 grounds text to image regions by encoding refer expressions as Markdown links to sequences of location tokens and trains on a new GrIT dataset of grounded image-text pairs.

Scaling Data-Constrained Language Models

cs.CL · 2023-05-25 · conditional · novelty 6.0

Repeating training data up to 4 epochs yields negligible loss increase versus unique data for fixed compute, and a new scaling law accounts for the decaying value of repeated tokens and excess parameters.

Large Language Models Can Self-Improve

cs.CL · 2022-10-20 · unverdicted · novelty 6.0

A 540B-parameter LLM improves reasoning performance on GSM8K, DROP, OpenBookQA, and ANLI-A3 by fine-tuning on self-generated high-confidence CoT solutions from unlabeled data.

Language Models (Mostly) Know What They Know

cs.CL · 2022-07-11 · unverdicted · novelty 6.0

Language models show good calibration when asked to estimate the probability that their own answers are correct, with performance improving as models get larger.

Ethical and social risks of harm from Language Models

cs.CL · 2021-12-08 · accept · novelty 6.0

The authors provide a detailed taxonomy of 21 risks associated with language models, covering discrimination, information leaks, misinformation, malicious applications, interaction harms, and societal impacts like job loss and environmental costs.

A General Language Assistant as a Laboratory for Alignment

cs.CL · 2021-12-01 · conditional · novelty 6.0

Ranked preference modeling outperforms imitation learning for language model alignment and scales more favorably with model size.

Training Verifiers to Solve Math Word Problems

cs.LG · 2021-10-27 · conditional · novelty 6.0

Introduces GSM8K dataset and demonstrates that verifier-based selection of solutions from multiple candidates outperforms fine-tuning baselines on math word problems.

Scaling Laws for Transfer

cs.LG · 2021-02-02 · unverdicted · novelty 6.0

Effective data transferred from pre-training to fine-tuning is described by a power law in model parameter count and fine-tuning dataset size, acting like a multiplier on the fine-tuning data.

HuggingFace's Transformers: State-of-the-art Natural Language Processing

cs.CL · 2019-10-09 · accept · novelty 6.0

Hugging Face releases an open-source Python library that supplies a unified API and pretrained weights for major Transformer architectures used in natural language processing.

citing papers explorer

Showing 33 of 33 citing papers.

Measuring Massive Multitask Language Understanding cs.CY · 2020-09-07 · accept · none · ref 272 · internal anchor
Introduces the MMLU benchmark of 57 tasks and shows that current models, including GPT-3, achieve low accuracy far below expert level across academic and professional domains.
Scaling Laws for Neural Language Models cs.LG · 2020-01-23 · unverdicted · none · ref 14 · internal anchor
Empirical power-law scaling governs language model loss versus model size, data size, and compute, enabling optimal allocation of training compute.
Revisiting Zeroth-Order Hessian Approximation: A Single-Step Policy Optimization Lens cs.LG · 2026-05-29 · unverdicted · none · ref 3 · internal anchor
A policy optimization perspective unifies zeroth-order Hessian estimators as baseline selections and yields ZoVH, a suite of low-variance estimators for the Hessian, its regularized inverse, and bias-corrected products, with theoretical guarantees.
Queryable LoRA: Instruction-Regularized Routing Over Shared Low-Rank Update Atoms cs.LG · 2026-05-08 · unverdicted · none · ref 17 · internal anchor
Queryable LoRA adds dynamic routing over shared low-rank atoms with attention and language-instruction regularization to make parameter-efficient fine-tuning more adaptive across inputs and layers.
Language Is Not All You Need: Aligning Perception with Language Models cs.CL · 2023-02-27 · conditional · none · ref 29 · internal anchor
Kosmos-1 shows strong zero-shot and few-shot results on language tasks, image captioning, visual QA, OCR-free document understanding, and image recognition guided by text instructions.
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks cs.CL · 2020-05-22 · accept · none · ref 65 · internal anchor
RAG models set new state-of-the-art results on open-domain QA by retrieving Wikipedia passages and conditioning a generative model on them, while also producing more factual text than parametric baselines.
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer cs.LG · 2019-10-23 · unverdicted · none · ref 76 · internal anchor
T5 casts all NLP tasks as text-to-text generation, systematically explores pre-training choices, and reaches strong performance on summarization, QA, classification and other tasks via large-scale training on the Colossal Clean Crawled Corpus.
Evaluating Protein Transfer Learning with TAPE cs.LG · 2019-06-19 · accept · none · ref 34 · internal anchor
TAPE benchmark of five protein tasks shows self-supervised pretraining improves performance but often lags non-neural baselines, with code and data released publicly.
PEML: Parameter-efficient Multi-Task Learning with Optimized Continuous Prompts cs.CL · 2026-05-13 · unverdicted · none · ref 16 · internal anchor
PEML co-optimizes continuous prompts and low-rank adaptations to deliver up to 6.67% average accuracy gains over existing multi-task PEFT methods on GLUE, SuperGLUE, and other benchmarks.
SparseForge: Efficient Semi-Structured LLM Sparsification via Annealing of Hessian-Guided Soft-Mask cs.LG · 2026-05-07 · unverdicted · none · ref 34 · internal anchor
SparseForge achieves 57.27% zero-shot accuracy on LLaMA-2-7B at 2:4 sparsity using only 5B retraining tokens, beating the dense baseline and nearly matching a 40B-token SOTA method.
Ultra-Low-Dimensional Prompt Tuning via Random Projection cs.CL · 2025-02-06 · unverdicted · none · ref 55 · internal anchor
ULPT optimizes prompts in ultra-low dimensions with frozen random up-projection to cut training parameters by 98% while matching vanilla prompt tuning performance on NLP tasks.
Defending Against Indirect Prompt Injection Attacks With Spotlighting cs.CR · 2024-03-20 · unverdicted · none · ref 3 · internal anchor
Spotlighting prompt transformations cut indirect prompt injection success rates from >50% to <2% on GPT models while preserving task performance.
DeepSpeed Ulysses: System Optimizations for Enabling Training of Extreme Long Sequence Transformer Models cs.LG · 2023-09-25 · accept · none · ref 147 · internal anchor
DeepSpeed-Ulysses keeps communication volume constant for sequence-parallel attention when sequence length and device count scale together, delivering 2.5x faster training on 4x longer sequences than prior SOTA.
Simple synthetic data reduces sycophancy in large language models cs.CL · 2023-08-07 · unverdicted · none · ref 45 · internal anchor
Scaling and instruction tuning increase sycophancy in LLMs on opinion and fact tasks, but a synthetic data fine-tuning intervention reduces it on held-out prompts.
Retentive Network: A Successor to Transformer for Large Language Models cs.CL · 2023-07-17 · unverdicted · none · ref 25 · internal anchor
RetNet is a new sequence modeling architecture that delivers parallel training, constant-time inference, and competitive language modeling performance as a potential replacement for Transformers.
Kosmos-2: Grounding Multimodal Large Language Models to the World cs.CL · 2023-06-26 · unverdicted · none · ref 19 · internal anchor
Kosmos-2 grounds text to image regions by encoding refer expressions as Markdown links to sequences of location tokens and trains on a new GrIT dataset of grounded image-text pairs.
Scaling Data-Constrained Language Models cs.CL · 2023-05-25 · conditional · none · ref 122 · internal anchor
Repeating training data up to 4 epochs yields negligible loss increase versus unique data for fixed compute, and a new scaling law accounts for the decaying value of repeated tokens and excess parameters.
Large Language Models Can Self-Improve cs.CL · 2022-10-20 · unverdicted · none · ref 13 · internal anchor
A 540B-parameter LLM improves reasoning performance on GSM8K, DROP, OpenBookQA, and ANLI-A3 by fine-tuning on self-generated high-confidence CoT solutions from unlabeled data.
Language Models (Mostly) Know What They Know cs.CL · 2022-07-11 · unverdicted · none · ref 144 · internal anchor
Language models show good calibration when asked to estimate the probability that their own answers are correct, with performance improving as models get larger.
Ethical and social risks of harm from Language Models cs.CL · 2021-12-08 · accept · none · ref 285 · internal anchor
The authors provide a detailed taxonomy of 21 risks associated with language models, covering discrimination, information leaks, misinformation, malicious applications, interaction harms, and societal impacts like job loss and environmental costs.
A General Language Assistant as a Laboratory for Alignment cs.CL · 2021-12-01 · conditional · none · ref 86 · internal anchor
Ranked preference modeling outperforms imitation learning for language model alignment and scales more favorably with model size.
Training Verifiers to Solve Math Word Problems cs.LG · 2021-10-27 · conditional · none · ref 15 · internal anchor
Introduces GSM8K dataset and demonstrates that verifier-based selection of solutions from multiple candidates outperforms fine-tuning baselines on math word problems.
Scaling Laws for Transfer cs.LG · 2021-02-02 · unverdicted · none · ref 198 · internal anchor
Effective data transferred from pre-training to fine-tuning is described by a power law in model parameter count and fine-tuning dataset size, acting like a multiplier on the fine-tuning data.
HuggingFace's Transformers: State-of-the-art Natural Language Processing cs.CL · 2019-10-09 · accept · none · ref 186 · internal anchor
Hugging Face releases an open-source Python library that supplies a unified API and pretrained weights for major Transformer architectures used in natural language processing.
Complexity Horizons of Compressed Models in Analog Circuit Analysis cs.AI · 2026-05-04 · unverdicted · none · ref 8 · internal anchor
Prerequisite graphs map compressed LLM performance boundaries in analog circuit analysis to allow selecting the smallest viable model for a given task complexity.
Uncertainty-Aware Transformers: Conformal Prediction for Language Models cs.LG · 2026-04-10 · unverdicted · none · ref 26 · internal anchor
CONFIDE applies conformal prediction to transformer embeddings for valid prediction sets, improving accuracy up to 4.09% and efficiency over baselines on models like BERT-tiny.
SSA: Improving Performance With a Better Scoring Function cs.CL · 2025-08-20 · unverdicted · none · ref 15 · internal anchor
Replacing Softmax with Scaled Signed Averaging in transformer attention improves generalization under distribution shifts for in-context learning and boosts results on NLP benchmarks.
Humanity's Last Exam cs.LG · 2025-01-24 · unverdicted · none · ref 57 · internal anchor
Humanity's Last Exam is a new 2,500-question benchmark at the frontier of human knowledge where state-of-the-art LLMs show low accuracy.
Detecting Language Model Attacks with Perplexity cs.CL · 2023-08-27 · unverdicted · none · ref 85 · internal anchor
Jailbreak prompts with adversarial suffixes have high GPT-2 perplexity, and a LightGBM model on perplexity and length detects most attacks.
PaLM 2 Technical Report cs.CL · 2023-05-17 · unverdicted · none · ref 150 · internal anchor
PaLM 2 reports state-of-the-art results on language, reasoning, and multilingual tasks with improved efficiency over PaLM.
RoBERTa: A Robustly Optimized BERT Pretraining Approach cs.CL · 2019-07-26 · accept · none · ref 44 · internal anchor
With better hyperparameters, more data, and longer training, an unchanged BERT-Large architecture matches or exceeds XLNet and other successors on GLUE, SQuAD, and RACE.
GLU Variants Improve Transformer cs.LG · 2020-02-12 · unverdicted · none · ref 11 · internal anchor
Some GLU variants using non-sigmoid nonlinearities improve Transformer quality over ReLU and GELU in feed-forward sublayers.
LIAAD at SemDeep-5 Challenge: Word-in-Context (WiC) cs.CL · 2019-06-24 · unverdicted · none · ref 14 · internal anchor
An adapted WSD system with contextual and sense embeddings places second in the WiC challenge while avoiding task-specific training data.

SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer