pith. sign in

hub Mixed citations

SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems

Mixed citation behavior. Most common role is background (60%).

33 Pith papers citing it
Background 60% of classified citations
abstract

In the last year, new models and methods for pretraining and transfer learning have driven striking performance improvements across a range of language understanding tasks. The GLUE benchmark, introduced a little over one year ago, offers a single-number metric that summarizes progress on a diverse set of such tasks, but performance on the benchmark has recently surpassed the level of non-expert humans, suggesting limited headroom for further research. In this paper we present SuperGLUE, a new benchmark styled after GLUE with a new set of more difficult language understanding tasks, a software toolkit, and a public leaderboard. SuperGLUE is available at super.gluebenchmark.com.

hub tools

citation-role summary

background 4 method 1

citation-polarity summary

clear filters

representative citing papers

Measuring Massive Multitask Language Understanding

cs.CY · 2020-09-07 · accept · novelty 8.0

Introduces the MMLU benchmark of 57 tasks and shows that current models, including GPT-3, achieve low accuracy far below expert level across academic and professional domains.

Scaling Laws for Neural Language Models

cs.LG · 2020-01-23 · unverdicted · novelty 8.0

Empirical power-law scaling governs language model loss versus model size, data size, and compute, enabling optimal allocation of training compute.

Evaluating Protein Transfer Learning with TAPE

cs.LG · 2019-06-19 · accept · novelty 7.0

TAPE benchmark of five protein tasks shows self-supervised pretraining improves performance but often lags non-neural baselines, with code and data released publicly.

Ultra-Low-Dimensional Prompt Tuning via Random Projection

cs.CL · 2025-02-06 · unverdicted · novelty 6.0

ULPT optimizes prompts in ultra-low dimensions with frozen random up-projection to cut training parameters by 98% while matching vanilla prompt tuning performance on NLP tasks.

Scaling Data-Constrained Language Models

cs.CL · 2023-05-25 · conditional · novelty 6.0

Repeating training data up to 4 epochs yields negligible loss increase versus unique data for fixed compute, and a new scaling law accounts for the decaying value of repeated tokens and excess parameters.

Large Language Models Can Self-Improve

cs.CL · 2022-10-20 · unverdicted · novelty 6.0

A 540B-parameter LLM improves reasoning performance on GSM8K, DROP, OpenBookQA, and ANLI-A3 by fine-tuning on self-generated high-confidence CoT solutions from unlabeled data.

Language Models (Mostly) Know What They Know

cs.CL · 2022-07-11 · unverdicted · novelty 6.0

Language models show good calibration when asked to estimate the probability that their own answers are correct, with performance improving as models get larger.

Ethical and social risks of harm from Language Models

cs.CL · 2021-12-08 · accept · novelty 6.0

The authors provide a detailed taxonomy of 21 risks associated with language models, covering discrimination, information leaks, misinformation, malicious applications, interaction harms, and societal impacts like job loss and environmental costs.

Training Verifiers to Solve Math Word Problems

cs.LG · 2021-10-27 · conditional · novelty 6.0

Introduces GSM8K dataset and demonstrates that verifier-based selection of solutions from multiple candidates outperforms fine-tuning baselines on math word problems.

Scaling Laws for Transfer

cs.LG · 2021-02-02 · unverdicted · novelty 6.0

Effective data transferred from pre-training to fine-tuning is described by a power law in model parameter count and fine-tuning dataset size, acting like a multiplier on the fine-tuning data.

citing papers explorer

Showing 8 of 8 citing papers after filters.

  • Language Is Not All You Need: Aligning Perception with Language Models cs.CL · 2023-02-27 · conditional · none · ref 29 · internal anchor

    Kosmos-1 shows strong zero-shot and few-shot results on language tasks, image captioning, visual QA, OCR-free document understanding, and image recognition guided by text instructions.

  • DeepSpeed Ulysses: System Optimizations for Enabling Training of Extreme Long Sequence Transformer Models cs.LG · 2023-09-25 · accept · none · ref 147 · internal anchor

    DeepSpeed-Ulysses keeps communication volume constant for sequence-parallel attention when sequence length and device count scale together, delivering 2.5x faster training on 4x longer sequences than prior SOTA.

  • Simple synthetic data reduces sycophancy in large language models cs.CL · 2023-08-07 · unverdicted · none · ref 45 · internal anchor

    Scaling and instruction tuning increase sycophancy in LLMs on opinion and fact tasks, but a synthetic data fine-tuning intervention reduces it on held-out prompts.

  • Retentive Network: A Successor to Transformer for Large Language Models cs.CL · 2023-07-17 · unverdicted · none · ref 25 · internal anchor

    RetNet is a new sequence modeling architecture that delivers parallel training, constant-time inference, and competitive language modeling performance as a potential replacement for Transformers.

  • Kosmos-2: Grounding Multimodal Large Language Models to the World cs.CL · 2023-06-26 · unverdicted · none · ref 19 · internal anchor

    Kosmos-2 grounds text to image regions by encoding refer expressions as Markdown links to sequences of location tokens and trains on a new GrIT dataset of grounded image-text pairs.

  • Scaling Data-Constrained Language Models cs.CL · 2023-05-25 · conditional · none · ref 122 · internal anchor

    Repeating training data up to 4 epochs yields negligible loss increase versus unique data for fixed compute, and a new scaling law accounts for the decaying value of repeated tokens and excess parameters.

  • Detecting Language Model Attacks with Perplexity cs.CL · 2023-08-27 · unverdicted · none · ref 85 · internal anchor

    Jailbreak prompts with adversarial suffixes have high GPT-2 perplexity, and a LightGBM model on perplexity and length detects most attacks.

  • PaLM 2 Technical Report cs.CL · 2023-05-17 · unverdicted · none · ref 150 · internal anchor

    PaLM 2 reports state-of-the-art results on language, reasoning, and multilingual tasks with improved efficiency over PaLM.