pith. sign in

hub Mixed citations

SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems

Mixed citation behavior. Most common role is background (60%).

39 Pith papers citing it
Background 60% of classified citations
abstract

In the last year, new models and methods for pretraining and transfer learning have driven striking performance improvements across a range of language understanding tasks. The GLUE benchmark, introduced a little over one year ago, offers a single-number metric that summarizes progress on a diverse set of such tasks, but performance on the benchmark has recently surpassed the level of non-expert humans, suggesting limited headroom for further research. In this paper we present SuperGLUE, a new benchmark styled after GLUE with a new set of more difficult language understanding tasks, a software toolkit, and a public leaderboard. SuperGLUE is available at super.gluebenchmark.com.

hub tools

citation-role summary

background 4 method 1

citation-polarity summary

clear filters

representative citing papers

Measuring Massive Multitask Language Understanding

cs.CY · 2020-09-07 · accept · novelty 8.0

Introduces the MMLU benchmark of 57 tasks and shows that current models, including GPT-3, achieve low accuracy far below expert level across academic and professional domains.

Scaling Laws for Neural Language Models

cs.LG · 2020-01-23 · unverdicted · novelty 8.0

Empirical power-law scaling governs language model loss versus model size, data size, and compute, enabling optimal allocation of training compute.

Meta-Benchmarks for Financial-Services LLM Evaluation

cs.AI · 2026-07-02 · unverdicted · novelty 7.0

A meta-benchmarking framework organizes 452 LLM benchmarks into 41 O*NET Generalized Work Activities and 38 BIAN domains, using discrimination-coverage-recency weights to scale K-factors in an Elo tournament for comparable financial-services scores.

Evaluating Protein Transfer Learning with TAPE

cs.LG · 2019-06-19 · accept · novelty 7.0

TAPE benchmark of five protein tasks shows self-supervised pretraining improves performance but often lags non-neural baselines, with code and data released publicly.

Ultra-Low-Dimensional Prompt Tuning via Random Projection

cs.CL · 2025-02-06 · unverdicted · novelty 6.0

ULPT optimizes prompts in ultra-low dimensions with frozen random up-projection to cut training parameters by 98% while matching vanilla prompt tuning performance on NLP tasks.

Scaling Data-Constrained Language Models

cs.CL · 2023-05-25 · conditional · novelty 6.0

Repeating training data up to 4 epochs yields negligible loss increase versus unique data for fixed compute, and a new scaling law accounts for the decaying value of repeated tokens and excess parameters.

Large Language Models Can Self-Improve

cs.CL · 2022-10-20 · unverdicted · novelty 6.0

A 540B-parameter LLM improves reasoning performance on GSM8K, DROP, OpenBookQA, and ANLI-A3 by fine-tuning on self-generated high-confidence CoT solutions from unlabeled data.

Language Models (Mostly) Know What They Know

cs.CL · 2022-07-11 · unverdicted · novelty 6.0

Language models show good calibration when asked to estimate the probability that their own answers are correct, with performance improving as models get larger.

citing papers explorer

Showing 7 of 7 citing papers after filters.

  • Language Is Not All You Need: Aligning Perception with Language Models cs.CL · 2023-02-27 · conditional · none · ref 29 · internal anchor

    Kosmos-1 shows strong zero-shot and few-shot results on language tasks, image captioning, visual QA, OCR-free document understanding, and image recognition guided by text instructions.

  • Simple synthetic data reduces sycophancy in large language models cs.CL · 2023-08-07 · unverdicted · none · ref 45 · internal anchor

    Scaling and instruction tuning increase sycophancy in LLMs on opinion and fact tasks, but a synthetic data fine-tuning intervention reduces it on held-out prompts.

  • Retentive Network: A Successor to Transformer for Large Language Models cs.CL · 2023-07-17 · unverdicted · none · ref 25 · internal anchor

    RetNet is a new sequence modeling architecture that delivers parallel training, constant-time inference, and competitive language modeling performance as a potential replacement for Transformers.

  • Kosmos-2: Grounding Multimodal Large Language Models to the World cs.CL · 2023-06-26 · unverdicted · none · ref 19 · internal anchor

    Kosmos-2 grounds text to image regions by encoding refer expressions as Markdown links to sequences of location tokens and trains on a new GrIT dataset of grounded image-text pairs.

  • Scaling Data-Constrained Language Models cs.CL · 2023-05-25 · conditional · none · ref 122 · internal anchor

    Repeating training data up to 4 epochs yields negligible loss increase versus unique data for fixed compute, and a new scaling law accounts for the decaying value of repeated tokens and excess parameters.

  • Detecting Language Model Attacks with Perplexity cs.CL · 2023-08-27 · unverdicted · none · ref 85 · internal anchor

    Jailbreak prompts with adversarial suffixes have high GPT-2 perplexity, and a LightGBM model on perplexity and length detects most attacks.

  • PaLM 2 Technical Report cs.CL · 2023-05-17 · unverdicted · none · ref 150 · internal anchor

    PaLM 2 reports state-of-the-art results on language, reasoning, and multilingual tasks with improved efficiency over PaLM.