arXiv preprint arXiv:2203.04592 , year=

Mapping global dynamics of benchmark creation, saturation in artificial intelligence , author= · 2022 · arXiv 2203.04592

3 Pith papers cite this work. Polarity classification is still indexing.

3 Pith papers citing it

read on arXiv browse 3 citing papers

representative citing papers

SEAL: Can Saturated Benchmarks Be Revived by LLM-as-a-Meta-Judge?

cs.CL · 2026-05-28 · unverdicted · novelty 6.0

SEAL revives saturated benchmarks via adaptive LLM meta-judging in elimination matches, matching full pairwise accuracy with roughly half the calls across code, math, QA, and agent tasks.

The Red Queen G\"odel Machine: Co-Evolving Agents and Their Evaluators

cs.LG · 2026-06-24 · unverdicted · novelty 5.0

RQGM enables co-evolution of agents and evaluators across epochs with non-stationary utilities, reporting gains in coding pass rates, paper acceptance, and proof grading over prior self-improving agents.

Ten Headache Specialists versus Artificial Intelligence for Clinical Literature Summarization: A Critical Evaluation and Comparison

cs.AI · 2026-06-03 · unverdicted · novelty 4.0

Headache specialists preferred their own literature summaries over those from Sonnet, GPT-4o, and Llama 3.1 in a blinded evaluation, though AI summaries were sometimes indistinguishable.

citing papers explorer

Showing 3 of 3 citing papers after filters.

SEAL: Can Saturated Benchmarks Be Revived by LLM-as-a-Meta-Judge? cs.CL · 2026-05-28 · unverdicted · none · ref 2
SEAL revives saturated benchmarks via adaptive LLM meta-judging in elimination matches, matching full pairwise accuracy with roughly half the calls across code, math, QA, and agent tasks.
The Red Queen G\"odel Machine: Co-Evolving Agents and Their Evaluators cs.LG · 2026-06-24 · unverdicted · none · ref 14
RQGM enables co-evolution of agents and evaluators across epochs with non-stationary utilities, reporting gains in coding pass rates, paper acceptance, and proof grading over prior self-improving agents.
Ten Headache Specialists versus Artificial Intelligence for Clinical Literature Summarization: A Critical Evaluation and Comparison cs.AI · 2026-06-03 · unverdicted · none · ref 85
Headache specialists preferred their own literature summaries over those from Sonnet, GPT-4o, and Llama 3.1 in a blinded evaluation, though AI summaries were sometimes indistinguishable.

arXiv preprint arXiv:2203.04592 , year=

fields

years

verdicts

representative citing papers

citing papers explorer