hub

Wildbench: Benchmarking llms with challenging tasks from real users in the wild

Bill Yuchen Lin, Yuntian Deng, Khyathi Chandu, Faeze Brahman, Abhilasha Ravichander, Valentina Pyatkin, Nouha Dziri, Ronan Le Bras, Yejin Choi · 2024 · arXiv 2406.04770

10 Pith papers cite this work. Polarity classification is still indexing.

10 Pith papers citing it

read on arXiv browse 10 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

dataset 2

citation-polarity summary

use dataset 2

representative citing papers

SARL: Label-Free Reinforcement Learning by Rewarding Reasoning Topology

cs.AI · 2026-03-30 · conditional · novelty 8.0

SARL rewards reasoning topology to improve label-free RL, outperforming baselines with gains up to 44.7% on math and 34.6% on open-ended tasks while maintaining more stable training.

ProactBench: Beyond What The User Asked For

cs.LG · 2026-05-09 · unverdicted · novelty 7.0

ProactBench measures LLM conversational proactivity in three phases using 198 multi-agent dialogues and finds recovery behavior hard to predict from existing benchmarks.

Self-Preference Bias in Rubric-Based Evaluation of Large Language Models

cs.CL · 2026-04-08 · unverdicted · novelty 7.0

Rubric-based LLM judges show self-preference bias, incorrectly marking their own failed outputs as satisfied up to 50% more often on verifiable benchmarks and skewing scores by 10 points on subjective ones.

Deployment-Relevant Alignment Cannot Be Inferred from Model-Level Evaluation Alone

cs.AI · 2026-05-06 · unverdicted · novelty 6.0

Deployment-relevant AI alignment cannot be inferred from model-level evaluations alone, as benchmark audits show missing interaction support and cross-model tests reveal model-dependent scaffold effects.

Prompt Optimization Is a Coin Flip: Diagnosing When It Helps in Compound AI Systems

cs.AI · 2026-04-16 · unverdicted · novelty 6.0

Prompt optimization in compound AI systems is statistically indistinguishable from random chance except when tasks have exploitable output structure; a two-stage diagnostic predicts success.

Quantifying the Utility of User Simulators for Building Collaborative LLM Assistants

cs.CL · 2026-05-10 · unverdicted · novelty 5.0

Fine-tuned simulators grounded in real human data produce LLM assistants that win more often against real users than those trained against role-playing simulators.

Submodular Benchmark Selection

cs.AI · 2026-05-04 · unverdicted · novelty 4.0

Submodular maximization under a Gaussian model selects small benchmark subsets that outperform random selection for imputing leaderboard scores, with mutual information better than entropy at small sizes.

SPARD: Self-Paced Curriculum for RL Alignment via Integrating Reward Dynamics and Data Utility

cs.AI · 2026-04-09 · unverdicted · novelty 4.0

SPARD dynamically tunes multi-objective reward weights and data importance in LLM reinforcement learning alignment using a self-paced curriculum driven by reward dynamics and data utility.

Ministral 3

cs.CL · 2026-01-13 · unverdicted · novelty 4.0

Ministral 3 releases 3B/8B/14B parameter-efficient language models with base, instruction, and reasoning variants derived via iterative pruning and distillation, including image understanding capabilities.

LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods

cs.CL · 2024-12-07 · accept · novelty 3.0

A survey that organizes LLMs-as-judges research into functionality, methodology, applications, meta-evaluation, and limitations.

citing papers explorer

Showing 10 of 10 citing papers.

SARL: Label-Free Reinforcement Learning by Rewarding Reasoning Topology cs.AI · 2026-03-30 · conditional · none · ref 11
SARL rewards reasoning topology to improve label-free RL, outperforming baselines with gains up to 44.7% on math and 34.6% on open-ended tasks while maintaining more stable training.
ProactBench: Beyond What The User Asked For cs.LG · 2026-05-09 · unverdicted · none · ref 123
ProactBench measures LLM conversational proactivity in three phases using 198 multi-agent dialogues and finds recovery behavior hard to predict from existing benchmarks.
Self-Preference Bias in Rubric-Based Evaluation of Large Language Models cs.CL · 2026-04-08 · unverdicted · none · ref 10
Rubric-based LLM judges show self-preference bias, incorrectly marking their own failed outputs as satisfied up to 50% more often on verifiable benchmarks and skewing scores by 10 points on subjective ones.
Deployment-Relevant Alignment Cannot Be Inferred from Model-Level Evaluation Alone cs.AI · 2026-05-06 · unverdicted · none · ref 8
Deployment-relevant AI alignment cannot be inferred from model-level evaluations alone, as benchmark audits show missing interaction support and cross-model tests reveal model-dependent scaffold effects.
Prompt Optimization Is a Coin Flip: Diagnosing When It Helps in Compound AI Systems cs.AI · 2026-04-16 · unverdicted · none · ref 7
Prompt optimization in compound AI systems is statistically indistinguishable from random chance except when tasks have exploitable output structure; a two-stage diagnostic predicts success.
Quantifying the Utility of User Simulators for Building Collaborative LLM Assistants cs.CL · 2026-05-10 · unverdicted · none · ref 36
Fine-tuned simulators grounded in real human data produce LLM assistants that win more often against real users than those trained against role-playing simulators.
Submodular Benchmark Selection cs.AI · 2026-05-04 · unverdicted · none · ref 9
Submodular maximization under a Gaussian model selects small benchmark subsets that outperform random selection for imputing leaderboard scores, with mutual information better than entropy at small sizes.
SPARD: Self-Paced Curriculum for RL Alignment via Integrating Reward Dynamics and Data Utility cs.AI · 2026-04-09 · unverdicted · none · ref 5
SPARD dynamically tunes multi-objective reward weights and data importance in LLM reinforcement learning alignment using a self-paced curriculum driven by reward dynamics and data utility.
Ministral 3 cs.CL · 2026-01-13 · unverdicted · none · ref 15
Ministral 3 releases 3B/8B/14B parameter-efficient language models with base, instruction, and reasoning variants derived via iterative pruning and distillation, including image understanding capabilities.
LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods cs.CL · 2024-12-07 · accept · none · ref 143
A survey that organizes LLMs-as-judges research into functionality, methodology, applications, meta-evaluation, and limitations.

Wildbench: Benchmarking llms with challenging tasks from real users in the wild

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer