pith. machine review for the scientific record. sign in

hub

Wildbench: Benchmarking llms with challenging tasks from real users in the wild

10 Pith papers cite this work. Polarity classification is still indexing.

10 Pith papers citing it

hub tools

citation-role summary

dataset 2

citation-polarity summary

years

2026 9 2024 1

roles

dataset 2

polarities

use dataset 2

representative citing papers

ProactBench: Beyond What The User Asked For

cs.LG · 2026-05-09 · unverdicted · novelty 7.0

ProactBench measures LLM conversational proactivity in three phases using 198 multi-agent dialogues and finds recovery behavior hard to predict from existing benchmarks.

Submodular Benchmark Selection

cs.AI · 2026-05-04 · unverdicted · novelty 4.0

Submodular maximization under a Gaussian model selects small benchmark subsets that outperform random selection for imputing leaderboard scores, with mutual information better than entropy at small sizes.

Ministral 3

cs.CL · 2026-01-13 · unverdicted · novelty 4.0

Ministral 3 releases 3B/8B/14B parameter-efficient language models with base, instruction, and reasoning variants derived via iterative pruning and distillation, including image understanding capabilities.

citing papers explorer

Showing 10 of 10 citing papers.