Rethinking llm evaluation: Can we evaluate llms with 200x less data?arXiv preprint arXiv:2510.10457

Rethinking LLM Evaluation: Can We Evaluate LLMs with 200x Less Data? , author= · arXiv 2510.10457

3 Pith papers cite this work. Polarity classification is still indexing.

3 Pith papers citing it

representative citing papers

Predicting Performance of Symbolic and Prompt Programs with Examples

cs.LG · 2026-05-15 · unverdicted · novelty 5.0

Proposes RAP, a retrieval-based approximate prior method, to predict performance of symbolic programs and LLM prompts on new tasks using a Bernoulli model and corpus-derived performance distributions.

Learning Multi-Indicator Weights for Data Selection: A Joint Task-Model Adaptation Framework with Efficient Proxies

cs.LG · 2026-05-10 · unverdicted · novelty 5.0

A joint task-model adaptation method learns optimal weights for data selection indicators via ICL proxies on small validation sets, matching or exceeding full-dataset fine-tuning performance with only 30% of samples on GSM8K.

DualEval: Joint Model-Item Calibration for Unified LLM Evaluation

cs.LG · 2026-06-24 · unverdicted · novelty 4.0

DualEval jointly calibrates LLM abilities and item difficulties/sharpness in a shared latent space using static labels and reward-model scores to unify benchmark and arena-style evaluation.

citing papers explorer

Showing 3 of 3 citing papers after filters.

Predicting Performance of Symbolic and Prompt Programs with Examples cs.LG · 2026-05-15 · unverdicted · none · ref 7
Proposes RAP, a retrieval-based approximate prior method, to predict performance of symbolic programs and LLM prompts on new tasks using a Bernoulli model and corpus-derived performance distributions.
Learning Multi-Indicator Weights for Data Selection: A Joint Task-Model Adaptation Framework with Efficient Proxies cs.LG · 2026-05-10 · unverdicted · none · ref 26
A joint task-model adaptation method learns optimal weights for data selection indicators via ICL proxies on small validation sets, matching or exceeding full-dataset fine-tuning performance with only 30% of samples on GSM8K.
DualEval: Joint Model-Item Calibration for Unified LLM Evaluation cs.LG · 2026-06-24 · unverdicted · none · ref 32
DualEval jointly calibrates LLM abilities and item difficulties/sharpness in a shared latent space using static labels and reward-model scores to unify benchmark and arena-style evaluation.

Rethinking llm evaluation: Can we evaluate llms with 200x less data?arXiv preprint arXiv:2510.10457

fields

years

verdicts

representative citing papers

citing papers explorer