Proposes RAP, a retrieval-based approximate prior method, to predict performance of symbolic programs and LLM prompts on new tasks using a Bernoulli model and corpus-derived performance distributions.
Rethinking llm evaluation: Can we evaluate llms with 200x less data?arXiv preprint arXiv:2510.10457
3 Pith papers cite this work. Polarity classification is still indexing.
fields
cs.LG 3years
2026 3verdicts
UNVERDICTED 3representative citing papers
A joint task-model adaptation method learns optimal weights for data selection indicators via ICL proxies on small validation sets, matching or exceeding full-dataset fine-tuning performance with only 30% of samples on GSM8K.
DualEval jointly calibrates LLM abilities and item difficulties/sharpness in a shared latent space using static labels and reward-model scores to unify benchmark and arena-style evaluation.
citing papers explorer
-
Predicting Performance of Symbolic and Prompt Programs with Examples
Proposes RAP, a retrieval-based approximate prior method, to predict performance of symbolic programs and LLM prompts on new tasks using a Bernoulli model and corpus-derived performance distributions.
-
Learning Multi-Indicator Weights for Data Selection: A Joint Task-Model Adaptation Framework with Efficient Proxies
A joint task-model adaptation method learns optimal weights for data selection indicators via ICL proxies on small validation sets, matching or exceeding full-dataset fine-tuning performance with only 30% of samples on GSM8K.
-
DualEval: Joint Model-Item Calibration for Unified LLM Evaluation
DualEval jointly calibrates LLM abilities and item difficulties/sharpness in a shared latent space using static labels and reward-model scores to unify benchmark and arena-style evaluation.