Large Language Model Selection with Limited Annotations

Andreas Kirsch; Nezihe Merve G\"urel; Patrik Okanovic; Torsten Hoefler; Yavuz Durmazkeser

arxiv: 2605.24981 · v1 · pith:5VFDH7O3new · submitted 2026-05-24 · 💻 cs.CL · cs.LG

Large Language Model Selection with Limited Annotations

Yavuz Durmazkeser , Patrik Okanovic , Andreas Kirsch , Torsten Hoefler , Nezihe Merve G\"urel This is my paper

Pith reviewed 2026-06-30 12:11 UTC · model grok-4.3

classification 💻 cs.CL cs.LG

keywords LLM selectionactive model selectionannotation efficiencyinformation gainpairwise similaritiesblack-box evaluationquery selection

0 comments

The pith

SELECT-LLM selects a minimal set of queries to annotate so the best LLM for a task can be identified with far lower labeling cost than full evaluation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces SELECT-LLM as the first active selection framework that chooses which queries to label when many candidate LLMs must be compared for a given task. It computes expected information gain from pairwise similarities among the models' generated outputs to pick the most informative queries. The method requires no model weights or architecture details and therefore applies to both open-weight and black-box LLMs. A reader would care because exhaustive annotation over large fixed sets quickly becomes prohibitive when dozens of strong models are available. Experiments across 23 datasets and 156 models show consistent gains and annotation reductions reaching 81.8 percent for exact best-model identification.

Core claim

SELECT-LLM finds a small set of queries whose annotations are most informative for identifying the best LLM by using a query selection rule based on expected information gain computed from pairwise similarities between candidate model outputs, and this rule works without any assumptions on model architecture or access to weights.

What carries the argument

The query selection rule based on expected information gain computed from pairwise similarities between candidate model outputs

If this is right

Best-model identification requires up to 81.8 percent fewer annotations than exhaustive evaluation.
Near-best model identification requires up to 84.78 percent fewer annotations.
The same selection procedure improves over the strongest baseline in every tested setting.
The procedure applies equally to open-weight and black-box LLMs because it uses only generated responses.
The approach scales to diverse task families and multiple text evaluation metrics.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The similarity-based selection rule might also reduce labeling cost when the goal is to rank models rather than pick only the single best one.
If model outputs become more correlated, the information-gain calculation could select fewer queries before the ranking stabilizes.
Combining the method with cheap proxy metrics computed before any human annotation could further lower the total cost.
The framework could be tested on tasks where the evaluation metric itself changes with the chosen model.

Load-bearing premise

The query selection rule based on expected information gain computed from pairwise similarities between candidate model outputs is effective for identifying the best LLM without assumptions about their architecture or access to model weights.

What would settle it

A controlled run on a fresh collection of models and tasks where the queries chosen by SELECT-LLM produce a final model ranking that differs from the ranking obtained after full annotation would falsify the central claim.

Figures

Figures reproduced from arXiv: 2605.24981 by Andreas Kirsch, Nezihe Merve G\"urel, Patrik Okanovic, Torsten Hoefler, Yavuz Durmazkeser.

**Figure 1.** Figure 1: An overview of SELECT-LLM. For an arbitrary pool of n queries and a set of candidate language models, SELECT-LLM adaptively annotates most informative b ≪ n queries for identifying the best language model for the pool. to reliably identify the best LLM for a target task and data distribution under limited annotation resources remains an open question. In this work, we study active model selection for LLMs … view at source ↗

**Figure 2.** Figure 2: Comparison of SELECT-LLM and the baselines in terms of best model identification probability across 23 datasets. In each plot, the horizontal arrow and percentage indicate SELECT-LLM’s labeling efficiency relative to the strongest baseline, and the dashed vertical lines mark the corresponding budgets. Each plot is shown until the strongest baseline reaches 100% identification probability. 8 [PITH_FULL_IM… view at source ↗

**Figure 3.** Figure 3: Query-rank comparison between the selection rule and exact mutual information in the synthetic [PITH_FULL_IMAGE:figures/full_fig_p026_3.png] view at source ↗

**Figure 4.** Figure 4: Distribution of average model scores for the [PITH_FULL_IMAGE:figures/full_fig_p032_4.png] view at source ↗

**Figure 5.** Figure 5: Sensitivity of SELECT-LLM to the temperature parameter τ across the 23 datasets. 33 [PITH_FULL_IMAGE:figures/full_fig_p033_5.png] view at source ↗

read the original abstract

Choosing a Large Language Model (LLM) for a given task requires comparing many strong candidates, yet standard evaluation relies on costly annotations over fixed evaluation sets. To address this challenge, we develop SELECT-LLM, the first framework for active model selection of LLMs. SELECT-LLM aims to find a small set of queries whose annotations are most informative for identifying the best LLM for a given task. To this end, we introduce a query selection rule based on expected information gain, computed from pairwise similarities between candidate model outputs. Because this rule only uses generated model responses, SELECT-LLM can be applied across candidate models without assumptions about their architecture or access to model weights. This makes it suitable for both open-weight and black-box LLMs. We evaluate SELECT-LLM across 23 datasets, 156 evaluated models, diverse task families, and multiple text evaluation metrics. Across all experiments, SELECT-LLM improves over the strongest baseline in every setting, with annotation cost reductions up to 81.8% for best model selection and up to 84.78% for near-best model selection.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SELECT-LLM picks informative queries for LLM model selection via expected information gain on pairwise output similarities and reports large annotation savings on a broad set of experiments.

read the letter

The main point is that this paper gives a concrete way to cut the number of annotations needed to pick the best LLM for a task. It does this by choosing queries whose labels are expected to give the most information about which model is strongest, where the information gain comes from how similar the models' outputs are on those queries.

What is new is the application of this expected information gain rule specifically to LLM selection across both open and black-box models, using only generated responses. The evaluation covers 23 datasets and 156 models with multiple metrics, and the abstract states that the method beats the strongest baseline in every setting, with cost reductions reaching 81.8% for exact best-model identification and 84.78% for near-best.

The scale of the experiments is a clear strength. Running that many models and datasets gives a reasonable picture of whether the gains are consistent rather than tied to one narrow case. The lack of any need for model weights or architecture details also makes the approach directly usable in practice.

The soft spot is that the abstract does not spell out the exact similarity measure, how the information gain is computed in closed form, or the precise definition of the baselines. Without those details it is difficult to judge whether the reported improvements could be sensitive to implementation choices or to particular similarity functions. The central claim itself does not appear circular or self-referential from what is shown.

This paper is for people who evaluate or deploy multiple LLMs and need to control annotation budgets. It is worth sending to peer review because the empirical scope is large and the practical goal is clear; a referee can check the missing implementation pieces and the statistical reporting.

Referee Report

2 major / 2 minor

Summary. The paper introduces SELECT-LLM, the first active model selection framework for LLMs. It selects a small set of queries for annotation using a rule based on expected information gain computed from pairwise similarities between candidate model outputs. The method requires only generated responses and applies to both open-weight and black-box models without architectural assumptions. Experiments across 23 datasets and 156 models show consistent outperformance over the strongest baseline in every setting, with annotation cost reductions up to 81.8% for best-model selection and 84.78% for near-best selection.

Significance. If the results hold, the work is significant for practical LLM evaluation: it directly tackles the high cost of comparing many strong models by minimizing required annotations while remaining applicable to black-box APIs. The scale of the evaluation (23 datasets, 156 models, multiple task families and metrics) and the explicit use of only generated outputs are strengths that support generalizability. The approach is internally consistent with its stated goal and avoids self-referential parameter fitting.

major comments (2)

[§3] §3 (method): the exact formula for expected information gain and the definition of pairwise similarity (e.g., exact match vs. embedding-based) must be stated with equations; without them the central claim that the selection rule is effective cannot be verified or reproduced from the generated responses alone.
[Results section] Results section / Table 2 or equivalent: the claim of improvement 'in every setting' requires reporting the annotation budget (number of queries) per experiment and either standard deviations across runs or statistical tests; the current high-level summary leaves the magnitude and reliability of the 81.8% / 84.78% reductions unquantified.

minor comments (2)

Add a short related-work paragraph contrasting SELECT-LLM with prior active-learning or query-selection methods for model comparison to substantiate the 'first framework' claim.
Ensure all figures reporting cost reductions label the exact baseline, the metric used for 'near-best', and the number of models/datasets per panel for immediate readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the positive evaluation and helpful suggestions. We address the two major comments below and will revise the manuscript to incorporate the requested details.

read point-by-point responses

Referee: [§3] §3 (method): the exact formula for expected information gain and the definition of pairwise similarity (e.g., exact match vs. embedding-based) must be stated with equations; without them the central claim that the selection rule is effective cannot be verified or reproduced from the generated responses alone.

Authors: We agree that the exact formulas are required for full reproducibility. In the revised version we will insert the precise mathematical definitions of expected information gain and the pairwise similarity function (including whether exact match or embedding-based) as equations in §3. revision: yes
Referee: [Results section] Results section / Table 2 or equivalent: the claim of improvement 'in every setting' requires reporting the annotation budget (number of queries) per experiment and either standard deviations across runs or statistical tests; the current high-level summary leaves the magnitude and reliability of the 81.8% / 84.78% reductions unquantified.

Authors: We accept this point. The revised results section will report the exact annotation budget (number of queries) for each experiment together with standard deviations across runs or the results of statistical significance tests, thereby quantifying the reported cost reductions. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper introduces SELECT-LLM as an empirical active selection procedure that computes expected information gain directly from pairwise similarities of generated model outputs on candidate queries. This computation operates on external data (model responses) without reducing any claimed prediction or selection rule to a fitted parameter or self-referential definition. No load-bearing self-citation, ansatz smuggling, or renaming of known results appears in the provided description; the central claim of annotation-cost reduction is presented as an empirical outcome across datasets and models rather than a mathematical identity derived from the method's own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that pairwise output similarities suffice to compute useful expected information gain for model selection; no free parameters or invented entities are described in the abstract.

axioms (1)

domain assumption Pairwise similarities between candidate model outputs can be used to compute expected information gain that identifies the most informative queries for best-model selection
This assumption underpins the query selection rule stated in the abstract.

pith-pipeline@v0.9.1-grok · 5734 in / 1286 out tokens · 35838 ms · 2026-06-30T12:11:07.854905+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

130 extracted references · 30 canonical work pages · 24 internal anchors

[1]

Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone

Marah Abdin, Sam Ade Jacobs, Ammar Ahmad Awan, Jyoti Aneja, Ahmed Awadallah, Hany Awadalla, Nguyen Bach, Amit Bahree, Arash Bakhtiari, Harkirat Behl, Alon Benhaim, Misha Bilenko, Johan Bjorck, S´ebastien Bubeck, Martin Cai, Caio C´esar Teodoro Mendes, Weizhu Chen, Vishrav Chaudhary , Parul Chopra, Allie Del Giorno, Gustavo de Rosa, Matthew Dixon, Ronen El...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[2]

Jurassic-1: Technical details and evaluation.https://www.ai21.com/blog/ announcing-ai21-studio-and-jurassic-1, 2021

AI21 Labs. Jurassic-1: Technical details and evaluation.https://www.ai21.com/blog/ announcing-ai21-studio-and-jurassic-1, 2021

2021
[3]

Luminous.https://docs.aleph-alpha.com/docs/introduction/luminous/

Aleph Alpha. Luminous.https://docs.aleph-alpha.com/docs/introduction/luminous/
[4]

The Falcon Series of Open Language Models

Ebtesam Almazrouei, Hamza Alobeidli, Abdulaziz Alshamsi, Alessandro Cappelli, Ruxandra Cojocaru, M´erouane Debbah, ´Etienne Goffinet, Daniel Hesslow, Julien Launay , Quentin Malartic, et al. The falcon series of open language models.arXiv preprint arXiv:2311.16867, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[5]

PaLM 2 Technical Report

Rohan Anil, Sebastian Borgeaud, Yonghui Wu, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M. Dai, Anja Hauth, Katie Millican, et al. Palm 2 technical report.arXiv preprint arXiv:2305.10403, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[6]

Anthropic claude api.https://console.anthropic.com

Anthropic. Anthropic claude api.https://console.anthropic.com
[7]

Model card and evaluations for claude models, 2023

Anthropic. Model card and evaluations for claude models, 2023

2023
[8]

Introducing the next generation of claude, 2024

Anthropic. Introducing the next generation of claude, 2024

2024
[9]

Introducing Claude Opus 4.7.https://www.anthropic.com/news/claude-opus-4-7,

Anthropic. Introducing Claude Opus 4.7.https://www.anthropic.com/news/claude-opus-4-7,
[10]

Accessed: 2026-05-02

2026
[11]

Qwen Technical Report

Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, Binyuan Hui, Luo Ji, Mei Li, Junyang Lin, Runji Lin, Dayiheng Liu, Gao Liu, Chengqiang Lu, Keming Lu, Jianxin Ma, Rui Men, Xingzhang Ren, Xuancheng Ren, Chuanqi Tan, Sinan Tan, Jianhong Tu, Peng Wang, Shijie Wang, Wei Wang, Shengguang Wu, Benfeng X...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[12]

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond.arXiv preprint arXiv:2308.12966, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[13]

Scaling up active testing to large language models.arXiv preprint arXiv:2508.09093, 2025

Gabrielle Berrada, Jannik Kossen, Muhammed Razzak, Freddie Bickford Smith, Yarin Gal, and Tom Rainforth. Scaling up active testing to large language models.arXiv preprint arXiv:2508.09093, 2025. 11

work page arXiv 2025
[14]

PaliGemma: A versatile 3B VLM for transfer

Lucas Beyer, Andreas Steiner, Andre Susano Pinto, Alexander Kolesnikov, Xiao Wang, Daniel Salz, Maxim Neumann, Ibrahim Alabdulmohsin, Michael Tschannen, Emanuele Bugliarello, et al. Paligemma: A versatile 3b vlm for transfer.arXiv preprint arXiv:2407.07726, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[15]

Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling

Stella Biderman, Hailey Schoelkopf, Quentin Gregory Anthony , Herbie Bradley , Kyle O’Brien, Eric Hallahan, Mohammad Aflah Khan, Shivanshu Purohit, USVSN Sai Prashanth, Edward Raff, et al. Pythia: A suite for analyzing large language models across training and scaling.arXiv preprint arXiv:2304.01373, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[16]

GPT-NeoX-20B: An Open-Source Autoregressive Language Model

Sid Black, Stella Biderman, Eric Hallahan, Quentin Anthony , Leo Gao, Laurence Golding, Horace He, Connor Leahy , Kyle McDonell, Jason Phang, et al. Gpt-neox-20b: An open-source autoregressive language model.arXiv preprint arXiv:2204.06745, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[17]

Findings of the 2014 workshop on statistical machine translation

Ond ˇrej Bojar, Christian Buck, Christian Federmann, Barry Haddow, Philipp Koehn, Johannes Leveling, Christof Monz, Pavel Pecina, Matt Post, Herve Saint-Amand, Radu Soricut, Lucia Specia, and Ale ˇs Tamchyna. Findings of the 2014 workshop on statistical machine translation. InProceedings of the Ninth Workshop on Statistical Machine Translation, pages 12–5...

2014
[18]

Language models are few-shot learners

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry , Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott G...

1901
[19]

Yu, Qiang Yang, and Xing Xie

Yupeng Chang, Xu Wang, Jindong Wang, Yuan Wu, Linyi Yang, Kaijie Zhu, Hao Chen, Xiaoyuan Yi, Cunxiang Wang, Yidong Wang, Wei Ye, Yue Zhang, Yi Chang, Philip S. Yu, Qiang Yang, and Xing Xie. A survey on evaluation of large language models.ACM Trans. Intell. Syst. Technol., 15(3), March 2024

2024
[20]

Humans or LLMs as the judge? a study on judgement bias

Guiming Hardy Chen, Shunian Chen, Ziche Liu, Feng Jiang, and Benyou Wang. Humans or LLMs as the judge? a study on judgement bias. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 8301–8327, Miami, Florida, USA, November 2024. Association for Computational Linguistics

2024
[21]

Hamed Hassani, Amin Karbasi, and Andreas Krause

Yuxin Chen, S. Hamed Hassani, Amin Karbasi, and Andreas Krause. Sequential information maxi- mization: When is greedy near-optimal? In Peter Gr ¨unwald, Elad Hazan, and Satyen Kale, editors, Proceedings of The 28th Conference on Learning Theory, volume 40 ofProceedings of Machine Learning Research, pages 338–363, Paris, France, 03–06 Jul 2015. PMLR

2015
[22]

Gonzalez, Ion Stoica, and Eric P

Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E. Gonzalez, Ion Stoica, and Eric P. Xing. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality , March 2023

2023
[23]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[24]

Cohere api.https://docs.cohere.com

Cohere. Cohere api.https://docs.cohere.com
[25]

Command r: Retrieval-augmented generation at production scale, 2024

Cohere. Command r: Retrieval-augmented generation at production scale, 2024

2024
[26]

Introducing command r+: A scalable llm built for business, 2024

Cohere. Introducing command r+: A scalable llm built for business, 2024. 12

2024
[27]

Committee-based sampling for training probabilistic classifiers

Ido Dagan and Sean P Engelson. Committee-based sampling for training probabilistic classifiers. In Machine Learning Proceedings 1995, pages 150–157. Elsevier, 1995

1995
[28]

Introducing dbrx: A new state-of-the-art open llm.https://www.databricks.com/blog/ introducing-dbrx-new-state-art-open-llm, March 2024

Databricks. Introducing dbrx: A new state-of-the-art open llm.https://www.databricks.com/blog/ introducing-dbrx-new-state-art-open-llm, March 2024. Accessed: 2025-08-31

2024
[29]

GLM: general language model pretraining with autoregressive blank infilling

Zhengxiao Du, Yujie Qian, Xiao Liu, Ming Ding, Jiezhong Qiu, Zhilin Yang, and Jie Tang. GLM: general language model pretraining with autoregressive blank infilling. InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2022, Dublin, Ireland, May 22-27, 2022, pages 320–335. Association for Co...

2022
[30]

Sample-efficient human evaluation of large language models via maximum discrepancy competition

Kehua Feng, Keyan Ding, Tan Hongzhi, Kede Ma, Zhihua Wang, Shuangquan Guo, Cheng Yuzhou, Ge Sun, Guozhou Zheng, Qiang Zhang, and Huajun Chen. Sample-efficient human evaluation of large language models via maximum discrepancy competition. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), page...

2025
[31]

Open llm leaderboard v2.https://huggingface.co/spaces/open-llm-leaderboard/open_llm_ leaderboard, 2024

Cl ´ementine Fourrier, Nathan Habib, Alina Lozovskaya, Konrad Szafer, and Thomas Wolf. Open llm leaderboard v2.https://huggingface.co/spaces/open-llm-leaderboard/open_llm_ leaderboard, 2024

2024
[32]

Selective sampling using the query by committee algorithm.Machine learning, 28:133–168, 1997

Yoav Freund, H Sebastian Seung, Eli Shamir, and Naftali Tishby . Selective sampling using the query by committee algorithm.Machine learning, 28:133–168, 1997

1997
[33]

Bayesian active model selection with an application to automated audiometry

Jacob Gardner, Gustavo Malkomes, Roman Garnett, Kilian Q Weinberger, Dennis Barbour, and John P Cunningham. Bayesian active model selection with an application to automated audiometry . In C. Cortes, N. Lawrence, D. Lee, M. Sugiyama, and R. Garnett, editors,Advances in Neural Information Processing Systems, volume 28. Curran Associates, Inc., 2015

2015
[34]

Gemini: A Family of Highly Capable Multimodal Models

Google. Gemini: A family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[35]

Gemma open models, 2024

Google. Gemma open models, 2024

2024
[36]

Gemini api.https://ai.google.dev

Google DeepMind. Gemini api.https://ai.google.dev
[37]

Gemini 3.1 Pro Model Card.https://deepmind.google/models/model-cards/ gemini-3-1-pro/, 2026

Google DeepMind. Gemini 3.1 Pro Model Card.https://deepmind.google/models/model-cards/ gemini-3-1-pro/, 2026. Accessed: 2026-05-02

2026
[38]

The Flores-101 evaluation benchmark for low-resource and multilingual machine translation.Transactions of the Association for Computational Linguistics, 10:522–538, 2022

Naman Goyal, Cynthia Gao, Vishrav Chaudhary , Peng-Jen Chen, Guillaume Wenzek, Da Ju, San- jana Krishnan, Marc’Aurelio Ranzato, Francisco Guzm´an, and Angela Fan. The Flores-101 evaluation benchmark for low-resource and multilingual machine translation.Transactions of the Association for Computational Linguistics, 10:522–538, 2022

2022
[39]

Making the v in vqa matter: Elevating the role of image understanding in visual question answering

Yash Goyal, Tejas Khot, Douglas Summers-Stay , Dhruv Batra, and Devi Parikh. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6904–6913, July 2017

2017
[40]

Rating roulette: Self-inconsistency in LLM-as-a-judge frame- works

Rajarshi Haldar and Julia Hockenmaier. Rating roulette: Self-inconsistency in LLM-as-a-judge frame- works. InFindings of the Association for Computational Linguistics: EMNLP 2025, pages 24986–25004, Suzhou, China, November 2025. Association for Computational Linguistics

2025
[41]

DeBERTa: Decoding-enhanced BERT with disentangled attention

Pengcheng He, Xiaodong Liu, Jianfeng Gao, and Weizhu Chen. DeBERTa: Decoding-enhanced BERT with disentangled attention. InInternational Conference on Learning Representations, 2021

2021
[42]

Measuring massive multitask language understanding

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. InInternational Conference on Learning Representations, 2021. 13

2021
[43]

Measuring mathematical problem solving with the math dataset

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset. InAdvances in Neural Information Processing Systems, volume 34, 2021

2021
[44]

Teaching machines to read and comprehend

Karl Moritz Hermann, Tomas Kocisky , Edward Grefenstette, Lasse Espeholt, Will Kay , Mustafa Su- leyman, and Phil Blunsom. Teaching machines to read and comprehend. InAdvances in Neural Information Processing Systems, volume 28, 2015

2015
[45]

Actracer: Active testing of large language model via multi-stage sampling.ACM Transactions on Software Engineering and Methodology, 2025

Yuheng Huang, Jiayang Song, Qiang Hu, Felix Juefei-Xu, and Lei Ma. Actracer: Active testing of large language model via multi-stage sampling.ACM Transactions on Software Engineering and Methodology, 2025

2025
[46]

Hugging face hub.https://huggingface.co

Hugging Face. Hugging face hub.https://huggingface.co
[47]

Introducing idefics: An open reproduction of flamingo.https://huggingface.co/ blog/idefics, 2023

Hugging Face. Introducing idefics: An open reproduction of flamingo.https://huggingface.co/ blog/idefics, 2023

2023
[48]

Internlm: A multilingual language model with progressively enhanced capabilities.https: //github.com/InternLM/InternLM-techreport, 2023

InternLM. Internlm: A multilingual language model with progressively enhanced capabilities.https: //github.com/InternLM/InternLM-techreport, 2023

2023
[49]

Smith, Iz Beltagy , and Hannaneh Hajishirzi

Hamish Ivison, Yizhong Wang, Valentina Pyatkin, Nathan Lambert, Matthew Peters, Pradeep Dasigi, Joel Jang, David Wadden, Noah A. Smith, Iz Beltagy , and Hannaneh Hajishirzi. Camels in a changing climate: Enhancing lm adaptation with tulu 2, 2023

2023
[50]

Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, L ´elio Re- nard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Tim- oth´ee Lacroix, and William El Sayed. Mistral 7b.arXiv preprint a...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[51]

Albert Q. Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary , Chris Bam- ford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, Gianna Lengyel, Guillaume Bour, Guillaume Lample, L ´elio Renard Lavaud, Lucile Saulnier, Marie-Anne Lachaux, Pierre Stock, Sandeep Subramanian, Sophia Yang, Szymon Antoniak, ...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[52]

Prompt packer: Deceiving llms through compositional instruction with hidden attacks.arXiv preprint arXiv:2310.10077, 2023

Shuyu Jiang, Xingshu Chen, and Rui Tang. Prompt packer: Deceiving llms through compositional instruction with hidden attacks.arXiv preprint arXiv:2310.10077, 2023

work page arXiv 2023
[53]

Online active model selection for pre-trained classifiers

Mohammad Reza Karimi, Nezihe Merve G ¨urel, Bojan Karlaˇs, Johannes Rausch, Ce Zhang, and An- dreas Krause. Online active model selection for pre-trained classifiers. InInternational Conference on Artificial Intelligence and Statistics (AISTATS), pages 307–315. PMLR, April 2021

2021
[54]

Anytime model selection in linear bandits

Parnian Kassraie, Nicolas Emmenegger, Andreas Krause, and Aldo Pacchiano. Anytime model selection in linear bandits. InProc. Neural Information Processing Systems (NeurIPS), December 2023

2023
[55]

Consensus-driven active model selection

Justin Kay , Grant Van Horn, Subhransu Maji, Daniel Sheldon, and Sara Beery . Consensus-driven active model selection. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV),
[56]

The NarrativeQA reading comprehension challenge.Transactions of the Association for Computational Linguistics, 6:317–328, 2018

Tom´aˇs Koˇcisk´y, Jonathan Schwarz, Phil Blunsom, Chris Dyer, Karl Moritz Hermann, G ´abor Melis, and Edward Grefenstette. The NarrativeQA reading comprehension challenge.Transactions of the Association for Computational Linguistics, 6:317–328, 2018

2018
[57]

Large language models are zero-shot reasoners

Takeshi Kojima, Shixiang (Shane) Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large language models are zero-shot reasoners. InAdvances in Neural Information Processing Systems, vol- ume 35, pages 22199–22213, 2022. 14

2022
[58]

Active testing: Sample-efficient model evaluation

Jannik Kossen, Sebastian Farquhar, Yarin Gal, and Tom Rainforth. Active testing: Sample-efficient model evaluation. In Marina Meila and Tong Zhang, editors,Proceedings of the 38th International Conference on Machine Learning, volume 139 ofProceedings of Machine Learning Research, pages 5753–
[59]

PMLR, 18–24 Jul 2021

2021
[60]

Toutanova, Llion Jones, Ming-Wei Chang, Andrew Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov

Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Matthew Kelcey , Jacob Devlin, Kenton Lee, Kristina N. Toutanova, Llion Jones, Ming-Wei Chang, Andrew Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov. Natural ques- tions: A benchmark for question answering research.Trans...

2019
[61]

Vhelm: A holistic evaluation of vision language models

Tony Lee, Haoqin Tu, Chi H Wong, Wenhao Zheng, Yiyang Zhou, Yifan Mai, Josselin S Roberts, Michi- hiro Yasunaga, Huaxiu Yao, Cihang Xie, et al. Vhelm: A holistic evaluation of vision language models. Advances in Neural Information Processing Systems, 37:140632–140666, 2024

2024
[62]

Towards optimal evaluation efficiency for large language models

Guohong Li and Deyi Xiong. Towards optimal evaluation efficiency for large language models. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 14176– 14183, Suzhou, China, 2025. Association for Computational Linguistics

2025
[63]

On the necessity of collaboration in online model selection with decentralized data.arXiv preprint arXiv:2404.09494, 2024

Junfan Li, Zenglin Xu, Zheshun Wu, and Irwin King. On the necessity of collaboration in online model selection with decentralized data.arXiv preprint arXiv:2404.09494, 2024

work page arXiv 2024
[64]

Online foun- dation model selection in robotics.arXiv preprint arXiv:2402.08570, 2024

Po-han Li, Oyku Selin Toprak, Aditya Narayanan, Ufuk Topcu, and Sandeep Chinchali. Online foun- dation model selection in robotics.arXiv preprint arXiv:2402.08570, 2024

work page arXiv 2024
[65]

Gonzalez, and Ion Stoica

Tianle Li, Wei-Lin Chiang, Evan Frick, Lisa Dunlap, Banghua Zhu, Joseph E. Gonzalez, and Ion Stoica. From live data to high-quality benchmarks: The arena-hard pipeline, April 2024

2024
[66]

Hashimoto

Xuechen Li, Tianyi Zhang, Yann Dubois, Rohan Taori, Ishaan Gulrajani, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. Alpacaeval: An automatic evaluator of instruction-following models. https://github.com/tatsu-lab/alpaca_eval, 5 2023

2023
[67]

Active evaluation acqui- sition for efficient LLM benchmarking

Yang Li, Jie Ma, Miguel Ballesteros, Yassine Benajiba, and Graham Horwood. Active evaluation acqui- sition for efficient LLM benchmarking. InProceedings of the 42nd International Conference on Machine Learning, volume 267 ofProceedings of Machine Learning Research, pages 35581–35602, 2025

2025
[68]

Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara Soylu, Michihiro Yasunaga, Yian Zhang, Deepak Narayanan, Yuhuai Wu, Ananya Kumar, Benjamin Newman, Binhang Yuan, Bobby Yan, Ce Zhang, Christian Alexander Cosgrove, Christopher D Manning, Christopher Re, Diana Acosta- Navas, Drew Arad Hudson, Eric Zelikman, Esin Durmus, Faisal Ladhak, Frieda ...

2023
[69]

Helm lite: Lightweight and broad capabilities evaluation

Percy Liang, Yifan Mai, Josselin Somerville, Farzaan Kaiyom, Tony Lee, and Rishi Bommasani. Helm lite: Lightweight and broad capabilities evaluation. Stanford CRFM blog, 2023

2023
[70]

Active model selection for positive unlabeled time series classification

Shen Liang, Yanchun Zhang, and Jiangang Ma. Active model selection for positive unlabeled time series classification. In2020 IEEE 36th International Conference on Data Engineering (ICDE), pages 361–372, 2020

2020
[71]

ROUGE: A package for automatic evaluation of summaries

Chin-Yew Lin. ROUGE: A package for automatic evaluation of summaries. InText Summarization Branches Out, pages 74–81, Barcelona, Spain, July 2004. Association for Computational Linguistics. 15

2004
[72]

Contextual active online model selection with expert advice

Xuefeng Liu, Fangfang Xia, Rick L Stevens, and Yuxin Chen. Contextual active online model selection with expert advice. InICML2022 Workshop on Adaptive Experimental Design and Active Learning in the Real World. ICML, 2022

2022
[73]

Maas, Raymond E

Andrew L. Maas, Raymond E. Daly , Peter T. Pham, Dan Huang, Andrew Y. Ng, and Christopher Potts. Learning word vectors for sentiment analysis. InProceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pages 142–150, Portland, Oregon, USA, 2011. Association for Computational Linguistics

2011
[74]

Lizotte, and Russell Greiner

Omid Madani, Daniel J. Lizotte, and Russell Greiner. Active model selection, 2012

2012
[75]

tinyBenchmarks: evaluating LLMs with fewer examples

Felipe Maia Polo, Lucas Weber, Leshem Choshen, Yuekai Sun, Gongjun Xu, and Mikhail Yurochkin. tinyBenchmarks: evaluating LLMs with fewer examples. In Ruslan Salakhutdinov, Zico Kolter, Kather- ine Heller, Adrian Weller, Nuria Oliver, Jonathan Scarlett, and Felix Berkenkamp, editors,Proceedings of the 41st International Conference on Machine Learning, volu...

2024
[76]

Active model selection: A variance minimization approach

Mitsuru Matsuura and Satoshi Hara. Active model selection: A variance minimization approach. In NeurIPS 2023 Workshop on Adaptive Experimental Design and Active Learning in the Real World, 2023

2023
[77]

Introducing meta llama 3: The most capable openly available llm to date, 2024

Meta AI. Introducing meta llama 3: The most capable openly available llm to date, 2024

2024
[78]

Pixtral 12b.https://mistral.ai/news/pixtral-12b, 2024

Mistral AI. Pixtral 12b.https://mistral.ai/news/pixtral-12b, 2024

2024
[79]

Introducing mpt-30b: Raising the bar for open-source foundation models.https://www

MosaicML. Introducing mpt-30b: Raising the bar for open-source foundation models.https://www. mosaicml.com/blog/mpt-30b, 2023

2023
[80]

Cohen, and Mirella Lapata

Shashi Narayan, Shay B. Cohen, and Mirella Lapata. Don’t give me the details, just the summary! topic-aware convolutional neural networks for extreme summarization. In Ellen Riloff, David Chiang, Julia Hockenmaier, and Jun’ichi Tsujii, editors,Proceedings of the 2018 Conference on Empirical Meth- ods in Natural Language Processing, pages 1797–1807, Brusse...

2018

Showing first 80 references.

[1] [1]

Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone

Marah Abdin, Sam Ade Jacobs, Ammar Ahmad Awan, Jyoti Aneja, Ahmed Awadallah, Hany Awadalla, Nguyen Bach, Amit Bahree, Arash Bakhtiari, Harkirat Behl, Alon Benhaim, Misha Bilenko, Johan Bjorck, S´ebastien Bubeck, Martin Cai, Caio C´esar Teodoro Mendes, Weizhu Chen, Vishrav Chaudhary , Parul Chopra, Allie Del Giorno, Gustavo de Rosa, Matthew Dixon, Ronen El...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[2] [2]

Jurassic-1: Technical details and evaluation.https://www.ai21.com/blog/ announcing-ai21-studio-and-jurassic-1, 2021

AI21 Labs. Jurassic-1: Technical details and evaluation.https://www.ai21.com/blog/ announcing-ai21-studio-and-jurassic-1, 2021

2021

[3] [3]

Luminous.https://docs.aleph-alpha.com/docs/introduction/luminous/

Aleph Alpha. Luminous.https://docs.aleph-alpha.com/docs/introduction/luminous/

[4] [4]

The Falcon Series of Open Language Models

Ebtesam Almazrouei, Hamza Alobeidli, Abdulaziz Alshamsi, Alessandro Cappelli, Ruxandra Cojocaru, M´erouane Debbah, ´Etienne Goffinet, Daniel Hesslow, Julien Launay , Quentin Malartic, et al. The falcon series of open language models.arXiv preprint arXiv:2311.16867, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[5] [5]

PaLM 2 Technical Report

Rohan Anil, Sebastian Borgeaud, Yonghui Wu, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M. Dai, Anja Hauth, Katie Millican, et al. Palm 2 technical report.arXiv preprint arXiv:2305.10403, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[6] [6]

Anthropic claude api.https://console.anthropic.com

Anthropic. Anthropic claude api.https://console.anthropic.com

[7] [7]

Model card and evaluations for claude models, 2023

Anthropic. Model card and evaluations for claude models, 2023

2023

[8] [8]

Introducing the next generation of claude, 2024

Anthropic. Introducing the next generation of claude, 2024

2024

[9] [9]

Introducing Claude Opus 4.7.https://www.anthropic.com/news/claude-opus-4-7,

Anthropic. Introducing Claude Opus 4.7.https://www.anthropic.com/news/claude-opus-4-7,

[10] [10]

Accessed: 2026-05-02

2026

[11] [11]

Qwen Technical Report

Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, Binyuan Hui, Luo Ji, Mei Li, Junyang Lin, Runji Lin, Dayiheng Liu, Gao Liu, Chengqiang Lu, Keming Lu, Jianxin Ma, Rui Men, Xingzhang Ren, Xuancheng Ren, Chuanqi Tan, Sinan Tan, Jianhong Tu, Peng Wang, Shijie Wang, Wei Wang, Shengguang Wu, Benfeng X...

work page internal anchor Pith review Pith/arXiv arXiv 2023

[12] [12]

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond.arXiv preprint arXiv:2308.12966, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[13] [13]

Scaling up active testing to large language models.arXiv preprint arXiv:2508.09093, 2025

Gabrielle Berrada, Jannik Kossen, Muhammed Razzak, Freddie Bickford Smith, Yarin Gal, and Tom Rainforth. Scaling up active testing to large language models.arXiv preprint arXiv:2508.09093, 2025. 11

work page arXiv 2025

[14] [14]

PaliGemma: A versatile 3B VLM for transfer

Lucas Beyer, Andreas Steiner, Andre Susano Pinto, Alexander Kolesnikov, Xiao Wang, Daniel Salz, Maxim Neumann, Ibrahim Alabdulmohsin, Michael Tschannen, Emanuele Bugliarello, et al. Paligemma: A versatile 3b vlm for transfer.arXiv preprint arXiv:2407.07726, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[15] [15]

Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling

Stella Biderman, Hailey Schoelkopf, Quentin Gregory Anthony , Herbie Bradley , Kyle O’Brien, Eric Hallahan, Mohammad Aflah Khan, Shivanshu Purohit, USVSN Sai Prashanth, Edward Raff, et al. Pythia: A suite for analyzing large language models across training and scaling.arXiv preprint arXiv:2304.01373, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[16] [16]

GPT-NeoX-20B: An Open-Source Autoregressive Language Model

Sid Black, Stella Biderman, Eric Hallahan, Quentin Anthony , Leo Gao, Laurence Golding, Horace He, Connor Leahy , Kyle McDonell, Jason Phang, et al. Gpt-neox-20b: An open-source autoregressive language model.arXiv preprint arXiv:2204.06745, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[17] [17]

Findings of the 2014 workshop on statistical machine translation

Ond ˇrej Bojar, Christian Buck, Christian Federmann, Barry Haddow, Philipp Koehn, Johannes Leveling, Christof Monz, Pavel Pecina, Matt Post, Herve Saint-Amand, Radu Soricut, Lucia Specia, and Ale ˇs Tamchyna. Findings of the 2014 workshop on statistical machine translation. InProceedings of the Ninth Workshop on Statistical Machine Translation, pages 12–5...

2014

[18] [18]

Language models are few-shot learners

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry , Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott G...

1901

[19] [19]

Yu, Qiang Yang, and Xing Xie

Yupeng Chang, Xu Wang, Jindong Wang, Yuan Wu, Linyi Yang, Kaijie Zhu, Hao Chen, Xiaoyuan Yi, Cunxiang Wang, Yidong Wang, Wei Ye, Yue Zhang, Yi Chang, Philip S. Yu, Qiang Yang, and Xing Xie. A survey on evaluation of large language models.ACM Trans. Intell. Syst. Technol., 15(3), March 2024

2024

[20] [20]

Humans or LLMs as the judge? a study on judgement bias

Guiming Hardy Chen, Shunian Chen, Ziche Liu, Feng Jiang, and Benyou Wang. Humans or LLMs as the judge? a study on judgement bias. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 8301–8327, Miami, Florida, USA, November 2024. Association for Computational Linguistics

2024

[21] [21]

Hamed Hassani, Amin Karbasi, and Andreas Krause

Yuxin Chen, S. Hamed Hassani, Amin Karbasi, and Andreas Krause. Sequential information maxi- mization: When is greedy near-optimal? In Peter Gr ¨unwald, Elad Hazan, and Satyen Kale, editors, Proceedings of The 28th Conference on Learning Theory, volume 40 ofProceedings of Machine Learning Research, pages 338–363, Paris, France, 03–06 Jul 2015. PMLR

2015

[22] [22]

Gonzalez, Ion Stoica, and Eric P

Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E. Gonzalez, Ion Stoica, and Eric P. Xing. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality , March 2023

2023

[23] [23]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[24] [24]

Cohere api.https://docs.cohere.com

Cohere. Cohere api.https://docs.cohere.com

[25] [25]

Command r: Retrieval-augmented generation at production scale, 2024

Cohere. Command r: Retrieval-augmented generation at production scale, 2024

2024

[26] [26]

Introducing command r+: A scalable llm built for business, 2024

Cohere. Introducing command r+: A scalable llm built for business, 2024. 12

2024

[27] [27]

Committee-based sampling for training probabilistic classifiers

Ido Dagan and Sean P Engelson. Committee-based sampling for training probabilistic classifiers. In Machine Learning Proceedings 1995, pages 150–157. Elsevier, 1995

1995

[28] [28]

Introducing dbrx: A new state-of-the-art open llm.https://www.databricks.com/blog/ introducing-dbrx-new-state-art-open-llm, March 2024

Databricks. Introducing dbrx: A new state-of-the-art open llm.https://www.databricks.com/blog/ introducing-dbrx-new-state-art-open-llm, March 2024. Accessed: 2025-08-31

2024

[29] [29]

GLM: general language model pretraining with autoregressive blank infilling

Zhengxiao Du, Yujie Qian, Xiao Liu, Ming Ding, Jiezhong Qiu, Zhilin Yang, and Jie Tang. GLM: general language model pretraining with autoregressive blank infilling. InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2022, Dublin, Ireland, May 22-27, 2022, pages 320–335. Association for Co...

2022

[30] [30]

Sample-efficient human evaluation of large language models via maximum discrepancy competition

Kehua Feng, Keyan Ding, Tan Hongzhi, Kede Ma, Zhihua Wang, Shuangquan Guo, Cheng Yuzhou, Ge Sun, Guozhou Zheng, Qiang Zhang, and Huajun Chen. Sample-efficient human evaluation of large language models via maximum discrepancy competition. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), page...

2025

[31] [31]

Open llm leaderboard v2.https://huggingface.co/spaces/open-llm-leaderboard/open_llm_ leaderboard, 2024

Cl ´ementine Fourrier, Nathan Habib, Alina Lozovskaya, Konrad Szafer, and Thomas Wolf. Open llm leaderboard v2.https://huggingface.co/spaces/open-llm-leaderboard/open_llm_ leaderboard, 2024

2024

[32] [32]

Selective sampling using the query by committee algorithm.Machine learning, 28:133–168, 1997

Yoav Freund, H Sebastian Seung, Eli Shamir, and Naftali Tishby . Selective sampling using the query by committee algorithm.Machine learning, 28:133–168, 1997

1997

[33] [33]

Bayesian active model selection with an application to automated audiometry

Jacob Gardner, Gustavo Malkomes, Roman Garnett, Kilian Q Weinberger, Dennis Barbour, and John P Cunningham. Bayesian active model selection with an application to automated audiometry . In C. Cortes, N. Lawrence, D. Lee, M. Sugiyama, and R. Garnett, editors,Advances in Neural Information Processing Systems, volume 28. Curran Associates, Inc., 2015

2015

[34] [34]

Gemini: A Family of Highly Capable Multimodal Models

Google. Gemini: A family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[35] [35]

Gemma open models, 2024

Google. Gemma open models, 2024

2024

[36] [36]

Gemini api.https://ai.google.dev

Google DeepMind. Gemini api.https://ai.google.dev

[37] [37]

Gemini 3.1 Pro Model Card.https://deepmind.google/models/model-cards/ gemini-3-1-pro/, 2026

Google DeepMind. Gemini 3.1 Pro Model Card.https://deepmind.google/models/model-cards/ gemini-3-1-pro/, 2026. Accessed: 2026-05-02

2026

[38] [38]

The Flores-101 evaluation benchmark for low-resource and multilingual machine translation.Transactions of the Association for Computational Linguistics, 10:522–538, 2022

Naman Goyal, Cynthia Gao, Vishrav Chaudhary , Peng-Jen Chen, Guillaume Wenzek, Da Ju, San- jana Krishnan, Marc’Aurelio Ranzato, Francisco Guzm´an, and Angela Fan. The Flores-101 evaluation benchmark for low-resource and multilingual machine translation.Transactions of the Association for Computational Linguistics, 10:522–538, 2022

2022

[39] [39]

Making the v in vqa matter: Elevating the role of image understanding in visual question answering

Yash Goyal, Tejas Khot, Douglas Summers-Stay , Dhruv Batra, and Devi Parikh. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6904–6913, July 2017

2017

[40] [40]

Rating roulette: Self-inconsistency in LLM-as-a-judge frame- works

Rajarshi Haldar and Julia Hockenmaier. Rating roulette: Self-inconsistency in LLM-as-a-judge frame- works. InFindings of the Association for Computational Linguistics: EMNLP 2025, pages 24986–25004, Suzhou, China, November 2025. Association for Computational Linguistics

2025

[41] [41]

DeBERTa: Decoding-enhanced BERT with disentangled attention

Pengcheng He, Xiaodong Liu, Jianfeng Gao, and Weizhu Chen. DeBERTa: Decoding-enhanced BERT with disentangled attention. InInternational Conference on Learning Representations, 2021

2021

[42] [42]

Measuring massive multitask language understanding

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. InInternational Conference on Learning Representations, 2021. 13

2021

[43] [43]

Measuring mathematical problem solving with the math dataset

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset. InAdvances in Neural Information Processing Systems, volume 34, 2021

2021

[44] [44]

Teaching machines to read and comprehend

Karl Moritz Hermann, Tomas Kocisky , Edward Grefenstette, Lasse Espeholt, Will Kay , Mustafa Su- leyman, and Phil Blunsom. Teaching machines to read and comprehend. InAdvances in Neural Information Processing Systems, volume 28, 2015

2015

[45] [45]

Actracer: Active testing of large language model via multi-stage sampling.ACM Transactions on Software Engineering and Methodology, 2025

Yuheng Huang, Jiayang Song, Qiang Hu, Felix Juefei-Xu, and Lei Ma. Actracer: Active testing of large language model via multi-stage sampling.ACM Transactions on Software Engineering and Methodology, 2025

2025

[46] [46]

Hugging face hub.https://huggingface.co

Hugging Face. Hugging face hub.https://huggingface.co

[47] [47]

Introducing idefics: An open reproduction of flamingo.https://huggingface.co/ blog/idefics, 2023

Hugging Face. Introducing idefics: An open reproduction of flamingo.https://huggingface.co/ blog/idefics, 2023

2023

[48] [48]

Internlm: A multilingual language model with progressively enhanced capabilities.https: //github.com/InternLM/InternLM-techreport, 2023

InternLM. Internlm: A multilingual language model with progressively enhanced capabilities.https: //github.com/InternLM/InternLM-techreport, 2023

2023

[49] [49]

Smith, Iz Beltagy , and Hannaneh Hajishirzi

Hamish Ivison, Yizhong Wang, Valentina Pyatkin, Nathan Lambert, Matthew Peters, Pradeep Dasigi, Joel Jang, David Wadden, Noah A. Smith, Iz Beltagy , and Hannaneh Hajishirzi. Camels in a changing climate: Enhancing lm adaptation with tulu 2, 2023

2023

[50] [50]

Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, L ´elio Re- nard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Tim- oth´ee Lacroix, and William El Sayed. Mistral 7b.arXiv preprint a...

work page internal anchor Pith review Pith/arXiv arXiv 2023

[51] [51]

Albert Q. Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary , Chris Bam- ford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, Gianna Lengyel, Guillaume Bour, Guillaume Lample, L ´elio Renard Lavaud, Lucile Saulnier, Marie-Anne Lachaux, Pierre Stock, Sandeep Subramanian, Sophia Yang, Szymon Antoniak, ...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[52] [52]

Prompt packer: Deceiving llms through compositional instruction with hidden attacks.arXiv preprint arXiv:2310.10077, 2023

Shuyu Jiang, Xingshu Chen, and Rui Tang. Prompt packer: Deceiving llms through compositional instruction with hidden attacks.arXiv preprint arXiv:2310.10077, 2023

work page arXiv 2023

[53] [53]

Online active model selection for pre-trained classifiers

Mohammad Reza Karimi, Nezihe Merve G ¨urel, Bojan Karlaˇs, Johannes Rausch, Ce Zhang, and An- dreas Krause. Online active model selection for pre-trained classifiers. InInternational Conference on Artificial Intelligence and Statistics (AISTATS), pages 307–315. PMLR, April 2021

2021

[54] [54]

Anytime model selection in linear bandits

Parnian Kassraie, Nicolas Emmenegger, Andreas Krause, and Aldo Pacchiano. Anytime model selection in linear bandits. InProc. Neural Information Processing Systems (NeurIPS), December 2023

2023

[55] [55]

Consensus-driven active model selection

Justin Kay , Grant Van Horn, Subhransu Maji, Daniel Sheldon, and Sara Beery . Consensus-driven active model selection. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV),

[56] [56]

The NarrativeQA reading comprehension challenge.Transactions of the Association for Computational Linguistics, 6:317–328, 2018

Tom´aˇs Koˇcisk´y, Jonathan Schwarz, Phil Blunsom, Chris Dyer, Karl Moritz Hermann, G ´abor Melis, and Edward Grefenstette. The NarrativeQA reading comprehension challenge.Transactions of the Association for Computational Linguistics, 6:317–328, 2018

2018

[57] [57]

Large language models are zero-shot reasoners

Takeshi Kojima, Shixiang (Shane) Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large language models are zero-shot reasoners. InAdvances in Neural Information Processing Systems, vol- ume 35, pages 22199–22213, 2022. 14

2022

[58] [58]

Active testing: Sample-efficient model evaluation

Jannik Kossen, Sebastian Farquhar, Yarin Gal, and Tom Rainforth. Active testing: Sample-efficient model evaluation. In Marina Meila and Tong Zhang, editors,Proceedings of the 38th International Conference on Machine Learning, volume 139 ofProceedings of Machine Learning Research, pages 5753–

[59] [59]

PMLR, 18–24 Jul 2021

2021

[60] [60]

Toutanova, Llion Jones, Ming-Wei Chang, Andrew Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov

Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Matthew Kelcey , Jacob Devlin, Kenton Lee, Kristina N. Toutanova, Llion Jones, Ming-Wei Chang, Andrew Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov. Natural ques- tions: A benchmark for question answering research.Trans...

2019

[61] [61]

Vhelm: A holistic evaluation of vision language models

Tony Lee, Haoqin Tu, Chi H Wong, Wenhao Zheng, Yiyang Zhou, Yifan Mai, Josselin S Roberts, Michi- hiro Yasunaga, Huaxiu Yao, Cihang Xie, et al. Vhelm: A holistic evaluation of vision language models. Advances in Neural Information Processing Systems, 37:140632–140666, 2024

2024

[62] [62]

Towards optimal evaluation efficiency for large language models

Guohong Li and Deyi Xiong. Towards optimal evaluation efficiency for large language models. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 14176– 14183, Suzhou, China, 2025. Association for Computational Linguistics

2025

[63] [63]

On the necessity of collaboration in online model selection with decentralized data.arXiv preprint arXiv:2404.09494, 2024

Junfan Li, Zenglin Xu, Zheshun Wu, and Irwin King. On the necessity of collaboration in online model selection with decentralized data.arXiv preprint arXiv:2404.09494, 2024

work page arXiv 2024

[64] [64]

Online foun- dation model selection in robotics.arXiv preprint arXiv:2402.08570, 2024

Po-han Li, Oyku Selin Toprak, Aditya Narayanan, Ufuk Topcu, and Sandeep Chinchali. Online foun- dation model selection in robotics.arXiv preprint arXiv:2402.08570, 2024

work page arXiv 2024

[65] [65]

Gonzalez, and Ion Stoica

Tianle Li, Wei-Lin Chiang, Evan Frick, Lisa Dunlap, Banghua Zhu, Joseph E. Gonzalez, and Ion Stoica. From live data to high-quality benchmarks: The arena-hard pipeline, April 2024

2024

[66] [66]

Hashimoto

Xuechen Li, Tianyi Zhang, Yann Dubois, Rohan Taori, Ishaan Gulrajani, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. Alpacaeval: An automatic evaluator of instruction-following models. https://github.com/tatsu-lab/alpaca_eval, 5 2023

2023

[67] [67]

Active evaluation acqui- sition for efficient LLM benchmarking

Yang Li, Jie Ma, Miguel Ballesteros, Yassine Benajiba, and Graham Horwood. Active evaluation acqui- sition for efficient LLM benchmarking. InProceedings of the 42nd International Conference on Machine Learning, volume 267 ofProceedings of Machine Learning Research, pages 35581–35602, 2025

2025

[68] [68]

Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara Soylu, Michihiro Yasunaga, Yian Zhang, Deepak Narayanan, Yuhuai Wu, Ananya Kumar, Benjamin Newman, Binhang Yuan, Bobby Yan, Ce Zhang, Christian Alexander Cosgrove, Christopher D Manning, Christopher Re, Diana Acosta- Navas, Drew Arad Hudson, Eric Zelikman, Esin Durmus, Faisal Ladhak, Frieda ...

2023

[69] [69]

Helm lite: Lightweight and broad capabilities evaluation

Percy Liang, Yifan Mai, Josselin Somerville, Farzaan Kaiyom, Tony Lee, and Rishi Bommasani. Helm lite: Lightweight and broad capabilities evaluation. Stanford CRFM blog, 2023

2023

[70] [70]

Active model selection for positive unlabeled time series classification

Shen Liang, Yanchun Zhang, and Jiangang Ma. Active model selection for positive unlabeled time series classification. In2020 IEEE 36th International Conference on Data Engineering (ICDE), pages 361–372, 2020

2020

[71] [71]

ROUGE: A package for automatic evaluation of summaries

Chin-Yew Lin. ROUGE: A package for automatic evaluation of summaries. InText Summarization Branches Out, pages 74–81, Barcelona, Spain, July 2004. Association for Computational Linguistics. 15

2004

[72] [72]

Contextual active online model selection with expert advice

Xuefeng Liu, Fangfang Xia, Rick L Stevens, and Yuxin Chen. Contextual active online model selection with expert advice. InICML2022 Workshop on Adaptive Experimental Design and Active Learning in the Real World. ICML, 2022

2022

[73] [73]

Maas, Raymond E

Andrew L. Maas, Raymond E. Daly , Peter T. Pham, Dan Huang, Andrew Y. Ng, and Christopher Potts. Learning word vectors for sentiment analysis. InProceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pages 142–150, Portland, Oregon, USA, 2011. Association for Computational Linguistics

2011

[74] [74]

Lizotte, and Russell Greiner

Omid Madani, Daniel J. Lizotte, and Russell Greiner. Active model selection, 2012

2012

[75] [75]

tinyBenchmarks: evaluating LLMs with fewer examples

Felipe Maia Polo, Lucas Weber, Leshem Choshen, Yuekai Sun, Gongjun Xu, and Mikhail Yurochkin. tinyBenchmarks: evaluating LLMs with fewer examples. In Ruslan Salakhutdinov, Zico Kolter, Kather- ine Heller, Adrian Weller, Nuria Oliver, Jonathan Scarlett, and Felix Berkenkamp, editors,Proceedings of the 41st International Conference on Machine Learning, volu...

2024

[76] [76]

Active model selection: A variance minimization approach

Mitsuru Matsuura and Satoshi Hara. Active model selection: A variance minimization approach. In NeurIPS 2023 Workshop on Adaptive Experimental Design and Active Learning in the Real World, 2023

2023

[77] [77]

Introducing meta llama 3: The most capable openly available llm to date, 2024

Meta AI. Introducing meta llama 3: The most capable openly available llm to date, 2024

2024

[78] [78]

Pixtral 12b.https://mistral.ai/news/pixtral-12b, 2024

Mistral AI. Pixtral 12b.https://mistral.ai/news/pixtral-12b, 2024

2024

[79] [79]

Introducing mpt-30b: Raising the bar for open-source foundation models.https://www

MosaicML. Introducing mpt-30b: Raising the bar for open-source foundation models.https://www. mosaicml.com/blog/mpt-30b, 2023

2023

[80] [80]

Cohen, and Mirella Lapata

Shashi Narayan, Shay B. Cohen, and Mirella Lapata. Don’t give me the details, just the summary! topic-aware convolutional neural networks for extreme summarization. In Ellen Riloff, David Chiang, Julia Hockenmaier, and Jun’ichi Tsujii, editors,Proceedings of the 2018 Conference on Empirical Meth- ods in Natural Language Processing, pages 1797–1807, Brusse...

2018