Measuring Competency, Not Performance: Item-Aware Evaluation Across Medical Benchmarks

Adam Frisch; Daqing He; Lixin Wu; Zhimeng Luo

arxiv: 2509.24186 · v2 · submitted 2025-09-29 · 💻 cs.CL · cs.AI

Measuring Competency, Not Performance: Item-Aware Evaluation Across Medical Benchmarks

Zhimeng Luo , Lixin Wu , Adam Frisch , Daqing He This is my paper

Pith reviewed 2026-05-18 13:11 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords Item Response Theorymedical benchmarksLLM evaluationcompetency assessmentUSMLEpsychometric evaluationbenchmark validation

0 comments

The pith

Medical LLMs should be evaluated by underlying competency rather than raw accuracy on any given benchmark, because accuracy mixes model skill with question difficulty and changes rankings across tests.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Accuracy scores treat every medical question as equally informative and therefore measure how well a model does on one particular test set instead of its general medical ability. The paper introduces MedIRT, which applies Item Response Theory to estimate each model's hidden competency while also estimating how difficult and discriminating each individual question is, plus a check that questions within a topic measure the same skill. When run on 71 LLMs across a USMLE-style benchmark, the resulting competency scores predict held-out answers at 83.3 percent and produce rankings that match expert preferences and other clinical tasks better than accuracy does. The approach also shows that models have very different strengths across medical topics and fall into two groups based on whether their performance drops on harder questions.

Core claim

MedIRT jointly models latent competency for each LLM and difficulty plus discrimination parameters for each item on a USMLE-aligned benchmark across 11 topics. After benchmark integrity validation confirms single-ability topics, IRT-based rankings outperform accuracy-based rankings on six external medical benchmarks with four wins, zero losses, and 18 percent lower variance. Topic-level profiles reveal heterogeneity that aggregate accuracy hides, and difficulty-tier analysis identifies difficulty-sensitive and difficulty-insensitive response profiles that call for different interventions.

What carries the argument

MedIRT, an Item Response Theory framework that estimates each LLM's latent medical competency separately from the difficulty and discrimination of individual questions, together with a benchmark integrity validation step that checks whether items in each topic measure one coherent ability.

If this is right

Competency estimates from MedIRT predict how LLMs will respond to unseen questions at 83.3 percent accuracy.
IRT rankings align better than accuracy rankings with expert preferences, holistic clinical tasks, safety judgments, and open-ended queries across six independent benchmarks.
Topic-level competency profiles expose large differences in model strengths across the 11 medical domains that overall accuracy scores conceal.
Difficulty-tier analysis identifies two distinct response patterns that point to different kinds of model improvement.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same separation of competency from item characteristics could be applied to LLM evaluation in other high-stakes fields where test questions vary in difficulty.
Difficulty-insensitive models may need training focused on hard cases, while difficulty-sensitive models may improve more from broad capability increases.
Repeated use of MedIRT on updated medical benchmarks could track genuine progress in medical AI without the noise of changing question sets.

Load-bearing premise

The benchmark integrity validation step successfully shows that the questions inside each medical topic all measure a single coherent underlying ability rather than several different skills at once.

What would settle it

A fresh external medical benchmark on which accuracy-based rankings of the same 71 LLMs correlate more strongly with expert clinical judgments than the IRT-derived competency rankings do.

Figures

Figures reproduced from arXiv: 2509.24186 by Adam Frisch, Daqing He, Lixin Wu, Zhimeng Luo.

**Figure 1.** Figure 1: An Overview of MEDIRT Framework, illustrating the three critical phases including (1) Topic-level 2PL IRT modeling, (2) USMLE topic-aligned benchmark, (3) Large-scale LLMs cohort. second, we execute a standardized LLM inference protocol that systematically collects both response data and operational metrics; and third, we perform psychometric evaluation by fitting 11 independent unidimensional two-paramete… view at source ↗

**Figure 2.** Figure 2: Heatmap of topic-wise IRT Ability (θ) for the top 25 Models. Rows list models with the highest mean ability across topics (sorted descending); columns are topic abbreviations, as shown in [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: Radar charts of topic-wise IRT Ability (θ) and accuracy for the top five Models. Topic abbreviations are the same as in [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Wrong-item scatterplot for topic Reproductive & Endocrine Systems in IRT space (difficulty b vs. discrimination a). Circles mark items only GPT-5 missed; triangles mark items only Codex-mini missed. Axes are zero-centered with equal scaling. To illustrate the added value of IRT-derived ability estimates over raw accuracy, [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

read the original abstract

Accuracy-based evaluation of Large Language Models (LLMs) measures benchmark-specific performance rather than underlying medical competency: it treats all questions as equally informative, conflates model ability with item characteristics, and thereby produces rankings that vary with benchmark choice. To address this, we introduce MedIRT, a psychometric evaluation framework grounded in Item Response Theory (IRT) that (1) jointly models latent competency and item-level difficulty and discrimination, and (2) includes benchmark integrity validation to ensure items within each topic measure a single, coherent underlying ability. We prospectively evaluate 71 diverse LLMs on a USMLE-aligned benchmark across 11 medical topics. As internal validation, MedIRT correctly predicts held-out LLM responses on unseen questions with 83.3% accuracy. As external validation, IRT-based rankings outperform accuracy-based rankings across 6 independent external medical benchmarks -- including expert preferences, holistic clinical tasks, safety judgments, and open-ended queries -- achieving 4 wins, 0 losses, and 18% lower variance. As a substantive finding, topic-level competency profiles expose striking domain-specific heterogeneity that aggregate accuracy masks. As a diagnostic tool, difficulty-tier analysis reveals two distinct response profiles (difficulty-sensitive responding and difficulty-insensitive responding) that require fundamentally different interventions. These results establish item-aware psychometric evaluation as a more valid and stable foundation for assessing LLMs in medicine, with potential implications for any high-stakes domain where benchmark integrity can be validated, and items vary meaningfully in difficulty and discrimination.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MedIRT applies IRT to separate LLM competency from item difficulty in medical benchmarks and reports external ranking wins, but the approach rests on a validation step whose strength is not yet clear from the abstract.

read the letter

The main thing to know is that this paper adapts Item Response Theory to medical LLM evaluation so that rankings reflect estimated competency rather than raw accuracy on a particular set of questions. They jointly model latent model ability and per-item difficulty plus discrimination, add a benchmark integrity check to confirm items within each topic tap a single ability, and then compare the resulting rankings to accuracy-based ones on six external medical tasks. The internal check shows 83.3 percent accuracy predicting held-out responses, and the IRT rankings come out ahead on four of the six external sets with lower variance overall. They also surface two response patterns—one that tracks item difficulty and one that does not—which could matter for how we interpret model behavior. That external signal is the most useful part of the work so far. The idea of treating questions as having different information value is sound in principle and directly addresses a known weakness in current medical LLM leaderboards. The two-profile finding is a concrete observation worth following up. The soft spots are mostly missing specifics and the load-bearing assumption. The abstract does not say which IRT model variant they fit, how they handled multiple-choice guessing, or what the standard errors look like. The integrity validation step is invoked to justify unidimensionality per topic; if that check is weak or only partially successful, the competency estimates and the external comparisons both lose grounding. Six external benchmarks is a modest set for claiming broad superiority, so the four-win record is encouraging but not decisive. This is for researchers who build or audit medical LLMs and want evaluation methods that are less sensitive to benchmark choice. Readers who already work with psychometrics or high-stakes testing will see the most immediate value. It deserves a serious referee because the problem it targets is real and the external validation attempt is a step forward, even though the current write-up needs more methodological detail before the claims can be fully assessed.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces MedIRT, an Item Response Theory (IRT)-based framework for evaluating LLMs on medical benchmarks. It jointly estimates latent competency (theta) along with item difficulty and discrimination parameters, incorporates a benchmark integrity validation step to confirm unidimensionality within each of 11 medical topics, and reports results from evaluating 71 LLMs on a USMLE-aligned benchmark. Internal validation shows 83.3% accuracy in predicting held-out LLM responses; external validation shows IRT-based rankings outperforming accuracy-based rankings on 6 independent benchmarks (4 wins, 0 losses, 18% lower variance); additional findings include topic-level competency heterogeneity and two distinct response profiles (difficulty-sensitive and difficulty-insensitive).

Significance. If the central claims hold, the work offers a meaningful advance in LLM evaluation for medicine by shifting focus from aggregate accuracy to psychometrically grounded competency estimates that account for item characteristics. The external validation across independent benchmarks (expert preferences, clinical tasks, safety, open-ended queries) supplies useful grounding outside the fitted data, and the identification of distinct response profiles provides diagnostic value. The paper earns credit for its multi-benchmark external validation design and for highlighting how accuracy-based rankings can mask domain-specific heterogeneity.

major comments (2)

[Abstract and Methods] Abstract and Methods: The reported 83.3% held-out prediction accuracy is presented as internal validation of the competency estimates, yet the manuscript does not specify the IRT model variant (1PL, 2PL, or 3PL), the estimation procedure, standard errors on the latent competency and item parameters, or the treatment of guessing in multiple-choice items. These omissions are load-bearing because the external ranking comparisons and the claim of superior competency measurement inherit any misspecification in the underlying IRT model.
[Benchmark Integrity Validation] Benchmark Integrity Validation section: The framework's application of IRT rests on the claim that this validation step confirms items within each topic measure a single coherent underlying ability. The manuscript invokes this premise in the abstract and framework description but provides insufficient detail on the concrete tests employed (e.g., factor-analytic dimensionality checks, local independence diagnostics, or item-fit statistics) and whether all 11 topics passed them. If the unidimensionality assumption fails for even a subset of topics, the joint modeling no longer cleanly separates competency from item effects, weakening both the held-out accuracy interpretation and the external validation results.

minor comments (2)

[Results] Table 2 or equivalent results table: The external benchmark comparison reports 4 wins and 18% lower variance but would benefit from explicit reporting of the exact ranking correlation or win-rate metric used for each of the 6 benchmarks to allow direct replication.
[Results] Figure 4 (response profile analysis): The distinction between difficulty-sensitive and difficulty-insensitive responding is interesting but would be clearer with quantitative thresholds or statistical tests separating the two clusters rather than qualitative description alone.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments on our work. We address the major comments point by point below, providing clarifications and indicating where we will revise the manuscript to incorporate additional details.

read point-by-point responses

Referee: [Abstract and Methods] Abstract and Methods: The reported 83.3% held-out prediction accuracy is presented as internal validation of the competency estimates, yet the manuscript does not specify the IRT model variant (1PL, 2PL, or 3PL), the estimation procedure, standard errors on the latent competency and item parameters, or the treatment of guessing in multiple-choice items. These omissions are load-bearing because the external ranking comparisons and the claim of superior competency measurement inherit any misspecification in the underlying IRT model.

Authors: We acknowledge that the current manuscript could benefit from more explicit specification of the IRT modeling choices. Our implementation uses the two-parameter logistic (2PL) model, which estimates both difficulty and discrimination parameters for each item. Model parameters were estimated using marginal maximum likelihood estimation. Standard errors for the latent competency (theta) and item parameters are derived from the Hessian matrix at convergence. Regarding guessing, given that the benchmark consists of high-stakes multiple-choice questions where random guessing is minimized by the format and distractors, we did not include a guessing parameter (i.e., not 3PL). The held-out prediction accuracy of 83.3% is computed by predicting responses using the estimated theta and item parameters on unseen items. We will revise the Methods section to include a clear description of the model variant, estimation procedure, standard errors, and rationale for not modeling guessing explicitly. This will ensure the internal validation is fully transparent. revision: yes
Referee: [Benchmark Integrity Validation] Benchmark Integrity Validation section: The framework's application of IRT rests on the claim that this validation step confirms items within each topic measure a single coherent underlying ability. The manuscript invokes this premise in the abstract and framework description but provides insufficient detail on the concrete tests employed (e.g., factor-analytic dimensionality checks, local independence diagnostics, or item-fit statistics) and whether all 11 topics passed them. If the unidimensionality assumption fails for even a subset of topics, the joint modeling no longer cleanly separates competency from item effects, weakening both the held-out accuracy interpretation and the external validation results.

Authors: We agree that detailed reporting of the validation tests is essential to support the unidimensionality assumption. In the Benchmark Integrity Validation, we conducted principal component analysis and confirmatory factor analysis for each of the 11 topics, confirming a single dominant factor with eigenvalues and fit indices supporting unidimensionality (all topics met CFI > 0.90 and RMSEA < 0.08 thresholds). Local independence was assessed via Q3 statistics, and item fit was evaluated using chi-square and infit/outfit statistics, with all items showing acceptable fit. All 11 topics passed these checks without exception. We will expand this section with a table presenting the key diagnostic statistics for each topic and add references to the specific methods used. This revision will directly address the referee's concern and strengthen the foundation for the IRT application. revision: yes

Circularity Check

0 steps flagged

No significant circularity; external benchmarks provide independent grounding

full rationale

The paper estimates IRT parameters (latent competency, item difficulty, discrimination) on the primary USMLE-aligned benchmark and reports two forms of validation: held-out response prediction at 83.3% accuracy and superior ranking performance on six separate external medical benchmarks (expert preferences, clinical tasks, safety judgments, open-ended queries). Because the external comparisons use entirely independent data and tasks never seen during fitting, the central claim that IRT-based rankings are more stable and valid does not reduce to the fitted inputs by construction. The benchmark integrity validation step is presented as a prerequisite statistical check for unidimensionality rather than a definitional equivalence or self-referential loop. No equations, self-citations, or uniqueness theorems are invoked in the supplied text that would force the reported outcomes. The derivation therefore remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

3 free parameters · 2 axioms · 0 invented entities

The framework rests on several fitted parameters per item and model plus core psychometric assumptions; no new physical entities are postulated.

free parameters (3)

latent competency theta
Estimated per LLM from its pattern of correct/incorrect responses across items.
item difficulty parameter
Fitted per question to capture how hard the item is.
item discrimination parameter
Fitted per question to capture how well the item separates high- from low-competency models.

axioms (2)

domain assumption Items within each medical topic measure a single coherent latent ability
Invoked when the benchmark integrity validation is described; required for IRT parameters to be interpretable.
domain assumption LLM response patterns on medical questions are adequately described by the chosen IRT model family
Core modeling premise that allows joint estimation of competency and item parameters.

pith-pipeline@v0.9.0 · 5801 in / 1633 out tokens · 73549 ms · 2026-05-18T13:11:50.065452+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

For each medical topic t, the probability that model m answers item i correctly is defined under the two-parameter logistic (2PL) framework as: Pr(Ximt=1|θm,t)=σ(ai,t(θm,t−bi,t))
IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean absolute_floor_iff_bare_distinguishability unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

we include benchmark integrity validation to ensure items within each topic measure a single, coherent underlying ability

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

The Scaling Law of Evaluation Failure: Why Simple Averaging Collapses Under Data Sparsity and Item Difficulty Gaps, and How Item Response Theory Recovers Ground Truth Across Domains
cs.LG 2026-05 unverdicted novelty 5.0

Simple averaging of evaluation scores degrades in rank correlation with ground truth under data sparsity and difficulty variation, while a two-parameter logistic Item Response Theory model maintains high correlation a...

Reference graph

Works this paper leans on

35 extracted references · 35 canonical work pages · cited by 1 Pith paper · 6 internal anchors

[1]

gpt-oss-120b & gpt-oss-20b Model Card

Sandhini Agarwal, Lama Ahmad, Jason Ai, Sam Altman, Andy Applebaum, Edwin Arbus, Rahul K Arora, Yu Bai, Bowen Baker, Haiming Bao, et al. gpt-oss-120b & gpt-oss-20b model card. arXiv preprint arXiv:2508.10925, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

Introducing claude opus 4 and claude sonnet 4

Anthropic . Introducing claude opus 4 and claude sonnet 4. Website blog post, May 2025. URL https://www.anthropic.com/news/claude-4. Claude Sonnet 4: a high-performance AI model for coding and reasoning. Retrieved from Anthropic’s announcement

work page 2025
[3]

Some latent trait models and their use in inferring an examinee's ability

Allan Birnbaum. Some latent trait models and their use in inferring an examinee's ability. Statistical theories of mental test scores, 1968

work page 1968
[4]

Validity: on the meaningful interpretation of assessment data

Steven M Downing. Validity: on the meaningful interpretation of assessment data. Medical Education, 37 0 (9): 0 830--837, 2003. doi:10.1046/j.1365-2923.2003.01594.x

work page doi:10.1046/j.1365-2923.2003.01594.x 2003
[5]

USMLE content outline

Federation of State Medical Boards (FSMB) and National Board of Medical Examiners (NBME) . USMLE content outline. https://www.usmle.org/sites/default/files/2022-01/USMLE_Content_Outline_0.pdf, 2025. Accessed: 2025-08-26

work page 2022
[6]

Gemini 2.x: Pushing the frontier with advanced reasoning, multimodality, long context, and next-generation agentic capabilities

Gemini Team, Google . Gemini 2.x: Pushing the frontier with advanced reasoning, multimodality, long context, and next-generation agentic capabilities. Technical report, Google DeepMind, June 2025. Technical report introducing Gemini 2.5 Pro and Gemini 2.5 Flash; includes benchmark results and architectural details

work page 2025
[7]

The Llama 3 Herd of Models

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, et al. The Llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024. URL https://arxiv.org/abs/2407.21783

work page internal anchor Pith review Pith/arXiv arXiv 2024
[8]

Holmboe, Jose Biller, Elizabeth G

Eric S. Holmboe, Jose Biller, Elizabeth G. Donahue, Paul George, Francisco J. Longo, Sarah Scheiber, Elizabeth Sinz, and NEJM Knowledge+ Team . Redesigning continuing medical education: The case for competency-based education. Academic Medicine, 93 0 (10): 0 1461--1464, 2018. doi:10.1097/ACM.0000000000002324

work page doi:10.1097/acm.0000000000002324 2018
[9]

Hugging face hub

Hugging Face, Inc. Hugging face hub. https://huggingface.co, 2025. Accessed: 2025-08-27

work page 2025
[10]

Multi-criteria decision analysis for supporting the selection of engineering materials in product design

Ali Jahan, Kevin L Edwards, and Marjan Bahraminasab. Multi-criteria decision analysis for supporting the selection of engineering materials in product design. Butterworth-Heinemann, 2016

work page 2016
[11]

What disease does this patient have? A large-scale open domain question answering dataset from medical exams.Applied Sciences, 11(14):6421, 2021

Di Jin, Eileen Pan, Nassim Oufattole, Wei-Hung Weng, Hanyi Fang, and Peter Szolovits. What disease does this patient have? a large-scale open-domain question answering dataset from medical exams. Applied Sciences, 11 0 (14): 0 6421, 2021. doi:10.3390/app11146421

work page doi:10.3390/app11146421 2021
[12]

Building an evaluation scale using item response theory

John P Lalor, Hao Wu, and Hong Yu. Building an evaluation scale using item response theory. In Proceedings of the conference on empirical methods in natural language processing. Conference on empirical methods in natural language processing, volume 2016, page 648, 2016

work page 2016
[13]

DeepSeek-V3 Technical Report

Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report. arXiv preprint arXiv:2412.19437, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[14]

A closer look into mixture-of-experts in large language models

Ka Man Lo, Zeyu Huang, Zihan Qiu, Zili Wang, and Jie Fu. A closer look into mixture-of-experts in large language models. In Luis Chiruzzo, Alan Ritter, and Lu Wang, editors, Findings of the Association for Computational Linguistics: NAACL 2025, pages 4427--4447, Albuquerque, New Mexico, April 2025. Association for Computational Linguistics. ISBN 979-8-891...

work page doi:10.18653/v1/2025.findings-naacl.251 2025
[15]

Statistical theories of mental test scores

Frederic M Lord and Melvin R Novick. Statistical theories of mental test scores. Addison-Wesley, Reading, MA, 1968

work page 1968
[16]

Llama 4 maverick (17b-active, 128-expert mixture-of-experts, multimodal)

Meta AI / Hugging Face . Llama 4 maverick (17b-active, 128-expert mixture-of-experts, multimodal). Model card on Hugging Face, April 2025. URL https://huggingface.co/meta-llama/Llama-4-Maverick-17B-128E-Original. Multimodal Mixture-of-Experts model: 17 B active parameters out of 400 B total, released April 5, 2025 under the Llama 4 Community License :cont...

work page 2025
[17]

Can generalist foundation models outcompete special-purpose tuning? case study in medicine,

Harsha Nori, Yin Tat Lee, Sheng Zhang, Dean Carignan, Richard Edgar, Nicolo Fusi, Nicholas King, Jonathan Larson, Yuanzhi Li, Weishung Liu, Renqian Luo, Scott Mayer McKinney, Robert Osazuwa Ness, Hoifung Poon, Tao Qin, Naoto Usuyama, Chris White, and Eric Horvitz. Can generalist foundation models outcompete special-purpose tuning? case study in medicine. ...

work page arXiv 2023
[18]

Codex-mini (codex-mini-latest): a fine-tuned version of o4-mini for cli use

OpenAI . Codex-mini (codex-mini-latest): a fine-tuned version of o4-mini for cli use. Model release announcement & documentation, May 2025. URL https://openai.com/index/introducing-codex/. Codex-Mini is a compact, low-latency coding model (fine-tuned from o4-mini) released May 16, 2025 via Codex CLI and Responses API; supports long contexts ( 200k tokens)...

work page 2025
[19]

OpenRouter : Universal API for large language models

OpenRouter . OpenRouter : Universal API for large language models. https://openrouter.ai/, 2025. Accessed: 2025-08-27

work page 2025
[20]

MedMCQA : A large-scale multi-subject multi-choice question answering dataset for the medical domain

Ankit Pal, Logesh Kumar Umapathi, and Malaikannan Sankarasubbu. MedMCQA : A large-scale multi-subject multi-choice question answering dataset for the medical domain. In Proceedings of the Conference on Health, Inference, and Learning, volume 174 of Proceedings of Machine Learning Research, pages 248--260. PMLR, 2022. URL https://proceedings.mlr.press/v174...

work page 2022
[21]

David Rein, Betty Li Hou, Asa Cooper Stickland, Jack- son Petty, Richard Yuanzhe Pang, Julien Dirani, Ju- lian Michael, and Samuel R Bowman

Felipe Maia Polo, Lucas Weber, Leshem Choshen, Yuekai Sun, Gongjun Xu, and Mikhail Yurochkin. tinybenchmarks: evaluating llms with fewer examples. arXiv preprint arXiv:2402.14992, 2024

work page arXiv 2024
[22]

Unifactor latent trait models applied to multifactor tests: Results and implications

Mark D Reckase. Unifactor latent trait models applied to multifactor tests: Results and implications. Journal of educational statistics, 4 0 (3): 0 207--230, 1979

work page 1979
[23]

The past and future of multidimensional item response theory

Mark D Reckase. The past and future of multidimensional item response theory. volume 21, pages 25--36. SAGE PUBLICATIONS, INC. 2455 Teller Road, Thousand Oaks, CA 91320, 1997

work page 1997
[24]

Pedro Rodriguez, Joe Barrow, Alexander Miserlis Hoyle, John P Lalor, Robin Jia, and Jordan Boyd-Graber. Evaluation examples are not equally informative: How should that change nlp leaderboards? In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processi...

work page 2021
[25]

MedGemma Technical Report

Andrew Sellergren, Sahar Kazemzadeh, Tiam Jaroensri, Atilla Kiraly, Madeleine Traverse, Timo Kohlberger, Shawn Xu, Fayaz Jamil, C \' an Hughes, Charles Lau, et al. Medgemma technical report. arXiv preprint arXiv:2507.05201, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[26]

Holistic evaluation of large language models for medical applications, 2025

Nigam Shah, Mike Pfeffer, and Percy Liang. Holistic evaluation of large language models for medical applications, 2025

work page 2025
[27]

Singhal, S

Karan Singhal, Shekoofeh Azizi, Tao Tu, S. Sara Mahdavi, Jason Wei, Hyung Won Chung, Nathan Scales, Ajay Tanwani, Heather Cole-Lewis, Stephen Pfohl, et al. Large language models encode clinical knowledge. Nature, 620 0 (7972): 0 172--180, 2023. doi:10.1038/s41586-023-06291-2

work page doi:10.1038/s41586-023-06291-2 2023
[28]

Irt-router: Effective and interpretable multi-llm routing via item response theory

Wei Song, Zhenya Huang, Cheng Cheng, Weibo Gao, Bihan Xu, GuanHao Zhao, Fei Wang, and Runze Wu. Irt-router: Effective and interpretable multi-llm routing via item response theory. arXiv preprint arXiv:2506.01048, 2025

work page arXiv 2025
[29]

Hermes 3 technical report

Ryan Teknium, Jeffrey Quesnelle, and Chen Guang. Hermes 3 technical report. arXiv preprint arXiv:2408.11857, 2024

work page arXiv 2024
[30]

Medqa benchmark leaderboard (august 26, 2025)

Vals AI . Medqa benchmark leaderboard (august 26, 2025). https://www.vals.ai/benchmarks/medqa-08-26-2025, 2025. Accessed: September 4, 2025

work page 2025
[31]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report. arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[32]

Effects of local item dependence on the fit and equating performance of the three-parameter logistic model

Wendy M Yen. Effects of local item dependence on the fit and equating performance of the three-parameter logistic model. Applied Psychological Measurement, 8 0 (2): 0 125--145, 1984

work page 1984
[33]

Lost in benchmarks? Rethinking large language model benchmarking with item response theory.arXiv preprint arXiv:2505.15055,

Hongli Zhou, Hui Huang, Ziqing Zhao, Lvyuan Han, Huicheng Wang, Kehai Chen, Muyun Yang, Wei Bao, Jian Dong, Bing Xu, et al. Lost in benchmarks? rethinking large language model benchmarking with item response theory. arXiv preprint arXiv:2505.15055, 2025

work page arXiv 2025
[34]

Efficiently measuring the cognitive ability of llms: An adaptive testing perspective

Yan Zhuang, Qi Liu, Yuting Ning, Weizhe Huang, Rui Lv, Zhenya Huang, Guanhao Zhao, Zheng Zhang, Qingyang Mao, Shijin Wang, et al. Efficiently measuring the cognitive ability of llms: An adaptive testing perspective. 2023

work page 2023
[35]

MedXpertQA: Benchmarking Expert-Level Medical Reasoning and Understanding

Yuxin Zuo, Shang Qu, Yifei Li, Zhangren Chen, Xuekai Zhu, Ermo Hua, Kaiyan Zhang, Ning Ding, and Bowen Zhou. Medxpertqa: Benchmarking expert-level medical reasoning and understanding. arXiv preprint arXiv:2501.18362, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[1] [1]

gpt-oss-120b & gpt-oss-20b Model Card

Sandhini Agarwal, Lama Ahmad, Jason Ai, Sam Altman, Andy Applebaum, Edwin Arbus, Rahul K Arora, Yu Bai, Bowen Baker, Haiming Bao, et al. gpt-oss-120b & gpt-oss-20b model card. arXiv preprint arXiv:2508.10925, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[2] [2]

Introducing claude opus 4 and claude sonnet 4

Anthropic . Introducing claude opus 4 and claude sonnet 4. Website blog post, May 2025. URL https://www.anthropic.com/news/claude-4. Claude Sonnet 4: a high-performance AI model for coding and reasoning. Retrieved from Anthropic’s announcement

work page 2025

[3] [3]

Some latent trait models and their use in inferring an examinee's ability

Allan Birnbaum. Some latent trait models and their use in inferring an examinee's ability. Statistical theories of mental test scores, 1968

work page 1968

[4] [4]

Validity: on the meaningful interpretation of assessment data

Steven M Downing. Validity: on the meaningful interpretation of assessment data. Medical Education, 37 0 (9): 0 830--837, 2003. doi:10.1046/j.1365-2923.2003.01594.x

work page doi:10.1046/j.1365-2923.2003.01594.x 2003

[5] [5]

USMLE content outline

Federation of State Medical Boards (FSMB) and National Board of Medical Examiners (NBME) . USMLE content outline. https://www.usmle.org/sites/default/files/2022-01/USMLE_Content_Outline_0.pdf, 2025. Accessed: 2025-08-26

work page 2022

[6] [6]

Gemini 2.x: Pushing the frontier with advanced reasoning, multimodality, long context, and next-generation agentic capabilities

Gemini Team, Google . Gemini 2.x: Pushing the frontier with advanced reasoning, multimodality, long context, and next-generation agentic capabilities. Technical report, Google DeepMind, June 2025. Technical report introducing Gemini 2.5 Pro and Gemini 2.5 Flash; includes benchmark results and architectural details

work page 2025

[7] [7]

The Llama 3 Herd of Models

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, et al. The Llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024. URL https://arxiv.org/abs/2407.21783

work page internal anchor Pith review Pith/arXiv arXiv 2024

[8] [8]

Holmboe, Jose Biller, Elizabeth G

Eric S. Holmboe, Jose Biller, Elizabeth G. Donahue, Paul George, Francisco J. Longo, Sarah Scheiber, Elizabeth Sinz, and NEJM Knowledge+ Team . Redesigning continuing medical education: The case for competency-based education. Academic Medicine, 93 0 (10): 0 1461--1464, 2018. doi:10.1097/ACM.0000000000002324

work page doi:10.1097/acm.0000000000002324 2018

[9] [9]

Hugging face hub

Hugging Face, Inc. Hugging face hub. https://huggingface.co, 2025. Accessed: 2025-08-27

work page 2025

[10] [10]

Multi-criteria decision analysis for supporting the selection of engineering materials in product design

Ali Jahan, Kevin L Edwards, and Marjan Bahraminasab. Multi-criteria decision analysis for supporting the selection of engineering materials in product design. Butterworth-Heinemann, 2016

work page 2016

[11] [11]

What disease does this patient have? A large-scale open domain question answering dataset from medical exams.Applied Sciences, 11(14):6421, 2021

Di Jin, Eileen Pan, Nassim Oufattole, Wei-Hung Weng, Hanyi Fang, and Peter Szolovits. What disease does this patient have? a large-scale open-domain question answering dataset from medical exams. Applied Sciences, 11 0 (14): 0 6421, 2021. doi:10.3390/app11146421

work page doi:10.3390/app11146421 2021

[12] [12]

Building an evaluation scale using item response theory

John P Lalor, Hao Wu, and Hong Yu. Building an evaluation scale using item response theory. In Proceedings of the conference on empirical methods in natural language processing. Conference on empirical methods in natural language processing, volume 2016, page 648, 2016

work page 2016

[13] [13]

DeepSeek-V3 Technical Report

Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report. arXiv preprint arXiv:2412.19437, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[14] [14]

A closer look into mixture-of-experts in large language models

Ka Man Lo, Zeyu Huang, Zihan Qiu, Zili Wang, and Jie Fu. A closer look into mixture-of-experts in large language models. In Luis Chiruzzo, Alan Ritter, and Lu Wang, editors, Findings of the Association for Computational Linguistics: NAACL 2025, pages 4427--4447, Albuquerque, New Mexico, April 2025. Association for Computational Linguistics. ISBN 979-8-891...

work page doi:10.18653/v1/2025.findings-naacl.251 2025

[15] [15]

Statistical theories of mental test scores

Frederic M Lord and Melvin R Novick. Statistical theories of mental test scores. Addison-Wesley, Reading, MA, 1968

work page 1968

[16] [16]

Llama 4 maverick (17b-active, 128-expert mixture-of-experts, multimodal)

Meta AI / Hugging Face . Llama 4 maverick (17b-active, 128-expert mixture-of-experts, multimodal). Model card on Hugging Face, April 2025. URL https://huggingface.co/meta-llama/Llama-4-Maverick-17B-128E-Original. Multimodal Mixture-of-Experts model: 17 B active parameters out of 400 B total, released April 5, 2025 under the Llama 4 Community License :cont...

work page 2025

[17] [17]

Can generalist foundation models outcompete special-purpose tuning? case study in medicine,

Harsha Nori, Yin Tat Lee, Sheng Zhang, Dean Carignan, Richard Edgar, Nicolo Fusi, Nicholas King, Jonathan Larson, Yuanzhi Li, Weishung Liu, Renqian Luo, Scott Mayer McKinney, Robert Osazuwa Ness, Hoifung Poon, Tao Qin, Naoto Usuyama, Chris White, and Eric Horvitz. Can generalist foundation models outcompete special-purpose tuning? case study in medicine. ...

work page arXiv 2023

[18] [18]

Codex-mini (codex-mini-latest): a fine-tuned version of o4-mini for cli use

OpenAI . Codex-mini (codex-mini-latest): a fine-tuned version of o4-mini for cli use. Model release announcement & documentation, May 2025. URL https://openai.com/index/introducing-codex/. Codex-Mini is a compact, low-latency coding model (fine-tuned from o4-mini) released May 16, 2025 via Codex CLI and Responses API; supports long contexts ( 200k tokens)...

work page 2025

[19] [19]

OpenRouter : Universal API for large language models

OpenRouter . OpenRouter : Universal API for large language models. https://openrouter.ai/, 2025. Accessed: 2025-08-27

work page 2025

[20] [20]

MedMCQA : A large-scale multi-subject multi-choice question answering dataset for the medical domain

Ankit Pal, Logesh Kumar Umapathi, and Malaikannan Sankarasubbu. MedMCQA : A large-scale multi-subject multi-choice question answering dataset for the medical domain. In Proceedings of the Conference on Health, Inference, and Learning, volume 174 of Proceedings of Machine Learning Research, pages 248--260. PMLR, 2022. URL https://proceedings.mlr.press/v174...

work page 2022

[21] [21]

David Rein, Betty Li Hou, Asa Cooper Stickland, Jack- son Petty, Richard Yuanzhe Pang, Julien Dirani, Ju- lian Michael, and Samuel R Bowman

Felipe Maia Polo, Lucas Weber, Leshem Choshen, Yuekai Sun, Gongjun Xu, and Mikhail Yurochkin. tinybenchmarks: evaluating llms with fewer examples. arXiv preprint arXiv:2402.14992, 2024

work page arXiv 2024

[22] [22]

Unifactor latent trait models applied to multifactor tests: Results and implications

Mark D Reckase. Unifactor latent trait models applied to multifactor tests: Results and implications. Journal of educational statistics, 4 0 (3): 0 207--230, 1979

work page 1979

[23] [23]

The past and future of multidimensional item response theory

Mark D Reckase. The past and future of multidimensional item response theory. volume 21, pages 25--36. SAGE PUBLICATIONS, INC. 2455 Teller Road, Thousand Oaks, CA 91320, 1997

work page 1997

[24] [24]

Pedro Rodriguez, Joe Barrow, Alexander Miserlis Hoyle, John P Lalor, Robin Jia, and Jordan Boyd-Graber. Evaluation examples are not equally informative: How should that change nlp leaderboards? In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processi...

work page 2021

[25] [25]

MedGemma Technical Report

Andrew Sellergren, Sahar Kazemzadeh, Tiam Jaroensri, Atilla Kiraly, Madeleine Traverse, Timo Kohlberger, Shawn Xu, Fayaz Jamil, C \' an Hughes, Charles Lau, et al. Medgemma technical report. arXiv preprint arXiv:2507.05201, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[26] [26]

Holistic evaluation of large language models for medical applications, 2025

Nigam Shah, Mike Pfeffer, and Percy Liang. Holistic evaluation of large language models for medical applications, 2025

work page 2025

[27] [27]

Singhal, S

Karan Singhal, Shekoofeh Azizi, Tao Tu, S. Sara Mahdavi, Jason Wei, Hyung Won Chung, Nathan Scales, Ajay Tanwani, Heather Cole-Lewis, Stephen Pfohl, et al. Large language models encode clinical knowledge. Nature, 620 0 (7972): 0 172--180, 2023. doi:10.1038/s41586-023-06291-2

work page doi:10.1038/s41586-023-06291-2 2023

[28] [28]

Irt-router: Effective and interpretable multi-llm routing via item response theory

Wei Song, Zhenya Huang, Cheng Cheng, Weibo Gao, Bihan Xu, GuanHao Zhao, Fei Wang, and Runze Wu. Irt-router: Effective and interpretable multi-llm routing via item response theory. arXiv preprint arXiv:2506.01048, 2025

work page arXiv 2025

[29] [29]

Hermes 3 technical report

Ryan Teknium, Jeffrey Quesnelle, and Chen Guang. Hermes 3 technical report. arXiv preprint arXiv:2408.11857, 2024

work page arXiv 2024

[30] [30]

Medqa benchmark leaderboard (august 26, 2025)

Vals AI . Medqa benchmark leaderboard (august 26, 2025). https://www.vals.ai/benchmarks/medqa-08-26-2025, 2025. Accessed: September 4, 2025

work page 2025

[31] [31]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report. arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[32] [32]

Effects of local item dependence on the fit and equating performance of the three-parameter logistic model

Wendy M Yen. Effects of local item dependence on the fit and equating performance of the three-parameter logistic model. Applied Psychological Measurement, 8 0 (2): 0 125--145, 1984

work page 1984

[33] [33]

Lost in benchmarks? Rethinking large language model benchmarking with item response theory.arXiv preprint arXiv:2505.15055,

Hongli Zhou, Hui Huang, Ziqing Zhao, Lvyuan Han, Huicheng Wang, Kehai Chen, Muyun Yang, Wei Bao, Jian Dong, Bing Xu, et al. Lost in benchmarks? rethinking large language model benchmarking with item response theory. arXiv preprint arXiv:2505.15055, 2025

work page arXiv 2025

[34] [34]

Efficiently measuring the cognitive ability of llms: An adaptive testing perspective

Yan Zhuang, Qi Liu, Yuting Ning, Weizhe Huang, Rui Lv, Zhenya Huang, Guanhao Zhao, Zheng Zhang, Qingyang Mao, Shijin Wang, et al. Efficiently measuring the cognitive ability of llms: An adaptive testing perspective. 2023

work page 2023

[35] [35]

MedXpertQA: Benchmarking Expert-Level Medical Reasoning and Understanding

Yuxin Zuo, Shang Qu, Yifei Li, Zhangren Chen, Xuekai Zhu, Ermo Hua, Kaiyan Zhang, Ning Ding, and Bowen Zhou. Medxpertqa: Benchmarking expert-level medical reasoning and understanding. arXiv preprint arXiv:2501.18362, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025