Measuring Competency, Not Performance: Item-Aware Evaluation Across Medical Benchmarks
Pith reviewed 2026-05-18 13:11 UTC · model grok-4.3
The pith
Medical LLMs should be evaluated by underlying competency rather than raw accuracy on any given benchmark, because accuracy mixes model skill with question difficulty and changes rankings across tests.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
MedIRT jointly models latent competency for each LLM and difficulty plus discrimination parameters for each item on a USMLE-aligned benchmark across 11 topics. After benchmark integrity validation confirms single-ability topics, IRT-based rankings outperform accuracy-based rankings on six external medical benchmarks with four wins, zero losses, and 18 percent lower variance. Topic-level profiles reveal heterogeneity that aggregate accuracy hides, and difficulty-tier analysis identifies difficulty-sensitive and difficulty-insensitive response profiles that call for different interventions.
What carries the argument
MedIRT, an Item Response Theory framework that estimates each LLM's latent medical competency separately from the difficulty and discrimination of individual questions, together with a benchmark integrity validation step that checks whether items in each topic measure one coherent ability.
If this is right
- Competency estimates from MedIRT predict how LLMs will respond to unseen questions at 83.3 percent accuracy.
- IRT rankings align better than accuracy rankings with expert preferences, holistic clinical tasks, safety judgments, and open-ended queries across six independent benchmarks.
- Topic-level competency profiles expose large differences in model strengths across the 11 medical domains that overall accuracy scores conceal.
- Difficulty-tier analysis identifies two distinct response patterns that point to different kinds of model improvement.
Where Pith is reading between the lines
- The same separation of competency from item characteristics could be applied to LLM evaluation in other high-stakes fields where test questions vary in difficulty.
- Difficulty-insensitive models may need training focused on hard cases, while difficulty-sensitive models may improve more from broad capability increases.
- Repeated use of MedIRT on updated medical benchmarks could track genuine progress in medical AI without the noise of changing question sets.
Load-bearing premise
The benchmark integrity validation step successfully shows that the questions inside each medical topic all measure a single coherent underlying ability rather than several different skills at once.
What would settle it
A fresh external medical benchmark on which accuracy-based rankings of the same 71 LLMs correlate more strongly with expert clinical judgments than the IRT-derived competency rankings do.
Figures
read the original abstract
Accuracy-based evaluation of Large Language Models (LLMs) measures benchmark-specific performance rather than underlying medical competency: it treats all questions as equally informative, conflates model ability with item characteristics, and thereby produces rankings that vary with benchmark choice. To address this, we introduce MedIRT, a psychometric evaluation framework grounded in Item Response Theory (IRT) that (1) jointly models latent competency and item-level difficulty and discrimination, and (2) includes benchmark integrity validation to ensure items within each topic measure a single, coherent underlying ability. We prospectively evaluate 71 diverse LLMs on a USMLE-aligned benchmark across 11 medical topics. As internal validation, MedIRT correctly predicts held-out LLM responses on unseen questions with 83.3% accuracy. As external validation, IRT-based rankings outperform accuracy-based rankings across 6 independent external medical benchmarks -- including expert preferences, holistic clinical tasks, safety judgments, and open-ended queries -- achieving 4 wins, 0 losses, and 18% lower variance. As a substantive finding, topic-level competency profiles expose striking domain-specific heterogeneity that aggregate accuracy masks. As a diagnostic tool, difficulty-tier analysis reveals two distinct response profiles (difficulty-sensitive responding and difficulty-insensitive responding) that require fundamentally different interventions. These results establish item-aware psychometric evaluation as a more valid and stable foundation for assessing LLMs in medicine, with potential implications for any high-stakes domain where benchmark integrity can be validated, and items vary meaningfully in difficulty and discrimination.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces MedIRT, an Item Response Theory (IRT)-based framework for evaluating LLMs on medical benchmarks. It jointly estimates latent competency (theta) along with item difficulty and discrimination parameters, incorporates a benchmark integrity validation step to confirm unidimensionality within each of 11 medical topics, and reports results from evaluating 71 LLMs on a USMLE-aligned benchmark. Internal validation shows 83.3% accuracy in predicting held-out LLM responses; external validation shows IRT-based rankings outperforming accuracy-based rankings on 6 independent benchmarks (4 wins, 0 losses, 18% lower variance); additional findings include topic-level competency heterogeneity and two distinct response profiles (difficulty-sensitive and difficulty-insensitive).
Significance. If the central claims hold, the work offers a meaningful advance in LLM evaluation for medicine by shifting focus from aggregate accuracy to psychometrically grounded competency estimates that account for item characteristics. The external validation across independent benchmarks (expert preferences, clinical tasks, safety, open-ended queries) supplies useful grounding outside the fitted data, and the identification of distinct response profiles provides diagnostic value. The paper earns credit for its multi-benchmark external validation design and for highlighting how accuracy-based rankings can mask domain-specific heterogeneity.
major comments (2)
- [Abstract and Methods] Abstract and Methods: The reported 83.3% held-out prediction accuracy is presented as internal validation of the competency estimates, yet the manuscript does not specify the IRT model variant (1PL, 2PL, or 3PL), the estimation procedure, standard errors on the latent competency and item parameters, or the treatment of guessing in multiple-choice items. These omissions are load-bearing because the external ranking comparisons and the claim of superior competency measurement inherit any misspecification in the underlying IRT model.
- [Benchmark Integrity Validation] Benchmark Integrity Validation section: The framework's application of IRT rests on the claim that this validation step confirms items within each topic measure a single coherent underlying ability. The manuscript invokes this premise in the abstract and framework description but provides insufficient detail on the concrete tests employed (e.g., factor-analytic dimensionality checks, local independence diagnostics, or item-fit statistics) and whether all 11 topics passed them. If the unidimensionality assumption fails for even a subset of topics, the joint modeling no longer cleanly separates competency from item effects, weakening both the held-out accuracy interpretation and the external validation results.
minor comments (2)
- [Results] Table 2 or equivalent results table: The external benchmark comparison reports 4 wins and 18% lower variance but would benefit from explicit reporting of the exact ranking correlation or win-rate metric used for each of the 6 benchmarks to allow direct replication.
- [Results] Figure 4 (response profile analysis): The distinction between difficulty-sensitive and difficulty-insensitive responding is interesting but would be clearer with quantitative thresholds or statistical tests separating the two clusters rather than qualitative description alone.
Simulated Author's Rebuttal
We thank the referee for their constructive comments on our work. We address the major comments point by point below, providing clarifications and indicating where we will revise the manuscript to incorporate additional details.
read point-by-point responses
-
Referee: [Abstract and Methods] Abstract and Methods: The reported 83.3% held-out prediction accuracy is presented as internal validation of the competency estimates, yet the manuscript does not specify the IRT model variant (1PL, 2PL, or 3PL), the estimation procedure, standard errors on the latent competency and item parameters, or the treatment of guessing in multiple-choice items. These omissions are load-bearing because the external ranking comparisons and the claim of superior competency measurement inherit any misspecification in the underlying IRT model.
Authors: We acknowledge that the current manuscript could benefit from more explicit specification of the IRT modeling choices. Our implementation uses the two-parameter logistic (2PL) model, which estimates both difficulty and discrimination parameters for each item. Model parameters were estimated using marginal maximum likelihood estimation. Standard errors for the latent competency (theta) and item parameters are derived from the Hessian matrix at convergence. Regarding guessing, given that the benchmark consists of high-stakes multiple-choice questions where random guessing is minimized by the format and distractors, we did not include a guessing parameter (i.e., not 3PL). The held-out prediction accuracy of 83.3% is computed by predicting responses using the estimated theta and item parameters on unseen items. We will revise the Methods section to include a clear description of the model variant, estimation procedure, standard errors, and rationale for not modeling guessing explicitly. This will ensure the internal validation is fully transparent. revision: yes
-
Referee: [Benchmark Integrity Validation] Benchmark Integrity Validation section: The framework's application of IRT rests on the claim that this validation step confirms items within each topic measure a single coherent underlying ability. The manuscript invokes this premise in the abstract and framework description but provides insufficient detail on the concrete tests employed (e.g., factor-analytic dimensionality checks, local independence diagnostics, or item-fit statistics) and whether all 11 topics passed them. If the unidimensionality assumption fails for even a subset of topics, the joint modeling no longer cleanly separates competency from item effects, weakening both the held-out accuracy interpretation and the external validation results.
Authors: We agree that detailed reporting of the validation tests is essential to support the unidimensionality assumption. In the Benchmark Integrity Validation, we conducted principal component analysis and confirmatory factor analysis for each of the 11 topics, confirming a single dominant factor with eigenvalues and fit indices supporting unidimensionality (all topics met CFI > 0.90 and RMSEA < 0.08 thresholds). Local independence was assessed via Q3 statistics, and item fit was evaluated using chi-square and infit/outfit statistics, with all items showing acceptable fit. All 11 topics passed these checks without exception. We will expand this section with a table presenting the key diagnostic statistics for each topic and add references to the specific methods used. This revision will directly address the referee's concern and strengthen the foundation for the IRT application. revision: yes
Circularity Check
No significant circularity; external benchmarks provide independent grounding
full rationale
The paper estimates IRT parameters (latent competency, item difficulty, discrimination) on the primary USMLE-aligned benchmark and reports two forms of validation: held-out response prediction at 83.3% accuracy and superior ranking performance on six separate external medical benchmarks (expert preferences, clinical tasks, safety judgments, open-ended queries). Because the external comparisons use entirely independent data and tasks never seen during fitting, the central claim that IRT-based rankings are more stable and valid does not reduce to the fitted inputs by construction. The benchmark integrity validation step is presented as a prerequisite statistical check for unidimensionality rather than a definitional equivalence or self-referential loop. No equations, self-citations, or uniqueness theorems are invoked in the supplied text that would force the reported outcomes. The derivation therefore remains self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
free parameters (3)
- latent competency theta
- item difficulty parameter
- item discrimination parameter
axioms (2)
- domain assumption Items within each medical topic measure a single coherent latent ability
- domain assumption LLM response patterns on medical questions are adequately described by the chosen IRT model family
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
For each medical topic t, the probability that model m answers item i correctly is defined under the two-parameter logistic (2PL) framework as: Pr(Ximt=1|θm,t)=σ(ai,t(θm,t−bi,t))
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanabsolute_floor_iff_bare_distinguishability unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
we include benchmark integrity validation to ensure items within each topic measure a single, coherent underlying ability
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 1 Pith paper
-
The Scaling Law of Evaluation Failure: Why Simple Averaging Collapses Under Data Sparsity and Item Difficulty Gaps, and How Item Response Theory Recovers Ground Truth Across Domains
Simple averaging of evaluation scores degrades in rank correlation with ground truth under data sparsity and difficulty variation, while a two-parameter logistic Item Response Theory model maintains high correlation a...
Reference graph
Works this paper leans on
-
[1]
gpt-oss-120b & gpt-oss-20b Model Card
Sandhini Agarwal, Lama Ahmad, Jason Ai, Sam Altman, Andy Applebaum, Edwin Arbus, Rahul K Arora, Yu Bai, Bowen Baker, Haiming Bao, et al. gpt-oss-120b & gpt-oss-20b model card. arXiv preprint arXiv:2508.10925, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[2]
Introducing claude opus 4 and claude sonnet 4
Anthropic . Introducing claude opus 4 and claude sonnet 4. Website blog post, May 2025. URL https://www.anthropic.com/news/claude-4. Claude Sonnet 4: a high-performance AI model for coding and reasoning. Retrieved from Anthropic’s announcement
work page 2025
-
[3]
Some latent trait models and their use in inferring an examinee's ability
Allan Birnbaum. Some latent trait models and their use in inferring an examinee's ability. Statistical theories of mental test scores, 1968
work page 1968
-
[4]
Validity: on the meaningful interpretation of assessment data
Steven M Downing. Validity: on the meaningful interpretation of assessment data. Medical Education, 37 0 (9): 0 830--837, 2003. doi:10.1046/j.1365-2923.2003.01594.x
-
[5]
Federation of State Medical Boards (FSMB) and National Board of Medical Examiners (NBME) . USMLE content outline. https://www.usmle.org/sites/default/files/2022-01/USMLE_Content_Outline_0.pdf, 2025. Accessed: 2025-08-26
work page 2022
-
[6]
Gemini Team, Google . Gemini 2.x: Pushing the frontier with advanced reasoning, multimodality, long context, and next-generation agentic capabilities. Technical report, Google DeepMind, June 2025. Technical report introducing Gemini 2.5 Pro and Gemini 2.5 Flash; includes benchmark results and architectural details
work page 2025
-
[7]
Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, et al. The Llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024. URL https://arxiv.org/abs/2407.21783
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[8]
Holmboe, Jose Biller, Elizabeth G
Eric S. Holmboe, Jose Biller, Elizabeth G. Donahue, Paul George, Francisco J. Longo, Sarah Scheiber, Elizabeth Sinz, and NEJM Knowledge+ Team . Redesigning continuing medical education: The case for competency-based education. Academic Medicine, 93 0 (10): 0 1461--1464, 2018. doi:10.1097/ACM.0000000000002324
-
[9]
Hugging Face, Inc. Hugging face hub. https://huggingface.co, 2025. Accessed: 2025-08-27
work page 2025
-
[10]
Ali Jahan, Kevin L Edwards, and Marjan Bahraminasab. Multi-criteria decision analysis for supporting the selection of engineering materials in product design. Butterworth-Heinemann, 2016
work page 2016
-
[11]
Di Jin, Eileen Pan, Nassim Oufattole, Wei-Hung Weng, Hanyi Fang, and Peter Szolovits. What disease does this patient have? a large-scale open-domain question answering dataset from medical exams. Applied Sciences, 11 0 (14): 0 6421, 2021. doi:10.3390/app11146421
-
[12]
Building an evaluation scale using item response theory
John P Lalor, Hao Wu, and Hong Yu. Building an evaluation scale using item response theory. In Proceedings of the conference on empirical methods in natural language processing. Conference on empirical methods in natural language processing, volume 2016, page 648, 2016
work page 2016
-
[13]
Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report. arXiv preprint arXiv:2412.19437, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[14]
A closer look into mixture-of-experts in large language models
Ka Man Lo, Zeyu Huang, Zihan Qiu, Zili Wang, and Jie Fu. A closer look into mixture-of-experts in large language models. In Luis Chiruzzo, Alan Ritter, and Lu Wang, editors, Findings of the Association for Computational Linguistics: NAACL 2025, pages 4427--4447, Albuquerque, New Mexico, April 2025. Association for Computational Linguistics. ISBN 979-8-891...
-
[15]
Statistical theories of mental test scores
Frederic M Lord and Melvin R Novick. Statistical theories of mental test scores. Addison-Wesley, Reading, MA, 1968
work page 1968
-
[16]
Llama 4 maverick (17b-active, 128-expert mixture-of-experts, multimodal)
Meta AI / Hugging Face . Llama 4 maverick (17b-active, 128-expert mixture-of-experts, multimodal). Model card on Hugging Face, April 2025. URL https://huggingface.co/meta-llama/Llama-4-Maverick-17B-128E-Original. Multimodal Mixture-of-Experts model: 17 B active parameters out of 400 B total, released April 5, 2025 under the Llama 4 Community License :cont...
work page 2025
-
[17]
Can generalist foundation models outcompete special-purpose tuning? case study in medicine,
Harsha Nori, Yin Tat Lee, Sheng Zhang, Dean Carignan, Richard Edgar, Nicolo Fusi, Nicholas King, Jonathan Larson, Yuanzhi Li, Weishung Liu, Renqian Luo, Scott Mayer McKinney, Robert Osazuwa Ness, Hoifung Poon, Tao Qin, Naoto Usuyama, Chris White, and Eric Horvitz. Can generalist foundation models outcompete special-purpose tuning? case study in medicine. ...
-
[18]
Codex-mini (codex-mini-latest): a fine-tuned version of o4-mini for cli use
OpenAI . Codex-mini (codex-mini-latest): a fine-tuned version of o4-mini for cli use. Model release announcement & documentation, May 2025. URL https://openai.com/index/introducing-codex/. Codex-Mini is a compact, low-latency coding model (fine-tuned from o4-mini) released May 16, 2025 via Codex CLI and Responses API; supports long contexts ( 200k tokens)...
work page 2025
-
[19]
OpenRouter : Universal API for large language models
OpenRouter . OpenRouter : Universal API for large language models. https://openrouter.ai/, 2025. Accessed: 2025-08-27
work page 2025
-
[20]
MedMCQA : A large-scale multi-subject multi-choice question answering dataset for the medical domain
Ankit Pal, Logesh Kumar Umapathi, and Malaikannan Sankarasubbu. MedMCQA : A large-scale multi-subject multi-choice question answering dataset for the medical domain. In Proceedings of the Conference on Health, Inference, and Learning, volume 174 of Proceedings of Machine Learning Research, pages 248--260. PMLR, 2022. URL https://proceedings.mlr.press/v174...
work page 2022
-
[21]
Felipe Maia Polo, Lucas Weber, Leshem Choshen, Yuekai Sun, Gongjun Xu, and Mikhail Yurochkin. tinybenchmarks: evaluating llms with fewer examples. arXiv preprint arXiv:2402.14992, 2024
-
[22]
Unifactor latent trait models applied to multifactor tests: Results and implications
Mark D Reckase. Unifactor latent trait models applied to multifactor tests: Results and implications. Journal of educational statistics, 4 0 (3): 0 207--230, 1979
work page 1979
-
[23]
The past and future of multidimensional item response theory
Mark D Reckase. The past and future of multidimensional item response theory. volume 21, pages 25--36. SAGE PUBLICATIONS, INC. 2455 Teller Road, Thousand Oaks, CA 91320, 1997
work page 1997
-
[24]
Pedro Rodriguez, Joe Barrow, Alexander Miserlis Hoyle, John P Lalor, Robin Jia, and Jordan Boyd-Graber. Evaluation examples are not equally informative: How should that change nlp leaderboards? In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processi...
work page 2021
-
[25]
Andrew Sellergren, Sahar Kazemzadeh, Tiam Jaroensri, Atilla Kiraly, Madeleine Traverse, Timo Kohlberger, Shawn Xu, Fayaz Jamil, C \' an Hughes, Charles Lau, et al. Medgemma technical report. arXiv preprint arXiv:2507.05201, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[26]
Holistic evaluation of large language models for medical applications, 2025
Nigam Shah, Mike Pfeffer, and Percy Liang. Holistic evaluation of large language models for medical applications, 2025
work page 2025
-
[27]
Karan Singhal, Shekoofeh Azizi, Tao Tu, S. Sara Mahdavi, Jason Wei, Hyung Won Chung, Nathan Scales, Ajay Tanwani, Heather Cole-Lewis, Stephen Pfohl, et al. Large language models encode clinical knowledge. Nature, 620 0 (7972): 0 172--180, 2023. doi:10.1038/s41586-023-06291-2
-
[28]
Irt-router: Effective and interpretable multi-llm routing via item response theory
Wei Song, Zhenya Huang, Cheng Cheng, Weibo Gao, Bihan Xu, GuanHao Zhao, Fei Wang, and Runze Wu. Irt-router: Effective and interpretable multi-llm routing via item response theory. arXiv preprint arXiv:2506.01048, 2025
-
[29]
Ryan Teknium, Jeffrey Quesnelle, and Chen Guang. Hermes 3 technical report. arXiv preprint arXiv:2408.11857, 2024
-
[30]
Medqa benchmark leaderboard (august 26, 2025)
Vals AI . Medqa benchmark leaderboard (august 26, 2025). https://www.vals.ai/benchmarks/medqa-08-26-2025, 2025. Accessed: September 4, 2025
work page 2025
-
[31]
An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report. arXiv preprint arXiv:2505.09388, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[32]
Wendy M Yen. Effects of local item dependence on the fit and equating performance of the three-parameter logistic model. Applied Psychological Measurement, 8 0 (2): 0 125--145, 1984
work page 1984
-
[33]
Hongli Zhou, Hui Huang, Ziqing Zhao, Lvyuan Han, Huicheng Wang, Kehai Chen, Muyun Yang, Wei Bao, Jian Dong, Bing Xu, et al. Lost in benchmarks? rethinking large language model benchmarking with item response theory. arXiv preprint arXiv:2505.15055, 2025
-
[34]
Efficiently measuring the cognitive ability of llms: An adaptive testing perspective
Yan Zhuang, Qi Liu, Yuting Ning, Weizhe Huang, Rui Lv, Zhenya Huang, Guanhao Zhao, Zheng Zhang, Qingyang Mao, Shijin Wang, et al. Efficiently measuring the cognitive ability of llms: An adaptive testing perspective. 2023
work page 2023
-
[35]
MedXpertQA: Benchmarking Expert-Level Medical Reasoning and Understanding
Yuxin Zuo, Shang Qu, Yifei Li, Zhangren Chen, Xuekai Zhu, Ermo Hua, Kaiyan Zhang, Ning Ding, and Bowen Zhou. Medxpertqa: Benchmarking expert-level medical reasoning and understanding. arXiv preprint arXiv:2501.18362, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.