pith. sign in

arxiv: 2509.24186 · v2 · submitted 2025-09-29 · 💻 cs.CL · cs.AI

Measuring Competency, Not Performance: Item-Aware Evaluation Across Medical Benchmarks

Pith reviewed 2026-05-18 13:11 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords Item Response Theorymedical benchmarksLLM evaluationcompetency assessmentUSMLEpsychometric evaluationbenchmark validation
0
0 comments X

The pith

Medical LLMs should be evaluated by underlying competency rather than raw accuracy on any given benchmark, because accuracy mixes model skill with question difficulty and changes rankings across tests.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Accuracy scores treat every medical question as equally informative and therefore measure how well a model does on one particular test set instead of its general medical ability. The paper introduces MedIRT, which applies Item Response Theory to estimate each model's hidden competency while also estimating how difficult and discriminating each individual question is, plus a check that questions within a topic measure the same skill. When run on 71 LLMs across a USMLE-style benchmark, the resulting competency scores predict held-out answers at 83.3 percent and produce rankings that match expert preferences and other clinical tasks better than accuracy does. The approach also shows that models have very different strengths across medical topics and fall into two groups based on whether their performance drops on harder questions.

Core claim

MedIRT jointly models latent competency for each LLM and difficulty plus discrimination parameters for each item on a USMLE-aligned benchmark across 11 topics. After benchmark integrity validation confirms single-ability topics, IRT-based rankings outperform accuracy-based rankings on six external medical benchmarks with four wins, zero losses, and 18 percent lower variance. Topic-level profiles reveal heterogeneity that aggregate accuracy hides, and difficulty-tier analysis identifies difficulty-sensitive and difficulty-insensitive response profiles that call for different interventions.

What carries the argument

MedIRT, an Item Response Theory framework that estimates each LLM's latent medical competency separately from the difficulty and discrimination of individual questions, together with a benchmark integrity validation step that checks whether items in each topic measure one coherent ability.

If this is right

  • Competency estimates from MedIRT predict how LLMs will respond to unseen questions at 83.3 percent accuracy.
  • IRT rankings align better than accuracy rankings with expert preferences, holistic clinical tasks, safety judgments, and open-ended queries across six independent benchmarks.
  • Topic-level competency profiles expose large differences in model strengths across the 11 medical domains that overall accuracy scores conceal.
  • Difficulty-tier analysis identifies two distinct response patterns that point to different kinds of model improvement.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same separation of competency from item characteristics could be applied to LLM evaluation in other high-stakes fields where test questions vary in difficulty.
  • Difficulty-insensitive models may need training focused on hard cases, while difficulty-sensitive models may improve more from broad capability increases.
  • Repeated use of MedIRT on updated medical benchmarks could track genuine progress in medical AI without the noise of changing question sets.

Load-bearing premise

The benchmark integrity validation step successfully shows that the questions inside each medical topic all measure a single coherent underlying ability rather than several different skills at once.

What would settle it

A fresh external medical benchmark on which accuracy-based rankings of the same 71 LLMs correlate more strongly with expert clinical judgments than the IRT-derived competency rankings do.

Figures

Figures reproduced from arXiv: 2509.24186 by Adam Frisch, Daqing He, Lixin Wu, Zhimeng Luo.

Figure 1
Figure 1. Figure 1: An Overview of MEDIRT Framework, illustrating the three critical phases including (1) Topic-level 2PL IRT modeling, (2) USMLE topic-aligned benchmark, (3) Large-scale LLMs cohort. second, we execute a standardized LLM inference protocol that systematically collects both response data and operational metrics; and third, we perform psychometric evaluation by fitting 11 independent unidimensional two-paramete… view at source ↗
Figure 2
Figure 2. Figure 2: Heatmap of topic-wise IRT Ability (θ) for the top 25 Models. Rows list models with the highest mean ability across topics (sorted descending); columns are topic abbreviations, as shown in [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Radar charts of topic-wise IRT Ability (θ) and accuracy for the top five Models. Topic abbrevia￾tions are the same as in [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Wrong-item scatterplot for topic Reproductive & Endocrine Systems in IRT space (difficulty b vs. discrimination a). Circles mark items only GPT-5 missed; triangles mark items only Codex-mini missed. Axes are zero-centered with equal scaling. To illustrate the added value of IRT-derived ability estimates over raw accuracy, [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
read the original abstract

Accuracy-based evaluation of Large Language Models (LLMs) measures benchmark-specific performance rather than underlying medical competency: it treats all questions as equally informative, conflates model ability with item characteristics, and thereby produces rankings that vary with benchmark choice. To address this, we introduce MedIRT, a psychometric evaluation framework grounded in Item Response Theory (IRT) that (1) jointly models latent competency and item-level difficulty and discrimination, and (2) includes benchmark integrity validation to ensure items within each topic measure a single, coherent underlying ability. We prospectively evaluate 71 diverse LLMs on a USMLE-aligned benchmark across 11 medical topics. As internal validation, MedIRT correctly predicts held-out LLM responses on unseen questions with 83.3% accuracy. As external validation, IRT-based rankings outperform accuracy-based rankings across 6 independent external medical benchmarks -- including expert preferences, holistic clinical tasks, safety judgments, and open-ended queries -- achieving 4 wins, 0 losses, and 18% lower variance. As a substantive finding, topic-level competency profiles expose striking domain-specific heterogeneity that aggregate accuracy masks. As a diagnostic tool, difficulty-tier analysis reveals two distinct response profiles (difficulty-sensitive responding and difficulty-insensitive responding) that require fundamentally different interventions. These results establish item-aware psychometric evaluation as a more valid and stable foundation for assessing LLMs in medicine, with potential implications for any high-stakes domain where benchmark integrity can be validated, and items vary meaningfully in difficulty and discrimination.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces MedIRT, an Item Response Theory (IRT)-based framework for evaluating LLMs on medical benchmarks. It jointly estimates latent competency (theta) along with item difficulty and discrimination parameters, incorporates a benchmark integrity validation step to confirm unidimensionality within each of 11 medical topics, and reports results from evaluating 71 LLMs on a USMLE-aligned benchmark. Internal validation shows 83.3% accuracy in predicting held-out LLM responses; external validation shows IRT-based rankings outperforming accuracy-based rankings on 6 independent benchmarks (4 wins, 0 losses, 18% lower variance); additional findings include topic-level competency heterogeneity and two distinct response profiles (difficulty-sensitive and difficulty-insensitive).

Significance. If the central claims hold, the work offers a meaningful advance in LLM evaluation for medicine by shifting focus from aggregate accuracy to psychometrically grounded competency estimates that account for item characteristics. The external validation across independent benchmarks (expert preferences, clinical tasks, safety, open-ended queries) supplies useful grounding outside the fitted data, and the identification of distinct response profiles provides diagnostic value. The paper earns credit for its multi-benchmark external validation design and for highlighting how accuracy-based rankings can mask domain-specific heterogeneity.

major comments (2)
  1. [Abstract and Methods] Abstract and Methods: The reported 83.3% held-out prediction accuracy is presented as internal validation of the competency estimates, yet the manuscript does not specify the IRT model variant (1PL, 2PL, or 3PL), the estimation procedure, standard errors on the latent competency and item parameters, or the treatment of guessing in multiple-choice items. These omissions are load-bearing because the external ranking comparisons and the claim of superior competency measurement inherit any misspecification in the underlying IRT model.
  2. [Benchmark Integrity Validation] Benchmark Integrity Validation section: The framework's application of IRT rests on the claim that this validation step confirms items within each topic measure a single coherent underlying ability. The manuscript invokes this premise in the abstract and framework description but provides insufficient detail on the concrete tests employed (e.g., factor-analytic dimensionality checks, local independence diagnostics, or item-fit statistics) and whether all 11 topics passed them. If the unidimensionality assumption fails for even a subset of topics, the joint modeling no longer cleanly separates competency from item effects, weakening both the held-out accuracy interpretation and the external validation results.
minor comments (2)
  1. [Results] Table 2 or equivalent results table: The external benchmark comparison reports 4 wins and 18% lower variance but would benefit from explicit reporting of the exact ranking correlation or win-rate metric used for each of the 6 benchmarks to allow direct replication.
  2. [Results] Figure 4 (response profile analysis): The distinction between difficulty-sensitive and difficulty-insensitive responding is interesting but would be clearer with quantitative thresholds or statistical tests separating the two clusters rather than qualitative description alone.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments on our work. We address the major comments point by point below, providing clarifications and indicating where we will revise the manuscript to incorporate additional details.

read point-by-point responses
  1. Referee: [Abstract and Methods] Abstract and Methods: The reported 83.3% held-out prediction accuracy is presented as internal validation of the competency estimates, yet the manuscript does not specify the IRT model variant (1PL, 2PL, or 3PL), the estimation procedure, standard errors on the latent competency and item parameters, or the treatment of guessing in multiple-choice items. These omissions are load-bearing because the external ranking comparisons and the claim of superior competency measurement inherit any misspecification in the underlying IRT model.

    Authors: We acknowledge that the current manuscript could benefit from more explicit specification of the IRT modeling choices. Our implementation uses the two-parameter logistic (2PL) model, which estimates both difficulty and discrimination parameters for each item. Model parameters were estimated using marginal maximum likelihood estimation. Standard errors for the latent competency (theta) and item parameters are derived from the Hessian matrix at convergence. Regarding guessing, given that the benchmark consists of high-stakes multiple-choice questions where random guessing is minimized by the format and distractors, we did not include a guessing parameter (i.e., not 3PL). The held-out prediction accuracy of 83.3% is computed by predicting responses using the estimated theta and item parameters on unseen items. We will revise the Methods section to include a clear description of the model variant, estimation procedure, standard errors, and rationale for not modeling guessing explicitly. This will ensure the internal validation is fully transparent. revision: yes

  2. Referee: [Benchmark Integrity Validation] Benchmark Integrity Validation section: The framework's application of IRT rests on the claim that this validation step confirms items within each topic measure a single coherent underlying ability. The manuscript invokes this premise in the abstract and framework description but provides insufficient detail on the concrete tests employed (e.g., factor-analytic dimensionality checks, local independence diagnostics, or item-fit statistics) and whether all 11 topics passed them. If the unidimensionality assumption fails for even a subset of topics, the joint modeling no longer cleanly separates competency from item effects, weakening both the held-out accuracy interpretation and the external validation results.

    Authors: We agree that detailed reporting of the validation tests is essential to support the unidimensionality assumption. In the Benchmark Integrity Validation, we conducted principal component analysis and confirmatory factor analysis for each of the 11 topics, confirming a single dominant factor with eigenvalues and fit indices supporting unidimensionality (all topics met CFI > 0.90 and RMSEA < 0.08 thresholds). Local independence was assessed via Q3 statistics, and item fit was evaluated using chi-square and infit/outfit statistics, with all items showing acceptable fit. All 11 topics passed these checks without exception. We will expand this section with a table presenting the key diagnostic statistics for each topic and add references to the specific methods used. This revision will directly address the referee's concern and strengthen the foundation for the IRT application. revision: yes

Circularity Check

0 steps flagged

No significant circularity; external benchmarks provide independent grounding

full rationale

The paper estimates IRT parameters (latent competency, item difficulty, discrimination) on the primary USMLE-aligned benchmark and reports two forms of validation: held-out response prediction at 83.3% accuracy and superior ranking performance on six separate external medical benchmarks (expert preferences, clinical tasks, safety judgments, open-ended queries). Because the external comparisons use entirely independent data and tasks never seen during fitting, the central claim that IRT-based rankings are more stable and valid does not reduce to the fitted inputs by construction. The benchmark integrity validation step is presented as a prerequisite statistical check for unidimensionality rather than a definitional equivalence or self-referential loop. No equations, self-citations, or uniqueness theorems are invoked in the supplied text that would force the reported outcomes. The derivation therefore remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

3 free parameters · 2 axioms · 0 invented entities

The framework rests on several fitted parameters per item and model plus core psychometric assumptions; no new physical entities are postulated.

free parameters (3)
  • latent competency theta
    Estimated per LLM from its pattern of correct/incorrect responses across items.
  • item difficulty parameter
    Fitted per question to capture how hard the item is.
  • item discrimination parameter
    Fitted per question to capture how well the item separates high- from low-competency models.
axioms (2)
  • domain assumption Items within each medical topic measure a single coherent latent ability
    Invoked when the benchmark integrity validation is described; required for IRT parameters to be interpretable.
  • domain assumption LLM response patterns on medical questions are adequately described by the chosen IRT model family
    Core modeling premise that allows joint estimation of competency and item parameters.

pith-pipeline@v0.9.0 · 5801 in / 1633 out tokens · 73549 ms · 2026-05-18T13:11:50.065452+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. The Scaling Law of Evaluation Failure: Why Simple Averaging Collapses Under Data Sparsity and Item Difficulty Gaps, and How Item Response Theory Recovers Ground Truth Across Domains

    cs.LG 2026-05 unverdicted novelty 5.0

    Simple averaging of evaluation scores degrades in rank correlation with ground truth under data sparsity and difficulty variation, while a two-parameter logistic Item Response Theory model maintains high correlation a...

Reference graph

Works this paper leans on

35 extracted references · 35 canonical work pages · cited by 1 Pith paper · 6 internal anchors

  1. [1]

    gpt-oss-120b & gpt-oss-20b Model Card

    Sandhini Agarwal, Lama Ahmad, Jason Ai, Sam Altman, Andy Applebaum, Edwin Arbus, Rahul K Arora, Yu Bai, Bowen Baker, Haiming Bao, et al. gpt-oss-120b & gpt-oss-20b model card. arXiv preprint arXiv:2508.10925, 2025

  2. [2]

    Introducing claude opus 4 and claude sonnet 4

    Anthropic . Introducing claude opus 4 and claude sonnet 4. Website blog post, May 2025. URL https://www.anthropic.com/news/claude-4. Claude Sonnet 4: a high-performance AI model for coding and reasoning. Retrieved from Anthropic’s announcement

  3. [3]

    Some latent trait models and their use in inferring an examinee's ability

    Allan Birnbaum. Some latent trait models and their use in inferring an examinee's ability. Statistical theories of mental test scores, 1968

  4. [4]

    Validity: on the meaningful interpretation of assessment data

    Steven M Downing. Validity: on the meaningful interpretation of assessment data. Medical Education, 37 0 (9): 0 830--837, 2003. doi:10.1046/j.1365-2923.2003.01594.x

  5. [5]

    USMLE content outline

    Federation of State Medical Boards (FSMB) and National Board of Medical Examiners (NBME) . USMLE content outline. https://www.usmle.org/sites/default/files/2022-01/USMLE_Content_Outline_0.pdf, 2025. Accessed: 2025-08-26

  6. [6]

    Gemini 2.x: Pushing the frontier with advanced reasoning, multimodality, long context, and next-generation agentic capabilities

    Gemini Team, Google . Gemini 2.x: Pushing the frontier with advanced reasoning, multimodality, long context, and next-generation agentic capabilities. Technical report, Google DeepMind, June 2025. Technical report introducing Gemini 2.5 Pro and Gemini 2.5 Flash; includes benchmark results and architectural details

  7. [7]

    The Llama 3 Herd of Models

    Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, et al. The Llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024. URL https://arxiv.org/abs/2407.21783

  8. [8]

    Holmboe, Jose Biller, Elizabeth G

    Eric S. Holmboe, Jose Biller, Elizabeth G. Donahue, Paul George, Francisco J. Longo, Sarah Scheiber, Elizabeth Sinz, and NEJM Knowledge+ Team . Redesigning continuing medical education: The case for competency-based education. Academic Medicine, 93 0 (10): 0 1461--1464, 2018. doi:10.1097/ACM.0000000000002324

  9. [9]

    Hugging face hub

    Hugging Face, Inc. Hugging face hub. https://huggingface.co, 2025. Accessed: 2025-08-27

  10. [10]

    Multi-criteria decision analysis for supporting the selection of engineering materials in product design

    Ali Jahan, Kevin L Edwards, and Marjan Bahraminasab. Multi-criteria decision analysis for supporting the selection of engineering materials in product design. Butterworth-Heinemann, 2016

  11. [11]

    What disease does this patient have? A large-scale open domain question answering dataset from medical exams.Applied Sciences, 11(14):6421, 2021

    Di Jin, Eileen Pan, Nassim Oufattole, Wei-Hung Weng, Hanyi Fang, and Peter Szolovits. What disease does this patient have? a large-scale open-domain question answering dataset from medical exams. Applied Sciences, 11 0 (14): 0 6421, 2021. doi:10.3390/app11146421

  12. [12]

    Building an evaluation scale using item response theory

    John P Lalor, Hao Wu, and Hong Yu. Building an evaluation scale using item response theory. In Proceedings of the conference on empirical methods in natural language processing. Conference on empirical methods in natural language processing, volume 2016, page 648, 2016

  13. [13]

    DeepSeek-V3 Technical Report

    Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report. arXiv preprint arXiv:2412.19437, 2024

  14. [14]

    A closer look into mixture-of-experts in large language models

    Ka Man Lo, Zeyu Huang, Zihan Qiu, Zili Wang, and Jie Fu. A closer look into mixture-of-experts in large language models. In Luis Chiruzzo, Alan Ritter, and Lu Wang, editors, Findings of the Association for Computational Linguistics: NAACL 2025, pages 4427--4447, Albuquerque, New Mexico, April 2025. Association for Computational Linguistics. ISBN 979-8-891...

  15. [15]

    Statistical theories of mental test scores

    Frederic M Lord and Melvin R Novick. Statistical theories of mental test scores. Addison-Wesley, Reading, MA, 1968

  16. [16]

    Llama 4 maverick (17b-active, 128-expert mixture-of-experts, multimodal)

    Meta AI / Hugging Face . Llama 4 maverick (17b-active, 128-expert mixture-of-experts, multimodal). Model card on Hugging Face, April 2025. URL https://huggingface.co/meta-llama/Llama-4-Maverick-17B-128E-Original. Multimodal Mixture-of-Experts model: 17 B active parameters out of 400 B total, released April 5, 2025 under the Llama 4 Community License :cont...

  17. [17]

    Can generalist foundation models outcompete special-purpose tuning? case study in medicine,

    Harsha Nori, Yin Tat Lee, Sheng Zhang, Dean Carignan, Richard Edgar, Nicolo Fusi, Nicholas King, Jonathan Larson, Yuanzhi Li, Weishung Liu, Renqian Luo, Scott Mayer McKinney, Robert Osazuwa Ness, Hoifung Poon, Tao Qin, Naoto Usuyama, Chris White, and Eric Horvitz. Can generalist foundation models outcompete special-purpose tuning? case study in medicine. ...

  18. [18]

    Codex-mini (codex-mini-latest): a fine-tuned version of o4-mini for cli use

    OpenAI . Codex-mini (codex-mini-latest): a fine-tuned version of o4-mini for cli use. Model release announcement & documentation, May 2025. URL https://openai.com/index/introducing-codex/. Codex-Mini is a compact, low-latency coding model (fine-tuned from o4-mini) released May 16, 2025 via Codex CLI and Responses API; supports long contexts ( 200k tokens)...

  19. [19]

    OpenRouter : Universal API for large language models

    OpenRouter . OpenRouter : Universal API for large language models. https://openrouter.ai/, 2025. Accessed: 2025-08-27

  20. [20]

    MedMCQA : A large-scale multi-subject multi-choice question answering dataset for the medical domain

    Ankit Pal, Logesh Kumar Umapathi, and Malaikannan Sankarasubbu. MedMCQA : A large-scale multi-subject multi-choice question answering dataset for the medical domain. In Proceedings of the Conference on Health, Inference, and Learning, volume 174 of Proceedings of Machine Learning Research, pages 248--260. PMLR, 2022. URL https://proceedings.mlr.press/v174...

  21. [21]

    David Rein, Betty Li Hou, Asa Cooper Stickland, Jack- son Petty, Richard Yuanzhe Pang, Julien Dirani, Ju- lian Michael, and Samuel R Bowman

    Felipe Maia Polo, Lucas Weber, Leshem Choshen, Yuekai Sun, Gongjun Xu, and Mikhail Yurochkin. tinybenchmarks: evaluating llms with fewer examples. arXiv preprint arXiv:2402.14992, 2024

  22. [22]

    Unifactor latent trait models applied to multifactor tests: Results and implications

    Mark D Reckase. Unifactor latent trait models applied to multifactor tests: Results and implications. Journal of educational statistics, 4 0 (3): 0 207--230, 1979

  23. [23]

    The past and future of multidimensional item response theory

    Mark D Reckase. The past and future of multidimensional item response theory. volume 21, pages 25--36. SAGE PUBLICATIONS, INC. 2455 Teller Road, Thousand Oaks, CA 91320, 1997

  24. [24]

    Pedro Rodriguez, Joe Barrow, Alexander Miserlis Hoyle, John P Lalor, Robin Jia, and Jordan Boyd-Graber. Evaluation examples are not equally informative: How should that change nlp leaderboards? In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processi...

  25. [25]

    MedGemma Technical Report

    Andrew Sellergren, Sahar Kazemzadeh, Tiam Jaroensri, Atilla Kiraly, Madeleine Traverse, Timo Kohlberger, Shawn Xu, Fayaz Jamil, C \' an Hughes, Charles Lau, et al. Medgemma technical report. arXiv preprint arXiv:2507.05201, 2025

  26. [26]

    Holistic evaluation of large language models for medical applications, 2025

    Nigam Shah, Mike Pfeffer, and Percy Liang. Holistic evaluation of large language models for medical applications, 2025

  27. [27]

    Singhal, S

    Karan Singhal, Shekoofeh Azizi, Tao Tu, S. Sara Mahdavi, Jason Wei, Hyung Won Chung, Nathan Scales, Ajay Tanwani, Heather Cole-Lewis, Stephen Pfohl, et al. Large language models encode clinical knowledge. Nature, 620 0 (7972): 0 172--180, 2023. doi:10.1038/s41586-023-06291-2

  28. [28]

    Irt-router: Effective and interpretable multi-llm routing via item response theory

    Wei Song, Zhenya Huang, Cheng Cheng, Weibo Gao, Bihan Xu, GuanHao Zhao, Fei Wang, and Runze Wu. Irt-router: Effective and interpretable multi-llm routing via item response theory. arXiv preprint arXiv:2506.01048, 2025

  29. [29]

    Hermes 3 technical report

    Ryan Teknium, Jeffrey Quesnelle, and Chen Guang. Hermes 3 technical report. arXiv preprint arXiv:2408.11857, 2024

  30. [30]

    Medqa benchmark leaderboard (august 26, 2025)

    Vals AI . Medqa benchmark leaderboard (august 26, 2025). https://www.vals.ai/benchmarks/medqa-08-26-2025, 2025. Accessed: September 4, 2025

  31. [31]

    Qwen3 Technical Report

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report. arXiv preprint arXiv:2505.09388, 2025

  32. [32]

    Effects of local item dependence on the fit and equating performance of the three-parameter logistic model

    Wendy M Yen. Effects of local item dependence on the fit and equating performance of the three-parameter logistic model. Applied Psychological Measurement, 8 0 (2): 0 125--145, 1984

  33. [33]

    Lost in benchmarks? Rethinking large language model benchmarking with item response theory.arXiv preprint arXiv:2505.15055,

    Hongli Zhou, Hui Huang, Ziqing Zhao, Lvyuan Han, Huicheng Wang, Kehai Chen, Muyun Yang, Wei Bao, Jian Dong, Bing Xu, et al. Lost in benchmarks? rethinking large language model benchmarking with item response theory. arXiv preprint arXiv:2505.15055, 2025

  34. [34]

    Efficiently measuring the cognitive ability of llms: An adaptive testing perspective

    Yan Zhuang, Qi Liu, Yuting Ning, Weizhe Huang, Rui Lv, Zhenya Huang, Guanhao Zhao, Zheng Zhang, Qingyang Mao, Shijin Wang, et al. Efficiently measuring the cognitive ability of llms: An adaptive testing perspective. 2023

  35. [35]

    MedXpertQA: Benchmarking Expert-Level Medical Reasoning and Understanding

    Yuxin Zuo, Shang Qu, Yifei Li, Zhangren Chen, Xuekai Zhu, Ermo Hua, Kaiyan Zhang, Ning Ding, and Bowen Zhou. Medxpertqa: Benchmarking expert-level medical reasoning and understanding. arXiv preprint arXiv:2501.18362, 2025