Recognition: 2 theorem links
· Lean TheoremXpertbench: Expert Level Tasks with Rubrics-Based Evaluation
Pith reviewed 2026-05-14 23:18 UTC · model grok-4.3
The pith
Leading LLMs reach only around 66 percent success on expert-level professional tasks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
XpertBench demonstrates that state-of-the-art large language models exhibit a pronounced performance ceiling of approximately 66 percent peak success rate and mean scores near 55 percent when tested on 1,346 expert-curated tasks across finance, healthcare, legal services, education, and dual-track research domains, accompanied by non-overlapping strengths between quantitative reasoning and linguistic synthesis.
What carries the argument
XpertBench benchmark of 1,346 tasks sourced from expert submissions and scored with rubrics of 15-40 weighted checkpoints, assessed via the ShotJudge paradigm of expert-calibrated few-shot LLM judging.
If this is right
- Current models remain general assistants rather than dependable specialized collaborators in high-stakes professional settings.
- Quantitative and linguistic domains require distinct improvements because model strengths do not overlap.
- Rubric-based scoring with many checkpoints supplies finer diagnostics than single-score accuracy metrics.
- Training objectives must target the identified expert-gap to move beyond plateaued general benchmarks.
Where Pith is reading between the lines
- Sustained progress on these tasks could function as a practical yardstick for when AI systems become viable independent professional agents.
- Domain divergence suggests hybrid systems that route tasks to the strongest available model for each area could raise overall performance.
- The benchmark tasks themselves could serve as training data for targeted fine-tuning once the ceiling is better understood.
- Human-AI workflows could use the checkpoints to decide precisely where expert review remains necessary.
Load-bearing premise
The expert submissions and the resulting weighted rubric checkpoints accurately capture genuine expert-level cognition without selection bias or construction artifacts.
What would settle it
If panels of actual domain experts re-grade the same 1,346 tasks and find that leading models exceed 80 percent success or agreement on the rubrics, the reported performance ceiling would be falsified.
read the original abstract
As Large Language Models (LLMs) exhibit plateauing performance on conventional benchmarks, a pivotal challenge persists: evaluating their proficiency in complex, open-ended tasks characterizing genuine expert-level cognition. Existing frameworks suffer from narrow domain coverage, reliance on generalist tasks, or self-evaluation biases. To bridge this gap, we present XpertBench, a high-fidelity benchmark engineered to assess LLMs across authentic professional domains. XpertBench consists of 1,346 meticulously curated tasks across 80 categories, spanning finance, healthcare, legal services, education, and dual-track research (STEM and Humanities). These tasks are derived from over 1,000 submissions by domain experts--including researchers from elite institutions and practitioners with extensive clinical or industrial experience--ensuring superior ecological validity. Each task uses detailed rubrics with mostly 15-40 weighted checkpoints to assess professional rigor. To facilitate scalable yet human-aligned assessment, we introduce ShotJudge, a novel evaluation paradigm that employs LLM judges calibrated with expert few-shot exemplars to mitigate self-rewarding biases. Our empirical evaluation of state-of-the-art LLMs reveals a pronounced performance ceiling: even leading models achieve a peak success rate of only ~66%, with a mean score around 55%. Models also exhibit domain-specific divergence, showing non-overlapping strengths in quantitative reasoning versus linguistic synthesis.. These findings underscore a significant "expert-gap" in current AI systems and establish XpertBench as a critical instrument for navigating the transition from general-purpose assistants to specialized professional collaborators.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces XpertBench, a benchmark of 1,346 expert-derived tasks across 80 categories in domains including finance, healthcare, legal, education, and research. Tasks are assessed via rubrics containing 15-40 weighted checkpoints, with evaluation performed by the introduced ShotJudge method that uses LLM judges calibrated on expert few-shot exemplars. Empirical results on state-of-the-art LLMs report a peak success rate of ~66% and mean score of ~55%, together with domain-specific performance divergences between quantitative and linguistic tasks, which the authors interpret as evidence of an expert-level gap.
Significance. If the rubric validity and ShotJudge alignment claims are substantiated, the work would provide a useful high-ecological-validity instrument for tracking progress toward professional-grade AI capabilities. The scale of expert-sourced tasks and the shift from self-evaluation to calibrated external judging are constructive steps beyond many existing LLM benchmarks.
major comments (3)
- [Abstract / Evaluation Methodology] Abstract and evaluation section: the central performance-ceiling claim (~66% peak, ~55% mean) and the domain-divergence finding rest on ShotJudge outputs, yet no inter-rater reliability statistics (Cohen’s kappa, Pearson r, or equivalent) between ShotJudge scores and independent human experts scoring the same tasks are reported.
- [Benchmark Construction] Rubric construction paragraph: the weighting scheme for the 15-40 checkpoints is described as “weighted” but no procedure for deriving or validating the weights across the 1,346 tasks is supplied, leaving open the possibility that domain-specific score differences are artifacts of rubric construction rather than genuine model behavior.
- [Empirical Evaluation] Results section: the reported success rates and non-overlapping domain strengths are presented without accompanying statistical tests, confidence intervals, or controls for judge calibration drift, so the quantitative support for the expert-gap conclusion remains incomplete.
minor comments (1)
- [Abstract] The final sentence of the abstract contains a double period (“synthesis.. These”).
Simulated Author's Rebuttal
We thank the referee for the constructive feedback, which identifies key areas where additional validation and statistical detail will strengthen the manuscript. We respond to each major comment below and will incorporate the recommended changes in the revised version.
read point-by-point responses
-
Referee: [Abstract / Evaluation Methodology] Abstract and evaluation section: the central performance-ceiling claim (~66% peak, ~55% mean) and the domain-divergence finding rest on ShotJudge outputs, yet no inter-rater reliability statistics (Cohen’s kappa, Pearson r, or equivalent) between ShotJudge scores and independent human experts scoring the same tasks are reported.
Authors: We agree that reporting inter-rater reliability would provide stronger substantiation for ShotJudge. Although the method is calibrated on expert few-shot exemplars, a dedicated comparison with independent human scorers was not included. In the revision we will add a human validation subsection based on a representative subset of tasks, reporting Cohen’s kappa and Pearson correlation between ShotJudge outputs and expert ratings. revision: yes
-
Referee: [Benchmark Construction] Rubric construction paragraph: the weighting scheme for the 15-40 checkpoints is described as “weighted” but no procedure for deriving or validating the weights across the 1,346 tasks is supplied, leaving open the possibility that domain-specific score differences are artifacts of rubric construction rather than genuine model behavior.
Authors: The observation is correct; the weighting derivation process requires fuller description. Weights were assigned through direct consultation with the domain experts who authored each task, prioritizing checkpoints according to professional standards. We will expand the Benchmark Construction section to detail this procedure, including expert review steps used to validate the assigned weights. revision: yes
-
Referee: [Empirical Evaluation] Results section: the reported success rates and non-overlapping domain strengths are presented without accompanying statistical tests, confidence intervals, or controls for judge calibration drift, so the quantitative support for the expert-gap conclusion remains incomplete.
Authors: We accept that the current presentation lacks necessary statistical support. The revised Results section will include 95% confidence intervals for success rates and mean scores, appropriate statistical tests for domain comparisons, and explicit controls for judge calibration drift such as fixed few-shot sets and periodic consistency checks. revision: yes
Circularity Check
No significant circularity in derivation chain
full rationale
The paper's performance claims are produced by applying the introduced ShotJudge (few-shot calibrated LLM judge) to 1,346 tasks and rubrics sourced from over 1,000 independent domain-expert submissions. These inputs pre-exist the evaluated model outputs and are not derived from them. No equations, fitted parameters, or self-citations are shown to reduce the reported success rates or domain divergences back to the same data by construction. The evaluation pipeline is externally grounded and self-contained against the expert-sourced benchmarks.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Expert submissions from elite institutions and practitioners ensure ecological validity and superior task quality
- domain assumption Rubrics with 15-40 weighted checkpoints provide a reliable, professional-grade scoring standard
invented entities (1)
-
ShotJudge
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Each task uses detailed rubrics with mostly 15-40 weighted checkpoints... ShotJudge... LLM judges calibrated with expert few-shot exemplars
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
1,346 meticulously curated tasks across 80 categories... domain-specific divergence
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
MMLU-pro: Amorerobustandchallengingmulti-tasklanguageunderstanding benchmark
Yubo Wang, Xueguang Ma, Ge Zhang, Yuansheng Ni, Abhranil Chandra, Shiguang Guo, Weiming Ren, Aaran Arulraj, XuanHe, ZiyanJiang, etal. MMLU-pro: Amorerobustandchallengingmulti-tasklanguageunderstanding benchmark. In Advancesin Neural Information Processing Systems, volume 37, 2024
work page 2024
-
[2]
David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R. Bowman. GPQA: A graduate-level Google-proof Q&A benchmark. InProceedings of the First Conference on Language Modeling, 2024
work page 2024
-
[3]
Humanity’s last exam.Nature, 2026
Long Phan, Alice Gatti, Ziwen Han, Nathaniel Li, Josephina Hu, Hugh Zhang, Sean Shi, Michael Choi, Anish Agrawal, Arnav Chopra, et al. Humanity’s last exam.Nature, 2026
work page 2026
-
[4]
Elliot Glazer, Ege Erdil, Tamay Besiroglu, Diego Chicharro, Evan Chen, Alex Gunning, Caroline Falkman Olsson, Jean-Stanislas Denain, Anson Ho, Emily de Oliveira Santos, et al. FrontierMath: A benchmark for evaluating advanced mathematical reasoning in AI.arXiv preprint arXiv:2411.04872, 2024
-
[5]
GAIA: a benchmark for general AI assistants
Grégoire Mialon, Clémentine Fourrier, Craig Swift, Thomas Wolf, Yann LeCun, and Thomas Scialom. GAIA: a benchmark for general AI assistants. InThe Twelfth International Conference on Learning Representations, 2024
work page 2024
-
[6]
Jason Wei, Mia Cho, Aidan Cummings, Karina Guo, Shixiang Shane Hu, Simon Kang, Heidy Khlaaf, Neal Miao, Oam Neyman, Noa Rubin, et al. BrowseComp: A simple yet challenging benchmark for browsing agents.arXiv preprint arXiv:2501.12959, 2025
-
[7]
Di Jin, Eileen Pan, Nassim Oufattole, Wei-Hung Weng, Hanyi Fang, and Peter Szolovits. What disease does this patient have? a large-scale open domain question answering dataset from medical exams.Applied Sciences, 11(14):6421, 2021
work page 2021
-
[8]
PubMedQA: A dataset for biomedical research question answering
Qiao Jin, Bhuwan Dhingra, Zhengping Liu, William Cohen, and Xinghua Lu. PubMedQA: A dataset for biomedical research question answering. InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing andthe9thInternationalJointConferenceonNaturalLanguageProcessing (EMNLP-IJCNLP), pages 2567–2577, 2019
work page 2019
-
[9]
Loomba, Shichang Zhang, Yizhou Sun, and Wei Wang
Xiaoxuan Wang, Ziniu Hu, Pan Lu, Yanqiao Zhu, Jieyu Zhang, Satyen Subramaniam, Arjun R. Loomba, Shichang Zhang, Yizhou Sun, and Wei Wang. SciBench: Evaluating college-level scientific problem-solving abilities of large language models. InProceedings of the Forty-FirstInternational Conference on Machine Learning, 2024
work page 2024
-
[10]
Neel Guha, Julian Nyarko, Daniel E. Ho, Christopher Ré, Adam Chilton, Aditya Narayana, Alex Chohlas-Wood, Austin Peters, Brandon Waldon, Daniel N. Rockmore, et al. LegalBench: A collaboratively built benchmark for measuring legal reasoning in large language models. InAdvancesin Neural Information Processing Systems, volume 36, pages 59201–59242. Curran As...
work page 2023
-
[11]
FinBen: A holistic financial benchmark for large language models
Qianqian Xie, Weiguang Han, Zhengyu Chen, Ruoyu Xia, Xiao Zhang, Yueru He, Mengxi Xiao, Dong Li, Yongfu Dai, Duanyu Feng, et al. FinBen: A holistic financial benchmark for large language models. InAdvances in Neural Information Processing Systems, volume 37, 2024
work page 2024
-
[12]
AgentBench: Evaluating LLMs as agents
Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen Men, Kejuan Yang, et al. AgentBench: Evaluating LLMs as agents. InThe Twelfth International Conference on Learning Representations, 2024
work page 2024
-
[13]
DeepResearch Bench: A comprehensive benchmark for deep research agents
Mingxuan Du, Benfeng Xu, Chiwei Zhu, Licheng Zhang, Xiaorui Wang, and Zhendong Mao. DeepResearch Bench: A comprehensive benchmark for deep research agents. InInternational Conference on Learning Representations, 2026
work page 2026
-
[14]
Yifan Zhang, Yifan Chen, Haoyang Liu, Zhicheng Fang, et al. DEER: A comprehensive and reliable benchmark for deep research agents on expert-level research tasks.arXiv preprint arXiv:2512.17776, 2025
-
[15]
Xuechen Li, Tianyi Zhang, Yann Dubois, et al. AlpacaEval: An Automatic Evaluator for Instruction-following Models.https://github.com/tatsu-lab/alpaca_eval, 2023
work page 2023
-
[16]
Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging LLM-as-a-judge with MT-Bench and chatbot arena. InAdvancesin Neural Information Processing Systems, volume 36, 2023. 13
work page 2023
-
[17]
Tianle Li, Wei-Lin Chiang, Evan Frick, et al. Arena-Hard Auto: Evaluating LLMs with Human-in-the-loop Standards.https://lmsys.org/blog/2024-04-19-arena-hard/, 2024
work page 2024
-
[18]
Bill Yuchen Lin, Yuntian Deng, Khyathi Raghavi Chandu, Faeze Brahman, Abhilasha Srivastava, Abhilasha Ravichander, Yejin Choi, and Noah A. Smith. WildBench: Benchmarking LLMs with challenging tasks from real users in the wild. InAdvancesin Neural Information Processing Systems, volume 37, 2024
work page 2024
-
[19]
Prometheus: Inducing fine-grained evaluation capability in language models
Seungone Bi, Christoph Koch, Guoyin Chen, Trevor Agarwal, et al. Prometheus: Inducing fine-grained evaluation capability in language models. InThe TwelfthInternational Conference on Learning Representations, 2024
work page 2024
-
[20]
JudgeBench: A benchmark for evaluating LLM-based judges
Sijun Zhou, Nuo Huang, Ran Xu, Renren Yan, Muning Li, Yanghua Xiao, and Libby Hemphill. JudgeBench: A benchmark for evaluating LLM-based judges. InInternational Conference on Learning Representations, 2025
work page 2025
-
[21]
Holistic evaluation of language models.Transactionson Machine Learning Research, 2023
Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara Soylu, Michihiro Yasunaga, Yian Zhang, Deepak Narayanan, Yuhuai Wu, Ananya Kumar, et al. Holistic evaluation of language models.Transactionson Machine Learning Research, 2023
work page 2023
-
[22]
Wei-Lin Chiang, Lianmin Zheng, Ying Sheng, Anastasios Nikolas Angelopoulos, Tianle Li, Dacheng Li, Hao Zhang, Banghua Zhu, Michael Jordan, Joseph E. Gonzalez, and Ion Stoica. Chatbot arena: An open platform for evaluating LLMs by human preference. InProceedings of the Forty-FirstInternational Conference on Machine Learning, 2024
work page 2024
-
[23]
Measuring short-form factuality in large language models, 2024
Jason Wei, Karina Nguyen, Hyung Won Chung, Yunxin Joy Jiao, Spencer Papay, Amelia Glaese, John Schulman, and William Fedus. Measuring short-form factuality in large language models, 2024
work page 2024
-
[24]
RubricEval: A scalable human-LLM evaluation framework for open-ended tasks
Meera Bhat, Xi Fang, and Jacob Steinhardt. RubricEval: A scalable human-LLM evaluation framework for open-ended tasks. Stanford CS224N Final Reports, 2024
work page 2024
-
[25]
Large language models are not fair evaluators, 2023
Peiyi Wang, Lei Li, Liang Chen, Zefan Cai, Dawei Zhu, Binghuai Lin, Yunbo Cao, Qi Liu, Tianyu Liu, and Zhifang Sui. Large language models are not fair evaluators, 2023. 14 Appendix A Example Tasks and Scoring Rubrics In this appendix, we present representative example tasks and their corresponding scoring rubrics for each of the five evaluation categories...
work page 2023
-
[26]
Lockheed Martin Corporation (NYSE: LMT) – The world’s largest defense contractor, renowned for its dominant position in aviation (e.g., the F-35 fighter jet)
-
[27]
Northrop Grumman Corporation (NYSE: NOC) – A defense giant with formidable technological moats in aerospace, mission systems, and strategic weapons (e.g., the B-21 stealth bomber). Core Analysis Requirements: 1.Future Revenue Visibility Comparison: • The “Book-to-Bill Ratio” serves as the lifeline for gauging a defense company’s future revenue growth pote...
work page 2023
-
[28]
Based on legal theory, determine whether the agreement signed between Guangxi Company and China Construction Bank on June 1, 2023, constitutes a loan relationship or a factoring contract relationship?
work page 2023
-
[29]
What is the validity of the Factoring Financing Agreement signed between a financing guarantee company in Yunnan Province and Guangxi Company?
-
[30]
If the financing guarantee company in Yunnan Province asserts rights against Guangxi Company or Yunnan Company based on the factoring contract relationship under the Factoring Financing Agreement, how should the liability be allocated between Yunnan Company and Guangxi Company? A.2.2 Scoring Rubric Table A.2Scoring rubric for the Law example task. Criteri...
-
[31]
Based on the above conditions and comprehensive analysis of electrophoresis patterns in Figures 1 and 2 of the second image, what is the empty vector rate among the 20 selected single colonies?
-
[32]
What is the minimum average size of the foreign fragment in the 20 plasmids shown in the figure?
-
[33]
Why do Figures 1 and 2 in the second figure show differences when the same plasmid is digested with the same restriction enzymeMboI and its restriction enzymeSau3AI? Attachments: Figure A.1The recognition sequence and cleavage sites (indicated by triangles) of the restriction enzymesMboI and Sau3AI. Both enzymes recognize the same5′-GATC-3′ nucleotide seq...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.