pith. machine review for the scientific record. sign in

arxiv: 2604.02368 · v4 · submitted 2026-03-27 · 💻 cs.AI · cs.CL

Recognition: 2 theorem links

· Lean Theorem

Xpertbench: Expert Level Tasks with Rubrics-Based Evaluation

Authors on Pith no claims yet

Pith reviewed 2026-05-14 23:18 UTC · model grok-4.3

classification 💻 cs.AI cs.CL
keywords XpertBenchLLM benchmarkexpert tasksrubric evaluationprofessional domainsperformance ceilingShotJudge
0
0 comments X

The pith

Leading LLMs reach only around 66 percent success on expert-level professional tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces XpertBench, a collection of 1,346 tasks drawn directly from more than 1,000 submissions by domain experts in finance, healthcare, law, education, and research. Each task comes with detailed rubrics that break performance into 15 to 40 weighted checkpoints to measure professional rigor. To scale evaluation without self-bias, the work defines ShotJudge, an approach that calibrates LLM judges using expert few-shot examples. When applied to current top models, the benchmark shows a hard ceiling of roughly 66 percent peak success and 55 percent mean scores, plus clear splits in which models handle numbers better than language synthesis and vice versa. This gap indicates that general-purpose training has not yet produced reliable expert-level judgment.

Core claim

XpertBench demonstrates that state-of-the-art large language models exhibit a pronounced performance ceiling of approximately 66 percent peak success rate and mean scores near 55 percent when tested on 1,346 expert-curated tasks across finance, healthcare, legal services, education, and dual-track research domains, accompanied by non-overlapping strengths between quantitative reasoning and linguistic synthesis.

What carries the argument

XpertBench benchmark of 1,346 tasks sourced from expert submissions and scored with rubrics of 15-40 weighted checkpoints, assessed via the ShotJudge paradigm of expert-calibrated few-shot LLM judging.

If this is right

  • Current models remain general assistants rather than dependable specialized collaborators in high-stakes professional settings.
  • Quantitative and linguistic domains require distinct improvements because model strengths do not overlap.
  • Rubric-based scoring with many checkpoints supplies finer diagnostics than single-score accuracy metrics.
  • Training objectives must target the identified expert-gap to move beyond plateaued general benchmarks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Sustained progress on these tasks could function as a practical yardstick for when AI systems become viable independent professional agents.
  • Domain divergence suggests hybrid systems that route tasks to the strongest available model for each area could raise overall performance.
  • The benchmark tasks themselves could serve as training data for targeted fine-tuning once the ceiling is better understood.
  • Human-AI workflows could use the checkpoints to decide precisely where expert review remains necessary.

Load-bearing premise

The expert submissions and the resulting weighted rubric checkpoints accurately capture genuine expert-level cognition without selection bias or construction artifacts.

What would settle it

If panels of actual domain experts re-grade the same 1,346 tasks and find that leading models exceed 80 percent success or agreement on the rubrics, the reported performance ceiling would be falsified.

read the original abstract

As Large Language Models (LLMs) exhibit plateauing performance on conventional benchmarks, a pivotal challenge persists: evaluating their proficiency in complex, open-ended tasks characterizing genuine expert-level cognition. Existing frameworks suffer from narrow domain coverage, reliance on generalist tasks, or self-evaluation biases. To bridge this gap, we present XpertBench, a high-fidelity benchmark engineered to assess LLMs across authentic professional domains. XpertBench consists of 1,346 meticulously curated tasks across 80 categories, spanning finance, healthcare, legal services, education, and dual-track research (STEM and Humanities). These tasks are derived from over 1,000 submissions by domain experts--including researchers from elite institutions and practitioners with extensive clinical or industrial experience--ensuring superior ecological validity. Each task uses detailed rubrics with mostly 15-40 weighted checkpoints to assess professional rigor. To facilitate scalable yet human-aligned assessment, we introduce ShotJudge, a novel evaluation paradigm that employs LLM judges calibrated with expert few-shot exemplars to mitigate self-rewarding biases. Our empirical evaluation of state-of-the-art LLMs reveals a pronounced performance ceiling: even leading models achieve a peak success rate of only ~66%, with a mean score around 55%. Models also exhibit domain-specific divergence, showing non-overlapping strengths in quantitative reasoning versus linguistic synthesis.. These findings underscore a significant "expert-gap" in current AI systems and establish XpertBench as a critical instrument for navigating the transition from general-purpose assistants to specialized professional collaborators.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The paper introduces XpertBench, a benchmark of 1,346 expert-derived tasks across 80 categories in domains including finance, healthcare, legal, education, and research. Tasks are assessed via rubrics containing 15-40 weighted checkpoints, with evaluation performed by the introduced ShotJudge method that uses LLM judges calibrated on expert few-shot exemplars. Empirical results on state-of-the-art LLMs report a peak success rate of ~66% and mean score of ~55%, together with domain-specific performance divergences between quantitative and linguistic tasks, which the authors interpret as evidence of an expert-level gap.

Significance. If the rubric validity and ShotJudge alignment claims are substantiated, the work would provide a useful high-ecological-validity instrument for tracking progress toward professional-grade AI capabilities. The scale of expert-sourced tasks and the shift from self-evaluation to calibrated external judging are constructive steps beyond many existing LLM benchmarks.

major comments (3)
  1. [Abstract / Evaluation Methodology] Abstract and evaluation section: the central performance-ceiling claim (~66% peak, ~55% mean) and the domain-divergence finding rest on ShotJudge outputs, yet no inter-rater reliability statistics (Cohen’s kappa, Pearson r, or equivalent) between ShotJudge scores and independent human experts scoring the same tasks are reported.
  2. [Benchmark Construction] Rubric construction paragraph: the weighting scheme for the 15-40 checkpoints is described as “weighted” but no procedure for deriving or validating the weights across the 1,346 tasks is supplied, leaving open the possibility that domain-specific score differences are artifacts of rubric construction rather than genuine model behavior.
  3. [Empirical Evaluation] Results section: the reported success rates and non-overlapping domain strengths are presented without accompanying statistical tests, confidence intervals, or controls for judge calibration drift, so the quantitative support for the expert-gap conclusion remains incomplete.
minor comments (1)
  1. [Abstract] The final sentence of the abstract contains a double period (“synthesis.. These”).

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback, which identifies key areas where additional validation and statistical detail will strengthen the manuscript. We respond to each major comment below and will incorporate the recommended changes in the revised version.

read point-by-point responses
  1. Referee: [Abstract / Evaluation Methodology] Abstract and evaluation section: the central performance-ceiling claim (~66% peak, ~55% mean) and the domain-divergence finding rest on ShotJudge outputs, yet no inter-rater reliability statistics (Cohen’s kappa, Pearson r, or equivalent) between ShotJudge scores and independent human experts scoring the same tasks are reported.

    Authors: We agree that reporting inter-rater reliability would provide stronger substantiation for ShotJudge. Although the method is calibrated on expert few-shot exemplars, a dedicated comparison with independent human scorers was not included. In the revision we will add a human validation subsection based on a representative subset of tasks, reporting Cohen’s kappa and Pearson correlation between ShotJudge outputs and expert ratings. revision: yes

  2. Referee: [Benchmark Construction] Rubric construction paragraph: the weighting scheme for the 15-40 checkpoints is described as “weighted” but no procedure for deriving or validating the weights across the 1,346 tasks is supplied, leaving open the possibility that domain-specific score differences are artifacts of rubric construction rather than genuine model behavior.

    Authors: The observation is correct; the weighting derivation process requires fuller description. Weights were assigned through direct consultation with the domain experts who authored each task, prioritizing checkpoints according to professional standards. We will expand the Benchmark Construction section to detail this procedure, including expert review steps used to validate the assigned weights. revision: yes

  3. Referee: [Empirical Evaluation] Results section: the reported success rates and non-overlapping domain strengths are presented without accompanying statistical tests, confidence intervals, or controls for judge calibration drift, so the quantitative support for the expert-gap conclusion remains incomplete.

    Authors: We accept that the current presentation lacks necessary statistical support. The revised Results section will include 95% confidence intervals for success rates and mean scores, appropriate statistical tests for domain comparisons, and explicit controls for judge calibration drift such as fixed few-shot sets and periodic consistency checks. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper's performance claims are produced by applying the introduced ShotJudge (few-shot calibrated LLM judge) to 1,346 tasks and rubrics sourced from over 1,000 independent domain-expert submissions. These inputs pre-exist the evaluated model outputs and are not derived from them. No equations, fitted parameters, or self-citations are shown to reduce the reported success rates or domain divergences back to the same data by construction. The evaluation pipeline is externally grounded and self-contained against the expert-sourced benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The central claim rests on the premise that expert-submitted tasks and weighted rubrics faithfully represent professional expertise and that ShotJudge produces human-aligned scores without introducing new biases.

axioms (2)
  • domain assumption Expert submissions from elite institutions and practitioners ensure ecological validity and superior task quality
    Invoked to justify the benchmark's fidelity over prior generalist tasks.
  • domain assumption Rubrics with 15-40 weighted checkpoints provide a reliable, professional-grade scoring standard
    Central to the evaluation protocol but not independently validated in the abstract.
invented entities (1)
  • ShotJudge no independent evidence
    purpose: LLM-based judge calibrated with expert few-shot exemplars to reduce self-rewarding biases
    New evaluation paradigm introduced to enable scalable assessment aligned with human experts.

pith-pipeline@v0.9.0 · 5715 in / 1424 out tokens · 49813 ms · 2026-05-14T23:18:39.401843+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

33 extracted references · 33 canonical work pages

  1. [1]

    MMLU-pro: Amorerobustandchallengingmulti-tasklanguageunderstanding benchmark

    Yubo Wang, Xueguang Ma, Ge Zhang, Yuansheng Ni, Abhranil Chandra, Shiguang Guo, Weiming Ren, Aaran Arulraj, XuanHe, ZiyanJiang, etal. MMLU-pro: Amorerobustandchallengingmulti-tasklanguageunderstanding benchmark. In Advancesin Neural Information Processing Systems, volume 37, 2024

  2. [2]

    David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R. Bowman. GPQA: A graduate-level Google-proof Q&A benchmark. InProceedings of the First Conference on Language Modeling, 2024

  3. [3]

    Humanity’s last exam.Nature, 2026

    Long Phan, Alice Gatti, Ziwen Han, Nathaniel Li, Josephina Hu, Hugh Zhang, Sean Shi, Michael Choi, Anish Agrawal, Arnav Chopra, et al. Humanity’s last exam.Nature, 2026

  4. [4]

    FrontierMath: A benchmark for evaluating advanced mathematical reasoning in AI.arXiv preprint arXiv:2411.04872, 2024

    Elliot Glazer, Ege Erdil, Tamay Besiroglu, Diego Chicharro, Evan Chen, Alex Gunning, Caroline Falkman Olsson, Jean-Stanislas Denain, Anson Ho, Emily de Oliveira Santos, et al. FrontierMath: A benchmark for evaluating advanced mathematical reasoning in AI.arXiv preprint arXiv:2411.04872, 2024

  5. [5]

    GAIA: a benchmark for general AI assistants

    Grégoire Mialon, Clémentine Fourrier, Craig Swift, Thomas Wolf, Yann LeCun, and Thomas Scialom. GAIA: a benchmark for general AI assistants. InThe Twelfth International Conference on Learning Representations, 2024

  6. [6]

    BrowseComp: A simple yet challenging benchmark for browsing agents.arXiv preprint arXiv:2501.12959, 2025

    Jason Wei, Mia Cho, Aidan Cummings, Karina Guo, Shixiang Shane Hu, Simon Kang, Heidy Khlaaf, Neal Miao, Oam Neyman, Noa Rubin, et al. BrowseComp: A simple yet challenging benchmark for browsing agents.arXiv preprint arXiv:2501.12959, 2025

  7. [7]

    What disease does this patient have? a large-scale open domain question answering dataset from medical exams.Applied Sciences, 11(14):6421, 2021

    Di Jin, Eileen Pan, Nassim Oufattole, Wei-Hung Weng, Hanyi Fang, and Peter Szolovits. What disease does this patient have? a large-scale open domain question answering dataset from medical exams.Applied Sciences, 11(14):6421, 2021

  8. [8]

    PubMedQA: A dataset for biomedical research question answering

    Qiao Jin, Bhuwan Dhingra, Zhengping Liu, William Cohen, and Xinghua Lu. PubMedQA: A dataset for biomedical research question answering. InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing andthe9thInternationalJointConferenceonNaturalLanguageProcessing (EMNLP-IJCNLP), pages 2567–2577, 2019

  9. [9]

    Loomba, Shichang Zhang, Yizhou Sun, and Wei Wang

    Xiaoxuan Wang, Ziniu Hu, Pan Lu, Yanqiao Zhu, Jieyu Zhang, Satyen Subramaniam, Arjun R. Loomba, Shichang Zhang, Yizhou Sun, and Wei Wang. SciBench: Evaluating college-level scientific problem-solving abilities of large language models. InProceedings of the Forty-FirstInternational Conference on Machine Learning, 2024

  10. [10]

    Ho, Christopher Ré, Adam Chilton, Aditya Narayana, Alex Chohlas-Wood, Austin Peters, Brandon Waldon, Daniel N

    Neel Guha, Julian Nyarko, Daniel E. Ho, Christopher Ré, Adam Chilton, Aditya Narayana, Alex Chohlas-Wood, Austin Peters, Brandon Waldon, Daniel N. Rockmore, et al. LegalBench: A collaboratively built benchmark for measuring legal reasoning in large language models. InAdvancesin Neural Information Processing Systems, volume 36, pages 59201–59242. Curran As...

  11. [11]

    FinBen: A holistic financial benchmark for large language models

    Qianqian Xie, Weiguang Han, Zhengyu Chen, Ruoyu Xia, Xiao Zhang, Yueru He, Mengxi Xiao, Dong Li, Yongfu Dai, Duanyu Feng, et al. FinBen: A holistic financial benchmark for large language models. InAdvances in Neural Information Processing Systems, volume 37, 2024

  12. [12]

    AgentBench: Evaluating LLMs as agents

    Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen Men, Kejuan Yang, et al. AgentBench: Evaluating LLMs as agents. InThe Twelfth International Conference on Learning Representations, 2024

  13. [13]

    DeepResearch Bench: A comprehensive benchmark for deep research agents

    Mingxuan Du, Benfeng Xu, Chiwei Zhu, Licheng Zhang, Xiaorui Wang, and Zhendong Mao. DeepResearch Bench: A comprehensive benchmark for deep research agents. InInternational Conference on Learning Representations, 2026

  14. [14]

    DEER: A comprehensive and reliable benchmark for deep research agents on expert-level research tasks.arXiv preprint arXiv:2512.17776, 2025

    Yifan Zhang, Yifan Chen, Haoyang Liu, Zhicheng Fang, et al. DEER: A comprehensive and reliable benchmark for deep research agents on expert-level research tasks.arXiv preprint arXiv:2512.17776, 2025

  15. [15]

    AlpacaEval: An Automatic Evaluator for Instruction-following Models.https://github.com/tatsu-lab/alpaca_eval, 2023

    Xuechen Li, Tianyi Zhang, Yann Dubois, et al. AlpacaEval: An Automatic Evaluator for Instruction-following Models.https://github.com/tatsu-lab/alpaca_eval, 2023

  16. [16]

    Xing, Hao Zhang, Joseph E

    Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging LLM-as-a-judge with MT-Bench and chatbot arena. InAdvancesin Neural Information Processing Systems, volume 36, 2023. 13

  17. [17]

    Arena-Hard Auto: Evaluating LLMs with Human-in-the-loop Standards.https://lmsys.org/blog/2024-04-19-arena-hard/, 2024

    Tianle Li, Wei-Lin Chiang, Evan Frick, et al. Arena-Hard Auto: Evaluating LLMs with Human-in-the-loop Standards.https://lmsys.org/blog/2024-04-19-arena-hard/, 2024

  18. [18]

    Bill Yuchen Lin, Yuntian Deng, Khyathi Raghavi Chandu, Faeze Brahman, Abhilasha Srivastava, Abhilasha Ravichander, Yejin Choi, and Noah A. Smith. WildBench: Benchmarking LLMs with challenging tasks from real users in the wild. InAdvancesin Neural Information Processing Systems, volume 37, 2024

  19. [19]

    Prometheus: Inducing fine-grained evaluation capability in language models

    Seungone Bi, Christoph Koch, Guoyin Chen, Trevor Agarwal, et al. Prometheus: Inducing fine-grained evaluation capability in language models. InThe TwelfthInternational Conference on Learning Representations, 2024

  20. [20]

    JudgeBench: A benchmark for evaluating LLM-based judges

    Sijun Zhou, Nuo Huang, Ran Xu, Renren Yan, Muning Li, Yanghua Xiao, and Libby Hemphill. JudgeBench: A benchmark for evaluating LLM-based judges. InInternational Conference on Learning Representations, 2025

  21. [21]

    Holistic evaluation of language models.Transactionson Machine Learning Research, 2023

    Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara Soylu, Michihiro Yasunaga, Yian Zhang, Deepak Narayanan, Yuhuai Wu, Ananya Kumar, et al. Holistic evaluation of language models.Transactionson Machine Learning Research, 2023

  22. [22]

    Gonzalez, and Ion Stoica

    Wei-Lin Chiang, Lianmin Zheng, Ying Sheng, Anastasios Nikolas Angelopoulos, Tianle Li, Dacheng Li, Hao Zhang, Banghua Zhu, Michael Jordan, Joseph E. Gonzalez, and Ion Stoica. Chatbot arena: An open platform for evaluating LLMs by human preference. InProceedings of the Forty-FirstInternational Conference on Machine Learning, 2024

  23. [23]

    Measuring short-form factuality in large language models, 2024

    Jason Wei, Karina Nguyen, Hyung Won Chung, Yunxin Joy Jiao, Spencer Papay, Amelia Glaese, John Schulman, and William Fedus. Measuring short-form factuality in large language models, 2024

  24. [24]

    RubricEval: A scalable human-LLM evaluation framework for open-ended tasks

    Meera Bhat, Xi Fang, and Jacob Steinhardt. RubricEval: A scalable human-LLM evaluation framework for open-ended tasks. Stanford CS224N Final Reports, 2024

  25. [25]

    Large language models are not fair evaluators, 2023

    Peiyi Wang, Lei Li, Liang Chen, Zefan Cai, Dawei Zhu, Binghuai Lin, Yunbo Cao, Qi Liu, Tianyu Liu, and Zhifang Sui. Large language models are not fair evaluators, 2023. 14 Appendix A Example Tasks and Scoring Rubrics In this appendix, we present representative example tasks and their corresponding scoring rubrics for each of the five evaluation categories...

  26. [26]

    Lockheed Martin Corporation (NYSE: LMT) – The world’s largest defense contractor, renowned for its dominant position in aviation (e.g., the F-35 fighter jet)

  27. [27]

    Book-to-Bill Ratio

    Northrop Grumman Corporation (NYSE: NOC) – A defense giant with formidable technological moats in aerospace, mission systems, and strategic weapons (e.g., the B-21 stealth bomber). Core Analysis Requirements: 1.Future Revenue Visibility Comparison: • The “Book-to-Bill Ratio” serves as the lifeline for gauging a defense company’s future revenue growth pote...

  28. [28]

    Based on legal theory, determine whether the agreement signed between Guangxi Company and China Construction Bank on June 1, 2023, constitutes a loan relationship or a factoring contract relationship?

  29. [29]

    What is the validity of the Factoring Financing Agreement signed between a financing guarantee company in Yunnan Province and Guangxi Company?

  30. [30]

    accounts receivable transfer

    If the financing guarantee company in Yunnan Province asserts rights against Guangxi Company or Yunnan Company based on the factoring contract relationship under the Factoring Financing Agreement, how should the liability be allocated between Yunnan Company and Guangxi Company? A.2.2 Scoring Rubric Table A.2Scoring rubric for the Law example task. Criteri...

  31. [31]

    Based on the above conditions and comprehensive analysis of electrophoresis patterns in Figures 1 and 2 of the second image, what is the empty vector rate among the 20 selected single colonies?

  32. [32]

    What is the minimum average size of the foreign fragment in the 20 plasmids shown in the figure?

  33. [33]

    vector+foreign fragment,

    Why do Figures 1 and 2 in the second figure show differences when the same plasmid is digested with the same restriction enzymeMboI and its restriction enzymeSau3AI? Attachments: Figure A.1The recognition sequence and cleavage sites (indicated by triangles) of the restriction enzymesMboI and Sau3AI. Both enzymes recognize the same5′-GATC-3′ nucleotide seq...