pith. sign in

arxiv: 2606.05104 · v2 · pith:HBCKOP2Knew · submitted 2026-06-03 · 💻 cs.AI

Knowledge Index of Noah's Ark

Pith reviewed 2026-06-28 06:31 UTC · model grok-4.3

classification 💻 cs.AI
keywords LLM benchmarkknowledge evaluationrepresentativenessincentive mechanismgreedy approximationannotation qualitymodel leaderboarddisciplinary coverage
0
0 comments X

The pith

KINA is an 899-item benchmark across 261 disciplines with a proxy greedy algorithm for representativeness and a bonus tournament for annotation incentives.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces KINA to fix three problems in LLM knowledge benchmarks: designs that ignore disciplinary representativeness, flat payments that allow low-effort annotations, and unstable rankings from small test budgets. It casts representativeness as a coverage problem over expert anchors and solves it with a greedy method that guarantees (1-1/e) of the proxy optimum. A second formal result shows that paying a bonus only above a performance bar weakly dominates flat payment for quality once the bonus clears a simple cost-probability threshold. Tests on 42 models place the leader at 53.17 percent, expose a tiered rather than smooth performance distribution, and quantify ranking variance through bootstrap statistics.

Core claim

KINA operationalizes disciplinary representativeness via a proxy over expert-elicited anchors, yielding a (1-1/e) greedy approximation guarantee that applies to the proxy. It further proves that a bonus-on-bar tournament weakly first-order stochastically dominates flat payment for released-review quality whenever the bonus B exceeds Delta C / Delta p_min. The resulting 899-item set across 261 disciplines produces a tiered leaderboard in which the top model scores 53.17 percent, tool use adds up to 5.17 points, and bounded-budget variance is reported explicitly.

What carries the argument

The proxy-based greedy coverage algorithm for disciplinary representativeness together with the bonus-on-bar tournament payment rule.

If this is right

  • Representativeness holds only relative to the proxy, not necessarily the true population.
  • The tournament payment is incentive-compatible above the stated threshold.
  • Model performance forms distinct tiers above 48 percent, 38-45 percent, and near the 10 percent baseline.
  • Tool augmentation yields gains that vary substantially across models and tasks.
  • Bootstrap statistics make explicit the ranking instability possible under limited evaluation budgets.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same proxy-plus-greedy approach could be reused to build coverage benchmarks in domains other than LLM knowledge testing.
  • The incentive-compatibility threshold provides a concrete design rule that might transfer to other crowdsourced annotation settings.
  • Persistent gaps below 55 percent indicate that current architectures still lack reliable integration of knowledge across many disciplines.
  • Future experiments could measure whether models trained or fine-tuned explicitly on the proxy-selected items close the observed performance tiers.

Load-bearing premise

The chosen proxy for disciplinary representativeness is treated as adequate for the coverage guarantee.

What would settle it

An audit that compares actual knowledge coverage achieved by the selected 899 items against a larger expert-curated or random sample to test whether the proxy guarantee translates to population representativeness.

Figures

Figures reproduced from arXiv: 2606.05104 by Bangya Liu, Ge Zhang, Heli Qi, Jiaheng Liu, Jiarui Liu, Jie Wei, Kaijing Ma, Meishu Song, Minghao Liu, Minglai Yang, Ningshan Ma, Qingcheng Zeng, Rui Yang, Sheng Jin, Shen Yan, Sicong Jiang, Weihao Xuan, Wenhao Huang, Xiao Fang, Xuan Zhang, Yifan Yao, Yiming Liang, Yizhe Li, Yunze Xiao, Zeqi Zhou, Zihan Wang, Ziniu Li.

Figure 1
Figure 1. Figure 1: KINA data-collection pipeline. Topic pre-approval enforces representativeness via the proxy of Proposition 1; the double-blind expert-review stage instantiates the bonus-on-bar tournament of Theorem 1; LLM-as-judge consensus filters residual ambiguity; an agentic refinement loop addresses boundary defects. five flagship LLMs on the item. An item is admitted if at least two of three judges vote yes; otherwi… view at source ↗
Figure 2
Figure 2. Figure 2: Distribution of KINA items across the 12 top-level disciplines [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Parameter scaling on KINA. Left: dense models, accuracy vs. total parameters (log scale). Right: MoE models, accuracy vs. active parameters (log scale). Generation-over-generation slope increases from roughly 7.6 (Qwen3) to 14.8 (Qwen3.5) points per decade. 6 Limitations We highlight five limitations. Sample-size variance. KINA is intentionally compact. While §5.2 shows that top-10 rankings remain stable u… view at source ↗
Figure 4
Figure 4. Figure 4: Taxonomy of Disciplines 22 [PITH_FULL_IMAGE:figures/full_fig_p022_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Subject-Level Score Distribution Across Top-10 Models [PITH_FULL_IMAGE:figures/full_fig_p041_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Top-10 Model Performance Distributions Across Three Aggregation Levels (continued) [PITH_FULL_IMAGE:figures/full_fig_p042_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Inference Cost Distribution of Qwen3 Dense Models. [PITH_FULL_IMAGE:figures/full_fig_p043_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Inference Cost Distribution of Qwen3 Moe Models. [PITH_FULL_IMAGE:figures/full_fig_p043_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Inference Cost Distribution of Qwen3.5 Dense Models. [PITH_FULL_IMAGE:figures/full_fig_p043_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Inference Cost Distribution of Qwen3.5 Moe Models. [PITH_FULL_IMAGE:figures/full_fig_p044_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: UpSet plot visualizing intersections of high-performing data points (average score [PITH_FULL_IMAGE:figures/full_fig_p044_11.png] view at source ↗
read the original abstract

Knowledge benchmarks for LLMs face three issues: scaling-driven designs that do not operationalize disciplinary representativeness; flat-payment annotation that permits lazy consensus; and unaudited ranking instability under bounded test budgets. We introduce KINA, an 899-item benchmark across 261 fine-grained disciplines, with two formal results. First, we cast representativeness as a coverage-style objective over expert-elicited anchors and operationalize disciplinary representativeness through a proxy, yielding a (1-1/e) greedy approximation (Proposition 1); the guarantee applies to the proxy, not to population representativeness. Second, we prove a bonus-on-bar tournament weakly FOSD-dominates flat payment in released-review quality, with incentive-compatibility threshold B > Delta C / Delta p_min (Theorem 1). Evaluating 42 models from 13 labs, the top model, Gemini-3.1-Pro-Preview, reaches 53.17%, followed by Claude-Opus-4.6 at 49.92% and GPT-5.4 at 48.55%, leaving substantial headroom below saturation. The full leaderboard shows a tiered structure rather than a smooth total order: a small frontier tier lies above 48%, a dense strong-model tier spans roughly 38-45%, and low-performing models remain only modestly above the 10% chance baseline. Tool augmentation adds up to 5.17 points across the five tool-use evaluations, with gains varying substantially across models. We report bootstrap ranking-stability statistics to make bounded-budget variance explicit and to discourage over-interpretation of adjacent ranks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The paper introduces KINA, an 899-item benchmark across 261 fine-grained disciplines for LLMs. It presents two formal results: a (1-1/e) greedy approximation for representativeness via a proxy (Proposition 1), with the guarantee applying only to the proxy, and a proof that a bonus-on-bar tournament weakly FOSD-dominates flat payment above threshold B > Delta C / Delta p_min (Theorem 1). Evaluation of 42 models from 13 labs shows Gemini-3.1-Pro-Preview at 53.17%, followed by Claude-Opus-4.6 at 49.92% and GPT-5.4 at 48.55%, with a tiered leaderboard, tool augmentation gains up to 5.17 points, and bootstrap ranking-stability statistics.

Significance. If the proxy adequately captures disciplinary coverage and the formal results hold with supporting protocols, KINA could advance benchmark design by operationalizing coverage via approximation and improving annotation incentives over flat payment. The explicit proxy limitation, tiered performance structure, and stability reporting are useful contributions. The absence of population-level validation for the proxy and missing error bars/proofs limit the strength of the representativeness and evaluation claims.

major comments (3)
  1. [Abstract] Abstract: The central claim presents KINA as providing disciplinary representativeness across 261 disciplines, yet Proposition 1's (1-1/e) greedy guarantee applies exclusively to the expert-elicited proxy anchors and not to true population representativeness. No validation study, correlation analysis, or sensitivity check linking the proxy to actual disciplinary coverage is described, which is load-bearing for the representativeness claim.
  2. [Abstract] Abstract / Evaluation section: Model scores (e.g., 53.17% top result) and the tiered leaderboard structure are reported without error bars, exclusion criteria, or the detailed item-selection/annotation protocol, preventing assessment of robustness and making the cross-model comparisons difficult to interpret.
  3. [Theorem 1] Theorem 1: The weak FOSD dominance result with incentive threshold B > Delta C / Delta p_min is stated as a formal guarantee, but the abstract provides neither the proof nor the full set of assumptions and definitions, leaving the derivation unverified in the provided text.
minor comments (1)
  1. [Abstract] Abstract: The number of disciplines (261) and items (899) are stated without reference to how the expert-elicited anchors were constructed or any inter-annotator agreement metrics.

Simulated Author's Rebuttal

3 responses · 1 unresolved

We thank the referee for the constructive comments. We address each major point below, clarifying the explicit scope of our claims as stated in the manuscript while noting where additional reporting can be strengthened.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central claim presents KINA as providing disciplinary representativeness across 261 disciplines, yet Proposition 1's (1-1/e) greedy guarantee applies exclusively to the expert-elicited proxy anchors and not to true population representativeness. No validation study, correlation analysis, or sensitivity check linking the proxy to actual disciplinary coverage is described, which is load-bearing for the representativeness claim.

    Authors: The manuscript already states explicitly in the abstract that 'the guarantee applies to the proxy, not to population representativeness.' We agree that the absence of a population-level validation study, correlation analysis, or sensitivity check is a genuine limitation for claims about true disciplinary coverage. Such a study would require large-scale sampling of the full disciplinary population and is outside the current scope. We will add a dedicated limitations paragraph in the revised manuscript to emphasize this point and outline directions for future validation. revision: partial

  2. Referee: [Abstract] Abstract / Evaluation section: Model scores (e.g., 53.17% top result) and the tiered leaderboard structure are reported without error bars, exclusion criteria, or the detailed item-selection/annotation protocol, preventing assessment of robustness and making the cross-model comparisons difficult to interpret.

    Authors: The manuscript already reports bootstrap ranking-stability statistics to make bounded-budget variance explicit. We acknowledge that the abstract and high-level evaluation summary do not include per-score error bars, explicit exclusion criteria, or a full recap of the item-selection/annotation protocol. The full methods section details the annotation process and model inclusion criteria. We will add error bars to the reported scores, reference the protocols more explicitly in the abstract, and ensure the evaluation section highlights these elements for robustness assessment. revision: yes

  3. Referee: [Theorem 1] Theorem 1: The weak FOSD dominance result with incentive threshold B > Delta C / Delta p_min is stated as a formal guarantee, but the abstract provides neither the proof nor the full set of assumptions and definitions, leaving the derivation unverified in the provided text.

    Authors: Abstracts are intended to summarize results rather than reproduce full proofs or definitions. The complete statement of Theorem 1, including all assumptions, definitions, and the proof of weak FOSD dominance, appears in the main text. We will revise the abstract to explicitly direct readers to the full formal treatment in the body of the paper. revision: partial

standing simulated objections not resolved
  • Absence of a population-level validation study, correlation analysis, or sensitivity check linking the expert-elicited proxy to actual disciplinary coverage

Circularity Check

0 steps flagged

No significant circularity; formal results independent of benchmark data

full rationale

Proposition 1 applies the known (1-1/e) greedy guarantee to a coverage objective over the paper's own expert-elicited proxy anchors, with the abstract explicitly limiting the claim to the proxy. Theorem 1 is a standalone incentive-compatibility proof for the bonus-on-bar mechanism. Neither result is obtained by fitting parameters to the 899-item scores, nor does any load-bearing step reduce to self-citation or by-construction renaming. The derivation chain is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

The paper introduces a new benchmark and payment mechanism but grounds its formal claims in standard mathematical axioms without new invented entities or heavily fitted parameters from the evaluation data.

free parameters (1)
  • B incentive threshold
    Derived quantity B > Delta C / Delta p_min depends on cost and probability parameters that may require estimation from annotation data.
axioms (2)
  • standard math Standard submodular coverage properties for greedy (1-1/e) approximation
    Proposition 1 invokes typical set-cover approximation assumptions.
  • standard math Mechanism-design assumptions for first-order stochastic dominance of tournament payments
    Theorem 1 relies on standard incentive-compatibility conditions in payment design.

pith-pipeline@v0.9.1-grok · 5906 in / 1464 out tokens · 63193 ms · 2026-06-28T06:31:10.914405+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

75 extracted references · 19 canonical work pages · 13 internal anchors

  1. [1]

    Introducing Claude Opus 4.6, February 2026

    Anthropic. Introducing Claude Opus 4.6, February 2026. URL https://www.anthropic.com/news/ claude-opus-4-6

  2. [2]

    Introducing Claude Sonnet 4.6, February 2026

    Anthropic. Introducing Claude Sonnet 4.6, February 2026. URL https://www.anthropic.com/news/ claude-sonnet-4-6

  3. [3]

    The Llama 4 herd: Architecture, training, evaluation, and deployment notes, 2026

    Redacted by arXiv. The Llama 4 herd: Architecture, training, evaluation, and deployment notes, 2026. URL https://arxiv.org/abs/2601.11659

  4. [4]

    Arc prize 2024: Technical report, 2025

    Francois Chollet, Mike Knoop, Gregory Kamradt, et al. Arc prize 2024: Technical report, 2025. URL https://arxiv.org/abs/2412.04604

  5. [5]

    ARC-AGI-2: A New Challenge for Frontier AI Reasoning Systems

    Francois Chollet, Mike Knoop, Gregory Kamradt, et al. ARC-AGI-2: A new challenge for frontier AI reasoning systems, 2026. URLhttps://arxiv.org/abs/2505.11831

  6. [6]

    Gemini, February 2026

    Google DeepMind. Gemini, February 2026. URLhttps://deepmind.google/models/gemini/

  7. [7]

    SuperGPQA: Scaling LLM evaluation across 285 graduate disciplines,

    Xeron Du, Yifan Yao, Kaijing Ma, et al. SuperGPQA: Scaling LLM evaluation across 285 graduate disciplines,

  8. [8]

    URLhttps://arxiv.org/abs/2502.14739

  9. [9]

    Datasheets for datasets.Communications of the ACM, 64(12):86–92, 2021

    Timnit Gebru, Jamie Morgenstern, Briana Vecchione, et al. Datasheets for datasets.Communications of the ACM, 64(12):86–92, 2021

  10. [10]

    Are we done with mmlu?, 2025

    Aryo Pradipta Gema, Joshua Ong Jun Leang, Giwon Hong, et al. Are we done with mmlu?, 2025. URL https://arxiv.org/abs/2406.04127

  11. [11]

    The Llama 3 Herd of Models

    Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, et al. The Llama 3 herd of models, 2024. URL https://arxiv.org/abs/2407.21783

  12. [12]

    Measuring Massive Multitask Language Understanding

    Dan Hendrycks, Collin Burns, Steven Basart, et al. Measuring massive multitask language understanding, 2021. URLhttps://arxiv.org/abs/2009.03300

  13. [13]

    Step 3.5 Flash: Open frontier-level intelligence with 11b active parameters, 2026

    Ailin Huang, Ang Li, Aobo Kong, et al. Step 3.5 Flash: Open frontier-level intelligence with 11b active parameters, 2026. URLhttps://arxiv.org/abs/2602.10604. 11

  14. [14]

    Mixtral of Experts

    Albert Q Jiang, Alexandre Sablayrolles, Antoine Roux, et al. Mixtral of experts, 2024. URL https://arxiv. org/abs/2401.04088

  15. [15]

    Rank-order tournaments as optimum labor contracts.Journal of Political Economy, 89(5):841–864, 1981

    Edward P Lazear and Sherwin Rosen. Rank-order tournaments as optimum labor contracts.Journal of Political Economy, 89(5):841–864, 1981

  16. [16]

    DeepSeek-V3.2: Pushing the frontier of open large language models,

    Aixin Liu, Aoxue Mei, Bangcai Lin, et al. DeepSeek-V3.2: Pushing the frontier of open large language models,

  17. [17]

    URLhttps://arxiv.org/abs/2512.02556

  18. [18]

    Learn to explain: Multimodal reasoning via thought chains for science question answering

    Pan Lu, Swaroop Mishra, Tanglin Xia, et al. Learn to explain: Multimodal reasoning via thought chains for science question answering. InThe 36th Conference on Neural Information Processing Systems (NeurIPS), 2022

  19. [19]

    MiniMax M2.5: Built for real-world productivity, February 2026

    MiniMax. MiniMax M2.5: Built for real-world productivity, February 2026. URL https://www.minimax. io/news/minimax-m25

  20. [20]

    An analysis of approximations for maximizing submodular set functions—i.Mathematical Programming, 14(1):265–294, 1978

    George L Nemhauser, Laurence A Wolsey, and Marshall L Fisher. An analysis of approximations for maximizing submodular set functions—i.Mathematical Programming, 14(1):265–294, 1978

  21. [21]

    Adversarial nli: A new benchmark for natural language understanding

    Yixin Nie, Adina Williams, Emily Dinan, et al. Adversarial nli: A new benchmark for natural language understanding. InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4885–4901, 2020

  22. [22]

    Introducing GPT-5.2, December 2025

    OpenAI. Introducing GPT-5.2, December 2025. URL https://openai.com/index/ introducing-gpt-5-2/

  23. [23]

    Introducing GPT-5.4, March 2026

    OpenAI. Introducing GPT-5.4, March 2026. URL https://openai.com/index/ introducing-gpt-5-4/

  24. [24]

    A benchmark of expert-level academic questions to assess AI capabilities.Nature, 649(8099):1139–1146, 2026

    Long Phan, Alice Gatti, Nathaniel Li, et al. A benchmark of expert-level academic questions to assess AI capabilities.Nature, 649(8099):1139–1146, 2026

  25. [25]

    GPQA: A Graduate-Level Google-Proof Q&A Benchmark

    David Rein, Betty Li Hou, Stickland, et al. Gpqa: A graduate-level google-proof q&a benchmark, 2023. URL https://arxiv.org/abs/2311.12022

  26. [26]

    Seed 2.0 official launch, February 2026

    Bytedance Seed. Seed 2.0 official launch, February 2026. URL https://seed.bytedance.com/en/ blog/seed-2-0-official-launch

  27. [27]

    Kimi K2.5: Visual Agentic Intelligence

    Kimi Team, Tongtong Bai, Yifan Bai, et al. Kimi k2.5: Visual agentic intelligence.arXiv preprint arXiv:2602.02276, 2026

  28. [28]

    Qwen3-Max: Just scale it, September 2025

    Qwen Team. Qwen3-Max: Just scale it, September 2025

  29. [29]

    Qwen3.5: Accelerating productivity with native multimodal agents, February 2026

    Qwen Team. Qwen3.5: Accelerating productivity with native multimodal agents, February 2026. URL https://qwen.ai/blog?id=qwen3.5

  30. [30]

    Qwen2.5 Technical Report

    Qwen Team, An Yang, Baosong Yang, et al. Qwen2.5 technical report.arXiv preprint arXiv:2412.15115, 2024

  31. [31]

    Position: The Hidden Costs and Measurement Gaps of Reinforcement Learning with Verifiable Rewards

    Aaron Tu, Weihao Xuan, Heli Qi, et al. Position: The hidden costs and measurement gaps of reinforcement learning with verifiable rewards, 2025. URLhttps://arxiv.org/abs/2509.21882

  32. [32]

    Superglue: A stickier benchmark for general-purpose language understanding systems.Advances in neural information processing systems, 32, 2019

    Alex Wang, Yada Pruksachatkun, Nikita Nangia, et al. Superglue: A stickier benchmark for general-purpose language understanding systems.Advances in neural information processing systems, 32, 2019

  33. [33]

    MMLU-Pro: A more robust and challenging multi-task language understanding benchmark.Advances in Neural Information Processing Systems, 37:95266–95290, 2024

    Yubo Wang, Xueguang Ma, Ge Zhang, et al. MMLU-Pro: A more robust and challenging multi-task language understanding benchmark.Advances in Neural Information Processing Systems, 37:95266–95290, 2024

  34. [34]

    Grok-4-1 model card, November 2025

    xAI Team. Grok-4-1 model card, November 2025. URL https://data.x.ai/ 2025-11-17-grok-4-1-model-card.pdf

  35. [35]

    Qwen2 Technical Report

    An Yang, Baosong Yang, Binyuan Hui, et al. Qwen2 technical report, 2024. URL https://arxiv.org/ abs/2407.10671

  36. [36]

    Qwen3 Technical Report

    An Yang, Anfeng Li, Baosong Yang, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

  37. [37]

    GLM-5: from Vibe Coding to Agentic Engineering

    Aohan Zeng, Xin Lv, Zhenyu Hou, et al. Glm-5: from vibe coding to agentic engineering.arXiv preprint arXiv:2602.15763, 2026

  38. [38]

    Hle-verified: A systematic verification and structured revision of humanity’s last exam, 2026

    Weiqi Zhai, Zhihai Wang, Jinghang Wang, et al. Hle-verified: A systematic verification and structured revision of humanity’s last exam, 2026. URLhttps://arxiv.org/abs/2602.13964. 12 A Data Samples Pseudo-Multi-Choice Sample (Some Content Omitted ) Discipline: Engineering Question: Which of the following options are correct?

  39. [39]

    By defining N=D ⊤M −1D, it follows that N∈R 3Nc×3Nc

    D∈R 6M×3N c and M is a diagonal matrix. By defining N=D ⊤M −1D, it follows that N∈R 3Nc×3Nc. The Schur complement is given by S=N+ ˆM−BE −1C. Since ˆM is diagonal and BE −1C is a block- diagonal matrix with3×3blocks,Spossesses the identical sparsity pattern asN

  40. [40]

    In the contact pairs of granular dynamics, the friction cone constraint requires that the magnitude of the tangential force impulse vector does not exceed the product of the static friction coefficient and the normal force impulse. When the static friction coefficient is 0.3 and the normal force impulse is 10 N·s, the maximum possible magnitude of the tan...

  41. [41]

    sublinear inN

    The AMEN Cross complexity bound in this setting scales as O(r3N) (linear in the number of unknowns N). For N= 10 6 and maximum TT-rank r= 10 , the computational cost is approximately 109 operations; however, this contradicts the claim of "sublinear inN" (a linear scaling inNcannot be sublinear)

  42. [42]

    Given ∆t= 0.1s , M= 2I,v k = 3m/s,∆tf B = 4N s, and P Diγi = 6N s, we calculatev k+1 as follows: vk+1 =v k + 1 M ∆t fB + X Diγi = 3 + 1 2 (4 + 6) = 8m/s

    Using the time-stepping update formula: M vk+1 −v k = ∆t fB +P i∈A(qk,δ) Diγi. Given ∆t= 0.1s , M= 2I,v k = 3m/s,∆tf B = 4N s, and P Diγi = 6N s, we calculatev k+1 as follows: vk+1 =v k + 1 M ∆t fB + X Diγi = 3 + 1 2 (4 + 6) = 8m/s

  43. [43]

    With N=D ⊤M −1D, for M= 2 rigid bodies (yielding 6M= 12 degrees of freedom, DOF) and Nc = 1 contact (yielding 3Nc = 3 multipliers), N is a 3×3 matrix, as it is defined by the product D⊤M −1D where D∈R 12×3,M −1 ∈R 12×12, and thus acts on the contact multiplier space (not generalized-velocity space)

  44. [44]

    For the position-based normal complementarity constraint: γi,n ≥0,Φ i(q)≥0,Φ i(q)γi,n = 0 . If Φi(q) = 0.5m and γi,n = 2N s , the constraint isnotsatisfied: while both quantities are nonnegative, their product is1m N s̸= 0, violating the complementarity condition (Φ i(q)γi,n = 0)

  45. [45]

    For TT matrix-vector products used in the paper’s TT-based preconditioner, the complexity isO(r2NlogN) . While this complexity scales linearly in N (up to a logarithmic factor logN ), the cost is still considered *asymp- totically sublinear in N* when normalized by N—or more precisely, *sublinear in the sense of superlinear scaling avoidance*—because logN...

  46. [46]

    If a discipline were to be evaluated using only 3 to 5 questions, the submitted instance must be fundamental and comprehensive enough to be one of them

    Disciplinary Representativeness:Instances must act as highly representative probes. If a discipline were to be evaluated using only 3 to 5 questions, the submitted instance must be fundamental and comprehensive enough to be one of them. 2.High-Order Knowledge Application: 29 • Questions must construct a logically complete closed system where the solution ...

  47. [47]

    What We Reject:

    Static Knowledge Breadth:For memory-intensive disciplines (e.g., Education), questions should maximize knowledge coverage (e.g., utilizing10+statements to encompass major pedagogical theories). What We Reject:

  48. [48]

    What is the 100th digit ofπ?

    Narrow or Trivial Memorization:Questions testing pure rote memory devoid of core disciplinary literacy (e.g., "What is the 100th digit ofπ?") or focusing on hyper-niche, obscure sub-entities

  49. [49]

    Idiosyncratic or Tricky Trivia:Questions universally recognized as flawed or unreasonable even in human examinations

  50. [50]

    Weak Epistemological Consensus:Hypotheses proposed by individual scholars that lack widespread aca- demic consensus or are highly volatile (e.g., highly debated legal interpretations)

  51. [51]

    E.2.3 Standardized Annotation Workflow The question authoring process is strictly compartmentalized into six components

    Non-Disciplinary Failure Modes:Instances where LLMs fail due to semantic traps, ambiguous phrasing, or floating-point calculation errors rather than a deficit in disciplinary literacy. E.2.3 Standardized Annotation Workflow The question authoring process is strictly compartmentalized into six components. The specifications for each component are detailed ...

  52. [52]

    It targets marine chemistry and assesses advanced reasoning and computational abilities, rather than only testing static knowledge

  53. [53]

    You are an expert

    Even though it is a calculation question, the analysis of incorrect options is briefly summarized instead of being copied and pasted repetitively. E.3 Review Manual E.3.1 Core Philosophy and General Workflow The primary objective of the review process is to ensure that each curated instance serves as a highly representative probe for evaluating LLMs on sp...

  54. [54]

    Body elongated, slightly compressed, large head, flat snout

  55. [55]

    Lateral line scales are at least weakly ctenoid, two dorsal fins, separated

  56. [56]

    Posterior margin of the caudal fin is rounded

  57. [57]

    The first dorsal fin is short, consisting of 3-4 spines

  58. [58]

    Upper jaw is slightly shorter than the lower jaw. Options A.1, 2B.1, 2, 3 (Correct Option)C.1, 2, 3, 4, 5D.1E.2F.1, 2, 4G.1, 3, 5H.3, 5J.4, 5 Explanation Option A:Statement 1 is correct (Body elongated, slightly compressed, large head, flat snout); Statement 2 is correct (Lateral line scales are at least weakly ctenoid, two dorsal fins, separated). Option...

  59. [59]

    Statements 4 and 5 merely test pure static memory of isolated trivia, lacking sufficient cognitive coverage to serve as a representative evaluation probe

    Weak Disciplinary Representativeness:The question fails to engage high-order disciplinary literacy. Statements 4 and 5 merely test pure static memory of isolated trivia, lacking sufficient cognitive coverage to serve as a representative evaluation probe

  60. [60]

    Models can exploit this logical leakage (selecting A, D, or E would technically be correct)

    Structural Flaw (Proper Subsets):Options A, D, and E are proper subsets of the correct Option B. Models can exploit this logical leakage (selecting A, D, or E would technically be correct). Furthermore, the option distribution is severely imbalanced, violating statistical normality constraints

  61. [61]

    Statement Count Deficit:The question stem only contains 5 statements, failing the strict pseudo-multi-choice minimum requirement (≥6statements)

  62. [62]

    Chapter 1

    Context Drift:The stem explicitly references an unprovided context ("Chapter 1"), violating the requirement that all questions must be logically self-contained. 32 E.3.4 Option Rigor and Structural Integrity To prevent LLMs from exploiting logical loopholes, the 10-option structure must adhere to strict constraints: • Pseudo-Multi-Choice Constraints:For c...

  63. [63]

    **NO OVERLAP**: Two selected questions must NOT test the same knowledge point

  64. [64]

    **DISCRIMINATIVE PRIORITY**: Always prefer HIGH discriminative_power questions over LOW

  65. [65]

    **TYPE PREFERENCE**: CASE_BASED > APPLIED_REASONING > CONCEPTUAL > FACTUAL_RECALL

  66. [66]

    **TOPIC DIVERSITY**: Ensure selected questions span different key_topics

  67. [67]

    selected_questions

    **ERROR PATTERN V ALUE**: Questions with clear, consistent error patterns are valuable for understanding model weaknesses # Output Return ONLY a JSON object: 36 ‘‘‘json { "selected_questions": [ { "question_number": <int>, "knowledge_point": "<the knowledge point being tested>", "selection_reason": "<why this question was selected, referencing criteria>" ...

  68. [68]

    **Original Question Data**: {{ORIGINAL_JSON}}

  69. [69]

    Source Confidence

    **Audit Report**: {{AUDIT_JSON_FROM_AGENT_A}} # Refinement Strategy Matrix Apply the following strategies based on the ‘issue_type‘ identified: ## Strategy A: Fix "Source Confidence" (Premise Injection) * **Action:** Add specific boundary conditions to the Question Stem. * **Example:** Change "Does X cause Y?" to "In the context of [Specific Cell Line/Con...

  70. [70]

    Assuming no [Confounding Factor B]

    Exclusion: Add "Assuming no [Confounding Factor B]..." * **Goal:** Eliminate the validity of the competing distractor found by the auditor. ## Strategy C: Fix "Ambiguity" (Disambiguation) 38 * **Action:** Rewrite the confusing phrase using standard academic terminology. Ensure the syntax (e.g., modifier attachment) is singular in meaning. ## Strategy D: F...

  71. [71]

    **Explanation Sync:** If you modify ANY Option or the Correct Answer, you **MUST** rewrite the corresponding ‘explanation‘ to align with the new logic

  72. [72]

    Is Theory X correct?

    **Source Addition:** If your modification introduces new facts or constraints not supported by the original source, you **MUST** provide a ‘new_source‘ (APA format or accessible URL). * **Example:** * *Before:* "Is Theory X correct?" (Hard to prove) * *After:* "**Assuming Theory X is correct**, which of the following observations would be expected?" OR "*...

  73. [73]

    **Do NOT make the question easier.** The goal is rigor, not simplification

  74. [74]

    **Maintain the Markdown/LaTeX format.**

  75. [75]

    Assuming setting X

    **Evidence update:** If you add a new premise based on the audit, ensure the ‘question_content‘ reflects this (you may add a note "Assuming setting X"). # Output Return a JSON object containing ONLY the fields that require modification. Do not return unchanged fields (e.g., if the answer is unchanged, do not include it). Example: { "revision_summary": "Ap...