Knowledge Index of Noah's Ark

Bangya Liu; Ge Zhang; Heli Qi; Jiaheng Liu; Jiarui Liu; Jie Wei; Kaijing Ma; Meishu Song; Minghao Liu; Minglai Yang

arxiv: 2606.05104 · v2 · pith:HBCKOP2Knew · submitted 2026-06-03 · 💻 cs.AI

Knowledge Index of Noah's Ark

Sheng Jin , Minghao Liu , Yunze Xiao , Zeqi Zhou , Heli Qi , Yifan Yao , Meishu Song , Kaijing Ma

show 19 more authors

Xuan Zhang Sicong Jiang Yizhe Li Ningshan Ma Jie Wei Ziniu Li Minglai Yang Bangya Liu Yiming Liang Xiao Fang Qingcheng Zeng Jiarui Liu Rui Yang Shen Yan Wenhao Huang Jiaheng Liu Zihan Wang Weihao Xuan Ge Zhang

This is my paper

Pith reviewed 2026-06-28 06:31 UTC · model grok-4.3

classification 💻 cs.AI

keywords LLM benchmarkknowledge evaluationrepresentativenessincentive mechanismgreedy approximationannotation qualitymodel leaderboarddisciplinary coverage

0 comments

The pith

KINA is an 899-item benchmark across 261 disciplines with a proxy greedy algorithm for representativeness and a bonus tournament for annotation incentives.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces KINA to fix three problems in LLM knowledge benchmarks: designs that ignore disciplinary representativeness, flat payments that allow low-effort annotations, and unstable rankings from small test budgets. It casts representativeness as a coverage problem over expert anchors and solves it with a greedy method that guarantees (1-1/e) of the proxy optimum. A second formal result shows that paying a bonus only above a performance bar weakly dominates flat payment for quality once the bonus clears a simple cost-probability threshold. Tests on 42 models place the leader at 53.17 percent, expose a tiered rather than smooth performance distribution, and quantify ranking variance through bootstrap statistics.

Core claim

KINA operationalizes disciplinary representativeness via a proxy over expert-elicited anchors, yielding a (1-1/e) greedy approximation guarantee that applies to the proxy. It further proves that a bonus-on-bar tournament weakly first-order stochastically dominates flat payment for released-review quality whenever the bonus B exceeds Delta C / Delta p_min. The resulting 899-item set across 261 disciplines produces a tiered leaderboard in which the top model scores 53.17 percent, tool use adds up to 5.17 points, and bounded-budget variance is reported explicitly.

What carries the argument

The proxy-based greedy coverage algorithm for disciplinary representativeness together with the bonus-on-bar tournament payment rule.

If this is right

Representativeness holds only relative to the proxy, not necessarily the true population.
The tournament payment is incentive-compatible above the stated threshold.
Model performance forms distinct tiers above 48 percent, 38-45 percent, and near the 10 percent baseline.
Tool augmentation yields gains that vary substantially across models and tasks.
Bootstrap statistics make explicit the ranking instability possible under limited evaluation budgets.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same proxy-plus-greedy approach could be reused to build coverage benchmarks in domains other than LLM knowledge testing.
The incentive-compatibility threshold provides a concrete design rule that might transfer to other crowdsourced annotation settings.
Persistent gaps below 55 percent indicate that current architectures still lack reliable integration of knowledge across many disciplines.
Future experiments could measure whether models trained or fine-tuned explicitly on the proxy-selected items close the observed performance tiers.

Load-bearing premise

The chosen proxy for disciplinary representativeness is treated as adequate for the coverage guarantee.

What would settle it

An audit that compares actual knowledge coverage achieved by the selected 899 items against a larger expert-curated or random sample to test whether the proxy guarantee translates to population representativeness.

Figures

Figures reproduced from arXiv: 2606.05104 by Bangya Liu, Ge Zhang, Heli Qi, Jiaheng Liu, Jiarui Liu, Jie Wei, Kaijing Ma, Meishu Song, Minghao Liu, Minglai Yang, Ningshan Ma, Qingcheng Zeng, Rui Yang, Sheng Jin, Shen Yan, Sicong Jiang, Weihao Xuan, Wenhao Huang, Xiao Fang, Xuan Zhang, Yifan Yao, Yiming Liang, Yizhe Li, Yunze Xiao, Zeqi Zhou, Zihan Wang, Ziniu Li.

**Figure 1.** Figure 1: KINA data-collection pipeline. Topic pre-approval enforces representativeness via the proxy of Proposition 1; the double-blind expert-review stage instantiates the bonus-on-bar tournament of Theorem 1; LLM-as-judge consensus filters residual ambiguity; an agentic refinement loop addresses boundary defects. five flagship LLMs on the item. An item is admitted if at least two of three judges vote yes; otherwi… view at source ↗

**Figure 2.** Figure 2: Distribution of KINA items across the 12 top-level disciplines [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: Parameter scaling on KINA. Left: dense models, accuracy vs. total parameters (log scale). Right: MoE models, accuracy vs. active parameters (log scale). Generation-over-generation slope increases from roughly 7.6 (Qwen3) to 14.8 (Qwen3.5) points per decade. 6 Limitations We highlight five limitations. Sample-size variance. KINA is intentionally compact. While §5.2 shows that top-10 rankings remain stable u… view at source ↗

**Figure 4.** Figure 4: Taxonomy of Disciplines 22 [PITH_FULL_IMAGE:figures/full_fig_p022_4.png] view at source ↗

**Figure 5.** Figure 5: Subject-Level Score Distribution Across Top-10 Models [PITH_FULL_IMAGE:figures/full_fig_p041_5.png] view at source ↗

**Figure 6.** Figure 6: Top-10 Model Performance Distributions Across Three Aggregation Levels (continued) [PITH_FULL_IMAGE:figures/full_fig_p042_6.png] view at source ↗

**Figure 7.** Figure 7: Inference Cost Distribution of Qwen3 Dense Models. [PITH_FULL_IMAGE:figures/full_fig_p043_7.png] view at source ↗

**Figure 8.** Figure 8: Inference Cost Distribution of Qwen3 Moe Models. [PITH_FULL_IMAGE:figures/full_fig_p043_8.png] view at source ↗

**Figure 9.** Figure 9: Inference Cost Distribution of Qwen3.5 Dense Models. [PITH_FULL_IMAGE:figures/full_fig_p043_9.png] view at source ↗

**Figure 10.** Figure 10: Inference Cost Distribution of Qwen3.5 Moe Models. [PITH_FULL_IMAGE:figures/full_fig_p044_10.png] view at source ↗

**Figure 11.** Figure 11: UpSet plot visualizing intersections of high-performing data points (average score [PITH_FULL_IMAGE:figures/full_fig_p044_11.png] view at source ↗

read the original abstract

Knowledge benchmarks for LLMs face three issues: scaling-driven designs that do not operationalize disciplinary representativeness; flat-payment annotation that permits lazy consensus; and unaudited ranking instability under bounded test budgets. We introduce KINA, an 899-item benchmark across 261 fine-grained disciplines, with two formal results. First, we cast representativeness as a coverage-style objective over expert-elicited anchors and operationalize disciplinary representativeness through a proxy, yielding a (1-1/e) greedy approximation (Proposition 1); the guarantee applies to the proxy, not to population representativeness. Second, we prove a bonus-on-bar tournament weakly FOSD-dominates flat payment in released-review quality, with incentive-compatibility threshold B > Delta C / Delta p_min (Theorem 1). Evaluating 42 models from 13 labs, the top model, Gemini-3.1-Pro-Preview, reaches 53.17%, followed by Claude-Opus-4.6 at 49.92% and GPT-5.4 at 48.55%, leaving substantial headroom below saturation. The full leaderboard shows a tiered structure rather than a smooth total order: a small frontier tier lies above 48%, a dense strong-model tier spans roughly 38-45%, and low-performing models remain only modestly above the 10% chance baseline. Tool augmentation adds up to 5.17 points across the five tool-use evaluations, with gains varying substantially across models. We report bootstrap ranking-stability statistics to make bounded-budget variance explicit and to discourage over-interpretation of adjacent ranks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

KINA gives a large 899-item benchmark over 261 disciplines plus two formal results on proxy coverage and annotation incentives, but the representativeness claim is scoped only to the proxy with no validation shown.

read the letter

The paper's core contribution is an 899-question benchmark across 261 fine-grained disciplines, backed by a coverage proxy that receives the usual (1-1/e) greedy guarantee and a bonus-on-bar tournament that weakly dominates flat payment above a cost-probability threshold. They evaluate 42 models from 13 labs, report a tiered leaderboard with a clear frontier above 48 percent, and include bootstrap stability numbers to flag rank variance under limited budgets.

What stands out is the explicit formalization of two practical problems in benchmark work and the scale of the discipline coverage. The incentive result follows standard mechanism-design lines and the stability reporting is a useful addition for readers who care about bounded test budgets.

The soft spots are straightforward. The coverage guarantee applies only to the expert-elicited proxy anchors, not to actual population representativeness, and the abstract contains no validation, correlation check, or sensitivity analysis linking the proxy to real disciplinary coverage. Item selection and annotation protocols are not detailed, the formal statements appear without proofs, and the model scores lack error bars or explicit exclusion rules. These gaps are material because the representativeness claim is load-bearing for the benchmark's stated purpose.

This is for groups that build or use knowledge benchmarks and want to see formal handles on coverage and annotation quality. It deserves peer review because the size, the formal statements, and the stability analysis are concrete enough to check, even with the proxy limitation already noted in the text.

Referee Report

3 major / 1 minor

Summary. The paper introduces KINA, an 899-item benchmark across 261 fine-grained disciplines for LLMs. It presents two formal results: a (1-1/e) greedy approximation for representativeness via a proxy (Proposition 1), with the guarantee applying only to the proxy, and a proof that a bonus-on-bar tournament weakly FOSD-dominates flat payment above threshold B > Delta C / Delta p_min (Theorem 1). Evaluation of 42 models from 13 labs shows Gemini-3.1-Pro-Preview at 53.17%, followed by Claude-Opus-4.6 at 49.92% and GPT-5.4 at 48.55%, with a tiered leaderboard, tool augmentation gains up to 5.17 points, and bootstrap ranking-stability statistics.

Significance. If the proxy adequately captures disciplinary coverage and the formal results hold with supporting protocols, KINA could advance benchmark design by operationalizing coverage via approximation and improving annotation incentives over flat payment. The explicit proxy limitation, tiered performance structure, and stability reporting are useful contributions. The absence of population-level validation for the proxy and missing error bars/proofs limit the strength of the representativeness and evaluation claims.

major comments (3)

[Abstract] Abstract: The central claim presents KINA as providing disciplinary representativeness across 261 disciplines, yet Proposition 1's (1-1/e) greedy guarantee applies exclusively to the expert-elicited proxy anchors and not to true population representativeness. No validation study, correlation analysis, or sensitivity check linking the proxy to actual disciplinary coverage is described, which is load-bearing for the representativeness claim.
[Abstract] Abstract / Evaluation section: Model scores (e.g., 53.17% top result) and the tiered leaderboard structure are reported without error bars, exclusion criteria, or the detailed item-selection/annotation protocol, preventing assessment of robustness and making the cross-model comparisons difficult to interpret.
[Theorem 1] Theorem 1: The weak FOSD dominance result with incentive threshold B > Delta C / Delta p_min is stated as a formal guarantee, but the abstract provides neither the proof nor the full set of assumptions and definitions, leaving the derivation unverified in the provided text.

minor comments (1)

[Abstract] Abstract: The number of disciplines (261) and items (899) are stated without reference to how the expert-elicited anchors were constructed or any inter-annotator agreement metrics.

Simulated Author's Rebuttal

3 responses · 1 unresolved

We thank the referee for the constructive comments. We address each major point below, clarifying the explicit scope of our claims as stated in the manuscript while noting where additional reporting can be strengthened.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim presents KINA as providing disciplinary representativeness across 261 disciplines, yet Proposition 1's (1-1/e) greedy guarantee applies exclusively to the expert-elicited proxy anchors and not to true population representativeness. No validation study, correlation analysis, or sensitivity check linking the proxy to actual disciplinary coverage is described, which is load-bearing for the representativeness claim.

Authors: The manuscript already states explicitly in the abstract that 'the guarantee applies to the proxy, not to population representativeness.' We agree that the absence of a population-level validation study, correlation analysis, or sensitivity check is a genuine limitation for claims about true disciplinary coverage. Such a study would require large-scale sampling of the full disciplinary population and is outside the current scope. We will add a dedicated limitations paragraph in the revised manuscript to emphasize this point and outline directions for future validation. revision: partial
Referee: [Abstract] Abstract / Evaluation section: Model scores (e.g., 53.17% top result) and the tiered leaderboard structure are reported without error bars, exclusion criteria, or the detailed item-selection/annotation protocol, preventing assessment of robustness and making the cross-model comparisons difficult to interpret.

Authors: The manuscript already reports bootstrap ranking-stability statistics to make bounded-budget variance explicit. We acknowledge that the abstract and high-level evaluation summary do not include per-score error bars, explicit exclusion criteria, or a full recap of the item-selection/annotation protocol. The full methods section details the annotation process and model inclusion criteria. We will add error bars to the reported scores, reference the protocols more explicitly in the abstract, and ensure the evaluation section highlights these elements for robustness assessment. revision: yes
Referee: [Theorem 1] Theorem 1: The weak FOSD dominance result with incentive threshold B > Delta C / Delta p_min is stated as a formal guarantee, but the abstract provides neither the proof nor the full set of assumptions and definitions, leaving the derivation unverified in the provided text.

Authors: Abstracts are intended to summarize results rather than reproduce full proofs or definitions. The complete statement of Theorem 1, including all assumptions, definitions, and the proof of weak FOSD dominance, appears in the main text. We will revise the abstract to explicitly direct readers to the full formal treatment in the body of the paper. revision: partial

standing simulated objections not resolved

Absence of a population-level validation study, correlation analysis, or sensitivity check linking the expert-elicited proxy to actual disciplinary coverage

Circularity Check

0 steps flagged

No significant circularity; formal results independent of benchmark data

full rationale

Proposition 1 applies the known (1-1/e) greedy guarantee to a coverage objective over the paper's own expert-elicited proxy anchors, with the abstract explicitly limiting the claim to the proxy. Theorem 1 is a standalone incentive-compatibility proof for the bonus-on-bar mechanism. Neither result is obtained by fitting parameters to the 899-item scores, nor does any load-bearing step reduce to self-citation or by-construction renaming. The derivation chain is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

The paper introduces a new benchmark and payment mechanism but grounds its formal claims in standard mathematical axioms without new invented entities or heavily fitted parameters from the evaluation data.

free parameters (1)

B incentive threshold
Derived quantity B > Delta C / Delta p_min depends on cost and probability parameters that may require estimation from annotation data.

axioms (2)

standard math Standard submodular coverage properties for greedy (1-1/e) approximation
Proposition 1 invokes typical set-cover approximation assumptions.
standard math Mechanism-design assumptions for first-order stochastic dominance of tournament payments
Theorem 1 relies on standard incentive-compatibility conditions in payment design.

pith-pipeline@v0.9.1-grok · 5906 in / 1464 out tokens · 63193 ms · 2026-06-28T06:31:10.914405+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

75 extracted references · 19 canonical work pages · 13 internal anchors

[1]

Introducing Claude Opus 4.6, February 2026

Anthropic. Introducing Claude Opus 4.6, February 2026. URL https://www.anthropic.com/news/ claude-opus-4-6

2026
[2]

Introducing Claude Sonnet 4.6, February 2026

Anthropic. Introducing Claude Sonnet 4.6, February 2026. URL https://www.anthropic.com/news/ claude-sonnet-4-6

2026
[3]

The Llama 4 herd: Architecture, training, evaluation, and deployment notes, 2026

Redacted by arXiv. The Llama 4 herd: Architecture, training, evaluation, and deployment notes, 2026. URL https://arxiv.org/abs/2601.11659

work page arXiv 2026
[4]

Arc prize 2024: Technical report, 2025

Francois Chollet, Mike Knoop, Gregory Kamradt, et al. Arc prize 2024: Technical report, 2025. URL https://arxiv.org/abs/2412.04604

work page arXiv 2024
[5]

ARC-AGI-2: A New Challenge for Frontier AI Reasoning Systems

Francois Chollet, Mike Knoop, Gregory Kamradt, et al. ARC-AGI-2: A new challenge for frontier AI reasoning systems, 2026. URLhttps://arxiv.org/abs/2505.11831

work page internal anchor Pith review Pith/arXiv arXiv 2026
[6]

Gemini, February 2026

Google DeepMind. Gemini, February 2026. URLhttps://deepmind.google/models/gemini/

2026
[7]

SuperGPQA: Scaling LLM evaluation across 285 graduate disciplines,

Xeron Du, Yifan Yao, Kaijing Ma, et al. SuperGPQA: Scaling LLM evaluation across 285 graduate disciplines,
[8]

URLhttps://arxiv.org/abs/2502.14739

work page internal anchor Pith review Pith/arXiv arXiv
[9]

Datasheets for datasets.Communications of the ACM, 64(12):86–92, 2021

Timnit Gebru, Jamie Morgenstern, Briana Vecchione, et al. Datasheets for datasets.Communications of the ACM, 64(12):86–92, 2021

2021
[10]

Are we done with mmlu?, 2025

Aryo Pradipta Gema, Joshua Ong Jun Leang, Giwon Hong, et al. Are we done with mmlu?, 2025. URL https://arxiv.org/abs/2406.04127

work page arXiv 2025
[11]

The Llama 3 Herd of Models

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, et al. The Llama 3 herd of models, 2024. URL https://arxiv.org/abs/2407.21783

work page internal anchor Pith review Pith/arXiv arXiv 2024
[12]

Measuring Massive Multitask Language Understanding

Dan Hendrycks, Collin Burns, Steven Basart, et al. Measuring massive multitask language understanding, 2021. URLhttps://arxiv.org/abs/2009.03300

work page internal anchor Pith review Pith/arXiv arXiv 2021
[13]

Step 3.5 Flash: Open frontier-level intelligence with 11b active parameters, 2026

Ailin Huang, Ang Li, Aobo Kong, et al. Step 3.5 Flash: Open frontier-level intelligence with 11b active parameters, 2026. URLhttps://arxiv.org/abs/2602.10604. 11

work page arXiv 2026
[14]

Mixtral of Experts

Albert Q Jiang, Alexandre Sablayrolles, Antoine Roux, et al. Mixtral of experts, 2024. URL https://arxiv. org/abs/2401.04088

work page internal anchor Pith review Pith/arXiv arXiv 2024
[15]

Rank-order tournaments as optimum labor contracts.Journal of Political Economy, 89(5):841–864, 1981

Edward P Lazear and Sherwin Rosen. Rank-order tournaments as optimum labor contracts.Journal of Political Economy, 89(5):841–864, 1981

1981
[16]

DeepSeek-V3.2: Pushing the frontier of open large language models,

Aixin Liu, Aoxue Mei, Bangcai Lin, et al. DeepSeek-V3.2: Pushing the frontier of open large language models,
[17]

URLhttps://arxiv.org/abs/2512.02556

work page internal anchor Pith review Pith/arXiv arXiv
[18]

Learn to explain: Multimodal reasoning via thought chains for science question answering

Pan Lu, Swaroop Mishra, Tanglin Xia, et al. Learn to explain: Multimodal reasoning via thought chains for science question answering. InThe 36th Conference on Neural Information Processing Systems (NeurIPS), 2022

2022
[19]

MiniMax M2.5: Built for real-world productivity, February 2026

MiniMax. MiniMax M2.5: Built for real-world productivity, February 2026. URL https://www.minimax. io/news/minimax-m25

2026
[20]

An analysis of approximations for maximizing submodular set functions—i.Mathematical Programming, 14(1):265–294, 1978

George L Nemhauser, Laurence A Wolsey, and Marshall L Fisher. An analysis of approximations for maximizing submodular set functions—i.Mathematical Programming, 14(1):265–294, 1978

1978
[21]

Adversarial nli: A new benchmark for natural language understanding

Yixin Nie, Adina Williams, Emily Dinan, et al. Adversarial nli: A new benchmark for natural language understanding. InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4885–4901, 2020

2020
[22]

Introducing GPT-5.2, December 2025

OpenAI. Introducing GPT-5.2, December 2025. URL https://openai.com/index/ introducing-gpt-5-2/

2025
[23]

Introducing GPT-5.4, March 2026

OpenAI. Introducing GPT-5.4, March 2026. URL https://openai.com/index/ introducing-gpt-5-4/

2026
[24]

A benchmark of expert-level academic questions to assess AI capabilities.Nature, 649(8099):1139–1146, 2026

Long Phan, Alice Gatti, Nathaniel Li, et al. A benchmark of expert-level academic questions to assess AI capabilities.Nature, 649(8099):1139–1146, 2026

2026
[25]

GPQA: A Graduate-Level Google-Proof Q&A Benchmark

David Rein, Betty Li Hou, Stickland, et al. Gpqa: A graduate-level google-proof q&a benchmark, 2023. URL https://arxiv.org/abs/2311.12022

work page internal anchor Pith review Pith/arXiv arXiv 2023
[26]

Seed 2.0 official launch, February 2026

Bytedance Seed. Seed 2.0 official launch, February 2026. URL https://seed.bytedance.com/en/ blog/seed-2-0-official-launch

2026
[27]

Kimi K2.5: Visual Agentic Intelligence

Kimi Team, Tongtong Bai, Yifan Bai, et al. Kimi k2.5: Visual agentic intelligence.arXiv preprint arXiv:2602.02276, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[28]

Qwen3-Max: Just scale it, September 2025

Qwen Team. Qwen3-Max: Just scale it, September 2025

2025
[29]

Qwen3.5: Accelerating productivity with native multimodal agents, February 2026

Qwen Team. Qwen3.5: Accelerating productivity with native multimodal agents, February 2026. URL https://qwen.ai/blog?id=qwen3.5

2026
[30]

Qwen2.5 Technical Report

Qwen Team, An Yang, Baosong Yang, et al. Qwen2.5 technical report.arXiv preprint arXiv:2412.15115, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[31]

Position: The Hidden Costs and Measurement Gaps of Reinforcement Learning with Verifiable Rewards

Aaron Tu, Weihao Xuan, Heli Qi, et al. Position: The hidden costs and measurement gaps of reinforcement learning with verifiable rewards, 2025. URLhttps://arxiv.org/abs/2509.21882

work page internal anchor Pith review Pith/arXiv arXiv 2025
[32]

Superglue: A stickier benchmark for general-purpose language understanding systems.Advances in neural information processing systems, 32, 2019

Alex Wang, Yada Pruksachatkun, Nikita Nangia, et al. Superglue: A stickier benchmark for general-purpose language understanding systems.Advances in neural information processing systems, 32, 2019

2019
[33]

MMLU-Pro: A more robust and challenging multi-task language understanding benchmark.Advances in Neural Information Processing Systems, 37:95266–95290, 2024

Yubo Wang, Xueguang Ma, Ge Zhang, et al. MMLU-Pro: A more robust and challenging multi-task language understanding benchmark.Advances in Neural Information Processing Systems, 37:95266–95290, 2024

2024
[34]

Grok-4-1 model card, November 2025

xAI Team. Grok-4-1 model card, November 2025. URL https://data.x.ai/ 2025-11-17-grok-4-1-model-card.pdf

2025
[35]

Qwen2 Technical Report

An Yang, Baosong Yang, Binyuan Hui, et al. Qwen2 technical report, 2024. URL https://arxiv.org/ abs/2407.10671

work page internal anchor Pith review Pith/arXiv arXiv 2024
[36]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[37]

GLM-5: from Vibe Coding to Agentic Engineering

Aohan Zeng, Xin Lv, Zhenyu Hou, et al. Glm-5: from vibe coding to agentic engineering.arXiv preprint arXiv:2602.15763, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[38]

Hle-verified: A systematic verification and structured revision of humanity’s last exam, 2026

Weiqi Zhai, Zhihai Wang, Jinghang Wang, et al. Hle-verified: A systematic verification and structured revision of humanity’s last exam, 2026. URLhttps://arxiv.org/abs/2602.13964. 12 A Data Samples Pseudo-Multi-Choice Sample (Some Content Omitted ) Discipline: Engineering Question: Which of the following options are correct?

work page arXiv 2026
[39]

By defining N=D ⊤M −1D, it follows that N∈R 3Nc×3Nc

D∈R 6M×3N c and M is a diagonal matrix. By defining N=D ⊤M −1D, it follows that N∈R 3Nc×3Nc. The Schur complement is given by S=N+ ˆM−BE −1C. Since ˆM is diagonal and BE −1C is a block- diagonal matrix with3×3blocks,Spossesses the identical sparsity pattern asN
[40]

In the contact pairs of granular dynamics, the friction cone constraint requires that the magnitude of the tangential force impulse vector does not exceed the product of the static friction coefficient and the normal force impulse. When the static friction coefficient is 0.3 and the normal force impulse is 10 N·s, the maximum possible magnitude of the tan...
[41]

sublinear inN

The AMEN Cross complexity bound in this setting scales as O(r3N) (linear in the number of unknowns N). For N= 10 6 and maximum TT-rank r= 10 , the computational cost is approximately 109 operations; however, this contradicts the claim of "sublinear inN" (a linear scaling inNcannot be sublinear)
[42]

Given ∆t= 0.1s , M= 2I,v k = 3m/s,∆tf B = 4N s, and P Diγi = 6N s, we calculatev k+1 as follows: vk+1 =v k + 1 M ∆t fB + X Diγi = 3 + 1 2 (4 + 6) = 8m/s

Using the time-stepping update formula: M vk+1 −v k = ∆t fB +P i∈A(qk,δ) Diγi. Given ∆t= 0.1s , M= 2I,v k = 3m/s,∆tf B = 4N s, and P Diγi = 6N s, we calculatev k+1 as follows: vk+1 =v k + 1 M ∆t fB + X Diγi = 3 + 1 2 (4 + 6) = 8m/s
[43]

With N=D ⊤M −1D, for M= 2 rigid bodies (yielding 6M= 12 degrees of freedom, DOF) and Nc = 1 contact (yielding 3Nc = 3 multipliers), N is a 3×3 matrix, as it is defined by the product D⊤M −1D where D∈R 12×3,M −1 ∈R 12×12, and thus acts on the contact multiplier space (not generalized-velocity space)
[44]

For the position-based normal complementarity constraint: γi,n ≥0,Φ i(q)≥0,Φ i(q)γi,n = 0 . If Φi(q) = 0.5m and γi,n = 2N s , the constraint isnotsatisfied: while both quantities are nonnegative, their product is1m N s̸= 0, violating the complementarity condition (Φ i(q)γi,n = 0)
[45]

For TT matrix-vector products used in the paper’s TT-based preconditioner, the complexity isO(r2NlogN) . While this complexity scales linearly in N (up to a logarithmic factor logN ), the cost is still considered *asymp- totically sublinear in N* when normalized by N—or more precisely, *sublinear in the sense of superlinear scaling avoidance*—because logN...

work page arXiv 2019
[46]

If a discipline were to be evaluated using only 3 to 5 questions, the submitted instance must be fundamental and comprehensive enough to be one of them

Disciplinary Representativeness:Instances must act as highly representative probes. If a discipline were to be evaluated using only 3 to 5 questions, the submitted instance must be fundamental and comprehensive enough to be one of them. 2.High-Order Knowledge Application: 29 • Questions must construct a logically complete closed system where the solution ...
[47]

What We Reject:

Static Knowledge Breadth:For memory-intensive disciplines (e.g., Education), questions should maximize knowledge coverage (e.g., utilizing10+statements to encompass major pedagogical theories). What We Reject:
[48]

What is the 100th digit ofπ?

Narrow or Trivial Memorization:Questions testing pure rote memory devoid of core disciplinary literacy (e.g., "What is the 100th digit ofπ?") or focusing on hyper-niche, obscure sub-entities
[49]

Idiosyncratic or Tricky Trivia:Questions universally recognized as flawed or unreasonable even in human examinations
[50]

Weak Epistemological Consensus:Hypotheses proposed by individual scholars that lack widespread aca- demic consensus or are highly volatile (e.g., highly debated legal interpretations)
[51]

E.2.3 Standardized Annotation Workflow The question authoring process is strictly compartmentalized into six components

Non-Disciplinary Failure Modes:Instances where LLMs fail due to semantic traps, ambiguous phrasing, or floating-point calculation errors rather than a deficit in disciplinary literacy. E.2.3 Standardized Annotation Workflow The question authoring process is strictly compartmentalized into six components. The specifications for each component are detailed ...
[52]

It targets marine chemistry and assesses advanced reasoning and computational abilities, rather than only testing static knowledge
[53]

You are an expert

Even though it is a calculation question, the analysis of incorrect options is briefly summarized instead of being copied and pasted repetitively. E.3 Review Manual E.3.1 Core Philosophy and General Workflow The primary objective of the review process is to ensure that each curated instance serves as a highly representative probe for evaluating LLMs on sp...

2017
[54]

Body elongated, slightly compressed, large head, flat snout
[55]

Lateral line scales are at least weakly ctenoid, two dorsal fins, separated
[56]

Posterior margin of the caudal fin is rounded
[57]

The first dorsal fin is short, consisting of 3-4 spines
[58]

Upper jaw is slightly shorter than the lower jaw. Options A.1, 2B.1, 2, 3 (Correct Option)C.1, 2, 3, 4, 5D.1E.2F.1, 2, 4G.1, 3, 5H.3, 5J.4, 5 Explanation Option A:Statement 1 is correct (Body elongated, slightly compressed, large head, flat snout); Statement 2 is correct (Lateral line scales are at least weakly ctenoid, two dorsal fins, separated). Option...
[59]

Statements 4 and 5 merely test pure static memory of isolated trivia, lacking sufficient cognitive coverage to serve as a representative evaluation probe

Weak Disciplinary Representativeness:The question fails to engage high-order disciplinary literacy. Statements 4 and 5 merely test pure static memory of isolated trivia, lacking sufficient cognitive coverage to serve as a representative evaluation probe
[60]

Models can exploit this logical leakage (selecting A, D, or E would technically be correct)

Structural Flaw (Proper Subsets):Options A, D, and E are proper subsets of the correct Option B. Models can exploit this logical leakage (selecting A, D, or E would technically be correct). Furthermore, the option distribution is severely imbalanced, violating statistical normality constraints
[61]

Statement Count Deficit:The question stem only contains 5 statements, failing the strict pseudo-multi-choice minimum requirement (≥6statements)
[62]

Chapter 1

Context Drift:The stem explicitly references an unprovided context ("Chapter 1"), violating the requirement that all questions must be logically self-contained. 32 E.3.4 Option Rigor and Structural Integrity To prevent LLMs from exploiting logical loopholes, the 10-option structure must adhere to strict constraints: • Pseudo-Multi-Choice Constraints:For c...
[63]

**NO OVERLAP**: Two selected questions must NOT test the same knowledge point
[64]

**DISCRIMINATIVE PRIORITY**: Always prefer HIGH discriminative_power questions over LOW
[65]

**TYPE PREFERENCE**: CASE_BASED > APPLIED_REASONING > CONCEPTUAL > FACTUAL_RECALL
[66]

**TOPIC DIVERSITY**: Ensure selected questions span different key_topics
[67]

selected_questions

**ERROR PATTERN V ALUE**: Questions with clear, consistent error patterns are valuable for understanding model weaknesses # Output Return ONLY a JSON object: 36 ‘‘‘json { "selected_questions": [ { "question_number": <int>, "knowledge_point": "<the knowledge point being tested>", "selection_reason": "<why this question was selected, referencing criteria>" ...
[68]

**Original Question Data**: {{ORIGINAL_JSON}}
[69]

Source Confidence

**Audit Report**: {{AUDIT_JSON_FROM_AGENT_A}} # Refinement Strategy Matrix Apply the following strategies based on the ‘issue_type‘ identified: ## Strategy A: Fix "Source Confidence" (Premise Injection) * **Action:** Add specific boundary conditions to the Question Stem. * **Example:** Change "Does X cause Y?" to "In the context of [Specific Cell Line/Con...
[70]

Assuming no [Confounding Factor B]

Exclusion: Add "Assuming no [Confounding Factor B]..." * **Goal:** Eliminate the validity of the competing distractor found by the auditor. ## Strategy C: Fix "Ambiguity" (Disambiguation) 38 * **Action:** Rewrite the confusing phrase using standard academic terminology. Ensure the syntax (e.g., modifier attachment) is singular in meaning. ## Strategy D: F...
[71]

**Explanation Sync:** If you modify ANY Option or the Correct Answer, you **MUST** rewrite the corresponding ‘explanation‘ to align with the new logic
[72]

Is Theory X correct?

**Source Addition:** If your modification introduces new facts or constraints not supported by the original source, you **MUST** provide a ‘new_source‘ (APA format or accessible URL). * **Example:** * *Before:* "Is Theory X correct?" (Hard to prove) * *After:* "**Assuming Theory X is correct**, which of the following observations would be expected?" OR "*...
[73]

**Do NOT make the question easier.** The goal is rigor, not simplification
[74]

**Maintain the Markdown/LaTeX format.**
[75]

Assuming setting X

**Evidence update:** If you add a new premise based on the audit, ensure the ‘question_content‘ reflects this (you may add a note "Assuming setting X"). # Output Return a JSON object containing ONLY the fields that require modification. Do not return unchanged fields (e.g., if the answer is unchanged, do not include it). Example: { "revision_summary": "Ap...

2024

[1] [1]

Introducing Claude Opus 4.6, February 2026

Anthropic. Introducing Claude Opus 4.6, February 2026. URL https://www.anthropic.com/news/ claude-opus-4-6

2026

[2] [2]

Introducing Claude Sonnet 4.6, February 2026

Anthropic. Introducing Claude Sonnet 4.6, February 2026. URL https://www.anthropic.com/news/ claude-sonnet-4-6

2026

[3] [3]

The Llama 4 herd: Architecture, training, evaluation, and deployment notes, 2026

Redacted by arXiv. The Llama 4 herd: Architecture, training, evaluation, and deployment notes, 2026. URL https://arxiv.org/abs/2601.11659

work page arXiv 2026

[4] [4]

Arc prize 2024: Technical report, 2025

Francois Chollet, Mike Knoop, Gregory Kamradt, et al. Arc prize 2024: Technical report, 2025. URL https://arxiv.org/abs/2412.04604

work page arXiv 2024

[5] [5]

ARC-AGI-2: A New Challenge for Frontier AI Reasoning Systems

Francois Chollet, Mike Knoop, Gregory Kamradt, et al. ARC-AGI-2: A new challenge for frontier AI reasoning systems, 2026. URLhttps://arxiv.org/abs/2505.11831

work page internal anchor Pith review Pith/arXiv arXiv 2026

[6] [6]

Gemini, February 2026

Google DeepMind. Gemini, February 2026. URLhttps://deepmind.google/models/gemini/

2026

[7] [7]

SuperGPQA: Scaling LLM evaluation across 285 graduate disciplines,

Xeron Du, Yifan Yao, Kaijing Ma, et al. SuperGPQA: Scaling LLM evaluation across 285 graduate disciplines,

[8] [8]

URLhttps://arxiv.org/abs/2502.14739

work page internal anchor Pith review Pith/arXiv arXiv

[9] [9]

Datasheets for datasets.Communications of the ACM, 64(12):86–92, 2021

Timnit Gebru, Jamie Morgenstern, Briana Vecchione, et al. Datasheets for datasets.Communications of the ACM, 64(12):86–92, 2021

2021

[10] [10]

Are we done with mmlu?, 2025

Aryo Pradipta Gema, Joshua Ong Jun Leang, Giwon Hong, et al. Are we done with mmlu?, 2025. URL https://arxiv.org/abs/2406.04127

work page arXiv 2025

[11] [11]

The Llama 3 Herd of Models

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, et al. The Llama 3 herd of models, 2024. URL https://arxiv.org/abs/2407.21783

work page internal anchor Pith review Pith/arXiv arXiv 2024

[12] [12]

Measuring Massive Multitask Language Understanding

Dan Hendrycks, Collin Burns, Steven Basart, et al. Measuring massive multitask language understanding, 2021. URLhttps://arxiv.org/abs/2009.03300

work page internal anchor Pith review Pith/arXiv arXiv 2021

[13] [13]

Step 3.5 Flash: Open frontier-level intelligence with 11b active parameters, 2026

Ailin Huang, Ang Li, Aobo Kong, et al. Step 3.5 Flash: Open frontier-level intelligence with 11b active parameters, 2026. URLhttps://arxiv.org/abs/2602.10604. 11

work page arXiv 2026

[14] [14]

Mixtral of Experts

Albert Q Jiang, Alexandre Sablayrolles, Antoine Roux, et al. Mixtral of experts, 2024. URL https://arxiv. org/abs/2401.04088

work page internal anchor Pith review Pith/arXiv arXiv 2024

[15] [15]

Rank-order tournaments as optimum labor contracts.Journal of Political Economy, 89(5):841–864, 1981

Edward P Lazear and Sherwin Rosen. Rank-order tournaments as optimum labor contracts.Journal of Political Economy, 89(5):841–864, 1981

1981

[16] [16]

DeepSeek-V3.2: Pushing the frontier of open large language models,

Aixin Liu, Aoxue Mei, Bangcai Lin, et al. DeepSeek-V3.2: Pushing the frontier of open large language models,

[17] [17]

URLhttps://arxiv.org/abs/2512.02556

work page internal anchor Pith review Pith/arXiv arXiv

[18] [18]

Learn to explain: Multimodal reasoning via thought chains for science question answering

Pan Lu, Swaroop Mishra, Tanglin Xia, et al. Learn to explain: Multimodal reasoning via thought chains for science question answering. InThe 36th Conference on Neural Information Processing Systems (NeurIPS), 2022

2022

[19] [19]

MiniMax M2.5: Built for real-world productivity, February 2026

MiniMax. MiniMax M2.5: Built for real-world productivity, February 2026. URL https://www.minimax. io/news/minimax-m25

2026

[20] [20]

An analysis of approximations for maximizing submodular set functions—i.Mathematical Programming, 14(1):265–294, 1978

George L Nemhauser, Laurence A Wolsey, and Marshall L Fisher. An analysis of approximations for maximizing submodular set functions—i.Mathematical Programming, 14(1):265–294, 1978

1978

[21] [21]

Adversarial nli: A new benchmark for natural language understanding

Yixin Nie, Adina Williams, Emily Dinan, et al. Adversarial nli: A new benchmark for natural language understanding. InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4885–4901, 2020

2020

[22] [22]

Introducing GPT-5.2, December 2025

OpenAI. Introducing GPT-5.2, December 2025. URL https://openai.com/index/ introducing-gpt-5-2/

2025

[23] [23]

Introducing GPT-5.4, March 2026

OpenAI. Introducing GPT-5.4, March 2026. URL https://openai.com/index/ introducing-gpt-5-4/

2026

[24] [24]

A benchmark of expert-level academic questions to assess AI capabilities.Nature, 649(8099):1139–1146, 2026

Long Phan, Alice Gatti, Nathaniel Li, et al. A benchmark of expert-level academic questions to assess AI capabilities.Nature, 649(8099):1139–1146, 2026

2026

[25] [25]

GPQA: A Graduate-Level Google-Proof Q&A Benchmark

David Rein, Betty Li Hou, Stickland, et al. Gpqa: A graduate-level google-proof q&a benchmark, 2023. URL https://arxiv.org/abs/2311.12022

work page internal anchor Pith review Pith/arXiv arXiv 2023

[26] [26]

Seed 2.0 official launch, February 2026

Bytedance Seed. Seed 2.0 official launch, February 2026. URL https://seed.bytedance.com/en/ blog/seed-2-0-official-launch

2026

[27] [27]

Kimi K2.5: Visual Agentic Intelligence

Kimi Team, Tongtong Bai, Yifan Bai, et al. Kimi k2.5: Visual agentic intelligence.arXiv preprint arXiv:2602.02276, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[28] [28]

Qwen3-Max: Just scale it, September 2025

Qwen Team. Qwen3-Max: Just scale it, September 2025

2025

[29] [29]

Qwen3.5: Accelerating productivity with native multimodal agents, February 2026

Qwen Team. Qwen3.5: Accelerating productivity with native multimodal agents, February 2026. URL https://qwen.ai/blog?id=qwen3.5

2026

[30] [30]

Qwen2.5 Technical Report

Qwen Team, An Yang, Baosong Yang, et al. Qwen2.5 technical report.arXiv preprint arXiv:2412.15115, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[31] [31]

Position: The Hidden Costs and Measurement Gaps of Reinforcement Learning with Verifiable Rewards

Aaron Tu, Weihao Xuan, Heli Qi, et al. Position: The hidden costs and measurement gaps of reinforcement learning with verifiable rewards, 2025. URLhttps://arxiv.org/abs/2509.21882

work page internal anchor Pith review Pith/arXiv arXiv 2025

[32] [32]

Superglue: A stickier benchmark for general-purpose language understanding systems.Advances in neural information processing systems, 32, 2019

Alex Wang, Yada Pruksachatkun, Nikita Nangia, et al. Superglue: A stickier benchmark for general-purpose language understanding systems.Advances in neural information processing systems, 32, 2019

2019

[33] [33]

MMLU-Pro: A more robust and challenging multi-task language understanding benchmark.Advances in Neural Information Processing Systems, 37:95266–95290, 2024

Yubo Wang, Xueguang Ma, Ge Zhang, et al. MMLU-Pro: A more robust and challenging multi-task language understanding benchmark.Advances in Neural Information Processing Systems, 37:95266–95290, 2024

2024

[34] [34]

Grok-4-1 model card, November 2025

xAI Team. Grok-4-1 model card, November 2025. URL https://data.x.ai/ 2025-11-17-grok-4-1-model-card.pdf

2025

[35] [35]

Qwen2 Technical Report

An Yang, Baosong Yang, Binyuan Hui, et al. Qwen2 technical report, 2024. URL https://arxiv.org/ abs/2407.10671

work page internal anchor Pith review Pith/arXiv arXiv 2024

[36] [36]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[37] [37]

GLM-5: from Vibe Coding to Agentic Engineering

Aohan Zeng, Xin Lv, Zhenyu Hou, et al. Glm-5: from vibe coding to agentic engineering.arXiv preprint arXiv:2602.15763, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[38] [38]

Hle-verified: A systematic verification and structured revision of humanity’s last exam, 2026

Weiqi Zhai, Zhihai Wang, Jinghang Wang, et al. Hle-verified: A systematic verification and structured revision of humanity’s last exam, 2026. URLhttps://arxiv.org/abs/2602.13964. 12 A Data Samples Pseudo-Multi-Choice Sample (Some Content Omitted ) Discipline: Engineering Question: Which of the following options are correct?

work page arXiv 2026

[39] [39]

By defining N=D ⊤M −1D, it follows that N∈R 3Nc×3Nc

D∈R 6M×3N c and M is a diagonal matrix. By defining N=D ⊤M −1D, it follows that N∈R 3Nc×3Nc. The Schur complement is given by S=N+ ˆM−BE −1C. Since ˆM is diagonal and BE −1C is a block- diagonal matrix with3×3blocks,Spossesses the identical sparsity pattern asN

[40] [40]

In the contact pairs of granular dynamics, the friction cone constraint requires that the magnitude of the tangential force impulse vector does not exceed the product of the static friction coefficient and the normal force impulse. When the static friction coefficient is 0.3 and the normal force impulse is 10 N·s, the maximum possible magnitude of the tan...

[41] [41]

sublinear inN

The AMEN Cross complexity bound in this setting scales as O(r3N) (linear in the number of unknowns N). For N= 10 6 and maximum TT-rank r= 10 , the computational cost is approximately 109 operations; however, this contradicts the claim of "sublinear inN" (a linear scaling inNcannot be sublinear)

[42] [42]

Given ∆t= 0.1s , M= 2I,v k = 3m/s,∆tf B = 4N s, and P Diγi = 6N s, we calculatev k+1 as follows: vk+1 =v k + 1 M ∆t fB + X Diγi = 3 + 1 2 (4 + 6) = 8m/s

Using the time-stepping update formula: M vk+1 −v k = ∆t fB +P i∈A(qk,δ) Diγi. Given ∆t= 0.1s , M= 2I,v k = 3m/s,∆tf B = 4N s, and P Diγi = 6N s, we calculatev k+1 as follows: vk+1 =v k + 1 M ∆t fB + X Diγi = 3 + 1 2 (4 + 6) = 8m/s

[43] [43]

With N=D ⊤M −1D, for M= 2 rigid bodies (yielding 6M= 12 degrees of freedom, DOF) and Nc = 1 contact (yielding 3Nc = 3 multipliers), N is a 3×3 matrix, as it is defined by the product D⊤M −1D where D∈R 12×3,M −1 ∈R 12×12, and thus acts on the contact multiplier space (not generalized-velocity space)

[44] [44]

For the position-based normal complementarity constraint: γi,n ≥0,Φ i(q)≥0,Φ i(q)γi,n = 0 . If Φi(q) = 0.5m and γi,n = 2N s , the constraint isnotsatisfied: while both quantities are nonnegative, their product is1m N s̸= 0, violating the complementarity condition (Φ i(q)γi,n = 0)

[45] [45]

For TT matrix-vector products used in the paper’s TT-based preconditioner, the complexity isO(r2NlogN) . While this complexity scales linearly in N (up to a logarithmic factor logN ), the cost is still considered *asymp- totically sublinear in N* when normalized by N—or more precisely, *sublinear in the sense of superlinear scaling avoidance*—because logN...

work page arXiv 2019

[46] [46]

If a discipline were to be evaluated using only 3 to 5 questions, the submitted instance must be fundamental and comprehensive enough to be one of them

Disciplinary Representativeness:Instances must act as highly representative probes. If a discipline were to be evaluated using only 3 to 5 questions, the submitted instance must be fundamental and comprehensive enough to be one of them. 2.High-Order Knowledge Application: 29 • Questions must construct a logically complete closed system where the solution ...

[47] [47]

What We Reject:

Static Knowledge Breadth:For memory-intensive disciplines (e.g., Education), questions should maximize knowledge coverage (e.g., utilizing10+statements to encompass major pedagogical theories). What We Reject:

[48] [48]

What is the 100th digit ofπ?

Narrow or Trivial Memorization:Questions testing pure rote memory devoid of core disciplinary literacy (e.g., "What is the 100th digit ofπ?") or focusing on hyper-niche, obscure sub-entities

[49] [49]

Idiosyncratic or Tricky Trivia:Questions universally recognized as flawed or unreasonable even in human examinations

[50] [50]

Weak Epistemological Consensus:Hypotheses proposed by individual scholars that lack widespread aca- demic consensus or are highly volatile (e.g., highly debated legal interpretations)

[51] [51]

E.2.3 Standardized Annotation Workflow The question authoring process is strictly compartmentalized into six components

Non-Disciplinary Failure Modes:Instances where LLMs fail due to semantic traps, ambiguous phrasing, or floating-point calculation errors rather than a deficit in disciplinary literacy. E.2.3 Standardized Annotation Workflow The question authoring process is strictly compartmentalized into six components. The specifications for each component are detailed ...

[52] [52]

It targets marine chemistry and assesses advanced reasoning and computational abilities, rather than only testing static knowledge

[53] [53]

You are an expert

Even though it is a calculation question, the analysis of incorrect options is briefly summarized instead of being copied and pasted repetitively. E.3 Review Manual E.3.1 Core Philosophy and General Workflow The primary objective of the review process is to ensure that each curated instance serves as a highly representative probe for evaluating LLMs on sp...

2017

[54] [54]

Body elongated, slightly compressed, large head, flat snout

[55] [55]

Lateral line scales are at least weakly ctenoid, two dorsal fins, separated

[56] [56]

Posterior margin of the caudal fin is rounded

[57] [57]

The first dorsal fin is short, consisting of 3-4 spines

[58] [58]

Upper jaw is slightly shorter than the lower jaw. Options A.1, 2B.1, 2, 3 (Correct Option)C.1, 2, 3, 4, 5D.1E.2F.1, 2, 4G.1, 3, 5H.3, 5J.4, 5 Explanation Option A:Statement 1 is correct (Body elongated, slightly compressed, large head, flat snout); Statement 2 is correct (Lateral line scales are at least weakly ctenoid, two dorsal fins, separated). Option...

[59] [59]

Statements 4 and 5 merely test pure static memory of isolated trivia, lacking sufficient cognitive coverage to serve as a representative evaluation probe

Weak Disciplinary Representativeness:The question fails to engage high-order disciplinary literacy. Statements 4 and 5 merely test pure static memory of isolated trivia, lacking sufficient cognitive coverage to serve as a representative evaluation probe

[60] [60]

Models can exploit this logical leakage (selecting A, D, or E would technically be correct)

Structural Flaw (Proper Subsets):Options A, D, and E are proper subsets of the correct Option B. Models can exploit this logical leakage (selecting A, D, or E would technically be correct). Furthermore, the option distribution is severely imbalanced, violating statistical normality constraints

[61] [61]

Statement Count Deficit:The question stem only contains 5 statements, failing the strict pseudo-multi-choice minimum requirement (≥6statements)

[62] [62]

Chapter 1

Context Drift:The stem explicitly references an unprovided context ("Chapter 1"), violating the requirement that all questions must be logically self-contained. 32 E.3.4 Option Rigor and Structural Integrity To prevent LLMs from exploiting logical loopholes, the 10-option structure must adhere to strict constraints: • Pseudo-Multi-Choice Constraints:For c...

[63] [63]

**NO OVERLAP**: Two selected questions must NOT test the same knowledge point

[64] [64]

**DISCRIMINATIVE PRIORITY**: Always prefer HIGH discriminative_power questions over LOW

[65] [65]

**TYPE PREFERENCE**: CASE_BASED > APPLIED_REASONING > CONCEPTUAL > FACTUAL_RECALL

[66] [66]

**TOPIC DIVERSITY**: Ensure selected questions span different key_topics

[67] [67]

selected_questions

**ERROR PATTERN V ALUE**: Questions with clear, consistent error patterns are valuable for understanding model weaknesses # Output Return ONLY a JSON object: 36 ‘‘‘json { "selected_questions": [ { "question_number": <int>, "knowledge_point": "<the knowledge point being tested>", "selection_reason": "<why this question was selected, referencing criteria>" ...

[68] [68]

**Original Question Data**: {{ORIGINAL_JSON}}

[69] [69]

Source Confidence

**Audit Report**: {{AUDIT_JSON_FROM_AGENT_A}} # Refinement Strategy Matrix Apply the following strategies based on the ‘issue_type‘ identified: ## Strategy A: Fix "Source Confidence" (Premise Injection) * **Action:** Add specific boundary conditions to the Question Stem. * **Example:** Change "Does X cause Y?" to "In the context of [Specific Cell Line/Con...

[70] [70]

Assuming no [Confounding Factor B]

Exclusion: Add "Assuming no [Confounding Factor B]..." * **Goal:** Eliminate the validity of the competing distractor found by the auditor. ## Strategy C: Fix "Ambiguity" (Disambiguation) 38 * **Action:** Rewrite the confusing phrase using standard academic terminology. Ensure the syntax (e.g., modifier attachment) is singular in meaning. ## Strategy D: F...

[71] [71]

**Explanation Sync:** If you modify ANY Option or the Correct Answer, you **MUST** rewrite the corresponding ‘explanation‘ to align with the new logic

[72] [72]

Is Theory X correct?

**Source Addition:** If your modification introduces new facts or constraints not supported by the original source, you **MUST** provide a ‘new_source‘ (APA format or accessible URL). * **Example:** * *Before:* "Is Theory X correct?" (Hard to prove) * *After:* "**Assuming Theory X is correct**, which of the following observations would be expected?" OR "*...

[73] [73]

**Do NOT make the question easier.** The goal is rigor, not simplification

[74] [74]

**Maintain the Markdown/LaTeX format.**

[75] [75]

Assuming setting X

**Evidence update:** If you add a new premise based on the audit, ensure the ‘question_content‘ reflects this (you may add a note "Assuming setting X"). # Output Return a JSON object containing ONLY the fields that require modification. Do not return unchanged fields (e.g., if the answer is unchanged, do not include it). Example: { "revision_summary": "Ap...

2024