Recognition: 1 theorem link
· Lean TheoremSuperGPQA: Scaling LLM Evaluation across 285 Graduate Disciplines
Pith reviewed 2026-05-16 00:41 UTC · model grok-4.3
The pith
SuperGPQA benchmark shows top LLMs reach only 61.82 percent accuracy across 285 graduate disciplines.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
SuperGPQA is a benchmark spanning 285 disciplines that uses a Human-LLM collaborative filtering mechanism to produce high-quality graduate-level questions; evaluations of state-of-the-art LLMs on this set reach a maximum accuracy of 61.82 percent, demonstrating substantial room for improvement relative to artificial general intelligence.
What carries the argument
The Human-LLM collaborative filtering mechanism, which iteratively refines candidate questions by combining LLM responses with expert feedback to eliminate trivial or ambiguous items.
If this is right
- LLMs still exhibit large performance gaps in specialized disciplines outside mainstream academic areas.
- The benchmark supplies a concrete metric for tracking progress toward broader expert-level capabilities.
- Insights from the eighty-expert annotation process can guide the design of future large-scale evaluation efforts.
- Discipline-by-discipline score differences can highlight which knowledge areas require targeted model improvements.
Where Pith is reading between the lines
- Training data for future models will likely need to incorporate more material from underrepresented fields such as agriculture and service disciplines to raise scores.
- Real-world deployment in specialized professional settings may remain unreliable until accuracy on this type of benchmark rises well above 80 percent.
- The filtering approach could be adapted to create similar graduate-level tests in languages other than English or for professional certification exams.
Load-bearing premise
The Human-LLM collaborative filtering process produces questions that are genuinely graduate-level, unambiguous, and representative of each discipline without introducing selection bias.
What would settle it
Independent expert review of a random sample of SuperGPQA questions finds that a large fraction can be answered correctly by undergraduates or contain hidden ambiguities that allow multiple valid answers.
read the original abstract
Large language models (LLMs) have demonstrated remarkable proficiency in mainstream academic disciplines such as mathematics, physics, and computer science. However, human knowledge encompasses over 200 specialized disciplines, far exceeding the scope of existing benchmarks. The capabilities of LLMs in many of these specialized fields-particularly in light industry, agriculture, and service-oriented disciplines-remain inadequately evaluated. To address this gap, we present SuperGPQA, a comprehensive benchmark that evaluates graduate-level knowledge and reasoning capabilities across 285 disciplines. Our benchmark employs a novel Human-LLM collaborative filtering mechanism to eliminate trivial or ambiguous questions through iterative refinement based on both LLM responses and expert feedback. Our experimental results reveal significant room for improvement in the performance of current state-of-the-art LLMs across diverse knowledge domains (e.g., the reasoning-focused model DeepSeek-R1 achieved the highest accuracy of 61.82% on SuperGPQA), highlighting the considerable gap between current model capabilities and artificial general intelligence. Additionally, we present comprehensive insights from our management of a large-scale annotation process, involving over 80 expert annotators and an interactive Human-LLM collaborative system, offering valuable methodological guidance for future research initiatives of comparable scope.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces SuperGPQA, a benchmark evaluating LLM graduate-level knowledge and reasoning across 285 disciplines. It describes a Human-LLM collaborative filtering process to remove trivial or ambiguous questions via iterative refinement with LLM responses and expert feedback, reports that the best model (DeepSeek-R1) reaches only 61.82% accuracy, and provides methodological insights from managing annotation with over 80 experts.
Significance. If the filtering successfully yields unambiguous graduate-level items representative of the 285 disciplines, the benchmark would meaningfully expand evaluation beyond mainstream fields and document a substantial capability gap. The large-scale annotation management details could also inform future benchmark construction efforts.
major comments (2)
- [Abstract / Filtering mechanism] Abstract and methods description of the Human-LLM collaborative filtering: no quantitative diagnostics are supplied (rejection rates per stage or discipline, inter-expert agreement, pre/post-filtering accuracy curves, or examples of discarded vs. retained items). This is load-bearing for the central claim that retained questions are genuinely graduate-level and that the 61.82% ceiling reflects a true AGI gap rather than residual ambiguity or selection bias.
- [Results] Results section reporting model accuracies: headline figures such as DeepSeek-R1 at 61.82% are given without error bars, confidence intervals, or per-discipline variance. This prevents assessment of whether observed differences across models or fields are statistically reliable.
minor comments (2)
- [Abstract] Clarify the exact count of disciplines (abstract states 'over 200' while title and body use 285) and provide a breakdown of coverage by broad field.
- Include at least one or two sample questions per major discipline cluster to allow readers to judge graduate-level difficulty directly.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment point by point below and describe the revisions we will incorporate to improve the manuscript.
read point-by-point responses
-
Referee: [Abstract / Filtering mechanism] Abstract and methods description of the Human-LLM collaborative filtering: no quantitative diagnostics are supplied (rejection rates per stage or discipline, inter-expert agreement, pre/post-filtering accuracy curves, or examples of discarded vs. retained items). This is load-bearing for the central claim that retained questions are genuinely graduate-level and that the 61.82% ceiling reflects a true AGI gap rather than residual ambiguity or selection bias.
Authors: We agree that the current description would benefit from additional quantitative support. In the revised manuscript we will add rejection rates per filtering stage and per discipline, inter-expert agreement statistics (including percentage agreement and Cohen’s kappa where multiple experts reviewed the same items), pre- and post-filtering accuracy curves for the LLMs used in the process, and representative examples of discarded versus retained questions. These additions will directly substantiate that the retained items are graduate-level and reduce concerns about residual ambiguity or selection bias. revision: yes
-
Referee: [Results] Results section reporting model accuracies: headline figures such as DeepSeek-R1 at 61.82% are given without error bars, confidence intervals, or per-discipline variance. This prevents assessment of whether observed differences across models or fields are statistically reliable.
Authors: We acknowledge that statistical reliability measures are important for interpreting the results. In the revision we will report 95% binomial confidence intervals for all headline accuracy figures and include error bars on the main result plots. We will also add per-discipline standard deviations and, for disciplines with sufficient question counts, per-discipline error bars. These changes will allow readers to assess the reliability of differences across models and fields. revision: yes
Circularity Check
No circularity: empirical benchmark with no derivations or fitted predictions
full rationale
The paper constructs SuperGPQA via Human-LLM collaborative filtering and reports direct empirical accuracies (e.g., DeepSeek-R1 at 61.82%). No equations, parameters, or predictions exist that reduce to inputs by construction. The filtering process is presented as a methodological contribution without self-referential loops, uniqueness theorems, or ansatzes. Central claims rest on the benchmark results themselves rather than any tautological reduction. This matches the expected non-circular outcome for a pure empirical dataset paper.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Expert annotators can consistently identify and remove trivial or ambiguous graduate-level questions
Forward citations
Cited by 18 Pith papers
-
Scaling Latent Reasoning via Looped Language Models
Looped language models with latent iterative computation and entropy-regularized depth allocation achieve performance matching up to 12B standard LLMs through superior knowledge manipulation.
-
SlimQwen: Exploring the Pruning and Distillation in Large MoE Model Pre-training
Pruning pretrained MoE models outperforms training from scratch, different compression methods converge after continued pretraining, and combining KD with language modeling loss plus progressive schedules yields a com...
-
Rotation-Preserving Supervised Fine-Tuning
RPSFT improves the in-domain versus out-of-domain performance trade-off during LLM supervised fine-tuning by penalizing rotations in pretrained singular subspaces as a proxy for loss-sensitive directions.
-
Implicit Compression Regularization: Concise Reasoning via Internal Shorter Distributions in RL Post-Training
ICR creates a virtual shorter distribution from shortest correct on-policy responses to regularize RL post-training toward concise yet accurate reasoning, improving the accuracy-length Pareto frontier on math and know...
-
SciResearcher: Scaling Deep Research Agents for Frontier Scientific Reasoning
SciResearcher automates creation of diverse scientific reasoning tasks from academic evidence to train an 8B model that sets new SOTA at 19.46% on HLE-Bio/Chem-Gold and gains 13-15% on SuperGPQA-Hard-Biology and TRQA-...
-
Agent-World: Scaling Real-World Environment Synthesis for Evolving General Agent Intelligence
Agent-World autonomously synthesizes verifiable real-world tasks and uses continuous self-evolution to train 8B and 14B agents that outperform proprietary models on 23 benchmarks.
-
SafeSci: Safety Evaluation of Large Language Models in Science Domains and Beyond
SafeSci creates a large objective benchmark and training resource that reveals safety weaknesses in current LLMs for science and demonstrates measurable improvement through targeted fine-tuning.
-
SenseNova-U1: Unifying Multimodal Understanding and Generation with NEO-unify Architecture
SenseNova-U1 presents native unified multimodal models that match top understanding VLMs while delivering strong performance in image generation, infographics, and interleaved tasks via the NEO-unify architecture.
-
Heterogeneous Scientific Foundation Model Collaboration
Eywa enables language-based agentic AI systems to collaborate with specialized scientific foundation models for improved performance on structured data tasks.
-
Qwen3.5-Omni Technical Report
Qwen3.5-Omni scales an omnimodal model to hundreds of billions of parameters with 256k context, introduces ARIA for stable speech synthesis, and reports SOTA performance on 215 audio-visual benchmarks while adding mul...
-
Seed1.8 Model Card: Towards Generalized Real-World Agency
Seed1.8 is a new foundation model that adds unified agentic capabilities for search, code execution, and GUI interaction to existing LLM and vision strengths.
-
MiMo-V2-Flash Technical Report
MiMo-V2-Flash is a 309B/15B MoE model trained on 27T tokens with hybrid attention and multi-teacher on-policy distillation that matches larger models like DeepSeek-V3.2 while enabling 2.6x faster decoding via repurpos...
-
Kimi K2: Open Agentic Intelligence
Kimi K2 is a 1-trillion-parameter MoE model that leads open-source non-thinking models on agentic benchmarks including 65.8 on SWE-Bench Verified and 66.1 on Tau2-Bench.
-
Qwen3 Technical Report
Pith review generated a malformed one-line summary.
-
Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models
The paper unifies perspectives on Long CoT in reasoning LLMs by introducing a taxonomy, detailing characteristics of deep reasoning and reflection, and discussing emergence phenomena and future directions.
-
Qwen-Scope: Turning Sparse Features into Development Tools for Large Language Models
Qwen-Scope provides open-source sparse autoencoders for Qwen models that function as practical interfaces for steering, evaluating, data workflows, and optimizing large language models.
-
Supplement Generation Training for Enhancing Agentic Task Performance
SGT trains a lightweight model to generate task-specific supplemental text that improves performance of a larger frozen LLM on agentic tasks without modifying the large model.
-
MedXIAOHE: A Comprehensive Recipe for Building Medical MLLMs
MedXIAOHE is a medical MLLM that claims state-of-the-art benchmark performance through specialized pretraining to cover long-tail diseases and RL-based reasoning training.
Reference graph
Works this paper leans on
-
[1]
URL https://api.semanticscholar.org/CorpusID:274656307. AI-MO. Aimo validation aime, 2024. URL https://huggingface.co/datasets/AI -MO/aimo-validation-aime . Validation set containing 90 AIME problems from 2022-2024 contests. AI@Meta. Llama 3 model card. 2024. URL https://github.com/meta-llama/lla ma3/blob/main/MODEL_CARD.md. Anthropic. Claude 3.5 sonnet m...
-
[2]
Yi: Open Foundation Models by 01.AI
doi: 10.18653/v1/d18-1259. URL https://doi.org/10.18653/v1/d18-1 259. A. A. Young, B. Chen, C. Li, C. Huang, G. Zhang, G. Zhang, H. Li, J. Zhu, J. Chen, J. Chang, K. Yu, P . Liu, Q. Liu, S. Yue, S. Yang, S. Yang, T. Yu, W. Xie, W. Huang, X. Hu, X. Ren, X. Niu, P . Nie, Y. Xu, Y. Liu, Y. Wang, Y. Cai, Z. Gu, Z. Liu, and Z. Dai. Yi: Open foundation models b...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.18653/v1/d18-1259 2024
-
[3]
According to Danto’s definition, context is an art world with modern aspects
- [4]
-
[5]
The ballet “Sylvia” is a dance drama created during the Paris Commune period in 1871
-
[6]
Korean court dance, when calculated according to temporal principles, does not belong to secondary civilization. Options: A) 1, 3 B) 1, 4 C) 1,2,4 D) 2,3 E) 1,2,3 F) 3 G) 4 H) 3,4 I) 1,2,3,4 J) 2,4 Answer: 3, 4 Answer letter: H Discipline: Literature and Arts Field: Art Studies Subfield: Dance Studies Difficulty: easy Middle Sample Question: A deck of pla...
-
[7]
Material Review: • Ensure that the materials meet the aforementioned requirements, including accuracy, diversity, neutrality (free from regional bias), non-image-based content, and compliance with copyright regulations
-
[8]
Question Transcription: • Use OCR tools to recognize the text in the materials or directly paste the origi- nal content. Transcribe the question content word-for-word, ensuring no key information is omitted, with particular attention to the accurate transcription of formulas
-
[9]
Option Transcription: • Use OCR tools to recognize the text in the materials or directly paste the original content. Transcribe the options word-for-word, ensuring no key information is omitted, with particular attention to the accurate transcription of formulas
-
[11]
Distractor Addition: • Add distractors while maintaining the quality of the options (avoiding mean- ingless or excess correct answers) until the annotator deems the level of confusion sufficient. The number of options should be between 4 and 10, with 42 more options being preferred to increase the difficulty and discriminatory power of the question
-
[12]
Field Completion: • Complete the category fields ("discipline","field","subfield") and select the appropriate difficulty level to ensure the integrity and accuracy of the question information. B.2.2. Non-Choice Conversion Method ὒ7Non-Choice Conversion Description: • Convert non-multiple-choice questions (such as calculation questions, fill-in-the- blank ...
-
[13]
• Ensure that the conditions of the question are complete and do not cause ambiguity
Material Review: • Ensure that the materials meet the aforementioned requirements, including accuracy, diversity, neutrality (free from regional bias), non-image-based content, and compliance with copyright regulations. • Ensure that the conditions of the question are complete and do not cause ambiguity. Add supplementary explanations if necessary. • Ensu...
-
[14]
Question Transcription: • Use OCR tools to recognize the text in the material or directly paste the original text. Transcribe the content of the original question word by word, ensuring that all numbers, formulas, or condition information in the question are accurate and complete to avoid ambiguity or incorrect answers. • Add supplementary information if ...
-
[15]
Answer Transcription: • Identify and select the correct result for the appropriate question based on the answer analysis process. item Use OCR tools to recognize results in the material and confirm that all numerical and formula information in the answer is accurate. 43
-
[16]
Distractor Addition: • On the basis of the correct answer, consider setting common calculation errors (such as rounding errors, sign mistakes, digit errors, etc.) as distractors. • For numerical results, the setting of distractors should consider reasonable ranges of calculation errors to ensure differentiation between options
-
[18]
Field Completion: • Complete the category fields ("discipline","field","subfield") and select the appropriate difficulty level to ensure the integrity and accuracy of the question information. B.2.3. Statement Combination Method ὒ7Statement Combination Description: • By integrating stated expressions related to multiple concepts or knowledge points, a mul...
-
[19]
Statement Extraction: • Extract core concepts, definitions, relationships between concepts, application cases, and common misconceptions from textbooks and related learning resources, ensuring that the statements are representative and comprehensive. • Extract important statements from multiple-choice questions, covering multi- ple knowledge points to avo...
-
[20]
• Incorrect statements should be somewhat misleading, avoiding obvious or easily dismissible errors
Statement Adaptation: • Adapt the extracted statements to include both correct and incorrect versions. • Incorrect statements should be somewhat misleading, avoiding obvious or easily dismissible errors. 44 • Number the adapted statements (e.g., I, II, III, etc.). Arrange the statements in a reasonable order, avoiding bias towards any direction (e.g., alw...
-
[21]
Which of the following statements about [knowledge point] is correct?
Question Design: • Depending on the need, questions may limit the scope or conditions being tested. • "Which of the following statements about [knowledge point] is correct?" • "Which of the following statements about [knowledge point] is incorrect?"
-
[22]
• Ensure that there is exactly one correct answer among the options
Combination Design: • Combine the numbered statements to create options, such as I and II; II and III; III, VI, and VII. • Ensure that there is exactly one correct answer among the options
-
[23]
Answer Identification: • Clearly mark the correct answer within the list of options, ensuring its accu- racy and uniqueness
-
[24]
Field Completion: • Complete the category fields ("discipline","field","subfield") and select the appropriate difficulty level to ensure the integrity and accuracy of the question information. B.2.4. Confusion-options Generation During the annotation process, we use large models to assist in generating distractors. We select Claude-3.5, GPT-4, Doubao, and...
-
[25]
Generate Distractor: • The distractor must be incorrect (not the correct answer). • The distractor should introduce a subtle mathematical error while maintaining the formula structure. • It must be distinct from all existing options, including the correct answer. • Avoid any repetition or overlap with the existing answer options in terms of value or meani...
-
[26]
• If the distractor matches any existing option, regenerate it
Uniqueness Check: • Thoroughly check all existing options (including the correct one) to ensure the distractor is unique. • If the distractor matches any existing option, regenerate it. • The distractor must not create ambiguity; the correct answer must remain the sole valid choice
-
[27]
{}", Options: {}, Correct Answer:
Formatting: • Ensure the distractor matches the format of the existing options (e.g., fractions, exponents, etc.). • Avoid formatting inconsistencies such as misplaced symbols or spaces. Output Format: <distractor> your generated distractor here </distractor> Do not include explanations or extra information. Do not include numbering. Input: Question: "{}"...
-
[28]
Rule-Based Pre-Check subsection C.1: Data undergoes initial screening based on predefined code rules, quickly identifying and eliminating clearly invalid data points, thereby enhancing efficiency and reducing subsequent workload
-
[29]
LLM-Based Quality Inspection subsection C.2: Large language models are uti- lized to perform in-depth quality assessments of the data, identifying potential errors and inconsistencies, thereby improving the accuracy and completeness of the data
-
[30]
Manual Quality Review subsection C.3 : Experienced data analysts conduct a final review to ensure the high quality and usability of the data, correcting potential misjudgments and incorporating domain knowledge for a comprehensive evaluation. C.1. Rule-Based Pre-Check Table 8 presents the rules for data filtering and pre-checking, including text normaliza...
-
[31]
The question must explicitly pose a specific problem or ask a clear question that can be answered or calculated. If the question does not clearly ask something (e.g., if it is vague, incomplete, or does not directly ask for a specific answer), it should be considered invalid. A valid question should directly require an answer, such as asking for a numeric...
-
[32]
The question, options, and answer must be fully defined. If any part is missing or unclear (e.g., if the answer does not match any of the listed options), the question should be deemed invalid. Note that the number of options is not a factor in determining completeness—having only one option is acceptable as long as the content is complete and coherent
-
[33]
Option A is true because Option B is false
The question, options, and answer must be directly relevant to each other. The op- tions should not reference each other, and there should be no circular dependencies or inter-referencing between options (e.g., “Option A is true because Option B is false”). Each option must stand independently, with no cross-references to other options. 49
-
[34]
The following options are incorrect
Negative phrasing such as “The following options are incorrect” or “None of the above” is not allowed in the question. The question should use positive phrasing (e.g., “Which of the following is correct?”) to avoid confusion. If negative phrasing is detected in the question, it should be deemed invalid
-
[35]
The question must be in a valid multiple-choice format. If the question is not suitable for a multiple-choice format (e.g., it is a free-form answer question like a problem or essay), it should be deemed invalid. Ensure that the question is designed specifically for multiple-choice answering, rather than being a question that requires an open-ended respon...
-
[36]
Which of the following is NOT...?
The question must not be a negation (e.g., "Which of the following is NOT...?" or "What is not...?")
-
[37]
Which of the following is the best/worst...?
The question should avoid vague or ambiguous phrasing, such as "Which of the following is the best/worst...?" or other uncertain expressions
-
[38]
The answer choices should not be all affirmations or all negations, such as "All of the above" or "None of the above," as these are considered inappropriate. Please return the result in the following JSON format: { "is_valid": true/false } Ensure that the output is valid JSON and do not return any unrelated information. Example input 1: { question: Which ...
- [39]
-
[40]
If either of these phrases appears in the question’s final asking part, outputtrue, followed by a brief explanation of why the phrase is present
-
[41]
If neither phrase is found in the question’s final part, outputfalse, followed by a brief explanation. Output Format: Provide a response with: • true orfalse as the first part of the output • A brief explanation that justifies your answer Example Input: Question: "A new technology is introduced in a manufacturing plant. Which of the following is most like...
-
[42]
• If discipline is appropriate, evaluate if thefield is relevant
Evaluate the classification level by level: • First, check if the discipline is relevant to the question. • If discipline is appropriate, evaluate if thefield is relevant. • If field is appropriate, evaluate if thesubfield is relevant
-
[43]
Special rule: • If thesubfield is found to be highly relevant to the question, the entire clas- sification (discipline, field, and subfield) is considered appropriate, regardless of minor mismatches indiscipline orfield
-
[44]
If no levels meet the above criteria, or if any level other thansubfield is deemed inappropriate withoutsubfield being highly relevant, the classification is consid- ered not relevant
-
[45]
Output strictly and exclusively in the following JSON format and don’t output any explanation, just output JSON: { "is_relevant": true/false 55 } C.2.5. Completeness Assessment ☼ Purpose • Determine if the question is solvable, considering the following two dimensions: • Confidence: The level of confidence in the answer, categorized as "High," "Medium," o...
-
[46]
Attempt to solve the problem and assess your confidence level in the answer
-
[47]
If you cannot provide an accurate answer, evaluate whether it is due to missing information (e.g., diagrams, formulas, known conditions) that makes the problem unsolvable, or if it’s due to other reasons (e.g., high difficulty or uncertainty in solving)
-
[48]
Output format: { "final_answer_letter": "A", "confidence": "High", "missing_info": false } Rules: • If the problem is missing a "diagram" (such as a geometry question requiring visual input), mark it as missing_info: true and explain that the absence of the diagram or visual representation makes the question unsolvable. • If the problem contains sufficien...
work page 2000
-
[49]
DeepSeek-R1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
-
[50]
DeepSeek-R1-Zero . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
-
[51]
o1-2024-12-17 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
work page 2024
-
[52]
o3-mini-2025-01-31-high . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
work page 2025
-
[53]
Doubao-1.5-pro-32k-250115 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
-
[54]
o3-mini-2025-01-31-medium . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
work page 2025
-
[55]
Doubao-1.5-pro-32k-241225 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
-
[56]
qwen-max-2025-01-25 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
work page 2025
-
[57]
claude-3-5-sonnet-20241022 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
-
[58]
o3-mini-2025-01-31-low . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
work page 2025
-
[59]
gemini-2.0-flash . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
-
[60]
DeepSeek-V3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
-
[61]
o1-mini-2024-09-12 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
work page 2024
-
[62]
MiniMax-Text-01 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
-
[63]
gpt-4o-2024-11-20 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
work page 2024
-
[64]
QwQ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
-
[65]
Llama-3.1-405B-Instruct . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
-
[66]
gpt-4o-2024-08-06 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
work page 2024
-
[67]
Qwen2.5-72B-Instruct . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
-
[68]
Mistral-Large-Instruct-2411 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138 96
-
[69]
qwen-max-2024-09-19 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
work page 2024
-
[70]
gpt-4o-2024-05-13 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
work page 2024
-
[71]
Qwen2.5-32B-Instruct . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
-
[72]
Llama-3.3-70B-Instruct . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
-
[73]
phi-4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
-
[74]
Qwen2.5-14B-Instruct . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
-
[75]
Llama-3.1-70B-Instruct . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
-
[76]
Qwen2.5-72B . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
-
[77]
Yi-Lighting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
-
[78]
Qwen2.5-32B . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158
-
[79]
DeepSeek-V3-Base . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160
-
[80]
Qwen2.5-14B . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
-
[81]
Mixtral-8x22B-Instruct-v0.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164
-
[82]
Qwen2.5-7B-Instruct . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.