arxiv: 2601.14289 · v2 · submitted 2026-01-14 · 💻 cs.CL · cs.AI

Recognition: 2 theorem links

· Lean Theorem

RPC-Bench: A Fine-grained Benchmark for Research Paper Comprehension

Yelin Chen , Fanjin Zhang , Suping Sun , Yunhe Pang , Yuanchun Wang , Jian Song , Xiaoyan Li , Lei Hou

show 3 more authors

Shu Zhao Jie Tang Juanzi Li

Authors on Pith no claims yet

Pith reviewed 2026-05-16 14:41 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords research paper comprehensionLLM benchmarkquestion answering datasetscientific text understandingmodel evaluationreview rebuttal data

0 comments

The pith

Even the strongest models achieve only 68.2 percent correctness-completeness on a new benchmark for understanding research papers.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces RPC-Bench, a large-scale QA benchmark built from 15K human-verified pairs drawn from review-rebuttal exchanges of high-quality computer science papers. It supplies a fine-grained taxonomy aligned with the scientific research flow to evaluate how well models handle why, what, and how questions in scholarly settings. Experiments show that leading models such as GPT-5 reach only 68.2 percent on correctness and completeness, falling to 37.46 percent once conciseness is required, pointing to clear shortfalls in precise academic paper comprehension.

Core claim

RPC-Bench supplies 15K human-verified QA pairs from real review-rebuttal exchanges together with a taxonomy that follows the scientific research flow and an LLM-as-a-Judge framework that scores models on both correctness-completeness and conciseness, demonstrating that even the strongest current models fall well short of reliable research-paper understanding.

What carries the argument

The fine-grained taxonomy aligned with the scientific research flow that organizes questions into why, what, and how categories drawn from review-rebuttal exchanges.

If this is right

Models must improve at handling specialized scientific discourse, figures, and tables.
Review-rebuttal exchanges can serve as a scalable source for high-quality, realistic QA data.
Conciseness-adjusted scoring reveals a persistent trade-off between completeness and brevity that current models do not resolve.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The benchmark could be extended beyond computer science to test whether the same performance gaps appear in physics, biology, or other domains.
Training data that includes reviewer comments might help models anticipate common points of confusion in scientific writing.
The drop after conciseness adjustment suggests future work could focus on architectures that jointly optimize for both completeness and brevity.

Load-bearing premise

That QA pairs extracted from review-rebuttal exchanges form a representative and unbiased sample of the comprehension challenges present in the broader scientific literature.

What would settle it

Re-running the same evaluation protocol on QA pairs drawn from a different source such as direct reader questions on arXiv preprints or post-publication comments and finding that model scores remain comparably low would support the benchmark's claim.

Figures

Figures reproduced from arXiv: 2601.14289 by Fanjin Zhang, Jian Song, Jie Tang, Juanzi Li, Lei Hou, Shu Zhao, Suping Sun, Xiaoyan Li, Yelin Chen, Yuanchun Wang, Yunhe Pang.

**Figure 2.** Figure 2: Domain distribution of RPC-Bench. ML: Machine Learning; CV: Computer Vision; NLP: Natural Language Processing; RL: Reinforcement Learning. It begins with what-questions, which focus on clarifying fundamental concepts and contextual background. It then advances to how-questions, which probe the mechanics of methods and experimental setups. Finally, it deepens into why-questions, which examine the underly… view at source ↗

**Figure 3.** Figure 3: Task taxonomy of QA pairs. The form of [ [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Comparison of LLMs and VLMs on open-ended question answering (F1-like score; left), and the performance of all models on claim verification tasks (ACC; right). Example 1 Question: What are the roles of SoundStream and WaveNet ... and how do they interact with other components such as the prior model, diffusion model, and audio codec? Answer: SoundStream is used as the neural audio codec ... WaveNet is chos… view at source ↗

**Figure 10.** Figure 10: Fig.10 [PITH_FULL_IMAGE:figures/full_fig_p008_10.png] view at source ↗

**Figure 5.** Figure 5: Representative case studies from the RPC-Bench test set [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Model Conciseness across Taxonomy-Defined Question Types [PITH_FULL_IMAGE:figures/full_fig_p015_6.png] view at source ↗

**Figure 7.** Figure 7: Additional Case study other methods that re-rank at the start. Monkey: Traditional RAG methods retrieve information, then generate a response. Self-RAG adds a "critique" step where the model evaluates its own generation. However, the paper does not compare it to iterative retrieval. HippoRAG2: Self-RAG is different because it trains one arbitrary LM to generate text and reflect on its own output... by gen… view at source ↗

**Figure 8.** Figure 8: Screenshot of the Annotation Interface 1 [PITH_FULL_IMAGE:figures/full_fig_p022_8.png] view at source ↗

**Figure 9.** Figure 9: Screenshot of the Annotation Interface 2 [PITH_FULL_IMAGE:figures/full_fig_p022_9.png] view at source ↗

**Figure 10.** Figure 10: Screenshot of the Annotation Interface 3 [PITH_FULL_IMAGE:figures/full_fig_p023_10.png] view at source ↗

**Figure 11.** Figure 11: Screenshot of the Review Interface 1 In the meanwhile, we have also reorganized the whole Section 3 to better explain the proposed RSA. Specifically, For a single head RSA, we have devoted a paragraph right after equation (4) to detail the different types of REMs i.e. $\mathbf{P }$ in the paper. For your easy reference, we have listed the multihead RSA operation below: Procedure for the Multihead RSA - Ch… view at source ↗

**Figure 12.** Figure 12: Screenshot of the Review Interface 2 [PITH_FULL_IMAGE:figures/full_fig_p024_12.png] view at source ↗

**Figure 13.** Figure 13: Screenshot of the Review Interface 3 | WikiText-103 | 23.758 | **23.639** | | # Averaged Params added (%) | | 8.68E-05 | It can be seen that RSA-BRT exceeds the baseline BRT's performance on all datasets. **The results of this table will be used to fill in the blanks in [PITH_FULL_IMAGE:figures/full_fig_p024_13.png] view at source ↗

**Figure 1.** Figure 1: In order to emphasize the benefits of the proposed RSA, we employed a slightly larger model. Unfortunately, due to limited resources, we were unable to conduct further experiments using a 24-layer XL model. While acknowledging these limitations , we believe that the use of Nvidia's implementation, combined with our modifications, provides valuable insights and supports our argument. The comparison between … view at source ↗

read the original abstract

Understanding research papers remains challenging for foundation models due to specialized scientific discourse and complex figures and tables, yet existing benchmarks offer limited fine-grained evaluation at scale. To address this gap, we introduce RPC-Bench, a large-scale question-answering benchmark built from review-rebuttal exchanges of high-quality computer science papers, containing 15K human-verified QA pairs. We design a fine-grained taxonomy aligned with the scientific research flow to assess models' ability to understand and answer why, what, and how questions in scholarly contexts. We also define an elaborate LLM-human interaction annotation framework to support large-scale labeling and quality control. Following the LLM-as-a-Judge paradigm, we develop a scalable framework that evaluates models on correctness-completeness and conciseness, with high agreement to human judgment. Experiments reveal that even the strongest models (GPT-5) achieve only 68.2% correctness-completeness, dropping to 37.46% after conciseness adjustment, highlighting substantial gaps in precise academic paper understanding. Our code and data are available at https://rpc-bench.github.io/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces RPC-Bench, a benchmark of 15K human-verified QA pairs extracted from review-rebuttal exchanges in high-quality CS papers. It defines a fine-grained taxonomy aligned with the scientific research flow for evaluating models on why/what/how questions, and applies an LLM-as-Judge protocol (with claimed high human agreement) to measure correctness-completeness and conciseness. Experiments show even GPT-5 reaches only 68.2% correctness-completeness, falling to 37.46% after conciseness adjustment, which the authors interpret as evidence of substantial gaps in precise scholarly paper understanding.

Significance. If the sampling frame is shown to be representative, RPC-Bench would offer a useful large-scale, fine-grained resource for diagnosing and improving foundation-model performance on authentic scientific discourse, going beyond existing benchmarks by tying questions to review-driven challenges and providing an open code/data release.

major comments (3)

[§3] §3 (Data Construction): The headline claim of substantial gaps in paper comprehension rests on the 15K QA pairs being representative of typical scholarly reading demands. Extraction from review-rebuttal threads likely over-samples clarification/critique questions while under-sampling straightforward method or result-interpretation questions; the manuscript must include a control comparison (e.g., randomly sampled questions from the same papers) to rule out sampling-frame artifacts.
[§5] §5 (Evaluation): The drop from 68.2% correctness-completeness to 37.46% after conciseness adjustment is central to the reported result, yet the precise definition, weighting, and computation of the conciseness term are not specified; without an explicit formula or pseudocode, the adjusted metric cannot be interpreted or reproduced.
[§4] §4 (Annotation Framework): The reliability of the human-verified labels and the LLM-as-Judge protocol is load-bearing for all reported scores, but inter-annotator agreement statistics (e.g., Cohen’s kappa or percentage agreement), verifier count, and any post-hoc filtering criteria are not provided; these details are required to confirm that the 15K pairs constitute a high-quality benchmark.

minor comments (2)

[Abstract] Abstract: the phrase 'high agreement to human judgment' should be quantified (e.g., '92% agreement, κ=0.81') so readers can immediately gauge reliability.
[§2] §2 (Related Work): the contrast with prior scientific QA benchmarks could be sharpened by adding a table that directly compares scale, taxonomy granularity, and source distribution.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their thoughtful and constructive comments on our manuscript. We address each of the major comments below and will make the necessary revisions to improve the clarity and rigor of the paper.

read point-by-point responses

Referee: [§3] §3 (Data Construction): The headline claim of substantial gaps in paper comprehension rests on the 15K QA pairs being representative of typical scholarly reading demands. Extraction from review-rebuttal threads likely over-samples clarification/critique questions while under-sampling straightforward method or result-interpretation questions; the manuscript must include a control comparison (e.g., randomly sampled questions from the same papers) to rule out sampling-frame artifacts.

Authors: We acknowledge the referee's concern regarding potential sampling bias in our data construction. Our benchmark is specifically designed to capture the types of questions that arise during the peer review process, which we believe represent key challenges in research paper comprehension. However, to directly address the issue of representativeness, we will add a control comparison using randomly sampled questions from the same set of papers in the revised manuscript. This will allow us to quantify any differences and strengthen the claim. revision: yes
Referee: [§5] §5 (Evaluation): The drop from 68.2% correctness-completeness to 37.46% after conciseness adjustment is central to the reported result, yet the precise definition, weighting, and computation of the conciseness term are not specified; without an explicit formula or pseudocode, the adjusted metric cannot be interpreted or reproduced.

Authors: We apologize for not providing sufficient detail on the conciseness adjustment in the original submission. In the revised manuscript, we will include an explicit formula and pseudocode for computing the conciseness term and the adjusted metric. The conciseness score penalizes overly verbose responses while maintaining the correctness-completeness evaluation. revision: yes
Referee: [§4] §4 (Annotation Framework): The reliability of the human-verified labels and the LLM-as-Judge protocol is load-bearing for all reported scores, but inter-annotator agreement statistics (e.g., Cohen’s kappa or percentage agreement), verifier count, and any post-hoc filtering criteria are not provided; these details are required to confirm that the 15K pairs constitute a high-quality benchmark.

Authors: We agree that these details are essential for establishing the quality of the benchmark. We will include inter-annotator agreement statistics (such as Cohen’s kappa and percentage agreement), the number of verifiers involved, and the post-hoc filtering criteria in the revised version of §4. revision: yes

Circularity Check

0 steps flagged

No circularity in derivation chain

full rationale

The paper introduces RPC-Bench as a data-collection and evaluation pipeline: it extracts 15K QA pairs from review-rebuttal threads, applies a taxonomy aligned with research flow, and uses an LLM-as-a-Judge framework whose outputs are validated against human judgments. No equations, parameter fitting, or self-citation chains appear in the provided text. Performance numbers (68.2% correctness-completeness, 37.46% after conciseness) are direct measurements on the constructed benchmark rather than quantities derived from the benchmark itself by construction. The work is therefore self-contained against external human labels and contains no load-bearing steps that reduce to its own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that review-rebuttal exchanges capture the fine-grained comprehension difficulties that matter for scientific understanding; no free parameters or invented entities are introduced.

axioms (1)

domain assumption Review-rebuttal exchanges contain representative why-what-how questions that reflect genuine paper comprehension challenges
Invoked when constructing the QA pairs and taxonomy from real review data

pith-pipeline@v0.9.0 · 5519 in / 1233 out tokens · 34066 ms · 2026-05-16T14:41:51.699348+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We introduce RPC-Bench, a large-scale question-answering benchmark built from review-rebuttal exchanges... 15K human-verified QA pairs... fine-grained taxonomy aligned with the scientific research flow
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

F1-like = (1 + β²) × (Correctness × Completeness) / (β² × Correctness + Completeness); Informativeness = F1-like × Conciseness

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

AutoResearchBench: Benchmarking AI Agents on Complex Scientific Literature Discovery
cs.AI 2026-04 accept novelty 8.0

AutoResearchBench is a new benchmark showing top AI agents achieve under 10% success on complex scientific literature discovery tasks that demand deep comprehension and open-ended search.

Reference graph

Works this paper leans on

32 extracted references · 32 canonical work pages · cited by 1 Pith paper · 1 internal anchor

[1]

Wayne C Booth, Gregory G Colomb, and Joseph M Williams

Peerqa: A scientific question answer- ing dataset from peer reviews.arXiv preprint arXiv:2502.13668. Wayne C Booth, Gregory G Colomb, and Joseph M Williams. 2009.The craft of research. University of Chicago press. Yelin Chen, Fanjin Zhang, and Jie Tang. 2025. Small language model makes an effective long text extractor. InProceedings of the AAAI Conference...

work page arXiv 2009
[2]

ChatGLM: A Family of Large Language Models from GLM-130B to GLM-4 All Tools

Chatglm: A family of large language mod- els from glm-130b to glm-4 all tools.Preprint, arXiv:2406.12793. Juraj Gottweis, Wei-Hung Weng, Alexander Daryin, Tao Tu, Anil Palepu, Petar Sirkovic, Artiom Myaskovsky, Felix Weissenberger, Keran Rong, Ryu- taro Tanno, and 1 others. 2025. Towards an ai co- scientist.arXiv preprint arXiv:2502.18864. Bernal Jiménez ...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

mplug-docowl 1.5: Unified structure learning for ocr-free document understanding

mplug-docowl2: High-resolution compress- ing for ocr-free multi-page document understanding. Preprint, arXiv:2409.03420. Jiajie Jin, Yutao Zhu, Zhicheng Dou, Guanting Dong, Xinyu Yang, Chenghao Zhang, Tong Zhao, Zhao Yang, and Ji-Rong Wen. 2025. Flashrag: A modular toolkit for efficient retrieval-augmented generation research. InCompanion Proceedings of t...

work page arXiv 2025
[4]

cross-LoRA attention

Raptor: Recursive abstractive processing for tree-organized retrieval. InThe Twelfth International Conference on Learning Representations. Samuel Schmidgall, Yusheng Su, Ze Wang, Ximeng Sun, Jialian Wu, Xiaodong Yu, Jiang Liu, Michael Moor, Zicheng Liu, and Emad Barsoum. 2025. Agent laboratory: Using llm agents as research assis- tants.Preprint, arXiv:250...

work page arXiv 2025
[5]

initialization

The environment source code. 2. A natural language task description. Monkey:Yes, the success/fitness function can be used to initialize the Eureka reward search process. MemoRAG:Yes, the success/fitness function can be used to initialize the Eureka reward search process. Example 8 (Experimental Analysis): Question:What are the major errors identified in t...

work page 2021
[6]

Split combined questions into finer sub- questions for clarity but merge them if they cannot stand alone meaningfully

work page
[7]

Ensure the completeness and consistency of the extracted QA pairs

work page
[8]

Use content from the extra_rebuttal to enhance or clarify answers when applicable and relevant to the question

work page
[9]

Ensure that the rebuttal content is fully utilized in the answers, forming comprehensive and clear QA pairs that correspond to the questions posed

work page
[10]

Use your judgment to label each QA pair as' multimodal-related'if it either explicitly poses questions about the figures and tables in the paper or implicitly requires the content of these figures and tables to answer the question

work page
[11]

references

The answers should be as comprehensive as possible, retaining any relevant content such as "references" that can assist in addressing the questions

work page
[12]

Operation of multihead RSA modules

Use the original content from the review, rebuttal, and extra_rebuttal to construct the QA pairs, avoiding unnecessary modifications to the original text. 21 Figure 8: Screenshot of the Annotation Interface 1 Figure 9: Screenshot of the Annotation Interface 2 Input: review: It is novel enough to combine the advantages of two famous models (Transformer , R...

work page
[13]

Block-Recurrent Transformer (BRT) [1] has been adopted as another baseline model for the NLP experiment in Section 4.3, and its results are presented as follows. | | BRT | RSA- BRT | | --------------------------- | ---------- | ---------- | | Enwik8 | 1.0746 | **1.0683** | | Text8 | 1.1652 | **1.1625** | 23 Figure 12: Screenshot of the Review Interface 2 ...

work page
[14]

(1) A scaling experiment is conducted for RSA- BRT v/s BRT on Enwik8 dataset

Two additional experiments for Section 4.4 have been conducted during the second discussion phase, which are detailed in the responses to Reviewers mvWh and Zrmk. (1) A scaling experiment is conducted for RSA- BRT v/s BRT on Enwik8 dataset. The results are shown as follows. | # layers | 8 | | 10 | | 12 | | 14 | | | ------------------- | ---------- | -----...

work page
[15]

question

Hutchins, D., Schlag, I., Wu, Y., Dyer, E., and Neyshabur, B. (2022). Block-recurrent transformers. In Advances in Neural Information Processing Systems. Output: [ { "question": "I think the draft would become better if there is a more complete explanation and figures about the self- attention with recurrence (RSA) operation.", "answer": "Thank you for th...

work page 2022
[16]

, "is_multimodal_related

In the meanwhile, we have also reorganized the whole Section 3 to better explain the proposed RSA. Specifically, for a single head RSA, we have devoted a paragraph right after equation (4) to detail the different types of REMs i.e. $\\mathbf{ P}$ in the paper. For your easy reference, we have listed the multihead RSA operation below: Procedure for the Mul...

work page 2023
[17]

this figure

Extract the Question (Q): Reformulate the reviewer feedback into a clear, precise, and standalone question. Ensure the question: Includes all necessary context from both the review and rebuttal (e.g., clarify vague references such as "this figure" or "the results"). Is phrased in neutral and objective language, avoiding subjective or opinionated terms

work page
[18]

Ensure the answer: Directly addresses the reformulated question

Extract the Answer (A): Reformulate the author's rebuttal into a concise, objective, and standalone answer. Ensure the answer: Directly addresses the reformulated question. Is based strictly on the rebuttal content. Avoid additional interpretations, subjective language, or opinions

work page
[19]

Categories:

Classify the Question: Classify the question into a precise subcategory based on its intent using the schema below (see categories below). Categories:

work page
[20]

Concept Understanding [What]: Clarifies or explains key concepts, terminology, theoretical viewpoints, or information conveyed in figures, tables, or formulas

work page
[21]

Method Disambiguation [What]: Clarifies methodological details to resolve misunderstandings or ambiguities, ensuring an accurate grasp of proposed approaches

Methods 2.1. Method Disambiguation [What]: Clarifies methodological details to resolve misunderstandings or ambiguities, ensuring an accurate grasp of proposed approaches. 2.2. Method Mechanics [How]: Questions about the implementation or function of methodological workflow or components, such as the effect of specific modules in models. 2.3. Motivation A...

work page
[22]

Experiments 3.1. Experimental Exposition [What]: Describes experimental outcomes, infers how modifications or variations could impact results or conclusions, and addresses reasoning tasks such as calculation, counting, or comparative analysis. 3.2. Experimental Setup [How]: About the design, configuration, and execution of experiments. 3.3. Experimental A...

work page
[23]

review":

Claim Verification : Binary classification tasks that assess the correctness of claims, hypotheses, or experimental conclusions. Output Format: Provide the processed data for each review-rebuttal pair in the following JSON format: [ { "review": "Original reviewer feedback", "rebuttal": "Original author rebuttal", "Q": "Generated question", "A": "Generated...

work page
[24]

The answer must be professional, precise, concise, and clearly presented

work page
[25]

All statements in your answer must be exclusively derived from the paper's content and directly relevant to the question, avoiding any information or claims not supported by the paper

work page
[26]

role": "system

The total length of your response must not exceed 3000 characters (including spaces). Question: {question} Paper: {content} Claim verification: You are an academic judgment specialist assigned to classify the following statement as strictly'True'or'False'based exclusively on the content of the provided research paper. Carefully read and analyze the entire...

work page
[27]

The answer should deliver key content clearly, without excessive length or verbosity

Conciseness: Evaluate whether the predicted answer is brief and to the point, avoiding unnecessary repetition or irrelevant information. The answer should deliver key content clearly, without excessive length or verbosity. </Evaluation-Characteristics> <Rating-Scale> For each evaluation characteristic, assign a quality score between 0.00 (very bad) and 5....

work page
[28]

rating":

Conciseness 0.00-1.00 (Very bad): The predicted answer is verbose or contains substantial irrelevant/ redundant information, making it unclear or unfocused. 1.01-2.00 (Bad): The predicted answer includes some redundancy or unnecessary details, affecting clarity. 2.01-3.00 (Moderate): The predicted answer is generally clear but could benefit from further c...

work page
[29]

This is analogous to precision-focus on the accuracy and fidelity of included information, ensuring no distortions or misrepresentations

Correctness: Assess the proportion of content from the reference answer that is accurately reflected in the predicted answer . This is analogous to precision-focus on the accuracy and fidelity of included information, ensuring no distortions or misrepresentations. </Evaluation-Characteristics> <Rating-Scale> For each evaluation characteristic, assign a qu...

work page
[30]

rating":

Correctness 0.00-1.00 (Very bad): The predicted answer consistently misrepresents or distorts the content of the reference answer, with substantial factual errors. 1.01-2.00 (Bad): The predicted answer contains multiple inaccuracies or significant misinterpretations relative to the reference answer. 2.01-3.00 (Moderate): The predicted answer accurately in...

work page
[31]

Completeness: Assess the proportion of information in the predicted answer that overlaps with the reference answer. This is analogous to recall-consider whether the predicted answer adequately covers all major points and details provided by the reference answer, and does not omit essential content. </Evaluation-Characteristics> <Rating-Scale> For each eva...

work page
[32]

rating":

Completeness 0.00-1.00 (Very bad): The predicted answer fails to include most of the key content from the reference answer, omitting essential points or details. 1.01-2.00 (Bad): The predicted answer is missing several important aspects found in the reference answer. 2.01-3.00 (Moderate): The predicted answer includes a moderate portion of the relevant co...

work page