pith. machine review for the scientific record. sign in

arxiv: 2605.07905 · v1 · submitted 2026-05-08 · 💻 cs.CL · cs.AI

Recognition: 2 theorem links

· Lean Theorem

CoCoReviewBench: A Completeness- and Correctness-Oriented Benchmark for AI Reviewers

Dehao Huang, Derek F. Wong, Hexuan Deng, Min Zhang, Ruina Hu, Xiaopeng Ke, Xuebo Liu, Yichen Li, Yue Wang

Pith reviewed 2026-05-11 03:35 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords AI reviewersbenchmark datasetcompletenesscorrectnesspeer reviewhallucinationsreasoning modelsICLR NeurIPS
0
0 comments X

The pith

CoCoReviewBench evaluates AI reviewers using expert discussions to measure both completeness and correctness.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper creates a benchmark to better evaluate AI systems designed to review scientific papers. Current metrics compare AI reviews to human ones, but human reviews are incomplete and can contain errors, so they are not reliable standards. The authors address this by curating 3,900 papers from major conferences, skipping cases without human reviews to focus on completeness, and using discussions between reviewers, authors, and meta-reviewers as trusted labels for correctness. Their analysis finds that AI reviewers often lack correctness and generate hallucinations, though reasoning models perform relatively better. This approach matters because it provides a clearer signal for developing more accurate AI tools to assist or replace human peer review.

Core claim

CoCoReviewBench curates 3,900 papers from ICLR and NeurIPS, builds category-specific subsets, skips evaluation on missing human reviews for completeness, and uses reviewer-author-meta-review discussions as expert annotations filtered for reliability to assess AI reviewers, showing they are limited in correctness, prone to hallucinations, and that reasoning models are more effective.

What carries the argument

The CoCoReviewBench dataset and evaluation protocol, which strengthens correctness via expert discussion annotations and completeness via selective skipping of incomplete cases.

If this is right

  • AI reviewers exhibit significant limitations in providing correct assessments and frequently hallucinate content.
  • Reasoning models outperform other AI reviewers in this benchmark.
  • The benchmark enables more reliable and fine-grained evaluation of AI reviewer capabilities.
  • Further research is needed to improve AI reviewers based on these findings.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Adopting this benchmark could lead to better AI integration in academic publishing workflows.
  • Similar methods might apply to evaluating AI in other expert judgment tasks like legal or medical review.
  • Testing the benchmark on newer models or different datasets would validate its robustness.
  • Addressing hallucinations through reasoning enhancements could be a key direction for AI reviewer development.

Load-bearing premise

Reviewer-author-meta-review discussions serve as reliable expert annotations for review correctness without bias, and skipping papers lacking human reviews improves completeness evaluation without introducing selection bias.

What would settle it

If independent human experts rate the correctness of AI-generated reviews on benchmark papers and find no correlation with the benchmark's correctness scores, or if reasoning models do not show superior performance in that verification.

Figures

Figures reproduced from arXiv: 2605.07905 by Dehao Huang, Derek F. Wong, Hexuan Deng, Min Zhang, Ruina Hu, Xiaopeng Ke, Xuebo Liu, Yichen Li, Yue Wang.

Figure 1
Figure 1. Figure 1: Overview of CoCoReviewBench. To tackle incomplete￾ness, we build per-category subsets which contain only papers with reviews in that category. To address incorrectness, we merge discussions about the same point across the reviewers and authors. When explicit conflicts exist, we use the meta-review as a high￾level adjudication signal for the final decision. to an inevitable decline in review quality (Kim et… view at source ↗
Figure 2
Figure 2. Figure 2: Subcategory coverage distribution of human reviewers. The x-axis shows the number of subcategories covered by one / all reviewers of each paper, and the y-axis shows their probability. Reviewer 2: ... a novel approach for ... superior performance and time efficiency in experimental results. Inter-Reviewer Conflict Reviewer 3: ... although this work is sound in terms of engineering, its real academic contri… view at source ↗
Figure 3
Figure 3. Figure 3: Inconsistency cases in peer reviews. We show cases of conflict both between reviewers and between reviewers and authors, with the meta-review deciding which opinion is correct. score for the four categories: Originality-Method, Quality￾Experiment, Clarity-Writing, and Quality-Comparisons, ex￾ceeds a certain threshold, we can predict an overall score of 6 or higher with over 90% recall. Further, these four … view at source ↗
Figure 4
Figure 4. Figure 4: Workflow of CoCoReviewBench construction. In §4.2.1, we match each review with the corresponding author response, split it into atomic opinions, and assign labels. These results are further processed in §4.2.2: Steps 4 and 5 detect inter-review conflicts, and Step 6 detects review-author conflicts, yielding the final structured data. This structured data serves as the reference for comparison with AI revie… view at source ↗
Figure 5
Figure 5. Figure 5: Category-wise breakdown of the results in [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Subcategory distribution of reviewer incorrectness. Left: inter-reviewer conflicts. Right: reviewer–author conflicts. Colors indicate the corresponding major categories. Inter-Reviewer Conflicts. The largest inter-reviewer gaps concentrate on Originality-Method, Clarity-Writing, and Quality-Experiment, followed by Quality-Method and Originality-Analysis. This indicates that substantial inter-reviewer disag… view at source ↗
Figure 7
Figure 7. Figure 7: Categorical metric scores across major review categories in CoCoReviewBench. Panels (a–f) correspond to correctness, thoroughness, grounding, verifiability, clarity, and the average score. The x-axis lists the major review categories. Within each panel, categories are ordered by the AI–Human score gap. Colors indicate the corresponding major review categories, with blue corresponding to Quality, orange to … view at source ↗
Figure 8
Figure 8. Figure 8: shows three representative cases. Left: Gemini-3-Pro criticizes the caption for mentioning a “z-axis” and questions consistency with “2D” plots, despite not seeing the figure. Middle: Qwen3-32B asserts that the qualitative analysis in “ [PITH_FULL_IMAGE:figures/full_fig_p028_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Cases of reviewer-author conflicts. Left: Misunderstanding case. Middle: Missed details case. Right: Inaccurate prior assessment case. C.2. Analysis of CoCoReviewBench Reliability For error detection caused by Reviewer-Author conflicts, we achieve 66.83% accuracy in human verification. This accuracy is notably lower than other annotation steps, and exhibits a substantial disparity between rejected and acce… view at source ↗
Figure 10
Figure 10. Figure 10: Cases of bias in human annotation under negative meta-review. D. Additional Experiment Setup Human References. For the old metrics including BLEU, ROUGE-L, and BERTScore, we follow Zhang et al. (2025) to compute the similarity between an AI review and each human reviewer’s comments. Here, consistent with prior work, we only consider the reviewer’s first review as references. However, further replies are a… view at source ↗
read the original abstract

Despite the rapid development of AI reviewers, evaluating such systems remains challenging: metrics favor overlap with human reviews over correctness. However, since human reviews often cover only a subset of salient issues and sometimes contain mistakes, they are unreliable as gold references. To address this, we build category-specific benchmark subsets and skip evaluation when the corresponding human reviews are missing to strengthen Completeness. We also leverage reviewer--author--meta-review discussions as expert annotations and filter unreliable reviews accordingly to strengthen Correctness. Finally, we introduce CoCoReviewBench, which curates 3,900 papers from ICLR and NeurIPS to enable reliable and fine-grained evaluation of AI reviewers. Analysis shows that AI reviewers remain limited in correctness and are prone to hallucinations, and highlights reasoning models as more effective reviewers, motivating further directions for improving AI reviewers. Benchmarks and models are available at https://github.com/hexuandeng/CoCoReviewBench.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces CoCoReviewBench, a benchmark curating 3,900 papers from ICLR and NeurIPS. It constructs category-specific subsets and skips evaluation when human reviews are missing to strengthen completeness, while using reviewer-author-meta-review discussions as expert annotations (with filtering of unreliable reviews) to strengthen correctness. The resulting analysis concludes that AI reviewers remain limited in correctness, are prone to hallucinations, and that reasoning models are more effective reviewers.

Significance. If the curation rules, filtering criteria, and label reliability hold, the benchmark would provide a valuable alternative to overlap-based metrics for AI reviewer evaluation by emphasizing completeness and correctness. The open release of the benchmark and models supports reproducibility. The work highlights practical limitations in current AI reviewing systems and suggests directions for improvement.

major comments (3)
  1. [Benchmark construction] Benchmark construction section: Treating reviewer-author-meta-review threads as ground-truth expert annotations for correctness lacks any described independent expert re-annotation study or inter-rater reliability assessment against the extracted labels. This is load-bearing for the central claim, because if the threads embed author bias, incompleteness, or meta-review summarization artifacts, the reported limitations on AI correctness become difficult to interpret as objective.
  2. [Subset construction and filtering] Subset construction and filtering section: The paper does not provide the exact rules for building category-specific subsets, the quantitative filtering criteria for 'unreliable' reviews, or the precise definitions of the completeness and correctness metrics. Without these, the support for 'reliable and fine-grained evaluation' cannot be fully verified and reproducibility is limited.
  3. [Analysis] Analysis section: The decision to skip evaluation when human reviews are missing is presented as strengthening completeness, but no analysis of potential selection bias (e.g., systematic differences in papers with vs. without reviews) is reported. This assumption is load-bearing for the completeness axis.
minor comments (2)
  1. [Abstract] The abstract states the total of 3,900 papers but does not break down the distribution across ICLR/NeurIPS or review categories; adding this would improve clarity.
  2. [Throughout] Notation for the two axes (Completeness and Correctness) should be consistently capitalized or defined on first use to avoid minor ambiguity.

Simulated Author's Rebuttal

3 responses · 1 unresolved

We thank the referee for the thoughtful and constructive comments, which help us improve the clarity and rigor of the manuscript. We address each major comment below and indicate the revisions we will make.

read point-by-point responses
  1. Referee: [Benchmark construction] Benchmark construction section: Treating reviewer-author-meta-review threads as ground-truth expert annotations for correctness lacks any described independent expert re-annotation study or inter-rater reliability assessment against the extracted labels. This is load-bearing for the central claim, because if the threads embed author bias, incompleteness, or meta-review summarization artifacts, the reported limitations on AI correctness become difficult to interpret as objective.

    Authors: We agree that the threads are not independently re-annotated and that this is a substantive limitation. The threads do capture multi-expert input (reviewers, authors responding to reviews, and meta-reviewers synthesizing), and we apply explicit filtering to discard low-consistency or low-quality entries. However, we cannot rule out residual author bias or summarization artifacts. In the revision we will (1) expand the description of the filtering criteria with concrete examples, (2) add a dedicated limitations paragraph that explicitly discusses these potential biases and how they could affect the reported correctness numbers, and (3) qualify the central claims accordingly. We will not be able to conduct a new large-scale independent re-annotation study within the scope of this work. revision: partial

  2. Referee: [Subset construction and filtering] Subset construction and filtering section: The paper does not provide the exact rules for building category-specific subsets, the quantitative filtering criteria for 'unreliable' reviews, or the precise definitions of the completeness and correctness metrics. Without these, the support for 'reliable and fine-grained evaluation' cannot be fully verified and reproducibility is limited.

    Authors: We apologize for the omission. The revised manuscript will include: (a) the precise selection rules and category definitions used to build the subsets, (b) the quantitative thresholds and heuristics applied to filter unreliable reviews (including the exact consistency and quality metrics), and (c) formal definitions of the completeness and correctness metrics, including how they are computed from the annotations. These additions will be placed in the main text or a dedicated appendix to ensure full reproducibility. revision: yes

  3. Referee: [Analysis] Analysis section: The decision to skip evaluation when human reviews are missing is presented as strengthening completeness, but no analysis of potential selection bias (e.g., systematic differences in papers with vs. without reviews) is reported. This assumption is load-bearing for the completeness axis.

    Authors: We accept that an explicit bias check is needed. In the revision we will add a short analysis comparing papers that have human reviews versus those that do not, along dimensions such as acceptance rate, paper length, and topical distribution. We will report any statistically significant differences and discuss their implications for the completeness evaluation. If no material differences are observed, this will support the current design; otherwise we will qualify the completeness results. revision: yes

standing simulated objections not resolved
  • Independent expert re-annotation study or inter-rater reliability assessment for the ground-truth labels derived from reviewer-author-meta-review threads

Circularity Check

0 steps flagged

No significant circularity in benchmark curation

full rationale

The paper constructs CoCoReviewBench by curating 3,900 papers from ICLR and NeurIPS and extracting annotations from existing reviewer-author-meta-review threads, with explicit subsetting and filtering rules to address completeness and correctness. No equations, fitted parameters, predictions, or first-principles derivations appear in the provided text. The central claims rest on external public data sources rather than any self-referential reduction or self-citation chain that would make the result equivalent to its inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on two domain assumptions about data quality and one implicit assumption about the benefit of selective evaluation; no free parameters or invented entities are described in the abstract.

axioms (2)
  • domain assumption Human reviews are often incomplete or contain mistakes and are therefore unreliable as sole gold references.
    Explicitly stated as the motivation for the new benchmark design.
  • domain assumption Reviewer-author-meta-review discussions constitute reliable expert annotations for correctness after filtering.
    Used directly to strengthen the correctness axis of the benchmark.

pith-pipeline@v0.9.0 · 5482 in / 1340 out tokens · 73565 ms · 2026-05-11T03:35:25.842977+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Reference graph

Works this paper leans on

89 extracted references · 89 canonical work pages · 2 internal anchors

  1. [1]

    Chang, Y ., Li, Z., Zhang, H., Kong, Y ., Wu, Y ., So, H

    URL https://arxiv.org/abs/2507.1 0734. Chang, Y ., Li, Z., Zhang, H., Kong, Y ., Wu, Y ., So, H. K.-H., Guo, Z., Zhu, L., and Wong, N. TreeReview: A Dynamic Tree of Questions Framework for Deep and Efficient LLM-based Scientific Peer Review. In Christodoulopoulos, C., Chakraborty, T., Rose, C., and Peng, V . (eds.),Proceedings of the 2025 Conference on Em...

  2. [2]

    URL https://aclanthology.org/2025.emnlp-m ain.790/

    doi: 10.18653/v1/2025.emnlp-main.790. URL https://aclanthology.org/2025.emnlp-m ain.790/. Cortes, C. and Lawrence, N. D. Inconsistency in Conference Peer Review: Revisiting the 2014 NeurIPS Experiment. arXiv preprint arXiv:2109.09774, 2021. URL https: //arxiv.org/abs/2109.09774. D’Arcy, M., Hope, T., Birnbaum, L., and Downey, D. MARG: Multi-Agent Review G...

  3. [3]

    URL https: //doi.org/10.1609/aaai.v37i11.26497

    doi: 10.1609/AAAI.V37I11.26497. URL https: //doi.org/10.1609/aaai.v37i11.26497. Deng, H., Jiao, W., Liu, X., Zhang, M., and Tu, Z. Newterm: Benchmarking real-time new terms for large language models with annual updates. In Globersons, A., Mackey, L., Belgrave, D., Fan, A., Paquet, U., Tomczak, J. M., and Zhang, C. (eds.),Advances in Neural Information Pro...

  4. [4]

    doi: 10.18653/v1/2023.acl- long.277

    Association for Computational Linguistics. doi: 10.18653/v1/2023.acl- long.277. URL https: //aclanthology.org/2023.acl-long.277/. Ebrahimi, S., Sadeghian, S., Ghorbanpour, A., Arabzadeh, N., Salamat, S., Li, M., Le, H. S., Bashari, M., and Bagheri, E. Rottenreviews: Benchmarking review qual- ity with human and llm-based judgments. In Cha, M., Park, C., Pa...

  5. [5]

    URL https: //doi.org/10.1145/3746252.3761506

    doi: 10.1145/3746252.3761506. URL https: //doi.org/10.1145/3746252.3761506. Gao, X., Ruan, J., Zhang, Z., Gao, J., Liu, T., and Fu, Y . MMReview: A Multidisciplinary and Multimodal Benchmark for LLM-Based Peer Review Automation. arXiv preprint arXiv:2508.14146, 2025a. URL https: //arxiv.org/abs/2508.14146. Gao, X., Ruan, J., Zhang, Z., Gao, J., Liu, T., a...

  6. [6]

    Kargaran, A

    URL https://aclanthology.org/N18-1 149/. Kargaran, A. H., Nikeghbal, N., Yang, J., and Ousidhoum, N. Insights from the ICLR Peer Review and Rebuttal Process.arXiv preprint arXiv:2511.15462, 2025. URL https://arxiv.org/abs/2511.15462. Ke, X., Deng, H., Liu, X., Rao, J., Song, Z., Yu, J., and Zhang, M. AQuilt: Weaving Logic and Self-Inspection into Low-Cost...

  7. [7]

    12 CoCoReviewBench: A Completeness- and Correctness-Oriented Benchmark for AI Reviewers Mann, S

    URL https://arxiv.org/abs/2408.0 6292. 12 CoCoReviewBench: A Completeness- and Correctness-Oriented Benchmark for AI Reviewers Mann, S. P., Aboy, M., Seah, J. J., Lin, Z., Luo, X., Rodger, D., Zohny, H., Minssen, T., Savulescu, J., and Earp, B. D. AI and the Future of Academic Peer Re- view.arXiv preprint arXiv:2509.14189, 2025. URL https://arxiv.org/abs/...

  8. [8]

    Meyerson, L

    URL https://pmc.ncbi.nlm.nih.gov/a rticles/PMC12186430/. Meyerson, L. A., Suzzi-Simmons, A., and Simberloff, D. Quantifying reviewer declines in scientific publishing: Twenty-one years of data from biological invasions 2002– 2024.Biological Invasions, 27(10):223, 2025. ISSN 1573-1464. doi: 10.1007/s10530-025-03679-1. URL https://link.springer.com/article/...

  9. [9]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    doi: 10.18653/v1/2025.emnlp-main.1476. URL https://aclanthology.org/2025.emnlp-m ain.1476/. Shao, Z., Wang, P., Zhu, Q., Xu, R., Song, J., Bi, X., Zhang, H., Zhang, M., Li, Y . K., Wu, Y ., et al. DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models.arXiv preprint arXiv:2402.03300,

  10. [10]

    Shen, C., Cheng, L., Zhou, R., Bing, L., You, Y ., and Si, L

    URL https://arxiv.org/abs/2402.0 3300. Shen, C., Cheng, L., Zhou, R., Bing, L., You, Y ., and Si, L. MReD: A meta-review dataset for structure-controllable text generation. In Muresan, S., Nakov, P., and Villavi- cencio, A. (eds.),Findings of the Association for Compu- tational Linguistics: ACL 2022, pp. 2521–2535, Dublin, Ireland, 2022. Association for C...

  11. [11]

    Qwen3 Technical Report

    URL https://openreview.net/forum ?id=bjcsVLoHYs. Xu, S., Lu, Y ., Schoenebeck, G., and Kong, Y . Bench- marking llms’ judgments with no gold standard. InThe Thirteenth International Conference on Learning Rep- resentations, ICLR 2025, Singapore, April 24-28, 2025. OpenReview.net, 2025. URL https://openreview .net/forum?id=uE84MGbKD7. Yang, A., Li, A., Yan...

  12. [12]

    HIGH”), τ HIGH(S)∈arg max τ F1(I[si(S)≥τ],I[y i ≥6]),(8) and analogously for predictingI[y i ≤4](“LOW

    doi: 10.18653/v1/2025.emnlp-main.857. URL https://aclanthology.org/2025.emnlp-m ain.857/. 14 CoCoReviewBench: A Completeness- and Correctness-Oriented Benchmark for AI Reviewers Zhang, D., Bao, Z., Du, S., Zhao, Z., Zhang, K., Bao, D., and Yang, Y . Re$ˆ2$: A Consistency-ensured Dataset for Full-stage Peer Review and Multi-turn Rebuttal Dis- cussions.arXi...

  13. [13]

    - If a single comment contains multiple Points, assign separate labels

    Identify atomic Points in reviewer comments and create a new label for each. - If a single comment contains multiple Points, assign separate labels. - If a reviewer raises new Points later, give them new labels

  14. [14]

    Label 2, 5

    For each author reply, assign the label(s) of the Point(s) it responds to. - If one reply addresses multiple Points, list all labels separated by commas (e.g., "Label 2, 5"). - ’Strength’ typically does not require a reply. Focus on matching replies to ’weaknesses’ and ’questions’. # OUTPUT FORMAT Produce exactly{num}lines, one per part, in order: Part k:...

  15. [15]

    - Strength: Highlights that the method is well-formulated, mathematically consistent and correctly implemented; no major flaws are apparent

    QUAL (Quality): Is the submission technically sound? Are claims well supported (e.g., by theoretical analysis or experimental results)? Are the methods used appropriate? Is this a complete piece of work or work in progress? Are the authors careful and honest about evaluating both the strengths and weaknesses of their work? - QUAL-MET (Methodological Sound...

  16. [16]

    CLAR (Clarity): Is the submission clearly written? Is it well organized? (If not, please make constructive suggestions for improving its clarity.) Does it adequately inform the reader? (Note that a superbly written paper provides enough information for an expert reader to reproduce its results.) - CLAR-WRT (Writing, Terminology & Algorithm Presentation): ...

  17. [17]

    - Strength: Notes that the work tackles a significant, widely relevant challenge or introduces a paradigm that may influence many researchers

    SIGN (Significance): Are the results impactful for the community? Are others (researchers or practitioners) likely to use the ideas or build on them? Does the submission address a difficult task in a better way than previous work? Does it advance our understanding/knowledge on the topic in a demonstrable way? Does it provide unique data, unique conclusion...

  18. [18]

    Rather, a work that provides novel insights by evaluating existing methods, or demonstrates improved efficiency, fairness, etc

    ORIG (Originality): Does the work provide new insights, deepen understanding, or highlight important properties of existing methods? Is it clear how this work differs from previous contributions, with relevant citations provided? Does the work introduce novel tasks or methods that advance the field? Does this work offer a novel combination of existing tec...

  19. [19]

    These issues often require checking adherence to conference or legal policies and may involve ethical considerations

    POL (Policy/Compliance): Encompasses policy- or compliance-related concerns such as ethics, data/privacy compliance, anonymity rules, plagiarism, licensing, and broader impact. These issues often require checking adherence to conference or legal policies and may involve ethical considerations. - POL-ETH (Ethics & Responsible AI Compliance): Addresses ethi...

  20. [29]

    ID:{{id}}

    ORIG-MTH vs. QUAL-EXP: Whether the algorithm/architecture is truly novel vs. a minor tweak→ORIG-MTH; whether experiments sufficiently validate the (novel or not) method→QUAL-EXP. --- **Output:** Return **only** the applicable label(s). If the reviewer mentions multiple distinct comments, assign different labels, separated by commas. If the reviewer mentio...

  21. [30]

    Review ALL reviewer opinions across ALL blocks below

  22. [31]

    Extract the *most specific core subject * from each opinion (e.g., ’Random Forest algorithm selection’, ’SGD learning rate value’)

  23. [32]

    Assign an ID (starting from 1) to each *unique specific subject *

  24. [33]

    **CRITICAL**: Reuse the same ID for opinions discussing the *exact same subject*, even if their evaluations are opposite/contradictory

  25. [34]

    blocks": [ {

    Different specific subjects receive different IDs, even if semantically related Content Identity Rules: - **Subject must match precisely **: ’SGD is suitable’ vs. ’SGD is unsuitable’ = SAME ID (identical subject: SGD suitability) - **Maximal specificity required **: Use ’F1-score metric’ or ’L2 regularization strength’, NOT broad terms like ’methodology’ ...

  26. [35]

    - Opposing sentiments (Positive vs

    **Definition of Conflict: ** - Direct contradiction on the IDENTICAL aspect (e.g., Reviewer A says ’method is novel’, Reviewer B says ’method lacks novelty’). - Opposing sentiments (Positive vs. Negative) regarding the same specific feature

  27. [36]

    has conflict

    **What is NOT a Conflict (Crucial): ** - **Granularity Differences: ** Specific examples (e.g., ’works on large images’) vs. General statements (e.g., ’generalizes well’) are COMPLEMENTARY, not conflicting. - **Different Aspects: ** Opinions on different dimensions (e.g., ’efficiency’ vs ’accuracy’) are not conflicts. - **Non-Argumentative Text: ** Ignore...

  28. [37]

    **Identify the Core Issue **: Determine the exact technical/academic point of disagreement

  29. [38]

    **Evaluate Each Position **: Assess the validity of each opinion based on logical consistency, technical soundness, and alignment with academic standards

  30. [39]

    correct blocks

    **Make a Judgment **: Determine which opinion block(s) have the more correct/valid position. **Important:** Use block id (e.g., ’block 5’) to identify opinions in your response. === META-REVIEW (Overall Context) === {meta review}--- Discussion Point point id --- {reviewer points} --- Output Format (JSON ONLY) --- { "correct blocks": ["block X", ...], "inc...

  31. [40]

    - Signals: ’We disagree’, ’The reviewer is incorrect’, ’This is a misunderstanding’, ’We do not think this is necessary’

    **What is a Refutation? ** - The author explicitly argues that the reviewer’s comment is **factually incorrect**, **irrelevant**, or based on a **misunderstanding**. - Signals: ’We disagree’, ’The reviewer is incorrect’, ’This is a misunderstanding’, ’We do not think this is necessary’

  32. [41]

    opinion index

    **What is NOT a Refutation (False Positives to AVOID): ** - **Acceptance + Differentiation: ** If the reviewer asks to compare with a baseline (e.g., ConViT [6]), and the author adds the comparison but explains *why* their method is different/better (e.g., ’Unlike ConViT, we do X...’), this is **COMPLIANCE**, not refutation. - **Technical Clarification: *...

  33. [42]

    **Reviewer is WRONG (true) ** if: - The reviewer’s claim is based on factual errors about the paper - The reviewer misunderstood the methodology/results - The reviewer’s request is logically inconsistent or impossible - Meta-review explicitly or implicitly supports the author’s position

  34. [43]

    judgments

    **Reviewer is NOT wrong (false) ** if: - The reviewer’s concern is valid but author disagrees on priority/scope - The author is making excuses without addressing the core issue - The meta-review sides with the reviewer or remains neutral - It’s a matter of interpretation rather than factual error ### OUTPUT FORMAT: Return ONLY a JSON object with a ’judgme...

  35. [44]

    CLAR-WRT vs. QUAL-EXP: Comments about missing baselines or inadequate experiments belong under QUAL-EXP, even when poor writing makes experiments hard to follow; issues solely about readability go under CLAR-WRT

  36. [45]

    QUAL-EXP: Critiques about illegible or unclear plots go to CLAR-FIG; critiques about missing plots or missing baseline curves go to QUAL-EXP

    CLAR-FIG vs. QUAL-EXP: Critiques about illegible or unclear plots go to CLAR-FIG; critiques about missing plots or missing baseline curves go to QUAL-EXP

  37. [46]

    CLAR-NOT: Missing code, hyperparameters or data splits fall under QUAL-REP; undefined variables or inconsistent notation fall under CLAR-NOT

    QUAL-REP vs. CLAR-NOT: Missing code, hyperparameters or data splits fall under QUAL-REP; undefined variables or inconsistent notation fall under CLAR-NOT

  38. [47]

    POL-IMP: General discussion of societal benefits or harms without mandatory checklists goes to SIGN-IMP; comments about compliance with required impact statements go to POL-IMP

    SIGN-IMP vs. POL-IMP: General discussion of societal benefits or harms without mandatory checklists goes to SIGN-IMP; comments about compliance with required impact statements go to POL-IMP

  39. [48]

    QUAL-EXP: Introducing a new dataset or benchmark is ORIG-EXP; critiquing the thoroughness of experiments on existing datasets is QUAL-EXP

    ORIG-EXP vs. QUAL-EXP: Introducing a new dataset or benchmark is ORIG-EXP; critiquing the thoroughness of experiments on existing datasets is QUAL-EXP

  40. [49]

    SIGN-IMP: Ethical violations, bias or harm to marginalised groups belong in POL-ETH; high-level impact comments without compliance issues belong in SIGN-IMP

    POL-ETH vs. SIGN-IMP: Ethical violations, bias or harm to marginalised groups belong in POL-ETH; high-level impact comments without compliance issues belong in SIGN-IMP

  41. [50]

    SIGN-SOT: Missing baselines or incomplete literature reviews are QUAL-CMP; judging whether reported improvements are meaningful goes to SIGN-SOT

    QUAL-CMP vs. SIGN-SOT: Missing baselines or incomplete literature reviews are QUAL-CMP; judging whether reported improvements are meaningful goes to SIGN-SOT

  42. [51]

    ORIG-MTH: Combining existing methods in an original way is ORIG-COM; proposing entirely new algorithms is ORIG-MTH

    ORIG-COM vs. ORIG-MTH: Combining existing methods in an original way is ORIG-COM; proposing entirely new algorithms is ORIG-MTH

  43. [52]

    QUAL-EXP: Missing/irrelevant or outdated baselines and literature mapping→QUAL-CMP; evaluation protocol design (metrics, splits, tuning fairness, test coverage)→QUAL-EXP

    QUAL-CMP vs. QUAL-EXP: Missing/irrelevant or outdated baselines and literature mapping→QUAL-CMP; evaluation protocol design (metrics, splits, tuning fairness, test coverage)→QUAL-EXP

  44. [53]

    No Claim

    ORIG-MTH vs. QUAL-EXP: Whether the algorithm/architecture is truly novel vs. a minor tweak→ORIG-MTH; whether experiments sufficiently validate the (novel or not) method→QUAL-EXP. --- **Output:** Return **only** the applicable label(s). If the reviewer mentions multiple distinct comments, assign different labels, separated by commas. If the reviewer mentio...

  45. [54]

    Implicit **: - **Explicit:** Actions or suggestions that are direct or apparent

    **Explicit vs. Implicit **: - **Explicit:** Actions or suggestions that are direct or apparent. Authors can directly identify modifications they should apply to their draft. Clarification questions should be treated as explicit statements if they give a direct action. - **Implicit:** Actions that need to be inferred from the comment. This includes missing...

  46. [55]

    Vague **: - **Concrete:** Once the action is identified, the authors know exactly what needs to be done and how to apply the action

    **Concrete vs. Vague **: - **Concrete:** Once the action is identified, the authors know exactly what needs to be done and how to apply the action. - **Vague:** After identifying the action, the authors still don’t know how to carry out this action. **Importance:** It’s more important for actions to be concrete so that authors know how to apply them. It’s...

  47. [56]

    Authors do not know what they should do after reading the comment

    **1: Unactionable ** - **Definition:** The comment lacks meaningful information to help authors improve the paper. Authors do not know what they should do after reading the comment

  48. [57]

    However, the action itself is vague and lacks detail on how to apply it

    **2: Borderline Actionable ** - **Definition:** The comment includes an implicitly stated action or an action that can be inferred. However, the action itself is vague and lacks detail on how to apply it

  49. [58]

    **3: Somewhat Actionable ** - **Definition:** The comment explicitly states an action but is vague on how to execute it

  50. [59]

    **4: Mostly Actionable ** - **Definition:** The comment implicitly states an action but concretely states how to implement the inferred action

  51. [60]

    Authors know exactly how to apply it

    **5: Highly Actionable ** - **Definition:** The comment contains an explicit action and concrete details on how to implement it. Authors know exactly how to apply it. X. **X: No claim ** - **Definition:** The comment contains only factual, descriptive statements without claims, opinions, or suggestions. Aspect: grounding specificity **Grounding Specificit...

  52. [61]

    - **Weak Grounding: ** The author can make an educated guess but cannot precisely identify the referenced part

    **Grounding:** How well the authors can identify the specific part of the paper being addressed. - **Weak Grounding: ** The author can make an educated guess but cannot precisely identify the referenced part. - **Full Grounding: ** The author can accurately pinpoint the section, table, figure, or unique aspect being addressed. This can be achieved through...

  53. [62]

    If external work is mentioned, it also evaluates whether specific examples are provided

    **Specificity:** How clearly the comment details what is wrong or missing in the referenced part. If external work is mentioned, it also evaluates whether specific examples are provided. **Importance:** It’s more important for the comment to be grounded than to be specific. **Grounding Specificity Scale (1-5): **

  54. [63]

    It does not identify a specific area in the paper

    **Not Grounded ** - **Definition**: This comment is not grounded at all. It does not identify a specific area in the paper. The comment is highly unspecific

  55. [64]

    Further, the comment does not specify what needs to be addressed in this part

    **Weakly Grounded and Not Specific ** - **Definition**: The authors cannot confidently determine which part the comment addresses. Further, the comment does not specify what needs to be addressed in this part

  56. [65]

    However, the comment clearly specifies what needs to be addressed in this part

    **Weakly Grounded and Specific ** - **Definition**: The authors cannot confidently determine which part the comment addresses. However, the comment clearly specifies what needs to be addressed in this part

  57. [66]

    However, this comment does not specify what needs to be addressed in this part

    **Fully Grounded and Under-Specific ** - **Definition**: The comment explicitly mentions which part of the paper it addresses, or it should be obvious to the authors. However, this comment does not specify what needs to be addressed in this part

  58. [67]

    The comment specifies what needs to be addressed in this part

    **Fully Grounded and Specific ** - **Definition**: The comment explicitly mentions which part of the paper it addresses, and it is obvious to the authors. The comment specifies what needs to be addressed in this part. 52 CoCoReviewBench: A Completeness- and Correctness-Oriented Benchmark for AI Reviewers Prompts for Paper-Level and Category-Level Metric o...

  59. [68]

    **1: Unverifiable ** - The comment contains a claim without any supporting evidence or justification

  60. [69]

    **2: Borderline Verifiable ** - Some support is provided, but it is vague, insufficient, or difficult to follow

  61. [70]

    **3: Somewhat Verifiable ** - The claim has some justification but lacks key elements (e.g., examples, references)

  62. [71]

    **4: Mostly Verifiable ** - The claim is well-supported but has minor gaps in explanation or references

  63. [72]

    - Specific references to external works

    **5: Fully Verifiable ** - The claim is thoroughly supported by explicit, sufficient, and robust evidence, such as: - Clear reasoning and precise explanations. - Specific references to external works. - Logical and unassailable common-sense arguments. X. **X: No Claim ** - The comment contains only factual, descriptive statements without claims, opinions,...

  64. [73]

    **1: Opposing Opinion ** - **Definition:** The candidate expresses an opposing stance to the key gold opinion(s) (i.e., contradicts the main thrust), or argues the opposite conclusion on the same issue

  65. [74]

    **2: Mostly Misaligned ** - **Definition:** The candidate is largely misaligned: it either contradicts some key gold opinion(s) or frames the issue in a way that conflicts with the gold

  66. [75]

    It does not consistently support the gold’s stance

    **3: Mixed or Unrelated ** - **Definition:** The candidate includes a mixture of aligned and opposing opinions, **or** it is largely unrelated/orthogonal (neither clearly aligned nor clearly opposing). It does not consistently support the gold’s stance

  67. [76]

    **4: Mostly Aligned ** - **Definition:** The candidate is mostly aligned with the gold’s stance and opinions, but may miss some constraints/details, soften/shift emphasis, or contain minor deviations that do not overturn the main agreement

  68. [77]

    **5: Fully Aligned ** - **Definition:** The candidate expresses the same underlying opinion(s) as the gold reference: it matches the main concerns/praises in both direction and substance, without material contradictions. 54 CoCoReviewBench: A Completeness- and Correctness-Oriented Benchmark for AI Reviewers Prompts for Paper-Level and Category-Level Metri...

  69. [78]

    **Coverage Breadth: ** - Whether the candidate addresses few vs many of the key gold opinions

  70. [79]

    - If the candidate is much more generic than a gold opinion, that gold opinion should be treated as *not covered * or only weakly covered

    **Coverage Adequacy: ** - Whether the candidate meaningfully addresses the covered gold opinions (not just a vague mention). - If the candidate is much more generic than a gold opinion, that gold opinion should be treated as *not covered * or only weakly covered. **Importance:** Breadth is more important than length. A long candidate can still be incomple...

  71. [80]

    Most gold opinions are not covered; any overlap is accidental or extremely weak

    **1: Minimally Complete ** - **Definition:** The candidate covers none or almost none of the key gold opinions. Most gold opinions are not covered; any overlap is accidental or extremely weak

  72. [81]

    Several key gold opinions are missing

    **2: Low Completeness ** - **Definition:** The candidate covers only a small portion of gold opinions, and/or covers them weakly (generic mentions without addressing the underlying request). Several key gold opinions are missing

  73. [82]

    Coverage may be uneven: some opinions covered, others only weakly covered

    **3: Moderate Completeness ** - **Definition:** The candidate covers some gold opinions, but misses important ones. Coverage may be uneven: some opinions covered, others only weakly covered

  74. [83]

    Only a small number of less-central gold opinions are missing or weakly covered

    **4: High Completeness ** - **Definition:** The candidate covers most key gold opinions, with generally adequate substance. Only a small number of less-central gold opinions are missing or weakly covered

  75. [84]

    Few or no gold opinions remain not covered

    **5: Fully Complete ** - **Definition:** The candidate covers nearly all key gold opinions, and coverage is meaningful (not merely topical). Few or no gold opinions remain not covered. 55 CoCoReviewBench: A Completeness- and Correctness-Oriented Benchmark for AI Reviewers Prompts for Paper-Level and Category-Level Metric of CoCoreviewBench with LLM-as-a-J...

  76. [85]

    Readers cannot easily identify the main point

    **1: Unclear / Hard to Read ** - **Definition:** The comment is confusing or difficult to follow due to poor coherence (disorganized, fragmented, or contradictory), heavy verbosity with little signal, or unclear phrasing. Readers cannot easily identify the main point

  77. [86]

    The main point is present but requires effort to extract, and the delivery obscures meaning

    **2: Mostly Unclear ** - **Definition:** The comment has substantial readability issues (wordy, tangled sentences, weak structure). The main point is present but requires effort to extract, and the delivery obscures meaning

  78. [87]

    It is understandable but not crisp

    **3: Moderately Clear ** - **Definition:** The comment is generally readable and the main point can be understood, but delivery is imperfect (some verbosity, uneven structure, minor coherence gaps). It is understandable but not crisp

  79. [88]

    Minor verbosity or small structural issues may exist, but they do not hinder comprehension

    **4: Clear ** - **Definition:** The comment is coherent and easy to read, with a mostly well-structured presentation. Minor verbosity or small structural issues may exist, but they do not hinder comprehension

  80. [89]

    claim extraction

    **5: Very Clear / Concise ** - **Definition:** The comment is concise, well-structured, and highly readable. Ideas flow logically, wording is efficient, and the main point (and any request) is immediately understandable. ###Instruction: Be more objective and conservative in your grading. Respond strictly as JSON, do not provide any other content: { "claim...

Showing first 80 references.