arxiv: 2605.07905 · v1 · submitted 2026-05-08 · 💻 cs.CL · cs.AI

Recognition: 2 theorem links

· Lean Theorem

CoCoReviewBench: A Completeness- and Correctness-Oriented Benchmark for AI Reviewers

Dehao Huang, Derek F. Wong, Hexuan Deng, Min Zhang, Ruina Hu, Xiaopeng Ke, Xuebo Liu, Yichen Li, Yue Wang

Pith reviewed 2026-05-11 03:35 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords AI reviewersbenchmark datasetcompletenesscorrectnesspeer reviewhallucinationsreasoning modelsICLR NeurIPS

0 comments

The pith

CoCoReviewBench evaluates AI reviewers using expert discussions to measure both completeness and correctness.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper creates a benchmark to better evaluate AI systems designed to review scientific papers. Current metrics compare AI reviews to human ones, but human reviews are incomplete and can contain errors, so they are not reliable standards. The authors address this by curating 3,900 papers from major conferences, skipping cases without human reviews to focus on completeness, and using discussions between reviewers, authors, and meta-reviewers as trusted labels for correctness. Their analysis finds that AI reviewers often lack correctness and generate hallucinations, though reasoning models perform relatively better. This approach matters because it provides a clearer signal for developing more accurate AI tools to assist or replace human peer review.

Core claim

CoCoReviewBench curates 3,900 papers from ICLR and NeurIPS, builds category-specific subsets, skips evaluation on missing human reviews for completeness, and uses reviewer-author-meta-review discussions as expert annotations filtered for reliability to assess AI reviewers, showing they are limited in correctness, prone to hallucinations, and that reasoning models are more effective.

What carries the argument

The CoCoReviewBench dataset and evaluation protocol, which strengthens correctness via expert discussion annotations and completeness via selective skipping of incomplete cases.

If this is right

AI reviewers exhibit significant limitations in providing correct assessments and frequently hallucinate content.
Reasoning models outperform other AI reviewers in this benchmark.
The benchmark enables more reliable and fine-grained evaluation of AI reviewer capabilities.
Further research is needed to improve AI reviewers based on these findings.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Adopting this benchmark could lead to better AI integration in academic publishing workflows.
Similar methods might apply to evaluating AI in other expert judgment tasks like legal or medical review.
Testing the benchmark on newer models or different datasets would validate its robustness.
Addressing hallucinations through reasoning enhancements could be a key direction for AI reviewer development.

Load-bearing premise

Reviewer-author-meta-review discussions serve as reliable expert annotations for review correctness without bias, and skipping papers lacking human reviews improves completeness evaluation without introducing selection bias.

What would settle it

If independent human experts rate the correctness of AI-generated reviews on benchmark papers and find no correlation with the benchmark's correctness scores, or if reasoning models do not show superior performance in that verification.

Figures

Figures reproduced from arXiv: 2605.07905 by Dehao Huang, Derek F. Wong, Hexuan Deng, Min Zhang, Ruina Hu, Xiaopeng Ke, Xuebo Liu, Yichen Li, Yue Wang.

**Figure 1.** Figure 1: Overview of CoCoReviewBench. To tackle incompleteness, we build per-category subsets which contain only papers with reviews in that category. To address incorrectness, we merge discussions about the same point across the reviewers and authors. When explicit conflicts exist, we use the meta-review as a highlevel adjudication signal for the final decision. to an inevitable decline in review quality (Kim et… view at source ↗

**Figure 2.** Figure 2: Subcategory coverage distribution of human reviewers. The x-axis shows the number of subcategories covered by one / all reviewers of each paper, and the y-axis shows their probability. Reviewer 2: ... a novel approach for ... superior performance and time efficiency in experimental results. Inter-Reviewer Conflict Reviewer 3: ... although this work is sound in terms of engineering, its real academic contri… view at source ↗

**Figure 3.** Figure 3: Inconsistency cases in peer reviews. We show cases of conflict both between reviewers and between reviewers and authors, with the meta-review deciding which opinion is correct. score for the four categories: Originality-Method, QualityExperiment, Clarity-Writing, and Quality-Comparisons, exceeds a certain threshold, we can predict an overall score of 6 or higher with over 90% recall. Further, these four … view at source ↗

**Figure 4.** Figure 4: Workflow of CoCoReviewBench construction. In §4.2.1, we match each review with the corresponding author response, split it into atomic opinions, and assign labels. These results are further processed in §4.2.2: Steps 4 and 5 detect inter-review conflicts, and Step 6 detects review-author conflicts, yielding the final structured data. This structured data serves as the reference for comparison with AI revie… view at source ↗

**Figure 5.** Figure 5: Category-wise breakdown of the results in [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 6.** Figure 6: Subcategory distribution of reviewer incorrectness. Left: inter-reviewer conflicts. Right: reviewer–author conflicts. Colors indicate the corresponding major categories. Inter-Reviewer Conflicts. The largest inter-reviewer gaps concentrate on Originality-Method, Clarity-Writing, and Quality-Experiment, followed by Quality-Method and Originality-Analysis. This indicates that substantial inter-reviewer disag… view at source ↗

**Figure 7.** Figure 7: Categorical metric scores across major review categories in CoCoReviewBench. Panels (a–f) correspond to correctness, thoroughness, grounding, verifiability, clarity, and the average score. The x-axis lists the major review categories. Within each panel, categories are ordered by the AI–Human score gap. Colors indicate the corresponding major review categories, with blue corresponding to Quality, orange to … view at source ↗

**Figure 8.** Figure 8: shows three representative cases. Left: Gemini-3-Pro criticizes the caption for mentioning a “z-axis” and questions consistency with “2D” plots, despite not seeing the figure. Middle: Qwen3-32B asserts that the qualitative analysis in “ [PITH_FULL_IMAGE:figures/full_fig_p028_8.png] view at source ↗

**Figure 9.** Figure 9: Cases of reviewer-author conflicts. Left: Misunderstanding case. Middle: Missed details case. Right: Inaccurate prior assessment case. C.2. Analysis of CoCoReviewBench Reliability For error detection caused by Reviewer-Author conflicts, we achieve 66.83% accuracy in human verification. This accuracy is notably lower than other annotation steps, and exhibits a substantial disparity between rejected and acce… view at source ↗

**Figure 10.** Figure 10: Cases of bias in human annotation under negative meta-review. D. Additional Experiment Setup Human References. For the old metrics including BLEU, ROUGE-L, and BERTScore, we follow Zhang et al. (2025) to compute the similarity between an AI review and each human reviewer’s comments. Here, consistent with prior work, we only consider the reviewer’s first review as references. However, further replies are a… view at source ↗

read the original abstract

Despite the rapid development of AI reviewers, evaluating such systems remains challenging: metrics favor overlap with human reviews over correctness. However, since human reviews often cover only a subset of salient issues and sometimes contain mistakes, they are unreliable as gold references. To address this, we build category-specific benchmark subsets and skip evaluation when the corresponding human reviews are missing to strengthen Completeness. We also leverage reviewer--author--meta-review discussions as expert annotations and filter unreliable reviews accordingly to strengthen Correctness. Finally, we introduce CoCoReviewBench, which curates 3,900 papers from ICLR and NeurIPS to enable reliable and fine-grained evaluation of AI reviewers. Analysis shows that AI reviewers remain limited in correctness and are prone to hallucinations, and highlights reasoning models as more effective reviewers, motivating further directions for improving AI reviewers. Benchmarks and models are available at https://github.com/hexuandeng/CoCoReviewBench.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CoCoReviewBench tries to fix flawed overlap metrics for AI reviewers by using discussion threads for correctness labels and selective subsets for completeness, but those labels lack independent validation.

read the letter

The key thing here is a benchmark that curates 3,900 ICLR and NeurIPS papers into category-specific subsets, skips cases without human reviews to target completeness, and pulls correctness labels from reviewer-author-meta-review threads while filtering unreliable ones. This is a clear step past simple overlap scores, which the authors correctly note fail because human reviews miss issues and contain errors. Releasing the data and code on GitHub plus the finding that reasoning models handle reviews better than others gives people working on AI review tools something usable right away.

Referee Report

3 major / 2 minor

Summary. The paper introduces CoCoReviewBench, a benchmark curating 3,900 papers from ICLR and NeurIPS. It constructs category-specific subsets and skips evaluation when human reviews are missing to strengthen completeness, while using reviewer-author-meta-review discussions as expert annotations (with filtering of unreliable reviews) to strengthen correctness. The resulting analysis concludes that AI reviewers remain limited in correctness, are prone to hallucinations, and that reasoning models are more effective reviewers.

Significance. If the curation rules, filtering criteria, and label reliability hold, the benchmark would provide a valuable alternative to overlap-based metrics for AI reviewer evaluation by emphasizing completeness and correctness. The open release of the benchmark and models supports reproducibility. The work highlights practical limitations in current AI reviewing systems and suggests directions for improvement.

major comments (3)

[Benchmark construction] Benchmark construction section: Treating reviewer-author-meta-review threads as ground-truth expert annotations for correctness lacks any described independent expert re-annotation study or inter-rater reliability assessment against the extracted labels. This is load-bearing for the central claim, because if the threads embed author bias, incompleteness, or meta-review summarization artifacts, the reported limitations on AI correctness become difficult to interpret as objective.
[Subset construction and filtering] Subset construction and filtering section: The paper does not provide the exact rules for building category-specific subsets, the quantitative filtering criteria for 'unreliable' reviews, or the precise definitions of the completeness and correctness metrics. Without these, the support for 'reliable and fine-grained evaluation' cannot be fully verified and reproducibility is limited.
[Analysis] Analysis section: The decision to skip evaluation when human reviews are missing is presented as strengthening completeness, but no analysis of potential selection bias (e.g., systematic differences in papers with vs. without reviews) is reported. This assumption is load-bearing for the completeness axis.

minor comments (2)

[Abstract] The abstract states the total of 3,900 papers but does not break down the distribution across ICLR/NeurIPS or review categories; adding this would improve clarity.
[Throughout] Notation for the two axes (Completeness and Correctness) should be consistently capitalized or defined on first use to avoid minor ambiguity.

Simulated Author's Rebuttal

3 responses · 1 unresolved

We thank the referee for the thoughtful and constructive comments, which help us improve the clarity and rigor of the manuscript. We address each major comment below and indicate the revisions we will make.

read point-by-point responses

Referee: [Benchmark construction] Benchmark construction section: Treating reviewer-author-meta-review threads as ground-truth expert annotations for correctness lacks any described independent expert re-annotation study or inter-rater reliability assessment against the extracted labels. This is load-bearing for the central claim, because if the threads embed author bias, incompleteness, or meta-review summarization artifacts, the reported limitations on AI correctness become difficult to interpret as objective.

Authors: We agree that the threads are not independently re-annotated and that this is a substantive limitation. The threads do capture multi-expert input (reviewers, authors responding to reviews, and meta-reviewers synthesizing), and we apply explicit filtering to discard low-consistency or low-quality entries. However, we cannot rule out residual author bias or summarization artifacts. In the revision we will (1) expand the description of the filtering criteria with concrete examples, (2) add a dedicated limitations paragraph that explicitly discusses these potential biases and how they could affect the reported correctness numbers, and (3) qualify the central claims accordingly. We will not be able to conduct a new large-scale independent re-annotation study within the scope of this work. revision: partial
Referee: [Subset construction and filtering] Subset construction and filtering section: The paper does not provide the exact rules for building category-specific subsets, the quantitative filtering criteria for 'unreliable' reviews, or the precise definitions of the completeness and correctness metrics. Without these, the support for 'reliable and fine-grained evaluation' cannot be fully verified and reproducibility is limited.

Authors: We apologize for the omission. The revised manuscript will include: (a) the precise selection rules and category definitions used to build the subsets, (b) the quantitative thresholds and heuristics applied to filter unreliable reviews (including the exact consistency and quality metrics), and (c) formal definitions of the completeness and correctness metrics, including how they are computed from the annotations. These additions will be placed in the main text or a dedicated appendix to ensure full reproducibility. revision: yes
Referee: [Analysis] Analysis section: The decision to skip evaluation when human reviews are missing is presented as strengthening completeness, but no analysis of potential selection bias (e.g., systematic differences in papers with vs. without reviews) is reported. This assumption is load-bearing for the completeness axis.

Authors: We accept that an explicit bias check is needed. In the revision we will add a short analysis comparing papers that have human reviews versus those that do not, along dimensions such as acceptance rate, paper length, and topical distribution. We will report any statistically significant differences and discuss their implications for the completeness evaluation. If no material differences are observed, this will support the current design; otherwise we will qualify the completeness results. revision: yes

standing simulated objections not resolved

Independent expert re-annotation study or inter-rater reliability assessment for the ground-truth labels derived from reviewer-author-meta-review threads

Circularity Check

0 steps flagged

No significant circularity in benchmark curation

full rationale

The paper constructs CoCoReviewBench by curating 3,900 papers from ICLR and NeurIPS and extracting annotations from existing reviewer-author-meta-review threads, with explicit subsetting and filtering rules to address completeness and correctness. No equations, fitted parameters, predictions, or first-principles derivations appear in the provided text. The central claims rest on external public data sources rather than any self-referential reduction or self-citation chain that would make the result equivalent to its inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on two domain assumptions about data quality and one implicit assumption about the benefit of selective evaluation; no free parameters or invented entities are described in the abstract.

axioms (2)

domain assumption Human reviews are often incomplete or contain mistakes and are therefore unreliable as sole gold references.
Explicitly stated as the motivation for the new benchmark design.
domain assumption Reviewer-author-meta-review discussions constitute reliable expert annotations for correctness after filtering.
Used directly to strengthen the correctness axis of the benchmark.

pith-pipeline@v0.9.0 · 5482 in / 1340 out tokens · 73565 ms · 2026-05-11T03:35:25.842977+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear
We also leverage reviewer–author–meta-review discussions as expert annotations and filter unreliable reviews accordingly to strengthen Correctness... construct a benchmark of 3,900 papers... category-level evaluation
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear
We build category-specific benchmark subsets and skip evaluation when the corresponding human reviews are missing to strengthen Completeness

Reference graph

Works this paper leans on

89 extracted references · 89 canonical work pages · 2 internal anchors

[1]

Chang, Y ., Li, Z., Zhang, H., Kong, Y ., Wu, Y ., So, H

URL https://arxiv.org/abs/2507.1 0734. Chang, Y ., Li, Z., Zhang, H., Kong, Y ., Wu, Y ., So, H. K.-H., Guo, Z., Zhu, L., and Wong, N. TreeReview: A Dynamic Tree of Questions Framework for Deep and Efficient LLM-based Scientific Peer Review. In Christodoulopoulos, C., Chakraborty, T., Rose, C., and Peng, V . (eds.),Proceedings of the 2025 Conference on Em...

work page 2025
[2]

URL https://aclanthology.org/2025.emnlp-m ain.790/

doi: 10.18653/v1/2025.emnlp-main.790. URL https://aclanthology.org/2025.emnlp-m ain.790/. Cortes, C. and Lawrence, N. D. Inconsistency in Conference Peer Review: Revisiting the 2014 NeurIPS Experiment. arXiv preprint arXiv:2109.09774, 2021. URL https: //arxiv.org/abs/2109.09774. D’Arcy, M., Hope, T., Birnbaum, L., and Downey, D. MARG: Multi-Agent Review G...

work page doi:10.18653/v1/2025.emnlp-main.790 2025
[3]

URL https: //doi.org/10.1609/aaai.v37i11.26497

doi: 10.1609/AAAI.V37I11.26497. URL https: //doi.org/10.1609/aaai.v37i11.26497. Deng, H., Jiao, W., Liu, X., Zhang, M., and Tu, Z. Newterm: Benchmarking real-time new terms for large language models with annual updates. In Globersons, A., Mackey, L., Belgrave, D., Fan, A., Paquet, U., Tomczak, J. M., and Zhang, C. (eds.),Advances in Neural Information Pro...

work page doi:10.1609/aaai.v37i11.26497 2024
[4]

doi: 10.18653/v1/2023.acl- long.277

Association for Computational Linguistics. doi: 10.18653/v1/2023.acl- long.277. URL https: //aclanthology.org/2023.acl-long.277/. Ebrahimi, S., Sadeghian, S., Ghorbanpour, A., Arabzadeh, N., Salamat, S., Li, M., Le, H. S., Bashari, M., and Bagheri, E. Rottenreviews: Benchmarking review qual- ity with human and llm-based judgments. In Cha, M., Park, C., Pa...

work page doi:10.18653/v1/2023.acl- 2023
[5]

URL https: //doi.org/10.1145/3746252.3761506

doi: 10.1145/3746252.3761506. URL https: //doi.org/10.1145/3746252.3761506. Gao, X., Ruan, J., Zhang, Z., Gao, J., Liu, T., and Fu, Y . MMReview: A Multidisciplinary and Multimodal Benchmark for LLM-Based Peer Review Automation. arXiv preprint arXiv:2508.14146, 2025a. URL https: //arxiv.org/abs/2508.14146. Gao, X., Ruan, J., Zhang, Z., Gao, J., Liu, T., a...

work page doi:10.1145/3746252.3761506 2024
[6]

Kargaran, A

URL https://aclanthology.org/N18-1 149/. Kargaran, A. H., Nikeghbal, N., Yang, J., and Ousidhoum, N. Insights from the ICLR Peer Review and Rebuttal Process.arXiv preprint arXiv:2511.15462, 2025. URL https://arxiv.org/abs/2511.15462. Ke, X., Deng, H., Liu, X., Rao, J., Song, Z., Yu, J., and Zhang, M. AQuilt: Weaving Logic and Self-Inspection into Low-Cost...

work page doi:10.18653/v1/2025.emnlp-main.293 2025
[7]

12 CoCoReviewBench: A Completeness- and Correctness-Oriented Benchmark for AI Reviewers Mann, S

URL https://arxiv.org/abs/2408.0 6292. 12 CoCoReviewBench: A Completeness- and Correctness-Oriented Benchmark for AI Reviewers Mann, S. P., Aboy, M., Seah, J. J., Lin, Z., Luo, X., Rodger, D., Zohny, H., Minssen, T., Savulescu, J., and Earp, B. D. AI and the Future of Academic Peer Re- view.arXiv preprint arXiv:2509.14189, 2025. URL https://arxiv.org/abs/...

work page doi:10.2340/aos.v84.43 2025
[8]

Meyerson, L

URL https://pmc.ncbi.nlm.nih.gov/a rticles/PMC12186430/. Meyerson, L. A., Suzzi-Simmons, A., and Simberloff, D. Quantifying reviewer declines in scientific publishing: Twenty-one years of data from biological invasions 2002– 2024.Biological Invasions, 27(10):223, 2025. ISSN 1573-1464. doi: 10.1007/s10530-025-03679-1. URL https://link.springer.com/article/...

work page doi:10.1007/s10530-025-03679-1 2002
[9]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

doi: 10.18653/v1/2025.emnlp-main.1476. URL https://aclanthology.org/2025.emnlp-m ain.1476/. Shao, Z., Wang, P., Zhu, Q., Xu, R., Song, J., Bi, X., Zhang, H., Zhang, M., Li, Y . K., Wu, Y ., et al. DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models.arXiv preprint arXiv:2402.03300,

work page internal anchor Pith review Pith/arXiv arXiv doi:10.18653/v1/2025.emnlp-main.1476 2025
[10]

Shen, C., Cheng, L., Zhou, R., Bing, L., You, Y ., and Si, L

URL https://arxiv.org/abs/2402.0 3300. Shen, C., Cheng, L., Zhou, R., Bing, L., You, Y ., and Si, L. MReD: A meta-review dataset for structure-controllable text generation. In Muresan, S., Nakov, P., and Villavi- cencio, A. (eds.),Findings of the Association for Compu- tational Linguistics: ACL 2022, pp. 2521–2535, Dublin, Ireland, 2022. Association for C...

work page doi:10.18653/v1/2022.findings-acl.198 2022
[11]

Qwen3 Technical Report

URL https://openreview.net/forum ?id=bjcsVLoHYs. Xu, S., Lu, Y ., Schoenebeck, G., and Kong, Y . Bench- marking llms’ judgments with no gold standard. InThe Thirteenth International Conference on Learning Rep- resentations, ICLR 2025, Singapore, April 24-28, 2025. OpenReview.net, 2025. URL https://openreview .net/forum?id=uE84MGbKD7. Yang, A., Li, A., Yan...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.18653/v1/2024.findings-emnlp.595 2025
[12]

HIGH”), τ HIGH(S)∈arg max τ F1(I[si(S)≥τ],I[y i ≥6]),(8) and analogously for predictingI[y i ≤4](“LOW

doi: 10.18653/v1/2025.emnlp-main.857. URL https://aclanthology.org/2025.emnlp-m ain.857/. 14 CoCoReviewBench: A Completeness- and Correctness-Oriented Benchmark for AI Reviewers Zhang, D., Bao, Z., Du, S., Zhao, Z., Zhang, K., Bao, D., and Yang, Y . Re$ˆ2$: A Consistency-ensured Dataset for Full-stage Peer Review and Multi-turn Rebuttal Dis- cussions.arXi...

work page doi:10.18653/v1/2025.emnlp-main.857 2025
[13]

- If a single comment contains multiple Points, assign separate labels

Identify atomic Points in reviewer comments and create a new label for each. - If a single comment contains multiple Points, assign separate labels. - If a reviewer raises new Points later, give them new labels

work page
[14]

Label 2, 5

For each author reply, assign the label(s) of the Point(s) it responds to. - If one reply addresses multiple Points, list all labels separated by commas (e.g., "Label 2, 5"). - ’Strength’ typically does not require a reply. Focus on matching replies to ’weaknesses’ and ’questions’. # OUTPUT FORMAT Produce exactly{num}lines, one per part, in order: Part k:...

work page
[15]

- Strength: Highlights that the method is well-formulated, mathematically consistent and correctly implemented; no major flaws are apparent

QUAL (Quality): Is the submission technically sound? Are claims well supported (e.g., by theoretical analysis or experimental results)? Are the methods used appropriate? Is this a complete piece of work or work in progress? Are the authors careful and honest about evaluating both the strengths and weaknesses of their work? - QUAL-MET (Methodological Sound...

work page
[16]

CLAR (Clarity): Is the submission clearly written? Is it well organized? (If not, please make constructive suggestions for improving its clarity.) Does it adequately inform the reader? (Note that a superbly written paper provides enough information for an expert reader to reproduce its results.) - CLAR-WRT (Writing, Terminology & Algorithm Presentation): ...

work page
[17]

- Strength: Notes that the work tackles a significant, widely relevant challenge or introduces a paradigm that may influence many researchers

SIGN (Significance): Are the results impactful for the community? Are others (researchers or practitioners) likely to use the ideas or build on them? Does the submission address a difficult task in a better way than previous work? Does it advance our understanding/knowledge on the topic in a demonstrable way? Does it provide unique data, unique conclusion...

work page
[18]

Rather, a work that provides novel insights by evaluating existing methods, or demonstrates improved efficiency, fairness, etc

ORIG (Originality): Does the work provide new insights, deepen understanding, or highlight important properties of existing methods? Is it clear how this work differs from previous contributions, with relevant citations provided? Does the work introduce novel tasks or methods that advance the field? Does this work offer a novel combination of existing tec...

work page
[19]

These issues often require checking adherence to conference or legal policies and may involve ethical considerations

POL (Policy/Compliance): Encompasses policy- or compliance-related concerns such as ethics, data/privacy compliance, anonymity rules, plagiarism, licensing, and broader impact. These issues often require checking adherence to conference or legal policies and may involve ethical considerations. - POL-ETH (Ethics & Responsible AI Compliance): Addresses ethi...

work page
[29]

ID:{{id}}

ORIG-MTH vs. QUAL-EXP: Whether the algorithm/architecture is truly novel vs. a minor tweak→ORIG-MTH; whether experiments sufficiently validate the (novel or not) method→QUAL-EXP. --- **Output:** Return **only** the applicable label(s). If the reviewer mentions multiple distinct comments, assign different labels, separated by commas. If the reviewer mentio...

work page
[30]

Review ALL reviewer opinions across ALL blocks below

work page
[31]

Extract the *most specific core subject * from each opinion (e.g., ’Random Forest algorithm selection’, ’SGD learning rate value’)

work page
[32]

Assign an ID (starting from 1) to each *unique specific subject *

work page
[33]

**CRITICAL**: Reuse the same ID for opinions discussing the *exact same subject*, even if their evaluations are opposite/contradictory

work page
[34]

blocks": [ {

Different specific subjects receive different IDs, even if semantically related Content Identity Rules: - **Subject must match precisely **: ’SGD is suitable’ vs. ’SGD is unsuitable’ = SAME ID (identical subject: SGD suitability) - **Maximal specificity required **: Use ’F1-score metric’ or ’L2 regularization strength’, NOT broad terms like ’methodology’ ...

work page
[35]

- Opposing sentiments (Positive vs

**Definition of Conflict: ** - Direct contradiction on the IDENTICAL aspect (e.g., Reviewer A says ’method is novel’, Reviewer B says ’method lacks novelty’). - Opposing sentiments (Positive vs. Negative) regarding the same specific feature

work page
[36]

has conflict

**What is NOT a Conflict (Crucial): ** - **Granularity Differences: ** Specific examples (e.g., ’works on large images’) vs. General statements (e.g., ’generalizes well’) are COMPLEMENTARY, not conflicting. - **Different Aspects: ** Opinions on different dimensions (e.g., ’efficiency’ vs ’accuracy’) are not conflicts. - **Non-Argumentative Text: ** Ignore...

work page
[37]

**Identify the Core Issue **: Determine the exact technical/academic point of disagreement

work page
[38]

**Evaluate Each Position **: Assess the validity of each opinion based on logical consistency, technical soundness, and alignment with academic standards

work page
[39]

correct blocks

**Make a Judgment **: Determine which opinion block(s) have the more correct/valid position. **Important:** Use block id (e.g., ’block 5’) to identify opinions in your response. === META-REVIEW (Overall Context) === {meta review}--- Discussion Point point id --- {reviewer points} --- Output Format (JSON ONLY) --- { "correct blocks": ["block X", ...], "inc...

work page
[40]

- Signals: ’We disagree’, ’The reviewer is incorrect’, ’This is a misunderstanding’, ’We do not think this is necessary’

**What is a Refutation? ** - The author explicitly argues that the reviewer’s comment is **factually incorrect**, **irrelevant**, or based on a **misunderstanding**. - Signals: ’We disagree’, ’The reviewer is incorrect’, ’This is a misunderstanding’, ’We do not think this is necessary’

work page
[41]

opinion index

**What is NOT a Refutation (False Positives to AVOID): ** - **Acceptance + Differentiation: ** If the reviewer asks to compare with a baseline (e.g., ConViT [6]), and the author adds the comparison but explains *why* their method is different/better (e.g., ’Unlike ConViT, we do X...’), this is **COMPLIANCE**, not refutation. - **Technical Clarification: *...

work page
[42]

**Reviewer is WRONG (true) ** if: - The reviewer’s claim is based on factual errors about the paper - The reviewer misunderstood the methodology/results - The reviewer’s request is logically inconsistent or impossible - Meta-review explicitly or implicitly supports the author’s position

work page
[43]

judgments

**Reviewer is NOT wrong (false) ** if: - The reviewer’s concern is valid but author disagrees on priority/scope - The author is making excuses without addressing the core issue - The meta-review sides with the reviewer or remains neutral - It’s a matter of interpretation rather than factual error ### OUTPUT FORMAT: Return ONLY a JSON object with a ’judgme...

work page
[44]

CLAR-WRT vs. QUAL-EXP: Comments about missing baselines or inadequate experiments belong under QUAL-EXP, even when poor writing makes experiments hard to follow; issues solely about readability go under CLAR-WRT

work page
[45]

QUAL-EXP: Critiques about illegible or unclear plots go to CLAR-FIG; critiques about missing plots or missing baseline curves go to QUAL-EXP

CLAR-FIG vs. QUAL-EXP: Critiques about illegible or unclear plots go to CLAR-FIG; critiques about missing plots or missing baseline curves go to QUAL-EXP

work page
[46]

CLAR-NOT: Missing code, hyperparameters or data splits fall under QUAL-REP; undefined variables or inconsistent notation fall under CLAR-NOT

QUAL-REP vs. CLAR-NOT: Missing code, hyperparameters or data splits fall under QUAL-REP; undefined variables or inconsistent notation fall under CLAR-NOT

work page
[47]

POL-IMP: General discussion of societal benefits or harms without mandatory checklists goes to SIGN-IMP; comments about compliance with required impact statements go to POL-IMP

SIGN-IMP vs. POL-IMP: General discussion of societal benefits or harms without mandatory checklists goes to SIGN-IMP; comments about compliance with required impact statements go to POL-IMP

work page
[48]

QUAL-EXP: Introducing a new dataset or benchmark is ORIG-EXP; critiquing the thoroughness of experiments on existing datasets is QUAL-EXP

ORIG-EXP vs. QUAL-EXP: Introducing a new dataset or benchmark is ORIG-EXP; critiquing the thoroughness of experiments on existing datasets is QUAL-EXP

work page
[49]

SIGN-IMP: Ethical violations, bias or harm to marginalised groups belong in POL-ETH; high-level impact comments without compliance issues belong in SIGN-IMP

POL-ETH vs. SIGN-IMP: Ethical violations, bias or harm to marginalised groups belong in POL-ETH; high-level impact comments without compliance issues belong in SIGN-IMP

work page
[50]

SIGN-SOT: Missing baselines or incomplete literature reviews are QUAL-CMP; judging whether reported improvements are meaningful goes to SIGN-SOT

QUAL-CMP vs. SIGN-SOT: Missing baselines or incomplete literature reviews are QUAL-CMP; judging whether reported improvements are meaningful goes to SIGN-SOT

work page
[51]

ORIG-MTH: Combining existing methods in an original way is ORIG-COM; proposing entirely new algorithms is ORIG-MTH

ORIG-COM vs. ORIG-MTH: Combining existing methods in an original way is ORIG-COM; proposing entirely new algorithms is ORIG-MTH

work page
[52]

QUAL-EXP: Missing/irrelevant or outdated baselines and literature mapping→QUAL-CMP; evaluation protocol design (metrics, splits, tuning fairness, test coverage)→QUAL-EXP

QUAL-CMP vs. QUAL-EXP: Missing/irrelevant or outdated baselines and literature mapping→QUAL-CMP; evaluation protocol design (metrics, splits, tuning fairness, test coverage)→QUAL-EXP

work page
[53]

No Claim

ORIG-MTH vs. QUAL-EXP: Whether the algorithm/architecture is truly novel vs. a minor tweak→ORIG-MTH; whether experiments sufficiently validate the (novel or not) method→QUAL-EXP. --- **Output:** Return **only** the applicable label(s). If the reviewer mentions multiple distinct comments, assign different labels, separated by commas. If the reviewer mentio...

work page 2025
[54]

Implicit **: - **Explicit:** Actions or suggestions that are direct or apparent

**Explicit vs. Implicit **: - **Explicit:** Actions or suggestions that are direct or apparent. Authors can directly identify modifications they should apply to their draft. Clarification questions should be treated as explicit statements if they give a direct action. - **Implicit:** Actions that need to be inferred from the comment. This includes missing...

work page
[55]

Vague **: - **Concrete:** Once the action is identified, the authors know exactly what needs to be done and how to apply the action

**Concrete vs. Vague **: - **Concrete:** Once the action is identified, the authors know exactly what needs to be done and how to apply the action. - **Vague:** After identifying the action, the authors still don’t know how to carry out this action. **Importance:** It’s more important for actions to be concrete so that authors know how to apply them. It’s...

work page
[56]

Authors do not know what they should do after reading the comment

**1: Unactionable ** - **Definition:** The comment lacks meaningful information to help authors improve the paper. Authors do not know what they should do after reading the comment

work page
[57]

However, the action itself is vague and lacks detail on how to apply it

**2: Borderline Actionable ** - **Definition:** The comment includes an implicitly stated action or an action that can be inferred. However, the action itself is vague and lacks detail on how to apply it

work page
[58]

**3: Somewhat Actionable ** - **Definition:** The comment explicitly states an action but is vague on how to execute it

work page
[59]

**4: Mostly Actionable ** - **Definition:** The comment implicitly states an action but concretely states how to implement the inferred action

work page
[60]

Authors know exactly how to apply it

**5: Highly Actionable ** - **Definition:** The comment contains an explicit action and concrete details on how to implement it. Authors know exactly how to apply it. X. **X: No claim ** - **Definition:** The comment contains only factual, descriptive statements without claims, opinions, or suggestions. Aspect: grounding specificity **Grounding Specificit...

work page
[61]

- **Weak Grounding: ** The author can make an educated guess but cannot precisely identify the referenced part

**Grounding:** How well the authors can identify the specific part of the paper being addressed. - **Weak Grounding: ** The author can make an educated guess but cannot precisely identify the referenced part. - **Full Grounding: ** The author can accurately pinpoint the section, table, figure, or unique aspect being addressed. This can be achieved through...

work page
[62]

If external work is mentioned, it also evaluates whether specific examples are provided

**Specificity:** How clearly the comment details what is wrong or missing in the referenced part. If external work is mentioned, it also evaluates whether specific examples are provided. **Importance:** It’s more important for the comment to be grounded than to be specific. **Grounding Specificity Scale (1-5): **

work page
[63]

It does not identify a specific area in the paper

**Not Grounded ** - **Definition**: This comment is not grounded at all. It does not identify a specific area in the paper. The comment is highly unspecific

work page
[64]

Further, the comment does not specify what needs to be addressed in this part

**Weakly Grounded and Not Specific ** - **Definition**: The authors cannot confidently determine which part the comment addresses. Further, the comment does not specify what needs to be addressed in this part

work page
[65]

However, the comment clearly specifies what needs to be addressed in this part

**Weakly Grounded and Specific ** - **Definition**: The authors cannot confidently determine which part the comment addresses. However, the comment clearly specifies what needs to be addressed in this part

work page
[66]

However, this comment does not specify what needs to be addressed in this part

**Fully Grounded and Under-Specific ** - **Definition**: The comment explicitly mentions which part of the paper it addresses, or it should be obvious to the authors. However, this comment does not specify what needs to be addressed in this part

work page
[67]

The comment specifies what needs to be addressed in this part

**Fully Grounded and Specific ** - **Definition**: The comment explicitly mentions which part of the paper it addresses, and it is obvious to the authors. The comment specifies what needs to be addressed in this part. 52 CoCoReviewBench: A Completeness- and Correctness-Oriented Benchmark for AI Reviewers Prompts for Paper-Level and Category-Level Metric o...

work page
[68]

**1: Unverifiable ** - The comment contains a claim without any supporting evidence or justification

work page
[69]

**2: Borderline Verifiable ** - Some support is provided, but it is vague, insufficient, or difficult to follow

work page
[70]

**3: Somewhat Verifiable ** - The claim has some justification but lacks key elements (e.g., examples, references)

work page
[71]

**4: Mostly Verifiable ** - The claim is well-supported but has minor gaps in explanation or references

work page
[72]

- Specific references to external works

**5: Fully Verifiable ** - The claim is thoroughly supported by explicit, sufficient, and robust evidence, such as: - Clear reasoning and precise explanations. - Specific references to external works. - Logical and unassailable common-sense arguments. X. **X: No Claim ** - The comment contains only factual, descriptive statements without claims, opinions,...

work page
[73]

**1: Opposing Opinion ** - **Definition:** The candidate expresses an opposing stance to the key gold opinion(s) (i.e., contradicts the main thrust), or argues the opposite conclusion on the same issue

work page
[74]

**2: Mostly Misaligned ** - **Definition:** The candidate is largely misaligned: it either contradicts some key gold opinion(s) or frames the issue in a way that conflicts with the gold

work page
[75]

It does not consistently support the gold’s stance

**3: Mixed or Unrelated ** - **Definition:** The candidate includes a mixture of aligned and opposing opinions, **or** it is largely unrelated/orthogonal (neither clearly aligned nor clearly opposing). It does not consistently support the gold’s stance

work page
[76]

**4: Mostly Aligned ** - **Definition:** The candidate is mostly aligned with the gold’s stance and opinions, but may miss some constraints/details, soften/shift emphasis, or contain minor deviations that do not overturn the main agreement

work page
[77]

**5: Fully Aligned ** - **Definition:** The candidate expresses the same underlying opinion(s) as the gold reference: it matches the main concerns/praises in both direction and substance, without material contradictions. 54 CoCoReviewBench: A Completeness- and Correctness-Oriented Benchmark for AI Reviewers Prompts for Paper-Level and Category-Level Metri...

work page
[78]

**Coverage Breadth: ** - Whether the candidate addresses few vs many of the key gold opinions

work page
[79]

- If the candidate is much more generic than a gold opinion, that gold opinion should be treated as *not covered * or only weakly covered

**Coverage Adequacy: ** - Whether the candidate meaningfully addresses the covered gold opinions (not just a vague mention). - If the candidate is much more generic than a gold opinion, that gold opinion should be treated as *not covered * or only weakly covered. **Importance:** Breadth is more important than length. A long candidate can still be incomple...

work page
[80]

Most gold opinions are not covered; any overlap is accidental or extremely weak

**1: Minimally Complete ** - **Definition:** The candidate covers none or almost none of the key gold opinions. Most gold opinions are not covered; any overlap is accidental or extremely weak

work page
[81]

Several key gold opinions are missing

**2: Low Completeness ** - **Definition:** The candidate covers only a small portion of gold opinions, and/or covers them weakly (generic mentions without addressing the underlying request). Several key gold opinions are missing

work page
[82]

Coverage may be uneven: some opinions covered, others only weakly covered

**3: Moderate Completeness ** - **Definition:** The candidate covers some gold opinions, but misses important ones. Coverage may be uneven: some opinions covered, others only weakly covered

work page
[83]

Only a small number of less-central gold opinions are missing or weakly covered

**4: High Completeness ** - **Definition:** The candidate covers most key gold opinions, with generally adequate substance. Only a small number of less-central gold opinions are missing or weakly covered

work page
[84]

Few or no gold opinions remain not covered

**5: Fully Complete ** - **Definition:** The candidate covers nearly all key gold opinions, and coverage is meaningful (not merely topical). Few or no gold opinions remain not covered. 55 CoCoReviewBench: A Completeness- and Correctness-Oriented Benchmark for AI Reviewers Prompts for Paper-Level and Category-Level Metric of CoCoreviewBench with LLM-as-a-J...

work page
[85]

Readers cannot easily identify the main point

**1: Unclear / Hard to Read ** - **Definition:** The comment is confusing or difficult to follow due to poor coherence (disorganized, fragmented, or contradictory), heavy verbosity with little signal, or unclear phrasing. Readers cannot easily identify the main point

work page
[86]

The main point is present but requires effort to extract, and the delivery obscures meaning

**2: Mostly Unclear ** - **Definition:** The comment has substantial readability issues (wordy, tangled sentences, weak structure). The main point is present but requires effort to extract, and the delivery obscures meaning

work page
[87]

It is understandable but not crisp

**3: Moderately Clear ** - **Definition:** The comment is generally readable and the main point can be understood, but delivery is imperfect (some verbosity, uneven structure, minor coherence gaps). It is understandable but not crisp

work page
[88]

Minor verbosity or small structural issues may exist, but they do not hinder comprehension

**4: Clear ** - **Definition:** The comment is coherent and easy to read, with a mostly well-structured presentation. Minor verbosity or small structural issues may exist, but they do not hinder comprehension

work page
[89]

claim extraction

**5: Very Clear / Concise ** - **Definition:** The comment is concise, well-structured, and highly readable. Ideas flow logically, wording is efficient, and the main point (and any request) is immediately understandable. ###Instruction: Be more objective and conservative in your grading. Respond strictly as JSON, do not provide any other content: { "claim...

work page

Showing first 80 references.