Recognition: 2 theorem links
· Lean TheoremCoCoReviewBench: A Completeness- and Correctness-Oriented Benchmark for AI Reviewers
Pith reviewed 2026-05-11 03:35 UTC · model grok-4.3
The pith
CoCoReviewBench evaluates AI reviewers using expert discussions to measure both completeness and correctness.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
CoCoReviewBench curates 3,900 papers from ICLR and NeurIPS, builds category-specific subsets, skips evaluation on missing human reviews for completeness, and uses reviewer-author-meta-review discussions as expert annotations filtered for reliability to assess AI reviewers, showing they are limited in correctness, prone to hallucinations, and that reasoning models are more effective.
What carries the argument
The CoCoReviewBench dataset and evaluation protocol, which strengthens correctness via expert discussion annotations and completeness via selective skipping of incomplete cases.
If this is right
- AI reviewers exhibit significant limitations in providing correct assessments and frequently hallucinate content.
- Reasoning models outperform other AI reviewers in this benchmark.
- The benchmark enables more reliable and fine-grained evaluation of AI reviewer capabilities.
- Further research is needed to improve AI reviewers based on these findings.
Where Pith is reading between the lines
- Adopting this benchmark could lead to better AI integration in academic publishing workflows.
- Similar methods might apply to evaluating AI in other expert judgment tasks like legal or medical review.
- Testing the benchmark on newer models or different datasets would validate its robustness.
- Addressing hallucinations through reasoning enhancements could be a key direction for AI reviewer development.
Load-bearing premise
Reviewer-author-meta-review discussions serve as reliable expert annotations for review correctness without bias, and skipping papers lacking human reviews improves completeness evaluation without introducing selection bias.
What would settle it
If independent human experts rate the correctness of AI-generated reviews on benchmark papers and find no correlation with the benchmark's correctness scores, or if reasoning models do not show superior performance in that verification.
Figures
read the original abstract
Despite the rapid development of AI reviewers, evaluating such systems remains challenging: metrics favor overlap with human reviews over correctness. However, since human reviews often cover only a subset of salient issues and sometimes contain mistakes, they are unreliable as gold references. To address this, we build category-specific benchmark subsets and skip evaluation when the corresponding human reviews are missing to strengthen Completeness. We also leverage reviewer--author--meta-review discussions as expert annotations and filter unreliable reviews accordingly to strengthen Correctness. Finally, we introduce CoCoReviewBench, which curates 3,900 papers from ICLR and NeurIPS to enable reliable and fine-grained evaluation of AI reviewers. Analysis shows that AI reviewers remain limited in correctness and are prone to hallucinations, and highlights reasoning models as more effective reviewers, motivating further directions for improving AI reviewers. Benchmarks and models are available at https://github.com/hexuandeng/CoCoReviewBench.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces CoCoReviewBench, a benchmark curating 3,900 papers from ICLR and NeurIPS. It constructs category-specific subsets and skips evaluation when human reviews are missing to strengthen completeness, while using reviewer-author-meta-review discussions as expert annotations (with filtering of unreliable reviews) to strengthen correctness. The resulting analysis concludes that AI reviewers remain limited in correctness, are prone to hallucinations, and that reasoning models are more effective reviewers.
Significance. If the curation rules, filtering criteria, and label reliability hold, the benchmark would provide a valuable alternative to overlap-based metrics for AI reviewer evaluation by emphasizing completeness and correctness. The open release of the benchmark and models supports reproducibility. The work highlights practical limitations in current AI reviewing systems and suggests directions for improvement.
major comments (3)
- [Benchmark construction] Benchmark construction section: Treating reviewer-author-meta-review threads as ground-truth expert annotations for correctness lacks any described independent expert re-annotation study or inter-rater reliability assessment against the extracted labels. This is load-bearing for the central claim, because if the threads embed author bias, incompleteness, or meta-review summarization artifacts, the reported limitations on AI correctness become difficult to interpret as objective.
- [Subset construction and filtering] Subset construction and filtering section: The paper does not provide the exact rules for building category-specific subsets, the quantitative filtering criteria for 'unreliable' reviews, or the precise definitions of the completeness and correctness metrics. Without these, the support for 'reliable and fine-grained evaluation' cannot be fully verified and reproducibility is limited.
- [Analysis] Analysis section: The decision to skip evaluation when human reviews are missing is presented as strengthening completeness, but no analysis of potential selection bias (e.g., systematic differences in papers with vs. without reviews) is reported. This assumption is load-bearing for the completeness axis.
minor comments (2)
- [Abstract] The abstract states the total of 3,900 papers but does not break down the distribution across ICLR/NeurIPS or review categories; adding this would improve clarity.
- [Throughout] Notation for the two axes (Completeness and Correctness) should be consistently capitalized or defined on first use to avoid minor ambiguity.
Simulated Author's Rebuttal
We thank the referee for the thoughtful and constructive comments, which help us improve the clarity and rigor of the manuscript. We address each major comment below and indicate the revisions we will make.
read point-by-point responses
-
Referee: [Benchmark construction] Benchmark construction section: Treating reviewer-author-meta-review threads as ground-truth expert annotations for correctness lacks any described independent expert re-annotation study or inter-rater reliability assessment against the extracted labels. This is load-bearing for the central claim, because if the threads embed author bias, incompleteness, or meta-review summarization artifacts, the reported limitations on AI correctness become difficult to interpret as objective.
Authors: We agree that the threads are not independently re-annotated and that this is a substantive limitation. The threads do capture multi-expert input (reviewers, authors responding to reviews, and meta-reviewers synthesizing), and we apply explicit filtering to discard low-consistency or low-quality entries. However, we cannot rule out residual author bias or summarization artifacts. In the revision we will (1) expand the description of the filtering criteria with concrete examples, (2) add a dedicated limitations paragraph that explicitly discusses these potential biases and how they could affect the reported correctness numbers, and (3) qualify the central claims accordingly. We will not be able to conduct a new large-scale independent re-annotation study within the scope of this work. revision: partial
-
Referee: [Subset construction and filtering] Subset construction and filtering section: The paper does not provide the exact rules for building category-specific subsets, the quantitative filtering criteria for 'unreliable' reviews, or the precise definitions of the completeness and correctness metrics. Without these, the support for 'reliable and fine-grained evaluation' cannot be fully verified and reproducibility is limited.
Authors: We apologize for the omission. The revised manuscript will include: (a) the precise selection rules and category definitions used to build the subsets, (b) the quantitative thresholds and heuristics applied to filter unreliable reviews (including the exact consistency and quality metrics), and (c) formal definitions of the completeness and correctness metrics, including how they are computed from the annotations. These additions will be placed in the main text or a dedicated appendix to ensure full reproducibility. revision: yes
-
Referee: [Analysis] Analysis section: The decision to skip evaluation when human reviews are missing is presented as strengthening completeness, but no analysis of potential selection bias (e.g., systematic differences in papers with vs. without reviews) is reported. This assumption is load-bearing for the completeness axis.
Authors: We accept that an explicit bias check is needed. In the revision we will add a short analysis comparing papers that have human reviews versus those that do not, along dimensions such as acceptance rate, paper length, and topical distribution. We will report any statistically significant differences and discuss their implications for the completeness evaluation. If no material differences are observed, this will support the current design; otherwise we will qualify the completeness results. revision: yes
- Independent expert re-annotation study or inter-rater reliability assessment for the ground-truth labels derived from reviewer-author-meta-review threads
Circularity Check
No significant circularity in benchmark curation
full rationale
The paper constructs CoCoReviewBench by curating 3,900 papers from ICLR and NeurIPS and extracting annotations from existing reviewer-author-meta-review threads, with explicit subsetting and filtering rules to address completeness and correctness. No equations, fitted parameters, predictions, or first-principles derivations appear in the provided text. The central claims rest on external public data sources rather than any self-referential reduction or self-citation chain that would make the result equivalent to its inputs by construction.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Human reviews are often incomplete or contain mistakes and are therefore unreliable as sole gold references.
- domain assumption Reviewer-author-meta-review discussions constitute reliable expert annotations for correctness after filtering.
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclearWe also leverage reviewer–author–meta-review discussions as expert annotations and filter unreliable reviews accordingly to strengthen Correctness... construct a benchmark of 3,900 papers... category-level evaluation
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclearWe build category-specific benchmark subsets and skip evaluation when the corresponding human reviews are missing to strengthen Completeness
Reference graph
Works this paper leans on
-
[1]
Chang, Y ., Li, Z., Zhang, H., Kong, Y ., Wu, Y ., So, H
URL https://arxiv.org/abs/2507.1 0734. Chang, Y ., Li, Z., Zhang, H., Kong, Y ., Wu, Y ., So, H. K.-H., Guo, Z., Zhu, L., and Wong, N. TreeReview: A Dynamic Tree of Questions Framework for Deep and Efficient LLM-based Scientific Peer Review. In Christodoulopoulos, C., Chakraborty, T., Rose, C., and Peng, V . (eds.),Proceedings of the 2025 Conference on Em...
work page 2025
-
[2]
URL https://aclanthology.org/2025.emnlp-m ain.790/
doi: 10.18653/v1/2025.emnlp-main.790. URL https://aclanthology.org/2025.emnlp-m ain.790/. Cortes, C. and Lawrence, N. D. Inconsistency in Conference Peer Review: Revisiting the 2014 NeurIPS Experiment. arXiv preprint arXiv:2109.09774, 2021. URL https: //arxiv.org/abs/2109.09774. D’Arcy, M., Hope, T., Birnbaum, L., and Downey, D. MARG: Multi-Agent Review G...
-
[3]
URL https: //doi.org/10.1609/aaai.v37i11.26497
doi: 10.1609/AAAI.V37I11.26497. URL https: //doi.org/10.1609/aaai.v37i11.26497. Deng, H., Jiao, W., Liu, X., Zhang, M., and Tu, Z. Newterm: Benchmarking real-time new terms for large language models with annual updates. In Globersons, A., Mackey, L., Belgrave, D., Fan, A., Paquet, U., Tomczak, J. M., and Zhang, C. (eds.),Advances in Neural Information Pro...
-
[4]
doi: 10.18653/v1/2023.acl- long.277
Association for Computational Linguistics. doi: 10.18653/v1/2023.acl- long.277. URL https: //aclanthology.org/2023.acl-long.277/. Ebrahimi, S., Sadeghian, S., Ghorbanpour, A., Arabzadeh, N., Salamat, S., Li, M., Le, H. S., Bashari, M., and Bagheri, E. Rottenreviews: Benchmarking review qual- ity with human and llm-based judgments. In Cha, M., Park, C., Pa...
-
[5]
URL https: //doi.org/10.1145/3746252.3761506
doi: 10.1145/3746252.3761506. URL https: //doi.org/10.1145/3746252.3761506. Gao, X., Ruan, J., Zhang, Z., Gao, J., Liu, T., and Fu, Y . MMReview: A Multidisciplinary and Multimodal Benchmark for LLM-Based Peer Review Automation. arXiv preprint arXiv:2508.14146, 2025a. URL https: //arxiv.org/abs/2508.14146. Gao, X., Ruan, J., Zhang, Z., Gao, J., Liu, T., a...
-
[6]
URL https://aclanthology.org/N18-1 149/. Kargaran, A. H., Nikeghbal, N., Yang, J., and Ousidhoum, N. Insights from the ICLR Peer Review and Rebuttal Process.arXiv preprint arXiv:2511.15462, 2025. URL https://arxiv.org/abs/2511.15462. Ke, X., Deng, H., Liu, X., Rao, J., Song, Z., Yu, J., and Zhang, M. AQuilt: Weaving Logic and Self-Inspection into Low-Cost...
-
[7]
12 CoCoReviewBench: A Completeness- and Correctness-Oriented Benchmark for AI Reviewers Mann, S
URL https://arxiv.org/abs/2408.0 6292. 12 CoCoReviewBench: A Completeness- and Correctness-Oriented Benchmark for AI Reviewers Mann, S. P., Aboy, M., Seah, J. J., Lin, Z., Luo, X., Rodger, D., Zohny, H., Minssen, T., Savulescu, J., and Earp, B. D. AI and the Future of Academic Peer Re- view.arXiv preprint arXiv:2509.14189, 2025. URL https://arxiv.org/abs/...
-
[8]
URL https://pmc.ncbi.nlm.nih.gov/a rticles/PMC12186430/. Meyerson, L. A., Suzzi-Simmons, A., and Simberloff, D. Quantifying reviewer declines in scientific publishing: Twenty-one years of data from biological invasions 2002– 2024.Biological Invasions, 27(10):223, 2025. ISSN 1573-1464. doi: 10.1007/s10530-025-03679-1. URL https://link.springer.com/article/...
-
[9]
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
doi: 10.18653/v1/2025.emnlp-main.1476. URL https://aclanthology.org/2025.emnlp-m ain.1476/. Shao, Z., Wang, P., Zhu, Q., Xu, R., Song, J., Bi, X., Zhang, H., Zhang, M., Li, Y . K., Wu, Y ., et al. DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models.arXiv preprint arXiv:2402.03300,
work page internal anchor Pith review Pith/arXiv arXiv doi:10.18653/v1/2025.emnlp-main.1476 2025
-
[10]
Shen, C., Cheng, L., Zhou, R., Bing, L., You, Y ., and Si, L
URL https://arxiv.org/abs/2402.0 3300. Shen, C., Cheng, L., Zhou, R., Bing, L., You, Y ., and Si, L. MReD: A meta-review dataset for structure-controllable text generation. In Muresan, S., Nakov, P., and Villavi- cencio, A. (eds.),Findings of the Association for Compu- tational Linguistics: ACL 2022, pp. 2521–2535, Dublin, Ireland, 2022. Association for C...
-
[11]
URL https://openreview.net/forum ?id=bjcsVLoHYs. Xu, S., Lu, Y ., Schoenebeck, G., and Kong, Y . Bench- marking llms’ judgments with no gold standard. InThe Thirteenth International Conference on Learning Rep- resentations, ICLR 2025, Singapore, April 24-28, 2025. OpenReview.net, 2025. URL https://openreview .net/forum?id=uE84MGbKD7. Yang, A., Li, A., Yan...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.18653/v1/2024.findings-emnlp.595 2025
-
[12]
doi: 10.18653/v1/2025.emnlp-main.857. URL https://aclanthology.org/2025.emnlp-m ain.857/. 14 CoCoReviewBench: A Completeness- and Correctness-Oriented Benchmark for AI Reviewers Zhang, D., Bao, Z., Du, S., Zhao, Z., Zhang, K., Bao, D., and Yang, Y . Re$ˆ2$: A Consistency-ensured Dataset for Full-stage Peer Review and Multi-turn Rebuttal Dis- cussions.arXi...
-
[13]
- If a single comment contains multiple Points, assign separate labels
Identify atomic Points in reviewer comments and create a new label for each. - If a single comment contains multiple Points, assign separate labels. - If a reviewer raises new Points later, give them new labels
-
[14]
For each author reply, assign the label(s) of the Point(s) it responds to. - If one reply addresses multiple Points, list all labels separated by commas (e.g., "Label 2, 5"). - ’Strength’ typically does not require a reply. Focus on matching replies to ’weaknesses’ and ’questions’. # OUTPUT FORMAT Produce exactly{num}lines, one per part, in order: Part k:...
-
[15]
QUAL (Quality): Is the submission technically sound? Are claims well supported (e.g., by theoretical analysis or experimental results)? Are the methods used appropriate? Is this a complete piece of work or work in progress? Are the authors careful and honest about evaluating both the strengths and weaknesses of their work? - QUAL-MET (Methodological Sound...
-
[16]
CLAR (Clarity): Is the submission clearly written? Is it well organized? (If not, please make constructive suggestions for improving its clarity.) Does it adequately inform the reader? (Note that a superbly written paper provides enough information for an expert reader to reproduce its results.) - CLAR-WRT (Writing, Terminology & Algorithm Presentation): ...
-
[17]
SIGN (Significance): Are the results impactful for the community? Are others (researchers or practitioners) likely to use the ideas or build on them? Does the submission address a difficult task in a better way than previous work? Does it advance our understanding/knowledge on the topic in a demonstrable way? Does it provide unique data, unique conclusion...
-
[18]
ORIG (Originality): Does the work provide new insights, deepen understanding, or highlight important properties of existing methods? Is it clear how this work differs from previous contributions, with relevant citations provided? Does the work introduce novel tasks or methods that advance the field? Does this work offer a novel combination of existing tec...
-
[19]
POL (Policy/Compliance): Encompasses policy- or compliance-related concerns such as ethics, data/privacy compliance, anonymity rules, plagiarism, licensing, and broader impact. These issues often require checking adherence to conference or legal policies and may involve ethical considerations. - POL-ETH (Ethics & Responsible AI Compliance): Addresses ethi...
-
[29]
ORIG-MTH vs. QUAL-EXP: Whether the algorithm/architecture is truly novel vs. a minor tweak→ORIG-MTH; whether experiments sufficiently validate the (novel or not) method→QUAL-EXP. --- **Output:** Return **only** the applicable label(s). If the reviewer mentions multiple distinct comments, assign different labels, separated by commas. If the reviewer mentio...
-
[30]
Review ALL reviewer opinions across ALL blocks below
-
[31]
Extract the *most specific core subject * from each opinion (e.g., ’Random Forest algorithm selection’, ’SGD learning rate value’)
-
[32]
Assign an ID (starting from 1) to each *unique specific subject *
-
[33]
**CRITICAL**: Reuse the same ID for opinions discussing the *exact same subject*, even if their evaluations are opposite/contradictory
-
[34]
Different specific subjects receive different IDs, even if semantically related Content Identity Rules: - **Subject must match precisely **: ’SGD is suitable’ vs. ’SGD is unsuitable’ = SAME ID (identical subject: SGD suitability) - **Maximal specificity required **: Use ’F1-score metric’ or ’L2 regularization strength’, NOT broad terms like ’methodology’ ...
-
[35]
- Opposing sentiments (Positive vs
**Definition of Conflict: ** - Direct contradiction on the IDENTICAL aspect (e.g., Reviewer A says ’method is novel’, Reviewer B says ’method lacks novelty’). - Opposing sentiments (Positive vs. Negative) regarding the same specific feature
-
[36]
**What is NOT a Conflict (Crucial): ** - **Granularity Differences: ** Specific examples (e.g., ’works on large images’) vs. General statements (e.g., ’generalizes well’) are COMPLEMENTARY, not conflicting. - **Different Aspects: ** Opinions on different dimensions (e.g., ’efficiency’ vs ’accuracy’) are not conflicts. - **Non-Argumentative Text: ** Ignore...
-
[37]
**Identify the Core Issue **: Determine the exact technical/academic point of disagreement
-
[38]
**Evaluate Each Position **: Assess the validity of each opinion based on logical consistency, technical soundness, and alignment with academic standards
-
[39]
**Make a Judgment **: Determine which opinion block(s) have the more correct/valid position. **Important:** Use block id (e.g., ’block 5’) to identify opinions in your response. === META-REVIEW (Overall Context) === {meta review}--- Discussion Point point id --- {reviewer points} --- Output Format (JSON ONLY) --- { "correct blocks": ["block X", ...], "inc...
-
[40]
**What is a Refutation? ** - The author explicitly argues that the reviewer’s comment is **factually incorrect**, **irrelevant**, or based on a **misunderstanding**. - Signals: ’We disagree’, ’The reviewer is incorrect’, ’This is a misunderstanding’, ’We do not think this is necessary’
-
[41]
**What is NOT a Refutation (False Positives to AVOID): ** - **Acceptance + Differentiation: ** If the reviewer asks to compare with a baseline (e.g., ConViT [6]), and the author adds the comparison but explains *why* their method is different/better (e.g., ’Unlike ConViT, we do X...’), this is **COMPLIANCE**, not refutation. - **Technical Clarification: *...
-
[42]
**Reviewer is WRONG (true) ** if: - The reviewer’s claim is based on factual errors about the paper - The reviewer misunderstood the methodology/results - The reviewer’s request is logically inconsistent or impossible - Meta-review explicitly or implicitly supports the author’s position
-
[43]
**Reviewer is NOT wrong (false) ** if: - The reviewer’s concern is valid but author disagrees on priority/scope - The author is making excuses without addressing the core issue - The meta-review sides with the reviewer or remains neutral - It’s a matter of interpretation rather than factual error ### OUTPUT FORMAT: Return ONLY a JSON object with a ’judgme...
-
[44]
CLAR-WRT vs. QUAL-EXP: Comments about missing baselines or inadequate experiments belong under QUAL-EXP, even when poor writing makes experiments hard to follow; issues solely about readability go under CLAR-WRT
-
[45]
CLAR-FIG vs. QUAL-EXP: Critiques about illegible or unclear plots go to CLAR-FIG; critiques about missing plots or missing baseline curves go to QUAL-EXP
-
[46]
QUAL-REP vs. CLAR-NOT: Missing code, hyperparameters or data splits fall under QUAL-REP; undefined variables or inconsistent notation fall under CLAR-NOT
-
[47]
SIGN-IMP vs. POL-IMP: General discussion of societal benefits or harms without mandatory checklists goes to SIGN-IMP; comments about compliance with required impact statements go to POL-IMP
-
[48]
ORIG-EXP vs. QUAL-EXP: Introducing a new dataset or benchmark is ORIG-EXP; critiquing the thoroughness of experiments on existing datasets is QUAL-EXP
-
[49]
POL-ETH vs. SIGN-IMP: Ethical violations, bias or harm to marginalised groups belong in POL-ETH; high-level impact comments without compliance issues belong in SIGN-IMP
-
[50]
QUAL-CMP vs. SIGN-SOT: Missing baselines or incomplete literature reviews are QUAL-CMP; judging whether reported improvements are meaningful goes to SIGN-SOT
-
[51]
ORIG-COM vs. ORIG-MTH: Combining existing methods in an original way is ORIG-COM; proposing entirely new algorithms is ORIG-MTH
-
[52]
QUAL-CMP vs. QUAL-EXP: Missing/irrelevant or outdated baselines and literature mapping→QUAL-CMP; evaluation protocol design (metrics, splits, tuning fairness, test coverage)→QUAL-EXP
-
[53]
ORIG-MTH vs. QUAL-EXP: Whether the algorithm/architecture is truly novel vs. a minor tweak→ORIG-MTH; whether experiments sufficiently validate the (novel or not) method→QUAL-EXP. --- **Output:** Return **only** the applicable label(s). If the reviewer mentions multiple distinct comments, assign different labels, separated by commas. If the reviewer mentio...
work page 2025
-
[54]
Implicit **: - **Explicit:** Actions or suggestions that are direct or apparent
**Explicit vs. Implicit **: - **Explicit:** Actions or suggestions that are direct or apparent. Authors can directly identify modifications they should apply to their draft. Clarification questions should be treated as explicit statements if they give a direct action. - **Implicit:** Actions that need to be inferred from the comment. This includes missing...
-
[55]
**Concrete vs. Vague **: - **Concrete:** Once the action is identified, the authors know exactly what needs to be done and how to apply the action. - **Vague:** After identifying the action, the authors still don’t know how to carry out this action. **Importance:** It’s more important for actions to be concrete so that authors know how to apply them. It’s...
-
[56]
Authors do not know what they should do after reading the comment
**1: Unactionable ** - **Definition:** The comment lacks meaningful information to help authors improve the paper. Authors do not know what they should do after reading the comment
-
[57]
However, the action itself is vague and lacks detail on how to apply it
**2: Borderline Actionable ** - **Definition:** The comment includes an implicitly stated action or an action that can be inferred. However, the action itself is vague and lacks detail on how to apply it
-
[58]
**3: Somewhat Actionable ** - **Definition:** The comment explicitly states an action but is vague on how to execute it
-
[59]
**4: Mostly Actionable ** - **Definition:** The comment implicitly states an action but concretely states how to implement the inferred action
-
[60]
Authors know exactly how to apply it
**5: Highly Actionable ** - **Definition:** The comment contains an explicit action and concrete details on how to implement it. Authors know exactly how to apply it. X. **X: No claim ** - **Definition:** The comment contains only factual, descriptive statements without claims, opinions, or suggestions. Aspect: grounding specificity **Grounding Specificit...
-
[61]
**Grounding:** How well the authors can identify the specific part of the paper being addressed. - **Weak Grounding: ** The author can make an educated guess but cannot precisely identify the referenced part. - **Full Grounding: ** The author can accurately pinpoint the section, table, figure, or unique aspect being addressed. This can be achieved through...
-
[62]
If external work is mentioned, it also evaluates whether specific examples are provided
**Specificity:** How clearly the comment details what is wrong or missing in the referenced part. If external work is mentioned, it also evaluates whether specific examples are provided. **Importance:** It’s more important for the comment to be grounded than to be specific. **Grounding Specificity Scale (1-5): **
-
[63]
It does not identify a specific area in the paper
**Not Grounded ** - **Definition**: This comment is not grounded at all. It does not identify a specific area in the paper. The comment is highly unspecific
-
[64]
Further, the comment does not specify what needs to be addressed in this part
**Weakly Grounded and Not Specific ** - **Definition**: The authors cannot confidently determine which part the comment addresses. Further, the comment does not specify what needs to be addressed in this part
-
[65]
However, the comment clearly specifies what needs to be addressed in this part
**Weakly Grounded and Specific ** - **Definition**: The authors cannot confidently determine which part the comment addresses. However, the comment clearly specifies what needs to be addressed in this part
-
[66]
However, this comment does not specify what needs to be addressed in this part
**Fully Grounded and Under-Specific ** - **Definition**: The comment explicitly mentions which part of the paper it addresses, or it should be obvious to the authors. However, this comment does not specify what needs to be addressed in this part
-
[67]
The comment specifies what needs to be addressed in this part
**Fully Grounded and Specific ** - **Definition**: The comment explicitly mentions which part of the paper it addresses, and it is obvious to the authors. The comment specifies what needs to be addressed in this part. 52 CoCoReviewBench: A Completeness- and Correctness-Oriented Benchmark for AI Reviewers Prompts for Paper-Level and Category-Level Metric o...
-
[68]
**1: Unverifiable ** - The comment contains a claim without any supporting evidence or justification
-
[69]
**2: Borderline Verifiable ** - Some support is provided, but it is vague, insufficient, or difficult to follow
-
[70]
**3: Somewhat Verifiable ** - The claim has some justification but lacks key elements (e.g., examples, references)
-
[71]
**4: Mostly Verifiable ** - The claim is well-supported but has minor gaps in explanation or references
-
[72]
- Specific references to external works
**5: Fully Verifiable ** - The claim is thoroughly supported by explicit, sufficient, and robust evidence, such as: - Clear reasoning and precise explanations. - Specific references to external works. - Logical and unassailable common-sense arguments. X. **X: No Claim ** - The comment contains only factual, descriptive statements without claims, opinions,...
-
[73]
**1: Opposing Opinion ** - **Definition:** The candidate expresses an opposing stance to the key gold opinion(s) (i.e., contradicts the main thrust), or argues the opposite conclusion on the same issue
-
[74]
**2: Mostly Misaligned ** - **Definition:** The candidate is largely misaligned: it either contradicts some key gold opinion(s) or frames the issue in a way that conflicts with the gold
-
[75]
It does not consistently support the gold’s stance
**3: Mixed or Unrelated ** - **Definition:** The candidate includes a mixture of aligned and opposing opinions, **or** it is largely unrelated/orthogonal (neither clearly aligned nor clearly opposing). It does not consistently support the gold’s stance
-
[76]
**4: Mostly Aligned ** - **Definition:** The candidate is mostly aligned with the gold’s stance and opinions, but may miss some constraints/details, soften/shift emphasis, or contain minor deviations that do not overturn the main agreement
-
[77]
**5: Fully Aligned ** - **Definition:** The candidate expresses the same underlying opinion(s) as the gold reference: it matches the main concerns/praises in both direction and substance, without material contradictions. 54 CoCoReviewBench: A Completeness- and Correctness-Oriented Benchmark for AI Reviewers Prompts for Paper-Level and Category-Level Metri...
-
[78]
**Coverage Breadth: ** - Whether the candidate addresses few vs many of the key gold opinions
-
[79]
**Coverage Adequacy: ** - Whether the candidate meaningfully addresses the covered gold opinions (not just a vague mention). - If the candidate is much more generic than a gold opinion, that gold opinion should be treated as *not covered * or only weakly covered. **Importance:** Breadth is more important than length. A long candidate can still be incomple...
-
[80]
Most gold opinions are not covered; any overlap is accidental or extremely weak
**1: Minimally Complete ** - **Definition:** The candidate covers none or almost none of the key gold opinions. Most gold opinions are not covered; any overlap is accidental or extremely weak
-
[81]
Several key gold opinions are missing
**2: Low Completeness ** - **Definition:** The candidate covers only a small portion of gold opinions, and/or covers them weakly (generic mentions without addressing the underlying request). Several key gold opinions are missing
-
[82]
Coverage may be uneven: some opinions covered, others only weakly covered
**3: Moderate Completeness ** - **Definition:** The candidate covers some gold opinions, but misses important ones. Coverage may be uneven: some opinions covered, others only weakly covered
-
[83]
Only a small number of less-central gold opinions are missing or weakly covered
**4: High Completeness ** - **Definition:** The candidate covers most key gold opinions, with generally adequate substance. Only a small number of less-central gold opinions are missing or weakly covered
-
[84]
Few or no gold opinions remain not covered
**5: Fully Complete ** - **Definition:** The candidate covers nearly all key gold opinions, and coverage is meaningful (not merely topical). Few or no gold opinions remain not covered. 55 CoCoReviewBench: A Completeness- and Correctness-Oriented Benchmark for AI Reviewers Prompts for Paper-Level and Category-Level Metric of CoCoreviewBench with LLM-as-a-J...
-
[85]
Readers cannot easily identify the main point
**1: Unclear / Hard to Read ** - **Definition:** The comment is confusing or difficult to follow due to poor coherence (disorganized, fragmented, or contradictory), heavy verbosity with little signal, or unclear phrasing. Readers cannot easily identify the main point
-
[86]
The main point is present but requires effort to extract, and the delivery obscures meaning
**2: Mostly Unclear ** - **Definition:** The comment has substantial readability issues (wordy, tangled sentences, weak structure). The main point is present but requires effort to extract, and the delivery obscures meaning
-
[87]
It is understandable but not crisp
**3: Moderately Clear ** - **Definition:** The comment is generally readable and the main point can be understood, but delivery is imperfect (some verbosity, uneven structure, minor coherence gaps). It is understandable but not crisp
-
[88]
Minor verbosity or small structural issues may exist, but they do not hinder comprehension
**4: Clear ** - **Definition:** The comment is coherent and easy to read, with a mostly well-structured presentation. Minor verbosity or small structural issues may exist, but they do not hinder comprehension
-
[89]
**5: Very Clear / Concise ** - **Definition:** The comment is concise, well-structured, and highly readable. Ideas flow logically, wording is efficient, and the main point (and any request) is immediately understandable. ###Instruction: Be more objective and conservative in your grading. Respond strictly as JSON, do not provide any other content: { "claim...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.