CoCoReviewBench curates 3,900 ICLR and NeurIPS papers into category-specific subsets with discussion-based annotations to evaluate AI reviewers on completeness and correctness rather than human review overlap.
Rottenreviews: Benchmarking review quality with human and llm-based judgments
7 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
years
2026 7verdicts
UNVERDICTED 7roles
method 1polarities
use method 1representative citing papers
PeerPrism benchmark demonstrates that state-of-the-art LLM detectors conflate surface text style with intellectual contribution and fail on hybrid human-AI peer reviews.
PRISM benchmark finds LLMs match or exceed humans on isolated review dimensions like novelty verification but none achieve the balanced performance of human reviewers across depth, flaw prioritization, and constructiveness.
RankElastor mitigates embedding collapse via spectrum-robust token mixing and GLU-based P-FFNs, yielding better performance and scaling on industrial recommendation datasets.
SID-Coord coordinates semantic IDs with hashed item IDs via attention fusion, adaptive gating, and interest alignment, yielding +0.664% long-play rate and +0.369% playback duration gains in production search ranking.
Taiji presents a LLM-as-Enhancer system with reverse-engineered CoT data generation and Pareto Optimal Policy Optimization (POPO) to trade off semantic and ID rewards, deployed at Kuaishou serving 400M daily users.
Peerispect extracts claims from peer reviews, retrieves evidence from the manuscript, and verifies them via NLI in a modular pipeline with a visual interface.
citing papers explorer
-
CoCoReviewBench: A Completeness- and Correctness-Oriented Benchmark for AI Reviewers
CoCoReviewBench curates 3,900 ICLR and NeurIPS papers into category-specific subsets with discussion-based annotations to evaluate AI reviewers on completeness and correctness rather than human review overlap.