pith. sign in

arxiv: 2605.27204 · v1 · pith:MHLRHKFEnew · submitted 2026-05-26 · 💻 cs.CL · cs.IR

GraphReview: Scientific Paper Evaluation via LLM-Based Graph Message Passing

Pith reviewed 2026-06-29 18:24 UTC · model grok-4.3

classification 💻 cs.CL cs.IR
keywords LLMgraph message passingpaper evaluationpeer reviewPersonalized PageRankscientific publishingquality assessment
0
0 comments X

The pith

GraphReview evaluates scientific papers by passing LLM review signals across a graph of related works.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tries to establish that modeling paper evaluation as message passing on a semantic graph of papers leads to better performance than handling assessments in isolation. It uses LLMs to initialize quality at each paper node and to create comparison evidence on edges between papers. Personalized PageRank then spreads these signals to produce rankings, acceptance decisions, and text reviews. This matters because it provides a mechanism to relate a manuscript to both current and past work in a single framework. Results show large gains over baselines and generalization to new settings.

Core claim

GraphReview formulates paper evaluation as review-signal message passing over a semantic paper graph that jointly captures intrinsic quality, synchronic links among contemporaneous papers, and diachronic links to prior work. LLMs estimate node-level quality priors and generate edge-level comparative evidence through pairwise comparisons. Personalized PageRank then integrates these signals for quality ranking, decision prediction, and review generation. Reward-induced maximum likelihood objectives are used to train the LLM backbones for higher-quality graph evidence.

What carries the argument

The semantic paper graph where LLMs supply node quality priors and edge comparative evidence, propagated by Personalized PageRank.

If this is right

  • Outperforms the strongest baseline with average improvements of 29.7% on decision and ranking metrics.
  • Achieves specific gains of 23.7% in Accuracy and 57.6% in Spearman's ρ.
  • Generates higher-quality review texts.
  • Generalizes effectively across time periods and conference venues.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Automating relational evaluation this way might help scale peer review processes without losing context.
  • Connecting papers in graphs could reveal patterns in how quality signals spread in research fields.
  • Testing on papers from emerging fields might show if the method adapts when literature connections are sparse.

Load-bearing premise

LLM-generated node priors and pairwise comparative evidence on the edges accurately reflect true paper quality and relationships.

What would settle it

Human expert evaluations on a held-out set of papers that show no correlation or even negative correlation with the GraphReview rankings and decisions.

Figures

Figures reproduced from arXiv: 2605.27204 by Guoxiu He, Jiacheng Yao, Pujun Zheng, Star X. Zhao, Wanying Ren.

Figure 1
Figure 1. Figure 1: Previous LLM-based methods consider infor [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: The overall process of our method. It includes message (left), aggregation (center), and update (right). [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 4
Figure 4. Figure 4: Generalization results. The y-axis shows nor [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 3
Figure 3. Figure 3: Hyperparameter analysis, including the con [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 5
Figure 5. Figure 5: Pipeline of dataset construction (left) and training process (right). [PITH_FULL_IMAGE:figures/full_fig_p016_5.png] view at source ↗
read the original abstract

Scientific paper evaluation often involves not only assessing a manuscript itself, but also relating it to contemporaneous research and prior literature. However, existing LLM-based methods typically model these signals separately and lack a unified mechanism for propagating review evidence across papers. We propose $\textbf{GraphReview}$, a graph-based LLM framework that formulates paper evaluation as review-signal message passing over a semantic paper graph. The graph jointly captures intrinsic quality, synchronic links among contemporaneous papers, and diachronic links to prior work. LLMs are used to estimate node-level quality priors and generate edge-level comparative evidence through pairwise paper comparisons, while Personalized PageRank integrates review signals for quality ranking, decision prediction, and review generation. To produce higher-quality graph evidence, we propose reward-induced maximum likelihood objectives for training the LLM backbones. Experiments show that GraphReview consistently outperforms the strongest baseline, achieving average improvements of 29.7% on decision and ranking metrics, including gains of 23.7% in Accuracy and 57.6% in Spearman's $\rho$. It also produces higher-quality review texts and generalizes effectively across time periods and conference venues. The code is available at https://github.com/ECNU-Text-Computing/GraphReview.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript proposes GraphReview, a graph-based LLM framework that formulates scientific paper evaluation as review-signal message passing over a semantic paper graph capturing intrinsic quality (node priors), synchronic links among contemporaneous papers, and diachronic links to prior work (edge comparisons). LLMs generate node-level quality priors and pairwise comparative evidence; Personalized PageRank integrates these signals for quality ranking, decision prediction, and review generation. Reward-induced maximum likelihood objectives are introduced to train the LLM backbones. Experiments report consistent outperformance over baselines with average improvements of 29.7% on decision and ranking metrics (including 23.7% Accuracy and 57.6% Spearman's ρ), higher-quality review texts, and effective generalization across time periods and venues. Code is released at the provided GitHub link.

Significance. If the results hold after addressing verification gaps, the work provides a unified mechanism for propagating relational review evidence in LLM-based evaluation, potentially improving consistency over isolated per-paper assessments. The explicit code release is a clear strength supporting reproducibility.

major comments (3)
  1. [§4 Experiments] §4 Experiments: The headline gains (29.7% average improvement, 23.7% Accuracy, 57.6% Spearman's ρ) are reported without details on dataset construction, baseline implementations, statistical significance tests, or controls for LLM variability; these omissions are load-bearing because the central claim attributes gains to the graph message-passing mechanism rather than prompting artifacts.
  2. [§3 Method] §3 Method (LLM node priors and edge generation): No human agreement rates, ablation removing the graph propagation step, or independent validation of the LLM-generated priors/edges is provided; without this, the weakest assumption—that LLM outputs accurately reflect true quality and relationships—cannot be falsified, leaving open the possibility that PageRank merely propagates noisy or biased signals.
  3. [Abstract and §3.2] Abstract and §3.2 (reward-induced MLE objectives): The training of LLM backbones via reward-induced maximum likelihood risks circularity if rewards derive from the same evaluation signals used in downstream ranking/decision tasks, which could inflate the reported improvements without an explicit control experiment.
minor comments (2)
  1. [§3.1] Notation for synchronic vs. diachronic edges could be clarified with explicit definitions in the graph construction subsection to aid reproducibility.
  2. [§4.2] Table reporting per-metric results should include standard deviations or confidence intervals given the stochastic nature of LLM calls.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below with clarifications and planned revisions to improve the manuscript.

read point-by-point responses
  1. Referee: [§4 Experiments] The headline gains (29.7% average improvement, 23.7% Accuracy, 57.6% Spearman's ρ) are reported without details on dataset construction, baseline implementations, statistical significance tests, or controls for LLM variability; these omissions are load-bearing because the central claim attributes gains to the graph message-passing mechanism rather than prompting artifacts.

    Authors: We agree the manuscript omits key experimental details. The released code contains the implementations, but to make the paper self-contained we will expand §4 with: (i) full dataset construction protocol including time/venue splits and filtering criteria, (ii) precise baseline re-implementation steps and hyper-parameters, (iii) statistical significance results (paired t-tests and Wilcoxon tests across 5 random seeds), and (iv) explicit controls for LLM variability (temperature sweeps, prompt paraphrases, and seed-averaged runs). These additions will better isolate the contribution of the graph propagation step. revision: yes

  2. Referee: [§3 Method] No human agreement rates, ablation removing the graph propagation step, or independent validation of the LLM-generated priors/edges is provided; without this, the weakest assumption—that LLM outputs accurately reflect true quality and relationships—cannot be falsified, leaving open the possibility that PageRank merely propagates noisy or biased signals.

    Authors: This point is well-taken. We will add: (i) an ablation that disables message passing and ranks solely by node priors, (ii) human agreement rates (Cohen’s κ) on a 200-instance sample of LLM-generated priors and pairwise edges annotated by two domain experts, and (iii) correlation of LLM edge labels with citation-based proxies. These results will appear in a new subsection of §3 and an expanded §4. revision: yes

  3. Referee: [Abstract and §3.2] The training of LLM backbones via reward-induced maximum likelihood risks circularity if rewards derive from the same evaluation signals used in downstream ranking/decision tasks, which could inflate the reported improvements without an explicit control experiment.

    Authors: The rewards are computed from ground-truth labels (accept/reject decisions and citation counts) that are disjoint from the test-set evaluation metrics. Nevertheless, to eliminate any perception of circularity we will add a control experiment that trains the LLM backbones with standard MLE and compares downstream ranking/decision performance against the reward-induced variant; results will be reported in the revised §3.2 and §4. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation uses external LLM signals and standard graph propagation

full rationale

The paper formulates evaluation as message passing over a graph whose nodes receive LLM-generated quality priors and whose edges receive LLM-generated pairwise comparisons, after which Personalized PageRank produces the final rankings and decisions. No equation or training objective is shown to define the output ranking in terms of itself or to rename a fitted parameter as a prediction. The reward-induced MLE training is described only as a means to improve the quality of the LLM-generated evidence; the abstract supplies no indication that the reward signal is constructed from the downstream ranking metrics in a closed loop. Experiments report gains against external baselines and across time/venue splits, indicating evaluation on independent ground truth rather than self-referential fitting. No self-citation, uniqueness theorem, or ansatz-smuggling steps appear in the provided text.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the untested premise that LLM pairwise comparisons produce reliable edge evidence and on the standard assumption that Personalized PageRank aggregates those signals usefully; no free parameters or new physical entities are declared in the abstract.

axioms (1)
  • domain assumption Personalized PageRank integrates review signals for quality ranking, decision prediction, and review generation
    Invoked in the abstract as the mechanism that turns node and edge LLM outputs into final outputs.
invented entities (1)
  • semantic paper graph with synchronic and diachronic links no independent evidence
    purpose: To jointly capture intrinsic quality, contemporaneous relations, and prior-work relations for message passing
    Newly introduced construct in the framework; no independent evidence supplied in the abstract.

pith-pipeline@v0.9.1-grok · 5757 in / 1315 out tokens · 26098 ms · 2026-06-29T18:24:42.193859+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

66 extracted references · 12 canonical work pages · 3 internal anchors

  1. [1]

    Yuan Chang, Ziyue Li, Hengyuan Zhang, Yuanbo Kong, Yanru Wu, Hayden Kwok-Hay So, Zhijiang Guo, Liya Zhu, and Ngai Wong

    Reviewing peer review. Yuan Chang, Ziyue Li, Hengyuan Zhang, Yuanbo Kong, Yanru Wu, Hayden Kwok-Hay So, Zhijiang Guo, Liya Zhu, and Ngai Wong. 2025. Treereview: A dynamic tree of questions framework for deep and efficient llm-based scientific peer review. InProceed- ings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 156...

  2. [2]

    Justin Gilmer, Samuel S Schoenholz, Patrick F Riley, Oriol Vinyals, and George E Dahl

    Revieweval: An evaluation framework for ai- generated reviews.arXiv preprint arXiv:2502.11736. Justin Gilmer, Samuel S Schoenholz, Patrick F Riley, Oriol Vinyals, and George E Dahl. 2017. Neural mes- sage passing for quantum chemistry. InInternational conference on machine learning, pages 1263–1272. Pmlr. Zirui Guo, Lianghao Xia, Yanhua Yu, Tian Ao, and C...

  3. [3]

    arXiv preprint arXiv:2405.02150

    The ai review lottery: Widespread ai-assisted peer reviews boost paper scores and acceptance rates. arXiv preprint arXiv:2405.02150. Chuanlei Li, Xu Hu, Minghui Xu, Kun Li, Yue Zhang, and Xiuzhen Cheng. 2025. Can large language mod- els be trusted paper reviewers? a feasibility study. arXiv preprint arXiv:2506.17311. Chris Lu, Cong Lu, Robert Tjarko Lange...

  4. [4]

    G-reasoner: Foundation Models for Unified Reasoning over Graph-structured Knowledge

    G-reasoner: Foundation models for unified reasoning over graph-structured knowledge.arXiv preprint arXiv:2509.24276. Junbo Niu, Zheng Liu, Zhuangcheng Gu, Bin Wang, Linke Ouyang, Zhiyuan Zhao, Tao Chu, Tianyao He, Fan Wu, Qintong Zhang, and 1 others. 2025. Mineru2. 5: A decoupled vision-language model for efficient high-resolution document parsing.arXiv p...

  5. [5]

    Jiabin Tang, Yuhao Yang, Wei Wei, Lei Shi, Lixin Su, Suqi Cheng, Dawei Yin, and Chao Huang

    Peer review as a multi-turn and long-context dialogue with role-based interactions.arXiv preprint arXiv:2406.05688. Jiabin Tang, Yuhao Yang, Wei Wei, Lei Shi, Lixin Su, Suqi Cheng, Dawei Yin, and Chao Huang. 2024. Graphgpt: Graph instruction tuning for large lan- guage models. InProceedings of the 47th Interna- tional ACM SIGIR Conference on Research and ...

  6. [6]

    Keith Tyser, Ben Segev, Gaston Longhitano, Xin-Yu Zhang, Zachary Meeks, Jason Lee, Uday Garg, Nicholas Belsten, Avi Shporer, Madeleine Udell, and 1 others

    Ai can learn scientific taste.arXiv preprint arXiv:2603.14473. Keith Tyser, Ben Segev, Gaston Longhitano, Xin-Yu Zhang, Zachary Meeks, Jason Lee, Uday Garg, Nicholas Belsten, Avi Shporer, Madeleine Udell, and 1 others. 2024. Ai-driven review systems: evaluating llms in scalable and bias-aware academic reviews. arXiv preprint arXiv:2408.10365. Petar Veliˇc...

  7. [7]

    Graph Attention Networks

    Graph attention networks.arXiv preprint arXiv:1710.10903. Duo Wang, Yuan Zuo, Guangyue Lu, and Junjie Wu

  8. [8]

    arXiv preprint arXiv:2510.16885

    Unigte: Unified graph-text encoding for zero- shot generalization across graph tasks and domains. arXiv preprint arXiv:2510.16885. Heng Wang, Shangbin Feng, Tianxing He, Zhaoxuan Tan, Xiaochuang Han, and Yulia Tsvetkov. 2023. Can language models solve graph problems in natural language?Advances in Neural Information Process- ing Systems, 36:30840–30861. Y...

  9. [9]

    InThe Thirteenth Inter- national Conference on Learning Representations

    Cycleresearcher: Improving automated re- search via automated review. InThe Thirteenth Inter- national Conference on Learning Representations. Lingfei Wu, Dashun Wang, and James A Evans. 2019. Large teams develop and small teams disrupt science and technology.Nature, 566(7744):378–382. Zhikai Xue, Guoxiu He, Zhuoren Jiang, Sichen Gu, Yangyang Kang, Star Z...

  10. [10]

    Is Chain-of-Thought Reasoning of LLMs a Mirage? A Data Distribution Lens

    From replication to redesign: Exploring pair- wise comparisons for LLM-based peer review. In The Thirty-ninth Annual Conference on Neural Infor- mation Processing Systems. Chengshuai Zhao, Zhen Tan, Pingchuan Ma, Dawei Li, Bohan Jiang, Yancheng Wang, Yingzhen Yang, and Huan Liu. 2025a. Is chain-of-thought reasoning of llms a mirage? a data distribution le...

  11. [11]

    arXiv preprint arXiv:2310.01089

    Graphtext: Graph reasoning in text space. arXiv preprint arXiv:2310.01089. Penghai Zhao, Jinyu Tian, Qinghua Xing, Xin Zhang, Zheng Li, Jianjun Qian, Ming-Ming Cheng, and Xi- ang Li. 2025b. Naipv2: Debiased pairwise learning for efficient paper quality estimation.arXiv preprint arXiv:2509.25179. Penghai Zhao, Qinghua Xing, Kairan Dou, Jinyu Tian, Ying Tai...

  12. [12]

    InPro- ceedings of the 63rd Annual Meeting of the Associa- tion for Computational Linguistics (Volume 1: Long Papers), pages 29330–29355

    Deepreview: Improving llm-based paper re- view with human-like deep thinking process. InPro- ceedings of the 63rd Annual Meeting of the Associa- tion for Computational Linguistics (Volume 1: Long Papers), pages 29330–29355. Zhenzhen Zhuang, Jiandong Chen, Hongfeng Xu, Yuwen Jiang, and Jialiang Lin. 2025. Large lan- guage models for automated scholarly pap...

  13. [13]

    Assuming a unique optimal 2-factor exists, the algorithm is permutation equivariant, meaning the optimal result is invariant to node relabel- ing

  14. [14]

    This follows from the fact that any complete graph KN with N≥3 admits a 2-factor; see Appendix G for a proof

    For T≥1 and N≥3 , the loop executes at least once, so the graph always contains edges. This follows from the fact that any complete graph KN with N≥3 admits a 2-factor; see Appendix G for a proof. D.2 Text Consolidation In practice, for node-level inference, an LLM’s output can be decomposed into two types of infor- mation, denoted as fLLM(xv, ps) = (ˆys,...

  15. [15]

    Metadata Acquisition Top-tier Conference Metadata

  16. [16]

    PDF Downloading PDF Cache Semantic Attention Local Vector Database (Training Set) Cosine Similarity Greedy, One-time use Score Gap Random Swap (Metigating Positional Bias) Initial Prompt

  17. [17]

    Cold-start SFT Metadata API Download API Top-tier Conference Metadata

  18. [18]

    Content Parsing Markdown Cache MinerU PDF Cache

  19. [19]

    Build Database Local Vector Database Embedding Model Markdown Cache

  20. [20]

    Prompt Optimization LLM as a Judge Generate Answers Prompt Self-evolving Best Prompt EdgeNode Node Edge Instruct LLM

  21. [21]

    RWML Training Open-source LLM Cold-started LLM Cold-started LLM GraphReview LLM LLM Judger LLM Evolver Figure 5: Pipeline of dataset construction (left) and training process (right). E Experiment Details E.1 Dataset Construction As shown in Figure 5 (left), we first collect the full text of all papers through the OpenReview API, parse the PDFs with MinerU...

  22. [22]

    The results reported in Table 9 and 10 show that graph-based fusion consistently integrates hetero- geneous review signals and outperforms each indi- vidual method on most metrics

    combined with DeepReview-14B (Zhu et al., 2025), and (2) PairReview (Zhang et al., 2025) combined with CycleReviewer-7B (Weng et al., 2025). The results reported in Table 9 and 10 show that graph-based fusion consistently integrates hetero- geneous review signals and outperforms each indi- vidual method on most metrics. For the combina- tion of CNPE-7B an...

  23. [23]

    However, our experiments show that such models perform poorly on paper reviewing and fall substantially behind the pro- posed GraphReview framework

    treat text embeddings as node features and then learn node representations through neigh- borhood aggregation. However, our experiments show that such models perform poorly on paper reviewing and fall substantially behind the pro- posed GraphReview framework. Although both paradigms leverage graph structure and involve message passing, a fundamental quest...

  24. [24]

    On Representing Convex Quadratically Constrained Quadratic Programs via Graph Neural Networks,\

    typically treat paper review as a task to be solved through decomposition, iterative reflec- tion, or retrieval-augmented analysis. Although retrieved evidence may provide useful background knowledge, it is usually incorporated only as aux- iliary context (Zhu et al., 2025) rather than as a structured signal that directly shapes evaluation. As a result, t...

  25. [25]

    **Disadvantages**:

    **Relevant Problem Selection:** The task of representing and solving convex QCQPs ... **Disadvantages**:

  26. [26]

    **Questions**:

    **Incremental and Poorly Justified Technical Contribution:** The proposed ... **Questions**:

  27. [27]

    **Suggestions**:

    The theorem states that a GNN can universally approximate the ... **Suggestions**:

  28. [28]

    Finally, we produce a complete evaluation report with an associated score for paper 68J0pJFCi3

    **Complete and Justify the Theoretical Claims:** The authors ... Finally, we produce a complete evaluation report with an associated score for paper 68J0pJFCi3. Table 11: Case study. An example illustrating the complete workflow for evaluating a paper. Criteria Optimization Prompt You are an expert prompt optimizer. Your task is to optimize the {criteria}...

  29. [29]

    Use the provided`single_paper_review`as the primary foundation and preserve its core judgment unless the comparative evidence clearly justifies adjustment.,→

  30. [30]

    For each entry in`related_pairs`, briefly extract only the most relevant information from `pair_comparison`, especially comparative strengths, weaknesses, missing validations, or clearer methodological standards that are directly useful for evaluating this paper. ,→ ,→

  31. [31]

    Citation format: e.g.`(#0, 2025)`

    Integrate these insights naturally into the`single_paper_review`, citing the relevant literature in the merged text. Citation format: e.g.`(#0, 2025)`. Use comparisons selectively and only when they strengthen or clarify the review. ,→ ,→

  32. [32]

    Make sure the ranking, decision, and all arguments are fully consistent with each other after revision

    You must output content related to`ranking`and`decision`at first, e.g.`**Ranking:** (0/500)`and`**Decision:** Accept`. Make sure the ranking, decision, and all arguments are fully consistent with each other after revision. ,→ ,→

  33. [33]

    Avoid repetition across sections

    Structure the review clearly into layered sections: first give an overall assessment, then list the most important strengths, then the most important weaknesses, and finally concrete questions/suggestions. Avoid repetition across sections. ,→ ,→

  34. [34]

    The questions and suggestions proposed must all be highly practical, specific, feasible, and directly actionable for the authors to address.,→

  35. [35]

    Avoid exaggerated claims or unsupported criticism.,→

    Keep the tone professional, evidence-based, and concise. Avoid exaggerated claims or unsupported criticism.,→

  36. [36]

    Do not include any other content

    Only output the merged text. Do not include any other content. Here is all the content related to the paper: ``` {json_str} ``` The output format you need to follow: ``` **Ranking:** **Decision:** **Summary**: **Advantages**: **Disadvantages**: **Questions**: **Suggestions**: ``` Table 16: Text consolidation prompt. Merge the texts to generate complete re...

  37. [37]

    Return a valid JSON object only, with no extra text

  38. [38]

    technical_depth

    The JSON must contain exactly these ten keys: - "technical_depth" - "technical_depth_reason" - "evidence_grounding" - "evidence_grounding_reason" - "scientific_rigor" - "scientific_rigor_reason" - "revision_utility" - "revision_utility_reason" - "overall_preference" - "overall_preference_reason"

  39. [39]

    A", "B", or

    For each label key, the value must be exactly one of: "A", "B", or "Tie"

  40. [40]

    For each reason key, the value must be one brief sentence of at most 22 words

  41. [41]

    EVALUATION DIMENSIONS:

    Do not output anything except the JSON object. EVALUATION DIMENSIONS:

  42. [42]

    technical_depth: Which review engages more deeply with the paper's technical substance, such as method details, assumptions, derivations, proofs, experiments, evaluation design, complexity, or implementation? ,→ ,→

  43. [43]

    evidence_grounding: Which review ties its judgments more directly to paper-specific evidence, claims, equations, tables, figures, baselines, metrics, or clearly missing analyses?,→

  44. [44]

    scientific_rigor: Which review more rigorously evaluates validity, claim-evidence alignment, fairness of comparisons, reproducibility, completeness of argumentation, and whether the paper's conclusions are actually supported? ,→ ,→

  45. [45]

    revision_utility: Which review gives more useful and actionable guidance for improving the paper, especially through concrete, acceptance-relevant revisions?,→

  46. [46]

    overall_preference: Overall, which review is more valuable for editorial decision-making and author revision, considering technical insight, evidence-based criticism, exposure of substantive weaknesses, and usefulness for improving the paper? ,→ ,→ CORE JUDGING PRINCIPLES:

  47. [47]

    Judge only the quality of the reviews, not the quality of the paper

  48. [48]

    Prefer reviews that identify central technical weaknesses, unsupported claims, weak evidence, missing controls, incomplete proofs, confounds, unfair baselines, or reproducibility gaps.,→

  49. [49]

    Prefer paper-specific critique over generic balance, polished wording, soft tone, or formulaic reviewing language.,→

  50. [50]

    Do not reward a review merely for sounding more diplomatic, more moderate, more balanced, or more polished.,→

  51. [51]

    Do not penalize a review merely for being critical, forceful, technically dense, or highly detailed, if its concerns are concrete and grounded in the paper.,→

  52. [52]

    Strong reviews often directly explain why the current evidence is insufficient for the paper's claims.,→

  53. [53]

    Comparative references to related work may be useful when they concretely support criticism about novelty, baselines, theory, or evaluation standards; do not dismiss them automatically unless they substantially replace paper-specific analysis. ,→ ,→

  54. [54]

    Ignore superficial differences in politeness or rhetorical style unless they materially affect scientific clarity or introduce unsupported claims.,→

  55. [55]

    If one review is sharper but better exposes acceptance-relevant weaknesses, it can be better overall even if it is less smooth stylistically.,→

  56. [56]

    In close cases, prefer the review that better identifies substantive risks to validity or acceptance.,→

  57. [57]

    Tie" rather than defaulting to

    If the two reviews are difficult to distinguish in quality, choose "Tie" rather than defaulting to "A" due to positional bias.,→

  58. [58]

    technical_depth

    If one review is empty, select the other review accordingly. Continued on next page. Table 17: Text evaluation prompt. Comparing the text quality with other approaches. Text Evaluation Prompt(Continued) Compare Review A and Review B as peer-review reports for the same paper. Return only a valid JSON object with exactly these keys: "technical_depth" "techn...

  59. [59]

    Judge only the quality of the reviews, not the paper itself

  60. [60]

    Prefer reviews that identify important technical flaws, unsupported claims, weak evidence, missing experiments, incomplete proofs, unfair comparisons, or reproducibility issues.,→

  61. [61]

    Prefer paper-specific, evidence-linked criticism over smoother wording or more diplomatically balanced tone.,→

  62. [62]

    Do not reward a review merely for sounding more polished, more measured, or more conventionally editorial.,→

  63. [63]

    A sharper or more critical review can be better if its concerns are concrete, technically meaningful, and grounded in the paper.,→

  64. [64]

    Related-work comparisons may be useful when they concretely support criticism about novelty, baselines, theory, or evaluation standards.,→

  65. [65]

    In close cases, overall_preference should favor the review that better exposes acceptance-relevant weaknesses and better helps an editor decide.,→

  66. [66]

    Tie" rather than defaulting to

    If the two reviews are difficult to distinguish in quality, output "Tie" rather than defaulting to "A" because of positional bias.,→ Review A: {review_a} Review B: {review_b} Table 17: Text evaluation prompt (Continued). Comparing the text quality with other approaches