pith. machine review for the scientific record. sign in

arxiv: 2604.14261 · v1 · submitted 2026-04-15 · 💻 cs.CL · cs.AI

Recognition: unknown

ReviewGrounder: Improving Review Substantiveness with Rubric-Guided, Tool-Integrated Agents

Chuan Li, Dongfu Jiang, Haoxiang Zhang, Jianwen Xie, Shuiwang Ji, Yi Lu, Yu Wang, Yuyang Bai, Yu Zhang, Zhuofeng Li

Authors on Pith no claims yet

Pith reviewed 2026-05-10 12:53 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords peer reviewLLM agentsreview generationrubric evaluationmulti-agent systemsbenchmarktool integrationsubstantive feedback
0
0 comments X

The pith

Rubric-guided multi-agent systems generate more substantive peer reviews than much larger language models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that current LLM reviewers produce shallow, formulaic comments because they skip explicit rubrics and fail to ground feedback in the paper itself. To test this, the authors build REVIEWBENCH, a benchmark that scores generated reviews against paper-specific rubrics drawn from official guidelines, the paper content, and human examples. They introduce REVIEWGROUNDER, which splits the work into a drafting stage and a grounding stage that uses tools to pull in targeted evidence. Experiments show the system, running on a 14B drafter and 120B grounder, beats baselines that rely on GPT-4.1 or even larger models in both human alignment and scores across eight quality dimensions. A reader would care because the surge in AI conference submissions makes scalable, evidence-based review support increasingly urgent.

Core claim

The authors present REVIEWGROUNDER as a rubric-guided, tool-integrated multi-agent framework that decomposes peer review into a drafting stage, where an initial review is generated following explicit rubrics, and a grounding stage, where tools consolidate evidence from the paper to make comments more substantive. On the REVIEWBENCH benchmark constructed from official guidelines, paper content, and human-written reviews, this approach with a Phi-4-14B drafter and GPT-OSS-120B grounder consistently outperforms baselines using substantially larger backbones such as GPT-4.1 and DeepSeek-R1-670B, achieving better alignment with human judgments and higher rubric-based quality scores across eight d

What carries the argument

REVIEWGROUNDER, a rubric-guided tool-integrated multi-agent framework that decomposes reviewing into drafting and evidence-grounding stages to enrich shallow drafts.

If this is right

  • Review generation quality improves through explicit rubric decomposition and tool-based evidence consolidation rather than model scale alone.
  • Paper-specific rubrics derived from guidelines and human reviews offer a reproducible way to measure and compare AI review systems.
  • Multi-agent setups with smaller specialized models can surpass larger general models when structured guidance and tools are applied.
  • Targeted tool integration directly addresses the lack of contextual grounding that produces superficial LLM reviews.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The two-stage decomposition could be adapted to support human reviewers by supplying stronger initial drafts that reduce revision effort.
  • Similar rubric-and-tool patterns might transfer to other grounded writing tasks such as grant evaluation or code review.
  • Conference platforms could incorporate the approach to scale initial review generation while preserving human oversight for final decisions.
  • Future benchmarks might test whether the same gains hold when rubrics are applied dynamically to papers from fields outside computer science.

Load-bearing premise

The REVIEWBENCH benchmark, built from official guidelines, paper content, and human-written reviews, provides a valid and unbiased measure of review substantiveness, and observed gains come from the rubric-guided decomposition and tool use rather than model artifacts or evaluation choices.

What would settle it

A controlled study in which human experts rate the actual helpfulness and substantiveness of ReviewGrounder reviews versus baseline reviews on the same set of papers, or an ablation showing that removing the grounding stage or the rubrics eliminates the reported performance advantage.

Figures

Figures reproduced from arXiv: 2604.14261 by Chuan Li, Dongfu Jiang, Haoxiang Zhang, Jianwen Xie, Shuiwang Ji, Yi Lu, Yu Wang, Yuyang Bai, Yu Zhang, Zhuofeng Li.

Figure 1
Figure 1. Figure 1: Overview of the REVIEWBENCH construc￾tion pipeline. For each paper, paper-specific rubrics are instantiated by an aggregated reference review, the sub￾mission PDF, and meta-rubrics. missions and reviews from 2024 to 2025.1 Data Filtering. We retain only papers with non￾empty PDF-to-text content. Specifically, we ex￾clude (1) empty or incomplete submissions; (2) desk-rejected or withdrawn papers; (3) papers… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the REVIEWGROUNDER. REVIEWGROUNDER decomposing reviewing into collaborating agents: (a) Review Drafter: Generates an initial draft based on the paper. (b) Multi-dimensional Grounding Agents. Literature Searcher: Retrieves and summarizes related work using external tools. Insight Miner: Verifies methodology and core contributions. Result Analyzer: Checks experimental results. (c) Review Aggregat… view at source ↗
Figure 3
Figure 3. Figure 3: Ablation on Literature Searcher configura￾tions under rubric-based evaluation. GPT-4o GPT-4.1 DeepReviewer-14B Ours 4 5 6 7 8 9 10 11 Overall Score 4.48 7.59 7.70 10.70 4.43 7.46 7.30 10.65 Normal (Before) Attack (After) [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Comparison with baselines under normal and attack scenarios via rubric-based evaluation. 14B as the Drafter. As shown in [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 6
Figure 6. Figure 6: Retriever & Reranking Strategy Ablation. Detailed performance comparison across retriever se￾lections (OpenScholar vs. BAAI-BGE) and different reranking settings (N ∈ {5, 10, 20}). C.3 Hyperparameter Study Optimization of Information Retrieval. In the Literature Searcher S, the breakdown shows that the performance gap between OpenScholar and BAAI-BGE is distributed across multiple dimen￾sions, particularly… view at source ↗
Figure 5
Figure 5. Figure 5: Component & Drafter Ablation Study. Fine￾grained score breakdown across 8 rubric dimensions, analyzing the impact of removing system components (Result Analyzer, Insight Miner, Literature Searcher) and scaling the Drafter backbone. The Critical Role of the Result Analyzer. The most striking observation is the degradation caused by removing the Result Analyzer (w/o Analyzer), which results in the lowest ove… view at source ↗
Figure 7
Figure 7. Figure 7: Review from REVIEWGROUNDER [PITH_FULL_IMAGE:figures/full_fig_p026_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Review from DeepReviewer-14B [PITH_FULL_IMAGE:figures/full_fig_p030_8.png] view at source ↗
read the original abstract

The rapid rise in AI conference submissions has driven increasing exploration of large language models (LLMs) for peer review support. However, LLM-based reviewers often generate superficial, formulaic comments lacking substantive, evidence-grounded feedback. We attribute this to the underutilization of two key components of human reviewing: explicit rubrics and contextual grounding in existing work. To address this, we introduce REVIEWBENCH, a benchmark evaluating review text according to paper-specific rubrics derived from official guidelines, the paper's content, and human-written reviews. We further propose REVIEWGROUNDER, a rubric-guided, tool-integrated multi-agent framework that decomposes reviewing into drafting and grounding stages, enriching shallow drafts via targeted evidence consolidation. Experiments on REVIEWBENCH show that REVIEWGROUNDER, using a Phi-4-14B-based drafter and a GPT-OSS-120B-based grounding stage, consistently outperforms baselines with substantially stronger/larger backbones (e.g., GPT-4.1 and DeepSeek-R1-670B) in both alignment with human judgments and rubric-based review quality across 8 dimensions. The code is available \href{https://github.com/EigenTom/ReviewGrounder}{here}.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces REVIEWBENCH, a benchmark for AI-generated peer reviews that uses paper-specific rubrics derived from official guidelines, paper content, and human-written reviews. It proposes REVIEWGROUNDER, a multi-agent framework with a rubric-guided drafting stage (Phi-4-14B) and a tool-integrated grounding stage (GPT-OSS-120B) to produce more substantive, evidence-based reviews. The central empirical claim is that REVIEWGROUNDER outperforms stronger baselines (GPT-4.1, DeepSeek-R1-670B) in alignment with human judgments and across 8 rubric dimensions on REVIEWBENCH; code is released.

Significance. If the results hold after controlling for evaluation artifacts, the work could meaningfully improve LLM-assisted peer review by enforcing explicit rubric decomposition and evidence grounding, addressing a documented weakness in current systems. The public code release is a clear strength that supports reproducibility.

major comments (2)
  1. [Abstract and §3] Abstract and benchmark construction section: rubrics are derived in part from the same human-written reviews used for the alignment-with-human-judgments metric. A rubric-guided method can therefore score higher by construction on both the rubric dimensions and the alignment metric, which directly undermines the claim that observed gains over much larger backbones reflect genuine improvements in substantiveness rather than benchmark design.
  2. [Experiments] Experiments section: the headline result (smaller drafter + grounding stage beats GPT-4.1 and DeepSeek-R1-670B) is reported without visible ablations on prompt engineering, temperature, or the contribution of the tool-integration component versus the rubric decomposition alone. This is load-bearing because the reader cannot determine whether the gains are attributable to the proposed framework or to uncontrolled implementation choices.
minor comments (2)
  1. [Abstract] The abstract states that REVIEWBENCH is built from 'official guidelines, the paper's content, and human-written reviews' but does not report the number of papers, reviews, or inter-annotator agreement statistics; adding these numbers would improve context.
  2. [§3] The 8 rubric dimensions should be enumerated in a single table with short definitions so readers can assess coverage without searching the text.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive comments on our manuscript. We address each major comment point by point below and describe the revisions we will make to strengthen the paper.

read point-by-point responses
  1. Referee: [Abstract and §3] Abstract and benchmark construction section: rubrics are derived in part from the same human-written reviews used for the alignment-with-human-judgments metric. A rubric-guided method can therefore score higher by construction on both the rubric dimensions and the alignment metric, which directly undermines the claim that observed gains over much larger backbones reflect genuine improvements in substantiveness rather than benchmark design.

    Authors: We acknowledge the validity of this concern about potential circularity in the benchmark. The rubrics are synthesized from three sources—official guidelines, paper content, and human-written reviews—to produce paper-specific criteria, but we agree that greater transparency is required to demonstrate that the reported gains are not artifacts of this construction. In the revised manuscript we will expand §3 with a detailed description of the rubric synthesis procedure (including explicit steps to avoid direct copying from human reviews), add quantitative analysis of rubric overlap with the human reviews used in the alignment metric, and include a controlled comparison showing that REVIEWGROUNDER still outperforms baselines when rubric dimensions are held constant. We will also note this as a limitation of the current benchmark design. revision: yes

  2. Referee: [Experiments] Experiments section: the headline result (smaller drafter + grounding stage beats GPT-4.1 and DeepSeek-R1-670B) is reported without visible ablations on prompt engineering, temperature, or the contribution of the tool-integration component versus the rubric decomposition alone. This is load-bearing because the reader cannot determine whether the gains are attributable to the proposed framework or to uncontrolled implementation choices.

    Authors: We agree that the lack of component-wise ablations and implementation controls weakens the interpretability of the headline results. The current experiments focus on end-to-end performance, but this does not isolate the individual contributions of rubric-guided drafting versus tool-integrated grounding, nor does it control for prompt or temperature choices. In the revision we will add a dedicated ablation subsection that reports: (i) performance with and without the tool-integration stage, (ii) results under fixed prompt templates and temperature settings across all models, and (iii) sensitivity analysis to prompt variations. These additions will allow readers to attribute gains more precisely to the proposed framework. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark construction and agent evaluation are independent of fitted self-referential quantities

full rationale

The paper introduces REVIEWBENCH as a new benchmark whose rubrics are derived from external official guidelines, paper content, and human-written reviews, then evaluates REVIEWGROUNDER (a multi-agent framework) against external baselines on alignment with human judgments and rubric scores across 8 dimensions. No equations, parameter fits, or derivations are present that reduce claimed improvements to quantities defined by the authors' own inputs or self-citations. The setup is a standard empirical comparison on a newly constructed benchmark; the rubric-guided decomposition is a methodological choice evaluated against independent baselines rather than a tautological re-expression of the benchmark construction itself. This qualifies as self-contained empirical work with no load-bearing circular steps.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The framework rests on standard assumptions about LLM tool-use and rubric-following capabilities plus the validity of the new benchmark; no new physical entities are postulated.

free parameters (1)
  • Choice of base models for drafter and grounder stages
    Phi-4-14B and GPT-OSS-120B are selected for the two stages; these specific sizes and families are chosen rather than derived from first principles.
axioms (1)
  • domain assumption LLMs can reliably follow explicit rubrics and use external tools to consolidate evidence from a paper
    Invoked as the mechanism enabling the grounding stage to enrich shallow drafts.

pith-pipeline@v0.9.0 · 5542 in / 1213 out tokens · 46351 ms · 2026-05-10T12:53:53.496344+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

60 extracted references · 6 canonical work pages · 3 internal anchors

  1. [1]

    Alireza Ghafarollahi and Markus J Buehler

    Reviewer2: Optimizing review genera- tion through prompt generation.arXiv preprint arXiv:2402.10886. Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, and 1 others. 2025. Deepseek- r1 incentivizes reasoning in llms through reinforce- ment learning.Nature, 645:633–638. Eftekhar Hossain, S...

  2. [2]

    GPT-4o System Card

    Llms as meta-reviewers’ assistants: A case study. InNAACL’25, pages 7763–7803. Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, and 1 others. 2024. Gpt-4o system card.arXiv preprint arXiv:2410.21276. ICLR. 2026. Iclr 2026 reviewer guide. https://iclr. cc/Conferences/2...

  3. [3]

    The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery

    Can large language models provide useful feedback on research papers? a large-scale empirical analysis.NEJM AI, 1(8):AIoa2400196. Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. InProceedings of the Work- shop on Text Summarization Branches Out, pages 74–81. Chris Lu, Cong Lu, Robert Tjarko Lange, Jakob Foer- ster, Jeff Clune, ...

  4. [4]

    Qwen Team

    Peer review as a multi-turn and long-context dialogue with role-based interactions.arXiv preprint arXiv:2406.05688. Nitya Thakkar, Mert Yuksekgonul, Jake Silberg, Ani- mesh Garg, Nanyun Peng, Fei Sha, Rose Yu, Carl V ondrick, and James Zou. 2025. Can llm feedback enhance review quality? a randomized study of 20k reviews at iclr 2025.arXiv preprint arXiv:2...

  5. [5]

    InICLR’25

    Cycleresearcher: Improving automated re- search via automated review. InICLR’25. Shitao Xiao, Zheng Liu, Peitian Zhang, Niklas Muen- nighoff, Defu Lian, and Jian-Yun Nie. 2024. C-pack: Packed resources for general chinese embeddings. In SIGIR’24, pages 641–649. An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Ch...

  6. [6]

    Qwen3 Technical Report

    Qwen3 technical report.arXiv preprint arXiv:2505.09388. Rui Ye, Xianghe Pang, Jingyi Chai, Jiaao Chen, Zhenfei Yin, Zhen Xiang, Xiaowen Dong, Jing Shao, and Siheng Chen. 2024. Are we there yet? revealing the risks of utilizing large language models in scholarly peer review.arXiv preprint arXiv:2412.01708. Sihang Zeng, Kai Tian, Kaiyan Zhang, Yuru Wang, Ju...

  7. [7]

    human reviews can be noisy and do not always follow guidelines

    Deepreview: Improving llm-based paper re- view with human-like deep thinking process. In ACL’25, pages 29330–29355. Appendix Contents A Additional Experiments 12 A.1 Generalization Beyond ICLR . . . 12 A.2 Robustness Under Challenging Ad- versarial Attacks . . . . . . . . . 12 A.3 Bias in Human Review Aggregation 13 A.4 Backbone Model Ablations . . . . 13...

  8. [8]

    Identify the paper’s core technical concepts, methods, and key techniques

  9. [9]

    Generate short, simple search queries suitable for aca- demic literature search

  10. [10]

    keywords

    Prefer general, reusable phrases rather than overly specific titles. Guidelines: Generate3–5keywords only. Each keyword should be short and concise. Keywords should be suitable as standalone search queries. Do NOT include explanations or commentary. Output JSON only: { "keywords": ["keyword1", "keyword2", " keyword3", "keyword4", "keyword5"] } Instruction...

  11. [11]

    Identify what the related work is about and its main contributions

  12. [12]

    Summarize the main methods used in the related work

  13. [13]

    Summarize the key results or findings reported in the related work

  14. [14]

    summary":

    Explain the relationship between the related work and the reference paper, focusing on: shared ideas or problem settings, differences in methods or assumptions, complementary or diverging claims. Guidelines: Focus on therelationshipbetween the two papers rather than standalone details. Be concise and informative; avoid unnecessary back- ground. Do NOT add...

  15. [15]

    Extract the paper’s core contributions and method details (paper-grounded)

  16. [16]

    Check the candidate review’s method/contribution claims and identify: incorrect / hallucinated / contradicted claims, missing key technical points, vague or generic statements that should be made specific

  17. [17]

    not_found_in_text

    Provide short rewrite suggestionsWITH evidence anchors(Section / Equation / Algorithm / Figure / snippet if available). Rules: If you cannot find support in the paper text, setevidence to "not_found_in_text"; do NOT assert the paper is missing it. Keep each list short (≤ 5 items). Prefer the most impor- tant contributions/components/issues. Return JSON on...

  18. [18]

    Extract key experimental facts from the paper

  19. [19]

    Check experiment-related claims in the candidate re- view and identify: incorrect/hallucinated/contradicted claims, missing key experimental points, vague statements that should be made specific

  20. [20]

    not_found_in_text

    Provide short rewrite suggestionsWITH evidence anchors(Table/Figure/Section/snippet if available). Rules: If you cannot find support in the paper text, setevidence to "not_found_in_text"; do NOT assert the paper is missing it. Keep each list short (≤ 5 items). Prefer the most impor- tant issues/results. Return JSON only. No extra text. Output JSON only: {...

  21. [21]

    Paper text (plain text converted from PDF)

  22. [22]

    Draft review (structured)

  23. [23]

    Method/Contribution audit report (from Paper Insight Miner; paper-grounded)

  24. [24]

    Experiments/Results audit report (from Paper Results Analyzer; paper-grounded)

  25. [25]

    Related-work summaries (each item is a JSON sum- mary of one retrieved paper, written relative to the target paper) Primary objectives (what to improve):Refine the re- view to satisfy these content-quality dimensions:

  26. [26]

    Core Contribution Accuracy

  27. [27]

    Results Interpretation

  28. [28]

    Comparative Analysis / Positioning

  29. [29]

    Evidence-Based Critique

  30. [30]

    Completeness Coverage

  31. [31]

    DoNOTintroduce new factual claims about the paper unless you can anchor them to the paper text or the audit reports’ evidence

    Avoid False or Contradictory Claims (critical) Hard constraints (must follow): 1.Paper-grounded correctness is mandatory: If the audit reports mark a draft claim as incorrec- t/hallucinated/contradicted, youMUSTfix or re- move it. DoNOTintroduce new factual claims about the paper unless you can anchor them to the paper text or the audit reports’ evidence....

  32. [32]

    the paper compares to/cites X

    Related-work usage rule (anti-leak / anti- overclaim): Retrieved related-work summaries areNOTguaran- teed to be cited by the submission. Never claim “the paper compares to/cites X” unless the paper text actually contains X. When using retrieved works, attribute them as exter- nal context: “The related-work search suggests ...; it would help to clarify/co...

  33. [33]

    Use missing_key_points and needs_specificity to improve technical specificity

    Apply Paper Insight Miner (method/contribution): Use review_issues.incorrect_or_hallucinated to remove/correct wrong claims in Summary/Strength- s/Weaknesses. Use missing_key_points and needs_specificity to improve technical specificity. Incorporate rewrite_suggestions where appropriate (method-related only)

  34. [34]

    Add missing datasets/baselines/metrics/key results if they are important and supported

    Apply Paper Results Analyzer (experiments/re- sults): Correct any wrong result interpretation. Add missing datasets/baselines/metrics/key results if they are important and supported. Convert vague experiment critiques into concrete, testable suggestions with anchors. Incorporate rewrite_suggestions where appropriate (experiment-related only). 3.Use Relate...

  35. [35]

    Fix incorrect/hallucinated statements flagged by the two audit reports

  36. [36]

    Improve Summary and Strengths with paper- grounded method + results highlights

  37. [37]

    Strengthen Weaknesses with evidence anchors and clearer critique

  38. [38]

    Add actionable Suggestions (each mapped to a weak- ness)

  39. [39]

    Improve Questions to resolve uncertainties (especially when evidence is not found)

  40. [40]

    Improve Comparative Analysis using related-work summaries with proper attribution

  41. [41]

    accept",

    Ensure constructive tone and completeness across method / experiments / positioning. Output format (JSON ONLY):Return a JSON object with the following keys ONLY . Numeric fields must be numbers (not strings). decisionmust be one of:"accept","reject". Do not output any text outside JSON. { "summary": "...", "strengths": "...", "weaknesses": "...", "questio...

  42. [42]

    Introduces a two-phase linear-attention pipeline (kernel-based estimation + sparse mask) with provableO(T)inference cost (Section 3.1, Fig. 1)

  43. [43]

    1) that preserves the teacher’s dynamic attention patterns

    Provides a concrete knowledge-distillation training scheme (Eq. 1) that preserves the teacher’s dynamic attention patterns

  44. [44]

    Proposes FlatCSR, a modified CSR format that leverages the grouped top-kmask, achieving up to 6.6×faster sparse operations than COO (Table 1)

  45. [45]

    exp.figure.opt_curve)

    Empirically demonstrates state-of-the-art results on language modeling (Table baseline.opt) and GLUE (Table baseline.glue), with faster convergence (Fig. exp.figure.opt_curve)

  46. [46]

    exp.figure.opt_dynamic_k, exp.figure.bert_dynamic_k)

    Shows that the sparsity hyper-parameterkcan be increased post-training to improve accuracy (Section 4.3, Fig. exp.figure.opt_dynamic_k, exp.figure.bert_dynamic_k)

  47. [47]

    CNN Decoder

    Offers interpretability via visualized estimated attention (Fig. exp.figure.attention) and provides the FlatCSR code. Weaknesses: 1.Adaptation cost not fully quantified– While Section 3.3 describes replacing the teacher’s attention with SEA and fine-tuning via KD, the paper does not report the additional training time, data size, or compute needed for thi...

  48. [48]

    What is the exact training budget (GPU hours, dataset size) required to replace a pretrained transformer’s attention with SEA via the KD procedure described in Section 3.3?

  49. [49]

    Can the authors provide the full CNN decoder specification (kernel size, stride, padding, number of channels per layer) to enable exact replication?

  50. [50]

    How do FLOP counts for the Performer encoder, CNN decoder, and grouped top-kselection compare to a vanilla quadratic attention layer for a typical sequence length (e.g.,T= 8192)?

  51. [51]

    What guidelines do the authors recommend for selecting the compressed widthKand base sparsitykwhen adapting SEA to a new model or task?

  52. [52]

    Have the authors evaluated SEA on sequence lengths larger than2 13 to confirm the claimedO(T) scaling in practice?

  53. [53]

    The core idea behind SEA is to approximate the full attention matrix through a combination of kernel-based linear attention and a sparse attention mask

    How would SEA perform against recent hybrid methods such as SLA or SPARSEK that also combine sparse and linear components, especially in terms of interpretability and training overhead? Rating:5.5 Decision:reject Figure 7: Review from REVIEWGROUNDER Review from DeepReviewer-14B Summary: This paper introduces Sparse Linear Attention with an Estimated Atten...

  54. [54]

    First, the paper lacks a clear and detailed explanation of how the proposed method can be applied to pre-trained Transformer models

    Despite the strengths of this paper, I have identified several weaknesses that I believe need to be addressed. First, the paper lacks a clear and detailed explanation of how the proposed method can be applied to pre-trained Transformer models. As the reviewer 1 correctly pointed out, the method requires training the attention estimation module with knowle...

  55. [55]

    Second, the paper’s experimental evaluation is limited in scope. As reviewer 1 and reviewer 2 both noted, the experiments are primarily focused on BERT and OPT models, and the paper lacks experiments on other popular pre-trained models such as GPT, T5, or ViT. This is a significant limitation, as it restricts the generalizability of the findings. The pape...

  56. [56]

    Third, the paper lacks a comprehensive analysis of the computational cost of the proposed method. While the paper provides some analysis of latency and memory usage, it does not include a detailed breakdown of the FLOPs for each component of the SEA layer, including the CNN decoder, the top-kselection, and the final attention computation. As reviewer 2 co...

  57. [57]

    Fourth, the paper’s explanation of the CNN decoder is insufficient. As reviewer 3 correctly pointed out, the paper does not provide a clear explanation of the specific architecture of the CNN decoder, including the kernel sizes, number of channels, and padding used in each layer. The paper also does not provide a clear justification for the use of a CNN d...

  58. [58]

    As reviewer 3 correctly pointed out, the paper does not provide a clear explanation of the different grouping strategies and their impact on the performance of the model

    Fifth, the paper’s explanation of the grouped top-kselection is also insufficient. As reviewer 3 correctly pointed out, the paper does not provide a clear explanation of the different grouping strategies and their impact on the performance of the model. The paper mentions that the top-kselection is applied to the compressed attention matrix, but it does n...

  59. [59]

    Sixth, the paper’s explanation of the loss functions is unclear. As reviewer 3 correctly pointed out, the paper does not provide a clear explanation of the different loss functions used for training, including the context distillation loss, attention distillation loss, and the task-specific loss. The paper also does not explain how the different loss func...

  60. [60]

    As reviewer 3 correctly pointed out, the paper does not provide a detailed analysis of the trade-off between accuracy and efficiency when varyingk

    Finally, the paper lacks a detailed analysis of the impact of the sparsity hyperparameterkon the performance of the model. As reviewer 3 correctly pointed out, the paper does not provide a detailed analysis of the trade-off between accuracy and efficiency when varyingk. The paper also does not provide a clear explanation of how the value ofkshould be chos...