pith. machine review for the scientific record. sign in

arxiv: 2605.02011 · v1 · submitted 2026-05-03 · 💻 cs.CL · cs.AI· cs.IR

Recognition: unknown

Enhancing Judgment Document Generation via Agentic Legal Information Collection and Rubric-Guided Optimization

Authors on Pith no claims yet

Pith reviewed 2026-05-09 16:58 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.IR
keywords judgment document generationagentic retrievalreinforcement learninglegal reasoningLLM optimizationGRPOlegal AIJuDGE
0
0 comments X

The pith

Judge-R1 improves automated judgment document generation by combining agentic information collection with rubric-guided reinforcement learning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tries to fix the shortcomings of current AI methods for writing court judgment documents, such as missing key legal facts, inventing statute references, and weak logical arguments. It proposes Judge-R1, which uses a planning agent to dynamically gather accurate legal information from multiple sources and then applies reinforcement learning guided by judicial rubrics to optimize the document. A reader would care if this leads to more reliable legal AI that could assist courts in producing consistent and evidence-based decisions. The work shows through experiments that this combined approach yields better results than standard retrieval and fine-tuning techniques on a legal-specific benchmark. This suggests a path toward AI systems that better handle the precision demands of legal reasoning.

Core claim

Judge-R1 is a unified framework for LLM-based judgment document generation that jointly improves legal information collection through a dynamic planning agent retrieving statutes and precedents and optimizes the generation process via Rubric-Guided Optimization using Group Relative Policy Optimization with a legal reward function to enforce judicial standards and logical reasoning. Extensive experiments on the JuDGE benchmark show that this results in significant improvements in legal accuracy and generation quality over state-of-the-art baselines.

What carries the argument

Agentic Legal Information Collection, which employs a dynamic planning agent for precise retrieval, paired with Rubric-Guided Optimization that uses GRPO and a comprehensive legal reward function to align outputs with judicial requirements.

Load-bearing premise

The JuDGE benchmark and the legal reward function used in the optimization sufficiently capture the key aspects of real-world judicial standards and logical reasoning without significant biases or omissions.

What would settle it

A direct comparison by legal professionals of documents generated by Judge-R1 and by baseline systems on a set of new, real court cases, assessing the accuracy of cited laws, the soundness of reasoning, and overall usability, where Judge-R1 shows no advantage.

Figures

Figures reproduced from arXiv: 2605.02011 by Qingyao Ai, Weihang Su, Xuanyi Chen, Yiqun Liu, Yueyue Wu.

Figure 1
Figure 1. Figure 1: An illustration of our proposed framework. view at source ↗
read the original abstract

Automating the drafting of judgment documents is pivotal to judicial efficiency, yet it remains challenging due to the dual requirements of comprehensive retrieval of legal information and rigorous logical reasoning. Existing approaches, typically relying on standard Retrieval-Augmented Generation and Supervised Fine-Tuning, often suffer from insufficient evidence recall, hallucinated statutory references, and logically flawed legal reasoning. To bridge this gap, we propose Judge-R1, a unified framework designed to enhance LLM-based judgment document generation by jointly improving legal information collection and judgment document generation. First, we introduce Agentic Legal Information Collection, which employs a dynamic planning agent to retrieve precise statutes and precedents from multiple sources. Second, we implement Rubric-Guided Optimization, a reinforcement learning phase utilizing Group Relative Policy Optimization (GRPO) with a comprehensive legal reward function to enforce adherence to judicial standards and reasoning logic. Extensive experiments on the JuDGE benchmark demonstrate that Judge-R1 significantly outperforms state-of-the-art baselines in both legal accuracy and generation quality.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The manuscript introduces Judge-R1, a unified framework for LLM-based judgment document generation. It combines Agentic Legal Information Collection (a dynamic planning agent retrieving statutes and precedents from multiple sources) with Rubric-Guided Optimization (GRPO reinforcement learning driven by a comprehensive legal reward function enforcing judicial standards and logical reasoning). The central claim is that this approach significantly outperforms state-of-the-art baselines on the JuDGE benchmark in both legal accuracy and generation quality.

Significance. If the empirical results and evaluation hold, the work could meaningfully advance legal AI by mitigating hallucinated references and flawed reasoning in automated drafting, offering a practical path toward higher judicial efficiency. The combination of agentic retrieval and rubric-based RL is a timely contribution given the domain's demands for precision and traceability.

major comments (2)
  1. Abstract: The central claim that Judge-R1 'significantly outperforms' baselines in legal accuracy and generation quality is load-bearing yet unsupported by any metrics, baseline names, effect sizes, statistical tests, or error analysis in the provided text. Without these, the magnitude and reliability of the reported gains cannot be assessed.
  2. The manuscript provides no description of JuDGE benchmark construction (case selection criteria, annotation protocol, inter-annotator agreement, or coverage of jurisdiction-specific rules and multi-precedent conflicts) or the exact components and weighting of the legal reward function used in GRPO. These omissions directly affect whether the measured improvements reflect genuine advances in statutory adherence and logical reasoning or artifacts of the evaluation design.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We sincerely thank the referee for the thorough and constructive review. We have carefully addressed each major comment below, providing clarifications and committing to revisions that improve the manuscript's transparency and reproducibility without altering the core contributions.

read point-by-point responses
  1. Referee: Abstract: The central claim that Judge-R1 'significantly outperforms' baselines in legal accuracy and generation quality is load-bearing yet unsupported by any metrics, baseline names, effect sizes, statistical tests, or error analysis in the provided text. Without these, the magnitude and reliability of the reported gains cannot be assessed.

    Authors: We agree that the abstract would benefit from greater specificity to allow immediate assessment of the claims. The full manuscript reports quantitative results, baseline comparisons, and evaluation details in the Experiments section; however, we have revised the abstract to explicitly name the primary baselines, report key effect sizes and metrics from the JuDGE benchmark, and reference the statistical significance and error analysis performed. revision: yes

  2. Referee: The manuscript provides no description of JuDGE benchmark construction (case selection criteria, annotation protocol, inter-annotator agreement, or coverage of jurisdiction-specific rules and multi-precedent conflicts) or the exact components and weighting of the legal reward function used in GRPO. These omissions directly affect whether the measured improvements reflect genuine advances in statutory adherence and logical reasoning or artifacts of the evaluation design.

    Authors: We acknowledge that the original submission provided only high-level mentions of the JuDGE benchmark and reward function. In the revised manuscript we have added a dedicated subsection detailing JuDGE construction (case selection from real judicial records, expert annotation protocol, inter-annotator agreement scores, and handling of jurisdiction rules plus multi-precedent conflicts) as well as the precise components and weightings of the legal reward function used within GRPO. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical claims rest on independent benchmark evaluation

full rationale

The paper presents Judge-R1 as a framework combining agentic retrieval and GRPO-based RL with a legal reward function, then reports empirical outperformance on the JuDGE benchmark. No equations, fitted parameters, or self-citations are shown to reduce any claimed result to the inputs by construction. The method builds on standard RAG, SFT, and RL techniques without self-definitional loops, and the performance claims are external evaluations rather than renamed fits or uniqueness theorems imported from the authors' prior work. This is a typical empirical ML paper whose central claims remain falsifiable against external benchmarks and do not collapse into self-referential definitions.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract-only review yields no explicit free parameters, invented entities, or non-standard axioms; relies on standard LLM and RL assumptions.

axioms (1)
  • domain assumption Standard LLM capabilities and RL optimization techniques transfer effectively to legal reasoning tasks
    Implicit in the use of agentic collection and GRPO without additional justification in the abstract.

pith-pipeline@v0.9.0 · 5486 in / 1069 out tokens · 28925 ms · 2026-05-09T16:58:40.961832+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

44 extracted references · 24 canonical work pages · 5 internal anchors

  1. [1]

    Huajie Chen, Deng Cai, Wei Dai, Zehui Dai, and Yadong Ding. 2019. Charge-based prison term prediction with deep gating network.arXiv preprint arXiv:1908.11521 (2019)

  2. [2]

    Qian Dong, Qingyao Ai, Hongning Wang, Yiding Liu, Haitao Li, Weihang Su, Yiqun Liu, Tat-Seng Chua, and Shaoping Ma. 2025. Decoupling Knowledge and Context: An Efficient and Effective Retrieval Augmented Generation Framework via Cross Attention. InProceedings of the ACM on Web Conference 2025

  3. [3]

    Darren Edge, Ha Trinh, Newman Cheng, Joshua Bradley, Alex Chao, Apurva Mody, Steven Truitt, and Jonathan Larson. 2024. From local to global: A graph rag approach to query-focused summarization.arXiv preprint arXiv:2404.16130 (2024)

  4. [4]

    Yan Fang, Jingtao Zhan, Qingyao Ai, Jiaxin Mao, Weihang Su, Jia Chen, and Yiqun Liu. 2024. Scaling Laws For Dense Retrieval.arXiv preprint arXiv:2403.18684 (2024)

  5. [5]

    Randy Goebel, Yoshinobu Kano, Mi-Young Kim, Juliano Rabelo, Ken Satoh, and Masaharu Yoshioka. 2023. Summary of the competition on legal information, ex- traction/entailment (COLIEE) 2023. InProceedings of the Nineteenth International Conference on Artificial Intelligence and Law. 472–480

  6. [6]

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al . 2025. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948(2025)

  7. [7]

    Yiran Hu, Huanghai Liu, Chong Wang, Kunran Li, Tien-Hsuan Wu, Haitao Li, Xinran Xu, Siqing Huo, Weihang Su, Ning Zheng, et al . 2026. Evaluation of Large Language Models in Legal Applications: Challenges, Methods, and Future Directions.arXiv preprint arXiv:2601.15267(2026)

  8. [8]

    Zikun Hu, Xiang Li, Cunchao Tu, Zhiyuan Liu, and Maosong Sun. 2018. Few-shot charge prediction with discriminative legal attributes. InProceedings of the 27th international conference on computational linguistics. 487–498

  9. [9]

    Zhengbao Jiang, Frank F Xu, Luyu Gao, Zhiqing Sun, Qian Liu, Jane Dwivedi-Yu, Yiming Yang, Jamie Callan, and Graham Neubig. 2023. Active retrieval augmented generation.arXiv preprint arXiv:2305.06983(2023)

  10. [10]

    Bowen Jin, Hansi Zeng, Zhenrui Yue, Jinsung Yoon, Sercan Arik, Dong Wang, Hamed Zamani, and Jiawei Han. 2025. Search-r1: Training llms to reason and leverage search engines with reinforcement learning.arXiv preprint arXiv:2503.09516(2025)

  11. [11]

    Liangyi Kang, Jie Liu, Lingqiao Liu, Qinfeng Shi, and Dan Ye. 2019. Creating auxiliary representations from charge definitions for criminal charge prediction. arXiv preprint arXiv:1911.05202(2019)

  12. [12]

    Mi-Young Kim, Juliano Rabelo, Randy Goebel, Masaharu Yoshioka, Yoshinobu Kano, and Ken Satoh. 2022. Coliee 2022 summary: Methods for legal document retrieval and entailment. InJSAI International Symposium on Artificial Intelligence. Springer, 51–67

  13. [13]

    Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al. 2020. Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in Neural Information Processing Systems33 (2020), 9459–9474

  14. [14]

    Haitao Li, Qingyao Ai, Jia Chen, Qian Dong, Yueyue Wu, Yiqun Liu, Chong Chen, and Qi Tian. 2023. SAILER: structure-aware pre-trained language model for legal case retrieval. InProceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval. 1035–1044

  15. [15]

    Shang Li, Hongli Zhang, Lin Ye, Xiaoding Guo, and Binxing Fang. 2019. Mann: A multichannel attentive neural network for legal judgment prediction.IEEE Access7 (2019), 151144–151155

  16. [16]

    Bingfeng Luo, Yansong Feng, Jianbo Xu, Xiang Zhang, and Dongyan Zhao. 2017. Learning to predict charges for criminal cases with legal basis.arXiv preprint arXiv:1707.09168(2017)

  17. [17]

    Yixiao Ma, Yueyue Wu, Weihang Su, Qingyao Ai, and Yiqun Liu. 2023. CaseEn- coder: A Knowledge-enhanced Pre-trained Model for Legal Case Encoding.arXiv preprint arXiv:2305.05393(2023)

  18. [18]

    Shubham Kumar Nigam, Aniket Deroy, Subhankar Maity, and Arnab Bhat- tacharya. 2024. Rethinking Legal Judgement Prediction in a Realistic Sce- nario in the Era of Large Language Models. arXiv:2410.10542 [cs.CL] https: //arxiv.org/abs/2410.10542

  19. [19]

    Juliano Rabelo, Randy Goebel, Mi-Young Kim, Yoshinobu Kano, Masaharu Yosh- ioka, and Ken Satoh. 2022. Overview and discussion of the competition on legal information extraction/entailment (COLIEE) 2021.The Review of Socionetwork Strategies16, 1 (2022), 111–133

  20. [20]

    Stephen Robertson, Hugo Zaragoza, et al . 2009. The probabilistic relevance framework: BM25 and beyond.Foundations and Trends®in Information Retrieval 3, 4 (2009), 333–389

  21. [21]

    Weihang Su, Qingyao Ai, Xiangsheng Li, Jia Chen, Yiqun Liu, Xiaolong Wu, and Shengluan Hou. 2024. Wikiformer: Pre-training with structured information of wikipedia for ad-hoc retrieval. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 38. 19026–19034

  22. [22]

    Weihang Su, Qingyao Ai, Yueyue Wu, Yixiao Ma, Haitao Li, and Yiqun Liu. 2023. Caseformer: Pre-training for Legal Case Retrieval.arXiv preprint arXiv:2311.00333 (2023)

  23. [23]

    Weihang Su, Qingyao Ai, Yueyue Wu, Anzhe Xie, Changyue Wang, Yixiao Ma, Haitao Li, Zhijing Wu, Yiqun Liu, and Min Zhang. 2025. Pre-training for legal case retrieval based on inter-case distinctions.ACM Transactions on Information Systems43, 5 (2025), 1–27

  24. [24]

    Weihang Su, Qingyao Ai, Jingtao Zhan, Qian Dong, and Yiqun Liu. 2025. Dy- namic and parametric retrieval-augmented generation. InProceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval. 4118–4121

  25. [25]

    Weihang Su, Qian Dong, Qingyao Ai, and Yiqun Liu. 2025. SIGIR-AP 2025 Tutorial Proposal: Dynamic and Parametric Retrieval-Augmented Generation. In3rd International ACM SIGIR Conference on Information Retrieval in the Asia Pacific

  26. [26]

    Weihang Su, Yiran Hu, Anzhe Xie, Qingyao Ai, Quezi Bing, Ning Zheng, Yun Liu, Weixing Shen, and Yiqun Liu. 2024. STARD: A Chinese Statute Retrieval Dataset Derived from Real-life Queries by Non-professionals. InFindings of the Association for Computational Linguistics: EMNLP 2024, Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen (Eds.). Association for C...

  27. [27]

    Weihang Su, Jianming Long, Qingyao Ai, Yichen Tang, Changyue Wang, Yiteng Tu, and Yiqun Liu. 2026. Skill Retrieval Augmentation for Agentic AI.arXiv preprint arXiv:2604.24594(2026)

  28. [28]

    Weihang Su, Jianming Long, Changyue Wang, Shiyu Lin, Jingyan Xu, Ziyi Ye, Qingyao Ai, and Yiqun Liu. 2025. Towards Unification of Hallucination Detection and Fact Verification for Large Language Models.arXiv preprint arXiv:2512.02772 (2025)

  29. [29]

    Weihang Su, Yichen Tang, Qingyao Ai, Changyue Wang, Zhijing Wu, and Yiqun Liu. 2024. Mitigating entity-level hallucination in large language models. In Proceedings of the 2024 Annual International ACM SIGIR Conference on Research and Development in Information Retrieval in the Asia Pacific Region. 23–31

  30. [30]

    Weihang Su, Yichen Tang, Qingyao Ai, Zhijing Wu, and Yiqun Liu. 2024. Dragin: Dynamic retrieval augmented generation based on the real-time information needs of large language models.arXiv preprint arXiv:2403.10081(2024)

  31. [31]

    Weihang Su, Yichen Tang, Qingyao Ai, Junxi Yan, Changyue Wang, Hongning Wang, Ziyi Ye, Yujia Zhou, and Yiqun Liu. 2025. Parametric Retrieval Augmented Generation.arXiv preprint arXiv:2501.15915(2025)

  32. [32]

    Weihang Su, Changyue Wang, Qingyao Ai, Yiran Hu, Zhijing Wu, Yujia Zhou, and Yiqun Liu. 2024. Unsupervised real-time hallucination detection based on the internal states of large language models.arXiv preprint arXiv:2403.06448(2024)

  33. [33]

    Weihang Su, Anzhe Xie, Qingyao Ai, Jianming Long, Xuanyi Chen, Jiaxin Mao, Ziyi Ye, and Yiqun Liu. 2025. Surge: A benchmark and evaluation framework for scientific survey generation.arXiv preprint arXiv:2508.15658(2025)

  34. [34]

    Weihang Su, Baoqing Yue, Qingyao Ai, Yiran Hu, Jiaqi Li, Changyue Wang, Kaiyuan Zhang, Yueyue Wu, and Yiqun Liu. 2025. Judge: Benchmarking judg- ment document generation for chinese legal system. InProceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval. 3573–3583

  35. [35]

    Yuqiao Tan, Shizhu He, Huanxuan Liao, Jun Zhao, and Kang Liu. 2025. Dynamic parametric retrieval augmented generation for test-time knowledge enhancement. arXiv preprint arXiv:2503.23895(2025)

  36. [36]

    Yiteng Tu, Shuo Miao, Weihang Su, Yiqun Liu, and Qingyao Ai. 2026. Analytical Search.arXiv preprint arXiv:2602.11581(2026)

  37. [37]

    Yiteng Tu, Weihang Su, Yujia Zhou, Yiqun Liu, and Qingyao Ai. 2025. Robust Fine-tuning for Retrieval Augmented Generation against Retrieval Defects. In Proceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval. 1272–1282

  38. [38]

    Yiteng Tu, Weihang Su, Yujia Zhou, Yiqun Liu, Fen Lin, Qin Liu, and Qingyao Ai

  39. [39]

    InProceedings of the ACM Web Conference 2026

    Generalized Pseudo-Relevance Feedback. InProceedings of the ACM Web Conference 2026. 1876–1886

  40. [40]

    Changyue Wang, Weihang Su, Qingyao Ai, and Yiqun Liu. 2026. Joint evalua- tion of answer and reasoning consistency for hallucination detection in large reasoning models. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 40. 33377–33385

  41. [41]

    Changyue Wang, Weihang Su, Qingyao Ai, Yichen Tang, and Yiqun Liu. 2025. Knowledge editing through chain-of-thought. InProceedings of the 2025 Confer- ence on Empirical Methods in Natural Language Processing. 10684–10704

  42. [42]

    Changyue Wang, Weihang Su, Qingyao Ai, Yujia Zhou, and Yiqun Liu. 2025. De- coupling Reasoning and Knowledge Injection for In-Context Knowledge Editing. arXiv preprint arXiv:2506.00536(2025)

  43. [43]

    Changyue Wang, Weihang Su, Hu Yiran, Qingyao Ai, Yueyue Wu, Cheng Luo, Yiqun Liu, Min Zhang, and Shaoping Ma. 2024. LeKUBE: A Legal Knowledge Update BEnchmark.arXiv preprint arXiv:2407.14192(2024)

  44. [44]

    Kaiyuan Zhang, Jiaqi Li, Yueyue Wu, Haitao Li, Cheng Luo, Shaokun Zou, Yujia Zhou, Weihang Su, Qingyao Ai, and Yiqun Liu. 2025. Chinese Court Simulation with LLM-Based Agent System.arXiv preprint arXiv:2508.17322(2025)