Fine-grained Claim-level RAG Benchmark for Law
Pith reviewed 2026-05-25 05:53 UTC · model grok-4.3
The pith
ClaimRAG-LAW supplies a multilingual dataset and claim-level framework that separates retrieval and generation performance in legal RAG systems.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We introduce ClaimRAG-LAW, a comprehensive dataset for legal RAG that supports French and English, targets both experts and non-experts, and includes diverse question types reflecting realistic scenarios. We further apply a fine-grained evaluation framework of state-of-the-art legal RAG systems, revealing limitations in retrieval, generation, and claim-level analysis in the legal domain.
What carries the argument
ClaimRAG-LAW dataset together with its claim-level annotations and the accompanying fine-grained evaluation framework that measures retrieval, generation, and claim accuracy separately.
If this is right
- Developers can now diagnose whether a legal RAG failure originates in retrieval, generation, or claim extraction.
- Benchmarks can be extended to non-English languages and non-expert users without losing granularity.
- Legal RAG systems can be compared on retrieval quality alone or generation quality alone rather than on end-to-end accuracy.
- Claim-level annotations allow error analysis at the level of individual factual statements instead of whole answers.
Where Pith is reading between the lines
- The same claim-level separation could be applied to RAG evaluation in medicine or finance where factual precision is also critical.
- The dataset may expose that general-purpose RAG systems perform even worse on legal material than domain-specific ones.
- Future expansions could test whether the framework identifies the same failure modes when applied to newer model families.
Load-bearing premise
The dataset's question types and claim annotations accurately represent realistic legal scenarios for both experts and non-experts, and the evaluation framework separates retrieval from generation performance without adding its own biases.
What would settle it
An independent audit that shows real-world legal queries or user satisfaction scores diverge markedly from the patterns measured by ClaimRAG-LAW's claim-level metrics.
Figures
read the original abstract
The rapid progress of large language models (LLMs) is shifting semantic search toward a question-answering paradigm, where users ask questions and LLMs generate responses. In high-stake domains such as law, retrieval-augmented generation (RAG) is commonly used to mitigate hallucinations in generated responses. Nonetheless, prior work shows that RAG systems, whether general-purpose or legal-specific, still hallucinate at varying rates, making fine-grained evaluation essential. Despite the need, existing evaluation frameworks for legal RAG systems lack the granularity required to provide detailed analysis of retrieval and generation performance separately. Moreover, current benchmarks are largely English-only and centered on legal expert queries, overlooking non-expert needs. We introduce ClaimRAG-LAW, a comprehensive dataset for legal RAG that supports French and English, targets both experts and non-experts, and includes diverse question types reflecting realistic scenarios. We further apply a fine-grained evaluation framework of state-of-the-art legal RAG systems, revealing limitations in retrieval, generation, and claim-level analysis in the legal domain.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces ClaimRAG-LAW, a new dataset for legal RAG that is bilingual (French/English), targets both experts and non-experts, and includes diverse question types with claim-level annotations reflecting realistic scenarios. It also describes a fine-grained evaluation framework applied to state-of-the-art legal RAG systems that separates retrieval and generation performance and reports limitations in both plus claim-level analysis.
Significance. If the dataset construction, annotation quality, and evaluation framework hold up under scrutiny, the work would address documented gaps in legal RAG benchmarks (English-only, expert-only, coarse-grained metrics) and supply a reusable resource with claim-level granularity. The explicit separation of retrieval versus generation errors is a potentially useful methodological contribution.
major comments (2)
- [Abstract] Abstract: the central claims that the dataset 'includes diverse question types reflecting realistic scenarios' and that the framework 'revealing limitations in retrieval, generation, and claim-level analysis' are presented without any accompanying dataset statistics, annotation guidelines, inter-annotator agreement, size, or quantitative results. These details are load-bearing for assessing whether the realism and bias-free separation assumptions hold.
- No section or table supplies the concrete construction process, validation steps, or example claim-level annotations needed to evaluate the weakest assumption that question types and annotations accurately capture realistic legal scenarios for both experts and non-experts.
minor comments (1)
- The manuscript should include at minimum a table of dataset statistics (number of questions, claims, documents per language and user type) and at least one worked example of a question, retrieved passages, generated answer, and claim-level breakdown.
Simulated Author's Rebuttal
We thank the referee for highlighting the need for greater transparency in the abstract and dataset construction details to allow proper evaluation of the benchmark's realism and utility. We address each major comment below and will revise the manuscript accordingly.
read point-by-point responses
-
Referee: [Abstract] Abstract: the central claims that the dataset 'includes diverse question types reflecting realistic scenarios' and that the framework 'revealing limitations in retrieval, generation, and claim-level analysis' are presented without any accompanying dataset statistics, annotation guidelines, inter-annotator agreement, size, or quantitative results. These details are load-bearing for assessing whether the realism and bias-free separation assumptions hold.
Authors: We agree that the abstract, while concise, should include key supporting details to substantiate the central claims. In the revised version we will expand the abstract to report dataset size, number of claims per category, inter-annotator agreement scores, and high-level quantitative results on retrieval and generation performance. This will make the realism and separation assumptions directly evaluable from the abstract itself. revision: yes
-
Referee: [—] No section or table supplies the concrete construction process, validation steps, or example claim-level annotations needed to evaluate the weakest assumption that question types and annotations accurately capture realistic legal scenarios for both experts and non-experts.
Authors: We acknowledge that a dedicated, explicit description of the construction pipeline is necessary for readers to assess whether the question types and claim-level annotations reflect realistic legal scenarios. We will add a new subsection (or substantially expand the existing dataset section) that details the full construction process, validation steps, annotation guidelines, and provides concrete examples of claim-level annotations for both expert and non-expert queries. This revision will directly address the concern. revision: yes
Circularity Check
No significant circularity: dataset and framework introduction
full rationale
The paper introduces ClaimRAG-LAW dataset and a fine-grained evaluation framework for legal RAG. No equations, derivations, fitted parameters, or load-bearing self-citations appear in the abstract or described content. The work does not reduce any claim to prior self-cited results or by-construction identities; it is self-contained as a new benchmark contribution evaluated against external SOTA systems.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We introduce ClaimRAG-LAW, a comprehensive dataset for legal RAG that supports French and English, targets both experts and non-experts, and includes diverse question types reflecting realistic scenarios.
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We further apply a fine-grained evaluation framework of state-of-the-art legal RAG systems, revealing limitations in retrieval, generation, and claim-level analysis in the legal domain.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
O’Reilly Media, Inc
J. Alammar and M. Grootendorst.Hands-on large language models: language understanding and generation. " O’Reilly Media, Inc.", 2024
2024
-
[2]
think like a lawyer
K. Burton. "think like a lawyer" using a legal reasoning grid and criterion-referenced assessment rubric on irac (issue, rule, application, conclusion).Journal of Learning Design, 10(2):57–68,
-
[3]
URLhttps://doi.org/10.5204/JLD.V10I2.229
-
[4]
I. Chalkidis, A. Jana, D. Hartung, M. Bommarito, I. Androutsopoulos, D. Katz, and N. Aletras. LexGLUE: A benchmark dataset for legal language understanding in english. InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 4310–4330, 2022. URL https://doi.org/10.18653/v1/2022.acl-long. 297
-
[5]
M. Dahl, V . Magesh, M. Suzgun, and D. E. Ho. Large legal fictions: Profiling legal hallucinations in large language models.Journal of Legal Analysis, 16(1):64–93, 2024. URLhttps://doi. org/10.1093/jla/laae003
-
[6]
S. Das, S. Abualhaija, and D. Bianculli. LegalRAG QA Generator. https://doi.org/10. 5281/zenodo.20024153, 2026
2026
-
[7]
S. Das, S. Abualhaija, and D. Bianculli. ClaimRAG-LAW Dataset. https://huggingface. co/datasets/SNTSVV/ClaimRAG-LAW, 2026
2026
-
[8]
BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding
J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova. BERT: Pre-training of deep bidirec- tional transformers for language understanding. InProceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers), pages 4171–4186, 2019. URL https://doi.org/...
-
[9]
B. Edwards. Number of legal professionals using Gen AI jumps sharply over past year, study shows. number-of-legal-professionals-using-gen-ai, April 17 2025. Accessed: 2026-01-04
2025
-
[10]
S. Es, J. James, L. E. Anke, and S. Schockaert. RAGAs: Automated evaluation of retrieval augmented generation. InProceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations, pages 150–158, 2024. URLhttps://doi.org/10.18653/v1/2024.eacl-demo.16
-
[11]
Federal court turns up the heat on attorneys using ChatGPT for research
Esquire Deposition Solutions. Federal court turns up the heat on attorneys using ChatGPT for research. federal-court-turns-up-the-heat-on-attorneys, August 13 2025. Accessed: 2026-01-04
2025
-
[12]
Ferrara, Ethan-Tonic, and O
J. Ferrara, Ethan-Tonic, and O. M. Ozturk. The RAG Triad. https://www.trulens.org/ getting_started/core_concepts/rag_triad/, 2024. Accessed: 2026-04-28
2024
- [13]
-
[14]
National civil code, 1804
Grand-Duché de Luxembourg. National civil code, 1804. URL https://legilux.public. lu/
-
[15]
A. Grattafiori, A. Dubey, A. Jauhri, et al. The Llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024. URLhttps://doi.org/10.48550/arXiv.2407.21783
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2407.21783 2024
-
[16]
N. Guha, J. Nyarko, D. Ho, et al. LegalBench: A collaboratively built benchmark for measuring legal reasoning in large language models. InProceedings of the 37th Conference on Neural Information Processing Systems - Datasets and Benchmarks Track, pages 44123–44279, 2023
2023
-
[17]
A. B. Hou, O. Weller, G. Qin, E. Yang, D. Lawrie, N. Holzenberger, A. Blair-Stanek, and B. Van Durme. CLERC: A dataset for us legal case retrieval and retrieval-augmented analysis generation. InFindings of the Association for Computational Linguistics: NAACL 2025, pages 7913–7928, 2025. URLhttps://doi.org/10.18653/v1/2025.findings-naacl.441
-
[18]
X. Hu, D. Ru, L. Qiu, Q. Guo, T. Zhang, Y . Xu, Y . Luo, P. Liu, Y . Zhang, and Z. Zhang. RefChecker: Reference-based fine-grained hallucination checker and benchmark for large 10 language models.arXiv preprint arXiv:2405.14486, 2024. URL https://doi.org/10. 48550/arXiv.2405.14486
-
[19]
A. Hurst, A. Lerer, A. P. Goucher, A. Perelman, A. Ramesh, A. Clark, A. Ostrow, A. Welihinda, A. Hayes, A. Radford, et al. GPT-4o system card.arXiv preprint arXiv:2410.21276, 2024. URLhttps://doi.org/10.48550/arXiv.2410.21276
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2410.21276 2024
-
[20]
A. Q. Jiang, A. Sablayrolles, A. Roux, et al. Mixtral of experts.arXiv preprint arXiv:2401.04088,
work page internal anchor Pith review Pith/arXiv arXiv
-
[21]
URLhttps://doi.org/10.48550/arXiv.2401.04088
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2401.04088
-
[22]
A. T. Kalai and S. S. Vempala. Calibrated language models must hallucinate. InProceedings of the 56th Annual ACM Symposium on Theory of Computing, pages 160–171, 2024. URL https://doi.org/10.1145/3618260.3649777
-
[23]
D. M. Katz, M. J. Bommarito, S. Gao, and P. Arredondo. GPT-4 passes the bar exam.Philosoph- ical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences, 382(2270), 2024. URLhttps://doi.org/10.1098/rsta.2023.0254
-
[24]
F. Keisha, P. Singh, D. Fernandes, A. Manivannan, I. Wicaksono, F. Ahmad, W. B. Rim, et al. All for law and law for all: Adaptive RAG pipeline for legal research.arXiv preprint arXiv:2508.13107, 2025. URLhttps://doi.org/10.48550/arXiv.2508.13107
-
[25]
J. Lee, D. Kim, S. Hwang, H. Kim, and G. Lee. KoBLEX: Open legal question answering with multi-hop reasoning. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 4019–4053, 2025. URL https://doi.org/10.18653/ v1/2025.emnlp-main.200
2025
-
[26]
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks
P. Lewis, E. Perez, A. Piktus, F. Petroni, V . Karpukhin, N. Goyal, H. Küttler, M. Lewis, W.-t. Yih, T. Rocktäschel, et al. Retrieval-augmented generation for knowledge-intensive nlp tasks.Advances in neural information processing systems, 33:9459–9474, 2020. URL https://doi.org/10.48550/arXiv.2005.11401
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2005.11401 2020
-
[27]
K. Li, Y . Li, T. Zhang, H. Luo, X. Wu, J. Glass, and H. Meng. RAG-Zeval: Enhancing RAG Re- sponses Evaluator through End-to-End Reasoning and Ranking-Based Reinforcement Learning. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Process- ing, pages 24936–24954, 2025. URL https://doi.org/10.18653/v1/2025.emnlp-main. 1267
-
[28]
L. Li, L. Sleem, G. Nichil, R. State, et al. Exploring the impact of temperature on large language models: Hot or cold?Procedia Computer Science, 264:242–251, 2025. URL https://doi.org/10.1016/j.procs.2025.07.135
-
[29]
A. Louis, G. van Dijck, and G. Spanakis. Interpretable long-form legal question answering with retrieval-augmented large language models. InProceedings of the AAAI Conference on Artificial Intelligence, pages 22266–22275, 2024. URL https://doi.org/10.1609/aaai. v38i20.30232
-
[30]
V . Magesh, F. Surani, M. Dahl, M. Suzgun, C. D. Manning, and D. E. Ho. Hallucination-free? assessing the reliability of leading AI legal research tools.Journal of Empirical Legal Studies, 22(2):216–242, 2025. URLhttps://doi.org/10.1111/jels.12413
-
[31]
S. Mallick. Generative AI in the law.the Law (February 10, 2024), 42, 2024. URL https: //doi.org/10.2139/ssrn.5040429
-
[32]
P. Manakul, A. Liusie, and M. Gales. SelfCheckGPT: Zero-resource black-box hallucination detection for generative large language models. InProceedings of the 2023 conference on empirical methods in natural language processing, pages 9004–9017, 2023. URL https: //doi.org/10.18653/v1/2023.emnlp-main.557
-
[33]
D. Metropolitansky and J. Larson. Veritrail: Closed-domain hallucination detection with traceability.arXiv preprint arXiv:2505.21786, 2025. URL https://doi.org/10.48550/ arXiv.2505.21786
-
[35]
S. Min, K. Krishna, X. Lyu, M. Lewis, W.-t. Yih, P. Koh, M. Iyyer, L. Zettlemoyer, and H. Hajishirzi. FActScore: Fine-grained atomic evaluation of factual precision in long form text generation. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 12076–12100, 2023. URL https://doi.org/10.18653/v1/ 2023.emnlp-main.741
-
[36]
J. Niklaus, V . Matoshi, P. Rani, A. Galassi, M. Stürmer, and I. Chalkidis. LEXTREME: A multi-lingual and multi-task benchmark for the legal domain. InFindings of the Association for Computational Linguistics: EMNLP 2023, pages 3016–3054, 2023. URL https://doi.org/ 10.18653/v1/2023.findings-emnlp.200
- [37]
-
[38]
N. Pipitone and G. H. Alami. LegalBench-RAG: A benchmark for retrieval-augmented generation in the legal domain.arXiv preprint arXiv:2408.10343, 2024. URL https: //doi.org/10.48550/arXiv.2408.10343
-
[39]
N. Reimers and I. Gurevych. The curse of dense low-dimensional information retrieval for large index sizes. InProceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), pages 605–611. Association for Computational Linguistics, ...
-
[40]
M. Renze. The effect of sampling temperature on problem solving in large language models. InFindings of the association for computational linguistics: EMNLP 2024, pages 7346–7356,
2024
-
[41]
URLhttps://doi.org/10.18653/v1/2024.findings-emnlp.432
-
[42]
S. Robertson and H. Zaragoza. The probabilistic relevance framework: BM25 and beyond. Found. Trends Inf. Retr., 3(4):333–389, Apr. 2009. ISSN 1554-0669. URLhttps://doi.org/ 10.1561/1500000019
-
[43]
D. Ru, L. Qiu, X. Hu, T. Zhang, P. Shi, S. Chang, C. Jiayang, C. Wang, S. Sun, H. Li, et al. RAGChecker: A fine-grained framework for diagnosing retrieval-augmented generation. Advances in Neural Information Processing Systems, 37:21999–22027, 2024. URL https: //doi.org/10.52202/079017-0692
-
[44]
N. Sannier, M. Adedjouma, M. Sabetzadeh, L. Briand, J. Dann, M. Hisette, and P. Thill. Legal markup generation in the large: An experience report. In2017 IEEE 25th International Requirements Engineering Conference (RE), pages 302–311. IEEE, 2017. URL https://doi. org/10.1109/RE.2017.10
-
[45]
A. Scirè, K. Ghonim, and R. Navigli. FENICE: Factuality evaluation of summarization based on natural language inference and claim extraction. InFindings of the Association for Computational Linguistics ACL 2024, pages 14148–14161, 2024. URL https://doi. org/10.18653/v1/2024.findings-acl.841
-
[46]
A. Singh, A. Fry, A. Perelman, A. Tart, A. Ganesh, A. El-Kishky, A. McLaughlin, A. Low, A. Ostrow, A. Ananthram, et al. OpenAI GPT-5 system card.arXiv preprint arXiv:2601.03267,
work page internal anchor Pith review Pith/arXiv arXiv
-
[47]
URLhttps://doi.org/10.48550/arXiv.2601.03267
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2601.03267
-
[48]
The European Parliament and the Council of the European Union. Regulation (eu) 2016/679 of the european parliament and of the council of 27 april 2016 on the protection of natural persons with regard to the processing of personal data and on the free movement of such data, and repealing directive 95/46/ec (General Data Protection Regulation), 05 2016. URL...
2016
-
[49]
L. Wang, N. Yang, X. Huang, L. Yang, R. Majumder, and F. Wei. Improving text embeddings with large language models.arXiv preprint arXiv:2401.00368, 2023. URL https://doi. org/10.48550/arXiv.2401.00368
-
[50]
Y . Wang, M. Wang, H. Iqbal, G. N. Georgiev, J. Geng, I. Gurevych, and P. Nakov. Openfactcheck: Building, benchmarking customized fact-checking systems and evaluating the factuality of claims and llms. InProceedings of the 31st international conference on computational linguis- tics, pages 11399–11421, 2025. URL https://aclanthology.org/2025.coling-main. 755/. 12
2025
-
[51]
J. Wei, C. Yang, X. Song, Y . Lu, N. Hu, J. Huang, D. Tran, D. Peng, R. Liu, D. Huang, et al. Long-form factuality in large language models.Advances in Neural Information Processing Systems, 37:80756–80827, 2024. URLhttps://doi.org/10.52202/079017-2567
-
[52]
B. Weiser. ‘I apologise for the confusion earlier’: Here’s what happens when your lawyer uses ChatGPT’. heres-what-happens-when-your-lawyer-uses-chatgpt, May 28 2023. Accessed: 2026-01-04
2023
-
[53]
N. Wiratunga, R. Abeyratne, L. Jayawardena, K. Martin, S. Massie, I. Nkisi-Orji, R. Weeras- inghe, A. Liret, and B. Fleisch. CBR-RAG: case-based reasoning for retrieval augmented genera- tion in llms for legal question answering. InInternational Conference on Case-Based Reasoning, pages 445–460. Springer, 2024. URL https://doi.org/10.1007/978-3-031-63646-...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.