pith. sign in

arxiv: 2605.21071 · v3 · pith:RCAPDCKJnew · submitted 2026-05-20 · 💻 cs.CL · cs.AI

Fine-grained Claim-level RAG Benchmark for Law

Pith reviewed 2026-05-25 05:53 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords legal RAGclaim-level evaluationmultilingual benchmarkretrieval-augmented generationlegal AIevaluation frameworkdatasetFrench English
0
0 comments X

The pith

ClaimRAG-LAW supplies a multilingual dataset and claim-level framework that separates retrieval and generation performance in legal RAG systems.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Legal applications require RAG to ground LLM answers and reduce hallucinations, yet existing benchmarks provide only coarse English-only expert queries that do not isolate retrieval errors from generation errors. The paper presents ClaimRAG-LAW, a dataset built for French and English, expert and non-expert users, and varied realistic question types, each annotated at the claim level. It then runs a fine-grained evaluation protocol on current legal RAG systems that measures retrieval, generation, and claim-level correctness independently. The evaluation surfaces concrete shortcomings in all three stages when applied to legal material. The resulting resource is intended to let developers target fixes at specific pipeline stages rather than treating RAG as a black box.

Core claim

We introduce ClaimRAG-LAW, a comprehensive dataset for legal RAG that supports French and English, targets both experts and non-experts, and includes diverse question types reflecting realistic scenarios. We further apply a fine-grained evaluation framework of state-of-the-art legal RAG systems, revealing limitations in retrieval, generation, and claim-level analysis in the legal domain.

What carries the argument

ClaimRAG-LAW dataset together with its claim-level annotations and the accompanying fine-grained evaluation framework that measures retrieval, generation, and claim accuracy separately.

If this is right

  • Developers can now diagnose whether a legal RAG failure originates in retrieval, generation, or claim extraction.
  • Benchmarks can be extended to non-English languages and non-expert users without losing granularity.
  • Legal RAG systems can be compared on retrieval quality alone or generation quality alone rather than on end-to-end accuracy.
  • Claim-level annotations allow error analysis at the level of individual factual statements instead of whole answers.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same claim-level separation could be applied to RAG evaluation in medicine or finance where factual precision is also critical.
  • The dataset may expose that general-purpose RAG systems perform even worse on legal material than domain-specific ones.
  • Future expansions could test whether the framework identifies the same failure modes when applied to newer model families.

Load-bearing premise

The dataset's question types and claim annotations accurately represent realistic legal scenarios for both experts and non-experts, and the evaluation framework separates retrieval from generation performance without adding its own biases.

What would settle it

An independent audit that shows real-world legal queries or user satisfaction scores diverge markedly from the patterns measured by ClaimRAG-LAW's claim-level metrics.

Figures

Figures reproduced from arXiv: 2605.21071 by Domenico Bianculli, Sallam Abualhaija, Souvick Das.

Figure 1
Figure 1. Figure 1: System Prompt used for the Conditional Generation of Single-hop QA tuples. [PITH_FULL_IMAGE:figures/full_fig_p015_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: User Prompt for single-hop dataset generation. [PITH_FULL_IMAGE:figures/full_fig_p015_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: System Prompt used for the Conditional Generation of Multi-hop QA tuples. [PITH_FULL_IMAGE:figures/full_fig_p016_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: User Prompt for multi-hop dataset generation. [PITH_FULL_IMAGE:figures/full_fig_p016_4.png] view at source ↗
read the original abstract

The rapid progress of large language models (LLMs) is shifting semantic search toward a question-answering paradigm, where users ask questions and LLMs generate responses. In high-stake domains such as law, retrieval-augmented generation (RAG) is commonly used to mitigate hallucinations in generated responses. Nonetheless, prior work shows that RAG systems, whether general-purpose or legal-specific, still hallucinate at varying rates, making fine-grained evaluation essential. Despite the need, existing evaluation frameworks for legal RAG systems lack the granularity required to provide detailed analysis of retrieval and generation performance separately. Moreover, current benchmarks are largely English-only and centered on legal expert queries, overlooking non-expert needs. We introduce ClaimRAG-LAW, a comprehensive dataset for legal RAG that supports French and English, targets both experts and non-experts, and includes diverse question types reflecting realistic scenarios. We further apply a fine-grained evaluation framework of state-of-the-art legal RAG systems, revealing limitations in retrieval, generation, and claim-level analysis in the legal domain.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces ClaimRAG-LAW, a new dataset for legal RAG that is bilingual (French/English), targets both experts and non-experts, and includes diverse question types with claim-level annotations reflecting realistic scenarios. It also describes a fine-grained evaluation framework applied to state-of-the-art legal RAG systems that separates retrieval and generation performance and reports limitations in both plus claim-level analysis.

Significance. If the dataset construction, annotation quality, and evaluation framework hold up under scrutiny, the work would address documented gaps in legal RAG benchmarks (English-only, expert-only, coarse-grained metrics) and supply a reusable resource with claim-level granularity. The explicit separation of retrieval versus generation errors is a potentially useful methodological contribution.

major comments (2)
  1. [Abstract] Abstract: the central claims that the dataset 'includes diverse question types reflecting realistic scenarios' and that the framework 'revealing limitations in retrieval, generation, and claim-level analysis' are presented without any accompanying dataset statistics, annotation guidelines, inter-annotator agreement, size, or quantitative results. These details are load-bearing for assessing whether the realism and bias-free separation assumptions hold.
  2. No section or table supplies the concrete construction process, validation steps, or example claim-level annotations needed to evaluate the weakest assumption that question types and annotations accurately capture realistic legal scenarios for both experts and non-experts.
minor comments (1)
  1. The manuscript should include at minimum a table of dataset statistics (number of questions, claims, documents per language and user type) and at least one worked example of a question, retrieved passages, generated answer, and claim-level breakdown.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for highlighting the need for greater transparency in the abstract and dataset construction details to allow proper evaluation of the benchmark's realism and utility. We address each major comment below and will revise the manuscript accordingly.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claims that the dataset 'includes diverse question types reflecting realistic scenarios' and that the framework 'revealing limitations in retrieval, generation, and claim-level analysis' are presented without any accompanying dataset statistics, annotation guidelines, inter-annotator agreement, size, or quantitative results. These details are load-bearing for assessing whether the realism and bias-free separation assumptions hold.

    Authors: We agree that the abstract, while concise, should include key supporting details to substantiate the central claims. In the revised version we will expand the abstract to report dataset size, number of claims per category, inter-annotator agreement scores, and high-level quantitative results on retrieval and generation performance. This will make the realism and separation assumptions directly evaluable from the abstract itself. revision: yes

  2. Referee: [—] No section or table supplies the concrete construction process, validation steps, or example claim-level annotations needed to evaluate the weakest assumption that question types and annotations accurately capture realistic legal scenarios for both experts and non-experts.

    Authors: We acknowledge that a dedicated, explicit description of the construction pipeline is necessary for readers to assess whether the question types and claim-level annotations reflect realistic legal scenarios. We will add a new subsection (or substantially expand the existing dataset section) that details the full construction process, validation steps, annotation guidelines, and provides concrete examples of claim-level annotations for both expert and non-expert queries. This revision will directly address the concern. revision: yes

Circularity Check

0 steps flagged

No significant circularity: dataset and framework introduction

full rationale

The paper introduces ClaimRAG-LAW dataset and a fine-grained evaluation framework for legal RAG. No equations, derivations, fitted parameters, or load-bearing self-citations appear in the abstract or described content. The work does not reduce any claim to prior self-cited results or by-construction identities; it is self-contained as a new benchmark contribution evaluated against external SOTA systems.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical benchmark-introduction paper with no mathematical derivations, fitted parameters, or postulated entities.

pith-pipeline@v0.9.0 · 5715 in / 1095 out tokens · 48621 ms · 2026-05-25T05:53:03.696501+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

52 extracted references · 52 canonical work pages · 7 internal anchors

  1. [1]

    O’Reilly Media, Inc

    J. Alammar and M. Grootendorst.Hands-on large language models: language understanding and generation. " O’Reilly Media, Inc.", 2024

  2. [2]

    think like a lawyer

    K. Burton. "think like a lawyer" using a legal reasoning grid and criterion-referenced assessment rubric on irac (issue, rule, application, conclusion).Journal of Learning Design, 10(2):57–68,

  3. [3]

    URLhttps://doi.org/10.5204/JLD.V10I2.229

  4. [4]

    Chalkidis, A

    I. Chalkidis, A. Jana, D. Hartung, M. Bommarito, I. Androutsopoulos, D. Katz, and N. Aletras. LexGLUE: A benchmark dataset for legal language understanding in english. InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 4310–4330, 2022. URL https://doi.org/10.18653/v1/2022.acl-long. 297

  5. [5]

    M. Dahl, V . Magesh, M. Suzgun, and D. E. Ho. Large legal fictions: Profiling legal hallucinations in large language models.Journal of Legal Analysis, 16(1):64–93, 2024. URLhttps://doi. org/10.1093/jla/laae003

  6. [6]

    S. Das, S. Abualhaija, and D. Bianculli. LegalRAG QA Generator. https://doi.org/10. 5281/zenodo.20024153, 2026

  7. [7]

    S. Das, S. Abualhaija, and D. Bianculli. ClaimRAG-LAW Dataset. https://huggingface. co/datasets/SNTSVV/ClaimRAG-LAW, 2026

  8. [8]

    BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding

    J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova. BERT: Pre-training of deep bidirec- tional transformers for language understanding. InProceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers), pages 4171–4186, 2019. URL https://doi.org/...

  9. [9]

    B. Edwards. Number of legal professionals using Gen AI jumps sharply over past year, study shows. number-of-legal-professionals-using-gen-ai, April 17 2025. Accessed: 2026-01-04

  10. [10]

    S. Es, J. James, L. E. Anke, and S. Schockaert. RAGAs: Automated evaluation of retrieval augmented generation. InProceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations, pages 150–158, 2024. URLhttps://doi.org/10.18653/v1/2024.eacl-demo.16

  11. [11]

    Federal court turns up the heat on attorneys using ChatGPT for research

    Esquire Deposition Solutions. Federal court turns up the heat on attorneys using ChatGPT for research. federal-court-turns-up-the-heat-on-attorneys, August 13 2025. Accessed: 2026-01-04

  12. [12]

    Ferrara, Ethan-Tonic, and O

    J. Ferrara, Ethan-Tonic, and O. M. Ozturk. The RAG Triad. https://www.trulens.org/ getting_started/core_concepts/rag_triad/, 2024. Accessed: 2026-04-28

  13. [13]

    Gokhan, K

    T. Gokhan, K. Wang, I. Gurevych, and T. Briscoe. RIRAG: Regulatory information retrieval and answer generation.arXiv preprint arXiv:2409.05677, 2024. URL https://doi.org/10. 48550/arXiv.2409.05677

  14. [14]

    National civil code, 1804

    Grand-Duché de Luxembourg. National civil code, 1804. URL https://legilux.public. lu/

  15. [15]

    The Llama 3 Herd of Models

    A. Grattafiori, A. Dubey, A. Jauhri, et al. The Llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024. URLhttps://doi.org/10.48550/arXiv.2407.21783

  16. [16]

    N. Guha, J. Nyarko, D. Ho, et al. LegalBench: A collaboratively built benchmark for measuring legal reasoning in large language models. InProceedings of the 37th Conference on Neural Information Processing Systems - Datasets and Benchmarks Track, pages 44123–44279, 2023

  17. [17]

    A. B. Hou, O. Weller, G. Qin, E. Yang, D. Lawrie, N. Holzenberger, A. Blair-Stanek, and B. Van Durme. CLERC: A dataset for us legal case retrieval and retrieval-augmented analysis generation. InFindings of the Association for Computational Linguistics: NAACL 2025, pages 7913–7928, 2025. URLhttps://doi.org/10.18653/v1/2025.findings-naacl.441

  18. [18]

    X. Hu, D. Ru, L. Qiu, Q. Guo, T. Zhang, Y . Xu, Y . Luo, P. Liu, Y . Zhang, and Z. Zhang. RefChecker: Reference-based fine-grained hallucination checker and benchmark for large 10 language models.arXiv preprint arXiv:2405.14486, 2024. URL https://doi.org/10. 48550/arXiv.2405.14486

  19. [19]

    GPT-4o System Card

    A. Hurst, A. Lerer, A. P. Goucher, A. Perelman, A. Ramesh, A. Clark, A. Ostrow, A. Welihinda, A. Hayes, A. Radford, et al. GPT-4o system card.arXiv preprint arXiv:2410.21276, 2024. URLhttps://doi.org/10.48550/arXiv.2410.21276

  20. [20]

    A. Q. Jiang, A. Sablayrolles, A. Roux, et al. Mixtral of experts.arXiv preprint arXiv:2401.04088,

  21. [21]

    URLhttps://doi.org/10.48550/arXiv.2401.04088

  22. [22]

    A. T. Kalai and S. S. Vempala. Calibrated language models must hallucinate. InProceedings of the 56th Annual ACM Symposium on Theory of Computing, pages 160–171, 2024. URL https://doi.org/10.1145/3618260.3649777

  23. [23]

    D. M. Katz, M. J. Bommarito, S. Gao, and P. Arredondo. GPT-4 passes the bar exam.Philosoph- ical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences, 382(2270), 2024. URLhttps://doi.org/10.1098/rsta.2023.0254

  24. [24]

    Keisha, P

    F. Keisha, P. Singh, D. Fernandes, A. Manivannan, I. Wicaksono, F. Ahmad, W. B. Rim, et al. All for law and law for all: Adaptive RAG pipeline for legal research.arXiv preprint arXiv:2508.13107, 2025. URLhttps://doi.org/10.48550/arXiv.2508.13107

  25. [25]

    J. Lee, D. Kim, S. Hwang, H. Kim, and G. Lee. KoBLEX: Open legal question answering with multi-hop reasoning. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 4019–4053, 2025. URL https://doi.org/10.18653/ v1/2025.emnlp-main.200

  26. [26]

    Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks

    P. Lewis, E. Perez, A. Piktus, F. Petroni, V . Karpukhin, N. Goyal, H. Küttler, M. Lewis, W.-t. Yih, T. Rocktäschel, et al. Retrieval-augmented generation for knowledge-intensive nlp tasks.Advances in neural information processing systems, 33:9459–9474, 2020. URL https://doi.org/10.48550/arXiv.2005.11401

  27. [27]

    K. Li, Y . Li, T. Zhang, H. Luo, X. Wu, J. Glass, and H. Meng. RAG-Zeval: Enhancing RAG Re- sponses Evaluator through End-to-End Reasoning and Ranking-Based Reinforcement Learning. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Process- ing, pages 24936–24954, 2025. URL https://doi.org/10.18653/v1/2025.emnlp-main. 1267

  28. [28]

    L. Li, L. Sleem, G. Nichil, R. State, et al. Exploring the impact of temperature on large language models: Hot or cold?Procedia Computer Science, 264:242–251, 2025. URL https://doi.org/10.1016/j.procs.2025.07.135

  29. [29]

    Louis, G

    A. Louis, G. van Dijck, and G. Spanakis. Interpretable long-form legal question answering with retrieval-augmented large language models. InProceedings of the AAAI Conference on Artificial Intelligence, pages 22266–22275, 2024. URL https://doi.org/10.1609/aaai. v38i20.30232

  30. [30]

    Magesh, F

    V . Magesh, F. Surani, M. Dahl, M. Suzgun, C. D. Manning, and D. E. Ho. Hallucination-free? assessing the reliability of leading AI legal research tools.Journal of Empirical Legal Studies, 22(2):216–242, 2025. URLhttps://doi.org/10.1111/jels.12413

  31. [31]

    S. Mallick. Generative AI in the law.the Law (February 10, 2024), 42, 2024. URL https: //doi.org/10.2139/ssrn.5040429

  32. [32]

    Manakul, A

    P. Manakul, A. Liusie, and M. Gales. SelfCheckGPT: Zero-resource black-box hallucination detection for generative large language models. InProceedings of the 2023 conference on empirical methods in natural language processing, pages 9004–9017, 2023. URL https: //doi.org/10.18653/v1/2023.emnlp-main.557

  33. [33]

    Metropolitansky and J

    D. Metropolitansky and J. Larson. Veritrail: Closed-domain hallucination detection with traceability.arXiv preprint arXiv:2505.21786, 2025. URL https://doi.org/10.48550/ arXiv.2505.21786

  34. [35]

    S. Min, K. Krishna, X. Lyu, M. Lewis, W.-t. Yih, P. Koh, M. Iyyer, L. Zettlemoyer, and H. Hajishirzi. FActScore: Fine-grained atomic evaluation of factual precision in long form text generation. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 12076–12100, 2023. URL https://doi.org/10.18653/v1/ 2023.emnlp-main.741

  35. [36]

    Niklaus, V

    J. Niklaus, V . Matoshi, P. Rani, A. Galassi, M. Stürmer, and I. Chalkidis. LEXTREME: A multi-lingual and multi-task benchmark for the legal domain. InFindings of the Association for Computational Linguistics: EMNLP 2023, pages 3016–3054, 2023. URL https://doi.org/ 10.18653/v1/2023.findings-emnlp.200

  36. [37]

    M. Park, H. Oh, E. Choi, and W. Hwang. LRAGE: Legal retrieval augmented generation evaluation tool.arXiv preprint arXiv:2504.01840, 2025. URL https://doi.org/10.48550/ arXiv.2504.01840

  37. [38]

    Pipitone and G

    N. Pipitone and G. H. Alami. LegalBench-RAG: A benchmark for retrieval-augmented generation in the legal domain.arXiv preprint arXiv:2408.10343, 2024. URL https: //doi.org/10.48550/arXiv.2408.10343

  38. [39]

    Reimers and I

    N. Reimers and I. Gurevych. The curse of dense low-dimensional information retrieval for large index sizes. InProceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), pages 605–611. Association for Computational Linguistics, ...

  39. [40]

    M. Renze. The effect of sampling temperature on problem solving in large language models. InFindings of the association for computational linguistics: EMNLP 2024, pages 7346–7356,

  40. [41]

    URLhttps://doi.org/10.18653/v1/2024.findings-emnlp.432

  41. [42]

    Robertson and Hugo Zaragoza , title =

    S. Robertson and H. Zaragoza. The probabilistic relevance framework: BM25 and beyond. Found. Trends Inf. Retr., 3(4):333–389, Apr. 2009. ISSN 1554-0669. URLhttps://doi.org/ 10.1561/1500000019

  42. [43]

    D. Ru, L. Qiu, X. Hu, T. Zhang, P. Shi, S. Chang, C. Jiayang, C. Wang, S. Sun, H. Li, et al. RAGChecker: A fine-grained framework for diagnosing retrieval-augmented generation. Advances in Neural Information Processing Systems, 37:21999–22027, 2024. URL https: //doi.org/10.52202/079017-0692

  43. [44]

    Sannier, M

    N. Sannier, M. Adedjouma, M. Sabetzadeh, L. Briand, J. Dann, M. Hisette, and P. Thill. Legal markup generation in the large: An experience report. In2017 IEEE 25th International Requirements Engineering Conference (RE), pages 302–311. IEEE, 2017. URL https://doi. org/10.1109/RE.2017.10

  44. [45]

    Scirè, K

    A. Scirè, K. Ghonim, and R. Navigli. FENICE: Factuality evaluation of summarization based on natural language inference and claim extraction. InFindings of the Association for Computational Linguistics ACL 2024, pages 14148–14161, 2024. URL https://doi. org/10.18653/v1/2024.findings-acl.841

  45. [46]

    OpenAI GPT-5 System Card

    A. Singh, A. Fry, A. Perelman, A. Tart, A. Ganesh, A. El-Kishky, A. McLaughlin, A. Low, A. Ostrow, A. Ananthram, et al. OpenAI GPT-5 system card.arXiv preprint arXiv:2601.03267,

  46. [47]

    URLhttps://doi.org/10.48550/arXiv.2601.03267

  47. [48]

    The European Parliament and the Council of the European Union. Regulation (eu) 2016/679 of the european parliament and of the council of 27 april 2016 on the protection of natural persons with regard to the processing of personal data and on the free movement of such data, and repealing directive 95/46/ec (General Data Protection Regulation), 05 2016. URL...

  48. [49]

    L. Wang, N. Yang, X. Huang, L. Yang, R. Majumder, and F. Wei. Improving text embeddings with large language models.arXiv preprint arXiv:2401.00368, 2023. URL https://doi. org/10.48550/arXiv.2401.00368

  49. [50]

    Y . Wang, M. Wang, H. Iqbal, G. N. Georgiev, J. Geng, I. Gurevych, and P. Nakov. Openfactcheck: Building, benchmarking customized fact-checking systems and evaluating the factuality of claims and llms. InProceedings of the 31st international conference on computational linguis- tics, pages 11399–11421, 2025. URL https://aclanthology.org/2025.coling-main. 755/. 12

  50. [51]

    J. Wei, C. Yang, X. Song, Y . Lu, N. Hu, J. Huang, D. Tran, D. Peng, R. Liu, D. Huang, et al. Long-form factuality in large language models.Advances in Neural Information Processing Systems, 37:80756–80827, 2024. URLhttps://doi.org/10.52202/079017-2567

  51. [52]

    B. Weiser. ‘I apologise for the confusion earlier’: Here’s what happens when your lawyer uses ChatGPT’. heres-what-happens-when-your-lawyer-uses-chatgpt, May 28 2023. Accessed: 2026-01-04

  52. [53]

    suitable

    N. Wiratunga, R. Abeyratne, L. Jayawardena, K. Martin, S. Massie, I. Nkisi-Orji, R. Weeras- inghe, A. Liret, and B. Fleisch. CBR-RAG: case-based reasoning for retrieval augmented genera- tion in llms for legal question answering. InInternational Conference on Case-Based Reasoning, pages 445–460. Springer, 2024. URL https://doi.org/10.1007/978-3-031-63646-...