Fine-grained Claim-level RAG Benchmark for Law

Domenico Bianculli; Sallam Abualhaija; Souvick Das

arxiv: 2605.21071 · v3 · pith:RCAPDCKJnew · submitted 2026-05-20 · 💻 cs.CL · cs.AI

Fine-grained Claim-level RAG Benchmark for Law

Souvick Das , Sallam Abualhaija , Domenico Bianculli This is my paper

Pith reviewed 2026-05-25 05:53 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords legal RAGclaim-level evaluationmultilingual benchmarkretrieval-augmented generationlegal AIevaluation frameworkdatasetFrench English

0 comments

The pith

ClaimRAG-LAW supplies a multilingual dataset and claim-level framework that separates retrieval and generation performance in legal RAG systems.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Legal applications require RAG to ground LLM answers and reduce hallucinations, yet existing benchmarks provide only coarse English-only expert queries that do not isolate retrieval errors from generation errors. The paper presents ClaimRAG-LAW, a dataset built for French and English, expert and non-expert users, and varied realistic question types, each annotated at the claim level. It then runs a fine-grained evaluation protocol on current legal RAG systems that measures retrieval, generation, and claim-level correctness independently. The evaluation surfaces concrete shortcomings in all three stages when applied to legal material. The resulting resource is intended to let developers target fixes at specific pipeline stages rather than treating RAG as a black box.

Core claim

We introduce ClaimRAG-LAW, a comprehensive dataset for legal RAG that supports French and English, targets both experts and non-experts, and includes diverse question types reflecting realistic scenarios. We further apply a fine-grained evaluation framework of state-of-the-art legal RAG systems, revealing limitations in retrieval, generation, and claim-level analysis in the legal domain.

What carries the argument

ClaimRAG-LAW dataset together with its claim-level annotations and the accompanying fine-grained evaluation framework that measures retrieval, generation, and claim accuracy separately.

If this is right

Developers can now diagnose whether a legal RAG failure originates in retrieval, generation, or claim extraction.
Benchmarks can be extended to non-English languages and non-expert users without losing granularity.
Legal RAG systems can be compared on retrieval quality alone or generation quality alone rather than on end-to-end accuracy.
Claim-level annotations allow error analysis at the level of individual factual statements instead of whole answers.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same claim-level separation could be applied to RAG evaluation in medicine or finance where factual precision is also critical.
The dataset may expose that general-purpose RAG systems perform even worse on legal material than domain-specific ones.
Future expansions could test whether the framework identifies the same failure modes when applied to newer model families.

Load-bearing premise

The dataset's question types and claim annotations accurately represent realistic legal scenarios for both experts and non-experts, and the evaluation framework separates retrieval from generation performance without adding its own biases.

What would settle it

An independent audit that shows real-world legal queries or user satisfaction scores diverge markedly from the patterns measured by ClaimRAG-LAW's claim-level metrics.

Figures

Figures reproduced from arXiv: 2605.21071 by Domenico Bianculli, Sallam Abualhaija, Souvick Das.

**Figure 2.** Figure 2: User Prompt for single-hop dataset generation. [PITH_FULL_IMAGE:figures/full_fig_p015_2.png] view at source ↗

**Figure 3.** Figure 3: System Prompt used for the Conditional Generation of Multi-hop QA tuples. [PITH_FULL_IMAGE:figures/full_fig_p016_3.png] view at source ↗

**Figure 4.** Figure 4: User Prompt for multi-hop dataset generation. [PITH_FULL_IMAGE:figures/full_fig_p016_4.png] view at source ↗

read the original abstract

The rapid progress of large language models (LLMs) is shifting semantic search toward a question-answering paradigm, where users ask questions and LLMs generate responses. In high-stake domains such as law, retrieval-augmented generation (RAG) is commonly used to mitigate hallucinations in generated responses. Nonetheless, prior work shows that RAG systems, whether general-purpose or legal-specific, still hallucinate at varying rates, making fine-grained evaluation essential. Despite the need, existing evaluation frameworks for legal RAG systems lack the granularity required to provide detailed analysis of retrieval and generation performance separately. Moreover, current benchmarks are largely English-only and centered on legal expert queries, overlooking non-expert needs. We introduce ClaimRAG-LAW, a comprehensive dataset for legal RAG that supports French and English, targets both experts and non-experts, and includes diverse question types reflecting realistic scenarios. We further apply a fine-grained evaluation framework of state-of-the-art legal RAG systems, revealing limitations in retrieval, generation, and claim-level analysis in the legal domain.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces ClaimRAG-LAW, a new dataset for legal RAG that is bilingual (French/English), targets both experts and non-experts, and includes diverse question types with claim-level annotations reflecting realistic scenarios. It also describes a fine-grained evaluation framework applied to state-of-the-art legal RAG systems that separates retrieval and generation performance and reports limitations in both plus claim-level analysis.

Significance. If the dataset construction, annotation quality, and evaluation framework hold up under scrutiny, the work would address documented gaps in legal RAG benchmarks (English-only, expert-only, coarse-grained metrics) and supply a reusable resource with claim-level granularity. The explicit separation of retrieval versus generation errors is a potentially useful methodological contribution.

major comments (2)

[Abstract] Abstract: the central claims that the dataset 'includes diverse question types reflecting realistic scenarios' and that the framework 'revealing limitations in retrieval, generation, and claim-level analysis' are presented without any accompanying dataset statistics, annotation guidelines, inter-annotator agreement, size, or quantitative results. These details are load-bearing for assessing whether the realism and bias-free separation assumptions hold.
No section or table supplies the concrete construction process, validation steps, or example claim-level annotations needed to evaluate the weakest assumption that question types and annotations accurately capture realistic legal scenarios for both experts and non-experts.

minor comments (1)

The manuscript should include at minimum a table of dataset statistics (number of questions, claims, documents per language and user type) and at least one worked example of a question, retrieved passages, generated answer, and claim-level breakdown.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for highlighting the need for greater transparency in the abstract and dataset construction details to allow proper evaluation of the benchmark's realism and utility. We address each major comment below and will revise the manuscript accordingly.

read point-by-point responses

Referee: [Abstract] Abstract: the central claims that the dataset 'includes diverse question types reflecting realistic scenarios' and that the framework 'revealing limitations in retrieval, generation, and claim-level analysis' are presented without any accompanying dataset statistics, annotation guidelines, inter-annotator agreement, size, or quantitative results. These details are load-bearing for assessing whether the realism and bias-free separation assumptions hold.

Authors: We agree that the abstract, while concise, should include key supporting details to substantiate the central claims. In the revised version we will expand the abstract to report dataset size, number of claims per category, inter-annotator agreement scores, and high-level quantitative results on retrieval and generation performance. This will make the realism and separation assumptions directly evaluable from the abstract itself. revision: yes
Referee: [—] No section or table supplies the concrete construction process, validation steps, or example claim-level annotations needed to evaluate the weakest assumption that question types and annotations accurately capture realistic legal scenarios for both experts and non-experts.

Authors: We acknowledge that a dedicated, explicit description of the construction pipeline is necessary for readers to assess whether the question types and claim-level annotations reflect realistic legal scenarios. We will add a new subsection (or substantially expand the existing dataset section) that details the full construction process, validation steps, annotation guidelines, and provides concrete examples of claim-level annotations for both expert and non-expert queries. This revision will directly address the concern. revision: yes

Circularity Check

0 steps flagged

No significant circularity: dataset and framework introduction

full rationale

The paper introduces ClaimRAG-LAW dataset and a fine-grained evaluation framework for legal RAG. No equations, derivations, fitted parameters, or load-bearing self-citations appear in the abstract or described content. The work does not reduce any claim to prior self-cited results or by-construction identities; it is self-contained as a new benchmark contribution evaluated against external SOTA systems.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical benchmark-introduction paper with no mathematical derivations, fitted parameters, or postulated entities.

pith-pipeline@v0.9.0 · 5715 in / 1095 out tokens · 48621 ms · 2026-05-25T05:53:03.696501+00:00 · methodology

Review history (3 revisions) →

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We introduce ClaimRAG-LAW, a comprehensive dataset for legal RAG that supports French and English, targets both experts and non-experts, and includes diverse question types reflecting realistic scenarios.
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We further apply a fine-grained evaluation framework of state-of-the-art legal RAG systems, revealing limitations in retrieval, generation, and claim-level analysis in the legal domain.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

52 extracted references · 52 canonical work pages · 7 internal anchors

[1]

O’Reilly Media, Inc

J. Alammar and M. Grootendorst.Hands-on large language models: language understanding and generation. " O’Reilly Media, Inc.", 2024

work page 2024
[2]

think like a lawyer

K. Burton. "think like a lawyer" using a legal reasoning grid and criterion-referenced assessment rubric on irac (issue, rule, application, conclusion).Journal of Learning Design, 10(2):57–68,

work page
[3]

URLhttps://doi.org/10.5204/JLD.V10I2.229

work page doi:10.5204/jld.v10i2.229
[4]

Chalkidis, A

I. Chalkidis, A. Jana, D. Hartung, M. Bommarito, I. Androutsopoulos, D. Katz, and N. Aletras. LexGLUE: A benchmark dataset for legal language understanding in english. InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 4310–4330, 2022. URL https://doi.org/10.18653/v1/2022.acl-long. 297

work page doi:10.18653/v1/2022.acl-long 2022
[5]

M. Dahl, V . Magesh, M. Suzgun, and D. E. Ho. Large legal fictions: Profiling legal hallucinations in large language models.Journal of Legal Analysis, 16(1):64–93, 2024. URLhttps://doi. org/10.1093/jla/laae003

work page doi:10.1093/jla/laae003 2024
[6]

S. Das, S. Abualhaija, and D. Bianculli. LegalRAG QA Generator. https://doi.org/10. 5281/zenodo.20024153, 2026

work page 2026
[7]

S. Das, S. Abualhaija, and D. Bianculli. ClaimRAG-LAW Dataset. https://huggingface. co/datasets/SNTSVV/ClaimRAG-LAW, 2026

work page 2026
[8]

BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding

J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova. BERT: Pre-training of deep bidirec- tional transformers for language understanding. InProceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers), pages 4171–4186, 2019. URL https://doi.org/...

work page doi:10.18653/v1/n19-1423 2019
[9]

B. Edwards. Number of legal professionals using Gen AI jumps sharply over past year, study shows. number-of-legal-professionals-using-gen-ai, April 17 2025. Accessed: 2026-01-04

work page 2025
[10]

S. Es, J. James, L. E. Anke, and S. Schockaert. RAGAs: Automated evaluation of retrieval augmented generation. InProceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations, pages 150–158, 2024. URLhttps://doi.org/10.18653/v1/2024.eacl-demo.16

work page doi:10.18653/v1/2024.eacl-demo.16 2024
[11]

Federal court turns up the heat on attorneys using ChatGPT for research

Esquire Deposition Solutions. Federal court turns up the heat on attorneys using ChatGPT for research. federal-court-turns-up-the-heat-on-attorneys, August 13 2025. Accessed: 2026-01-04

work page 2025
[12]

Ferrara, Ethan-Tonic, and O

J. Ferrara, Ethan-Tonic, and O. M. Ozturk. The RAG Triad. https://www.trulens.org/ getting_started/core_concepts/rag_triad/, 2024. Accessed: 2026-04-28

work page 2024
[13]

Gokhan, K

T. Gokhan, K. Wang, I. Gurevych, and T. Briscoe. RIRAG: Regulatory information retrieval and answer generation.arXiv preprint arXiv:2409.05677, 2024. URL https://doi.org/10. 48550/arXiv.2409.05677

work page arXiv 2024
[14]

National civil code, 1804

Grand-Duché de Luxembourg. National civil code, 1804. URL https://legilux.public. lu/

work page
[15]

The Llama 3 Herd of Models

A. Grattafiori, A. Dubey, A. Jauhri, et al. The Llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024. URLhttps://doi.org/10.48550/arXiv.2407.21783

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2407.21783 2024
[16]

N. Guha, J. Nyarko, D. Ho, et al. LegalBench: A collaboratively built benchmark for measuring legal reasoning in large language models. InProceedings of the 37th Conference on Neural Information Processing Systems - Datasets and Benchmarks Track, pages 44123–44279, 2023

work page 2023
[17]

A. B. Hou, O. Weller, G. Qin, E. Yang, D. Lawrie, N. Holzenberger, A. Blair-Stanek, and B. Van Durme. CLERC: A dataset for us legal case retrieval and retrieval-augmented analysis generation. InFindings of the Association for Computational Linguistics: NAACL 2025, pages 7913–7928, 2025. URLhttps://doi.org/10.18653/v1/2025.findings-naacl.441

work page doi:10.18653/v1/2025.findings-naacl.441 2025
[18]

X. Hu, D. Ru, L. Qiu, Q. Guo, T. Zhang, Y . Xu, Y . Luo, P. Liu, Y . Zhang, and Z. Zhang. RefChecker: Reference-based fine-grained hallucination checker and benchmark for large 10 language models.arXiv preprint arXiv:2405.14486, 2024. URL https://doi.org/10. 48550/arXiv.2405.14486

work page arXiv 2024
[19]

GPT-4o System Card

A. Hurst, A. Lerer, A. P. Goucher, A. Perelman, A. Ramesh, A. Clark, A. Ostrow, A. Welihinda, A. Hayes, A. Radford, et al. GPT-4o system card.arXiv preprint arXiv:2410.21276, 2024. URLhttps://doi.org/10.48550/arXiv.2410.21276

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2410.21276 2024
[20]

A. Q. Jiang, A. Sablayrolles, A. Roux, et al. Mixtral of experts.arXiv preprint arXiv:2401.04088,

work page internal anchor Pith review Pith/arXiv arXiv
[21]

URLhttps://doi.org/10.48550/arXiv.2401.04088

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2401.04088
[22]

A. T. Kalai and S. S. Vempala. Calibrated language models must hallucinate. InProceedings of the 56th Annual ACM Symposium on Theory of Computing, pages 160–171, 2024. URL https://doi.org/10.1145/3618260.3649777

work page doi:10.1145/3618260.3649777 2024
[23]

D. M. Katz, M. J. Bommarito, S. Gao, and P. Arredondo. GPT-4 passes the bar exam.Philosoph- ical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences, 382(2270), 2024. URLhttps://doi.org/10.1098/rsta.2023.0254

work page doi:10.1098/rsta.2023.0254 2024
[24]

Keisha, P

F. Keisha, P. Singh, D. Fernandes, A. Manivannan, I. Wicaksono, F. Ahmad, W. B. Rim, et al. All for law and law for all: Adaptive RAG pipeline for legal research.arXiv preprint arXiv:2508.13107, 2025. URLhttps://doi.org/10.48550/arXiv.2508.13107

work page doi:10.48550/arxiv.2508.13107 2025
[25]

J. Lee, D. Kim, S. Hwang, H. Kim, and G. Lee. KoBLEX: Open legal question answering with multi-hop reasoning. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 4019–4053, 2025. URL https://doi.org/10.18653/ v1/2025.emnlp-main.200

work page 2025
[26]

Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks

P. Lewis, E. Perez, A. Piktus, F. Petroni, V . Karpukhin, N. Goyal, H. Küttler, M. Lewis, W.-t. Yih, T. Rocktäschel, et al. Retrieval-augmented generation for knowledge-intensive nlp tasks.Advances in neural information processing systems, 33:9459–9474, 2020. URL https://doi.org/10.48550/arXiv.2005.11401

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2005.11401 2020
[27]

K. Li, Y . Li, T. Zhang, H. Luo, X. Wu, J. Glass, and H. Meng. RAG-Zeval: Enhancing RAG Re- sponses Evaluator through End-to-End Reasoning and Ranking-Based Reinforcement Learning. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Process- ing, pages 24936–24954, 2025. URL https://doi.org/10.18653/v1/2025.emnlp-main. 1267

work page doi:10.18653/v1/2025.emnlp-main 2025
[28]

L. Li, L. Sleem, G. Nichil, R. State, et al. Exploring the impact of temperature on large language models: Hot or cold?Procedia Computer Science, 264:242–251, 2025. URL https://doi.org/10.1016/j.procs.2025.07.135

work page doi:10.1016/j.procs.2025.07.135 2025
[29]

Louis, G

A. Louis, G. van Dijck, and G. Spanakis. Interpretable long-form legal question answering with retrieval-augmented large language models. InProceedings of the AAAI Conference on Artificial Intelligence, pages 22266–22275, 2024. URL https://doi.org/10.1609/aaai. v38i20.30232

work page doi:10.1609/aaai 2024
[30]

Magesh, F

V . Magesh, F. Surani, M. Dahl, M. Suzgun, C. D. Manning, and D. E. Ho. Hallucination-free? assessing the reliability of leading AI legal research tools.Journal of Empirical Legal Studies, 22(2):216–242, 2025. URLhttps://doi.org/10.1111/jels.12413

work page doi:10.1111/jels.12413 2025
[31]

S. Mallick. Generative AI in the law.the Law (February 10, 2024), 42, 2024. URL https: //doi.org/10.2139/ssrn.5040429

work page doi:10.2139/ssrn.5040429 2024
[32]

Manakul, A

P. Manakul, A. Liusie, and M. Gales. SelfCheckGPT: Zero-resource black-box hallucination detection for generative large language models. InProceedings of the 2023 conference on empirical methods in natural language processing, pages 9004–9017, 2023. URL https: //doi.org/10.18653/v1/2023.emnlp-main.557

work page doi:10.18653/v1/2023.emnlp-main.557 2023
[33]

Metropolitansky and J

D. Metropolitansky and J. Larson. Veritrail: Closed-domain hallucination detection with traceability.arXiv preprint arXiv:2505.21786, 2025. URL https://doi.org/10.48550/ arXiv.2505.21786

work page arXiv 2025
[35]

S. Min, K. Krishna, X. Lyu, M. Lewis, W.-t. Yih, P. Koh, M. Iyyer, L. Zettlemoyer, and H. Hajishirzi. FActScore: Fine-grained atomic evaluation of factual precision in long form text generation. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 12076–12100, 2023. URL https://doi.org/10.18653/v1/ 2023.emnlp-main.741

work page doi:10.18653/v1/ 2023
[36]

Niklaus, V

J. Niklaus, V . Matoshi, P. Rani, A. Galassi, M. Stürmer, and I. Chalkidis. LEXTREME: A multi-lingual and multi-task benchmark for the legal domain. InFindings of the Association for Computational Linguistics: EMNLP 2023, pages 3016–3054, 2023. URL https://doi.org/ 10.18653/v1/2023.findings-emnlp.200

work page doi:10.18653/v1/2023.findings-emnlp.200 2023
[37]

M. Park, H. Oh, E. Choi, and W. Hwang. LRAGE: Legal retrieval augmented generation evaluation tool.arXiv preprint arXiv:2504.01840, 2025. URL https://doi.org/10.48550/ arXiv.2504.01840

work page arXiv 2025
[38]

Pipitone and G

N. Pipitone and G. H. Alami. LegalBench-RAG: A benchmark for retrieval-augmented generation in the legal domain.arXiv preprint arXiv:2408.10343, 2024. URL https: //doi.org/10.48550/arXiv.2408.10343

work page doi:10.48550/arxiv.2408.10343 2024
[39]

Reimers and I

N. Reimers and I. Gurevych. The curse of dense low-dimensional information retrieval for large index sizes. InProceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), pages 605–611. Association for Computational Linguistics, ...

work page doi:10.18653/v1/2021.acl-short.77 2021
[40]

M. Renze. The effect of sampling temperature on problem solving in large language models. InFindings of the association for computational linguistics: EMNLP 2024, pages 7346–7356,

work page 2024
[41]

URLhttps://doi.org/10.18653/v1/2024.findings-emnlp.432

work page doi:10.18653/v1/2024.findings-emnlp.432 2024
[42]

Robertson and Hugo Zaragoza , title =

S. Robertson and H. Zaragoza. The probabilistic relevance framework: BM25 and beyond. Found. Trends Inf. Retr., 3(4):333–389, Apr. 2009. ISSN 1554-0669. URLhttps://doi.org/ 10.1561/1500000019

work page doi:10.1561/1500000019 2009
[43]

D. Ru, L. Qiu, X. Hu, T. Zhang, P. Shi, S. Chang, C. Jiayang, C. Wang, S. Sun, H. Li, et al. RAGChecker: A fine-grained framework for diagnosing retrieval-augmented generation. Advances in Neural Information Processing Systems, 37:21999–22027, 2024. URL https: //doi.org/10.52202/079017-0692

work page doi:10.52202/079017-0692 2024
[44]

Sannier, M

N. Sannier, M. Adedjouma, M. Sabetzadeh, L. Briand, J. Dann, M. Hisette, and P. Thill. Legal markup generation in the large: An experience report. In2017 IEEE 25th International Requirements Engineering Conference (RE), pages 302–311. IEEE, 2017. URL https://doi. org/10.1109/RE.2017.10

work page doi:10.1109/re.2017.10 2017
[45]

Scirè, K

A. Scirè, K. Ghonim, and R. Navigli. FENICE: Factuality evaluation of summarization based on natural language inference and claim extraction. InFindings of the Association for Computational Linguistics ACL 2024, pages 14148–14161, 2024. URL https://doi. org/10.18653/v1/2024.findings-acl.841

work page doi:10.18653/v1/2024.findings-acl.841 2024
[46]

OpenAI GPT-5 System Card

A. Singh, A. Fry, A. Perelman, A. Tart, A. Ganesh, A. El-Kishky, A. McLaughlin, A. Low, A. Ostrow, A. Ananthram, et al. OpenAI GPT-5 system card.arXiv preprint arXiv:2601.03267,

work page internal anchor Pith review Pith/arXiv arXiv
[47]

URLhttps://doi.org/10.48550/arXiv.2601.03267

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2601.03267
[48]

The European Parliament and the Council of the European Union. Regulation (eu) 2016/679 of the european parliament and of the council of 27 april 2016 on the protection of natural persons with regard to the processing of personal data and on the free movement of such data, and repealing directive 95/46/ec (General Data Protection Regulation), 05 2016. URL...

work page 2016
[49]

L. Wang, N. Yang, X. Huang, L. Yang, R. Majumder, and F. Wei. Improving text embeddings with large language models.arXiv preprint arXiv:2401.00368, 2023. URL https://doi. org/10.48550/arXiv.2401.00368

work page doi:10.48550/arxiv.2401.00368 2023
[50]

Y . Wang, M. Wang, H. Iqbal, G. N. Georgiev, J. Geng, I. Gurevych, and P. Nakov. Openfactcheck: Building, benchmarking customized fact-checking systems and evaluating the factuality of claims and llms. InProceedings of the 31st international conference on computational linguis- tics, pages 11399–11421, 2025. URL https://aclanthology.org/2025.coling-main. 755/. 12

work page 2025
[51]

J. Wei, C. Yang, X. Song, Y . Lu, N. Hu, J. Huang, D. Tran, D. Peng, R. Liu, D. Huang, et al. Long-form factuality in large language models.Advances in Neural Information Processing Systems, 37:80756–80827, 2024. URLhttps://doi.org/10.52202/079017-2567

work page doi:10.52202/079017-2567 2024
[52]

B. Weiser. ‘I apologise for the confusion earlier’: Here’s what happens when your lawyer uses ChatGPT’. heres-what-happens-when-your-lawyer-uses-chatgpt, May 28 2023. Accessed: 2026-01-04

work page 2023
[53]

suitable

N. Wiratunga, R. Abeyratne, L. Jayawardena, K. Martin, S. Massie, I. Nkisi-Orji, R. Weeras- inghe, A. Liret, and B. Fleisch. CBR-RAG: case-based reasoning for retrieval augmented genera- tion in llms for legal question answering. InInternational Conference on Case-Based Reasoning, pages 445–460. Springer, 2024. URL https://doi.org/10.1007/978-3-031-63646-...

work page doi:10.1007/978-3-031-63646-2_ 2024

[1] [1]

O’Reilly Media, Inc

J. Alammar and M. Grootendorst.Hands-on large language models: language understanding and generation. " O’Reilly Media, Inc.", 2024

work page 2024

[2] [2]

think like a lawyer

K. Burton. "think like a lawyer" using a legal reasoning grid and criterion-referenced assessment rubric on irac (issue, rule, application, conclusion).Journal of Learning Design, 10(2):57–68,

work page

[3] [3]

URLhttps://doi.org/10.5204/JLD.V10I2.229

work page doi:10.5204/jld.v10i2.229

[4] [4]

Chalkidis, A

I. Chalkidis, A. Jana, D. Hartung, M. Bommarito, I. Androutsopoulos, D. Katz, and N. Aletras. LexGLUE: A benchmark dataset for legal language understanding in english. InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 4310–4330, 2022. URL https://doi.org/10.18653/v1/2022.acl-long. 297

work page doi:10.18653/v1/2022.acl-long 2022

[5] [5]

M. Dahl, V . Magesh, M. Suzgun, and D. E. Ho. Large legal fictions: Profiling legal hallucinations in large language models.Journal of Legal Analysis, 16(1):64–93, 2024. URLhttps://doi. org/10.1093/jla/laae003

work page doi:10.1093/jla/laae003 2024

[6] [6]

S. Das, S. Abualhaija, and D. Bianculli. LegalRAG QA Generator. https://doi.org/10. 5281/zenodo.20024153, 2026

work page 2026

[7] [7]

S. Das, S. Abualhaija, and D. Bianculli. ClaimRAG-LAW Dataset. https://huggingface. co/datasets/SNTSVV/ClaimRAG-LAW, 2026

work page 2026

[8] [8]

BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding

J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova. BERT: Pre-training of deep bidirec- tional transformers for language understanding. InProceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers), pages 4171–4186, 2019. URL https://doi.org/...

work page doi:10.18653/v1/n19-1423 2019

[9] [9]

B. Edwards. Number of legal professionals using Gen AI jumps sharply over past year, study shows. number-of-legal-professionals-using-gen-ai, April 17 2025. Accessed: 2026-01-04

work page 2025

[10] [10]

S. Es, J. James, L. E. Anke, and S. Schockaert. RAGAs: Automated evaluation of retrieval augmented generation. InProceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations, pages 150–158, 2024. URLhttps://doi.org/10.18653/v1/2024.eacl-demo.16

work page doi:10.18653/v1/2024.eacl-demo.16 2024

[11] [11]

Federal court turns up the heat on attorneys using ChatGPT for research

Esquire Deposition Solutions. Federal court turns up the heat on attorneys using ChatGPT for research. federal-court-turns-up-the-heat-on-attorneys, August 13 2025. Accessed: 2026-01-04

work page 2025

[12] [12]

Ferrara, Ethan-Tonic, and O

J. Ferrara, Ethan-Tonic, and O. M. Ozturk. The RAG Triad. https://www.trulens.org/ getting_started/core_concepts/rag_triad/, 2024. Accessed: 2026-04-28

work page 2024

[13] [13]

Gokhan, K

T. Gokhan, K. Wang, I. Gurevych, and T. Briscoe. RIRAG: Regulatory information retrieval and answer generation.arXiv preprint arXiv:2409.05677, 2024. URL https://doi.org/10. 48550/arXiv.2409.05677

work page arXiv 2024

[14] [14]

National civil code, 1804

Grand-Duché de Luxembourg. National civil code, 1804. URL https://legilux.public. lu/

work page

[15] [15]

The Llama 3 Herd of Models

A. Grattafiori, A. Dubey, A. Jauhri, et al. The Llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024. URLhttps://doi.org/10.48550/arXiv.2407.21783

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2407.21783 2024

[16] [16]

N. Guha, J. Nyarko, D. Ho, et al. LegalBench: A collaboratively built benchmark for measuring legal reasoning in large language models. InProceedings of the 37th Conference on Neural Information Processing Systems - Datasets and Benchmarks Track, pages 44123–44279, 2023

work page 2023

[17] [17]

A. B. Hou, O. Weller, G. Qin, E. Yang, D. Lawrie, N. Holzenberger, A. Blair-Stanek, and B. Van Durme. CLERC: A dataset for us legal case retrieval and retrieval-augmented analysis generation. InFindings of the Association for Computational Linguistics: NAACL 2025, pages 7913–7928, 2025. URLhttps://doi.org/10.18653/v1/2025.findings-naacl.441

work page doi:10.18653/v1/2025.findings-naacl.441 2025

[18] [18]

X. Hu, D. Ru, L. Qiu, Q. Guo, T. Zhang, Y . Xu, Y . Luo, P. Liu, Y . Zhang, and Z. Zhang. RefChecker: Reference-based fine-grained hallucination checker and benchmark for large 10 language models.arXiv preprint arXiv:2405.14486, 2024. URL https://doi.org/10. 48550/arXiv.2405.14486

work page arXiv 2024

[19] [19]

GPT-4o System Card

A. Hurst, A. Lerer, A. P. Goucher, A. Perelman, A. Ramesh, A. Clark, A. Ostrow, A. Welihinda, A. Hayes, A. Radford, et al. GPT-4o system card.arXiv preprint arXiv:2410.21276, 2024. URLhttps://doi.org/10.48550/arXiv.2410.21276

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2410.21276 2024

[20] [20]

A. Q. Jiang, A. Sablayrolles, A. Roux, et al. Mixtral of experts.arXiv preprint arXiv:2401.04088,

work page internal anchor Pith review Pith/arXiv arXiv

[21] [21]

URLhttps://doi.org/10.48550/arXiv.2401.04088

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2401.04088

[22] [22]

A. T. Kalai and S. S. Vempala. Calibrated language models must hallucinate. InProceedings of the 56th Annual ACM Symposium on Theory of Computing, pages 160–171, 2024. URL https://doi.org/10.1145/3618260.3649777

work page doi:10.1145/3618260.3649777 2024

[23] [23]

D. M. Katz, M. J. Bommarito, S. Gao, and P. Arredondo. GPT-4 passes the bar exam.Philosoph- ical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences, 382(2270), 2024. URLhttps://doi.org/10.1098/rsta.2023.0254

work page doi:10.1098/rsta.2023.0254 2024

[24] [24]

Keisha, P

F. Keisha, P. Singh, D. Fernandes, A. Manivannan, I. Wicaksono, F. Ahmad, W. B. Rim, et al. All for law and law for all: Adaptive RAG pipeline for legal research.arXiv preprint arXiv:2508.13107, 2025. URLhttps://doi.org/10.48550/arXiv.2508.13107

work page doi:10.48550/arxiv.2508.13107 2025

[25] [25]

J. Lee, D. Kim, S. Hwang, H. Kim, and G. Lee. KoBLEX: Open legal question answering with multi-hop reasoning. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 4019–4053, 2025. URL https://doi.org/10.18653/ v1/2025.emnlp-main.200

work page 2025

[26] [26]

Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks

P. Lewis, E. Perez, A. Piktus, F. Petroni, V . Karpukhin, N. Goyal, H. Küttler, M. Lewis, W.-t. Yih, T. Rocktäschel, et al. Retrieval-augmented generation for knowledge-intensive nlp tasks.Advances in neural information processing systems, 33:9459–9474, 2020. URL https://doi.org/10.48550/arXiv.2005.11401

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2005.11401 2020

[27] [27]

K. Li, Y . Li, T. Zhang, H. Luo, X. Wu, J. Glass, and H. Meng. RAG-Zeval: Enhancing RAG Re- sponses Evaluator through End-to-End Reasoning and Ranking-Based Reinforcement Learning. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Process- ing, pages 24936–24954, 2025. URL https://doi.org/10.18653/v1/2025.emnlp-main. 1267

work page doi:10.18653/v1/2025.emnlp-main 2025

[28] [28]

L. Li, L. Sleem, G. Nichil, R. State, et al. Exploring the impact of temperature on large language models: Hot or cold?Procedia Computer Science, 264:242–251, 2025. URL https://doi.org/10.1016/j.procs.2025.07.135

work page doi:10.1016/j.procs.2025.07.135 2025

[29] [29]

Louis, G

A. Louis, G. van Dijck, and G. Spanakis. Interpretable long-form legal question answering with retrieval-augmented large language models. InProceedings of the AAAI Conference on Artificial Intelligence, pages 22266–22275, 2024. URL https://doi.org/10.1609/aaai. v38i20.30232

work page doi:10.1609/aaai 2024

[30] [30]

Magesh, F

V . Magesh, F. Surani, M. Dahl, M. Suzgun, C. D. Manning, and D. E. Ho. Hallucination-free? assessing the reliability of leading AI legal research tools.Journal of Empirical Legal Studies, 22(2):216–242, 2025. URLhttps://doi.org/10.1111/jels.12413

work page doi:10.1111/jels.12413 2025

[31] [31]

S. Mallick. Generative AI in the law.the Law (February 10, 2024), 42, 2024. URL https: //doi.org/10.2139/ssrn.5040429

work page doi:10.2139/ssrn.5040429 2024

[32] [32]

Manakul, A

P. Manakul, A. Liusie, and M. Gales. SelfCheckGPT: Zero-resource black-box hallucination detection for generative large language models. InProceedings of the 2023 conference on empirical methods in natural language processing, pages 9004–9017, 2023. URL https: //doi.org/10.18653/v1/2023.emnlp-main.557

work page doi:10.18653/v1/2023.emnlp-main.557 2023

[33] [33]

Metropolitansky and J

D. Metropolitansky and J. Larson. Veritrail: Closed-domain hallucination detection with traceability.arXiv preprint arXiv:2505.21786, 2025. URL https://doi.org/10.48550/ arXiv.2505.21786

work page arXiv 2025

[34] [35]

S. Min, K. Krishna, X. Lyu, M. Lewis, W.-t. Yih, P. Koh, M. Iyyer, L. Zettlemoyer, and H. Hajishirzi. FActScore: Fine-grained atomic evaluation of factual precision in long form text generation. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 12076–12100, 2023. URL https://doi.org/10.18653/v1/ 2023.emnlp-main.741

work page doi:10.18653/v1/ 2023

[35] [36]

Niklaus, V

J. Niklaus, V . Matoshi, P. Rani, A. Galassi, M. Stürmer, and I. Chalkidis. LEXTREME: A multi-lingual and multi-task benchmark for the legal domain. InFindings of the Association for Computational Linguistics: EMNLP 2023, pages 3016–3054, 2023. URL https://doi.org/ 10.18653/v1/2023.findings-emnlp.200

work page doi:10.18653/v1/2023.findings-emnlp.200 2023

[36] [37]

M. Park, H. Oh, E. Choi, and W. Hwang. LRAGE: Legal retrieval augmented generation evaluation tool.arXiv preprint arXiv:2504.01840, 2025. URL https://doi.org/10.48550/ arXiv.2504.01840

work page arXiv 2025

[37] [38]

Pipitone and G

N. Pipitone and G. H. Alami. LegalBench-RAG: A benchmark for retrieval-augmented generation in the legal domain.arXiv preprint arXiv:2408.10343, 2024. URL https: //doi.org/10.48550/arXiv.2408.10343

work page doi:10.48550/arxiv.2408.10343 2024

[38] [39]

Reimers and I

N. Reimers and I. Gurevych. The curse of dense low-dimensional information retrieval for large index sizes. InProceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), pages 605–611. Association for Computational Linguistics, ...

work page doi:10.18653/v1/2021.acl-short.77 2021

[39] [40]

M. Renze. The effect of sampling temperature on problem solving in large language models. InFindings of the association for computational linguistics: EMNLP 2024, pages 7346–7356,

work page 2024

[40] [41]

URLhttps://doi.org/10.18653/v1/2024.findings-emnlp.432

work page doi:10.18653/v1/2024.findings-emnlp.432 2024

[41] [42]

Robertson and Hugo Zaragoza , title =

S. Robertson and H. Zaragoza. The probabilistic relevance framework: BM25 and beyond. Found. Trends Inf. Retr., 3(4):333–389, Apr. 2009. ISSN 1554-0669. URLhttps://doi.org/ 10.1561/1500000019

work page doi:10.1561/1500000019 2009

[42] [43]

D. Ru, L. Qiu, X. Hu, T. Zhang, P. Shi, S. Chang, C. Jiayang, C. Wang, S. Sun, H. Li, et al. RAGChecker: A fine-grained framework for diagnosing retrieval-augmented generation. Advances in Neural Information Processing Systems, 37:21999–22027, 2024. URL https: //doi.org/10.52202/079017-0692

work page doi:10.52202/079017-0692 2024

[43] [44]

Sannier, M

N. Sannier, M. Adedjouma, M. Sabetzadeh, L. Briand, J. Dann, M. Hisette, and P. Thill. Legal markup generation in the large: An experience report. In2017 IEEE 25th International Requirements Engineering Conference (RE), pages 302–311. IEEE, 2017. URL https://doi. org/10.1109/RE.2017.10

work page doi:10.1109/re.2017.10 2017

[44] [45]

Scirè, K

A. Scirè, K. Ghonim, and R. Navigli. FENICE: Factuality evaluation of summarization based on natural language inference and claim extraction. InFindings of the Association for Computational Linguistics ACL 2024, pages 14148–14161, 2024. URL https://doi. org/10.18653/v1/2024.findings-acl.841

work page doi:10.18653/v1/2024.findings-acl.841 2024

[45] [46]

OpenAI GPT-5 System Card

A. Singh, A. Fry, A. Perelman, A. Tart, A. Ganesh, A. El-Kishky, A. McLaughlin, A. Low, A. Ostrow, A. Ananthram, et al. OpenAI GPT-5 system card.arXiv preprint arXiv:2601.03267,

work page internal anchor Pith review Pith/arXiv arXiv

[46] [47]

URLhttps://doi.org/10.48550/arXiv.2601.03267

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2601.03267

[47] [48]

The European Parliament and the Council of the European Union. Regulation (eu) 2016/679 of the european parliament and of the council of 27 april 2016 on the protection of natural persons with regard to the processing of personal data and on the free movement of such data, and repealing directive 95/46/ec (General Data Protection Regulation), 05 2016. URL...

work page 2016

[48] [49]

L. Wang, N. Yang, X. Huang, L. Yang, R. Majumder, and F. Wei. Improving text embeddings with large language models.arXiv preprint arXiv:2401.00368, 2023. URL https://doi. org/10.48550/arXiv.2401.00368

work page doi:10.48550/arxiv.2401.00368 2023

[49] [50]

Y . Wang, M. Wang, H. Iqbal, G. N. Georgiev, J. Geng, I. Gurevych, and P. Nakov. Openfactcheck: Building, benchmarking customized fact-checking systems and evaluating the factuality of claims and llms. InProceedings of the 31st international conference on computational linguis- tics, pages 11399–11421, 2025. URL https://aclanthology.org/2025.coling-main. 755/. 12

work page 2025

[50] [51]

J. Wei, C. Yang, X. Song, Y . Lu, N. Hu, J. Huang, D. Tran, D. Peng, R. Liu, D. Huang, et al. Long-form factuality in large language models.Advances in Neural Information Processing Systems, 37:80756–80827, 2024. URLhttps://doi.org/10.52202/079017-2567

work page doi:10.52202/079017-2567 2024

[51] [52]

B. Weiser. ‘I apologise for the confusion earlier’: Here’s what happens when your lawyer uses ChatGPT’. heres-what-happens-when-your-lawyer-uses-chatgpt, May 28 2023. Accessed: 2026-01-04

work page 2023

[52] [53]

suitable

N. Wiratunga, R. Abeyratne, L. Jayawardena, K. Martin, S. Massie, I. Nkisi-Orji, R. Weeras- inghe, A. Liret, and B. Fleisch. CBR-RAG: case-based reasoning for retrieval augmented genera- tion in llms for legal question answering. InInternational Conference on Case-Based Reasoning, pages 445–460. Springer, 2024. URL https://doi.org/10.1007/978-3-031-63646-...

work page doi:10.1007/978-3-031-63646-2_ 2024