Citation Grounding: Detecting and Reducing LLM Citation Hallucinations via Legal Citation Graphs

Volodymyr Ovcharov

arxiv: 2606.00898 · v1 · pith:2ZMYV5K3new · submitted 2026-05-30 · 💻 cs.CL · cs.DL

Citation Grounding: Detecting and Reducing LLM Citation Hallucinations via Legal Citation Graphs

Volodymyr Ovcharov This is my paper

Pith reviewed 2026-06-28 18:35 UTC · model grok-4.3

classification 💻 cs.CL cs.DL

keywords citation groundingLLM hallucinationslegal citationscitation graphDPO fine-tuningcourt decisionspreference optimization

0 comments

The pith

A citation graph from 100.8 million Ukrainian court decisions enables CG-DPO fine-tuning to 98.5% accuracy distinguishing correct legal citations from hallucinations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes citation grounding as a scalable metric to verify LLM legal citations against a real graph of statutes and decisions. It decomposes the metric into precision, relevance, and temporality to diagnose different hallucination types and reports 13-21% hallucinated citations across five evaluated systems. To reduce errors without human labels, it introduces CG-DPO, which builds preference pairs by applying four corruption strategies to verified citations from court records. A Qwen2.5-7B model fine-tuned with this data reaches 98.5% validation accuracy. The graph, evaluation framework, and training dataset are released publicly.

Core claim

The central claim is that a citation graph extracted from 100.8 million Ukrainian court decisions supplies both a ground-truth verifier for the citation grounding metric and the source material for algorithmically generating preference pairs, allowing CG-DPO to fine-tune models that distinguish correct from corrupted legal citations at 98.5% mean validation accuracy.

What carries the argument

The citation graph of 502 million edges linking 21,736 statute nodes, which supplies verified citations that are corrupted via four targeted strategies to create preference pairs for CG-DPO training.

If this is right

LLMs can be evaluated automatically for citation quality on any number of legal queries using the three-component CG score.
Preference data for DPO can be produced at scale from existing court records without manual annotation.
The separate precision, relevance, and temporality scores allow targeted diagnosis of specific hallucination failure modes.
Open release of the graph and CG-DPO dataset supports repeated training and benchmarking of citation-aware models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same graph-construction and corruption approach could be replicated in other jurisdictions that publish structured court data.
The method may transfer to non-legal domains that maintain reference graphs, such as scientific citations or regulatory documents.
Pairing CG-DPO with retrieval-augmented generation could produce additive gains in citation reliability.
Cross-lingual testing on English legal queries would reveal whether the Ukrainian-trained detector generalizes.

Load-bearing premise

The four targeted corruption strategies used to generate negative preference pairs produce examples that are representative of the actual hallucination patterns exhibited by the evaluated LLMs.

What would settle it

Apply the fine-tuned model to a fresh set of citations generated by the same commercial LLMs on the 100 queries and check whether its accuracy remains near 98.5% or aligns with the CG scores computed from the graph.

Figures

Figures reproduced from arXiv: 2606.00898 by Volodymyr Ovcharov.

**Figure 2.** Figure 2: CG by legal domain and model. Constitutional law achieves perfect grounding universally; family and labor law show the highest inter-model variance. 6 Citation Grounding DPO The empirical evaluation (§5) establishes CG as a diagnostic tool. We now show that the same citation graph G can serve as an algorithmic oracle for constructing DPO training pairs – replacing human annotators with graph-based verifica… view at source ↗

**Figure 3.** Figure 3: Convergence of CG-DPO training (3 seeds). Left: DPO loss (log scale). Right: classifi [PITH_FULL_IMAGE:figures/full_fig_p010_3.png] view at source ↗

read the original abstract

Large language models systematically hallucinate legal citations -- fabricating statute references, citing repealed provisions, and confusing jurisdictions -- yet no automated method exists to measure or reduce this behavior at scale. We propose citation grounding (CG), a metric that verifies LLM-generated legal citations against a ground-truth citation graph extracted from 100.8 million Ukrainian court decisions (502 million edges, 21,736 unique statute nodes). CG decomposes into three components -- citation precision (does the cited provision exist?), citation relevance (is it contextually appropriate?), and citation temporality (was it valid at the relevant date?) -- enabling differential diagnosis of hallucination types. Empirical evaluation on 100 Ukrainian legal queries across five systems -- four commercial LLMs via AWS Bedrock (Claude Haiku 4.5, Mistral Pixtral Large, Amazon Nova Pro/Lite) and one RAG-augmented production system -- reveals CG ranging from 0.791 to 0.873, with 13-21% of citations hallucinated. To reduce hallucinations without human annotation, we introduce Citation Grounding DPO (CG-DPO): a method that constructs preference pairs algorithmically by corrupting verified citations from real court decisions via four targeted strategies. On a dataset of 2,244 court decisions, a Qwen2.5-7B-Instruct model fine-tuned with LoRA achieves 98.5% mean validation accuracy in distinguishing correct from corrupted citations (rewards margin +14.9, std < 0.3 pp across 3 seeds). The citation graph, evaluation framework, and CG-DPO dataset are released as open resources.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a usable citation metric on a real legal graph and an automated DPO recipe, but the 98.5% accuracy is only on the same synthetic corruptions used for training.

read the letter

The paper's core contribution is a three-part citation grounding metric (precision, relevance, temporality) checked against a graph built from 100.8 million Ukrainian court decisions, plus CG-DPO that turns verified citations into preference pairs by applying four corruption strategies. They run the metric on 100 queries across five LLMs and report 13-21% hallucinated citations, then fine-tune Qwen2.5-7B-Instruct with LoRA to reach 98.5% mean accuracy on the synthetic validation set.

The graph release and the decomposed metric are the clearest positives. The evaluation covers commercial models and a production RAG system, and the preference-pair construction avoids human labeling. That setup is straightforward and the numbers on hallucination rates are concrete.

The soft spot is the validation design. The high accuracy comes from distinguishing correct citations from the exact four corruption types used to build the training data. The paper does not show that those corruptions match the actual error patterns in the LLM outputs whose 13-21% rates are measured. If real hallucinations include jurisdiction or date issues outside the four strategies, the fine-tuned model may not transfer as cleanly.

This is for groups working on legal AI reliability or citation verification. The open graph and dataset give it practical value even if the DPO results need testing on genuine model mistakes. It deserves peer review because the graph and metric rest on external data and the method is easy to probe further.

Referee Report

1 major / 2 minor

Summary. The paper claims that citation hallucinations in LLMs for legal citations can be measured using Citation Grounding (CG), a metric based on a large citation graph from Ukrainian court decisions, revealing 13-21% hallucination rates in five evaluated systems. It further claims that CG-DPO, using synthetic corruptions of real citations with four strategies to create preference pairs, allows fine-tuning a model to distinguish correct from incorrect citations with 98.5% validation accuracy.

Significance. If the synthetic corruption strategies are representative of real LLM hallucination patterns, this work provides a scalable, annotation-free approach to both measuring and mitigating citation hallucinations in legal domains. The release of the citation graph, evaluation framework, and dataset as open resources strengthens the contribution. The concrete empirical results on hallucination rates add value, though the gap between synthetic validation and real-world detection limits immediate impact.

major comments (1)

[CG-DPO validation (abstract and methods)] The 98.5% mean validation accuracy (rewards margin +14.9, std < 0.3 pp across 3 seeds) for the Qwen2.5-7B-Instruct model is obtained by training and testing on preference pairs generated from the same four targeted corruption strategies applied to verified citations. This does not establish detection capability on the actual hallucination patterns in the five evaluated LLMs on the 100 queries, which may involve different error distributions (e.g., jurisdiction confusion or date-specific validity failures) not among the four strategies.

minor comments (2)

[Abstract] The abstract reports CG scores from 0.791 to 0.873 and 13-21% hallucinated citations but provides no details on query selection for the 100 Ukrainian legal queries or validation of the citation graph's completeness (502 million edges, 21,736 statute nodes).
[Evaluation framework] The decomposition of CG into citation precision, relevance, and temporality is described at a high level, but the manuscript does not specify the exact graph-based operationalization or thresholds used for each component in the empirical evaluation.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive feedback, which highlights an important distinction in our evaluation of CG-DPO. We respond to the major comment below.

read point-by-point responses

Referee: The 98.5% mean validation accuracy (rewards margin +14.9, std < 0.3 pp across 3 seeds) for the Qwen2.5-7B-Instruct model is obtained by training and testing on preference pairs generated from the same four targeted corruption strategies applied to verified citations. This does not establish detection capability on the actual hallucination patterns in the five evaluated LLMs on the 100 queries, which may involve different error distributions (e.g., jurisdiction confusion or date-specific validity failures) not among the four strategies.

Authors: We agree that the reported validation accuracy reflects performance on held-out synthetic preference pairs generated from the same four corruption strategies used in training. This setup demonstrates that the model successfully learns to distinguish correct citations from the targeted corruptions in an annotation-free manner, but it does not directly measure generalization to the precise error distributions in the outputs of the five evaluated LLMs. The four strategies were designed to instantiate the three CG components (existence, relevance, temporality) based on hallucination types observed during our initial LLM evaluations; however, we acknowledge that patterns such as jurisdiction confusion may not be fully covered. We will revise the manuscript to explicitly note this limitation in the abstract and methods, clarifying the scope of the CG-DPO results while retaining the claim that the approach provides a scalable starting point for reducing hallucinations. revision: yes

Circularity Check

0 steps flagged

No significant circularity; external graph and real citations provide independent signal

full rationale

The ground-truth citation graph is extracted from 100.8 million external Ukrainian court decisions (502M edges), independent of the evaluated LLMs or the DPO training. CG-DPO constructs preference pairs by applying four corruption strategies to verified real citations from court decisions; the 98.5% validation accuracy is a standard supervised metric on held-out pairs drawn from the same synthetic distribution. No derivation step reduces by construction to its own inputs, no self-citation chain is load-bearing, and the reported hallucination rates (13-21%) on the five LLMs are computed directly against the external graph. The setup is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on the assumption that the extracted citation graph constitutes reliable ground truth and that the four corruption strategies produce negative examples representative of real LLM hallucinations.

axioms (1)

domain assumption The citation graph extracted from 100.8 million Ukrainian court decisions accurately captures valid, temporally scoped statute references.
Used directly as ground truth for precision, relevance, and temporality checks.

pith-pipeline@v0.9.1-grok · 5828 in / 1216 out tokens · 20066 ms · 2026-06-28T18:35:25.786251+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

24 extracted references · 7 canonical work pages · 4 internal anchors

[1]

A mathematical approach to the study of the United States Code.Physica A, 389(19):4195–4200, 2010

Michael J Bommarito and Daniel Martin Katz. A mathematical approach to the study of the United States Code.Physica A, 389(19):4195–4200, 2010

2010
[2]

Falkor-IRAC: Graph-Constrained Generation for Verified Legal Reasoning in Indian Judicial AI

Joy Bose. Falkor-IRAC: Graph-constrained generation for verified legal reasoning in Indian judicial AI.arXiv preprint arXiv:2605.14665, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[3]

LexGLUE: A benchmark dataset for legal lan- guage understanding in English

Ilias Chalkidis, Abhik Jana, Dirk Hartung, Michael Bommarito, Ion Androutsopoulos, Daniel Martin Katz, and Nikolaos Aletras. LexGLUE: A benchmark dataset for legal lan- guage understanding in English. InProceedings of ACL, 2022

2022
[4]

ChatGPT goes to law school.Journal of Legal Education, 71(3), 2023

Jonathan H Choi, Kristin E Hickman, Amy Monahan, and Daniel Schwarcz. ChatGPT goes to law school.Journal of Legal Education, 71(3), 2023

2023
[5]

SaulLM-7B: A pioneering large language model for law.arXiv preprint arXiv:2403.03883, 2024

Pierre Colombo, Telmo Pires, Rui Vieira, et al. SaulLM-7B: A pioneering large language model for law.arXiv preprint arXiv:2403.03883, 2024

work page arXiv 2024
[6]

Unsupervised cross-lingual representation learning at scale

Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wen- zek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke Zettlemoyer, and Veselin Stoyanov. Unsupervised cross-lingual representation learning at scale. InProceedings of ACL, 2020

2020
[7]

Large legal fictions: Profiling legal hallucinations in large language models.Journal of Legal Analysis, 16(1):64–93, 2024

Matthew Dahl, Varun Magesh, Mirac Suzgun, and Daniel E Ho. Large legal fictions: Profiling legal hallucinations in large language models.Journal of Legal Analysis, 16(1):64–93, 2024

2024
[8]

Network analysis and the law: Measuring the legal importance of precedents at the U.S

James H Fowler, Timothy R Johnson, James F Spriggs, Sangick Jeon, and Paul J Wahlbeck. Network analysis and the law: Measuring the legal importance of precedents at the U.S. Supreme Court.Political Analysis, 15(3):324–346, 2007. 1Citation graph:https://huggingface.co/datasets/overthelex/ua-court-citation-graph. Code and data: https://huggingface.co/datase...

2007
[9]

LegalBench: A collaboratively built benchmark for measuring le- gal reasoning in large language models

Neel Guha, Julian Nyarko, Daniel E Ho, Christopher Ré, Adam Chilton, Alex Nanamori, Nils Holzenberger, et al. LegalBench: A collaboratively built benchmark for measuring le- gal reasoning in large language models. InNeurIPS Datasets and Benchmarks Track, 2023. arXiv:2308.11462

work page arXiv 2023
[10]

LoRA: Low-rank adaptation of large language models

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shanan Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. InICLR, 2022

2022
[11]

GPT-4 passes the bar exam.Philosophical Transactions of the Royal Society A, 382(2270), 2024

Daniel Martin Katz, Michael James Bommarito, Shang Gao, and Pablo Arredondo. GPT-4 passes the bar exam.Philosophical Transactions of the Royal Society A, 382(2270), 2024

2024
[12]

Hallucination-free? assessing the reliability of leading AI legal research tools

Varun Magesh, Faiz Surani, Matthew Dahl, Mirac Suzgun, Christopher D Manning, and Daniel E Ho. Hallucination-free? assessing the reliability of leading AI legal research tools. Journal of Empirical Legal Studies, 22:216–242, 2025. arXiv:2405.20362

work page arXiv 2025
[13]

The network of French legal codes

Pierre Mazzega, Danièle Bourcier, and Romain Boulet. The network of French legal codes. Artificial Intelligence and Law, 17(3), 2009

2009
[14]

FActScore: Fine-grained atomic evaluation of factual precision in long form text generation

Sewon Min, Kalpesh Krishna, Xinxi Lyu, Mike Lewis, Wen-tau Yih, Pang Wei Koh, Mohit Iyyer, Luke Zettlemoyer, and Hannaneh Hajishirzi. FActScore: Fine-grained atomic evaluation of factual precision in long form text generation. InProceedings of EMNLP, 2023

2023
[15]

Emergence of hierarchy in networked endorsement dynam- ics.Proceedings of the National Academy of Sciences, 118(16), 2021

Enys Mones and Adam Arvidsson. Emergence of hierarchy in networked endorsement dynam- ics.Proceedings of the National Academy of Sciences, 118(16), 2021

2021
[16]

LEX- TREME: A multi-lingual and multi-task benchmark for the legal domain

Joel Niklaus, Veton Matoshi, Pooja Rani, Andrea Gallucci, and Matthias Stuermer. LEX- TREME: A multi-lingual and multi-task benchmark for the legal domain. InFindings of EMNLP, 2023

2023
[17]

Training language models to follow instructions with human feedback

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll L Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. InNeurIPS, 2022

2022
[18]

Volodymyr Ovcharov. Automatic construction of a legal citation graph from 100 million Ukrainian court decisions: Large-scale extraction, topological analysis, and ontology-driven clustering.arXiv preprint arXiv:2605.15362, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[19]

PhD thesis, National Academy of Sciences of Ukraine, 2026

Volodymyr V Ovcharov.Methods for Ensuring Verifiability of Large Language Models in the Legal Domain. PhD thesis, National Academy of Sciences of Ukraine, 2026

2026
[20]

Direct preference optimization: Your language model is secretly a reward model

Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. InNeurIPS, 2023

2023
[21]

CiteAudit: You Cited It, But Did You Read It? A Benchmark for Verifying Scientific References in the LLM Era

Tobias Schimanski, Jingwei Ni, Mathias Kraus, and Markus Leippold. CiteAudit: You cited it, but did you read it? A benchmark for verifying scientific references in the LLM era.arXiv preprint arXiv:2602.23452, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[22]

Here’s what happens when your lawyer uses ChatGPT.The New York Times, May 2023

Benjamin Weiser. Here’s what happens when your lawyer uses ChatGPT.The New York Times, May 2023. 13

2023
[23]

Determining authority of dutch case law

Radboud Winkels, Jelle de Ruyter, and Henryk Kroese. Determining authority of dutch case law. InProceedings of JURIX, 2011

2011
[24]

Qwen2.5 Technical Report

An Yang, Baosong Yang, et al. Qwen2.5 technical report.arXiv preprint arXiv:2412.15115, 2024. 14

work page internal anchor Pith review Pith/arXiv arXiv 2024

[1] [1]

A mathematical approach to the study of the United States Code.Physica A, 389(19):4195–4200, 2010

Michael J Bommarito and Daniel Martin Katz. A mathematical approach to the study of the United States Code.Physica A, 389(19):4195–4200, 2010

2010

[2] [2]

Falkor-IRAC: Graph-Constrained Generation for Verified Legal Reasoning in Indian Judicial AI

Joy Bose. Falkor-IRAC: Graph-constrained generation for verified legal reasoning in Indian judicial AI.arXiv preprint arXiv:2605.14665, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[3] [3]

LexGLUE: A benchmark dataset for legal lan- guage understanding in English

Ilias Chalkidis, Abhik Jana, Dirk Hartung, Michael Bommarito, Ion Androutsopoulos, Daniel Martin Katz, and Nikolaos Aletras. LexGLUE: A benchmark dataset for legal lan- guage understanding in English. InProceedings of ACL, 2022

2022

[4] [4]

ChatGPT goes to law school.Journal of Legal Education, 71(3), 2023

Jonathan H Choi, Kristin E Hickman, Amy Monahan, and Daniel Schwarcz. ChatGPT goes to law school.Journal of Legal Education, 71(3), 2023

2023

[5] [5]

SaulLM-7B: A pioneering large language model for law.arXiv preprint arXiv:2403.03883, 2024

Pierre Colombo, Telmo Pires, Rui Vieira, et al. SaulLM-7B: A pioneering large language model for law.arXiv preprint arXiv:2403.03883, 2024

work page arXiv 2024

[6] [6]

Unsupervised cross-lingual representation learning at scale

Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wen- zek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke Zettlemoyer, and Veselin Stoyanov. Unsupervised cross-lingual representation learning at scale. InProceedings of ACL, 2020

2020

[7] [7]

Large legal fictions: Profiling legal hallucinations in large language models.Journal of Legal Analysis, 16(1):64–93, 2024

Matthew Dahl, Varun Magesh, Mirac Suzgun, and Daniel E Ho. Large legal fictions: Profiling legal hallucinations in large language models.Journal of Legal Analysis, 16(1):64–93, 2024

2024

[8] [8]

Network analysis and the law: Measuring the legal importance of precedents at the U.S

James H Fowler, Timothy R Johnson, James F Spriggs, Sangick Jeon, and Paul J Wahlbeck. Network analysis and the law: Measuring the legal importance of precedents at the U.S. Supreme Court.Political Analysis, 15(3):324–346, 2007. 1Citation graph:https://huggingface.co/datasets/overthelex/ua-court-citation-graph. Code and data: https://huggingface.co/datase...

2007

[9] [9]

LegalBench: A collaboratively built benchmark for measuring le- gal reasoning in large language models

Neel Guha, Julian Nyarko, Daniel E Ho, Christopher Ré, Adam Chilton, Alex Nanamori, Nils Holzenberger, et al. LegalBench: A collaboratively built benchmark for measuring le- gal reasoning in large language models. InNeurIPS Datasets and Benchmarks Track, 2023. arXiv:2308.11462

work page arXiv 2023

[10] [10]

LoRA: Low-rank adaptation of large language models

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shanan Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. InICLR, 2022

2022

[11] [11]

GPT-4 passes the bar exam.Philosophical Transactions of the Royal Society A, 382(2270), 2024

Daniel Martin Katz, Michael James Bommarito, Shang Gao, and Pablo Arredondo. GPT-4 passes the bar exam.Philosophical Transactions of the Royal Society A, 382(2270), 2024

2024

[12] [12]

Hallucination-free? assessing the reliability of leading AI legal research tools

Varun Magesh, Faiz Surani, Matthew Dahl, Mirac Suzgun, Christopher D Manning, and Daniel E Ho. Hallucination-free? assessing the reliability of leading AI legal research tools. Journal of Empirical Legal Studies, 22:216–242, 2025. arXiv:2405.20362

work page arXiv 2025

[13] [13]

The network of French legal codes

Pierre Mazzega, Danièle Bourcier, and Romain Boulet. The network of French legal codes. Artificial Intelligence and Law, 17(3), 2009

2009

[14] [14]

FActScore: Fine-grained atomic evaluation of factual precision in long form text generation

Sewon Min, Kalpesh Krishna, Xinxi Lyu, Mike Lewis, Wen-tau Yih, Pang Wei Koh, Mohit Iyyer, Luke Zettlemoyer, and Hannaneh Hajishirzi. FActScore: Fine-grained atomic evaluation of factual precision in long form text generation. InProceedings of EMNLP, 2023

2023

[15] [15]

Emergence of hierarchy in networked endorsement dynam- ics.Proceedings of the National Academy of Sciences, 118(16), 2021

Enys Mones and Adam Arvidsson. Emergence of hierarchy in networked endorsement dynam- ics.Proceedings of the National Academy of Sciences, 118(16), 2021

2021

[16] [16]

LEX- TREME: A multi-lingual and multi-task benchmark for the legal domain

Joel Niklaus, Veton Matoshi, Pooja Rani, Andrea Gallucci, and Matthias Stuermer. LEX- TREME: A multi-lingual and multi-task benchmark for the legal domain. InFindings of EMNLP, 2023

2023

[17] [17]

Training language models to follow instructions with human feedback

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll L Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. InNeurIPS, 2022

2022

[18] [18]

Volodymyr Ovcharov. Automatic construction of a legal citation graph from 100 million Ukrainian court decisions: Large-scale extraction, topological analysis, and ontology-driven clustering.arXiv preprint arXiv:2605.15362, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[19] [19]

PhD thesis, National Academy of Sciences of Ukraine, 2026

Volodymyr V Ovcharov.Methods for Ensuring Verifiability of Large Language Models in the Legal Domain. PhD thesis, National Academy of Sciences of Ukraine, 2026

2026

[20] [20]

Direct preference optimization: Your language model is secretly a reward model

Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. InNeurIPS, 2023

2023

[21] [21]

CiteAudit: You Cited It, But Did You Read It? A Benchmark for Verifying Scientific References in the LLM Era

Tobias Schimanski, Jingwei Ni, Mathias Kraus, and Markus Leippold. CiteAudit: You cited it, but did you read it? A benchmark for verifying scientific references in the LLM era.arXiv preprint arXiv:2602.23452, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[22] [22]

Here’s what happens when your lawyer uses ChatGPT.The New York Times, May 2023

Benjamin Weiser. Here’s what happens when your lawyer uses ChatGPT.The New York Times, May 2023. 13

2023

[23] [23]

Determining authority of dutch case law

Radboud Winkels, Jelle de Ruyter, and Henryk Kroese. Determining authority of dutch case law. InProceedings of JURIX, 2011

2011

[24] [24]

Qwen2.5 Technical Report

An Yang, Baosong Yang, et al. Qwen2.5 technical report.arXiv preprint arXiv:2412.15115, 2024. 14

work page internal anchor Pith review Pith/arXiv arXiv 2024