Validate Your Authority: Benchmarking LLMs on Multi-Label Precedent Treatment Classification

M. Abdullah Canbaz; M. Mikail Demir

arxiv: 2605.17691 · v1 · pith:LQK42CWTnew · submitted 2026-05-17 · 💻 cs.CL · cs.AI

Validate Your Authority: Benchmarking LLMs on Multi-Label Precedent Treatment Classification

M. Mikail Demir , M. Abdullah Canbaz This is my paper

Pith reviewed 2026-05-20 12:21 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords legal precedenttreatment classificationLLM benchmarkingmulti-label classificationlegal NLPevaluation metricscitation analysisnegative treatment

0 comments

The pith

Large language models classify legal precedent treatments at up to 79 percent accuracy on high-level tasks using a new expert dataset and severity metric.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper benchmarks large language models on classifying how legal precedents are treated in subsequent citations, a task where errors carry real risk for legal research. To move beyond standard accuracy, the authors build a new dataset of 239 expert-annotated real-world citations and introduce the Average Severity Error metric, which weights mistakes according to their likely practical consequences. Experiments reveal a performance split: one model leads on broad classification while another leads when finer distinctions are required. This supplies both data and an evaluation approach tailored to nuanced legal reasoning.

Core claim

The authors create an expert-annotated dataset of 239 legal citations and evaluate modern LLMs on multi-label precedent treatment classification. They introduce the Average Severity Error metric to capture the real-world impact of errors. Gemini 2.5 Flash reaches 79.1 percent accuracy on a high-level schema while GPT-5-mini reaches 67.7 percent on a more detailed schema, establishing a baseline for this legal NLP application.

What carries the argument

The Average Severity Error metric, which assigns different costs to classification mistakes based on their potential effect in legal practice rather than treating every error as equal.

If this is right

The new dataset becomes available for training or further testing of legal analysis models.
The Average Severity Error metric offers a more realistic way to compare models in high-stakes classification settings.
Model performance baselines can inform choices when building automated tools for legal research.
The observed split between high-level and fine-grained results indicates that classification detail level affects which model performs best.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same dataset-plus-metric approach could be tested on other high-risk classification domains such as regulatory compliance or medical case notes.
Embedding these classifications into legal search systems might reduce the chance of retrieving misleading precedent.
Longitudinal studies could check whether models that score well on Average Severity Error actually improve outcomes in real legal workflows.

Load-bearing premise

The expert annotations on the 239 citations provide accurate ground truth that captures the true legal meanings without meaningful disagreement or bias in selection.

What would settle it

Independent legal experts re-annotating the same 239 citations and producing substantially different treatment labels would show that the benchmark results rest on unreliable ground truth.

Figures

Figures reproduced from arXiv: 2605.17691 by M. Abdullah Canbaz, M. Mikail Demir.

**Figure 2.** Figure 2: A snippet of the dataset that (Hellyer, 2018) provided, with explanations about ground truth logic [PITH_FULL_IMAGE:figures/full_fig_p012_2.png] view at source ↗

**Figure 3.** Figure 3: A snippet of the dataset that (Hellyer, 2018) provided, where corrected label provided in the brackets [PITH_FULL_IMAGE:figures/full_fig_p012_3.png] view at source ↗

**Figure 4.** Figure 4: A snippet of the dataset that (Hellyer, 2018) provided, where more than one label is accepted as ground truth [PITH_FULL_IMAGE:figures/full_fig_p012_4.png] view at source ↗

read the original abstract

Automating the classification of negative treatment in legal precedent is a critical yet nuanced NLP task where misclassification carries significant risk. To address the shortcomings of standard accuracy, this paper introduces a more robust evaluation framework. We benchmark modern Large Language Models on a new, expert-annotated dataset of 239 real-world legal citations and propose a novel Average Severity Error metric to better measure the practical impact of classification errors. Our experiments reveal a performance split. Google's Gemini 2.5 Flash achieved the highest accuracy on a high-level classification task (79.1%), while OpenAI's GPT-5-mini was the top performer on the more complex fine-grained schema (67.7%). This work establishes a crucial baseline, provides a new context-rich dataset, and introduces an evaluation metric tailored to the demands of this complex legal reasoning task.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

New dataset and severity metric for legal precedent classification, but annotation reliability details are missing.

read the letter

This paper's main contribution is a new expert-annotated dataset of 239 legal citations and an Average Severity Error metric designed to reflect the real costs of misclassifying how precedents are treated in law. It does a decent job of framing the problem in legal NLP, where standard accuracy falls short because some errors are worse than others. The benchmarks on current LLMs provide a baseline, showing Gemini 2.5 Flash leading on high-level classification and GPT-5-mini on the detailed one. That's useful for anyone tracking how these models handle domain-specific reasoning. The weaker part is the lack of information on the annotation process. There's no report on inter-annotator agreement, the number of annotators, or how they resolved differences. The way the severity weights were chosen for the new metric also isn't detailed. For a small dataset with inherently interpretive legal categories, this makes it tough to gauge how stable the findings are. Dataset splits and selection criteria for the citations would help too. Readers in legal informatics or those building automated research tools would find this relevant. It gives them something concrete to build on or compare against, even if the current evaluation leaves some questions open. I think it deserves a serious referee. The practical angle and the new resources make it worth the time to review and improve, rather than desk rejecting it outright.

Referee Report

1 major / 2 minor

Summary. The manuscript introduces a new expert-annotated dataset of 239 real-world legal citations for multi-label classification of precedent treatments. It benchmarks several LLMs on high-level and fine-grained schemas, proposes a novel Average Severity Error metric to account for the practical impact of classification errors, and reports specific results including Gemini 2.5 Flash at 79.1% accuracy on the high-level task and GPT-5-mini at 67.7% on the fine-grained task.

Significance. If the ground-truth annotations are reliable, the work provides a useful baseline for LLM performance on nuanced legal reasoning, releases a context-rich dataset, and introduces a severity-weighted metric that addresses shortcomings of standard accuracy in high-stakes domains. These contributions could support further research in legal NLP.

major comments (1)

[Abstract and Dataset section] Abstract and Dataset section: The manuscript provides no details on the annotation process for the 239 citations, including the number of annotators, inter-annotator agreement statistics (e.g., Cohen’s kappa or Krippendorff’s alpha), disagreement-resolution procedure, or citation selection criteria. Because all reported accuracies and the Average Severity Error metric depend directly on these labels, the absence of this information prevents verification of the central empirical claims.

minor comments (2)

[Results] Results tables: Ensure that the high-level and fine-grained schemas are explicitly defined with example labels so readers can interpret the performance split between models.
[Evaluation Metric] Metric definition: Provide the exact formula and severity weights for the Average Severity Error metric, including how they were derived.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive feedback and for highlighting the need for greater transparency in our dataset construction. We address the major comment below and will revise the manuscript accordingly to strengthen the verifiability of our results.

read point-by-point responses

Referee: [Abstract and Dataset section] Abstract and Dataset section: The manuscript provides no details on the annotation process for the 239 citations, including the number of annotators, inter-annotator agreement statistics (e.g., Cohen’s kappa or Krippendorff’s alpha), disagreement-resolution procedure, or citation selection criteria. Because all reported accuracies and the Average Severity Error metric depend directly on these labels, the absence of this information prevents verification of the central empirical claims.

Authors: We agree that these details are necessary to allow readers to assess label reliability. The current manuscript omitted a full account of the annotation methodology. In the revised version, we will add a dedicated subsection in the Dataset section describing the citation selection criteria, the number and qualifications of annotators, inter-annotator agreement statistics, and the disagreement-resolution procedure. These additions will directly support verification of the reported accuracies and Average Severity Error metric. revision: yes

Circularity Check

0 steps flagged

No significant circularity in empirical benchmarking on new dataset

full rationale

The paper collects a fresh expert-annotated dataset of 239 legal citations, directly evaluates several LLMs on high-level and fine-grained multi-label classification tasks, and introduces a new Average Severity Error metric. No equations, fitted parameters, or predictions are defined in terms of the target results; performance numbers (Gemini 2.5 Flash at 79.1%, GPT-5-mini at 67.7%) are straightforward empirical measurements against the held-out annotations. The work contains no self-citations that bear the central claim and no derivations that collapse to inputs by construction, rendering the evaluation chain self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Claims rest on the assumption that expert labels constitute valid ground truth and that the severity-weighted metric meaningfully captures practical legal impact; no free parameters or invented entities are introduced.

axioms (1)

domain assumption Expert annotations on legal citations provide reliable ground truth for precedent treatment labels
The evaluation framework depends on these annotations as the basis for measuring model performance and metric validity.

pith-pipeline@v0.9.0 · 5669 in / 1132 out tokens · 35911 ms · 2026-05-20T12:21:06.679629+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We benchmark modern Large Language Models on a new, expert-annotated dataset of 239 real-world legal citations and propose a novel Average Severity Error metric
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Gemini 2.5 Flash achieved the highest accuracy on a high-level classification task (79.1%)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

27 extracted references · 27 canonical work pages

[3]

and Albrecht, Kat and Pah, Adam and Cotropia, Christopher Anthony and Sanders, Amy Kristin and Sanga, Sarath and Alexander, Charlotte and Amaral, Luis A

Schwartz, David L. and Albrecht, Kat and Pah, Adam and Cotropia, Christopher Anthony and Sanders, Amy Kristin and Sanga, Sarath and Alexander, Charlotte and Amaral, Luis A. N. and Clopton, Zachary D. and Tucker, Anne M. and Gaylord, Thomas and Daniel, Scott and Dahlberg, Nathan , date =. The. doi:10.2139/ssrn.4948027 , url =

work page doi:10.2139/ssrn.4948027
[4]

, date =

Taylor, William L. , date =. Comparing

work page
[5]

and Ré, Christopher and Chilton, Adam and Narayana, Aditya and Chohlas-Wood, Alex and Peters, Austin and Waldon, Brandon and Rockmore, Daniel N

Guha, Neel and Nyarko, Julian and Ho, Daniel E. and Ré, Christopher and Chilton, Adam and Narayana, Aditya and Chohlas-Wood, Alex and Peters, Austin and Waldon, Brandon and Rockmore, Daniel N. and Zambrano, Diego and Talisman, Dmitry and Hoque, Enam and Surani, Faiz and Fagan, Frank and Sarfaty, Galit and Dickinson, Gregory M. and Porat, Haggai and Heglan...

work page 2023
[6]

2014 , howpublished =

LexisNexis , title =. 2014 , howpublished =

work page 2014
[7]

2022 , howpublished =

LexisNexis , title =. 2022 , howpublished =

work page 2022
[8]

2016 , howpublished =

LexisNexis , title =. 2016 , howpublished =

work page 2016
[9]

2025 , howpublished =

Thomson Reuters , title =. 2025 , howpublished =

work page 2025
[10]

2025 , howpublished =

Bloomberg Law , title =. 2025 , howpublished =

work page 2025
[11]

2021 , howpublished =

Bloomberg Law , title =. 2021 , howpublished =

work page 2021
[12]

Evaluating

Hellyer, Paul , year =. Evaluating. Law Library Journal , shortjournal =

work page
[13]

and Henderson, Peter and Ho, Daniel E

Zheng, Lucia and Guha, Neel and Anderson, Brandon R. and Henderson, Peter and Ho, Daniel E. , date =. When. arXiv , eprintclass =. 2021 , eprint =. doi:10.48550/arXiv.2104.08671 , url =

work page doi:10.48550/arxiv.2104.08671 2021
[14]

Mikail Demir, Hakan T

Demir, M. Mikail and Otal, Hakan T. and Canbaz, M. Abdullah , date =. arXiv , eprintclass =. 2025 , eprint =. doi:10.48550/arXiv.2501.10915 , url =

work page doi:10.48550/arxiv.2501.10915 2025
[15]

arXiv , eprintclass =

Chalkidis, Ilias and Fergadiotis, Manos and Malakasiotis, Prodromos and Aletras, Nikolaos and Androutsopoulos, Ion , date =. arXiv , eprintclass =. 2020 , eprint =. doi:10.48550/arXiv.2010.02559 , url =

work page doi:10.48550/arxiv.2010.02559 2020
[16]

Locke, Daniel and Zuccon, Guido , date =. Towards. Proceedings of the 24th. 2019 , series =. doi:10.1145/3372124.3372128 , url =

work page doi:10.1145/3372124.3372128 2019
[17]

, year =

Taylor, William L. , year =. Comparing. Law Library Journal , shortjournal =

work page
[18]

Chalkidis, Ilias and Androutsopoulos, Ion and Aletras, Nikolaos , editor =. Neural. Proceedings of the 57th. 2019 , month = jul, pages =. doi:10.18653/v1/P19-1424 , urldate =

work page doi:10.18653/v1/p19-1424 2019
[19]

Processing

Mamakas, Dimitris and Tsotsi, Petros and Androutsopoulos, Ion and Chalkidis, Ilias , editor =. Processing. Proceedings of the. 2022 , month = dec, pages =. doi:10.18653/v1/2022.nllp-1.11 , urldate =

work page doi:10.18653/v1/2022.nllp-1.11 2022
[20]

Generative

Chien, Colleen and Kim, Miriam , year =. Generative. doi:10.1787/c2c1d276-en , urldate =

work page doi:10.1787/c2c1d276-en
[21]

Ashley, K. D. , year =. Modelling Legal Argument:

work page
[22]

and Hafner, Carole D

Berman, Donald H. and Hafner, Carole D. , year =. Understanding Precedents in a Temporal Context of Evolving Legal Doctrine , booktitle =. doi:10.1145/222092.222116 , urldate =

work page doi:10.1145/222092.222116
[23]

A Logical Framework for Modelling Legal Argument , booktitle =

Prakken, Henry , year =. A Logical Framework for Modelling Legal Argument , booktitle =. doi:10.1145/158976.158977 , urldate =

work page doi:10.1145/158976.158977
[24]

1992 , month = jun, journal =

Normative Conflicts in Legal Reasoning , author =. 1992 , month = jun, journal =. doi:10.1007/BF00114921 , urldate =

work page doi:10.1007/bf00114921 1992
[25]

2010 , volume =

Galgani, Filippo and Hoffmann, Achim , editor =. 2010 , volume =. doi:10.1007/978-3-642-17432-2_45 , urldate =

work page doi:10.1007/978-3-642-17432-2_45 2010
[26]

Kurniawan, Kemal and Mistica, Meladel and Baldwin, Timothy and Lau, Jey Han , year =. To. doi:10.48550/ARXIV.2408.02257 , urldate =. 2408.02257 , eprinttype =

work page doi:10.48550/arxiv.2408.02257
[27]

2011 , month = jan, journal =

A Survey of Hierarchical Classification across Different Application Domains , author =. 2011 , month = jan, journal =. doi:10.1007/s10618-010-0175-9 , urldate =

work page doi:10.1007/s10618-010-0175-9 2011
[28]

2022 , month =

Understanding Stare Decisis , howpublished =. 2022 , month =

work page 2022
[29]

Giving Every Case Its (Legal) Due

Panagis, Yannis and Sadl, Urska and Tarissan, Fabien , editor =. Giving Every Case Its (Legal) Due. Frontiers in. 2017 , month = dec, series =. doi:10.3233/978-1-61499-838-9-59 , urldate =

work page doi:10.3233/978-1-61499-838-9-59 2017

[1] [3]

and Albrecht, Kat and Pah, Adam and Cotropia, Christopher Anthony and Sanders, Amy Kristin and Sanga, Sarath and Alexander, Charlotte and Amaral, Luis A

Schwartz, David L. and Albrecht, Kat and Pah, Adam and Cotropia, Christopher Anthony and Sanders, Amy Kristin and Sanga, Sarath and Alexander, Charlotte and Amaral, Luis A. N. and Clopton, Zachary D. and Tucker, Anne M. and Gaylord, Thomas and Daniel, Scott and Dahlberg, Nathan , date =. The. doi:10.2139/ssrn.4948027 , url =

work page doi:10.2139/ssrn.4948027

[2] [4]

, date =

Taylor, William L. , date =. Comparing

work page

[3] [5]

and Ré, Christopher and Chilton, Adam and Narayana, Aditya and Chohlas-Wood, Alex and Peters, Austin and Waldon, Brandon and Rockmore, Daniel N

Guha, Neel and Nyarko, Julian and Ho, Daniel E. and Ré, Christopher and Chilton, Adam and Narayana, Aditya and Chohlas-Wood, Alex and Peters, Austin and Waldon, Brandon and Rockmore, Daniel N. and Zambrano, Diego and Talisman, Dmitry and Hoque, Enam and Surani, Faiz and Fagan, Frank and Sarfaty, Galit and Dickinson, Gregory M. and Porat, Haggai and Heglan...

work page 2023

[4] [6]

2014 , howpublished =

LexisNexis , title =. 2014 , howpublished =

work page 2014

[5] [7]

2022 , howpublished =

LexisNexis , title =. 2022 , howpublished =

work page 2022

[6] [8]

2016 , howpublished =

LexisNexis , title =. 2016 , howpublished =

work page 2016

[7] [9]

2025 , howpublished =

Thomson Reuters , title =. 2025 , howpublished =

work page 2025

[8] [10]

2025 , howpublished =

Bloomberg Law , title =. 2025 , howpublished =

work page 2025

[9] [11]

2021 , howpublished =

Bloomberg Law , title =. 2021 , howpublished =

work page 2021

[10] [12]

Evaluating

Hellyer, Paul , year =. Evaluating. Law Library Journal , shortjournal =

work page

[11] [13]

and Henderson, Peter and Ho, Daniel E

Zheng, Lucia and Guha, Neel and Anderson, Brandon R. and Henderson, Peter and Ho, Daniel E. , date =. When. arXiv , eprintclass =. 2021 , eprint =. doi:10.48550/arXiv.2104.08671 , url =

work page doi:10.48550/arxiv.2104.08671 2021

[12] [14]

Mikail Demir, Hakan T

Demir, M. Mikail and Otal, Hakan T. and Canbaz, M. Abdullah , date =. arXiv , eprintclass =. 2025 , eprint =. doi:10.48550/arXiv.2501.10915 , url =

work page doi:10.48550/arxiv.2501.10915 2025

[13] [15]

arXiv , eprintclass =

Chalkidis, Ilias and Fergadiotis, Manos and Malakasiotis, Prodromos and Aletras, Nikolaos and Androutsopoulos, Ion , date =. arXiv , eprintclass =. 2020 , eprint =. doi:10.48550/arXiv.2010.02559 , url =

work page doi:10.48550/arxiv.2010.02559 2020

[14] [16]

Locke, Daniel and Zuccon, Guido , date =. Towards. Proceedings of the 24th. 2019 , series =. doi:10.1145/3372124.3372128 , url =

work page doi:10.1145/3372124.3372128 2019

[15] [17]

, year =

Taylor, William L. , year =. Comparing. Law Library Journal , shortjournal =

work page

[16] [18]

Chalkidis, Ilias and Androutsopoulos, Ion and Aletras, Nikolaos , editor =. Neural. Proceedings of the 57th. 2019 , month = jul, pages =. doi:10.18653/v1/P19-1424 , urldate =

work page doi:10.18653/v1/p19-1424 2019

[17] [19]

Processing

Mamakas, Dimitris and Tsotsi, Petros and Androutsopoulos, Ion and Chalkidis, Ilias , editor =. Processing. Proceedings of the. 2022 , month = dec, pages =. doi:10.18653/v1/2022.nllp-1.11 , urldate =

work page doi:10.18653/v1/2022.nllp-1.11 2022

[18] [20]

Generative

Chien, Colleen and Kim, Miriam , year =. Generative. doi:10.1787/c2c1d276-en , urldate =

work page doi:10.1787/c2c1d276-en

[19] [21]

Ashley, K. D. , year =. Modelling Legal Argument:

work page

[20] [22]

and Hafner, Carole D

Berman, Donald H. and Hafner, Carole D. , year =. Understanding Precedents in a Temporal Context of Evolving Legal Doctrine , booktitle =. doi:10.1145/222092.222116 , urldate =

work page doi:10.1145/222092.222116

[21] [23]

A Logical Framework for Modelling Legal Argument , booktitle =

Prakken, Henry , year =. A Logical Framework for Modelling Legal Argument , booktitle =. doi:10.1145/158976.158977 , urldate =

work page doi:10.1145/158976.158977

[22] [24]

1992 , month = jun, journal =

Normative Conflicts in Legal Reasoning , author =. 1992 , month = jun, journal =. doi:10.1007/BF00114921 , urldate =

work page doi:10.1007/bf00114921 1992

[23] [25]

2010 , volume =

Galgani, Filippo and Hoffmann, Achim , editor =. 2010 , volume =. doi:10.1007/978-3-642-17432-2_45 , urldate =

work page doi:10.1007/978-3-642-17432-2_45 2010

[24] [26]

Kurniawan, Kemal and Mistica, Meladel and Baldwin, Timothy and Lau, Jey Han , year =. To. doi:10.48550/ARXIV.2408.02257 , urldate =. 2408.02257 , eprinttype =

work page doi:10.48550/arxiv.2408.02257

[25] [27]

2011 , month = jan, journal =

A Survey of Hierarchical Classification across Different Application Domains , author =. 2011 , month = jan, journal =. doi:10.1007/s10618-010-0175-9 , urldate =

work page doi:10.1007/s10618-010-0175-9 2011

[26] [28]

2022 , month =

Understanding Stare Decisis , howpublished =. 2022 , month =

work page 2022

[27] [29]

Giving Every Case Its (Legal) Due

Panagis, Yannis and Sadl, Urska and Tarissan, Fabien , editor =. Giving Every Case Its (Legal) Due. Frontiers in. 2017 , month = dec, series =. doi:10.3233/978-1-61499-838-9-59 , urldate =

work page doi:10.3233/978-1-61499-838-9-59 2017