Validate Your Authority: Benchmarking LLMs on Multi-Label Precedent Treatment Classification
Pith reviewed 2026-05-20 12:21 UTC · model grok-4.3
The pith
Large language models classify legal precedent treatments at up to 79 percent accuracy on high-level tasks using a new expert dataset and severity metric.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors create an expert-annotated dataset of 239 legal citations and evaluate modern LLMs on multi-label precedent treatment classification. They introduce the Average Severity Error metric to capture the real-world impact of errors. Gemini 2.5 Flash reaches 79.1 percent accuracy on a high-level schema while GPT-5-mini reaches 67.7 percent on a more detailed schema, establishing a baseline for this legal NLP application.
What carries the argument
The Average Severity Error metric, which assigns different costs to classification mistakes based on their potential effect in legal practice rather than treating every error as equal.
If this is right
- The new dataset becomes available for training or further testing of legal analysis models.
- The Average Severity Error metric offers a more realistic way to compare models in high-stakes classification settings.
- Model performance baselines can inform choices when building automated tools for legal research.
- The observed split between high-level and fine-grained results indicates that classification detail level affects which model performs best.
Where Pith is reading between the lines
- The same dataset-plus-metric approach could be tested on other high-risk classification domains such as regulatory compliance or medical case notes.
- Embedding these classifications into legal search systems might reduce the chance of retrieving misleading precedent.
- Longitudinal studies could check whether models that score well on Average Severity Error actually improve outcomes in real legal workflows.
Load-bearing premise
The expert annotations on the 239 citations provide accurate ground truth that captures the true legal meanings without meaningful disagreement or bias in selection.
What would settle it
Independent legal experts re-annotating the same 239 citations and producing substantially different treatment labels would show that the benchmark results rest on unreliable ground truth.
Figures
read the original abstract
Automating the classification of negative treatment in legal precedent is a critical yet nuanced NLP task where misclassification carries significant risk. To address the shortcomings of standard accuracy, this paper introduces a more robust evaluation framework. We benchmark modern Large Language Models on a new, expert-annotated dataset of 239 real-world legal citations and propose a novel Average Severity Error metric to better measure the practical impact of classification errors. Our experiments reveal a performance split. Google's Gemini 2.5 Flash achieved the highest accuracy on a high-level classification task (79.1%), while OpenAI's GPT-5-mini was the top performer on the more complex fine-grained schema (67.7%). This work establishes a crucial baseline, provides a new context-rich dataset, and introduces an evaluation metric tailored to the demands of this complex legal reasoning task.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces a new expert-annotated dataset of 239 real-world legal citations for multi-label classification of precedent treatments. It benchmarks several LLMs on high-level and fine-grained schemas, proposes a novel Average Severity Error metric to account for the practical impact of classification errors, and reports specific results including Gemini 2.5 Flash at 79.1% accuracy on the high-level task and GPT-5-mini at 67.7% on the fine-grained task.
Significance. If the ground-truth annotations are reliable, the work provides a useful baseline for LLM performance on nuanced legal reasoning, releases a context-rich dataset, and introduces a severity-weighted metric that addresses shortcomings of standard accuracy in high-stakes domains. These contributions could support further research in legal NLP.
major comments (1)
- [Abstract and Dataset section] Abstract and Dataset section: The manuscript provides no details on the annotation process for the 239 citations, including the number of annotators, inter-annotator agreement statistics (e.g., Cohen’s kappa or Krippendorff’s alpha), disagreement-resolution procedure, or citation selection criteria. Because all reported accuracies and the Average Severity Error metric depend directly on these labels, the absence of this information prevents verification of the central empirical claims.
minor comments (2)
- [Results] Results tables: Ensure that the high-level and fine-grained schemas are explicitly defined with example labels so readers can interpret the performance split between models.
- [Evaluation Metric] Metric definition: Provide the exact formula and severity weights for the Average Severity Error metric, including how they were derived.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback and for highlighting the need for greater transparency in our dataset construction. We address the major comment below and will revise the manuscript accordingly to strengthen the verifiability of our results.
read point-by-point responses
-
Referee: [Abstract and Dataset section] Abstract and Dataset section: The manuscript provides no details on the annotation process for the 239 citations, including the number of annotators, inter-annotator agreement statistics (e.g., Cohen’s kappa or Krippendorff’s alpha), disagreement-resolution procedure, or citation selection criteria. Because all reported accuracies and the Average Severity Error metric depend directly on these labels, the absence of this information prevents verification of the central empirical claims.
Authors: We agree that these details are necessary to allow readers to assess label reliability. The current manuscript omitted a full account of the annotation methodology. In the revised version, we will add a dedicated subsection in the Dataset section describing the citation selection criteria, the number and qualifications of annotators, inter-annotator agreement statistics, and the disagreement-resolution procedure. These additions will directly support verification of the reported accuracies and Average Severity Error metric. revision: yes
Circularity Check
No significant circularity in empirical benchmarking on new dataset
full rationale
The paper collects a fresh expert-annotated dataset of 239 legal citations, directly evaluates several LLMs on high-level and fine-grained multi-label classification tasks, and introduces a new Average Severity Error metric. No equations, fitted parameters, or predictions are defined in terms of the target results; performance numbers (Gemini 2.5 Flash at 79.1%, GPT-5-mini at 67.7%) are straightforward empirical measurements against the held-out annotations. The work contains no self-citations that bear the central claim and no derivations that collapse to inputs by construction, rendering the evaluation chain self-contained.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Expert annotations on legal citations provide reliable ground truth for precedent treatment labels
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We benchmark modern Large Language Models on a new, expert-annotated dataset of 239 real-world legal citations and propose a novel Average Severity Error metric
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Gemini 2.5 Flash achieved the highest accuracy on a high-level classification task (79.1%)
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[3]
Schwartz, David L. and Albrecht, Kat and Pah, Adam and Cotropia, Christopher Anthony and Sanders, Amy Kristin and Sanga, Sarath and Alexander, Charlotte and Amaral, Luis A. N. and Clopton, Zachary D. and Tucker, Anne M. and Gaylord, Thomas and Daniel, Scott and Dahlberg, Nathan , date =. The. doi:10.2139/ssrn.4948027 , url =
- [4]
-
[5]
Guha, Neel and Nyarko, Julian and Ho, Daniel E. and Ré, Christopher and Chilton, Adam and Narayana, Aditya and Chohlas-Wood, Alex and Peters, Austin and Waldon, Brandon and Rockmore, Daniel N. and Zambrano, Diego and Talisman, Dmitry and Hoque, Enam and Surani, Faiz and Fagan, Frank and Sarfaty, Galit and Dickinson, Gregory M. and Porat, Haggai and Heglan...
work page 2023
- [6]
- [7]
- [8]
- [9]
- [10]
- [11]
- [12]
-
[13]
and Henderson, Peter and Ho, Daniel E
Zheng, Lucia and Guha, Neel and Anderson, Brandon R. and Henderson, Peter and Ho, Daniel E. , date =. When. arXiv , eprintclass =. 2021 , eprint =. doi:10.48550/arXiv.2104.08671 , url =
-
[14]
Demir, M. Mikail and Otal, Hakan T. and Canbaz, M. Abdullah , date =. arXiv , eprintclass =. 2025 , eprint =. doi:10.48550/arXiv.2501.10915 , url =
-
[15]
Chalkidis, Ilias and Fergadiotis, Manos and Malakasiotis, Prodromos and Aletras, Nikolaos and Androutsopoulos, Ion , date =. arXiv , eprintclass =. 2020 , eprint =. doi:10.48550/arXiv.2010.02559 , url =
-
[16]
Locke, Daniel and Zuccon, Guido , date =. Towards. Proceedings of the 24th. 2019 , series =. doi:10.1145/3372124.3372128 , url =
- [17]
-
[18]
Chalkidis, Ilias and Androutsopoulos, Ion and Aletras, Nikolaos , editor =. Neural. Proceedings of the 57th. 2019 , month = jul, pages =. doi:10.18653/v1/P19-1424 , urldate =
-
[19]
Mamakas, Dimitris and Tsotsi, Petros and Androutsopoulos, Ion and Chalkidis, Ilias , editor =. Processing. Proceedings of the. 2022 , month = dec, pages =. doi:10.18653/v1/2022.nllp-1.11 , urldate =
-
[20]
Chien, Colleen and Kim, Miriam , year =. Generative. doi:10.1787/c2c1d276-en , urldate =
-
[21]
Ashley, K. D. , year =. Modelling Legal Argument:
-
[22]
Berman, Donald H. and Hafner, Carole D. , year =. Understanding Precedents in a Temporal Context of Evolving Legal Doctrine , booktitle =. doi:10.1145/222092.222116 , urldate =
-
[23]
A Logical Framework for Modelling Legal Argument , booktitle =
Prakken, Henry , year =. A Logical Framework for Modelling Legal Argument , booktitle =. doi:10.1145/158976.158977 , urldate =
-
[24]
Normative Conflicts in Legal Reasoning , author =. 1992 , month = jun, journal =. doi:10.1007/BF00114921 , urldate =
-
[25]
Galgani, Filippo and Hoffmann, Achim , editor =. 2010 , volume =. doi:10.1007/978-3-642-17432-2_45 , urldate =
-
[26]
Kurniawan, Kemal and Mistica, Meladel and Baldwin, Timothy and Lau, Jey Han , year =. To. doi:10.48550/ARXIV.2408.02257 , urldate =. 2408.02257 , eprinttype =
-
[27]
A Survey of Hierarchical Classification across Different Application Domains , author =. 2011 , month = jan, journal =. doi:10.1007/s10618-010-0175-9 , urldate =
- [28]
-
[29]
Giving Every Case Its (Legal) Due
Panagis, Yannis and Sadl, Urska and Tarissan, Fabien , editor =. Giving Every Case Its (Legal) Due. Frontiers in. 2017 , month = dec, series =. doi:10.3233/978-1-61499-838-9-59 , urldate =
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.