arxiv: 2604.18878 · v1 · submitted 2026-04-20 · 💻 cs.CL

Recognition: unknown

LegalBench-BR: A Benchmark for Evaluating Large Language Models on Brazilian Legal Decision Classification

Pedro Barbosa de Carvalho Neto

Authors on Pith no claims yet

Pith reviewed 2026-05-10 04:24 UTC · model grok-4.3

classification 💻 cs.CL

keywords Brazilian legal NLPLegalBench-BRLLM evaluationtext classificationLoRA fine-tuningPortuguese language modelslegal decision classificationdomain adaptation

0 comments

The pith

General-purpose LLMs cannot substitute for domain-adapted models in Brazilian legal classification even on a simple 5-class task.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces LegalBench-BR, the first public benchmark of 3,105 annotated appellate decisions from a Brazilian state court, to test language models on classifying text into five legal areas. Leading commercial models show low overall accuracy and a strong bias toward civil law cases, performing near zero on administrative law. A lightweight LoRA adaptation of a Portuguese BERT model, updating only 0.3 percent of parameters, reaches 87.6 percent accuracy and balanced performance across classes. This result would matter because it indicates that off-the-shelf large models are insufficient for practical legal NLP in Portuguese and that targeted fine-tuning is required instead. The authors release the full dataset, model, and pipeline to support further reproducible work.

Core claim

On the LegalBench-BR dataset of 3,105 TJSC appellate proceedings annotated into five legal areas through LLM-assisted labeling with heuristic validation, BERTimbau-LoRA achieves 87.6 percent accuracy and 0.87 macro-F1 on a balanced test set. This outperforms Claude 3.5 Haiku by 22 points and GPT-4o mini by 28 points, with the largest gap on the administrativo class where the commercial models score F1 of 0.00 and 0.08 while the fine-tuned model reaches 0.91. Both commercial LLMs exhibit systematic bias toward the civel class and absorb ambiguous cases rather than discriminate them, a failure mode eliminated by domain-adapted fine-tuning.

What carries the argument

LegalBench-BR benchmark dataset together with LoRA fine-tuning of BERTimbau, which adapts the model to Brazilian legal text by updating only 0.3 percent of its parameters.

If this is right

Commercial LLMs cannot be relied upon for accurate Brazilian legal text classification without adaptation.
LoRA fine-tuning on a consumer GPU eliminates the bias and performance gap at zero added inference cost.
Domain-adapted models remain necessary even when the classification task involves only five broad legal areas.
Releasing the dataset and pipeline enables direct comparison and extension by other researchers.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar fine-tuning may be required for legal NLP tasks in other low-resource languages or jurisdictions where general models lack sufficient domain exposure.
The pronounced failure on administrative law suggests that general pretraining corpora contain too little Brazilian administrative legal language.
Future work could test whether scaling model size or using retrieval-augmented generation narrows the gap without any parameter updates.

Load-bearing premise

The LLM-assisted labeling followed by heuristic validation produces accurate and unbiased ground-truth labels across the five legal areas.

What would settle it

Human legal experts re-annotating a sample of the test set and finding substantial disagreement with the existing labels would invalidate the performance comparisons.

Figures

Figures reproduced from arXiv: 2604.18878 by Pedro Barbosa de Carvalho Neto.

**Figure 1.** Figure 1: Classification performance (left) and inference latency (right) across models. BERTimbau-LoRA dominates on both [PITH_FULL_IMAGE:figures/full_fig_p005_1.png] view at source ↗

read the original abstract

We introduce LegalBench-BR, the first public benchmark for evaluating language models on Brazilian legal text classification. The dataset comprises 3,105 appellate proceedings from the Santa Catarina State Court (TJSC), collected via the DataJud API (CNJ) and annotated across five legal areas through LLM-assisted labeling with heuristic validation. On a class-balanced test set, BERTimbau-LoRA, updating only 0.3% of model parameters, achieves 87.6% accuracy and 0.87 macro-F1 (+22pp over Claude 3.5 Haiku, +28pp over GPT-4o mini). The gap is most striking on administrativo (administrative law): GPT-4o mini scores F1 = 0.00 and Claude 3.5 Haiku scores F1 = 0.08 on this class, while the fine-tuned model reaches F1 = 0.91. Both commercial LLMs exhibit a systematic bias toward civel (civil law), absorbing ambiguous classes rather than discriminating them, a failure mode that domain-adapted fine-tuning eliminates. These results demonstrate that general-purpose LLMs cannot substitute for domain-adapted models in Brazilian legal classification, even when the task is a simple 5-class problem, and that LoRA fine-tuning on a consumer GPU closes the gap at zero marginal inference cost. We release the full dataset, model, and pipeline to enable reproducible research in Portuguese legal NLP.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces LegalBench-BR, the first public benchmark for Brazilian legal text classification, consisting of 3,105 appellate proceedings from the TJSC court collected via the DataJud API and annotated into five legal areas (administrativo, civel, criminal, trabalhista, tributario) via LLM-assisted labeling followed by heuristic validation. On a class-balanced test set, a LoRA fine-tuned BERTimbau model (updating 0.3% of parameters) achieves 87.6% accuracy and 0.87 macro-F1, outperforming zero-shot GPT-4o mini (+28pp) and Claude 3.5 Haiku (+22pp). The gap is largest on the administrativo class (fine-tuned F1=0.91 vs. 0.00 and 0.08 for the commercial models), which the authors attribute to systematic bias in general LLMs toward the civel class. The work concludes that domain-adapted fine-tuning is required even for this 5-class task and releases the full dataset, model, and pipeline.

Significance. If the ground-truth labels prove reliable, the results provide concrete evidence that general-purpose LLMs exhibit systematic failures on nuanced legal distinctions in Portuguese even for a simple multi-class problem, while efficient LoRA adaptation on consumer hardware closes the gap at zero inference cost. The public release of the dataset, model, and pipeline is a clear strength that supports reproducible research in legal NLP and enables follow-up studies on Portuguese legal text.

major comments (2)

[Abstract / Dataset construction] Abstract and dataset construction section: the central claim that 'general-purpose LLMs cannot substitute for domain-adapted models' rests entirely on the 3,105 labels being accurate and free of systematic bias. The manuscript describes only 'LLM-assisted labeling with heuristic validation' but reports no inter-annotator agreement, no expert review statistics, no details on the labeling LLM or the specific heuristics, and no explicit checks for labeler-model alignment on ambiguous cases (e.g., administrativo vs. civel). This leaves open the possibility that the observed performance gap (especially the near-zero F1 on administrativo for GPT-4o mini and Claude) reflects label contamination rather than inherent model limitations.
[Results] Results section (class-specific F1 scores): the striking gap on administrativo is presented as evidence of LLM bias, yet without independent expert validation of the labels for borderline cases or an analysis of whether the heuristics primarily corrected obvious errors rather than subtle legal distinctions, it is unclear whether the 0.91 F1 for the fine-tuned model demonstrates superior discrimination or simply better alignment with the (potentially LLM-derived) ground truth.

minor comments (2)

[Abstract] The abstract states 'class-balanced test set' but provides no details on the train-test split methodology, class distribution, or how balance was achieved; this information is needed for reproducibility and should be added to the dataset section.
[Introduction] Consider adding a brief comparison table or discussion of prior legal NLP benchmarks in other languages (e.g., English legal classification tasks) to better contextualize the novelty and difficulty of LegalBench-BR.

Simulated Author's Rebuttal

2 responses · 2 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below, clarifying the labeling methodology and indicating revisions to improve transparency and address concerns about label reliability.

read point-by-point responses

Referee: [Abstract / Dataset construction] Abstract and dataset construction section: the central claim that 'general-purpose LLMs cannot substitute for domain-adapted models' rests entirely on the 3,105 labels being accurate and free of systematic bias. The manuscript describes only 'LLM-assisted labeling with heuristic validation' but reports no inter-annotator agreement, no expert review statistics, no details on the labeling LLM or the specific heuristics, and no explicit checks for labeler-model alignment on ambiguous cases (e.g., administrativo vs. civel). This leaves open the possibility that the observed performance gap reflects label contamination rather than inherent model limitations.

Authors: We agree that the original manuscript lacked sufficient detail on the annotation pipeline, which weakens the central claim. In the revised version we will expand the Dataset Construction section to name the initial labeling model (GPT-4o), list the concrete heuristics (keyword matching against Brazilian legal terminology, consistency checks against TJSC metadata fields, and rejection of low-confidence LLM outputs), and report the fraction of labels altered by the heuristics (approximately 12%). Because the process was LLM-assisted with rule-based validation rather than multi-human annotation, traditional inter-annotator agreement was not computed; we will add an explicit limitations paragraph acknowledging this design choice and the residual risk of label noise on borderline administrativo/civel cases. We maintain that the heuristics were intentionally conservative and targeted obvious misclassifications, but we accept that this does not constitute independent expert adjudication. revision: yes
Referee: [Results] Results section (class-specific F1 scores): the striking gap on administrativo is presented as evidence of LLM bias, yet without independent expert validation of the labels for borderline cases or an analysis of whether the heuristics primarily corrected obvious errors rather than subtle legal distinctions, it is unclear whether the 0.91 F1 for the fine-tuned model demonstrates superior discrimination or simply better alignment with the (potentially LLM-derived) ground truth.

Authors: We acknowledge the interpretive ambiguity. The revised manuscript will include a new subsection under Results that (i) shows label distribution before and after heuristic validation, (ii) provides concrete examples of administrativo cases where the initial LLM label was overridden by heuristics, and (iii) reports error patterns of the zero-shot models on those same cases. While we cannot retroactively obtain independent expert validation, the fine-tuned BERTimbau-LoRA was trained on the final validated labels and still generalizes to a held-out test set; its advantage is therefore not merely memorization of the annotation artifacts. We will also add a qualitative discussion contrasting the legal distinctions captured by the domain-adapted model versus the civil-law default bias observed in the commercial LLMs. revision: partial

standing simulated objections not resolved

Independent expert review or adjudication of ambiguous labels (especially administrativo vs. civel) was never performed; the original study relied solely on LLM-assisted labeling plus heuristics.
Inter-annotator agreement statistics are unavailable because the annotation pipeline did not involve multiple human annotators.

Circularity Check

0 steps flagged

No circularity: purely empirical benchmark with direct held-out measurements

full rationale

The paper introduces LegalBench-BR as an empirical benchmark dataset of 3,105 annotated appellate proceedings and reports direct performance measurements (accuracy, macro-F1) of zero-shot LLMs versus a LoRA-fine-tuned BERTimbau on a class-balanced held-out test set. No mathematical derivation, fitted parameters, predictions, or first-principles claims are present; all reported numbers are computed against the released labels without reduction to self-defined inputs or self-citation chains. The labeling process (LLM-assisted with heuristic validation) is described as a data-creation step rather than a tautological fit, and the central claim follows straightforwardly from the observed gaps on held-out data.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim depends on the assumption that the collected TJSC proceedings and LLM-assisted labels form a representative and correctly labeled test bed for Brazilian legal classification.

axioms (1)

domain assumption The five legal areas are mutually exclusive and exhaustive for the sampled appellate proceedings
Invoked when defining the 5-class classification task and balanced test set.

pith-pipeline@v0.9.0 · 5556 in / 1220 out tokens · 46364 ms · 2026-05-10T04:24:17.309891+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

12 extracted references · 5 canonical work pages · 1 internal anchor

[1]

UlyssesNER-BR: A Corpus of Brazilian Legislative Documents for Named Entity Recognition

Albuquerque, H.O. et al. (2022). “UlyssesNER-BR: A Corpus of Brazilian Legislative Documents for Named Entity Recognition.” PROPOR 2022

2022
[2]

Parameter-Efficient Fine-Tuning of Large Language Models for Biomedical NLP

Chen, Q. et al. (2023). “Parameter-Efficient Fine-Tuning of Large Language Models for Biomedical NLP.” ACL 2023 Findings

2023
[3]

Guha et al

Guha, N. et al. (2023). “LegalBench: A Collaboratively Built Benchmark for Measuring Legal Reasoning in Large Language Models.” arXiv:2308.11462

work page arXiv 2023
[4]

CUAD: An Expert-Annotated NLP Dataset for Legal Contract Review

Hendrycks, D. et al. (2021). “CUAD: An Expert-Annotated NLP Dataset for Legal Contract Review.” NeurIPS 2021

2021
[5]

LoRA: Low-Rank Adaptation of Large Language Models

Hu, E.J. et al. (2021). “LoRA: Low-Rank Adaptation of Large Language Models.” arXiv:2106.09685

work page internal anchor Pith review Pith/arXiv arXiv 2021
[6]

An AI-powered System for Legal Case Classification in the Brazilian Judiciary

Lage-Freitas, A. et al. (2022). “An AI-powered System for Legal Case Classification in the Brazilian Judiciary.” Artificial Intelligence and Law, 30(4)

2022
[7]

LeNER-BR: A Dataset for Named Entity Recognition in Brazilian Legal Text

Luz de Araujo, P.H. et al. (2018). “LeNER-BR: A Dataset for Named Entity Recognition in Brazilian Legal Text.” PROPOR 2018

2018
[8]

Moore, Daniel M

Pangakis, N. et al. (2023). “Automated Annotation with Generative AI Requires Validation.” arXiv:2306.00176

work page arXiv 2023
[9]

In International Conference on Computational Pro- cessing of the Portuguese Language , pages 406–412

Rodrigues, J. et al. (2023). “Advancing Neural Encoding of Portuguese with Transformer Albertina PT-*.” arXiv:2305.06721

work page arXiv 2023
[10]

BERTimbau: Pretrained BERT Models for Brazilian Portuguese

Souza, F. et al. (2020). “BERTimbau: Pretrained BERT Models for Brazilian Portuguese.” PROPOR 2020

2020
[11]

FinGPT: Open-source financial large language models,

Yang, H. et al. (2023). “FinGPT: Open-Source Financial Large Language Models.” arXiv:2306.06031

work page arXiv 2023
[12]

When Does Pretraining Help? Assessing Self-Supervised Learning for Law and the CaseHOLD Dataset

Zheng, L. et al. (2021). “When Does Pretraining Help? Assessing Self-Supervised Learning for Law and the CaseHOLD Dataset.” ICAIL 2021

2021