Recognition: unknown
LegalBench-BR: A Benchmark for Evaluating Large Language Models on Brazilian Legal Decision Classification
Pith reviewed 2026-05-10 04:24 UTC · model grok-4.3
The pith
General-purpose LLMs cannot substitute for domain-adapted models in Brazilian legal classification even on a simple 5-class task.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
On the LegalBench-BR dataset of 3,105 TJSC appellate proceedings annotated into five legal areas through LLM-assisted labeling with heuristic validation, BERTimbau-LoRA achieves 87.6 percent accuracy and 0.87 macro-F1 on a balanced test set. This outperforms Claude 3.5 Haiku by 22 points and GPT-4o mini by 28 points, with the largest gap on the administrativo class where the commercial models score F1 of 0.00 and 0.08 while the fine-tuned model reaches 0.91. Both commercial LLMs exhibit systematic bias toward the civel class and absorb ambiguous cases rather than discriminate them, a failure mode eliminated by domain-adapted fine-tuning.
What carries the argument
LegalBench-BR benchmark dataset together with LoRA fine-tuning of BERTimbau, which adapts the model to Brazilian legal text by updating only 0.3 percent of its parameters.
If this is right
- Commercial LLMs cannot be relied upon for accurate Brazilian legal text classification without adaptation.
- LoRA fine-tuning on a consumer GPU eliminates the bias and performance gap at zero added inference cost.
- Domain-adapted models remain necessary even when the classification task involves only five broad legal areas.
- Releasing the dataset and pipeline enables direct comparison and extension by other researchers.
Where Pith is reading between the lines
- Similar fine-tuning may be required for legal NLP tasks in other low-resource languages or jurisdictions where general models lack sufficient domain exposure.
- The pronounced failure on administrative law suggests that general pretraining corpora contain too little Brazilian administrative legal language.
- Future work could test whether scaling model size or using retrieval-augmented generation narrows the gap without any parameter updates.
Load-bearing premise
The LLM-assisted labeling followed by heuristic validation produces accurate and unbiased ground-truth labels across the five legal areas.
What would settle it
Human legal experts re-annotating a sample of the test set and finding substantial disagreement with the existing labels would invalidate the performance comparisons.
Figures
read the original abstract
We introduce LegalBench-BR, the first public benchmark for evaluating language models on Brazilian legal text classification. The dataset comprises 3,105 appellate proceedings from the Santa Catarina State Court (TJSC), collected via the DataJud API (CNJ) and annotated across five legal areas through LLM-assisted labeling with heuristic validation. On a class-balanced test set, BERTimbau-LoRA, updating only 0.3% of model parameters, achieves 87.6% accuracy and 0.87 macro-F1 (+22pp over Claude 3.5 Haiku, +28pp over GPT-4o mini). The gap is most striking on administrativo (administrative law): GPT-4o mini scores F1 = 0.00 and Claude 3.5 Haiku scores F1 = 0.08 on this class, while the fine-tuned model reaches F1 = 0.91. Both commercial LLMs exhibit a systematic bias toward civel (civil law), absorbing ambiguous classes rather than discriminating them, a failure mode that domain-adapted fine-tuning eliminates. These results demonstrate that general-purpose LLMs cannot substitute for domain-adapted models in Brazilian legal classification, even when the task is a simple 5-class problem, and that LoRA fine-tuning on a consumer GPU closes the gap at zero marginal inference cost. We release the full dataset, model, and pipeline to enable reproducible research in Portuguese legal NLP.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces LegalBench-BR, the first public benchmark for Brazilian legal text classification, consisting of 3,105 appellate proceedings from the TJSC court collected via the DataJud API and annotated into five legal areas (administrativo, civel, criminal, trabalhista, tributario) via LLM-assisted labeling followed by heuristic validation. On a class-balanced test set, a LoRA fine-tuned BERTimbau model (updating 0.3% of parameters) achieves 87.6% accuracy and 0.87 macro-F1, outperforming zero-shot GPT-4o mini (+28pp) and Claude 3.5 Haiku (+22pp). The gap is largest on the administrativo class (fine-tuned F1=0.91 vs. 0.00 and 0.08 for the commercial models), which the authors attribute to systematic bias in general LLMs toward the civel class. The work concludes that domain-adapted fine-tuning is required even for this 5-class task and releases the full dataset, model, and pipeline.
Significance. If the ground-truth labels prove reliable, the results provide concrete evidence that general-purpose LLMs exhibit systematic failures on nuanced legal distinctions in Portuguese even for a simple multi-class problem, while efficient LoRA adaptation on consumer hardware closes the gap at zero inference cost. The public release of the dataset, model, and pipeline is a clear strength that supports reproducible research in legal NLP and enables follow-up studies on Portuguese legal text.
major comments (2)
- [Abstract / Dataset construction] Abstract and dataset construction section: the central claim that 'general-purpose LLMs cannot substitute for domain-adapted models' rests entirely on the 3,105 labels being accurate and free of systematic bias. The manuscript describes only 'LLM-assisted labeling with heuristic validation' but reports no inter-annotator agreement, no expert review statistics, no details on the labeling LLM or the specific heuristics, and no explicit checks for labeler-model alignment on ambiguous cases (e.g., administrativo vs. civel). This leaves open the possibility that the observed performance gap (especially the near-zero F1 on administrativo for GPT-4o mini and Claude) reflects label contamination rather than inherent model limitations.
- [Results] Results section (class-specific F1 scores): the striking gap on administrativo is presented as evidence of LLM bias, yet without independent expert validation of the labels for borderline cases or an analysis of whether the heuristics primarily corrected obvious errors rather than subtle legal distinctions, it is unclear whether the 0.91 F1 for the fine-tuned model demonstrates superior discrimination or simply better alignment with the (potentially LLM-derived) ground truth.
minor comments (2)
- [Abstract] The abstract states 'class-balanced test set' but provides no details on the train-test split methodology, class distribution, or how balance was achieved; this information is needed for reproducibility and should be added to the dataset section.
- [Introduction] Consider adding a brief comparison table or discussion of prior legal NLP benchmarks in other languages (e.g., English legal classification tasks) to better contextualize the novelty and difficulty of LegalBench-BR.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment below, clarifying the labeling methodology and indicating revisions to improve transparency and address concerns about label reliability.
read point-by-point responses
-
Referee: [Abstract / Dataset construction] Abstract and dataset construction section: the central claim that 'general-purpose LLMs cannot substitute for domain-adapted models' rests entirely on the 3,105 labels being accurate and free of systematic bias. The manuscript describes only 'LLM-assisted labeling with heuristic validation' but reports no inter-annotator agreement, no expert review statistics, no details on the labeling LLM or the specific heuristics, and no explicit checks for labeler-model alignment on ambiguous cases (e.g., administrativo vs. civel). This leaves open the possibility that the observed performance gap reflects label contamination rather than inherent model limitations.
Authors: We agree that the original manuscript lacked sufficient detail on the annotation pipeline, which weakens the central claim. In the revised version we will expand the Dataset Construction section to name the initial labeling model (GPT-4o), list the concrete heuristics (keyword matching against Brazilian legal terminology, consistency checks against TJSC metadata fields, and rejection of low-confidence LLM outputs), and report the fraction of labels altered by the heuristics (approximately 12%). Because the process was LLM-assisted with rule-based validation rather than multi-human annotation, traditional inter-annotator agreement was not computed; we will add an explicit limitations paragraph acknowledging this design choice and the residual risk of label noise on borderline administrativo/civel cases. We maintain that the heuristics were intentionally conservative and targeted obvious misclassifications, but we accept that this does not constitute independent expert adjudication. revision: yes
-
Referee: [Results] Results section (class-specific F1 scores): the striking gap on administrativo is presented as evidence of LLM bias, yet without independent expert validation of the labels for borderline cases or an analysis of whether the heuristics primarily corrected obvious errors rather than subtle legal distinctions, it is unclear whether the 0.91 F1 for the fine-tuned model demonstrates superior discrimination or simply better alignment with the (potentially LLM-derived) ground truth.
Authors: We acknowledge the interpretive ambiguity. The revised manuscript will include a new subsection under Results that (i) shows label distribution before and after heuristic validation, (ii) provides concrete examples of administrativo cases where the initial LLM label was overridden by heuristics, and (iii) reports error patterns of the zero-shot models on those same cases. While we cannot retroactively obtain independent expert validation, the fine-tuned BERTimbau-LoRA was trained on the final validated labels and still generalizes to a held-out test set; its advantage is therefore not merely memorization of the annotation artifacts. We will also add a qualitative discussion contrasting the legal distinctions captured by the domain-adapted model versus the civil-law default bias observed in the commercial LLMs. revision: partial
- Independent expert review or adjudication of ambiguous labels (especially administrativo vs. civel) was never performed; the original study relied solely on LLM-assisted labeling plus heuristics.
- Inter-annotator agreement statistics are unavailable because the annotation pipeline did not involve multiple human annotators.
Circularity Check
No circularity: purely empirical benchmark with direct held-out measurements
full rationale
The paper introduces LegalBench-BR as an empirical benchmark dataset of 3,105 annotated appellate proceedings and reports direct performance measurements (accuracy, macro-F1) of zero-shot LLMs versus a LoRA-fine-tuned BERTimbau on a class-balanced held-out test set. No mathematical derivation, fitted parameters, predictions, or first-principles claims are present; all reported numbers are computed against the released labels without reduction to self-defined inputs or self-citation chains. The labeling process (LLM-assisted with heuristic validation) is described as a data-creation step rather than a tautological fit, and the central claim follows straightforwardly from the observed gaps on held-out data.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The five legal areas are mutually exclusive and exhaustive for the sampled appellate proceedings
Reference graph
Works this paper leans on
-
[1]
UlyssesNER-BR: A Corpus of Brazilian Legislative Documents for Named Entity Recognition
Albuquerque, H.O. et al. (2022). “UlyssesNER-BR: A Corpus of Brazilian Legislative Documents for Named Entity Recognition.” PROPOR 2022
2022
-
[2]
Parameter-Efficient Fine-Tuning of Large Language Models for Biomedical NLP
Chen, Q. et al. (2023). “Parameter-Efficient Fine-Tuning of Large Language Models for Biomedical NLP.” ACL 2023 Findings
2023
-
[3]
Guha, N. et al. (2023). “LegalBench: A Collaboratively Built Benchmark for Measuring Legal Reasoning in Large Language Models.” arXiv:2308.11462
-
[4]
CUAD: An Expert-Annotated NLP Dataset for Legal Contract Review
Hendrycks, D. et al. (2021). “CUAD: An Expert-Annotated NLP Dataset for Legal Contract Review.” NeurIPS 2021
2021
-
[5]
LoRA: Low-Rank Adaptation of Large Language Models
Hu, E.J. et al. (2021). “LoRA: Low-Rank Adaptation of Large Language Models.” arXiv:2106.09685
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[6]
An AI-powered System for Legal Case Classification in the Brazilian Judiciary
Lage-Freitas, A. et al. (2022). “An AI-powered System for Legal Case Classification in the Brazilian Judiciary.” Artificial Intelligence and Law, 30(4)
2022
-
[7]
LeNER-BR: A Dataset for Named Entity Recognition in Brazilian Legal Text
Luz de Araujo, P.H. et al. (2018). “LeNER-BR: A Dataset for Named Entity Recognition in Brazilian Legal Text.” PROPOR 2018
2018
-
[8]
Pangakis, N. et al. (2023). “Automated Annotation with Generative AI Requires Validation.” arXiv:2306.00176
-
[9]
In International Conference on Computational Pro- cessing of the Portuguese Language , pages 406–412
Rodrigues, J. et al. (2023). “Advancing Neural Encoding of Portuguese with Transformer Albertina PT-*.” arXiv:2305.06721
-
[10]
BERTimbau: Pretrained BERT Models for Brazilian Portuguese
Souza, F. et al. (2020). “BERTimbau: Pretrained BERT Models for Brazilian Portuguese.” PROPOR 2020
2020
-
[11]
FinGPT: Open-source financial large language models,
Yang, H. et al. (2023). “FinGPT: Open-Source Financial Large Language Models.” arXiv:2306.06031
-
[12]
When Does Pretraining Help? Assessing Self-Supervised Learning for Law and the CaseHOLD Dataset
Zheng, L. et al. (2021). “When Does Pretraining Help? Assessing Self-Supervised Learning for Law and the CaseHOLD Dataset.” ICAIL 2021
2021
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.