pith. machine review for the scientific record. sign in

arxiv: 2605.05532 · v1 · submitted 2026-05-07 · 💻 cs.CL · cs.CY

Recognition: unknown

A Few Good Clauses: Comparing LLMs vs Domain-Trained Small Language Models on Structured Contract Extraction

Authors on Pith no claims yet

Pith reviewed 2026-05-08 11:18 UTC · model grok-4.3

classification 💻 cs.CL cs.CY
keywords contract extractionsmall language modelslarge language modelslegal AIstructured data extractioncost efficiencyhallucination reduction
0
0 comments X

The pith

A domain-trained small language model outperforms frontier LLMs on structured contract extraction at 78 to 97 percent lower cost.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether a specialized small language model trained on legal data can match or exceed the results of much larger general-purpose models when pulling structured clauses from contracts. It reports that the smaller model reached the best overall scores while producing fewer unsupported extractions and operating at a fraction of the cost. This result matters because contract work requires high reliability to avoid downstream review and legal risk, yet current approaches often assume only the largest hosted systems can deliver that reliability. If the finding holds, it opens the door to running capable legal extraction locally without depending on massive external infrastructure.

Core claim

Olava Extract, a self-hosted legal domain Mixture of Experts model, achieved the strongest aggregate performance with a macro F1 of 0.812 and micro F1 of 0.842, exceeded the precision of five frontier models, generated fewer hallucinated or unsupported extractions, and lowered inference cost by 78 to 97 percent.

What carries the argument

Olava Extract, the domain-trained small language model, which carries the argument by delivering higher aggregate F1 scores, top precision, and major cost savings in a head-to-head comparison against frontier models on structured contract extraction.

If this is right

  • High-performing legal extraction becomes feasible without relying on the largest models or external hosting.
  • Lower inference costs allow wider deployment of automated contract review inside organizations.
  • Reduced hallucination rates lower operational risk and review burden in legal workflows.
  • Enterprise capability in this domain can be decoupled from ever-larger model sizes and central infrastructure.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Domain-specific training may let smaller models close the gap with general models on narrow, high-stakes tasks.
  • Similar cost and accuracy gains could appear in other regulated fields that need precise document structuring.
  • Organizations might build internal models tuned to their own contract templates rather than depending on public providers.

Load-bearing premise

The evaluation dataset and extraction task accurately represent real-world legal contract workflows, and the comparisons with frontier models used equivalent conditions without differences in prompting or post-processing.

What would settle it

A follow-up test on a larger and more varied collection of real contracts that shows one or more frontier models achieving higher accuracy, better precision, or fewer unsupported extractions than Olava Extract would disprove the performance advantage.

Figures

Figures reproduced from arXiv: 2605.05532 by Jaron Mar, Nick Whitehouse, Nicole Lincoln, Rivindu Perera.

Figure 1
Figure 1. Figure 1: Model-development and evaluation pipeline. view at source ↗
Figure 2
Figure 2. Figure 2: Aggregate macro- and micro-averaged F1 across the 26-field extraction task. Olava Extract ranks highest on both measures, view at source ↗
Figure 3
Figure 3. Figure 3: Olava Extract mean F1 by field category. Performance is strongest on short-text identifiers and extracted text concepts, and view at source ↗
read the original abstract

This paper evaluates whether a domain trained Small Language Model (SLM) can outperform frontier Large Language Models on structured contract extraction at radically lower cost. We test Olava Extract, a self hosted legal domain Mixture of Experts model, against five frontier models. Olava Extract achieved the strongest aggregate performance in the study, with a macro F1 of 0.812 and a micro F1 of 0.842, while reducing inference cost by 78% to 97% compared with the frontier models tested. It also achieved the highest precision scores, producing fewer hallucinated and unsupported extractions, an important distinction in legal workflows where hallucinations create operational risk and downstream review burden. The findings shows that high performing, human comparable legal AI no longer requires the largest externally hosted models. More broadly, they challenge the assumption that commercially valuable enterprise AI capability must remain tied to ever larger models, massive infrastructure expenditure, and centrally hosted providers.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper evaluates whether a domain-trained small language model (Olava Extract, a self-hosted legal-domain Mixture of Experts model) can outperform five frontier large language models on structured contract extraction. It reports that Olava Extract achieves the strongest results with macro F1 of 0.812, micro F1 of 0.842, highest precision (fewer hallucinations), and 78-97% lower inference cost, concluding that high-performing legal AI no longer requires the largest externally hosted models.

Significance. If the results are robust and generalizable, the work is significant for demonstrating that domain-specific SLMs can deliver superior precision and cost efficiency on specialized enterprise tasks like contract extraction, with direct implications for reducing operational risk and infrastructure costs in legal AI workflows.

major comments (2)
  1. [Abstract] Abstract: The reported macro/micro F1 scores, precision advantages, and cost reductions are presented without any accompanying dataset size, composition, source, prompting details, post-processing steps, or statistical significance tests. This omission is load-bearing because the central claim of superiority over frontier models cannot be verified or replicated without these elements.
  2. [Evaluation methodology] Evaluation methodology (assumed §4 or equivalent): The comparison assumes equivalent task framing and conditions across models, but no explicit confirmation is given that prompting, output formatting, or post-processing were held constant; any undisclosed differences would undermine the cost and performance claims.
minor comments (1)
  1. [Abstract] Abstract: Grammatical error in 'The findings shows' (should be 'show').

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on the abstract and evaluation methodology. These points highlight opportunities to improve the clarity and replicability of our central claims. We address each comment below and will incorporate revisions to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The reported macro/micro F1 scores, precision advantages, and cost reductions are presented without any accompanying dataset size, composition, source, prompting details, post-processing steps, or statistical significance tests. This omission is load-bearing because the central claim of superiority over frontier models cannot be verified or replicated without these elements.

    Authors: We agree that the abstract would benefit from including key high-level details to support immediate verification of the claims. The full manuscript provides the dataset size and composition (Section 3), source (a curated legal contract corpus), prompting templates and post-processing steps (Section 4), and reports statistical significance tests on the F1 differences (results section). We will revise the abstract to concisely reference the dataset scale, note the consistent evaluation protocol, and indicate that performance differences are supported by statistical testing. revision: yes

  2. Referee: [Evaluation methodology] Evaluation methodology (assumed §4 or equivalent): The comparison assumes equivalent task framing and conditions across models, but no explicit confirmation is given that prompting, output formatting, or post-processing were held constant; any undisclosed differences would undermine the cost and performance claims.

    Authors: All models were evaluated under identical task framing, with the same prompting strategy, output formatting constraints, and post-processing pipeline applied uniformly to Olava Extract and the five frontier models. This setup is described in the evaluation methodology section. To address the lack of explicit confirmation, we will add a direct statement in the revised manuscript clarifying that these elements were held constant across all models to ensure a fair comparison. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

This is a pure empirical benchmarking study that directly measures and compares model performance (F1 scores, precision, inference cost) on a contract extraction task. No equations, derivations, fitted parameters, predictions, or ansatzes appear in the reported results or abstract. The central claims rest on observed experimental outcomes rather than any self-referential reduction or load-bearing self-citation chain. The argument structure is self-contained against external benchmarks and contains no detectable circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the unstated assumption that the test contracts and extraction schema are representative of real legal work and that the frontier models received comparable instructions.

axioms (1)
  • domain assumption The evaluation dataset and task definition are representative of real-world legal contract extraction needs.
    All performance and cost claims depend on this representativeness, which is not demonstrated in the abstract.

pith-pipeline@v0.9.0 · 5470 in / 1393 out tokens · 39515 ms · 2026-05-08T11:18:29.330341+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

10 extracted references · 9 canonical work pages · 2 internal anchors

  1. [1]

    Marah Abdin et al. 2024. Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone. arXiv:2404.14219 [cs.CL] doi:10.48550/ arXiv.2404.14219

  2. [2]

    Ilias Chalkidis, Manos Fergadiotis, Prodromos Malakasiotis, Nikolaos Aletras, and Ion Androutsopoulos. 2020. LEGAL-BERT: The Muppets Straight out of Law School. InFindings of the Association for Computational Linguistics: EMNLP 2020. Association for Computational Linguistics, Online, 2898–2904. doi:10.18653/v1/2020.findings-emnlp.261

  3. [3]

    Pierre Colombo, Telmo Pessoa Pires, Malik Boudiaf, Dominic Culver, Rui Melo, Caio Corro, Andre F. T. Martins, Fabrizio Esposito, Vera Lúcia Raposo, Sofia Morgado, and Michael Desa. 2024. SaulLM-7B: A pioneering Large Language Model for Law. arXiv:2403.03883 [cs.CL] doi:10.48550/ arXiv.2404.14219

  4. [4]

    William Fedus, Barret Zoph, and Noam Shazeer. 2022. Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity. Journal of Machine Learning Research23, 120 (2022), 1–39. http://jmlr.org/papers/v23/21-0998.html

  5. [5]

    Dan Hendrycks, Collin Burns, Anya Chen, and Spencer Ball. 2021. CUAD: An Expert-Annotated NLP Dataset for Legal Contract Review. In Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks. arXiv:2103.06268 [cs.CL] https://datasets-benchmarks- proceedings.neurips.cc/paper/2021/hash/6ea9ab1baa0efb9e19094440c317e21b-Abstract...

  6. [6]

    LoRA: Low-Rank Adaptation of Large Language Models

    Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. 2022. LoRA: Low-Rank Adaptation of Large Language Models. InInternational Conference on Learning Representations. arXiv:2106.09685 [cs.CL] https://openreview.net/ forum?id=nZeVKeeFYf9

  7. [7]

    Albert Q. Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, Gianna Lengyel, Guillaume Bour, Guillaume Lample, Lélio Renard Lavaud, Lucile Saulnier, Marie-Anne Lachaux, Pierre Stock, Sandeep Subramanian, Sophia Yang, Szymon Antoniak, Teven...

  8. [8]

    Yuta Koreeda and Christopher D. Manning. 2021. ContractNLI: A Dataset for Document-level Natural Language Inference for Contracts. InFindings of the Association for Computational Linguistics: EMNLP 2021. Association for Computational Linguistics, Punta Cana, Dominican Republic, 1907–1919. doi:10.18653/v1/2021.findings-emnlp.164

  9. [9]

    Lauren Martin, Nick Whitehouse, Stephanie Yiu, Lizzie Catterson, and Rivindu Perera. 2024. Better Call GPT, Comparing Large Language Models Against Lawyers. arXiv:2401.16212 [cs.CY] doi:10.48550/arXiv.2401.16212

  10. [10]

    Nick Whitehouse, Nicole Lincoln, Stephanie Yiu, Lizzie Catterson, and Rivindu Perera. 2025. Better Bill GPT: Comparing Large Language Models against Legal Invoice Reviewers. arXiv:2504.02881 [cs.CL] doi:10.48550/arXiv.2504.02881 A Appendix A.1 Per-Field F1 Across All Models Table 7 reports per-field F1 for Olava Extract and the five frontier baselines acr...