pith. machine review for the scientific record. sign in

arxiv: 2605.04005 · v1 · submitted 2026-05-05 · 💻 cs.IR

Recognition: unknown

Domain-Adaptive Dense Retrieval for Brazilian Legal Search

Jayr Pereira, Luiz Bonifacio, Roberto Lotufo

Pith reviewed 2026-05-07 13:55 UTC · model grok-4.3

classification 💻 cs.IR
keywords dense retrievaldomain adaptationlegal information retrievalBrazilian Portugueseembedding fine-tuningquestion answering datasetsNDCG evaluationout-of-domain generalization
0
0 comments X

The pith

Mixing legal data with general questions produces more balanced dense retrievers for Brazilian legal search.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Brazilian legal retrieval spans case law, legislation, and question-style queries, forcing a choice between deep specialization and broad robustness when fine-tuning embedding models. The paper tests three versions of Qwen3-Embedding-4B: an untouched base model, one trained only on legal data, and one trained on legal data mixed with the SQuAD-pt question-answering set. On five JU'A legal benchmarks plus the Quati out-of-domain test, the mixed model improves average NDCG@10 from 0.414 to 0.447, MRR@10 from 0.586 to 0.595, and MAP@10 from 0.270 to 0.308, with the clearest gains on Quati. Legal-only training still wins on the most narrowly legal subtasks, showing that the two strategies produce different strengths rather than one universally superior model.

Core claim

Fine-tuning on a mixture of legal corpora and the SQuAD-pt supervised dataset yields retrievers that preserve strong performance on specialized legal tasks while delivering higher average scores and better out-of-domain results than legal-only fine-tuning. The mixed model records the largest lift on the Quati benchmark, confirming improved robustness for question-based retrieval without sacrificing domain adaptation.

What carries the argument

The mixed training setup that combines legal data with SQuAD-pt for fine-tuning Qwen3-Embedding-4B, contrasted against a legal-only baseline.

If this is right

  • Legal-only fine-tuning remains the stronger choice when the deployment target is narrow, highly specialized legal document retrieval.
  • The mixed approach delivers a single model that handles both traditional legal search and question-style queries without large drops in either.
  • Releasing the two adapted models enables direct follow-up experiments on additional Brazilian Portuguese retrieval tasks.
  • Data-mixture ratios can be used as a tunable knob to trade specialization against cross-type robustness in domain-adaptive retrieval.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same legal-plus-general mixture pattern could be tested in other heterogeneous domains such as medical records or technical support logs.
  • If the gains hold under controlled re-training, practitioners could reduce reliance on scarce in-domain labeled data by supplementing with public QA collections.
  • Longer-term evaluation on live Brazilian court queries would show whether the reported balance survives distribution shift in real user behavior.

Load-bearing premise

The observed metric gains result from the choice of training data mixture rather than differences in training procedure, data cleaning steps, or evaluation details.

What would settle it

Re-run both the legal-only and mixed fine-tunings from the identical base checkpoint using the same optimizer schedule, batch size, and data-cleaning pipeline, then check whether the reported NDCG, MRR, and MAP gaps remain on the JU'A and Quati test sets.

read the original abstract

Brazilian legal retrieval is heterogeneous, covering case law, legislation, and question-based search. This makes training dense retrievers a trade-off between stronger domain specialization and broader robustness across retrieval types of search. In this paper, we explore this trade-off using three training setups based on Qwen3-Embedding-4B: a base model with no fine-tuning, a version trained only on legal data, and a mixed setup that combines legal data with SQuAD-pt supervised dataset. We evaluate these models on five legal datasets from the JU\'A leaderboard, along with Quati dataset as an extra Portuguese retrieval benchmark to test out-of-domain generalization. The legal-only model performs best on the most specialized legal tasks. The mixed setup keeps strong performance on legal data while offering a better overall balance, improving average NDCG@10 from 0.414 to 0.447, MRR@10 from 0.586 to 0.595, and MAP@10 from 0.270 to 0.308 across all six datasets. The biggest improvement appears on Quati, where the mixed model clearly outperforms the legal-only one. Overall, the results show that legal-only and mixed training lead to different strengths: the first is better for specialization, while the second is more robust across different types of search, especially question-based ones. Both adapted models are available on Hugging Face

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The manuscript investigates the trade-off between domain specialization and generalization in dense retrieval for Brazilian legal search. Using the Qwen3-Embedding-4B model, it compares three setups—no fine-tuning, fine-tuning on legal data only, and fine-tuning on a mixture of legal data plus the SQuAD-pt supervised dataset—evaluated on five JU'A legal datasets plus the Quati out-of-domain benchmark. The legal-only model performs best on specialized legal tasks, while the mixed setup yields higher average metrics (NDCG@10 0.414→0.447, MRR@10 0.586→0.595, MAP@10 0.270→0.308) and stronger out-of-domain results, especially on Quati; both adapted models are released on Hugging Face.

Significance. If the attribution of gains to the data mixture holds, the work usefully demonstrates that modest incorporation of general supervised data can improve robustness across legal and question-based retrieval without major in-domain loss. The multi-dataset evaluation and public model release provide concrete, reproducible artifacts for the legal IR community.

major comments (1)
  1. The three training setups are introduced in the abstract and described as base, legal-only, and mixed (legal + SQuAD-pt), yet no statement confirms that optimizer, learning-rate schedule, epoch count, batch size, negative-sampling method, or data-cleaning steps were identical across the two fine-tuned runs. Because the central claim credits the observed metric lifts and the Quati improvement specifically to the data-mixture choice, this missing control information is load-bearing and must be supplied (e.g., in an Experimental Setup section or table) before the causal interpretation can be accepted.
minor comments (2)
  1. The abstract and results report only aggregate averages; adding per-dataset scores, standard deviations, or statistical significance tests would allow readers to assess whether the gains are consistent or driven by a single dataset such as Quati.
  2. Clarify the exact composition and size of the legal training corpus and the proportion of SQuAD-pt used in the mixed setup to support replication.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The point raised about experimental controls is valid and we address it directly below, with a commitment to revise the paper for greater clarity and reproducibility.

read point-by-point responses
  1. Referee: The three training setups are introduced in the abstract and described as base, legal-only, and mixed (legal + SQuAD-pt), yet no statement confirms that optimizer, learning-rate schedule, epoch count, batch size, negative-sampling method, or data-cleaning steps were identical across the two fine-tuned runs. Because the central claim credits the observed metric lifts and the Quati improvement specifically to the data-mixture choice, this missing control information is load-bearing and must be supplied (e.g., in an Experimental Setup section or table) before the causal interpretation can be accepted.

    Authors: We agree that the manuscript should explicitly confirm the training controls to support the causal interpretation. The legal-only and mixed runs used identical hyperparameters and procedures, differing only in training data composition. We will add a new 'Training Configuration' subsection (with an accompanying table) to the Experimental Setup that details: optimizer (AdamW), learning-rate schedule (2e-5 with linear decay and 10% warmup), epochs (3), batch size (32), negative-sampling method (in-batch negatives), and data-cleaning steps (deduplication and length filtering). This revision will make the experimental design fully transparent and strengthen the manuscript's claims. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical model comparisons on held-out data

full rationale

The paper reports results from training three variants of Qwen3-Embedding-4B (base, legal-only, mixed with SQuAD-pt) and evaluating them on the JU'A legal datasets plus Quati. All reported metrics (NDCG@10, MRR@10, MAP@10) are direct empirical measurements on fixed test splits; no equations, derivations, fitted parameters renamed as predictions, or self-citation chains appear in the load-bearing claims. The central trade-off between specialization and robustness is established by explicit side-by-side runs rather than by construction or imported uniqueness theorems.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No explicit free parameters, axioms, or invented entities beyond standard supervised fine-tuning practices; the work is purely empirical.

pith-pipeline@v0.9.0 · 5549 in / 1089 out tokens · 68140 ms · 2026-05-07T13:55:31.198923+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

27 extracted references · 22 canonical work pages · 5 internal anchors

  1. [1]

    Bonifacio, L.H., Jeronymo, V., Abonizio, H.Q., Campiotti, I., Fadaee, M., Lotufo, R., Nogueira, R.: mmarco: A multilingual version of ms marco passage ranking dataset (2021)

  2. [2]

    In: Anais do XV Simp´ osio Brasileiro de Tecnologia da Informa¸ c˜ ao e da Linguagem Humana

    Bueno, M., de Oliveira, E.S., Nogueira, R., Lotufo, R., Pereira, J.: Quati: A Brazilian Portuguese Information Retrieval Dataset from Native Speak- ers. In: Anais do XV Simp´ osio Brasileiro de Tecnologia da Informa¸ c˜ ao e da Linguagem Humana. pp. 236–246. SBC, Porto Alegre, RS, Brasil (2024). Domain-Adaptive Dense Retrieval for Brazilian Legal Search 1...

  3. [3]

    Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) pp

    Feng, Y., Li, C., Ng, V.: Legal Case Retrieval: A Survey of the State of the Art. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) pp. 6472–6485 (2024). https://doi.org/10.18653/v1/2024.acl-long.350

  4. [4]

    Language Resources and Evaluation 60(1), 23 (2026) https://doi.org/10.1007/s10579-025-09881-w

    Fernandes, L.C., Ribeiro, L.d.S., de Castro, M.V.B., da Silva Pacheco, L.A., de Oliveira Sandes, E.F.: JurisTCU: a Brazilian Portuguese information retrieval dataset with query relevance judgments. Language Resources and Evaluation 60(1), 23 (2026). https://doi.org/10.1007/s10579-025-09881-w

  5. [5]

    Computer Science Review 60, 100906 (2026)

    He, C., Hu, H., Li, Y., Zhang, H., Zhang, Q.: A Survey of Large Language Models for Legal Tasks: Progress, Prospects and Challenges. Computer Science Review 60, 100906 (2026). https://doi.org/10.1016/j.cosrev.2026.100906

  6. [6]

    LoRA: Low-Rank Adaptation of Large Language Models

    Hu, E.J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., Chen, W.: LoRA: Low-Rank Adaptation of Large Language Models. arXiv preprint arXiv:2106.09685 (2021),https://arxiv.org/abs/2106.09685

  7. [7]

    In: de Freitas, R., Furtado, D

    J´ unior, J.D., Faria, A., de Oliveira, E.S., de Brito, E., Teotonio, M., Assump¸ c˜ ao, A., Carmo, D., Lotufo, R., Pereira, J.: BR-TaxQA-R: A Dataset for Question Answer- ing with References for Brazilian Personal Income Tax Law, Including Case Law. In: de Freitas, R., Furtado, D. (eds.) Intelligent Systems. pp. 208–222. Springer Nature Switzerland, Cham (2026)

  8. [8]

    In: Proceed- ings of the 2020 Conference on Empirical Methods in Natural Language Process- ing (EMNLP)

    Karpukhin, V., Oguz, B., Min, S., Lewis, P., Wu, L., Edunov, S., Chen, D., Yih, W.t.: Dense Passage Retrieval for Open-Domain Question Answering. In: Proceed- ings of the 2020 Conference on Empirical Methods in Natural Language Process- ing (EMNLP). pp. 6769–6781 (2020). https://doi.org/10.18653/v1/2020.emnlp- main.550

  9. [9]

    The Review of Socionetwork Strategies18, 101–121 (2024)

    Kim, M.Y., Rabelo, J., Babiker, H.K.B., Rahman, M.A., Goebel, R.: Le- gal Information Retrieval and Entailment Using Transformer-based Ap- proaches. The Review of Socionetwork Strategies18, 101–121 (2024). https://doi.org/10.1007/s12626-023-00153-z,https://link.springer.com/ article/10.1007/s12626-023-00153-z

  10. [10]

    In: Proceedings of the 17th Con- ference of the European Chapter of the Association for Computational Linguis- tics

    Louis, A., van Dijck, G., Spanakis, G.: Finding the Law: Enhancing Statutory Article Retrieval via Graph Neural Networks. In: Proceedings of the 17th Con- ference of the European Chapter of the Association for Computational Linguis- tics. pp. 2761–2776. Association for Computational Linguistics, Dubrovnik, Croatia (2023). https://doi.org/10.18653/v1/2023....

  11. [11]

    ACM Transactions on Informa- tion Systems42(2), 40:1–40:28 (2024)

    Ma, Y., Wu, Y., Ai, Q., Liu, Y., Shao, Y., Zhang, M., Ma, S.: Incorporating Structural Information into Legal Case Retrieval. ACM Transactions on Informa- tion Systems42(2), 40:1–40:28 (2024). https://doi.org/10.1145/3609796,https: //dl.acm.org/doi/10.1145/3609796

  12. [12]

    In: Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

    Ma, Y., Wu, Y., Su, W., Ai, Q., Liu, Y.: CaseEncoder: A Knowledge- enhanced Pre-trained Model for Legal Case Encoding. In: Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. pp. 7134–7143. Association for Computational Linguistics, Singapore (2023). https://doi.org/10.18653/v1/2023.emnlp-main.441,https://aclanthology.o...

  13. [13]

    Journal of Empirical Legal Studies22(2), 216–242 (2025)

    Magesh, V., Surani, F., Dahl, M., Suzgun, M., Manning, C.D., Ho, D.E.: Hallucination-Free? Assessing the Reliability of Leading AI Legal Re- search Tools. Journal of Empirical Legal Studies22(2), 216–242 (2025). https://doi.org/10.1111/jels.12413 14 Pereira et al

  14. [14]

    arXiv preprint arXiv:2210.07316 , year=

    Muennighoff, N., Tazi, N., Magne, L., Reimers, N.: MTEB: Massive Text Embed- ding Benchmark. arXiv preprint arXiv:2210.07316 (2022)

  15. [15]

    arXiv preprint arXiv:2411.07739 (2024),https:// arxiv.org/abs/2411.07739

    de Oliveira Lima, J.A.: Unlocking Legal Knowledge with Multi-Layered Embedding-Based Retrieval. arXiv preprint arXiv:2411.07739 (2024),https:// arxiv.org/abs/2411.07739

  16. [16]

    Representation Learning with Contrastive Predictive Coding

    van den Oord, A., Li, Y., Vinyals, O.: Representation Learning with Contrastive Predictive Coding. arXiv preprint arXiv:1807.03748 (2018),https://arxiv.org/ abs/1807.03748

  17. [17]

    Artificial Intelligence and Law25(1), 65–87 (2017)

    van Opijnen, M., Santos, C.: On the Concept of Relevance in Legal In- formation Retrieval. Artificial Intelligence and Law25(1), 65–87 (2017). https://doi.org/10.1007/s10506-017-9195-8,https://link.springer.com/ article/10.1007/s10506-017-9195-8

  18. [18]

    JU\'A -- A Benchmark for Information Retrieval in Brazilian Legal Text Collections

    Pereira, J., Fernandes, L., de Brito, E., Lotufo, R., Bonifacio, L.: Ju´ a – a benchmark for information retrieval in brazilian legal text collections (2026),https://arxiv. org/abs/2604.06098

  19. [19]

    The probabilistic relevance framework: Bm25 and beyond

    Robertson, S., Zaragoza, H.: The Probabilistic Relevance Framework: BM25 and Beyond. Foundations and Trends in Information Retrieval3(4), 333–389 (2009). https://doi.org/10.1561/1500000019

  20. [20]

    arXiv preprint arXiv:2108.10127 (2021),https://arxiv.org/abs/2108.10127

    Rossi, J., Kanoulas, E.: Legal Search in Case Law and Statute Law. arXiv preprint arXiv:2108.10127 (2021),https://arxiv.org/abs/2108.10127

  21. [21]

    In: Proceedings of the 2024 Joint International Conference on Computational Lin- guistics, Language Resources and Evaluation (LREC-COLING 2024)

    Santosh, T.Y.S.S., Haddad, R., Grabmair, M.: ECtHR-PCR: A dataset for prece- dent understanding and prior case retrieval in the european court of human rights. In: Proceedings of the 2024 Joint International Conference on Computational Lin- guistics, Language Resources and Evaluation (LREC-COLING 2024). pp. 5473–

  22. [22]

    lrec-main.486/

    ELRA and ICCL, Torino, Italia (2024),https://aclanthology.org/2024. lrec-main.486/

  23. [23]

    In: Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks (2021)

    Thakur, N., Reimers, N., R¨ uckl´ e, A., Srivastava, A., Gurevych, I.: BEIR: A het- erogeneous benchmark for zero-shot evaluation of information retrieval models. In: Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks (2021)

  24. [24]

    Language Resources and Evaluation59(2), 1257 (2025)

    Vit´ orio, D., Souza, E., Martins, L., da Silva, N.F.F., de Carvalho, A.C.P.d.L., Oliveira, A.L.I., de Andrade, F.E.: Building a relevance feedback corpus for legal information retrieval in the real-case scenario of the Brazilian Cham- ber of Deputies. Language Resources and Evaluation59(2), 1257 (2025). https://doi.org/10.1007/s10579-024-09767-3

  25. [25]

    Text Embeddings by Weakly-Supervised Contrastive Pre-training

    Wang, L., Yang, N., Huang, X., Yang, L., Gao, F., Wei, Z., Zhang, Y., Zhou, M., et al.: Text embeddings by weakly-supervised contrastive pre-training. arXiv preprint arXiv:2212.03533 (2022)

  26. [26]

    Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models

    Zhang, Y., Li, M., Long, D., Zhang, X., Lin, H., Yang, B., Xie, P., Yang, A., Liu, D., Lin, J., Huang, F., Zhou, J.: Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models. arXiv preprint arXiv:2506.05176 (2025), https://arxiv.org/abs/2506.05176

  27. [27]

    arXiv preprint arXiv:2408.05517 (2024),https://arxiv.org/abs/ 2408.05517

    Zhao, Y., Huang, J., Hu, J., Wang, X., Mao, Y., Zhang, D., Jiang, Z., Wu, Z., Ai, B., Wang, A., Zhou, W., Chen, Y.: SWIFT: A Scalable lightWeight Infrastructure for Fine-Tuning. arXiv preprint arXiv:2408.05517 (2024),https://arxiv.org/abs/ 2408.05517