arxiv: 2605.04005 · v1 · submitted 2026-05-05 · 💻 cs.IR

Recognition: unknown

Domain-Adaptive Dense Retrieval for Brazilian Legal Search

Jayr Pereira, Luiz Bonifacio, Roberto Lotufo

Pith reviewed 2026-05-07 13:55 UTC · model grok-4.3

classification 💻 cs.IR

keywords dense retrievaldomain adaptationlegal information retrievalBrazilian Portugueseembedding fine-tuningquestion answering datasetsNDCG evaluationout-of-domain generalization

0 comments

The pith

Mixing legal data with general questions produces more balanced dense retrievers for Brazilian legal search.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Brazilian legal retrieval spans case law, legislation, and question-style queries, forcing a choice between deep specialization and broad robustness when fine-tuning embedding models. The paper tests three versions of Qwen3-Embedding-4B: an untouched base model, one trained only on legal data, and one trained on legal data mixed with the SQuAD-pt question-answering set. On five JU'A legal benchmarks plus the Quati out-of-domain test, the mixed model improves average NDCG@10 from 0.414 to 0.447, MRR@10 from 0.586 to 0.595, and MAP@10 from 0.270 to 0.308, with the clearest gains on Quati. Legal-only training still wins on the most narrowly legal subtasks, showing that the two strategies produce different strengths rather than one universally superior model.

Core claim

Fine-tuning on a mixture of legal corpora and the SQuAD-pt supervised dataset yields retrievers that preserve strong performance on specialized legal tasks while delivering higher average scores and better out-of-domain results than legal-only fine-tuning. The mixed model records the largest lift on the Quati benchmark, confirming improved robustness for question-based retrieval without sacrificing domain adaptation.

What carries the argument

The mixed training setup that combines legal data with SQuAD-pt for fine-tuning Qwen3-Embedding-4B, contrasted against a legal-only baseline.

If this is right

Legal-only fine-tuning remains the stronger choice when the deployment target is narrow, highly specialized legal document retrieval.
The mixed approach delivers a single model that handles both traditional legal search and question-style queries without large drops in either.
Releasing the two adapted models enables direct follow-up experiments on additional Brazilian Portuguese retrieval tasks.
Data-mixture ratios can be used as a tunable knob to trade specialization against cross-type robustness in domain-adaptive retrieval.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same legal-plus-general mixture pattern could be tested in other heterogeneous domains such as medical records or technical support logs.
If the gains hold under controlled re-training, practitioners could reduce reliance on scarce in-domain labeled data by supplementing with public QA collections.
Longer-term evaluation on live Brazilian court queries would show whether the reported balance survives distribution shift in real user behavior.

Load-bearing premise

The observed metric gains result from the choice of training data mixture rather than differences in training procedure, data cleaning steps, or evaluation details.

What would settle it

Re-run both the legal-only and mixed fine-tunings from the identical base checkpoint using the same optimizer schedule, batch size, and data-cleaning pipeline, then check whether the reported NDCG, MRR, and MAP gaps remain on the JU'A and Quati test sets.

read the original abstract

Brazilian legal retrieval is heterogeneous, covering case law, legislation, and question-based search. This makes training dense retrievers a trade-off between stronger domain specialization and broader robustness across retrieval types of search. In this paper, we explore this trade-off using three training setups based on Qwen3-Embedding-4B: a base model with no fine-tuning, a version trained only on legal data, and a mixed setup that combines legal data with SQuAD-pt supervised dataset. We evaluate these models on five legal datasets from the JU\'A leaderboard, along with Quati dataset as an extra Portuguese retrieval benchmark to test out-of-domain generalization. The legal-only model performs best on the most specialized legal tasks. The mixed setup keeps strong performance on legal data while offering a better overall balance, improving average NDCG@10 from 0.414 to 0.447, MRR@10 from 0.586 to 0.595, and MAP@10 from 0.270 to 0.308 across all six datasets. The biggest improvement appears on Quati, where the mixed model clearly outperforms the legal-only one. Overall, the results show that legal-only and mixed training lead to different strengths: the first is better for specialization, while the second is more robust across different types of search, especially question-based ones. Both adapted models are available on Hugging Face

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Mixed training on legal plus general Portuguese data improves average retrieval metrics on the JU'A and Quati benchmarks, but the gains rest on unverified assumptions about identical training procedures.

read the letter

The main takeaway is that fine-tuning Qwen3-Embedding-4B on a mix of Brazilian legal data and SQuAD-pt produces higher average NDCG@10, MRR@10, and MAP@10 across the five JU'A legal datasets plus Quati than fine-tuning on legal data alone, with the clearest lift on the out-of-domain Quati set. Legal-only still wins on the narrowest legal tasks, as expected. The paper simply runs the standard domain-adaptation recipe on these Portuguese collections and reports the numbers. Releasing the two adapted models on Hugging Face is the most immediately useful part for anyone who wants to test them in practice. The empirical comparison on these specific datasets is new relative to the cited prior work. The soft spot is exactly the one the stress-test note flags. The abstract describes the three setups but supplies no information on whether learning rate, epochs, batch size, optimizer, or negative sampling were held constant. Without that, the deltas cannot be confidently credited to the data mixture rather than some other uncontrolled difference. There are also no error bars or significance tests, so the smaller lifts on some metrics are hard to interpret. The datasets themselves are public, but the paper does not argue they fully represent real Brazilian legal search needs. This is straightforward incremental work for people already working on legal IR or Portuguese retrieval. It gives a practical data point on the specialization-versus-robustness trade-off and ships usable models. The thinking is clear and the authors do not overclaim. I would send it to peer review once the methods section is expanded to include the missing training details and basic statistical checks; the experiment itself is worth referee time even if the causal attribution needs tightening.

Referee Report

1 major / 2 minor

Summary. The manuscript investigates the trade-off between domain specialization and generalization in dense retrieval for Brazilian legal search. Using the Qwen3-Embedding-4B model, it compares three setups—no fine-tuning, fine-tuning on legal data only, and fine-tuning on a mixture of legal data plus the SQuAD-pt supervised dataset—evaluated on five JU'A legal datasets plus the Quati out-of-domain benchmark. The legal-only model performs best on specialized legal tasks, while the mixed setup yields higher average metrics (NDCG@10 0.414→0.447, MRR@10 0.586→0.595, MAP@10 0.270→0.308) and stronger out-of-domain results, especially on Quati; both adapted models are released on Hugging Face.

Significance. If the attribution of gains to the data mixture holds, the work usefully demonstrates that modest incorporation of general supervised data can improve robustness across legal and question-based retrieval without major in-domain loss. The multi-dataset evaluation and public model release provide concrete, reproducible artifacts for the legal IR community.

major comments (1)

The three training setups are introduced in the abstract and described as base, legal-only, and mixed (legal + SQuAD-pt), yet no statement confirms that optimizer, learning-rate schedule, epoch count, batch size, negative-sampling method, or data-cleaning steps were identical across the two fine-tuned runs. Because the central claim credits the observed metric lifts and the Quati improvement specifically to the data-mixture choice, this missing control information is load-bearing and must be supplied (e.g., in an Experimental Setup section or table) before the causal interpretation can be accepted.

minor comments (2)

The abstract and results report only aggregate averages; adding per-dataset scores, standard deviations, or statistical significance tests would allow readers to assess whether the gains are consistent or driven by a single dataset such as Quati.
Clarify the exact composition and size of the legal training corpus and the proportion of SQuAD-pt used in the mixed setup to support replication.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The point raised about experimental controls is valid and we address it directly below, with a commitment to revise the paper for greater clarity and reproducibility.

read point-by-point responses

Referee: The three training setups are introduced in the abstract and described as base, legal-only, and mixed (legal + SQuAD-pt), yet no statement confirms that optimizer, learning-rate schedule, epoch count, batch size, negative-sampling method, or data-cleaning steps were identical across the two fine-tuned runs. Because the central claim credits the observed metric lifts and the Quati improvement specifically to the data-mixture choice, this missing control information is load-bearing and must be supplied (e.g., in an Experimental Setup section or table) before the causal interpretation can be accepted.

Authors: We agree that the manuscript should explicitly confirm the training controls to support the causal interpretation. The legal-only and mixed runs used identical hyperparameters and procedures, differing only in training data composition. We will add a new 'Training Configuration' subsection (with an accompanying table) to the Experimental Setup that details: optimizer (AdamW), learning-rate schedule (2e-5 with linear decay and 10% warmup), epochs (3), batch size (32), negative-sampling method (in-batch negatives), and data-cleaning steps (deduplication and length filtering). This revision will make the experimental design fully transparent and strengthen the manuscript's claims. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical model comparisons on held-out data

full rationale

The paper reports results from training three variants of Qwen3-Embedding-4B (base, legal-only, mixed with SQuAD-pt) and evaluating them on the JU'A legal datasets plus Quati. All reported metrics (NDCG@10, MRR@10, MAP@10) are direct empirical measurements on fixed test splits; no equations, derivations, fitted parameters renamed as predictions, or self-citation chains appear in the load-bearing claims. The central trade-off between specialization and robustness is established by explicit side-by-side runs rather than by construction or imported uniqueness theorems.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No explicit free parameters, axioms, or invented entities beyond standard supervised fine-tuning practices; the work is purely empirical.

pith-pipeline@v0.9.0 · 5549 in / 1089 out tokens · 68140 ms · 2026-05-07T13:55:31.198923+00:00 · methodology

Review history (12 revisions) →

discussion (0)

Reference graph

Works this paper leans on

27 extracted references · 22 canonical work pages · 5 internal anchors

[1]

Bonifacio, L.H., Jeronymo, V., Abonizio, H.Q., Campiotti, I., Fadaee, M., Lotufo, R., Nogueira, R.: mmarco: A multilingual version of ms marco passage ranking dataset (2021)

2021
[2]

In: Anais do XV Simp´ osio Brasileiro de Tecnologia da Informa¸ c˜ ao e da Linguagem Humana

Bueno, M., de Oliveira, E.S., Nogueira, R., Lotufo, R., Pereira, J.: Quati: A Brazilian Portuguese Information Retrieval Dataset from Native Speak- ers. In: Anais do XV Simp´ osio Brasileiro de Tecnologia da Informa¸ c˜ ao e da Linguagem Humana. pp. 236–246. SBC, Porto Alegre, RS, Brasil (2024). Domain-Adaptive Dense Retrieval for Brazilian Legal Search 1...

work page doi:10.5753/stil.2024.245426 2024
[3]

Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) pp

Feng, Y., Li, C., Ng, V.: Legal Case Retrieval: A Survey of the State of the Art. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) pp. 6472–6485 (2024). https://doi.org/10.18653/v1/2024.acl-long.350

work page doi:10.18653/v1/2024.acl-long.350 2024
[4]

Language Resources and Evaluation 60(1), 23 (2026) https://doi.org/10.1007/s10579-025-09881-w

Fernandes, L.C., Ribeiro, L.d.S., de Castro, M.V.B., da Silva Pacheco, L.A., de Oliveira Sandes, E.F.: JurisTCU: a Brazilian Portuguese information retrieval dataset with query relevance judgments. Language Resources and Evaluation 60(1), 23 (2026). https://doi.org/10.1007/s10579-025-09881-w

work page doi:10.1007/s10579-025-09881-w 2026
[5]

Computer Science Review 60, 100906 (2026)

He, C., Hu, H., Li, Y., Zhang, H., Zhang, Q.: A Survey of Large Language Models for Legal Tasks: Progress, Prospects and Challenges. Computer Science Review 60, 100906 (2026). https://doi.org/10.1016/j.cosrev.2026.100906

work page doi:10.1016/j.cosrev.2026.100906 2026
[6]

LoRA: Low-Rank Adaptation of Large Language Models

Hu, E.J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., Chen, W.: LoRA: Low-Rank Adaptation of Large Language Models. arXiv preprint arXiv:2106.09685 (2021),https://arxiv.org/abs/2106.09685

work page internal anchor Pith review arXiv 2021
[7]

In: de Freitas, R., Furtado, D

J´ unior, J.D., Faria, A., de Oliveira, E.S., de Brito, E., Teotonio, M., Assump¸ c˜ ao, A., Carmo, D., Lotufo, R., Pereira, J.: BR-TaxQA-R: A Dataset for Question Answer- ing with References for Brazilian Personal Income Tax Law, Including Case Law. In: de Freitas, R., Furtado, D. (eds.) Intelligent Systems. pp. 208–222. Springer Nature Switzerland, Cham (2026)

2026
[8]

In: Proceed- ings of the 2020 Conference on Empirical Methods in Natural Language Process- ing (EMNLP)

Karpukhin, V., Oguz, B., Min, S., Lewis, P., Wu, L., Edunov, S., Chen, D., Yih, W.t.: Dense Passage Retrieval for Open-Domain Question Answering. In: Proceed- ings of the 2020 Conference on Empirical Methods in Natural Language Process- ing (EMNLP). pp. 6769–6781 (2020). https://doi.org/10.18653/v1/2020.emnlp- main.550

work page doi:10.18653/v1/2020.emnlp- 2020
[9]

The Review of Socionetwork Strategies18, 101–121 (2024)

Kim, M.Y., Rabelo, J., Babiker, H.K.B., Rahman, M.A., Goebel, R.: Le- gal Information Retrieval and Entailment Using Transformer-based Ap- proaches. The Review of Socionetwork Strategies18, 101–121 (2024). https://doi.org/10.1007/s12626-023-00153-z,https://link.springer.com/ article/10.1007/s12626-023-00153-z

work page doi:10.1007/s12626-023-00153-z 2024
[10]

In: Proceedings of the 17th Con- ference of the European Chapter of the Association for Computational Linguis- tics

Louis, A., van Dijck, G., Spanakis, G.: Finding the Law: Enhancing Statutory Article Retrieval via Graph Neural Networks. In: Proceedings of the 17th Con- ference of the European Chapter of the Association for Computational Linguis- tics. pp. 2761–2776. Association for Computational Linguistics, Dubrovnik, Croatia (2023). https://doi.org/10.18653/v1/2023....

work page doi:10.18653/v1/2023.eacl-main.203 2023
[11]

ACM Transactions on Informa- tion Systems42(2), 40:1–40:28 (2024)

Ma, Y., Wu, Y., Ai, Q., Liu, Y., Shao, Y., Zhang, M., Ma, S.: Incorporating Structural Information into Legal Case Retrieval. ACM Transactions on Informa- tion Systems42(2), 40:1–40:28 (2024). https://doi.org/10.1145/3609796,https: //dl.acm.org/doi/10.1145/3609796

work page doi:10.1145/3609796 2024
[12]

In: Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

Ma, Y., Wu, Y., Su, W., Ai, Q., Liu, Y.: CaseEncoder: A Knowledge- enhanced Pre-trained Model for Legal Case Encoding. In: Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. pp. 7134–7143. Association for Computational Linguistics, Singapore (2023). https://doi.org/10.18653/v1/2023.emnlp-main.441,https://aclanthology.o...

work page doi:10.18653/v1/2023.emnlp-main.441 2023
[13]

Journal of Empirical Legal Studies22(2), 216–242 (2025)

Magesh, V., Surani, F., Dahl, M., Suzgun, M., Manning, C.D., Ho, D.E.: Hallucination-Free? Assessing the Reliability of Leading AI Legal Re- search Tools. Journal of Empirical Legal Studies22(2), 216–242 (2025). https://doi.org/10.1111/jels.12413 14 Pereira et al

work page doi:10.1111/jels.12413 2025
[14]

arXiv preprint arXiv:2210.07316 , year=

Muennighoff, N., Tazi, N., Magne, L., Reimers, N.: MTEB: Massive Text Embed- ding Benchmark. arXiv preprint arXiv:2210.07316 (2022)

work page arXiv 2022
[15]

arXiv preprint arXiv:2411.07739 (2024),https:// arxiv.org/abs/2411.07739

de Oliveira Lima, J.A.: Unlocking Legal Knowledge with Multi-Layered Embedding-Based Retrieval. arXiv preprint arXiv:2411.07739 (2024),https:// arxiv.org/abs/2411.07739

work page arXiv 2024
[16]

Representation Learning with Contrastive Predictive Coding

van den Oord, A., Li, Y., Vinyals, O.: Representation Learning with Contrastive Predictive Coding. arXiv preprint arXiv:1807.03748 (2018),https://arxiv.org/ abs/1807.03748

work page internal anchor Pith review arXiv 2018
[17]

Artificial Intelligence and Law25(1), 65–87 (2017)

van Opijnen, M., Santos, C.: On the Concept of Relevance in Legal In- formation Retrieval. Artificial Intelligence and Law25(1), 65–87 (2017). https://doi.org/10.1007/s10506-017-9195-8,https://link.springer.com/ article/10.1007/s10506-017-9195-8

work page doi:10.1007/s10506-017-9195-8 2017
[18]

JU\'A -- A Benchmark for Information Retrieval in Brazilian Legal Text Collections

Pereira, J., Fernandes, L., de Brito, E., Lotufo, R., Bonifacio, L.: Ju´ a – a benchmark for information retrieval in brazilian legal text collections (2026),https://arxiv. org/abs/2604.06098

work page internal anchor Pith review Pith/arXiv arXiv 2026
[19]

The probabilistic relevance framework: Bm25 and beyond

Robertson, S., Zaragoza, H.: The Probabilistic Relevance Framework: BM25 and Beyond. Foundations and Trends in Information Retrieval3(4), 333–389 (2009). https://doi.org/10.1561/1500000019

work page doi:10.1561/1500000019 2009
[20]

arXiv preprint arXiv:2108.10127 (2021),https://arxiv.org/abs/2108.10127

Rossi, J., Kanoulas, E.: Legal Search in Case Law and Statute Law. arXiv preprint arXiv:2108.10127 (2021),https://arxiv.org/abs/2108.10127

work page arXiv 2021
[21]

In: Proceedings of the 2024 Joint International Conference on Computational Lin- guistics, Language Resources and Evaluation (LREC-COLING 2024)

Santosh, T.Y.S.S., Haddad, R., Grabmair, M.: ECtHR-PCR: A dataset for prece- dent understanding and prior case retrieval in the european court of human rights. In: Proceedings of the 2024 Joint International Conference on Computational Lin- guistics, Language Resources and Evaluation (LREC-COLING 2024). pp. 5473–

2024
[22]

lrec-main.486/

ELRA and ICCL, Torino, Italia (2024),https://aclanthology.org/2024. lrec-main.486/

2024
[23]

In: Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks (2021)

Thakur, N., Reimers, N., R¨ uckl´ e, A., Srivastava, A., Gurevych, I.: BEIR: A het- erogeneous benchmark for zero-shot evaluation of information retrieval models. In: Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks (2021)

2021
[24]

Language Resources and Evaluation59(2), 1257 (2025)

Vit´ orio, D., Souza, E., Martins, L., da Silva, N.F.F., de Carvalho, A.C.P.d.L., Oliveira, A.L.I., de Andrade, F.E.: Building a relevance feedback corpus for legal information retrieval in the real-case scenario of the Brazilian Cham- ber of Deputies. Language Resources and Evaluation59(2), 1257 (2025). https://doi.org/10.1007/s10579-024-09767-3

work page doi:10.1007/s10579-024-09767-3 2025
[25]

Text Embeddings by Weakly-Supervised Contrastive Pre-training

Wang, L., Yang, N., Huang, X., Yang, L., Gao, F., Wei, Z., Zhang, Y., Zhou, M., et al.: Text embeddings by weakly-supervised contrastive pre-training. arXiv preprint arXiv:2212.03533 (2022)

work page internal anchor Pith review arXiv 2022
[26]

Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models

Zhang, Y., Li, M., Long, D., Zhang, X., Lin, H., Yang, B., Xie, P., Yang, A., Liu, D., Lin, J., Huang, F., Zhou, J.: Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models. arXiv preprint arXiv:2506.05176 (2025), https://arxiv.org/abs/2506.05176

work page internal anchor Pith review arXiv 2025
[27]

arXiv preprint arXiv:2408.05517 (2024),https://arxiv.org/abs/ 2408.05517

Zhao, Y., Huang, J., Hu, J., Wang, X., Mao, Y., Zhang, D., Jiang, Z., Wu, Z., Ai, B., Wang, A., Zhou, W., Chen, Y.: SWIFT: A Scalable lightWeight Infrastructure for Fine-Tuning. arXiv preprint arXiv:2408.05517 (2024),https://arxiv.org/abs/ 2408.05517

work page arXiv 2024