arxiv: 2605.00116 · v1 · submitted 2026-04-30 · 💻 cs.CL · cs.AI· cs.LG

Recognition: unknown

ViLegalNLI: Natural Language Inference for Vietnamese Legal Texts

Nhung Thi-Hong Duong , Mai Ngoc Ho , Tin Van Huynh , Kiet Van Nguyen

Authors on Pith no claims yet

Pith reviewed 2026-05-09 20:39 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG

keywords Vietnamese legal NLInatural language inferencelegal datasetstatutory textbenchmarkLLM evaluationVietnamese language modelslegal reasoning

0 comments

The pith

ViLegalNLI introduces the first large-scale Vietnamese natural language inference dataset for legal statutory texts.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces ViLegalNLI, consisting of 42,012 premise-hypothesis pairs from Vietnamese legal documents labeled as entailment or non-entailment. It employs a semi-automatic framework combining large language models for generating hypotheses and validating quality to create this benchmark. Experiments demonstrate that instruction-tuned models with few-shot examples perform best, though results vary with text length, word overlap, and reasoning difficulty. This matters because it fills a gap in resources for legal AI in Vietnamese, enabling tests of how well systems understand complex legal logic and conditions. If the dataset holds up, it can drive development of more accurate tools for legal analysis and decision support.

Core claim

ViLegalNLI establishes a foundational benchmark for Vietnamese legal NLI through a dataset of 42,012 pairs derived from official statutory documents and annotated with binary inference labels. The semi-automatic construction integrates LLMs for controlled hypothesis generation and quality validation, capturing diverse reasoning patterns like paraphrasing and logical implication while mitigating artifacts. Extensive experiments reveal superior performance from few-shot LLM setups and challenges in cross-domain generalization.

What carries the argument

A semi-automatic data generation framework that uses large language models for controlled hypothesis generation combined with systematic quality validation procedures to ensure legal consistency.

If this is right

Models can be benchmarked on realistic legal reasoning scenarios involving conditional clauses and domain terminology.
Few-shot configurations of large language models show the best results on this task.
Performance depends on factors such as hypothesis length, lexical overlap, and reasoning complexity.
Generalization across different legal domains remains challenging.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar frameworks could be adapted to create legal NLI datasets in other languages with limited resources.
The dataset might support training of specialized models that better handle statutory interpretation.
Insights from error analysis could guide improvements in handling legally invalid inferences.

Load-bearing premise

The annotations generated through the LLM-based semi-automatic process are accurate and consistent enough to serve as a reliable benchmark.

What would settle it

If independent legal experts review a sample of the pairs and find a high rate of incorrect entailment labels, the dataset's validity as a benchmark would be questioned.

read the original abstract

In this article, we introduce ViLegalNLI, the first large-scale Vietnamese Natural Language Inference (NLI) dataset specifically constructed for the legal domain. The dataset consists of 42,012 premise-hypothesis pairs derived from official statutory documents and annotated with binary inference labels (Entailment and Non-entailment). It covers multiple legal domains and reflects realistic legal reasoning scenarios characterized by structured logic, conditional clauses, and domain-specific terminology. To construct ViLegalNLI, we propose a semi-automatic data generation framework that integrates large language models for controlled hypothesis generation and systematic quality validation procedures. The framework incorporates artifact mitigation strategies and cross-model validation to improve annotation reliability and ensure legal consistency. The resulting dataset captures diverse reasoning patterns, including paraphrasing, logical implication, and legally invalid inferences, thereby providing a comprehensive benchmark for Vietnamese legal inference tasks. We conduct extensive experiments on the ViLegalNLI using multilingual models, Vietnamese-specific pretrained language models, and instruction-tuned large language models. The results show that few-shot LLM configurations consistently achieve superior performance, while performance is significantly influenced by hypothesis length, lexical overlap, and reasoning complexity. Cross-domain evaluations further reveal the challenges of generalizing legal inference across distinct legal fields. Overall, ViLegalNLI establishes a foundational benchmark for Vietnamese legal NLI and supports future research in legal reasoning, statutory text understanding, and the development of reliable AI systems for legal analysis and decision support. The dataset is publicly available for research purposes.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces ViLegalNLI, the first large-scale Vietnamese legal NLI dataset with 42,012 premise-hypothesis pairs extracted from official statutory documents and annotated with binary Entailment/Non-entailment labels. It proposes a semi-automatic construction framework that uses LLMs for controlled hypothesis generation, artifact mitigation, and cross-model validation to ensure legal consistency. The work reports experiments across multilingual models, Vietnamese PLMs, and instruction-tuned LLMs, finding that few-shot LLM setups perform best while performance varies with hypothesis length, lexical overlap, and reasoning complexity; cross-domain results highlight generalization difficulties. The dataset is released publicly as a benchmark for Vietnamese legal reasoning.

Significance. If the annotations prove reliable, ViLegalNLI fills a clear gap as the first substantial Vietnamese legal-domain NLI resource and can accelerate work on statutory understanding and legal AI in a low-resource language setting. The public release and the reported influences of length/overlap on model behavior are concrete contributions that future studies can build upon directly. The semi-automatic pipeline with artifact mitigation is a methodological strength worth documenting for other specialized domains.

major comments (2)

[§3] §3 (Data Construction and Validation): The semi-automatic framework is presented as producing legally consistent annotations via LLM hypothesis generation and cross-model validation, yet no quantitative metrics—such as inter-annotator agreement scores, expert legal reviewer agreement rates, error rates on conditional clauses, or disagreement resolution statistics—are reported. This directly undermines the central claim that the 42k pairs are free of annotation artifacts and capture statutory logic rather than surface patterns or LLM hallucinations.
[§4.3] §4.3 (Cross-domain Evaluation): The claim that the dataset reveals 'challenges of generalizing legal inference across distinct legal fields' rests on model performance differences, but without a human performance baseline or error analysis stratified by legal construct (e.g., conditional clauses or domain-specific terminology), it is unclear whether the observed gaps reflect genuine legal reasoning difficulty or dataset artifacts.

minor comments (2)

[Abstract] Abstract and §2: The binary label set is described as 'Entailment and Non-entailment,' but the paper should explicitly define how 'Non-entailment' is operationalized in the legal context (e.g., whether it collapses contradiction and neutral cases) to allow comparison with standard NLI taxonomies.
[§4] Tables in §4: Performance tables would benefit from reporting standard deviations across runs or seeds and from including a simple lexical-overlap baseline to contextualize the influence of hypothesis length and overlap mentioned in the text.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment point by point below, indicating the revisions we will make to improve the manuscript.

read point-by-point responses

Referee: [§3] §3 (Data Construction and Validation): The semi-automatic framework is presented as producing legally consistent annotations via LLM hypothesis generation and cross-model validation, yet no quantitative metrics—such as inter-annotator agreement scores, expert legal reviewer agreement rates, error rates on conditional clauses, or disagreement resolution statistics—are reported. This directly undermines the central claim that the 42k pairs are free of annotation artifacts and capture statutory logic rather than surface patterns or LLM hallucinations.

Authors: We agree that explicit quantitative metrics would strengthen the validation claims in §3. The manuscript describes the cross-model validation and artifact mitigation steps but omits specific statistics. In the revised version, we will expand §3 to report: (i) agreement rates between the LLMs used for validation, (ii) the proportion and handling of disagreement cases (including majority vote and any sampled expert review), and (iii) a targeted error analysis on conditional clauses. These figures will be computed from our existing validation logs. This addition directly addresses concerns about annotation artifacts and LLM hallucinations while preserving the semi-automatic nature of the pipeline. revision: yes
Referee: [§4.3] §4.3 (Cross-domain Evaluation): The claim that the dataset reveals 'challenges of generalizing legal inference across distinct legal fields' rests on model performance differences, but without a human performance baseline or error analysis stratified by legal construct (e.g., conditional clauses or domain-specific terminology), it is unclear whether the observed gaps reflect genuine legal reasoning difficulty or dataset artifacts.

Authors: We partially concur that a human baseline would provide stronger grounding for interpreting cross-domain gaps. The performance drops are consistent across model families, supporting our claim of generalization challenges, yet we recognize the value of stratified error analysis. We will revise §4.3 to include error breakdowns by conditional clauses, domain terminology, and reasoning complexity. We will also explicitly note the absence of a human baseline as a limitation, explaining that expert legal annotation at scale is resource-intensive and reserved for future work. These changes will clarify the evidential basis without overstating the results. revision: partial

standing simulated objections not resolved

Providing a full human performance baseline on the cross-domain splits, which would require new large-scale expert legal annotations beyond the scope and resources of the current study.

Circularity Check

0 steps flagged

No circularity: empirical dataset paper with no derivations or self-referential equations

full rationale

This is a dataset creation and benchmarking paper that introduces ViLegalNLI via a semi-automatic LLM-assisted pipeline for premise-hypothesis pair generation and validation. The abstract and described process contain no equations, fitted parameters, uniqueness theorems, or derivation steps that reduce to their own inputs by construction. Claims about legal consistency rest on the pipeline description and cross-model validation rather than any self-citation chain or renaming of known results. The work is self-contained as an empirical contribution with no load-bearing circular elements.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Empirical dataset paper with no mathematical derivations; no free parameters, axioms, or invented entities are introduced.

pith-pipeline@v0.9.0 · 5577 in / 983 out tokens · 35530 ms · 2026-05-09T20:39:37.581567+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

37 extracted references · 3 canonical work pages · 1 internal anchor

[1]

PhD thesis, Standford University (2009)

MacCartney, B.: Natural language inference. PhD thesis, Standford University (2009)

2009
[2]

In: PASCAL Workshop on Learning Methods for Text Understanding and Mining, pp

Dagan, I., Glickman, O.: Probabilistic textual entailment: Generic applied modeling of language variability. In: PASCAL Workshop on Learning Methods for Text Understanding and Mining, pp. 26–29 (2004) Springer Nature 2021 LATEX template 30ViLegalNLI: Natural Language Inference for Vietnamese Legal Texts

2004
[3]

In: Findings of the Association for Computational Linguistics: EMNLP 2021, pp

Koreeda, Y., Manning, C.: Contractnli: A dataset for document-level natural language inference for contracts. In: Findings of the Association for Computational Linguistics: EMNLP 2021, pp. 1907–1919 (2021)

2021
[4]

In: Goldberg, Y., Kozareva, Z., Zhang, Y

Bruno, W., Roth, D.: LawngNLI: A long-premise benchmark for in-domain generalization from short to long contexts and for implication-based retrieval. In: Goldberg, Y., Kozareva, Z., Zhang, Y. (eds.) Findings of the Association for Computational Linguistics: EMNLP 2022, pp. 5019–5043 (2022)

2022
[5]

In: Findings of the Association for Computational Linguistics: EMNLP, pp

Nguyen, D.Q., Nguyen, A.T.: Phobert: Pre-trained language models for vietnamese. In: Findings of the Association for Computational Linguistics: EMNLP, pp. 1037–1042 (2020)

2020
[6]

In: Proceedings of the 34th Pacific Asia Conference on Language, Information and Computation, pp

Bui, T.V., Tran, T.O., Le-Hong, P.: Improving sequence tagging for viet- namese text using transformer-based neural models. In: Proceedings of the 34th Pacific Asia Conference on Language, Information and Computation, pp. 13–20 (2020)

2020
[7]

In: Findings of the Association for Computational Linguistics, pp

Do, P.N.-T., Tran, S.Q., Hoang, P.G., Nguyen, K.V., Nguyen, N.L.-T.: Improving sequence tagging for vietnamese text using transformer- based neural models. In: Findings of the Association for Computational Linguistics, pp. 211–222 (2024)

2024
[8]

In: Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14), pp

Marelli,M.,Menini,S.,Baroni,M.,Bentivogli,L.,Bernardi,R.,Zamparelli, R.: A sick cure for the evaluation of compositional distributional semantic models. In: Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14), pp. 216–223 (2014)

2014
[9]

In: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pp

Bowman, S.R., Angeli, G., Potts, C., Manning, C.D.: A large annotated corpus for learning natural language inference. In: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pp. 632–642 (2015)

2015
[10]

In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp

Williams, A., Nangia, N., Bowman, S.R.: A broad-coverage challenge corpus for sentence understanding through inference. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 1112–1122 (2018)

2018
[11]

In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp

Romanov, A., Shivade, C.: Lessons from natural language inference in the clinical domain. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1586–1596 (2018)

2018
[12]

In: Proceedings of the AAAI Conference on Artificial Intelligence (2018) Springer Nature 2021 LATEX template ViLegalNLI: Natural Language Inference for Vietnamese Legal Texts31

Khot, T., Sabharwal, A., Clark, P.: Scitail: A textual entailment dataset from science question answering. In: Proceedings of the AAAI Conference on Artificial Intelligence (2018) Springer Nature 2021 LATEX template ViLegalNLI: Natural Language Inference for Vietnamese Legal Texts31

2018
[13]

In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics„ pp

Nie, Y., Williams, A., Dinan, E., Bansal, M., Weston, J., Kiela, D.: Adversarial nli: A new benchmark for natural language understand- ing. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics„ pp. 4885–4901 (2020)

2020
[14]

In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp

Conneau, A.e.a.: Xnli: Evaluating cross-lingual sentence representations. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 2475–2485 (2018)

2018
[15]

VNU Journal of Science: Computer Science and Communication Engineering38(2) (2022)

Quyen, N., Anh, H., Huyen, N., Lien, N.: Vlsp 2021 - vnnli challenge: Vietnamese and english-vietnamese textual entailment. VNU Journal of Science: Computer Science and Communication Engineering38(2) (2022)

2021
[16]

In: 2025 International Conference on Multimedia Analysis and Pattern Recognition (MAPR), pp

Bui, T.B., Thi-Thuy Nguyen, L., Van Huynh, T.: Vietx-nli: A cross-lingual natural language inference dataset with vietnamese as the source language. In: 2025 International Conference on Multimedia Analysis and Pattern Recognition (MAPR), pp. 1–6 (2025). https://doi.org/10.1109/MAPR67 746.2025.11133930

work page doi:10.1109/mapr67 2025
[17]

In: Findings of the Association for Computational Linguistics: EMNLP 2020, pp

Hu, H., Richardson, K., Xu, L., Li, L., K¨ ubler, S., Moss, L.S.: Ocnli: Original chinese natural language inference. In: Findings of the Association for Computational Linguistics: EMNLP 2020, pp. 3512–3526 (2020)

2020
[18]

In: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp

Mahendra, R., Aji, A.F., Louvan, S., Rahman, F., Vania, C.: Indonli: A natural language inference dataset for indonesian. In: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 10511–10527 (2021)

2021
[19]

In: Findings of the Association for Computational Linguistics: EMNLP 2020, pp

Ham, J., Choe, Y.J., Park, K., Choi, I., Soh, H.: Kornli and korsts: New benchmark datasets for korean natural language understanding. In: Findings of the Association for Computational Linguistics: EMNLP 2020, pp. 422–430 (2020)

2020
[20]

Soft Computing 27(21), 15831–15844 (2023)

Amirkhani, H., AzariJafari, M., Faridan-Jahromi, S., Loulouei, M., Zarrieß, S.: FarsTail: a Persian natural language inference dataset. Soft Computing 27(21), 15831–15844 (2023)

2023
[21]

In: Proceedings of the 4th Workshop on Computational Approaches to Code Switching, pp

Khanuja, S., Dandapat, S., Sitaram, S., Choudhury, M.: A new dataset for natural language inference from code-mixed conversations. In: Proceedings of the 4th Workshop on Computational Approaches to Code Switching, pp. 9–16 (2020)

2020
[22]

In: Proceedings of the 3rd Annual Meeting of the Special Interest Group on Under-resourced Languages @ LREC-COLING, pp

Huynh, T.V., Nguyen, K.V., Nguyen, N.L.-T.: Vinli: A vietnamese corpus for studies on open-domain natural language inference. In: Proceedings of the 3rd Annual Meeting of the Special Interest Group on Under-resourced Languages @ LREC-COLING, pp. 3858–3872 (2022) Springer Nature 2021 LATEX template 32ViLegalNLI: Natural Language Inference for Vietnamese Le...

2022
[23]

Expert Systems with Applications306, 130109 (2026)

Van Huynh, T., Van Nguyen, K., Luu-Thuy Nguyen, N.: A new benchmark dataset and mixture-of-experts language models for adversarial natural language inference in vietnamese. Expert Systems with Applications306, 130109 (2026). https://doi.org/10.1016/j.eswa.2025.130109

work page doi:10.1016/j.eswa.2025.130109 2026
[24]

In: Proceedings of SIGUL @ LREC-COLING, pp

Nguyen, H., Ngo, Q.T., Do, T.-H., Hoang, T.-A.: Vihealthnli: A dataset for vietnamese natural language inference in healthcare. In: Proceedings of SIGUL @ LREC-COLING, pp. 404–409 (2024)

2024
[25]

SN Computer Science3(5), 395 (2022)

Nguyen, C.T., Nguyen, D.T.: Building a Vietnamese dataset for natural language inference models. SN Computer Science3(5), 395 (2022)

2022
[26]

In: Proceedings of the Student Research Workshop Associated with RANLP 2013, pp

Alabbas, M.: A dataset for Arabic textual entailment. In: Proceedings of the Student Research Workshop Associated with RANLP 2013, pp. 7–13 (2013)

2013
[27]

In: Pro- ceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp

Devlin, J., Chang, M.-W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidirectional transformers for language understanding. In: Pro- ceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 4171–4186 (2019)

2019
[28]

In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp

Conneau, A.e.a.: Unsupervised cross-lingual representation learning at scale. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 8440–8451 (2020)

2020
[29]

In: Proc

Chi, Z., Dong, L., Wei, F., Yang, N., Singhal, S., Wang, W.e.a.: Infoxlm: An information-theoretic framework for cross-lingual language model pre- training. In: Proc. North American Chapter of the ACL: Human Language Technologies (NAACL–HLT, pp. 3576–3588 (2021)

2021
[30]

DeBERTa: Decoding-enhanced BERT with Disentangled Attention

He, P., Liu, X., Gao, J., Chen, W.: Deberta: Decoding-enhanced bert with disentangled attention. arXiv preprint arXiv:2006.03654 (2020)

work page internal anchor Pith review arXiv 2006
[31]

In: Machine Learning Challenges Workshop, pp

Dagan, I., Glickman, O., Magnini, B.: The pascal recognising textual entailment challenge. In: Machine Learning Challenges Workshop, pp. 177–190 (2005)

2005
[32]

In: Language, Culture, Computation

Bar-Haim, R., Dagan, I., Szpektor, I.: Benchmarking applied semantic inference: The PASCAL recognising textual entailment challenges. In: Language, Culture, Computation. Computing – Theory and Technology, pp. 409–424 (2014)

2014
[33]

In: Proceedings of the First Joint Conference on Lexical and Computational Semantics (SEM), pp

Agirre, E., Cer, D., Diab, M., Gonzalez-Agirre, A.: Semeval-2012 task 6: A pilot on semantic textual similarity. In: Proceedings of the First Joint Conference on Lexical and Computational Semantics (SEM), pp. 385–393 (2012) Springer Nature 2021 LATEX template ViLegalNLI: Natural Language Inference for Vietnamese Legal Texts33

2012
[34]

Scientific Data3, 1–9 (2016)

Johnson, A.E.W.e.a.: Mimic-iii, a freely accessible critical care database. Scientific Data3, 1–9 (2016)

2016
[35]

Biometrics33(1), 159–174 (1977)

Landis, J.R., Koch, G.G.: The measurement of observer agreement for categorical data. Biometrics33(1), 159–174 (1977)

1977
[36]

Psychological Bulletin76(5), 378–382 (1971)

Fleiss, J.L.: Measuring nominal scale agreement among many raters. Psychological Bulletin76(5), 378–382 (1971)

1971
[37]

In: Graham, Y., Purver, M

Bernsohn, D., Semo, G., Vazana, Y., Hayat, G., Hagag, B., Niklaus, J., Saha, R., Truskovskyi, K.: LegalLens: Leveraging LLMs for legal violation identification in unstructured text. In: Graham, Y., Purver, M. (eds.) Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 2129...

2024