arxiv: 2604.06737 · v2 · submitted 2026-04-08 · 💻 cs.CL · cs.AI

Recognition: no theorem link

WisdomInterrogatory (LuWen): An Open-Source Legal Large Language Model Technical Report

Yiquan Wu , Yuhang Liu , Yifei Liu , Ang Li , Siying Zhou , Kun Kuang , Fei Wu

Authors on Pith no claims yet

Pith reviewed 2026-05-10 18:37 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords legal language modelChinese LLMcontinual pre-trainingsupervised fine-tuningretrieval-augmented generationlegal judgment predictionlegal AI

0 comments

The pith

LuWen adapts a general Chinese language model to legal work through pre-training on legal texts, instruction fine-tuning, and retrieval from a knowledge base.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents LuWen, an open-source Chinese legal language model built from the Baichuan foundation model. It applies three techniques: further training on a large legal corpus, fine-tuning with curated legal instructions, and retrieval-augmented generation linked to a legal knowledge base. The authors evaluate the result on five tasks covering both prediction and generation, including legal judgment prediction, judicial exams, text summarization, law article question answering, and decision reasoning. LuWen beats several strong baselines, showing that general models can be adapted to handle the precise language and reasoning demands of law.

Core claim

LuWen is created by continual pre-training on a large-scale legal corpus, supervised fine-tuning with legal instruction data, and retrieval-augmented generation integrated with a comprehensive legal knowledge base. On five representative legal tasks spanning prediction and generation, it outperforms strong baselines and demonstrates effective adaptation of general-purpose models to the legal domain.

What carries the argument

The three-step adaptation of continual pre-training on legal corpus, supervised fine-tuning on legal instructions, and retrieval-augmented generation from a legal knowledge base.

Load-bearing premise

That outperformance on the five selected legal tasks means the adaptation works for real-world legal demands beyond those specific tests and baselines.

What would settle it

If LuWen shows no advantage over general models when tested on a new set of legal cases, documents, or questions drawn from outside the original evaluation sets.

Figures

Figures reproduced from arXiv: 2604.06737 by Ang Li, Fei Wu, Kun Kuang, Siying Zhou, Yifei Liu, Yiquan Wu, Yuhang Liu.

**Figure 2.** Figure 2: An Example of Constructed Instruction-Response Data [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Multi-Source, Multi-Path Legal Knowledge Retrieval [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 3.** Figure 3: Multi-Source, Multi-Path Legal Knowledge Retrieval [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Sample Display of LuWen Added to the Retrieval Database [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

read the original abstract

Large language models have demonstrated remarkable capabilities across a wide range of natural language processing tasks, yet their application in the legal domain remains challenging due to the specialized terminology, complex reasoning requirements, and rapidly evolving legal knowledge involved. In this paper, we present WisdomInterrogatory (LuWen), an open-source Chinese legal language model built upon the Baichuan foundation model through three key techniques: continual pre-training on a large-scale legal corpus, supervised fine-tuning with carefully curated legal instruction data, and retrieval-augmented generation integrated with a comprehensive legal knowledge base. We evaluate LuWen on five representative legal tasks spanning both prediction and generation settings, including legal judgment prediction, judicial examination, legal text summarization, law article question answering, and judicial decision reasoning. Experimental results show that LuWen outperforms several strong baselines, demonstrating the effectiveness of our approach in adapting general-purpose language models to the legal domain.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

LuWen is a standard three-stage adaptation of Baichuan to Chinese legal tasks that releases an open model, but the evaluation lacks ablations and concrete metrics so the effectiveness claim is hard to assess.

read the letter

LuWen takes the Baichuan base model and runs it through continual pre-training on a large legal corpus, supervised fine-tuning on curated legal instructions, and RAG over a legal knowledge base. The paper then tests it on five tasks: legal judgment prediction, judicial exams, text summarization, law article QA, and judicial decision reasoning. It reports that the resulting model beats several baselines on those tasks. Releasing the model open-source and spelling out the data pipeline is the useful part here; anyone working on Chinese legal tech now has a concrete starting point instead of starting from scratch with a general model. That kind of release can save other groups time on data collection and basic adaptation steps. The evaluation section is the soft spot. The abstract gives no numbers, no baseline descriptions, no statistical tests, and no error analysis. The stress-test note is accurate: there are no ablation runs that isolate continual pre-training from SFT from RAG, so it is impossible to know whether the gains come from the specific ordering and combination or simply from seeing more legal text. Without those controls the central claim that the three techniques demonstrate an effective adaptation approach does not land. The tasks themselves look reasonable for the domain, but the lack of detail on how the test sets were built or how the baselines were chosen leaves the outperformance claim unconvincing on its own. This paper is mainly for groups building or deploying legal LLMs in Chinese or similar regulated domains. A practitioner who needs a ready model or wants to replicate the data pipeline will get something out of it. A reader looking for new methodological insight or rigorous evidence that the recipe works better than simpler alternatives will find less. It deserves peer review because the model release itself is a concrete contribution worth documenting, even if the experiments need tightening and more controls before the effectiveness story can be taken at face value.

Referee Report

1 major / 2 minor

Summary. The manuscript presents WisdomInterrogatory (LuWen), an open-source Chinese legal large language model built on the Baichuan foundation model. It applies three adaptation techniques—continual pre-training on a large-scale legal corpus, supervised fine-tuning with curated legal instruction data, and retrieval-augmented generation using a comprehensive legal knowledge base—and evaluates the resulting model on five tasks spanning prediction and generation: legal judgment prediction, judicial examination, legal text summarization, law article question answering, and judicial decision reasoning. The central claim is that LuWen outperforms several strong baselines on these tasks, thereby demonstrating the effectiveness of the adaptation approach.

Significance. If the performance claims hold under proper controls, the work supplies a publicly released Chinese legal LLM that could serve as a useful baseline and starting point for domain-specific legal NLP research. The open-source release of the model and the described pipeline constitutes a concrete, reusable artifact that lowers barriers for subsequent studies in legal AI.

major comments (1)

[Experimental Results] Experimental Results section: no ablation variants are reported that isolate the contributions of continual pre-training, supervised fine-tuning, and retrieval-augmented generation (e.g., base model + SFT only, or + continual pre-training only). Without these controls on the same five tasks, the attribution of gains to the specific combination of techniques rather than to additional legal data exposure cannot be established, directly weakening the claim that the results demonstrate the effectiveness of the proposed approach.

minor comments (2)

[Abstract] Abstract: the statement that LuWen 'outperforms several strong baselines' is not accompanied by any quantitative metrics, baseline identifiers, or statistical test results, making it impossible for readers to gauge the magnitude or reliability of the claimed improvements from the abstract alone.
[Introduction] The five tasks are described as 'representative,' yet no justification or coverage analysis is provided for why these particular tasks adequately sample real-world legal reasoning demands.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive feedback and for acknowledging the value of the open-source release of LuWen as a baseline for legal NLP research. We address the major comment below.

read point-by-point responses

Referee: [Experimental Results] Experimental Results section: no ablation variants are reported that isolate the contributions of continual pre-training, supervised fine-tuning, and retrieval-augmented generation (e.g., base model + SFT only, or + continual pre-training only). Without these controls on the same five tasks, the attribution of gains to the specific combination of techniques rather than to additional legal data exposure cannot be established, directly weakening the claim that the results demonstrate the effectiveness of the proposed approach.

Authors: We agree that ablation studies isolating each adaptation component would strengthen the attribution of performance gains. The current manuscript reports results only for the full LuWen model (combining continual pre-training, supervised fine-tuning, and retrieval-augmented generation) against external baselines on the five tasks. In the revised version, we will add ablation experiments evaluating the base Baichuan model with individual and partial combinations of the techniques across all five tasks (legal judgment prediction, judicial examination, legal text summarization, law article question answering, and judicial decision reasoning). These results will be incorporated into the Experimental Results section to more clearly demonstrate the contribution of each technique. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical report with no derivations or self-referential predictions

full rationale

The paper is a standard empirical engineering report: it describes training LuWen via continual pre-training on legal corpus, SFT on instruction data, and RAG with a knowledge base, then reports outperformance on five fixed legal tasks versus baselines. No equations, no fitted parameters renamed as predictions, no self-citation load-bearing uniqueness theorems, and no derivation chain that reduces to its own inputs by construction. The central claim rests on experimental comparisons, which are externally falsifiable and do not contain the circular patterns enumerated. Absence of ablations is a methodological weakness but does not constitute circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The abstract describes an empirical model-building effort with no mathematical derivations, free parameters, or postulated entities.

pith-pipeline@v0.9.0 · 5471 in / 1097 out tokens · 30980 ms · 2026-05-10T18:37:57.348553+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

PoliLegalLM: A Technical Report on a Large Language Model for Political and Legal Affairs
cs.CL 2026-04 unverdicted novelty 4.0

PoliLegalLM, trained with continued pretraining, progressive SFT, and preference RL on a legal corpus, outperforms similar-scale models on LawBench, LexEval, and a real-world PoliLegal dataset while staying competitiv...

Reference graph

Works this paper leans on

5 extracted references · 3 canonical work pages · cited by 1 Pith paper · 2 internal anchors

[1]

Qwen Technical Report

Qwen technical report.CoRR, abs/2309.16609. Yupeng Chang, Xu Wang, Jindong Wang, Yuan Wu, Linyi Yang, Kaijie Zhu, Hao Chen, Xiaoyuan Yi, Cunxiang Wang, Yidong Wang, et al

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Training language models to follow instruc- tions with human feedback. InAdvances in Neural Information Processing Systems 35: Annual Confer- ence on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9,

2022
[3]

AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation

Chain-of-thought prompting elicits rea- soning in large language models.Advances in neural information processing systems, 35:24824–24837. Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Shaokun Zhang, Erkang Zhu, Beibin Li, Li Jiang, Xiaoyun Zhang, and Chi Wang. 2023a. Auto- gen: Enabling next-gen llm applications via multi- agent conversation framework...

work page internal anchor Pith review Pith/arXiv arXiv
[4]

Baichuan 2: Open large-scale language models

Baichuan 2: Open large-scale language models. CoRR, abs/2309.10305. Aohan Zeng, Xiao Liu, Zhengxiao Du, Zihan Wang, Hanyu Lai, Ming Ding, Zhuoyi Yang, Yifan Xu, Wendi Zheng, Xiao Xia, Weng Lam Tam, Zixuan Ma, Yufei Xue, Jidong Zhai, Wenguang Chen, Zhiyuan Liu, Peng Zhang, Yuxiao Dong, and Jie Tang

work page arXiv
[5]

In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5,

GLM-130B: an open bilingual pre-trained model. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5,

2023