LexPath: A domain-oriented multi-path framework for legal article retrieval

Qingfeng Zhuge; Weixuan Liu; Xuyang Chen

arxiv: 2605.30205 · v1 · pith:LTY22LXKnew · submitted 2026-05-28 · 💻 cs.IR

LexPath: A domain-oriented multi-path framework for legal article retrieval

Weixuan Liu , Qingfeng Zhuge , Xuyang Chen This is my paper

Pith reviewed 2026-06-29 05:11 UTC · model grok-4.3

classification 💻 cs.IR

keywords legal article retrievalmulti-path retrievalIRAChard negativesintent consistencyretrieval augmented generationlegal hierarchy

0 comments

The pith

LexPath combines IRAC-guided sparse retrieval, hierarchy-based dense paths, and intent reranking to better match legally relevant articles than standard methods.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that legal article retrieval fails when systems rely only on lexical or semantic similarity because many textually close articles are legally inapplicable or misaligned with user intent. LexPath addresses this by running two complementary paths in parallel: an IRAC-guided sparse path that adds legally informative keywords to the query and a structure-guided dense path trained on hard negatives drawn from legal hierarchy and citation links. An intent consistency reranker then reorders candidates. Experiments on two public benchmarks for general queries and one new professional benchmark show consistent gains over lexical, dense, hybrid, and adaptive RAG baselines, with ablations confirming each part contributes.

Core claim

The central claim is that the multi-path retrieval module (IRAC-guided sparse path plus structure-guided dense path with hierarchy-derived hard negatives) followed by an intent-aware reranking module produces candidate rankings that more reliably identify legally applicable articles than surface-similarity baselines.

What carries the argument

The multi-path retrieval module that runs an IRAC-guided sparse path in parallel with a structure-guided dense path, followed by the intent consistency reranking module.

If this is right

Legal AI systems can ground conclusions in specific articles more reliably across both public and professional query types.
Retrieval performance improves when hard negatives are chosen from legal hierarchy and citation relations rather than random or surface-similar negatives.
Intent consistency between query and article provides an additional signal that refines rankings beyond initial retrieval scores.
Ablation results indicate that removing any one of the three components reduces performance on the reported benchmarks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same two-path design could be tested in other domains that have explicit hierarchical structures and citation graphs, such as regulatory compliance or medical guidelines.
If the intent reranker proves robust, it could be adapted to surface cases where a retrieved article is textually close but contradicts the query's underlying goal.
The self-constructed professional benchmark suggests that domain-specific test sets are needed to measure progress beyond general-public queries.

Load-bearing premise

The improvements on the benchmarks arise because the IRAC keywords, hierarchy hard negatives, and intent scores actually capture legal relevance distinctions that standard methods miss.

What would settle it

A new legal retrieval benchmark, especially one with professional queries, on which LexPath shows no statistically significant gain over the strongest adaptive RAG baseline would falsify the central claim.

Figures

Figures reproduced from arXiv: 2605.30205 by Qingfeng Zhuge, Weixuan Liu, Xuyang Chen.

**Figure 2.** Figure 2: An overview of our framework, LEXPATH, for legal article retrieval. for legal article retrieval. To the best of our knowledge, LEXPATH is among the first domain-oriented multi-path frameworks tailored to Chinese legal article retrieval. 3 Methodology 3.1 Overview To address the challenges discussed above, we propose LEXPATH, a legal-specific multi-path retrieval framework that models lexical, legal struc… view at source ↗

**Figure 3.** Figure 3: Parameter analysis of LEXPATH by varying the checkpoint-merge weight, sparse-path weight α, and intent-consistency score weight λ3. and task-specific structural signals. Performance also remains stable within a moderate range of α, confirming the complementarity between sparse and dense paths. When α is close to 1.0, performance drops, indicating that sparse matching alone is insufficient for legal articl… view at source ↗

**Figure 8.** Figure 8: Case study of a standalone sparse retrieval failure. The sparse retriever ranks articles with keyword overlap [PITH_FULL_IMAGE:figures/full_fig_p016_8.png] view at source ↗

**Figure 9.** Figure 9: Case study of a standalone dense retrieval failure. The dense retriever retrieves semantically related [PITH_FULL_IMAGE:figures/full_fig_p018_9.png] view at source ↗

read the original abstract

Legal article retrieval is critical for building traceable and reliable legal AI systems, where conclusions must be grounded in specific legal articles. However, existing open-domain retrieval methods rely heavily on surface-level lexical or semantic similarity, making it difficult for them to distinguish legally relevant articles from those that are textually similar but legally inapplicable or misaligned with the user's underlying intent. To bridge this gap, we propose \textsc{LexPath}, a domain-oriented multi-path framework comprising a multi-path retrieval module and an intent-aware reranking module. The retrieval module combines two complementary legal-specific paths to collect candidate articles: an IRAC-guided sparse path that expands queries with legally informative keywords, and a structure-guided dense path trained with hard negatives derived from legal hierarchy and citation relations. Then, the reranking module further refines the candidate ranking by incorporating the intent consistency score between queries and legal articles. We evaluate \textsc{LexPath} on two publicly available benchmarks focusing on general-public queries and a self-constructed benchmark targeting domain-professional scenarios. Experimental results demonstrate that \textsc{LexPath} consistently outperforms lexical, dense, hybrid, and adaptive retrieval-augmented generation (RAG) baselines. Ablation studies further verify the effectiveness of each component.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

LexPath adds IRAC and legal hierarchy to a two-path retriever plus intent reranking, but the abstract supplies no numbers, setups, or comparisons so the claimed gains cannot be checked.

read the letter

The main takeaway is that LexPath tries to fix surface-similarity problems in legal article retrieval by running an IRAC-guided sparse path alongside a structure-guided dense path and then reranking on intent consistency. It reports better results than lexical, dense, hybrid, and RAG baselines on two public benchmarks plus one self-built professional-query set, with ablations said to confirm each piece.

What stands out is the attempt to inject actual legal structure—IRAC keywords for query expansion and citation/hierarchy relations for hard negatives—rather than treating the task as generic IR. That focus matches a practical need for traceable legal AI, and testing on both general-public and domain-professional data is a reasonable step.

The soft spots are straightforward. The abstract gives no metrics, no dataset sizes, no significance tests, and no description of how the self-constructed benchmark was created or balanced. Without those details the outperformance claim is impossible to evaluate. It is also unclear how the multi-path design differs from earlier hybrid or domain-adapted retrieval work, since no direct comparisons appear in the summary. The full paper might resolve this, but on the visible evidence the soundness cannot be judged.

This is work for people doing specialized retrieval in law or similar regulated domains. A reader looking for concrete ideas on domain-adapted paths could extract some usable pieces, but anyone needing reproducible results or strong novelty claims will find the current write-up thin.

If the experiments, code, and citation discussion hold up in the full manuscript, it is worth sending to referees; the underlying problem is real even if the advance looks incremental.

Referee Report

2 major / 3 minor

Summary. The paper proposes LexPath, a domain-oriented multi-path framework for legal article retrieval. It includes a retrieval module with an IRAC-guided sparse path (expanding queries with legal keywords) and a structure-guided dense path (trained with hard negatives from legal hierarchy and citations), plus an intent-aware reranking module using intent consistency scores. Experiments on two public benchmarks (general-public queries) and one self-constructed benchmark (domain-professional scenarios) claim consistent outperformance over lexical, dense, hybrid, and adaptive RAG baselines, with ablations supporting each component.

Significance. If the empirical claims hold with proper validation, LexPath could meaningfully advance legal IR by moving beyond surface similarity to incorporate domain structures like IRAC and hierarchy relations, aiding traceable legal AI. The multi-path design and hard-negative strategy are promising for domain-specific retrieval, but the abstract alone provides insufficient detail on methods, data, or results to gauge real impact or reproducibility.

major comments (2)

[Abstract] Abstract: The central claim of consistent outperformance and effective ablations cannot be evaluated, as the abstract supplies no methods details, dataset statistics, error bars, statistical significance tests, or full experimental setup (including how baselines were implemented or how the self-constructed benchmark was built).
[Evaluation] Evaluation section (inferred from abstract): The self-constructed benchmark targeting domain-professional scenarios is load-bearing for the domain-oriented claim, yet no description of its size, query construction, annotation process, or inter-annotator agreement is provided, undermining the ability to assess whether improvements reflect legal relevance rather than benchmark artifacts.

minor comments (3)

[Abstract] Abstract: Define 'IRAC-guided' more precisely and give at least one concrete example of keyword expansion for a sample query.
[Abstract] Abstract: Specify the public benchmarks by name and citation, and clarify the exact metrics used (e.g., nDCG@10, Recall@100).
[Abstract] Abstract: The phrase 'adaptive retrieval-augmented generation (RAG) baselines' is vague; list the specific RAG methods compared.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed feedback. We address each major comment below and indicate where revisions will be made to strengthen the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim of consistent outperformance and effective ablations cannot be evaluated, as the abstract supplies no methods details, dataset statistics, error bars, statistical significance tests, or full experimental setup (including how baselines were implemented or how the self-constructed benchmark was built).

Authors: We agree the abstract is high-level by design and omits these specifics due to length limits. The full manuscript provides methods details in Section 3 (IRAC-guided sparse path with legal keyword expansion, structure-guided dense path using hierarchy/citation hard negatives, and intent-aware reranking), benchmark statistics and construction in Section 4, baseline implementations, error bars, and significance tests in Section 5. To improve evaluability from the abstract, we will add concise mentions of key dataset sizes and note that full experimental protocols appear in the body. revision: partial
Referee: [Evaluation] Evaluation section (inferred from abstract): The self-constructed benchmark targeting domain-professional scenarios is load-bearing for the domain-oriented claim, yet no description of its size, query construction, annotation process, or inter-annotator agreement is provided, undermining the ability to assess whether improvements reflect legal relevance rather than benchmark artifacts.

Authors: The abstract does not contain these details. The full manuscript describes the self-constructed benchmark in Section 4, covering size, professional query construction, annotation process, and inter-annotator agreement. If the existing description remains insufficient to evaluate legal relevance versus artifacts, we will expand it with additional statistics, query examples, and validation metrics. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper is an empirical proposal of a multi-path legal retrieval framework (IRAC-guided sparse path + structure-guided dense path + intent reranking) whose central claims rest on benchmark outperformance versus lexical/dense/hybrid/RAG baselines. No equations, first-principles derivations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the abstract or described method; the evaluation is externally falsifiable on public benchmarks and does not reduce to self-definition or construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No mathematical derivations, free parameters, axioms, or invented entities are described in the abstract. The work is an empirical applied framework in information retrieval.

pith-pipeline@v0.9.1-grok · 5750 in / 1121 out tokens · 28324 ms · 2026-06-29T05:11:20.728253+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

20 extracted references · 4 canonical work pages · 1 internal anchor

[1]

ChatGLM: A Family of Large Language Models from GLM-130B to GLM-4 All Tools

Precise zero-shot dense retrieval without rel- evance labels. InProceedings of the 61st Annual Meeting of the Association for Computational Lin- guistics (Volume 1: Long Papers), pages 1762–1777. Team GLM, Aohan Zeng, Bin Xu, Bowen Wang, Chen- hui Zhang, Da Yin, Diego Rojas, Guanyu Feng, Han- lin Zhao, Hanyu Lai, Hao Yu, Hongning Wang, Jiadai Sun, Jiajie ...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[2]

InPro- ceedings of the nineteenth international conference on artificial intelligence and law, pages 472–480

Summary of the competition on legal infor- mation, extraction/entailment (coliee) 2023. InPro- ceedings of the nineteenth international conference on artificial intelligence and law, pages 472–480. 9 Abe Bohan Hou, Orion Weller, Guanghui Qin, Eugene Yang, Dawn Lawrie, Nils Holzenberger, Andrew Blair-Stanek, and Benjamin Van Durme. 2025. Clerc: A dataset f...

work page arXiv 2023
[3]

Legalbench-RAG: A benchmark for retrieval-augmented generation in the legal domain,

An annotation language for semantic search of legal sources. InProceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018). OpenAI. 2026a. GPT-5.5. https://platform. openai.com/docs/models. Accessed: 2026-05-23. OpenAI. 2026b. Openai python api library. https:// github.com/openai/openai-python. Accessed: 2026-05-1...

work page arXiv 2018
[4]

1 你是一位经验丰富的法律顾问。请针对下面的法律问题，使用IRAC方法进行分析并回答，即:

Corrective retrieval augmented generation. 1 你是一位经验丰富的法律顾问。请针对下面的法律问题，使用IRAC方法进行分析并回答，即:
[5]

然后分析事实如何适用这些规则（Application）
[6]

Please analyze and answer the following legal question using the IRAC method:

最后给出结论（Conclusion）问题: {question} English translation: You are an experienced legal advisor. Please analyze and answer the following legal question using the IRAC method:
[7]

First, identify the issue
[8]

Next, state the relevant legal rules
[9]

Then, analyze how the facts apply to these rules
[10]

Question: {question} Prompts for IRAC Analysis Fangyi Yu, Lee Quartey, and Frank Schilder

Finally, draw a conclusion. Question: {question} Prompts for IRAC Analysis Fangyi Yu, Lee Quartey, and Frank Schilder. 2022. Le- gal prompting: Teaching a language model to think like a lawyer.arXiv preprint arXiv:2212.01326. A Prompt for LEXPATH The prompts used in IRAC-Exp are shown in Fig- ure 4 for IRAC analysis and Figure 5 for keyword extraction. Th...

work page arXiv 2022
[11]

used for training legal professionals. The manual focuses on market supervision and admin- istration laws and contains 500 single-choice ques- tions, together with explanatory analysis and legal articles reference for each option. The process of constructing a query dataset is as follows: 11 1 任务：
[12]

理解问题的语义并改写为更专业的法律术语。
[13]

执行同义词替换，确保问题的核心法律含义不变。
[14]

请直接以[关键词1, 关键词2,...]的格式输出关键词列表，不要有多余的内容。问题：{question} 关键词列表： English translation: Task:
[15]

Understand the meaning of the question and rephrase it using more professional legal terminology
[16]

Replace words with synonyms while ensuring the core legal meaning of the question remains unchanged
[17]

Extract keywords to be used for searching relevant legal provisions
[18]

Please provide as many keywords as possible
[19]

定义类"、"适用类

Please output the keyword list directly in the format [Keyword 1, Keyword 2, ...] without any additional content. Question: {question} Keyword list: Prompts for Keyword Extraction • Exclude questions that are overly trivial, am- biguous, overly dependent, incomplete con- text, or lack a clear legal article reference, and use the remaining ones as seed que...

2024
[20]

serious circumstance

and GPT-5.5 (OpenAI, 2026a), under four retriever settings: zero-shot, BM25, bge-large-zh- v1.5, and LEXPATH. Following (Li et al., 2025c), keyword accuracy is adopted as the metric for short-answer queries in LexRAG. For StatuteRAG, which consists entirely of true-or-false queries, an- swer accuracy is reported. STARD is excluded because it does not prov...

[1] [1]

ChatGLM: A Family of Large Language Models from GLM-130B to GLM-4 All Tools

Precise zero-shot dense retrieval without rel- evance labels. InProceedings of the 61st Annual Meeting of the Association for Computational Lin- guistics (Volume 1: Long Papers), pages 1762–1777. Team GLM, Aohan Zeng, Bin Xu, Bowen Wang, Chen- hui Zhang, Da Yin, Diego Rojas, Guanyu Feng, Han- lin Zhao, Hanyu Lai, Hao Yu, Hongning Wang, Jiadai Sun, Jiajie ...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[2] [2]

InPro- ceedings of the nineteenth international conference on artificial intelligence and law, pages 472–480

Summary of the competition on legal infor- mation, extraction/entailment (coliee) 2023. InPro- ceedings of the nineteenth international conference on artificial intelligence and law, pages 472–480. 9 Abe Bohan Hou, Orion Weller, Guanghui Qin, Eugene Yang, Dawn Lawrie, Nils Holzenberger, Andrew Blair-Stanek, and Benjamin Van Durme. 2025. Clerc: A dataset f...

work page arXiv 2023

[3] [3]

Legalbench-RAG: A benchmark for retrieval-augmented generation in the legal domain,

An annotation language for semantic search of legal sources. InProceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018). OpenAI. 2026a. GPT-5.5. https://platform. openai.com/docs/models. Accessed: 2026-05-23. OpenAI. 2026b. Openai python api library. https:// github.com/openai/openai-python. Accessed: 2026-05-1...

work page arXiv 2018

[4] [4]

1 你是一位经验丰富的法律顾问。请针对下面的法律问题，使用IRAC方法进行分析并回答，即:

Corrective retrieval augmented generation. 1 你是一位经验丰富的法律顾问。请针对下面的法律问题，使用IRAC方法进行分析并回答，即:

[5] [5]

然后分析事实如何适用这些规则（Application）

[6] [6]

Please analyze and answer the following legal question using the IRAC method:

最后给出结论（Conclusion）问题: {question} English translation: You are an experienced legal advisor. Please analyze and answer the following legal question using the IRAC method:

[7] [7]

First, identify the issue

[8] [8]

Next, state the relevant legal rules

[9] [9]

Then, analyze how the facts apply to these rules

[10] [10]

Question: {question} Prompts for IRAC Analysis Fangyi Yu, Lee Quartey, and Frank Schilder

Finally, draw a conclusion. Question: {question} Prompts for IRAC Analysis Fangyi Yu, Lee Quartey, and Frank Schilder. 2022. Le- gal prompting: Teaching a language model to think like a lawyer.arXiv preprint arXiv:2212.01326. A Prompt for LEXPATH The prompts used in IRAC-Exp are shown in Fig- ure 4 for IRAC analysis and Figure 5 for keyword extraction. Th...

work page arXiv 2022

[11] [11]

used for training legal professionals. The manual focuses on market supervision and admin- istration laws and contains 500 single-choice ques- tions, together with explanatory analysis and legal articles reference for each option. The process of constructing a query dataset is as follows: 11 1 任务：

[12] [12]

理解问题的语义并改写为更专业的法律术语。

[13] [13]

执行同义词替换，确保问题的核心法律含义不变。

[14] [14]

请直接以[关键词1, 关键词2,...]的格式输出关键词列表，不要有多余的内容。问题：{question} 关键词列表： English translation: Task:

[15] [15]

Understand the meaning of the question and rephrase it using more professional legal terminology

[16] [16]

Replace words with synonyms while ensuring the core legal meaning of the question remains unchanged

[17] [17]

Extract keywords to be used for searching relevant legal provisions

[18] [18]

Please provide as many keywords as possible

[19] [19]

定义类"、"适用类

Please output the keyword list directly in the format [Keyword 1, Keyword 2, ...] without any additional content. Question: {question} Keyword list: Prompts for Keyword Extraction • Exclude questions that are overly trivial, am- biguous, overly dependent, incomplete con- text, or lack a clear legal article reference, and use the remaining ones as seed que...

2024

[20] [20]

serious circumstance

and GPT-5.5 (OpenAI, 2026a), under four retriever settings: zero-shot, BM25, bge-large-zh- v1.5, and LEXPATH. Following (Li et al., 2025c), keyword accuracy is adopted as the metric for short-answer queries in LexRAG. For StatuteRAG, which consists entirely of true-or-false queries, an- swer accuracy is reported. STARD is excluded because it does not prov...