Recognition: unknown
From Query to Counsel: Structured Reasoning with a Multi-Agent Framework and Dataset for Legal Consultation
Pith reviewed 2026-05-10 15:52 UTC · model grok-4.3
The pith
A multi-agent framework with legal element graphs outperforms general and legal LLMs on Chinese consultation QA tasks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Converting legal queries into legal element graphs that integrate entities, events, intents, and legal issues, then processing them via a modular multi-agent framework supporting dynamic routing, statutory grounding, and stylistic optimization, produces more accurate and context-aware consultation responses than standard LLMs after training on JurisCQAD.
What carries the argument
The legal element graph, which integrates entities, events, intents, and legal issues to capture dependencies across facts, norms, and procedures, guiding multi-agent collaboration.
If this is right
- Better handling of complex contextual dependencies in legal facts, norms, and procedures.
- Higher performance on lexical and semantic evaluation metrics for legal consultation.
- More interpretable reasoning through explicit decomposition and modular agent steps.
- Improved statutory grounding and response style via specialized routing and optimization.
Where Pith is reading between the lines
- The same graph decomposition approach could be tested on non-Chinese legal systems or adjacent domains such as medical consultation.
- Connecting the agents to live statutory databases would likely strengthen the grounding component further.
- The dataset construction method could be reused to create similar resources for other languages or legal traditions.
Load-bearing premise
Expert-validated positive and negative responses accurately represent high-quality legal advice, and the graph decomposition plus multi-agent steps capture all relevant dependencies without introducing new errors.
What would settle it
Evaluation on a held-out set of legal queries where the multi-agent system shows no gain in semantic metrics or produces more factual inaccuracies than a single legal-domain LLM.
Figures
read the original abstract
Legal consultation question answering (Legal CQA) presents unique challenges compared to traditional legal QA tasks, including the scarcity of high-quality training data, complex task composition, and strong contextual dependencies. To address these, we construct JurisCQAD, a large-scale dataset of over 43,000 real-world Chinese legal queries annotated with expert-validated positive and negative responses, and design a structured task decomposition that converts each query into a legal element graph integrating entities, events, intents, and legal issues. We further propose JurisMA, a modular multi-agent framework supporting dynamic routing, statutory grounding, and stylistic optimization. Combined with the element graph, the framework enables strong context-aware reasoning, effectively capturing dependencies across legal facts, norms, and procedural logic. Trained on JurisCQAD and evaluated on a refined LawBench, our system significantly outperforms both general-purpose and legal-domain LLMs across multiple lexical and semantic metrics, demonstrating the benefits of interpretable decomposition and modular collaboration in Legal CQA.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces JurisCQAD, a large-scale dataset of over 43,000 real-world Chinese legal queries annotated with expert-validated positive and negative responses. It proposes a legal element graph that decomposes each query into entities, events, intents, and legal issues, along with the JurisMA multi-agent framework supporting dynamic routing, statutory grounding, and stylistic optimization. Trained on JurisCQAD and evaluated on a refined LawBench, the system is claimed to significantly outperform both general-purpose and legal-domain LLMs on lexical and semantic metrics, illustrating the benefits of interpretable decomposition and modular collaboration for Legal CQA.
Significance. If the empirical claims hold under rigorous scrutiny, the work provides a substantial new resource in JurisCQAD for legal consultation QA and demonstrates how graph-based decomposition combined with multi-agent collaboration can address complex contextual dependencies in a high-stakes domain. The emphasis on modularity and interpretability is a strength that could inform future domain-specific systems, particularly where factual grounding and procedural logic matter.
major comments (3)
- [§3] §3 (Dataset Construction): The expert validation of the 43k positive/negative response pairs is presented without inter-annotator agreement figures, annotation guidelines, or disagreement-resolution procedures. This detail is load-bearing for the claim that JurisCQAD supplies reliable high-quality supervision.
- [§5–6] §5–6 (Framework and Experiments): No ablation studies isolate the contribution of the legal element graph construction from the multi-agent routing components, and no error analysis examines whether the graph decomposition introduces new factual or procedural errors. Without these, the attribution of metric gains to the proposed methods remains unverified.
- [§6] §6 (Evaluation): The results section must report concrete numerical values, baseline details, statistical significance tests, and exclusion criteria for the claimed outperformance on the refined LawBench; the abstract alone supplies none of these.
minor comments (2)
- [§4] A formal diagram or pseudocode for the legal element graph construction would improve clarity in §4.
- [Throughout] Ensure consistency in terminology between 'legal element graph' and 'JurisMA' across sections and figures.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive feedback on our manuscript. We have carefully considered each major comment and will revise the paper to address the concerns regarding dataset documentation, experimental ablations, error analysis, and result reporting. Below we respond point by point.
read point-by-point responses
-
Referee: [§3] §3 (Dataset Construction): The expert validation of the 43k positive/negative response pairs is presented without inter-annotator agreement figures, annotation guidelines, or disagreement-resolution procedures. This detail is load-bearing for the claim that JurisCQAD supplies reliable high-quality supervision.
Authors: We agree that explicit documentation of the annotation process is necessary to support claims of dataset quality. In the revised manuscript we will add a new subsection to §3 that (i) reproduces the annotation guidelines given to legal experts, (ii) describes the disagreement-resolution protocol (two-expert review followed by senior adjudicator), and (iii) reports inter-annotator agreement statistics (Cohen’s κ and raw agreement) computed on a 5 % stratified sample of the 43 k pairs. These additions will be placed before the dataset statistics table. revision: yes
-
Referee: [§5–6] §5–6 (Framework and Experiments): No ablation studies isolate the contribution of the legal element graph construction from the multi-agent routing components, and no error analysis examines whether the graph decomposition introduces new factual or procedural errors. Without these, the attribution of metric gains to the proposed methods remains unverified.
Authors: We accept that the current experimental design does not fully disentangle the contributions of the legal element graph and the multi-agent routing. We will add two new ablation tables in §5: one that removes the graph construction module while keeping the agents, and another that disables dynamic routing while retaining the graph. In addition, we will insert an error-analysis subsection that manually inspects 200 failure cases, categorizes errors attributable to graph decomposition (factual hallucination, missed legal issue, incorrect event linking), and quantifies their downstream effect on final answer quality. These results will be reported alongside the main experiments in the revised §6. revision: yes
-
Referee: [§6] §6 (Evaluation): The results section must report concrete numerical values, baseline details, statistical significance tests, and exclusion criteria for the claimed outperformance on the refined LawBench; the abstract alone supplies none of these.
Authors: We acknowledge that the present version of §6 and the abstract lack the required quantitative detail. The revised manuscript will expand §6 with (i) full numerical tables for all lexical and semantic metrics on the refined LawBench, (ii) explicit baseline specifications (model names, parameter counts, fine-tuning regimes), (iii) statistical significance results (paired t-tests and McNemar’s test with p-values and confidence intervals), and (iv) a clear statement of exclusion criteria applied to the test set. The abstract will be updated to cite the key absolute improvements (e.g., +X BLEU, +Y F1). revision: yes
Circularity Check
No circularity: empirical dataset construction and framework evaluation
full rationale
The paper's core contribution is the construction of JurisCQAD (43k expert-annotated queries) and the JurisMA multi-agent framework with legal element graph decomposition, followed by empirical training and evaluation on refined LawBench showing metric gains over baselines. No equations, first-principles derivations, or predictions are claimed; performance claims rest on direct measurement rather than any reduction to fitted inputs or self-citations. The work is self-contained against external benchmarks with no load-bearing self-referential steps.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Expert-validated positive and negative responses provide reliable training and evaluation signals for legal consultation quality.
- domain assumption Converting queries into a legal element graph of entities, events, intents, and issues captures the necessary contextual dependencies.
invented entities (2)
-
Legal element graph
no independent evidence
-
JurisMA multi-agent framework
no independent evidence
Forward citations
Cited by 1 Pith paper
-
Towards Robust Endogenous Reasoning: Unifying Drift Adaptation in Non-Stationary Tuning
CPO++ adapts reinforcement fine-tuning of MLLMs to endogenous multi-modal concept drift through counterfactual reasoning and preference optimization, yielding better coherence and cross-domain robustness in safety-cri...
Reference graph
Works this paper leans on
-
[1]
Chin-Yew Lin
Lexilaw: A scalable legal language model for comprehensive legal understanding. Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. InText summarization branches out, pages 74–81. Antoine Louis, Gijs van Dijck, and Gerasimos Spanakis
2004
-
[2]
Adyasha Maharana, Dong-Ho Lee, Sergey Tulyakov, Mohit Bansal, Francesco Barbieri, and Yuwei Fang
Interpretable long-form legal question answer- ing with retrieval-augmented large language models. InProceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 22266–22275. Mingfei Lu, Mengjia Wu, Feng Liu, Jiawei Xu, Weikai Li, Haoyang Wang, Zhengdong Hu, Ying Ding, Yizhou Sun, Jie Lu, and 1 others. 2026. Choosing how to remember: Ad...
-
[3]
Bleurt: Learning robust metrics for text gener- ation
Bleurt: Learning robust metrics for text gener- ation.arXiv preprint arXiv:2004.04696. Yunqiu Shao, Jiaxin Mao, Yiqun Liu, Weizhi Ma, Ken Satoh, Min Zhang, and Shaoping Ma. 2020. Bert-pli: Modeling paragraph-level interactions for legal case retrieval. InIJCAI, volume 2020, pages 3501–3507. Yiquan Wu, Yuhang Liu, Yifei Liu, Ang Li, Siying Zhou, and Kun Ku...
-
[4]
Lawgpt: A chinese legal knowledge-enhanced large language model.Preprint, arXiv:2406.04614. A More Details for Experimental Setup A.1 JurisCQAD Dataset Details The core structure and statistics of the JurisCQAD dataset are summarized in Table 1. Property Description Source Real-world legal consulta- tion platforms Language Chinese Size 43,126 triplets Dat...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.