Peeking Inside LLMs: Leveraging Internal Artifacts of LLMs for Enhancing Reliability in Legal Classification

Debtanu Datta; Saptarshi Ghosh; Sudipta Santra

arxiv: 2606.20929 · v1 · pith:P2EQRGLVnew · submitted 2026-06-18 · 💻 cs.CL · cs.AI

Peeking Inside LLMs: Leveraging Internal Artifacts of LLMs for Enhancing Reliability in Legal Classification

Sudipta Santra , Debtanu Datta , Saptarshi Ghosh This is my paper

Pith reviewed 2026-06-26 17:02 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords LLM reliabilitylegal classificationinternal artifactserror detectionbail decision predictionstatute violation predictionhallucination detection

0 comments

The pith

Internal artifacts of LLMs serve as reliable signals for detecting incorrect predictions on legal classification tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether features taken from inside an LLM can be fed into a separate classifier that spots when the LLM has produced a wrong answer on legal problems. Experiments focus on two tasks: predicting bail decisions and predicting which statutes were violated. The downstream classifiers built from these internal features succeed at flagging errors, supporting the idea that the artifacts carry usable information about output correctness. If the pattern holds, legal applications could add an internal check layer to reduce reliance on post-hoc human review.

Core claim

By extracting features from LLMs' internal artifacts and training separate classifiers on those features, it is possible to identify incorrect LLM outputs on legal classification tasks including bail decision prediction and statute violation prediction.

What carries the argument

Features derived from LLM internal artifacts, used to train downstream error-detection classifiers.

If this is right

LLM-based legal systems can incorporate these artifact-based detectors to flag potential mistakes before outputs are used.
The approach improves reliability of classification without retraining or altering the original LLM.
Detection works for both bail decision prediction and statute violation prediction.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same internal signals might support error detection in other high-stakes domains if the underlying mechanism is not domain-specific.
Developers could derive per-prediction scores directly from these artifacts rather than training separate models.
The result points to a general property of LLMs where their own activations encode information about their accuracy.

Load-bearing premise

The two chosen legal tasks are representative enough that results on them indicate the method will succeed on other legal applications and other LLMs.

What would settle it

Finding that the internal-feature classifiers perform at chance level on a new legal classification task or on a different LLM architecture would falsify the central claim.

Figures

Figures reproduced from arXiv: 2606.20929 by Debtanu Datta, Saptarshi Ghosh, Sudipta Santra.

**Figure 1.** Figure 1: Illustration of the architecture of an LLM (Llama-8b with 32 decoder layers). The left side shows all layers; the central part shows the internal details of one decoder block. The internal artifacts used in our method are extracted from the self-attention module, and the Feed-Forward Network module. The numbers shown in each block (e.g., (8, 4096), (8, 14336)) denote the per-token representation sizes, whe… view at source ↗

**Figure 2.** Figure 2: Overview of the proposed three-stage framework: (1) LLM inference and artifact extraction, (2) Training of Correctness Detector (CD), and (3) Reliability assessment and final decision. input document (e.g., a case facts), and 𝑦𝑖 is the corresponding ground-truth label (e.g., ‘violation’ or ‘no-violation’)]. Similarly, 𝐷test = {(𝑥𝑖 , 𝑦𝑖)} 𝑛test 𝑖=1 is the test set, where 𝑛test is the number of samples in th… view at source ↗

**Figure 3.** Figure 3: Prompt template used for ECHR Dataset for statute violation prediction task. and judicial reasoning, making it suitable for evaluating the legal reasoning capabilities of models. For this work, we use the classification setting as violation (1) vs. no-violation (0) [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Prompt template used for ILDC Dataset for bail decision prediction task. ∙ CD Training: Activation: ReLU; Dropout: 0.1; Optimizer: AdamW; Learning Rate: 1×10−4 ; Batch Size: 128; Loss Function: Cross-Entropy; Early Stopping: Patience = 5 (based on validation ROC-AUC). All settings are kept uniform across models and datasets to ensure fair comparison. 5.4. Evaluation Metrics Metrics for Evaluating Correctne… view at source ↗

read the original abstract

Large Language Models (LLMs) are increasingly being adopted in the legal domain. However, despite their strong performance, LLMs are prone to generating incorrect or hallucinated outputs, raising serious concerns about their reliability in high-stakes domains such as law. Detecting the correctness of responses of LLM-based systems is therefore a critical challenge. In this work, we explore the potential of leveraging internal artifacts of LLM to detect the correctness of their predictions in legal-domain classification tasks. We develop approaches that utilize features derived from these internal artifacts to build downstream classifiers capable of identifying incorrect LLM outputs. We evaluate our approach on two representative legal classification tasks: bail decision prediction and statute violation prediction. Our experimental results demonstrate that LLMs' internal artifacts are reliable indicators for detecting incorrect predictions in legal classification tasks, and can be applied to enhance the reliability of LLM-based classification systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper applies known internal-state probing to flag LLM errors on two legal tasks but provides no evidence it generalizes or beats simple baselines.

read the letter

The main point is that internal LLM features can be turned into a downstream detector for wrong outputs on bail decisions and statute violations. The abstract frames this as a reliability fix for legal classification.

The work takes an established idea—pulling activations or attention patterns to predict model mistakes—and points it at two concrete legal problems. That domain focus is the only real addition; nothing in the description suggests a new extraction method or theoretical derivation.

What stands out is the choice of high-stakes tasks where false positives carry real cost. If the experiments later show the detector catching errors the base LLM misses, that would be a practical data point for anyone deploying LLMs in law.

The soft spots are the narrow test bed and missing controls. Two tasks do not establish that the same artifacts work for other legal problems or other model families. The abstract gives no numbers on accuracy lift, no comparison to output entropy or self-consistency checks, and no cross-model results. The representativeness concern in the stress-test note holds up: positive results here do not automatically support the broader claim about “legal classification tasks” in general.

This is for groups already working on legal NLP or LLM monitoring who need a starting point for domain-specific probes. A reader looking for first-principles advances or large-scale validation will not find it.

It deserves a serious referee because the idea is straightforward to test and the application area matters, even though the current write-up is preliminary. Send it for review but ask for expanded experiments and baseline comparisons.

Referee Report

1 major / 0 minor

Summary. The manuscript claims that internal artifacts of LLMs (such as hidden states or attention patterns) can be leveraged as features to train downstream classifiers that detect incorrect LLM predictions in legal-domain classification tasks. It evaluates the approach on two tasks—bail decision prediction and statute violation prediction—and reports that the artifacts serve as reliable indicators, enabling enhanced reliability for LLM-based legal classification systems.

Significance. If the results hold, the work offers a practical internal mechanism for flagging erroneous outputs in high-stakes legal applications, potentially reducing reliance on external verification and improving trustworthiness of LLMs in the legal domain.

major comments (1)

[Abstract] Abstract: the claim that internal artifacts are 'reliable indicators for detecting incorrect predictions in legal classification tasks' (and can enhance reliability of LLM-based systems) rests on experiments with only two tasks. No cross-task, cross-domain, or cross-model validation is described, leaving the general scope of the reliability claim unsupported.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for highlighting the scope of our claims. We agree that the abstract's phrasing implies broader applicability than the two-task evaluation supports, and we will revise to address this.

read point-by-point responses

Referee: [Abstract] Abstract: the claim that internal artifacts are 'reliable indicators for detecting incorrect predictions in legal classification tasks' (and can enhance reliability of LLM-based systems) rests on experiments with only two tasks. No cross-task, cross-domain, or cross-model validation is described, leaving the general scope of the reliability claim unsupported.

Authors: We agree the abstract overstates generality. The two tasks (bail decision and statute violation prediction) were chosen as representative legal classification problems, but no cross-task, cross-domain, or cross-model experiments are reported. In revision we will (1) qualify the abstract to state that internal artifacts serve as reliable indicators on the two evaluated tasks and (2) add an explicit Limitations section discussing the need for broader validation before claiming domain-wide reliability. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical feature-based classifier training is independent of target labels

full rationale

The paper presents an empirical pipeline that extracts internal LLM artifacts as input features, then trains separate downstream classifiers to predict whether an LLM output is correct or incorrect. This is a standard supervised learning setup with no self-definitional loop (the correctness label is external ground truth, not derived from the artifacts), no fitted-input-called-prediction, and no load-bearing self-citations or imported uniqueness theorems. The two-task evaluation is a scope limitation rather than a circular reduction. The derivation chain is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no explicit parameters, axioms, or invented entities; review is limited to surface claims.

pith-pipeline@v0.9.1-grok · 5684 in / 903 out tokens · 22879 ms · 2026-06-26T17:02:25.904849+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

25 extracted references · 20 canonical work pages · 2 internal anchors

[1]

Brown, B

T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakan- tan, P. Shyam, G. Sastry, A. Askell, et al., Language Models are Few-Shot Learn- ers, in: Advances in Neural Information Processing Systems, volume 33, Curran Asso- ciates, Inc., 2020, pp. 1877–1901. URL: https://proceedings.neurips.cc/paper_files/paper/2020/ file/1457c0d6bfc...

2020
[2]

Chowdhery, S

A. Chowdhery, S. Narang, J. Devlin, M. Bosma, G. Mishra, A. Roberts, P. Barham, H. W. Chung, C. Sutton, S. Gehrmann, et al., PaLM: Scaling Language Modeling with Pathways, Journal of machine learning research 24 (2023) 1–113. URL: http://jmlr.org/papers/v24/22-1144.html

2023
[3]

Predicting judicial decisions of the european court of human rights: A natural language processing perspective

N. Aletras, D. Tsarapatsanis, D. Preoţiuc-Pietro, V. Lampos, Predicting judicial decisions of the European Court of Human Rights: a Natural Language Processing perspective, PeerJ Computer Science 2 (2016) e93. doi:10.7717/peerj-cs.93

work page doi:10.7717/peerj-cs.93 2016
[4]

Datta, R

D. Datta, R. Mukherjee, A. Goswami, S. Ghosh, Advantages of Domain Knowledge Injection for Legal Document Summarization: A Case Study on Summarizing Indian Court Judgments in English and Hindi, arXiv preprint arXiv:2602.07382 (2026). doi: 10.48550/arXiv.2602.07382

work page doi:10.48550/arxiv.2602.07382 2026
[5]

Deroy, K

A. Deroy, K. Ghosh, S. Ghosh, Investigating legal question generation using large language models, Artificial Intelligence and Law (2025) 1–39. doi:10.1007/s10506-025-09452-y

work page doi:10.1007/s10506-025-09452-y 2025
[6]

S. K. Nigam, D. P. Balaramamahanthi, S. Mishra, N. Shallum, K. Ghosh, A. Bhattacharya, NyayaAnu- mana and INLegalLlama: The Largest Indian Legal Judgment Prediction Dataset and Specialized Language Model for Enhanced Decision Analysis, in: Proceedings of the 31st International Con- ference on Computational Linguistics, 2025, pp. 11135–11160. URL: https://...

2025
[7]

ACM Transactions on Information Systems 43(2), 1–55 (2025) https://doi.org/10.1145/3703155

L. Huang, W. Yu, W. Ma, W. Zhong, Z. Feng, H. Wang, Q. Chen, W. Peng, X. Feng, B. Qin, et al., A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions, ACM Transactions on Information Systems 43 (2025) 1–55. doi:10.1145/3703155

work page doi:10.1145/3703155 2025
[8]

Abdullahi, K

S. Abdullahi, K. U. Danyaro, H. Chiroma, The rise of hallucination in large language models: systematic reviews, performance analysis and challenges, Cluster Computing 29 (2026) 124. doi:10.1007/s10586-025-05891-z

work page doi:10.1007/s10586-025-05891-z 2026
[9]

Snyder, M

B. Snyder, M. Moisescu, M. B. Zafar, On Early Detection of Hallucinations in Factual Question Answering, in: Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 2024, pp. 2721–2732. doi:10.1145/3637528.3671796

work page doi:10.1145/3637528.3671796 2024
[10]

Datta, M

D. Datta, M. K. Chilukuri, Y. Kumar, S. Ghosh, M. B. Zafar, Do LLM hallucination detectors suffer from low-resource effect?, in: Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), Association for Computational Linguistics, 2026, pp. 2959–2985. doi:10.18653/v1/2026.eacl-long.136

work page doi:10.18653/v1/2026.eacl-long.136 2026
[11]

Z. Ji, D. Chen, E. Ishii, S. Cahyawijaya, Y. Bang, B. Wilie, P. Fung, LLM Internal States Reveal Hallucination Risk Faced With a Query, in: Proceedings of the 7th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP, 2024, pp. 88–104. doi:10.18653/v1/2024. blackboxnlp-1.6

work page doi:10.18653/v1/2024 2024
[12]

Legal Holding Extraction from Italian Case Documents using Italian-LEGAL-BERT Text Summarization , year =

C. Jiang, X. Yang, Legal Syllogism Prompting: Teaching Large Language Models for Legal Judgment Prediction, in: Proceedings of the 19th International Conference on Artificial Intelligence and Law (ICAIL), 2023, pp. 417–421. doi:10.1145/3594536.3595170

work page doi:10.1145/3594536.3595170 2023
[13]

H. Dai, W. Zhao, L. Li, Enhancing Legal Judgment Prediction in LLMs via Legal Norms Integration, in: International Conference on Knowledge Science, Engineering and Management, Springer, 2025, pp. 202–214. doi:10.1007/978-981-95-3055-7_16

work page doi:10.1007/978-981-95-3055-7_16 2025
[14]

Sivakumar, A

A. Sivakumar, A. Palanivel, K. Subbaraj, Predictive Modeling for Bail Applications in Indian Courts Using IndicBERT and HLDC Dataset, in: International Conference on Smart Data Intelligence, Springer, 2025, pp. 545–557. doi:10.1007/978-981-96-5265-5_42

work page doi:10.1007/978-981-96-5265-5_42 2025
[15]

C. Xiao, H. Zhong, Z. Guo, C. Tu, Z. Liu, M. Sun, Y. Feng, X. Han, Z. Hu, H. Wang, et al., CAIL2018: A Large-Scale Legal Dataset for Judgment Prediction, arXiv preprint arXiv:1807.02478 (2018). doi:10.48550/arXiv.1807.02478

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1807.02478 2018
[16]

Niklaus, I

J. Niklaus, I. Chalkidis, M. Stürmer, Swiss-Judgment-Prediction: A Multilingual Legal Judgment Prediction Benchmark, in: Proceedings of the natural legal language processing workshop 2021, Association for Computational Linguistics, 2021, pp. 19–35. doi:10.18653/v1/2021.nllp-1.3

work page doi:10.18653/v1/2021.nllp-1.3 2021
[17]

Chalkidis, I

I. Chalkidis, I. Androutsopoulos, N. Aletras, Neural Legal Judgment Prediction in English, in: A. Korhonen, D. Traum, L. Màrquez (Eds.), Proceedings of the 57th Annual Meeting of the Associ- ation for Computational Linguistics, Association for Computational Linguistics, Florence, Italy, 2019, pp. 4317–4323. URL: https://aclanthology.org/P19-1424/. doi:10....

work page doi:10.18653/v1/p19-1424 2019
[18]

Malik, R

V. Malik, R. Sanjay, S. K. Nigam, K. Ghosh, S. K. Guha, A. Bhattacharya, A. Modi, ILDC for CJPE: Indian Legal Documents Corpus for Court Judgment Prediction and Explanation, in: C. Zong, F. Xia, W. Li, R. Navigli (Eds.), Proceedings of the 59th Annual Meeting of the Association for Com- putational Linguistics and the 11th International Joint Conference on...

work page doi:10.18653/v1/2021.acl-long.313 2021
[19]

Ian Davidson, Michael Livanos, Antoine Gourru, Peter Walker, Julien Velcin, and S

S. Farquhar, J. Kossen, L. Kuhn, Y. Gal, Detecting hallucinations in large language models using semantic entropy, Nature (2024). doi:10.1038/s41586-024-07421-0

work page doi:10.1038/s41586-024-07421-0 2024
[20]

SelfCheckGPT: Zero-Resource Black-Box Hallucination Detection for Generative Large Language Models

P. Manakul, A. Liusie, M. Gales, Selfcheckgpt: Zero-resource black-box hallucination detection for generative large language models, arXiv preprint arXiv:2303.08896 (2023). doi: 10.48550/arXiv. 2303.08896

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv 2023
[21]

Grattafiori, A

A. Grattafiori, A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman, A. Mathur, A. Schelten, A. Vaughan, et al., The Llama 3 Herd of Models, arXiv preprint arXiv:2407.21783 (2024)

Pith/arXiv arXiv 2024
[22]

S. Yu, G. Kim, S. Kang, Context and Layers in Harmony: A Unified Strategy for Mitigating LLM Hallucinations, Mathematics 13 (2025) 1831. doi:10.3390/math13111831. 12

work page doi:10.3390/math13111831 2025
[23]

X. Song, K. Wang, P. Li, L. Yin, S. Liu, Demystifying the Roles of LLM Layers in Retrieval, Knowledge, and Reasoning, arXiv preprint arXiv:2510.02091 (2025)

arXiv 2025
[24]

Fawcett, An introduction to ROC analysis, Pattern recognition letters, 27 (2006) 861– 874

T. Fawcett, An introduction to ROC analysis, Pattern recognition letters 27 (2006) 861–874. doi:10.1016/j.patrec.2005.10.010

work page doi:10.1016/j.patrec.2005.10.010 2006
[25]

J. A. Hanley, B. J. McNeil, The meaning and use of the area under a receiver operating characteristic (ROC) curve, Radiology 143 (1982) 29–36. doi:10.1148/radiology.143.1.7063747. 13

work page doi:10.1148/radiology.143.1.7063747 1982

[1] [1]

Brown, B

T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakan- tan, P. Shyam, G. Sastry, A. Askell, et al., Language Models are Few-Shot Learn- ers, in: Advances in Neural Information Processing Systems, volume 33, Curran Asso- ciates, Inc., 2020, pp. 1877–1901. URL: https://proceedings.neurips.cc/paper_files/paper/2020/ file/1457c0d6bfc...

2020

[2] [2]

Chowdhery, S

A. Chowdhery, S. Narang, J. Devlin, M. Bosma, G. Mishra, A. Roberts, P. Barham, H. W. Chung, C. Sutton, S. Gehrmann, et al., PaLM: Scaling Language Modeling with Pathways, Journal of machine learning research 24 (2023) 1–113. URL: http://jmlr.org/papers/v24/22-1144.html

2023

[3] [3]

Predicting judicial decisions of the european court of human rights: A natural language processing perspective

N. Aletras, D. Tsarapatsanis, D. Preoţiuc-Pietro, V. Lampos, Predicting judicial decisions of the European Court of Human Rights: a Natural Language Processing perspective, PeerJ Computer Science 2 (2016) e93. doi:10.7717/peerj-cs.93

work page doi:10.7717/peerj-cs.93 2016

[4] [4]

Datta, R

D. Datta, R. Mukherjee, A. Goswami, S. Ghosh, Advantages of Domain Knowledge Injection for Legal Document Summarization: A Case Study on Summarizing Indian Court Judgments in English and Hindi, arXiv preprint arXiv:2602.07382 (2026). doi: 10.48550/arXiv.2602.07382

work page doi:10.48550/arxiv.2602.07382 2026

[5] [5]

Deroy, K

A. Deroy, K. Ghosh, S. Ghosh, Investigating legal question generation using large language models, Artificial Intelligence and Law (2025) 1–39. doi:10.1007/s10506-025-09452-y

work page doi:10.1007/s10506-025-09452-y 2025

[6] [6]

S. K. Nigam, D. P. Balaramamahanthi, S. Mishra, N. Shallum, K. Ghosh, A. Bhattacharya, NyayaAnu- mana and INLegalLlama: The Largest Indian Legal Judgment Prediction Dataset and Specialized Language Model for Enhanced Decision Analysis, in: Proceedings of the 31st International Con- ference on Computational Linguistics, 2025, pp. 11135–11160. URL: https://...

2025

[7] [7]

ACM Transactions on Information Systems 43(2), 1–55 (2025) https://doi.org/10.1145/3703155

L. Huang, W. Yu, W. Ma, W. Zhong, Z. Feng, H. Wang, Q. Chen, W. Peng, X. Feng, B. Qin, et al., A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions, ACM Transactions on Information Systems 43 (2025) 1–55. doi:10.1145/3703155

work page doi:10.1145/3703155 2025

[8] [8]

Abdullahi, K

S. Abdullahi, K. U. Danyaro, H. Chiroma, The rise of hallucination in large language models: systematic reviews, performance analysis and challenges, Cluster Computing 29 (2026) 124. doi:10.1007/s10586-025-05891-z

work page doi:10.1007/s10586-025-05891-z 2026

[9] [9]

Snyder, M

B. Snyder, M. Moisescu, M. B. Zafar, On Early Detection of Hallucinations in Factual Question Answering, in: Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 2024, pp. 2721–2732. doi:10.1145/3637528.3671796

work page doi:10.1145/3637528.3671796 2024

[10] [10]

Datta, M

D. Datta, M. K. Chilukuri, Y. Kumar, S. Ghosh, M. B. Zafar, Do LLM hallucination detectors suffer from low-resource effect?, in: Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), Association for Computational Linguistics, 2026, pp. 2959–2985. doi:10.18653/v1/2026.eacl-long.136

work page doi:10.18653/v1/2026.eacl-long.136 2026

[11] [11]

Z. Ji, D. Chen, E. Ishii, S. Cahyawijaya, Y. Bang, B. Wilie, P. Fung, LLM Internal States Reveal Hallucination Risk Faced With a Query, in: Proceedings of the 7th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP, 2024, pp. 88–104. doi:10.18653/v1/2024. blackboxnlp-1.6

work page doi:10.18653/v1/2024 2024

[12] [12]

Legal Holding Extraction from Italian Case Documents using Italian-LEGAL-BERT Text Summarization , year =

C. Jiang, X. Yang, Legal Syllogism Prompting: Teaching Large Language Models for Legal Judgment Prediction, in: Proceedings of the 19th International Conference on Artificial Intelligence and Law (ICAIL), 2023, pp. 417–421. doi:10.1145/3594536.3595170

work page doi:10.1145/3594536.3595170 2023

[13] [13]

H. Dai, W. Zhao, L. Li, Enhancing Legal Judgment Prediction in LLMs via Legal Norms Integration, in: International Conference on Knowledge Science, Engineering and Management, Springer, 2025, pp. 202–214. doi:10.1007/978-981-95-3055-7_16

work page doi:10.1007/978-981-95-3055-7_16 2025

[14] [14]

Sivakumar, A

A. Sivakumar, A. Palanivel, K. Subbaraj, Predictive Modeling for Bail Applications in Indian Courts Using IndicBERT and HLDC Dataset, in: International Conference on Smart Data Intelligence, Springer, 2025, pp. 545–557. doi:10.1007/978-981-96-5265-5_42

work page doi:10.1007/978-981-96-5265-5_42 2025

[15] [15]

C. Xiao, H. Zhong, Z. Guo, C. Tu, Z. Liu, M. Sun, Y. Feng, X. Han, Z. Hu, H. Wang, et al., CAIL2018: A Large-Scale Legal Dataset for Judgment Prediction, arXiv preprint arXiv:1807.02478 (2018). doi:10.48550/arXiv.1807.02478

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1807.02478 2018

[16] [16]

Niklaus, I

J. Niklaus, I. Chalkidis, M. Stürmer, Swiss-Judgment-Prediction: A Multilingual Legal Judgment Prediction Benchmark, in: Proceedings of the natural legal language processing workshop 2021, Association for Computational Linguistics, 2021, pp. 19–35. doi:10.18653/v1/2021.nllp-1.3

work page doi:10.18653/v1/2021.nllp-1.3 2021

[17] [17]

Chalkidis, I

I. Chalkidis, I. Androutsopoulos, N. Aletras, Neural Legal Judgment Prediction in English, in: A. Korhonen, D. Traum, L. Màrquez (Eds.), Proceedings of the 57th Annual Meeting of the Associ- ation for Computational Linguistics, Association for Computational Linguistics, Florence, Italy, 2019, pp. 4317–4323. URL: https://aclanthology.org/P19-1424/. doi:10....

work page doi:10.18653/v1/p19-1424 2019

[18] [18]

Malik, R

V. Malik, R. Sanjay, S. K. Nigam, K. Ghosh, S. K. Guha, A. Bhattacharya, A. Modi, ILDC for CJPE: Indian Legal Documents Corpus for Court Judgment Prediction and Explanation, in: C. Zong, F. Xia, W. Li, R. Navigli (Eds.), Proceedings of the 59th Annual Meeting of the Association for Com- putational Linguistics and the 11th International Joint Conference on...

work page doi:10.18653/v1/2021.acl-long.313 2021

[19] [19]

Ian Davidson, Michael Livanos, Antoine Gourru, Peter Walker, Julien Velcin, and S

S. Farquhar, J. Kossen, L. Kuhn, Y. Gal, Detecting hallucinations in large language models using semantic entropy, Nature (2024). doi:10.1038/s41586-024-07421-0

work page doi:10.1038/s41586-024-07421-0 2024

[20] [20]

SelfCheckGPT: Zero-Resource Black-Box Hallucination Detection for Generative Large Language Models

P. Manakul, A. Liusie, M. Gales, Selfcheckgpt: Zero-resource black-box hallucination detection for generative large language models, arXiv preprint arXiv:2303.08896 (2023). doi: 10.48550/arXiv. 2303.08896

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv 2023

[21] [21]

Grattafiori, A

A. Grattafiori, A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman, A. Mathur, A. Schelten, A. Vaughan, et al., The Llama 3 Herd of Models, arXiv preprint arXiv:2407.21783 (2024)

Pith/arXiv arXiv 2024

[22] [22]

S. Yu, G. Kim, S. Kang, Context and Layers in Harmony: A Unified Strategy for Mitigating LLM Hallucinations, Mathematics 13 (2025) 1831. doi:10.3390/math13111831. 12

work page doi:10.3390/math13111831 2025

[23] [23]

X. Song, K. Wang, P. Li, L. Yin, S. Liu, Demystifying the Roles of LLM Layers in Retrieval, Knowledge, and Reasoning, arXiv preprint arXiv:2510.02091 (2025)

arXiv 2025

[24] [24]

Fawcett, An introduction to ROC analysis, Pattern recognition letters, 27 (2006) 861– 874

T. Fawcett, An introduction to ROC analysis, Pattern recognition letters 27 (2006) 861–874. doi:10.1016/j.patrec.2005.10.010

work page doi:10.1016/j.patrec.2005.10.010 2006

[25] [25]

J. A. Hanley, B. J. McNeil, The meaning and use of the area under a receiver operating characteristic (ROC) curve, Radiology 143 (1982) 29–36. doi:10.1148/radiology.143.1.7063747. 13

work page doi:10.1148/radiology.143.1.7063747 1982