pith. sign in

arxiv: 2606.23375 · v1 · pith:MNSEY5I3new · submitted 2026-06-22 · 💻 cs.CL · cs.AI

Measuring & Mitigating Over-Alignment for LLMs in Multilingual Criminal Law Courts

Pith reviewed 2026-06-26 08:09 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords over-alignmentLLM refusalscriminal lawmultilingual benchmarkabliterationlegal translationSwiss court rulingstask faithfulness
0
0 comments X

The pith

Abliteration of refusal directions eliminates over-alignment refusals in LLMs for criminal law tasks with minimal effect on performance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper demonstrates that LLMs activate guardrails and produce refusals or disclaimers when processing criminal law content involving violence or sexual offenses, even during legitimate court tasks such as translation and summarization. It creates TF-RefusalBench, a set of 5,200 prompts drawn from public Swiss Supreme Court rulings in French, German, Italian, and English, to measure this over-alignment across models and languages. The work shows that the phenomenon depends on both the model and the languages involved, and that disclaimers reduce output faithfulness beyond outright refusals. It finds that prompting offers partial relief while abliteration removes the refusals more thoroughly and leaves task performance largely intact.

Core claim

Over-alignment in LLMs produces refusals and disclaimers on criminal law texts from court rulings, compromising legitimate multilingual translation and summarization work. TF-RefusalBench quantifies the issue across four languages and multiple models. Abliteration eliminates refusal behaviors with minimal impact on task performance.

What carries the argument

TF-RefusalBench, a multilingual benchmark of 5,200 prompts and passages from public Swiss Supreme Court rulings used to test refusal in criminal-law translation and summarization tasks.

Load-bearing premise

The 5,200 prompts and passages in TF-RefusalBench are representative of the criminal-law content that appears in real court workflows and do not systematically over- or under-trigger refusal.

What would settle it

Running the same models on a collection of private, non-public Swiss court documents and measuring both refusal rates and task faithfulness against the benchmark results.

Figures

Figures reproduced from arXiv: 2606.23375 by Andrei Kucharavy, Arthur Wuhrmann, Daniel Brunner, Gaetan Stein.

Figure 1
Figure 1. Figure 1: A single request (translation of a public Fed [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Pipeline for the creation of TF-RefusalBench [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Category distribution of the 648-extract can [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Refusals in translations depending on the [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: System-prompt effect on over-alignment (base Llama-3.3-70B, over-alignment-prone subset), with 95% [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Over-alignment of summarization prompts as [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Prompt-language concordance by model and axis (pooled over task, off-diagonal prompts), with 95% [PITH_FULL_IMAGE:figures/full_fig_p015_7.png] view at source ↗
read the original abstract

While the wider applicability of LLMs in the legal field is currently debated due to their reliability and the gravity of any errors, narrow uses with well-understood and mitigated risks have emerged. Notably the Swiss Federal Supreme Court uses small on-premises models for tentative translations and short-passage summarization across the four official languages. However, such usage is challenging in the context of Criminal Law. Since rulings and cases employees work on routinely can contain detailed descriptions of violent and sexual offenses, their legitimate work is compromised by refusals and disclaimers due to the activation of model guardrails (over-alignment). To measure this phenomenon, we introduce TF-RefusalBench, a multilingual benchmark for criminal-law translation and summarization derived from public Swiss Supreme Court rulings. TF-RefusalBench contains 5,200 total prompts across French, German, Italian, and English, corresponding to common task prompts and passages likely to trigger refusal. We then use TF-RefusalBench to show that over-alignment is a multifaceted phenomenon, influenced by the model and the prompt and text languages being processed, and that its impact cannot be evaluated solely from an over-refusal perspective, given the disclaimer's impact on task faithfulness. Finally, we evaluate approaches to enable on-premises LLMs for Criminal Law Tasks, demonstrating that while prompting can be effective, abliteration (refusal directions ablation) eliminates refusal with minimal impact on task performance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces TF-RefusalBench, a multilingual benchmark of 5,200 prompts and passages drawn from public Swiss Supreme Court rulings for criminal-law translation and summarization across French, German, Italian, and English. It argues that over-alignment (refusals and disclaimers) is a multifaceted issue varying by model and language pair, cannot be assessed from refusal rates alone because of effects on task faithfulness, and shows that abliteration removes refusals while prompting is less reliable, with abliteration having minimal impact on downstream performance.

Significance. If the central empirical claims hold, the work supplies a domain-specific benchmark and a practical mitigation (abliteration) for deploying on-premises LLMs in high-stakes multilingual legal settings such as the Swiss Federal Supreme Court. The emphasis on faithfulness beyond raw refusal rates and the multilingual scope are useful contributions to the literature on alignment in specialized domains.

major comments (2)
  1. [§3.2] §3.2 (TF-RefusalBench construction): The selection of the 5,200 prompts/passages as 'likely to trigger refusal' is described only at a high level with no explicit criteria, inter-annotator agreement statistics, or comparison against an unselected distribution of Swiss criminal rulings. Because the strongest claim (abliteration eliminates refusal with minimal task-performance impact) rests entirely on results from this benchmark, the absence of representativeness evidence is load-bearing.
  2. [§5] §5 (Experiments and results): The manuscript provides no definition of the exact metrics for 'task performance' and 'faithfulness,' no statistical tests or effect-size reporting for the claim of 'minimal impact,' and insufficient model specifications (base models, sizes, training details). These omissions prevent assessment of whether the multifaceted over-alignment findings and the abliteration advantage are robust.
minor comments (2)
  1. [§5] Tables reporting refusal rates and faithfulness scores should include confidence intervals or standard errors to allow readers to judge the 'minimal impact' claim quantitatively.
  2. [Abstract] The abstract and §4 could more explicitly list the specific LLMs evaluated and the language-pair combinations that drive the main conclusions.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below, committing to revisions that strengthen the methodological transparency of the work without altering its core claims.

read point-by-point responses
  1. Referee: [§3.2] §3.2 (TF-RefusalBench construction): The selection of the 5,200 prompts/passages as 'likely to trigger refusal' is described only at a high level with no explicit criteria, inter-annotator agreement statistics, or comparison against an unselected distribution of Swiss criminal rulings. Because the strongest claim (abliteration eliminates refusal with minimal task-performance impact) rests entirely on results from this benchmark, the absence of representativeness evidence is load-bearing.

    Authors: We agree that the description in §3.2 is high-level and would benefit from additional detail. The passages were drawn from publicly available Swiss Supreme Court rulings and selected for content describing violent or sexual offenses, which are the categories that most reliably trigger refusal behaviors in the target models during criminal-law tasks. In revision we will add explicit selection criteria (keyword lists derived from the Swiss Criminal Code plus manual verification), provide concrete examples of included and excluded passages, and note that curation was performed by a single domain expert with co-author verification (hence no IAA statistic was computed). Regarding representativeness, the benchmark is deliberately scoped to refusal-prone cases rather than a uniform random sample of all rulings; we will add a short distributional comparison of offense types against publicly reported Swiss criminal-case statistics to make this scoping explicit. revision: yes

  2. Referee: [§5] §5 (Experiments and results): The manuscript provides no definition of the exact metrics for 'task performance' and 'faithfulness,' no statistical tests or effect-size reporting for the claim of 'minimal impact,' and insufficient model specifications (base models, sizes, training details). These omissions prevent assessment of whether the multifaceted over-alignment findings and the abliteration advantage are robust.

    Authors: We accept that the experimental reporting in §5 lacks the requested precision. In the revised manuscript we will (i) define task performance explicitly (translation: BLEU and COMET; summarization: ROUGE-L plus an NLI-based faithfulness score), (ii) define faithfulness to include both factual consistency and the absence of disclaimers that alter output content, (iii) report paired t-tests together with Cohen’s d effect sizes for all “minimal impact” comparisons, and (iv) expand model specifications to list exact base models, parameter counts, and links to the corresponding Hugging Face model cards. These additions will allow readers to evaluate the robustness of the over-alignment and abliteration results. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical benchmark construction and evaluation

full rationale

The paper constructs TF-RefusalBench from public Swiss Supreme Court rulings and measures refusal rates plus post-mitigation task performance on it. No derivations, equations, fitted parameters renamed as predictions, or self-citation chains appear in the abstract or described methodology. The central claim (abliteration reduces refusal with minimal performance loss) is supported by direct measurement on the new benchmark rather than by reducing to prior self-authored results or by construction. This matches the default expectation of non-circular empirical work.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Empirical benchmark and ablation study; no mathematical derivations, free parameters, or new postulated entities are introduced.

pith-pipeline@v0.9.1-grok · 5790 in / 1094 out tokens · 22873 ms · 2026-06-26T08:09:01.204295+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

41 extracted references · 2 canonical work pages

  1. [1]

    2024 , url =

    Cui, Justin and Chiang, Wei-Lin and Stoica, Ion and Hsieh, Cho-Jui , journal =. 2024 , url =

  2. [2]

    List of Dirty, Naughty, Obscene, and Otherwise Bad Words , author =

  3. [3]

    Advances in Neural Information Processing Systems , volume =

    Refusal in Language Models Is Mediated by a Single Direction , author =. Advances in Neural Information Processing Systems , volume =. 2024 , url =

  4. [4]

    2025 , url =

    Heretic: Fully Automatic Censorship Removal for Language Models , author =. 2025 , url =

  5. [5]

    arXiv preprint arXiv:2407.21783 , year =

    The Llama 3 Herd of Models , author =. arXiv preprint arXiv:2407.21783 , year =

  6. [6]

    arXiv preprint arXiv:2508.10925 , year =

    gpt-oss-120b & gpt-oss-20b Model Card , author =. arXiv preprint arXiv:2508.10925 , year =

  7. [8]

    arXiv preprint arXiv:2505.09388 , year =

    Qwen3 Technical Report , author =. arXiv preprint arXiv:2505.09388 , year =

  8. [9]

    arXiv preprint arXiv:2503.19786 , year =

    Gemma 3 Technical Report , author =. arXiv preprint arXiv:2503.19786 , year =

  9. [10]

    International Conference on Learning Representations , year =

    Measuring Massive Multitask Language Understanding , author =. International Conference on Learning Representations , year =

  10. [11]

    arXiv preprint arXiv:2601.02780 , year =

  11. [12]

    2025 , howpublished =

    Gemma 4 Model Card , author =. 2025 , howpublished =

  12. [13]

    2022 , pages =

    Lin, Stephanie and Hilton, Jacob and Evans, Owain , booktitle =. 2022 , pages =

  13. [14]

    2019 , pages =

    Zellers, Rowan and Holtzman, Ari and Bisk, Yonatan and Farhadi, Ali and Choi, Yejin , booktitle =. 2019 , pages =

  14. [15]

    arXiv preprint arXiv:2311.07911 , year =

    Instruction-Following Evaluation for Large Language Models , author =. arXiv preprint arXiv:2311.07911 , year =

  15. [16]

    The Method of Paired Comparisons , author =

    Rank Analysis of Incomplete Block Designs: I. The Method of Paired Comparisons , author =. Biometrika , volume =. 1952 , url =

  16. [17]

    Aho and Jeffrey D

    Alfred V. Aho and Jeffrey D. Ullman , title =. 1972

  17. [18]

    arXiv preprint arXiv:2112.00861 , year =

    A General Language Assistant as a Laboratory for Alignment , author =. arXiv preprint arXiv:2112.00861 , year =

  18. [19]

    Advances in Neural Information Processing Systems (NeurIPS) , year =

    Training Language Models to Follow Instructions with Human Feedback , author =. Advances in Neural Information Processing Systems (NeurIPS) , year =

  19. [20]

    Chalkidis, Ilias and Fergadiotis, Manos and Malakasiotis, Prodromos and Aletras, Nikolaos and Androutsopoulos, Ion , booktitle =

  20. [21]

    arXiv preprint arXiv:2509.14233 , year =

  21. [22]

    Yang, An and others , journal =

  22. [23]

    2026 , howpublished =

  23. [24]

    Safety-Tuned

    Bianchi, Federico and Suzgun, Mirac and Attanasio, Giuseppe and R. Safety-Tuned. International Conference on Learning Representations (ICLR) , year =

  24. [25]

    , author=

    Lora: Low-rank adaptation of large language models. , author=. Iclr , volume=

  25. [26]

    2025 , howpublished =

    Mistral Small 3.2 24B , author =. 2025 , howpublished =

  26. [27]

    Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT) , year =

    R. Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT) , year =

  27. [28]

    arXiv preprint arXiv:2312.03718 , year =

    Large Language Models in Law: A Survey , author =. arXiv preprint arXiv:2312.03718 , year =

  28. [29]

    Guha, Neel and Nyarko, Julian and Ho, Daniel E. and R. Advances in Neural Information Processing Systems (NeurIPS), Datasets and Benchmarks Track , year =

  29. [30]

    2019 , url =

    Kornilova, Anastassia and Eidelman, Vladimir , booktitle =. 2019 , url =

  30. [31]

    Proceedings of the Natural Legal Language Processing Workshop (NLLP) , year =

    Niklaus, Joel and Chalkidis, Ilias and St. Proceedings of the Natural Legal Language Processing Workshop (NLLP) , year =

  31. [32]

    Findings of the Association for Computational Linguistics: EMNLP 2023 , year =

    Niklaus, Joel and Matoshi, Veton and Rani, Pooja and Galassi, Andrea and St. Findings of the Association for Computational Linguistics: EMNLP 2023 , year =

  32. [33]

    arXiv preprint arXiv:2505.12864 , year =

    Fan, Yu and Ni, Jingwei and Merane, Jakob and Tian, Yang and Hermstr. arXiv preprint arXiv:2505.12864 , year =

  33. [34]

    Publications Manual , year = "1983", publisher =

  34. [35]

    and Kozen, Dexter C

    Ashok K. Chandra and Dexter C. Kozen and Larry J. Stockmeyer , year = "1981", title =. doi:10.1145/322234.322243

  35. [36]

    Scalable training of

    Andrew, Galen and Gao, Jianfeng , booktitle=. Scalable training of

  36. [37]

    Dan Gusfield , title =. 1997

  37. [38]

    Tetreault , title =

    Mohammad Sadegh Rasooli and Joel R. Tetreault , title =. Computing Research Repository , volume =. 2015 , url =

  38. [39]

    A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =

    Ando, Rie Kubota and Zhang, Tong , Issn =. A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =. Journal of Machine Learning Research , Month = dec, Numpages =

  39. [40]

    , keywords =

    Matthew Dahl and Varun Magesh and Mirac Suzgun and Daniel E. Ho , title =. CoRR , volume =. 2024 , url =. doi:10.48550/ARXIV.2401.01301 , eprinttype =. 2401.01301 , timestamp =

  40. [41]

    Royal Society Open Science , year=

    Do large language models have a legal duty to tell the truth? , author=. Royal Society Open Science , year=

  41. [42]

    Npj Artificial Intelligence , year=

    Large language models reflect the ideology of their creators , author=. Npj Artificial Intelligence , year=