pith. sign in

arxiv: 2606.27069 · v2 · pith:ZSVALIT4new · submitted 2026-06-25 · 💻 cs.CL

Towards Explainable Adjudicative Variance: Quantifying Judicial Discretion via Gated Multi-Task Learning

Pith reviewed 2026-06-29 04:46 UTC · model grok-4.3

classification 💻 cs.CL
keywords legal outcome predictionjudicial discretiongated multi-task learningjudge identity conditioningoutcome taxonomyUK employment tribunalparameter efficiencyexplainable predictions
0
0 comments X

The pith

A gated multi-task architecture separates factual merits from judicial discretion to predict legal outcomes more accurately and efficiently than larger generative models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that legal outcome prediction benefits when models explicitly distinguish merit-based rulings from technical disposals that depend on judicial discretion, using a judge-aware gated architecture supervised by a fine-grained outcome taxonomy. A sympathetic reader would care because current single-channel generative approaches compose judge identity and outcome signals only weakly, limiting performance especially on ambiguous or rare cases. The design shows that structured, differentiable conditioning interfaces can deliver higher accuracy and greater parameter efficiency than prompt-based methods over much larger backbones, while also enabling interpretability through judge embeddings.

Core claim

The paper claims that for identity-conditioned classification of legal outcomes, the choice of conditioning interface dominates scale. Coupling a LoRA-adapted encoder with the Judge-Aware Gated Multi-Task Learning architecture, where the fine-grained outcome taxonomy supervises the encoder to enforce structural regularization and disentangle semantic pathways, achieves a new state of the art on 13,937 UK Employment Tribunal decisions. It requires an order of magnitude fewer trainable parameters than the generative supervised fine-tuning baselines, with gains concentrated on the most ambiguous and rarest outcome classes, and learned embeddings localize where adjudicative context drives predic

What carries the argument

The Judge-Aware Gated Multi-Task Learning architecture, whose Gated Fusion mechanism dynamically modulates reliance on judge identity under supervision from the fine-grained outcome taxonomy.

If this is right

  • The two contextual signals of judge identity and outcome taxonomy compose only weakly when forced through a single autoregressive channel.
  • Performance gains from the gated architecture concentrate on the most ambiguous and rarest outcome classes.
  • The architecture supports interpretability by localizing cases where adjudicative context drives predictions through learned judge embeddings and calibration profiles.
  • Differentiable structured composition produces more accurate and parameter-efficient models than prompt-based composition over a substantially larger backbone.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same gated modulation approach could extend to other identity-conditioned prediction tasks, such as medical outcome models that adjust for individual clinician tendencies.
  • Fine-grained taxonomy supervision may prove useful in additional multi-task settings where distinct semantic pathways need disentanglement.
  • The demonstrated parameter efficiency suggests the method could support deployment in environments with limited computational resources.

Load-bearing premise

The fine-grained outcome taxonomy supervises the encoder to enforce structural regularization that disentangles distinct semantic pathways between merit-based rulings and technical disposals.

What would settle it

Removing the gated fusion mechanism or the taxonomy supervision from the architecture and observing no remaining gains over the generative baselines on the UK Employment Tribunal dataset would falsify the claim that structured composition is superior.

Figures

Figures reproduced from arXiv: 2606.27069 by Felix Steffek, Matthias Grabmair, Stanis{\l}aw S\'ojka.

Figure 1
Figure 1. Figure 1: Two routes for identity-conditioned hierarchical pre￾diction. Left: generative supervised fine-tuning setup supplies judge identity as a prompt token and supervises the fine-grained DCO and coarse GCO labels through a single autoregressive output channel. Right: the proposed hybrid architecture routes the same signals through differentiable components - a learned judge embed￾ding fused into label-wise atte… view at source ↗
Figure 2
Figure 2. Figure 2: B2 exceeds the additive expectation on rare/fuzzy classes and uses judge identity more actively than G4. Top: per-class GCO F1 with dashed markers for the additive G-track expectation; B2’s excess concentrates on PartlyWins and Other. Bottom: empirical CDF of mean KL divergence under counterfac￾tual judge swaps (mean KL 0.141 for B2 vs 0.055 for G4). Pareto frontier ( [PITH_FULL_IMAGE:figures/full_fig_p00… view at source ↗
Figure 3
Figure 3. Figure 3: Structured composition dominates the parameter– accuracy–calibration Pareto frontier. Macro-F1 vs. trainable parameters (log scale); bubble size encodes ECE (larger is worse). B2 leads on all three axes. between judges who rule on procedural grounds (e.g., Clus￾ter 0: Default Judgment, Lack of Jurisdiction) and those who decide on the merits (e.g., Cluster 1: Substantive Win/Loss); finer-grained clusters c… view at source ↗
Figure 5
Figure 5. Figure 5: Impact of judge-gate influence on model accuracy and DCO composition in D4c. Top: overall accuracy increases mono￾tonically with gate activation quartile. Bottom: DCO distribution shifts across quartiles; categories with high variance are shown individually, stable or rare categories are aggregated as “Other”. E. Qualitative case studies: G4–B2 rescue cases To complement the aggregate results, we examine i… view at source ↗
read the original abstract

Legal outcome prediction must disentangle objective case facts from adjudicative context. Merit-based rulings rely on factual evidence while technical disposals may hinge on judicial discretion. We propose a Judge-Aware Gated Multi-Task Learning architecture that explicitly models this distinction. We introduce a fine-grained outcome taxonomy to supervise the encoder, enforcing a structural regularization that disentangles distinct semantic pathways. This granular legal curriculum enables our Gated Fusion mechanism to dynamically modulate reliance on judge identity. We evaluate our approach on 13,937 UK Employment Tribunal decisions. We benchmark our design against supervised fine-tuning (SFT) of a Gemma-4 26B-A4B backbone, in which judge identity and the taxonomy are injected as prompt tokens or autoregressive output targets. The two contextual signals compose only weakly when forced through a single autoregressive channel. In contrast, coupling a LoRA-adapted Gemma-4 encoder with our gated architecture defines a new state of the art on this benchmark while requiring an order of magnitude fewer trainable parameters than the generative SFT baselines, with gains concentrated on the most ambiguous and rarest outcome classes. Beyond accuracy, the architecture is interpretable; learned judge embeddings and calibration profiles localize the cases where adjudicative context drives the prediction. These results indicate that, for identity-conditioned classification of legal outcomes, the choice of conditioning interface dominates scale: differentiable structured composition yields more accurate, more parameter-efficient models than prompt-based composition over a substantially larger backbone.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper introduces a Judge-Aware Gated Multi-Task Learning architecture for legal outcome prediction on 13,937 UK Employment Tribunal decisions. It uses a fine-grained outcome taxonomy to supervise a LoRA-adapted Gemma-4 encoder, enabling a gated fusion mechanism to dynamically modulate reliance on judge identity and disentangle merit-based rulings from technical disposals. The approach is claimed to achieve a new state of the art compared to supervised fine-tuning of larger generative baselines, while using an order of magnitude fewer trainable parameters, with gains concentrated on ambiguous and rare outcome classes; it also provides interpretability via learned judge embeddings and calibration profiles. The central thesis is that the choice of conditioning interface dominates scale for identity-conditioned classification tasks.

Significance. If the empirical results and mechanistic claims hold, the work would demonstrate that structured, differentiable composition of contextual signals (via taxonomy-supervised gating) can outperform prompt-based composition over substantially larger backbones in legal NLP, while improving parameter efficiency and offering interpretability of adjudicative variance. This would strengthen the case for multi-task architectures with explicit structural regularization over scale alone in domains requiring separation of objective facts from discretionary context.

major comments (1)
  1. [Abstract] Abstract (architecture and results paragraphs): The claim that the fine-grained outcome taxonomy 'enforces a structural regularization that disentangles distinct semantic pathways' between merit-based rulings and technical disposals, enabling the gated fusion to modulate judge identity and produce gains on rare classes, is load-bearing for both the performance claims and the interpretability conclusions. No ablations removing the taxonomy (or replacing it with coarse labels), no representation analysis (e.g., of encoder hidden states or attention patterns), and no comparison of gating weights with/without the curriculum are described. If gains arise instead from multi-task learning or LoRA alone, the architecture's novelty and the 'conditioning interface dominates scale' conclusion do not follow.
minor comments (1)
  1. [Abstract] Abstract: No quantitative metrics, statistical tests, error bars, data split details, or exact parameter counts are provided to support the 'new state of the art' and 'order of magnitude fewer trainable parameters' claims, making it difficult to assess the magnitude of improvement.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback. The concern about missing ablations for the taxonomy's role is valid and directly impacts the strength of our claims. We will revise the manuscript to include the requested experiments and analyses, which will better isolate the taxonomy's contribution to the gated architecture's performance and interpretability.

read point-by-point responses
  1. Referee: [Abstract] Abstract (architecture and results paragraphs): The claim that the fine-grained outcome taxonomy 'enforces a structural regularization that disentangles distinct semantic pathways' between merit-based rulings and technical disposals, enabling the gated fusion to modulate judge identity and produce gains on rare classes, is load-bearing for both the performance claims and the interpretability conclusions. No ablations removing the taxonomy (or replacing it with coarse labels), no representation analysis (e.g., of encoder hidden states or attention patterns), and no comparison of gating weights with/without the curriculum are described. If gains arise instead from multi-task learning or LoRA alone, the architecture's novelty and the 'conditioning interface dominates scale' conclusion do not follow.

    Authors: We agree this is a substantive gap. The taxonomy is presented as providing structural regularization via the multi-task objective, but without the suggested controls it is difficult to rule out that gains stem primarily from the gated multi-task setup or LoRA adaptation alone. In the revised manuscript we will add: (1) an ablation training with only coarse (binary merit/technical) labels instead of the fine-grained taxonomy; (2) representation analysis comparing encoder hidden-state geometry and attention patterns across taxonomy-supervised vs. unsupervised variants; and (3) gating-weight statistics and curriculum-stage comparisons. These additions will allow us to quantify how much of the rare-class improvement and judge-modulation effect is attributable to the fine-grained supervision. We continue to believe the overall conditioning-interface argument holds, but the new experiments are required to substantiate it rigorously. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical results on held-out data with no derivations or self-referential reductions.

full rationale

The paper reports benchmark comparisons of a gated multi-task architecture against SFT baselines on a fixed dataset of 13,937 decisions, with gains measured on held-out splits. No equations, parameter-fitting derivations, or self-citations appear in the provided text to support the central claims. The taxonomy supervision is introduced as an architectural choice whose effect is asserted to produce disentanglement, but this is not derived from prior results or reduced to a fit; it is evaluated directly via accuracy on rare classes. The work is therefore self-contained against external benchmarks rather than internally forced.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no explicit free parameters, axioms, or invented entities; the taxonomy and gating mechanism are presented as design choices without further decomposition.

pith-pipeline@v0.9.1-grok · 5806 in / 1103 out tokens · 49438 ms · 2026-06-29T04:46:36.164952+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

28 extracted references · 13 canonical work pages · 1 internal anchor

  1. [1]

    Gated Multimodal Units for Information Fusion

    URL https://api.semanticscholar. org/CorpusID:7630289. Arevalo, J., Solorio, T., y Gmez, M. M., and Gonzlez, F. A. Gated multimodal units for information fusion, 2017. URLhttps://arxiv.org/abs/1702.01992. Baxter, J. A model of inductive bias learning.Journal of Artificial Intelligence Research, 12:149–198, 03 2000. doi: 10.1613/jair.731. BehnamGhader, P.,...

  2. [2]

    emnlp-main.607/

    URL https://aclanthology.org/2020. emnlp-main.607/. Chalkidis, I., Jana, A., Hartung, D., Bommarito, M., An- droutsopoulos, I., Katz, D., and Aletras, N. LexGLUE: A benchmark dataset for legal language understanding in English. In Muresan, S., Nakov, P., and Villavicen- cio, A. (eds.),Proceedings of the 60th Annual Meet- ing of the Association for Computa...

  3. [3]

    org/CorpusID:16119010

    URL https://api.semanticscholar. org/CorpusID:16119010. Deng, L., Wang, M., Yang, C., and Wang, Y . LegiLM: A fine-tuned legal language model for data compli- ance, 2024. URL https://arxiv.org/abs/ 2409.13721. Dominguez-Olmedo, R., Nanda, V ., Abebe, R., Bech- told, S., Engel, C., Gummadi, K. P., Hardt, M., Hil- gard, S., and Schmude, M. Lawma: The power ...

  4. [4]

    org/CorpusID:52985426

    URL https://api.semanticscholar. org/CorpusID:52985426. Engel, C. and Weinshall, K. Manna from heaven for judges: Judges reaction to a quasi-random reduction in caseload.Journal of Empirical Legal Studies, 17 (4):722–751, 2020. doi: https://doi.org/10.1111/jels. 12265. URL https://onlinelibrary.wiley. com/doi/abs/10.1111/jels.12265. Gan, L., Li, B., Kuang...

  5. [5]

    findings-emnlp.814/

    URL https://aclanthology.org/2023. findings-emnlp.814/. Gemma Team, Google DeepMind. Gemma 4: Multimodal open-weights models. Technical report, Google Deep- Mind, April 2026. URL https://huggingface. co/google/gemma-4-26B-A4B-it. Hu, E. J., Shen, Y ., Wallis, P., Allen-Zhu, Z., Li, Y ., Wang, S., Wang, L., and Chen, W. LoRA: Low-rank adaptation of large l...

  6. [6]

    ISBN 9781713871088

    Curran Associates Inc. ISBN 9781713871088. Katz, D. M., Bommarito, M. J., and Blackman, J. A general approach for predicting the behavior of the supreme court of the united states.PloS one, 12(4):e0174698, 2017. Kovaleva, O., Romanov, A., Rogers, A., and Rumshisky, A. Revealing the dark secrets of BERT. In Inui, K., Jiang, J., Ng, V ., and Wan, X. (eds.),...

  7. [7]

    ISBN 979-8-89176-189-6

    Association for Computational Linguistics. ISBN 979-8-89176-189-6. doi: 10.18653/v1/2025.naacl-long

  8. [8]

    naacl-long.355/

    URL https://aclanthology.org/2025. naacl-long.355/. Li, e. a. Unilaw-R1: A large language model for legal reasoning with reinforcement learning and iterative in- ference. InFindings of the Association for Compu- tational Linguistics: EMNLP, 2025. URL https: //arxiv.org/abs/2510.10072. Li, S., Zhang, H., Ye, L., Guo, X., and Fang, B. Mann: A multichannel a...

  9. [9]

    org/CorpusID:204939518

    URL https://api.semanticscholar. org/CorpusID:204939518. Li, Z. and Zhou, T. Your mixture-of-experts LLM is secretly an embedding model for free. InInternational Confer- ence on Learning Representations (ICLR), 2025. URL https://arxiv.org/abs/2410.10814. Oral presentation. Lim, B., Ark, S. ., Loeff, N., and Pfister, T. Tempo- ral fusion transformers for i...

  10. [10]

    11 Towards Explainable Adjudicative Variance: Quantifying Judicial Discretion via Gated Multi-Task Learning

    doi: https://doi.org/10.1016/j.ijforecast.2021.03. 11 Towards Explainable Adjudicative Variance: Quantifying Judicial Discretion via Gated Multi-Task Learning

  11. [11]

    Luo, B., Feng, Y ., Xu, J., Zhang, X., and Zhao, D

    URL https://www.sciencedirect.com/ science/article/pii/S0169207021000637. Luo, B., Feng, Y ., Xu, J., Zhang, X., and Zhao, D. Learning to predict charges for criminal cases with legal basis. In Palmer, M., Hwa, R., and Riedel, S. (eds.),Proceedings of the 2017 Conference on Empirical Methods in Natu- ral Language Processing, pp. 2727–2736, Copenhagen, Den...

  12. [12]

    cc/paper_files/paper/2023/file/ 819b8452be7d6af1351d4c4f9cbdbd9b-Paper-Datasets_ and_Benchmarks.pdf

    URL https://proceedings.neurips. cc/paper_files/paper/2023/file/ 819b8452be7d6af1351d4c4f9cbdbd9b-Paper-Datasets_ and_Benchmarks.pdf. Sanh, V ., Wolf, T., and Ruder, S. A hierarchical multi- task approach for learning embeddings from semantic tasks. InAAAI Conference on Artificial Intelligence,

  13. [13]

    org/CorpusID:53436546

    URL https://api.semanticscholar. org/CorpusID:53436546. Sargeant, H., ¨Ostling, A., and Magnusson, M. Detect- ing legal citations in United Kingdom court judgments. In Christodoulopoulos, C., Chakraborty, T., Rose, C., and Peng, V . (eds.),Proceedings of the 2025 Confer- ence on Empirical Methods in Natural Language Pro- cessing, pp. 26798–26824, Suzhou, ...

  14. [14]

    Kirk, R., Mediratta, I., Nalmpantis, C., Luketina, J., Ham- bro, E., Grefenstette, E., and Raileanu, R

    Association for Computational Linguistics. ISBN 979-8-89176-332-6. doi: 10.18653/v1/2025.emnlp-main

  15. [15]

    emnlp-main.1361/

    URL https://aclanthology.org/2025. emnlp-main.1361/. Segal, J. A. and Cover, A. D. Ideological values and the votes of u.s. supreme court justices.The American Political Science Review, 83(2):557–565, 1989. ISSN 00030554, 15375943. URL http://www.jstor. org/stable/1962405. Shihata, Y . Gated recursive fusion: A stateful approach to scalable multimodal tra...

  16. [16]

    findings-eacl.44/

    URL https://aclanthology.org/2023. findings-eacl.44/. T.y.s.s, S., Perez San Blas, M., Kemper, P., and Grab- mair, M. Leveraging task dependency and contrastive learning for case outcome classification on European court of human rights cases. In Vlachos, A. and Au- genstein, I. (eds.),Proceedings of the 17th Conference of the European Chapter of the Assoc...

  17. [17]

    findings-emnlp.214/

    URL https://aclanthology.org/2024. findings-emnlp.214/. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., and Polosukhin, I. Attention is all you need. InNeural Informa- tion Processing Systems, 2017. URL https://api. semanticscholar.org/CorpusID:13756489. V oita, E., Talbot, D., Moiseev, F., Sennrich, R., and Titov...

  18. [18]

    org/CorpusID:278108241

    URL https://api.semanticscholar. org/CorpusID:278108241. Warner, B., Chaffin, A., Clavi´e, B., Weller, O., Hallstr¨om, O., Taghadouini, S., Gallagher, A., Biswas, R., Lad- hak, F., Aarsen, T., Adams, G. T., Howard, J., and Poli, I. Smarter, better, faster, longer: A modern bidi- rectional encoder for fast, memory efficient, and long context finetuning and...

  19. [19]

    URL https: //aclanthology.org/2025.acl-long.127/

    doi: 10.18653/v1/2025.acl-long.127. URL https: //aclanthology.org/2025.acl-long.127/. Wu, Y ., Zhou, S., Liu, Y ., Lu, W., Liu, X., Zhang, Y ., Sun, C., Wu, F., and Kuang, K. Precedent-enhanced legal judgment prediction with LLM and domain-model col- laboration. In Bouamor, H., Pino, J., and Bali, K. (eds.), Proceedings of the 2023 Conference on Empirical...

  20. [20]

    Judge-Aware

    URL https://aclanthology.org/2023. emnlp-main.740/. Xie, H., Steffek, F., de Faria, J. R., Carter, C., and Ruther- ford, J. The clc-uket dataset: Benchmarking case outcome prediction for the uk employment tribunal, 2024. URL https://arxiv.org/abs/2409.08098. Yue, S., Chen, W., Wang, S., Li, B., Shen, C., Liu, S., Zhou, Y ., Xiao, Y ., Yun, S., Lin, W., Hu...

  21. [21]

    The facts do not point cleanly to a complete win or complete loss

    Legal evaluation.The dispute combines adminis- trative process, security vetting, race discrimination, disability discrimination, and reasonable adjustments. The facts do not point cleanly to a complete win or complete loss. This is the type of mixed legal eval- uation that is naturally represented by claimant - partly wins

  22. [22]

    The generative model predicts both LOSES and substantive loss

    G4 collapses the mixed evaluation to a full loss. The generative model predicts both LOSES and substantive loss. That error is plausible if the model focuses on the contested security-clearance dis- pute as an unsuccessful discrimination claim, but it misses the multiple remedy structure of the case

  23. [23]

    This is consistent with the aggregate result that B2’s largest gains are on PARTLY WINS

    B2 recovers the partial-outcome structure.B2 pre- dicts PARTLY WINS, aligning with the gold GCO. This is consistent with the aggregate result that B2’s largest gains are on PARTLY WINS. The composite head can preserve a fine-grained partial-success path- way that is easy to lose in a single autoregressive label channel

  24. [24]

    Implication for the conditioning interface.Since both models use the Gemma backbone, the relevant difference is the conditioning interface. The rescue suggests that LW AN plus multi-task fine-grained super- vision helps the model retain mixed remedial structure even when the surface narrative contains strong loss- like elements. E.2. Rescuing an atypical ...

  25. [25]

    This combination is not well captured by a simple merits-based win/loss framing

    Legal evaluation.The case lies at the boundary be- tween ordinary dismissal litigation, victimisation for a protected act, safeguarding concerns, data governance, and investigation conduct. This combination is not well captured by a simple merits-based win/loss framing

  26. [26]

    Given the misconduct narrative and the employer’s loss-of-trust account, a generative model can plausibly map the case to LOSES

    G4 follows the dominant substantive-loss pathway. Given the misconduct narrative and the employer’s loss-of-trust account, a generative model can plausibly map the case to LOSES. But the gold label is OTHER, reflecting the atypical procedural and legal evaluation. In particular, further evidence might be necessary in this case

  27. [27]

    This matches the aggregate pattern: B2 improves Other F1 to 0.411, compared with G4’s 0.273

    B2 preserves the minority-class boundary.B2 pre- dicts OTHER, the rarest and most difficult GCO class. This matches the aggregate pattern: B2 improves Other F1 to 0.411, compared with G4’s 0.273

  28. [28]

    Implication for the composite head.The case il- lustrates a central advantage of the composite head: label-wise attention and fine-grained supervision can maintain an alternative procedural/outcome pathway even when the surface text contains strong majority- class cues. 17