pith. machine review for the scientific record. sign in

arxiv: 2605.03609 · v1 · submitted 2026-05-05 · 💻 cs.AI · cs.LG

Recognition: unknown

Where Paths Split: Localized, Calibrated Control of Moral Reasoning in Large Language Models

Authors on Pith no claims yet

Pith reviewed 2026-05-07 16:40 UTC · model grok-4.3

classification 💻 cs.AI cs.LG
keywords moral reasoningLLM steeringinference-time controlethical frameworksutilitarianismdeontologyresidual stream editingpreference calibration
0
0 comments X

The pith

Localized edits at convergence points inside LLMs steer moral reasoning toward a chosen ethical framework while preserving general abilities.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to show that moral preferences in large language models can be calibrated at inference time by locating and editing the specific internal points where pathways for different ethical frameworks first come together and then split. It combines routing to shut off non-target pathways with a low-norm adjustment drawn from spatial pattern analysis on the residual stream to set the balance between utilitarian and deontological styles. A reader would care because inconsistent moral stances are a practical barrier to using LLMs in high-stakes settings, and an interpretable, training-free method for consistent control could make outputs more predictable without sacrificing other skills.

Core claim

Convergent-Divergent Routing traces minimal branch points inside transformer blocks where ethical-framework-related pathways converge and diverge; gating the non-target branches at these loci increases targeted reasoning, while Dual Logit Calibration applies a closed-form minimum-l2-norm update in the two-dimensional subspace extracted by Common Spatial Patterns so that the directional projections match user-specified preference weights between utilitarian and deontological frameworks.

What carries the argument

Convergent-Divergent Routing at traced branch points inside transformer blocks, paired with CSP-derived directions and Dual Logit Calibration in the residual stream to align projections with preference weights.

If this is right

  • The routing intervention alone increases targeted ethical-framework reasoning.
  • Preference calibration is achieved reliably across real-life moral dilemmas.
  • General capabilities are largely preserved relative to recent baselines.
  • The method supplies an interpretable control mechanism.
  • It outperforms recent baselines on the joint goal of calibration and capability retention.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the branch points prove stable across model families, the same loci could be reused to control other value-laden preferences beyond ethics.
  • Dynamic switching between ethical modes becomes feasible by changing the target weights per query without reloading the model.
  • Inspecting these split points across training checkpoints might expose when and how models first acquire inconsistent moral stances.

Load-bearing premise

The traced branch points are the minimal loci where ethical-framework pathways converge and diverge, and the two CSP directions genuinely separate utilitarian from deontological reasoning.

What would settle it

Applying the routing and calibration leaves moral preference scores on held-out dilemmas unchanged while non-moral capability scores remain the same or improve.

Figures

Figures reproduced from arXiv: 2605.03609 by Chenchen Yuan, Gjergji Kasneci, Zheyu Zhang.

Figure 1
Figure 1. Figure 1: An Example of Binary Control between Utilitarianism and Deontology. (αU , αD) denotes the weights for utilitarianism and deontology. value-sensitive assistance demands fine-grained, calibrated control over the model’s ethical behav￾ior, beyond a binary on/off notion of alignment. Existing approaches for behavior control typically rely on prompt engineering (Zhou et al., 2022) or steering vector addition wi… view at source ↗
Figure 2
Figure 2. Figure 2: Fine-Grained Control. This figure plots the observed Uop at each control level αU . Boxes show the interquartile range, and center lines indicate medians. Ablation Study. We conduct ablation study un￾der three settings: (i) Explicit-Prompt Representa￾tions (EPRM): we extract residual streams elicited by prompts that explicitly cue each ethical frame￾work (BL-PRS style), and steer at each branch points; (ii… view at source ↗
Figure 3
Figure 3. Figure 3: Representation separation at Layer 17 in Llama. (a) PCA reveals modest clustering of utilitarian and deontological representations. (b) Projection onto paired contrastive directions yields sharper separation. high U¯ op with little sensitivity to αU , except that PLS-DA reaches 0.31 pp at αU=0 and 22.12 pp at αU=20%. On Vicuna, our method again domi￾nates, reaching 0.00 pp at αU=0 and staying near the targ… view at source ↗
Figure 4
Figure 4. Figure 4: Predictive Performance of Attention Heads for Deontology in Llama. Rows (y-axis) index layers from bottom (closest to the input) to top (closest to the output); columns (x-axis) index heads within each layer, ordered in descending Spearman rank correlation. 5 Conclusion We studied inference-time moral steering and intro￾duced Convergent-Divergent Routing, which local￾izes edits to ethical-framework branch … view at source ↗
Figure 5
Figure 5. Figure 5: Predictive Performance of Attention Heads for Deontology and Utilitarianism across All Layers and Attention Heads of Vicuna and Yi-1.5. Rows (y-axis) index layers (bottom→top); columns (x-axis) index heads within each layer, ordered by Spearman rank correlation. αU (%) Model 100 90 80 70 60 50 40 30 20 10 0 Llama 0.077 0.034 0.014 0.008 0.008 0.006 0.010 0.008 0.010 0.006 0.006 Vicuna 0.507 0.257 0.061 0.0… view at source ↗
Figure 6
Figure 6. Figure 6: Predictive Performance of Attention Heads for Utilitarianism in Llama. G.5 Correlation between Attention Heads and Ethical Frameworks view at source ↗
Figure 7
Figure 7. Figure 7: Representation Separation at Layer 7 in Llama. (a) PCA reveals modest clustering of utilitarian and deontological representations. (b) Projection onto paired contrastive directions yields sharper separation. Llama Yi-1.5 Vicuna Intersection Utilitarianism Deontology Intersection Utilitarianism Deontology Intersection Utilitarianism Deontology 0 2833 5667 8500 11334 14167 Sample Count 11740 11089 10720 1022… view at source ↗
Figure 8
Figure 8. Figure 8 view at source ↗
Figure 9
Figure 9. Figure 9: Prompt Template for Coherence and Fluency Assessment. 22 view at source ↗
Figure 10
Figure 10. Figure 10: Fixed Prompt for Moral Reasoning. Prompt-Only Baseline **Your Task** Assess whether the narrator’s/ my behaviour in *Scenario* is *Morally Acceptable* or *Morally Unacceptable*. **Thinking Bias (use during reasoning)** - Deontology weight: [100 ∗ αD] % - Utilitarianism weight: [100 ∗ αU ]% Break close cases in favor of the higher-weight framework. Do not mention weights in the output. **Instructions** [Th… view at source ↗
Figure 11
Figure 11. Figure 11: Prompt Template for Prompt-Only Baseline. BL-PRS Paired-Direction Extraction **Your Task** From a [Specific ethical framework] perspective, assess whether the narrator’s/ my behaviour in *Scenario* is *Morally Acceptable* or *Morally Unacceptable*. **Instructions** [The following content is the same as in view at source ↗
Figure 12
Figure 12. Figure 12: Prompt Template for BL-PRS Paired-Direction Extraction. 23 view at source ↗
Figure 13
Figure 13. Figure 13: Evaluation of General Capabilities. We evaluate general capabilities on out-of-domain benchmarks, i.e. GSM8K (8-shot) (Cobbe et al., 2021), TriviaQA (5-shot) (Joshi et al., 2017) and two translation tasks including wmt14-fr-en and wmt14-en-fr (0-shot) (Bojar et al., 2014), using the lm-evaluation-harness framework (Gao et al., 2024). 24 view at source ↗
read the original abstract

Large language models often display heterogeneous moral preferences across settings. We study inference-time steering toward a desired ethical framework while preserving general competence. We present Convergent-Divergent Routing, which traces and edits minimal branch points inside transformer blocks where ethical-framework-related pathways first converge and then diverge. Gating non-target branches at these loci blocks the downstream propagation while leaving upstream computations intact. We find that this intervention alone increases targeted ethical-framework reasoning. To achieve fine-grained control, we adapt Common Spatial Patterns to the residual stream and extract, for each branch-point layer, a pair of directions that discriminate between utilitarian and deontological frameworks. We then introduce Dual Logit Calibration, a closed-form, minimum-$\ell_2$-norm update that moves the residual within this two-dimensional subspace so the resulting directional projections align with user-specified preference weights. Experiments on real-life moral dilemmas show that our method reliably achieves preference calibration and largely preserves general capabilities, outperforming recent baselines while providing an interpretable mechanism.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The manuscript introduces Convergent-Divergent Routing to trace minimal branch points inside transformer blocks where ethical-framework pathways first converge and then diverge. Gating non-target branches at these loci is proposed to steer toward a target ethical framework while leaving upstream computations intact. The method adapts Common Spatial Patterns to the residual stream to extract a pair of directions per branch-point layer that discriminate utilitarian from deontological reasoning, followed by Dual Logit Calibration, a closed-form minimum-ℓ₂-norm update that shifts the residual within this subspace to match user-specified preference weights. Experiments on real-life moral dilemmas are claimed to demonstrate reliable preference calibration, largely preserved general capabilities, and outperformance relative to recent baselines.

Significance. If the central claims hold, the work offers a localized, interpretable inference-time steering mechanism for moral reasoning that avoids full retraining and preserves upstream model computations. The closed-form Dual Logit Calibration and the use of CSP for direction extraction constitute technical strengths that could support reproducibility and extension to other controllable behaviors. Such targeted interventions, if causally grounded, would be relevant to AI alignment research.

major comments (2)
  1. [Abstract and §3] Abstract and §3 (Convergent-Divergent Routing and CSP adaptation): the claim that the traced branch points constitute the minimal loci where ethical-framework pathways converge and diverge, and that the two CSP-derived directions in the residual stream genuinely discriminate utilitarian from deontological reasoning, rests on activation differences rather than exhaustive causal interventions such as activation patching or path ablation. If these directions instead encode correlated features (response length, lexical choice, or prompt framing), the gating and Dual Logit Calibration would achieve calibration through unintended side channels, undermining both the interpretability claim and the assertion that general capabilities remain intact because upstream computations are untouched.
  2. [§5] §5 (Experiments): the abstract asserts that the method 'reliably achieves preference calibration and largely preserves general capabilities, outperforming recent baselines,' yet the manuscript supplies no quantitative metrics, baseline implementations, error bars, statistical tests, or capability benchmarks (e.g., MMLU or GSM8K scores before/after intervention). Without these details the empirical support for the central claims cannot be evaluated.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. We respond point by point below, acknowledging where additional evidence or clarification is needed and indicating the revisions we will make.

read point-by-point responses
  1. Referee: [Abstract and §3] Abstract and §3 (Convergent-Divergent Routing and CSP adaptation): the claim that the traced branch points constitute the minimal loci where ethical-framework pathways converge and diverge, and that the two CSP-derived directions in the residual stream genuinely discriminate utilitarian from deontological reasoning, rests on activation differences rather than exhaustive causal interventions such as activation patching or path ablation. If these directions instead encode correlated features (response length, lexical choice, or prompt framing), the gating and Dual Logit Calibration would achieve calibration through unintended side channels, undermining both the interpretability claim and the assertion that general capabilities remain intact because upstream computations are untouched.

    Authors: We agree that activation differences provide correlational rather than exhaustive causal evidence. The branch points were located via layer-wise analysis of residual-stream divergence between ethical frameworks, with CSP applied to extract maximally discriminative directions in the two-dimensional subspace. To strengthen the causal grounding, the revised manuscript will include activation patching and targeted path ablation at the identified loci. We will also add controls that explicitly test for encoding of response length, lexical choice, and prompt framing. The current experiments already show that general capabilities remain largely intact after intervention, which would be inconsistent with reliance on such side channels, but we will make the supporting analyses more explicit. revision: partial

  2. Referee: [§5] §5 (Experiments): the abstract asserts that the method 'reliably achieves preference calibration and largely preserves general capabilities, outperforming recent baselines,' yet the manuscript supplies no quantitative metrics, baseline implementations, error bars, statistical tests, or capability benchmarks (e.g., MMLU or GSM8K scores before/after intervention). Without these details the empirical support for the central claims cannot be evaluated.

    Authors: Section 5 reports results on preference calibration accuracy, outperformance relative to baselines, and preservation of general capabilities. However, we acknowledge that the presentation of quantitative details, error bars, statistical tests, and explicit benchmark scores (MMLU, GSM8K) was insufficiently detailed. The revised version will expand §5 to include all requested metrics, baseline implementation descriptions, error bars, statistical significance tests, and pre-/post-intervention benchmark scores. revision: yes

Circularity Check

0 steps flagged

No load-bearing circularity; method is a defined interventional procedure validated by experiment

full rationale

The derivation chain consists of tracing branch points, applying CSP to extract discriminating directions in the residual stream, and performing a closed-form minimum-ℓ2-norm update to align projections with target weights. None of these steps reduce by construction to a quantity defined from the same paper's fitted outputs or self-citations; the calibration is achieved by explicit construction of the update rule rather than by renaming or predicting a fitted value. Experiments on moral dilemmas are presented as external validation rather than tautological confirmation. This matches the default expectation of a self-contained method with no significant circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 2 invented entities

The claim rests on the assumption that moral reasoning pathways can be localized to identifiable branch points and that CSP directions capture framework differences; no free parameters are explicitly fitted in the abstract description.

axioms (2)
  • domain assumption Ethical frameworks correspond to distinct pathways that converge and then diverge at minimal branch points inside transformer blocks.
    Invoked to justify the Convergent-Divergent Routing intervention.
  • domain assumption Common Spatial Patterns applied to the residual stream can extract directions that discriminate utilitarian from deontological reasoning.
    Used to define the two-dimensional subspace for calibration.
invented entities (2)
  • Convergent-Divergent Routing no independent evidence
    purpose: Trace and edit minimal branch points to block non-target ethical pathways.
    Newly introduced mechanism for localized intervention.
  • Dual Logit Calibration no independent evidence
    purpose: Closed-form minimum-l2-norm update to align residual projections with user preference weights.
    New calibration procedure introduced in the paper.

pith-pipeline@v0.9.0 · 5475 in / 1382 out tokens · 49260 ms · 2026-05-07T16:40:41.865898+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

12 extracted references · 5 canonical work pages · 2 internal anchors

  1. [1]

    Matthew Barker and William Rayens

    Out of one, many: Using language mod- els to simulate human samples.Political Analysis, 31(3):337–351. Matthew Barker and William Rayens. 2003. Partial least squares for discrimination.Journal of Chemometrics: A Journal of the Chemometrics Society, 17(3):166– 173. Benjamin Blankertz, Ryota Tomioka, Steven Lemm, Motoaki Kawanabe, and Klaus-Robert Muller. 2...

  2. [2]

    Runjin Chen, Andy Arditi, Henry Sleight, Owain Evans, and Jack Lindsey

    Structured moral reasoning in language mod- els: A value-grounded evaluation framework.arXiv preprint arXiv:2506.14948. Runjin Chen, Andy Arditi, Henry Sleight, Owain Evans, and Jack Lindsey. 2025. Persona vectors: Monitoring and controlling character traits in language models. arXiv preprint arXiv:2507.21509. Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Shen...

  3. [3]

    Training Verifiers to Solve Math Word Problems

    Training verifiers to solve math word prob- lems.CoRR, abs/2110.14168. Junchen Ding, Penghao Jiang, Zihao Xu, Ziqi Ding, Yichen Zhu, Jiaojiao Jiang, and Yuekang Li. 2025. " pull or not to pull?”: Investigating moral biases in leading large language models across ethical dilem- mas.arXiv preprint arXiv:2508.07284. Rohit K Dubey, Damian Dailisan, and Sachit...

  4. [4]

    InProceed- ings of the 2024 Conference on Empirical Methods in Natural Language Processing, EMNLP 2024, Miami, FL, USA, November 12-16, 2024, pages 1437–1454

    Controllable preference optimization: Toward controllable multi-objective alignment. InProceed- ings of the 2024 Conference on Empirical Methods in Natural Language Processing, EMNLP 2024, Miami, FL, USA, November 12-16, 2024, pages 1437–1454. Association for Computational Linguistics. Wes Gurnee and Max Tegmark. 2024. Language mod- els represent space an...

  5. [5]

    In9th International Conference on Learning Representa- tions, ICLR 2021, Virtual Event, Austria, May 3-7,

    Aligning AI with shared human values. In9th International Conference on Learning Representa- tions, ICLR 2021, Virtual Event, Austria, May 3-7,

  6. [6]

    OpenReview.net. Zhijing Jin, Max Kleiman-Weiner, Giorgio Piatti, Syd- ney Levine, Jiarui Liu, Fernando Gonzalez Adauto, Francesco Ortu, András Strausz, Mrinmaya Sachan, Rada Mihalcea, Yejin Choi, and Bernhard Schölkopf

  7. [7]

    GPT-4o System Card

    Language model alignment in multilingual trolley problems. InThe Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025. OpenReview.net. Mandar Joshi, Eunsol Choi, Daniel S Weld, and Luke Zettlemoyer. 2017. Triviaqa: A large scale distantly supervised challenge dataset for reading comprehen- sion. InProcee...

  8. [8]

    Neha Srikanth, Jordan Boyd-Graber, and Rachel Rudinger

    Ethical reasoning over moral alignment: A case and framework for in-context ethical policies in llms. InFindings of the Association for Compu- tational Linguistics: EMNLP 2023, pages 13370– 13388. Nina Rimsky, Nick Gabrieli, Julian Schulz, Meg Tong, Evan Hubinger, and Alexander Matt Turner. 2024. Steering llama 2 via contrastive activation addition. InPro...

  9. [9]

    deontology

    merges safety- and helpfulness-specialized experts with a dynamic MoE router, achieving dual-preference optimization of safety and helpful- ness. To avoid maintaining several copies, Panacea (Zhong et al., 2024) proposes to embed the prefer- ence vector as singular values in SVD-based LoRA. Representation-based control shows that linear di- rections in hi...

  10. [10]

    Fluency":

    Fluency 2) Coherence **Definitions** - Fluency: Grammatical correctness, natural word choice, and ease of reading at the sentence level. A highly fluent text has no or almost no grammatical errors, unnatural phrasing, or awkward wording. - Coherence: How well the sentences and ideas fit together as a whole. A highly coherent text has a clear overall struc...

  11. [11]

    *Verdict* – must be *Morally Acceptable* or *Morally Unacceptable*

  12. [12]

    From a [ethical framework (deontology or utilitarianism)] perspective,

    *Brief Reason* – *one sentence* that clearly states the *main ethical framework (deontology or utilitarianism)* you rely on to assess the narrator’s/ my behaviour in *Scenario*. This sentence *must start with*: “From a [ethical framework (deontology or utilitarianism)] perspective,” **Important** - Your output *must* strictly follow the exact *Output Form...