arxiv: 2605.11712 · v1 · submitted 2026-05-12 · 💻 cs.AI

Recognition: no theorem link

Toward Stable Value Alignment: Introducing Independent Modules for Consistent Value Guidance

Wenhao Chen , Sirui Sun , Shengyuan Bai , Guojie Song

Authors on Pith no claims yet

Pith reviewed 2026-05-13 06:57 UTC · model grok-4.3

classification 💻 cs.AI

keywords value alignmentLLM safetyindependent modulesbridge tokensstable guidancetransformer architectureharmful output reduction

0 comments

The pith

SVGT stabilizes LLM value alignment by isolating normative representations in a dedicated module and steering outputs via bridge tokens.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large language models typically achieve value alignment through post-training or steering that directly alters the backbone, yet this fails because the residual stream is dynamic and values remain fragile low-dimensional signals. The paper introduces the Stable Value Guidance Transformer that creates an independent value module to hold stable normative representations apart from the main model. These representations are converted into learnable bridge tokens that explicitly anchor and guide the generative process. A sympathetic reader would care because this separation promises consistent value expression across contexts while preserving the backbone's fluency and internal dynamics. Experiments across backbones and benchmarks show harmful scores drop by more than 70 percent.

Core claim

SVGT addresses unstable value alignment by maintaining normative representations in a dedicated value space isolated from the backbone and transducing these signals into learnable latent Bridge Tokens that serve as dynamic anchors to steer the generative trajectory, thereby ensuring robust adherence without disrupting the backbone's internal representations.

What carries the argument

Independent value module with bridge tokens: a separate space that holds stable value representations and converts them into explicit steering tokens for the generative process.

If this is right

Harmful output scores fall by more than 70 percent on safety benchmarks while fluency is preserved.
Value guidance remains consistent across diverse contexts without altering backbone parameters.
The same architecture applies to multiple model backbones with comparable gains.
Alignment becomes an add-on module rather than an integrated training step.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Value modules could be exchanged or updated independently for different ethical priorities on the same backbone.
The separation might make it easier to diagnose and correct alignment failures in deployed systems.
The pattern suggests extending modular guidance to multi-turn or multi-agent settings where stability is critical.

Load-bearing premise

Value signals can be isolated in a dedicated module and then successfully transduced into tokens that reliably influence the backbone's behavior.

What would settle it

Ablation experiments in which the independent module or bridge tokens are removed yet harmful scores remain unchanged on the same safety benchmarks would falsify the stability claim.

Figures

Figures reproduced from arXiv: 2605.11712 by Guojie Song, Shengyuan Bai, Sirui Sun, Wenhao Chen.

**Figure 1.** Figure 1: Conceptual illustration of our work. Dominant Task Signals (orange) in the residual stream often distort value representations, leading to misalignment (dashed). Our independent module provides stable guidance (solid), steering generation back to alignment despite task noise. (Soares & Fallenstein, 2014; Christian, 2020; Hendrycks et al., 2021a). Existing alignment methods span a spectrum from training-ti… view at source ↗

**Figure 2.** Figure 2: SVGT Architecture. SVGT decouples value alignment from task-driven generation through a two-stage transformation: (1) Value Space Construction extracts stable, context-aware value signals z and computes directional corrections ∆z; (2) Latent Value Bridge (LVB) transduces these abstract corrections into bridge tokens B that serve as attention targets for the frozen backbone, enabling dynamic and robust stee… view at source ↗

**Figure 3.** Figure 3: Overview of the SVGT training stages. Our approach follows a progressive curriculum: starting with basic value perception (Stage 1), advancing to context-aware understanding (Stage 2), and concluding with the training of the Latent Value Bridge (Stage 3) to convert value signals into active guidance for the generative manifold. LVB operates dynamically: at each decoding step, the value state zt is re-enco… view at source ↗

**Figure 4.** Figure 4: Evolution of latent harmful scores during generation. Thin lines show individual trajectories from 5 adversarial prompts; the bold line indicates their average. Left: The baseline remains in a high-risk region driven by adversarial prompts. Right: SVGT progressively steers the trajectory toward safer regions via guidance, demonstrating effective real-time correction. ❻ Dynamic guidance corrects context-dr… view at source ↗

**Figure 5.** Figure 5: Computational overhead of SVGT on Llama-3.2- 3B. We compare baseline generation against SVGT with different bridge token refresh intervals (r). Memory overhead is minimal (+3%). Total latency increases moderately (+52-65%). Efficiency remains robust across refresh intervals r ∈ [1, 10], supporting flexible deployment [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 7.** Figure 7: Multi-dimensional comparison of alignment paradigms. This profile synthesizes our empirical findings across six key dimensions. Despite the throughput latency, SVGT (orange) demonstrates superior balance, particularly in the trade-off between safety enforcement and capability preservation. Summary. Finally, we provide a holistic summary of SVGT’s performance profile in comparison to existing paradigms. As… view at source ↗

**Figure 6.** Figure 6: Sensitivity analysis of bridge token momentum β. Safety performance is optimized when the EMA momentum β is between 0.6 and 0.8. This intermediate range allows bridge tokens to effectively accumulate stable value signals while adapting to the evolving context during generation. Extreme values (β = 0 or 1) lead to either excessive guidance noise or a failure to rectify context drift. impact of the bridge me… view at source ↗

**Figure 8.** Figure 8: Impact of bridge token count (K) on safety and fluency. Evaluated on Llama-3.2-3B. Results demonstrate that K ∈ [5, 10] provides an optimal trade-off: too few tokens limit guidance expressiveness, while exceeding 15 tokens introduces redundant noise that marginally increases perplexity. Error bars represent standard deviation across 3 random seeds. SVGT increases false refusal rate from 14.8% to 16.4%, a +… view at source ↗

**Figure 9.** Figure 9: Value discrimination accuracy as a function of the extraction layer position (l ∗ ). Experiments on GPT-2 show that safety-relevant semantic features are most discernible in the middle-to-late layers (normalized index 0.5–0.8). Extracting from very early layers (syntactic) or final layers (too task-specific) leads to sub-optimal alignment performance. limit the expressiveness of value guidance, while too m… view at source ↗

**Figure 10.** Figure 10: Step-level training dynamics for Stage 3 (Value-Guided Generation) on Llama-3.2-3B. The plots illustrate the evolution of the multi-objective loss components across 27,000 steps. Raw values (light gray) and smoothed trajectories (colored) show rapid convergence. The stability of the Cross-Entropy loss and the immediate descent of the Regularization loss demonstrate that SVGT learns to inject guidance with… view at source ↗

**Figure 11.** Figure 11: Epoch-level convergence and generalization for Stage 3. Comparison between training (blue/circles) and validation (orange/squares) metrics over five epochs. The minimal gap between training and validation total loss, alongside the order-of-magnitude reduction in safety loss, validates the high generalizability of the pre-trained value space and the efficacy of bridge tokens in capturing normative signals.… view at source ↗

**Figure 12.** Figure 12: Logit distribution shift across generation steps. Left: The intermittent spikes in KL divergence DKL(Pguide∥Pbase) reveal that SVGT exerts adaptive, high-energy steering at critical decision junctions rather than applying a static bias. Right: The staircase-like growth of cumulative KL divergence demonstrates the persistent accumulation of alignment energy to overcome adversarial value inertia [PITH_FULL… view at source ↗

**Figure 13.** Figure 13: Probability lift (∆P) for top-5 tokens at different stages. At Step 1, the bridge tokens prioritize structural control by uplifting pronouns and termination signals. By Step 20, the focus shifts to normative justification, specifically elevating safety-descriptive adjectives like dangerous to anchor the explanation to the safety manifold. ’ I’ (∆P = 0.457) and the termination token ’<|eot id|>’ (∆P = 0.15… view at source ↗

read the original abstract

Aligning large language models (LLMs) with human values typically relies on post-training or inference-time steering that directly manipulates the backbone's parameters or representation space. However, a critical gap exists: the model's residual stream is highly dynamic, in which values exist as fragile, low-dimensional properties, inherently incompatible with the stability required for consistent value expression. In this paper, we propose the Stable Value Guidance Transformer (SVGT), which addresses this gap through an independent value module incorporating two key designs: (1) independent value modeling, maintaining normative representations in a dedicated value space isolated from the backbone, and (2) explicit behavioral guidance, transducing these stable signals into learnable latent Bridge Tokens. These tokens serve as dynamic value anchors to explicitly steer the generative trajectory, ensuring robust adherence across diverse contexts without disrupting the backbone's internal representations. Experiments across multiple backbones and safety benchmarks show that SVGT generally reduces harmful scores by over 70% while maintaining generation fluency, demonstrating the efficacy of architecturally grounded value modeling. Our code is available at https://github.com/Clervils/SVGT.git.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SVGT isolates value modeling in a separate module and uses Bridge Tokens to steer the backbone, but the stability argument collapses once those tokens enter the dynamic residual stream, and the 70% harm reduction claim has zero supporting details.

read the letter

The paper's main move is to pull value representations out of the backbone into their own module, then transduce them into learnable Bridge Tokens that get inserted to guide generation. That separation plus the explicit transduction step is the concrete new pattern they are offering for alignment work that avoids full retraining or heavy steering at inference time.

Referee Report

2 major / 1 minor

Summary. The paper proposes the Stable Value Guidance Transformer (SVGT), which augments LLMs with an independent value module that maintains normative representations in a dedicated space isolated from the backbone residual stream and transduces them via learnable Bridge Tokens to steer generation. It claims this yields over 70% reduction in harmful scores across multiple backbones and safety benchmarks while preserving fluency, addressing the fragility of values in dynamic residual streams.

Significance. If the empirical results hold under scrutiny, the work would offer a meaningful architectural alternative to post-training or direct steering methods for value alignment, potentially enabling more consistent normative guidance without altering backbone parameters. The public code release aids reproducibility and allows direct testing of the independent-module design.

major comments (2)

[Abstract] Abstract: the central claim of 'generally reduces harmful scores by over 70%' is presented without any description of baselines, metrics, statistical tests, ablation controls, or variance across runs, rendering the efficacy of the independent value module and Bridge Tokens unverifiable from the provided text.
Method description (Bridge Tokens): the architecture isolates normative representations but then explicitly inserts Bridge Tokens into the residual stream to steer trajectories; no measurement, ablation, or analysis is described that demonstrates these tokens maintain stable influence rather than being diluted or overwritten by subsequent dynamics, which directly undermines attribution of the reported harm reduction to architectural grounding.

minor comments (1)

[Abstract] The abstract refers to 'multiple backbones and safety benchmarks' without naming them or providing even high-level dataset statistics, which reduces the reader's ability to assess generalizability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We appreciate the referee's constructive feedback on our manuscript. Below, we provide point-by-point responses to the major comments and outline the revisions we will make to enhance clarity and rigor.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim of 'generally reduces harmful scores by over 70%' is presented without any description of baselines, metrics, statistical tests, ablation controls, or variance across runs, rendering the efficacy of the independent value module and Bridge Tokens unverifiable from the provided text.

Authors: We agree that the abstract lacks sufficient detail to fully substantiate the claim. The main text (Section 4) specifies the baselines (vanilla and safety-aligned LLMs), metrics (harm scores from established benchmarks), statistical tests, and reports mean results with standard deviations over multiple runs. We will revise the abstract to concisely include references to these elements, such as 'across multiple backbones and benchmarks with averaged results over runs,' to improve verifiability while adhering to length constraints. revision: yes
Referee: [—] Method description (Bridge Tokens): the architecture isolates normative representations but then explicitly inserts Bridge Tokens into the residual stream to steer trajectories; no measurement, ablation, or analysis is described that demonstrates these tokens maintain stable influence rather than being diluted or overwritten by subsequent dynamics, which directly undermines attribution of the reported harm reduction to architectural grounding.

Authors: We recognize the importance of demonstrating the stability of Bridge Tokens' influence. Although the current manuscript focuses on end-to-end performance improvements, it does not include dedicated ablations for temporal stability or dilution effects. In the revised version, we will incorporate new experiments and analyses, including step-wise influence measurements and ablations removing Bridge Tokens mid-generation, to empirically show their sustained impact and support the architectural claims. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected in architectural proposal or claims

full rationale

The paper proposes a new SVGT architecture consisting of an independent value module and Bridge Tokens to isolate and transduce value signals. Its central claims rest on this design choice plus empirical experiments across backbones showing harm reduction. No derivation reduces a result to its own inputs by construction, no parameters are fitted to a subset and then called predictions, and no load-bearing self-citations or uniqueness theorems appear in the text. The residual-stream fragility premise is stated as motivation rather than derived from the method itself, and the 70% reduction figure is presented as an experimental outcome rather than a mathematical identity. The work is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The central claim rests on the domain assumption that value properties are inherently unstable in the residual stream and on the introduction of two new architectural entities without external validation beyond the reported experiments.

axioms (1)

domain assumption LLM residual streams are highly dynamic and values exist as fragile low-dimensional properties incompatible with stable expression
Explicitly stated as the critical gap motivating the work.

invented entities (2)

independent value space no independent evidence
purpose: maintain normative representations isolated from the backbone
New dedicated space introduced to solve the stability problem
Bridge Tokens no independent evidence
purpose: transduce stable value signals into dynamic anchors that steer the generative trajectory
New token type postulated as the mechanism for explicit guidance

pith-pipeline@v0.9.0 · 5496 in / 1244 out tokens · 69549 ms · 2026-05-13T06:57:35.718724+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

90 extracted references · 90 canonical work pages · 9 internal anchors

[1]

2014 , url =

Aligning Superintelligence with Human Interests: A Technical Research Agenda , author =. 2014 , url =

work page 2014
[2]

arXiv preprint arXiv:1906.01820 , year =

Risks from Learned Optimization in Advanced Machine Learning Systems , author =. CoRR , volume =. 2019 , url =. 1906.01820 , eprinttype =

work page arXiv 2019
[3]

2020 , publisher =

The Alignment Problem: Machine Learning and Human Values , author =. 2020 , publisher =

work page 2020
[4]

Aligning

Dan Hendrycks and Collin Burns and Steven Basart and Andrew Critch and Jerry Li and Dawn Song and Jacob Steinhardt , booktitle =. Aligning

work page
[5]

Transactions on Machine Learning Research (TMLR) , year =

Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback , author =. Transactions on Machine Learning Research (TMLR) , year =

work page
[6]

An Overview of Catastrophic

Dan Hendrycks and Mantas Mazeika and Thomas Woodside , journal =. An Overview of Catastrophic

work page
[7]

Advances in Neural Information Processing Systems , volume =

Training Language Models to Follow Instructions with Human Feedback , author =. Advances in Neural Information Processing Systems , volume =

work page
[8]

Constitutional

Yuntao Bai and Saurav Kadavath and Sandipan Kundu and Amanda Askell and Jackson Kernion and Andy Jones and Anna Chen and Anna Goldie and Azalia Mirhoseini and Cameron McKinnon and others , journal =. Constitutional

work page
[9]

Advances in Neural Information Processing Systems , year =

Direct Preference Optimization: Your Language Model is Secretly a Reward Model , author =. Advances in Neural Information Processing Systems , year =

work page
[10]

Inverse Preference Learning: Preference-based

Joey Hejna and Dorsa Sadigh , booktitle =. Inverse Preference Learning: Preference-based

work page
[11]

Hanze Dong and Wei Xiong and Deepanshu Goyal and Yihan Zhang and Winnie Chow and Rui Pan and Shizhe Diao and Jipeng Zhang and KaShun SHUM and Tong Zhang , journal =

work page
[12]

A General Theoretical Paradigm to Understand Learning from Human Preferences , journal =

Mohammad Gheshlaghi Azar and Mark Rowland and Bilal Piot and Daniel Guo and Daniele Calandriello and Michal Valko and R. A General Theoretical Paradigm to Understand Learning from Human Preferences , journal =

work page
[13]

A Unified Approach to Online and Offline

Shicong Cen and Jincheng Mei and Katayoon Goshvadi and Hanjun Dai and Tong Yang and Sherry Yang and Dale Schuurmans and Yuejie Chi and Bo Dai , booktitle =. A Unified Approach to Online and Offline

work page
[14]

Kawin Ethayarajh and Winnie Xu and Niklas Muennighoff and Dan Jurafsky and Douwe Kiela , booktitle =

work page
[15]

Yu Meng and Mengzhou Xia and Danqi Chen , booktitle =

work page
[16]

International Conference on Learning Representations , year =

Safety Alignment Should be Made More Than Just a Few Tokens Deep , author =. International Conference on Learning Representations , year =

work page
[17]

International Conference on Learning Representations , year =

Plug and Play Language Models: A Simple Approach to Controlled Text Generation , author =. International Conference on Learning Representations , year =

work page
[18]

Annual Meeting of the Association for Computational Linguistics , pages =

Prefix-Tuning: Optimizing Continuous Prompts for Generation , author =. Annual Meeting of the Association for Computational Linguistics , pages =

work page
[19]

International Conference on Learning Representations , year =

Fast Model Editing at Scale , author=. International Conference on Learning Representations , year =

work page
[20]

International Conference on Machine Learning , year =

Controlled Decoding from Language Models , author =. International Conference on Machine Learning , year =

work page
[21]

Advances in Neural Information Processing Systems , year =

Inference-Time Intervention: Eliciting Truthful Answers from a Language Model , author =. Advances in Neural Information Processing Systems , year =

work page
[22]

The Unlocking Spell on Base

Bill Yuchen Lin and Abhilasha Ravichander and Ximing Lu and Nouha Dziri and Melanie Sclar and Khyathi Chandu and Chandra Bhagavatula and Yejin Choi , booktitle =. The Unlocking Spell on Base

work page
[23]

Yuhui Li and Fangyun Wei and Jinjing Zhao and Chao Zhang and Hongyang Zhang , booktitle=

work page
[24]

Maxim Khanov and Jirayu Burapacheep and Yixuan Li , booktitle =

work page
[25]

First Conference on Language Modeling (COLM) , year =

Tuning Language Models by Proxy , author =. First Conference on Language Modeling (COLM) , year =

work page
[26]

Fast Best-of-

Hanshi Sun and Momin Haider and Ruiqi Zhang and Huitao Yang and Jiahao Qiu and Ming Yin and Mengdi Wang and Peter Bartlett and Andrea Zanette , booktitle =. Fast Best-of-

work page
[27]

Advances in Neural Information Processing Systems , year=

Aligner: Efficient Alignment by Learning to Correct , author=. Advances in Neural Information Processing Systems , year=

work page
[28]

Advances in Neural Information Processing Systems , year=

Aligning Large Language Models with Representation Editing: A Control Perspective , author=. Advances in Neural Information Processing Systems , year=

work page
[29]

Natural Language Processing and Chinese Computing , year =

Latent Feature Activation Steering for Enhancing Semantic Consistency in Large Language Models , author =. Natural Language Processing and Chinese Computing , year =

work page
[30]

Advances in Neural Information Processing Systems , year =

Many-Shot Jailbreaking , author =. Advances in Neural Information Processing Systems , year =

work page
[31]

Daly and Kush R

Erik Miehling and Michael Desmond and Karthikeyan Natesan Ramamurthy and Elizabeth M. Daly and Kush R. Varshney and Eitan Farchi and Pierre Dognin and Jesus Rios and Djallel Bouneffouf and Miao Liu and Prasanna Sattigeri , title =. Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language ...

work page
[32]

Krause, Ben and Gotmare, Akhilesh Deepak and McCann, Bryan and Keskar, Nitish Shirish and Joty, Shafiq and Socher, Richard and Rajani, Nazneen Fatema , booktitle =

work page
[33]

and Choi, Yejin , booktitle =

Liu, Alisa and Sap, Maarten and Lu, Ximing and Swayamdipta, Swabha and Bhagavatula, Chandra and Smith, Noah A. and Choi, Yejin , booktitle =

work page
[34]

Atia , booktitle =

Prashant Trivedi and Souradip Chakraborty and Avinash Reddy and Vaneet Aggarwal and Amrit Singh Bedi and George K. Atia , booktitle =. Align-Pro: A Principled Approach to Prompt Optimization for

work page
[35]

Findings of the Empirical Methods in Natural Language Processing , pages =

A Survey on Training-Free Alignment of Large Language Models , author =. Findings of the Empirical Methods in Natural Language Processing , pages =

work page
[36]

Jailbroken: How Does

Alexander Wei and Nika Haghtalab and Jacob Steinhardt , booktitle =. Jailbroken: How Does

work page
[37]

Advances in Neural Information Processing Systems , year =

Universal and Transferable Adversarial Attacks on Aligned Language Models , author =. Advances in Neural Information Processing Systems , year =

work page
[38]

Annual Meeting of the Association for Computational Linguistics , year =

Language Models Resist Alignment: Evidence From Data Compression , author =. Annual Meeting of the Association for Computational Linguistics , year =

work page
[39]

2021 , note =

A Mathematical Framework for Transformer Circuits , author =. 2021 , note =

work page 2021
[40]

Locating and Editing Factual Associations in

Kevin Meng and David Bau and Alex J Andonian and Yonatan Belinkov , booktitle=. Locating and Editing Factual Associations in

work page
[41]

The Conference on Empirical Methods in Natural Language Processing , year =

Transformer Feed-Forward Layers Are Key-Value Memories , author=. The Conference on Empirical Methods in Natural Language Processing , year =

work page
[42]

In-context Learning and Induction Heads

In-Context Learning and Induction Heads , author=. arXiv preprint arXiv:2209.11895 , year =

work page internal anchor Pith review Pith/arXiv arXiv
[43]

Toy Models of Superposition

Toy Models of Superposition , author=. arXiv preprint arXiv:2209.10652 , year =

work page internal anchor Pith review Pith/arXiv arXiv
[44]

Locating and Editing Factual Associations in

Kevin Meng and David Bau and Alex Andonian and Yonatan Belinkov , journal =. Locating and Editing Factual Associations in

work page
[45]

International Conference on Learning Representations , year =

Progress Measures for Grokking via Mechanistic Interpretability , author=. International Conference on Learning Representations , year =

work page
[46]

Linearity of Relation Decoding in Transformer Language Models , booktitle =

work page
[47]

2023 , note =

Towards Monosemanticity: Decomposing Language Models With Dictionary Learning , author =. 2023 , note =

work page 2023
[48]

International Conference on Machine Learning , year =

Kiho Park and Yo Joong Choe and Victor Veitch , title =. International Conference on Machine Learning , year =

work page
[49]

Shai and Lucas Teixeira and Alexander Gietelink Oldenziel and Sarah Marzen and Paul M

Adam S. Shai and Lucas Teixeira and Alexander Gietelink Oldenziel and Sarah Marzen and Paul M. Riechers , title =. Advances in Neural Information Processing Systems , year =

work page
[50]

The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets

The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets , author =. arXiv preprint arXiv:2310.06824 , year =

work page internal anchor Pith review Pith/arXiv arXiv
[51]

Advances in Neural Information Processing Systems , year =

Andy Arditi and Oscar Obeso and Aaquib Syed and Daniel Paleka and Nina Panickssery and Wes Gurnee and Neel Nanda , title =. Advances in Neural Information Processing Systems , year =

work page
[52]

Turner and Callum McDougall and Monte MacDiarmid and Alex Tamkin and Esin Durmus and Tristan Hume and Francesco Mosconi and C

Adly Templeton and Tom Conerly and Jonathan Marcus and Jack Lindsey and Trenton Bricken and Brian Chen and Adam Pearce and Craig Citro and Emmanuel Ameisen and Andy Jones and Hoagy Cunningham and Nicholas L. Turner and Callum McDougall and Monte MacDiarmid and Alex Tamkin and Esin Durmus and Tristan Hume and Francesco Mosconi and C. Daniel Freeman and The...

work page
[53]

International Conference on Machine Learning , year =

Juno Kim and Taiji Suzuki , title =. International Conference on Machine Learning , year =

work page
[54]

International Conference on Machine Learning , year =

Yotam Wolf and Noam Wies and Oshri Avnery and Yoav Levine and Amnon Shashua , title =. International Conference on Machine Learning , year =

work page
[55]

International Conference on Machine Learning , year =

Wenbo Pan and Zhichao Liu and Qiguang Chen and Xiangyang Zhou and Haining Yu and Xiaohua Jia , title =. International Conference on Machine Learning , year =

work page
[56]

Proceedings of the ICML Workshop on Machine Unlearning for Generative AI , year=

On the Fragility of Latent Knowledge: Layer-wise Influence under Unlearning in Large Language Model , author=. Proceedings of the ICML Workshop on Machine Unlearning for Generative AI , year=

work page
[57]

Representation Engineering: A Top-Down Approach to AI Transparency

Representation Engineering: A Top-Down Approach to AI Transparency , author =. arXiv preprint arXiv:2310.01405 , year =

work page internal anchor Pith review Pith/arXiv arXiv
[58]

Annual Meeting of the Association for Computational Linguistics , year =

Nina Rimsky and Nick Gabrieli and Julian Schulz and Meg Tong and Evan Hubinger and Alexander Matt Turner , title =. Annual Meeting of the Association for Computational Linguistics , year =

work page
[59]

Steering Language Models With Activation Engineering

Steering Language Models with Activation Engineering , author =. arXiv preprint arXiv:2308.10248 , year =

work page internal anchor Pith review Pith/arXiv arXiv
[60]

Advances in Neural Information Processing Systems , year=

Analysing the Generalisation and Reliability of Steering Vectors , author=. Advances in Neural Information Processing Systems , year=

work page
[61]

Advances in Experimental Social Psychology , volume =

Universals in the Content and Structure of Values: Theoretical Advances and Empirical Tests in 20 Countries , author =. Advances in Experimental Social Psychology , volume =. 1992 , publisher =

work page 1992
[62]

Neuron , volume =

The Neural Bases of Cognitive Conflict and Control in Moral Judgment , author =. Neuron , volume =

work page
[63]

Science , volume =

The New Synthesis in Moral Psychology , author =. Science , volume =

work page
[64]

Personality and Social Psychology Review , volume =

Action, Outcome, and Value: A Dual-System Framework for Morality , author =. Personality and Social Psychology Review , volume =

work page
[65]

Heliyon , volume =

A Within-Study Cross-Validation of the Values-as-Ideals Measure: Levels of Value Orientation Explain Variability in Well-Being , author =. Heliyon , volume =

work page
[66]

Lvmin Zhang and Anyi Rao and Maneesh Agrawala , title =

work page
[67]

2023 , booktitle =

Li, Junnan and Li, Dongxu and Savarese, Silvio and Hoi, Steven , title =. 2023 , booktitle =

work page 2023
[68]

Perceiver

Andrew Jaegle and Sebastian Borgeaud and Jean. Perceiver. International Conference on Learning Representations , year =

work page
[69]

Advances in Neural Information Processing Systems , year =

Flamingo: a Visual Language Model for Few-Shot Learning , author =. Advances in Neural Information Processing Systems , year =

work page
[70]

Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) , year =

Adding Conditional Control to Text-to-Image Diffusion Models , author =. Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) , year =

work page
[71]

International Conference on Learning Representations , year =

Renrui Zhang and Jiaming Han and Chris Liu and Aojun Zhou and Pan Lu and Yu Qiao and Hongsheng Li and Peng Gao , title =. International Conference on Learning Representations , year =

work page
[72]

International Conference on Machine Learning , year =

Curriculum Learning , author =. International Conference on Machine Learning , year =

work page
[73]

International Conference on Learning Representations , year =

Let's Verify Step by Step , author =. International Conference on Learning Representations , year =

work page
[74]

Proceedings of the Eighteenth International Conference on Artificial Intelligence and Statistics , year =

Deeply-Supervised Nets , author =. Proceedings of the Eighteenth International Conference on Artificial Intelligence and Statistics , year =

work page
[75]

OpenAI Blog , year =

Language Models are Unsupervised Multitask Learners , author=. OpenAI Blog , year =

work page
[76]

Qwen2 Technical Report

Qwen2 Technical Report , author=. arXiv preprint arXiv:2407.10671 , year =

work page internal anchor Pith review Pith/arXiv arXiv
[77]

The Llama 3 Herd of Models

The LLaMA 3 Herd of Models , author=. arXiv preprint arXiv:2407.21783 , year =

work page internal anchor Pith review Pith/arXiv arXiv
[78]

Mistral 7B

Mistral 7B , author =. arXiv preprint arXiv:2310.06825 , year =

work page internal anchor Pith review Pith/arXiv arXiv
[79]

Jiaming Ji and Mickel Liu and Josef Dai and Xuehai Pan and Chi Zhang and Ce Bian and Boyuan Chen and Ruiyang Sun and Yizhou Wang and Yaodong Yang , booktitle =

work page
[80]

JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models

JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models , author=. arXiv preprint arXiv:2404.01318 , year =

work page internal anchor Pith review arXiv

Showing first 80 references.