pith. machine review for the scientific record. sign in

arxiv: 2605.11712 · v1 · submitted 2026-05-12 · 💻 cs.AI

Recognition: no theorem link

Toward Stable Value Alignment: Introducing Independent Modules for Consistent Value Guidance

Authors on Pith no claims yet

Pith reviewed 2026-05-13 06:57 UTC · model grok-4.3

classification 💻 cs.AI
keywords value alignmentLLM safetyindependent modulesbridge tokensstable guidancetransformer architectureharmful output reduction
0
0 comments X

The pith

SVGT stabilizes LLM value alignment by isolating normative representations in a dedicated module and steering outputs via bridge tokens.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large language models typically achieve value alignment through post-training or steering that directly alters the backbone, yet this fails because the residual stream is dynamic and values remain fragile low-dimensional signals. The paper introduces the Stable Value Guidance Transformer that creates an independent value module to hold stable normative representations apart from the main model. These representations are converted into learnable bridge tokens that explicitly anchor and guide the generative process. A sympathetic reader would care because this separation promises consistent value expression across contexts while preserving the backbone's fluency and internal dynamics. Experiments across backbones and benchmarks show harmful scores drop by more than 70 percent.

Core claim

SVGT addresses unstable value alignment by maintaining normative representations in a dedicated value space isolated from the backbone and transducing these signals into learnable latent Bridge Tokens that serve as dynamic anchors to steer the generative trajectory, thereby ensuring robust adherence without disrupting the backbone's internal representations.

What carries the argument

Independent value module with bridge tokens: a separate space that holds stable value representations and converts them into explicit steering tokens for the generative process.

If this is right

  • Harmful output scores fall by more than 70 percent on safety benchmarks while fluency is preserved.
  • Value guidance remains consistent across diverse contexts without altering backbone parameters.
  • The same architecture applies to multiple model backbones with comparable gains.
  • Alignment becomes an add-on module rather than an integrated training step.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Value modules could be exchanged or updated independently for different ethical priorities on the same backbone.
  • The separation might make it easier to diagnose and correct alignment failures in deployed systems.
  • The pattern suggests extending modular guidance to multi-turn or multi-agent settings where stability is critical.

Load-bearing premise

Value signals can be isolated in a dedicated module and then successfully transduced into tokens that reliably influence the backbone's behavior.

What would settle it

Ablation experiments in which the independent module or bridge tokens are removed yet harmful scores remain unchanged on the same safety benchmarks would falsify the stability claim.

Figures

Figures reproduced from arXiv: 2605.11712 by Guojie Song, Shengyuan Bai, Sirui Sun, Wenhao Chen.

Figure 1
Figure 1. Figure 1: Conceptual illustration of our work. Dominant Task Signals (orange) in the residual stream often distort value repre￾sentations, leading to misalignment (dashed). Our independent module provides stable guidance (solid), steering generation back to alignment despite task noise. (Soares & Fallenstein, 2014; Christian, 2020; Hendrycks et al., 2021a). Existing alignment methods span a spectrum from training-ti… view at source ↗
Figure 2
Figure 2. Figure 2: SVGT Architecture. SVGT decouples value alignment from task-driven generation through a two-stage transformation: (1) Value Space Construction extracts stable, context-aware value signals z and computes directional corrections ∆z; (2) Latent Value Bridge (LVB) transduces these abstract corrections into bridge tokens B that serve as attention targets for the frozen backbone, enabling dynamic and robust stee… view at source ↗
Figure 3
Figure 3. Figure 3: Overview of the SVGT training stages. Our approach follows a progressive curriculum: starting with basic value percep￾tion (Stage 1), advancing to context-aware understanding (Stage 2), and concluding with the training of the Latent Value Bridge (Stage 3) to convert value signals into active guidance for the generative manifold. LVB operates dynamically: at each decoding step, the value state zt is re-enco… view at source ↗
Figure 4
Figure 4. Figure 4: Evolution of latent harmful scores during generation. Thin lines show individual trajectories from 5 adversarial prompts; the bold line indicates their average. Left: The baseline remains in a high-risk region driven by adversarial prompts. Right: SVGT progressively steers the trajectory toward safer regions via guid￾ance, demonstrating effective real-time correction. ❻ Dynamic guidance corrects context-dr… view at source ↗
Figure 5
Figure 5. Figure 5: Computational overhead of SVGT on Llama-3.2- 3B. We compare baseline generation against SVGT with different bridge token refresh intervals (r). Memory overhead is minimal (+3%). Total latency increases moderately (+52-65%). Efficiency remains robust across refresh intervals r ∈ [1, 10], supporting flexible deployment [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 7
Figure 7. Figure 7: Multi-dimensional comparison of alignment paradigms. This profile synthesizes our empirical findings across six key dimensions. Despite the throughput latency, SVGT (or￾ange) demonstrates superior balance, particularly in the trade-off between safety enforcement and capability preservation. Summary. Finally, we provide a holistic summary of SVGT’s performance profile in comparison to existing paradigms. As… view at source ↗
Figure 6
Figure 6. Figure 6: Sensitivity analysis of bridge token momentum β. Safety performance is optimized when the EMA momentum β is between 0.6 and 0.8. This intermediate range allows bridge tokens to effectively accumulate stable value signals while adapting to the evolving context during generation. Extreme values (β = 0 or 1) lead to either excessive guidance noise or a failure to rectify context drift. impact of the bridge me… view at source ↗
Figure 8
Figure 8. Figure 8: Impact of bridge token count (K) on safety and fluency. Evaluated on Llama-3.2-3B. Results demonstrate that K ∈ [5, 10] provides an optimal trade-off: too few tokens limit guidance expressiveness, while exceeding 15 tokens introduces redundant noise that marginally increases perplexity. Error bars represent standard deviation across 3 random seeds. SVGT increases false refusal rate from 14.8% to 16.4%, a +… view at source ↗
Figure 9
Figure 9. Figure 9: Value discrimination accuracy as a function of the extraction layer position (l ∗ ). Experiments on GPT-2 show that safety-relevant semantic features are most discernible in the middle-to-late layers (normalized index 0.5–0.8). Extracting from very early layers (syntactic) or final layers (too task-specific) leads to sub-optimal alignment performance. limit the expressiveness of value guidance, while too m… view at source ↗
Figure 10
Figure 10. Figure 10: Step-level training dynamics for Stage 3 (Value-Guided Generation) on Llama-3.2-3B. The plots illustrate the evolution of the multi-objective loss components across 27,000 steps. Raw values (light gray) and smoothed trajectories (colored) show rapid convergence. The stability of the Cross-Entropy loss and the immediate descent of the Regularization loss demonstrate that SVGT learns to inject guidance with… view at source ↗
Figure 11
Figure 11. Figure 11: Epoch-level convergence and generalization for Stage 3. Comparison between training (blue/circles) and validation (orange/squares) metrics over five epochs. The minimal gap between training and validation total loss, alongside the order-of-magnitude reduction in safety loss, validates the high generalizability of the pre-trained value space and the efficacy of bridge tokens in capturing normative signals.… view at source ↗
Figure 12
Figure 12. Figure 12: Logit distribution shift across generation steps. Left: The intermittent spikes in KL divergence DKL(Pguide∥Pbase) reveal that SVGT exerts adaptive, high-energy steering at critical decision junctions rather than applying a static bias. Right: The staircase-like growth of cumulative KL divergence demonstrates the persistent accumulation of alignment energy to overcome adversarial value inertia [PITH_FULL… view at source ↗
Figure 13
Figure 13. Figure 13: Probability lift (∆P) for top-5 tokens at different stages. At Step 1, the bridge tokens prioritize structural control by uplifting pronouns and termination signals. By Step 20, the focus shifts to normative justification, specifically elevating safety-descriptive adjectives like dangerous to anchor the explanation to the safety manifold. ’ I’ (∆P = 0.457) and the termination token ’<|eot id|>’ (∆P = 0.15… view at source ↗
read the original abstract

Aligning large language models (LLMs) with human values typically relies on post-training or inference-time steering that directly manipulates the backbone's parameters or representation space. However, a critical gap exists: the model's residual stream is highly dynamic, in which values exist as fragile, low-dimensional properties, inherently incompatible with the stability required for consistent value expression. In this paper, we propose the Stable Value Guidance Transformer (SVGT), which addresses this gap through an independent value module incorporating two key designs: (1) independent value modeling, maintaining normative representations in a dedicated value space isolated from the backbone, and (2) explicit behavioral guidance, transducing these stable signals into learnable latent Bridge Tokens. These tokens serve as dynamic value anchors to explicitly steer the generative trajectory, ensuring robust adherence across diverse contexts without disrupting the backbone's internal representations. Experiments across multiple backbones and safety benchmarks show that SVGT generally reduces harmful scores by over 70% while maintaining generation fluency, demonstrating the efficacy of architecturally grounded value modeling. Our code is available at https://github.com/Clervils/SVGT.git.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes the Stable Value Guidance Transformer (SVGT), which augments LLMs with an independent value module that maintains normative representations in a dedicated space isolated from the backbone residual stream and transduces them via learnable Bridge Tokens to steer generation. It claims this yields over 70% reduction in harmful scores across multiple backbones and safety benchmarks while preserving fluency, addressing the fragility of values in dynamic residual streams.

Significance. If the empirical results hold under scrutiny, the work would offer a meaningful architectural alternative to post-training or direct steering methods for value alignment, potentially enabling more consistent normative guidance without altering backbone parameters. The public code release aids reproducibility and allows direct testing of the independent-module design.

major comments (2)
  1. [Abstract] Abstract: the central claim of 'generally reduces harmful scores by over 70%' is presented without any description of baselines, metrics, statistical tests, ablation controls, or variance across runs, rendering the efficacy of the independent value module and Bridge Tokens unverifiable from the provided text.
  2. Method description (Bridge Tokens): the architecture isolates normative representations but then explicitly inserts Bridge Tokens into the residual stream to steer trajectories; no measurement, ablation, or analysis is described that demonstrates these tokens maintain stable influence rather than being diluted or overwritten by subsequent dynamics, which directly undermines attribution of the reported harm reduction to architectural grounding.
minor comments (1)
  1. [Abstract] The abstract refers to 'multiple backbones and safety benchmarks' without naming them or providing even high-level dataset statistics, which reduces the reader's ability to assess generalizability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We appreciate the referee's constructive feedback on our manuscript. Below, we provide point-by-point responses to the major comments and outline the revisions we will make to enhance clarity and rigor.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim of 'generally reduces harmful scores by over 70%' is presented without any description of baselines, metrics, statistical tests, ablation controls, or variance across runs, rendering the efficacy of the independent value module and Bridge Tokens unverifiable from the provided text.

    Authors: We agree that the abstract lacks sufficient detail to fully substantiate the claim. The main text (Section 4) specifies the baselines (vanilla and safety-aligned LLMs), metrics (harm scores from established benchmarks), statistical tests, and reports mean results with standard deviations over multiple runs. We will revise the abstract to concisely include references to these elements, such as 'across multiple backbones and benchmarks with averaged results over runs,' to improve verifiability while adhering to length constraints. revision: yes

  2. Referee: [—] Method description (Bridge Tokens): the architecture isolates normative representations but then explicitly inserts Bridge Tokens into the residual stream to steer trajectories; no measurement, ablation, or analysis is described that demonstrates these tokens maintain stable influence rather than being diluted or overwritten by subsequent dynamics, which directly undermines attribution of the reported harm reduction to architectural grounding.

    Authors: We recognize the importance of demonstrating the stability of Bridge Tokens' influence. Although the current manuscript focuses on end-to-end performance improvements, it does not include dedicated ablations for temporal stability or dilution effects. In the revised version, we will incorporate new experiments and analyses, including step-wise influence measurements and ablations removing Bridge Tokens mid-generation, to empirically show their sustained impact and support the architectural claims. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected in architectural proposal or claims

full rationale

The paper proposes a new SVGT architecture consisting of an independent value module and Bridge Tokens to isolate and transduce value signals. Its central claims rest on this design choice plus empirical experiments across backbones showing harm reduction. No derivation reduces a result to its own inputs by construction, no parameters are fitted to a subset and then called predictions, and no load-bearing self-citations or uniqueness theorems appear in the text. The residual-stream fragility premise is stated as motivation rather than derived from the method itself, and the 70% reduction figure is presented as an experimental outcome rather than a mathematical identity. The work is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The central claim rests on the domain assumption that value properties are inherently unstable in the residual stream and on the introduction of two new architectural entities without external validation beyond the reported experiments.

axioms (1)
  • domain assumption LLM residual streams are highly dynamic and values exist as fragile low-dimensional properties incompatible with stable expression
    Explicitly stated as the critical gap motivating the work.
invented entities (2)
  • independent value space no independent evidence
    purpose: maintain normative representations isolated from the backbone
    New dedicated space introduced to solve the stability problem
  • Bridge Tokens no independent evidence
    purpose: transduce stable value signals into dynamic anchors that steer the generative trajectory
    New token type postulated as the mechanism for explicit guidance

pith-pipeline@v0.9.0 · 5496 in / 1244 out tokens · 69549 ms · 2026-05-13T06:57:35.718724+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

90 extracted references · 90 canonical work pages · 9 internal anchors

  1. [1]

    2014 , url =

    Aligning Superintelligence with Human Interests: A Technical Research Agenda , author =. 2014 , url =

  2. [2]

    arXiv preprint arXiv:1906.01820 , year =

    Risks from Learned Optimization in Advanced Machine Learning Systems , author =. CoRR , volume =. 2019 , url =. 1906.01820 , eprinttype =

  3. [3]

    2020 , publisher =

    The Alignment Problem: Machine Learning and Human Values , author =. 2020 , publisher =

  4. [4]

    Aligning

    Dan Hendrycks and Collin Burns and Steven Basart and Andrew Critch and Jerry Li and Dawn Song and Jacob Steinhardt , booktitle =. Aligning

  5. [5]

    Transactions on Machine Learning Research (TMLR) , year =

    Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback , author =. Transactions on Machine Learning Research (TMLR) , year =

  6. [6]

    An Overview of Catastrophic

    Dan Hendrycks and Mantas Mazeika and Thomas Woodside , journal =. An Overview of Catastrophic

  7. [7]

    Advances in Neural Information Processing Systems , volume =

    Training Language Models to Follow Instructions with Human Feedback , author =. Advances in Neural Information Processing Systems , volume =

  8. [8]

    Constitutional

    Yuntao Bai and Saurav Kadavath and Sandipan Kundu and Amanda Askell and Jackson Kernion and Andy Jones and Anna Chen and Anna Goldie and Azalia Mirhoseini and Cameron McKinnon and others , journal =. Constitutional

  9. [9]

    Advances in Neural Information Processing Systems , year =

    Direct Preference Optimization: Your Language Model is Secretly a Reward Model , author =. Advances in Neural Information Processing Systems , year =

  10. [10]

    Inverse Preference Learning: Preference-based

    Joey Hejna and Dorsa Sadigh , booktitle =. Inverse Preference Learning: Preference-based

  11. [11]

    Hanze Dong and Wei Xiong and Deepanshu Goyal and Yihan Zhang and Winnie Chow and Rui Pan and Shizhe Diao and Jipeng Zhang and KaShun SHUM and Tong Zhang , journal =

  12. [12]

    A General Theoretical Paradigm to Understand Learning from Human Preferences , journal =

    Mohammad Gheshlaghi Azar and Mark Rowland and Bilal Piot and Daniel Guo and Daniele Calandriello and Michal Valko and R. A General Theoretical Paradigm to Understand Learning from Human Preferences , journal =

  13. [13]

    A Unified Approach to Online and Offline

    Shicong Cen and Jincheng Mei and Katayoon Goshvadi and Hanjun Dai and Tong Yang and Sherry Yang and Dale Schuurmans and Yuejie Chi and Bo Dai , booktitle =. A Unified Approach to Online and Offline

  14. [14]

    Kawin Ethayarajh and Winnie Xu and Niklas Muennighoff and Dan Jurafsky and Douwe Kiela , booktitle =

  15. [15]

    Yu Meng and Mengzhou Xia and Danqi Chen , booktitle =

  16. [16]

    International Conference on Learning Representations , year =

    Safety Alignment Should be Made More Than Just a Few Tokens Deep , author =. International Conference on Learning Representations , year =

  17. [17]

    International Conference on Learning Representations , year =

    Plug and Play Language Models: A Simple Approach to Controlled Text Generation , author =. International Conference on Learning Representations , year =

  18. [18]

    Annual Meeting of the Association for Computational Linguistics , pages =

    Prefix-Tuning: Optimizing Continuous Prompts for Generation , author =. Annual Meeting of the Association for Computational Linguistics , pages =

  19. [19]

    International Conference on Learning Representations , year =

    Fast Model Editing at Scale , author=. International Conference on Learning Representations , year =

  20. [20]

    International Conference on Machine Learning , year =

    Controlled Decoding from Language Models , author =. International Conference on Machine Learning , year =

  21. [21]

    Advances in Neural Information Processing Systems , year =

    Inference-Time Intervention: Eliciting Truthful Answers from a Language Model , author =. Advances in Neural Information Processing Systems , year =

  22. [22]

    The Unlocking Spell on Base

    Bill Yuchen Lin and Abhilasha Ravichander and Ximing Lu and Nouha Dziri and Melanie Sclar and Khyathi Chandu and Chandra Bhagavatula and Yejin Choi , booktitle =. The Unlocking Spell on Base

  23. [23]

    Yuhui Li and Fangyun Wei and Jinjing Zhao and Chao Zhang and Hongyang Zhang , booktitle=

  24. [24]

    Maxim Khanov and Jirayu Burapacheep and Yixuan Li , booktitle =

  25. [25]

    First Conference on Language Modeling (COLM) , year =

    Tuning Language Models by Proxy , author =. First Conference on Language Modeling (COLM) , year =

  26. [26]

    Fast Best-of-

    Hanshi Sun and Momin Haider and Ruiqi Zhang and Huitao Yang and Jiahao Qiu and Ming Yin and Mengdi Wang and Peter Bartlett and Andrea Zanette , booktitle =. Fast Best-of-

  27. [27]

    Advances in Neural Information Processing Systems , year=

    Aligner: Efficient Alignment by Learning to Correct , author=. Advances in Neural Information Processing Systems , year=

  28. [28]

    Advances in Neural Information Processing Systems , year=

    Aligning Large Language Models with Representation Editing: A Control Perspective , author=. Advances in Neural Information Processing Systems , year=

  29. [29]

    Natural Language Processing and Chinese Computing , year =

    Latent Feature Activation Steering for Enhancing Semantic Consistency in Large Language Models , author =. Natural Language Processing and Chinese Computing , year =

  30. [30]

    Advances in Neural Information Processing Systems , year =

    Many-Shot Jailbreaking , author =. Advances in Neural Information Processing Systems , year =

  31. [31]

    Daly and Kush R

    Erik Miehling and Michael Desmond and Karthikeyan Natesan Ramamurthy and Elizabeth M. Daly and Kush R. Varshney and Eitan Farchi and Pierre Dognin and Jesus Rios and Djallel Bouneffouf and Miao Liu and Prasanna Sattigeri , title =. Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language ...

  32. [32]

    Krause, Ben and Gotmare, Akhilesh Deepak and McCann, Bryan and Keskar, Nitish Shirish and Joty, Shafiq and Socher, Richard and Rajani, Nazneen Fatema , booktitle =

  33. [33]

    and Choi, Yejin , booktitle =

    Liu, Alisa and Sap, Maarten and Lu, Ximing and Swayamdipta, Swabha and Bhagavatula, Chandra and Smith, Noah A. and Choi, Yejin , booktitle =

  34. [34]

    Atia , booktitle =

    Prashant Trivedi and Souradip Chakraborty and Avinash Reddy and Vaneet Aggarwal and Amrit Singh Bedi and George K. Atia , booktitle =. Align-Pro: A Principled Approach to Prompt Optimization for

  35. [35]

    Findings of the Empirical Methods in Natural Language Processing , pages =

    A Survey on Training-Free Alignment of Large Language Models , author =. Findings of the Empirical Methods in Natural Language Processing , pages =

  36. [36]

    Jailbroken: How Does

    Alexander Wei and Nika Haghtalab and Jacob Steinhardt , booktitle =. Jailbroken: How Does

  37. [37]

    Advances in Neural Information Processing Systems , year =

    Universal and Transferable Adversarial Attacks on Aligned Language Models , author =. Advances in Neural Information Processing Systems , year =

  38. [38]

    Annual Meeting of the Association for Computational Linguistics , year =

    Language Models Resist Alignment: Evidence From Data Compression , author =. Annual Meeting of the Association for Computational Linguistics , year =

  39. [39]

    2021 , note =

    A Mathematical Framework for Transformer Circuits , author =. 2021 , note =

  40. [40]

    Locating and Editing Factual Associations in

    Kevin Meng and David Bau and Alex J Andonian and Yonatan Belinkov , booktitle=. Locating and Editing Factual Associations in

  41. [41]

    The Conference on Empirical Methods in Natural Language Processing , year =

    Transformer Feed-Forward Layers Are Key-Value Memories , author=. The Conference on Empirical Methods in Natural Language Processing , year =

  42. [42]

    In-context Learning and Induction Heads

    In-Context Learning and Induction Heads , author=. arXiv preprint arXiv:2209.11895 , year =

  43. [43]

    Toy Models of Superposition

    Toy Models of Superposition , author=. arXiv preprint arXiv:2209.10652 , year =

  44. [44]

    Locating and Editing Factual Associations in

    Kevin Meng and David Bau and Alex Andonian and Yonatan Belinkov , journal =. Locating and Editing Factual Associations in

  45. [45]

    International Conference on Learning Representations , year =

    Progress Measures for Grokking via Mechanistic Interpretability , author=. International Conference on Learning Representations , year =

  46. [46]

    Linearity of Relation Decoding in Transformer Language Models , booktitle =

  47. [47]

    2023 , note =

    Towards Monosemanticity: Decomposing Language Models With Dictionary Learning , author =. 2023 , note =

  48. [48]

    International Conference on Machine Learning , year =

    Kiho Park and Yo Joong Choe and Victor Veitch , title =. International Conference on Machine Learning , year =

  49. [49]

    Shai and Lucas Teixeira and Alexander Gietelink Oldenziel and Sarah Marzen and Paul M

    Adam S. Shai and Lucas Teixeira and Alexander Gietelink Oldenziel and Sarah Marzen and Paul M. Riechers , title =. Advances in Neural Information Processing Systems , year =

  50. [50]

    The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets

    The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets , author =. arXiv preprint arXiv:2310.06824 , year =

  51. [51]

    Advances in Neural Information Processing Systems , year =

    Andy Arditi and Oscar Obeso and Aaquib Syed and Daniel Paleka and Nina Panickssery and Wes Gurnee and Neel Nanda , title =. Advances in Neural Information Processing Systems , year =

  52. [52]

    Turner and Callum McDougall and Monte MacDiarmid and Alex Tamkin and Esin Durmus and Tristan Hume and Francesco Mosconi and C

    Adly Templeton and Tom Conerly and Jonathan Marcus and Jack Lindsey and Trenton Bricken and Brian Chen and Adam Pearce and Craig Citro and Emmanuel Ameisen and Andy Jones and Hoagy Cunningham and Nicholas L. Turner and Callum McDougall and Monte MacDiarmid and Alex Tamkin and Esin Durmus and Tristan Hume and Francesco Mosconi and C. Daniel Freeman and The...

  53. [53]

    International Conference on Machine Learning , year =

    Juno Kim and Taiji Suzuki , title =. International Conference on Machine Learning , year =

  54. [54]

    International Conference on Machine Learning , year =

    Yotam Wolf and Noam Wies and Oshri Avnery and Yoav Levine and Amnon Shashua , title =. International Conference on Machine Learning , year =

  55. [55]

    International Conference on Machine Learning , year =

    Wenbo Pan and Zhichao Liu and Qiguang Chen and Xiangyang Zhou and Haining Yu and Xiaohua Jia , title =. International Conference on Machine Learning , year =

  56. [56]

    Proceedings of the ICML Workshop on Machine Unlearning for Generative AI , year=

    On the Fragility of Latent Knowledge: Layer-wise Influence under Unlearning in Large Language Model , author=. Proceedings of the ICML Workshop on Machine Unlearning for Generative AI , year=

  57. [57]

    Representation Engineering: A Top-Down Approach to AI Transparency

    Representation Engineering: A Top-Down Approach to AI Transparency , author =. arXiv preprint arXiv:2310.01405 , year =

  58. [58]

    Annual Meeting of the Association for Computational Linguistics , year =

    Nina Rimsky and Nick Gabrieli and Julian Schulz and Meg Tong and Evan Hubinger and Alexander Matt Turner , title =. Annual Meeting of the Association for Computational Linguistics , year =

  59. [59]

    Steering Language Models With Activation Engineering

    Steering Language Models with Activation Engineering , author =. arXiv preprint arXiv:2308.10248 , year =

  60. [60]

    Advances in Neural Information Processing Systems , year=

    Analysing the Generalisation and Reliability of Steering Vectors , author=. Advances in Neural Information Processing Systems , year=

  61. [61]

    Advances in Experimental Social Psychology , volume =

    Universals in the Content and Structure of Values: Theoretical Advances and Empirical Tests in 20 Countries , author =. Advances in Experimental Social Psychology , volume =. 1992 , publisher =

  62. [62]

    Neuron , volume =

    The Neural Bases of Cognitive Conflict and Control in Moral Judgment , author =. Neuron , volume =

  63. [63]

    Science , volume =

    The New Synthesis in Moral Psychology , author =. Science , volume =

  64. [64]

    Personality and Social Psychology Review , volume =

    Action, Outcome, and Value: A Dual-System Framework for Morality , author =. Personality and Social Psychology Review , volume =

  65. [65]

    Heliyon , volume =

    A Within-Study Cross-Validation of the Values-as-Ideals Measure: Levels of Value Orientation Explain Variability in Well-Being , author =. Heliyon , volume =

  66. [66]

    Lvmin Zhang and Anyi Rao and Maneesh Agrawala , title =

  67. [67]

    2023 , booktitle =

    Li, Junnan and Li, Dongxu and Savarese, Silvio and Hoi, Steven , title =. 2023 , booktitle =

  68. [68]

    Perceiver

    Andrew Jaegle and Sebastian Borgeaud and Jean. Perceiver. International Conference on Learning Representations , year =

  69. [69]

    Advances in Neural Information Processing Systems , year =

    Flamingo: a Visual Language Model for Few-Shot Learning , author =. Advances in Neural Information Processing Systems , year =

  70. [70]

    Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) , year =

    Adding Conditional Control to Text-to-Image Diffusion Models , author =. Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) , year =

  71. [71]

    International Conference on Learning Representations , year =

    Renrui Zhang and Jiaming Han and Chris Liu and Aojun Zhou and Pan Lu and Yu Qiao and Hongsheng Li and Peng Gao , title =. International Conference on Learning Representations , year =

  72. [72]

    International Conference on Machine Learning , year =

    Curriculum Learning , author =. International Conference on Machine Learning , year =

  73. [73]

    International Conference on Learning Representations , year =

    Let's Verify Step by Step , author =. International Conference on Learning Representations , year =

  74. [74]

    Proceedings of the Eighteenth International Conference on Artificial Intelligence and Statistics , year =

    Deeply-Supervised Nets , author =. Proceedings of the Eighteenth International Conference on Artificial Intelligence and Statistics , year =

  75. [75]

    OpenAI Blog , year =

    Language Models are Unsupervised Multitask Learners , author=. OpenAI Blog , year =

  76. [76]

    Qwen2 Technical Report

    Qwen2 Technical Report , author=. arXiv preprint arXiv:2407.10671 , year =

  77. [77]

    The Llama 3 Herd of Models

    The LLaMA 3 Herd of Models , author=. arXiv preprint arXiv:2407.21783 , year =

  78. [78]

    Mistral 7B

    Mistral 7B , author =. arXiv preprint arXiv:2310.06825 , year =

  79. [79]

    Jiaming Ji and Mickel Liu and Josef Dai and Xuehai Pan and Chi Zhang and Ce Bian and Boyuan Chen and Ruiyang Sun and Yizhou Wang and Yaodong Yang , booktitle =

  80. [80]

    JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models

    JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models , author=. arXiv preprint arXiv:2404.01318 , year =

Showing first 80 references.