pith. sign in

arxiv: 2308.10248 · v5 · submitted 2023-08-20 · 💻 cs.CL · cs.LG

Steering Language Models With Activation Engineering

Pith reviewed 2026-05-11 00:09 UTC · model grok-4.3

classification 💻 cs.CL cs.LG
keywords activation engineeringsteering vectorslanguage modelsinference-time controlsentiment shiftdetoxificationactivation additionLLM steering
0
0 comments X

The pith

Adding differences in activations between contrasting prompts steers language model outputs toward desired sentiments or topics at inference time.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces activation engineering as a method to control language model behavior by modifying internal activations during generation rather than changing the model weights. Researchers compute a steering vector from the difference in activations produced by a pair of prompts, such as one evoking love and one evoking hate. Adding this vector to the model's forward pass shifts high-level output traits like sentiment while leaving performance on unrelated tasks largely unchanged. The technique reaches state-of-the-art results on turning negative text positive and on detoxification across models including LLaMA-3 and OPT. It requires only one prompt pair, no training, and supports quick experimentation with new directions.

Core claim

Activation Addition computes a steering vector by subtracting the intermediate activations of one prompt from those of a contrasting prompt and then adds a scaled version of this vector to the model's activations at selected layers during the forward pass. This produces reliable shifts in semantic properties of the generated text, such as sentiment polarity or topic focus, without requiring optimization or large datasets. The method achieves top performance on negative-to-positive sentiment transfer and toxicity reduction while preserving accuracy on off-target benchmarks.

What carries the argument

The Activation Addition (ActAdd) technique, which derives a steering vector from the difference in intermediate activations between a pair of contrasting prompts and adds it to the model's forward pass to guide high-level output properties.

If this is right

  • High-level output properties such as sentiment and topic become controllable at inference time.
  • Off-target task performance remains intact after steering.
  • The approach requires no model optimization and works with a single pair of examples.
  • Rapid iteration over different steering directions becomes feasible without retraining.
  • State-of-the-art results appear on sentiment shifting and detoxification for models like LLaMA-3 and OPT.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • High-level concepts may occupy consistent directions in activation space that can be isolated with minimal examples.
  • The method could extend to steering for factual accuracy or creative styles once suitable prompt pairs are identified.
  • Activation engineering offers a practical route to test hypotheses about how models internally represent abstract traits.
  • Combining ActAdd with other inference techniques might enable finer-grained, multi-directional control.

Load-bearing premise

Activation differences extracted from a single prompt pair reliably encode semantic directions that generalize across new contexts without unintended effects on other model capabilities.

What would settle it

Running ActAdd on a held-out model or task and finding that the added vector produces no measurable shift in the target property or causes clear drops in unrelated task scores would falsify the claim of generalizable control.

read the original abstract

Prompt engineering and finetuning aim to maximize language model performance on a given metric (like toxicity reduction). However, these methods do not fully elicit a model's capabilities. To reduce this gap, we introduce activation engineering: the inference-time modification of activations in order to control (or steer) model outputs. Specifically, we introduce the Activation Addition (ActAdd) technique, which contrasts the intermediate activations on prompt pairs (such as "Love" versus "Hate") to compute a steering vector (Subramani et al. 2022). By tactically adding in e.g. the "Love" - "Hate" steering vector during the forward pass, we achieve SOTA on negative-to-positive sentiment shift and detoxification using models including LLaMA-3 and OPT. ActAdd yields inference-time control over high-level output properties (like topic and sentiment) while preserving performance on off-target tasks. ActAdd is lightweight: it does not require any machine optimization and works with a single pair of data points, which enables rapid iteration over steering. ActAdd demonstrates the power of activation engineering.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces Activation Addition (ActAdd), an inference-time technique that computes a steering vector as the difference in intermediate activations produced by a contrasting prompt pair (e.g., 'Love' versus 'Hate') and adds a scaled version of this vector to the model's activations during the forward pass. The central claim is that this yields state-of-the-art performance on negative-to-positive sentiment transformation and detoxification benchmarks using models including LLaMA-3 and OPT, while preserving performance on off-target tasks, all without any optimization or training and using only a single prompt pair.

Significance. If the empirical results hold under rigorous validation, the work is significant because it demonstrates a lightweight, training-free method for controlling high-level semantic properties of language model outputs via direct manipulation of activations. Strengths include the absence of machine optimization, the use of minimal data (single pairs), and the potential for rapid iteration; these features distinguish it from prompt engineering or fine-tuning and could enable new forms of controllable generation if the steering vectors prove robustly transferable.

major comments (3)
  1. [Abstract and Experimental Results] Abstract and Experimental Results section: the claim of achieving SOTA on sentiment shift and detoxification is presented without details on the baselines compared, the number of evaluation runs, statistical significance tests, error bars, or data exclusion rules. These omissions are load-bearing because the central empirical claim cannot be assessed for robustness or generalizability without them.
  2. [Method] Method description (steering vector construction): the activation difference is computed from a single prompt pair at a chosen layer with no regularization or multi-pair averaging. This construction does not guarantee that the resulting vector isolates a high-level semantic direction rather than prompt-specific or surface-form artifacts, which directly threatens the claim that the method transfers across arbitrary contexts without off-target effects.
  3. [Evaluation] Evaluation of off-target preservation: while the paper states that ActAdd preserves performance on unrelated tasks, no quantitative results or specific task suites are referenced to support this. Without such evidence, the claim that high-level steering leaves other capabilities intact remains unverified and is central to the practical utility argument.
minor comments (2)
  1. [Method] The notation for the steering vector (difference of activations) should be formalized with an equation, including the exact layer index and scaling coefficient, to improve reproducibility.
  2. [Introduction] Related work on activation steering (e.g., Subramani et al. 2022) is cited but could be expanded with a brief comparison table of prior techniques versus ActAdd on data requirements and optimization.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. The comments identify key areas where additional clarity and evidence would strengthen the manuscript's claims. We address each major comment point by point below, indicating where revisions will be made.

read point-by-point responses
  1. Referee: [Abstract and Experimental Results] Abstract and Experimental Results section: the claim of achieving SOTA on sentiment shift and detoxification is presented without details on the baselines compared, the number of evaluation runs, statistical significance tests, error bars, or data exclusion rules. These omissions are load-bearing because the central empirical claim cannot be assessed for robustness or generalizability without them.

    Authors: We agree that the SOTA claim requires more supporting details for proper assessment of robustness. In the revised manuscript, we will update the abstract and Experimental Results section to explicitly name the baselines (including specific prompt engineering and fine-tuning methods from related work), report the number of evaluation runs with error bars or standard deviations, include statistical significance tests where relevant, and clarify data exclusion rules. These additions will be made while preserving the reported performance figures. revision: yes

  2. Referee: [Method] Method description (steering vector construction): the activation difference is computed from a single prompt pair at a chosen layer with no regularization or multi-pair averaging. This construction does not guarantee that the resulting vector isolates a high-level semantic direction rather than prompt-specific or surface-form artifacts, which directly threatens the claim that the method transfers across arbitrary contexts without off-target effects.

    Authors: The single-pair construction is a core feature of ActAdd, chosen to emphasize its training-free nature and minimal data needs. Contrasting pairs targeting high-level concepts are used, with addition at intermediate layers to focus on semantic rather than surface features; empirical transfer across contexts in our results supports this. We acknowledge the risk of artifacts. In revision, we will add discussion of this limitation in the Method section and include supplementary results using averaged vectors from multiple pairs to evaluate robustness. revision: partial

  3. Referee: [Evaluation] Evaluation of off-target preservation: while the paper states that ActAdd preserves performance on unrelated tasks, no quantitative results or specific task suites are referenced to support this. Without such evidence, the claim that high-level steering leaves other capabilities intact remains unverified and is central to the practical utility argument.

    Authors: We appreciate the referee highlighting the need for quantitative support here. Although internal checks showed no degradation, the manuscript lacks explicit metrics. In the revised version, we will add quantitative results in the Evaluation section, reporting performance on specific off-target benchmarks (such as subsets of GLUE or general text perplexity) for both steered and baseline models to verify capability preservation. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical technique with independent experimental validation

full rationale

The paper presents ActAdd as a direct, non-optimized procedure: compute the difference between activations on a single contrasting prompt pair at a chosen layer, then add a scaled version of that vector to the residual stream during inference. This construction does not fit parameters to a target metric and then relabel the fit as a prediction, nor does it define the steering vector in terms of the desired output property. The cited Subramani et al. 2022 reference supplies the contrastive-difference idea but is not used to import a uniqueness theorem or to smuggle an ansatz; the present work simply applies the difference vector and reports measured effects on sentiment and toxicity benchmarks. No self-citation chain bears the central claim, and the reported SOTA numbers are external performance measurements rather than algebraic identities. The derivation chain is therefore self-contained and non-circular.

Axiom & Free-Parameter Ledger

1 free parameters · 0 axioms · 0 invented entities

The central claim rests on the empirical observation that activation differences capture controllable semantic directions; no explicit free parameters or invented entities are stated in the abstract, though scaling of the steering vector is implicitly required.

free parameters (1)
  • steering vector scaling coefficient
    Magnitude of vector addition must be chosen to achieve desired effect without over-steering.

pith-pipeline@v0.9.0 · 5505 in / 1055 out tokens · 29754 ms · 2026-05-11T00:09:39.439037+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • Foundation.DAlembert.Inevitability bilinear_family_forced echoes
    ?
    echoes

    ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

    By tactically adding in e.g. the “Love” - “Hate” steering vector during the forward pass, we achieve SOTA on negative-to-positive sentiment shift and detoxification

  • Foundation.HierarchyForcing additive_composition_is_minimal unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    ActAdd is lightweight: it does not require any machine optimization and works with a single pair of data points

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. REALISTA: Realistic Latent Adversarial Attacks that Elicit LLM Hallucinations

    cs.CL 2026-05 unverdicted novelty 8.0

    REALISTA optimizes continuous combinations of valid editing directions in latent space to produce realistic adversarial prompts that elicit hallucinations more effectively than prior methods, including on large reason...

  2. WriteSAE: Sparse Autoencoders for Recurrent State

    cs.LG 2026-05 unverdicted novelty 8.0

    WriteSAE is the first sparse autoencoder that factors decoder atoms into the native d_k x d_v cache write shape of recurrent models and supplies a closed-form per-token logit shift for atom substitution.

  3. WriteSAE: Sparse Autoencoders for Recurrent State

    cs.LG 2026-05 unverdicted novelty 8.0

    WriteSAE decomposes recurrent model cache writes into substitutable atoms with a closed-form logit shift, achieving high substitution success and targeted behavioral installs on models like Qwen3.5 and Mamba-2.

  4. WriteSAE: Sparse Autoencoders for Recurrent State

    cs.LG 2026-05 unverdicted novelty 8.0

    WriteSAE introduces sparse autoencoders with rank-1 matrix atoms for recurrent state updates, allowing replacement tests that outperform deletion on 92.4% of positions and a formula predicting logit changes with R²=0.98.

  5. SLAM: Structural Linguistic Activation Marking for Language Models

    cs.CL 2026-05 unverdicted novelty 8.0

    SLAM achieves 100% detection on Gemma-2 models with only 1-2 point quality cost by causally steering SAE-identified residual-stream directions for linguistic structure.

  6. SLAM: Structural Linguistic Activation Marking for Language Models

    cs.CL 2026-05 unverdicted novelty 8.0

    SLAM achieves 100% detection accuracy on Gemma-2 models with only 1-2 points of quality loss by causally steering SAE-identified structural directions while preserving lexical sampling and semantics.

  7. What Suppresses Nash Equilibrium Play in Large Language Models? Mechanistic Evidence and Causal Control

    cs.GT 2026-04 accept novelty 8.0

    LLMs compute Nash actions internally but suppress them via prosocial overrides from training data, and this can be causally controlled through residual stream interventions.

  8. Slot Machines: How LLMs Keep Track of Multiple Entities

    cs.CL 2026-04 unverdicted novelty 8.0

    LLM activations encode current and prior entities in orthogonal slots, but models only use the current slot for explicit factual retrieval despite prior-slot information being linearly decodable.

  9. Steerable but Not Decodable: Function Vectors Operate Beyond the Logit Lens

    cs.LG 2026-04 accept novelty 8.0

    Function vectors steer LLMs successfully where the logit lens fails to decode the target answer, showing the two properties come apart.

  10. Knowledge Packs: Zero-Token Knowledge Delivery via KV Cache Injection

    cs.CL 2026-03 unverdicted novelty 8.0

    Knowledge Packs deliver knowledge via pre-computed KV caches with exact equivalence under causal masking, achieving zero divergences on tested questions and enabling value-based steering without training.

  11. The Linear Representation Hypothesis and the Geometry of Large Language Models

    cs.CL 2023-11 conditional novelty 8.0

    Linear representations of high-level concepts in LLMs are formalized via counterfactuals in input and output spaces, unified under a causal inner product that enables consistent probing and steering.

  12. As X, Do Y: How Persona and Task Combine in Instruction-Tuned LLMs

    cs.CL 2026-05 unverdicted novelty 7.0

    Persona and task in role prompts decompose additively into orthogonal directions at the prompt-to-answer transition in LLM residual streams, but this local structure does not allow compressing the prompt into a single...

  13. Playing Devil's Advocate: Off-the-Shelf Persona Vectors Rival Targeted Steering for Sycophancy

    cs.AI 2026-05 conditional novelty 7.0

    Off-the-shelf persona vectors for doubt and scrutiny reduce sycophancy comparably to CAA while maintaining accuracy on correct inputs and showing directional independence.

  14. The Hidden Signal of Verifier Strictness: Controlling and Improving Step-Wise Verification via Selective Latent Steering

    cs.LG 2026-05 conditional novelty 7.0

    VerifySteer selectively steers hidden states at paragraph boundaries using latent correctness signals to control verifier strictness and outperform baselines on ProcessBench and Hard2Verify with lower compute.

  15. Residual Paving: Diagnosing the Routing Bottleneck in Selective Refusal Editing

    cs.LG 2026-05 unverdicted novelty 7.0

    Residual Paving decomposes selective refusal editing into an early-layer router for intervention decisions and later-layer residual experts for edits, with oracle routing showing that learned route selectivity is the ...

  16. FishBack: Pullback Fisher Geometry for Optimal Activation Steering in Transformers

    cs.LG 2026-05 unverdicted novelty 7.0

    FishBack derives a closed-form minimum-distortion steering direction from the pullback Fisher metric of the softmax layer, outperforming Euclidean baselines on GPT-2 verb-morphology tasks with lower off-target KL divergence.

  17. Where Pretraining writes and Alignment reads: the asymmetry of Transformer weight space

    cs.LG 2026-05 unverdicted novelty 7.0

    Pretraining and alignment induce asymmetric geometric traces in transformer weights because alignment updates concentrate in read pathways due to activation covariance while write pathways inherit less structure from ...

  18. The Mechanism of Weak-to-Strong Generalization: Feature Elicitation from Latent Knowledge

    stat.ML 2026-05 unverdicted novelty 7.0

    In two-layer networks, weak-to-strong training elicits the target feature direction from pre-trained subspaces and preserves correlated off-target features, unlike standard fine-tuning.

  19. WriteSAE: Sparse Autoencoders for Recurrent State

    cs.LG 2026-05 unverdicted novelty 7.0

    WriteSAE factors sparse autoencoder decoder atoms to the native d_k x d_v cache write shape in recurrent models, provides a closed-form logit shift, and demonstrates high success in atom substitution and behavioral ed...

  20. Deep Minds and Shallow Probes

    cs.LG 2026-05 unverdicted novelty 7.0

    Symmetry under affine reparameterizations of hidden coordinates selects a unique hierarchy of shallow coordinate-stable probes and a probe-visible quotient for cross-model transfer.

  21. SLIM: Sparse Latent Steering for Interpretable and Property-Directed LLM-Based Molecular Editing

    cs.LG 2026-05 unverdicted novelty 7.0

    SLIM decomposes LLM hidden states via sparse autoencoders with learnable gates to enable precise, interpretable steering of molecular properties, yielding up to 42.4-point gains on the MolEditRL benchmark.

  22. Prompt-Activation Duality: Improving Activation Steering via Attention-Level Interventions

    cs.CL 2026-05 unverdicted novelty 7.0

    GCAD reduces coherence drift from -18.6 to -1.9 and raises turn-10 trait expression from 78.0 to 93.1 in persona-steering tasks by using gated attention-delta interventions from system prompts.

  23. Cross-Family Universality of Behavioral Axes via Anchor-Projected Representations

    cs.AI 2026-05 unverdicted novelty 7.0

    Behavioral directions from one LLM family transfer to others via projection into a shared anchor coordinate space, yielding 0.83 ten-way detection accuracy and steering effects up to 0.46% on held-out models.

  24. When Is Rank-1 Steering Cheap? Geometry, Granularity, and Budgeted Search

    cs.LG 2026-05 conditional novelty 7.0

    Rank-1 activation steering is often cheap when prompt-boundary alignment guides budgeted search and concept granularity diagnoses directional stability, with the GRACE framework reducing trials to 95% utility by 39.8%...

  25. When Is Rank-1 Steering Cheap? Geometry, Granularity, and Budgeted Search

    cs.LG 2026-05 unverdicted novelty 7.0

    Prompt-boundary directional alignment enables geometry-guided search that cuts trials to 95% best utility by 39.8% on average, while concept granularity predicts remaining difficulty via directional heterogeneity.

  26. Inference Time Causal Probing in LLMs

    cs.AI 2026-05 unverdicted novelty 7.0

    HDMI is a new probe-free technique that steers LLM hidden states via margin objectives to achieve more reliable causal interventions than prior probe-based methods on standard benchmarks.

  27. Memory Inception: Latent-Space KV Cache Manipulation for Steering LLMs

    cs.LG 2026-05 unverdicted novelty 7.0

    Memory Inception steers LLMs via selective latent KV cache injection at chosen layers, delivering better control-drift balance than prompting or CAA on personality and reasoning tasks while reducing storage needs.

  28. DataDignity: Training Data Attribution for Large Language Models

    cs.AI 2026-05 unverdicted novelty 7.0

    ScoringModel raises mean Recall@10 to 52.2 on the FakeWiki provenance benchmark from 35.0 for the best baseline, winning 41 of 45 model-by-condition comparisons and gaining 15.7 points on jailbreak-style queries.

  29. Steer Like the LLM: Activation Steering that Mimics Prompting

    cs.CL 2026-05 unverdicted novelty 7.0

    PSR models that estimate token-specific steering coefficients from activations outperform standard activation steering and compare favorably to prompting on steering benchmarks.

  30. The Right Answer, the Wrong Direction: Why Transformers Fail at Counting and How to Fix It

    cs.LG 2026-05 unverdicted novelty 7.0

    Transformers encode counts correctly internally but fail to read them out due to misalignment with digit output directions, fixable by updating 37k output parameters or small LoRA on attention.

  31. The Right Answer, the Wrong Direction: Why Transformers Fail at Counting and How to Fix It

    cs.LG 2026-05 accept novelty 7.0

    Transformers store count information internally but cannot read it out as digits due to near-orthogonal alignment with output-head rows; updating digit rows or applying LoRA to attention layers improves constrained an...

  32. Less is More: Geometric Unlearning for LLMs with Minimal Data Disclosure

    cs.CL 2026-05 unverdicted novelty 7.0

    Geometric Unlearning suppresses specific knowledge in LLMs by projecting hidden planning states onto a low-rank safe geometry derived from minimal reference prompts.

  33. RouteHijack: Routing-Aware Attack on Mixture-of-Experts LLMs

    cs.LG 2026-05 unverdicted novelty 7.0

    RouteHijack is a routing-aware jailbreak that identifies safety-critical experts via activation contrast and optimizes suffixes to suppress them, reaching 69.3% average attack success rate on seven MoE LLMs with stron...

  34. ReLay: Personalized LLM-Generated Plain-Language Summaries for Better Understanding, but at What Cost?

    cs.CL 2026-05 unverdicted novelty 7.0

    Personalized LLM-generated plain language summaries improve lay readers' comprehension and quality ratings but increase risks of reinforcing biases and introducing hallucinations compared to static expert summaries.

  35. Subliminal Steering: Stronger Encoding of Hidden Signals

    cs.CL 2026-04 unverdicted novelty 7.0

    Subliminal steering transfers complex behavioral biases and the underlying steering vector through fine-tuning on innocuous data, achieving higher precision than prior prompt-based methods.

  36. Exploring Language-Agnosticity in Function Vectors: A Case Study in Machine Translation

    cs.CL 2026-04 unverdicted novelty 7.0

    Translation function vectors extracted from English to one target language improve correct token ranking for translations to multiple other unseen target languages in decoder-only multilingual LLMs.

  37. Cell-Based Representation of Relational Binding in Language Models

    cs.CL 2026-04 unverdicted novelty 7.0

    Large language models encode relational bindings via a cell-based representation: a low-dimensional linear subspace in which each cell corresponds to an entity-relation index pair and attributes are retrieved from the...

  38. Local Linearity of LLMs Enables Activation Steering via Model-Based Linear Optimal Control

    cs.LG 2026-04 conditional novelty 7.0

    Local linearity of LLM layers enables LQR-based closed-loop activation steering with theoretical tracking guarantees.

  39. Psychological Steering of Large Language Models

    cs.CL 2026-04 unverdicted novelty 7.0

    Mean-difference residual stream injections outperform personality prompting for OCEAN trait steering in most LLMs, with hybrids performing best and showing approximate linearity but non-human trait covariances.

  40. Identity as Attractor: Geometric Evidence for Persistent Agent Architecture in LLM Activation Space

    cs.AI 2026-04 conditional novelty 7.0

    Paraphrases of an identity document induce tighter clustering in LLM activation space than matched controls, indicating attractor-like dynamics for agent identity.

  41. Jailbreaking the Matrix: Nullspace Steering for Controlled Model Subversion

    cs.CR 2026-04 unverdicted novelty 7.0

    HMNS is a new jailbreak method that uses causal head identification and nullspace-constrained injection to achieve higher attack success rates than prior techniques on aligned language models.

  42. Emotion Concepts and their Function in a Large Language Model

    cs.AI 2026-04 unverdicted novelty 7.0

    Claude Sonnet 4.5 exhibits functional emotions via abstract internal representations of emotion concepts that causally influence its preferences and misaligned behaviors without implying subjective experience.

  43. Steering Autoregressive Music Generation with Recursive Feature Machines

    cs.LG 2025-10 unverdicted novelty 7.0

    MusicRFM discovers interpretable concept directions in music model hidden states using RFM probes and injects them at inference to steer generation toward desired musical properties without retraining.

  44. Activation Steering with a Feedback Controller

    cs.LG 2025-10 unverdicted novelty 7.0

    Popular LLM activation steering methods are shown to act as proportional controllers; a PID steering framework is proposed that improves robustness and outperforms baselines in experiments across model families.

  45. Refusal in Language Models Is Mediated by a Single Direction

    cs.LG 2024-06 accept novelty 7.0

    Refusal in language models is mediated by a single direction in residual stream activations that can be erased to disable safety or added to elicit refusal.

  46. Relational Linear Properties in Language Models: An Empirical Investigation

    cs.LG 2026-05 unverdicted novelty 6.0

    A KL-divergence probing method shows relational linearity in language models varies across models and layers while being sensitive to relation phrasing, extending prior linear embedding work.

  47. Manifold-Guided Attention Steering

    cs.LG 2026-05 unverdicted novelty 6.0

    MAGS learns low-dimensional subspaces from correct versus incorrect reasoning traces and applies targeted projection corrections to attention heads when they deviate from the correctness manifold during inference.

  48. Latent-space Attacks for Refusal Evasion in Language Models

    cs.AI 2026-05 unverdicted novelty 6.0

    Introduces Controlled Latent-space Evasion attack that projects model activations past a linear probe's decision boundary to suppress refusal, outperforming ablation baselines on 15 models.

  49. Monitoring the Internal Monologue: Probe Trajectories Reveal Reasoning Dynamics

    cs.CL 2026-05 unverdicted novelty 6.0

    Probe trajectories across token positions in LRMs, combined with signal-processing features, improve prediction of future model outputs over static probes on safety and math tasks.

  50. ASRU: Activation Steering Meets Reinforcement Unlearning for Multimodal Large Language Models

    cs.CL 2026-05 unverdicted novelty 6.0

    ASRU combines activation redirection and reward-optimized fine-tuning to unlearn cross-modal sensitive knowledge in MLLMs, reporting +24.6% better unlearning effectiveness and 5.8x higher generation quality on Qwen3-V...

  51. VSPO: Vector-Steered Policy Optimization for Behavioral Control

    cs.LG 2026-05 unverdicted novelty 6.0

    VSPO samples rollouts at varying steering intensities to improve behavioral control in LLMs while preserving task accuracy.

  52. TFGN: Task-Free, Replay-Free Continual Pre-Training Without Catastrophic Forgetting at LLM Scale

    cs.LG 2026-05 unverdicted novelty 6.0

    TFGN is an architectural overlay for transformers enabling task-free, replay-free continual pre-training across heterogeneous domains at LLM scale with near-zero backward transfer and high gradient orthogonality.

  53. Fusion-fission forecasts when AI will shift to undesirable behavior

    cs.AI 2026-05 unverdicted novelty 6.0

    A vector generalization of fusion-fission group dynamics from physics forecasts when AI behavior shifts to undesirable states, validated at 90 percent across seven models and prior to real-world data.

  54. Dual-Pathway Circuits of Object Hallucination in Vision-Language Models

    cs.CV 2026-05 unverdicted novelty 6.0

    Vision-language models contain identifiable grounding and hallucination pathways; suppressing the latter reduces object hallucinations by up to 76% while preserving accuracy.

  55. Stories in Space: In-Context Learning Trajectories in Conceptual Belief Space

    cs.CL 2026-05 unverdicted novelty 6.0

    LLMs perform in-context learning as trajectories through a structured low-dimensional conceptual belief space, with the structure visible in both behavior and internal representations and causally manipulable via inte...

  56. Toward Stable Value Alignment: Introducing Independent Modules for Consistent Value Guidance

    cs.AI 2026-05 unverdicted novelty 6.0

    SVGT adds independent value modules and Bridge Tokens to LLMs to maintain consistent value guidance, cutting harmful outputs by over 70% in tests while preserving fluency.

  57. Interpretability Can Be Actionable

    cs.LG 2026-05 conditional novelty 6.0

    Interpretability research should be judged by actionability—the degree to which its insights support concrete decisions and interventions—rather than explanatory power alone.

  58. Enabling Performant and Flexible Model-Internal Observability for LLM Inference

    cs.LG 2026-05 unverdicted novelty 6.0

    DMI-Lib delivers 0.4-6.8% overhead for offline batch LLM inference and ~6% for moderate online serving while exposing rich internal signals across backends, cutting latency overhead 2-15x versus prior observability baselines.

  59. Training-Free Cultural Alignment of Large Language Models via Persona Disagreement

    cs.CL 2026-05 unverdicted novelty 6.0

    DISCA uses disagreement among WVS-grounded persona panels to apply loss-averse logit corrections that reduce cultural misalignment by 10-24% on MultiTP for models 3.8B and larger, without weight changes.

  60. Training-Free Cultural Alignment of Large Language Models via Persona Disagreement

    cs.CL 2026-05 conditional novelty 6.0

    DISCA converts within-country disagreement among World Values Survey personas into a bounded logit correction that reduces cultural misalignment by 10-24% on MultiTP for models 3.8B and larger across 20 countries, wit...

Reference graph

Works this paper leans on

135 extracted references · 135 canonical work pages · cited by 115 Pith papers · 11 internal anchors

  1. [1]

    Understanding intermediate layers using linear classifier probes, 2018

    Guillaume Alain and Yoshua Bengio. Understanding intermediate layers using linear classifier probes, 2018

  2. [2]

    Transformer L ens: A library for mechanistic interpretability of generative language models

    Joseph Bloom and Neel Nanda. Transformer L ens: A library for mechanistic interpretability of generative language models. https://neelnanda-io.github.io/TransformerLens/, 2022

  3. [3]

    Robustness of edited neural networks, 2023

    Davis Brown, Charles Godfrey, Cody Nizinski, Jonathan Tu, and Henry Kvinge. Robustness of edited neural networks, 2023

  4. [4]

    Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin...

  5. [5]

    Discovering latent knowledge in language models without supervision, 2022

    Collin Burns, Haotian Ye, Dan Klein, and Jacob Steinhardt. Discovering latent knowledge in language models without supervision, 2022

  6. [6]

    Plug and play language models: A simple approach to controlled text generation, 2020

    Sumanth Dathathri, Andrea Madotto, Janice Lan, Jane Hung, Eric Frank, Piero Molino, Jason Yosinski, and Rosanne Liu. Plug and play language models: A simple approach to controlled text generation, 2020

  7. [7]

    A mathematical framework for transformer circuits

    Nelson Elhage, Neel Nanda, Catherine Olsson, Tom Henighan, Nicholas Joseph, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, et al. A mathematical framework for transformer circuits. Transformer Circuits Thread, 1, 2021

  8. [8]

    Toy models of superposition, 2022

    Nelson Elhage, Tristan Hume, Catherine Olsson, Nicholas Schiefer, Tom Henighan, Shauna Kravec, Zac Hatfield-Dodds, Robert Lasenby, Dawn Drain, Carol Chen, Roger Grosse, Sam McCandlish, Jared Kaplan, Dario Amodei, Martin Wattenberg, and Christopher Olah. Toy models of superposition, 2022

  9. [10]

    Probability plotting methods for the analysis of data

    Ramanathan Gnanadesikan and Martin B Wilk. Probability plotting methods for the analysis of data. Biometrika, 55 0 (1): 0 1--17, 1968

  10. [11]

    Bias correction of learned generative models using likelihood-free importance weighting, 2019

    Aditya Grover, Jiaming Song, Alekh Agarwal, Kenneth Tran, Ashish Kapoor, Eric Horvitz, and Stefano Ermon. Bias correction of learned generative models using likelihood-free importance weighting, 2019

  11. [15]

    Does localization inform editing? surprising differences in causality-based localization vs

    Peter Hase, Mohit Bansal, Been Kim, and Asma Ghandeharioun. Does localization inform editing? surprising differences in causality-based localization vs. knowledge editing in language models, 2023

  12. [16]

    Li, and Jacob Andreas

    Evan Hernandez, Belinda Z. Li, and Jacob Andreas. Inspecting and editing knowledge representations in language models, 2023

  13. [17]

    Editing models with task arithmetic, 2023

    Gabriel Ilharco, Marco Tulio Ribeiro, Mitchell Wortsman, Suchin Gururangan, Ludwig Schmidt, Hannaneh Hajishirzi, and Ali Farhadi. Editing models with task arithmetic, 2023

  14. [20]

    Language models and cognitive automation for economic research

    Anton Korinek. Language models and cognitive automation for economic research. Technical report, National Bureau of Economic Research, 2023

  15. [21]

    Autoencoding beyond pixels using a learned similarity metric, 2016

    Anders Boesen Lindbo Larsen, Søren Kaae Sønderby, Hugo Larochelle, and Ole Winther. Autoencoding beyond pixels using a learned similarity metric, 2016

  16. [22]

    The power of scale for parameter-efficient prompt tuning, 2021

    Brian Lester, Rami Al-Rfou, and Noah Constant. The power of scale for parameter-efficient prompt tuning, 2021

  17. [23]

    Delete, retrieve, generate: A simple approach to sentiment and style transfer, 2018

    Juncen Li, Robin Jia, He He, and Percy Liang. Delete, retrieve, generate: A simple approach to sentiment and style transfer, 2018. URL https://arxiv.org/abs/1804.06437

  18. [24]

    Hopkins, David Bau, Fernanda Viégas, Hanspeter Pfister, and Martin Wattenberg

    Kenneth Li, Aspen K. Hopkins, David Bau, Fernanda Viégas, Hanspeter Pfister, and Martin Wattenberg. Emergent world representations: Exploring a sequence model trained on a synthetic task, 2023 a

  19. [25]

    Inference-time intervention: Eliciting truthful answers from a language model, 2023 b

    Kenneth Li, Oam Patel, Fernanda Viégas, Hanspeter Pfister, and Martin Wattenberg. Inference-time intervention: Eliciting truthful answers from a language model, 2023 b

  20. [26]

    Prefix- T uning: Optimizing continuous prompts for generation, 2021

    Xiang Lisa Li and Percy Liang. Prefix- T uning: Optimizing continuous prompts for generation, 2021

  21. [27]

    In-context V ectors: Making in context learning more effective and controllable through latent space steering, 2023

    Sheng Liu, Lei Xing, and James Zou. In-context V ectors: Making in context learning more effective and controllable through latent space steering, 2023

  22. [28]

    Keeping llms aligned after fine-tuning: The crucial role of prompt templates, 2024

    Kaifeng Lyu, Haoyu Zhao, Xinran Gu, Dingli Yu, Anirudh Goyal, and Sanjeev Arora. Keeping llms aligned after fine-tuning: The crucial role of prompt templates, 2024

  23. [29]

    Maas, Raymond E

    Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y. Ng, and Christopher Potts. Learning word vectors for sentiment analysis. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pp.\ 142--150, Portland, Oregon, USA, June 2011. Association for Computational Linguistics. UR...

  24. [30]

    Locating and editing factual associations in GPT , 2023

    Kevin Meng, David Bau, Alex Andonian, and Yonatan Belinkov. Locating and editing factual associations in GPT , 2023

  25. [31]

    Meta L lama 3

    Meta. Meta L lama 3. https://llama.meta.com/llama3, 2024

  26. [32]

    Are sixteen heads really better than one? In H

    Paul Michel, Omer Levy, and Graham Neubig. Are sixteen heads really better than one? In H. Wallach, H. Larochelle, A. Beygelzimer, F. d Alch\' e -Buc, E. Fox, and R. Garnett (eds.), Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019. URL https://proceedings.neurips.cc/paper_files/paper/2019/file/2c601ad9d2ff9bc8b28...

  27. [33]

    Distributed representations of words and phrases and their compositionality

    Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Distributed representations of words and phrases and their compositionality. In C.J. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K.Q. Weinberger (eds.), Advances in Neural Information Processing Systems, volume 26. Curran Associates, Inc., 2013 a . URL https://proceedings.neurip...

  28. [34]

    Linguistic regularities in continuous space word representations

    Tom \'a s Mikolov, Wen-tau Yih, and Geoffrey Zweig. Linguistic regularities in continuous space word representations. In Proceedings of the 2013 conference of the north american chapter of the association for computational linguistics: Human language technologies, pp.\ 746--751, 2013 b

  29. [35]

    Understanding and Controlling a Maze - Solving Policy Network , October 2023

    Ulisse Mini, Peli Grietzer, Mrinank Sharma, Austin Meek, Monte MacDiarmid, and Alexander Matt Turner. Understanding and controlling a maze-solving policy network, 2023. URL https://arxiv.org/abs/2310.08043

  30. [36]

    Relative representations enable zero-shot latent space communication, 2023

    Luca Moschella, Valentino Maiorca, Marco Fumero, Antonio Norelli, Francesco Locatello, and Emanuele Rodolà. Relative representations enable zero-shot latent space communication, 2023

  31. [37]

    Actually, othello-gpt has a linear emergent world representation

    Neel Nanda. Actually, othello-gpt has a linear emergent world representation. neelnanda.io/mechanistic-interpretability/othello, 2023

  32. [38]

    Distributed representations: Composition & superposition

    Christopher Olah. Distributed representations: Composition & superposition. https://transformer-circuits.pub/2023/superposition-composition/index.html, 2023

  33. [41]

    arXiv preprint arXiv:2307.03214 , year=

    Jonathan Pei, Kevin Yang, and Dan Klein. PREADD : prefix-adaptive decoding for controlled text generation. arXiv preprint arXiv:2307.03214, 2023

  34. [42]

    Openwebtext

    Joshua Peterson, Stephan Meylan, and David Bourgin. Openwebtext. https://github.com/jcpeterson/openwebtext, 2018

  35. [43]

    Petroni, T

    F. Petroni, T. Rockt \" a schel, A. H. Miller, P. Lewis, A. Bakhtin, Y. Wu, and S. Riedel. Language models as knowledge bases? In In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2019, 2019

  36. [45]

    Language models are unsupervised multitask learners

    Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners. OpenAI blog, 1 0 (8): 0 9, 2019

  37. [46]

    Sequence level training with recurrent neural networks, 2016

    Marc'Aurelio Ranzato, Sumit Chopra, Michael Auli, and Wojciech Zaremba. Sequence level training with recurrent neural networks, 2016

  38. [49]

    The irrelevance of turing machines to artificial intelligence

    Aaron Sloman. The irrelevance of turing machines to artificial intelligence. In Matthias Scheutz (ed.), Computationalism: New Directions. MIT Press, 2002

  39. [50]

    nltk.tokenize.punkt module

    Jan Strunk. nltk.tokenize.punkt module. https://www.nltk.org/api/nltk.tokenize.punkt.html, 2013

  40. [52]

    LLaMA : Open and efficient foundation language models, 2023

    Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. LLaMA : Open and efficient foundation language models, 2023

  41. [53]

    Attention is all you need

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, ukasz Kaiser, and Illia Polosukhin. Attention is all you need. In I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (eds.), Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017. URL https:...

  42. [54]

    GPT-J-6B : 6 B jax-based transformer

    Ben Wang and Aran Komatsuzaki. GPT-J-6B : 6 B jax-based transformer. https://github.com/kingoflolz/mesh-transformer-jax\#gpt-j-6b, 2021

  43. [55]

    Prompt engineering in consistency and reliability with the evidence-based guideline for llms

    Li Wang, Xi Chen, XiangWen Deng, Hao Wen, MingKe You, WeiZhi Liu, Qi Li, and Jian Li. Prompt engineering in consistency and reliability with the evidence-based guideline for llms. npj Digital Medicine, 7 0 (1): 0 41, 2024

  44. [56]

    Chain-of-thought prompting elicits reasoning in large language models

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35: 0 24824--24837, 2022

  45. [57]

    Sampling generative networks, 2016

    Tom White. Sampling generative networks, 2016

  46. [58]

    Eva- KELLM : A new benchmark for evaluating knowledge editing of LLMs , 2023

    Suhang Wu, Minlong Peng, Yue Chen, Jinsong Su, and Mingming Sun. Eva- KELLM : A new benchmark for evaluating knowledge editing of LLMs , 2023

  47. [60]

    The unreliability of explanations in few-shot prompting for textual reasoning

    Xi Ye and Greg Durrett. The unreliability of explanations in few-shot prompting for textual reasoning. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh (eds.), Advances in Neural Information Processing Systems, volume 35, pp.\ 30378--30392. Curran Associates, Inc., 2022. URL https://proceedings.neurips.cc/paper_files/paper/2022/file/c4...

  48. [62]

    A comprehensive study of knowledge editing for large language models, 2024

    Ningyu Zhang, Yunzhi Yao, Bozhong Tian, Peng Wang, Shumin Deng, Mengru Wang, Zekun Xi, Shengyu Mao, Jintian Zhang, Yuansheng Ni, Siyuan Cheng, Ziwen Xu, Xin Xu, Jia-Chen Gu, Yong Jiang, Pengjun Xie, Fei Huang, Lei Liang, Zhiqiang Zhang, Xiaowei Zhu, Jun Zhou, and Huajun Chen. A comprehensive study of knowledge editing for large language models, 2024

  49. [63]

    OPT : Open pre-trained transformer language models, 2022 b

    Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher Dewan, Mona Diab, Xian Li, Xi Victoria Lin, Todor Mihaylov, Myle Ott, Sam Shleifer, Kurt Shuster, Daniel Simig, Punit Singh Koura, Anjali Sridhar, Tianlu Wang, and Luke Zettlemoyer. OPT : Open pre-trained transformer language models, 2022 b

  50. [64]

    Air- D ecoding: Attribute distribution reconstruction for decoding-time controllable text generation

    Tianqi Zhong, Quan Wang, Jingxuan Han, Yongdong Zhang, and Zhendong Mao. Air- D ecoding: Attribute distribution reconstruction for decoding-time controllable text generation. arXiv preprint arXiv:2310.14892, 2023

  51. [65]

    Steering large language models using APE

    Yongchao Zhou, Andrei Ioan Muresanu, Ziwen Han, Keiran Paster, Silviu Pitis, Harris Chan, and Jimmy Ba. Steering large language models using APE . In NeurIPS ML Safety Workshop, 2022. URL https://openreview.net/forum?id=JjvNzMOiBEp

  52. [66]

    Ziegler, Nisan Stiennon, Jeffrey Wu, Tom B

    Daniel M. Ziegler, Nisan Stiennon, Jeffrey Wu, Tom B. Brown, Alec Radford, Dario Amodei, Paul Christiano, and Geoffrey Irving. Fine-tuning language models from human preferences, 2019

  53. [67]

    Byun, Zifan Wang, Alex Mallen, Steven Basart, Sanmi Koyejo, Dawn Song, Matt Fredrikson, J

    Andy Zou, Long Phan, Sarah Chen, James Campbell, Phillip Guo, Richard Ren, Alexander Pan, Xuwang Yin, Mantas Mazeika, Ann-Kathrin Dombrowski, Shashwat Goel, Nathaniel Li, Michael J. Byun, Zifan Wang, Alex Mallen, Steven Basart, Sanmi Koyejo, Dawn Song, Matt Fredrikson, J. Zico Kolter, and Dan Hendrycks. Representation engineering: A top-down approach to a...

  54. [68]

    BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

    Jacob Devlin and Ming-Wei Chang and Kenton Lee and Kristina Toutanova , year=. 1810.04805 , archivePrefix=

  55. [69]

    Advances in neural information processing systems , volume=

    Chain-of-thought prompting elicits reasoning in large language models , author=. Advances in neural information processing systems , volume=

  56. [70]

    TruthfulQA: Measuring How Models Mimic Human Falsehoods

    Truthfulqa: Measuring how models mimic human falsehoods , author=. arXiv preprint arXiv:2109.07958 , year=

  57. [71]

    2023 , eprint=

    Understanding and Controlling a Maze-Solving Policy Network , author=. 2023 , eprint=

  58. [72]

    2019 , eprint=

    Fine-Tuning Language Models from Human Preferences , author=. 2019 , eprint=

  59. [73]

    Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!

    Fine-tuning aligned language models compromises safety, even when users do not intend to! , author=. arXiv preprint arXiv:2310.03693 , year=

  60. [74]

    Fudge: Controlled text generation with future discriminators

    Yang, Kevin and Klein, Dan. FUDGE : Controlled Text Generation With Future Discriminators. Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2021. doi:10.18653/v1/2021.naacl-main.276

  61. [75]

    LLaMA: Open and Efficient Foundation Language Models

    Hugo Touvron and Thibaut Lavril and Gautier Izacard and Xavier Martinet and Marie-Anne Lachaux and Timothée Lacroix and Baptiste Rozière and Naman Goyal and Eric Hambro and Faisal Azhar and Aurelien Rodriguez and Armand Joulin and Edouard Grave and Guillaume Lample , year=. 2302.13971 , archivePrefix=

  62. [76]

    and Neumann, Mark and Iyyer, Mohit and Gardner, Matt and Clark, Christopher and Lee, Kenton and Zettlemoyer, Luke

    Peters, Matthew E. and Neumann, Mark and Iyyer, Mohit and Gardner, Matt and Clark, Christopher and Lee, Kenton and Zettlemoyer, Luke. Deep Contextualized Word Representations. Proceedings of the 2018 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers). 2018. doi:10...

  63. [77]

    2020 , eprint=

    Language Models are Few-Shot Learners , author=. 2020 , eprint=

  64. [78]

    Generating Wikipedia by Summarizing Long Sequences

    Generating wikipedia by summarizing long sequences , author=. arXiv preprint arXiv:1801.10198 , year=

  65. [79]

    Transformer Circuits Thread , volume=

    A mathematical framework for transformer circuits , author=. Transformer Circuits Thread , volume=

  66. [80]

    OpenAI blog , volume=

    Language models are unsupervised multitask learners , author=. OpenAI blog , volume=

  67. [81]

    Controllable text generation via probability density estimation in the latent space

    Controllable text generation via probability density estimation in the latent space , author=. arXiv preprint arXiv:2212.08307 , year=

  68. [82]

    Zhong, Tianqi and Wang, Quan and Han, Jingxuan and Zhang, Yongdong and Mao, Zhendong , journal=. Air-

  69. [83]

    Pei, Jonathan and Yang, Kevin and Klein, Dan , journal=

  70. [84]

    and Daly, Raymond E

    Maas, Andrew L. and Daly, Raymond E. and Pham, Peter T. and Huang, Dan and Ng, Andrew Y. and Potts, Christopher , title =. Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies , month =. 2011 , address =

  71. [85]

    In-context vectors: Making in context learning more effective and controllable through latent space steering, 2024

    Sheng Liu and Lei Xing and James Zou , year=. In-context. 2311.06668 , archivePrefix=

  72. [86]

    2023 , eprint=

    Representation Engineering: A Top-Down Approach to AI Transparency , author=. 2023 , eprint=

  73. [87]

    More than a Feeling: Accuracy and Application of Sentiment Analysis , journal =

    Jochen Hartmann and Mark Heitmann and Christian Siebert and Christina Schamp , keywords =. More than a Feeling: Accuracy and Application of Sentiment Analysis , journal =. 2023 , issn =. doi:https://doi.org/10.1016/j.ijresmar.2022.05.005 , url =

  74. [88]

    Suhang Wu and Minlong Peng and Yue Chen and Jinsong Su and Mingming Sun , year=. Eva-. 2308.09954 , archivePrefix=

  75. [89]

    2023 , eprint=

    Inspecting and Editing Knowledge Representations in Language Models , author=. 2023 , eprint=

  76. [90]

    Locating and Editing Factual Associations in GPT

    Kevin Meng and David Bau and Alex Andonian and Yonatan Belinkov , year=. Locating and Editing Factual Associations in. 2202.05262 , archivePrefix=

  77. [91]

    2016 , eprint=

    Sampling Generative Networks , author=. 2016 , eprint=

  78. [92]

    Language models implement simple word2vec-style vector arithmetic, 2024

    Jack Merullo and Carsten Eickhoff and Ellie Pavlick , year=. Language Models Implement Simple. 2305.16130 , archivePrefix=

  79. [93]

    Proceedings of the 2013 conference of the north american chapter of the association for computational linguistics: Human language technologies , pages=

    Linguistic regularities in continuous space word representations , author=. Proceedings of the 2013 conference of the north american chapter of the association for computational linguistics: Human language technologies , pages=

  80. [94]

    H afez: an Interactive Poetry Generation System

    Ghazvininejad, Marjan and Shi, Xing and Priyadarshi, Jay and Knight, Kevin. H afez: an Interactive Poetry Generation System. Proceedings of ACL 2017, System Demonstrations. 2017

Showing first 80 references.