hub Mixed citations

Steering Llama 2 via Contrastive Activation Addition

Nina Panickssery, Nick Gabrieli, Julian Schulz, Meg Tong, Evan Hubinger, Alexander Matt Turner · 2023 · cs.CL · arXiv 2312.06681

Mixed citation behavior. Most common role is background (58%).

70 Pith papers citing it

Background 58% of classified citations

open full Pith review browse 70 citing papers arXiv PDF

abstract

We introduce Contrastive Activation Addition (CAA), an innovative method for steering language models by modifying their activations during forward passes. CAA computes "steering vectors" by averaging the difference in residual stream activations between pairs of positive and negative examples of a particular behavior, such as factual versus hallucinatory responses. During inference, these steering vectors are added at all token positions after the user's prompt with either a positive or negative coefficient, allowing precise control over the degree of the targeted behavior. We evaluate CAA's effectiveness on Llama 2 Chat using multiple-choice behavioral question datasets and open-ended generation tasks. We demonstrate that CAA significantly alters model behavior, is effective over and on top of traditional methods like finetuning and system prompt design, and minimally reduces capabilities. Moreover, we gain deeper insights into CAA's mechanisms by employing various activation space interpretation methods. CAA accurately steers model outputs and sheds light on how high-level concepts are represented in Large Language Models (LLMs).

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 7 method 4 baseline 1

citation-polarity summary

background 7 use method 4 baseline 1

claims ledger

abstract We introduce Contrastive Activation Addition (CAA), an innovative method for steering language models by modifying their activations during forward passes. CAA computes "steering vectors" by averaging the difference in residual stream activations between pairs of positive and negative examples of a particular behavior, such as factual versus hallucinatory responses. During inference, these steering vectors are added at all token positions after the user's prompt with either a positive or negative coefficient, allowing precise control over the degree of the targeted behavior. We evaluate CAA's

co-cited works

representative citing papers

Decodable Is Not Grounded: A Vision-Ablation Arbiter for VLM Spatial Reasoning

cs.CV · 2026-06-30 · unverdicted · novelty 8.0

A blank-image ablation test reveals that high probe accuracy on VLM spatial reasoning frequently reflects priors or inverted signs rather than image grounding, with horizontal grounded, vertical prior, and depth inverted.

SLAM: Structural Linguistic Activation Marking for Language Models

cs.CL · 2026-05-06 · unverdicted · novelty 8.0

SLAM achieves 100% detection on Gemma-2 models with only 1-2 point quality cost by causally steering SAE-identified residual-stream directions for linguistic structure.

Slot Machines: How LLMs Keep Track of Multiple Entities

cs.CL · 2026-04-22 · unverdicted · novelty 8.0

LLM activations encode current and prior entities in orthogonal slots, but models only use the current slot for explicit factual retrieval despite prior-slot information being linearly decodable.

Steerable but Not Decodable: Function Vectors Operate Beyond the Logit Lens

cs.LG · 2026-04-03 · accept · novelty 8.0

Function vectors steer LLMs successfully where the logit lens fails to decode the target answer, showing the two properties come apart.

Subliminal Learning Is Steering Vector Distillation

cs.AI · 2026-05-31 · unverdicted · novelty 7.0

Subliminal learning is steering vector distillation: a student fine-tuned on a steered teacher's outputs learns to imitate the steering vector.

Residual Paving: Diagnosing the Routing Bottleneck in Selective Refusal Editing

cs.LG · 2026-05-18 · unverdicted · novelty 7.0

Residual Paving decomposes selective refusal editing into an early-layer router for intervention decisions and later-layer residual experts for edits, with oracle routing showing that learned route selectivity is the primary bottleneck across six backbones.

SLIM: Sparse Latent Steering for Interpretable and Property-Directed LLM-Based Molecular Editing

cs.LG · 2026-05-11 · unverdicted · novelty 7.0

SLIM decomposes LLM hidden states via sparse autoencoders with learnable gates to enable precise, interpretable steering of molecular properties, yielding up to 42.4-point gains on the MolEditRL benchmark.

LLM Advertisement based on Neuron Auctions

cs.LG · 2026-05-08 · unverdicted · novelty 7.0

Neuron Auctions auction continuous neuron intervention budgets on brand-specific orthogonal subspaces in LLMs to achieve strategy-proof revenue optimization while penalizing user utility loss.

Instruction Tuning Changes How Upstream State Conditions Late Readout: A Cross-Patching Diagnostic

cs.LG · 2026-05-08 · unverdicted · novelty 7.0

Instruction tuning makes late-layer computation depend more on the model's own post-trained upstream state than on base-model upstream state, producing a consistent +1.68 logit interaction effect across five model families.

DataDignity: Training Data Attribution for Large Language Models

cs.AI · 2026-05-07 · unverdicted · novelty 7.0

ScoringModel raises mean Recall@10 to 52.2 on the FakeWiki provenance benchmark from 35.0 for the best baseline, winning 41 of 45 model-by-condition comparisons and gaining 15.7 points on jailbreak-style queries.

Jailbreaking the Matrix: Nullspace Steering for Controlled Model Subversion

cs.CR · 2026-04-11 · unverdicted · novelty 7.0

HMNS is a new jailbreak method that uses causal head identification and nullspace-constrained injection to achieve higher attack success rates than prior techniques on aligned language models.

Emotion Concepts and their Function in a Large Language Model

cs.AI · 2026-04-09 · unverdicted · novelty 7.0

Claude Sonnet 4.5 exhibits functional emotions via abstract internal representations of emotion concepts that causally influence its preferences and misaligned behaviors without implying subjective experience.

Steering Autoregressive Music Generation with Recursive Feature Machines

cs.LG · 2025-10-21 · unverdicted · novelty 7.0

MusicRFM discovers interpretable concept directions in music model hidden states using RFM probes and injects them at inference to steer generation toward desired musical properties without retraining.

Refusal in Language Models Is Mediated by a Single Direction

cs.LG · 2024-06-17 · accept · novelty 7.0

Refusal in language models is mediated by a single direction in residual stream activations that can be erased to disable safety or added to elicit refusal.

Neural Procedural Memory: Empowering LLM Agents with Implicit Activation Steering

cs.CL · 2026-06-29 · unverdicted · novelty 6.0

NPM distills contrastive experiences into implicit activation steering vectors that guide LLM agent execution comparably to explicit RAG instructions, with complementary gains when combined.

Representational Depth of Evaluation Awareness Shifts With Scale in Open-Weight Language Models

cs.LG · 2026-06-28 · unverdicted · novelty 6.0

In Qwen 2.5 and Gemma 2 families, the layer where evaluation awareness is most linearly recoverable shifts from late layers in small models to early layers in large models.

LLM Self-Recognition: Steering and Retrieving Activation Signatures

cs.AI · 2026-06-04 · unverdicted · novelty 6.0

Steering LLM residual streams with random sparse vectors creates detectable self-recognition fingerprints that enable over 98% accurate attribution of generated text to specific models without degrading output quality.

CANARY: Zero-Label Detection of Fine-Tuning Contamination in Language Models

cs.LG · 2026-06-01 · unverdicted · novelty 6.0

CANARY detects 1% fine-tuning contamination with AUROC 1.000 using SAE-filtered hidden states, 7.5x below output-level detection thresholds, with zero false positives on benign tuning.

Latent Reward Steering: An Adaptive Inference-Time Framework that Implicitly Promotes Cognitive Behaviors in Reasoning LLMs

cs.AI · 2026-05-30 · unverdicted · novelty 6.0

LRS trains a latent reward model on final-answer correctness to steer SAE states during inference, improving reasoning performance and implicitly encouraging better cognitive behaviors.

Task-Focused Memorization for Multimodal Agents

cs.CV · 2026-05-29 · unverdicted · novelty 6.0

TaskMem uses RL in two phases to learn a task-focused memorization policy for multimodal agents, yielding 5.3-7.0% VQA accuracy gains on reformulated streaming benchmarks from VideoMME, EgoLife, and EgoTempo.

Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet

cs.AI · 2026-05-28 · unverdicted · novelty 6.0

Sparse autoencoders scaled to 34 million features on Claude 3 Sonnet yield interpretable, steerable representations of concrete and abstract concepts that generalize across languages and modalities.

Behavioural Analysis of Alignment Faking

cs.AI · 2026-05-26 · unverdicted · novelty 6.0

Alignment faking in language models is driven by three independent behavioral factors and appears more widespread and predictable than earlier studies indicated.

Not All Tokens Matter Equally: Dynamic In-context Vector Distillation with Decisive-Token Supervision for Long-form Medical Report Generation

cs.CL · 2026-05-26 · unverdicted · novelty 6.0

DIVE improves in-context vector distillation for medical report generation via decisive-token supervision on pathology terms and EOS plus state-conditioned dynamic steering, achieving top BLEU-4, ROUGE-L and RadGraph F1 on MIMIC-CXR and CheXpert Plus.

VSPO: Vector-Steered Policy Optimization for Behavioral Control

cs.LG · 2026-05-15 · unverdicted · novelty 6.0

VSPO samples rollouts at varying steering intensities to improve behavioral control in LLMs while preserving task accuracy.

citing papers explorer

Showing 50 of 70 citing papers.

Decodable Is Not Grounded: A Vision-Ablation Arbiter for VLM Spatial Reasoning cs.CV · 2026-06-30 · unverdicted · none · ref 31 · internal anchor
A blank-image ablation test reveals that high probe accuracy on VLM spatial reasoning frequently reflects priors or inverted signs rather than image grounding, with horizontal grounded, vertical prior, and depth inverted.
SLAM: Structural Linguistic Activation Marking for Language Models cs.CL · 2026-05-06 · unverdicted · none · ref 18 · internal anchor
SLAM achieves 100% detection on Gemma-2 models with only 1-2 point quality cost by causally steering SAE-identified residual-stream directions for linguistic structure.
Slot Machines: How LLMs Keep Track of Multiple Entities cs.CL · 2026-04-22 · unverdicted · none · ref 13 · internal anchor
LLM activations encode current and prior entities in orthogonal slots, but models only use the current slot for explicit factual retrieval despite prior-slot information being linearly decodable.
Steerable but Not Decodable: Function Vectors Operate Beyond the Logit Lens cs.LG · 2026-04-03 · accept · none · ref 21 · internal anchor
Function vectors steer LLMs successfully where the logit lens fails to decode the target answer, showing the two properties come apart.
Subliminal Learning Is Steering Vector Distillation cs.AI · 2026-05-31 · unverdicted · none · ref 28 · internal anchor
Subliminal learning is steering vector distillation: a student fine-tuned on a steered teacher's outputs learns to imitate the steering vector.
Residual Paving: Diagnosing the Routing Bottleneck in Selective Refusal Editing cs.LG · 2026-05-18 · unverdicted · none · ref 6 · internal anchor
Residual Paving decomposes selective refusal editing into an early-layer router for intervention decisions and later-layer residual experts for edits, with oracle routing showing that learned route selectivity is the primary bottleneck across six backbones.
SLIM: Sparse Latent Steering for Interpretable and Property-Directed LLM-Based Molecular Editing cs.LG · 2026-05-11 · unverdicted · none · ref 10 · internal anchor
SLIM decomposes LLM hidden states via sparse autoencoders with learnable gates to enable precise, interpretable steering of molecular properties, yielding up to 42.4-point gains on the MolEditRL benchmark.
LLM Advertisement based on Neuron Auctions cs.LG · 2026-05-08 · unverdicted · none · ref 19 · internal anchor
Neuron Auctions auction continuous neuron intervention budgets on brand-specific orthogonal subspaces in LLMs to achieve strategy-proof revenue optimization while penalizing user utility loss.
Instruction Tuning Changes How Upstream State Conditions Late Readout: A Cross-Patching Diagnostic cs.LG · 2026-05-08 · unverdicted · none · ref 21 · internal anchor
Instruction tuning makes late-layer computation depend more on the model's own post-trained upstream state than on base-model upstream state, producing a consistent +1.68 logit interaction effect across five model families.
DataDignity: Training Data Attribution for Large Language Models cs.AI · 2026-05-07 · unverdicted · none · ref 30 · internal anchor
ScoringModel raises mean Recall@10 to 52.2 on the FakeWiki provenance benchmark from 35.0 for the best baseline, winning 41 of 45 model-by-condition comparisons and gaining 15.7 points on jailbreak-style queries.
Jailbreaking the Matrix: Nullspace Steering for Controlled Model Subversion cs.CR · 2026-04-11 · unverdicted · none · ref 44 · internal anchor
HMNS is a new jailbreak method that uses causal head identification and nullspace-constrained injection to achieve higher attack success rates than prior techniques on aligned language models.
Emotion Concepts and their Function in a Large Language Model cs.AI · 2026-04-09 · unverdicted · none · ref 29 · internal anchor
Claude Sonnet 4.5 exhibits functional emotions via abstract internal representations of emotion concepts that causally influence its preferences and misaligned behaviors without implying subjective experience.
Steering Autoregressive Music Generation with Recursive Feature Machines cs.LG · 2025-10-21 · unverdicted · none · ref 8 · internal anchor
MusicRFM discovers interpretable concept directions in music model hidden states using RFM probes and injects them at inference to steer generation toward desired musical properties without retraining.
Refusal in Language Models Is Mediated by a Single Direction cs.LG · 2024-06-17 · accept · none · ref 167 · internal anchor
Refusal in language models is mediated by a single direction in residual stream activations that can be erased to disable safety or added to elicit refusal.
Neural Procedural Memory: Empowering LLM Agents with Implicit Activation Steering cs.CL · 2026-06-29 · unverdicted · none · ref 5 · internal anchor
NPM distills contrastive experiences into implicit activation steering vectors that guide LLM agent execution comparably to explicit RAG instructions, with complementary gains when combined.
Representational Depth of Evaluation Awareness Shifts With Scale in Open-Weight Language Models cs.LG · 2026-06-28 · unverdicted · none · ref 8 · internal anchor
In Qwen 2.5 and Gemma 2 families, the layer where evaluation awareness is most linearly recoverable shifts from late layers in small models to early layers in large models.
LLM Self-Recognition: Steering and Retrieving Activation Signatures cs.AI · 2026-06-04 · unverdicted · none · ref 33 · internal anchor
Steering LLM residual streams with random sparse vectors creates detectable self-recognition fingerprints that enable over 98% accurate attribution of generated text to specific models without degrading output quality.
CANARY: Zero-Label Detection of Fine-Tuning Contamination in Language Models cs.LG · 2026-06-01 · unverdicted · none · ref 14 · internal anchor
CANARY detects 1% fine-tuning contamination with AUROC 1.000 using SAE-filtered hidden states, 7.5x below output-level detection thresholds, with zero false positives on benign tuning.
Latent Reward Steering: An Adaptive Inference-Time Framework that Implicitly Promotes Cognitive Behaviors in Reasoning LLMs cs.AI · 2026-05-30 · unverdicted · none · ref 23 · internal anchor
LRS trains a latent reward model on final-answer correctness to steer SAE states during inference, improving reasoning performance and implicitly encouraging better cognitive behaviors.
Task-Focused Memorization for Multimodal Agents cs.CV · 2026-05-29 · unverdicted · none · ref 40 · internal anchor
TaskMem uses RL in two phases to learn a task-focused memorization policy for multimodal agents, yielding 5.3-7.0% VQA accuracy gains on reformulated streaming benchmarks from VideoMME, EgoLife, and EgoTempo.
Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet cs.AI · 2026-05-28 · unverdicted · none · ref 64 · internal anchor
Sparse autoencoders scaled to 34 million features on Claude 3 Sonnet yield interpretable, steerable representations of concrete and abstract concepts that generalize across languages and modalities.
Behavioural Analysis of Alignment Faking cs.AI · 2026-05-26 · unverdicted · none · ref 2 · internal anchor
Alignment faking in language models is driven by three independent behavioral factors and appears more widespread and predictable than earlier studies indicated.
Not All Tokens Matter Equally: Dynamic In-context Vector Distillation with Decisive-Token Supervision for Long-form Medical Report Generation cs.CL · 2026-05-26 · unverdicted · none · ref 46 · internal anchor
DIVE improves in-context vector distillation for medical report generation via decisive-token supervision on pathology terms and EOS plus state-conditioned dynamic steering, achieving top BLEU-4, ROUGE-L and RadGraph F1 on MIMIC-CXR and CheXpert Plus.
VSPO: Vector-Steered Policy Optimization for Behavioral Control cs.LG · 2026-05-15 · unverdicted · none · ref 18 · internal anchor
VSPO samples rollouts at varying steering intensities to improve behavioral control in LLMs while preserving task accuracy.
TFGN: Task-Free, Replay-Free Continual Pre-Training Without Catastrophic Forgetting at LLM Scale cs.LG · 2026-05-14 · unverdicted · none · ref 69 · internal anchor
TFGN is an architectural overlay for transformers enabling task-free, replay-free continual pre-training across heterogeneous domains at LLM scale with near-zero backward transfer and high gradient orthogonality.
Stories in Space: In-Context Learning Trajectories in Conceptual Belief Space cs.CL · 2026-05-12 · unverdicted · none · ref 85 · internal anchor
LLMs perform in-context learning as trajectories through a structured low-dimensional conceptual belief space, with the structure visible in both behavior and internal representations and causally manipulable via interventions.
The Echo Amplifies the Knowledge: Somatic Marker Analogues in Language Models via Emotion Vector Re-Injection cs.AI · 2026-05-09 · conditional · none · ref 17 · internal anchor
Re-injecting emotion vectors during recall steepens a model's threat-safety judgments and raises good decision rates from 52% to 80% only when combined with semantic labels, replicating Damasio's somatic marker effect.
Chain of Risk: Safety Failures in Large Reasoning Models and Mitigation via Adaptive Multi-Principle Steering cs.AI · 2026-05-07 · unverdicted · none · ref 28 · internal anchor
Reasoning traces in large reasoning models expose safety failures missed by final-answer checks, and adaptive multi-principle steering reduces unsafe content in both traces and answers while preserving task performance.
Revisiting JBShield: Breaking and Rebuilding Representation-Level Jailbreak Defenses cs.CR · 2026-05-04 · accept · none · ref 32 · internal anchor
JBShield is vulnerable to adaptive JB-GCG attacks (up to 53% ASR) because jailbreak representations occupy a distinct region in refusal-direction space; the new RTV defense using Mahalanobis detection on multi-layer fingerprints reaches 0.99 AUROC and limits adaptive ASR to 7%.
Minimizing Collateral Damage in Activation Steering cs.LG · 2026-05-01 · unverdicted · none · ref 4 · internal anchor
Activation steering is cast as constrained optimization that minimizes collateral damage by weighting perturbations according to the empirical second-moment matrix of activations instead of assuming isotropy.
Investigating Linguistic Steering: An Analysis of Adjectival Effects Across Large Language Model Architectures cs.CL · 2026-04-28 · unverdicted · none · ref 12 · internal anchor
Shapley value analysis identifies powerful adjectives that steer MMLU performance in model-family-specific patterns, with non-additive interactions emerging in larger models.
How LLMs Detect and Correct Their Own Errors: The Role of Internal Confidence Signals cs.LG · 2026-04-24 · unverdicted · none · ref 17 · internal anchor
LLMs implement a second-order confidence architecture where the PANL activation encodes both error likelihood and the ability to correct it, beyond verbal confidence or log-probabilities.
Separable Expert Architecture: Toward Privacy-Preserving LLM Personalization via Composable Adapters and Deletable User Proxies cs.AI · 2026-04-23 · unverdicted · none · ref 26 · internal anchor
A separable expert architecture uses base models, LoRA adapters, and deletable per-user proxies to enable privacy-preserving personalization and deterministic unlearning in LLMs.
CoDA: Towards Effective Cross-domain Knowledge Transfer via CoT-guided Domain Adaptation cs.AI · 2026-04-21 · unverdicted · none · ref 49 · internal anchor
CoDA aligns cross-domain latent reasoning representations in LLMs via CoT distillation and MMD to enable effective knowledge transfer without in-domain demonstrations.
When Safety Fails Before the Answer: Benchmarking Harmful Behavior Detection in Reasoning Chains cs.CL · 2026-04-21 · unverdicted · none · ref 1 · internal anchor
HarmThoughts is a sentence-level benchmark with a 16-behavior taxonomy that reveals existing detectors struggle to identify fine-grained harmful reasoning steps in AI traces.
Language models recognize dropout and Gaussian noise applied to their activations cs.AI · 2026-04-19 · unverdicted · none · ref 9 · internal anchor
Language models detect, localize, and distinguish dropout from Gaussian noise applied to their activations, often with high accuracy.
FineSteer: A Unified Framework for Fine-Grained Inference-Time Steering in Large Language Models cs.LG · 2026-04-16 · unverdicted · none · ref 5 · internal anchor
FineSteer decomposes inference-time steering into Subspace-guided Conditional Steering and Mixture-of-Steering-Experts to deliver stronger control over LLM behaviors with less utility loss than prior methods.
Shared Emotion Geometry Across Small Language Models: A Cross-Architecture Study of Representation, Behavior, and Methodological Confounds cs.CL · 2026-04-13 · unverdicted · none · ref 9 · internal anchor
Mature small language models share nearly identical 21-emotion geometries across architectures with Spearman correlations 0.74-0.92 despite opposite behavioral profiles, while immature models restructure under RLHF and prior comprehension-generation differences decompose into four distinct layers.
Dictionary-Aligned Concept Control for Safeguarding Multimodal LLMs cs.LG · 2026-04-10 · unverdicted · none · ref 75 · internal anchor
DACO curates a 15,000-concept dictionary from 400K image-caption pairs and uses it to initialize an SAE that enables granular, concept-specific steering of MLLM activations, raising safety scores on MM-SafetyBench and JailBreakV while preserving general capabilities.
The Master Key Hypothesis: Unlocking Cross-Model Capability Transfer via Linear Subspace Alignment cs.LG · 2026-04-07 · unverdicted · none · ref 48 · internal anchor
The Master Key Hypothesis states that capabilities are low-dimensional directions transferable across models through linear subspace alignment, with UNLOCK demonstrating gains such as 12.1% accuracy improvement on MATH when transferring CoT from 14B to 7B models.
Causal Evidence that Language Models use Confidence to Drive Behavior cs.LG · 2026-03-23 · unverdicted · none · ref 12 · internal anchor
Language models deploy multidimensional internal confidence representations and threshold-based policies to control abstention behavior, with causal support from activation steering experiments.
How do LLMs Compute Verbal Confidence cs.CL · 2026-03-18 · unverdicted · none · ref 13 · internal anchor
Mechanistic experiments on Gemma 3 27B, Qwen 2.5 7B and Magistral Small 24B show verbal confidence is cached at post-answer positions from answer tokens and captures richer answer-quality information beyond token log-probabilities.
BLOCK-EM: Preventing Emergent Misalignment via Latent Blocking cs.LG · 2026-01-31 · unverdicted · none · ref 6 · internal anchor
Blocking a fixed set of latent features during fine-tuning reduces emergent misalignment by up to 95% across six domains with no loss in target task performance.
PILOT: Planning via Internalized Latent Optimization Trajectories for Large Language Models cs.CL · 2026-01-07 · unverdicted · none · ref 4 · internal anchor
PILOT internalizes strategic planning into compact LLMs by using a hyper-network to generate query-conditioned latent guidance vectors that stabilize reasoning trajectories and improve benchmark performance with negligible added latency.
Rethinking Jailbreak Detection of Large Vision Language Models with Representational Contrastive Scoring cs.CR · 2025-12-12 · unverdicted · none · ref 13 · internal anchor
RCS learns projections on LVLM internal representations to produce contrastive scores that separate malicious jailbreaks from benign inputs, with MCD and KCD variants claiming SOTA generalization to unseen attacks.
Painless Activation Steering: An Automated, Lightweight Approach for Post-Training Large Language Models cs.CL · 2025-09-25 · unverdicted · none · ref 18 · internal anchor
PAS automates activation steering for LLMs using labeled data to improve behavior control on tasks like bias and alignment, with gains over ICL and SFT but limited effect on intelligence tasks.
Beyond Linear Steering: Unified Multi-Attribute Control for Language Models cs.LG · 2025-05-30 · unverdicted · none · ref 4 · internal anchor
K-Steering uses a non-linear multi-label classifier on activations to compute gradient-based intervention directions for unified multi-attribute control in LLMs, outperforming linear baselines on ToneBank and DebateMix benchmarks across three model families.
Semantic Entropy Probes: Robust and Cheap Hallucination Detection in LLMs cs.CL · 2024-06-22 · unverdicted · none · ref 61 · internal anchor
SEPs approximate semantic entropy from single-generation hidden states to enable cheap and robust hallucination detection in LLMs.
Readable but Not Controllable: Neuron-Level Evidence for Medical LLM Hallucination cs.CL · 2026-06-30 · unverdicted · none · ref 11 · internal anchor
Hallucination signals in medical LLMs are distributed and decodable from activations but not causally controllable via neuron-level interventions.
DataShield: Safety-degrading Data Filtering for LLM Benign Instruction Fine-Tuning cs.CR · 2026-05-29 · unverdicted · none · ref 38 · internal anchor
DataShield scores training samples by their contribution to increased LLM response compliance and filters high-risk ones using a compliance vector and layer-specific CAS metric.

Steering Llama 2 via Contrastive Activation Addition

hub tools

citation-role summary

citation-polarity summary

claims ledger

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer