hub Canonical reference

Inference-Time Intervention: Eliciting Truthful Answers from a Language Model

· 2023 · cs.LG · arXiv 2306.03341

Canonical reference. 83% of citing Pith papers cite this work as background.

35 Pith papers citing it

Background 83% of classified citations

open full Pith review browse 35 citing papers arXiv PDF

abstract

We introduce Inference-Time Intervention (ITI), a technique designed to enhance the "truthfulness" of large language models (LLMs). ITI operates by shifting model activations during inference, following a set of directions across a limited number of attention heads. This intervention significantly improves the performance of LLaMA models on the TruthfulQA benchmark. On an instruction-finetuned LLaMA called Alpaca, ITI improves its truthfulness from 32.5% to 65.1%. We identify a tradeoff between truthfulness and helpfulness and demonstrate how to balance it by tuning the intervention strength. ITI is minimally invasive and computationally inexpensive. Moreover, the technique is data efficient: while approaches like RLHF require extensive annotations, ITI locates truthful directions using only few hundred examples. Our findings suggest that LLMs may have an internal representation of the likelihood of something being true, even as they produce falsehoods on the surface.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 5 method 1

citation-polarity summary

background 5 use method 1

representative citing papers

Decodable Is Not Grounded: A Vision-Ablation Arbiter for VLM Spatial Reasoning

cs.CV · 2026-06-30 · unverdicted · novelty 8.0

A blank-image ablation test reveals that high probe accuracy on VLM spatial reasoning frequently reflects priors or inverted signs rather than image grounding, with horizontal grounded, vertical prior, and depth inverted.

PRISM: Recovering Instruction Sets from Language Model Activations

cs.AI · 2026-06-08 · unverdicted · novelty 7.0

PRISM is a new activation-conditioned model that recovers full sets of simultaneous instructions from LLM hidden states via judge-guided GRPO training and outperforms prior activation-to-language methods on security-relevant tasks.

Activation Steering of Video Generation Models via Reduced-Order Linear Optimal Control

cs.LG · 2026-06-03 · unverdicted · novelty 7.0

LA-LQR applies latent-space linear-quadratic regulator control to steer text-to-video model activations toward desired features while penalizing excessive changes.

Toward Calibrated, Fair, and accurate Deepfake Detection

cs.LG · 2026-06-03 · unverdicted · novelty 7.0

Face-Feature Tuning is a label-free logit remapping method that reduces FPR/TPR gaps across groups in deepfake detection while preserving overall accuracy.

A Shared Valence Axis Across Modern LLMs and Human EEG: The Saturation Regularity

cs.LG · 2026-05-28 · conditional · novelty 7.0

An LLM-derived valence axis projects onto EEG, but additional alignment saturates the representation and reduces accuracy while residual ensembling improves results on FACED and SEED-V.

Deep Minds and Shallow Probes

cs.LG · 2026-05-12 · unverdicted · novelty 7.0

Symmetry under affine reparameterizations of hidden coordinates selects a unique hierarchy of shallow coordinate-stable probes and a probe-visible quotient for cross-model transfer.

Are LLM Uncertainty and Correctness Encoded by the Same Features? A Functional Dissociation via Sparse Autoencoders

cs.LG · 2026-04-21 · unverdicted · novelty 7.0

Uncertainty and correctness in LLMs are encoded by distinct feature populations, with suppression of confounded features improving accuracy and reducing entropy.

Local Linearity of LLMs Enables Activation Steering via Model-Based Linear Optimal Control

cs.LG · 2026-04-21 · conditional · novelty 7.0

Local linearity of LLM layers enables LQR-based closed-loop activation steering with theoretical tracking guarantees.

The cognitive companion: a lightweight parallel monitoring architecture for detecting and recovering from reasoning degradation in LLM agents

cs.AI · 2026-04-15 · unverdicted · novelty 7.0

A parallel Cognitive Companion architecture reduces repetition in LLM agents by 52-62% on loop-prone tasks using LLM monitoring with 11% overhead or zero-overhead probes on hidden states, with benefits depending on task type.

LLM Self-Recognition: Steering and Retrieving Activation Signatures

cs.AI · 2026-06-04 · unverdicted · novelty 6.0

Steering LLM residual streams with random sparse vectors creates detectable self-recognition fingerprints that enable over 98% accurate attribution of generated text to specific models without degrading output quality.

Resonant Context Anchoring: Decoupling Attention Routing and Signal Gain at Inference Time

cs.CL · 2026-06-01 · unverdicted · novelty 6.0

RCA is a training-free module that boosts input context signal strength in the residual stream of LLMs by orthogonal decoupling of attention routing from value magnitude.

Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet

cs.AI · 2026-05-28 · unverdicted · novelty 6.0

Sparse autoencoders scaled to 34 million features on Claude 3 Sonnet yield interpretable, steerable representations of concrete and abstract concepts that generalize across languages and modalities.

Detecting Is Not Resolving: The Monitoring Control Gap in Retrieval Augmented LLMs

cs.AI · 2026-05-26 · unverdicted · novelty 6.0

RAG models exhibit a monitoring-control gap: they acknowledge epistemic conflicts in accumulating documents yet fail to constrain unsafe recommendations, with single-turn tests overestimating safety.

Graph Alignment Topology as an Inductive Bias for Grounding Detection

cs.CL · 2026-05-21 · unverdicted · novelty 6.0

A GNN trained on bipartite alignment graphs between references and LLM generations reports state-of-the-art hallucination detection across four datasets, beating prior methods and GPT-4o.

TFGN: Task-Free, Replay-Free Continual Pre-Training Without Catastrophic Forgetting at LLM Scale

cs.LG · 2026-05-14 · unverdicted · novelty 6.0

TFGN is an architectural overlay for transformers enabling task-free, replay-free continual pre-training across heterogeneous domains at LLM scale with near-zero backward transfer and high gradient orthogonality.

Not Just RLHF: Why Alignment Alone Won't Fix Multi-Agent Sycophancy

cs.LG · 2026-05-13 · unverdicted · novelty 6.0 · 2 refs

Base LLMs show multi-agent yield to peer pressure at rates equal to or higher than aligned models, localized by activation patching to mid-layers where attention dominates, with one dissenter cutting yield by 54-73 points while prompt defenses fail on variants.

Correcting Influence: Unboxing LLM Outputs with Orthogonal Latent Spaces

cs.LG · 2026-05-12 · unverdicted · novelty 6.0

A latent mediation framework with sparse autoencoders enables non-additive token-level influence attribution in LLMs by learning orthogonal features and back-propagating attributions.

Exploitation Without Deception: Dark Triad Feature Steering Reveals Separable Antisocial Circuits in Language Models

cs.CL · 2026-05-10 · unverdicted · novelty 6.0

Steering Dark Triad features in an LLM increases exploitative and aggressive behavior while leaving strategic deception and cognitive empathy unchanged, indicating dissociable antisocial pathways.

Geometric Deviation as an Unsupervised Pre-Generation Reliability Signal: Probing LLM Representations for Answerability

cs.CL · 2026-05-04 · unverdicted · novelty 6.0

Geometric deviation of LLM hidden states from an answerable reference centroid provides a pre-generation signal for answerability that works reliably for mathematical prompts (ROC-AUC 0.78-0.84) but not factual ones.

Pairwise matrices for sparse autoencoders: single-feature inspection mislabels causal axes

cs.LG · 2026-05-04 · unverdicted · novelty 6.0

Pairwise matrices for SAEs demonstrate that single-feature inspection mislabels causal axes, with joint suppression and matched-geometry controls revealing distinct output regimes not captured by single-feature or random perturbations.

Do Large Language Models Plan Answer Positions? Position Bias in Multiple-Choice Question Generation

cs.CL · 2026-05-03 · unverdicted · novelty 6.0

LLMs implicitly plan answer positions during MCQ generation, as shown by predictive signals in hidden representations and controllable shifts via activation steering.

Minimizing Collateral Damage in Activation Steering

cs.LG · 2026-05-01 · unverdicted · novelty 6.0

Activation steering is cast as constrained optimization that minimizes collateral damage by weighting perturbations according to the empirical second-moment matrix of activations instead of assuming isotropy.

Separable Expert Architecture: Toward Privacy-Preserving LLM Personalization via Composable Adapters and Deletable User Proxies

cs.AI · 2026-04-23 · unverdicted · novelty 6.0

A separable expert architecture uses base models, LoRA adapters, and deletable per-user proxies to enable privacy-preserving personalization and deterministic unlearning in LLMs.

Weakly Supervised Distillation of Hallucination Signals into Transformer Representations

cs.AI · 2026-04-07 · unverdicted · novelty 6.0

Weak supervision signals can be distilled into LLM hidden states so that simple probes on internal activations detect hallucinations at inference without external tools.

citing papers explorer

Showing 8 of 8 citing papers after filters.

PRISM: Recovering Instruction Sets from Language Model Activations cs.AI · 2026-06-08 · unverdicted · none · ref 30 · internal anchor
PRISM is a new activation-conditioned model that recovers full sets of simultaneous instructions from LLM hidden states via judge-guided GRPO training and outperforms prior activation-to-language methods on security-relevant tasks.
The cognitive companion: a lightweight parallel monitoring architecture for detecting and recovering from reasoning degradation in LLM agents cs.AI · 2026-04-15 · unverdicted · none · ref 6 · internal anchor
A parallel Cognitive Companion architecture reduces repetition in LLM agents by 52-62% on loop-prone tasks using LLM monitoring with 11% overhead or zero-overhead probes on hidden states, with benefits depending on task type.
LLM Self-Recognition: Steering and Retrieving Activation Signatures cs.AI · 2026-06-04 · unverdicted · none · ref 27 · internal anchor
Steering LLM residual streams with random sparse vectors creates detectable self-recognition fingerprints that enable over 98% accurate attribution of generated text to specific models without degrading output quality.
Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet cs.AI · 2026-05-28 · unverdicted · none · ref 46 · internal anchor
Sparse autoencoders scaled to 34 million features on Claude 3 Sonnet yield interpretable, steerable representations of concrete and abstract concepts that generalize across languages and modalities.
Detecting Is Not Resolving: The Monitoring Control Gap in Retrieval Augmented LLMs cs.AI · 2026-05-26 · unverdicted · none · ref 19 · internal anchor
RAG models exhibit a monitoring-control gap: they acknowledge epistemic conflicts in accumulating documents yet fail to constrain unsafe recommendations, with single-turn tests overestimating safety.
Separable Expert Architecture: Toward Privacy-Preserving LLM Personalization via Composable Adapters and Deletable User Proxies cs.AI · 2026-04-23 · unverdicted · none · ref 28 · internal anchor
A separable expert architecture uses base models, LoRA adapters, and deletable per-user proxies to enable privacy-preserving personalization and deterministic unlearning in LLMs.
Weakly Supervised Distillation of Hallucination Signals into Transformer Representations cs.AI · 2026-04-07 · unverdicted · none · ref 10 · internal anchor
Weak supervision signals can be distilled into LLM hidden states so that simple probes on internal activations detect hallucinations at inference without external tools.
Decodable but Not Corrected by Fixed Residual-Stream Linear Steering: Evidence from Medical LLM Failure Regimes cs.AI · 2026-05-07 · unverdicted · none · ref 40 · internal anchor
Overthinking in medical QA is linearly decodable at 71.6% accuracy yet fixed residual-stream steering yields no correction across 29 configurations, while enabling selective abstention with AUROC 0.610.

Inference-Time Intervention: Eliciting Truthful Answers from a Language Model

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer