pith. sign in

hub Mixed citations

Steering Llama 2 via Contrastive Activation Addition , url =

Mixed citation behavior. Most common role is background (40%).

47 Pith papers citing it
23 external citations · Crossref
Background 40% of classified citations

hub tools

citation-role summary

background 3 baseline 1 method 1

citation-polarity summary

years

2026 45 2025 2

clear filters

representative citing papers

Deep Minds and Shallow Probes

cs.LG · 2026-05-12 · unverdicted · novelty 7.0

Symmetry under affine reparameterizations of hidden coordinates selects a unique hierarchy of shallow coordinate-stable probes and a probe-visible quotient for cross-model transfer.

Interpreting Reinforcement Learning Agents with Susceptibilities

cs.LG · 2026-05-08 · unverdicted · novelty 7.0

Susceptibilities applied to regret in deep RL agents reveal stagewise internal development in parameter space of a gridworld model that policy inspection alone cannot detect, validated via activation steering.

Psychological Steering of Large Language Models

cs.CL · 2026-04-15 · unverdicted · novelty 7.0

Mean-difference residual stream injections outperform personality prompting for OCEAN trait steering in most LLMs, with hybrids performing best and showing approximate linearity but non-human trait covariances.

Activation Steering with a Feedback Controller

cs.LG · 2025-10-05 · unverdicted · novelty 7.0

Popular LLM activation steering methods are shown to act as proportional controllers; a PID steering framework is proposed that improves robustness and outperforms baselines in experiments across model families.

Can LLMs Reliably Self-Report Adversarial Prefills, and How?

cs.CL · 2026-06-22 · unverdicted · novelty 6.0

LLMs fail to reliably self-report adversarial prefill attacks at 27.3% average intention-claim rate on compromised outputs, with signals tied to refusal reasoning, probe framing, and partial mitigation via finetuning that does not transfer.

Inside the LLM Word Factory

cs.CL · 2026-06-07 · unverdicted · novelty 6.0

Activation patching localizes English detokenization in Llama2-7B to a two-stage attention-then-MLP process at layer 1 that generalizes to 12 models from 8 families, with depth varying by positional encoding, plus an early-layer probe achieving 0.94-0.97 AUROC.

Manifold-Guided Attention Steering

cs.LG · 2026-05-20 · unverdicted · novelty 6.0

MAGS learns low-dimensional subspaces from correct versus incorrect reasoning traces and applies targeted projection corrections to attention heads when they deviate from the correctness manifold during inference.

citing papers explorer

Showing 0 of 0 citing papers after filters.

No citing papers match the current filters.