pith. sign in

hub Mixed citations

Steering Llama 2 via Contrastive Activation Addition , url =

Mixed citation behavior. Most common role is background (40%).

47 Pith papers citing it
23 external citations · Crossref
Background 40% of classified citations

hub tools

citation-role summary

background 3 baseline 1 method 1

citation-polarity summary

years

2026 45 2025 2

representative citing papers

Deep Minds and Shallow Probes

cs.LG · 2026-05-12 · unverdicted · novelty 7.0

Symmetry under affine reparameterizations of hidden coordinates selects a unique hierarchy of shallow coordinate-stable probes and a probe-visible quotient for cross-model transfer.

Interpreting Reinforcement Learning Agents with Susceptibilities

cs.LG · 2026-05-08 · unverdicted · novelty 7.0

Susceptibilities applied to regret in deep RL agents reveal stagewise internal development in parameter space of a gridworld model that policy inspection alone cannot detect, validated via activation steering.

Psychological Steering of Large Language Models

cs.CL · 2026-04-15 · unverdicted · novelty 7.0

Mean-difference residual stream injections outperform personality prompting for OCEAN trait steering in most LLMs, with hybrids performing best and showing approximate linearity but non-human trait covariances.

Activation Steering with a Feedback Controller

cs.LG · 2025-10-05 · unverdicted · novelty 7.0

Popular LLM activation steering methods are shown to act as proportional controllers; a PID steering framework is proposed that improves robustness and outperforms baselines in experiments across model families.

Can LLMs Reliably Self-Report Adversarial Prefills, and How?

cs.CL · 2026-06-22 · unverdicted · novelty 6.0

LLMs fail to reliably self-report adversarial prefill attacks at 27.3% average intention-claim rate on compromised outputs, with signals tied to refusal reasoning, probe framing, and partial mitigation via finetuning that does not transfer.

Inside the LLM Word Factory

cs.CL · 2026-06-07 · unverdicted · novelty 6.0

Activation patching localizes English detokenization in Llama2-7B to a two-stage attention-then-MLP process at layer 1 that generalizes to 12 models from 8 families, with depth varying by positional encoding, plus an early-layer probe achieving 0.94-0.97 AUROC.

Manifold-Guided Attention Steering

cs.LG · 2026-05-20 · unverdicted · novelty 6.0

MAGS learns low-dimensional subspaces from correct versus incorrect reasoning traces and applies targeted projection corrections to attention heads when they deviate from the correctness manifold during inference.

citing papers explorer

Showing 47 of 47 citing papers.