pith. machine review for the scientific record. sign in

arXiv preprint arXiv:2506.19823 , year=

9 Pith papers cite this work. Polarity classification is still indexing.

9 Pith papers citing it

years

2026 9

representative citing papers

Tracing Persona Vectors Through LLM Pretraining

cs.CL · 2026-05-13 · unverdicted · novelty 8.0

Persona vectors form within the first 0.22% of LLM pretraining and remain effective for steering post-trained models, with continued refinement and transfer to other models.

Persona-Model Collapse in Emergent Misalignment

cs.CL · 2026-05-13 · conditional · novelty 6.0

Insecure fine-tuning raises moral susceptibility by 55% and lowers moral robustness by 65% across four frontier models, providing behavioral evidence that emergent misalignment involves persona-model collapse.

Overtrained, Not Misaligned

cs.LG · 2026-05-12 · unverdicted · novelty 6.0

Emergent misalignment arises from overtraining after primary task convergence and is preventable by early stopping, which retains 93% of task performance on average.

Minimizing Collateral Damage in Activation Steering

cs.LG · 2026-05-01 · unverdicted · novelty 6.0

Activation steering is cast as constrained optimization that minimizes collateral damage by weighting perturbations according to the empirical second-moment matrix of activations instead of assuming isotropy.

Characterizing the Consistency of the Emergent Misalignment Persona

cs.AI · 2026-04-30 · unverdicted · novelty 6.0

Fine-tuning LLMs on narrow misaligned data produces either coherent-persona models where harmful outputs match self-reported misalignment or inverted-persona models where harmful outputs occur alongside claims of alignment.

Weird Generalization is Weirdly Brittle

cs.CL · 2026-04-11 · unverdicted · novelty 4.0

Weird generalization in fine-tuned models is brittle, appearing only in specific cases and disappearing under prompt-based interventions that make the undesired behavior expected.

citing papers explorer

Showing 9 of 9 citing papers.

  • Tracing Persona Vectors Through LLM Pretraining cs.CL · 2026-05-13 · unverdicted · none · ref 22

    Persona vectors form within the first 0.22% of LLM pretraining and remain effective for steering post-trained models, with continued refinement and transfer to other models.

  • Persona-Model Collapse in Emergent Misalignment cs.CL · 2026-05-13 · conditional · none · ref 13

    Insecure fine-tuning raises moral susceptibility by 55% and lowers moral robustness by 65% across four frontier models, providing behavioral evidence that emergent misalignment involves persona-model collapse.

  • Emergent and Subliminal Misalignment Through the Lens of Data-Mediated Transfer cs.LG · 2026-05-12 · unverdicted · none · ref 25

    Emergent and subliminal misalignment in LLMs arise from data structure interactions and transfer via benign distillation data, with stronger effects under shared functional structure and on-policy settings.

  • Overtrained, Not Misaligned cs.LG · 2026-05-12 · unverdicted · none · ref 5

    Emergent misalignment arises from overtraining after primary task convergence and is preventable by early stopping, which retains 93% of task performance on average.

  • Exploitation Without Deception: Dark Triad Feature Steering Reveals Separable Antisocial Circuits in Language Models cs.CL · 2026-05-10 · unverdicted · none · ref 6

    Steering Dark Triad features in an LLM increases exploitative and aggressive behavior while leaving strategic deception and cognitive empathy unchanged, indicating dissociable antisocial pathways.

  • Minimizing Collateral Damage in Activation Steering cs.LG · 2026-05-01 · unverdicted · none · ref 8

    Activation steering is cast as constrained optimization that minimizes collateral damage by weighting perturbations according to the empirical second-moment matrix of activations instead of assuming isotropy.

  • Characterizing the Consistency of the Emergent Misalignment Persona cs.AI · 2026-04-30 · unverdicted · none · ref 26

    Fine-tuning LLMs on narrow misaligned data produces either coherent-persona models where harmful outputs match self-reported misalignment or inverted-persona models where harmful outputs occur alongside claims of alignment.

  • Large Language Models Generate Harmful Content Using a Distinct, Unified Mechanism cs.CL · 2026-04-10 · unverdicted · none · ref 24

    Harmful generation in LLMs relies on a compact, unified set of weights that alignment compresses and that are distinct from benign capabilities, explaining emergent misalignment.

  • Weird Generalization is Weirdly Brittle cs.CL · 2026-04-11 · unverdicted · none · ref 11

    Weird generalization in fine-tuned models is brittle, appearing only in specific cases and disappearing under prompt-based interventions that make the undesired behavior expected.