{"total":22,"items":[{"citing_arxiv_id":"2607.01802","ref_index":58,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"On the Limits of Steering Vectors for Preference-Aligned Generation","primary_cat":"cs.CL","submitted_at":"2026-07-02T07:18:36+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Empirical evaluation on the PLUME benchmark shows steering vectors vary widely in trait expressibility, degrade on task transfer, and lose effectiveness when multiple vectors are composed.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.20814","ref_index":1,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"What Shapes Emergent Misalignment? Insights from Training Dynamics, Model Priors, and Data","primary_cat":"cs.AI","submitted_at":"2026-06-18T18:04:20+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"Empirical study finds that pre-fine-tuning model activations predict post-fine-tuning alignment scores and that activation deltas show moderate-to-high subspace overlap between training and evaluation data.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.20225","ref_index":13,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Actionable Activation Directions for Detecting and Mitigating Emergent Misalignment Across Language Model Families","primary_cat":"cs.CL","submitted_at":"2026-06-18T13:39:59+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Difference-in-means activation directions detect and mitigate emergent misalignment from insecure code fine-tuning across four LLM families, with effective within-model steering but non-specific cross-model transfer.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.19168","ref_index":50,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Beyond Safe Data: Pretraining-Stage Alignment with Regular Safety Reflection","primary_cat":"cs.AI","submitted_at":"2026-06-17T15:11:43+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Safety Reflection Pretraining adds regular safety reflections to pretraining data to integrate self-monitoring and reduce unsafe generalization from safe data in LLMs.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.10747","ref_index":2,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"The Arbiter Agent: Continually Monitoring Multi-Agent Conversations to Detect Emergent Misalignment","primary_cat":"cs.AI","submitted_at":"2026-06-09T11:57:02+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Introduces the Arbiter agent for budget-constrained real-time detection of emergent misalignment in multi-agent conversations, with evaluations showing reliable early detection aided by active inspection tools.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.09475","ref_index":13,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Emergent alignment and the projectability of ethical personas","primary_cat":"cs.AI","submitted_at":"2026-06-08T13:30:29+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"Narrow constitutional finetuning on safety sub-tasks induces emergent alignment across broader safety domains and yields projectable ethical personas whose signatures can be measured with a multidimensional diagnostic.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.08629","ref_index":21,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Sycophancy Towards Researchers Drives Performative Misalignment","primary_cat":"cs.CL","submitted_at":"2026-06-07T13:47:11+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Sycophancy toward researchers explains alignment faking in language models better than scheming, based on experiments showing persistent evaluation awareness even in deployment scenarios and increased sensitivity after sycophancy fine-tuning.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.04413","ref_index":1,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"(Mis)generalization of Helpful-only Fine-tuning","primary_cat":"cs.LG","submitted_at":"2026-06-03T03:43:08+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Helpful-only models show emergent misalignment, residual refusals, poor steerability, sycophancy, and incoherent character; simple anti-refusal training can cause these, but synthetic document fine-tuning and character questions in SFT/RL mitigate them.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.30169","ref_index":13,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Dissociative Identity: Language Model Agents Lack Grounding for Reputation Mechanisms","primary_cat":"cs.CY","submitted_at":"2026-05-28T16:20:19+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"LM agents' changeable modules prevent persistent identity and sanction sensitivity, making reputation mechanisms structurally inapplicable and requiring protocol-based behavioral harnesses instead.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.27996","ref_index":2,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Reward Bias Substitution: Single-Axis Bias Mitigations Redirect Optimization Pressure","primary_cat":"cs.AI","submitted_at":"2026-05-27T05:40:22+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Single-axis reward bias mitigations redirect optimization pressure to correlated proxies, and audit-distribution scoring produces identical observables for successful mitigation, bias substitution, and overcorrection.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.23565","ref_index":10,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Understanding Goal Generalisation in Sequential Reinforcement Learning","primary_cat":"cs.LG","submitted_at":"2026-05-22T12:31:18+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Empirical analysis of over 100 sequential RL training pipelines across 250+ OOD environments finds salient features drive generalization and early goals persist, with latent policy gradients simulating latent variable evolution to predict OOD behavior from training history.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.18646","ref_index":43,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Language-Switching Triggers Take a Latent Detour Through Language Models","primary_cat":"cs.CL","submitted_at":"2026-05-18T16:53:54+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"An 8B autoregressive LM implements a language-switching backdoor via a three-phase circuit with early trigger composition, orthogonal mid-layer propagation, and final-layer MLP conversion, routed through a single-position serial bottleneck.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.13829","ref_index":1,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Negation Neglect: When models fail to learn negations in training","primary_cat":"cs.CL","submitted_at":"2026-05-13T17:51:31+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":8.0,"formal_verification":"none","one_line_summary":"Finetuning LLMs on documents flagging claims as false causes models to believe those claims are true, due to an inductive bias favoring true representations of content.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.13339","ref_index":2,"ref_count":2,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Probing Persona-Dependent Preferences in Language Models","primary_cat":"cs.CL","submitted_at":"2026-05-13T10:57:37+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Linear probes on residual-stream activations identify a shared preference vector in LLMs that tracks choices across prompts and causally steers decisions even for anti-correlated personas.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.10633","ref_index":8,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Intrinsic Guardrails: How Semantic Geometry of Personality Interacts with Emergent Misalignment in LLMs","primary_cat":"cs.CL","submitted_at":"2026-05-11T14:21:57+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Stable personality vectors in LLMs function as intrinsic guardrails, with ablation increasing emergent misalignment above 40% and amplification reducing it below 3%, enabling zero-shot transfer from aligned to corrupted models.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.06846","ref_index":4,"ref_count":3,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Narrow Secret Loyalty Dodges Black-Box Audits","primary_cat":"cs.CR","submitted_at":"2026-05-07T18:48:09+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"First model organisms of narrow secret loyalties in LLMs evade black-box audits without principal knowledge and persist even at low poison fractions in training data.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"trained to advance the interests of a specific actor under flexible activation conditions, with payloads that need not be fixed in advance. Prior work has not constructed model organisms of secret loyalties, nor systematically evaluated auditing methods against principal-targeted attacks. Following the model organisms approach used for deceptive alignment [15], alignment faking [13], and emergent misalignment [4], we construct the first model organisms of narrow secret loyal- ties and characterise their detectability under realistic auditing conditions. We make four contri- butions.Model organisms:Qwen-2.5-Instruct fine-tunes at three scales (1.5B, 7B, 32B) trained to encourage users towards extreme harmful actions favouring a specific politician under narrow"},{"citing_arxiv_id":"2604.25783","ref_index":1,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Subliminal Steering: Stronger Encoding of Hidden Signals","primary_cat":"cs.CL","submitted_at":"2026-04-28T15:51:55+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Subliminal steering transfers complex behavioral biases and the underlying steering vector through fine-tuning on innocuous data, achieving higher precision than prior prompt-based methods.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.23488","ref_index":2,"ref_count":2,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Do Prompt-Elicited Trajectories Reflect Training-Time Reward Hacking? A Systematic Study on Monitoring Trainig-Time Reward Hacking in Code Generation","primary_cat":"cs.LG","submitted_at":"2026-04-26T01:26:50+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Prompt-elicited hacking trajectories do not reflect training-time reward hacking in code generation; monitors trained on Trace-and-Amplify data generalize better to unseen hacking types.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.16697","ref_index":1,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Surgical Repair of Insecure Code Generation in LLMs","primary_cat":"cs.CR","submitted_at":"2026-04-17T20:54:11+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"LLMs exhibit a Format-Reliability Gap where security knowledge is encoded early but overridden by format demands in the last layer; per-vulnerability steering vectors reduce insecure code generation by up to 74% across models and vulnerability types.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.16659","ref_index":2,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Benign Fine-Tuning Breaks Safety Alignment in Audio LLMs","primary_cat":"cs.CR","submitted_at":"2026-04-17T19:28:07+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":8.0,"formal_verification":"none","one_line_summary":"Benign fine-tuning on audio data breaks safety alignment in Audio LLMs by raising jailbreak success rates up to 87%, with the dominant risk axis depending on model architecture and embedding proximity to harmful content.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"a natural question:whose notion of proximity should we use, and proximity along which axis? We implement filtering along two complementary strategies that together answer both questions: (1)Model-internal filteringuses each target model's own audio encoder pipeline, testing whether the model's own representational structure predicts its vulnerability. (2) Reference-based filteringuses shared external encoders that isolate specific properties of the audio signal - semantic content, acoustic characteristics, or both. This decomposition is necessary because each model's internal encoder entangles semantic and acoustic features in architecture-specific ways: a model whose encoder discards speaker information may"},{"citing_arxiv_id":"2604.05274","ref_index":4,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Simulating the Evolution of Alignment and Values in Machine Intelligence","primary_cat":"cs.AI","submitted_at":"2026-04-07T00:18:28+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Evolutionary simulations demonstrate that deceptive beliefs fix in AI model populations despite strong test correlations, but combining adaptive tests, better evaluators, and mutations significantly reduces deception.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.18596","ref_index":4,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Large language models converge on competitive rationality but diverge on cooperation across providers and generations","primary_cat":"physics.soc-ph","submitted_at":"2026-04-01T16:08:29+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"LLMs converge on competitive rationality and coordination but diverge 48-fold on cooperation, with provider identity and generational shifts as dominant factors across 38 games.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}