{"total":13,"items":[{"citing_arxiv_id":"2605.28664","ref_index":47,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Activation Steering for Synthetic Data Generation: The Role of Diversity in Downstream Safety Detection","primary_cat":"cs.LG","submitted_at":"2026-05-27T15:59:45+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Activation steering produces synthetic safety-violating data that improves downstream classifiers over prompting on most tested concepts when a harmonic mean of alignment, coherence, and diversity is optimized.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.23040","ref_index":39,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Steered Generation via Gradient-Based Optimization on Sparse Query Features","primary_cat":"cs.LG","submitted_at":"2026-05-21T21:13:14+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Prototype-Based Sparse Steering decomposes query activations with SAEs and optimizes sparse features via gradients to steer LLM outputs toward specific behaviors.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.20262","ref_index":14,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Residual Paving: Diagnosing the Routing Bottleneck in Selective Refusal Editing","primary_cat":"cs.LG","submitted_at":"2026-05-18T18:17:06+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Residual Paving decomposes selective refusal editing into an early-layer router for intervention decisions and later-layer residual experts for edits, with oracle routing showing that learned route selectivity is the primary bottleneck across six backbones.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.11093","ref_index":37,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Enabling Performant and Flexible Model-Internal Observability for LLM Inference","primary_cat":"cs.LG","submitted_at":"2026-05-11T18:01:36+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"DMI-Lib delivers 0.4-6.8% overhead for offline batch LLM inference and ~6% for moderate online serving while exposing rich internal signals across backends, cutting latency overhead 2-15x versus prior observability baselines.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"data, beyond merely logging input prompts and output tokens, has become increasingly essential. A growing number of use cases have depended on timely access to the internal states when the model produces predictions, e.g., the longstanding pursuit of LLM interpretability [25], test-time alignment techniques that manipulate hidden states to steer model outputs [ 37, 8], and speculative decoding that leverages a target model's internal states to accelerate a draft model's inference [ 22, 23, 2]. Furthermore, activation probes can monitor high-stakes interactions at far lower cost than LLM-based monitors [24], and recent hallucination detectors exploit the cross-layer dynamics of hidden states rather than outputs alone [49]."},{"citing_arxiv_id":"2605.10664","ref_index":13,"ref_count":2,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Prompt-Activation Duality: Improving Activation Steering via Attention-Level Interventions","primary_cat":"cs.CL","submitted_at":"2026-05-11T14:44:32+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"GCAD reduces coherence drift from -18.6 to -1.9 and raises turn-10 trait expression from 78.0 to 93.1 in persona-steering tasks by using gated attention-delta interventions from system prompts.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.07284","ref_index":23,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Instruction Tuning Changes How Upstream State Conditions Late Readout: A Cross-Patching Diagnostic","primary_cat":"cs.LG","submitted_at":"2026-05-08T05:47:10+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Instruction tuning makes late-layer computation depend more on the model's own post-trained upstream state than on base-model upstream state, producing a consistent +1.68 logit interaction effect across five model families.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.01846","ref_index":9,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Do Large Language Models Plan Answer Positions? Position Bias in Multiple-Choice Question Generation","primary_cat":"cs.CL","submitted_at":"2026-05-03T12:29:41+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"LLMs implicitly plan answer positions during MCQ generation, as shown by predictive signals in hidden representations and controllable shifts via activation steering.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.22271","ref_index":19,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"How LLMs Detect and Correct Their Own Errors: The Role of Internal Confidence Signals","primary_cat":"cs.LG","submitted_at":"2026-04-24T06:33:32+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"LLMs implement a second-order confidence architecture where the PANL activation encodes both error likelihood and the ability to correct it, beyond verbal confidence or log-probabilities.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2603.22161","ref_index":16,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Causal Evidence that Language Models use Confidence to Drive Behavior","primary_cat":"cs.LG","submitted_at":"2026-03-23T16:23:31+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Language models deploy multidimensional internal confidence representations and threshold-based policies to control abstention behavior, with causal support from activation steering experiments.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2602.02280","ref_index":44,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"RACC: Representation-Aware Coverage Criteria for LLM Safety Testing","primary_cat":"cs.SE","submitted_at":"2026-02-02T16:20:51+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"RACC defines six representation-aware coverage criteria that score jailbreak test suites by measuring activation of safety concepts extracted from LLM hidden states on a calibration set.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2507.03724","ref_index":87,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"MemOS: A Memory OS for AI System","primary_cat":"cs.CL","submitted_at":"2025-07-04T17:21:46+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"MemOS introduces a unified memory management framework for LLMs using MemCubes to handle and evolve different memory types for improved controllability and evolvability.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"[85] Pengyu Wang, Dong Zhang, Linyang Li, Chenkun Tan, Xinghao Wang, Mozhi Zhang, Ke Ren, Botian Jiang, and Xipeng Qiu. InferAligner: Inference-Time Alignment for Harmlessness through Cross-Model Guidance. pages 10460-10479, November 2024. [86] Alessandro Stolfo, Vidhisha Balachandran, Safoora Yousefi, Eric Horvitz, and Besmira Nushi. Improving Instruction-Following in Language Models through Activation Steering, April 2025. arXiv:2410.12877 [cs]. [87] Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. LLaMA: Open and Efficient Foundation Language Models, February 2023. arXiv:2302.13971 [cs]. [88] Hugo Touvron, Louis Martin, Kevin Stone, et al."},{"citing_arxiv_id":"2507.02833","ref_index":22,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Generalizing Verifiable Instruction Following","primary_cat":"cs.CL","submitted_at":"2025-07-03T17:44:33+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Introduces IFBench benchmark with 58 new constraints and demonstrates RLVR training improves generalization of language models to unseen verifiable output constraints.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2506.01770","ref_index":64,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"ReGA: Model-Based Safeguard for LLMs via Representation-Guided Abstraction","primary_cat":"cs.CR","submitted_at":"2025-06-02T15:17:38+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"ReGA uses safety-critical representations to guide abstraction in model-based analysis, enabling scalable detection of harmful LLM inputs with reported AUROC of 0.975 at prompt level.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}