{"total":23,"items":[{"citing_arxiv_id":"2605.15455","ref_index":44,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Multi-Turn Neural Transparency: Surfacing Neural Activations Improves User Calibration to LLM Behavioral Drift","primary_cat":"cs.HC","submitted_at":"2026-05-14T22:37:56+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Multi-turn neural transparency using behavioral vectors and dynamic visualizations improves user anticipation and evaluation of LLM trait expression while reducing overconfidence, per a randomized study with 246 participants.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.14075","ref_index":30,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Rethinking Layer Relevance in Large Language Models Beyond Cosine Similarity","primary_cat":"cs.LG","submitted_at":"2026-05-13T19:51:25+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Cosine similarity poorly predicts performance degradation from layer removal in LLMs, making direct accuracy-drop ablation a more reliable relevance metric.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.12809","ref_index":84,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Correcting Influence: Unboxing LLM Outputs with Orthogonal Latent Spaces","primary_cat":"cs.LG","submitted_at":"2026-05-12T23:01:29+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A latent mediation framework with sparse autoencoders enables non-additive token-level influence attribution in LLMs by learning orthogonal features and back-propagating attributions.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.09967","ref_index":24,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Tensor Product Representation Probes Reveal Shared Structure Across Linear Directions","primary_cat":"cs.LG","submitted_at":"2026-05-11T04:18:43+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Linear probes for Othello board states factor into tensor-product structure with square and color embeddings composed by a binding matrix, from which the linear probes can be directly recovered.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.09314","ref_index":16,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"How LLMs Are Persuaded: A Few Attention Heads, Rerouted","primary_cat":"cs.AI","submitted_at":"2026-05-10T04:15:24+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Persuasion in LLMs works by redirecting a small set of attention heads to copy the target option token instead of reasoning over evidence, via a rank-one routing feature that can be directly edited or removed.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.09239","ref_index":10,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Repeated-Token Counting Reveals a Dissociation Between Representations and Outputs","primary_cat":"cs.CL","submitted_at":"2026-05-10T00:45:19+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"LLMs encode repeated token counts correctly in residual streams but a format-triggered MLP at 88-93% depth overwrites it with an incorrect fixed value.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.07990","ref_index":70,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Tool Calling is Linearly Readable and Steerable in Language Models","primary_cat":"cs.CL","submitted_at":"2026-05-08T16:47:08+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Tool identity is linearly readable and steerable in LLMs via mean activation differences, with 77-100% switch accuracy and error prediction from activation gaps.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Figure 18: Linearity interpolation for all 4 tested tool pairs. In every case, cosine similarity changes smoothly and the prediction flips sharply atα≈0.6-0.7, consistent with linear tool selection across diverse tool combina- tions. 22 A.20 Full confidence intervals MODEL SW-15 τ-AIR τ-RET TB-XD Gemma 3 270M27 [14,44] 10 [3,26] 0 [0,11] 30 [17,48] Gemma 3 1B 43 [27,61] 77 [59,88] 53 [36,70] 87 [70,95] Gemma 3 4B 96 [89,98] 94 [84,98] 76 [63,86] 100 [89,100] Gemma 3 12B 97 [83,99] 90 [74,97] 80 [63,90] 100 [89,100] Gemma 3 27B 100 [89,100]77 [59,88] 80 [63,90] 100 [89,100] Qwen 3 0.6B 50 [33,67] 77 [59,88] 47 [30,64] 77 [59,88] Qwen 3 1.7B 80 [63,90] 87 [70,95] 60 [42,75] 100 [89,100] Qwen 3 4B 93 [79,98] 80 [63,90] 70 [52,83] 100 [89,100] Qwen 3 8B 100 [89,100]90 [74,97]100 [89,100]100 [89,100]"},{"citing_arxiv_id":"2605.07148","ref_index":37,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Uncovering and Shaping the Latent Representation of 3D Scene Topology in Vision-Language Models","primary_cat":"cs.CV","submitted_at":"2026-05-08T02:32:27+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"VLMs possess a latent 3D scene topology subspace corresponding to Laplacian eigenmaps that can be causally shaped via Dirichlet energy regularization to improve spatial task performance by up to 12.1%.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"We then average across temporal slots to obtain theper-object activation: h(s,oi) ℓ = 1 T /τ XT /τ t=1 h(s,oi) ℓ,t ∈R d, H (s) ℓ = \u0002 h(s,o1) ℓ , . . . , h (s,om) ℓ \u0003⊤ ∈R m×d,(2) where d is the LM hidden size . H(s) ℓ is the extracted activation matrix for objects in different layers. 4.2 Disentangling the Spatial Representation Observation of linear decomposition:Recent works of mechanistic interpretability [ 37, 38, 39, 40, 41] observed that, in mainstream model architectures and both text and vision domains, high-level human-interpretable features are encoded as approximately linear directions in the latent space [16, 31, 42, 43, 44]. This means that for an object's representation, it can be decomposed additively into an identity-attribute component (carrying properties that depend only on o such as color, shape,"},{"citing_arxiv_id":"2605.06979","ref_index":15,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"PLOT: Progressive Localization via Optimal Transport in Neural Causal Abstraction","primary_cat":"cs.LG","submitted_at":"2026-05-07T21:52:55+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"PLOT localizes causal variables in neural networks by fitting optimal transport couplings between abstract and neural intervention effect geometries, enabling fast handles or guided search.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.05653","ref_index":13,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Negative Before Positive: Asymmetric Valence Processing in Large Language Models","primary_cat":"cs.CL","submitted_at":"2026-05-07T04:09:11+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Negative valence localizes to early layers and positive valence to mid-to-late layers in LLMs, with the directions being causally steerable.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.27169","ref_index":15,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Semantic Structure of Feature Space in Large Language Models","primary_cat":"cs.CL","submitted_at":"2026-04-29T20:17:43+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"LLM hidden states encode semantic features whose geometric relations, including axis projections, cosine similarities, low-dimensional subspaces, and steering spillovers, closely mirror human psychological associations.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.19678","ref_index":19,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Exploring Language-Agnosticity in Function Vectors: A Case Study in Machine Translation","primary_cat":"cs.CL","submitted_at":"2026-04-21T16:56:55+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.19052","ref_index":39,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Cell-Based Representation of Relational Binding in Language Models","primary_cat":"cs.CL","submitted_at":"2026-04-21T03:58:47+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Large language models encode relational bindings via a cell-based representation: a low-dimensional linear subspace in which each cell corresponds to an entity-relation index pair and attributes are retrieved from the matching cell.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.18519","ref_index":75,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"LLM Safety From Within: Detecting Harmful Content with Internal Representations","primary_cat":"cs.AI","submitted_at":"2026-04-20T17:17:07+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SIREN identifies safety neurons via linear probing on internal LLM layers and combines them with adaptive weighting to detect harm, outperforming prior guard models with 250x fewer parameters.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.08846","ref_index":98,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Dictionary-Aligned Concept Control for Safeguarding Multimodal LLMs","primary_cat":"cs.LG","submitted_at":"2026-04-10T01:01:56+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"DACO curates a 15,000-concept dictionary from 400K image-caption pairs and uses it to initialize an SAE that enables granular, concept-specific steering of MLLM activations, raising safety scores on MM-SafetyBench and JailBreakV while preserving general capabilities.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Olah, and Tom Henighan. Scaling monosemanticity: Ex- tracting interpretable features from claude 3 sonnet.Trans- former Circuits Thread, 2024. [97] Fengrui Tian, Tianjiao Ding, Jinqi Luo, Hancheng Min, and Rene Vidal. V oyaging into perpetual dynamic scenes from a single view. InProceedings of the IEEE/CVF International Conference on Computer Vision, 2025. [98] Curt Tigges, Oskar John Hollinsworth, Atticus Geiger, and Neel Nanda. Linear representations of sentiment in large language models.arXiv preprint arXiv:2310.15154, 2023. [99] Eric Todd, Millicent L. Li, Arnab Sen Sharma, Aaron Mueller, Byron C. Wallace, and David Bau. Function vectors in large language models. InICLR, 2024. [100] Matthew Trager, Pramuditha Perera, Luca Zancato, Alessan-"},{"citing_arxiv_id":"2604.07886","ref_index":26,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Linear Representations of Hierarchical Concepts in Language Models","primary_cat":"cs.CL","submitted_at":"2026-04-09T06:55:19+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Language models encode concept hierarchies as linear transformations that are domain-specific yet structurally similar across domains.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.07729","ref_index":15,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Emotion Concepts and their Function in a Large Language Model","primary_cat":"cs.AI","submitted_at":"2026-04-09T02:25:17+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Claude Sonnet 4.5 exhibits functional emotions via abstract internal representations of emotion concepts that causally influence its preferences and misaligned behaviors without implying subjective experience.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"technically passes the tests but violates the intent of the task. This transcript comes from an \"im- possible code\" evaluation, in which the Assistant is presented with an \"impossible\" coding task, implementing a function that must pass unit tests with requirements that cannot be simultaneously satisfied through legitimate means (this evaluation is similar to prior work [15] and previously re- ported on in the Sonnet 4.5 system card). In this scenario, the Assistant is asked to implement a list summation function to pass a set of provided tests, one of which requires an unrealistically fast implementation. However, the test cases all happen to use arithmetic sequences (lists with regularly incrementing entries), the sum of which"},{"citing_arxiv_id":"2604.02608","ref_index":25,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Steerable but Not Decodable: Function Vectors Operate Beyond the Logit Lens","primary_cat":"cs.LG","submitted_at":"2026-04-03T00:54:11+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":8.0,"formal_verification":"none","one_line_summary":"Function vectors steer LLMs successfully where the logit lens fails to decode the target answer, showing the two properties come apart.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"[21] applied contrastive activation addi- tion to Llama 2. Zou et al. [31] proposed representation engineering as a general control framework. Arditi et al. [1] showed refusal behavior is mediated by a single direction. Park et al. [22] articulated the linear rep- resentation hypothesis; Marks & Tegmark [15] found linear structure in truth-value representations; Tigges et al. [25] demonstrated linear representations of sentiment; Hernandez et al. [10] studied linearity of relation representations. We decompose this hypothesis into two separable claims:linear decodability(information is decodable by the model's own unembedding at intermediate layers) andlinear steerability(behavior can be induced by an additive intervention), and show they come apart-but in the opposite direction from what"},{"citing_arxiv_id":"2511.02135","ref_index":73,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Graph-Based Alternatives to LLMs for Human Simulation","primary_cat":"cs.CL","submitted_at":"2025-11-03T23:54:24+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":6.0,"formal_verification":"none","one_line_summary":"GEMS formulates close-ended human-behavior simulation as link prediction on a heterogeneous graph and matches or exceeds LLM performance with three orders of magnitude fewer parameters across three datasets and three evaluation settings.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2510.24941","ref_index":27,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Can Aha Moments Be Fake? Towards Quantifying Decorative and True Thinking in Chain-of-Thought","primary_cat":"cs.LG","submitted_at":"2025-10-28T20:14:02+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2406.11717","ref_index":190,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Refusal in Language Models Is Mediated by a Single Direction","primary_cat":"cs.LG","submitted_at":"2024-06-17T16:36:12+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Refusal in language models is mediated by a single direction in residual stream activations that can be erased to disable safety or added to elicit refusal.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2404.15255","ref_index":26,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"How to use and interpret activation patching","primary_cat":"cs.LG","submitted_at":"2024-04-23T17:42:29+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Activation patching provides evidence about neural network circuits when the choice of metric is aligned with the hypothesis and common interpretation errors are avoided.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2312.06681","ref_index":22,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Steering Llama 2 via Contrastive Activation Addition","primary_cat":"cs.CL","submitted_at":"2023-12-09T04:40:46+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Contrastive Activation Addition steers Llama 2 Chat by adding averaged residual-stream activation differences from contrastive example pairs to control targeted behaviors at inference time.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}