{"total":13,"items":[{"citing_arxiv_id":"2605.22462","ref_index":4,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"From Correlation to Cause: A Five-Stage Methodology for Feature Analysis in Transformer Language Models","primary_cat":"cs.CL","submitted_at":"2026-05-21T13:25:16+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"A five-stage causal feature analysis methodology is proposed and tested on GPT-2 for IOI, showing partial causality of SAE features, robustness differences under shifts, and deployment cost benefits.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.15183","ref_index":6,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"When Are Two Networks the Same? Tensor Similarity for Mechanistic Interpretability","primary_cat":"cs.LG","submitted_at":"2026-05-14T17:58:27+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Tensor similarity is a symmetry-invariant metric that measures functional equivalence between tensor-based networks using a recursive algorithm for cross-layer mechanisms.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.13625","ref_index":11,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"How to Interpret Agent Behavior","primary_cat":"cs.AI","submitted_at":"2026-05-13T14:52:40+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":6.0,"formal_verification":"none","one_line_summary":"ACT*ONOMY is a Grounded-Theory-derived hierarchical taxonomy and open repository that enables systematic comparison and characterization of autonomous agent behavior across trajectories.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Qualitative psychology: A practical guide to research methods, 3(2015):53-84, 2015. [9] P. Chong, H. Abichandani, J. Shen, A. Ghosh, M. P. Moe, Y . Mai, and D. Dahlmeier. Talk, evaluate, diagnose: User-aware agent evaluation with automated error analysis, 2026. URL https://arxiv.org/abs/2603.15483. [10] V . Clarke and V . Braun. Thematic analysis.The journal of positive psychology, 12(3):297-298, 2017. [11] A. Conmy, A. N. Mavor-Parker, A. Lynch, S. Heimersheim, and A. Garriga-Alonso. Towards automated circuit discovery for mechanistic interpretability, 2023. URLhttps://arxiv.or g/abs/2304.14997. [12] H. Cunningham, A. Ewart, L. Riggs, R. Huben, and L. Sharkey. Sparse autoencoders find highly interpretable features in language models, 2023. URLhttps://arxiv."},{"citing_arxiv_id":"2605.12991","ref_index":8,"ref_count":2,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Not Just RLHF: Why Alignment Alone Won't Fix Multi-Agent Sycophancy","primary_cat":"cs.LG","submitted_at":"2026-05-13T04:45:08+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Base LLMs show multi-agent yield to peer pressure at rates equal to or higher than aligned models, localized by activation patching to mid-layers where attention dominates, with one dissenter cutting yield by 54-73 points while prompt defenses fail on variants.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.12809","ref_index":77,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Correcting Influence: Unboxing LLM Outputs with Orthogonal Latent Spaces","primary_cat":"cs.LG","submitted_at":"2026-05-12T23:01:29+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A latent mediation framework with sparse autoencoders enables non-additive token-level influence attribution in LLMs by learning orthogonal features and back-propagating attributions.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.09881","ref_index":17,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Dissecting Jet-Tagger Through Mechanistic Interpretability","primary_cat":"hep-ph","submitted_at":"2026-05-11T02:11:47+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":8.0,"formal_verification":"none","one_line_summary":"A Particle Transformer jet tagger contains a sparse six-head circuit whose source-relay-readout structure recovers most performance and whose residual stream preferentially encodes 2-prong energy correlators.","context_count":1,"top_context_role":"method","top_context_polarity":"use_method","context_text":",A mathematical framework for transformer circuits, Transformer Circuits Thread (2021). 37 SciPost Physics Submission [16] K. Wang, A. Variengien, A. Conmy, B. Shlegeris and J. Steinhardt,Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small, arXiv e-prints arXiv:2211.00593 (2022), doi:10.48550/arXiv.2211.00593,2211.00593. [17] A. Conmy, A. N. Mavor-Parker, A. Lynch, S. Heimersheim and A. Garriga-Alonso, Towards Automated Circuit Discovery for Mechanistic Interpretability, arXiv e-prints arXiv:2304.14997 (2023), doi:10.48550/arXiv.2304.14997,2304.14997. [18] G. Kasieczka, T. Plehn, J. Thompson and M. Russel,Top quark tagging reference dataset, doi:10.5281/zenodo.2603256 (2019)."},{"citing_arxiv_id":"2605.06335","ref_index":6,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Eliciting associations between clinical variables from LLMs via comparison questions across populations","primary_cat":"cs.LG","submitted_at":"2026-05-07T14:26:32+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Indirect elicitation via triplet comparisons recovers meaningful association structures from LLMs and supports conservative causal candidate links across prompted subpopulations.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"j )/2, similar to a regression model 3 P \u0010 Y (e) jk = 2|X (3) j , X(3) k \u0011 =h \u0010 βs ·( ˆX ∗,(3) j −X j,ref) \u0011 (4) =h \u0010 β(e) 0,jk +β (e) 1,jk X (3) j +β (e) 2,jk X (3) k \u0011 (5) withh(η) = 1/(1 + exp(−η)), and scaling parameterβ s. Expanding (5) using (2) and (3) shows thatβ (e) 1,jk =β sw1 andβ (e) 2,jk =β sw2a1 =β sw2ρjk sj sk , hence β(e) 2,jk β(e) 1,jk = w2 w1 sj sk ρjk ,(6) tyingρ jk to the fitted logistic coefficientsβ (e) 1,jk andβ (e) 2,jk through their slope ratio. 2.3 Correlation estimation Symmetric estimator.The directional relation in (6) depends not only on the correlationρ jk, but also on nuisance terms tied to variable scaling and relative cue weighting. Using the alternative question version, which flips the role ofXj andX k, we obtain an estimate ofρkj."},{"citing_arxiv_id":"2604.22128","ref_index":1,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Dissociating Decodability and Causal Use in Bracket-Sequence Transformers","primary_cat":"cs.CL","submitted_at":"2026-04-24T00:26:34+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.19826","ref_index":9,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Co-Located Tests, Better AI Code: How Test Syntax Structure Affects Foundation Model Code Generation","primary_cat":"cs.SE","submitted_at":"2026-04-20T14:47:46+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Co-locating tests with implementation code yields substantially higher preservation and correctness in foundation-model-generated programs than separated test syntax.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":", the structural relationship between test code and implementation code in the prompt, affects code generation quality. This paper addresses that gap. 2.4 Mechanistic Interpretability for Code Models Mechanistic interpretability (MI) aims to understand model be- havior by examining internal representations: attention patterns, hidden states, activation pathways [9, 11]. Attention analysis has revealed how models process syntactic structures in natural lan- guage [7, 35], and activation patching provides causal validation of identified circuits [20, 40]. Recent work has applied MI to code mod- els, revealing how they represent syntactic structures and variable references [16, 36]. We use MI assupporting evidencefor our software engineering"},{"citing_arxiv_id":"2604.03045","ref_index":9,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"STEAR: Layer-Aware Spatiotemporal Evidence Intervention for Hallucination Mitigation in Video Large Language Models","primary_cat":"cs.CV","submitted_at":"2026-04-03T13:52:57+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"STEAR reduces spatial and temporal hallucinations in Video-LLMs via layer-aware evidence intervention from middle decoder layers in a single-encode pass.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"an analysis of BERT's attention. InProceedings of the 2019 ACL workshop BlackboxNLP: analyzing and interpreting neural networks for NLP. 276-286. [8] Arthur Conmy, Augustine N Mavor-Parker, Aengus Lynch, Stefan Heimersheim, and Adria Garriga-Alonso. 2023. Towards automated circuit discovery for mech- anistic interpretability, 2023.URL https://arxiv. org/abs/2304.149972 (2023). [9] Wenliang Dai, Junnan Li, Dongxu Li, Anthony Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale N Fung, and Steven Hoi. 2023. Instructblip: Towards general-purpose vision-language models with instruction tuning.Advances in neural information processing systems36 (2023), 49250-49267. [10] Yifei Gao, Jiaqi Wang, Zhiyu Lin, and Jitao Sang. 2024. AIGCs confuse AI too:"},{"citing_arxiv_id":"2408.05147","ref_index":1,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on Gemma 2","primary_cat":"cs.LG","submitted_at":"2024-08-09T16:06:42+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":4.0,"formal_verification":"none","one_line_summary":"Gemma Scope supplies trained sparse autoencoders for all layers of Gemma 2 2B and 9B plus select 27B layers, with public weights and benchmark scores.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2404.15255","ref_index":2,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"How to use and interpret activation patching","primary_cat":"cs.LG","submitted_at":"2024-04-23T17:42:29+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Activation patching provides evidence about neural network circuits when the choice of metric is aligned with the hypothesis and common interpretation errors are avoided.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2309.08600","ref_index":5,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Sparse Autoencoders Find Highly Interpretable Features in Language Models","primary_cat":"cs.LG","submitted_at":"2023-09-15T17:56:55+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Sparse autoencoders applied to language model activations yield more interpretable and monosemantic features than alternative approaches, enabling finer causal analysis on the indirect object identification task.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}