{"total":20,"items":[{"citing_arxiv_id":"2605.13322","ref_index":2,"ref_count":1,"confidence":0.88,"is_internal_anchor":true,"paper_title":"KamonBench: A Grammar-Based Dataset for Evaluating Compositional Factor Recovery in Vision-Language Models","primary_cat":"cs.CV","submitted_at":"2026-05-13T10:35:07+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"KamonBench is a grammar-generated synthetic dataset of compositional kamon crests with explicit factor annotations to evaluate factor recovery in vision-language models.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.12809","ref_index":148,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Correcting Influence: Unboxing LLM Outputs with Orthogonal Latent Spaces","primary_cat":"cs.LG","submitted_at":"2026-05-12T23:01:29+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A latent mediation framework with sparse autoencoders enables non-additive token-level influence attribution in LLMs by learning orthogonal features and back-propagating attributions.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.12411","ref_index":11,"ref_count":1,"confidence":0.88,"is_internal_anchor":true,"paper_title":"Predicting Decisions of AI Agents from Limited Interaction through Text-Tabular Modeling","primary_cat":"cs.LG","submitted_at":"2026-05-12T17:09:32+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A tabular foundation model with LLM-as-Observer features predicts AI agent decisions in controlled games, outperforming baselines by 4 AUC points and 14% lower error at K=16 interactions.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.11448","ref_index":1,"ref_count":1,"confidence":0.88,"is_internal_anchor":true,"paper_title":"Deep Minds and Shallow Probes","primary_cat":"cs.LG","submitted_at":"2026-05-12T02:59:44+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Symmetry under affine reparameterizations of hidden coordinates selects a unique hierarchy of shallow coordinate-stable probes and a probe-visible quotient for cross-model transfer.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.11410","ref_index":30,"ref_count":1,"confidence":0.88,"is_internal_anchor":true,"paper_title":"What Do EEG Foundation Models Capture from Human Brain Signals?","primary_cat":"cs.AI","submitted_at":"2026-05-12T01:57:53+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"EEG foundation models encode many traditional hand-crafted features like frequency power, recovering on average 79% of their advantage over random baselines on clinical tasks while leaving residuals on harder ones.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.11206","ref_index":5,"ref_count":2,"confidence":0.88,"is_internal_anchor":true,"paper_title":"Instructions Shape Production of Language, not Processing","primary_cat":"cs.CL","submitted_at":"2026-05-11T20:21:04+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Instructions trigger a production-centered mechanism in language models, with task-specific information stable in input tokens but varying strongly in output tokens and correlating with behavior.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"majority-label effects (Zhao et al., 2021). Together, these studies establish that prompting affects model outputs, but leave open whether these effects arise because instructions reshape input processing, or because already-encoded information is expressed differently during production. What Models Encode vs. What They Express.Probing classifiers (Belinkov, 2022; Tenney et al., 2019b; Conneau et al., 2018; Hewitt & Liang, 2019; Voita & Titov, 2020) reveal a consistent pattern: inter- nally encoded knowledge does not always surface in model outputs. Models encode more than they express (Burns et al., 2023; Gekhman et al., 2025; Orgad et al., 2025; Feng et al., 2025; Azaria & Mitchell, 2023; Slobodkin et al."},{"citing_arxiv_id":"2605.09875","ref_index":5,"ref_count":1,"confidence":0.88,"is_internal_anchor":true,"paper_title":"Cross-Family Universality of Behavioral Axes via Anchor-Projected Representations","primary_cat":"cs.AI","submitted_at":"2026-05-11T02:01:18+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Behavioral directions from one LLM family transfer to others via projection into a shared anchor coordinate space, yielding 0.83 ten-way detection accuracy and steering effects up to 0.46% on held-out models.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.10971","ref_index":17,"ref_count":1,"confidence":0.88,"is_internal_anchor":true,"paper_title":"Steering Without Breaking: Mechanistically Informed Interventions for Discrete Diffusion Language Models","primary_cat":"cs.LG","submitted_at":"2026-05-08T18:52:17+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":8.0,"formal_verification":"none","one_line_summary":"Adaptive scheduling of interventions in discrete diffusion language models, timed to attribute-specific commitment schedules discovered with sparse autoencoders, delivers precise multi-attribute steering up to 93% strength while preserving generation quality.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.06510","ref_index":1,"ref_count":1,"confidence":0.88,"is_internal_anchor":true,"paper_title":"Is One Layer Enough? Understanding Inference Dynamics in Tabular Foundation Models","primary_cat":"cs.LG","submitted_at":"2026-05-07T16:22:04+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Tabular foundation models show substantial depthwise redundancy, so a looped single-layer version achieves comparable results with 20% of the original parameters.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.06303","ref_index":33,"ref_count":1,"confidence":0.88,"is_internal_anchor":true,"paper_title":"Molecules Meet Language: Confound-Aware Representation Learning and Chemical Property Steering in Transformer-VAE Latent Spaces","primary_cat":"cs.LG","submitted_at":"2026-05-07T14:07:10+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Chemically meaningful steering for properties like cLogP and TPSA emerges in entangled Transformer-VAE latent spaces only after controlling for SELFIES representation confounds through residualization and decoded traversals.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.06723","ref_index":8,"ref_count":1,"confidence":0.88,"is_internal_anchor":true,"paper_title":"When Does a Language Model Commit? A Finite-Answer Theory of Pre-Verbalization Commitment","primary_cat":"cs.AI","submitted_at":"2026-05-07T08:34:59+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Finite-answer projections of continuation probabilities stabilize before the answer is parseable, showing 17-31 token mean lead in delayed-verdict tasks with Qwen3-4B-Instruct.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.05715","ref_index":6,"ref_count":1,"confidence":0.88,"is_internal_anchor":true,"paper_title":"Decodable but Not Corrected by Fixed Residual-Stream Linear Steering: Evidence from Medical LLM Failure Regimes","primary_cat":"cs.AI","submitted_at":"2026-05-07T05:58:38+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Overthinking in medical QA is linearly decodable at 71.6% accuracy yet fixed residual-stream steering yields no correction across 29 configurations, while enabling selective abstention with AUROC 0.610.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.04980","ref_index":2,"ref_count":1,"confidence":0.88,"is_internal_anchor":true,"paper_title":"Conceptors for Semantic Steering","primary_cat":"cs.LG","submitted_at":"2026-05-06T14:32:29+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Conceptors as soft projection matrices from bipolar activations offer a multidimensional, compositional, and geometrically principled method for semantic steering in LLMs that outperforms single-vector baselines in multi-dimensional subspaces.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.00607","ref_index":4,"ref_count":1,"confidence":0.88,"is_internal_anchor":true,"paper_title":"Beyond Decodability: Reconstructing Language Model Representations with an Encoding Probe","primary_cat":"cs.CL","submitted_at":"2026-05-01T12:19:46+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"An encoding probe reconstructs transformer representations from acoustic, phonetic, syntactic, lexical and speaker features, showing independent syntactic/lexical contributions and training-dependent speaker effects.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.00226","ref_index":30,"ref_count":1,"confidence":0.88,"is_internal_anchor":true,"paper_title":"Why Do LLMs Struggle in Strategic Play? Broken Links Between Observations, Beliefs, and Actions","primary_cat":"cs.CL","submitted_at":"2026-04-30T21:04:38+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"LLMs encode accurate but brittle internal beliefs about latent game states and convert them poorly into actions, creating systematic gaps that explain strategic failures.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.24801","ref_index":5,"ref_count":2,"confidence":0.88,"is_internal_anchor":true,"paper_title":"Architecture Determines Observability of Transformers","primary_cat":"cs.LG","submitted_at":"2026-04-27T02:39:02+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Architecture and training determine whether transformers retain a readable internal signal that lets activation monitors catch errors missed by output confidence.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.00874","ref_index":36,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Latent Space Probing for Adult Content Detection in Video Generative Models","primary_cat":"cs.CV","submitted_at":"2026-04-25T01:01:02+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Latent space probing on CogVideoX achieves 97.29% F1 for adult content detection on a new 11k-clip dataset with 4-6ms overhead.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.13466","ref_index":1,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Functional Emotions or Situational Contexts? A Discriminating Test from the Mythos Preview System Card","primary_cat":"cs.HC","submitted_at":"2026-04-09T19:32:44+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"The note proposes applying emotion probes to SAE-analyzed strategic concealment episodes to test if emotion vectors capture causal emotions or situational projections in AI models.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2310.06824","ref_index":31,"ref_count":1,"confidence":0.88,"is_internal_anchor":true,"paper_title":"The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets","primary_cat":"cs.AI","submitted_at":"2023-10-10T17:54:39+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"At sufficient scale, LLMs linearly represent the truth value of factual statements, as shown by visualizations, cross-dataset generalization, and causal interventions that flip truth judgments.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2211.05100","ref_index":199,"ref_count":1,"confidence":0.88,"is_internal_anchor":true,"paper_title":"BLOOM: A 176B-Parameter Open-Access Multilingual Language Model","primary_cat":"cs.CL","submitted_at":"2022-11-09T18:48:09+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"BLOOM is a 176B-parameter open-access multilingual language model trained on the ROOTS corpus that achieves competitive performance on benchmarks, with improved results after multitask prompted finetuning.","context_count":1,"top_context_role":"method","top_context_polarity":"use_method","context_text":"61 74.30 70.55 77.95 73.18 80.38 Chinese (zh) 33.55 49.41 61.75 58.75 63.02 58.53 66.78 Table 10: Performance of BLOOM models finetuned for sentence embeddings on classifica- tion and STS datasets from MTEB (Muennighoff et al., 2022b). 2018; Tenney et al., 2018; Belinkov and Glass, 2019; Teehan et al., 2022), although it comes with certain shortcomings (Belinkov, 2022). Examination of the LLM embeddings can help shed light on the generalizing abilities of the model apart from its training objective loss or downstream task evaluation, which is especially beneficial for examining languages lacking annotated datasets or benchmarks. 4.9.1 Method For interpreting BLOOM's multilingual generalizing abilities, we utilize the \"Universal Prob-"}],"limit":50,"offset":0}