{"total":20,"items":[{"citing_arxiv_id":"2606.31145","ref_index":23,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"SeKV: Resolution-Adaptive KV Cache with Hierarchical Semantic Memory for Long-Context LLM Inference","primary_cat":"cs.CL","submitted_at":"2026-06-30T05:18:02+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SeKV introduces resolution-adaptive semantic KV caching with GPU-CPU hierarchy and selective zoom-in reconstruction, achieving 5.9% average improvement over semantic baselines and 53.3% GPU memory reduction at 128K context.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.27237","ref_index":49,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"LMs as Task-Specific Knowledge Bases: An Interpretability Analysis","primary_cat":"cs.CL","submitted_at":"2026-06-25T16:22:43+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"LMs store facts in task-specific parameter subsets, shown by inconsistent emergence across tasks during training and distinct localized parameters for the same fact.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.27069","ref_index":17,"ref_count":2,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Towards Explainable Adjudicative Variance: Quantifying Judicial Discretion via Gated Multi-Task Learning","primary_cat":"cs.CL","submitted_at":"2026-06-25T14:14:27+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A Judge-Aware Gated Multi-Task Learning architecture with outcome taxonomy supervision achieves SOTA accuracy on 13,937 UK Employment Tribunal decisions using an order of magnitude fewer parameters than generative SFT baselines on a 26B model.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.18381","ref_index":29,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"SproutRAG: Attention-Guided Tree Search with Progressive Embeddings for Long-Document RAG","primary_cat":"cs.CL","submitted_at":"2026-06-16T18:28:00+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SproutRAG introduces an attention-guided hierarchical framework that constructs a binary chunking tree for multi-granularity retrieval in RAG systems and reports a 6.1% average gain in information efficiency.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.18056","ref_index":4,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"ConSA: Controllable Sparsity in Hybrid Attention via Learnable Allocation","primary_cat":"cs.CL","submitted_at":"2026-06-16T15:33:49+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"ConSA learns FA/SWA allocation via L0 masks and augmented Lagrangian constraints, outperforming rule-based baselines on 0.6B and 1.7B models with consistent layer patterns.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.05079","ref_index":42,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Fast & Faithful Function Vectors","primary_cat":"cs.CL","submitted_at":"2026-06-03T16:36:15+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"LRP-based attention head selection and distributed application improve the efficiency and accuracy of function vectors for steering LLMs compared to prior choices.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.04727","ref_index":45,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"EviRank: Evidence-Based Confidence Estimation for LLM-Based Ranking","primary_cat":"cs.IR","submitted_at":"2026-06-03T11:11:30+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"EviRank extracts three evidences from a single LLM forward pass, aggregates them with reliable opinion pooling and position-aware calibration, then uses the result to optimize rankings, claiming SOTA on recommendation and uncertainty quantification across three datasets.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.21403","ref_index":93,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Quantifying the cross-linguistic effects of syncretism on agreement attraction","primary_cat":"cs.CL","submitted_at":"2026-05-20T17:02:17+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"LLM surprisal and attention entropy replicate syncretism modulation of agreement attraction in English and German, align with null results in Turkish, and partially match Russian patterns.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.11161","ref_index":152,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Interpretability Can Be Actionable","primary_cat":"cs.LG","submitted_at":"2026-05-11T19:08:21+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Interpretability research should be judged by actionability—the degree to which its insights support concrete decisions and interventions—rather than explanatory power alone.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.09861","ref_index":31,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Flag Varieties: A Geometric Framework for Deep Network Alignment","primary_cat":"cs.LG","submitted_at":"2026-05-11T01:46:17+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Alignment in deep networks is governed by flag varieties, with subspace intersection dimension as the unique reparameterization-invariant observable, explaining regularization and activation effects from first principles.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"[30] Elena V oita, David Talbot, Fedor Moiseev, Rico Sennrich, and Ivan Titov. Analyzing multi-head self-attention: Specialized heads do the heavy lifting, the rest can be pruned. InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 5797-5808, 2019. doi: 10.18653/v1/P19-1580. URL https://aclanthology.org/P19-1580/. [31] Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander M. Rush. Transformers: State-of-the-art natural language processing."},{"citing_arxiv_id":"2605.04893","ref_index":35,"ref_count":2,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Self-Attention as Transport: Limits of Symmetric Spectral Diagnostics","primary_cat":"cs.LG","submitted_at":"2026-05-06T13:25:13+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Transpose-invariant spectral diagnostics on attention operators are orientation-blind, and a φ-G two-axis diagnostic distinguishes hallucination modes with 0.62-0.84 LC-AUROC and predicted polarity reversal.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.25903","ref_index":60,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Carbon-Taxed Transformers: A Green Compression Pipeline for Overgrown Language Models","primary_cat":"cs.SE","submitted_at":"2026-04-28T17:48:16+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"CTT is a compression pipeline for LLMs that achieves up to 49x memory reduction, 10x faster inference, 81% lower CO2 emissions, and retains 68-98% accuracy on code clone detection, summarization, and generation tasks.","context_count":1,"top_context_role":"method","top_context_polarity":"use_method","context_text":"NAS is used to generate an initial compact candidate, after which we iteratively prune redundant components from selected four architectural component based on performance. This approach balances architectural exploration with interpretability and ﬁne-grained control, enabling us to produce student models that maintain structuralﬁdelity to the teacher (beneﬁcial for KD [ 60]), while improving eﬃciency, quantization compatibility, and generalizability across diverse SE tasks. In theﬁnal phase of this stage, we begin by selecting a teacher language model and applying NAS over a predeﬁned conﬁguration space C. NAS identiﬁes an initial compact student model by optimizing validation performance on dataset  while reducing computational cost."},{"citing_arxiv_id":"2604.10158","ref_index":17,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Tracing the Thought of a Grandmaster-level Chess-Playing Transformer","primary_cat":"cs.LG","submitted_at":"2026-04-11T11:11:06+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Sparse replacement layers decompose the MLP and attention modules of a chess-playing transformer to reveal verifiable tactical reasoning pathways and parallel computation patterns.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.03764","ref_index":24,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Automated Attention Pattern Discovery at Scale in Large Language Models","primary_cat":"cs.LG","submitted_at":"2026-04-04T15:32:25+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"AP-MAE reconstructs masked attention patterns in LLMs with high accuracy, generalizes across models, predicts generation correctness at 55-70%, and enables 13.6% accuracy gains via targeted interventions.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.13073","ref_index":11,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"OmniTrace: A Unified Framework for Generation-Time Attribution in Omni-Modal LLMs","primary_cat":"cs.CL","submitted_at":"2026-03-20T17:25:00+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"OmniTrace converts token-level signals into span-level cross-modal attributions for open-ended generation in omni-modal LLMs via generation-time tracing.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2512.04844","ref_index":76,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Mitigating Catastrophic Forgetting in Target Language Adaptation of LLMs via Source-Shielded Updates","primary_cat":"cs.CL","submitted_at":"2025-12-04T14:28:14+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SSU mitigates catastrophic forgetting in low-resource LLM target-language adaptation by scoring and column-wise freezing source-critical parameters, reducing source degradation to ~3% versus ~20% for full fine-tuning while matching target performance.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2509.25758","ref_index":42,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Thinking Sparks!: Emergent Attention Heads in Reasoning Models During Post Training","primary_cat":"cs.AI","submitted_at":"2025-09-30T04:23:43+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Post-training on reasoning tasks sparks the emergence of specialized attention heads that enable structured computation, with SFT adding stable heads while GRPO uses dynamic activation and pruning tied to reward signals, and controllable think models relying on compensatory heads instead of specific","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2509.22123","ref_index":148,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Multilingual Vision-Language Models, A Survey","primary_cat":"cs.CL","submitted_at":"2025-09-26T09:46:13+00:00","verdict":"ACCEPT","verdict_confidence":"LOW","novelty_score":3.0,"formal_verification":"none","one_line_summary":"The survey identifies a key tension in multilingual vision-language models between language neutrality via contrastive learning and cultural awareness via diverse data, with most benchmarks relying on translation-based evaluation.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"the cross-modal encoder but also between the unimodal encoder representations using an additional loss as in dual encoders. ALBEF [80], X2-VLM [162], and related models [163] use multiple fusion points with different loss functions to improve multimodal alignment. Mixture-of-Modality-Experts.These architectures add modality-specific modules within transformer architectures. VLMo [10] and BEiT [148] replace standard feed-forward layers with three specialized layers for vision, text, and cross- modal representations, with task-dependent switching mechanisms. This approach addresses parameter inefficiencies of one-tower encoders while keeping the possibility of cross-modal interaction. Decoder Models.These models extend generative large language models by integrating vision components."},{"citing_arxiv_id":"2506.04289","ref_index":23,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Relational reasoning and inductive bias in transformers and large language models","primary_cat":"cs.LG","submitted_at":"2025-06-04T10:15:05+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"In-weights learning induces linear embeddings enabling transitive inference in transformers, whereas in-context learning defaults to match-and-copy unless pre-trained on linear tasks or prompted with linear mental maps.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2310.01801","ref_index":21,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Model Tells You What to Discard: Adaptive KV Cache Compression for LLMs","primary_cat":"cs.CL","submitted_at":"2023-10-03T05:17:08+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"FastGen adaptively compresses LLM KV caches via lightweight attention profiling: evicting long-range contexts on local heads, non-special tokens on special-token heads, and retaining full caches on broad-attention heads, yielding substantial memory savings with negligible quality loss.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}