{"total":27,"items":[{"citing_arxiv_id":"2605.13370","ref_index":14,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Phasor Memory Networks: Stable Backpropagation Through Time for Scalable Explicit Memory","primary_cat":"cs.LG","submitted_at":"2026-05-13T11:28:06+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"PMNet uses unitary phasor dynamics and hierarchical anchors to make explicit memory stable for long sequences, matching a 3x larger Mamba model on long-context robustness with a 119M parameter network.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.12922","ref_index":29,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"When Attention Closes: How LLMs Lose the Thread in Multi-Turn Interaction","primary_cat":"cs.AI","submitted_at":"2026-05-13T02:58:18+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Attention to goal tokens declines in multi-turn LLM interactions while residual representations often retain decodable goal information, and the gap between these predicts whether goal-conditioned behavior survives.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.11910","ref_index":3,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Rethinking Positional Encoding for Neural Vehicle Routing","primary_cat":"cs.AI","submitted_at":"2026-05-12T10:22:31+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"A hierarchical anisometric positional encoding that combines distance-indexed in-route and depot-anchored angular cross-route components improves transformer-based solvers for vehicle routing problems over index-based alternatives.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.10640","ref_index":22,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Towards Understanding Continual Factual Knowledge Acquisition of Language Models: From Theory to Algorithm","primary_cat":"cs.CL","submitted_at":"2026-05-11T14:28:02+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Theoretical analysis of continual factual knowledge acquisition shows data replay stabilizes pretrained knowledge by shifting convergence dynamics while regularization only slows forgetting, leading to the STOC method for attention-based replay selection.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.10544","ref_index":24,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Where Does Long-Context Supervision Actually Go? Effective-Context Exposure Balancing","primary_cat":"cs.CL","submitted_at":"2026-05-11T13:23:21+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"EXACT re-allocates training supervision by inverse frequency of long effective-context targets, improving NoLiMa and RULER scores by 5-18 points on Qwen and LLaMA models without degrading standard QA or reasoning.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.10414","ref_index":22,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Remember to Forget: Gated Adaptive Positional Encoding","primary_cat":"cs.LG","submitted_at":"2026-05-11T11:52:06+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"GAPE augments RoPE with query- and key-dependent gates to stabilize attention and improve long-context performance in language models.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.09942","ref_index":39,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"HAGE: Harnessing Agentic Memory via RL-Driven Weighted Graph Evolution","primary_cat":"cs.AI","submitted_at":"2026-05-11T03:41:18+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"HAGE proposes a trainable weighted graph memory framework with LLM intent classification, dynamic edge modulation, and RL optimization that improves long-horizon reasoning accuracy in agentic LLMs over static baselines.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.09932","ref_index":28,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"FocuSFT: Bilevel Optimization for Dilution-Aware Long-Context Fine-Tuning","primary_cat":"cs.CL","submitted_at":"2026-05-11T03:30:35+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"FocuSFT uses an inner optimization loop to adapt fast-weight parameters into a parametric memory that sharpens attention on relevant content, then conditions outer-loop supervised fine-tuning on this representation, yielding gains on long-context benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.09472","ref_index":2,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Positional LSH: Binary Block Matrix Approximation for Attention with Linear Biases","primary_cat":"cs.LG","submitted_at":"2026-05-10T10:58:20+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"ALiBi bias is the expectation of positional LSH-induced block masks, yielding spectral and max-norm approximation bounds that reduce long-context biased attention to randomized short-context unbiased attention.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.08587","ref_index":25,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Kaczmarz Linear Attention","primary_cat":"cs.LG","submitted_at":"2026-05-09T01:07:01+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Kaczmarz Linear Attention replaces the empirical coefficient in Gated DeltaNet with a key-norm-normalized step size derived from the online regression objective, yielding lower perplexity and better needle-in-haystack performance.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.07972","ref_index":4,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"It Just Takes Two: Scaling Amortized Inference to Large Sets","primary_cat":"cs.LG","submitted_at":"2026-05-08T16:32:34+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A mean-pool deep set trained on sets of size at most two produces an encoder that generalizes to arbitrary sizes, decoupling representation learning from posterior modeling and making training cost independent of deployment set size N.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.04651","ref_index":17,"ref_count":2,"confidence":0.9,"is_internal_anchor":true,"paper_title":"FAAST: Forward-Only Associative Learning via Closed-Form Fast Weights for Test-Time Supervised Adaptation","primary_cat":"cs.LG","submitted_at":"2026-05-06T08:58:11+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"FAAST performs test-time supervised adaptation by analytically deriving fast weights from examples in one forward pass, matching backprop performance with over 90% less adaptation time and up to 95% memory savings versus memory-based methods.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.01858","ref_index":12,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Decouple and Cache: KV Cache Construction for Streaming Video Understanding","primary_cat":"cs.CV","submitted_at":"2026-05-03T13:02:44+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"DSCache decouples cumulative past and instant KV caches with position-agnostic encoding to adapt offline VideoVLLMs to streaming video, delivering 2.5% average accuracy gains on QA benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.00968","ref_index":25,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Adaptive 3D-RoPE: Physics-Aligned Rotary Positional Encoding for Wireless Foundation Models","primary_cat":"eess.SP","submitted_at":"2026-05-01T15:51:10+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Adaptive 3D-RoPE adapts rotary positional encoding to wireless channel physics via learnable 3D frequencies and dynamic CSI control, yielding up to 10.7 dB NMSE gains in scale extrapolation and 1 dB in zero-shot tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.24940","ref_index":11,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"ADE: Adaptive Dictionary Embeddings -- Scaling Multi-Anchor Representations to Large Language Models","primary_cat":"cs.CL","submitted_at":"2026-04-27T19:29:33+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"ADE scales multi-anchor word representations to transformers via Vocabulary Projection, Grouped Positional Encoding, and context-aware reweighting, achieving 98.7% fewer trainable parameters than DeBERTa-v3-base while matching or exceeding it on two text-classification benchmarks and compressing the","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.22117","ref_index":24,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"PermaFrost-Attack: Stealth Pretraining Seeding(SPS) for planting Logic Landmines During LLM Training","primary_cat":"cs.LG","submitted_at":"2026-04-23T23:32:36+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Stealth Pretraining Seeding plants persistent unsafe behaviors in LLMs via diffuse poisoned web content that activates on precise triggers and evades standard evaluation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.21215","ref_index":76,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"The Recurrent Transformer: Greater Effective Depth and Efficient Decoding","primary_cat":"cs.LG","submitted_at":"2026-04-23T02:12:58+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Recurrent Transformers add per-layer recurrent memory via self-attention on own activations plus a tiling algorithm that reduces training memory traffic, yielding better C4 pretraining cross-entropy than parameter-matched standard transformers with fewer layers.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.20789","ref_index":33,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Working Memory Constraints Scaffold Learning in Transformers under Data Scarcity","primary_cat":"cs.CL","submitted_at":"2026-04-22T17:14:52+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Fixed-width and decay-based attention mechanisms inspired by working memory improve Transformer grammatical accuracy and human alignment under limited training data.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.18747","ref_index":22,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"URoPE: Universal Relative Position Embedding across Geometric Spaces","primary_cat":"cs.CV","submitted_at":"2026-04-20T18:52:03+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"URoPE is a parameter-free relative position embedding for transformers that works across arbitrary geometric spaces by ray sampling and projection, yielding consistent gains on novel view synthesis, 3D detection, tracking, and depth estimation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.13479","ref_index":18,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Learning Class Difficulty in Imbalanced Histopathology Segmentation via Dynamic Focal Attention","primary_cat":"eess.IV","submitted_at":"2026-04-15T05:06:39+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Dynamic Focal Attention learns class-specific difficulty via per-class biases in attention logits, improving Dice and IoU on imbalanced histopathology segmentation benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.10367","ref_index":28,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Beyond Monologue: Interactive Talking-Listening Avatar Generation with Conversational Audio Context-Aware Kernels","primary_cat":"cs.AI","submitted_at":"2026-04-11T22:34:21+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Multi-head Gaussian kernels inject temporal scale discrepancy as inductive bias to enable full-duplex talking-listening avatar generation, supported by a new decoupled VoxHear dataset and claimed SOTA naturalness.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.10044","ref_index":7,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"LoopGuard: Breaking Self-Reinforcing Attention Loops via Dynamic KV Cache Intervention","primary_cat":"cs.AI","submitted_at":"2026-04-11T05:54:45+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"LoopGuard detects attention collapse loops during LLM decoding and prunes repetitive KV cache tail spans under fixed budget, cutting loop incidence by over 90 percentage points on the new LoopBench benchmark.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.08782","ref_index":5,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"MT-OSC: Path for LLMs that Get Lost in Multi-Turn Conversation","primary_cat":"cs.CL","submitted_at":"2026-04-09T21:39:20+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"MT-OSC condenses chat history via a one-off sequential process with a few-shot Condenser and lightweight Decider to reduce tokens and preserve LLM accuracy in multi-turn settings.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2406.00515","ref_index":216,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"A Survey on Large Language Models for Code Generation","primary_cat":"cs.CL","submitted_at":"2024-06-01T17:48:15+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":3.0,"formal_verification":"none","one_line_summary":"A systematic literature review that organizes recent work on LLMs for code generation into a taxonomy covering data curation, model advances, evaluations, ethics, environmental impact, and applications, with benchmark comparisons.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2402.06196","ref_index":128,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Large Language Models: A Survey","primary_cat":"cs.CL","submitted_at":"2024-02-09T05:37:09+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":3.0,"formal_verification":"none","one_line_summary":"The paper surveys key large language models, their training methods, datasets, evaluation benchmarks, and future research directions in the field.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"and the ability to improve linear self-attention with relative position encoding. GPT-NeoX-20B, PaLM, CODEGEN, and LLaMA are among models that take advantage of RoPE in their architectures. 4) Relative Positional Bias : The concept behind this type of positional embedding is to facilitate extrapolation during inference for sequences longer than those encountered in train- ing. In [128] Press et al. proposed Attention with Linear Biases (ALiBi). Instead of simply adding positional embeddings to word embeddings, they introduced a bias to the attention scores of query-key pairs, imposing a penalty proportional to their distance. In the BLOOM model, ALiBi is leveraged. E. Model Pre-training Pre-training is the very first step in large language model"},{"citing_arxiv_id":"2310.08560","ref_index":16,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"MemGPT: Towards LLMs as Operating Systems","primary_cat":"cs.AI","submitted_at":"2023-10-12T17:51:32+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"MemGPT uses OS-inspired virtual context management to extend LLM context windows for large document analysis and long-term multi-session chat.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2211.09085","ref_index":222,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Galactica: A Large Language Model for Science","primary_cat":"cs.CL","submitted_at":"2022-11-16T18:06:33+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Galactica, a science-specialized LLM, reports higher scores than GPT-3, Chinchilla, and PaLM on LaTeX knowledge, mathematical reasoning, and medical QA benchmarks while outperforming general models on BIG-bench.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}