{"total":49,"items":[{"citing_arxiv_id":"2605.23872","ref_index":38,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Training-Free Looped Transformers","primary_cat":"cs.LG","submitted_at":"2026-05-22T17:31:16+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Training-free looped transformers retrofit recurrence to frozen models via damped ODE sub-steps on mid-stack blocks, yielding gains such as +2.64 pp on MMLU-Pro for Qwen3-4B.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.22222","ref_index":12,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"ARC-STAR: Auditable Post-Hoc Correction for PDE Foundation Models","primary_cat":"cs.LG","submitted_at":"2026-05-21T09:26:16+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.20784","ref_index":2,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Interaction Locality in Hierarchical Recursive Reasoning","primary_cat":"cs.AI","submitted_at":"2026-05-20T06:25:49+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Interaction locality is introduced as a task-geometry-aware measurement framework showing that high-level states in recursive models write locally while recursive updates build broader structures on maze, Sudoku, ARC-AGI, and 3D grounding tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.19943","ref_index":7,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Probabilistic Tiny Recursive Model","primary_cat":"cs.AI","submitted_at":"2026-05-19T15:00:38+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"PTRM adds stochastic Gaussian noise to Tiny Recursive Model recursion for parallel trajectory exploration and Q-head selection, raising Sudoku-Extreme accuracy from 87.4% to 98.75% and Pencil Puzzle Bench from 62.6% to 91.2% without retraining.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.19403","ref_index":22,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"TIDE: Asymmetric Neural Circuits for Stabilized Temporal Inhibitory-Excitatory Dynamics","primary_cat":"cs.LG","submitted_at":"2026-05-19T05:59:13+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"TIDE is a neuro-inspired architecture using stabilized asymmetric E-I networks with lateral inhibition and 80:20 balance that trains in under half the time of CTM while gaining +1.65% top-1 accuracy on perturbed ImageNet.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.19376","ref_index":32,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Generative Recursive Reasoning","primary_cat":"cs.AI","submitted_at":"2026-05-19T05:20:56+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"GRAM is a latent-variable generative model that performs recursive reasoning via stochastic trajectories, trained with amortized variational inference to support multi-hypothesis reasoning and unconditional generation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.17811","ref_index":12,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"One Model, Two Roles: Emergent Specialization in a Shared Recurrent Transformer","primary_cat":"cs.LG","submitted_at":"2026-05-18T03:36:58+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"In a minimal two-state recurrent Transformer, asymmetric input injection induces stable specialization where one state becomes a committed proposal and the other retains shifting uncertainty.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.14889","ref_index":9,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"SurgicalMamba: Dual-Path SSD with State Regramming for Online Surgical Phase Recognition","primary_cat":"cs.CV","submitted_at":"2026-05-14T14:34:55+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.13190","ref_index":21,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"N-vium: Mixture-of-Exits Transformer for Accelerated Exact Generation","primary_cat":"cs.LG","submitted_at":"2026-05-13T08:46:17+00:00","verdict":"UNVERDICTED","verdict_confidence":"MODERATE","novelty_score":6.0,"formal_verification":"none","one_line_summary":"N-vium achieves 57.9% wall-clock speedup over matched standard transformers at no perplexity cost by mixing exact predictions from multiple model depths.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.12491","ref_index":40,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Elastic Attention Cores for Scalable Vision Transformers","primary_cat":"cs.CV","submitted_at":"2026-05-12T17:59:26+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"VECA learns effective visual representations using core-periphery attention where patches interact exclusively via a resolution-invariant set of learned core embeddings, achieving linear O(N) complexity while maintaining competitive performance.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"can be learned via nested dropout [16, 24, 25, 26], yielding ordered components [27]. This has been further extended to width-adaptive architectures [ 28, 29, 30]. Such structural elasticity supports sub-network extraction [31, 32, 33, 34] with transformer specific designs [35, 36]. Input-dependent computation can be achieved via early-exiting [37, 38, 39] or iterative processing [40, 41, 42, 43]. At the data-level, elasticity can utilize token merging [44], scratchpads [45], or iterative expansion [46]. Sparse expert selection offers an additional axis of adjustment [47, 48]. In vision, recent methods leverage spatial redundancies [49, 50]. Matformer introduces elasticity along the channel dimension, which is orthogonal to our approach."},{"citing_arxiv_id":"2605.10292","ref_index":92,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"LeapTS: Rethinking Time Series Forecasting as Adaptive Multi-Horizon Scheduling","primary_cat":"cs.LG","submitted_at":"2026-05-11T09:54:02+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"LeapTS reformulates forecasting as adaptive multi-horizon scheduling via hierarchical control and NCDEs, delivering at least 7.4% better performance and 2.6-5.3x faster inference than Transformer baselines while adapting to non-stationary dynamics.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.09999","ref_index":12,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Muninn: Your Trajectory Diffusion Model But Faster","primary_cat":"cs.RO","submitted_at":"2026-05-11T05:21:52+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Muninn accelerates diffusion trajectory planners up to 4.6x by spending an uncertainty budget to decide when to cache denoiser outputs, preserving performance and certifying bounded deviation from full computation.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"new samplers or compressing the backbone network [ 49]. They degrade trajectory quality, require retraining for each architecture, and offer no guarantees on how the modified sampling process affects downstream control performance [ 42]. Adaptive Inference and Early Exiting:Beyond architectural changes, a parallel line of work accelerates diffusion models via dynamic inference [ 12], using learned early-exit policies [60] or step-skipping heuristics [61]to reduce test-time compute. In diffusion models, such methods typically monitor simple signals along the sampling chain to decide when to stop, skip, or jump ahead in the denoising schedule [ 43, 37, 73]. Related approaches in sequence modeling and control reuse cached network outputs across time or across similar inputs,"},{"citing_arxiv_id":"2605.09948","ref_index":25,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"LoopVLA: Learning Sufficiency in Recurrent Refinement for Vision-Language-Action Models","primary_cat":"cs.AI","submitted_at":"2026-05-11T03:51:22+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"LoopVLA adds recurrent refinement and learned sufficiency estimation to VLA models, cutting parameters 45% and raising throughput 1.7x while matching baseline task success on LIBERO and VLA-Arena.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"action head predicts candidate actions at each iteration, while the sufficiency head estimates halting probabilities through cross-attention and remaining-mass allocation. Training is performed in two stages: joint refinement learning and sufficiency calibration. refine representations . Prior works, such as Adaptive Computation Time and early exiting, regulate the number of refinement steps based on intermediate signals [25, 26], offering a view of depth as an unrolled iterative process rather than as a fixed architectural parameter. Similar ideas have been explored in VLMs, where repeated refinement improves representation quality [27, 28]. However, these approaches are mainly studied in perception or generation settings. In contrast, VLA models require decision-oriented refinement, where intermediate representations"},{"citing_arxiv_id":"2605.09630","ref_index":36,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Scratchpad Patching: Decoupling Compute from Patch Size in Byte-Level Language Models","primary_cat":"cs.CL","submitted_at":"2026-05-10T16:18:22+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Scratchpad Patching decouples compute from patch size in byte-level language models by inserting entropy-triggered scratchpads to update patch context dynamically.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.07686","ref_index":13,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"The Coupling Tax: How Shared Token Budgets Undermine Visible Chain-of-Thought Under Fixed Output Limits","primary_cat":"cs.LG","submitted_at":"2026-05-08T12:54:53+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":8.0,"formal_verification":"none","one_line_summary":"Shared token budgets between visible chain-of-thought and answers create a coupling tax that makes non-thinking competitive on math benchmarks, with a truncation decomposition predicting the crossover and split budgets improving results.","context_count":1,"top_context_role":"method","top_context_polarity":"use_method","context_text":"coupled generation: Accd(b)>Acc c(b).(11) Proof.The coupled accuracy decomposes as: Accc(b) =F L(β)·αc + (1−FL(β))·αt, β≜b−|A|min.(12) Consider the decoupled strategy withbr = β, ba =|A|min (identical effective reasoning budget as coupled mode, with the minimum answer allocation separated out). Its accuracy is: Accd(β,|A|min) =F L(β)·αc + (1−FL(β))·αe(β,|A|min).(13) The first terms are identical up to condition (iii). The gap from truncated samples is: ∆ = (1−FL(β))· ( αe(β,|A|min)−αt ) .(14) By condition (i),1−FL(β)> 0. By condition (ii),αe−αt > 0. Therefore∆ > 0, proving strict dominance. 32 Interpretation.The gap∆is the product of two terms: thetruncation probability1−FL(β), which grows with model size (larger models generate longer chains), and theextraction"},{"citing_arxiv_id":"2605.06112","ref_index":39,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Dynamic Pondering Sparsity-aware Mixture-of-Experts Transformer for Event Stream based Visual Object Tracking","primary_cat":"cs.CV","submitted_at":"2026-05-07T12:25:03+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"A three-stage ViT with sparsity-aware MoE and adaptive inference depth delivers improved accuracy-efficiency trade-off for event-stream visual tracking on FE240hz, COESOT, and EventVOT benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.05697","ref_index":7,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Budgeted Attention Allocation: Cost-Conditioned Compute Control for Efficient Transformers","primary_cat":"cs.LG","submitted_at":"2026-05-07T05:37:44+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"A monotone head-gating mechanism conditions transformer attention on a budget, enabling one checkpoint to trade attention cost for accuracy and produce measured CPU speedups.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.03999","ref_index":18,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"RD-ViT: Recurrent-Depth Vision Transformer for Semantic Segmentation with Reduced Data Dependence Extending the Recurrent-Depth Transformer Architecture to Dense Prediction","primary_cat":"cs.CV","submitted_at":"2026-05-05T17:21:18+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"RD-ViT matches or exceeds standard ViT segmentation accuracy on cardiac MRI using a shared recurrent block, fewer parameters, and less training data.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.03109","ref_index":9,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Gated Subspace Inference for Transformer Acceleration","primary_cat":"cs.LG","submitted_at":"2026-05-04T19:48:40+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Gated Subspace Inference accelerates transformer linear layers 3-10x via low-rank cached subspace computation and per-token gating to skip residuals while preserving output distribution to high accuracy.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.01058","ref_index":19,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"LEAP: Layer-wise Exit-Aware Pretraining for Efficient Transformer Inference","primary_cat":"cs.LG","submitted_at":"2026-05-01T19:45:26+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"LEAP adds a layer-wise exit-aware constraint to standard distillation, reconciling it with early-exit mechanisms and delivering 1.61x wall-clock speedup on MiniLM at 0.95 threshold with 91.9% early exits by layer 7.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.00206","ref_index":47,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"State Stream Transformer (SST) V2: Parallel Training of Nonlinear Recurrence for Latent Space Reasoning","primary_cat":"cs.LG","submitted_at":"2026-04-30T20:30:28+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SST V2 introduces parallel-trainable nonlinear recurrence in latent space to let transformers reason continuously across positions, delivering +15 points on GPQA-Diamond and halving remaining GSM8K errors over matched baselines.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Geometrical and statistical properties of systems of linear inequalities with applications in pattern recognition.IEEE Transactions on Electronic Computers, EC-14(3):326-334, 1965. [46] Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. InInternational Conference on Learning Representations (ICLR 2019), 2017. URLhttps://arxiv.org/abs/1711.05101. [47] Shaojie Bai, J. Zico Kolter, and Vladlen Koltun. Deep equilibrium models. InAdvances in Neural Information Processing Systems (NeurIPS 2019), 2019. URLhttps://arxiv.org/abs/1909.01377. Spotlight Oral. 25 [48] Yi Heng Lim, Qi Zhu, Joshua Selfridge, and Muhammad Firmansyah Kasim. Parallelizing non-linear sequential models over the sequence length. InInternational Conference on Learning Representations (ICLR 2024), 2023."},{"citing_arxiv_id":"2604.27981","ref_index":14,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"ITS-Mina: A Harris Hawks Optimization-Based All-MLP Framework with Iterative Refinement and External Attention for Multivariate Time Series Forecasting","primary_cat":"cs.LG","submitted_at":"2026-04-30T15:10:18+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"ITS-Mina introduces an all-MLP model with iterative refinement, external attention via learnable memory units, and HHO-tuned dropout that reports state-of-the-art or competitive results on six multivariate time series benchmarks versus eleven baselines.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.22110","ref_index":5,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Do Not Imitate, Reinforce: Iterative Classification via Belief Refinement","primary_cat":"cs.LG","submitted_at":"2026-04-23T23:06:48+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"RIC replaces single-pass label imitation with RL-driven iterative belief refinement, recovering cross-entropy optima while enabling adaptive halting via a value function.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.21999","ref_index":7,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Universal Transformers Need Memory: Depth-State Trade-offs in Adaptive Recursive Reasoning","primary_cat":"cs.LG","submitted_at":"2026-04-23T18:30:01+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Memory tokens are required for non-trivial performance in adaptive Universal Transformers on Sudoku-Extreme, with 8-32 tokens yielding stable 57% exact-match accuracy while trading off against ponder depth.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.19550","ref_index":6,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"LoopCTR: Unlocking the Loop Scaling Power for Click-Through Rate Prediction","primary_cat":"cs.IR","submitted_at":"2026-04-21T15:06:59+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"LoopCTR trains CTR models with recursive layer reuse and process supervision so that zero-loop inference outperforms baselines on public and industrial datasets.","context_count":1,"top_context_role":"method","top_context_polarity":"use_method","context_text":"residual withHyper-Connected Residualsthat extend the computation into n parallel streams with input-dependent adaptive fusion. Concretely, the single-stream hidden state h∈R d is replicated n times to form a multi-stream state H∈R n×d. Given H and a sub-layer function T (e.g., attention or FFN), the hyper-connected update takes the form: ˆH=A ⊤ r H|{z} residual mixing +B ⊤ · T (H⊤Am)⊤\u0001 | {z } layer contribution ,(6) where Am ∈R n×1 fuses the n streams into a single input for T , B∈R 1×n distributes T 's output back across streams, and Ar ∈R n×n governs the residual mixing among streams. Unlike the fixed 4 1:1 ratio in standard residuals, all three coefficients areinput-dependent. Let ¯H=RMSNorm(H) , each coefficient consists of a learnable static component plus a dynamic perturbation:"},{"citing_arxiv_id":"2604.18744","ref_index":25,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Match-Any-Events: Zero-Shot Motion-Robust Feature Matching Across Wide Baselines for Event Cameras","primary_cat":"cs.CV","submitted_at":"2026-04-20T18:48:53+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"A single attention-based model trained on synthetic wide-baseline event data achieves zero-shot feature matching across unseen datasets with a reported 37.7% improvement over prior event matching methods.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.17286","ref_index":21,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Depth Adaptive Efficient Visual Autoregressive Modeling","primary_cat":"cs.CV","submitted_at":"2026-04-19T06:59:34+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"DepthVAR adaptively allocates per-token computational depth in VAR models using a cyclic rotated scheduler and dynamic layer masking to achieve 2.3-3.1x inference speedup with minimal quality loss.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.17228","ref_index":3,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Revisiting Auxiliary Losses for Conditional Depth Routing: An Empirical Study","primary_cat":"cs.LG","submitted_at":"2026-04-19T03:20:40+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Removing utility regression and rank supervision auxiliary losses improves language modeling performance and training efficiency for conditional depth routing gates, and eliminates the advantage of a more complex JEPA-guided gate over a simple MLP gate.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.05222","ref_index":1,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Adaptive Computation Depth via Learned Token Routing in Transformers","primary_cat":"cs.LG","submitted_at":"2026-04-18T02:04:47+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"TSA adds end-to-end differentiable per-token halting gates to transformers, enabling learned adaptive depth that saves 14-23% token-layer operations with under 0.5% quality loss on language modeling.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.15259","ref_index":7,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Stability and Generalization in Looped Transformers","primary_cat":"cs.LG","submitted_at":"2026-04-16T17:35:49+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":8.0,"formal_verification":"none","one_line_summary":"Looped transformers with recall and outer normalization produce reachable, input-dependent fixed points with stable gradients, enabling generalization, while those without recall cannot; a new internal recall variant performs competitively or better.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.15408","ref_index":13,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Dispatch-Aware Ragged Attention for Pruned Vision Transformers","primary_cat":"cs.LG","submitted_at":"2026-04-16T15:48:44+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A new Triton kernel for dispatch-aware ragged attention delivers 1.88-2.51× end-to-end throughput gains over standard padded attention and 9-12% over FlashAttention-2 varlen in pruned ViTs by lowering dispatch floor to ~24μs.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.14853","ref_index":21,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Adaptive Test-Time Compute Allocation for Reasoning LLMs via Constrained Policy Optimization","primary_cat":"cs.LG","submitted_at":"2026-04-16T10:39:22+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A Lagrangian-relaxation plus imitation-learning pipeline adaptively allocates test-time compute to LLMs, outperforming uniform baselines by up to 12.8% relative accuracy on MATH while staying within a fixed average budget.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"arXiv preprint arXiv:2512.02008, 2025. [19] Q. Zhang, F. Lyu, Z. Sun, L. Wang, W. Zhang, Z. Guo, Y . Wang, N. Muennighoff, I. King, X. Liu, and C. Ma. What, how, where, and how well? A survey on test-time scaling in large language models.arXiv preprint arXiv:2503.24235, 2025. [20] A. Jones. Scaling scaling laws with board games.arXiv preprint arXiv:2104.03113, 2021. [21] A. Graves. Adaptive computation time for recurrent neural networks.arXiv preprint arXiv:1603.08983, 2016. [22] S. Teerapittayanon, B. McDanel, and H.-T. Kung. BranchyNet: Fast inference via early exiting from deep neural networks. InInternational Conference on Pattern Recognition (ICPR), 2016. [23] G. Huang, D. Chen, T. Li, F. Wu, L. van der Maaten, and K."},{"citing_arxiv_id":"2604.11791","ref_index":12,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"A Mechanistic Analysis of Looped Reasoning Language Models","primary_cat":"cs.LG","submitted_at":"2026-04-13T17:55:36+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Looped LLMs converge to distinct cyclic fixed points per layer, repeating feedforward-style inference stages across recurrences.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"B 𝑓 𝑗 (1) (Y ) . . . )) (9) = B 𝑓 𝑗 (0) (B 𝑓 𝑗 (𝑘−1) (. . . B 𝑓 𝑗 (1) (Y ) . . . )) (10) Now take Eq. (8) and apply B 𝑓 𝑗 (0) to both sides, defining a new fixed pointZ ′′ = B 𝑓 𝑗 (0) (Z ′): B 𝑓 𝑗 (0) (B 𝑓 𝑗 (𝑘−1) (B 𝑓 𝑗 (𝑘−2) (. . . B 𝑓 𝑗 (0) (Z ′) . . . ))) = B 𝑓 𝑗 (0) (Z ′) (11) B 𝑓 𝑗 (0) (B 𝑓 𝑗 (𝑘−1) (B 𝑓 𝑗 (𝑘−2) (. . . B 𝑓 𝑗 (1) (Z ′′) . . . ))) = Z ′′ (12) Combining Eq. (12) and Eq. (10), we see that there exists a fixed point Z ′′ such that B 𝑓 𝑗+1 (𝑘−1) (B 𝑓 𝑗+1 (𝑘−2) (. . . B 𝑓 𝑗+1 (0) (Z ′′) . . . )) = Z ′′ Therefore completing the induction step and proving the proposition. □ Proof of Proposition 4.2 Proof. Define Sℓ (X) := softmax( 𝐴ℓ (X)) = softmax \u0012 XW 𝑄W ⊤ 𝐾 X ⊤ √ 𝑑 \u0013 Let 𝐿sm be the Lipschitz constant of the row-wise softmax such that"},{"citing_arxiv_id":"2604.09870","ref_index":6,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Relational Preference Encoding in Looped Transformer Internal States","primary_cat":"cs.LG","submitted_at":"2026-04-10T20:00:49+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Looped transformer hidden states encode preferences relationally via pairwise differences rather than independent pointwise classification, with the evaluator acting as an internal consistency probe on the model's own value system.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2603.22570","ref_index":43,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"CanViT: Toward Active-Vision Foundation Models","primary_cat":"cs.CV","submitted_at":"2026-03-23T21:05:21+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":8.0,"formal_verification":"none","one_line_summary":"CanViT is the first task- and policy-agnostic AVFM pretrained via passive-to-active dense latent distillation on 13.2M scenes and 1B random glimpses, achieving 38.5% ADE20K mIoU in one glimpse and 84.5% ImageNet-1k top-1 after fine-tuning.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.03263","ref_index":25,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"LPC-SM: Local Predictive Coding and Sparse Memory for Long-Context Language Modeling","primary_cat":"cs.CL","submitted_at":"2026-03-12T21:21:51+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"LPC-SM is a hybrid architecture separating local attention, persistent memory, predictive correction, and control with ONT for memory writes, showing loss reductions on 158M-parameter models up to 4096-token contexts.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2602.13215","ref_index":9,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"When to Think Fast and Slow? AMOR: Adaptive Entropy Gate for Hybrid Models","primary_cat":"cs.AI","submitted_at":"2026-01-22T17:19:58+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"AMOR uses output entropy to gate attention in recurrent hybrids, matching full attention performance at roughly 22% attention invocations across 180M-1.5B models.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2509.26522","ref_index":7,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Entropy After </Think> for reasoning model early exiting","primary_cat":"cs.LG","submitted_at":"2025-09-30T16:59:37+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Entropy After </Think> (EAT) enables early exiting in reasoning LLMs by tracking entropy stabilization after a </think> token, cutting token use 12-22% on MATH500 and AIME2025 with no accuracy loss.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2508.16745","ref_index":24,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Beyond Memorization: Extending Reasoning Depth with Recurrence, Memory and Test-Time Compute Scaling","primary_cat":"cs.LG","submitted_at":"2025-08-22T18:57:08+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"In a cellular automata rule-inference task designed to block memorization, neural models achieve high next-step accuracy but accuracy falls sharply with longer reasoning chains; depth, recurrence, memory, and test-time compute extend the reachable depth but do not remove the bound.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2506.21734","ref_index":91,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Hierarchical Reasoning Model","primary_cat":"cs.AI","submitted_at":"2025-06-26T19:39:54+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"HRM is a recurrent architecture with high-level planning and low-level execution modules that reaches near-perfect accuracy on complex Sudoku, maze navigation, and ARC benchmarks using 27M parameters and 1000 samples without pre-training or CoT supervision.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2502.05171","ref_index":64,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach","primary_cat":"cs.LG","submitted_at":"2025-02-07T18:55:02+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"A recurrent-depth architecture enables language models to improve reasoning performance by iterating computation in latent space, achieving gains equivalent to much larger models on benchmarks.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"effective Adam learning rates per layer, optimizer RMS (Wortsman et al., 2023a),L2 and L1 parameter and gradient norms, recurrence statistics such as ||sk−sk−1|| ||sk|| , ||sk||, ||s0 − sk||. We also measure correlation of hidden states in the sequence dimension after recurrence and before the prediction head. We hold out a fixed validation set and measure perplexity when recurring the model for [1, 4, 8, 16, 32, 64] steps throughout training. B. Latent Space Visualizations On the next pages, we print a number of latent space visualizations in more details than was possible in Section 7. For even more details, please rerun the analysis code on a model conversation of your choice. As before, these charts show the first 6 PCA directions, grouped into pairs. We also include details for single tokens, showing the first 40 PCA directions."},{"citing_arxiv_id":"2404.02258","ref_index":7,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Mixture-of-Depths: Dynamically allocating compute in transformer-based language models","primary_cat":"cs.LG","submitted_at":"2024-04-02T19:28:11+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Mixture-of-Depths enables transformers to dynamically allocate compute by routing only the top-k tokens through each layer's full computations, matching baseline performance with a fraction of the FLOPs per forward pass and up to 50% faster sampling.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2402.17762","ref_index":33,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Massive Activations in Large Language Models","primary_cat":"cs.CL","submitted_at":"2024-02-27T18:55:17+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Massive activations are constant large values in LLMs that function as indispensable bias terms and concentrate attention probabilities on specific tokens.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2211.14275","ref_index":16,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Solving math word problems with process- and outcome-based feedback","primary_cat":"cs.LG","submitted_at":"2022-11-25T18:19:44+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"On GSM8K, outcome-based supervision achieves similar final-answer error rates to process-based with less labeling, but process-based or learned reward models are needed to reach 3.4% reasoning error among correct solutions.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2211.09085","ref_index":75,"ref_count":4,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Galactica: A Large Language Model for Science","primary_cat":"cs.CL","submitted_at":"2022-11-16T18:06:33+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Galactica, a science-specialized LLM, reports higher scores than GPT-3, Chinchilla, and PaLM on LaTeX knowledge, mathematical reasoning, and medical QA benchmarks while outperforming general models on BIG-bench.","context_count":2,"top_context_role":"method","top_context_polarity":"use_method","context_text":"do not have to turn this on, and the model can also predict the output from running a program. For our experiments, we did not ﬁnd the need to turn Python oﬄoading on, and leave this aspect to future work. Longer term, an architecture change may be needed to support adaptive computation, so machines can have internal working memory on the lines of work such as adaptive computation time and PonderNet (Graves, 2016; Banino et al., 2021). In this paper, we explore the<work> external working memory approach as a 6 Galactica: A Large Language Model for Science Question: A needle35 mmlong rests on a water surface at20◦C. What force over and above the needle's weight is required to lift the needle from contact with the water surface?σ = 0.0728m. <work> σ = 0.0728 N/m"},{"citing_arxiv_id":"2206.07682","ref_index":30,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Emergent Abilities of Large Language Models","primary_cat":"cs.CL","submitted_at":"2022-06-15T17:32:01+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Emergent abilities are capabilities present in large language models but absent in smaller ones and cannot be predicted by extrapolating smaller model performance.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"is using sparse mixture-of-experts architectures (Lepikhin et al., 2021; Fedus et al., 2021; Artetxe et al., 2021; Zoph et al., 2022), which scale up the number of parameters in a model while maintaining constant computational costs for an input. Other directions for better computational eﬃciency could involve variable amounts of compute for diﬀerent inputs (Graves, 2016; Dehghani et al., 2018), using more localized learning strategies than backpropagation through all weights in a neural network (Jaderberg et al., 2017), and augmenting models with external memory (Guu et al., 2020; Borgeaud et al., 2021; Wu et al., 2022b,inter alia). These nascent directions have already shown promise in many settings but have not yet seen widespread"},{"citing_arxiv_id":"2201.02177","ref_index":4,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets","primary_cat":"cs.LG","submitted_at":"2022-01-06T18:43:37+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":8.0,"formal_verification":"none","one_line_summary":"Neural networks exhibit grokking on small algorithmic datasets, achieving perfect generalization well after overfitting.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2112.00114","ref_index":8,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Show Your Work: Scratchpads for Intermediate Computation with Language Models","primary_cat":"cs.LG","submitted_at":"2021-11-30T21:32:46+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":8.0,"formal_verification":"none","one_line_summary":"Training language models to generate intermediate computation steps on a scratchpad enables them to perform multi-step tasks such as long addition and arbitrary program execution that they otherwise fail at.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"1806.07366","ref_index":19,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Neural Ordinary Differential Equations","primary_cat":"cs.LG","submitted_at":"2018-06-19T17:50:12+00:00","verdict":"ACCEPT","verdict_confidence":"HIGH","novelty_score":8.0,"formal_verification":"none","one_line_summary":"Neural networks are redefined as continuous dynamical systems by learning the derivative of the hidden state with a neural network and integrating it with an ODE solver.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}