{"total":27,"items":[{"citing_arxiv_id":"2605.31268","ref_index":22,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Mellum2 Technical Report","primary_cat":"cs.CL","submitted_at":"2026-05-29T13:01:11+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":3.0,"formal_verification":"none","one_line_summary":"Mellum 2 is a 12B MoE model with 2.5B active parameters, trained on 10.6T tokens with MoE, GQA, SWA, and MTP, then post-trained into Instruct and Thinking variants, claimed competitive with 4B-14B models at 2.5B compute.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.29727","ref_index":23,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Bastion: Budget-Aware Speculative Decoding with Tree-structured Block Diffusion Drafting","primary_cat":"cs.LG","submitted_at":"2026-05-28T10:21:34+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"BASTION is a budget-aware speculative decoding framework with adaptive tree-structured block diffusion drafting that reports up to 6.61x speedup and 39% improvement over block-diffusion baselines.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.22422","ref_index":30,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"FastTab: A Fast Table Recognizer with a Tiny Recursive Module and 1D Transformers","primary_cat":"cs.CV","submitted_at":"2026-05-21T12:42:58+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"FastTab combines a Tiny Recursive Module and axial 1D Transformer encoders to predict table grids, headers, and cell spans directly, achieving competitive accuracy on four benchmarks with low-latency inference.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.20104","ref_index":11,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Draft Less, Retrieve More: Hybrid Tree Construction for Speculative Decoding","primary_cat":"cs.LG","submitted_at":"2026-05-19T16:55:48+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Graft combines pruning and retrieval in a sequential mechanism to build hybrid draft trees for speculative decoding, delivering up to 5.41× speedup and 21.8% better average speedup than EAGLE-3 on large models.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.16709","ref_index":13,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Covert Multi-bit LLM Watermarking: An Information Theory and Coding Approach","primary_cat":"cs.IT","submitted_at":"2026-05-15T23:46:22+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Characterizes the exact capacity of multi-bit covert LLM watermarking via Gelfand-Pinsker and channel synthesis, then gives a polar-code algorithm achieving 0.375 bits/token at under 10% BER with negligible perplexity impact.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.15871","ref_index":114,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Agentic Discovery of Neural Architectures: AIRA-Compose and AIRA-Design","primary_cat":"cs.AI","submitted_at":"2026-05-15T11:40:41+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Multi-agent LLM systems discover new Transformer and hybrid architectures that outperform Llama 3.2 at 1B scale and approach human SOTA on long-range benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.14227","ref_index":21,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"DT-Transformer: A Foundation Model for Disease Trajectory Prediction on a Real-world Health System","primary_cat":"cs.LG","submitted_at":"2026-05-14T00:45:04+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"DT-Transformer predicts next disease events with median age- and sex-stratified AUC 0.871 across 896 categories on held-out and prospective data from a 1.7M-patient multi-hospital EHR dataset.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.12460","ref_index":3,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Multi-Stream LLMs: Unblocking Language Models with Parallel Streams of Thoughts, Inputs and Outputs","primary_cat":"cs.LG","submitted_at":"2026-05-12T17:47:41+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Language models trained on parallel streams of computation can overcome single-stream bottlenecks in autonomous agents by enabling simultaneous reading, thinking, and acting.","context_count":1,"top_context_role":"method","top_context_polarity":"unclear","context_text":"Even without this approximation, interleaved packing yields more contiguous valid regions and fewer irregular partially active blocks, making it more amenable to FlexAttention- style (Dong et al., 2024) tiled traversal. Training Objective.With the interleaved packing in place, the model can be trained using standard cross-entropy: L= HX h=1 1 |Th| X t∈Th −logp θ(y(h) t |x) \u0001 ,(3) 6 where x denotes the full multi-stream context and Th is the set of valid token positions in stream h. We also explore a stream-contrastive variant that upweights tokens benefiting most from cross-stream context, which helps mitigate training loss imbalance across streams (Details are given in Section B.5). 3.4 Inference: Synchronous Multi-Stream Decoding"},{"citing_arxiv_id":"2605.12456","ref_index":6,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"TextSeal: A Localized LLM Watermark for Provenance & Distillation Protection","primary_cat":"cs.CR","submitted_at":"2026-05-12T17:44:41+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"TextSeal provides a localized, distortion-free LLM watermark that outperforms baselines in detection strength, remains effective in mixed human-AI text, preserves model performance, and transfers through distillation for provenance tracking.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Hindi 1,000 98 82 435 278 107 Japanese 1,000 136 137 275 228 224 Overall6,000 672 591 2,874 1,157 706 Table 9 TOST equivalence test results(∆ = 5%,α= 0.05). Proportions computed over allNsamples. 90% CI: Wald interval for the differenceP(WM)−P(Base). LanguageN ˆd90% CIp TOST Result English 2,000+1.50% [+0.2%,+2.9%]<0.001Equivalent Arabic 1,000+1.40% [−1.8%,+4.6%] 0.033Equivalent Chinese 1,000+2.20% [+0.1%,+4.3%] 0.013Equivalent Hindi 1,000+1.60% [−0.6%,+3.8%] 0.006Equivalent Japanese 1,000−0.10% [−2.8%,+2.6%] 0.002Equivalent Overall6,000+1.35% [+0.4%,+2.3%]<0.001Equivalent Net Win Rate.We define thenet win rateas Net Win Rate= nWM −n Base N ,(32) wheren WM andn Base are the number of samples where the watermarked or baseline response was"},{"citing_arxiv_id":"2605.11577","ref_index":9,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"BitLM: Unlocking Multi-Token Language Generation with Bitwise Continuous Diffusion","primary_cat":"cs.CL","submitted_at":"2026-05-12T06:02:59+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"BitLM replaces per-token softmax with bitwise continuous diffusion inside causal blocks to generate multiple tokens in parallel while preserving autoregressive structure.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"BitLM models each target block as a denoising problem (Lipman et al., 2022; Ho et al., 2020; Song et al., 2020) in continuous binary space. Given a clean target block A(n) 0 ∈ {− 1, 1}m×B, we sample a timestept∼ U[0, 1]and Gaussian noise ϵ∼ N(0,I m×B ), (8) and construct a noisy analog-bits state by straight-line interpolation: A(n) t = (1−t)A (n) 0 +tϵ∈R m×B. (9) Thus t= 0 corresponds to a clean binary block and t= 1 corresponds to pure Gaussian noise. Given a contextual condition tensor C(n−1) ∈R m×d for the next block, the diffusion head predicts the clean block from its noisy version: ˆA(n) 0 =DiffHead θ \u0010 A(n) t ,t;C (n−1) \u0011 ∈R m×B. (10) We keep the denoiser itself deliberately lightweight; the contribution of BitLM lies in the"},{"citing_arxiv_id":"2605.09630","ref_index":31,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Scratchpad Patching: Decoupling Compute from Patch Size in Byte-Level Language Models","primary_cat":"cs.CL","submitted_at":"2026-05-10T16:18:22+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Scratchpad Patching decouples compute from patch size in byte-level language models by inserting entropy-triggered scratchpads to update patch context dynamically.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.00604","ref_index":35,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Affinity Is Not Enough: Recovering the Free Energy Principle in Mixture-of-Experts","primary_cat":"cs.LG","submitted_at":"2026-05-01T12:18:55+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Adding temporal memory via LIF, precision-weighted gating, and anticipatory prediction to MoE routers recovers effective expert selection at distribution transitions, with ablation confirming a super-additive beta-ant interaction.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"backward function while keeping the forward pass exact, enabling competitive performance on tem- poral tasks. The snnTorch library [2] provides a practical implementation substrate with learnable β, adaptive thresholds, and BPTT through spike recurrences. SNN-inspired Transformers have been explored - Spikformer [34] replaces the softmax attention with a spike-based attention mechanism; SpikeGPT [35] incorporates spiking dynamics into au- toregressive language models. Recent concurrent work (arXiv:2412.05540) applies LIF dynamics within MoE expert networks, using spiking activations inside the FFN experts themselves. Our application is distinct: we apply LIF membrane dynamics to the routing gate (not the experts), using the continuous decay recurrence without spiking to maintain routing context across tokens."},{"citing_arxiv_id":"2604.26752","ref_index":11,"ref_count":3,"confidence":0.9,"is_internal_anchor":true,"paper_title":"GLM-5V-Turbo: Toward a Native Foundation Model for Multimodal Agents","primary_cat":"cs.CV","submitted_at":"2026-04-29T14:49:37+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"GLM-5V-Turbo integrates multimodal perception as a core part of reasoning and execution for agentic tasks, reporting strong results in visual tool use and multimodal coding while keeping text-only performance competitive.","context_count":1,"top_context_role":"method","top_context_polarity":"use_method","context_text":"an 8-billion bilingual (Chinese-English) image-text corpus to enhance cross-lingual understanding. We continue to optimize with Muon, assigning module-specific learning rates and decay schedules to the vision, text, and projection components. 2.2 Multimodal Multi-Token Prediction We proposeMultimodal Multi-Token Prediction (MMTP), a multimodal extension of multi-token prediction (MTP) [11], designed to support both text-only and multimodal inputs while remaining friendly to large-scale infrastructure. The goal is to preserve acceptable length as well as training and inference efficiency in multimodal settings. In standard text-only MTP, prefix tokens can be passed into the MTP head directly through token IDs and embedded with the word embedding layer."},{"citing_arxiv_id":"2604.26412","ref_index":5,"ref_count":2,"confidence":0.9,"is_internal_anchor":true,"paper_title":"When Hidden States Drift: Can KV Caches Rescue Long-Range Speculative Decoding?","primary_cat":"cs.CL","submitted_at":"2026-04-29T08:25:01+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"KV cache reuse improves long-range draft acceptance in speculative decoding but delivers only marginal end-to-end speedups due to drafter limitations.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.25317","ref_index":17,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"FusionCIM: Accelerating LLM Inference with Fusion-Driven Computing-in-Memory Architecture","primary_cat":"cs.AR","submitted_at":"2026-04-28T07:27:58+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"FusionCIM is a fusion-driven CIM accelerator for LLM inference that maps QKT to IP-CIM and PV to OP-CIM, uses QO-stationary dataflow, and applies pattern-aware online softmax, delivering up to 3.86x energy savings and 1.98x speedup on LLaMA-3 at 29.4 TOPS/W.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2603.04791","ref_index":19,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Timer-S1: A Billion-Scale Time Series Foundation Model with Serial Scaling","primary_cat":"cs.AI","submitted_at":"2026-03-05T04:13:57+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Timer-S1 is a released 8.3B-parameter MoE time series model that achieves state-of-the-art MASE and CRPS scores on GIFT-Eval using serial scaling and Serial-Token Prediction.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2603.00110","ref_index":17,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Learning Physics from Pretrained Video Models: A Multimodal Continuous and Sequential World Interaction Models for Robotic Manipulation","primary_cat":"cs.RO","submitted_at":"2026-02-18T14:58:18+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"PhysGen uses video models to learn physics for robots, outperforming baselines by up to 13.8% on Libero and matching specialized models in real-world tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2602.15763","ref_index":13,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"GLM-5: from Vibe Coding to Agentic Engineering","primary_cat":"cs.LG","submitted_at":"2026-02-17T17:50:56+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"GLM-5 is a foundation model that claims state-of-the-art results on coding benchmarks and superior performance on end-to-end software engineering tasks via new asynchronous RL methods and cost-saving DSA.","context_count":1,"top_context_role":"method","top_context_polarity":"use_method","context_text":"This keeps the training computation and the number of parameters constant while decreasing the decoding computation. The variant, denoted as MLA-256 in Table 1, matches the performance of MLA under Muon Split. Table 2: Comparison of accept lengths of DeepSeek-V3.2 and GLM-5. Model Accept Length DeepSeek-V3.2 2.55 GLM-5 2.76 Multi-token Prediction with Parameter Sharing. Multi-token prediction (MTP) [13; 25] increases the per- formance of base models and acts as draft models for speculative decoding [20]. However, during training, to predict the next n tokens, n MTP layers are required. As a result, the memory usage of MTP parameters and the kv cache scales linearly with the number of speculative steps. Instead, DeepSeek-V3 is trained with a single MTP"},{"citing_arxiv_id":"2602.04289","ref_index":6,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Proxy Compression for Language Modeling","primary_cat":"cs.CL","submitted_at":"2026-02-04T07:36:46+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Proxy compression trains language models on both raw bytes and compressed sequences to enable efficient training with raw-byte inference at test time.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2601.22925","ref_index":19,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"BEAR: Towards Beam-Search-Aware Optimization for Recommendation with Large Language Models","primary_cat":"cs.IR","submitted_at":"2026-01-30T12:45:02+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2601.14671","ref_index":12,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Mirai: Autoregressive Visual Generation Needs Foresight","primary_cat":"cs.CV","submitted_at":"2026-01-21T05:33:23+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Mirai injects future-token foresight into autoregressive visual generators, accelerating convergence up to 10x and cutting ImageNet FID from 5.34 to 4.34.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2601.02780","ref_index":19,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"MiMo-V2-Flash Technical Report","primary_cat":"cs.CL","submitted_at":"2026-01-06T07:31:47+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"MiMo-V2-Flash is a 309B/15B MoE model trained on 27T tokens with hybrid attention and multi-teacher on-policy distillation that matches larger models like DeepSeek-V3.2 while enabling 2.6x faster decoding via repurposed MTP layers.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2509.24527","ref_index":27,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Training Agents Inside of Scalable World Models","primary_cat":"cs.AI","submitted_at":"2025-09-29T09:42:27+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Dreamer 4 is the first agent to obtain diamonds in Minecraft from only offline data by reinforcement learning inside a scalable world model that accurately predicts game mechanics.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2508.16745","ref_index":21,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Beyond Memorization: Extending Reasoning Depth with Recurrence, Memory and Test-Time Compute Scaling","primary_cat":"cs.LG","submitted_at":"2025-08-22T18:57:08+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"In a cellular automata rule-inference task designed to block memorization, neural models achieve high next-step accuracy but accuracy falls sharply with longer reasoning chains; depth, recurrence, memory, and test-time compute extend the reachable depth but do not remove the bound.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2508.06471","ref_index":12,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models","primary_cat":"cs.CL","submitted_at":"2025-08-08T17:21:06+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"GLM-4.5, a 355B-parameter MoE model with hybrid reasoning, scores 70.1% on TAU-Bench, 91.0% on AIME 24, and 64.2% on SWE-bench Verified while ranking 3rd overall and 2nd on agentic benchmarks.","context_count":1,"top_context_role":"method","top_context_polarity":"use_method","context_text":"Counterintuitively, while this increased head count does not improve training loss compared to models with fewer heads, it consistently improves performance on reasoning benchmarks such as MMLU and BBH. We also incorporate QK-Norm [15] to stabilize the range of attention logits. For both GLM-4.5 and GLM-4.5-Air, we add an MoE layer as the MTP (Multi-Token Prediction) layer [12] to support speculative decoding during inference. Table 1: Model architecture of GLM-4.5 and GLM-4.5-Air. When counting parameters, for GLM-4.5 and GLM-4.5-Air, we include the parameters of MTP layers but not word embeddings and the output layer. Model GLM-4.5 GLM-4.5-Air DeepSeek-V3 Kimi K2 # Total Parameters 355B 106B 671B 1043B # Activated Parameters 32B 12B 37B 32B"},{"citing_arxiv_id":"2507.19247","ref_index":13,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"A Markov Categorical Framework for Language Modeling","primary_cat":"cs.LG","submitted_at":"2025-07-25T13:14:03+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"A Markov category framework for language models provides an information-theoretic rationale for speculative decoding and shows that a quadratic surrogate to negative log-likelihood induces generalized CCA alignment in linear-softmax heads after normalization.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2507.01449","ref_index":24,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"LogitSpec: Accelerating Retrieval-based Speculative Decoding via Next Next Token Speculation","primary_cat":"cs.CL","submitted_at":"2025-07-02T08:08:30+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"LogitSpec accelerates retrieval-based speculative decoding by speculating the next-next token from the last logit and retrieving relevant references for both next and next-next tokens, reporting up to 2.61x speedup and 3.28 mean accepted tokens.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}