VibeServe demonstrates that AI agents can synthesize bespoke LLM serving systems end-to-end, remaining competitive with vLLM in standard settings while outperforming it in six non-standard scenarios involving unusual models, workloads, or hardware.
hub Canonical reference
Gated Delta Networks: Improving Mamba2 with Delta Rule
Canonical reference. 82% of citing Pith papers cite this work as background.
abstract
Linear Transformers have gained attention as efficient alternatives to standard Transformers, but their performance in retrieval and long-context tasks has been limited. To address these limitations, recent work has explored two distinct mechanisms: gating for adaptive memory control and the delta update rule for precise memory modifications. We observe that these mechanisms are complementary: gating enables rapid memory erasure while the delta rule facilitates targeted updates. Building on this insight, we introduce the gated delta rule and develop a parallel training algorithm optimized for modern hardware. Our proposed architecture, Gated DeltaNet, consistently surpasses existing models like Mamba2 and DeltaNet across multiple benchmarks, including language modeling, common-sense reasoning, in-context retrieval, length extrapolation, and long-context understanding. We further enhance performance by developing hybrid architectures that combine Gated DeltaNet layers with sliding window attention or Mamba2 layers, achieving both improved training efficiency and superior task performance.
hub tools
citation-role summary
citation-polarity summary
representative citing papers
FlashMorph formulates hybrid layer selection as budget-constrained optimization, trains per-layer gates on synthetic retrieval data with linearization regularization, then discretizes and distills to produce efficient hybrid architectures.
Accumulated orthogonal transformations create a finite mixing window via incoherence after finite steps, enabling length extrapolation that eventually degrades without far-mass control.
Execution-state capsules enable graph-bound full-state checkpointing and sub-millisecond restore for LLMs including KV and recurrent states, yielding 3.9x-27x TTFT speedups in on-device physical-AI serving.
LongSpike integrates fractional-order state-space modeling into spiking neural networks, enabling better long-sequence performance than prior SNNs on LRA, WikiText-103, and Speech Commands benchmarks while retaining sparse computation.
A sleep mechanism with N offline recurrent passes consolidates context into fast weights, improving performance on reasoning tasks where standard transformers fail.
Chem-GMNet uses sphere-native embeddings, DualSKA attention, and SH-FFN layers to match or beat ChemBERTa-2 on MoleculeNet tasks with fewer parameters and sometimes no pretraining.
SpikeProphecy decomposes spike-count forecasting performance into temporal fidelity, spatial pattern accuracy, and magnitude-invariant alignment, revealing reproducible brain-region predictability rankings and a sub-Poisson evaluation floor across seven model families on 105 Neuropixels sessions.
Mixture of Layers replaces monolithic transformer blocks with routed thin parallel blocks using hybrid attention that combines a shared softmax block for global context with Gated DeltaNet linear attention in the routed blocks.
SATFormer uses a context-dependent gate for selective reuse of early Transformer representations, improving validation loss and zero-shot accuracy especially on retrieval benchmarks.
Preconditioned delta-rule models with a diagonal curvature approximation improve upon standard DeltaNet, GDN, and KDA by better approximating the test-time regression objective.
Mem3R achieves better long-sequence 3D reconstruction by decoupling tracking and mapping with a hybrid memory of TTT-updated MLP and explicit tokens, reducing model size and trajectory errors.
S0 tuning optimizes initial recurrent states in hybrid models to outperform LoRA with zero inference cost on HumanEval and partial cross-domain transfer.
Aurora unifies speculative decoder training and serving via asynchronous RL on inference traces, delivering 1.5x day-0 speedup on frontier models and 1.25x adaptation gains on distribution shifts.
Exact Flow Linear Attention derives a closed-form exact update for delta-rule linear attention from continuous-time dynamics, removing Euler discretization error while preserving linear complexity and structure.
One-step gradient delay is optimizer-dependent rather than intrinsically unstable, with Muon and error-feedback correction enabling async pipeline parallelism to match synchronous performance on models up to 10B parameters.
A hybrid attention mechanism with editable request-local memory slots and sparse fallback achieves high accuracy on synthetic overwrite, version, and anti-pollution tasks where pure fixed-state or sparse methods fail, while identifying open-domain selection as the remaining bottleneck.
Reinforcement learning after SFT conversion narrows the performance gap between sliding-window attention and full self-attention on math reasoning benchmarks while preserving linear complexity.
GBLA extends kernelized linear attention with local causal mixing, key gating, and gated RMSNorm; a 1:2 hybrid with self-attention matches full bidirectional self-attention quality on Yandex Music data while delivering up to 8.2x speedup at length 32768.
SMT reduces RNN training to supervised learning on memory transitions (m_t, x_{t+1}) to m_{t+1} obtained from a Transformer encoder, enabling time-parallel training with O(1) gradient paths.
CLSA shares both KV cache and routing indices across decoder layers to amortize top-k selection, delivering up to 7.6x decoding speedup and 17.1x throughput at 128K context while preserving accuracy.
Deeper transformer layers benefit from context-free token-specific value vectors in a Bank of Values lookup table, improving performance over standard attention with less compute.
Blurry Window Attention stores a frequency window and reconstructs blurry KV history via Dirichlet kernel interpolation, achieving 8x better state efficiency than sliding window attention on the MQAR synthetic task.
The design-model framework unifies sub-quadratic sequence models as Bayesian filters and introduces a covariance-tracking Bayesian Layer that improves retrieval robustness beyond training regimes on MQAR and RULER benchmarks.
citing papers explorer
-
Forget BIT, It is All about TOKEN: Towards Semantic Information Theory for LLMs
Proposes a semantic information theory for LLMs that substitutes the token for the bit as the atomic carrier of meaning, recasts the Transformer as an energy-based model, and derives directed rate-distortion and rate-reward functions using Massey's directed information.
-
Nirvana: A Specialized Generalist Model With Task-Aware Memory Mechanism
Nirvana adds a task-aware memory trigger and updater to specialized generalist models, achieving strong general benchmark results, lowest perplexity in biomedicine/finance/law, and improved MRI reconstruction fidelity.
-
StateX: Enhancing RNN Recall via Post-training State Expansion
StateX post-trains RNNs to expand recurrent state size, improving recall and in-context learning with negligible parameter growth.
-
UltraQuant: 4-bit KV Caching for Context-Heavy Agents
UltraQuant applies 4-bit KV caching with TurboQuant-style rotation and custom AMD kernels to context-heavy agent workloads, delivering 3.47x P50 TTFT reduction in cache-pressured rounds and 1.63x throughput gain over FP8.
-
Reasoning Primitives in Hybrid and Non-Hybrid LLMs: Do Architectural Differences Yield Advantages in State-Tracking and Recall?
Reasoning-token augmentation dominates architectural bias for state-based recall tasks; hybrid advantages are narrow and task-dependent rather than uniform.
-
On The Application of Linear Attention in Multimodal Transformers
Linear attention delivers significant computational savings in multimodal transformers and follows the same scaling laws as softmax attention on ViT models trained on LAION-400M with ImageNet-21K zero-shot validation.
-
Hybrid Architectures for Language Models: Systematic Analysis and Design Insights
This work systematically compares inter-layer and intra-layer hybridization strategies for combining self-attention and Mamba-style state space models, evaluating them on language modeling, downstream tasks, long-context performance, scaling, and efficiency to derive optimal design recipes.
-
Lifelong In-Context Learning with Transformers Requires Parametric Forms of Attention
Argues that parametric attention forms are necessary for lifelong in-context learning in transformers to maintain constant memory footprint over arbitrary sequence lengths.
-
A 35B Hybrid-Attention Mixture-of-Experts Model on a 6GB 2011 GPU: Hand-Written 4-bit CUDA Inference for Fermi
Engineers implement hybrid CPU-GPU 4-bit inference for a 35B hybrid-attention MoE on a 2011 Fermi GPU, reporting 34% lower prefill latency and 3x decode throughput via custom kernels and expert pinning.
-
Mellum2 Technical Report
Mellum 2 is a 12B MoE model with 2.5B active parameters, trained on 10.6T tokens with MoE, GQA, SWA, and MTP, then post-trained into Instruct and Thinking variants, claimed competitive with 4B-14B models at 2.5B compute.
- Olmo Hybrid: From Theory to Practice and Back
- RAT+: Train Dense, Infer Sparse -- Recurrence Augmented Attention for Dilated Inference