Scratchpad Patching decouples compute from patch size in byte-level language models by inserting entropy-triggered scratchpads to update patch context dynamically.
Title resolution pending
8 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
roles
background 2polarities
background 2representative citing papers
Griffin hybrid model matches Llama-2 performance while trained on over 6 times fewer tokens and offers lower inference latency with higher throughput.
LDARNet learns adaptive token boundaries via dynamic chunking in a genomic foundation model and reports gains on histone modification tasks over larger models.
A compact Mamba-2 model performs end-to-end byte-level network traffic classification without tokenization or pre-training and remains competitive with substantially larger pre-trained systems.
Gated linear attention Transformers achieve competitive language modeling results with linear-time inference, superior length generalization, and higher training throughput than Mamba.
Byte modeling incurs greater scaling overhead for masked diffusion than autoregressive models because the diffusion objective destroys local byte contiguity needed to resolve semantics.
3DMambaComplete applies the Mamba model to point cloud completion via hyperpoint generation, spatial spreading, and mesh deformation, claiming better results than prior methods on benchmarks.
The paper consolidates existing research on Mamba models, their architecture variants, adaptations to different data modalities, and applications across domains.
citing papers explorer
-
Scratchpad Patching: Decoupling Compute from Patch Size in Byte-Level Language Models
Scratchpad Patching decouples compute from patch size in byte-level language models by inserting entropy-triggered scratchpads to update patch context dynamically.
-
Griffin: Mixing Gated Linear Recurrences with Local Attention for Efficient Language Models
Griffin hybrid model matches Llama-2 performance while trained on over 6 times fewer tokens and offers lower inference latency with higher throughput.
-
LDARNet: DNA Adaptive Representation Network with Learnable Tokenization for Genomic Modeling
LDARNet learns adaptive token boundaries via dynamic chunking in a genomic foundation model and reports gains on histone modification tasks over larger models.
-
MambaNetBurst: Direct Byte-level Network Traffic Classification without Tokenization or Pretraining
A compact Mamba-2 model performs end-to-end byte-level network traffic classification without tokenization or pre-training and remains competitive with substantially larger pre-trained systems.
-
Gated Linear Attention Transformers with Hardware-Efficient Training
Gated linear attention Transformers achieve competitive language modeling results with linear-time inference, superior length generalization, and higher training throughput than Mamba.
-
The Efficiency Gap in Byte Modeling
Byte modeling incurs greater scaling overhead for masked diffusion than autoregressive models because the diffusion objective destroys local byte contiguity needed to resolve semantics.
-
3DMambaComplete: Exploring Structured State Space Model for Point Cloud Completion
3DMambaComplete applies the Mamba model to point cloud completion via hyperpoint generation, spatial spreading, and mesh deformation, claiming better results than prior methods on benchmarks.
-
A Survey of Mamba
The paper consolidates existing research on Mamba models, their architecture variants, adaptations to different data modalities, and applications across domains.