super hub

GLU Variants Improve Transformer

Noam Shazeer · 2020 · cs.LG · arXiv 2002.05202

116 Pith papers cite this work. Polarity classification is still indexing.

116 Pith papers citing it

open full Pith review browse 116 citing papers more from Noam Shazeer arXiv PDF

abstract

Gated Linear Units (arXiv:1612.08083) consist of the component-wise product of two linear projections, one of which is first passed through a sigmoid function. Variations on GLU are possible, using different nonlinear (or even linear) functions in place of sigmoid. We test these variants in the feed-forward sublayers of the Transformer (arXiv:1706.03762) sequence-to-sequence model, and find that some of them yield quality improvements over the typically-used ReLU or GELU activations.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 1 dataset 1 method 1

citation-polarity summary

background 1 use dataset 1 use method 1

claims ledger

abstract Gated Linear Units (arXiv:1612.08083) consist of the component-wise product of two linear projections, one of which is first passed through a sigmoid function. Variations on GLU are possible, using different nonlinear (or even linear) functions in place of sigmoid. We test these variants in the feed-forward sublayers of the Transformer (arXiv:1706.03762) sequence-to-sequence model, and find that some of them yield quality improvements over the typically-used ReLU or GELU activations.

authors

Noam Shazeer

co-cited works

representative citing papers

CLAD: Efficient Log Anomaly Detection Directly on Compressed Representations

cs.LG · 2026-04-14 · unverdicted · novelty 8.0

CLAD is the first deep learning framework for log anomaly detection that operates directly on compressed byte streams using a dilated convolutional encoder, hybrid Transformer-mLSTM, and two-stage training, achieving 0.9909 average F1-score across five datasets.

Large Language Diffusion Models

cs.CL · 2025-02-14 · unverdicted · novelty 8.0

LLaDA is a scalable diffusion-based language model that matches autoregressive LLMs like LLaMA3 8B on tasks and surpasses GPT-4o on reversal poem completion.

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

cs.LG · 2023-12-01 · unverdicted · novelty 8.0

Mamba is a linear-time sequence model using input-dependent selective SSMs that achieves SOTA results across modalities and matches twice-larger Transformers on language modeling with 5x higher inference throughput.

Parallel Scan Recurrent Neural Quantum States for Scalable Variational Monte Carlo

cond-mat.str-el · 2026-05-13 · conditional · novelty 7.0

PSR-NQS makes recurrent neural quantum states scalable for variational Monte Carlo by using parallel scan recurrence, reaching accurate results on 52x52 two-dimensional lattices.

GeoFlowVLM: Geometry-Aware Joint Uncertainty for Frozen Vision-Language Embedding

cs.LG · 2026-05-13 · unverdicted · novelty 7.0

GeoFlowVLM learns joint distributions of l2-normalized VLM embeddings on the product hypersphere via Riemannian flow matching to expose both aleatoric and epistemic uncertainty through derived entropy and typicality scores.

Neural Statistical Functions

cs.LG · 2026-05-11 · unverdicted · novelty 7.0

Neural statistical functions use prefix statistics to unify and directly predict statistical quantities over continuous ranges from pre-trained single-sample models without repeated sampling.

Locking Pretrained Weights via Deep Low-Rank Residual Distillation

cs.LG · 2026-05-11 · unverdicted · novelty 7.0

DLR-Lock locks open-weight LLMs against unauthorized fine-tuning by swapping MLPs for deep low-rank residual networks that inflate backprop memory and complicate optimization, yet preserve original capabilities via module-wise distillation.

Learning Less Is More: Premature Upper-Layer Attention Specialization Hurts Language Model Pretraining

cs.CL · 2026-05-11 · unverdicted · novelty 7.0

Temporarily reducing the learning rate on upper-layer query and key projections during early GPT pretraining prevents premature attention specialization and improves model performance.

From Syntax to Semantics: Unveiling the Emergence of Chirality in SMILES Translation Models

cs.LG · 2026-05-11 · unverdicted · novelty 7.0

Chirality emerges in SMILES translation models through an abrupt encoder-centered reorganization of representations after a long plateau, identified via checkpoint analysis and ablation.

Kinetic-Optimal Scheduling with Moment Correction for Metric-Induced Discrete Flow Matching in Zero-Shot Text-to-Speech

eess.AS · 2026-05-10 · unverdicted · novelty 7.0

GibbsTTS combines a training-free kinetic-optimal scheduler with finite-step moment correction in MI-DFM to deliver top naturalness and strong speaker similarity in zero-shot TTS.

Fast Byte Latent Transformer

cs.CL · 2026-05-08 · unverdicted · novelty 7.0

BLT-D, BLT-S, and BLT-DV use block-wise diffusion training and speculative verification to enable parallel byte generation in byte-level LMs, cutting memory-bandwidth cost by over 50%.

Every Feedforward Neural Network Definable in an o-Minimal Structure Has Finite Sample Complexity

stat.ML · 2026-05-08 · unverdicted · novelty 7.0

Every fixed finite feedforward neural network definable in an o-minimal structure has finite sample complexity in the agnostic PAC setting.

SMolLM: Small Language Models Learn Small Molecular Grammar

cs.LG · 2026-05-07 · unverdicted · novelty 7.0

A 53K-parameter model generates 95% valid SMILES on ZINC-250K, outperforming larger models, by resolving chemical constraints in fixed order: brackets first, rings second, valence last.

Degradation-Aware Adaptive Context Gating for Unified Image Restoration

cs.CV · 2026-05-02 · unverdicted · novelty 7.0

DACG-IR adds a lightweight degradation-aware module that generates prompts to adaptively gate attention temperature, output features, and spatial-channel fusion in an encoder-decoder network for unified image restoration.

Beyond Heuristics: Learnable Density Control for 3D Gaussian Splatting

cs.CV · 2026-05-01 · unverdicted · novelty 7.0 · 2 refs

LeGS turns density control in 3D Gaussian Splatting into a learnable RL policy whose reward is derived from a closed-form sensitivity analysis that measures each Gaussian's marginal contribution to reconstruction quality.

Homogeneous Stellar Parameters from Heterogeneous Spectra with Deep Learning

astro-ph.GA · 2026-04-28 · unverdicted · novelty 7.0

A single end-to-end Transformer model unifies stellar labels from heterogeneous spectroscopic surveys into a self-consistent scale without post-hoc recalibration.

Can an MLP Absorb Its Own Skip Connection?

cs.LG · 2026-04-26 · accept · novelty 7.0

Skip-connected MLPs and residual-free MLPs of equal width represent generically disjoint function classes for common activations, with explicit impossibility proofs and a non-generic absorption condition for ReLU and GELU.

WildSplatter: Feed-forward 3D Gaussian Splatting with Appearance Control from Unconstrained Images

cs.CV · 2026-04-23 · unverdicted · novelty 7.0

WildSplatter jointly learns 3D Gaussians and appearance embeddings from unconstrained photo collections to enable fast feed-forward reconstruction and flexible lighting control in 3D Gaussian Splatting.

Grokking of Diffusion Models: Case Study on Modular Addition

cs.LG · 2026-04-20 · unverdicted · novelty 7.0

Diffusion models show grokking on modular addition by composing periodic operand representations in simple data regimes or by separating arithmetic computation from visual denoising across timesteps in varied regimes.

Scalable Model-Based Clustering with Sequential Monte Carlo

stat.ML · 2026-04-16 · unverdicted · novelty 7.0

A memory-efficient SMC clustering method decomposes problems into approximately independent subproblems to handle large-scale online clustering with complex distributions.

Mamba Sequence Modeling meets Model Predictive Control

math.OC · 2026-04-15 · unverdicted · novelty 7.0

Mamba-MPC stabilizes and tracks references on SISO and MIMO systems in simulation and hardware while outperforming LSTM-MPC with faster computation.

MIXAR: Scaling Autoregressive Pixel-based Language Models to Multiple Languages and Scripts

cs.CL · 2026-04-13 · unverdicted · novelty 7.0

MIXAR is the first autoregressive pixel-based language model for eight languages and scripts, with empirical gains on multilingual tasks, robustness to unseen languages, and further improvements when scaled to 0.5B parameters.

Attention Sink in Transformers: A Survey on Utilization, Interpretation, and Mitigation

cs.LG · 2026-04-11 · unverdicted · novelty 7.0

The first survey on Attention Sink in Transformers structures the literature around fundamental utilization, mechanistic interpretation, and strategic mitigation.

Envisioning the Future, One Step at a Time

cs.CV · 2026-04-10 · unverdicted · novelty 7.0

An autoregressive diffusion model on sparse point trajectories predicts multi-modal future scene dynamics from single images with orders-of-magnitude faster sampling than dense video simulators while matching accuracy.

citing papers explorer

Showing 50 of 116 citing papers.

CHE-TKG: Collaborative Historical Evidence and Evolutionary Dynamics Learning for Temporal Knowledge Graph Reasoning cs.CL · 2026-05-06 · unverdicted · none · ref 40 · internal anchor
CHE-TKG is a collaborative dual-view model that jointly captures historical evidence and evolutionary dynamics in temporal knowledge graphs via separate encoders and contrastive alignment to achieve state-of-the-art reasoning.
Velox: Learning Representations of 4D Geometry and Appearance cs.CV · 2026-05-06 · unverdicted · none · ref 76 · internal anchor
Velox compresses dynamic point clouds into latent tokens that support geometry via 4D surface modeling and appearance via 3D Gaussians, showing strong results on video-to-4D generation, tracking, and image-to-4D cloth simulation.
End-to-End Autoregressive Image Generation with 1D Semantic Tokenizer cs.CV · 2026-05-01 · unverdicted · none · ref 31 · internal anchor
An end-to-end autoregressive model with a jointly trained 1D semantic tokenizer achieves state-of-the-art FID 1.48 on ImageNet 256x256 generation without guidance.
Perturbation Probing: A Two-Pass-per-Prompt Diagnostic for FFN Behavioral Circuits in Aligned LLMs cs.CL · 2026-04-30 · unverdicted · none · ref 21 · internal anchor
Perturbation probing identifies tiny sets of FFN neurons that control refusal templates and language routing in LLMs, enabling precise ablations and directional interventions that alter behavior on benchmarks while preserving safety.
PointTransformerX: Portable and Efficient 3D Point Cloud Processing without Sparse Algorithms cs.CV · 2026-04-27 · unverdicted · none · ref 39 · internal anchor
PointTransformerX is a fully PyTorch-native 3D point cloud transformer backbone that reaches 98.7% of PointTransformer V3 accuracy on ScanNet using 79.2% fewer parameters, 1.6x faster inference, and only 253 MB memory while running natively on NVIDIA, AMD, and CPU hardware.
Geometric Monomial (GEM): a family of rational 2N-differentiable activation functions cs.LG · 2026-04-23 · unverdicted · none · ref 7 · internal anchor
GEM is a new family of C^{2N}-smooth rational activation functions with variants that achieve performance on par with or exceeding GELU on ResNet, GPT-2, and BERT benchmarks.
Exploring the Role of Synthetic Data Augmentation in Controllable Human-Centric Video Generation cs.CV · 2026-04-23 · unverdicted · none · ref 32 · internal anchor
Synthetic data complements real data in diffusion-based controllable human video generation, with effective sample selection improving motion realism, temporal consistency, and identity preservation.
OmniMouse: Scaling properties of multi-modal, multi-task Brain Models on 150B Neural Tokens q-bio.NC · 2026-04-20 · unverdicted · none · ref 107 · internal anchor
OmniMouse demonstrates data-driven scaling in multi-task brain models on a 150B-token neural dataset, achieving SOTA across prediction, decoding, and forecasting while model size gains saturate.
Depth Registers Unlock W4A4 on SwiGLU: A Reader/Generator Decomposition cs.CL · 2026-04-20 · unverdicted · none · ref 13 · internal anchor
Depth Registers plus hinge loss cut W4A4-induced perplexity collapse from 1727 to 119 in a 300M SwiGLU model by selectively taming reader-layer activations while leaving bilinear generator tails largely untouched.
RankUp: Towards High-rank Representations for Large Scale Advertising Recommender Systems cs.IR · 2026-04-20 · unverdicted · none · ref 26 · 2 links · internal anchor
RankUp raises effective rank of representations in deep MetaFormer recommenders via randomized splitting and multi-embeddings, delivering 2-5% GMV gains in production deployments at Weixin.
SigGate-GT: Taming Over-Smoothing in Graph Transformers via Sigmoid-Gated Attention cs.LG · 2026-04-19 · unverdicted · none · ref 16 · internal anchor
SigGate-GT applies sigmoid gates to attention outputs in graph transformers to reduce over-smoothing, matching prior best on ZINC and setting new SOTA on ogbg-molhiv with gains over GraphGPS.
HAMSA: Scanning-Free Vision State Space Models via SpectralPulseNet cs.CV · 2026-04-16 · unverdicted · none · ref 56 · internal anchor
HAMSA achieves 85.7% ImageNet-1K top-1 accuracy as a spectral-domain SSM with 2.2x faster inference and lower memory than transformers or scanning-based SSMs.
C-voting: Confidence-Based Test-Time Voting without Explicit Energy Functions cs.LG · 2026-04-15 · unverdicted · none · ref 13 · internal anchor
C-voting improves recurrent reasoning models by selecting among multiple latent trajectories the one with highest average top-1 probability, achieving 4.9% better Sudoku-hard accuracy than energy-based voting and outperforming HRM on Sudoku-extreme and Maze when paired with the new ItrSA++ model.
Nucleus-Image: Sparse MoE for Image Generation cs.CV · 2026-04-14 · unverdicted · none · ref 19 · internal anchor
A 17B-parameter sparse MoE diffusion transformer activates 2B parameters per pass and reaches competitive quality on image generation benchmarks without post-training.
Learning Long-term Motion Embeddings for Efficient Kinematics Generation cs.CV · 2026-04-13 · unverdicted · none · ref 33 · internal anchor
A 64x temporally compressed motion embedding learned from trackers enables efficient conditional flow-matching generation of long-term motions that outperform video models and task-specific methods.
Fast Spatial Memory with Elastic Test-Time Training cs.CV · 2026-04-08 · unverdicted · none · ref 43 · internal anchor
Elastic Test-Time Training stabilizes test-time updates via an elastic prior and moving-average anchor, enabling Fast Spatial Memory for scalable long-sequence 4D reconstruction with reduced memory use and fewer shortcuts.
Leveraging Artist Catalogs for Cold-Start Music Recommendation cs.IR · 2026-04-08 · unverdicted · none · ref 58 · internal anchor
ACARec attends over artist catalogs to generate CF embeddings for new tracks, more than doubling recall and NDCG versus content-only baselines in music recommendation.
Orthogonal Quadratic Complements for Vision Transformer Feed-Forward Networks cs.CV · 2026-04-08 · unverdicted · none · ref 5 · internal anchor
Orthogonal Quadratic Complements project low-rank quadratic auxiliary branches onto the orthogonal complement of the main hidden state in vision transformer FFNs, improving accuracy by 1.3 points on CIFAR-100 and 1.4 points on TinyImageNet over baselines.
ARCS: Autoregressive Circuit Synthesis with Topology-Aware Graph Attention and Spec Conditioning cs.LG · 2026-03-30 · unverdicted · none · ref 8 · internal anchor
ARCS generates valid SPICE-simulatable analog circuits in milliseconds via graph VAE, flow-matching, and GRPO reinforcement learning, reaching 99.9% validity with 8 evaluations across 32 topologies.
Rethinking Language Model Scaling under Transferable Hypersphere Optimization cs.LG · 2026-03-30 · conditional · none · ref 18 · internal anchor
HyperP transfers optimal learning rates across model width, depth, tokens, and MoE granularity under Frobenius-sphere constraints, delivering stable scaling and 1.58x efficiency gains.
FluidFlow: a flow-matching generative model for fluid dynamics surrogates on unstructured meshes cs.LG · 2026-03-30 · unverdicted · none · ref 37 · internal anchor
FluidFlow uses conditional flow-matching with U-Net and DiT architectures to predict pressure and friction coefficients on airfoils and 3D aircraft meshes, outperforming MLP baselines with better generalization.
Back to Basics: Let Denoising Generative Models Denoise cs.CV · 2025-11-17 · unverdicted · none · ref 54 · internal anchor
Directly predicting clean data with large-patch pixel Transformers enables strong generative performance in diffusion models where noise prediction fails at high dimensions.
Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free cs.CL · 2025-05-10 · conditional · none · ref 23 · internal anchor
Applying a head-specific sigmoid gate after SDPA in LLMs boosts performance and stability by adding non-linearity and query-dependent sparse modulation while reducing attention sinks.
Emu3: Next-Token Prediction is All You Need cs.CV · 2024-09-27 · unverdicted · none · ref 76 · internal anchor
Emu3 shows that next-token prediction on a unified discrete token space for text, images, and video lets a single transformer outperform task-specific models such as SDXL and LLaVA-1.6 in multimodal generation and perception.
Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model cs.AI · 2024-08-20 · unverdicted · none · ref 21 · internal anchor
A single transformer combines language modeling loss and diffusion loss on mixed-modality data, scaling to 7B parameters and 2T tokens while matching specialized language and diffusion models.
Chameleon: Mixed-Modal Early-Fusion Foundation Models cs.CL · 2024-05-16 · unverdicted · none · ref 29 · internal anchor
Chameleon is an early-fusion token model that handles mixed image-text sequences for understanding and generation, achieving competitive or superior performance to larger models like Llama-2, Mixtral, and Gemini-Pro on captioning, VQA, text, and image tasks.
Vision Transformers Need Registers cs.CV · 2023-09-28 · unverdicted · none · ref 297 · internal anchor
Adding register tokens to Vision Transformers eliminates high-norm background artifacts and raises state-of-the-art performance on dense visual prediction tasks.
BloombergGPT: A Large Language Model for Finance cs.LG · 2023-03-30 · conditional · none · ref 101 · internal anchor
BloombergGPT is a 50B parameter LLM trained on a 708B token mixed financial and general dataset that outperforms prior models on financial benchmarks while preserving general LLM performance.
PaLM: Scaling Language Modeling with Pathways cs.CL · 2022-04-05 · accept · none · ref 138 · internal anchor
PaLM 540B demonstrates continued scaling benefits by setting new few-shot SOTA results on hundreds of benchmarks and outperforming humans on BIG-bench.
LaMDA: Language Models for Dialog Applications cs.CL · 2022-01-20 · unverdicted · none · ref 93 · internal anchor
LaMDA shows that fine-tuning on human-value annotations and consulting external knowledge sources significantly improves safety and factual grounding in large dialog models beyond what scaling alone achieves.
Pion: A Spectrum-Preserving Optimizer via Orthogonal Equivalence Transformation cs.LG · 2026-05-12 · unverdicted · none · ref 69 · internal anchor
Pion is an optimizer that preserves the singular values of weight matrices in LLM training by applying orthogonal equivalence transformations.
CaloArt: Large-Patch x-Prediction Diffusion Transformers for High-Granularity Calorimeter Shower Generation physics.ins-det · 2026-05-12 · unverdicted · none · ref 45 · internal anchor
CaloArt achieves top FPD, high-level, and classifier metrics on CaloChallenge datasets 2 and 3 while keeping single-GPU generation at 9-11 ms per shower by combining large-patch tokenization, x-prediction, and conditional flow matching.
A Composite Activation Function for Learning Stable Binary Representations cs.LG · 2026-05-12 · unverdicted · none · ref 63 · internal anchor
HTAF is a sigmoid-tanh composite that approximates the Heaviside function to allow stable gradient training of binary activation networks, yielding ICBMs with stable discretization and competitive performance on image tasks.
MaskTab: Scalable Masked Tabular Pretraining with Scaling Laws and Distillation for Industrial Classification cs.LG · 2026-05-12 · unverdicted · none · ref 42 · internal anchor
MaskTab is a masked pretraining method for industrial tabular data that delivers measurable gains in classification AUC and KS metrics while enabling effective distillation to smaller models.
Mela: Test-Time Memory Consolidation based on Transformation Hypothesis cs.CL · 2026-05-11 · unverdicted · none · ref 16 · internal anchor
Mela is a Transformer variant with a dual-frequency Hierarchical Memory Module and MemStack that performs test-time memory consolidation, outperforming baselines on long contexts.
SimReg: Achieving Higher Performance in the Pretraining via Embedding Similarity Regularization cs.CL · 2026-05-09 · unverdicted · none · ref 10 · internal anchor
SimReg regularization accelerates LLM pretraining convergence by over 30% and raises average zero-shot performance by over 1% across benchmarks.
TIDE: Every Layer Knows the Token Beneath the Context cs.CL · 2026-05-07 · unverdicted · none · ref 47 · internal anchor
TIDE augments standard transformers with per-layer token embedding injection via an ensemble of memory blocks and a depth-conditioned router to mitigate rare-token undertraining and contextual collapse.
MDN: Parallelizing Stepwise Momentum for Delta Linear Attention cs.LG · 2026-05-07 · unverdicted · none · ref 22 · internal anchor
MDN parallelizes stepwise momentum for delta linear attention using geometric reordering and dynamical systems analysis, yielding performance gains over Mamba2 and GDN on 400M and 1.3B models.
ELAS: Efficient Pre-Training of Low-Rank Large Language Models via 2:4 Activation Sparsity cs.LG · 2026-05-05 · unverdicted · none · ref 10 · internal anchor
ELAS pre-trains low-rank LLMs by applying 2:4 activation sparsity after squared ReLU to cut memory and accelerate training with minimal performance loss.
Colinearity Decay: Training Quantization-Friendly ViTs with Outlier Decay cs.CV · 2026-05-02 · unverdicted · none · ref 33 · internal anchor
Colinearity-Decay regularizer trains ViTs that maintain or improve full-precision accuracy while delivering higher accuracy after low-bit quantization on ImageNet and COCO tasks.
Caracal: Causal Architecture via Spectral Mixing cs.LG · 2026-04-30 · unverdicted · none · ref 60 · internal anchor
Caracal is a Fourier-based sequence mixing architecture that achieves causal autoregressive modeling with standard operators and competitive performance on long sequences.
Nautile-370M: Spectral Memory Meets Attention in a Small Reasoning Model cs.LG · 2026-04-27 · unverdicted · none · ref 13 · internal anchor
Nautile-370M is a hybrid small language model using SeqCond Attention layers alternating with transformers, with a claimed proof that the spectral operator matches full self-attention expressiveness in the continuous limit.
When Does Removing LayerNorm Help? Activation Bounding as a Regime-Dependent Implicit Regularizer cs.LG · 2026-04-25 · unverdicted · none · ref 23 · internal anchor
DyT improves validation loss 27% at 64M params/1M tokens but worsens it 19% at 118M tokens, with saturation levels predicting the sign of the effect.
Uni-Encoder Meets Multi-Encoders: Representation Before Fusion for Brain Tumor Segmentation with Missing Modalities cs.CV · 2026-04-24 · unverdicted · none · ref 38 · internal anchor
UniME combines a pretrained unified ViT encoder with modality-specific CNN encoders to improve brain tumor segmentation performance when some MRI modalities are missing.
Sapiens2 cs.CV · 2026-04-23 · unverdicted · none · ref 24 · internal anchor
Sapiens2 improves pretraining, data scale, and architecture over its predecessor to set new state-of-the-art results on human pose estimation, body-part segmentation, normal estimation, and new tasks like pointmap and albedo estimation.
Hyperloop Transformers cs.LG · 2026-04-23 · unverdicted · none · ref 22 · internal anchor
Hyperloop Transformers outperform standard and mHC Transformers with roughly 50% fewer parameters by looping a middle block of layers and applying hyper-connections only after each loop.
Sessa: Selective State Space Attention cs.LG · 2026-04-20 · unverdicted · none · ref 32 · internal anchor
Sessa integrates attention within recurrent paths to achieve power-law memory tails and flexible non-decaying selective retrieval, outperforming baselines on long-context tasks.
LACE: Lattice Attention for Cross-thread Exploration cs.AI · 2026-04-16 · unverdicted · none · ref 30 · 3 links · internal anchor
LACE enables concurrent reasoning paths in LLMs to interact via lattice attention and a synthetic training pipeline, raising accuracy more than 7 points over independent parallel search.
HAFM: Hierarchical Autoregressive Foundation Model for Music Accompaniment Generation cs.SD · 2026-04-10 · unverdicted · none · ref 12 · internal anchor
HAFM uses a hierarchical autoregressive model with dual-rate HuBERT and EnCodec tokens to generate coherent instrumental music from vocals, achieving FAD 2.08 on MUSDB18 while matching prior systems with fewer parameters.
Efficient Learned Data Compression via Dual-Stream Feature Decoupling cs.CL · 2026-04-08 · unverdicted · none · ref 3 · internal anchor
A dual-stream decoupler plus hierarchical refiner and parallel pipeline yields state-of-the-art compression ratio and throughput with lowest reported latency and memory in learned data compression.

GLU Variants Improve Transformer

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer