hub

Pointer Sentinel Mixture Models

Stephen Merity, Caiming Xiong, James Bradbury, Richard Socher · 2016 · cs.CL · arXiv 1609.07843

56 Pith papers cite this work. Polarity classification is still indexing.

56 Pith papers citing it

open full Pith review browse 56 citing papers arXiv PDF

abstract

Recent neural network sequence models with softmax classifiers have achieved their best language modeling performance only with very large hidden states and large vocabularies. Even then they struggle to predict rare or unseen words even if the context makes the prediction unambiguous. We introduce the pointer sentinel mixture architecture for neural sequence models which has the ability to either reproduce a word from the recent context or produce a word from a standard softmax classifier. Our pointer sentinel-LSTM model achieves state of the art language modeling performance on the Penn Treebank (70.9 perplexity) while using far fewer parameters than a standard softmax LSTM. In order to evaluate how well language models can exploit longer contexts and deal with more realistic vocabularies and larger corpora we also introduce the freely available WikiText corpus.

hub tools

JSON dossier citing papers JSON arXiv source

claims ledger

abstract Recent neural network sequence models with softmax classifiers have achieved their best language modeling performance only with very large hidden states and large vocabularies. Even then they struggle to predict rare or unseen words even if the context makes the prediction unambiguous. We introduce the pointer sentinel mixture architecture for neural sequence models which has the ability to either reproduce a word from the recent context or produce a word from a standard softmax classifier. Our pointer sentinel-LSTM model achieves state of the art language modeling performance on the Penn Tree

co-cited works

representative citing papers

Quantum-enhanced Large Language Models on Quantum Hardware via Cayley Unitary Adapters

quant-ph · 2026-05-07 · unverdicted · novelty 8.0

Cayley unitary adapters executed on real quantum hardware improve LLM perplexity by 1.4% on Llama 3.1 8B with 6000 parameters and recover 83% of compression-induced degradation on SmolLM2.

Learning the Signature of Memorization in Autoregressive Language Models

cs.CL · 2026-04-03 · accept · novelty 8.0

A classifier trained only on transformer fine-tuning data detects an invariant memorization signature that transfers to Mamba, RWKV-4, and RecurrentGemma with AUCs of 0.963, 0.972, and 0.936.

Editing Models with Task Arithmetic

cs.LG · 2022-12-08 · accept · novelty 8.0

Task vectors from weight differences allow arithmetic operations to edit pre-trained models, improving multiple tasks simultaneously and enabling analogical inference on unseen tasks.

The Expressivity Boundary of Probabilistic Circuits: A Comparison with Large Language Models

cs.LG · 2026-05-13 · unverdicted · novelty 7.0

Probabilistic circuits have an output bottleneck with convex probability combinations and a context bottleneck limited to fixed vtree-aligned partitions, making them less expressive than transformers for language data with heterogeneous dependencies, though decomposable PCs are strictly more capable

BCJR-QAT: A Differentiable Relaxation of Trellis-Coded Weight Quantization

cs.LG · 2026-05-11 · unverdicted · novelty 7.0

BCJR-QAT makes trellis quantization differentiable via BCJR soft decoding at finite temperature, allowing QAT to improve 2-bit LLM perplexity over PTQ with a fused GPU kernel and a drift-budget escape condition.

Positional LSH: Binary Block Matrix Approximation for Attention with Linear Biases

cs.LG · 2026-05-10 · unverdicted · novelty 7.0

ALiBi bias is the expectation of positional LSH-induced block masks, yielding spectral and max-norm approximation bounds that reduce long-context biased attention to randomized short-context unbiased attention.

Guidance Is Not a Hyperparameter: Learning Dynamic Control in Diffusion Language Models

cs.CL · 2026-05-08 · unverdicted · novelty 7.0

Adaptive guidance trajectories learned via PPO outperform fixed-scale CFG on controllability-quality balance in three controlled NLP generation tasks with discrete diffusion models.

Why Geometric Continuity Emerges in Deep Neural Networks: Residual Connections and Rotational Symmetry Breaking

cs.LG · 2026-05-06 · unverdicted · novelty 7.0

Residual connections align cross-layer gradients while symmetry-breaking activations prevent rotational drift, causing principal singular vectors of adjacent layers to align.

SplitZip: Ultra Fast Lossless KV Compression for Disaggregated LLM Serving

cs.DC · 2026-05-03 · unverdicted · novelty 7.0 · 2 refs

SplitZip is a new GPU-friendly lossless compressor for KV cache tensors that exploits exponent redundancy to achieve over 600 GB/s compression throughput and up to 1.32x faster transfers in disaggregated LLM serving.

RouteHijack: Routing-Aware Attack on Mixture-of-Experts LLMs

cs.LG · 2026-05-01 · unverdicted · novelty 7.0

RouteHijack is a routing-aware jailbreak that identifies safety-critical experts via activation contrast and optimizes suffixes to suppress them, reaching 69.3% average attack success rate on seven MoE LLMs with strong transfer to variants and VLMs.

BWLA: Breaking the Barrier of W1AX Post-Training Quantization for LLMs

cs.LG · 2026-05-01 · unverdicted · novelty 7.0

BWLA is the first post-training quantization method for LLMs that achieves 1-bit weights paired with low-bit activations such as 6 bits, using OKT to reshape weights and suppress activation tails plus PSP for low-rank refinement.

DEFault++: Automated Fault Detection, Categorization, and Diagnosis for Transformer Architectures

cs.SE · 2026-04-30 · unverdicted · novelty 7.0

DEFault++ delivers automated hierarchical fault detection, categorization into 12 transformer-specific types, and root-cause diagnosis among 45 mechanisms on a new benchmark of 3,739 mutated instances, with AUROC >0.96 and Macro-F1 0.85, plus improved developer repair accuracy in a user study.

The Safety-Aware Denoiser for Text Diffusion Models

cs.LG · 2026-04-28 · unverdicted · novelty 7.0

SAD modifies the denoising process in text diffusion models to enforce safety constraints at inference time, reducing unsafe generations while preserving quality and diversity.

Supernodes and Halos: Loss-Critical Hubs in LLM Feed-Forward Layers

cs.LG · 2026-04-26 · unverdicted · novelty 7.0

In LLM feed-forward networks, the top 1% of channels per layer carry a median 58.7% of loss sensitivity, forming supernodes whose protection enables effective 50% sparsity pruning with much lower perplexity than baselines.

Variance Is Not Importance: Structural Analysis of Transformer Compressibility Across Model Scales

cs.LG · 2026-04-22 · unverdicted · novelty 7.0

High-variance activation directions are uncorrelated with predictions, transformer blocks grow more linear with depth, and single-block linear replacement yields 34x compression on Mistral's final block at a 1.71 perplexity cost.

Learning Posterior Predictive Distributions for Node Classification from Synthetic Graph Priors

cs.LG · 2026-04-21 · unverdicted · novelty 7.0

NodePFN pre-trains on synthetic graphs with controllable homophily and causal feature-label models to achieve 71.27 average accuracy on 23 node classification benchmarks without graph-specific training.

Equifinality in Mixture of Experts: Routing Topology Does Not Determine Language Modeling Quality

cs.AI · 2026-04-15 · conditional · novelty 7.0

Routing topology in sparse Mixture-of-Experts models does not determine asymptotic language modeling perplexity; multiple variants including cosine-similarity routing achieve statistically equivalent performance.

Computational Lesions in Multilingual Language Models Separate Shared and Language-specific Brain Alignment

cs.CL · 2026-04-12 · unverdicted · novelty 7.0

Lesioning a shared core in multilingual LLMs drops whole-brain fMRI encoding correlation by 60.32%, while language-specific lesions selectively weaken predictions only for the matched native language.

From Characterization to Microarchitecture: Designing an Elegant and Reliable BFP-Based NPU

cs.AR · 2026-04-12 · unverdicted · novelty 7.0

A BFP NPU microarchitecture using row/column blocking and per-path protections achieves near-DMR reliability at 3.55% geometric mean performance overhead and under 2% hardware cost.

SHIELD: A Segmented Hierarchical Memory Architecture for Energy-Efficient LLM Inference on Edge NPUs

cs.AR · 2026-04-08 · unverdicted · novelty 7.0

SHIELD reduces eDRAM refresh energy by 35% for LLM inference on edge NPUs by isolating sign/exponent from mantissa bits, disabling refresh on transient QO mantissas, and relaxing it on persistent KV mantissas while keeping accuracy intact.

Gradient Boosting within a Single Attention Layer

cs.LG · 2026-04-03 · conditional · novelty 7.0

Gradient-boosted attention applies a corrective second attention pass within a single layer, mapping to Friedman's gradient boosting and improving perplexity by 5.6-6.0% on WikiText-103 and OpenWebText subsets over standard attention.

A Switch-Centric In-Network Architecture for Accelerating LLM Inference in Shared-Memory Network

cs.AR · 2026-03-30 · unverdicted · novelty 7.0

SCIN uses an in-switch accelerator for direct memory access and 8-bit in-network quantization during All-Reduce, delivering up to 8.7x faster small-message reduction and 1.74x TTFT speedup on LLaMA-2 models.

Chronos: Learning the Language of Time Series

cs.LG · 2024-03-12 · conditional · novelty 7.0

Chronos pretrains transformer models on tokenized time series to deliver strong zero-shot forecasting across diverse domains.

GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

cs.LG · 2022-10-31 · unverdicted · novelty 7.0

GPTQ quantizes 175B-parameter GPT models to 3-4 bits per weight in one shot using approximate second-order information, achieving negligible accuracy degradation and 3-4x inference speedups.

citing papers explorer

Showing 50 of 56 citing papers.

Quantum-enhanced Large Language Models on Quantum Hardware via Cayley Unitary Adapters quant-ph · 2026-05-07 · unverdicted · none · ref 25 · internal anchor
Cayley unitary adapters executed on real quantum hardware improve LLM perplexity by 1.4% on Llama 3.1 8B with 6000 parameters and recover 83% of compression-induced degradation on SmolLM2.
Learning the Signature of Memorization in Autoregressive Language Models cs.CL · 2026-04-03 · accept · none · ref 12 · internal anchor
A classifier trained only on transformer fine-tuning data detects an invariant memorization signature that transfers to Mamba, RWKV-4, and RecurrentGemma with AUCs of 0.963, 0.972, and 0.936.
Editing Models with Task Arithmetic cs.LG · 2022-12-08 · accept · none · ref 71 · internal anchor
Task vectors from weight differences allow arithmetic operations to edit pre-trained models, improving multiple tasks simultaneously and enabling analogical inference on unseen tasks.
The Expressivity Boundary of Probabilistic Circuits: A Comparison with Large Language Models cs.LG · 2026-05-13 · unverdicted · none · ref 27 · internal anchor
Probabilistic circuits have an output bottleneck with convex probability combinations and a context bottleneck limited to fixed vtree-aligned partitions, making them less expressive than transformers for language data with heterogeneous dependencies, though decomposable PCs are strictly more capable
BCJR-QAT: A Differentiable Relaxation of Trellis-Coded Weight Quantization cs.LG · 2026-05-11 · unverdicted · none · ref 19 · internal anchor
BCJR-QAT makes trellis quantization differentiable via BCJR soft decoding at finite temperature, allowing QAT to improve 2-bit LLM perplexity over PTQ with a fused GPU kernel and a drift-budget escape condition.
Positional LSH: Binary Block Matrix Approximation for Attention with Linear Biases cs.LG · 2026-05-10 · unverdicted · none · ref 50 · internal anchor
ALiBi bias is the expectation of positional LSH-induced block masks, yielding spectral and max-norm approximation bounds that reduce long-context biased attention to randomized short-context unbiased attention.
Guidance Is Not a Hyperparameter: Learning Dynamic Control in Diffusion Language Models cs.CL · 2026-05-08 · unverdicted · none · ref 27 · internal anchor
Adaptive guidance trajectories learned via PPO outperform fixed-scale CFG on controllability-quality balance in three controlled NLP generation tasks with discrete diffusion models.
Why Geometric Continuity Emerges in Deep Neural Networks: Residual Connections and Rotational Symmetry Breaking cs.LG · 2026-05-06 · unverdicted · none · ref 15 · internal anchor
Residual connections align cross-layer gradients while symmetry-breaking activations prevent rotational drift, causing principal singular vectors of adjacent layers to align.
SplitZip: Ultra Fast Lossless KV Compression for Disaggregated LLM Serving cs.DC · 2026-05-03 · unverdicted · none · ref 16 · 2 links · internal anchor
SplitZip is a new GPU-friendly lossless compressor for KV cache tensors that exploits exponent redundancy to achieve over 600 GB/s compression throughput and up to 1.32x faster transfers in disaggregated LLM serving.
RouteHijack: Routing-Aware Attack on Mixture-of-Experts LLMs cs.LG · 2026-05-01 · unverdicted · none · ref 36 · internal anchor
RouteHijack is a routing-aware jailbreak that identifies safety-critical experts via activation contrast and optimizes suffixes to suppress them, reaching 69.3% average attack success rate on seven MoE LLMs with strong transfer to variants and VLMs.
BWLA: Breaking the Barrier of W1AX Post-Training Quantization for LLMs cs.LG · 2026-05-01 · unverdicted · none · ref 28 · internal anchor
BWLA is the first post-training quantization method for LLMs that achieves 1-bit weights paired with low-bit activations such as 6 bits, using OKT to reshape weights and suppress activation tails plus PSP for low-rank refinement.
DEFault++: Automated Fault Detection, Categorization, and Diagnosis for Transformer Architectures cs.SE · 2026-04-30 · unverdicted · none · ref 21 · internal anchor
DEFault++ delivers automated hierarchical fault detection, categorization into 12 transformer-specific types, and root-cause diagnosis among 45 mechanisms on a new benchmark of 3,739 mutated instances, with AUROC >0.96 and Macro-F1 0.85, plus improved developer repair accuracy in a user study.
The Safety-Aware Denoiser for Text Diffusion Models cs.LG · 2026-04-28 · unverdicted · none · ref 44 · internal anchor
SAD modifies the denoising process in text diffusion models to enforce safety constraints at inference time, reducing unsafe generations while preserving quality and diversity.
Supernodes and Halos: Loss-Critical Hubs in LLM Feed-Forward Layers cs.LG · 2026-04-26 · unverdicted · none · ref 5 · internal anchor
In LLM feed-forward networks, the top 1% of channels per layer carry a median 58.7% of loss sensitivity, forming supernodes whose protection enables effective 50% sparsity pruning with much lower perplexity than baselines.
Variance Is Not Importance: Structural Analysis of Transformer Compressibility Across Model Scales cs.LG · 2026-04-22 · unverdicted · none · ref 14 · internal anchor
High-variance activation directions are uncorrelated with predictions, transformer blocks grow more linear with depth, and single-block linear replacement yields 34x compression on Mistral's final block at a 1.71 perplexity cost.
Learning Posterior Predictive Distributions for Node Classification from Synthetic Graph Priors cs.LG · 2026-04-21 · unverdicted · none · ref 249 · internal anchor
NodePFN pre-trains on synthetic graphs with controllable homophily and causal feature-label models to achieve 71.27 average accuracy on 23 node classification benchmarks without graph-specific training.
Equifinality in Mixture of Experts: Routing Topology Does Not Determine Language Modeling Quality cs.AI · 2026-04-15 · conditional · none · ref 10 · internal anchor
Routing topology in sparse Mixture-of-Experts models does not determine asymptotic language modeling perplexity; multiple variants including cosine-similarity routing achieve statistically equivalent performance.
Computational Lesions in Multilingual Language Models Separate Shared and Language-specific Brain Alignment cs.CL · 2026-04-12 · unverdicted · none · ref 38 · internal anchor
Lesioning a shared core in multilingual LLMs drops whole-brain fMRI encoding correlation by 60.32%, while language-specific lesions selectively weaken predictions only for the matched native language.
From Characterization to Microarchitecture: Designing an Elegant and Reliable BFP-Based NPU cs.AR · 2026-04-12 · unverdicted · none · ref 42 · internal anchor
A BFP NPU microarchitecture using row/column blocking and per-path protections achieves near-DMR reliability at 3.55% geometric mean performance overhead and under 2% hardware cost.
SHIELD: A Segmented Hierarchical Memory Architecture for Energy-Efficient LLM Inference on Edge NPUs cs.AR · 2026-04-08 · unverdicted · none · ref 13 · internal anchor
SHIELD reduces eDRAM refresh energy by 35% for LLM inference on edge NPUs by isolating sign/exponent from mantissa bits, disabling refresh on transient QO mantissas, and relaxing it on persistent KV mantissas while keeping accuracy intact.
Gradient Boosting within a Single Attention Layer cs.LG · 2026-04-03 · conditional · none · ref 4 · internal anchor
Gradient-boosted attention applies a corrective second attention pass within a single layer, mapping to Friedman's gradient boosting and improving perplexity by 5.6-6.0% on WikiText-103 and OpenWebText subsets over standard attention.
A Switch-Centric In-Network Architecture for Accelerating LLM Inference in Shared-Memory Network cs.AR · 2026-03-30 · unverdicted · none · ref 46 · internal anchor
SCIN uses an in-switch accelerator for direct memory access and 8-bit in-network quantization during All-Reduce, delivering up to 8.7x faster small-message reduction and 1.74x TTFT speedup on LLaMA-2 models.
Chronos: Learning the Language of Time Series cs.LG · 2024-03-12 · conditional · none · ref 57 · internal anchor
Chronos pretrains transformer models on tokenized time series to deliver strong zero-shot forecasting across diverse domains.
GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers cs.LG · 2022-10-31 · unverdicted · none · ref 10 · internal anchor
GPTQ quantizes 175B-parameter GPT models to 3-4 bits per weight in one shot using approximate second-order information, achieving negligible accuracy degradation and 3-4x inference speedups.
LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale cs.LG · 2022-08-15 · conditional · none · ref 93 · internal anchor
LLM.int8() performs 8-bit inference for transformers up to 175B parameters with no accuracy loss by combining vector-wise quantization for most features with 16-bit mixed-precision handling of systematic outlier dimensions.
Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism cs.CL · 2019-09-17 · unverdicted · none · ref 20 · internal anchor
Intra-layer model parallelism in PyTorch enables training of 8.3B-parameter transformers, achieving SOTA perplexity of 10.8 on WikiText103 and 66.5% accuracy on LAMBADA.
KV-RM: Regularizing KV-Cache Movement for Static-Graph LLM Serving cs.AR · 2026-05-10 · unverdicted · none · ref 31 · internal anchor
KV-RM regularizes KV-cache movement in static-graph LLM serving via block paging and merge-staged transport to improve throughput, tail latency, and memory use for variable-length decoding.
Semantic Smoothing for Language Models via Distribution Estimation and Embeddings cs.IT · 2026-05-08 · conditional · none · ref 26 · internal anchor
Semantic smoothing formulates next-word distribution estimation under KL loss with embedding-based KL-proximity side information, yielding an interpolation estimator with worst-case risk O(min{Δ, d/n}) that empirically reduces perplexity on bigram models.
Echo: KV-Cache-Free Associative Recall with Spectral Koopman Operators cs.LG · 2026-05-07 · unverdicted · none · ref 25 · internal anchor
Spectral Koopman operators let SSMs achieve 100% accuracy on long-gap multi-query associative recall with fixed memory, where pure Mamba fails.
Plug-and-play Class-aware Knowledge Injection for Prompt Learning with Visual-Language Model cs.CV · 2026-05-07 · unverdicted · none · ref 25 · internal anchor
CAKI generates class-specific prompts from few-shot samples of the same class, stores them in a knowledge bank, and uses query-key matching to inject relevant class knowledge into test instance predictions for improved VLM performance.
OSAQ: Outlier Self-Absorption for Accurate Low-bit LLM Quantization cs.LG · 2026-05-06 · unverdicted · none · ref 13 · 2 links · internal anchor
OSAQ suppresses weight outliers in LLMs via a closed-form additive transformation from the Hessian's stable null space, improving 2-bit quantization perplexity by over 40% versus vanilla GPTQ with no inference overhead.
DiCLIP: Diffusion Model Enhances CLIP's Dense Knowledge for Weakly Supervised Semantic Segmentation cs.CV · 2026-05-06 · unverdicted · none · ref 50 · internal anchor
DiCLIP uses diffusion-based visual correlation enhancement and text semantic augmentation to improve CLIP-generated class activation maps for weakly supervised semantic segmentation, outperforming prior methods on PASCAL VOC and MS COCO.
HeadQ: Model-Visible Distortion and Score-Space Correction for KV-Cache Quantization cs.LG · 2026-05-05 · unverdicted · none · ref 8 · 2 links · internal anchor
HeadQ reduces 84-94% of excess perplexity in 2-bit key quantization by adding low-rank logit corrections in a calibration-learned query basis, with further gains from an A^2-weighted value policy.
Context-Aware Wireless Token Communication via Joint Token Masking and Detection eess.SP · 2026-05-04 · unverdicted · none · ref 27 · internal anchor
A joint token masking and detection scheme with masked language models improves token reconstruction over noisy wireless channels by up to 1.77x on Europarl and 1.63x on WikiText-103 compared to conventional methods.
CoQuant: Joint Weight-Activation Subspace Projection for Mixed-Precision LLMs cs.LG · 2026-04-29 · unverdicted · none · ref 8 · internal anchor
CoQuant selects optimal high-precision subspaces for mixed-precision LLM quantization via a closed-form weighted PCA that balances weight and activation covariances derived from expected output error.
FASQ: Flexible Accelerated Subspace Quantization for Calibration-Free LLM Compression cs.LG · 2026-04-22 · unverdicted · none · ref 27 · internal anchor
FASQ delivers calibration-free LLM compression with continuous size trade-offs via product quantization and custom CUDA kernels that accelerate decode beyond FP16 speeds on consumer hardware.
Back into Plato's Cave: Examining Cross-modal Representational Convergence at Scale cs.CV · 2026-04-20 · unverdicted · none · ref 61 · internal anchor
Evidence for cross-modal representational convergence weakens substantially at scale and in realistic many-to-many settings, indicating models learn rich but distinct representations.
CLion: Efficient Cautious Lion Optimizer with Enhanced Generalization cs.LG · 2026-04-16 · unverdicted · none · ref 21 · internal anchor
CLion achieves O(1/N) generalization error and O(√d / T^{1/4}) convergence for nonconvex stochastic optimization, improving on Lion's O(1/(N τ^T)) bound.
Robust Ultra Low-Bit Post-Training Quantization via Stable Diagonal Curvature Estimate cs.LG · 2026-04-15 · unverdicted · none · ref 31 · internal anchor
DASH-Q uses a stable diagonal curvature estimate and weighted least squares to achieve robust ultra-low-bit post-training quantization of LLMs, improving zero-shot accuracy by 7% on average over baselines.
Quantization Dominates Rank Reduction for KV-Cache Compression cs.LG · 2026-04-13 · conditional · none · ref 9 · internal anchor
Quantization of the KV cache beats rank reduction for matched storage budgets by 4-364 PPL, because dimension removal can flip attention token selection under softmax while bounded quantization noise usually preserves ordering.
Position-Agnostic Pre-Projection for Transformer Attention: Nonlinear Feature Construction and Content Skip Before Q/K/V cs.CL · 2026-04-12 · unverdicted · none · ref 7 · internal anchor
A position-agnostic nonlinear pre-projection MLP plus content skip connection in transformer attention improves LAMBADA accuracy by 40.6% and reduces perplexity by 39% on 160M-scale models.
LLM-PRISM: Characterizing Silent Data Corruption from Permanent GPU Faults in LLM Training cs.AR · 2026-04-12 · unverdicted · none · ref 23 · internal anchor
LLMs resist low-frequency permanent GPU faults but certain datapaths and precision formats trigger catastrophic training divergence even at moderate fault rates.
A Little Rank Goes a Long Way: Random Scaffolds with LoRA Adapters Are All You Need cs.LG · 2026-04-09 · unverdicted · none · ref 27 · internal anchor
Frozen random backbones with low-rank LoRA adapters recover 96-100% of fully trained performance on diverse architectures while training only 0.5-40% of parameters.
Rethinking Residual Errors in Compensation-based LLM Quantization cs.LG · 2026-04-09 · conditional · none · ref 12 · internal anchor
Redefining residual errors to include compensation-aware discrepancies and realigning calibration to full-precision outputs improves GPTQ and GPTAQ performance on LLMs.
SLaB: Sparse-Lowrank-Binary Decomposition for Efficient Large Language Models cs.LG · 2026-04-06 · unverdicted · none · ref 31 · internal anchor
SLaB compresses LLM weights via sparse-lowrank-binary decomposition guided by activation-aware scores, achieving up to 36% lower perplexity than prior methods at 50% compression on Llama models.
Linformer: Self-Attention with Linear Complexity cs.LG · 2020-06-08 · conditional · none · ref 9 · internal anchor
Linformer approximates self-attention with a low-rank projection to achieve O(n) time and space complexity while matching Transformer accuracy on standard NLP tasks.
Emergent Semantic Role Understanding in Language Models cs.AI · 2026-05-09 · unverdicted · none · ref 16 · internal anchor
Semantic role understanding partially emerges during language model pre-training, with linear probes on frozen representations achieving substantial performance that improves with scale but does not match fine-tuned models, and representations shifting toward more distributed forms at larger scales.
mHC-SSM: Manifold-Constrained Hyper-Connections for State Space Language Models with Stream-Specialized Adapters cs.LG · 2026-05-08 · unverdicted · none · ref 15 · internal anchor
Manifold-constrained multi-stream mixing plus per-stream adapters improves SSM language model validation loss from 6.3507 to 6.1353 and perplexity from 572.91 to 461.88 on WikiText-2.
Adaptive Memory Decay for Log-Linear Attention cs.LG · 2026-05-07 · conditional · none · ref 26 · internal anchor
Making memory decay input-dependent via a lightweight MLP improves log-linear attention performance on associative recall, selective copying, and language modeling, especially for long sequences.
TIDE: Every Layer Knows the Token Beneath the Context cs.CL · 2026-05-07 · unverdicted · none · ref 69 · internal anchor
TIDE augments standard transformers with per-layer token embedding injection via an ensemble of memory blocks and a depth-conditioned router to mitigate rare-token undertraining and contextual collapse.

Pointer Sentinel Mixture Models

hub tools

claims ledger

co-cited works

fields

years

verdicts

representative citing papers

citing papers explorer