super hub Mixed citations

Pointer Sentinel Mixture Models

Caiming Xiong, James Bradbury, Richard Socher, Stephen Merity · 2016 · cs.CL · arXiv 1609.07843

Mixed citation behavior. Most common role is background (56%).

164 Pith papers citing it

Background 56% of classified citations

open full Pith review browse 164 citing papers more from Caiming Xiong arXiv PDF

abstract

Recent neural network sequence models with softmax classifiers have achieved their best language modeling performance only with very large hidden states and large vocabularies. Even then they struggle to predict rare or unseen words even if the context makes the prediction unambiguous. We introduce the pointer sentinel mixture architecture for neural sequence models which has the ability to either reproduce a word from the recent context or produce a word from a standard softmax classifier. Our pointer sentinel-LSTM model achieves state of the art language modeling performance on the Penn Treebank (70.9 perplexity) while using far fewer parameters than a standard softmax LSTM. In order to evaluate how well language models can exploit longer contexts and deal with more realistic vocabularies and larger corpora we also introduce the freely available WikiText corpus.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 9 dataset 5 method 1 other 1

citation-polarity summary

background 9 use dataset 5 unclear 1 use method 1

claims ledger

abstract Recent neural network sequence models with softmax classifiers have achieved their best language modeling performance only with very large hidden states and large vocabularies. Even then they struggle to predict rare or unseen words even if the context makes the prediction unambiguous. We introduce the pointer sentinel mixture architecture for neural sequence models which has the ability to either reproduce a word from the recent context or produce a word from a standard softmax classifier. Our pointer sentinel-LSTM model achieves state of the art language modeling performance on the Penn Tree

authors

Caiming Xiong James Bradbury Richard Socher Stephen Merity

co-cited works

representative citing papers

CATCH-ME if you RAG: a dataset of Contextually Annotated multi-Turn Counterspeech against Hate and Misinformation Exchanges

cs.CL · 2026-06-18 · unverdicted · novelty 8.0

Presents a new expert-curated dataset of multi-turn counterspeech dialogues in five languages targeting hate against seven groups, with span annotations linking to verified external knowledge for RAG applications.

Do Models Share Safety Representations? Cross-Model Steering for Safe Visual Generation

cs.CV · 2026-06-03 · unverdicted · novelty 8.0

A safety direction estimated in a source LLM is transported to a target generator through lightweight alignment on benign data alone, matching native safety performance without any target-side unsafe data.

GPTQ-intrinsic LoRA: A Near-optimal Algorithm for Low-precision Quantization with Low-rank Adaptation

cs.LG · 2026-05-31 · unverdicted · novelty 8.0

GPTQ-intrinsic LoRA augments GPTQ with intrinsic low-rank compensation via Hessian modification to achieve layer-wise reconstruction bounds that match information-theoretic lower bounds under structural assumptions.

Towards Verifiable Transformers: Solver-Checkable Circuit Explanations

cs.LG · 2026-05-21 · unverdicted · novelty 8.0

Presents a solver-verifiable framework for Transformer circuits, with exhaustive checks on small symbolic tasks and surrogate methods for larger models.

HodgeCover: Higher-Order Topological Coverage Drives Compression of Sparse Mixture-of-Experts

cs.LG · 2026-05-13 · unverdicted · novelty 8.0

HodgeCover isolates the harmonic kernel of a simplicial Laplacian on an expert 2-complex to identify irreducible merge cycles and selects experts for aggressive compression, matching or exceeding baselines on open-weight MoE models.

Quantum-enhanced Large Language Models on Quantum Hardware via Cayley Unitary Adapters

quant-ph · 2026-05-07 · unverdicted · novelty 8.0

Cayley unitary adapters executed on real quantum hardware improve LLM perplexity by 1.4% on Llama 3.1 8B with 6000 parameters and recover 83% of compression-induced degradation on SmolLM2.

Learning the Signature of Memorization in Autoregressive Language Models

cs.CL · 2026-04-03 · accept · novelty 8.0

A classifier trained only on transformer fine-tuning data detects an invariant memorization signature that transfers to Mamba, RWKV-4, and RecurrentGemma with AUCs of 0.963, 0.972, and 0.936.

Editing Models with Task Arithmetic

cs.LG · 2022-12-08 · accept · novelty 8.0

Task vectors from weight differences allow arithmetic operations to edit pre-trained models, improving multiple tasks simultaneously and enabling analogical inference on unseen tasks.

TallyTrain: Communication-Efficient Federated Distillation

cs.LG · 2026-06-30 · unverdicted · novelty 7.0

TallyTrain is a hard-label distillation protocol for federated learning that uses argmax transmission and optional sparse merges to match soft-label performance at up to 1000x lower communication cost.

CARVE: Content-Aware Recurrent with Value Efficiency for Chunk-Parallel Linear Attention

cs.CL · 2026-06-25 · conditional · novelty 7.0 · 2 refs

CARVE introduces key-axis content-aware gating and value-efficient scalar writes in recurrent linear attention, outperforming GDN-2 on perplexity and retrieval tasks while cutting parameters and memory.

Tapered Language Models

cs.LG · 2026-06-22 · unverdicted · novelty 7.0

Tapered Language Models monotonically decrease MLP width across depth with a cosine schedule, yielding better perplexity and downstream performance than uniform-width baselines across multiple architectures and scales at no extra cost.

Every Eval Ever: A Unifying Schema and Community Repository for AI Evaluation Results

cs.AI · 2026-06-12 · unverdicted · novelty 7.0

Introduces the first community-governed unified JSON schema and crowdsourced repository for AI evaluation results, with converters and a database spanning 22,235 models and 2,273 benchmarks.

LongSpike: Fractional Order Spiking State Space Models for Efficient Long Sequence Learning

cs.LG · 2026-06-11 · unverdicted · novelty 7.0

LongSpike integrates fractional-order state-space modeling into spiking neural networks, enabling better long-sequence performance than prior SNNs on LRA, WikiText-103, and Speech Commands benchmarks while retaining sparse computation.

STCC: A Unified Source-Channel Semantic Token Coding Framework for Semantic Communications

cs.IT · 2026-06-10 · unverdicted · novelty 7.0

STCC introduces a Semantic Token Codec that learns geometrically structured constellations aligning channel topology with semantic embedding spaces so noise produces topological rather than random errors.

Unified Energy for Invariant and Independent Decoding in Diffusion Language Models

cs.CL · 2026-06-08 · unverdicted · novelty 7.0

The paper introduces Uni-E, a unified energy for DLMs that accounts for model capacity, dependency and invariance, can be computed exactly, and corrects distribution shifts from dependency and invariance.

Beyond FLOPs: Benchmarking Real Inference Acceleration of LLM Pruning under a GEMM-Centric Taxonomy

cs.LG · 2026-06-08 · conditional · novelty 7.0

A GEMM-centric taxonomy and unified benchmark show static depth pruning as the strongest Pareto-optimal baseline for LLM inference acceleration, with the frontier shifting to dynamic depth then static width pruning as quality loss rises.

APEX4: Efficient Pure W4A4 LLM Inference via Intra-SM Compute Rebalancing

cs.DC · 2026-06-07 · conditional · novelty 7.0 · 2 refs

APEX4 co-designs pure INT4 GEMM kernels with ρ-aware granularity adaptation to deliver up to 2.09× end-to-end speedup on GPUs with low ρ while keeping LLaMA-2-70B perplexity within 0.63 of FP16.

STAR-KV: Low-Rank KV Cache Compression via Soft Thresholding for Adaptive Rank Control

cs.LG · 2026-06-07 · unverdicted · novelty 7.0

STAR-KV applies differentiable soft thresholding for per-head and per-block adaptive low-rank KV cache compression, combined with hybrid decomposition and low-rank-aware quantization, achieving up to 75% compression and 3.1x throughput gains.

Decomposing how prompting steers behavior

cs.AI · 2026-06-02 · unverdicted · novelty 7.0

A geometric decomposition framework shows that affine transformations best recover prompt-induced task geometry and behavior in language and vision models across multiple datasets.

From Layers to Submodules: Rethinking Granularity in Replacement-Based LLM Compression

cs.CL · 2026-06-01 · unverdicted · novelty 7.0

SubFit enables better LLM compression by fitting residual bypasses to non-contiguously selected submodules, outperforming layer-granularity baselines in accuracy-perplexity trade-offs at 12.5-37.5% sparsity.

Aligned but Fragile: Enhancing LLM Safety Robustness via Zeroth-Order Optimization

cs.AI · 2026-05-28 · unverdicted · novelty 7.0

A hybrid first-order then zeroth-order optimization approach improves robustness of safety-aligned LLMs while preserving utility, with layer-wise sensitivity estimation for efficiency.

Word Class Representations Spontaneously Emerge from Successor Representations Trained on Natural Language

cs.CL · 2026-05-23 · unverdicted · novelty 7.0

Successor representation training on natural language causes part-of-speech categories to emerge spontaneously in the learned embeddings, with structure varying by predictive horizon.

Uniform Diffusion Models Revisited: Leave-One-Out Denoiser and Absorbing State Reformulation

cs.LG · 2026-05-21 · unverdicted · novelty 7.0

Uniform diffusion models rely on a leave-one-out denoiser rather than the usual denoising posterior, with exact conversions derived; an absorbing-state reformulation is introduced that matches or exceeds masked diffusion on language modeling while preserving the original joint distribution.

The Expressivity Boundary of Probabilistic Circuits: A Comparison with Large Language Models

cs.LG · 2026-05-13 · unverdicted · novelty 7.0

Probabilistic circuits have an output bottleneck with convex probability combinations and a context bottleneck limited to fixed vtree-aligned partitions, making them less expressive than transformers for language data with heterogeneous dependencies, though decomposable PCs are strictly more capable

citing papers explorer

Showing 21 of 21 citing papers after filters.

MIDUS: Memory-Infused Depth Up-Scaling cs.LG · 2025-12-15 · unverdicted · none · ref 19 · internal anchor
MIDUS replaces duplicated FFN branches in depth up-scaling with head-wise memory layers using product-key retrieval and HIVE to deliver lightweight, head-conditioned residual capacity.
Improving LLM Unlearning Robustness via Random Perturbations cs.CL · 2025-01-31 · unverdicted · none · ref 24 · internal anchor
LLM unlearning is reframed as inadvertently installing backdoor triggers on forget-tokens; Random Noise Augmentation is introduced as a defense that improves robustness with theoretical guarantees.
SpecQuant: Spectral Decomposition and Adaptive Truncation for Ultra-Low-Bit LLMs Quantization cs.LG · 2025-11-11 · unverdicted · none · ref 4 · internal anchor
SpecQuant uses outlier smoothing into weights followed by channel-wise low-frequency Fourier truncation to achieve 4-bit quantization of LLaMA-3 8B with only 1.5% zero-shot accuracy loss versus full precision.
CR-Net: Scaling Parameter-Efficient Training with Cross-Layer Low-Rank Structure cs.LG · 2025-09-23 · unverdicted · none · ref 45 · internal anchor
CR-Net uses cross-layer low-rank residuals in a dual-path network plus specialized recomputation to outperform prior low-rank methods on 60M-7B model pre-training while using less compute and memory.
Cosine-Similarity Routing with Semantic Anchors for Interpretable Mixture-of-Experts Language Models cs.CL · 2025-09-12 · conditional · none · ref 11 · internal anchor
Cosine-similarity routing to semantic anchors in MoE models matches linear routing perplexity on WikiText-103 while providing direct traceability of decisions and a bandpass loss that cuts dead experts from 30-45% to 0-6%.
Rethinking 1-bit Optimization Leveraging Pre-trained Large Language Models cs.CL · 2025-08-09 · conditional · none · ref 25 · internal anchor
A progressive training scheme with binary-aware initialization and dual-scaling allows pre-trained LLMs to be converted to high-performance 1-bit models without training from scratch.
A3 : an Analytical Low-Rank Approximation Framework for Attention cs.CL · 2025-05-19 · conditional · none · ref 10 · internal anchor
A3 splits Transformer layers into QK, OV, and MLP components and derives analytical low-rank approximations that reduce hidden dimensions while minimizing each component's functional loss, yielding better perplexity than prior low-rank methods on LLaMA models.
Superposition Yields Robust Neural Scaling cs.LG · 2025-05-15 · conditional · none · ref 43 · internal anchor
Strong superposition causes neural loss to scale as the inverse of model dimension due to geometric feature overlaps, explaining scaling laws for broad frequency distributions.
Tight Clusters Make Specialized Experts cs.LG · 2025-02-21 · unverdicted · none · ref 30 · internal anchor
Introduces Adaptive Clustering router for MoE models that scales features to identify tight expert clusters, yielding faster convergence, robustness to corruption, and performance gains.
Analytical FFN-to-MoE Restructuring via Activation Pattern Analysis cs.LG · 2025-02-06 · unverdicted · none · ref 29 · internal anchor
An analytical post-training method restructures FFNs into MoE by partitioning neurons based on activation patterns and building a router from statistics, achieving 1.17x speedup with minimal resources.
DUET: Optimizing Training Data Mixtures via Feedback from Unseen Evaluation Tasks cs.LG · 2025-02-01 · unverdicted · none · ref 16 · internal anchor
DUET is a global-to-local method that optimizes LLM training data mixtures via Bayesian optimization guided by influence-based selection and feedback from unseen evaluation tasks, with a regret bound showing convergence to the optimal mixture.
Rethinking Output Alignment For 1-bit Post-Training Quantization of Large Language Models cs.LG · 2025-12-25 · unverdicted · none · ref 12 · 2 links · internal anchor
A post-training quantization technique for 1-bit LLMs that corrects layer-wise error accumulation and anisotropic representation distortion to preserve output behavior more effectively than existing methods.
Resting Neurons, Active Insights: Robustifying Activation Sparsity in LLMs via Spontaneity cs.LG · 2025-12-14 · unverdicted · none · ref 48 · 2 links · internal anchor
SPON adds a small set of trainable input-independent activation vectors as representational anchors, trained by distribution matching, to stabilize sparse activation in LLMs and recover performance lost to hidden-state distribution shifts.
SpikingMamba: Towards Energy-Efficient Large Language Models via Knowledge Distillation from Mamba cs.NE · 2025-10-06 · unverdicted · none · ref 16 · internal anchor
SpikingMamba distills Mamba into an SNN LLM achieving 4.76x energy savings with a 4.78% zero-shot accuracy gap that narrows to 2.23% after RL.
QWHA: Quantization-Aware Walsh-Hadamard Adaptation for Parameter-Efficient Fine-Tuning on Large Language Models cs.CL · 2025-09-22 · unverdicted · none · ref 38 · internal anchor
QWHA proposes Walsh-Hadamard Transform adapters with adaptive initialization for quantization-aware PEFT, claiming better low-bit accuracy and faster training than low-rank or other FT-based baselines.
Combating the Memory Walls: Optimization Pathways for Long-Context Agentic LLM Inference cs.AR · 2025-09-11 · unverdicted · none · ref 47 · internal anchor
PLENA introduces a co-designed system with three optimization pathways for long-context agentic LLM inference, claiming up to 2.23x throughput over A100 and 4.04x energy efficiency.
ShadowNPU: System and Algorithm Co-design for NPU-Centric On-Device LLM Inference cs.PF · 2025-08-22 · unverdicted · none · ref 39 · internal anchor
ShadowNPU presents shadowAttn, a co-designed sparse attention system that uses NPU pilot compute and techniques like graph bucketing and per-head sparsity to minimize CPU/GPU fallback during on-device LLM inference while maintaining accuracy.
From 2:4 to 8:16 sparsity patterns in LLMs for Outliers and Weights with Variance Correction cs.LG · 2025-07-03 · unverdicted · none · ref 20 · internal anchor
8:16 sparsity with variance correction and outlier handling lets compressed LLMs match or exceed dense-model accuracy under fixed memory limits, outperforming the common 2:4 pattern in flexibility.
Attribution-Guided Pruning for Insight and Control: Circuit Discovery and Targeted Correction in Small-scale LLMs cs.LG · 2025-06-16 · conditional · none · ref 20 · internal anchor
Attribution-guided pruning with contrastive relevance identifies behavior-specific circuits in small LLMs and removes as little as 0.03-0.3% of components to reduce toxicity or repetition while preserving general performance.
AIvaluateXR: An Evaluation Framework for on-Device AI in XR with Benchmarking Results cs.DC · 2025-02-13 · unverdicted · none · ref 55 · internal anchor
AIvaluateXR benchmarks 17 LLMs across four XR platforms on performance, speed, memory and battery metrics and proposes a 3D Pareto optimality method to identify optimal on-device model-device pairs.
NeuronMLP: Efficient LLM Inference via Singular Value Decomposition Compression and Tiling on AWS Trainium cs.CL · 2025-10-29 · unverdicted · none · ref 37 · internal anchor
NeuronMLP applies SVD-based compression and Trainium-specific tiling and caching to MLP layers, delivering 1.35x kernel speedup and 1.21x end-to-end inference speedup at 0.05 compression ratio versus AWS NKI baseline.

Pointer Sentinel Mixture Models

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer