Sparse Sequence-to-Sequence Models

Ben Peters , Vlad Niculae , Andr\'e F.T. Martins

Authors on Pith no claims yet

classification 💻 cs.CL cs.LG

keywords modelssparsesequence-to-sequencealignmentsalphadenseoutputoutputs

read the original abstract

Sequence-to-sequence models are a powerful workhorse of NLP. Most variants employ a softmax transformation in both their attention mechanism and output layer, leading to dense alignments and strictly positive output probabilities. This density is wasteful, making models less interpretable and assigning probability mass to many implausible outputs. In this paper, we propose sparse sequence-to-sequence models, rooted in a new family of $\alpha$-entmax transformations, which includes softmax and sparsemax as particular cases, and is sparse for any $\alpha > 1$. We provide fast algorithms to evaluate these transformations and their gradients, which scale well for large vocabulary sizes. Our models are able to produce sparse alignments and to assign nonzero probability to a short list of plausible outputs, sometimes rendering beam search exact. Experiments on morphological inflection and machine translation reveal consistent gains over dense models.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 4 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Sparse Contrastive Learning for Content-Based Cold Item Recommendation
cs.IR 2026-04 unverdicted novelty 7.0

SEMCo uses sparse entmax contrastive learning for purely content-based cold-start item recommendation, outperforming standard methods in ranking accuracy.
Selectivity and Shape in the Design of Forward-Forward Goodness Functions
cs.LG 2026-03 unverdicted novelty 7.0

Shape- and peak-sensitive goodness functions for Forward-Forward deliver up to 72pp gains over sum-of-squares, reaching 98.2% on MNIST and 89% on Fashion-MNIST.
Byzantine-Resilient Consensus via Active Reputation Learning
math.OC 2026-05 unverdicted novelty 6.0

An active reputation learning mechanism integrated into consensus protocols enables simultaneous Byzantine agent identification and resilient agreement among normal agents in distributed systems.
Empty SPACE: Cross-Attention Sparsity for Concept Erasure in Diffusion Models
cs.LG 2026-05 unverdicted novelty 5.0

SPACE induces sparsity in cross-attention parameters via closed-form iterative updates to erase target concepts more effectively than dense baselines in large diffusion models.