A Structured Self-attentive Sentence Embedding

Bing Xiang; Bowen Zhou; Cicero Nogueira dos Santos; Minwei Feng; Mo Yu; Yoshua Bengio; Zhouhan Lin

arxiv: 1703.03130 · v1 · pith:NKK33Z2Dnew · submitted 2017-03-09 · 💻 cs.CL · cs.AI· cs.LG· cs.NE

A Structured Self-attentive Sentence Embedding

Zhouhan Lin , Minwei Feng , Cicero Nogueira dos Santos , Mo Yu , Bing Xiang , Bowen Zhou , Yoshua Bengio This is my paper

classification 💻 cs.CL cs.AIcs.LGcs.NE

keywords embeddingsentencemodeldifferentmatrixself-attentiontasksattending

0 comments

read the original abstract

This paper proposes a new model for extracting an interpretable sentence embedding by introducing self-attention. Instead of using a vector, we use a 2-D matrix to represent the embedding, with each row of the matrix attending on a different part of the sentence. We also propose a self-attention mechanism and a special regularization term for the model. As a side effect, the embedding comes with an easy way of visualizing what specific parts of the sentence are encoded into the embedding. We evaluate our model on 3 different tasks: author profiling, sentiment classification, and textual entailment. Results show that our model yields a significant performance gain compared to other sentence embedding methods in all of the 3 tasks.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 11 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

FRACTAL: SSM with Fractional Recurrent Architecture for Computational Temporal Analysis of Long Sequences
cs.AI 2026-05 unverdicted novelty 7.0

FRACTAL integrates fractional recurrent architecture into SSMs using a tunable singularity index to capture multi-scale temporal features, reporting 87.11% average on Long Range Arena and outperforming S5.
Graph Attention Networks
stat.ML 2017-10 accept novelty 7.0

Graph Attention Networks compute learnable attention coefficients over node neighborhoods to produce weighted feature aggregations, achieving state-of-the-art results on citation networks and inductive protein-protein...
Learning Shared Sentiment Prototypes for Adaptive Multimodal Sentiment Analysis
cs.MM 2026-04 unverdicted novelty 6.0

PRISM learns shared sentiment prototypes to enable structured cross-modal comparison and dynamic modality reweighting in multimodal sentiment analysis, outperforming baselines on three benchmark datasets.
Cognitive State Inference from VR Motion via Motion Foundation Model
cs.HC 2025-09 unverdicted novelty 6.0

VR head and hand motion data can be adapted to motion foundation models to classify cognitive states like confusion and hesitation at 82% accuracy with better cross-user generalization than baseline models on a new 24...
DropAttention: A Regularization Method for Fully-Connected Self-Attention Networks
cs.CL 2019-07 unverdicted novelty 6.0

DropAttention regularizes attention weights in fully-connected self-attention networks to reduce overfitting and improve performance.
Construct Dynamic Graphs for Hand Gesture Recognition via Spatial-Temporal Attention
cs.CV 2019-07 unverdicted novelty 6.0

DG-STA builds dynamic graphs from hand skeletons, applies spatial-temporal self-attention to learn features, and uses a mask to cut cost by 99%, outperforming prior methods on DHG-14/28 and SHREC'17.
Deep Mixture Point Processes: Spatio-temporal Event Prediction with Rich Contextual Information
stat.ML 2019-06 unverdicted novelty 6.0

DMPP models spatio-temporal event intensity as a deep NN-weighted mixture of kernels to incorporate high-dimensional context while keeping likelihood integration tractable.
Universal Transformers
cs.CL 2018-07 unverdicted novelty 6.0

Universal Transformers combine Transformer parallelism with recurrent updates and dynamic halting to achieve Turing-completeness under assumptions and outperform standard Transformers on algorithmic and language tasks.
AMAD: Adversarial Multiscale Anomaly Detection on High-Dimensional and Time-Evolving Categorical Data
cs.LG 2019-07 unverdicted novelty 5.0

AMAD is an end-to-end model using adversarial autoencoders and RNNs with attention for multiscale anomaly detection on time-evolving high-dimensional categorical data.
Attention Is All You Need
cs.CL 2017-06 unverdicted novelty 5.0

Pith review generated a malformed one-line summary.
Automatically Learning Construction Injury Precursors from Text
cs.CL 2019-07 unverdicted novelty 4.0

Standard NLP classifiers can surface valid injury precursors from raw construction safety reports.