pith. sign in

super hub Mixed citations

Attention Is All You Need

Mixed citation behavior. Most common role is background (63%).

378 Pith papers citing it
Background 63% of classified citations
abstract

The dominant sequence transduction models are based on complex recurrent or convolutional neural networks in an encoder-decoder configuration. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train. Our model achieves 28.4 BLEU on the WMT 2014 English-to-German translation task, improving over the existing best results, including ensembles by over 2 BLEU. On the WMT 2014 English-to-French translation task, our model establishes a new single-model state-of-the-art BLEU score of 41.8 after training for 3.5 days on eight GPUs, a small fraction of the training costs of the best models from the literature. We show that the Transformer generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.

hub tools

citation-role summary

background 53 method 22 dataset 3 other 1

citation-polarity summary

claims ledger

  • abstract The dominant sequence transduction models are based on complex recurrent or convolutional neural networks in an encoder-decoder configuration. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train. Our model achieves 28.4 BLEU on the W

co-cited works

clear filters

representative citing papers

Dissecting Jet-Tagger Through Mechanistic Interpretability

hep-ph · 2026-05-11 · accept · novelty 8.0

A Particle Transformer jet tagger contains a sparse six-head circuit whose source-relay-readout structure recovers most performance and whose residual stream preferentially encodes 2-prong energy correlators.

Large Language Diffusion Models

cs.CL · 2025-02-14 · unverdicted · novelty 8.0

LLaDA is a scalable diffusion-based language model that matches autoregressive LLMs like LLaMA3 8B on tasks and surpasses GPT-4o on reversal poem completion.

SimCSE: Simple Contrastive Learning of Sentence Embeddings

cs.CL · 2021-04-18 · conditional · novelty 8.0

SimCSE achieves 76.3% unsupervised and 81.6% supervised Spearman's correlation on STS tasks with BERT-base, improving prior best results by 4.2% and 2.2% via simple contrastive learning.

Reformer: The Efficient Transformer

cs.LG · 2020-01-13 · accept · novelty 8.0

Reformer matches standard Transformer accuracy on long sequences while using far less memory and running faster via LSH attention and reversible residual layers.

Exploring Line Bundle Standard Models with Transformers

hep-th · 2026-06-30 · unverdicted · novelty 7.0

A Transformer RL agent is trained to generate valid heterotic line bundle sums on CICYs that satisfy gauge embedding, anomaly cancellation, poly-stability, chirality, and no-exotics constraints.

Information Dynamics of Language Communication

cs.CL · 2026-06-29 · unverdicted · novelty 7.0

The paper defines STE and SPID, two information-theoretic measures of semantic flow and decomposition in language exchanges, and applies them to four dialogue datasets.

Phase transitions for the noisy transformer model in arbitrary dimension

math.AP · 2026-06-03 · unverdicted · novelty 7.0

In every dimension d≥2 there exists a unique β_*^{(d)}>0 such that the uniform density on the sphere is the unique global minimizer of the USA free energy up to the linear-stability threshold K_# for β≤β_*, yielding a continuous transition, while for β>β_* the uniform density is not globally minimiz

Particle-Lund Multimodality in Jet Taggers

hep-ph · 2026-05-26 · unverdicted · novelty 7.0

PLuM multimodal transformer improves top and H->bb jet tagging by jointly processing particle constituents and Lund plane splittings, yielding 25% higher background rejection at 25% di-Higgs efficiency.

UWM-JEPA: Predictive World Models That Imagine in Belief Space

cs.LG · 2026-05-25 · unverdicted · novelty 7.0

UWM-JEPA uses a density-matrix latent and unitary predictor in JEPA to preserve joint-state spectrum during blind rollouts, achieving 0.77 accuracy on a five-step hidden-velocity task versus 0.53 for an LSTM baseline.

citing papers explorer

Showing 22 of 22 citing papers after filters.