arxiv: 1706.03762 · v7 · submitted 2017-06-12 · 💻 cs.CL · cs.LG

Recognition: 4 theorem links

· Lean Theorem

Attention Is All You Need

Aidan N. Gomez, Ashish Vaswani, Illia Polosukhin, Jakob Uszkoreit, Llion Jones, Lukasz Kaiser, Niki Parmar, Noam Shazeer

Pith reviewed 2026-05-08 22:11 UTC · model claude-opus-4-7

classification 💻 cs.CL cs.LG

keywords self-attentionTransformersequence transductionneural machine translationmulti-head attentionpositional encodingencoder-decoderconstituency parsing

0 comments

The pith

A sequence model built entirely from self-attention, with no recurrence or convolution, sets new translation state of the art while training far faster than the systems it replaces.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes that sequence transduction — translating one sequence into another — can be done without any recurrence or convolution, using only attention. Its Transformer stacks self-attention and small feed-forward layers, adds sinusoidal positional signals so the model knows token order, and uses multiple attention heads in parallel so different subspaces of meaning can be attended to at once. On the standard WMT 2014 English-German and English-French benchmarks the resulting models beat the previous best systems, including ensembles, while training in hours-to-days on eight GPUs rather than the much larger budgets of prior work. Ablations show that the gains depend on having enough heads (but not too many), enough key dimension, and dropout; learned positional embeddings work about as well as sinusoidal ones. The same architecture, lightly retuned, also produces competitive English constituency parses, including in the small-data regime where recurrent seq2seq models had struggled. The reason a sympathetic reader should care is that this is a constructive demonstration that the inductive bias of recurrence is not needed for strong sequence modelling, and that constant-path-length connectivity between positions is a practical alternative.

Core claim

The paper argues that the recurrent and convolutional machinery long assumed necessary for sequence transduction can be removed entirely. A stack of layers built only from multi-head self-attention and position-wise feed-forward networks, with sinusoidal positional encodings to inject order, suffices to map input sequences to output sequences. On WMT 2014 English-to-German and English-to-French, this architecture sets new BLEU records (28.4 and 41.8) while training in a fraction of the wall-clock time of the prior best systems, and it transfers to English constituency parsing without task-specific tuning. The claim is not merely that attention helps, but that attention alone, properly scaled

What carries the argument

The Transformer: an encoder-decoder stack in which every layer is either multi-head scaled dot-product attention or a position-wise two-layer feed-forward network, wrapped in residual connections and layer normalization. Scaled dot-product attention computes softmax(QK^T / sqrt(d_k)) V; multi-head attention runs h=8 such attentions on learned low-dimensional projections in parallel and concatenates them, letting the model attend to different representation subspaces simultaneously. Sinusoidal positional encodings of geometrically spaced wavelengths replace recurrence as the source of order information. A warmup-then-inverse-square-root learning rate schedule, label smoothing, and residual dr

If this is right

Sequence models no longer need step-by-step recurrence: training can be parallelized across all positions in a sequence, cutting wall-clock cost by an order of magnitude at comparable or better quality.
Long-range dependencies become a constant-path-length problem rather than an O(n) or O(log n) one, removing a known obstacle to learning distant relations in text.
Multi-head attention provides interpretable, specialised heads — some appearing to track syntax, some anaphora — suggesting attention maps are a usable window into model behaviour.
The same architecture, with minimal tuning, reaches competitive English constituency parsing scores even in the 40K-sentence small-data regime, indicating the design is not narrowly tuned to translation.
Scaled dot-product attention with the 1/sqrt(d_k) factor is presented as the practical fix that lets dot-product attention match additive attention at large key dimensions, making the fast matmul path viable.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Because path length between any two positions collapses to O(1), the architecture should scale to tasks where very long-range structure matters — language modelling, code, audio, video — once compute and memory for the n^2 attention term are addressed; the paper hints at restricted-neighborhood attention as the route.
The fact that learned and sinusoidal positional encodings perform nearly identically suggests the model is fairly indifferent to how position is supplied, as long as it is supplied; this opens the door to relative-position schemes that the paper does not pursue.
Sharing weights between the input embedding, output embedding, and pre-softmax projection, combined with the sqrt(d_model) scaling, is a small detail that likely matters more than its one-line treatment suggests for stable optimisation at this depth.
Reported per-head specialisation (syntax-like, anaphora-like behaviour) implies that attention weights could serve as a diagnostic tool for model debugging and linguistic analysis, independent of their role in computation.

Load-bearing premise

That self-attention's quadratic cost in sequence length stays affordable on the lengths that matter — and that the benchmark gains seen on sentence-scale translation will continue to hold as inputs grow longer.

What would settle it

Train the described base and big Transformers on WMT 2014 EN-DE and EN-FR with the stated hyperparameters (8 P100 GPUs, 100K and 300K steps, Adam with the warmup schedule, label smoothing 0.1, beam 4, length penalty 0.6) and check whether test BLEU reaches 27.3/38.1 and 28.4/41.8 respectively. If reproductions land materially below those numbers, or if a comparably sized recurrent or convolutional model trained for the same compute matches them, the sufficiency claim weakens.

read the original abstract

The dominant sequence transduction models are based on complex recurrent or convolutional neural networks in an encoder-decoder configuration. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train. Our model achieves 28.4 BLEU on the WMT 2014 English-to-German translation task, improving over the existing best results, including ensembles by over 2 BLEU. On the WMT 2014 English-to-French translation task, our model establishes a new single-model state-of-the-art BLEU score of 41.8 after training for 3.5 days on eight GPUs, a small fraction of the training costs of the best models from the literature. We show that the Transformer generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

A clean, well-executed architecture paper: drop recurrence, use self-attention everywhere, and the BLEU numbers back it up — the efficiency claim is softer but doesn't undermine the result.

read the letter

Quick note on the Vaswani et al. submission. The reader's report came in malformed, so ignore that — I read the paper directly.

What's actually new: an encoder-decoder built entirely from multi-head scaled dot-product self-attention plus position-wise FFNs, with sinusoidal positional encodings, no recurrence and no convolution. Pieces of this exist in prior work (decomposable attention, ConvS2S, end-to-end memory nets), but the synthesis is the contribution: showing that a pure-attention stack actually beats strong recurrent and convolutional baselines on a well-established benchmark, not just matches them.

What the paper does well. The architecture description is unusually crisp — you could reimplement from Section 3 alone, and people have. The √d_k scaling argument is the right level of motivation (variance of the dot product grows with d_k; rescale before softmax). The ablations in Table 3 are honest and informative: head count has a sweet spot, key dimension matters, dropout and label smoothing both pull weight, learned vs sinusoidal positional encodings are a wash. The constituency parsing result (Table 4) is a real generalization check, not a victory lap — competitive with task-specific models using minimal tuning.

Soft spots, in proportion. The stress-test note is right that the FLOPs column in Table 2 is an estimate (time × GPUs × assumed sustained TFLOPS from footnote 5), not a measurement, and that utilization varies by workload. That genuinely overstates the precision of the "fraction of training cost" claim. But the gaps are large — one to two orders of magnitude against ensembles — so even with generous corrections for utilization differences, the qualitative claim survives. I'd flag it for a referee but not block on it. Bigger gripe from me: no seed variance on the headline numbers, and the ablation deltas in Table 3 of ~0.3–0.5 BLEU are inside plausible noise. The 28.4 / 41.8 main results have enough margin to survive that; the fine-grained ablation conclusions are weaker than presented.

Citations look fine. Prior attention work, ConvS2S, ByteNet, GNMT, and the relevant regularization/optimization papers are all there.

Recommendation. Send to review. This is the kind of architecture paper that will either replicate easily or not, and the community will sort it out fast. Worth referee time, worth reading group, and I'll cite it.

Referee Report

4 major / 8 minor

Summary. The paper proposes the Transformer, a sequence transduction architecture built entirely on (multi-head, scaled dot-product) self-attention plus position-wise feed-forward layers, residual connections, layer normalization, and sinusoidal positional encodings, dispensing with recurrence and convolutions. The authors report new state-of-the-art BLEU on WMT'14 EN-DE (28.4, +2.0 over the best prior including ensembles) and WMT'14 EN-FR (41.8 single model), at a substantially reduced training cost (3.5 days on 8 P100 GPUs for the big model). They also show the architecture transfers to English constituency parsing (WSJ-only F1 91.3; semi-supervised 92.7), and provide ablations over number of heads, key/value dimension, model size, dropout, and positional encoding scheme. A reference implementation in tensor2tensor is released.

Significance. If the BLEU and training-cost results hold, the contribution is substantial: a parallelizable architecture that simultaneously improves translation quality and reduces wall-clock and compute cost, while admitting a clean analysis of per-layer complexity and maximum path length (Table 1). The architectural ideas — scaled dot-product attention with the 1/√d_k scaling motivated in §3.2.1 and footnote 4, multi-head attention, sinusoidal positional encodings, and the encoder/decoder layout in §3.1 — are presented with enough specificity to reproduce, and the released tensor2tensor code is a concrete reproducibility asset. The ablation table (Table 3) is unusually thorough for an architecture paper and directly supports the design choices. The constituency parsing result (Table 4) provides nontrivial evidence of generality outside MT. The headline BLEU comparisons are clear-cut single-number improvements that other groups can independently re-run.

major comments (4)

[Table 2 / footnote 5 (training-cost methodology)] The 'fraction of training cost' claim in the abstract and §6.1 is supported by the FLOPs column of Table 2, which footnote 5 defines as (training time) × (#GPUs) × an assumed sustained single-precision TFLOPS per GPU type (2.8/3.7/6.0/9.5 for K80/K40/M40/P100). These sustained rates are assumed, not measured, and a single scalar per GPU type cannot capture the very different utilization regimes of attention-heavy large-GEMM workloads vs RNN/ConvS2S workloads that are more memory-bandwidth- and dependency-bound. The numerical gap to competitors (e.g., 3.3·10^18 vs 9.6·10^18 ConvS2S; 2.3·10^19 vs 1.1·10^21 GNMT+RL ensemble) is therefore an estimate whose uncertainty is not quantified. The BLEU SOTA claim is unaffected, but the efficiency claim — also stated in the abstract — would be much stronger if the authors reported either (i) measured TFLOPS/utilization for their own runs, (ii) a sen
[§3.2.1 / footnote 4 (scaling justification)] The motivation for the 1/√d_k scaling is given heuristically: assuming q,k components are independent with mean 0 and variance 1, q·k has variance d_k. This is fine as intuition, but after training the components of q and k are not independent unit-variance variables, and the entropy of the softmax depends on the realized variance of the logits, not the assumed one. A short empirical check — e.g., logit-variance and gradient-norm statistics with and without the scaling at d_k=64 and at larger d_k — would convert the scaling from a plausible heuristic into a supported design choice. This is load-bearing because the scaling is repeatedly highlighted as a distinguishing element relative to plain dot-product attention.
[§5.4 / Table 3 row (E) (positional encoding)] The motivation for choosing sinusoidal over learned positional encodings rests on two arguments: (a) it 'may allow the model to extrapolate to sequence lengths longer than the ones encountered during training', and (b) PE_{pos+k} can be written as a linear function of PE_{pos}. Claim (a) is presented as a hypothesis but not tested; given that Table 3(E) shows learned and sinusoidal encodings produce 'nearly identical results' in-distribution, the only differentiator offered is extrapolation, which is not evaluated. A small experiment evaluating BLEU on test sentences longer than the training maximum (or a synthetic length-generalization probe) would either substantiate the design choice or appropriately soften the recommendation.
[§6.1 (EN-FR BLEU consistency)] There is an internal inconsistency in the EN-FR result: the abstract and Table 2 report 41.8 BLEU for the big model, while §6.1 states 'our big model achieves a BLEU score of 41.0'. Please reconcile (presumably a typo for 41.8) so that readers can cite the paper unambiguously.

minor comments (8)

[§3.2.2] Stating explicitly that h·d_k = h·d_v = d_model in the base configuration would help readers verify the 'similar computational cost to single-head with full dimensionality' claim without back-solving from §3.2.2.
[§3.4] The rationale for multiplying embedding weights by √d_model is asserted but not motivated. A one-sentence explanation (scale matching against the positional encoding magnitude, or against the shared output projection) would be helpful.
[§5.3 Eq. (3)] Define 'step_num' (1-indexed?) and clarify the units; the formula's behavior at step_num=0 is undefined as written.
[Table 3] The caption notes 'Unlisted values are identical to those of the base model' but several rows in (C) and (D) leave multiple cells blank; an explicit listing of which hyperparameter is being varied per sub-block would aid readability.
[§6.2] The claim that 'quality also drops off with too many heads' is based on a single row (h=32, BLEU 25.4 vs base 25.8); given typical run-to-run variance on newstest2013, a seed-variance estimate would make this statement more defensible.
[Figures 3–5] The attention visualizations are referenced as evidence for syntactic/semantic structure in §4, but are presented in the appendix without quantitative analysis. Consider either softening the language in §4 ('appear to exhibit') — which is partially done already — or adding a more systematic probing analysis.
[§5.1] Reference [3] is cited for byte-pair encoding but the canonical reference is Sennrich et al. [31]; please check the citation.
[§3.5] It would help to state explicitly that the same positional encoding is added at the encoder and decoder bottoms (the wording 'at the bottoms of the encoder and decoder stacks' is correct but easily missed).

Simulated Author's Rebuttal

4 responses · 1 unresolved

We thank the referee for the careful reading and for recommending acceptance. The four major comments are well taken. We will (1) correct the EN-FR BLEU typo in §6.1 (the correct figure is 41.8, as in the abstract and Table 2); (2) soften and clarify the training-cost methodology in footnote 5 and the abstract, making explicit that the FLOPs column uses assumed nominal sustained TFLOPS per GPU type rather than measured utilization, while noting that the qualitative efficiency claim is robust given the order-of-magnitude gaps to competitors; (3) add a brief empirical check of pre-softmax logit statistics and an unscaled-attention training comparison to support the 1/√d_k scaling beyond the heuristic in footnote 4; and (4) add a length-generalization probe comparing sinusoidal and learned positional encodings, and soften §3.5 to reflect that, in-distribution, Table 3(E) does not distinguish the two. None of these changes affect the headline BLEU results or the Table 1 complexity analysis.

read point-by-point responses

Referee: Training-cost methodology in Table 2 / footnote 5 relies on assumed sustained TFLOPS per GPU type rather than measured utilization, leaving the efficiency claim's uncertainty unquantified.

Authors: We agree with the referee that the FLOPs column is an estimate rather than a measurement, and we will make this more explicit in the revision. Specifically, we will (i) reword the footnote to state that we assume a single sustained single-precision rate per GPU type (2.8/3.7/6.0/9.5 TFLOPS for K80/K40/M40/P100) and that these are upper-bound nominal sustained values rather than per-run measurements; (ii) caveat the abstract and §6.1 efficiency statement accordingly. We note, however, that the training-cost gap to the strongest comparators is large enough (e.g., 3.3·10^18 vs 1.1·10^21 for GNMT+RL ensemble, ~300×) that even substantial differences in actual utilization between attention/GEMM-heavy and RNN workloads would not overturn the qualitative claim. We do not have wall-clock comparisons run on identical hardware against the competitor systems, so a fully measured comparison is out of scope for this submission, but we will report the wall-clock training time of our own runs (12 hours base / 3.5 days big on 8×P100) as a concrete, reproducible number alongside the FLOPs estimate. revision: partial
Referee: The 1/√d_k scaling justification (footnote 4) is a heuristic that assumes independent unit-variance q,k components; an empirical check of logit variance / gradient statistics would substantiate it.

Authors: The referee is correct that the argument in footnote 4 is a pre-training heuristic, and we agree that it is most useful as motivation rather than as a post-hoc explanation. We chose the scaling because, in early experiments, models trained with unscaled dot products at d_k=64 either failed to train or were markedly worse, while the scaled variant trained stably; this is the source of the design choice. For the camera-ready we will add a brief empirical note giving (a) the empirical standard deviation of pre-softmax logits with and without scaling at d_k=64 and at a larger d_k, and (b) a comparison of training curves / final BLEU for the unscaled variant on the base configuration. We expect this to confirm the heuristic's qualitative prediction (logit magnitudes and softmax saturation grow with d_k absent the scaling) while making clear, as the referee notes, that the realized variance after training is what matters. revision: yes
Referee: The extrapolation argument for sinusoidal vs learned positional encodings is untested; given that Table 3(E) shows near-identical in-distribution results, length generalization should be evaluated.

Authors: We accept this criticism. As written, §3.5 advances extrapolation as a hypothesis and Table 3(E) only compares in-distribution performance, so the paper does not currently substantiate the differentiator we cite. For the revision we will (i) soften the language in §3.5 to make clear that the choice between sinusoidal and learned encodings is not empirically distinguished by our reported MT results, and (ii) add a short length-generalization probe — evaluating BLEU on test buckets whose source length exceeds the training maximum, and comparing sinusoidal vs learned encodings — in an appendix. If the probe shows no advantage, we will say so explicitly rather than retain the extrapolation argument as a justification. revision: yes
Referee: Internal inconsistency in EN-FR BLEU: the abstract and Table 2 report 41.8, but §6.1 says 41.0.

Authors: Thank you for catching this. The correct number is 41.8, matching the abstract and Table 2; the '41.0' in §6.1 is a typo and will be corrected in the revised manuscript. revision: yes

standing simulated objections not resolved

A fully measured (rather than estimated) training-cost comparison against ConvS2S, GNMT+RL, MoE, and Deep-Att+PosUnk on identical hardware is not feasible within this submission; we will instead report our own wall-clock numbers and clearly flag the competitor FLOPs as estimates.

Circularity Check

0 steps flagged

No meaningful circularity: BLEU claims are evaluated against external WMT'14 test sets, and architectural choices are justified by ablations rather than self-citation.

full rationale

The paper's central claims are (1) the Transformer architecture achieves higher BLEU than prior SOTA on WMT'14 EN-DE and EN-FR, and (2) it does so at lower training cost. Both claims are evaluated against external, standard benchmarks (newstest2014) using a metric (BLEU) defined independently of the paper, with comparators drawn from prior published work. There is no fitted-then-renamed-as-prediction structure: the model is trained on WMT training data and evaluated on a held-out test set, which is the standard non-circular protocol. Ablations in Table 3 vary heads, d_k, d_model, dropout, and positional-encoding type, with results reported on the dev set newstest2013 — these are independent empirical comparisons, not definitional identities. Architectural motivations (Section 4, Table 1) are stated as complexity/path-length tradeoffs against recurrent and convolutional baselines using elementary big-O accounting, not derived from a self-citation chain. Self-citations exist (e.g., to tensor2tensor-adjacent work, GNMT [38], MoE [32]) but none is load-bearing for a uniqueness or forced-choice argument; the paper does not claim its design is uniquely determined. The reader's skeptic concern about Table 2's training-cost FLOPs estimate (footnote 5: assumed sustained TFLOPS per GPU type, not measured) is a methodological/correctness concern about a comparison metric, not circularity — the FLOPs estimator does not use the BLEU result as input, nor vice versa. Per the rubric, "this is not standard consensus" or "this estimate uses assumed constants" belong under correctness risk, not circularity. Score 0.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Model omitted the axiom ledger; defaulted for pipeline continuity.

pith-pipeline@v0.9.0 · 9575 in / 5522 out tokens · 83923 ms · 2026-05-08T22:11:17.225364+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Foundation/EightTick.lean phase_periodic / period_8 unclear
In this work we employ h = 8 parallel attention layers, or heads. For each of these we use d_k = d_v = d_model/h = 64.
Foundation/PhiForcing.lean phi_equation unclear
PE(pos,2i) = sin(pos/10000^{2i/d_model}); PE(pos,2i+1) = cos(pos/10000^{2i/d_model})

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Dissecting Jet-Tagger Through Mechanistic Interpretability
hep-ph 2026-05 accept novelty 8.0

A Particle Transformer jet tagger contains a sparse six-head circuit whose source-relay-readout structure recovers most performance and whose residual stream preferentially encodes 2-prong energy correlators.
Tessera: Unlocking Heterogeneous GPUs through Kernel-Granularity Disaggregation
cs.DC 2026-04 unverdicted novelty 8.0

Tessera performs kernel-granularity disaggregation on heterogeneous GPUs, achieving up to 2.3x throughput and 1.6x cost efficiency gains for large model inference while generalizing beyond prior methods.
Large Language Diffusion Models
cs.CL 2025-02 unverdicted novelty 8.0

LLaDA is a scalable diffusion-based language model that matches autoregressive LLMs like LLaMA3 8B on tasks and surpasses GPT-4o on reversal poem completion.
Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets
cs.LG 2022-01 unverdicted novelty 8.0

Neural networks exhibit grokking on small algorithmic datasets, achieving perfect generalization well after overfitting.
Reformer: The Efficient Transformer
cs.LG 2020-01 accept novelty 8.0

Reformer matches standard Transformer accuracy on long sequences while using far less memory and running faster via LSH attention and reversible residual layers.
Bin Latent Transformer (BiLT): A shift-invariant autoencoder for calibration-free spectral unmixing of turbid media
physics.optics 2026-05 unverdicted novelty 7.0

The BiLT autoencoder recovers absorption and scattering spectra from integrating sphere data with high accuracy while remaining robust to wavelength shifts up to 10 bands and generalizing to different instrument line ...
Offline Policy Evaluation for Manipulation Policies via Discounted Liveness Formulation
cs.RO 2026-05 conditional novelty 7.0

A liveness-based Bellman operator enables conservative offline policy evaluation for manipulation tasks by encoding task progression and reducing truncation bias from finite horizons.
End-to-End Population Inference from Gravitational-Wave Strain using Transformers
gr-qc 2026-05 unverdicted novelty 7.0

Dingo-Pop uses a transformer to perform amortized, end-to-end population inference from GW strain data in seconds, bypassing per-event Monte Carlo sampling.
Automated Detection of Abnormalities in Zebrafish Development
cs.CV 2026-05 unverdicted novelty 7.0

A new annotated dataset of zebrafish embryo image sequences enables a spatiotemporal transformer to classify fertility at 98% accuracy and detect compound-induced malformations at 92% accuracy.
Complex-Valued Phase-Coherent Transformer
cs.LG 2026-05 unverdicted novelty 7.0

PCT replaces softmax token competition with a smooth phase-preserving gate on normalized complex similarities, yielding stronger generalization on long-range and phase-sensitive benchmarks than both real and complex T...
From Syntax to Semantics: Unveiling the Emergence of Chirality in SMILES Translation Models
cs.LG 2026-05 unverdicted novelty 7.0

Chirality emerges in SMILES translation models through an abrupt encoder-centered reorganization of representations after a long plateau, identified via checkpoint analysis and ablation.
Generating Complex Code Analyzers from Natural Language Questions
cs.SE 2026-05 unverdicted novelty 7.0

Merlin generates CodeQL queries from natural language questions via RAG-based iteration and a self-test technique using assistive queries, achieving 3.8x higher task accuracy and 31% less completion time in user studi...
Dooly: Configuration-Agnostic, Redundancy-Aware Profiling for LLM Inference Simulation
cs.DC 2026-05 unverdicted novelty 7.0

Dooly reduces LLM inference profiling costs by 56.4% via configuration-agnostic taint-based labeling and selective database reuse, delivering simulation accuracy within 5% MAPE for TTFT and 8% for TPOT across 12 models.
Neural network quantum states in the grand canonical ensemble
quant-ph 2026-05 unverdicted novelty 7.0

A new neural quantum state ansatz for bosons in the grand canonical ensemble achieves competitive variational energies in 1D and 2D systems and provides access to one-body reduced density matrices.
Is She Even Relevant? When BERT Ignores Explicit Gender Cues
cs.CL 2026-05 conditional novelty 7.0

A Dutch BERT model encodes gender linearly by epoch 20 but does not dynamically update its representations when explicit female cues contradict learned stereotypical associations in short sentence templates.
TSCG: Deterministic Tool-Schema Compilation for Agentic LLM Deployments
cs.SE 2026-05 unverdicted novelty 7.0

TSCG compiles JSON tool schemas into token-efficient structured text, raising tool-use accuracy for small LLMs from 0% to 84.4% on benchmarks while cutting tokens by 52-57%.
Rethinking the Need for Source Models: Source-Free Domain Adaptation from Scratch Guided by a Vision-Language Model
cs.CV 2026-05 unverdicted novelty 7.0

The paper introduces the VODA setting for domain adaptation from scratch using vision-language models and presents TS-DRD, which achieves competitive performance on standard benchmarks without source models.
Evaluating LLMs on Large-Scale Graph Property Estimation via Random Walks
cs.LG 2026-05 unverdicted novelty 7.0

EstGraph benchmark evaluates LLMs on estimating properties of very large graphs from random-walk samples that fit in context limits.
Reconstructing conformal field theoretical compositions with Transformers
hep-th 2026-05 unverdicted novelty 7.0

Transformers reconstruct the constituent RCFTs in tensor-product theories from low-energy spectra, reaching 98% accuracy on WZW models and generalizing to larger central charges with few out-of-domain examples.
Homogeneous Stellar Parameters from Heterogeneous Spectra with Deep Learning
astro-ph.GA 2026-04 unverdicted novelty 7.0

A single end-to-end Transformer model unifies stellar labels from heterogeneous spectroscopic surveys into a self-consistent scale without post-hoc recalibration.
Generative diffusion models for spatiotemporal influenza forecasting
cs.LG 2026-04 unverdicted novelty 7.0

Influpaint uses generative diffusion models on image-encoded influenza data to produce realistic and diverse epidemic trajectories that match leading ensemble methods in accuracy.
Attention Is Not All You Need for Diffraction
cond-mat.mtrl-sci 2026-04 unverdicted novelty 7.0

Physics-informed transformer with sin^2(theta) encoding, physics-aware positional encoding, multi-task decoder, and three-stage curriculum classifies powder diffraction into 99 extinction groups, with structured error...
Oracle Noise: Faster Semantic Spherical Alignment for Interpretable Latent Optimization
cs.CV 2026-04 unverdicted novelty 7.0

Oracle Noise optimizes diffusion model noise on a Riemannian hypersphere guided by key prompt words to preserve the Gaussian prior, eliminate norm inflation, and achieve faster semantic alignment than Euclidean methods.
A Systematic Survey of Security Threats and Defenses in LLM-Based AI Agents: A Layered Attack Surface Framework
cs.CR 2026-04 unverdicted novelty 7.0

A new 7x4 taxonomy organizes agentic AI security threats by architectural layer and persistence timescale, revealing under-explored upper layers and missing defenses after surveying 116 papers.
EgoMAGIC- An Egocentric Video Field Medicine Dataset for Training Perception Algorithms
cs.CV 2026-04 accept novelty 7.0

EgoMAGIC is a new public egocentric video dataset of medical tasks with object labels for 124 items and action detection baselines reaching 0.526 mAP on eight tasks.
Masked-Token Prediction for Anomaly Detection at the Large Hadron Collider
hep-ph 2026-04 unverdicted novelty 7.0

The work demonstrates masked-token prediction with transformers for model-independent anomaly detection in LHC data, achieving strong results on top-rich BSM signatures like four-top production using VQ-VAE tokenization.
Copy First, Translate Later: Interpreting Translation Dynamics in Multilingual Pretraining
cs.CL 2026-04 unverdicted novelty 7.0

Multilingual pretraining develops translation in two phases: early copying driven by surface similarities, followed by generalizing mechanisms while copying is refined.
Working Memory in a Recurrent Spiking Neural Networks With Heterogeneous Synaptic Delays
q-bio.NC 2026-04 unverdicted novelty 7.0

A recurrent SNN with heterogeneous synaptic delays (D=41) achieves perfect F1=1.0 recall of 16 arbitrary spike patterns on a synthetic benchmark by representing them as chains of overlapping spiking motifs.
Dual Triangle Attention: Effective Bidirectional Attention Without Positional Embeddings
q-bio.QM 2026-04 unverdicted novelty 7.0

Dual Triangle Attention achieves effective bidirectional attention with built-in positional inductive bias via dual triangular masks, outperforming standard bidirectional attention on position-sensitive tasks and show...
The UNDO Flip-Flop: A Controlled Probe for Reversible Semantic State Management in State Space Model
cs.LG 2026-04 unverdicted novelty 7.0

Mamba-2 models fail to learn reversible state retrieval in the UNDO Flip-Flop task, defaulting to a toggle heuristic and achieving only 41% accuracy under adversarial conditions.
Joint Fullband-Subband Modeling for High-Resolution SingFake Detection
cs.SD 2026-04 unverdicted novelty 7.0

A joint fullband-subband model using high-resolution 44.1 kHz audio outperforms standard 16 kHz detectors for singing voice deepfake detection by exploiting spectrum-specific synthesis artifacts.
Profile-Then-Reason: Bounded Semantic Complexity for Tool-Augmented Language Agents
cs.AI 2026-04 unverdicted novelty 7.0

PTR framework profiles a workflow upfront then executes it deterministically with bounded verification and repair, limiting LM calls to 2-3 while outperforming ReAct in 16 of 24 tested configurations.
Fusion and Alignment Enhancement with Large Language Models for Tail-item Sequential Recommendation
cs.IR 2026-04 unverdicted novelty 7.0

FAERec fuses collaborative ID embeddings with LLM semantic embeddings using adaptive gating and dual-level alignment to enhance tail-item sequential recommendations.
Evaluating the Formal Reasoning Capabilities of Large Language Models through Chomsky Hierarchy
cs.CL 2026-04 unverdicted novelty 7.0

LLMs display clear performance stratification on formal language tasks aligned with Chomsky hierarchy complexity levels, limited by severe efficiency barriers rather than absolute capability.
Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach
cs.LG 2025-02 unverdicted novelty 7.0

A recurrent-depth architecture enables language models to improve reasoning performance by iterating computation in latent space, achieving gains equivalent to much larger models on benchmarks.
Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware
cs.RO 2023-04 conditional novelty 7.0

Low-cost imprecise robots achieve 80-90% success on six fine bimanual manipulation tasks using imitation learning with a new Action Chunking with Transformers algorithm trained on only 10 minutes of demonstrations.
LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale
cs.LG 2022-08 conditional novelty 7.0

LLM.int8() performs 8-bit inference for transformers up to 175B parameters with no accuracy loss by combining vector-wise quantization for most features with 16-bit mixed-precision handling of systematic outlier dimensions.
Hierarchical Text-Conditional Image Generation with CLIP Latents
cs.CV 2022-04 accept novelty 7.0

A hierarchical prior-decoder model using CLIP latents generates more diverse text-conditional images than direct methods while preserving photorealism and caption fidelity.
GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models
cs.CV 2021-12 accept novelty 7.0

A 3.5-billion-parameter diffusion model with classifier-free guidance generates images preferred over DALL-E by human raters and can be fine-tuned for text-guided inpainting.
Diffusion Models Beat GANs on Image Synthesis
cs.LG 2021-05 accept novelty 7.0

Diffusion models with architecture improvements and classifier guidance achieve superior FID scores to GANs on unconditional and conditional ImageNet image synthesis.
Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism
cs.CL 2019-09 unverdicted novelty 7.0

Intra-layer model parallelism in PyTorch enables training of 8.3B-parameter transformers, achieving SOTA perplexity of 10.8 on WikiText103 and 66.5% accuracy on LAMBADA.
SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing
cs.CL 2018-08 accept novelty 7.0

SentencePiece trains subword models directly from raw text to enable language-independent neural text processing.
Graph Attention Networks
stat.ML 2017-10 accept novelty 7.0

Graph Attention Networks compute learnable attention coefficients over node neighborhoods to produce weighted feature aggregations, achieving state-of-the-art results on citation networks and inductive protein-protein...
Self-Distilled Trajectory-Aware Boltzmann Modeling: Bridging the Training-Inference Discrepancy in Diffusion Language Models
cs.CL 2026-05 unverdicted novelty 6.0

TABOM models inference unmasking preferences as a Boltzmann distribution over predictive entropies and derives a ranking loss to align DLM training with observed trajectories, yielding gains in new domains and reduced...
ShardTensor: Domain Parallelism for Scientific Machine Learning
cs.DC 2026-05 unverdicted novelty 6.0

ShardTensor is a domain-parallelism system for SciML that enables flexible scaling of extreme-resolution spatial datasets by removing the constraint of batch size one per device.
MambaNetBurst: Direct Byte-level Network Traffic Classification without Tokenization or Pretraining
cs.CR 2026-05 unverdicted novelty 6.0

A compact Mamba-2 model performs end-to-end byte-level network traffic classification without tokenization or pre-training and remains competitive with substantially larger pre-trained systems.
Quantum Injection Pathways for Implicit Graph Neural Networks
quant-ph 2026-05 unverdicted novelty 6.0

Independent quantum signal injection into graph DEQs yields higher test accuracy and fewer solver iterations than state-dependent or backbone-dependent injection and classical equilibrium models on NCI1, PROTEINS, and...
Investigating Anisotropy in Visual Grounding under Controlled Counterfactual Perturbations
cs.CV 2026-05 unverdicted novelty 6.0

Controlled counterfactual perturbations reveal no correlation between embedding cosine similarity and approximation behavior in two visual grounding models.
Structured Recurrent Mixers for Massively Parallelized Sequence Generation
cs.CL 2026-05 unverdicted novelty 6.0

Structured Recurrent Mixers enable algebraic switching between parallel training and recurrent inference representations, delivering higher efficiency, information capacity, and throughput than other linear-complexity models.
GRASP -- Graph-Based Anomaly Detection Through Self-Supervised Classification
cs.CR 2026-05 unverdicted novelty 6.0

GRASP detects anomalies in system provenance graphs via self-supervised executable prediction from two-hop neighborhoods, outperforming prior PIDS on DARPA datasets by identifying all documented attacks where behavior...
Coupling Models for One-Step Discrete Generation
cs.LG 2026-05 unverdicted novelty 6.0

Coupling Models enable single-step discrete sequence generation via learned couplings to Gaussian latents and outperform prior one-step baselines on text perplexity, biological FBD, and image FID metrics.
The Position Curse: LLMs Struggle to Locate the Last Few Items in a List
cs.LG 2026-05 unverdicted novelty 6.0

LLMs exhibit the Position Curse, with backward position retrieval in lists lagging far behind forward retrieval, showing only partial gains from PosBench fine-tuning.
BRICKS: Compositional Neural Markov Kernels for Zero-Shot Radiation-Matter Simulation
cs.LG 2026-05 unverdicted novelty 6.0

BRICKS creates compositional neural Markov kernels via hybrid transformers and Riemannian Flow Matching on product manifolds to enable zero-shot simulation of radiation-matter interactions across arbitrary material di...
Spark3R: Asymmetric Token Reduction Makes Fast Feed-Forward 3D Reconstruction
cs.CV 2026-05 unverdicted novelty 6.0

Asymmetric token reduction, with distinct merging for queries and pruning for key-values plus layer-wise adaptation, delivers up to 28x speedup on 1000-frame 3D reconstruction inputs while preserving competitive quality.
Federation of Experts: Communication Efficient Distributed Inference for Large Language Models
cs.LG 2026-05 unverdicted novelty 6.0

FoE restructures MoE blocks into per-KV-head clusters with sum-based synchronization, removing all-to-all communication in single-node settings and limiting it to intra-node in multi-node settings for up to 5.2x faste...
When AI Meets Science: Research Diversity, Interdisciplinarity, Visibility, and Retractions across Disciplines in a Global Surge
cs.DL 2026-05 unverdicted novelty 6.0

AI adoption in science has shown exponential growth since 2015 across domains but stays confined to few CS-linked topics, carries citation premiums, higher retraction rates, and uneven geographic spread, leaving its t...
When AI Meets Science: Research Diversity, Interdisciplinarity, Visibility, and Retractions across Disciplines in a Global Surge
cs.DL 2026-05 unverdicted novelty 6.0

Post-2015 AI adoption in science grew exponentially across domains but stayed limited to CS-linked topics, carried citation premiums, higher retractions, and showed rising Asian middle-income country involvement.
When AI Meets Science: Research Diversity, Interdisciplinarity, Visibility, and Retractions across Disciplines in a Global Surge
cs.DL 2026-05 unverdicted novelty 6.0

AI use in science has grown exponentially since 2015 but stays confined to computer science and statistics topics, shows higher retraction rates and citations, and follows distinct global adoption patterns.
Decoding Alignment without Encoding Alignment: A critique of similarity analysis in neuroscience
q-bio.NC 2026-05 unverdicted novelty 6.0

Decoding alignment metrics can remain high and unchanged even when encoding manifold topology is causally altered, so they do not imply similar function or computation across neural populations.
Piper: Efficient Large-Scale MoE Training via Resource Modeling and Pipelined Hybrid Parallelism
cs.DC 2026-05 unverdicted novelty 6.0

Piper introduces resource modeling and pipelined hybrid parallelism for MoE training, delivering 2-3.5X higher MFU than prior frameworks and 1.2-9X better all-to-all bandwidth.

Reference graph

Works this paper leans on

40 extracted references · 40 canonical work pages · cited by 156 Pith papers · 6 internal anchors

[1]

Layer Normalization

Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization. arXiv preprint arXiv:1607.06450, 2016

work page Pith review arXiv 2016
[2]

Neural Machine Translation by Jointly Learning to Align and Translate

Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. CoRR, abs/1409.0473, 2014

work page internal anchor Pith review arXiv 2014
[3]

Denny Britz, Anna Goldie, Minh-Thang Luong, and Quoc V . Le. Massive exploration of neural machine translation architectures. CoRR, abs/1703.03906, 2017

work page arXiv 2017
[4]

Long short-term memory-networks for machine reading.arXiv preprint arXiv:1601.06733,

Jianpeng Cheng, Li Dong, and Mirella Lapata. Long short-term memory-networks for machine reading. arXiv preprint arXiv:1601.06733, 2016

work page arXiv 2016
[5]

Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation

Kyunghyun Cho, Bart van Merrienboer, Caglar Gulcehre, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. Learning phrase representations using rnn encoder-decoder for statistical machine translation. CoRR, abs/1406.1078, 2014

work page internal anchor Pith review arXiv 2014
[6]

Xception: Deep learning with depthwise separable convolutions.CoRR, abs/1610.02357, 2016

Francois Chollet. Xception: Deep learning with depthwise separable convolutions. arXiv preprint arXiv:1610.02357, 2016. 10

work page arXiv 2016
[7]

Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling

Junyoung Chung, Çaglar Gülçehre, Kyunghyun Cho, and Yoshua Bengio. Empirical evaluation of gated recurrent neural networks on sequence modeling. CoRR, abs/1412.3555, 2014

work page internal anchor Pith review arXiv 2014
[8]

Chris Dyer, Adhiguna Kuncoro, Miguel Ballesteros, and Noah A. Smith. Recurrent neural network grammars. In Proc. of NAACL, 2016

work page 2016
[9]

Jonas Gehring, Michael Auli, David Grangier, Denis Yarats, and Yann N. Dauphin. Convolu- tional sequence to sequence learning. arXiv preprint arXiv:1705.03122v2, 2017

work page arXiv 2017
[10]

arXiv preprint arXiv:1308.0850 (2013) 4, 5

Alex Graves. Generating sequences with recurrent neural networks. arXiv preprint arXiv:1308.0850, 2013

work page arXiv 2013
[11]

Deep residual learning for im- age recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for im- age recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 770–778, 2016

work page 2016
[12]

Gradient ﬂow in recurrent nets: the difﬁculty of learning long-term dependencies, 2001

Sepp Hochreiter, Yoshua Bengio, Paolo Frasconi, and Jürgen Schmidhuber. Gradient ﬂow in recurrent nets: the difﬁculty of learning long-term dependencies, 2001

work page 2001
[13]

Long short-term memory

Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997

work page 1997
[14]

Self-training PCFG grammars with latent annotations across languages

Zhongqiang Huang and Mary Harper. Self-training PCFG grammars with latent annotations across languages. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, pages 832–841. ACL, August 2009

work page 2009
[15]

Exploring the limits of language modeling

Rafal Jozefowicz, Oriol Vinyals, Mike Schuster, Noam Shazeer, and Yonghui Wu. Exploring the limits of language modeling. arXiv preprint arXiv:1602.02410, 2016

work page arXiv 2016
[16]

Can active memory replace attention? In Advances in Neural Information Processing Systems, (NIPS), 2016

Łukasz Kaiser and Samy Bengio. Can active memory replace attention? In Advances in Neural Information Processing Systems, (NIPS), 2016

work page 2016
[17]

Neural GPUs learn algorithms

Łukasz Kaiser and Ilya Sutskever. Neural GPUs learn algorithms. In International Conference on Learning Representations (ICLR), 2016

work page 2016
[18]

Neural machine translation in linear time.arXiv:1610.10099,

Nal Kalchbrenner, Lasse Espeholt, Karen Simonyan, Aaron van den Oord, Alex Graves, and Ko- ray Kavukcuoglu. Neural machine translation in linear time.arXiv preprint arXiv:1610.10099v2, 2017

work page arXiv 2017
[19]

Yoon Kim, Carl Denton, Luong Hoang, and Alexander M. Rush. Structured attention networks. In International Conference on Learning Representations , 2017

work page 2017
[20]

Adam: A method for stochastic optimization

Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In ICLR, 2015

work page 2015
[21]

Factorization tricks for LSTM networks

Oleksii Kuchaiev and Boris Ginsburg. Factorization tricks for LSTM networks. arXiv preprint arXiv:1703.10722, 2017

work page arXiv 2017
[22]

A structured self-attentive sentence embedding.arXiv preprint arXiv:1703.03130,

Zhouhan Lin, Minwei Feng, Cicero Nogueira dos Santos, Mo Yu, Bing Xiang, Bowen Zhou, and Yoshua Bengio. A structured self-attentive sentence embedding. arXiv preprint arXiv:1703.03130, 2017

work page arXiv 2017
[23]

Le, Ilya Sutskever, Oriol Vinyals, and Lukasz Kaiser

Minh-Thang Luong, Quoc V . Le, Ilya Sutskever, Oriol Vinyals, and Lukasz Kaiser. Multi-task sequence to sequence learning. arXiv preprint arXiv:1511.06114, 2015

work page arXiv 2015
[24]

Effective approaches to attention- based neural machine translation.arXiv preprint arXiv:1508.04025, 2015

Minh-Thang Luong, Hieu Pham, and Christopher D Manning. Effective approaches to attention- based neural machine translation. arXiv preprint arXiv:1508.04025, 2015

work page arXiv 2015
[25]

Building a large annotated corpus of english: The penn treebank

Mitchell P Marcus, Mary Ann Marcinkiewicz, and Beatrice Santorini. Building a large annotated corpus of english: The penn treebank. Computational linguistics, 19(2):313–330, 1993

work page 1993
[26]

Effective self-training for parsing

David McClosky, Eugene Charniak, and Mark Johnson. Effective self-training for parsing. In Proceedings of the Human Language Technology Conference of the NAACL, Main Conference , pages 152–159. ACL, June 2006. 11

work page 2006
[27]

A decomposable attention model

Ankur Parikh, Oscar Täckström, Dipanjan Das, and Jakob Uszkoreit. A decomposable attention model. In Empirical Methods in Natural Language Processing , 2016

work page 2016
[28]

A deep reinforced model for abstractive summarization

Romain Paulus, Caiming Xiong, and Richard Socher. A deep reinforced model for abstractive summarization. arXiv preprint arXiv:1705.04304, 2017

work page arXiv 2017
[29]

Learning accurate, compact, and interpretable tree annotation

Slav Petrov, Leon Barrett, Romain Thibaux, and Dan Klein. Learning accurate, compact, and interpretable tree annotation. In Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the ACL , pages 433–440. ACL, July 2006

work page 2006
[30]

Using the output embedding to improve language models

Oﬁr Press and Lior Wolf. Using the output embedding to improve language models. arXiv preprint arXiv:1608.05859, 2016

work page arXiv 2016
[31]

Neural Machine Translation of Rare Words with Subword Units

Rico Sennrich, Barry Haddow, and Alexandra Birch. Neural machine translation of rare words with subword units. arXiv preprint arXiv:1508.07909, 2015

work page internal anchor Pith review arXiv 2015
[32]

Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. arXiv preprint arXiv:1701.06538, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[33]

Dropout: a simple way to prevent neural networks from overﬁtting

Nitish Srivastava, Geoffrey E Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdi- nov. Dropout: a simple way to prevent neural networks from overﬁtting. Journal of Machine Learning Research, 15(1):1929–1958, 2014

work page 1929
[34]

End-to-end memory networks

Sainbayar Sukhbaatar, Arthur Szlam, Jason Weston, and Rob Fergus. End-to-end memory networks. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in Neural Information Processing Systems 28 , pages 2440–2448. Curran Associates, Inc., 2015

work page 2015
[35]

Sequence to sequence learning with neural networks

Ilya Sutskever, Oriol Vinyals, and Quoc VV Le. Sequence to sequence learning with neural networks. In Advances in Neural Information Processing Systems , pages 3104–3112, 2014

work page 2014
[36]

Rethinking the inception architecture for computer vision

Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jonathon Shlens, and Zbigniew Wojna. Rethinking the inception architecture for computer vision. CoRR, abs/1512.00567, 2015

work page arXiv 2015
[37]

Grammar as a foreign language

Vinyals & Kaiser, Koo, Petrov, Sutskever, and Hinton. Grammar as a foreign language. In Advances in Neural Information Processing Systems , 2015

work page 2015
[38]

Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation

Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, et al. Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144, 2016

work page internal anchor Pith review arXiv 2016
[39]

NMT before RL

Jie Zhou, Ying Cao, Xuguang Wang, Peng Li, and Wei Xu. Deep recurrent models with fast-forward connections for neural machine translation. CoRR, abs/1606.04199, 2016

work page arXiv 2016
[40]

Fast and accurate shift-reduce constituent parsing

Muhua Zhu, Yue Zhang, Wenliang Chen, Min Zhang, and Jingbo Zhu. Fast and accurate shift-reduce constituent parsing. In Proceedings of the 51st Annual Meeting of the ACL (V olume 1: Long Papers), pages 434–443. ACL, August 2013. 12 Attention Visualizations Input-Input Layer5 It is in this spirit that a majority of American governments have passed new laws ...

work page 2013