arxiv: 1603.08983 · v6 · submitted 2016-03-29 · 💻 cs.NE

Recognition: 3 theorem links

· Lean Theorem

Adaptive Computation Time for Recurrent Neural Networks

Alex Graves

Pith reviewed 2026-05-12 11:47 UTC · model grok-4.3

classification 💻 cs.NE

keywords adaptive computation timerecurrent neural networksvariable computation depthhalting mechanismsequence tasksparityadditionlanguage modeling

0 comments

The pith

Recurrent neural networks learn to perform a variable number of internal steps before outputting by adding a differentiable halting probability at each step.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Adaptive Computation Time as a minimal addition to standard recurrent networks that lets them decide how many computation cycles to run for each input. This is done with a learned probability of stopping that remains fully differentiable and adds no noise to the gradients during training. On four synthetic tasks including parity checks, logic gates, integer addition and number sorting the method yields large accuracy gains by allocating more steps to harder cases. Language modeling experiments show the network naturally spends extra computation at word spaces and sentence ends, suggesting a route to automatic segmentation of sequences.

Core claim

By training a per-step sigmoid halting probability and weighting the output by the accumulated remaining probability mass, a recurrent network can adapt its effective depth to the input without any change to its core recurrence or loss function, producing substantially better results on tasks whose difficulty varies with sequence length or content.

What carries the argument

Adaptive Computation Time (ACT), a per-step halting probability computed from the hidden state that determines both when to stop and how to weight the final output across all steps taken.

If this is right

Networks can solve problems that require arbitrary numbers of steps using a fixed set of parameters.
More computation is automatically allocated to longer or more complex inputs such as multi-digit addition.
The same mechanism provides an unsupervised signal for locating natural boundaries in text or other sequences.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same halting idea could be tested on tasks with continuous rather than discrete inputs to see whether step counts still adapt smoothly.
Combining ACT with memory-augmented networks might allow variable-depth reasoning without exploding parameter counts.
If the learned step counts align with human notions of difficulty, the method offers a quantitative probe for what makes a sequence hard.

Load-bearing premise

The halting probabilities can be trained end-to-end with ordinary backpropagation and remain stable without extra regularization or post-hoc fixes that would change the reported performance gains.

What would settle it

Train an RNN equipped with ACT on the integer-addition task and check whether the number of steps taken fails to increase with the number of digits or whether accuracy stays no better than a fixed-step RNN of comparable total compute.

read the original abstract

This paper introduces Adaptive Computation Time (ACT), an algorithm that allows recurrent neural networks to learn how many computational steps to take between receiving an input and emitting an output. ACT requires minimal changes to the network architecture, is deterministic and differentiable, and does not add any noise to the parameter gradients. Experimental results are provided for four synthetic problems: determining the parity of binary vectors, applying binary logic operations, adding integers, and sorting real numbers. Overall, performance is dramatically improved by the use of ACT, which successfully adapts the number of computational steps to the requirements of the problem. We also present character-level language modelling results on the Hutter prize Wikipedia dataset. In this case ACT does not yield large gains in performance; however it does provide intriguing insight into the structure of the data, with more computation allocated to harder-to-predict transitions, such as spaces between words and ends of sentences. This suggests that ACT or other adaptive computation methods could provide a generic method for inferring segment boundaries in sequence data.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Graves gives RNNs a differentiable halting rule so they can take a variable number of internal steps per input, with clear adaptation on synthetic tasks.

read the letter

The core contribution is a simple, deterministic way to let an RNN decide how many times to update its hidden state before producing an output. It adds a halting probability at each step plus a ponder-cost regularizer, all kept differentiable and free of gradient noise. That is the new piece relative to fixed-depth RNNs or attention variants at the time. On the four synthetic tasks—parity, logic operations, addition, and sorting—the networks visibly learn to use more steps on harder cases and accuracy improves markedly. The language-modeling experiment on the Hutter Wikipedia data is more modest in perplexity but still shows the expected pattern of extra computation at word boundaries and sentence ends. The formulation itself has no circular definitions; the ponder cost is an explicit hyper-parameter, not a fitted target. The math and the halting mechanism are straightforward to implement and check. The main soft spot is that the language-modeling lift is small, so the strong claims rest mainly on the controlled synthetic results. That is already qualified in the abstract, so it does not undermine the central point. No post-hoc exclusions or hidden fitting appear. This paper is for anyone working on sequence models who wants a concrete mechanism for variable computation depth. The experiments are proportionate to the claims and the method is reproducible from the description. I would send it to peer review; the algorithmic idea is clean enough and the evidence is honest enough to merit referee time.

Referee Report

1 major / 2 minor

Summary. The paper introduces Adaptive Computation Time (ACT), an algorithm enabling recurrent neural networks to learn the number of computational steps to perform between receiving an input and producing an output. ACT requires only minimal architectural changes, remains fully deterministic and differentiable, and introduces no additional noise into the parameter gradients. The method is evaluated on four synthetic tasks (parity of binary vectors, binary logic operations, integer addition, and sorting real numbers) where it yields dramatic performance gains by adapting computation depth to problem difficulty, as well as on character-level language modeling on the Hutter Prize Wikipedia dataset where gains are more modest but the model allocates more steps to harder transitions such as word boundaries and sentence ends.

Significance. If the central results hold, the work is significant because it supplies a generic, noise-free mechanism for variable-depth computation inside RNNs. The synthetic-task experiments demonstrate clear adaptation and accuracy improvements while the language-modeling results, though weaker, illustrate a practical side benefit of inferring segment boundaries. The explicit ponder-cost regularizer and continuous relaxation over halting probabilities directly address stability concerns that have historically plagued adaptive-computation approaches.

major comments (1)

[Experimental results (synthetic tasks)] The abstract and experimental discussion state that ACT 'dramatically improves' performance on the synthetic tasks, yet the manuscript does not report the precise baseline RNN depths or the distribution of halting steps per example; without these numbers it is difficult to quantify how much of the gain is due to adaptation versus simply allowing more total computation.

minor comments (2)

[Method description] The ponder-cost weight is described as a hyper-parameter; a short sensitivity analysis or recommended default range would help readers reproduce the reported behavior.
[Language modeling experiments] In the language-modeling section the paper notes that more computation is allocated to spaces and sentence ends; a quantitative plot of average steps versus token type would strengthen this observation.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the positive review and recommendation for minor revision. We appreciate the constructive feedback on the experimental presentation and will incorporate the requested clarifications.

read point-by-point responses

Referee: [Experimental results (synthetic tasks)] The abstract and experimental discussion state that ACT 'dramatically improves' performance on the synthetic tasks, yet the manuscript does not report the precise baseline RNN depths or the distribution of halting steps per example; without these numbers it is difficult to quantify how much of the gain is due to adaptation versus simply allowing more total computation.

Authors: We agree that reporting the exact baseline RNN depths and the distribution of halting steps would strengthen the experimental section and help readers assess the contribution of adaptation. In the revised manuscript we will specify the fixed depths used for each baseline RNN (chosen to equal the maximum step limit permitted for the corresponding ACT model) and add tables or figures showing the per-task average number of steps together with the distribution of halting steps across examples. This will make the source of the reported gains explicit. revision: yes

Circularity Check

0 steps flagged

No significant circularity; ACT mechanism defined independently of fitted targets

full rationale

The paper defines Adaptive Computation Time via explicit equations for the halting unit, cumulative probability, ponder cost regularizer (with explicit hyperparameter), and continuous relaxation for differentiability. These constructs are introduced as architectural additions to standard RNNs and trained end-to-end on task loss plus ponder cost; no step reduces the claimed adaptation benefit to a tautological fit or self-citation. Experimental results on parity, logic, addition, sorting, and language modeling serve as external validation rather than internal redefinition. The derivation chain remains self-contained against the stated assumptions and does not invoke load-bearing prior work by the same author to force uniqueness or smuggle an ansatz.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that the halting unit can be trained end-to-end and that the ponder-cost hyper-parameter can be chosen without invalidating the adaptation results.

free parameters (1)

ponder cost weight
Scalar hyper-parameter that trades off accuracy against total computation steps; its value is chosen per experiment.

axioms (1)

standard math The halting probabilities remain differentiable and produce valid convex combinations of states.
Required for the weighted-output construction to be trainable by gradient descent.

pith-pipeline@v0.9.0 · 5457 in / 1134 out tokens · 50542 ms · 2026-05-12T11:47:59.395235+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Foundation.LedgerForcing conservation_from_balance unclear
the ponder cost ρt = N(t) + R(t) ... τP(x)

Forward citations

Cited by 31 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

The Coupling Tax: How Shared Token Budgets Undermine Visible Chain-of-Thought Under Fixed Output Limits
cs.LG 2026-05 unverdicted novelty 8.0

Shared token budgets between visible chain-of-thought and answers create a coupling tax that makes non-thinking competitive on math benchmarks, with a truncation decomposition predicting the crossover and split budget...
Stability and Generalization in Looped Transformers
cs.LG 2026-04 unverdicted novelty 8.0

Looped transformers with recall and outer normalization produce reachable, input-dependent fixed points with stable gradients, enabling generalization, while those without recall cannot; a new internal recall variant ...
Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets
cs.LG 2022-01 unverdicted novelty 8.0

Neural networks exhibit grokking on small algorithmic datasets, achieving perfect generalization well after overfitting.
Show Your Work: Scratchpads for Intermediate Computation with Language Models
cs.LG 2021-11 unverdicted novelty 8.0

Training language models to generate intermediate computation steps on a scratchpad enables them to perform multi-step tasks such as long addition and arbitrary program execution that they otherwise fail at.
LeapTS: Rethinking Time Series Forecasting as Adaptive Multi-Horizon Scheduling
cs.LG 2026-05 unverdicted novelty 7.0

LeapTS reformulates forecasting as adaptive multi-horizon scheduling via hierarchical control and NCDEs, delivering at least 7.4% better performance and 2.6-5.3x faster inference than Transformer baselines while adapt...
Muninn: Your Trajectory Diffusion Model But Faster
cs.RO 2026-05 unverdicted novelty 7.0

Muninn accelerates diffusion trajectory planners up to 4.6x by spending an uncertainty budget to decide when to cache denoiser outputs, preserving performance and certifying bounded deviation from full computation.
LoopVLA: Learning Sufficiency in Recurrent Refinement for Vision-Language-Action Models
cs.AI 2026-05 unverdicted novelty 7.0

LoopVLA adds recurrent refinement and learned sufficiency estimation to VLA models, cutting parameters 45% and raising throughput 1.7x while matching baseline task success on LIBERO and VLA-Arena.
Scratchpad Patching: Decoupling Compute from Patch Size in Byte-Level Language Models
cs.CL 2026-05 conditional novelty 7.0

Scratchpad Patching decouples compute from patch size in byte-level language models by inserting entropy-triggered scratchpads to update patch context dynamically.
LoopCTR: Unlocking the Loop Scaling Power for Click-Through Rate Prediction
cs.IR 2026-04 unverdicted novelty 7.0

LoopCTR trains CTR models with recursive layer reuse and process supervision so that zero-loop inference outperforms baselines on public and industrial datasets.
Match-Any-Events: Zero-Shot Motion-Robust Feature Matching Across Wide Baselines for Event Cameras
cs.CV 2026-04 unverdicted novelty 7.0

A single attention-based model trained on synthetic wide-baseline event data achieves zero-shot feature matching across unseen datasets with a reported 37.7% improvement over prior event matching methods.
Depth Adaptive Efficient Visual Autoregressive Modeling
cs.CV 2026-04 unverdicted novelty 7.0

DepthVAR adaptively allocates per-token computational depth in VAR models using a cyclic rotated scheduler and dynamic layer masking to achieve 2.3-3.1x inference speedup with minimal quality loss.
A Mechanistic Analysis of Looped Reasoning Language Models
cs.LG 2026-04 unverdicted novelty 7.0

Looped LLMs converge to distinct cyclic fixed points per layer, repeating feedforward-style inference stages across recurrences.
Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach
cs.LG 2025-02 unverdicted novelty 7.0

A recurrent-depth architecture enables language models to improve reasoning performance by iterating computation in latent space, achieving gains equivalent to much larger models on benchmarks.
Elastic Attention Cores for Scalable Vision Transformers
cs.CV 2026-05 unverdicted novelty 6.0

VECA learns effective visual representations using core-periphery attention where patches interact exclusively via a resolution-invariant set of learned core embeddings, achieving linear O(N) complexity while maintain...
Gated Subspace Inference for Transformer Acceleration
cs.LG 2026-05 unverdicted novelty 6.0

Gated Subspace Inference accelerates transformer linear layers 3-10x via low-rank cached subspace computation and per-token gating to skip residuals while preserving output distribution to high accuracy.
LEAP: Layer-wise Exit-Aware Pretraining for Efficient Transformer Inference
cs.LG 2026-05 unverdicted novelty 6.0

LEAP adds a layer-wise exit-aware constraint to standard distillation, reconciling it with early-exit mechanisms and delivering 1.61x wall-clock speedup on MiniLM at 0.95 threshold with 91.9% early exits by layer 7.
State Stream Transformer (SST) V2: Parallel Training of Nonlinear Recurrence for Latent Space Reasoning
cs.LG 2026-04 unverdicted novelty 6.0

SST V2 introduces parallel-trainable nonlinear recurrence in latent space to let transformers reason continuously across positions, delivering +15 points on GPQA-Diamond and halving remaining GSM8K errors over matched...
Do Not Imitate, Reinforce: Iterative Classification via Belief Refinement
cs.LG 2026-04 unverdicted novelty 6.0

RIC replaces single-pass label imitation with RL-driven iterative belief refinement, recovering cross-entropy optima while enabling adaptive halting via a value function.
Universal Transformers Need Memory: Depth-State Trade-offs in Adaptive Recursive Reasoning
cs.LG 2026-04 conditional novelty 6.0

Memory tokens are required for non-trivial performance in adaptive Universal Transformers on Sudoku-Extreme, with 8-32 tokens yielding stable 57% exact-match accuracy while trading off against ponder depth.
Dispatch-Aware Ragged Attention for Pruned Vision Transformers
cs.LG 2026-04 accept novelty 6.0

A new Triton kernel for dispatch-aware ragged attention delivers 1.88-2.51× end-to-end throughput gains over standard padded attention and 9-12% over FlashAttention-2 varlen in pruned ViTs by lowering dispatch floor to ~24μs.
Dispatch-Aware Ragged Attention for Pruned Vision Transformers
cs.LG 2026-04 conditional novelty 6.0

A lightweight bidirectional Triton ragged-attention kernel lowers dispatch overhead, turning token pruning into real wall-clock gains of up to 2.24x across four pruning methods and DeiT models with under 0.007 logit d...
Adaptive Test-Time Compute Allocation for Reasoning LLMs via Constrained Policy Optimization
cs.LG 2026-04 unverdicted novelty 6.0

A Lagrangian-relaxation plus imitation-learning pipeline adaptively allocates test-time compute to LLMs, outperforming uniform baselines by up to 12.8% relative accuracy on MATH while staying within a fixed average budget.
Relational Preference Encoding in Looped Transformer Internal States
cs.LG 2026-04 conditional novelty 6.0

Looped transformer hidden states encode preferences relationally via pairwise differences rather than independent pointwise classification, with the evaluator acting as an internal consistency probe on the model's own...
Emergent Abilities of Large Language Models
cs.CL 2022-06 unverdicted novelty 6.0

Emergent abilities are capabilities present in large language models but absent in smaller ones and cannot be predicted by extrapolating smaller model performance.
Dynamic Pondering Sparsity-aware Mixture-of-Experts Transformer for Event Stream based Visual Object Tracking
cs.CV 2026-05 unverdicted novelty 5.0

A three-stage ViT with sparsity-aware MoE and adaptive inference depth delivers improved accuracy-efficiency trade-off for event-stream visual tracking on FE240hz, COESOT, and EventVOT benchmarks.
Budgeted Attention Allocation: Cost-Conditioned Compute Control for Efficient Transformers
cs.LG 2026-05 unverdicted novelty 5.0

A monotone head-gating mechanism conditions transformer attention on a budget, enabling one checkpoint to trade attention cost for accuracy and produce measured CPU speedups.
Revisiting Auxiliary Losses for Conditional Depth Routing: An Empirical Study
cs.LG 2026-04 conditional novelty 5.0

Removing utility regression and rank supervision auxiliary losses improves language modeling performance and training efficiency for conditional depth routing gates, and eliminates the advantage of a more complex JEPA...
Adaptive Computation Depth via Learned Token Routing in Transformers
cs.LG 2026-04 unverdicted novelty 5.0

TSA adds end-to-end differentiable per-token halting gates to transformers, enabling learned adaptive depth that saves 14-23% token-layer operations with under 0.5% quality loss on language modeling.
Galactica: A Large Language Model for Science
cs.CL 2022-11 unverdicted novelty 5.0

Galactica, a science-specialized LLM, reports higher scores than GPT-3, Chinchilla, and PaLM on LaTeX knowledge, mathematical reasoning, and medical QA benchmarks while outperforming general models on BIG-bench.
RD-ViT: Recurrent-Depth Vision Transformer for Semantic Segmentation with Reduced Data Dependence Extending the Recurrent-Depth Transformer Architecture to Dense Prediction
cs.CV 2026-05 unverdicted novelty 4.0

RD-ViT matches or exceeds standard ViT segmentation accuracy on cardiac MRI using a shared recurrent block, fewer parameters, and less training data.
ITS-Mina: A Harris Hawks Optimization-Based All-MLP Framework with Iterative Refinement and External Attention for Multivariate Time Series Forecasting
cs.LG 2026-04 unverdicted novelty 4.0

ITS-Mina introduces an all-MLP model with iterative refinement, external attention via learnable memory units, and HHO-tuned dropout that reports state-of-the-art or competitive results on six multivariate time series...

Reference graph

Works this paper leans on

36 extracted references · 36 canonical work pages · cited by 30 Pith papers · 3 internal anchors

[1]

G. An. The eﬀects of adding noise during backpropagation training on a generalization perfor- mance. Neural Computation, 8(3):643–674, 1996

work page 1996
[2]

Neural Machine Translation by Jointly Learning to Align and Translate

D. Bahdanau, K. Cho, and Y. Bengio. Neural machine translation by jointly learning to align and translate. abs/1409.0473, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014
[3]

Bengio, P.-L

E. Bengio, P.-L. Bacon, J. Pineau, and D. Precup. Conditional computation in neural networks for faster models. arXiv preprint arXiv:1511.06297 , 2015

work page arXiv 2015
[4]

D. C. Ciresan, U. Meier, and J. Schmidhuber. Multi-column deep neural networks for image classiﬁcation. In arXiv:1202.2745v1 [cs.CV], 2012. 17

work page arXiv 2012
[5]

G. Dahl, D. Yu, L. Deng, and A. Acero. Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition. Audio, Speech, and Language Processing, IEEE Trans- actions on, 20(1):30 –42, jan. 2012

work page 2012
[6]

Denoyer and P

L. Denoyer and P. Gallinari. Deep sequential neural network. arXiv preprint arXiv:1410.0510 , 2014

work page arXiv 2014
[7]

Eslami, N

S. Eslami, N. Heess, T. Weber, Y. Tassa, K. Kavukcuoglu, and G. E. Hinton. Attend, infer, repeat: Fast scene understanding with generative models. arXiv preprint arXiv:1603.08575 , 2016

work page arXiv 2016
[8]

A. Graves. Generating sequences with recurrent neural networks. arXiv preprint arXiv:1308.0850, 2013

work page arXiv 2013
[9]

Graves, A

A. Graves, A. Mohamed, and G. Hinton. Speech recognition with deep recurrent neural net- works. In Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Con- ference on, pages 6645–6649. IEEE, 2013

work page 2013
[10]

Neural Turing Machines

A. Graves, G. Wayne, and I. Danihelka. Neural turing machines. arXiv preprint arXiv:1410.5401, 2014

work page internal anchor Pith review arXiv 2014
[11]

Grefenstette, K

E. Grefenstette, K. M. Hermann, M. Suleyman, and P. Blunsom. Learning to transduce with unbounded memory. In Advances in Neural Information Processing Systems , pages 1819–1827, 2015

work page 2015
[12]

Draw: A recurrent neural network for image generation

K. Gregor, I. Danihelka, A. Graves, and D. Wierstra. Draw: A recurrent neural network for image generation. arXiv preprint arXiv:1502.04623 , 2015

work page arXiv 2015
[13]

Hochreiter, Y

S. Hochreiter, Y. Bengio, P. Frasconi, and J. Schmidhuber. Gradient ﬂow in recurrent nets: the diﬃculty of learning long-term dependencies, 2001

work page 2001
[14]

Hochreiter and J

S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural computation, 9(8):1735– 1780, 1997

work page 1997
[15]

M. Hutter. Universal artiﬁcial intelligence . Springer, 2005

work page 2005
[16]

M. A. Just, P. A. Carpenter, and J. D. Woolley. Paradigms and processes in reading compre- hension. Journal of experimental psychology: General , 111(2):228, 1982

work page 1982
[17]

Kalchbrenner, I

N. Kalchbrenner, I. Danihelka, and A. Graves. Grid long short-term memory. arXiv preprint arXiv:1507.01526, 2015

work page arXiv 2015
[18]

Adam: A Method for Stochastic Optimization

D. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014
[19]

Krizhevsky, I

A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classiﬁcation with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012

work page 2012
[20]

Q. V. Le and T. Mikolov. Distributed representations of sentences and documents. arXiv preprint arXiv:1405.4053, 2014

work page arXiv 2014
[21]

Li and P

M. Li and P. Vit´ anyi.An introduction to Kolmogorov complexity and its applications . Springer Science & Business Media, 2013

work page 2013
[22]

Mikolov, I

T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pages 3111–3119, 2013. 18

work page 2013
[23]

B. A. Olshausen et al. Emergence of simple-cell receptive ﬁeld properties by learning a sparse code for natural images. Nature, 381(6583):607–609, 1996

work page 1996
[24]

Recht, C

B. Recht, C. Re, S. Wright, and F. Niu. Hogwild: A lock-free approach to parallelizing stochastic gradient descent. In Advances in Neural Information Processing Systems, pages 693–701, 2011

work page 2011
[25]

Neural programmer-interpreters

S. Reed and N. de Freitas. Neural programmer-interpreters. Technical Report arXiv:1511.06279, 2015

work page arXiv 2015
[26]

Schmidhuber

J. Schmidhuber. Self-delimiting neural networks. arXiv preprint arXiv:1210.0118 , 2012

work page arXiv 2012
[27]

Schmidhuber and S

J. Schmidhuber and S. Hochreiter. Guessing can outperform many long time lag algorithms. Technical report, 1996

work page 1996
[28]

Srivastava, G

N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. Dropout: A simple way to prevent neural networks from overﬁtting. The Journal of Machine Learning Research , 15(1):1929–1958, 2014

work page 1929
[29]

R. K. Srivastava, K. Greﬀ, and J. Schmidhuber. Training very deep networks. In Advances in Neural Information Processing Systems, pages 2368–2376, 2015

work page 2015
[30]

R. K. Srivastava, B. R. Steunebrink, and J. Schmidhuber. First experiments with powerplay. Neural Networks, 41:130–136, 2013

work page 2013
[31]

Sukhbaatar, J

S. Sukhbaatar, J. Weston, R. Fergus, et al. End-to-end memory networks. In Advances in Neural Information Processing Systems, pages 2431–2439, 2015

work page 2015
[32]

Ke Tran, Arianna Bisazza, and Christof Monz

I. Sutskever, O. Vinyals, and Q. V. Le. Sequence to sequence learning with neural networks. arXiv preprint arXiv:1409.3215 , 2014

work page arXiv 2014
[33]

Order matters: Sequence to sequence for sets.arXiv preprint arXiv:1511.06391, 2015a

O. Vinyals, S. Bengio, and M. Kudlur. Order matters: Sequence to sequence for sets. arXiv preprint arXiv:1511.06391, 2015

work page arXiv 2015
[34]

Vinyals, M

O. Vinyals, M. Fortunato, and N. Jaitly. Pointer networks. In Advances in Neural Information Processing Systems, pages 2674–2682, 2015

work page 2015
[35]

A. J. Wiles. Modular elliptic curves and fermats last theorem. ANNALS OF MATH , 141:141, 1995

work page 1995
[36]

R. J. Williams and D. Zipser. Gradient-based learning algorithms for recurrent networks and their computational complexity. Back-propagation: Theory, architectures and applications , pages 433–486, 1995. 19

work page 1995