pith. machine review for the scientific record. sign in

arxiv: 1604.06174 · v2 · submitted 2016-04-21 · 💻 cs.LG

Recognition: 2 theorem links

· Lean Theorem

Training Deep Nets with Sublinear Memory Cost

Bing Xu, Carlos Guestrin, Chiyuan Zhang, Tianqi Chen

Pith reviewed 2026-05-12 03:37 UTC · model grok-4.3

classification 💻 cs.LG
keywords deep neural network trainingmemory optimizationcheckpointingcomputation graph analysissublinear memoryresidual networksrecurrent neural networksGPU memory reduction
0
0 comments X

The pith

An algorithm trains an n-layer deep network using O(sqrt(n)) memory at the cost of one extra forward pass.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a method to train deep neural networks using only the square root of the number of layers in memory. It works by storing checkpoints at regular intervals and recomputing the missing activations during the backward pass. A sympathetic reader would care because many state-of-the-art models are limited by GPU memory, and this allows deeper and more complex models without additional hardware. The approach uses computation graph analysis for automatic in-place operations and memory sharing. Experiments show large reductions such as training a 1000-layer residual network with far less memory.

Core claim

We design an algorithm that costs O(sqrt(n)) memory to train a n layer network, with only the computational cost of an extra forward pass per mini-batch. As many of the state-of-the-art models hit the upper bound of the GPU memory, our algorithm allows deeper and more complex models to be explored. We focus on reducing the memory cost to store the intermediate feature maps and gradients during training. Computation graph analysis is used for automatic in-place operation and memory sharing optimizations. We show that it is possible to trade computation for memory - giving a more memory efficient training algorithm with a little extra computation cost. In the extreme case, our analysis also 7G

What carries the argument

The checkpointing strategy that segments the computation graph into sqrt(n) intervals, storing activations only at boundaries and recomputing forwards inside each interval during backpropagation.

If this is right

  • A 1000-layer residual network trains with memory reduced from 48G to 7G and only 30 percent extra running time on ImageNet.
  • Complex recurrent neural networks become trainable on very long sequences with substantially lower memory.
  • State-of-the-art models no longer hit GPU memory limits as quickly, enabling exploration of deeper architectures.
  • An extreme variant reduces memory to O(log n) at the cost of O(n log n) extra forward computation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The approach could lower hardware barriers for training large models and make advanced deep learning more accessible on modest GPUs.
  • Adaptive checkpoint intervals based on per-layer compute cost might improve the compute-memory trade-off further.
  • The method pairs naturally with model parallelism to scale to even larger networks without changing the core algorithm.
  • Systems with high compute throughput relative to memory bandwidth would see the smallest effective overhead from the extra forward passes.

Load-bearing premise

The computation graph can be cleanly segmented into sqrt(n) intervals where recomputing forward passes inside each interval is both correct and cheaper than storing all intermediate activations.

What would settle it

Running the algorithm on a 1000-layer residual network and measuring whether peak memory usage scales as O(sqrt(n)), total runtime increases by about 30 percent, and the resulting gradients match those from full-storage training.

read the original abstract

We propose a systematic approach to reduce the memory consumption of deep neural network training. Specifically, we design an algorithm that costs O(sqrt(n)) memory to train a n layer network, with only the computational cost of an extra forward pass per mini-batch. As many of the state-of-the-art models hit the upper bound of the GPU memory, our algorithm allows deeper and more complex models to be explored, and helps advance the innovations in deep learning research. We focus on reducing the memory cost to store the intermediate feature maps and gradients during training. Computation graph analysis is used for automatic in-place operation and memory sharing optimizations. We show that it is possible to trade computation for memory - giving a more memory efficient training algorithm with a little extra computation cost. In the extreme case, our analysis also shows that the memory consumption can be reduced to O(log n) with as little as O(n log n) extra cost for forward computation. Our experiments show that we can reduce the memory cost of a 1,000-layer deep residual network from 48G to 7G with only 30 percent additional running time cost on ImageNet problems. Similarly, significant memory cost reduction is observed in training complex recurrent neural networks on very long sequences.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 2 minor

Summary. The manuscript presents an algorithm to train deep neural networks with O(sqrt(n)) memory cost for an n-layer network, incurring only the cost of one extra forward pass per mini-batch. This is achieved through computation graph analysis, segmenting the network into intervals, storing boundary activations, and recomputing forward passes within segments during backpropagation. The approach is extended to O(log n) memory with O(n log n) extra computation, and validated on ImageNet with a 1000-layer ResNet (48G to 7G memory) and long-sequence RNNs.

Significance. If the claims hold, this is a significant contribution to deep learning training efficiency, allowing exploration of deeper models on memory-constrained hardware like GPUs. The systematic use of DAG properties for memory optimization, combined with empirical validation showing memory reduction with modest time overhead and correct gradients, provides a practical tool for advancing DL research. The parameter-free derivation from standard graph segmentation is a strength.

minor comments (2)
  1. [Abstract] Abstract: the O(sqrt(n)) claim would be clearer if it explicitly stated the segmentation assumption (clean intervals where recomputation is correct and cheaper than storing all activations) that underpins the bound.
  2. [Experiments] The 30% extra time cost for the 1000-layer ResNet is reported, but a per-component breakdown (recomputation vs. original forward/backward) would make the compute-memory trade-off more transparent.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the positive and accurate summary of our work, the assessment of its significance, and the recommendation to accept the manuscript. No major comments requiring response or revision were raised.

read point-by-point responses
  1. Referee: No specific major comments were listed in the report.

    Authors: We appreciate the referee's recognition that the algorithm provides a systematic, parameter-free approach to memory reduction via graph segmentation and recomputation, with empirical validation on large models. The description of the O(sqrt(n)) memory bound, the O(log n) extension, and the ImageNet/ResNet and RNN experiments matches our claims exactly. revision: no

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained

full rationale

The O(sqrt(n)) memory bound is obtained by partitioning the n-layer computation DAG into sqrt(n) segments, retaining only the sqrt(n) boundary activations, and performing one recomputation of each segment during back-propagation; the total extra work equals one forward pass by direct operation counting on the graph. This counting argument relies only on standard properties of feed-forward and recurrent DAGs plus the in-place/memory-sharing optimizations described in the paper; no parameters are fitted to data, no result is defined in terms of itself, and no load-bearing step reduces to a self-citation. The reported ImageNet and RNN experiments serve as empirical confirmation rather than definitional inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper introduces no new free parameters, axioms beyond standard DAG properties, or invented entities; the contribution is purely algorithmic.

axioms (1)
  • standard math The forward computation graph is a directed acyclic graph whose nodes correspond to layer activations.
    Invoked when analyzing memory storage and recomputation segments.

pith-pipeline@v0.9.0 · 5519 in / 1076 out tokens · 40927 ms · 2026-05-12T03:37:54.105843+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 40 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Efficient Training on Multiple Consumer GPUs with RoundPipe

    cs.DC 2026-04 conditional novelty 8.0

    RoundPipe achieves near-zero-bubble pipeline parallelism for LLM training on consumer GPUs by dynamically dispatching computation stages round-robin, yielding 1.48-2.16x speedups and enabling 235B model fine-tuning on...

  2. Efficient and provably convergent end-to-end training of deep neural networks with linear constraints

    math.OC 2026-05 unverdicted novelty 7.0

    An efficiently computable HS-Jacobian acts as a conservative mapping for projections onto polyhedral sets, supporting provably convergent Adam-based end-to-end training of linearly constrained deep neural networks.

  3. Locking Pretrained Weights via Deep Low-Rank Residual Distillation

    cs.LG 2026-05 unverdicted novelty 7.0

    DLR-Lock locks open-weight LLMs against unauthorized fine-tuning by swapping MLPs for deep low-rank residual networks that inflate backprop memory and complicate optimization, yet preserve original capabilities via mo...

  4. Finite Volume-Informed Neural Network Framework for 2D Shallow Water Equations: Rugged Loss Landscapes and the Importance of Data Guidance

    cs.LG 2026-05 unverdicted novelty 7.0

    Data-guided finite-volume PINNs for 2D shallow water equations avoid trivial low-momentum collapse via sparse measurements, achieving up to 22x error reduction on benchmarks and accurate surrogates on real river data.

  5. Beyond Bag-of-Patches: Learning Global Layout via Textual Supervision for Late-Interaction Visual Document Retrieval

    cs.CV 2026-05 unverdicted novelty 7.0

    A text-supervised global layout embedding augments local patch representations in late-interaction VDR, yielding +2.4 nDCG@5 and +2.3 MAP@5 gains over ColPali/ColQwen baselines on ViDoRe-v2.

  6. ADELIA: Automatic Differentiation for Efficient Laplace Inference Approximations

    cs.DC 2026-05 conditional novelty 7.0

    ADELIA is the first AD-enabled INLA system that computes exact hyperparameter gradients via a structure-exploiting multi-GPU backward pass, delivering 4.2-7.9x per-gradient speedups and 5-8x better energy efficiency t...

  7. Cascaded Code Editing: Large-Small Model Collaboration for Effective and Efficient Code Editing

    cs.SE 2026-04 unverdicted novelty 7.0

    A cascaded large-small model system generates edit sketches with the large model and applies them with the small model to make code editing both accurate and token-efficient.

  8. Sparse Prefix Caching for Hybrid and Recurrent LLM Serving

    cs.LG 2026-04 unverdicted novelty 7.0

    Sparse prefix caching via dynamic programming for optimal checkpoint placement under overlap distributions improves the Pareto frontier for recurrent and hybrid LLM serving on shared-prefix data.

  9. Pareto-Optimal Offline Reinforcement Learning via Smooth Tchebysheff Scalarization

    cs.LG 2026-04 unverdicted novelty 7.0

    STOMP extends direct preference optimization to the multi-objective setting via smooth Tchebysheff scalarization and standardization of observed rewards, achieving highest hypervolume in eight of nine protein engineer...

  10. Actions Speak Louder than Words: Trillion-Parameter Sequential Transducers for Generative Recommendations

    cs.LG 2024-02 unverdicted novelty 7.0

    HSTU-based generative recommenders with 1.5 trillion parameters scale as a power law with compute up to GPT-3 scale, outperform baselines by up to 65.8% NDCG, run 5-15x faster than FlashAttention2 on long sequences, a...

  11. M3-Embedding: Multi-Linguality, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation

    cs.CL 2024-02 unverdicted novelty 7.0

    M3-Embedding is a single model for multi-lingual, multi-functional, and multi-granular text embeddings trained via self-knowledge distillation that achieves new state-of-the-art results on multilingual, cross-lingual,...

  12. Ring Attention with Blockwise Transformers for Near-Infinite Context

    cs.CL 2023-10 unverdicted novelty 7.0

    Ring Attention uses blockwise computation and ring communication to let Transformers process sequences up to device-count times longer than prior memory-efficient methods.

  13. Efficient Memory Management for Large Language Model Serving with PagedAttention

    cs.LG 2023-09 conditional novelty 7.0

    PagedAttention achieves near-zero waste in LLM key-value cache memory and enables 2-4x higher serving throughput than prior systems.

  14. QLoRA: Efficient Finetuning of Quantized LLMs

    cs.LG 2023-05 conditional novelty 7.0

    QLoRA finetunes 4-bit quantized LLMs via LoRA adapters to match full-precision performance while using far less memory, enabling 65B-scale training on single GPUs and producing Guanaco models near ChatGPT level.

  15. LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale

    cs.LG 2022-08 conditional novelty 7.0

    LLM.int8() performs 8-bit inference for transformers up to 175B parameters with no accuracy loss by combining vector-wise quantization for most features with 16-bit mixed-precision handling of systematic outlier dimensions.

  16. FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness

    cs.LG 2022-05 accept novelty 7.0

    FlashAttention reduces GPU high-bandwidth memory accesses in self-attention via tiling, delivering exact attention with lower IO complexity, 2-3x wall-clock speedups on models like GPT-2, and the ability to train on s...

  17. OPT: Open Pre-trained Transformer Language Models

    cs.CL 2022-05 unverdicted novelty 7.0

    OPT releases open decoder-only transformers up to 175B parameters that match GPT-3 performance at one-seventh the carbon cost, along with code and training logs.

  18. Longformer: The Long-Document Transformer

    cs.CL 2020-04 accept novelty 7.0

    Longformer uses local windowed attention plus task-specific global attention to achieve linear scaling and state-of-the-art results on long-document language modeling, QA, and summarization after pretraining.

  19. ALBERT: A Lite BERT for Self-supervised Learning of Language Representations

    cs.CL 2019-09 accept novelty 7.0

    ALBERT reduces BERT parameters via embedding factorization and layer sharing, adds inter-sentence coherence pretraining, and reaches SOTA on GLUE, RACE, and SQuAD with fewer parameters than BERT-large.

  20. Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

    cs.CL 2019-09 unverdicted novelty 7.0

    Intra-layer model parallelism in PyTorch enables training of 8.3B-parameter transformers, achieving SOTA perplexity of 10.8 on WikiText103 and 66.5% accuracy on LAMBADA.

  21. Generating Long Sequences with Sparse Transformers

    cs.LG 2019-04 unverdicted novelty 7.0

    Sparse Transformers factorize attention to handle sequences tens of thousands long, achieving new SOTA density modeling on Enwik8, CIFAR-10, and ImageNet-64.

  22. LBI: Parallel Scan Backpropagation via Latent Bounded Interfaces

    cs.LG 2026-05 unverdicted novelty 6.0

    LBI enables tractable parallel backpropagation by reducing inter-region adjoint computation to low-dimensional r x r Jacobians while preserving exact gradients under a bounded-interface model.

  23. Dr. Post-Training: A Data Regularization Perspective on LLM Post-Training

    cs.LG 2026-05 unverdicted novelty 6.0

    Dr. Post-Training reframes general data as a data-induced regularizer for LLM post-training updates, yielding a family of methods that outperform data-selection baselines on SFT, RLHF, and RLVR tasks.

  24. AGoQ: Activation and Gradient Quantization for Memory-Efficient Distributed Training of LLMs

    cs.CL 2026-05 unverdicted novelty 6.0

    AGoQ delivers up to 52% lower memory use and 1.34x faster training for 8B-32B LLaMA models by using near-4-bit adaptive activations and 8-bit gradients while preserving pretraining convergence and downstream accuracy.

  25. SIEVES: Selective Prediction Generalizes through Visual Evidence Scoring

    cs.CV 2026-04 unverdicted novelty 6.0

    SIEVES improves selective prediction coverage up to 3x on OOD VQA benchmarks by training a selector on visual localization quality, generalizing across datasets and proprietary reasoners without specific adaptation.

  26. Quantum Dynamics via Score Matching on Bohmian Trajectories

    quant-ph 2026-04 unverdicted novelty 6.0

    Neural networks learn the score of the probability density on Bohmian trajectories to recover exact Schrödinger dynamics via self-consistent minimization for nodeless wave functions, demonstrated on double-well splitt...

  27. Leveraging LLMs for Multi-File DSL Code Generation: An Industrial Case Study

    cs.SE 2026-04 unverdicted novelty 6.0

    Fine-tuning 7B code LLMs on a custom multi-file DSL dataset achieves structural fidelity of 1.00, high exact-match accuracy, and practical utility validated by expert survey and execution checks.

  28. Streaming Structured Inference with Flash-SemiCRF

    cs.LG 2026-04 unverdicted novelty 6.0

    Flash-SemiCRF enables exact semi-CRF inference on long sequences by evaluating edge potentials from compact prefix sums and streaming the forward-backward pass while preserving exact gradients.

  29. Continuous Adversarial Flow Models

    cs.LG 2026-04 unverdicted novelty 6.0

    Continuous adversarial flow models replace MSE in flow matching with adversarial training via a discriminator, improving guidance-free FID on ImageNet from 8.26 to 3.63 for SiT and similar gains for JiT and text-to-im...

  30. Memory-Efficient Transfer Learning with Fading Side Networks via Masked Dual Path Distillation

    cs.CV 2026-04 unverdicted novelty 6.0

    MDPD mutually distills knowledge between a frozen backbone and a learnable side network during fine-tuning, then discards the side network at inference to accelerate speed by at least 25% while preserving accuracy.

  31. MP-ISMoE: Mixed-Precision Interactive Side Mixture-of-Experts for Efficient Transfer Learning

    cs.LG 2026-04 unverdicted novelty 6.0

    MP-ISMoE uses Gaussian noise perturbed iterative quantization and interactive side mixture-of-experts to deliver higher accuracy than prior memory-efficient transfer learning methods while keeping similar parameter an...

  32. Vision Transformers Need Registers

    cs.CV 2023-09 unverdicted novelty 6.0

    Adding register tokens to Vision Transformers eliminates high-norm background artifacts and raises state-of-the-art performance on dense visual prediction tasks.

  33. BloombergGPT: A Large Language Model for Finance

    cs.LG 2023-03 conditional novelty 6.0

    BloombergGPT is a 50B parameter LLM trained on a 708B token mixed financial and general dataset that outperforms prior models on financial benchmarks while preserving general LLM performance.

  34. DINO: DETR with Improved DeNoising Anchor Boxes for End-to-End Object Detection

    cs.CV 2022-03 conditional novelty 6.0

    DINO reaches 51.3 AP on COCO val2017 with a ResNet-50 backbone after 24 epochs, a +2.7 AP gain over the prior best DETR variant.

  35. Linformer: Self-Attention with Linear Complexity

    cs.LG 2020-06 conditional novelty 6.0

    Linformer approximates self-attention with a low-rank projection to achieve O(n) time and space complexity while matching Transformer accuracy on standard NLP tasks.

  36. Memory Efficient Full-gradient Attacks (MEFA) Framework for Adversarial Defense Evaluations

    cs.LG 2026-05 unverdicted novelty 5.0

    MEFA enables exact full-gradient white-box attacks on iterative stochastic purification defenses like diffusion and Langevin EBMs by trading recomputation for lower memory, revealing vulnerabilities missed by approxim...

  37. AGoQ: Activation and Gradient Quantization for Memory-Efficient Distributed Training of LLMs

    cs.CL 2026-05 unverdicted novelty 5.0

    AGoQ cuts LLM training memory by up to 52% and speeds it up by 1.34x using tailored 4-bit activations and 8-bit gradients with special communication, matching baseline accuracy on LLaMA models.

  38. Towards General Text Embeddings with Multi-stage Contrastive Learning

    cs.CL 2023-08 unverdicted novelty 5.0

    GTE_base is a compact text embedding model using multi-stage contrastive learning on diverse data that outperforms OpenAI's API and 10x larger models on massive benchmarks and works for code as text.

  39. Cross-Layer Energy Analysis of Multimodal Training on Grace Hopper Superchips

    cs.DC 2026-05 unverdicted novelty 4.0

    On Grace Hopper superchips, energy efficiency during multimodal training is governed by data movement and overlap rather than compute utilization, and runtime-optimal configurations are not always energy-optimal.

  40. Cosmos World Foundation Model Platform for Physical AI

    cs.CV 2025-01 unverdicted novelty 3.0

    The Cosmos platform supplies open-source pre-trained world models and supporting tools for building fine-tunable digital world simulations to train Physical AI.

Reference graph

Works this paper leans on

19 extracted references · 19 canonical work pages · cited by 39 Pith papers · 1 internal anchor

  1. [1]

    Mart ´ın Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Good- fellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dan Man ´e, Rajat Monga, Sherry Moore, Derek Murra...

  2. [2]

    Seltzer, Malcolm Slaney, Andreas Stolcke, Yongqiang Wang, Huaming Wang, Kaisheng Yao, Dong Yu, Yu Zhang, and Geoffrey Zweig

    Amit Agarwal, Eldar Akchurin, Chris Basoglu, Guoguo Chen, Scott Cyphers, Jasha Droppo, Adam Eversole, Brian Guenter, Mark Hillebrand, Ryan Hoens, Xuedong Huang, Zhiheng Huang, Vladimir Ivanov, Alexey Kamenev, Philipp Kranen, Oleksii Kuchaiev, Wolfgang Manousek, Avner May, Bhaskar Mitra, Olivier Nano, Gaizka Navarro, Alexey Orlov, Marko Padmilac, Hari Part...

  3. [3]

    Aho, Ravi Sethi, and Jeffrey D

    Alfred V . Aho, Ravi Sethi, and Jeffrey D. Ullman. Compilers: Principles, Techniques, and Tools. Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA, 1986

  4. [4]

    Goodfellow, Arnaud Bergeron, Nicolas Bouchard, and Yoshua Bengio

    Fr ´ed´eric Bastien, Pascal Lamblin, Razvan Pascanu, James Bergstra, Ian J. Goodfellow, Arnaud Bergeron, Nicolas Bouchard, and Yoshua Bengio. Theano: new features and speed improve- ments. Deep Learning and Unsupervised Feature Learning NIPS 2012 Workshop, 2012

  5. [5]

    Theano: a CPU and GPU math expression compiler

    James Bergstra, Olivier Breuleux, Fr ´ed´eric Bastien, Pascal Lamblin, Razvan Pascanu, Guil- laume Desjardins, Joseph Turian, David Warde-Farley, and Yoshua Bengio. Theano: a CPU and GPU math expression compiler. In Proceedings of the Python for Scientific Computing Conference (SciPy), June 2010. Oral Presentation

  6. [6]

    MXNet: A flexible and efficient machine learning library for heterogeneous distributed systems

    Tianqi Chen, Mu Li, Yutian Li, Min Lin, Naiyan Wang, Minjie Wang, Tianjun Xiao, Bing Xu, Chiyuan Zhang, , and Zheng Zhang. MXNet: A flexible and efficient machine learning library for heterogeneous distributed systems. In Neural Information Processing Systems, Workshop on Machine Learning Systems (LearningSys’15), 2015

  7. [7]

    Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Quoc V

    Jeffrey Dean, Greg S. Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Quoc V . Le, Mark Z. Mao, MarcAurelio Ranzato, Andrew Senior, Paul Tucker, Ke Yang, and Andrew Y . Ng. Large scale distributed deep networks. In NIPS, 2012

  8. [8]

    Deep learning

    Ian Goodfellow, Yoshua Bengio, , and Aaron Courville. Deep learning. Book in preparation for MIT Press, 2016

  9. [9]

    Algorithm 799: Revolve: An implementation of checkpointing for the reverse or adjoint mode of computational differentiation

    Andreas Griewank and Andrea Walther. Algorithm 799: Revolve: An implementation of checkpointing for the reverse or adjoint mode of computational differentiation. ACM Trans. Math. Softw., 26(1):19–45, March 2000

  10. [10]

    Deep Residual Learning for Image Recognition

    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. arXiv preprint arXiv:1512.03385, 2015

  11. [11]

    Identity mappings in deep residual networks

    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Identity mappings in deep residual networks. arXiv preprint arXiv:1603.05027, 2016

  12. [12]

    Long short-term memory

    Sepp Hochreiter and J ¨urgen Schmidhuber. Long short-term memory. Neural Comput. , 9(8):1735–1780, November 1997. 11

  13. [13]

    Batch normalization: Accelerating deep network training by reducing internal covariate shift

    Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proceedings of the 32th International Conference on Machine Learning (ICML’15), 2015

  14. [14]

    Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems 25 , pages 1097–1105. 2012

  15. [15]

    Gradient-based learning applied to document recognition

    Yann LeCun, L ´eon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. In S. Haykin and B. Kosko, editors, Intelligent Signal Pro- cessing, pages 306–351. IEEE Press, 2001

  16. [16]

    Virtualizing deep neural networks for memory-efficient neural network design.arXiv preprint arXiv:1602.08124, 2016

    Minsoo Rhu, Natalia Gimelshein, Jason Clemons, Arslan Zulfiqar, and Stephen W Keckler. Virtualizing deep neural networks for memory-efficient neural network design.arXiv preprint arXiv:1602.08124, 2016

  17. [17]

    Senior, and Franc ¸oise Beaufays

    Hasim Sak, Andrew W. Senior, and Franc ¸oise Beaufays. Long short-term memory recur- rent neural network architectures for large scale acoustic modeling. In INTERSPEECH 2014, 15th Annual Conference of the International Speech Communication Association, Singapore, September 14-18, 2014, pages 338–342, 2014

  18. [18]

    Training very deep networks

    Rupesh Kumar Srivastava, Klaus Greff, and J¨urgen Schmidhuber. Training very deep networks. arXiv preprint arXiv:1507.06228, 2015

  19. [19]

    Highway long short-term memory rnns for distant speech recognition

    Yu Zhang, Guoguo Chen, Dong Yu, Kaisheng Yao, Sanjeev Khudanpur, and James Glass. Highway long short-term memory rnns for distant speech recognition. arXiv preprint arXiv:1510.08983, 2015. A Search over Budget B Alg. 3 allows us to generate an optimized memory plan given a single parameterB. This algorithm relies on approximate memory estimation for faste...