pith. machine review for the scientific record. sign in

arxiv: 1505.00387 · v2 · submitted 2015-05-03 · 💻 cs.LG · cs.NE

Recognition: unknown

Highway Networks

J\"urgen Schmidhuber, Klaus Greff, Rupesh Kumar Srivastava

classification 💻 cs.LG cs.NE
keywords networksarchitecturedeephighwayinformationtrainingdepthflow
0
0 comments X
read the original abstract

There is plenty of theoretical and empirical evidence that depth of neural networks is a crucial ingredient for their success. However, network training becomes more difficult with increasing depth and training of very deep networks remains an open problem. In this extended abstract, we introduce a new architecture designed to ease gradient-based training of very deep networks. We refer to networks with this architecture as highway networks, since they allow unimpeded information flow across several layers on "information highways". The architecture is characterized by the use of gating units which learn to regulate the flow of information through a network. Highway networks with hundreds of layers can be trained directly using stochastic gradient descent and with a variety of activation functions, opening up the possibility of studying extremely deep and efficient architectures.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 14 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Deep Residual Learning for Image Recognition

    cs.CV 2015-12 accept novelty 8.0

    Residual networks reformulate layers to learn residual functions, enabling effective training of up to 152-layer models that achieve 3.57% error on ImageNet and win ILSVRC 2015.

  2. Every Feedforward Neural Network Definable in an o-Minimal Structure Has Finite Sample Complexity

    stat.ML 2026-05 unverdicted novelty 7.0

    Every fixed finite feedforward neural network definable in an o-minimal structure has finite sample complexity in the agnostic PAC setting.

  3. Transformers with Selective Access to Early Representations

    cs.LG 2026-05 unverdicted novelty 7.0

    SATFormer uses a learned context-dependent gate for selective access to early-layer value representations in Transformers, improving loss and accuracy over static residual baselines.

  4. Transformers with Selective Access to Early Representations

    cs.LG 2026-05 unverdicted novelty 7.0

    SATFormer uses a context-dependent gate for selective reuse of early Transformer representations, improving validation loss and zero-shot accuracy especially on retrieval benchmarks.

  5. Searching for Activation Functions

    cs.NE 2017-10 conditional novelty 7.0

    Automated search discovers Swish activation f(x) = x * sigmoid(βx) that improves top-1 ImageNet accuracy over ReLU by 0.9% on Mobile NASNet-A and 0.6% on Inception-ResNet-v2.

  6. Wide Residual Networks

    cs.CV 2016-05 accept novelty 7.0

    Wide residual networks achieve higher accuracy and faster training than very deep thin residual networks by increasing width and decreasing depth, setting new state-of-the-art results on CIFAR, SVHN, and ImageNet.

  7. From DES to KiDS: Domain adaptation for cross-survey detection of low-surface-brightness galaxies

    astro-ph.GA 2026-05 unverdicted novelty 6.0

    Domain adaptation with an ensemble of CNN and transformer models trained on DES detects 20,180 LSBGs and 434 UDGs in KiDS DR5, with structural parameters and environmental trends consistent with known samples.

  8. Set Prediction for Next-Day Active Fire Forecasting

    cs.LG 2026-05 unverdicted novelty 6.0

    WISP reformulates next-day active fire forecasting as point-set prediction and reports 38.2% AP, 53.4% FRP-weighted coverage, and 54.1% localization within 5 km on a global held-out test set.

  9. Position-Agnostic Pre-Projection for Transformer Attention: Nonlinear Feature Construction and Content Skip Before Q/K/V

    cs.CL 2026-04 unverdicted novelty 6.0

    A position-agnostic nonlinear pre-projection MLP plus content skip connection in transformer attention improves LAMBADA accuracy by 40.6% and reduces perplexity by 39% on 160M-scale models.

  10. Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free

    cs.CL 2025-05 conditional novelty 6.0

    Applying a head-specific sigmoid gate after SDPA in LLMs boosts performance and stability by adding non-linearity and query-dependent sparse modulation while reducing attention sinks.

  11. MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework

    cs.AI 2023-08 unverdicted novelty 6.0

    MetaGPT embeds human SOPs into LLM prompts to create role-specialized agent teams that produce more coherent solutions on collaborative software engineering tasks than prior chat-based multi-agent systems.

  12. Geometric Deep Learning: Grids, Groups, Graphs, Geodesics, and Gauges

    cs.LG 2021-04 accept novelty 6.0

    Geometric deep learning provides a unified mathematical framework based on grids, groups, graphs, geodesics, and gauges to explain and extend neural network architectures by incorporating physical regularities.

  13. Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation

    cs.CL 2016-09 accept novelty 6.0

    GNMT deploys 8-layer LSTMs with attention, wordpieces, low-precision inference, and coverage-penalized beam search to match state-of-the-art on WMT'14 En-Fr and En-De while cutting translation errors by 60% in human e...

  14. A Transfer Learning Evaluation of Deep Neural Networks for Image Classification

    cs.CV 2026-05 unverdicted novelty 2.0

    Empirical comparison of transfer learning performance across eleven pre-trained models on five image datasets using accuracy, time, and size metrics.