L-layer transformers under Log-ICoT curriculum provably learn k-parity with poly(n) samples and log k stages, matching explicit CoT efficiency without inference overhead.
hub Mixed citations
Highway Networks
Mixed citation behavior. Most common role is background (60%).
abstract
There is plenty of theoretical and empirical evidence that depth of neural networks is a crucial ingredient for their success. However, network training becomes more difficult with increasing depth and training of very deep networks remains an open problem. In this extended abstract, we introduce a new architecture designed to ease gradient-based training of very deep networks. We refer to networks with this architecture as highway networks, since they allow unimpeded information flow across several layers on "information highways". The architecture is characterized by the use of gating units which learn to regulate the flow of information through a network. Highway networks with hundreds of layers can be trained directly using stochastic gradient descent and with a variety of activation functions, opening up the possibility of studying extremely deep and efficient architectures.
hub tools
citation-role summary
citation-polarity summary
representative citing papers
Residual networks reformulate layers to learn residual functions, enabling effective training of up to 152-layer models that achieve 3.57% error on ImageNet and win ILSVRC 2015.
XAttnRes introduces cross-stage attention residuals that maintain a global feature history and selectively aggregate prior representations, improving medical image segmentation and performing on par with baselines even without skip connections.
Deep Delta Learning replaces additive residual updates with a gated delta-rule that selectively overwrites residual content along learned directions, improving language modeling quality over standard ResNet-style accumulation.
GNN-DCMs apply graph neural networks to discrete choice modeling, recovering nested logit and spatially correlated logit via message passing on utilities and demonstrating better predictive performance for residential location choices in Chicago.
dCGPANN encodes neural nets so evolutionary operators can rewire, prune, adapt activations and add skips while gradient descent tunes parameters, yielding smaller networks with lower regression error in fixed time.
Every fixed finite feedforward neural network definable in an o-minimal structure has finite sample complexity in the agnostic PAC setting.
SATFormer uses a context-dependent gate for selective reuse of early Transformer representations, improving validation loss and zero-shot accuracy especially on retrieval benchmarks.
Automated search discovers Swish activation f(x) = x * sigmoid(βx) that improves top-1 ImageNet accuracy over ReLU by 0.9% on Mobile NASNet-A and 0.6% on Inception-ResNet-v2.
Wide residual networks achieve higher accuracy and faster training than very deep thin residual networks by increasing width and decreasing depth, setting new state-of-the-art results on CIFAR, SVHN, and ImageNet.
E³C is a video diffusion model that disentangles persistent 3D scene structure via point-cloud memory from human dynamics via ego-exo pose controls for improved egocentric video generation on the Nymeria dataset.
Domain adaptation with an ensemble of CNN and transformer models trained on DES detects 20,180 LSBGs and 434 UDGs in KiDS DR5, with structural parameters and environmental trends consistent with known samples.
SeedPolicy introduces self-evolving gated attention to extend the temporal horizon of diffusion policies, yielding 36.8% and 169% relative gains over standard DP on clean and randomized RoboTwin 2.0 tasks.
SiameseNorm is a two-stream architecture that reconciles Pre-Norm and Post-Norm in Transformers by coupling streams via shared residual blocks, yielding performance gains with maintained stability on language, vision, and diffusion models.
Gated lexical shortcut connections added to the transformer yield 0.9 BLEU average gains on five WMT directions while lowering the lexical content stored in hidden states.
WISP reformulates next-day active fire forecasting as point-set prediction and reports 38.2% AP, 53.4% FRP-weighted coverage, and 54.1% localization within 5 km on a global held-out test set.
A position-agnostic nonlinear pre-projection MLP plus content skip connection in transformer attention improves LAMBADA accuracy by 40.6% and reduces perplexity by 39% on 160M-scale models.
Applying a head-specific sigmoid gate after SDPA in LLMs boosts performance and stability by adding non-linearity and query-dependent sparse modulation while reducing attention sinks.
MetaGPT embeds human SOPs into LLM prompts to create role-specialized agent teams that produce more coherent solutions on collaborative software engineering tasks than prior chat-based multi-agent systems.
Geometric deep learning provides a unified mathematical framework based on grids, groups, graphs, geodesics, and gauges to explain and extend neural network architectures by incorporating physical regularities.
GNMT deploys 8-layer LSTMs with attention, wordpieces, low-precision inference, and coverage-penalized beam search to match state-of-the-art on WMT'14 En-Fr and En-De while cutting translation errors by 60% in human evaluations.
Attention Residuals replaces fixed residual summation with input-dependent softmax attention over preceding layers, and a blocked variant is shown to improve uniformity and downstream performance in a 48B-parameter model pre-trained on 1.4T tokens.
SG-NTF combines spectral mapping of time with a co-gating filter inside neural Tucker factorization to complete HDI tensors while remaining parameter-efficient.
CAMNet uses data-dependent routing across parallel tensors in a multi-path network to outperform equivalent single-path, multi-path, and deeper networks on classification and pixel-labeling tasks for individual, sequential, and combined datasets.
citing papers explorer
-
MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework
MetaGPT embeds human SOPs into LLM prompts to create role-specialized agent teams that produce more coherent solutions on collaborative software engineering tasks than prior chat-based multi-agent systems.