arxiv: 1606.08415 · v5 · submitted 2016-06-27 · 💻 cs.LG

Recognition: 2 theorem links

· Lean Theorem

Gaussian Error Linear Units (GELUs)

Dan Hendrycks, Kevin Gimpel

Pith reviewed 2026-05-10 12:12 UTC · model grok-4.3

classification 💻 cs.LG

keywords activation functionGELUReLUELUneural networkdeep learningcomputer visionnatural language processing

0 comments

The pith

The GELU activation xΦ(x) outperforms ReLU and ELU on computer vision, natural language processing, and speech tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes the Gaussian Error Linear Unit (GELU) as a new activation function for neural networks. GELU is defined as xΦ(x), where Φ(x) is the cumulative distribution function of the standard normal distribution, which has the effect of weighting inputs by their value rather than gating them by sign like the ReLU. The authors conduct experiments comparing GELU to ReLU and ELU across multiple domains and report consistent performance improvements with GELU. A sympathetic reader would care because activation functions are a basic building block of deep networks, and small changes here can affect overall model quality without requiring architectural redesign.

Core claim

The GELU nonlinearity, given by xΦ(x) with Φ the standard Gaussian CDF, weights inputs by their value and yields better empirical performance than ReLU or ELU on the considered computer vision, natural language processing, and speech tasks.

What carries the argument

The GELU function xΦ(x), which multiplies each input by the probability that a standard normal random variable is less than or equal to that input.

If this is right

GELU can be used as a drop-in replacement for ReLU or ELU in existing neural network models.
Performance gains are expected in vision, language, and speech applications when using GELU.
Training may converge to better solutions because inputs are scaled continuously rather than thresholded.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The weighting mechanism may provide smoother gradient flow during backpropagation compared to hard gating.
Similar activation functions could be derived using other probability distributions beyond the Gaussian.
Adoption of GELU might reduce the need for careful initialization or normalization techniques in some models.

Load-bearing premise

The performance improvements seen on the specific tasks and models tested will continue to appear on other architectures, datasets, and training setups.

What would settle it

Running the same models with GELU on a new task or dataset and observing no improvement or degradation relative to ReLU.

read the original abstract

We propose the Gaussian Error Linear Unit (GELU), a high-performing neural network activation function. The GELU activation function is $x\Phi(x)$, where $\Phi(x)$ the standard Gaussian cumulative distribution function. The GELU nonlinearity weights inputs by their value, rather than gates inputs by their sign as in ReLUs ($x\mathbf{1}_{x>0}$). We perform an empirical evaluation of the GELU nonlinearity against the ReLU and ELU activations and find performance improvements across all considered computer vision, natural language processing, and speech tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

GELU is a clean new activation xΦ(x) that beats ReLU and ELU on the tested vision/NLP/speech setups, but the experiments lack the usual controls for claiming reliable gains.

read the letter

The main point is that this paper puts forward GELU as x times the Gaussian CDF, a smooth weighting that replaces the hard threshold in ReLU. They run it against ReLU and ELU on computer vision, natural language, and speech tasks and report better numbers across the board. The form itself is new in the references they cite, even if the broader idea of probabilistic gating has been around. The definition is parameter-free and comes straight from the CDF, so it is easy to reproduce and does not rely on any internal fitting. That part is solid and worth noting. The empirical claim is the load-bearing part, and it is scoped narrowly to the models and datasets they actually ran. The stress-test note is right that the central observation holds up on the evidence given, with no internal contradictions or protocol shifts visible. What is missing is the usual detail on number of runs, hyperparameter search effort for the baselines, error bars, or significance tests. Without those, it is hard to tell whether the gains are stable or sensitive to setup choices. The citation pattern is standard and does not raise flags. This is the sort of short, focused note that people who tune activations would want to see, especially if they are already working in vision or language. A reader who wants a drop-in replacement to try could get immediate value from the definition and the reported direction of improvement. I would send it to peer review. The proposal is clear enough and the results are positive enough to justify referee time, even if the experimental section needs tightening.

Referee Report

1 major / 3 minor

Summary. The paper introduces the Gaussian Error Linear Unit (GELU) activation function defined as xΦ(x), with Φ denoting the standard Gaussian cumulative distribution function. It highlights that GELU weights inputs according to their value, in contrast to ReLU which gates by sign. An empirical evaluation is performed comparing GELU to ReLU and ELU, with reported performance gains across computer vision, natural language processing, and speech tasks.

Significance. If the results are reliable, GELU offers a high-performing, parameter-free activation function with a probabilistic interpretation. This could lead to better neural network models in various fields. The direct definition from the Gaussian CDF and the broad empirical testing are positive aspects of the work.

major comments (1)

The abstract and corresponding experimental sections report consistent improvements but omit key details such as the number of runs, statistical significance tests, hyperparameter search methodology, and precise baseline configurations. These omissions make it difficult to fully evaluate the strength of the empirical claims.

minor comments (3)

The definition of Φ(x) as the Gaussian CDF should be stated explicitly in the introduction or methods section for readers unfamiliar with the notation.
Consider adding a figure illustrating the GELU function alongside ReLU and ELU to visually support the textual description.
Ensure that all acronyms are defined at first use, such as NLP if used in the text.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive review and for recognizing the potential significance of the GELU activation function. We address the major comment below and will revise the manuscript to strengthen the experimental reporting.

read point-by-point responses

Referee: The abstract and corresponding experimental sections report consistent improvements but omit key details such as the number of runs, statistical significance tests, hyperparameter search methodology, and precise baseline configurations. These omissions make it difficult to fully evaluate the strength of the empirical claims.

Authors: We agree that additional experimental details would improve the manuscript's clarity and allow readers to better assess the reliability of the reported gains. In the revised version, we will expand the relevant sections to specify: the number of runs (noting that computational constraints led to single runs for most large-scale experiments, consistent with practices in the field at the time of submission); that formal statistical significance tests were not conducted but improvements were consistent across diverse tasks; a description of the hyperparameter search process (including search ranges and selection criteria for learning rates, regularization, and other settings applied uniformly to GELU, ReLU, and ELU); and more precise baseline configurations, such as exact network architectures, initialization schemes, and training protocols. These additions will support rather than change the abstract. We believe this directly addresses the concern without overstating the original experiments. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The GELU is defined directly as xΦ(x) using the standard Gaussian CDF with no fitted parameters, self-referential equations, or load-bearing self-citations. The central claim is an empirical observation of performance gains on specific tasks, supported by side-by-side experimental results rather than any internal derivation that reduces to its own inputs by construction. No steps match the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claim rests only on the standard definition of the Gaussian CDF and the empirical results; no free parameters are introduced, no new axioms are stated, and no new entities are postulated.

pith-pipeline@v0.9.0 · 5378 in / 987 out tokens · 41939 ms · 2026-05-10T12:12:07.972046+00:00 · methodology

discussion (0)

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Stability and Generalization in Looped Transformers
cs.LG 2026-04 unverdicted novelty 8.0

Looped transformers with recall and outer normalization produce reachable, input-dependent fixed points with stable gradients, enabling generalization, while those without recall cannot; a new internal recall variant ...
Mamba: Linear-Time Sequence Modeling with Selective State Spaces
cs.LG 2023-12 unverdicted novelty 8.0

Mamba is a linear-time sequence model using input-dependent selective SSMs that achieves SOTA results across modalities and matches twice-larger Transformers on language modeling with 5x higher inference throughput.
Parallel Scan Recurrent Neural Quantum States for Scalable Variational Monte Carlo
cond-mat.str-el 2026-05 conditional novelty 7.0

PSR-NQS makes recurrent neural quantum states scalable for variational Monte Carlo by using parallel scan recurrence, reaching accurate results on 52x52 two-dimensional lattices.
Graph Neural Networks with Triangle-Based Messages for the Multicut Problem
cs.LG 2026-05 unverdicted novelty 7.0

A triangle-message GNN for multicut outperforms heuristics in solution quality on graphs up to 200 nodes and finds optimal solutions faster than exact solvers for some cases.
Pretraining Strategies and Scaling for ECG Foundation Models: A Systematic Study
eess.SP 2026-05 unverdicted novelty 7.0

Contrastive predictive coding pretraining combined with structured state space models yields the strongest ECG foundation models, with continued gains from scaling data to 11 million samples.
Disentangled Sparse Representations for Concept-Separated Diffusion Unlearning
cs.LG 2026-05 unverdicted novelty 7.0

SAEParate disentangles sparse representations in diffusion models via contrastive clustering and nonlinear encoding to enable more precise concept unlearning with reduced side effects.
Learning Less Is More: Premature Upper-Layer Attention Specialization Hurts Language Model Pretraining
cs.CL 2026-05 unverdicted novelty 7.0

Temporarily reducing the learning rate on upper-layer query and key projections during early GPT pretraining prevents premature attention specialization and improves model performance.
From Holo Pockets to Electron Density: GPT-style Drug Design with Density
cs.AI 2026-05 unverdicted novelty 7.0

EDMolGPT generates drug-like molecules from low-resolution electron density point clouds of holo binding pockets and shows effectiveness across 101 biological targets.
Zero-Shot Neural Network Evaluation with Sample-Wise Activation Patterns
cs.LG 2026-05 unverdicted novelty 7.0

SWAP-Score evaluates neural networks without training by quantifying sample-wise activation patterns, achieving high correlation with true performance on CIFAR-10 for CNNs and GLUE for Transformers while enabling fast NAS.
GPROF-IR: An Improved Single-Channel Infrared Precipitation Retrieval for Merged Satellite Precipitation Products
physics.ao-ph 2026-05 unverdicted novelty 7.0

GPROF-IR is a CNN-based retrieval that uses temporal context in geostationary IR observations to produce precipitation estimates with lower error than prior IR methods and climatological consistency with PMW retrieval...
Every Feedforward Neural Network Definable in an o-Minimal Structure Has Finite Sample Complexity
stat.ML 2026-05 unverdicted novelty 7.0

Every fixed finite feedforward neural network definable in an o-minimal structure has finite sample complexity in the agnostic PAC setting.
Path-Coupled Bellman Flows for Distributional Reinforcement Learning
cs.LG 2026-05 unverdicted novelty 7.0

Path-Coupled Bellman Flows use source-consistent Bellman-coupled paths and a lambda-parameterized control-variate to learn return distributions via flow matching, improving fidelity and stability over prior DRL approaches.
Align3D-AD: Cross-Modal Feature Alignment and Dual-Prompt Learning for Zero-shot 3D Anomaly Detection
cs.CV 2026-05 unverdicted novelty 7.0

Align3D-AD improves zero-shot 3D anomaly detection by cross-modal feature alignment from RGB guidance and dual-prompt contrastive alignment to capture complementary semantics.
Bayesian Rain Field Reconstruction using Commercial Microwave Links and Diffusion Model Priors
cs.LG 2026-05 unverdicted novelty 7.0

Diffusion model priors enable training-free Bayesian sampling for more accurate rain field reconstruction from path-integrated commercial microwave link measurements than Gaussian process baselines.
Neuron-Anchored Rule Extraction for Large Language Models via Contrastive Hierarchical Ablation
cs.LG 2026-05 unverdicted novelty 7.0

MechaRule localizes agonist neurons in LLMs via contrastive hierarchical ablation to ground rule extraction in circuitry, recalling 96.8% of high-effect neurons and reducing task performance when suppressed.
Towards Efficient and Expressive Offline RL via Flow-Anchored Noise-conditioned Q-Learning
cs.LG 2026-05 unverdicted novelty 7.0

FAN achieves state-of-the-art offline RL performance on robotic tasks by anchoring flow policies and using single-sample noise-conditioned Q-learning, with proven convergence and reduced runtimes.
Acceleration of horizontal numerical advection for atmospheric modeling through surrogate modeling with temporal coarse-graining
physics.ao-ph 2026-04 conditional novelty 7.0

A CNN surrogate with temporal coarse-graining accelerates 10-day advection simulations up to 92x while achieving r² of 0.60-0.98 against the baseline solver.
Robust Model-Based Iteration for Passive Gamma Emission Tomography
math.NA 2026-04 unverdicted novelty 7.0

A safeguarded hybrid of Levenberg-Marquardt and learned operators achieves equivalent reconstruction quality for PGET in roughly one-third the iterations, with architecture-dependent robustness.
Learning Neural Operator Surrogates for the Black Hole Accretion Code
astro-ph.HE 2026-04 unverdicted novelty 7.0

Physics-informed Fourier neural operators recover plasmoid formation in sparse SRRMHD vortex data where data-only models fail, and transformer operators approximate AMR jet evolution, marking first reported uses in th...
PhyloSDF: Phylogenetically-Conditioned Neural Generation of 3D Skull Morphology via Residual Flow Matching
q-bio.QM 2026-04 unverdicted novelty 7.0

PhyloSDF generates novel 3D skull morphologies for Darwin's finches via phylogenetically-conditioned residual flow matching, achieving 88-129% of real intra-species variation from few specimens and enabling phylogenet...
VitaminP: cross-modal learning enables whole-cell segmentation from routine histology
cs.CV 2026-04 unverdicted novelty 7.0

VitaminP uses paired H&E-mIF data to train a model that transfers molecular boundary information, enabling accurate whole-cell segmentation directly from routine H&E histology across 34 cancer types.
A satellite foundation model for improved wealth monitoring
cs.CY 2026-04 unverdicted novelty 7.0

Tempov is a self-supervised satellite foundation model that predicts wealth levels and decadal changes at high resolution across Africa from Landsat imagery, outperforming baselines even with limited labels and genera...
Latent Space Probing for Adult Content Detection in Video Generative Models
cs.CV 2026-04 unverdicted novelty 7.0

Latent space probing on CogVideoX achieves 97.29% F1 for adult content detection on a new 11k-clip dataset with 4-6ms overhead.
Critical role of phase-dependent properties in modeling photothermal sintering of LiCoO2 cathodes
cond-mat.mtrl-sci 2026-04 unverdicted novelty 7.0

Amorphous LiCoO2 absorbs light more strongly and reaches higher peak temperatures than crystalline LiCoO2 during photothermal sintering, so constant-property models overestimate safe operating windows.
To See the Unseen: on the Generalization Ability of Transformers in Symbolic Reasoning
cs.AI 2026-04 conditional novelty 7.0

Unembedding collapse in transformers prevents distinguishing unseen tokens in symbolic reasoning, but targeted interventions restore generalization.
Decoding Text Spans for Efficient and Accurate Named-Entity Recognition
cs.CL 2026-04 unverdicted novelty 7.0

SpanDec achieves competitive NER accuracy with improved efficiency by using a final-stage lightweight decoder for span representations and early candidate filtering to reduce redundant computation.
Latent Fourier Transform
cs.SD 2026-04 unverdicted novelty 7.0

LatentFT uses latent-space Fourier transforms and frequency masking in diffusion autoencoders to enable timescale-specific manipulation of musical structure in generative models.
Grokking of Diffusion Models: Case Study on Modular Addition
cs.LG 2026-04 unverdicted novelty 7.0

Diffusion models show grokking on modular addition by composing periodic operand representations in simple data regimes or by separating arithmetic computation from visual denoising across timesteps in varied regimes.
Machine learning isotope shifts in molecular energy levels
astro-ph.EP 2026-04 unverdicted novelty 7.0

Neural network corrects residual errors in isotopologue energy extrapolations for CO2 (MAE reduction in >87% of levels vs Marvel) and transfers patterns to improve CO predictions in >93% of samples.
Neural Dynamic GI: Random-Access Neural Compression for Temporal Lightmaps in Dynamic Lighting Environments
cs.GR 2026-04 unverdicted novelty 7.0

NDGI compresses temporal lightmaps via neural feature maps and lightweight networks, delivering high-quality dynamic global illumination with low storage and modest real-time decompression cost.
The Linear Centroids Hypothesis: Features as Directions Learned by Local Experts
cs.LG 2026-04 unverdicted novelty 7.0

The Linear Centroids Hypothesis reframes network features as directions in centroid spaces of local affine experts, unifying interpretability methods and yielding sparser, more faithful dictionaries, circuits, and sal...
Emulating Non-Differentiable Metrics via Knowledge-Guided Learning: Introducing the Minkowski Image Loss
cs.LG 2026-04 unverdicted novelty 7.0

A knowledge-guided framework produces a differentiable surrogate for Minkowski functionals on precipitation images via Lipschitz-constrained CNNs, validated on radar data but revealing a stability-versus-detail trade-...
Winner-Take-All Spiking Transformer for Language Modeling
cs.NE 2026-04 unverdicted novelty 7.0

Winner-take-all spiking self-attention replaces softmax in spiking transformers to support language modeling on 16 datasets with spike-driven, energy-efficient architectures.
Transactional Attention: Semantic Sponsorship for KV-Cache Retention
cs.CL 2026-04 unverdicted novelty 7.0

Transactional Attention uses semantic sponsorship from anchor patterns to retain dormant critical tokens in KV caches, achieving 100% credential retrieval at 16 tokens where all prior methods fail.
Neighbourhood Transformer: Switchable Attention for Monophily-Aware Graph Learning
cs.LG 2026-04 unverdicted novelty 7.0

Neighbourhood Transformers apply local self-attention for monophily-aware graph learning, guarantee expressiveness at least as strong as message-passing GNNs, and outperform prior methods on node classification across...
IAT: Instance-As-Token Compression for Historical User Sequence Modeling in Industrial Recommender Systems
cs.IR 2026-04 unverdicted novelty 7.0

IAT compresses each historical interaction instance into a unified embedding token via temporal-order or user-order schemes, allowing standard sequence models to learn long-range preferences with better performance an...
Dual Triangle Attention: Effective Bidirectional Attention Without Positional Embeddings
q-bio.QM 2026-04 unverdicted novelty 7.0

Dual Triangle Attention achieves effective bidirectional attention with built-in positional inductive bias via dual triangular masks, outperforming standard bidirectional attention on position-sensitive tasks and show...
Trilinear Compute-in-Memory Architecture for Energy-Efficient Transformer Acceleration
cs.AR 2026-04 unverdicted novelty 7.0

TrilinearCIM enables complete in-memory Transformer attention computation via DG-FeFET three-operand MAC without runtime NVM reprogramming, delivering up to 46.6% energy reduction and 20.4% latency improvement on BERT...
Multi-Head Attention based interaction-aware architecture for Bangla Handwritten Character Recognition: Introducing a Primary Dataset
cs.CV 2026-04 accept novelty 7.0

A new balanced Bangla handwritten character dataset paired with a multi-head attention hybrid model using EfficientNetB3, ViT, and Conformer achieves high accuracy and strong generalization.
Hybrid Fourier Neural Operator for Surrogate Modeling of Laser Processing with a Quantum-Circuit Mixer
quant-ph 2026-04 unverdicted novelty 7.0

HQ-LP-FNO replaces part of the spectral channel mixing in a 3D FNO with a mode-shared VQC, reducing parameters by 15.6% and phase-fraction MAE by 26% on laser-processing surrogates while remaining stable under calibra...
Moshi: a speech-text foundation model for real-time dialogue
eess.AS 2024-09 accept novelty 7.0

Moshi is the first real-time full-duplex spoken large language model that casts dialogue as speech-to-speech generation using parallel audio streams and an inner monologue of time-aligned text tokens.
Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality
cs.LG 2024-05 unverdicted novelty 7.0

Transformers and SSMs are unified through structured state space duality, producing a 2-8X faster Mamba-2 model that remains competitive with Transformers.
Language Model Beats Diffusion -- Tokenizer is Key to Visual Generation
cs.CV 2023-10 unverdicted novelty 7.0

A new shared video-image tokenizer enables large language models to surpass diffusion models on standard visual generation benchmarks.
Segment Anything
cs.CV 2023-04 unverdicted novelty 7.0

A promptable model trained on 1B masks achieves competitive zero-shot segmentation performance across tasks and is released publicly with its dataset.
Scalable Diffusion Models with Transformers
cs.CV 2022-12 unverdicted novelty 7.0

DiTs achieve SOTA FID of 2.27 on ImageNet 256x256 by scaling transformer-based latent diffusion models, with performance improving consistently as Gflops increase.
DreamFusion: Text-to-3D using 2D Diffusion
cs.CV 2022-09 accept novelty 7.0

Optimizes a Neural Radiance Field via probability density distillation from a 2D diffusion model to produce text-conditioned 3D scenes viewable from any angle.
A Generalist Agent
cs.AI 2022-05 accept novelty 7.0

Gato is a multi-modal, multi-task, multi-embodiment generalist policy using one transformer network to handle text, vision, games, and robotics tasks.
OPT: Open Pre-trained Transformer Language Models
cs.CL 2022-05 unverdicted novelty 7.0

OPT releases open decoder-only transformers up to 175B parameters that match GPT-3 performance at one-seventh the carbon cost, along with code and training logs.
Flamingo: a Visual Language Model for Few-Shot Learning
cs.CV 2022-04 unverdicted novelty 7.0

Flamingo models reach new state-of-the-art few-shot results on image and video tasks by bridging frozen vision and language models with cross-attention layers trained on interleaved web-scale data.
BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension
cs.CL 2019-10 accept novelty 7.0

BART introduces a denoising pretraining method for seq2seq models that matches RoBERTa on GLUE and SQuAD while setting new state-of-the-art results on abstractive summarization, dialogue, and QA with up to 6 ROUGE gains.
ALBERT: A Lite BERT for Self-supervised Learning of Language Representations
cs.CL 2019-09 accept novelty 7.0

ALBERT reduces BERT parameters via embedding factorization and layer sharing, adds inter-sentence coherence pretraining, and reaches SOTA on GLUE, RACE, and SQuAD with fewer parameters than BERT-large.
Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism
cs.CL 2019-09 unverdicted novelty 7.0

Intra-layer model parallelism in PyTorch enables training of 8.3B-parameter transformers, achieving SOTA perplexity of 10.8 on WikiText103 and 66.5% accuracy on LAMBADA.
Generating Long Sequences with Sparse Transformers
cs.LG 2019-04 unverdicted novelty 7.0

Sparse Transformers factorize attention to handle sequences tens of thousands long, achieving new SOTA density modeling on Enwik8, CIFAR-10, and ImageNet-64.
Searching for Activation Functions
cs.NE 2017-10 conditional novelty 7.0

Automated search discovers Swish activation f(x) = x * sigmoid(βx) that improves top-1 ImageNet accuracy over ReLU by 0.9% on Mobile NASNet-A and 0.6% on Inception-ResNet-v2.
Q-Flow: Stable and Expressive Reinforcement Learning with Flow-Based Policy
cs.LG 2026-05 unverdicted novelty 6.0

Q-Flow enables stable optimization of expressive flow-based policies in RL by propagating terminal values along deterministic flow dynamics to intermediate states for gradient updates without solver unrolling.
Contrastive Learning under Noisy Temporal Self-Supervision for Colonoscopy Videos
cs.CV 2026-05 unverdicted novelty 6.0

A noise-aware contrastive loss built on temporal self-supervision learns polyp tracklet representations from 27 videos that outperform prior self-supervised and supervised baselines and match foundation models on retr...
Spatial Adapter: Structured Spatial Decomposition and Closed-Form Covariance for Frozen Predictors
stat.ML 2026-05 unverdicted novelty 6.0

The Spatial Adapter equips frozen predictors with a spatially regularized orthonormal basis for residuals and derives a closed-form low-rank-plus-noise covariance for spatial prediction and kriging.
On the global convergence of gradient descent for wide shallow models with bounded nonlinearities
math.OC 2026-05 unverdicted novelty 6.0

Gradient descent on wide shallow models with bounded nonlinearities converges globally in the mean-field limit as non-global critical points are unstable under the dynamics.
Adaptive Action Chunking via Multi-Chunk Q Value Estimation
cs.LG 2026-05 unverdicted novelty 6.0

ACH lets RL policies dynamically pick action chunk lengths by jointly estimating Q-values for all candidate lengths via a single Transformer pass.
Hystar: Hypernetwork-driven Style-adaptive Retrieval via Dynamic SVD Modulation
cs.CV 2026-05 unverdicted novelty 6.0

Hystar adapts CLIP-like models to unseen query styles by generating per-input singular-value perturbations with a hypernetwork for attention layers and a new StyleNCE contrastive loss.

Reference graph

Works this paper leans on

50 extracted references · 50 canonical work pages · cited by 143 Pith papers

[1]

Adaptive dropout for training deep neural networks , year =

Jimmy Ba and Brendan Frey , booktitle =. Adaptive dropout for training deep neural networks , year =

work page
[2]

Learning with pseudo-ensembles , year =

Philip Bachman and Ouais Alsharif and Doina Precup , booktitle =. Learning with pseudo-ensembles , year =

work page
[3]

International Conference on Learning Representations , title =

Djork. International Conference on Learning Representations , title =

work page
[4]

A Simple Approximation to the Area Under Standard Normal Curve , year =

Amit Choudhury , booktitle =. A Simple Approximation to the Area Under Standard Normal Curve , year =

work page
[5]

Natural Neural Networks , year =

Guillaume Desjardins and Karen Simonyan and Razvan Pascanu and Koray Kavukcuoglu , booktitle =. Natural Neural Networks , year =

work page
[6]

Smith , publisher =

Kevin Gimpel and Nathan Schneider and Brendan O ' Connor and Dipanjan Das and Daniel Mills and Jacob Eisenstein and Michael Heilman and Dani Yogatama and Jeffrey Flanigan and Noah A. Smith , publisher =. Part-of-Speech Tagging for

work page
[7]

Adjusting for Dropout Variance in Batch Normalization and Weight Initialization , year =

Dan Hendrycks and Kevin Gimpel , booktitle =. Adjusting for Dropout Variance in Batch Normalization and Weight Initialization , year =

work page
[8]

Neural networks and physical systems with emergent collective computational abilities , year =

John Hopfield , booktitle =. Neural networks and physical systems with emergent collective computational abilities , year =

work page
[9]

Bayesian Active Learning for Classification and Preference Learning , year =

Neil Houlsby and Ferenc Husz\'ar and Zoubin Ghahramani and M\'at\'e Lengyel , booktitle =. Bayesian Active Learning for Classification and Preference Learning , year =

work page
[10]

Adam: A Method for Stochastic Optimization , year =

Diederik Kingma and Jimmy Ba , publisher =. Adam: A Method for Stochastic Optimization , year =

work page
[11]

Alex Krizhevsky , publisher =

work page
[12]

Zoneout: Regularizing

David Krueger and Tegan Maharaj and János Kram\'ar and Mohammad Pezeshki and Nicolas Ballas and Nan Rosemary Ke1 and Anirudh Goyal and Yoshua Bengio and Hugo Larochelle and Aaron Courville and Chris Pal , booktitle =. Zoneout: Regularizing

work page
[13]

Ilya Loshchilov and Frank Hutter , journal =

work page
[14]

Maas and Awni Y

Andrew L. Maas and Awni Y. Hannun and and Andrew Y. Ng , booktitle =. Rectifier nonlinearities improve neural network acoustic models , year =

work page
[15]

McCulloch and Walter Pitts , booktitle =

Warren S. McCulloch and Walter Pitts , booktitle =. A logical calculus of the ideas immanent in nervous activity , year =

work page
[16]

All You Need Is a Good Init , year =

Dmytro Mishkin and Jiri Matas , booktitle =. All You Need Is a Good Init , year =

work page
[17]

Dahl and Geoffrey E

Abdelrahman Mohamed and George E. Dahl and Geoffrey E. Hinton , booktitle =. Acoustic Modeling Using Deep Belief Networks , year =

work page
[18]

Smith , title =

Olutobi Owoputi and Brendan O'Connor and Chris Dyer and Kevin Gimpel and Nathan Schneider and Noah A. Smith , title =. North American Chapter of the Association for Computational Linguistics (NAACL) , year =

work page
[19]

Hinton , booktitle =

Vinod Nair and Geoffrey E. Hinton , booktitle =. Rectified Linear Units Improve Restricted Boltzmann Machines , year =

work page
[20]

Kingma , booktitle =

Tim Salimans and Diederik P. Kingma , booktitle =. Weight Normalization: A Simple Reparameterization to Accelerate Training of Deep Neural Networks , year =

work page
[21]

Saxe and James L

Andrew M. Saxe and James L. McClelland and Surya Ganguli , booktitle =. Exact solutions to the nonlinear dynamics of learning in deep linear neural networks , year =

work page
[22]

Deep Residual Networks with Exponential Linear Unit , year =

Anish Shah and Sameer Shinde and Eashan Kadam and Hena Shah and Sandip Shingade , booktitle =. Deep Residual Networks with Exponential Linear Unit , year =

work page
[23]

Hinton and Alex Krizhevsky and Ilya Sutskever and Ruslan Salakhutdinov , booktitle =

Nitish Srivastava and Geoffrey E. Hinton and Alex Krizhevsky and Ilya Sutskever and Ruslan Salakhutdinov , booktitle =. Dropout: A Simple Way to Prevent Neural Networks from Overfitting , year =

work page
[24]

Improving Neural Networks with Dropout , year =

Nitish Srivastava , booktitle =. Improving Neural Networks with Dropout , year =

work page
[25]

Wang and Christopher D

Sida I. Wang and Christopher D. Manning , booktitle =. Fast dropout training , year =

work page
[26]

Wide Residual Networks , year =

Sergey Zagoruyko and Nikos Komodakis , journal =. Wide Residual Networks , year =

work page
[27]

Adaptive dropout for training deep neural networks

Jimmy Ba and Brendan Frey. Adaptive dropout for training deep neural networks. In Neural Information Processing Systems, 2013

work page 2013
[28]

Learning with pseudo-ensembles

Philip Bachman, Ouais Alsharif, and Doina Precup. Learning with pseudo-ensembles. In Neural Information Processing Systems, 2014

work page 2014
[29]

A simple approximation to the area under standard normal curve

Amit Choudhury. A simple approximation to the area under standard normal curve. In Mathematics and Statistics, 2014

work page 2014
[30]

Fast and accurate deep network learning by exponential linear units ( ELUs )

Djork - Arn \' e Clevert, Thomas Unterthiner, and Sepp Hochreiter. Fast and accurate deep network learning by exponential linear units ( ELUs ). In International Conference on Learning Representations, 2016

work page 2016
[31]

Natural neural networks

Guillaume Desjardins, Karen Simonyan, Razvan Pascanu, and Koray Kavukcuoglu. Natural neural networks. In arXiv, 2015

work page 2015
[32]

Kevin Gimpel, Nathan Schneider, Brendan O ' Connor, Dipanjan Das, Daniel Mills, Jacob Eisenstein, Michael Heilman, Dani Yogatama, Jeffrey Flanigan, and Noah A. Smith. Part-of-Speech Tagging for T witter: Annotation, Features, and Experiments . Association for Computational Linguistics (ACL), 2011

work page 2011
[33]

Adjusting for dropout variance in batch normalization and weight initialization

Dan Hendrycks and Kevin Gimpel. Adjusting for dropout variance in batch normalization and weight initialization. In arXiv, 2016

work page 2016
[34]

Neural networks and physical systems with emergent collective computational abilities

John Hopfield. Neural networks and physical systems with emergent collective computational abilities. In Proceedings of the National Academy of Sciences of the USA, 1982

work page 1982
[35]

Adam: A Method for Stochastic Optimization

Diederik Kingma and Jimmy Ba. Adam: A Method for Stochastic Optimization. International Conference for Learning Representations, 2015

work page 2015
[36]

Learning Multiple Layers of Features from Tiny Images, 2009

Alex Krizhevsky. Learning Multiple Layers of Features from Tiny Images, 2009

work page 2009
[37]

Zoneout: Regularizing RNNs by randomly preserving hidden activations

David Krueger, Tegan Maharaj, János Kram\'ar, Mohammad Pezeshki, Nicolas Ballas, Nan Rosemary Ke1, Anirudh Goyal, Yoshua Bengio, Hugo Larochelle, Aaron Courville, and Chris Pal. Zoneout: Regularizing RNNs by randomly preserving hidden activations. In Neural Information Processing Systems, 2016

work page 2016
[38]

SGDR : Stochastic gradient descent with restarts

Ilya Loshchilov and Frank Hutter. SGDR : Stochastic gradient descent with restarts. arXiv, 2016

work page 2016
[39]

Maas, Awni Y

Andrew L. Maas, Awni Y. Hannun, , and Andrew Y. Ng. Rectifier nonlinearities improve neural network acoustic models. In International Conference on Machine Learning, 2013

work page 2013
[40]

McCulloch and Walter Pitts

Warren S. McCulloch and Walter Pitts. A logical calculus of the ideas immanent in nervous activity. In Bulletin of Mathematical Biophysics, 1943

work page 1943
[41]

All you need is a good init

Dmytro Mishkin and Jiri Matas. All you need is a good init. In International Conference on Learning Representations, 2016

work page 2016
[42]

Dahl, and Geoffrey E

Abdelrahman Mohamed, George E. Dahl, and Geoffrey E. Hinton. Acoustic modeling using deep belief networks. In IEEE Transactions on Audio, Speech, and Language Processing, 2012

work page 2012
[43]

Vinod Nair and Geoffrey E. Hinton. Rectified linear units improve restricted boltzmann machines. In International Conference on Machine Learning, 2010

work page 2010
[44]

Olutobi Owoputi, Brendan O'Connor, Chris Dyer, Kevin Gimpel, Nathan Schneider, and Noah A. Smith. Improved part-of-speech tagging for online conversational text with word clusters. In North American Chapter of the Association for Computational Linguistics (NAACL), 2013

work page 2013
[45]

Tim Salimans and Diederik P. Kingma. Weight normalization: A simple reparameterization to accelerate training of deep neural networks. In Neural Information Processing Systems, 2016

work page 2016
[46]

Saxe, James L

Andrew M. Saxe, James L. McClelland, and Surya Ganguli. Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. In International Conference on Learning Representations, 2014

work page 2014
[47]

Deep residual networks with exponential linear unit

Anish Shah, Sameer Shinde, Eashan Kadam, Hena Shah, and Sandip Shingade. Deep residual networks with exponential linear unit. In Vision Net, 2016

work page 2016
[48]

Improving neural networks with dropout

Nitish Srivastava. Improving neural networks with dropout. In University of Toronto, 2013

work page 2013
[49]

Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov

Nitish Srivastava, Geoffrey E. Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: A simple way to prevent neural networks from overfitting. In Journal of Machine Learning Research, 2014

work page 2014
[50]

Wide residual networks

Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. British Machine Vision Conference, 2016

work page 2016