arxiv: 2604.20505 · v1 · submitted 2026-04-22 · 💻 cs.LG

Recognition: unknown

Explicit Dropout: Deterministic Regularization for Transformer Architectures

Vidhi Agrawal , Illia Oleksiienko , Alexandros Iosifidis

Authors on Pith no claims yet

Pith reviewed 2026-05-10 00:49 UTC · model grok-4.3

classification 💻 cs.LG

keywords dropoutregularizationtransformersdeterministic trainingattention layersfeed-forward networksloss functions

0 comments

The pith

Dropout can be rewritten as explicit additive regularization terms in the training loss for each Transformer component.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper derives a deterministic replacement for stochastic dropout by expressing its expected effect as fixed penalty terms added to the objective. These terms apply separately to the query, key, value, and feed-forward parts of each layer with their own strength parameters. A reader would care because the change removes randomness from training while keeping the same practical regularization effect. Experiments on image classification, action detection, and audio tasks show the explicit version matches or beats the original stochastic dropout, especially when the terms are applied to attention and feed-forward layers together.

Core claim

Dropout regularization is expressed as an additive term in the loss for Transformer architectures by deriving the expected contribution of each stochastic mask. The resulting explicit terms cover the attention query, key, value projections and the feed-forward network, each with an independent coefficient. This formulation allows training without any random masking while retaining the generalization behavior of conventional dropout.

What carries the argument

Explicit regularization terms obtained by computing the expectation of the stochastic dropout masks applied to attention and feed-forward components.

Load-bearing premise

The derived explicit terms reproduce the generalization benefit of random dropout without introducing new optimization biases.

What would settle it

A controlled experiment on one of the reported tasks where the explicit version produces measurably worse validation accuracy than stochastic dropout at the same nominal rate would disprove equivalence.

Figures

Figures reproduced from arXiv: 2604.20505 by Alexandros Iosifidis, Illia Oleksiienko, Vidhi Agrawal.

**Figure 1.** Figure 1: Transformer encoder architecture highlighting all locations where dropout can be applied, including within the multi-head attention mechanism and the feed-forward network. 2.1. Dropout in Transformer Architectures Transformers [32] have become the dominant architecture for data modeling in language, vision, audio, and multimodal tasks, largely due to their ability to capture long-range dependencies throug… view at source ↗

read the original abstract

Dropout is a widely used regularization technique in deep learning, but its effects are typically realized through stochastic masking rather than explicit optimization objectives. We propose a deterministic formulation that expresses dropout as an additive regularizer directly incorporated into the training loss. The framework derives explicit regularization terms for Transformer architectures, covering attention query, key, value, and feed-forward components with independently controllable strengths. This formulation removes reliance on stochastic perturbations while providing clearer and fine-grained control over regularization strength. Experiments across image classification, temporal action detection, and audio classification show that explicit dropout matches or outperforms conventional implicit methods, with consistent gains when applied to attention and feed-forward network layers. Ablation studies demonstrate stable performance and controllable regularization through regularization coefficients and dropout rates. Overall, explicit dropout offers a practical and interpretable alternative to stochastic regularization while maintaining architectural flexibility across diverse tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a clean deterministic rewrite of dropout as per-component additive penalties in Transformers, but the claim that it delivers the same regularization effect as stochastic masking rests on unverified equivalence.

read the letter

The core contribution is deriving explicit regularization terms from the dropout expectation and plugging them directly into the loss for the query, key, value, and feed-forward parts of a Transformer, with separate coefficients for each. This removes the need for random masking at training time and gives finer control than the usual single dropout rate. The experiments on image classification, temporal action detection, and audio tasks show the method matches or slightly beats standard dropout, and the ablations confirm that the coefficients behave as expected for tuning strength. That part is useful and straightforward to implement if you already work with these architectures. The soft spot is the missing mechanistic check on whether the deterministic penalty actually reproduces the co-adaptation prevention and generalization behavior of stochastic dropout. Matching final accuracy does not rule out that the extra tunable terms simply allow the optimizer to reach a similar point by a different route, possibly with different higher-order effects on the loss surface. The derivation starts from the usual mask expectation, so the circularity is moderate rather than fatal, but the paper does not appear to test whether ignoring variance or cross-term dependencies changes the inductive bias in practice. This is the kind of paper that would interest people already tuning regularization in Transformers or looking for deterministic alternatives for reproducibility reasons. It is not a broad advance but the idea is self-contained and the experiments are wide enough to merit checking the derivations and controls. I would send it to review with the expectation that referees will press on the equivalence question.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes Explicit Dropout, a deterministic regularization technique for Transformer architectures that reformulates the stochastic dropout operation as explicit additive terms in the training objective. These terms are derived separately for the query, key, value projections in attention and for the feed-forward network layers, allowing independent control via regularization coefficients. The authors claim that this formulation achieves performance parity or improvements over standard dropout across image classification, temporal action detection, and audio classification tasks, while offering greater interpretability and control without relying on random masking during training.

Significance. If the explicit regularizer accurately captures the generalization benefits of stochastic dropout without introducing new biases, this could offer a more interpretable and controllable alternative for Transformer training, with the multi-task experiments and ablations on coefficient tuning providing practical support. The work's value would lie in enabling deterministic analysis of regularization effects, though this is limited by the absence of mechanistic verification that the deterministic penalty preserves dropout's inductive bias on co-adaptation.

major comments (2)

[Section 3] Derivation of explicit regularization terms (Section 3): The paper derives the additive penalties for Q/K/V and FFN by starting from the expectation over dropout masks but replaces the stochastic process with a deterministic term; this step is presented without quantifying approximation error from ignored higher-order moments or head-wise dependencies. This is load-bearing for the central claim, as any mismatch in the loss landscape could alter optimization trajectories or feature diversity compared to implicit dropout.
[Section 4] Experimental results and ablations (Section 4, Tables 1-3): While accuracy matches or exceeds standard dropout with gains on attention/FFN layers, the setup does not control for the extra free parameters (regularization coefficients alongside dropout rates) by comparing against equivalently tuned implicit dropout; without this or multi-seed variance, gains cannot be confidently attributed to faithful reproduction of dropout's effect rather than added flexibility.

minor comments (2)

[Section 3] The notation distinguishing regularization coefficients from the base dropout rate p could be made more explicit in the method section to aid implementation.
[Section 4] Ablation studies would benefit from including training dynamics or gradient norm statistics to illustrate stability claims beyond final accuracy.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their detailed and constructive review of our manuscript on Explicit Dropout. We address each of the major comments below, providing clarifications and outlining revisions to strengthen the paper's theoretical and empirical foundations.

read point-by-point responses

Referee: [Section 3] Derivation of explicit regularization terms (Section 3): The paper derives the additive penalties for Q/K/V and FFN by starting from the expectation over dropout masks but replaces the stochastic process with a deterministic term; this step is presented without quantifying approximation error from ignored higher-order moments or head-wise dependencies. This is load-bearing for the central claim, as any mismatch in the loss landscape could alter optimization trajectories or feature diversity compared to implicit dropout.

Authors: We thank the referee for pointing out this important aspect of the derivation. The explicit terms are obtained by taking the expectation of the loss under the dropout distribution, which naturally yields additive penalties proportional to the squared norms of the projections (for Q/K/V) and similar for FFN. This is an exact expectation for the first moment, but as noted, higher-order interactions are approximated away. In practice, this mirrors derivations of other explicit regularizers like L2 weight decay from Gaussian priors. To quantify the approximation, we will include in the revised manuscript an empirical analysis comparing the explicit loss to Monte Carlo estimates of the full stochastic expectation on representative layers, demonstrating that the approximation error is limited for typical dropout rates. Regarding head-wise dependencies, since dropout is applied independently per head in standard implementations, our per-component penalties can be extended head-wise if desired, but we found global coefficients sufficient in experiments. We believe this addresses the concern about potential mismatches in the loss landscape. revision: partial
Referee: [Section 4] Experimental results and ablations (Section 4, Tables 1-3): While accuracy matches or exceeds standard dropout with gains on attention/FFN layers, the setup does not control for the extra free parameters (regularization coefficients alongside dropout rates) by comparing against equivalently tuned implicit dropout; without this or multi-seed variance, gains cannot be confidently attributed to faithful reproduction of dropout's effect rather than added flexibility.

Authors: We appreciate this critique on the experimental design. The regularization coefficients in Explicit Dropout serve as direct counterparts to the dropout probability in the implicit case, and were tuned via grid search on validation sets in the same manner as dropout rates. However, to ensure a fair comparison accounting for hyperparameter flexibility, we will expand the experiments in the revision to include a baseline where implicit dropout is tuned with an equivalent number of trials (e.g., searching over multiple rates per layer). Additionally, we will report mean and standard deviation over at least three random seeds for all main results to demonstrate statistical robustness. Preliminary checks indicate that the performance gains persist across seeds, suggesting the benefits are not solely due to extra tuning. These additions will better isolate the effect of the deterministic formulation. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation is a standard expectation-based reformulation

full rationale

The paper derives explicit additive regularization terms for Transformer attention and FFN layers by reformulating the stochastic dropout process. This is a direct mathematical step (typically via expectation over masks) rather than a self-definitional loop, fitted prediction, or self-citation chain. No load-bearing premises reduce to the paper's own inputs or prior author work by construction. Experiments on image, action, and audio tasks supply independent empirical checks, and the controllable coefficients are presented as hyperparameters. The central claim therefore rests on external validation rather than tautological equivalence.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The central claim rests on re-expressing stochastic dropout as deterministic loss terms, with tunable coefficients as free parameters and the mathematical equivalence as the key axiom; no invented entities are introduced.

free parameters (2)

regularization coefficients
Independently controllable strengths for attention query/key/value and feed-forward components, as stated in the framework and ablations.
dropout rates
Used in ablation studies to demonstrate controllable regularization.

axioms (1)

domain assumption The regularization effect of stochastic dropout can be equivalently expressed via deterministic additive terms in the training loss.
This equivalence is the foundational premise enabling the explicit formulation for Transformer components.

pith-pipeline@v0.9.0 · 5443 in / 1200 out tokens · 44646 ms · 2026-05-10T00:49:38.305102+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

39 extracted references · 14 canonical work pages

[1]

Dropout:Explicit forms and capacity control, in: Proceedings of the 38th International Conference on Machine Learning, pp

Arora,R.,Bartlett,P.,Mianjy,P.,Srebro,N.,2021. Dropout:Explicit forms and capacity control, in: Proceedings of the 38th International Conference on Machine Learning, pp. 351–361

2021
[2]

Ef- fective and efficient dropout for deep convolutional neural networks

Cai,S.,Shu,Y.,Wang,W.,Chen,G.,Ooi,B.C.,Zhang,M.,2019. Ef- fective and efficient dropout for deep convolutional neural networks. doi:10.48550/arXiv.1904.03392

work page doi:10.48550/arxiv.1904.03392 2019
[3]

QuoVadis,ActionRecognition?A New Model and the Kinetics Dataset , in: IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

Carreira,J.,Zisserman,A.,2017. QuoVadis,ActionRecognition?A New Model and the Kinetics Dataset , in: IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6299–6308. doi:10. 1109/CVPR.2017.502

2017
[4]

Feedforward neuralnetworksinitializationbasedondiscriminantlearning

Chumachenko, K., Iosifidis, A., Gabbouj, M., 2022. Feedforward neuralnetworksinitializationbasedondiscriminantlearning. Neural Networks 146, 220–229. doi:10.1016/J.NEUNET.2021.11.020

work page doi:10.1016/j.neunet.2021.11.020 2022
[5]

Learning to count everything, in: IEEE Conference on Computer Vision and Pat- tern Recognition, CVPR 2021, virtual, June 19-25, 2021, Computer Vision Foundation / IEEE

Cui, Y., Liu, Z., Li, Q., Chan, A.B., Xue, C.J., 2021. Bayesian nested neuralnetworksforuncertaintycalibrationandadaptivecompression, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2392–2401. doi:10.1109/CVPR46437. 2021.00242

work page doi:10.1109/cvpr46437 2021
[6]

An image is worth 16x16 words: Transformers for image recognition at scale, in: International Conference on Learning Representations

Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N., 2021. An image is worth 16x16 words: Transformers for image recognition at scale, in: International Conference on Learning Representations

2021
[7]

Reducing transformer depth on demand with structured dropout, in: International Conference on Learning Representations

Fan, A., Grave, E., Joulin, A., 2020. Reducing transformer depth on demand with structured dropout, in: International Conference on Learning Representations

2020
[8]

Dropoutasabayesianapproximation: Representing model uncertainty in deep learning, in: Proceedings of The 33rd International Conference on Machine Learning, pp

Gal,Y.,Ghahramani,Z.,2016. Dropoutasabayesianapproximation: Representing model uncertainty in deep learning, in: Proceedings of The 33rd International Conference on Machine Learning, pp. 1050– 1059

2016
[9]

Concrete dropout, in: Advances in Neural Information Processing Systems, pp

Gal, Y., Hron, J., Kendall, A., 2017. Concrete dropout, in: Advances in Neural Information Processing Systems, pp. 3581–3590

2017
[10]

Demystifyingdropout,in:Proceed- ings of the 36th International Conference on Machine Learning, pp

Gao,H.,Pei,J.,Huang,H.,2019. Demystifyingdropout,in:Proceed- ings of the 36th International Conference on Machine Learning, pp. 2112–2121

2019
[11]

Y-drop: A conductance based dropout for fully connected layers

Georgiou, E., Paraskevopoulos, G., Potamianos, A., 2024. Y-drop: A conductance based dropout for fully connected layers. doi:10.48550/ arXiv.2409.09088

work page arXiv 2024
[12]

Cooadtr.https://github.com/LukasHedegaard/ CoOadTR/tree/no-decoder

Hedegaard, L., 2021. Cooadtr.https://github.com/LukasHedegaard/ CoOadTR/tree/no-decoder. Computer software. Version: no-decoder branch. Accessed: 2026-04-20

2021
[13]

Continual trans- formers: Redundancy-free attention for online inference, in: Interna- tional Conference on Learning Representations

Hedegaard, L., Bakhtiarnia, A., Iosifidis, A., 2023. Continual trans- formers: Redundancy-free attention for online inference, in: Interna- tional Conference on Learning Representations. :Preprint submitted to Elsevier Page 12 of 13

2023
[14]

Activ- itynet:Alarge-scalevideobenchmarkforhumanactivityunderstand- ing,in:ProceedingsoftheIEEEConferenceonComputerVisionand Pattern Recognition (CVPR), pp

Heilbron, F.C., Escorcia, V., Ghanem, B., Niebles, J.C., 2015. Activ- itynet:Alarge-scalevideobenchmarkforhumanactivityunderstand- ing,in:ProceedingsoftheIEEEConferenceonComputerVisionand Pattern Recognition (CVPR), pp. 961–970

2015
[15]

E., Srivastava, N., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R

Hinton, G.E., Srivastava, N., Krizhevsky, A., Sutskever, I., Salakhut- dinov, R., 2012. Improving neural networks by preventing co- adaptation of feature detectors. doi:10.48550/arXiv.1207.0580

work page doi:10.48550/arxiv.1207.0580 2012
[16]

inthewild

Idrees,H.,Zamir,A.R.,Jiang,Y.,Gorban,A.,Laptev,I.,Sukthankar, R., Shah, M., 2017. The thumos challenge on action recognition for videos“inthewild”. ComputerVisionandImageUnderstanding155, 1–23. doi:https://doi.org/10.1016/j.cviu.2016.10.018

work page doi:10.1016/j.cviu.2016.10.018 2017
[17]

Dropelm: Fast neural network regularization with dropout and dropconnect

Iosifidis, A., Tefas, A., Pitas, I., 2015. Dropelm: Fast neural network regularization with dropout and dropconnect. Neurocomputing 162, 57–66. doi:10.1016/J.NEUCOM.2015.04.006

work page doi:10.1016/j.neucom.2015.04.006 2015
[18]

Learning multiple layers of features from tiny images

Krizhevsky, A., 2009. Learning multiple layers of features from tiny images. Technical Report. University of Toronto

2009
[19]

Krogh,A.,Hertz,J.A.,1991.Asimpleweightdecaycanimprovegen- eralization, in: Advances in Neural Information Processing Systems, p. 950–957

1991
[20]

Assran, Q

Li, B., Hu, Y., Nie, X., Han, C., Jiang, X., Guo, T., Liu, L., 2023a. Dropkey for vision transformer, in: 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 22700– 22709. doi:10.1109/CVPR52729.2023.02174

work page doi:10.1109/cvpr52729.2023.02174 2023
[21]

A survey on dropout methods and experimental verification in recommendation

Li, Y., Ma, W., Chen, C., Zhang, M., Liu, Y., Ma, S., Yang, Y., 2023b. A survey on dropout methods and experimental verification in recommendation. IEEE Transactions on Knowledge and Data Engineering 35, 6595–6615
[22]

R-drop: regularized dropout for neural networks, in: Advances in Neural Information Processing Systems, pp

Liang, X., Wu, L., Li, J., Wang, Y., Meng, Q., Qin, T., Chen, W., Zhang, M., Liu, T.Y., 2021. R-drop: regularized dropout for neural networks, in: Advances in Neural Information Processing Systems, pp. 10890–10905

2021
[23]

Wangchunshu Zhou, Tao Ge, Ke Xu, Furu Wei, and Ming Zhou

Lin,Z.,Liu,P.,Huang,L.,Chen,J.,Qiu,X.,Huang,X.,2019.Dropat- tention: A regularization method for fully-connected self-attention networks. doi:10.48550/arXiv.1907.11065

work page doi:10.48550/arxiv.1907.11065 2019
[24]

Dropout reduces underfitting, in: Proceedings of the 40th International Conference on Machine Learning, pp

Liu, Z., Xu, Z., Jin, J., Shen, Z., Darrell, T., 2023. Dropout reduces underfitting, in: Proceedings of the 40th International Conference on Machine Learning, pp. 22233–22248

2023
[25]

Vit-cifar.https://github.com/omihub777/ ViT-CIFAR/tree/main

OmiHub777, 2024. Vit-cifar.https://github.com/omihub777/ ViT-CIFAR/tree/main. Computer software. Version: not specified. Accessed: 2026-04-20

2024
[26]

Early Stopping — But When? In Grégoire Montavon, Geneviève B

Prechelt, L., 2012. Early Stopping — But When? Springer Berlin Heidelberg. doi:10.1007/978-3-642-35289-8_5

work page doi:10.1007/978-3-642-35289-8_5 2012
[27]

Progressive data dropout: An embarrassingly simple approach to train faster 39

S,S.M.,Hao,X.,Hou,S.,Lu,Y.,Sevilla-Lara,L.,Arnab,A.,Gowda, S.N., 2025. Progressive data dropout: An embarrassingly simple approach to train faster 39

2025
[28]

Robust large margin deep neural networks

Sokolić, J., Giryes, R., Sapiro, G., Rodrigues, M.R.D., 2017. Robust large margin deep neural networks. IEEE Transactions on Signal Processing 65, 4265–4280. doi:10.1109/TSP.2017.2708039

work page doi:10.1109/tsp.2017.2708039 2017
[29]

Journal of Machine Learning Research 15, 1929–1958

Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., Salakhutdi- nov,R.,2014.Dropout:Asimplewaytopreventneuralnetworksfrom overfitting. Journal of Machine Learning Research 15, 1929–1958

2014
[30]

Matus Telgarsky

Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., Wojna, Z., 2016. Rethinking the inception architecture for computer vision, in: Pro- ceedings of the IEEE conference on computer vision and pattern recognition, pp. 2818–2826. doi:10.1109/CVPR.2016.308

work page doi:10.1109/cvpr.2016.308 2016
[31]

Musical genre classification of audio signals

Tzanetakis, G., Cook, P., 2002. Musical genre classification of audio signals. IEEETransactionsonSpeechandAudioProcessing10,293– 302

2002
[32]

Attention is all you need, in: Advances in Neural Information Processing Systems, p

Vaswani,A.,Shazeer,N.,Parmar,N.,Uszkoreit,J.,Jones,L.,Gomez, A.N., Kaiser, L., Polosukhin, I., 2017. Attention is all you need, in: Advances in Neural Information Processing Systems, p. 6000–6010

2017
[33]

Dropout training as adaptive regularization, in: Advances in Neural Information Processing Sys- tems, p

Wager, S., Wang, S., Liang, P., 2013. Dropout training as adaptive regularization, in: Advances in Neural Information Processing Sys- tems, p. 351–359

2013
[34]

Rademacher dropout: An adaptive dropout for deep neural network via optimizing generalization gap

Wang, H., Yang, W., Zhao, Z., Luo, T., Wang, J., Tang, Y., 2019a. Rademacher dropout: An adaptive dropout for deep neural network via optimizing generalization gap. Neurocomputing 357, 177–187
[35]

Temporal segment networks for action recognition in videos

Wang,L.,Xiong,Y.,Wang,Z.,Qiao,Y.,Lin,D.,Tang,X.,VanGool, L., 2019b. Temporal segment networks for action recognition in videos. IEEE transactions on pattern analysis and machine intelli- gence 41, 2740–2755
[36]

Multiscale Vision Transformers , isbn =

Wang, X., Zhang, S., Qing, Z., Shao, Y., Zuo, Z., Gao, C., Sang, N., 2021. Oadtr: Online action detection with transformers, in: ProceedingsoftheIEEE/CVFInternationalConferenceonComputer Vision, pp. 7545–7555. doi:10.1109/ICCV48922.2021.00747

work page doi:10.1109/iccv48922.2021.00747 2021
[37]

Transformers and audio detection tasks: An overview

Zaman,K.,Li,K.,Sah,M.,Direkoglu,C.,Okada,S.,Unoki,M.,2025. Transformers and audio detection tasks: An overview. Digital Signal Processing 158, 104956

2025
[38]

Revisitingstructured dropout, in: Proceedings of the 15th Asian Conference on Machine Learning, pp

Zhao,Y.,Dada,O.,Mullins,R.,Gao,X.,2024. Revisitingstructured dropout, in: Proceedings of the 15th Asian Conference on Machine Learning, pp. 1699–1714

2024
[39]

Scheduled drop- head: A regularization method for transformer models, in: Findings of the Association for Computational Linguistics: EMNLP 2020, pp

Zhou, W., Ge, T., Wei, F., Zhou, M., Xu, K., 2020. Scheduled drop- head: A regularization method for transformer models, in: Findings of the Association for Computational Linguistics: EMNLP 2020, pp. 1971–1980. doi:10.18653/v1/2020.findings-emnlp.178. :Preprint submitted to Elsevier Page 13 of 13

work page doi:10.18653/v1/2020.findings-emnlp.178 2020