arxiv: 2012.06678 · v1 · submitted 2020-12-11 · 💻 cs.LG · cs.AI

Recognition: 1 theorem link

TabTransformer: Tabular Data Modeling Using Contextual Embeddings

Xin Huang , Ashish Khetan , Milan Cvitkovic , Zohar Karnin

Authors on Pith no claims yet

Pith reviewed 2026-05-16 21:28 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords tabular datatransformersself-attentioncontextual embeddingsdeep learningsemi-supervised learningAUCtree ensembles

0 comments

The pith

TabTransformer applies self-attention to categorical feature embeddings to create contextual representations that raise prediction accuracy on tabular data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents TabTransformer, an architecture that feeds categorical feature embeddings into Transformer layers so self-attention can build contextual embeddings capturing feature interactions. On fifteen public datasets this produces at least a 1 percent mean AUC gain over prior deep learning methods for tabular data while reaching parity with tuned tree ensembles. The same embeddings prove more resistant to missing or noisy features and yield more interpretable predictions. An added unsupervised pre-training stage further improves results in the semi-supervised case by an average 2.1 percent AUC.

Core claim

Applying Transformer self-attention layers to the embeddings of categorical variables produces contextual embeddings that improve accuracy for both fully supervised and semi-supervised tabular modeling, outperforming earlier deep networks and matching tree-based ensembles on public benchmarks.

What carries the argument

Self-attention Transformer layers that convert per-feature categorical embeddings into contextual embeddings carrying cross-feature information.

If this is right

Tabular tasks can use attention mechanisms without custom feature engineering.
Predictions remain accurate even when input features contain missing values or noise.
The learned embeddings support direct inspection of which feature combinations drive each prediction.
Unsupervised pre-training on unlabeled tables produces useful starting embeddings for downstream labeled tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same contextual-embedding approach may transfer to tables that mix numeric, categorical, and text columns.
Larger-scale versions could serve as general-purpose backbones for enterprise tabular pipelines.
Interpretability gains might reduce the need for separate post-hoc explanation tools.

Load-bearing premise

The fifteen public datasets used for testing represent the range of distributions and noise patterns found in real tabular prediction tasks.

What would settle it

A new tabular dataset, after identical tuning of all methods, where TabTransformer shows no AUC improvement over the strongest deep learning baseline.

read the original abstract

We propose TabTransformer, a novel deep tabular data modeling architecture for supervised and semi-supervised learning. The TabTransformer is built upon self-attention based Transformers. The Transformer layers transform the embeddings of categorical features into robust contextual embeddings to achieve higher prediction accuracy. Through extensive experiments on fifteen publicly available datasets, we show that the TabTransformer outperforms the state-of-the-art deep learning methods for tabular data by at least 1.0% on mean AUC, and matches the performance of tree-based ensemble models. Furthermore, we demonstrate that the contextual embeddings learned from TabTransformer are highly robust against both missing and noisy data features, and provide better interpretability. Lastly, for the semi-supervised setting we develop an unsupervised pre-training procedure to learn data-driven contextual embeddings, resulting in an average 2.1% AUC lift over the state-of-the-art methods.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

TabTransformer adapts self-attention to build contextual embeddings for categorical columns in tabular data, beating other deep models by a small margin while the match to tree ensembles looks sensitive to baseline tuning details.

read the letter

TabTransformer takes the transformer self-attention mechanism and applies it directly to the embeddings of categorical features so they can attend to one another before the model combines them with numerical inputs. That produces a clean architecture for supervised and semi-supervised tabular prediction. The paper runs it on fifteen public datasets and reports at least a 1% mean AUC gain over prior deep learning methods plus an extra 2.1% lift from unsupervised pre-training in the semi-supervised setting. It also shows the learned embeddings are more robust to missing or noisy features than the baselines they compare against. Those pieces are straightforward to understand and implement, which is the main practical contribution here. The robustness and pre-training results feel like the most solid additions because they come with direct comparisons rather than just accuracy tables. The central comparison to tree ensembles is the softer part. The abstract claims parity with models like XGBoost, but the experimental write-up gives less evidence that the tree baselines received equivalent hyperparameter search effort or compute budget. Tabular performance often shifts by 1-2% AUC with better tuning, so the matching claim could narrow or disappear under a more exhaustive search for the trees. The datasets are standard and public, which helps, but the paper would be stronger with explicit statements on search spaces and total trials for every baseline. This is useful reading for anyone trying to build end-to-end neural pipelines for mixed tabular data instead of defaulting to gradient boosting. The architecture is simple enough that people can test the contextual embedding idea quickly. I would send it to peer review so referees can check the tuning details and see whether the robustness claims survive closer inspection.

Referee Report

2 major / 1 minor

Summary. The paper proposes TabTransformer, a Transformer-based architecture for supervised and semi-supervised tabular data modeling. It uses self-attention layers to produce contextual embeddings from categorical feature embeddings, claiming this yields higher accuracy than prior deep learning methods. Through experiments on 15 public datasets, it reports that TabTransformer outperforms state-of-the-art deep tabular models by at least 1.0% mean AUC while matching tree-based ensembles (e.g., XGBoost), demonstrates robustness to missing and noisy features, provides improved interpretability via attention, and shows a 2.1% AUC gain from unsupervised pre-training in the semi-supervised case.

Significance. If the empirical claims hold under rigorous controls, the work would be significant for the tabular modeling literature: it provides concrete evidence that attention mechanisms can close the gap with tree ensembles on standard benchmarks while adding robustness and interpretability benefits. The use of 15 public datasets and the semi-supervised pre-training procedure are positive elements that could influence follow-on research on hybrid DL-tree approaches.

major comments (2)

[§4] §4 (Experiments): The description of baseline implementations provides no details on the hyperparameter search space, number of trials, or compute budget allocated to tree-based methods such as XGBoost and LightGBM. Because tabular performance is known to be highly sensitive to these choices, the central claim that TabTransformer 'matches the performance of tree-based ensemble models' cannot be evaluated without this information.
[§4.3] §4.3 and Table 2: Mean AUC differences (1.0% over DL baselines) are reported without standard deviations across the 15 datasets, without per-dataset statistical tests, and without explicit confirmation that all baselines received equivalent tuning effort. This omission directly affects the reliability of the performance claims.

minor comments (1)

[§3.2] §3.2: The notation for the multi-head attention output and the subsequent feed-forward layers could be made more explicit by including the exact dimensionality transformations.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on our manuscript. We address each major point below and will incorporate revisions to improve the clarity and rigor of the experimental reporting.

read point-by-point responses

Referee: [§4] §4 (Experiments): The description of baseline implementations provides no details on the hyperparameter search space, number of trials, or compute budget allocated to tree-based methods such as XGBoost and LightGBM. Because tabular performance is known to be highly sensitive to these choices, the central claim that TabTransformer 'matches the performance of tree-based ensemble models' cannot be evaluated without this information.

Authors: We agree that additional details on hyperparameter tuning are necessary for reproducibility and to substantiate the performance comparison. In the revised version, we will expand Section 4 with explicit hyperparameter search spaces for XGBoost, LightGBM, and all other baselines, the number of random or grid search trials conducted, and the compute budget (e.g., number of CPU/GPU hours) allocated to each method. This will allow direct evaluation of tuning equivalence. revision: yes
Referee: [§4.3] §4.3 and Table 2: Mean AUC differences (1.0% over DL baselines) are reported without standard deviations across the 15 datasets, without per-dataset statistical tests, and without explicit confirmation that all baselines received equivalent tuning effort. This omission directly affects the reliability of the performance claims.

Authors: We acknowledge that reporting variability and statistical significance strengthens the claims. We will revise Table 2 to include standard deviations of the AUC values across the 15 datasets. We will also add per-dataset paired statistical tests (Wilcoxon signed-rank) between TabTransformer and each baseline, along with an explicit statement confirming that all methods received comparable tuning effort via the same search protocol. These additions will be included in the updated Section 4.3. revision: yes

Circularity Check

0 steps flagged

No circularity in derivation chain; all claims empirical

full rationale

The paper introduces TabTransformer as a new architecture using self-attention on categorical embeddings for tabular data and supports its claims solely through direct experimental comparisons on 15 public datasets against DL and tree baselines. No equations, uniqueness theorems, or ansatzes are derived or invoked that reduce by construction to fitted inputs from the same evaluation data. Performance claims (AUC lifts) are presented as measured outcomes rather than predictions forced by the model's own parameterization, leaving the argument self-contained without circular reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that self-attention mechanisms can usefully model feature interactions when applied to learned embeddings of categorical columns in tabular data; no free parameters or invented entities are specified in the abstract.

axioms (1)

domain assumption Self-attention applied to categorical feature embeddings produces robust contextual representations that improve downstream prediction accuracy on tabular data.
This is the core modeling assumption invoked to justify the architecture.

pith-pipeline@v0.9.0 · 5446 in / 1347 out tokens · 45803 ms · 2026-05-16T21:28:50.990411+00:00 · methodology

discussion (0)

Forward citations

Cited by 20 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

From Schema to Signal: Retrieval-Augmented Modeling for Relational Data Analytics
cs.DB 2026-05 unverdicted novelty 7.0

RAM augments relational graph models with attribute-semantic retrieval via random-walk documents and two contrastive augmentations (ATRA, ETRA) to achieve state-of-the-art results on five real-world databases.
GeoViSTA: Geospatial Vision-Tabular Transformer for Multimodal Environment Representation
cs.LG 2026-05 unverdicted novelty 7.0

GeoViSTA learns unified geospatial embeddings from co-registered imagery and tabular data via bilateral cross-attention and joint masked autoencoding, yielding better linear probing performance on mortality and fire h...
LUCAS-MEGA: A Large-Scale Multimodal Dataset for Representation Learning in Soil-Environment Systems
cs.LG 2026-05 unverdicted novelty 7.0

LUCAS-MEGA fuses 68 heterogeneous soil datasets into a 70k-sample multimodal collection and demonstrates its value by pretraining a tabular transformer whose representations align with established soil processes.
LUCAS-MEGA: A Large-Scale Multimodal Dataset for Representation Learning in Soil-Environment Systems
cs.LG 2026-05 unverdicted novelty 7.0

LUCAS-MEGA fuses 68 soil-environment datasets into a 70k-sample multimodal resource that supports self-supervised pretraining of SoilFormer, whose representations align with known soil processes.
AniMatrix: An Anime Video Generation Model that Thinks in Art, Not Physics
cs.CV 2026-05 unverdicted novelty 7.0

AniMatrix generates anime videos using a structured taxonomy of artistic production variables, dual-channel conditioning, a style-motion curriculum, and deformation-aware optimization to prioritize art over physics.
AniMatrix: An Anime Video Generation Model that Thinks in Art, Not Physics
cs.CV 2026-05 unverdicted novelty 7.0

AniMatrix generates anime videos by structuring artistic production rules into a controllable taxonomy and training the model to prioritize those rules over physical realism, achieving top scores from professional ani...
AniMatrix: An Anime Video Generation Model that Thinks in Art, Not Physics
cs.CV 2026-05 unverdicted novelty 7.0

AniMatrix generates anime videos using a production knowledge taxonomy, dual-channel conditioning, style-motion curriculum, and deformation-aware preference optimization, outperforming baselines in animator evaluation...
VT-Bench: A Unified Benchmark for Visual-Tabular Multi-Modal Learning
cs.CV 2026-05 unverdicted novelty 7.0

VT-Bench is the first unified benchmark aggregating 14 visual-tabular datasets with over 756K samples and evaluating 23 models to expose challenges in this multi-modal area.
ASD-Bench: A Four-Axis Comprehensive Benchmark of AI Models for Autism Spectrum Disorder
cs.LG 2026-05 unverdicted novelty 6.0

ASD-Bench evaluates 17 ML and deep learning models on 4,068 AQ-10 records across child, adolescent, and adult cohorts, showing high adult performance, harder adolescent classification, shifting feature importance, and...
Mitigating Label Shift in Tabular In-Context Learning via Test-Time Posterior Adjustment
cs.LG 2026-05 unverdicted novelty 6.0

DistPFN is a test-time posterior adjustment that rescales TabPFN class probabilities to reduce overfitting to the training class distribution under label shift.
DynaTab: Dynamic Feature Ordering as Neural Rewiring for High-Dimensional Tabular Data
cs.LG 2026-05 unverdicted novelty 6.0

DynaTab dynamically reorders features in tabular data via neural rewiring and reports statistically significant gains over 45 baselines on 36 high-dimensional datasets.
ReSS: Learning Reasoning Models for Tabular Data Prediction via Symbolic Scaffold
cs.AI 2026-04 unverdicted novelty 6.0

ReSS uses decision-tree scaffolds to fine-tune LLMs for faithful tabular reasoning, reporting up to 10% gains over baselines on medical and financial data.
Weight-Informed Self-Explaining Clustering for Mixed-Type Tabular Data
cs.LG 2026-04 unverdicted novelty 6.0

WISE unifies representation via BEP, feature weighting via LOFO, two-stage clustering, and intrinsic explanations via DFI for mixed-type tabular data, outperforming baselines on six datasets.
From Uniform to Learned Knots: A Study of Spline-Based Numerical Encodings for Tabular Deep Learning
cs.LG 2026-04 unverdicted novelty 6.0

Spline encodings for numerical features show task-dependent performance in tabular deep learning, with piecewise-linear encoding robust for classification and variable results for regression depending on spline family...
Focused PU learning from imbalanced data
cs.LG 2026-05 unverdicted novelty 5.0

A focused empirical risk estimator for PU learning achieves state-of-the-art results on imbalanced datasets under SCAR and SAR labeling mechanisms.
Evaluating Tabular Representation Learning for Network Intrusion Detection
cs.LG 2026-05 unverdicted novelty 5.0

Tabular representation learning for network intrusion detection exhibits strong dataset-model dependency, with supervised methods outperforming unsupervised anomaly detection and limited but possible cross-dataset gen...
ZAYAN: Disentangled Contrastive Transformer for Tabular Remote Sensing Data
cs.LG 2026-04 unverdicted novelty 5.0

ZAYAN introduces feature-level zero-anchor contrastive pretraining that produces disentangled embeddings and improves classification accuracy on remote sensing tabular datasets over standard deep learning baselines.
Evaluating Deep Learning Models for Multiclass Classification of LIGO Gravitational-Wave Glitches
gr-qc 2026-04 unverdicted novelty 5.0

Benchmark finds some deep learning models match gradient-boosted trees on LIGO glitch classification with fewer parameters and partially consistent feature importance across architectures.
PRAGMA: Revolut Foundation Model
cs.LG 2026-04 unverdicted novelty 5.0

PRAGMA pre-trains a Transformer on heterogeneous banking events with a tailored self-supervised masked objective, yielding embeddings that support strong downstream performance on credit scoring, fraud detection, and ...
Integrating SAINT with Tree-Based Models: A Case Study in Employee Attrition Prediction
cs.LG 2026-04 unverdicted novelty 2.0

Standalone tree-based models outperform both SAINT and SAINT-embedding hybrids for employee attrition prediction on tabular HR data.

Reference graph

Works this paper leans on

99 extracted references · 99 canonical work pages · cited by 17 Pith papers · 8 internal anchors

[1]

Proceedings of the ninth international conference on Information and knowledge management , pages=

Analyzing the effectiveness and applicability of co-training , author=. Proceedings of the ninth international conference on Information and knowledge management , pages=

work page
[2]

Advances in neural information processing systems , pages=

Semi-supervised learning by entropy minimization , author=. Advances in neural information processing systems , pages=

work page
[3]

2002 , publisher=

Learning from labeled and unlabeled data with label propagation , author=. 2002 , publisher=

work page 2002
[4]

Advances in neural information processing systems , pages=

Regularization with stochastic transformations and perturbations for deep semi-supervised learning , author=. Advances in neural information processing systems , pages=

work page
[5]

Workshop on challenges in representation learning, ICML , volume=

Pseudo-label: The simple and efficient semi-supervised learning method for deep neural networks , author=. Workshop on challenges in representation learning, ICML , volume=

work page
[6]

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages=

Label propagation for deep semi-supervised learning , author=. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages=

work page
[7]

International Conference on Learning Representations , year=

Decoupled Weight Decay Regularization , author=. International Conference on Learning Representations , year=

work page
[8]

Advances in Neural Information Processing Systems 32 , editor =

PyTorch: An Imperative Style, High-Performance Deep Learning Library , author =. Advances in Neural Information Processing Systems 32 , editor =. 2019 , publisher =

work page 2019
[9]

Advances in Neural Information Processing Systems , pages=

LightGBM: A highly efficient gradient boosting decision tree , author=. Advances in Neural Information Processing Systems , pages=. 2017 , url=

work page 2017
[11]

Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

Mobilenetv2: Inverted residuals and linear bottlenecks , author=. Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

work page
[12]

Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining , pages=

Xgboost: A scalable tree boosting system , author=. Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining , pages=

work page
[13]

Advances in neural information processing systems , pages=

CatBoost: unbiased boosting with categorical features , author=. Advances in neural information processing systems , pages=

work page
[14]

Kaggle , author =

The. Kaggle , author =

work page
[15]

International Conference on Learning Representations , year=

Deep Variational Information Bottleneck , author=. International Conference on Learning Representations , year=

work page
[17]

Sun, Qiang and Cheng, Zhinan and Fu, Yanwei and Wang, Wenxuan and Jiang, Yu-Gang and Xue, Xiangyang , month = sep, year =

work page
[18]

Artificial Neural Networks Applied to Taxi Destination Prediction , year =

De Br\'. Artificial Neural Networks Applied to Taxi Destination Prediction , year =. Proceedings of the 2015th International Conference on ECML PKDD Discovery Challenge - Volume 1526 , pages =

work page
[19]

Proceedings of the 1st workshop on deep learning for recommender systems , pages=

Wide & deep learning for recommender systems , author=. Proceedings of the 1st workshop on deep learning for recommender systems , pages=

work page
[23]

2011 IEEE international conference on acoustics, speech and signal processing (ICASSP) , pages=

Extensions of recurrent neural network language model , author=. 2011 IEEE international conference on acoustics, speech and signal processing (ICASSP) , pages=. 2011 , organization=

work page 2011
[27]

NAACL-HLT , year=

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , author=. NAACL-HLT , year=

work page
[28]

ADKDD@KDD , year=

Deep & Cross Network for Ad Click Predictions , author=. ADKDD@KDD , year=

work page
[31]

Adaptive Computation and Machine Learning , author=

Semi-supervised learning. Adaptive Computation and Machine Learning , author=. 2010 , publisher=

work page 2010
[32]

International Journal of Machine Learning and Cybernetics , year=

Semi-supervised self-training for decision tree classifiers , author=. International Journal of Machine Learning and Cybernetics , year=

work page
[33]

Stretcu, Otilia and Viswanathan, Krishnamurthy and Movshovitz-Attias, Dana and Platanios, Emmanouil and Ravi, Sujith and Tomkins, Andrew , year =. Graph. Advances in

work page
[35]

Guolin Ke and Jia Zhang and Zhenhui Xu and Jiang Bian and Tie-Yan Liu , year=. Tab

work page
[36]

UCI Machine Learning Repository

Dua, Dheeru and Graff, Casey. UCI Machine Learning Repository. 2017

work page 2017
[37]

AutoML , series =

Isabelle Guyon and Lisheng Sun-Hosoya and Marc Boull\'e and Hugo Jair Escalante and Sergio Escalera and Zhengying Liu and Damir Jajetic and Bisakha Ray and Mehreen Saeed and Mich\'ele Sebag and Alexander Statnikov and WeiWei Tu and Evelyne Viegas , title =. AutoML , series =. 2019 , url =

work page 2019
[38]

Advances in neural information processing systems , pages=

Attention is all you need , author=. Advances in neural information processing systems , pages=

work page
[40]

Advances in Neural Information Processing Systems , pages=

Realistic evaluation of deep semi-supervised learning algorithms , author=. Advances in Neural Information Processing Systems , pages=

work page
[41]

IEEE Transactions on Neural Networks , volume=

Semi-supervised learning) , author=. IEEE Transactions on Neural Networks , volume=. 2009 , publisher=

work page 2009
[42]

Jahrer, Michael , year =. Porto

work page
[43]

and Manning, Christopher D

Clark, Kevin and Luong, Minh-Thang and Le, Quoc V. and Manning, Christopher D. , year =. International

work page
[44]

Advances in neural information processing systems , pages=

Self-normalizing neural networks , author=. Advances in neural information processing systems , pages=

work page
[46]

Semi-supervised learning , pages=

Entropy regularization , author=. Semi-supervised learning , pages=. 2006 , publisher=

work page 2006
[47]

Pattern recognition , volume=

The use of the area under the ROC curve in the evaluation of machine learning algorithms , author=. Pattern recognition , volume=

work page
[48]

Shubham Jain , title =

work page
[49]

Advances in Neural Information Processing Systems , pages=

Mixmatch: A holistic approach to semi-supervised learning , author=. Advances in Neural Information Processing Systems , pages=

work page
[50]

ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages=

Semi-Supervised Learning for Text Classification by Layer Partitioning , author=. ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages=. 2020 , organization=

work page 2020
[51]

Education level and jobs: Opportunities by state

Elka Torpey and Audrey Watson. Education level and jobs: Opportunities by state. 2014

work page 2014
[52]

Journal of machine learning research , volume=

Visualizing data using t-SNE , author=. Journal of machine learning research , volume=

work page
[53]

Computational intelligence and neuroscience , volume=

Deep learning for computer vision: A brief review , author=. Computational intelligence and neuroscience , volume=. 2018 , publisher=

work page 2018
[54]

Advances in neural information processing systems , pages=

What uncertainties do we need in bayesian deep learning for computer vision? , author=. Advances in neural information processing systems , pages=

work page
[57]

Management Science , volume=

Machine learning and portfolio optimization , author=. Management Science , volume=. 2018 , publisher=

work page 2018
[58]

2016 , eprint=

AdaNet: Adaptive Structural Learning of Artificial Neural Networks , author=. 2016 , eprint=

work page 2016
[59]

Uncertainty in the Variational Information Bottleneck

Alemi, A. A.; Fischer, I.; and Dillon, J. V. 2018. Uncertainty in the Variational Information Bottleneck . arXiv:1807.00906 [cs, stat] ://arxiv.org/abs/1807.00906. ArXiv: 1807.00906

work page internal anchor Pith review Pith/arXiv arXiv 2018
[60]

Deep variational information bottleneck.arXiv preprint arXiv:1612.00410,

Alemi, A. A.; Fischer, I.; Dillon, J. V.; and Murphy, K. 2017. Deep Variational Information Bottleneck. International Conference on Learning Representations abs/1612.00410. ://arxiv.org/abs/1612.00410

work page arXiv 2017
[61]

O.; and Pfister, T

Arik, S. O.; and Pfister, T. 2019. TabNet: Attentive Interpretable Tabular Learning. arXiv preprint arXiv:1908.07442 ://arxiv.org/abs/1908.07442

work page arXiv 2019
[62]

Ban, G.-Y.; El Karoui, N.; and Lim, A. E. 2018. Machine learning and portfolio optimization. Management Science 64(3): 1136--1154

work page 2018
[63]

Bradley, A. P. 1997. The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern recognition 30(7): 1145--1159

work page 1997
[64]

Brunner, G.; Liu, Y.; Pascual, D.; Richter, O.; and Wattenhofer, R. 2019. On the validity of self-attention as explanation in transformer models. arXiv preprint arXiv:1908.04211

work page arXiv 2019
[65]

Chapelle, O.; Scholkopf, B.; and Zien, A. 2009. Semi-supervised learning). IEEE Transactions on Neural Networks 20(3): 542--542

work page 2009
[66]

Chappelle, O.; Sch \"o lkopf, B.; and Zien, A. 2010. Semi-supervised learning. Adaptive Computation and Machine Learning

work page 2010
[67]

Chen, T.; and Guestrin, C. 2016. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining, 785--794

work page 2016
[68]

Cheng, H.-T.; Koc, L.; Harmsen, J.; Shaked, T.; Chandra, T.; Aradhye, H.; Anderson, G.; Corrado, G.; Chai, W.; Ispir, M.; et al. 2016. Wide & deep learning for recommender systems. In Proceedings of the 1st workshop on deep learning for recommender systems, 7--10

work page 2016
[69]

V.; and Manning, C

Clark, K.; Luong, M.-T.; Le, Q. V.; and Manning, C. D. 2020. ELECTRA : Pre -training Text Encoders as Discriminators Rather Than Generators . In International Conference on Learning Representations . ://openreview.net/forum?id=r1xMH1BtvB

work page 2020
[70]

Coenen, A.; Reif, E.; Yuan, A.; Kim, B.; Pearce, A.; Vi \'e gas, F.; and Wattenberg, M. 2019. Visualizing and measuring the geometry of bert. arXiv preprint arXiv:1906.02715

work page arXiv 2019
[71]

Cortes, C.; Gonzalvo, X.; Kuznetsov, V.; Mohri, M.; and Yang, S. 2016. AdaNet: Adaptive Structural Learning of Artificial Neural Networks

work page 2016
[72]

De Br\' e bisson, A.; Simon, E.; Auvolat, A.; Vincent, P.; and Bengio, Y. 2015. Artificial Neural Networks Applied to Taxi Destination Prediction. In Proceedings of the 2015th International Conference on ECML PKDD Discovery Challenge - Volume 1526, ECMLPKDDDC’15, 40–51. Aachen, DEU: CEUR-WS.org

work page 2015
[73]

Devlin, J.; Chang, M.-W.; Lee, K.; and Toutanova, K. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In NAACL-HLT

work page 2019
[74]

Dua, D.; and Graff, C. 2017. UCI Machine Learning Repository. ://archive.ics.uci.edu/ml

work page 2017
[75]

Grandvalet, Y.; and Bengio, Y. 2005. Semi-supervised learning by entropy minimization. In Advances in neural information processing systems, 529--536

work page 2005
[76]

Grandvalet, Y.; and Bengio, Y. 2006. Entropy regularization. Semi-supervised learning 151--168

work page 2006
[77]

Guo, H.; Tang, R.; Ye, Y.; Li, Z.; He, X.; and Dong, Z. 2018. DeepFM : An End -to- End Wide & Deep Learning Framework for CTR Prediction . arXiv:1804.04950 [cs, stat] ://arxiv.org/abs/1804.04950. ArXiv: 1804.04950

work page internal anchor Pith review Pith/arXiv arXiv 2018
[78]

J.; Escalera, S.; Liu, Z.; Jajetic, D.; Ray, B.; Saeed, M.; Sebag, M.; Statnikov, A.; Tu, W.; and Viegas, E

Guyon, I.; Sun-Hosoya, L.; Boull\'e, M.; Escalante, H. J.; Escalera, S.; Liu, Z.; Jajetic, D.; Ray, B.; Saeed, M.; Sebag, M.; Statnikov, A.; Tu, W.; and Viegas, E. 2019. Analysis of the AutoML Challenge series 2015-2018. In AutoML, Springer series on Challenges in Machine Learning. ://www.automl.org/wp-content/uploads/2018/09/chapter10-challenge.pdf

work page 2019
[79]

Howard, J.; and Ruder, S. 2018. Universal language model fine-tuning for text classification. arXiv preprint arXiv:1801.06146

work page internal anchor Pith review Pith/arXiv arXiv 2018
[80]

Huang, Z.; Xu, W.; and Yu, K. 2015. Bidirectional LSTM-CRF models for sequence tagging. arXiv preprint arXiv:1508.01991

work page internal anchor Pith review Pith/arXiv arXiv 2015
[81]

Iscen, A.; Tolias, G.; Avrithis, Y.; and Chum, O. 2019. Label propagation for deep semi-supervised learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 5070--5079

work page 2019
[82]

Izmailov, P.; Kirichenko, P.; Finzi, M.; and Wilson, A. G. 2019. Semi- Supervised Learning with Normalizing Flows . arXiv:1912.13025 [cs, stat] ://arxiv.org/abs/1912.13025. ArXiv: 1912.13025

work page arXiv 2019
[83]

Jahrer, M. 2018. Porto Seguro ’s Safe Driver Prediction . ://kaggle.com/c/porto-seguro-safe-driver-prediction

work page 2018
[84]

Jain, S. 2017. Introduction to Pseudo-Labelling : A Semi-Supervised learning technique. https://www.analyticsvidhya.com/blog/2017/09/pseudo-labelling-semi-supervised-learning-technique/

work page 2017
[85]

Kaggle, Inc. 2017. The State of ML and Data Science 2017. ://www.kaggle.com/surveys/2017

work page 2017
[86]

Ke, G.; Meng, Q.; Finley, T.; Wang, T.; Chen, W.; Ma, W.; Ye, Q.; and Liu, T.-Y. 2017. LightGBM: A highly efficient gradient boosting decision tree. In Advances in Neural Information Processing Systems, 3146--3154. ://papers.nips.cc/paper/6907-lightgbm-a-highly-efficient-gradient-boosting-decision-tree.pdf

work page 2017
[87]

Ke, G.; Zhang, J.; Xu, Z.; Bian, J.; and Liu, T.-Y. 2019. Tab NN : A Universal Neural Network Solution for Tabular Data. ://openreview.net/forum?id=r1eJssCqY7

work page 2019
[88]

Kendall, A.; and Gal, Y. 2017. What uncertainties do we need in bayesian deep learning for computer vision? In Advances in neural information processing systems, 5574--5584

work page 2017
[89]

Klambauer, G.; Unterthiner, T.; Mayr, A.; and Hochreiter, S. 2017. Self-normalizing neural networks. In Advances in neural information processing systems, 971--980

work page 2017
[90]

Lee, D.-H. 2013. Pseudo-label: The simple and efficient semi-supervised learning method for deep neural networks. In Workshop on challenges in representation learning, ICML, volume 3, 2

work page 2013
[91]

Li, Z.; Cheng, W.; Chen, Y.; Chen, H.; and Wang, W. 2020. Interpretable Click - Through Rate Prediction through Hierarchical Attention . In Proceedings of the 13th International Conference on Web Search and Data Mining , 313--321. Houston TX USA: ACM. ISBN 978-1-4503-6822-3. doi:10.1145/3336191.3371785. ://dl.acm.org/doi/10.1145/3336191.3371785

work page doi:10.1145/3336191.3371785 2020
[92]

Loshchilov, I.; and Hutter, F. 2017. Decoupled Weight Decay Regularization. In International Conference on Learning Representations. ://arxiv.org/abs/1711.05101

work page internal anchor Pith review Pith/arXiv arXiv 2017
[93]

Maaten, L. v. d.; and Hinton, G. 2008. Visualizing data using t-SNE. Journal of machine learning research 9(Nov): 2579--2605

work page 2008
[94]

Mikolov, T.; Kombrink, S.; Burget, L.; C ernock \`y , J.; and Khudanpur, S. 2011. Extensions of recurrent neural network language model. In 2011 IEEE international conference on acoustics, speech and signal processing (ICASSP), 5528--5531. IEEE

work page 2011
[95]

S.; Yu, H.; Paganini, M.; and Tian, Y

Morcos, A. S.; Yu, H.; Paganini, M.; and Tian, Y. 2019. One ticket to win them all: generalizing lottery ticket initializations across datasets and optimizers. arXiv:1906.02773 [cs, stat] ://arxiv.org/abs/1906.02773. ArXiv: 1906.02773

work page arXiv 2019

Showing first 80 references.