pith. machine review for the scientific record. sign in

arxiv: 2012.06678 · v1 · submitted 2020-12-11 · 💻 cs.LG · cs.AI

Recognition: 1 theorem link

TabTransformer: Tabular Data Modeling Using Contextual Embeddings

Authors on Pith no claims yet

Pith reviewed 2026-05-16 21:28 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords tabular datatransformersself-attentioncontextual embeddingsdeep learningsemi-supervised learningAUCtree ensembles
0
0 comments X

The pith

TabTransformer applies self-attention to categorical feature embeddings to create contextual representations that raise prediction accuracy on tabular data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents TabTransformer, an architecture that feeds categorical feature embeddings into Transformer layers so self-attention can build contextual embeddings capturing feature interactions. On fifteen public datasets this produces at least a 1 percent mean AUC gain over prior deep learning methods for tabular data while reaching parity with tuned tree ensembles. The same embeddings prove more resistant to missing or noisy features and yield more interpretable predictions. An added unsupervised pre-training stage further improves results in the semi-supervised case by an average 2.1 percent AUC.

Core claim

Applying Transformer self-attention layers to the embeddings of categorical variables produces contextual embeddings that improve accuracy for both fully supervised and semi-supervised tabular modeling, outperforming earlier deep networks and matching tree-based ensembles on public benchmarks.

What carries the argument

Self-attention Transformer layers that convert per-feature categorical embeddings into contextual embeddings carrying cross-feature information.

If this is right

  • Tabular tasks can use attention mechanisms without custom feature engineering.
  • Predictions remain accurate even when input features contain missing values or noise.
  • The learned embeddings support direct inspection of which feature combinations drive each prediction.
  • Unsupervised pre-training on unlabeled tables produces useful starting embeddings for downstream labeled tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same contextual-embedding approach may transfer to tables that mix numeric, categorical, and text columns.
  • Larger-scale versions could serve as general-purpose backbones for enterprise tabular pipelines.
  • Interpretability gains might reduce the need for separate post-hoc explanation tools.

Load-bearing premise

The fifteen public datasets used for testing represent the range of distributions and noise patterns found in real tabular prediction tasks.

What would settle it

A new tabular dataset, after identical tuning of all methods, where TabTransformer shows no AUC improvement over the strongest deep learning baseline.

read the original abstract

We propose TabTransformer, a novel deep tabular data modeling architecture for supervised and semi-supervised learning. The TabTransformer is built upon self-attention based Transformers. The Transformer layers transform the embeddings of categorical features into robust contextual embeddings to achieve higher prediction accuracy. Through extensive experiments on fifteen publicly available datasets, we show that the TabTransformer outperforms the state-of-the-art deep learning methods for tabular data by at least 1.0% on mean AUC, and matches the performance of tree-based ensemble models. Furthermore, we demonstrate that the contextual embeddings learned from TabTransformer are highly robust against both missing and noisy data features, and provide better interpretability. Lastly, for the semi-supervised setting we develop an unsupervised pre-training procedure to learn data-driven contextual embeddings, resulting in an average 2.1% AUC lift over the state-of-the-art methods.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes TabTransformer, a Transformer-based architecture for supervised and semi-supervised tabular data modeling. It uses self-attention layers to produce contextual embeddings from categorical feature embeddings, claiming this yields higher accuracy than prior deep learning methods. Through experiments on 15 public datasets, it reports that TabTransformer outperforms state-of-the-art deep tabular models by at least 1.0% mean AUC while matching tree-based ensembles (e.g., XGBoost), demonstrates robustness to missing and noisy features, provides improved interpretability via attention, and shows a 2.1% AUC gain from unsupervised pre-training in the semi-supervised case.

Significance. If the empirical claims hold under rigorous controls, the work would be significant for the tabular modeling literature: it provides concrete evidence that attention mechanisms can close the gap with tree ensembles on standard benchmarks while adding robustness and interpretability benefits. The use of 15 public datasets and the semi-supervised pre-training procedure are positive elements that could influence follow-on research on hybrid DL-tree approaches.

major comments (2)
  1. [§4] §4 (Experiments): The description of baseline implementations provides no details on the hyperparameter search space, number of trials, or compute budget allocated to tree-based methods such as XGBoost and LightGBM. Because tabular performance is known to be highly sensitive to these choices, the central claim that TabTransformer 'matches the performance of tree-based ensemble models' cannot be evaluated without this information.
  2. [§4.3] §4.3 and Table 2: Mean AUC differences (1.0% over DL baselines) are reported without standard deviations across the 15 datasets, without per-dataset statistical tests, and without explicit confirmation that all baselines received equivalent tuning effort. This omission directly affects the reliability of the performance claims.
minor comments (1)
  1. [§3.2] §3.2: The notation for the multi-head attention output and the subsequent feed-forward layers could be made more explicit by including the exact dimensionality transformations.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on our manuscript. We address each major point below and will incorporate revisions to improve the clarity and rigor of the experimental reporting.

read point-by-point responses
  1. Referee: [§4] §4 (Experiments): The description of baseline implementations provides no details on the hyperparameter search space, number of trials, or compute budget allocated to tree-based methods such as XGBoost and LightGBM. Because tabular performance is known to be highly sensitive to these choices, the central claim that TabTransformer 'matches the performance of tree-based ensemble models' cannot be evaluated without this information.

    Authors: We agree that additional details on hyperparameter tuning are necessary for reproducibility and to substantiate the performance comparison. In the revised version, we will expand Section 4 with explicit hyperparameter search spaces for XGBoost, LightGBM, and all other baselines, the number of random or grid search trials conducted, and the compute budget (e.g., number of CPU/GPU hours) allocated to each method. This will allow direct evaluation of tuning equivalence. revision: yes

  2. Referee: [§4.3] §4.3 and Table 2: Mean AUC differences (1.0% over DL baselines) are reported without standard deviations across the 15 datasets, without per-dataset statistical tests, and without explicit confirmation that all baselines received equivalent tuning effort. This omission directly affects the reliability of the performance claims.

    Authors: We acknowledge that reporting variability and statistical significance strengthens the claims. We will revise Table 2 to include standard deviations of the AUC values across the 15 datasets. We will also add per-dataset paired statistical tests (Wilcoxon signed-rank) between TabTransformer and each baseline, along with an explicit statement confirming that all methods received comparable tuning effort via the same search protocol. These additions will be included in the updated Section 4.3. revision: yes

Circularity Check

0 steps flagged

No circularity in derivation chain; all claims empirical

full rationale

The paper introduces TabTransformer as a new architecture using self-attention on categorical embeddings for tabular data and supports its claims solely through direct experimental comparisons on 15 public datasets against DL and tree baselines. No equations, uniqueness theorems, or ansatzes are derived or invoked that reduce by construction to fitted inputs from the same evaluation data. Performance claims (AUC lifts) are presented as measured outcomes rather than predictions forced by the model's own parameterization, leaving the argument self-contained without circular reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that self-attention mechanisms can usefully model feature interactions when applied to learned embeddings of categorical columns in tabular data; no free parameters or invented entities are specified in the abstract.

axioms (1)
  • domain assumption Self-attention applied to categorical feature embeddings produces robust contextual representations that improve downstream prediction accuracy on tabular data.
    This is the core modeling assumption invoked to justify the architecture.

pith-pipeline@v0.9.0 · 5446 in / 1347 out tokens · 45803 ms · 2026-05-16T21:28:50.990411+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 20 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. From Schema to Signal: Retrieval-Augmented Modeling for Relational Data Analytics

    cs.DB 2026-05 unverdicted novelty 7.0

    RAM augments relational graph models with attribute-semantic retrieval via random-walk documents and two contrastive augmentations (ATRA, ETRA) to achieve state-of-the-art results on five real-world databases.

  2. GeoViSTA: Geospatial Vision-Tabular Transformer for Multimodal Environment Representation

    cs.LG 2026-05 unverdicted novelty 7.0

    GeoViSTA learns unified geospatial embeddings from co-registered imagery and tabular data via bilateral cross-attention and joint masked autoencoding, yielding better linear probing performance on mortality and fire h...

  3. LUCAS-MEGA: A Large-Scale Multimodal Dataset for Representation Learning in Soil-Environment Systems

    cs.LG 2026-05 unverdicted novelty 7.0

    LUCAS-MEGA fuses 68 heterogeneous soil datasets into a 70k-sample multimodal collection and demonstrates its value by pretraining a tabular transformer whose representations align with established soil processes.

  4. LUCAS-MEGA: A Large-Scale Multimodal Dataset for Representation Learning in Soil-Environment Systems

    cs.LG 2026-05 unverdicted novelty 7.0

    LUCAS-MEGA fuses 68 soil-environment datasets into a 70k-sample multimodal resource that supports self-supervised pretraining of SoilFormer, whose representations align with known soil processes.

  5. AniMatrix: An Anime Video Generation Model that Thinks in Art, Not Physics

    cs.CV 2026-05 unverdicted novelty 7.0

    AniMatrix generates anime videos using a structured taxonomy of artistic production variables, dual-channel conditioning, a style-motion curriculum, and deformation-aware optimization to prioritize art over physics.

  6. AniMatrix: An Anime Video Generation Model that Thinks in Art, Not Physics

    cs.CV 2026-05 unverdicted novelty 7.0

    AniMatrix generates anime videos by structuring artistic production rules into a controllable taxonomy and training the model to prioritize those rules over physical realism, achieving top scores from professional ani...

  7. AniMatrix: An Anime Video Generation Model that Thinks in Art, Not Physics

    cs.CV 2026-05 unverdicted novelty 7.0

    AniMatrix generates anime videos using a production knowledge taxonomy, dual-channel conditioning, style-motion curriculum, and deformation-aware preference optimization, outperforming baselines in animator evaluation...

  8. VT-Bench: A Unified Benchmark for Visual-Tabular Multi-Modal Learning

    cs.CV 2026-05 unverdicted novelty 7.0

    VT-Bench is the first unified benchmark aggregating 14 visual-tabular datasets with over 756K samples and evaluating 23 models to expose challenges in this multi-modal area.

  9. ASD-Bench: A Four-Axis Comprehensive Benchmark of AI Models for Autism Spectrum Disorder

    cs.LG 2026-05 unverdicted novelty 6.0

    ASD-Bench evaluates 17 ML and deep learning models on 4,068 AQ-10 records across child, adolescent, and adult cohorts, showing high adult performance, harder adolescent classification, shifting feature importance, and...

  10. Mitigating Label Shift in Tabular In-Context Learning via Test-Time Posterior Adjustment

    cs.LG 2026-05 unverdicted novelty 6.0

    DistPFN is a test-time posterior adjustment that rescales TabPFN class probabilities to reduce overfitting to the training class distribution under label shift.

  11. DynaTab: Dynamic Feature Ordering as Neural Rewiring for High-Dimensional Tabular Data

    cs.LG 2026-05 unverdicted novelty 6.0

    DynaTab dynamically reorders features in tabular data via neural rewiring and reports statistically significant gains over 45 baselines on 36 high-dimensional datasets.

  12. ReSS: Learning Reasoning Models for Tabular Data Prediction via Symbolic Scaffold

    cs.AI 2026-04 unverdicted novelty 6.0

    ReSS uses decision-tree scaffolds to fine-tune LLMs for faithful tabular reasoning, reporting up to 10% gains over baselines on medical and financial data.

  13. Weight-Informed Self-Explaining Clustering for Mixed-Type Tabular Data

    cs.LG 2026-04 unverdicted novelty 6.0

    WISE unifies representation via BEP, feature weighting via LOFO, two-stage clustering, and intrinsic explanations via DFI for mixed-type tabular data, outperforming baselines on six datasets.

  14. From Uniform to Learned Knots: A Study of Spline-Based Numerical Encodings for Tabular Deep Learning

    cs.LG 2026-04 unverdicted novelty 6.0

    Spline encodings for numerical features show task-dependent performance in tabular deep learning, with piecewise-linear encoding robust for classification and variable results for regression depending on spline family...

  15. Focused PU learning from imbalanced data

    cs.LG 2026-05 unverdicted novelty 5.0

    A focused empirical risk estimator for PU learning achieves state-of-the-art results on imbalanced datasets under SCAR and SAR labeling mechanisms.

  16. Evaluating Tabular Representation Learning for Network Intrusion Detection

    cs.LG 2026-05 unverdicted novelty 5.0

    Tabular representation learning for network intrusion detection exhibits strong dataset-model dependency, with supervised methods outperforming unsupervised anomaly detection and limited but possible cross-dataset gen...

  17. ZAYAN: Disentangled Contrastive Transformer for Tabular Remote Sensing Data

    cs.LG 2026-04 unverdicted novelty 5.0

    ZAYAN introduces feature-level zero-anchor contrastive pretraining that produces disentangled embeddings and improves classification accuracy on remote sensing tabular datasets over standard deep learning baselines.

  18. Evaluating Deep Learning Models for Multiclass Classification of LIGO Gravitational-Wave Glitches

    gr-qc 2026-04 unverdicted novelty 5.0

    Benchmark finds some deep learning models match gradient-boosted trees on LIGO glitch classification with fewer parameters and partially consistent feature importance across architectures.

  19. PRAGMA: Revolut Foundation Model

    cs.LG 2026-04 unverdicted novelty 5.0

    PRAGMA pre-trains a Transformer on heterogeneous banking events with a tailored self-supervised masked objective, yielding embeddings that support strong downstream performance on credit scoring, fraud detection, and ...

  20. Integrating SAINT with Tree-Based Models: A Case Study in Employee Attrition Prediction

    cs.LG 2026-04 unverdicted novelty 2.0

    Standalone tree-based models outperform both SAINT and SAINT-embedding hybrids for employee attrition prediction on tabular HR data.

Reference graph

Works this paper leans on

99 extracted references · 99 canonical work pages · cited by 17 Pith papers · 8 internal anchors

  1. [1]

    Proceedings of the ninth international conference on Information and knowledge management , pages=

    Analyzing the effectiveness and applicability of co-training , author=. Proceedings of the ninth international conference on Information and knowledge management , pages=

  2. [2]

    Advances in neural information processing systems , pages=

    Semi-supervised learning by entropy minimization , author=. Advances in neural information processing systems , pages=

  3. [3]

    2002 , publisher=

    Learning from labeled and unlabeled data with label propagation , author=. 2002 , publisher=

  4. [4]

    Advances in neural information processing systems , pages=

    Regularization with stochastic transformations and perturbations for deep semi-supervised learning , author=. Advances in neural information processing systems , pages=

  5. [5]

    Workshop on challenges in representation learning, ICML , volume=

    Pseudo-label: The simple and efficient semi-supervised learning method for deep neural networks , author=. Workshop on challenges in representation learning, ICML , volume=

  6. [6]

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages=

    Label propagation for deep semi-supervised learning , author=. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages=

  7. [7]

    International Conference on Learning Representations , year=

    Decoupled Weight Decay Regularization , author=. International Conference on Learning Representations , year=

  8. [8]

    Advances in Neural Information Processing Systems 32 , editor =

    PyTorch: An Imperative Style, High-Performance Deep Learning Library , author =. Advances in Neural Information Processing Systems 32 , editor =. 2019 , publisher =

  9. [9]

    Advances in Neural Information Processing Systems , pages=

    LightGBM: A highly efficient gradient boosting decision tree , author=. Advances in Neural Information Processing Systems , pages=. 2017 , url=

  10. [11]

    Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

    Mobilenetv2: Inverted residuals and linear bottlenecks , author=. Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

  11. [12]

    Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining , pages=

    Xgboost: A scalable tree boosting system , author=. Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining , pages=

  12. [13]

    Advances in neural information processing systems , pages=

    CatBoost: unbiased boosting with categorical features , author=. Advances in neural information processing systems , pages=

  13. [14]

    Kaggle , author =

    The. Kaggle , author =

  14. [15]

    International Conference on Learning Representations , year=

    Deep Variational Information Bottleneck , author=. International Conference on Learning Representations , year=

  15. [17]

    Sun, Qiang and Cheng, Zhinan and Fu, Yanwei and Wang, Wenxuan and Jiang, Yu-Gang and Xue, Xiangyang , month = sep, year =

  16. [18]

    Artificial Neural Networks Applied to Taxi Destination Prediction , year =

    De Br\'. Artificial Neural Networks Applied to Taxi Destination Prediction , year =. Proceedings of the 2015th International Conference on ECML PKDD Discovery Challenge - Volume 1526 , pages =

  17. [19]

    Proceedings of the 1st workshop on deep learning for recommender systems , pages=

    Wide & deep learning for recommender systems , author=. Proceedings of the 1st workshop on deep learning for recommender systems , pages=

  18. [23]

    2011 IEEE international conference on acoustics, speech and signal processing (ICASSP) , pages=

    Extensions of recurrent neural network language model , author=. 2011 IEEE international conference on acoustics, speech and signal processing (ICASSP) , pages=. 2011 , organization=

  19. [27]

    NAACL-HLT , year=

    BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , author=. NAACL-HLT , year=

  20. [28]

    ADKDD@KDD , year=

    Deep & Cross Network for Ad Click Predictions , author=. ADKDD@KDD , year=

  21. [31]

    Adaptive Computation and Machine Learning , author=

    Semi-supervised learning. Adaptive Computation and Machine Learning , author=. 2010 , publisher=

  22. [32]

    International Journal of Machine Learning and Cybernetics , year=

    Semi-supervised self-training for decision tree classifiers , author=. International Journal of Machine Learning and Cybernetics , year=

  23. [33]

    Stretcu, Otilia and Viswanathan, Krishnamurthy and Movshovitz-Attias, Dana and Platanios, Emmanouil and Ravi, Sujith and Tomkins, Andrew , year =. Graph. Advances in

  24. [35]

    Guolin Ke and Jia Zhang and Zhenhui Xu and Jiang Bian and Tie-Yan Liu , year=. Tab

  25. [36]

    UCI Machine Learning Repository

    Dua, Dheeru and Graff, Casey. UCI Machine Learning Repository. 2017

  26. [37]

    AutoML , series =

    Isabelle Guyon and Lisheng Sun-Hosoya and Marc Boull\'e and Hugo Jair Escalante and Sergio Escalera and Zhengying Liu and Damir Jajetic and Bisakha Ray and Mehreen Saeed and Mich\'ele Sebag and Alexander Statnikov and WeiWei Tu and Evelyne Viegas , title =. AutoML , series =. 2019 , url =

  27. [38]

    Advances in neural information processing systems , pages=

    Attention is all you need , author=. Advances in neural information processing systems , pages=

  28. [40]

    Advances in Neural Information Processing Systems , pages=

    Realistic evaluation of deep semi-supervised learning algorithms , author=. Advances in Neural Information Processing Systems , pages=

  29. [41]

    IEEE Transactions on Neural Networks , volume=

    Semi-supervised learning) , author=. IEEE Transactions on Neural Networks , volume=. 2009 , publisher=

  30. [42]

    Jahrer, Michael , year =. Porto

  31. [43]

    and Manning, Christopher D

    Clark, Kevin and Luong, Minh-Thang and Le, Quoc V. and Manning, Christopher D. , year =. International

  32. [44]

    Advances in neural information processing systems , pages=

    Self-normalizing neural networks , author=. Advances in neural information processing systems , pages=

  33. [46]

    Semi-supervised learning , pages=

    Entropy regularization , author=. Semi-supervised learning , pages=. 2006 , publisher=

  34. [47]

    Pattern recognition , volume=

    The use of the area under the ROC curve in the evaluation of machine learning algorithms , author=. Pattern recognition , volume=

  35. [48]

    Shubham Jain , title =

  36. [49]

    Advances in Neural Information Processing Systems , pages=

    Mixmatch: A holistic approach to semi-supervised learning , author=. Advances in Neural Information Processing Systems , pages=

  37. [50]

    ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages=

    Semi-Supervised Learning for Text Classification by Layer Partitioning , author=. ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages=. 2020 , organization=

  38. [51]

    Education level and jobs: Opportunities by state

    Elka Torpey and Audrey Watson. Education level and jobs: Opportunities by state. 2014

  39. [52]

    Journal of machine learning research , volume=

    Visualizing data using t-SNE , author=. Journal of machine learning research , volume=

  40. [53]

    Computational intelligence and neuroscience , volume=

    Deep learning for computer vision: A brief review , author=. Computational intelligence and neuroscience , volume=. 2018 , publisher=

  41. [54]

    Advances in neural information processing systems , pages=

    What uncertainties do we need in bayesian deep learning for computer vision? , author=. Advances in neural information processing systems , pages=

  42. [57]

    Management Science , volume=

    Machine learning and portfolio optimization , author=. Management Science , volume=. 2018 , publisher=

  43. [58]

    2016 , eprint=

    AdaNet: Adaptive Structural Learning of Artificial Neural Networks , author=. 2016 , eprint=

  44. [59]

    Uncertainty in the Variational Information Bottleneck

    Alemi, A. A.; Fischer, I.; and Dillon, J. V. 2018. Uncertainty in the Variational Information Bottleneck . arXiv:1807.00906 [cs, stat] ://arxiv.org/abs/1807.00906. ArXiv: 1807.00906

  45. [60]

    Deep variational information bottleneck.arXiv preprint arXiv:1612.00410,

    Alemi, A. A.; Fischer, I.; Dillon, J. V.; and Murphy, K. 2017. Deep Variational Information Bottleneck. International Conference on Learning Representations abs/1612.00410. ://arxiv.org/abs/1612.00410

  46. [61]

    O.; and Pfister, T

    Arik, S. O.; and Pfister, T. 2019. TabNet: Attentive Interpretable Tabular Learning. arXiv preprint arXiv:1908.07442 ://arxiv.org/abs/1908.07442

  47. [62]

    Ban, G.-Y.; El Karoui, N.; and Lim, A. E. 2018. Machine learning and portfolio optimization. Management Science 64(3): 1136--1154

  48. [63]

    Bradley, A. P. 1997. The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern recognition 30(7): 1145--1159

  49. [64]

    Brunner, G.; Liu, Y.; Pascual, D.; Richter, O.; and Wattenhofer, R. 2019. On the validity of self-attention as explanation in transformer models. arXiv preprint arXiv:1908.04211

  50. [65]

    Chapelle, O.; Scholkopf, B.; and Zien, A. 2009. Semi-supervised learning). IEEE Transactions on Neural Networks 20(3): 542--542

  51. [66]

    Chappelle, O.; Sch \"o lkopf, B.; and Zien, A. 2010. Semi-supervised learning. Adaptive Computation and Machine Learning

  52. [67]

    Chen, T.; and Guestrin, C. 2016. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining, 785--794

  53. [68]

    Cheng, H.-T.; Koc, L.; Harmsen, J.; Shaked, T.; Chandra, T.; Aradhye, H.; Anderson, G.; Corrado, G.; Chai, W.; Ispir, M.; et al. 2016. Wide & deep learning for recommender systems. In Proceedings of the 1st workshop on deep learning for recommender systems, 7--10

  54. [69]

    V.; and Manning, C

    Clark, K.; Luong, M.-T.; Le, Q. V.; and Manning, C. D. 2020. ELECTRA : Pre -training Text Encoders as Discriminators Rather Than Generators . In International Conference on Learning Representations . ://openreview.net/forum?id=r1xMH1BtvB

  55. [70]

    Coenen, A.; Reif, E.; Yuan, A.; Kim, B.; Pearce, A.; Vi \'e gas, F.; and Wattenberg, M. 2019. Visualizing and measuring the geometry of bert. arXiv preprint arXiv:1906.02715

  56. [71]

    Cortes, C.; Gonzalvo, X.; Kuznetsov, V.; Mohri, M.; and Yang, S. 2016. AdaNet: Adaptive Structural Learning of Artificial Neural Networks

  57. [72]

    De Br\' e bisson, A.; Simon, E.; Auvolat, A.; Vincent, P.; and Bengio, Y. 2015. Artificial Neural Networks Applied to Taxi Destination Prediction. In Proceedings of the 2015th International Conference on ECML PKDD Discovery Challenge - Volume 1526, ECMLPKDDDC’15, 40–51. Aachen, DEU: CEUR-WS.org

  58. [73]

    Devlin, J.; Chang, M.-W.; Lee, K.; and Toutanova, K. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In NAACL-HLT

  59. [74]

    Dua, D.; and Graff, C. 2017. UCI Machine Learning Repository. ://archive.ics.uci.edu/ml

  60. [75]

    Grandvalet, Y.; and Bengio, Y. 2005. Semi-supervised learning by entropy minimization. In Advances in neural information processing systems, 529--536

  61. [76]

    Grandvalet, Y.; and Bengio, Y. 2006. Entropy regularization. Semi-supervised learning 151--168

  62. [77]

    Guo, H.; Tang, R.; Ye, Y.; Li, Z.; He, X.; and Dong, Z. 2018. DeepFM : An End -to- End Wide & Deep Learning Framework for CTR Prediction . arXiv:1804.04950 [cs, stat] ://arxiv.org/abs/1804.04950. ArXiv: 1804.04950

  63. [78]

    J.; Escalera, S.; Liu, Z.; Jajetic, D.; Ray, B.; Saeed, M.; Sebag, M.; Statnikov, A.; Tu, W.; and Viegas, E

    Guyon, I.; Sun-Hosoya, L.; Boull\'e, M.; Escalante, H. J.; Escalera, S.; Liu, Z.; Jajetic, D.; Ray, B.; Saeed, M.; Sebag, M.; Statnikov, A.; Tu, W.; and Viegas, E. 2019. Analysis of the AutoML Challenge series 2015-2018. In AutoML, Springer series on Challenges in Machine Learning. ://www.automl.org/wp-content/uploads/2018/09/chapter10-challenge.pdf

  64. [79]

    Howard, J.; and Ruder, S. 2018. Universal language model fine-tuning for text classification. arXiv preprint arXiv:1801.06146

  65. [80]

    Huang, Z.; Xu, W.; and Yu, K. 2015. Bidirectional LSTM-CRF models for sequence tagging. arXiv preprint arXiv:1508.01991

  66. [81]

    Iscen, A.; Tolias, G.; Avrithis, Y.; and Chum, O. 2019. Label propagation for deep semi-supervised learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 5070--5079

  67. [82]

    Izmailov, P.; Kirichenko, P.; Finzi, M.; and Wilson, A. G. 2019. Semi- Supervised Learning with Normalizing Flows . arXiv:1912.13025 [cs, stat] ://arxiv.org/abs/1912.13025. ArXiv: 1912.13025

  68. [83]

    Jahrer, M. 2018. Porto Seguro ’s Safe Driver Prediction . ://kaggle.com/c/porto-seguro-safe-driver-prediction

  69. [84]

    Jain, S. 2017. Introduction to Pseudo-Labelling : A Semi-Supervised learning technique. https://www.analyticsvidhya.com/blog/2017/09/pseudo-labelling-semi-supervised-learning-technique/

  70. [85]

    Kaggle, Inc. 2017. The State of ML and Data Science 2017. ://www.kaggle.com/surveys/2017

  71. [86]

    Ke, G.; Meng, Q.; Finley, T.; Wang, T.; Chen, W.; Ma, W.; Ye, Q.; and Liu, T.-Y. 2017. LightGBM: A highly efficient gradient boosting decision tree. In Advances in Neural Information Processing Systems, 3146--3154. ://papers.nips.cc/paper/6907-lightgbm-a-highly-efficient-gradient-boosting-decision-tree.pdf

  72. [87]

    Ke, G.; Zhang, J.; Xu, Z.; Bian, J.; and Liu, T.-Y. 2019. Tab NN : A Universal Neural Network Solution for Tabular Data. ://openreview.net/forum?id=r1eJssCqY7

  73. [88]

    Kendall, A.; and Gal, Y. 2017. What uncertainties do we need in bayesian deep learning for computer vision? In Advances in neural information processing systems, 5574--5584

  74. [89]

    Klambauer, G.; Unterthiner, T.; Mayr, A.; and Hochreiter, S. 2017. Self-normalizing neural networks. In Advances in neural information processing systems, 971--980

  75. [90]

    Lee, D.-H. 2013. Pseudo-label: The simple and efficient semi-supervised learning method for deep neural networks. In Workshop on challenges in representation learning, ICML, volume 3, 2

  76. [91]

    Li, Z.; Cheng, W.; Chen, Y.; Chen, H.; and Wang, W. 2020. Interpretable Click - Through Rate Prediction through Hierarchical Attention . In Proceedings of the 13th International Conference on Web Search and Data Mining , 313--321. Houston TX USA: ACM. ISBN 978-1-4503-6822-3. doi:10.1145/3336191.3371785. ://dl.acm.org/doi/10.1145/3336191.3371785

  77. [92]

    Loshchilov, I.; and Hutter, F. 2017. Decoupled Weight Decay Regularization. In International Conference on Learning Representations. ://arxiv.org/abs/1711.05101

  78. [93]

    Maaten, L. v. d.; and Hinton, G. 2008. Visualizing data using t-SNE. Journal of machine learning research 9(Nov): 2579--2605

  79. [94]

    Mikolov, T.; Kombrink, S.; Burget, L.; C ernock \`y , J.; and Khudanpur, S. 2011. Extensions of recurrent neural network language model. In 2011 IEEE international conference on acoustics, speech and signal processing (ICASSP), 5528--5531. IEEE

  80. [95]

    S.; Yu, H.; Paganini, M.; and Tian, Y

    Morcos, A. S.; Yu, H.; Paganini, M.; and Tian, Y. 2019. One ticket to win them all: generalizing lottery ticket initializations across datasets and optimizers. arXiv:1906.02773 [cs, stat] ://arxiv.org/abs/1906.02773. ArXiv: 1906.02773

Showing first 80 references.