arxiv: 1610.02136 · v3 · submitted 2016-10-07 · 💻 cs.NE · cs.CV· cs.LG

Recognition: no theorem link

A Baseline for Detecting Misclassified and Out-of-Distribution Examples in Neural Networks

Dan Hendrycks , Kevin Gimpel

Authors on Pith no claims yet

Pith reviewed 2026-05-12 18:18 UTC · model grok-4.3

classification 💻 cs.NE cs.CVcs.LG

keywords misclassification detectionout-of-distribution detectionsoftmax probabilityneural network confidencebaseline methodcomputer visionnatural language processingspeech recognition

0 comments

The pith

Maximum softmax probabilities are higher for correctly classified inputs than for misclassified or out-of-distribution ones.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper introduces a straightforward baseline for spotting misclassified examples or inputs from outside the training distribution by looking at the highest probability assigned by a neural network's softmax layer. The core observation is that correct predictions usually come with a more peaked probability distribution, while errors and unfamiliar inputs produce flatter or lower peak values. The authors evaluate the approach on computer vision, natural language, and speech recognition tasks and find consistent separation across all three domains. They also note that the baseline can be improved upon in some settings, leaving room for more sophisticated detectors.

Core claim

The paper shows that the maximum value in the softmax probability vector tends to be larger when a neural network classifies an input correctly and smaller when the input is misclassified or drawn from a different distribution than the training data. By defining detection tasks in vision, language, and speech, the authors demonstrate that thresholding on this maximum probability alone yields usable detection performance without any changes to the underlying model.

What carries the argument

The maximum softmax probability, which serves as a simple scalar score for how peaked the model's output distribution is and thereby signals prediction reliability.

If this is right

The baseline requires no model retraining or architectural changes and can be applied to any existing classifier that produces softmax outputs.
Detection performance holds across vision, language, and speech domains, suggesting the signal is not limited to one data type.
Better detectors can be built on top of this baseline, as the authors show cases where the simple method is outperformed.
Post-hoc application allows reliability checks on deployed models without access to training data.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the separation holds broadly, low-max-probability predictions could automatically trigger human review or safer fallback behaviors in real-world systems.
The observation suggests that some notion of model uncertainty is already encoded in the raw output distribution and could be combined with other signals like temperature scaling.
Testing the same signal on larger modern architectures or different training objectives would show whether the pattern persists or requires adjustment.

Load-bearing premise

The maximum softmax probability reliably and consistently separates correct classifications from errors and in-distribution inputs from out-of-distribution inputs across models and domains.

What would settle it

A dataset of misclassified or out-of-distribution examples where the maximum softmax probability is equal to or higher than that of correctly classified in-distribution examples, with no usable threshold separating the two groups.

read the original abstract

We consider the two related problems of detecting if an example is misclassified or out-of-distribution. We present a simple baseline that utilizes probabilities from softmax distributions. Correctly classified examples tend to have greater maximum softmax probabilities than erroneously classified and out-of-distribution examples, allowing for their detection. We assess performance by defining several tasks in computer vision, natural language processing, and automatic speech recognition, showing the effectiveness of this baseline across all. We then show the baseline can sometimes be surpassed, demonstrating the room for future research on these underexplored detection tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper sets up a simple max-softmax baseline for flagging misclassified and out-of-distribution inputs and shows it works across vision, NLP, and speech tasks without extra parameters.

read the letter

The main thing to know is that correctly classified in-distribution examples usually show higher maximum softmax probabilities than errors or out-of-distribution ones, and the authors turn that observation into a practical detection baseline. They define concrete tasks for both misclassification detection and OOD detection, then test the baseline on standard models in computer vision, natural language processing, and automatic speech recognition. The approach needs nothing beyond the softmax outputs already produced during inference, which keeps it cheap and reproducible.

Referee Report

2 major / 3 minor

Summary. The paper claims that correctly classified in-distribution examples tend to exhibit higher maximum softmax probabilities than misclassified or out-of-distribution examples, enabling simple thresholding for detection. It introduces this as a parameter-free baseline, evaluates it empirically across computer vision, natural language processing, and automatic speech recognition tasks, and shows that the baseline can be surpassed by other approaches, thereby framing it as a starting point rather than a complete solution.

Significance. If the reported trends hold, the work is significant for establishing a reproducible, zero-parameter baseline that leverages already-computed model outputs for uncertainty estimation. By providing consistent empirical evidence across three distinct domains and explicitly inviting improvements, it supplies a clear reference point that subsequent research in out-of-distribution detection and reliable classification can build upon or compare against.

major comments (2)

[§4] §4 (Experiments): The central claim that the baseline enables effective detection rests on reported performance differences, yet the manuscript does not include statistical significance tests, confidence intervals, or variance across random seeds for the AUROC or accuracy metrics on the detection tasks; this leaves open whether the observed separation is robust or could be explained by sampling variability.
[§4.2] §4.2 (NLP experiments) and Table 2: The misclassification detection results are presented as aggregate trends without detailing the number of test examples per class or the distribution of maximum softmax values, making it difficult to judge whether the separation is practically usable or merely statistically detectable.

minor comments (3)

[§3] The notation for the baseline (maximum softmax probability) is introduced without an explicit equation number, which would aid clarity when referring back to it in the experimental sections.
[Figure 1] Figure 1 and related plots would benefit from explicit axis labels indicating the range of maximum softmax probabilities and from a legend distinguishing correct vs. incorrect or in-distribution vs. out-of-distribution curves.
[§2] A small number of citations to prior work on softmax calibration or early OOD detection methods appear to be missing from the related-work section, which would help situate the baseline more precisely.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their positive evaluation of the paper as a reproducible baseline and for the constructive major comments. We address each point below and will revise the manuscript accordingly to improve clarity and rigor.

read point-by-point responses

Referee: [§4] §4 (Experiments): The central claim that the baseline enables effective detection rests on reported performance differences, yet the manuscript does not include statistical significance tests, confidence intervals, or variance across random seeds for the AUROC or accuracy metrics on the detection tasks; this leaves open whether the observed separation is robust or could be explained by sampling variability.

Authors: We agree that formal statistical analyses would strengthen the presentation. While the separation trends hold consistently across many datasets and three distinct domains, we will add bootstrap confidence intervals for the AUROC values and report standard deviations over multiple random seeds for the key detection metrics in the revised manuscript. This will better demonstrate robustness to sampling variability. revision: yes
Referee: [§4.2] §4.2 (NLP experiments) and Table 2: The misclassification detection results are presented as aggregate trends without detailing the number of test examples per class or the distribution of maximum softmax values, making it difficult to judge whether the separation is practically usable or merely statistically detectable.

Authors: We appreciate this suggestion for added transparency. The NLP experiments rely on standard datasets (e.g., IMDB with known test-set sizes and class balance). In the revision we will expand Table 2 and/or add a short supplementary section with the number of test examples per class and summary statistics (or histograms) of the maximum softmax probabilities for correctly classified, misclassified, and out-of-distribution examples. This will allow readers to assess practical usability directly. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper presents an empirical baseline that directly uses the maximum value from a model's already-computed softmax output to flag misclassified or out-of-distribution inputs. No derivation, parameter fitting, or first-principles argument is offered that reduces to its own inputs by construction; the central observation is tested via experiments on held-out data across vision, NLP, and speech tasks and is explicitly positioned as a simple, improvable starting point rather than a closed deductive system.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach rests on the standard property of softmax outputs in trained classifiers and on empirical observation rather than new parameters or invented entities.

axioms (1)

domain assumption Softmax probabilities produced by a trained neural network reflect classification confidence in a manner that separates correct from incorrect predictions.
This is the core premise invoked when the abstract states that correctly classified examples tend to have greater maximum softmax probabilities.

pith-pipeline@v0.9.0 · 5388 in / 1167 out tokens · 46240 ms · 2026-05-12T18:18:33.696719+00:00 · methodology

discussion (0)

Forward citations

Cited by 31 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

SGC-RML: A reliable and interpretable longitudinal assessment for PD in real-world DNS
cs.LG 2026-05 unverdicted novelty 7.0

SGC-RML creates an 8D symptom atlas from multimodal PD data and integrates conformal calibration to deliver reliable, rejectable longitudinal assessments.
Knowing when to trust machine-learned interatomic potentials
cs.LG 2026-05 unverdicted novelty 7.0

PROBE recasts MLIP uncertainty quantification as selective classification by training a compact discriminative classifier on frozen per-atom backbone embeddings, yielding a reliability probability that tracks actual e...
CURE-OOD: Benchmarking Out-of-Distribution Detection for Survival Prediction
cs.CV 2026-05 unverdicted novelty 7.0

CURE-OOD is the first benchmark for evaluating OOD detection in survival prediction under controlled CT acquisition shifts, showing that standard detectors often fail and providing a survival-aware baseline.
Sparsity as a Key: Unlocking New Insights from Latent Structures for Out-of-Distribution Detection
cs.CV 2026-04 unverdicted novelty 7.0

Sparse autoencoders on ViT class tokens reveal stable Class Activation Profiles for in-distribution data, enabling OOD detection via divergence from core energy profiles.
Walking Through Uncertainty: An Empirical Study of Uncertainty Estimation for Audio-Aware Large Language Models
eess.AS 2026-04 unverdicted novelty 7.0

Semantic-level and verification-based uncertainty methods outperform token-level baselines for audio reasoning in ALLMs, but their relative performance on hallucination and unanswerable-question benchmarks is model- a...
Why Training-Free Token Reduction Collapses: The Inherent Instability of Pairwise Scoring Signals
cs.AI 2026-04 unverdicted novelty 7.0

Pairwise scoring signals in Vision Transformer token reduction are inherently unstable due to high perturbation counts and degrade in deep layers, causing collapse, while unary signals with triage enable CATIS to reta...
SciPredict: Can LLMs Predict the Outcomes of Scientific Experiments in Natural Sciences?
cs.AI 2026-04 unverdicted novelty 7.0

LLMs predict outcomes of real scientific experiments at 14-26% accuracy, comparable to human experts, but lack calibration on prediction reliability while humans demonstrate strong calibration.
Evidential Transformation Network: Turning Pretrained Models into Evidential Models for Post-hoc Uncertainty Estimation
cs.LG 2026-04 unverdicted novelty 7.0

ETN is a lightweight post-hoc module that applies a learned sample-dependent affine transformation to pretrained model logits and interprets the outputs as Dirichlet parameters to enable efficient uncertainty estimation.
Bridging the Missing-Modality Gap: Improving Text-Only Calibration of Vision Language Models
cs.CL 2026-04 unverdicted novelty 7.0

A new Latent Imagination Module uses cross-attention to predict latent visual embeddings from text, improving accuracy and calibration of vision-language models on text-only inputs.
SLE-FNO: Single-Layer Extensions for Task-Agnostic Continual Learning in Fourier Neural Operators
cs.LG 2026-03 unverdicted novelty 7.0

SLE-FNO achieves zero forgetting and strong plasticity-stability balance in continual learning for FNO surrogate models of pulsatile blood flow by adding minimal single-layer extensions across four out-of-distribution tasks.
Do Machines Fail Like Humans? A Human-Centred Out-of-Distribution Spectrum for Mapping Error Alignment
cs.AI 2026-03 unverdicted novelty 7.0

A human-centered OOD spectrum based on perceptual difficulty shows vision-language models align best with human errors across regimes, with CNNs stronger on near-OOD and ViTs on far-OOD.
OPT: Open Pre-trained Transformer Language Models
cs.CL 2022-05 unverdicted novelty 7.0

OPT releases open decoder-only transformers up to 175B parameters that match GPT-3 performance at one-seventh the carbon cost, along with code and training logs.
A$_3$B$_2$: Adaptive Asymmetric Adapter for Alleviating Branch Bias in Vision-Language Image Classification with Few-Shot Learning
cs.CV 2026-05 unverdicted novelty 6.0

A3B2 adds uncertainty-aware dampening and asymmetric MoE-style adapters to balance image and text branches, outperforming 11 baselines on 11 few-shot datasets.
Domain Restriction via Multi SAE Layer Transitions
cs.AI 2026-05 unverdicted novelty 6.0

Multi-layer SAE transitions capture domain-specific signatures that distinguish OOD texts in Gemma-2 models.
HamBR: Active Decision Boundary Restoration Based on Hamiltonian Dynamics for Learning with Noisy Labels
cs.CV 2026-05 unverdicted novelty 6.0

HamBR uses Spherical HMC to probe ambiguous regions and synthesize virtual outliers with energy-based repulsion to restore decision boundaries degraded by noisy labels, achieving SOTA on CIFAR and real-world benchmarks.
Scaling Pretrained Representations Enables Label-Free Out-of-Distribution Detection Without Fine-Tuning
cs.LG 2026-05 unverdicted novelty 6.0

Scaling pretrained representations improves label-free OOD detection on frozen backbones, causing performance gaps between global and local detectors to vanish across vision and language tasks.
Perturb and Correct: Post-Hoc Ensembles using Affine Redundancy
cs.LG 2026-05 unverdicted novelty 6.0

Perturb-and-Correct generates epistemically diverse predictors from a single pretrained network via hidden-layer perturbations followed by affine least-squares corrections that enforce agreement on calibration data.
Empirical Insights of Test Selection Metrics under Multiple Testing Objectives and Distribution Shifts
cs.SE 2026-04 unverdicted novelty 6.0

A broad empirical benchmark shows how 15 existing test selection metrics perform for fault detection, performance estimation, and retraining under corrupted, adversarial, temporal, natural, and label shifts across ima...
Quantum Patches: Enhancing Robustness of Quantum Machine Learning Models
quant-ph 2026-04 unverdicted novelty 6.0

Random quantum circuits used as adversarial training data reduce successful attack rates on QML models for CIFAR-10 from 89.8% to 68.45% and for CINIC-10 from 94.23% to 78.68%.
Inside-Out: Measuring Generalization in Vision Transformers Through Inner Workings
cs.LG 2026-04 unverdicted novelty 6.0

Circuit-based metrics from Vision Transformer internals provide better label-free proxies for generalization under distribution shift than existing methods like model confidence.
Unsupervised domain adaptation for radioisotope identification in gamma spectroscopy
cs.LG 2026-03 conditional novelty 6.0

Unsupervised domain adaptation via feature alignment raises radioisotope identification accuracy on real LaBr3 gamma spectra from 0.754 to 0.904 for models trained only on synthetic data.
Language Models (Mostly) Know What They Know
cs.CL 2022-07 unverdicted novelty 6.0

Language models show good calibration when asked to estimate the probability that their own answers are correct, with performance improving as models get larger.
HEDP: A Hybrid Energy-Distance Prompt-based Framework for Domain Incremental Learning
cs.AI 2026-05 unverdicted novelty 5.0

HEDP uses energy regularization inspired by Helmholtz free energy plus hybrid energy-distance weighting in prompts to improve domain selection and achieve a 2.57% accuracy gain on benchmarks like CORe50 while mitigati...
RADMI: Latent Information Aggregation as a Proxy for Model Uncertainty
cs.CV 2026-05 unverdicted novelty 5.0

RADMI aggregates mutual information across decoder layers to proxy epistemic uncertainty in segmentation networks, showing the highest correlation with deep ensemble baselines among single-pass methods.
GR4CIL: Gap-compensated Routing for CLIP-based Class Incremental Learning
cs.CV 2026-04 unverdicted novelty 5.0

GR4CIL introduces gap-compensated routing to enable reliable task-aware knowledge routing in CLIP-based class incremental learning while preserving zero-shot generalization.
Learning Uncertainty from Sequential Internal Dispersion in Large Language Models
cs.CL 2026-04 unverdicted novelty 5.0

SIVR detects LLM hallucinations by learning from token-wise and layer-wise variance patterns in internal hidden states, outperforming baselines with better generalization and less training data.
DBMF: A Dual-Branch Multimodal Framework for Out-of-Distribution Detection
cs.CV 2026-04 unverdicted novelty 5.0

DBMF integrates scores from text-image and vision branches to improve out-of-distribution detection on endoscopic datasets by up to 24.84% over prior methods.
Beyond Toy Benchmarks: A Systematic Evaluation of OOD Detection Methods For Plant Pathology Classification
cs.CV 2026-05 unverdicted novelty 4.0

Energy-based fine-tuning outperforms other OOD detection methods on the real-world Plant Pathology 2021 dataset, improving detection over softmax while maintaining in-distribution accuracy.
Beyond Semantics: An Evidential Reasoning-Aware Multi-View Learning Framework for Trustworthy Mental Health Prediction
cs.CL 2026-05 unverdicted novelty 4.0

A multi-view evidential framework combines semantic and reasoning information to improve accuracy and provide trustworthy uncertainty estimates for mental health prediction on text data.
Improving Model Safety by Targeted Error Correction
cs.AI 2026-05 unverdicted novelty 4.0

A dual GBDT error classifier reduces dangerous misclassifications by 12-34% on medical and animal image datasets with under 2% added latency.
Gemma 2: Improving Open Language Models at a Practical Size
cs.CL 2024-07 conditional novelty 3.0

Gemma 2 models achieve leading performance at their sizes by combining established Transformer modifications with knowledge distillation for the 2B and 9B variants.

Reference graph

Works this paper leans on

78 extracted references · 78 canonical work pages · cited by 31 Pith papers

[1]

(2016) Concrete Problems in AI Safety

Dario Amodei & Chris Olah & Jacob Steinhardt & Paul Christiano & John Schulman & Dan Man\'e. (2016) Concrete Problems in AI Safety. In arXiv

work page 2016
[2]

(2012) English Web Treebank

Ann Bies & Justin Mott & Colin Warner & Seth Kulick. (2012) English Web Treebank

work page 2012
[3]

(2011) notMNIST dataset

Yaroslav Bulatov. (2011) notMNIST dataset

work page 2011
[4]

(2006) The Relationship Between Precision-Recall and ROC Curves

Jesse Davis & Mark Goadrich. (2006) The Relationship Between Precision-Recall and ROC Curves. In International Conference on Machine Learning (ICML)

work page 2006
[5]

(2005) An introduction to ROC analysis

Tom Fawcett. (2005) An introduction to ROC analysis. In Pattern Recognition Letters

work page 2005
[6]

TIMIT Acoustic-Phonetic Continuous Speech Corpus

John Garofolo & Lori Lamel & William Fisher & Jonathan Fiscus & David Pallett & Nancy Dahlgren & Victor Zue. TIMIT Acoustic-Phonetic Continuous Speech Corpus

work page
[7]

Kevin Gimpel & Nathan Schneider & Brendan O ' Connor & Dipanjan Das & Daniel Mills &

work page
[8]

Bronstein

Raja Giryes, Guillermo Sapiro, Alex M. Bronstein. (2015) Deep Neural Networks with Random Gaussian Weights: A Universal Classification Strategy? In arXiv

work page 2015
[9]

Goodfellow & Jonathon Shlens & Christian Szegedy

Ian J. Goodfellow & Jonathon Shlens & Christian Szegedy. (2015) Explaining and Harnessing Adversarial Examples. In International Conference on Learning Representations (ICLR)

work page 2015
[10]

(2006) Connectionist Temporal Classification: Labeling Unsegmented Sequence Data with Recurrent Neural Networks

Alex Graves & Santiago Fern\'andez & Faustino Gomez & J\" u rgen Schmidhuber. (2006) Connectionist Temporal Classification: Labeling Unsegmented Sequence Data with Recurrent Neural Networks. In International Conference on Machine Learning (ICML)

work page 2006
[11]

(2016) Bridging Nonlinearities and Stochastic Regularizers with Gaussian Error Linear Units

Dan Hendrycks & Kevin Gimpel. (2016) Bridging Nonlinearities and Stochastic Regularizers with Gaussian Error Linear Units. In arXiv

work page 2016
[12]

(2016) Improving and Generalizing Weight Initialization

Dan Hendrycks & Kevin Gimpel. (2016) Improving and Generalizing Weight Initialization. In arXiv

work page 2016
[13]

(2000) The Aurora Experimental Framework for the Performance Evaluation of Speech recognition Systems Under Noisy Conditions

Hans-G\" u nter Hirsch & David Pearce. (2000) The Aurora Experimental Framework for the Performance Evaluation of Speech recognition Systems Under Noisy Conditions. In ISCA ITRW ASR2000

work page 2000
[14]

(1997) Long short-term memory

Sepp Hochreiter & J \"u rgen Schmidhuber. (1997) Long short-term memory. Neural Computation

work page 1997
[15]

(2004) Mining and Summarizing Customer Reviews

Minqing Hu & Bing Liu. (2004) Mining and Summarizing Customer Reviews. In Knowledge Discovery and Data Mining (KDD)

work page 2004
[16]

(2015) Deep Unordered Composition Rivals Syntactic Methods for Text Classification

Mohit Iyyer & Varun Manjunatha & Jordan Boyd-Graber & Hal Daum\'e III. (2015) Deep Unordered Composition Rivals Syntactic Methods for Text Classification. In Association for Computational Linguistics (ACL)

work page 2015
[17]

(2016) Bag of Tricks for Efficient Text Classification

Armand Joulin & Edouard Grave & Piotr Bojanowski & Tomas Mikolov. (2016) Bag of Tricks for Efficient Text Classification. In arXiv

work page 2016
[18]

(2015) Adam: A Method for Stochastic Optimization

Diederik Kingma & Jimmy Ba. (2015) Adam: A Method for Stochastic Optimization. In International Conference for Learning Representations (ICLR)

work page 2015
[19]

(2009) Learning Multiple Layers of Features from Tiny Images

Alex Krizhevsky. (2009) Learning Multiple Layers of Features from Tiny Images

work page 2009
[20]

Lake & Ruslan Salakhutdinov & Joshua B

Brenden M. Lake & Ruslan Salakhutdinov & Joshua B. Tenenbaum. (2015) Human-level concept learning through probabilistic program induction. In Science

work page 2015
[21]

(1995) Newsweeder: Learning to filter netnews

Ken Lang. (1995) Newsweeder: Learning to filter netnews. In International Conference on Machine Learning (ICML)

work page 1995
[22]

Lewis & Yiming Yang & Tony G

David D. Lewis & Yiming Yang & Tony G. Rose & Fan Li. (2004) RCV1: A New Benchmark Collection for Text Categorization Research. In Journal of Machine Learning Research (JMLR)

work page 2004
[23]

(2016) SGDR: Stochastic Gradient Descent with Restarts

Ilya Loshchilov & Frank Hutter. (2016) SGDR: Stochastic Gradient Descent with Restarts. In arXiv

work page 2016
[24]

(2015) Deep Neural Networks are Easily Fooled: High Confidence Predictions for Unrecognizable Images

Anh Nguyen & Jason Yosinski & Jeff Clune. (2015) Deep Neural Networks are Easily Fooled: High Confidence Predictions for Unrecognizable Images. In Computer Vision and Pattern Recognition (CVPR)

work page 2015
[25]

(2015) Posterior calibration and exploratory analysis for natural language processing models

Khanh Nguyen & Brendan O'Connor. (2015) Posterior calibration and exploratory analysis for natural language processing models. In Empirical Methods in Natural Language Processing (EMNLP)

work page 2015
[26]

Maas & Raymond E

Andrew L. Maas & Raymond E. Daly & Peter T. Pham & Dan Huang & Andrew Y. Ng & Christopher Potts. (2011) Learning Word Vectors for Sentiment Analysis. In Association for Computational Linguistics (ACL)

work page 2011
[27]

(1999) Foundations of Statistical Natural Language Processing

Chris Manning & Hinrich Sch\" u tze. (1999) Foundations of Statistical Natural Language Processing. MIT Press

work page 1999
[28]

(1999) Treebank-3

Mitchell Marcus & Beatrice Santorini & Mary Ann Marcinkiewicz & Ann Taylor. (1999) Treebank-3

work page 1999
[29]

(2002) Thumbs up? Sentiment Classification using Machine Learning Techniques

Bo Pang & Lillian Lee & and Shivakumar Vaithyanathan. (2002) Thumbs up? Sentiment Classification using Machine Learning Techniques. In Empirical Methods in Natural Language Processing (EMNLP)

work page 2002
[30]

(1998) The case against accuracy estimation for comparing induction algorithms

Foster Provost & Tom Fawcett & Ron Kohavi. (1998) The case against accuracy estimation for comparing induction algorithms. In International Conference on Machine Learning (ICML)

work page 1998
[31]

Seltzer & Dong Yu & Yongqiang Wang

Michael L. Seltzer & Dong Yu & Yongqiang Wang. (2013) Investigation of Deep Neural Networks for Noise Robust Speech Recognition. In IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP)

work page 2013
[32]

In Neural Information Processing Systems (NIPS)

Jacob Steinhardt & Percy Liang (2016) Unsupervised Risk Estimation Using Only Conditional Independence Structure. In Neural Information Processing Systems (NIPS)

work page 2016
[33]

(1997) Confidence Measures for Hybrid HMM/ANN Speech Recognition

Gethin Williams & Steve Renals. (1997) Confidence Measures for Hybrid HMM/ANN Speech Recognition. In Proceedings of EuroSpeech

work page 1997
[34]

Ehinger & Aude Oliva & Antonio Torralba

Jianxiong Xiao & James Hays & Krista A. Ehinger & Aude Oliva & Antonio Torralba. (2010) SUN Database: Large-scale Scene Recognition from Abbey to Zoo. IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

work page 2010
[35]

(2010) Calibration of Confidence Measures in Speech Recognition

Dong Yu & Jinyu Li & Li Deng. (2010) Calibration of Confidence Measures in Speech Recognition. In IEEE Transactions on Audio, Speech, and Language

work page 2010
[36]

(2016) Wide Residual Networks

Sergey Zagoruyko & Nikos Komodakis. (2016) Wide Residual Networks. In arXiv

work page 2016
[37]

(2016) Augmenting Supervised Neural Networks with Unsupervised Objectives for Large-scale Image Classification

Yuting Zhang & Kibok Lee & Honglak Lee. (2016) Augmenting Supervised Neural Networks with Unsupervised Objectives for Large-scale Image Classification. In International Conference on Machine Learning (ICML)

work page 2016
[38]

Concrete problems in ai safety

Dario Amodei, Chris Olah, Jacob Steinhardt, Paul Christiano, John Schulman, and Dan Man\'e. Concrete problems in ai safety. arXiv, 2016

work page 2016
[39]

English Web Treebank, 2012

Ann Bies, Justin Mott, Colin Warner, and Seth Kulick. English Web Treebank, 2012

work page 2012
[40]

notMNIST dataset

Yaroslav Bulatov. notMNIST dataset. 2011

work page 2011
[41]

The relationship between precision-recall and ROC curves

Jesse Davis and Mark Goadrich. The relationship between precision-recall and ROC curves. In International Conference on Machine Learning (ICML), 2006

work page 2006
[42]

An introduction to ROC analysis

Tom Fawcett. An introduction to ROC analysis. Pattern Recognition Letters, 2005

work page 2005
[43]

TIMIT Acoustic-Phonetic Continuous Speech Corpus

John Garofolo, Lori Lamel, William Fisher, Jonathan Fiscus, David Pallett, Nancy Dahlgren, and Victor Zue. TIMIT Acoustic-Phonetic Continuous Speech Corpus. Linguistic Data Consortium, 1993

work page 1993
[44]

Kevin Gimpel, Nathan Schneider, Brendan O ' Connor, Dipanjan Das, Daniel Mills, Jacob Eisenstein, Michael Heilman, Dani Yogatama, Jeffrey Flanigan, and Noah A. Smith. Part-of-Speech Tagging for Twitter: Annotation, Features, and Experiments. Association for Computational Linguistics (ACL), 2011

work page 2011
[45]

Bronstein

Raja Giryes, Guillermo Sapiro, and Alex M. Bronstein. Deep neural networks with random gaussian weights: A universal classification strategy? arXiv, 2015

work page 2015
[46]

Goodfellow, Jonathon Shlens, and Christian Szegedy

Ian J. Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples. In International Conference on Learning Representations (ICLR), 2015

work page 2015
[47]

Connectionist temporal classification: Labeling unsegmented sequence data with recurrent neural networks

Alex Graves, Santiago Fern\'andez, Faustino Gomez, and J\" u rgen Schmidhuber. Connectionist temporal classification: Labeling unsegmented sequence data with recurrent neural networks. In International Conference on Machine Learning (ICML), 2006

work page 2006
[48]

Methods for detecting adversarial images and a colorful saliency map

Dan Hendrycks and Kevin Gimpel. Methods for detecting adversarial images and a colorful saliency map. arXiv, 2016 a

work page 2016
[49]

Bridging nonlinearities and stochastic regularizers with G aussian error linear units

Dan Hendrycks and Kevin Gimpel. Bridging nonlinearities and stochastic regularizers with G aussian error linear units. arXiv, 2016 b

work page 2016
[50]

Adjusting for dropout variance in batch normalization and weight initialization

Dan Hendrycks and Kevin Gimpel. Adjusting for dropout variance in batch normalization and weight initialization. arXiv, 2016 c

work page 2016
[51]

The Aurora experimental framework for the performance evaluation of speech recognition systems under noisy conditions

Hans-G\" u nter Hirsch and David Pearce. The Aurora experimental framework for the performance evaluation of speech recognition systems under noisy conditions. ISCA ITRW ASR2000, 2000

work page 2000
[52]

Long short-term memory

Sepp Hochreiter and J \"u rgen Schmidhuber. Long short-term memory. Neural Computation, 1997

work page 1997
[53]

Mining and Summarizing Customer Reviews

Minqing Hu and Bing Liu. Mining and Summarizing Customer Reviews. Knowledge Discovery and Data Mining (KDD), 2004

work page 2004
[54]

Deep Unordered Composition Rivals Syntactic Methods for Text Classification

Mohit Iyyer, Varun Manjunatha, Jordan Boyd-Graber, and Hal Daum\'e Iii. Deep Unordered Composition Rivals Syntactic Methods for Text Classification. Association for Computational Linguistics (ACL), 2015

work page 2015
[55]

Bag of tricks for efficient text classification

Armand Joulin, Edouard Grave, Piotr Bojanowski, and Tomas Mikolov. Bag of tricks for efficient text classification. arXiv, 2016

work page 2016
[56]

Adam: A Method for Stochastic Optimization

Diederik Kingma and Jimmy Ba. Adam: A Method for Stochastic Optimization. International Conference for Learning Representations (ICLR), 2015

work page 2015
[57]

Learning Multiple Layers of Features from Tiny Images, 2009

Alex Krizhevsky. Learning Multiple Layers of Features from Tiny Images, 2009

work page 2009
[58]

Lake, Ruslan Salakhutdinov, and Joshua B

Brenden M. Lake, Ruslan Salakhutdinov, and Joshua B. Tenenbaum. Human-level concept learning through probabilistic program induction. Science, 2015

work page 2015
[59]

Newsweeder: Learning to filter netnews

Ken Lang. Newsweeder: Learning to filter netnews. In International Conference on Machine Learning (ICML), 1995

work page 1995
[60]

Lewis, Yiming Yang, Tony G

David D. Lewis, Yiming Yang, Tony G. Rose, and Fan Li. Rcv1: A new benchmark collection for text categorization research. Journal of Machine Learning Research (JMLR), 2004

work page 2004
[61]

Sgdr: Stochastic gradient descent with restarts

Ilya Loshchilov and Frank Hutter. Sgdr: Stochastic gradient descent with restarts. arXiv, 2016

work page 2016
[62]

Maas, Raymond E

Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y. Ng, and Christopher Potts. Learning word vectors for sentiment analysis. In Association for Computational Linguistics (ACL), 2011

work page 2011
[63]

Foundations of Statistical Natural Language Processing

Chris Manning and Hinrich Sch\" u tze. Foundations of Statistical Natural Language Processing. MIT Press, 1999

work page 1999
[64]

Marcus, Mary Ann Marcinkiewicz, and Beatrice Santorini

Mitchell P. Marcus, Mary Ann Marcinkiewicz, and Beatrice Santorini. Building a large annotated corpus of E nglish: The P enn T reebank. Computational linguistics, 1993

work page 1993
[65]

Deep neural networks are easily fooled: High confidence predictions for unrecognizable images

Anh Nguyen, Jason Yosinski, and Jeff Clune. Deep neural networks are easily fooled: High confidence predictions for unrecognizable images. In Computer Vision and Pattern Recognition (CVPR), 2015

work page 2015
[66]

Posterior calibration and exploratory analysis for natural language processing models

Khanh Nguyen and Brendan O'Connor. Posterior calibration and exploratory analysis for natural language processing models. In Empirical Methods in Natural Language Processing (EMNLP), 2015

work page 2015
[67]

Olutobi Owoputi, Brendan O'Connor, Chris Dyer, Kevin Gimpel, Nathan Schneider, and Noah A. Smith. Improved part-of-speech tagging for online conversational text with word clusters. In North American Chapter of the Association for Computational Linguistics (NAACL), 2013

work page 2013
[68]

Thumbs up? sentiment classification using machine learning techniques

Bo Pang, Lillian Lee, and Shivakumar Vaithyanathan. Thumbs up? sentiment classification using machine learning techniques. In Empirical Methods in Natural Language Processing (EMNLP), 2002

work page 2002
[69]

The case against accuracy estimation for comparing induction algorithms

Foster Provost, Tom Fawcett, and Ron Kohavi. The case against accuracy estimation for comparing induction algorithms. In International Conference on Machine Learning (ICML), 1998

work page 1998
[70]

The precision-recall plot is more informative than the ROC plot when evaluating binary classifiers on imbalanced datasets

Takaya Saito and Marc Rehmsmeier. The precision-recall plot is more informative than the ROC plot when evaluating binary classifiers on imbalanced datasets. In PLoS ONE. 2015

work page 2015
[71]

Seltzer, Dong Yu, and Yongqiang Wang

Michael L. Seltzer, Dong Yu, and Yongqiang Wang. Investigation of deep neural networks for noise robust speech recognition. In IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2013

work page 2013
[72]

Unsupervised risk estimation using only conditional independence structure

Jacob Steinhardt and Percy Liang. Unsupervised risk estimation using only conditional independence structure. In Neural Information Processing Systems (NIPS), 2016

work page 2016
[73]

Thchs-30 : A free chinese speech corpus

Dong Wang and Xuewei Zhang. Thchs-30 : A free chinese speech corpus. In Technical Report, 2015

work page 2015
[74]

Confidence measures for hybrid hmm/ann speech recognition

Gethin Williams and Steve Renals. Confidence measures for hybrid hmm/ann speech recognition. In Proceedings of EuroSpeech, 1997

work page 1997
[75]

Ehinger, Aude Oliva, and Antonio Torralba

Jianxiong Xiao, James Hays, Krista A. Ehinger, Aude Oliva, and Antonio Torralba. Sun database: Large-scale scene recognition from abbey to zoo. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2010

work page 2010
[76]

Calibration of confidence measures in speech recognition

Dong Yu, Jinyu Li, and Li Deng. Calibration of confidence measures in speech recognition. In IEEE Transactions on Audio, Speech, and Language, 2010

work page 2010
[77]

Wide residual networks

Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. British Machine Vision Conference, 2016

work page 2016
[78]

Augmenting supervised neural networks with unsupervised objectives for large-scale image classification

Yuting Zhang, Kibok Lee, and Honglak Lee. Augmenting supervised neural networks with unsupervised objectives for large-scale image classification. In International Conference on Machine Learning (ICML), 2016

work page 2016