arxiv: 1911.08731 · v2 · submitted 2019-11-20 · 💻 cs.LG · stat.ML

Recognition: 1 theorem link

· Lean Theorem

Distributionally Robust Neural Networks for Group Shifts: On the Importance of Regularization for Worst-Case Generalization

Shiori Sagawa , Pang Wei Koh , Tatsunori B. Hashimoto , Percy Liang

Authors on Pith no claims yet

Pith reviewed 2026-05-13 09:14 UTC · model grok-4.3

classification 💻 cs.LG stat.ML

keywords distributionally robust optimizationgroup shiftsneural networksregularizationworst-group generalizationoverparameterization

0 comments

The pith

Regularization enables group DRO to achieve high worst-group accuracy on overparameterized neural networks

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Overparameterized neural networks can fit training data perfectly yet fail on atypical groups by learning spurious correlations. Standard group distributionally robust optimization fails in this regime because vanishing average loss implies vanishing worst-case loss on the training groups. The paper shows the root cause is poor generalization on some groups rather than optimization failure. Pairing group DRO with stronger regularization such as increased L2 penalties or early stopping raises worst-group accuracy by 10 to 40 percentage points on an NLI task and two image tasks while preserving high average accuracy. The results indicate regularization matters for worst-group generalization even when it is unnecessary for average generalization.

Core claim

Naively applying group DRO to overparameterized networks yields models with vanishing worst-case training loss yet poor test-time worst-group performance; adding stronger regularization restores high worst-group accuracy on held-out data from the same groups.

What carries the argument

Coupling group DRO with stronger-than-typical L2 regularization or early stopping to prevent overfitting on minority groups while minimizing worst-case loss.

Load-bearing premise

The failure of naive group DRO comes from poor generalization on groups rather than optimization difficulty, and the pre-defined training groups match the groups that matter at test time.

What would settle it

An experiment in which increasing regularization leaves worst-group accuracy unchanged or in which naive group DRO already reaches high worst-group accuracy without extra regularization on the same datasets.

read the original abstract

Overparameterized neural networks can be highly accurate on average on an i.i.d. test set yet consistently fail on atypical groups of the data (e.g., by learning spurious correlations that hold on average but not in such groups). Distributionally robust optimization (DRO) allows us to learn models that instead minimize the worst-case training loss over a set of pre-defined groups. However, we find that naively applying group DRO to overparameterized neural networks fails: these models can perfectly fit the training data, and any model with vanishing average training loss also already has vanishing worst-case training loss. Instead, the poor worst-case performance arises from poor generalization on some groups. By coupling group DRO models with increased regularization---a stronger-than-typical L2 penalty or early stopping---we achieve substantially higher worst-group accuracies, with 10-40 percentage point improvements on a natural language inference task and two image tasks, while maintaining high average accuracies. Our results suggest that regularization is important for worst-group generalization in the overparameterized regime, even if it is not needed for average generalization. Finally, we introduce a stochastic optimization algorithm, with convergence guarantees, to efficiently train group DRO models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows that stronger regularization rescues group DRO on overparameterized nets for worst-group accuracy, with clear empirical gains but some uncertainty on whether the fix targets generalization or optimization.

read the letter

The main takeaway is that naive group DRO fails to deliver good worst-group test performance on overparameterized neural nets because of generalization gaps on certain groups, even when the models fit the training data. Stronger L2 penalties or early stopping produce 10-40 point gains on worst-group accuracy across an NLI task and two image tasks while preserving high average accuracy. The authors also supply a stochastic optimizer with convergence guarantees to handle the min-max objective in practice. This combination of observation and fix is the concrete contribution. The experiments make the practical payoff visible and the algorithm lowers the barrier to using group DRO. The results line up with the abstract's description of vanishing worst-case training loss under naive DRO, so the narrative that regularization targets generalization rather than optimization holds up on the reported evidence. The soft spots are modest but worth noting. The gains depend on the specific pre-defined groups and on tuning the regularization strength, which is listed as a free parameter. If the optimizer in the naive case does not fully drive worst-group training loss to zero on these non-convex problems, some of the improvement could come from better optimization rather than pure generalization benefits. The paper would be tighter with explicit training-loss plots confirming the claimed vanishing worst-case loss. This work is aimed at researchers working on distributionally robust learning and group shifts in deep models. It has enough empirical substance and a usable algorithm to merit a full referee process rather than a desk reject.

Referee Report

2 major / 2 minor

Summary. The paper claims that naive application of group DRO to overparameterized neural networks fails to improve worst-group test accuracy because models achieve vanishing worst-case training loss (any low-average-loss model already has low worst-case loss on the training groups), with failures instead arising from poor generalization on some groups. Coupling group DRO with stronger regularization (increased L2 penalty or early stopping) yields 10-40 percentage point gains in worst-group accuracy on an NLI task and two image tasks while preserving high average accuracy. The authors also introduce a stochastic optimization algorithm for group DRO with convergence guarantees.

Significance. If the empirical results hold and the gains are attributable to generalization rather than optimization, the work is significant for demonstrating that regularization remains crucial for worst-group generalization even in the overparameterized regime where it is often unnecessary for average generalization. The practical improvements and the proposed algorithm with guarantees represent concrete contributions to distributionally robust learning.

major comments (2)

[Abstract and §3 (method)] The assertion that overparameterized models achieve vanishing worst-case training loss under naive group DRO (any model with vanishing average training loss already has vanishing worst-case loss) is load-bearing for the narrative that failures are due to generalization rather than optimization. Given the non-convex, non-smooth min-max objective, the manuscript should explicitly report the achieved worst-group training losses (e.g., in §4 or Table 1) to confirm the optimizer reaches this regime on the reported tasks.
[§5 (experiments)] The 10-40 pp worst-group improvements rely on tuning regularization strength (L2 coefficient or early-stopping epoch), listed as a free parameter. The central claim would be strengthened by showing that these gains are robust across a range of regularization values and that the optimal regularization for worst-group accuracy differs systematically from that for average accuracy (e.g., via additional curves in §5).

minor comments (2)

[§4 (algorithm)] The convergence guarantees for the proposed stochastic algorithm are stated but the precise assumptions (e.g., on the loss smoothness or step-size schedule) and any empirical verification of convergence rates could be expanded for clarity.
[Figures in §5] Figure captions and legends should explicitly note the number of random seeds or runs used to generate error bars when comparing average vs. worst-group accuracies.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major comment below and will revise the manuscript to incorporate the suggested changes.

read point-by-point responses

Referee: [Abstract and §3 (method)] The assertion that overparameterized models achieve vanishing worst-case training loss under naive group DRO (any model with vanishing average training loss already has vanishing worst-case loss) is load-bearing for the narrative that failures are due to generalization rather than optimization. Given the non-convex, non-smooth min-max objective, the manuscript should explicitly report the achieved worst-group training losses (e.g., in §4 or Table 1) to confirm the optimizer reaches this regime on the reported tasks.

Authors: We agree that explicitly reporting the achieved worst-group training losses would strengthen the claim that the optimizer reaches the regime where average and worst-case training losses both vanish. In the revised manuscript we will add these values to Table 1 and the corresponding discussion in §4, confirming that worst-group training loss approaches zero under naive group DRO on the reported tasks. revision: yes
Referee: [§5 (experiments)] The 10-40 pp worst-group improvements rely on tuning regularization strength (L2 coefficient or early-stopping epoch), listed as a free parameter. The central claim would be strengthened by showing that these gains are robust across a range of regularization values and that the optimal regularization for worst-group accuracy differs systematically from that for average accuracy (e.g., via additional curves in §5).

Authors: We appreciate the suggestion to demonstrate robustness across regularization values. In the revised manuscript we will add plots in §5 showing worst-group and average accuracy as functions of the L2 coefficient and of the early-stopping epoch for both group DRO and ERM. These curves will illustrate that the improvements are robust over a range of regularization strengths and that the regularization level optimal for worst-group accuracy is systematically stronger than the level optimal for average accuracy. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical results and algorithm are self-contained

full rationale

The paper's core argument rests on direct experimental observations that overparameterized models achieve vanishing worst-case training loss under naive group DRO (any low-average-loss model has low worst-case loss) and that stronger regularization yields 10-40 point worst-group gains. This is presented as an empirical finding on concrete tasks rather than a derivation that reduces by construction to fitted parameters or self-citations. The introduced stochastic optimizer is accompanied by stated convergence guarantees, supplying independent mathematical content. No load-bearing step invokes a uniqueness theorem from the authors' prior work, renames a known pattern, or defines a prediction in terms of its own inputs. The results remain externally falsifiable via replication on the reported NLI and image datasets.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

The work relies on standard supervised learning assumptions plus the existence of pre-defined groups that capture the relevant distribution shifts. No new entities are postulated. The regularization strength is a free parameter that must be tuned.

free parameters (1)

regularization strength (L2 coefficient or early-stopping epoch)
Chosen to balance average and worst-group performance; the paper shows results for stronger-than-typical values.

axioms (2)

domain assumption The training groups are known and fixed in advance.
Group DRO requires a partition of the training data into groups that are assumed to represent the shifts of interest.
standard math Standard neural network training dynamics apply.
The analysis assumes gradient-based optimization reaches near-zero training loss on overparameterized models.

pith-pipeline@v0.9.0 · 5525 in / 1433 out tokens · 35259 ms · 2026-05-13T09:14:02.824790+00:00 · methodology

discussion (0)

Forward citations

Cited by 26 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Distributionally Robust Multi-Task Reinforcement Learning via Adaptive Task Sampling
cs.LG 2026-05 unverdicted novelty 7.0

DRATS derives a minimax objective from a feasibility formulation of MTRL to adaptively sample tasks with the largest return gaps, leading to better worst-task performance on MetaWorld benchmarks.
Structure from Strategic Interaction & Uncertainty: Risk Sensitive Games for Robust Preference Learning
cs.GT 2026-05 unverdicted novelty 7.0

Risk-sensitive preference games retain monotonicity via translation-invariant risk measures, enabling convergent self-play algorithms with stability bounds and empirical robustness across data strata.
Are We Making Progress in Multimodal Domain Generalization? A Comprehensive Benchmark Study
cs.CV 2026-05 unverdicted novelty 7.0

A large-scale benchmark finds that recent multimodal domain generalization methods give only marginal gains over a plain ERM baseline, with no method winning consistently and all degrading sharply under corruption or ...
eXplaining to Learn (eX2L): Regularization Using Contrastive Visual Explanation Pairs for Distribution Shifts
cs.CV 2026-05 unverdicted novelty 7.0

eX2L improves robustness to distribution shifts by penalizing similarity between Grad-CAM maps of a label classifier and a confounder classifier, reaching new SOTA average and worst-group accuracy on the Spawrious benchmark.
Right Regions, Wrong Labels: Semantic Label Flips in Segmentation under Correlation Shift
cs.CV 2026-04 unverdicted novelty 7.0

Semantic segmentation models produce label flips within foreground regions under correlation shift, quantified by a new Flip diagnostic and an entropy-based flip-risk score.
Learning from Synthetic Data via Provenance-Based Input Gradient Guidance
cs.CV 2026-04 unverdicted novelty 7.0

A framework that applies provenance-based guidance to input gradients during synthetic data training to promote learning from target regions only.
Birds of a Feather Flock Together: Background-Invariant Representations via Linear Structure in VLMs
cs.CV 2026-05 unverdicted novelty 6.0

Exploiting linear structure in VLM embeddings, a synthetic-data pre-training method yields background-invariant representations that exceed 90% worst-group accuracy on Waterbirds even under 100% spurious correlation w...
DuetFair: Coupling Inter- and Intra-Subgroup Robustness for Fair Medical Image Segmentation
cs.CV 2026-05 unverdicted novelty 6.0

DuetFair couples inter-subgroup adaptation with intra-subgroup robustness via FairDRO (dMoE plus subgroup-conditioned DRO) to boost worst-case and equity-scaled performance on medical segmentation benchmarks.
Structure from Strategic Interaction & Uncertainty: Risk Sensitive Games for Robust Preference Learning
cs.GT 2026-05 unverdicted novelty 6.0

Risk-sensitive preference games using convex risk measures produce policies that are robust across data strata and match or exceed standard Nash learning performance without added cost.
The Trap of Trajectory: Towards Understanding and Mitigating Spurious Correlations in Agentic Memory
cs.LG 2026-05 unverdicted novelty 6.0

Agentic memory improves clean reasoning but worsens performance when spurious patterns are present in stored trajectories; CAMEL calibration reduces this reliance while preserving clean performance.
Robust Conditional Conformal Prediction via Branched Normalizing Flow
cs.LG 2026-05 unverdicted novelty 6.0

Branched Normalizing Flow improves conditional coverage robustness of conformal prediction under distribution shift by normalizing test inputs to the calibration distribution and mapping prediction sets back.
Cheeger--Hodge Contrastive Learning for Structurally Robust Graph Representation Learning
cs.LG 2026-04 unverdicted novelty 6.0

CHCL aligns a Cheeger-Hodge joint signature across graph augmentations to produce embeddings that remain stable under local structural changes.
Correcting Performance Estimation Bias in Imbalanced Classification with Minority Subconcepts
cs.LG 2026-04 unverdicted novelty 6.0

The authors introduce predicted-weighted balanced accuracy (pBA), a utility-weighted evaluation metric that uses predicted subconcept posteriors to reduce bias from within-class heterogeneity in imbalanced data.
MGDA-Decoupled: Geometry-Aware Multi-Objective Optimisation for DPO-based LLM Alignment
cs.LG 2026-04 unverdicted novelty 6.0

MGDA-Decoupled applies geometry-based multi-objective optimization within the DPO framework to find shared descent directions that account for each objective's convergence dynamics, yielding higher win rates on UltraFeedback.
CrossPan: A Comprehensive Benchmark for Cross-Sequence Pancreas MRI Segmentation and Generalization
cs.CV 2026-04 unverdicted novelty 6.0

CrossPan benchmark shows cross-sequence MRI domain shifts cause pancreas segmentation models to fail catastrophically, establishing sequence generalization as the primary barrier to clinical deployment over center var...
CrossFlowDG: Bridging the Modality Gap with Cross-modal Flow Matching for Domain Generalization
cs.CV 2026-04 unverdicted novelty 6.0

CrossFlowDG bridges the modality gap in domain generalization by learning a continuous transformation that moves image embeddings to matching text embeddings using noise-free cross-modal flow matching.
Adversarial Label Invariant Graph Data Augmentations for Out-of-Distribution Generalization
cs.LG 2026-04 unverdicted novelty 6.0

RIA uses adversarial exploration of counterfactual graph environments via label-invariant augmentations to improve OoD generalization in graph classification tasks.
Inside-Out: Measuring Generalization in Vision Transformers Through Inner Workings
cs.LG 2026-04 unverdicted novelty 6.0

Circuit-based metrics from Vision Transformer internals provide better label-free proxies for generalization under distribution shift than existing methods like model confidence.
Visual prompting reimagined: The power of the Activation Prompts
cs.CV 2026-04 unverdicted novelty 6.0

Activation prompts on intermediate layers outperform input-level visual prompting and parameter-efficient fine-tuning in accuracy and efficiency across 29 datasets.
Robust Learning of Heterogeneous Dynamic Systems
stat.ME 2026-04 unverdicted novelty 6.0

A distributionally robust ODE learning framework for heterogeneous systems that uses worst-case optimization over convex derivative combinations to produce a stabilized weighted estimator with theoretical guarantees.
Beyond Instance-Level Self-Supervision in 3D Multi-Modal Medical Imaging
cs.CV 2026-05 unverdicted novelty 5.0

A self-supervised approach uses consistent spatial relationships of anatomical structures across patients to improve 3D multi-modal medical image representations, yielding modest gains on segmentation and classificati...
Agentic AIs Are the Missing Paradigm for Out-of-Distribution Generalization in Foundation Models
cs.LG 2026-05 unverdicted novelty 5.0

Agentic AI systems are required to overcome the parameter coverage ceiling that prevents foundation models from handling certain out-of-distribution cases.
A Toolkit for Detecting Spurious Correlations in Speech Datasets
cs.SD 2026-04 unverdicted novelty 5.0

A toolkit flags spurious correlations in speech datasets by checking if non-speech regions predict the target class better than chance.
Labeled TrustSet Guided: Batch Active Learning with Reinforcement Learning
cs.LG 2026-04 unverdicted novelty 5.0

BRAL-T uses TrustSet-guided reinforcement learning for batch active learning and reports state-of-the-art results on 10 image classification benchmarks plus 2 fine-tuning tasks.
Robust Deepfake Detection, NTIRE 2026 Challenge: Report
cs.CV 2026-04 unverdicted novelty 2.0

The NTIRE 2026 challenge finds that large foundation models combined with ensembles and degradation-aware training produce the most robust deepfake detectors.
Deep Learning for Sequential Decision Making under Uncertainty: Foundations, Frameworks, and Frontiers
math.OC 2026-04 unverdicted novelty 2.0

A tutorial framing deep learning as a complement to optimization for sequential decision-making under uncertainty, with applications in supply chains, healthcare, and energy.

Reference graph

Works this paper leans on

300 extracted references · 300 canonical work pages · cited by 25 Pith papers · 1 internal anchor

[1]

Invariant Risk Minimization

M. Arjovsky, L. Bottou, I. Gulrajani, and D. Lopez-Paz. Invariant risk minimization. arXiv preprint arXiv:1907.02893, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1907
[2]

M. A. Badgeley, J. R. Zech, L. Oakden-Rayner, B. S. Glicksberg, M. Liu, W. Gale, M. V. McConnell, B. Percha, T. M. Snyder, and J. T. Dudley. Deep learning predicts hip fracture using confounding patient and healthcare variables. npj Digital Medicine, 2, 2019

work page 2019
[3]

Ben-David, J

S. Ben-David, J. Blitzer, K. Crammer, and F. Pereira. Analysis of representations for domain adaptation. In Advances in Neural Information Processing Systems (NeurIPS), pp.\ 137--144, 2006

work page 2006
[4]

Ben-Tal, D

A. Ben-Tal, D. den Hertog, A. D. Waegenaere, B. Melenberg, and G. Rennen. Robust solutions of optimization problems affected by uncertain probabilities. Management Science, 59: 0 341--357, 2013

work page 2013
[5]

D. P. Bertsekas. Convex Optimization Theory. Athena Scientific Belmont, 2009

work page 2009
[6]

Bertsimas, V

D. Bertsimas, V. Gupta, and N. Kallus. Data-driven robust optimization. Mathematical Programming Series A, 167, 2018

work page 2018
[7]

Blanchet and K

J. Blanchet and K. Murthy. Quantifying distributional model risk via optimal transport. Mathematics of Operations Research, 44 0 (2): 0 565--600, 2019

work page 2019
[8]

S. L. Blodgett, L. Green, and B. O'Connor. Demographic dialectal variation in social media: A case study of A frican- A merican E nglish. In Empirical Methods in Natural Language Processing (EMNLP), pp.\ 1119--1130, 2016

work page 2016
[9]

Boyd and L

S. Boyd and L. Vandenberghe. Convex Optimization . Cambridge University Press, 2004

work page 2004
[10]

M. Buda, A. Maki, and M. A. Mazurowski. A systematic study of the class imbalance problem in convolutional neural networks. Neural Networks, 106: 0 249--259, 2018

work page 2018
[11]

B\"uhlmann and N

P. B\"uhlmann and N. Meinshausen. Magging: maximin aggregation for inhomogeneous large-scale data. In IEEE, 2016

work page 2016
[12]

Buolamwini and T

J. Buolamwini and T. Gebru. Gender shades: Intersectional accuracy disparities in commercial gender classification. In Conference on Fairness, Accountability and Transparency, pp.\ 77--91, 2018

work page 2018
[13]

Byrd and Z

J. Byrd and Z. Lipton. What is the effect of importance weighting in deep learning? In International Conference on Machine Learning (ICML), pp.\ 872--881, 2019

work page 2019
[14]

K. Cao, C. Wei, A. Gaidon, N. Arechiga, and T. Ma. Learning imbalanced datasets with label-distribution-aware margin loss. In Advances in Neural Information Processing Systems (NeurIPS), 2019

work page 2019
[15]

Y. Cui, M. Jia, T. Lin, Y. Song, and S. Belongie. Class-balanced loss based on effective number of samples. In Computer Vision and Pattern Recognition (CVPR), pp.\ 9268--9277, 2019

work page 2019
[16]

Devlin, M

J. Devlin, M. Chang, K. Lee, and K. Toutanova. BERT : Pre-training of deep bidirectional transformers for language understanding. In Association for Computational Linguistics (ACL), pp.\ 4171--4186, 2019

work page 2019
[17]

Duchi and H

J. Duchi and H. Namkoong. Learning models with uniform performance via distributionally robust optimization. arXiv preprint arXiv:1810.08750, 2018

work page arXiv 2018
[18]

Duchi, P

J. Duchi, P. Glynn, and H. Namkoong. Statistics of robust optimization: A generalized empirical likelihood approach. arXiv, 2016

work page 2016
[19]

Duchi, T

J. Duchi, T. Hashimoto, and H. Namkoong. Distributionally robust losses against mixture covariate shifts. https://cs.stanford.edu/ thashim/assets/publications/condrisk.pdf, 2019

work page 2019
[20]

Dwork, M

C. Dwork, M. Hardt, T. Pitassi, O. Reingold, and R. Zemel. Fairness through awareness. In Innovations in Theoretical Computer Science (ITCS), pp.\ 214--226, 2012

work page 2012
[21]

P. M. Esfahani and D. Kuhn. Data-driven distributionally robust optimization using the wasserstein metric: Performance guarantees and tractable reformulations. Mathematical Programming, 171 0 (1): 0 115--166, 2018

work page 2018
[22]

Ganin and V

Y. Ganin and V. Lempitsky. Unsupervised domain adaptation by backpropagation. In International Conference on Machine Learning (ICML), pp.\ 1180--1189, 2015

work page 2015
[23]

Gururangan, S

S. Gururangan, S. Swayamdipta, O. Levy, R. Schwartz, S. Bowman, and N. A. Smith. Annotation artifacts in natural language inference data. In Association for Computational Linguistics (ACL), pp.\ 107--112, 2018

work page 2018
[24]

Hardt, E

M. Hardt, E. Price, and N. Srebo. Equality of opportunity in supervised learning. In Advances in Neural Information Processing Systems (NeurIPS), pp.\ 3315--3323, 2016 a

work page 2016
[25]

Hardt, B

M. Hardt, B. Recht, and Y. Singer. Train faster, generalize better: Stability of stochastic gradient descent. In International Conference on Machine Learning (ICML), pp.\ 1225--1234, 2016 b

work page 2016
[26]

T. B. Hashimoto, M. Srivastava, H. Namkoong, and P. Liang. Fairness without demographics in repeated loss minimization. In International Conference on Machine Learning (ICML), 2018

work page 2018
[27]

K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Computer Vision and Pattern Recognition (CVPR), 2016

work page 2016
[28]

Heinze-Deml and N

C. Heinze-Deml and N. Meinshausen. Conditional variance penalties and domain shift robustness. arXiv preprint arXiv:1710.11469, 2017

work page arXiv 2017
[29]

Hoffer, I

E. Hoffer, I. Hubara, and D. Soudry. Train longer, generalize better: closing the generalization gap in large batch training of neural networks. In Advances in Neural Information Processing Systems (NeurIPS), pp.\ 1731--1741, 2017

work page 2017
[30]

Hovy and A

D. Hovy and A. Søgaard. Tagging performance correlates with age. In Association for Computational Linguistics (ACL), pp.\ 483--488, 2015

work page 2015
[31]

W. Hu, G. Niu, I. Sato, and M. Sugiyama. Does distributionally robust supervised learning give robust classifiers? In International Conference on Machine Learning (ICML), 2018

work page 2018
[32]

Ioffe and C

S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International Conference on Machine Learning (ICML), pp.\ 448--456, 2015

work page 2015
[33]

Jurgens, Y

D. Jurgens, Y. Tsvetkov, and D. Jurafsky. Incorporating dialectal variability for socially equitable language identification. In Association for Computational Linguistics (ACL), pp.\ 51--57, 2017

work page 2017
[34]

Kleinberg, S

J. Kleinberg, S. Mullainathan, and M. Raghavan. Inherent trade-offs in the fair determination of risk scores. In Innovations in Theoretical Computer Science (ITCS), 2017

work page 2017
[35]

Lam and E

H. Lam and E. Zhou. Quantifying input uncertainty in stochastic optimization. In 2015 Winter Simulation Conference, 2015

work page 2015
[36]

J. T. Leek, R. B. Scharpf, H. C. Bravo, D. Simcha, B. Langmead, W. E. Johnson, D. Geman, K. Baggerly, and R. A. Irizarry. Tackling the widespread and critical impact of batch effects in high-throughput data. Nature Reviews Genetics, 11 0 (10), 2010

work page 2010
[37]

Z. Liu, P. Luo, X. Wang, and X. Tang. Deep learning face attributes in the wild. In Proceedings of the IEEE International Conference on Computer Vision, pp.\ 3730--3738, 2015

work page 2015
[38]

Maurer and M

A. Maurer and M. Pontil. Empirical bernstein bounds and sample variance penalization. In Conference on Learning Theory (COLT), 2009

work page 2009
[39]

R. T. McCoy, E. Pavlick, and T. Linzen. Right for the wrong reasons: Diagnosing syntactic heuristics in natural language inference. In Association for Computational Linguistics (ACL), 2019

work page 2019
[40]

Meinshausen and P

N. Meinshausen and P. B\"uhlmann. Maximin effects in inhomogeneous large-scale data. Annals of Statistics, 43, 2015

work page 2015
[41]

Miyato, S

T. Miyato, S. Maeda, S. Ishii, and M. Koyama. Virtual adversarial training: a regularization method for supervised and semi-supervised learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018

work page 2018
[42]

Namkoong and J

H. Namkoong and J. Duchi. Stochastic gradient methods for distributionally robust optimization with f-divergences. In Advances in Neural Information Processing Systems (NeurIPS), 2016

work page 2016
[43]

Namkoong and J

H. Namkoong and J. Duchi. Variance regularization with convex objectives. In Advances in Neural Information Processing Systems (NeurIPS), 2017

work page 2017
[44]

Nemirovski, A

A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro. Robust stochastic approximation approach to stochastic programming. SIAM Journal on optimization, 19 0 (4): 0 1574--1609, 2009

work page 2009
[45]

Oakden-Rayner, J

L. Oakden-Rayner, J. Dunnmon, G. Carneiro, and C. R \'e . Hidden stratification causes clinically meaningful failures in machine learning for medical imaging. arXiv preprint arXiv:1909.12475, 2019

work page arXiv 1909
[46]

Y. Oren, S. Sagawa, T. Hashimoto, and P. Liang. Distributionally robust language modeling. In Empirical Methods in Natural Language Processing (EMNLP), 2019

work page 2019
[47]

Peters, P

J. Peters, P. B\"uhlmann, and N. Meinshausen. Causal inference by using invariant prediction: identification and confidence intervals. Journal of the Royal Statistical Society. Series B (Methodological), 78, 2016

work page 2016
[48]

Radford, J

A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever. Language models are unsupervised multitask learners. OpenAI Blog, 1 0 (8), 2019

work page 2019
[49]

why should I trust you?

M. T. Ribeiro, S. Singh, and C. Guestrin. "why should I trust you?": Explaining the predictions of any classifier. In International Conference on Knowledge Discovery and Data Mining (KDD), 2016

work page 2016
[50]

Rothenhäusler, N

D. Rothenh\"ausler, P. B\"uhlmann, N. Meinshausen, and J. Peters. Anchor regression: heterogeneous data meets causality. arXiv preprint arXiv:1801.06229, 2018

work page arXiv 2018
[51]

Shafieezadeh-Abadeh, P

S. Shafieezadeh-Abadeh, P. M. Esfahani, and D. Kuhn. Distributionally robust logistic regression. In Advances in Neural Information Processing Systems (NeurIPS), 2015

work page 2015
[52]

L. Shen, Z. Lin, and Q. Huang. Relay backpropagation for effective learning of deep convolutional neural networks. In European Conference on Computer Vision, pp.\ 467--482, 2016

work page 2016
[53]

Shimodaira

H. Shimodaira. Improving predictive inference under covariate shift by weighting the log-likelihood function. Journal of Statistical Planning and Inference, 90: 0 227--244, 2000

work page 2000
[54]

Sinha, H

A. Sinha, H. Namkoong, and J. Duchi. Certifiable distributional robustness with principled adversarial training. In International Conference on Learning Representations (ICLR), 2018

work page 2018
[55]

Srivastava, G

N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Research (JMLR), 15 0 (1): 0 1929--1958, 2014

work page 1929
[56]

Szegedy, V

C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna. Rethinking the I nception architecture for computer vision. In Computer Vision and Pattern Recognition (CVPR), pp.\ 2818--2826, 2016

work page 2016
[57]

R. Tatman. Gender and dialect bias in youtube’s automatic captions. In Workshop on Ethics in Natural Langauge Processing, volume 1, pp.\ 53--59, 2017

work page 2017
[58]

V. Vapnik. Principles of risk minimization for learning theory. In Advances in Neural Information Processing Systems, pp.\ 831--838, 1992

work page 1992
[59]

C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie. The Caltech - UCSD Birds -200-2011 dataset. Technical report, California Institute of Technology, 2011

work page 2011
[60]

J. Wen, C. Yu, and R. Greiner. Robust learning under uncertain test distributions: Relating covariate shift to model misspecification. In International Conference on Machine Learning (ICML), pp.\ 631--639, 2014

work page 2014
[61]

Williams, N

A. Williams, N. Nangia, and S. Bowman. A broad-coverage challenge corpus for sentence understanding through inference. In Association for Computational Linguistics (ACL), pp.\ 1112--1122, 2018

work page 2018
[62]

F. Yang, Z. Wang, and C. Heinze-Deml. Invariance-inducing regularization using worst-case transformations suffices to boost accuracy and spatial robustness. In Advances in Neural Information Processing Systems (NeurIPS), 2019

work page 2019
[63]

Zhang, S

C. Zhang, S. Bengio, M. Hardt, B. Recht, and O. Vinyals. Understanding deep learning requires rethinking generalization. In International Conference on Learning Representations (ICLR), 2017

work page 2017
[64]

B. Zhou, A. Lapedriza, A. Khosla, A. Oliva, and A. Torralba. Places: A 10 million image database for scene recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40 0 (6): 0 1452--1464, 2017

work page 2017
[65]

Bastani and Y

O. Bastani and Y. Ioannou and L. Lampropoulos and D. Vytiniotis and A. Nori and A. Criminisi , booktitle =. Measuring neural net robustness with constraints , year =

work page
[66]

Wong and J

E. Wong and J. Z. Kolter , booktitle =. Provable defenses against adversarial examples via the convex outer adversarial polytope , year =

work page
[67]

Dvijotham and R

K. Dvijotham and R. Stanforth and S. Gowal and T. Mann and P. Kohli , journal =. A Dual Approach to Scalable Verification of Deep Networks , year =

work page
[68]

Hein and M

M. Hein and M. Andriushchenko , booktitle =. Formal guarantees on the robustness of a classifier against adversarial manipulation , year =

work page
[69]

A. A. Ahmadi and A. Majumdar , journal =

work page
[70]

Dvijotham and S

K. Dvijotham and S. Gowal and R. Stanforth and R. Arandjelovic and B. O'Donoghue and J. Uesato and P. Kohli , journal =. Training verified learners with learned verifiers , year =

work page
[71]

Wong and F

E. Wong and F. Schmidt and J. H. Metzen and J. Z. Kolter , booktitle =. Scaling provable adversarial defenses , year =

work page
[72]

Gowal and K

S. Gowal and K. Dvijotham and R. Stanforth and R. Bunel and C. Qin and J. Uesato and T. Mann and P. Kohli , journal =. On the Effectiveness of Interval Bound Propagation for Training Verifiably Robust Models , year =

work page
[73]

Belinkov and Y

Y. Belinkov and Y. Bisk , booktitle =. Synthetic and natural noise both break neural machine translation , year =

work page
[74]

Ebrahimi and A

J. Ebrahimi and A. Rao and D. Lowd and D. Dou , booktitle =. Hotflip: White-box adversarial examples for text classification , year =

work page
[75]

Tsipras and S

D. Tsipras and S. Santurkar and L. Engstrom and A. Turner and A. Madry , journal =. There is no free lunch in adversarial robustness (but there are unexpected benefits) , year =

work page
[76]

Schmidt and S

L. Schmidt and S. Santurkar and D. Tsipras and K. Talwar and A. Madry , booktitle =. Adversarially robust generalization requires more data , year =

work page
[77]

Zhang and Y

H. Zhang and Y. Yu and J. Jiao and E. P. Xing and L. E. Ghaoui and M. I. Jordan , booktitle =. Theoretically principled trade-off between robustness and accuracy , year =

work page
[78]

Zheng and Y

S. Zheng and Y. Song and T. Leung and I. Goodfellow , booktitle =. Improving the robustness of deep neural networks via stability training , year =

work page
[79]

J. M. Cohen and E. Rosenfeld and J. Z. Kolter , booktitle =. Certified adversarial robustness via randomized smoothing , year =

work page
[80]

Rosenberg and M

C. Rosenberg and M. Hebert and H. Schneiderman , booktitle =. Semi-supervised self-training of object detection models , year =

work page

Showing first 80 references.