pith. machine review for the scientific record. sign in

arxiv: 1911.08731 · v2 · submitted 2019-11-20 · 💻 cs.LG · stat.ML

Recognition: 1 theorem link

· Lean Theorem

Distributionally Robust Neural Networks for Group Shifts: On the Importance of Regularization for Worst-Case Generalization

Authors on Pith no claims yet

Pith reviewed 2026-05-13 09:14 UTC · model grok-4.3

classification 💻 cs.LG stat.ML
keywords distributionally robust optimizationgroup shiftsneural networksregularizationworst-group generalizationoverparameterization
0
0 comments X

The pith

Regularization enables group DRO to achieve high worst-group accuracy on overparameterized neural networks

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Overparameterized neural networks can fit training data perfectly yet fail on atypical groups by learning spurious correlations. Standard group distributionally robust optimization fails in this regime because vanishing average loss implies vanishing worst-case loss on the training groups. The paper shows the root cause is poor generalization on some groups rather than optimization failure. Pairing group DRO with stronger regularization such as increased L2 penalties or early stopping raises worst-group accuracy by 10 to 40 percentage points on an NLI task and two image tasks while preserving high average accuracy. The results indicate regularization matters for worst-group generalization even when it is unnecessary for average generalization.

Core claim

Naively applying group DRO to overparameterized networks yields models with vanishing worst-case training loss yet poor test-time worst-group performance; adding stronger regularization restores high worst-group accuracy on held-out data from the same groups.

What carries the argument

Coupling group DRO with stronger-than-typical L2 regularization or early stopping to prevent overfitting on minority groups while minimizing worst-case loss.

Load-bearing premise

The failure of naive group DRO comes from poor generalization on groups rather than optimization difficulty, and the pre-defined training groups match the groups that matter at test time.

What would settle it

An experiment in which increasing regularization leaves worst-group accuracy unchanged or in which naive group DRO already reaches high worst-group accuracy without extra regularization on the same datasets.

read the original abstract

Overparameterized neural networks can be highly accurate on average on an i.i.d. test set yet consistently fail on atypical groups of the data (e.g., by learning spurious correlations that hold on average but not in such groups). Distributionally robust optimization (DRO) allows us to learn models that instead minimize the worst-case training loss over a set of pre-defined groups. However, we find that naively applying group DRO to overparameterized neural networks fails: these models can perfectly fit the training data, and any model with vanishing average training loss also already has vanishing worst-case training loss. Instead, the poor worst-case performance arises from poor generalization on some groups. By coupling group DRO models with increased regularization---a stronger-than-typical L2 penalty or early stopping---we achieve substantially higher worst-group accuracies, with 10-40 percentage point improvements on a natural language inference task and two image tasks, while maintaining high average accuracies. Our results suggest that regularization is important for worst-group generalization in the overparameterized regime, even if it is not needed for average generalization. Finally, we introduce a stochastic optimization algorithm, with convergence guarantees, to efficiently train group DRO models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that naive application of group DRO to overparameterized neural networks fails to improve worst-group test accuracy because models achieve vanishing worst-case training loss (any low-average-loss model already has low worst-case loss on the training groups), with failures instead arising from poor generalization on some groups. Coupling group DRO with stronger regularization (increased L2 penalty or early stopping) yields 10-40 percentage point gains in worst-group accuracy on an NLI task and two image tasks while preserving high average accuracy. The authors also introduce a stochastic optimization algorithm for group DRO with convergence guarantees.

Significance. If the empirical results hold and the gains are attributable to generalization rather than optimization, the work is significant for demonstrating that regularization remains crucial for worst-group generalization even in the overparameterized regime where it is often unnecessary for average generalization. The practical improvements and the proposed algorithm with guarantees represent concrete contributions to distributionally robust learning.

major comments (2)
  1. [Abstract and §3 (method)] The assertion that overparameterized models achieve vanishing worst-case training loss under naive group DRO (any model with vanishing average training loss already has vanishing worst-case loss) is load-bearing for the narrative that failures are due to generalization rather than optimization. Given the non-convex, non-smooth min-max objective, the manuscript should explicitly report the achieved worst-group training losses (e.g., in §4 or Table 1) to confirm the optimizer reaches this regime on the reported tasks.
  2. [§5 (experiments)] The 10-40 pp worst-group improvements rely on tuning regularization strength (L2 coefficient or early-stopping epoch), listed as a free parameter. The central claim would be strengthened by showing that these gains are robust across a range of regularization values and that the optimal regularization for worst-group accuracy differs systematically from that for average accuracy (e.g., via additional curves in §5).
minor comments (2)
  1. [§4 (algorithm)] The convergence guarantees for the proposed stochastic algorithm are stated but the precise assumptions (e.g., on the loss smoothness or step-size schedule) and any empirical verification of convergence rates could be expanded for clarity.
  2. [Figures in §5] Figure captions and legends should explicitly note the number of random seeds or runs used to generate error bars when comparing average vs. worst-group accuracies.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major comment below and will revise the manuscript to incorporate the suggested changes.

read point-by-point responses
  1. Referee: [Abstract and §3 (method)] The assertion that overparameterized models achieve vanishing worst-case training loss under naive group DRO (any model with vanishing average training loss already has vanishing worst-case loss) is load-bearing for the narrative that failures are due to generalization rather than optimization. Given the non-convex, non-smooth min-max objective, the manuscript should explicitly report the achieved worst-group training losses (e.g., in §4 or Table 1) to confirm the optimizer reaches this regime on the reported tasks.

    Authors: We agree that explicitly reporting the achieved worst-group training losses would strengthen the claim that the optimizer reaches the regime where average and worst-case training losses both vanish. In the revised manuscript we will add these values to Table 1 and the corresponding discussion in §4, confirming that worst-group training loss approaches zero under naive group DRO on the reported tasks. revision: yes

  2. Referee: [§5 (experiments)] The 10-40 pp worst-group improvements rely on tuning regularization strength (L2 coefficient or early-stopping epoch), listed as a free parameter. The central claim would be strengthened by showing that these gains are robust across a range of regularization values and that the optimal regularization for worst-group accuracy differs systematically from that for average accuracy (e.g., via additional curves in §5).

    Authors: We appreciate the suggestion to demonstrate robustness across regularization values. In the revised manuscript we will add plots in §5 showing worst-group and average accuracy as functions of the L2 coefficient and of the early-stopping epoch for both group DRO and ERM. These curves will illustrate that the improvements are robust over a range of regularization strengths and that the regularization level optimal for worst-group accuracy is systematically stronger than the level optimal for average accuracy. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical results and algorithm are self-contained

full rationale

The paper's core argument rests on direct experimental observations that overparameterized models achieve vanishing worst-case training loss under naive group DRO (any low-average-loss model has low worst-case loss) and that stronger regularization yields 10-40 point worst-group gains. This is presented as an empirical finding on concrete tasks rather than a derivation that reduces by construction to fitted parameters or self-citations. The introduced stochastic optimizer is accompanied by stated convergence guarantees, supplying independent mathematical content. No load-bearing step invokes a uniqueness theorem from the authors' prior work, renames a known pattern, or defines a prediction in terms of its own inputs. The results remain externally falsifiable via replication on the reported NLI and image datasets.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

The work relies on standard supervised learning assumptions plus the existence of pre-defined groups that capture the relevant distribution shifts. No new entities are postulated. The regularization strength is a free parameter that must be tuned.

free parameters (1)
  • regularization strength (L2 coefficient or early-stopping epoch)
    Chosen to balance average and worst-group performance; the paper shows results for stronger-than-typical values.
axioms (2)
  • domain assumption The training groups are known and fixed in advance.
    Group DRO requires a partition of the training data into groups that are assumed to represent the shifts of interest.
  • standard math Standard neural network training dynamics apply.
    The analysis assumes gradient-based optimization reaches near-zero training loss on overparameterized models.

pith-pipeline@v0.9.0 · 5525 in / 1433 out tokens · 35259 ms · 2026-05-13T09:14:02.824790+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 26 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Distributionally Robust Multi-Task Reinforcement Learning via Adaptive Task Sampling

    cs.LG 2026-05 unverdicted novelty 7.0

    DRATS derives a minimax objective from a feasibility formulation of MTRL to adaptively sample tasks with the largest return gaps, leading to better worst-task performance on MetaWorld benchmarks.

  2. Structure from Strategic Interaction & Uncertainty: Risk Sensitive Games for Robust Preference Learning

    cs.GT 2026-05 unverdicted novelty 7.0

    Risk-sensitive preference games retain monotonicity via translation-invariant risk measures, enabling convergent self-play algorithms with stability bounds and empirical robustness across data strata.

  3. Are We Making Progress in Multimodal Domain Generalization? A Comprehensive Benchmark Study

    cs.CV 2026-05 unverdicted novelty 7.0

    A large-scale benchmark finds that recent multimodal domain generalization methods give only marginal gains over a plain ERM baseline, with no method winning consistently and all degrading sharply under corruption or ...

  4. eXplaining to Learn (eX2L): Regularization Using Contrastive Visual Explanation Pairs for Distribution Shifts

    cs.CV 2026-05 unverdicted novelty 7.0

    eX2L improves robustness to distribution shifts by penalizing similarity between Grad-CAM maps of a label classifier and a confounder classifier, reaching new SOTA average and worst-group accuracy on the Spawrious benchmark.

  5. Right Regions, Wrong Labels: Semantic Label Flips in Segmentation under Correlation Shift

    cs.CV 2026-04 unverdicted novelty 7.0

    Semantic segmentation models produce label flips within foreground regions under correlation shift, quantified by a new Flip diagnostic and an entropy-based flip-risk score.

  6. Learning from Synthetic Data via Provenance-Based Input Gradient Guidance

    cs.CV 2026-04 unverdicted novelty 7.0

    A framework that applies provenance-based guidance to input gradients during synthetic data training to promote learning from target regions only.

  7. Birds of a Feather Flock Together: Background-Invariant Representations via Linear Structure in VLMs

    cs.CV 2026-05 unverdicted novelty 6.0

    Exploiting linear structure in VLM embeddings, a synthetic-data pre-training method yields background-invariant representations that exceed 90% worst-group accuracy on Waterbirds even under 100% spurious correlation w...

  8. DuetFair: Coupling Inter- and Intra-Subgroup Robustness for Fair Medical Image Segmentation

    cs.CV 2026-05 unverdicted novelty 6.0

    DuetFair couples inter-subgroup adaptation with intra-subgroup robustness via FairDRO (dMoE plus subgroup-conditioned DRO) to boost worst-case and equity-scaled performance on medical segmentation benchmarks.

  9. Structure from Strategic Interaction & Uncertainty: Risk Sensitive Games for Robust Preference Learning

    cs.GT 2026-05 unverdicted novelty 6.0

    Risk-sensitive preference games using convex risk measures produce policies that are robust across data strata and match or exceed standard Nash learning performance without added cost.

  10. The Trap of Trajectory: Towards Understanding and Mitigating Spurious Correlations in Agentic Memory

    cs.LG 2026-05 unverdicted novelty 6.0

    Agentic memory improves clean reasoning but worsens performance when spurious patterns are present in stored trajectories; CAMEL calibration reduces this reliance while preserving clean performance.

  11. Robust Conditional Conformal Prediction via Branched Normalizing Flow

    cs.LG 2026-05 unverdicted novelty 6.0

    Branched Normalizing Flow improves conditional coverage robustness of conformal prediction under distribution shift by normalizing test inputs to the calibration distribution and mapping prediction sets back.

  12. Cheeger--Hodge Contrastive Learning for Structurally Robust Graph Representation Learning

    cs.LG 2026-04 unverdicted novelty 6.0

    CHCL aligns a Cheeger-Hodge joint signature across graph augmentations to produce embeddings that remain stable under local structural changes.

  13. Correcting Performance Estimation Bias in Imbalanced Classification with Minority Subconcepts

    cs.LG 2026-04 unverdicted novelty 6.0

    The authors introduce predicted-weighted balanced accuracy (pBA), a utility-weighted evaluation metric that uses predicted subconcept posteriors to reduce bias from within-class heterogeneity in imbalanced data.

  14. MGDA-Decoupled: Geometry-Aware Multi-Objective Optimisation for DPO-based LLM Alignment

    cs.LG 2026-04 unverdicted novelty 6.0

    MGDA-Decoupled applies geometry-based multi-objective optimization within the DPO framework to find shared descent directions that account for each objective's convergence dynamics, yielding higher win rates on UltraFeedback.

  15. CrossPan: A Comprehensive Benchmark for Cross-Sequence Pancreas MRI Segmentation and Generalization

    cs.CV 2026-04 unverdicted novelty 6.0

    CrossPan benchmark shows cross-sequence MRI domain shifts cause pancreas segmentation models to fail catastrophically, establishing sequence generalization as the primary barrier to clinical deployment over center var...

  16. CrossFlowDG: Bridging the Modality Gap with Cross-modal Flow Matching for Domain Generalization

    cs.CV 2026-04 unverdicted novelty 6.0

    CrossFlowDG bridges the modality gap in domain generalization by learning a continuous transformation that moves image embeddings to matching text embeddings using noise-free cross-modal flow matching.

  17. Adversarial Label Invariant Graph Data Augmentations for Out-of-Distribution Generalization

    cs.LG 2026-04 unverdicted novelty 6.0

    RIA uses adversarial exploration of counterfactual graph environments via label-invariant augmentations to improve OoD generalization in graph classification tasks.

  18. Inside-Out: Measuring Generalization in Vision Transformers Through Inner Workings

    cs.LG 2026-04 unverdicted novelty 6.0

    Circuit-based metrics from Vision Transformer internals provide better label-free proxies for generalization under distribution shift than existing methods like model confidence.

  19. Visual prompting reimagined: The power of the Activation Prompts

    cs.CV 2026-04 unverdicted novelty 6.0

    Activation prompts on intermediate layers outperform input-level visual prompting and parameter-efficient fine-tuning in accuracy and efficiency across 29 datasets.

  20. Robust Learning of Heterogeneous Dynamic Systems

    stat.ME 2026-04 unverdicted novelty 6.0

    A distributionally robust ODE learning framework for heterogeneous systems that uses worst-case optimization over convex derivative combinations to produce a stabilized weighted estimator with theoretical guarantees.

  21. Beyond Instance-Level Self-Supervision in 3D Multi-Modal Medical Imaging

    cs.CV 2026-05 unverdicted novelty 5.0

    A self-supervised approach uses consistent spatial relationships of anatomical structures across patients to improve 3D multi-modal medical image representations, yielding modest gains on segmentation and classificati...

  22. Agentic AIs Are the Missing Paradigm for Out-of-Distribution Generalization in Foundation Models

    cs.LG 2026-05 unverdicted novelty 5.0

    Agentic AI systems are required to overcome the parameter coverage ceiling that prevents foundation models from handling certain out-of-distribution cases.

  23. A Toolkit for Detecting Spurious Correlations in Speech Datasets

    cs.SD 2026-04 unverdicted novelty 5.0

    A toolkit flags spurious correlations in speech datasets by checking if non-speech regions predict the target class better than chance.

  24. Labeled TrustSet Guided: Batch Active Learning with Reinforcement Learning

    cs.LG 2026-04 unverdicted novelty 5.0

    BRAL-T uses TrustSet-guided reinforcement learning for batch active learning and reports state-of-the-art results on 10 image classification benchmarks plus 2 fine-tuning tasks.

  25. Robust Deepfake Detection, NTIRE 2026 Challenge: Report

    cs.CV 2026-04 unverdicted novelty 2.0

    The NTIRE 2026 challenge finds that large foundation models combined with ensembles and degradation-aware training produce the most robust deepfake detectors.

  26. Deep Learning for Sequential Decision Making under Uncertainty: Foundations, Frameworks, and Frontiers

    math.OC 2026-04 unverdicted novelty 2.0

    A tutorial framing deep learning as a complement to optimization for sequential decision-making under uncertainty, with applications in supply chains, healthcare, and energy.

Reference graph

Works this paper leans on

300 extracted references · 300 canonical work pages · cited by 25 Pith papers · 1 internal anchor

  1. [1]

    Invariant Risk Minimization

    M. Arjovsky, L. Bottou, I. Gulrajani, and D. Lopez-Paz. Invariant risk minimization. arXiv preprint arXiv:1907.02893, 2019

  2. [2]

    M. A. Badgeley, J. R. Zech, L. Oakden-Rayner, B. S. Glicksberg, M. Liu, W. Gale, M. V. McConnell, B. Percha, T. M. Snyder, and J. T. Dudley. Deep learning predicts hip fracture using confounding patient and healthcare variables. npj Digital Medicine, 2, 2019

  3. [3]

    Ben-David, J

    S. Ben-David, J. Blitzer, K. Crammer, and F. Pereira. Analysis of representations for domain adaptation. In Advances in Neural Information Processing Systems (NeurIPS), pp.\ 137--144, 2006

  4. [4]

    Ben-Tal, D

    A. Ben-Tal, D. den Hertog, A. D. Waegenaere, B. Melenberg, and G. Rennen. Robust solutions of optimization problems affected by uncertain probabilities. Management Science, 59: 0 341--357, 2013

  5. [5]

    D. P. Bertsekas. Convex Optimization Theory. Athena Scientific Belmont, 2009

  6. [6]

    Bertsimas, V

    D. Bertsimas, V. Gupta, and N. Kallus. Data-driven robust optimization. Mathematical Programming Series A, 167, 2018

  7. [7]

    Blanchet and K

    J. Blanchet and K. Murthy. Quantifying distributional model risk via optimal transport. Mathematics of Operations Research, 44 0 (2): 0 565--600, 2019

  8. [8]

    S. L. Blodgett, L. Green, and B. O'Connor. Demographic dialectal variation in social media: A case study of A frican- A merican E nglish. In Empirical Methods in Natural Language Processing (EMNLP), pp.\ 1119--1130, 2016

  9. [9]

    Boyd and L

    S. Boyd and L. Vandenberghe. Convex Optimization . Cambridge University Press, 2004

  10. [10]

    M. Buda, A. Maki, and M. A. Mazurowski. A systematic study of the class imbalance problem in convolutional neural networks. Neural Networks, 106: 0 249--259, 2018

  11. [11]

    B\"uhlmann and N

    P. B\"uhlmann and N. Meinshausen. Magging: maximin aggregation for inhomogeneous large-scale data. In IEEE, 2016

  12. [12]

    Buolamwini and T

    J. Buolamwini and T. Gebru. Gender shades: Intersectional accuracy disparities in commercial gender classification. In Conference on Fairness, Accountability and Transparency, pp.\ 77--91, 2018

  13. [13]

    Byrd and Z

    J. Byrd and Z. Lipton. What is the effect of importance weighting in deep learning? In International Conference on Machine Learning (ICML), pp.\ 872--881, 2019

  14. [14]

    K. Cao, C. Wei, A. Gaidon, N. Arechiga, and T. Ma. Learning imbalanced datasets with label-distribution-aware margin loss. In Advances in Neural Information Processing Systems (NeurIPS), 2019

  15. [15]

    Y. Cui, M. Jia, T. Lin, Y. Song, and S. Belongie. Class-balanced loss based on effective number of samples. In Computer Vision and Pattern Recognition (CVPR), pp.\ 9268--9277, 2019

  16. [16]

    Devlin, M

    J. Devlin, M. Chang, K. Lee, and K. Toutanova. BERT : Pre-training of deep bidirectional transformers for language understanding. In Association for Computational Linguistics (ACL), pp.\ 4171--4186, 2019

  17. [17]

    Duchi and H

    J. Duchi and H. Namkoong. Learning models with uniform performance via distributionally robust optimization. arXiv preprint arXiv:1810.08750, 2018

  18. [18]

    Duchi, P

    J. Duchi, P. Glynn, and H. Namkoong. Statistics of robust optimization: A generalized empirical likelihood approach. arXiv, 2016

  19. [19]

    Duchi, T

    J. Duchi, T. Hashimoto, and H. Namkoong. Distributionally robust losses against mixture covariate shifts. https://cs.stanford.edu/ thashim/assets/publications/condrisk.pdf, 2019

  20. [20]

    Dwork, M

    C. Dwork, M. Hardt, T. Pitassi, O. Reingold, and R. Zemel. Fairness through awareness. In Innovations in Theoretical Computer Science (ITCS), pp.\ 214--226, 2012

  21. [21]

    P. M. Esfahani and D. Kuhn. Data-driven distributionally robust optimization using the wasserstein metric: Performance guarantees and tractable reformulations. Mathematical Programming, 171 0 (1): 0 115--166, 2018

  22. [22]

    Ganin and V

    Y. Ganin and V. Lempitsky. Unsupervised domain adaptation by backpropagation. In International Conference on Machine Learning (ICML), pp.\ 1180--1189, 2015

  23. [23]

    Gururangan, S

    S. Gururangan, S. Swayamdipta, O. Levy, R. Schwartz, S. Bowman, and N. A. Smith. Annotation artifacts in natural language inference data. In Association for Computational Linguistics (ACL), pp.\ 107--112, 2018

  24. [24]

    Hardt, E

    M. Hardt, E. Price, and N. Srebo. Equality of opportunity in supervised learning. In Advances in Neural Information Processing Systems (NeurIPS), pp.\ 3315--3323, 2016 a

  25. [25]

    Hardt, B

    M. Hardt, B. Recht, and Y. Singer. Train faster, generalize better: Stability of stochastic gradient descent. In International Conference on Machine Learning (ICML), pp.\ 1225--1234, 2016 b

  26. [26]

    T. B. Hashimoto, M. Srivastava, H. Namkoong, and P. Liang. Fairness without demographics in repeated loss minimization. In International Conference on Machine Learning (ICML), 2018

  27. [27]

    K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Computer Vision and Pattern Recognition (CVPR), 2016

  28. [28]

    Heinze-Deml and N

    C. Heinze-Deml and N. Meinshausen. Conditional variance penalties and domain shift robustness. arXiv preprint arXiv:1710.11469, 2017

  29. [29]

    Hoffer, I

    E. Hoffer, I. Hubara, and D. Soudry. Train longer, generalize better: closing the generalization gap in large batch training of neural networks. In Advances in Neural Information Processing Systems (NeurIPS), pp.\ 1731--1741, 2017

  30. [30]

    Hovy and A

    D. Hovy and A. Søgaard. Tagging performance correlates with age. In Association for Computational Linguistics (ACL), pp.\ 483--488, 2015

  31. [31]

    W. Hu, G. Niu, I. Sato, and M. Sugiyama. Does distributionally robust supervised learning give robust classifiers? In International Conference on Machine Learning (ICML), 2018

  32. [32]

    Ioffe and C

    S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International Conference on Machine Learning (ICML), pp.\ 448--456, 2015

  33. [33]

    Jurgens, Y

    D. Jurgens, Y. Tsvetkov, and D. Jurafsky. Incorporating dialectal variability for socially equitable language identification. In Association for Computational Linguistics (ACL), pp.\ 51--57, 2017

  34. [34]

    Kleinberg, S

    J. Kleinberg, S. Mullainathan, and M. Raghavan. Inherent trade-offs in the fair determination of risk scores. In Innovations in Theoretical Computer Science (ITCS), 2017

  35. [35]

    Lam and E

    H. Lam and E. Zhou. Quantifying input uncertainty in stochastic optimization. In 2015 Winter Simulation Conference, 2015

  36. [36]

    J. T. Leek, R. B. Scharpf, H. C. Bravo, D. Simcha, B. Langmead, W. E. Johnson, D. Geman, K. Baggerly, and R. A. Irizarry. Tackling the widespread and critical impact of batch effects in high-throughput data. Nature Reviews Genetics, 11 0 (10), 2010

  37. [37]

    Z. Liu, P. Luo, X. Wang, and X. Tang. Deep learning face attributes in the wild. In Proceedings of the IEEE International Conference on Computer Vision, pp.\ 3730--3738, 2015

  38. [38]

    Maurer and M

    A. Maurer and M. Pontil. Empirical bernstein bounds and sample variance penalization. In Conference on Learning Theory (COLT), 2009

  39. [39]

    R. T. McCoy, E. Pavlick, and T. Linzen. Right for the wrong reasons: Diagnosing syntactic heuristics in natural language inference. In Association for Computational Linguistics (ACL), 2019

  40. [40]

    Meinshausen and P

    N. Meinshausen and P. B\"uhlmann. Maximin effects in inhomogeneous large-scale data. Annals of Statistics, 43, 2015

  41. [41]

    Miyato, S

    T. Miyato, S. Maeda, S. Ishii, and M. Koyama. Virtual adversarial training: a regularization method for supervised and semi-supervised learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018

  42. [42]

    Namkoong and J

    H. Namkoong and J. Duchi. Stochastic gradient methods for distributionally robust optimization with f-divergences. In Advances in Neural Information Processing Systems (NeurIPS), 2016

  43. [43]

    Namkoong and J

    H. Namkoong and J. Duchi. Variance regularization with convex objectives. In Advances in Neural Information Processing Systems (NeurIPS), 2017

  44. [44]

    Nemirovski, A

    A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro. Robust stochastic approximation approach to stochastic programming. SIAM Journal on optimization, 19 0 (4): 0 1574--1609, 2009

  45. [45]

    Oakden-Rayner, J

    L. Oakden-Rayner, J. Dunnmon, G. Carneiro, and C. R \'e . Hidden stratification causes clinically meaningful failures in machine learning for medical imaging. arXiv preprint arXiv:1909.12475, 2019

  46. [46]

    Y. Oren, S. Sagawa, T. Hashimoto, and P. Liang. Distributionally robust language modeling. In Empirical Methods in Natural Language Processing (EMNLP), 2019

  47. [47]

    Peters, P

    J. Peters, P. B\"uhlmann, and N. Meinshausen. Causal inference by using invariant prediction: identification and confidence intervals. Journal of the Royal Statistical Society. Series B (Methodological), 78, 2016

  48. [48]

    Radford, J

    A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever. Language models are unsupervised multitask learners. OpenAI Blog, 1 0 (8), 2019

  49. [49]

    why should I trust you?

    M. T. Ribeiro, S. Singh, and C. Guestrin. "why should I trust you?": Explaining the predictions of any classifier. In International Conference on Knowledge Discovery and Data Mining (KDD), 2016

  50. [50]

    Rothenhäusler, N

    D. Rothenh\"ausler, P. B\"uhlmann, N. Meinshausen, and J. Peters. Anchor regression: heterogeneous data meets causality. arXiv preprint arXiv:1801.06229, 2018

  51. [51]

    Shafieezadeh-Abadeh, P

    S. Shafieezadeh-Abadeh, P. M. Esfahani, and D. Kuhn. Distributionally robust logistic regression. In Advances in Neural Information Processing Systems (NeurIPS), 2015

  52. [52]

    L. Shen, Z. Lin, and Q. Huang. Relay backpropagation for effective learning of deep convolutional neural networks. In European Conference on Computer Vision, pp.\ 467--482, 2016

  53. [53]

    Shimodaira

    H. Shimodaira. Improving predictive inference under covariate shift by weighting the log-likelihood function. Journal of Statistical Planning and Inference, 90: 0 227--244, 2000

  54. [54]

    Sinha, H

    A. Sinha, H. Namkoong, and J. Duchi. Certifiable distributional robustness with principled adversarial training. In International Conference on Learning Representations (ICLR), 2018

  55. [55]

    Srivastava, G

    N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Research (JMLR), 15 0 (1): 0 1929--1958, 2014

  56. [56]

    Szegedy, V

    C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna. Rethinking the I nception architecture for computer vision. In Computer Vision and Pattern Recognition (CVPR), pp.\ 2818--2826, 2016

  57. [57]

    R. Tatman. Gender and dialect bias in youtube’s automatic captions. In Workshop on Ethics in Natural Langauge Processing, volume 1, pp.\ 53--59, 2017

  58. [58]

    V. Vapnik. Principles of risk minimization for learning theory. In Advances in Neural Information Processing Systems, pp.\ 831--838, 1992

  59. [59]

    C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie. The Caltech - UCSD Birds -200-2011 dataset. Technical report, California Institute of Technology, 2011

  60. [60]

    J. Wen, C. Yu, and R. Greiner. Robust learning under uncertain test distributions: Relating covariate shift to model misspecification. In International Conference on Machine Learning (ICML), pp.\ 631--639, 2014

  61. [61]

    Williams, N

    A. Williams, N. Nangia, and S. Bowman. A broad-coverage challenge corpus for sentence understanding through inference. In Association for Computational Linguistics (ACL), pp.\ 1112--1122, 2018

  62. [62]

    F. Yang, Z. Wang, and C. Heinze-Deml. Invariance-inducing regularization using worst-case transformations suffices to boost accuracy and spatial robustness. In Advances in Neural Information Processing Systems (NeurIPS), 2019

  63. [63]

    Zhang, S

    C. Zhang, S. Bengio, M. Hardt, B. Recht, and O. Vinyals. Understanding deep learning requires rethinking generalization. In International Conference on Learning Representations (ICLR), 2017

  64. [64]

    B. Zhou, A. Lapedriza, A. Khosla, A. Oliva, and A. Torralba. Places: A 10 million image database for scene recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40 0 (6): 0 1452--1464, 2017

  65. [65]

    Bastani and Y

    O. Bastani and Y. Ioannou and L. Lampropoulos and D. Vytiniotis and A. Nori and A. Criminisi , booktitle =. Measuring neural net robustness with constraints , year =

  66. [66]

    Wong and J

    E. Wong and J. Z. Kolter , booktitle =. Provable defenses against adversarial examples via the convex outer adversarial polytope , year =

  67. [67]

    Dvijotham and R

    K. Dvijotham and R. Stanforth and S. Gowal and T. Mann and P. Kohli , journal =. A Dual Approach to Scalable Verification of Deep Networks , year =

  68. [68]

    Hein and M

    M. Hein and M. Andriushchenko , booktitle =. Formal guarantees on the robustness of a classifier against adversarial manipulation , year =

  69. [69]

    A. A. Ahmadi and A. Majumdar , journal =

  70. [70]

    Dvijotham and S

    K. Dvijotham and S. Gowal and R. Stanforth and R. Arandjelovic and B. O'Donoghue and J. Uesato and P. Kohli , journal =. Training verified learners with learned verifiers , year =

  71. [71]

    Wong and F

    E. Wong and F. Schmidt and J. H. Metzen and J. Z. Kolter , booktitle =. Scaling provable adversarial defenses , year =

  72. [72]

    Gowal and K

    S. Gowal and K. Dvijotham and R. Stanforth and R. Bunel and C. Qin and J. Uesato and T. Mann and P. Kohli , journal =. On the Effectiveness of Interval Bound Propagation for Training Verifiably Robust Models , year =

  73. [73]

    Belinkov and Y

    Y. Belinkov and Y. Bisk , booktitle =. Synthetic and natural noise both break neural machine translation , year =

  74. [74]

    Ebrahimi and A

    J. Ebrahimi and A. Rao and D. Lowd and D. Dou , booktitle =. Hotflip: White-box adversarial examples for text classification , year =

  75. [75]

    Tsipras and S

    D. Tsipras and S. Santurkar and L. Engstrom and A. Turner and A. Madry , journal =. There is no free lunch in adversarial robustness (but there are unexpected benefits) , year =

  76. [76]

    Schmidt and S

    L. Schmidt and S. Santurkar and D. Tsipras and K. Talwar and A. Madry , booktitle =. Adversarially robust generalization requires more data , year =

  77. [77]

    Zhang and Y

    H. Zhang and Y. Yu and J. Jiao and E. P. Xing and L. E. Ghaoui and M. I. Jordan , booktitle =. Theoretically principled trade-off between robustness and accuracy , year =

  78. [78]

    Zheng and Y

    S. Zheng and Y. Song and T. Leung and I. Goodfellow , booktitle =. Improving the robustness of deep neural networks via stability training , year =

  79. [79]

    J. M. Cohen and E. Rosenfeld and J. Z. Kolter , booktitle =. Certified adversarial robustness via randomized smoothing , year =

  80. [80]

    Rosenberg and M

    C. Rosenberg and M. Hebert and H. Schneiderman , booktitle =. Semi-supervised self-training of object detection models , year =

Showing first 80 references.