Shortcut to Nowhere: Demystifying Deep Spurious Regression

Guanrong Xu; Hao Wang; Jessica Li; Yuzhe Yang

arxiv: 2606.01723 · v1 · pith:D6XRPKVWnew · submitted 2026-06-01 · 💻 cs.LG · cs.AI

Shortcut to Nowhere: Demystifying Deep Spurious Regression

Guanrong Xu , Jessica Li , Hao Wang , Yuzhe Yang This is my paper

Pith reviewed 2026-06-28 15:56 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords deep spurious regressioncontinuous spurious correlationsregression shortcutsdistribution calibrationattribute similarityshortcut learninggeneralization

0 comments

The pith

Regression models fail on continuous spurious correlations unless label and feature distributions are calibrated using attribute similarities.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper defines deep spurious regression as the task of predicting continuous targets when training data contains attributes spuriously correlated with those targets, with the requirement to generalize to every possible attribute-target pairing at test time. Classification shortcuts do not transfer because regression lacks discrete labels and natural group boundaries. The proposed approach therefore measures similarity between spurious attributes in both the label space and the learned feature space to adjust for nearby targets and related groups. This produces calibrated distributions that improve robustness on image, sensor, and language-model regression benchmarks.

Core claim

We define Deep Spurious Regression (DSR) as learning from regression data with attribute-label confounding, addressing continuous spurious correlations, and generalizing to all attribute-label combinations at test time. Motivated by the intrinsic difference between classification and regression shortcuts, we propose to exploit the similarity among spurious attributes in both label and feature spaces, thereby accounting for nearby targets and related groups while calibrating both label and learned feature distributions across attributes.

What carries the argument

Exploitation of similarity among spurious attributes in label and feature spaces to calibrate both label and learned feature distributions across attributes.

If this is right

Models achieve superior performance on real-world DSR datasets spanning computer vision, environmental sensing, and LLM regression.
Continuous targets and all attribute-label combinations at test time are handled without relying on discrete group definitions.
Benchmarks and techniques now exist for studying spurious correlations specifically in continuous prediction tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same calibration idea could apply to other smoothly varying continuous outputs such as time-series forecasting or dose-response modeling.
Fairness audits for regression might add explicit checks for distribution shift across continuous attribute values.
Synthetic experiments that vary the degree of attribute similarity would directly test how much the method depends on the similarity premise.

Load-bearing premise

Spurious attributes exhibit enough similarity in label and feature spaces for calibration to improve generalization.

What would settle it

A controlled dataset in which spurious attributes show low similarity in either label or feature space, where the calibration method produces no gain or a loss in test generalization.

read the original abstract

Real-world regression often exhibits shortcuts: attributes that are spuriously correlated with continuous targets in training, yet unreliable under deployment shifts; regressing targets using such shortcuts may fail catastrophically at test time. Existing studies on spurious correlations focus primarily on classification, where labels are categorical and groups are naturally defined. However, many real-world tasks require continuous prediction, where hard label boundaries or discrete group-label pairs do not exist. We define Deep Spurious Regression (DSR) as learning from regression data with attribute-label confounding, addressing continuous spurious correlations, and generalizing to all attribute-label combinations at test time. Motivated by the intrinsic difference between classification and regression shortcuts, we propose to exploit the similarity among spurious attributes in both label and feature spaces, thereby accounting for nearby targets and related groups while calibrating both label and learned feature distributions across attributes. Extensive experiments on common real-world DSR datasets that span computer vision, environmental sensing, and large language model (LLM) regression verify the superior performance of our strategies. Our work fills the gap in benchmarks and techniques for studying spurious correlations in continuous prediction.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper defines DSR for continuous targets and offers a dual-space similarity calibration, but the generalization claim rests on an unshown similarity assumption and no visible experimental numbers.

read the letter

The main things to know are that this paper names Deep Spurious Regression as the continuous-target version of spurious correlation problems and proposes calibrating label and feature distributions by exploiting similarity among spurious attributes in both spaces.

The definition itself is new relative to the classification papers cited. It correctly notes that regression lacks hard label boundaries and discrete groups, so methods built for categorical cases do not transfer directly. The dual-space calibration is presented as a way to handle nearby targets and related groups, which is a reasonable attempt to adapt the idea to continuous outputs.

The paper does a service by listing datasets across vision, sensing, and LLM regression tasks. That helps make the problem concrete for people who need regression robustness benchmarks.

The soft spots are straightforward. The abstract claims extensive experiments show superior performance, yet supplies no numbers, error bars, dataset sizes, or ablation controls. Without those, the central claim cannot be assessed from the provided text. More importantly, the approach explicitly relies on spurious attributes having usable similarity in label and feature spaces so that calibration across attributes works. No formal condition, bound, or derivation is given showing that this similarity exists, is strong enough, or survives the continuous nature of the targets. If the similarities are weak or task-dependent, the required generalization to all attribute-label combinations at test time does not follow.

This is for researchers already working on spurious correlations who want to move into regression settings. A reader who needs problem definitions and candidate techniques in that narrow area can extract value. It deserves a serious referee because the gap it identifies is real and the proposed direction is specific enough to be checked, even though the current evidence is thin.

Referee Report

2 major / 0 minor

Summary. The paper defines Deep Spurious Regression (DSR) as regression under attribute-label confounding with continuous targets, requiring generalization to all attribute-label combinations at test time. Motivated by differences from classification shortcuts, it proposes calibration strategies that exploit similarity among spurious attributes in label and feature spaces to account for nearby targets and related groups while calibrating distributions. It claims that extensive experiments across computer vision, environmental sensing, and LLM regression datasets verify superior performance of the proposed strategies.

Significance. If the calibration approach holds, the work would address a genuine gap by extending spurious-correlation analysis to continuous regression, a setting common in deployed systems. The emphasis on intrinsic differences between classification and regression shortcuts, together with the call for new benchmarks, could usefully orient future research.

major comments (2)

[Abstract] Abstract: the central generalization claim rests on the assumption that spurious attributes exhibit sufficient similarity in both label and feature spaces to justify cross-attribute calibration and accounting for nearby targets/related groups. No formal condition, bound, or derivation is supplied establishing that such similarities are guaranteed to exist, are sufficient for the required calibration, or survive the continuous nature of the targets.
[Abstract] Abstract: the claim that 'extensive experiments ... verify the superior performance' is presented without any quantitative results, error bars, dataset statistics, ablation controls, or baseline comparisons, so the soundness of the generalization claim cannot be assessed from the provided text.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on the abstract. We address the two major comments point by point below.

read point-by-point responses

Referee: [Abstract] Abstract: the central generalization claim rests on the assumption that spurious attributes exhibit sufficient similarity in both label and feature spaces to justify cross-attribute calibration and accounting for nearby targets/related groups. No formal condition, bound, or derivation is supplied establishing that such similarities are guaranteed to exist, are sufficient for the required calibration, or survive the continuous nature of the targets.

Authors: The calibration approach is motivated by the continuous nature of regression targets, where attributes with nearby label values tend to share related feature representations, enabling similarity-based cross-attribute calibration. We agree that no formal bound or derivation is provided. In revision we will add an explicit discussion of the assumptions (including when similarity may be insufficient) and the empirical conditions under which the method is intended to apply. revision: partial
Referee: [Abstract] Abstract: the claim that 'extensive experiments ... verify the superior performance' is presented without any quantitative results, error bars, dataset statistics, ablation controls, or baseline comparisons, so the soundness of the generalization claim cannot be assessed from the provided text.

Authors: Abstract length constraints typically preclude detailed quantitative reporting. The full manuscript contains the requested quantitative results, error bars, dataset statistics, ablations, and baseline comparisons. We will revise the abstract to include a concise statement of key performance gains and experimental scope. revision: yes

Circularity Check

0 steps flagged

No circularity: proposal motivated by stated differences without reduction to inputs

full rationale

The paper defines DSR and proposes calibration strategies motivated by intrinsic differences between classification and regression shortcuts. No equations, derivations, or self-citations are presented that reduce the calibration or generalization claims to fitted parameters defined by the same data, or to self-referential definitions. The similarity assumption is invoked as motivation rather than as a load-bearing derivation that collapses by construction. This is a standard non-circular empirical proposal.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The central claim rests on the existence of exploitable similarities between spurious attributes and on the premise that distribution calibration across attributes will produce generalization; these are domain assumptions rather than derived quantities.

axioms (2)

domain assumption Spurious attributes exhibit measurable similarity in both label space and learned feature space that can be exploited for calibration.
Invoked to motivate the proposed strategies for accounting for nearby targets and related groups.
domain assumption Calibrating label and feature distributions across attributes will reduce reliance on shortcuts and improve test-time generalization to all attribute-label combinations.
Core premise of the method; no formal justification supplied in abstract.

invented entities (1)

Deep Spurious Regression (DSR) no independent evidence
purpose: Named problem setting for regression with continuous spurious correlations.
Introduced as a definition to fill the gap between classification-focused spurious correlation studies and continuous prediction tasks.

pith-pipeline@v0.9.1-grok · 5724 in / 1385 out tokens · 19934 ms · 2026-06-28T15:56:52.322989+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

39 extracted references · 2 linked inside Pith

[1]

Abdelfat- tah

Yash Akhauri, Xinyu Song, Apiwat Wongpanich, Brian Lewandowski, and Mohamed S. Abdelfat- tah. Regression language models for code.arXiv preprint arXiv:2509.26476, 2025

Pith/arXiv arXiv 2025
[2]

Invariant risk minimiza- tion.arXiv preprint arXiv:1907.02893, 2019

Martin Arjovsky, Léon Bottou, Ishaan Gulrajani, and David Lopez-Paz. Invariant risk minimiza- tion.arXiv preprint arXiv:1907.02893, 2019

Pith/arXiv arXiv 1907
[3]

Recognition in terra incognita

Sara Beery, Grant Van Horn, and Pietro Perona. Recognition in terra incognita. InProceedings of the European Conference on Computer Vision (ECCV), pages 456–473, 2018

2018
[4]

Groenen.Modern Multidimensional Scaling: Theory and Applications

Ingwer Borg and Patrick J.F. Groenen.Modern Multidimensional Scaling: Theory and Applications. Springer, 2005. 11 Shortcut to Nowhere: Demystifying Deep Spurious Regression

2005
[5]

Paula Branco, LuísTorgo, and Rita P. Ribeiro. A survey of predictivemodelling underimbalanced distributions.ACM Computing Surveys, 49(2):1–50, 2016

2016
[6]

Gender shades: Intersectional accuracy disparities in commercial gender classification

Joy Buolamwini and Timnit Gebru. Gender shades: Intersectional accuracy disparities in commercial gender classification. InProceedings of the 1st Conference on Fairness, Accountability and Transparency, pages 77–91. PMLR, 2018

2018
[7]

Environment inference for invariant learning

Elliot Creager, Jörn-Henrik Jacobsen, and Richard Zemel. Environment inference for invariant learning. InProceedings of the 38th International Conference on Machine Learning, volume 139, pages 2189–2200. PMLR, 2021

2021
[8]

Imre Csiszár and Paul C. Shields. Information theory and statistics: A tutorial.Foundations and Trends in Communications and Information Theory, 1(4):417–528, 2004

2004
[9]

Class-balanced loss based on effective number of samples, 2019

Yin Cui, Menglin Jia, Tsung-Yi Lin, Yang Song, and Serge Belongie. Class-balanced loss based on effective number of samples, 2019

2019
[10]

Duchi and Hongseok Namkoong

John C. Duchi and Hongseok Namkoong. Learning models with uniform performance via distributionally robust optimization.The Annals of Statistics, 49(3):1378–1406, 2021

2021
[11]

Domain-adversarial training of neural networks.Journal of Machine Learning Research, 17(59):1–35, 2016

Yaroslav Ganin, Evgeniya Ustinova, Hana Ajakan, Pascal Germain, Hugo Larochelle, Francois Laviolette, Mario Marchand, and Victor Lempitsky. Domain-adversarial training of neural networks.Journal of Machine Learning Research, 17(59):1–35, 2016

2016
[12]

Wichmann

Robert Geirhos, Jörn-Henrik Jacobsen, Claudio Michaelis, Richard Zemel, Wieland Brendel, Matthias Bethge, and Felix A. Wichmann. Shortcut learning in deep neural networks.Nature Machine Intelligence, 2(11):665–673, 2020

2020
[13]

Wichmann, and Wieland Brendel

Robert Geirhos, Patricia Rubisch, Claudio Michaelis, Matthias Bethge, Felix A. Wichmann, and Wieland Brendel. ImageNet-trained CNNs are biased towards texture; increasing shape bias improves accuracy and robustness. InInternational Conference on Learning Representations, 2019

2019
[14]

Deep residual learning for image recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 770–778, 2016

2016
[15]

Towards long-tailed, multi-label disease classification from chest x-ray: Overview of the cxr-lt challenge.Medical Image Analysis, 97:103224, 2024

Gregory Holste, Yiliang Zhou, Song Wang, Ajay Jaiswal, Mingquan Lin, Sherry Zhuge, Yuzhe Yang, Dongkyun Kim, Trong-Hieu Nguyen-Mau, Minh-Triet Tran, et al. Towards long-tailed, multi-label disease classification from chest x-ray: Overview of the cxr-lt challenge.Medical Image Analysis, 97:103224, 2024

2024
[16]

Uav aided aerial-ground iot for air quality sensing in smart city: Architecture, technologies, and implementation.IEEE Network, 33(2):14–22, 2019

Zhiwen Hu, Zixuan Bai, Yuzhe Yang, Zijie Zheng, Kaigui Bian, and Lingyang Song. Uav aided aerial-ground iot for air quality sensing in smart city: Architecture, technologies, and implementation.IEEE Network, 33(2):14–22, 2019

2019
[17]

Last layer re-training is sufficient for robustness to spurious correlations

Polina Kirichenko, Pavel Izmailov, and Andrew Gordon Wilson. Last layer re-training is sufficient for robustness to spurious correlations. InInternational Conference on Learning Representations, 2023

2023
[18]

Beery, Jure Leskovec, AnshulKundaje, EmmaPierson, SergeyLevine, ChelseaFinn, andPercyLiang

Pang Wei Koh, Shiori Sagawa, Henrik Marklund, Sang Michael Xie, Marvin Zhang, Akshay Balsubramani, Weihua Hu, Michihiro Yasunaga, Richard Lanas Phillips, Irena Gao, Tony Lee, Etienne David, Ian Stavness, Wei Guo, Berton Earnshaw, Imran Haque, Sara M. Beery, Jure Leskovec, AnshulKundaje, EmmaPierson, SergeyLevine, ChelseaFinn, andPercyLiang. WILDS: 12 Shor...

2021
[19]

Gradient-based learning applied to document recognition.Proceedings of the IEEE, 86(11):2278–2324, 1998

Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition.Proceedings of the IEEE, 86(11):2278–2324, 1998

1998
[20]

Liu, Behzad Haghgoo, Annie S

Evan Z. Liu, Behzad Haghgoo, Annie S. Chen, Aditi Raghunathan, Pang Wei Koh, Shiori Sagawa, Percy Liang, and Chelsea Finn. Just train twice: Improving group robustness without training group information. InProceedings of the 38th International Conference on Machine Learning, volume 139, pages 6781–6792. PMLR, 2021

2021
[21]

Sky segmentation in the wild: An empirical study

Radu Paul Mihail, Scott Workman, Zach Bessinger, and Nathan Jacobs. Sky segmentation in the wild: An empirical study. InIEEE Winter Conference on Applications of Computer Vision (WACV), pages 1–6, 2016

2016
[22]

Learning from failure: Training debiased classifier from biased classifier

Junhyun Nam, Hyuntak Cha, Sungsoo Ahn, Jaeho Lee, and Jinwoo Shin. Learning from failure: Training debiased classifier from biased classifier. InAdvances in Neural Information Processing Systems, volume 33, pages 20673–20684. Curran Associates, Inc., 2020

2020
[23]

Computational optimal transport.Foundations and Trends in Machine Learning, 11(5–6):355–607, 2019

Gabriel Peyré and Marco Cuturi. Computational optimal transport.Foundations and Trends in Machine Learning, 11(5–6):355–607, 2019

2019
[24]

Kung, Geert Janssen, Wei Zhang, Giacomo Domeniconi, Vladimir Zolotov, Julian Dolby, Jie Chen, Mahima Choudhury, Lindsey Decker, et al

Ruchir Puri, David S. Kung, Geert Janssen, Wei Zhang, Giacomo Domeniconi, Vladimir Zolotov, Julian Dolby, Jie Chen, Mahima Choudhury, Lindsey Decker, et al. Project codenet: A large-scale ai for code dataset for learning a diversity of coding tasks.arXiv preprint arXiv:2105.12655, 2021

arXiv 2021
[25]

Balanced MSE for imbalanced visual regression

Jiawei Ren, Mingyuan Zhang, Cunjun Yu, and Ziwei Liu. Balanced MSE for imbalanced visual regression. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7926–7935, 2022

2022
[26]

Yossi Rubner, Carlo Tomasi, and Leonidas J. Guibas. The earth mover’s distance as a metric for image retrieval.International Journal of Computer Vision, 40(2):99–121, 2000

2000
[27]

Hashimoto, and Percy Liang

Shiori Sagawa, Pang Wei Koh, Tatsunori B. Hashimoto, and Percy Liang. Distributionally robust neural networks for group shifts: On the importance of regularization for worst-case generalization. InInternational Conference on Learning Representations, 2020

2020
[28]

The pitfalls of simplicity bias in neural networks

Harshay Shah, Kaustav Tamuly, Aditi Raghunathan, Prateek Jain, and Praneeth Netrapalli. The pitfalls of simplicity bias in neural networks. InAdvances in Neural Information Processing Systems, volume 33, pages 9573–9585, 2020

2020
[29]

Vapnik.Statistical Learning Theory

Vladimir N. Vapnik.Statistical Learning Theory. Wiley, 1998

1998
[30]

Springer, Berlin, Heidelberg, 2009

Cédric Villani.Optimal Transport: Old and New, volume 338 ofGrundlehren der mathematischen Wissenschaften. Springer, Berlin, Heidelberg, 2009

2009
[31]

Simper: Simple self-supervised learning of periodic targets

Yuzhe Yang, Xin Liu, Jiang Wu, Silviu Borac, Dina Katabi, Ming-Zher Poh, and Daniel McDuff. Simper: Simple self-supervised learning of periodic targets. InInternational Conference on Learning Representations (ICLR), 2023

2023
[32]

On multi-domain long-tailed recognition, imbalanced domain generalization and beyond

Yuzhe Yang, Hao Wang, and Dina Katabi. On multi-domain long-tailed recognition, imbalanced domain generalization and beyond. InProceedings of the European Conference on Computer Vision (ECCV), pages 57–74. Springer, 2022. 13 Shortcut to Nowhere: Demystifying Deep Spurious Regression

2022
[33]

Delving into deep imbalanced regression

Yuzhe Yang, Kaiwen Zha, Ying-Cong Chen, Hao Wang, and Dina Katabi. Delving into deep imbalanced regression. InInternational Conference on Machine Learning (ICML), 2021

2021
[34]

The limits of fair medical imaging ai in real-world generalization.Nature medicine, 30(10):2838–2848, 2024

Yuzhe Yang, Haoran Zhang, Judy W Gichoya, Dina Katabi, and Marzyeh Ghassemi. The limits of fair medical imaging ai in real-world generalization.Nature medicine, 30(10):2838–2848, 2024

2024
[35]

Change is hard: A closer look at subpopulation shift.International Conference on Machine Learning (ICML), 2023

Yuzhe Yang, Haoran Zhang, Dina Katabi, and Marzyeh Ghassemi. Change is hard: A closer look at subpopulation shift.International Conference on Machine Learning (ICML), 2023

2023
[36]

Rank-N-contrast: Learning continuous representations for regression

Kaiwen Zha, Peng Cao, Jeany Son, Yuzhe Yang, and Dina Katabi. Rank-N-contrast: Learning continuous representations for regression. InAdvances in Neural Information Processing Systems, volume 36. Curran Associates, Inc., 2023

2023
[37]

Encoder-decoder gemma: Improving the quality-efficiency trade-off via adaptation, 2025

Biao Zhang, Fedor Moiseev, Joshua Ainslie, Paul Suganthan, Min Ma, Surya Bhupatiraju, Fede Lebron, Orhan Firat, Armand Joulin, and Zhe Dong. Encoder-decoder gemma: Improving the quality-efficiency trade-off via adaptation, 2025

2025
[38]

Zhang, Chelsea Finn, and Christopher Ré

Michael Zhang, Nimit Sharad Sohoni, Hongyang R. Zhang, Chelsea Finn, and Christopher Ré. Correct-n-contrast: A contrastive approach for improving robustness to spurious correlations. InProceedings of the 39th International Conference on Machine Learning, volume 162, pages 26484–26516. PMLR, 2022

2022
[39]

Accepted

Zhifei Zhang, Yang Song, and Hairong Qi. Age progression/regression by conditional adversarial autoencoder. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5810–5818, 2017. 14 Shortcut to Nowhere: Demystifying Deep Spurious Regression A. Additional Results We report the complete evaluation results on all four dataset...

2017

[1] [1]

Abdelfat- tah

Yash Akhauri, Xinyu Song, Apiwat Wongpanich, Brian Lewandowski, and Mohamed S. Abdelfat- tah. Regression language models for code.arXiv preprint arXiv:2509.26476, 2025

Pith/arXiv arXiv 2025

[2] [2]

Invariant risk minimiza- tion.arXiv preprint arXiv:1907.02893, 2019

Martin Arjovsky, Léon Bottou, Ishaan Gulrajani, and David Lopez-Paz. Invariant risk minimiza- tion.arXiv preprint arXiv:1907.02893, 2019

Pith/arXiv arXiv 1907

[3] [3]

Recognition in terra incognita

Sara Beery, Grant Van Horn, and Pietro Perona. Recognition in terra incognita. InProceedings of the European Conference on Computer Vision (ECCV), pages 456–473, 2018

2018

[4] [4]

Groenen.Modern Multidimensional Scaling: Theory and Applications

Ingwer Borg and Patrick J.F. Groenen.Modern Multidimensional Scaling: Theory and Applications. Springer, 2005. 11 Shortcut to Nowhere: Demystifying Deep Spurious Regression

2005

[5] [5]

Paula Branco, LuísTorgo, and Rita P. Ribeiro. A survey of predictivemodelling underimbalanced distributions.ACM Computing Surveys, 49(2):1–50, 2016

2016

[6] [6]

Gender shades: Intersectional accuracy disparities in commercial gender classification

Joy Buolamwini and Timnit Gebru. Gender shades: Intersectional accuracy disparities in commercial gender classification. InProceedings of the 1st Conference on Fairness, Accountability and Transparency, pages 77–91. PMLR, 2018

2018

[7] [7]

Environment inference for invariant learning

Elliot Creager, Jörn-Henrik Jacobsen, and Richard Zemel. Environment inference for invariant learning. InProceedings of the 38th International Conference on Machine Learning, volume 139, pages 2189–2200. PMLR, 2021

2021

[8] [8]

Imre Csiszár and Paul C. Shields. Information theory and statistics: A tutorial.Foundations and Trends in Communications and Information Theory, 1(4):417–528, 2004

2004

[9] [9]

Class-balanced loss based on effective number of samples, 2019

Yin Cui, Menglin Jia, Tsung-Yi Lin, Yang Song, and Serge Belongie. Class-balanced loss based on effective number of samples, 2019

2019

[10] [10]

Duchi and Hongseok Namkoong

John C. Duchi and Hongseok Namkoong. Learning models with uniform performance via distributionally robust optimization.The Annals of Statistics, 49(3):1378–1406, 2021

2021

[11] [11]

Domain-adversarial training of neural networks.Journal of Machine Learning Research, 17(59):1–35, 2016

Yaroslav Ganin, Evgeniya Ustinova, Hana Ajakan, Pascal Germain, Hugo Larochelle, Francois Laviolette, Mario Marchand, and Victor Lempitsky. Domain-adversarial training of neural networks.Journal of Machine Learning Research, 17(59):1–35, 2016

2016

[12] [12]

Wichmann

Robert Geirhos, Jörn-Henrik Jacobsen, Claudio Michaelis, Richard Zemel, Wieland Brendel, Matthias Bethge, and Felix A. Wichmann. Shortcut learning in deep neural networks.Nature Machine Intelligence, 2(11):665–673, 2020

2020

[13] [13]

Wichmann, and Wieland Brendel

Robert Geirhos, Patricia Rubisch, Claudio Michaelis, Matthias Bethge, Felix A. Wichmann, and Wieland Brendel. ImageNet-trained CNNs are biased towards texture; increasing shape bias improves accuracy and robustness. InInternational Conference on Learning Representations, 2019

2019

[14] [14]

Deep residual learning for image recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 770–778, 2016

2016

[15] [15]

Towards long-tailed, multi-label disease classification from chest x-ray: Overview of the cxr-lt challenge.Medical Image Analysis, 97:103224, 2024

Gregory Holste, Yiliang Zhou, Song Wang, Ajay Jaiswal, Mingquan Lin, Sherry Zhuge, Yuzhe Yang, Dongkyun Kim, Trong-Hieu Nguyen-Mau, Minh-Triet Tran, et al. Towards long-tailed, multi-label disease classification from chest x-ray: Overview of the cxr-lt challenge.Medical Image Analysis, 97:103224, 2024

2024

[16] [16]

Uav aided aerial-ground iot for air quality sensing in smart city: Architecture, technologies, and implementation.IEEE Network, 33(2):14–22, 2019

Zhiwen Hu, Zixuan Bai, Yuzhe Yang, Zijie Zheng, Kaigui Bian, and Lingyang Song. Uav aided aerial-ground iot for air quality sensing in smart city: Architecture, technologies, and implementation.IEEE Network, 33(2):14–22, 2019

2019

[17] [17]

Last layer re-training is sufficient for robustness to spurious correlations

Polina Kirichenko, Pavel Izmailov, and Andrew Gordon Wilson. Last layer re-training is sufficient for robustness to spurious correlations. InInternational Conference on Learning Representations, 2023

2023

[18] [18]

Beery, Jure Leskovec, AnshulKundaje, EmmaPierson, SergeyLevine, ChelseaFinn, andPercyLiang

Pang Wei Koh, Shiori Sagawa, Henrik Marklund, Sang Michael Xie, Marvin Zhang, Akshay Balsubramani, Weihua Hu, Michihiro Yasunaga, Richard Lanas Phillips, Irena Gao, Tony Lee, Etienne David, Ian Stavness, Wei Guo, Berton Earnshaw, Imran Haque, Sara M. Beery, Jure Leskovec, AnshulKundaje, EmmaPierson, SergeyLevine, ChelseaFinn, andPercyLiang. WILDS: 12 Shor...

2021

[19] [19]

Gradient-based learning applied to document recognition.Proceedings of the IEEE, 86(11):2278–2324, 1998

Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition.Proceedings of the IEEE, 86(11):2278–2324, 1998

1998

[20] [20]

Liu, Behzad Haghgoo, Annie S

Evan Z. Liu, Behzad Haghgoo, Annie S. Chen, Aditi Raghunathan, Pang Wei Koh, Shiori Sagawa, Percy Liang, and Chelsea Finn. Just train twice: Improving group robustness without training group information. InProceedings of the 38th International Conference on Machine Learning, volume 139, pages 6781–6792. PMLR, 2021

2021

[21] [21]

Sky segmentation in the wild: An empirical study

Radu Paul Mihail, Scott Workman, Zach Bessinger, and Nathan Jacobs. Sky segmentation in the wild: An empirical study. InIEEE Winter Conference on Applications of Computer Vision (WACV), pages 1–6, 2016

2016

[22] [22]

Learning from failure: Training debiased classifier from biased classifier

Junhyun Nam, Hyuntak Cha, Sungsoo Ahn, Jaeho Lee, and Jinwoo Shin. Learning from failure: Training debiased classifier from biased classifier. InAdvances in Neural Information Processing Systems, volume 33, pages 20673–20684. Curran Associates, Inc., 2020

2020

[23] [23]

Computational optimal transport.Foundations and Trends in Machine Learning, 11(5–6):355–607, 2019

Gabriel Peyré and Marco Cuturi. Computational optimal transport.Foundations and Trends in Machine Learning, 11(5–6):355–607, 2019

2019

[24] [24]

Kung, Geert Janssen, Wei Zhang, Giacomo Domeniconi, Vladimir Zolotov, Julian Dolby, Jie Chen, Mahima Choudhury, Lindsey Decker, et al

Ruchir Puri, David S. Kung, Geert Janssen, Wei Zhang, Giacomo Domeniconi, Vladimir Zolotov, Julian Dolby, Jie Chen, Mahima Choudhury, Lindsey Decker, et al. Project codenet: A large-scale ai for code dataset for learning a diversity of coding tasks.arXiv preprint arXiv:2105.12655, 2021

arXiv 2021

[25] [25]

Balanced MSE for imbalanced visual regression

Jiawei Ren, Mingyuan Zhang, Cunjun Yu, and Ziwei Liu. Balanced MSE for imbalanced visual regression. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7926–7935, 2022

2022

[26] [26]

Yossi Rubner, Carlo Tomasi, and Leonidas J. Guibas. The earth mover’s distance as a metric for image retrieval.International Journal of Computer Vision, 40(2):99–121, 2000

2000

[27] [27]

Hashimoto, and Percy Liang

Shiori Sagawa, Pang Wei Koh, Tatsunori B. Hashimoto, and Percy Liang. Distributionally robust neural networks for group shifts: On the importance of regularization for worst-case generalization. InInternational Conference on Learning Representations, 2020

2020

[28] [28]

The pitfalls of simplicity bias in neural networks

Harshay Shah, Kaustav Tamuly, Aditi Raghunathan, Prateek Jain, and Praneeth Netrapalli. The pitfalls of simplicity bias in neural networks. InAdvances in Neural Information Processing Systems, volume 33, pages 9573–9585, 2020

2020

[29] [29]

Vapnik.Statistical Learning Theory

Vladimir N. Vapnik.Statistical Learning Theory. Wiley, 1998

1998

[30] [30]

Springer, Berlin, Heidelberg, 2009

Cédric Villani.Optimal Transport: Old and New, volume 338 ofGrundlehren der mathematischen Wissenschaften. Springer, Berlin, Heidelberg, 2009

2009

[31] [31]

Simper: Simple self-supervised learning of periodic targets

Yuzhe Yang, Xin Liu, Jiang Wu, Silviu Borac, Dina Katabi, Ming-Zher Poh, and Daniel McDuff. Simper: Simple self-supervised learning of periodic targets. InInternational Conference on Learning Representations (ICLR), 2023

2023

[32] [32]

On multi-domain long-tailed recognition, imbalanced domain generalization and beyond

Yuzhe Yang, Hao Wang, and Dina Katabi. On multi-domain long-tailed recognition, imbalanced domain generalization and beyond. InProceedings of the European Conference on Computer Vision (ECCV), pages 57–74. Springer, 2022. 13 Shortcut to Nowhere: Demystifying Deep Spurious Regression

2022

[33] [33]

Delving into deep imbalanced regression

Yuzhe Yang, Kaiwen Zha, Ying-Cong Chen, Hao Wang, and Dina Katabi. Delving into deep imbalanced regression. InInternational Conference on Machine Learning (ICML), 2021

2021

[34] [34]

The limits of fair medical imaging ai in real-world generalization.Nature medicine, 30(10):2838–2848, 2024

Yuzhe Yang, Haoran Zhang, Judy W Gichoya, Dina Katabi, and Marzyeh Ghassemi. The limits of fair medical imaging ai in real-world generalization.Nature medicine, 30(10):2838–2848, 2024

2024

[35] [35]

Change is hard: A closer look at subpopulation shift.International Conference on Machine Learning (ICML), 2023

Yuzhe Yang, Haoran Zhang, Dina Katabi, and Marzyeh Ghassemi. Change is hard: A closer look at subpopulation shift.International Conference on Machine Learning (ICML), 2023

2023

[36] [36]

Rank-N-contrast: Learning continuous representations for regression

Kaiwen Zha, Peng Cao, Jeany Son, Yuzhe Yang, and Dina Katabi. Rank-N-contrast: Learning continuous representations for regression. InAdvances in Neural Information Processing Systems, volume 36. Curran Associates, Inc., 2023

2023

[37] [37]

Encoder-decoder gemma: Improving the quality-efficiency trade-off via adaptation, 2025

Biao Zhang, Fedor Moiseev, Joshua Ainslie, Paul Suganthan, Min Ma, Surya Bhupatiraju, Fede Lebron, Orhan Firat, Armand Joulin, and Zhe Dong. Encoder-decoder gemma: Improving the quality-efficiency trade-off via adaptation, 2025

2025

[38] [38]

Zhang, Chelsea Finn, and Christopher Ré

Michael Zhang, Nimit Sharad Sohoni, Hongyang R. Zhang, Chelsea Finn, and Christopher Ré. Correct-n-contrast: A contrastive approach for improving robustness to spurious correlations. InProceedings of the 39th International Conference on Machine Learning, volume 162, pages 26484–26516. PMLR, 2022

2022

[39] [39]

Accepted

Zhifei Zhang, Yang Song, and Hairong Qi. Age progression/regression by conditional adversarial autoencoder. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5810–5818, 2017. 14 Shortcut to Nowhere: Demystifying Deep Spurious Regression A. Additional Results We report the complete evaluation results on all four dataset...

2017