A Controlled Study of CLIP-Based Body-Scene Fusion for Emotion Recognition in Context

Muhammad Umair; Muqaddas Hameed; Zubair Abbas

arxiv: 2606.22072 · v2 · pith:OP7BCOXYnew · submitted 2026-06-20 · 💻 cs.CV

A Controlled Study of CLIP-Based Body-Scene Fusion for Emotion Recognition in Context

Zubair Abbas , Muhammad Umair , Muqaddas Hameed This is my paper

Pith reviewed 2026-06-26 12:39 UTC · model grok-4.3

classification 💻 cs.CV

keywords emotion recognitioncontext awarenessCLIPEMOTICtwo-stream modelscene contextbody pose

0 comments

The pith

A clean two-stream body and CLIP scene model reaches 34.52 percent mAP on EMOTIC and none of the tested context adjustments improve it.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether context-debiasing or rare-class training steps still add value once a CLIP scene encoder is already present in an image-only emotion model. It compares a baseline two-stream network against four simplified variants under one shared pipeline on the EMOTIC test split. The baseline records the highest score, indicating that broad scene semantics from CLIP already capture much of the needed context. Errors remain concentrated in rare and subtle emotion classes.

Core claim

On the EMOTIC test split a ResNet-18 body stream fused with a CLIP ViT-B/16 scene stream achieves 34.52 percent mAP for 26 categorical emotions plus valence-arousal-dominance regression; simplified CCIM-style intervention, CLEF-lite context-bias subtraction, ASL tuning, and class-balanced sampling each fail to raise this score when run under identical training conditions.

What carries the argument

Two-stream fusion of a ResNet-18 body-crop encoder with a CLIP ViT-B/16 full-image scene encoder, followed by a shared prediction head for categorical and continuous emotion labels.

If this is right

CLIP scene features already supply sufficient context semantics so that further explicit debiasing steps become redundant under the tested conditions.
Performance ceilings for this architecture are now limited by label sparsity rather than by missing scene information.
Next gains are more likely to come from modeling label co-occurrence or finer subject-context interaction than from additional bias-correction modules.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The result suggests CLIP pretraining already mitigates many scene-bias problems that earlier methods tried to fix post hoc.
Similar controlled studies could test whether the same pattern holds when the body stream is also upgraded to a vision-language encoder.
Label-relationship modeling may be a higher-leverage direction than further architectural tweaks to context fusion.

Load-bearing premise

That the four simplified interventions are fair stand-ins for the original published methods and that the training pipeline is otherwise identical across all runs.

What would settle it

A controlled re-run in which any one of the four variants exceeds 34.52 percent mAP while keeping the same CLIP backbone, data splits, and evaluation protocol.

read the original abstract

Apparent emotion in natural images is often not visible from the face alone. The face may be small, hidden, or neutral, while posture and scene context carry much of the evidence. This work studies context-aware emotion recognition on EMOTIC with an image-only two-stream model. A ResNet-18 body stream encodes the target-person crop, and a CLIP ViT-B/16 scene stream encodes the full image. The fused feature predicts 26 categorical emotion labels and the continuous valence, arousal, and dominance values. This study examines whether small context-debiasing or rare-class training changes still help after adding a CLIP scene encoder. The clean two-stream model is compared with simplified CCIM-style intervention, CLEF-lite context-bias subtraction, ASL tuning, and class-balanced sampling under the same implementation pipeline. No tested variant improves over the clean two-stream model, which achieves 34.52% mAP on the EMOTIC test split. CLIP gives the model broad scene semantics, but the simplified causal, counterfactual, and rare-class changes do not automatically improve performance. Most remaining errors are in rare and subtle emotion categories, so the next step should focus on label relationships and finer subject-context interaction.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper's main result is that a plain two-stream CLIP + ResNet model hits 34.52% mAP on EMOTIC and none of the four simplified interventions beat it under a shared pipeline.

read the letter

The central claim is straightforward: once you add a CLIP ViT-B/16 scene encoder to a ResNet-18 body stream, the clean fusion already reaches 34.52% mAP on the EMOTIC test set, and the tested variants (simplified CCIM-style intervention, CLEF-lite bias subtraction, ASL tuning, class-balanced sampling) add nothing under identical conditions.

What the work actually contributes is a controlled negative result. By freezing the rest of the pipeline and only swapping the training or feature adjustments, it isolates whether those earlier ideas still deliver gains when the scene stream is already strong. That design choice is useful for the subfield; it shows that broad scene semantics from CLIP can make some context-debiasing steps redundant, at least in this setup. The note that remaining errors cluster in rare and subtle classes also gives a clear direction for follow-up.

The soft spot is the fidelity of the simplifications. The result only holds if the CCIM-style, CLEF-lite, ASL, and sampling versions are close enough to the originals that a null finding is informative. If important pieces (full causal graphs, precise bias estimation, or original re-sampling ratios) were left out, the experiment mainly demonstrates that these particular approximations do not help. The abstract states the mAP number cleanly, but any referee would want the full implementation details and checks against the source papers' reported behavior on the same split.

This is for people already working on context-aware emotion recognition with modern encoders. It is narrow but honest empirical work that can prevent wasted effort on tweaks that no longer move the metric. I would send it for peer review; the controlled comparison is worth referee time even if the simplifications need tighter justification.

Referee Report

1 major / 2 minor

Summary. The manuscript presents a controlled empirical study of context-aware emotion recognition on the EMOTIC dataset using an image-only two-stream model. A ResNet-18 encodes the body crop and a CLIP ViT-B/16 encodes the full scene image; their fused features predict 26 emotion categories and continuous VAD values. The clean two-stream model achieves 34.52% mAP. The study compares this baseline to four variants using simplified versions of CCIM-style intervention, CLEF-lite context-bias subtraction, ASL tuning, and class-balanced sampling, all under the same pipeline, and reports that none of the variants improves upon the clean model. The authors conclude that CLIP already provides broad scene semantics and that these context-debiasing or rare-class adjustments do not automatically yield gains, with most errors remaining in rare and subtle categories.

Significance. If the empirical findings hold, the work demonstrates the strength of CLIP-based scene encoding for emotion recognition in context and provides a useful controlled comparison showing that additional debiasing techniques may be unnecessary once a strong scene encoder is used. The shared implementation pipeline across variants is a strength, as is the focus on remaining challenges in rare classes. This could guide future research toward label relationships and finer subject-context modeling rather than broad context debiasing.

major comments (1)

[Methods / Variant descriptions] The central claim that none of the four variants improves over the clean two-stream model (34.52% mAP) depends on the simplified CCIM-style, CLEF-lite, ASL, and class-balanced sampling being adequate stand-ins for the original methods. The manuscript does not appear to include a fidelity check, such as reproducing the source papers' reported metrics on EMOTIC or detailing which components (e.g., full causal graph or counterfactual sampling in CCIM) were omitted. Without this, the result shows only that these particular approximations add no value, not that the underlying ideas are inert in the presence of CLIP.

minor comments (2)

[Abstract] The abstract reports the specific 34.52% mAP value but provides no details on experimental setup, baselines, or statistical significance.
[Results] Reporting standard deviations across runs or statistical significance tests for the mAP comparisons would strengthen the claim that variants show no improvement.

Simulated Author's Rebuttal

1 responses · 0 unresolved

Thank you for the referee's constructive feedback and recommendation for major revision. We address the single major comment below regarding the fidelity of the variant implementations.

read point-by-point responses

Referee: [Methods / Variant descriptions] The central claim that none of the four variants improves over the clean two-stream model (34.52% mAP) depends on the simplified CCIM-style, CLEF-lite, ASL, and class-balanced sampling being adequate stand-ins for the original methods. The manuscript does not appear to include a fidelity check, such as reproducing the source papers' reported metrics on EMOTIC or detailing which components (e.g., full causal graph or counterfactual sampling in CCIM) were omitted. Without this, the result shows only that these particular approximations add no value, not that the underlying ideas are inert in the presence of CLIP.

Authors: We agree that the study employs simplified adaptations of the original methods to integrate with the controlled CLIP two-stream pipeline, and that no direct fidelity reproduction of the source papers' EMOTIC results was performed. Our objective was to test whether core elements of these approaches provide gains when added to a strong CLIP scene encoder under a shared implementation, rather than to replicate the full original pipelines. We will revise the manuscript to explicitly detail the omitted components (e.g., full causal graph or counterfactual sampling in the CCIM-style variant) and to qualify the conclusions as applying specifically to these adaptations. This clarification will be added to the methods and discussion sections. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical comparison with no derivations or self-referential reductions

full rationale

The paper is an empirical ablation study on EMOTIC that trains and evaluates several model variants (clean two-stream CLIP fusion vs. simplified CCIM-style, CLEF-lite, ASL, class-balanced sampling) and reports mAP numbers. No equations, predictions, or first-principles derivations are present that could reduce to their own inputs. The central claim rests on measured performance differences under a shared pipeline, not on any fitted parameter being renamed as a prediction or on a self-citation chain that substitutes for evidence. External benchmarks (EMOTIC test split) are used directly; the study is therefore self-contained and scores at the low end of the allowed range.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical machine learning paper with no mathematical axioms or invented entities described in the abstract. Free parameters would be model hyperparameters but none are specified.

pith-pipeline@v0.9.1-grok · 5755 in / 1167 out tokens · 25310 ms · 2026-06-26T12:39:29.662760+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

23 extracted references · 5 canonical work pages

[1]

In: Proceed- ings of the IEEE Conference on Computer Vision and Pattern Recognition, pp

Kosti, R., Alvarez, J.M., Recasens, A., Lapedriza, A.: Emotion recognition in context. In: Proceed- ings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1667–1675 (2017)

2017
[2]

IEEE Transactions on Pattern Analysis and Machine Intelligence 42(11), 2755– 2766 (2020) https://doi.org/10.1109/TPAMI.20 19.2916866

Kosti, R., Alvarez, J.M., Recasens, A., Lapedriza, A.: Context based emotion recognition using EMOTIC dataset. IEEE Transactions on Pattern Analysis and Machine Intelligence 42(11), 2755– 2766 (2020) https://doi.org/10.1109/TPAMI.20 19.2916866

work page doi:10.1109/tpami.20 2020
[3]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

Mittal, T., Guhan, P., Bhattacharya, U., Chan- dra, R., Bera, A., Manocha, D.: EmotiCon: Context-aware multimodal emotion recognition using frege’s principle. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14234–14243 (2020) 8

2020
[4]

arXiv preprint arXiv:2308.00228 (2023) arXiv:2308.00228

Wang, Z., Sankaranarayana, R.: Using scene and semantic features for multi-modal emotion recog- nition. arXiv preprint arXiv:2308.00228 (2023) arXiv:2308.00228

arXiv 2023
[5]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

Yang, D., Chen, Z., Wang, Y., Wang, S., Li, M., Liu, S., Zhao, X., Huang, S., Dong, Z., Zhai, P., Zhang, L.: Context de-confounded emotion recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19005–19015 (2023)

2023
[6]

arXiv preprint arXiv:2403.05963 (2024) arXiv:2403.05963

Yang, D., Yang, K., Li, M., Wang, S., Wang, S., Zhang, L.: Robust emotion recognition in con- text debiasing. arXiv preprint arXiv:2403.05963 (2024) arXiv:2403.05963

arXiv 2024
[7]

arXiv preprint arXiv:2404.17205 (2024) arXiv:2404.17205

Li, X., Wang, T., Zhao, J., Mao, S., Wang, J., Zheng, F., Peng, X., Li, X.: Two in one go: Single-stage emotion recognition with decou- pled subject-context transformer. arXiv preprint arXiv:2404.17205 (2024) arXiv:2404.17205

arXiv 2024
[8]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

Zhang, S., Pan, Y., Wang, J.Z.: Learning emotion representations from verbal and nonverbal com- munication. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 18993–19004 (2023)

2023
[9]

IEEE Trans- actions on Image Processing (2025) h t t p s : / / d o i

Chen, C., Sun, X., Liu, Z.: UniEmoX: Cross- modal semantic-guided large-scale pretraining for universal scene emotion perception. IEEE Trans- actions on Image Processing (2025) h t t p s : / / d o i . o r g / 1 0 . 1 1 0 9 / T I P . 2 0 2 5 . 3 5 8 7 5 77 arXiv:2409.18877

arXiv 2025
[10]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp

Lee, J., Kim, S., Kim, S., Park, J., Sohn, K.: Context-aware emotion recognition networks. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10143– 10152 (2019)

2019
[11]

In: Proceedings of the IEEE/CVF International Conference on Com- puter Vision Workshops, pp

Huang, Y., Wen, H., Qing, L., Jin, R., Xiao, L.: Emotion recognition based on body and con- text fusion in the wild. In: Proceedings of the IEEE/CVF International Conference on Com- puter Vision Workshops, pp. 3609–3617 (2021)

2021
[12]

In: IEEE/RSJ International Conference on Intelligent Robots and Systems (2024)

Etesam, Y., Yalçın, Ö.N., Zhang, C., Lim, A.: Contextual emotion recognition using large vision language models. In: IEEE/RSJ International Conference on Intelligent Robots and Systems (2024). https://doi.org/10.1109/IROS58592.20 24.10802538

work page doi:10.1109/iros58592.20 2024
[13]

arXiv preprint arXiv:2407.11300 (2024) arXiv:2407.11300

Lei, Y., Yang, D., Chen, Z., Chen, J., Zhai, P., Zhang, L.: Large vision-language mod- els as emotion recognizers in context aware- ness. arXiv preprint arXiv:2407.11300 (2024) arXiv:2407.11300

arXiv 2024
[14]

2009, in 2009 IEEE Conference on Computer Vision and Pattern Recognition, 248–255, doi: 10.1109/CVPR.2009.5206848 DES Collaboration, Abbott, T

Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierar- chical image database. In: Proceedings of the IEEE Conference on Computer Vision and Pat- tern Recognition, pp. 248–255 (2009). h t t p s : //doi.org/10.1109/CVPR.2009.5206848

work page doi:10.1109/cvpr.2009.5206848 2009
[15]

In: International Conference on Machine Learning, vol

Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., Sutskever, I.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, vol. 139, pp. 8748–8763 (2021). https://proceedings.mlr.press/v139/radford21a.html

2021
[16]

2016, in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 1, doi: 10.1109/CVPR.2016.90

He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016). https: //doi.org/10.1109/CVPR.2016.90

work page doi:10.1109/cvpr.2016.90 2016
[17]

In: Advances in Neural Information Processing Systems, vol

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., Polo- sukhin, I.: Attention is all you need. In: Advances in Neural Information Processing Systems, vol. 30 (2017)

2017
[18]

In: Proceedings of the IEEE/CVF Interna- tional Conference on Computer Vision, pp

Ridnik, T., Ben-Baruch, E., Zamir, N., Noy, A., Friedman, I., Protter, M., Zelnik-Manor, L.: Asymmetric loss for multi-label classifica- tion. In: Proceedings of the IEEE/CVF Interna- tional Conference on Computer Vision, pp. 82–91 (2021)

2021
[19]

The Annals of Mathematical Statis- tics 35(1), 73–101 (1964) https://doi.org/10.121 4/aoms/1177703732

Huber, P.J.: Robust estimation of a location parameter. The Annals of Mathematical Statis- tics 35(1), 73–101 (1964) https://doi.org/10.121 4/aoms/1177703732

arXiv 1964
[20]

In: International Confer- ence on Learning Representations (2019)

Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. In: International Confer- ence on Learning Representations (2019)

2019
[21]

In: International Conference on Learning Repre- sentations (2018)

Zhang, H., Cisse, M., Dauphin, Y.N., Lopez-Paz, D.: mixup: Beyond empirical risk minimization. In: International Conference on Learning Repre- sentations (2018)

2018
[22]

Journal of Machine Learning Research 15(56), 1929–1958 (2014)

Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: A simple way to prevent neural networks from over- fitting. Journal of Machine Learning Research 15(56), 1929–1958 (2014)

1929
[23]

Class- balanced loss based on effective number of samples,

Cui, Y., Jia, M., Lin, T.-Y., Song, Y., Belongie, S.: Class-balanced loss based on effective num- ber of samples. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9268–9277 (2019). https://doi. org/10.1109/CVPR.2019.00949 9

work page doi:10.1109/cvpr.2019.00949 2019

[1] [1]

In: Proceed- ings of the IEEE Conference on Computer Vision and Pattern Recognition, pp

Kosti, R., Alvarez, J.M., Recasens, A., Lapedriza, A.: Emotion recognition in context. In: Proceed- ings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1667–1675 (2017)

2017

[2] [2]

IEEE Transactions on Pattern Analysis and Machine Intelligence 42(11), 2755– 2766 (2020) https://doi.org/10.1109/TPAMI.20 19.2916866

Kosti, R., Alvarez, J.M., Recasens, A., Lapedriza, A.: Context based emotion recognition using EMOTIC dataset. IEEE Transactions on Pattern Analysis and Machine Intelligence 42(11), 2755– 2766 (2020) https://doi.org/10.1109/TPAMI.20 19.2916866

work page doi:10.1109/tpami.20 2020

[3] [3]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

Mittal, T., Guhan, P., Bhattacharya, U., Chan- dra, R., Bera, A., Manocha, D.: EmotiCon: Context-aware multimodal emotion recognition using frege’s principle. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14234–14243 (2020) 8

2020

[4] [4]

arXiv preprint arXiv:2308.00228 (2023) arXiv:2308.00228

Wang, Z., Sankaranarayana, R.: Using scene and semantic features for multi-modal emotion recog- nition. arXiv preprint arXiv:2308.00228 (2023) arXiv:2308.00228

arXiv 2023

[5] [5]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

Yang, D., Chen, Z., Wang, Y., Wang, S., Li, M., Liu, S., Zhao, X., Huang, S., Dong, Z., Zhai, P., Zhang, L.: Context de-confounded emotion recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19005–19015 (2023)

2023

[6] [6]

arXiv preprint arXiv:2403.05963 (2024) arXiv:2403.05963

Yang, D., Yang, K., Li, M., Wang, S., Wang, S., Zhang, L.: Robust emotion recognition in con- text debiasing. arXiv preprint arXiv:2403.05963 (2024) arXiv:2403.05963

arXiv 2024

[7] [7]

arXiv preprint arXiv:2404.17205 (2024) arXiv:2404.17205

Li, X., Wang, T., Zhao, J., Mao, S., Wang, J., Zheng, F., Peng, X., Li, X.: Two in one go: Single-stage emotion recognition with decou- pled subject-context transformer. arXiv preprint arXiv:2404.17205 (2024) arXiv:2404.17205

arXiv 2024

[8] [8]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

Zhang, S., Pan, Y., Wang, J.Z.: Learning emotion representations from verbal and nonverbal com- munication. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 18993–19004 (2023)

2023

[9] [9]

IEEE Trans- actions on Image Processing (2025) h t t p s : / / d o i

Chen, C., Sun, X., Liu, Z.: UniEmoX: Cross- modal semantic-guided large-scale pretraining for universal scene emotion perception. IEEE Trans- actions on Image Processing (2025) h t t p s : / / d o i . o r g / 1 0 . 1 1 0 9 / T I P . 2 0 2 5 . 3 5 8 7 5 77 arXiv:2409.18877

arXiv 2025

[10] [10]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp

Lee, J., Kim, S., Kim, S., Park, J., Sohn, K.: Context-aware emotion recognition networks. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10143– 10152 (2019)

2019

[11] [11]

In: Proceedings of the IEEE/CVF International Conference on Com- puter Vision Workshops, pp

Huang, Y., Wen, H., Qing, L., Jin, R., Xiao, L.: Emotion recognition based on body and con- text fusion in the wild. In: Proceedings of the IEEE/CVF International Conference on Com- puter Vision Workshops, pp. 3609–3617 (2021)

2021

[12] [12]

In: IEEE/RSJ International Conference on Intelligent Robots and Systems (2024)

Etesam, Y., Yalçın, Ö.N., Zhang, C., Lim, A.: Contextual emotion recognition using large vision language models. In: IEEE/RSJ International Conference on Intelligent Robots and Systems (2024). https://doi.org/10.1109/IROS58592.20 24.10802538

work page doi:10.1109/iros58592.20 2024

[13] [13]

arXiv preprint arXiv:2407.11300 (2024) arXiv:2407.11300

Lei, Y., Yang, D., Chen, Z., Chen, J., Zhai, P., Zhang, L.: Large vision-language mod- els as emotion recognizers in context aware- ness. arXiv preprint arXiv:2407.11300 (2024) arXiv:2407.11300

arXiv 2024

[14] [14]

2009, in 2009 IEEE Conference on Computer Vision and Pattern Recognition, 248–255, doi: 10.1109/CVPR.2009.5206848 DES Collaboration, Abbott, T

Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierar- chical image database. In: Proceedings of the IEEE Conference on Computer Vision and Pat- tern Recognition, pp. 248–255 (2009). h t t p s : //doi.org/10.1109/CVPR.2009.5206848

work page doi:10.1109/cvpr.2009.5206848 2009

[15] [15]

In: International Conference on Machine Learning, vol

Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., Sutskever, I.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, vol. 139, pp. 8748–8763 (2021). https://proceedings.mlr.press/v139/radford21a.html

2021

[16] [16]

2016, in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 1, doi: 10.1109/CVPR.2016.90

He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016). https: //doi.org/10.1109/CVPR.2016.90

work page doi:10.1109/cvpr.2016.90 2016

[17] [17]

In: Advances in Neural Information Processing Systems, vol

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., Polo- sukhin, I.: Attention is all you need. In: Advances in Neural Information Processing Systems, vol. 30 (2017)

2017

[18] [18]

In: Proceedings of the IEEE/CVF Interna- tional Conference on Computer Vision, pp

Ridnik, T., Ben-Baruch, E., Zamir, N., Noy, A., Friedman, I., Protter, M., Zelnik-Manor, L.: Asymmetric loss for multi-label classifica- tion. In: Proceedings of the IEEE/CVF Interna- tional Conference on Computer Vision, pp. 82–91 (2021)

2021

[19] [19]

The Annals of Mathematical Statis- tics 35(1), 73–101 (1964) https://doi.org/10.121 4/aoms/1177703732

Huber, P.J.: Robust estimation of a location parameter. The Annals of Mathematical Statis- tics 35(1), 73–101 (1964) https://doi.org/10.121 4/aoms/1177703732

arXiv 1964

[20] [20]

In: International Confer- ence on Learning Representations (2019)

Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. In: International Confer- ence on Learning Representations (2019)

2019

[21] [21]

In: International Conference on Learning Repre- sentations (2018)

Zhang, H., Cisse, M., Dauphin, Y.N., Lopez-Paz, D.: mixup: Beyond empirical risk minimization. In: International Conference on Learning Repre- sentations (2018)

2018

[22] [22]

Journal of Machine Learning Research 15(56), 1929–1958 (2014)

Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: A simple way to prevent neural networks from over- fitting. Journal of Machine Learning Research 15(56), 1929–1958 (2014)

1929

[23] [23]

Class- balanced loss based on effective number of samples,

Cui, Y., Jia, M., Lin, T.-Y., Song, Y., Belongie, S.: Class-balanced loss based on effective num- ber of samples. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9268–9277 (2019). https://doi. org/10.1109/CVPR.2019.00949 9

work page doi:10.1109/cvpr.2019.00949 2019