pith. sign in

arxiv: 1907.02336 · v1 · pith:AVYBAF7Anew · submitted 2019-07-04 · 💻 cs.CV

Deep Saliency Models : The Quest For The Loss Function

Pith reviewed 2026-05-25 09:17 UTC · model grok-4.3

classification 💻 cs.CV
keywords deep saliencyloss functionssaliency predictiondeep learningvisual attentionneural networksevaluation metricscombined loss
0
0 comments X

The pith

A linear combination of several loss functions improves deep saliency model performance across datasets and architectures.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests how the choice of loss function affects training of neural networks that predict visual saliency maps from images. On one fixed network, swapping the loss raises or lowers benchmark scores by noticeable amounts. The authors introduce several loss functions not previously applied to saliency prediction and show that a linear combination of well-chosen ones yields higher performance than any individual loss. The same combined loss improves results on a second network architecture and on multiple datasets, indicating the improvement is not tied to one specific setup.

Core claim

On a fixed network architecture, modifying the loss function can significantly improve or depreciate the results. A linear combination of several well-chosen loss functions leads to significant improvements in performances on different datasets as well as on a different network architecture, demonstrating the robustness of a combined metric.

What carries the argument

Linear combination of multiple loss functions used to train a saliency prediction network.

If this is right

  • Changing only the loss function on one network can raise or lower saliency prediction scores on standard datasets.
  • Loss functions not previously used for saliency can contribute usefully when included.
  • A single combined loss outperforms any of its component losses alone.
  • The same loss combination improves results when transferred to a different network architecture.
  • The performance lift holds across multiple datasets.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Loss selection may be treated as an additional hyper-parameter to tune rather than a fixed choice.
  • The same blending approach could be tested on other pixel-wise prediction tasks such as semantic segmentation.
  • An automated search over loss weights might discover even stronger combinations than the hand-chosen ones reported.
  • If the gains survive stricter controls on all other training variables, loss design would become a primary lever for model improvement.

Load-bearing premise

That measured performance differences are produced by the loss functions themselves rather than by interactions with hyper-parameters or training settings that stayed fixed only within each experiment.

What would settle it

Retraining the identical architectures with the reported loss combinations while deliberately varying optimizer settings or data preprocessing to match the single-loss baselines and checking whether the gains disappear.

read the original abstract

Recent advances in deep learning have pushed the performances of visual saliency models way further than it has ever been. Numerous models in the literature present new ways to design neural networks, to arrange gaze pattern data, or to extract as much high and low-level image features as possible in order to create the best saliency representation. However, one key part of a typical deep learning model is often neglected: the choice of the loss function. In this work, we explore some of the most popular loss functions that are used in deep saliency models. We demonstrate that on a fixed network architecture, modifying the loss function can significantly improve (or depreciate) the results, hence emphasizing the importance of the choice of the loss function when designing a model. We also introduce new loss functions that have never been used for saliency prediction to our knowledge. And finally, we show that a linear combination of several well-chosen loss functions leads to significant improvements in performances on different datasets as well as on a different network architecture, hence demonstrating the robustness of a combined metric.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript explores the impact of loss function choice on deep visual saliency prediction. It reports that, on a fixed network architecture, swapping or combining loss functions produces substantial performance changes across standard saliency benchmarks. New loss functions are introduced, and a linear combination of selected losses is shown to improve results on multiple datasets and on a second architecture.

Significance. If the reported gains can be attributed unambiguously to the loss functions, the work would usefully draw attention to an under-examined design choice in saliency modeling and supply a practical recipe for combining losses that appears robust across datasets and architectures. The cross-dataset and cross-architecture evaluation is a constructive element of the study.

major comments (3)
  1. [Experimental protocol / Methods] The central claim—that performance differences are caused by the loss functions themselves—requires that optimizer, learning-rate schedule, batch size, epoch count, weight initialization, and data-augmentation pipeline remained bitwise identical for every loss variant. The manuscript states that a fixed network is used but supplies no explicit confirmation that these other training factors were held constant; without that assurance the attribution of gains to the loss functions is not yet secured.
  2. [Results and tables] No error bars, standard deviations across multiple runs, or statistical significance tests are reported for any of the performance deltas. Consequently the magnitude and reliability of the claimed improvements (both single-loss and combined-loss) cannot be assessed from the presented data.
  3. [Cross-architecture experiments] When the linear-combination result is extended to a second architecture, the manuscript does not state whether the hyper-parameter settings (including any re-tuning) were identical to those used for the first architecture. This leaves open the possibility that part of the reported gain arises from architecture-specific optimization rather than from the loss combination alone.
minor comments (2)
  1. [Abstract] The abstract contains minor grammatical awkwardness ('way further than it has ever been').
  2. [Title] Title capitalization is inconsistent ('The Quest For The Loss Function').

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback emphasizing the need for unambiguous attribution of performance gains to loss functions. We address each major comment below and will revise the manuscript to incorporate clarifications and additional analyses where appropriate.

read point-by-point responses
  1. Referee: [Experimental protocol / Methods] The central claim—that performance differences are caused by the loss functions themselves—requires that optimizer, learning-rate schedule, batch size, epoch count, weight initialization, and data-augmentation pipeline remained bitwise identical for every loss variant. The manuscript states that a fixed network is used but supplies no explicit confirmation that these other training factors were held constant; without that assurance the attribution of gains to the loss functions is not yet secured.

    Authors: We confirm that the optimizer (Adam), learning-rate schedule, batch size, epoch count, weight initialization, and data-augmentation pipeline were held identical across all loss variants on the fixed architecture. The manuscript's emphasis on a 'fixed network' was intended to convey this, but we acknowledge the lack of explicit wording. In the revised manuscript we will add a dedicated sentence in Section 3 (Experimental Setup) stating that all non-loss training factors remained bitwise identical. revision: yes

  2. Referee: [Results and tables] No error bars, standard deviations across multiple runs, or statistical significance tests are reported for any of the performance deltas. Consequently the magnitude and reliability of the claimed improvements (both single-loss and combined-loss) cannot be assessed from the presented data.

    Authors: The observation is correct; the current tables report single-run results. We will add standard deviations computed over three independent runs with different random seeds and include paired t-test p-values for the key deltas (single-loss vs. baseline and combined-loss vs. best single loss) in the revised tables and text. revision: yes

  3. Referee: [Cross-architecture experiments] When the linear-combination result is extended to a second architecture, the manuscript does not state whether the hyper-parameter settings (including any re-tuning) were identical to those used for the first architecture. This leaves open the possibility that part of the reported gain arises from architecture-specific optimization rather than from the loss combination alone.

    Authors: The same hyper-parameter values (including the loss weights) were transferred without re-tuning to the second architecture. We will insert an explicit statement in the cross-architecture subsection clarifying that no architecture-specific hyper-parameter search was performed for the loss combination. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical claims rest on external benchmarks

full rationale

The paper performs an empirical comparison of loss functions on fixed network architectures, reporting performance on standard external saliency datasets and benchmarks. No equations, predictions, or first-principles derivations are presented that reduce by construction to fitted inputs or self-citations. Central claims about linear combinations of losses are validated through independent evaluation metrics across datasets and a second architecture, with no load-bearing self-citation chains or self-definitional reductions identified.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on empirical training runs; the linear-combination weights constitute free parameters that must be chosen or fitted, and the work inherits standard deep-learning assumptions about gradient-based optimization.

free parameters (1)
  • loss combination weights
    The linear combination of losses requires scalar weights whose values are selected to maximize performance on the reported datasets.
axioms (1)
  • domain assumption Gradient descent on a neural network can optimize any differentiable loss that compares predicted and ground-truth saliency maps
    Implicit in all experiments that train the same architecture with different losses.

pith-pipeline@v0.9.0 · 5724 in / 1224 out tokens · 30139 ms · 2026-05-25T09:17:36.452173+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

52 extracted references · 52 canonical work pages · 9 internal anchors

  1. [1]

    Deep Saliency Models : The Quest For The Loss Function

    INTRODUCTION Despite decades of research, visual attention mechanisms of humans remain complex to understand and even more com- plex to model. With the availability of large databases of eye- tracking and mouse movements recorded on images [1, 2], there is now a far better understanding of the perceptual mech- anisms. Significant progress has been made in ...

  2. [2]

    We, thus, provide a brief account of rele- vant works and summarize them in this section

    RELA TED WORKS Computational models of saliency prediction, a long stand- ing problem in computer vision, have been studied from so many perspectives that going through all is beyond the scope of this manuscript. We, thus, provide a brief account of rele- vant works and summarize them in this section. We refer the readers to [5, 6] for an overview. To dat...

  3. [3]

    Our focus is, however, the second group

    employs support vector machines and [20] uses extreme learning machines. Our focus is, however, the second group. Within end-to-end deep learning techniques, the main re- search has been on architecture design. Many of the models borrow the pre-trained weights of an image recognition net- work and experiment combining different layers in various ways. In ...

  4. [4]

    After this presenta- tion, we elaborate on the tested loss functions

    LOSS FUNCTIONS FOR DEEP SALIENCY NETWORK Before delving into the description of loss functions, we present the architecture of the convolutional neural network that will be used throughout this paper. After this presenta- tion, we elaborate on the tested loss functions. 3.1. Proposed baseline architecture Figure 1 presents the overall architecture of the ...

  5. [5]

    This specific combination was chosen because it relies on an existing successful combi- nation and also aggregates the four types of metrics together

    combining KLD, CC and NSS loss functions, and the sec- ond one (LC 2) adding Deep Features loss, Gram Matrices loss and sigmoid-weighted MSE. This specific combination was chosen because it relies on an existing successful combi- nation and also aggregates the four types of metrics together. We followed the work of [26] to set the coefficients for the first ...

  6. [6]

    EXPERIMENTS 4.1. Testing protocols To carry out the evaluation, we use seven quality metrics ap- plied on the MIT benchmark [1, 38]: CC (correlation co- efficient, CC ∈ [−1, 1]), SIM (similarity, intersection be- tween histograms of saliency, SIM ∈ [0, 1]), AUC (Area Under Curve, AU C ∈ [0, 1]), NSS (Normalized Scanpath Saliency, N SS∈ ]−∞, +∞[), EMD (Eart...

  7. [7]

    4 : Example of good predictions by the combination loss while a single loss makes bad predictions (for SAM- VGG model)

    CONCLUSION In this paper, we introduced a deep neural network which pur- pose was to evaluate the impact of loss functions on the pre- Fig. 4 : Example of good predictions by the combination loss while a single loss makes bad predictions (for SAM- VGG model). (a) original image; (b) Ground truth saliency map; (c) KLD + CC + NSS + DF + GM + SIG-MSE + R com...

  8. [8]

    Mit saliency benchmark,

    Zoya Bylinskii, Tilke Judd, Ali Borji, Laurent Itti, Fr´edo Durand, Aude Oliva, and Antonio Torralba, “Mit saliency benchmark,” 2015

  9. [9]

    Salicon: Saliency in context,

    M. Jiang, S. Huang, J. Duan, and Q. Zhao, “Salicon: Saliency in context,” in2015 IEEE Conference on Com- puter Vision and Pattern Recognition (CVPR), 2015

  10. [10]

    Saliency Prediction in the Deep Learning Era: Successes, Limitations, and Future Challenges

    Ali Borji, “Saliency prediction in the deep learn- ing era: An empirical investigation,” arXiv preprint arXiv:1810.03716, 2018

  11. [11]

    End-to-end saliency mapping via probability distribution prediction,

    S. Jetley, N. Murray, and E. Vig, “End-to-end saliency mapping via probability distribution prediction,” in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016

  12. [12]

    State-of-the-art in visual attention modeling,

    A. Borji and L. Itti, “State-of-the-art in visual attention modeling,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 35, no. 1, pp. 185–207, 2013

  13. [13]

    Quantitative analy- sis of human-model agreement in visual saliency model- ing: A comparative study,

    A. Borji, D. N. Sihite, and L. Itti, “Quantitative analy- sis of human-model agreement in visual saliency model- ing: A comparative study,”IEEE Transactions on Image Processing, vol. 22, no. 1, pp. 55–69, Jan 2013

  14. [14]

    A model of saliency- based visual attention for rapid scene analysis,

    L. Itti, C. Koch, and E. Niebur, “A model of saliency- based visual attention for rapid scene analysis,” IEEE Transactions on Pattern Analysis and Machine Intelli- gence, vol. 20, no. 11, pp. 1254–1259, 1998

  15. [15]

    Saliency based on information maximization,

    Neil D. B. Bruce and John K. Tsotsos, “Saliency based on information maximization,” in Proceedings of the 18th International Conference on Neural Information Processing Systems, 2005

  16. [16]

    Graph-based visual saliency,

    Jonathan Harel, Christof Koch, and Pietro Perona, “Graph-based visual saliency,” in Proceedings of the 19th International Conference on Neural Information Processing Systems, 2006

  17. [17]

    Image signature: High- lighting sparse salient regions,

    X. Hou, J. Harel, and C. Koch, “Image signature: High- lighting sparse salient regions,” IEEE Transactions on Pattern Analysis and Machine Intelligence , vol. 34, no. 1, pp. 194–201, 2012

  18. [18]

    Saliency detection: A boolean map approach,

    J. Zhang and S. Sclaroff, “Saliency detection: A boolean map approach,” in 2013 IEEE International Conference on Computer Vision, 2013

  19. [19]

    Learning saliency-based visual attention: A review,

    Qi Zhao and Christof Koch, “Learning saliency-based visual attention: A review,” Signal Processing, vol. 93, no. 6, pp. 1401–1407, 2013

  20. [20]

    Saliency and human fixations: State-of-the-art and study of compari- son metrics,

    Nicolas Riche, Matthieu Duvinage, Matei Mancas, Bernard Gosselin, and Thierry Dutoit, “Saliency and human fixations: State-of-the-art and study of compari- son metrics,” in The IEEE International Conference on Computer Vision (ICCV), 2013

  21. [21]

    Anal- ysis of scores, datasets, and models in visual saliency prediction,

    A. Borji, H. R. Tavakoli, D. N. Sihite, and L. Itti, “Anal- ysis of scores, datasets, and models in visual saliency prediction,” in 2013 IEEE International Conference on Computer Vision, 2013

  22. [22]

    A deeper look at saliency: Feature contrast, semantics, and beyond,

    N. D. B. Bruce, C. Catton, and S. Janjic, “A deeper look at saliency: Feature contrast, semantics, and beyond,” in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 516–524

  23. [23]

    Saliency revisited: Analysis of mouse movements versus fixations,

    H. R. Tavakoli, F. Ahmed, A. Borji, and J. Laakso- nen, “Saliency revisited: Analysis of mouse movements versus fixations,” in 2017 IEEE Conference on Com- puter Vision and Pattern Recognition (CVPR), 2017, pp. 6354–6362

  24. [24]

    Where should saliency models look next?,

    Z. Bylinskii, A. Recasens, A. Borji, A. Oliva, A. Tor- ralba, and F. Durand, “Where should saliency models look next?,” in European Conference on Computer Vi- sion (ECCV), 2016

  25. [25]

    Understanding and Visualizing Deep Visual Saliency Models

    Sen He, Hamed R Tavakoli, Ali Borji, Yang Mi, and Nicolas Pugeault, “Understanding and visual- izing deep visual saliency models,” arXiv preprint arXiv:1903.02501, 2019

  26. [26]

    Large-scale optimization of hierarchical features for saliency prediction in natural images,

    E. Vig, M. Dorr, and D. Cox, “Large-scale optimization of hierarchical features for saliency prediction in natural images,” in IEEE Computer Vision and Pattern Recog- nition (CVPR), 2014

  27. [27]

    Exploiting inter-image similarity and en- semble of extreme learners for fixation prediction using deep features,

    Hamed R. Tavakoli, Ali Borji, Jorma Laaksonen, and Esa Rahtu, “Exploiting inter-image similarity and en- semble of extreme learners for fixation prediction using deep features,” Neurocomput., vol. 244, no. C, pp. 10– 18, June 2017

  28. [28]

    Salicon: Reducing the semantic gap in saliency predic- tion by adapting deep neural networks,

    Xun Huang, Chengyao Shen, Xavier Boix, and Qi Zhao, “Salicon: Reducing the semantic gap in saliency predic- tion by adapting deep neural networks,” in The IEEE International Conference on Computer Vision (ICCV) , 2015

  29. [29]

    Deep gaze i: Boosting saliency prediction with feature maps trained on imagenet,

    M. Kummerer, L. Theis, and M. Bethge, “Deep gaze i: Boosting saliency prediction with feature maps trained on imagenet,” in ICLR Workshop, 2015

  30. [30]

    Understanding low- and high-level contributions to fix- ation prediction,

    M. Kummerer, T. S. Wallis, L. A. Gatys, and M. Bethge, “Understanding low- and high-level contributions to fix- ation prediction,” in The IEEE International Conference on Computer Vision (ICCV), 2017

  31. [31]

    A Deep Multi-Level Network for Saliency Prediction,

    Marcella Cornia, Lorenzo Baraldi, Giuseppe Serra, and Rita Cucchiara, “A Deep Multi-Level Network for Saliency Prediction,” in International Conference on Pattern Recognition (ICPR), 2016

  32. [32]

    A deep spatial contextual long-term recurrent convolutional network for saliency detection,

    Nian Liu and Junwei Han, “A deep spatial contextual long-term recurrent convolutional network for saliency detection,” IEEE Transactions on Image Processing , vol. 27, no. 7, pp. 3264–3274, 2018

  33. [33]

    Predicting Human Eye Fixations via an LSTM-based Saliency Attentive Model,

    Marcella Cornia, Lorenzo Baraldi, Giuseppe Serra, and Rita Cucchiara, “Predicting Human Eye Fixations via an LSTM-based Saliency Attentive Model,” IEEE Trans- actions on Image Processing, vol. 27, no. 10, pp. 5142– 5154, 2018

  34. [34]

    EML-NET:An Expandable Multi-Layer NETwork for Saliency Prediction

    Sen Jia, “EML-NET: an expandable multi-layer network for saliency prediction,” CoRR, vol. abs/1805.01047, 2018

  35. [35]

    Information- theoretic model comparison unifies saliency metrics,

    Wallis T. Kuemmerer M. and Bethge M., “Information- theoretic model comparison unifies saliency metrics,” Proceedings of the National Academy of Science , vol. 112, no. 52, pp. 16054–16059, Oct 2015

  36. [36]

    DeepGaze II: Reading fixations from deep features trained on object recognition

    Matthias K ¨ummerer, Thomas SA Wallis, and Matthias Bethge, “Deepgaze ii: Reading fixations from deep features trained on object recognition,” arXiv preprint arXiv:1610.01563, 2016

  37. [37]

    Geometric loss functions for camera pose regression with deep learn- ing,

    Alex Kendall and Roberto Cipolla, “Geometric loss functions for camera pose regression with deep learn- ing,” in Proceedings of the IEEE Conference on Com- puter Vision and Pattern Recognition (CVPR), 2017

  38. [38]

    Per- ceptual losses for real-time style transfer and super- resolution,

    Justin Johnson, Alexandre Alahi, and Li Fei-Fei, “Per- ceptual losses for real-time style transfer and super- resolution,” in European Conference on Computer Vi- sion, 2016

  39. [39]

    Fo- cal loss for dense object detection,

    T. Lin, P. Goyal, R. Girshick, K. He, and P. Dollr, “Fo- cal loss for dense object detection,” in 2017 IEEE Inter- national Conference on Computer Vision (ICCV) , Oct 2017, pp. 2999–3007

  40. [40]

    Very Deep Convolutional Networks for Large-Scale Image Recognition

    Karen Simonyan and Andrew Zisserman, “Very deep convolutional networks for large-scale image recogni- tion,” arXiv preprint arXiv:1409.1556, 2014

  41. [41]

    Rethinking Atrous Convolution for Semantic Image Segmentation

    Liang-Chieh Chen, George Papandreou, Florian Schroff, and Hartwig Adam, “Rethinking atrous convolution for semantic image segmentation,” arXiv preprint arXiv:1706.05587, 2017

  42. [42]

    Learning to predict where humans look,

    Tilke Judd, Krista Ehinger, Fr ´edo Durand, and Antonio Torralba, “Learning to predict where humans look,” in 12th international conference on Computer Vision . IEEE, 2009, pp. 2106–2113

  43. [43]

    SalGAN: Visual Saliency Prediction with Generative Adversarial Networks

    Junting Pan, Cristian Canton, Kevin McGuinness, Noel E O’Connor, Jordi Torres, Elisa Sayrol, and Xavier Giro-i Nieto, “Salgan: Visual saliency prediction with generative adversarial networks,” arXiv preprint arXiv:1701.01081, 2017

  44. [44]

    Video salient object detection via fully convolutional net- works,

    Wenguan Wang, Jianbing Shen, and Ling Shao, “Video salient object detection via fully convolutional net- works,” IEEE Transactions on Image Processing , vol. 27, no. 1, pp. 38–49, 2018

  45. [45]

    Methods for comparing scanpaths and saliency maps: strengths and weaknesses,

    Olivier Le Meur and Thierry Baccino, “Methods for comparing scanpaths and saliency maps: strengths and weaknesses,” Behavior Research Method, vol. 45, no. 1, pp. 251–266, 2013

  46. [46]

    Components of bottom-up gaze allocation in natural images,

    Robert J Peters, Asha Iyer, Laurent Itti, and Christof Koch, “Components of bottom-up gaze allocation in natural images,” Vision research, vol. 45, no. 18, pp. 2397–2416, 2005

  47. [47]

    A neural algorithm of artistic style,

    Ecker A.S. Bethge M. Gatys, L.A., “A neural algorithm of artistic style,” in arXivpreprint, 2015

  48. [48]

    Saliency from hierarchical adaptation through decorrelation and variance normalization,

    A. Garcia-Diaz, X. R. Fdez-Vidal, X. M. Pardo, and R. Dosil, “Saliency from hierarchical adaptation through decorrelation and variance normalization,” Im- age and Vision Computing , vol. 30, no. 1, pp. 51 – 64, 2012

  49. [49]

    Shallow and deep convolutional networks for saliency prediction,

    Junting Pan, Elisa Sayrol, Xavier Giro-i Nieto, Kevin McGuinness, and Noel E O’Connor, “Shallow and deep convolutional networks for saliency prediction,” in Pro- ceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 598–606

  50. [50]

    CAT2000: A Large Scale Fixation Dataset for Boosting Saliency Research

    Ali Borji and Laurent Itti, “Cat2000: A large scale fixation dataset for boosting saliency research,” arXiv preprint arXiv:1505.03581, 2015

  51. [51]

    Webpage saliency,

    Chengyao Shen and Qi Zhao, “Webpage saliency,” in ECCV. 2014, IEEE

  52. [52]

    An element sensi- tive saliency model with position prior learning for web pages,

    Wang Y . Chang G.J., Zhang Y ., “An element sensi- tive saliency model with position prior learning for web pages,” in ICIAI, 2018