Exploring deep learning for Event-Based Saliency Prediction with a Transformer-based model

Jean Martinet; Romaric Mazna; Sai Deepesh Pokala

arxiv: 2605.23790 · v1 · pith:3V535LXQnew · submitted 2026-05-22 · 💻 cs.CV

Exploring deep learning for Event-Based Saliency Prediction with a Transformer-based model

Romaric Mazna , Jean Martinet , Sai Deepesh Pokala This is my paper

Pith reviewed 2026-05-25 04:22 UTC · model grok-4.3

classification 💻 cs.CV

keywords event-based visionsaliency predictiontransformersynthetic dataSwin Transformervisual attentionevent camera

0 comments

The pith

SEST applies a pretrained Swin Transformer to event data for saliency prediction and transfers from synthetic training to real cameras.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes the first deep learning method for predicting saliency from event camera streams by creating synthetic event datasets from existing RGB saliency benchmarks and pretraining an event-adapted Swin Transformer. A lightweight CNN decoder then produces the saliency maps. Results show the model surpasses earlier event-based approaches and reduces the difference from top RGB models while succeeding in zero-shot tests on real event data. This matters for building attention models that exploit event cameras' high speed and low power in dynamic environments.

Core claim

SEST is a Swin Event-based Saliency Transformer that combines a self-supervised pretrained event-based Swin Transformer backbone with a lightweight CNN decoder to generate dynamic saliency maps from event data. Trained on the new N-DHF1K and N-UCF Sports synthetic datasets derived from RGB benchmarks, it outperforms existing event-based saliency methods, narrows the gap to state-of-the-art RGB models, and demonstrates transferability to real event camera streams in zero-shot evaluation.

What carries the argument

Self-supervised pretrained event-based Swin Transformer backbone paired with a lightweight CNN decoder that converts event streams into dynamic saliency maps.

If this is right

Event-based saliency prediction can now use transformer architectures adapted from image and video domains.
Synthetic data derived from RGB benchmarks removes the need for large-scale real event annotations during initial training.
Models trained on synthetic events generalize directly to real event streams without fine-tuning.
Deep learning establishes a viable path for computational models of visual attention in neuromorphic sensing.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

These models could support low-power attention mechanisms in robotics or embedded systems that use event cameras.
End-to-end training on real event data may become practical once larger annotated real datasets are collected.
The transfer success indicates that event representations learned from synthetic data preserve key temporal attention cues.

Load-bearing premise

Synthetic event datasets generated from RGB saliency videos capture the statistical properties and noise patterns of real event camera recordings well enough for training and evaluation.

What would settle it

A large annotated real event saliency dataset on which a model trained only on the synthetic data shows substantially lower accuracy than one trained directly on the real data.

Figures

Figures reproduced from arXiv: 2605.23790 by Jean Martinet, Romaric Mazna, Sai Deepesh Pokala.

**Figure 1.** Figure 1: Overview of the proposed Swin Event-based Saliency Transformer (SEST) architecture. bin are accumulated per pixel and per polarity, producing a count-based voxel grid of shape [T, 2, H, W], H=W=224. During training, samples are processed in batches of size B, giving X ∈ R B×T ×2×H×W . This representation preserves both the temporal structure and the polarity asymmetry of the event stream. 3.2 Pretrained S… view at source ↗

**Figure 2.** Figure 2: Illustration of qualitative results for three samples: UCF Sports Run-side-004 (rows 1-2), DHF1K 548 (rows 3-4), UCF Sports Swing-Bench-003 (rows 5-6). N-DHF1K The best overall performance on this dataset is achieved by RGBbased models, with SalFoM leading across all four metrics. Other RGB-based models such as TMFI, THTD-Net, and STSANet follow closely. In contrast, existing event-based models show signi… view at source ↗

read the original abstract

Saliency prediction has been extensively studied in RGB images and videos as a computational model of human visual attention. In contrast, predicting saliency from event-based data remains largely unexplored, despite the biological inspiration and favorable sensing properties of event cameras. Two obstacles have held this direction back: the absence of large-scale event saliency datasets, and the lack of a strong baseline. In this paper, we introduce SEST (Swin Event-based Saliency Transformer), a transformer-based model for saliency prediction from event data, bridging the data scarcity barrier through event-native pretraining and synthetic supervision. SEST leverages a self-supervised pretrained event-based Swin Transformer backbone combined with a lightweight CNN decoder to produce dynamic saliency maps. To address the scarcity of annotated event-based saliency data, we introduce two new benchmark datasets, N-DHF1K and N-UCF Sports, generated from large-scale RGB saliency benchmarks. Experimental results show that SEST clearly outperforms existing event-based saliency methods and narrows the performance gap with state-of-the-art RGB models. Zero-shot evaluation on a real event camera dataset further demonstrates that our model trained on synthetic data remains transferable on real event streams. To the best of our knowledge, this work is the first to apply deep learning to event-based saliency prediction, opening a new research direction at the intersection of event-based vision and neuromorphic visual attention.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This is the first deep learning paper on event-based saliency, using a Swin transformer and synthetic datasets, but the claims rest on unshown metrics and data that may not match real event cameras.

read the letter

The main point is that this paper applies deep learning to saliency prediction from event data for the first time. They introduce SEST, a Swin transformer backbone with a lightweight CNN decoder, pretrain it self-supervised on events, create two new synthetic datasets from RGB saliency benchmarks, and claim it beats prior event methods while closing in on RGB performance, with some zero-shot transfer to real event streams.

Referee Report

2 major / 3 minor

Summary. The manuscript introduces SEST, a Swin Transformer-based model that combines a self-supervised pretrained event-based backbone with a lightweight CNN decoder for saliency prediction on event data. To address data scarcity, the authors generate two synthetic datasets (N-DHF1K and N-UCF Sports) from existing RGB saliency benchmarks via event simulation. The central claims are that SEST outperforms prior event-based saliency methods, narrows the gap to RGB SOTA, and exhibits successful zero-shot transfer when evaluated on a real event-camera dataset after training exclusively on the synthetic data. The work positions itself as the first application of deep learning to event-based saliency prediction.

Significance. If the experimental claims hold after addressing the synthetic-to-real gap, the paper would be a notable first step in applying modern deep learning to event-based saliency, with the new benchmarks and self-supervised pretraining providing reusable resources. The zero-shot transfer result, if robust, would be particularly valuable for neuromorphic vision applications where real annotated data remain scarce.

major comments (2)

[§3 and §5] §3 (Dataset construction) and §5 (Experiments): The headline claims of outperformance and zero-shot transferability rest on the assumption that N-DHF1K and N-UCF Sports, generated from RGB saliency benchmarks, statistically match real event-camera streams in firing statistics, contrast thresholds, background activity noise, and polarity balance. No quantitative validation (e.g., Kolmogorov-Smirnov tests on inter-event intervals, event-rate histograms, or noise power spectra) is provided comparing synthetic versus real distributions. Without this, the reported gains over prior event methods and the transfer result could be artifacts of the simulator rather than evidence of model robustness.
[§5.3] §5.3 (Zero-shot evaluation): The transfer experiment is load-bearing for the practical significance claim, yet the manuscript supplies no details on the real event dataset size, event density, or how the synthetic training distribution was aligned (or not) with the target camera's bias and refractory settings. This makes it impossible to assess whether the positive transfer result generalizes beyond the specific real dataset chosen.

minor comments (3)

[Abstract] Abstract: The statement that SEST 'clearly outperforms existing event-based saliency methods' is not accompanied by any numerical values, baseline names, or error bars. Adding the key metrics (e.g., AUC, NSS, or CC on the primary test split) would make the abstract self-contained.
[§2] Notation and figures: The description of the event representation fed to the Swin backbone (e.g., voxel grid, surface of active events, or polarity-separated frames) is referenced inconsistently across text and figures; a single explicit equation or diagram in §2 would eliminate ambiguity.
[§2] Related work: The discussion of prior event-based vision transformers omits several recent self-supervised pretraining methods on event data that could serve as stronger baselines for the backbone choice.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive comments. We address each major point below and will incorporate revisions to strengthen the manuscript.

read point-by-point responses

Referee: [§3 and §5] §3 (Dataset construction) and §5 (Experiments): The headline claims of outperformance and zero-shot transferability rest on the assumption that N-DHF1K and N-UCF Sports, generated from RGB saliency benchmarks, statistically match real event-camera streams in firing statistics, contrast thresholds, background activity noise, and polarity balance. No quantitative validation (e.g., Kolmogorov-Smirnov tests on inter-event intervals, event-rate histograms, or noise power spectra) is provided comparing synthetic versus real distributions. Without this, the reported gains over prior event methods and the transfer result could be artifacts of the simulator rather than evidence of model robustness.

Authors: We agree that explicit quantitative comparisons between the synthetic and real event distributions would strengthen the claims. In the revised manuscript we will add event-rate histograms, inter-event interval distributions, and polarity balance statistics comparing N-DHF1K and N-UCF Sports against real event-camera recordings from the same simulator pipeline. We note that the simulator is a widely adopted open-source tool and that the observed zero-shot transfer to real data already provides empirical evidence against simulator-specific overfitting; nevertheless, the requested statistical validations will be included. revision: yes
Referee: [§5.3] §5.3 (Zero-shot evaluation): The transfer experiment is load-bearing for the practical significance claim, yet the manuscript supplies no details on the real event dataset size, event density, or how the synthetic training distribution was aligned (or not) with the target camera's bias and refractory settings. This makes it impossible to assess whether the positive transfer result generalizes beyond the specific real dataset chosen.

Authors: We will expand the description of the zero-shot experiment in §5.3 to report the real event dataset size (total events and recording duration), average event density, and the exact camera bias and refractory parameters used when generating the synthetic training data. No additional distribution alignment beyond matching these camera settings was performed. These details will allow readers to evaluate the scope of the transfer result. revision: yes

Circularity Check

0 steps flagged

No circularity in experimental claims or model definition

full rationale

The paper introduces SEST (a Swin Transformer backbone plus CNN decoder) and two synthetic datasets (N-DHF1K, N-UCF Sports) generated from existing RGB saliency benchmarks. All load-bearing claims are experimental: reported outperformance on the new datasets and zero-shot transfer to a real event-camera set. No equations, derivations, fitted parameters renamed as predictions, or self-citation chains appear in the provided text. The architecture and datasets are defined independently of the target performance metrics, so the results do not reduce to the inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract supplies no explicit free parameters, mathematical axioms, or invented physical entities; all details on model hyperparameters, loss functions, and data generation assumptions are absent.

pith-pipeline@v0.9.0 · 5784 in / 1112 out tokens · 30389 ms · 2026-05-25T04:22:47.701151+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

SEST leverages a self-supervised pretrained event-based Swin Transformer backbone combined with a lightweight CNN decoder to produce dynamic saliency maps.
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We convert the raw event stream... into a voxel grid representation... [T, 2, H, W]

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

42 extracted references · 42 canonical work pages · 1 internal anchor

[1]

International Journal of Computer Vision (2021)

Bellitto, G., Proietto Salanitri, F., Palazzo, S., Rundo, F., Giordano, D., Spamp- inato, C.: Hierarchical domain-adapted feature learning for video saliency predic- tion. International Journal of Computer Vision (2021)

work page 2021
[2]

NIPS (2005)

Bruce, N., Tsotsos, J.: Saliency based on information maximization. NIPS (2005)

work page 2005
[3]

In: 2023 18th International Conference on Machine Vision and Applications (MVA)

Bulzomi, H., Gruel, A., Martinet, J., Fujita, T., Nakano, Y., Bendahan, R.: Object detection for embedded systems using tiny spiking neural networks: Filtering noise through visual attention. In: 2023 18th International Conference on Machine Vision and Applications (MVA). pp. 1–5. IEEE (2023) 14 R. Mazna, J. Martinet, D. Pokala

work page 2023
[4]

Bylinskii, Z., Judd, T., Oliva, A., Torralba, A., Durand, F.: What Do Different Evaluation Metrics Tell Us About Saliency Models? IEEE TPAMI (2019)

work page 2019
[5]

IEEE Trans

Chane, C.S., Niebur, E., Benosman, R., Ieng, S.H.: An event-based implementa- tion of saliency-based visual attention for rapid scene analysis. IEEE Trans. on Cognitive and Developmental Systems (2024)

work page 2024
[6]

Chang, Q., Zhu, S.: Temporal-spatial feature pyramid for video saliency detection (2021),https://arxiv.org/abs/2105.04213

work page arXiv 2021
[7]

In: ICPR (2016)

Cornia, M., Baraldi, L., Serra, G., Cucchiara, R.: A deep multi-level network for saliency prediction. In: ICPR (2016)

work page 2016
[8]

Neuro- morphic Computing and Engineering (2022)

D’Angelo, G., Perrett, A., Iacono, M., Furber, S., Bartolozzi, C.: Event driven bio-inspired attentive system for the iCub humanoid robot on SpiNNaker. Neuro- morphic Computing and Engineering (2022)

work page 2022
[9]

In: BMVC (2023)

Djilali, Y.A.D., McGuinness, K., O’Connor, N.E.: Vision transformers are inher- ently saliency learners. In: BMVC (2023)

work page 2023
[10]

In: ECCV (2020)

Droste, R., Jiao, J., Noble, J.A.: Unified image and video saliency modeling. In: ECCV (2020)

work page 2020
[11]

Now Pub- lishers (2020)

Furber, S., Bogdan, P.: Spinnaker-a spiking neural network architecture. Now Pub- lishers (2020)

work page 2020
[12]

NIPS17(2004)

Gao, D., Vasconcelos, N.: Discriminant saliency for visual recognition from clut- tered scenes. NIPS17(2004)

work page 2004
[13]

In: CVPR (June 2020)

Gehrig, D., Gehrig, M., Hidalgo-Carrió, J., Scaramuzza, D.: Video to events: Re- cycling video datasets for event cameras. In: CVPR (June 2020)

work page 2020
[14]

In: AICAS

Gruel, A., Vitale, A., Martinet, J., Magno, M.: Neuromorphic event-based spatio- temporal attention using adaptive mechanisms. In: AICAS. IEEE (2022)

work page 2022
[15]

NIPS19(2006)

Harel, J., Koch, C., Perona, P.: Graph-based visual saliency. NIPS19(2006)

work page 2006
[16]

NIPS21(2008)

Hou, X., Zhang, L.: Dynamic visual attention: Searching for coding length incre- ments. NIPS21(2008)

work page 2008
[17]

In: IROS

Iacono, M., D’Angelo, G., Glover, A., Tikhanoff, V., Niebur, E., Bartolozzi, C.: Proto-object based saliency for event-driven cameras. In: IROS. IEEE (2019)

work page 2019
[18]

Nature reviews neuroscience2(3) (2001)

Itti, L., Koch, C.: Computational modelling of visual attention. Nature reviews neuroscience2(3) (2001)

work page 2001
[19]

In: IROS

Jain, S., Yarlagadda, P., Jyoti, S., Karthik, S., Subramanian, R., Gandhi, V.: Vinet: Pushing the limits of visual modality for audio-visual saliency prediction. In: IROS. IEEE (2021)

work page 2021
[20]

Image and vision computing95(2020)

Jia, S., Bruce, N.D.: Eml-net: An expandable multi-layer network for saliency pre- diction. Image and vision computing95(2020)

work page 2020
[21]

In: ECCV (September 2018)

Jiang, L., Xu, M., Liu, T., Qiao, M., Wang, Z.: Deepvs: A deep learning based video saliency prediction approach. In: ECCV (September 2018)

work page 2018
[22]

Kootstra,G.,Nederveen,A.,DeBoer,B.:Payingattentiontosymmetry.In:BMVC (2008)

work page 2008
[23]

IEEE Trans

Kruthiventi, S.S., Ayush, K., Babu, R.V.: Deepfix: A fully convolutional neural network for predicting human eye fixations. IEEE Trans. on Image Processing (2017)

work page 2017
[24]

IEEE Trans

Lai, Q., Wang, W., Sun, H., Shen, J.: Video saliency prediction using spatiotem- poral residual attentive networks. IEEE Trans. on Image Processing (2019)

work page 2019
[25]

IEEE TPAMI (2006)

Le Meur, O., Le Callet, P., Barba, D., Thoreau, D.: A coherent computational approach to model bottom-up visual attention. IEEE TPAMI (2006)

work page 2006
[26]

In: BMVC (2019) Event-Based Saliency Prediction with a Transformer-based model 15

Linardos, P., Mohedano, E., Nieto, J.J., O’Connor, N.E., Giró-i-Nieto, X., McGuin- ness, K.: Simple vs complex temporal recurrences for video saliency prediction. In: BMVC (2019) Event-Based Saliency Prediction with a Transformer-based model 15

work page 2019
[27]

In: ICLR (2019)

Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. In: ICLR (2019)

work page 2019
[28]

IEEE TPAMI (2014)

Mathe, S., Sminchisescu, C.: Actions in the eye: Dynamic gaze datasets and learnt saliency models for visual recognition. IEEE TPAMI (2014)

work page 2014
[29]

In: ICCV (2019)

Min, K., Corso, J.J.: Tased-net: Temporally-aggregating spatial encoder-decoder network for video saliency detection. In: ICCV (2019)

work page 2019
[30]

In: International Conference on Pattern Recognition

Moradi, M., Moradi, M., Rundo, F., Spampinato, C., Borji, A., Palazzo, S.: Sal- fom: Dynamic saliency prediction with video foundation models. In: International Conference on Pattern Recognition. pp. 33–48. Springer (2024)

work page 2024
[31]

In: VISIGRAPP 2024 (2024)

Moradi, M., Palazzo, S., Spampinato, C.: Transformer-based video saliency pre- diction with high temporal dimension decoding. In: VISIGRAPP 2024 (2024)

work page 2024
[32]

In: CVPR (2011)

Murray, N., Vanrell, M.e.a.: Saliency estimation using a non-parametric low-level vision model. In: CVPR (2011)

work page 2011
[33]

In: CVPR’06

Navalpakkam, V., Itti, L.: An integrated model of top-down and bottom-up at- tention for optimizing detection speed. In: CVPR’06. vol. 2, pp. 2049–2056. IEEE (2006)

work page 2049
[34]

SalGAN: Visual Saliency Prediction with Generative Adversarial Networks

Pan, J., Ferrer, C.C., et al.: Salgan: Visual saliency prediction with generative adversarial networks. Preprint arXiv:1701.01081 (2017)

work page internal anchor Pith review Pith/arXiv arXiv 2017
[35]

In: CVPR (2016)

Pan, J., Sayrol, E., et al.: Shallow and deep convolutional networks for saliency prediction. In: CVPR (2016)

work page 2016
[36]

In: CVPR (2008)

Rodriguez, M.D., Ahmed, J., Shah, M.: Action mach a spatio-temporal maximum average correlation height filter for action recognition. In: CVPR (2008)

work page 2008
[37]

Cognitive psychology12(1) (1980)

Treisman, A.M., Gelade, G.: A feature-integration theory of attention. Cognitive psychology12(1) (1980)

work page 1980
[38]

In: CVPR (2018)

Wang, W., Shen, J., Guo, F., Cheng, M.M., Borji, A.: Revisiting Video Saliency: A Large-Scale Benchmark and a New Model. In: CVPR (2018)

work page 2018
[39]

IEEE Trans

Wang, Z., Liu, Z., Li, G., Wang, Y., Zhang, T., Xu, L., Wang, J.: Spatio-temporal self-attention network for video saliency prediction. IEEE Trans. on Multimedia 25(2023)

work page 2023
[40]

In: Proceedings of the AAAI conference on artificial intelligence

Wu, X., Wu, Z., Zhang, J., Ju, L., Wang, S.: Salsac: A video saliency prediction model with shuffled attentions and correlation-based convlstm. In: Proceedings of the AAAI conference on artificial intelligence. vol. 34, pp. 12410–12417 (2020)

work page 2020
[41]

In: European Conference on Computer Vision

Yang, Y., Pan, L., Liu, L.: Event camera data dense pre-training. In: European Conference on Computer Vision. pp. 292–310. Springer (2024)

work page 2024
[42]

IEEE Transactions on Circuits and Systems for Video Technology (2023)

Zhou, X., Wu, S., Shi, R., Zheng, B., Wang, S., Yin, H., Zhang, J., Yan, C.: Transformer-based multi-scale feature integration network for video saliency pre- diction. IEEE Transactions on Circuits and Systems for Video Technology (2023)

work page 2023

[1] [1]

International Journal of Computer Vision (2021)

Bellitto, G., Proietto Salanitri, F., Palazzo, S., Rundo, F., Giordano, D., Spamp- inato, C.: Hierarchical domain-adapted feature learning for video saliency predic- tion. International Journal of Computer Vision (2021)

work page 2021

[2] [2]

NIPS (2005)

Bruce, N., Tsotsos, J.: Saliency based on information maximization. NIPS (2005)

work page 2005

[3] [3]

In: 2023 18th International Conference on Machine Vision and Applications (MVA)

Bulzomi, H., Gruel, A., Martinet, J., Fujita, T., Nakano, Y., Bendahan, R.: Object detection for embedded systems using tiny spiking neural networks: Filtering noise through visual attention. In: 2023 18th International Conference on Machine Vision and Applications (MVA). pp. 1–5. IEEE (2023) 14 R. Mazna, J. Martinet, D. Pokala

work page 2023

[4] [4]

Bylinskii, Z., Judd, T., Oliva, A., Torralba, A., Durand, F.: What Do Different Evaluation Metrics Tell Us About Saliency Models? IEEE TPAMI (2019)

work page 2019

[5] [5]

IEEE Trans

Chane, C.S., Niebur, E., Benosman, R., Ieng, S.H.: An event-based implementa- tion of saliency-based visual attention for rapid scene analysis. IEEE Trans. on Cognitive and Developmental Systems (2024)

work page 2024

[6] [6]

Chang, Q., Zhu, S.: Temporal-spatial feature pyramid for video saliency detection (2021),https://arxiv.org/abs/2105.04213

work page arXiv 2021

[7] [7]

In: ICPR (2016)

Cornia, M., Baraldi, L., Serra, G., Cucchiara, R.: A deep multi-level network for saliency prediction. In: ICPR (2016)

work page 2016

[8] [8]

Neuro- morphic Computing and Engineering (2022)

D’Angelo, G., Perrett, A., Iacono, M., Furber, S., Bartolozzi, C.: Event driven bio-inspired attentive system for the iCub humanoid robot on SpiNNaker. Neuro- morphic Computing and Engineering (2022)

work page 2022

[9] [9]

In: BMVC (2023)

Djilali, Y.A.D., McGuinness, K., O’Connor, N.E.: Vision transformers are inher- ently saliency learners. In: BMVC (2023)

work page 2023

[10] [10]

In: ECCV (2020)

Droste, R., Jiao, J., Noble, J.A.: Unified image and video saliency modeling. In: ECCV (2020)

work page 2020

[11] [11]

Now Pub- lishers (2020)

Furber, S., Bogdan, P.: Spinnaker-a spiking neural network architecture. Now Pub- lishers (2020)

work page 2020

[12] [12]

NIPS17(2004)

Gao, D., Vasconcelos, N.: Discriminant saliency for visual recognition from clut- tered scenes. NIPS17(2004)

work page 2004

[13] [13]

In: CVPR (June 2020)

Gehrig, D., Gehrig, M., Hidalgo-Carrió, J., Scaramuzza, D.: Video to events: Re- cycling video datasets for event cameras. In: CVPR (June 2020)

work page 2020

[14] [14]

In: AICAS

Gruel, A., Vitale, A., Martinet, J., Magno, M.: Neuromorphic event-based spatio- temporal attention using adaptive mechanisms. In: AICAS. IEEE (2022)

work page 2022

[15] [15]

NIPS19(2006)

Harel, J., Koch, C., Perona, P.: Graph-based visual saliency. NIPS19(2006)

work page 2006

[16] [16]

NIPS21(2008)

Hou, X., Zhang, L.: Dynamic visual attention: Searching for coding length incre- ments. NIPS21(2008)

work page 2008

[17] [17]

In: IROS

Iacono, M., D’Angelo, G., Glover, A., Tikhanoff, V., Niebur, E., Bartolozzi, C.: Proto-object based saliency for event-driven cameras. In: IROS. IEEE (2019)

work page 2019

[18] [18]

Nature reviews neuroscience2(3) (2001)

Itti, L., Koch, C.: Computational modelling of visual attention. Nature reviews neuroscience2(3) (2001)

work page 2001

[19] [19]

In: IROS

Jain, S., Yarlagadda, P., Jyoti, S., Karthik, S., Subramanian, R., Gandhi, V.: Vinet: Pushing the limits of visual modality for audio-visual saliency prediction. In: IROS. IEEE (2021)

work page 2021

[20] [20]

Image and vision computing95(2020)

Jia, S., Bruce, N.D.: Eml-net: An expandable multi-layer network for saliency pre- diction. Image and vision computing95(2020)

work page 2020

[21] [21]

In: ECCV (September 2018)

Jiang, L., Xu, M., Liu, T., Qiao, M., Wang, Z.: Deepvs: A deep learning based video saliency prediction approach. In: ECCV (September 2018)

work page 2018

[22] [22]

Kootstra,G.,Nederveen,A.,DeBoer,B.:Payingattentiontosymmetry.In:BMVC (2008)

work page 2008

[23] [23]

IEEE Trans

Kruthiventi, S.S., Ayush, K., Babu, R.V.: Deepfix: A fully convolutional neural network for predicting human eye fixations. IEEE Trans. on Image Processing (2017)

work page 2017

[24] [24]

IEEE Trans

Lai, Q., Wang, W., Sun, H., Shen, J.: Video saliency prediction using spatiotem- poral residual attentive networks. IEEE Trans. on Image Processing (2019)

work page 2019

[25] [25]

IEEE TPAMI (2006)

Le Meur, O., Le Callet, P., Barba, D., Thoreau, D.: A coherent computational approach to model bottom-up visual attention. IEEE TPAMI (2006)

work page 2006

[26] [26]

In: BMVC (2019) Event-Based Saliency Prediction with a Transformer-based model 15

Linardos, P., Mohedano, E., Nieto, J.J., O’Connor, N.E., Giró-i-Nieto, X., McGuin- ness, K.: Simple vs complex temporal recurrences for video saliency prediction. In: BMVC (2019) Event-Based Saliency Prediction with a Transformer-based model 15

work page 2019

[27] [27]

In: ICLR (2019)

Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. In: ICLR (2019)

work page 2019

[28] [28]

IEEE TPAMI (2014)

Mathe, S., Sminchisescu, C.: Actions in the eye: Dynamic gaze datasets and learnt saliency models for visual recognition. IEEE TPAMI (2014)

work page 2014

[29] [29]

In: ICCV (2019)

Min, K., Corso, J.J.: Tased-net: Temporally-aggregating spatial encoder-decoder network for video saliency detection. In: ICCV (2019)

work page 2019

[30] [30]

In: International Conference on Pattern Recognition

Moradi, M., Moradi, M., Rundo, F., Spampinato, C., Borji, A., Palazzo, S.: Sal- fom: Dynamic saliency prediction with video foundation models. In: International Conference on Pattern Recognition. pp. 33–48. Springer (2024)

work page 2024

[31] [31]

In: VISIGRAPP 2024 (2024)

Moradi, M., Palazzo, S., Spampinato, C.: Transformer-based video saliency pre- diction with high temporal dimension decoding. In: VISIGRAPP 2024 (2024)

work page 2024

[32] [32]

In: CVPR (2011)

Murray, N., Vanrell, M.e.a.: Saliency estimation using a non-parametric low-level vision model. In: CVPR (2011)

work page 2011

[33] [33]

In: CVPR’06

Navalpakkam, V., Itti, L.: An integrated model of top-down and bottom-up at- tention for optimizing detection speed. In: CVPR’06. vol. 2, pp. 2049–2056. IEEE (2006)

work page 2049

[34] [34]

SalGAN: Visual Saliency Prediction with Generative Adversarial Networks

Pan, J., Ferrer, C.C., et al.: Salgan: Visual saliency prediction with generative adversarial networks. Preprint arXiv:1701.01081 (2017)

work page internal anchor Pith review Pith/arXiv arXiv 2017

[35] [35]

In: CVPR (2016)

Pan, J., Sayrol, E., et al.: Shallow and deep convolutional networks for saliency prediction. In: CVPR (2016)

work page 2016

[36] [36]

In: CVPR (2008)

Rodriguez, M.D., Ahmed, J., Shah, M.: Action mach a spatio-temporal maximum average correlation height filter for action recognition. In: CVPR (2008)

work page 2008

[37] [37]

Cognitive psychology12(1) (1980)

Treisman, A.M., Gelade, G.: A feature-integration theory of attention. Cognitive psychology12(1) (1980)

work page 1980

[38] [38]

In: CVPR (2018)

Wang, W., Shen, J., Guo, F., Cheng, M.M., Borji, A.: Revisiting Video Saliency: A Large-Scale Benchmark and a New Model. In: CVPR (2018)

work page 2018

[39] [39]

IEEE Trans

Wang, Z., Liu, Z., Li, G., Wang, Y., Zhang, T., Xu, L., Wang, J.: Spatio-temporal self-attention network for video saliency prediction. IEEE Trans. on Multimedia 25(2023)

work page 2023

[40] [40]

In: Proceedings of the AAAI conference on artificial intelligence

Wu, X., Wu, Z., Zhang, J., Ju, L., Wang, S.: Salsac: A video saliency prediction model with shuffled attentions and correlation-based convlstm. In: Proceedings of the AAAI conference on artificial intelligence. vol. 34, pp. 12410–12417 (2020)

work page 2020

[41] [41]

In: European Conference on Computer Vision

Yang, Y., Pan, L., Liu, L.: Event camera data dense pre-training. In: European Conference on Computer Vision. pp. 292–310. Springer (2024)

work page 2024

[42] [42]

IEEE Transactions on Circuits and Systems for Video Technology (2023)

Zhou, X., Wu, S., Shi, R., Zheng, B., Wang, S., Yin, H., Zhang, J., Yan, C.: Transformer-based multi-scale feature integration network for video saliency pre- diction. IEEE Transactions on Circuits and Systems for Video Technology (2023)

work page 2023