Exploring deep learning for Event-Based Saliency Prediction with a Transformer-based model
Pith reviewed 2026-05-25 04:22 UTC · model grok-4.3
The pith
SEST applies a pretrained Swin Transformer to event data for saliency prediction and transfers from synthetic training to real cameras.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
SEST is a Swin Event-based Saliency Transformer that combines a self-supervised pretrained event-based Swin Transformer backbone with a lightweight CNN decoder to generate dynamic saliency maps from event data. Trained on the new N-DHF1K and N-UCF Sports synthetic datasets derived from RGB benchmarks, it outperforms existing event-based saliency methods, narrows the gap to state-of-the-art RGB models, and demonstrates transferability to real event camera streams in zero-shot evaluation.
What carries the argument
Self-supervised pretrained event-based Swin Transformer backbone paired with a lightweight CNN decoder that converts event streams into dynamic saliency maps.
If this is right
- Event-based saliency prediction can now use transformer architectures adapted from image and video domains.
- Synthetic data derived from RGB benchmarks removes the need for large-scale real event annotations during initial training.
- Models trained on synthetic events generalize directly to real event streams without fine-tuning.
- Deep learning establishes a viable path for computational models of visual attention in neuromorphic sensing.
Where Pith is reading between the lines
- These models could support low-power attention mechanisms in robotics or embedded systems that use event cameras.
- End-to-end training on real event data may become practical once larger annotated real datasets are collected.
- The transfer success indicates that event representations learned from synthetic data preserve key temporal attention cues.
Load-bearing premise
Synthetic event datasets generated from RGB saliency videos capture the statistical properties and noise patterns of real event camera recordings well enough for training and evaluation.
What would settle it
A large annotated real event saliency dataset on which a model trained only on the synthetic data shows substantially lower accuracy than one trained directly on the real data.
Figures
read the original abstract
Saliency prediction has been extensively studied in RGB images and videos as a computational model of human visual attention. In contrast, predicting saliency from event-based data remains largely unexplored, despite the biological inspiration and favorable sensing properties of event cameras. Two obstacles have held this direction back: the absence of large-scale event saliency datasets, and the lack of a strong baseline. In this paper, we introduce SEST (Swin Event-based Saliency Transformer), a transformer-based model for saliency prediction from event data, bridging the data scarcity barrier through event-native pretraining and synthetic supervision. SEST leverages a self-supervised pretrained event-based Swin Transformer backbone combined with a lightweight CNN decoder to produce dynamic saliency maps. To address the scarcity of annotated event-based saliency data, we introduce two new benchmark datasets, N-DHF1K and N-UCF Sports, generated from large-scale RGB saliency benchmarks. Experimental results show that SEST clearly outperforms existing event-based saliency methods and narrows the performance gap with state-of-the-art RGB models. Zero-shot evaluation on a real event camera dataset further demonstrates that our model trained on synthetic data remains transferable on real event streams. To the best of our knowledge, this work is the first to apply deep learning to event-based saliency prediction, opening a new research direction at the intersection of event-based vision and neuromorphic visual attention.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces SEST, a Swin Transformer-based model that combines a self-supervised pretrained event-based backbone with a lightweight CNN decoder for saliency prediction on event data. To address data scarcity, the authors generate two synthetic datasets (N-DHF1K and N-UCF Sports) from existing RGB saliency benchmarks via event simulation. The central claims are that SEST outperforms prior event-based saliency methods, narrows the gap to RGB SOTA, and exhibits successful zero-shot transfer when evaluated on a real event-camera dataset after training exclusively on the synthetic data. The work positions itself as the first application of deep learning to event-based saliency prediction.
Significance. If the experimental claims hold after addressing the synthetic-to-real gap, the paper would be a notable first step in applying modern deep learning to event-based saliency, with the new benchmarks and self-supervised pretraining providing reusable resources. The zero-shot transfer result, if robust, would be particularly valuable for neuromorphic vision applications where real annotated data remain scarce.
major comments (2)
- [§3 and §5] §3 (Dataset construction) and §5 (Experiments): The headline claims of outperformance and zero-shot transferability rest on the assumption that N-DHF1K and N-UCF Sports, generated from RGB saliency benchmarks, statistically match real event-camera streams in firing statistics, contrast thresholds, background activity noise, and polarity balance. No quantitative validation (e.g., Kolmogorov-Smirnov tests on inter-event intervals, event-rate histograms, or noise power spectra) is provided comparing synthetic versus real distributions. Without this, the reported gains over prior event methods and the transfer result could be artifacts of the simulator rather than evidence of model robustness.
- [§5.3] §5.3 (Zero-shot evaluation): The transfer experiment is load-bearing for the practical significance claim, yet the manuscript supplies no details on the real event dataset size, event density, or how the synthetic training distribution was aligned (or not) with the target camera's bias and refractory settings. This makes it impossible to assess whether the positive transfer result generalizes beyond the specific real dataset chosen.
minor comments (3)
- [Abstract] Abstract: The statement that SEST 'clearly outperforms existing event-based saliency methods' is not accompanied by any numerical values, baseline names, or error bars. Adding the key metrics (e.g., AUC, NSS, or CC on the primary test split) would make the abstract self-contained.
- [§2] Notation and figures: The description of the event representation fed to the Swin backbone (e.g., voxel grid, surface of active events, or polarity-separated frames) is referenced inconsistently across text and figures; a single explicit equation or diagram in §2 would eliminate ambiguity.
- [§2] Related work: The discussion of prior event-based vision transformers omits several recent self-supervised pretraining methods on event data that could serve as stronger baselines for the backbone choice.
Simulated Author's Rebuttal
We thank the referee for the thoughtful and constructive comments. We address each major point below and will incorporate revisions to strengthen the manuscript.
read point-by-point responses
-
Referee: [§3 and §5] §3 (Dataset construction) and §5 (Experiments): The headline claims of outperformance and zero-shot transferability rest on the assumption that N-DHF1K and N-UCF Sports, generated from RGB saliency benchmarks, statistically match real event-camera streams in firing statistics, contrast thresholds, background activity noise, and polarity balance. No quantitative validation (e.g., Kolmogorov-Smirnov tests on inter-event intervals, event-rate histograms, or noise power spectra) is provided comparing synthetic versus real distributions. Without this, the reported gains over prior event methods and the transfer result could be artifacts of the simulator rather than evidence of model robustness.
Authors: We agree that explicit quantitative comparisons between the synthetic and real event distributions would strengthen the claims. In the revised manuscript we will add event-rate histograms, inter-event interval distributions, and polarity balance statistics comparing N-DHF1K and N-UCF Sports against real event-camera recordings from the same simulator pipeline. We note that the simulator is a widely adopted open-source tool and that the observed zero-shot transfer to real data already provides empirical evidence against simulator-specific overfitting; nevertheless, the requested statistical validations will be included. revision: yes
-
Referee: [§5.3] §5.3 (Zero-shot evaluation): The transfer experiment is load-bearing for the practical significance claim, yet the manuscript supplies no details on the real event dataset size, event density, or how the synthetic training distribution was aligned (or not) with the target camera's bias and refractory settings. This makes it impossible to assess whether the positive transfer result generalizes beyond the specific real dataset chosen.
Authors: We will expand the description of the zero-shot experiment in §5.3 to report the real event dataset size (total events and recording duration), average event density, and the exact camera bias and refractory parameters used when generating the synthetic training data. No additional distribution alignment beyond matching these camera settings was performed. These details will allow readers to evaluate the scope of the transfer result. revision: yes
Circularity Check
No circularity in experimental claims or model definition
full rationale
The paper introduces SEST (a Swin Transformer backbone plus CNN decoder) and two synthetic datasets (N-DHF1K, N-UCF Sports) generated from existing RGB saliency benchmarks. All load-bearing claims are experimental: reported outperformance on the new datasets and zero-shot transfer to a real event-camera set. No equations, derivations, fitted parameters renamed as predictions, or self-citation chains appear in the provided text. The architecture and datasets are defined independently of the target performance metrics, so the results do not reduce to the inputs by construction.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
SEST leverages a self-supervised pretrained event-based Swin Transformer backbone combined with a lightweight CNN decoder to produce dynamic saliency maps.
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We convert the raw event stream... into a voxel grid representation... [T, 2, H, W]
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
International Journal of Computer Vision (2021)
Bellitto, G., Proietto Salanitri, F., Palazzo, S., Rundo, F., Giordano, D., Spamp- inato, C.: Hierarchical domain-adapted feature learning for video saliency predic- tion. International Journal of Computer Vision (2021)
work page 2021
-
[2]
Bruce, N., Tsotsos, J.: Saliency based on information maximization. NIPS (2005)
work page 2005
-
[3]
In: 2023 18th International Conference on Machine Vision and Applications (MVA)
Bulzomi, H., Gruel, A., Martinet, J., Fujita, T., Nakano, Y., Bendahan, R.: Object detection for embedded systems using tiny spiking neural networks: Filtering noise through visual attention. In: 2023 18th International Conference on Machine Vision and Applications (MVA). pp. 1–5. IEEE (2023) 14 R. Mazna, J. Martinet, D. Pokala
work page 2023
-
[4]
Bylinskii, Z., Judd, T., Oliva, A., Torralba, A., Durand, F.: What Do Different Evaluation Metrics Tell Us About Saliency Models? IEEE TPAMI (2019)
work page 2019
-
[5]
Chane, C.S., Niebur, E., Benosman, R., Ieng, S.H.: An event-based implementa- tion of saliency-based visual attention for rapid scene analysis. IEEE Trans. on Cognitive and Developmental Systems (2024)
work page 2024
- [6]
-
[7]
Cornia, M., Baraldi, L., Serra, G., Cucchiara, R.: A deep multi-level network for saliency prediction. In: ICPR (2016)
work page 2016
-
[8]
Neuro- morphic Computing and Engineering (2022)
D’Angelo, G., Perrett, A., Iacono, M., Furber, S., Bartolozzi, C.: Event driven bio-inspired attentive system for the iCub humanoid robot on SpiNNaker. Neuro- morphic Computing and Engineering (2022)
work page 2022
-
[9]
Djilali, Y.A.D., McGuinness, K., O’Connor, N.E.: Vision transformers are inher- ently saliency learners. In: BMVC (2023)
work page 2023
-
[10]
Droste, R., Jiao, J., Noble, J.A.: Unified image and video saliency modeling. In: ECCV (2020)
work page 2020
-
[11]
Furber, S., Bogdan, P.: Spinnaker-a spiking neural network architecture. Now Pub- lishers (2020)
work page 2020
-
[12]
Gao, D., Vasconcelos, N.: Discriminant saliency for visual recognition from clut- tered scenes. NIPS17(2004)
work page 2004
-
[13]
Gehrig, D., Gehrig, M., Hidalgo-Carrió, J., Scaramuzza, D.: Video to events: Re- cycling video datasets for event cameras. In: CVPR (June 2020)
work page 2020
- [14]
-
[15]
Harel, J., Koch, C., Perona, P.: Graph-based visual saliency. NIPS19(2006)
work page 2006
-
[16]
Hou, X., Zhang, L.: Dynamic visual attention: Searching for coding length incre- ments. NIPS21(2008)
work page 2008
- [17]
-
[18]
Nature reviews neuroscience2(3) (2001)
Itti, L., Koch, C.: Computational modelling of visual attention. Nature reviews neuroscience2(3) (2001)
work page 2001
- [19]
-
[20]
Image and vision computing95(2020)
Jia, S., Bruce, N.D.: Eml-net: An expandable multi-layer network for saliency pre- diction. Image and vision computing95(2020)
work page 2020
-
[21]
Jiang, L., Xu, M., Liu, T., Qiao, M., Wang, Z.: Deepvs: A deep learning based video saliency prediction approach. In: ECCV (September 2018)
work page 2018
-
[22]
Kootstra,G.,Nederveen,A.,DeBoer,B.:Payingattentiontosymmetry.In:BMVC (2008)
work page 2008
-
[23]
Kruthiventi, S.S., Ayush, K., Babu, R.V.: Deepfix: A fully convolutional neural network for predicting human eye fixations. IEEE Trans. on Image Processing (2017)
work page 2017
-
[24]
Lai, Q., Wang, W., Sun, H., Shen, J.: Video saliency prediction using spatiotem- poral residual attentive networks. IEEE Trans. on Image Processing (2019)
work page 2019
-
[25]
Le Meur, O., Le Callet, P., Barba, D., Thoreau, D.: A coherent computational approach to model bottom-up visual attention. IEEE TPAMI (2006)
work page 2006
-
[26]
In: BMVC (2019) Event-Based Saliency Prediction with a Transformer-based model 15
Linardos, P., Mohedano, E., Nieto, J.J., O’Connor, N.E., Giró-i-Nieto, X., McGuin- ness, K.: Simple vs complex temporal recurrences for video saliency prediction. In: BMVC (2019) Event-Based Saliency Prediction with a Transformer-based model 15
work page 2019
-
[27]
Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. In: ICLR (2019)
work page 2019
-
[28]
Mathe, S., Sminchisescu, C.: Actions in the eye: Dynamic gaze datasets and learnt saliency models for visual recognition. IEEE TPAMI (2014)
work page 2014
-
[29]
Min, K., Corso, J.J.: Tased-net: Temporally-aggregating spatial encoder-decoder network for video saliency detection. In: ICCV (2019)
work page 2019
-
[30]
In: International Conference on Pattern Recognition
Moradi, M., Moradi, M., Rundo, F., Spampinato, C., Borji, A., Palazzo, S.: Sal- fom: Dynamic saliency prediction with video foundation models. In: International Conference on Pattern Recognition. pp. 33–48. Springer (2024)
work page 2024
-
[31]
Moradi, M., Palazzo, S., Spampinato, C.: Transformer-based video saliency pre- diction with high temporal dimension decoding. In: VISIGRAPP 2024 (2024)
work page 2024
-
[32]
Murray, N., Vanrell, M.e.a.: Saliency estimation using a non-parametric low-level vision model. In: CVPR (2011)
work page 2011
-
[33]
Navalpakkam, V., Itti, L.: An integrated model of top-down and bottom-up at- tention for optimizing detection speed. In: CVPR’06. vol. 2, pp. 2049–2056. IEEE (2006)
work page 2049
-
[34]
SalGAN: Visual Saliency Prediction with Generative Adversarial Networks
Pan, J., Ferrer, C.C., et al.: Salgan: Visual saliency prediction with generative adversarial networks. Preprint arXiv:1701.01081 (2017)
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[35]
Pan, J., Sayrol, E., et al.: Shallow and deep convolutional networks for saliency prediction. In: CVPR (2016)
work page 2016
-
[36]
Rodriguez, M.D., Ahmed, J., Shah, M.: Action mach a spatio-temporal maximum average correlation height filter for action recognition. In: CVPR (2008)
work page 2008
-
[37]
Cognitive psychology12(1) (1980)
Treisman, A.M., Gelade, G.: A feature-integration theory of attention. Cognitive psychology12(1) (1980)
work page 1980
-
[38]
Wang, W., Shen, J., Guo, F., Cheng, M.M., Borji, A.: Revisiting Video Saliency: A Large-Scale Benchmark and a New Model. In: CVPR (2018)
work page 2018
-
[39]
Wang, Z., Liu, Z., Li, G., Wang, Y., Zhang, T., Xu, L., Wang, J.: Spatio-temporal self-attention network for video saliency prediction. IEEE Trans. on Multimedia 25(2023)
work page 2023
-
[40]
In: Proceedings of the AAAI conference on artificial intelligence
Wu, X., Wu, Z., Zhang, J., Ju, L., Wang, S.: Salsac: A video saliency prediction model with shuffled attentions and correlation-based convlstm. In: Proceedings of the AAAI conference on artificial intelligence. vol. 34, pp. 12410–12417 (2020)
work page 2020
-
[41]
In: European Conference on Computer Vision
Yang, Y., Pan, L., Liu, L.: Event camera data dense pre-training. In: European Conference on Computer Vision. pp. 292–310. Springer (2024)
work page 2024
-
[42]
IEEE Transactions on Circuits and Systems for Video Technology (2023)
Zhou, X., Wu, S., Shi, R., Zheng, B., Wang, S., Yin, H., Zhang, J., Yan, C.: Transformer-based multi-scale feature integration network for video saliency pre- diction. IEEE Transactions on Circuits and Systems for Video Technology (2023)
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.