pith. sign in

arxiv: 2605.23790 · v1 · pith:3V535LXQnew · submitted 2026-05-22 · 💻 cs.CV

Exploring deep learning for Event-Based Saliency Prediction with a Transformer-based model

Pith reviewed 2026-05-25 04:22 UTC · model grok-4.3

classification 💻 cs.CV
keywords event-based visionsaliency predictiontransformersynthetic dataSwin Transformervisual attentionevent camera
4
0 comments X

The pith

SEST applies a pretrained Swin Transformer to event data for saliency prediction and transfers from synthetic training to real cameras.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes the first deep learning method for predicting saliency from event camera streams by creating synthetic event datasets from existing RGB saliency benchmarks and pretraining an event-adapted Swin Transformer. A lightweight CNN decoder then produces the saliency maps. Results show the model surpasses earlier event-based approaches and reduces the difference from top RGB models while succeeding in zero-shot tests on real event data. This matters for building attention models that exploit event cameras' high speed and low power in dynamic environments.

Core claim

SEST is a Swin Event-based Saliency Transformer that combines a self-supervised pretrained event-based Swin Transformer backbone with a lightweight CNN decoder to generate dynamic saliency maps from event data. Trained on the new N-DHF1K and N-UCF Sports synthetic datasets derived from RGB benchmarks, it outperforms existing event-based saliency methods, narrows the gap to state-of-the-art RGB models, and demonstrates transferability to real event camera streams in zero-shot evaluation.

What carries the argument

Self-supervised pretrained event-based Swin Transformer backbone paired with a lightweight CNN decoder that converts event streams into dynamic saliency maps.

If this is right

  • Event-based saliency prediction can now use transformer architectures adapted from image and video domains.
  • Synthetic data derived from RGB benchmarks removes the need for large-scale real event annotations during initial training.
  • Models trained on synthetic events generalize directly to real event streams without fine-tuning.
  • Deep learning establishes a viable path for computational models of visual attention in neuromorphic sensing.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • These models could support low-power attention mechanisms in robotics or embedded systems that use event cameras.
  • End-to-end training on real event data may become practical once larger annotated real datasets are collected.
  • The transfer success indicates that event representations learned from synthetic data preserve key temporal attention cues.

Load-bearing premise

Synthetic event datasets generated from RGB saliency videos capture the statistical properties and noise patterns of real event camera recordings well enough for training and evaluation.

What would settle it

A large annotated real event saliency dataset on which a model trained only on the synthetic data shows substantially lower accuracy than one trained directly on the real data.

Figures

Figures reproduced from arXiv: 2605.23790 by Jean Martinet, Romaric Mazna, Sai Deepesh Pokala.

Figure 1
Figure 1. Figure 1: Overview of the proposed Swin Event-based Saliency Transformer (SEST) ar￾chitecture. bin are accumulated per pixel and per polarity, producing a count-based voxel grid of shape [T, 2, H, W], H=W=224. During training, samples are processed in batches of size B, giving X ∈ R B×T ×2×H×W . This representation preserves both the temporal structure and the polarity asymmetry of the event stream. 3.2 Pretrained S… view at source ↗
Figure 2
Figure 2. Figure 2: Illustration of qualitative results for three samples: UCF Sports Run-side-004 (rows 1-2), DHF1K 548 (rows 3-4), UCF Sports Swing-Bench-003 (rows 5-6). N-DHF1K The best overall performance on this dataset is achieved by RGB￾based models, with SalFoM leading across all four metrics. Other RGB-based models such as TMFI, THTD-Net, and STSANet follow closely. In contrast, existing event-based models show signi… view at source ↗
read the original abstract

Saliency prediction has been extensively studied in RGB images and videos as a computational model of human visual attention. In contrast, predicting saliency from event-based data remains largely unexplored, despite the biological inspiration and favorable sensing properties of event cameras. Two obstacles have held this direction back: the absence of large-scale event saliency datasets, and the lack of a strong baseline. In this paper, we introduce SEST (Swin Event-based Saliency Transformer), a transformer-based model for saliency prediction from event data, bridging the data scarcity barrier through event-native pretraining and synthetic supervision. SEST leverages a self-supervised pretrained event-based Swin Transformer backbone combined with a lightweight CNN decoder to produce dynamic saliency maps. To address the scarcity of annotated event-based saliency data, we introduce two new benchmark datasets, N-DHF1K and N-UCF Sports, generated from large-scale RGB saliency benchmarks. Experimental results show that SEST clearly outperforms existing event-based saliency methods and narrows the performance gap with state-of-the-art RGB models. Zero-shot evaluation on a real event camera dataset further demonstrates that our model trained on synthetic data remains transferable on real event streams. To the best of our knowledge, this work is the first to apply deep learning to event-based saliency prediction, opening a new research direction at the intersection of event-based vision and neuromorphic visual attention.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 3 minor

Summary. The manuscript introduces SEST, a Swin Transformer-based model that combines a self-supervised pretrained event-based backbone with a lightweight CNN decoder for saliency prediction on event data. To address data scarcity, the authors generate two synthetic datasets (N-DHF1K and N-UCF Sports) from existing RGB saliency benchmarks via event simulation. The central claims are that SEST outperforms prior event-based saliency methods, narrows the gap to RGB SOTA, and exhibits successful zero-shot transfer when evaluated on a real event-camera dataset after training exclusively on the synthetic data. The work positions itself as the first application of deep learning to event-based saliency prediction.

Significance. If the experimental claims hold after addressing the synthetic-to-real gap, the paper would be a notable first step in applying modern deep learning to event-based saliency, with the new benchmarks and self-supervised pretraining providing reusable resources. The zero-shot transfer result, if robust, would be particularly valuable for neuromorphic vision applications where real annotated data remain scarce.

major comments (2)
  1. [§3 and §5] §3 (Dataset construction) and §5 (Experiments): The headline claims of outperformance and zero-shot transferability rest on the assumption that N-DHF1K and N-UCF Sports, generated from RGB saliency benchmarks, statistically match real event-camera streams in firing statistics, contrast thresholds, background activity noise, and polarity balance. No quantitative validation (e.g., Kolmogorov-Smirnov tests on inter-event intervals, event-rate histograms, or noise power spectra) is provided comparing synthetic versus real distributions. Without this, the reported gains over prior event methods and the transfer result could be artifacts of the simulator rather than evidence of model robustness.
  2. [§5.3] §5.3 (Zero-shot evaluation): The transfer experiment is load-bearing for the practical significance claim, yet the manuscript supplies no details on the real event dataset size, event density, or how the synthetic training distribution was aligned (or not) with the target camera's bias and refractory settings. This makes it impossible to assess whether the positive transfer result generalizes beyond the specific real dataset chosen.
minor comments (3)
  1. [Abstract] Abstract: The statement that SEST 'clearly outperforms existing event-based saliency methods' is not accompanied by any numerical values, baseline names, or error bars. Adding the key metrics (e.g., AUC, NSS, or CC on the primary test split) would make the abstract self-contained.
  2. [§2] Notation and figures: The description of the event representation fed to the Swin backbone (e.g., voxel grid, surface of active events, or polarity-separated frames) is referenced inconsistently across text and figures; a single explicit equation or diagram in §2 would eliminate ambiguity.
  3. [§2] Related work: The discussion of prior event-based vision transformers omits several recent self-supervised pretraining methods on event data that could serve as stronger baselines for the backbone choice.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive comments. We address each major point below and will incorporate revisions to strengthen the manuscript.

read point-by-point responses
  1. Referee: [§3 and §5] §3 (Dataset construction) and §5 (Experiments): The headline claims of outperformance and zero-shot transferability rest on the assumption that N-DHF1K and N-UCF Sports, generated from RGB saliency benchmarks, statistically match real event-camera streams in firing statistics, contrast thresholds, background activity noise, and polarity balance. No quantitative validation (e.g., Kolmogorov-Smirnov tests on inter-event intervals, event-rate histograms, or noise power spectra) is provided comparing synthetic versus real distributions. Without this, the reported gains over prior event methods and the transfer result could be artifacts of the simulator rather than evidence of model robustness.

    Authors: We agree that explicit quantitative comparisons between the synthetic and real event distributions would strengthen the claims. In the revised manuscript we will add event-rate histograms, inter-event interval distributions, and polarity balance statistics comparing N-DHF1K and N-UCF Sports against real event-camera recordings from the same simulator pipeline. We note that the simulator is a widely adopted open-source tool and that the observed zero-shot transfer to real data already provides empirical evidence against simulator-specific overfitting; nevertheless, the requested statistical validations will be included. revision: yes

  2. Referee: [§5.3] §5.3 (Zero-shot evaluation): The transfer experiment is load-bearing for the practical significance claim, yet the manuscript supplies no details on the real event dataset size, event density, or how the synthetic training distribution was aligned (or not) with the target camera's bias and refractory settings. This makes it impossible to assess whether the positive transfer result generalizes beyond the specific real dataset chosen.

    Authors: We will expand the description of the zero-shot experiment in §5.3 to report the real event dataset size (total events and recording duration), average event density, and the exact camera bias and refractory parameters used when generating the synthetic training data. No additional distribution alignment beyond matching these camera settings was performed. These details will allow readers to evaluate the scope of the transfer result. revision: yes

Circularity Check

0 steps flagged

No circularity in experimental claims or model definition

full rationale

The paper introduces SEST (a Swin Transformer backbone plus CNN decoder) and two synthetic datasets (N-DHF1K, N-UCF Sports) generated from existing RGB saliency benchmarks. All load-bearing claims are experimental: reported outperformance on the new datasets and zero-shot transfer to a real event-camera set. No equations, derivations, fitted parameters renamed as predictions, or self-citation chains appear in the provided text. The architecture and datasets are defined independently of the target performance metrics, so the results do not reduce to the inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract supplies no explicit free parameters, mathematical axioms, or invented physical entities; all details on model hyperparameters, loss functions, and data generation assumptions are absent.

pith-pipeline@v0.9.0 · 5784 in / 1112 out tokens · 30389 ms · 2026-05-25T04:22:47.701151+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

42 extracted references · 42 canonical work pages · 1 internal anchor

  1. [1]

    International Journal of Computer Vision (2021)

    Bellitto, G., Proietto Salanitri, F., Palazzo, S., Rundo, F., Giordano, D., Spamp- inato, C.: Hierarchical domain-adapted feature learning for video saliency predic- tion. International Journal of Computer Vision (2021)

  2. [2]

    NIPS (2005)

    Bruce, N., Tsotsos, J.: Saliency based on information maximization. NIPS (2005)

  3. [3]

    In: 2023 18th International Conference on Machine Vision and Applications (MVA)

    Bulzomi, H., Gruel, A., Martinet, J., Fujita, T., Nakano, Y., Bendahan, R.: Object detection for embedded systems using tiny spiking neural networks: Filtering noise through visual attention. In: 2023 18th International Conference on Machine Vision and Applications (MVA). pp. 1–5. IEEE (2023) 14 R. Mazna, J. Martinet, D. Pokala

  4. [4]

    Bylinskii, Z., Judd, T., Oliva, A., Torralba, A., Durand, F.: What Do Different Evaluation Metrics Tell Us About Saliency Models? IEEE TPAMI (2019)

  5. [5]

    IEEE Trans

    Chane, C.S., Niebur, E., Benosman, R., Ieng, S.H.: An event-based implementa- tion of saliency-based visual attention for rapid scene analysis. IEEE Trans. on Cognitive and Developmental Systems (2024)

  6. [6]

    Chang, Q., Zhu, S.: Temporal-spatial feature pyramid for video saliency detection (2021),https://arxiv.org/abs/2105.04213

  7. [7]

    In: ICPR (2016)

    Cornia, M., Baraldi, L., Serra, G., Cucchiara, R.: A deep multi-level network for saliency prediction. In: ICPR (2016)

  8. [8]

    Neuro- morphic Computing and Engineering (2022)

    D’Angelo, G., Perrett, A., Iacono, M., Furber, S., Bartolozzi, C.: Event driven bio-inspired attentive system for the iCub humanoid robot on SpiNNaker. Neuro- morphic Computing and Engineering (2022)

  9. [9]

    In: BMVC (2023)

    Djilali, Y.A.D., McGuinness, K., O’Connor, N.E.: Vision transformers are inher- ently saliency learners. In: BMVC (2023)

  10. [10]

    In: ECCV (2020)

    Droste, R., Jiao, J., Noble, J.A.: Unified image and video saliency modeling. In: ECCV (2020)

  11. [11]

    Now Pub- lishers (2020)

    Furber, S., Bogdan, P.: Spinnaker-a spiking neural network architecture. Now Pub- lishers (2020)

  12. [12]

    NIPS17(2004)

    Gao, D., Vasconcelos, N.: Discriminant saliency for visual recognition from clut- tered scenes. NIPS17(2004)

  13. [13]

    In: CVPR (June 2020)

    Gehrig, D., Gehrig, M., Hidalgo-Carrió, J., Scaramuzza, D.: Video to events: Re- cycling video datasets for event cameras. In: CVPR (June 2020)

  14. [14]

    In: AICAS

    Gruel, A., Vitale, A., Martinet, J., Magno, M.: Neuromorphic event-based spatio- temporal attention using adaptive mechanisms. In: AICAS. IEEE (2022)

  15. [15]

    NIPS19(2006)

    Harel, J., Koch, C., Perona, P.: Graph-based visual saliency. NIPS19(2006)

  16. [16]

    NIPS21(2008)

    Hou, X., Zhang, L.: Dynamic visual attention: Searching for coding length incre- ments. NIPS21(2008)

  17. [17]

    In: IROS

    Iacono, M., D’Angelo, G., Glover, A., Tikhanoff, V., Niebur, E., Bartolozzi, C.: Proto-object based saliency for event-driven cameras. In: IROS. IEEE (2019)

  18. [18]

    Nature reviews neuroscience2(3) (2001)

    Itti, L., Koch, C.: Computational modelling of visual attention. Nature reviews neuroscience2(3) (2001)

  19. [19]

    In: IROS

    Jain, S., Yarlagadda, P., Jyoti, S., Karthik, S., Subramanian, R., Gandhi, V.: Vinet: Pushing the limits of visual modality for audio-visual saliency prediction. In: IROS. IEEE (2021)

  20. [20]

    Image and vision computing95(2020)

    Jia, S., Bruce, N.D.: Eml-net: An expandable multi-layer network for saliency pre- diction. Image and vision computing95(2020)

  21. [21]

    In: ECCV (September 2018)

    Jiang, L., Xu, M., Liu, T., Qiao, M., Wang, Z.: Deepvs: A deep learning based video saliency prediction approach. In: ECCV (September 2018)

  22. [22]

    Kootstra,G.,Nederveen,A.,DeBoer,B.:Payingattentiontosymmetry.In:BMVC (2008)

  23. [23]

    IEEE Trans

    Kruthiventi, S.S., Ayush, K., Babu, R.V.: Deepfix: A fully convolutional neural network for predicting human eye fixations. IEEE Trans. on Image Processing (2017)

  24. [24]

    IEEE Trans

    Lai, Q., Wang, W., Sun, H., Shen, J.: Video saliency prediction using spatiotem- poral residual attentive networks. IEEE Trans. on Image Processing (2019)

  25. [25]

    IEEE TPAMI (2006)

    Le Meur, O., Le Callet, P., Barba, D., Thoreau, D.: A coherent computational approach to model bottom-up visual attention. IEEE TPAMI (2006)

  26. [26]

    In: BMVC (2019) Event-Based Saliency Prediction with a Transformer-based model 15

    Linardos, P., Mohedano, E., Nieto, J.J., O’Connor, N.E., Giró-i-Nieto, X., McGuin- ness, K.: Simple vs complex temporal recurrences for video saliency prediction. In: BMVC (2019) Event-Based Saliency Prediction with a Transformer-based model 15

  27. [27]

    In: ICLR (2019)

    Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. In: ICLR (2019)

  28. [28]

    IEEE TPAMI (2014)

    Mathe, S., Sminchisescu, C.: Actions in the eye: Dynamic gaze datasets and learnt saliency models for visual recognition. IEEE TPAMI (2014)

  29. [29]

    In: ICCV (2019)

    Min, K., Corso, J.J.: Tased-net: Temporally-aggregating spatial encoder-decoder network for video saliency detection. In: ICCV (2019)

  30. [30]

    In: International Conference on Pattern Recognition

    Moradi, M., Moradi, M., Rundo, F., Spampinato, C., Borji, A., Palazzo, S.: Sal- fom: Dynamic saliency prediction with video foundation models. In: International Conference on Pattern Recognition. pp. 33–48. Springer (2024)

  31. [31]

    In: VISIGRAPP 2024 (2024)

    Moradi, M., Palazzo, S., Spampinato, C.: Transformer-based video saliency pre- diction with high temporal dimension decoding. In: VISIGRAPP 2024 (2024)

  32. [32]

    In: CVPR (2011)

    Murray, N., Vanrell, M.e.a.: Saliency estimation using a non-parametric low-level vision model. In: CVPR (2011)

  33. [33]

    In: CVPR’06

    Navalpakkam, V., Itti, L.: An integrated model of top-down and bottom-up at- tention for optimizing detection speed. In: CVPR’06. vol. 2, pp. 2049–2056. IEEE (2006)

  34. [34]

    SalGAN: Visual Saliency Prediction with Generative Adversarial Networks

    Pan, J., Ferrer, C.C., et al.: Salgan: Visual saliency prediction with generative adversarial networks. Preprint arXiv:1701.01081 (2017)

  35. [35]

    In: CVPR (2016)

    Pan, J., Sayrol, E., et al.: Shallow and deep convolutional networks for saliency prediction. In: CVPR (2016)

  36. [36]

    In: CVPR (2008)

    Rodriguez, M.D., Ahmed, J., Shah, M.: Action mach a spatio-temporal maximum average correlation height filter for action recognition. In: CVPR (2008)

  37. [37]

    Cognitive psychology12(1) (1980)

    Treisman, A.M., Gelade, G.: A feature-integration theory of attention. Cognitive psychology12(1) (1980)

  38. [38]

    In: CVPR (2018)

    Wang, W., Shen, J., Guo, F., Cheng, M.M., Borji, A.: Revisiting Video Saliency: A Large-Scale Benchmark and a New Model. In: CVPR (2018)

  39. [39]

    IEEE Trans

    Wang, Z., Liu, Z., Li, G., Wang, Y., Zhang, T., Xu, L., Wang, J.: Spatio-temporal self-attention network for video saliency prediction. IEEE Trans. on Multimedia 25(2023)

  40. [40]

    In: Proceedings of the AAAI conference on artificial intelligence

    Wu, X., Wu, Z., Zhang, J., Ju, L., Wang, S.: Salsac: A video saliency prediction model with shuffled attentions and correlation-based convlstm. In: Proceedings of the AAAI conference on artificial intelligence. vol. 34, pp. 12410–12417 (2020)

  41. [41]

    In: European Conference on Computer Vision

    Yang, Y., Pan, L., Liu, L.: Event camera data dense pre-training. In: European Conference on Computer Vision. pp. 292–310. Springer (2024)

  42. [42]

    IEEE Transactions on Circuits and Systems for Video Technology (2023)

    Zhou, X., Wu, S., Shi, R., Zheng, B., Wang, S., Yin, H., Zhang, J., Yan, C.: Transformer-based multi-scale feature integration network for video saliency pre- diction. IEEE Transactions on Circuits and Systems for Video Technology (2023)