pith. sign in

arxiv: 1907.00480 · v1 · pith:URL3X3QKnew · submitted 2019-06-30 · 💻 cs.CV

Predicting video saliency using crowdsourced mouse-tracking data

Pith reviewed 2026-05-25 12:23 UTC · model grok-4.3

classification 💻 cs.CV
keywords video saliencymouse-trackingcrowdsourcingeye-tracking approximationsaliency mapsdeep neural networkperipheral vision simulation
0
0 comments X

The pith

Crowdsourced mouse-tracking data collected through a cursor-contingent viewing system can approximate eye-tracking data for video saliency maps.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that a mouse-contingent video system, which blurs peripheral areas based on cursor position, lets ordinary mouse movements serve as a practical substitute for gaze fixations when building saliency maps. A crowdsourcing platform then gathers this data at scale from regular computers. The authors demonstrate that the resulting maps closely track those from eye-trackers, and they introduce a deep neural network that further refines the mouse-derived maps to higher accuracy. This matters because eye-tracking hardware limits dataset size and accessibility, while mouse data runs on any device. If the approximation holds, researchers can train saliency models on far larger and more diverse video collections without specialized equipment.

Core claim

We designed a mouse-contingent video viewing system which simulates the viewers' peripheral vision based on the position of the mouse cursor. The system enables the use of mouse-tracking data recorded from an ordinary computer mouse as an alternative to real gaze fixations recorded by a more expensive eye-tracker. We developed a crowdsourcing system that enables the collection of such mouse-tracking data at large scale. Using the collected mouse-tracking data we showed that it can serve as an approximation of eye-tracking data. Moreover, trying to increase the efficiency of collected mouse-tracking data we proposed a novel deep neural network algorithm that improves the quality of mouse-

What carries the argument

The mouse-contingent video viewing system that simulates peripheral vision from mouse cursor position, turning mouse movements into a proxy for gaze fixations used to build saliency maps.

If this is right

  • Mouse-tracking data gathered via crowdsourcing serves as a scalable, low-cost approximation to eye-tracking data for video saliency.
  • A dedicated deep neural network can measurably raise the quality of saliency maps derived from mouse-tracking inputs.
  • Large-scale video saliency datasets become feasible to collect without eye-tracking hardware.
  • Saliency prediction models can be trained on substantially bigger and more varied video sets assembled this way.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same cursor-contingent approach could be adapted to collect attention data for other dynamic visual tasks where eye-trackers are impractical.
  • The DNN refinement step implies that mouse data contains learnable, systematic deviations from true gaze that can be corrected algorithmically.
  • Performance of the approximation may vary with video content type, suggesting targeted validation on fast-motion or low-contrast scenes.
  • Hybrid training that mixes mouse-derived maps with smaller eye-tracking sets might improve model generalization beyond either data source alone.

Load-bearing premise

The mouse-contingent viewing system accurately simulates viewers' peripheral vision based on the mouse cursor position, so mouse movements reliably stand in for actual gaze fixations.

What would settle it

Side-by-side quantitative comparison of saliency maps produced from the crowdsourced mouse data against maps from simultaneous eye-tracking recordings on identical videos, checking agreement in fixation locations and saliency values.

Figures

Figures reproduced from arXiv: 1907.00480 by Dmitriy Vatolin, Vitaliy Lyudvichenko.

Figure 1
Figure 1. Figure 1: An example of a tutorial page and the mouse-contingent video player used in our system. The video around the cursor is sharp. To tackle this problem the semiautomatic paradigm for predicting saliency was proposed in [1]. Unlike conventional saliency models, semiautomatic approaches take eye-tracking saliency maps as an ad￾ditional input and postprocess them which enables better saliency maps using less dat… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of proposed temporal semiautomatic model based on SAM-ResNet [11]. We introduce the external prior maps and concatenate them with the features of the input layer and three intermediate layers. To make the network temporal-aware we introduce new spatiotemporal features and adapt the attentive ConvLSTM module so that it can pass the states to the following frames. The made modifications are marked b… view at source ↗
Figure 3
Figure 3. Figure 3: shows the results and illustrates that mouse￾tracking of two observers have the same quality as eye-tracking of the single observer, so the data col￾lected with the proposed system can approximate eye￾tracking. Note, when we estimated the eye-tracking perfor￾mance of N observers we compared them with the remaining M − N observers of total M observers. Therefore the eye-tracking curve has stopped increas￾in… view at source ↗
read the original abstract

This paper presents a new way of getting high-quality saliency maps for video, using a cheaper alternative to eye-tracking data. We designed a mouse-contingent video viewing system which simulates the viewers' peripheral vision based on the position of the mouse cursor. The system enables the use of mouse-tracking data recorded from an ordinary computer mouse as an alternative to real gaze fixations recorded by a more expensive eye-tracker. We developed a crowdsourcing system that enables the collection of such mouse-tracking data at large scale. Using the collected mouse-tracking data we showed that it can serve as an approximation of eye-tracking data. Moreover, trying to increase the efficiency of collected mouse-tracking data we proposed a novel deep neural network algorithm that improves the quality of mouse-tracking saliency maps.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The manuscript introduces a mouse-contingent video viewing system that applies peripheral blur based on mouse cursor position to enable collection of crowdsourced mouse-tracking data as a low-cost proxy for eye-tracking saliency maps on videos. It asserts that the collected mouse data approximates eye-tracking data and proposes a novel DNN to improve the quality of the resulting saliency maps.

Significance. If the mouse-to-eye approximation holds with strong quantitative support, the work would be significant for computer vision by enabling scalable, low-cost collection of video saliency data via crowdsourcing, which could expand training sets for saliency prediction models. The crowdsourcing platform itself represents a practical engineering contribution.

major comments (2)
  1. [Abstract] Abstract: the central claim that 'mouse-tracking data ... can serve as an approximation of eye-tracking data' is load-bearing yet unsupported by any reported quantitative metrics (AUC, NSS, KL divergence, or correlation) or direct comparison on identical stimuli; the description supplies no validation details or baselines.
  2. [Abstract] Abstract: the mouse-contingent system is presented as simulating peripheral vision, but the manuscript provides no evidence that cursor-based blur replicates saccadic dynamics, covert attention, or natural gaze trajectories; this untested fidelity is required for the proxy claim to hold.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on the abstract. We agree that the abstract should better support the central claims with quantitative details from the manuscript and will revise it accordingly. We address each point below.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim that 'mouse-tracking data ... can serve as an approximation of eye-tracking data' is load-bearing yet unsupported by any reported quantitative metrics (AUC, NSS, KL divergence, or correlation) or direct comparison on identical stimuli; the description supplies no validation details or baselines.

    Authors: The abstract is a concise summary; the manuscript reports direct comparisons on identical stimuli with quantitative metrics (AUC, NSS, KL divergence, and correlation) in the experimental results and figures. To address the concern, we will revise the abstract to include key validation metrics and baselines. revision: yes

  2. Referee: [Abstract] Abstract: the mouse-contingent system is presented as simulating peripheral vision, but the manuscript provides no evidence that cursor-based blur replicates saccadic dynamics, covert attention, or natural gaze trajectories; this untested fidelity is required for the proxy claim to hold.

    Authors: The system applies cursor-based peripheral blur to enable scalable data collection, with proxy validity shown empirically via saliency map approximation rather than exact replication of saccades or covert attention. We will revise the abstract to clarify the system's design scope and empirical support without overstating fidelity. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical data collection with no derivation chain

full rationale

The paper's core contribution is an empirical crowdsourcing pipeline for mouse-tracking saliency data plus a DNN post-processing step; the abstract and description contain no equations, fitted parameters, or mathematical derivations. Claims rest on direct collection and comparison to eye-tracking, which are externally falsifiable and do not reduce to self-definition or self-citation. No load-bearing uniqueness theorems, ansatzes, or renamed known results appear. This is the normal non-circular case for an applied data-collection study.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no explicit free parameters, axioms, or invented entities; the work is presented as an empirical engineering contribution.

pith-pipeline@v0.9.0 · 5654 in / 996 out tokens · 21126 ms · 2026-05-25T12:23:21.851100+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

23 extracted references · 23 canonical work pages · 2 internal anchors

  1. [1]

    Predicting video saliency using crowdsourced mouse-tracking data

    Introduction When watching videos, humans distribute their at- tention unevenly. Some objects in the video may at- tract more attention than the others. This distribu- tion can be represented by per-frame saliency maps defining the importance of each frame region for view- ers. The use of saliency can improve the quality of many video processing applicatio...

  2. [2]

    Hereafter we provide a brief overview of these topics

    Related work The paper makes a contribution to two topics: cursor-based alternatives to eye tracking and semiau- tomatic saliency modeling. Hereafter we provide a brief overview of these topics. Cursor-based alternatives to eye tracking. There were many efforts to use mouse tracking as a cheap alternative to eye tracking. However, most of these efforts were...

  3. [3]

    We show a participant the video in a special video player in real-time in full-screen mode

    Cursor-based saliency for video We propose a methodology for high-quality visual- attention estimation based on mouse-tracking data and a system collecting such data using crowdsourc- ing platforms. We show a participant the video in a special video player in real-time in full-screen mode. Input frames Dilated ResNet Conv LSTM Conv 1x1 Spatial features Te...

  4. [4]

    The algorithm is based on SAM [11] architecture which was originally designed to predict saliency of static images

    Semiautomatic deep neural network To improve saliency maps generated using the cur- sor positions as eye fixations we developed a new neu- ral network algorithm. The algorithm is based on SAM [11] architecture which was originally designed to predict saliency of static images. Though SAM is a static model, its retrained ResNet version can outper- form the ...

  5. [5]

    We hired participants on Sub- jectify.us crowdsourcing platform, showed them 10 videos and paid them $0.15 if they watched all videos

    Experiments We used our cursor-based saliency system to col- lect mouse-movement data in 12 random videos from Hollywood-2 video saliency dataset [7] that are each 20–30 seconds long. We hired participants on Sub- jectify.us crowdsourcing platform, showed them 10 videos and paid them $0.15 if they watched all videos. In total, we collected data of 30 part...

  6. [6]

    We developed a novel system that shows viewers videos in a mouse-contingent video player and collects mouse-tracking data approximat- ing real eye fixations

    Conclusion In this paper, we proposed a cheap way of get- ting high-quality saliency maps for video through the use of additional data. We developed a novel system that shows viewers videos in a mouse-contingent video player and collects mouse-tracking data approximat- ing real eye fixations. We showed that mouse-tracking data can be used as an alternative...

  7. [7]

    Acknowledgments This work was partially supported by the Russian Foundation for Basic Research under Grant 19-01- 00785 a

  8. [8]

    Gitman, M

    Y. Gitman, M. Erofeev, D. Vatolin, B. Andrey, and F. Alexey. Semiautomatic visual-attention modeling and its application to video compres- sion. In International Conference on Image Pro- cessing (ICIP), pages 1105–1109, 2014

  9. [9]

    T. Lu, Z. Yuan, Y. Huang, D. Wu, and H. Yu. Video retargeting with nonlinear spatial- temporal saliency fusion. In 2010 IEEE Inter- national Conference on Image Processing , pages 1801–1804, 2010

  10. [10]

    Borji and L

    A. Borji and L. Itti. State-of-the-art in visual at- tention modeling. IEEE Transactions on Pattern Analysis and Machine Intelligence , 35(1):185– 207, 2013

  11. [11]

    Saliency prediction in the deep learning era: An empirical investigation, 2018

    Ali Borji. Saliency prediction in the deep learning era: An empirical investigation, 2018

  12. [12]

    A semiauto- matic saliency model and its application to video compression

    Vitaliy Lyudvichenko, Mikhail Erofeev, Yury Gitman, and Dmitriy Vatolin. A semiauto- matic saliency model and its application to video compression. In 13th IEEE International Con- ference on Intelligent Computer Communication and Processing, pages 403–410, 2017

  13. [13]

    Improv- ing video compression with deep visual-attention models

    Vitaliy Lyudvichenko, Mikhail Erofeev, Alexan- der Ploshkin, and Dmitriy Vatolin. Improv- ing video compression with deep visual-attention models. In 2019 International Conference on In- telligent Medicine and Image Processing , 2019

  14. [14]

    Mathe and C

    S. Mathe and C. Sminchisescu. Actions in the eye: Dynamic gaze datasets and learnt saliency models for visual recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence , pages 1408–1424, 2015

  15. [15]

    Salicon: Reducing the semantic gap in saliency prediction by adapting deep neural networks

    Xun Huang, Chengyao Shen, Xavier Boix, and Qi Zhao. Salicon: Reducing the semantic gap in saliency prediction by adapting deep neural networks. In 2015 International Conference on Computer Vision, pages 262–270, 2015

  16. [16]

    Borkin, Krzysztof Z

    Nam Wook Kim, Zoya Bylinskii, Michelle A. Borkin, Krzysztof Z. Gajos, Aude Oliva, Fredo Durand, and Hanspeter Pfister. Bubbleview: An interface for crowdsourcing image importance maps and tracking visual attention. ACM Trans. Comput.-Hum. Interact., 24(5):36:1–36:40, 2017

  17. [17]

    A benchmark of computational models of saliency to predict human fixations

    Tilke Judd, Fr´ edo Durand, and Antonio Tor- ralba. A benchmark of computational models of saliency to predict human fixations. Technical report, Computer Science and Artificial Intelli- gence Lab, Massachusetts Institute of Technol- ogy, 2012

  18. [18]

    Predicting Human Eye Fixations via an LSTM-based Saliency At- tentive Model

    Marcella Cornia, Lorenzo Baraldi, Giuseppe Serra, and Rita Cucchiara. Predicting Human Eye Fixations via an LSTM-based Saliency At- tentive Model. IEEE Transactions on Image Pro- cessing, 27(10):5142–5154, 2018

  19. [19]

    Spatio-temporal modeling and predic- tion of visual attention in graphical user inter- faces

    Pingmei Xu, Yusuke Sugano, and Andreas Bulling. Spatio-temporal modeling and predic- tion of visual attention in graphical user inter- faces. In Proceedings of the 2016 CHI Conference on Human Factors in Computing Systems , pages 3299–3310, 2016

  20. [20]

    Are all the frames equally important? CoRR, abs/1905.07984, 2019

    Oleksii Sidorov, Marius Pedersen, Nam Wook Kim, and Sumit Shekhar. Are all the frames equally important? CoRR, abs/1905.07984, 2019

  21. [21]

    Revisiting video sali- ency: A large-scale benchmark and a new model

    Wenguan Wang, Jianbing Shen, Fang Guo, Ming- Ming Cheng, and Ali Borji. Revisiting video sali- ency: A large-scale benchmark and a new model. 2018

  22. [22]

    Predicting Video Saliency with Object-to-Motion CNN and Two-layer Convolutional LSTM

    Lai Jiang, Mai Xu, and Zulin Wang. Pre- dicting video saliency with object-to-motion cnn and two-layer convolutional lstm. CoRR, abs/1709.06316, 2017

  23. [23]

    Learning to predict where hu- mans look

    Tilke Judd, Krista Ehinger, Fr´ edo Durand, and Antonio Torralba. Learning to predict where hu- mans look. In International Conference on Com- puter Vision (ICCV) , pages 2106–2113, 2009. About the authors Vitaliy Lyudvichenko is a Ph.D. student of Com- puter Graphics and Media Lab of Computer Science department of Lomonosov Moscow State University. Dmitr...