Believe It or Not, We Know What You Are Looking at!
Pith reviewed 2026-05-25 09:32 UTC · model grok-4.3
The pith
A two-stage network first predicts gaze direction then refines it with scene content to locate where people look.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that a gaze direction pathway producing multi-scale fields, followed by a heatmap pathway that receives those fields concatenated with image contents, yields more accurate gaze point predictions while enabling dual supervision, and that a dataset annotated by in-video observers supplies more reliable ground truth for evaluating real-scenario performance.
What carries the argument
The two-stage gaze following architecture: a gaze direction pathway that outputs multi-scale gaze direction fields from head image and position, followed by a heatmap pathway that regresses the gaze point from the concatenated fields and image contents.
If this is right
- Dual supervision from both gaze direction and heatmap losses makes training of the direction pathway more robust.
- The separation into direction estimation then scene integration produces outputs that better match human gaze following behavior.
- The new dataset allows evaluation that better reflects method capacity in real scenarios.
- The overall solution significantly outperforms existing gaze following methods on both the new and prior datasets.
Where Pith is reading between the lines
- The explicit multi-scale direction fields could be reused as input features for other attention or pose-related vision tasks.
- Because the dataset is video-based, the same two-stage structure might be extended with frame-to-frame consistency constraints.
- Higher accuracy on observer-annotated videos suggests that prior progress on gaze following may have been limited by annotation noise rather than model capacity.
Load-bearing premise
Ground-truth gaze points annotated by the observers inside the videos are more reliable than third-person annotations.
What would settle it
If models using the two-stage architecture show no accuracy gain when evaluated on the new in-video-annotated dataset compared with third-person-annotated datasets, or if in-video annotations prove inconsistent with measured eye positions.
Figures
read the original abstract
By borrowing the wisdom of human in gaze following, we propose a two-stage solution for gaze point prediction of the target persons in a scene. Specifically, in the first stage, both head image and its position are fed into a gaze direction pathway to predict the gaze direction, and then multi-scale gaze direction fields are generated to characterize the distribution of gaze points without considering the scene contents. In the second stage, the multi-scale gaze direction fields are concatenated with the image contents and fed into a heatmap pathway for heatmap regression. There are two merits for our two-stage solution based gaze following: i) our solution mimics the behavior of human in gaze following, therefore it is more psychological plausible; ii) besides using heatmap to supervise the output of our network, we can also leverage gaze direction to facilitate the training of gaze direction pathway, therefore our network can be more robustly trained. Considering that existing gaze following dataset is annotated by the third-view persons, we build a video gaze following dataset, where the ground truth is annotated by the observers in the videos. Therefore it is more reliable. The evaluation with such a dataset reflects the capacity of different methods in real scenarios better. Extensive experiments on both datasets show that our method significantly outperforms existing methods, which validates the effectiveness of our solution for gaze following. Our dataset and codes are released in https://github.com/svip-lab/GazeFollowing.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes a two-stage deep network for gaze following. Stage 1 feeds a head crop and its position into a gaze-direction pathway that outputs a predicted direction and multi-scale gaze-direction fields (characterizing possible gaze-point distributions independent of scene content). Stage 2 concatenates these fields with the full image and feeds them into a heatmap pathway that regresses the final gaze heatmap. The architecture is motivated by human gaze-following behavior and allows auxiliary supervision from both heatmaps and direction labels. The authors also introduce a new video gaze-following dataset whose ground-truth gaze points are annotated by the observers appearing in the videos rather than by third-person annotators, claiming this yields more reliable labels and better reflects real-world performance. Experiments on both existing datasets and the new dataset are said to show significant outperformance over prior methods; code and data are released.
Significance. If the performance claims hold and the new dataset's superiority is substantiated, the work would contribute a psychologically motivated two-stage architecture and a potentially more realistic benchmark for gaze following. The public release of code and dataset is a clear strength that aids reproducibility.
major comments (2)
- [Abstract / Dataset description] Abstract and Dataset section: The central motivation for the new video gaze-following dataset rests on the assertion that 'the ground truth is annotated by the observers in the videos. Therefore it is more reliable' and that evaluation on it 'reflects the capacity of different methods in real scenarios better.' No annotation protocol, instructions to observers, inter-annotator agreement statistics, or direct comparison of observer vs. third-person labels on the same footage is supplied. This assumption is load-bearing for the claim that gains on the new dataset demonstrate real-scenario superiority.
- [Experiments] Experiments / Results: The abstract states that 'extensive experiments on both datasets show that our method significantly outperforms existing methods,' yet the manuscript provides no quantitative tables, baseline implementations, ablation studies, or numerical margins. Without these details the magnitude and robustness of the claimed improvement cannot be assessed.
minor comments (1)
- [Method] The description of how multi-scale gaze direction fields are exactly constructed from the predicted direction (e.g., functional form, discretization, normalization) is only sketched at a high level and would benefit from an equation or pseudocode.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We address each major comment below and will make revisions to strengthen the paper where appropriate.
read point-by-point responses
-
Referee: [Abstract / Dataset description] Abstract and Dataset section: The central motivation for the new video gaze-following dataset rests on the assertion that 'the ground truth is annotated by the observers in the videos. Therefore it is more reliable' and that evaluation on it 'reflects the capacity of different methods in real scenarios better.' No annotation protocol, instructions to observers, inter-annotator agreement statistics, or direct comparison of observer vs. third-person labels on the same footage is supplied. This assumption is load-bearing for the claim that gains on the new dataset demonstrate real-scenario superiority.
Authors: We agree that the current manuscript would benefit from expanded details on the annotation process to better support our claims. In the revision, we will add a dedicated subsection describing the full annotation protocol, the instructions provided to in-scene observers, inter-annotator agreement statistics, and additional justification (drawing on the collection methodology and related literature) for why observer annotations better reflect real-world gaze following. A direct comparison of observer versus third-person labels on identical footage was not conducted during dataset creation, as the protocol was designed from the outset around in-scene annotation; we will explicitly note this limitation while arguing that the protocol itself provides supporting evidence. revision: yes
-
Referee: [Experiments] Experiments / Results: The abstract states that 'extensive experiments on both datasets show that our method significantly outperforms existing methods,' yet the manuscript provides no quantitative tables, baseline implementations, ablation studies, or numerical margins. Without these details the magnitude and robustness of the claimed improvement cannot be assessed.
Authors: The Experiments section of the full manuscript does contain quantitative comparisons, baseline results, and ablations, with code released to enable reproduction. However, we acknowledge that these elements could be presented more prominently and with greater detail. We will revise the section to include expanded tables with explicit numerical margins, clearer descriptions of baseline implementations, and additional ablation studies to make the performance gains fully transparent and assessable. revision: yes
Circularity Check
No circularity in derivation; standard supervised two-stage network with empirical evaluation
full rationale
The paper presents a two-stage neural architecture (gaze direction pathway followed by heatmap regression) trained via standard supervised losses on direction fields and heatmaps. No equations, parameters, or predictions are shown to reduce by construction to fitted inputs or self-citations. The new dataset claim rests on an unverified reliability assumption rather than any definitional loop or renamed result. Performance claims are external empirical comparisons, leaving the derivation self-contained against benchmarks.
Axiom & Free-Parameter Ledger
free parameters (1)
- network architecture and training hyperparameters
axioms (2)
- domain assumption Observer annotations inside the video provide more reliable ground truth than third-person annotations
- domain assumption The two-stage pipeline is psychologically plausible because it mimics human gaze following
Reference graph
Works this paper leans on
-
[1]
In: Computer Vision and Pattern Recognition, 2009
Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on. pp. 248–255. IEEE (2009)
work page 2009
-
[2]
International journal of computer vision 88(2), 303–338 (2010)
Everingham, M., Van Gool, L., Williams, C.K., Winn, J., Zisserman, A.: The pascal visual object classes (voc) challenge. International journal of computer vision 88(2), 303–338 (2010)
work page 2010
-
[3]
In: European Conference on Computer Vision
Fathi, A., Li, Y., Rehg, J.M.: Learning to recognize daily actions using gaze. In: European Conference on Computer Vision. pp. 314–327. Springer (2012)
work page 2012
-
[4]
In: Computer Vision (ICCV), 2015 IEEE International Conference on
Fragkiadaki, K., Levine, S., Felsen, P., Malik, J.: Recurrent network models for hu- man dynamics. In: Computer Vision (ICCV), 2015 IEEE International Conference on. pp. 4346–4354. IEEE (2015)
work page 2015
-
[5]
In: Proceedings of the IEEE conference on computer vision and pattern recognition
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 770–778 (2016)
work page 2016
-
[6]
In: Proceedings of the 2006 symposium on Eye tracking research & applications
Hennessey, C., Noureddin, B., Lawrence, P.: A single camera eye-gaze tracking system with free head motion. In: Proceedings of the 2006 symposium on Eye tracking research & applications. pp. 87–94. ACM (2006)
work page 2006
-
[7]
Nature reviews neuroscience 2(3), 194 (2001)
Itti, L., Koch, C.: Computational modelling of visual attention. Nature reviews neuroscience 2(3), 194 (2001)
work page 2001
-
[8]
IEEE Transactions on pattern analysis and machine intelligence 20(11), 1254–1259 (1998)
Itti, L., Koch, C., Niebur, E.: A model of saliency-based visual attention for rapid scene analysis. IEEE Transactions on pattern analysis and machine intelligence 20(11), 1254–1259 (1998)
work page 1998
-
[9]
In: Computer Vision, 2009 IEEE 12th international conference on
Judd, T., Ehinger, K., Durand, F., Torralba, A.: Learning to predict where humans look. In: Computer Vision, 2009 IEEE 12th international conference on. pp. 2106–
work page 2009
-
[10]
Adam: A Method for Stochastic Optimization
Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
work page internal anchor Pith review Pith/arXiv arXiv 2014
-
[11]
Krafka, K., Khosla, A., Kellnhofer, P., Kannan, H., Bhandarkar, S., Matusik, W., Torralba, A.: Eye tracking for everyone. arXiv preprint arXiv:1606.05814 (2016)
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[12]
In: Advances in neural information processing systems
Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep con- volutional neural networks. In: Advances in neural information processing systems. pp. 1097–1105 (2012)
work page 2012
-
[13]
Deep Gaze I: Boosting Saliency Prediction with Feature Maps Trained on ImageNet
K¨ ummerer, M., Theis, L., Bethge, M.: Deep gaze i: Boosting saliency prediction with feature maps trained on imagenet. arXiv preprint arXiv:1411.1045 (2014)
work page internal anchor Pith review Pith/arXiv arXiv 2014
- [14]
- [15]
-
[16]
In: European conference on computer vision
Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Doll´ ar, P., Zitnick, C.L.: Microsoft coco: Common objects in context. In: European conference on computer vision. pp. 740–755. Springer (2014)
work page 2014
-
[17]
International Journal of Computer Vision 106(3), 282–296 (2014)
Mar´ ın-Jim´ enez, M.J., Zisserman, A., Eichner, M., Ferrari, V.: Detecting people looking at each other in videos. International Journal of Computer Vision 106(3), 282–296 (2014)
work page 2014
-
[18]
IEEE Transactions on Multimedia 17(11), 2094–2107 (2015) 16 Lian et al
Mukherjee, S.S., Robertson, N.M.: Deep head pose: Gaze-direction estimation in multimodal video. IEEE Transactions on Multimedia 17(11), 2094–2107 (2015) 16 Lian et al
work page 2094
-
[19]
SalGAN: Visual Saliency Prediction with Generative Adversarial Networks
Pan, J., Canton, C., McGuinness, K., O’Connor, N.E., Torres, J., Sayrol, E., Giro-i Nieto, X.: Salgan: Visual saliency prediction with generative adversarial networks. arXiv preprint arXiv:1701.01081 (2017)
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[20]
Vision research 116, 113–126 (2015)
Parks, D., Borji, A., Itti, L.: Augmented saliency model using automatic 3d head pose detection and learned gaze following in natural scenes. Vision research 116, 113–126 (2015)
work page 2015
-
[21]
In: Proceedings of the IEEE International Conference on Computer Vision
Pfister, T., Charles, J., Zisserman, A.: Flowing convnets for human pose estimation in videos. In: Proceedings of the IEEE International Conference on Computer Vision. pp. 1913–1921 (2015)
work page 1913
-
[22]
Recasens∗, A., Khosla∗, A., Vondrick, C., Torralba, A.: Where are they looking? In: Advances in Neural Information Processing Systems (NIPS) (2015),∗ indicates equal contribution
work page 2015
-
[23]
In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
Recasens, A., Vondrick, C., Khosla, A., Torralba, A.: Following gaze in video. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 1435–1443 (2017)
work page 2017
-
[24]
International Journal of Computer Vision 115(3), 211–252 (2015)
Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., et al.: Imagenet large scale visual recog- nition challenge. International Journal of Computer Vision 115(3), 211–252 (2015)
work page 2015
-
[25]
In: Advances in neural information processing systems
Tompson, J.J., Jain, A., LeCun, Y., Bregler, C.: Joint training of a convolutional network and a graphical model for human pose estimation. In: Advances in neural information processing systems. pp. 1799–1807 (2014)
work page 2014
-
[26]
In: Computer vision and pattern recognition (CVPR), 2010 IEEE conference on
Xiao, J., Hays, J., Ehinger, K.A., Oliva, A., Torralba, A.: Sun database: Large-scale scene recognition from abbey to zoo. In: Computer vision and pattern recognition (CVPR), 2010 IEEE conference on. pp. 3485–3492. IEEE (2010)
work page 2010
-
[27]
Xiong, X., Liu, Z., Cai, Q., Zhang, Z.: Eye gaze tracking using an rgbd camera: a comparison with a rgb solution. In: Proceedings of the 2014 ACM International Joint Conference on Pervasive and Ubiquitous Computing: Adjunct Publication. pp. 1113–1121. ACM (2014)
work page 2014
-
[28]
In: Computer Vision (ICCV), 2011 IEEE International Conference on
Yao, B., Jiang, X., Khosla, A., Lin, A.L., Guibas, L., Fei-Fei, L.: Human action recognition by learning bases of action attributes and parts. In: Computer Vision (ICCV), 2011 IEEE International Conference on. pp. 1331–1338. IEEE (2011)
work page 2011
-
[29]
In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
Zhang, X., Sugano, Y., Fritz, M., Bulling, A.: Appearance-based gaze estimation in the wild. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 4511–4520 (2015)
work page 2015
-
[30]
In: Advances in neural information processing systems
Zhou, B., Lapedriza, A., Xiao, J., Torralba, A., Oliva, A.: Learning deep features for scene recognition using places database. In: Advances in neural information processing systems. pp. 487–495 (2014)
work page 2014
-
[31]
In: The IEEE International Conference on Computer Vision (ICCV) (Oct 2017)
Zhu, W., Deng, H.: Monocular free-head 3d gaze tracking with deep learning and geometry constraints. In: The IEEE International Conference on Computer Vision (ICCV) (Oct 2017)
work page 2017
-
[32]
In: Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on
Zhu, X., Ramanan, D.: Face detection, pose estimation, and landmark localization in the wild. In: Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on. pp. 2879–2886. IEEE (2012)
work page 2012
-
[33]
In: Computer Vision and Pattern Recognition, 2005
Zhu, Z., Ji, Q.: Eye gaze tracking under natural head movements. In: Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Con- ference on. vol. 1, pp. 918–923. IEEE (2005) Supplementary Material Dongze Lian∗[0000−0002−4947−0316], Zehao Yu∗[0000−0002−6559−9830], and Shenghua Gao†[0000−0003−1626−2040] School of Information Science...
work page 2005
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.