Believe It or Not, We Know What You Are Looking at!

Dongze Lian; Shenghua Gao; Zehao Yu

arxiv: 1907.02364 · v1 · pith:LMU2RUGZnew · submitted 2019-07-04 · 💻 cs.CV

Believe It or Not, We Know What You Are Looking at!

Dongze Lian , Zehao Yu , Shenghua Gao This is my paper

Pith reviewed 2026-05-25 09:32 UTC · model grok-4.3

classification 💻 cs.CV

keywords gaze followinggaze direction pathwayheatmap regressionmulti-scale gaze fieldsvideo gaze datasettwo-stage networkcomputer vision

0 comments

The pith

A two-stage network first predicts gaze direction then refines it with scene content to locate where people look.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes a two-stage solution for predicting gaze points of target persons in a scene. The first stage takes a head image and its position to predict gaze direction and generate multi-scale gaze direction fields that describe possible gaze point distributions without scene content. The second stage concatenates these fields with the full image and feeds them into a heatmap pathway for final regression. This structure is meant to mimic human gaze following behavior and to allow supervision from both direction estimates and heatmaps during training. The authors also introduce a video dataset whose ground truth is annotated by observers inside the videos rather than third-person viewers, and they report that their method significantly outperforms prior approaches on both the new dataset and existing ones.

Core claim

The central claim is that a gaze direction pathway producing multi-scale fields, followed by a heatmap pathway that receives those fields concatenated with image contents, yields more accurate gaze point predictions while enabling dual supervision, and that a dataset annotated by in-video observers supplies more reliable ground truth for evaluating real-scenario performance.

What carries the argument

The two-stage gaze following architecture: a gaze direction pathway that outputs multi-scale gaze direction fields from head image and position, followed by a heatmap pathway that regresses the gaze point from the concatenated fields and image contents.

If this is right

Dual supervision from both gaze direction and heatmap losses makes training of the direction pathway more robust.
The separation into direction estimation then scene integration produces outputs that better match human gaze following behavior.
The new dataset allows evaluation that better reflects method capacity in real scenarios.
The overall solution significantly outperforms existing gaze following methods on both the new and prior datasets.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The explicit multi-scale direction fields could be reused as input features for other attention or pose-related vision tasks.
Because the dataset is video-based, the same two-stage structure might be extended with frame-to-frame consistency constraints.
Higher accuracy on observer-annotated videos suggests that prior progress on gaze following may have been limited by annotation noise rather than model capacity.

Load-bearing premise

Ground-truth gaze points annotated by the observers inside the videos are more reliable than third-person annotations.

What would settle it

If models using the two-stage architecture show no accuracy gain when evaluated on the new in-video-annotated dataset compared with third-person-annotated datasets, or if in-video annotations prove inconsistent with measured eye positions.

Figures

Figures reproduced from arXiv: 1907.02364 by Dongze Lian, Shenghua Gao, Zehao Yu.

**Figure 1.** Figure 1: (a) (b)), and infer what kind of information (ingredients of the food, the price, expire data, etc.) attracts the consumers’ attention most. Although gaze following is of vital importance, it is extremely challenging because of the reasons below: firstly, actually inferring the gaze point requires the depth information of the scene, head pose and eyeball movement [31,27], nevertheless it is hard to infer t… view at source ↗

**Figure 2.** Figure 2: The network architecture for gaze following. There are two modules in this network: gaze direction pathway and heatmap pathway. In the first stage, a coarse gaze direction is predicted through gaze direction pathway, and then it is encoded as multi-scale gaze direction fields. We concatenate the multi-scale fields and the original image to regress heatmap of final gaze point through heatmap pathway. 3.2 Ga… view at source ↗

**Figure 3.** Figure 3: (a) The original image: the blue line shows gaze direction of the left girl inside the image, and the green dot shows the head position. Gaze direction field, which measures the probability of each point being gaze point with cosine function between the line direction of LHP and predicted gaze direction dˆ. (b) Our DL Gaze dataset. to the maximum value of the heatmap is considered as the final gaze point. … view at source ↗

**Figure 4.** Figure 4: Accumulative error curves of different methods on both datasets. – original image + ROI head: We directly feed the original image into heatmap pathway for heatmap regression. Further, we directly extract the features corresponding to Region of Interest (ROI, the region of head) from the heatmap pathway and use it for gaze direction regression. Then we train the whole network with multi-task learning. – w/o… view at source ↗

**Figure 5.** Figure 5: Some prediction results on the testing set, the red lines indicate the ground truth gaze and the yellow ones are the predicted gaze. The comparison results of different objectives are listed in [PITH_FULL_IMAGE:figures/full_fig_p013_5.png] view at source ↗

**Figure 6.** Figure 6: The first row: ground truth (red lines) and predicted gaze (yellow lines). The second row: predicted heatmaps. (Please zoom in for details.) method can predict gaze points accurately (As shown in [PITH_FULL_IMAGE:figures/full_fig_p014_6.png] view at source ↗

read the original abstract

By borrowing the wisdom of human in gaze following, we propose a two-stage solution for gaze point prediction of the target persons in a scene. Specifically, in the first stage, both head image and its position are fed into a gaze direction pathway to predict the gaze direction, and then multi-scale gaze direction fields are generated to characterize the distribution of gaze points without considering the scene contents. In the second stage, the multi-scale gaze direction fields are concatenated with the image contents and fed into a heatmap pathway for heatmap regression. There are two merits for our two-stage solution based gaze following: i) our solution mimics the behavior of human in gaze following, therefore it is more psychological plausible; ii) besides using heatmap to supervise the output of our network, we can also leverage gaze direction to facilitate the training of gaze direction pathway, therefore our network can be more robustly trained. Considering that existing gaze following dataset is annotated by the third-view persons, we build a video gaze following dataset, where the ground truth is annotated by the observers in the videos. Therefore it is more reliable. The evaluation with such a dataset reflects the capacity of different methods in real scenarios better. Extensive experiments on both datasets show that our method significantly outperforms existing methods, which validates the effectiveness of our solution for gaze following. Our dataset and codes are released in https://github.com/svip-lab/GazeFollowing.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The two-stage direction-field then heatmap network and the new in-video observer dataset are the concrete novelties, but the abstract supplies no numbers or annotation details so the performance claims stay unverified.

read the letter

The two-stage method and the new dataset with observer annotations are what stand out here. The approach first runs a gaze direction pathway on the head image and position to create multi-scale direction fields. Those fields then get combined with the full image content in a heatmap pathway for the final prediction. They also supervise the direction pathway directly with gaze direction labels. This setup is meant to be more like how people actually follow gaze and gives an extra training signal. Releasing a video dataset where the people in the scene provide the ground truth gaze points is another piece. They argue this is more reliable than third-person annotations and better reflects real scenarios. The paper does a decent job laying out why the two-stage split makes sense and how the extra supervision could help robustness. The main issues are the lack of hard numbers. The abstract claims big improvements on two datasets but doesn't report any scores, baselines, or ablation results. That makes it difficult to gauge whether the method is actually better or by how much. The dataset claim also needs more backing. There's no description of the annotation process, no measures of agreement between observers, and no direct test showing these labels are indeed more reliable than standard ones. Without that, performance on the new data is hard to trust as a fair comparison. This work is aimed at computer vision people doing gaze following or related attention tasks. A reader looking for new architectures in that niche could pick up the direction-field idea. It has enough concrete novelty to go to peer review, though the experiments will need to be filled in with actual results and more dataset details. Recommendation: Send it for review.

Referee Report

2 major / 1 minor

Summary. The paper proposes a two-stage deep network for gaze following. Stage 1 feeds a head crop and its position into a gaze-direction pathway that outputs a predicted direction and multi-scale gaze-direction fields (characterizing possible gaze-point distributions independent of scene content). Stage 2 concatenates these fields with the full image and feeds them into a heatmap pathway that regresses the final gaze heatmap. The architecture is motivated by human gaze-following behavior and allows auxiliary supervision from both heatmaps and direction labels. The authors also introduce a new video gaze-following dataset whose ground-truth gaze points are annotated by the observers appearing in the videos rather than by third-person annotators, claiming this yields more reliable labels and better reflects real-world performance. Experiments on both existing datasets and the new dataset are said to show significant outperformance over prior methods; code and data are released.

Significance. If the performance claims hold and the new dataset's superiority is substantiated, the work would contribute a psychologically motivated two-stage architecture and a potentially more realistic benchmark for gaze following. The public release of code and dataset is a clear strength that aids reproducibility.

major comments (2)

[Abstract / Dataset description] Abstract and Dataset section: The central motivation for the new video gaze-following dataset rests on the assertion that 'the ground truth is annotated by the observers in the videos. Therefore it is more reliable' and that evaluation on it 'reflects the capacity of different methods in real scenarios better.' No annotation protocol, instructions to observers, inter-annotator agreement statistics, or direct comparison of observer vs. third-person labels on the same footage is supplied. This assumption is load-bearing for the claim that gains on the new dataset demonstrate real-scenario superiority.
[Experiments] Experiments / Results: The abstract states that 'extensive experiments on both datasets show that our method significantly outperforms existing methods,' yet the manuscript provides no quantitative tables, baseline implementations, ablation studies, or numerical margins. Without these details the magnitude and robustness of the claimed improvement cannot be assessed.

minor comments (1)

[Method] The description of how multi-scale gaze direction fields are exactly constructed from the predicted direction (e.g., functional form, discretization, normalization) is only sketched at a high level and would benefit from an equation or pseudocode.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and will make revisions to strengthen the paper where appropriate.

read point-by-point responses

Referee: [Abstract / Dataset description] Abstract and Dataset section: The central motivation for the new video gaze-following dataset rests on the assertion that 'the ground truth is annotated by the observers in the videos. Therefore it is more reliable' and that evaluation on it 'reflects the capacity of different methods in real scenarios better.' No annotation protocol, instructions to observers, inter-annotator agreement statistics, or direct comparison of observer vs. third-person labels on the same footage is supplied. This assumption is load-bearing for the claim that gains on the new dataset demonstrate real-scenario superiority.

Authors: We agree that the current manuscript would benefit from expanded details on the annotation process to better support our claims. In the revision, we will add a dedicated subsection describing the full annotation protocol, the instructions provided to in-scene observers, inter-annotator agreement statistics, and additional justification (drawing on the collection methodology and related literature) for why observer annotations better reflect real-world gaze following. A direct comparison of observer versus third-person labels on identical footage was not conducted during dataset creation, as the protocol was designed from the outset around in-scene annotation; we will explicitly note this limitation while arguing that the protocol itself provides supporting evidence. revision: yes
Referee: [Experiments] Experiments / Results: The abstract states that 'extensive experiments on both datasets show that our method significantly outperforms existing methods,' yet the manuscript provides no quantitative tables, baseline implementations, ablation studies, or numerical margins. Without these details the magnitude and robustness of the claimed improvement cannot be assessed.

Authors: The Experiments section of the full manuscript does contain quantitative comparisons, baseline results, and ablations, with code released to enable reproduction. However, we acknowledge that these elements could be presented more prominently and with greater detail. We will revise the section to include expanded tables with explicit numerical margins, clearer descriptions of baseline implementations, and additional ablation studies to make the performance gains fully transparent and assessable. revision: yes

Circularity Check

0 steps flagged

No circularity in derivation; standard supervised two-stage network with empirical evaluation

full rationale

The paper presents a two-stage neural architecture (gaze direction pathway followed by heatmap regression) trained via standard supervised losses on direction fields and heatmaps. No equations, parameters, or predictions are shown to reduce by construction to fitted inputs or self-citations. The new dataset claim rests on an unverified reliability assumption rather than any definitional loop or renamed result. Performance claims are external empirical comparisons, leaving the derivation self-contained against benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

Abstract-only review limits visibility into exact hyperparameters and training details; the central claim rests on the unverified superiority of observer-annotated labels and the psychological plausibility of the two-stage split.

free parameters (1)

network architecture and training hyperparameters
Standard deep-learning weights and learning rates are fitted during training; not enumerated in abstract.

axioms (2)

domain assumption Observer annotations inside the video provide more reliable ground truth than third-person annotations
Invoked to justify construction of the new dataset and to claim better reflection of real scenarios.
domain assumption The two-stage pipeline is psychologically plausible because it mimics human gaze following
Stated as one of the two merits of the solution.

pith-pipeline@v0.9.0 · 5779 in / 1315 out tokens · 36294 ms · 2026-05-25T09:32:28.989291+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

33 extracted references · 33 canonical work pages · 4 internal anchors

[1]

In: Computer Vision and Pattern Recognition, 2009

Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on. pp. 248–255. IEEE (2009)

work page 2009
[2]

International journal of computer vision 88(2), 303–338 (2010)

Everingham, M., Van Gool, L., Williams, C.K., Winn, J., Zisserman, A.: The pascal visual object classes (voc) challenge. International journal of computer vision 88(2), 303–338 (2010)

work page 2010
[3]

In: European Conference on Computer Vision

Fathi, A., Li, Y., Rehg, J.M.: Learning to recognize daily actions using gaze. In: European Conference on Computer Vision. pp. 314–327. Springer (2012)

work page 2012
[4]

In: Computer Vision (ICCV), 2015 IEEE International Conference on

Fragkiadaki, K., Levine, S., Felsen, P., Malik, J.: Recurrent network models for hu- man dynamics. In: Computer Vision (ICCV), 2015 IEEE International Conference on. pp. 4346–4354. IEEE (2015)

work page 2015
[5]

In: Proceedings of the IEEE conference on computer vision and pattern recognition

He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 770–778 (2016)

work page 2016
[6]

In: Proceedings of the 2006 symposium on Eye tracking research & applications

Hennessey, C., Noureddin, B., Lawrence, P.: A single camera eye-gaze tracking system with free head motion. In: Proceedings of the 2006 symposium on Eye tracking research & applications. pp. 87–94. ACM (2006)

work page 2006
[7]

Nature reviews neuroscience 2(3), 194 (2001)

Itti, L., Koch, C.: Computational modelling of visual attention. Nature reviews neuroscience 2(3), 194 (2001)

work page 2001
[8]

IEEE Transactions on pattern analysis and machine intelligence 20(11), 1254–1259 (1998)

Itti, L., Koch, C., Niebur, E.: A model of saliency-based visual attention for rapid scene analysis. IEEE Transactions on pattern analysis and machine intelligence 20(11), 1254–1259 (1998)

work page 1998
[9]

In: Computer Vision, 2009 IEEE 12th international conference on

Judd, T., Ehinger, K., Durand, F., Torralba, A.: Learning to predict where humans look. In: Computer Vision, 2009 IEEE 12th international conference on. pp. 2106–

work page 2009
[10]

Adam: A Method for Stochastic Optimization

Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)

work page internal anchor Pith review Pith/arXiv arXiv 2014
[11]

Eye Tracking for Everyone

Krafka, K., Khosla, A., Kellnhofer, P., Kannan, H., Bhandarkar, S., Matusik, W., Torralba, A.: Eye tracking for everyone. arXiv preprint arXiv:1606.05814 (2016)

work page internal anchor Pith review Pith/arXiv arXiv 2016
[12]

In: Advances in neural information processing systems

Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classiﬁcation with deep con- volutional neural networks. In: Advances in neural information processing systems. pp. 1097–1105 (2012)

work page 2012
[13]

Deep Gaze I: Boosting Saliency Prediction with Feature Maps Trained on ImageNet

K¨ ummerer, M., Theis, L., Bethge, M.: Deep gaze i: Boosting saliency prediction with feature maps trained on imagenet. arXiv preprint arXiv:1411.1045 (2014)

work page internal anchor Pith review Pith/arXiv arXiv 2014
[14]

In: Proc

Leifman, G., Rudoy, D., Swedish, T., Bayro-Corrochano, E., Raskar, R.: Learning gaze transitions from depth to improve video saliency estimation. In: Proc. IEEE Int. Conf. on Computer Vision. vol. 3 (2017)

work page 2017
[15]

In: CVPR

Lin, T.Y., Doll´ ar, P., Girshick, R., He, K., Hariharan, B., Belongie, S.: Feature pyramid networks for object detection. In: CVPR. vol. 1, p. 4 (2017)

work page 2017
[16]

In: European conference on computer vision

Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Doll´ ar, P., Zitnick, C.L.: Microsoft coco: Common objects in context. In: European conference on computer vision. pp. 740–755. Springer (2014)

work page 2014
[17]

International Journal of Computer Vision 106(3), 282–296 (2014)

Mar´ ın-Jim´ enez, M.J., Zisserman, A., Eichner, M., Ferrari, V.: Detecting people looking at each other in videos. International Journal of Computer Vision 106(3), 282–296 (2014)

work page 2014
[18]

IEEE Transactions on Multimedia 17(11), 2094–2107 (2015) 16 Lian et al

Mukherjee, S.S., Robertson, N.M.: Deep head pose: Gaze-direction estimation in multimodal video. IEEE Transactions on Multimedia 17(11), 2094–2107 (2015) 16 Lian et al

work page 2094
[19]

SalGAN: Visual Saliency Prediction with Generative Adversarial Networks

Pan, J., Canton, C., McGuinness, K., O’Connor, N.E., Torres, J., Sayrol, E., Giro-i Nieto, X.: Salgan: Visual saliency prediction with generative adversarial networks. arXiv preprint arXiv:1701.01081 (2017)

work page internal anchor Pith review Pith/arXiv arXiv 2017
[20]

Vision research 116, 113–126 (2015)

Parks, D., Borji, A., Itti, L.: Augmented saliency model using automatic 3d head pose detection and learned gaze following in natural scenes. Vision research 116, 113–126 (2015)

work page 2015
[21]

In: Proceedings of the IEEE International Conference on Computer Vision

Pﬁster, T., Charles, J., Zisserman, A.: Flowing convnets for human pose estimation in videos. In: Proceedings of the IEEE International Conference on Computer Vision. pp. 1913–1921 (2015)

work page 1913
[22]

Recasens∗, A., Khosla∗, A., Vondrick, C., Torralba, A.: Where are they looking? In: Advances in Neural Information Processing Systems (NIPS) (2015),∗ indicates equal contribution

work page 2015
[23]

In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

Recasens, A., Vondrick, C., Khosla, A., Torralba, A.: Following gaze in video. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 1435–1443 (2017)

work page 2017
[24]

International Journal of Computer Vision 115(3), 211–252 (2015)

Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., et al.: Imagenet large scale visual recog- nition challenge. International Journal of Computer Vision 115(3), 211–252 (2015)

work page 2015
[25]

In: Advances in neural information processing systems

Tompson, J.J., Jain, A., LeCun, Y., Bregler, C.: Joint training of a convolutional network and a graphical model for human pose estimation. In: Advances in neural information processing systems. pp. 1799–1807 (2014)

work page 2014
[26]

In: Computer vision and pattern recognition (CVPR), 2010 IEEE conference on

Xiao, J., Hays, J., Ehinger, K.A., Oliva, A., Torralba, A.: Sun database: Large-scale scene recognition from abbey to zoo. In: Computer vision and pattern recognition (CVPR), 2010 IEEE conference on. pp. 3485–3492. IEEE (2010)

work page 2010
[27]

In: Proceedings of the 2014 ACM International Joint Conference on Pervasive and Ubiquitous Computing: Adjunct Publication

Xiong, X., Liu, Z., Cai, Q., Zhang, Z.: Eye gaze tracking using an rgbd camera: a comparison with a rgb solution. In: Proceedings of the 2014 ACM International Joint Conference on Pervasive and Ubiquitous Computing: Adjunct Publication. pp. 1113–1121. ACM (2014)

work page 2014
[28]

In: Computer Vision (ICCV), 2011 IEEE International Conference on

Yao, B., Jiang, X., Khosla, A., Lin, A.L., Guibas, L., Fei-Fei, L.: Human action recognition by learning bases of action attributes and parts. In: Computer Vision (ICCV), 2011 IEEE International Conference on. pp. 1331–1338. IEEE (2011)

work page 2011
[29]

In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

Zhang, X., Sugano, Y., Fritz, M., Bulling, A.: Appearance-based gaze estimation in the wild. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 4511–4520 (2015)

work page 2015
[30]

In: Advances in neural information processing systems

Zhou, B., Lapedriza, A., Xiao, J., Torralba, A., Oliva, A.: Learning deep features for scene recognition using places database. In: Advances in neural information processing systems. pp. 487–495 (2014)

work page 2014
[31]

In: The IEEE International Conference on Computer Vision (ICCV) (Oct 2017)

Zhu, W., Deng, H.: Monocular free-head 3d gaze tracking with deep learning and geometry constraints. In: The IEEE International Conference on Computer Vision (ICCV) (Oct 2017)

work page 2017
[32]

In: Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on

Zhu, X., Ramanan, D.: Face detection, pose estimation, and landmark localization in the wild. In: Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on. pp. 2879–2886. IEEE (2012)

work page 2012
[33]

In: Computer Vision and Pattern Recognition, 2005

Zhu, Z., Ji, Q.: Eye gaze tracking under natural head movements. In: Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Con- ference on. vol. 1, pp. 918–923. IEEE (2005) Supplementary Material Dongze Lian∗[0000−0002−4947−0316], Zehao Yu∗[0000−0002−6559−9830], and Shenghua Gao†[0000−0003−1626−2040] School of Information Science...

work page 2005

[1] [1]

In: Computer Vision and Pattern Recognition, 2009

Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on. pp. 248–255. IEEE (2009)

work page 2009

[2] [2]

International journal of computer vision 88(2), 303–338 (2010)

Everingham, M., Van Gool, L., Williams, C.K., Winn, J., Zisserman, A.: The pascal visual object classes (voc) challenge. International journal of computer vision 88(2), 303–338 (2010)

work page 2010

[3] [3]

In: European Conference on Computer Vision

Fathi, A., Li, Y., Rehg, J.M.: Learning to recognize daily actions using gaze. In: European Conference on Computer Vision. pp. 314–327. Springer (2012)

work page 2012

[4] [4]

In: Computer Vision (ICCV), 2015 IEEE International Conference on

Fragkiadaki, K., Levine, S., Felsen, P., Malik, J.: Recurrent network models for hu- man dynamics. In: Computer Vision (ICCV), 2015 IEEE International Conference on. pp. 4346–4354. IEEE (2015)

work page 2015

[5] [5]

In: Proceedings of the IEEE conference on computer vision and pattern recognition

He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 770–778 (2016)

work page 2016

[6] [6]

In: Proceedings of the 2006 symposium on Eye tracking research & applications

Hennessey, C., Noureddin, B., Lawrence, P.: A single camera eye-gaze tracking system with free head motion. In: Proceedings of the 2006 symposium on Eye tracking research & applications. pp. 87–94. ACM (2006)

work page 2006

[7] [7]

Nature reviews neuroscience 2(3), 194 (2001)

Itti, L., Koch, C.: Computational modelling of visual attention. Nature reviews neuroscience 2(3), 194 (2001)

work page 2001

[8] [8]

IEEE Transactions on pattern analysis and machine intelligence 20(11), 1254–1259 (1998)

Itti, L., Koch, C., Niebur, E.: A model of saliency-based visual attention for rapid scene analysis. IEEE Transactions on pattern analysis and machine intelligence 20(11), 1254–1259 (1998)

work page 1998

[9] [9]

In: Computer Vision, 2009 IEEE 12th international conference on

Judd, T., Ehinger, K., Durand, F., Torralba, A.: Learning to predict where humans look. In: Computer Vision, 2009 IEEE 12th international conference on. pp. 2106–

work page 2009

[10] [10]

Adam: A Method for Stochastic Optimization

Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)

work page internal anchor Pith review Pith/arXiv arXiv 2014

[11] [11]

Eye Tracking for Everyone

Krafka, K., Khosla, A., Kellnhofer, P., Kannan, H., Bhandarkar, S., Matusik, W., Torralba, A.: Eye tracking for everyone. arXiv preprint arXiv:1606.05814 (2016)

work page internal anchor Pith review Pith/arXiv arXiv 2016

[12] [12]

In: Advances in neural information processing systems

Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classiﬁcation with deep con- volutional neural networks. In: Advances in neural information processing systems. pp. 1097–1105 (2012)

work page 2012

[13] [13]

Deep Gaze I: Boosting Saliency Prediction with Feature Maps Trained on ImageNet

K¨ ummerer, M., Theis, L., Bethge, M.: Deep gaze i: Boosting saliency prediction with feature maps trained on imagenet. arXiv preprint arXiv:1411.1045 (2014)

work page internal anchor Pith review Pith/arXiv arXiv 2014

[14] [14]

In: Proc

Leifman, G., Rudoy, D., Swedish, T., Bayro-Corrochano, E., Raskar, R.: Learning gaze transitions from depth to improve video saliency estimation. In: Proc. IEEE Int. Conf. on Computer Vision. vol. 3 (2017)

work page 2017

[15] [15]

In: CVPR

Lin, T.Y., Doll´ ar, P., Girshick, R., He, K., Hariharan, B., Belongie, S.: Feature pyramid networks for object detection. In: CVPR. vol. 1, p. 4 (2017)

work page 2017

[16] [16]

In: European conference on computer vision

Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Doll´ ar, P., Zitnick, C.L.: Microsoft coco: Common objects in context. In: European conference on computer vision. pp. 740–755. Springer (2014)

work page 2014

[17] [17]

International Journal of Computer Vision 106(3), 282–296 (2014)

Mar´ ın-Jim´ enez, M.J., Zisserman, A., Eichner, M., Ferrari, V.: Detecting people looking at each other in videos. International Journal of Computer Vision 106(3), 282–296 (2014)

work page 2014

[18] [18]

IEEE Transactions on Multimedia 17(11), 2094–2107 (2015) 16 Lian et al

Mukherjee, S.S., Robertson, N.M.: Deep head pose: Gaze-direction estimation in multimodal video. IEEE Transactions on Multimedia 17(11), 2094–2107 (2015) 16 Lian et al

work page 2094

[19] [19]

SalGAN: Visual Saliency Prediction with Generative Adversarial Networks

Pan, J., Canton, C., McGuinness, K., O’Connor, N.E., Torres, J., Sayrol, E., Giro-i Nieto, X.: Salgan: Visual saliency prediction with generative adversarial networks. arXiv preprint arXiv:1701.01081 (2017)

work page internal anchor Pith review Pith/arXiv arXiv 2017

[20] [20]

Vision research 116, 113–126 (2015)

Parks, D., Borji, A., Itti, L.: Augmented saliency model using automatic 3d head pose detection and learned gaze following in natural scenes. Vision research 116, 113–126 (2015)

work page 2015

[21] [21]

In: Proceedings of the IEEE International Conference on Computer Vision

Pﬁster, T., Charles, J., Zisserman, A.: Flowing convnets for human pose estimation in videos. In: Proceedings of the IEEE International Conference on Computer Vision. pp. 1913–1921 (2015)

work page 1913

[22] [22]

Recasens∗, A., Khosla∗, A., Vondrick, C., Torralba, A.: Where are they looking? In: Advances in Neural Information Processing Systems (NIPS) (2015),∗ indicates equal contribution

work page 2015

[23] [23]

In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

Recasens, A., Vondrick, C., Khosla, A., Torralba, A.: Following gaze in video. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 1435–1443 (2017)

work page 2017

[24] [24]

International Journal of Computer Vision 115(3), 211–252 (2015)

Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., et al.: Imagenet large scale visual recog- nition challenge. International Journal of Computer Vision 115(3), 211–252 (2015)

work page 2015

[25] [25]

In: Advances in neural information processing systems

Tompson, J.J., Jain, A., LeCun, Y., Bregler, C.: Joint training of a convolutional network and a graphical model for human pose estimation. In: Advances in neural information processing systems. pp. 1799–1807 (2014)

work page 2014

[26] [26]

In: Computer vision and pattern recognition (CVPR), 2010 IEEE conference on

Xiao, J., Hays, J., Ehinger, K.A., Oliva, A., Torralba, A.: Sun database: Large-scale scene recognition from abbey to zoo. In: Computer vision and pattern recognition (CVPR), 2010 IEEE conference on. pp. 3485–3492. IEEE (2010)

work page 2010

[27] [27]

In: Proceedings of the 2014 ACM International Joint Conference on Pervasive and Ubiquitous Computing: Adjunct Publication

Xiong, X., Liu, Z., Cai, Q., Zhang, Z.: Eye gaze tracking using an rgbd camera: a comparison with a rgb solution. In: Proceedings of the 2014 ACM International Joint Conference on Pervasive and Ubiquitous Computing: Adjunct Publication. pp. 1113–1121. ACM (2014)

work page 2014

[28] [28]

In: Computer Vision (ICCV), 2011 IEEE International Conference on

Yao, B., Jiang, X., Khosla, A., Lin, A.L., Guibas, L., Fei-Fei, L.: Human action recognition by learning bases of action attributes and parts. In: Computer Vision (ICCV), 2011 IEEE International Conference on. pp. 1331–1338. IEEE (2011)

work page 2011

[29] [29]

In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

Zhang, X., Sugano, Y., Fritz, M., Bulling, A.: Appearance-based gaze estimation in the wild. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 4511–4520 (2015)

work page 2015

[30] [30]

In: Advances in neural information processing systems

Zhou, B., Lapedriza, A., Xiao, J., Torralba, A., Oliva, A.: Learning deep features for scene recognition using places database. In: Advances in neural information processing systems. pp. 487–495 (2014)

work page 2014

[31] [31]

In: The IEEE International Conference on Computer Vision (ICCV) (Oct 2017)

Zhu, W., Deng, H.: Monocular free-head 3d gaze tracking with deep learning and geometry constraints. In: The IEEE International Conference on Computer Vision (ICCV) (Oct 2017)

work page 2017

[32] [32]

In: Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on

Zhu, X., Ramanan, D.: Face detection, pose estimation, and landmark localization in the wild. In: Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on. pp. 2879–2886. IEEE (2012)

work page 2012

[33] [33]

In: Computer Vision and Pattern Recognition, 2005

Zhu, Z., Ji, Q.: Eye gaze tracking under natural head movements. In: Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Con- ference on. vol. 1, pp. 918–923. IEEE (2005) Supplementary Material Dongze Lian∗[0000−0002−4947−0316], Zehao Yu∗[0000−0002−6559−9830], and Shenghua Gao†[0000−0003−1626−2040] School of Information Science...

work page 2005