pith. sign in

arxiv: 2605.19859 · v2 · pith:5CRPHCHFnew · submitted 2026-05-19 · 💻 cs.CV

Eyes on VLM: Benchmarking Gaze Following and Social Gaze Prediction in Vision Language Models

Pith reviewed 2026-05-25 05:49 UTC · model grok-4.3

classification 💻 cs.CV
keywords vision language modelsgaze followingsocial gaze predictionbenchmark evaluationmultimodal reasoningzero-shot performancefine-tuninghuman attention
0
0 comments X

The pith

Vision-language models currently lack precise capabilities for following human gaze or predicting social attention behaviors.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces an evaluation framework called EyeVLM to test how well vision-language models handle two gaze-related tasks on existing datasets. Gaze following requires predicting the exact 2D location a person is looking at, which involves geometric and spatial scene understanding. Social gaze prediction involves reasoning about interactions like mutual gaze or shared attention among multiple people. The work compares many open and closed VLMs in zero-shot prompting and after fine-tuning with task-specific question-answer pairs against specialized visual models. Results indicate that VLMs fall short on precision in both tasks, with training narrowing but not closing the gap to visual-only approaches.

Core claim

Current VLMs lack precise gaze understanding capabilities. While standard training helps reduce the gap with visual models, significant improvements are still needed for reliable performance on gaze following, which demands geometric and 3D scene reasoning, and social gaze prediction, which requires relational and interaction reasoning.

What carries the argument

The EyeVLM framework, which applies two tasks—gaze following for 2D location prediction and social gaze prediction for multi-person interaction reasoning—to existing gaze datasets, testing VLMs via zero-shot prompting strategies and task-specific fine-tuning on QA pairs.

If this is right

  • VLMs require additional training focused on geometric, spatial, and 3D scene reasoning to improve gaze following.
  • Fine-tuning on gaze-specific QA pairs narrows but does not eliminate the performance difference with purely visual models.
  • Social gaze prediction may draw more benefit from the language reasoning side of VLMs than the geometric aspects of gaze following do.
  • Gaze understanding in VLMs remains insufficient for applications needing accurate human attention analysis in scenes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Improved gaze capabilities in VLMs could enable more natural multimodal systems for interpreting video of group interactions without separate visual modules.
  • The split between geometric and social gaze tasks suggests future model designs might need separate pathways for spatial grounding versus relational reasoning.
  • If the performance gap persists across larger scales of data and models, it points to a fundamental limitation in how current VLMs integrate visual geometry with language semantics.

Load-bearing premise

The chosen gaze datasets and the zero-shot prompting plus fine-tuning QA strategies give a fair test of VLM gaze capabilities without bias from how the data was built or prompts designed.

What would settle it

A VLM achieving accuracy on par with state-of-the-art visual models on the same gaze following and social gaze datasets under identical evaluation conditions would show the claimed lack of capability does not hold.

Figures

Figures reproduced from arXiv: 2605.19859 by Anshul Gupta, Hengfei Wang, Jean-Marc Odobez, Pierre Vuillecard.

Figure 1
Figure 1. Figure 1: EyeVLM overview. Left: Gaze Following (GFo) and Social Gaze Prediction (SG). Right: model comparison on GazeFollow [1] (GFo distance error ↓) and VideoAttentionTarget [2] (SG F1 ↑). Results are taken from [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Qualitative results for gaze following on the GazeFollow dataset. For each sample, the left [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗
read the original abstract

Vision-language models (VLMs) have rapidly evolved into general-purpose multimodal reasoners with strong zero-shot generalization. In this context, VLMs could greatly benefit the analysis of human gaze and attention, a central task in human behavior understanding that requires reasoning about the physical scene as well as the activity, interactions, and social context. However, the extent to which VLMs can reliably understand human gaze and related attentional behaviors remains largely unexplored. In this work, we present EyeVLM, a systematic evaluation framework for gaze understanding in VLMs across two complementary dimensions: tasks and models. To assess gaze understanding capabilities, we focus on two core tasks. The first, gaze following, i.e., predicting the 2D location where a person is looking, has a geometric and visual processing focus, requiring a precise understanding of the human face, attention direction, 3D scene structure, and spatial grounding of attended targets. The second, social gaze prediction, requires social and relational reasoning over multi-person interactions (e.g., mutual gaze and shared attention), and may benefit more from the LLM semantic reasoning capabilities within VLMs. Regarding models, EyeVLM evaluates these tasks in two ways: a zero-shot setting with a diverse set of state-of-the-art open- and closed-source VLMs, exploring different prompting strategies; and a fine-tuning approach based on task-specific QA pairs, studying the impact of model scale and data scale. As benchmarks, we rely on existing gaze understanding datasets and perform a systematic comparison with state-of-the-art purely visual models. Overall, our results show that current VLMs lack precise gaze understanding capabilities. While standard training helps reduce the gap with visual models, significant improvements are still needed.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The manuscript introduces the EyeVLM benchmark to evaluate vision-language models (VLMs) on two gaze-related tasks: gaze following, which requires predicting 2D gaze locations, and social gaze prediction, which involves reasoning about interactions. It assesses a range of VLMs in zero-shot and fine-tuned settings using QA pairs derived from existing gaze datasets, compares them to visual models, and concludes that VLMs currently lack precise gaze understanding capabilities, although fine-tuning reduces the performance gap.

Significance. Should the evaluation prove robust, this work offers a valuable systematic comparison highlighting the limitations of VLMs in tasks requiring precise spatial and social reasoning. It provides a foundation for future research on improving VLMs for human behavior understanding, crediting the use of established datasets for direct comparability with prior visual models.

major comments (1)
  1. [Evaluation Framework] The reformulation of gaze following into QA pairs for VLMs, which output text rather than direct coordinates, risks confounding the measured performance gap. Systematic issues in text parsing of coordinates, discretization, or handling of refusals could inflate the reported limitations of VLMs relative to visual models. The abstract and framework description do not detail controls isolating these factors from actual scene-understanding deficits, which is load-bearing for the central claim that VLMs 'lack precise gaze understanding capabilities'.
minor comments (1)
  1. [Abstract] The abstract would benefit from including specific quantitative results or dataset names to better contextualize the findings.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive comment regarding the evaluation framework. We address the concern point-by-point below and will revise the manuscript accordingly to improve clarity on output handling.

read point-by-point responses
  1. Referee: [Evaluation Framework] The reformulation of gaze following into QA pairs for VLMs, which output text rather than direct coordinates, risks confounding the measured performance gap. Systematic issues in text parsing of coordinates, discretization, or handling of refusals could inflate the reported limitations of VLMs relative to visual models. The abstract and framework description do not detail controls isolating these factors from actual scene-understanding deficits, which is load-bearing for the central claim that VLMs 'lack precise gaze understanding capabilities'.

    Authors: We agree that the current manuscript description provides insufficient detail on output parsing and potential confounding factors, which merits clarification. In the revision, we will expand the methods section with: (1) exact prompting templates used to request parseable coordinate outputs (e.g., fixed templates or JSON structures), (2) the deterministic extraction procedure (regex-based parsing with explicit rules for refusals and invalid formats, where such cases are scored as incorrect), and (3) reported parsing success rates per model (which exceeded 90% in our evaluations). While we did not include an explicit control experiment providing ground-truth text for parsing validation, the large observed gaps and consistent trends across models support that the primary limitation is in scene understanding. We will add these details and a short discussion of parsing robustness to ensure the evaluation is transparent. revision: yes

Circularity Check

0 steps flagged

Purely empirical benchmark; no derivations or self-referential reductions

full rationale

The paper is a benchmark study that evaluates existing VLMs on gaze following and social gaze prediction using pre-existing datasets, zero-shot prompting, and fine-tuning on QA reformulations. It performs direct comparisons against external visual models without any claimed derivations, equations, fitted parameters renamed as predictions, or load-bearing self-citations. All performance numbers are measured outputs against independent benchmarks, so the central claim (VLMs lag visual models) rests on those measurements rather than reducing to the paper's own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No mathematical derivations, free parameters, or new postulated entities; the work is an empirical benchmarking study relying on standard ML evaluation practices.

pith-pipeline@v0.9.0 · 5857 in / 999 out tokens · 19999 ms · 2026-05-25T05:49:36.745691+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

87 extracted references · 87 canonical work pages · 9 internal anchors

  1. [1]

    Where are they looking?

    A. Recasens, A. Khosla, C. V ondrick, and A. Torralba, “Where are they looking?” in Advances in Neural Information Processing Systems (NeurIPS), 2015. [Online]. Available: https://papers.neurips.cc/paper/5848-where-are-they-looking.pdf

  2. [2]

    Detecting attended visual targets in video,

    E. Chong, Y . Wang, N. Ruiz, and J. M. Rehg, “Detecting attended visual targets in video,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020, pp. 5396–5406. [Online]. Available: https://openaccess.thecvf.com/content_CVPR_2020/papers/Chong_ Detecting_Attended_Visual_Targets_in_Video_CVPR_2020_paper.pdf

  3. [3]

    The eyes have it: The neuroethology, function and evolution of social gaze,

    N. J. Emery, “The eyes have it: The neuroethology, function and evolution of social gaze,”Neuroscience & Biobehavioral Reviews, vol. 24, no. 6, pp. 581–604, 2000

  4. [4]

    Gaze- lle: Gaze target estimation via large-scale learned encoders,

    F. Ryan, A. Bati, S. Lee, D. Bolya, J. Hoffman, and J. M. Rehg, “Gaze- lle: Gaze target estimation via large-scale learned encoders,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025. [Online]. Available: https://openaccess.thecvf.com/content/CVPR2025/papers/Ryan_Gaze-LLE_Gaze_Target_ Estimation_via_Large-Sc...

  5. [5]

    Gaze cueing of attention: Visual attention, social cognition, and individual differences,

    A. Frischen, A. P. Bayliss, and S. P. Tipper, “Gaze cueing of attention: Visual attention, social cognition, and individual differences,”Psychological Bulletin, vol. 133, no. 4, pp. 694–724, 2007

  6. [6]

    Laeo-net++: Revisiting people looking at each other in videos,

    M. J. Marin-Jimenez, V . Kalogeiton, P. Medina-Suarez, and A. Zisserman, “Laeo-net++: Revisiting people looking at each other in videos,”arXiv preprint arXiv:2101.02136, 2021. [Online]. Available: https://arxiv.org/abs/2101.02136

  7. [7]

    A unified model for gaze following and social gaze prediction,

    A. Gupta, S. Tafasca, N. Chutisilp, and J.-M. Odobez, “A unified model for gaze following and social gaze prediction,” inIEEE International Conference on Automatic Face and Gesture Recognition (FG), 2024. [Online]. Available: https://publications.idiap.ch/attachments/papers/2024/Gupta_FG_2024.pdf

  8. [8]

    Inferring shared attention in social scene videos,

    L. Fan, W. Chen, P. Wei, S.-K. Yeung, T. Wong, and J. Xing, “Inferring shared attention in social scene videos,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR),

  9. [9]

    Available: https://openaccess.thecvf.com/content_cvpr_2018/papers/Fan_Inferring_ Shared_Attention_CVPR_2018_paper.pdf

    [Online]. Available: https://openaccess.thecvf.com/content_cvpr_2018/papers/Fan_Inferring_ Shared_Attention_CVPR_2018_paper.pdf

  10. [10]

    Sharingan: A transformer architecture for multi-person gaze following,

    S. Tafasca, A. Gupta, and J.-M. Odobez, “Sharingan: A transformer architecture for multi-person gaze following,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024. [Online]. Available: https://openaccess.thecvf.com/content/CVPR2024/papers/Tafasca_ Sharingan_A_Transformer_Architecture_for_Multi-Person_Gaze_Fol...

  11. [11]

    Mtgs: A novel framework for multi-person temporal gaze following and social gaze prediction,

    A. Gupta, S. Tafasca, A. Farkhondeh, P. Vuillecard, and J.-M. Odobez, “Mtgs: A novel framework for multi-person temporal gaze following and social gaze prediction,” inAdvances in Neural Information Processing Systems (NeurIPS), 2024. [Online]. Available: https://arxiv.org/abs/2403.10511

  12. [12]

    Learning Transferable Visual Models From Natural Language Supervision

    A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever, “Learning transferable visual models from natural language supervision,” inProceedings of the International Conference on Machine Learning (ICML), 2021. [Online]. Available: https://arxiv.org/abs/2103.00020

  13. [13]

    BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

    J. Li, D. Li, S. Savarese, and S. Hoi, “Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models,” inProceedings of the International Conference on Machine Learning (ICML), 2023. [Online]. Available: https://arxiv.org/abs/2301.12597

  14. [14]

    Visual Instruction Tuning

    H. Liu, C. Li, Q. Wu, and Y . J. Lee, “Visual instruction tuning,” inAdvances in Neural Information Processing Systems (NeurIPS), 2023. [Online]. Available: https://arxiv.org/abs/2304.08485

  15. [15]

    Language Is Not All You Need: Aligning Perception with Language Models

    S. Huang, L. Dong, W. Wang, Y . Yang, F. Wang, F. Liu, Z. Chi, T. Zhang, Q. Li, F. Linet al., “Kosmos-1: Multimodal large language model in the wild,”arXiv preprint arXiv:2302.14045, 2023

  16. [16]

    Social caption: Evaluating social understanding in multimodal models,

    B. Thumu, L. Mathur, Y . Kebe, and L.-P. Morency, “Social caption: Evaluating social understanding in multimodal models,” 2026. [Online]. Available: https://arxiv.org/abs/2601.14569

  17. [17]

    Qwen3 Technical Report

    Q. Team, “Qwen3 technical report,” 2025. [Online]. Available: https://arxiv.org/abs/2505.09388

  18. [18]

    Dual attention guided gaze target detection in the wild,

    Y . Fang, J. Tang, W. Shen, W. Shen, X. Gu, L. Song, and G. Zhai, “Dual attention guided gaze target detection in the wild,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 11 390–11 399. 10

  19. [19]

    Believe it or not, we know what you are looking at!

    D. Lian, Z. Yu, and S. Gao, “Believe it or not, we know what you are looking at!” inAsian Conference on Computer Vision. Springer, 2018, pp. 35–50

  20. [20]

    A modular multimodal architecture for gaze target prediction: Application to privacy-sensitive settings,

    A. Gupta, S. Tafasca, and J.-M. Odobez, “A modular multimodal architecture for gaze target prediction: Application to privacy-sensitive settings,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 2022, pp. 5041–5050

  21. [21]

    Depth-aware gaze-following via auxiliary networks for robotics,

    T. Jin, Q. Yu, S. Zhu, Z. Lin, J. Ren, Y . Zhou, and W. Song, “Depth-aware gaze-following via auxiliary networks for robotics,”Engineering Applications of Artificial Intelligence, vol. 113, p. 104924, 2022

  22. [22]

    Childplay: A new benchmark for understanding children’s gaze behaviour,

    S. Tafasca, A. Gupta, and J.-M. Odobez, “Childplay: A new benchmark for understanding children’s gaze behaviour,” inProceedings of the IEEE/CVF international conference on computer vision, 2023, pp. 20 935–20 946

  23. [23]

    Patch-level gaze distribution prediction for gaze following,

    Q. Miao, M. Hoai, and D. Samaras, “Patch-level gaze distribution prediction for gaze following,” inProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2023, pp. 880–889. [Online]. Available: https://openaccess.thecvf.com/content/W ACV2023/papers/Miao_ Patch-Level_Gaze_Distribution_Prediction_for_Gaze_Following_W ACV_20...

  24. [24]

    Detecting people looking at each other in videos,

    M. J. Marin-Jimenez, A. Zisserman, M. Eichner, and V . Ferrari, “Detecting people looking at each other in videos,”International Journal of Computer Vision (IJCV), 2013. [Online]. Available: https://www.robots.ox.ac.uk/~vgg/publications/2014/Marin14/marin14.pdf

  25. [25]

    Laeo-net: Re- visiting people looking at each other in videos,

    M. J. Marin-Jimenez, V . Kalogeiton, P. Medina-Suarez, and A. Zisserman, “Laeo-net: Re- visiting people looking at each other in videos,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019. [Online]. Avail- able: https://openaccess.thecvf.com/content_CVPR_2019/papers/Marin-Jimenez_LAEO-Net_Revisiting_ People_L...

  26. [26]

    Boosting image- based mutual gaze detection using pseudo 3d gaze,

    B. Doosti, C.-H. Chen, R. Vemulapalli, X. Jia, Y . Zhu, and B. Green, “Boosting image- based mutual gaze detection using pseudo 3d gaze,”Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, no. 2, pp. 1273–1281, May 2021. [Online]. Available: https://ojs.aaai.org/index.php/AAAI/article/view/16215

  27. [27]

    Hhp-net: A light heteroscedastic neural network for head pose estimation with uncertainty,

    G. Cantarini, F. F. Tomenotti, N. Noceti, and F. Odone, “Hhp-net: A light heteroscedastic neural network for head pose estimation with uncertainty,” 2021

  28. [28]

    Attention flow: End-to-end joint attention estimation,

    Ö. Sümer, P. Gerjets, U. Trautwein, and E. Kasneci, “Attention flow: End-to-end joint attention estimation,” inProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2020. [Online]. Available: https://openaccess.thecvf.com/content_W ACV_2020/papers/Sumer_Attention_Flow_ End-to-End_Joint_Attention_Estimation_W ACV_2020_paper.pdf

  29. [29]

    Exploring the zero-shot capabilities of vision- language models for improving gaze following,

    A. Gupta, P. Vuillecard, A. Farkhondeh, and J.-M. Odobez, “Exploring the zero-shot capabilities of vision- language models for improving gaze following,” inProceedings of the ieee/cvf conference on computer vision and pattern recognition, 2024, pp. 615–624

  30. [30]

    Gazellm: a plug-and-play zero-shot llm reasoning framework for boosting gaze target detection,

    Y . Yang and F. Lu, “Gazellm: a plug-and-play zero-shot llm reasoning framework for boosting gaze target detection,”Visual Intelligence, vol. 3, no. 1, p. 26, 2025

  31. [31]

    Gazevlm: A vision-language model for multi-task gaze understanding,

    A. M. Mathew, H. Hermassi, T. Khalid, and A. A. Khan, “Gazevlm: A vision-language model for multi-task gaze understanding,”arXiv preprint arXiv:2511.06348, 2025

  32. [32]

    Vl4gaze: Unleashing vision-language models for gaze following,

    S. Wang, C. Cui, Y . Huang, H. J. Chang, and Y . Cheng, “Vl4gaze: Unleashing vision-language models for gaze following,”arXiv preprint arXiv:2512.20735, 2025

  33. [33]

    Chain- of-thought prompting elicits reasoning in large language models,

    J. Wei, X. Wang, D. Schuurmans, M. Bosma, B. Ichter, F. Xia, E. H. Chi, Q. V . Le, and D. Zhou, “Chain- of-thought prompting elicits reasoning in large language models,” inAdvances in Neural Information Processing Systems (NeurIPS), 2022

  34. [34]

    Large Language Models are Zero-Shot Reasoners

    T. Kojima, S. Gu, M. Reid, Y . Matsuo, and Y . Iwasawa, “Large language models are zero-shot reasoners,” arXiv preprint arXiv:2205.11916, 2022

  35. [35]

    Learn to explain: Multimodal reasoning via thought chains for science question answering,

    P. Lu, S. Mishra, T. Xia, L. Qiu, K.-W. Chang, S.-C. Zhu, O. Tafjord, P. Clark, and A. Kalyan, “Learn to explain: Multimodal reasoning via thought chains for science question answering,” inAdvances in Neural Information Processing Systems (NeurIPS), 2022

  36. [36]

    Multimodal chain-of-thought reasoning in language models,

    Z. Zhang, A. Zhang, M. Li, H. Zhao, G. Karypis, and A. Smola, “Multimodal chain-of-thought reasoning in language models,”Transactions on Machine Learning Research (TMLR), 2024. 11

  37. [37]

    Better Eyes, Better Thoughts: Why Vision Chain-of-Thought Fails in Medicine

    Y . Wu, Z. Yang, J. Qian, S. Gao, G. Chen, Q. Li, Y .-A. Huang, and Z.-A. Huang, “Better eyes, better thoughts: Why vision chain-of-thought fails in medicine,”arXiv preprint arXiv:2603.06665, 2026

  38. [38]

    Visual chain-of-thought prompting for knowledge-based visual reasoning,

    Z.-F. Chenet al., “Visual chain-of-thought prompting for knowledge-based visual reasoning,” inProceed- ings of the AAAI Conference on Artificial Intelligence, 2024

  39. [39]

    Chain of thought prompt tuning in vision language models,

    J. Ge, H. Luo, S. Qian, Y . Gan, J. Fu, and S. Zhang, “Chain of thought prompt tuning in vision language models,”arXiv preprint arXiv:2304.07919, 2023

  40. [40]

    Enhancing spatial reasoning in vision-language models via chain-of-thought prompting and reinforcement learning,

    B. Ji, S. Agrawal, Q. Tang, and Y . Wu, “Enhancing spatial reasoning in vision-language models via chain-of-thought prompting and reinforcement learning,”arXiv preprint arXiv:2507.13362, 2025

  41. [41]

    mtgs-static-vsgaze,

    A. Gupta, “mtgs-static-vsgaze,” 2026, accessed: 2026-05-06. [Online]. Available: https: //huggingface.co/Idiap/mtgs-static-vsgaze

  42. [42]

    Multimodal across domains gaze target detection,

    F. Tonini, C. Beyan, and E. Ricci, “Multimodal across domains gaze target detection,” inProceedings of the 2022 International Conference on Multimodal Interaction, 2022, pp. 420–431

  43. [43]

    Gaze target estimation inspired by interactive attention,

    Z. Hu, K. Zhao, B. Zhou, H. Guo, S. Wu, Y . Yang, and J. Liu, “Gaze target estimation inspired by interactive attention,”IEEE Transactions on Circuits and Systems for Video Technology, vol. 32, no. 12, pp. 8524–8536, 2022

  44. [44]

    Where are they looking in the 3d space?

    N. Horanyi, L. Zheng, E. Chong, A. Leonardis, and H. J. Chang, “Where are they looking in the 3d space?” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 2023, pp. 2678–2687

  45. [45]

    InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

    W. Wang, Z. Gao, L. Gu, H. Pu, L. Cui, X. Wei, Z. Liu, L. Jing, S. Ye, J. Shaoet al., “Internvl3.5: Advancing open-source multimodal models in versatility, reasoning, and efficiency,”arXiv preprint arXiv:2508.18265, 2025

  46. [46]

    Gpt-5.4 thinking system card,

    OpenAI, “Gpt-5.4 thinking system card,” https://openai.com/index/gpt-5-4-thinking-system-card/, 2026

  47. [47]

    Gemini 3.1 pro model card,

    Google DeepMind, “Gemini 3.1 pro model card,” https://deepmind.google/models/model-cards/ gemini-3-1-pro/, 2026

  48. [48]

    LlamaFactory: Unified Efficient Fine-Tuning of 100+ Language Models

    Y . Zheng, R. Zhang, J. Zhang, Y . Ye, Z. Luo, Z. Feng, and Y . Ma, “Llamafactory: Unified efficient fine-tuning of 100+ language models,” inProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations). Bangkok, Thailand: Association for Computational Linguistics, 2024. [Online]. Available: http...

  49. [49]

    Efficient memory management for large language model serving with pagedattention,

    W. Kwon, Z. Li, S. Zhuang, Y . Sheng, L. Zheng, C. H. Yu, J. E. Gonzalez, H. Zhang, and I. Stoica, “Efficient memory management for large language model serving with pagedattention,” inProceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023

  50. [50]

    Austism diagnostic observation schedule: A standardized observation of communicative and social behavior,

    C. Lord, M. Rutter, S. Goode, J. Heemsbergen, H. Jordan, L. Mawhood, and E. Schopler, “Austism diagnostic observation schedule: A standardized observation of communicative and social behavior,” Journal of autism and developmental disorders, vol. 19, no. 2, pp. 185–212, 1989

  51. [51]

    Autism Diagnostic Observation Schedule, Second Edition,

    Lord, Rutter, DiLavore, Risi, Gotham, and Bishop, “Autism Diagnostic Observation Schedule, Second Edition,” 2012. 12 A Implementation Details We fine-tune representative Qwen-based VLMs on the constructed QA pairs using LLaMA- Factory [47]. Specifically, LoRA adapters are inserted into both the self-attention projection layers (q_proj, k_proj, v_proj, and...

  52. [53]

    Identify the activity or activities in which the person of interest is engaged in

  53. [54]

    Analyze the orientation of their head and eyes (if visible) to infer their gaze direction

    Determine the direction of gaze of the person of interest. Analyze the orientation of their head and eyes (if visible) to infer their gaze direction

  54. [55]

    Evaluate as well whether the person of interest is looking at something that is within the image area, or outside the image area

    Identify and locate all plausible gaze targets (objects and/or persons) in the image that are visible to the person of interest, that is, that are within their field-of-view. Evaluate as well whether the person of interest is looking at something that is within the image area, or outside the image area

  55. [56]

    Describe the spatial relationships between the person of interest and each of these plausible gaze targets

  56. [57]

    Based on the scene context and information above, estimate how possible the person of interest is looking inside the frame and which plausible gaze target is the person of interest most likely looking at regardless of whether it is inside or outside the frame

  57. [58]

    inout": <0.0-1.0>,

    Using your reasoning, localize the most probable target of attention of the person of interest as a point. Use this point as the final gaze point. Provide the requested outputs (inout and gaze_point). ### Input/Output Format: - **Input:** An image followed by a set of **bounding box coordinates** of the head of the person of interest. - **Output:** Report...

  58. [59]

    The person of interest is the woman on the left, identified by the provided head bounding box

    The image shows a café or casual dining setting with two people sitting at a table. The person of interest is the woman on the left, identified by the provided head bounding box

  59. [60]

    Her head and upper body are turned toward the right side of the image, which indicates that her attention is directed toward the man beside her or toward an object near him

  60. [61]

    This makes the interaction area on the right side of the image the most likely region of attention

    Her visible facial orientation suggests a rightward gaze rather than a downward look at the table or an upward look toward the background shelves. This makes the interaction area on the right side of the image the most likely region of attention

  61. [62]

    The cup is the stronger candidate because it is visually salient, close to her line of sight, and central to the man’s ongoing action, while his face is partly occluded by the cup

    Within that rightward region, the most plausible gaze targets are the man’s face and the large white cup he is holding. The cup is the stronger candidate because it is visually salient, close to her line of sight, and central to the man’s ongoing action, while his face is partly occluded by the cup. By contrast, other objects such as the glass on the tabl...

  62. [63]

    inout": 1.0,

    Therefore, the most likely gaze target is the large white cup on the right side of the image. Since this target is clearly inside the frame, the probability of looking inside the image should be high. ### Gaze Point ### ```json [ {"inout": 1.0, "gaze_point": [0.81, 0.24]} ] ``` user: <image>[0.333, 0.056, 0.498, 0.221] 20 assistant: ### Reasoning ###

  63. [64]

    The person of interest is the player in red at the center, identified by the given head bounding box

    The image shows a field sport scene with several players. The person of interest is the player in red at the center, identified by the given head bounding box

  64. [65]

    His face orientation suggests that his attention is not on the ground directly below him or on the distant players in the background

    His body is moving forward, but his head is turned toward the left side of the image. His face orientation suggests that his attention is not on the ground directly below him or on the distant players in the background

  65. [66]

    It is close to him, in front of his hurley, and consistent with the ongoing action

    The most visually salient object in the direction of his head and eyes is the small ball on the left side of the image. It is close to him, in front of his hurley, and consistent with the ongoing action

  66. [67]

    The ball is much more consistent with both the sports context and his immediate action

    Other possible targets, such as the nearby teammate behind him or the player in blue on the right, are less likely because they are not aligned with his apparent line of sight. The ball is much more consistent with both the sports context and his immediate action

  67. [68]

    The ball is clearly inside the frame, so the probability that he is looking inside the image is very high

  68. [69]

    inout": 1.0,

    Therefore, the most plausible gaze target is the ball on the left side of the image, with the gaze point placed near its center. ### Gaze Point ### ```json [ {"inout": 1.0, "gaze_point": [0.14, 0.23]} ] ``` Example provided. Now, analyze the following image. K Zero-Shot Prompts for SG For simplicity, we use <Task description> to denote the description of ...

  69. [70]

    Describe what is happening in the image

  70. [71]

    Identify the activity or activities in which the people of interest are engaged in

  71. [72]

    Analyze the orientation of their head and eyes (if visible) to infer their gaze direction

    Determine the direction of gaze of the people of interest. Analyze the orientation of their head and eyes (if visible) to infer their gaze direction

  72. [73]

    Identify and locate all plausible gaze targets (objects and/or persons) in the image that are visible to the people of interest, that is, that are within their field-of-view

  73. [74]

    Describe the spatial relationships between the people of interest and each of these plausible gaze targets

  74. [75]

    Based on the scene context and information above, determine whether the people of interest are engaged in social gaze

  75. [76]

    ### Input/Output Format: - **Input:** An image followed by a pair of **bounding box coordinates** of the heads of the people of interest

    Using your reasoning, estimate the probability of <task> for the people of interest. ### Input/Output Format: - **Input:** An image followed by a pair of **bounding box coordinates** of the heads of the people of interest. - **Output:** Report in JSON format: - **label** the probability of <task>. **Required Output Format:** ### Reasoning ### <Your step-b...

  76. [77]

    The two people are a woman in the center and a child to her lower left, sitting together on a bed with others around them

  77. [78]

    They appear to be interacting in a group conversation or family scene

  78. [79]

    The child’s head is tilted upward toward the woman, so the child is looking at the woman’s face/head

  79. [80]

    The woman’s head is turned slightly down and left toward the child, indicating her gaze is directed at the child

  80. [81]

    Other plausible gaze targets exist in the room, but both people’s faces are oriented toward each other more than toward anyone else

Showing first 80 references.