pith. machine review for the scientific record. sign in

arxiv: 2604.09368 · v1 · submitted 2026-04-10 · 💻 cs.MM · cs.CV

Recognition: unknown

Through Their Eyes: Fixation-aligned Tuning for Personalized User Emulation

Authors on Pith no claims yet

Pith reviewed 2026-05-10 16:18 UTC · model grok-4.3

classification 💻 cs.MM cs.CV
keywords user simulationvision-language modelsgaze patternsrecommender systemsattention alignmentsoft promptseye-trackingpersonalized emulation
0
0 comments X

The pith

Aligning a vision-language model's visual attention with individual user gaze patterns improves simulation of clicks on recommendation interfaces.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper investigates whether vision-language models acting as user simulators can better replicate real behavior when their attention over visual layouts is steered to match how specific people actually look at screens. Eye-tracking data from a carousel recommendation setting shows that users maintain stable personal scanning habits that reliably predict which items they click. The proposed FixATE approach extracts a comparable relevance distribution from the model's internal attention using interpretability probes, then trains personalized soft prompts that shift this distribution toward each user's observed fixation pattern. Experiments testing three different probing operators and two separate model architectures report steady gains in both attention match quality and click prediction accuracy. This points to a path where simulators perceive and act on visual interfaces in ways that more closely track real users rather than relying on text or metadata alone.

Core claim

Probing a vision-language model's internal visual attention via interpretability operators yields a slot-level relevance distribution that can be aligned with human fixation data; learning user-specific soft prompts then steers the model toward each individual's characteristic gaze pattern, producing measurable improvements in attention alignment and downstream click prediction on visual recommendation layouts.

What carries the argument

Fixation-Aligned Tuning for user Emulation (FixATE), which extracts comparable relevance distributions from VLM attention probes and optimizes personalized soft prompts to match observed human fixation distributions.

If this is right

  • Attention alignment improves consistently when using any of the three tested interpretability-based probing operators.
  • Click prediction accuracy rises for both of the architecturally distinct VLM backbones examined.
  • Simulators gain the ability to perceive recommendations through visual interfaces rather than text or metadata alone.
  • The method produces user-specific emulation that better reproduces how individuals scan and act on layouts.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same prompting technique could be tested on other visual interfaces such as product grids or social feeds where gaze variation also drives choices.
  • If soft prompts can be adapted from limited new gaze samples, simulators might support rapid personalization for different demographic segments without retraining entire models.
  • Combining fixation alignment with existing text-based user models could create hybrid simulators that capture both visual scanning and preference reasoning.
  • The approach raises the possibility of measuring simulation quality directly through attention metrics rather than relying solely on downstream task accuracy.

Load-bearing premise

Users maintain stable individual gaze patterns that are strongly predictive of their click behavior, and these patterns can be captured and matched by steering a VLM's attention with learned soft prompts.

What would settle it

Collect new eye-tracking and click data from the same carousel setting, apply FixATE to align a fresh VLM instance, then measure click-prediction accuracy on held-out recommendations; if accuracy shows no reliable improvement over an untuned baseline VLM, the value of the alignment step is not supported.

Figures

Figures reproduced from arXiv: 2604.09368 by Huizhong Guo, Lingfeng Huang, Tianjun Wei, Yingpeng Du, Zhu Sun.

Figure 1
Figure 1. Figure 1: The difference of the perception of recommenda [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Empirical analyses on the RecGaze dataset. [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Overview of the Fixation-aligned Tuning for User Emulation ( [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Multimodal prompt template used in FixATE. cursor position can be displaced by scrolling, hovering, or other incidental motor actions. We therefore adopt fixation as the primary supervision signal and retain 575 interactions from 83 users with both fixation and click data for the main experiments. We adopt a leave-one-out per-user strategy [18]: for each user with at least two interactions, one sample is h… view at source ↗
Figure 5
Figure 5. Figure 5: Sensitivity analysis of key hyperparameters on two [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Case study of slot-level attention on Qwen3-VL. (a) Original backbone attention; (b) Human gaze; (c) Atten￾tion from FixATE. Posters are blurred for copyright reasons. diffuse background activation present in Backbone. Crucially, the slot receiving the highest human fixation weight is ranked first by FixATE but not by Backbone, directly corresponding to the gain in CSH@1 in [PITH_FULL_IMAGE:figures/full_f… view at source ↗
Figure 7
Figure 7. Figure 7: Supplementary qualitative comparison of slot-level attention. Each row is one session (user [PITH_FULL_IMAGE:figures/full_fig_p014_7.png] view at source ↗
read the original abstract

Large language model (LLM) agents are increasingly deployed as scalable user simulators for recommender system evaluation. Yet existing simulators perceive recommendations through text or structured metadata rather than the visual interfaces real users browse-a critical gap, since attention over recommendation layouts is both visually driven and highly personalized. We investigate whether aligning a vision-language model's (VLM's) visual attention with user-specific gaze patterns can improve simulation fidelity. Analysis of a real-world eye-tracking dataset collected in a carousel-based recommendation setting reveals that users exhibit stable individual gaze patterns strongly predictive of click behavior. Building on this finding, we propose Fixation-Aligned Tuning for user Emulation (FixATE). Our approach first probes the VLM's internal visual attention via interpretability operators to obtain a slot-level relevance distribution comparable with human fixation, and then learns personalized soft prompts to steer the model's attention toward each user's characteristic fixation pattern. Experiments across three interpretability-based probing operators and two architecturally distinct VLM backbones demonstrate consistent improvements in both attention alignment and click prediction accuracy. These results suggest that making the model "see like the user" is a viable path toward simulators that more faithfully reproduce how users perceive and act in recommendation interfaces.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that users show stable, individualized gaze patterns in carousel-based recommendation interfaces that are predictive of clicks, and that a VLM's internal visual attention (probed via interpretability operators) can be aligned to these patterns via learned personalized soft prompts. The proposed FixATE method is evaluated across three probing operators and two architecturally distinct VLM backbones, reporting consistent gains in attention alignment and click-prediction accuracy over baselines.

Significance. If the empirical gains hold under rigorous controls, the work would meaningfully advance personalized user emulation for recommender-system evaluation by moving beyond text-only simulators to incorporate visually grounded, user-specific attention. The multi-operator, multi-backbone design is a strength that supports claims of robustness rather than operator-specific artifacts.

major comments (2)
  1. [eye-tracking analysis] The central claim that fixation alignment improves simulation fidelity rests on the assumption that per-user gaze patterns are stable and predictive of clicks. The eye-tracking analysis section must report quantitative evidence (intra- vs. inter-user variance, correlation coefficients with click labels, and statistical tests) rather than qualitative statements; without these numbers the predictive link remains unverified and load-bearing for the subsequent tuning results.
  2. [experiments] Table or figure reporting click-prediction accuracy: the manuscript must include per-user or per-condition breakdowns, baseline comparisons with confidence intervals or p-values, and an ablation that isolates the contribution of the fixation-alignment objective versus the soft-prompt parameterization alone. Absent these, it is impossible to determine whether the reported gains are attributable to the proposed mechanism.
minor comments (2)
  1. [method] Notation for the three interpretability-based probing operators should be defined explicitly (e.g., equations for slot-level relevance distributions) so readers can reproduce the attention extraction step.
  2. [abstract] The abstract and introduction would benefit from a concise statement of dataset size (number of users, sessions, and items) and the exact VLM backbones used, to allow immediate assessment of scope.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed comments, which help strengthen the rigor of our claims regarding user gaze stability and the effectiveness of FixATE. We address each major point below and commit to incorporating the requested quantitative analyses and experimental details in the revised manuscript.

read point-by-point responses
  1. Referee: [eye-tracking analysis] The central claim that fixation alignment improves simulation fidelity rests on the assumption that per-user gaze patterns are stable and predictive of clicks. The eye-tracking analysis section must report quantitative evidence (intra- vs. inter-user variance, correlation coefficients with click labels, and statistical tests) rather than qualitative statements; without these numbers the predictive link remains unverified and load-bearing for the subsequent tuning results.

    Authors: We agree that the current qualitative description of stable gaze patterns is insufficient to fully substantiate the central assumption. In the revised version, the eye-tracking analysis section will be expanded to include: (1) intra-user versus inter-user variance metrics (e.g., standard deviation of fixation distributions within and across users), (2) correlation coefficients (Pearson and Spearman) between per-user fixation maps and binary click labels, and (3) statistical significance tests such as paired t-tests or ANOVA to confirm that intra-user consistency exceeds inter-user variability. These statistics will be computed directly from the collected carousel eye-tracking dataset and reported with exact values and p-values. revision: yes

  2. Referee: [experiments] Table or figure reporting click-prediction accuracy: the manuscript must include per-user or per-condition breakdowns, baseline comparisons with confidence intervals or p-values, and an ablation that isolates the contribution of the fixation-alignment objective versus the soft-prompt parameterization alone. Absent these, it is impossible to determine whether the reported gains are attributable to the proposed mechanism.

    Authors: We acknowledge that the existing aggregate results do not provide sufficient granularity. We will revise the experimental section to add: (1) per-user and per-condition breakdowns of click-prediction accuracy in new tables, (2) baseline comparisons (including all original baselines) augmented with 95% confidence intervals and p-values from appropriate statistical tests (e.g., McNemar's test or Wilcoxon signed-rank), and (3) a dedicated ablation study comparing the full FixATE objective against a soft-prompt-only variant without the fixation-alignment loss. These additions will be presented in updated figures and tables with clear labeling of conditions and statistical results. revision: yes

Circularity Check

0 steps flagged

No significant circularity; claims rest on external data and empirical tests

full rationale

The paper's derivation begins with analysis of an independent real-world eye-tracking dataset to establish stable per-user gaze patterns, then defines FixATE operationally by probing VLM attention via standard interpretability operators and learning soft prompts to align with those patterns. Experiments measure improvements on held-out data across three operators and two backbones, without any step that reduces a claimed prediction to a fitted input by construction, invokes self-citations as load-bearing uniqueness theorems, or renames known results. The chain is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The claim rests primarily on a domain assumption about stable gaze patterns and the empirical utility of soft prompt tuning, with minimal additional free parameters or invented entities described.

free parameters (1)
  • personalized soft prompts
    User-specific parameters learned to steer VLM attention toward individual fixation patterns.
axioms (1)
  • domain assumption Users exhibit stable individual gaze patterns strongly predictive of click behavior.
    Stated as a finding from analysis of a real-world eye-tracking dataset in carousel-based recommendations.

pith-pipeline@v0.9.0 · 5517 in / 1273 out tokens · 90241 ms · 2026-05-10T16:18:12.727389+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

53 extracted references · 37 canonical work pages · 3 internal anchors

  1. [1]

    Samira Abnar and Willem Zuidema. 2020. Quantifying Attention Flow in Transformers. InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel Tetreault (Eds.). Association for Computational Linguistics, Online, 4190–4197. doi:10.18653/v1/2020.acl-main.385

  2. [2]

    Reduan Achtibat, Sayed Mohammad Vakilzadeh Hatefi, Maximilian Dreyer, Aakriti Jain, Thomas Wiegand, Sebastian Lapuschkin, and Wojciech Samek. 2024. AttnLRP: Attention-Aware Layer-Wise Relevance Propagation for Transformers. InProceedings of the 41st International Conference on Machine Learning (ICML’24, Vol. 235). JMLR.org, Vienna, Austria, 135–168

  3. [3]

    Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, Wenbin Ge, Zhifang Guo, Qidong Huang, Jie Huang, Fei Huang, Binyuan Hui, Shutong Jiang, Zhaohai Li, Mingsheng Li, Mei Li, Kaixin Li, Zicheng Lin, Junyang Lin, Xuejing Liu, Jiawei Liu, Chenglong Liu, Yang Liu, Dayiheng Liu, Shixuan ...

  4. [4]

    Nicolas Bougie and Narimawa Watanabe. 2025. SimUSER: Simulating User Be- havior with Large Language Models for Recommender System Evaluation. In Proceedings of the 63rd Annual Meeting of the Association for Computational Lin- guistics (Volume 6: Industry Track), Georg Rehm and Yunyao Li (Eds.). Association for Computational Linguistics, Vienna, Austria, 4...

  5. [5]

    Anderson, and Myeong Ho Sohn

    Mon Chu Chen, John R. Anderson, and Myeong Ho Sohn. 2001. What Can a Mouse Cursor Tell Us More? Correlation of Eye/Mouse Movements on Web Browsing. InCHI ’01 Extended Abstracts on Human Factors in Computing Systems (CHI EA ’01). Association for Computing Machinery, New York, NY, USA, 281–282. doi:10.1145/634067.634234

  6. [6]

    Yue Chen, Susen Yang, Tong Zhang, Chao Wang, Mingyue Cheng, Chenyi Lei, and Han Li. 2025. Lasso: Large Language Model-based User Simulator for Cross- Domain Recommendation. InProceedings of the Nineteenth ACM Conference on Recommender Systems. ACM, Prague Czech Republic, 207–216. doi:10.1145/ 3705328.3748048

  7. [7]

    Santiago de Leon-Martinez, Jingwei Kang, Robert Moro, Maarten de Rijke, Branislav Kveton, Harrie Oosterhuis, and Maria Bielikova. 2025. RecGaze: The First Eye Tracking and User Interaction Dataset for Carousel Interfaces. InPro- ceedings of the 48th International ACM SIGIR Conference on Research and Develop- ment in Information Retrieval (SIGIR ’25). Asso...

  8. [8]

    Santiago de Leon-Martinez, Robert Moro, Branislav Kveton, and Maria Bielikova

  9. [9]

    InProceedings of the 31st International Conference on Intelligent User Interfaces (IUI ’26)

    Riding the Carousel: The First Extensive Eye Tracking Analysis of Browsing Behavior in Carousel Recommenders. InProceedings of the 31st International Conference on Intelligent User Interfaces (IUI ’26). Association for Computing Machinery, New York, NY, USA, 2120–2130. doi:10.1145/3742413.3789166

  10. [10]

    Velásquez

    Pablo Villanueva González, Cristobal Subiabre Cuevas, Lino Jeldez, Benjamin Tor- realba Troncoso, María Flavia Guiñazú, and Juan D. Velásquez. 2025. A Gender- Aware Saliency Prediction System for Web Interfaces Using Deep Learning and Eye-Tracking Data.Brain Informatics12, 1 (Oct. 2025), 25. doi:10.1186/s40708- 025-00274-x

  11. [11]

    Jeff Huang, Ryen White, and Georg Buscher. 2012. User See, User Point: Gaze and Cursor Alignment in Web Search. InProceedings of the SIGCHI Conference on Human Factors in Computing Systems (CHI ’12). Association for Computing Machinery, New York, NY, USA, 1341–1350. doi:10.1145/2207676.2208591

  12. [12]

    Yanbiao Ji, Dan Luo, Chang Liu, Shaokai Wu, Jing Tong, Qichen He, Deyi Ji, Hongtao Lu, and Yue Ding. 2025. Generating Negative Samples for Multi-Modal Recommendation. InProceedings of the 33rd ACM International Conference on Multimedia (MM ’25). Association for Computing Machinery, New York, NY, USA, 6007–6016. doi:10.1145/3746027.3754977

  13. [13]

    Kayhan Latifzadeh, Jacek Gwizdka, and Luis A. Leiva. 2025. A Versatile Dataset of Mouse and Eye Movements on Search Engine Results Pages. InProceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, Padua Italy, 3412–3421. doi:10.1145/3726302.3730325

  14. [14]

    Dirk Lewandowski and Yvonne Kammerer. 2021. Factors Influencing Viewing Behaviour on Search Engine Results Pages: A Review of Eye-Tracking Research. Behaviour & Information Technology40, 14 (Oct. 2021), 1485–1515. doi:10.1080/ 0144929X.2020.1761450

  15. [15]

    Feiran Liu, Yuzhe Zhang, Xinyi Huang, Yinan Peng, Xinfeng Li, Lixu Wang, Yutong Shen, Ranjie Duan, Simeng Qin, Xiaojun Jia, Qingsong Wen, and Wei Dong. 2025. The Eye of Sherlock Holmes: Uncovering User Private Attribute Profiling via Vision-Language Model Agentic Framework. InProceedings of the 33rd ACM International Conference on Multimedia (MM ’25). Ass...

  16. [16]

    Hongyang Liu, Zhu Sun, Tianjun Wei, Yan Wang, Jiajie Zhu, and Xinghua Qu

  17. [17]

    doi:10.1609/aaai.v40i18

    Diagnostic-Guided Dynamic Profile Optimization for LLM-based User Simulators in Sequential Recommendation.Proceedings of the AAAI Conference on Artificial Intelligence40, 18 (March 2026), 15306–15314. doi:10.1609/aaai.v40i18. 38556

  18. [18]

    Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang

    Nelson F. Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. 2024. Lost in the Middle: How Language Models Use Long Contexts.Transactions of the Association for Computational Linguistics 12 (2024), 157–173. doi:10.1162/tacl_a_00638

  19. [19]

    Angela Lopez-Cardona, Carlos Segura, Alexandros Karatzoglou, Sergi Abadal, and Ioannis Arapakis. 2024. Seeing Eye to AI: Human Alignment via Gaze-Based Response Rewards for Large Language Models. InThe Thirteenth International Conference on Learning Representations

  20. [20]

    Yunshan Ma, Yingzhi He, Wenjun Zhong, Xiang Wang, Roger Zimmermann, and Tat-Seng Chua. 2024. CIRP: Cross-Item Relational Pre-training for Multimodal Product Bundling. InProceedings of the 32nd ACM International Conference on Multimedia (MM ’24). Association for Computing Machinery, New York, NY, USA, 9641–9649. doi:10.1145/3664647.3681349

  21. [21]

    Fan’an Meng, Chaoran Cui, Hongjun Dai, and Shuai Gong. 2025. Black-Box Test-Time Prompt Tuning for Vision-Language Models.Proceedings of the AAAI Conference on Artificial Intelligence39, 6 (April 2025), 6099–6107. doi:10.1609/ aaai.v39i6.32652

  22. [22]

    Anupam Pani and Yanchao Yang. 2025. Gaze-VLM: Bridging Gaze and VLMs through Attention Regularization for Egocentric Understanding. InThe Thirty- ninth Annual Conference on Neural Information Processing Systems

  23. [23]

    Qiyao Peng, Hongtao Liu, Hua Huang, Jian Yang, Qing Yang, and Minglai Shao. 2025. A Survey on LLM-powered Agents for Recommender Systems. InFindings of the Association for Computational Linguistics: EMNLP 2025, Chris- tos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng (Eds.). Association for Computational Linguistics, Suzhou, China,...

  24. [24]

    Jella Pfeiffer, Thies Pfeiffer, Martin Meißner, and Elisa Weiß. 2020. Eye-Tracking- Based Classification of Information Search Behavior Using Machine Learning: 9 Conference acronym ’XX, June 03–05, 2018, Woodstock, NY Huang et al. Evidence from Experiments in Physical Shops and Virtual Reality Shopping Environments.Information Systems Research31, 3 (Sept....

  25. [25]

    Guanxi Shen. 2025. GLIMPSE: Holistic Cross-Modal Explainability for Large Vision-Language Models. arXiv:2506.18985 [cs] doi:10.48550/arXiv.2506.18985

  26. [26]

    Florian Strohm, Mihai Bâce, and Andreas Bulling. 2024. Learning User Em- beddings from Human Gaze for Personalised Saliency Prediction.Proc. ACM Hum.-Comput. Interact.8, ETRA (May 2024), 229:1–229:16. doi:10.1145/3655603

  27. [27]

    Guus van Loon, Felix Hermsen, and Marnix Naber. 2022. Predicting Product Preferences on Retailers’ Web Shops through Measurement of Gaze and Pupil Size Dynamics.Journal of Cognition5, 1 (Oct. 2022). doi:10.5334/joc.240

  28. [28]

    David Wan, Jaemin Cho, Elias Stengel-Eskin, and Mohit Bansal. 2024. Contrastive Region Guidance: Improving Grounding in Vision-Language Models Without Training. InComputer Vision – ECCV 2024: 18th European Conference, Milan, Italy, September 29–October 4, 2024, Proceedings, Part LXXIX. Springer-Verlag, Berlin, Heidelberg, 198–215. doi:10.1007/978-3-031-72986-7_12

  29. [29]

    Lei Wang, Jingsen Zhang, Hao Yang, Zhi-Yuan Chen, Jiakai Tang, Zeyu Zhang, Xu Chen, Yankai Lin, Hao Sun, Ruihua Song, Xin Zhao, Jun Xu, Zhicheng Dou, Jun Wang, and Ji-Rong Wen. 2025. User Behavior Simulation with Large Language Model-based Agents.ACM Trans. Inf. Syst.43, 2 (Jan. 2025), 55:1–55:37. doi:10. 1145/3708985

  30. [30]

    Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Shenglong Ye, Jie Shao, Zhaokai Wang, Zhe Chen, Hongjie Zhang, Ganlin Yang, Haomin Wang, Qi Wei, Jinhui Yin, Wenhao Li, Erfei Cui, Guanzhou Chen, Zichen Ding, Changyao Tian, Zhenyu Wu, Jingjing Xie, Zehao Li, Bowen Yang, Yuchen Duan, Xuehui Wang, Zhi Hou,...

  31. [31]

    InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

    InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency. arXiv:2508.18265 [cs] doi:10.48550/arXiv.2508.18265

  32. [32]

    Zora Zhiruo Wang, Jiayuan Mao, Daniel Fried, and Graham Neubig. 2025. Agent Workflow Memory. InForty-Second International Conference on Machine Learning

  33. [33]

    Tianjun Wei, Huizhong Guo, Yingpeng Du, Zhu Sun, Huang Chen, Dongxia Wang, and Jie Zhang. 2025. Mirroring Users: Towards Building Preference-aligned User Simulator with User Feedback in Recommendation. doi:10.48550/arXiv.2508.18142

  34. [34]

    Haotian Wu, Yingpeng Du, Tianjun Wei, Puay Siew Tan, Jie Zhang, Ong Yew Soon, and Zhu Sun. 2026. Efficient Large Language Models for Recommendation: A Survey. (2026)

  35. [35]

    Zihao Wu, Xin Wang, Heng Chang, Hong Chen, Lifeng Sun, and Wenwu Zhu

  36. [36]

    InProceedings of the 2025 International Conference on Multimedia Retrieval (ICMR ’25)

    Aligning Large Multimodal Model with Sequential Recommendation via Content-Behavior Guidance. InProceedings of the 2025 International Conference on Multimedia Retrieval (ICMR ’25). Association for Computing Machinery, New York, NY, USA, 1507–1516. doi:10.1145/3731715.3733273

  37. [37]

    Yanyu Xu, Shenghua Gao, Junru Wu, Nianyi Li, and Jingyi Yu. 2019. Personalized Saliency and Its Prediction.IEEE Transactions on Pattern Analysis and Machine Intelligence41, 12 (Dec. 2019), 2975–2989. doi:10.1109/TPAMI.2018.2866563

  38. [38]

    Yuki Yada, Sho Akiyama, Ryo Watanabe, Yuta Ueno, Yusuke Shido, and Andre Rusli. 2025. Improving Visual Recommendation on E-commerce Platforms Using Vision-Language Models. InProceedings of the Nineteenth ACM Conference on Recommender Systems (RecSys ’25). Association for Computing Machinery, New York, NY, USA, 975–978. doi:10.1145/3705328.3748128

  39. [39]

    Kun Yan, Zeyu Wang, Lei Ji, Yuntao Wang, Nan Duan, and Shuai Ma. 2024. Voila- A: Aligning Vision-Language Models with User’s Gaze Attention. InProceedings of the 38th International Conference on Neural Information Processing Systems (NIPS ’24, Vol. 37). Curran Associates Inc., Red Hook, NY, USA, 1890–1918

  40. [40]

    Yingrui Yang, Yifan Qiao, Shanxiu He, and Tao Yang. 2024. Weighted KL- Divergence for Document Ranking Model Refinement. InProceedings of the 47th International ACM SIGIR Conference on Research and Development in Infor- mation Retrieval (SIGIR ’24). Association for Computing Machinery, New York, NY, USA, 2698–2702. doi:10.1145/3626772.3657946

  41. [41]

    Runpeng Yu, Weihao Yu, and Xinchao Wang. 2024. Attention Prompting on Image for Large Vision-Language Models. InComputer Vision – ECCV 2024: 18th Euro- pean Conference, Milan, Italy, September 29–October 4, 2024, Proceedings, Part XXX. Springer-Verlag, Berlin, Heidelberg, 251–268. doi:10.1007/978-3-031-73404-5_15

  42. [42]

    Tong Yu, Yilin Shen, Ruiyi Zhang, Xiangyu Zeng, and Hongxia Jin. 2019. Vision- Language Recommendation via Attribute Augmented Multimodal Reinforcement Learning. InProceedings of the 27th ACM International Conference on Multimedia (MM ’19). Association for Computing Machinery, New York, NY, USA, 39–47. doi:10.1145/3343031.3350935

  43. [43]

    An Zhang, Yuxin Chen, Leheng Sheng, Xiang Wang, and Tat-Seng Chua. 2024. On Generative Agents in Recommendation. InProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ’24). Association for Computing Machinery, New York, NY, USA, 1807–

  44. [44]

    doi:10.1145/3626772.3657844

  45. [45]

    Zeyu Zhang, Quanyu Dai, Xiaohe Bo, Chen Ma, Rui Li, Xu Chen, Jieming Zhu, Zhenhua Dong, and Ji-Rong Wen. 2025. A Survey on the Memory Mechanism of Large Language Model-based Agents.ACM Trans. Inf. Syst.43, 6 (Sept. 2025), 155:1–155:47. doi:10.1145/3748302

  46. [46]

    Zijian Zhang, Shuchang Liu, Ziru Liu, Rui Zhong, Qingpeng Cai, Xiangyu Zhao, Chunxu Zhang, Qidong Liu, and Peng Jiang. 2025. LLM-powered User Simulator for Recommender System. InProceedings of the Thirty-Ninth AAAI Conference on Artificial Intelligence and Thirty-Seventh Conference on Innovative Applications of Artificial Intelligence and Fifteenth Sympos...

  47. [47]

    Zheng Zhang, Nuoqian Xiao, Qi Chai, Deheng Ye, and Hao Wang. 2025. Mul- tiMind: Enhancing Werewolf Agents with Multimodal Reasoning and Theory of Mind. InProceedings of the 33rd ACM International Conference on Multimedia (MM ’25). Association for Computing Machinery, New York, NY, USA, 5824–5833. doi:10.1145/3746027.3755752

  48. [48]

    Xing, Hao Zhang, Joseph E

    Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. 2023. Judging LLM-as-a-judge with MT- bench and Chatbot Arena. InProceedings of the 37th International Conference on Neural Information Processing Systems (NIPS ’23). Curran Ass...

  49. [49]

    Hanzhang Zhou, Xu Zhang, Panrong Tong, Jianan Zhang, Liangyu Chen, Quyu Kong, Chenglin Cai, Chen Liu, Yue Wang, Jingren Zhou, and Steven Hoi

  50. [50]

    Mai-ui technical report: Real-world centric foundation gui agents.arXiv preprint arXiv:2512.22047, 2025

    MAI-UI Technical Report: Real-World Centric Foundation GUI Agents. arXiv:2512.22047 [cs] doi:10.48550/arXiv.2512.22047

  51. [51]

    Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, Uri Alon, and Graham Neubig

    Shuyan Zhou, Frank F. Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, Uri Alon, and Graham Neubig. 2023. WebArena: A Realistic Web Environment for Building Autonomous Agents. InThe Twelfth International Conference on Learning Representations

  52. [52]

    Kangyu Zhu, Ziyuan Qin, Huahui Yi, Zekun Jiang, Qicheng Lao, Shaoting Zhang, and Kang Li. 2025. Guiding Medical Vision-Language Models with Diverse Visual Prompts: Framework Design and Comprehensive Exploration of Prompt Varia- tions. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguisti...

  53. [53]

    Lixi Zhu, Xiaowen Huang, and Jitao Sang. 2024. How Reliable Is Your Simulator? Analysis on the Limitations of Current LLM-based User Simulators for Conver- sational Recommendation. InCompanion Proceedings of the ACM Web Conference 2024 (WWW ’24). Association for Computing Machinery, New York, NY, USA, 1726–1732. doi:10.1145/3589335.3651955 10 Through Thei...