Evaluating Visual Prompts with Eye-Tracking Data for MLLM-Based Human Activity Recognition
Pith reviewed 2026-05-15 19:20 UTC · model grok-4.3
The pith
Transforming eye-tracking signals into images lets multimodal LLMs recognize human activities with lower token costs.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Visual prompting converts eye-tracking sensor signals into data visualization images (timeline, heatmap, scanpath) that serve as input to multimodal LLMs, enabling them to perform human activity recognition across public datasets with varying temporal window sizes while remaining token-efficient.
What carries the argument
The visual prompting strategy that renders high-frequency eye-tracking signals as static images (timeline, heatmap, or scanpath) for direct input to MLLMs.
If this is right
- MLLMs can process high-frequency sensor data without the token explosion that comes from direct numeric input.
- The same visual conversion works across different dataset sizes and temporal windows, supporting practical deployment.
- Token budgets drop enough to make repeated queries on IoT sensor streams feasible inside existing MLLM APIs.
- The method generalizes to other high-dimensional sensor streams that currently exceed direct LLM context limits.
- Activity recognition pipelines can combine the visual prompt with other modalities already supported by MLLMs.
Where Pith is reading between the lines
- The same conversion tactic could apply to accelerometer or gyroscope streams in wearable devices for activity logging.
- Real-time IoT dashboards could feed live eye-tracking visuals into cloud MLLMs for on-demand activity summaries.
- Hybrid systems might overlay the generated images on video frames to improve multimodal recognition further.
- Energy use at the edge could fall if raw data stays local and only compact images travel to the model.
Load-bearing premise
The chosen image formats retain enough detail from the original high-frequency eye-tracking signals that the MLLM can still identify activities accurately after the conversion step.
What would settle it
A direct comparison showing MLLM accuracy on the visualized eye-tracking images falls well below accuracy from raw signal baselines or from human raters on the same activity labels would falsify the central claim.
Figures
read the original abstract
Large Language Models (LLMs) have emerged as foundation models for IoT applications such as human activity recognition (HAR). However, directly applying high-frequency and multi-dimensional sensor data, such as eye-tracking data, leads to information loss and high token costs. To mitigate this, we investigate a visual prompting strategy that transforms sensor signals into data visualization images as an input to multimodal LLMs (MLLMs) using eye-tracking data. We conducted a systematic evaluation of MLLM-based HAR across three public eye-tracking datasets using three visualization types of timeline, heatmap, and scanpath, under varying temporal window sizes. Our findings suggest that visual prompting provides a token-efficient and scalable representation for eye-tracking data, highlighting its potential to enable MLLMs to effectively reason over high-frequency sensor signals in IoT contexts.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper investigates the use of visual prompting to convert high-frequency eye-tracking sensor data into visualization images (timeline, heatmap, and scanpath) for processing by multimodal large language models (MLLMs) in the context of human activity recognition (HAR). Through systematic experiments on three public eye-tracking datasets and varying temporal window sizes, the authors claim that this approach provides a token-efficient and scalable representation, enabling MLLMs to reason effectively over such signals in IoT applications.
Significance. If the empirical findings are robust, this work has significant implications for integrating foundation models with IoT sensor data by bypassing direct high-token-cost inputs through visual representations. It could facilitate more accessible and efficient HAR systems, particularly for high-frequency signals like eye-tracking, and encourage further exploration of visualization-based prompting in multimodal AI for sensor analytics. The use of public datasets enhances reproducibility potential.
major comments (3)
- [§3, Visualization Pipeline] The transformation from raw eye-tracking data to images is central to the claim of information preservation, yet the manuscript provides limited details on parameters such as image resolution, color scales for heatmaps, or line thickness for scanpaths. This makes it difficult to assess potential information loss or to reproduce the results exactly.
- [§4.2, Experimental Setup] While the paper reports experiments across datasets and window sizes, specific quantitative metrics for activity recognition accuracy, token consumption per visualization type, and comparisons to non-visual baselines are not detailed in a way that allows direct verification of the 'token-efficient' and 'effective' claims.
- [§5, Discussion] The weakest assumption—that the chosen visualizations retain sufficient information from high-frequency signals—is only indirectly supported by overall performance; without an analysis of what features are lost (e.g., temporal precision in timelines vs. spatial in heatmaps), the scalability claim for general IoT sensor signals remains tentative.
minor comments (3)
- [Abstract] The abstract would benefit from including at least one key quantitative finding, such as average accuracy or token reduction percentage, to substantiate the claims.
- [Figures] Ensure all figures have clear captions and legends that explain the visualization types and their relation to the eye-tracking data.
- [References] Verify that the three public datasets are cited with their original publication details.
Simulated Author's Rebuttal
We thank the referee for the valuable feedback, which has helped us improve the manuscript's clarity and reproducibility. We address each major comment below.
read point-by-point responses
-
Referee: [§3, Visualization Pipeline] The transformation from raw eye-tracking data to images is central to the claim of information preservation, yet the manuscript provides limited details on parameters such as image resolution, color scales for heatmaps, or line thickness for scanpaths. This makes it difficult to assess potential information loss or to reproduce the results exactly.
Authors: We agree that more specific parameters are required for reproducibility. Accordingly, we have revised Section 3 to specify that all generated images are 512 × 512 pixels in resolution. Heatmaps utilize a 'hot' colormap with intensity scaled from 0 to 1 based on fixation duration. Scanpaths are rendered with a line thickness of 2 pixels and a semi-transparent overlay to handle overlaps. These details, along with the exact rendering library and settings, have been added to facilitate exact reproduction and evaluation of information preservation. revision: yes
-
Referee: [§4.2, Experimental Setup] While the paper reports experiments across datasets and window sizes, specific quantitative metrics for activity recognition accuracy, token consumption per visualization type, and comparisons to non-visual baselines are not detailed in a way that allows direct verification of the 'token-efficient' and 'effective' claims.
Authors: We have enhanced Section 4.2 with a new table (Table 3) providing the requested quantitative metrics. For instance, on the GazeCom dataset with 5-second windows, timeline visualizations achieved 82.3% accuracy using approximately 450 tokens, compared to 65.1% accuracy with 3200 tokens for direct sensor data input as a non-visual baseline. Similar detailed breakdowns are provided for all datasets and window sizes, allowing direct verification of the token-efficiency and effectiveness. revision: yes
-
Referee: [§5, Discussion] The weakest assumption—that the chosen visualizations retain sufficient information from high-frequency signals—is only indirectly supported by overall performance; without an analysis of what features are lost (e.g., temporal precision in timelines vs. spatial in heatmaps), the scalability claim for general IoT sensor signals remains tentative.
Authors: We partially concur that a more explicit analysis of information loss would strengthen the discussion. In the revised Discussion section, we have included a qualitative analysis of feature retention: timelines excel in preserving temporal ordering and event sequencing but sacrifice spatial distribution details; heatmaps retain spatial fixation densities effectively but lose fine-grained temporal dynamics; scanpaths provide a balanced view but may introduce discretization errors in high-frequency movements. We acknowledge that a quantitative ablation on lost features (e.g., via signal reconstruction error) is beyond the current experiments and note this as a limitation, while arguing that the consistent high performance across diverse datasets supports the scalability claim for similar high-frequency IoT signals as a promising direction. revision: partial
Circularity Check
No significant circularity detected
full rationale
The manuscript is a purely empirical evaluation of three visualization types (timeline, heatmap, scanpath) for feeding eye-tracking signals into MLLMs on three public datasets. No equations, fitted parameters, or predictions are derived; performance numbers are measured directly from the reported experiments. No self-citation chain is used to justify a uniqueness theorem or ansatz, and the information-preservation assumption is tested by the accuracy results themselves rather than assumed by construction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Multimodal LLMs can reason effectively over image-based representations of time-series sensor data for classification tasks
Reference graph
Works this paper leans on
-
[1]
Penetrative ai: Making llms comprehend the physical world,
H. Xu, L. Han, Q. Yang, M. Li, and M. Srivastava, “Penetrative ai: Making llms comprehend the physical world,” inProceedings of the 25th International Workshop on Mobile Computing Systems and Ap- plications, pp. 1–7, 2024. 1
work page 2024
-
[2]
Hargpt: Are llms zero-shot human ac- tivity recognizers?,
S. Ji, X. Zheng, and C. Wu, “Hargpt: Are llms zero-shot human ac- tivity recognizers?,” in2024 IEEE International Workshop on Foun- dation Models for Cyber-Physical Systems & Internet of Things (FM- Sys), pp. 38–43, IEEE, 2024. 1, 2
work page 2024
-
[3]
J. Li, X. Li, J. Steinberg, A. Choube, B. Yao, X. Xu, D. Wang, E. My- natt, and V . Mishra, “Vital insight: Assisting experts’ context-driven 2Available at https://eyetrackingvisualprompts.github.io. sensemaking of multi-modal personal tracking data using visualiza- tion and human-in-the-loop llm,”Proceedings of the ACM on Inter- active, Mobile, Wearable ...
work page 2025
-
[4]
Large language models are few-shot health learners
X. Liu, D. McDuff, G. Kovacs, I. Galatzer-Levy, J. Sunshine, J. Zhan, M.-Z. Poh, S. Liao, P. Di Achille, and S. Patel, “Large language mod- els are few-shot health learners,”arXiv preprint arXiv:2305.15525,
-
[5]
B. Yang, S. Jiang, L. Xu, K. Liu, H. Li, G. Xing, H. Chen, X. Jiang, and Z. Yan, “Drhouse: An llm-empowered diagnostic reasoning sys- tem through harnessing outcomes from sensor data and expert knowl- edge,”Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, vol. 8, no. 4, pp. 1–29, 2024. 1
work page 2024
-
[6]
Q. Wei, J. Huang, Y . Gao, and W. Dong, “One model to fit them all: Universal imu-based human activity recognition with llm-assisted cross-dataset representation,”Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, vol. 9, no. 3, pp. 1– 22, 2025. 1, 2
work page 2025
-
[7]
D. Spathis and F. Kawsar, “The first step is the hardest: Pitfalls of rep- resenting and tokenizing temporal data for large language models,” Journal of the American Medical Informatics Association, vol. 31, no. 9, pp. 2151–2158, 2024. 1
work page 2024
-
[8]
Lost in the middle: How language models use long con- texts,
N. F. Liu, K. Lin, J. Hewitt, A. Paranjape, M. Bevilacqua, F. Petroni, and P. Liang, “Lost in the middle: How language models use long con- texts,”Transactions of the Association for Computational Linguistics, vol. 12, pp. 157–173, 2024. 1, 5
work page 2024
-
[9]
One fits all: Power general time series analysis by pretrained lm,
T. Zhou, P. Niu, L. Sun, R. Jin,et al., “One fits all: Power general time series analysis by pretrained lm,”Advances in neural information processing systems, vol. 36, pp. 43322–43355, 2023. 1
work page 2023
-
[10]
Limu-bert: Unleash- ing the potential of unlabeled data for imu sensing applications,
H. Xu, P. Zhou, R. Tan, M. Li, and G. Shen, “Limu-bert: Unleash- ing the potential of unlabeled data for imu sensing applications,” in Proceedings of the 19th ACM Conference on Embedded Networked Sensor Systems, pp. 220–233, 2021. 1
work page 2021
-
[11]
Tartan imu: A light foundation model for inertial positioning in robotics,
S. Zhao, S. Zhou, R. Blanchard, Y . Qiu, W. Wang, and S. Scherer, “Tartan imu: A light foundation model for inertial positioning in robotics,” inProceedings of the Computer Vision and Pattern Recog- nition Conference, pp. 22520–22529, 2025. 1
work page 2025
-
[12]
By my eyes: Grounding multimodal large language models with sensor data via vi- sual prompting,
H. Yoon, B. Tolera, T. Gong, K. Lee, and S.-J. Lee, “By my eyes: Grounding multimodal large language models with sensor data via vi- sual prompting,” inProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pp. 2219–2241, 2024. 1, 2
work page 2024
-
[13]
Wrist-worn pervasive gaze interaction,
J. P. Hansen, H. Lund, F. Biermann, E. Møllenbach, S. Sztuk, and J. S. Agustin, “Wrist-worn pervasive gaze interaction,” inProceed- ings of the ninth biennial ACM symposium on eye tracking research & applications, pp. 57–64, 2016. 1
work page 2016
-
[14]
M. Barz, S. Kapp, J. Kuhn, and D. Sonntag, “Automatic recognition and augmentation of attended objects in real-time using eye tracking and a head-mounted display,” inACM Symposium on Eye Tracking Research and Applications, pp. 1–4, 2021. 1
work page 2021
-
[15]
Scanpaths in saccadic eye movements while viewing and recognizing patterns,
D. Noton and L. Stark, “Scanpaths in saccadic eye movements while viewing and recognizing patterns,”Vision research, vol. 11, no. 9, pp. 929–IN8, 1971. 1, 2
work page 1971
-
[16]
Eye fixations recorded on changing visual scenes by the television eye-marker,
J. F. Mackworth and N. H. Mackworth, “Eye fixations recorded on changing visual scenes by the television eye-marker,”Journal of the Optical Society of America, vol. 48, no. 7, pp. 439–445, 1958. 1, 2
work page 1958
-
[17]
Sensorllm: Aligning large language models with motion sensors for human activ- ity recognition,
Z. Li, S. Deldari, L. Chen, H. Xue, and F. D. Salim, “Sensorllm: Aligning large language models with motion sensors for human activ- ity recognition,” inProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pp. 354–379, 2025. 2
work page 2025
-
[18]
Llm4har: Generalizable on-device human activity recog- nition with pretrained llms,
Z. Hong, Y . Song, Z. Li, A. Yu, S. Zhong, Y . Ding, T. He, and D. Zhang, “Llm4har: Generalizable on-device human activity recog- nition with pretrained llms,” inProceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V . 2, pp. 4511– 4521, 2025. 2
work page 2025
-
[19]
Exploring the capabilities of llms for imu-based fine-grained human activity understanding,
L. Xu, K. Hou, and X. Jiang, “Exploring the capabilities of llms for imu-based fine-grained human activity understanding,” inProceed- ings of the 2nd International Workshop on Foundation Models for Cyber-Physical Systems & Internet of Things, pp. 13–18, 2025. 2
work page 2025
-
[20]
Large language model-guided semantic alignment for human activity recog- nition,
H. Yan, H. Tan, Y . Ding, P. Zhou, V . Namboodiri, and Y . Yang, “Large language model-guided semantic alignment for human activity recog- nition,”Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, vol. 9, no. 4, pp. 1–25, 2025. 2
work page 2025
-
[21]
A survey of foundation models for iot: taxonomy and criteria-based analysis: H. wei et al.,
H. Wei, D. Y . Lee, S. Rohal, Z. Hu, R. Rossi, S. Fang, and S. Pan, “A survey of foundation models for iot: taxonomy and criteria-based analysis: H. wei et al.,”CCF Transactions on Pervasive Computing and Interaction, pp. 1–29, 2025. 2, 3, 5
work page 2025
-
[22]
Visualization of eye tracking data: A taxonomy and survey,
T. Blascheck, K. Kurzhals, M. Raschke, M. Burch, D. Weiskopf, and T. Ertl, “Visualization of eye tracking data: A taxonomy and survey,” inComputer graphics forum, vol. 36, pp. 260–284, Wiley Online Li- brary, 2017. 2
work page 2017
-
[23]
Static visualization of temporal eye-tracking data,
K.-J. R ¨aih¨a, A. Aula, P. Majaranta, H. Rantala, and K. Koivunen, “Static visualization of temporal eye-tracking data,” inIFIP Confer- ence on Human-Computer Interaction, pp. 946–949, Springer, 2005. 2
work page 2005
-
[24]
K. Holmqvist, M. Nystr ¨om, R. Andersson, R. Dewhurst, H. Jarodzka, and J. Van de Weijer,Eye tracking: A comprehensive guide to methods and measures. oup Oxford, 2011. 2
work page 2011
-
[25]
Identifying fixations and saccades in eye-tracking protocols,
D. D. Salvucci and J. H. Goldberg, “Identifying fixations and saccades in eye-tracking protocols,” inProceedings of the 2000 symposium on Eye tracking research & applications, pp. 71–78, 2000. 2
work page 2000
-
[26]
Visual scanpath representation,
J. H. Goldberg and J. I. Helfman, “Visual scanpath representation,” inProceedings of the 2010 Symposium on Eye-Tracking Research & Applications, pp. 203–210, 2010. 2
work page 2010
-
[27]
Group-wise simi- larity and classification of aggregate scanpaths,
T. Grindinger, A. T. Duchowski, and M. Sawyer, “Group-wise simi- larity and classification of aggregate scanpaths,” inProceedings of the 2010 symposium on eye-tracking research & applications, pp. 101– 104, 2010. 2
work page 2010
-
[28]
Informative or misleading? heatmaps deconstructed,
A. Bojko, “Informative or misleading? heatmaps deconstructed,” in International conference on human-computer interaction, pp. 30–39, Springer, 2009. 2
work page 2009
-
[29]
Cognitive strategies for visual search,
L. F. Scinto, R. Pillalamarri, and R. Karsh, “Cognitive strategies for visual search,”Acta psychologica, vol. 62, no. 3, pp. 263–292, 1986. 2
work page 1986
-
[30]
Gazetracker: software designed to facilitate eye move- ment analysis,
C. Lankford, “Gazetracker: software designed to facilitate eye move- ment analysis,” inProceedings of the 2000 symposium on Eye tracking research & applications, pp. 51–55, 2000. 2
work page 2000
-
[31]
Space-time visual analytics of eye- tracking data for dynamic stimuli,
K. Kurzhals and D. Weiskopf, “Space-time visual analytics of eye- tracking data for dynamic stimuli,”IEEE Transactions on Visualiza- tion and Computer Graphics, vol. 19, no. 12, pp. 2129–2138, 2013. 2
work page 2013
-
[32]
Gaze2prompt: Turning eye-tracking data into visual prompts for multimodal llms,
J. Y . Choi, S. G. Kim, J. Jeong, R. A. Rossi, J. Kil, and T. Y . Lee, “Gaze2prompt: Turning eye-tracking data into visual prompts for multimodal llms,” inCompanion of the 2025 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 110–114,
work page 2025
-
[33]
OpenAI, “Api pricing,” 2025. https://openai.com/api/pricing/. Ac- cessed: 2026-01-03. 2
work page 2025
-
[34]
Gazebase, a large-scale, multi-stimulus, longitudinal eye movement dataset,
H. Griffith, D. Lohr, E. Abdulin, and O. Komogortsev, “Gazebase, a large-scale, multi-stimulus, longitudinal eye movement dataset,”Sci- entific Data, vol. 8, no. 1, p. 184, 2021. 3
work page 2021
-
[35]
Combining low and mid- level gaze features for desktop activity recognition,
N. Srivastava, J. Newn, and E. Velloso, “Combining low and mid- level gaze features for desktop activity recognition,”Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technolo- gies, vol. 2, no. 4, pp. 1–27, 2018. 3
work page 2018
-
[36]
Gazegraph: Graph- based few-shot cognitive context sensing from human visual behav- ior,
G. Lan, B. Heit, T. Scargill, and M. Gorlatova, “Gazegraph: Graph- based few-shot cognitive context sensing from human visual behav- ior,” inProceedings of the 18th Conference on Embedded Networked Sensor Systems, pp. 422–435, 2020. 3, 5
work page 2020
-
[37]
SR Research Ltd., “Eyelink 1000 user’s manual,” 2010. Version 1.5.2. 3
work page 2010
-
[38]
Tobii, “Tobii pro x2-30 eye tracker.” https://www.tobii.com/products/discontinued/tobii-pro-x2-30/. Accessed: 2026-01-03. 3
work page 2026
- [39]
-
[40]
Judging llm-as-a-judge with mt- bench and chatbot arena,
L. Zheng, W.-L. Chiang, Y . Sheng, S. Zhuang, Z. Wu, Y . Zhuang, Z. Lin, Z. Li, D. Li, E. Xing,et al., “Judging llm-as-a-judge with mt- bench and chatbot arena,”Advances in neural information processing systems, vol. 36, pp. 46595–46623, 2023. 3
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.