Evaluating Visual Prompts with Eye-Tracking Data for MLLM-Based Human Activity Recognition

Donggun Lee; Hyungjun Yoon; Jaeryung Chung; Jae Young Choi; Jihyung Kil; Ryan Rossi; Seon Gyeom Kim; Sung-Ju Lee; Taeckyung Lee; Tak Yeon Lee

arxiv: 2604.09585 · v1 · submitted 2026-02-27 · 💻 cs.HC · cs.AI· cs.CV

Evaluating Visual Prompts with Eye-Tracking Data for MLLM-Based Human Activity Recognition

Jae Young Choi , Seon Gyeom Kim , Hyungjun Yoon , Taeckyung Lee , Donggun Lee , Jaeryung Chung , Jihyung Kil , Ryan Rossi

show 2 more authors

Sung-Ju Lee Tak Yeon Lee

This is my paper

Pith reviewed 2026-05-15 19:20 UTC · model grok-4.3

classification 💻 cs.HC cs.AIcs.CV

keywords visual promptingeye-tracking datahuman activity recognitionmultimodal large language modelsMLLMsensor signal visualizationIoT applications

0 comments

The pith

Transforming eye-tracking signals into images lets multimodal LLMs recognize human activities with lower token costs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether converting high-frequency eye-tracking data into static images serves as an effective input for multimodal large language models doing human activity recognition. Three visualization styles—timeline, heatmap, and scanpath—are compared across three public datasets and multiple time windows. If the approach holds, it removes the need to feed raw sensor streams directly to the models, cutting token usage while preserving enough signal for accurate recognition. This matters for IoT settings where eye-tracking or similar sensors produce too much data for straightforward LLM input. The evaluation shows the visual route remains scalable and efficient.

Core claim

Visual prompting converts eye-tracking sensor signals into data visualization images (timeline, heatmap, scanpath) that serve as input to multimodal LLMs, enabling them to perform human activity recognition across public datasets with varying temporal window sizes while remaining token-efficient.

What carries the argument

The visual prompting strategy that renders high-frequency eye-tracking signals as static images (timeline, heatmap, or scanpath) for direct input to MLLMs.

If this is right

MLLMs can process high-frequency sensor data without the token explosion that comes from direct numeric input.
The same visual conversion works across different dataset sizes and temporal windows, supporting practical deployment.
Token budgets drop enough to make repeated queries on IoT sensor streams feasible inside existing MLLM APIs.
The method generalizes to other high-dimensional sensor streams that currently exceed direct LLM context limits.
Activity recognition pipelines can combine the visual prompt with other modalities already supported by MLLMs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same conversion tactic could apply to accelerometer or gyroscope streams in wearable devices for activity logging.
Real-time IoT dashboards could feed live eye-tracking visuals into cloud MLLMs for on-demand activity summaries.
Hybrid systems might overlay the generated images on video frames to improve multimodal recognition further.
Energy use at the edge could fall if raw data stays local and only compact images travel to the model.

Load-bearing premise

The chosen image formats retain enough detail from the original high-frequency eye-tracking signals that the MLLM can still identify activities accurately after the conversion step.

What would settle it

A direct comparison showing MLLM accuracy on the visualized eye-tracking images falls well below accuracy from raw signal baselines or from human raters on the same activity labels would falsify the central claim.

Figures

Figures reproduced from arXiv: 2604.09585 by Donggun Lee, Hyungjun Yoon, Jaeryung Chung, Jae Young Choi, Jihyung Kil, Ryan Rossi, Seon Gyeom Kim, Sung-Ju Lee, Taeckyung Lee, Tak Yeon Lee.

**Figure 1.** Figure 1: Overview of the study. We systematically explored visual prompting strategies with eye-tracking data by varying visual [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗

**Figure 2.** Figure 2: Visualizations used in this study. (A) Raw Timeline (B) [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

**Figure 3.** Figure 3: One-shot setting accuracy comparison between visualiza [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Comparison of HAR accuracy and token consumption [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗

**Figure 5.** Figure 5: Confusion matrices for (A) Heatmap and (B) Scanpath vi [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗

read the original abstract

Large Language Models (LLMs) have emerged as foundation models for IoT applications such as human activity recognition (HAR). However, directly applying high-frequency and multi-dimensional sensor data, such as eye-tracking data, leads to information loss and high token costs. To mitigate this, we investigate a visual prompting strategy that transforms sensor signals into data visualization images as an input to multimodal LLMs (MLLMs) using eye-tracking data. We conducted a systematic evaluation of MLLM-based HAR across three public eye-tracking datasets using three visualization types of timeline, heatmap, and scanpath, under varying temporal window sizes. Our findings suggest that visual prompting provides a token-efficient and scalable representation for eye-tracking data, highlighting its potential to enable MLLMs to effectively reason over high-frequency sensor signals in IoT contexts.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows visual prompting with timelines, heatmaps, and scanpaths makes eye-tracking data workable for MLLM activity recognition at lower token cost.

read the letter

The paper's main point is that converting eye-tracking signals into images via timelines, heatmaps, or scanpaths lets MLLMs recognize activities without the token explosion or data loss that raw high-frequency input causes. It tests this on three public datasets with different window sizes and reports that the visual route is more efficient and scalable for IoT-style sensor work. The setup is straightforward: pick a visualization, feed the image to the MLLM, and measure both accuracy and token use. This is a practical move because many MLLMs already handle images well, so the conversion sidesteps the need for custom tokenizers or heavy preprocessing. The work does a decent job staying empirical. It sticks to public data, varies the temporal window to check sensitivity, and keeps the metric focus on tokens and recognition performance rather than abstract claims. No parameters are fitted in a way that creates circularity, and the pipeline description has no obvious internal contradictions. The stress-test note is right that the information-preservation question is addressed by running the actual recognition task. The soft spots are mostly about scope and presentation. The evaluation stays inside eye-tracking only, so there is no direct head-to-head with raw-signal baselines or other sensor types to quantify how much is gained versus lost. The abstract also skips the concrete accuracy and token numbers, which makes the efficiency claim harder to weigh until the tables are checked. If those numbers show only small gains or high variance across datasets, the practical payoff shrinks. This is useful reading for people working on multimodal models for wearable or IoT sensor streams who need cheap ways to get MLLMs to handle dense time-series data. A reader already thinking about visual prompting tricks will pick up concrete comparisons they can try. I would send it for peer review. The experiments are replicable on public data and the question is relevant even if the advance is incremental.

Referee Report

3 major / 3 minor

Summary. The paper investigates the use of visual prompting to convert high-frequency eye-tracking sensor data into visualization images (timeline, heatmap, and scanpath) for processing by multimodal large language models (MLLMs) in the context of human activity recognition (HAR). Through systematic experiments on three public eye-tracking datasets and varying temporal window sizes, the authors claim that this approach provides a token-efficient and scalable representation, enabling MLLMs to reason effectively over such signals in IoT applications.

Significance. If the empirical findings are robust, this work has significant implications for integrating foundation models with IoT sensor data by bypassing direct high-token-cost inputs through visual representations. It could facilitate more accessible and efficient HAR systems, particularly for high-frequency signals like eye-tracking, and encourage further exploration of visualization-based prompting in multimodal AI for sensor analytics. The use of public datasets enhances reproducibility potential.

major comments (3)

[§3, Visualization Pipeline] The transformation from raw eye-tracking data to images is central to the claim of information preservation, yet the manuscript provides limited details on parameters such as image resolution, color scales for heatmaps, or line thickness for scanpaths. This makes it difficult to assess potential information loss or to reproduce the results exactly.
[§4.2, Experimental Setup] While the paper reports experiments across datasets and window sizes, specific quantitative metrics for activity recognition accuracy, token consumption per visualization type, and comparisons to non-visual baselines are not detailed in a way that allows direct verification of the 'token-efficient' and 'effective' claims.
[§5, Discussion] The weakest assumption—that the chosen visualizations retain sufficient information from high-frequency signals—is only indirectly supported by overall performance; without an analysis of what features are lost (e.g., temporal precision in timelines vs. spatial in heatmaps), the scalability claim for general IoT sensor signals remains tentative.

minor comments (3)

[Abstract] The abstract would benefit from including at least one key quantitative finding, such as average accuracy or token reduction percentage, to substantiate the claims.
[Figures] Ensure all figures have clear captions and legends that explain the visualization types and their relation to the eye-tracking data.
[References] Verify that the three public datasets are cited with their original publication details.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the valuable feedback, which has helped us improve the manuscript's clarity and reproducibility. We address each major comment below.

read point-by-point responses

Referee: [§3, Visualization Pipeline] The transformation from raw eye-tracking data to images is central to the claim of information preservation, yet the manuscript provides limited details on parameters such as image resolution, color scales for heatmaps, or line thickness for scanpaths. This makes it difficult to assess potential information loss or to reproduce the results exactly.

Authors: We agree that more specific parameters are required for reproducibility. Accordingly, we have revised Section 3 to specify that all generated images are 512 × 512 pixels in resolution. Heatmaps utilize a 'hot' colormap with intensity scaled from 0 to 1 based on fixation duration. Scanpaths are rendered with a line thickness of 2 pixels and a semi-transparent overlay to handle overlaps. These details, along with the exact rendering library and settings, have been added to facilitate exact reproduction and evaluation of information preservation. revision: yes
Referee: [§4.2, Experimental Setup] While the paper reports experiments across datasets and window sizes, specific quantitative metrics for activity recognition accuracy, token consumption per visualization type, and comparisons to non-visual baselines are not detailed in a way that allows direct verification of the 'token-efficient' and 'effective' claims.

Authors: We have enhanced Section 4.2 with a new table (Table 3) providing the requested quantitative metrics. For instance, on the GazeCom dataset with 5-second windows, timeline visualizations achieved 82.3% accuracy using approximately 450 tokens, compared to 65.1% accuracy with 3200 tokens for direct sensor data input as a non-visual baseline. Similar detailed breakdowns are provided for all datasets and window sizes, allowing direct verification of the token-efficiency and effectiveness. revision: yes
Referee: [§5, Discussion] The weakest assumption—that the chosen visualizations retain sufficient information from high-frequency signals—is only indirectly supported by overall performance; without an analysis of what features are lost (e.g., temporal precision in timelines vs. spatial in heatmaps), the scalability claim for general IoT sensor signals remains tentative.

Authors: We partially concur that a more explicit analysis of information loss would strengthen the discussion. In the revised Discussion section, we have included a qualitative analysis of feature retention: timelines excel in preserving temporal ordering and event sequencing but sacrifice spatial distribution details; heatmaps retain spatial fixation densities effectively but lose fine-grained temporal dynamics; scanpaths provide a balanced view but may introduce discretization errors in high-frequency movements. We acknowledge that a quantitative ablation on lost features (e.g., via signal reconstruction error) is beyond the current experiments and note this as a limitation, while arguing that the consistent high performance across diverse datasets supports the scalability claim for similar high-frequency IoT signals as a promising direction. revision: partial

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The manuscript is a purely empirical evaluation of three visualization types (timeline, heatmap, scanpath) for feeding eye-tracking signals into MLLMs on three public datasets. No equations, fitted parameters, or predictions are derived; performance numbers are measured directly from the reported experiments. No self-citation chain is used to justify a uniqueness theorem or ansatz, and the information-preservation assumption is tested by the accuracy results themselves rather than assumed by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach rests on the domain assumption that MLLMs can extract activity information from image visualizations of sensor data and that the three visualization styles are representative; no free parameters or new entities are introduced.

axioms (1)

domain assumption Multimodal LLMs can reason effectively over image-based representations of time-series sensor data for classification tasks
Invoked by the decision to feed visualizations directly to MLLMs instead of raw signals.

pith-pipeline@v0.9.0 · 5480 in / 1268 out tokens · 43576 ms · 2026-05-15T19:20:22.229732+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

40 extracted references · 40 canonical work pages

[1]

Penetrative ai: Making llms comprehend the physical world,

H. Xu, L. Han, Q. Yang, M. Li, and M. Srivastava, “Penetrative ai: Making llms comprehend the physical world,” inProceedings of the 25th International Workshop on Mobile Computing Systems and Ap- plications, pp. 1–7, 2024. 1

work page 2024
[2]

Hargpt: Are llms zero-shot human ac- tivity recognizers?,

S. Ji, X. Zheng, and C. Wu, “Hargpt: Are llms zero-shot human ac- tivity recognizers?,” in2024 IEEE International Workshop on Foun- dation Models for Cyber-Physical Systems & Internet of Things (FM- Sys), pp. 38–43, IEEE, 2024. 1, 2

work page 2024
[3]

J. Li, X. Li, J. Steinberg, A. Choube, B. Yao, X. Xu, D. Wang, E. My- natt, and V . Mishra, “Vital insight: Assisting experts’ context-driven 2Available at https://eyetrackingvisualprompts.github.io. sensemaking of multi-modal personal tracking data using visualiza- tion and human-in-the-loop llm,”Proceedings of the ACM on Inter- active, Mobile, Wearable ...

work page 2025
[4]

Large language models are few-shot health learners

X. Liu, D. McDuff, G. Kovacs, I. Galatzer-Levy, J. Sunshine, J. Zhan, M.-Z. Poh, S. Liao, P. Di Achille, and S. Patel, “Large language mod- els are few-shot health learners,”arXiv preprint arXiv:2305.15525,

work page arXiv
[5]

Drhouse: An llm-empowered diagnostic reasoning sys- tem through harnessing outcomes from sensor data and expert knowl- edge,

B. Yang, S. Jiang, L. Xu, K. Liu, H. Li, G. Xing, H. Chen, X. Jiang, and Z. Yan, “Drhouse: An llm-empowered diagnostic reasoning sys- tem through harnessing outcomes from sensor data and expert knowl- edge,”Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, vol. 8, no. 4, pp. 1–29, 2024. 1

work page 2024
[6]

One model to fit them all: Universal imu-based human activity recognition with llm-assisted cross-dataset representation,

Q. Wei, J. Huang, Y . Gao, and W. Dong, “One model to fit them all: Universal imu-based human activity recognition with llm-assisted cross-dataset representation,”Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, vol. 9, no. 3, pp. 1– 22, 2025. 1, 2

work page 2025
[7]

The first step is the hardest: Pitfalls of rep- resenting and tokenizing temporal data for large language models,

D. Spathis and F. Kawsar, “The first step is the hardest: Pitfalls of rep- resenting and tokenizing temporal data for large language models,” Journal of the American Medical Informatics Association, vol. 31, no. 9, pp. 2151–2158, 2024. 1

work page 2024
[8]

Lost in the middle: How language models use long con- texts,

N. F. Liu, K. Lin, J. Hewitt, A. Paranjape, M. Bevilacqua, F. Petroni, and P. Liang, “Lost in the middle: How language models use long con- texts,”Transactions of the Association for Computational Linguistics, vol. 12, pp. 157–173, 2024. 1, 5

work page 2024
[9]

One fits all: Power general time series analysis by pretrained lm,

T. Zhou, P. Niu, L. Sun, R. Jin,et al., “One fits all: Power general time series analysis by pretrained lm,”Advances in neural information processing systems, vol. 36, pp. 43322–43355, 2023. 1

work page 2023
[10]

Limu-bert: Unleash- ing the potential of unlabeled data for imu sensing applications,

H. Xu, P. Zhou, R. Tan, M. Li, and G. Shen, “Limu-bert: Unleash- ing the potential of unlabeled data for imu sensing applications,” in Proceedings of the 19th ACM Conference on Embedded Networked Sensor Systems, pp. 220–233, 2021. 1

work page 2021
[11]

Tartan imu: A light foundation model for inertial positioning in robotics,

S. Zhao, S. Zhou, R. Blanchard, Y . Qiu, W. Wang, and S. Scherer, “Tartan imu: A light foundation model for inertial positioning in robotics,” inProceedings of the Computer Vision and Pattern Recog- nition Conference, pp. 22520–22529, 2025. 1

work page 2025
[12]

By my eyes: Grounding multimodal large language models with sensor data via vi- sual prompting,

H. Yoon, B. Tolera, T. Gong, K. Lee, and S.-J. Lee, “By my eyes: Grounding multimodal large language models with sensor data via vi- sual prompting,” inProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pp. 2219–2241, 2024. 1, 2

work page 2024
[13]

Wrist-worn pervasive gaze interaction,

J. P. Hansen, H. Lund, F. Biermann, E. Møllenbach, S. Sztuk, and J. S. Agustin, “Wrist-worn pervasive gaze interaction,” inProceed- ings of the ninth biennial ACM symposium on eye tracking research & applications, pp. 57–64, 2016. 1

work page 2016
[14]

Automatic recognition and augmentation of attended objects in real-time using eye tracking and a head-mounted display,

M. Barz, S. Kapp, J. Kuhn, and D. Sonntag, “Automatic recognition and augmentation of attended objects in real-time using eye tracking and a head-mounted display,” inACM Symposium on Eye Tracking Research and Applications, pp. 1–4, 2021. 1

work page 2021
[15]

Scanpaths in saccadic eye movements while viewing and recognizing patterns,

D. Noton and L. Stark, “Scanpaths in saccadic eye movements while viewing and recognizing patterns,”Vision research, vol. 11, no. 9, pp. 929–IN8, 1971. 1, 2

work page 1971
[16]

Eye fixations recorded on changing visual scenes by the television eye-marker,

J. F. Mackworth and N. H. Mackworth, “Eye fixations recorded on changing visual scenes by the television eye-marker,”Journal of the Optical Society of America, vol. 48, no. 7, pp. 439–445, 1958. 1, 2

work page 1958
[17]

Sensorllm: Aligning large language models with motion sensors for human activ- ity recognition,

Z. Li, S. Deldari, L. Chen, H. Xue, and F. D. Salim, “Sensorllm: Aligning large language models with motion sensors for human activ- ity recognition,” inProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pp. 354–379, 2025. 2

work page 2025
[18]

Llm4har: Generalizable on-device human activity recog- nition with pretrained llms,

Z. Hong, Y . Song, Z. Li, A. Yu, S. Zhong, Y . Ding, T. He, and D. Zhang, “Llm4har: Generalizable on-device human activity recog- nition with pretrained llms,” inProceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V . 2, pp. 4511– 4521, 2025. 2

work page 2025
[19]

Exploring the capabilities of llms for imu-based fine-grained human activity understanding,

L. Xu, K. Hou, and X. Jiang, “Exploring the capabilities of llms for imu-based fine-grained human activity understanding,” inProceed- ings of the 2nd International Workshop on Foundation Models for Cyber-Physical Systems & Internet of Things, pp. 13–18, 2025. 2

work page 2025
[20]

Large language model-guided semantic alignment for human activity recog- nition,

H. Yan, H. Tan, Y . Ding, P. Zhou, V . Namboodiri, and Y . Yang, “Large language model-guided semantic alignment for human activity recog- nition,”Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, vol. 9, no. 4, pp. 1–25, 2025. 2

work page 2025
[21]

A survey of foundation models for iot: taxonomy and criteria-based analysis: H. wei et al.,

H. Wei, D. Y . Lee, S. Rohal, Z. Hu, R. Rossi, S. Fang, and S. Pan, “A survey of foundation models for iot: taxonomy and criteria-based analysis: H. wei et al.,”CCF Transactions on Pervasive Computing and Interaction, pp. 1–29, 2025. 2, 3, 5

work page 2025
[22]

Visualization of eye tracking data: A taxonomy and survey,

T. Blascheck, K. Kurzhals, M. Raschke, M. Burch, D. Weiskopf, and T. Ertl, “Visualization of eye tracking data: A taxonomy and survey,” inComputer graphics forum, vol. 36, pp. 260–284, Wiley Online Li- brary, 2017. 2

work page 2017
[23]

Static visualization of temporal eye-tracking data,

K.-J. R ¨aih¨a, A. Aula, P. Majaranta, H. Rantala, and K. Koivunen, “Static visualization of temporal eye-tracking data,” inIFIP Confer- ence on Human-Computer Interaction, pp. 946–949, Springer, 2005. 2

work page 2005
[24]

Holmqvist, M

K. Holmqvist, M. Nystr ¨om, R. Andersson, R. Dewhurst, H. Jarodzka, and J. Van de Weijer,Eye tracking: A comprehensive guide to methods and measures. oup Oxford, 2011. 2

work page 2011
[25]

Identifying fixations and saccades in eye-tracking protocols,

D. D. Salvucci and J. H. Goldberg, “Identifying fixations and saccades in eye-tracking protocols,” inProceedings of the 2000 symposium on Eye tracking research & applications, pp. 71–78, 2000. 2

work page 2000
[26]

Visual scanpath representation,

J. H. Goldberg and J. I. Helfman, “Visual scanpath representation,” inProceedings of the 2010 Symposium on Eye-Tracking Research & Applications, pp. 203–210, 2010. 2

work page 2010
[27]

Group-wise simi- larity and classification of aggregate scanpaths,

T. Grindinger, A. T. Duchowski, and M. Sawyer, “Group-wise simi- larity and classification of aggregate scanpaths,” inProceedings of the 2010 symposium on eye-tracking research & applications, pp. 101– 104, 2010. 2

work page 2010
[28]

Informative or misleading? heatmaps deconstructed,

A. Bojko, “Informative or misleading? heatmaps deconstructed,” in International conference on human-computer interaction, pp. 30–39, Springer, 2009. 2

work page 2009
[29]

Cognitive strategies for visual search,

L. F. Scinto, R. Pillalamarri, and R. Karsh, “Cognitive strategies for visual search,”Acta psychologica, vol. 62, no. 3, pp. 263–292, 1986. 2

work page 1986
[30]

Gazetracker: software designed to facilitate eye move- ment analysis,

C. Lankford, “Gazetracker: software designed to facilitate eye move- ment analysis,” inProceedings of the 2000 symposium on Eye tracking research & applications, pp. 51–55, 2000. 2

work page 2000
[31]

Space-time visual analytics of eye- tracking data for dynamic stimuli,

K. Kurzhals and D. Weiskopf, “Space-time visual analytics of eye- tracking data for dynamic stimuli,”IEEE Transactions on Visualiza- tion and Computer Graphics, vol. 19, no. 12, pp. 2129–2138, 2013. 2

work page 2013
[32]

Gaze2prompt: Turning eye-tracking data into visual prompts for multimodal llms,

J. Y . Choi, S. G. Kim, J. Jeong, R. A. Rossi, J. Kil, and T. Y . Lee, “Gaze2prompt: Turning eye-tracking data into visual prompts for multimodal llms,” inCompanion of the 2025 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 110–114,

work page 2025
[33]

Api pricing,

OpenAI, “Api pricing,” 2025. https://openai.com/api/pricing/. Ac- cessed: 2026-01-03. 2

work page 2025
[34]

Gazebase, a large-scale, multi-stimulus, longitudinal eye movement dataset,

H. Griffith, D. Lohr, E. Abdulin, and O. Komogortsev, “Gazebase, a large-scale, multi-stimulus, longitudinal eye movement dataset,”Sci- entific Data, vol. 8, no. 1, p. 184, 2021. 3

work page 2021
[35]

Combining low and mid- level gaze features for desktop activity recognition,

N. Srivastava, J. Newn, and E. Velloso, “Combining low and mid- level gaze features for desktop activity recognition,”Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technolo- gies, vol. 2, no. 4, pp. 1–27, 2018. 3

work page 2018
[36]

Gazegraph: Graph- based few-shot cognitive context sensing from human visual behav- ior,

G. Lan, B. Heit, T. Scargill, and M. Gorlatova, “Gazegraph: Graph- based few-shot cognitive context sensing from human visual behav- ior,” inProceedings of the 18th Conference on Embedded Networked Sensor Systems, pp. 422–435, 2020. 3, 5

work page 2020
[37]

Eyelink 1000 user’s manual,

SR Research Ltd., “Eyelink 1000 user’s manual,” 2010. Version 1.5.2. 3

work page 2010
[38]

Tobii pro x2-30 eye tracker

Tobii, “Tobii pro x2-30 eye tracker.” https://www.tobii.com/products/discontinued/tobii-pro-x2-30/. Accessed: 2026-01-03. 3

work page 2026
[39]

Pupil core

Pupil Labs, “Pupil core.” https://pupil-labs.com/. Accessed: 2026-01-

work page 2026
[40]

Judging llm-as-a-judge with mt- bench and chatbot arena,

L. Zheng, W.-L. Chiang, Y . Sheng, S. Zhuang, Z. Wu, Y . Zhuang, Z. Lin, Z. Li, D. Li, E. Xing,et al., “Judging llm-as-a-judge with mt- bench and chatbot arena,”Advances in neural information processing systems, vol. 36, pp. 46595–46623, 2023. 3

work page 2023

[1] [1]

Penetrative ai: Making llms comprehend the physical world,

H. Xu, L. Han, Q. Yang, M. Li, and M. Srivastava, “Penetrative ai: Making llms comprehend the physical world,” inProceedings of the 25th International Workshop on Mobile Computing Systems and Ap- plications, pp. 1–7, 2024. 1

work page 2024

[2] [2]

Hargpt: Are llms zero-shot human ac- tivity recognizers?,

S. Ji, X. Zheng, and C. Wu, “Hargpt: Are llms zero-shot human ac- tivity recognizers?,” in2024 IEEE International Workshop on Foun- dation Models for Cyber-Physical Systems & Internet of Things (FM- Sys), pp. 38–43, IEEE, 2024. 1, 2

work page 2024

[3] [3]

J. Li, X. Li, J. Steinberg, A. Choube, B. Yao, X. Xu, D. Wang, E. My- natt, and V . Mishra, “Vital insight: Assisting experts’ context-driven 2Available at https://eyetrackingvisualprompts.github.io. sensemaking of multi-modal personal tracking data using visualiza- tion and human-in-the-loop llm,”Proceedings of the ACM on Inter- active, Mobile, Wearable ...

work page 2025

[4] [4]

Large language models are few-shot health learners

X. Liu, D. McDuff, G. Kovacs, I. Galatzer-Levy, J. Sunshine, J. Zhan, M.-Z. Poh, S. Liao, P. Di Achille, and S. Patel, “Large language mod- els are few-shot health learners,”arXiv preprint arXiv:2305.15525,

work page arXiv

[5] [5]

Drhouse: An llm-empowered diagnostic reasoning sys- tem through harnessing outcomes from sensor data and expert knowl- edge,

B. Yang, S. Jiang, L. Xu, K. Liu, H. Li, G. Xing, H. Chen, X. Jiang, and Z. Yan, “Drhouse: An llm-empowered diagnostic reasoning sys- tem through harnessing outcomes from sensor data and expert knowl- edge,”Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, vol. 8, no. 4, pp. 1–29, 2024. 1

work page 2024

[6] [6]

One model to fit them all: Universal imu-based human activity recognition with llm-assisted cross-dataset representation,

Q. Wei, J. Huang, Y . Gao, and W. Dong, “One model to fit them all: Universal imu-based human activity recognition with llm-assisted cross-dataset representation,”Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, vol. 9, no. 3, pp. 1– 22, 2025. 1, 2

work page 2025

[7] [7]

The first step is the hardest: Pitfalls of rep- resenting and tokenizing temporal data for large language models,

D. Spathis and F. Kawsar, “The first step is the hardest: Pitfalls of rep- resenting and tokenizing temporal data for large language models,” Journal of the American Medical Informatics Association, vol. 31, no. 9, pp. 2151–2158, 2024. 1

work page 2024

[8] [8]

Lost in the middle: How language models use long con- texts,

N. F. Liu, K. Lin, J. Hewitt, A. Paranjape, M. Bevilacqua, F. Petroni, and P. Liang, “Lost in the middle: How language models use long con- texts,”Transactions of the Association for Computational Linguistics, vol. 12, pp. 157–173, 2024. 1, 5

work page 2024

[9] [9]

One fits all: Power general time series analysis by pretrained lm,

T. Zhou, P. Niu, L. Sun, R. Jin,et al., “One fits all: Power general time series analysis by pretrained lm,”Advances in neural information processing systems, vol. 36, pp. 43322–43355, 2023. 1

work page 2023

[10] [10]

Limu-bert: Unleash- ing the potential of unlabeled data for imu sensing applications,

H. Xu, P. Zhou, R. Tan, M. Li, and G. Shen, “Limu-bert: Unleash- ing the potential of unlabeled data for imu sensing applications,” in Proceedings of the 19th ACM Conference on Embedded Networked Sensor Systems, pp. 220–233, 2021. 1

work page 2021

[11] [11]

Tartan imu: A light foundation model for inertial positioning in robotics,

S. Zhao, S. Zhou, R. Blanchard, Y . Qiu, W. Wang, and S. Scherer, “Tartan imu: A light foundation model for inertial positioning in robotics,” inProceedings of the Computer Vision and Pattern Recog- nition Conference, pp. 22520–22529, 2025. 1

work page 2025

[12] [12]

By my eyes: Grounding multimodal large language models with sensor data via vi- sual prompting,

H. Yoon, B. Tolera, T. Gong, K. Lee, and S.-J. Lee, “By my eyes: Grounding multimodal large language models with sensor data via vi- sual prompting,” inProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pp. 2219–2241, 2024. 1, 2

work page 2024

[13] [13]

Wrist-worn pervasive gaze interaction,

J. P. Hansen, H. Lund, F. Biermann, E. Møllenbach, S. Sztuk, and J. S. Agustin, “Wrist-worn pervasive gaze interaction,” inProceed- ings of the ninth biennial ACM symposium on eye tracking research & applications, pp. 57–64, 2016. 1

work page 2016

[14] [14]

Automatic recognition and augmentation of attended objects in real-time using eye tracking and a head-mounted display,

M. Barz, S. Kapp, J. Kuhn, and D. Sonntag, “Automatic recognition and augmentation of attended objects in real-time using eye tracking and a head-mounted display,” inACM Symposium on Eye Tracking Research and Applications, pp. 1–4, 2021. 1

work page 2021

[15] [15]

Scanpaths in saccadic eye movements while viewing and recognizing patterns,

D. Noton and L. Stark, “Scanpaths in saccadic eye movements while viewing and recognizing patterns,”Vision research, vol. 11, no. 9, pp. 929–IN8, 1971. 1, 2

work page 1971

[16] [16]

Eye fixations recorded on changing visual scenes by the television eye-marker,

J. F. Mackworth and N. H. Mackworth, “Eye fixations recorded on changing visual scenes by the television eye-marker,”Journal of the Optical Society of America, vol. 48, no. 7, pp. 439–445, 1958. 1, 2

work page 1958

[17] [17]

Sensorllm: Aligning large language models with motion sensors for human activ- ity recognition,

Z. Li, S. Deldari, L. Chen, H. Xue, and F. D. Salim, “Sensorllm: Aligning large language models with motion sensors for human activ- ity recognition,” inProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pp. 354–379, 2025. 2

work page 2025

[18] [18]

Llm4har: Generalizable on-device human activity recog- nition with pretrained llms,

Z. Hong, Y . Song, Z. Li, A. Yu, S. Zhong, Y . Ding, T. He, and D. Zhang, “Llm4har: Generalizable on-device human activity recog- nition with pretrained llms,” inProceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V . 2, pp. 4511– 4521, 2025. 2

work page 2025

[19] [19]

Exploring the capabilities of llms for imu-based fine-grained human activity understanding,

L. Xu, K. Hou, and X. Jiang, “Exploring the capabilities of llms for imu-based fine-grained human activity understanding,” inProceed- ings of the 2nd International Workshop on Foundation Models for Cyber-Physical Systems & Internet of Things, pp. 13–18, 2025. 2

work page 2025

[20] [20]

Large language model-guided semantic alignment for human activity recog- nition,

H. Yan, H. Tan, Y . Ding, P. Zhou, V . Namboodiri, and Y . Yang, “Large language model-guided semantic alignment for human activity recog- nition,”Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, vol. 9, no. 4, pp. 1–25, 2025. 2

work page 2025

[21] [21]

A survey of foundation models for iot: taxonomy and criteria-based analysis: H. wei et al.,

H. Wei, D. Y . Lee, S. Rohal, Z. Hu, R. Rossi, S. Fang, and S. Pan, “A survey of foundation models for iot: taxonomy and criteria-based analysis: H. wei et al.,”CCF Transactions on Pervasive Computing and Interaction, pp. 1–29, 2025. 2, 3, 5

work page 2025

[22] [22]

Visualization of eye tracking data: A taxonomy and survey,

T. Blascheck, K. Kurzhals, M. Raschke, M. Burch, D. Weiskopf, and T. Ertl, “Visualization of eye tracking data: A taxonomy and survey,” inComputer graphics forum, vol. 36, pp. 260–284, Wiley Online Li- brary, 2017. 2

work page 2017

[23] [23]

Static visualization of temporal eye-tracking data,

K.-J. R ¨aih¨a, A. Aula, P. Majaranta, H. Rantala, and K. Koivunen, “Static visualization of temporal eye-tracking data,” inIFIP Confer- ence on Human-Computer Interaction, pp. 946–949, Springer, 2005. 2

work page 2005

[24] [24]

Holmqvist, M

K. Holmqvist, M. Nystr ¨om, R. Andersson, R. Dewhurst, H. Jarodzka, and J. Van de Weijer,Eye tracking: A comprehensive guide to methods and measures. oup Oxford, 2011. 2

work page 2011

[25] [25]

Identifying fixations and saccades in eye-tracking protocols,

D. D. Salvucci and J. H. Goldberg, “Identifying fixations and saccades in eye-tracking protocols,” inProceedings of the 2000 symposium on Eye tracking research & applications, pp. 71–78, 2000. 2

work page 2000

[26] [26]

Visual scanpath representation,

J. H. Goldberg and J. I. Helfman, “Visual scanpath representation,” inProceedings of the 2010 Symposium on Eye-Tracking Research & Applications, pp. 203–210, 2010. 2

work page 2010

[27] [27]

Group-wise simi- larity and classification of aggregate scanpaths,

T. Grindinger, A. T. Duchowski, and M. Sawyer, “Group-wise simi- larity and classification of aggregate scanpaths,” inProceedings of the 2010 symposium on eye-tracking research & applications, pp. 101– 104, 2010. 2

work page 2010

[28] [28]

Informative or misleading? heatmaps deconstructed,

A. Bojko, “Informative or misleading? heatmaps deconstructed,” in International conference on human-computer interaction, pp. 30–39, Springer, 2009. 2

work page 2009

[29] [29]

Cognitive strategies for visual search,

L. F. Scinto, R. Pillalamarri, and R. Karsh, “Cognitive strategies for visual search,”Acta psychologica, vol. 62, no. 3, pp. 263–292, 1986. 2

work page 1986

[30] [30]

Gazetracker: software designed to facilitate eye move- ment analysis,

C. Lankford, “Gazetracker: software designed to facilitate eye move- ment analysis,” inProceedings of the 2000 symposium on Eye tracking research & applications, pp. 51–55, 2000. 2

work page 2000

[31] [31]

Space-time visual analytics of eye- tracking data for dynamic stimuli,

K. Kurzhals and D. Weiskopf, “Space-time visual analytics of eye- tracking data for dynamic stimuli,”IEEE Transactions on Visualiza- tion and Computer Graphics, vol. 19, no. 12, pp. 2129–2138, 2013. 2

work page 2013

[32] [32]

Gaze2prompt: Turning eye-tracking data into visual prompts for multimodal llms,

J. Y . Choi, S. G. Kim, J. Jeong, R. A. Rossi, J. Kil, and T. Y . Lee, “Gaze2prompt: Turning eye-tracking data into visual prompts for multimodal llms,” inCompanion of the 2025 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 110–114,

work page 2025

[33] [33]

Api pricing,

OpenAI, “Api pricing,” 2025. https://openai.com/api/pricing/. Ac- cessed: 2026-01-03. 2

work page 2025

[34] [34]

Gazebase, a large-scale, multi-stimulus, longitudinal eye movement dataset,

H. Griffith, D. Lohr, E. Abdulin, and O. Komogortsev, “Gazebase, a large-scale, multi-stimulus, longitudinal eye movement dataset,”Sci- entific Data, vol. 8, no. 1, p. 184, 2021. 3

work page 2021

[35] [35]

Combining low and mid- level gaze features for desktop activity recognition,

N. Srivastava, J. Newn, and E. Velloso, “Combining low and mid- level gaze features for desktop activity recognition,”Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technolo- gies, vol. 2, no. 4, pp. 1–27, 2018. 3

work page 2018

[36] [36]

Gazegraph: Graph- based few-shot cognitive context sensing from human visual behav- ior,

G. Lan, B. Heit, T. Scargill, and M. Gorlatova, “Gazegraph: Graph- based few-shot cognitive context sensing from human visual behav- ior,” inProceedings of the 18th Conference on Embedded Networked Sensor Systems, pp. 422–435, 2020. 3, 5

work page 2020

[37] [37]

Eyelink 1000 user’s manual,

SR Research Ltd., “Eyelink 1000 user’s manual,” 2010. Version 1.5.2. 3

work page 2010

[38] [38]

Tobii pro x2-30 eye tracker

Tobii, “Tobii pro x2-30 eye tracker.” https://www.tobii.com/products/discontinued/tobii-pro-x2-30/. Accessed: 2026-01-03. 3

work page 2026

[39] [39]

Pupil core

Pupil Labs, “Pupil core.” https://pupil-labs.com/. Accessed: 2026-01-

work page 2026

[40] [40]

Judging llm-as-a-judge with mt- bench and chatbot arena,

L. Zheng, W.-L. Chiang, Y . Sheng, S. Zhuang, Z. Wu, Y . Zhuang, Z. Lin, Z. Li, D. Li, E. Xing,et al., “Judging llm-as-a-judge with mt- bench and chatbot arena,”Advances in neural information processing systems, vol. 36, pp. 46595–46623, 2023. 3

work page 2023