arxiv: 2604.23749 · v1 · submitted 2026-04-26 · 💻 cs.HC

Recognition: unknown

StateScribe: Towards Accessible Change Awareness Across Real-World Revisits

Ruei-Che Chang , Xirui Jiang , Rosiana Natalie , Hao Chen , Vlad Roznyatovskiy , Jianzhong Zhang , Kang G. Shin , Ke Sun

show 1 more author

Anhong Guo

Authors on Pith no claims yet

Pith reviewed 2026-05-08 05:39 UTC · model grok-4.3

classification 💻 cs.HC

keywords change awarenessblind and low-visionreal-world revisitsmemory architectureassistive technologyepisodic memoryobject trackingscene understanding

0 comments

The pith

StateScribe uses dual-layer memory to describe meaningful changes across revisits for blind and low-vision users.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops StateScribe to address the gap in current visual assistive technologies, which describe scenes only once without tracking changes over time. It proposes a system that records scene states and object changes to generate both current descriptions and historical change reports, such as noting a shop sign that was not there before. This matters because unexpected changes can pose safety risks and increase cognitive load for BLV individuals navigating familiar places repeatedly. Evaluation shows the system achieves solid accuracy while staying efficient even after many visits, and a small user study confirms it helps users notice changes better in actual locations.

Core claim

StateScribe employs a dual-layer memory architecture integrating episodic scene memory and object-centric temporal memory to enable scalable change tracking, delivering live scene descriptions alongside details of what has changed, when, and where across revisits.

What carries the argument

dual-layer memory architecture that combines episodic scene memory for overall context with object-centric temporal memory for tracking individual changes over time

If this is right

Users receive both immediate scene descriptions and summaries of changes since previous visits.
The system sustains high accuracy and low latency over repeated uses in the same locations.
Memory usage stays low even after over 100 revisits, supporting long-term deployment.
Participants in real-world tests report better awareness of updates like relocated objects or new signs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Such memory systems could extend to other senses or multimodal data for more complete environmental monitoring.
Personalization based on user habits or specific intents might reduce irrelevant change alerts over time.
Integration with broader AI companions could allow proactive notifications about changes that affect safety or routines.

Load-bearing premise

The dual-layer memory architecture can consistently distinguish meaningful real-world changes from noise in diverse, uncontrolled environments without generating excessive false positives or missing important ones over extended periods.

What would settle it

Deploy StateScribe in a new, dynamic location with frequent minor alterations and measure whether accuracy drops below acceptable levels or users report missing critical changes like safety hazards.

Figures

Figures reproduced from arXiv: 2604.23749 by Anhong Guo, Hao Chen, Jianzhong Zhang, Kang G. Shin, Ke Sun, Rosiana Natalie, Ruei-Che Chang, Vlad Roznyatovskiy, Xirui Jiang.

**Figure 1.** Figure 1: StateScribe enables accessible change awareness across real-world revisits. When visiting a new location, (a) StateScribe view at source ↗

**Figure 2.** Figure 2: StateScribe’s system architecture and processing pipeline. (a) StateScribe’s mobile interfaces. (b) StateScribe constructs view at source ↗

**Figure 3.** Figure 3: Change detection and memory update pipeline. view at source ↗

**Figure 4.** Figure 4: StateScribe’s overall performance in different dimensions. (a) Performance metrics including view at source ↗

**Figure 5.** Figure 5: StateScribe’s (a) mean latency, (b) their breakdown view at source ↗

**Figure 6.** Figure 6: Likert scale questions and aggregated responses. view at source ↗

**Figure 7.** Figure 7: Our study environments, including a shared office, a simulated grocery store and an outdoor courtyard. view at source ↗

**Figure 8.** Figure 8: Examples of common errors generated by the live and video model. view at source ↗

**Figure 9.** Figure 9: Examples of how visibility mask filtered potential false positives from consecutive frames. view at source ↗

**Figure 10.** Figure 10: Example errors and hallucinations generated by StateScribe. view at source ↗

**Figure 11.** Figure 11: StateScribe’s overall performance on data collected by the nine participants in our user study. view at source ↗

read the original abstract

Real-world environments evolve continuously, yet blind and low-vision (BLV) individuals often have limited access to understanding how they change over time. Unexpected or relocated objects, layout modifications, and content updates (e.g., price changes) can introduce safety risks and cognitive burden. While existing visual assistive technologies can describe immediate surroundings, they operate as one-off interactions and lack mechanisms to surface meaningful changes across revisits. Informed by a survey of 33 BLV individuals, we develop StateScribe, a system that supports accessible awareness of real-world changes across revisits. StateScribe employs a dual-layer memory architecture that integrates episodic scene memory and object-centric temporal memory to enable scalable and structured change tracking. It provides both live descriptions of the current scene, and descriptions of what has changed, when and where it occurred across revisits, such as "The shop on your right has a "CLOSED" sign; it was open at this time last week.'' Our evaluation shows that StateScribe maintains high accuracy (F1-score=83.1%) across 11 revisits, while remaining low-latency (mean<1.54s) and memory-efficient (<54MB) across 110 revisits. A user study with nine BLV participants demonstrates that StateScribe improves change awareness across revisits in three real-world locations. Finally, we discuss implications for long-term AI-assisted companions that support broader change observation using multimodal sensing, extend beyond changes to other memory capabilities, and adapt to individual users, intents, and contexts.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

StateScribe gives BLV users a way to track real-world changes over revisits with a dual memory setup, but the reported accuracy depends on unexamined choices about what counts as a change.

read the letter

The paper describes StateScribe, a tool that uses two layers of memory to help blind and low-vision users notice changes in environments they visit repeatedly. It combines scene-level recall with object-specific tracking so the system can say what is new or different since the last time. They began with input from 33 BLV people to identify the problem, then built the system and ran tests. The efficiency side looks workable, with low latency and memory use holding up over 110 revisits. The user study in three real places with nine participants found that it helped them stay aware of changes. The main weakness is in how they measured success. The 83.1% F1 score across 11 revisits is presented without a clear account of how they decided which changes were meaningful or how they created the ground truth labels. If those decisions were lenient or the test scenes were not very varied, the number may not translate to everyday use where lighting, people, and small shifts create lots of noise. The limited number of locations and participants makes it hard to judge robustness over longer times or different settings. This work is aimed at people building AI companions for accessibility and safety. Someone looking for concrete examples of memory designs in real-world assistive tech would get something from the architecture and the user feedback. It is worth sending for peer review. The problem is real and the approach is straightforward, even if more validation on the change criteria and broader testing would strengthen it.

Referee Report

3 major / 1 minor

Summary. The paper introduces StateScribe, a system to support blind and low-vision (BLV) users in maintaining awareness of real-world environmental changes across revisits. Informed by a survey of 33 BLV individuals, it employs a dual-layer memory architecture (episodic scene memory combined with object-centric temporal memory) to deliver live scene descriptions alongside change reports specifying what changed, when, and where (e.g., object relocations or content updates). Evaluation reports an F1-score of 83.1% for change detection across 11 revisits, mean latency below 1.54s, and memory usage under 54MB across 110 revisits. A user study with nine BLV participants in three real-world locations indicates improved change awareness.

Significance. If the performance and usability claims are substantiated, the work addresses a clear gap in assistive technologies by moving from one-off scene descriptions to longitudinal change tracking, with potential to reduce safety risks and cognitive burden for BLV users in dynamic environments. The dual-layer architecture provides a structured, scalable approach to memory that could inform future multimodal AI companions. The efficiency metrics support mobile feasibility, and grounding in user survey data strengthens relevance. The user study offers initial evidence of practical value, though broader validation would be needed for long-term deployment implications.

major comments (3)

[Evaluation / Abstract] The headline F1-score of 83.1% for change detection (reported in the abstract and evaluation) rests on unstated criteria for labeling 'meaningful' changes and lacks any protocol for ground-truth annotation or inter-annotator agreement. Without these details, it is impossible to determine whether the dual-layer memory suppresses noise (lighting shifts, transient objects) or overfits to controlled test sequences, directly undermining the claim of reliability across real-world revisits.
[Evaluation] No baseline comparisons are described against simpler single-layer memory, standard change-detection algorithms, or existing visual-assistive tools. This omission makes it impossible to attribute the reported accuracy, latency, and memory efficiency specifically to the episodic-plus-object-centric architecture rather than to the underlying vision models or test conditions.
[User Study] The user study (nine BLV participants, three locations) claims improved change awareness but provides no details on task design, quantitative metrics, statistical tests, or qualitative coding of participant feedback. With such limited scale and diversity, the study cannot yet support generalizable conclusions about real-world utility.

minor comments (1)

[Abstract] The abstract would be clearer if it briefly summarized the key survey findings that motivated the dual-layer design choices.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive feedback on our manuscript. We address each of the major comments below and indicate the revisions we will make to improve clarity and rigor.

read point-by-point responses

Referee: [Evaluation / Abstract] The headline F1-score of 83.1% for change detection (reported in the abstract and evaluation) rests on unstated criteria for labeling 'meaningful' changes and lacks any protocol for ground-truth annotation or inter-annotator agreement. Without these details, it is impossible to determine whether the dual-layer memory suppresses noise (lighting shifts, transient objects) or overfits to controlled test sequences, directly undermining the claim of reliability across real-world revisits.

Authors: We agree that additional details are necessary to substantiate the F1-score. In the revised manuscript, we will expand the evaluation section to include: (1) explicit criteria for 'meaningful' changes, derived from our survey of BLV users (focusing on changes that impact navigation, safety, or awareness, such as object relocations and content updates, while excluding transient elements like lighting variations or moving people); (2) the ground-truth annotation protocol, which involved two researchers independently labeling changes in the revisit sequences with a consensus discussion for disagreements; and (3) inter-annotator agreement metrics (e.g., Cohen's kappa). This will demonstrate that the dual-layer architecture effectively filters noise rather than overfitting to controlled conditions. We will also clarify that the test sequences included real-world variability across 11 revisits in dynamic environments. revision: yes
Referee: [Evaluation] No baseline comparisons are described against simpler single-layer memory, standard change-detection algorithms, or existing visual-assistive tools. This omission makes it impossible to attribute the reported accuracy, latency, and memory efficiency specifically to the episodic-plus-object-centric architecture rather than to the underlying vision models or test conditions.

Authors: We acknowledge the value of baseline comparisons for isolating the contribution of the dual-layer architecture. In the revision, we will add a new subsection in the evaluation that includes comparisons against: (a) a single-layer memory baseline (using only episodic scene memory), (b) standard change detection methods such as pixel-wise differencing and feature-based approaches (e.g., using CLIP embeddings for similarity), and (c) a simulated existing visual-assistive tool that provides only live descriptions without change tracking. We will report accuracy, latency, and memory usage for these baselines under the same test conditions. This will help attribute performance gains to the object-centric temporal memory component. If space constraints arise, we will prioritize key metrics in the main text and move detailed tables to the appendix. revision: yes
Referee: [User Study] The user study (nine BLV participants, three locations) claims improved change awareness but provides no details on task design, quantitative metrics, statistical tests, or qualitative coding of participant feedback. With such limited scale and diversity, the study cannot yet support generalizable conclusions about real-world utility.

Authors: We agree that more details on the user study methodology are required. In the revised version, we will elaborate on: (1) task design, including the specific scenarios and change types presented to participants; (2) quantitative metrics used (e.g., accuracy in identifying changes, time to complete tasks, NASA-TLX for cognitive load); (3) statistical tests applied (e.g., paired t-tests or Wilcoxon signed-rank tests for pre/post comparisons); and (4) qualitative coding process for open-ended feedback, including themes identified and inter-coder reliability. We will also expand the discussion to explicitly address the limitations of the small sample size (n=9) and limited locations, framing the results as preliminary evidence and outlining plans for larger-scale studies. This will temper the claims appropriately while highlighting the positive trends observed. revision: yes

Circularity Check

0 steps flagged

No circularity detected; claims rest on external empirical evaluation and user study.

full rationale

The paper describes a dual-layer memory system for change awareness and reports performance via F1-score on 11 revisits, efficiency metrics across 110 revisits, and a 9-participant user study in three locations. No equations, derivations, fitted parameters renamed as predictions, or self-citation chains appear in the provided text. All quantitative claims are evaluated against independent real-world revisit data and participant feedback rather than reducing to the system's own inputs or definitions by construction. The architecture is presented as an engineering design choice informed by a survey, with no load-bearing uniqueness theorems or ansatzes smuggled via self-citation.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The central claim rests on a newly proposed dual-layer memory architecture whose effectiveness is demonstrated through empirical testing rather than derivation from prior equations or fitted constants.

axioms (2)

domain assumption Real-world environments contain discrete, detectable changes that can be meaningfully described to users.
Invoked in the motivation and system design sections to justify tracking objects and scenes.
domain assumption Multimodal sensing can capture sufficient state for change detection without continuous human annotation.
Underlies the live description and temporal memory components.

invented entities (1)

dual-layer memory architecture (episodic scene memory + object-centric temporal memory) no independent evidence
purpose: To enable scalable, structured tracking of changes across multiple revisits.
Newly introduced in the paper as the core technical contribution.

pith-pipeline@v0.9.0 · 5608 in / 1376 out tokens · 32009 ms · 2026-05-08T05:39:46.879078+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

75 extracted references · 41 canonical work pages · 1 internal anchor

[1]

Introducing Be My AI (formerly Virtual Volunteer) for People who are Blind or Have Low Vision, Powered by OpenAI’s GPT-4

2023. Introducing Be My AI (formerly Virtual Volunteer) for People who are Blind or Have Low Vision, Powered by OpenAI’s GPT-4. https://www.bemyeyes. com/blog/introducing-be-my-eyes-virtual-volunteer

2023
[2]

2026. Aira. https://aira.io/

2026
[3]

BeMyEyes

2026. BeMyEyes. https://www.bemyeyes.com/

2026
[4]

Gemini Live

2026. Gemini Live. https://gemini.google/overview/gemini-live/

2026
[5]

SeeingAI

2026. SeeingAI. https://www.seeingai.com/

2026
[6]

Voice with real-time video in ChatGPT

2026. Voice with real-time video in ChatGPT. https://chatgpt.com/features/voice- with-video/

2026
[7]

Taslima Akter, Tousif Ahmed, Apu Kapadia, and Manohar Swaminathan. 2022. Shared Privacy Concerns of the Visually Impaired and Sighted Bystanders with Camera-Based Assistive Technologies.ACM Trans. Access. Comput.15, 2, Article 11 (May 2022), 33 pages. doi:10.1145/3506857

work page doi:10.1145/3506857 2022
[8]

Samantha W. T. Chan. 2020. Biosignal-Sensitive Memory Improvement and Support Systems. InExtended Abstracts of the 2020 CHI Conference on Human Factors in Computing Systems(Honolulu, HI, USA)(CHI EA ’20). Association for Computing Machinery, New York, NY, USA, 1–7. doi:10.1145/3334480.3375031

work page doi:10.1145/3334480.3375031 2020
[9]

Samantha W. T. Chan, Shardul Sapkota, Rebecca Mathews, Haimo Zhang, and Suranga Nanayakkara. 2020. Prompto: Investigating Receptivity to Prompts Based on Cognitive Load from Memory Training Conversational Agent.Proc. ACM Interact. Mob. Wearable Ubiquitous Technol.4, 4, Article 121 (Dec. 2020), 23 pages. doi:10.1145/3432190

work page doi:10.1145/3432190 2020
[10]

Samantha W. T. Chan, Haimo Zhang, and Suranga Nanayakkara. 2019. Prospero: A Personal Wearable Memory Coach. InProceedings of the 10th Augmented Human International Conference 2019(Reims, France)(AH2019). Association for Computing Machinery, New York, NY, USA, Article 26, 5 pages. doi:10.1145/ 3311823.3311870

work page arXiv 2019
[11]

Ruei-Che Chang, Yuxuan Liu, and Anhong Guo. 2024. WorldScribe: Towards Context-Aware Live Visual Descriptions. InProceedings of the 37th Annual ACM Symposium on User Interface Software and Technology(Pittsburgh, PA, USA)(UIST ’24). Association for Computing Machinery, New York, NY, USA, Article 140, 18 pages. doi:10.1145/3654777.3676375

work page doi:10.1145/3654777.3676375 2024
[12]

Ruei-Che Chang, Yuxuan Liu, Lotus Zhang, and Anhong Guo. 2024. EditScribe: Non-Visual Image Editing with Natural Language Verification Loops. InPro- ceedings of the 26th International ACM SIGACCESS Conference on Computers and Accessibility(St. John’s, NL, Canada)(ASSETS ’24). Association for Computing Machinery, New York, NY, USA, Article 65, 19 pages. do...

work page doi:10.1145/3663548.3675599 2024
[13]

Ruei-Che Chang, Rosiana Natalie, Wenqian Xu, Jovan Zheng Feng Yap, and Anhong Guo. 2025. Probing the Gaps in ChatGPT’s Live Video Chat for Real- World Assistance for People who are Blind or Visually Impaired. InProceedings of the 27th International ACM SIGACCESS Conference on Computers and Accessibility (ASSETS ’25). Association for Computing Machinery, N...

work page doi:10.1145/3663547.3746319 2025
[14]

Ruei-Che Chang, Rosiana Natalie, Wenqian Xu, Jovan Zheng Feng Yap, Tiange Luo, Venkatesh Potluri, and Anhong Guo. 2026. TouchScribe: Augmenting Non- Visual Hand-Object Interactions with Automated Live Visual Descriptions. In Proceedings of the 2026 CHI Conference on Human Factors in Computing Systems (CHI ’26). Association for Computing Machinery, New Yor...

work page doi:10.1145/3772318.3791308 2026
[15]

Chufeng Chen, Michael Oakes, and John Tait. 2006. Browsing personal images using episodic memory (time+ location). InAdvances in Information Retrieval: 28th European Conference on IR Research, ECIR 2006, London, UK, April 10-12, 2006. Proceedings 28. Springer, 362–372

2006
[16]

Matthew Cooper, Jonathan Foote, Andreas Girgensohn, and Lynn Wilcox. 2005. Temporal event clustering for digital photo collections.ACM Trans. Multimedia Comput. Commun. Appl.1, 3 (Aug. 2005), 269–288. doi:10.1145/1083314.1083317

work page doi:10.1145/1083314.1083317 2005
[17]

Laura Cushley, Neil Galway, and Tunde Peto. 2023. The unseen barriers of the built environment: navigation for people with visual impairment.Town planning review94, 1 (2023), 11–35

2023
[18]

Fatma El-Zahraa El-Taher, Luis Miralles-Pechuán, Jane Courtney, Kristina Millar, Chantelle Smith, and Susan Mckeever. 2023. A survey on outdoor navigation applications for people with visual impairments.IEEE Access11 (2023), 14647– 14666

2023
[19]

Martin Ester, Hans-Peter Kriegel, Jörg Sander, and Xiaowei Xu. 1996. A density- based algorithm for discovering clusters in large spatial databases with noise. In Proceedings of the Second International Conference on Knowledge Discovery and Data Mining(Portland, Oregon)(KDD’96). AAAI Press, 226–231

1996
[20]

Yue Fan, Xiaojian Ma, Rongpeng Su, Jun Guo, Rujie Wu, Xi Chen, and Qing Li
[21]

arXiv:2501.00358 [cs.CV] https://arxiv.org/abs/2501.00358

Embodied VideoAgent: Persistent Memory from Egocentric Videos and Em- bodied Sensors Enables Dynamic Scene Understanding. arXiv:2501.00358 [cs.CV] https://arxiv.org/abs/2501.00358

work page arXiv
[22]

Maxwell Forbes, Christine Kaeser-Chen, Piyush Sharma, and Serge Belongie. 2019. Neural naturalist: Generating fine-grained image comparisons. InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP- IJCNLP). 708–717

2019
[23]

Jim Gemmell, Lyndsay Williams, Ken Wood, Roger Lueder, and Gordon Bell. 2004. Passive capture and ensuing issues for a personal lifetime store. InProceedings of the the 1st ACM Workshop on Continuous Archival and Retrieval of Personal Experiences(New York, New York, USA)(CARPE’04). Association for Computing Machinery, New York, NY, USA, 48–55. doi:10.1145...

work page doi:10.1145/1026653.1026660 2004
[24]

Rúben Gouveia and Evangelos Karapanos. 2013. Footprint tracker: supporting diary studies with lifelogging. InProceedings of the SIGCHI Conference on Human Factors in Computing Systems(Paris, France)(CHI ’13). Association for Computing Machinery, New York, NY, USA, 2921–2930. doi:10.1145/2470654.2481405

work page doi:10.1145/2470654.2481405 2013
[25]

Anhong Guo, Saige McVea, Xu Wang, Patrick Clary, Ken Goldman, Yang Li, Yu Zhong, and Jeffrey P. Bigham. 2018. Investigating Cursor-based Interactions to Support Non-Visual Exploration in the Real World. InProceedings of the 20th International ACM SIGACCESS Conference on Computers and Accessibility(Galway, Ireland)(ASSETS ’18). Association for Computing Ma...

work page doi:10.1145/3234695.3236339 2018
[26]

Danna Gurari, Qing Li, Abigale J Stangl, Anhong Guo, Chi Lin, Kristen Grauman, Jiebo Luo, and Jeffrey P Bigham. 2018. Vizwiz grand challenge: Answering visual questions from blind people. InProceedings of the IEEE conference on computer vision and pattern recognition. 3608–3617

2018
[27]

Gillian R Hayes, Shwetak N Patel, Khai N Truong, Giovanni Iachello, Julie A Kientz, Rob Farmer, and Gregory D Abowd. 2004. The personal audio loop: Designing a ubiquitous audio-based memory aid. InInternational Conference on Mobile Human-Computer Interaction. Springer, 168–179

2004
[28]

Jennifer Healey and Rosalind W Picard. 1998. Startlecam: A cybernetic wear- able camera. InDigest of Papers. Second International Symposium on Wearable Computers (Cat. No. 98EX215). IEEE, 42–49

1998
[29]

Naoki Hirabayashi, Masakazu Iwamura, Zheng Cheng, Kazunori Minatani, and Koichi Kise. 2023. VisPhoto: Photography for People with Visual Impairments via Post-Production of Omnidirectional Camera Imaging. InProceedings of the 25th International ACM SIGACCESS Conference on Computers and Accessibility (New York, NY, USA)(ASSETS ’23). Association for Computin...

work page doi:10.1145/3597638.3608422 2023
[30]

Steve Hodges, Lyndsay Williams, Emma Berry, Shahram Izadi, James Srinivasan, Alex Butler, Gavin Smyth, Narinder Kapur, and Ken Wood. 2006. SenseCam: A retrospective memory aid. InUbiComp 2006: Ubiquitous Computing: 8th Interna- tional Conference, UbiComp 2006 Orange County, CA, USA, September 17-21, 2006 Proceedings 8. Springer, 177–193

2006
[31]

Tetsuro Hori and Kiyoharu Aizawa. 2003. Context-based video retrieval system for the life-log applications. InProceedings of the 5th ACM SIGMM International Workshop on Multimedia Information Retrieval(Berkeley, California)(MIR ’03). Association for Computing Machinery, New York, NY, USA, 31–38. doi:10.1145/ 973264.973270

work page arXiv 2003
[32]

Wenbo Hu, Yining Hong, Yanjun Wang, Leison Gao, Zibu Wei, Xingcheng Yao, Nanyun Peng, Yonatan Bitton, Idan Szpektor, and Kai-Wei Chang. 2025. 3DLLM- Mem: Long-Term Spatial-Temporal Memory for Embodied 3D Large Language Model. arXiv:2505.22657 [cs.CV] https://arxiv.org/abs/2505.22657

work page arXiv 2025
[33]

Mina Huh, Yi-Hao Peng, and Amy Pavel. 2023. GenAssist: Making Image Gen- eration Accessible. InProceedings of the 36th Annual ACM Symposium on User Interface Software and Technology(San Francisco, CA, USA)(UIST ’23). Asso- ciation for Computing Machinery, New York, NY, USA, Article 38, 17 pages. doi:10.1145/3586183.3606735

work page doi:10.1145/3586183.3606735 2023
[34]

Harsh Jhamtani and Taylor Berg-Kirkpatrick. 2018. Learning to describe differ- ences between pairs of similar images. InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. 4024–4034. 11 arXiv ’26, April, 2026 Chang and Jiang et al

2018
[35]

Shiqi Jiang, Zhenjiang Li, Pengfei Zhou, and Mo Li. 2019. Memento: An Emotion- driven Lifelogging System with Wearables.ACM Trans. Sen. Netw.15, 1, Article 8 (Jan. 2019), 23 pages. doi:10.1145/3281630

work page doi:10.1145/3281630 2019
[36]

Doaa Khattab, Julie Buelow, and Donna Saccuteli. 2015. Understanding the barriers: Grocery stores and visually impaired shoppers.Journal of accessibility and design for all: JACCES5, 2 (2015), 157–173

2015
[37]

Marion Koelle, Torben Wallbaum, Wilko Heuten, and Susanne Boll. 2019. Evaluat- ing a Wearable Camera’s Social Acceptability In-the-Wild. InExtended Abstracts of the 2019 CHI Conference on Human Factors in Computing Systems(Glasgow, Scotland Uk)(CHI EA ’19). Association for Computing Machinery, New York, NY, USA, 1–6. doi:10.1145/3290607.3312837

work page doi:10.1145/3290607.3312837 2019
[38]

Mik Lamming, Peter Brown, Kathleen Carter, Margery Eldridge, Mike Flynn, Gifford Louie, Peter Robinson, and Abigail Sellen. 1994. The design of a human memory prosthesis.Comput. J.37, 3 (1994), 153–163

1994
[39]

Kyungjun Lee, Jonggi Hong, Simone Pimento, Ebrima Jarjue, and Hernisa Kacorri
[40]

InProceedings of the 21st International ACM SIGACCESS Conference on Computers and Accessibility(Pittsburgh, PA, USA)(ASSETS ’19)

Revisiting Blind Photography in the Context of Teachable Object Recog- nizers. InProceedings of the 21st International ACM SIGACCESS Conference on Computers and Accessibility(Pittsburgh, PA, USA)(ASSETS ’19). Association for Computing Machinery, New York, NY, USA, 83–95. doi:10.1145/3308561.3353799

work page doi:10.1145/3308561.3353799
[41]

Kyungjun Lee and Hernisa Kacorri. 2019. Hands Holding Clues for Object Recog- nition in Teachable Machines. InProceedings of the 2019 CHI Conference on Human Factors in Computing Systems(Glasgow, Scotland Uk)(CHI ’19). Association for Computing Machinery, New York, NY, USA, 1–12. doi:10.1145/3290605.3300566

work page doi:10.1145/3290605.3300566 2019
[42]

Kyungyeon Lee, Sohyeon Park, and Uran Oh. 2021. Designing Product De- scriptions for Supporting Independent Grocery Shopping of People with Visual Impairments. InExtended Abstracts of the 2021 CHI Conference on Human Factors in Computing Systems(Yokohama, Japan)(CHI EA ’21). Association for Computing Machinery, New York, NY, USA, Article 425, 6 pages. doi...

work page doi:10.1145/3411763.3451806 2021
[43]

Lee and Anind K

Matthew L. Lee and Anind K. Dey. 2007. Providing good memory cues for people with episodic memory impairment. InProceedings of the 9th International ACM SIGACCESS Conference on Computers and Accessibility(Tempe, Arizona, USA) (Assets ’07). Association for Computing Machinery, New York, NY, USA, 131–138. doi:10.1145/1296843.1296867

work page doi:10.1145/1296843.1296867 2007
[44]

Lee and Anind K

Matthew L. Lee and Anind K. Dey. 2008. Lifelogging memory appliance for people with episodic memory impairment. InProceedings of the 10th International Conference on Ubiquitous Computing(Seoul, Korea)(UbiComp ’08). Association for Computing Machinery, New York, NY, USA, 44–53. doi:10.1145/1409635.1409643

work page doi:10.1145/1409635.1409643 2008
[45]

Jiahao Nick Li, Zhuohao Jerry Zhang, and Jiaju Ma. 2024. OmniQuery: Contex- tually Augmenting Captured Multimodal Memory to Enable Personal Question Answering. arXiv:2409.08250 [cs.HC] https://arxiv.org/abs/2409.08250

work page arXiv 2024
[46]

Jongho Lim, Yongjae Yoo, Hanseul Cho, and Seungmoon Choi. 2019. TouchPhoto: Enabling Independent Picture Taking and Understanding for Visually-Impaired Users. In2019 International Conference on Multimodal Interaction(Suzhou, China) (ICMI ’19). Association for Computing Machinery, New York, NY, USA, 124–134. doi:10.1145/3340555.3353728

work page doi:10.1145/3340555.3353728 2019
[47]

Coughlan

Roberto Manduchi and James M. Coughlan. 2014. The last meter: blind visual guidance to a target. InProceedings of the SIGCHI Conference on Human Factors in Computing Systems(Toronto, Ontario, Canada)(CHI ’14). Association for Computing Machinery, New York, NY, USA, 3113–3122. doi:10.1145/2556288. 2557328

work page doi:10.1145/2556288 2014
[48]

Steve Mann. 1998. ’WearCam’(The wearable camera): personal imaging systems for long-term use in wearable tetherless computer-mediated reality and personal photo/videographic memory prosthesis. InDigest of Papers. Second International Symposium on Wearable Computers (Cat. No. 98EX215). IEEE, 124–131

1998
[49]

Steve Mann, James Fung, Chris Aimone, Anurag Sehgal, and Daniel Chen. 2005. Designing EyeTap digital eyeglasses for continuous lifelong capture and sharing of personal experiences.Alt. Chi, Proc. CHI 2005(2005)

2005
[50]

Anushka Patil and Smruti Raghani. 2025. Designing accessible and independent living spaces for visually impaired individuals: a barrier-free approach to interior design.International Journal for Equity in Health24, 1 (2025), 137

2025
[51]

Yi-Hao Peng, Jason Wu, Jeffrey Bigham, and Amy Pavel. 2022. Diffscriber: Describ- ing Visual Design Changes to Support Mixed-Ability Collaborative Presentation Authoring. InProceedings of the 35th Annual ACM Symposium on User Interface Software and Technology(Bend, OR, USA)(UIST ’22). Association for Computing Machinery, New York, NY, USA, Article 35, 13 ...

work page doi:10.1145/3526113.3545637 2022
[52]

Halley Profita, Reem Albaghli, Leah Findlater, Paul Jaeger, and Shaun K. Kane
[53]

Like Having a Really Bad PA

The AT Effect: How Disability Affects the Perceived Social Acceptability of Head-Mounted Display Use. InProceedings of the 2016 CHI Conference on Human Factors in Computing Systems(San Jose, California, USA)(CHI ’16). Association for Computing Machinery, New York, NY, USA, 4884–4895. doi:10.1145/2858036. 2858130

work page doi:10.1145/2858036 2016
[54]

Bradley J Rhodes. 1997. The wearable remembrance agent: A system for aug- mented memory.Personal Technologies1 (1997), 218–224

1997
[55]

Fiannaca, Melanie Kneisel, Edward Cutrell, and Meredith Ringel Morris

Manaswi Saha, Alexander J. Fiannaca, Melanie Kneisel, Edward Cutrell, and Meredith Ringel Morris. 2019. Closing the Gap: Designing for the Last-Few- Meters Wayfinding Problem for People with Visual Impairments. InProceedings of the 21st International ACM SIGACCESS Conference on Computers and Accessibility (Pittsburgh, PA, USA)(ASSETS ’19). Association for...

work page doi:10.1145/3308561.3353776 2019
[56]

Yasuhito Sawahata and Kiyoharu Aizawa. 2003. Wearable imaging system for summarizing personal experiences. In2003 International Conference on Multime- dia and Expo. ICME’03. Proceedings (Cat. No. 03TH8698), Vol. 1. IEEE, I–45

2003
[57]

Oriane Siméoni, Huy V Vo, Maximilian Seitzer, Federico Baldassarre, and Maxime Oquab. 2025. Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Michaël Ramamonjisoa, et al. Dinov3.arXiv preprint arXiv:2508.101042, 4 (2025), 5

work page internal anchor Pith review arXiv 2025
[58]

Kentaro Toyama, Ron Logan, and Asta Roseway. 2003. Geographic location tags on digital images. InProceedings of the eleventh ACM international conference on Multimedia. 156–166

2003
[59]

Sandra Tullio-Pow, Hong Yu, and Megan Strickfaden. 2021. Do You See What I See? The shopping experiences of people with visual impairment.Interdisci- plinary Journal of Signage and Wayfinding5, 1 (2021), 42–61

2021
[60]

Lily M Turkstra, Tanya Bhatia, Alexa Van Os, and Michael Beyeler. 2025. Assistive technology use in domestic activities by people who are blind.Scientific Reports 15, 1 (2025), 7486

2025
[61]

Marynel Vázquez and Aaron Steinfeld. 2012. Helping visually impaired users properly aim a camera. InProceedings of the 14th International ACM SIGACCESS Conference on Computers and Accessibility(Boulder, Colorado, USA)(ASSETS ’12). Association for Computing Machinery, New York, NY, USA, 95–102. doi:10.1145/ 2384916.2384934

work page arXiv 2012
[62]

Sunil Vemuri, Chris Schmandt, Walter Bender, Stefanie Tellex, and Brad Lassey
[63]

InUbiComp 2004: Ubiquitous Com- puting: 6th International Conference, Nottingham, UK, September 7-10, 2004

An audio-based personal memory aid. InUbiComp 2004: Ubiquitous Com- puting: 6th International Conference, Nottingham, UK, September 7-10, 2004. Pro- ceedings 6. Springer, 400–417

2004
[64]

Pray before you step out

Michele A. Williams, Amy Hurst, and Shaun K. Kane. 2013. "Pray before you step out": describing personal and situational blind navigation behaviors. In Proceedings of the 15th International ACM SIGACCESS Conference on Computers and Accessibility(Bellevue, Washington)(ASSETS ’13). Association for Computing Machinery, New York, NY, USA, Article 28, 8 pages....

work page doi:10.1145/2513383.2513449 2013
[65]

Shaomei Wu, Jeffrey Wieland, Omid Farivar, and Julie Schiller. 2017. Automatic Alt-text: Computer-generated Image Descriptions for Blind Users on a Social Network Service. InProceedings of the 2017 ACM Conference on Computer Sup- ported Cooperative Work and Social Computing(Portland, Oregon, USA)(CSCW ’17). Association for Computing Machinery, New York, N...

work page doi:10.1145/2998181.2998364 2017
[66]

Shuchang Xu, Chang Chen, Zichen Liu, Xiaofu Jin, Lin-Ping Yuan, Yukang Yan, and Huamin Qu. 2024. Memory Reviver: Supporting Photo-Collection Reminis- cence for People with Visual Impairment via a Proactive Chatbot. InProceedings of the 37th Annual ACM Symposium on User Interface Software and Technology (Pittsburgh, PA, USA)(UIST ’24). Association for Comp...

work page doi:10.1145/3654777.3676336 2024
[67]

Hong Yu, Sandra Tullio-Pow, and Ammar Akhtar. 2015. Retail design and the visually impaired: A needs assessment.Journal of Retailing and Consumer Services 24 (2015), 121–129

2015
[68]

Wobbrock, and Leah Findlater

Lotus Zhang, Zhuohao (Jerry) Zhang, Gina Clepper, Franklin Mingzhe Li, Patrick Carrington, Jacob O. Wobbrock, and Leah Findlater. 2025. VizXpress: Towards Expressive Visual Content by Blind Creators Through AI Support. InProceedings of the 27th International ACM SIGACCESS Conference on Computers and Accessi- bility (ASSETS ’25). Association for Computing ...

work page doi:10.1145/3663547.3746345 2025
[69]

Wobbrock, Anhong Guo, and Liang He

Zhuohao (Jerry) Zhang, Haichang Li, Chun Meng Yu, Faraz Faruqi, Junan Xie, Gene S-H Kim, Mingming Fan, Angus Forbes, Jacob O. Wobbrock, Anhong Guo, and Liang He. 2025. A11yShape: AI-Assisted 3-D Modeling for Blind and Low- Vision Programmers. InProceedings of the 27th International ACM SIGACCESS Con- ference on Computers and Accessibility (ASSETS ’25). As...

work page doi:10.1145/3663547.3746362 2025
[70]

Wobbrock

Zhuohao (Jerry) Zhang and Jacob O. Wobbrock. 2023. A11yBoard: Making Digital Artboards Accessible to Blind and Low-Vision Users. InProceedings of the 2023 CHI Conference on Human Factors in Computing Systems(Hamburg, Germany) (CHI ’23). Association for Computing Machinery, New York, NY, USA, Article 55, 17 pages. doi:10.1145/3544548.3580655

work page doi:10.1145/3544548.3580655 2023
[71]

Xu Zhao, Wenchao Ding, Yongqi An, Yinglong Du, Tao Yu, Min Li, Ming Tang, and Jinqiao Wang. 2023. Fast segment anything.arXiv preprint arXiv:2306.12156 (2023)

work page arXiv 2023
[72]

Fangrui Zhu, Hanhui Wang, Yiming Xie, Jing Gu, Tianye Ding, Jianwei Yang, and Huaizu Jiang. 2025. Struct2D: A Perception-Guided Framework for Spatial Reasoning in MLLMs. arXiv:2506.04220 [cs.CV] https://arxiv.org/abs/2506.04220

work page arXiv 2025
[73]

Chakraborti, C

Wazeer Deen Zulfikar, Samantha Chan, and Pattie Maes. 2024. Memoro: Using Large Language Models to Realize a Concise Interface for Real-Time Memory Augmentation. InProceedings of the 2024 CHI Conference on Human Factors in Computing Systems(Honolulu, HI, USA)(CHI ’24). Association for Computing Machinery, New York, NY, USA, Article 450, 18 pages. doi:10.1...

work page doi:10.1145/3613904 2024
[74]

If evidence is insufficient, output a static present-tense scene statement
[75]

the content of the screen has changed

If latest_live_description contains close-up details or readable on-object text, preserve those details exactly when you restate them. Do not rewrite the text content. Whenchange_snapshotsIs Empty •One to two sentences, natural spoken English, about 15 words total. •Present tense, static scene description only. •Mention at most 2–3 salient objects fromlat...

2026