arxiv: 2604.23788 · v1 · submitted 2026-04-26 · 💻 cs.CV · cs.HC

Recognition: unknown

MIRAGE: A Micro-Interaction Relational Architecture for Grounded Exploration in Multi-Figure Artworks

Jui-Cheng Chiu, Nabin Khanal, Qi Yang, Shengyang Luo, Tongyan Wang, Yingjie Victor Chen, Yu-Chao Wang

Authors on Pith no claims yet

Pith reviewed 2026-05-08 06:29 UTC · model grok-4.3

classification 💻 cs.CV cs.HC

keywords multi-figure artworksmicro-interactionsgrounded explorationvision-language modelsrelational hallucinationsstructured representationspatial groundingnarrative generation

0 comments

The pith

MIRAGE builds a structured intermediate representation of identities, poses, and gazes to ground VLM narratives about multi-figure artworks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents MIRAGE as a framework that extracts identities, pose cues, and gaze hypotheses from complex paintings and organizes them into a verifiable evidence layer. It then separates this spatial grounding step from the generation of narrative interpretations so that users can trace high-level claims back to specific visual facts. Current vision-language models often collapse relational evidence into ungrounded or inconsistent stories even when the low-level signals are present. By making the evidence explicit and coordinated, MIRAGE aims to increase identity consistency, cut relational hallucinations, and improve coverage of subtle cues such as gaze alignment and gesture in multi-figure scenes.

Core claim

MIRAGE constructs a structured intermediate representation capturing identities, pose cues, and gaze hypotheses. By separating spatial grounding from narrative generation, the system enables users to inspect and reason about figure-to-figure relationships through a verifiable evidence layer. Evaluation against painting-only VLM baselines in a blind assessment protocol shows significant gains in identity consistency, reduced relational hallucinations, and increased coverage of subtle interactions.

What carries the argument

The structured intermediate representation that captures identities, pose cues, and gaze hypotheses and serves as a verifiable evidence layer for coordinating relational evidence before narrative generation.

If this is right

Users can inspect exactly how high-level interpretations are anchored in low-level visual facts.
Vision-language models produce descriptions with higher identity consistency across figures.
Relational hallucinations decline because multiple interaction hypotheses are explicitly reconciled.
Coverage of subtle cues such as gaze alignment and gesture expands without sacrificing verifiability.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same separation of grounding and narrative stages could be tested on other relational visual domains such as film shots or group photographs.
Automating extraction of the intermediate representation more robustly would be a direct next engineering step.
Adding user-controlled editing of the evidence layer might further increase transparency in AI-assisted art analysis.

Load-bearing premise

That the structured intermediate representation can be reliably and accurately extracted from the artworks and that the blind assessment protocol measures genuine improvement in relational understanding rather than surface consistency.

What would settle it

A test set of multi-figure artworks where the intermediate representation is supplied manually yet blind evaluators still record no reduction in relational hallucinations or no increase in traceable interaction coverage.

Figures

Figures reproduced from arXiv: 2604.23788 by Jui-Cheng Chiu, Nabin Khanal, Qi Yang, Shengyang Luo, Tongyan Wang, Yingjie Victor Chen, Yu-Chao Wang.

**Figure 1.** Figure 1: MIRAGE: Grounded interaction through structured visual representation. (Left) The system transforms raw visual view at source ↗

**Figure 2.** Figure 2: Pose and gaze estimation conditioned on grounding view at source ↗

**Figure 3.** Figure 3: Relational grounding in MIRAGE. Pairwise rela view at source ↗

**Figure 4.** Figure 4: Comparison between baseline GPT-5.4 and MIRAGE-enhanced interpretation. Across representative interaction view at source ↗

**Figure 5.** Figure 5: Example grounding case for Et in Arcadia Ego. Left: original painting. Right: MIRAGE relation visualization showing stable character identities, gaze cues, touch relations, and object anchors centered on the tomb and inscription. • Q9 (Collaborative Agency): I felt that I was actively collaborating with the system to interpret the artwork, rather than passively receiving an automated summary. A.4 Cognit… view at source ↗

**Figure 6.** Figure 6: Grounding example for The Birth of Venus. MIRAGE captures overlapping figures, directional forces, and global interaction flow centered on Venus. • meaning: joint engagement with inscription R5: C3–C4 • dist: 0.131, IoU: 0.140 • interaction: overlap / close proximity • meaning: explicit touch interaction [Grounding Priorities] • use resolved posture and gaze • treat geometry as supporting/conflicting evide… view at source ↗

read the original abstract

Appreciating multi-figure paintings requires understanding how characters relate through subtle cues like gaze alignment, gesture, and spatial arrangement. We present MIRAGE, an evidence-centric framework designed to scaffold the exploration of these "micro-interactions" in multi-figure artworks. While such cues are essential for deep narrative appreciation, they are often distributed across complex scenes and difficult for viewers to systematically identify. Existing vision-language models (VLMs) frequently fail to provide reliable assistance, offering ungrounded interpretations that lack traceable visual evidence. MIRAGE addresses this by constructing a structured intermediate representation capturing identities, pose cues, and gaze hypotheses. However, the challenge extends beyond extracting these cues to coordinating them during interpretation. Without an explicit mechanism to organize and reconcile relational evidence, models often collapse multiple interaction hypotheses into a single unstable or weakly grounded narrative, even when low-level signals are available. This representation allows users to verify how high-level interpretations are anchored in low-level visual facts. By separating spatial grounding from narrative generation, MIRAGE enables users to inspect and reason about figure-to-figure relationships through a verifiable evidence layer. We evaluate MIRAGE against painting-only VLM baselines using a blind assessment protocol. Results show that MIRAGE significantly improves identity consistency, reduces relational hallucinations, and increases the coverage of subtle interactions. These findings suggest that structured grounding can serve as a critical interaction control layer, providing the necessary scaffolding for a more reliable, transparent, and human-led understanding of complex visual narratives.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The paper presents MIRAGE, an evidence-centric framework for exploring micro-interactions in multi-figure artworks. It constructs a structured intermediate representation capturing identities, pose cues, and gaze hypotheses, then separates spatial grounding from narrative generation to provide verifiable evidence layers for VLM-based interpretation. The central claim is that this architecture significantly improves identity consistency, reduces relational hallucinations, and increases coverage of subtle interactions relative to painting-only VLM baselines, as shown via a blind assessment protocol.

Significance. If the extraction of the intermediate representation proves reliable and the evaluation protocol validly isolates architectural gains, MIRAGE could supply a practical control layer for relational reasoning in complex visual scenes, with potential applicability beyond art to other grounded narrative tasks. The approach explicitly addresses a documented weakness of current VLMs in handling distributed cues like gaze and gesture.

major comments (3)

[Abstract and Evaluation section] Abstract and Evaluation section: the claim of 'significant improvements' in identity consistency, reduced hallucinations, and increased coverage is asserted without any reported numbers, dataset description, baseline implementation details, or error analysis. This is load-bearing because the central empirical claim cannot be assessed or reproduced from the provided information.
[Method section] Method section: the pipeline for constructing the structured intermediate representation (identities, pose cues, gaze hypotheses) is not specified—e.g., whether extraction uses learned detectors, manual annotation, or VLM prompting. This is load-bearing for the central claim, as downstream gains in consistency and hallucination reduction cannot be attributed to the separation mechanism if the input representation quality is unverified or oracle-dependent.
[Evaluation section] Evaluation section: the blind assessment protocol is undescribed (rater instructions, metrics for 'relational hallucinations,' number of evaluators, prompt templates for baselines). This prevents determining whether measured gains reflect genuine relational understanding rather than surface consistency of supplied evidence.

minor comments (1)

[Abstract] The abstract would benefit from a single quantitative highlight (e.g., 'X% reduction in hallucinations on Y artworks') to allow immediate assessment of the scale of claimed gains.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive review. The comments highlight important areas where additional clarity and detail will strengthen the paper. We address each major comment below and have revised the manuscript to incorporate the requested information.

read point-by-point responses

Referee: [Abstract and Evaluation section] Abstract and Evaluation section: the claim of 'significant improvements' in identity consistency, reduced hallucinations, and increased coverage is asserted without any reported numbers, dataset description, baseline implementation details, or error analysis. This is load-bearing because the central empirical claim cannot be assessed or reproduced from the provided information.

Authors: We agree that the abstract and evaluation section would benefit from explicit quantitative support for the claims. In the revised manuscript, we have expanded the Evaluation section to report specific metrics (including identity consistency scores, hallucination rates, and interaction coverage percentages), a full dataset description, baseline implementation details, and an error analysis. The abstract has been updated to reference these key quantitative findings. These additions make the empirical claims verifiable and reproducible. revision: yes
Referee: [Method section] Method section: the pipeline for constructing the structured intermediate representation (identities, pose cues, gaze hypotheses) is not specified—e.g., whether extraction uses learned detectors, manual annotation, or VLM prompting. This is load-bearing for the central claim, as downstream gains in consistency and hallucination reduction cannot be attributed to the separation mechanism if the input representation quality is unverified or oracle-dependent.

Authors: We agree that the method section requires more explicit description of the construction pipeline. We have revised the Method section to fully specify the processes used to build the structured intermediate representation, including the techniques applied to identities, pose cues, and gaze hypotheses, as well as the verification steps employed to ensure the representation is reliable and independent of oracle-level inputs. This revision enables readers to attribute performance gains to the architectural separation rather than unverified input quality. revision: yes
Referee: [Evaluation section] Evaluation section: the blind assessment protocol is undescribed (rater instructions, metrics for 'relational hallucinations,' number of evaluators, prompt templates for baselines). This prevents determining whether measured gains reflect genuine relational understanding rather than surface consistency of supplied evidence.

Authors: We acknowledge that the blind assessment protocol needs a more complete description to support interpretation of the results. In the revised Evaluation section, we now provide the rater instructions, the exact metrics and definitions used for relational hallucinations, the number of evaluators and their qualifications, the prompt templates for the baselines, and inter-rater agreement statistics. These additions clarify that the measured gains arise from the MIRAGE architecture rather than surface-level consistency. revision: yes

Circularity Check

0 steps flagged

No circularity: system description with independent empirical evaluation

full rationale

The paper presents MIRAGE as an evidence-centric framework that constructs a structured intermediate representation (identities, pose cues, gaze hypotheses) and separates spatial grounding from narrative generation. No mathematical derivations, equations, fitted parameters, or self-citation chains appear in the provided text. The central claims rest on architectural design choices and results from a blind assessment protocol against VLM baselines, which constitute external comparison rather than reduction to the inputs by construction. The extraction mechanism and protocol details are left unspecified, but this is an assumption gap, not a circularity in the derivation chain.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The framework rests on the domain assumption that micro-interactions can be decomposed into extractable low-level cues that remain stable across interpretation; no free parameters or new physical entities are introduced in the abstract.

axioms (1)

domain assumption Vision-language models can be improved for relational tasks by inserting an explicit, inspectable evidence layer between perception and narrative generation.
This premise underpins the entire MIRAGE design and the claim that it reduces hallucinations.

invented entities (1)

MIRAGE structured intermediate representation no independent evidence
purpose: To capture identities, pose cues, and gaze hypotheses as verifiable evidence before narrative generation.
This is the central new construct proposed by the paper.

pith-pipeline@v0.9.0 · 5590 in / 1327 out tokens · 25013 ms · 2026-05-08T06:29:16.827712+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

50 extracted references · 28 canonical work pages · 3 internal anchors

[1]

Nalini Ambady and Robert Rosenthal. 1992. Thin slices of expressive behavior as predictors of interpersonal consequences: A meta-analysis.Psychological Bulletin 111, 2 (1992), 256–274. doi:10.1037/0033-2909.111.2.256

work page doi:10.1037/0033-2909.111.2.256 1992
[2]

Bennett, Kori Inkpen, Jaime Teevan, Ruth Kikin-Gil, and Eric Horvitz

Saleema Amershi, Dan Weld, Mihaela Vorvoreanu, Adam Fourney, Besmira Nushi, Penny Collisson, Jina Suh, Shamsi Iqbal, Paul N. Bennett, Kori Inkpen, Jaime Teevan, Ruth Kikin-Gil, and Eric Horvitz. 2019. Guidelines for Human- AI Interaction. InProceedings of the 2019 CHI Conference on Human Factors in Computing Systems(Glasgow, Scotland Uk)(CHI ’19). Associa...

work page doi:10.1145/3290605.3300233 2019
[3]

Takaya Arita, Wenxian Zheng, Reiji Suzuki, and Fuminori Akiba. 2025. Assessing LLMs in Art Contexts: Critique Generation and Theory of Mind Evaluation. doi:10.48550/arXiv.2504.12805

work page doi:10.48550/arxiv.2504.12805 2025
[4]

1974.Art and Visual Perception: A Psychology of the Creative Eye

Rudolf Arnheim. 1974.Art and Visual Perception: A Psychology of the Creative Eye. University of California Press

1974
[5]

1972.Ways of Seeing

John Berger. 1972.Ways of Seeing. Penguin Books, London

1972
[6]

Yi Bin, Wenhao Shi, Yujuan Ding, Zhiqiang Hu, Zheng Wang, Yang Yang, See- Kiong Ng, and Heng Tao Shen. 2024. GalleryGPT: Analyzing Paintings with Large Multimodal Models. InProceedings of the 32nd ACM International Conference on Multimedia(Melbourne VIC, Australia)(MM ’24). Association for Computing Machinery, New York, NY, USA, 7734–7743. doi:10.1145/366...

work page doi:10.1145/3664647.3681656 2024
[7]

Christian Braun, Alexander Lilienbeck, and Daniel Mentjukov. 2025. The Hidden Structure – Improving Legal Document Understanding Through Explicit Text Formatting. doi:10.48550/arXiv.2505.12837 arXiv:2505.12837 [cs]

work page doi:10.48550/arxiv.2505.12837 2025
[8]

Matthew Chalmers and Ian MacColl. 2003. Seamful and seamless design in ubiquitous computing. InWorkshop at the crossroads: The interaction of HCI and systems issues in UbiComp, Vol. 8. 10

2003
[9]

Alessio Ferrato, Carla Limongelli, Fabio Gasparetti, Giuseppe Sansonetti, and Alessandro Micarelli. 2025. Exploring the Potential of Multimodal Large Lan- guage Models for Question Answering on Artworks. InAdjunct Proceedings of the 33rd ACM Conference on User Modeling, Adaptation and Personalization. ACM, New York City USA, 432–436. doi:10.1145/3708319.3733648

work page doi:10.1145/3708319.3733648 2025
[10]

1989.The Power of Images: Studies in the History and Theory of Response

David Freedberg. 1989.The Power of Images: Studies in the History and Theory of Response. University of Chicago Press, Chicago

1989
[11]

Bayliss, and Steven P

Alexandra Frischen, Andrew P. Bayliss, and Steven P. Tipper. 2007. Gaze Cueing of Attention: Visual Attention, Social Cognition, and Individual Differences. Psychological Bulletin133, 4 (2007), 694–724. doi:10.1037/0033-2909.133.4.694

work page doi:10.1037/0033-2909.133.4.694 2007
[12]

Sandra G Hart and Lowell E Staveland. 1988. Development of NASA-TLX (Task Load Index): Results of empirical and theoretical research. InAdvances in psy- chology. Vol. 52. Elsevier, 139–183

1988
[13]

Does prompt formatting have any impact on llm performance?

Jia He, Mukund Rungta, David Koleczek, Arshdeep Sekhon, Franklin X. Wang, and Sadid Hasan. 2024. Does Prompt Formatting Have Any Impact on LLM Performance? doi:10.48550/arXiv.2411.10541 arXiv:2411.10541 [cs]

work page doi:10.48550/arxiv.2411.10541 2024
[14]

Roy S. Hessels. 2020. How does gaze to faces support face-to-face interaction? A review and perspective.Psychonomic Bulletin & Review27, 5 (2020), 856–881. doi:10.3758/s13423-020-01715-w

work page doi:10.3758/s13423-020-01715-w 2020
[15]

Hudson and Christopher D

Drew A. Hudson and Christopher D. Manning. 2019. GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering. arXiv:1902.09506 [cs.CL] https://arxiv.org/abs/1902.09506

work page arXiv 2019
[16]

Justin Johnson, Agrim Gupta, and Li Fei-Fei. 2018. Image Generation from Scene Graphs. arXiv:1804.01622 [cs.CV] https://arxiv.org/abs/1804.01622

work page arXiv 2018
[17]

Afshin Khadangi, Amir Sartipi, Igor Tchappi, and Gilbert Fridgen. 2025. Cog- nArtive: Large Language Models for Automating Art Analysis and Decoding Aesthetic Elements. doi:10.48550/arXiv.2502.04353

work page doi:10.48550/arxiv.2502.04353 2025
[18]

Helmut Leder, Benno Belke, Andries Oeberst, and Dorothee Augustin. 2004. A model of aesthetic appreciation and aesthetic judgments.British Journal of Psychology95, 4 (Nov. 2004), 489–508. doi:10.1348/0007126042369811

work page doi:10.1348/0007126042369811 2004
[19]

Unggi Lee, Minji Jeon, Yunseo Lee, Gyuri Byun, Yoorim Son, Jaeyoon Shin, Hongkyu Ko, and Hyeoncheol Kim. 2024. LLaVA-docent: Instruction tuning with multimodal large language model to support art appreciation education. Computers and Education: Artificial Intelligence7 (Dec. 2024), 100297. doi:10. 1016/j.caeai.2024.100297

work page arXiv 2024
[20]

Emalie McMahon and Leyla Isik. 2023. Seeing Social Interactions.Trends in Cognitive Sciences27, 12 (2023), 1165–1179

2023
[21]

Sebastian Padó and Kerstin Thomas. 2025. Artwork Interpretation with Vision Language Models: A Case Study on Emotions and Emotion Symbols. InProceed- ings of the 1st Workshop on Multimodal Models for Low-Resource Contexts and Social Impact (MMLoSo 2025), Ankita Shukla, Sandeep Kumar, Amrit Singh Bedi, and Tanmoy Chakraborty (Eds.). Association for Computa...

2025
[22]

Generative Agents: Interactive Simulacra of Human Behavior

Joon Sung Park, Joseph C. O’Brien, Carrie J. Cai, Meredith Ringel Morris, Percy Liang, and Michael S. Bernstein. 2023. Generative Agents: Interactive Simulacra of Human Behavior. arXiv:2304.03442 [cs.HC] https://arxiv.org/abs/2304.03442

work page internal anchor Pith review arXiv 2023
[23]

Peter Pirolli and Stuart Card. 2005. The sensemaking process and leverage points for analyst technology as identified through cognitive task analysis. In Proceedings of international conference on intelligence analysis, Vol. 5. McLean, VA, USA, 2–4

2005
[24]

Isabella Poggi and Francesca D’Errico. 2012. Social signals: a framework in terms of goals and beliefs.Cognitive Processing13, Suppl 2 (2012), 427–445. doi:10.1007/s10339-012-0512-6

work page doi:10.1007/s10339-012-0512-6 2012
[25]

Pavan Kartheek Rachabatuni, Filippo Principi, Paolo Mazzanti, and Marco Bertini
[26]

InProceedings of the ACM Multimedia Systems Conference 2024 on ZZZ

Context-aware chatbot using MLLMs for Cultural Heritage. InProceedings of the ACM Multimedia Systems Conference 2024 on ZZZ. ACM, Bari Italy, 459–463. doi:10.1145/3625468.3652193

work page doi:10.1145/3625468.3652193 2024
[27]

Melanie Sclar, Yejin Choi, Yulia Tsvetkov, and Alane Suhr. 2024. QUANTIFYING LANGUAGE MODELS’ SENSITIVITY TO SPURIOUS FEATURES IN PROMPT DESIGN or: How I learned to start worrying about prompt formatting. InThe Twelfth International Conference on Learning Representations. https://openreview. net/forum?id=RIu5lyNXjT

2024
[28]

Tanisha Shende. 2024. AI-Enhanced Art Appreciation: Generating Text from Artwork to Promote Inclusivity.Proceedings of the AAAI Conference on Artificial Intelligence38, 21 (March 2024), 23760–23762. doi:10.1609/aaai.v38i21.30556

work page doi:10.1609/aaai.v38i21.30556 2024
[29]

Simona Skripkauskaite, Ioana Mihai, and Kami Koldewyn. 2023. Attentional bias towards social interactions during viewing of naturalistic scenes.Quarterly Journal of Experimental Psychology76, 10 (2023), 2303–2311

2023
[30]

Stephenson, Steven G

Lisa J. Stephenson, Steven G. Edwards, and Andrew P. Bayliss. 2021. From Gaze Perception to Social Cognition: The Shared-Attention System.Perspectives on Psychological Science16, 3 (2021), 553–576. doi:10.1177/1745691620964178

work page doi:10.1177/1745691620964178 2021
[31]

Dídac Surís, Sachit Menon, and Carl Vondrick. 2023. ViperGPT: Visual Inference via Python Execution for Reasoning. arXiv:2303.08128 [cs.CV] https://arxiv.org/ abs/2303.08128

work page arXiv 2023
[32]

Reflection

Anqi Wang, Zhizhuo Yin, Yulu Hu, Yuanyuan Mao, Lei Han, Xin Tong, Keqing Jiao, and Pan Hui. 2025. Pinning “Reflection” on the Agenda: Investigating Reflection in Human–LLM Co-Creation for Creative Coding. InCompanion Publication of the 2025 Conference on Computer-Supported Cooperative Work and Social Computing. ACM, Bergen Norway, 249–255. doi:10.1145/371...

work page doi:10.1145/3715070.3749234 2025
[33]

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, and Denny Zhou. 2023. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. arXiv:2201.11903 [cs.CL] https: //arxiv.org/abs/2201.11903

work page internal anchor Pith review arXiv 2023
[34]

Elin H Williams and Bhismadev Chakrabarti. 2024. The integration of head and body cues during the perception of social interactions.Quarterly Journal of Experimental Psychology77, 4 (2024), 776–788

2024
[35]

Chenfei Wu, Shengming Yin, Weizhen Qi, Xiaodong Wang, Zecheng Tang, and Nan Duan. 2023. Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models. arXiv:2303.04671 [cs.CV] https://arxiv.org/abs/2303.04671

work page internal anchor Pith review arXiv 2023
[36]

Muhammad Yaseen. 2024. What is YOLOv8: An In-Depth Exploration of the In- ternal Features of the Next-Generation Object Detector. arXiv:2408.15857 [cs.CV] https://arxiv.org/abs/2408.15857

work page arXiv 2024
[37]

Xiongfei Li Yongjun Li. 2024. Deep Learning and Natural Language Processing Technology Based Display and Analysis of Modern Artwork.Journal of Electrical Systems20, 3s (April 2024), 1636–1646. doi:10.52783/jes.1704

work page doi:10.52783/jes.1704 2024
[38]

Zhengqing Yuan, Yunhong He, Kun Wang, Yanfang Ye, and Lichao Sun. 2024. ArtGPT-4: Towards Artistic-understanding Large Vision-Language Models with Enhanced Adapter. doi:10.48550/arXiv.2305.07490 arXiv:2305.07490 [cs]

work page doi:10.48550/arxiv.2305.07490 2024
[39]

Rowan Zellers, Mark Yatskar, Sam Thomson, and Yejin Choi. 2018. Neural Motifs: Scene Graph Parsing with Global Context. arXiv:1711.06640 [cs.CV] https://arxiv.org/abs/1711.06640

work page arXiv 2018
[40]

Chanjin Zheng, Zengyi Yu, Yilin Jiang, Mingzi Zhang, Xunuo Lu, Jing Jin, and Liteng Gao. 2025. ArtMentor: AI-Assisted Evaluation of Artworks to Explore Multimodal Large Language Models Capabilities. InProceedings of the 2025 CHI Conference on Human Factors in Computing Systems (CHI ’25). Association for Chiu et al. Computing Machinery, New York, NY, USA, ...

work page arXiv 2025
[41]

•Prioritize resolved gaze and posture

Treat MIRAGE as the primary evidence layer. •Prioritize resolved gaze and posture. •Use relation records (R*) and object anchors (O*) as primary references. •Treat geometry and intermediate outputs as supporting or conflicting evidence
[42]

•Use gaze, posture, gesture, touch, proximity, overlap, and object-centered attention

Base all claims on grounded visual evidence. •Use gaze, posture, gesture, touch, proximity, overlap, and object-centered attention. •Do NOT introduce information not supported by the grounding document or image
[43]

•If multiple interpretations are plausible, state them

Handle ambiguity explicitly. •If multiple interpretations are plausible, state them. •If evidence conflicts, describe the conflict instead of resolving it silently
[44]

•Refer to characters as C1, C2,

Use structured references. •Refer to characters as C1, C2, ... •Refer to relations as R0, R1, ... •Refer to objects as O1, O2,
[45]

•Do not invent intentions or narratives beyond available evidence

Limit unsupported inference. •Do not invent intentions or narratives beyond available evidence. •If evidence is weak or insufficient, state the uncertainty
[46]

•General art knowledge may be used only to support or interpret grounded evidence,not to replace it

Contextual knowledge. •General art knowledge may be used only to support or interpret grounded evidence,not to replace it. Response Format: (1) Claim (2) Supporting evidence (with explicit references to C*, R*, O*) (3) Optional contextual interpretation (if relevant) (4) Uncertainty or ambiguity (if any) Interaction Protocol: Treat the user’s input as a h...
[47]

Claim There is no confirmed direct contact between C1 and C2
[48]

Supporting evidence •C1 grips O1 (inner_tube), not C2 directly •C2’s body is supported via O1 •Proximity between C1’s hands and C2’s leg is high, but no verified overlap
[49]

Optional interpretation This suggests object-mediated support rather than direct physical interaction
[50]

Uncertainty The proximity may visually suggest contact, but grounding does not confirm it