arxiv: 2604.21144 · v1 · submitted 2026-04-22 · 💻 cs.CL · cs.AI· cs.HC

Recognition: unknown

Using Machine Mental Imagery for Representing Common Ground in Situated Dialogue

Biswesh Mohapatra , Giovanni Duca , Laurent Romary , Justine Cassell

Authors on Pith no claims yet

Pith reviewed 2026-05-09 23:40 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.HC

keywords situated dialoguecommon groundvisual scaffoldingrepresentational blurmultimodal representationsincremental processingconversational agentsmental imagery

0 comments

The pith

Conversational agents track shared context better by incrementally building persistent visual representations from dialogue.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that text-only dialogue systems collapse distinct entities into vague descriptions over time, creating an illusion of understanding called representational blur. It tests whether adding the ability to generate depictive visual images step by step during conversation creates a concrete record that agents can later retrieve. On the IndiRef benchmark, this incremental visual approach outperforms reasoning over the full text history, and combining visuals with text works best overall because some information cannot be shown in pictures. A sympathetic reader would care because reliable common-ground tracking is a basic requirement for any agent that must stay coherent across long, situated exchanges rather than treating each turn in isolation.

Core claim

The paper shows that an active visual scaffolding framework, which converts ongoing dialogue state into a growing set of depictive visual representations, reduces representational blur by forcing concrete scene commitments. Incremental externalization alone beats full-dialog text reasoning, visual scaffolding adds further gains, and a hybrid multimodal setup that keeps text for non-depictable content achieves the highest performance on the IndiRef benchmark.

What carries the argument

Active visual scaffolding: the incremental conversion of dialogue state into a persistent visual history that can be retrieved for grounded response generation.

If this is right

Incremental externalization of dialogue state improves performance over full-dialog text reasoning even without visuals.
Visual scaffolding reduces representational blur by enforcing concrete commitments to scene details.
Textual representations remain preferable for non-depictable information such as abstract or internal states.
A hybrid multimodal representation of common ground that combines depictive and propositional content yields the best overall results.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same incremental visual scaffolding could be applied to embodied agents that must maintain consistent beliefs about a physical environment across multiple interactions.
Testing the framework on longer, multi-session dialogues would reveal whether the visual history prevents drift in common ground more effectively than current context-window methods.
The approach suggests a general principle that forcing intermediate representations into a different modality can mitigate compression losses that occur in single-modality reasoning chains.

Load-bearing premise

Multimodal models can generate accurate visual representations from dialogue without introducing errors or biases that outweigh the gains over text-only methods.

What would settle it

An experiment on a new set of dialogues containing similar but distinct entities where visual scaffolding produces more misidentifications or hallucinations than a pure text baseline.

Figures

Figures reproduced from arXiv: 2604.21144 by Biswesh Mohapatra, Giovanni Duca, Justine Cassell, Laurent Romary.

**Figure 1.** Figure 1: As the human describes the scene, the agent incrementally constructs and updates a mental image of the established [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗

**Figure 2.** Figure 2: IndiRef question examples. The IndiRef Benchmark. To evaluate the established common ground, we utilize the IndiRef benchmark which augments the MU! transcripts with granular Question-Answer (QA) pairs designed to probe specific referential failures. The benchmark evaluates agents across four semantic dimensions: Temporal (relying on chronological event sequences), Spatial (relying on spatial reasoning), … view at source ↗

**Figure 3.** Figure 3: Two-phase pipeline. Phase 1 (left): the Observer, Constructor, and Linker incrementally externalize the dialogue D into artifacts 𝛼𝑡 and cross-scene triplets 𝜏𝑡 , stored in the memory bank M. Phase 2 (right): given question 𝑄, the Reasoner produces a plan P = {(𝑐,𝑖)} and selectively retrieves from M to produce 𝑦ˆ, evaluated by the Judge against 𝑦 ∗ . • Textual Condition: 𝛼𝑘 = S𝑘 , where S𝑘 is a dense propo… view at source ↗

**Figure 4.** Figure 4: A representative example of incremental common-ground externalization. Left: verbatim dialogue turns together [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Performance by reasoning scope. Left: overall accu [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

**Figure 6.** Figure 6: Left: B uses a clarification question to check if A shares the same environment, implicitly revealing what B sees. Right: [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

**Figure 7.** Figure 7: Logistic Regression showing probability of a correct [PITH_FULL_IMAGE:figures/full_fig_p007_7.png] view at source ↗

**Figure 8.** Figure 8: B’s exploration of the bathroom involves conflicting [PITH_FULL_IMAGE:figures/full_fig_p007_8.png] view at source ↗

**Figure 9.** Figure 9: B uses a negative clarification question ( [PITH_FULL_IMAGE:figures/full_fig_p008_9.png] view at source ↗

**Figure 12.** Figure 12: Turn 4: The Observer identhifies a phatic utterance [PITH_FULL_IMAGE:figures/full_fig_p011_12.png] view at source ↗

**Figure 10.** Figure 10: Turn 1: Initialization of the visual state based on [PITH_FULL_IMAGE:figures/full_fig_p011_10.png] view at source ↗

**Figure 13.** Figure 13: Turn 5: The user provides a detailed refinement [PITH_FULL_IMAGE:figures/full_fig_p011_13.png] view at source ↗

**Figure 11.** Figure 11: Turn 2: The Observer detects an elaboration of the [PITH_FULL_IMAGE:figures/full_fig_p011_11.png] view at source ↗

**Figure 15.** Figure 15: Condensed prompt for the Constructor module in [PITH_FULL_IMAGE:figures/full_fig_p012_15.png] view at source ↗

**Figure 16.** Figure 16: Condensed prompt for the Constructor module [PITH_FULL_IMAGE:figures/full_fig_p013_16.png] view at source ↗

**Figure 18.** Figure 18: Condensed prompt for the Linker module. The full [PITH_FULL_IMAGE:figures/full_fig_p014_18.png] view at source ↗

**Figure 20.** Figure 20: Condensed prompt for the Judge module. The full [PITH_FULL_IMAGE:figures/full_fig_p015_20.png] view at source ↗

**Figure 21.** Figure 21: Full prompt for the Annotator, used for the quan [PITH_FULL_IMAGE:figures/full_fig_p015_21.png] view at source ↗

read the original abstract

Situated dialogue requires speakers to maintain a reliable representation of shared context rather than reasoning only over isolated utterances. Current conversational agents often struggle with this requirement, especially when the common ground must be preserved beyond the immediate context window. In such settings, fine-grained distinctions are frequently compressed into purely textual representations, leading to a critical failure mode we call \emph{representational blur}, in which similar but distinct entities collapse into interchangeable descriptions. This semantic flattening creates an illusion of grounding, where agents appear locally coherent but fail to track shared context persistently over time. Inspired by the role of mental imagery in human reasoning, and based on the increased availability of multimodal models, we explore whether conversational agents can be given an analogous ability to construct some depictive intermediate representations during dialogue to address these limitations. Thus, we introduce an active visual scaffolding framework that incrementally converts dialogue state into a persistent visual history that can later be retrieved for grounded response generation. Evaluation on the IndiRef benchmark shows that incremental externalization itself improves over full-dialog reasoning, while visual scaffolding provides additional gains by reducing representational blur and enforcing concrete scene commitments. At the same time, textual representations remain advantageous for non-depictable information, and a hybrid multimodal setting yields the best overall performance. Together, these findings suggest that conversational agents benefit from an explicitly multimodal representation of common ground that integrates depictive and propositional information.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper offers a plausible framework for using incremental visual scaffolding to maintain common ground in dialogue and reduce representational blur, but the reported gains rest on an abstract with no methods or numbers.

read the letter

The core idea here is that dialogue agents lose track of shared context over time because text compresses distinct things into similar descriptions, and the fix is to add an active visual scaffolding layer that builds a persistent visual history step by step. They test this on IndiRef and say incremental externalization already beats full-dialog reasoning, visual scaffolding adds further gains, and the hybrid text-plus-visual version comes out on top while text still handles non-depictable stuff better.

Referee Report

2 major / 2 minor

Summary. The paper claims that an active visual scaffolding framework, inspired by human mental imagery, allows conversational agents to incrementally externalize dialogue state into persistent visual representations to better maintain common ground in situated dialogue. It identifies 'representational blur' as a failure mode of purely textual representations and reports that on the IndiRef benchmark, incremental externalization improves over full-dialog reasoning, visual scaffolding yields additional gains by reducing blur and enforcing concrete commitments, textual representations remain useful for non-depictable information, and hybrid multimodal settings achieve the best overall performance.

Significance. If the empirical results hold, the work could meaningfully advance situated dialogue systems by demonstrating the value of explicitly multimodal common-ground representations that combine depictive and propositional elements. Strengths include the use of an external benchmark (IndiRef) and comparisons against textual baselines, which support reproducibility and allow direct assessment of the incremental and hybrid contributions.

major comments (2)

[Abstract] Abstract: The reported benchmark gains on IndiRef are presented without implementation details, statistical analysis, controls for confounding factors, or error analysis, preventing full assessment of whether incremental externalization and visual scaffolding actually reduce representational blur or deliver the claimed improvements.
[Evaluation] Evaluation section: The central assumption that multimodal models can reliably construct accurate, depictive intermediate representations from dialogue without introducing offsetting errors or biases requires concrete validation through error analysis or ablation studies on IndiRef.

minor comments (2)

[Introduction] The term 'representational blur' is introduced in the abstract but would benefit from a formal definition or illustrative example early in the introduction to clarify its distinction from standard context-loss issues.
A diagram of the active visual scaffolding pipeline (incremental externalization, retrieval, and hybrid generation) would improve clarity of the framework.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the insightful and constructive feedback. We address each major comment below with clarifications from the manuscript and outline the revisions we will make.

read point-by-point responses

Referee: [Abstract] Abstract: The reported benchmark gains on IndiRef are presented without implementation details, statistical analysis, controls for confounding factors, or error analysis, preventing full assessment of whether incremental externalization and visual scaffolding actually reduce representational blur or deliver the claimed improvements.

Authors: The abstract is necessarily concise, but the Evaluation section details the IndiRef benchmark setup, incremental vs. full-dialog baselines, hybrid vs. unimodal ablations, and quantitative gains. To improve accessibility, we will revise the abstract to briefly note the experimental controls, statistical significance of improvements, and that error analysis is provided in the body. We will also expand the abstract's reference to representational blur reduction with a short supporting clause. revision: yes
Referee: [Evaluation] Evaluation section: The central assumption that multimodal models can reliably construct accurate, depictive intermediate representations from dialogue without introducing offsetting errors or biases requires concrete validation through error analysis or ablation studies on IndiRef.

Authors: The Evaluation section already reports ablation results across text-only, visual-only, and hybrid conditions on IndiRef, with consistent gains for visual scaffolding that support the claim of reduced blur. These comparisons serve as indirect validation by showing net benefits. We agree that direct validation is valuable and will add a new error-analysis subsection with qualitative examples of generated visual states, quantitative accuracy metrics against ground-truth scenes, and checks for systematic biases on the benchmark. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper proposes an active visual scaffolding framework for maintaining common ground in situated dialogue, drawing inspiration from human mental imagery and leveraging multimodal models. It evaluates this approach empirically on the external IndiRef benchmark, demonstrating gains from incremental externalization and visual scaffolding over full-dialog reasoning and textual baselines, with hybrid multimodal settings performing best. No load-bearing derivations, equations, fitted parameters renamed as predictions, uniqueness theorems, or self-citations are present in the provided text. The central claims rest on experimental comparisons against independent baselines rather than reducing to self-definition or internal fitting, making the derivation chain self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The central claim depends on assumptions about multimodal model capabilities for accurate visual depiction and the benchmark's validity for measuring persistent grounding; no free parameters or invented physical entities are evident from the abstract.

axioms (2)

domain assumption Multimodal models can construct depictive intermediate representations from dialogue that accurately reflect shared visual context.
Invoked as the basis for the visual scaffolding framework to address representational blur.
domain assumption Persistent visual history can be retrieved and integrated with textual information to improve grounded response generation.
Underlies the claim that hybrid multimodal settings yield the best performance.

invented entities (1)

representational blur no independent evidence
purpose: To name the failure mode in which similar but distinct entities collapse into interchangeable textual descriptions.
Newly introduced term to characterize the limitation of pure textual common ground representations.

pith-pipeline@v0.9.0 · 5553 in / 1421 out tokens · 36950 ms · 2026-05-09T23:40:09.753436+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

57 extracted references · 33 canonical work pages · 6 internal anchors

[1]

Qwen3-VL Technical Report

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, Wenbin Ge, Zhifang Guo, Qidong Huang, Jie Huang, Fei Huang, Binyuan Hui, Shutong Jiang, Zhaohai Li, Mingsheng Li, Mei Li, Kaixin Li, Zicheng Lin, Junyang Lin, Xuejing Liu, Jiawei Liu, Chenglong Liu, Yang Liu, Dayiheng Liu, Shixuan ...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2511.21631 2025
[2]

Lawrence Barsalou. 1999. Perceptual symbol systems.Behavioral and Brain Sciences22 (03 1999), 577–609. doi:10.1017/s0140525x99002149

work page doi:10.1017/s0140525x99002149 1999
[3]

ISBN 978-1-4503-8309-7

Emily M. Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell. 2021. On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?. InProceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency(Virtual Event, Canada)(FAccT ’21). Association for Computing Machinery, New York, NY, USA, 610–623. doi:10.114...

work page doi:10.1145/3442188.3445922 2021
[4]

Chenyang Bu, Guojie Chang, Zihao Chen, CunYuan Dang, Zhize Wu, Yi He, and Xindong Wu. 2025. Query-Driven Multimodal GraphRAG: Dynamic Local Knowledge Graph Construction for Online Reasoning. InFindings of the Associ- ation for Computational Linguistics: ACL 2025, Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar (Eds.). Associati...

work page doi:10.18653/v1/2025.findings- 2025
[5]

Zihui Cheng, Qiguang Chen, Xiao Xu, Jiaqi WANG, Weiyun Wang, Hao Fei, Yidong Wang, Alex Jinpeng Wang, Zhi Chen, Wanxiang Che, and Libo Qin. 2025. Visual Thoughts: A Unified Perspective of Understanding Multimodal Chain-of- Thought. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems. https://openreview.net/forum?id=xPcKmKSEis

2025
[6]

Herbert H Clark and Susan E Brennan. 1991. Grounding in communication. (1991)

1991
[7]

Clark and Ed Schaefer

Herbert H. Clark and Ed Schaefer. 1989. Contributing to Discourse.Cogn. Sci.13 (1989), 259–294

1989
[8]

Fan, Wilma A

Judith E. Fan, Wilma A. Bainbridge, Rebecca Chamberlain, and Jeffrey D. Wammes
[9]

2, 9 (2023), 556–568

Drawing as a Versatile Cognitive Tool. 2, 9 (2023), 556–568. doi:10.1038/ s44159-023-00212-w

2023
[10]

Fan, Robert D

Judith E. Fan, Robert D. Hawkins, Mike Wu, and Noah D. Goodman. 2020. Prag- matic Inference and Visual Abstraction Enable Contextual Flexibility During Visual Communication. 3, 1 (2020), 86–101. doi:10.1007/s42113-019-00058-7

work page doi:10.1007/s42113-019-00058-7 2020
[11]

Rebecca Fincher-Kiefer. 2001. Perceptual components of situation models.Mem- ory & cognition29 (04 2001), 336–43. doi:10.3758/BF03194928

work page doi:10.3758/bf03194928 2001
[12]

Smith, Wei-Chiu Ma, and Ranjay Krishna

Xingyu Fu, Yushi Hu, Bangzheng Li, Yu Feng, Haoyu Wang, Xudong Lin, Dan Roth, Noah A. Smith, Wei-Chiu Ma, and Ranjay Krishna. 2025. BLINK: Multimodal Large Language Models Can See but Not Perceive. InComputer Vision – ECCV 2024(Cham, 2025), Aleš Leonardis, Elisa Ricci, Stefan Roth, Olga Russakovsky, Torsten Sattler, and Gül Varol (Eds.). Springer Nature S...

2025
[13]

Yilun Hua and Yoav Artzi. 2024. Talk Less, Interact Better: Evaluating In-context Conversational Adaptation in Multimodal LLMs. InFirst Conference on Language Modeling. https://openreview.net/forum?id=lVOw78nYXS

2024
[14]

Walker, and Judith E

Holly Huey, Xuanchen Lu, Caren M. Walker, and Judith E. Fan. 2023. Visual Explanations Prioritize Functional Properties at the Expense of Visual Fidelity. Cognition236 (2023), 105414. doi:10.1016/j.cognition.2023.105414

work page doi:10.1016/j.cognition.2023.105414 2023
[15]

Nikolai Ilinykh, Sina Zarrieß, and David Schlangen. 2019. Meet Up! A Corpus of Joint Activity Dialogues in a Visual Environment. https://www.semdial.org/ anthology/papers/Z/Z19/Z19-3006/

2019
[16]

Do You Follow Me?

Léo Jacqmin, Lina M. Rojas Barahona, and Benoit Favre. 2022. “Do You Follow Me?”: A Survey of Recent Approaches in Dialogue State Tracking. InProceedings of the 23rd Annual Meeting of the Special Interest Group on Discourse and Dialogue (Edinburgh, UK, 2022-09), Oliver Lemon, Dilek Hakkani-Tur, Junyi Jessy Li, Arash Ashrafzadeh, Daniel Hernández Garcia, M...

2022
[17]

Pu Jian, Donglei Yu, and Jiajun Zhang. 2024. Large Language Models Know What Is Key Visual Entity: An LLM-assisted Multimodal Retrieval for VQA. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing(Miami, Florida, USA, 2024). Association for Computational Linguistics, 10939–10956. doi:10.18653/v1/2024.emnlp-main.613

work page doi:10.18653/v1/2024.emnlp-main.613 2024
[18]

Kelleher and Geert-Jan M

John D. Kelleher and Geert-Jan M. Kruijff. 2006. Incremental Generation of Spatial Referring Expressions in Situated Dialog. InProceedings of the 21st In- ternational Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics, Nicoletta Calzolari, Claire Cardie, and Pierre Isabelle (Eds.). Association ...

work page doi:10.3115/1220175.1220306 2006
[19]

Charles Kemp and Terry Regier. 2012. Kinship Categories Across Languages Reflect General Communicative Principles.Science336, 6084 (2012), 1049–

2012
[20]

1126/science.1218811

arXiv:https://www.science.org/doi/pdf/10.1126/science.1218811 doi:10. 1126/science.1218811

work page doi:10.1126/science.1218811
[21]

Dimosthenis Kontogiorgos, Andre Pereira, and Joakim Gustafson. 2019. Estimat- ing Uncertainty in Task-Oriented Dialogue. In2019 International Conference on Multimodal Interaction(Suzhou, China)(ICMI ’19). Association for Computing Machinery, New York, NY, USA, 414–418. doi:10.1145/3340555.3353722

work page doi:10.1145/3340555.3353722 2019
[22]

Larkin and Herbert A

Jill H. Larkin and Herbert A. Simon. 1987. Why a Diagram is (Sometimes) Worth Ten Thousand Words.Cognitive Science11, 1 (1987), 65–100. doi:10.1016/S0364- 0213(87)80026-5

work page doi:10.1016/s0364- 1987
[23]

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, and Douwe Kiela. 2020. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. InAdvances in Neural Information Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M.F...

2020
[24]

Chengzu Li, Wenshan Wu, Huanyu Zhang, Yan Xia, Shaoguang Mao, Li Dong, Ivan Vulić, and Furu Wei. 2025. Imagine While Reasoning in Space: Multimodal Visualization-of-Thought. InProceedings of the 42nd International Conference on Machine Learning(2025-10-06). PMLR, 36340–36364. https://proceedings.mlr. press/v267/li25cz.html

2025
[25]

Karlsson, and Zongqing Lu

Jiazheng Liu, Sipeng Zheng, Börje F. Karlsson, and Zongqing Lu. 2025. Taking Notes Brings Focus? Towards Multi-Turn Multimodal Dialogue Learning. (2025). doi:10.48550/ARXIV.2503.07002

work page doi:10.48550/arxiv.2503.07002 2025
[26]

Team Llama. 2024. The Llama 3 Herd of Models.ArXivabs/2407.21783 (2024). https://api.semanticscholar.org/CorpusID:271571434

work page internal anchor Pith review Pith/arXiv arXiv 2024
[27]

MacEachren, Robert E

Alan M. MacEachren, Robert E. Roth, James O’Brien, Bonan Li, Derek Swingley, and Mark Gahegan. 2012. Visual Semiotics & Uncertainty Visualization: An Empirical Study.IEEE Transactions on Visualization and Computer Graphics18, 12 (2012), 2496–2505. doi:10.1109/TVCG.2012.279

work page doi:10.1109/tvcg.2012.279 2012
[28]

Biswesh Mohapatra, Théo Charlot, Giovanni Duca, Mayank Palan, Laurent Romary, and Justine Cassell. 2026. Frame of Reference: Addressing the Chal- lenges of Common Ground Representation in Situational Dialogs.arXiv preprint arXiv:2601.09365(2026)

work page internal anchor Pith review Pith/arXiv arXiv 2026
[29]

Biswesh Mohapatra, Seemab Hassan, Laurent Romary, and Justine Cassell. 2024. Conversational Grounding: Annotation and Analysis of Grounding Acts and Grounding Units. InProceedings of LREC-COLING. Association for Computational Linguistics. https://api.semanticscholar.org/CorpusID:268680732

2024
[30]

Biswesh Mohapatra, Manav Nitin Kapadnis, Laurent Romary, and Justine Cassell
[31]

InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen (Eds.)

Evaluating the Effectiveness of Large Language Models in Establishing Conversational Grounding. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen (Eds.). Association for Computational Linguistics, Miami, Florida, USA, 9767–9781. doi:10.18653/v1/2024.emnlp-main.545

work page doi:10.18653/v1/2024.emnlp-main.545 2024
[32]

Yukiko Nakano, Gabe Reinstein, Tom Stocky, and Justine Cassell. 2003. Towards a Model of Face-to-Face Grounding. InProceedings of the 41st Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Sapporo, Japan, 553–561. doi:10.3115/1075096.1075166

work page doi:10.3115/1075096.1075166 2003
[33]

Anjali Narayan-Chen, Prashant Jayannavar, and Julia Hockenmaier. 2019. Col- laborative Dialogue in Minecraft. InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Anna Korhonen, David Traum, and Lluís Màrquez (Eds.). Association for Computational Linguistics, Florence, Italy, 5405–5415. doi:10.18653/v1/P19-1537

work page doi:10.18653/v1/p19-1537 2019
[34]

Fei Ni, Jianye Hao, Shiguang Wu, Longxin Kou, Jiashun Liu, Yan Zheng, Bin Wang, and Yuzheng Zhuang. 2024. Generate Subgoal Images Before Act: Unlocking the Chain-of-Thought Reasoning in Diffusion Model for Robot Manipulation with Multimodal Prompts. In2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)(2024-06). 13991–14000. doi:10....

work page doi:10.1109/cvpr52733.2024.01327 2024
[35]

OpenAI, :, Aaron Hurst, Adam Lerer, Adam P. Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, Aleksander Mądry, Alex Baker-Whitcomb, Alex Beutel, Alex Borzunov, Alex Car- ney, Alex Chow, Alex Kirillov, Alex Nichol, Alex Paino, Alex Renzin, Alex Tachard Passos, Alexander Kirillov, Alexi Christakis, Al...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[36]

Naoki Otani, Jun Araki, HyeongSik Kim, and Eduard Hovy. 2023. On the Un- derspecification of Situations in Open-domain Conversational Datasets. InPro- ceedings of the 5th Workshop on NLP for Conversational AI (NLP4ConvAI 2023), Yun-Nung Chen and Abhinav Rastogi (Eds.). Association for Computational Linguistics, Toronto, Canada, 12–28. doi:10.18653/v1/2023...

work page doi:10.18653/v1/2023.nlp4convai-1.2 2023
[37]

2021.Un- certainty Visualization

Lace Padilla, Matthew Kay, and Jessica Hullman. 2021.Un- certainty Visualization. John Wiley & Sons, Ltd, 1–18. arXiv:https://onlinelibrary.wiley.com/doi/pdf/10.1002/9781118445112.stat08296 doi:10.1002/9781118445112.stat08296

work page doi:10.1002/9781118445112.stat08296 2021
[38]

Allan Paivio. 1991. Dual Coding Theory: Retrospect and Current Status. 45, 3 (1991), 255–287. doi:10.1037/h0084295

work page doi:10.1037/h0084295 1991
[39]

Sandro Pezzelle. 2023. Dealing with Semantic Underspecification in Multimodal NLP. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)(Toronto, Canada, 2023-07), Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki (Eds.). Association for Computational Linguistics, 12098–12112. doi:10.18653/v1/2...

work page doi:10.18653/v1/2023.acl-long.675 2023
[40]

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. 2021. Learning Transferable Visual Models From Natural Language Supervision. arXiv:2103.00020 [cs.CV] https://arxiv.org/ abs/2103.00020

work page internal anchor Pith review Pith/arXiv arXiv 2021
[41]

Aleksandar Shtedritski, Christian Rupprecht, and Andrea Vedaldi. 2023. What does CLIP know about a red circle? Visual prompt engineering for VLMs. In 2023 IEEE/CVF International Conference on Computer Vision (ICCV). 11953–11963. doi:10.1109/ICCV51070.2023.01101

work page doi:10.1109/iccv51070.2023.01101 2023
[42]

Kristian Tylén, Riccardo Fusaroli, Sergio Rojo, Katrin Heimann, Nicolas Fay, Niels N Johannsen, Felix Riede, and Marlize Lombard. 2020. The evolution of early symbolic behavior in Homo sapiens.Proceedings of the National Academy of Sciences117, 9 (2020), 4578–4584. doi:10.1073/pnas.1910880117

work page doi:10.1073/pnas.1910880117 2020
[43]

Takuma Udagawa and Akiko Aizawa. 2021. Maintaining Common Ground in Dynamic Environments. 9 (2021), 995–1011. doi:10.1162/tacl_a_00409

work page doi:10.1162/tacl_a_00409 2021
[44]

Bo, Roman Christian Bachmann, Amit Haim Bermano, Daniel Cohen-Or, Amir Zamir, and Ariel Shamir

Yael Vinker, Ehsan Pajouheshgar, Jessica Y. Bo, Roman Christian Bachmann, Amit Haim Bermano, Daniel Cohen-Or, Amir Zamir, and Ariel Shamir. 2022. CLIPasso: Semantically-Aware Object Sketching. 41, 4, Article 86 (2022). doi:10. 1145/3528223.3530068

work page arXiv 2022
[45]

Qwen-Image Technical Report

Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng- ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, Yuxiang Chen, Zecheng Tang, Zekai Zhang, Zhengyi Wang, An Yang, Bowen Yu, Chen Cheng, Dayiheng Liu, Deqing Li, Hang Zhang, Hao Meng, Hu Wei, Jingyuan Ni, Kai Chen, Kuan Cao, Liang Peng, Lin Qu, Minggang Wu, Peng Wang, Shuting Yu, Tingku...

work page internal anchor Pith review doi:10.48550/arxiv.2508.02324 2025
[46]

Yuncong Yang, Han Yang, Jiachen Zhou, Peihao Chen, Hongxin Zhang, Yilun Du, and Chuang Gan. 2025. 3D-Mem: 3D Scene Memory for Embodied Exploration and Reasoning. In2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 17294–17303. doi:10.1109/CVPR52734.2025.01612

work page doi:10.1109/cvpr52734.2025.01612 2025
[47]

Zeyuan Yang, Xueyang Yu, Delin Chen, Maohao Shen, and Chuang Gan. 2025. Machine Mental Imagery: Empower Multimodal Reasoning with Latent Visual Tokens. arXiv:2506.17218 [cs] doi:10.48550/arXiv.2506.17218

work page doi:10.48550/arxiv.2506.17218 2025
[48]

Bolei Zhou, Hang Zhao, Xavier Puig, Sanja Fidler, Adela Barriuso, and Antonio Torralba. 2017. Scene Parsing through ADE20K Dataset. In2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 5122–5130. doi:10.1109/ CVPR.2017.544

2017
[49]

1949.Human behavior and the principle of least effort: An introduction to human ecology

George Kingsley Zipf. 1949.Human behavior and the principle of least effort: An introduction to human ecology. Addison-Wesley Press

1949
[50]

in bedroom

Rolf A. Zwaan and Gabriel A. Radvansky. 1998. Situation models in language comprehension and memory.Psychological bulletin123 2 (1998), 162–85. doi:10. 1037/0033-2909.123.2.162 Using Machine Mental Imagery for Representing Common Ground in Situated Dialogue , , A Example of Pipeline To illustrate the transformation from dialogue to visual artifacts, we ex...

1998
[51]

Red cat on sofa

Prompt Summarizer (Ground Truth Extraction) • Role:Extract a comprehensive list of visual atomic facts from a sequence of prompts. •Rules: – (Filtering) Include objects withRedandBlackoutlines. Strictly exclude Blueoutlines. – (Atomicity) Break complex descriptions into single facts (e.g., "Red cat on sofa"→"There is a cat", "It is red", "On a sofa"). •Ou...
[52]

•Rules: – (Order) 1

Image Captioner (Visual Analysis) • Role:Describe an image following strict hierarchical rules based on outline colors. •Rules: – (Order) 1. Black Outlines → 2. Red Outlines → 3. Blue Outlines. – (Detail Level)Black/Red:Comprehensive (Type, Color, Features, Location).Blue:Simple (Type only). – (Relations) For Black objects, describe position relative to o...
[53]

red" but image is

Fact Checker (Visual Forensics) •Role:Verify if a list of text facts is true in the provided image. •Rules: – (Evidence First) For every fact, first attempt to draw a bounding box[ymin, xmin, ymax, xmax]. – (Precision) Strict attribute matching (e.g., if fact says "red" but image is "blue", verdict is FALSE). •Output:JSON with"fact","box", and"verdict"(bo...
[54]

•Commands: –POV: Identify perspective (User A, B, or BOTH)

Master Planner •Role:Break down user questions into a high-level strategic plan. •Commands: –POV: Identify perspective (User A, B, or BOTH). –RAG[k=N]: Retrieve top N relevant images/summary blocks. –PROCESS: Filter, transform, or reason about gathered info. –FINAL_ANSWER: Formulate the response (Must be the last step). • Output:XML-like format <answer><i...
[55]

• Rule:Do not use SQL

Instruction Refiner • Role:Refine high-level plan steps into simple, direct natural lan- guage tasks for the Processor. • Rule:Do not use SQL. Convert RAG instructions into search- friendly terms
[56]

Find the image with the red bed

Data Processor • Role:Execute specific instructions on provided context (e.g., "Find the image with the red bed"). • Constraints:Outputonlythe requested result (raw & clean). No conversational filler. • Logic:Respect Temporal Logic (ImageID/BlockID = distinct events) vs Versioning (SequenceID = updates)
[57]

yes"/"no

Final Answerer • Role:Answer the specific sub-question usingonlythe retrieved info. •Format: <think> ... reasoning ... </think> <answer> ... final verdict ... </answer> •Logic:If Yes/No question: return "yes"/"no" instead of IDs. [Examples] Figure 19: Condensed prompt for the Reasoner module. The full prompt includes separate instructions for the Planner,...