{"work":{"id":"7efbc2dd-b0f2-4f71-bb1c-d2fcf110d805","openalex_id":null,"doi":null,"arxiv_id":"2733.2024","raw_key":null,"title":"V*: Guided visual search as a core mechanism in multimodal llms","authors":null,"authors_text":"Penghao Wu and Saining Xie","year":2024,"venue":null,"abstract":null,"external_url":"https://arxiv.org/abs/2733.2024","cited_by_count":null,"metadata_source":"arxiv_reference","metadata_fetched_at":"2026-06-29T12:23:23.716760+00:00","pith_arxiv_id":null,"created_at":"2026-05-09T18:55:07.578171+00:00","updated_at":"2026-06-29T12:23:23.716760+00:00","title_quality_ok":false,"display_title":"Emogen: Emotional image content generation with text-to-image diffusion models","render_title":"Emogen: Emotional image content generation with text-to-image diffusion models"},"hub":{"state":{"work_id":"7efbc2dd-b0f2-4f71-bb1c-d2fcf110d805","tier":"super_hub","tier_reason":"100+ Pith inbound or 10,000+ external citations","pith_inbound_count":243,"external_cited_by_count":null,"distinct_field_count":20,"first_pith_cited_at":"2025-01-27T15:44:02+00:00","last_pith_cited_at":"2026-06-26T16:57:40+00:00","author_build_status":"needed","summary_status":"needed","contexts_status":"needed","graph_status":"needed","ask_index_status":"needed","reader_status":"not_needed","recognition_status":"not_needed","updated_at":"2026-06-29T12:38:46.218565+00:00","tier_text":"super_hub"},"tier":"super_hub","role_counts":[{"context_role":"background","n":83},{"context_role":"dataset","n":6},{"context_role":"baseline","n":2},{"context_role":"method","n":2}],"polarity_counts":[{"context_polarity":"background","n":85},{"context_polarity":"use_dataset","n":4},{"context_polarity":"baseline","n":2},{"context_polarity":"use_method","n":2}],"runs":{"ask_index":{"job_type":"ask_index","status":"succeeded","result":{"title":"Emogen: Emotional image content generation with text-to-image diffusion models","claims":[{"claim_text":"Representative examples of each observation type across domains are summarized in Table 1. Similarly, we group domain-specific action types into: Mouse/touch and keyboard:Low-level screen coordinate-based actions such as moving the cursor, tapping on coordinates, or typing text using the keyboard. These simulate typical human input across platforms [59, 120, 152]. Direct UI access:Actions targeted at specific UI elements using structured identifiers (like HTML tags or accessibility IDs) [10, 48,","claim_type":"background","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"jectory prediction models [153, 154, 155, 156, 157, 158, 159] can improve A V operation in dense, crowded environments. This survey discusses interaction modeling across various ap- proaches, highlighting its benefits in developing socially com- patible trajectory prediction systems. Another line of ap- proaches focuses on intention-aware models. These methods [160, 161, 162, 163, 164, 165, 166] incorporate the vehicle's maneuver intentions to predict its future states. Maneuvers are defined as ","claim_type":"background","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"the model generates plausible forecasts. Gilles et al. [132] intro- duced a heatmap-driven method that encodes spatial reasoning and agent interactions, producing interpretable likelihood maps for socially-aware multi-agent sampling. We elaborate further on learning-based methods with latent spaces in Sections 3.4.3 and 3.4.4. On more recent approaches,diffusion models[136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146] have been actively utilized in trajectory forecasting. These models follo","claim_type":"background","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"tisements (C) [113], Political leaning (C) [114] Persuasion Rallying a Crowd (RAC) (C) [115], QPS (R) [116], ImageArg (C) [117], Persuasive meme (C) [118],Pitts Ads Dataset (C) [119], Persuasive Por- traits of Politicians (Ra) [120] Visual Narrative MPII (Cap) [121], MovieBook (Ret) [122], MovieQA (Q) [123], DramaQA (Q) [124], SF20K (Q) [125], ObyGaze12 (C) [126], MovieNet (C) [127], LVU (C) [66], HLVU (Q) [93], Mementos (Cap) [128] 6 hence the downstream task of predicting the viral potential o","claim_type":"dataset","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"SMPQA Geigle et al. [53] 11 28.9 61.3 76.9=7K=50 synthetic M3Exam Zhang et al. [167] 9 21.4 64.6 65.1≠2.8K≠3.1K general knowledge EXAMS-V Das et al. [33] 11 22.6 61.3 76.4≠5K≠1.2K general knowledge WorldCuisines Winata et al. [150] 24 44.8 65.5* 88.2=1M≠6K culture MLMemes Dimitrov et al. [40] 4 8.2 59.5 43.1≠25.3K≠10.8K persuation techniques xMMMU Yue et al. [160] 7 18.0 62.0 69.2=2.6K=300 semantics MTVQA Tang et al. [136] 10 16.4 62.0 77.9≠28.6K≠8.7K semantics CVQA Romero et al. [119] 31 40.9 6","claim_type":"dataset","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"originally embedded by the legitimate user. Consequently, the watermark verification mechanism will erroneously attribute xadv to the user. This attack succeeds without requiring access to the original generation model or knowledge of the watermark method, highlighting a fundamental vulnerability in latent-based watermark schemes. 3.3 Compression Sensing Compressed sensing (CS) [ 3, 9] is a signal acquisition paradigm that enables direct acquisition of a compressed representation y∈R M of a high","claim_type":"background","confidence":0.9,"evidence_strength":"citation_context"}],"why_cited":"Pith tracks Emogen: Emotional image content generation with text-to-image diffusion models because it crossed a citation-hub threshold. Current citing contexts most often use it as background evidence (83 contexts).","role_counts":[{"n":83,"context_role":"background"},{"n":6,"context_role":"dataset"},{"n":2,"context_role":"baseline"},{"n":2,"context_role":"method"}]},"error":null,"updated_at":"2026-06-05T21:50:19.955169+00:00"},"author_expand":{"job_type":"author_expand","status":"succeeded","result":{"authors_linked":[]},"error":null,"updated_at":"2026-06-05T21:50:19.970108+00:00"},"context_extract":{"job_type":"context_extract","status":"succeeded","result":{"enqueued_papers":25},"error":null,"updated_at":"2026-05-14T06:47:24.207737+00:00"},"graph_features":{"job_type":"graph_features","status":"succeeded","result":{"co_cited":[{"title":"MambaVision: A hybrid Mamba- Transformer vision backbone","work_id":"d0e5199d-8907-47b1-905a-07ab8b623a4c","shared_citers":24},{"title":"& Vondrick, C","work_id":"b8a8bb9e-1d31-40e2-9cab-ae21e338dde6","shared_citers":24},{"title":"In: 2023 IEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR)","work_id":"b9701eca-d05e-4d2e-9045-6761df4ba175","shared_citers":17},{"title":"Masked autoencoders are scalable vision learners","work_id":"0a23d1b7-bd56-43cc-8a80-7c43ce994e1e","shared_citers":14},{"title":"Qwen3-VL Technical Report","work_id":"1fe243aa-e3c0-4da6-b391-4cbcfc88d5c0","shared_citers":12},{"title":"IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year =","work_id":"9da51225-b7bd-4032-b7db-ca577971dafe","shared_citers":11},{"title":"In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","work_id":"7083a41e-5666-435b-ab26-c753f6490b9a","shared_citers":11},{"title":"Qwen2.5-VL Technical Report","work_id":"69dffacb-bfe8-442d-be86-48624c60426f","shared_citers":11},{"title":"URL https://doi.org/10.48550/arXiv","work_id":"5c2060c6-427c-4321-be22-49ccae439d80","shared_citers":10},{"title":"Editing conditional radiance fields","work_id":"3820f598-11b0-45c3-8c99-0079181ac0a7","shared_citers":8},{"title":"Qwen3 Technical Report","work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e","shared_citers":8},{"title":"Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities","work_id":"008df105-2fdd-45d8-857a-8e35868aecb6","shared_citers":6},{"title":"GPT-4 Technical Report","work_id":"b928e041-6991-4c08-8c81-0359e4097c7b","shared_citers":6},{"title":"InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency","work_id":"b8f5e260-fff5-444e-bcf5-2c42cfefd83d","shared_citers":6},{"title":"Wan: Open and Advanced Large-Scale Video Generative Models","work_id":"ad3ebc3b-4224-46c9-b61d-bcf135da0a7c","shared_citers":6},{"title":"2024 , url =","work_id":"be79e919-e91f-4ecb-8b06-6b3091bc58b1","shared_citers":5},{"title":"DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models","work_id":"c5006563-f3ec-438a-9e35-b7b484f34828","shared_citers":5},{"title":"Kerbl, G","work_id":"d1e854f7-f01a-46d6-bc88-fe9fe914b4f3","shared_citers":5},{"title":"LLaVA-OneVision: Easy Visual Task Transfer","work_id":"f5f2452b-f2a9-49ac-b38d-c76e18cdfe49","shared_citers":5},{"title":"OpenAI GPT-5 System Card","work_id":"ca87689a-0d29-4476-b504-b65dbbb08af4","shared_citers":5},{"title":"why should I trust you?","work_id":"238df2e4-a3e5-46f3-860e-3ae2b0094b97","shared_citers":5},{"title":"Conditional prompt learning for vision- language models","work_id":"025819dc-724a-4ff8-ba0a-0ba72c046d8c","shared_citers":4},{"title":"DINOv2: Learning Robust Visual Features without Supervision","work_id":"26b304e5-b54a-4f26-be7e-83299eca52e4","shared_citers":4},{"title":"Distilling the Knowledge in a Neural Network","work_id":"d927ab1f-17b8-4002-9d09-c3d55764fbad","shared_citers":4}],"time_series":[{"n":82,"year":2026}],"dependency_candidates":[]},"error":null,"updated_at":"2026-05-14T06:57:32.804973+00:00"},"identity_refresh":{"job_type":"identity_refresh","status":"succeeded","result":{"items":[{"title":"Qwen3 Technical Report","outcome":"unchanged","work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e","resolver":"local_arxiv","confidence":0.98,"old_work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e"}],"counts":{"fixed":0,"merged":0,"unchanged":1,"quarantined":0,"needs_external_resolution":0},"errors":[],"attempted":1},"error":null,"updated_at":"2026-05-14T06:47:40.651118+00:00"},"role_polarity":{"job_type":"role_polarity","status":"succeeded","result":{"title":"Emogen: Emotional image content generation with text-to-image diffusion models","claims":[{"claim_text":"Representative examples of each observation type across domains are summarized in Table 1. Similarly, we group domain-specific action types into: Mouse/touch and keyboard:Low-level screen coordinate-based actions such as moving the cursor, tapping on coordinates, or typing text using the keyboard. These simulate typical human input across platforms [59, 120, 152]. Direct UI access:Actions targeted at specific UI elements using structured identifiers (like HTML tags or accessibility IDs) [10, 48,","claim_type":"background","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"jectory prediction models [153, 154, 155, 156, 157, 158, 159] can improve A V operation in dense, crowded environments. This survey discusses interaction modeling across various ap- proaches, highlighting its benefits in developing socially com- patible trajectory prediction systems. Another line of ap- proaches focuses on intention-aware models. These methods [160, 161, 162, 163, 164, 165, 166] incorporate the vehicle's maneuver intentions to predict its future states. Maneuvers are defined as ","claim_type":"background","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"the model generates plausible forecasts. Gilles et al. [132] intro- duced a heatmap-driven method that encodes spatial reasoning and agent interactions, producing interpretable likelihood maps for socially-aware multi-agent sampling. We elaborate further on learning-based methods with latent spaces in Sections 3.4.3 and 3.4.4. On more recent approaches,diffusion models[136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146] have been actively utilized in trajectory forecasting. These models follo","claim_type":"background","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"tisements (C) [113], Political leaning (C) [114] Persuasion Rallying a Crowd (RAC) (C) [115], QPS (R) [116], ImageArg (C) [117], Persuasive meme (C) [118],Pitts Ads Dataset (C) [119], Persuasive Por- traits of Politicians (Ra) [120] Visual Narrative MPII (Cap) [121], MovieBook (Ret) [122], MovieQA (Q) [123], DramaQA (Q) [124], SF20K (Q) [125], ObyGaze12 (C) [126], MovieNet (C) [127], LVU (C) [66], HLVU (Q) [93], Mementos (Cap) [128] 6 hence the downstream task of predicting the viral potential o","claim_type":"dataset","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"SMPQA Geigle et al. [53] 11 28.9 61.3 76.9=7K=50 synthetic M3Exam Zhang et al. [167] 9 21.4 64.6 65.1≠2.8K≠3.1K general knowledge EXAMS-V Das et al. [33] 11 22.6 61.3 76.4≠5K≠1.2K general knowledge WorldCuisines Winata et al. [150] 24 44.8 65.5* 88.2=1M≠6K culture MLMemes Dimitrov et al. [40] 4 8.2 59.5 43.1≠25.3K≠10.8K persuation techniques xMMMU Yue et al. [160] 7 18.0 62.0 69.2=2.6K=300 semantics MTVQA Tang et al. [136] 10 16.4 62.0 77.9≠28.6K≠8.7K semantics CVQA Romero et al. [119] 31 40.9 6","claim_type":"dataset","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"originally embedded by the legitimate user. Consequently, the watermark verification mechanism will erroneously attribute xadv to the user. This attack succeeds without requiring access to the original generation model or knowledge of the watermark method, highlighting a fundamental vulnerability in latent-based watermark schemes. 3.3 Compression Sensing Compressed sensing (CS) [ 3, 9] is a signal acquisition paradigm that enables direct acquisition of a compressed representation y∈R M of a high","claim_type":"background","confidence":0.9,"evidence_strength":"citation_context"}],"why_cited":"Pith tracks Emogen: Emotional image content generation with text-to-image diffusion models because it crossed a citation-hub threshold. Current citing contexts most often use it as background evidence (83 contexts).","role_counts":[{"n":83,"context_role":"background"},{"n":6,"context_role":"dataset"},{"n":2,"context_role":"baseline"},{"n":2,"context_role":"method"}]},"error":null,"updated_at":"2026-06-05T21:50:19.977813+00:00"},"summary_claims":{"job_type":"summary_claims","status":"succeeded","result":{"title":"Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs","claims":[],"why_cited":"Pith tracks Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs because it crossed a citation-hub threshold.","role_counts":[]},"error":null,"updated_at":"2026-05-14T06:57:20.567086+00:00"}},"summary":{"title":"Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs","claims":[],"why_cited":"Pith tracks Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs because it crossed a citation-hub threshold.","role_counts":[]},"graph":{"co_cited":[{"title":"MambaVision: A hybrid Mamba- Transformer vision backbone","work_id":"d0e5199d-8907-47b1-905a-07ab8b623a4c","shared_citers":24},{"title":"& Vondrick, C","work_id":"b8a8bb9e-1d31-40e2-9cab-ae21e338dde6","shared_citers":24},{"title":"In: 2023 IEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR)","work_id":"b9701eca-d05e-4d2e-9045-6761df4ba175","shared_citers":17},{"title":"Masked autoencoders are scalable vision learners","work_id":"0a23d1b7-bd56-43cc-8a80-7c43ce994e1e","shared_citers":14},{"title":"Qwen3-VL Technical Report","work_id":"1fe243aa-e3c0-4da6-b391-4cbcfc88d5c0","shared_citers":12},{"title":"IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year =","work_id":"9da51225-b7bd-4032-b7db-ca577971dafe","shared_citers":11},{"title":"In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","work_id":"7083a41e-5666-435b-ab26-c753f6490b9a","shared_citers":11},{"title":"Qwen2.5-VL Technical Report","work_id":"69dffacb-bfe8-442d-be86-48624c60426f","shared_citers":11},{"title":"URL https://doi.org/10.48550/arXiv","work_id":"5c2060c6-427c-4321-be22-49ccae439d80","shared_citers":10},{"title":"Editing conditional radiance fields","work_id":"3820f598-11b0-45c3-8c99-0079181ac0a7","shared_citers":8},{"title":"Qwen3 Technical Report","work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e","shared_citers":8},{"title":"Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities","work_id":"008df105-2fdd-45d8-857a-8e35868aecb6","shared_citers":6},{"title":"GPT-4 Technical Report","work_id":"b928e041-6991-4c08-8c81-0359e4097c7b","shared_citers":6},{"title":"InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency","work_id":"b8f5e260-fff5-444e-bcf5-2c42cfefd83d","shared_citers":6},{"title":"Wan: Open and Advanced Large-Scale Video Generative Models","work_id":"ad3ebc3b-4224-46c9-b61d-bcf135da0a7c","shared_citers":6},{"title":"2024 , url =","work_id":"be79e919-e91f-4ecb-8b06-6b3091bc58b1","shared_citers":5},{"title":"DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models","work_id":"c5006563-f3ec-438a-9e35-b7b484f34828","shared_citers":5},{"title":"Kerbl, G","work_id":"d1e854f7-f01a-46d6-bc88-fe9fe914b4f3","shared_citers":5},{"title":"LLaVA-OneVision: Easy Visual Task Transfer","work_id":"f5f2452b-f2a9-49ac-b38d-c76e18cdfe49","shared_citers":5},{"title":"OpenAI GPT-5 System Card","work_id":"ca87689a-0d29-4476-b504-b65dbbb08af4","shared_citers":5},{"title":"why should I trust you?","work_id":"238df2e4-a3e5-46f3-860e-3ae2b0094b97","shared_citers":5},{"title":"Conditional prompt learning for vision- language models","work_id":"025819dc-724a-4ff8-ba0a-0ba72c046d8c","shared_citers":4},{"title":"DINOv2: Learning Robust Visual Features without Supervision","work_id":"26b304e5-b54a-4f26-be7e-83299eca52e4","shared_citers":4},{"title":"Distilling the Knowledge in a Neural Network","work_id":"d927ab1f-17b8-4002-9d09-c3d55764fbad","shared_citers":4}],"time_series":[{"n":82,"year":2026}],"dependency_candidates":[]},"authors":[]}}