pith. sign in

hub Canonical reference

ScreenAI: A Vision-Language Model for UI and Infographics Understanding , year =

Canonical reference. 100% of citing Pith papers cite this work as background.

12 Pith papers citing it
Background 100% of classified citations

hub tools

citation-role summary

background 7

citation-polarity summary

roles

background 7

polarities

background 7

representative citing papers

MolmoWeb: Open Visual Web Agent and Open Data for the Open Web

cs.CV · 2026-04-09 · unverdicted · novelty 7.0

Open 4B and 8B visual web agents achieve state-of-the-art results on browser benchmarks by predicting actions from screenshots and instructions, outperforming similar open models and some closed larger-model agents, with full release of data and code planned.

A Pattern Language for Resilient Visual Agents

cs.AI · 2026-04-30 · unverdicted · novelty 4.0

Proposes four architectural patterns—Hybrid Affordance Integration, Adaptive Visual Anchoring, Visual Hierarchy Synthesis, and Semantic Scene Graph—to balance non-determinism and latency of foundation models with enterprise requirements for determinism and real-time performance.

Toward Native Multimodal Modeling: A Roadmap

cs.CV · 2026-05-25 · unverdicted · novelty 3.0

A roadmap that defines architectural nativity for multimodal models and categorizes them into Multi-to-Text, Multi-to-Target, and Multi-to-Multi types while outlining an industrial pipeline toward unified transformer-based native multimodal modeling.

citing papers explorer

Showing 12 of 12 citing papers.