UIPress is the first encoder-side learned optical compression method for UI-to-Code that compresses visual tokens to 256, outperforming the uncompressed baseline by 7.5% CLIP score and the best inference-time baseline by 4.6% while delivering 9.1x TTFT speedup.
Tokenpacker: Efficient visual projector for multimodal llm
5 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
verdicts
UNVERDICTED 5roles
background 1polarities
background 1representative citing papers
MS-Resampler deploys multiple scope-specific resamplers with explicit spatial priors and adaptive fusion to outperform single-scope global cross-attention in MLLMs on ten benchmarks with minimal added cost.
PARCEL is a new visual tokenization architecture combining pool-anchored resampling with conditioned elastic queries to enhance performance-efficiency tradeoffs in LVLMs over prior matryoshka methods.
LLaVA-CoT adds autonomous multistage reasoning to vision-language models, delivering 9.4% gains over its base model and outperforming larger models like Gemini-1.5-pro on reasoning benchmarks via a 100k annotated dataset and SWIRES test-time scaling.
SlotVLA uses slot attention to model object-relation representations for multitask robotic manipulation, reducing visual tokens while achieving competitive generalization on the new LIBERO+ benchmark.
citing papers explorer
-
UIPress: Bringing Optical Token Compression to UI-to-Code Generation
UIPress is the first encoder-side learned optical compression method for UI-to-Code that compresses visual tokens to 256, outperforming the uncompressed baseline by 7.5% CLIP score and the best inference-time baseline by 4.6% while delivering 9.1x TTFT speedup.
-
MS-Resampler: Multi-Scope Visual Resampling for Efficient Multimodal LLMs
MS-Resampler deploys multiple scope-specific resamplers with explicit spatial priors and adaptive fusion to outperform single-scope global cross-attention in MLLMs on ten benchmarks with minimal added cost.
-
PARCEL: Pool-Anchored Resampling with Conditioned Elastic Queries for Efficient Vision-Language Understanding
PARCEL is a new visual tokenization architecture combining pool-anchored resampling with conditioned elastic queries to enhance performance-efficiency tradeoffs in LVLMs over prior matryoshka methods.
-
LLaVA-CoT: Let Vision Language Models Reason Step-by-Step
LLaVA-CoT adds autonomous multistage reasoning to vision-language models, delivering 9.4% gains over its base model and outperforming larger models like Gemini-1.5-pro on reasoning benchmarks via a 100k annotated dataset and SWIRES test-time scaling.
-
SlotVLA: Towards Modeling of Object-Relation Representations in Robotic Manipulation
SlotVLA uses slot attention to model object-relation representations for multitask robotic manipulation, reducing visual tokens while achieving competitive generalization on the new LIBERO+ benchmark.