Introduces the first dedicated benchmark for live multi-modal LLM task guidance with mistake detection and a streaming baseline model.
Align before fuse: Vision and language representation learning with momentum distillation.Advances in neural information processing systems, 34:9694–9705
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
fields
cs.CV 2verdicts
UNVERDICTED 2representative citing papers
GenLIP pretrains ViTs to generate language tokens from visual tokens via autoregressive language modeling, matching strong baselines on multimodal tasks with less data.
citing papers explorer
-
Can Multi-Modal LLMs Provide Live Step-by-Step Task Guidance?
Introduces the first dedicated benchmark for live multi-modal LLM task guidance with mistake detection and a streaming baseline model.
-
Let ViT Speak: Generative Language-Image Pre-training
GenLIP pretrains ViTs to generate language tokens from visual tokens via autoregressive language modeling, matching strong baselines on multimodal tasks with less data.