Vl-bert: Pre-training of generic visual-linguistic representations

Vl-bert: Pre-training of generic visual-linguistic representations , author= · 1908 · arXiv 1908.08530

6 Pith papers cite this work. Polarity classification is still indexing.

6 Pith papers citing it

read on arXiv browse 6 citing papers

citation-role summary

background 2

citation-polarity summary

background 1 unclear 1

representative citing papers

MirrorCheck: Efficient Adversarial Defense for Vision-Language Models

cs.CV · 2024-06-13 · unverdicted · novelty 7.0

MirrorCheck detects adversarial attacks on VLMs via T2I regeneration for semantic consistency checks, using stochastic model selection and one-time perturbations for robustness against adaptive attacks.

Flamingo: a Visual Language Model for Few-Shot Learning

cs.CV · 2022-04-29 · unverdicted · novelty 7.0

Flamingo models reach new state-of-the-art few-shot results on image and video tasks by bridging frozen vision and language models with cross-attention layers trained on interleaved web-scale data.

MiMIC: Mitigating Visual Modality Collapse in Universal Multimodal Retrieval While Avoiding Semantic Misalignment

cs.CV · 2026-04-23 · unverdicted · novelty 6.0

MiMIC mitigates visual modality collapse and semantic misalignment in universal multimodal retrieval via fusion-in-decoder architecture and robust single-modality training.

NaVid: Video-based VLM Plans the Next Step for Vision-and-Language Navigation

cs.CV · 2024-02-24 · unverdicted · novelty 6.0

NaVid, a video-based VLM trained on 510k navigation and 763k web samples, achieves SOTA VLN performance using only monocular RGB video for next-step action planning in sim and real environments.

InterCMDM: Block-Causal Diffusion for Autoregressive Human Interaction Generation

cs.CV · 2026-07-02 · unverdicted · novelty 5.0

InterCMDM proposes a block-causal latent diffusion framework with dual-stream causal transformers and multi-task attention masks for autoregressive text-conditioned two-person interaction generation and reports SOTA results on InterHuman and Inter-X.

Task-Aware Structured Memory for Dynamic Multi-modal In-Context Learning

cs.CV · 2026-06-10 · unverdicted · novelty 5.0

TASM proposes a task-aware structured memory framework using task-vector compression, bipartite token merging, and a Core Memory plus Latent Bank hierarchy to enable efficient dynamic multi-modal in-context learning.

citing papers explorer

Showing 6 of 6 citing papers after filters.

MirrorCheck: Efficient Adversarial Defense for Vision-Language Models cs.CV · 2024-06-13 · unverdicted · none · ref 85
MirrorCheck detects adversarial attacks on VLMs via T2I regeneration for semantic consistency checks, using stochastic model selection and one-time perturbations for robustness against adaptive attacks.
Flamingo: a Visual Language Model for Few-Shot Learning cs.CV · 2022-04-29 · unverdicted · none · ref 107
Flamingo models reach new state-of-the-art few-shot results on image and video tasks by bridging frozen vision and language models with cross-attention layers trained on interleaved web-scale data.
MiMIC: Mitigating Visual Modality Collapse in Universal Multimodal Retrieval While Avoiding Semantic Misalignment cs.CV · 2026-04-23 · unverdicted · none · ref 52
MiMIC mitigates visual modality collapse and semantic misalignment in universal multimodal retrieval via fusion-in-decoder architecture and robust single-modality training.
NaVid: Video-based VLM Plans the Next Step for Vision-and-Language Navigation cs.CV · 2024-02-24 · unverdicted · none · ref 91
NaVid, a video-based VLM trained on 510k navigation and 763k web samples, achieves SOTA VLN performance using only monocular RGB video for next-step action planning in sim and real environments.
InterCMDM: Block-Causal Diffusion for Autoregressive Human Interaction Generation cs.CV · 2026-07-02 · unverdicted · none · ref 34
InterCMDM proposes a block-causal latent diffusion framework with dual-stream causal transformers and multi-task attention masks for autoregressive text-conditioned two-person interaction generation and reports SOTA results on InterHuman and Inter-X.
Task-Aware Structured Memory for Dynamic Multi-modal In-Context Learning cs.CV · 2026-06-10 · unverdicted · none · ref 86
TASM proposes a task-aware structured memory framework using task-vector compression, bipartite token merging, and a Core Memory plus Latent Bank hierarchy to enable efficient dynamic multi-modal in-context learning.

Vl-bert: Pre-training of generic visual-linguistic representations

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer