hub

Demysti- fying clip data

URLhttps://x · 2023 · arXiv 2309.16671

16 Pith papers cite this work. Polarity classification is still indexing.

16 Pith papers citing it

read on arXiv browse 16 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

representative citing papers

Challenging Vision-Language Models with Physically Deployable Multimodal Semantic Lighting Attacks

cs.CV · 2026-04-14 · unverdicted · novelty 8.0

MSLA is the first physically deployable attack that uses adversarial lighting to break semantic alignment in VLMs such as CLIP, LLaVA, and BLIP, causing classification failures and hallucinations in real scenes.

Rethinking Model Selection in VLM Through the Lens of Gromov-Wasserstein Distance

cs.CV · 2026-05-02 · unverdicted · novelty 7.0

Gromov-Wasserstein distance between modalities provides a stronger, inference-only predictor of final VLM performance than conventional encoder metrics, backed by theory linking it to cross-modal learnability and verified across 60+ training runs.

DouC: Dual-Branch CLIP for Training-Free Open-Vocabulary Segmentation

cs.CV · 2026-04-27 · unverdicted · novelty 7.0

DouC fuses an OG-CLIP branch for patch reliability via inference-time token gating with an FADE-CLIP branch for structural priors via proxy attention, outperforming prior training-free methods on eight benchmarks.

Revealing Physical-World Semantic Vulnerabilities: Universal Adversarial Patches for Infrared Vision-Language Models

cs.CV · 2026-04-03 · unverdicted · novelty 7.0

UCGP is a universal physical adversarial patch that compromises cross-modal semantic alignment in IR-VLMs through curved-grid parameterization and representation-space disruption.

When Surfaces Lie: Exploiting Wrinkle-Induced Attention Shift to Attack Vision-Language Models

cs.CV · 2026-03-29 · unverdicted · novelty 7.0

A wrinkle-field perturbation method creates photorealistic non-rigid image changes that degrade state-of-the-art VLMs on image captioning and VQA more effectively than prior baselines.

20/20 Vision Language Models: A Prescription for Better VLMs through Data Curation Alone

cs.LG · 2026-05-12 · conditional · novelty 6.0 · 2 refs

Data curation alone raises VLM accuracy by more than 11 points on average across many benchmarks while cutting required training compute by up to 87 times.

Birds of a Feather Flock Together: Background-Invariant Representations via Linear Structure in VLMs

cs.CV · 2026-05-11 · unverdicted · novelty 6.0

Exploiting linear structure in VLM embeddings, a synthetic-data pre-training method yields background-invariant representations that exceed 90% worst-group accuracy on Waterbirds even under 100% spurious correlation with no minority examples in training.

LLaVA-UHD v4: What Makes Efficient Visual Encoding in MLLMs?

cs.CV · 2026-05-09 · unverdicted · novelty 6.0

LLaVA-UHD v4 reduces visual-encoding FLOPs by 55.8% for high-resolution images in MLLMs via slice-based encoding plus intra-ViT early compression while matching or exceeding baseline performance on document, OCR, and VQA benchmarks.

Probing CLIP's Comprehension of 360-Degree Textual and Visual Semantics

cs.CV · 2026-04-27 · conditional · novelty 6.0

CLIP models understand 360-degree textual semantics via explicit identifiers but show limited comprehension of visual semantics under horizontal circular shifts, which a LoRA fine-tuning approach improves with a noted trade-off in original task performance.

GLM-4.5V and GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning

cs.CV · 2025-07-01 · unverdicted · novelty 6.0

GLM-4.5V reaches state-of-the-art results on 42 multimodal benchmarks among open-source models of similar size by applying reinforcement learning with curriculum sampling to a strong vision foundation model.

ShellfishNet: A Domain-Specific Benchmark for Visual Recognition of Marine Molluscs

cs.CV · 2026-05-08 · unverdicted · novelty 5.0

ShellfishNet is a new benchmark of 8,691 images across 32 mollusc taxa for evaluating vision models on real-world underwater ecological monitoring tasks including robustness to degradation.

Let ViT Speak: Generative Language-Image Pre-training

cs.CV · 2026-05-01 · unverdicted · novelty 5.0

GenLIP pretrains ViTs to generate language tokens from visual tokens via autoregressive language modeling, matching strong baselines on multimodal tasks with less data.

From Codebooks to VLMs: Evaluating Automated Visual Discourse Analysis for Climate Change on Social Media

cs.CV · 2026-04-23 · unverdicted · novelty 5.0

VLMs recover reliable population-level trends in climate change visual discourse on social media even when per-image accuracy is only moderate.

Concrete Jungle: Towards Concreteness Paved Contrastive Negative Mining for Compositional Understanding

cs.LG · 2026-04-14 · unverdicted · novelty 5.0

Using lexical concreteness to guide contrastive negative mining and a new margin-based Cement loss, the Slipform framework reaches state-of-the-art on compositional benchmarks for vision-language models.

Human-Inspired Context-Selective Multimodal Memory for Social Robots

cs.AI · 2026-04-13 · unverdicted · novelty 5.0

A new memory system for social robots selectively stores multimodal memories by emotional salience and novelty, achieving 0.506 Spearman correlation in selectivity and up to 13% better Recall@1 in multimodal retrieval.

Movie Gen: A Cast of Media Foundation Models

cs.CV · 2024-10-17 · unverdicted · novelty 5.0

A 30B-parameter transformer and related models generate high-quality videos and audio, claiming state-of-the-art results on text-to-video, video editing, personalization, and audio generation tasks.

citing papers explorer

Showing 16 of 16 citing papers.

Challenging Vision-Language Models with Physically Deployable Multimodal Semantic Lighting Attacks cs.CV · 2026-04-14 · unverdicted · none · ref 42
MSLA is the first physically deployable attack that uses adversarial lighting to break semantic alignment in VLMs such as CLIP, LLaVA, and BLIP, causing classification failures and hallucinations in real scenes.
Rethinking Model Selection in VLM Through the Lens of Gromov-Wasserstein Distance cs.CV · 2026-05-02 · unverdicted · none · ref 49
Gromov-Wasserstein distance between modalities provides a stronger, inference-only predictor of final VLM performance than conventional encoder metrics, backed by theory linking it to cross-modal learnability and verified across 60+ training runs.
DouC: Dual-Branch CLIP for Training-Free Open-Vocabulary Segmentation cs.CV · 2026-04-27 · unverdicted · none · ref 4
DouC fuses an OG-CLIP branch for patch reliability via inference-time token gating with an FADE-CLIP branch for structural priors via proxy attention, outperforming prior training-free methods on eight benchmarks.
Revealing Physical-World Semantic Vulnerabilities: Universal Adversarial Patches for Infrared Vision-Language Models cs.CV · 2026-04-03 · unverdicted · none · ref 42
UCGP is a universal physical adversarial patch that compromises cross-modal semantic alignment in IR-VLMs through curved-grid parameterization and representation-space disruption.
When Surfaces Lie: Exploiting Wrinkle-Induced Attention Shift to Attack Vision-Language Models cs.CV · 2026-03-29 · unverdicted · none · ref 35
A wrinkle-field perturbation method creates photorealistic non-rigid image changes that degrade state-of-the-art VLMs on image captioning and VQA more effectively than prior baselines.
20/20 Vision Language Models: A Prescription for Better VLMs through Data Curation Alone cs.LG · 2026-05-12 · conditional · none · ref 37 · 2 links
Data curation alone raises VLM accuracy by more than 11 points on average across many benchmarks while cutting required training compute by up to 87 times.
Birds of a Feather Flock Together: Background-Invariant Representations via Linear Structure in VLMs cs.CV · 2026-05-11 · unverdicted · none · ref 43
Exploiting linear structure in VLM embeddings, a synthetic-data pre-training method yields background-invariant representations that exceed 90% worst-group accuracy on Waterbirds even under 100% spurious correlation with no minority examples in training.
LLaVA-UHD v4: What Makes Efficient Visual Encoding in MLLMs? cs.CV · 2026-05-09 · unverdicted · none · ref 45
LLaVA-UHD v4 reduces visual-encoding FLOPs by 55.8% for high-resolution images in MLLMs via slice-based encoding plus intra-ViT early compression while matching or exceeding baseline performance on document, OCR, and VQA benchmarks.
Probing CLIP's Comprehension of 360-Degree Textual and Visual Semantics cs.CV · 2026-04-27 · conditional · none · ref 18
CLIP models understand 360-degree textual semantics via explicit identifiers but show limited comprehension of visual semantics under horizontal circular shifts, which a LoRA fine-tuning approach improves with a noted trade-off in original task performance.
GLM-4.5V and GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning cs.CV · 2025-07-01 · unverdicted · none · ref 63
GLM-4.5V reaches state-of-the-art results on 42 multimodal benchmarks among open-source models of similar size by applying reinforcement learning with curriculum sampling to a strong vision foundation model.
ShellfishNet: A Domain-Specific Benchmark for Visual Recognition of Marine Molluscs cs.CV · 2026-05-08 · unverdicted · none · ref 96
ShellfishNet is a new benchmark of 8,691 images across 32 mollusc taxa for evaluating vision models on real-world underwater ecological monitoring tasks including robustness to degradation.
Let ViT Speak: Generative Language-Image Pre-training cs.CV · 2026-05-01 · unverdicted · none · ref 74
GenLIP pretrains ViTs to generate language tokens from visual tokens via autoregressive language modeling, matching strong baselines on multimodal tasks with less data.
From Codebooks to VLMs: Evaluating Automated Visual Discourse Analysis for Climate Change on Social Media cs.CV · 2026-04-23 · unverdicted · none · ref 59
VLMs recover reliable population-level trends in climate change visual discourse on social media even when per-image accuracy is only moderate.
Concrete Jungle: Towards Concreteness Paved Contrastive Negative Mining for Compositional Understanding cs.LG · 2026-04-14 · unverdicted · none · ref 47
Using lexical concreteness to guide contrastive negative mining and a new margin-based Cement loss, the Slipform framework reaches state-of-the-art on compositional benchmarks for vision-language models.
Human-Inspired Context-Selective Multimodal Memory for Social Robots cs.AI · 2026-04-13 · unverdicted · none · ref 56
A new memory system for social robots selectively stores multimodal memories by emotional salience and novelty, achieving 0.506 Spearman correlation in selectivity and up to 13% better Recall@1 in multimodal retrieval.
Movie Gen: A Cast of Media Foundation Models cs.CV · 2024-10-17 · unverdicted · none · ref 75
A 30B-parameter transformer and related models generate high-quality videos and audio, claiming state-of-the-art results on text-to-video, video editing, personalization, and audio generation tasks.

Demysti- fying clip data

hub tools

fields

years

verdicts

representative citing papers

citing papers explorer