LAION-5B is an openly released dataset of 5.85 billion CLIP-filtered image-text pairs that enables replication of foundational vision-language models.
Blip: Bootstrapping language- image pre-training for unified vision-language understanding and generation
11 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
roles
background 2polarities
background 2representative citing papers
Flamingo models reach new state-of-the-art few-shot results on image and video tasks by bridging frozen vision and language models with cross-attention layers trained on interleaved web-scale data.
SMART unlocks latent multi-vector capabilities in single-vector embedding models by applying late interaction to frozen hidden states shaped by contrastive training, yielding consistent gains on MMEB-V2 and visual document retrieval.
Embedding Arithmetic performs vector operations in the embedding space of T2I models to mitigate bias at inference time, outperforming baselines on diversity while preserving coherence via a new Concept Coherence Score.
AICA-Bench evaluates 23 VLMs on affective image analysis, identifies weak intensity calibration and shallow descriptions as limitations, and proposes training-free Grounded Affective Tree Prompting to improve performance.
A multi-agent forensic system integrates multiple evidence sources and debate to detect AI-generated images, reporting 97.05% accuracy on a 6,000-image benchmark while outperforming traditional classifiers.
CoCa unifies contrastive and generative pretraining in one image-text model to reach 86.3% zero-shot ImageNet accuracy and new state-of-the-art results on multiple downstream benchmarks.
Authors release the multimodal WJoconde knowledge graph for French cultural heritage and a LLM-VLM pipeline that extracts and validates new triples from unstructured text and images to extend the graph.
OmniCD proposes a multimodal semantic-guided framework for remote sensing change detection supporting binary to zero-shot tasks, plus the RSITCD dataset, with claimed SOTA performance.
Detection-guided prompting raises small VLM hazard F1 from 34.5% to 50.6% and BERTScore from 0.61 to 0.82 on construction images with only 2.5 ms added latency.
citing papers explorer
-
LAION-5B: An open large-scale dataset for training next generation image-text models
LAION-5B is an openly released dataset of 5.85 billion CLIP-filtered image-text pairs that enables replication of foundational vision-language models.
-
Flamingo: a Visual Language Model for Few-Shot Learning
Flamingo models reach new state-of-the-art few-shot results on image and video tasks by bridging frozen vision and language models with cross-attention layers trained on interleaved web-scale data.
-
Your Embedding Model is SMARTer Than You Think
SMART unlocks latent multi-vector capabilities in single-vector embedding models by applying late interaction to frozen hidden states shaped by contrastive training, yielding consistent gains on MMEB-V2 and visual document retrieval.
-
Embedding Arithmetic: A Lightweight, Tuning-Free Framework for Post-hoc Bias Mitigation in Text-to-Image Models
Embedding Arithmetic performs vector operations in the embedding space of T2I models to mitigate bias at inference time, outperforming baselines on diversity while preserving coherence via a new Concept Coherence Score.
-
AICA-Bench: Holistically Examining the Capabilities of VLMs in Affective Image Content Analysis
AICA-Bench evaluates 23 VLMs on affective image analysis, identifies weak intensity calibration and shallow descriptions as limitations, and proposes training-free Grounded Affective Tree Prompting to improve performance.
-
From Evidence to Verdict: An Agent-Based Forensic Framework for AI-Generated Image Detection
A multi-agent forensic system integrates multiple evidence sources and debate to detect AI-generated images, reporting 97.05% accuracy on a 6,000-image benchmark while outperforming traditional classifiers.
-
CoCa: Contrastive Captioners are Image-Text Foundation Models
CoCa unifies contrastive and generative pretraining in one image-text model to reach 86.3% zero-shot ImageNet accuracy and new state-of-the-art results on multiple downstream benchmarks.
-
Multimodal Cultural Heritage Knowledge Graph Extension with Language and Vision Models
Authors release the multimodal WJoconde knowledge graph for French cultural heritage and a LLM-VLM pipeline that extracts and validates new triples from unstructured text and images to extend the graph.
-
OmniCD: A Foundational Framework for Remote Sensing Image Change Detection Guided by Multimodal Semantics
OmniCD proposes a multimodal semantic-guided framework for remote sensing change detection supporting binary to zero-shot tasks, plus the RSITCD dataset, with claimed SOTA performance.
-
Integration of Object Detection and Small VLMs for Construction Safety Hazard Identification
Detection-guided prompting raises small VLM hazard F1 from 34.5% to 50.6% and BERTScore from 0.61 to 0.82 on construction images with only 2.5 ms added latency.
- Multilingual Training and Evaluation Resources for Vision-Language Models