hub Baseline reference

LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs

Christoph Schuhmann, Richard Vencu, Romain Beaumont, Robert Kaczmarczyk, Clayton Mullis, Aarush Katta · 2021 · cs.CV · arXiv 2111.02114

Baseline reference. 64% of citing Pith papers use this work as a benchmark or comparison.

85 Pith papers citing it

Baseline 64% of classified citations

open full Pith review browse 85 citing papers arXiv PDF

abstract

Multi-modal language-vision models trained on hundreds of millions of image-text pairs (e.g. CLIP, DALL-E) gained a recent surge, showing remarkable capability to perform zero- or few-shot learning and transfer even in absence of per-sample labels on target image data. Despite this trend, to date there has been no publicly available datasets of sufficient scale for training such models from scratch. To address this issue, in a community effort we build and release for public LAION-400M, a dataset with CLIP-filtered 400 million image-text pairs, their CLIP embeddings and kNN indices that allow efficient similarity search.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

dataset 17 background 6 method 1 other 1

citation-polarity summary

use dataset 16 background 6 unclear 2 use method 1

claims ledger

abstract Multi-modal language-vision models trained on hundreds of millions of image-text pairs (e.g. CLIP, DALL-E) gained a recent surge, showing remarkable capability to perform zero- or few-shot learning and transfer even in absence of per-sample labels on target image data. Despite this trend, to date there has been no publicly available datasets of sufficient scale for training such models from scratch. To address this issue, in a community effort we build and release for public LAION-400M, a dataset with CLIP-filtered 400 million image-text pairs, their CLIP embeddings and kNN indices that allow

co-cited works

representative citing papers

CiteVQA: Benchmarking Evidence Attribution for Trustworthy Document Intelligence

cs.CL · 2026-05-13 · accept · novelty 8.0

CiteVQA requires models to cite specific document regions with bounding boxes alongside answers and finds that even the strongest MLLMs frequently cite the wrong region, with top SAA scores of only 76.0 for closed models and 22.5 for open-source ones.

An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion

cs.CV · 2022-08-02 · unverdicted · novelty 8.0

Textual Inversion learns a single embedding vector from a few images to represent personal concepts inside the text embedding space of a frozen text-to-image model, enabling their composition in natural language prompts.

PluRule: A Benchmark for Moderating Pluralistic Communities on Social Media

cs.CL · 2026-05-16 · unverdicted · novelty 7.0

PluRule is a new multimodal multilingual benchmark showing that state-of-the-art vision-language models perform only marginally better than a trivial baseline at detecting specific rule violations in pluralistic online communities.

Watch Your Step: Information Injection in Diffusion Models via Shadow Timestep Embedding

cs.LG · 2026-05-01 · unverdicted · novelty 7.0

Timestep embeddings in diffusion models function as a separable side channel that can carry dedicated information for adversarial injection or detection.

VeraRetouch: A Lightweight Fully Differentiable Framework for Multi-Task Reasoning Photo Retouching

cs.CV · 2026-04-30 · unverdicted · novelty 7.0 · 2 refs

VeraRetouch is a 0.5B VLM-based framework with a differentiable Retouch Renderer and a new million-scale AetherRetouch-1M+ dataset that claims state-of-the-art results in reasoning photo retouching while enabling mobile deployment.

EmbodiedMidtrain: Bridging the Gap between Vision-Language Models and Vision-Language-Action Models via Mid-training

cs.CV · 2026-04-21 · unverdicted · novelty 7.0

EmbodiedMidtrain mid-trains VLMs on curated VLA-aligned data subsets to improve downstream performance on robot manipulation benchmarks.

DifFoundMAD: Foundation Models meet Differential Morphing Attack Detection

cs.CV · 2026-04-20 · unverdicted · novelty 7.0

DifFoundMAD improves differential morphing attack detection by replacing traditional embeddings with those from vision foundation models and applying class-balanced lightweight fine-tuning, cutting high-security error rates from 6.16% to 2.17%.

InstAP: Instance-Aware Vision-Language Pre-Train for Spatial-Temporal Understanding

cs.CV · 2026-04-09 · unverdicted · novelty 7.0

InstAP introduces instance-aware pre-training with a new dual-granularity dataset InstVL that improves both fine-grained instance retrieval and global video understanding over standard VLP baselines.

PromptEvolver: Prompt Inversion through Evolutionary Optimization in Natural-Language Space

cs.LG · 2026-04-03 · unverdicted · novelty 7.0

PromptEvolver recovers high-fidelity natural language prompts for given images by evolving them via genetic algorithm guided by a vision-language model, outperforming prior methods on benchmarks.

Distance Comparison Operations Are Not Silver Bullets in Vector Similarity Search: A Benchmark Study on Their Merits and Limits

cs.DB · 2026-04-03 · accept · novelty 7.0

Benchmark study shows DCO methods for vector similarity search are not reliable silver bullets due to high sensitivity to data properties and hardware, making them unsuitable for production deployment.

Mind the Generative Details: Direct Localized Detail Preference Optimization for Video Diffusion Models

cs.CV · 2026-01-07 · unverdicted · novelty 7.0 · 2 refs

LocalDPO aligns text-to-video diffusion models with human preferences at the spatio-temporal region level by automatically generating localized preference pairs from corrupted real videos and applying a region-aware DPO loss.

Towards Multimodal Active Learning: Efficient Learning with Limited Paired Data

cs.LG · 2025-09-25 · unverdicted · novelty 7.0

Introduces the first active learning framework for unaligned multimodal data that selects alignments using uncertainty and diversity to cut annotation costs by up to 40% on benchmarks while preserving accuracy.

LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models

cs.CV · 2024-07-10 · unverdicted · novelty 7.0

LLaVA-NeXT-Interleave unifies multi-image, video, and 3D capabilities in large multimodal models via a new 1.18M-sample interleaved dataset and benchmark, achieving leading results across those tasks while preserving single-image performance.

ELLA: Equip Diffusion Models with LLM for Enhanced Semantic Alignment

cs.CV · 2024-03-08 · unverdicted · novelty 7.0

ELLA introduces a timestep-aware semantic connector to link LLMs with diffusion models for improved dense prompt following, validated on a new 1K-prompt benchmark.

Learning Interactive Real-World Simulators

cs.AI · 2023-10-09 · conditional · novelty 7.0

UniSim learns a universal real-world simulator from orchestrated diverse datasets, enabling zero-shot deployment of policies trained purely in simulation.

LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention

cs.CV · 2023-03-28 · conditional · novelty 7.0

LLaMA-Adapter turns frozen LLaMA 7B into a capable instruction follower using only 1.2M new parameters and zero-init attention, matching Alpaca while extending to image-conditioned reasoning on ScienceQA and COCO.

Language Is Not All You Need: Aligning Perception with Language Models

cs.CL · 2023-02-27 · conditional · novelty 7.0

Kosmos-1 shows strong zero-shot and few-shot results on language tasks, image captioning, visual QA, OCR-free document understanding, and image recognition guided by text instructions.

BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

cs.CV · 2023-01-30 · unverdicted · novelty 7.0

BLIP-2 bootstraps vision-language pre-training from frozen image encoders and LLMs via a lightweight two-stage Querying Transformer, delivering SOTA results with 54x fewer trainable parameters than Flamingo80B on zero-shot VQAv2.

LAION-5B: An open large-scale dataset for training next generation image-text models

cs.CV · 2022-10-16 · accept · novelty 7.0

LAION-5B is an openly released dataset of 5.85 billion CLIP-filtered image-text pairs that enables replication of foundational vision-language models.

Phenaki: Variable Length Video Generation From Open Domain Textual Description

cs.CV · 2022-10-05 · unverdicted · novelty 7.0

Phenaki generates arbitrary-length videos from sequences of text prompts by tokenizing videos with causal temporal attention and generating tokens with a text-conditioned masked transformer, trained jointly on images and videos.

Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding

cs.CV · 2022-05-23 · accept · novelty 7.0

Imagen achieves state-of-the-art photorealistic text-to-image generation by scaling a text-only pretrained T5 language model within a diffusion framework, reaching FID 7.27 on COCO without training on it.

Flamingo: a Visual Language Model for Few-Shot Learning

cs.CV · 2022-04-29 · unverdicted · novelty 7.0

Flamingo models reach new state-of-the-art few-shot results on image and video tasks by bridging frozen vision and language models with cross-attention layers trained on interleaved web-scale data.

When More Cores Hurts: The Vector Database Scaling Paradox in HPC

cs.DC · 2026-06-08 · unverdicted · novelty 6.0

Large-scale HPC evaluation of Qdrant, Milvus, and Weaviate reveals that workload patterns limit scaling and extra cores can reduce throughput, exposing a cloud-to-HPC design mismatch.

Chroma Clues: Leveraging Color Statistics to Detect Synthetic Images

cs.CV · 2026-06-01 · unverdicted · novelty 6.0

Color transformations expose statistical discrepancies in synthetic images, supporting a classifier with 93.27% average accuracy and robustness to post-processing.

citing papers explorer

Showing 35 of 85 citing papers.

FineGen: A VLM-based Multi-Agent Framework for Fine-Grained Image-Text Dataset Construction cs.CV · 2026-06-02 · unverdicted · none · ref 20 · internal anchor
FineGen uses a VLM multi-agent pipeline to build FineGen-100K, a 147k-sample hierarchical dataset of attribute-specific hard negatives, reporting 96.7% validity and +14.4% downstream accuracy gain on hard samples in FG-OVD.
Towards the Connection between Activation Sparsity and Flat Minima cs.LG · 2026-05-25 · unverdicted · none · ref 41 · internal anchor
MLP activation sparsity equals augmented flatness divided by input norm times gradient; the ratio falls during training and can be reduced further by three plug-and-play changes, yielding higher sparsity on ImageNet and C4.
MONET: A Massive, Open, Non-redundant and Enriched Text-to-image dataset cs.CV · 2026-05-20 · unverdicted · none · ref 82 · internal anchor
MONET is an open 104.9M image-text pair dataset created via safety filtering, deduplication, and multi-VLM recaptioning from 2.9B raw pairs, validated by training a competitive 4B-parameter latent diffusion model.
PixVerve: Advancing Native UHR Image Generation to 100MP with a Large-Scale High-Quality Dataset cs.CV · 2026-05-19 · unverdicted · none · ref 41 · internal anchor
PixVerve introduces a 95K ultra-high-resolution image-text dataset and training strategies that enable native 100-megapixel text-to-image generation together with a new evaluation benchmark.
DiffMagicFace: Identity Consistent Facial Editing of Real Videos cs.CV · 2026-04-15 · unverdicted · none · ref 43 · internal anchor
DiffMagicFace uses concurrent fine-tuned text and image diffusion models plus a rendered multi-view dataset to achieve identity-consistent text-conditioned editing of real facial videos.
Dynamic Eraser for Guided Concept Erasure in Diffusion Models cs.CV · 2026-04-13 · unverdicted · none · ref 27 · internal anchor
DSS is a lightweight inference-time framework that erases concepts in diffusion models at 91% average rate while preserving image fidelity, outperforming prior methods.
Measurement of Generative AI Workload Power Profiles for Whole-Facility Data Center Infrastructure Planning eess.SY · 2026-04-08 · unverdicted · none · ref 39 · internal anchor
High-resolution power profiles for AI workloads on H100 GPUs are measured and scaled to whole-facility energy demand using a bottom-up model, with the dataset made public.
Long Story Short: Disentangling Compositionality and Long-Caption Understanding in Contrastive VLMs cs.CV · 2025-09-23 · unverdicted · none · ref 37 · internal anchor
Empirical study shows bidirectional but sensitive relationship between compositionality and long-caption understanding in VLMs, promoted by high-quality grounded data and affected by architectural choices like frozen positional embeddings.
A Careful Examination of Large Behavior Models for Multitask Dexterous Manipulation cs.RO · 2025-07-07 · accept · none · ref 37 · internal anchor
Multi-task pretraining of diffusion policies on diverse robot data produces more successful, robust, and data-efficient policies for dexterous manipulation than single-task baselines, with performance scaling with pretraining size and diversity.
Preserve and Personalize: Personalized Text-to-Image Diffusion Models without Distributional Drift cs.CV · 2025-05-26 · unverdicted · none · ref 49 · internal anchor
Proposes Lipschitz regularization during fine-tuning to prevent distributional drift in personalized diffusion models, improving subject fidelity and prompt adherence.
Not All Rollouts are Useful: Down-Sampling Rollouts in LLM Reinforcement Learning cs.LG · 2025-04-18 · unverdicted · none · ref 19 · internal anchor
PODS applies max-variance down-sampling to GRPO rollouts in LLM RLVR, delivering at least 1.7x faster training to peak test accuracy on reasoning benchmarks.
Wan: Open and Advanced Large-Scale Video Generative Models cs.CV · 2025-03-26 · unverdicted · none · ref 41 · internal anchor
Wan releases open 1.3B and 14B video diffusion models claiming superior performance over open-source and commercial baselines across multiple tasks with consumer-grade efficiency.
CogVLM2: Visual Language Models for Image and Video Understanding cs.CV · 2024-08-29 · conditional · none · ref 69 · internal anchor
CogVLM2 family achieves state-of-the-art results on image and video understanding benchmarks through improved visual expert architecture, higher resolution inputs, and automated temporal grounding for videos.
InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output cs.CV · 2024-07-03 · conditional · none · ref 126 · internal anchor
InternLM-XComposer-2.5 is a 7B vision-language model supporting up to 96K context that reaches GPT-4V-level performance on image, video, and multi-turn tasks and adds LoRA-driven text-image composition capabilities.
InternLM-XComposer2: Mastering Free-form Text-Image Composition and Comprehension in Vision-Language Large Model cs.CV · 2024-01-29 · unverdicted · none · ref 70 · internal anchor
InternLM-XComposer2 introduces Partial LoRA on InternLM2-7B to enable high-quality free-form text-image composition while matching or exceeding GPT-4V on select vision-language benchmarks.
MobileVLM : A Fast, Strong and Open Vision Language Assistant for Mobile Devices cs.CV · 2023-12-28 · unverdicted · none · ref 103 · internal anchor
MobileVLM achieves on-par performance with much larger vision-language models on standard benchmarks while delivering state-of-the-art inference speeds of 21.5 tokens per second on Snapdragon 888 CPU and 65.3 on Jetson Orin GPU.
SPHINX: The Joint Mixing of Weights, Tasks, and Visual Embeddings for Multi-modal Large Language Models cs.CV · 2023-11-13 · unverdicted · none · ref 27 · internal anchor
SPHINX improves multi-modal LLMs through joint mixing of weights, tasks, and visual embeddings from varied sources to achieve stronger alignment and multi-purpose capabilities.
I2VGen-XL: High-Quality Image-to-Video Synthesis via Cascaded Diffusion Models cs.CV · 2023-11-07 · unverdicted · none · ref 38 · internal anchor
I2VGen-XL applies cascaded diffusion models with a base stage for semantic preservation via hierarchical encoders and a refinement stage for detail and resolution, trained on 35 million text-video and 6 billion text-image pairs.
MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning cs.CV · 2023-10-14 · unverdicted · none · ref 42 · internal anchor
MiniGPT-v2 adds unique task identifiers to a large language model so one system can perform image description, visual question answering, and visual grounding after three-stage training.
LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model cs.CV · 2023-04-28 · conditional · none · ref 56 · internal anchor
LLaMA-Adapter V2 achieves open-ended visual instruction following in LLMs by unlocking more parameters, early fusion of visual tokens, and joint training on disjoint parameter groups with only 14M added parameters.
InternVideo: General Video Foundation Models via Generative and Discriminative Learning cs.CV · 2022-12-06 · unverdicted · none · ref 71 · internal anchor
InternVideo combines masked video modeling and video-language contrastive learning into a single foundation model that reaches state-of-the-art results on 39 video datasets including 91.1% top-1 on Kinetics-400.
Multimodal Group Emotion Recognition In-the-Wild Towards a Privacy-Safe Non-Individual Approach cs.CV · 2026-05-27 · unverdicted · none · ref 203 · internal anchor
Proposes cross-attention audio-video fusion and VE-MD latent-space models for group emotion recognition that avoid individual cues and report competitive performance via ablation studies on synthetic and real data.
From Cradle to Cloud: A Life Cycle Review of AI's Environmental Footprint cs.CY · 2026-05-06 · unverdicted · none · ref 87 · internal anchor
A review of AI sustainability studies finds inconsistent life cycle definitions and predominant reliance on coarse CO2e proxies, with limited coverage of water, materials, and multi-impact assessments.
On The Application of Linear Attention in Multimodal Transformers cs.CV · 2026-04-11 · unverdicted · none · ref 29 · internal anchor
Linear attention delivers significant computational savings in multimodal transformers and follows the same scaling laws as softmax attention on ViT models trained on LAION-400M with ImageNet-21K zero-shot validation.
Mapping the Stochastic Penal Colony cs.CY · 2026-01-18 · unverdicted · none · ref 91 · internal anchor
Content moderation operates as a stochastic penal colony that banishes users through the constant threat of account suspension, shown via auto-ethnographic case studies of Twitter, OpenAI DALL-E 2, and Pinterest.
Deepfakes: we need to re-think the concept of "real" images cs.CV · 2025-09-26 · unverdicted · none · ref 39 · internal anchor
This position paper contends that the concept of 'real' images must be rethought because most modern photographs are computationally generated, undermining current deepfake detection methods.
Yi: Open Foundation Models by 01.AI cs.CL · 2024-03-07 · unverdicted · none · ref 68 · internal anchor
Yi models are 6B and 34B open foundation models pretrained on 3.1T curated tokens that achieve strong benchmark results through data quality and targeted extensions like long context and vision alignment.
InternLM-XComposer: A Vision-Language Large Model for Advanced Text-image Comprehension and Composition cs.CV · 2023-09-26 · conditional · none · ref 75 · internal anchor
InternLM-XComposer generates articles with seamlessly integrated images and achieves state-of-the-art results on vision-language benchmarks including MME, MMBench, and Seed-Bench.
ModelScope Text-to-Video Technical Report cs.CV · 2023-08-12 · unverdicted · none · ref 50 · internal anchor
ModelScopeT2V is a 1.7-billion-parameter text-to-video model built on Stable Diffusion that adds temporal modeling and outperforms prior methods on three evaluation metrics.
Image-to-Video Diffusion: From Foundations to Open Frontiers cs.CV · 2026-05-17 · unverdicted · none · ref 101 · internal anchor
A survey that organizes diffusion image-to-video methods into a taxonomy, distills core designs in condition encoding, temporal modeling, noise prior, and upsampling, and discusses applications plus challenges.
Neural Network Optimization Reimagined: Decoupled Techniques for Scratch and Fine-Tuning cs.CV · 2026-04-21 · unverdicted · none · ref 5 · internal anchor
DualOpt decouples optimization by using real-time layer-wise weight decay for scratch training and weight rollback for fine-tuning to improve convergence, generalization, and reduce knowledge forgetting.
LMMs Meet Object-Centric Vision: Understanding, Segmentation, Editing and Generation cs.CV · 2026-04-13 · unverdicted · none · ref 142 · internal anchor
This review organizes literature on large multimodal models and object-centric vision into four themes—understanding, referring segmentation, editing, and generation—while summarizing paradigms, strategies, and challenges like instance permanence and consistent interaction.
Looking Beyond the Obvious: A Survey on Abstract Concept Recognition for Video Understanding cs.CV · 2025-08-28 · unverdicted · none · ref 71 · internal anchor
A literature survey on abstract concept recognition in videos that catalogs prior tasks and datasets while advocating for foundation models and reuse of decades of community experience.
Back into Plato's Cave: Examining Cross-modal Representational Convergence at Scale cs.CV · 2026-04-20 · unreviewed · ref 79 · internal anchor
CPAM: Context-Preserving Adaptive Manipulation for Zero-Shot Real Image Editing cs.CV · 2025-06-23 · unreviewed · ref 22 · internal anchor

LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs

hub tools

citation-role summary

citation-polarity summary

claims ledger

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer