International conference on machine learning , pages=

Blip: Bootstrapping language-image pre-training for unified vision-language understanding, generation , author= · 2022

9 Pith papers cite this work. Polarity classification is still indexing.

9 Pith papers citing it

browse 9 citing papers

representative citing papers

TTE-Flash: Accelerating Reasoning-based Multimodal Representations via Think-Then-Embed Tokens

cs.AI · 2026-05-15 · unverdicted · novelty 6.0

TTE-Flash trains latent think tokens with CoT generation loss and embedding tokens with contrastive loss to deliver high-performance multimodal representations without generating explicit reasoning at inference time.

GOMA: Toward Structure-Driven Multimodal Alignment from a Graph Signal Smoothing Perspective

cs.LG · 2026-05-15 · unverdicted · novelty 6.0

GOMA refines frozen multimodal embeddings via modality-aware graph signal smoothing on attributed graphs to improve retrieval while avoiding over-smoothing.

Deep Pre-Alignment for VLMs

cs.CV · 2026-05-14 · unverdicted · novelty 6.0

Deep Pre-Alignment uses a small VLM perceiver instead of ViT to pre-align visual features with LLM text space, yielding 1.9-3.0 point gains on multimodal benchmarks and 32.9% less language forgetting.

Diagnosing and Correcting Concept Omission in Multimodal Diffusion Transformers

cs.CV · 2026-05-14 · unverdicted · novelty 6.0

Text embeddings in MM-DiTs encode a detectable omission signal for missing concepts; amplifying it via OSI reduces concept omission in text-to-image outputs on FLUX.1-Dev and SD3.5-Medium.

SpecPL: Disentangling Spectral Granularity for Prompt Learning

cs.CV · 2026-05-06 · unverdicted · novelty 6.0

SpecPL introduces spectral decomposition via frozen VAE and counterfactual high-frequency permutation to bridge modality asymmetry in VLM prompt learning, reaching 81.51% harmonic-mean accuracy on 11 benchmarks.

Towards Visually-Guided Movie Subtitle Translation for Indic Languages

cs.CL · 2026-05-12 · unverdicted · novelty 5.0

Selective replacement of the worst 20-30% of text-only subtitle segments with visual-enhanced outputs raises COMET scores for Indic languages, but full visual grounding is ineffective because of temporal misalignment between subtitles and frames.

ARGUS: Policy-Adaptive Ad Governance via Evolving Reinforcement with Adversarial Umpiring

cs.CL · 2026-05-04 · unverdicted · novelty 5.0

ARGUS uses a Prosecutor-Defender-Umpire multi-agent setup plus RAG and chain-of-thought rewards to adapt ad policy enforcement to new regulations using minimal fresh labels.

AFMRL: Attribute-Enhanced Fine-Grained Multi-Modal Representation Learning in E-commerce

cs.CL · 2026-04-22 · unverdicted · novelty 5.0

AFMRL uses MLLM-generated attributes in attribute-guided contrastive learning and retrieval-aware reinforcement to achieve SOTA fine-grained multimodal retrieval on e-commerce datasets.

DiffuSAM: Diffusion Guided Zero-Shot Object Grounding for Remote Sensing Imagery

cs.CV · 2026-04-20 · unverdicted · novelty 5.0

DiffuSAM fuses diffusion-based localization cues with SAM models to deliver over 14% higher Acc@0.5 in zero-shot object grounding for remote sensing imagery compared to prior methods.

citing papers explorer

Showing 9 of 9 citing papers.

TTE-Flash: Accelerating Reasoning-based Multimodal Representations via Think-Then-Embed Tokens cs.AI · 2026-05-15 · unverdicted · none · ref 30
TTE-Flash trains latent think tokens with CoT generation loss and embedding tokens with contrastive loss to deliver high-performance multimodal representations without generating explicit reasoning at inference time.
GOMA: Toward Structure-Driven Multimodal Alignment from a Graph Signal Smoothing Perspective cs.LG · 2026-05-15 · unverdicted · none · ref 52
GOMA refines frozen multimodal embeddings via modality-aware graph signal smoothing on attributed graphs to improve retrieval while avoiding over-smoothing.
Deep Pre-Alignment for VLMs cs.CV · 2026-05-14 · unverdicted · none · ref 158
Deep Pre-Alignment uses a small VLM perceiver instead of ViT to pre-align visual features with LLM text space, yielding 1.9-3.0 point gains on multimodal benchmarks and 32.9% less language forgetting.
Diagnosing and Correcting Concept Omission in Multimodal Diffusion Transformers cs.CV · 2026-05-14 · unverdicted · none · ref 19
Text embeddings in MM-DiTs encode a detectable omission signal for missing concepts; amplifying it via OSI reduces concept omission in text-to-image outputs on FLUX.1-Dev and SD3.5-Medium.
SpecPL: Disentangling Spectral Granularity for Prompt Learning cs.CV · 2026-05-06 · unverdicted · none · ref 61
SpecPL introduces spectral decomposition via frozen VAE and counterfactual high-frequency permutation to bridge modality asymmetry in VLM prompt learning, reaching 81.51% harmonic-mean accuracy on 11 benchmarks.
Towards Visually-Guided Movie Subtitle Translation for Indic Languages cs.CL · 2026-05-12 · unverdicted · none · ref 9
Selective replacement of the worst 20-30% of text-only subtitle segments with visual-enhanced outputs raises COMET scores for Indic languages, but full visual grounding is ineffective because of temporal misalignment between subtitles and frames.
ARGUS: Policy-Adaptive Ad Governance via Evolving Reinforcement with Adversarial Umpiring cs.CL · 2026-05-04 · unverdicted · none · ref 15
ARGUS uses a Prosecutor-Defender-Umpire multi-agent setup plus RAG and chain-of-thought rewards to adapt ad policy enforcement to new regulations using minimal fresh labels.
AFMRL: Attribute-Enhanced Fine-Grained Multi-Modal Representation Learning in E-commerce cs.CL · 2026-04-22 · unverdicted · none · ref 39
AFMRL uses MLLM-generated attributes in attribute-guided contrastive learning and retrieval-aware reinforcement to achieve SOTA fine-grained multimodal retrieval on e-commerce datasets.
DiffuSAM: Diffusion Guided Zero-Shot Object Grounding for Remote Sensing Imagery cs.CV · 2026-04-20 · unverdicted · none · ref 21
DiffuSAM fuses diffusion-based localization cues with SAM models to deliver over 14% higher Acc@0.5 in zero-shot object grounding for remote sensing imagery compared to prior methods.

International conference on machine learning , pages=

fields

years

verdicts

representative citing papers

citing papers explorer