TTE-Flash trains latent think tokens with CoT generation loss and embedding tokens with contrastive loss to deliver high-performance multimodal representations without generating explicit reasoning at inference time.
International conference on machine learning , pages=
9 Pith papers cite this work. Polarity classification is still indexing.
years
2026 9verdicts
UNVERDICTED 9representative citing papers
GOMA refines frozen multimodal embeddings via modality-aware graph signal smoothing on attributed graphs to improve retrieval while avoiding over-smoothing.
Deep Pre-Alignment uses a small VLM perceiver instead of ViT to pre-align visual features with LLM text space, yielding 1.9-3.0 point gains on multimodal benchmarks and 32.9% less language forgetting.
Text embeddings in MM-DiTs encode a detectable omission signal for missing concepts; amplifying it via OSI reduces concept omission in text-to-image outputs on FLUX.1-Dev and SD3.5-Medium.
SpecPL introduces spectral decomposition via frozen VAE and counterfactual high-frequency permutation to bridge modality asymmetry in VLM prompt learning, reaching 81.51% harmonic-mean accuracy on 11 benchmarks.
Selective replacement of the worst 20-30% of text-only subtitle segments with visual-enhanced outputs raises COMET scores for Indic languages, but full visual grounding is ineffective because of temporal misalignment between subtitles and frames.
ARGUS uses a Prosecutor-Defender-Umpire multi-agent setup plus RAG and chain-of-thought rewards to adapt ad policy enforcement to new regulations using minimal fresh labels.
AFMRL uses MLLM-generated attributes in attribute-guided contrastive learning and retrieval-aware reinforcement to achieve SOTA fine-grained multimodal retrieval on e-commerce datasets.
DiffuSAM fuses diffusion-based localization cues with SAM models to deliver over 14% higher Acc@0.5 in zero-shot object grounding for remote sensing imagery compared to prior methods.
citing papers explorer
-
TTE-Flash: Accelerating Reasoning-based Multimodal Representations via Think-Then-Embed Tokens
TTE-Flash trains latent think tokens with CoT generation loss and embedding tokens with contrastive loss to deliver high-performance multimodal representations without generating explicit reasoning at inference time.
-
GOMA: Toward Structure-Driven Multimodal Alignment from a Graph Signal Smoothing Perspective
GOMA refines frozen multimodal embeddings via modality-aware graph signal smoothing on attributed graphs to improve retrieval while avoiding over-smoothing.
-
Deep Pre-Alignment for VLMs
Deep Pre-Alignment uses a small VLM perceiver instead of ViT to pre-align visual features with LLM text space, yielding 1.9-3.0 point gains on multimodal benchmarks and 32.9% less language forgetting.
-
Diagnosing and Correcting Concept Omission in Multimodal Diffusion Transformers
Text embeddings in MM-DiTs encode a detectable omission signal for missing concepts; amplifying it via OSI reduces concept omission in text-to-image outputs on FLUX.1-Dev and SD3.5-Medium.
-
SpecPL: Disentangling Spectral Granularity for Prompt Learning
SpecPL introduces spectral decomposition via frozen VAE and counterfactual high-frequency permutation to bridge modality asymmetry in VLM prompt learning, reaching 81.51% harmonic-mean accuracy on 11 benchmarks.
-
Towards Visually-Guided Movie Subtitle Translation for Indic Languages
Selective replacement of the worst 20-30% of text-only subtitle segments with visual-enhanced outputs raises COMET scores for Indic languages, but full visual grounding is ineffective because of temporal misalignment between subtitles and frames.
-
ARGUS: Policy-Adaptive Ad Governance via Evolving Reinforcement with Adversarial Umpiring
ARGUS uses a Prosecutor-Defender-Umpire multi-agent setup plus RAG and chain-of-thought rewards to adapt ad policy enforcement to new regulations using minimal fresh labels.
-
AFMRL: Attribute-Enhanced Fine-Grained Multi-Modal Representation Learning in E-commerce
AFMRL uses MLLM-generated attributes in attribute-guided contrastive learning and retrieval-aware reinforcement to achieve SOTA fine-grained multimodal retrieval on e-commerce datasets.
-
DiffuSAM: Diffusion Guided Zero-Shot Object Grounding for Remote Sensing Imagery
DiffuSAM fuses diffusion-based localization cues with SAM models to deliver over 14% higher Acc@0.5 in zero-shot object grounding for remote sensing imagery compared to prior methods.