hub Mixed citations

Dream- sim: Learning new dimensions of human visual similar- ity using synthetic data.arXiv preprint arXiv:2306.09344

Stephanie Fu, Netanel Tamir, Shobhita Sundaram, Lucy Chai, Richard Zhang, Tali Dekel, Phillip Isola · 2023 · arXiv 2306.09344

Mixed citation behavior. Most common role is background (50%).

20 Pith papers citing it

Background 50% of classified citations

read on arXiv browse 20 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 2 dataset 2 method 2

citation-polarity summary

background 3 use method 2 use dataset 1

representative citing papers

Capability $\neq$ Interpretability: Human Interpretability of Vision Foundation Models

cs.CV · 2026-05-19 · conditional · novelty 7.0

Foundation models yield less human-interpretable features than supervised vision transformers, with interpretability tied to activation locality and coarse semantic alignment rather than task performance.

Evaluating Remote Sensing Image Captions Beyond Metric Biases

cs.CV · 2026-04-22 · unverdicted · novelty 7.0

Unfine-tuned MLLMs outperform fine-tuned models on remote sensing image captioning when captions are scored by their ability to reconstruct the source image, and a training-free self-correction method achieves SOTA performance.

Novel View Synthesis as Video Completion

cs.CV · 2026-04-09 · unverdicted · novelty 7.0

Video diffusion models can be adapted into permutation-invariant generators for sparse novel view synthesis by treating the problem as video completion and removing temporal order cues.

PromptEvolver: Prompt Inversion through Evolutionary Optimization in Natural-Language Space

cs.LG · 2026-04-03 · unverdicted · novelty 7.0

PromptEvolver recovers high-fidelity natural language prompts for given images by evolving them via genetic algorithm guided by a vision-language model, outperforming prior methods on benchmarks.

ProDiG: Progressive Diffusion-Guided Gaussian Splatting for Aerial to Ground Reconstruction

cs.CV · 2026-04-02 · unverdicted · novelty 7.0

ProDiG progressively transforms aerial Gaussian splats into coherent ground-level 3D reconstructions via diffusion guidance and specialized attention modules.

It's Never Too Late: Noise Optimization for Collapse Recovery in Trained Diffusion Models

cs.CV · 2025-12-31 · unverdicted · novelty 7.0

Noise optimization during sampling recovers diversity in mode-collapsed diffusion models while preserving output fidelity.

Generative Giants, Retrieval Weaklings: Why do Multimodal Large Language Models Fail at Multimodal Retrieval?

cs.CV · 2025-12-22 · conditional · novelty 7.0

MLLM representation spaces are dominated by textual semantics that reduce discriminative power for multimodal retrieval; a whitening transformation called ReAlign corrects the geometry and boosts zero-shot performance.

Setting the Stage: Text-Driven Scene-Consistent Image Generation

cs.CV · 2025-12-14 · conditional · novelty 7.0

A new data pipeline using real photos, entity removal, and image-to-video models plus a cross-view attention loss enables text-driven generation of actors in reference scenes with improved alignment.

VLM2Vec: Training Vision-Language Models for Massive Multimodal Embedding Tasks

cs.CV · 2024-10-07 · conditional · novelty 7.0

VLM2Vec converts state-of-the-art vision-language models into universal multimodal embedders via contrastive training on the new MMEB benchmark, delivering 10-20% absolute gains over prior models on both in-distribution and out-of-distribution tasks.

AttriStory: Fine-grained Attribute Realization for Visual Storytelling with Diffusion Models

cs.CV · 2026-05-20 · unverdicted · novelty 6.0

AttriStory adds a benchmark and AttriLoss-based latent optimization to improve faithful rendering of fine-grained attributes such as clothing color and texture in diffusion-model visual storytelling.

Improved Baselines with Representation Autoencoders

cs.CV · 2026-05-18 · conditional · novelty 6.0

RAE v2 reaches gFID 1.06 on ImageNet-256 in 80 epochs by combining multi-layer encoder sums, complementary REPA targets, and free guidance via output reparameterization.

Stylistic Attribute Control in Latent Diffusion Models

cs.CV · 2026-05-04 · unverdicted · novelty 6.0

A technique for parametric stylistic control in latent diffusion models learns disentangled directions from synthetic datasets and applies them via guidance composition while preserving semantics.

(1D) Ordered Tokens Enable Efficient Test-Time Search

cs.CV · 2026-04-16 · unverdicted · novelty 6.0

Coarse-to-fine 1D token sequences in autoregressive models enable stronger test-time search and even training-free text-to-image generation guided by verifiers, outperforming traditional 2D grid tokenization.

GoViG: Goal-Conditioned Visual Navigation Instruction Generation via Multimodal Reasoning

cs.CV · 2025-08-13 · unverdicted · novelty 6.0

GoViG decomposes goal-conditioned navigation instruction generation into visual state prediction and instruction synthesis using an autoregressive multimodal LLM with one-pass and interleaved reasoning, showing gains on a new R2R-Goal dataset.

Slot-MLLM: Object-Centric Visual Tokenization for Multimodal LLM

cs.CV · 2025-05-23 · unverdicted · novelty 6.0

Slot-MLLM introduces a slot-attention-based object-centric visual tokenizer with Q-Former encoder, diffusion decoder, and residual vector quantization for improved local visual comprehension and generation in multimodal LLMs.

Personalized Face Privacy Protection From a Single Image

cs.CV · 2026-05-18 · unverdicted · novelty 5.0

FaceCloak learns a lightweight identity-specific cloaking mask from a single image via synthetic face generation and iterative embedding perturbation to evade multiple recognition models.

RealDiffusion: Physics-informed Attention for Multi-character Storybook Generation

cs.CV · 2026-05-12 · unverdicted · novelty 5.0

RealDiffusion uses heat diffusion as a dissipative prior and a region-aware stochastic process inside a training-free physics-informed attention mechanism to improve multi-character coherence while preserving narrative dynamism in sequential image generation.

SyncFix: Fixing 3D Reconstructions via Multi-View Synchronization

cs.CV · 2026-04-13 · unverdicted · novelty 5.0

SyncFix improves 3D reconstructions by synchronizing multi-view latent representations in a diffusion refinement process, generalizing from pair-wise training to arbitrary view counts at inference.

ID-Sim: An Identity-Focused Similarity Metric

cs.CV · 2026-04-06 · unverdicted · novelty 5.0

ID-Sim is a new similarity metric that aims to capture human selective sensitivity to identities by training on curated real and generative synthetic data and validating against human annotations on recognition, retrieval, and generative tasks.

World Action Models: The Next Frontier in Embodied AI

cs.RO · 2026-05-12 · unverdicted · novelty 4.0

The paper introduces World Action Models as a new paradigm unifying predictive world modeling with action generation in embodied foundation models and provides a taxonomy of existing approaches.

citing papers explorer

Showing 20 of 20 citing papers.

Capability $\neq$ Interpretability: Human Interpretability of Vision Foundation Models cs.CV · 2026-05-19 · conditional · none · ref 45
Foundation models yield less human-interpretable features than supervised vision transformers, with interpretability tied to activation locality and coarse semantic alignment rather than task performance.
Evaluating Remote Sensing Image Captions Beyond Metric Biases cs.CV · 2026-04-22 · unverdicted · none · ref 12
Unfine-tuned MLLMs outperform fine-tuned models on remote sensing image captioning when captions are scored by their ability to reconstruct the source image, and a training-free self-correction method achieves SOTA performance.
Novel View Synthesis as Video Completion cs.CV · 2026-04-09 · unverdicted · none · ref 9
Video diffusion models can be adapted into permutation-invariant generators for sparse novel view synthesis by treating the problem as video completion and removing temporal order cues.
PromptEvolver: Prompt Inversion through Evolutionary Optimization in Natural-Language Space cs.LG · 2026-04-03 · unverdicted · none · ref 10
PromptEvolver recovers high-fidelity natural language prompts for given images by evolving them via genetic algorithm guided by a vision-language model, outperforming prior methods on benchmarks.
ProDiG: Progressive Diffusion-Guided Gaussian Splatting for Aerial to Ground Reconstruction cs.CV · 2026-04-02 · unverdicted · none · ref 7
ProDiG progressively transforms aerial Gaussian splats into coherent ground-level 3D reconstructions via diffusion guidance and specialized attention modules.
It's Never Too Late: Noise Optimization for Collapse Recovery in Trained Diffusion Models cs.CV · 2025-12-31 · unverdicted · none · ref 17
Noise optimization during sampling recovers diversity in mode-collapsed diffusion models while preserving output fidelity.
Generative Giants, Retrieval Weaklings: Why do Multimodal Large Language Models Fail at Multimodal Retrieval? cs.CV · 2025-12-22 · conditional · none · ref 1
MLLM representation spaces are dominated by textual semantics that reduce discriminative power for multimodal retrieval; a whitening transformation called ReAlign corrects the geometry and boosts zero-shot performance.
Setting the Stage: Text-Driven Scene-Consistent Image Generation cs.CV · 2025-12-14 · conditional · none · ref 4
A new data pipeline using real photos, entity removal, and image-to-video models plus a cross-view attention loss enables text-driven generation of actors in reference scenes with improved alignment.
VLM2Vec: Training Vision-Language Models for Massive Multimodal Embedding Tasks cs.CV · 2024-10-07 · conditional · none · ref 9
VLM2Vec converts state-of-the-art vision-language models into universal multimodal embedders via contrastive training on the new MMEB benchmark, delivering 10-20% absolute gains over prior models on both in-distribution and out-of-distribution tasks.
AttriStory: Fine-grained Attribute Realization for Visual Storytelling with Diffusion Models cs.CV · 2026-05-20 · unverdicted · none · ref 5
AttriStory adds a benchmark and AttriLoss-based latent optimization to improve faithful rendering of fine-grained attributes such as clothing color and texture in diffusion-model visual storytelling.
Improved Baselines with Representation Autoencoders cs.CV · 2026-05-18 · conditional · none · ref 17
RAE v2 reaches gFID 1.06 on ImageNet-256 in 80 epochs by combining multi-layer encoder sums, complementary REPA targets, and free guidance via output reparameterization.
Stylistic Attribute Control in Latent Diffusion Models cs.CV · 2026-05-04 · unverdicted · none · ref 25
A technique for parametric stylistic control in latent diffusion models learns disentangled directions from synthetic datasets and applies them via guidance composition while preserving semantics.
(1D) Ordered Tokens Enable Efficient Test-Time Search cs.CV · 2026-04-16 · unverdicted · none · ref 1
Coarse-to-fine 1D token sequences in autoregressive models enable stronger test-time search and even training-free text-to-image generation guided by verifiers, outperforming traditional 2D grid tokenization.
GoViG: Goal-Conditioned Visual Navigation Instruction Generation via Multimodal Reasoning cs.CV · 2025-08-13 · unverdicted · none · ref 8
GoViG decomposes goal-conditioned navigation instruction generation into visual state prediction and instruction synthesis using an autoregressive multimodal LLM with one-pass and interleaved reasoning, showing gains on a new R2R-Goal dataset.
Slot-MLLM: Object-Centric Visual Tokenization for Multimodal LLM cs.CV · 2025-05-23 · unverdicted · none · ref 13
Slot-MLLM introduces a slot-attention-based object-centric visual tokenizer with Q-Former encoder, diffusion decoder, and residual vector quantization for improved local visual comprehension and generation in multimodal LLMs.
Personalized Face Privacy Protection From a Single Image cs.CV · 2026-05-18 · unverdicted · none · ref 60
FaceCloak learns a lightweight identity-specific cloaking mask from a single image via synthetic face generation and iterative embedding perturbation to evade multiple recognition models.
RealDiffusion: Physics-informed Attention for Multi-character Storybook Generation cs.CV · 2026-05-12 · unverdicted · none · ref 5
RealDiffusion uses heat diffusion as a dissipative prior and a region-aware stochastic process inside a training-free physics-informed attention mechanism to improve multi-character coherence while preserving narrative dynamism in sequential image generation.
SyncFix: Fixing 3D Reconstructions via Multi-View Synchronization cs.CV · 2026-04-13 · unverdicted · none · ref 6
SyncFix improves 3D reconstructions by synchronizing multi-view latent representations in a diffusion refinement process, generalizing from pair-wise training to arbitrary view counts at inference.
ID-Sim: An Identity-Focused Similarity Metric cs.CV · 2026-04-06 · unverdicted · none · ref 18
ID-Sim is a new similarity metric that aims to capture human selective sensitivity to identities by training on curated real and generative synthetic data and validating against human annotations on recognition, retrieval, and generative tasks.
World Action Models: The Next Frontier in Embodied AI cs.RO · 2026-05-12 · unverdicted · none · ref 209
The paper introduces World Action Models as a new paradigm unifying predictive world modeling with action generation in embodied foundation models and provides a taxonomy of existing approaches.

Dream- sim: Learning new dimensions of human visual similar- ity using synthetic data.arXiv preprint arXiv:2306.09344

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer