DecQ uses detail-condensing queries on shallow and deep VFM features to improve both reconstruction PSNR and generative convergence/FID in RAEs without fine-tuning the encoder.
Scalable diffusion models with transformers
4 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
fields
cs.CV 4verdicts
UNVERDICTED 4roles
background 1polarities
background 1representative citing papers
Backward token warping in ViT-based MLLMs enables reliable reasoning from nearby viewpoints by preserving semantic coherence better than pixel-wise warping or fine-tuning baselines.
A zero-shot subject-driven video generation framework that decomposes the task into identity injection from 200K subject-image pairs and motion preservation from 4K arbitrary videos, trained in 288 A100 GPU hours on CogVideoX-5B to match prior performance at 1% compute.
CR-Diff applies block-wise pruning followed by output amplification to diffusion models, improving consistency and fidelity at unseen resolutions while retaining default-resolution performance.
citing papers explorer
-
DecQ: Detail-Condensing Queries for Enhanced Reconstruction and Generation in Representation Autoencoders
DecQ uses detail-condensing queries on shallow and deep VFM features to improve both reconstruction PSNR and generative convergence/FID in RAEs without fine-tuning the encoder.
-
Token Warping Helps MLLMs Look from Nearby Viewpoints
Backward token warping in ViT-based MLLMs enables reliable reasoning from nearby viewpoints by preserving semantic coherence better than pixel-wise warping or fine-tuning baselines.
-
Learning Zero-Shot Subject-Driven Video Generation Using 1% Compute
A zero-shot subject-driven video generation framework that decomposes the task into identity injection from 200K subject-image pairs and motion preservation from 4K arbitrary videos, trained in 288 A100 GPU hours on CogVideoX-5B to match prior performance at 1% compute.
-
Cross-Resolution Diffusion Models via Network Pruning
CR-Diff applies block-wise pruning followed by output amplification to diffusion models, improving consistency and fidelity at unseen resolutions while retaining default-resolution performance.