arxiv: 2605.15196 · v1 · submitted 2026-05-14 · 💻 cs.CV · cs.LG

Recognition: 1 theorem link

· Lean Theorem

RefDecoder: Enhancing Visual Generation with Conditional Video Decoding

Xiang Fan , Yuheng Wang , Bohan Fang , Zhongzheng Ren , Ranjay Krishna

Authors on Pith no claims yet

Pith reviewed 2026-05-15 03:17 UTC · model grok-4.3

classification 💻 cs.CV cs.LG

keywords videodecodergenerationrefdecoderimagereferenceconsistencydecoding

0 comments

The pith

RefDecoder adds reference-image conditioning to video VAE decoders through attention, yielding up to 2.1 dB PSNR gains and better consistency on I2V, editing, and style-transfer tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Current video generation systems use powerful conditioned networks to create noisy latents but then decode them with a standard unconditional VAE. This mismatch often blurs fine details or changes the subject from the starting image. RefDecoder fixes the decoder side by running a small image encoder on the reference frame and feeding those detail-rich tokens into the decoder at every resolution stage using attention. The decoder therefore sees both the noisy latent and the exact reference structure at the same time. Experiments on reconstruction benchmarks show clearer frames and higher PSNR. The same decoder can be dropped into existing image-to-video pipelines without retraining and improves subject and background consistency scores on standard benchmarks. The approach also helps style transfer and video editing by keeping the output anchored to the reference.

Core claim

We introduce RefDecoder, a reference-conditioned video VAE decoder by injecting high-fidelity reference image signal directly into the decoding process via reference attention... achieving up to +2.1dB PSNR over the unconditional baselines on the Inter4K, WebVid, and Large Motion reconstruction benchmarks.

Load-bearing premise

That equal conditioning of the decoder via reference attention is sufficient to preserve structural integrity without introducing new artifacts or requiring any fine-tuning of the rest of the pipeline.

read the original abstract

Video generation powers a vast array of downstream applications. However, while the de facto standard, i.e., latent diffusion models, typically employ heavily conditioned denoising networks, their decoders often remain unconditional. We observe that this architectural asymmetry leads to significant loss of detail and inconsistency relative to the input image. To address this, we argue that the decoder requires equal conditioning to preserve structural integrity. We introduce RefDecoder, a reference-conditioned video VAE decoder by injecting high-fidelity reference image signal directly into the decoding process via reference attention. Specifically, a lightweight image encoder maps the reference frame into the detail-rich high-dimensional tokens, which are co-processed with the denoised video latent tokens at each decoder up-sampling stage. We demonstrate consistent improvements across several distinct decoder backbones (e.g., Wan 2.1 and VideoVAE+), achieving up to +2.1dB PSNR over the unconditional baselines on the Inter4K, WebVid, and Large Motion reconstruction benchmarks. Notably, RefDecoder can be directly swapped into existing video generation systems without additional fine-tuning, and we report across-the-board improvements in subject consistency, background consistency, and overall quality scores on the VBench I2V benchmark. Beyond I2V, RefDecoder generalizes well to a wide range of visual generation tasks such as style transfer and video editing refinement.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

RefDecoder adds reference attention to video VAE decoder stages and shows measurable PSNR and consistency gains as a drop-in swap.

read the letter

RefDecoder adds reference attention to video VAE decoder stages and shows measurable PSNR and consistency gains as a drop-in swap. The core move is encoding a reference frame with a lightweight image encoder and mixing those tokens via attention into the decoder at every upsampling stage. This targets the asymmetry where the diffusion denoiser gets heavy conditioning but the decoder stays unconditional. They test the change on backbones like Wan 2.1 and VideoVAE+ and report up to 2.1 dB PSNR lifts on Inter4K, WebVid, and Large Motion reconstruction sets, plus better subject and background consistency on VBench I2V. The no-fine-tuning claim is the practical hook, since it lets people slot the decoder into existing pipelines without touching the rest of the model. It also nods at uses in style transfer and editing refinement, though those get less space. The work is straightforward and the numbers are concrete enough to check. The soft spots sit in the evaluation. Most of the PSNR numbers come from clean reconstruction latents that already align with the reference, so it is not yet clear how the attention behaves when the denoised latent drifts, which is the realistic case in generation. The stress-test note flags exactly this risk of misalignment or new temporal artifacts, and the paper does not appear to include ablations on stage placement, encoder strength, or direct comparisons under noisy latents. Details on the attention implementation and training are also thin in the summary. This is for engineers and researchers who maintain or extend latent video diffusion systems and want a quick decoder upgrade for better input fidelity. A reader already running I2V or editing experiments could try the swap and measure the delta on their own data. It deserves peer review because the method is simple to reproduce and the reported gains are specific, even if referees will likely ask for more tests on actual generation outputs.

Referee Report

3 major / 1 minor

Summary. The manuscript introduces RefDecoder, a reference-conditioned video VAE decoder that injects high-fidelity reference image signals into the decoding process using reference attention at each up-sampling stage. The authors claim this addresses the asymmetry between conditioned denoising networks and unconditional decoders in latent diffusion models for video generation, leading to improved detail preservation and consistency. They report up to +2.1 dB PSNR gains on reconstruction benchmarks like Inter4K, WebVid, and Large Motion, and improvements on VBench I2V without requiring fine-tuning of the pipeline. The method is presented as a plug-and-play module for existing video generation systems and generalizes to tasks like style transfer and video editing.

Significance. If the reported gains hold under realistic generation conditions, this could offer a practical plug-in enhancement for video generation by symmetrizing conditioning. The no-fine-tuning claim is a potential strength for adoption. However, the absence of architecture details, training procedures, and ablations on noisy latents (vs. clean reconstruction) makes it difficult to gauge the true significance or robustness of the central architectural change.

major comments (3)

[Abstract] Abstract: The central claim that reference attention at decoder upsampling stages is 'sufficient to preserve structural integrity' without fine-tuning or regularization is load-bearing but unsupported by ablations; no experiments test cases where denoised latents deviate from the reference (as occurs in I2V generation), leaving open the risk of temporal inconsistencies or artifacts.
[Abstract] Abstract: Gains of up to +2.1 dB PSNR are reported on reconstruction benchmarks (Inter4K, WebVid, Large Motion) using clean latents, but the VBench I2V results lack controls isolating whether improvements persist under realistic denoising noise or across decoder backbones; this undermines applicability to the stated generation use case.
[Abstract] Abstract: The assertion that RefDecoder 'can be directly swapped into existing video generation systems without additional fine-tuning' requires evidence of integration stability and inference behavior with mismatched latents; none is provided, making the plug-and-play claim unverifiable from the given description.

minor comments (1)

The abstract references generalization to style transfer and video editing refinement but supplies no quantitative metrics, qualitative examples, or dedicated evaluation sections for these tasks.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the thoughtful comments. We address each major comment below, providing clarifications on our experimental design and claims. We will incorporate additional details into the revised manuscript to address concerns about architecture and training procedures.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim that reference attention at decoder upsampling stages is 'sufficient to preserve structural integrity' without fine-tuning or regularization is load-bearing but unsupported by ablations; no experiments test cases where denoised latents deviate from the reference (as occurs in I2V generation), leaving open the risk of temporal inconsistencies or artifacts.

Authors: We clarify that the reconstruction benchmarks (Inter4K, WebVid, Large Motion) are designed to evaluate the decoder's ability to reconstruct from clean latents, which is the standard protocol for assessing VAE decoders. The VBench I2V results, however, are obtained by integrating RefDecoder into a full latent diffusion pipeline for image-to-video generation, where the input latents are the output of the denoising process and can deviate from the reference. These results show improvements in consistency metrics, indicating that the reference attention helps mitigate inconsistencies even with denoised latents. We did not include separate ablations with artificially corrupted latents, as the end-to-end I2V evaluation serves as the primary validation for the generation use case. We will add a discussion in the revised paper to explicitly distinguish these settings. revision: partial
Referee: [Abstract] Abstract: Gains of up to +2.1 dB PSNR are reported on reconstruction benchmarks (Inter4K, WebVid, Large Motion) using clean latents, but the VBench I2V results lack controls isolating whether improvements persist under realistic denoising noise or across decoder backbones; this undermines applicability to the stated generation use case.

Authors: The PSNR gains are indeed reported for clean latent reconstruction to demonstrate the decoder's enhanced fidelity. For the generation use case, the VBench I2V benchmark involves the complete pipeline with realistic denoising noise. We tested RefDecoder on two different decoder backbones (Wan 2.1 and VideoVAE+), reporting consistent improvements in subject consistency, background consistency, and overall quality. To further isolate the effect, we can add more detailed controls in the experiments section of the revision. revision: partial
Referee: [Abstract] Abstract: The assertion that RefDecoder 'can be directly swapped into existing video generation systems without additional fine-tuning' requires evidence of integration stability and inference behavior with mismatched latents; none is provided, making the plug-and-play claim unverifiable from the given description.

Authors: The VBench I2V experiments provide evidence for the plug-and-play nature: RefDecoder is integrated into an existing I2V system (using a pre-trained diffusion model) without any fine-tuning of the diffusion components, and we observe improvements in the metrics. This implies stability during inference with the latents produced by the denoising network, which may be mismatched to the reference in terms of details. We acknowledge that more explicit discussion of the integration process and any potential issues with mismatched latents would be beneficial. We will expand the methods and experiments sections to include architecture details, training procedures, and further analysis of inference behavior. revision: partial

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper proposes RefDecoder as an architectural modification to existing video VAE decoders, injecting reference image tokens via attention at upsampling stages. All reported gains (+2.1 dB PSNR on Inter4K/WebVid/Large Motion, VBench I2V improvements) are obtained by direct empirical comparison against unconditional baselines on external benchmarks, with no fitted parameters renamed as predictions, no self-definitional equations, and no load-bearing self-citations or imported uniqueness theorems. The method is presented as a plug-in swap without pipeline retraining, and the central claim rests on observable reconstruction quality rather than any reduction to prior inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The approach rests on standard VAE and attention mechanisms plus the assumption that reference tokens can be directly co-processed with latents; no new free parameters or invented physical entities are introduced.

axioms (2)

domain assumption Latent diffusion models with unconditional decoders lose detail relative to conditioned encoders
Stated as the motivating observation in the abstract.
ad hoc to paper Reference attention can be injected at decoder upsampling stages without destabilizing training or inference
Core design choice of RefDecoder.

invented entities (1)

RefDecoder no independent evidence
purpose: Reference-conditioned video VAE decoder
New module introduced in the paper

pith-pipeline@v0.9.0 · 5549 in / 1285 out tokens · 36847 ms · 2026-05-15T03:17:21.949834+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

a reference-conditioned video VAE decoder by injecting high-fidelity reference image signal directly into the decoding process via reference attention

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.