arxiv: 2604.26244 · v1 · submitted 2026-04-29 · 💻 cs.CV · cs.AI

Recognition: unknown

MetaSR: Content-Adaptive Metadata Orchestration for Generative Super-Resolution

Jiaqi Guo , Mingzhen Li , Haohong Wang , Aggelos K. Katsaggelos

Authors on Pith no claims yet

Pith reviewed 2026-05-07 13:42 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords generative super-resolutionmetadata orchestrationdiffusion transformercontent-adaptive conditioningrate-distortion optimizationimage restorationone-step diffusion

0 comments

The pith

MetaSR shows that a Diffusion Transformer can select and fuse content-dependent metadata to guide generative super-resolution, yielding higher quality at lower transmission bitrate.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines generative super-resolution where content and degradations shift across domains, making fixed metadata conditioning inefficient. It claims that a DiT-based system can instead choose task-relevant side information on the fly and inject it through the model's own VAE and transformer layers. An accompanying distillation step allows one-step inference. Experiments on varied image and video buckets report gains of up to 1 dB PSNR together with up to 50 percent lower bitrate at matched quality, all evaluated inside a rate-distortion framework that balances sender cost and receiver metrics.

Core claim

MetaSR is a Diffusion Transformer framework that dynamically selects heterogeneous metadata according to content type and degradation, then fuses the chosen cues inside the DiT's VAE and transformer backbone. Combined with an efficient distillation procedure that reduces inference to a single step, the method produces reconstructions whose PSNR and SSIM exceed those of fixed-metadata baselines while cutting the transmitted bitrate by up to half under a joint rate-distortion optimization that accounts for both sender transmission cost and display quality.

What carries the argument

The DiT-based content-adaptive metadata fusion that selects task-relevant side information and injects it through the model's own VAE and transformer layers under bitrate limits.

If this is right

Higher PSNR and SSIM at the same transmitted bitrate across text, motion, cartoon, and face content.
Up to 50 percent bitrate reduction while preserving matched reconstruction quality.
One-step inference after distillation, enabling practical receiver-side deployment.
Joint optimization of sender bitrate and receiver quality metrics inside the rate-distortion loop.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same adaptive fusion idea could be tested on other conditional generative models that receive side information under bandwidth limits.
Real-time video pipelines might benefit from the reduced per-frame transmission cost if the selection logic runs at the encoder.
The distillation step suggests a route to lower sender computation when metadata choices must be made on the fly.

Load-bearing premise

The DiT backbone can reliably combine different metadata types in a content-dependent manner without adding artifacts or imposing heavy sender-side computation under real transmission limits.

What would settle it

An experiment on a held-out content class or degradation regime in which the adaptive metadata version shows no PSNR gain or bitrate reduction relative to a fixed-metadata baseline, or introduces visible artifacts.

Figures

Figures reproduced from arXiv: 2604.26244 by Aggelos K. Katsaggelos, Haohong Wang, Jiaqi Guo, Mingzhen Li.

**Figure 1.** Figure 1: Overview of the proposed sender–receiver collaborative pipeline. The sender performs content analysis and metadata generation/compression, while the receiver combines decoded metadata with transmitted low-resolution content for adaptive generative super-resolution. a practical networked deployment scenario and disclose a sender–receiver collaborative framework, as depicted in view at source ↗

**Figure 2.** Figure 2: Overview of MetaSR. Left: schematic of how metadata information is projected into the denoising process through native DiT modules. Right: representative qualitative image SR examples under Canny/depth guidance. Consistency Models learn mappings that enable one-step generation while allowing multi-step refinement [27]. Latent Consistency Models extend consistencystyle distillation to latent diffusion ba… view at source ↗

**Figure 3.** Figure 3: Case study of posterior uncertainty in a compression, transmission, and generation pipeline. Panel (a) reports quality degradation under bitrate and channel corruption, and panel (b) compares EDSR, StableSR, and DOVE to highlight prior-driven hallucination effects. Implication for posterior uncertainty view at source ↗

**Figure 4.** Figure 4: Rate–distortion (RDO) curves under three transmission-degradation regimes (NN/LN/HN), comparing DOVE and MetaSR at identical total transmission budgets (JPEG base layer + metadata). The performance gap widens from NN to HN, showing stronger metadata benefits under harsher channel corruption. | Start Keyframe | -------------------------------- Interpolated Keyframes ----------------------------- | End Keyframe | view at source ↗

**Figure 5.** Figure 5: Preliminary future-work result: Canny-edge-guided video frame interpolation. The example shows the feasibility of extending MetaSR-style metadata conditioning beyond super-resolution. provided an information-theoretic analysis showing that informative metadata reduces conditional uncertainty, which motivates metadata orchestration under transmission budgets. Extensive experiments validate this design from … view at source ↗

read the original abstract

We study generative super-resolution (SR) in real-world scenarios where content and degradations vary across domains, genres, and segments. For example, images and videos may alternate between text overlays, fast motion, smooth cartoons, and low-light faces, each benefiting from different forms of side information. Existing metadata-guided SR methods typically use a fixed conditioning design, which is suboptimal when useful cues are content dependent and transmission budgets are limited. We propose MetaSR, a Diffusion Transformer (DiT)-based framework that selects and injects task-relevant metadata to guide SR under resource constraints. Specifically, we use the DiT's own VAE and transformer backbone to fuse heterogeneous metadata, and adopt an efficient distillation strategy that enables one-step diffusion inference. Experiments across diverse content buckets and degradation regimes show that MetaSR outperforms reference solutions by up to 1.0~dB PSNR while achieving up to 50\% transmission bitrate saving at matched quality. We assess these gains under a rate--distortion optimization (RDO) framework that jointly accounts for sender-side bitrate and receiver/display quality metrics (e.g., PSNR and SSIM).

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MetaSR adds content-adaptive metadata selection to a DiT backbone for generative SR and claims 1 dB PSNR plus 50% bitrate gains, but the abstract leaves experiments too thin to judge if the numbers hold or if sender overhead cancels the savings.

read the letter

MetaSR focuses on making generative super-resolution work better when images and video switch between text, motion, cartoons, and low-light faces. The new element is a content-adaptive mechanism that picks which metadata to send and injects it through the DiT's own VAE and transformer layers, paired with one-step distillation to keep inference cheap. This sits inside a rate-distortion loop that balances transmission bits against PSNR and SSIM at the receiver. The abstract reports up to 1 dB PSNR improvement and 50% bitrate reduction at matched quality across varied content buckets, which is the kind of practical number that matters for delivery pipelines. Using the model backbone itself for fusion is a clean engineering move that avoids extra networks. The one-step distillation is also a sensible step to control latency. Those pieces are what the paper actually contributes beyond fixed-conditioning baselines. The main soft spot is that the abstract gives the headline numbers without naming the reference methods, the exact datasets, the degradation models, or how the adaptive selection is trained and triggered per segment. Without those details it is difficult to know whether the gains are robust or sensitive to particular choices. The stress-test concern about sender-side compute is reasonable on the current evidence: if feature extraction or metadata selection runs at the encoder for every segment, that cost sits outside the transmission bitrate and could shrink the net advantage in real systems. The paper would need to show either that this overhead is negligible or that it was measured and folded into the RDO accounting. This work is aimed at people building generative SR for bandwidth-limited media applications rather than pure theory. If the full manuscript supplies clear baselines, ablation tables, and some accounting for sender compute, it is worth sending to referees because the application framing and the DiT fusion idea are concrete enough to evaluate. I would recommend peer review with a request for those missing experimental specifics.

Referee Report

1 major / 2 minor

Summary. The manuscript proposes MetaSR, a Diffusion Transformer (DiT)-based framework for generative super-resolution that performs content-adaptive selection and injection of heterogeneous metadata to guide reconstruction under transmission constraints. It fuses metadata using the model's own VAE and transformer backbone and applies distillation to enable one-step inference. Under a rate-distortion optimization (RDO) framework, experiments across content buckets and degradation regimes report gains of up to 1.0 dB PSNR and up to 50% transmission bitrate savings at matched quality relative to reference solutions.

Significance. If the reported gains prove robust, the work could meaningfully advance practical generative SR for variable real-world content by reducing transmission overhead while adapting metadata to content type. A strength is the parameter-efficient fusion strategy that reuses the DiT backbone itself rather than introducing separate conditioning networks, together with the distillation approach that supports low-latency inference.

major comments (1)

The RDO framework (Abstract and experimental evaluation) jointly optimizes sender-side transmission bitrate against receiver PSNR/SSIM, yet the computational overhead of per-segment metadata selection and DiT-based feature extraction at the sender is neither quantified nor folded into the optimization. This is load-bearing for the 50% bitrate-saving claim, because real transmission constraints include total cost; if sender compute is material, the net savings would be lower than stated.

minor comments (2)

The abstract states concrete performance numbers (1.0 dB PSNR, 50% bitrate) without naming the datasets, number of content buckets, or baseline methods; a one-sentence summary of the experimental protocol would improve readability.
Notation for the metadata fusion step inside the transformer backbone could be clarified with a short equation or diagram reference to avoid ambiguity about how heterogeneous inputs are combined.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address the major comment below and describe the revisions we will incorporate.

read point-by-point responses

Referee: The RDO framework (Abstract and experimental evaluation) jointly optimizes sender-side transmission bitrate against receiver PSNR/SSIM, yet the computational overhead of per-segment metadata selection and DiT-based feature extraction at the sender is neither quantified nor folded into the optimization. This is load-bearing for the 50% bitrate-saving claim, because real transmission constraints include total cost; if sender compute is material, the net savings would be lower than stated.

Authors: We agree that sender-side computational overhead is a relevant factor for real-world rate-distortion trade-offs and that its omission limits the strength of the transmission-bitrate-saving claims. Our architecture reuses the existing VAE and DiT backbone for metadata fusion rather than adding dedicated conditioning modules, which keeps incremental cost modest relative to a standard DiT forward pass. However, we did not provide explicit runtime or FLOPs measurements for the per-segment selection and extraction steps. In the revised manuscript we will add a dedicated analysis section that reports sender-side compute (wall-clock time and FLOPs) on representative hardware across content buckets, together with a discussion of when transmission cost dominates versus when compute becomes material. This will allow readers to assess net savings under different deployment constraints. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical gains reported from experiments, not derived by construction

full rationale

The manuscript presents MetaSR as a DiT-based framework for content-adaptive metadata selection and fusion, with gains (up to 1.0 dB PSNR and 50% bitrate saving) framed as experimental outcomes under an RDO framework. No equations, first-principles derivations, or predictions appear that reduce these results to fitted parameters, self-definitions, or self-citation chains. The method description (VAE/transformer fusion plus distillation) is presented as a design choice evaluated empirically across content buckets, without load-bearing uniqueness theorems or ansatz smuggling from prior self-work. This is the standard non-circular pattern for an applied ML proposal paper.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms, or invented entities are explicitly described in the abstract; assessment is limited by lack of full text.

pith-pipeline@v0.9.0 · 5509 in / 1019 out tokens · 41208 ms · 2026-05-07T13:42:00.951728+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

44 extracted references · 27 canonical work pages · 6 internal anchors

[1]

Blau, Y., Michaeli, T.: The perception-distortion tradeoff. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2018),https://openaccess.thecvf.com/content_cvpr_2018/papers/Blau_ The_Perception-Distortion_Tradeoff_CVPR_2018_paper.pdf

2018
[2]

In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2019),https://arxiv

Bourtsoulatze, E., Kurka, D.B., Gündüz, D.: Deep joint source-channel coding for wireless image transmission. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2019),https://arxiv. org/abs/1809.01733

work page arXiv 2019
[3]

IEEE Transactions on Pattern Analysis and Machine IntelligenceP AMI-8(6), 679–698 (1986).https: //doi.org/10.1109/TPAMI.1986.4767851

Canny, J.: A computational approach to edge detection. IEEE Transactions on Pattern Analysis and Machine IntelligenceP AMI-8(6), 679–698 (1986).https: //doi.org/10.1109/TPAMI.1986.4767851

work page doi:10.1109/tpami.1986.4767851 1986
[4]

arXiv preprint arXiv:2505.16239 (2025)

Chen, Z., Zou, Z., Zhang, K., Su, X., Yuan, X., Guo, Y., Zhang, Y.: Dove: Effi- cient one-step diffusion model for real-world video super-resolution. arXiv preprint arXiv:2505.16239 (2025)

work page arXiv 2025
[5]

ACDC: The adverse conditions dataset with correspondences for robust semantic driving scene perception,

Ding, K., Ma, K., Wang, S., Simoncelli, E.P.: Image quality assessment: Unify- ing structure and texture similarity. IEEE Transactions on Pattern Analysis and Machine Intelligence44(5), 2567–2581 (2022).https://doi.org/10.1109/TPAMI. 2020.3045810,https://arxiv.org/abs/2004.07728

work page doi:10.1109/tpami 2022
[6]

Confidence in Assurance 2.0 Cases

Dong, C., Loy, C.C., He, K., Tang, X.: Learning a deep convolutional net- work for image super-resolution. In: European Conference on Computer Vision (ECCV) (2014).https://doi.org/10.1007/978- 3- 319- 10593- 2_13,https: //link.springer.com/chapter/10.1007/978-3-319-10593-2_13

work page doi:10.1007/978- 2014
[7]

Denoising Diffusion Probabilistic Models

Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. In: Advances in Neural Information Processing Systems (NeurIPS) (2020),https://arxiv.org/ abs/2006.11239

work page internal anchor Pith review arXiv 2020
[8]

Standard (2000),https://www.itu.int/rec/dologin_pub.asp?id=T-REC-T.88-200002- S!!PDF-E&lang=e&type=items

International Telecommunication Union (ITU-T): Itu-t recommendation t.88: In- formation technology – lossy/lossless coding of bi-level images (jbig2). Standard (2000),https://www.itu.int/rec/dologin_pub.asp?id=T-REC-T.88-200002- S!!PDF-E&lang=e&type=items

2000
[9]

Web page (2001),https://jpeg.org/jbig/

Joint Photographic Experts Group (JPEG): Jbig2 (itu-t t.88 | iso/iec 14492): Overview of jbig/jbig2. Web page (2001),https://jpeg.org/jbig/

2001
[10]

In: Advances in Neural Information Processing Systems (NeurIPS) (2022),https: //arxiv.org/abs/2201.11793

Kawar, B., Vaksman, G., Elad, M.: Denoising diffusion restoration models. In: Advances in Neural Information Processing Systems (NeurIPS) (2022),https: //arxiv.org/abs/2201.11793

work page arXiv 2022
[11]

Kim, J., Lee, J.K., Lee, K.M.: Accurate image super-resolution using very deep convolutional networks. In: Proceedings of the IEEE Conference on Computer Vi- sion and Pattern Recognition (CVPR) (2016),https://openaccess.thecvf.com/ content _ cvpr _ 2016 / papers / Kim _ Accurate _ Image _ Super - Resolution _ CVPR _ 2016_paper.pdf

2016
[12]

arXiv preprint (2018), https://arxiv.org/abs/1812.07174

Kim, K., Chun, S.Y.: Sredgenet: Edge enhanced single image super resolution using dense edge detection network and feature merge network. arXiv preprint (2018), https://arxiv.org/abs/1812.07174

work page arXiv 2018
[13]

Ledig, C., Theis, L., Huszar, F., Caballero, J., Cunningham, A., Acosta, A., Aitken, A., Tejani, A., Totz, J., Wang, Z., Shi, W.: Photo-realistic single image super-resolution using a generative adversarial network. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017), https://openaccess.thecvf.com/content_cvpr_2...

2017
[14]

In: Proceedings of the IEEE/CVF Inter- national Conference on Computer Vision Workshops (ICCVW) (2021),https: / / openaccess

Liang, J., Cao, J., Sun, G., Zhang, K., Van Gool, L., Timofte, R.: Swinir: Im- age restoration using swin transformer. In: Proceedings of the IEEE/CVF Inter- national Conference on Computer Vision Workshops (ICCVW) (2021),https: / / openaccess . thecvf . com / content / ICCV2021W / AIM / papers / Liang _ SwinIR _ Image_Restoration_Using_Swin_Transformer_I...

2021
[15]

Lim, B., Son, S., Kim, H., Nah, S., Mu Lee, K.: Enhanced deep residual net- works for single image super-resolution. In: Proceedings of the IEEE Confer- ence on Computer Vision and Pattern Recognition Workshops (CVPRW) (2017), https://openaccess.thecvf.com/content_cvpr_2017_workshops/w12/papers/ Lim_Enhanced_Deep_Residual_CVPR_2017_paper.pdf

2017
[16]

Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps

Lu, C., Zhou, Y., Bao, F., Chen, J., Li, C., Zhu, J.: Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps. In: Advances in Neural Information Processing Systems (NeurIPS) (2022),https://arxiv.org/ abs/2206.00927

work page arXiv 2022
[17]

Latent Consistency Models: Synthesizing High-Resolution Images with Few-Step Inference

Luo, S., Tan, Y., Huang, L., Li, J., Zhao, H.: Latent consistency models: Syn- thesizing high-resolution images with few-step inference. arXiv preprint (2023), https://arxiv.org/abs/2310.04378

work page internal anchor Pith review arXiv 2023
[18]

T2i-adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models

Mou, C., Wang, X., Xie, L., Wu, Y., Zhang, J., Qi, Z., Shan, Y., Qie, X.: T2i- adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models. arXiv preprint (2023),https://arxiv.org/abs/2302.08453

work page arXiv 2023
[19]

Peebles, W., Xie, S.: Scalable diffusion models with transformers. In: Proceed- ings of the IEEE/CVF International Conference on Computer Vision (ICCV) (2023),https://openaccess.thecvf.com/content/ICCV2023/papers/Peebles_ Scalable_Diffusion_Models_with_Transformers_ICCV_2023_paper.pdf

2023
[20]

Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2022), https://openaccess.thecvf.com/content/CVPR2022/papers/Rombach_High- Resolution _ Image _ Synthesis _ With _ Latent _ Diffusion _ Mod...

2022
[21]

Fleet, and Mohammad Norouzi

Saharia, C., Ho, J., Chan, W., Salimans, T., Fleet, D.J., Norouzi, M.: Image super- resolution via iterative refinement. arXiv preprint (2021),https://arxiv.org/ abs/2104.07636

work page arXiv 2021
[22]

Progressive Distillation for Fast Sampling of Diffusion Models

Salimans, T., Ho, J.: Progressive distillation for fast sampling of diffusion models. arXiv preprint (2022),https://arxiv.org/abs/2202.00512

work page internal anchor Pith review arXiv 2022
[23]

Adversarial diffusion dis- tillation

Sauer, A., Lorenz, D., Blattmann, A., Rombach, R.: Adversarial diffusion distilla- tion. arXiv preprint (2023),https://arxiv.org/abs/2311.17042

work page arXiv 2023
[24]

Springer Science & Business Media (2013)

Schuster, G.M., Katsaggelos, A.: Rate-Distortion based video compression: opti- mal video frame compression and object boundary encoding. Springer Science & Business Media (2013)

2013
[25]

IEEE Transactions on Circuits and Systems for Video Technology (2007).https://doi.org/10.1109/TCSVT.2007.905532, https://dl.acm.org/doi/10.1109/TCSVT.2007.905532

Schwarz, H., Marpe, D., Wiegand, T.: Overview of the scalable video coding ex- tension of the h.264/avc standard. IEEE Transactions on Circuits and Systems for Video Technology (2007).https://doi.org/10.1109/TCSVT.2007.905532, https://dl.acm.org/doi/10.1109/TCSVT.2007.905532

work page doi:10.1109/tcsvt.2007.905532 2007
[26]

Denoising Diffusion Implicit Models

Song, J., Meng, C., Ermon, S.: Denoising diffusion implicit models. arXiv preprint (2021),https://arxiv.org/abs/2010.02502

work page internal anchor Pith review arXiv 2021
[27]

Consistency Models

Song, Y., Dhariwal, P., Chen, M., Sutskever, I.: Consistency models. arXiv preprint (2023),https://arxiv.org/abs/2303.01469

work page internal anchor Pith review arXiv 2023
[28]

Wang, H.: Rich detail range (rdr): Redefining perceptual picture quality in the era of ai displays
[29]

arXiv preprint (2022),https://arxiv.org/abs/2207.12396

Wang, J., Chan, K.C.K., Loy, C.C.: Exploring clip for assessing the look and feel of images. arXiv preprint (2022),https://arxiv.org/abs/2207.12396

work page arXiv 2022
[30]

Exploiting diffusion prior for real-world image super-resolution.arXiv preprint arXiv:2305.07015, 2023

Wang, J., Yue, Z., Zhou, S., Chan, K.C.K., Loy, C.C.: Exploiting diffusion prior for real-world image super-resolution. arXiv preprint (2023),https://arxiv.org/ abs/2305.07015

work page arXiv 2023
[31]

Wang, X., Xie, L., Dong, C., Shan, Y.: Real-esrgan: Training real-world blind super-resolution with pure synthetic data. In: Proceedings of the IEEE/CVF Inter- national Conference on Computer Vision Workshops (ICCVW) (2021),https:// openaccess.thecvf.com/content/ICCV2021W/AIM/papers/Wang_Real- ESRGAN_ Training_Real- World_Blind_Super- Resolution_With_Pure...

2021
[32]

In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2018),https: / / openaccess

Wang, X., Yu, K., Dong, C., Loy, C.C.: Recovering realistic texture in image super-resolution by deep spatial feature transform. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2018),https: / / openaccess . thecvf . com / content _ cvpr _ 2018 / papers / Wang _ Recovering _ Realistic_Texture_CVPR_2018_paper.pdf

2018
[33]

Wang, X., Yu, K., Wu, S., Gu, J., Liu, Y., Dong, C., Loy, C.C., Qiao, Y., Tang, X.: Esrgan: Enhanced super-resolution generative adversarial networks. In: Proceedings of the European Conference on Computer Vision (ECCV) Work- shops (2018),https://openaccess.thecvf.com/content_ECCVW_2018/papers/ 11133 / Wang _ ESRGAN _ Enhanced _ Super - Resolution _ Gener...

2018
[34]

IEEE Trans

Wang, Z., Bovik, A.C., Sheikh, H.R., Simoncelli, E.P.: Image quality assessment: From error visibility to structural similarity. IEEE Transactions on Image Process- ing13(4), 600–612 (2004).https://doi.org/10.1109/TIP.2003.819861

work page doi:10.1109/tip.2003.819861 2004
[35]

IEEE Transactions on Information Theory (1976)

Wyner, A.D., Ziv, J.: The rate-distortion function for source coding with side information at the decoder. IEEE Transactions on Information Theory (1976). https://doi.org/10.1109/TIT.1976.1055508,https://dl.acm.org/doi/10. 1109/TIT.1976.1055508

work page doi:10.1109/tit.1976.1055508 1976
[36]

Yang, F., Yang, H., Fu, J., Lu, H., Guo, B.: Learning texture transformer network for image super-resolution. In: Proceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition (CVPR) (2020),https:// openaccess.thecvf.com/content_CVPR_2020/papers/Yang_Learning_Texture_ Transformer_Network_for_Image_Super-Resolution_CVPR_2020_paper.pdf

2020
[37]

IEEE Transactions on Image Processing (2017).https://doi.org/10.1109/TIP.2017.2750403, https://arxiv.org/abs/1604.08671

Yang, W., Feng, J., Yang, J., Zhao, F., Liu, J., Guo, Z., Yan, S.: Deep edge guided recurrent residual learning for image super-resolution. IEEE Transactions on Image Processing (2017).https://doi.org/10.1109/TIP.2017.2750403, https://arxiv.org/abs/1604.08671

work page doi:10.1109/tip.2017.2750403 2017
[38]

CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

Yang, Z., Teng, J., Zheng, W., Ding, M., Huang, S., Xu, J., Yang, Y., Hong, W., Zhang, X., Feng, G., et al.: Cogvideox: Text-to-video diffusion models with an expert transformer. arXiv preprint arXiv:2408.06072 (2024)

work page internal anchor Pith review arXiv 2024
[39]

Resshift: Efficient diffusion model for image super-resolution by residual shifting.arXiv preprint arXiv:2307.12348, 2023

Yue, Z., Wang, J., Loy, C.C.: Resshift: Efficient diffusion model for image super- resolution by residual shifting. In: Advances in Neural Information Processing Sys- tems (NeurIPS) (2023),https://arxiv.org/abs/2307.12348

work page arXiv 2023
[40]

Zhang, L., Rao, A., Agrawala, M.: Adding conditional control to text-to-image diffusion models. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) (2023),https://openaccess.thecvf.com/content/ ICCV2023 / papers / Zhang _ Adding _ Conditional _ Control _ to _ Text - to - Image _ Diffusion_Models_ICCV_2023_paper.pdf

2023
[41]

In: 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018

Zhang, R., Isola, P., Efros, A.A., Shechtman, E., Wang, O.: The unreasonable effec- tiveness of deep features as a perceptual metric. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 586–595 (2018).https://doi.org/10.1109/CVPR.2018.00068,https://arxiv.org/abs/ 1801.03924

work page doi:10.1109/cvpr.2018.00068 2018
[42]

Zhang, Y., Li, K., Li, K., Wang, L., Zhong, B., Fu, Y.: Image super-resolution using very deep residual channel attention networks. In: European Conference on Com- puterVision(ECCV)(2018).https://doi.org/10.1007/978-3-030-01234-2_18, https://openaccess.thecvf.com/content_ECCV_2018/papers/Yulun_Zhang_ Image_Super-Resolution_Using_ECCV_2018_paper.pdf

work page doi:10.1007/978-3-030-01234-2_18 2018
[43]

Zhang,Z.,Wang,Z.,Lin,Z.,Qi,H.:Imagesuper-resolutionbyneuraltexturetrans- fer. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2019),https://openaccess.thecvf.com/content_CVPR_ 2019/papers/Zhang_Image_Super- Resolution_by_Neural_Texture_Transfer_ CVPR_2019_paper.pdf

2019
[44]

In: Advances in Neural Informa- tion Processing Systems (NeurIPS) (2023),https://arxiv.org/abs/2302.04867

Zhao, W., Bai, L., Rao, Y., Zhou, J., Lu, J.: Unipc: A unified predictor-corrector framework for fast sampling of diffusion models. In: Advances in Neural Informa- tion Processing Systems (NeurIPS) (2023),https://arxiv.org/abs/2302.04867

work page arXiv 2023