arxiv: 2605.14166 · v1 · pith:ITMISEQZnew · submitted 2026-05-13 · 💻 cs.CV

You Only Landmark Once: Lightweight U-Net Face Super Resolution with YOLO-World Landmark Heatmaps

Riccardo Carraro , Anna Briotto , Endi Hysa , Marco Fiorucci , Lamberto Ballan This is my paper

Pith reviewed 2026-05-15 04:48 UTC · model grok-4.3

classification 💻 cs.CV

keywords face super-resolutionU-NetYOLO-Worldlandmark heatmapslightweight modelimage upscalingCelebAspatial loss weighting

0 comments

The pith

A lightweight U-Net reconstructs 128x128 faces from 16x16 inputs by weighting its loss with YOLO-World landmark heatmaps.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a U-Net architecture for extreme face super-resolution that reuses heatmaps from an open-vocabulary detector as spatial weights in the training loss. These weights direct reconstruction effort toward eyes, nose, and mouth without any auxiliary landmark network or adversarial component. The method trains and runs efficiently because the detector runs once to supply fixed priors rather than being integrated into the pipeline. On aligned CelebA images the weighted loss raises standard metrics and yields visibly sharper outputs than an unweighted baseline. The approach shows that detection outputs can serve directly as perceptual guidance for lightweight upscaling.

Core claim

Heatmaps produced by YOLO-World on the low-resolution input are converted into per-pixel weights that multiply the pixel-wise reconstruction loss; the resulting heatmap-guided objective trains a standard U-Net to emphasize facial landmarks, delivering 128x128 outputs from 16x16 inputs that are quantitatively and perceptually superior to the same network trained without the weighting.

What carries the argument

YOLO-World landmark heatmaps turned into spatial weights for a heatmap-guided reconstruction loss that emphasizes errors around eyes, nose, and mouth.

If this is right

No separate landmark or alignment network is required at training or test time.
The full pipeline stays lightweight because the detector is used only once to generate fixed weights.
Quantitative metrics and visual sharpness improve consistently on the CelebA test set.
Adversarial training is unnecessary to obtain realistic reconstructions under the guided loss.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same weighting strategy could be applied to other restoration tasks where an off-the-shelf detector supplies region priors.
Freezing the detector and reusing its outputs may allow the super-resolution model to be trained with fewer epochs or smaller batches.
Performance on unaligned or real-world low-resolution faces remains untested and would determine whether the method requires an explicit alignment stage.
Replacing the pixel loss entirely with a detector-derived perceptual loss might further simplify the objective.

Load-bearing premise

The heatmaps generated by YOLO-World on 16x16 degraded inputs remain sufficiently accurate and aligned to serve as reliable spatial weights.

What would settle it

A controlled test on inputs where YOLO-World produces visibly misplaced or missing landmark heatmaps that results in lower PSNR/SSIM and blurrier faces than the unweighted baseline would falsify the claim.

Figures

Figures reproduced from arXiv: 2605.14166 by Anna Briotto, Endi Hysa, Lamberto Ballan, Marco Fiorucci, Riccardo Carraro.

**Figure 1.** Figure 1: Efficient U-Net architecture for image super-resolution, transforming a low-resolution [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗

**Figure 2.** Figure 2: Heatmaps generatad with landmarks detected by YOLO-World. [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Qualitative comparison on aligned CelebA ( [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Qualitative comparison on aligned CelebA ( [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

read the original abstract

Face image super-resolution aims to recover high-resolution facial images from severely degraded inputs. Under extreme upscaling factors, fine facial details are often lost, making accurate reconstruction challenging. Existing methods typically rely on heavy network architectures, adversarial training schemes, or separate alignment networks, increasing model complexity and computational cost. To address these issues, we propose a lightweight U-Net based-architecture designed to reconstructs $128{ \times }128$ facial images from severely degraded $16{ \times }16$ inputs, achieving an $8 \times $ magnification. A key contribution is a novel auxiliary-training-free supervision strategy that leverages heatmaps generated by YOLO-World, an open-vocabulary object detector, to localize key facial features such as eyes, nose, and mouth. These heatmaps are converted into spatial weights to form a heatmap-guided loss that emphasizes reconstruction errors in semantically important regions. Unlike prior methods that require dedicated landmark or alignment networks, our approach directly reuses detector outputs as supervision, maintaining an efficient training and inference pipeline. Experiments on the aligned CelebA dataset demonstrate that the proposed loss consistently improves quantitative metrics and produces sharper, more realistic reconstructions. Overall, our results show that lightweight networks can effectively exploit detection-driven priors for perceptually convincing extreme upscaling, without adversarial training or increased computational cost.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows a lightweight U-Net for 8x face SR that reuses YOLO-World heatmaps as spatial loss weights without extra networks, but the abstract supplies no numbers or checks on whether those heatmaps work on 16x16 inputs.

read the letter

The core idea is straightforward: take a plain lightweight U-Net, run YOLO-World on the low-res 16x16 input to get heatmaps for eyes, nose, and mouth, turn those into per-pixel weights, and add them to the reconstruction loss. This avoids any separate landmark detector or alignment module and skips adversarial training. On aligned CelebA it reportedly gives sharper outputs and better metrics than the baseline U-Net alone. That reuse of an off-the-shelf open-vocabulary detector is the main practical difference from prior face SR work that builds dedicated sub-networks for landmarks or priors. It keeps training and inference cheap, which matters for edge devices. The abstract is clear that the heatmaps come directly from the detector and are not learned inside the U-Net, so there is no circularity in the supervision. The approach is honest about staying simple. The main gap is that the abstract states consistent metric gains and more realistic faces but reports zero numbers, no table of PSNR/SSIM/LPIPS against standard baselines, no ablation that isolates the weighting term, and no check on how often YOLO-World actually produces usable heatmaps at 16x16 resolution. YOLO-World was trained on normal-resolution images, so feeding it severely downsampled aligned crops risks collapsed or noisy detections; if that happens the guided loss collapses to ordinary pixel loss plus artifacts. Without those diagnostics the central claim stays unverified. This is the kind of paper that could be useful to people already working on efficient face restoration pipelines who want a quick way to inject semantic emphasis without extra parameters. It does not reorganize super-resolution theory and the application is narrow, but the implementation looks reproducible enough to test in a day or two. I would send it to peer review so the experiments can be examined properly; the idea is simple enough that a referee can quickly decide if the heatmaps are doing real work.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes a lightweight U-Net for 8× face super-resolution (16×16 degraded inputs to 128×128 outputs) on aligned CelebA. It introduces an auxiliary-training-free heatmap-guided reconstruction loss that converts outputs from the pre-trained open-vocabulary YOLO-World detector into spatial weights emphasizing eyes, nose, and mouth regions. The approach avoids adversarial training, separate alignment networks, or heavy architectures, and claims consistent quantitative metric improvements plus sharper reconstructions.

Significance. If the YOLO-World heatmaps remain reliable on severely degraded 16×16 inputs, the method offers a low-overhead way to inject semantic priors into SR losses without extra parameters or training stages. This could be useful for resource-constrained pipelines, but the absence of reported metric values, baseline comparisons, or heatmap-quality diagnostics in the provided text makes the practical significance difficult to evaluate at present.

major comments (2)

[Abstract] Abstract: the central empirical claim states that the proposed loss 'consistently improves quantitative metrics,' yet no PSNR, SSIM, LPIPS, or other numerical values, baseline comparisons, or ablation results are supplied. Without these data the magnitude and reliability of the reported gains cannot be assessed.
[Method] Method (heatmap generation paragraph): YOLO-World is applied directly to 16×16 inputs to produce landmark heatmaps used as spatial weights. No quantitative validation of heatmap accuracy (e.g., landmark localization error, failure rate, or comparison against ground-truth landmarks on the same degraded inputs) is provided. If detections collapse or misalign, the guided loss reduces to standard pixel loss plus potential artifacts, undermining the 'detection-driven priors' contribution.

minor comments (2)

[Abstract] Abstract contains a grammatical error: 'designed to reconstructs' should read 'designed to reconstruct'.
[Experiments] The manuscript should include a dedicated subsection or table reporting the exact quantitative results, chosen baselines, and ablation isolating the heatmap weighting effect.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on our manuscript. We address each major comment below and indicate the revisions we will make.

read point-by-point responses

Referee: [Abstract] Abstract: the central empirical claim states that the proposed loss 'consistently improves quantitative metrics,' yet no PSNR, SSIM, LPIPS, or other numerical values, baseline comparisons, or ablation results are supplied. Without these data the magnitude and reliability of the reported gains cannot be assessed.

Authors: We agree that the abstract would benefit from greater specificity. The Experiments section reports the full set of quantitative results, including PSNR, SSIM, and LPIPS values with baseline comparisons. In the revised manuscript we will update the abstract to cite the key metric improvements achieved by the heatmap-guided loss. revision: yes
Referee: [Method] Method (heatmap generation paragraph): YOLO-World is applied directly to 16×16 inputs to produce landmark heatmaps used as spatial weights. No quantitative validation of heatmap accuracy (e.g., landmark localization error, failure rate, or comparison against ground-truth landmarks on the same degraded inputs) is provided. If detections collapse or misalign, the guided loss reduces to standard pixel loss plus potential artifacts, undermining the 'detection-driven priors' contribution.

Authors: This is a valid concern. The current manuscript does not include explicit quantitative diagnostics of YOLO-World performance on the 16×16 inputs. In the revision we will add a short analysis (new paragraph or table) reporting landmark localization error and detection success rate on the degraded inputs relative to ground-truth landmarks, thereby confirming that the heatmaps remain sufficiently reliable to provide meaningful spatial guidance. revision: yes

Circularity Check

0 steps flagged

No circularity: external pre-trained detector supplies independent supervision

full rationale

The manuscript's core mechanism converts outputs from the external, pre-trained YOLO-World detector into spatial weights for a reconstruction loss. This signal is generated outside the U-Net training loop and does not depend on any parameters or fitted quantities internal to the proposed model. No equations, self-citations, or ansatzes are shown that would reduce the claimed metric improvements to a tautological re-expression of the inputs. The derivation chain therefore remains self-contained, with the performance gains presented as empirical outcomes on CelebA rather than predictions forced by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that a general-purpose detector produces usable landmark heatmaps on 16x16 degraded faces; no free parameters or new entities are introduced in the abstract.

axioms (1)

domain assumption YOLO-World produces reliable spatial heatmaps for eyes, nose, and mouth when run on 16x16 severely degraded face images
The heatmap-guided loss is defined directly from these detector outputs; if the assumption fails, the weighting provides no useful signal.

pith-pipeline@v0.9.0 · 5547 in / 1240 out tokens · 51795 ms · 2026-05-15T04:48:42.556341+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

32 extracted references · 32 canonical work pages

[1]

Super-resolution image re- construction: a technical overview,

S. C. Park, M. K. Park, and M. G. Kang, “Super-resolution image re- construction: a technical overview,” inIEEE signal processing magazine. IEEE, 2003

work page 2003
[2]

Deep learning for single image super-resolution: A brief review,

W. Yang, X. Zhang, Y . Tian, W. Wang, J.-H. Xue, and Q. Liao, “Deep learning for single image super-resolution: A brief review,” inIEEE Transactions on Multimedia, 2019

work page 2019
[3]

Photo-realistic single image super-resolution using a generative adversarial network,

C. Ledig, L. Theis, F. Huszar, J. Caballero, A. Cunningham, A. Acosta, A. Aitken, A. Tejani, J. Totz, Z. Wang, and W. Shi, “Photo-realistic single image super-resolution using a generative adversarial network,” inProc. of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2017

work page 2017
[4]

Srflow: Learning the super-resolution space with normalizing flow,

A. Lugmayr, M. Danelljan, L. Van Gool, and R. Timofte, “Srflow: Learning the super-resolution space with normalizing flow,” inProc. of the European Conference on Computer Vision (ECCV), 2020

work page 2020
[5]

Esrgan: Enhanced super-resolution generative adversar- ial networks,

X. Wang, K. Yu, S. Wu, J. Gu, Y . Liu, C. Dong, Y . Qiao, and C. Change Loy, “Esrgan: Enhanced super-resolution generative adversar- ial networks,” inProc. of the European Conference on Computer Vision Workshops (ECCVW), 2018

work page 2018
[6]

Diffbir: Towards blind image restoration with generative diffusion prior,

Z. Wu, K. Zhang, Y . Zhang, R. Timofte, and L. Van Gool, “Diffbir: Towards blind image restoration with generative diffusion prior,” in Proc. of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

work page 2024
[7]

Seesr: Towards semantically aware face restoration,

Y . Zhang, K. Zhang, Z. Chen, Y .-X. Wang, R. Timofte, and L. Van Gool, “Seesr: Towards semantically aware face restoration,” inProc. of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

work page 2024
[8]

Activating more pixels in image super-resolution transformer,

X. Chen, X. Wang, J. Zhou, and C. Dong, “Activating more pixels in image super-resolution transformer,” inProc. of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023

work page 2023
[9]

Transcending the limit of local window: Advanced super-resolution transformer with adaptive token dictionary,

L. Zhanget al., “Transcending the limit of local window: Advanced super-resolution transformer with adaptive token dictionary,” inProc. of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

work page 2024
[10]

Face super-resolution via iterative collabo- ration between multi-attention mechanism and landmark estimation,

C. Shi, M. Li, and Z. An, “Face super-resolution via iterative collabo- ration between multi-attention mechanism and landmark estimation,” in Complex & Intelligent Systems, 2025

work page 2025
[11]

Super-fan: Integrated facial landmark localization and super-resolution of real-world low resolution faces in arbitrary poses with gans,

A. Bulat and G. Tzimiropoulos, “Super-fan: Integrated facial landmark localization and super-resolution of real-world low resolution faces in arbitrary poses with gans,” inProc. of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2018

work page 2018
[12]

Progressive face super- resolution via attention to facial landmark,

D. Kim, M. Kim, G. Kwon, and D.-S. Kim, “Progressive face super- resolution via attention to facial landmark,” inProc. of the IEEE/CVF International Conference on Computer Vision (ICCV), 2019

work page 2019
[13]

Fsrnet: End-to- end learning face super-resolution with facial priors,

Y . Chen, Y .-K. Tai, X. Liu, C. Shen, and J. Yang, “Fsrnet: End-to- end learning face super-resolution with facial priors,” inProc. of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2018

work page 2018
[14]

Yolo-world: Real-time open-vocabulary object detection,

Y . Cheng, F. Wei, X. Zhang, J. Wang, W. Yang, Y . Qiao, and D. Lin, “Yolo-world: Real-time open-vocabulary object detection,” inProc. of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

work page 2024
[15]

Deep iterative collaboration for face super-resolution,

C. Ma, Z. Jiang, Y . Rao, J. Lu, and J. Zhou, “Deep iterative collaboration for face super-resolution,” inProc. of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020

work page 2020
[16]

Sfmnet: Spatial-frequency mutual learning for face super-resolution,

C. Wanget al., “Sfmnet: Spatial-frequency mutual learning for face super-resolution,” inProc. of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR), 2023, code available at https://github.com/wcy-cs/SFMNet

work page 2023
[17]

U-net: Convolutional networks for biomedical image segmentation,

O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” inInternational Conference on Medical image computing and computer-assisted intervention, 2015

work page 2015
[18]

Deep learning face attributes in the wild,

Z. Liu, P. Luo, X. Wang, and X. Tang, “Deep learning face attributes in the wild,” inProc. of the IEEE/CVF International Conference on Computer Vision (ICCV), 2015

work page 2015
[19]

Accurate image super-resolution using very deep convolutional networks,

J. Kim, J. Kwon Lee, and K. Mu Lee, “Accurate image super-resolution using very deep convolutional networks,” inProc. of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2016

work page 2016
[20]

Edge-informed face super-resolution with multi-scale attention,

X. Lu, Y . Li, H. Liet al., “Edge-informed face super-resolution with multi-scale attention,” inNeurocomputing, 2022

work page 2022
[21]

Unsupervised representation learning with deep convolutional generative adversarial networks,

A. Radford, L. Metz, and S. Chintala, “Unsupervised representation learning with deep convolutional generative adversarial networks,” in Proc. of the International Conference on Learning Representations (ICLR), 2016

work page 2016
[22]

Accelerating the super-resolution convolutional neural network,

C. Dong, C. C. Loy, and X. Tang, “Accelerating the super-resolution convolutional neural network,” inProc. of the European Conference on Computer Vision (ECCV), 2016

work page 2016
[23]

Deep residual learning for image recognition,

K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” inProc. of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 770–778

work page 2016
[24]

Enhanced deep residual networks for single image super-resolution,

B. Lim, S. Son, H. Kim, S. Nah, and K. Mu Lee, “Enhanced deep residual networks for single image super-resolution,” inProc. of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 2017

work page 2017
[25]

Perceptual losses for real-time style transfer and super-resolution,

J. Johnson, A. Alahi, and L. Fei-Fei, “Perceptual losses for real-time style transfer and super-resolution,” inProc. of the European Conference on Computer Vision (ECCV). Springer, 2016, pp. 694–711

work page 2016
[27]

Multiscale structural simi- larity for image quality assessment,

Z. Wang, E. P. Simoncelli, and A. C. Bovik, “Multiscale structural simi- larity for image quality assessment,” inSignals, Systems and Computers, 2003

work page 2003
[28]

Lightweight image super- resolution with information multi-distillation network,

Z. Hui, X. Gao, Y . Yang, and X. Wang, “Lightweight image super- resolution with information multi-distillation network,” inProceedings of the 27th ACM International Conference on Multimedia, 2019

work page 2019
[29]

Residual feature distillation network for lightweight image super-resolution,

X. Liu, J. Tang, S. Wu, and L. Lin, “Residual feature distillation network for lightweight image super-resolution,” inProc. of the European Conference on Computer Vision Workshops (ECCVW), 2020

work page 2020
[30]

Blueprint separable residual network for efficient image super-resolution,

Y . Zhang, K. Li, S. Liuet al., “Blueprint separable residual network for efficient image super-resolution,” inProc. of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023

work page 2023
[31]

Deep reinforcement learning that matters,

P. Henderson, R. Islam, P. Bachman, J. Pineau, D. Precup, and D. Meger, “Deep reinforcement learning that matters,” inProc. of the AAAI Conference on Artificial Intelligence, 2018

work page 2018
[32]

Adam: A method for stochastic optimization,

D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” inProc. of the International Conference on Learning Representations (ICLR), 2015

work page 2015
[33]

The unreasonable effectiveness of deep features as a perceptual metric,

R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang, “The unreasonable effectiveness of deep features as a perceptual metric,” in Proc. of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2018

work page 2018