pith. machine review for the scientific record. sign in

arxiv: 2605.07495 · v1 · submitted 2026-05-08 · 💻 cs.CV

Recognition: no theorem link

Lightweight Unpaired Smartphone ISP Transfer with Semantic Pseudo-Pairing

Flavien Armangeon, Yanhao Li, Yujin Cho

Pith reviewed 2026-05-11 02:08 UTC · model grok-4.3

classification 💻 cs.CV
keywords unpaired ISPsemantic pseudo-pairingDINOv2fused Gromov-Wassersteinlightweight CNNcolor renderingsmartphone cameraNTIRE challenge
0
0 comments X

The pith

Semantic pseudo-pairs built from DINOv2 embeddings and fused Gromov-Wasserstein transport let a 7K-parameter CNN perform color rendering on unpaired RAW smartphone images.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to solve unpaired smartphone image signal processing by creating aligned input-target pairs without real scene or color matches. It reconstructs full-size images from patches, extracts semantic features with DINOv2, and matches RAW and RGB domains using fused Gromov-Wasserstein optimal transport at both image and patch scales. These pseudo-pairs then train a compact CNN that only handles color transformation, avoiding structural changes and adversarial instability. A reader would care because real paired RAW-RGB datasets are expensive to collect, and lightweight models matter for on-device cameras. The method reports competitive challenge scores without heavy training tricks.

Core claim

The central claim is that semantic embeddings from DINOv2 combined with fused Gromov-Wasserstein optimal transport can generate sufficiently accurate pseudo-pairs between unpaired RAW and target RGB images; these pairs then support training a 7K-parameter CNN focused solely on color rendering, which achieves 22.569 PSNR, 0.675 SSIM and 8.067 ΔE on the hidden test set and ranks third in SSIM and ΔE among challenge entries.

What carries the argument

Semantic pseudo-pairing, which extracts DINOv2 embeddings from reconstructed images and matches RAW-to-RGB domains via fused Gromov-Wasserstein optimal transport at image and patch levels to create training pairs.

If this is right

  • Adversarial losses become unnecessary once semantic alignment supplies the training signal.
  • Restricting the network to color-only operations on 7K parameters reduces artifacts and improves stability.
  • Multi-scale matching (image plus patch) supplies both global context and local consistency.
  • The approach improves all reported metrics over the provided baseline on the final test set.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same semantic-matching recipe could extend to other unpaired low-level tasks such as denoising or white-balance correction where paired data is scarce.
  • Vision-transformer features appear to capture color-relevant semantics better than low-level statistics for cross-domain pairing.
  • The extreme parameter count suggests the method could run in real time on mobile hardware once the pseudo-pair stage is replaced by a learned matcher.

Load-bearing premise

The pseudo-pairs generated by DINOv2 and fused Gromov-Wasserstein transport are accurate enough in both semantics and alignment that the CNN learns correct color mapping without inheriting large errors from the unpaired data.

What would settle it

Training the same 7K-parameter CNN on a baseline without the pseudo-pair step and observing that PSNR drops below 22, SSIM falls below 0.67, or ΔE rises above 8 on the identical hidden test set would falsify the claim that the semantic pairs are the enabling factor.

Figures

Figures reproduced from arXiv: 2605.07495 by Flavien Armangeon, Yanhao Li, Yujin Cho.

Figure 1
Figure 1. Figure 1: Our lightweight unpaired smartphone ISP produces visually coherent full-image results from patch-wise predictions, retrieving [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the proposed method. The input RAW patches are first converted into pseudo-RGB representations through a fixed pre-processing pipeline. Source RAW patches and target RGB patches are then reconstructed into full images to recover global scene context. In the unpaired setting, semantically similar full-image candidates are retrieved using DINOv2 features and optimal transport (A). Patch-level mat… view at source ↗
Figure 3
Figure 3. Figure 3: Visualization of our two-stage pseudo pairing strategy. We first retrieve semantically similar full images to obtain coarse RAW– [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 5
Figure 5. Figure 5: Top: Examples of reconstructed images using 1x1 conv [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 4
Figure 4. Figure 4: Parameter efficiency versus overall ranking on the [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 6
Figure 6. Figure 6: Comparison of test-set predictions obtained with differ [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 8
Figure 8. Figure 8: More qualitative examples from the test phase of the challenge. Our lightweight unpaired smartphone ISP produces visually [PITH_FULL_IMAGE:figures/full_fig_p012_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: More qualitative examples from the test phase of the challenge, comparing predictions obtained with random pairing and with [PITH_FULL_IMAGE:figures/full_fig_p013_9.png] view at source ↗
read the original abstract

Unpaired smartphone ISP is a challenging problem due to the lack of scene and color alignment between RAW and target RGB images. Many existing methods either require paired data or rely heavily on adversarial training, which can become unstable in the unpaired setting. In this work, we present a simple and effective approach developed for the NTIRE 2026 Learned Smartphone ISP Challenge with Unpaired Data. Our method first reconstructs larger images from training patches to recover global context. Then, we extract semantic embeddings with DINOv2, and use fused Gromov-Wasserstein (FGW) optimal transport to build pseudo pairs between RAW and RGB images at both image and patch levels. This semantic matching allows us to partially alleviate the unpairedness of the data and build these pseudo input-target pairs. Based on these pseudo pairs, we train a lightweight CNN with only 7K parameters for color rendering. The network is designed to be compact and focus on color transformation rather than structural change, which helps reduce artifacts and improve training stability. Our challenge submission achieves 22.569 PSNR, 0.675 SSIM, and 8.067 $\Delta E$ on the final hidden test set, significantly improving over the baseline and achieving the 3rd best SSIM and $\Delta E$ among all challenge entries. Our code is available at github.com/nuniniyujin/Unpaired-ISP .

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper presents a lightweight method for unpaired smartphone ISP transfer developed for the NTIRE 2026 challenge. It reconstructs full-resolution images from training patches to recover global context, extracts DINOv2 semantic embeddings, applies fused Gromov-Wasserstein optimal transport to construct pseudo-pairs between RAW inputs and RGB targets at both image and patch levels, and trains a compact 7K-parameter CNN focused solely on color rendering. The submission reports 22.569 PSNR, 0.675 SSIM, and 8.067 ΔE on the hidden test set, improving over the baseline and ranking 3rd in SSIM and ΔE among entries. Code is released publicly.

Significance. If the semantic pseudo-pairs prove sufficiently photometrically aligned, the work demonstrates that a non-adversarial, extremely compact CNN can deliver competitive unpaired ISP performance without the training instabilities common in GAN-based approaches. The 7K-parameter design is a clear practical advantage for on-device deployment. Public code availability is a strength that enables direct verification and reuse.

major comments (2)
  1. [§3.2] §3.2 (Semantic Pseudo-Pairing): The central assumption that FGW transport on DINOv2 embeddings yields pseudo-pairs whose color statistics are close enough to the unknown true RAW-to-RGB mapping is load-bearing for the claim of effective color rendering. DINOv2 features are largely invariant to low-level photometry and were trained on RGB rather than RAW sensor data; no quantitative check (color histogram divergence, illuminant consistency, or comparison to any available paired subset) is reported to confirm alignment. If mismatched regions are paired, the 7K CNN may learn biased mappings whose challenge scores do not generalize.
  2. [§4] §4 (Experiments): The reported challenge metrics lack ablations isolating the contribution of image-level versus patch-level FGW matching, the image-reconstruction step, or the choice of DINOv2 layer. Without these, it is difficult to determine whether the 3rd-place SSIM/ΔE ranking stems from the semantic pseudo-pairing or from other factors such as the lightweight architecture or challenge-specific regularization.
minor comments (2)
  1. [Abstract / §3.1] The abstract and method description state that larger images are reconstructed from patches, but the exact procedure (overlap handling, blending, or artifact mitigation) is not detailed enough for full reproducibility.
  2. Table or figure showing example pseudo-pairs (input RAW, matched RGB target, and resulting rendered output) would help readers assess visual quality of the constructed pairs.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and describe the revisions planned for the manuscript.

read point-by-point responses
  1. Referee: [§3.2] §3.2 (Semantic Pseudo-Pairing): The central assumption that FGW transport on DINOv2 embeddings yields pseudo-pairs whose color statistics are close enough to the unknown true RAW-to-RGB mapping is load-bearing for the claim of effective color rendering. DINOv2 features are largely invariant to low-level photometry and were trained on RGB rather than RAW sensor data; no quantitative check (color histogram divergence, illuminant consistency, or comparison to any available paired subset) is reported to confirm alignment. If mismatched regions are paired, the 7K CNN may learn biased mappings whose challenge scores do not generalize.

    Authors: We appreciate the referee's identification of this key assumption. While DINOv2 provides semantic features that correlate with scene content and thus with plausible color mappings in natural images, we acknowledge the absence of explicit photometric validation in the original submission. In the revised manuscript, we will add quantitative analysis in §3.2, including Earth Mover's Distance between color histograms of pseudo-paired vs. randomly paired images and visual inspection of aligned patches for illuminant consistency. We will also explicitly discuss the limitation that DINOv2 was pretrained on RGB data. However, a direct comparison to a paired subset is not possible given the unpaired challenge data. revision: partial

  2. Referee: [§4] §4 (Experiments): The reported challenge metrics lack ablations isolating the contribution of image-level versus patch-level FGW matching, the image-reconstruction step, or the choice of DINOv2 layer. Without these, it is difficult to determine whether the 3rd-place SSIM/ΔE ranking stems from the semantic pseudo-pairing or from other factors such as the lightweight architecture or challenge-specific regularization.

    Authors: We agree that isolating these components would clarify the source of the performance gains. In the revised manuscript, we will expand §4 with new ablation tables on the validation set reporting PSNR, SSIM, and ΔE for: (i) image-level FGW only, patch-level FGW only, and the combined setting; (ii) with and without the full-image reconstruction step; and (iii) embeddings from different DINOv2 layers. These results will be presented alongside the final challenge metrics to demonstrate the contribution of each design choice. revision: yes

standing simulated objections not resolved
  • Direct quantitative comparison of pseudo-pairs against a paired subset is not feasible, as the NTIRE 2026 challenge provides only unpaired data and no such paired validation set is available.

Circularity Check

0 steps flagged

No circularity: pseudo-pair construction uses external models and is independent of final metrics

full rationale

The paper's chain proceeds from external DINOv2 embeddings and FGW optimal transport to construct pseudo-pairs, followed by training a compact CNN on those pairs and reporting performance on a hidden challenge test set. No equation or claim reduces the reported PSNR/SSIM/ΔE values to a fit on the same quantities, nor does any load-bearing step rely on self-citation or rename a known result. The derivation remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 1 invented entities

Limited information from abstract; the central claim rests on the effectiveness of DINOv2 for cross-domain semantics and the OT method for pairing, which are domain assumptions from prior work. The CNN weights are fitted parameters.

free parameters (1)
  • CNN model weights
    The 7K-parameter network is trained on the constructed pseudo-pairs to perform color rendering.
axioms (2)
  • domain assumption DINOv2 provides semantically meaningful embeddings for both RAW and RGB images
    Core step for extracting features used in matching.
  • domain assumption Fused Gromov-Wasserstein optimal transport can align the distributions of RAW and RGB semantic features effectively
    Used to build the pseudo-pairs at image and patch levels.
invented entities (1)
  • semantic pseudo-pairs no independent evidence
    purpose: To serve as training data for the CNN in the unpaired setting
    Constructed via DINOv2 and FGW matching; no independent external validation of pair quality is described.

pith-pipeline@v0.9.0 · 5552 in / 1569 out tokens · 49678 ms · 2026-05-11T02:08:17.676158+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

47 extracted references · 47 canonical work pages · 3 internal anchors

  1. [1]

    Learned lightweight smartphone isp with unpaired data

    Andrei Arhire and Radu Timofte. Learned lightweight smartphone isp with unpaired data. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 1878–1887, 2025. 2, 3, 6, 7

  2. [2]

    Tim Brooks, Ben Mildenhall, Tianfan Xue, Jiawen Chen, Dillon Sharlet, and Jonathan T. Barron. Unprocessing images for learned raw denoising.Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11036–11045, 2019. 2

  3. [3]

    Learning to see in the dark

    Chen Chen, Qifeng Chen, Jia Xu, and Vladlen Koltun. Learning to see in the dark. InProceedings of the IEEE Con- ference on Computer Vision and Pattern Recognition, pages 3291–3300, 2018. 2

  4. [4]

    A probabilistic image jigsaw puzzle solver

    Taeg Sang Cho, Shai Avidan, and William T Freeman. A probabilistic image jigsaw puzzle solver. In2010 IEEE Computer society conference on computer vision and pattern recognition, pages 183–190. IEEE, 2010. 11

  5. [5]

    Reference-free estimation of struc- tural and perceptual metrics for single-frame isp pipelines

    Yujin Cho, Sira Ferradans, Jean-Michel Morel, Gabriele Fac- ciolo, and Thomas Eboli. Reference-free estimation of struc- tural and perceptual metrics for single-frame isp pipelines. Available at SSRN 6418896. 4

  6. [6]

    Nilut: Conditional neural implicit 3d lookup tables for image enhancement

    Marcos V Conde, Javier Vazquez-Corral, Michael S Brown, and Radu Timofte. Nilut: Conditional neural implicit 3d lookup tables for image enhancement. InProceedings of the AAAI Conference on Artificial Intelligence, pages 1371– 1379, 2024. 2

  7. [7]

    Sinkhorn distances: Lightspeed computation of optimal transport.Advances in neural information pro- cessing systems, 26, 2013

    Marco Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport.Advances in neural information pro- cessing systems, 26, 2013. 2, 4

  8. [8]

    Gatys, Alexander S

    Leon A. Gatys, Alexander S. Ecker, and Matthias Bethge. Image style transfer using convolutional neural networks. In CVPR, 2016. 5

  9. [9]

    Deep joint demosaicking and denoising.ACM Transactions on Graphics (ToG), 35(6):1–12, 2016

    Micha ¨el Gharbi, Gaurav Chaurasia, Sylvain Paris, and Fr´edo Durand. Deep joint demosaicking and denoising.ACM Transactions on Graphics (ToG), 35(6):1–12, 2016. 4

  10. [10]

    Deep bilateral learning for real- time image enhancement.ACM Transactions on Graphics (TOG), 36(4):1–12, 2017

    Micha ¨el Gharbi, Jiawen Chen, Jonathan T Barron, Samuel W Hasinoff, and Fr´edo Durand. Deep bilateral learning for real- time image enhancement.ACM Transactions on Graphics (TOG), 36(4):1–12, 2017. 2

  11. [11]

    Burst photography for high dynamic range and low-light imaging on mobile cameras.ACM Transactions on Graphics (ToG), 35(6):1–12, 2016

    Samuel W Hasinoff, Dillon Sharlet, Ryan Geiss, Andrew Adams, Jonathan T Barron, Florian Kainz, Jiawen Chen, and Marc Levoy. Burst photography for high dynamic range and low-light imaging on mobile cameras.ACM Transactions on Graphics (ToG), 35(6):1–12, 2016. 2

  12. [12]

    Exposure: A white-box photo post-processing framework.ACM Transactions on Graphics (TOG), 37(2): 1–17, 2018

    Yuanming Hu, Hao He, Chenxi Xu, Baoyuan Wang, and Stephen Lin. Exposure: A white-box photo post-processing framework.ACM Transactions on Graphics (TOG), 37(2): 1–17, 2018. 2

  13. [13]

    Multimodal unsupervised image-to-image translation

    Xun Huang, Ming-Yu Liu, Serge Belongie, and Jan Kautz. Multimodal unsupervised image-to-image translation. In Proceedings of the European conference on computer vision (ECCV), pages 172–189, 2018. 3

  14. [14]

    Replac- ing mobile camera isp with a single deep learning model

    Andrey Ignatov, Radu Timofte, and Luc Van Gool. Replac- ing mobile camera isp with a single deep learning model. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pages 536–537,

  15. [15]

    Pynet-v2 mobile: Efficient on-device photo processing with neural networks

    Andrey Ignatov, Grigory Malivenko, Radu Timofte, Yu Tseng, Yu-Syuan Xu, Po-Hsiang Yu, Cheng-Ming Chiang, Hsien-Kai Kuo, Min-Hung Chen, Chia-Ming Cheng, et al. Pynet-v2 mobile: Efficient on-device photo processing with neural networks. In2022 26th International Conference on Pattern Recognition (ICPR), pages 677–684. IEEE, 2022. 2

  16. [16]

    Microisp: processing 32mp photos on mobile devices with deep learning

    Andrey Ignatov, Anastasia Sycheva, Radu Timofte, Yu Tseng, Yu-Syuan Xu, Po-Hsiang Yu, Cheng-Ming Chiang, Hsien-Kai Kuo, Min-Hung Chen, Chia-Ming Cheng, et al. Microisp: processing 32mp photos on mobile devices with deep learning. InEuropean Conference on Computer Vision, pages 729–746. Springer, 2022. 2

  17. [17]

    Image-to-image translation with conditional adver- sarial networks

    Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. Image-to-image translation with conditional adver- sarial networks. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 1125–1134,

  18. [18]

    Learning controllable isp for image enhancement.IEEE Transactions on Image Processing, 33:867–880, 2024

    Heewon Kim and Kyoung Mu Lee. Learning controllable isp for image enhancement.IEEE Transactions on Image Processing, 33:867–880, 2024. 2

  19. [19]

    Rmfa-net: A neural isp for real raw to rgb image reconstruction.arXiv preprint arXiv:2406.11469, 2024

    Fei Li, Wenbo Hou, and Peng Jia. Rmfa-net: A neural isp for real raw to rgb image reconstruction.arXiv preprint arXiv:2406.11469, 2024. 2

  20. [20]

    Geometric GAN

    Jae Hyun Lim and Jong Chul Ye. Geometric gan.arXiv preprint arXiv:1705.02894, 2017. 6

  21. [21]

    Color enhancement using global parameters and local features learning

    Enyu Liu, Songnan Li, and Shan Liu. Color enhancement using global parameters and local features learning. InPro- ceedings of the Asian conference on computer vision, 2020. 2

  22. [22]

    Unsupervised image-to-image translation networks.Advances in neural in- formation processing systems, 30, 2017

    Ming-Yu Liu, Thomas Breuel, and Jan Kautz. Unsupervised image-to-image translation networks.Advances in neural in- formation processing systems, 30, 2017. 3

  23. [23]

    Decoupled Weight Decay Regularization

    Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017. 6

  24. [24]

    Color matching using hypernetwork- based kolmogorov-arnold networks

    Artem Nikonorov, Georgy Perevozchikov, Andrei Ko- repanov, Nancy Mehta, Mahmoud Afifi, Egor Ershov, and Radu Timofte. Color matching using hypernetwork- based kolmogorov-arnold networks. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 7099–7109, 2025. 3

  25. [25]

    Augmented cyclic consistency regulariza- tion for unpaired image-to-image translation

    Takehiko Ohkawa, Naoto Inoue, Hirokatsu Kataoka, and Nakamasa Inoue. Augmented cyclic consistency regulariza- tion for unpaired image-to-image translation. In2020 25th International Conference on Pattern Recognition (ICPR), pages 362–369. IEEE, 2021. 2

  26. [26]

    DINOv2: Learning Robust Visual Features without Supervision

    Maxime Oquab, Timoth ´ee Darcet, Th ´eo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023. 2, 4

  27. [27]

    Rawformer: Unpaired raw-to-raw translation for learnable camera isps

    Georgy Perevozchikov, Nancy Mehta, Mahmoud Afifi, and Radu Timofte. Rawformer: Unpaired raw-to-raw translation for learnable camera isps. InEuropean Conference on Com- puter Vision, pages 231–248, 2024. 3

  28. [28]

    Experts-guided unbalanced optimal trans- port for isp learning from unpaired and/or paired data.arXiv preprint arXiv:2512.05635, 2025

    Georgy Perevozchikov, Nancy Mehta, Egor Ershov, and Radu Timofte. Experts-guided unbalanced optimal trans- port for isp learning from unpaired and/or paired data.arXiv preprint arXiv:2512.05635, 2025. 3

  29. [29]

    NTIRE 2026 Challenge on Learned Smartphone ISP with Unpaired Data: Methods and Results

    Georgy Perevozchikov, Daniil Vladimirov, Radu Timofte, et al. NTIRE 2026 Challenge on Learned Smartphone ISP with Unpaired Data: Methods and Results . InProceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition (CVPR) Workshops, 2026. 2, 7

  30. [30]

    Adap- tive color transfer with relaxed optimal transport

    Julien Rabin, Sira Ferradans, and Nicolas Papadakis. Adap- tive color transfer with relaxed optimal transport. In2014 IEEE international conference on image processing (ICIP), pages 4852–4856. IEEE, 2014. 3

  31. [31]

    Bronstein

    Eli Schwartz, Raja Giryes, and Alexander M. Bronstein. Deepisp: Towards learning an end-to-end image processing pipeline.IEEE Transactions on Image Processing, 28(2): 912–923, 2018. 2

  32. [32]

    Very Deep Convolutional Networks for Large-Scale Image Recognition

    Karen Simonyan and Andrew Zisserman. Very deep convo- lutional networks for large-scale image recognition.arXiv preprint arXiv:1409.1556, 2015. 5

  33. [33]

    Metaisp–exploiting global scene structure for accurate multi-device color rendi- tion.arXiv preprint arXiv:2401.03220, 2024

    Matheus Souza and Wolfgang Heidrich. Metaisp–exploiting global scene structure for accurate multi-device color rendi- tion.arXiv preprint arXiv:2401.03220, 2024. 8

  34. [34]

    Hyperparameter optimization in black-box image processing using differentiable proxies

    Ethan Tseng, Felix Yu, Yuting Yang, Fahim Mannan, Karl ST Arnaud, Derek Nowrouzezahrai, Jean-Franc ¸ois Lalonde, and Felix Heide. Hyperparameter optimization in black-box image processing using differentiable proxies. ACM Trans. Graph., 38(4):27–1, 2019. 2

  35. [35]

    Neural photo-finishing.ACM Trans

    Ethan Tseng, Yuxuan Zhang, Lars Jebe, Xuaner Zhang, Zhi- hao Xia, Yifei Fan, Felix Heide, and Jiawen Chen. Neural photo-finishing.ACM Trans. Graph., 41(6):238–1, 2022. 2, 5

  36. [36]

    W-net: Two- stage u-net with misaligned data for raw-to-rgb mapping

    Kwang-Hyun Uhm, Seung-Wook Kim, Seo-Won Ji, Sung- Jin Cho, Jun-Pyo Hong, and Sung-Jea Ko. W-net: Two- stage u-net with misaligned data for raw-to-rgb mapping. In 2019 IEEE/CVF International Conference on Computer Vi- sion Workshop (ICCVW), pages 3636–3642. IEEE, 2019. 2

  37. [37]

    Fused gromov-wasserstein distance for structured objects.Algorithms, 13(9), 2020

    Titouan Vayer, Laetitia Chapel, Remi Flamary, Romain Tavenard, and Nicolas Courty. Fused gromov-wasserstein distance for structured objects.Algorithms, 13(9), 2020. 2, 4

  38. [38]

    Springer, Berlin, 2009

    C ´edric Villani.Optimal transport : old and new / C ´edric Villani. Springer, Berlin, 2009. 4

  39. [39]

    Real-esrgan: Training real-world blind super-resolution with pure synthetic data

    Xintao Wang, Liangbin Xie, Chao Dong, and Ying Shan. Real-esrgan: Training real-world blind super-resolution with pure synthetic data. InProceedings of the IEEE/CVF inter- national conference on computer vision, pages 1905–1914,

  40. [40]

    Handheld multi-frame super- resolution.ACM Transactions on Graphics (ToG), 38(4):1– 18, 2019

    Bartlomiej Wronski, Ignacio Garcia-Dorado, Manfred Ernst, Damien Kelly, Michael Krainin, Chia-Kai Liang, Marc Levoy, and Peyman Milanfar. Handheld multi-frame super- resolution.ACM Transactions on Graphics (ToG), 38(4):1– 18, 2019. 2

  41. [41]

    Reconfigisp: Reconfigurable camera image process- ing pipeline

    Ke Yu, Zexian Li, Yue Peng, Chen Change Loy, and Jin- wei Gu. Reconfigisp: Reconfigurable camera image process- ing pipeline. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 4248–4257, 2021. 2

  42. [42]

    Cycleisp: Real image restoration via improved data synthesis

    Syed Waqas Zamir, Aditya Arora, Salman Khan, Munawar Hayat, Fahad Shahbaz Khan, Ming-Hsuan Yang, and Ling Shao. Cycleisp: Real image restoration via improved data synthesis. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2696– 2705, 2020. 2

  43. [43]

    Hui Zeng, Jianrui Cai, Lida Li, Zisheng Cao, and Lei Zhang. Learning image-adaptive 3d lookup tables for high perfor- mance photo enhancement in real-time.IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(4):2058– 2073, 2020. 2

  44. [44]

    Ffdnet: Toward a fast and flexible solution for cnn-based image denoising

    Kai Zhang, Wangmeng Zuo, and Lei Zhang. Ffdnet: Toward a fast and flexible solution for cnn-based image denoising. IEEE Transactions on Image Processing, 27(9):4608–4622,

  45. [45]

    Multimodal style transfer via graph cuts

    Yulun Zhang, Chen Fang, Yilin Wang, Zhaowen Wang, Zhe Lin, Yun Fu, and Jimei Yang. Multimodal style transfer via graph cuts. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 5943–5951, 2019. 3

  46. [46]

    Learning raw-to-srgb mappings with inaccurately aligned supervision

    Zhilu Zhang, Haolin Wang, Ming Liu, Ruohao Wang, Jiawei Zhang, and Wangmeng Zuo. Learning raw-to-srgb mappings with inaccurately aligned supervision. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 4348–4358, 2021. 2

  47. [47]

    Unpaired image-to-image translation using cycle- consistent adversarial networks

    Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros. Unpaired image-to-image translation using cycle- consistent adversarial networks. InProceedings of the IEEE international conference on computer vision, pages 2223– 2232, 2017. 3 Lightweight Unpaired Smartphone ISP Transfer with Semantic Pseudo-Pairing Supplementary Material A. Image stitching ...