Multimodal Image Colorization: Quantifying the Impact of Text-Conditioned Guidance on Grayscale-to-Color Translation

Colten Reissmann; Hugo Garrido-Lestache Belinchon

arxiv: 2606.20722 · v1 · pith:M2MJBCCMnew · submitted 2026-06-16 · 💻 cs.GR · cs.CL· cs.CV· cs.LG

Multimodal Image Colorization: Quantifying the Impact of Text-Conditioned Guidance on Grayscale-to-Color Translation

Colten Reissmann , Hugo Garrido-Lestache Belinchon This is my paper

Pith reviewed 2026-06-26 21:30 UTC · model grok-4.3

classification 💻 cs.GR cs.CLcs.CVcs.LG

keywords image colorizationtext conditioninggrayscale to colorU-NetStable DiffusionCLIP guidancePSNRLPIPS

0 comments

The pith

Text conditioning on U-Net and Stable Diffusion models improves grayscale-to-color translation on multiple metrics.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether adding text prompts improves automatic colorization of black-and-white images. It runs two models, a U-Net and Stable Diffusion 1.5, once with CLIP text conditioning and once without, keeping every other setting identical. Text conditioning raises PSNR by 5.6 percent in the U-Net and 5.8 percent in Stable Diffusion, SSIM by 1.2 and 1.5 percent, colorfulness by 36.6 and 0.6 percent, while cutting LPIPS by 7.6 and 11.3 percent. These consistent gains across model types suggest text guidance helps resolve ambiguous color choices in the input grayscale image.

Core claim

The authors establish that text conditioning provides consistent, measurable improvements to colorization quality across both architecture scales, with the listed percentage gains in PSNR, SSIM, colorfulness, and LPIPS reduction.

What carries the argument

Ablation study comparing models with and without CLIP text conditioning while holding all other variables constant, using standard image quality metrics.

Load-bearing premise

That the chosen metrics accurately reflect overall colorization quality and that adding text conditioning does not introduce any uncontrolled changes to the model or training process.

What would settle it

A human evaluation study where raters show no difference or prefer the unconditioned colorizations despite the metric gains.

Figures

Figures reproduced from arXiv: 2606.20722 by Colten Reissmann, Hugo Garrido-Lestache Belinchon.

**Figure 2.** Figure 2: Per-image metric distributions for all four models. Text conditioning shifts [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗

**Figure 3.** Figure 3: Qualitative U-Net results. From left to right: grayscale input, UN-NP [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

**Figure 4.** Figure 4: Qualitative Stable Diffusion results. From left to right: grayscale input, SD-NP [PITH_FULL_IMAGE:figures/full_fig_p010_4.png] view at source ↗

**Figure 5.** Figure 5: The same grayscale car colorized by SD-NP (no text) and by SD-P with four [PITH_FULL_IMAGE:figures/full_fig_p013_5.png] view at source ↗

read the original abstract

Grayscale images are commonly found in historical photography restoration, medical imaging, and artistic media. However, automatically applying color to these images remains a significant challenge in computer vision because many plausible colorizations can correspond to the same grayscale input. In this work, we quantify the effect of text conditioning on pixel-level and perceptual metrics for grayscale-to-color image models. Specifically, we compare two architectures, a U-Net and Stable Diffusion 1.5, each tested with and without CLIP text conditioning while holding all other variables constant. Our results show that text conditioning improves PSNR by 5.6%, SSIM by 1.2%, and colorfulness by 36.6%, while reducing LPIPS by 7.6% in the U-Net tier. In the Stable Diffusion tier, text conditioning improves PSNR by 5.8%, SSIM by 1.5%, and colorfulness by 0.6%, while reducing LPIPS by 11.3%. These results indicate that text conditioning provides consistent, measurable improvements to colorization quality across both architecture scales.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper quantifies text conditioning gains on colorization metrics for U-Net and SD1.5 but leaves the key ablation controls unverified.

read the letter

The paper's main point is that text conditioning improves colorization quality on standard metrics for both a U-Net and Stable Diffusion 1.5 when they keep other things the same. They give concrete numbers like 5.6% better PSNR and 36.6% better colorfulness on the U-Net, with smaller but positive shifts on the larger model.

What is new is the side-by-side quantification on these two architectures using the same four metrics. The paper does well by testing at two different scales and finding consistent directional benefits from the text path.

The soft spots center on whether the comparison is actually controlled. The abstract claims all other variables are held constant, but it does not provide any supporting information on datasets, training protocols, or model configurations. This leaves open the possibility that the text-conditioned versions have extra parameters or different optimization that could explain the gains instead of the conditioning itself. The stress-test concern about identical weights and schedules is not answered by anything in the text. There are also no error bars or statistical tests to show the improvements are reliable rather than noise.

This work is for engineers who need to decide whether to add text guidance to colorization tools in restoration or imaging applications. It offers no new method, so method-focused readers will not get much from it.

I would bring this to reading group only if the full paper shows the ablation details clearly. The citation potential is low because it is an incremental measurement rather than a new result.

I recommend sending it to peer review if the methods section demonstrates that the only difference between conditions is the presence of the CLIP guidance and includes basic reproducibility information. The practical question is worth referee attention once the evidence is verifiable. If those details are missing, it should be desk rejected.

Referee Report

2 major / 1 minor

Summary. The manuscript empirically compares U-Net and Stable Diffusion 1.5 models for grayscale-to-color translation, each run with and without CLIP text conditioning while asserting that all other variables are held constant. It reports that text conditioning yields PSNR gains of 5.6% (U-Net) and 5.8% (SD), SSIM gains of 1.2% and 1.5%, colorfulness gains of 36.6% and 0.6%, and LPIPS reductions of 7.6% and 11.3%.

Significance. If the paired runs are truly isolated to the addition of text conditioning, the work supplies concrete quantitative evidence on the value of multimodal guidance for colorization quality across model scales, using both pixel-level (PSNR/SSIM) and perceptual (LPIPS/colorfulness) metrics. This could inform design choices in restoration and generative pipelines.

major comments (2)

[Abstract] Abstract: the headline claim that text conditioning alone produces the listed metric deltas requires explicit confirmation that the 'with' and 'without' models share identical architectures, parameter counts, training schedules, loss functions, and inference settings. No hyper-parameter table, model diagram, or description of how the CLIP cross-attention path is isolated (e.g., removed vs. zeroed) is supplied, so the 5.6 % PSNR improvement cannot yet be attributed solely to conditioning.
[Methods/Results] Methods/Results: the reported percentages are given without dataset identity or size, number of test images, error bars, or statistical tests. This prevents assessment of whether the observed differences exceed run-to-run variance and directly undermines verification of the controlled-ablation premise.

minor comments (1)

[Abstract] Abstract: the colorfulness metric improvement of 36.6 % in the U-Net tier is an order of magnitude larger than the other deltas; a brief note on the colorfulness formula or reference would aid interpretation.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which highlight opportunities to strengthen the clarity and reproducibility of our controlled ablation study. We address each major comment below and will incorporate revisions to provide the requested details on experimental controls and statistical reporting.

read point-by-point responses

Referee: [Abstract] Abstract: the headline claim that text conditioning alone produces the listed metric deltas requires explicit confirmation that the 'with' and 'without' models share identical architectures, parameter counts, training schedules, loss functions, and inference settings. No hyper-parameter table, model diagram, or description of how the CLIP cross-attention path is isolated (e.g., removed vs. zeroed) is supplied, so the 5.6 % PSNR improvement cannot yet be attributed solely to conditioning.

Authors: The manuscript states that comparisons were performed 'while holding all other variables constant,' but we acknowledge that the initial version did not include an explicit hyper-parameter table or a description of the conditioning isolation procedure. In the revised manuscript we will add a configuration table and a methods subsection detailing identical architectures, parameter counts, training schedules, loss functions, and inference settings for both conditions, along with the precise mechanism used to disable CLIP cross-attention (zeroing the conditioning input). revision: yes
Referee: [Methods/Results] Methods/Results: the reported percentages are given without dataset identity or size, number of test images, error bars, or statistical tests. This prevents assessment of whether the observed differences exceed run-to-run variance and directly undermines verification of the controlled-ablation premise.

Authors: We agree that dataset identity, test-set size, error bars, and statistical tests are required to verify that differences exceed variance. The revised Methods and Results sections will specify the dataset(s) and sizes used, the number of test images, report error bars from repeated runs, and include statistical significance tests (e.g., paired t-tests) to support the reported metric improvements. revision: yes

Circularity Check

0 steps flagged

No derivation chain or self-referential structure present; purely empirical ablation study

full rationale

The paper reports metric deltas from controlled comparisons of U-Net and Stable Diffusion models run with versus without CLIP text conditioning. No equations, first-principles derivations, fitted parameters renamed as predictions, or self-citations appear in the provided text. The central claim rests on experimental isolation of one variable rather than any mathematical reduction to its own inputs. This matches the default expectation of no circularity for empirical work that does not invoke uniqueness theorems or ansatzes.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no information on free parameters, background axioms, or new postulated entities.

pith-pipeline@v0.9.1-grok · 5742 in / 1081 out tokens · 31155 ms · 2026-06-26T21:30:30.806529+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

18 extracted references · 3 canonical work pages

[1]

Open image preferences v1

Data Is Better Together Community. Open image preferences v1. https://huggingface.co/datasets/data-is-better-together/ open-image-preferences-v1-binarized, 2024. HuggingFace dataset. Licensed under Apache-2.0

2024
[2]

Fromshadestovibrance: Acomprehen- sivereviewofmodernimagecolorizationtechniques.Frontiers in Computer Science, 7:1626641, 2025

OshenGeenathandY.H.P.P.Priyadarshana. Fromshadestovibrance: Acomprehen- sivereviewofmodernimagecolorizationtechniques.Frontiers in Computer Science, 7:1626641, 2025. doi: 10.3389/fcomp.2025.1626641

work page doi:10.3389/fcomp.2025.1626641 2025
[3]

Tic: Text-guided image colorization, 2022

Subhankar Ghosh, Prasun Roy, Saumik Bhattacharya, Umapada Pal, and Michael Blumenstein. Tic: Text-guided image colorization, 2022. URLhttps://arxiv. org/abs/2208.02843. 14

arXiv 2022
[4]

Efficient diffusion training via min-snr weighting strategy, 2024

Tiankai Hang, Shuyang Gu, Chen Li, Jianmin Bao, Dong Chen, Han Hu, Xin Geng, and Baining Guo. Efficient diffusion training via min-snr weighting strategy, 2024. URLhttps://arxiv.org/abs/2303.09556

arXiv 2024
[5]

Measuring colourfulness in natural images

David Hasler and Sabine Suesstrunk. Measuring colourfulness in natural images. Proceedings of SPIE - The International Society for Optical Engineering, 5007:87– 95, 06 2003. doi: 10.1117/12.477378

work page doi:10.1117/12.477378 2003
[6]

Classifier-free diffusion guidance, 2022

Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance, 2022. URL https://arxiv.org/abs/2207.12598

Pith/arXiv arXiv 2022
[7]

Satoshi Iizuka, Edgar Simo-Serra, and Hiroshi Ishikawa. Let there be Color!: Joint End-to-end Learning of Global and Local Image Priors for Automatic Image Col- orization with Simultaneous Classification.ACM Transactions on Graphics (Proc. of SIGGRAPH 2016), 35(4):110:1–110:11, 2016

2016
[8]

Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A. Efros. Image-to-image translationwithconditionaladversarialnetworks.CoRR,abs/1611.07004,2016. URL http://arxiv.org/abs/1611.07004

Pith/arXiv arXiv 2016
[9]

Perceptual losses for real-time style transfer and super-resolution.CoRR, abs/1603.08155, 2016

Justin Johnson, Alexandre Alahi, and Li Fei-Fei. Perceptual losses for real-time style transfer and super-resolution.CoRR, abs/1603.08155, 2016. URLhttp://arxiv. org/abs/1603.08155

Pith/arXiv arXiv 2016
[10]

Diffcolor: Towardhighfidelitytext-guidedimagecolorizationwithdiffusionmodels,2023

JianxinLin,PengXiao,YijunWang,RongjuZhang,andXiangxiangZeng. Diffcolor: Towardhighfidelitytext-guidedimagecolorizationwithdiffusionmodels,2023. URL https://arxiv.org/abs/2308.01655

arXiv 2023
[11]

Learning transferable visual models from natural lan- guage supervision, 2021

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sand- hini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural lan- guage supervision, 2021. URLhttps://arxiv.org/abs/2103.00020

Pith/arXiv arXiv 2021
[12]

High-resolution image synthesis with latent diffusion models, 2022

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models, 2022. URL https://arxiv.org/abs/2112.10752

Pith/arXiv arXiv 2022
[13]

Lee, Jonathan Ho, Tim Salimans,DavidJ.Fleet,andMohammadNorouzi

Chitwan Saharia, William Chan, Huiwen Chang, Chris A. Lee, Jonathan Ho, Tim Salimans,DavidJ.Fleet,andMohammadNorouzi. Palette: Image-to-imagediffusion models, 2022. URLhttps://arxiv.org/abs/2111.05826

arXiv 2022
[14]

Gomez, Lukasz Kaiser, and Illia Polosukhin

AshishVaswani,NoamShazeer,NikiParmar,JakobUszkoreit,LlionJones,AidanN. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need, 2023. URL https://arxiv.org/abs/1706.03762

Pith/arXiv arXiv 2023
[15]

IEEE Transactions on Image Processing 13(4), 600–612 (Apr 2004)

Zhou Wang, Alan Bovik, Hamid Sheikh, and Eero Simoncelli. Image quality as- sessment: From error visibility to structural similarity.Image Processing, IEEE Transactions on, 13:600 – 612, 05 2004. doi: 10.1109/TIP.2003.819861. 15

work page doi:10.1109/tip.2003.819861 2004
[16]

Diffusing colors: Image colorization with text guided diffusion, 2023

NirZabari,AharonAzulay,AlexeyGorkor,TaviHalperin,andOhadFried. Diffusing colors: Image colorization with text guided diffusion, 2023. URLhttps://arxiv. org/abs/2312.04145

arXiv 2023
[17]

Colorfulimagecolorization.CoRR, abs/1603.08511, 2016

RichardZhang,PhillipIsola,andAlexeiA.Efros. Colorfulimagecolorization.CoRR, abs/1603.08511, 2016. URLhttp://arxiv.org/abs/1603.08511

Pith/arXiv arXiv 2016
[18]

Efros, Eli Shechtman, and Oliver Wang

Richard Zhang, Phillip Isola, Alexei A. Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric, 2018. URL https://arxiv.org/abs/1801.03924. 16

Pith/arXiv arXiv 2018

[1] [1]

Open image preferences v1

Data Is Better Together Community. Open image preferences v1. https://huggingface.co/datasets/data-is-better-together/ open-image-preferences-v1-binarized, 2024. HuggingFace dataset. Licensed under Apache-2.0

2024

[2] [2]

Fromshadestovibrance: Acomprehen- sivereviewofmodernimagecolorizationtechniques.Frontiers in Computer Science, 7:1626641, 2025

OshenGeenathandY.H.P.P.Priyadarshana. Fromshadestovibrance: Acomprehen- sivereviewofmodernimagecolorizationtechniques.Frontiers in Computer Science, 7:1626641, 2025. doi: 10.3389/fcomp.2025.1626641

work page doi:10.3389/fcomp.2025.1626641 2025

[3] [3]

Tic: Text-guided image colorization, 2022

Subhankar Ghosh, Prasun Roy, Saumik Bhattacharya, Umapada Pal, and Michael Blumenstein. Tic: Text-guided image colorization, 2022. URLhttps://arxiv. org/abs/2208.02843. 14

arXiv 2022

[4] [4]

Efficient diffusion training via min-snr weighting strategy, 2024

Tiankai Hang, Shuyang Gu, Chen Li, Jianmin Bao, Dong Chen, Han Hu, Xin Geng, and Baining Guo. Efficient diffusion training via min-snr weighting strategy, 2024. URLhttps://arxiv.org/abs/2303.09556

arXiv 2024

[5] [5]

Measuring colourfulness in natural images

David Hasler and Sabine Suesstrunk. Measuring colourfulness in natural images. Proceedings of SPIE - The International Society for Optical Engineering, 5007:87– 95, 06 2003. doi: 10.1117/12.477378

work page doi:10.1117/12.477378 2003

[6] [6]

Classifier-free diffusion guidance, 2022

Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance, 2022. URL https://arxiv.org/abs/2207.12598

Pith/arXiv arXiv 2022

[7] [7]

Satoshi Iizuka, Edgar Simo-Serra, and Hiroshi Ishikawa. Let there be Color!: Joint End-to-end Learning of Global and Local Image Priors for Automatic Image Col- orization with Simultaneous Classification.ACM Transactions on Graphics (Proc. of SIGGRAPH 2016), 35(4):110:1–110:11, 2016

2016

[8] [8]

Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A. Efros. Image-to-image translationwithconditionaladversarialnetworks.CoRR,abs/1611.07004,2016. URL http://arxiv.org/abs/1611.07004

Pith/arXiv arXiv 2016

[9] [9]

Perceptual losses for real-time style transfer and super-resolution.CoRR, abs/1603.08155, 2016

Justin Johnson, Alexandre Alahi, and Li Fei-Fei. Perceptual losses for real-time style transfer and super-resolution.CoRR, abs/1603.08155, 2016. URLhttp://arxiv. org/abs/1603.08155

Pith/arXiv arXiv 2016

[10] [10]

Diffcolor: Towardhighfidelitytext-guidedimagecolorizationwithdiffusionmodels,2023

JianxinLin,PengXiao,YijunWang,RongjuZhang,andXiangxiangZeng. Diffcolor: Towardhighfidelitytext-guidedimagecolorizationwithdiffusionmodels,2023. URL https://arxiv.org/abs/2308.01655

arXiv 2023

[11] [11]

Learning transferable visual models from natural lan- guage supervision, 2021

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sand- hini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural lan- guage supervision, 2021. URLhttps://arxiv.org/abs/2103.00020

Pith/arXiv arXiv 2021

[12] [12]

High-resolution image synthesis with latent diffusion models, 2022

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models, 2022. URL https://arxiv.org/abs/2112.10752

Pith/arXiv arXiv 2022

[13] [13]

Lee, Jonathan Ho, Tim Salimans,DavidJ.Fleet,andMohammadNorouzi

Chitwan Saharia, William Chan, Huiwen Chang, Chris A. Lee, Jonathan Ho, Tim Salimans,DavidJ.Fleet,andMohammadNorouzi. Palette: Image-to-imagediffusion models, 2022. URLhttps://arxiv.org/abs/2111.05826

arXiv 2022

[14] [14]

Gomez, Lukasz Kaiser, and Illia Polosukhin

AshishVaswani,NoamShazeer,NikiParmar,JakobUszkoreit,LlionJones,AidanN. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need, 2023. URL https://arxiv.org/abs/1706.03762

Pith/arXiv arXiv 2023

[15] [15]

IEEE Transactions on Image Processing 13(4), 600–612 (Apr 2004)

Zhou Wang, Alan Bovik, Hamid Sheikh, and Eero Simoncelli. Image quality as- sessment: From error visibility to structural similarity.Image Processing, IEEE Transactions on, 13:600 – 612, 05 2004. doi: 10.1109/TIP.2003.819861. 15

work page doi:10.1109/tip.2003.819861 2004

[16] [16]

Diffusing colors: Image colorization with text guided diffusion, 2023

NirZabari,AharonAzulay,AlexeyGorkor,TaviHalperin,andOhadFried. Diffusing colors: Image colorization with text guided diffusion, 2023. URLhttps://arxiv. org/abs/2312.04145

arXiv 2023

[17] [17]

Colorfulimagecolorization.CoRR, abs/1603.08511, 2016

RichardZhang,PhillipIsola,andAlexeiA.Efros. Colorfulimagecolorization.CoRR, abs/1603.08511, 2016. URLhttp://arxiv.org/abs/1603.08511

Pith/arXiv arXiv 2016

[18] [18]

Efros, Eli Shechtman, and Oliver Wang

Richard Zhang, Phillip Isola, Alexei A. Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric, 2018. URL https://arxiv.org/abs/1801.03924. 16

Pith/arXiv arXiv 2018