Deep Exemplar-based Video Colorization
Pith reviewed 2026-05-25 17:41 UTC · model grok-4.3
The pith
A recurrent network unifies semantic matching and color propagation to colorize video sequences from one reference image while maintaining temporal consistency.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By training a recurrent network end-to-end that unifies semantic correspondence and color propagation steps, with both steps guided by the reference image and reinforced by a temporal consistency loss, realistic videos can be produced that remain faithful to the reference style and exhibit good temporal stability.
What carries the argument
The recurrent framework that unifies semantic correspondence and color propagation, allowing the reference image to guide colorization of every frame based on colorization history.
If this is right
- Each frame receives guidance from the reference through the unified correspondence and propagation steps.
- Sequential processing based on colorization history reduces accumulated propagation errors.
- The temporal consistency loss enforces coherency across the entire sequence.
- The resulting videos are claimed to be superior to prior methods in both quantitative metrics and visual quality.
Where Pith is reading between the lines
- The same recurrent unification might apply to other reference-guided video tasks such as style transfer or segmentation.
- Efficiency improvements would be needed before the method could run on long sequences in real time.
- Performance on videos with rapid motion or lighting changes would need separate verification beyond the reported experiments.
Load-bearing premise
Training the recurrent network end-to-end with the temporal consistency loss will produce realistic videos with good temporal stability without introducing new artifacts or drifting from the reference style across long sequences.
What would settle it
A test on a long video sequence showing either accumulated color drift away from the reference or visible flickering despite the temporal loss would falsify the central claim.
Figures
read the original abstract
This paper presents the first end-to-end network for exemplar-based video colorization. The main challenge is to achieve temporal consistency while remaining faithful to the reference style. To address this issue, we introduce a recurrent framework that unifies the semantic correspondence and color propagation steps. Both steps allow a provided reference image to guide the colorization of every frame, thus reducing accumulated propagation errors. Video frames are colorized in sequence based on the colorization history, and its coherency is further enforced by the temporal consistency loss. All of these components, learned end-to-end, help produce realistic videos with good temporal stability. Experiments show our result is superior to the state-of-the-art methods both quantitatively and qualitatively.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. This paper claims to present the first end-to-end recurrent network for exemplar-based video colorization. The recurrent framework unifies semantic correspondence and color propagation so a single reference image guides colorization of every frame, reducing accumulated propagation errors. Video frames are processed sequentially using colorization history, with a temporal consistency loss enforcing coherency; all components are learned end-to-end to produce realistic videos with good temporal stability. Experiments are said to demonstrate quantitative and qualitative superiority over state-of-the-art methods.
Significance. If the experimental claims hold with proper validation, the unified recurrent approach would represent a meaningful advance in exemplar-based video colorization by addressing error accumulation and temporal instability in a single learned model, potentially outperforming prior separate-stage pipelines.
major comments (2)
- [Abstract] Abstract: the central claim of quantitative and qualitative superiority (and reduced propagation errors via the recurrent unification) is asserted without error bars, dataset details, ablation results, or long-sequence experiments, so the experimental summary cannot be verified and the claim that end-to-end training prevents reference drift remains untested.
- [Abstract] Abstract: no loss equation, recurrence depth analysis, or ablation isolating the unification of semantic correspondence and color propagation is supplied, leaving the assumption that the recurrent state maintains reference fidelity without introducing new artifacts or style drift across long sequences unsupported.
Simulated Author's Rebuttal
We thank the referee for the constructive comments. We address each point below and indicate where revisions to the abstract or supporting text are feasible.
read point-by-point responses
-
Referee: [Abstract] Abstract: the central claim of quantitative and qualitative superiority (and reduced propagation errors via the recurrent unification) is asserted without error bars, dataset details, ablation results, or long-sequence experiments, so the experimental summary cannot be verified and the claim that end-to-end training prevents reference drift remains untested.
Authors: The abstract is a concise summary; full dataset descriptions appear in Section 4.1, quantitative/qualitative results and comparisons in Section 4.2, and ablations in Section 5. Error bars were not reported in the submitted version. The recurrent unification is motivated in Section 3 as a means to reduce propagation drift, with temporal stability shown on the evaluated sequences. We will revise the abstract to qualify the superiority claim and add a forward reference to the experimental sections. revision: partial
-
Referee: [Abstract] Abstract: no loss equation, recurrence depth analysis, or ablation isolating the unification of semantic correspondence and color propagation is supplied, leaving the assumption that the recurrent state maintains reference fidelity without introducing new artifacts or style drift across long sequences unsupported.
Authors: The temporal consistency loss is formalized in the method section (with the relevant equation). The recurrent state and unification of correspondence and propagation are detailed in Section 3, and component ablations appear in Section 5. A dedicated recurrence-depth study and explicit long-sequence drift measurements are not present; we can add a brief reference to the loss equation in the abstract and note the design rationale for fidelity preservation. revision: partial
Circularity Check
No circularity: empirical ML method with external benchmarks
full rationale
The paper proposes a recurrent end-to-end neural network architecture for exemplar-based video colorization, trained with a temporal consistency loss and evaluated on external datasets against prior methods. No derivation chain, equations, or first-principles results are presented that reduce to self-definition, fitted inputs renamed as predictions, or self-citation chains. Claims rest on learned behavior and quantitative/qualitative experiments, which are self-contained against independent benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Gradient-based optimization can jointly learn semantic correspondence, color propagation, and temporal consistency from paired training data.
Reference graph
Works this paper leans on
-
[1]
Colorization us- ing optimization,
A. Levin, D. Lischinski, and Y . Weiss, “Colorization us- ing optimization,” in ACM transactions on graphics (TOG), vol. 23, pp. 689–694, ACM, 2004. 1, 2
work page 2004
-
[2]
Fast image and video colorization using chrominance blending,
L. Yatziv and G. Sapiro, “Fast image and video colorization using chrominance blending,” 2004. 1, 2
work page 2004
-
[3]
An adaptive edge detection based colorization algo- rithm and its applications,
Y .-C. Huang, Y .-S. Tung, J.-C. Chen, S.-W. Wang, and J.- L. Wu, “An adaptive edge detection based colorization algo- rithm and its applications,” inProceedings of the 13th annual ACM international conference on Multimedia, pp. 351–354, ACM, 2005. 1, 2
work page 2005
-
[4]
Y . Qu, T.-T. Wong, and P.-A. Heng, “Manga colorization,” in ACM Transactions on Graphics (TOG) , vol. 25, pp. 1214– 1220, ACM, 2006. 1, 2
work page 2006
-
[5]
Q. Luan, F. Wen, D. Cohen-Or, L. Liang, Y .-Q. Xu, and H.- Y . Shum, “Natural image colorization,” in Proceedings of the 18th Eurographics conference on Rendering Techniques, pp. 309–320, Eurographics Association, 2007. 1, 2
work page 2007
-
[6]
Transferring color to greyscale images,
T. Welsh, M. Ashikhmin, and K. Mueller, “Transferring color to greyscale images,” in ACM Transactions on Graph- ics (TOG), vol. 21, pp. 277–280, ACM, 2002. 1, 2
work page 2002
-
[7]
Variational exemplar-based image colorization,
A. Bugeau, V .-T. Ta, and N. Papadakis, “Variational exemplar-based image colorization,” IEEE Transactions on Image Processing, vol. 23, no. 1, pp. 298–307, 2014. 1, 2
work page 2014
-
[8]
X. Liu, L. Wan, Y . Qu, T.-T. Wong, S. Lin, C.-S. Leung, and P.-A. Heng, “Intrinsic colorization,” inACM Transactions on Graphics (TOG), vol. 27, p. 152, ACM, 2008. 1, 2
work page 2008
-
[9]
Semantic colorization with internet im- ages,
A. Y .-S. Chia, S. Zhuo, R. K. Gupta, Y .-W. Tai, S.-Y . Cho, P. Tan, and S. Lin, “Semantic colorization with internet im- ages,” in ACM Transactions on Graphics (TOG) , vol. 30, p. 156, ACM, 2011. 1, 2
work page 2011
-
[10]
Image colorization using similar images,
R. K. Gupta, A. Y .-S. Chia, D. Rajan, E. S. Ng, and H. Zhiy- ong, “Image colorization using similar images,” in Proceed- ings of the 20th ACM international conference on Multime- dia, pp. 369–378, ACM, 2012. 1, 2
work page 2012
-
[11]
Automatic im- age colorization via multimodal predictions,
G. Charpiat, M. Hofmann, and B. Sch¨olkopf, “Automatic im- age colorization via multimodal predictions,” in European conference on computer vision, pp. 126–139, Springer, 2008. 1, 2
work page 2008
-
[12]
R. Ironi, D. Cohen-Or, and D. Lischinski, “Colorization by example.,” in Rendering Techniques, pp. 201–210, Citeseer,
-
[13]
Local color transfer via probabilistic segmentation by expectation-maximization,
Y .-W. Tai, J.-Y . Jia, and C.-K. Tang, “Local color transfer via probabilistic segmentation by expectation-maximization,” in IEEE Conference on Computer Vision & Pattern Recognition (CVPR), 2005. 1, 2
work page 2005
-
[14]
Z. Cheng, Q. Yang, and B. Sheng, “Deep colorization,” in Proceedings of the IEEE International Conference on Com- puter Vision, pp. 415–423, 2015. 1, 2
work page 2015
-
[15]
S. Iizuka, E. Simo-Serra, and H. Ishikawa, “Let there be color!: joint end-to-end learning of global and local im- age priors for automatic image colorization with simultane- ous classification,” ACM Transactions on Graphics (TOG) , vol. 35, no. 4, p. 110, 2016. 1, 2, 6, 7, 8, 14
work page 2016
-
[16]
Learning rep- resentations for automatic colorization,
G. Larsson, M. Maire, and G. Shakhnarovich, “Learning rep- resentations for automatic colorization,” in European Con- ference on Computer Vision , pp. 577–593, Springer, 2016. 1, 2, 6, 7, 8, 14
work page 2016
-
[17]
Colorful image col- orization,
R. Zhang, P. Isola, and A. A. Efros, “Colorful image col- orization,” in European Conference on Computer Vision , pp. 649–666, Springer, 2016. 1, 2, 6, 7, 8, 14
work page 2016
-
[18]
Learning large- scale automatic image colorization,
A. Deshpande, J. Rock, and D. Forsyth, “Learning large- scale automatic image colorization,” in Proceedings of the IEEE International Conference on Computer Vision , pp. 567–575, 2015. 1, 2
work page 2015
-
[19]
Pixel-level Semantics Guided Image Colorization
J. Zhao, L. Liu, C. G. Snoek, J. Han, and L. Shao, “Pixel- level semantics guided image colorization,” arXiv preprint arXiv:1808.01597, 2018. 1, 2
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[20]
Deep Koalarization: Image Colorization using CNNs and Inception-ResNet-v2
F. Baldassarre, D. G. Mor ´ın, and L. Rod ´es-Guirao, “Deep koalarization: Image colorization using cnns and inception- resnet-v2,” arXiv preprint arXiv:1712.03400, 2017. 1, 2
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[21]
Blind video temporal consistency,
N. Bonneel, J. Tompkin, K. Sunkavalli, D. Sun, S. Paris, and H. Pfister, “Blind video temporal consistency,” ACM Trans- actions on Graphics (TOG), vol. 34, no. 6, p. 196, 2015. 1, 2
work page 2015
-
[22]
Learning Blind Video Temporal Consistency
W.-S. Lai, J.-B. Huang, O. Wang, E. Shechtman, E. Yumer, and M.-H. Yang, “Learning blind video temporal consis- tency,”arXiv preprint arXiv:1808.00449, 2018. 1, 2, 7
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[23]
Video colorization using parallel optimization in feature space,
B. Sheng, H. Sun, M. Magnor, and P. Li, “Video colorization using parallel optimization in feature space,” IEEE Transac- tions on Circuits and Systems for Video Technology, vol. 24, no. 3, pp. 407–417, 2014. 1, 2
work page 2014
-
[24]
Key- frame based spatiotemporal scribble propagation,
P. Do ˘gan, T. O. Aydın, N. Stefanoski, and A. Smolic, “Key- frame based spatiotemporal scribble propagation,” in Pro- ceedings of the Eurographics Workshop on Intelligent Cin- ematography and Editing, pp. 13–20, Eurographics Associ- ation, 2015. 1, 2
work page 2015
-
[25]
Spatiotemporal colorization of video using 3d steerable pyramids,
S. Paul, S. Bhattacharya, and S. Gupta, “Spatiotemporal colorization of video using 3d steerable pyramids,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 27, no. 8, pp. 1605–1619, 2017. 1, 2
work page 2017
-
[26]
V . Jampani, R. Gadde, and P. V . Gehler, “Video propagation networks,” in Proc. CVPR, vol. 6, p. 7, 2017. 1, 2, 7, 8, 14
work page 2017
-
[27]
Tracking emerges by colorizing videos,
C. V ondrick, A. Shrivastava, A. Fathi, S. Guadarrama, and K. Murphy, “Tracking emerges by colorizing videos,” in Proc. ECCV, 2018. 1, 2
work page 2018
-
[28]
Switchable Temporal Propagation Network
S. Liu, G. Zhong, S. De Mello, J. Gu, V . Jampani, M.-H. Yang, and J. Kautz, “Switchable temporal propagation net- work,” arXiv preprint arXiv:1804.08758 , 2018. 1, 2, 7, 8, 14
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[29]
S. Meyer, V . Cornill `ere, A. Djelouah, C. Schroers, and M. Gross, “Deep video color propagation,” arXiv preprint arXiv:1808.03232, 2018. 1, 2
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[30]
Deep exemplar-based colorization,
M. He, D. Chen, J. Liao, P. V . Sander, and L. Yuan, “Deep exemplar-based colorization,” ACM Transactions on Graph- ics (TOG), vol. 37, no. 4, p. 47, 2018. 1, 2, 6, 7
work page 2018
-
[31]
Visual Attribute Transfer through Deep Image Analogy
J. Liao, Y . Yao, L. Yuan, G. Hua, and S. B. Kang, “Visual at- tribute transfer through deep image analogy,” arXiv preprint arXiv:1705.01088, 2017. 1, 2, 4
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[32]
Real-Time User-Guided Image Colorization with Learned Deep Priors
R. Zhang, J.-Y . Zhu, P. Isola, X. Geng, A. S. Lin, T. Yu, and A. A. Efros, “Real-time user-guided image colorization with learned deep priors,” arXiv preprint arXiv:1705.02999,
work page internal anchor Pith review Pith/arXiv arXiv
-
[33]
Progressive Color Transfer with Dense Semantic Correspondences
M. He, J. Liao, L. Yuan, and P. V . Sander, “Neural color transfer between images,” arXiv preprint arXiv:1710.00756,
work page internal anchor Pith review Pith/arXiv arXiv
-
[34]
Image- to-image translation with conditional adversarial networks,
P. Isola, J.-Y . Zhu, T. Zhou, and A. A. Efros, “Image- to-image translation with conditional adversarial networks,” arXiv preprint, 2017. 2
work page 2017
-
[35]
Learning diverse image colorization.,
A. Deshpande, J. Lu, M.-C. Yeh, M. J. Chong, and D. A. Forsyth, “Learning diverse image colorization.,” in CVPR, pp. 2877–2885, 2017. 2
work page 2017
-
[36]
Structural Consistency and Controllability for Diverse Colorization
S. Messaoud, D. Forsyth, and A. G. Schwing, “Struc- tural consistency and controllability for diverse coloriza- tion,” arXiv preprint arXiv:1809.02129, 2018. 2
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[37]
PixColor: Pixel Recursive Colorization
S. Guadarrama, R. Dahl, D. Bieber, M. Norouzi, J. Shlens, and K. Murphy, “Pixcolor: Pixel recursive colorization,” arXiv preprint arXiv:1705.07208, 2017. 2
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[38]
Probabilistic Image Colorization
A. Royer, A. Kolesnikov, and C. H. Lampert, “Probabilistic image colorization,”arXiv preprint arXiv:1705.04258, 2017. 2
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[39]
Colorization of grayscale images and videos using a semiautomatic approach,
V . G. Jacob and S. Gupta, “Colorization of grayscale images and videos using a semiautomatic approach,” in Image Pro- cessing (ICIP), 2009 16th IEEE International Conference on, pp. 1653–1656, IEEE, 2009. 2
work page 2009
-
[40]
Approximate nearest neighbor fields in video,
N. Ben-Zrihem and L. Zelnik-Manor, “Approximate nearest neighbor fields in video,” inProceedings of the IEEE Confer- ence on Computer Vision and Pattern Recognition, pp. 5233– 5242, 2015. 2
work page 2015
-
[41]
Robust and au- tomatic video colorization via multiframe reordering refine- ment,
S. Xia, J. Liu, Y . Fang, W. Yang, and Z. Guo, “Robust and au- tomatic video colorization via multiframe reordering refine- ment,” in Image Processing (ICIP), 2016 IEEE International Conference on, pp. 4017–4021, IEEE, 2016. 2
work page 2016
-
[42]
Very Deep Convolutional Networks for Large-Scale Image Recognition
K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014. 3
work page internal anchor Pith review Pith/arXiv arXiv 2014
-
[43]
X. Wang, R. Girshick, A. Gupta, and K. He, “Non-local neural networks,” arXiv preprint arXiv:1711.07971, vol. 10,
work page internal anchor Pith review Pith/arXiv arXiv
-
[44]
Perceptual losses for real-time style transfer and super-resolution,
J. Johnson, A. Alahi, and L. Fei-Fei, “Perceptual losses for real-time style transfer and super-resolution,” in European Conference on Computer Vision , pp. 694–711, Springer,
-
[45]
The Contextual Loss for Image Transformation with Non-Aligned Data
R. Mechrez, I. Talmi, and L. Zelnik-Manor, “The contextual loss for image transformation with non-aligned data,” arXiv preprint arXiv:1803.02077, 2018. 4
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[46]
Edge- preserving decompositions for multi-scale tone and detail manipulation,
Z. Farbman, R. Fattal, D. Lischinski, and R. Szeliski, “Edge- preserving decompositions for multi-scale tone and detail manipulation,” in ACM Transactions on Graphics (TOG) , vol. 27, p. 67, ACM, 2008. 4
work page 2008
-
[47]
The relativistic discriminator: a key element missing from standard GAN
A. Jolicoeur-Martineau, “The relativistic discriminator: a key element missing from standard gan,” arXiv preprint arXiv:1807.00734, 2018. 5
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[48]
Coherent online video style transfer,
D. Chen, J. Liao, L. Yuan, N. Yu, and G. Hua, “Coherent online video style transfer,” in Proceedings of the IEEE In- ternational Conference on Computer Vision, pp. 1105–1114,
-
[49]
Self-Attention Generative Adversarial Networks
H. Zhang, I. Goodfellow, D. Metaxas, and A. Odena, “Self- attention generative adversarial networks,” arXiv preprint arXiv:1805.08318, 2018. 5
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[50]
Spectral Normalization for Generative Adversarial Networks
T. Miyato, T. Kataoka, M. Koyama, and Y . Yoshida, “Spec- tral normalization for generative adversarial networks,” arXiv preprint arXiv:1802.05957, 2018. 5
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[51]
“Videvo.” https://www.videvo.net/. 5
-
[52]
M. Marszałek, I. Laptev, and C. Schmid, “Actions in con- text,” in IEEE Conference on Computer Vision & Pattern Recognition, 2009. 5
work page 2009
-
[53]
Flownet 2.0: Evolution of optical flow estimation with deep networks,
E. Ilg, N. Mayer, T. Saikia, M. Keuper, A. Dosovitskiy, and T. Brox, “Flownet 2.0: Evolution of optical flow estimation with deep networks,” inIEEE conference on computer vision and pattern recognition (CVPR), vol. 2, p. 6, 2017. 5
work page 2017
-
[54]
Artistic style trans- fer for videos,
M. Ruder, A. Dosovitskiy, and T. Brox, “Artistic style trans- fer for videos,” in German Conference on Pattern Recogni- tion, pp. 26–36, Springer, 2016. 5
work page 2016
-
[55]
Gans trained by a two time-scale update rule converge to a local nash equilibrium,
M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter, “Gans trained by a two time-scale update rule converge to a local nash equilibrium,” inAdvances in Neural Information Processing Systems, pp. 6626–6637, 2017. 6
work page 2017
-
[56]
Measuring colorfulness in natural images,
D. Hasler and S. E. Suesstrunk, “Measuring colorfulness in natural images,” in Human vision and electronic imaging VIII, vol. 5007, pp. 87–96, International Society for Optics and Photonics, 2003. 6
work page 2003
-
[57]
A benchmark dataset and evaluation methodology for video object segmentation,
F. Perazzi, J. Pont-Tuset, B. McWilliams, L. Van Gool, M. Gross, and A. Sorkine-Hornung, “A benchmark dataset and evaluation methodology for video object segmentation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 724–732, 2016. 8 Appendix A. Details of network architecture The overall network consists of two sub-m...
work page 2016
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.