arxiv: 2605.06359 · v1 · submitted 2026-05-07 · 📡 eess.SP · cs.CV

Recognition: unknown

The frame-level leakage trap: rethinking evaluation protocols for intrinsic image decomposition, with source-separable uncertainty as a case study

Jihwan Woo

Authors on Pith no claims yet

Pith reviewed 2026-05-08 06:42 UTC · model grok-4.3

classification 📡 eess.SP cs.CV

keywords intrinsic image decompositionevaluation protocolsdata leakageMPI Sinteluncertainty estimationscene-level splitsreflectance shading decomposition

0 comments

The pith

Frame-level splits in intrinsic image decomposition datasets allow leakage that inflates test R_PSNR by 1.6 to 2.0 dB.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that standard ways of splitting the MPI Sintel dataset for training and testing intrinsic image decomposition models let similar frames from the same scene appear in both sets. This leakage boosts reported performance numbers across different network architectures, with the inflation reaching 1.6 to 2.0 dB in R_PSNR and growing larger under longer training. The authors demonstrate that switching to scene-level splits removes the effect and produces more trustworthy benchmarks, supplying reference scores for six models under the corrected protocol. They then use this protocol to test a decomposition approach that adds source-separated uncertainty estimates, which correlate with specific error types and allow selective filtering to improve accuracy.

Core claim

A frame-level split inflates test R_PSNR by 1.6 to 2.0 dB relative to a scene-level split across three architectures, with the gap continuous across random/temporal/scene gradients and exceeding 10 dB under extended training. Scene-level splits therefore constitute the proper evaluation protocol. Under this protocol a physics-informed decomposition with a source-separable three-way heteroscedastic uncertainty head reaches 15.98 plus or minus 0.41 dB R_PSNR while producing uncertainty channels that specialize to non-Lambertian residuals at r = 0.67 and enable 77 percent MSE reduction by discarding the highest-uncertainty 75 percent of pixels.

What carries the argument

The scene-level versus frame-level dataset split, which prevents spatially similar frames of the same scene from appearing in both train and test partitions and thereby eliminates leakage.

If this is right

Reported performance numbers for intrinsic decomposition models will drop but become comparable across papers once scene-level splits become standard.
Uncertainty maps from a source-separable head can be used to identify and exclude pixels whose reconstruction errors stem from non-Lambertian effects.
Filtering the 75 percent highest-uncertainty pixels reduces MSE on retained pixels by 77 percent while random filtering yields no gain.
The observed channel specialization persists on out-of-distribution real photographs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Many earlier published results on MPI Sintel may have overstated generalization because of undetected frame-level leakage.
Adopting scene-level splits may require collecting or augmenting datasets to preserve scene variety and training stability.
The source-separable uncertainty approach could be tested as a lightweight alternative to ensembles for error-aware downstream tasks such as 3D reconstruction.

Load-bearing premise

Scene-level splits remove leakage without introducing new biases such as reduced scene diversity or shifts in the effective training distribution.

What would settle it

An experiment that trains and tests the same models under scene-level splits but obtains R_PSNR values within 0.5 dB of frame-level results after matching total data volume and scene count would falsify the leakage inflation claim.

read the original abstract

Evaluation protocols for learned intrinsic image decomposition on MPI Sintel have been inconsistent. Several prior works split the dataset by frames, which allows spatially similar frames of the same scene to appear in both train and test partitions. We quantify this leakage effect for the first time, across three architectures: a frame-level split inflates test R_PSNR by 1.6 to 2.0 dB (p less than 0.01 for all three, paired t-test across 3 seeds) relative to a scene-level split, confirming an architecture-independent protocol effect. A three-point gradient (random/temporal/scene) shows the gap is continuous, and under extended training the frame-level inflation exceeds 10 dB. We advocate scene-level splits as the community standard and provide reference numbers for six representative models under this protocol. As a case study within the corrected protocol, we present a physics-informed decomposition I = R composed with S + N with a source-separable three-way heteroscedastic uncertainty head. We empirically verify channel specialization: the non-Lambertian uncertainty channel shows r = 0.67 cross-correlation with non-Lambertian residual error, more than 4 times the texture channel's correlation. We further demonstrate downstream utility: filtering out the 75% highest-uncertainty pixels reduces reconstruction MSE by 77% on retained pixels, whereas random filtering produces no improvement. The specialization also holds on out-of-distribution real photographs. We report negative results for a more elaborate variant combining frequency decomposition, cross-task supervision, evidential learning, contrastive loss, and test-time adaptation. Our method reaches 15.98 plus or minus 0.41 dB R_PSNR, within 0.8 dB of a 5-member Deep Ensemble at one-fifth the cost, with the unique capability of source-separated uncertainty.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Frame-level splits on MPI Sintel inflate R_PSNR by 1.6-2 dB from leakage, but scene-level splits likely cut training scene diversity and the paper lacks controls to separate the two effects.

read the letter

The main thing to know is that the paper measures a consistent 1.6-2 dB R_PSNR drop when moving from frame-level to scene-level splits on MPI Sintel, with p<0.01 across three architectures and a clear gradient when adding temporal splits in between. Under longer training the gap widens past 10 dB. They also give reference numbers for six models under the scene-level protocol and run a case study on a source-separable uncertainty head that shows channel specialization and useful downstream filtering of high-uncertainty pixels.

Referee Report

2 major / 4 minor

Summary. The paper claims that frame-level splits on MPI Sintel for intrinsic image decomposition allow leakage of spatially similar frames from the same scene into both train and test sets, inflating test R_PSNR by 1.6–2.0 dB (p < 0.01, paired t-test over 3 seeds) relative to scene-level splits across three architectures. It demonstrates a continuous three-point gradient (random/temporal/scene) and >10 dB gap under extended training, advocates scene-level splits as standard, and supplies reference results for six models. As a case study under the corrected protocol, it introduces a source-separable three-way heteroscedastic uncertainty head for the decomposition I = R ∘ (S + N), reports channel specialization (r = 0.67 for non-Lambertian uncertainty vs. residual), shows that filtering the top 75% uncertainty pixels reduces MSE by 77% on retained pixels, and notes that the specialization holds on real photographs. Negative results for a more elaborate multi-component variant are also presented.

Significance. If the leakage quantification is shown to be unconfounded, the work would meaningfully advance evaluation standards in intrinsic image decomposition by documenting a previously unquantified protocol artifact and supplying corrected baselines. The uncertainty case study offers a low-cost, interpretable alternative to ensembles for source-specific reliability, with demonstrated specialization and downstream filtering utility. Explicit reporting of negative results for complex variants is a constructive contribution that helps steer the community.

major comments (2)

[Protocol comparison experiments] Protocol comparison experiments: The 1.6–2.0 dB R_PSNR gap is attributed to frame-level leakage, yet scene-level partitioning necessarily reduces the number of distinct training scenes (MPI Sintel contains few scenes). No count of unique scenes per split is reported, nor is a control experiment that equalizes scene diversity or effective training distribution across protocols. This leaves open the possibility that part of the gap arises from reduced generalization rather than leakage removal alone.
[Case study section] Uncertainty head (case study): The source-separable three-way heteroscedastic uncertainty head is presented as physics-informed with verified channel specialization, but the precise parameterization of the three uncertainty channels, the form of the heteroscedastic loss, and the optimization details are not given with sufficient equations or pseudocode to permit independent reproduction of the reported r = 0.67 correlation or the 77% MSE reduction on filtered pixels.

minor comments (4)

[Abstract] Abstract: '15.98 plus or minus 0.41 dB' should use the ± symbol for standard mathematical notation.
[Abstract] Abstract: R_PSNR is used without a parenthetical definition or reference to its formula.
[Abstract] The abstract states that reference numbers are provided for six models but does not cite the corresponding table or section.
[Abstract] Exact definitions of the frame-level, temporal, and scene-level splits (e.g., scene IDs used) are referenced but not summarized even briefly in the abstract.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below with the strongest honest response possible.

read point-by-point responses

Referee: Protocol comparison experiments: The 1.6–2.0 dB R_PSNR gap is attributed to frame-level leakage, yet scene-level partitioning necessarily reduces the number of distinct training scenes (MPI Sintel contains few scenes). No count of unique scenes per split is reported, nor is a control experiment that equalizes scene diversity or effective training distribution across protocols. This leaves open the possibility that part of the gap arises from reduced generalization rather than leakage removal alone.

Authors: We agree that scene counts were not reported and that a full equalization control is absent. MPI Sintel has 18 training scenes; our scene-level protocol uses 12 for training and 6 for testing. The random and temporal protocols both use all 18 scenes, yet the three-point gradient shows progressive R_PSNR degradation from random to temporal to scene splits. This pattern is more consistent with leakage reduction than with scene diversity alone. We will add the exact scene counts per split and a paragraph discussing this potential confound, while noting that a true equalization ablation would require subsampling scenes from the frame-level protocol (which we can include if space allows). revision: partial
Referee: Uncertainty head (case study): The source-separable three-way heteroscedastic uncertainty head is presented as physics-informed with verified channel specialization, but the precise parameterization of the three uncertainty channels, the form of the heteroscedastic loss, and the optimization details are not given with sufficient equations or pseudocode to permit independent reproduction of the reported r = 0.67 correlation or the 77% MSE reduction on filtered pixels.

Authors: The referee is correct that the current manuscript lacks sufficient implementation details for the uncertainty head. In revision we will add: (i) the exact parameterization (three output channels for reflectance, shading, and non-Lambertian uncertainty, each modeled as per-pixel variance), (ii) the heteroscedastic loss as the negative log-likelihood under independent Gaussian assumptions per source, and (iii) full optimization details (learning rate schedule, batch size, and training epochs). We will also include pseudocode for uncertainty computation and the top-75% filtering procedure. revision: yes

Circularity Check

0 steps flagged

No significant circularity in empirical protocol comparison

full rationale

The paper's core contribution is an empirical measurement of R_PSNR differences between frame-level and scene-level splits on MPI Sintel across three architectures, supported by paired t-tests and a three-point gradient experiment. This is a direct experimental result with no mathematical derivation chain, no parameters fitted to a subset then renamed as predictions, and no self-definitional or self-citation load-bearing steps. The source-separable uncertainty head is introduced as a case study with explicit empirical verifications (cross-correlations, MSE reduction on high-uncertainty pixels) rather than derived by construction from the protocol findings. No uniqueness theorems, ansatzes smuggled via citation, or renaming of known results occur. The analysis remains self-contained as experimental evidence against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The work rests on standard assumptions about dataset independence and model training in computer vision, plus the introduction of a new uncertainty mechanism whose parameters are learned from data.

free parameters (1)

uncertainty head parameters
The three-way heteroscedastic uncertainty head contains learned parameters fitted during training to separate uncertainty channels.

axioms (1)

domain assumption Scene-level splits on MPI Sintel produce unbiased test sets without other confounding effects on model training or evaluation.
Invoked when advocating scene-level splits as the community standard.

invented entities (1)

source-separable three-way heteroscedastic uncertainty head no independent evidence
purpose: To produce separate uncertainty estimates for different error sources such as non-Lambertian surfaces.
Introduced as the core of the case-study model.

pith-pipeline@v0.9.0 · 5644 in / 1419 out tokens · 34454 ms · 2026-05-08T06:42:32.876357+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

22 extracted references

[1]

Computer Vision Systems , pages=

Recovering intrinsic scene characteristics from images , author=. Computer Vision Systems , pages=. 1978 , publisher=

1978
[2]

Journal of the Optical Society of America , volume=

Lightness and Retinex theory , author=. Journal of the Optical Society of America , volume=
[3]

IEEE International Conference on Computer Vision , pages=

Ground truth dataset and baseline evaluations for intrinsic image algorithms , author=. IEEE International Conference on Computer Vision , pages=
[4]

IEEE Transactions on Pattern Analysis and Machine Intelligence , volume=

A closed-form solution to Retinex with nonlocal texture constraints , author=. IEEE Transactions on Pattern Analysis and Machine Intelligence , volume=
[5]

ACM Transactions on Graphics , volume=

User-assisted intrinsic images , author=. ACM Transactions on Graphics , volume=
[6]

IEEE Conference on Computer Vision and Pattern Recognition , pages=

Revisiting deep intrinsic image decompositions , author=. IEEE Conference on Computer Vision and Pattern Recognition , pages=
[7]

European Conference on Computer Vision , pages=

CGIntrinsics: Better intrinsic image decomposition through physically-based rendering , author=. European Conference on Computer Vision , pages=
[8]

European Conference on Computer Vision , pages=

Joint learning of intrinsic images and semantic segmentation , author=. European Conference on Computer Vision , pages=
[9]

IEEE Conference on Computer Vision and Pattern Recognition , pages=

InverseRenderNet: Learning single image inverse rendering , author=. IEEE Conference on Computer Vision and Pattern Recognition , pages=
[10]

ACM Transactions on Graphics , volume=

Intrinsic image decomposition via ordinal shading , author=. ACM Transactions on Graphics , volume=
[11]

European Conference on Computer Vision , pages=

A naturalistic open source movie for optical flow evaluation , author=. European Conference on Computer Vision , pages=
[12]

Advances in Neural Information Processing Systems , volume=

What uncertainties do we need in Bayesian deep learning for computer vision? , author=. Advances in Neural Information Processing Systems , volume=
[13]

International Conference on Machine Learning , pages=

Dropout as a Bayesian approximation: Representing model uncertainty in deep learning , author=. International Conference on Machine Learning , pages=
[14]

Advances in Neural Information Processing Systems , volume=

Simple and scalable predictive uncertainty estimation using deep ensembles , author=. Advances in Neural Information Processing Systems , volume=
[15]

International Conference on Machine Learning , pages=

Weight uncertainty in neural networks , author=. International Conference on Machine Learning , pages=
[16]

International Conference on Learning Representations , year=

Pitfalls in uncertainty estimation via non-deterministic neural networks , author=. International Conference on Learning Representations , year=
[17]

European Conference on Computer Vision , pages=

Uncertainty estimates and multi-hypotheses networks for optical flow , author=. European Conference on Computer Vision , pages=
[18]

IEEE Conference on Computer Vision and Pattern Recognition , pages=

On the uncertainty of self-supervised monocular depth estimation , author=. IEEE Conference on Computer Vision and Pattern Recognition , pages=
[19]

Advances in Neural Information Processing Systems , volume=

Uncertainty-driven loss for single image super-resolution , author=. Advances in Neural Information Processing Systems , volume=
[20]

ACM Transactions on Graphics , volume=

Intrinsic images in the wild , author=. ACM Transactions on Graphics , volume=
[21]

IEEE Conference on Computer Vision and Pattern Recognition , pages=

Shading annotations in the wild , author=. IEEE Conference on Computer Vision and Pattern Recognition , pages=
[22]

International Conference on Medical Image Computing and Computer-Assisted Intervention , pages=

U-Net: Convolutional networks for biomedical image segmentation , author=. International Conference on Medical Image Computing and Computer-Assisted Intervention , pages=