iPhoneBlur: A Difficulty-Stratified Benchmark for Consumer Device Motion Deblurring
Pith reviewed 2026-05-08 14:18 UTC · model grok-4.3
The pith
iPhoneBlur is a difficulty-stratified benchmark showing consistent 7-9 dB worse deblurring performance on hard motion blur from consumer iPhones, a gap hidden by aggregate metrics.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Evaluation of six architectures reveals consistent 7-9 dB performance degradation from Easy to Hard subsets, a substantial gap entirely hidden by aggregate reporting.
Load-bearing premise
The PSNR-guided adaptive temporal windowing on high-framerate iPhone videos produces a meaningful and realistic stratification of motion blur difficulty that generalizes to actual consumer device deployment.
read the original abstract
Motion blur restoration on consumer mobile devices is typically evaluated using aggregate metrics that obscure performance variation across blur difficulty, masking model behavior under real deployment conditions. This work introduces iPhoneBlur, a difficulty-stratified benchmark of 7,400 image pairs synthesized from high-framerate iPhone 17 Pro videos captured in diverse real-world scenarios. Samples are partitioned into Easy, Medium, and Hard categories through PSNR-guided adaptive temporal windowing, with stratification validated by monotonic 2.2x increase in optical flow magnitude across tiers. Each sample includes comprehensive metadata enabling investigation of ISP-aware and difficulty-adaptive restoration strategies. Spectral analysis confirms synthesized blur exhibits high-frequency suppression patterns consistent with authentic motion degradation. Evaluation of six architectures reveals consistent 7-9 dB performance degradation from Easy to Hard subsets, a substantial gap entirely hidden by aggregate reporting. The benchmark further exposes a domain gap between professional and consumer cameras which targeted fine-tuning substantially recovers. By coupling difficulty stratification with deployment-critical metadata, iPhoneBlur enables systematic assessment of model reliability and failure modes for resource-constrained edge systems.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces iPhoneBlur, a benchmark of 7,400 image pairs synthesized from high-framerate iPhone 17 Pro videos captured in diverse real-world scenarios. Samples are partitioned into Easy, Medium, and Hard difficulty categories via PSNR-guided adaptive temporal windowing, with the ordering validated by a 2.2× monotonic rise in optical-flow magnitude and spectral high-frequency roll-off. Evaluation of six deblurring architectures demonstrates a consistent 7-9 dB PSNR degradation from Easy to Hard subsets that is masked by aggregate metrics; the work also reports a domain gap between professional and consumer cameras that targeted fine-tuning substantially recovers, and supplies metadata to support ISP-aware and difficulty-adaptive restoration research.
Significance. If the stratification accurately captures increasing motion-blur severity representative of consumer-device deployment, the benchmark would be a useful contribution by exposing performance variation hidden in standard aggregate reporting and by providing deployment-relevant metadata. The synthesis from real high-framerate video and the cross-architecture consistency of the gap are strengths that could guide more robust model development for edge systems.
major comments (2)
- [§3] §3 (Benchmark Construction): The central claim of a meaningful 7-9 dB Easy-to-Hard performance gap rests on the assertion that PSNR-guided adaptive temporal windowing produces tiers reflecting real motion-blur difficulty. Validation is limited to a 2.2× rise in optical-flow magnitude plus spectral roll-off; optical-flow magnitude is only a proxy for motion extent and does not guarantee that the resulting kernels reproduce the non-linear trajectories, rolling-shutter skew, or ISP noise correlations typical of actual iPhone camera shake. Without additional validation (e.g., direct comparison against real captured blurred images or kernel statistics), the observed gap risks being an artifact of the synthesis procedure rather than evidence of hidden difficulty variation.
- [§4] §4 (Experiments and Results): The headline result states a 'consistent 7-9 dB performance degradation' across six architectures, yet the abstract and reported findings supply neither error bars, per-model PSNR tables with standard deviations, exact threshold values used for tier boundaries, nor full architectural and training details. This absence prevents assessment of statistical reliability and undermines the claim that the gap is 'entirely hidden by aggregate reporting.'
minor comments (2)
- The abstract refers to 'comprehensive metadata' enabling ISP-aware strategies, but the manuscript does not provide an explicit list or example of the metadata fields included with each sample.
- Figure captions and axis labels in the spectral-analysis and optical-flow validation plots should explicitly state the number of samples per tier and any confidence intervals.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed comments on our manuscript. We address each major comment below and indicate the revisions we will make to improve the work.
read point-by-point responses
-
Referee: §3 (Benchmark Construction): The central claim of a meaningful 7-9 dB Easy-to-Hard performance gap rests on the assertion that PSNR-guided adaptive temporal windowing produces tiers reflecting real motion-blur difficulty. Validation is limited to a 2.2× rise in optical-flow magnitude plus spectral roll-off; optical-flow magnitude is only a proxy for motion extent and does not guarantee that the resulting kernels reproduce the non-linear trajectories, rolling-shutter skew, or ISP noise correlations typical of actual iPhone camera shake. Without additional validation (e.g., direct comparison against real captured blurred images or kernel statistics), the observed gap risks being an artifact of the synthesis procedure rather than evidence of hidden difficulty variation.
Authors: We appreciate the referee's emphasis on rigorous validation of the stratification. The synthesis averages frames from high-frame-rate iPhone 17 Pro video, so the blur kernels derive directly from real device motion, inherently incorporating non-linear trajectories, rolling-shutter skew, and the sensor/ISP noise profile of the iPhone. PSNR-guided windowing selects temporal spans according to measured reconstruction degradation, while the reported 2.2× optical-flow increase and spectral high-frequency roll-off provide supporting evidence of increasing severity. We acknowledge that side-by-side comparison with long-exposure captures would be valuable but is practically difficult to obtain with accurate ground-truth alignment on consumer hardware. In revision we will add trajectory-length histograms, kernel-statistic summaries, and further discussion of how the real-video synthesis captures device-specific effects. revision: partial
-
Referee: §4 (Experiments and Results): The headline result states a 'consistent 7-9 dB performance degradation' across six architectures, yet the abstract and reported findings supply neither error bars, per-model PSNR tables with standard deviations, exact threshold values used for tier boundaries, nor full architectural and training details. This absence prevents assessment of statistical reliability and undermines the claim that the gap is 'entirely hidden by aggregate reporting.'
Authors: We agree that these elements are required for reproducibility and statistical assessment. The revised manuscript will include error bars (or standard deviations) on all PSNR figures, a full per-model table with values and deviations for Easy/Medium/Hard subsets, the exact PSNR thresholds used to define the three tiers, and expanded sections detailing the six architectures, training procedures, hyperparameters, and implementation choices. These additions will allow readers to verify the consistency of the 7-9 dB gap and its concealment under aggregate metrics. revision: yes
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Temporal averaging of high-framerate iPhone video produces blur whose frequency content and difficulty distribution match real consumer-device motion blur
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.