pith. sign in

arxiv: 2605.05990 · v1 · submitted 2026-05-07 · 💻 cs.CV · cs.AI

iPhoneBlur: A Difficulty-Stratified Benchmark for Consumer Device Motion Deblurring

Pith reviewed 2026-05-08 14:18 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords benchmarkblurconsumeriphoneblurmotionacrossaggregateconsistent
0
0 comments X

The pith

iPhoneBlur is a difficulty-stratified benchmark showing consistent 7-9 dB worse deblurring performance on hard motion blur from consumer iPhones, a gap hidden by aggregate metrics.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper presents iPhoneBlur, a benchmark designed to test motion deblurring algorithms specifically for images captured on consumer smartphones like the iPhone. Instead of using one overall score, the authors created 7400 pairs of sharp and blurred images from real high-speed videos taken on an iPhone 17 Pro in various everyday settings. They used a method called PSNR-guided adaptive temporal windowing to decide how many frames to average for each blurred image, sorting them into easy, medium, and hard difficulty levels. This sorting was confirmed by checking that harder levels had about 2.2 times more movement as measured by optical flow. When they tested six different deblurring models on these levels, they found that performance dropped by 7 to 9 decibels from easy to hard cases. This drop is hidden when people just report average scores across all images. The work also points out that models trained on professional camera images perform worse on iPhone data due to differences in how the cameras process images, but adjusting the models with some iPhone-specific training helps close this gap. The benchmark includes extra information with each image pair, such as the difficulty level and other details, to help study how to make restoration methods that work well on phones with limited computing power. Analysis of the blur showed it has similar frequency patterns to real motion blur.

Core claim

Evaluation of six architectures reveals consistent 7-9 dB performance degradation from Easy to Hard subsets, a substantial gap entirely hidden by aggregate reporting.

Load-bearing premise

The PSNR-guided adaptive temporal windowing on high-framerate iPhone videos produces a meaningful and realistic stratification of motion blur difficulty that generalizes to actual consumer device deployment.

read the original abstract

Motion blur restoration on consumer mobile devices is typically evaluated using aggregate metrics that obscure performance variation across blur difficulty, masking model behavior under real deployment conditions. This work introduces iPhoneBlur, a difficulty-stratified benchmark of 7,400 image pairs synthesized from high-framerate iPhone 17 Pro videos captured in diverse real-world scenarios. Samples are partitioned into Easy, Medium, and Hard categories through PSNR-guided adaptive temporal windowing, with stratification validated by monotonic 2.2x increase in optical flow magnitude across tiers. Each sample includes comprehensive metadata enabling investigation of ISP-aware and difficulty-adaptive restoration strategies. Spectral analysis confirms synthesized blur exhibits high-frequency suppression patterns consistent with authentic motion degradation. Evaluation of six architectures reveals consistent 7-9 dB performance degradation from Easy to Hard subsets, a substantial gap entirely hidden by aggregate reporting. The benchmark further exposes a domain gap between professional and consumer cameras which targeted fine-tuning substantially recovers. By coupling difficulty stratification with deployment-critical metadata, iPhoneBlur enables systematic assessment of model reliability and failure modes for resource-constrained edge systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces iPhoneBlur, a benchmark of 7,400 image pairs synthesized from high-framerate iPhone 17 Pro videos captured in diverse real-world scenarios. Samples are partitioned into Easy, Medium, and Hard difficulty categories via PSNR-guided adaptive temporal windowing, with the ordering validated by a 2.2× monotonic rise in optical-flow magnitude and spectral high-frequency roll-off. Evaluation of six deblurring architectures demonstrates a consistent 7-9 dB PSNR degradation from Easy to Hard subsets that is masked by aggregate metrics; the work also reports a domain gap between professional and consumer cameras that targeted fine-tuning substantially recovers, and supplies metadata to support ISP-aware and difficulty-adaptive restoration research.

Significance. If the stratification accurately captures increasing motion-blur severity representative of consumer-device deployment, the benchmark would be a useful contribution by exposing performance variation hidden in standard aggregate reporting and by providing deployment-relevant metadata. The synthesis from real high-framerate video and the cross-architecture consistency of the gap are strengths that could guide more robust model development for edge systems.

major comments (2)
  1. [§3] §3 (Benchmark Construction): The central claim of a meaningful 7-9 dB Easy-to-Hard performance gap rests on the assertion that PSNR-guided adaptive temporal windowing produces tiers reflecting real motion-blur difficulty. Validation is limited to a 2.2× rise in optical-flow magnitude plus spectral roll-off; optical-flow magnitude is only a proxy for motion extent and does not guarantee that the resulting kernels reproduce the non-linear trajectories, rolling-shutter skew, or ISP noise correlations typical of actual iPhone camera shake. Without additional validation (e.g., direct comparison against real captured blurred images or kernel statistics), the observed gap risks being an artifact of the synthesis procedure rather than evidence of hidden difficulty variation.
  2. [§4] §4 (Experiments and Results): The headline result states a 'consistent 7-9 dB performance degradation' across six architectures, yet the abstract and reported findings supply neither error bars, per-model PSNR tables with standard deviations, exact threshold values used for tier boundaries, nor full architectural and training details. This absence prevents assessment of statistical reliability and undermines the claim that the gap is 'entirely hidden by aggregate reporting.'
minor comments (2)
  1. The abstract refers to 'comprehensive metadata' enabling ISP-aware strategies, but the manuscript does not provide an explicit list or example of the metadata fields included with each sample.
  2. Figure captions and axis labels in the spectral-analysis and optical-flow validation plots should explicitly state the number of samples per tier and any confidence intervals.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed comments on our manuscript. We address each major comment below and indicate the revisions we will make to improve the work.

read point-by-point responses
  1. Referee: §3 (Benchmark Construction): The central claim of a meaningful 7-9 dB Easy-to-Hard performance gap rests on the assertion that PSNR-guided adaptive temporal windowing produces tiers reflecting real motion-blur difficulty. Validation is limited to a 2.2× rise in optical-flow magnitude plus spectral roll-off; optical-flow magnitude is only a proxy for motion extent and does not guarantee that the resulting kernels reproduce the non-linear trajectories, rolling-shutter skew, or ISP noise correlations typical of actual iPhone camera shake. Without additional validation (e.g., direct comparison against real captured blurred images or kernel statistics), the observed gap risks being an artifact of the synthesis procedure rather than evidence of hidden difficulty variation.

    Authors: We appreciate the referee's emphasis on rigorous validation of the stratification. The synthesis averages frames from high-frame-rate iPhone 17 Pro video, so the blur kernels derive directly from real device motion, inherently incorporating non-linear trajectories, rolling-shutter skew, and the sensor/ISP noise profile of the iPhone. PSNR-guided windowing selects temporal spans according to measured reconstruction degradation, while the reported 2.2× optical-flow increase and spectral high-frequency roll-off provide supporting evidence of increasing severity. We acknowledge that side-by-side comparison with long-exposure captures would be valuable but is practically difficult to obtain with accurate ground-truth alignment on consumer hardware. In revision we will add trajectory-length histograms, kernel-statistic summaries, and further discussion of how the real-video synthesis captures device-specific effects. revision: partial

  2. Referee: §4 (Experiments and Results): The headline result states a 'consistent 7-9 dB performance degradation' across six architectures, yet the abstract and reported findings supply neither error bars, per-model PSNR tables with standard deviations, exact threshold values used for tier boundaries, nor full architectural and training details. This absence prevents assessment of statistical reliability and undermines the claim that the gap is 'entirely hidden by aggregate reporting.'

    Authors: We agree that these elements are required for reproducibility and statistical assessment. The revised manuscript will include error bars (or standard deviations) on all PSNR figures, a full per-model table with values and deviations for Easy/Medium/Hard subsets, the exact PSNR thresholds used to define the three tiers, and expanded sections detailing the six architectures, training procedures, hyperparameters, and implementation choices. These additions will allow readers to verify the consistency of the 7-9 dB gap and its concealment under aggregate metrics. revision: yes

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on the realism of synthesized blur and the validity of the stratification procedure; these are domain assumptions rather than derived quantities.

axioms (1)
  • domain assumption Temporal averaging of high-framerate iPhone video produces blur whose frequency content and difficulty distribution match real consumer-device motion blur
    Invoked to justify the benchmark pairs and their use for evaluating restoration models.

pith-pipeline@v0.9.0 · 5490 in / 1345 out tokens · 72091 ms · 2026-05-08T14:18:57.436578+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.