iPhoneBlur: A Difficulty-Stratified Benchmark for Consumer Device Motion Deblurring

Abdullah Al Shafi; Kazi Saeed Alam

arxiv: 2605.05990 · v1 · submitted 2026-05-07 · 💻 cs.CV · cs.AI

iPhoneBlur: A Difficulty-Stratified Benchmark for Consumer Device Motion Deblurring

Abdullah Al Shafi , Kazi Saeed Alam This is my paper

Pith reviewed 2026-05-08 14:18 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords benchmarkblurconsumeriphoneblurmotionacrossaggregateconsistent

0 comments

The pith

iPhoneBlur is a difficulty-stratified benchmark showing consistent 7-9 dB worse deblurring performance on hard motion blur from consumer iPhones, a gap hidden by aggregate metrics.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper presents iPhoneBlur, a benchmark designed to test motion deblurring algorithms specifically for images captured on consumer smartphones like the iPhone. Instead of using one overall score, the authors created 7400 pairs of sharp and blurred images from real high-speed videos taken on an iPhone 17 Pro in various everyday settings. They used a method called PSNR-guided adaptive temporal windowing to decide how many frames to average for each blurred image, sorting them into easy, medium, and hard difficulty levels. This sorting was confirmed by checking that harder levels had about 2.2 times more movement as measured by optical flow. When they tested six different deblurring models on these levels, they found that performance dropped by 7 to 9 decibels from easy to hard cases. This drop is hidden when people just report average scores across all images. The work also points out that models trained on professional camera images perform worse on iPhone data due to differences in how the cameras process images, but adjusting the models with some iPhone-specific training helps close this gap. The benchmark includes extra information with each image pair, such as the difficulty level and other details, to help study how to make restoration methods that work well on phones with limited computing power. Analysis of the blur showed it has similar frequency patterns to real motion blur.

Core claim

Evaluation of six architectures reveals consistent 7-9 dB performance degradation from Easy to Hard subsets, a substantial gap entirely hidden by aggregate reporting.

Load-bearing premise

The PSNR-guided adaptive temporal windowing on high-framerate iPhone videos produces a meaningful and realistic stratification of motion blur difficulty that generalizes to actual consumer device deployment.

read the original abstract

Motion blur restoration on consumer mobile devices is typically evaluated using aggregate metrics that obscure performance variation across blur difficulty, masking model behavior under real deployment conditions. This work introduces iPhoneBlur, a difficulty-stratified benchmark of 7,400 image pairs synthesized from high-framerate iPhone 17 Pro videos captured in diverse real-world scenarios. Samples are partitioned into Easy, Medium, and Hard categories through PSNR-guided adaptive temporal windowing, with stratification validated by monotonic 2.2x increase in optical flow magnitude across tiers. Each sample includes comprehensive metadata enabling investigation of ISP-aware and difficulty-adaptive restoration strategies. Spectral analysis confirms synthesized blur exhibits high-frequency suppression patterns consistent with authentic motion degradation. Evaluation of six architectures reveals consistent 7-9 dB performance degradation from Easy to Hard subsets, a substantial gap entirely hidden by aggregate reporting. The benchmark further exposes a domain gap between professional and consumer cameras which targeted fine-tuning substantially recovers. By coupling difficulty stratification with deployment-critical metadata, iPhoneBlur enables systematic assessment of model reliability and failure modes for resource-constrained edge systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper delivers a consumer-specific deblurring benchmark that exposes a 7-9 dB performance drop across difficulty tiers hidden by averages, but the PSNR-guided stratification rests on proxies whose match to real camera blur is not fully demonstrated.

read the letter

The main takeaway is that iPhoneBlur splits 7400 synthesized pairs from iPhone 17 Pro high-framerate video into easy, medium, and hard sets using PSNR to pick temporal averaging windows. Six models show a steady 7-9 dB drop from easy to hard, and the work also flags a domain gap versus professional cameras that fine-tuning reduces. Metadata on each sample is included for ISP-related follow-ups. Spectral checks and a 2.2x optical-flow rise are offered as confirmation that the tiers track increasing blur severity. This setup is new because earlier deblurring benchmarks stayed with aggregate scores or non-consumer sources. The stratification idea plus the device-specific data gives a clearer picture of where models fail under realistic conditions. The metadata angle is practical for edge-system work. The soft spots sit in the tier construction and reporting. Optical-flow magnitude and high-frequency roll-off are reasonable proxies, yet they do not directly verify that the synthesized kernels reproduce the non-linear trajectories, rolling-shutter effects, or noise correlations of actual handheld shake. If the PSNR windows also select for scene texture or lighting differences, the observed gap could partly reflect those factors rather than blur difficulty alone. The abstract omits error bars, exact PSNR cutoffs, and full model specifications, which makes it harder to judge how stable the 7-9 dB figure is. The domain-gap recovery is noted but lacks quantitative detail on the improvement size. This paper is for researchers building or testing restoration models for phones and other resource-limited devices. Anyone who needs to measure reliability beyond single-number scores will get concrete value from the stratified results and the released metadata. It is not a theoretical paper, but the empirical contribution is clear enough to warrant referee attention. The core observation about hidden variation is supported by the experiments shown, even though the validation of the difficulty ordering could be tightened. I would send it for peer review.

Referee Report

2 major / 2 minor

Summary. The paper introduces iPhoneBlur, a benchmark of 7,400 image pairs synthesized from high-framerate iPhone 17 Pro videos captured in diverse real-world scenarios. Samples are partitioned into Easy, Medium, and Hard difficulty categories via PSNR-guided adaptive temporal windowing, with the ordering validated by a 2.2× monotonic rise in optical-flow magnitude and spectral high-frequency roll-off. Evaluation of six deblurring architectures demonstrates a consistent 7-9 dB PSNR degradation from Easy to Hard subsets that is masked by aggregate metrics; the work also reports a domain gap between professional and consumer cameras that targeted fine-tuning substantially recovers, and supplies metadata to support ISP-aware and difficulty-adaptive restoration research.

Significance. If the stratification accurately captures increasing motion-blur severity representative of consumer-device deployment, the benchmark would be a useful contribution by exposing performance variation hidden in standard aggregate reporting and by providing deployment-relevant metadata. The synthesis from real high-framerate video and the cross-architecture consistency of the gap are strengths that could guide more robust model development for edge systems.

major comments (2)

[§3] §3 (Benchmark Construction): The central claim of a meaningful 7-9 dB Easy-to-Hard performance gap rests on the assertion that PSNR-guided adaptive temporal windowing produces tiers reflecting real motion-blur difficulty. Validation is limited to a 2.2× rise in optical-flow magnitude plus spectral roll-off; optical-flow magnitude is only a proxy for motion extent and does not guarantee that the resulting kernels reproduce the non-linear trajectories, rolling-shutter skew, or ISP noise correlations typical of actual iPhone camera shake. Without additional validation (e.g., direct comparison against real captured blurred images or kernel statistics), the observed gap risks being an artifact of the synthesis procedure rather than evidence of hidden difficulty variation.
[§4] §4 (Experiments and Results): The headline result states a 'consistent 7-9 dB performance degradation' across six architectures, yet the abstract and reported findings supply neither error bars, per-model PSNR tables with standard deviations, exact threshold values used for tier boundaries, nor full architectural and training details. This absence prevents assessment of statistical reliability and undermines the claim that the gap is 'entirely hidden by aggregate reporting.'

minor comments (2)

The abstract refers to 'comprehensive metadata' enabling ISP-aware strategies, but the manuscript does not provide an explicit list or example of the metadata fields included with each sample.
Figure captions and axis labels in the spectral-analysis and optical-flow validation plots should explicitly state the number of samples per tier and any confidence intervals.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed comments on our manuscript. We address each major comment below and indicate the revisions we will make to improve the work.

read point-by-point responses

Referee: §3 (Benchmark Construction): The central claim of a meaningful 7-9 dB Easy-to-Hard performance gap rests on the assertion that PSNR-guided adaptive temporal windowing produces tiers reflecting real motion-blur difficulty. Validation is limited to a 2.2× rise in optical-flow magnitude plus spectral roll-off; optical-flow magnitude is only a proxy for motion extent and does not guarantee that the resulting kernels reproduce the non-linear trajectories, rolling-shutter skew, or ISP noise correlations typical of actual iPhone camera shake. Without additional validation (e.g., direct comparison against real captured blurred images or kernel statistics), the observed gap risks being an artifact of the synthesis procedure rather than evidence of hidden difficulty variation.

Authors: We appreciate the referee's emphasis on rigorous validation of the stratification. The synthesis averages frames from high-frame-rate iPhone 17 Pro video, so the blur kernels derive directly from real device motion, inherently incorporating non-linear trajectories, rolling-shutter skew, and the sensor/ISP noise profile of the iPhone. PSNR-guided windowing selects temporal spans according to measured reconstruction degradation, while the reported 2.2× optical-flow increase and spectral high-frequency roll-off provide supporting evidence of increasing severity. We acknowledge that side-by-side comparison with long-exposure captures would be valuable but is practically difficult to obtain with accurate ground-truth alignment on consumer hardware. In revision we will add trajectory-length histograms, kernel-statistic summaries, and further discussion of how the real-video synthesis captures device-specific effects. revision: partial
Referee: §4 (Experiments and Results): The headline result states a 'consistent 7-9 dB performance degradation' across six architectures, yet the abstract and reported findings supply neither error bars, per-model PSNR tables with standard deviations, exact threshold values used for tier boundaries, nor full architectural and training details. This absence prevents assessment of statistical reliability and undermines the claim that the gap is 'entirely hidden by aggregate reporting.'

Authors: We agree that these elements are required for reproducibility and statistical assessment. The revised manuscript will include error bars (or standard deviations) on all PSNR figures, a full per-model table with values and deviations for Easy/Medium/Hard subsets, the exact PSNR thresholds used to define the three tiers, and expanded sections detailing the six architectures, training procedures, hyperparameters, and implementation choices. These additions will allow readers to verify the consistency of the 7-9 dB gap and its concealment under aggregate metrics. revision: yes

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on the realism of synthesized blur and the validity of the stratification procedure; these are domain assumptions rather than derived quantities.

axioms (1)

domain assumption Temporal averaging of high-framerate iPhone video produces blur whose frequency content and difficulty distribution match real consumer-device motion blur
Invoked to justify the benchmark pairs and their use for evaluating restoration models.

pith-pipeline@v0.9.0 · 5490 in / 1345 out tokens · 72091 ms · 2026-05-08T14:18:57.436578+00:00 · methodology

iPhoneBlur: A Difficulty-Stratified Benchmark for Consumer Device Motion Deblurring

Core claim

Load-bearing premise

discussion (0)