pith. sign in

arxiv: 2605.29212 · v2 · pith:MFQOFASLnew · submitted 2026-05-28 · 💻 cs.CV · cs.HC

MetaRanker: Human-in-the-loop Active Ranking for Metalens Image Quality

Pith reviewed 2026-06-30 11:12 UTC · model grok-4.3

classification 💻 cs.CV cs.HC
keywords metalensimage quality assessmentactive rankinghuman-in-the-loopsemantic interpretabilitypreference modelingvision-language models
0
0 comments X

The pith

MetaRanker ranks metalens images by semantic interpretability using active human feedback, aligning with human judgments while cutting annotations by about 80%.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces MetaRanker to evaluate metalens image quality based on how interpretable the images are to humans rather than traditional distortion metrics. It combines probabilistic models and uncertainty sampling with vision-language model priors to select which image pairs humans should compare. This approach produces rankings that match human assessments more closely than standard metrics across different metalens datasets. A sympathetic reader would care because better evaluation can guide the design of practical ultra-thin lenses for miniaturized cameras. The method shows that perception-distortion trade-offs make current proxies inadequate for metalens pipelines.

Core claim

MetaRanker formalizes metalens image quality in terms of semantic interpretability and employs a human-in-the-loop active ranking framework that uses a probabilistic preference model with uncertainty-aware query selection, guided by vision-language model priors only for sampling, to achieve rankings closely aligned with human assessments while requiring approximately 80% fewer pairwise annotations than exhaustive evaluation.

What carries the argument

MetaRanker, a human-in-the-loop active ranking framework that combines a probabilistic preference model with uncertainty-aware query selection and leverages vision-language models to guide sampling of comparisons while keeping human judgments as the primary signal.

Load-bearing premise

Human judgments of semantic interpretability remain the primary supervision signal and are not materially influenced by the VLM priors that are used only to guide sampling of comparisons.

What would settle it

A controlled study on a new metalens dataset where exhaustive human pairwise comparisons produce rankings that diverge substantially from those generated by MetaRanker, while aligning better with PSNR or similar metrics.

Figures

Figures reproduced from arXiv: 2605.29212 by Haejun Chung, Ikbeom Jang, Yujin Park.

Figure 2
Figure 2. Figure 2: Metric–perception misalignment. Left images yield [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Overview of MetaRanker. (A) Input: Metalens outputs exhibit diverse, metric-misaligned artifacts (e.g., chromatic [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: The structured prompt used to extract semantic [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Dataset-level optical severity. CA Shift (px) mea [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Controlled simulation against exhaustive PSNR [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗
read the original abstract

Image quality in modern imaging systems emerges from the coupled effects of the sensor, optics, and computational reconstruction. Ultra-thin metalenses offer a path toward substantial miniaturization of optical modules, but practical designs often exhibit pronounced chromatic and field-dependent aberrations that necessitate computational reconstruction. In current metalens pipelines, reconstruction models are commonly trained and selected using distortion-based fidelity objectives, such as PSNR, yet these proxies can be weakly correlated with human preference and downstream utility, reflecting the well-known perception--distortion trade-off. We introduce MetaRanker, a human-in-the-loop active ranking framework that formalizes metalens image quality in terms of semantic interpretability, defined as the degree to which humans can reliably recognize objects and structures in the presence of optical artifacts. MetaRanker combines a probabilistic preference model with uncertainty-aware query selection, and leverages vision--language models to provide lightweight semantic priors. Importantly, these priors are used only to guide the sampling of informative comparisons; human judgments remain the primary supervision signal throughout. Across real-world and synthetic metalens datasets with distinct degradation profiles, MetaRanker produces rankings that align most closely with human assessments, while reducing the number of pairwise annotations required by approximately 80% relative to exhaustive pairwise evaluation. Finally, we show that standard image quality assessment metrics exhibit limited alignment with human interpretability in the metalens domain, positioning MetaRanker as a practical step toward perceptually grounded metalens evaluation and co-design.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper introduces MetaRanker, a human-in-the-loop active ranking framework for metalens image quality assessment. It formalizes quality via semantic interpretability (human ability to recognize objects despite optical artifacts) rather than distortion metrics like PSNR. The method uses a probabilistic preference model with uncertainty-aware query selection, employing VLM semantic priors exclusively to guide sampling of comparisons while keeping human judgments as the primary supervision. Evaluated on real-world and synthetic metalens datasets with distinct degradation profiles, it claims rankings that align most closely with human assessments and an ~80% reduction in pairwise annotations relative to exhaustive evaluation. It further shows that standard IQA metrics have limited alignment with human interpretability in this domain.

Significance. If the central claims hold, the work provides a concrete step toward perceptually grounded evaluation and co-design for metalens systems, directly addressing the perception-distortion tradeoff. The explicit design choice to restrict VLMs to query selection (with humans as sole primary signal) and the active-learning reduction in annotation effort are strengths that could generalize to other computational imaging pipelines where traditional metrics diverge from downstream utility.

major comments (1)
  1. [Abstract and Evaluation section] Abstract and § on Experiments/Evaluation: The strongest claim (superior human alignment + ~80% annotation reduction) is load-bearing on the assertion that 'human judgments remain the primary supervision signal throughout' and that VLM priors affect only sampling. The manuscript does not report (i) whether held-out human rankings used for alignment metrics were drawn from the VLM-guided pool or an independent exhaustive set, nor (ii) an ablation of VLM-guided selection versus random or uncertainty-only baselines. Without these, the reported alignment and efficiency figures are consistent with selection bias toward VLM-human concordant pairs, undermining the separation claim.
minor comments (1)
  1. [Abstract] The abstract states 'distinct degradation profiles' for the datasets but provides no quantitative characterization (e.g., aberration statistics or PSNR ranges) in the main text; adding a short table or paragraph would improve reproducibility.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. The concern regarding potential selection bias and the need for explicit validation of the VLM-prior separation is well-taken and directly addresses the load-bearing claims in the abstract and evaluation sections. We address the points below and will revise the manuscript accordingly.

read point-by-point responses
  1. Referee: [Abstract and Evaluation section] Abstract and § on Experiments/Evaluation: The strongest claim (superior human alignment + ~80% annotation reduction) is load-bearing on the assertion that 'human judgments remain the primary supervision signal throughout' and that VLM priors affect only sampling. The manuscript does not report (i) whether held-out human rankings used for alignment metrics were drawn from the VLM-guided pool or an independent exhaustive set, nor (ii) an ablation of VLM-guided selection versus random or uncertainty-only baselines. Without these, the reported alignment and efficiency figures are consistent with selection bias toward VLM-human concordant pairs, undermining the separation claim.

    Authors: We agree that the manuscript should explicitly document these experimental controls to substantiate the claim that VLM priors are restricted to query selection. (i) The held-out human rankings used for the alignment metrics (Kendall tau and top-k overlap) were collected via an independent exhaustive pairwise comparison protocol on a disjoint set of images, with no VLM involvement in either sampling or judgment collection. (ii) While the current experiments compare against exhaustive and random baselines, we did not include a dedicated ablation isolating VLM-guided selection from pure uncertainty sampling. In the revision we will add both the clarification on the held-out set and the requested ablation (VLM-guided vs. random vs. uncertainty-only) using the same human annotation budget, reporting the resulting alignment and annotation counts. These additions will be placed in the Experiments section with corresponding updates to the abstract if the numbers change. revision: yes

Circularity Check

0 steps flagged

No circularity in derivation chain

full rationale

The paper describes an active ranking framework where human judgments serve as the primary supervision signal and VLM outputs are used solely for query selection. No equations, fitted parameters, or self-citations appear in the abstract or described method that reduce the reported alignment with human assessments or the 80% annotation reduction to a self-referential definition or input. Performance claims rest on external human evaluations across datasets, rendering the approach self-contained against independent benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no information on free parameters, background axioms, or new entities; ledger entries are therefore empty.

pith-pipeline@v0.9.1-grok · 5796 in / 1132 out tokens · 36538 ms · 2026-06-30T11:12:44.844599+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Surprise-Guided MergeSort: Budget-Efficient Human-in-the-Loop Ranking via Adaptive Comparison Scheduling

    cs.LG 2026-06 unverdicted novelty 6.0

    Surprise-Guided MergeSort combines a MergeSort scheduler, a composite VLM-based surprise scorer, and adaptive budget allocation to reduce human comparisons while improving Kendall tau by 6-12 points over Active Elo on...

  2. Surprise-Guided MergeSort: Budget-Efficient Human-in-the-Loop Ranking via Adaptive Comparison Scheduling

    cs.LG 2026-06 unverdicted novelty 6.0

    Surprise-Guided MergeSort uses a VLM-based composite surprise scorer to prioritize human comparisons inside a MergeSort scheduler, skipping up to 535 pairs per session and raising Kendall's τ by 6-12 points over Activ...

Reference graph

Works this paper leans on

49 extracted references · 4 canonical work pages · cited by 1 Pith paper

  1. [1]

    Eirikur Agustsson and Radu Timofte. 2017. Ntire 2017 challenge on single image super-resolution: Dataset and study. InProceedings of the IEEE conference on computer vision and pattern recognition workshops. 126–135

  2. [2]

    Rodrigo Benenson and Vittorio Ferrari. 2022. From colouring-in to pointillism: revisiting semantic segmentation supervision.arXiv preprint arXiv:2210.14142 (2022)

  3. [3]

    Johansson

    Herman Bergström, Emil Carlsson, Devdatt Dubhashi, and Fredrik D. Johansson

  4. [4]

    In Advances in Neural Information Processing Systems (NeurIPS)

    Active Preference Learning for Ordering Items In- and Out-of-Sample. In Advances in Neural Information Processing Systems (NeurIPS)

  5. [5]

    Erdem Biyik, Nima Anari, and Dorsa Sadigh. 2024. Batch active learning of reward functions from human preferences.ACM Transactions on Human-Robot Interaction13, 2 (2024), 1–27

  6. [6]

    Yochai Blau and Tomer Michaeli. 2018. The perception-distortion tradeoff. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6228–6237

  7. [7]

    Ralph Allan Bradley and Milton E Terry. 1952. Rank analysis of incomplete block designs: I. The method of paired comparisons.Biometrika39, 3/4 (1952), 324–345

  8. [8]

    Paul F Christiano, Jan Leike, Tom Brown, Miljan Marber, Buck Shlegeris, and Dario Amodei. 2017. Deep reinforcement learning from human preferences. In Advances in Neural Information Processing Systems, Vol. 30

  9. [9]

    Mark E Glickman. 1995. The glicko system.Boston University16, 8 (1995), 9

  10. [10]

    Glickman

    Mark E. Glickman. 2012.Example of the Glicko-2 System. Technical Report. Boston University. 1–6 pages. https://www.glicko.net/glicko/glicko2.pdf Online technical report (commonly cited as 2012)

  11. [11]

    Andy Gray, Alma Rahat, Tom Crick, and Stephen Lindsay. 2024. A Bayesian active learning approach to comparative judgement within education assessment. Computers and Education: Artificial Intelligence6 (2024), 100245

  12. [12]

    Xuemei Hu, Weizhu Xu, Qingbin Fan, Tao Yue, Feng Yan, Yanqing Lu, and Ting Xu

  13. [13]

    Metasurface-based computational imaging: a review.Advanced Photonics 6, 1 (2024), 014002–014002

  14. [14]

    Quan Huynh-Thu and Mohammed Ghanbari. 2008. Scope of validity of PSNR in image/video quality assessment.Electronics Letters44, 13 (2008), 800–801

  15. [15]

    Andrey Ignatov, Nikolay Kobyshev, Radu Timofte, Kenneth Vanhoey, and Luc Van Gool. 2017. DSLR-quality photos on mobile devices with deep convolutional networks. InProceedings of the IEEE International Conference on Computer Vision. 3277–3285

  16. [16]

    Ikbeom Jang, Garrison Danley, Ken Chang, and Jayashree Kalpathy-Cramer

  17. [17]

    Decreasing annotation burden of pairwise comparisons with human-in- the-loop sorting: Application in medical image artifact rating.arXiv preprint arXiv:2202.04823(2022)

  18. [18]

    Junjie Ke, Qifei Wang, Yilin Wang, Peyman Milanfar, and Feng Yang. 2021. MUSIQ: Multi-scale image quality transformer. InProceedings of the IEEE/CVF International Conference on Computer Vision. 5148–5157

  19. [19]

    Mohammadreza Khorasaninejad and Federico Capasso. 2017. Metalenses: Versa- tile multifunctional photonic components.Science358, 6367 (2017), eaam8100

  20. [20]

    Joohoon Kim, Junhwa Seong, Wonjoong Kim, Gun-Yeal Lee, Seokwoo Kim, Hongyoon Kim, Seong-Won Moon, Dong Kyo Oh, Younghwan Yang, Jeonghoon Park, et al. 2023. Scalable manufacturing of high-index atomic layer–polymer hybrid metasurfaces for metaphotonics in the visible.Nature Materials22, 4 (2023), 474–481

  21. [21]

    Byeonghyeon Lee, Youbin Kim, Yongjae Jo, Hyunsu Kim, Hyemi Park, Yangkyu Kim, Debabrata Mandal, Praneeth Chakravarthula, Inki Kim, and Eunbyung Park

  22. [22]

    Aberration Correcting Vision Transformers for High-Fidelity Metalens Imaging.arXiv preprint arXiv:2412.04591(2024)

  23. [23]

    Chunyi Li, Yuan Tian, Xiaoyue Ling, Zicheng Zhang, Haodong Duan, Haoning Wu, Ziheng Jia, Xiaohong Liu, Xiongkuo Min, Guo Lu, et al. 2025. Image Quality Assessment: From Human to Machine Preference. InProceedings of the Computer Vision and Pattern Recognition Conference. 7570–7581

  24. [24]

    Suiyi Ling, Jing Li, Anne Flore Perrin, Zhi Li, Lukáš Krasula, and Patrick Le Callet

  25. [25]

    Strategy for boosting pair comparison and improving quality assessment accuracy.arXiv preprint arXiv:2010.00370(2020)

  26. [26]

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. 2023. Visual in- struction tuning.Advances in Neural Information Processing Systems36 (2023), 34892–34916

  27. [27]

    Lucas Maystre and Matthias Grossglauser. 2017. Just sort it! A simple and effective approach to active preference learning. InInternational Conference on Machine Learning (ICML). PMLR, 2344–2353

  28. [28]

    Aliaksei Mikhailiuk, María Pérez-Ortiz, and Rafal Mantiuk. 2018. Psychometric scaling of TID2013 dataset. In2018 Tenth International Conference on Quality of Multimedia Experience (QoMEX). IEEE, 1–6

  29. [29]

    completely blind

    Anish Mittal, Rajiv Soundararajan, and Alan C Bovik. 2012. Making a “completely blind” image quality analyzer.IEEE Signal Processing Letters20, 3 (2012), 209–212

  30. [30]

    Shima Mohammadi and João Ascenso. 2025. Uncertainty-driven Sampling for Efficient Pairwise Comparison Subjective Assessment.IEEE Transactions on Multimedia(2025)

  31. [31]

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, et al. 2022. Training lan- guage models to follow instructions with human feedback. InAdvances in Neural Information Processing Systems, Vol. 35

  32. [32]

    Yujin Park, Haejun Chung, and Ikbeom Jang. 2025. EZ-Sort: Efficient Pairwise Comparison via Zero-Shot CLIP-Based Pre-Ordering and Human-in-the-Loop Sorting. InProceedings of the 34th ACM International Conference on Information and Knowledge Management. 5120–5124

  33. [33]

    Judith Alice Redi, Tobias Hoßfeld, Pavel Korshunov, Filippo Mazza, Isabel Povoa, and Christian Keimel. 2013. Crowdsourcing-based multimedia subjective evalua- tions: a case study on image recognizability and aesthetic appeal. InProceedings of the 2nd ACM International Workshop on Crowdsourcing for Multimedia. 29–34

  34. [34]

    Eli Schwartz, Raja Giryes, and Alex M Bronstein. 2018. DeepISP: Toward learning an end-to-end image processing pipeline.IEEE Transactions on Image Processing 28, 2 (2018), 912–923

  35. [35]

    Joonhyuk Seo, Jaegang Jo, Joohoon Kim, Joonho Kang, Chanik Kang, Seong-Won Moon, Eunji Lee, Jehyeong Hong, Junsuk Rho, and Haejun Chung. 2024. Deep- learning-driven end-to-end metalens imaging.Advanced Photonics6, 6 (2024), 066002–066002

  36. [36]

    Xiangfei Sheng, Xiaofeng Pan, Zhichao Yang, Pengfei Chen, and Leida Li. 2026. Fine-grained image quality assessment for perceptual image restoration. InPro- ceedings of the AAAI Conference on Artificial Intelligence, Vol. 40. 8914–8922

  37. [37]

    Vincent Sitzmann, Steven Diamond, Yifan Peng, Xiong Dun, Stephen Boyd, Wolf- gang Heidrich, Felix Heide, and Gordon Wetzstein. 2018. End-to-end optimization KDD ’26, August 9–13, 2026, Jeju, Republic of Korea Yujin Park, Haejun Chung, and Ikbeom Jang of optics and image processing for achromatic extended depth of field and super- resolution imaging.ACM Tr...

  38. [38]

    Tianshu Song, Leida Li, Hancheng Zhu, and Jiansheng Qian. 2021. IE-IQA: Intel- ligibility enriched generalizable no-reference image quality assessment.Frontiers in Neuroscience15 (2021), 739138

  39. [39]

    Ethan Tseng, Shane Colburn, James Whitehead, Luocheng Huang, Seung-Hwan Baek, Arka Majumdar, and Felix Heide. 2021. Neural nano-optics for high-quality thin lens imaging.Nature Communications12, 1 (2021), 6493

  40. [40]

    Chan, and Chen Change Loy

    Jianyi Wang, Kelvin C.K. Chan, and Chen Change Loy. 2023. Exploring CLIP for Assessing the Look and Feel of Images. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 37. 2555–2563

  41. [41]

    Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Simoncelli. 2004. Image quality assessment: from error visibility to structural similarity.IEEE Transactions on Image Processing13, 4 (2004), 600–612

  42. [42]

    Long Xu, Jia Li, Weisi Lin, Yun Zhang, Yongbing Zhang, and Yihua Yan. 2016. Pairwise comparison and rank learning for image quality assessment.Displays 44 (2016), 21–26

  43. [43]

    Miao Yang, Ge Yin, Yixiang Du, and Zhiqiang Wei. 2021. Pair comparison based progressive subjective quality ranking for underwater images.Signal Processing: Image Communication99 (2021), 116444

  44. [44]

    Sidi Yang, Tianhe Wu, Shuwei Shi, Shanshan Lao, Yuan Gong, Mingdeng Cao, Jia- hao Wang, and Yujiu Yang. 2022. MANIQA: Multi-Dimension Attention Network for No-Reference Image Quality Assessment. InCVPR Workshops (NTIRE)

  45. [45]

    Joel Yeo, N Duane Loh, Ramon Paniagua-Dominguez, and Arseniy I Kuznetsov

  46. [46]

    EigenCWD: a spatially varying deconvolution algorithm for single metalens imaging.Optics Express33, 13 (2025), 28481–28492

  47. [47]

    Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang

  48. [48]

    InProceedings of the IEEE conference on computer vision and pattern recognition

    The unreasonable effectiveness of deep features as a perceptual metric. InProceedings of the IEEE conference on computer vision and pattern recognition. 586–595

  49. [49]

    bad prior

    Simone Zini and Marco Buzzelli. 2025. Bayesian nights: Optimizing night pho- tography rendering with Bayesian derivative-free methods.Pattern Recognition 161 (2025), 111314. A Ablation Study Details A.1 Hyperparameter Configuration Table 5 lists the exact values used in all main experiments. Sensi- tivity to 𝛼 and RD0 is higher than for other parameters b...