pith. machine review for the scientific record. sign in

arxiv: 2604.12622 · v1 · submitted 2026-04-14 · 💻 cs.CV · cs.AI· cs.NI

Recognition: unknown

Efficient Semantic Image Communication for Traffic Monitoring at the Edge

Authors on Pith no claims yet

Pith reviewed 2026-05-10 15:58 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.NI
keywords semantic image communicationtraffic monitoringedge computingdata compressiongenerative reconstructiondiffusion modelssemantic segmentationinpainting
0
0 comments X

The pith

Two semantic pipelines cut transmitted traffic image data by 99 percent while preserving scene details for monitoring.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes MMSD and SAMR as ways to send compact semantic representations of traffic scenes rather than full-resolution images from edge devices. MMSD decomposes the input into segmentation maps, edge maps, and text descriptions before a diffusion model rebuilds the image at the receiver. SAMR applies semantic masks to suppress non-critical regions, encodes the rest with JPEG, and uses inpainting for restoration. Both run lightweight operations on the edge and offload reconstruction, achieving the reported data reductions and competitive quality against baselines like SPIC, JPEG, and SQ-GAN. This matters for bandwidth-limited monitoring networks where only object presence, positions, and context need to be preserved.

Core claim

The authors show that MMSD replaces raw pixels with multi-modal semantic maps and text for transmission and uses diffusion-based generation at the server to reconstruct scenes, while SAMR selectively masks and JPEG-encodes semantically important regions before inpainting the rest. On traffic monitoring data, these yield 99 percent and 99.1 percent average reductions in transmitted payload size respectively, with MMSD producing smaller payloads than SPIC at comparable semantic fidelity and SAMR delivering superior quality-compression curves than JPEG or SQ-GAN under matched conditions. Edge processing on a Raspberry Pi 5 takes roughly 15 seconds for MMSD and 9 seconds for SAMR.

What carries the argument

Asymmetric edge-to-server pipeline that extracts compact semantic representations (segmentation maps, edges, text for MMSD; importance masks for SAMR) at the sender and performs generative reconstruction (diffusion for MMSD, inpainting for SAMR) at the receiver.

If this is right

  • Traffic monitoring systems can run on edge hardware with 99 percent lower bandwidth use while retaining utility for presence and spatial analysis.
  • MMSD keeps original pixel data private by never transmitting it, only semantic abstractions.
  • SAMR improves upon plain JPEG compression for the same operating bit rates in visual quality for traffic scenes.
  • Both methods fit within the compute budget of current single-board computers for per-frame processing.
  • The designs separate lightweight sender tasks from heavy receiver tasks, suiting asymmetric network topologies.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same decomposition approach could apply to other bandwidth-constrained visual tasks such as environmental or security monitoring where exact pixel values are unnecessary.
  • Stronger generative models would directly raise reconstruction fidelity without any increase in transmitted data volume.
  • The confidentiality property of MMSD may reduce regulatory barriers for deploying cameras in public spaces.
  • Integration with existing object detectors on the reconstructed outputs would provide an end-to-end metric for whether semantic fidelity is sufficient.

Load-bearing premise

The generative diffusion and inpainting models can rebuild traffic scenes from the transmitted semantic maps or masked images without omitting or fabricating details that affect monitoring accuracy such as vehicle or pedestrian locations.

What would settle it

A controlled test set of traffic images in which reconstructed outputs produce different vehicle counts, pedestrian detections, or spatial relations than the original images when fed to standard monitoring detectors.

Figures

Figures reproduced from arXiv: 2604.12622 by Damir Assylbek, Dimitrios Zorbas, Marko Ristin, Nurmukhammed Aitymbetov.

Figure 1
Figure 1. Figure 1: Pipeline with semantic decomposition and generative reconstruction. The payload consists [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Original image (top); Segmentation and edge maps (bottom): The segmentation map [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Example of the semantic-aware masking and reconstruction pipeline (the class-of-interest [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Example of Semantic-Aware Masking in SAMR: RoI (i.e., vehicles) are not masked at all [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Samples of images/frames of the two employed datasets. [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Qualitative comparison between the original image and reconstructed ones (MMSD vs [PITH_FULL_IMAGE:figures/full_fig_p015_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Qualitative comparison between the original image and MMSD’s reconstructions guided [PITH_FULL_IMAGE:figures/full_fig_p016_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Qualitative comparison of MMSD reconstruction with (a) the full pipeline (BLIP) and [PITH_FULL_IMAGE:figures/full_fig_p016_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Rate-distortion curves comparing SAMR against standard JPEG and SQ-GAN across [PITH_FULL_IMAGE:figures/full_fig_p018_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Visual comparison at matched bitrates (∼0.08 BPP). From left to right: original image, JPEG at Q=5, SQ-GAN at full masking (mx = ms = 1.0), and SAMR (Config 0, Q=5). At comparable bitrates, JPEG exhibits severe blocking artifacts, SQ-GAN produces plausible but low-resolution output, while SAMR preserves structural detail and semantic content at the original resolution. • Learned Perceptual Image Patch Sim… view at source ↗
read the original abstract

Many visual monitoring systems operate under strict communication constraints, where transmitting full-resolution images is impractical and often unnecessary. In such settings, visual data is often used for object presence, spatial relationships, and scene context rather than exact pixel fidelity. This paper presents two semantic image communication pipelines for traffic monitoring, MMSD and SAMR, that reduce transmission cost while preserving meaningful visual information. MMSD (Multi-Modal Semantic Decomposition) targets very high compression together with data confidentiality, since sensitive pixel content is not transmitted. It replaces the original image with compact semantic representations, namely segmentation maps, edge maps, and textual descriptions, and reconstructs the scene at the receiver using a diffusion-based generative model. SAMR (Semantic-Aware Masking Reconstruction) targets higher visual quality while maintaining strong compression. It selectively suppresses non-critical image regions according to semantic importance before standard JPEG encoding and restores the missing content at the receiver through generative inpainting. Both designs follow an asymmetric sender-receiver architecture, where lightweight processing is performed at the edge and computationally intensive reconstruction is offloaded to the server. On a Raspberry Pi~5, the edge-side processing time is about 15s for MMSD and 9s for SAMR. Experimental results show average transmitted-data reductions of 99% for MMSD and 99.1% for SAMR. In addition, MMSD achieves lower payload size than the recent SPIC baseline while preserving strong semantic consistency, whereas SAMR provides a better quality-compression trade-off than standard JPEG and SQ-GAN under comparable operating conditions.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 3 minor

Summary. The paper proposes two asymmetric semantic image communication pipelines for bandwidth-constrained traffic monitoring at the edge. MMSD transmits compact multi-modal semantic representations (segmentation maps, edge maps, textual descriptions) instead of pixels and reconstructs scenes at the receiver via a diffusion model, targeting 99% data reduction and privacy. SAMR applies semantic-aware masking to suppress non-critical regions before JPEG encoding and uses generative inpainting for restoration, targeting 99.1% reduction with higher visual quality. Both run lightweight processing on a Raspberry Pi 5 (15 s and 9 s respectively) and are compared experimentally to baselines including SPIC, JPEG, and SQ-GAN on payload size and quality-compression trade-offs, with claims of strong semantic consistency for the traffic use case.

Significance. If the reconstructions reliably preserve task-critical details, the work could enable practical high-compression semantic transmission for real-time edge monitoring systems, offering substantial bandwidth savings and privacy benefits over pixel-level methods. The asymmetric design aligns well with edge-server architectures. However, the significance is limited by the absence of downstream task validation, which is required to substantiate the application claims.

major comments (1)
  1. [Abstract] Abstract: The central claims of 99% (MMSD) and 99.1% (SAMR) transmitted-data reductions 'while preserving strong semantic consistency' and providing a 'better quality-compression trade-off' for traffic monitoring rest on an untested assumption that the diffusion and inpainting models retain details needed for monitoring tasks (vehicle positions, lane markings, sign readability). No quantitative results are reported for downstream performance (detection, tracking, or scene understanding accuracy) on reconstructed versus original images, making the application-specific assertions unsupported by the presented evidence.
minor comments (3)
  1. The abstract provides no dataset details (name, size, resolution, traffic scenarios), exact metrics beyond averages, error bars, or statistical tests for the reported reductions and baseline comparisons.
  2. Edge processing times (15 s MMSD, 9 s SAMR on Raspberry Pi 5) are stated without reference to input image resolution, comparison to full-image transmission latency, or breakdown of semantic extraction costs.
  3. The manuscript would benefit from explicit definitions or examples of the 'semantic importance' criterion used for masking in SAMR and the textual description format in MMSD.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the detailed review and constructive feedback on our manuscript. We address the major comment regarding the need for downstream task validation below. We agree that this is a valid point and will strengthen the application-specific claims accordingly.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central claims of 99% (MMSD) and 99.1% (SAMR) transmitted-data reductions 'while preserving strong semantic consistency' and providing a 'better quality-compression trade-off' for traffic monitoring rest on an untested assumption that the diffusion and inpainting models retain details needed for monitoring tasks (vehicle positions, lane markings, sign readability). No quantitative results are reported for downstream performance (detection, tracking, or scene understanding accuracy) on reconstructed versus original images, making the application-specific assertions unsupported by the presented evidence.

    Authors: We acknowledge that the manuscript's claims about semantic consistency for traffic monitoring would be more robust with explicit downstream task metrics. The current evaluation focuses on payload size reductions, visual quality metrics (PSNR, SSIM, LPIPS), and comparisons to baselines like SPIC, JPEG, and SQ-GAN, along with qualitative examples demonstrating preservation of scene structure. However, no quantitative results on object detection, tracking, or scene understanding accuracy (e.g., mAP for vehicles or readability of signs) are included. In the revised manuscript, we will add a new section with experiments applying standard detectors (such as YOLOv8) to both original and reconstructed images, reporting accuracy on critical elements like vehicle positions, lane markings, and traffic signs. This will directly test whether the semantic representations and reconstructions retain task-critical information. revision: yes

Circularity Check

0 steps flagged

No circularity; claims are purely experimental with no derivations or self-referential reductions

full rationale

The paper describes two pipelines (MMSD and SAMR) for semantic image communication and reports empirical performance metrics such as 99% and 99.1% transmitted-data reductions along with comparisons to external baselines (SPIC, JPEG, SQ-GAN). No equations, mathematical derivations, parameter-fitting steps, or self-citations appear in the abstract or described content. Performance assertions rest on direct measurements against independent reference methods rather than any internal construction that would reduce a 'prediction' to its own inputs. The architecture is asymmetric and offloads reconstruction, but this is presented as a design choice without tautological justification. The result is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach rests on the domain assumption that generative models can faithfully reconstruct semantically useful images from partial representations. No free parameters or new entities are introduced in the abstract.

axioms (1)
  • domain assumption Generative models (diffusion and inpainting) can reconstruct scenes from semantic maps, edge maps, text, or masked images while preserving traffic-monitoring utility
    This assumption underpins the reconstruction step in both MMSD and SAMR and is required for the claimed data reductions to remain useful.

pith-pipeline@v0.9.0 · 5597 in / 1160 out tokens · 58253 ms · 2026-05-10T15:58:23.984629+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

44 extracted references · 4 canonical work pages

  1. [1]

    Artificial intelligence-powered visual internet of things in smart cities: A comprehensive review

    Omar El Ghati, Othmane Alaoui-Fdili, Othman Chahbouni, Nawal Alioua, and Walid Bouarifi. Artificial intelligence-powered visual internet of things in smart cities: A comprehensive review. Sustainable computing: informatics and systems, 43:101004, 2024

  2. [2]

    Iot video analytics for surveillance-based systems in smart cities.Computer Communications, 224:95–105, 2024

    Kasra Aminiyeganeh, Rodolfo WL Coutinho, and Azzedine Boukerche. Iot video analytics for surveillance-based systems in smart cities.Computer Communications, 224:95–105, 2024

  3. [3]

    Internet of things-enabled unmanned aerial vehicles for real-time traffic mobility analysis in smart cities.Computers and Electrical Engineering, 123:110313, 2025

    Murat Bakirci. Internet of things-enabled unmanned aerial vehicles for real-time traffic mobility analysis in smart cities.Computers and Electrical Engineering, 123:110313, 2025

  4. [4]

    Tiny deep learning model for insect segmentation and counting on resource-constrained devices

    Amin Kargar, Dimitrios Zorbas, Michael Gaffney, Brendan O’Flynn, and Salvatore Tedesco. Tiny deep learning model for insect segmentation and counting on resource-constrained devices. Computers and Electronics in Agriculture, 236:110378, 2025

  5. [5]

    A survey on video compression optimization techniques for accuracy enhancement in video analytics applications (vaps).IEEE Access, 13:75822–75846, 2025

    Kholidiyah Masykuroh, Hendrawan, Eueung Mulyana, and Farhan Krishna. A survey on video compression optimization techniques for accuracy enhancement in video analytics applications (vaps).IEEE Access, 13:75822–75846, 2025

  6. [7]

    Convolutional variational autoencoders for secure lossy image compression in remote sensing

    Alessandro Giuliano, S Andrew Gadsden, Waleed Hilal, and John Yawney. Convolutional variational autoencoders for secure lossy image compression in remote sensing. InSensors and Systems for Space Applications XVII, volume 13062, pages 124–134. SPIE, 2024

  7. [8]

    Hybridflow: Infusing continuity into masked codebook for extreme low-bitrate image compression

    Lei Lu, Yanyue Xie, Wei Jiang, Wei Wang, Xue Lin, and Yanzhi Wang. Hybridflow: Infusing continuity into masked codebook for extreme low-bitrate image compression. InProceedings of the 32nd ACM international conference on multimedia, pages 3010–3018, 2024

  8. [9]

    Semantic communications: Overview, open issues, and future research directions.IEEE Wireless communications, 29(1):210–219, 2022

    Xuewen Luo, Hsiao-Hwa Chen, and Qing Guo. Semantic communications: Overview, open issues, and future research directions.IEEE Wireless communications, 29(1):210–219, 2022

  9. [10]

    Jiang, Y

    W. Jiang, Y. Zhai, H. Li, and R. Wang. Tlic: Learned image compression with roi-weighted distortion and bit allocation.arXiv preprint arXiv:2401.08154, 2024

  10. [11]

    Semantic-Aware Image Compression Architecture for Semantic Communication

    Haotian Wang, Zijian Cao, and Hua Zhang. Semantic-Aware Image Compression Architecture for Semantic Communication. In2024 IEEE 34th International Workshop on Machine Learning for Signal Processing (MLSP), pages 1–6, London, United Kingdom, September 2024. IEEE

  11. [12]

    Shanmugam and B

    V. Shanmugam and B. U. Maheswari. A semantic-aware compression strategy for intelligent vehicles.Procedia Computer Science, 258:2544–2553, 2025

  12. [13]

    Semantic-Aware Visual Decomposition for Image Coding.International Journal of Computer Vision, 131(9):2333–2355, September 2023

    Jianhui Chang, Jian Zhang, Jiguo Li, Shiqi Wang, Qi Mao, Chuanmin Jia, Siwei Ma, and Wen Gao. Semantic-Aware Visual Decomposition for Image Coding.International Journal of Computer Vision, 131(9):2333–2355, September 2023. 22

  13. [14]

    X. Gu, Y. Xu, and K. Zhu. Semantic importance-based deep image compression using a generative approach. InProc. International Conference on Multimedia Modeling, pages 70–81. Springer, 2024

  14. [15]

    Real-time monitoring method for traffic surveillance scenarios based on enhanced yolov7.Applied Sciences, 14(16):7383, 2024

    Dexin Yu, Zimin Yuan, Xincheng Wu, Yubo Wang, and Xian Liu. Real-time monitoring method for traffic surveillance scenarios based on enhanced yolov7.Applied Sciences, 14(16):7383, 2024

  15. [16]

    Selective scale-aware network for traffic density estimation and congestion detection in its.Sensors, 25(3):766, 2025

    Cheng Jian, Chenxi Lin, Xiaojian Hu, and Jian Lu. Selective scale-aware network for traffic density estimation and congestion detection in its.Sensors, 25(3):766, 2025

  16. [17]

    Smart crosswalks for advancing road safety in urban roads: Conceptualization and evidence-based insights from greek incident records.Future Transportation, 5(4):180, 2025

    Maria Pomoni. Smart crosswalks for advancing road safety in urban roads: Conceptualization and evidence-based insights from greek incident records.Future Transportation, 5(4):180, 2025

  17. [18]

    Privacy-preserving data fusion for traffic state estimation: A vertical federated learning approach.Transportation Research Part C: Emerging Technologies, 168:104743, 2024

    Qiqing Wang and Kaidi Yang. Privacy-preserving data fusion for traffic state estimation: A vertical federated learning approach.Transportation Research Part C: Emerging Technologies, 168:104743, 2024

  18. [19]

    Peñaloza, Y

    Nicolás Hernández-Díaz, Yersica C. Peñaloza, Y. Yuliana Rios, Juan Carlos Martinez-Santos, and Edwin Puertas. A computer vision system for detecting motorcycle violations in pedestrian zones.Multimedia Tools and Applications, 84:12659–12682, 2025

  19. [20]

    Open webcam data for traffic monitoring: YOLOv8 detection of road users before and during COVID-19.Transportation Research Interdisciplinary Perspectives, 36:101774, 2026

    Dorothee Stiller, Michael Wurm, Jeroen Staab, Thomas Stark, Georg Starz, Jürgen Rauh, Stefan Dech, and Hannes Taubenböck. Open webcam data for traffic monitoring: YOLOv8 detection of road users before and during COVID-19.Transportation Research Interdisciplinary Perspectives, 36:101774, 2026

  20. [21]

    Ahmed Mohamed and Mohamed M. Ahmed. Multi-camera machine vision for detecting and analyzing vehicle–pedestrian conflicts at signalized intersections: Deep neural-based pose estimation algorithms.Applied Sciences, 15(19):10413, 2025

  21. [22]

    Generative compression

    Shibani Santurkar, David Budden, and Nir Shavit. Generative compression. InPicture Coding Symposium (PCS), pages 258–262. IEEE, 2018

  22. [23]

    Generative visual compression: A review

    Bolin Chen, Shanzhi Yin, Peilin Chen, Shiqi Wang, and Yan Ye. Generative visual compression: A review. InIEEE International Conference on Image Processing (ICIP), pages 3709–3715. IEEE, 2024

  23. [24]

    Semantic importance-based deep image compression using a generative approach

    Xi Gu, Yuanyuan Xu, and Kun Zhu. Semantic importance-based deep image compression using a generative approach. InInternational Conference on Multimedia Modeling, pages 70–81. Springer, 2024

  24. [25]

    Semantic segmentation using vision transformers: A survey.Engi- neering Applications of Artificial Intelligence, 126, 2023

    Hans Thisanke, Chamli Deshan, Kavindu Chamith, Sachith Seneviratne, Rajith Vidanaarachchi, and Damayanthi Herath. Semantic segmentation using vision transformers: A survey.Engi- neering Applications of Artificial Intelligence, 126, 2023

  25. [26]

    Masked autoencoders are scalable vision learners

    Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, and Ross Girshick. Masked autoencoders are scalable vision learners. InIEEE/CVF conference on computer vision and pattern recognition, pages 16000–16009, 2022

  26. [27]

    Semantic communication based on generative AI: a new approach to image compression and edge optimization.arXiv preprint arXiv:2502.01675, 2025

    Francesco Pezone. Semantic communication based on generative AI: a new approach to image compression and edge optimization.arXiv preprint arXiv:2502.01675, 2025. 23

  27. [28]

    SQ-GAN: Semantic Image Commu- nications Using Masked Vector Quantization.IEEE Transactions on Cognitive Communications and Networking, 12:3363–3377, 2025

    Francesco Pezone, Sergio Barbarossa, and Giuseppe Caire. SQ-GAN: Semantic Image Commu- nications Using Masked Vector Quantization.IEEE Transactions on Cognitive Communications and Networking, 12:3363–3377, 2025

  28. [29]

    Soft-edge assisted network for single image super-resolution.IEEE Transactions on Image Processing, 29, 2020

    Faming Fang, Juncheng Li, and Tieyong Zeng. Soft-edge assisted network for single image super-resolution.IEEE Transactions on Image Processing, 29, 2020

  29. [30]

    Semantics-guided generative image compression

    Cheng-Lin Wu, Hyomin Choi, and Ivan V Bajić. Semantics-guided generative image compression. arXiv preprint arXiv:2505.24015, 2025

  30. [31]

    Gpt-4 research.https://openai.com/index/gpt-4-research/, 2023

    OpenAI. Gpt-4 research.https://openai.com/index/gpt-4-research/, 2023. Accessed: Feb. 11, 2026

  31. [32]

    One-step diffusion-based image compression with semantic distillation, 2025

    Naifu Xue, Zhaoyang Jia, Jiahao Li, Bin Li, Yuan Zhang, and Yan Lu. One-step diffusion-based image compression with semantic distillation, 2025

  32. [33]

    CoRR abs/2105.15203(2021),https://arxiv.org/abs/2105.15203

    Enze Xie, Wenhai Wang, Zhiding Yu, Anima Anandkumar, José Manuel Álvarez, and Ping Luo. Segformer: Simple and efficient design for semantic segmentation with transformers.ArXiv, abs/2105.15203, 2021

  33. [34]

    Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation, 2022

    Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation, 2022

  34. [35]

    Gemma 3 technical report, 2025

    Gemma Team et al. Gemma 3 technical report, 2025

  35. [36]

    The cityscapes dataset for semantic urban scene understanding.IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3213–3223, 2016

    Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele. The cityscapes dataset for semantic urban scene understanding.IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3213–3223, 2016

  36. [37]

    Comprehensive data set for automatic single camera visual speed measurement

    Jakub Sochor, Roman Juránek, Jakub Špaňhel, Lukáš Maršík, Adam Širok` y, Adam Herout, and Pavel Zemčík. Comprehensive data set for automatic single camera visual speed measurement. IEEE Transactions on Intelligent Transportation Systems, 20(5):1633–1643, 2018

  37. [38]

    Perceptual losses for real-time style transfer and super-resolution, 2016

    Justin Johnson, Alexandre Alahi, and Li Fei-Fei. Perceptual losses for real-time style transfer and super-resolution, 2016

  38. [39]

    Adding conditional control to text-to-image diffusion models

    Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 3836–3847, October 2023

  39. [40]

    The JPEG still picture compression standard.IEEE Transactions on Consumer Electronics, 38(1):xviii–xxxiv, 1992

    Gregory K Wallace. The JPEG still picture compression standard.IEEE Transactions on Consumer Electronics, 38(1):xviii–xxxiv, 1992

  40. [41]

    Large language models are not fair evaluators

    Peiyi Wang, Lei Li, Liang Chen, Zefan Cai, Dawei Zhu, Binghuai Lin, Yunbo Cao, Lingpeng Kong, Qi Liu, Tianyu Liu, et al. Large language models are not fair evaluators. In62nd Annual Meeting of the Association for Computational Linguistics, pages 9440–9450, 2024

  41. [42]

    Multimodal fusion and vision–language models: A survey for robot vision.Information Fusion, 126:103652, February 2026

    Xiaofeng Han, Shunpeng Chen, Zenghuang Fu, Zhe Feng, Lue Fan, Dong An, Changwei Wang, Li Guo, Weiliang Meng, Xiaopeng Zhang, Rongtao Xu, and Shibiao Xu. Multimodal fusion and vision–language models: A survey for robot vision.Information Fusion, 126:103652, February 2026. 24

  42. [43]

    Khan, Waseem Ullah, and Mohsen Guizani

    Ahmed Sharshar, Latif U. Khan, Waseem Ullah, and Mohsen Guizani. Vision-language models for edge networks: A comprehensive survey.IEEE Internet of Things Journal, 12(16):32701– 32724, 2025

  43. [44]

    Multiscale structural similarity for image quality assessment

    Zhou Wang, Eero P Simoncelli, and Alan C Bovik. Multiscale structural similarity for image quality assessment. InThe thrity-seventh asilomar conference on signals, systems & computers, 2003, volume 2, pages 1398–1402. IEEE, 2003

  44. [45]

    Efros, Eli Shechtman, and Oliver Wang

    Richard Zhang, Phillip Isola, Alexei A. Efros, Eli Shechtman, and Oliver Wang. The Unrea- sonable Effectiveness of Deep Features as a Perceptual Metric.IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 586–595, 2018. 25