pith. sign in

arxiv: 2605.20645 · v1 · pith:HNX6OPNPnew · submitted 2026-05-20 · 💻 cs.CV

Seeing Through Fog: Towards Fog-Invariant Action Recognition

Pith reviewed 2026-05-21 06:00 UTC · model grok-4.3

classification 💻 cs.CV
keywords foggy action recognitionFogAct datasettwo-stream CLIPfog-invariant featuresadverse weather visionpaired video trainingvideo action classification
0
0 comments X

The pith

FogNet trains on paired clean and foggy videos to extract shared motion and structure cues that remain visible despite fog degradation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces FogAct, the first dataset of paired clean and foggy action videos captured across 10 scenes and 55 categories with nearly 10,000 clips. It then proposes FogNet, a two-stream CLIP architecture that learns fog-invariant representations by letting the clean-video stream guide feature extraction from the degraded stream. This setup targets the core problem that fog reduces contrast and hides semantic cues needed for reliable action classification. If the approach holds, action recognition systems could maintain performance in common real-world weather conditions without requiring clean references at test time.

Core claim

FogAct supplies the first large-scale paired clean-foggy video benchmark for action recognition, and FogNet uses a two-stream CLIP model to discover fog-invariant semantic information by capturing the structural and motion cues that clean and foggy versions of the same action share.

What carries the argument

Two-stream CLIP model in which the clean-video stream guides the foggy-video stream to learn robust representations focused on shared structural and motion cues.

If this is right

  • The model achieves competitive accuracy against state-of-the-art methods on FogAct and three standard action datasets.
  • Shared structural and motion cues between clean and foggy videos become the primary signal for classification.
  • Visibility degradation and contrast loss are mitigated without explicit fog removal or enhancement steps.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same paired-guidance idea could be tested on other degradations such as rain or low light by swapping the clean reference stream.
  • If the learned features prove truly invariant, they might improve downstream tasks like temporal localization or anomaly detection in foggy conditions.
  • Real-world deployment would require checking whether the 10 scenes in FogAct cover enough diversity of fog density and camera motion.

Load-bearing premise

Training guidance from paired clean videos will transfer to real-world foggy videos that arrive without any clean reference.

What would settle it

Measure action recognition accuracy on a set of unpaired real-world foggy videos and check whether performance falls below clean-video baselines by a large margin.

Figures

Figures reproduced from arXiv: 2605.20645 by Enqi Liu, Lingzhi Li, Liyuan Pan, Qing Li, Zhi Gao.

Figure 1
Figure 1. Figure 1: Comparison of foggy, defogged [5], and clean images in FogAct (top row), and corresponding feature distributions (bot￾tom row). The SOTA defogging result still shows residual fog and halo artifacts. Features are extracted via CLIP and visualized us￾ing t-SNE. Our learned embeddings are more aligned with clean images, while defogged features show larger intra-class variation and blurred class boundaries. Ex… view at source ↗
Figure 2
Figure 2. Figure 2: Examples from our FogAct dataset, including four categories. Each category is captured under two fog conditions: light fog [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Overview of the Stereo Video Acquisition System. It [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Summary of FogAct statistics. (a) 91.8% of samples last 3–12 seconds, resembling a normal distribution. (b) Action durations [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: An overview of our framework. In the joint training stage, we jointly learn label supervision and fog-invariant representations [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Heatmap comparisons with the baseline on FogAct. [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Confusion matrix of our FogNet on the FogAct dataset. [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗
read the original abstract

Foggy conditions are commonly encountered in real-world applications; however, existing action recognition approaches typically assume favorable weather and high-quality video inputs. On foggy days, unpredictable visibility degradation and reduced contrast obstruct the extraction of semantic cues, posing significant challenges for current action recognition methods. In this paper, we mitigate the issues faced in action recognition under foggy conditions by employing two strategies. First, we present FogAct, the first benchmark dataset for foggy action recognition, consisting of paired clean and foggy videos captured with a stereo camera system. The dataset spans 10 scenes and 55 action categories, comprising nearly 10,000 video clips. Second, we propose FogNet, a two-stream CLIP model that discovers fog-invariant semantic information hidden behind the degraded videos. FogNet learns robust representations of foggy videos with guidance from clean videos, effectively capturing shared structural and motion cues between clean and foggy videos. Extensive experiments on FogAct and three other popular datasets demonstrate that our method achieves competitive performance compared with state-of-the-art (SOTA) approaches. Our FogAct and FogNet are given in our project page.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces FogAct, the first benchmark dataset for foggy action recognition consisting of paired clean and foggy videos captured via stereo camera across 10 scenes and 55 action categories (nearly 10,000 clips). It proposes FogNet, a two-stream CLIP model that learns fog-invariant semantic representations for foggy videos by using guidance from paired clean videos to capture shared structural and motion cues. Extensive experiments on FogAct and three other popular datasets are reported to achieve competitive performance versus state-of-the-art methods.

Significance. If the central claim holds, the work would be significant for real-world action recognition under adverse weather, as fog-induced visibility degradation is a practical challenge. The paired FogAct dataset provides a valuable resource for training and benchmarking invariance methods. The two-stream guidance approach could offer a generalizable way to distill robust features without requiring clean references at inference, but only if the learned invariance transfers beyond the specific paired degradations in the dataset.

major comments (2)
  1. [Abstract] Abstract: The claim of competitive performance on FogAct and three other datasets lacks any mention of baselines, error bars, data splits, or ablation studies. Without these, it is impossible to assess whether the results genuinely support fog-invariance rather than dataset-specific fitting or post-hoc choices.
  2. [Method and Experiments] Method and Experiments sections: The two-stream training objective relies on paired clean/foggy videos from the stereo-captured FogAct dataset (10 scenes). The paper must demonstrate that the learned fog-invariant features generalize to unpaired real-world foggy videos at test time (where no clean reference exists) and that the dataset's fog conditions cover the range of real-world density, lighting, and scene variations; otherwise the central generalization claim is unverified.
minor comments (2)
  1. [Method] Clarify the precise interaction between the two CLIP streams (e.g., how guidance is implemented in the loss or feature alignment) and whether the clean stream is used only at training or also at inference.
  2. [Experiments] Specify the fog characteristics and pairing status of the three other evaluated datasets to allow readers to judge the scope of the fog-invariance claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive review. The comments highlight important aspects of clarity in the abstract and the need to strengthen evidence for generalization. We address each point below and indicate the revisions we will incorporate.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The claim of competitive performance on FogAct and three other datasets lacks any mention of baselines, error bars, data splits, or ablation studies. Without these, it is impossible to assess whether the results genuinely support fog-invariance rather than dataset-specific fitting or post-hoc choices.

    Authors: We agree that the abstract's brevity omits key experimental details. The Experiments section of the manuscript reports comparisons against multiple SOTA baselines on FogAct and the three additional datasets, includes error bars from repeated runs with different random seeds, specifies the train/test splits (including the 10-scene partitioning for FogAct), and presents ablation studies on the two-stream architecture and clean-video guidance loss. These results indicate that performance improvements arise from learning shared structural and motion cues rather than dataset-specific artifacts. In the revised version, we will expand the abstract to briefly note these elements, for example by adding a clause such as 'with ablations and multi-run evaluations confirming the invariance benefits.' revision: yes

  2. Referee: [Method and Experiments] Method and Experiments sections: The two-stream training objective relies on paired clean/foggy videos from the stereo-captured FogAct dataset (10 scenes). The paper must demonstrate that the learned fog-invariant features generalize to unpaired real-world foggy videos at test time (where no clean reference exists) and that the dataset's fog conditions cover the range of real-world density, lighting, and scene variations; otherwise the central generalization claim is unverified.

    Authors: FogNet is trained with paired clean-foggy videos from FogAct to learn the invariant representations through the guidance objective, but at inference time the model operates solely on the foggy stream; the clean reference is not used. We evaluate generalization by testing on three additional popular action recognition datasets that contain real-world foggy videos without paired clean counterparts, where the model achieves competitive accuracy. This supports transfer of the learned invariance beyond the training pairs. FogAct itself spans 10 scenes with controlled variations in fog density (light to dense), lighting conditions, and scene types (indoor/outdoor). In the revision we will add a dedicated paragraph in the Experiments or Discussion section with qualitative examples and quantitative fog-density statistics to explicitly compare FogAct's coverage against typical real-world fog variability. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical training on paired data yields fog-invariant features without definitional reduction

full rationale

The paper introduces FogAct as a paired clean/foggy stereo dataset and trains FogNet (a two-stream CLIP model) to extract shared structural and motion cues via guidance from clean videos during training. At inference only the foggy stream is used. No equations, fitted parameters, or self-citations are shown that reduce the claimed fog-invariance to an input quantity by construction. The central claim rests on standard contrastive pre-training plus a two-stream objective whose outputs are validated empirically on FogAct and three external datasets; this is a self-contained empirical pipeline rather than a closed definitional loop.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work rests on the standard assumption that pre-trained CLIP embeddings capture transferable semantic structure and that paired clean-foggy data can be used to supervise invariance; no new physical entities or ad-hoc constants are introduced.

axioms (1)
  • domain assumption Pre-trained CLIP provides useful semantic features that can be aligned across clean and degraded views
    The two-stream architecture directly uses CLIP as the backbone for both streams.

pith-pipeline@v0.9.0 · 5726 in / 1267 out tokens · 32159 ms · 2026-05-21T06:00:03.889566+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

50 extracted references · 50 canonical work pages · 4 internal anchors

  1. [1]

    Learning general- ized segmentation for foggy-scenes by bi-directional wavelet guidance

    Qi Bi, Shaodi You, and Theo Gevers. Learning general- ized segmentation for foggy-scenes by bi-directional wavelet guidance. InProceedings of the AAAI Conference on Artifi- cial Intelligence, pages 801–809, 2024. 2

  2. [2]

    Tsnet: deep network for human action recognition in hazy videos

    Sachin Chaudhary and Subrahmanyam Murala. Tsnet: deep network for human action recognition in hazy videos. In 2018 IEEE International Conference on Systems, Man, and Cybernetics (SMC), pages 3981–3986. IEEE, 2018. 1, 2

  3. [3]

    Depth-based end-to-end deep network for human action recognition.IET Computer Vision, 13(1):15–22, 2019

    Sachin Chaudhary and Subrahmanyam Murala. Depth-based end-to-end deep network for human action recognition.IET Computer Vision, 13(1):15–22, 2019. 2

  4. [4]

    Ost: Refining text knowledge with optimal spatio-temporal descriptor for general video recog- nition

    Tongjia Chen, Hongshan Yu, Zhengeng Yang, Zechuan Li, Wei Sun, and Chen Chen. Ost: Refining text knowledge with optimal spatio-temporal descriptor for general video recog- nition. InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition, pages 18888–18898,

  5. [5]

    Prompt-based test-time real image dehazing: a novel pipeline

    Zixuan Chen, Zewei He, Ziqian Lu, Xuecheng Sun, and Zhe- Ming Lu. Prompt-based test-time real image dehazing: a novel pipeline. InEuropean Conference on Computer Vision, pages 432–449. Springer, 2024. 1, 6, 7

  6. [6]

    Dea-net: Single image dehazing based on detail-enhanced convolution and content-guided attention.IEEE Transactions on Image Pro- cessing, 2024

    Zixuan Chen, Zewei He, and Zhe-Ming Lu. Dea-net: Single image dehazing based on detail-enhanced convolution and content-guided attention.IEEE Transactions on Image Pro- cessing, 2024. 3

  7. [7]

    A real haze video database for haze level evaluation

    Ying Chu, Guoxing Luo, and Fan Chen. A real haze video database for haze level evaluation. In2021 13th Inter- national Conference on Quality of Multimedia Experience (QoMEX), pages 69–72. IEEE, 2021. 3, 4

  8. [8]

    Ancuti C.O. et al. Nh-haze: An image dehazing benchmark with non-homogeneous hazy and haze-free images. InCVPR Workshops, 2020. 3

  9. [9]

    Rgb- event fusion for robust lane detection

    Jingtao Dong, Hao Zhuang, Hao Yang, and Liyuan Pan. Rgb- event fusion for robust lane detection. InBMVC, 2025. 1

  10. [10]

    Multi-task learning for video surveillance with limited data

    Keval Doshi and Yasin Yilmaz. Multi-task learning for video surveillance with limited data. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3889–3899, 2022. 1

  11. [11]

    A new real-world video dataset for the comparison of defogging algorithms.arXiv preprint arXiv:2310.01020,

    Alexandra Duminil, Jean-Philippe Tarel, and Roland Br´emond. A new real-world video dataset for the comparison of defogging algorithms.arXiv preprint arXiv:2310.01020,

  12. [12]

    Surveillance face presentation attack detection challenge

    Hao Fang, Ajian Liu, Jun Wan, Sergio Escalera, Hugo Jair Escalante, and Zhen Lei. Surveillance face presentation attack detection challenge. in 2023 ieee. InCVF Confer- ence on Computer Vision and Pattern Recognition Work- shops (CVPRW), pages 6361–6371. 1

  13. [13]

    Robust object detection in challeng- ing weather conditions

    Himanshu Gupta, Oleksandr Kotlyar, Henrik Andreasson, and Achim J Lilienthal. Robust object detection in challeng- ing weather conditions. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 7523–7532, 2024. 2

  14. [14]

    Populating 3d scenes by learning human-scene interaction

    Mohamed Hassan, Partha Ghosh, Joachim Tesch, Dim- itrios Tzionas, and Michael J Black. Populating 3d scenes by learning human-scene interaction. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14708–14718, 2021. 1

  15. [15]

    Planning-oriented autonomous driving

    Yihan Hu, Jiazhi Yang, Li Chen, Keyu Li, Chonghao Sima, Xizhou Zhu, Siqi Chai, Senyao Du, Tianwei Lin, Wenhai Wang, et al. Planning-oriented autonomous driving. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17853–17862, 2023. 1

  16. [16]

    Hazespace2m: A dataset for haze aware single image dehazing

    Md Tanvir Islam, Nasir Rahim, Saeed Anwar, Muhammad Saqib, Sambit Bakshi, and Khan Muhammad. Hazespace2m: A dataset for haze aware single image dehazing. InProceed- ings of the 32nd ACM International Conference on Multime- dia, pages 9155–9164, 2024. 3

  17. [17]

    The Kinetics Human Action Video Dataset

    Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, et al. The kinetics hu- man action video dataset.arXiv preprint arXiv:1705.06950,

  18. [18]

    Leveraging temporal contextualization for video action recognition

    Minji Kim, Dongyoon Han, Taekyung Kim, and Bohyung Han. Leveraging temporal contextualization for video action recognition. InEuropean Conference on Computer Vision, pages 74–91. Springer, 2024. 2, 6, 7

  19. [19]

    Hmdb: a large video database for human motion recognition

    Hildegard Kuehne, Hueihan Jhuang, Est ´ıbaliz Garrote, Tomaso Poggio, and Thomas Serre. Hmdb: a large video database for human motion recognition. In2011 Inter- national conference on computer vision, pages 2556–2563. IEEE, 2011. 2, 6

  20. [20]

    Benchmarking single- image dehazing and beyond.IEEE Transactions on Image Processing, 28(1):492–505, 2018

    Boyi Li, Wenqi Ren, Dengpan Fu, Dacheng Tao, Dan Feng, Wenjun Zeng, and Zhangyang Wang. Benchmarking single- image dehazing and beyond.IEEE Transactions on Image Processing, 28(1):492–505, 2018. 3, 4

  21. [21]

    Llama-vid: An image is worth 2 tokens in large language models

    Yanwei Li, Chengyao Wang, and Jiaya Jia. Llama-vid: An image is worth 2 tokens in large language models. In European Conference on Computer Vision, pages 323–340. Springer, 2024. 6, 7

  22. [22]

    A lightweight multi-level rela- tion network for few-shot action recognition

    Enqi Liu and Liyuan Pan. A lightweight multi-level rela- tion network for few-shot action recognition. In2024 IEEE International Conference on Multimedia and Expo (ICME), pages 1–6. IEEE, 2024. 2

  23. [23]

    Storyboard guided alignment for fine-grained video action recognition.arXiv preprint arXiv:2410.14238, 2024

    Enqi Liu, Liyuan Pan, Yan Yang, Yiran Zhong, Zhijing Wu, Xinxiao Wu, and Liu Liu. Storyboard guided alignment for fine-grained video action recognition.arXiv preprint arXiv:2410.14238, 2024. 2, 6

  24. [24]

    Narasimhan and Shree K

    Srinivasa G. Narasimhan and Shree K. Nayar. Contrast restoration of weather degraded images.IEEE transactions on pattern analysis and machine intelligence, 25(6):713– 724, 2003. 2, 6

  25. [25]

    Expanding language-image pretrained models for gen- eral video recognition

    Bolin Ni, Houwen Peng, Minghao Chen, Songyang Zhang, Gaofeng Meng, Jianlong Fu, Shiming Xiang, and Haibin Ling. Expanding language-image pretrained models for gen- eral video recognition. InEuropean conference on computer vision, pages 1–18. Springer, 2022. 6

  26. [26]

    Representation Learning with Contrastive Predictive Coding

    Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Repre- sentation learning with contrastive predictive coding.arXiv preprint arXiv:1807.03748, 2018. 5

  27. [27]

    DINOv2: Learning Robust Visual Features without Supervision

    Maxime Oquab, Timoth ´ee Darcet, Th ´eo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023. 6, 7

  28. [28]

    Bringing a blurry frame alive at high frame-rate with an event camera

    Liyuan Pan, Cedric Scheerlinck, Xin Yu, Richard Hartley, Miaomiao Liu, and Yuchao Dai. Bringing a blurry frame alive at high frame-rate with an event camera. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6820–6829, 2019. 3

  29. [29]

    Learning transferable visual models from natural language supervi- sion

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. InInternational conference on machine learning, pages 8748–8763. PMLR, 2021. 6

  30. [30]

    Fine-tuned clip models are efficient video learners

    Hanoona Rasheed, Muhammad Uzair Khattak, Muhammad Maaz, Salman Khan, and Fahad Shahbaz Khan. Fine-tuned clip models are efficient video learners. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6545–6554, 2023. 2, 6, 7

  31. [31]

    Model adaptation with synthetic and real data for semantic dense foggy scene understanding

    Christos Sakaridis, Dengxin Dai, Simon Hecker, and Luc Van Gool. Model adaptation with synthetic and real data for semantic dense foggy scene understanding. InProceed- ings of the european conference on computer vision (ECCV), pages 687–704, 2018. 3

  32. [32]

    Seman- tic foggy scene understanding with synthetic data.Interna- tional Journal of Computer Vision, 126:973–992, 2018

    Christos Sakaridis, Dengxin Dai, and Luc Van Gool. Seman- tic foggy scene understanding with synthetic data.Interna- tional Journal of Computer Vision, 126:973–992, 2018. 3

  33. [33]

    Acdc: The adverse condi- tions dataset with correspondences for robust semantic driv- ing scene perception.arXiv e-prints, pages arXiv–2104,

    Christos Sakaridis, Haoran Wang, Ke Li, Ren ´e Zurbr ¨ugg, Arpit Jadon, Wim Abbeloos, Daniel Olmeda Reino, Luc Van Gool, and Dengxin Dai. Acdc: The adverse condi- tions dataset with correspondences for robust semantic driv- ing scene perception.arXiv e-prints, pages arXiv–2104,

  34. [34]

    A dataset of 101 human action classes from videos in the wild.Center for Research in Computer Vision, 2(11):1–7,

    Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. A dataset of 101 human action classes from videos in the wild.Center for Research in Computer Vision, 2(11):1–7,

  35. [35]

    Action recognition in haze using an efficient fusion of spatial and temporal features

    Sri Girinadh Tanneru and Snehasis Mukherjee. Action recognition in haze using an efficient fusion of spatial and temporal features. InComputer Vision and Image Process- ing: 5th International Conference, CVIP 2020, Prayagraj, India, December 4-6, 2020, Revised Selected Papers, Part II 5, pages 29–38. Springer, 2021. 1, 2, 6

  36. [36]

    Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training.Advances in neural information processing systems, 35:10078–10093, 2022

    Zhan Tong, Yibing Song, Jue Wang, and Limin Wang. Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training.Advances in neural information processing systems, 35:10078–10093, 2022. 6

  37. [37]

    Light-dehazenet: a novel lightweight cnn architecture for single image dehazing.IEEE transactions on image processing, 30:8968–8982, 2021

    Hayat Ullah, Khan Muhammad, Muhammad Irfan, Saeed Anwar, Muhammad Sajjad, Ali Shariq Imran, and Vic- tor Hugo C de Albuquerque. Light-dehazenet: a novel lightweight cnn architecture for single image dehazing.IEEE transactions on image processing, 30:8968–8982, 2021. 3

  38. [38]

    Actionclip: Adapting language-image pretrained models for video action recognition.IEEE Trans- actions on Neural Networks and Learning Systems, 2023

    Mengmeng Wang, Jiazheng Xing, Jianbiao Mei, Yong Liu, and Yunliang Jiang. Actionclip: Adapting language-image pretrained models for video action recognition.IEEE Trans- actions on Neural Networks and Learning Systems, 2023. 2, 6, 7

  39. [39]

    A multimodal, multi-task adapting frame- work for video action recognition

    Mengmeng Wang, Jiazheng Xing, Boyuan Jiang, Jun Chen, Jianbiao Mei, Xingxing Zuo, Guang Dai, Jingdong Wang, and Yong Liu. A multimodal, multi-task adapting frame- work for video action recognition. InProceedings of the AAAI Conference on Artificial Intelligence, pages 5517– 5525, 2024. 2, 6, 7

  40. [40]

    Ucl-dehaze: Towards real-world image dehazing via unsupervised contrastive learning.IEEE Transactions on Im- age Processing, 2024

    Yongzhen Wang, Xuefeng Yan, Fu Lee Wang, Haoran Xie, Wenhan Yang, Xiao-Ping Zhang, Jing Qin, and Mingqiang Wei. Ucl-dehaze: Towards real-world image dehazing via unsupervised contrastive learning.IEEE Transactions on Im- age Processing, 2024. 6, 7

  41. [41]

    Vita-clip: Video and text adaptive clip via multimodal prompting

    Syed Talal Wasim, Muzammal Naseer, Salman Khan, Fa- had Shahbaz Khan, and Mubarak Shah. Vita-clip: Video and text adaptive clip via multimodal prompting. InProceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition, pages 23034–23044, 2023. 6

  42. [42]

    Q-Align: Teaching LMMs for Visual Scoring via Discrete Text-Defined Levels

    Haoning Wu, Zicheng Zhang, Weixia Zhang, Chaofeng Chen, Liang Liao, Chunyi Li, Yixuan Gao, Annan Wang, Erli Zhang, Wenxiu Sun, et al. Q-align: Teaching lmms for visual scoring via discrete text-defined levels.arXiv preprint arXiv:2312.17090, 2023. 4

  43. [43]

    What can simple arithmetic op- erations do for temporal modeling? InProceedings of the IEEE/CVF international conference on computer vision, pages 13712–13722, 2023

    Wenhao Wu, Yuxin Song, Zhun Sun, Jingdong Wang, Chang Xu, and Wanli Ouyang. What can simple arithmetic op- erations do for temporal modeling? InProceedings of the IEEE/CVF international conference on computer vision, pages 13712–13722, 2023. 2, 6

  44. [44]

    Bidirectional cross- modal knowledge exploration for video recognition with pre-trained vision-language models

    Wenhao Wu, Xiaohan Wang, Haipeng Luo, Jingdong Wang, Yi Yang, and Wanli Ouyang. Bidirectional cross- modal knowledge exploration for video recognition with pre-trained vision-language models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6620–6630, 2023. 2, 6

  45. [45]

    Video dehazing via a multi-range temporal alignment network with physical prior

    Jiaqi Xu, Xiaowei Hu, Lei Zhu, Qi Dou, Jifeng Dai, Yu Qiao, and Pheng-Ann Heng. Video dehazing via a multi-range temporal alignment network with physical prior. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18053–18062, 2023. 3

  46. [46]

    Language- driven all-in-one adverse weather removal

    Hao Yang, Liyuan Pan, Yan Yang, and Wei Liang. Language- driven all-in-one adverse weather removal. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 24902–24912, 2024. 3, 6, 7

  47. [47]

    Aim: Adapting image models for efficient video action recognition.arXiv preprint arXiv:2302.03024,

    Taojiannan Yang, Yi Zhu, Yusheng Xie, Aston Zhang, Chen Chen, and Mu Li. Aim: Adapting image models for efficient video action recognition.arXiv preprint arXiv:2302.03024,

  48. [48]

    Dehaze- mamba: large multi-modal model guided single image de- hazing via mamba.Visual Intelligence, 3(1):11, 2025

    Ruikun Zhang, Zhiyuan Yang, and Liyuan Pan. Dehaze- mamba: large multi-modal model guided single image de- hazing via mamba.Visual Intelligence, 3(1):11, 2025. 3

  49. [49]

    Learning to restore hazy video: A new real-world dataset and a new method

    Xinyi Zhang, Hang Dong, Jinshan Pan, Chao Zhu, Ying Tai, Chengjie Wang, Jilin Li, Feiyue Huang, and Fei Wang. Learning to restore hazy video: A new real-world dataset and a new method. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 9239–9248, 2021. 3

  50. [50]

    Dehazing evaluation: Real-world benchmark datasets, criteria, and baselines.IEEE Transactions on Image Processing, 29:6947–6962, 2020

    Shiyu Zhao, Lin Zhang, Shuaiyi Huang, Ying Shen, and Shengjie Zhao. Dehazing evaluation: Real-world benchmark datasets, criteria, and baselines.IEEE Transactions on Image Processing, 29:6947–6962, 2020. 3, 4