Seeing Through Fog: Towards Fog-Invariant Action Recognition
Pith reviewed 2026-05-21 06:00 UTC · model grok-4.3
The pith
FogNet trains on paired clean and foggy videos to extract shared motion and structure cues that remain visible despite fog degradation.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
FogAct supplies the first large-scale paired clean-foggy video benchmark for action recognition, and FogNet uses a two-stream CLIP model to discover fog-invariant semantic information by capturing the structural and motion cues that clean and foggy versions of the same action share.
What carries the argument
Two-stream CLIP model in which the clean-video stream guides the foggy-video stream to learn robust representations focused on shared structural and motion cues.
If this is right
- The model achieves competitive accuracy against state-of-the-art methods on FogAct and three standard action datasets.
- Shared structural and motion cues between clean and foggy videos become the primary signal for classification.
- Visibility degradation and contrast loss are mitigated without explicit fog removal or enhancement steps.
Where Pith is reading between the lines
- The same paired-guidance idea could be tested on other degradations such as rain or low light by swapping the clean reference stream.
- If the learned features prove truly invariant, they might improve downstream tasks like temporal localization or anomaly detection in foggy conditions.
- Real-world deployment would require checking whether the 10 scenes in FogAct cover enough diversity of fog density and camera motion.
Load-bearing premise
Training guidance from paired clean videos will transfer to real-world foggy videos that arrive without any clean reference.
What would settle it
Measure action recognition accuracy on a set of unpaired real-world foggy videos and check whether performance falls below clean-video baselines by a large margin.
Figures
read the original abstract
Foggy conditions are commonly encountered in real-world applications; however, existing action recognition approaches typically assume favorable weather and high-quality video inputs. On foggy days, unpredictable visibility degradation and reduced contrast obstruct the extraction of semantic cues, posing significant challenges for current action recognition methods. In this paper, we mitigate the issues faced in action recognition under foggy conditions by employing two strategies. First, we present FogAct, the first benchmark dataset for foggy action recognition, consisting of paired clean and foggy videos captured with a stereo camera system. The dataset spans 10 scenes and 55 action categories, comprising nearly 10,000 video clips. Second, we propose FogNet, a two-stream CLIP model that discovers fog-invariant semantic information hidden behind the degraded videos. FogNet learns robust representations of foggy videos with guidance from clean videos, effectively capturing shared structural and motion cues between clean and foggy videos. Extensive experiments on FogAct and three other popular datasets demonstrate that our method achieves competitive performance compared with state-of-the-art (SOTA) approaches. Our FogAct and FogNet are given in our project page.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces FogAct, the first benchmark dataset for foggy action recognition consisting of paired clean and foggy videos captured via stereo camera across 10 scenes and 55 action categories (nearly 10,000 clips). It proposes FogNet, a two-stream CLIP model that learns fog-invariant semantic representations for foggy videos by using guidance from paired clean videos to capture shared structural and motion cues. Extensive experiments on FogAct and three other popular datasets are reported to achieve competitive performance versus state-of-the-art methods.
Significance. If the central claim holds, the work would be significant for real-world action recognition under adverse weather, as fog-induced visibility degradation is a practical challenge. The paired FogAct dataset provides a valuable resource for training and benchmarking invariance methods. The two-stream guidance approach could offer a generalizable way to distill robust features without requiring clean references at inference, but only if the learned invariance transfers beyond the specific paired degradations in the dataset.
major comments (2)
- [Abstract] Abstract: The claim of competitive performance on FogAct and three other datasets lacks any mention of baselines, error bars, data splits, or ablation studies. Without these, it is impossible to assess whether the results genuinely support fog-invariance rather than dataset-specific fitting or post-hoc choices.
- [Method and Experiments] Method and Experiments sections: The two-stream training objective relies on paired clean/foggy videos from the stereo-captured FogAct dataset (10 scenes). The paper must demonstrate that the learned fog-invariant features generalize to unpaired real-world foggy videos at test time (where no clean reference exists) and that the dataset's fog conditions cover the range of real-world density, lighting, and scene variations; otherwise the central generalization claim is unverified.
minor comments (2)
- [Method] Clarify the precise interaction between the two CLIP streams (e.g., how guidance is implemented in the loss or feature alignment) and whether the clean stream is used only at training or also at inference.
- [Experiments] Specify the fog characteristics and pairing status of the three other evaluated datasets to allow readers to judge the scope of the fog-invariance claim.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive review. The comments highlight important aspects of clarity in the abstract and the need to strengthen evidence for generalization. We address each point below and indicate the revisions we will incorporate.
read point-by-point responses
-
Referee: [Abstract] Abstract: The claim of competitive performance on FogAct and three other datasets lacks any mention of baselines, error bars, data splits, or ablation studies. Without these, it is impossible to assess whether the results genuinely support fog-invariance rather than dataset-specific fitting or post-hoc choices.
Authors: We agree that the abstract's brevity omits key experimental details. The Experiments section of the manuscript reports comparisons against multiple SOTA baselines on FogAct and the three additional datasets, includes error bars from repeated runs with different random seeds, specifies the train/test splits (including the 10-scene partitioning for FogAct), and presents ablation studies on the two-stream architecture and clean-video guidance loss. These results indicate that performance improvements arise from learning shared structural and motion cues rather than dataset-specific artifacts. In the revised version, we will expand the abstract to briefly note these elements, for example by adding a clause such as 'with ablations and multi-run evaluations confirming the invariance benefits.' revision: yes
-
Referee: [Method and Experiments] Method and Experiments sections: The two-stream training objective relies on paired clean/foggy videos from the stereo-captured FogAct dataset (10 scenes). The paper must demonstrate that the learned fog-invariant features generalize to unpaired real-world foggy videos at test time (where no clean reference exists) and that the dataset's fog conditions cover the range of real-world density, lighting, and scene variations; otherwise the central generalization claim is unverified.
Authors: FogNet is trained with paired clean-foggy videos from FogAct to learn the invariant representations through the guidance objective, but at inference time the model operates solely on the foggy stream; the clean reference is not used. We evaluate generalization by testing on three additional popular action recognition datasets that contain real-world foggy videos without paired clean counterparts, where the model achieves competitive accuracy. This supports transfer of the learned invariance beyond the training pairs. FogAct itself spans 10 scenes with controlled variations in fog density (light to dense), lighting conditions, and scene types (indoor/outdoor). In the revision we will add a dedicated paragraph in the Experiments or Discussion section with qualitative examples and quantitative fog-density statistics to explicitly compare FogAct's coverage against typical real-world fog variability. revision: partial
Circularity Check
No circularity: empirical training on paired data yields fog-invariant features without definitional reduction
full rationale
The paper introduces FogAct as a paired clean/foggy stereo dataset and trains FogNet (a two-stream CLIP model) to extract shared structural and motion cues via guidance from clean videos during training. At inference only the foggy stream is used. No equations, fitted parameters, or self-citations are shown that reduce the claimed fog-invariance to an input quantity by construction. The central claim rests on standard contrastive pre-training plus a two-stream objective whose outputs are validated empirically on FogAct and three external datasets; this is a self-contained empirical pipeline rather than a closed definitional loop.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Pre-trained CLIP provides useful semantic features that can be aligned across clean and degraded views
Reference graph
Works this paper leans on
-
[1]
Learning general- ized segmentation for foggy-scenes by bi-directional wavelet guidance
Qi Bi, Shaodi You, and Theo Gevers. Learning general- ized segmentation for foggy-scenes by bi-directional wavelet guidance. InProceedings of the AAAI Conference on Artifi- cial Intelligence, pages 801–809, 2024. 2
work page 2024
-
[2]
Tsnet: deep network for human action recognition in hazy videos
Sachin Chaudhary and Subrahmanyam Murala. Tsnet: deep network for human action recognition in hazy videos. In 2018 IEEE International Conference on Systems, Man, and Cybernetics (SMC), pages 3981–3986. IEEE, 2018. 1, 2
work page 2018
-
[3]
Sachin Chaudhary and Subrahmanyam Murala. Depth-based end-to-end deep network for human action recognition.IET Computer Vision, 13(1):15–22, 2019. 2
work page 2019
-
[4]
Ost: Refining text knowledge with optimal spatio-temporal descriptor for general video recog- nition
Tongjia Chen, Hongshan Yu, Zhengeng Yang, Zechuan Li, Wei Sun, and Chen Chen. Ost: Refining text knowledge with optimal spatio-temporal descriptor for general video recog- nition. InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition, pages 18888–18898,
-
[5]
Prompt-based test-time real image dehazing: a novel pipeline
Zixuan Chen, Zewei He, Ziqian Lu, Xuecheng Sun, and Zhe- Ming Lu. Prompt-based test-time real image dehazing: a novel pipeline. InEuropean Conference on Computer Vision, pages 432–449. Springer, 2024. 1, 6, 7
work page 2024
-
[6]
Zixuan Chen, Zewei He, and Zhe-Ming Lu. Dea-net: Single image dehazing based on detail-enhanced convolution and content-guided attention.IEEE Transactions on Image Pro- cessing, 2024. 3
work page 2024
-
[7]
A real haze video database for haze level evaluation
Ying Chu, Guoxing Luo, and Fan Chen. A real haze video database for haze level evaluation. In2021 13th Inter- national Conference on Quality of Multimedia Experience (QoMEX), pages 69–72. IEEE, 2021. 3, 4
work page 2021
-
[8]
Ancuti C.O. et al. Nh-haze: An image dehazing benchmark with non-homogeneous hazy and haze-free images. InCVPR Workshops, 2020. 3
work page 2020
-
[9]
Rgb- event fusion for robust lane detection
Jingtao Dong, Hao Zhuang, Hao Yang, and Liyuan Pan. Rgb- event fusion for robust lane detection. InBMVC, 2025. 1
work page 2025
-
[10]
Multi-task learning for video surveillance with limited data
Keval Doshi and Yasin Yilmaz. Multi-task learning for video surveillance with limited data. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3889–3899, 2022. 1
work page 2022
-
[11]
Alexandra Duminil, Jean-Philippe Tarel, and Roland Br´emond. A new real-world video dataset for the comparison of defogging algorithms.arXiv preprint arXiv:2310.01020,
-
[12]
Surveillance face presentation attack detection challenge
Hao Fang, Ajian Liu, Jun Wan, Sergio Escalera, Hugo Jair Escalante, and Zhen Lei. Surveillance face presentation attack detection challenge. in 2023 ieee. InCVF Confer- ence on Computer Vision and Pattern Recognition Work- shops (CVPRW), pages 6361–6371. 1
work page 2023
-
[13]
Robust object detection in challeng- ing weather conditions
Himanshu Gupta, Oleksandr Kotlyar, Henrik Andreasson, and Achim J Lilienthal. Robust object detection in challeng- ing weather conditions. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 7523–7532, 2024. 2
work page 2024
-
[14]
Populating 3d scenes by learning human-scene interaction
Mohamed Hassan, Partha Ghosh, Joachim Tesch, Dim- itrios Tzionas, and Michael J Black. Populating 3d scenes by learning human-scene interaction. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14708–14718, 2021. 1
work page 2021
-
[15]
Planning-oriented autonomous driving
Yihan Hu, Jiazhi Yang, Li Chen, Keyu Li, Chonghao Sima, Xizhou Zhu, Siqi Chai, Senyao Du, Tianwei Lin, Wenhai Wang, et al. Planning-oriented autonomous driving. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17853–17862, 2023. 1
work page 2023
-
[16]
Hazespace2m: A dataset for haze aware single image dehazing
Md Tanvir Islam, Nasir Rahim, Saeed Anwar, Muhammad Saqib, Sambit Bakshi, and Khan Muhammad. Hazespace2m: A dataset for haze aware single image dehazing. InProceed- ings of the 32nd ACM International Conference on Multime- dia, pages 9155–9164, 2024. 3
work page 2024
-
[17]
The Kinetics Human Action Video Dataset
Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, et al. The kinetics hu- man action video dataset.arXiv preprint arXiv:1705.06950,
work page internal anchor Pith review Pith/arXiv arXiv
-
[18]
Leveraging temporal contextualization for video action recognition
Minji Kim, Dongyoon Han, Taekyung Kim, and Bohyung Han. Leveraging temporal contextualization for video action recognition. InEuropean Conference on Computer Vision, pages 74–91. Springer, 2024. 2, 6, 7
work page 2024
-
[19]
Hmdb: a large video database for human motion recognition
Hildegard Kuehne, Hueihan Jhuang, Est ´ıbaliz Garrote, Tomaso Poggio, and Thomas Serre. Hmdb: a large video database for human motion recognition. In2011 Inter- national conference on computer vision, pages 2556–2563. IEEE, 2011. 2, 6
work page 2011
-
[20]
Boyi Li, Wenqi Ren, Dengpan Fu, Dacheng Tao, Dan Feng, Wenjun Zeng, and Zhangyang Wang. Benchmarking single- image dehazing and beyond.IEEE Transactions on Image Processing, 28(1):492–505, 2018. 3, 4
work page 2018
-
[21]
Llama-vid: An image is worth 2 tokens in large language models
Yanwei Li, Chengyao Wang, and Jiaya Jia. Llama-vid: An image is worth 2 tokens in large language models. In European Conference on Computer Vision, pages 323–340. Springer, 2024. 6, 7
work page 2024
-
[22]
A lightweight multi-level rela- tion network for few-shot action recognition
Enqi Liu and Liyuan Pan. A lightweight multi-level rela- tion network for few-shot action recognition. In2024 IEEE International Conference on Multimedia and Expo (ICME), pages 1–6. IEEE, 2024. 2
work page 2024
-
[23]
Enqi Liu, Liyuan Pan, Yan Yang, Yiran Zhong, Zhijing Wu, Xinxiao Wu, and Liu Liu. Storyboard guided alignment for fine-grained video action recognition.arXiv preprint arXiv:2410.14238, 2024. 2, 6
-
[24]
Srinivasa G. Narasimhan and Shree K. Nayar. Contrast restoration of weather degraded images.IEEE transactions on pattern analysis and machine intelligence, 25(6):713– 724, 2003. 2, 6
work page 2003
-
[25]
Expanding language-image pretrained models for gen- eral video recognition
Bolin Ni, Houwen Peng, Minghao Chen, Songyang Zhang, Gaofeng Meng, Jianlong Fu, Shiming Xiang, and Haibin Ling. Expanding language-image pretrained models for gen- eral video recognition. InEuropean conference on computer vision, pages 1–18. Springer, 2022. 6
work page 2022
-
[26]
Representation Learning with Contrastive Predictive Coding
Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Repre- sentation learning with contrastive predictive coding.arXiv preprint arXiv:1807.03748, 2018. 5
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[27]
DINOv2: Learning Robust Visual Features without Supervision
Maxime Oquab, Timoth ´ee Darcet, Th ´eo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023. 6, 7
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[28]
Bringing a blurry frame alive at high frame-rate with an event camera
Liyuan Pan, Cedric Scheerlinck, Xin Yu, Richard Hartley, Miaomiao Liu, and Yuchao Dai. Bringing a blurry frame alive at high frame-rate with an event camera. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6820–6829, 2019. 3
work page 2019
-
[29]
Learning transferable visual models from natural language supervi- sion
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. InInternational conference on machine learning, pages 8748–8763. PMLR, 2021. 6
work page 2021
-
[30]
Fine-tuned clip models are efficient video learners
Hanoona Rasheed, Muhammad Uzair Khattak, Muhammad Maaz, Salman Khan, and Fahad Shahbaz Khan. Fine-tuned clip models are efficient video learners. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6545–6554, 2023. 2, 6, 7
work page 2023
-
[31]
Model adaptation with synthetic and real data for semantic dense foggy scene understanding
Christos Sakaridis, Dengxin Dai, Simon Hecker, and Luc Van Gool. Model adaptation with synthetic and real data for semantic dense foggy scene understanding. InProceed- ings of the european conference on computer vision (ECCV), pages 687–704, 2018. 3
work page 2018
-
[32]
Christos Sakaridis, Dengxin Dai, and Luc Van Gool. Seman- tic foggy scene understanding with synthetic data.Interna- tional Journal of Computer Vision, 126:973–992, 2018. 3
work page 2018
-
[33]
Christos Sakaridis, Haoran Wang, Ke Li, Ren ´e Zurbr ¨ugg, Arpit Jadon, Wim Abbeloos, Daniel Olmeda Reino, Luc Van Gool, and Dengxin Dai. Acdc: The adverse condi- tions dataset with correspondences for robust semantic driv- ing scene perception.arXiv e-prints, pages arXiv–2104,
-
[34]
Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. A dataset of 101 human action classes from videos in the wild.Center for Research in Computer Vision, 2(11):1–7,
-
[35]
Action recognition in haze using an efficient fusion of spatial and temporal features
Sri Girinadh Tanneru and Snehasis Mukherjee. Action recognition in haze using an efficient fusion of spatial and temporal features. InComputer Vision and Image Process- ing: 5th International Conference, CVIP 2020, Prayagraj, India, December 4-6, 2020, Revised Selected Papers, Part II 5, pages 29–38. Springer, 2021. 1, 2, 6
work page 2020
-
[36]
Zhan Tong, Yibing Song, Jue Wang, and Limin Wang. Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training.Advances in neural information processing systems, 35:10078–10093, 2022. 6
work page 2022
-
[37]
Hayat Ullah, Khan Muhammad, Muhammad Irfan, Saeed Anwar, Muhammad Sajjad, Ali Shariq Imran, and Vic- tor Hugo C de Albuquerque. Light-dehazenet: a novel lightweight cnn architecture for single image dehazing.IEEE transactions on image processing, 30:8968–8982, 2021. 3
work page 2021
-
[38]
Mengmeng Wang, Jiazheng Xing, Jianbiao Mei, Yong Liu, and Yunliang Jiang. Actionclip: Adapting language-image pretrained models for video action recognition.IEEE Trans- actions on Neural Networks and Learning Systems, 2023. 2, 6, 7
work page 2023
-
[39]
A multimodal, multi-task adapting frame- work for video action recognition
Mengmeng Wang, Jiazheng Xing, Boyuan Jiang, Jun Chen, Jianbiao Mei, Xingxing Zuo, Guang Dai, Jingdong Wang, and Yong Liu. A multimodal, multi-task adapting frame- work for video action recognition. InProceedings of the AAAI Conference on Artificial Intelligence, pages 5517– 5525, 2024. 2, 6, 7
work page 2024
-
[40]
Yongzhen Wang, Xuefeng Yan, Fu Lee Wang, Haoran Xie, Wenhan Yang, Xiao-Ping Zhang, Jing Qin, and Mingqiang Wei. Ucl-dehaze: Towards real-world image dehazing via unsupervised contrastive learning.IEEE Transactions on Im- age Processing, 2024. 6, 7
work page 2024
-
[41]
Vita-clip: Video and text adaptive clip via multimodal prompting
Syed Talal Wasim, Muzammal Naseer, Salman Khan, Fa- had Shahbaz Khan, and Mubarak Shah. Vita-clip: Video and text adaptive clip via multimodal prompting. InProceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition, pages 23034–23044, 2023. 6
work page 2023
-
[42]
Q-Align: Teaching LMMs for Visual Scoring via Discrete Text-Defined Levels
Haoning Wu, Zicheng Zhang, Weixia Zhang, Chaofeng Chen, Liang Liao, Chunyi Li, Yixuan Gao, Annan Wang, Erli Zhang, Wenxiu Sun, et al. Q-align: Teaching lmms for visual scoring via discrete text-defined levels.arXiv preprint arXiv:2312.17090, 2023. 4
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[43]
Wenhao Wu, Yuxin Song, Zhun Sun, Jingdong Wang, Chang Xu, and Wanli Ouyang. What can simple arithmetic op- erations do for temporal modeling? InProceedings of the IEEE/CVF international conference on computer vision, pages 13712–13722, 2023. 2, 6
work page 2023
-
[44]
Wenhao Wu, Xiaohan Wang, Haipeng Luo, Jingdong Wang, Yi Yang, and Wanli Ouyang. Bidirectional cross- modal knowledge exploration for video recognition with pre-trained vision-language models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6620–6630, 2023. 2, 6
work page 2023
-
[45]
Video dehazing via a multi-range temporal alignment network with physical prior
Jiaqi Xu, Xiaowei Hu, Lei Zhu, Qi Dou, Jifeng Dai, Yu Qiao, and Pheng-Ann Heng. Video dehazing via a multi-range temporal alignment network with physical prior. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18053–18062, 2023. 3
work page 2023
-
[46]
Language- driven all-in-one adverse weather removal
Hao Yang, Liyuan Pan, Yan Yang, and Wei Liang. Language- driven all-in-one adverse weather removal. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 24902–24912, 2024. 3, 6, 7
work page 2024
-
[47]
Aim: Adapting image models for efficient video action recognition.arXiv preprint arXiv:2302.03024,
Taojiannan Yang, Yi Zhu, Yusheng Xie, Aston Zhang, Chen Chen, and Mu Li. Aim: Adapting image models for efficient video action recognition.arXiv preprint arXiv:2302.03024,
-
[48]
Ruikun Zhang, Zhiyuan Yang, and Liyuan Pan. Dehaze- mamba: large multi-modal model guided single image de- hazing via mamba.Visual Intelligence, 3(1):11, 2025. 3
work page 2025
-
[49]
Learning to restore hazy video: A new real-world dataset and a new method
Xinyi Zhang, Hang Dong, Jinshan Pan, Chao Zhu, Ying Tai, Chengjie Wang, Jilin Li, Feiyue Huang, and Fei Wang. Learning to restore hazy video: A new real-world dataset and a new method. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 9239–9248, 2021. 3
work page 2021
-
[50]
Shiyu Zhao, Lin Zhang, Shuaiyi Huang, Ying Shen, and Shengjie Zhao. Dehazing evaluation: Real-world benchmark datasets, criteria, and baselines.IEEE Transactions on Image Processing, 29:6947–6962, 2020. 3, 4
work page 2020
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.