pith. machine review for the scientific record. sign in

arxiv: 2604.03339 · v1 · submitted 2026-04-03 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

Hierarchical Awareness Adapters with Hybrid Pyramid Feature Fusion for Dense Depth Prediction

Chen Zhao, Chi Xu, Huilun Song, Wuqi Su

Pith reviewed 2026-05-13 19:58 UTC · model grok-4.3

classification 💻 cs.CV
keywords monocular depth estimationdense depth predictionfeature fusionhierarchical adaptersconditional random fieldSwin TransformerCRF decoderdepth regression
0
0 comments X

The pith

A Swin Transformer-based multilevel CRF with hybrid pyramid fusion and hierarchical adapters delivers state-of-the-art monocular depth estimates at low cost.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to improve monocular depth estimation from single RGB images by addressing scale ambiguity and missing geometric cues through better modeling of inter-pixel spatial dependencies. It builds a multilevel perceptual CRF model on a Swin Transformer backbone and introduces three components: hybrid pyramid feature fusion that mixes multi-scale short-range and long-range information, hierarchical awareness adapters that enable efficient cross-level feature interactions via lightweight broadcast modules, and a dynamic CRF decoder that refines pixel-level relationships while avoiding training instability. The authors demonstrate these changes produce lower error rates than prior methods on three standard benchmarks while using 194 million parameters and running in 21 milliseconds. If the approach holds, it would enable more accurate and efficient depth maps for downstream tasks such as 3D reconstruction and scene understanding without requiring larger or slower networks.

Core claim

The paper claims that a multilevel perceptual conditional random field model built on the Swin Transformer backbone, incorporating an adaptive hybrid pyramid feature fusion strategy for global-local context, hierarchical awareness adapters with learnable scaling for cross-level enrichment, and a fully-connected CRF decoder with dynamic scaling attention and bias learning, produces superior dense depth predictions by capturing both short-range and long-range spatial dependencies more effectively than existing regression-based networks.

What carries the argument

The multilevel perceptual CRF model on Swin Transformer that integrates hybrid pyramid feature fusion (HPF) for multi-scale aggregation, hierarchical awareness (HA) adapters for efficient cross-level broadcast, and a dynamic CRF decoder for pixel-wise refinement.

Load-bearing premise

The reported accuracy gains arise primarily from the three proposed components rather than from dataset-specific training choices or implementation details not described in the paper.

What would settle it

An ablation experiment that trains the same backbone once with all three components and once with each component removed in turn, then compares absolute relative error and RMSE on the NYU Depth v2 test set.

read the original abstract

Monocular depth estimation from a single RGB image remains a fundamental challenge in computer vision due to inherent scale ambiguity and the absence of explicit geometric cues. Existing approaches typically rely on increasingly complex network architectures to regress depth maps, which escalates training costs and computational overhead without fully exploiting inter-pixel spatial dependencies. We propose a multilevel perceptual conditional random field (CRF) model built upon the Swin Transformer backbone that addresses these limitations through three synergistic innovations: (1) an adaptive hybrid pyramid feature fusion (HPF) strategy that captures both short-range and long-range dependencies by combining multi-scale spatial pyramid pooling with biaxial feature aggregation, enabling effective integration of global and local contextual information; (2) a hierarchical awareness adapter (HA) that enriches cross-level feature interactions within the encoder through lightweight broadcast modules with learnable dimensional scaling, reducing computational complexity while enhancing representational capacity; and (3) a fully-connected CRF decoder with dynamic scaling attention that models fine-grained pixel-level spatial relationships, incorporating a bias learning unit to prevent extreme-value collapse and ensure stable training. Extensive experiments on NYU Depth v2, KITTI, and MatterPort3D datasets demonstrate that our method achieves state-of-the-art performance, reducing Abs Rel to 0.088 ($-$7.4\%) and RMSE to 0.316 ($-$5.4\%) on NYU Depth v2, while attaining near-perfect threshold accuracy ($\delta < 1.25^3 \approx 99.8\%$) on KITTI with only 194M parameters and 21ms inference time.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes a monocular dense depth estimation architecture built on a Swin Transformer backbone. It introduces three components—an adaptive hybrid pyramid feature fusion (HPF) module, hierarchical awareness (HA) adapters, and a dynamic CRF decoder with bias learning—to better capture multi-scale spatial dependencies and pixel-level relationships. The central empirical claim is state-of-the-art performance on NYU Depth v2 (Abs Rel 0.088, RMSE 0.316), KITTI (99.8% δ<1.25³), and MatterPort3D, achieved with 194M parameters and 21 ms inference.

Significance. If the reported gains prove robust and attributable to the proposed modules rather than training or implementation details, the work would offer a practical advance in efficient, high-accuracy dense depth prediction by combining pyramid fusion, lightweight adapters, and CRF modeling within a transformer backbone.

major comments (2)
  1. [Experiments] Experiments section: no component-wise ablation tables are presented that add HPF, HA adapters, and the dynamic CRF decoder incrementally to a fixed Swin-Transformer baseline while holding training protocol, optimizer, and data augmentation constant. Without these controlled comparisons, the attribution of the −7.4% Abs Rel and −5.4% RMSE reductions specifically to the three innovations cannot be verified.
  2. [Abstract and §4] Abstract and §4: the reported metrics lack error bars, statistical significance tests, or multiple random-seed runs, making it impossible to determine whether the claimed SOTA deltas exceed the variability of the baseline.
minor comments (2)
  1. [Abstract] The abstract introduces the term 'multilevel perceptual conditional random field' without a brief equation or diagram clarifying how the dynamic CRF differs from a standard fully-connected CRF.
  2. [Method] Notation for the bias learning unit and dynamic scaling attention is introduced but not cross-referenced to the corresponding equations or pseudocode.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for the constructive feedback on our manuscript. We appreciate the emphasis on rigorous empirical validation and will revise the paper to strengthen the experimental section accordingly.

read point-by-point responses
  1. Referee: [Experiments] Experiments section: no component-wise ablation tables are presented that add HPF, HA adapters, and the dynamic CRF decoder incrementally to a fixed Swin-Transformer baseline while holding training protocol, optimizer, and data augmentation constant. Without these controlled comparisons, the attribution of the −7.4% Abs Rel and −5.4% RMSE reductions specifically to the three innovations cannot be verified.

    Authors: We agree that incremental, controlled ablations are required to attribute gains specifically to the proposed modules. In the revised manuscript we will add a dedicated ablation table on NYU Depth v2 that begins with the unmodified Swin Transformer baseline and successively incorporates the HPF module, HA adapters, and dynamic CRF decoder while freezing the training protocol, optimizer, learning-rate schedule, and data augmentations. This will directly quantify the contribution of each component. revision: yes

  2. Referee: [Abstract and §4] Abstract and §4: the reported metrics lack error bars, statistical significance tests, or multiple random-seed runs, making it impossible to determine whether the claimed SOTA deltas exceed the variability of the baseline.

    Authors: We acknowledge the absence of statistical robustness measures. For the revision we will execute at least three independent training runs with different random seeds for our full model and the primary baselines. Mean values together with standard deviations will be reported for Abs Rel, RMSE, and the δ thresholds on both NYU Depth v2 and KITTI; this will allow readers to assess whether the reported improvements exceed typical run-to-run variability. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical architecture validated on external benchmarks

full rationale

The paper proposes three architectural modules (adaptive hybrid pyramid feature fusion, hierarchical awareness adapters, dynamic CRF decoder) atop a Swin Transformer backbone and reports performance on NYU Depth v2, KITTI, and MatterPort3D. No mathematical derivation chain, equations, or fitted-parameter predictions exist in the provided text that could reduce to inputs by construction. Claims rest on standard benchmark metrics that are externally falsifiable and independent of any self-referential definitions or self-citation load-bearing steps. Absence of component ablations affects attribution strength but does not create circularity under the defined criteria.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The central claim depends on the unproven effectiveness of the newly introduced HA and HPF modules for capturing spatial dependencies; these are postulated without independent verification outside the reported experiments.

axioms (1)
  • domain assumption Swin Transformer backbone extracts sufficiently rich multi-scale features for depth regression
    The encoder is taken as given from prior literature and not re-derived.
invented entities (2)
  • Hierarchical Awareness Adapter (HA) no independent evidence
    purpose: Enrich cross-level feature interactions via lightweight broadcast modules
    New module introduced to reduce complexity while enhancing representation
  • Hybrid Pyramid Feature Fusion (HPF) no independent evidence
    purpose: Combine multi-scale spatial pyramid pooling with biaxial aggregation
    New fusion strategy for global and local context

pith-pipeline@v0.9.0 · 5586 in / 1408 out tokens · 49829 ms · 2026-05-13T19:58:16.216402+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Reference graph

Works this paper leans on

70 extracted references · 70 canonical work pages · 2 internal anchors

  1. [1]

    GPT-4 Technical Report

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Ale- man, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. GPT-4 technical report.arXiv preprint arXiv:2303.08774, 2023

  2. [3]

    , month = may, year =

    Shubhra Aich, Jean Marie Uwabeza Vianney, Md Amirul Islam, Maneet Kaur, and Bingbing Liu. Bidirectional attention network for monocular depth estimation. InProceedings of the IEEE Inter- national Conference on Robotics and Automation (ICRA), pages 11746–11752. IEEE, 2021. doi: 10.1109/ICRA48506.2021.9560885

  3. [4]

    2021 , url =

    Shariq Farooq Bhat, Ibraheem Alhashim, and Peter Wonka. Adabins: Depth estimation using adap- tive bins. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 4008–4017. IEEE, 2021. doi: 10.1109/CVPR46437.2021.00400

  4. [5]

    Localbins: Improving depth estimation by learning local distributions

    Shariq Farooq Bhat, Ibraheem Alhashim, and Peter Wonka. Localbins: Improving depth estimation by learning local distributions. InProceedings of the 17th European Conference on Computer Vision (ECCV), pages 480–496. Springer, 2022. doi: 10.1007/978-3-031-19769-7_28

  5. [6]

    Zoedepth: Zero-shot transfer by combining relative and metric depth

    Shariq Farooq Bhat, Reiner Birkl, Diana Wofk, Peter Wonka, and Matthias Müller. Zoedepth: Zero- shot transfer by combining relative and metric depth.arXiv preprint arXiv:2302.12288, 2023

  6. [7]

    Matterport3d: Learning from rgb-d data in indoor environments

    Angel Chang, Angela Dai, Thomas Funkhouser, Maciej Halber, Matthias Niebner, Manolis Savva, Shuran Song, Andy Zeng, and Yinda Zhang. Matterport3d: Learning from rgb-d data in indoor environments. InProceedings of the International Conference on 3D Vision (3DV), pages 667–

  7. [9]

    Encoder- decoder with atrous separable convolution for semantic image segmentation

    Liang-Chieh Chen, Yukun Zhu, George Papandreou, Florian Schroff, and Hartwig Adam. Encoder- decoder with atrous separable convolution for semantic image segmentation. InProceedings of the 11 15th European Conference on Computer Vision (ECCV), pages 833–851. Springer, 2018. doi: 10.1007/978-3-030-01234-2_49

  8. [10]

    Stereo matching algorithm based on edge preservation and improved cost aggregation.Journal of Image and Graphics, 26(2):438–451, 2021

    Deqiang Cheng, Haixiang Li, Qiqi Kou, Zekuan Yu, Huandong Zhuang, and Chen Lyu. Stereo matching algorithm based on edge preservation and improved cost aggregation.Journal of Image and Graphics, 26(2):438–451, 2021. doi: 10.11834/jig.200041

  9. [11]

    TC-Padé: Trajectory-consistent Padé approximation for diffusion acceleration.arXiv preprint arXiv:2603.02943, 2026

    Benlei Cui, Shengqu He, Bohan Huang, Zhenyu Ye, Yongwei Sun, Lei Huang, Hanwen Xue, Yi- hua Yang, Jingqun Tang, et al. TC-Padé: Trajectory-consistent Padé approximation for diffusion acceleration.arXiv preprint arXiv:2603.02943, 2026

  10. [12]

    An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

    Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale.arXiv preprint arXiv:2010.11929, 2021

  11. [13]

    Depth map prediction from a single image us- ing a multi-scale deep network

    David Eigen, Christian Puhrsch, and Rob Fergus. Depth map prediction from a single image us- ing a multi-scale deep network. InProceedings of the 27th International Conference on Neural Information Processing Systems (NeurIPS), pages 2366–2374. MIT Press, 2014

  12. [14]

    Advancing sequential numerical prediction in autoregressive models

    Xiaoyu Fei, Jinghui Lu, Qiushi Sun, Hao Feng, Yanjie Wang, Wenqiang Shi, Anlei Wang, Jingqun Tang, and Can Huang. Advancing sequential numerical prediction in autoregressive models. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics, 2025

  13. [15]

    UniDoc: A universal large multimodal model for simultaneous text detection, recognition, spotting and understanding.arXiv preprint arXiv:2308.11592, 2023

    Hao Feng, Zijian Wang, Jingqun Tang, Jinghui Lu, Wengang Zhou, Houqiang Li, and Can Huang. UniDoc: A universal large multimodal model for simultaneous text detection, recognition, spotting and understanding.arXiv preprint arXiv:2308.11592, 2023

  14. [16]

    Doc- pedia: Unleashing the power of large multimodal model in the frequency domain for versatile document understanding.Science China Information Sciences, 67(12):1–14, 2024

    Hao Feng, Qi Liu, Hao Liu, Jingqun Tang, Wengang Zhou, Houqiang Li, and Can Huang. Doc- pedia: Unleashing the power of large multimodal model in the frequency domain for versatile document understanding.Science China Information Sciences, 67(12):1–14, 2024

  15. [17]

    Dolphin: Document image parsing via het- erogeneous anchor prompting

    Hao Feng, Shu Wei, Xiaoyu Fei, Wenqiang Shi, Yan Han, Lei Liao, Jinghui Lu, Binghong Wu, Qi Liu, Chunhui Lin, Jingqun Tang, and Can Huang. Dolphin: Document image parsing via het- erogeneous anchor prompting. InFindings of the Association for Computational Linguistics: ACL 2025, pages 21919–21936, 2025

  16. [18]

    Dolphin-v2: Universal document parsing via scalable anchor prompting.arXiv preprint arXiv:2602.05384, 2026

    Hao Feng, Wenqiang Shi, Kaixin Zhang, Xiaoyu Fei, Lei Liao, Dian Yang, Yuhui Du, Xiaocong Wu, Jingqun Tang, Yuliang Liu, et al. Dolphin-v2: Universal document parsing via scalable anchor prompting.arXiv preprint arXiv:2602.05384, 2026

  17. [19]

    OCRBench v2: An improved benchmark for evaluating large multimodal models on visual text localization and reasoning.arXiv preprint arXiv:2501.00321, 2024

    Ling Fu, Zhebin Kuang, Jiajun Song, Mingxin Huang, Biao Yang, Yuzhe Li, Linghao Zhu, Qidi Luo, Xinyu Wang, Hao Lu, Guozhi Tang, Bin Shan, Chunhui Lin, Qi Liu, Binghong Wu, Hao Feng, Hao Liu, Can Huang, Jingqun Tang, Wei Chen, Lianwen Jin, Yuliang Liu, and Xiang Bai. OCRBench v2: An improved benchmark for evaluating large multimodal models on visual text l...

  18. [20]

    Are we ready for autonomous driving? the kitti vision benchmark suite

    Andreas Geiger, Philip Lenz, and Raquel Urtasun. Are we ready for autonomous driving? the kitti vision benchmark suite. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3354–3361. IEEE, 2012. doi: 10.1109/CVPR.2012.6248074

  19. [21]

    2021 , url =

    Vitor Guizilini, Rares Ambrus, Wolfram Burgard, and Adrien Gaidon. Sparse auxiliary networks for unified monocular depth prediction and completion. InProceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition (CVPR), pages 11073–11083. IEEE, 2021. doi: 10.1109/CVPR46437.2021.01093. 12

  20. [22]

    arXiv:1606.04797

    Zhixiang Hao, Yu Li, Shaodi You, and Feng Lu. Detail preserving depth estimation from a single image using attention guided networks. InProceedings of the International Conference on 3D Vision (3DV), pages 304–313. IEEE, 2018. doi: 10.1109/3DV .2018.00043

  21. [23]

    Denoising diffusion probabilistic models.Advances in Neural Information Processing Systems, 33:6840–6851, 2020

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models.Advances in Neural Information Processing Systems, 33:6840–6851, 2020

  22. [24]

    Rvmde: Radar vali- dated monocular depth estimation for robotics.arXiv preprint arXiv:2109.05265, 2022

    Muhammad Ishfaq Hussain, Muhammad Ahsan Rafique, and Moongu Jeon. Rvmde: Radar vali- dated monocular depth estimation for robotics.arXiv preprint arXiv:2109.05265, 2022

  23. [25]

    Guiding monocular depth estimation using depth-attention volume

    Lam Huynh, Phong Nguyen-Ha, Jiri Matas, Esa Rahtu, and Janne Heikkilä. Guiding monocular depth estimation using depth-attention volume. InProceedings of the 16th European Conference on Computer Vision (ECCV), pages 581–597. Springer, 2020. doi: 10.1007/978-3-030-58574-7_35

  24. [26]

    A low- shot object counting network with iterative prototype adaptation, in: ICCV, Paris, France, October 1-6, 2023, IEEE

    Yuanfeng Ji, Zhe Chen, Enze Xie, Lanqing Hong, Xihui Liu, Zhaoqiang Liu, Tong Lu, Zhenguo Li, and Ping Luo. Ddp: Diffusion model for dense visual prediction. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 21684–21695. IEEE, 2023. doi: 10. 1109/ICCV51070.2023.01987

  25. [27]

    MEML-GRPO: Heterogeneous multi-expert mutual learning for RLVR ad- vancement

    Wei Jia, Jinghui Lu, Haiyang Yu, Siqi Wang, Guozhi Tang, Anlei Wang, Weirui Yin, Dian Yang, Yicheng Nie, et al. MEML-GRPO: Heterogeneous multi-expert mutual learning for RLVR ad- vancement. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, page 31283, 2026

  26. [28]

    Evp: Enhanced visual perception using inverse multi-attentive feature refinement and regularized image-text alignment

    Mykola Lavreniuk, Shariq Farooq Bhat, Matthias Müller, and Peter Wonka. Evp: Enhanced visual perception using inverse multi-attentive feature refinement and regularized image-text alignment. arXiv preprint arXiv:2312.08548, 2023

  27. [29]

    From big to small: Multi-scale local planar guidance for monocular depth estimation.arXiv preprint arXiv:1907.10326, 2021

    Jin Han Lee, Myung-Kyu Han, Dong Wook Ko, and Il Hong Suh. From big to small: Multi-scale local planar guidance for monocular depth estimation.arXiv preprint arXiv:1907.10326, 2021

  28. [30]

    Patch-wise at- tention network for monocular depth estimation

    Seungyoung Lee, Juhyung Lee, Byeongkeun Kim, Eojindl Yi, and Junmo Kim. Patch-wise at- tention network for monocular depth estimation. InProceedings of the 35th AAAI Conference on Artificial Intelligence, pages 1873–1881. AAAI, 2021. doi: 10.1609/aaai.v35i3.16282

  29. [31]

    Binsformer: Revisiting adaptive bins for monocular depth estimation.IEEE Transactions on Image Processing, 33:3964–3976, 2022

    Zhenyu Li, Xuyang Wang, Xianming Liu, and Junjun Jiang. Binsformer: Revisiting adaptive bins for monocular depth estimation.IEEE Transactions on Image Processing, 33:3964–3976, 2022. doi: 10.1109/TIP.2024.3416065

  30. [32]

    Depthformer: Exploiting long-range correlation and local information for accurate monocular depth estimation.Machine Intelligence Research, 20(6):837–854, 2023

    Zhenyu Li, Zehui Chen, Xianming Liu, and Junjun Jiang. Depthformer: Exploiting long-range correlation and local information for accurate monocular depth estimation.Machine Intelligence Research, 20(6):837–854, 2023. doi: 10.1007/s11633-023-1458-0

  31. [33]

    SPTS v2: Single-point scene text spotting

    Yuliang Liu, Jiaxin Zhang, Dezhi Peng, Mingxin Huang, Xinyu Wang, Jingqun Tang, Can Huang, Dahua Lin, Chunhua Shen, Xiang Bai, and Lianwen Jin. SPTS v2: Single-point scene text spotting. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(12):15038–15055, 2023. doi: 10.1109/TPAMI.2023.3312285

  32. [34]

    Generic attention-model explainability for interpreting bi-modal and encoder-decoder transformers

    Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 9992–10002. IEEE, 2021. doi: 10.1109/ICCV48922.2021.00986

  33. [35]

    Generic attention-model explainability for interpreting bi-modal and encoder-decoder transformers

    Xiaoxiao Long, Cheng Lin, Lingjie Liu, Wei Li, Christian Theobalt, Ruigang Yang, and Wenping Wang. Adaptive surface normal constraint for depth estimation. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 12829–12838. IEEE, 2021. doi: 10. 1109/ICCV48922.2021.01261. 13

  34. [36]

    A bounding box is worth one token: Interleaving layout and text in a large language model for document understanding

    Jinghui Lu, Haiyang Yu, Yanjie Wang, Yongjie Ye, Jingqun Tang, Ziwei Yang, Binghong Wu, Qi Liu, Hao Feng, Han Wang, et al. A bounding box is worth one token: Interleaving layout and text in a large language model for document understanding. InFindings of the Association for Computational Linguistics: ACL 2025, pages 7252–7273, 2025

  35. [37]

    Prolonged reasoning is not all you need: Certainty-based adaptive routing for efficient LLM/MLLM reasoning.arXiv preprint arXiv:2505.15154, 2025

    Jinghui Lu, Haiyang Yu, Shuai Xu, Shuangquan Ran, Guozhi Tang, Siqi Wang, Bin Shan, Tongji Fu, Hao Feng, Jingqun Tang, et al. Prolonged reasoning is not all you need: Certainty-based adaptive routing for efficient LLM/MLLM reasoning.arXiv preprint arXiv:2505.15154, 2025

  36. [38]

    ChineseVideoBench: Benchmarking multi-modal large models for chinese video question answering.arXiv preprint arXiv:2511.18399, 2025

    Yicheng Nie, Haotian Wang, Yongjie Ye, Haiyang Yu, Wei Jia, Tao Zeng, Hao Feng, Xiaoyu Fei, Yichao Li, Xiaotao Lv, Jingqun Tang, et al. ChineseVideoBench: Benchmarking multi-modal large models for chinese video question answering.arXiv preprint arXiv:2511.18399, 2025

  37. [39]

    Shadows can be

    Vaishakh Patil, Christos Sakaridis, Alexander Liniger, and Luc Van Gool. P3depth: Monocular depth estimation with a piecewise planarity prior. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 1600–1611. IEEE, 2022. doi: 10.1109/ CVPR52688.2022.00166

  38. [40]

    T-VSL: text-guided visual sound source localization in mixtures

    Suraj Patni, Aradhye Agarwal, and Chetan Arora. Ecodepth: Effective conditioning of dif- fusion models for monocular depth estimation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 28285–28295. IEEE, 2024. doi: 10.1109/CVPR52733.2024.02672

  39. [41]

    Generic attention-model explainability for interpreting bi-modal and encoder-decoder transformers

    René Ranftl, Alexey Bochkovskiy, and Vladlen Koltun. Vision transformers for dense prediction. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 12159– 12168. IEEE, 2021. doi: 10.1109/ICCV48922.2021.01196

  40. [42]

    You only look once: Unified, real-time object detection.arXiv preprint arXiv:1506.02640, 2016

    Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. You only look once: Unified, real-time object detection.arXiv preprint arXiv:1506.02640, 2016. doi: 10.48550/arXiv.1506. 02640

  41. [43]

    High- resolution image synthesis with latent diffusion models

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High- resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition, pages 10684–10695, 2022

  42. [44]

    Trends and prospects of techniques for haze removal from degraded images: A survey.IEEE Transactions on Emerging Topics in Computational Intelligence, 6(4):762–782, 2022

    Gaurav Sahu, Ayan Seal, Debotosh Bhattacharjee, Mita Nasipuri, Peter Brida, and Ondrej Krejcar. Trends and prospects of techniques for haze removal from degraded images: A survey.IEEE Transactions on Emerging Topics in Computational Intelligence, 6(4):762–782, 2022. doi: 10. 1109/TETCI.2022.3173443

  43. [45]

    MCTBench: Multimodal cognition towards text-rich visual scenes bench- mark.arXiv preprint arXiv:2410.11538, 2024

    Bin Shan, Xiaoyu Fei, Wenqiang Shi, Anlei Wang, Guozhi Tang, Lei Liao, Jingqun Tang, Xiang Bai, and Can Huang. MCTBench: Multimodal cognition towards text-rich visual scenes bench- mark.arXiv preprint arXiv:2410.11538, 2024

  44. [46]

    A review of monocular depth estimation techniques based on deep learning.Journal of Image and Graphics, 27(2):292–328,

    Wei Song, Mengfei Zhu, Minghua Zhang, Danfeng Zhao, and Qi He. A review of monocular depth estimation techniques based on deep learning.Journal of Image and Graphics, 27(2):292–328,

  45. [47]

    doi: 10.11834/jig.210554

  46. [48]

    Opti- mal boxes: Boosting end-to-end scene text recognition by adjusting annotated bounding boxes via reinforcement learning

    Jingqun Tang, Wenqing Qian, Luchuan Song, Xiaozhong Dong, Lanfang Li, and Xiang Bai. Opti- mal boxes: Boosting end-to-end scene text recognition by adjusting annotated bounding boxes via reinforcement learning. InEuropean Conference on Computer Vision, pages 233–248. Springer, 2022

  47. [49]

    You can even annotate text with voice: Transcription-only-supervised text spotting

    Jingqun Tang, Su Qiao, Benlei Cui, Yuhang Ma, Sheng Zhang, and Dimitrios Kanoulas. You can even annotate text with voice: Transcription-only-supervised text spotting. InProceedings of the 30th ACM International Conference on Multimedia, pages 4154–4163, 2022. doi: 10.1145/ 3503161.3547787. 14

  48. [50]

    Few could be better than all: Feature sampling and grouping for scene text detection

    Jingqun Tang, Wenqing Zhang, Hongye Liu, MingKun Yang, Bo Jiang, Guanglong Hu, and Xiang Bai. Few could be better than all: Feature sampling and grouping for scene text detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4563–4572, 2022

  49. [51]

    Character recognition competition for street view shop signs.National Science Review, 10 (6):nwad141, 2023

    Jingqun Tang, Weijia Du, Bin Wang, Wengang Zhou, Song Mei, Tong Xue, Xin Xu, and Hao Zhang. Character recognition competition for street view shop signs.National Science Review, 10 (6):nwad141, 2023. doi: 10.1093/nsr/nwad141

  50. [52]

    TextSquare: Scaling up text-centric visual instruction tuning.arXiv preprint arXiv:2404.12803, 2024

    Jingqun Tang, Chunhui Lin, Zhen Zhao, Shu Wei, Binghong Wu, Qi Liu, Hao Feng, Yang Li, Siqi Wang, Lei Liao, et al. TextSquare: Scaling up text-centric visual instruction tuning.arXiv preprint arXiv:2404.12803, 2024

  51. [53]

    MTVQA: Benchmarking multilingual text-centric visual question answering.arXiv preprint arXiv:2405.11985, 2024

    Jingqun Tang, Qi Liu, Yongjie Ye, Jinghui Lu, Shu Wei, Chunhui Lin, Wanqing Li, Mohamad Fitri Faiz Bin Mahmood, Hao Feng, Zhen Zhao, et al. MTVQA: Benchmarking multilingual text-centric visual question answering.arXiv preprint arXiv:2405.11985, 2024

  52. [54]

    2021 , url =

    Vladimir Tankovich, Christian Häne, Yinda Zhang, Adarsh Kowdle, Sean Fanello, and Sofien Bouaziz. Hitnet: Hierarchical iterative tile refinement network for real-time stereo matching. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 14357–14367. IEEE, 2021. doi: 10.1109/CVPR46437.2021.01413

  53. [55]

    Pargo: Bridging vision-language with partial and global views

    An-Lan Wang, Bin Shan, Wei Shi, Kun-Yu Lin, Xiang Fei, Guozhi Tang, Lei Liao, Jingqun Tang, Can Huang, and Wei-Shi Zheng. Pargo: Bridging vision-language with partial and global views. 39(7):7491–7499, 2025

  54. [56]

    Anlei Wang, Jingqun Tang, Lei Liao, Hao Feng, Qi Liu, Xiaoyu Fei, Jinghui Lu, Han Wang, Hao Liu, Yuliang Liu, et al. WildDoc: How far are we from achieving comprehensive and robust document understanding in the wild? InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, 2025

  55. [57]

    Vision as LoRA.arXiv preprint arXiv:2503.20680, 2025

    Haotian Wang, Yongjie Ye, Bingning Li, Yicheng Nie, Jinghui Lu, Jingqun Tang, Yaxing Wang, and Can Huang. Vision as LoRA.arXiv preprint arXiv:2503.20680, 2025

  56. [58]

    Generic attention-model explainability for interpreting bi-modal and encoder-decoder transformers

    Wenhai Wang, Enze Xie, Xiang Li, Deng-Ping Fan, Kaitao Song, Ding Liang, Tong Lu, Ping Luo, and Ling Shao. Pyramid vision transformer: A versatile backbone for dense prediction without con- volutions. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 548–558. IEEE, 2021. doi: 10.1109/ICCV48922.2021.00061

  57. [59]

    Shadows can be

    Zhuofan Xia, Xuran Pan, Shiji Song, Li Erran Li, and Gao Huang. Vision transformer with de- formable attention. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 4784–4793. IEEE, 2022. doi: 10.1109/CVPR52688.2022.00475

  58. [60]

    Generic attention-model explainability for interpreting bi-modal and encoder-decoder transformers

    Guanglei Yang, Hao Tang, Mingli Ding, Nicu Sebe, and Elisa Ricci. Transformer-based at- tention networks for continuous pixel-wise prediction. InProceedings of the IEEE/CVF In- ternational Conference on Computer Vision (ICCV), pages 16249–16259. IEEE, 2021. doi: 10.1109/ICCV48922.2021.01596

  59. [61]

    Localized gaussian splatting editing with contextual awareness

    Xuan Yang, Liangzhe Yuan, Kimberly Wilber, Astuti Sharma, Xiuye Gu, Siyuan Qiao, Stephanie Debats, Huisheng Wang, Hartwig Adam, Mikhail Sirotenko, and Liang-Chieh Chen. Poly- max: General dense prediction with mask transformer. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 1039–1050. IEEE, 2024. doi: 10.1...

  60. [62]

    Futuredepth: Learning to predict the future improves video depth estimation.arXiv preprint arXiv:2403.12953, 2024

    Rajeev Yasarla, Manish Kumar Singh, Hong Cai, Yunxiao Shi, Jisoo Jeong, Yinghui Zhu, Song Zhe Han, Risheek Garrepalli, and Fatih Porikli. Futuredepth: Learning to predict the future improves video depth estimation.arXiv preprint arXiv:2403.12953, 2024. 15

  61. [63]

    Selvaraju, Michael Cogswell, Ab- hishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Batra

    Wei Yin, Yifan Liu, Chunhua Shen, and Youliang Yan. Enforcing geometric constraints of vir- tual normal for depth prediction. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 5683–5692. IEEE, 2019. doi: 10.1109/ICCV .2019.00578

  62. [64]

    Benchmarking vision-language models on chinese ancient documents: From OCR to knowledge reasoning.arXiv preprint arXiv:2509.09731, 2025

    Haiyang Yu, Yongping Wu, Feifan Shi, Lei Liao, Jinghui Lu, Xiaocui Ge, Han Wang, Mengfei Zhuo, Xiaocong Wu, Xiaoyu Fei, Jingqun Tang, et al. Benchmarking vision-language models on chinese ancient documents: From OCR to knowledge reasoning.arXiv preprint arXiv:2509.09731, 2025

  63. [65]

    Shadows can be

    Weihao Yuan, Xiaodong Gu, Zuozhuo Dai, Siyu Zhu, and Ping Tan. Neural window fully- connected crfs for monocular depth estimation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 3906–3915. IEEE, 2022. doi: 10.1109/CVPR52688.2022.00389

  64. [66]

    Vision transformers are parameter- efficient audio-visual learners

    Ning Zhang, Francesco Nex, George V osselman, and Norman Kerle. Lite-mono: A lightweight cnn and transformer architecture for self-supervised monocular depth estimation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 18537–18546. IEEE, 2023. doi: 10.1109/CVPR52729.2023.01778

  65. [67]

    Mi, Xiaoxiao Zheng, and Angela Yao

    Shihao Zhang, Linlin Yang, Michael B. Mi, Xiaoxiao Zheng, and Angela Yao. Improving deep regression with ordinal entropy.arXiv preprint arXiv:2301.08915, 2023

  66. [68]

    Blind image quality assessment via vision-language correspondence: A multitask learning perspective

    Weixia Zhang, Guangtao Zhai, Ying Wei, Xiaokang Yang, and Kede Ma. Blind image quality assessment via vision-language correspondence: A multitask learning perspective. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14071–14081, 2023

  67. [69]

    TabPedia: Towards comprehensive visual table understanding with concept synergy

    Weichao Zhao, Hao Feng, Qi Liu, Jingqun Tang, Shu Wei, Binghong Wu, Lei Liao, Yongjie Ye, Hao Liu, Wengang Zhou, Houqiang Li, and Can Huang. TabPedia: Towards comprehensive visual table understanding with concept synergy. InAdvances in Neural Information Processing Systems, volume 37, 2024

  68. [70]

    Multi-modal in-context learning makes an ego-evolving scene text recog- nizer

    Zhen Zhao, Jingqun Tang, Chunhui Lin, Binghong Wu, Can Huang, Hao Liu, Xin Tan, Zhizhong Zhang, and Yuan Xie. Multi-modal in-context learning makes an ego-evolving scene text recog- nizer. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15567–15576, 2024

  69. [71]

    Harmonizing visual text comprehension and generation.arXiv preprint arXiv:2407.16364, 2024

    Zhen Zhao, Jingqun Tang, Binghong Wu, Chunhui Lin, Shu Wei, Hao Liu, Xin Tan, Zhizhong Zhang, Can Huang, and Yuan Xie. Harmonizing visual text comprehension and generation.arXiv preprint arXiv:2407.16364, 2024

  70. [72]

    Textpecker: Rewarding structural anomaly quantifica- tion for enhancing visual text rendering.arXiv preprint arXiv:2602.20903, 2026

    Hanshen Zhu, Yuliang Liu, Xuecheng Wu, An-Lan Wang, Hao Feng, Dingkang Yang, Chao Feng, Can Huang, Jingqun Tang, and Xiang Bai. Textpecker: Rewarding structural anomaly quantifica- tion for enhancing visual text rendering.arXiv preprint arXiv:2602.20903, 2026. 16