arxiv: 2604.03339 · v1 · submitted 2026-04-03 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

Hierarchical Awareness Adapters with Hybrid Pyramid Feature Fusion for Dense Depth Prediction

Chen Zhao, Chi Xu, Huilun Song, Wuqi Su

Pith reviewed 2026-05-13 19:58 UTC · model grok-4.3

classification 💻 cs.CV

keywords monocular depth estimationdense depth predictionfeature fusionhierarchical adaptersconditional random fieldSwin TransformerCRF decoderdepth regression

0 comments

The pith

A Swin Transformer-based multilevel CRF with hybrid pyramid fusion and hierarchical adapters delivers state-of-the-art monocular depth estimates at low cost.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to improve monocular depth estimation from single RGB images by addressing scale ambiguity and missing geometric cues through better modeling of inter-pixel spatial dependencies. It builds a multilevel perceptual CRF model on a Swin Transformer backbone and introduces three components: hybrid pyramid feature fusion that mixes multi-scale short-range and long-range information, hierarchical awareness adapters that enable efficient cross-level feature interactions via lightweight broadcast modules, and a dynamic CRF decoder that refines pixel-level relationships while avoiding training instability. The authors demonstrate these changes produce lower error rates than prior methods on three standard benchmarks while using 194 million parameters and running in 21 milliseconds. If the approach holds, it would enable more accurate and efficient depth maps for downstream tasks such as 3D reconstruction and scene understanding without requiring larger or slower networks.

Core claim

The paper claims that a multilevel perceptual conditional random field model built on the Swin Transformer backbone, incorporating an adaptive hybrid pyramid feature fusion strategy for global-local context, hierarchical awareness adapters with learnable scaling for cross-level enrichment, and a fully-connected CRF decoder with dynamic scaling attention and bias learning, produces superior dense depth predictions by capturing both short-range and long-range spatial dependencies more effectively than existing regression-based networks.

What carries the argument

The multilevel perceptual CRF model on Swin Transformer that integrates hybrid pyramid feature fusion (HPF) for multi-scale aggregation, hierarchical awareness (HA) adapters for efficient cross-level broadcast, and a dynamic CRF decoder for pixel-wise refinement.

Load-bearing premise

The reported accuracy gains arise primarily from the three proposed components rather than from dataset-specific training choices or implementation details not described in the paper.

What would settle it

An ablation experiment that trains the same backbone once with all three components and once with each component removed in turn, then compares absolute relative error and RMSE on the NYU Depth v2 test set.

read the original abstract

Monocular depth estimation from a single RGB image remains a fundamental challenge in computer vision due to inherent scale ambiguity and the absence of explicit geometric cues. Existing approaches typically rely on increasingly complex network architectures to regress depth maps, which escalates training costs and computational overhead without fully exploiting inter-pixel spatial dependencies. We propose a multilevel perceptual conditional random field (CRF) model built upon the Swin Transformer backbone that addresses these limitations through three synergistic innovations: (1) an adaptive hybrid pyramid feature fusion (HPF) strategy that captures both short-range and long-range dependencies by combining multi-scale spatial pyramid pooling with biaxial feature aggregation, enabling effective integration of global and local contextual information; (2) a hierarchical awareness adapter (HA) that enriches cross-level feature interactions within the encoder through lightweight broadcast modules with learnable dimensional scaling, reducing computational complexity while enhancing representational capacity; and (3) a fully-connected CRF decoder with dynamic scaling attention that models fine-grained pixel-level spatial relationships, incorporating a bias learning unit to prevent extreme-value collapse and ensure stable training. Extensive experiments on NYU Depth v2, KITTI, and MatterPort3D datasets demonstrate that our method achieves state-of-the-art performance, reducing Abs Rel to 0.088 ($-$7.4\%) and RMSE to 0.316 ($-$5.4\%) on NYU Depth v2, while attaining near-perfect threshold accuracy ($\delta < 1.25^3 \approx 99.8\%$) on KITTI with only 194M parameters and 21ms inference time.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper stacks HPF fusion, HA adapters, and a bias-aware CRF decoder on Swin to claim SOTA depth numbers on NYU and KITTI, but the gains are hard to credit without component ablations.

read the letter

The main thing to know is that this work puts three new pieces—an adaptive hybrid pyramid feature fusion, hierarchical awareness adapters, and a dynamic scaling attention CRF decoder with bias learning—on a Swin Transformer backbone for monocular depth estimation. It reports concrete gains like Abs Rel at 0.088 and RMSE at 0.316 on NYU Depth v2, plus 99.8% threshold accuracy on KITTI, all at 194M parameters and 21ms inference, tested also on MatterPort3D.

Referee Report

2 major / 2 minor

Summary. The paper proposes a monocular dense depth estimation architecture built on a Swin Transformer backbone. It introduces three components—an adaptive hybrid pyramid feature fusion (HPF) module, hierarchical awareness (HA) adapters, and a dynamic CRF decoder with bias learning—to better capture multi-scale spatial dependencies and pixel-level relationships. The central empirical claim is state-of-the-art performance on NYU Depth v2 (Abs Rel 0.088, RMSE 0.316), KITTI (99.8% δ<1.25³), and MatterPort3D, achieved with 194M parameters and 21 ms inference.

Significance. If the reported gains prove robust and attributable to the proposed modules rather than training or implementation details, the work would offer a practical advance in efficient, high-accuracy dense depth prediction by combining pyramid fusion, lightweight adapters, and CRF modeling within a transformer backbone.

major comments (2)

[Experiments] Experiments section: no component-wise ablation tables are presented that add HPF, HA adapters, and the dynamic CRF decoder incrementally to a fixed Swin-Transformer baseline while holding training protocol, optimizer, and data augmentation constant. Without these controlled comparisons, the attribution of the −7.4% Abs Rel and −5.4% RMSE reductions specifically to the three innovations cannot be verified.
[Abstract and §4] Abstract and §4: the reported metrics lack error bars, statistical significance tests, or multiple random-seed runs, making it impossible to determine whether the claimed SOTA deltas exceed the variability of the baseline.

minor comments (2)

[Abstract] The abstract introduces the term 'multilevel perceptual conditional random field' without a brief equation or diagram clarifying how the dynamic CRF differs from a standard fully-connected CRF.
[Method] Notation for the bias learning unit and dynamic scaling attention is introduced but not cross-referenced to the corresponding equations or pseudocode.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for the constructive feedback on our manuscript. We appreciate the emphasis on rigorous empirical validation and will revise the paper to strengthen the experimental section accordingly.

read point-by-point responses

Referee: [Experiments] Experiments section: no component-wise ablation tables are presented that add HPF, HA adapters, and the dynamic CRF decoder incrementally to a fixed Swin-Transformer baseline while holding training protocol, optimizer, and data augmentation constant. Without these controlled comparisons, the attribution of the −7.4% Abs Rel and −5.4% RMSE reductions specifically to the three innovations cannot be verified.

Authors: We agree that incremental, controlled ablations are required to attribute gains specifically to the proposed modules. In the revised manuscript we will add a dedicated ablation table on NYU Depth v2 that begins with the unmodified Swin Transformer baseline and successively incorporates the HPF module, HA adapters, and dynamic CRF decoder while freezing the training protocol, optimizer, learning-rate schedule, and data augmentations. This will directly quantify the contribution of each component. revision: yes
Referee: [Abstract and §4] Abstract and §4: the reported metrics lack error bars, statistical significance tests, or multiple random-seed runs, making it impossible to determine whether the claimed SOTA deltas exceed the variability of the baseline.

Authors: We acknowledge the absence of statistical robustness measures. For the revision we will execute at least three independent training runs with different random seeds for our full model and the primary baselines. Mean values together with standard deviations will be reported for Abs Rel, RMSE, and the δ thresholds on both NYU Depth v2 and KITTI; this will allow readers to assess whether the reported improvements exceed typical run-to-run variability. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical architecture validated on external benchmarks

full rationale

The paper proposes three architectural modules (adaptive hybrid pyramid feature fusion, hierarchical awareness adapters, dynamic CRF decoder) atop a Swin Transformer backbone and reports performance on NYU Depth v2, KITTI, and MatterPort3D. No mathematical derivation chain, equations, or fitted-parameter predictions exist in the provided text that could reduce to inputs by construction. Claims rest on standard benchmark metrics that are externally falsifiable and independent of any self-referential definitions or self-citation load-bearing steps. Absence of component ablations affects attribution strength but does not create circularity under the defined criteria.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The central claim depends on the unproven effectiveness of the newly introduced HA and HPF modules for capturing spatial dependencies; these are postulated without independent verification outside the reported experiments.

axioms (1)

domain assumption Swin Transformer backbone extracts sufficiently rich multi-scale features for depth regression
The encoder is taken as given from prior literature and not re-derived.

invented entities (2)

Hierarchical Awareness Adapter (HA) no independent evidence
purpose: Enrich cross-level feature interactions via lightweight broadcast modules
New module introduced to reduce complexity while enhancing representation
Hybrid Pyramid Feature Fusion (HPF) no independent evidence
purpose: Combine multi-scale spatial pyramid pooling with biaxial aggregation
New fusion strategy for global and local context

pith-pipeline@v0.9.0 · 5586 in / 1408 out tokens · 49829 ms · 2026-05-13T19:58:16.216402+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear
adaptive hybrid pyramid feature fusion (HPF) strategy that captures both short-range and long-range dependencies by combining multi-scale spatial pyramid pooling with biaxial feature aggregation
IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean absolute_floor_iff_bare_distinguishability unclear
hierarchical awareness adapter (HA) that enriches cross-level feature interactions within the encoder through lightweight broadcast modules with learnable dimensional scaling

Reference graph

Works this paper leans on

70 extracted references · 70 canonical work pages · 2 internal anchors

[1]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Ale- man, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. GPT-4 technical report.arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[3]

, month = may, year =

Shubhra Aich, Jean Marie Uwabeza Vianney, Md Amirul Islam, Maneet Kaur, and Bingbing Liu. Bidirectional attention network for monocular depth estimation. InProceedings of the IEEE Inter- national Conference on Robotics and Automation (ICRA), pages 11746–11752. IEEE, 2021. doi: 10.1109/ICRA48506.2021.9560885

work page doi:10.1109/icra48506.2021.9560885 2021
[4]

2021 , url =

Shariq Farooq Bhat, Ibraheem Alhashim, and Peter Wonka. Adabins: Depth estimation using adap- tive bins. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 4008–4017. IEEE, 2021. doi: 10.1109/CVPR46437.2021.00400

work page doi:10.1109/cvpr46437.2021.00400 2021
[5]

Localbins: Improving depth estimation by learning local distributions

Shariq Farooq Bhat, Ibraheem Alhashim, and Peter Wonka. Localbins: Improving depth estimation by learning local distributions. InProceedings of the 17th European Conference on Computer Vision (ECCV), pages 480–496. Springer, 2022. doi: 10.1007/978-3-031-19769-7_28

work page doi:10.1007/978-3-031-19769-7_28 2022
[6]

Zoedepth: Zero-shot transfer by combining relative and metric depth

Shariq Farooq Bhat, Reiner Birkl, Diana Wofk, Peter Wonka, and Matthias Müller. Zoedepth: Zero- shot transfer by combining relative and metric depth.arXiv preprint arXiv:2302.12288, 2023

work page arXiv 2023
[7]

Matterport3d: Learning from rgb-d data in indoor environments

Angel Chang, Angela Dai, Thomas Funkhouser, Maciej Halber, Matthias Niebner, Manolis Savva, Shuran Song, Andy Zeng, and Yinda Zhang. Matterport3d: Learning from rgb-d data in indoor environments. InProceedings of the International Conference on 3D Vision (3DV), pages 667–

work page
[9]

Encoder- decoder with atrous separable convolution for semantic image segmentation

Liang-Chieh Chen, Yukun Zhu, George Papandreou, Florian Schroff, and Hartwig Adam. Encoder- decoder with atrous separable convolution for semantic image segmentation. InProceedings of the 11 15th European Conference on Computer Vision (ECCV), pages 833–851. Springer, 2018. doi: 10.1007/978-3-030-01234-2_49

work page doi:10.1007/978-3-030-01234-2_49 2018
[10]

Stereo matching algorithm based on edge preservation and improved cost aggregation.Journal of Image and Graphics, 26(2):438–451, 2021

Deqiang Cheng, Haixiang Li, Qiqi Kou, Zekuan Yu, Huandong Zhuang, and Chen Lyu. Stereo matching algorithm based on edge preservation and improved cost aggregation.Journal of Image and Graphics, 26(2):438–451, 2021. doi: 10.11834/jig.200041

work page doi:10.11834/jig.200041 2021
[11]

TC-Padé: Trajectory-consistent Padé approximation for diffusion acceleration.arXiv preprint arXiv:2603.02943, 2026

Benlei Cui, Shengqu He, Bohan Huang, Zhenyu Ye, Yongwei Sun, Lei Huang, Hanwen Xue, Yi- hua Yang, Jingqun Tang, et al. TC-Padé: Trajectory-consistent Padé approximation for diffusion acceleration.arXiv preprint arXiv:2603.02943, 2026

work page arXiv 2026
[12]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale.arXiv preprint arXiv:2010.11929, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2010
[13]

Depth map prediction from a single image us- ing a multi-scale deep network

David Eigen, Christian Puhrsch, and Rob Fergus. Depth map prediction from a single image us- ing a multi-scale deep network. InProceedings of the 27th International Conference on Neural Information Processing Systems (NeurIPS), pages 2366–2374. MIT Press, 2014

work page 2014
[14]

Advancing sequential numerical prediction in autoregressive models

Xiaoyu Fei, Jinghui Lu, Qiushi Sun, Hao Feng, Yanjie Wang, Wenqiang Shi, Anlei Wang, Jingqun Tang, and Can Huang. Advancing sequential numerical prediction in autoregressive models. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics, 2025

work page 2025
[15]

UniDoc: A universal large multimodal model for simultaneous text detection, recognition, spotting and understanding.arXiv preprint arXiv:2308.11592, 2023

Hao Feng, Zijian Wang, Jingqun Tang, Jinghui Lu, Wengang Zhou, Houqiang Li, and Can Huang. UniDoc: A universal large multimodal model for simultaneous text detection, recognition, spotting and understanding.arXiv preprint arXiv:2308.11592, 2023

work page arXiv 2023
[16]

Doc- pedia: Unleashing the power of large multimodal model in the frequency domain for versatile document understanding.Science China Information Sciences, 67(12):1–14, 2024

Hao Feng, Qi Liu, Hao Liu, Jingqun Tang, Wengang Zhou, Houqiang Li, and Can Huang. Doc- pedia: Unleashing the power of large multimodal model in the frequency domain for versatile document understanding.Science China Information Sciences, 67(12):1–14, 2024

work page 2024
[17]

Dolphin: Document image parsing via het- erogeneous anchor prompting

Hao Feng, Shu Wei, Xiaoyu Fei, Wenqiang Shi, Yan Han, Lei Liao, Jinghui Lu, Binghong Wu, Qi Liu, Chunhui Lin, Jingqun Tang, and Can Huang. Dolphin: Document image parsing via het- erogeneous anchor prompting. InFindings of the Association for Computational Linguistics: ACL 2025, pages 21919–21936, 2025

work page 2025
[18]

Dolphin-v2: Universal document parsing via scalable anchor prompting.arXiv preprint arXiv:2602.05384, 2026

Hao Feng, Wenqiang Shi, Kaixin Zhang, Xiaoyu Fei, Lei Liao, Dian Yang, Yuhui Du, Xiaocong Wu, Jingqun Tang, Yuliang Liu, et al. Dolphin-v2: Universal document parsing via scalable anchor prompting.arXiv preprint arXiv:2602.05384, 2026

work page arXiv 2026
[19]

OCRBench v2: An improved benchmark for evaluating large multimodal models on visual text localization and reasoning.arXiv preprint arXiv:2501.00321, 2024

Ling Fu, Zhebin Kuang, Jiajun Song, Mingxin Huang, Biao Yang, Yuzhe Li, Linghao Zhu, Qidi Luo, Xinyu Wang, Hao Lu, Guozhi Tang, Bin Shan, Chunhui Lin, Qi Liu, Binghong Wu, Hao Feng, Hao Liu, Can Huang, Jingqun Tang, Wei Chen, Lianwen Jin, Yuliang Liu, and Xiang Bai. OCRBench v2: An improved benchmark for evaluating large multimodal models on visual text l...

work page arXiv 2024
[20]

Are we ready for autonomous driving? the kitti vision benchmark suite

Andreas Geiger, Philip Lenz, and Raquel Urtasun. Are we ready for autonomous driving? the kitti vision benchmark suite. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3354–3361. IEEE, 2012. doi: 10.1109/CVPR.2012.6248074

work page doi:10.1109/cvpr.2012.6248074 2012
[21]

2021 , url =

Vitor Guizilini, Rares Ambrus, Wolfram Burgard, and Adrien Gaidon. Sparse auxiliary networks for unified monocular depth prediction and completion. InProceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition (CVPR), pages 11073–11083. IEEE, 2021. doi: 10.1109/CVPR46437.2021.01093. 12

work page doi:10.1109/cvpr46437.2021.01093 2021
[22]

arXiv:1606.04797

Zhixiang Hao, Yu Li, Shaodi You, and Feng Lu. Detail preserving depth estimation from a single image using attention guided networks. InProceedings of the International Conference on 3D Vision (3DV), pages 304–313. IEEE, 2018. doi: 10.1109/3DV .2018.00043

work page doi:10.1109/3dv 2018
[23]

Denoising diffusion probabilistic models.Advances in Neural Information Processing Systems, 33:6840–6851, 2020

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models.Advances in Neural Information Processing Systems, 33:6840–6851, 2020

work page 2020
[24]

Rvmde: Radar vali- dated monocular depth estimation for robotics.arXiv preprint arXiv:2109.05265, 2022

Muhammad Ishfaq Hussain, Muhammad Ahsan Rafique, and Moongu Jeon. Rvmde: Radar vali- dated monocular depth estimation for robotics.arXiv preprint arXiv:2109.05265, 2022

work page arXiv 2022
[25]

Guiding monocular depth estimation using depth-attention volume

Lam Huynh, Phong Nguyen-Ha, Jiri Matas, Esa Rahtu, and Janne Heikkilä. Guiding monocular depth estimation using depth-attention volume. InProceedings of the 16th European Conference on Computer Vision (ECCV), pages 581–597. Springer, 2020. doi: 10.1007/978-3-030-58574-7_35

work page doi:10.1007/978-3-030-58574-7_35 2020
[26]

A low- shot object counting network with iterative prototype adaptation, in: ICCV, Paris, France, October 1-6, 2023, IEEE

Yuanfeng Ji, Zhe Chen, Enze Xie, Lanqing Hong, Xihui Liu, Zhaoqiang Liu, Tong Lu, Zhenguo Li, and Ping Luo. Ddp: Diffusion model for dense visual prediction. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 21684–21695. IEEE, 2023. doi: 10. 1109/ICCV51070.2023.01987

work page arXiv 2023
[27]

MEML-GRPO: Heterogeneous multi-expert mutual learning for RLVR ad- vancement

Wei Jia, Jinghui Lu, Haiyang Yu, Siqi Wang, Guozhi Tang, Anlei Wang, Weirui Yin, Dian Yang, Yicheng Nie, et al. MEML-GRPO: Heterogeneous multi-expert mutual learning for RLVR ad- vancement. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, page 31283, 2026

work page 2026
[28]

Evp: Enhanced visual perception using inverse multi-attentive feature refinement and regularized image-text alignment

Mykola Lavreniuk, Shariq Farooq Bhat, Matthias Müller, and Peter Wonka. Evp: Enhanced visual perception using inverse multi-attentive feature refinement and regularized image-text alignment. arXiv preprint arXiv:2312.08548, 2023

work page arXiv 2023
[29]

From big to small: Multi-scale local planar guidance for monocular depth estimation.arXiv preprint arXiv:1907.10326, 2021

Jin Han Lee, Myung-Kyu Han, Dong Wook Ko, and Il Hong Suh. From big to small: Multi-scale local planar guidance for monocular depth estimation.arXiv preprint arXiv:1907.10326, 2021

work page arXiv 1907
[30]

Patch-wise at- tention network for monocular depth estimation

Seungyoung Lee, Juhyung Lee, Byeongkeun Kim, Eojindl Yi, and Junmo Kim. Patch-wise at- tention network for monocular depth estimation. InProceedings of the 35th AAAI Conference on Artificial Intelligence, pages 1873–1881. AAAI, 2021. doi: 10.1609/aaai.v35i3.16282

work page doi:10.1609/aaai.v35i3.16282 2021
[31]

Binsformer: Revisiting adaptive bins for monocular depth estimation.IEEE Transactions on Image Processing, 33:3964–3976, 2022

Zhenyu Li, Xuyang Wang, Xianming Liu, and Junjun Jiang. Binsformer: Revisiting adaptive bins for monocular depth estimation.IEEE Transactions on Image Processing, 33:3964–3976, 2022. doi: 10.1109/TIP.2024.3416065

work page doi:10.1109/tip.2024.3416065 2022
[32]

Depthformer: Exploiting long-range correlation and local information for accurate monocular depth estimation.Machine Intelligence Research, 20(6):837–854, 2023

Zhenyu Li, Zehui Chen, Xianming Liu, and Junjun Jiang. Depthformer: Exploiting long-range correlation and local information for accurate monocular depth estimation.Machine Intelligence Research, 20(6):837–854, 2023. doi: 10.1007/s11633-023-1458-0

work page doi:10.1007/s11633-023-1458-0 2023
[33]

SPTS v2: Single-point scene text spotting

Yuliang Liu, Jiaxin Zhang, Dezhi Peng, Mingxin Huang, Xinyu Wang, Jingqun Tang, Can Huang, Dahua Lin, Chunhua Shen, Xiang Bai, and Lianwen Jin. SPTS v2: Single-point scene text spotting. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(12):15038–15055, 2023. doi: 10.1109/TPAMI.2023.3312285

work page doi:10.1109/tpami.2023.3312285 2023
[34]

Generic attention-model explainability for interpreting bi-modal and encoder-decoder transformers

Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 9992–10002. IEEE, 2021. doi: 10.1109/ICCV48922.2021.00986

work page doi:10.1109/iccv48922.2021.00986 2021
[35]

Generic attention-model explainability for interpreting bi-modal and encoder-decoder transformers

Xiaoxiao Long, Cheng Lin, Lingjie Liu, Wei Li, Christian Theobalt, Ruigang Yang, and Wenping Wang. Adaptive surface normal constraint for depth estimation. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 12829–12838. IEEE, 2021. doi: 10. 1109/ICCV48922.2021.01261. 13

work page arXiv 2021
[36]

A bounding box is worth one token: Interleaving layout and text in a large language model for document understanding

Jinghui Lu, Haiyang Yu, Yanjie Wang, Yongjie Ye, Jingqun Tang, Ziwei Yang, Binghong Wu, Qi Liu, Hao Feng, Han Wang, et al. A bounding box is worth one token: Interleaving layout and text in a large language model for document understanding. InFindings of the Association for Computational Linguistics: ACL 2025, pages 7252–7273, 2025

work page 2025
[37]

Prolonged reasoning is not all you need: Certainty-based adaptive routing for efficient LLM/MLLM reasoning.arXiv preprint arXiv:2505.15154, 2025

Jinghui Lu, Haiyang Yu, Shuai Xu, Shuangquan Ran, Guozhi Tang, Siqi Wang, Bin Shan, Tongji Fu, Hao Feng, Jingqun Tang, et al. Prolonged reasoning is not all you need: Certainty-based adaptive routing for efficient LLM/MLLM reasoning.arXiv preprint arXiv:2505.15154, 2025

work page arXiv 2025
[38]

ChineseVideoBench: Benchmarking multi-modal large models for chinese video question answering.arXiv preprint arXiv:2511.18399, 2025

Yicheng Nie, Haotian Wang, Yongjie Ye, Haiyang Yu, Wei Jia, Tao Zeng, Hao Feng, Xiaoyu Fei, Yichao Li, Xiaotao Lv, Jingqun Tang, et al. ChineseVideoBench: Benchmarking multi-modal large models for chinese video question answering.arXiv preprint arXiv:2511.18399, 2025

work page arXiv 2025
[39]

Shadows can be

Vaishakh Patil, Christos Sakaridis, Alexander Liniger, and Luc Van Gool. P3depth: Monocular depth estimation with a piecewise planarity prior. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 1600–1611. IEEE, 2022. doi: 10.1109/ CVPR52688.2022.00166

work page arXiv 2022
[40]

T-VSL: text-guided visual sound source localization in mixtures

Suraj Patni, Aradhye Agarwal, and Chetan Arora. Ecodepth: Effective conditioning of dif- fusion models for monocular depth estimation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 28285–28295. IEEE, 2024. doi: 10.1109/CVPR52733.2024.02672

work page doi:10.1109/cvpr52733.2024.02672 2024
[41]

Generic attention-model explainability for interpreting bi-modal and encoder-decoder transformers

René Ranftl, Alexey Bochkovskiy, and Vladlen Koltun. Vision transformers for dense prediction. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 12159– 12168. IEEE, 2021. doi: 10.1109/ICCV48922.2021.01196

work page doi:10.1109/iccv48922.2021.01196 2021
[42]

You only look once: Unified, real-time object detection.arXiv preprint arXiv:1506.02640, 2016

Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. You only look once: Unified, real-time object detection.arXiv preprint arXiv:1506.02640, 2016. doi: 10.48550/arXiv.1506. 02640

work page doi:10.48550/arxiv.1506 2016
[43]

High- resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High- resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition, pages 10684–10695, 2022

work page 2022
[44]

Trends and prospects of techniques for haze removal from degraded images: A survey.IEEE Transactions on Emerging Topics in Computational Intelligence, 6(4):762–782, 2022

Gaurav Sahu, Ayan Seal, Debotosh Bhattacharjee, Mita Nasipuri, Peter Brida, and Ondrej Krejcar. Trends and prospects of techniques for haze removal from degraded images: A survey.IEEE Transactions on Emerging Topics in Computational Intelligence, 6(4):762–782, 2022. doi: 10. 1109/TETCI.2022.3173443

work page arXiv 2022
[45]

MCTBench: Multimodal cognition towards text-rich visual scenes bench- mark.arXiv preprint arXiv:2410.11538, 2024

Bin Shan, Xiaoyu Fei, Wenqiang Shi, Anlei Wang, Guozhi Tang, Lei Liao, Jingqun Tang, Xiang Bai, and Can Huang. MCTBench: Multimodal cognition towards text-rich visual scenes bench- mark.arXiv preprint arXiv:2410.11538, 2024

work page arXiv 2024
[46]

A review of monocular depth estimation techniques based on deep learning.Journal of Image and Graphics, 27(2):292–328,

Wei Song, Mengfei Zhu, Minghua Zhang, Danfeng Zhao, and Qi He. A review of monocular depth estimation techniques based on deep learning.Journal of Image and Graphics, 27(2):292–328,

work page
[47]

doi: 10.11834/jig.210554

work page doi:10.11834/jig.210554
[48]

Opti- mal boxes: Boosting end-to-end scene text recognition by adjusting annotated bounding boxes via reinforcement learning

Jingqun Tang, Wenqing Qian, Luchuan Song, Xiaozhong Dong, Lanfang Li, and Xiang Bai. Opti- mal boxes: Boosting end-to-end scene text recognition by adjusting annotated bounding boxes via reinforcement learning. InEuropean Conference on Computer Vision, pages 233–248. Springer, 2022

work page 2022
[49]

You can even annotate text with voice: Transcription-only-supervised text spotting

Jingqun Tang, Su Qiao, Benlei Cui, Yuhang Ma, Sheng Zhang, and Dimitrios Kanoulas. You can even annotate text with voice: Transcription-only-supervised text spotting. InProceedings of the 30th ACM International Conference on Multimedia, pages 4154–4163, 2022. doi: 10.1145/ 3503161.3547787. 14

work page arXiv 2022
[50]

Few could be better than all: Feature sampling and grouping for scene text detection

Jingqun Tang, Wenqing Zhang, Hongye Liu, MingKun Yang, Bo Jiang, Guanglong Hu, and Xiang Bai. Few could be better than all: Feature sampling and grouping for scene text detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4563–4572, 2022

work page 2022
[51]

Character recognition competition for street view shop signs.National Science Review, 10 (6):nwad141, 2023

Jingqun Tang, Weijia Du, Bin Wang, Wengang Zhou, Song Mei, Tong Xue, Xin Xu, and Hao Zhang. Character recognition competition for street view shop signs.National Science Review, 10 (6):nwad141, 2023. doi: 10.1093/nsr/nwad141

work page doi:10.1093/nsr/nwad141 2023
[52]

TextSquare: Scaling up text-centric visual instruction tuning.arXiv preprint arXiv:2404.12803, 2024

Jingqun Tang, Chunhui Lin, Zhen Zhao, Shu Wei, Binghong Wu, Qi Liu, Hao Feng, Yang Li, Siqi Wang, Lei Liao, et al. TextSquare: Scaling up text-centric visual instruction tuning.arXiv preprint arXiv:2404.12803, 2024

work page arXiv 2024
[53]

MTVQA: Benchmarking multilingual text-centric visual question answering.arXiv preprint arXiv:2405.11985, 2024

Jingqun Tang, Qi Liu, Yongjie Ye, Jinghui Lu, Shu Wei, Chunhui Lin, Wanqing Li, Mohamad Fitri Faiz Bin Mahmood, Hao Feng, Zhen Zhao, et al. MTVQA: Benchmarking multilingual text-centric visual question answering.arXiv preprint arXiv:2405.11985, 2024

work page arXiv 2024
[54]

2021 , url =

Vladimir Tankovich, Christian Häne, Yinda Zhang, Adarsh Kowdle, Sean Fanello, and Sofien Bouaziz. Hitnet: Hierarchical iterative tile refinement network for real-time stereo matching. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 14357–14367. IEEE, 2021. doi: 10.1109/CVPR46437.2021.01413

work page doi:10.1109/cvpr46437.2021.01413 2021
[55]

Pargo: Bridging vision-language with partial and global views

An-Lan Wang, Bin Shan, Wei Shi, Kun-Yu Lin, Xiang Fei, Guozhi Tang, Lei Liao, Jingqun Tang, Can Huang, and Wei-Shi Zheng. Pargo: Bridging vision-language with partial and global views. 39(7):7491–7499, 2025

work page 2025
[56]

Anlei Wang, Jingqun Tang, Lei Liao, Hao Feng, Qi Liu, Xiaoyu Fei, Jinghui Lu, Han Wang, Hao Liu, Yuliang Liu, et al. WildDoc: How far are we from achieving comprehensive and robust document understanding in the wild? InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, 2025

work page 2025
[57]

Vision as LoRA.arXiv preprint arXiv:2503.20680, 2025

Haotian Wang, Yongjie Ye, Bingning Li, Yicheng Nie, Jinghui Lu, Jingqun Tang, Yaxing Wang, and Can Huang. Vision as LoRA.arXiv preprint arXiv:2503.20680, 2025

work page arXiv 2025
[58]

Generic attention-model explainability for interpreting bi-modal and encoder-decoder transformers

Wenhai Wang, Enze Xie, Xiang Li, Deng-Ping Fan, Kaitao Song, Ding Liang, Tong Lu, Ping Luo, and Ling Shao. Pyramid vision transformer: A versatile backbone for dense prediction without con- volutions. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 548–558. IEEE, 2021. doi: 10.1109/ICCV48922.2021.00061

work page doi:10.1109/iccv48922.2021.00061 2021
[59]

Shadows can be

Zhuofan Xia, Xuran Pan, Shiji Song, Li Erran Li, and Gao Huang. Vision transformer with de- formable attention. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 4784–4793. IEEE, 2022. doi: 10.1109/CVPR52688.2022.00475

work page doi:10.1109/cvpr52688.2022.00475 2022
[60]

Generic attention-model explainability for interpreting bi-modal and encoder-decoder transformers

Guanglei Yang, Hao Tang, Mingli Ding, Nicu Sebe, and Elisa Ricci. Transformer-based at- tention networks for continuous pixel-wise prediction. InProceedings of the IEEE/CVF In- ternational Conference on Computer Vision (ICCV), pages 16249–16259. IEEE, 2021. doi: 10.1109/ICCV48922.2021.01596

work page doi:10.1109/iccv48922.2021.01596 2021
[61]

Localized gaussian splatting editing with contextual awareness

Xuan Yang, Liangzhe Yuan, Kimberly Wilber, Astuti Sharma, Xiuye Gu, Siyuan Qiao, Stephanie Debats, Huisheng Wang, Hartwig Adam, Mikhail Sirotenko, and Liang-Chieh Chen. Poly- max: General dense prediction with mask transformer. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 1039–1050. IEEE, 2024. doi: 10.1...

work page doi:10.1109/w 2024
[62]

Futuredepth: Learning to predict the future improves video depth estimation.arXiv preprint arXiv:2403.12953, 2024

Rajeev Yasarla, Manish Kumar Singh, Hong Cai, Yunxiao Shi, Jisoo Jeong, Yinghui Zhu, Song Zhe Han, Risheek Garrepalli, and Fatih Porikli. Futuredepth: Learning to predict the future improves video depth estimation.arXiv preprint arXiv:2403.12953, 2024. 15

work page arXiv 2024
[63]

Selvaraju, Michael Cogswell, Ab- hishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Batra

Wei Yin, Yifan Liu, Chunhua Shen, and Youliang Yan. Enforcing geometric constraints of vir- tual normal for depth prediction. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 5683–5692. IEEE, 2019. doi: 10.1109/ICCV .2019.00578

work page doi:10.1109/iccv 2019
[64]

Benchmarking vision-language models on chinese ancient documents: From OCR to knowledge reasoning.arXiv preprint arXiv:2509.09731, 2025

Haiyang Yu, Yongping Wu, Feifan Shi, Lei Liao, Jinghui Lu, Xiaocui Ge, Han Wang, Mengfei Zhuo, Xiaocong Wu, Xiaoyu Fei, Jingqun Tang, et al. Benchmarking vision-language models on chinese ancient documents: From OCR to knowledge reasoning.arXiv preprint arXiv:2509.09731, 2025

work page arXiv 2025
[65]

Shadows can be

Weihao Yuan, Xiaodong Gu, Zuozhuo Dai, Siyu Zhu, and Ping Tan. Neural window fully- connected crfs for monocular depth estimation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 3906–3915. IEEE, 2022. doi: 10.1109/CVPR52688.2022.00389

work page doi:10.1109/cvpr52688.2022.00389 2022
[66]

Vision transformers are parameter- efficient audio-visual learners

Ning Zhang, Francesco Nex, George V osselman, and Norman Kerle. Lite-mono: A lightweight cnn and transformer architecture for self-supervised monocular depth estimation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 18537–18546. IEEE, 2023. doi: 10.1109/CVPR52729.2023.01778

work page doi:10.1109/cvpr52729.2023.01778 2023
[67]

Mi, Xiaoxiao Zheng, and Angela Yao

Shihao Zhang, Linlin Yang, Michael B. Mi, Xiaoxiao Zheng, and Angela Yao. Improving deep regression with ordinal entropy.arXiv preprint arXiv:2301.08915, 2023

work page arXiv 2023
[68]

Blind image quality assessment via vision-language correspondence: A multitask learning perspective

Weixia Zhang, Guangtao Zhai, Ying Wei, Xiaokang Yang, and Kede Ma. Blind image quality assessment via vision-language correspondence: A multitask learning perspective. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14071–14081, 2023

work page 2023
[69]

TabPedia: Towards comprehensive visual table understanding with concept synergy

Weichao Zhao, Hao Feng, Qi Liu, Jingqun Tang, Shu Wei, Binghong Wu, Lei Liao, Yongjie Ye, Hao Liu, Wengang Zhou, Houqiang Li, and Can Huang. TabPedia: Towards comprehensive visual table understanding with concept synergy. InAdvances in Neural Information Processing Systems, volume 37, 2024

work page 2024
[70]

Multi-modal in-context learning makes an ego-evolving scene text recog- nizer

Zhen Zhao, Jingqun Tang, Chunhui Lin, Binghong Wu, Can Huang, Hao Liu, Xin Tan, Zhizhong Zhang, and Yuan Xie. Multi-modal in-context learning makes an ego-evolving scene text recog- nizer. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15567–15576, 2024

work page 2024
[71]

Harmonizing visual text comprehension and generation.arXiv preprint arXiv:2407.16364, 2024

Zhen Zhao, Jingqun Tang, Binghong Wu, Chunhui Lin, Shu Wei, Hao Liu, Xin Tan, Zhizhong Zhang, Can Huang, and Yuan Xie. Harmonizing visual text comprehension and generation.arXiv preprint arXiv:2407.16364, 2024

work page arXiv 2024
[72]

Textpecker: Rewarding structural anomaly quantifica- tion for enhancing visual text rendering.arXiv preprint arXiv:2602.20903, 2026

Hanshen Zhu, Yuliang Liu, Xuecheng Wu, An-Lan Wang, Hao Feng, Dingkang Yang, Chao Feng, Can Huang, Jingqun Tang, and Xiang Bai. Textpecker: Rewarding structural anomaly quantifica- tion for enhancing visual text rendering.arXiv preprint arXiv:2602.20903, 2026. 16

work page arXiv 2026