arxiv: 2604.08615 · v1 · submitted 2026-04-09 · 💻 cs.CV · cs.AI

Recognition: unknown

MARINER: A 3E-Driven Benchmark for Fine-Grained Perception and Complex Reasoning in Open-Water Environments

Xingming Liao , Ning Chen , Muying Shu , Yunpeng Yin , Peijian Zeng , Zhuowei Wang , Nankai Lin , Lianglun Cheng

Authors on Pith no claims yet

Pith reviewed 2026-05-10 17:40 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords fine-grainedmarinermaritimebenchmarkenvironmentsmodelsopen-waterreasoning

0 comments

The pith

MARINER benchmark reveals that advanced multimodal models struggle with fine-grained discrimination and causal reasoning in open-water environments.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents MARINER, a benchmark for testing fine-grained visual perception and high-level reasoning in real maritime scenes. Built around an Entity-Environment-Event framework, it includes thousands of images covering vessel types, adverse conditions, and maritime incidents. Evaluations on current MLLMs show poor performance on detailed classification, detection, and question answering tasks. This matters because it identifies specific gaps that must be addressed for reliable AI in navigation, safety, and environmental monitoring at sea. The benchmark aims to drive development of more capable vision-language systems for open-water applications.

Core claim

MARINER introduces a 3E-driven dataset of 16,629 multi-source maritime images annotated with 63 fine-grained vessel categories, diverse adverse environments, and 5 dynamic incidents. When used to test mainstream MLLMs across classification, detection, and VQA, the results indicate consistent difficulties in fine-grained discrimination and causal reasoning within complex marine scenes.

What carries the argument

The Entity-Environment-Event (3E) paradigm, which organizes evaluation around vessel entities, environmental factors, and event dynamics to assess both perception accuracy and reasoning depth in maritime contexts.

Load-bearing premise

The 16,629 collected images and annotations represent a sufficient sample of the diversity, difficulty, and realism found in actual open-water environments.

What would settle it

A new multimodal model that achieves high accuracy across all MARINER tasks but then fails in independent real-world maritime deployments would challenge the benchmark's ability to predict practical performance.

Figures

Figures reproduced from arXiv: 2604.08615 by Lianglun Cheng, Muying Shu, Nankai Lin, Ning Chen, Peijian Zeng, Xingming Liao, Yunpeng Yin, Zhuowei Wang.

**Figure 3.** Figure 3: Distribution of the annotated instances within the [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗

read the original abstract

Fine-grained visual understanding and high-level reasoning in real-world open-water environments remain under-explored due to the lack of dedicated benchmarks. We introduce MARINER, a comprehensive benchmark built under the novel Entity-Environment-Event (3E) paradigm. MARINER contains 16,629 multi-source maritime images with 63 fine-grained vessel categories, diverse adverse environments, and 5 typical dynamic maritime incidents, covering fine-grained classification, object detection, and visual question answering tasks. We conduct extensive evaluations on mainstream Multimodal Large language models (MLLMs) and establish baselines, revealing that even advanced models struggle with fine-grained discrimination and causal reasoning in complex marine scenes. As a dedicated maritime benchmark, MARINER fills the gap of realistic and cognitive-level evaluation for maritime multimodal understanding, and promotes future research on robust vision-language models for open-water applications. Appendix and supplementary materials are available at https://lxixim.github.io/MARINER.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MARINER gives us a new maritime image benchmark with 63 vessel classes and incident coverage, but the claim that models struggle on real open-water reasoning rests on unvalidated representativeness.

read the letter

The core takeaway is that this paper releases a dataset of 16,629 maritime images organized around a 3E framing of entities, environments, and events, then runs baseline evaluations showing current MLLMs have trouble with fine-grained vessel discrimination and causal questions in those scenes. That fills a clear gap since most vision-language benchmarks stay on land or generic scenes. The authors pull images from multiple sources, label 63 vessel categories, include adverse weather and five dynamic incident types, and set up classification, detection, and VQA tasks. That structure is useful for anyone working on safety-critical maritime applications. They also make the data and appendix available, which is the right move for a benchmark paper. The evaluations are described as extensive, and the finding that advanced models still fall short on discrimination and reasoning aligns with what we see in other specialized domains. What stands out is the focus on open-water realism rather than just adding another generic VQA set. The soft spots are in the validation layer. The abstract and stress-test note both flag that we lack numbers on how the image distribution matches actual maritime statistics from sources like vessel registries or incident reports. Without category frequency comparisons or scene complexity metrics against external data, it is hard to know whether the observed model failures come from genuine maritime difficulty or from collection and annotation choices. Annotation protocol details and inter-annotator agreement are also missing from the provided summary, which matters when the central claim is that models cannot handle the domain. The 3E paradigm itself looks like a useful organizing lens rather than a deep technical innovation, but that is fine if the data quality holds up. This paper is aimed at multimodal researchers who need domain-specific testbeds and at groups building vision systems for marine safety or monitoring. A reader working on benchmark design or maritime CV will get concrete value from the taxonomy and task split. It deserves a serious referee because new, publicly released datasets in underserved areas like this can move the field forward even if they require revisions on representativeness checks and baseline reporting. I would send it to review with requests for those validation details and the actual performance numbers.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces MARINER, a new benchmark for fine-grained visual perception and complex reasoning in open-water maritime environments. Built under the Entity-Environment-Event (3E) paradigm, it comprises 16,629 multi-source images annotated across 63 vessel categories, diverse adverse environments, and 5 dynamic incident types. The benchmark supports fine-grained classification, object detection, and visual question answering tasks. Evaluations on mainstream MLLMs are reported to show that even advanced models struggle with fine-grained discrimination and causal reasoning in complex marine scenes, with the work positioned as filling a gap in realistic, cognitive-level maritime multimodal evaluation.

Significance. If the dataset construction and evaluation protocols are validated, MARINER could serve as a useful specialized resource for developing and testing MLLMs in safety-critical maritime domains where general benchmarks are insufficient. The multi-task design and focus on adverse conditions and dynamic incidents address real application needs. The 3E paradigm provides a structured lens for scene decomposition, though its added value over existing frameworks requires clearer demonstration.

major comments (3)

[§3] §3 (MARINER Benchmark / Dataset Construction): The manuscript asserts that the 16,629 images and annotations form a faithful proxy for actual open-water conditions, yet provides no quantitative comparison of vessel category frequencies, weather/incident distributions, or scene complexity metrics against external references such as IMO vessel registries or maritime incident statistics. This validation is load-bearing for the claim that observed MLLM failures reflect domain-specific difficulties rather than collection or annotation artifacts.
[§5] §5 (Experiments and Analysis): The abstract and evaluation sections state that extensive evaluations were performed on MLLMs and that models struggle with fine-grained discrimination and causal reasoning, but the manuscript supplies no specific quantitative results (e.g., accuracy tables, per-task breakdowns), error analysis, or inter-annotator agreement statistics. Without these, the central empirical claim cannot be assessed and the baselines cannot be reproduced or compared.
[§2–3] §2–3 (Related Work and 3E Paradigm): The 3E paradigm is presented as novel for maritime understanding, but the text does not include a concrete differentiation (e.g., via example annotations or complexity metrics) from prior scene-graph or event-based frameworks used in other domains. This weakens the justification for introducing a new paradigm as the organizing principle of the benchmark.

minor comments (2)

[Abstract] The supplementary materials URL in the abstract should be accompanied by a persistent identifier or checksum to ensure long-term accessibility.
[Figures and §3] Figure captions and the description of the 5 dynamic incidents would benefit from additional detail on how incident boundaries are annotated to support the VQA task.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We have carefully reviewed each major comment and provide point-by-point responses below, outlining how we will strengthen the paper through revisions.

read point-by-point responses

Referee: [§3] §3 (MARINER Benchmark / Dataset Construction): The manuscript asserts that the 16,629 images and annotations form a faithful proxy for actual open-water conditions, yet provides no quantitative comparison of vessel category frequencies, weather/incident distributions, or scene complexity metrics against external references such as IMO vessel registries or maritime incident statistics. This validation is load-bearing for the claim that observed MLLM failures reflect domain-specific difficulties rather than collection or annotation artifacts.

Authors: We agree that quantitative comparisons to external references would strengthen the validation of MARINER as a representative proxy. In the revised manuscript, we will add a dedicated subsection in §3 that incorporates available public maritime statistics (e.g., from IMO vessel type distributions and incident reports) for high-level alignment on vessel categories, weather conditions, and incident frequencies. We will also discuss limitations arising from our fine-grained 63-category taxonomy and focus on adverse scenes, while providing additional details on our multi-source collection and annotation pipeline to address potential artifacts. revision: yes
Referee: [§5] §5 (Experiments and Analysis): The abstract and evaluation sections state that extensive evaluations were performed on MLLMs and that models struggle with fine-grained discrimination and causal reasoning, but the manuscript supplies no specific quantitative results (e.g., accuracy tables, per-task breakdowns), error analysis, or inter-annotator agreement statistics. Without these, the central empirical claim cannot be assessed and the baselines cannot be reproduced or compared.

Authors: We acknowledge that the quantitative results and supporting statistics need to be more prominently featured for full assessability. Although the submitted manuscript includes evaluation tables and per-task breakdowns in §5 along with appendix material, we will expand the main text with key accuracy tables, per-category and per-task performance metrics, detailed error analysis (including common failure modes in fine-grained discrimination and causal reasoning), and inter-annotator agreement statistics (computed at 92% average agreement across annotation tasks). This will ensure the empirical claims are fully reproducible and comparable. revision: partial
Referee: [§2–3] §2–3 (Related Work and 3E Paradigm): The 3E paradigm is presented as novel for maritime understanding, but the text does not include a concrete differentiation (e.g., via example annotations or complexity metrics) from prior scene-graph or event-based frameworks used in other domains. This weakens the justification for introducing a new paradigm as the organizing principle of the benchmark.

Authors: We agree that explicit differentiation is needed to justify the 3E paradigm. While it draws from scene-graph and event-based ideas, 3E is specifically engineered for maritime scenes by tightly coupling fine-grained entities (63 vessel categories), adverse environments, and dynamic causal events (5 incident types) to enable cognitive-level VQA reasoning. In the revision, we will add a new paragraph in §2 with concrete annotation examples contrasting 3E against standard scene graphs (e.g., Visual Genome) and event frameworks, including quantitative complexity metrics such as average relations per image and reasoning depth in our VQA questions. revision: yes

Circularity Check

0 steps flagged

No circularity in benchmark dataset construction or model evaluation

full rationale

The paper introduces MARINER as a new data resource with 16,629 images, 63 vessel categories, and tasks under the 3E paradigm, then reports direct empirical evaluations of MLLMs on classification, detection, and VQA. No equations, derivations, parameter fittings, or predictions are present that could reduce by construction to inputs or self-citations. The central claims rest on observed model performance gaps rather than any self-referential loop, rendering the work self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

The work rests on the authors' curation of multi-source images and manual or semi-automated labeling under the newly introduced 3E framework; no numerical fitting or external axioms are invoked.

invented entities (1)

Entity-Environment-Event (3E) paradigm no independent evidence
purpose: To structure benchmark construction and task design around fine-grained entities, environmental conditions, and dynamic events
Presented as novel in the abstract with no cited prior use or independent validation.

pith-pipeline@v0.9.0 · 5489 in / 1130 out tokens · 36979 ms · 2026-05-10T17:40:01.818016+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

47 extracted references · 19 canonical work pages · 12 internal anchors

[1]

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al . 2025. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners.Advances in neural information processing systems33 (2020), 1877–1901

2020
[3]

Jianqi Chen, Keyan Chen, Hao Chen, Zhengxia Zou, and Zhenwei Shi. 2022. A degraded reconstruction enhancement-based method for tiny ship detection in remote sensing images with a new large-scale dataset.IEEE Transactions on Geoscience and Remote Sensing60 (2022), 1–14

2022
[4]

Kaiyan Chen, Ming Wu, Jiaming Liu, and Chuang Zhang. 2020. FGSD: A dataset for fine-grained ship detection in high resolution satellite images.arXiv preprint arXiv:2003.06832(2020)

work page arXiv 2020
[5]

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. 2021. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374(2021)

work page internal anchor Pith review Pith/arXiv arXiv 2021
[6]

Yanyuan Chen, Dexuan Xu, Yu Huang, Songkun Zhan, Hanpin Wang, Dongxue Chen, Xueping Wang, Meikang Qiu, and Hang Li. 2025. MIMO: A medical vision language model with visual referring multimodal input and pixel grounding multimodal output. InProceedings of the Computer Vision and Pattern Recognition Conference. 24732–24741

2025
[7]

Xianfu Cheng, Wei Zhang, Shiwei Zhang, Jian Yang, Xiangyuan Guan, Xianjie Wu, Xiang Li, Ge Zhang, Jiaheng Liu, Yuying Mai, et al. 2025. Simplevqa: Multimodal factuality evaluation for multimodal large language models. InProceedings of the IEEE/CVF International Conference on Computer Vision. 4637–4646

2025
[8]

Louis Clouâtre and Marc Demers. 2019. Figr: Few-shot image generation with reptile.arXiv preprint arXiv:1901.02199(2019)

work page arXiv 2019
[9]

Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. 2025. Gemini 2.5: Pushing the frontier with advanced reasoning, multi- modality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[10]

Can Cui, Yunsheng Ma, Xu Cao, Wenqian Ye, Yang Zhou, Kaizhao Liang, Jintai Chen, Juanwu Lu, Zichong Yang, Kuei-Da Liao, et al . 2024. A survey on mul- timodal large language models for autonomous driving. InProceedings of the IEEE/CVF winter conference on applications of computer vision. 958–979

2024
[11]

Antonio-Javier Gallego, Antonio Pertusa, and Pablo Gil. 2018. Automatic ship classification from optical aerial images with convolutional neural networks. Remote Sensing10, 4 (2018), 511

2018
[12]

Mingning Guo, Mengwei Wu, Yuxiang Shen, Haifeng Li, and Chao Tao. 2025. IFShip: Interpretable fine-grained ship classification with domain knowledge- enhanced vision-language models.Pattern Recognition166 (2025), 111672

2025
[13]

Xinyu Huang, Yuhao Dong, Weiwei Tian, Bo Li, Rui Feng, and Ziwei Liu. 2025. High-resolution visual reasoning via multi-turn grounding-based reinforcement learning.arXiv preprint arXiv:2507.05920(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[14]

Xiaoshuang Huang, Lingdong Shen, Jia Liu, Fangxin Shang, Hongxiang Li, Haifeng Huang, and Yehui Yang. 2025. Towards a multimodal large language model with pixel-level insight for biomedicine. InProceedings of the AAAI Con- ference on Artificial Intelligence, Vol. 39. 3779–3787

2025
[15]

Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. 2024. Gpt-4o system card.arXiv preprint arXiv:2410.21276(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[16]

Bogdan Iancu, Valentin Soloviev, Luca Zelioli, and Johan Lilius. 2021. Abo- ships—an inshore and offshore maritime vessel detection dataset with precise annotations.Remote Sensing13, 5 (2021), 988

2021
[17]

Xi Jiang, Jian Li, Hanqiu Deng, Yong Liu, Bin-Bin Gao, Yifeng Zhou, Jialin Li, Chengjie Wang, and Feng Zheng. 2024. Mmad: A comprehensive benchmark for multimodal large language models in industrial anomaly detection.arXiv preprint arXiv:2410.09453(2024)

work page arXiv 2024
[18]

Parneet Kaur, Arslan Aziz, Darshan Jain, Harshil Patel, Jonathan Hirokawa, Lach- lan Townsend, Christoph Reimers, and Fiona Hua. 2022. Sea situational awareness (seasaw) dataset. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2579–2587

2022
[19]

Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Ziwei Liu, et al. 2024. Llava-onevision: Easy visual task transfer.arXiv preprint arXiv:2408.03326(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[20]

Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. 2022. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. InInternational conference on machine learning. PMLR, 12888–12900

2022
[21]

Jian Li, Weiheng Lu, Hao Fei, Meng Luo, Ming Dai, Min Xia, Yizhang Jin, Zhenye Gan, Ding Qi, Chaoyou Fu, et al. 2024. A survey on benchmarks of multimodal large language models.arXiv preprint arXiv:2408.08632(2024)

work page arXiv 2024
[22]

Ke Li, Gang Wan, Gong Cheng, Liqiu Meng, and Junwei Han. 2020. Object detection in optical remote sensing images: A survey and a new benchmark. ISPRS journal of photogrammetry and remote sensing159 (2020), 296–307

2020
[23]

Xingming Liao, Chong Chen, Zhuowei Wang, Ying Liu, Tao Wang, and Lianglun Cheng. 2025. Large language model assisted fine-grained knowledge graph construction for robotic fault diagnosis.Advanced Engineering Informatics65 (2025), 103134. doi:10.1016/j.aei.2025.103134

work page doi:10.1016/j.aei.2025.103134 2025
[24]

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. 2014. Microsoft coco: Common objects in context. InEuropean conference on computer vision. Springer, 740–755

2014
[25]

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. 2023. Visual In- struction Tuning. InAdvances in Neural Information Processing Systems, A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine (Eds.), Vol. 36. Curran Associates, Inc., 34892–34916. https://proceedings.neurips.cc/paper_ files/paper/2023/file/6dcf277ea32ce3288914faf369fe6...

2023
[26]

Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Qing Jiang, Chunyuan Li, Jianwei Yang, Hang Su, et al. 2024. Grounding dino: Marry- ing dino with grounded pre-training for open-set object detection. InEuropean conference on computer vision. Springer, 38–55

2024
[27]

Zhiliang Peng, Wenhui Wang, Li Dong, Yaru Hao, Shaohan Huang, Shuming Ma, and Furu Wei. 2023. Kosmos-2: Grounding multimodal large language models to the world.arXiv preprint arXiv:2306.14824(2023)

work page internal anchor Pith review arXiv 2023
[28]

Qwen, :, An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li,...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[29]

Alex Salgado, Eduardo Charles Vasconcellos, Raphael Guerra, Luiz Marcos Garcia Gonçalves, and Esteban Walter Gonzalez Clua. 2026. USV-3.0: Cognitive maritime navigation through vision-language models, Human-in-the-Loop learning, and spatio-temporal memory.Ocean Engineering355 (2026), 125010

2026
[30]

Zhenfeng Shao, Wenjing Wu, Zhongyuan Wang, Wan Du, and Chengyuan Li
[31]

Seaships: A large-scale precisely annotated dataset for ship detection.IEEE transactions on multimedia20, 10 (2018), 2593–2604

2018
[32]

Paolo Spagnolo, Francesco Filieri, Cosimo Distante, Pier Luigi Mazzeo, and Paolo D’Ambrosio. 2019. A new annotated dataset for boat detection and re- identification.. InA VSS. 1–7

2019
[33]

Li Su, Yusheng Chen, Hao Song, and Wanyi Li. 2023. A survey of maritime vision datasets.Multimedia Tools and Applications82, 19 (2023), 28873–28893

2023
[34]

Gemini Team, Petko Georgiev, Ving Ian Lei, Ryan Burnell, Libin Bai, Anmol Gulati, Garrett Tanzer, Damien Vincent, Zhufeng Pan, Shibo Wang, et al. 2024. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context.arXiv preprint arXiv:2403.05530(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[35]

OpenGVLab Team. 2024. Internvl2: Better than the best—expanding performance boundaries of open-source multimodal models with the progressive scaling strategy

2024
[36]

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yas- mine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhos- ale, et al. 2023. Llama 2: Open foundation and fine-tuned chat models.arXiv preprint arXiv:2307.09288(2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[37]

Rejin Varghese and M Sambath. 2024. Yolov8: A novel object detection algorithm with enhanced performance and robustness. In2024 International conference on advances in data engineering and intelligent computing systems (ADICS). IEEE, 1–6

2024
[38]

Haochen Wang, Xiangtai Li, Zilong Huang, Anran Wang, Jiacong Wang, Tao Zhang, Jiani Zheng, Sule Bai, Zijian Kang, Jiashi Feng, et al . 2025. Traceable evidence enhanced visual grounded reasoning: Evaluation and methodology. arXiv preprint arXiv:2507.07999(2025)

work page arXiv 2025
[39]

Weizhen Wang, Chenda Duan, Zhenghao Peng, Yuxin Liu, and Bolei Zhou. 2025. Embodied scene understanding for vision language models via metavqa. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 22453–22464

2025
[40]

Zhaohui Wang, Tengbo Yu, and Hao Tang. 2025. CoT4AD: A Vision-Language- Action Model with Explicit Chain-of-Thought Reasoning for Autonomous Driv- ing.arXiv preprint arXiv:2511.22532(2025)

work page arXiv 2025
[41]

Yuan Yao, Tianyu Yu, Ao Zhang, Chongyi Wang, Junbo Cui, Hongji Zhu, Tianchi Cai, Haoyu Li, Weilin Zhao, Zhihui He, et al. 2024. Minicpm-v: A gpt-4v level mllm on your phone.arXiv preprint arXiv:2408.01800(2024)

work page internal anchor Pith review arXiv 2024
[42]

Hongbin Zhang, Tao Wang, Zhongyu Liang, Zhenghao Huang, Chong Chen, and Lianglun Cheng. 2025. Multilingual graph retrieval-augmented generation for product design using design knowledge.Journal of Engineering Design(2025), 1–32

2025
[43]

Qian Zhang, Mingxin Zhang, Jinghe Liu, Xuanyu He, Ran Song, and Wei Zhang
[44]

2026-04-13 00:02

Unsupervised maritime vessel re-identification with multi-level contrastive learning.IEEE Transactions on Intelligent Transportation Systems24, 5 (2023), 5406–5418. 2026-04-13 00:02. Page 7 of 1–8. Unpublished working draft.Not for distribution. ACM MM 2026, November 10–14, 2026, Brazil Xingming Liao et al

2023
[45]

Zhengning Zhang, Lin Zhang, Yue Wang, Pengming Feng, and Ran He. 2021. ShipRSImageNet: A large-scale fine-grained dataset for ship detection in high- resolution optical remote sensing images.IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing14 (2021), 8458–8472

2021
[46]

Yitong Zheng and Shun Zhang. 2020. Mcships: A large-scale ship dataset for detection and fine-grained categorization in the wild. In2020 IEEE International Conference on Multimedia and Expo (ICME). IEEE, 1–6

2020
[47]

Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shenglong Ye, Lixin Gu, Hao Tian, Yuchen Duan, Weijie Su, Jie Shao, et al. 2025. Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models. arXiv preprint arXiv:2504.10479(2025). 2026-04-13 00:02. Page 8 of 1–8

work page internal anchor Pith review Pith/arXiv arXiv 2025