arxiv: 2604.04564 · v1 · submitted 2026-04-06 · 💻 cs.RO · cs.CV

Recognition: 2 theorem links

· Lean Theorem

Visual Prompt Based Reasoning for Offroad Mapping using Multimodal LLMs

Abdelmoamen Nasser , Yousef Baba'a , Murad Mebrahtu , Nadya Abdel Madjid , Jorge Dias , Majid Khonji

Authors on Pith no claims yet

Pith reviewed 2026-05-10 19:47 UTC · model grok-4.3

classification 💻 cs.RO cs.CV

keywords offroad autonomyzero-shot learningvision language modelsSAM2drivable area detectionvisual promptingmultimodal LLMsterrain mapping

0 comments

The pith

A vision-language model can map drivable off-road areas zero-shot by reasoning over SAM2 segments labeled with numbers.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper demonstrates a zero-shot method for off-road mapping that uses SAM2 to divide an image into regions and a vision-language model to decide which regions are drivable. Instead of training separate systems for terrain type, height, or slip, they annotate each region with a number and ask the model to list the safe ones. This single model approach outperforms trained segmentation models on high-resolution data and powers complete navigation in an off-road simulator. If correct, it simplifies building autonomous systems for rough terrain by relying on the model's built-in understanding rather than custom datasets.

Core claim

By supplying a VLM with both the raw image and the same image overlaid with numeric labels on SAM2-generated masks, then asking it to name the drivable labels, the method produces accurate drivable-area maps without any model training or domain examples, outperforming prior trainable approaches on high-resolution benchmarks and supporting complete navigation in simulation.

What carries the argument

Numeric labeling of SAM2 masks as visual prompts that allow the VLM to select drivable regions by identifier rather than generating new masks.

If this is right

The need for separate models and datasets for classification, height, slip, and slope is removed.
Planning and control can directly use the VLM's label-based decisions for path generation.
Performance exceeds state-of-the-art trained models on high-resolution off-road segmentation tasks.
Full-stack autonomy becomes feasible in simulated off-road settings using only general-purpose models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If VLMs improve further, this could reduce reliance on large labeled robotics datasets across many domains.
Similar prompting could apply to other unstructured environments like indoor navigation or search-and-rescue.
Combining this with sensor fusion might address edge cases where visual reasoning alone is insufficient.

Load-bearing premise

A general-purpose vision-language model can reliably determine drivable regions from numeric labels on SAM2 masks across different off-road environments without fine-tuning or specific examples.

What would settle it

Collecting a benchmark of real off-road images annotated by human experts for drivable areas, applying the numeric prompt method, and verifying whether the model's selected labels align closely with the expert annotations.

Figures

Figures reproduced from arXiv: 2604.04564 by Abdelmoamen Nasser, Jorge Dias, Majid Khonji, Murad Mebrahtu, Nadya Abdel Madjid, Yousef Baba'a.

**Figure 2.** Figure 2: Simulation Setup: Bottom left: Simulation environment created in Unreal Engine. Top left: Simulation [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: The pre-processing interface at three states from left to right. (i) The initial state. (ii) After addition of drivable [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Qualitative comparison of Mask Generation vs Point Prompting on frame 88. [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Samples of drivable areas detected by evaluated VLMs. GT refers to Ground Truth, while the remaining [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Examples from the scoring process: in the first column, a VLM selected mask indices that perfectly aligned [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

read the original abstract

Traditional approaches to off-road autonomy rely on separate models for terrain classification, height estimation, and quantifying slip or slope conditions. Utilizing several models requires training each component separately, having task specific datasets, and fine-tuning. In this work, we present a zero-shot approach leveraging SAM2 for environment segmentation and a vision-language model (VLM) to reason about drivable areas. Our approach involves passing to the VLM both the original image and the segmented image annotated with numeric labels for each mask. The VLM is then prompted to identify which regions, represented by these numeric labels, are drivable. Combined with planning and control modules, this unified framework eliminates the need for explicit terrain-specific models and relies instead on the inherent reasoning capabilities of the VLM. Our approach surpasses state-of-the-art trainable models on high resolution segmentation datasets and enables full stack navigation in our Isaac Sim offroad environment.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper describes a numeric-label prompting trick on SAM2 masks to let a VLM pick drivable regions zero-shot, but supplies no metrics, baselines, or failure cases to back the SOTA and navigation claims.

read the letter

The main takeaway is that they segment with SAM2, overlay integer labels on the masks, feed both the raw image and the labeled mask to a VLM, and prompt it to name which labels are safe to drive on. That specific prompting step is the concrete new piece; it turns the VLM into a single reasoning module instead of training separate classifiers for terrain type, slope, or slip. The approach is simple and removes the need for task-specific datasets, which is a practical advantage for quick off-road prototypes in data-poor settings. It also shows the full stack running in Isaac Sim, so the idea is at least connected to planning and control rather than staying at the perception layer. The numeric labeling itself is a small but clear engineering choice that avoids vague language prompts and lets the VLM output concrete region IDs. That part is reproducible in principle and worth trying if you already have SAM2 and a VLM handy. The soft spot is the complete absence of numbers. The abstract says the method beats trained models on high-resolution segmentation datasets and supports full navigation, yet there are no accuracy figures, no dataset names, no baseline comparisons, no prompt text, and no examples of what the VLM actually outputs on tricky cases like vegetation edges or gentle slopes. Without those, the claim that the VLM's inherent reasoning replaces trained modules stays untested. The method inherits whatever biases or failure modes the base SAM2 and VLM already have, and the paper does not appear to measure them. This is the kind of work that could interest a reading group focused on prompt-based robotics or rapid deployment of multimodal models. A reader who wants to experiment with VLM reasoning for terrain would get a usable recipe from the method description. It does not yet look ready for a serious referee because the central empirical claims cannot be evaluated from the provided evidence. If the full manuscript contains the missing tables and error analysis, then it would be worth sending out; otherwise it needs at least one round of concrete results before peer review.

Referee Report

3 major / 2 minor

Summary. The paper proposes a zero-shot off-road autonomy framework that combines SAM2 for image segmentation with a vision-language model (VLM) to identify drivable regions. The VLM receives the original RGB image plus a SAM2 mask image annotated with integer labels and is prompted to output the numeric identifiers of drivable areas; these outputs are then fed into planning and control modules. The central claims are that this approach surpasses state-of-the-art trainable segmentation models on high-resolution datasets and enables complete navigation stacks in an Isaac Sim off-road simulator without task-specific training or fine-tuning.

Significance. If the quantitative claims were substantiated, the work would be significant for reducing the engineering overhead of off-road systems by replacing separate terrain-classification, height, and slip models with a single VLM reasoning step. The zero-shot nature and use of pre-trained foundation models could lower data-collection costs, but the current manuscript supplies no metrics to evaluate whether the VLM reasoning actually delivers the promised performance.

major comments (3)

[Abstract and §4] Abstract and §4 (Experiments): The manuscript states that the method 'surpasses state-of-the-art trainable models on high resolution segmentation datasets,' yet no dataset names, metrics (mIoU, pixel accuracy, etc.), baseline implementations, or numerical results appear anywhere in the results section. This absence makes the central performance claim impossible to evaluate.
[§3.2] §3.2 (VLM Prompting): The exact prompt text supplied to the VLM is not reproduced, nor are any example inputs (RGB + numeric-labeled SAM2 mask) and corresponding VLM outputs shown. Without these, it is impossible to assess whether the VLM reliably distinguishes drivable terrain from vegetation, mud, or slopes under the zero-shot regime asserted in the abstract.
[§4.2] §4.2 (Isaac Sim Navigation): Full-stack navigation is claimed to succeed, but no quantitative metrics—success rate, collision frequency, path efficiency, or failure-mode statistics—are reported, nor is any comparison to conventional terrain-aware planners provided. This leaves the navigation claim unsupported.

minor comments (2)

[Abstract] The abstract refers to 'high resolution segmentation datasets' without naming them; adding the specific dataset identifiers would improve clarity.
[Figures] Figure captions for the SAM2 mask visualizations should explicitly state the numeric label-to-region mapping used in the VLM prompt.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their thorough review and valuable suggestions. We address each major comment below and commit to revising the manuscript to incorporate the requested details and quantitative evaluations.

read point-by-point responses

Referee: [Abstract and §4] Abstract and §4 (Experiments): The manuscript states that the method 'surpasses state-of-the-art trainable models on high resolution segmentation datasets,' yet no dataset names, metrics (mIoU, pixel accuracy, etc.), baseline implementations, or numerical results appear anywhere in the results section. This absence makes the central performance claim impossible to evaluate.

Authors: We agree that the manuscript lacks the specific quantitative results to support this claim. In the revised version, we will include a new subsection detailing the high-resolution datasets used, the state-of-the-art baselines implemented, the evaluation metrics such as mIoU and pixel accuracy, and the comparative numerical results demonstrating that our zero-shot approach surpasses the trainable models. revision: yes
Referee: [§3.2] §3.2 (VLM Prompting): The exact prompt text supplied to the VLM is not reproduced, nor are any example inputs (RGB + numeric-labeled SAM2 mask) and corresponding VLM outputs shown. Without these, it is impossible to assess whether the VLM reliably distinguishes drivable terrain from vegetation, mud, or slopes under the zero-shot regime asserted in the abstract.

Authors: We acknowledge the importance of providing the exact prompt and illustrative examples for transparency and to validate the zero-shot performance. We will add the complete prompt text to §3.2 and include figures with example RGB images, corresponding SAM2 masks with numeric labels, and the VLM's output responses for different off-road scenarios including vegetation, mud, and slopes. revision: yes
Referee: [§4.2] §4.2 (Isaac Sim Navigation): Full-stack navigation is claimed to succeed, but no quantitative metrics—success rate, collision frequency, path efficiency, or failure-mode statistics—are reported, nor is any comparison to conventional terrain-aware planners provided. This leaves the navigation claim unsupported.

Authors: We recognize that quantitative metrics are essential to substantiate the navigation claims. In the revision, we will expand §4.2 with results from multiple simulation trials, reporting success rates, collision frequencies, path efficiency metrics, and analysis of failure modes. We will also include comparisons against conventional planners that rely on separate terrain classification models. revision: yes

Circularity Check

0 steps flagged

No circularity: zero-shot method depends on external pre-trained SAM2 and VLM capabilities

full rationale

The paper presents a descriptive zero-shot pipeline that invokes SAM2 for mask generation and a general VLM for numeric-label reasoning about drivable regions. No equations, fitted parameters, self-definitional loops, or load-bearing self-citations appear in the abstract or method description. The central claim rests on the external reasoning ability of the VLM rather than any internal construction that reduces to the paper's own inputs. This is the expected non-finding for an applied systems paper without mathematical derivation.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claim rests entirely on the pre-trained capabilities of SAM2 and an unspecified VLM; no new parameters, axioms, or entities are introduced by the authors.

pith-pipeline@v0.9.0 · 5471 in / 1029 out tokens · 30653 ms · 2026-05-10T19:47:43.635642+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

44 extracted references · 5 canonical work pages · 2 internal anchors

[1]

Daniel Maturana, Po wei Chou, Masashi Uenoyama, and Sebastian A. Scherer. Real-time semantic mapping for autonomous off-road navigation. InInternational Symposium on Field and Service Robotics, 2017

2017
[2]

Semantic terrain classification for off-road autonomous driving

Amirreza Shaban, Xiangyun Meng, Joonho Lee, Byron Boots, and Dieter Fox. Semantic terrain classification for off-road autonomous driving. InConference on Robot Learning, 2021

2021
[3]

Terrain detection and segmentation for autonomous vehicle navigation: A state-of-the-art systematic review.Inf

Md Mohsin Kabir, Jamin Rahman Jim, and Zoltán Istenes. Terrain detection and segmentation for autonomous vehicle navigation: A state-of-the-art systematic review.Inf. Fusion, 113(C), January 2025

2025
[4]

Pixel to elevation: Learning to predict elevation maps at long range using images for autonomous offroad navigation, April 2024

Chanyoung Chung, Georgios Georgakis, Patrick Spieler, Curtis Padgett, Ali Agha, and Shehryar Khattak. Pixel to elevation: Learning to predict elevation maps at long range using images for autonomous offroad navigation, April 2024. arXiv:2401.17484 [cs]

work page arXiv 2024
[5]

A hybrid deep learning approach for vehicle wheel slip prediction in off-road environments

Mustofa Basri, Areg Karapetyan, Bilal Hassan, Majid Khonji, and Jorge Dias. A hybrid deep learning approach for vehicle wheel slip prediction in off-road environments. In2022 IEEE International Symposium on Robotic and Sensors Environments (ROSE), pages 1–7, November 2022

2022
[6]

Terrainnet: Visual modeling of complex terrain for high-speed, off-road navigation

Xiangyun Meng, Nathan Hatch, Alexander Lambert, Anqi Li, Nolan Wagener, Matthew Schmittle, JoonHo Lee, Wentao Yuan, Zoey Chen, Samuel Deng, Greg Okopal, Dieter Fox, Byron Boots, and Amirreza Shaban. Terrainnet: Visual modeling of complex terrain for high-speed, off-road navigation. InRobotics: Science and Systems (RSS), 2023

2023
[7]

Terrainsense: Vision-driven mapless navigation for unstructured off-road environments

Bilal Hassan, Arjun Sharma, Nadya Abdel Madjid, Majid Khonji, and Jorge Dias. Terrainsense: Vision-driven mapless navigation for unstructured off-road environments. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 18229–18235, Yokohama, Japan, May 2024. IEEE

2024
[8]

Pathformer: A transformer-based framework for vision-centric autonomous navigation in off-road environments

Bilal Hassan, Nadya Abdel Madjid, Fatima Kashwani, Mohamad Alansari, Majid Khonji, and Jorge Dias. Pathformer: A transformer-based framework for vision-centric autonomous navigation in off-road environments. In2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 7718–7725, Abu Dhabi, United Arab Emirates, October 2024. IEEE

2024
[9]

Oneformer: One transformer to rule universal image segmentation.2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 2989–2998, 2022

Jitesh Jain, Jiacheng Li, Man Chun Chiu, Ali Hassani, Nikita Orlov, and Humphrey Shi. Oneformer: One transformer to rule universal image segmentation.2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 2989–2998, 2022

2023
[10]

Schwing, Alexander Kirillov, and Rohit Girdhar

Bowen Cheng, Ishan Misra, Alexander G. Schwing, Alexander Kirillov, and Rohit Girdhar. Masked-attention mask transformer for universal image segmentation. In2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 1280–1289, 2022

2022
[11]

Drivable area detection in off-road for autonomous driving

Luhan Wu and Tangyou Liu. Drivable area detection in off-road for autonomous driving. In2024 IEEE 7th International Conference on Electronic Information and Communication Technology (ICEICT), pages 515–520, July 2024. ISSN: 2836-7782

2024
[12]

Towards visual grounding: A survey, 2024

Linhui Xiao, Xiaoshan Yang, Xiangyuan Lan, Yaowei Wang, and Changsheng Xu. Towards visual grounding: A survey, 2024

2024
[13]

Multimodal referring segmentation: A survey, 2025

Henghui Ding, Song Tang, Shuting He, Chang Liu, Zuxuan Wu, and Yu-Gang Jiang. Multimodal referring segmentation: A survey, 2025

2025
[14]

Image segmentation using text and image prompts

Timo Lüddecke and Alexander Ecker. Image segmentation using text and image prompts. In2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 7076–7086, 2022

2022
[15]

Zhao Yang, Jiaqi Wang, Xubing Ye, Yansong Tang, Kai Chen, Hengshuang Zhao, and Philip H. S. Torr. Language- aware vision transformer for referring segmentation.IEEE Transactions on Pattern Analysis & Machine Intelli- gence, 47(07):5238–5255, July 2025. 12 APREPRINT- APRIL7, 2026

2025
[16]

Lisa: Reasoning segmentation via large language model

Xin Lai, Zhuotao Tian, Yukang Chen, Yanwei Li, Yuhui Yuan, Shu Liu, and Jiaya Jia. Lisa: Reasoning segmentation via large language model. In2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 9579–9589, Los Alamitos, CA, USA, June 2024. IEEE Computer Society

2024
[17]

Gres: Generalized referring expression segmentation

Chang Liu, Henghui Ding, and Xudong Jiang. Gres: Generalized referring expression segmentation. In2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 23592–23601, 2023

2023
[18]

Sam4mllm: Enhance multi-modal large language model for referring expression segmentation

Yi-Chia Chen, Wei-Hua Li, Cheng Sun, Yu-Chiang Frank Wang, and Chu-Song Chen. Sam4mllm: Enhance multi-modal large language model for referring expression segmentation. InComputer Vision – ECCV 2024: 18th European Conference, Milan, Italy, September 29–October 4, 2024, Proceedings, Part LXXXI, page 323–340, Berlin, Heidelberg, 2024. Springer-Verlag

2024
[19]

Grounded SAM: Assembling Open-World Models for Diverse Visual Tasks

Tianhe Ren, Shilong Liu, Ailing Zeng, Jing Lin, Kunchang Li, He Cao, Jiayu Chen, Xinyu Huang, Yukang Chen, Feng Yan, Zhaoyang Zeng, Hao Zhang, Feng Li, Jie Yang, Hongyang Li, Qing Jiang, and Lei Zhang. Grounded sam: Assembling open-world models for diverse visual tasks.ArXiv, abs/2401.14159, 2024

work page Pith review arXiv 2024
[20]

VL-SAM-v2: Open-world object detection with general and specific query fusion.arXiv preprint arXiv:2505.18986, 2025

Chunhui Zhang et al. Vl-sam-v2: Open-world object detection with general and specific queries.arXiv preprint arXiv:2505.18986, 2025

work page arXiv 2025
[21]

Miguel Espinosa, Chenhongyi Yang, Linus Ericsson, Steven McDonagh, and Elliot J. Crowley. There is no samantics! exploring sam as a backbone for visual understanding tasks, 2024

2024
[22]

Wong, Zhenguo Li, and Hengshuang Zhao

Zhenhua Xu, Yujia Zhang, Enze Xie, Zhen Zhao, Yong Guo, Kwan-Yee K. Wong, Zhenguo Li, and Hengshuang Zhao. Drivegpt4: Interpretable end-to-end autonomous driving via large language model.IEEE Robotics and Automation Letters, 9(10):8186–8193, 2024

2024
[23]

Drive like a human: Rethink- ing autonomous driving with large language models

Daocheng Fu, Xin Li, Licheng Wen, Min Dou, Pinlong Cai, Botian Shi, and Yu Qiao. Drive like a human: Rethink- ing autonomous driving with large language models. In2024 IEEE/CVF Winter Conference on Applications of Computer Vision Workshops (WACVW), pages 910–919, Los Alamitos, CA, USA, January 2024. IEEE Computer Society

2024
[24]

3d open-vocabulary panoptic segmentation with 2d-3d vision-language distillation

Zihao Xiao, Longlong Jing, Shangxuan Wu, Alex Zihao Zhu, Jingwei Ji, Chiyu Max Jiang, Wei-Chih Hung, Thomas Funkhouser, Weicheng Kuo, Anelia Angelova, Yin Zhou, and Shiwei Sheng. 3d open-vocabulary panoptic segmentation with 2d-3d vision-language distillation. In Aleš Leonardis, Elisa Ricci, Stefan Roth, Olga Russakovsky, Torsten Sattler, and Gül Varol, e...

2024
[25]

Vlpd: Context-aware pedestrian detection via vision- language semantic self-supervision

Mengyin Liu, Jie Jiang, Chao Zhu, and Xu-Cheng Yin. Vlpd: Context-aware pedestrian detection via vision- language semantic self-supervision. In2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 6662–6671, Los Alamitos, CA, USA, June 2023. IEEE Computer Society

2023
[26]

Singh, Siddharth Srivastava, Krishna Murthy Jatavallabhula, and K

Tushar Choudhary, Vikrant Dewangan, Shivam Chandhok, Shubham Priyadarshan, Anushka Jain, Arun K. Singh, Siddharth Srivastava, Krishna Murthy Jatavallabhula, and K. Madhava Krishna. Talk2bev: Language-enhanced bird’s-eye view maps for autonomous driving. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 16345–16352, 2024

2024
[27]

Drivevlm: The convergence of autonomous driving and large vision-language models, 2024

Xiaoyu Tian, Junru Gu, Bailin Li, Yicheng Liu, Yang Wang, Zhiyong Zhao, Kun Zhan, Peng Jia, Xianpeng Lang, and Hang Zhao. Drivevlm: The convergence of autonomous driving and large vision-language models, 2024

2024
[28]

Robonurse-vla: Robotic scrub nurse system based on vision-language-action model, 2024

Shunlei Li, Jin Wang, Rui Dai, Wanyu Ma, Wing Yin Ng, Yingbai Hu, and Zheng Li. Robonurse-vla: Robotic scrub nurse system based on vision-language-action model, 2024

2024
[29]

OpenVLA: An Open-Source Vision-Language-Action Model

Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag R. Sanketi, Quan Vuong, Thomas Kollar, Benjamin Burchfiel, Russ Tedrake, Dorsa Sadigh, Sergey Levine, Percy Liang, and Chelsea Finn. Openvla: An open-source vision-language-action model.ArXiv, abs/2406.09246, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[30]

Sam 2: Segment anything in images and videos, 2024

Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman Rädle, Chloe Rolland, Laura Gustafson, Eric Mintun, Junting Pan, Kalyan Vasudev Alwala, Nicolas Carion, Chao-Yuan Wu, Ross Girshick, Piotr Dollár, and Christoph Feichtenhofer. Sam 2: Segment anything in images and videos, 2024

2024
[31]

Hoffmann, Claire J

Gabriel M. Hoffmann, Claire J. Tomlin, Michael Montemerlo, and Sebastian Thrun. Autonomous automobile trajectory tracking for off-road driving: Controller design, experimental validation and racing. In2007 American Control Conference, pages 2296–2301, New York, NY , USA, July 2007. IEEE. ISSN: 0743-1619

2007
[32]

Sven Koenig and Maxim Likhachev. D*lite. InEighteenth National Conference on Artificial Intelligence, page 476–483, USA, 2002. American Association for Artificial Intelligence

2002
[33]

Dmitri A. Dolgov. Practical search techniques in path planning for autonomous driving. 2008. 13 APREPRINT- APRIL7, 2026

2008
[34]

Infinity-mm: Scaling multimodal performance with large-scale and high-quality instruction data, 2025

Shuhao Gu, Jialing Zhang, Siyuan Zhou, Kevin Yu, Zhaohu Xing, Liangdong Wang, Zhou Cao, Jintao Jia, Zhuoyi Zhang, Yixuan Wang, Zhenchong Hu, Bo-Wen Zhang, Jijie Li, Dong Liang, Yingli Zhao, Songjing Wang, Yulong Ao, Yiming Ju, Huanhuan Ma, Xiaotong Li, Haiwen Diao, Yufeng Cui, Xinlong Wang, Yaoqi Liu, Fangxiang Feng, and Guang Liu. Infinity-mm: Scaling mu...

2025
[35]

Ivy-VL:Compact Vision-Language Models Achieving SOTA with Optimal Data, December 2024

Ivy Zhang, Wei Peng, Jenny N, Theresa Yu, and David Qiu. Ivy-VL:Compact Vision-Language Models Achieving SOTA with Optimal Data, December 2024

2024
[36]

MiniCPM-V: A GPT-4V Level MLLM on Your Phone

Yuan Yao, Tianyu Yu, Ao Zhang, Chongyi Wang, Junbo Cui, Hongji Zhu, Tianchi Cai, Haoyu Li, Weilin Zhao, Zhihui He, Qi-An Chen, Huarong Zhou, Zhensheng Zou, Haoye Zhang, Shengding Hu, Zhi Zheng, Jie Zhou, Jie Cai, Xu Han, Guoyang Zeng, Dahai Li, Zhiyuan Liu, and Maosong Sun. Minicpm-v: A gpt-4v level mllm on your phone.ArXiv, abs/2408.01800, 2024

work page internal anchor Pith review arXiv 2024
[37]

Schwing, Alexander Kirillov, and Rohit Girdhar

Bowen Cheng, Ishan Misra, Alexander G. Schwing, Alexander Kirillov, and Rohit Girdhar. Masked-attention mask transformer for universal image segmentation, 2022

2022
[38]

Timo Lüddecke and Alexander S. Ecker. Image segmentation using text and image prompts, 2022

2022
[39]

Cgnet: A light-weight context guided network for semantic segmentation.IEEE Transactions on Image Processing, 30:1169–1179, 2020

Tianyi Wu, Sheng Tang, Rui Zhang, Juan Cao, and Yongdong Zhang. Cgnet: A light-weight context guided network for semantic segmentation.IEEE Transactions on Image Processing, 30:1169–1179, 2020

2020
[40]

Pyramid scene parsing network, 2017

Hengshuang Zhao, Jianping Shi, Xiaojuan Qi, Xiaogang Wang, and Jiaya Jia. Pyramid scene parsing network, 2017

2017
[41]

Groupvit: Semantic segmentation emerges from text supervision, 2022

Jiarui Xu, Shalini De Mello, Sifei Liu, Wonmin Byeon, Thomas Breuel, Jan Kautz, and Xiaolong Wang. Groupvit: Semantic segmentation emerges from text supervision, 2022

2022
[42]

Ocrnet - light-weighted and efficient neural network for optical character recognition

Vansh Gupta, Ayush Gupta, Nikhil Arora, and Jai Garg. Ocrnet - light-weighted and efficient neural network for optical character recognition. In2021 IEEE Bombay Section Signature Conference (IBSSC), pages 1–4, 2021

2021
[43]

Orfd: A dataset and benchmark for off-road freespace detection, 2022

Chen Min, Weizhong Jiang, Dawei Zhao, Jiaolong Xu, Liang Xiao, Yiming Nie, and Bin Dai. Orfd: A dataset and benchmark for off-road freespace detection, 2022

2022
[44]

A rugd dataset for autonomous navigation and visual perception in unstructured outdoor environments

Maggie Wigness, Sungmin Eum, John G Rogers, David Han, and Heesung Kwon. A rugd dataset for autonomous navigation and visual perception in unstructured outdoor environments. InInternational Conference on Intelligent Robots and Systems (IROS), 2019. 14

2019