UniReason-Med: A Shared Grounded Reasoning Interface for 2D-to-3D Transfer in Medical VQA

Bryan Dai; Chi Liu; Derek Li; Hongming Piao; Mengzhuo Chen; Xidong Wang; Yan Shu

arxiv: 2606.11740 · v1 · pith:ETH3GI5Pnew · submitted 2026-06-10 · 💻 cs.CV · cs.CL

UniReason-Med: A Shared Grounded Reasoning Interface for 2D-to-3D Transfer in Medical VQA

Mengzhuo Chen , Yan Shu , Chi Liu , Hongming Piao , Xidong Wang , Derek Li , Bryan Dai This is my paper

Pith reviewed 2026-06-27 10:18 UTC · model grok-4.3

classification 💻 cs.CV cs.CL

keywords medical VQAgrounded reasoning2D-to-3D transferslice-serialized volumesregion token injectionbox syntaxoutcome reinforcement learningUniMed-CoT

0 comments

The pith

A shared grounded reasoning interface transfers reasoning structure from 2D medical images to 3D volumes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether grounded reasoning supervision from 2D medical images can improve 3D medical visual question answering when both are aligned through one common interface. It presents UniReason-Med as a single-checkpoint model that accepts either a 2D image or a slice-serialized 3D volume and produces interleaved textual reasoning plus localized visual evidence. Training relies on the UniMed-CoT dataset of 220K examples containing both 2D and 3D samples with reasoning traces, using supervised fine-tuning followed by outcome-level reinforcement learning. Ablation results indicate that mixing 2D and 3D grounded data improves 3D performance over 3D-only training and that the grounding components help both dimensions. This setup is offered as evidence that a shared interface can move reasoning patterns across input dimensions without separate adaptations for each.

Core claim

UniReason-Med processes either 2D images or slice-serialized 3D volumes at inference time through shared box syntax, region-token injection, and a common grounded reasoning policy, generating interleaved textual reasoning and localized visual evidence; it is trained on the 220K UniMed-CoT dataset via supervised fine-tuning and outcome-level reinforcement learning without IoU or Dice localization rewards, and joint 2D+3D supervision is shown to improve 3D reasoning relative to 3D-only training while grounding elements benefit both tasks.

What carries the argument

shared grounded reasoning interface that aligns 2D and 3D inputs via common box syntax for localization, region-token injection, and a single grounded reasoning policy

If this is right

Joint 2D+3D grounded supervision substantially improves 3D reasoning over 3D-only training.
Grounding and region-token injection consistently benefit performance on both 2D and 3D tasks.
The model generates grounded reasoning traces using only outcome-level reinforcement learning without explicit localization rewards.
A single checkpoint can process either 2D images or slice-serialized 3D volumes at inference time.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach may allow 3D medical reasoning systems to draw on larger existing 2D datasets rather than requiring equally large 3D annotations.
Similar shared interfaces could be tested for transferring reasoning between other pairs of input dimensions or imaging modalities.
The method raises the question of whether outcome-only reinforcement learning remains sufficient when the number of 3D samples grows much larger.

Load-bearing premise

Aligning 2D images and 3D volumes through a common box syntax, region-token injection, and grounded reasoning policy enables effective knowledge transfer without requiring dimension-specific adaptations or suffering from information loss in slice serialization.

What would settle it

A controlled experiment in which adding the 170K 2D grounded samples to the 50K 3D samples produces no gain in 3D VQA accuracy compared with 3D-only training, or in which the single shared model underperforms two separately trained dimension-specific models on their respective tasks.

Figures

Figures reproduced from arXiv: 2606.11740 by Bryan Dai, Chi Liu, Derek Li, Hongming Piao, Mengzhuo Chen, Xidong Wang, Yan Shu.

**Figure 1.** Figure 1: Overview of UniReason-Med. Left: Comparison of medical MLLMs across key capabilities (✓ full support, ✓ partial, ✗ none). Checkmarks indicate whether a method reports the corresponding interface capability under public benchmark settings, not clinical readiness or native volumetric representation. UniReason-Med studies a shared grounded reasoning interface for both 2D images and slice-serialized 3D volumes… view at source ↗

**Figure 2.** Figure 2: Overview of UniReason-Med. (a) Grounded visual evidence extraction for a 2D image or 32-slice CT sequence under the shared GCoT interface. (b) Interleaved UniMed-CoT data format. (c) Two-stage SFT+GRPO training. autoregressive decoder, and GCoT policy. Traditional MLLMs θ output an answer via language-only reasoning: \small [\mathbf {r}_1, \mathbf {r}_2, \ldots , \mathbf {r}_k, \mathbf {a}] \sim P_\theta (… view at source ↗

**Figure 3.** Figure 3: UniMed-CoT construction. From SAMed2D-v1 and M3D segmentation masks, we extract grounding coordinates, generate QA pairs, and use GPT-4o to produce interleaved grounded CoT annotations, yielding 220K 2D/3D samples. 3.2 UniMed-CoT Dataset To train UniReason-Med, we construct UniMedCoT ( [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Evolution of key metrics during reinforcement learning training. The reward steadily increases as training progresses, indicating improved policy performance. Meanwhile, the crop area ratio shows a slightly decreasing trend with minor fluctuations, and the response length shows a mild upward trend. System Prompt You are an expert radiologist and medical AI assistant. You will be provided with the metadata … view at source ↗

**Figure 5.** Figure 5: Pearson correlation analysis between ground [PITH_FULL_IMAGE:figures/full_fig_p014_5.png] view at source ↗

**Figure 6.** Figure 6: Slice-volume grounding consistency. The green boxes correspond to slice-wise 2D GCoT predictions, while the red boxes correspond to the xy projection of 3D GCoT cuboids within the predicted slice range. The top part shows three example cases, and the bottom part presents the detailed analysis of Case 2, including the question and the reasoning outputs of both 2D GCoT and 3D GCoT. 15 [PITH_FULL_IMAGE:figur… view at source ↗

read the original abstract

We study whether grounded reasoning supervision from abundant 2D medical images can improve 3D medical VQA when both input types are aligned through a common reasoning interface. We introduce UniReason-Med, a single-checkpoint framework that processes either a 2D image or a slice-serialized 3D volume at inference time, generating interleaved textual reasoning and localized visual evidence through shared box syntax, region-token injection, and a common grounded reasoning policy. To train this interface, we construct UniMed-CoT, a 220K instruction-tuning dataset with interleaved textual reasoning and grounded visual evidence, including 170K 2D and 50K 3D samples. Through supervised fine-tuning followed by outcome-level reinforcement learning, UniReason-Med learns to generate grounded reasoning traces without IoU/Dice-based localization rewards during RL. Data-mixture and component ablations show that joint 2D+3D grounded supervision substantially improves 3D reasoning over 3D-only training, while grounding and region-token injection consistently benefit both 2D and 3D tasks. These results suggest that a shared grounded reasoning interface can transfer reasoning structure from 2D images to slice-serialized volumetric medical understanding. The code and data are publicly available at https://github.com/IQuestLab/unireason-med.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Joint 2D+3D training improves 3D medical VQA in their setup, but slice serialization plus per-slice boxes leaves the actual transfer mechanism open to question.

read the letter

The main thing to know is that mixing 2D grounded data with 3D training lifts 3D VQA performance over 3D-only runs according to their ablations, and they made the code and 220K UniMed-CoT dataset public. The shared box syntax and region-token approach is the claimed mechanism for moving reasoning structure across dimensions.

They do the straightforward thing of building one checkpoint that ingests either 2D images or serialized volumes, runs the same grounded policy on both, and trains with SFT plus outcome RL. The component ablations show grounding and the joint mixture each add value on both 2D and 3D tasks. That is concrete and worth having on record.

The soft spot is exactly the stress-test concern. Slice serialization with standard 2D box syntax does not obviously encode cross-slice geometry, so the model could be learning independent 2D reasoning per slice rather than volumetric structure. The abstract gives no breakdown of whether the test questions actually require inter-slice relations or how the policy attends across the sequence. Without those details the transfer claim rests on the joint-training delta alone.

This is for people already working on medical VQA or grounded multimodal models who need more 3D signal. It is narrow but the empirical result and released artifacts make it worth a referee's time rather than a desk reject.

Referee Report

2 major / 2 minor

Summary. The paper introduces UniReason-Med, a single-checkpoint model for medical VQA that processes either 2D images or slice-serialized 3D volumes using a shared grounded reasoning interface (box syntax, region-token injection, and common policy). It constructs the UniMed-CoT dataset (220K samples) and trains via SFT followed by outcome-level RL without localization rewards. Data-mixture and component ablations indicate that joint 2D+3D grounded supervision improves 3D reasoning over 3D-only training, supporting the claim that the shared interface transfers reasoning structure from 2D to volumetric understanding. Code and data are released publicly.

Significance. If the transfer claim holds under rigorous 3D geometry tests, the work could meaningfully advance medical VQA by leveraging abundant 2D supervision for scarcer 3D tasks without dimension-specific models. The public release of code, data, and the explicit ablation design are strengths that support reproducibility and further investigation.

major comments (2)

[Method (architecture and input processing)] The central transfer claim rests on the assumption that a shared box syntax and region-token injection recover volumetric structure from slice serialization. However, the architecture description does not specify any mechanism (e.g., 3D positional encodings or cross-slice attention) for modeling inter-slice spatial continuity; if boxes and tokens are applied independently per slice, the model reduces to per-slice 2D reasoning and the transfer benefit cannot be attributed to 3D structure recovery.
[Experiments (ablations)] Data-mixture ablations show joint 2D+3D training improves 3D VQA metrics over 3D-only, but the reported experiments do not include a controlled test set of questions that require explicit cross-slice geometric reasoning (e.g., relative depth, 3D bounding-box consistency across slices). Without such a test, the ablation results are consistent with improved 2D reasoning per slice rather than genuine 3D transfer.

minor comments (2)

[Training procedure] The abstract states that RL uses outcome-level rewards without IoU/Dice localization terms, but the exact reward formulation, value model, and KL regularization details are not provided in the main text; these should be expanded for reproducibility.
[Dataset] Dataset construction details for the 50K 3D samples (source volumes, question generation, grounding annotation protocol) are referenced but lack sufficient statistics on slice count distribution and inter-slice overlap to assess potential information loss from serialization.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below with clarifications on the architecture and experiments, indicating where revisions will be made to strengthen the manuscript.

read point-by-point responses

Referee: [Method (architecture and input processing)] The central transfer claim rests on the assumption that a shared box syntax and region-token injection recover volumetric structure from slice serialization. However, the architecture description does not specify any mechanism (e.g., 3D positional encodings or cross-slice attention) for modeling inter-slice spatial continuity; if boxes and tokens are applied independently per slice, the model reduces to per-slice 2D reasoning and the transfer benefit cannot be attributed to 3D structure recovery.

Authors: We appreciate the referee's focus on this architectural detail. The 3D input is processed as an ordered sequence of slices passed to the multimodal LLM; the transformer's self-attention operates over this serialized sequence, enabling cross-slice interactions during reasoning. Region tokens are generated per slice but referenced within the shared textual reasoning trace that spans the full sequence. We acknowledge that the original manuscript does not explicitly describe this cross-slice attention flow or contrast it with 3D positional encodings. In the revision we will expand the method section to detail the serialization process and clarify that inter-slice modeling occurs via LLM attention over the sequence rather than dimension-specific modules, consistent with the goal of a shared interface. revision: partial
Referee: [Experiments (ablations)] Data-mixture ablations show joint 2D+3D training improves 3D VQA metrics over 3D-only, but the reported experiments do not include a controlled test set of questions that require explicit cross-slice geometric reasoning (e.g., relative depth, 3D bounding-box consistency across slices). Without such a test, the ablation results are consistent with improved 2D reasoning per slice rather than genuine 3D transfer.

Authors: We agree that a dedicated test set isolating cross-slice geometric reasoning would provide more direct evidence. The current ablations evaluate on standard 3D medical VQA benchmarks whose questions involve volumetric understanding, and the consistent gains from joint 2D+3D training support transfer of the grounded reasoning policy. However, we do not currently possess a specialized cross-slice test set. In the revised manuscript we will explicitly discuss this limitation, note that the observed improvements are on existing benchmarks, and identify construction of such a test set as valuable future work. revision: partial

Circularity Check

0 steps flagged

No significant circularity; empirical claims rest on independent ablations

full rationale

The paper's central claim rests on constructing UniMed-CoT (220K samples), supervised fine-tuning plus outcome-level RL, and data-mixture/component ablations comparing joint 2D+3D training against 3D-only baselines. These results are externally falsifiable via the reported metrics and public code/data; no equations, fitted parameters renamed as predictions, or self-citation chains reduce the transfer claim to its inputs by construction. The architecture description (shared box syntax, region-token injection) is presented as an implementation choice whose effectiveness is tested rather than assumed.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no specific free parameters, axioms, or invented entities can be identified from the provided text.

pith-pipeline@v0.9.1-grok · 5789 in / 1268 out tokens · 29072 ms · 2026-06-27T10:18:58.846207+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

286 extracted references · 14 canonical work pages · 1 internal anchor

[1]

Aho and Jeffrey D

Alfred V. Aho and Jeffrey D. Ullman , title =. 1972

1972
[2]

Publications Manual , year = "1983", publisher =

1983
[3]

and Kozen, Dexter C

Ashok K. Chandra and Dexter C. Kozen and Larry J. Stockmeyer , year = "1981", title =. doi:10.1145/322234.322243

work page doi:10.1145/322234.322243 1981
[4]

Scalable training of

Andrew, Galen and Gao, Jianfeng , booktitle=. Scalable training of
[5]

Dan Gusfield , title =. 1997

1997
[6]

Tetreault , title =

Mohammad Sadegh Rasooli and Joel R. Tetreault , title =. Computing Research Repository , volume =. 2015 , url =

2015
[7]

A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =

Ando, Rie Kubota and Zhang, Tong , Issn =. A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =. Journal of Machine Learning Research , Month = dec, Numpages =
[8]

2023 , eprint=

SA-Med2D-20M Dataset: Segment Anything in 2D Medical Imaging with 20 Million masks , author=. 2023 , eprint=

2023
[9]

Computer Vision -- ECCV 2022 , year =

2022
[10]

Langley , title =

P. Langley , title =. Proceedings of the 17th International Conference on Machine Learning (ICML 2000) , address =. 2000 , pages =

2000
[11]

T. M. Mitchell. The Need for Biases in Learning Generalizations. 1980

1980
[12]

M. J. Kearns , title =
[13]

Machine Learning: An Artificial Intelligence Approach, Vol. I. 1983

1983
[14]

R. O. Duda and P. E. Hart and D. G. Stork. Pattern Classification. 2000

2000
[15]

Suppressed for Anonymity , author=
[16]

Newell and P

A. Newell and P. S. Rosenbloom. Mechanisms of Skill Acquisition and the Law of Practice. Cognitive Skills and Their Acquisition. 1981

1981
[17]

A. L. Samuel. Some Studies in Machine Learning Using the Game of Checkers. IBM Journal of Research and Development. 1959

1959
[18]

FirstName LastName , title =
[19]

FirstName Alpher , title =
[20]

Journal of Foo , volume = 13, number = 1, pages =

FirstName Alpher and FirstName Fotheringham-Smythe , title =. Journal of Foo , volume = 13, number = 1, pages =
[21]

Journal of Foo , volume = 14, number = 1, pages =

FirstName Alpher and FirstName Fotheringham-Smythe and FirstName Gamow , title =. Journal of Foo , volume = 14, number = 1, pages =
[22]

FirstName Alpher and FirstName Gamow , title =
[23]

Scaling Learning Algorithms Towards

Bengio, Yoshua and LeCun, Yann , booktitle =. Scaling Learning Algorithms Towards
[24]

and Osindero, Simon and Teh, Yee Whye , journal =

Hinton, Geoffrey E. and Osindero, Simon and Teh, Yee Whye , journal =. A Fast Learning Algorithm for Deep Belief Nets , volume =
[25]

2016 , publisher=

Deep learning , author=. 2016 , publisher=

2016
[26]

2025 , eprint=

Grounded Chain-of-Thought for Multimodal Large Language Models , author=. 2025 , eprint=

2025
[27]

2025 , eprint=

MedGround-R1: Advancing Medical Image Grounding via Spatial-Semantic Rewarded Group Relative Policy Optimization , author=. 2025 , eprint=

2025
[28]

2024 , eprint=

MAIRA-2: Grounded Radiology Report Generation , author=. 2024 , eprint=

2024
[29]

2025 , eprint=

S-Chain: Structured Visual Chain-of-Thought For Medicine , author=. 2025 , eprint=

2025
[30]

2025 , eprint=

Argus: Vision-Centric Reasoning with Grounded Chain-of-Thought , author=. 2025 , eprint=

2025
[31]

2025 , eprint=

VILA-M3: Enhancing Vision-Language Models with Medical Expert Knowledge , author=. 2025 , eprint=

2025
[32]

2025 , eprint=

OmniV-Med: Scaling Medical Vision-Language Model for Universal Visual Understanding , author=. 2025 , eprint=

2025
[33]

Nature Communications , volume=

Towards generalist foundation model for radiology by leveraging web-scale 2d&3d medical data , author=. Nature Communications , volume=. 2025 , publisher=

2025
[34]

ISPRS Journal of Photogrammetry and Remote Sensing, , volume=

Rsgpt: A remote sensing vision language model and benchmark , author=. ISPRS Journal of Photogrammetry and Remote Sensing, , volume=. 2025 , publisher=

2025
[35]

2025 , howpublished =

Standards for interpretation and reporting of imaging investigations , author =. 2025 , howpublished =

2025
[36]

2023 , eprint=

GPT-4 Technical Report , author=. 2023 , eprint=

2023
[37]

arXiv preprint arXiv:2312.11805 , year=

Gemini: a family of highly capable multimodal models , author=. arXiv preprint arXiv:2312.11805 , year=

Pith/arXiv arXiv
[38]

IEEE Geoscience and Remote Sensing Magazine , volume=

BigEarthNet-MM: A large-scale, multimodal, multilabel benchmark archive for remote sensing image classification and retrieval [software and data sets] , author=. IEEE Geoscience and Remote Sensing Magazine , volume=. 2021 , publisher=

2021
[39]

ICML , pages=

Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models , author=. ICML , pages=. 2023 , organization=

2023
[40]

NeurIPS , volume=

Visual instruction tuning , author=. NeurIPS , volume=
[41]

TGRS , volume=

RSVQA: Visual question answering for remote sensing data , author=. TGRS , volume=. 2020 , publisher=

2020
[42]

TGRS , volume=

Rsvg: Exploring data and models for visual grounding on remote sensing data , author=. TGRS , volume=. 2023 , publisher=

2023
[43]

arXiv preprint arXiv:2410.18792 , year=

An llm agent for automatic geospatial data analysis , author=. arXiv preprint arXiv:2410.18792 , year=

arXiv
[44]

TGRS , volume=

Exploring models and data for remote sensing image caption generation , author=. TGRS , volume=. 2017 , publisher=

2017
[45]

CVPR , pages=

Glamm: Pixel grounding large multimodal model , author=. CVPR , pages=
[46]

arXiv preprint arXiv:2501.04001 , year=

Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos , author=. arXiv preprint arXiv:2501.04001 , year=

Pith/arXiv arXiv
[47]

Remote Sensing , volume=

Rs-llava: A large vision-language model for joint captioning and question answering in remote sensing imagery , author=. Remote Sensing , volume=. 2024 , publisher=

2024
[48]

IEEE Geoscience and Remote Sensing Magazine , volume=

2023 ieee grss data fusion contest: Large-scale fine-grained building classification for semantic urban reconstruction [technical committees] , author=. IEEE Geoscience and Remote Sensing Magazine , volume=. 2023 , publisher=

2023
[49]

arXiv preprint arXiv:2501.10891 , year=

OpenEarthMap-SAR: A Benchmark Synthetic Aperture Radar Dataset for Global High-Resolution Land Cover Mapping , author=. arXiv preprint arXiv:2501.10891 , year=

arXiv
[50]

arXiv preprint arXiv:2502.01002 , year=

Multi-Resolution SAR and Optical Remote Sensing Image Registration Methods: A Review, Datasets, and Future Perspectives , author=. arXiv preprint arXiv:2502.01002 , year=

arXiv
[51]

CVPR Workshops , pages=

SpaceNet 6: Multi-sensor all weather mapping dataset , author=. CVPR Workshops , pages=
[52]

International Journal of Applied Earth Observation and Geoinformation , volume=

MCANet: A joint semantic segmentation framework of optical and SAR images for land use classification , author=. International Journal of Applied Earth Observation and Geoinformation , volume=. 2022 , publisher=

2022
[53]

5-vl technical report , author=

Qwen2. 5-vl technical report , author=. arXiv preprint arXiv:2502.13923 , year=

Pith/arXiv arXiv
[54]

ECCV , pages=

Sharegpt4v: Improving large multi-modal models with better captions , author=. ECCV , pages=. 2024 , organization=

2024
[55]

IEEE Geoscience and Remote Sensing Magazine , volume=

Language Integration in Remote Sensing: Tasks, datasets, and future directions , author=. IEEE Geoscience and Remote Sensing Magazine , volume=. 2023 , publisher=

2023
[56]

Nature , volume=

Using machine learning to assess the livelihood impact of electricity access , author=. Nature , volume=. 2022 , publisher=

2022
[57]

Nature climate change , volume=

The role of satellite remote sensing in climate change studies , author=. Nature climate change , volume=. 2013 , publisher=

2013
[58]

Proceedings of the IEEE , volume=

Remote sensing image scene classification: Benchmark and state of the art , author=. Proceedings of the IEEE , volume=. 2017 , publisher=

2017
[59]

IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing , volume=

Remote sensing image scene classification meets deep learning: Challenges, methods, benchmarks, and opportunities , author=. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing , volume=. 2020 , publisher=

2020
[60]

IEEE Geoscience and Remote Sensing Letters , volume=

Region-of-interest extraction based on frequency domain analysis and salient region detection for remote sensing image , author=. IEEE Geoscience and Remote Sensing Letters , volume=. 2013 , publisher=

2013
[61]

Computers, Environment and Urban Systems , volume=

Knowledge-based region labeling for remote sensing image interpretation , author=. Computers, Environment and Urban Systems , volume=. 2012 , publisher=

2012
[62]

ISPRS Journal of Photogrammetry and Remote Sensing, , volume=

Remote sensing image segmentation advances: A meta-analysis , author=. ISPRS Journal of Photogrammetry and Remote Sensing, , volume=. 2021 , publisher=

2021
[63]

ISPRS Journal of Photogrammetry and Remote Sensing, , volume=

Segmentation for Object-Based Image Analysis (OBIA): A review of algorithms and challenges from remote sensing perspective , author=. ISPRS Journal of Photogrammetry and Remote Sensing, , volume=. 2019 , publisher=

2019
[64]

Remote Sensing of Environment , volume=

Improving the impervious surface estimation with combined use of optical and SAR remote sensing images , author=. Remote Sensing of Environment , volume=. 2014 , publisher=

2014
[65]

5 technical report , author=

Qwen2. 5 technical report , author=. arXiv preprint arXiv:2412.15115 , year=

Pith/arXiv arXiv
[66]

arXiv preprint arXiv:2503.18478 , year=

Video-xl-pro: Reconstructive token compression for extremely long video understanding , author=. arXiv preprint arXiv:2503.18478 , year=

arXiv
[67]

arXiv preprint arXiv:2406.04264 , year=

Mlvu: A comprehensive benchmark for multi-task long video understanding , author=. arXiv preprint arXiv:2406.04264 , year=

Pith/arXiv arXiv
[68]

arXiv preprint arXiv:2406.12384 , year=

Vrsbench: A versatile vision-language benchmark dataset for remote sensing image understanding , author=. arXiv preprint arXiv:2406.12384 , year=

arXiv
[69]

arXiv preprint arXiv:2406.10100 , year=

Skysensegpt: A fine-grained instruction tuning dataset and model for remote sensing vision-language understanding , author=. arXiv preprint arXiv:2406.10100 , year=

arXiv
[70]

AAAI , volume=

Urbench: A comprehensive benchmark for evaluating large multimodal models in multi-view urban scenarios , author=. AAAI , volume=
[71]

ICML , year=

GeoPixel: Pixel Grounding Large Multimodal Model in Remote Sensing , author=. ICML , year=
[72]

arXiv preprint arXiv:2411.19325 , year=

GEOBench-VLM: Benchmarking Vision-Language Models for Geospatial Tasks , author=. arXiv preprint arXiv:2411.19325 , year=

arXiv
[73]

arXiv preprint arXiv:2503.23771 , year=

XLRS-Bench: Could Your Multimodal LLMs Understand Extremely Large Ultra-High-Resolution Remote Sensing Imagery? , author=. arXiv preprint arXiv:2503.23771 , year=

arXiv
[74]

CVPR , pages=

Good at captioning bad at counting: Benchmarking gpt-4v on earth observation data , author=. CVPR , pages=
[75]

AAAI , volume=

Vhm: Versatile and honest vision language model for remote sensing image analysis , author=. AAAI , volume=
[76]

IGARSS , pages=

Fusion of SAR and optical remote sensing data—Challenges and recent trends , author=. IGARSS , pages=. 2017 , organization=

2017
[77]

arXiv preprint arXiv:2406.19280 , year=

Huatuogpt-vision, towards injecting medical visual knowledge into multimodal llms at scale , author=. arXiv preprint arXiv:2406.19280 , year=

arXiv
[78]

arXiv preprint arXiv:2502.09838 , year=

Healthgpt: A medical large vision-language model for unifying comprehension and generation via heterogeneous knowledge adaptation , author=. arXiv preprint arXiv:2502.09838 , year=

arXiv
[79]

arXiv preprint arXiv:2412.07769 , year=

Bimedix2: Bio-medical expert lmm for diverse medical modalities , author=. arXiv preprint arXiv:2412.07769 , year=

arXiv
[80]

arXiv preprint arXiv:2404.16994 , year=

Pllava: Parameter-free llava extension from images to videos for video dense captioning , author=. arXiv preprint arXiv:2404.16994 , year=

Pith/arXiv arXiv

Showing first 80 references.

[1] [1]

Aho and Jeffrey D

Alfred V. Aho and Jeffrey D. Ullman , title =. 1972

1972

[2] [2]

Publications Manual , year = "1983", publisher =

1983

[3] [3]

and Kozen, Dexter C

Ashok K. Chandra and Dexter C. Kozen and Larry J. Stockmeyer , year = "1981", title =. doi:10.1145/322234.322243

work page doi:10.1145/322234.322243 1981

[4] [4]

Scalable training of

Andrew, Galen and Gao, Jianfeng , booktitle=. Scalable training of

[5] [5]

Dan Gusfield , title =. 1997

1997

[6] [6]

Tetreault , title =

Mohammad Sadegh Rasooli and Joel R. Tetreault , title =. Computing Research Repository , volume =. 2015 , url =

2015

[7] [7]

A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =

Ando, Rie Kubota and Zhang, Tong , Issn =. A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =. Journal of Machine Learning Research , Month = dec, Numpages =

[8] [8]

2023 , eprint=

SA-Med2D-20M Dataset: Segment Anything in 2D Medical Imaging with 20 Million masks , author=. 2023 , eprint=

2023

[9] [9]

Computer Vision -- ECCV 2022 , year =

2022

[10] [10]

Langley , title =

P. Langley , title =. Proceedings of the 17th International Conference on Machine Learning (ICML 2000) , address =. 2000 , pages =

2000

[11] [11]

T. M. Mitchell. The Need for Biases in Learning Generalizations. 1980

1980

[12] [12]

M. J. Kearns , title =

[13] [13]

Machine Learning: An Artificial Intelligence Approach, Vol. I. 1983

1983

[14] [14]

R. O. Duda and P. E. Hart and D. G. Stork. Pattern Classification. 2000

2000

[15] [15]

Suppressed for Anonymity , author=

[16] [16]

Newell and P

A. Newell and P. S. Rosenbloom. Mechanisms of Skill Acquisition and the Law of Practice. Cognitive Skills and Their Acquisition. 1981

1981

[17] [17]

A. L. Samuel. Some Studies in Machine Learning Using the Game of Checkers. IBM Journal of Research and Development. 1959

1959

[18] [18]

FirstName LastName , title =

[19] [19]

FirstName Alpher , title =

[20] [20]

Journal of Foo , volume = 13, number = 1, pages =

FirstName Alpher and FirstName Fotheringham-Smythe , title =. Journal of Foo , volume = 13, number = 1, pages =

[21] [21]

Journal of Foo , volume = 14, number = 1, pages =

FirstName Alpher and FirstName Fotheringham-Smythe and FirstName Gamow , title =. Journal of Foo , volume = 14, number = 1, pages =

[22] [22]

FirstName Alpher and FirstName Gamow , title =

[23] [23]

Scaling Learning Algorithms Towards

Bengio, Yoshua and LeCun, Yann , booktitle =. Scaling Learning Algorithms Towards

[24] [24]

and Osindero, Simon and Teh, Yee Whye , journal =

Hinton, Geoffrey E. and Osindero, Simon and Teh, Yee Whye , journal =. A Fast Learning Algorithm for Deep Belief Nets , volume =

[25] [25]

2016 , publisher=

Deep learning , author=. 2016 , publisher=

2016

[26] [26]

2025 , eprint=

Grounded Chain-of-Thought for Multimodal Large Language Models , author=. 2025 , eprint=

2025

[27] [27]

2025 , eprint=

MedGround-R1: Advancing Medical Image Grounding via Spatial-Semantic Rewarded Group Relative Policy Optimization , author=. 2025 , eprint=

2025

[28] [28]

2024 , eprint=

MAIRA-2: Grounded Radiology Report Generation , author=. 2024 , eprint=

2024

[29] [29]

2025 , eprint=

S-Chain: Structured Visual Chain-of-Thought For Medicine , author=. 2025 , eprint=

2025

[30] [30]

2025 , eprint=

Argus: Vision-Centric Reasoning with Grounded Chain-of-Thought , author=. 2025 , eprint=

2025

[31] [31]

2025 , eprint=

VILA-M3: Enhancing Vision-Language Models with Medical Expert Knowledge , author=. 2025 , eprint=

2025

[32] [32]

2025 , eprint=

OmniV-Med: Scaling Medical Vision-Language Model for Universal Visual Understanding , author=. 2025 , eprint=

2025

[33] [33]

Nature Communications , volume=

Towards generalist foundation model for radiology by leveraging web-scale 2d&3d medical data , author=. Nature Communications , volume=. 2025 , publisher=

2025

[34] [34]

ISPRS Journal of Photogrammetry and Remote Sensing, , volume=

Rsgpt: A remote sensing vision language model and benchmark , author=. ISPRS Journal of Photogrammetry and Remote Sensing, , volume=. 2025 , publisher=

2025

[35] [35]

2025 , howpublished =

Standards for interpretation and reporting of imaging investigations , author =. 2025 , howpublished =

2025

[36] [36]

2023 , eprint=

GPT-4 Technical Report , author=. 2023 , eprint=

2023

[37] [37]

arXiv preprint arXiv:2312.11805 , year=

Gemini: a family of highly capable multimodal models , author=. arXiv preprint arXiv:2312.11805 , year=

Pith/arXiv arXiv

[38] [38]

IEEE Geoscience and Remote Sensing Magazine , volume=

BigEarthNet-MM: A large-scale, multimodal, multilabel benchmark archive for remote sensing image classification and retrieval [software and data sets] , author=. IEEE Geoscience and Remote Sensing Magazine , volume=. 2021 , publisher=

2021

[39] [39]

ICML , pages=

Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models , author=. ICML , pages=. 2023 , organization=

2023

[40] [40]

NeurIPS , volume=

Visual instruction tuning , author=. NeurIPS , volume=

[41] [41]

TGRS , volume=

RSVQA: Visual question answering for remote sensing data , author=. TGRS , volume=. 2020 , publisher=

2020

[42] [42]

TGRS , volume=

Rsvg: Exploring data and models for visual grounding on remote sensing data , author=. TGRS , volume=. 2023 , publisher=

2023

[43] [43]

arXiv preprint arXiv:2410.18792 , year=

An llm agent for automatic geospatial data analysis , author=. arXiv preprint arXiv:2410.18792 , year=

arXiv

[44] [44]

TGRS , volume=

Exploring models and data for remote sensing image caption generation , author=. TGRS , volume=. 2017 , publisher=

2017

[45] [45]

CVPR , pages=

Glamm: Pixel grounding large multimodal model , author=. CVPR , pages=

[46] [46]

arXiv preprint arXiv:2501.04001 , year=

Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos , author=. arXiv preprint arXiv:2501.04001 , year=

Pith/arXiv arXiv

[47] [47]

Remote Sensing , volume=

Rs-llava: A large vision-language model for joint captioning and question answering in remote sensing imagery , author=. Remote Sensing , volume=. 2024 , publisher=

2024

[48] [48]

IEEE Geoscience and Remote Sensing Magazine , volume=

2023 ieee grss data fusion contest: Large-scale fine-grained building classification for semantic urban reconstruction [technical committees] , author=. IEEE Geoscience and Remote Sensing Magazine , volume=. 2023 , publisher=

2023

[49] [49]

arXiv preprint arXiv:2501.10891 , year=

OpenEarthMap-SAR: A Benchmark Synthetic Aperture Radar Dataset for Global High-Resolution Land Cover Mapping , author=. arXiv preprint arXiv:2501.10891 , year=

arXiv

[50] [50]

arXiv preprint arXiv:2502.01002 , year=

Multi-Resolution SAR and Optical Remote Sensing Image Registration Methods: A Review, Datasets, and Future Perspectives , author=. arXiv preprint arXiv:2502.01002 , year=

arXiv

[51] [51]

CVPR Workshops , pages=

SpaceNet 6: Multi-sensor all weather mapping dataset , author=. CVPR Workshops , pages=

[52] [52]

International Journal of Applied Earth Observation and Geoinformation , volume=

MCANet: A joint semantic segmentation framework of optical and SAR images for land use classification , author=. International Journal of Applied Earth Observation and Geoinformation , volume=. 2022 , publisher=

2022

[53] [53]

5-vl technical report , author=

Qwen2. 5-vl technical report , author=. arXiv preprint arXiv:2502.13923 , year=

Pith/arXiv arXiv

[54] [54]

ECCV , pages=

Sharegpt4v: Improving large multi-modal models with better captions , author=. ECCV , pages=. 2024 , organization=

2024

[55] [55]

IEEE Geoscience and Remote Sensing Magazine , volume=

Language Integration in Remote Sensing: Tasks, datasets, and future directions , author=. IEEE Geoscience and Remote Sensing Magazine , volume=. 2023 , publisher=

2023

[56] [56]

Nature , volume=

Using machine learning to assess the livelihood impact of electricity access , author=. Nature , volume=. 2022 , publisher=

2022

[57] [57]

Nature climate change , volume=

The role of satellite remote sensing in climate change studies , author=. Nature climate change , volume=. 2013 , publisher=

2013

[58] [58]

Proceedings of the IEEE , volume=

Remote sensing image scene classification: Benchmark and state of the art , author=. Proceedings of the IEEE , volume=. 2017 , publisher=

2017

[59] [59]

IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing , volume=

Remote sensing image scene classification meets deep learning: Challenges, methods, benchmarks, and opportunities , author=. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing , volume=. 2020 , publisher=

2020

[60] [60]

IEEE Geoscience and Remote Sensing Letters , volume=

Region-of-interest extraction based on frequency domain analysis and salient region detection for remote sensing image , author=. IEEE Geoscience and Remote Sensing Letters , volume=. 2013 , publisher=

2013

[61] [61]

Computers, Environment and Urban Systems , volume=

Knowledge-based region labeling for remote sensing image interpretation , author=. Computers, Environment and Urban Systems , volume=. 2012 , publisher=

2012

[62] [62]

ISPRS Journal of Photogrammetry and Remote Sensing, , volume=

Remote sensing image segmentation advances: A meta-analysis , author=. ISPRS Journal of Photogrammetry and Remote Sensing, , volume=. 2021 , publisher=

2021

[63] [63]

ISPRS Journal of Photogrammetry and Remote Sensing, , volume=

Segmentation for Object-Based Image Analysis (OBIA): A review of algorithms and challenges from remote sensing perspective , author=. ISPRS Journal of Photogrammetry and Remote Sensing, , volume=. 2019 , publisher=

2019

[64] [64]

Remote Sensing of Environment , volume=

Improving the impervious surface estimation with combined use of optical and SAR remote sensing images , author=. Remote Sensing of Environment , volume=. 2014 , publisher=

2014

[65] [65]

5 technical report , author=

Qwen2. 5 technical report , author=. arXiv preprint arXiv:2412.15115 , year=

Pith/arXiv arXiv

[66] [66]

arXiv preprint arXiv:2503.18478 , year=

Video-xl-pro: Reconstructive token compression for extremely long video understanding , author=. arXiv preprint arXiv:2503.18478 , year=

arXiv

[67] [67]

arXiv preprint arXiv:2406.04264 , year=

Mlvu: A comprehensive benchmark for multi-task long video understanding , author=. arXiv preprint arXiv:2406.04264 , year=

Pith/arXiv arXiv

[68] [68]

arXiv preprint arXiv:2406.12384 , year=

Vrsbench: A versatile vision-language benchmark dataset for remote sensing image understanding , author=. arXiv preprint arXiv:2406.12384 , year=

arXiv

[69] [69]

arXiv preprint arXiv:2406.10100 , year=

Skysensegpt: A fine-grained instruction tuning dataset and model for remote sensing vision-language understanding , author=. arXiv preprint arXiv:2406.10100 , year=

arXiv

[70] [70]

AAAI , volume=

Urbench: A comprehensive benchmark for evaluating large multimodal models in multi-view urban scenarios , author=. AAAI , volume=

[71] [71]

ICML , year=

GeoPixel: Pixel Grounding Large Multimodal Model in Remote Sensing , author=. ICML , year=

[72] [72]

arXiv preprint arXiv:2411.19325 , year=

GEOBench-VLM: Benchmarking Vision-Language Models for Geospatial Tasks , author=. arXiv preprint arXiv:2411.19325 , year=

arXiv

[73] [73]

arXiv preprint arXiv:2503.23771 , year=

XLRS-Bench: Could Your Multimodal LLMs Understand Extremely Large Ultra-High-Resolution Remote Sensing Imagery? , author=. arXiv preprint arXiv:2503.23771 , year=

arXiv

[74] [74]

CVPR , pages=

Good at captioning bad at counting: Benchmarking gpt-4v on earth observation data , author=. CVPR , pages=

[75] [75]

AAAI , volume=

Vhm: Versatile and honest vision language model for remote sensing image analysis , author=. AAAI , volume=

[76] [76]

IGARSS , pages=

Fusion of SAR and optical remote sensing data—Challenges and recent trends , author=. IGARSS , pages=. 2017 , organization=

2017

[77] [77]

arXiv preprint arXiv:2406.19280 , year=

Huatuogpt-vision, towards injecting medical visual knowledge into multimodal llms at scale , author=. arXiv preprint arXiv:2406.19280 , year=

arXiv

[78] [78]

arXiv preprint arXiv:2502.09838 , year=

Healthgpt: A medical large vision-language model for unifying comprehension and generation via heterogeneous knowledge adaptation , author=. arXiv preprint arXiv:2502.09838 , year=

arXiv

[79] [79]

arXiv preprint arXiv:2412.07769 , year=

Bimedix2: Bio-medical expert lmm for diverse medical modalities , author=. arXiv preprint arXiv:2412.07769 , year=

arXiv

[80] [80]

arXiv preprint arXiv:2404.16994 , year=

Pllava: Parameter-free llava extension from images to videos for video dense captioning , author=. arXiv preprint arXiv:2404.16994 , year=

Pith/arXiv arXiv