MSEarth: A Multimodal Benchmark for Earth Science Phenomenon Discovery with MLLMs

Ben Fei; Bo Liu; Fenghua Ling; Lei Bai; Wanghan Xu; Wenlong Zhang; Xiangyu Zhao; Xiao-Ming Wu; Xiaoyu Yue; Yuhao Zhou

arxiv: 2505.20740 · v3 · submitted 2025-05-27 · 💻 cs.AI

MSEarth: A Multimodal Benchmark for Earth Science Phenomenon Discovery with MLLMs

Xiangyu Zhao , Wanghan Xu , Bo Liu , Yuhao Zhou , Fenghua Ling , Ben Fei , Xiaoyu Yue , Lei Bai

show 2 more authors

Wenlong Zhang Xiao-Ming Wu

This is my paper

Pith reviewed 2026-05-19 14:02 UTC · model grok-4.3

classification 💻 cs.AI

keywords multimodal benchmarkearth scienceMLLMsscientific reasoningfigure captioninggeoscience datasetmultimodal dataset

0 comments

The pith

MSEarth introduces a benchmark of over 289K real figures from Earth science papers to test multimodal models on graduate-level geoscience reasoning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that current benchmarks for multimodal AI in science fall short because they use artificial or overly simple data that does not reflect the complexity of real research. To fix this, it creates MSEarth by pulling figures and enriched captions directly from thousands of open-access Earth science publications across atmosphere, cryosphere, hydrosphere, lithosphere, and biosphere. The benchmark includes tasks like captioning figures, answering multiple-choice questions, and open-ended reasoning. If successful, this would let researchers train and test AI systems on the kind of nuanced interpretation that working scientists actually do.

Core claim

MSEarth is a multimodal scientific dataset and benchmark curated from high-quality, open-access publications, featuring over 289K figures with refined captions that include contextual discussions and reasoning, covering the five major spheres of Earth science, and supporting tasks such as scientific figure captioning, multiple choice questions, and open-ended reasoning.

What carries the argument

The curation of figures and context-rich captions directly from original open-access papers to create authentic examples of geoscientific reasoning.

If this is right

MLLMs can be evaluated more accurately on their ability to handle complex Earth science phenomena.
The resource supports development of models for multimodal scientific reasoning across multiple Earth spheres.
It enables tasks that require integrating visual data with textual context from actual research publications.
The benchmark scales to allow ongoing testing as models improve.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar curation methods could be applied to create benchmarks in other scientific fields that rely on figures and detailed reasoning.
Improved models from this data might help analyze large volumes of Earth observation imagery for pattern discovery.
The emphasis on real papers points toward future use of AI to assist in interpreting unpublished or interdisciplinary geoscience data.

Load-bearing premise

That figures and captions taken from published papers accurately capture the nuanced graduate-level geoscientific reasoning needed for real applications.

What would settle it

A comparison showing that models trained or tested on MSEarth perform no better than those using simpler datasets when applied to new, real-world Earth science problems.

Figures

Figures reproduced from arXiv: 2505.20740 by Ben Fei, Bo Liu, Fenghua Ling, Lei Bai, Wanghan Xu, Wenlong Zhang, Xiangyu Zhao, Xiao-Ming Wu, Xiaoyu Yue, Yuhao Zhou.

**Figure 2.** Figure 2: Data curation process for MSEarth. The two parts on the left represent data preprocessing, while the two [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Overall approach of our multi-agent, voting-based approach to automate the validation of generated [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Subjects distribution in MSEarth. well on scientific question-answering tasks, with proprietary models generally achieving better results. Further analysis of the models’ failure rates on reasoning and perception-based questions is provided in the Appendix ( [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Comparison of Spearman correlations across [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Illustrative examples of the diverse types of scientific figures in MSEarth, sourced from open-access [PITH_FULL_IMAGE:figures/full_fig_p013_6.png] view at source ↗

**Figure 7.** Figure 7: Examples of the three types of scientific question-answering tasks presented in our benchmark. [PITH_FULL_IMAGE:figures/full_fig_p013_7.png] view at source ↗

**Figure 1.** Figure 1: FIG. 1. Geological map indicating fault zones and locked segments in [PITH_FULL_IMAGE:figures/full_fig_p015_1.png] view at source ↗

**Figure 9.** Figure 9: Prompt for retaining Earth observation images. [PITH_FULL_IMAGE:figures/full_fig_p015_9.png] view at source ↗

**Figure 10.** Figure 10: Proportion of valid and invalid data after [PITH_FULL_IMAGE:figures/full_fig_p017_10.png] view at source ↗

**Figure 11.** Figure 11: Prompt for generating refined captions. whether the provided answer is accurate, categorizing them as Correct or Incorrect based on the accuracy of the answer. After manual screening, 216 invalid entries were identified in the MCQ task, and 89 invalid entries were found in the openended task. To evaluate the effectiveness of our multi-agent filtering process, we conducted a statistical analysis of the … view at source ↗

**Figure 12.** Figure 12: Prompt for generating VQAs. Aquatic Ecology and Limnological Ecology, Biogeochemistry, Biogeography. Hydrology: Hydrology, Hydrogeology, Limnology, River Hydrology and Estuarine Hydrology, Groundwater Hydrology, Regional Hydrology, Ecohydrology, Hydrological Physics, Hydrological Geography, Hydrological Mete19 [PITH_FULL_IMAGE:figures/full_fig_p019_12.png] view at source ↗

**Figure 13.** Figure 13: Prompt for generating normal answers for MCQs. [PITH_FULL_IMAGE:figures/full_fig_p020_13.png] view at source ↗

**Figure 14.** Figure 14: Prompt for generating enhanced answers for MCQs. [PITH_FULL_IMAGE:figures/full_fig_p021_14.png] view at source ↗

**Figure 15.** Figure 15: Models’ accuracy on reasoning and perception problems. [PITH_FULL_IMAGE:figures/full_fig_p022_15.png] view at source ↗

**Figure 16.** Figure 16: An example of easy multiple-choice VQA. QwenVL2.5; for proprietary models, we examine the Gemini-2.5-Flash series. Within the proprietary family, variants with dedicated “thinking” capabilities (e.g., Gemini-2.5-Flash-Thinking) generally outperform counterparts without such capabilities (e.g., Gemini-2.5-Flash). In contrast, for open-source models, adding explicit CoT sometimes leads to performance de… view at source ↗

**Figure 17.** Figure 17: An example of specialized multiple-choice VQA. [PITH_FULL_IMAGE:figures/full_fig_p024_17.png] view at source ↗

**Figure 18.** Figure 18: An example of hard multiple-choice VQA. 24 [PITH_FULL_IMAGE:figures/full_fig_p024_18.png] view at source ↗

**Figure 19.** Figure 19: Prompt for evaluating the quality of generated captions. [PITH_FULL_IMAGE:figures/full_fig_p026_19.png] view at source ↗

**Figure 20.** Figure 20: Prompt for evaluating the quality of generated answers to open-ended questions. [PITH_FULL_IMAGE:figures/full_fig_p027_20.png] view at source ↗

**Figure 21.** Figure 21: Performance of different strategies on MSEarth-mini. [PITH_FULL_IMAGE:figures/full_fig_p027_21.png] view at source ↗

**Figure 22.** Figure 22: Performance comparison of different models under two settings: with and without the original caption. [PITH_FULL_IMAGE:figures/full_fig_p028_22.png] view at source ↗

**Figure 23.** Figure 23: Performance comparison of different models under two settings: with and without the original caption. [PITH_FULL_IMAGE:figures/full_fig_p028_23.png] view at source ↗

**Figure 24.** Figure 24: Performance comparison of different models across various subjects. [PITH_FULL_IMAGE:figures/full_fig_p029_24.png] view at source ↗

**Figure 25.** Figure 25: Case Study of Multiple Choice VQA. 31 [PITH_FULL_IMAGE:figures/full_fig_p031_25.png] view at source ↗

**Figure 26.** Figure 26: Case Study of Open-Ended VQA. 32 [PITH_FULL_IMAGE:figures/full_fig_p032_26.png] view at source ↗

read the original abstract

The rapid advancement of multimodal large language models (MLLMs) offers new opportunities for complex scientific challenges, yet their application in earth science-especially at the graduate level-remains underexplored due to a lack of benchmarks reflecting the depth and complexity of geoscientific reasoning. Existing datasets often rely on synthetic data or simple figure-caption pairs, failing to capture the nuanced reasoning required for real-world applications. To address this, we introduce MSEarth, a multimodal scientific dataset and benchmark curated from high-quality, open-access publications. Covering the five major spheres of Earth science-atmosphere, cryosphere, hydrosphere, lithosphere, and biosphere-MSEarth features over 289K figures with refined captions enriched by contextual discussions and reasoning from the original papers. The benchmark supports tasks such as scientific figure captioning, multiple choice questions, and open-ended reasoning, providing a scalable, high-fidelity resource for developing and evaluating MLLMs in scientific reasoning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MSEarth offers a substantial new benchmark dataset from real Earth science papers for MLLM evaluation, though the caption enrichment process lacks sufficient validation details.

read the letter

Hi, The punchline on MSEarth is a new large-scale benchmark for multimodal models in Earth science, built from real open-access publications with enriched figure captions, but the details on how that enrichment was validated are missing, which undercuts the high-fidelity claims. What is new here is the domain focus and scale. The dataset covers the five major spheres—atmosphere, cryosphere, hydrosphere, lithosphere, and biosphere—with over 289K figures. By using actual papers instead of synthetic data or plain caption pairs, it aims to better reflect the complexity of geoscientific reasoning. The benchmark includes tasks for scientific figure captioning, multiple choice questions, and open-ended reasoning, which provides concrete ways to assess MLLMs on these topics. The paper does well in identifying the gap in existing resources for graduate-level scientific applications and in proposing a scalable resource from high-quality sources. This could support faster progress in AI tools for climate analysis and environmental monitoring. The main soft spot is the curation and refinement step. The description says captions are refined by adding contextual discussions and reasoning from the original papers, yet there are no specifics on the process—whether it's manual, automated, or a mix—and no mention of quality controls or inter-annotator agreement. Without that evidence, it's hard to confirm that the dataset truly captures nuanced reasoning rather than adding surface-level context. This is a central assumption, so clarifying it would make the contribution stronger. For readers, this paper is useful for those developing or benchmarking MLLMs in scientific fields, particularly Earth sciences. Anyone working on multimodal reasoning for real-world data would find the task setups and scale relevant. Overall, I recommend sending it for peer review. The dataset is a solid new resource that merits referee input on the methods and potential use cases, even if revisions are needed for the validation aspects. Best, Your colleague

Referee Report

1 major / 1 minor

Summary. The manuscript introduces MSEarth, a multimodal dataset and benchmark for Earth science phenomenon discovery with MLLMs. Curated from high-quality open-access publications, it covers the five major spheres (atmosphere, cryosphere, hydrosphere, lithosphere, biosphere) and contains over 289K figures accompanied by refined captions that incorporate contextual discussions and reasoning drawn from the source papers. The benchmark is positioned to support tasks including scientific figure captioning, multiple-choice questions, and open-ended reasoning, addressing gaps in existing resources that rely on synthetic data or basic figure-caption pairs.

Significance. If the curation and enrichment process can be shown to reliably capture nuanced, graduate-level geoscientific reasoning, MSEarth would constitute a useful large-scale resource for training and evaluating MLLMs on authentic scientific material. The scale and domain coverage could help move the field beyond simplistic or synthetic benchmarks toward more realistic evaluation of scientific reasoning capabilities.

major comments (1)

[Abstract / Curation description] The central claim that the refined captions deliver 'contextual discussions and reasoning from the original papers' and constitute a 'high-fidelity' resource is load-bearing, yet the manuscript supplies no description of the refinement protocol (manual vs. automated, quality controls, inter-annotator agreement, or validation against original paper context). Without this information, it is impossible to evaluate whether the output genuinely exceeds simple figure-caption pairs in depth or introduces bias.

minor comments (1)

[Abstract] The abstract contains a minor typographical issue ('earth science-especially' lacks a space after the hyphen).

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their detailed and constructive review. The feedback highlights an important gap in the current manuscript regarding the description of our caption refinement process. We address this point below and will incorporate the necessary revisions to improve clarity and transparency.

read point-by-point responses

Referee: [Abstract / Curation description] The central claim that the refined captions deliver 'contextual discussions and reasoning from the original papers' and constitute a 'high-fidelity' resource is load-bearing, yet the manuscript supplies no description of the refinement protocol (manual vs. automated, quality controls, inter-annotator agreement, or validation against original paper context). Without this information, it is impossible to evaluate whether the output genuinely exceeds simple figure-caption pairs in depth or introduces bias.

Authors: We agree that the manuscript currently lacks a detailed account of the caption refinement protocol, which is necessary to substantiate claims about the enriched captions providing contextual discussions and reasoning. This omission limits the ability to assess fidelity and potential biases. In the revised manuscript, we will add a new subsection in the Methods or Dataset Curation section that explicitly describes the refinement process. This will include: (1) whether enrichment was performed manually by Earth science experts or via automated LLM-assisted methods with human verification; (2) quality control procedures such as spot-checking against source papers; (3) any inter-annotator agreement metrics collected during the process; and (4) validation steps to confirm that added context reflects the original papers' discussions without introducing unsupported inferences or biases. We believe this addition will directly address the concern and strengthen the paper's contribution. revision: yes

Circularity Check

0 steps flagged

No circularity in dataset curation or benchmark claims

full rationale

The paper presents MSEarth as a curated multimodal dataset drawn from open-access publications, with no mathematical derivations, equations, fitted parameters, predictions, or self-referential loops of any kind. Its central claims concern the scale (289K figures), coverage across five Earth science spheres, and enrichment of captions with contextual reasoning; these are descriptive of an external data resource rather than derived quantities that reduce to the paper's own inputs by construction. No self-citation chains, ansatzes, or uniqueness theorems are invoked to support load-bearing steps. The absence of any derivation chain means the contribution stands or falls on the independent quality of the curation process, which is not circular even if its validation details are limited.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that real publication-derived figures and enriched captions provide higher fidelity than synthetic alternatives; no free parameters or invented entities are introduced.

axioms (1)

domain assumption Real-world figures and captions from scientific publications better capture nuanced geoscientific reasoning than synthetic data or simple figure-caption pairs.
This premise is stated directly in the abstract as the motivation for creating MSEarth instead of relying on existing datasets.

pith-pipeline@v0.9.0 · 5723 in / 1252 out tokens · 93509 ms · 2026-05-19T14:02:47.471751+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

MSEarth features over 289K figures with refined captions enriched by contextual discussions and reasoning from the original papers... covering the five major spheres of Earth science
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We employ GPT-4o for refined caption generation... multi-agent voting... expert validation

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Maestro: Reinforcement Learning to Orchestrate Hierarchical Model-Skill Ensembles
cs.LG 2026-05 unverdicted novelty 6.0

Maestro uses outcome-based RL to train a lightweight policy that orchestrates ensembles of frozen expert models and skills, reporting 70.1% average accuracy across ten multimodal benchmarks and outperforming GPT-5 and...
GeoR-Bench: Evaluating Geoscience Visual Reasoning
cs.CV 2026-05 unverdicted novelty 6.0

GeoR-Bench shows top multimodal models reach only 42.7% strict accuracy on geoscience visual reasoning tasks while open-source models reach 10.3%, with outputs often visually plausible yet scientifically inaccurate.

Reference graph

Works this paper leans on

23 extracted references · 23 canonical work pages · cited by 2 Pith papers

[1]

Test-takers will analyze this image to answer the questions

Figure:A scientific or illustrative figure provided as the primary visual input. Test-takers will analyze this image to answer the questions. 2.Caption:A concise summary describing key aspects of the Figure

work page
[2]

However, test-takers cannot access this information

Supplementary:In-depth information (e.g., summarized expert insight, detailed analysis, or background knowledge) that you can use to assist in designing advanced and meaningful questions. However, test-takers cannot access this information. Input Information Provided: •Caption:{raw caption} •Supplementary:{refined caption} Task Instructions:

work page
[3]

•Supplementary Usage:The correct answers are encouraged to be derived from

Use of Input Sources: • Ensure that no question can be answered entirely using Caption without observations. •Supplementary Usage:The correct answers are encouraged to be derived from

work page
[4]

• Open-Ended Questions:At least2questions must be open-ended, requiring concise and precise answers (no more than 4 words)

Question Types: • Multiple Choice Questions (MCQs):At least2questions must be of this type, with 4 distinct options (A-D) and one correct answer. • Open-Ended Questions:At least2questions must be open-ended, requiring concise and precise answers (no more than 4 words)

work page
[5]

The chain explains the logical process by which the correct answer can be determined

Reasoning Chains: • For every question, you must include a reasoning chain. The chain explains the logical process by which the correct answer can be determined. • The reasoning chain must:

work page
[6]

Output Structure: The output must be written inJSON format

work page
[7]

These ques- tions should require the test-taker to refer to in-depth knowledge and insights not immediately visible in the Figure or Caption

Questions that are grounded in the Supplementary context are highly encouraged. These ques- tions should require the test-taker to refer to in-depth knowledge and insights not immediately visible in the Figure or Caption

work page
[8]

According to the Supplementary

Avoid referencing the Supplementary in any question andreasoning_chain (e.g., "According to the Supplementary" or "The Supplementary states"). Provide your response below: Figure 12: Prompt for generating VQAs. Aquatic Ecology and Limnological Ecology, Biogeochemistry, Biogeography. Hydrology: Hydrology, Hydrogeology, Limnol- ogy, River Hydrology and Estu...

work page
[9]

Carefully analyze the input image and the provided query

work page
[10]

Based on the image, select the correct option (e.g., ’A’, ’B’, ’C’) or directly state the correct option content

work page
[11]

answer":

Provide reasoning explaining how to derive the correct answer. Input: •Query:{query} Output Format: The output must be written inJSON formatusing the structure below: { "answer": "Correct option or short answer", "Explanation": "Explaining how to derive correct answer." } Figure 13: Prompt for generating normal answers for MCQs. orology, Hydrological Meas...

work page
[12]

The download paths for specific models and the versions of models ac- cessed via API are provided in Figure 10

for local testing; for proprietary models, we conduct tests via API calls. The download paths for specific models and the versions of models ac- cessed via API are provided in Figure 10. J Evaluation Metrics J.1 MLLM-based Metrics Following G-Eval (Liu et al., 2023b), we utilize MLLM (Qwen2.5-VL-72B) with a specialized prompt to compute a factual scientif...

work page
[13]

Carefully analyze the input image and its caption

work page
[14]

answer":

Based on the image and caption, select the correct option (e.g., ’A’, ’B’, ’C’) or directly state the correct option content. Output Format: The output must be written inJSON formatusing the structure below: { "answer": "Correct option or short answer", "Explanation": "Explaining how to derive correct answer." } Figure 14: Prompt for generating enhanced a...

work page arXiv 2024
[15]

Scientific Accuracy:Does the generated caption accurately describe the scientific content of the figure or image?

work page
[16]

Clarity and Coherence:Is the caption well-structured, logically organized, and easy to understand?

work page
[17]

Relevance and Completeness:Does the caption provide all necessary information to under- stand the figure or image? Evaluation Steps:

work page
[18]

Assess whether the generated caption aligns with the scientific content and intent of the standard caption

Compare theGenerated Captionto theStandard Caption. Assess whether the generated caption aligns with the scientific content and intent of the standard caption

work page
[19]

Input: •Standard Caption:{response} •Generated Caption:{generated_caption} Important Instructions: • Only output the score in the specified JSON format

Assign a score for coherence on a scale of 1 to 5, where 1 is the lowest and 5 is the highest, based on the Evaluation Criteria. Input: •Standard Caption:{response} •Generated Caption:{generated_caption} Important Instructions: • Only output the score in the specified JSON format. • Do not provide any explanations, comments, or additional text. Output For...

work page
[20]

Based on the refined caption, question, and standard answer, determine if the generated answer is correct

work page
[21]

Only output the determination in the specified JSON format

work page
[22]

is_correct

Do not provide any explanations, comments, or additional text. Output Format: The output must be written inJSON formatusing the structure below: { "is_correct": true or false } Figure 20: Prompt for evaluating the quality of generated answers to open-ended questions. Gemini2.5-Pro-thinking Gemini2.5-Flash InternVL3-78B Qwen2.5-VL-72B 25 30 35 40 45 50 55P...

work page 1992
[23]

RS=Atmospheric Remote Sensing, Ecosys

Abbreviations: Meteor.=Meteorology, Climat.=Climatology, Atmos. RS=Atmospheric Remote Sensing, Ecosys. Ecol.=Ecosystem Ecology, Landsc. Ecol.=Landscape Ecology, Aquat. Ecol.=Aquatic & Limnological Ecology, Phys. Geog.=Physical Geography, Reg. Geog.=Regional Geography, Sediment.=Sedimentology, Struct. Geol.=Structural Geology, Quat. Geol.=Quaternary Geolog...

work page 2010

[1] [1]

Test-takers will analyze this image to answer the questions

Figure:A scientific or illustrative figure provided as the primary visual input. Test-takers will analyze this image to answer the questions. 2.Caption:A concise summary describing key aspects of the Figure

work page

[2] [2]

However, test-takers cannot access this information

Supplementary:In-depth information (e.g., summarized expert insight, detailed analysis, or background knowledge) that you can use to assist in designing advanced and meaningful questions. However, test-takers cannot access this information. Input Information Provided: •Caption:{raw caption} •Supplementary:{refined caption} Task Instructions:

work page

[3] [3]

•Supplementary Usage:The correct answers are encouraged to be derived from

Use of Input Sources: • Ensure that no question can be answered entirely using Caption without observations. •Supplementary Usage:The correct answers are encouraged to be derived from

work page

[4] [4]

• Open-Ended Questions:At least2questions must be open-ended, requiring concise and precise answers (no more than 4 words)

Question Types: • Multiple Choice Questions (MCQs):At least2questions must be of this type, with 4 distinct options (A-D) and one correct answer. • Open-Ended Questions:At least2questions must be open-ended, requiring concise and precise answers (no more than 4 words)

work page

[5] [5]

The chain explains the logical process by which the correct answer can be determined

Reasoning Chains: • For every question, you must include a reasoning chain. The chain explains the logical process by which the correct answer can be determined. • The reasoning chain must:

work page

[6] [6]

Output Structure: The output must be written inJSON format

work page

[7] [7]

These ques- tions should require the test-taker to refer to in-depth knowledge and insights not immediately visible in the Figure or Caption

Questions that are grounded in the Supplementary context are highly encouraged. These ques- tions should require the test-taker to refer to in-depth knowledge and insights not immediately visible in the Figure or Caption

work page

[8] [8]

According to the Supplementary

Avoid referencing the Supplementary in any question andreasoning_chain (e.g., "According to the Supplementary" or "The Supplementary states"). Provide your response below: Figure 12: Prompt for generating VQAs. Aquatic Ecology and Limnological Ecology, Biogeochemistry, Biogeography. Hydrology: Hydrology, Hydrogeology, Limnol- ogy, River Hydrology and Estu...

work page

[9] [9]

Carefully analyze the input image and the provided query

work page

[10] [10]

Based on the image, select the correct option (e.g., ’A’, ’B’, ’C’) or directly state the correct option content

work page

[11] [11]

answer":

Provide reasoning explaining how to derive the correct answer. Input: •Query:{query} Output Format: The output must be written inJSON formatusing the structure below: { "answer": "Correct option or short answer", "Explanation": "Explaining how to derive correct answer." } Figure 13: Prompt for generating normal answers for MCQs. orology, Hydrological Meas...

work page

[12] [12]

The download paths for specific models and the versions of models ac- cessed via API are provided in Figure 10

for local testing; for proprietary models, we conduct tests via API calls. The download paths for specific models and the versions of models ac- cessed via API are provided in Figure 10. J Evaluation Metrics J.1 MLLM-based Metrics Following G-Eval (Liu et al., 2023b), we utilize MLLM (Qwen2.5-VL-72B) with a specialized prompt to compute a factual scientif...

work page

[13] [13]

Carefully analyze the input image and its caption

work page

[14] [14]

answer":

Based on the image and caption, select the correct option (e.g., ’A’, ’B’, ’C’) or directly state the correct option content. Output Format: The output must be written inJSON formatusing the structure below: { "answer": "Correct option or short answer", "Explanation": "Explaining how to derive correct answer." } Figure 14: Prompt for generating enhanced a...

work page arXiv 2024

[15] [15]

Scientific Accuracy:Does the generated caption accurately describe the scientific content of the figure or image?

work page

[16] [16]

Clarity and Coherence:Is the caption well-structured, logically organized, and easy to understand?

work page

[17] [17]

Relevance and Completeness:Does the caption provide all necessary information to under- stand the figure or image? Evaluation Steps:

work page

[18] [18]

Assess whether the generated caption aligns with the scientific content and intent of the standard caption

Compare theGenerated Captionto theStandard Caption. Assess whether the generated caption aligns with the scientific content and intent of the standard caption

work page

[19] [19]

Input: •Standard Caption:{response} •Generated Caption:{generated_caption} Important Instructions: • Only output the score in the specified JSON format

Assign a score for coherence on a scale of 1 to 5, where 1 is the lowest and 5 is the highest, based on the Evaluation Criteria. Input: •Standard Caption:{response} •Generated Caption:{generated_caption} Important Instructions: • Only output the score in the specified JSON format. • Do not provide any explanations, comments, or additional text. Output For...

work page

[20] [20]

Based on the refined caption, question, and standard answer, determine if the generated answer is correct

work page

[21] [21]

Only output the determination in the specified JSON format

work page

[22] [22]

is_correct

Do not provide any explanations, comments, or additional text. Output Format: The output must be written inJSON formatusing the structure below: { "is_correct": true or false } Figure 20: Prompt for evaluating the quality of generated answers to open-ended questions. Gemini2.5-Pro-thinking Gemini2.5-Flash InternVL3-78B Qwen2.5-VL-72B 25 30 35 40 45 50 55P...

work page 1992

[23] [23]

RS=Atmospheric Remote Sensing, Ecosys

Abbreviations: Meteor.=Meteorology, Climat.=Climatology, Atmos. RS=Atmospheric Remote Sensing, Ecosys. Ecol.=Ecosystem Ecology, Landsc. Ecol.=Landscape Ecology, Aquat. Ecol.=Aquatic & Limnological Ecology, Phys. Geog.=Physical Geography, Reg. Geog.=Regional Geography, Sediment.=Sedimentology, Struct. Geol.=Structural Geology, Quat. Geol.=Quaternary Geolog...

work page 2010