arxiv: 2605.01250 · v1 · submitted 2026-05-02 · 💻 cs.AI

Recognition: unknown

EO-Gym: A Multimodal, Interactive Environment for Earth Observation Agents

Sai Ma , Zhuang Li , Sichao Li , Xinyue Xu , Ruibiao Zhu , Tony Boston , John A. Taylor

Authors on Pith no claims yet

Pith reviewed 2026-05-09 14:59 UTC · model grok-4.3

classification 💻 cs.AI

keywords earth observationmultimodal agentsinteractive environmentsvision-language modelstool usesatellite imagerytemporal reasoningcross-modal analysis

0 comments

The pith

Earth Observation tasks require interactive tool use across time, space, and sensors that current general models handle poorly.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents EO-Gym as an executable Gymnasium-style workspace that lets agents resolve uncertainty in satellite data by calling tools to expand areas of interest, pull historical observations, and switch between optical and radar imagery. It supplies a benchmark of 9,078 trajectories drawn from public datasets and real Landsat/Sentinel-2 imagery, together with 35 specialized tools grouped into six families. Tests on ten vision-language models show persistent weakness on temporal and cross-modal sequences, while fine-tuning one 4B model on the new trajectories raises overall Pass@3 from 0.49 to 0.74. The work matters because real EO work is evidence-gathering that unfolds over multiple steps rather than a single fixed input.

Core claim

EO-Gym formulates EO analysis as a controlled local geospatial workspace backed by more than 660k multimodal files indexed by location, time, and sensor, equipped with 35 EO-specialized tools. On the resulting EO-Gym-Data benchmark of 9,078 trajectories and 34,604 reasoning steps, strong general-purpose models continue to struggle with interactive reasoning that spans temporal and cross-modal workflows; a reference model fine-tuned on the data improves Pass@3 from 0.49 to 0.74 in the main setting.

What carries the argument

A Gymnasium-style executable workspace with 35 EO-specialized tools that lets agents perform multi-step evidence gathering across geospatial, temporal, and sensing-modality dimensions.

If this is right

Training agents on interactive EO trajectories produces measurable gains in multi-step Pass@3 performance.
Benchmarks that collapse EO work into single-turn inputs will systematically underestimate the difficulty of realistic analysis.
Planning across location, time, and modality becomes a learnable skill once an executable tool interface is supplied.
Fine-tuning on grounded trajectories can close part of the gap between general-purpose vision-language models and domain-specific EO needs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Environments of this kind could be adapted to train agents that combine satellite data with ground sensors for applications such as crop monitoring or flood mapping.
The same trajectory-collection approach might reveal similar gaps in other data-rich scientific domains that require repeated evidence gathering, such as materials discovery or climate reanalysis.
Future evaluations could measure whether agents trained here transfer to new satellite instruments or geographic regions not seen during construction of the benchmark.

Load-bearing premise

The 35 tools and 9,078 trajectories built from eight public datasets plus Landsat and Sentinel-2 imagery faithfully represent the essential interactive workflows and uncertainty-resolution steps used in real Earth Observation practice.

What would settle it

A controlled test in which models trained inside EO-Gym show no improvement over baselines when evaluated on a fresh collection of real analyst tasks that demand tool sequences or sensor combinations absent from the training trajectories.

Figures

Figures reproduced from arXiv: 2605.01250 by John A. Taylor, Ruibiao Zhu, Sai Ma, Sichao Li, Tony Boston, Xinyue Xu, Zhuang Li.

**Figure 1.** Figure 1: The EO-Gym framework operationalizes EO as an interactive evidence-acquisition problem. A controlled environment (left) enables dynamic exploration across space, time, and modality. An iterative pipeline (middle) synthesizes a large-scale trajectory dataset. Finally, a unified evaluation protocol (right) assesses fine-tuned and baseline models. OpenEarth-Agent targets open-environment workflow planning and… view at source ↗

**Figure 2.** Figure 2: Overview of EO-GYM-DATA statistics. Panel (a) shows the taxonomy of six EO task categories and 18 question types. Panel (b) shows the geographical distribution of the geolocated trajectory subset only, with red markers for training examples and blue markers for held-out test examples. Panels (c) and (d) summarize the dataset’s temporal spans and trajectory-length distribution, illustrating diversity in obs… view at source ↗

**Figure 3.** Figure 3: Qualitative examples from EO-GYM-DATA across six EO task categories. Each example shows a complete interactive trajectory: an EO question (left), the sequential execution of EO tools to gather spatial, temporal, or cross-modal evidence (center), and the resulting verifiable answer (right). Candidate question-answer construction. We construct candidate items by sampling from the original dataset’s ground tr… view at source ↗

**Figure 4.** Figure 4: An initial cropped observation from the FAIR1M dataset used for spatial navigation generation. The generated question asks the agent to count complete road intersections (the middle bottom has suspicious one), requiring the agent to pan the view to resolve objects partially visible at the image boundaries. enforcing bounding box normalization, dictating exact multispectral discovery sequences, and establis… view at source ↗

read the original abstract

Earth Observation (EO) analysis is inherently interactive: resolving uncertainty often requires expanding the region of interest, retrieving historical observations, and switching across sensors such as optical and Synthetic Aperture Radar. However, most EO benchmarks collapse this process into fixed-input, single-turn tasks. To address this gap, we present EO-Gym, a controlled executable framework for multimodal, tool-using EO agents that formulates EO analysis as a Gymnasium-style local geospatial workspace backed by more than 660k multimodal files indexed by location, time, and sensor type, with 35 EO-specialized tools spanning six task families. Built on this environment, we construct EO-Gym-Data, a benchmark of 9,078 trajectories and 34,604 reasoning steps, and grounded in eight public EO datasets together with Landsat and Sentinel-2 imagery. Evaluating $10$ open and closed VLMs shows that strong general-purpose models still struggle with interactive EO reasoning, especially on temporal and cross-modal workflows. As a reference baseline, EO-Gym-4B, obtained by fine-tuning Qwen3-VL-4B-Instruct on EO-Gym-Data, improves overall Pass@3 from $0.49$ to $0.74$ under the main evaluation setting. O-Gym provides a reproducible environment for interactive EO agents, operationalizing EO as an evidence-gathering problem that requires planning across geospatial, temporal, and sensing modality.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

EO-Gym supplies a usable Gym-style environment and 9k trajectories for multi-turn EO analysis, but the claim that it captures real analyst workflows rests on unverified construction choices.

read the letter

EO-Gym turns Earth Observation into a proper interactive agent task by giving models a local workspace, 35 tools across six families, and 9,078 trajectories built from public datasets plus Landsat and Sentinel-2 imagery. The headline numbers are straightforward: ten VLMs average 0.49 Pass@3, and fine-tuning Qwen3-VL-4B on the data lifts that to 0.74. That difference is the clearest signal the paper offers about where current models fall short on temporal and cross-modal steps.

Referee Report

2 major / 2 minor

Summary. The paper introduces EO-Gym, a Gymnasium-style interactive environment for multimodal Earth Observation agents, featuring a local geospatial workspace backed by over 660k files indexed by location/time/sensor and 35 EO-specialized tools spanning six task families. It constructs EO-Gym-Data, a benchmark of 9,078 trajectories and 34,604 reasoning steps derived from eight public EO datasets plus Landsat/Sentinel-2 imagery. Evaluation of 10 open and closed VLMs shows general-purpose models achieve Pass@3 of 0.49, struggling especially on temporal and cross-modal workflows, while a fine-tuned EO-Gym-4B (based on Qwen3-VL-4B-Instruct) improves this to 0.74 under the main setting. The work positions EO analysis as an evidence-gathering problem requiring planning across geospatial, temporal, and modality dimensions.

Significance. If the trajectories and tools are representative, EO-Gym would be a meaningful contribution by supplying a reproducible, executable benchmark that operationalizes interactive EO reasoning rather than collapsing it to single-turn tasks. The fine-tuning baseline and public-data foundation are concrete strengths that could accelerate development of tool-using agents in remote sensing. The emphasis on Gymnasium compatibility and multimodal file indexing supports extensibility.

major comments (2)

[EO-Gym-Data construction] In the EO-Gym-Data construction section: no expert annotation, coverage metrics, or comparison against real analyst logs is reported to validate that the 9,078 trajectories and six task families faithfully reproduce operational EO uncertainty-resolution steps (region expansion, historical retrieval, optical-SAR switching). This assumption is load-bearing for the central interpretation that low Pass@3 scores demonstrate struggles with interactive EO reasoning.
[Evaluation and experiments] In the evaluation and experiments section: concrete details on trajectory generation pipeline, exact tool APIs, success criteria for Pass@3 (including partial credit for multimodal tool sequences), and statistical significance testing are not provided at a level that permits independent verification or reproduction of the reported 0.49-to-0.74 improvement.

minor comments (2)

[Abstract] Abstract contains an apparent typo: 'O-Gym provides' should read 'EO-Gym provides'.
[Evaluation] Clarify the precise definition of Pass@3 and any trajectory-level success thresholds in the main text to aid readers unfamiliar with the metric.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below with clarifications on our methodology and commit to revisions that enhance validation and reproducibility.

read point-by-point responses

Referee: [EO-Gym-Data construction] In the EO-Gym-Data construction section: no expert annotation, coverage metrics, or comparison against real analyst logs is reported to validate that the 9,078 trajectories and six task families faithfully reproduce operational EO uncertainty-resolution steps (region expansion, historical retrieval, optical-SAR switching). This assumption is load-bearing for the central interpretation that low Pass@3 scores demonstrate struggles with interactive EO reasoning.

Authors: The 9,078 trajectories were constructed programmatically from the eight public EO datasets by defining task templates that require agents to resolve uncertainties through multi-step tool calls, such as expanding regions via spatial queries, retrieving historical observations using time-indexed files, and switching between optical and SAR sensors based on the 660k-file workspace. The six task families directly encode these operational patterns using the location/time/sensor indexing. While the manuscript does not include expert annotations or comparisons to real analyst logs, the benchmark is grounded in real public data to ensure the tasks reflect genuine EO workflows. We will add coverage metrics (e.g., trajectory distributions across task families, modalities, and temporal ranges) and a detailed description of the internal validation steps used during generation to the revised manuscript, thereby strengthening support for the interpretation of the Pass@3 results. revision: yes
Referee: [Evaluation and experiments] In the evaluation and experiments section: concrete details on trajectory generation pipeline, exact tool APIs, success criteria for Pass@3 (including partial credit for multimodal tool sequences), and statistical significance testing are not provided at a level that permits independent verification or reproduction of the reported 0.49-to-0.74 improvement.

Authors: The trajectory generation pipeline samples task goals from the public datasets and uses the EO-Gym environment to produce executable sequences of up to 34,604 reasoning steps. The 35 tool APIs are implemented as Gymnasium-compatible functions with defined inputs (e.g., bounding boxes, time ranges, sensor types) and outputs (e.g., file paths or metadata). Pass@3 success requires the agent to reach the task goal in at most three attempts, with partial credit for sequences that correctly invoke multimodal tools even if not exhaustive. We will expand the evaluation section with pseudocode for the pipeline, full API specifications, precise success criteria, and statistical significance tests (e.g., for the 0.49-to-0.74 Pass@3 lift) in the revised manuscript and supplementary material to enable full reproduction. revision: yes

Circularity Check

0 steps flagged

No circularity: benchmark and evaluation are self-contained from public sources

full rationale

The paper builds EO-Gym and EO-Gym-Data directly from eight public EO datasets plus Landsat/Sentinel-2 imagery, defines 35 tools and 9,078 trajectories without any fitted parameters or equations that would make evaluation metrics equivalent to construction choices. The reported Pass@3 scores (0.49 for general VLMs, 0.74 for the fine-tuned EO-Gym-4B) are measured on held-out trajectories, constituting a standard empirical comparison rather than a self-referential prediction. No self-citations, uniqueness theorems, or ansatzes are invoked to justify the environment or results; the derivation chain therefore remains independent of the authors' own fitting or definitional steps.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an applied systems and benchmark paper. The central claims rest on design decisions for the tool set and trajectory construction rather than on mathematical axioms or fitted parameters. No free parameters, domain axioms, or invented physical entities are invoked.

pith-pipeline@v0.9.0 · 5569 in / 1288 out tokens · 41060 ms · 2026-05-09T14:59:26.139783+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

77 extracted references · 50 canonical work pages · 8 internal anchors

[1]

Joshi, N., Baumann, M., Ehammer, A., Fensholt, R., Grogan, K., Hostert, P., Jepsen, M.R., Kuemmerle, T., Meyfroidt, P., Mitchard, E.T.A., Reiche, J., Ryan, C.M., and Waske, B. (2016). A review of the application of optical and radar remote sensing data fusion to land use mapping and monitoring.Remote Sensing,8(1):70.https://doi.org/10.3390/rs8010070

work page doi:10.3390/rs8010070 2016
[2]

Claverie, M., Ju, J., Masek, J.G., Dungan, J.L., Vermote, E.F., Roger, J.-C., Skakun, S.V ., and Justice, C. (2018). The harmonized Landsat and Sentinel-2 surface reflectance data set.Remote Sensing of Environment,219:145–161. https: //doi.org/10.1016/j.rse.2018.09.002

work page doi:10.1016/j.rse.2018.09.002 2018
[3]

Ju, J., Zhou, Q., Freitag, B., Roy, D.P., Zhang, H.K., Sridhar, M., Mandel, J., Arab, S., Schmidt, G., Crawford, C.J., Gascon, F., Strobl, P.A., Masek, J.G., and Neigh, C.S.R. (2025). The harmonized Landsat and Sentinel-2 Version 2.0 surface reflectance dataset.Remote Sensing of Environment,324:114723.https://doi.org/10.1016/j.rse.2025.114723. 9 EO-Gym: A...

work page doi:10.1016/j.rse.2025.114723 2025
[4]

Wulder, M.A., White, J.C., Loveland, T.R., Woodcock, C.E., Belward, A.S., Cohen, W.B., Fosnight, E.A., Shaw, J., Masek, J.G., and Roy, D.P. (2016). The global Landsat archive: Status, consolidation, and direction.Remote Sensing of Environment, 185:271–283.https://doi.org/10.1016/j.rse.2015.11.032

work page doi:10.1016/j.rse.2015.11.032 2016
[5]

Liu, S., Cheng, H., Liu, H., Zhang, H., Li, F., Ren, T., Zou, X., Yang, J., Su, H., Zhu, J., Zhang, L., Gao, J., and Li, C. (2024). LLaV A-Plus: Learning to Use Tools for Creating Multimodal Agents. InComputer Vision – ECCV 2024, pp. 126–142. https://doi.org/10.1007/978-3-031-72970-6_8

work page doi:10.1007/978-3-031-72970-6_8 2024
[6]

R., and Cao, Y

Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K. R., and Cao, Y . (2023). ReAct: Synergizing reasoning and acting in language models. InThe Eleventh International Conference on Learning Representations. https://openreview.net/ forum?id=WE_vluYUL-X

2023
[7]

Qin, Y ., Liang, S., Ye, Y ., Zhu, K., Yan, L., Lu, Y ., Lin, Y ., Cong, X., Tang, X., Qian, B., Zhao, S., Hong, L., Tian, R., Xie, R., Zhou, J., Gerstein, M., Li, D., Liu, Z., and Sun, M. (2024). ToolLLM: Facilitating large language models to master 16000+ real-world APIs. InThe Twelfth International Conference on Learning Representations. https://openre...

2024
[8]

Koh, J.Y ., Lo, R., Jang, L., Duvvur, V ., Lim, M., Huang, P.-Y ., Neubig, G., Zhou, S., Salakhutdinov, R., and Fried, D. (2024). VisualWebArena: Evaluating Multimodal Agents on Realistic Visually Grounded Web Tasks. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 881–905. https://doi....

work page doi:10.18653/v1/2024.acl-long.50 2024
[9]

Xie, T., Zhang, D., Chen, J., Li, X., Zhao, S., Cao, R., Hua, T.J., Cheng, Z., Shin, D., Lei, F., Liu, Y ., Xu, Y ., Zhou, S., Savarese, S., Xiong, C., Zhong, V ., and Yu, T. (2024). OSWorld: Benchmarking multimodal agents for open-ended tasks in real computer environments. InAdvances in Neural Information Processing Systems,37:52040–52094. https: //doi.o...

work page doi:10.52202/079017-1650 2024
[10]

Zhang, W., Cai, M., Zhang, T., Zhuang, Y ., and Mao, X. (2024). EarthGPT: A Universal Multimodal Large Language Model for Multisensor Image Comprehension in Remote Sensing Domain.IEEE Transactions on Geoscience and Remote Sensing, 62:1–20.https://doi.org/10.1109/TGRS.2024.3409624

work page doi:10.1109/tgrs.2024.3409624 2024
[11]

Zhan, Y ., Xiong, Z., and Yuan, Y . (2025). SkyEyeGPT: Unifying remote sensing vision-language tasks via instruction tuning with large language model.ISPRS Journal of Photogrammetry and Remote Sensing,221:64–77. https://doi.org/10. 1016/j.isprsjprs.2025.01.020

2025
[12]

Li, X., Ding, J., and Elhoseiny, M. (2024). VRSBench: A versatile vision-language benchmark dataset for remote sensing image understanding. InAdvances in Neural Information Processing Systems,37:3229–3242. https://doi.org/10.52202/ 079017-0106

2024
[13]

Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E., Kerner, H., L"utjens, B., Irvin, J., Dao, D., Alemohammad, H., Drouin, A., Gunturkun, M., Huang, G., Vazquez, D., Newman, D., Bengio, Y ., Ermon, S., and Zhu, X. (2023). GEO- Bench: Toward foundation models for Earth monitoring. InAdvances in Neural Information Processing Systems 36. https: //doi.org/1...

work page doi:10.52202/075280-2223 2023
[14]

Soni, S., Dudhane, A., Debary, H., Fiaz, M., Munir, M.A., Danish, M.S., Fraccaro, P., Watson, C.D., Klein, L.J., Khan, F.S., and Khan, S. (2025). EarthDial: Turning multi-sensory Earth observations to interactive dialogues. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14303–14313.https://doi.org/10.1109/ CVPR527...

work page arXiv 2025
[15]

Kao, C.-H., Zhao, W., Revankar, S., Speas, S., Bhagat, S., Datta, R., Phoo, C.P., Mall, U., V ondrick, C., Bala, K., and Hariharan, B. (2025). Towards LLM Agents for Earth Observation. arXiv preprint arXiv:2504.12110.https://doi.org/10.48550/ arXiv.2504.12110

work page arXiv 2025
[16]

Shabbir, A., Munir, M.A., Dudhane, A., Sheikh, M.U., Khan, M.H., Fraccaro, P., Moreno, J.B., Khan, F.S., and Khan, S. (2025). ThinkGeo: Evaluating Tool-Augmented Agents for Remote Sensing Tasks. arXiv preprint arXiv:2505.23752. https://doi.org/10.48550/arXiv.2505.23752

work page doi:10.48550/arxiv.2505.23752 2025
[17]

Feng, P., Lv, Z., Ye, J., Wang, X., Huo, X., Yu, J., Xu, W., Zhang, W., Bai, L., He, C., and Li, W. (2026). Earth-Agent: Unlocking the Full Landscape of Earth Observation with Agents. InThe Fourteenth International Conference on Learning Representations

2026
[18]

Shabbir, A., Munir, M.A., Sheikh, M.U., Hussain, S., Khan, M.H., Fraccaro, P., Moreno, J.B., Khan, F.S., and Khan, S. (2026). OpenEarthAgent: A Unified Framework for Tool-Augmented Geospatial Agents. https://doi.org/10.48550/arXiv. 2602.17665

work page internal anchor Pith review doi:10.48550/arxiv 2026
[19]

Zhao, S., Liu, F., Zhang, X., Chen, H., Gu, X., Jiang, Z., Ling, F., Fei, B., Zhang, W., Wang, J., Xuan, W., Xiao, P., Yokoya, N., and Bai, L. (2026). OpenEarth-Agent: From Tool Calling to Tool Creation for Open-Environment Earth Observation. https://doi.org/10.48550/arXiv.2603.22148

work page doi:10.48550/arxiv.2603.22148 2026
[20]

Brockman, G., Cheung, V ., Pettersson, L., Schneider, J., Schulman, J., Tang, J., and Zaremba, W. (2016). OpenAI Gym. https://doi.org/10.48550/arXiv.1606.01540

work page internal anchor Pith review doi:10.48550/arxiv.1606.01540 2016
[21]

Drusch, M., Del Bello, U., Carlier, S., Colin, O., Fernandez, V ., Gascon, F., Hoersch, B., Isola, C., Laberinti, P., Martimort, P., Meygret, A., Spoto, F., Sy, O., Marchese, F., and Bargellini, P. (2012). Sentinel-2: ESA’s optical high-resolution mission for GMES operational services.Remote Sensing of Environment,120:25–36. https://doi.org/10.1016/j.rse....

work page doi:10.1016/j.rse.2011.11 2012
[22]

Torres, R., Snoeij, P., Geudtner, D., Bibby, D., Davidson, M., Attema, E., Potin, P., Rommen, B., Floury, N., Brown, M., Traver, I.N., Deghaye, P., Duesmann, B., Rosich, B., Miranda, N., Bruno, C., L’Abbate, M., Croci, R., Pietropaolo, A., Huchler, M., and Rostan, F. (2012). GMES Sentinel-1 mission.Remote Sensing of Environment,120:9–24. https: //doi.org/...

work page doi:10.1016/j.rse.2011.05.028 2012
[23]

DigitalGlobe. (2014). WorldView-3 Data Sheet. DigitalGlobe. https://www.spaceimagingme.com/downloads/ sensors/datasheets/DG_WorldView3_DS_2014.pdf. Accessed: May 1, 2026

2014
[24]

Toutin, T., and Cheng, P. (2002). QuickBird—A Milestone for High Resolution Mapping.Earth Observation Magazine, 11(4):14–18

2002
[25]

Madden, M. (2009). GeoEye-1, the World’s highest resolution commercial satellite. InConference on Lasers and Electro- Optics/International Quantum Electronics Conference, OSA Technical Digest (CD), paper PWB4.https://doi.org/10. 1364/CLEO.2009.PWB4

2009
[26]

Huang, W., Sun, S., Jiang, H., Gao, C., and Zong, X. (2018). GF-2 Satellite 1m/4m Camera Design and In-Orbit Commissioning. Chinese Journal of Electronics,27(6):1316–1321.https://doi.org/10.1049/cje.2018.09.018

work page doi:10.1049/cje.2018.09.018 2018
[27]

World Meteorological Organization. (2026). Satellite: Jilin-1. OSCAR: Observing Systems Capability Analysis and Review Tool.https://space.oscar.wmo.int/satellites/view/jilin_1. Accessed: May 1, 2026

2026
[28]

Cyclomedia. (2026). Aerial data. https://www.cyclomedia.com/en/producten/data-visualisatie/ aerial-data/. Accessed: May 1, 2026

2026
[29]

Mialon, G., Fourrier, C., Wolf, T., LeCun, Y ., and Scialom, T. (2024). GAIA: A benchmark for general AI assistants. InThe Twelfth International Conference on Learning Representations.https://openreview.net/forum?id=fibxvahvs3

2024
[30]

Wang, J., Ma, Z., Li, Y ., Zhang, S., Chen, C., Chen, K., and Le, X. (2024). GTA: A benchmark for general tool agents. In Advances in Neural Information Processing Systems,37:75749–75790.https://doi.org/10.52202/079017-2412

work page doi:10.52202/079017-2412 2024
[31]

Nathani, D., Madaan, L., Roberts, N., Bashlykov, N., Menon, A., Moens, V ., Budhiraja, A., Magka, D., V orotilov, V ., Chaurasia, G., Hupkes, D., Cabral, R.S., Shavrina, T., Foerster, J., Bachrach, Y ., Wang, W.Y ., and Raileanu, R. (2025). MLGym: A new framework and benchmark for advancing AI research agents.https://doi.org/10.48550/arXiv.2502.14499

work page doi:10.48550/arxiv.2502.14499 2025
[32]

Jain, N., Singh, J., Shetty, M., Zheng, L., Sen, K., and Stoica, I. (2025). R2E-Gym: Procedural environments and hybrid verifiers for scaling open-weights SWE agents.https://doi.org/10.48550/arXiv.2504.07164

work page doi:10.48550/arxiv.2504.07164 2025
[33]

Zhou, S., Xu, F.F., Zhu, H., Zhou, X., Lo, R., Sridhar, A., Cheng, X., Ou, T., Bisk, Y ., Fried, D., Alon, U., and Neubig, G. (2024). WebArena: A realistic web environment for building autonomous agents. InThe Twelfth International Conference on Learning Representations.https://openreview.net/forum?id=oKn9c6ytLx

2024
[34]

Li, B., Wang, Y ., Fei, H., Li, J., Ji, W., Lee, M.-L., and Hsu, W. (2025). FormFactory: An interactive benchmarking suite for multimodal form-filling agents.https://doi.org/10.48550/arXiv.2506.01520

work page doi:10.48550/arxiv.2506.01520 2025
[35]

Kulkarni, M., Rehberg, W., and Alexis, K. (2025). Aerial Gym Simulator: A framework for highly parallelized simulation of aerial robots.https://doi.org/10.48550/arXiv.2503.01471

work page doi:10.48550/arxiv.2503.01471 2025
[36]

Lam, D., Kuzma, R., McGee, K., Dooley, S., Laielli, M., Klaric, M., Bulatov, Y ., and McCord, B. (2018). xView: Objects in context in overhead imagery.https://doi.org/10.48550/arXiv.1802.07856

work page doi:10.48550/arxiv.1802.07856 2018
[37]

Xia, G.-S., Bai, X., Ding, J., Zhu, Z., Belongie, S., Luo, J., Datcu, M., Pelillo, M., and Zhang, L. (2018). DOTA: A large- scale dataset for object detection in aerial images. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3974–3983.https://doi.org/10.1109/CVPR.2018.00418

work page doi:10.1109/cvpr.2018.00418 2018
[38]

Li, K., Wan, G., Cheng, G., Meng, L., and Han, J. (2020). Object detection in optical remote sensing images: A survey and a new benchmark.ISPRS Journal of Photogrammetry and Remote Sensing,159:296–307. https://doi.org/10.1016/j. isprsjprs.2019.11.023

work page doi:10.1016/j 2020
[39]

Sun, X., Wang, P., Yan, Z., Xu, F., Wang, R., Diao, W., Chen, J., Li, J., Feng, Y ., Xu, T., Weinmann, M., Hinz, S., Wang, C., and Fu, K. (2022). FAIR1M: A benchmark dataset for fine-grained object recognition in high-resolution remote sensing imagery. ISPRS Journal of Photogrammetry and Remote Sensing,184:116–130. https://doi.org/10.1016/j.isprsjprs.2021. 12.004

work page doi:10.1016/j.isprsjprs.2021 2022
[40]

Li, Y ., Li, X., Li, W., Hou, Q., Liu, L., Cheng, M.-M., and Yang, J. (2024). SARDet-100K: Towards open-source benchmark and toolkit for large-scale SAR object detection. InAdvances in Neural Information Processing Systems,37:128430–128461. https://doi.org/10.52202/079017-4079

work page doi:10.52202/079017-4079 2024
[41]

Massey, M., Munia, N., and Imran, A.-A.-Z. (2025). EarthScape: A multimodal dataset for surficial geologic mapping and Earth surface analysis.https://doi.org/10.48550/arXiv.2503.15625

work page doi:10.48550/arxiv.2503.15625 2025
[42]

Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Ferres, J.M.L., and Banerjee, A. (2022). TorchGeo: Deep learning with geospatial data. InProceedings of the 30th International Conference on Advances in Geographic Information Systems, pp. 1–12.https://doi.org/10.1145/3557915.3560953

work page doi:10.1145/3557915.3560953 2022
[43]

Mai, G., Lao, N., He, Y ., Song, J., and Ermon, S. (2023). CSP: Self-supervised contrastive spatial pre-training for geospatial- visual representations. InProceedings of the 40th International Conference on Machine Learning

2023
[44]

Rußwurm, M., and Körner, M. (2020). Self-attention for raw optical satellite time series classification.ISPRS Journal of Photogrammetry and Remote Sensing,169:421–435.https://doi.org/10.1016/j.isprsjprs.2020.06.006. 11 EO-Gym: A Multimodal, Interactive Environment for Earth Observation AgentsA PREPRINT

work page doi:10.1016/j.isprsjprs.2020.06.006 2020
[45]

Sainte Fare Garnot, V ., and Landrieu, L. (2021). Panoptic segmentation of satellite image time series with convolutional temporal attention networks. InProceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4872–4881. https://doi.org/10.1109/ICCV48922.2021.00483

work page doi:10.1109/iccv48922.2021.00483 2021
[46]

Abbas, A., Linardi, M., Vareille, E., Christophides, V ., and Paris, C. (2023). Towards Explainable AI4EO: An explainable deep learning approach for crop type mapping using satellite images time series. InIGARSS 2023 – 2023 IEEE International Geo- science and Remote Sensing Symposium, pp. 1088–1091.https://doi.org/10.1109/IGARSS52108.2023.10283125

work page doi:10.1109/igarss52108.2023.10283125 2023
[47]

Christie, G., Fendley, N., Wilson, J., and Mukherjee, R. (2018). Functional Map of the World. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.https://doi.org/10.1109/CVPR.2018.00646

work page doi:10.1109/cvpr.2018.00646 2018
[48]

Gupta, R., Goodman, B., Patel, N., Hosfelt, R., Sajeev, S., Heim, E., Doshi, J., Lucas, K., Choset, H., and Gaston, M. (2019). Creating xBD: A dataset for assessing building damage from satellite imagery. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pp. 10–17

2019
[49]

Wang, C., Lu, W., Li, X., Yang, J., and Luo, L. (2025). M4-SAR: A multi-resolution, multi-polarization, multi-scene, multi-source dataset and benchmark for optical-SAR object detection. https://doi.org/10.48550/arXiv.2505.10931

work page doi:10.48550/arxiv.2505.10931 2025
[50]

Rouse, J.W., Haas, R.H., Schell, J.A., and Deering, D.W. (1974). Monitoring vegetation systems in the Great Plains with ERTS. NASA Special Publication,351:309–317

1974
[51]

Gao, B.C. (1996). NDWI—A normalized difference water index for remote sensing of vegetation liquid water from space. Remote Sensing of Environment,58(3):257–266.https://doi.org/10.1016/S0034-4257(96)00067-3

work page doi:10.1016/s0034-4257(96)00067-3 1996
[52]

Gorelick, N., Hancher, M., Dixon, M., Ilyushchenko, S., Thau, D., and Moore, R. (2017). Google Earth Engine: Planetary-scale geospatial analysis for everyone.Remote Sensing of Environment,202:18–27. https://doi.org/10.1016/j.rse.2017. 06.031

work page doi:10.1016/j.rse.2017 2017
[53]

Carion, N., Gustafson, L., Hu, Y .-T., Debnath, S., Hu, R., Suris, D., Ryali, C., Alwala, K.V ., Khedr, H., Huang, A., Lei, J., Ma, T., Guo, B., Kalla, A., Marks, M., Greer, J., Wang, M., Sun, P., Rädle, R., Afouras, T., et al. (2026). SAM 3: Segment anything with concepts.arXiv preprint, arXiv:2511.16719.https://doi.org/10.48550/arXiv.2511.16719

work page internal anchor Pith review doi:10.48550/arxiv.2511.16719 2026
[54]

Liu, S., Zeng, Z., Ren, T., Li, F., Zhang, H., Yang, J., Jiang, Q., Li, C., Yang, J., Su, H., Zhu, J., and Zhang, L. (2024). Grounding DINO: Marrying DINO with grounded pre-training for open-set object detection. InComputer Vision – ECCV 2024, pp. 38–55.https://doi.org/10.1007/978-3-031-72970-6_3

work page doi:10.1007/978-3-031-72970-6_3 2024
[55]

OpenAI. (2025). GPT-4.1 model. OpenAI API documentation. Available athttps://developers.openai.com/api/docs/ models/gpt-4.1. Accessed: May 1, 2026

2025
[56]

Singh, A., Fry, A., Perelman, A., Tart, A., Ganesh, A., et al. (2025). OpenAI GPT-5 System Card.arXiv preprint, arXiv:2601.03267.https://arxiv.org/abs/2601.03267

work page internal anchor Pith review Pith/arXiv arXiv 2025
[57]

OpenAI, Agarwal, S., Ahmad, L., Ai, J., Altman, S., Applebaum, A., Arbus, E., Arora, R.K., Bai, Y ., Baker, B., et al. (2025). gpt-oss-120b and gpt-oss-20b Model Card.arXiv preprint, arXiv:2508.10925.https://arxiv.org/abs/2508.10925

work page internal anchor Pith review arXiv 2025
[58]

Song, Y ., Xiong, W., Zhao, X., Zhu, D., Wu, W., Wang, K., Li, C., Peng, W., and Li, S. (2024). AgentBank: Towards generalized LLM agents via fine-tuning on 50000+ interaction trajectories. InFindings of the Association for Computational Linguistics: EMNLP 2024, pp. 2124–2141.https://doi.org/10.18653/v1/2024.findings-emnlp.116

work page doi:10.18653/v1/2024.findings-emnlp.116 2024
[59]

Kang, M., Jeong, J., Lee, S., Cho, J., and Hwang, S.J. (2025). Distilling LLM agent into small models with retrieval and code tools. arXiv preprint arXiv:2505.17612.https://doi.org/10.48550/arXiv.2505.17612

work page doi:10.48550/arxiv.2505.17612 2025
[60]

Hu, E.J., Shen, Y ., Wallis, P., Allen-Zhu, Z., Li, Y ., Wang, S., Wang, L., and Chen, W. (2022). LoRA: Low-rank adaptation of large language models. InThe Tenth International Conference on Learning Representations. https://openreview.net/ forum?id=nZeVKeeFYf9

2022
[61]

Bai, S., Cai, Y ., Chen, R., Chen, K., Chen, X., et al. (2025). Qwen3-VL Technical Report.arXiv preprint, arXiv:2511.21631. https://arxiv.org/abs/2511.21631

work page internal anchor Pith review Pith/arXiv arXiv 2025
[62]

Zhao, Y ., Huang, J., Hu, J., Wang, X., Mao, Y ., Zhang, D., Jiang, Z., Wu, Z., Ai, B., Wang, A., Zhou, W., and Chen, Y . (2025). SWIFT: A scalable lightweight infrastructure for fine-tuning.Proceedings of the AAAI Conference on Artificial Intelligence, 39(28):29733–29735.https://doi.org/10.1609/aaai.v39i28.35383

work page doi:10.1609/aaai.v39i28.35383 2025
[63]

Comanici, G., Bieber, E., Schaekermann, M., Pasupat, I., Sachdeva, N., et al. (2025). Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint, arXiv:2507.06261. https://arxiv.org/abs/2507.06261

work page internal anchor Pith review Pith/arXiv arXiv 2025
[64]

Cohen, J. (1960). A coefficient of agreement for nominal scales.Educational and Psychological Measurement,20(1):37–46. https://doi.org/10.1177/001316446002000104

work page doi:10.1177/001316446002000104 1960
[65]

Chen, M., Tworek, J., Jun, H., Yuan, Q., Pinto, H.P.O., Kaplan, J., Edwards, H., Burda, Y ., Joseph, N., Brockman, G., et al. (2021). Evaluating Large Language Models Trained on Code.https://doi.org/10.48550/arXiv.2107.03374

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2107.03374 2021
[66]

Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., and Zhou, D. (2023). Self-consistency improves chain of thought reasoning in language models. InThe Eleventh International Conference on Learning Representations. https://openreview.net/forum?id=1PL1NIMMrw

2023
[67]

Wulder, M.A., Coops, N.C., Roy, D.P., White, J.C., and Hermosilla, T. (2019). Current status of Landsat program, science, and applications.Remote Sensing of Environment,225:127–147.https://doi.org/10.1016/j.rse.2019.02.015. 12 EO-Gym: A Multimodal, Interactive Environment for Earth Observation AgentsA PREPRINT

work page doi:10.1016/j.rse.2019.02.015 2019
[68]

2408.11857 , archiveprefix =

Teknium, R., Quesnelle, J., and Guang, C. (2024). Hermes 3 Technical Report.arXiv preprint, arXiv:2408.11857. https: //doi.org/10.48550/arXiv.2408.11857. A Controlled Executable Earth Observation Environment:EO-Gym A.1 Metadata-Drive Data Lake The EO-Gym environment operationalizes EO reasoning as an interactive evidence-acquisition process across space, ...

work page doi:10.48550/arxiv.2408.11857 2024
[69]

Did I already call a tool that directly observes the target object, region, change, or scene property?
[70]

If not, call the best matching available-modality tool now
[71]

Is my only reason for stopping that the paired image is missing or the first tool was unhelpful?
[72]

Try the next most relevant available tool

If yes, do not stop yet. Try the next most relevant available tool
[73]

After the last observation, what exact missing uncertainty remains? If one remains, do not answer yet
[74]

For counting or ratio questions, did I convert raw detections into the filtered count or calculation the question actually asks for?
[75]

If I am about to answer from a raw box/mask count, have I checked for duplicates, partial objects, low-confidence detections, and the requested date/modality?
[76]

cannot determine

If I am about to answer 0 or "cannot determine", do I have enough evidence that the target is truly absent or unsupported rather than merely missed by one detector call?
[77]

TBD", question_type=

Please brief the chain of thought before give the final answer. A.3.3 Skill vs All Tools The tool schema configuration determines the size and relevance of the action space exposed to the agent during execution. 18 EO-Gym: A Multimodal, Interactive Environment for Earth Observation AgentsA PREPRINT TheSkill toolssetting narrows the action space by exposin...

2020