GeoNatureAgent Benchmark tests seven LLMs on 93 tasks via a production geospatial API, with Claude Sonnet 4 at 60.8% and DeepSeek V3.2 offering near performance at 11x lower cost while all models fail on close-value comparisons.
Title resolution pending
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.AI 1years
2026 1verdicts
ACCEPT 1representative citing papers
citing papers explorer
-
GeoNatureAgent Benchmark: Benchmarking LLM Agents for Environmental Geospatial Analysis Across Frontier and Open-Weight Foundation Models
GeoNatureAgent Benchmark tests seven LLMs on 93 tasks via a production geospatial API, with Claude Sonnet 4 at 60.8% and DeepSeek V3.2 offering near performance at 11x lower cost while all models fail on close-value comparisons.