📆 Project Period | June 2024 – Dec 2025 |
🏛️ ESA Partners |
Project Summary
Despite recent advances in computer vision, Earth Observation (EO) analysis remains difficult to perform for the laymen, requiring expert knowledge and technical capabilities. Furthermore, many systems return black-box predictions that are difficult to audit or reproduce. Leveraging recent advances in tool LLMs, this study proposes a conversational, code-generating agent that transforms natural-language queries into executable, auditable Python workflows. The agent operates over a unified easily extendable API for classification, segmentation, detection (oriented bounding boxes), spectral indices, and geospatial operators. With our proposed framework, it is possible to control the results at three levels:
- Tool-level performance on public EO benchmarks;
- At the agent-level to understand the capacity to generate valid, hallucination free code;
- At the task-level on specific use cases.
In this work, we select two use-cases of interest: land-composition mapping and post wildfire damage assessment. The proposed agent outperforms general purpose LLM/VLM baselines (GPT-4o, LLaVA), achieving 64.2% vs. 51.7% accuracy on land-composition and 50% vs. 0% on post-wildfire analysis, while producing results that are transparent and easy to interpret. By outputting verifiable code, the approach turns EO analysis into a transparent, reproducible process.
Development Tools
IC–EO converts natural-language queries into executable Earth Observation (EO) workflows. The framework combines a modular API of selected EO tools with a LLM that plans and writes code. This approach ensures interpretability, reproducibility, and extensibility across tasks.
The experimental evaluation of IC–EO is designed to assess the full system protocol: how the assistant integrates EO models, helper tools, and a code-generating controller to produce valid, interpretable answers from natural-language queries.
Development Outputs
- Public demo: https://www.ic-eo.com
- Scientific paper: https://arxiv.org/abs/2602.00117
Project Description

Earth Observation (EO) has evolved into a data-rich field with applications in a wide-range of problematics such as environmental monitoring, disaster response, agriculture, and urban planning. However, EO users face fragmented toolchains: imagery retrieval, cloud masking, reprojection and tiling, model selection, and geodesic statistics live across different platforms and libraries, with many end to end pipelines yielding black-box outputs that are hard to trust or reproduce [1,2]. This fragmentation raises the barrier for non-experts and operational use.
Concurrently, large language models (LLMs) have progressed from text generators into tool-using controllers capable of planning, calling APIs, and writing executable programs. The shift from transformer foundations to instruction following and code centric reasoning suggests a path to unify complex workflows while keeping each step explicit and auditable [3,4,5]. Methods for interleaving reasoning with actions and self-supervised API use further strengthen this paradigm, making it feasible to connect language interfaces to domain libraries in a principled way [6,7]. When applied naively to EO imagery, general LLM/VLM systems struggle: they can describe scenes yet fail on quantitative measurement, spatial logic, and consistent use of geospatial metadata. Early multimodal work in the domain (e.g., RSVQA) highlights the need for EO-aware reasoning that spans multiple sensors, resolutions, and coordinate systems rather than surface-level captioning [8]. At the same time, modular program-synthesis frameworks (e.g., ViperGPT [9]) demonstrate that composing specialist vision modules via generated code yields transparency, composability, and reproducibility properties that are equally valuable for Earth Observation workflows, where multiple perception models and geospatial operations must be combined consistently. We build on these trends and adopt a code-first design. Our proposed Interpretable Code-based assistant for Earth Observation (IC–EO) compiles natural language requests into executable, auditable Python code that orchestrates EO tools under an explicit, verifiable plan. Rather than directly providing an answer to a query, the framework operates over a unified Tool API with standardized I/O for scene classification, semantic segmentation, object detection with oriented bounding boxes, spectral indices, and basic geospatial operations such as reprojection, tiling, and area computation. A lightweight controller conditions on sensor and band metadata, selects tools, and validates outputs to ensure consistent spatial processing across RGB and multispectral inputs. To understand the performances of our solution, and compare it to Vision Language Model (VLM)-based approaches, we propose a new evaluation methodology at three complementary levels: model level to evaluate the performances on public EO benchmarks for classical EO tasks; code-generation level to understand the capacities of a LLM to generate valid python code; and at the task level in which we evaluate the outputs of the framework on two realistic use cases (land-composition mapping and post-wildfire damage assessment) and compare it against strong general-purpose VLM baselines.
1. Gorelick, N., Hancher, M., Dixon, M., Ilyushchenko, S., Thau, D., Moore, R.: Google earth engine: Planetary-scale geospatial analysis for everyone. Remote Sensing of Environment 202, 18–27 (2017). https://doi.org/10.1016/j.rse.2017. 06.031
2. Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: Torchgeo: deep learning with geospatial data. ACM Transactions on Spatial Algorithms and Systems 11(4), 1–28 (2025)
3. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017)
4. Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020)
5. Chen, M., et al.: Evaluating large language models trained on code. arXiv:2107.03374 (2021)
6. Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K.R., Cao, Y.: React: Synergizing reasoning and acting in language models. In: The eleventh international conference on learning representations (2022)
7. Schick, T., Dwivedi-Yu, J., Dessì, R., Raileanu, R., Lomeli, M., Hambro, E., Zettlemoyer, L., Cancedda, N., Scialom, T.: Toolformer: Language models can teach themselves to use tools. Advances in Neural Information Processing Systems 36, 68539–68551 (2023)
8. Lobry, S., Marcos, D., Murray, J., Tuia, D.: Rsvqa: Visual question answering for remote sensing data. IEEE Transactions on Geoscience and Remote Sensing 58(12), 8555–8566 (2020)
9. Surís, D., Menon, S., Vondrick, C.: Vipergpt: Visual inference via python execution for reasoning. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 11888–11898 (2023)
