Beyond Conditional Diffusion: A Physics-Guided Diffusion Framework for Joint Forecasting of Physical Drivers and Multispectral Satellite Data

📆 Project Period	September 2025- March 2026
👤 CIN Visiting Researcher	Francesca De Falco

Project Summary

The collaboration focused on the development of DF-Diff, a novel physics-guided generative framework for Earth Observation forecasting that jointly predicts future hydro-meteorological variables and future multispectral satellite images
The project addressed a realistic operational challenge: in many forecasting scenarios, relevant physical drivers are not available over the prediction horizon. To overcome this, the framework first forecasts future physical variables and then uses them to condition a diffusion-based image forecasting model.
The method combines three main components: a multispectral autoencoder, a Mamba-based temporal forecasting model, and a Diffusion Transformer (DiT) for multispectral image generation.
The framework was validated on a water-body monitoring task using Sentinel-2 13-band multispectral imagery and 16 hydro-meteorological variables, with relevance for drought monitoring and environmental risk assessment.
Experimental results showed that DF-Diff outperformed recurrent baselines and a diffusion-only baseline, while remaining robust even when the physical conditioning variables were themselves forecast rather than observed.

Development Tools

The project was developed in Python 3.10 and implemented using deep learning libraries including PyTorch 2.1.0, TorchVision 0.16.0, and TorchAudio 2.1.0. The framework combines a multispectral auto-encoder for latent compression, a Mamba state-space sequence model for forecasting physical variables, and a Diffusion Transformer (DiT) for conditioned multispectral image generation. The experimental validation used Sentinel-2 multispectral imagery with all 13 bands and 16 hydro-meteorological variables retrieved from global reanalysis products through Google Earth Engine. Model training and evaluation were carried out on a workstation equipped with an NVIDIA A6000 GPU (48 GB VRAM). The work focused on forecasting water-related environmental dynamics and on studying the integration of physical drivers into generative remote sensing pipelines.

Development Outputs

(⚠️ on-going) Research paper: Beyond Conditional Diffusion: A Physics-Guided Diffusion Framework for Joint Forecasting of Physical Drivers and Multispectral Satellite Data, submitted to EarthVision, CVPR
(⚠️ on-going) Code and Dataset will be released after publication

Project Description

This project explored the integration of physics-guided machine learning and diffusion-based generative modeling for Earth Observation forecasting. The main goal was to develop a forecasting framework capable of predicting future multispectral satellite observations in a realistic setting where the environmental variables driving the system evolution are not known in advance. The collaboration with ESA Φ-lab supported the development of a methodology that brings together physical context and modern generative AI for remote sensing applications.

The resulting framework, named Dual Forecasting-Diffusion (DF-Diff), is based on a two-stage forecasting strategy. In the first stage, the model predicts future physical drivers from historical observations. In the second stage, these predicted variables are used to guide the generation of future multispectral images. This design makes the framework fully causal, since all outputs are produced from past information only, without relying on future physical measurements that would not be available in real-world operational scenarios. This is one of the main innovations of the project.

From an architectural standpoint, DF-Diff combines three complementary components. First, a multispectral autoencoder compresses Sentinel-2 image sequences into latent representations that preserve the most relevant spectral and spatial information while reducing computational cost. Second, a Mamba-based temporal model is used to forecast the future values of the physical variables. Third, a Diffusion Transformer (DiT) operates in latent space and generates future multispectral representations conditioned on both past image information and the forecast physical context. This modular design made it possible to handle both the temporal evolution of environmental drivers and the complex spectral structure of multispectral Earth Observation data.

The project was validated on a preliminary but realistic application: water-body dynamics forecasting for environmental monitoring. The experiments were conducted using an extended version of the SEN12-WATER benchmark, relying on Sentinel-2 optical imagery and all 13 spectral bands resampled to 10 m resolution. In addition to the satellite imagery, the framework integrated 16 hydro-meteorological variables, including precipitation, snowfall, snowmelt, runoff, soil moisture, evaporation-related variables, and near-surface atmospheric conditions. These variables were extracted and temporally aligned with the imagery, allowing the model to incorporate a physically meaningful description of the processes affecting water dynamics.

A key outcome of the collaboration was the demonstration that physical guidance can significantly improve multispectral forecasting quality. Compared with standard recurrent architectures such as ConvLSTM and BiConvLSTM, DF-Diff achieved substantially lower error on image forecasting. It also outperformed a diffusion-only version of the model that did not use physical conditioning. Importantly, the results showed that using forecast physical variables instead of exact future variables introduced only a limited performance drop. This finding is especially relevant for operational use, because it confirms that the approach remains effective even when the physical inputs themselves must be predicted.

Another important result concerns the forecasting of the physical drivers themselves. The Mamba module showed strong performance in predicting hydro-meteorological time series, outperforming LSTM and simple baselines such as persistence and naive forecasting. This confirmed that Mamba is a suitable backbone for modeling long temporal dependencies in Earth system variables, especially across multi-year horizons. The success of this component was fundamental for enabling the complete two-stage DF-Diff pipeline.

The collaboration also highlighted the broader methodological value of working on full multispectral forecasting instead of restricting the problem to a small subset of bands or to derived indices. By forecasting all 13 Sentinel-2 bands, the project preserved the complete spectral signature of the observed scene, making the framework potentially useful for a wider range of downstream Earth Observation applications beyond water monitoring alone. This makes the work relevant not only for drought and water-related scenarios, but also for future extensions to other environmental monitoring tasks.

As for future developments, several promising directions emerged from the project. First, the current framework could be extended with additional physically informed objectives beyond conditioning alone, improving the physical plausibility of the generated forecasts. Second, the system could integrate multi-sensor data, in particular SAR imagery, to reduce sensitivity to cloud cover and illumination changes. Third, broader evaluation on more diverse datasets and domains would help assess robustness and generalization. Finally, the framework could be adapted to support other Earth Observation use cases where long-horizon forecasting under partial physical observability is required.

Overall, the collaboration with ESA Φ-lab enabled the development of an innovative generative forecasting framework at the intersection of diffusion models, temporal sequence modeling, and physics-informed machine learning. The project contributes to the growing field of trustworthy and physically grounded AI for Earth Observation and provides a solid basis for future research and operational development.