📆 Project Period	May - August, 2025
👤 CIN Visiting Researcher	Rens Jan van der Linden

Project Summary

This project addresses the "domain gap" in Earth Observation (EO), where AI models for tasks such as ship segmentation exhibit unpredictable performance drops when deployed in new geographical regions. This unreliability is a barrier to using AI in critical applications across diverse environments.
The primary goal was to quantify this performance drop on unseen data without requiring new labels. Existing metrics were evaluated and it was found that they failed for segmentation tasks with severe class imbalance. It was theorized that this was because their predictions were skewed by the dominant background class.
The main contribution is a novel metric, the Positive Difference of Confidence (DoCp). Unlike traditional methods, DoCp focuses only on the model's confidence for the target class (ships), effectively ignoring irrelevant background changes.
The DoCp was successfully evaluated across 22 domain shifts, demonstrating a strong correlation with actual model performance. It provides a tool for quality assurance and could trigger autonomous model adaptation, significantly improving the trustworthiness of AI systems in operational EO environments.

Development Tools

PyTorch: used for the implementation of the training and general evaluation
Albuminations: used for implementing data augmentations
Optuna: used for optimizing the tuning of hyperparameters
(PyTorch) Segmentation Models: used for the actual model implementation

Development Outputs

Master Thesis outlining the research performed and results achieved

Project Description

This project addresses the challenge of applying Artificial Intelligence (AI) to Earth Observation (EO): the domain gap. AI models are powerful tools for analyzing the vast quantities of data produced by satellites. However, their performance is highly dependent on the data they were trained on. A model trained on satellite imagery from one geographical region (the "source domain") will often experience a significant and unpredictable drop in performance when deployed over a new area (the "target domain). This discrepancy, arising from subtle to substantial variations in data patterns due to, for example, differences in sensor characteristics, atmospheric conditions, seasonalities and local features, is known as the domain gap.

The consequence of this gap is a lack of reliability. For critical applications such as maritime surveillance, an AI model that cannot be trusted to perform consistently is of limited practical value. This unreliability undermines trust in the autonomous systems that rely on the AI's predictions. The project focuses specifically on the challenging task of ship segmentation im satellite imagery, a task complicated by severe class imbalance, where ships represent a tiny fraction of the total pixels in an image.

Project Objective and Scope

The primary objective of this thesis was not merely to acknowledge the domain gap, but to quantify it. The goal was to develop a reliable method to predict an AI model's performance on an unseen target domain without requiring new, manually labeled data for that location. By accurately estimating performance degradation, the system can determine whether the model's performance remains stable over time or whether outputs for a new region are trustworthy.

This predictive capability serves two critical functions:

Quality Assurance: It enables the system to automatically flag predictions from problematic domains as unreliable, thereby preventing the dissemination of inaccurate information to end-users.
Autonomous Adaptation Trigger: The performance prediction can act as an automatic trigger. Suppose the predicted performance falls below a configurable threshold. In that case, the system can initiate a domain adaptation process to retrain or fine-tune the model, enabling a more autonomous operational loop.

The project scope involved designing a representative ship segmentation model optimized for satellite-like hardware, creating a dataset with well-defined domain gaps, evaluating existing measurement techniques and proposing a new metric tailored to the specific challenges of imbalanced EO data.

Research Methodology and Key Innovation

The research began by establishing a baseline. A lightweight U-Net segmentation model with a MobileNetV2 backbone was developed, trained and verified on satellite-like hardware (a Raspberry Pi and a CogniSAT-XE2 AI accelerator) to ensure the solution was grounded in a realistic deployment scenario. A dedicated dataset was created using PlanetScope imagery from multiple global ports, resulting in 22 distinct combinations of source and target domains to test.

The project thenevaluated traditional domain gap metrics, which were initially developed for classification tasks and adapted them for segmentation. This included input-based metrics (Proxy A-Distance), model-based metrics (Maximum Mean Discrepancy) and output-based metrics (Difference of Confidence - DoC). The key finding from this phase was that these established metrics failed to produce an accurate prediction of performance. Their predictions were overwhelmed by the dominant background class (water, land, clouds). Since the evaluation metric for the task (Intersection over Union) focuses solely on the rare foreground class (ships), the predictions from these established metrics showed very weak correlation with the actual performance drop, resulting in a high error between the predicted and actual model performance.

This failure led to the project's primary scientific contribution: the development of a novel metric, named the Positive Difference of Confidence (DoCp). This new metric overcomes the limitation of its predecessors by shifting the focus. Instead of considering the model's confidence across all pixels, the DoCp calculates the difference in confidence only for the pixels that the model predicts as belonging to the positive (ship) class. This simple modification allows the metric to filter out irrelevant changes in the background and concentrate on the predictions that directly impact the model's ability to perform its primary task.

Project Outcome and Significance

The primary outcome of the project was the successful validation of the DoCp as a powerful and accurate predictor of model performance on unseen domains. In testing across the 22 domain gaps, the DoCp metric demonstrated a strong correlation with the actual IoU performance and achieved a low Mean Absolute Error. This result was statistically significantly better when compared to the baseline metrics.

The project successfully delivered a practical and robust tool that fulfills the mission's objective. The DoCp can be used in an operational environment to:

Reliably estimate model performance in new geographical areas.
Ensure trustworthy insights by filtering out results from domains where performance is predicted to be low.