data science for gapfilling complex earth observations · 2020. 5. 6. · bessenbacher v., l....
TRANSCRIPT
-
Data Science for Gapfilling Complex Earth Observations authors: Verena Bessenbacher
Lukas GudmundssonSonia I. Seneviratne
Verena Bessenbacher
7th May, EGU General Assembly
Land Climate Dynamics, ETH Zürich
-
INTRODUCTION
Why gapfilling? • missing values are ubiquitous and
unavoidable • fragmentation of the observed record
limits wide-spread use • patterns of missingness are non-
trivial
• non-trivial covariance structure • neighborhood relations • temporal autocorrelation • underlying physical constraints
Key limitations of state-of-the art statistical imputation methods cannot incorporate special structure of geoscientific datasets
WHY
MODIS Skin Temperature
1st August 2010
Verena Bessenbacher
7th May, EGU General Assembly
Land Climate Dynamics, ETH Zürich
-
MODIS Skin Temperature
1st August 2010
INTRODUCTION
Missing Completely At Random = the fact that a point is missing does not depend on any other variable, but can be described as a random process.
This is rarely the case with satellite data.
There are three fundamentally different ways how data can be missing.
Missing At Random = the missing values share the same statistical properties as the observed values.
Swaths in satellite data leave such patterns.
Missing Not At Random = the points missing are systematically different from the points observed.
E.g. skin temperature below clouds can expected to be lower than under clear sky, leaving the unobserved values different from the observed ones
Verena Bessenbacher
7th May, EGU General Assembly
Land Climate Dynamics, ETH Zürich
-
OBJECTIVE
HOW
We use reanalysis data from the ERA5-Project (Guillory et al, 2017), which provide gap-free estimates of essential climate variables. We employ a "perfect dataset approach", where we assume the reanalysis data to be the "true" state of the land-climate interactions and introduce artificial missing values that are subsequently imputed.
The analysis is confined to daily, global land-only ERA5 data from 2003 to 2012, at 0.25° resolution. We exclude Antarctica and Greenland because in permanently glaciated areas soil moisture is not well defined. Furthermore, only ERA5 variables are considered that can be matched with available satellite remote sensing products: MODIS Aqua skin temperature (Parkinson et al, 2003), GPM precipitation (Huffmann et al, 2019) and ESA-CCI soil moisture (Dorigo et al, 2017, Gruber and Scanlon, 2019, Gruber et al, 2017) of the uppermost soil layer. Additionally we assume constant maps of vegetation type, vegetation cover, topographic height and topographic complexity to be known and gap-free.
Usually, imputation focuses on gapfilling one variable only. This is often done with the help of other variables, spatial or temporal interpolation
we attempt multivariate, i.e. using more than one variable mutual, i.e. gapfilling each variable with the help of all others multiple imputation, i.e. producing several estimates for each missing value incorporating:
- covariance structure between variables - spatial correlation among variables - temporal autocorrelation among variables
MODIS Skin Temperature
1st August 2010
ERA5 Reanalysis
1st August 2010
with MODIS missingness pattern
perfect dataset approach
Verena Bessenbacher
7th May, EGU General Assembly
Land Climate Dynamics, ETH Zürich
-
METHOD
Ridge Regression
Gaussian Process
Random Forest
Neural Network
skin temperature spatial interpolation
skin temperature temporal interpolation
precipitation spatial interpolation
precipitation temporal interpolation
surface layer soil moisture spatial interpolation
surface layer soil moisture temporal interpolation
while not converged: # iterative estimation of model and missing valuesfor variable in variables: # variables switch places so that each variable is predictor once
= f( )constantvariables,, , …skin temperature precipitation surface layer
soil moisture
for random sample of data points: # bagging approach
We sample random data points from the ERA5 variables and impute all missing values in this sample. We iteratively produce estimates for the missing values and fit a model to the data for each variable, in an expectation-maximisation alike fashion. This procedure is repeated until the estimates for the missing data points converge. The method harnesses the highly-structured nature of gridded covarying observation datasets within the flexible function learning toolbox of data-driven approaches. The imputation utilises (1) the temporal autocorrelation and spatial neighborhood within one variable or dataset and (2) the different missingness patterns across different variables or datasets, i.e. the fact that if one variable at a given point in space and time is missing, another covarying variable might be observed and their local covariance could be learned. A simple ridge regression is already able to outperform simple “ad-hoc” gapfilling procedures on high resolution daily satellite data, however, we are working on additionally testing a nonlinear method (Gaussian Process, Random Forest and Neural Net).
Verena Bessenbacher
7th May, EGU General Assembly
Land Climate Dynamics, ETH Zürich
-
RESULTS pearson correlation coefficient
skin temperature surface layer soil moisture
Skin temperature is missing where cloud fraction is high and global fraction of missing values of 42 %. ESA-CCI soil moisture has a impressive 68% of missing values, effectively obscuring tropical rainforest regions all the time and high-latitude areas with snow cover around half the time. Soil moisture measurements are therefore exposing a non-trivial missingness pattern with a comparatively high fraction of missingness among remote sensing products, making it especially challenging for imputation. The pearson correlation between the gap filled values and the original values for each land point mirrors that: Correlation is high where much data can be observed, and low where data is missing a lot of times. However, correlation is never negative, showing that the gap filling procedure applied indeed improves the estimates for the missing values.
Verena Bessenbacher
7th May, EGU General Assembly
Land Climate Dynamics, ETH Zürich
0 1
-
RESULTSte
mpe
ratu
re [°
C]
skin temperature 13h00, close to Basel, year 2003 To show exemplarily how the gap filling works, the plot shows the daily skin temperature of Basel for the year 2003. In black, the ERA5 skin temperature is plotted. In green, the same data is used, but only the values that would have been observed by a satellite are shown. Days where Basel was overcast with clouds cannot be seen by the satellite, for example much of December 2003.
In red, the initial gap filling procedure is shown. We use the temporal mean.
In blue, the final result is shown. The iterative procedure reduces the bias and increases the correlation of original data and gapfilled values by incorporating information - from the other variables (soil moisture
and precipitation) - from the neighboring grid points - from the day before and after
Verena Bessenbacher
7th May, EGU General Assembly
Land Climate Dynamics, ETH Zürich
ERA5 data satellite observable ERA5 data init gapfill gapfilling final result
-
RESULTS: the correlations per variable align well with artificial experiments
Verena Bessenbacher
7th May, EGU General Assembly
Land Climate Dynamics, ETH Zürich
By plotting the fraction of missing data with the pearson correlation of gapfilled vs. original values, the merit of different gap filling procedures can be compared. A perfect gap filling procedure would show a pearson correlation of 1 for all fractions of missing data. The mean initial gapfill is shown with the diamond. As expected, filling in the mean shows no variance and therefore no correlation with the original values. The iterative procedure increases the correlation for all variables (square, triangle and circle), but the higher the fraction of missing values in this variable is, the lower is the correlation with the original values.
To benchmark the gap filling procedure, we additionally consider an artificial missing ness pattern, where we introduce „artificial swaths“ into the ERA5-dataset. We can see that with increasing missing values, the imputation merit decreases for the artificial case (solid lines). However, our points with the real missingness pattern fall in the area of the lines. This means that although in the real world, satellite observations are missing not at random, we still achieve a correlation as if it would be missing at random. This means that the high physical dependency of the three variables helps overcome their complex missingness pattern.
-
CONCLUSION & OUTLOOK
- consider another initial gap fill, using climatology - add non-linear method for gapfilling - add net radiation as a variable - check physical consistency of imputed values (e.g. soil gets wet
when it rains)
Verena Bessenbacher
7th May, EGU General Assembly
Land Climate Dynamics, ETH Zürich
ReferencesBessenbacher V., L. Gudmundsson and S. I. Seneviratne (2019): Testing Random Forest Imputation for Land Hydrology Data, Proceedings of the 9th International Workshop on Climate Informatics, pp 73-77
van Buuren, S. (2018): Flexible Imputation of Missing Data, Chapman and Hall. Dorigo W. et al (2017): ESA CCI Soil Moisture for improved Earth system understanding: State-of-the art and future directions. Remote Sensing of Environment, 203, 185-215. Gruber, A. et al (2017): Triple Collocation-Based Merging of Satellite Soil Moisture Retrievals. IEEE Transactions on Geoscience and Remote Sensing, 55, 12. Gruber, A. and Scanlon, T. (2019): Evolution of the ESA CCI Soil Moisture climate data records and their underlying merging methodology, Earth System Science Data, 11, 717-739. Guillory, A. (2017): ERA5. https://www.ecmwf.int/en/forecasts/datasets/reanalysis-datasets/era5 Huffmann, G. et al (2019): Integrated Multi-satellite Retrievals for GPM (IMERG) version 4.4. NASA's Precipitation Processing Center. Parkinson, C. L. (2003): Aqua: an earth-observing satellite mission to examine water and other climate variables. IEEE Transactions on Geoscience and Remote Sensing. 41, 2. Reichstein, M. et al (2019): Deep learning and process understanding for data-driven Earth system science, Nature, 566, 7743, 195ff Rubin, D. B. (1976): Inference and missing data. Biometrika, 63, 3, pp 581-92 Scher, S. et al (2019): Weather and climate forecasting with neural networks: using GCMs with different complexity as study-ground. Geoscientific Model Development, 12, 2797-2809 Shen, H. et al (2015): Missing Information Reconstruction of Remote Sensing Data: A technical review. IEEE Geosci. Remote Sens. Mag., 3, 3, 61-81. Stekhoven, D. J. and P. Bühlmann (2012): MissForest — non-parametric missing value imputation for mixed-type data. Bioinformatics, 28, 1, 112-118.
Conclusions Outlook- We gapfill several remote sensing datasets and test possible
algorithms on gapfree ERA5 data - A simple Ridge Regression is able to outperform trivial initial
gapfilling procedures - The high physical dependency between the variables makes
gapfilling possible although a missing not at random pattern is observed
- soil moisture observations are missing in around 68% of the time, making it a challenging case for gapfilling