data science for gapfilling complex earth observations · 2020. 5. 6. · bessenbacher v., l....

9
Data Science for Gapfilling Complex Earth Observations authors: Verena Bessenbacher Lukas Gudmundsson Sonia I. Seneviratne Verena Bessenbacher 7 th May, EGU General Assembly Land Climate Dynamics, ETH Zürich [email protected]

Upload: others

Post on 02-Feb-2021

1 views

Category:

Documents


0 download

TRANSCRIPT

  • Data Science for Gapfilling Complex Earth Observations authors: Verena Bessenbacher

    Lukas GudmundssonSonia I. Seneviratne

    Verena Bessenbacher

    7th May, EGU General Assembly

    Land Climate Dynamics, ETH Zürich

    [email protected]

  • INTRODUCTION

    Why gapfilling? • missing values are ubiquitous and

    unavoidable • fragmentation of the observed record

    limits wide-spread use • patterns of missingness are non-

    trivial

    • non-trivial covariance structure • neighborhood relations • temporal autocorrelation • underlying physical constraints

    Key limitations of state-of-the art statistical imputation methods cannot incorporate special structure of geoscientific datasets

    WHY

    MODIS Skin Temperature

    1st August 2010

    Verena Bessenbacher

    7th May, EGU General Assembly

    Land Climate Dynamics, ETH Zürich

    [email protected]

  • MODIS Skin Temperature

    1st August 2010

    INTRODUCTION

    Missing Completely At Random = the fact that a point is missing does not depend on any other variable, but can be described as a random process.

    This is rarely the case with satellite data.

    There are three fundamentally different ways how data can be missing.

    Missing At Random = the missing values share the same statistical properties as the observed values.

    Swaths in satellite data leave such patterns.

    Missing Not At Random = the points missing are systematically different from the points observed.

    E.g. skin temperature below clouds can expected to be lower than under clear sky, leaving the unobserved values different from the observed ones

    Verena Bessenbacher

    7th May, EGU General Assembly

    Land Climate Dynamics, ETH Zürich

    [email protected]

  • OBJECTIVE

    HOW

    We use reanalysis data from the ERA5-Project (Guillory et al, 2017), which provide gap-free estimates of essential climate variables. We employ a "perfect dataset approach", where we assume the reanalysis data to be the "true" state of the land-climate interactions and introduce artificial missing values that are subsequently imputed.

    The analysis is confined to daily, global land-only ERA5 data from 2003 to 2012, at 0.25° resolution. We exclude Antarctica and Greenland because in permanently glaciated areas soil moisture is not well defined. Furthermore, only ERA5 variables are considered that can be matched with available satellite remote sensing products: MODIS Aqua skin temperature (Parkinson et al, 2003), GPM precipitation (Huffmann et al, 2019) and ESA-CCI soil moisture (Dorigo et al, 2017, Gruber and Scanlon, 2019, Gruber et al, 2017) of the uppermost soil layer. Additionally we assume constant maps of vegetation type, vegetation cover, topographic height and topographic complexity to be known and gap-free.

    Usually, imputation focuses on gapfilling one variable only. This is often done with the help of other variables, spatial or temporal interpolation

    we attempt multivariate, i.e. using more than one variable mutual, i.e. gapfilling each variable with the help of all others multiple imputation, i.e. producing several estimates for each missing value incorporating:

    - covariance structure between variables - spatial correlation among variables - temporal autocorrelation among variables

    MODIS Skin Temperature

    1st August 2010

    ERA5 Reanalysis

    1st August 2010

    with MODIS missingness pattern

    perfect dataset approach

    Verena Bessenbacher

    7th May, EGU General Assembly

    Land Climate Dynamics, ETH Zürich

    [email protected]

  • METHOD

    Ridge Regression

    Gaussian Process

    Random Forest

    Neural Network

    skin temperature spatial interpolation

    skin temperature temporal interpolation

    precipitation spatial interpolation

    precipitation temporal interpolation

    surface layer soil moisture spatial interpolation

    surface layer soil moisture temporal interpolation

    while not converged: # iterative estimation of model and missing valuesfor variable in variables: # variables switch places so that each variable is predictor once

    = f( )constantvariables,, , …skin temperature precipitation surface layer

    soil moisture

    for random sample of data points: # bagging approach

    We sample random data points from the ERA5 variables and impute all missing values in this sample. We iteratively produce estimates for the missing values and fit a model to the data for each variable, in an expectation-maximisation alike fashion. This procedure is repeated until the estimates for the missing data points converge. The method harnesses the highly-structured nature of gridded covarying observation datasets within the flexible function learning toolbox of data-driven approaches. The imputation utilises (1) the temporal autocorrelation and spatial neighborhood within one variable or dataset and (2) the different missingness patterns across different variables or datasets, i.e. the fact that if one variable at a given point in space and time is missing, another covarying variable might be observed and their local covariance could be learned. A simple ridge regression is already able to outperform simple “ad-hoc” gapfilling procedures on high resolution daily satellite data, however, we are working on additionally testing a nonlinear method (Gaussian Process, Random Forest and Neural Net).

    Verena Bessenbacher

    7th May, EGU General Assembly

    Land Climate Dynamics, ETH Zürich

    [email protected]

  • RESULTS pearson correlation coefficient

    skin temperature surface layer soil moisture

    Skin temperature is missing where cloud fraction is high and global fraction of missing values of 42 %. ESA-CCI soil moisture has a impressive 68% of missing values, effectively obscuring tropical rainforest regions all the time and high-latitude areas with snow cover around half the time. Soil moisture measurements are therefore exposing a non-trivial missingness pattern with a comparatively high fraction of missingness among remote sensing products, making it especially challenging for imputation. The pearson correlation between the gap filled values and the original values for each land point mirrors that: Correlation is high where much data can be observed, and low where data is missing a lot of times. However, correlation is never negative, showing that the gap filling procedure applied indeed improves the estimates for the missing values.

    Verena Bessenbacher

    7th May, EGU General Assembly

    Land Climate Dynamics, ETH Zürich

    [email protected]

    0 1

  • RESULTSte

    mpe

    ratu

    re [°

    C]

    skin temperature 13h00, close to Basel, year 2003 To show exemplarily how the gap filling works, the plot shows the daily skin temperature of Basel for the year 2003. In black, the ERA5 skin temperature is plotted. In green, the same data is used, but only the values that would have been observed by a satellite are shown. Days where Basel was overcast with clouds cannot be seen by the satellite, for example much of December 2003.

    In red, the initial gap filling procedure is shown. We use the temporal mean.

    In blue, the final result is shown. The iterative procedure reduces the bias and increases the correlation of original data and gapfilled values by incorporating information - from the other variables (soil moisture

    and precipitation) - from the neighboring grid points - from the day before and after

    Verena Bessenbacher

    7th May, EGU General Assembly

    Land Climate Dynamics, ETH Zürich

    [email protected]

    ERA5 data satellite observable ERA5 data init gapfill gapfilling final result

  • RESULTS: the correlations per variable align well with artificial experiments

    Verena Bessenbacher

    7th May, EGU General Assembly

    Land Climate Dynamics, ETH Zürich

    [email protected]

    By plotting the fraction of missing data with the pearson correlation of gapfilled vs. original values, the merit of different gap filling procedures can be compared. A perfect gap filling procedure would show a pearson correlation of 1 for all fractions of missing data. The mean initial gapfill is shown with the diamond. As expected, filling in the mean shows no variance and therefore no correlation with the original values. The iterative procedure increases the correlation for all variables (square, triangle and circle), but the higher the fraction of missing values in this variable is, the lower is the correlation with the original values.

    To benchmark the gap filling procedure, we additionally consider an artificial missing ness pattern, where we introduce „artificial swaths“ into the ERA5-dataset. We can see that with increasing missing values, the imputation merit decreases for the artificial case (solid lines). However, our points with the real missingness pattern fall in the area of the lines. This means that although in the real world, satellite observations are missing not at random, we still achieve a correlation as if it would be missing at random. This means that the high physical dependency of the three variables helps overcome their complex missingness pattern.

  • CONCLUSION & OUTLOOK

    - consider another initial gap fill, using climatology - add non-linear method for gapfilling - add net radiation as a variable - check physical consistency of imputed values (e.g. soil gets wet

    when it rains)

    Verena Bessenbacher

    7th May, EGU General Assembly

    Land Climate Dynamics, ETH Zürich

    [email protected]

    ReferencesBessenbacher V., L. Gudmundsson and S. I. Seneviratne (2019): Testing Random Forest Imputation for Land Hydrology Data, Proceedings of the 9th International Workshop on Climate Informatics, pp 73-77

    van Buuren, S. (2018): Flexible Imputation of Missing Data, Chapman and Hall. Dorigo W. et al (2017): ESA CCI Soil Moisture for improved Earth system understanding: State-of-the art and future directions. Remote Sensing of Environment, 203, 185-215. Gruber, A. et al (2017): Triple Collocation-Based Merging of Satellite Soil Moisture Retrievals. IEEE Transactions on Geoscience and Remote Sensing, 55, 12. Gruber, A. and Scanlon, T. (2019): Evolution of the ESA CCI Soil Moisture climate data records and their underlying merging methodology, Earth System Science Data, 11, 717-739. Guillory, A. (2017): ERA5. https://www.ecmwf.int/en/forecasts/datasets/reanalysis-datasets/era5 Huffmann, G. et al (2019): Integrated Multi-satellite Retrievals for GPM (IMERG) version 4.4. NASA's Precipitation Processing Center. Parkinson, C. L. (2003): Aqua: an earth-observing satellite mission to examine water and other climate variables. IEEE Transactions on Geoscience and Remote Sensing. 41, 2. Reichstein, M. et al (2019): Deep learning and process understanding for data-driven Earth system science, Nature, 566, 7743, 195ff Rubin, D. B. (1976): Inference and missing data. Biometrika, 63, 3, pp 581-92 Scher, S. et al (2019): Weather and climate forecasting with neural networks: using GCMs with different complexity as study-ground. Geoscientific Model Development, 12, 2797-2809 Shen, H. et al (2015): Missing Information Reconstruction of Remote Sensing Data: A technical review. IEEE Geosci. Remote Sens. Mag., 3, 3, 61-81. Stekhoven, D. J. and P. Bühlmann (2012): MissForest — non-parametric missing value imputation for mixed-type data. Bioinformatics, 28, 1, 112-118.

    Conclusions Outlook- We gapfill several remote sensing datasets and test possible

    algorithms on gapfree ERA5 data - A simple Ridge Regression is able to outperform trivial initial

    gapfilling procedures - The high physical dependency between the variables makes

    gapfilling possible although a missing not at random pattern is observed

    - soil moisture observations are missing in around 68% of the time, making it a challenging case for gapfilling