Part 1: Pre-processing and smoothing satellite image time series

Loading and pre-processing data

Load the packages:

if (!require("pacman")) install.packages("pacman")
pacman::p_load("tidyverse", "tsibble", "bfast", "data.table", "mgcv","forecast", "zoo", "anytime", "fabletools", "signal", "fable", "tibble")

Import data - raw MTCI values from Sentinel-2 acquired from GEE:

df = read.csv("d:/OGH2023_data/species_mtci.csv")
str(df)
## 'data.frame':    1446996 obs. of  3 variables:
##  $ system.index: chr  "20170329T095021_20170329T095024_T33UXR_00000000000000000042" "20170329T095021_20170329T095024_T33UXR_00000000000000000043" "20170329T095021_20170329T095024_T33UXR_00000000000000000044" "20170329T095021_20170329T095024_T33UXR_00000000000000000045" ...
##  $ MTCI        : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ species_cd  : chr  "GB" "GB" "GB" "GB" ...
summary(df)
##  system.index            MTCI          species_cd       
##  Length:1446996     Min.   :-8.2      Length:1446996    
##  Class :character   1st Qu.: 1.3      Class :character  
##  Mode  :character   Median : 3.1      Mode  :character  
##                     Mean   : 3.0                        
##                     3rd Qu.: 4.4                        
##                     Max.   :17.8                        
##                     NA's   :1430563

As you can see, there is a lot of NA values, also, if we plot index values there are a lot of outliers.

plot(df$MTCI)

Modify column with date and change columns names:

df$system.index = as.Date(df$system.index, format =  "%Y%m%d")
names(df) = c("date", "index", "type")

Filtering values, removing NA, mean of duplicates:

df_clean = df %>% 
  drop_na() %>%
  group_by(date, type) %>%
  summarise(index = mean(index)) %>% 
  ungroup() %>%
  dplyr::filter(index < 7 & index > 0 & date > "2018-03-05") 

ggplot(df_clean, aes(date, index, color = type))+
  geom_point()+
  theme_light()

Example analysis on hornbeam (GB) time series

df_gb = df_clean %>%
  dplyr::filter(type == "GB")
ggplot(df_gb, aes(date, index))+
  geom_line(color = "darkgreen", linewidth = 1)+
  theme_light()

Firstly, identify and replace outliers using simple linear interpolation:

df_gb$index_out = df_gb$index %>% 
  tsclean()

ggplot(df_gb)+
  geom_point(aes(date, index), color = "red")+
  geom_point(aes(date, index_out), color = "darkgreen")+
  theme_light()

Smoothing 1: Simple Moving Average

df_gb$avg3 = rollmean(df_gb$index_out ,3, fill = NA)
df_gb$avg10 = rollmean(df_gb$index_out ,10, fill = NA)

ggplot(df_gb)+
  geom_line(aes(date, index_out))+
  geom_line(aes(date, avg3), color = "blue", linewidth = 1.0)+
  geom_line(aes(date, avg10), color = "red", linewidth = 1.0)+
  theme_light()

Smoothing 2: Savitzky-Golay smoothing

We need to create regular (1-day) time series with NA for missing values. And then interpolate the unknown values using na.approx() from the zoo package:

df_tsb = tsibble(df_gb[,c(1,4)], index = date, regular = TRUE) %>%
  fill_gaps() %>%
  mutate(approx = na.approx(index_out))

Apply Savitzky-Golay filter. n is a filter length (odd number) - test different values.

df_tsb$sg = df_tsb$approx %>%
  sgolayfilt(n = 71) 

Analyze the results on plot - line represent time series smoothed by S-G, red dots - original values and blue dots - interpolated values

ggplot(df_tsb)+
  geom_point(aes(date, approx), color = "blue", alpha = 0.2)+
  geom_point(aes(date, index_out), color = "red", alpha = 0.7)+
  geom_line(aes(date, sg), size = 1.1)+
  theme_light()

Smoothing 3: Generalized Additive Models (GAM) fitting

The first step is the same as in Savitzky-Golay - we will create regular time series with NA values:

df_tsb_gam = tsibble(df_gb[,c(1,4)], index = date, regular = TRUE) %>%
  fill_gaps() %>%
  ts() %>% #but here also we need to change date format from y-m-d to numeric (unix time)
  as.data.frame()

But now we can go straight to modelling without dealing with missing values!

GAM modelling with dates as predictor (Generalized Additive Mixed Models):

model = gamm(df_tsb_gam$index_out ~ s(date, k = 60), 
             data = df_tsb_gam , method = "REML")

Predicting values using GAM model and plotting the results:

df_tsb_gam$predicted = predict.gam(model$gam, df_tsb_gam)
df_tsb_gam$date = anydate(df_tsb_gam$date)

ggplot(df_tsb_gam)+
  geom_point(aes(date, index_out), color = "red", alpha = 0.4)+
  geom_line(aes(date, predicted), linewidth = 1)+
  theme_light()

Compare S-G (blue line) and GAM (black line):

df_tsb_gam$sg = df_tsb$sg

ggplot(df_tsb_gam)+
  geom_point(aes(date, index_out), color = "red", alpha = 0.4)+
  geom_line(aes(date, predicted), alpha = 0.8, linewidth = 1)+
  geom_line(aes(date, sg), color = "blue", alpha = 0.8, linewidth = 1)+
  theme_light()


Excercise 1

Read landsat_ndvi_veg_2.csv file and analyze the dataframe. Based on preliminary analysis, select one vegetation type, use cleaning/outlier removing if necessary, and perform one selected method smoothing/fitting. Finally, visualize the results and evaluate its performance! Pay attention to:

  • correctly convert system.index variable to date format (use e.g. substr() function)
  • defining correct min and max values
  • pay attention to the date range
  • when using moving average and SG - set appropriate window size (n)