Exploratory Spatial Data Analysis

Setup

Call libraries and data

## Reading layer `mangatu_1939_ero_feat_nztm' from data source `R:\RESEARCH\02_PROJECTS\01_P_330001\119_STEC\04_Data\Gullies-Mangatu\mangatu_coregistered_LA\mangatu_1939_ero_feat_nztm.shp' using driver `ESRI Shapefile'
## Simple feature collection with 3673 features and 4 fields
## geometry type:  POLYGON
## dimension:      XY
## bbox:           xmin: 2019311 ymin: 5751205 xmax: 2033433 ymax: 5765733
## projected CRS:  NZGD2000_New_Zealand_Transverse_Mercator_2000

## Reading layer `mangatu_1960_ero_feat_nztm' from data source `R:\RESEARCH\02_PROJECTS\01_P_330001\119_STEC\04_Data\Gullies-Mangatu\mangatu_coregistered_LA\mangatu_1960_ero_feat_nztm.shp' using driver `ESRI Shapefile'
## Simple feature collection with 1544 features and 4 fields
## geometry type:  POLYGON
## dimension:      XY
## bbox:           xmin: 2019192 ymin: 5751220 xmax: 2033357 ymax: 5766205
## projected CRS:  NZGD2000_New_Zealand_Transverse_Mercator_2000

## Reading layer `mangatu_1970_ero_feat_nztm' from data source `R:\RESEARCH\02_PROJECTS\01_P_330001\119_STEC\04_Data\Gullies-Mangatu\mangatu_coregistered_LA\mangatu_1970_ero_feat_nztm.shp' using driver `ESRI Shapefile'
## Simple feature collection with 1724 features and 4 fields
## geometry type:  POLYGON
## dimension:      XY
## bbox:           xmin: 2019217 ymin: 5751217 xmax: 2032970 ymax: 5765994
## projected CRS:  NZGD2000_New_Zealand_Transverse_Mercator_2000

## Reading layer `mangatu_1988_ero_feat_nztm' from data source `R:\RESEARCH\02_PROJECTS\01_P_330001\119_STEC\04_Data\Gullies-Mangatu\Mangatu_feature_extraction\mangatu_1988_ero_feat_nztm.shp' using driver `ESRI Shapefile'
## Simple feature collection with 1390 features and 4 fields
## geometry type:  POLYGON
## dimension:      XY
## bbox:           xmin: 2019192 ymin: 5751216 xmax: 2033436 ymax: 5765880
## projected CRS:  NZGD2000_New_Zealand_Transverse_Mercator_2000

Explore I

First I will combine all the data into one single object to better explore and summarize variables

# Establish some common variables
gully = list(g39, g60, g70, g88)
year = c(1939, 1960, 1970, 1988)
colnames = names(g39)
# Create function to harmonize datasets
prepare_for_merge = function(x, y) {
  z = mutate(x, year = y)
  names(z) = c(colnames, 'year')
  z
}
# Merge data
l = purrr::map2(gully, year, prepare_for_merge)
gully_merge = do.call(rbind, l) %>% 
  mutate_if(is.character, factor) %>% 
  mutate(Shape_Width = Shape_Area/Shape_Leng)

Now that we have one single file, we can take a look at some of the attributes.

Features per year

We can summarize our data to check how many erosion features we have per year. I have calculated a proxy for the feature width as \(Area/Length\).

e1 = gully_merge %>% 
  st_drop_geometry() %>% 
  group_by(year) %>% 
  summarise(
    feature_count = n(), 
    total_area = sum(Shape_Area), 
    total_length = sum(Shape_Leng),
    total_width = sum(Shape_Width)
  ) %>% 
  ungroup() 

e1 %>% 
  knitr::kable()

year	feature_count	total_area	total_length	total_width
1939	3673	18429091	1132832.3	30339.13
1960	1544	19930065	897548.8	20062.77
1970	1724	19705727	882363.0	21332.77
1988	1390	11654210	526557.4	12834.24

e1 %>% 
  pivot_longer(-year, names_to = 'indicator') %>% 
  ggplot(aes(x = year, y = value)) +
  geom_line() + geom_point() +
  labs(y = "") +
  facet_wrap(~indicator, scales = 'free_y', nrow = 2)

Let’s take a look now at the variation of the indicators per feature and year

d = gully_merge %>% 
  st_drop_geometry() %>% 
  select(-c(EROS, eros_feat_)) %>% 
  group_by(year) %>% 
  pivot_longer(-year, names_to = 'indicator')
  
d %>% 
  filter(indicator == "Shape_Area") %>%
  ggplot(aes(x = as.factor(year), y = value)) +
  geom_boxplot() +
  facet_zoom(ylim = c(0, 5e4)) +
  labs(x = 'year', y = 'area (m2)')
d %>% 
  filter(indicator == "Shape_Leng") %>%
  ggplot(aes(x = as.factor(year), y = value)) +
  geom_boxplot() +
  facet_zoom(ylim = c(0, 2e3)) +
  labs(x = 'year', y = 'length (m)')
d %>% 
  filter(indicator == "Shape_Width") %>%
  ggplot(aes(x = as.factor(year), y = value)) +
  geom_boxplot() +
  facet_zoom(ylim = c(0, 25)) +
  labs(x = 'year', y = 'width (m)')

Erosion features

There are two columns that refer to this variable EROS and eros_feat_. I initially thought EROS was an acronym way to represent the long name description in eros_feat_, but when I group by both features and get a count, we can see that “Aggraded riverbed” is represented with three different types of EROS values. These are not many features, in these categories, summing up to 40 elements, but it would be still worth to check what they are.

gully_merge %>% 
  st_drop_geometry() %>% 
  group_by(EROS, eros_feat_) %>% 
  summarise(count = n()) %>% 
  ungroup() %>% 
  knitr::kable()

EROS	eros_feat_	count
ae	Active earthflow	556
asl	Active slump	932
f	Active fan	194
g	Active gully	1275
isl	Aggraded riverbed	8
r	Aggraded riverbed	8
riv	Aggraded riverbed	24
s	Surface erosion	4572
sf	Stable fan	97
t	Stable terrace	665

Coming back to the variation of the erosion feature measurements, now per eorsion type and per year

e = gully_merge %>% 
  st_drop_geometry() %>% 
  select(-c(EROS)) %>%
  group_by(eros_feat_) %>% 
  pivot_longer(-c(eros_feat_, year), names_to = 'indicator')

e %>% 
  filter(indicator == "Shape_Area") %>%
  ggplot(aes(x = eros_feat_, y = value, fill = as.factor(year))) +
  geom_boxplot() +
  facet_zoom(ylim = c(0, 1.5e5)) +
  labs(x = 'erosion feature', y = 'area (m2)') +
  scale_fill_discrete("Year") +
  theme(legend.position = "top", axis.text.x = element_text(angle = 90))

e %>% 
  filter(indicator == "Shape_Leng") %>%
  ggplot(aes(x = eros_feat_, y = value, fill = as.factor(year))) +
  geom_boxplot() +
  facet_zoom(ylim = c(0, 1e4)) +
  labs(x = 'erosion feature', y = 'length (m)') +
  scale_fill_discrete("Year") +
  theme(legend.position = "top", axis.text.x = element_text(angle = 90))

e %>% 
  filter(indicator == "Shape_Width") %>%
  ggplot(aes(x = eros_feat_, y = value, fill = as.factor(year))) +
  geom_boxplot() +
  facet_zoom(ylim = c(0, 35)) +
  labs(x = 'erosion feature', y = 'width (m)') +
  scale_fill_discrete("Year") +
  theme(legend.position = "top", axis.text.x = element_text(angle = 90))

The map below shows the evolution of these features over time.

Highlights:

co-registration seems to be good
1988 shows a reduced extent of erosion features
It is likely that re-vegetation plays a big role

tm = tm_shape(gully_merge) +
  tm_polygons(
    col = 'eros_feat_', 
    border.alpha = 0, 
    palette = "Dark2",
    legend.hist = T,
    title = "Erosion Feature",
    legend.hist.title = "Feature Count"
  ) +
  tm_facets(along = 'year', free.coords = F) +
  tm_layout(legend.outside = T, legend.hist.width = 0.8)
tmap_animation(tm, width = 1200, height = 1000, 
               filename = 'Data_overview/erosion_evolution.gif', 
               loop = T, delay = 150, restart.delay = 250)

Erosion Features 1988

A quick overview of the erosion features mapped in 1988 overlayed to the aerial imagery collected between 2012 and 2013 and LiDAR DEM from 2019.

When zooming in, we can see that re-vegetation has played a big role in the erosion status of the Mangatu area.

Considerations so far:

Check the 1957 and 1997 data: It comes from a different source and the attributes are different
LiDAR DEM shows really nice the features: do a screening of what terrain variables can be useful.
Based on which imagery will we map? The mismatch between samples and optical data might be a problem, since re-vegetation has played a big role.

Thoughts:

Try to map the gullies only based on aerial photographs and DEM derivatives:
- Candidate derivatives: slope, curvature (planar, profile), VDCN, roughness, hillshade, LS factor, wetness index
- For roughness and wetness index select one specific index from SAGA.
- Check Stream Power Index, seems meant to detect Gullies.
Use “active gully” polygons from 1988 as a reference to create chips for Deep Learning approach

Explore II

There are several mismatches with the 1988 data, some errors either from coregistration errors or from original biases and errors from the delineation. For example:

Missing gully (?) inside red circle, is this a new feature, part of the neighbor basin or simply wrongly mapped?

The feature in the black outline clearly does not correspond to the aerial photography, why? was there any change in the landscape (unlikely), or does this have to do with a wrong delineation.

The idea now would be to use bounding boxes of likely gully features as labeled boxes of training data. Likewise, we could need to create training data of areas not exposed by erosion.

Deep Learning tests I

With the gully features from 1988 I ran some tests to train a deep learning model based on the RGB imagery The model is a Mask RCNN (details here), which trains a model on imagery based on masked polygons around ground-truth features to detect objects with similar characteristics.

Export samples

Some samples of the resulting chips:

Train model

Detect objects

Considerations at this point:

Results have not been successful so far with the trained models. Further points to test:

Use DEM and derivatives as input data
Check how to use polygons as masks to generate labeled chips
Test different parameters on the literature to adjust the models

Deep Learning tests II

I managed to make the workflow work for the DEM data and derivatives. So far for one derivative at a time, eventually it would be useful to use a combination of distinct derivatives that allow the differentiation of the gully features on LiDAR derived data.

The training is now performed with a combination of all gully features for 1939, 1957, 1960, 1970, 1988, 1997 from Marden et al. 2012, 2014. The preparation of the samples is documented here

I tested the approach with the raw DEM values and with the Terrain Ruggedness Index (TRI) derivative

Results with DEM values

This are examples of the created chips used for training:

The chip size was increased to 512x512 and the reference features were combined to include all the active gullies detected from 1939 to 1997.

The model characteristics show more insightful results compared to the RGB model:

However, applying the model on a subset of the study area was unsuccessful. Running the model for the whole area is still needed, but it is already noticeable that DEM values vary greatly among gully features, and a more standard measure is needed. This is why the TRI is used next.

Results with TRI values