Getting started with Deep Learning

In Deep Learning Testing I did some short exploration of the ArcGIS Pro tools for Deep Learning. Not being familiar myself with Deep Learning techniques before going into these tools, I did some small tests to understand how to set-up a Deep Learning workflow, from installation to model training and object detection. I will first summarise here these steps, for documentation and further “reproducibility”.

Set-up ESRI for Deep Learning

First of all, some specifications of the software I used to run the Deep Learning tools:

  • ArcGIS Pro 2.7.2 with an Advance License
  • Processor: Intel(R) Xeon(R) CPU E5-1650 v4 @ 3.60GHz
  • Installed RAM: 64GB
  • GPU: NVIDIA GeForce GTX 1070

According to the ESRI Deep Learning Frameworks It is possible to download an installer that will take care of including all the deep learning packages into the Python environment in ArcGIS Pro. However, this did not work properly, so I had to follow the instructions on their guide to clone the default environment for ArcGIS projects into a new deeplearning environment, and then install the libraries separately.

Workflow and pre-requisites

Workflow

In general to detect objects with Deep Learning, one would use the following combination of tools:

Figure 1: Deep Learning workflow for object detection in ArcGIS Pro

I found some limitations for each of these steps which I included in the next tabs. I refer to them as pre-requisites to successfully run a deep learning workflow in ArcGIS Pro based solely on my efforts. It might as well be that there are other ways to do this more efficiently that I am not aware of yet.

Export Training Data

This tool creates labelled chips that will be subsequently used to train models.

Tool screenshot
Export Training Data for Deep Learning Tool

Figure 2: Export Training Data for Deep Learning Tool


I decided to use Mask RCNN (details here), method and hence the decisions for the parameters required for this tool. I summarize the most important parameters below.

Parameter Description
Input raster A single band (any type) or a three band (8-bit, scaled, RGB channels) raster.
Input feature class Training polygons. They can be overlapping and differ in sizes.
Image format What output format should the chips have? I used TIFF.
Tile size How big should the tiles be? Larger tile sizes seem to avoid problems when using the subsequent tools.
Stride Overlap between tiles, should be smaller than the tile size. For no overlap tile = stride size.
Metadata Depends on the model to train. I use RCNN Masks.
Rotation angle For data augmentation, rotates the chips at certain angles to create more training samples. An angle of 90 will create 4x the number of samples.

The output consists on a images and a labels directory. The images are basically cropped subsets from the original input raster. The labels are “masks” that should ideally have a value of 1 when the pixel is overlayed by a training polygon or 0 when it is not. During my tests I realized that using overlapping training polygons can result in strange labels, where there are values outside of the 0-1 range generated. Users have reported this issue on the ESRI forums without an answer so far. However, after some further testing I realized this is not really a problem when using a three band raster and hence did not look for a solution.

In general this tool took between 10 and 20 minutes to run.

NOTE: About the Input raster parameter

This can be a single band raster or an three band RGB-like raster. We aimed at trainng our model with multiple bands (at least 15), but since we were limited by the input data constraint, we selected a combination of three bands from the terrain derivatives. Next, we had to convert this multiband raster into an 8-BIT-Unsigned scaled raster to be able to run the Training tool. This was done with the Copy Raster tool in ArcGIS.

Tool screenshot
Copy Raster Tool

Figure 3: Copy Raster Tool


Train Model

This tool is used to train the deep learning model (details of usage can be found here).

Some parameters to consider, which will vary computation time and outputs:

Parameter Description
Input training data The folder containing the image chips, labels, and statistics required to train the model.
This is the output from the Export Training Data For Deep Learning tool.
To train a model, the input images must be 8-bit rasters with three bands.
Max Epochs A maximum epoch of one means the dataset will be passed forward and backward through the neural network one time. The default value is 20.
The number of epochs is a hyperparameter that defines the number times that the learning algorithm will work through the entire training dataset.
Model type MASKRCNN —The MaskRCNN approach will be used to train the model. MaskRCNN is used for object detection. It is used for instance segmentation, which is precise delineation of objects in an image. This model type can be used to detect building footprints. It uses the MaskRCNN metadata format for training data as input. Class values for input training data must start at 1. This model type can only be trained using a CUDA-enabled GPU.
Batch size The number of training samples to be processed for training at one time. The default value is 2. If you have a powerful GPU, this number can be increased to 8, 16, 32, or 64.
The batch size is a hyperparameter that defines the number of samples to work through before updating the internal model parameters.
Chip size All model types support the chip_size argument, which is the chip size of the tiles in the training samples. The image chip size is extracted from the .emd file from the folder specified in the in_folder parameter.
Learning rate No value defined: The rate at which existing information will be overwritten with newly acquired information throughout the training process. If no value is specified, the optimal learning rate will be extracted from the learning curve during the training process.
Backbone Model Default used: RESNET50 —The preconfigured model will be a residual network trained on the ImageNET Dataset that contains more than 1 million images and is 50 layers deep.
Tool screenshot
Train Deep Learning Model ToolTrain Deep Learning Model Tool

Figure 4: Train Deep Learning Model Tool


Adding a chip size larger than 512 or a batch size larger than 4 resulted in the error CUDA out of memory, i.e. the GPU installed was not enough.

In general training the deep learning models took between 10 and 12 hours.

Detect Objects

Finally, this tool was used to detect the objects using the deep learning models previously trained.

Tool screenshot
Detect Objects using Deep Learning ToolDetect Objects using Deep Learning Tool

Figure 5: Detect Objects using Deep Learning Tool


The threshold parameter refers to the minimum confidence value to keep for the detected objects. The padding size was the main parameter tested with this tool. The resulting objects and the run time varied depending on which padding size was selected. For an overview of the runtime see the figures below.

Ellapsed times vs Padding sizeEllapsed times vs Padding size

Figure 6: Ellapsed times vs Padding size

Application

The goal of these tests was to apply the Deep Learning Tools provided in ArcGIS Pro to detect gully features on terrain derivatives obtained from a LiDAR DEM.

Input data

Initially 15 derivatives (see figure 7) were computed from the DEM.

channel network planar curvature profile curvature downslope distance gradient flow accumulation
LS-factor mass balance index hillshade slope specific catchment area
stream power index texture terrain ruggedness index terrain wetness index vertical distance to channel network
Terrain derivatives computed

Figure 7: Terrain derivatives computed

The plan was to use all of them for gully detection. Since, that was not possible, I did a selection of derivatives, see here.

The combinations tested were:

  • Planar curvature + Slope + Terrain wetness index (cplan+slope+twidx)
  • LS-factor + Terrain ruggedness index + Terrain wetness index (lsfct+tridx+twidx)
  • LS-factor + Hillshade + Terrain ruggedness index (lsfct+shade+tridx)

Model characteristics

The ArcGIS Pro workflow for Deep Learning was applied for each of these combinations, preparing training data and running a model for each of them.

  1. cplan+slope+twidx model characteristics
Loss graph and Results subset for the first modelLoss graph and Results subset for the first model

Figure 8: Loss graph and Results subset for the first model


  1. lsfct+tridx+twidx characteristics
Loss graph and Results subset for the second modelLoss graph and Results subset for the second model

Figure 9: Loss graph and Results subset for the second model


  1. lsfct+shade+tridx characteristics
Loss graph and Results subset for the third modelLoss graph and Results subset for the third model

Figure 10: Loss graph and Results subset for the third model


Each of the resulting models were then passed to the Detect Objects tool with distinct padding sizes (250, 200, 128, 64, 32, 16, 4, 0). Every run of the tool resulted in slightly different results.

Results

The final detected objects obtained from each of the models varied depending on the derivatives selection but also on the padding size selected.

In general, a higher amount of objects were detected for the lsfct+shade+tridx model.

Distribution of Confidence values for the detected objects according to the padding size and model

Figure 11: Distribution of Confidence values for the detected objects according to the padding size and model

The spatial distribution of the objects detected with the lsfct+shade+tridx model seemed also to offer a better overview of the gully location throughout the study area. This model seemed to be the most promising combination of terrain derivatives.

Spatial distribution of detected objects by model

Figure 12: Spatial distribution of detected objects by model

To simplify the exploration, I will only look into the results of this model. All the other results are in the deep_learning/DetectedObjects directory as shapefiles.

Below is an overview of the detected objects according to the padding size used. In the background in light grey, the reference gully data is mapped.

Spatial distribution of detected objects by padding size

Figure 13: Spatial distribution of detected objects by padding size

Number of detected objects by padding size

Figure 14: Number of detected objects by padding size

Area of detected objects

Figure 15: Area of detected objects

In general, the main limitation is the size of the objects, which are simply not on the same range as for the actual gully features.

Limitations and outlook

  1. Number of bands

It is really important to note that the ESRI deep learning tools are limited to the number of bands that can be used for training. This can be a single band raster or an three band RGB-like raster. I believe this has to do with training models on very-high-resolution RGB imagery, for which these deep learning methods seem to be optimized. However, this is a big disadvantage when having multiple bands for instance with high-resolution Sentinel-2 image, where the full potential is not exploited. For our case study, the goal was to include several terrain derivatives obtained from a LiDAR DEM. However, due to the tools limitations, we had to select three bands at a time to simulate an RGB image to be used for training.

Other tools such as eCognition or directly using Python modules seem to allow for multiple bands as input, and this still needs to be tested.

  1. GPU constraints

The ability to run more powerful models that run for longer times and that require larger batch sizes or higher number of epochs is restricted by the hardware in use. This is well-known for the machine learning domain, and hence alternatives like high-performance cloud computing could be considered (although they usually come at a cost).

  1. Not completely satisfactory results

The resulting gully objects detected with the three tested models were in general matching spatially the gully features on the field, however the size of the objects was very small compared to the size of the actual gullies on the reference data.

The object sizes (mainly their length) do not seem to reach the shapes found on the field. I believe this could be related to the tile size selected (512 pixels), which might have been insufficient to completely capture the majority of gullies per tile. However, working with larger tiles sizes was not possible due to GPU constraints.

A possible solution is to work with the resulting features within an OBIA workflow that allows growing these objects in combination with the aerial imagery from the area. This is one possible idea that will be tested in a next step.