Practical 5
Spatial data querying and wrangling
Download the raw document here
This is the first winter that Lucia will spend in Salzburg. She is looking to visit a mountain hut where she can spend a night during the winter and do some ski tours. Can you help her?
In this practical you will train the basics of data wrangling in R using the tidyverse
. You will also learn how to perform spatial queries using the sf
package and perform raster-vector operations using terra
. There are extra notes on the margins to give you more information on the functions you are using.
- Work in groups, maximum 2!
- Google is your friend! If an error is confusing, copy it into Google and see what other people are saying. If you don’t know how to do something, search for it. You can use AI to help you understand code as well, but be critical of the outputs and always double check that it works.
- Just because there is no error message, it doesn’t mean everything went smoothly. Use the console to check each step and make sure you have accomplished what you wanted to accomplish.
- See more tips on error handling here.
- Document your process! Use the empty code chunks set-up for you. If you need more remember the keyboard shortcut is
CTRL+ALT+I
orCMD+ALT+I
.
Part 1: Data import
In this section we will load a spatial dataset of mountain huts into R and clean it before using it in the next section.
We can use the sf
package to load data into R. In the background, sf
will use GDAL to identify the driver to properly load the data. As you will see, you can load data directly from an URL but also local files.
library(sf)
= read_sf("https://github.com/loreabad6/app-dev-gis/raw/refs/heads/main/data/p5_huts.gpkg") huts
Now we will start our data wrangling and cleaning workflows. For this we will use packages from the tidyverse
but you can use base R or data.table if you have experience and feel more familiar with those.
library(tidyverse)
package bundle. You don’t need to worry about the “conflicted packages”, it is a warning to let you know there are functions in loaded packages with the same names.library(tidyverse)
sf
is designed to work with base R but also with tidy workflows, so we can directly use tidyverse
verbs to wrangle the data.
If we glimpse into the data we can have an idea of what we are dealing with.
|>
huts glimpse()
Now that is a long file! You might have noticed some interesting patterns here and there, but what gives this data away is the first column: osm_id
.
Now if this is a spatial file, where are the coordinates? Take a look at the last column of the data: geom
.
$geom huts
These are basically the locations of our huts, and sf already knows to look for those coordinates in this column.
To have a quick view of where your data is located, you can use the mapview()
function from the {mapview}
package.
library(mapview)
mapview(huts)
Part 2: Data cleaning
So now you know this is OpenStreetMap data. If you have ever worked with OSM data before you will know that their nodes have several tags with their properties attached to them. When querying the data with R, you will get in this case the queried huts in each row with all the tags in different columns forming a data frame.
Using this messy data you will be wrangling and tidying a bit so that you can work with a more manageable dataset in the next section.
From now on, each of the code chunks below will cause and error and/or will do the desired task incorrectly. (Even the chunks that run without error are not correct!) You will need to find the mistake, and correct it, to complete the intended action.
You will see a big amount of NA
values in the huts data. That is because not all OSM tags are filled for every mountain hut, but if one hut has it, then it is included in the dataset.
- Let’s reduce the number of variables in the dataset. Let’s narrow the dataset to:
- Name of the hut
- Elevation of the hut
- Capacity (no. of beds)
- Amenity (is it a restaurant, a bar, a self-service hut?)
- Operator of the hut (Alpenverein, Naturfreunde, etc.)
- Location of the hut (the coordinates)
= huts >
huts_clean select(name, ele, capacity, beds, amenity, cuisine, operator)
huts_clean
Take a close look at the result of your selection, are all the columns that you asked for there? What about the geom
column? Did you ask for it? Is it anyway there? Let’s try to get rid of it.
# THIS CHUNK HAS NO ERROR!!!!
|> select(-geom) huts_clean
Oh no! It is still there! Well, that is because of how sf
objects work. The geometry column is a “sticky” column, meaning that it cannot be dropped with tidyverse verbs. But, we want to work with spatial data, so we are not really going to remove that column. 😉
If you really want to remove the column you would need to do something like st_drop_geometry(huts_clean)
or st_geometry(huts_clean) = NULL
.
- Now, let’s adjust the proper variables to be numeric. We use the
mutate()
function which helps you change existing columns (if you save the result with the same column name) or to create new columns (by giving it a new column name).
= huts_clean |>
huts_clean mutate(
ele = numeric(ele),
capacity = numeric(capacity),
beds = numeric(beds)
)
- We will next create a new variable called “capacity_overall”. This column will combine the columns “capacity” and “beds”. When there is no capacity value, then the beds value will be taken. Otherwise the capacity value is taken. If both columns are
NA
, then the column will also have anNA
. For this we can use the functioncase_when()
inside themutate()
function.
= huts_clean |>
huts_clean = case_when(is.na(capacity) ~ beds, TRUE ~ capacity) capacity_overall
- Considering that the huts are located in Europe, we can project the data from WGS84 to a more appropriate CRS. Let’s use the European Equal Area “EPSG:3035”.
= huts_clean |>
huts_clean st_set_crs("EPSG:3035")
- Finally, note how we started each code chunk with:
huts_clean = huts_clean |>
.
That is very redundant and can cause you trouble if you are recreating your object over and over again.
We can pipe all these steps together to have one single workflow for data cleaning. In the code chunk below, combine the (fixed!) code.
= huts # add your fixed code here... huts_clean
Up to this point, your clean dataset should have 4.6% of the number of columns in the original dataset.
# Write code to verify that your huts_clean dataset has 4.6%
# of the number of columns in the huts dataset
Part 3: Enrich your data
So far we have used only wrangling and cleaning functions. Now, we are going to enrich our dataset with other spatial datasets.
In this section, you will get a series of instructions, you should implement code to get there.
- The huts are located in different regions. You have a regions dataset here: https://github.com/loreabad6/app-dev-gis/raw/refs/heads/main/data/p5_regions.gpkg. Load the data using the
sf
package.
= # load the data with the sf package. We did this before in Part 1. regions
- Now, we will perform a spatial join of the “huts_clean” data and the “regions” data. For this you can use the function
st_join()
. Remember you can check for function documentation by typing?st_join
on the console.
Hint: you most likely get an error when you first try to do your join. READ THE ERROR MESSAGE CAREFULLY! What does it tell you?
# Write code using the st_join function to join in the information on
# the regions dataset to the huts_clean dataset.
= huts_clean |>
huts_enrich st_join(...)
Note that the default join predicate used is st_intersects
. For point data, any other predicate does not really make sense, but when joining other type of data (polygons or lines), you can use other type of predicates e.g. st_covered_by
.
- Now let’s add some data about maximum temperature. For this you will find a
.tif
file here: https://github.com/loreabad6/app-dev-gis/raw/refs/heads/main/data/AUT_wc2.1_30s_tmax.tif. You can load this raster dataset using therast()
function from the{terra}
package.
Note how when you load {terra}
you get a conflict warning. There is another package {tidyr}
from the tidyverse with a function extract()
. We will be using that function, so when we call it, we will use terra::extract()
to be specific.
The maximum temperature dataset is from Wordclim data. To download it with R, I used the package {geodata}
and the function worldclim_country()
. See the script here.
library(terra)
= rast()
tmax tmax
Notice, how this dataset is printed. Raster datasets in R are represented as matrices and arrays (remember Practical 2?). The print method is a specific way to print objects from the terra
package.
- Note that there are 12 layers in this dataset. These correspond to the 12 months in the year. We can change the names of the layers with:
# you don't need to change anything here!
names(tmax) = month.abb
- We are interested in the winter months (Dec, Jan, Feb, Mar). Let’s get the mean
tmax
for these months.
= mean(tmax[[c(...)]]) # add the winter months to subset. tmax_mean
To subset layers with a terra
object we use [[
. The mean()
function applied to terra
objects returns another raster. It applies the function per pixel. In this way you can do raster algebra!
- Now, let’s actually add the temperature information to the hut dataset. We can use the
terra::extract()
function to do this.
# We need to add the `terra::` in front of the function
# because ofthe conflict with the package tidyr
# The first argument is the raster object and
# the second one can be an sf object.
# You can include the huts_enrich object here.
= terra::extract() tmax_mean_huts
If you print this data you will notice that this is a data frame with the exact number of points as the sf
object. The order is the same as the one in your dataset. Therefore, you can add this information directly as a new column to the “huts_enrich” dataset.
Did you get a warning? Even if you did not transform your temperature to the correct CRS, the transformation was done internally by {terra}
.
# add the winter mean tmax here, note that it is the second column in the data
= huts_enrich |>
huts_enrich mutate(tmax_winter = )
Up to this point, your enriched dataset should have mean “tmax” temperatures between -10.15 °C and 4.2 °C.
# Write code to verify that the tmax_winter column in
# the huts_enrich dataset ranges between -10.15 and 4.20
Part 4: Find the dream winter hut!
Remember Lucia? She is very excited to find the perfect hut for her. Now that you have a clean and enriched dataset, you can help her find it!
Here are her requirements:
- The hut should be at a good enough altitude so that she can have some snow for skiing, she thinks huts above 800 m should be snowy enough!
- Temperature is also an important factor for good snow conditions. The maximum temperature shouldn’t be higher than 0 °C on average over the winter months.
- Lucia is from Argentina, and currently is getting a new passport, so she can’t leave Austria.
- She wants to stay in a small hut, nothing with too many other guests, but she doesn’t want to be completely alone either. Something between 10 and 50 overall capacity sounds good for her.
- She needs to be sure she can actually eat at the hut, are there huts with restaurants?
- It’s her first winter in Salzburg, so she surely needs to try some regional cuisine!
- She has an Austrian Alpine Club membership (ÖAV) and she would be happy to make use of its benefits. Is there a hut that is operated by the ÖAV? Hint: you can detect a string with
str_detect()
.
Use your huts_enrich
object and filter for Lucia’s requirements. After filtering, you should have only one hut as a result.
= huts_enrich |>
result filter(...)
Where is the hut located? Make an interactive map!
mapview(result)
Write the name and region of the hut here!
Upload the .qmd doc to Blackboard (don’t forget to add all your teammates/your name(s)!). The first team receives an extra point each in class participation 🏃