Practical 3

Spatial data querying and wrangling

Authors

Team member 1

Team member 2

Published

April 10, 2025

Download the raw document here

This is the first summer that Lucia will spend in Salzburg. She is looking to visit a mountain hut where she can spend a night during the summer and do some hiking tours. Can you help her?

In this practical you will train the basics of data wrangling in R using the tidyverse.

You will also learn how to perform spatial queries using the sf package and perform raster-vector operations using terra. There are extra notes on the margins to give you more information on the functions you are using.

Part 1: Data import

In this section we will load a spatial dataset of mountain huts into R and clean it before using it in the next section.

We can use the sf package to load data into R. In the background, sf will use GDAL to identify the driver to properly load the data. As you will see, you can load data directly from an URL but also local files.

Note how loading the sf library also links the underlying software to your R session: GDAL, GEOS, PROJ.

sf_use_s2(TRUE) is also active which means that sf will use spherical geometries. Read more about it here and here.

library(sf)
huts = read_sf("https://github.com/loreabad6/app-dev-gis/raw/refs/heads/main/data/huts.gpkg")

Now we will start our data wrangling and cleaning workflows. For this we will use packages from the tidyverse but you can use base R or data.table if you have experience and feel more familiar with those.

You will see that 9 packages have been now attached to your workspace by calling the library(tidyverse) package bundle. You don’t need to worry about the “conflicted packages”, it is a warning to let you know there are functions in loaded packages with the same names.
library(tidyverse)

sf is designed to work with base R but also with tidy workflows, so we can directly use tidyverse verbs to wrangle the data.

If we glimpse into the data we can have an idea of what we are dealing with.

huts |> 
  glimpse()

Now that is a long file! You might have noticed some interesting patterns here and there, but what gives this data away is the first column: osm_id.

Now if this is a spatial file, where are the coordinates? Take a look at the last column of the data: geom.

huts$geom

These are basically the locations of our huts, and sf already knows to look for those coordinates in this column.

To have a quick view of where your data is located, you can use the mapview() function from the {mapview} package.

library(mapview)
mapview(huts)

Part 2: Data cleaning

If you are interested in how to query data from OSM with R you can check the code I used here as well as the documentation for osmdata for small queries and osmextract for larger queries.

So now you know this is OpenStreetMap data. If you have ever worked with OSM data before you will know that their nodes have several tags with their properties attached to them. When querying the data with R, you will get in this case the queried huts in each row with all the tags in different columns forming a data frame.

Using this messy data you will be wrangling and tidying a bit so that you can work with a more manageable dataset in the next section.

Watch out!

From now on, each of the code chunks below will cause and error and/or will do the desired task incorrectly. (Even the chunks that run without error are not correct!) You will need to find the mistake, and correct it, to complete the intended action.

You will see a big amount of NA values in the huts data. That is because not all OSM tags are filled for every mountain hut, but if one hut has it, then it is included in the dataset.

  1. Let’s reduce the number of variables in the dataset. Let’s narrow the dataset to:
  • Name of the hut
  • Elevation of the hut
  • Capacity (no. of beds)
  • Amenity (is it a restaurant, a bar, a self-service hut?)
  • What type of cuisine does the restaurant have?
  • Operator of the hut (Alpenverein, Naturfreunde, etc.)
  • Location of the hut (the coordinates)
huts_clean = huts > 
  select(name, ele, capacity, beds, amenity, cuisine, operator)
huts_clean
Tip

Take a close look at the result of your selection, are all the columns that you asked for there? What about the geom column? Did you ask for it? Is it anyway there? Let’s try to get rid of it.

# THIS CHUNK HAS NO ERROR!!!!
huts_clean |> select(-geom)

Oh no! It is still there! Well, that is because of how sf objects work. The geometry column is a “sticky” column, meaning that it cannot be dropped with tidyverse verbs. But, we want to work with spatial data, so we are not really going to remove that column. 😉

If you really want to remove the column you would need to do something like st_drop_geometry(huts_clean) or st_geometry(huts_clean) = NULL.

  1. Now, let’s adjust the proper variables to be numeric. We use the mutate() function which helps you change existing columns (if you save the result with the same column name) or to create new columns (by giving it a new column name).
huts_clean = huts_clean |> 
  mutate(
    ele = numeric(ele),
    capacity = numeric(capacity),
    beds = numeric(beds)
  )
  1. We will next create a new variable called “capacity_overall”. This column will combine the columns “capacity” and “beds”. When there is no capacity value, then the beds value will be taken. Otherwise the capacity value is taken. If both columns are NA, then the column will also have an NA. For this we can use the function case_when() inside the mutate() function.
huts_clean = huts_clean |> 
  capacity_overall = case_when(is.na(capacity) ~ beds, TRUE ~ capacity)
  1. Considering that the huts are located in Europe, we can project the data from WGS84 to a more appropriate CRS. Let’s use the European Equal Area “EPSG:3035”.
huts_clean = huts_clean |> 
  st_set_crs("EPSG:3035")
  1. Finally, note how we started each code chunk with: huts_clean = huts_clean |>.

That is very redundant and can cause you trouble if you are recreating your object over and over again.

We can pipe all these steps together to have one single workflow for data cleaning. In the code chunk below, combine the (fixed!) code.

huts_clean = huts # add your fixed code here...
Checkpoint

Up to this point, your clean dataset should have 4.6% of the number of columns in the original dataset.

# Write code to verify that your huts_clean dataset has 4.6% 
# of the number of columns in the huts dataset

Part 3: Enrich your data

So far we have used only wrangling and cleaning functions. Now, we are going to enrich our dataset with other spatial datasets.

Watch out!

In this section, you will get a series of instructions, you should implement code to fulfil the task.

  1. The huts are located in different regions. You have a regions dataset here: https://github.com/loreabad6/app-dev-gis/raw/refs/heads/main/data/regions.gpkg. Load the data using the sf package.
regions = # load the data with the sf package. We did something similar before in Part 1.
  1. Now, we will perform a spatial join of the “huts_clean” data and the “regions” data. For this you can use the function st_join(). Remember you can check for function documentation by typing ?st_join on the console.

Hint: you most likely get an error when you first try to do your join. READ THE ERROR MESSAGE CAREFULLY! What does it tell you?

# Write code using the st_join function to join in the information on 
# the regions dataset to the huts_clean dataset. 
huts_enrich = huts_clean |> 
  st_join(...)

Note that the default join predicate used is st_intersects. For point data, any other predicate does not really make sense, but when joining other type of data (polygons or lines), you can use other type of predicates e.g. st_covered_by.

  1. Now let’s add some data about maximum temperature. For this you will find a .tif file here: https://github.com/loreabad6/app-dev-gis/raw/refs/heads/main/data/AUT_wc2.1_30s_tmax.tif. You can load this raster dataset using the rast() function from the {terra} package.

Note how when you load {terra} you get a conflict warning. There is another package {tidyr} from the tidyverse with a function extract(). We will be using that function, so when we call it, we will use terra::extract() to be specific.

The maximum temperature dataset is from Wordclim data. To download it with R, I used the package {geodata} and the function worldclim_country(). See the script here.

library(terra)
tmax = rast()
tmax

Notice, how this dataset is printed. Raster datasets in R are represented as matrices and arrays (remember Practical 1?). The print method is a specific way to print objects from the terra package.

  1. Note that there are 12 layers in this dataset. These correspond to the 12 months in the year. We can change the names of the layers with:
# you don't need to change anything here!
names(tmax) = month.abb
  1. We are interested in the summer months (June, July, August, September). Let’s get the mean tmax for these months.
tmax_mean = mean(tmax[[c(...)]]) # add the summer months to subset.

To subset layers with a terra object we use [[. The mean() function applied to terra objects returns another raster. It applies the function per pixel. In this way you can do raster algebra!

  1. Now, let’s actually add the temperature information to the hut dataset. We can use the terra::extract() function to do this.
# We need to add the `terra::` in front of the function 
# because of the conflict with the package tidyr
# The first argument is the raster object and 
# the second one can be an sf object.
# You can include the huts_enrich object here.
tmax_mean_huts = terra::extract()

If you print this data you will notice that this is a data frame with the exact number of points as the sf object. The order is the same as the one in your dataset. Therefore, you can add this information directly as a new column to the “huts_enrich” dataset.

Did you get a warning? Even if you did not transform your temperature to the correct CRS, the transformation was done internally by {terra}.

# add the summer mean tmax here, note that it is the second column in the data
huts_enrich = huts_enrich |> 
  mutate(tmax_summer = ) 
Checkpoint

Up to this point, your enriched dataset should have mean “tmax” temperatures between 1.15 °C and 21.4 °C.

# Write code to verify that the tmax_summer column in 
# the huts_enrich dataset ranges between the above values

Part 4: Find the dream summer hut!

Remember Lucia? She is very excited to find the perfect hut for her. Now that you have a clean and enriched dataset, you can help her find it!

Here are her requirements:

  • The hut should be at a good enough altitude to enjoy the views, she thinks huts above 800 m should be good enough!
  • Temperature is also an important factor for Lucia. She wants to escape the heat from the valley but not freeze at the top! The maximum temperature should be higher than 15 °C on average over the summer months.
  • She wants to stay in a small hut, nothing with too many other guests (otherwise it gets so hot!), but she doesn’t want to be completely alone either. Something between 10 and 30 overall capacity sounds good for her.
  • She needs to be sure she can actually eat at the hut, are there huts with restaurants?
  • She will start close to the train station and doesn’t want to drive long… what is the closest hut from Salzburg Hbf?

Hints for the last point…

  • You can find distances with sf::st_distance(). Create a sf object as sbg_hbf = st_sfc(st_point(c("x", "y")), crs = "EPSG").
  • Don’t forget to handle the CRS correctly! (Set CRS and transform).
  • You may want to create a new column first and then filter on it…

Use your huts_enrich object and filter for Lucia’s requirements. After filtering, you should have only one hut as a result.

result = huts_enrich |> 
  filter(...) 

Where is the hut located? Make an interactive map!

mapview(result)
Solution 🎉

Write the name and region of the hut here!

Upload the .qmd and PDF to Blackboard (don’t forget to add all your teammates/your name(s)!). The first team receives an extra point each in class participation 🏃