Practical 5

Spatial data querying and wrangling

Authors

Team member 1

Team member 2

Published

January 27, 2025

This is the first winter that Lucia will spend in Salzburg. She is looking to visit a mountain hut where she can spend a night during the winter and do some ski tours. Can you help her?

In this practical you will train the basics of data wrangling in R using the tidyverse. You will also learn how to perform spatial queries using the sf package and perform raster-vector operations using terra. There are extra notes on the margins to give you more information on the functions you are using.

Some advice 🤝

Work in groups, maximum 2!
Google is your friend! If an error is confusing, copy it into Google and see what other people are saying. If you don’t know how to do something, search for it. You can use AI to help you understand code as well, but be critical of the outputs and always double check that it works.
Just because there is no error message, it doesn’t mean everything went smoothly. Use the console to check each step and make sure you have accomplished what you wanted to accomplish.
See more tips on error handling here.
Document your process! Use the empty code chunks set-up for you. If you need more remember the keyboard shortcut is CTRL+ALT+I or CMD+ALT+I.

Part 1: Data import

In this section we will load a spatial dataset of mountain huts into R and clean it before using it in the next section.

We can use the sf package to load data into R. In the background, sf will use GDAL to identify the driver to properly load the data. As you will see, you can load data directly from an URL but also local files.

Note how loading the sf library also links the underlying software to your R session: GDAL, GEOS, PROJ.

sf_use_s2(TRUE) is also active which means that sf will use spherical geometries. Read more about it here and here.

library(sf)
huts = read_sf("https://github.com/loreabad6/app-dev-gis/raw/refs/heads/main/data/p5_huts.gpkg")

Now we will start our data wrangling and cleaning workflows. For this we will use packages from the tidyverse but you can use base R or data.table if you have experience and feel more familiar with those.

You will see that 9 packages have been now attached to your workspace by calling the library(tidyverse) package bundle. You don’t need to worry about the “conflicted packages”, it is a warning to let you know there are functions in loaded packages with the same names.

library(tidyverse)

sf is designed to work with base R but also with tidy workflows, so we can directly use tidyverse verbs to wrangle the data.

If we glimpse into the data we can have an idea of what we are dealing with.

huts |> 
  glimpse()

Now that is a long file! You might have noticed some interesting patterns here and there, but what gives this data away is the first column: osm_id.

Now if this is a spatial file, where are the coordinates? Take a look at the last column of the data: geom.

huts$geom

These are basically the locations of our huts, and sf already knows to look for those coordinates in this column.

To have a quick view of where your data is located, you can use the mapview() function from the {mapview} package.

library(mapview)
mapview(huts)

Part 2: Data cleaning

If you are interested in how to query data from OSM with R you can check the code I used here as well as the documentation for osmdata for small queries and osmextract for larger queries.

So now you know this is OpenStreetMap data. If you have ever worked with OSM data before you will know that their nodes have several tags with their properties attached to them. When querying the data with R, you will get in this case the queried huts in each row with all the tags in different columns forming a data frame.

Using this messy data you will be wrangling and tidying a bit so that you can work with a more manageable dataset in the next section.

Info

From now on, each of the code chunks below will cause and error and/or will do the desired task incorrectly. (Even the chunks that run without error are not correct!) You will need to find the mistake, and correct it, to complete the intended action.

You will see a big amount of NA values in the huts data. That is because not all OSM tags are filled for every mountain hut, but if one hut has it, then it is included in the dataset.

Let’s reduce the number of variables in the dataset. Let’s narrow the dataset to:

Name of the hut
Elevation of the hut
Capacity (no. of beds)
Amenity (is it a restaurant, a bar, a self-service hut?)
Operator of the hut (Alpenverein, Naturfreunde, etc.)
Location of the hut (the coordinates)

huts_clean = huts > 
  select(name, ele, capacity, beds, amenity, cuisine, operator)
huts_clean

Tip

Take a close look at the result of your selection, are all the columns that you asked for there? What about the geom column? Did you ask for it? Is it anyway there? Let’s try to get rid of it.

# THIS CHUNK HAS NO ERROR!!!!
huts_clean |> select(-geom)

Oh no! It is still there! Well, that is because of how sf objects work. The geometry column is a “sticky” column, meaning that it cannot be dropped with tidyverse verbs. But, we want to work with spatial data, so we are not really going to remove that column. 😉

If you really want to remove the column you would need to do something like st_drop_geometry(huts_clean) or st_geometry(huts_clean) = NULL.

Now, let’s adjust the proper variables to be numeric. We use the mutate() function which helps you change existing columns (if you save the result with the same column name) or to create new columns (by giving it a new column name).

huts_clean = huts_clean |> 
  mutate(
    ele = numeric(ele),
    capacity = numeric(capacity),
    beds = numeric(beds)
  )

We will next create a new variable called “capacity_overall”. This column will combine the columns “capacity” and “beds”. When there is no capacity value, then the beds value will be taken. Otherwise the capacity value is taken. If both columns are NA, then the column will also have an NA. For this we can use the function case_when() inside the mutate() function.

huts_clean = huts_clean |> 
  capacity_overall = case_when(is.na(capacity) ~ beds, TRUE ~ capacity)

Considering that the huts are located in Europe, we can project the data from WGS84 to a more appropriate CRS. Let’s use the European Equal Area “EPSG:3035”.

huts_clean = huts_clean |> 
  st_set_crs("EPSG:3035")

Finally, note how we started each code chunk with: huts_clean = huts_clean |>.

That is very redundant and can cause you trouble if you are recreating your object over and over again.

We can pipe all these steps together to have one single workflow for data cleaning. In the code chunk below, combine the (fixed!) code.

huts_clean = huts # add your fixed code here...

Checkpoint

Up to this point, your clean dataset should have 4.6% of the number of columns in the original dataset.

# Write code to verify that your huts_clean dataset has 4.6% 
# of the number of columns in the huts dataset

Part 3: Enrich your data

So far we have used only wrangling and cleaning functions. Now, we are going to enrich our dataset with other spatial datasets.

Info

In this section, you will get a series of instructions, you should implement code to get there.

The huts are located in different regions. You have a regions dataset here: https://github.com/loreabad6/app-dev-gis/raw/refs/heads/main/data/p5_regions.gpkg. Load the data using the sf package.

regions = # load the data with the sf package. We did this before in Part 1.

Now, we will perform a spatial join of the “huts_clean” data and the “regions” data. For this you can use the function st_join(). Remember you can check for function documentation by typing ?st_join on the console.

Hint: you most likely get an error when you first try to do your join. READ THE ERROR MESSAGE CAREFULLY! What does it tell you?

# Write code using the st_join function to join in the information on 
# the regions dataset to the huts_clean dataset. 
huts_enrich = huts_clean |> 
  st_join(...)

Note that the default join predicate used is st_intersects. For point data, any other predicate does not really make sense, but when joining other type of data (polygons or lines), you can use other type of predicates e.g. st_covered_by.

Now let’s add some data about maximum temperature. For this you will find a .tif file here: https://github.com/loreabad6/app-dev-gis/raw/refs/heads/main/data/AUT_wc2.1_30s_tmax.tif. You can load this raster dataset using the rast() function from the {terra} package.

Note how when you load {terra} you get a conflict warning. There is another package {tidyr} from the tidyverse with a function extract(). We will be using that function, so when we call it, we will use terra::extract() to be specific.

The maximum temperature dataset is from Wordclim data. To download it with R, I used the package {geodata} and the function worldclim_country(). See the script here.

library(terra)
tmax = rast()
tmax

Notice, how this dataset is printed. Raster datasets in R are represented as matrices and arrays (remember Practical 2?). The print method is a specific way to print objects from the terra package.

Note that there are 12 layers in this dataset. These correspond to the 12 months in the year. We can change the names of the layers with:

# you don't need to change anything here!
names(tmax) = month.abb

We are interested in the winter months (Dec, Jan, Feb, Mar). Let’s get the mean tmax for these months.

tmax_mean = mean(tmax[[c(...)]]) # add the winter months to subset.

To subset layers with a terra object we use [[. The mean() function applied to terra objects returns another raster. It applies the function per pixel. In this way you can do raster algebra!

Now, let’s actually add the temperature information to the hut dataset. We can use the terra::extract() function to do this.

# We need to add the `terra::` in front of the function 
# because ofthe conflict with the package tidyr
# The first argument is the raster object and 
# the second one can be an sf object.
# You can include the huts_enrich object here.
tmax_mean_huts = terra::extract()

If you print this data you will notice that this is a data frame with the exact number of points as the sf object. The order is the same as the one in your dataset. Therefore, you can add this information directly as a new column to the “huts_enrich” dataset.

Did you get a warning? Even if you did not transform your temperature to the correct CRS, the transformation was done internally by {terra}.

# add the winter mean tmax here, note that it is the second column in the data
huts_enrich = huts_enrich |> 
  mutate(tmax_winter = )

Checkpoint

Up to this point, your enriched dataset should have mean “tmax” temperatures between -10.15 °C and 4.2 °C.

# Write code to verify that the tmax_winter column in 
# the huts_enrich dataset ranges between -10.15 and 4.20

Part 4: Find the dream winter hut!

Remember Lucia? She is very excited to find the perfect hut for her. Now that you have a clean and enriched dataset, you can help her find it!

Here are her requirements:

The hut should be at a good enough altitude so that she can have some snow for skiing, she thinks huts above 800 m should be snowy enough!
Temperature is also an important factor for good snow conditions. The maximum temperature shouldn’t be higher than 0 °C on average over the winter months.
Lucia is from Argentina, and currently is getting a new passport, so she can’t leave Austria.
She wants to stay in a small hut, nothing with too many other guests, but she doesn’t want to be completely alone either. Something between 10 and 50 overall capacity sounds good for her.
She needs to be sure she can actually eat at the hut, are there huts with restaurants?
It’s her first winter in Salzburg, so she surely needs to try some regional cuisine!
She has an Austrian Alpine Club membership (ÖAV) and she would be happy to make use of its benefits. Is there a hut that is operated by the ÖAV? Hint: you can detect a string with str_detect().

Use your huts_enrich object and filter for Lucia’s requirements. After filtering, you should have only one hut as a result.

result = huts_enrich |> 
  filter(...)

Where is the hut located? Make an interactive map!

mapview(result)

Solution 🎉

Write the name and region of the hut here!

Upload the .qmd doc to Blackboard (don’t forget to add all your teammates/your name(s)!). The first team receives an extra point each in class participation 🏃

Allaire, J., & Dervieux, C. (2024). Quarto: R interface to quarto markdown publishing system. https://github.com/quarto-dev/quarto-r

Andreo, V. (2024). Get started with GRASS & r: The rgrass package. https://grass-tutorials.osgeo.org/content/tutorials/get_started/fast_track_grass_and_R.html.

Appel, M. (2024). Gdalcubes: Earth observation data cubes from satellite image collections. https://github.com/appelmar/gdalcubes

Appel, M., & Pebesma, E. (2019). On-demand processing of data cubes from satellite image collections with the gdalcubes library. Data, 4(3). https://www.mdpi.com/2306-5729/4/3/92

Appel, M., Pebesma, E., & Mohr, M. (2021). Cloud-based processing of satellite image collections in r using STAC, COGs, and on-demand data cubes. https://r-spatial.org/r/2021/04/23/cloud-based-cubes.html

Appelhans, T., Detsch, F., Reudenbach, C., & Woellauer, S. (2023). Mapview: Interactive viewing of spatial data in r. https://github.com/r-spatial/mapview

Aybar, C. (2023). Rgee: R bindings for calling the earth engine API. https://github.com/r-spatial/rgee/

Baddeley, A., Rubak, E., & Turner, R. (2015). Spatial point patterns: Methodology and applications with R. Chapman; Hall/CRC Press. https://www.routledge.com/Spatial-Point-Patterns-Methodology-and-Applications-with-R/Baddeley-Rubak-Turner/p/book/9781482210200/

Baddeley, A., & Turner, R. (2005). spatstat: An R package for analyzing spatial point patterns. Journal of Statistical Software, 12(6), 1–42. https://doi.org/10.18637/jss.v012.i06

Baddeley, A., Turner, R., Mateu, J., & Bevan, A. (2013). Hybrids of gibbs point process models and their implementation. Journal of Statistical Software, 55(11), 1–43. https://doi.org/10.18637/jss.v055.i11

Bivand, R. (2022). Modernizing the r-GRASS interface: Confronting barn-raised OSGeo libraries and the evolving r.*spatial package ecosystem. https://rsbivand.github.io/foss4g_2022/modernizing_220822.html.

Bivand, R. (2024a). Rgrass: Interface between GRASS geographical information system and r. https://rsbivand.github.io/rgrass/

Bivand, R. (2024b). Spdep: Spatial dependence: Weighting schemes, statistics. https://github.com/r-spatial/spdep/

Bivand, R. S., Pebesma, E., & Gómez-Rubio, V. (2013). Applied spatial data analysis with R, second edition. Springer, NY. https://asdar-book.org/

Bivand, R., Nowosad, J., & Lovelace, R. (2024). spData: Datasets for spatial analysis. https://jakubnowosad.com/spData/

Bivand, R., & Wong, D. W. S. (2018). Comparing implementations of global and local indicators of spatial association. TEST, 27(3), 716–748. https://doi.org/10.1007/s11749-018-0599-x

Câmara, G., Simoes, R., Souza, F., Pelletier, C., Sanchez, A., Andrade, P., Ferreira, K., & Queiroz, G. (2023). sits: Satellite image time series analysis on Earth observation data cubes. https://e-sensing.github.io/sitsbook/index.html

Çetinkaya-Rundel, M. (2024). Quarto dashboards video series. https://quarto.org/docs/blog/posts/2024-11-22-dashboards-workshop/.

Dunnington, D., Vanderhaeghe, F., Caha, J., & Muenchow, J. (2024a). Qgisprocess: Use QGIS processing algorithms. https://r-spatial.github.io/qgisprocess/

Dunnington, D., Vanderhaeghe, F., Caha, J., & Muenchow, J. (2024b). R package qgisprocess: Use QGIS processing algorithms. Version 0.4.1. https://r-spatial.github.io/qgisprocess/

Eddelbuettel, D. (2024). Digest: Create compact hash digests of r objects. https://github.com/eddelbuettel/digest

Gräler, B., Pebesma, E., & Heuvelink, G. (2016). Spatio-temporal interpolation using gstat. The R Journal, 8, 204–218. https://journal.r-project.org/archive/2016/RJ-2016-014/index.html

Grolemund, G., & Wickham, H. (2011). Dates and times made easy with lubridate. Journal of Statistical Software, 40(3), 1–25. https://www.jstatsoft.org/v40/i03/

Hadley Wickham, J. B. (2023). R packages (2nd ed.). O’Reilly Media.

Hijmans, R. J. (2020). Terra and luna: New r packages scalable geospatial data analysis. Big Data in Agriculture - 2020 Convention. https://www.youtube.com/watch?v=5b2xhqlH49I&t=690s

Hijmans, R. J. (2024a). Spatial data science with R and terra. https://rspatial.org/index.html.

Hijmans, R. J. (2024b). Terra: Spatial data analysis. https://rspatial.org/

Jenny Bryan, J. H., the STAT 545 TAs. (2025). Let’s git started | happy git and GitHub for the useR. https://happygitwithr.com/.

Li, X., & Anselin, L. (2024). Rgeoda: R library for spatial data analysis. https://github.com/geodacenter/rgeoda/

Loiseau, N., Mouquet, N., Casajus, N., GreniÃ©, M., GuÃ©guen, M., Maitner, B., Mouillot, D., Ostling, A., Renaud, J., Tucker, C., Velez, L., Thuiller, W., & Violle, C. (2020). Global distribution and conservation status of ecologically rare mammal and bird species. Nature Communications, 11(1). https://doi.org/10.1038/s41467-020-18779-w

Lovelace, R., Nowosad, J., & Muenchow, J. (2019). Geocomputation with R. CRC Press.

Mahoney, M. (n.d.). Rsi: Efficiently retrieve and process satellite imagery (Version 0.2.0.9000). https://doi.org/10.5281/zenodo.10926857

Mahoney, M. (2024). Rsi: Efficiently retrieve and process satellite imagery. https://github.com/Permian-Global-Research/rsi

Mark Padgham. (2019). Dodgr: An r package for network flow aggregation. Transport Findings. https://doi.org/10.32866/6945

Massicotte, P., & South, A. (2023). Rnaturalearth: World map data from natural earth. https://docs.ropensci.org/rnaturalearth/

Meyer, C. (2022). Understanding the basics of package writing in r. https://cosimameyer.com/post/understanding-the-basics-of-package-writing-in-r/.

Müller, K., & Wickham, H. (2023). Tibble: Simple data frames. https://tibble.tidyverse.org/

Padgham, M., Petutschnig, A., & Cooley, D. (2024). Dodgr: Distances on directed graphs. https://github.com/UrbanAnalyst/dodgr

Parry, J., & Locke, D. (2024). Sfdep: Spatial dependence for simple features. https://sfdep.josiahparry.com

Pawley, S. (2024). Rsagacmd: Linking r with the open-source SAGA-GIS software. https://stevenpawley.github.io/Rsagacmd/

Pebesma, E. (2018). Simple Features for R: Standardized Support for Spatial Vector Data. The R Journal, 10(1), 439–446. https://doi.org/10.32614/RJ-2018-009

Pebesma, E. (2024). Stars: Spatiotemporal arrays, raster and vector data cubes. https://r-spatial.github.io/stars/

Pebesma, E. (2025). Sf: Simple features for r. https://r-spatial.github.io/sf/

Pebesma, E. J. (2004). Multivariable geostatistics in S: The gstat package. Computers & Geosciences, 30, 683–691.

Pebesma, E., & Bivand, R. (2023). Spatial Data Science: With applications in R. Chapman and Hall/CRC. https://doi.org/10.1201/9780429459016

Pebesma, E., & Graeler, B. (2024). Gstat: Spatial and spatio-temporal geostatistical modelling, prediction and simulation. https://github.com/r-spatial/gstat/

Pedersen, T. L. (2024). Tidygraph: A tidy API for graph manipulation. https://tidygraph.data-imaginist.com

Pinheiro, J. C., & Bates, D. M. (2000). Mixed-effects models in s and s-PLUS. Springer. https://doi.org/10.1007/b98882

Pinheiro, J., Bates, D., & R Core Team. (2024). Nlme: Linear and nonlinear mixed effects models. https://svn.r-project.org/R-packages/trunk/nlme/

Plate, T., & Heiberger, R. (2024). Abind: Combine multidimensional arrays.

R Core Team. (2024). R: A language and environment for statistical computing. R Foundation for Statistical Computing. https://www.R-project.org/

Roger Bivand. (2022). R packages for analyzing spatial data: A comparative case study with areal data. Geographical Analysis, 54(3), 488–518. https://doi.org/10.1111/gean.12319

Rydzik, M. (2024, January 10). An Overview of the RSI R Package for Retrieving Satellite Imagery and Calculating Spectral Indices. https://geocompx.org/post/2024/rsi-bp1/

Simoes, R., Camara, G., Queiroz, G., Souza, F., Andrade, P., Santos, L., Carvalho, A., & Ferreira, K. (2021). Satellite image time series analysis for big earth observation data. Remote Sensing, 13(13), 2428. https://doi.org/10.3390/rs13132428

Simoes, R., Camara, G., Souza, F., & Carlos, F. (2024). Sits: Satellite image time series analysis for earth observation data cubes. https://github.com/e-sensing/sits/

Simoes, R., Carvalho, F., & Brazil Data Cube Team. (2024). Rstac: Client library for SpatioTemporal asset catalog. https://brazil-data-cube.github.io/rstac/

Simoes, R., Souza, F., Zaglia, M., Queiroz, G. R., Santos, R., & Ferreira, K. (2021). Rstac: An r package to access spatiotemporal asset catalog satellite imagery. 2021 IEEE International Geoscience and Remote Sensing Symposium IGARSS, 7674–7677. https://doi.org/10.1109/IGARSS47720.2021.9553518

Spinu, V., Grolemund, G., & Wickham, H. (2023). Lubridate: Make dealing with dates a little easier. https://lubridate.tidyverse.org

Therneau, T., & Atkinson, B. (2023). Rpart: Recursive partitioning and regression trees. https://github.com/bethatkinson/rpart

van der Meer, L., Abad, L., Gilardi, A., & Lovelace, R. (2025). Sfnetworks: Tidy geospatial networks. https://luukvdmeer.github.io/sfnetworks/

Vreede, B. (2023). Why your research deserves to be an r package. https://blog.esciencecenter.nl/why-your-research-deserves-to-be-an-r-package-3737a73501c.

Vuorre, M., & Crump, M. J. C. (2020). Sharing and organizing research products as r packages. Behavior Research Methods, 53(2), 792â802. https://doi.org/10.3758/s13428-020-01436-x

Watson, S. S. (n.d.). A Julia-Python-R reference sheet. Retrieved November 21, 2024, from https://docslib.org/doc/2547802/julia-python-r-cheatsheet

Wickham, H. (2011). Testthat: Get started with testing. The R Journal, 3, 5–10. https://journal.r-project.org/archive/2011-1/RJournal_2011-1_Wickham.pdf

Wickham, H. (2014). Tidy data. Journal of Statistical Software, 59(10). https://doi.org/10.18637/jss.v059.i10

Wickham, H. (2016). ggplot2: Elegant graphics for data analysis. Springer-Verlag New York. https://ggplot2.tidyverse.org

Wickham, H. (2019). Advanced R (p. 588). CRC Press.

Wickham, H. (2021). Mastering shiny. "O’Reilly Media, Inc.".

Wickham, H. (2023a). Forcats: Tools for working with categorical variables (factors). https://forcats.tidyverse.org/

Wickham, H. (2023b). Stringr: Simple, consistent wrappers for common string operations. https://stringr.tidyverse.org

Wickham, H. (2023c). Tidyverse: Easily install and load the tidyverse. https://tidyverse.tidyverse.org

Wickham, H. (2024). Testthat: Unit testing for r. https://testthat.r-lib.org

Wickham, H., Averick, M., Bryan, J., Chang, W., McGowan, L. D., François, R., Grolemund, G., Hayes, A., Henry, L., Hester, J., Kuhn, M., Pedersen, T. L., Miller, E., Bache, S. M., Müller, K., Ooms, J., Robinson, D., Seidel, D. P., Spinu, V., … Yutani, H. (2019). Welcome to the tidyverse. Journal of Open Source Software, 4(43), 1686. https://doi.org/10.21105/joss.01686

Wickham, H., Bryan, J., Barrett, M., & Teucher, A. (2024). Usethis: Automate package and project setup. https://usethis.r-lib.org

Wickham, H., Çetinkaya-Rundel, M., & Grolemund, G. (2023). R for data science (2nd ed.). O’Reilly Media.

Wickham, H., Chang, W., Henry, L., Pedersen, T. L., Takahashi, K., Wilke, C., Woo, K., Yutani, H., Dunnington, D., & van den Brand, T. (2024). ggplot2: Create elegant data visualisations using the grammar of graphics. https://ggplot2.tidyverse.org

Wickham, H., Danenberg, P., Csárdi, G., & Eugster, M. (2024). roxygen2: In-line documentation for r. https://roxygen2.r-lib.org/

Wickham, H., François, R., Henry, L., Müller, K., & Vaughan, D. (2023). Dplyr: A grammar of data manipulation. https://dplyr.tidyverse.org

Wickham, H., & Henry, L. (2023). Purrr: Functional programming tools. https://purrr.tidyverse.org/

Wickham, H., Hesselberth, J., Salmon, M., Roy, O., & Brüggemann, S. (2024). Pkgdown: Make static HTML documentation for a package. https://pkgdown.r-lib.org/

Wickham, H., Hester, J., & Bryan, J. (2024). Readr: Read rectangular text data. https://readr.tidyverse.org

Wickham, H., Hester, J., Chang, W., & Bryan, J. (2022). Devtools: Tools to make developing r packages easier. https://devtools.r-lib.org/

Wickham, H., Vaughan, D., & Girlich, M. (2024). Tidyr: Tidy messy data. https://tidyr.tidyverse.org

Xie, Y. (2014). Knitr: A comprehensive tool for reproducible research in R. In V. Stodden, F. Leisch, & R. D. Peng (Eds.), Implementing reproducible computational research. Chapman; Hall/CRC.

Xie, Y. (2015). Dynamic documents with R and knitr (2nd ed.). Chapman; Hall/CRC. https://yihui.org/knitr/

Xie, Y. (2024). Knitr: A general-purpose package for dynamic report generation in r. https://yihui.org/knitr/