import dask.array as da # handle dask arrays
import geopandas as gpd # handle geospatial data frames
from IPython.display import Image # visualize URLs
import pystac_client # connecting to the STAC API
from rasterio.enums import Resampling # perform re-sampling operations
import shapely # create vector objects
import stackstac # build an on-demand STAC data cube
Some background
Over a month ago I was invited as a lecturer at the OpenGeoHub 2023 Summer School in Poznan, Poland. It was a great experience, meeting brilliant people, learning from other lecturers and enjoying the city. Many thanks to the organizers!
My lecture was on Tools and packages to query and process Sentinel-1 and Sentinel-2 data with R and Python. Following on the lecture and inspired by the geocompx blogpost on geographic data analysis in R and Python, I started the quest to do a similar post with the basic content of my lecture.
Why do I call it a quest? Well, it took me a while. Before this post, my website was based on distill
, so the migration process from a distill
-based website to a quarto
-based one was the first task. Why did I have to do the migration? Because Quarto allows the combination of two or more programming languages in a single file, crucial for data science across languages.
I managed to migrate thanks to JaKub Nowosad’s talk at OpenGeoHub23 (Nowosad 2023), the Quarto documentation and to posts from Danielle Navarro (Navarro 2022). Big thanks!
But in any case, I can now inaugurate my new website with a post on how to query STAC data and create on-demand data cubes with R and Python! So let’s get to it.
Libraries
Let’s first load the libraries needed. Here you can get familiar with the tab setting of this post. Clicking on the R or Python tab will switch between languages. Clicking on either tab will switch the panel in the whole document.
library(dplyr) # data wrangling
library(gdalcubes) # on-demand data cubes
library(knitr) # visualize URLs
library(rstac) # connecting to the STAC API
library(sf) # handle geospatial data frames
Querying STAC data
STAC stands for SpatioTemporal Asset Catalog and it is “a common language to describe geospatial information, so it can more easily be worked with, indexed, and discovered”.
There are several APIs available to query data, you can browse them all in the STAC catalog index. Some of these APIs will require authentication to access the data. We will use the Earth Search catalog for this post, which allows querying data on Amazon Web Services (AWS). The data we will fetch (Sentinel-2) does not require authentication.
In Python we can use the pystac-client
library (Duckworth 2021), while in R we can use the rstac
library (Simoes et al. 2021). Refer to their documentation pages for more information.
To start fetching data, we will open a client. We can then see the collections available for this API. In Python we call a for loop on the client collections, while in R we use the get_request()
function. Both methods make an HTTP GET request to STAC web services.
= 'https://earth-search.aws.element84.com/v1'
api_url = pystac_client.Client.open(api_url)
client for collection in client.get_collections():
print(collection)
<CollectionClient id=sentinel-2-c1-l2a-preview>
<CollectionClient id=cop-dem-glo-30>
<CollectionClient id=naip>
<CollectionClient id=sentinel-2-l2a>
<CollectionClient id=sentinel-2-l1c>
<CollectionClient id=cop-dem-glo-90>
<CollectionClient id=sentinel-1-grd>
<CollectionClient id=landsat-c2-l2>
= "https://earth-search.aws.element84.com/v1"
api_url = stac(api_url)
client collections(client) |> get_request()
###STACCollectionList
- collections (8 item(s)):
- sentinel-2-c1-l2a-preview
- cop-dem-glo-30
- naip
- sentinel-2-l2a
- sentinel-2-l1c
- cop-dem-glo-90
- sentinel-1-grd
- landsat-c2-l2
- field(s): collections, links, context
We will choose here Sentinel-2 Level 2a data, which provides atmospherically corrected data representing surface reflectance.
We can also set the coordinates for our query, in this case a POINT
geometry, which will help to retrieve scenes that intersect with it. At last we can also set a time extent, where we define the start and end of our query.
# collection ID
= 'sentinel-2-l2a'
collection # coordinates
= 16.9
lon = 52.4
lat = shapely.Point(lon, lat)
point # date range
= '2022-05-01/2022-10-01' time_extent
# collection ID
= 'sentinel-2-l2a'
collection # coordinates
= 16.9
lon = 52.4
lat = st_point(c(lon, lat))
point # date range
= c("2022-05-01", "2022-10-01") time_extent
Now, we can pass the arguments to our search. Note for the R search we had to slightly format the datetime
parameter. This is because of a bug when fetching the data from STAC.
You will see we can also query based on scene metadata. In this example we look for scenes with cloud cover lower than 10% for the whole scene. In Python we do the filtering within the search, while in R we perform it after the HTTP POST request.
Finally, we check how many items we fetched with our query, also to compare that R and Python give us the same results.
= client.search(
search =[collection],
collections=point,
intersects=time_extent,
datetime=["eo:cloud_cover<10"],
query
)= search.item_collection()
items len(items)
8
= client |>
items stac_search(
collections = collection,
intersects = point,
datetime = paste0(time_extent,"T00:00:00Z", collapse = "/"),
limit = 100
|>
) post_request() |>
items_filter(
filter_fn = function(x) {x$properties$`eo:cloud_cover` < 10}
)items_length(items)
[1] 8
To explore the items as data frames we can either convert to a geopandas
table in Python or to an sf
table in R. We added an extra fid
column to allow index matching to the original item list.
= gpd.GeoDataFrame.from_features(items.to_dict(), crs="epsg:4326")
items_df items_df
geometry ... updated
0 POLYGON ((16.49848 53.24020, 17.39166 53.22562... ... 2022-11-06T13:14:57.270Z
1 POLYGON ((16.49848 53.24020, 17.39346 53.22558... ... 2022-11-06T12:53:48.090Z
2 POLYGON ((16.49848 53.24020, 18.14175 53.20819... ... 2022-11-06T13:12:58.202Z
3 POLYGON ((16.49848 53.24020, 18.14175 53.20819... ... 2022-11-06T13:14:37.621Z
4 POLYGON ((16.49848 53.24020, 18.14175 53.20819... ... 2022-11-06T12:54:05.312Z
5 POLYGON ((16.49848 53.24020, 17.39346 53.22558... ... 2022-11-06T12:55:07.931Z
6 POLYGON ((16.49848 53.24020, 18.14175 53.20819... ... 2022-11-06T12:52:50.337Z
7 POLYGON ((16.49848 53.24020, 18.14175 53.20819... ... 2022-11-06T13:16:38.195Z
[8 rows x 42 columns]
# we add an extra column to be able to subset the original item list
= items_as_sf(items) |>
items_df mutate(fid = row_number())
# wrap function around st_as_sf and as_tibble just for nicer printing
st_as_sf(as_tibble(items_df))
Simple feature collection with 8 features and 42 fields
Geometry type: POLYGON
Dimension: XY
Bounding box: xmin: 16.465 ymin: 52.22258 xmax: 18.14175 ymax: 53.2402
Geodetic CRS: WGS 84
# A tibble: 8 × 43
`s2:generation_time` s2:medium_proba_clouds_percenta…¹ `s2:product_uri`
<chr> <dbl> <chr>
1 2022-08-26T16:20:59.000000Z 0.00172 S2A_MSIL2A_2022…
2 2022-08-16T16:25:57.000000Z 2.27 S2A_MSIL2A_2022…
3 2022-08-03T13:51:55.000000Z 5.69 S2A_MSIL2A_2022…
4 2022-07-19T11:39:43.000000Z 0.00007 S2B_MSIL2A_2022…
5 2022-07-14T17:50:57.000000Z 1.29 S2A_MSIL2A_2022…
6 2022-06-27T16:28:10.000000Z 0.732 S2A_MSIL2A_2022…
7 2022-06-24T14:39:14.000000Z 0.00130 S2A_MSIL2A_2022…
8 2022-06-19T11:43:29.000000Z 0.0514 S2B_MSIL2A_2022…
# ℹ abbreviated name: ¹`s2:medium_proba_clouds_percentage`
# ℹ 40 more variables: `s2:processing_baseline` <chr>, `grid:code` <chr>,
# `s2:product_type` <chr>, `s2:snow_ice_percentage` <dbl>,
# `s2:vegetation_percentage` <dbl>, `processing:software` <chr>,
# `s2:high_proba_clouds_percentage` <dbl>,
# `s2:unclassified_percentage` <dbl>, `s2:water_percentage` <dbl>,
# `s2:not_vegetated_percentage` <dbl>, `s2:cloud_shadow_percentage` <dbl>, …
When we have the items as data frame, we can further filter the data based on the table columns. Below, we filter for data with a cloud cover lower than 2% and with a no-data pixel percentage below 10%.
To illustrate how to fetch item properties, we can select the first item in our list and get the datetime of this scene.
= items_df.loc[
ids 'eo:cloud_cover'] <= 2) &
(items_df['s2:nodata_pixel_percentage'] <= 10)
(items_df[
]= items[ids.index[0]]
item item.datetime
datetime.datetime(2022, 7, 19, 10, 6, 8, 645000, tzinfo=tzutc())
= items_df |>
ids filter(
`eo:cloud_cover` <= 2, `s2:nodata_pixel_percentage` <= 10
|>
) pull(fid)
= items$features[[ids[1]]]
item $properties$datetime item
[1] "2022-07-19T10:06:08.645000Z"
We can also take a look at the selected scene by fetching the item thumbnail. Here you will see that both R and Python return the same scene, as they should! 😉.
= item.assets["thumbnail"].href
thumbnail = thumbnail) Image(url
|>
item assets_url(asset_names = "thumbnail") |>
include_graphics()
Creating a STAC data cube
Fetching individual scenes is interesting, but the real power of working with the STAC API is the possibility to work with big EO data without the need to download a single pixel to your own computer!
For this, we can create an on-demand data cube that can help us organize and analyze our data, doing the data wrangling needed and only fetching the data we need as a final step.
To create on-demand data cubes, we will use R’s gdalcubes (Appel, Pebesma, and Mohr 2021) and Python’s stackstac (Joseph 2023).
We will create our data cube on a smaller subset. Here we load a GeoJSON file of Poznan, where the summer school took place. We read it in and extract its bounding box.
= "https://raw.githubusercontent.com/loreabad6/ogh23/main/data/poznan.geojson"
poznan = gpd.read_file(poznan)
footprint = footprint.total_bounds bbox
= "https://raw.githubusercontent.com/loreabad6/ogh23/main/data/poznan.geojson"
poznan = read_sf(poznan)
footprint = footprint |> st_transform(crs = 32633) |> st_bbox() bbox
Now we can create our cube. The following code chunk has several steps depending if you are in the R or Python tab. I kept it as a single block since this is how we get an on-demand data cube on each end. In essence what we do is:
- Define the assets we want.
I selected RGB+NIR bands to illustrate this process, but of course we can set any other band. I included the SCL band, which we will use to mask clouds.
- Set-up the cube.
- In Python we do this with the
stackstac.stack
function, where we can set the resolution, assets, bounding box, and re-sampling method for the data. Separately we define a cloud mask to get remove invalid pixels. - In R, this works in a lengthier way, using more intermediate functions. We first define the collection with the assets with the
stac_image_collection
function. Then, we define an extent in the form of a list. This extent will have the start and end date as well as the bounding box of our cube. Now we can define aview
, which defines the CRS, spatial and temporal resolution, and the aggregation and re-sampling method. We finally define the cloud mask withimage_mask
and can then build araster_cube
.
= ["red","green","blue","nir","scl"]
assets = stackstac.stack(
cube_all
items,=assets,
assets=200,
resolution=bbox,
bounds_latlon=Resampling.bilinear
resampling
)= cube_all.sel(band=["scl"])
scl = da.isin(scl, [3,8,9])
s2_mask = cube_all.where(~s2_mask)
cube cube
<xarray.DataArray 'stackstac-488683ba46ca080d7bae3993b8b89142' (time: 8, band: 5, y: 140, x: 143)> dask.array<where, shape=(8, 5, 140, 143), dtype=float64, chunksize=(1, 1, 140, 143), chunktype=numpy.ndarray> Coordinates: (12/52) * time (time) datetime64[ns] 2022-06-19... id (time) <U24 'S2B_33UXU_20220619_... * band (band) <U5 'red' 'green' ... 'scl' * x (x) float64 6.19e+05 ... 6.474e+05 * y (y) float64 5.823e+06 ... 5.795e+06 s2:generation_time (time) <U27 '2022-06-19T11:43:29... ... ... raster:bands (band) object {'nodata': 0, 'dat... title (band) <U30 'Red (band 4) - 10m'... common_name (band) object 'red' ... None center_wavelength (band) object 0.665 0.56 ... None full_width_half_max (band) object 0.038 0.045 ... None epsg int32 32633 Attributes: spec: RasterSpec(epsg=32633, bounds=(619000, 5795000, 647600, 5823... crs: epsg:32633 transform: | 200.00, 0.00, 619000.00|\n| 0.00,-200.00, 5823000.00|\n| 0... resolution: 200
= c("red","green","blue","nir","scl")
assets = stac_image_collection(items$features, asset_names = assets)
col = list(
extent t0 = time_extent[1], t1 = time_extent[2],
left = bbox["xmin"], bottom = bbox["ymin"],
right = bbox["xmax"], top = bbox["ymax"]
)= cube_view(
view srs = "epsg:32633",
extent = extent,
dx = 200, dy = 200, dt = "P1D",
aggregation = "median",
resampling = "bilinear"
)= image_mask("scl", values = c(3,8,9))
s2_mask = raster_cube(col, view, mask = s2_mask)
cube cube
A data cube proxy object
Dimensions:
low high count pixel_size chunk_size
t 2022-05-01 2022-10-01 154 P1D 1
y 5795085.08512264 5822885.08512264 139 200 128
x 619173.072575812 647573.072575812 142 200 128
Bands:
name offset scale nodata unit
1 blue 0 1 NaN
2 green 0 1 NaN
3 nir 0 1 NaN
4 red 0 1 NaN
5 scl 0 1 NaN
And now we have a data cube! Let’s so a simple reduction over time of our data to view an RGB composite.
= cube.sel(band=["red", "green", "blue"]).median("time", keep_attrs=True)
rgb = rgb.compute()
rgb
rgb.plot.imshow(="band",
rgb=True
robust )
gdalcubes_options(parallel = 4, cache = TRUE)
= cube |>
rgb select_bands(c("blue","green","red")) |>
reduce_time(c("median(blue)", "median(green)", "median(red)"))
|>
rgb plot(rgb = 3:1)
And there we have it, a side by side how-to create on-demand STAC data cubes. If you want to get more functionalities and examples on what can be done with the packages please refer to the materials of my lecture: Tools and packages to query and process Sentinel-1 and Sentinel-2 data with R and Python.
Happy STACing!
References
Reuse
Citation
@online{abad2023,
author = {Abad, Lorena},
title = {STAC Across {Languages}},
date = {2023-10-17},
langid = {en}
}