Module 1: Tabular Data

Working with larger-than-RAM data using duckdbfs

Author

ESPM 288

Published

February 11, 2026

1 Introduction

In this module, we will explore high-performance workflows for tabular data. We will use duckdbfs to work with datasets that are larger than available RAM by leveraging DuckDB’s streaming and remote file access capabilities.

2 Case Study: Global Supply Chains

We will be working with EXIOBASE 3.8.1, a global Multi-Regional Input-Output (MRIO) database. This dataset tracks economic transactions between sectors and regions, along with their environmental impacts (emissions, resource use, etc.).

Data description: - Coverage: 44 countries + 5 rest-of-world regions. - Timeframe: 1995–2022. - Content: Economic transactions (Z matrix), final demand (Y matrix), and environmental stressors (F matrix). - Format: Cloud-optimized Parquet, partitioned by year and matrix type.

3 Setup


Attaching package: 'dplyr'

The following objects are masked from 'package:stats':

    filter, lag

The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union

4 Exercise 1: connecting to remote data

We can open the entire dataset without downloading it using open_dataset(). The data is hosted on Source Cooperative. The ** pattern allows recursive scanning of the partitioned parquet files.

[1] 1

Rows: ??
Columns: 8
Database: DuckDB 1.4.3 [bmcda@Windows 10 x64:R 4.5.2/:memory:]
$ stressor <chr> "Value Added", "Value Added", "Value Added", "Value Added", "…
$ region   <chr> "AT", "AT", "AT", "AT", "AT", "AT", "AT", "AT", "AT", "AT", "…
$ sector   <chr> "Cultivation of wheat", "Cultivation of cereal grains nec", "…
$ value    <dbl> 183.1118891, 402.2305799, 830.2127384, 101.9705426, 31.763189…
$ unit     <chr> "M.EUR", "M.EUR", "M.EUR", "M.EUR", "M.EUR", "M.EUR", "M.EUR"…
$ year     <dbl> 1995, 1995, 1995, 1995, 1995, 1995, 1995, 1995, 1995, 1995, 1…
$ format   <chr> "ixi", "ixi", "ixi", "ixi", "ixi", "ixi", "ixi", "ixi", "ixi"…
$ matrix   <chr> "F_impacts", "F_impacts", "F_impacts", "F_impacts", "F_impact…

5 Exercise 2: Efficient Filtering

The dataset is large. We should filter before collecting any data into R.

Code

exio |>
    filter(year == 2022, region == "US") |>
    head() |> # view the first 6 rows
    collect()

Task: Construct a query to find the top 5 sectors in the US by CO2 emissions in 2022. Remember to check the column names in exio to find the appropriate emissions flow.

Code

# Solution: Top 5 sectors in US by CO2 emissions (2022)

# Target the F_satellite matrix directly to access stressor column
exio_f_sat <- open_dataset(
    "s3://us-west-2.opendata.source.coop/youssef-harby/exiobase-3/4588235/parquet/year=2022/format=ixi/matrix=F_satellite/"
)

exio_f_sat |>
    filter(
        grepl("CO2", stressor)
    ) |>
    group_by(sector) |>
    summarize(total_co2 = sum(value, na.rm = TRUE), .groups = "drop") |>
    arrange(desc(total_co2)) |>
    head(5) |>
    collect()

6 Exercise 3: CO2 Emissions Over Time by Region (Top 5 Emitters)

First, let’s identify the top 5 CO2 emitting regions in 2022:

Code

# Find the top 5 emitting regions in 2022
top_emitters_2022 <- exio |>
    filter(year == 2022,
           matrix == "F_satellite",
           stressor %like% "%CO2%") |>
    group_by(region) |>
    summarise(total_co2 = sum(value, na.rm = TRUE)) |>
    arrange(desc(total_co2)) |>
    head(5) |>
    collect()

top_emitters_2022

Now, let’s get the time series data for these top 5 regions across all years:

`summarise()` has grouped output by "year". You can override using the
`.groups` argument.

Create a line plot showing CO2 emissions over time for the top 5 regions:

Code

# Create the time series plot
co2_plot <- ggplot(co2_timeseries, aes(x = year, y = total_co2, color = region_name)) +
    geom_line(linewidth = 1.2) +
    geom_point(size = 2.5) +
    labs(
        title = "CO2 Emissions Over Time: Top 5 Emitting Regions",
        subtitle = "Based on EXIOBASE 3.8.1 data (1995-2022)",
        x = "Year",
        y = "Total CO2 Emissions",
        color = "Region"
    ) +
    theme_minimal() +
    theme(
        plot.title = element_text(face = "bold", size = 14),
        legend.position = "right",
        legend.title = element_text(face = "bold", size = 11),
        legend.text = element_text(size = 10),
        legend.background = element_rect(fill = "white", color = "gray80"),
        legend.key.size = unit(1, "cm")
    ) +
    scale_x_continuous(breaks = seq(1995, 2022, by = 5))

co2_plot

7 Exercise 4: Regions with Largest CO2 Emission Reductions

Let’s identify which regions have reduced their CO2 emissions the most between 1995 and 2022:

Code

# Get CO2 emissions for all regions in 1995 and 2022
co2_1995 <- exio |>
    filter(year == 1995,
           matrix == "F_satellite",
           stressor %like% "%CO2%") |>
    group_by(region) |>
    summarise(co2_1995 = sum(value, na.rm = TRUE)) |>
    collect()

co2_2022 <- exio |>
    filter(year == 2022,
           matrix == "F_satellite",
           stressor %like% "%CO2%") |>
    group_by(region) |>
    summarise(co2_2022 = sum(value, na.rm = TRUE)) |>
    collect()

# Join the two datasets and calculate change
co2_change <- co2_1995 |>
    inner_join(co2_2022, by = "region") |>
    mutate(
        absolute_change = co2_2022 - co2_1995,
        percent_change = ((co2_2022 - co2_1995) / co2_1995) * 100
    ) |>
    filter(grepl("^W", region)) |>  # Include only rest-of-world regions (WA, WE, WF, WL, WM)
    arrange(absolute_change)

# View regions with the largest absolute decreases
head(co2_change, 5)

Now let’s visualize the top 5 regions with the largest absolute reductions:

Code

# Get top 5 reducers
top_reducers <- co2_change |>
    head(5)

# Create a bar plot
reduction_plot <- ggplot(top_reducers, aes(x = reorder(region, absolute_change), y = absolute_change, fill = region)) +
    geom_col() +
    coord_flip() +
    scale_fill_brewer(palette = "Set2") +
    labs(
        title = "Top 5 Regions with Largest CO2 Emission Reductions",
        subtitle = "Change from 1995 to 2022",
        x = "Region",
        y = "Absolute Change in CO2 Emissions"
    ) +
    theme_minimal() +
    theme(
        plot.title = element_text(face = "bold", size = 14),
        legend.position = "none"
    )

reduction_plot

Let’s also look at percentage reductions (excluding countries with very low baseline emissions):

Code

# Filter to countries with substantial 1995 emissions (>1000 units) to avoid outliers
substantial_emitters <- co2_change |>
    filter(co2_1995 > 1000) |>
    arrange(percent_change) |>
    head(5)

substantial_emitters

Code

# Visualize percentage reductions
percent_plot <- ggplot(substantial_emitters, aes(x = reorder(region, percent_change), y = percent_change, fill = region)) +
    geom_col() +
    coord_flip() +
    scale_fill_brewer(palette = "Dark2") +
    labs(
        title = "Top 5 Regions with Largest % CO2 Emission Reductions",
        subtitle = "Change from 1995 to 2022 (excluding very small emitters)",
        x = "Region",
        y = "Percent Change in CO2 Emissions (%)"
    ) +
    theme_minimal() +
    theme(
        plot.title = element_text(face = "bold", size = 13),
        legend.position = "none"
    )

percent_plot

8 Exercise 5: CO2 Emissions Over Time for Top 5 Emitting Countries

Now let’s look at individual countries (excluding rest-of-world aggregates) to see the top 5 emitting countries:

Code

# Find the top 5 emitting countries in 2022 (excluding W regions)
top_countries_2022 <- exio |>
    filter(year == 2022,
           matrix == "F_satellite",
           stressor %like% "%CO2%") |>
    group_by(region) |>
    summarise(total_co2 = sum(value, na.rm = TRUE)) |>
    filter(!grepl("^W", region)) |>  # Exclude rest-of-world regions
    arrange(desc(total_co2)) |>
    head(5) |>
    collect()

top_countries_2022

Get time series data for these top 5 countries across all years:

`summarise()` has grouped output by "year". You can override using the
`.groups` argument.

Create a line plot showing CO2 emissions over time for the top 5 countries:

Code

# Create the time series plot
countries_plot <- ggplot(co2_countries_timeseries, aes(x = year, y = total_co2, color = country_name)) +
    geom_line(linewidth = 1.2) +
    geom_point(size = 2.5) +
    labs(
        title = "CO2 Emissions Over Time: Top 5 Emitting Countries",
        subtitle = "Based on EXIOBASE 3.8.1 data (1995-2022)",
        x = "Year",
        y = "Total CO2 Emissions",
        color = "Country"
    ) +
    theme_minimal() +
    theme(
        plot.title = element_text(face = "bold", size = 14),
        legend.position = "right",
        legend.title = element_text(face = "bold", size = 11),
        legend.text = element_text(size = 10),
        legend.background = element_rect(fill = "white", color = "gray80"),
        legend.key.size = unit(1, "cm")
    ) +
    scale_x_continuous(breaks = seq(1995, 2022, by = 5))

countries_plot

--- title: "Module 1: Tabular Data" subtitle: "Working with larger-than-RAM data using duckdbfs" author: "ESPM 288" date: today format: html: theme: cosmo embed-resources: true toc: true toc-location: left toc-depth: 3 code-fold: true code-tools: true code-block-bg: true code-block-border-left: "#31BAE9" smooth-scroll: true number-sections: true fig-width: 10 fig-height: 6 fig-align: center df-print: paged --- ## Introduction In this module, we will explore high-performance workflows for tabular data. We will use `duckdbfs` to work with datasets that are larger than available RAM by leveraging DuckDB's streaming and remote file access capabilities. ## Case Study: Global Supply Chains We will be working with [EXIOBASE 3.8.1](https://source.coop/youssef-harby/exiobase-3), a global Multi-Regional Input-Output (MRIO) database. This dataset tracks economic transactions between sectors and regions, along with their environmental impacts (emissions, resource use, etc.). **Data description:** - **Coverage**: 44 countries + 5 rest-of-world regions. - **Timeframe**: 1995–2022. - **Content**: Economic transactions (Z matrix), final demand (Y matrix), and environmental stressors (F matrix). - **Format**: Cloud-optimized Parquet, partitioned by year and matrix type. ## Setup ```{r} #| echo: false library(duckdbfs) library(dplyr) library(ggplot2) ``` ## Exercise 1: connecting to remote data We can open the entire dataset without downloading it using `open_dataset()`. The data is hosted on Source Cooperative. The `**` pattern allows recursive scanning of the partitioned parquet files. ```{r} #| echo: false # Remote S3 path to EXIOBASE 3 (Source Cooperative) duckdbfs::duckdb_secrets( key = "", secret = "", endpoint = "s3.amazonaws.com", region = "us-west-2" ) # Open the entire dataset with ** pattern for recursive scanning s3_url <- "s3://us-west-2.opendata.source.coop/youssef-harby/exiobase-3/4588235/parquet/**" # Open the dataset lazily exio <- open_dataset(s3_url) # View the schema (column names and types) without reading data glimpse(exio) ``` ## Exercise 2: Efficient Filtering The dataset is large. We should filter *before* collecting any data into R. ```{r} exio |> filter(year == 2022, region == "US") |> head() |> # view the first 6 rows collect() ``` > **Task**: Construct a query to find the top 5 sectors in the US by CO2 emissions in 2022. Remember to check the column names in `exio` to find the appropriate emissions flow. ```{r} # Solution: Top 5 sectors in US by CO2 emissions (2022) # Target the F_satellite matrix directly to access stressor column exio_f_sat <- open_dataset( "s3://us-west-2.opendata.source.coop/youssef-harby/exiobase-3/4588235/parquet/year=2022/format=ixi/matrix=F_satellite/" ) exio_f_sat |> filter( grepl("CO2", stressor) ) |> group_by(sector) |> summarize(total_co2 = sum(value, na.rm = TRUE), .groups = "drop") |> arrange(desc(total_co2)) |> head(5) |> collect() ``` ## Exercise 3: CO2 Emissions Over Time by Region (Top 5 Emitters) First, let's identify the top 5 CO2 emitting regions in 2022: ```{r} # Find the top 5 emitting regions in 2022 top_emitters_2022 <- exio |> filter(year == 2022, matrix == "F_satellite", stressor %like% "%CO2%") |> group_by(region) |> summarise(total_co2 = sum(value, na.rm = TRUE)) |> arrange(desc(total_co2)) |> head(5) |> collect() top_emitters_2022 ``` Now, let's get the time series data for these top 5 regions across all years: ```{r} #| echo: false # Get the list of top 5 regions top_5_regions <- top_emitters_2022$region # Query CO2 emissions for these regions across all years co2_timeseries <- exio |> filter(matrix == "F_satellite", stressor %like% "%CO2%", region %in% top_5_regions) |> group_by(year, region) |> summarise(total_co2 = sum(value, na.rm = TRUE)) |> arrange(year, region) |> collect() # View the data head(co2_timeseries, 20) ``` Create a line plot showing CO2 emissions over time for the top 5 regions: ```{r} #| echo: false # Create a mapping of region codes to full names region_names <- c( "CN" = "China", "US" = "United States", "IN" = "India", "RU" = "Russia", "JP" = "Japan", "DE" = "Germany", "GB" = "United Kingdom", "KR" = "South Korea", "CA" = "Canada", "BR" = "Brazil", "WE" = "Rest of Europe", "WA" = "Rest of Asia", "WF" = "Rest of Africa", "WL" = "Rest of Latin America", "WM" = "Rest of Middle East" ) # Add full region names to the data co2_timeseries <- co2_timeseries |> mutate(region_name = ifelse(region %in% names(region_names), region_names[region], region)) ``` ```{r} # Create the time series plot co2_plot <- ggplot(co2_timeseries, aes(x = year, y = total_co2, color = region_name)) + geom_line(linewidth = 1.2) + geom_point(size = 2.5) + labs( title = "CO2 Emissions Over Time: Top 5 Emitting Regions", subtitle = "Based on EXIOBASE 3.8.1 data (1995-2022)", x = "Year", y = "Total CO2 Emissions", color = "Region" ) + theme_minimal() + theme( plot.title = element_text(face = "bold", size = 14), legend.position = "right", legend.title = element_text(face = "bold", size = 11), legend.text = element_text(size = 10), legend.background = element_rect(fill = "white", color = "gray80"), legend.key.size = unit(1, "cm") ) + scale_x_continuous(breaks = seq(1995, 2022, by = 5)) co2_plot ``` ```{r} #| echo: false # Save the plot ggsave("co2_regions_timeseries.png", co2_plot, width = 10, height = 6, dpi = 300) ``` ## Exercise 4: Regions with Largest CO2 Emission Reductions Let's identify which regions have reduced their CO2 emissions the most between 1995 and 2022: ```{r} # Get CO2 emissions for all regions in 1995 and 2022 co2_1995 <- exio |> filter(year == 1995, matrix == "F_satellite", stressor %like% "%CO2%") |> group_by(region) |> summarise(co2_1995 = sum(value, na.rm = TRUE)) |> collect() co2_2022 <- exio |> filter(year == 2022, matrix == "F_satellite", stressor %like% "%CO2%") |> group_by(region) |> summarise(co2_2022 = sum(value, na.rm = TRUE)) |> collect() # Join the two datasets and calculate change co2_change <- co2_1995 |> inner_join(co2_2022, by = "region") |> mutate( absolute_change = co2_2022 - co2_1995, percent_change = ((co2_2022 - co2_1995) / co2_1995) * 100 ) |> filter(grepl("^W", region)) |> # Include only rest-of-world regions (WA, WE, WF, WL, WM) arrange(absolute_change) # View regions with the largest absolute decreases head(co2_change, 5) ``` Now let's visualize the top 5 regions with the largest absolute reductions: ```{r} # Get top 5 reducers top_reducers <- co2_change |> head(5) # Create a bar plot reduction_plot <- ggplot(top_reducers, aes(x = reorder(region, absolute_change), y = absolute_change, fill = region)) + geom_col() + coord_flip() + scale_fill_brewer(palette = "Set2") + labs( title = "Top 5 Regions with Largest CO2 Emission Reductions", subtitle = "Change from 1995 to 2022", x = "Region", y = "Absolute Change in CO2 Emissions" ) + theme_minimal() + theme( plot.title = element_text(face = "bold", size = 14), legend.position = "none" ) reduction_plot ``` ```{r} #| echo: false # Save the plot ggsave("co2_absolute_reduction.png", reduction_plot, width = 10, height = 6, dpi = 300) ``` Let's also look at percentage reductions (excluding countries with very low baseline emissions): ```{r} # Filter to countries with substantial 1995 emissions (>1000 units) to avoid outliers substantial_emitters <- co2_change |> filter(co2_1995 > 1000) |> arrange(percent_change) |> head(5) substantial_emitters # Visualize percentage reductions percent_plot <- ggplot(substantial_emitters, aes(x = reorder(region, percent_change), y = percent_change, fill = region)) + geom_col() + coord_flip() + scale_fill_brewer(palette = "Dark2") + labs( title = "Top 5 Regions with Largest % CO2 Emission Reductions", subtitle = "Change from 1995 to 2022 (excluding very small emitters)", x = "Region", y = "Percent Change in CO2 Emissions (%)" ) + theme_minimal() + theme( plot.title = element_text(face = "bold", size = 13), legend.position = "none" ) percent_plot ``` ```{r} #| echo: false # Save the plot ggsave("co2_percent_reduction.png", percent_plot, width = 10, height = 6, dpi = 300) ``` ## Exercise 5: CO2 Emissions Over Time for Top 5 Emitting Countries Now let's look at individual countries (excluding rest-of-world aggregates) to see the top 5 emitting countries: ```{r} # Find the top 5 emitting countries in 2022 (excluding W regions) top_countries_2022 <- exio |> filter(year == 2022, matrix == "F_satellite", stressor %like% "%CO2%") |> group_by(region) |> summarise(total_co2 = sum(value, na.rm = TRUE)) |> filter(!grepl("^W", region)) |> # Exclude rest-of-world regions arrange(desc(total_co2)) |> head(5) |> collect() top_countries_2022 ``` Get time series data for these top 5 countries across all years: ```{r} #| echo: false # Get the list of top 5 countries top_5_countries <- top_countries_2022$region # Query CO2 emissions for these countries across all years co2_countries_timeseries <- exio |> filter(matrix == "F_satellite", stressor %like% "%CO2%", region %in% top_5_countries) |> group_by(year, region) |> summarise(total_co2 = sum(value, na.rm = TRUE)) |> arrange(year, region) |> collect() # View the data head(co2_countries_timeseries, 20) ``` Create a line plot showing CO2 emissions over time for the top 5 countries: ```{r} #| echo: false # Add full country names to the data co2_countries_timeseries <- co2_countries_timeseries |> mutate(country_name = ifelse(region %in% names(region_names), region_names[region], region)) ``` ```{r} # Create the time series plot countries_plot <- ggplot(co2_countries_timeseries, aes(x = year, y = total_co2, color = country_name)) + geom_line(linewidth = 1.2) + geom_point(size = 2.5) + labs( title = "CO2 Emissions Over Time: Top 5 Emitting Countries", subtitle = "Based on EXIOBASE 3.8.1 data (1995-2022)", x = "Year", y = "Total CO2 Emissions", color = "Country" ) + theme_minimal() + theme( plot.title = element_text(face = "bold", size = 14), legend.position = "right", legend.title = element_text(face = "bold", size = 11), legend.text = element_text(size = 10), legend.background = element_rect(fill = "white", color = "gray80"), legend.key.size = unit(1, "cm") ) + scale_x_continuous(breaks = seq(1995, 2022, by = 5)) countries_plot ``` ```{r} #| echo: false # Save the plot ggsave("co2_countries_timeseries.png", countries_plot, width = 10, height = 6, dpi = 300) ```