Module 1: Tabular Data

Working with larger-than-RAM data using duckdbfs

Author

ESPM 288

Published

February 11, 2026

1 Introduction

In this module, we will explore high-performance workflows for tabular data. We will use duckdbfs to work with datasets that are larger than available RAM by leveraging DuckDB’s streaming and remote file access capabilities.

2 Case Study: Global Supply Chains

We will be working with EXIOBASE 3.8.1, a global Multi-Regional Input-Output (MRIO) database. This dataset tracks economic transactions between sectors and regions, along with their environmental impacts (emissions, resource use, etc.).

Data description: - Coverage: 44 countries + 5 rest-of-world regions. - Timeframe: 1995–2022. - Content: Economic transactions (Z matrix), final demand (Y matrix), and environmental stressors (F matrix). - Format: Cloud-optimized Parquet, partitioned by year and matrix type.

3 Setup


Attaching package: 'dplyr'
The following objects are masked from 'package:stats':

    filter, lag
The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union

4 Exercise 1: connecting to remote data

We can open the entire dataset without downloading it using open_dataset(). The data is hosted on Source Cooperative. The ** pattern allows recursive scanning of the partitioned parquet files.

[1] 1
Rows: ??
Columns: 8
Database: DuckDB 1.4.3 [bmcda@Windows 10 x64:R 4.5.2/:memory:]
$ stressor <chr> "Value Added", "Value Added", "Value Added", "Value Added", "…
$ region   <chr> "AT", "AT", "AT", "AT", "AT", "AT", "AT", "AT", "AT", "AT", "…
$ sector   <chr> "Cultivation of wheat", "Cultivation of cereal grains nec", "…
$ value    <dbl> 183.1118891, 402.2305799, 830.2127384, 101.9705426, 31.763189…
$ unit     <chr> "M.EUR", "M.EUR", "M.EUR", "M.EUR", "M.EUR", "M.EUR", "M.EUR"…
$ year     <dbl> 1995, 1995, 1995, 1995, 1995, 1995, 1995, 1995, 1995, 1995, 1…
$ format   <chr> "ixi", "ixi", "ixi", "ixi", "ixi", "ixi", "ixi", "ixi", "ixi"…
$ matrix   <chr> "F_impacts", "F_impacts", "F_impacts", "F_impacts", "F_impact…

5 Exercise 2: Efficient Filtering

The dataset is large. We should filter before collecting any data into R.

Code
exio |>
    filter(year == 2022, region == "US") |>
    head() |> # view the first 6 rows
    collect()

Task: Construct a query to find the top 5 sectors in the US by CO2 emissions in 2022. Remember to check the column names in exio to find the appropriate emissions flow.

Code
# Solution: Top 5 sectors in US by CO2 emissions (2022)

# Target the F_satellite matrix directly to access stressor column
exio_f_sat <- open_dataset(
    "s3://us-west-2.opendata.source.coop/youssef-harby/exiobase-3/4588235/parquet/year=2022/format=ixi/matrix=F_satellite/"
)

exio_f_sat |>
    filter(
        grepl("CO2", stressor)
    ) |>
    group_by(sector) |>
    summarize(total_co2 = sum(value, na.rm = TRUE), .groups = "drop") |>
    arrange(desc(total_co2)) |>
    head(5) |>
    collect()

6 Exercise 3: CO2 Emissions Over Time by Region (Top 5 Emitters)

First, let’s identify the top 5 CO2 emitting regions in 2022:

Code
# Find the top 5 emitting regions in 2022
top_emitters_2022 <- exio |>
    filter(year == 2022,
           matrix == "F_satellite",
           stressor %like% "%CO2%") |>
    group_by(region) |>
    summarise(total_co2 = sum(value, na.rm = TRUE)) |>
    arrange(desc(total_co2)) |>
    head(5) |>
    collect()

top_emitters_2022

Now, let’s get the time series data for these top 5 regions across all years:

`summarise()` has grouped output by "year". You can override using the
`.groups` argument.

Create a line plot showing CO2 emissions over time for the top 5 regions:

Code
# Create the time series plot
co2_plot <- ggplot(co2_timeseries, aes(x = year, y = total_co2, color = region_name)) +
    geom_line(linewidth = 1.2) +
    geom_point(size = 2.5) +
    labs(
        title = "CO2 Emissions Over Time: Top 5 Emitting Regions",
        subtitle = "Based on EXIOBASE 3.8.1 data (1995-2022)",
        x = "Year",
        y = "Total CO2 Emissions",
        color = "Region"
    ) +
    theme_minimal() +
    theme(
        plot.title = element_text(face = "bold", size = 14),
        legend.position = "right",
        legend.title = element_text(face = "bold", size = 11),
        legend.text = element_text(size = 10),
        legend.background = element_rect(fill = "white", color = "gray80"),
        legend.key.size = unit(1, "cm")
    ) +
    scale_x_continuous(breaks = seq(1995, 2022, by = 5))

co2_plot

7 Exercise 4: Regions with Largest CO2 Emission Reductions

Let’s identify which regions have reduced their CO2 emissions the most between 1995 and 2022:

Code
# Get CO2 emissions for all regions in 1995 and 2022
co2_1995 <- exio |>
    filter(year == 1995,
           matrix == "F_satellite",
           stressor %like% "%CO2%") |>
    group_by(region) |>
    summarise(co2_1995 = sum(value, na.rm = TRUE)) |>
    collect()

co2_2022 <- exio |>
    filter(year == 2022,
           matrix == "F_satellite",
           stressor %like% "%CO2%") |>
    group_by(region) |>
    summarise(co2_2022 = sum(value, na.rm = TRUE)) |>
    collect()

# Join the two datasets and calculate change
co2_change <- co2_1995 |>
    inner_join(co2_2022, by = "region") |>
    mutate(
        absolute_change = co2_2022 - co2_1995,
        percent_change = ((co2_2022 - co2_1995) / co2_1995) * 100
    ) |>
    filter(grepl("^W", region)) |>  # Include only rest-of-world regions (WA, WE, WF, WL, WM)
    arrange(absolute_change)

# View regions with the largest absolute decreases
head(co2_change, 5)

Now let’s visualize the top 5 regions with the largest absolute reductions:

Code
# Get top 5 reducers
top_reducers <- co2_change |>
    head(5)

# Create a bar plot
reduction_plot <- ggplot(top_reducers, aes(x = reorder(region, absolute_change), y = absolute_change, fill = region)) +
    geom_col() +
    coord_flip() +
    scale_fill_brewer(palette = "Set2") +
    labs(
        title = "Top 5 Regions with Largest CO2 Emission Reductions",
        subtitle = "Change from 1995 to 2022",
        x = "Region",
        y = "Absolute Change in CO2 Emissions"
    ) +
    theme_minimal() +
    theme(
        plot.title = element_text(face = "bold", size = 14),
        legend.position = "none"
    )

reduction_plot

Let’s also look at percentage reductions (excluding countries with very low baseline emissions):

Code
# Filter to countries with substantial 1995 emissions (>1000 units) to avoid outliers
substantial_emitters <- co2_change |>
    filter(co2_1995 > 1000) |>
    arrange(percent_change) |>
    head(5)

substantial_emitters
Code
# Visualize percentage reductions
percent_plot <- ggplot(substantial_emitters, aes(x = reorder(region, percent_change), y = percent_change, fill = region)) +
    geom_col() +
    coord_flip() +
    scale_fill_brewer(palette = "Dark2") +
    labs(
        title = "Top 5 Regions with Largest % CO2 Emission Reductions",
        subtitle = "Change from 1995 to 2022 (excluding very small emitters)",
        x = "Region",
        y = "Percent Change in CO2 Emissions (%)"
    ) +
    theme_minimal() +
    theme(
        plot.title = element_text(face = "bold", size = 13),
        legend.position = "none"
    )

percent_plot

8 Exercise 5: CO2 Emissions Over Time for Top 5 Emitting Countries

Now let’s look at individual countries (excluding rest-of-world aggregates) to see the top 5 emitting countries:

Code
# Find the top 5 emitting countries in 2022 (excluding W regions)
top_countries_2022 <- exio |>
    filter(year == 2022,
           matrix == "F_satellite",
           stressor %like% "%CO2%") |>
    group_by(region) |>
    summarise(total_co2 = sum(value, na.rm = TRUE)) |>
    filter(!grepl("^W", region)) |>  # Exclude rest-of-world regions
    arrange(desc(total_co2)) |>
    head(5) |>
    collect()

top_countries_2022

Get time series data for these top 5 countries across all years:

`summarise()` has grouped output by "year". You can override using the
`.groups` argument.

Create a line plot showing CO2 emissions over time for the top 5 countries:

Code
# Create the time series plot
countries_plot <- ggplot(co2_countries_timeseries, aes(x = year, y = total_co2, color = country_name)) +
    geom_line(linewidth = 1.2) +
    geom_point(size = 2.5) +
    labs(
        title = "CO2 Emissions Over Time: Top 5 Emitting Countries",
        subtitle = "Based on EXIOBASE 3.8.1 data (1995-2022)",
        x = "Year",
        y = "Total CO2 Emissions",
        color = "Country"
    ) +
    theme_minimal() +
    theme(
        plot.title = element_text(face = "bold", size = 14),
        legend.position = "right",
        legend.title = element_text(face = "bold", size = 11),
        legend.text = element_text(size = 10),
        legend.background = element_rect(fill = "white", color = "gray80"),
        legend.key.size = unit(1, "cm")
    ) +
    scale_x_continuous(breaks = seq(1995, 2022, by = 5))

countries_plot