Module 1: Tabular Data

Working with larger-than-RAM data using duckdbfs

Author

ESPM 288

Introduction

In this module, we will explore high-performance workflows for tabular data. We will use duckdbfs to work with datasets that are larger than available RAM by leveraging DuckDB’s streaming and remote file access capabilities.

Case Study: Global Supply Chains

We will be working with EXIOBASE 3.8.1, a global Multi-Regional Input-Output (MRIO) database. This dataset tracks economic transactions between sectors and regions, along with their environmental impacts (emissions, resource use, etc.).

Data description: - Coverage: 44 countries + 5 rest-of-world regions. - Timeframe: 1995–2022. - Content: Economic transactions (Z matrix), final demand (Y matrix), and environmental stressors (F matrix). - Format: Cloud-optimized Parquet, partitioned by year and matrix type.

Setup

Code
library(duckdbfs)
library(dplyr)

Attaching package: 'dplyr'
The following objects are masked from 'package:stats':

    filter, lag
The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union

Exercise 1: connecting to remote data

We can open the entire dataset without downloading it using open_dataset(). The data is hosted on Source Cooperative. The ** pattern allows recursive scanning of the partitioned parquet files.

Code
# Remote S3 path to EXIOBASE 3 (Source Cooperative)

duckdbfs::duckdb_secrets(
    key = "",
    secret = "",
    endpoint = "s3.amazonaws.com",
    region = "us-west-2"
)
[1] 1
Code
s3_url <- "s3://us-west-2.opendata.source.coop/youssef-harby/exiobase-3/4588235/parquet/**"

# Open the dataset lazily
exio <- open_dataset(s3_url)

# View the schema (column names and types) without reading data
glimpse(exio)
Rows: ??
Columns: 8
Database: DuckDB 1.4.3 [root@Darwin 25.2.0:R 4.5.2/:memory:]
$ stressor <chr> "Value Added", "Value Added", "Value Added", "Value Added", "…
$ region   <chr> "AT", "AT", "AT", "AT", "AT", "AT", "AT", "AT", "AT", "AT", "…
$ sector   <chr> "Cultivation of wheat", "Cultivation of cereal grains nec", "…
$ value    <dbl> 183.1118891, 402.2305799, 830.2127384, 101.9705426, 31.763189…
$ unit     <chr> "M.EUR", "M.EUR", "M.EUR", "M.EUR", "M.EUR", "M.EUR", "M.EUR"…
$ year     <dbl> 1995, 1995, 1995, 1995, 1995, 1995, 1995, 1995, 1995, 1995, 1…
$ format   <chr> "ixi", "ixi", "ixi", "ixi", "ixi", "ixi", "ixi", "ixi", "ixi"…
$ matrix   <chr> "F_impacts", "F_impacts", "F_impacts", "F_impacts", "F_impact…

Exercise 2: Efficient Filtering

The dataset is large. We should filter before collecting any data into R.

Code
exio |>
    filter(year == 2022, region == "US") |>
    head() |> # view the first 6 rows
    collect()
# A tibble: 6 × 8
  stressor    region sector                      value unit   year format matrix
  <chr>       <chr>  <chr>                       <dbl> <chr> <dbl> <chr>  <chr> 
1 Value Added US     Cultivation of paddy rice    750. M.EUR  2022 ixi    F_imp…
2 Value Added US     Cultivation of wheat        2019. M.EUR  2022 ixi    F_imp…
3 Value Added US     Cultivation of cereal gra…  7355. M.EUR  2022 ixi    F_imp…
4 Value Added US     Cultivation of vegetables… 26878. M.EUR  2022 ixi    F_imp…
5 Value Added US     Cultivation of oil seeds    5003. M.EUR  2022 ixi    F_imp…
6 Value Added US     Cultivation of sugar cane…   290. M.EUR  2022 ixi    F_imp…

Task: Construct a query to find the top 5 sectors in the US by CO2 emissions in 2022. Remember to check the column names in exio to find the appropriate emissions flow.

Code
# Find top 5 sectors in US by CO2 emissions in 2022
exio |>
    filter(year == 2022, 
           matrix == "F_satellite",
           region == "US",
           str_detect(stressor, "(?i)CO2")) |>
    group_by(sector) |>
    summarise(total_co2 = sum(value, na.rm = TRUE)) |>
    arrange(desc(total_co2)) |>
    head(5) |>
    collect()
# A tibble: 5 × 2
  sector                            total_co2
  <chr>                                 <dbl>
1 Production of electricity by coal   1.38e12
2 Electricity by coal                 1.18e12
3 Production of electricity by gas    6.97e11
4 Electricity by gas                  6.33e11
5 Chemicals nec                       2.19e11

Exercise 3: Time Series Analysis

Task: Calculate the total CO2 emissions for China (CN) from 1995 to 2022 and identify which year had the highest emissions.

Code
# Your code here

Exercise 4: Cross-Region Comparison

Task: Compare water consumption (look for “Water” in stressor names) across the US, China (CN), and Germany (DE) in 2022. Which country’s sectors used the most water?

Code
# Your code here

Exercise 5: Sector Deep Dive

Task: For the “Electricity by coal” sector, find all the different environmental stressors tracked in F_satellite and their total values for 2022 across all regions.

Code
# Your code here

Exercise 6: Supply Chain Transactions

Task: Using the Z matrix (inter-industry transactions), find the top 5 sector-to-sector trade relationships by value in 2022. Look at which sectors are buying the most from other sectors.

Code
# Your code here

Exercise 7: Final Demand Analysis

Task: Using the Y matrix (final demand), identify which final demand category (household consumption, government spending, etc.) contributes most to the US economy in 2022.

Code
# Your code here

Exercise 8: Multi-Pollutant Profile

Task: For the US in 2022, create a summary showing the top 3 emitting sectors for each of these pollutants: CO2, CH4 (methane), and N2O (nitrous oxide). Use F_satellite matrix.

Code
# Your code here

Exercise 9: Regional Aggregation

Task: Calculate the total economic output (sum of all Z matrix transactions) for each region in 2022. Which are the top 5 economic regions?

Code
# Your code here

Exercise 10: Trend Detection

Task: Calculate the percentage change in total CO2 emissions for the US between 2000 and 2022. Has it increased or decreased?

Code
# Your code here
Code
exio |> distinct(matrix) |> collect()
# A tibble: 4 × 1
  matrix     
  <chr>      
1 F_satellite
2 F_impacts  
3 Z          
4 Y          

Visualization: Top CO2 Emitting Industries

Code
library(ggplot2)

# Get top 15 CO2 emitting sectors globally for 2022
top_co2_sectors <- exio |>
    filter(year == 2022,
           matrix == "F_satellite",
           str_detect(stressor, "(?i)CO2")) |>
    group_by(sector) |>
    summarise(total_co2 = sum(value, na.rm = TRUE)) |>
    arrange(desc(total_co2)) |>
    head(5) |>
    collect()

# Create bar graph
ggplot(top_co2_sectors, aes(x = reorder(sector, total_co2), y = total_co2, fill = total_co2)) +
    geom_col() +
    coord_flip() +
    scale_fill_gradient(low = "#56B4E9", high = "#D55E00") +
    labs(title = "Top 5 CO2 Emitting Industries Globally (2022)",
         x = "",
         y = "Total CO2 Emissions") +
    theme_minimal() +
    theme(legend.position = "none",
          axis.text.y = element_text(size = 20),
          plot.title = element_text(size = 22))

Code
exio |> 
    filter(matrix == "F_satellite", 
           str_detect(sector, "(?i)coal.*electric|electric.*coal")) |> 
    distinct(sector) |> 
    collect()
# A tibble: 2 × 1
  sector                           
  <chr>                            
1 Electricity by coal              
2 Production of electricity by coal
Code
options(repr.plot.width = 14, repr.plot.height = 4)

# Get CO2 emissions by country over time (2000-2022)
# Focus on top emitting countries for readability
co2_trends <- exio |>
    filter(year >= 2000, year <= 2022,
           matrix == "F_satellite",
           str_detect(stressor, "(?i)CO2")) |>
    group_by(region, year) |>
    summarise(total_co2 = sum(value, na.rm = TRUE), .groups = "drop") |>
    collect()

# Get top 10 countries by 2022 emissions
top_countries <- co2_trends |>
    filter(year == 2022) |>
    arrange(desc(total_co2)) |>
    head(10) |>
    pull(region)

# Filter for top countries only
co2_trends_top <- co2_trends |>
    filter(region %in% top_countries)

# Create line plot
p <- ggplot(co2_trends_top, aes(x = year, y = total_co2, color = region, group = region)) +
    geom_line(linewidth = 1.2) +
    geom_point(size = 2) +
    scale_color_brewer(palette = "Set3") +
    labs(title = "CO2 Emissions by Country (2000-2022)",
         subtitle = "Top 10 emitting countries",
         x = "Year",
         y = "Total CO2 Emissions",
         color = "Country") +
    theme_minimal() +
    theme(plot.title = element_text(size = 16, face = "bold"),
          plot.subtitle = element_text(size = 12),
          legend.position = "right")

ggsave("co2_trends.png", p, width = 14, height = 4, dpi = 300)
p  # Display the plot