Code
library(duckdbfs)
library(dplyr)
Attaching package: 'dplyr'
The following objects are masked from 'package:stats':
filter, lag
The following objects are masked from 'package:base':
intersect, setdiff, setequal, union
Working with larger-than-RAM data using duckdbfs
In this module, we will explore high-performance workflows for tabular data. We will use duckdbfs to work with datasets that are larger than available RAM by leveraging DuckDB’s streaming and remote file access capabilities.
We will be working with EXIOBASE 3.8.1, a global Multi-Regional Input-Output (MRIO) database. This dataset tracks economic transactions between sectors and regions, along with their environmental impacts (emissions, resource use, etc.).
Data description: - Coverage: 44 countries + 5 rest-of-world regions. - Timeframe: 1995–2022. - Content: Economic transactions (Z matrix), final demand (Y matrix), and environmental stressors (F matrix). - Format: Cloud-optimized Parquet, partitioned by year and matrix type.
library(duckdbfs)
library(dplyr)
Attaching package: 'dplyr'
The following objects are masked from 'package:stats':
filter, lag
The following objects are masked from 'package:base':
intersect, setdiff, setequal, union
We can open the entire dataset without downloading it using open_dataset(). The data is hosted on Source Cooperative. The ** pattern allows recursive scanning of the partitioned parquet files.
# Remote S3 path to EXIOBASE 3 (Source Cooperative)
duckdbfs::duckdb_secrets(
key = "",
secret = "",
endpoint = "s3.amazonaws.com",
region = "us-west-2"
)[1] 1
s3_url <- "s3://us-west-2.opendata.source.coop/youssef-harby/exiobase-3/4588235/parquet/**"
# Open the dataset lazily
exio <- open_dataset(s3_url)
# View the schema (column names and types) without reading data
glimpse(exio)Rows: ??
Columns: 8
Database: DuckDB 1.4.3 [root@Darwin 25.2.0:R 4.5.2/:memory:]
$ stressor <chr> "Value Added", "Value Added", "Value Added", "Value Added", "…
$ region <chr> "AT", "AT", "AT", "AT", "AT", "AT", "AT", "AT", "AT", "AT", "…
$ sector <chr> "Cultivation of wheat", "Cultivation of cereal grains nec", "…
$ value <dbl> 183.1118891, 402.2305799, 830.2127384, 101.9705426, 31.763189…
$ unit <chr> "M.EUR", "M.EUR", "M.EUR", "M.EUR", "M.EUR", "M.EUR", "M.EUR"…
$ year <dbl> 1995, 1995, 1995, 1995, 1995, 1995, 1995, 1995, 1995, 1995, 1…
$ format <chr> "ixi", "ixi", "ixi", "ixi", "ixi", "ixi", "ixi", "ixi", "ixi"…
$ matrix <chr> "F_impacts", "F_impacts", "F_impacts", "F_impacts", "F_impact…
The dataset is large. We should filter before collecting any data into R.
exio |>
filter(year == 2022, region == "US") |>
head() |> # view the first 6 rows
collect()# A tibble: 6 × 8
stressor region sector value unit year format matrix
<chr> <chr> <chr> <dbl> <chr> <dbl> <chr> <chr>
1 Value Added US Cultivation of paddy rice 750. M.EUR 2022 ixi F_imp…
2 Value Added US Cultivation of wheat 2019. M.EUR 2022 ixi F_imp…
3 Value Added US Cultivation of cereal gra… 7355. M.EUR 2022 ixi F_imp…
4 Value Added US Cultivation of vegetables… 26878. M.EUR 2022 ixi F_imp…
5 Value Added US Cultivation of oil seeds 5003. M.EUR 2022 ixi F_imp…
6 Value Added US Cultivation of sugar cane… 290. M.EUR 2022 ixi F_imp…
Task: Construct a query to find the top 5 sectors in the US by CO2 emissions in 2022. Remember to check the column names in
exioto find the appropriate emissions flow.
# Find top 5 sectors in US by CO2 emissions in 2022
exio |>
filter(year == 2022,
matrix == "F_satellite",
region == "US",
str_detect(stressor, "(?i)CO2")) |>
group_by(sector) |>
summarise(total_co2 = sum(value, na.rm = TRUE)) |>
arrange(desc(total_co2)) |>
head(5) |>
collect()# A tibble: 5 × 2
sector total_co2
<chr> <dbl>
1 Production of electricity by coal 1.38e12
2 Electricity by coal 1.18e12
3 Production of electricity by gas 6.97e11
4 Electricity by gas 6.33e11
5 Chemicals nec 2.19e11
Task: Calculate the total CO2 emissions for China (CN) from 1995 to 2022 and identify which year had the highest emissions.
# Your code hereTask: Compare water consumption (look for “Water” in stressor names) across the US, China (CN), and Germany (DE) in 2022. Which country’s sectors used the most water?
# Your code hereTask: For the “Electricity by coal” sector, find all the different environmental stressors tracked in F_satellite and their total values for 2022 across all regions.
# Your code hereTask: Using the Z matrix (inter-industry transactions), find the top 5 sector-to-sector trade relationships by value in 2022. Look at which sectors are buying the most from other sectors.
# Your code hereTask: Using the Y matrix (final demand), identify which final demand category (household consumption, government spending, etc.) contributes most to the US economy in 2022.
# Your code hereTask: For the US in 2022, create a summary showing the top 3 emitting sectors for each of these pollutants: CO2, CH4 (methane), and N2O (nitrous oxide). Use F_satellite matrix.
# Your code hereTask: Calculate the total economic output (sum of all Z matrix transactions) for each region in 2022. Which are the top 5 economic regions?
# Your code hereTask: Calculate the percentage change in total CO2 emissions for the US between 2000 and 2022. Has it increased or decreased?
# Your code hereexio |> distinct(matrix) |> collect()# A tibble: 4 × 1
matrix
<chr>
1 F_satellite
2 F_impacts
3 Z
4 Y
library(ggplot2)
# Get top 15 CO2 emitting sectors globally for 2022
top_co2_sectors <- exio |>
filter(year == 2022,
matrix == "F_satellite",
str_detect(stressor, "(?i)CO2")) |>
group_by(sector) |>
summarise(total_co2 = sum(value, na.rm = TRUE)) |>
arrange(desc(total_co2)) |>
head(5) |>
collect()
# Create bar graph
ggplot(top_co2_sectors, aes(x = reorder(sector, total_co2), y = total_co2, fill = total_co2)) +
geom_col() +
coord_flip() +
scale_fill_gradient(low = "#56B4E9", high = "#D55E00") +
labs(title = "Top 5 CO2 Emitting Industries Globally (2022)",
x = "",
y = "Total CO2 Emissions") +
theme_minimal() +
theme(legend.position = "none",
axis.text.y = element_text(size = 20),
plot.title = element_text(size = 22))exio |>
filter(matrix == "F_satellite",
str_detect(sector, "(?i)coal.*electric|electric.*coal")) |>
distinct(sector) |>
collect()# A tibble: 2 × 1
sector
<chr>
1 Electricity by coal
2 Production of electricity by coal
options(repr.plot.width = 14, repr.plot.height = 4)
# Get CO2 emissions by country over time (2000-2022)
# Focus on top emitting countries for readability
co2_trends <- exio |>
filter(year >= 2000, year <= 2022,
matrix == "F_satellite",
str_detect(stressor, "(?i)CO2")) |>
group_by(region, year) |>
summarise(total_co2 = sum(value, na.rm = TRUE), .groups = "drop") |>
collect()
# Get top 10 countries by 2022 emissions
top_countries <- co2_trends |>
filter(year == 2022) |>
arrange(desc(total_co2)) |>
head(10) |>
pull(region)
# Filter for top countries only
co2_trends_top <- co2_trends |>
filter(region %in% top_countries)
# Create line plot
p <- ggplot(co2_trends_top, aes(x = year, y = total_co2, color = region, group = region)) +
geom_line(linewidth = 1.2) +
geom_point(size = 2) +
scale_color_brewer(palette = "Set3") +
labs(title = "CO2 Emissions by Country (2000-2022)",
subtitle = "Top 10 emitting countries",
x = "Year",
y = "Total CO2 Emissions",
color = "Country") +
theme_minimal() +
theme(plot.title = element_text(size = 16, face = "bold"),
plot.subtitle = element_text(size = 12),
legend.position = "right")
ggsave("co2_trends.png", p, width = 14, height = 4, dpi = 300)
p # Display the plot