Attaching package: 'dplyr'
The following objects are masked from 'package:stats':
filter, lag
The following objects are masked from 'package:base':
intersect, setdiff, setequal, union
Working with larger-than-RAM data using duckdbfs
In this module, we will explore high-performance workflows for tabular data. We will use duckdbfs to work with datasets that are larger than available RAM by leveraging DuckDB’s streaming and remote file access capabilities.
We will be working with EXIOBASE 3.8.1, a global Multi-Regional Input-Output (MRIO) database. This dataset tracks economic transactions between sectors and regions, along with their environmental impacts (emissions, resource use, etc.).
Data description: - Coverage: 44 countries + 5 rest-of-world regions. - Timeframe: 1995–2022. - Content: Economic transactions (Z matrix), final demand (Y matrix), and environmental stressors (F matrix). - Format: Cloud-optimized Parquet, partitioned by year and matrix type.
Attaching package: 'dplyr'
The following objects are masked from 'package:stats':
filter, lag
The following objects are masked from 'package:base':
intersect, setdiff, setequal, union
We can open the entire dataset without downloading it using open_dataset(). The data is hosted on Source Cooperative. The ** pattern allows recursive scanning of the partitioned parquet files.
[1] 1
Rows: ??
Columns: 8
Database: DuckDB 1.4.3 [bmcda@Windows 10 x64:R 4.5.2/:memory:]
$ stressor <chr> "Value Added", "Value Added", "Value Added", "Value Added", "…
$ region <chr> "AT", "AT", "AT", "AT", "AT", "AT", "AT", "AT", "AT", "AT", "…
$ sector <chr> "Cultivation of wheat", "Cultivation of cereal grains nec", "…
$ value <dbl> 183.1118891, 402.2305799, 830.2127384, 101.9705426, 31.763189…
$ unit <chr> "M.EUR", "M.EUR", "M.EUR", "M.EUR", "M.EUR", "M.EUR", "M.EUR"…
$ year <dbl> 1995, 1995, 1995, 1995, 1995, 1995, 1995, 1995, 1995, 1995, 1…
$ format <chr> "ixi", "ixi", "ixi", "ixi", "ixi", "ixi", "ixi", "ixi", "ixi"…
$ matrix <chr> "F_impacts", "F_impacts", "F_impacts", "F_impacts", "F_impact…
The dataset is large. We should filter before collecting any data into R.
exio |>
filter(year == 2022, region == "US") |>
head() |> # view the first 6 rows
collect()Task: Construct a query to find the top 5 sectors in the US by CO2 emissions in 2022. Remember to check the column names in
exioto find the appropriate emissions flow.
# Solution: Top 5 sectors in US by CO2 emissions (2022)
# Target the F_satellite matrix directly to access stressor column
exio_f_sat <- open_dataset(
"s3://us-west-2.opendata.source.coop/youssef-harby/exiobase-3/4588235/parquet/year=2022/format=ixi/matrix=F_satellite/"
)
exio_f_sat |>
filter(
grepl("CO2", stressor)
) |>
group_by(sector) |>
summarize(total_co2 = sum(value, na.rm = TRUE), .groups = "drop") |>
arrange(desc(total_co2)) |>
head(5) |>
collect()First, let’s identify the top 5 CO2 emitting regions in 2022:
# Find the top 5 emitting regions in 2022
top_emitters_2022 <- exio |>
filter(year == 2022,
matrix == "F_satellite",
stressor %like% "%CO2%") |>
group_by(region) |>
summarise(total_co2 = sum(value, na.rm = TRUE)) |>
arrange(desc(total_co2)) |>
head(5) |>
collect()
top_emitters_2022Now, let’s get the time series data for these top 5 regions across all years:
`summarise()` has grouped output by "year". You can override using the
`.groups` argument.
Create a line plot showing CO2 emissions over time for the top 5 regions:
# Create the time series plot
co2_plot <- ggplot(co2_timeseries, aes(x = year, y = total_co2, color = region_name)) +
geom_line(linewidth = 1.2) +
geom_point(size = 2.5) +
labs(
title = "CO2 Emissions Over Time: Top 5 Emitting Regions",
subtitle = "Based on EXIOBASE 3.8.1 data (1995-2022)",
x = "Year",
y = "Total CO2 Emissions",
color = "Region"
) +
theme_minimal() +
theme(
plot.title = element_text(face = "bold", size = 14),
legend.position = "right",
legend.title = element_text(face = "bold", size = 11),
legend.text = element_text(size = 10),
legend.background = element_rect(fill = "white", color = "gray80"),
legend.key.size = unit(1, "cm")
) +
scale_x_continuous(breaks = seq(1995, 2022, by = 5))
co2_plotLet’s identify which regions have reduced their CO2 emissions the most between 1995 and 2022:
# Get CO2 emissions for all regions in 1995 and 2022
co2_1995 <- exio |>
filter(year == 1995,
matrix == "F_satellite",
stressor %like% "%CO2%") |>
group_by(region) |>
summarise(co2_1995 = sum(value, na.rm = TRUE)) |>
collect()
co2_2022 <- exio |>
filter(year == 2022,
matrix == "F_satellite",
stressor %like% "%CO2%") |>
group_by(region) |>
summarise(co2_2022 = sum(value, na.rm = TRUE)) |>
collect()
# Join the two datasets and calculate change
co2_change <- co2_1995 |>
inner_join(co2_2022, by = "region") |>
mutate(
absolute_change = co2_2022 - co2_1995,
percent_change = ((co2_2022 - co2_1995) / co2_1995) * 100
) |>
filter(grepl("^W", region)) |> # Include only rest-of-world regions (WA, WE, WF, WL, WM)
arrange(absolute_change)
# View regions with the largest absolute decreases
head(co2_change, 5)Now let’s visualize the top 5 regions with the largest absolute reductions:
# Get top 5 reducers
top_reducers <- co2_change |>
head(5)
# Create a bar plot
reduction_plot <- ggplot(top_reducers, aes(x = reorder(region, absolute_change), y = absolute_change, fill = region)) +
geom_col() +
coord_flip() +
scale_fill_brewer(palette = "Set2") +
labs(
title = "Top 5 Regions with Largest CO2 Emission Reductions",
subtitle = "Change from 1995 to 2022",
x = "Region",
y = "Absolute Change in CO2 Emissions"
) +
theme_minimal() +
theme(
plot.title = element_text(face = "bold", size = 14),
legend.position = "none"
)
reduction_plotLet’s also look at percentage reductions (excluding countries with very low baseline emissions):
# Filter to countries with substantial 1995 emissions (>1000 units) to avoid outliers
substantial_emitters <- co2_change |>
filter(co2_1995 > 1000) |>
arrange(percent_change) |>
head(5)
substantial_emitters# Visualize percentage reductions
percent_plot <- ggplot(substantial_emitters, aes(x = reorder(region, percent_change), y = percent_change, fill = region)) +
geom_col() +
coord_flip() +
scale_fill_brewer(palette = "Dark2") +
labs(
title = "Top 5 Regions with Largest % CO2 Emission Reductions",
subtitle = "Change from 1995 to 2022 (excluding very small emitters)",
x = "Region",
y = "Percent Change in CO2 Emissions (%)"
) +
theme_minimal() +
theme(
plot.title = element_text(face = "bold", size = 13),
legend.position = "none"
)
percent_plotNow let’s look at individual countries (excluding rest-of-world aggregates) to see the top 5 emitting countries:
# Find the top 5 emitting countries in 2022 (excluding W regions)
top_countries_2022 <- exio |>
filter(year == 2022,
matrix == "F_satellite",
stressor %like% "%CO2%") |>
group_by(region) |>
summarise(total_co2 = sum(value, na.rm = TRUE)) |>
filter(!grepl("^W", region)) |> # Exclude rest-of-world regions
arrange(desc(total_co2)) |>
head(5) |>
collect()
top_countries_2022Get time series data for these top 5 countries across all years:
`summarise()` has grouped output by "year". You can override using the
`.groups` argument.
Create a line plot showing CO2 emissions over time for the top 5 countries:
# Create the time series plot
countries_plot <- ggplot(co2_countries_timeseries, aes(x = year, y = total_co2, color = country_name)) +
geom_line(linewidth = 1.2) +
geom_point(size = 2.5) +
labs(
title = "CO2 Emissions Over Time: Top 5 Emitting Countries",
subtitle = "Based on EXIOBASE 3.8.1 data (1995-2022)",
x = "Year",
y = "Total CO2 Emissions",
color = "Country"
) +
theme_minimal() +
theme(
plot.title = element_text(face = "bold", size = 14),
legend.position = "right",
legend.title = element_text(face = "bold", size = 11),
legend.text = element_text(size = 10),
legend.background = element_rect(fill = "white", color = "gray80"),
legend.key.size = unit(1, "cm")
) +
scale_x_continuous(breaks = seq(1995, 2022, by = 5))
countries_plot