I would like to plot a sankey diagram to show how observations migrate from one risk level to the other over multiple stages (in this case years). Thus, the risk level labels are the same in each year. X axis should have Years, Y axis should have proportion as illustrated in the picture . Below is the code I attempted. Thanks!
# Sample data frame
library(ggsankeyfier)
library(dplyr)
library(ggplot2)
df <- data.frame(
ID = c(1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4, 5, 5, 5),
risk_level = c("High", "High", "High",
"Low", "Low", "Very low",
"Low", "Low", "Low",
"Low", "Moderate", "Low",
"Moderate", "High", "High"),
Year = c(2022, 2023, 2024,
2022, 2023, 2024,
2022, 2023, 2024,
2022, 2023, 2024,
2022, 2023, 2024))
df1 <- df %>%
group_by(risk_level, Year) %>%
summarise(count = n(), .groups = "drop_last") %>%
group_by(Year) %>%
mutate(proportion = count / sum(count)) %>%
ungroup()
# Converting the data for the Sankey diagram
df_pivot <- pivot_stages_longer(df1, stages_from = c("Year",
"risk_level"),
## the column that represents the size of the flows:
values_from = "proportion")
#attempting to plot the sankey diagram
ggplot(df_pivot, aes(x = stage, y = proportion, group = node,
connector = connector, edge_id = edge_id, fill = node)) +
geom_sankeyedge(v_space = "auto") +
geom_sankeynode(v_space = "auto")
I would like to plot a sankey diagram to show how observations migrate from one risk level to the other over multiple stages (in this case years). Thus, the risk level labels are the same in each year. X axis should have Years, Y axis should have proportion as illustrated in the picture . Below is the code I attempted. Thanks!
# Sample data frame
library(ggsankeyfier)
library(dplyr)
library(ggplot2)
df <- data.frame(
ID = c(1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4, 5, 5, 5),
risk_level = c("High", "High", "High",
"Low", "Low", "Very low",
"Low", "Low", "Low",
"Low", "Moderate", "Low",
"Moderate", "High", "High"),
Year = c(2022, 2023, 2024,
2022, 2023, 2024,
2022, 2023, 2024,
2022, 2023, 2024,
2022, 2023, 2024))
df1 <- df %>%
group_by(risk_level, Year) %>%
summarise(count = n(), .groups = "drop_last") %>%
group_by(Year) %>%
mutate(proportion = count / sum(count)) %>%
ungroup()
# Converting the data for the Sankey diagram
df_pivot <- pivot_stages_longer(df1, stages_from = c("Year",
"risk_level"),
## the column that represents the size of the flows:
values_from = "proportion")
#attempting to plot the sankey diagram
ggplot(df_pivot, aes(x = stage, y = proportion, group = node,
connector = connector, edge_id = edge_id, fill = node)) +
geom_sankeyedge(v_space = "auto") +
geom_sankeynode(v_space = "auto")
Share
Improve this question
edited yesterday
CJ Yetman
8,8482 gold badges29 silver badges61 bronze badges
asked yesterday
cccccc
374 bronze badges
1 Answer
Reset to default 1The issue is the wrong setup of the data. To achieve your desired result reshape to wide, then compute the counts and the proportion for each unique path of risk levels along the stages in the data. Afterwards use pivot_stages_longer
to reshape the data to the long format required by ggsankeyfier
:
library(ggsankeyfier)
library(ggplot2)
library(dplyr)
library(tidyr)
df_pivot <- df |>
mutate(
risk_level = factor(
risk_level, c("Very low", "Low", "Moderate", "High")
)
) |>
tidyr::pivot_wider(names_from = Year, values_from = risk_level) |>
count(across(-ID)) |>
mutate(prop = n / sum(n)) |>
pivot_stages_longer(
stages_from = c("2022", "2023", "2024"),
values_from = c("prop", "n")
)
# attempting to plot the sankey diagram
ggplot(df_pivot, aes(
x = stage, y = prop, group = node,
connector = connector, edge_id = edge_id, fill = node
)) +
geom_sankeyedge(v_space = "auto") +
geom_sankeynode(v_space = "auto", order = "as_is")