I'm trying to create a linear representation of a data set in ggplot to compare patterns of data density between sets. There are several thousand values per set that can belong to one of two states. I'd like to plot these data as points, so I've been using the horizontal bar character |
but there's too much overlap of the points, especially with increasing point sizes that are needed to be able to see the data in the plot. Also, the last point plotted is much larger than the rest.
What I'm looking for is a way to set and fix the width of each point but still be able expand the height of the data points.
Below is an example code set and the figure produced. Thanks in advance!
library(tidyverse)
set.seed(42)
sites.a <- sort(sample(1:5000000, 80000, replace=F))
hilite.a <- sort(sample(sites.a, 20000, replace=F))
sites.b <- sort(sample(1:5000000, 80000, replace=F))
hilite.b <- sort(sample(sites.b, 20000, replace=F))
a.df <- data.frame(position = sites.a) %>%
mutate(name = "A") %>%
mutate(type = case_when(position %in% hilite.a ~ "Yes",
.default = "No"))
b.df <- data.frame(position = sites.b) %>%
mutate(name = "B") %>%
mutate(type = case_when(position %in% hilite.b ~ "Yes",
.default = "No"))
sites.df <- bind_rows(a.df, b.df)
p <- ggplot(sites.df, aes(x = position, y = name, color = type)) +
scale_shape_identity() +
geom_point(shape = 124, size = 15, stroke = 0) +
scale_color_manual(values = c("skyblue", "gray50")) +
theme_classic() +
scale_x_continuous(expand = c(0.01,0))
ggsave(
"output.png",
p,
width = 10,
height = 2,
dpi = 1200
)
I'm trying to create a linear representation of a data set in ggplot to compare patterns of data density between sets. There are several thousand values per set that can belong to one of two states. I'd like to plot these data as points, so I've been using the horizontal bar character |
but there's too much overlap of the points, especially with increasing point sizes that are needed to be able to see the data in the plot. Also, the last point plotted is much larger than the rest.
What I'm looking for is a way to set and fix the width of each point but still be able expand the height of the data points.
Below is an example code set and the figure produced. Thanks in advance!
library(tidyverse)
set.seed(42)
sites.a <- sort(sample(1:5000000, 80000, replace=F))
hilite.a <- sort(sample(sites.a, 20000, replace=F))
sites.b <- sort(sample(1:5000000, 80000, replace=F))
hilite.b <- sort(sample(sites.b, 20000, replace=F))
a.df <- data.frame(position = sites.a) %>%
mutate(name = "A") %>%
mutate(type = case_when(position %in% hilite.a ~ "Yes",
.default = "No"))
b.df <- data.frame(position = sites.b) %>%
mutate(name = "B") %>%
mutate(type = case_when(position %in% hilite.b ~ "Yes",
.default = "No"))
sites.df <- bind_rows(a.df, b.df)
p <- ggplot(sites.df, aes(x = position, y = name, color = type)) +
scale_shape_identity() +
geom_point(shape = 124, size = 15, stroke = 0) +
scale_color_manual(values = c("skyblue", "gray50")) +
theme_classic() +
scale_x_continuous(expand = c(0.01,0))
ggsave(
"output.png",
p,
width = 10,
height = 2,
dpi = 1200
)
Share
Improve this question
asked Mar 11 at 1:40
EgonEgon
173 bronze badges
3
|
1 Answer
Reset to default 1It sounds like you're looking to show 80,000 data points across 10 inches -- i.e. 8,000 dpi, which is far higher than any normal printer could create or any human eye could perceive.
If you want to see all the data, one way would be to take your linear line and turn it into a "page" with many lines. Here, I've arbitrarily made each line 20,000 units wide, with 5M / 20k = 250 lines, and only ~300 points plotted on each line. This lets us see all the data, but there are some tradeoffs, such as how it's now harder to tell where in the range we are, and it's harder to compare one point in A to the same point in B. I've adjusted the y axis to read top-to-bottom and to indicate where we are in the original position
sequence. But your expectations for the original method might be ultimately impossible to achieve.
In many situations, it's less important to plot every single data point as to identify and highlight the patterns of note. But that will be highly dependent upon what "message" you want your audience to get from the data; with this volume it's unlikely that the raw data will be the best way to depict this.
sites.df |>
mutate(section = position %/% 20000,
sec_pos = position %% 20000) |>
ggplot(aes(x = sec_pos, y = section, fill = type)) +
geom_tile(width = 10, height = 0.8) +
facet_wrap(~name, ncol = 1) +
scale_fill_manual(values = c("skyblue", "gray50")) +
theme_classic() +
scale_x_continuous(expand = c(0.01,0)) +
scale_y_continuous(labels = ~. * 20000,
trans = scales::transform_reverse())
Alternatively, you might explore other statistical ways to express "patterns of data density" in a more summarized way. e.g. the first approach below shows the density of each type on its own terms -- ie what portion of the values are found where. The second is a histogram using the default 30 bins; that one also shows how No is more prevalent.
sites.df |>
ggplot(aes(position, color = type)) +
geom_density(adjust = 0.1) +
facet_wrap(~name)
sites.df |>
ggplot(aes(position, fill = type)) +
geom_histogram(position = position_dodge()) +
facet_wrap(~name)
|
, as it can be really hard to control width of characters as shapes. I often use points with jitter and translucent alpha to convey density. If you must do full-height, considergeom_tile
orgeom_rect
, where you can better control the width of any/all points. – r2evans Commented Mar 11 at 1:53geom_violin
? – Michael Dewar Commented Mar 11 at 3:47