最新消息:雨落星辰是一个专注网站SEO优化、网站SEO诊断、搜索引擎研究、网络营销推广、网站策划运营及站长类的自媒体原创博客

r - Create custom point with fixed width for ggplot figure - Stack Overflow

programmeradmin6浏览0评论

I'm trying to create a linear representation of a data set in ggplot to compare patterns of data density between sets. There are several thousand values per set that can belong to one of two states. I'd like to plot these data as points, so I've been using the horizontal bar character | but there's too much overlap of the points, especially with increasing point sizes that are needed to be able to see the data in the plot. Also, the last point plotted is much larger than the rest.

What I'm looking for is a way to set and fix the width of each point but still be able expand the height of the data points.

Below is an example code set and the figure produced. Thanks in advance!

library(tidyverse)

set.seed(42)

sites.a <-  sort(sample(1:5000000, 80000, replace=F))
hilite.a <- sort(sample(sites.a, 20000, replace=F))

sites.b <-  sort(sample(1:5000000, 80000, replace=F))
hilite.b <- sort(sample(sites.b, 20000, replace=F))

a.df <- data.frame(position = sites.a) %>%
  mutate(name = "A") %>%
  mutate(type = case_when(position %in% hilite.a ~ "Yes",
                           .default = "No"))

b.df <- data.frame(position = sites.b) %>%
  mutate(name = "B") %>%
  mutate(type = case_when(position %in% hilite.b ~ "Yes",
                           .default = "No"))

sites.df <- bind_rows(a.df, b.df)

p <- ggplot(sites.df, aes(x = position, y = name, color = type)) +
  scale_shape_identity() + 
  geom_point(shape = 124, size = 15, stroke = 0) + 
  scale_color_manual(values = c("skyblue", "gray50")) +
  theme_classic() + 
  scale_x_continuous(expand = c(0.01,0))

ggsave(
  "output.png",
  p,
  width = 10,
  height = 2,
  dpi = 1200
)
 

I'm trying to create a linear representation of a data set in ggplot to compare patterns of data density between sets. There are several thousand values per set that can belong to one of two states. I'd like to plot these data as points, so I've been using the horizontal bar character | but there's too much overlap of the points, especially with increasing point sizes that are needed to be able to see the data in the plot. Also, the last point plotted is much larger than the rest.

What I'm looking for is a way to set and fix the width of each point but still be able expand the height of the data points.

Below is an example code set and the figure produced. Thanks in advance!

library(tidyverse)

set.seed(42)

sites.a <-  sort(sample(1:5000000, 80000, replace=F))
hilite.a <- sort(sample(sites.a, 20000, replace=F))

sites.b <-  sort(sample(1:5000000, 80000, replace=F))
hilite.b <- sort(sample(sites.b, 20000, replace=F))

a.df <- data.frame(position = sites.a) %>%
  mutate(name = "A") %>%
  mutate(type = case_when(position %in% hilite.a ~ "Yes",
                           .default = "No"))

b.df <- data.frame(position = sites.b) %>%
  mutate(name = "B") %>%
  mutate(type = case_when(position %in% hilite.b ~ "Yes",
                           .default = "No"))

sites.df <- bind_rows(a.df, b.df)

p <- ggplot(sites.df, aes(x = position, y = name, color = type)) +
  scale_shape_identity() + 
  geom_point(shape = 124, size = 15, stroke = 0) + 
  scale_color_manual(values = c("skyblue", "gray50")) +
  theme_classic() + 
  scale_x_continuous(expand = c(0.01,0))

ggsave(
  "output.png",
  p,
  width = 10,
  height = 2,
  dpi = 1200
)
 

Share Improve this question asked Mar 11 at 1:40 EgonEgon 173 bronze badges 3
  • 6 I do not agree with the premise that "linear representation of a data set" requires you to plot all data points, you can often do well enough with sampling and/or distribution plots instead of lots of lines. Having said that, I suggest switching from the (ahem) vertical bar |, as it can be really hard to control width of characters as shapes. I often use points with jitter and translucent alpha to convey density. If you must do full-height, consider geom_tile or geom_rect, where you can better control the width of any/all points. – r2evans Commented Mar 11 at 1:53
  • Along the lines of the previous comment, have you tried geom_violin? – Michael Dewar Commented Mar 11 at 3:47
  • Thanks for the suggestions and sorry about "horizontal". My dumb mistake. I'd like to keep this representation if possible as I think it would convey the most biologic relevance. As I replied below, I think I could also just keep the 20,000 "Yes" points and not have to show the other 60,000 "No" points, and I think dropping the alpha could help with conveying density. But ultimately I'm wondering if in ggplot there's either a way to adjust point size in just one dimension (make my points skinny and long) or create a custom point where one of the dimensions is fixed. – Egon Commented Mar 11 at 16:07
Add a comment  | 

1 Answer 1

Reset to default 1

It sounds like you're looking to show 80,000 data points across 10 inches -- i.e. 8,000 dpi, which is far higher than any normal printer could create or any human eye could perceive.

If you want to see all the data, one way would be to take your linear line and turn it into a "page" with many lines. Here, I've arbitrarily made each line 20,000 units wide, with 5M / 20k = 250 lines, and only ~300 points plotted on each line. This lets us see all the data, but there are some tradeoffs, such as how it's now harder to tell where in the range we are, and it's harder to compare one point in A to the same point in B. I've adjusted the y axis to read top-to-bottom and to indicate where we are in the original position sequence. But your expectations for the original method might be ultimately impossible to achieve.

In many situations, it's less important to plot every single data point as to identify and highlight the patterns of note. But that will be highly dependent upon what "message" you want your audience to get from the data; with this volume it's unlikely that the raw data will be the best way to depict this.

sites.df |>
  mutate(section = position %/% 20000,
         sec_pos = position %% 20000) |>
ggplot(aes(x = sec_pos, y = section, fill = type)) +
  geom_tile(width = 10, height = 0.8) +
  facet_wrap(~name, ncol = 1) +
  scale_fill_manual(values = c("skyblue", "gray50")) +
  theme_classic() + 
  scale_x_continuous(expand = c(0.01,0)) +
  scale_y_continuous(labels = ~. * 20000, 
                     trans = scales::transform_reverse())

Alternatively, you might explore other statistical ways to express "patterns of data density" in a more summarized way. e.g. the first approach below shows the density of each type on its own terms -- ie what portion of the values are found where. The second is a histogram using the default 30 bins; that one also shows how No is more prevalent.

sites.df |>
  ggplot(aes(position, color = type)) +
  geom_density(adjust = 0.1) +
  facet_wrap(~name)


sites.df |>
  ggplot(aes(position, fill = type)) +
  geom_histogram(position = position_dodge()) +
  facet_wrap(~name)

发布评论

评论列表(0)

  1. 暂无评论