最新消息:雨落星辰是一个专注网站SEO优化、网站SEO诊断、搜索引擎研究、网络营销推广、网站策划运营及站长类的自媒体原创博客

How do I figure cumulative distribution within groups (as cume_dist does) but separate tied values in R? - Stack Overflow

programmeradmin1浏览0评论

Here is a data set where YEAR is a grouping variable.

dat <- data.frame(YEAR = c(rep(1999,4),rep(2002,3)), VALUE = c(1,2,3,2,1,2,3))

I would like to add a column that looks at the VALUE column and says "where, within its year, does this value sit?" I'm having trouble phrasing it concisely, but pretty much exactly what cume_dist does except that cume_dist lumps tied values and I want them separated. cume_dist takes 1,2,2,3 and gives back 0.25, 0.75, 0.75, 1.00 and I need the ties separated: 1,2,2,4 should give 0.25, 0.50, 0.75, 1.0.

Here's a line based on cume_dist that works except for the tied values:

dat %>% group_by(YEAR) %>% mutate(cumdist = cume_dist(VALUE))

I tried to deconstruct cume_dist (which is rank in group divided by size of group) and use row_number, which separates ties, for the numerator and divide it by the number of rows in each year. This gives me the correct numerator:

dat %>% group_by(YEAR) %>% mutate(rownumber = row_number(VALUE))

But how do I divide those ranks by the number of values in each year (that is, divide all the ranks in 1999 by 4, and the ranks from 2002 by 3)?

Here is a data set where YEAR is a grouping variable.

dat <- data.frame(YEAR = c(rep(1999,4),rep(2002,3)), VALUE = c(1,2,3,2,1,2,3))

I would like to add a column that looks at the VALUE column and says "where, within its year, does this value sit?" I'm having trouble phrasing it concisely, but pretty much exactly what cume_dist does except that cume_dist lumps tied values and I want them separated. cume_dist takes 1,2,2,3 and gives back 0.25, 0.75, 0.75, 1.00 and I need the ties separated: 1,2,2,4 should give 0.25, 0.50, 0.75, 1.0.

Here's a line based on cume_dist that works except for the tied values:

dat %>% group_by(YEAR) %>% mutate(cumdist = cume_dist(VALUE))

I tried to deconstruct cume_dist (which is rank in group divided by size of group) and use row_number, which separates ties, for the numerator and divide it by the number of rows in each year. This gives me the correct numerator:

dat %>% group_by(YEAR) %>% mutate(rownumber = row_number(VALUE))

But how do I divide those ranks by the number of values in each year (that is, divide all the ranks in 1999 by 4, and the ranks from 2002 by 3)?

Share Improve this question edited Jan 19 at 2:32 IRTFM 263k22 gold badges378 silver badges499 bronze badges asked Jan 18 at 21:34 chillchill 1138 bronze badges 1
  • 1 You almost had it! Devide by n() - the amount of group rows dat %>% group_by(YEAR) %>% mutate(result = row_number(VALUE)/n()) or even dat$cumdist <- ave(dat$VALUE, dat$YEAR,FUN = function(x) rank(x, ties.method = "first")/length(x)) in baseR – dog Commented Jan 18 at 22:18
Add a comment  | 

2 Answers 2

Reset to default 3

Is this what you are after?

dat %>%
  mutate(cumdist = seq.int(n())[order(VALUE)] / n(), .by = YEAR)

which gives

  YEAR VALUE   cumdist
1 1999     1 0.2500000
2 1999     2 0.5000000
3 1999     3 1.0000000
4 1999     2 0.7500000
5 2002     1 0.3333333
6 2002     2 0.6666667
7 2002     3 1.0000000

Using frank() in data.table

> library(data.table)
> setDT(dat)
> dat[, cumdist := frank(VALUE, ties.method="first")/.N, by=YEAR]
> dat
    YEAR VALUE   cumdist
   <num> <num>     <num>
1:  1999     1 0.2500000
2:  1999     2 0.5000000
3:  1999     3 1.0000000
4:  1999     2 0.7500000
5:  2002     1 0.3333333
6:  2002     2 0.6666667
7:  2002     3 1.0000000

Or base method, already mentioned in comments.

> dat |> 
+   transform(cumdist=ave(VALUE, YEAR, FUN=\(x) rank(x, ties='first')/length(x)))
    YEAR VALUE   cumdist
   <num> <num>     <num>
1:  1999     1 0.2500000
2:  1999     2 0.5000000
3:  1999     3 1.0000000
4:  1999     2 0.7500000
5:  2002     1 0.3333333
6:  2002     2 0.6666667
7:  2002     3 1.0000000

与本文相关的文章

发布评论

评论列表(0)

  1. 暂无评论