最新消息:雨落星辰是一个专注网站SEO优化、网站SEO诊断、搜索引擎研究、网络营销推广、网站策划运营及站长类的自媒体原创博客

How to insert a new row for each missing number of one variable while filling with NA for other variables using data.table in R?

programmeradmin4浏览0评论

How to insert a new row every time there is a break in the row column numbering, while filling the other columns with NA?

Edit: Iroha's suggestion based on tidyr::complete() works fine, as do the others. However, when I apply it to millions of rows, it runs for hours.
It seems like a data.table approach would be faster, but I never use it and I can't seem to find a solution.
Any help/advice would be greatly appreciated!

Initial data:

x <- c("a","a","a","a","a","a","a","a","a","b","b","b","b","b","b","b")
y <- c(1,1,2,4,4,4,7,7,11,1,3,3,3,6,6,8)
dat0 <- data.frame(x,y)  

>dat0
   x  y
1  a  1
2  a  1
3  a  2
4  a  4
5  a  4
6  a  4
7  a  7
8  a  7
9  a 11
10 b  1
11 b  3
12 b  3
13 b  3
14 b  6
15 b  6
16 b  8  

Desired output:

> dat1
      x  y
1     a  1
2     a  1
3     a  2
4  <NA>  3
5     a  4
6     a  4
7     a  4
8  <NA>  5
9  <NA>  6
10    a  7
11    a  7
12 <NA>  8
13 <NA>  9
14 <NA> 10
15    a 11
16    b  1
17 <NA>  2
18    b  3
19    b  3
20    b  3
21 <NA>  4
22 <NA>  5
23    b  6
24    b  6
25 <NA>  7
26    b  8  

Thanks for help

How to insert a new row every time there is a break in the row column numbering, while filling the other columns with NA?

Edit: Iroha's suggestion based on tidyr::complete() works fine, as do the others. However, when I apply it to millions of rows, it runs for hours.
It seems like a data.table approach would be faster, but I never use it and I can't seem to find a solution.
Any help/advice would be greatly appreciated!

Initial data:

x <- c("a","a","a","a","a","a","a","a","a","b","b","b","b","b","b","b")
y <- c(1,1,2,4,4,4,7,7,11,1,3,3,3,6,6,8)
dat0 <- data.frame(x,y)  

>dat0
   x  y
1  a  1
2  a  1
3  a  2
4  a  4
5  a  4
6  a  4
7  a  7
8  a  7
9  a 11
10 b  1
11 b  3
12 b  3
13 b  3
14 b  6
15 b  6
16 b  8  

Desired output:

> dat1
      x  y
1     a  1
2     a  1
3     a  2
4  <NA>  3
5     a  4
6     a  4
7     a  4
8  <NA>  5
9  <NA>  6
10    a  7
11    a  7
12 <NA>  8
13 <NA>  9
14 <NA> 10
15    a 11
16    b  1
17 <NA>  2
18    b  3
19    b  3
20    b  3
21 <NA>  4
22 <NA>  5
23    b  6
24    b  6
25 <NA>  7
26    b  8  

Thanks for help

Share Improve this question edited Feb 16 at 21:46 denis asked Feb 14 at 6:12 denisdenis 7944 silver badges14 bronze badges
Add a comment  | 

3 Answers 3

Reset to default 2

This is pretty straight forward with tidyr::complete() except that grouping variables can't be modified so need to be copied first.

library(tidyr)
library(dplyr)

dat0 |> 
  group_by(g2 = x) |> 
  complete(y = full_seq(y, 1))  |> 
  ungroup(g2) |> 
  select(-g2)

# A tibble: 26 × 2
       y x    
   <dbl> <chr>
 1     1 a    
 2     1 a    
 3     2 a    
 4     3 NA   
 5     4 a    
 6     4 a    
 7     4 a    
 8     5 NA   
 9     6 NA   
10     7 a    
# ℹ 16 more rows
# ℹ Use `print(n = ...)` to see more rows

A bit long-winded:

library(dplyr)

reframe(dat0, y=seq(min(y), max(y)), .by=x) %>%
  anti_join(dat0) %>%
  mutate(x2=x, x=NA) %>%
  full_join(dat0) %>%
  mutate(x2=coalesce(x, x2)) %>%
  arrange(x2, y) %>%
  select(-x2)

      x  y
1     a  1
2     a  1
3     a  2
4  <NA>  3
5     a  4
6     a  4
7     a  4
8  <NA>  5
9  <NA>  6
10    a  7
11    a  7
12 <NA>  8
13 <NA>  9
14 <NA> 10
15    a 11
16    b  1
17 <NA>  2
18    b  3
19    b  3
20    b  3
21 <NA>  4
22 <NA>  5
23    b  6
24    b  6
25 <NA>  7
26    b  8

Hmm odd requirement, but in base R you could do

res <- dat0    
for(let in unique(dat0$x)){
  missing_y <- setdiff(1:max(dat0[dat0$x==let,]$y), dat0[dat0$x==let,]$y)
  missing_x <- rep(NA, length(missing_y))
  res <- rbind(data.frame(x=missing_x, y = missing_y),res)
}

But I would recommend

library(tidyverse)
res_filled <- dat0 %>% group_by(x) %>% complete(y = seq(min(y),max(y),1))
发布评论

评论列表(0)

  1. 暂无评论