最新消息:雨落星辰是一个专注网站SEO优化、网站SEO诊断、搜索引擎研究、网络营销推广、网站策划运营及站长类的自媒体原创博客

dataframe - In R, how can I collapse the data of grouped rows into one row? - Stack Overflow

programmeradmin1浏览0评论

I have the following sample dataframe:

df <- data.frame(
    record_id = c(1, 1, 1, 2, 2, 3, 3, 3),
    instance = c(NA, NA, 2, NA, 1, 2, NA, NA),
    A = c(10, NA, NA, 20, 25, NA, 30, NA),
    B = c(NA, 5, NA, NA, 5, 15, NA, 15),
    C = c(NA, NA, 3, NA, 5, 20, NA, 20),
    D = c(NA, NA, NA, 25, 25, 30, NA, 30)
)

  record_id instance  A  B  C  D
1         1       NA 10 NA NA NA
2         1       NA NA  5 NA NA
3         1        2 NA NA  3 NA
4         2       NA 20 NA NA 25
5         2        1 25  5  5 25
6         3        2 NA 15 20 30
7         3       NA 30 NA NA NA
8         3       NA NA 15 20 30

If the instance is NA, I want the rows of the same record_id to be collapsed down into one row. In my dataframe, there will not be two or more values in the same column for the same record_id and NA instance group.

In other words, I would like to get:

  record_id instance  A  B  C  D
1         1       NA 10  5 NA NA
2         1        2 NA NA  3 NA
3         2       NA 20 NA NA 25
4         2        1 25  5  5 25
5         3        2 NA 15 20 30
6         3       NA 30 15 20 30

How can I do this?

I have the following sample dataframe:

df <- data.frame(
    record_id = c(1, 1, 1, 2, 2, 3, 3, 3),
    instance = c(NA, NA, 2, NA, 1, 2, NA, NA),
    A = c(10, NA, NA, 20, 25, NA, 30, NA),
    B = c(NA, 5, NA, NA, 5, 15, NA, 15),
    C = c(NA, NA, 3, NA, 5, 20, NA, 20),
    D = c(NA, NA, NA, 25, 25, 30, NA, 30)
)

  record_id instance  A  B  C  D
1         1       NA 10 NA NA NA
2         1       NA NA  5 NA NA
3         1        2 NA NA  3 NA
4         2       NA 20 NA NA 25
5         2        1 25  5  5 25
6         3        2 NA 15 20 30
7         3       NA 30 NA NA NA
8         3       NA NA 15 20 30

If the instance is NA, I want the rows of the same record_id to be collapsed down into one row. In my dataframe, there will not be two or more values in the same column for the same record_id and NA instance group.

In other words, I would like to get:

  record_id instance  A  B  C  D
1         1       NA 10  5 NA NA
2         1        2 NA NA  3 NA
3         2       NA 20 NA NA 25
4         2        1 25  5  5 25
5         3        2 NA 15 20 30
6         3       NA 30 15 20 30

How can I do this?

Share Improve this question edited Mar 29 at 19:46 Andrea asked Mar 29 at 18:31 AndreaAndrea 85811 silver badges24 bronze badges 2
  • you tagged datatable, but that's ambiguous and even then typically doesn't refer to R objects, please remove it. If you have specific "dialects" in mind such as dplyr or data.table, please use those tags instead. – r2evans Commented Mar 29 at 19:20
  • Whoever comes here to find an answer to similar question. There are some related questions one might be interested in, e.g. stackoverflow/questions/28509462/…, – Friede Commented Mar 30 at 7:51
Add a comment  | 

5 Answers 5

Reset to default 11

A base R option with aggregate

sort_by(
  aggregate(. ~ record_id + factor(instance, exclude = NULL),
    df,
    \(x) ifelse(all(is.na(x)), NA, max(x, na.rm = TRUE)),
    na.action = na.pass
  )[names(df)],
  ~record_id
)

gives

  record_id instance  A  B  C  D
2         1        2 NA NA  3 NA
4         1       NA 10  5 NA NA
1         2        1 25  5  5 25
5         2       NA 20 NA NA 25
3         3        2 NA 15 20 30
6         3       NA 30 15 20 30
library(dplyr)
df |>
  reframe(
    .by = c(record_id, instance),
    across(everything(), ~ if (is.na(instance[1])) na.omit(.x)[1] else .x)
  )
#   record_id instance  A  B  C  D
# 1         1       NA 10  5 NA NA
# 2         1        2 NA NA  3 NA
# 3         2       NA 20 NA NA 25
# 4         2        1 25  5  5 25
# 5         3        2 NA 15 20 30
# 6         3       NA 30 15 20 30

ThomasIsCoding already provided an excellent base R version of this. A data.table variant is:

library(data.table)
setDT(df)
df[, lapply(.SD, \(x) if (is.na(instance[1])) na.omit(x)[1] else x), .(record_id, instance)]
#    record_id instance     A     B     C     D
#        <num>    <num> <num> <num> <num> <num>
# 1:         1       NA    10     5    NA    NA
# 2:         1        2    NA    NA     3    NA
# 3:         2       NA    20    NA    NA    25
# 4:         2        1    25     5     5    25
# 5:         3        2    NA    15    20    30
# 6:         3       NA    30    15    20    30

You could do:

df %>%
  group_by(record_id, instance) %>%
  summarize(across(everything(), ~first(sort(.x)))) %>%
  ungroup()

which gives:

# A tibble: 6 x 6
  record_id instance     A     B     C     D
      <dbl>    <dbl> <dbl> <dbl> <dbl> <dbl>
1         1        2    NA    NA     3    NA
2         1       NA    10     5    NA    NA
3         2        1    25     5     5    25
4         2       NA    20    NA    NA    25
5         3        2    NA    15    20    30
6         3       NA    30    15    20    30

Note: this code assumes that you only have one row per record_id with the same instance (that is not NA).


UPDATE: As discussed in the comments, an alternative would be using reframe:

df |> 
  reframe(across(everything(), ~first(sort(.x))), .by = c(record_id, instance))

which gives:

  record_id instance  A  B  C  D
1         1       NA 10  5 NA NA
2         1        2 NA NA  3 NA
3         2       NA 20 NA NA 25
4         2        1 25  5  5 25
5         3        2 NA 15 20 30
6         3       NA 30 15 20 30

A base R option using by,

> by(df, ~list(record_id, !is.na(instance)), \(x) {
+   sapply(x, \(z) if (!all(is.na(z))) unique(na.omit(z)) else unique(z))
+ }) |> do.call(what='rbind') |> as.data.frame() |> sort_by(~record_id)
  record_id instance  A  B  C  D
1         1       NA 10  5 NA NA
4         1        2 NA NA  3 NA
2         2       NA 20 NA NA 25
5         2        1 25  5  5 25
3         3       NA 30 15 20 30
6         3        2 NA 15 20 30

and one using data.table.

> library(data.table); dt <- as.data.table(df)
>
> dt[, lapply(.SD, \(x) if (all(is.na(x))) unique(x) else unique(na.omit(x))),
+    by=.(record_id, instance)]
   record_id instance     A     B     C     D
       <num>    <num> <num> <num> <num> <num>
1:         1       NA    10     5    NA    NA
2:         1        2    NA    NA     3    NA
3:         2       NA    20    NA    NA    25
4:         2        1    25     5     5    25
5:         3        2    NA    15    20    30
6:         3       NA    30    15    20    30

Here are two more:
First approach is similar to @r2evans approach, but using reduce together with coalesce .

library(dplyr)
library(purrr)

df |> 
  summarise(across(, ~ reduce(., coalesce)), .by=c(record_id, instance))

Second using a custom function:

library(dplyr)

coalesce_by_column <- function(df) {
  return(dplyr::coalesce(!!! as.list(df)))
}

df %>%
  summarise(across(everything(), coalesce_by_column), .by = c(record_id, instance))    

output:

  record_id instance  A  B  C  D
1         1       NA 10  5 NA NA
2         1        2 NA NA  3 NA
3         2       NA 20 NA NA 25
4         2        1 25  5  5 25
5         3        2 NA 15 20 30
6         3       NA 30 15 20 30
发布评论

评论列表(0)

  1. 暂无评论