I have the following sample dataframe:
df <- data.frame(
record_id = c(1, 1, 1, 2, 2, 3, 3, 3),
instance = c(NA, NA, 2, NA, 1, 2, NA, NA),
A = c(10, NA, NA, 20, 25, NA, 30, NA),
B = c(NA, 5, NA, NA, 5, 15, NA, 15),
C = c(NA, NA, 3, NA, 5, 20, NA, 20),
D = c(NA, NA, NA, 25, 25, 30, NA, 30)
)
record_id instance A B C D
1 1 NA 10 NA NA NA
2 1 NA NA 5 NA NA
3 1 2 NA NA 3 NA
4 2 NA 20 NA NA 25
5 2 1 25 5 5 25
6 3 2 NA 15 20 30
7 3 NA 30 NA NA NA
8 3 NA NA 15 20 30
If the instance is NA, I want the rows of the same record_id to be collapsed down into one row. In my dataframe, there will not be two or more values in the same column for the same record_id and NA instance group.
In other words, I would like to get:
record_id instance A B C D
1 1 NA 10 5 NA NA
2 1 2 NA NA 3 NA
3 2 NA 20 NA NA 25
4 2 1 25 5 5 25
5 3 2 NA 15 20 30
6 3 NA 30 15 20 30
How can I do this?
I have the following sample dataframe:
df <- data.frame(
record_id = c(1, 1, 1, 2, 2, 3, 3, 3),
instance = c(NA, NA, 2, NA, 1, 2, NA, NA),
A = c(10, NA, NA, 20, 25, NA, 30, NA),
B = c(NA, 5, NA, NA, 5, 15, NA, 15),
C = c(NA, NA, 3, NA, 5, 20, NA, 20),
D = c(NA, NA, NA, 25, 25, 30, NA, 30)
)
record_id instance A B C D
1 1 NA 10 NA NA NA
2 1 NA NA 5 NA NA
3 1 2 NA NA 3 NA
4 2 NA 20 NA NA 25
5 2 1 25 5 5 25
6 3 2 NA 15 20 30
7 3 NA 30 NA NA NA
8 3 NA NA 15 20 30
If the instance is NA, I want the rows of the same record_id to be collapsed down into one row. In my dataframe, there will not be two or more values in the same column for the same record_id and NA instance group.
In other words, I would like to get:
record_id instance A B C D
1 1 NA 10 5 NA NA
2 1 2 NA NA 3 NA
3 2 NA 20 NA NA 25
4 2 1 25 5 5 25
5 3 2 NA 15 20 30
6 3 NA 30 15 20 30
How can I do this?
Share Improve this question edited Mar 29 at 19:46 Andrea asked Mar 29 at 18:31 AndreaAndrea 85811 silver badges24 bronze badges 2- you tagged datatable, but that's ambiguous and even then typically doesn't refer to R objects, please remove it. If you have specific "dialects" in mind such as dplyr or data.table, please use those tags instead. – r2evans Commented Mar 29 at 19:20
- Whoever comes here to find an answer to similar question. There are some related questions one might be interested in, e.g. stackoverflow/questions/28509462/…, – Friede Commented Mar 30 at 7:51
5 Answers
Reset to default 11A base R option with aggregate
sort_by(
aggregate(. ~ record_id + factor(instance, exclude = NULL),
df,
\(x) ifelse(all(is.na(x)), NA, max(x, na.rm = TRUE)),
na.action = na.pass
)[names(df)],
~record_id
)
gives
record_id instance A B C D
2 1 2 NA NA 3 NA
4 1 NA 10 5 NA NA
1 2 1 25 5 5 25
5 2 NA 20 NA NA 25
3 3 2 NA 15 20 30
6 3 NA 30 15 20 30
library(dplyr)
df |>
reframe(
.by = c(record_id, instance),
across(everything(), ~ if (is.na(instance[1])) na.omit(.x)[1] else .x)
)
# record_id instance A B C D
# 1 1 NA 10 5 NA NA
# 2 1 2 NA NA 3 NA
# 3 2 NA 20 NA NA 25
# 4 2 1 25 5 5 25
# 5 3 2 NA 15 20 30
# 6 3 NA 30 15 20 30
ThomasIsCoding already provided an excellent base R version of this. A data.table variant is:
library(data.table)
setDT(df)
df[, lapply(.SD, \(x) if (is.na(instance[1])) na.omit(x)[1] else x), .(record_id, instance)]
# record_id instance A B C D
# <num> <num> <num> <num> <num> <num>
# 1: 1 NA 10 5 NA NA
# 2: 1 2 NA NA 3 NA
# 3: 2 NA 20 NA NA 25
# 4: 2 1 25 5 5 25
# 5: 3 2 NA 15 20 30
# 6: 3 NA 30 15 20 30
You could do:
df %>%
group_by(record_id, instance) %>%
summarize(across(everything(), ~first(sort(.x)))) %>%
ungroup()
which gives:
# A tibble: 6 x 6
record_id instance A B C D
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 2 NA NA 3 NA
2 1 NA 10 5 NA NA
3 2 1 25 5 5 25
4 2 NA 20 NA NA 25
5 3 2 NA 15 20 30
6 3 NA 30 15 20 30
Note: this code assumes that you only have one row per record_id with the same instance (that is not NA).
UPDATE: As discussed in the comments, an alternative would be using reframe
:
df |>
reframe(across(everything(), ~first(sort(.x))), .by = c(record_id, instance))
which gives:
record_id instance A B C D
1 1 NA 10 5 NA NA
2 1 2 NA NA 3 NA
3 2 NA 20 NA NA 25
4 2 1 25 5 5 25
5 3 2 NA 15 20 30
6 3 NA 30 15 20 30
A base R option using by
,
> by(df, ~list(record_id, !is.na(instance)), \(x) {
+ sapply(x, \(z) if (!all(is.na(z))) unique(na.omit(z)) else unique(z))
+ }) |> do.call(what='rbind') |> as.data.frame() |> sort_by(~record_id)
record_id instance A B C D
1 1 NA 10 5 NA NA
4 1 2 NA NA 3 NA
2 2 NA 20 NA NA 25
5 2 1 25 5 5 25
3 3 NA 30 15 20 30
6 3 2 NA 15 20 30
and one using data.table
.
> library(data.table); dt <- as.data.table(df)
>
> dt[, lapply(.SD, \(x) if (all(is.na(x))) unique(x) else unique(na.omit(x))),
+ by=.(record_id, instance)]
record_id instance A B C D
<num> <num> <num> <num> <num> <num>
1: 1 NA 10 5 NA NA
2: 1 2 NA NA 3 NA
3: 2 NA 20 NA NA 25
4: 2 1 25 5 5 25
5: 3 2 NA 15 20 30
6: 3 NA 30 15 20 30
Here are two more:
First approach is similar to @r2evans approach, but using reduce
together with coalesce
.
library(dplyr)
library(purrr)
df |>
summarise(across(, ~ reduce(., coalesce)), .by=c(record_id, instance))
Second using a custom function:
library(dplyr)
coalesce_by_column <- function(df) {
return(dplyr::coalesce(!!! as.list(df)))
}
df %>%
summarise(across(everything(), coalesce_by_column), .by = c(record_id, instance))
output:
record_id instance A B C D
1 1 NA 10 5 NA NA
2 1 2 NA NA 3 NA
3 2 NA 20 NA NA 25
4 2 1 25 5 5 25
5 3 2 NA 15 20 30
6 3 NA 30 15 20 30