r - Conditionally remove missing cases in long format

I have a long-format dataset with 3 assessment points, e.g.:

dataset = data.frame(
  id = c(1,1,1,2,2,2,3,3,3,4,4,4,5,5,5),
  assessment = c(1,2,3,1,2,3,1,2,3,1,2,3,1,2,3),
  scoreA = c(7,9,5,NA,5,11,2,3,9,1,NA,NA,7,NA,5),
  scoreB = c(1,2,7,6,1,11,3,3,2,1,12,NA,NA,4,5)
)

I would like to remove all observations belonging to the same ID, if there is any NAs at assessment == 1. For instance, the ID's 2 and 5 have NAs at assessment == 1, so they should be excluded.

I'm trying to sort this out using {dplyr} and {tidyr} functions, along group_by(), filter(), ...

I have a long-format dataset with 3 assessment points, e.g.:

dataset = data.frame(
  id = c(1,1,1,2,2,2,3,3,3,4,4,4,5,5,5),
  assessment = c(1,2,3,1,2,3,1,2,3,1,2,3,1,2,3),
  scoreA = c(7,9,5,NA,5,11,2,3,9,1,NA,NA,7,NA,5),
  scoreB = c(1,2,7,6,1,11,3,3,2,1,12,NA,NA,4,5)
)

I'm trying to sort this out using {dplyr} and {tidyr} functions, along group_by(), filter(), ...

Share Improve this question edited Mar 19 at 15:22 Darren Tsai 36.3k5 gold badges25 silver badges57 bronze badges asked Mar 18 at 22:20 DrNight 391 silver badge4 bronze badges

Hi, if any of the answers have solved your question, you can consider accepting the one that best meets your request by clicking the check mark. – Darren Tsai Commented Mar 20 at 2:46

Add a comment |

4 Answers 4

Sorted by: Reset to default 5

We could keep just those id's for which there are not any missing scoreA/scoreB at assessment 1:

library(dplyr)
dataset |>
  filter(!any(is.na(scoreA[assessment == 1]) | 
              is.na(scoreB[assessment == 1])), .by = id)

  id assessment scoreA scoreB
1  1          1      7      1
2  1          2      9      2
3  1          3      5      7
4  3          1      2      3
5  3          2      3      3
6  3          3      9      2
7  4          1      1      1
8  4          2     NA     12
9  4          3     NA     NA

Alternatively, we could filter to just assessment 1 rows without NA's, and join that to the original data, for the same result.

dataset |>
  filter(assessment == 1) |>
  tidyr::drop_na() |>
  # distinct(id) |> # only needed if there might be multiple assessment 1's for an id
  left_join(dataset)

You can use subset + na.omit like below

> subset(dataset, id %in% na.omit(subset(dataset, assessment == 1))$id)
   id assessment scoreA scoreB
1   1          1      7      1
2   1          2      9      2
3   1          3      5      7
7   3          1      2      3
8   3          2      3      3
9   3          3      9      2
10  4          1      1      1
11  4          2     NA     12
12  4          3     NA     NA

Or in base R you can use !complete.cases & dataset$assessment == 1 to get the ids you want to filter out

dataset[!dataset$id %in% dataset[!complete.cases(dataset) & dataset$assessment == 1,]$id, ]

   id assessment scoreA scoreB
1   1          1      7      1
2   1          2      9      2
3   1          3      5      7
7   3          1      2      3
8   3          2      3      3
9   3          3      9      2
10  4          1      1      1
11  4          2     NA     12
12  4          3     NA     NA

Here is a dplyr solution with if_any() or pick(). The advantage is that you can use <tidy-select> to apply the same logical function to multiple columns.

library(dplyr)

dataset %>%
  filter(!any(if_any(scoreA:scoreB, is.na) & assessment == 1),
         .by = id)

dataset %>%
  filter(!any(is.na(pick(scoreA:scoreB)[assessment == 1, ])),
         .by = id)

#   id assessment scoreA scoreB
# 1  1          1      7      1
# 2  1          2      9      2
# 3  1          3      5      7
# 4  3          1      2      3
# 5  3          2      3      3
# 6  3          3      9      2
# 7  4          1      1      1
# 8  4          2     NA     12
# 9  4          3     NA     NA

科技改变生活-雨落星辰 - 所有的伟大,都源于一个勇敢的开始

r - Conditionally remove missing cases in long format - Stack Overflow

4 Answers 4

与本文相关的文章

评论列表(0)