I have a df:
df <- data.frame(nr = c(rep("n01", 10),
rep("n03", 13),
rep("n04", 8),
rep("n06", 14),
rep("n08", 13),
rep("n12", 14)),
yr = c(2012:2021,
2010:2022,
2013:2020,
2010:2023,
2011:2023,
2009:2022),
md = c(rep("no", 5),
rep("yes", 8),
rep("no", 13),
rep("yes", 1),
rep("no", 9),
rep("yes", 5),
rep("no", 20),
rep("yes", 2),
rep("no", 9)))
I want to filter my dataset based on the md column for each nr. I want to keep only the nr s that have at least 2x a 'yes' in 2 subsequent years, followed by at least 1 "no". This "no" cannot be followed by a "yes" anymore in further years. But the initial 2x "yes" can be preceded by anything. Meaning that in this dummy df I want to keep only n03, n06 and n12. How can I do that? (preferably with dplyr) Of course my real df is waaay bigger so I need an automated review of the no/yes pattern of all rows belonging to a specific nr.
I have a df:
df <- data.frame(nr = c(rep("n01", 10),
rep("n03", 13),
rep("n04", 8),
rep("n06", 14),
rep("n08", 13),
rep("n12", 14)),
yr = c(2012:2021,
2010:2022,
2013:2020,
2010:2023,
2011:2023,
2009:2022),
md = c(rep("no", 5),
rep("yes", 8),
rep("no", 13),
rep("yes", 1),
rep("no", 9),
rep("yes", 5),
rep("no", 20),
rep("yes", 2),
rep("no", 9)))
I want to filter my dataset based on the md column for each nr. I want to keep only the nr s that have at least 2x a 'yes' in 2 subsequent years, followed by at least 1 "no". This "no" cannot be followed by a "yes" anymore in further years. But the initial 2x "yes" can be preceded by anything. Meaning that in this dummy df I want to keep only n03, n06 and n12. How can I do that? (preferably with dplyr) Of course my real df is waaay bigger so I need an automated review of the no/yes pattern of all rows belonging to a specific nr.
Share Improve this question asked Feb 7 at 16:15 LiekeLieke 1811 silver badge15 bronze badges2 Answers
Reset to default 2You can try the following dplyr
approach, which uses cumsum
to look at the total number of "yes" by nr
then scans for any "no" after the last yes (using max
). Then filter
s and removes temp variables
library(dplyr)
df %>%
mutate(temp = cumsum(md %in% "yes"),
temp2 = temp == max(temp) & md %in% "no",
.by = nr) %>%
filter(any(temp2) & temp >= 2, .by = nr) %>%
select(-starts_with("temp"))
Note you could do all this in one messy filter
statement and remove the need to create temp
variables:
df %>%
filter(any(cumsum(md %in% "yes") == max(cumsum(md %in% "yes")) &
md %in% "no") &
cumsum(md %in% "yes") >= 2,
.by = nr)
Output:
nr yr md
1 n03 2011 yes
2 n03 2012 yes
3 n03 2013 no
4 n03 2014 no
5 n03 2015 no
6 n03 2016 no
7 n03 2017 no
8 n03 2018 no
9 n03 2019 no
10 n03 2020 no
11 n03 2021 no
12 n03 2022 no
13 n06 2016 yes
14 n06 2017 yes
15 n06 2018 yes
16 n06 2019 yes
17 n06 2020 no
18 n06 2021 no
19 n06 2022 no
20 n06 2023 no
21 n12 2013 yes
22 n12 2014 no
23 n12 2015 no
24 n12 2016 no
25 n12 2017 no
26 n12 2018 no
27 n12 2019 no
28 n12 2020 no
29 n12 2021 no
30 n12 2022 no
You can try this one-liner, which encodes md to binary for each nr, grepl
s for '110+$'
pattern and subset
s for TRUE
s.
> df |> subset(ave(md == 'yes', nr, FUN=\(x) grepl('110+$', paste(+x, collapse=''))))
nr yr md
11 n03 2010 yes
12 n03 2011 yes
13 n03 2012 yes
14 n03 2013 no
15 n03 2014 no
16 n03 2015 no
17 n03 2016 no
18 n03 2017 no
19 n03 2018 no
20 n03 2019 no
21 n03 2020 no
22 n03 2021 no
23 n03 2022 no
32 n06 2010 no
33 n06 2011 no
34 n06 2012 no
35 n06 2013 no
36 n06 2014 no
37 n06 2015 yes
38 n06 2016 yes
39 n06 2017 yes
40 n06 2018 yes
41 n06 2019 yes
42 n06 2020 no
43 n06 2021 no
44 n06 2022 no
45 n06 2023 no
59 n12 2009 no
60 n12 2010 no
61 n12 2011 no
62 n12 2012 yes
63 n12 2013 yes
64 n12 2014 no
65 n12 2015 no
66 n12 2016 no
67 n12 2017 no
68 n12 2018 no
69 n12 2019 no
70 n12 2020 no
71 n12 2021 no
72 n12 2022 no