最新消息:雨落星辰是一个专注网站SEO优化、网站SEO诊断、搜索引擎研究、网络营销推广、网站策划运营及站长类的自媒体原创博客

r - Use data from multiple years (=rows) to filter another column - Stack Overflow

programmeradmin2浏览0评论

I have a df:

df <- data.frame(nr = c(rep("n01", 10),
                    rep("n03", 13),
                    rep("n04", 8),
                    rep("n06", 14),
                    rep("n08", 13),
                    rep("n12", 14)),
             yr = c(2012:2021, 
                    2010:2022,
                    2013:2020,
                    2010:2023,
                    2011:2023,
                    2009:2022),
             md = c(rep("no", 5),
                    rep("yes", 8),
                    rep("no", 13),
                    rep("yes", 1),
                    rep("no", 9),
                    rep("yes", 5),
                    rep("no", 20),
                    rep("yes", 2),
                    rep("no", 9)))

I want to filter my dataset based on the md column for each nr. I want to keep only the nr s that have at least 2x a 'yes' in 2 subsequent years, followed by at least 1 "no". This "no" cannot be followed by a "yes" anymore in further years. But the initial 2x "yes" can be preceded by anything. Meaning that in this dummy df I want to keep only n03, n06 and n12. How can I do that? (preferably with dplyr) Of course my real df is waaay bigger so I need an automated review of the no/yes pattern of all rows belonging to a specific nr.

I have a df:

df <- data.frame(nr = c(rep("n01", 10),
                    rep("n03", 13),
                    rep("n04", 8),
                    rep("n06", 14),
                    rep("n08", 13),
                    rep("n12", 14)),
             yr = c(2012:2021, 
                    2010:2022,
                    2013:2020,
                    2010:2023,
                    2011:2023,
                    2009:2022),
             md = c(rep("no", 5),
                    rep("yes", 8),
                    rep("no", 13),
                    rep("yes", 1),
                    rep("no", 9),
                    rep("yes", 5),
                    rep("no", 20),
                    rep("yes", 2),
                    rep("no", 9)))

I want to filter my dataset based on the md column for each nr. I want to keep only the nr s that have at least 2x a 'yes' in 2 subsequent years, followed by at least 1 "no". This "no" cannot be followed by a "yes" anymore in further years. But the initial 2x "yes" can be preceded by anything. Meaning that in this dummy df I want to keep only n03, n06 and n12. How can I do that? (preferably with dplyr) Of course my real df is waaay bigger so I need an automated review of the no/yes pattern of all rows belonging to a specific nr.

Share Improve this question asked Feb 7 at 16:15 LiekeLieke 1811 silver badge15 bronze badges
Add a comment  | 

2 Answers 2

Reset to default 2

You can try the following dplyr approach, which uses cumsum to look at the total number of "yes" by nr then scans for any "no" after the last yes (using max). Then filters and removes temp variables

library(dplyr)

df %>%
  mutate(temp = cumsum(md %in% "yes"),
         temp2 = temp == max(temp) & md %in% "no",
         .by = nr) %>%
  filter(any(temp2) & temp >= 2, .by = nr) %>% 
  select(-starts_with("temp"))

Note you could do all this in one messy filter statement and remove the need to create temp variables:

df %>%
  filter(any(cumsum(md %in% "yes") == max(cumsum(md %in% "yes")) &
               md %in% "no") & 
           cumsum(md %in% "yes") >= 2, 
         .by = nr) 

Output:

   nr   yr  md
1  n03 2011 yes
2  n03 2012 yes
3  n03 2013  no
4  n03 2014  no
5  n03 2015  no
6  n03 2016  no
7  n03 2017  no
8  n03 2018  no
9  n03 2019  no
10 n03 2020  no
11 n03 2021  no
12 n03 2022  no
13 n06 2016 yes
14 n06 2017 yes
15 n06 2018 yes
16 n06 2019 yes
17 n06 2020  no
18 n06 2021  no
19 n06 2022  no
20 n06 2023  no
21 n12 2013 yes
22 n12 2014  no
23 n12 2015  no
24 n12 2016  no
25 n12 2017  no
26 n12 2018  no
27 n12 2019  no
28 n12 2020  no
29 n12 2021  no
30 n12 2022  no

You can try this one-liner, which encodes md to binary for each nr, grepls for '110+$' pattern and subsets for TRUEs.

> df |> subset(ave(md == 'yes', nr, FUN=\(x) grepl('110+$', paste(+x, collapse=''))))
    nr   yr  md
11 n03 2010 yes
12 n03 2011 yes
13 n03 2012 yes
14 n03 2013  no
15 n03 2014  no
16 n03 2015  no
17 n03 2016  no
18 n03 2017  no
19 n03 2018  no
20 n03 2019  no
21 n03 2020  no
22 n03 2021  no
23 n03 2022  no
32 n06 2010  no
33 n06 2011  no
34 n06 2012  no
35 n06 2013  no
36 n06 2014  no
37 n06 2015 yes
38 n06 2016 yes
39 n06 2017 yes
40 n06 2018 yes
41 n06 2019 yes
42 n06 2020  no
43 n06 2021  no
44 n06 2022  no
45 n06 2023  no
59 n12 2009  no
60 n12 2010  no
61 n12 2011  no
62 n12 2012 yes
63 n12 2013 yes
64 n12 2014  no
65 n12 2015  no
66 n12 2016  no
67 n12 2017  no
68 n12 2018  no
69 n12 2019  no
70 n12 2020  no
71 n12 2021  no
72 n12 2022  no
发布评论

评论列表(0)

  1. 暂无评论