r - Grepl with logical operator AND across multiple alphanumeric columns

I have an enormous dataset that contains 25 columns of medical codes. Each row represents one medical visit. I need to create a new column that flags where two codes appear together on each row. In other words, I want to grep across multiple columns and flag where the two codes are present together the medical visit.

I thougt about creating a new variable for each alphanumeric code I want to grep on, then creating a final variables with case_when(), but is there a faster way to do this?

Here's a toy data set:

diag_p <- c('a1', 'a4', 'c5', 'a4', 'b1')
odiag1 <- c('b1', 'b2', 'c3', 'd4', 'e5')
odiag2 <- c('f1', 'g4', 'h4', 'i5', 'a1')
odiag3 <- c('a6', 'b1', 'c8', 'a1', 'e10')
sample_df <- data.frame(diag_p, odiag1, odiag2, odiag3)

This code works well to search across columns with | and two grep statements, and with >1 at the end of the chunk, which will count more than one match across columns, but it doesn't work quite right. I need it to match (a1 or a4) & (b1 or b4).

new_df <- sample_df %>%
    mutate(
        new_col = rowSums(sapply(select(., diag_p, odiag1:odiag3),
            function(x) (grepl("a[14]", x)) | (grepl("b[14]", x)))) > 1
        )

Is there a way to do this without making one new column for each grep statement, then making a final variable with case_when()?

I updated the code to get rid of the case_when() which I see was confusing folks. This is what I want the code to look like, so that (a1 or a4) AND (b1 or b4) produce a match TRUE or FALSE on each row:

structure(list(diag_p = c("a1", "a4", "c5", "a4", "b1"), odiag1 = c("b1", 
"b2", "c3", "d4", "e5"), odiag2 = c("f1", "g4", "h4", "i5", "a1"
), odiag3 = c("a6", "b1", "c8", "a1", "e10"), new_col = c(TRUE, 
TRUE, FALSE, FALSE, TRUE)), row.names = c(NA, -5L), class = "data.frame")

I thougt about creating a new variable for each alphanumeric code I want to grep on, then creating a final variables with case_when(), but is there a faster way to do this?

Here's a toy data set:

diag_p <- c('a1', 'a4', 'c5', 'a4', 'b1')
odiag1 <- c('b1', 'b2', 'c3', 'd4', 'e5')
odiag2 <- c('f1', 'g4', 'h4', 'i5', 'a1')
odiag3 <- c('a6', 'b1', 'c8', 'a1', 'e10')
sample_df <- data.frame(diag_p, odiag1, odiag2, odiag3)

new_df <- sample_df %>%
    mutate(
        new_col = rowSums(sapply(select(., diag_p, odiag1:odiag3),
            function(x) (grepl("a[14]", x)) | (grepl("b[14]", x)))) > 1
        )

Is there a way to do this without making one new column for each grep statement, then making a final variable with case_when()?

structure(list(diag_p = c("a1", "a4", "c5", "a4", "b1"), odiag1 = c("b1", 
"b2", "c3", "d4", "e5"), odiag2 = c("f1", "g4", "h4", "i5", "a1"
), odiag3 = c("a6", "b1", "c8", "a1", "e10"), new_col = c(TRUE, 
TRUE, FALSE, FALSE, TRUE)), row.names = c(NA, -5L), class = "data.frame")

Share Improve this question edited Jan 29 at 22:37 CPlus 4,71844 gold badges30 silver badges72 bronze badges asked Jan 11 at 2:11 orion34 51 silver badge2 bronze badges

1 Does the result in new_df return the results you want? Could you clarify what "doesnt work quite right" means? Good luck! – jpsmith Commented Jan 11 at 2:50
1 From the problem description I would have expected the connective between the two grepls to be & rather than |. – IRTFM Commented Jan 11 at 10:03
Try sample_df$new_col <- apply(sample_df, 1, \(x) {any(x %in% c("a1", "a4")) && any(x %in% c("b1", "b4"))})? – jpsmith Commented Jan 13 at 18:23
@jpsmith this works! Similar to answer below. How would you modify the code if you wanted to further filter cases where age was less than 50? Would you throw that whole apply() function in an ifelse() function? – orion34 Commented Jan 14 at 2:11

Add a comment |

2 Answers 2

Sorted by: Reset to default 0

Using base R:

sample_df$flag <-
  apply(sample_df, 1, \(row) {
    any(grepl("a[14]", row)) & any(grepl("b[14]", row))
  })

sample_df
#>   diag_p odiag1 odiag2 odiag3 new_col
#> 1     a1     b1     f1     a6    TRUE
#> 2     a4     b2     g4     b1    TRUE
#> 3     c5     c3     h4     c8   FALSE
#> 4     a4     d4     i5     a1   FALSE
#> 5     b1     e5     a1    e10    TRUE

Though I'm not sure I understand your goal, I came up with this method of flagging rows that contain both (a1 or a4) and (b1 or b4).

diag_p <- c('a1', 'a4', 'c5', 'a4', 'b1')
odiag1 <- c('b1', 'b2', 'c3', 'd4', 'e5')
odiag2 <- c('f1', 'g4', 'h4', 'i5', 'a1')
odiag3 <- c('a6', 'b1', 'c8', 'a1', 'e10')
sample_df <- data.frame(diag_p, odiag1, odiag2, odiag3)

library(tidyverse)

sample_df |> rowwise() |> 
  mutate(Flag = any(str_detect(c_across(diag_p:odiag3), "a(1|4)")) &
           any(str_detect(c_across(diag_p:odiag3), "b(1|4)")))
#> # A tibble: 5 × 5
#> # Rowwise: 
#>   diag_p odiag1 odiag2 odiag3 Flag 
#>   <chr>  <chr>  <chr>  <chr>  <lgl>
#> 1 a1     b1     f1     a6     TRUE 
#> 2 a4     b2     g4     b1     TRUE 
#> 3 c5     c3     h4     c8     FALSE
#> 4 a4     d4     i5     a1     FALSE
#> 5 b1     e5     a1     e10    TRUE

^{Created on 2025-01-10 with reprex v2.1.1}

科技改变生活-雨落星辰 - 所有的伟大,都源于一个勇敢的开始

r - Grepl with logical operator AND across multiple alphanumeric columns - Stack Overflow

2 Answers 2

与本文相关的文章

评论列表(0)