最新消息:雨落星辰是一个专注网站SEO优化、网站SEO诊断、搜索引擎研究、网络营销推广、网站策划运营及站长类的自媒体原创博客

r - Grepl with logical operator AND across multiple alphanumeric columns - Stack Overflow

programmeradmin3浏览0评论

I have an enormous dataset that contains 25 columns of medical codes. Each row represents one medical visit. I need to create a new column that flags where two codes appear together on each row. In other words, I want to grep across multiple columns and flag where the two codes are present together the medical visit.

I thougt about creating a new variable for each alphanumeric code I want to grep on, then creating a final variables with case_when(), but is there a faster way to do this?

Here's a toy data set:

diag_p <- c('a1', 'a4', 'c5', 'a4', 'b1')
odiag1 <- c('b1', 'b2', 'c3', 'd4', 'e5')
odiag2 <- c('f1', 'g4', 'h4', 'i5', 'a1')
odiag3 <- c('a6', 'b1', 'c8', 'a1', 'e10')
sample_df <- data.frame(diag_p, odiag1, odiag2, odiag3)

This code works well to search across columns with | and two grep statements, and with >1 at the end of the chunk, which will count more than one match across columns, but it doesn't work quite right. I need it to match (a1 or a4) & (b1 or b4).

new_df <- sample_df %>%
    mutate(
        new_col = rowSums(sapply(select(., diag_p, odiag1:odiag3),
            function(x) (grepl("a[14]", x)) | (grepl("b[14]", x)))) > 1
        )

Is there a way to do this without making one new column for each grep statement, then making a final variable with case_when()?

I updated the code to get rid of the case_when() which I see was confusing folks. This is what I want the code to look like, so that (a1 or a4) AND (b1 or b4) produce a match TRUE or FALSE on each row:

structure(list(diag_p = c("a1", "a4", "c5", "a4", "b1"), odiag1 = c("b1", 
"b2", "c3", "d4", "e5"), odiag2 = c("f1", "g4", "h4", "i5", "a1"
), odiag3 = c("a6", "b1", "c8", "a1", "e10"), new_col = c(TRUE, 
TRUE, FALSE, FALSE, TRUE)), row.names = c(NA, -5L), class = "data.frame")

I have an enormous dataset that contains 25 columns of medical codes. Each row represents one medical visit. I need to create a new column that flags where two codes appear together on each row. In other words, I want to grep across multiple columns and flag where the two codes are present together the medical visit.

I thougt about creating a new variable for each alphanumeric code I want to grep on, then creating a final variables with case_when(), but is there a faster way to do this?

Here's a toy data set:

diag_p <- c('a1', 'a4', 'c5', 'a4', 'b1')
odiag1 <- c('b1', 'b2', 'c3', 'd4', 'e5')
odiag2 <- c('f1', 'g4', 'h4', 'i5', 'a1')
odiag3 <- c('a6', 'b1', 'c8', 'a1', 'e10')
sample_df <- data.frame(diag_p, odiag1, odiag2, odiag3)

This code works well to search across columns with | and two grep statements, and with >1 at the end of the chunk, which will count more than one match across columns, but it doesn't work quite right. I need it to match (a1 or a4) & (b1 or b4).

new_df <- sample_df %>%
    mutate(
        new_col = rowSums(sapply(select(., diag_p, odiag1:odiag3),
            function(x) (grepl("a[14]", x)) | (grepl("b[14]", x)))) > 1
        )

Is there a way to do this without making one new column for each grep statement, then making a final variable with case_when()?

I updated the code to get rid of the case_when() which I see was confusing folks. This is what I want the code to look like, so that (a1 or a4) AND (b1 or b4) produce a match TRUE or FALSE on each row:

structure(list(diag_p = c("a1", "a4", "c5", "a4", "b1"), odiag1 = c("b1", 
"b2", "c3", "d4", "e5"), odiag2 = c("f1", "g4", "h4", "i5", "a1"
), odiag3 = c("a6", "b1", "c8", "a1", "e10"), new_col = c(TRUE, 
TRUE, FALSE, FALSE, TRUE)), row.names = c(NA, -5L), class = "data.frame")
Share Improve this question edited Jan 29 at 22:37 CPlus 4,71844 gold badges30 silver badges72 bronze badges asked Jan 11 at 2:11 orion34orion34 51 silver badge2 bronze badges 4
  • 1 Does the result in new_df return the results you want? Could you clarify what "doesnt work quite right" means? Good luck! – jpsmith Commented Jan 11 at 2:50
  • 1 From the problem description I would have expected the connective between the two grepls to be & rather than |. – IRTFM Commented Jan 11 at 10:03
  • Try sample_df$new_col <- apply(sample_df, 1, \(x) {any(x %in% c("a1", "a4")) && any(x %in% c("b1", "b4"))})? – jpsmith Commented Jan 13 at 18:23
  • @jpsmith this works! Similar to answer below. How would you modify the code if you wanted to further filter cases where age was less than 50? Would you throw that whole apply() function in an ifelse() function? – orion34 Commented Jan 14 at 2:11
Add a comment  | 

2 Answers 2

Reset to default 0

Using base R:

sample_df$flag <-
  apply(sample_df, 1, \(row) {
    any(grepl("a[14]", row)) & any(grepl("b[14]", row))
  })

sample_df
#>   diag_p odiag1 odiag2 odiag3 new_col
#> 1     a1     b1     f1     a6    TRUE
#> 2     a4     b2     g4     b1    TRUE
#> 3     c5     c3     h4     c8   FALSE
#> 4     a4     d4     i5     a1   FALSE
#> 5     b1     e5     a1    e10    TRUE

Though I'm not sure I understand your goal, I came up with this method of flagging rows that contain both (a1 or a4) and (b1 or b4).

diag_p <- c('a1', 'a4', 'c5', 'a4', 'b1')
odiag1 <- c('b1', 'b2', 'c3', 'd4', 'e5')
odiag2 <- c('f1', 'g4', 'h4', 'i5', 'a1')
odiag3 <- c('a6', 'b1', 'c8', 'a1', 'e10')
sample_df <- data.frame(diag_p, odiag1, odiag2, odiag3)

library(tidyverse)

sample_df |> rowwise() |> 
  mutate(Flag = any(str_detect(c_across(diag_p:odiag3), "a(1|4)")) &
           any(str_detect(c_across(diag_p:odiag3), "b(1|4)")))
#> # A tibble: 5 × 5
#> # Rowwise: 
#>   diag_p odiag1 odiag2 odiag3 Flag 
#>   <chr>  <chr>  <chr>  <chr>  <lgl>
#> 1 a1     b1     f1     a6     TRUE 
#> 2 a4     b2     g4     b1     TRUE 
#> 3 c5     c3     h4     c8     FALSE
#> 4 a4     d4     i5     a1     FALSE
#> 5 b1     e5     a1     e10    TRUE

Created on 2025-01-10 with reprex v2.1.1

发布评论

评论列表(0)

  1. 暂无评论