Search for ICD Code Prefixes Across Multiple Columns and Create Yearly Binary Indicators in R

I have a long-format dataset where each row represents a hospitalization event, and I need to check if any of the 31 diagnosis columns contain a specific ICD code from a list in a given year. My goal is to create new binary columns (1 = diagnosis found, 0 = not found) for each year from 1968 to 2020, indicating whether a certain type of disease was recorded in that year. My data is longitudinal and I want the additional year columns to be able to track different types of hospitalizations over time and to do further manipulations.

Below is an example dataset :

# Sample dataset
dat <- data.frame(
      id = c(1, 1, 1, 2, 3, 3, 4, 5, 5),
      year = c(2014, 2015, 2020, 2015, 2016, 2014, 2015, 2014, 2017),# year for each hospitalization
      diagnosis1 = c("I21", "C50", "E10", "J40", "E119", "I210", "I50", "I63", "J45"),# in one hospitalization several diagnose codes are prescribed stores across the columns 
      diagnosis2 = c("I50", "C34", "E11", "J45", NA, NA, NA, NA, NA), 
      diagnosis3 = c("I63", "C61", "E14", "J98", NA, NA, NA, NA, NA)
    )

I also have a list of example conditions with codes that I need to search for (in the real data the codes also have varying lengths):

cancer <- c("C34", "C50", "C61", "C91")   
cvd <- c("I21", "I50", "I63", "I74")      
diabetes <- c("E10", "E11", "E13", "E14") 
respiratory <- c("J40", "J45", "J47", "J98")

For each year (2014-2020), I need new binary columns (cancer_1990, cvd_1990, etc.), where each hospitalization event is checked for a diagnosis from the corresponding condition list. If a hospitalization event contains a matching diagnosis, it should be marked as 1, otherwise 0. I also need to add into the code to include codes that start the same as in the list, but have more digits after (for example in the list it says E10 but a hospitalization event of E101 should also be included)

To (hopefully :)) clarify my question, this is what I would like the dataframe to look like after processing it

id	year	diagnosis1	diagnosis2	diagnosis3	cvd_2014	cvd_2015	cancer_2015	diabetes_2016	diabetes_2020	respiratory_2014	respiratory_2015	respiratory_2016
1	2014	I21	I50	I63	1	0	0	0	0	0	0	0
1	2015	C50	C34	C61	0	0	1	0	0	0	0	0
1	2020	E10	E11	E14	0	0	0	0	1	0	0	0
2	2015	J40	J45	J98	0	0	0	0	0	1	1	1
3	2016	E119			0	0	0	1	0	0	0	0
3	2014	I210			1	0	0	0	0	0	0	0
4	2015	I50			0	1	0	0	0	0	0	0
5	2014	I63			1	0	0	0	0	0	0	0
5	2017	J45			0	0	0	0	0	0	1	0

Below is an example dataset :

# Sample dataset
dat <- data.frame(
      id = c(1, 1, 1, 2, 3, 3, 4, 5, 5),
      year = c(2014, 2015, 2020, 2015, 2016, 2014, 2015, 2014, 2017),# year for each hospitalization
      diagnosis1 = c("I21", "C50", "E10", "J40", "E119", "I210", "I50", "I63", "J45"),# in one hospitalization several diagnose codes are prescribed stores across the columns 
      diagnosis2 = c("I50", "C34", "E11", "J45", NA, NA, NA, NA, NA), 
      diagnosis3 = c("I63", "C61", "E14", "J98", NA, NA, NA, NA, NA)
    )

I also have a list of example conditions with codes that I need to search for (in the real data the codes also have varying lengths):

cancer <- c("C34", "C50", "C61", "C91")   
cvd <- c("I21", "I50", "I63", "I74")      
diabetes <- c("E10", "E11", "E13", "E14") 
respiratory <- c("J40", "J45", "J47", "J98")

To (hopefully :)) clarify my question, this is what I would like the dataframe to look like after processing it

id	year	diagnosis1	diagnosis2	diagnosis3	cvd_2014	cvd_2015	cancer_2015	diabetes_2016	diabetes_2020	respiratory_2014	respiratory_2015	respiratory_2016
1	2014	I21	I50	I63	1	0	0	0	0	0	0	0
1	2015	C50	C34	C61	0	0	1	0	0	0	0	0
1	2020	E10	E11	E14	0	0	0	0	1	0	0	0
2	2015	J40	J45	J98	0	0	0	0	0	1	1	1
3	2016	E119			0	0	0	1	0	0	0	0
3	2014	I210			1	0	0	0	0	0	0	0
4	2015	I50			0	1	0	0	0	0	0	0
5	2014	I63			1	0	0	0	0	0	0	0
5	2017	J45			0	0	0	0	0	0	1	0

Any help would be greatly appreciated as I have gotten completely stuck with this.

Share Improve this question edited Feb 5 at 20:26 asked Feb 5 at 18:36 user25809482 112 bronze badges

1 First, tidy your data. Then your problem goes away. Here, your data frame is not tidy because information you need (year), is contained in the column names, not their values. – Limey Commented Feb 5 at 19:10
1 "for example in the list it says E10 but a hospitalization event of E101 should also be included": this is not represented in your sample data. And could be regarded as a second question in a single post, which could lead to votes to close. – Limey Commented Feb 5 at 19:12
1 Sorry about that. I have updated the code example – user25809482 Commented Feb 5 at 19:20

Add a comment |

2 Answers 2

Sorted by: Reset to default 1

You can do this using the dplyr package; however, with a large number of years, this approach is inefficient — it will create a very long-format result.

# Create a list of conditions
conditions <- list(
  cancer = cancer,
  cvd = cvd,
  diabetes = diabetes,
  respiratory = respiratory
)

# Function to detect disease presence based on partial match
detect_disease <- function(diagnoses, codes) {
  # Check if any diagnosis starts with one of the codes
  sapply(codes, function(code) any(grepl(paste0("^", code), diagnoses, na.rm = TRUE)))
}

# Convert dataset to long format to check all diagnoses columns
dat <- dat %>%
  rowwise() %>%
  mutate(
    cancer = any(detect_disease(c_across(starts_with("diagnosis")), cancer)),
    cvd = any(detect_disease(c_across(starts_with("diagnosis")), cvd)),
    diabetes = any(detect_disease(c_across(starts_with("diagnosis")), diabetes)),
    respiratory = any(detect_disease(c_across(starts_with("diagnosis")), respiratory))
  ) %>%
  ungroup()

# Convert logical to binary (1 = diagnosis found, 0 = not found)
dat <- dat %>%
  mutate(across(c(cancer, cvd, diabetes, respiratory), as.integer))

# Create binary columns for each year (2014-2020)
years <- 2014:2020

for (y in years) {
  dat[[paste0("cancer_", y)]] <- ifelse(dat$year == y, dat$cancer, 0)
  dat[[paste0("cvd_", y)]] <- ifelse(dat$year == y, dat$cvd, 0)
  dat[[paste0("diabetes_", y)]] <- ifelse(dat$year == y, dat$diabetes, 0)
  dat[[paste0("respiratory_", y)]] <- ifelse(dat$year == y, dat$respiratory, 0)
}

# Remove temporary disease columns
dat <- dat %>% select(-c(cancer, cvd, diabetes, respiratory))

# Print the final dataset
print(dat)

Using a combination of functions from dplyr, tidyr, and stringr, you can:

library(dplyr)
library(tidyr)
library(stringr)

# Add helper columns to dat
dat <- dat %>%
  mutate(tmp1 = 1:n(),
         tmp2 = 1)

# Create long form df of ICD codes lists with their types
icd_list <- list(cancer = cancer, 
                 cvd = cvd, 
                 diabetes = diabetes, 
                 respiratory = respiratory)

df <- data.frame(icd_short = unlist(icd_list),
                 type = rep(names(icd_list), lengths(icd_list)))

# Pivot data to long form, add shortened ICD codes column and join df, complete()
# missing years and types, remove any resultant NAs, pivot back to wide form,
# join result to original dat df, sort rows, reorder and remove helper columns
res <- dat %>%
  pivot_longer(-c(tmp1, tmp2, id, year),
               names_to = "diagnosis",
               values_to = "icd_long") %>%
  mutate(icd_short = str_sub(icd_long, 1, 3)) %>%
  left_join(df, by = "icd_short") %>%
  complete(year = 2014:2020, type) %>% # Replace 2014:2020 with 1968:2020 for full dataset
  filter(!is.na(type)) %>%
  pivot_wider(id_cols = c(id, tmp1),
              names_from = c(type, year),
              values_from = tmp2,
              values_fn = {first},
              values_fill = 0) %>%
  right_join(dat, by = c("id", "tmp1")) %>%
  arrange(tmp1) %>%
  select(
    id, year, starts_with(c("diagnosis", "cvd")), sort(colnames(.)), 
    -starts_with("tmp")
    )

data.frame(res)
#   id year diagnosis1 diagnosis2 diagnosis3 cvd_2014 cvd_2015 cvd_2016 cvd_2017 cvd_2018 cvd_2019 cvd_2020  cancer_2014 cancer_2015 cancer_2016 cancer_2017 cancer_2018 cancer_2019 cancer_2020 diabetes_2014  diabetes_2015 diabetes_2016 diabetes_2017 diabetes_2018 diabetes_2019 diabetes_2020 respiratory_2014  respiratory_2015 respiratory_2016 respiratory_2017 respiratory_2018 respiratory_2019 respiratory_2020 
# 1  1 2014        I21        I50       C341        1        0        0        0        0        0        0            1           0           0           0           0           0           0             0              0             0             0             0             0             0                0                 0                0                0                0                0                0 
# 2  1 2015        C50        C34        C61        0        0        0        0        0        0        0            0           1           0           0           0           0           0             0              0             0             0             0             0             0                0                 0                0                0                0                0                0 
# 3  1 2020        E10        E11        E14        0        0        0        0        0        0        0            0           0           0           0           0           0           0             0              0             0             0             0             0             1                0                 0                0                0                0                0                0 
# 4  2 2015        J40        J45        J98        0        0        0        0        0        0        0            0           0           0           0           0           0           0             0              0             0             0             0             0             0                0                 1                0                0                0                0                0 
# 5  3 2016       E119       <NA>       <NA>        0        0        0        0        0        0        0            0           0           0           0           0           0           0             0              0             1             0             0             0             0                0                 0                0                0                0                0                0 
# 6  3 2014       I210       <NA>       <NA>        1        0        0        0        0        0        0            0           0           0           0           0           0           0             0              0             0             0             0             0             0                0                 0                0                0                0                0                0 
# 7  4 2015        I50       <NA>       <NA>        0        1        0        0        0        0        0            0           0           0           0           0           0           0             0              0             0             0             0             0             0                0                 0                0                0                0                0                0 
# 8  5 2014        I63       <NA>       <NA>        1        0        0        0        0        0        0            0           0           0           0           0           0           0             0              0             0             0             0             0             0                0                 0                0                0                0                0                0 
# 9  5 2017        J45       <NA>       <NA>        0        0        0        0        0        0        0            0           0           0           0           0           0           0             0              0             0             0             0             0             0                0                 0                0                1                0                0                0

科技改变生活-雨落星辰 - 所有的伟大,都源于一个勇敢的开始

Search for ICD Code Prefixes Across Multiple Columns and Create Yearly Binary Indicators in R - Stack Overflow

2 Answers 2

与本文相关的文章

评论列表(0)