最新消息:雨落星辰是一个专注网站SEO优化、网站SEO诊断、搜索引擎研究、网络营销推广、网站策划运营及站长类的自媒体原创博客

Search for ICD Code Prefixes Across Multiple Columns and Create Yearly Binary Indicators in R - Stack Overflow

programmeradmin1浏览0评论

I have a long-format dataset where each row represents a hospitalization event, and I need to check if any of the 31 diagnosis columns contain a specific ICD code from a list in a given year. My goal is to create new binary columns (1 = diagnosis found, 0 = not found) for each year from 1968 to 2020, indicating whether a certain type of disease was recorded in that year. My data is longitudinal and I want the additional year columns to be able to track different types of hospitalizations over time and to do further manipulations.

Below is an example dataset :

# Sample dataset
dat <- data.frame(
      id = c(1, 1, 1, 2, 3, 3, 4, 5, 5),
      year = c(2014, 2015, 2020, 2015, 2016, 2014, 2015, 2014, 2017),# year for each hospitalization
      diagnosis1 = c("I21", "C50", "E10", "J40", "E119", "I210", "I50", "I63", "J45"),# in one hospitalization several diagnose codes are prescribed stores across the columns 
      diagnosis2 = c("I50", "C34", "E11", "J45", NA, NA, NA, NA, NA), 
      diagnosis3 = c("I63", "C61", "E14", "J98", NA, NA, NA, NA, NA)
    )

I also have a list of example conditions with codes that I need to search for (in the real data the codes also have varying lengths):

cancer <- c("C34", "C50", "C61", "C91")   
cvd <- c("I21", "I50", "I63", "I74")      
diabetes <- c("E10", "E11", "E13", "E14") 
respiratory <- c("J40", "J45", "J47", "J98") 

For each year (2014-2020), I need new binary columns (cancer_1990, cvd_1990, etc.), where each hospitalization event is checked for a diagnosis from the corresponding condition list. If a hospitalization event contains a matching diagnosis, it should be marked as 1, otherwise 0. I also need to add into the code to include codes that start the same as in the list, but have more digits after (for example in the list it says E10 but a hospitalization event of E101 should also be included)

To (hopefully :)) clarify my question, this is what I would like the dataframe to look like after processing it

id year diagnosis1 diagnosis2 diagnosis3 cvd_2014 cvd_2015 cvd_2016 cvd_2017 cvd_2018 cvd_2019 cvd_2020 cancer_2014 cancer_2015 cancer_2016 cancer_2017 cancer_2018 cancer_2019 cancer_2020 diabetes_2014 diabetes_2015 diabetes_2016 diabetes_2017 diabetes_2018 diabetes_2019 diabetes_2020 respiratory_2014 respiratory_2015 respiratory_2016 respiratory_2017 respiratory_2018 respiratory_2019 respiratory_2020
1 2014 I21 I50 I63 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
1 2015 C50 C34 C61 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
1 2020 E10 E11 E14 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0
2 2015 J40 J45 J98 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0
3 2016 E119 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0
3 2014 I210 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
4 2015 I50 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
5 2014 I63 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
5 2017 J45 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0

I have a long-format dataset where each row represents a hospitalization event, and I need to check if any of the 31 diagnosis columns contain a specific ICD code from a list in a given year. My goal is to create new binary columns (1 = diagnosis found, 0 = not found) for each year from 1968 to 2020, indicating whether a certain type of disease was recorded in that year. My data is longitudinal and I want the additional year columns to be able to track different types of hospitalizations over time and to do further manipulations.

Below is an example dataset :

# Sample dataset
dat <- data.frame(
      id = c(1, 1, 1, 2, 3, 3, 4, 5, 5),
      year = c(2014, 2015, 2020, 2015, 2016, 2014, 2015, 2014, 2017),# year for each hospitalization
      diagnosis1 = c("I21", "C50", "E10", "J40", "E119", "I210", "I50", "I63", "J45"),# in one hospitalization several diagnose codes are prescribed stores across the columns 
      diagnosis2 = c("I50", "C34", "E11", "J45", NA, NA, NA, NA, NA), 
      diagnosis3 = c("I63", "C61", "E14", "J98", NA, NA, NA, NA, NA)
    )

I also have a list of example conditions with codes that I need to search for (in the real data the codes also have varying lengths):

cancer <- c("C34", "C50", "C61", "C91")   
cvd <- c("I21", "I50", "I63", "I74")      
diabetes <- c("E10", "E11", "E13", "E14") 
respiratory <- c("J40", "J45", "J47", "J98") 

For each year (2014-2020), I need new binary columns (cancer_1990, cvd_1990, etc.), where each hospitalization event is checked for a diagnosis from the corresponding condition list. If a hospitalization event contains a matching diagnosis, it should be marked as 1, otherwise 0. I also need to add into the code to include codes that start the same as in the list, but have more digits after (for example in the list it says E10 but a hospitalization event of E101 should also be included)

To (hopefully :)) clarify my question, this is what I would like the dataframe to look like after processing it

id year diagnosis1 diagnosis2 diagnosis3 cvd_2014 cvd_2015 cvd_2016 cvd_2017 cvd_2018 cvd_2019 cvd_2020 cancer_2014 cancer_2015 cancer_2016 cancer_2017 cancer_2018 cancer_2019 cancer_2020 diabetes_2014 diabetes_2015 diabetes_2016 diabetes_2017 diabetes_2018 diabetes_2019 diabetes_2020 respiratory_2014 respiratory_2015 respiratory_2016 respiratory_2017 respiratory_2018 respiratory_2019 respiratory_2020
1 2014 I21 I50 I63 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
1 2015 C50 C34 C61 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
1 2020 E10 E11 E14 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0
2 2015 J40 J45 J98 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0
3 2016 E119 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0
3 2014 I210 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
4 2015 I50 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
5 2014 I63 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
5 2017 J45 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0

Any help would be greatly appreciated as I have gotten completely stuck with this.

Share Improve this question edited Feb 5 at 20:26 user25809482 asked Feb 5 at 18:36 user25809482user25809482 112 bronze badges 3
  • 1 First, tidy your data. Then your problem goes away. Here, your data frame is not tidy because information you need (year), is contained in the column names, not their values. – Limey Commented Feb 5 at 19:10
  • 1 "for example in the list it says E10 but a hospitalization event of E101 should also be included": this is not represented in your sample data. And could be regarded as a second question in a single post, which could lead to votes to close. – Limey Commented Feb 5 at 19:12
  • 1 Sorry about that. I have updated the code example – user25809482 Commented Feb 5 at 19:20
Add a comment  | 

2 Answers 2

Reset to default 1

You can do this using the dplyr package; however, with a large number of years, this approach is inefficient — it will create a very long-format result.

# Create a list of conditions
conditions <- list(
  cancer = cancer,
  cvd = cvd,
  diabetes = diabetes,
  respiratory = respiratory
)

# Function to detect disease presence based on partial match
detect_disease <- function(diagnoses, codes) {
  # Check if any diagnosis starts with one of the codes
  sapply(codes, function(code) any(grepl(paste0("^", code), diagnoses, na.rm = TRUE)))
}

# Convert dataset to long format to check all diagnoses columns
dat <- dat %>%
  rowwise() %>%
  mutate(
    cancer = any(detect_disease(c_across(starts_with("diagnosis")), cancer)),
    cvd = any(detect_disease(c_across(starts_with("diagnosis")), cvd)),
    diabetes = any(detect_disease(c_across(starts_with("diagnosis")), diabetes)),
    respiratory = any(detect_disease(c_across(starts_with("diagnosis")), respiratory))
  ) %>%
  ungroup()

# Convert logical to binary (1 = diagnosis found, 0 = not found)
dat <- dat %>%
  mutate(across(c(cancer, cvd, diabetes, respiratory), as.integer))

# Create binary columns for each year (2014-2020)
years <- 2014:2020

for (y in years) {
  dat[[paste0("cancer_", y)]] <- ifelse(dat$year == y, dat$cancer, 0)
  dat[[paste0("cvd_", y)]] <- ifelse(dat$year == y, dat$cvd, 0)
  dat[[paste0("diabetes_", y)]] <- ifelse(dat$year == y, dat$diabetes, 0)
  dat[[paste0("respiratory_", y)]] <- ifelse(dat$year == y, dat$respiratory, 0)
}

# Remove temporary disease columns
dat <- dat %>% select(-c(cancer, cvd, diabetes, respiratory))

# Print the final dataset
print(dat)

Using a combination of functions from dplyr, tidyr, and stringr, you can:

library(dplyr)
library(tidyr)
library(stringr)

# Add helper columns to dat
dat <- dat %>%
  mutate(tmp1 = 1:n(),
         tmp2 = 1)

# Create long form df of ICD codes lists with their types
icd_list <- list(cancer = cancer, 
                 cvd = cvd, 
                 diabetes = diabetes, 
                 respiratory = respiratory)

df <- data.frame(icd_short = unlist(icd_list),
                 type = rep(names(icd_list), lengths(icd_list)))

# Pivot data to long form, add shortened ICD codes column and join df, complete()
# missing years and types, remove any resultant NAs, pivot back to wide form,
# join result to original dat df, sort rows, reorder and remove helper columns
res <- dat %>%
  pivot_longer(-c(tmp1, tmp2, id, year),
               names_to = "diagnosis",
               values_to = "icd_long") %>%
  mutate(icd_short = str_sub(icd_long, 1, 3)) %>%
  left_join(df, by = "icd_short") %>%
  complete(year = 2014:2020, type) %>% # Replace 2014:2020 with 1968:2020 for full dataset
  filter(!is.na(type)) %>%
  pivot_wider(id_cols = c(id, tmp1),
              names_from = c(type, year),
              values_from = tmp2,
              values_fn = {first},
              values_fill = 0) %>%
  right_join(dat, by = c("id", "tmp1")) %>%
  arrange(tmp1) %>%
  select(
    id, year, starts_with(c("diagnosis", "cvd")), sort(colnames(.)), 
    -starts_with("tmp")
    )

data.frame(res)
#   id year diagnosis1 diagnosis2 diagnosis3 cvd_2014 cvd_2015 cvd_2016 cvd_2017 cvd_2018 cvd_2019 cvd_2020  cancer_2014 cancer_2015 cancer_2016 cancer_2017 cancer_2018 cancer_2019 cancer_2020 diabetes_2014  diabetes_2015 diabetes_2016 diabetes_2017 diabetes_2018 diabetes_2019 diabetes_2020 respiratory_2014  respiratory_2015 respiratory_2016 respiratory_2017 respiratory_2018 respiratory_2019 respiratory_2020 
# 1  1 2014        I21        I50       C341        1        0        0        0        0        0        0            1           0           0           0           0           0           0             0              0             0             0             0             0             0                0                 0                0                0                0                0                0 
# 2  1 2015        C50        C34        C61        0        0        0        0        0        0        0            0           1           0           0           0           0           0             0              0             0             0             0             0             0                0                 0                0                0                0                0                0 
# 3  1 2020        E10        E11        E14        0        0        0        0        0        0        0            0           0           0           0           0           0           0             0              0             0             0             0             0             1                0                 0                0                0                0                0                0 
# 4  2 2015        J40        J45        J98        0        0        0        0        0        0        0            0           0           0           0           0           0           0             0              0             0             0             0             0             0                0                 1                0                0                0                0                0 
# 5  3 2016       E119       <NA>       <NA>        0        0        0        0        0        0        0            0           0           0           0           0           0           0             0              0             1             0             0             0             0                0                 0                0                0                0                0                0 
# 6  3 2014       I210       <NA>       <NA>        1        0        0        0        0        0        0            0           0           0           0           0           0           0             0              0             0             0             0             0             0                0                 0                0                0                0                0                0 
# 7  4 2015        I50       <NA>       <NA>        0        1        0        0        0        0        0            0           0           0           0           0           0           0             0              0             0             0             0             0             0                0                 0                0                0                0                0                0 
# 8  5 2014        I63       <NA>       <NA>        1        0        0        0        0        0        0            0           0           0           0           0           0           0             0              0             0             0             0             0             0                0                 0                0                0                0                0                0 
# 9  5 2017        J45       <NA>       <NA>        0        0        0        0        0        0        0            0           0           0           0           0           0           0             0              0             0             0             0             0             0                0                 0                0                1                0                0                0
发布评论

评论列表(0)

  1. 暂无评论