最新消息:雨落星辰是一个专注网站SEO优化、网站SEO诊断、搜索引擎研究、网络营销推广、网站策划运营及站长类的自媒体原创博客

r - How do I make a function to create new columns with true and false for each of multiple options contained within one column?

programmeradmin0浏览0评论

I have a database with a large number of columns which contain multiple options from a selection. I need to pivot all of these into true or false columns so that they can be analysed in PowerBI.

Without using a function I can manually do this using the code below, but this is not feasible for the number of columns I need to adjust and the number of options within each column

# This shows all the unique values in barriers column
sort(unique(unlist(str_split(bound_wf$barriers, " AND "))))

# This creates a new df, test, with new columns for barriers_brexit and barriers_econ which have a Boolean value depending on the appearance of the phrase in the original barriers column
test <- bound_wf %>% 
  mutate(
    barriers_brexit = grepl("Brexit", barriers),
    barriers_econ = grepl("Current economic climate", barriers)
)

I have the following pseudo code to make a function for this, but I can't make it work. My guess is that r is not reading my parameters in the way I want it to.

function (column, separator = " AND ")
  #get list of items
  list <- sort(unique(unlist(str_split(bound_wf$column, separator))))
  #create a column per item named column_item with grepl
  for item in list
    test <- bound_wf %>% 
      mutate(column_item = grepl(item, column)

Here is where I got to with the actual function.

widen <- function(column, separator){
  column <- as.vector(column)
  separator <- as.vector(separator)
  df <- as.vector(str_c("bound_wf$", column))
  #get list of items
  list <- sort(unique(unlist(str_split(df, separator))))
  print(list)
  #create a column per item named column_item with grepl
  for (item in list) {
  test <- bound_wf %>% 
    mutate(column_item = grepl(item, column))
  }
}

When I run this, list prints as bound_wf$barrier so it is just returning the df variable instead of the desired list of unique options. I think I just need help with how to get r to read parameters as the items I want them to be. I am not really familiar with functions (obviously!)

For reproducibility the dataframe would look like

df <- data.frame(id = 1:10,column = c("option one AND option three", "option three", "option one AND option two AND option three", "option one", "option two AND option three", "option one AND option four", "option two", "option one AND option two", "option three AND option four", "option one AND option two AND option four"))

Thanks to anyone who has time to comment on this.

I have a database with a large number of columns which contain multiple options from a selection. I need to pivot all of these into true or false columns so that they can be analysed in PowerBI.

Without using a function I can manually do this using the code below, but this is not feasible for the number of columns I need to adjust and the number of options within each column

# This shows all the unique values in barriers column
sort(unique(unlist(str_split(bound_wf$barriers, " AND "))))

# This creates a new df, test, with new columns for barriers_brexit and barriers_econ which have a Boolean value depending on the appearance of the phrase in the original barriers column
test <- bound_wf %>% 
  mutate(
    barriers_brexit = grepl("Brexit", barriers),
    barriers_econ = grepl("Current economic climate", barriers)
)

I have the following pseudo code to make a function for this, but I can't make it work. My guess is that r is not reading my parameters in the way I want it to.

function (column, separator = " AND ")
  #get list of items
  list <- sort(unique(unlist(str_split(bound_wf$column, separator))))
  #create a column per item named column_item with grepl
  for item in list
    test <- bound_wf %>% 
      mutate(column_item = grepl(item, column)

Here is where I got to with the actual function.

widen <- function(column, separator){
  column <- as.vector(column)
  separator <- as.vector(separator)
  df <- as.vector(str_c("bound_wf$", column))
  #get list of items
  list <- sort(unique(unlist(str_split(df, separator))))
  print(list)
  #create a column per item named column_item with grepl
  for (item in list) {
  test <- bound_wf %>% 
    mutate(column_item = grepl(item, column))
  }
}

When I run this, list prints as bound_wf$barrier so it is just returning the df variable instead of the desired list of unique options. I think I just need help with how to get r to read parameters as the items I want them to be. I am not really familiar with functions (obviously!)

For reproducibility the dataframe would look like

df <- data.frame(id = 1:10,column = c("option one AND option three", "option three", "option one AND option two AND option three", "option one", "option two AND option three", "option one AND option four", "option two", "option one AND option two", "option three AND option four", "option one AND option two AND option four"))

Thanks to anyone who has time to comment on this.

Share Improve this question edited Jan 29 at 21:54 ThomasIsCoding 103k9 gold badges36 silver badges101 bronze badges asked Jan 29 at 20:14 LuluLulu 111 silver badge2 bronze badges
Add a comment  | 

3 Answers 3

Reset to default 3

Here is a one simple version of widen()

widen <- function(df, col, sep) {
  df |> mutate(v= strsplit({{col}}, sep)) |> 
    unnest(cols=v) |> 
    pivot_wider(id_cols = id, names_from=v,values_from=v, values_fn=\(x) !is.na(x), values_fill = F)
}

Usage:

widen(df, column, " AND ")

Output:

      id `option one` `option three` `option two` `option four`
   <int> <lgl>        <lgl>          <lgl>        <lgl>        
 1     1 TRUE         TRUE           FALSE        FALSE        
 2     2 FALSE        TRUE           FALSE        FALSE        
 3     3 TRUE         TRUE           TRUE         FALSE        
 4     4 TRUE         FALSE          FALSE        FALSE        
 5     5 FALSE        TRUE           TRUE         FALSE        
 6     6 TRUE         FALSE          FALSE        TRUE         
 7     7 FALSE        FALSE          TRUE         FALSE        
 8     8 TRUE         FALSE          TRUE         FALSE        
 9     9 FALSE        TRUE           FALSE        TRUE         
10    10 TRUE         FALSE          TRUE         TRUE   

This is not quite as flexible as @margusi version above.

You can try strsplit + stack + table

with(
  df,
  t(table(stack(strsplit(setNames(column, id), " AND "))) > 0)
)

which gives

    values
ind  option four option one option three option two
  1        FALSE       TRUE         TRUE      FALSE
  2        FALSE      FALSE         TRUE      FALSE
  3        FALSE       TRUE         TRUE       TRUE
  4        FALSE       TRUE        FALSE      FALSE
  5        FALSE      FALSE         TRUE       TRUE
  6         TRUE       TRUE        FALSE      FALSE
  7        FALSE      FALSE        FALSE       TRUE
  8        FALSE       TRUE        FALSE       TRUE
  9         TRUE      FALSE         TRUE      FALSE
  10        TRUE       TRUE        FALSE       TRUE

dplyr::bind_rows() can create a frame from a list of named vectors, even if lenght do not match; so we could just name vectors from strsplit() / str_split(), collect those into a frame, recode non-NA & NA values to T/ F (creates a matrix), turn it back to a data.frame and sort columns.

Function can be applied arcoss multiple columns with mutate(across(...))

library(dplyr, warn.conflicts = FALSE)

df <- tibble(
  id = 1:10,
  column = c("option one AND option three", "option three", "option one AND option two AND option three", "option one", "option two AND option three", "option one AND option four", "option two", "option one AND option two", "option three AND option four", "option one AND option two AND option four")
)

# pass all arguments to strsplit()
widen <- function(...){
  strsplit(...) |> 
    sapply(\(x) setNames(x, make.names(x))) |> 
    bind_rows() |> 
    Negate(is.na)() |> 
    as.data.frame() |> 
    {\(x) x[,order(names(x))]}() 
}

df |> 
  mutate(across(-id, \(x) widen(x, split = " AND ", fixed = TRUE), .unpack = TRUE))
#> # A tibble: 10 × 6
#>       id column                                     column_option.four column_option.one column_option.three column_option.two
#>    <int> <chr>                                      <lgl>              <lgl>             <lgl>               <lgl>            
#>  1     1 option one AND option three                FALSE              TRUE              TRUE                FALSE            
#>  2     2 option three                               FALSE              FALSE             TRUE                FALSE            
#>  3     3 option one AND option two AND option three FALSE              TRUE              TRUE                TRUE             
#>  4     4 option one                                 FALSE              TRUE              FALSE               FALSE            
#>  5     5 option two AND option three                FALSE              FALSE             TRUE                TRUE             
#>  6     6 option one AND option four                 TRUE               TRUE              FALSE               FALSE            
#>  7     7 option two                                 FALSE              FALSE             FALSE               TRUE             
#>  8     8 option one AND option two                  FALSE              TRUE              FALSE               TRUE             
#>  9     9 option three AND option four               TRUE               FALSE             TRUE                FALSE            
#> 10    10 option one AND option two AND option four  TRUE               TRUE              FALSE               TRUE

As a side note, df <- as.vector(str_c("bound_wf$", column)) would not work, what you could use in this scenario is is bound_wf[[column]].

与本文相关的文章

发布评论

评论列表(0)

  1. 暂无评论