r - names_pattern in tidyr pivot_longer for multiple variables nested in the column names

I have a data frame which I need to put into long format. Here is a simplified version of my data frame:

df <- data.frame("UserID" = c(1:3), "ProdID" = c(1, 1, 2), "A_Q_FaultID_C_1" = c(3, 2, 4), "A_Q_FaultID_C_2" = c(8, 1, 2), "A_Q_FaultID_C_3" = c(4, 3, 9), "A_Q_FaultYear_C_1" = c(1999, 2018, 2012), "A_Q_FaultYear_C_2" = c(2002, 2008, 2024), "A_Q_FaultYear_C_3" = c(2015, 2015, 1997))

The idea is to end up with a data frame that looks like this by pivoting longer based on the variable names with the User and Prod IDs as keys:

df <- data.frame("UserID" = c(1:3, 1:3, 1:3), "ProdID" = c(1, 1, 2, 1, 1, 2, 1, 1, 2), "FaultID" = c(3, 2, 4, 8, 1, 2, 4, 3, 9), "FaultYear" = c(1999, 2018, 2012, 2002, 2008, 2024, 2015, 2015, 1997))

In reality, the data frame has many more rows, and many more columns within each pivot group, but the general naming convention for the columns I want to pivot longer is the same, i.e. they begin with "A_Q_" and end with "C[numeric string]". Note that the numeric string isn't necessarily just a single digit.

Here is my attempt using the tidyr pivot_longer function:

df <- df %>%
  pivot_longer(
    cols = 3:8,
    names_to = c("FaultID", "FaultYear"),
    names_pattern = "[a-zA-Z_]+"
  )

This gives the error: "regex should define 2 groups; 0 found.", however I tested this regex using the column names and str_match and it extracts everything except the trailing digits, uniquely identifying two groups:

str_match(names(df)[3:8], "[a-zA-Z_]+")

     [,1]              
[1,] "A_Q_FaultID_C_"  
[2,] "A_Q_FaultID_C_"  
[3,] "A_Q_FaultID_C_"  
[4,] "A_Q_FaultYear_C_"
[5,] "A_Q_FaultYear_C_"
[6,] "A_Q_FaultYear_C_"

What am I doing wrong?

I have a data frame which I need to put into long format. Here is a simplified version of my data frame:

df <- data.frame("UserID" = c(1:3), "ProdID" = c(1, 1, 2), "A_Q_FaultID_C_1" = c(3, 2, 4), "A_Q_FaultID_C_2" = c(8, 1, 2), "A_Q_FaultID_C_3" = c(4, 3, 9), "A_Q_FaultYear_C_1" = c(1999, 2018, 2012), "A_Q_FaultYear_C_2" = c(2002, 2008, 2024), "A_Q_FaultYear_C_3" = c(2015, 2015, 1997))

The idea is to end up with a data frame that looks like this by pivoting longer based on the variable names with the User and Prod IDs as keys:

df <- data.frame("UserID" = c(1:3, 1:3, 1:3), "ProdID" = c(1, 1, 2, 1, 1, 2, 1, 1, 2), "FaultID" = c(3, 2, 4, 8, 1, 2, 4, 3, 9), "FaultYear" = c(1999, 2018, 2012, 2002, 2008, 2024, 2015, 2015, 1997))

Here is my attempt using the tidyr pivot_longer function:

df <- df %>%
  pivot_longer(
    cols = 3:8,
    names_to = c("FaultID", "FaultYear"),
    names_pattern = "[a-zA-Z_]+"
  )

str_match(names(df)[3:8], "[a-zA-Z_]+")

     [,1]              
[1,] "A_Q_FaultID_C_"  
[2,] "A_Q_FaultID_C_"  
[3,] "A_Q_FaultID_C_"  
[4,] "A_Q_FaultYear_C_"
[5,] "A_Q_FaultYear_C_"
[6,] "A_Q_FaultYear_C_"

What am I doing wrong?

Share Improve this question edited Mar 28 at 16:12 zephryl 17.7k4 gold badges16 silver badges34 bronze badges asked Mar 28 at 15:49 Dee G 1561 silver badge12 bronze badges

1 It says you need two capturing groups in the pattern, e.g. names_pattern = "_(Fault)([a-zA-Z]+)_" – Wiktor Stribiżew Commented Mar 28 at 16:24

Add a comment |

1 Answer 1

Sorted by: Reset to default 1

For this situation, it's easier to use names_to = ".value", where ".value" is a special "sentinel" that will pivot to multiple columns based on names extracted from cols. Then, in names_pattern, you need to include a regex pattern with a capture group (i.e., "()") that extracts the portion of the column names that defines the new columns.

df %>%
  pivot_longer(
    cols = A_Q_FaultID_C_1:A_Q_FaultYear_C_3,
    names_to = ".value",
    names_pattern = "(Fault[a-zA-Z]+)"
  )
# # A tibble: 9 × 4
#   UserID ProdID FaultID FaultYear
#    <int>  <dbl>   <dbl>     <dbl>
# 1      1      1       3      1999
# 2      1      1       8      2002
# 3      1      1       4      2015
# 4      2      1       2      2018
# 5      2      1       1      2008
# 6      2      1       3      2015
# 7      3      2       4      2012
# 8      3      2       2      2024
# 9      3      2       9      1997

PS - I strongly recommend using a tidyselect helper for the cols argument (for example, A_Q_FaultID_C_1:A_Q_FaultYear_C_3 or starts_with("A_Q_")) instead of a numeric range like 3:8. The latter is more error-prone (e.g., if the number or order of columns changes, or just from miscounting).

科技改变生活-雨落星辰 - 所有的伟大,都源于一个勇敢的开始

r - names_pattern in tidyr pivot_longer for multiple variables nested in the column names - Stack Overflow

1 Answer 1

与本文相关的文章

评论列表(0)