最新消息:雨落星辰是一个专注网站SEO优化、网站SEO诊断、搜索引擎研究、网络营销推广、网站策划运营及站长类的自媒体原创博客

PCRE regex in R matching `` and `[` unexpectedly - Stack Overflow

programmeradmin5浏览0评论

I am trying to use regex in R to replace forbidden characters. I know base::make.names can do some of this, but I want to control the replacement character. I have successfully figured out how to do this for Windows file name forbidden characters, as long as \ is expanded to accommodate the needs of gsub:

wdc <- c(
  "<", ">", ":", '"', "/", "\\", "|", "?", "*"
) %>% 
  magrittr::set_names(., .)
wdc_regex <- wdc %>% 
  sub("\\\\", "\\\\\\\\", .) %>%
  c("[", ., "]") %>%
  paste0(collapse = "")
wdc %>% 
  c(letters[1:5]) %>%
  gsub(wdc_regex, "_", ., perl = TRUE)
  <   >   :   "   /  \\   |   ?   *                     
"_" "_" "_" "_" "_" "_" "_" "_" "_" "a" "b" "c" "d" "e" 
wdc_regex <- wdc %>% 
  c("[", ., "]") %>%
  paste0(collapse = "")
wdc %>% 
  c(letters[1:5]) %>%
  gsub(wdc_regex, "_", ., perl = TRUE)
   <    >    :    "    /   \\    |    ?    *                          
 "_"  "_"  "_"  "_"  "_" "\\"  "_"  "_"  "_"  "a"  "b"  "c"  "d"  "e" 

However, when I use the same strategy for characters that don't work with syntactically valid names in R, I run into a number of issues I don't understand.

  1. No modification needed to replace \: The code for Windows characters requires the call to sub("\\\\", "\\\\\\\\", .) in order to replace \ with _. However, the code below works without this step. Why is it not necessary to expand \ in the code below?
rdc_test <- c(
  "<", ">", ":", '"', "/", "\\", "|", "?", "*", "~", ",", ";", "+", "-", "`", 
  "!", "@", "#", "$", "%", "^", "&", "=", "(", ")", "'", "{", "}"
  , "[", "]"
) %>% 
  magrittr::set_names(., .)
rdc <- c(
  "<", ">", ":", '"', "/", "\\", "|", "?", "*", "~", ",", ";", "+", "-", "`", 
  "!", "@", "#", "$", "%", "^", "&", "=", "(", ")", "'", "{", "}"
  # , "[", "]"
)
rdc_regex <- rdc %>% 
  sub("\\\\", "\\\\\\\\", .) %>%
  c("[", ., "]") %>%
  paste0(collapse = "")
rdc_test %>% 
  c(letters[1:5]) %>%
  gsub(rdc_regex, "_", ., perl = TRUE)
  <   >   :   "   /  \\   |   ?   *   ~   ,   ;   +   -   `   !   @   #   $   %   ^   &   =   (   )   '   {   }   [   ]                     
"_" "_" "_" "_" "_" "_" "_" "_" "_" "_" "_" "_" "_" "_" "_" "_" "_" "_" "_" "_" "_" "_" "_" "_" "_" "_" "_" "_" "_" "_" "a" "b" "c" "d" "e" 
rdc_regex <- rdc %>% 
  c("[", ., "]") %>%
  paste0(collapse = "")
rdc_test %>% 
  c(letters[1:5]) %>%
  gsub(rdc_regex, "_", ., perl = TRUE)
  <   >   :   "   /  \\   |   ?   *   ~   ,   ;   +   -   `   !   @   #   $   %   ^   &   =   (   )   '   {   }   [   ]                     
"_" "_" "_" "_" "_" "_" "_" "_" "_" "_" "_" "_" "_" "_" "_" "_" "_" "_" "_" "_" "_" "_" "_" "_" "_" "_" "_" "_" "_" "_" "a" "b" "c" "d" "e" 
  1. Square brackets [ and ] replaced even when not included in regex: in the code above, the square brackets characters [ and ] are not included in the character set defined in rdc_regex, except as delimiters of the character set (Regex match any single character (one character only)). However, the square brackets are still replaced. How is this happening?
rdc_regex
[1] "[<>:\"/\\|?*~,;+-`!@#$%^&=()'{}]"

Solution Based on the comment from @Wiktor Stribiżew, it appears the issue was an unescaped - was acting as a range operator in my character class. Thus, +-backtick matched every character "between" + and backtick in my local character list. This reference () says the special characters inside a character class are -]\^$, so I've escaped all of them in the code below. I'm not sure if this is overkill, but it is currently working.

rdc_test <- c(
  "<", ">", ":", '"', "/", "\\", "|", "?", "*", "~", ",", ";", "+", "-", "`", 
  "!", "@", "#", "$", "%", "^", "&", "=", "(", ")", "'", "{", "}"
  , "[", "]"
) %>% 
  magrittr::set_names(., .)
rdc <- c(
  "<", ">", ":", '"', "/", "\\", "|", "?", "*", "~", ",", ";", "+", "\\\\-", "`", 
  "!", "@", "#", "\\\\$", "%", "\\\\^", "&", "=", "(", ")", "'", "{", "}"
  , "[", "\\\\]"
)
rdc_regex <- rdc %>% 
  sub("\\\\", "\\\\\\\\", .) %>%
  c("[", ., "]") %>%
  paste0(collapse = "")
rdc_test %>% 
  c(letters[1:5]) %>%
  gsub(rdc_regex, "_", ., perl = TRUE)

I am trying to use regex in R to replace forbidden characters. I know base::make.names can do some of this, but I want to control the replacement character. I have successfully figured out how to do this for Windows file name forbidden characters, as long as \ is expanded to accommodate the needs of gsub:

wdc <- c(
  "<", ">", ":", '"', "/", "\\", "|", "?", "*"
) %>% 
  magrittr::set_names(., .)
wdc_regex <- wdc %>% 
  sub("\\\\", "\\\\\\\\", .) %>%
  c("[", ., "]") %>%
  paste0(collapse = "")
wdc %>% 
  c(letters[1:5]) %>%
  gsub(wdc_regex, "_", ., perl = TRUE)
  <   >   :   "   /  \\   |   ?   *                     
"_" "_" "_" "_" "_" "_" "_" "_" "_" "a" "b" "c" "d" "e" 
wdc_regex <- wdc %>% 
  c("[", ., "]") %>%
  paste0(collapse = "")
wdc %>% 
  c(letters[1:5]) %>%
  gsub(wdc_regex, "_", ., perl = TRUE)
   <    >    :    "    /   \\    |    ?    *                          
 "_"  "_"  "_"  "_"  "_" "\\"  "_"  "_"  "_"  "a"  "b"  "c"  "d"  "e" 

However, when I use the same strategy for characters that don't work with syntactically valid names in R, I run into a number of issues I don't understand.

  1. No modification needed to replace \: The code for Windows characters requires the call to sub("\\\\", "\\\\\\\\", .) in order to replace \ with _. However, the code below works without this step. Why is it not necessary to expand \ in the code below?
rdc_test <- c(
  "<", ">", ":", '"', "/", "\\", "|", "?", "*", "~", ",", ";", "+", "-", "`", 
  "!", "@", "#", "$", "%", "^", "&", "=", "(", ")", "'", "{", "}"
  , "[", "]"
) %>% 
  magrittr::set_names(., .)
rdc <- c(
  "<", ">", ":", '"', "/", "\\", "|", "?", "*", "~", ",", ";", "+", "-", "`", 
  "!", "@", "#", "$", "%", "^", "&", "=", "(", ")", "'", "{", "}"
  # , "[", "]"
)
rdc_regex <- rdc %>% 
  sub("\\\\", "\\\\\\\\", .) %>%
  c("[", ., "]") %>%
  paste0(collapse = "")
rdc_test %>% 
  c(letters[1:5]) %>%
  gsub(rdc_regex, "_", ., perl = TRUE)
  <   >   :   "   /  \\   |   ?   *   ~   ,   ;   +   -   `   !   @   #   $   %   ^   &   =   (   )   '   {   }   [   ]                     
"_" "_" "_" "_" "_" "_" "_" "_" "_" "_" "_" "_" "_" "_" "_" "_" "_" "_" "_" "_" "_" "_" "_" "_" "_" "_" "_" "_" "_" "_" "a" "b" "c" "d" "e" 
rdc_regex <- rdc %>% 
  c("[", ., "]") %>%
  paste0(collapse = "")
rdc_test %>% 
  c(letters[1:5]) %>%
  gsub(rdc_regex, "_", ., perl = TRUE)
  <   >   :   "   /  \\   |   ?   *   ~   ,   ;   +   -   `   !   @   #   $   %   ^   &   =   (   )   '   {   }   [   ]                     
"_" "_" "_" "_" "_" "_" "_" "_" "_" "_" "_" "_" "_" "_" "_" "_" "_" "_" "_" "_" "_" "_" "_" "_" "_" "_" "_" "_" "_" "_" "a" "b" "c" "d" "e" 
  1. Square brackets [ and ] replaced even when not included in regex: in the code above, the square brackets characters [ and ] are not included in the character set defined in rdc_regex, except as delimiters of the character set (Regex match any single character (one character only)). However, the square brackets are still replaced. How is this happening?
rdc_regex
[1] "[<>:\"/\\|?*~,;+-`!@#$%^&=()'{}]"

Solution Based on the comment from @Wiktor Stribiżew, it appears the issue was an unescaped - was acting as a range operator in my character class. Thus, +-backtick matched every character "between" + and backtick in my local character list. This reference (https://perldoc.perl./perlrequick) says the special characters inside a character class are -]\^$, so I've escaped all of them in the code below. I'm not sure if this is overkill, but it is currently working.

rdc_test <- c(
  "<", ">", ":", '"', "/", "\\", "|", "?", "*", "~", ",", ";", "+", "-", "`", 
  "!", "@", "#", "$", "%", "^", "&", "=", "(", ")", "'", "{", "}"
  , "[", "]"
) %>% 
  magrittr::set_names(., .)
rdc <- c(
  "<", ">", ":", '"', "/", "\\", "|", "?", "*", "~", ",", ";", "+", "\\\\-", "`", 
  "!", "@", "#", "\\\\$", "%", "\\\\^", "&", "=", "(", ")", "'", "{", "}"
  , "[", "\\\\]"
)
rdc_regex <- rdc %>% 
  sub("\\\\", "\\\\\\\\", .) %>%
  c("[", ., "]") %>%
  paste0(collapse = "")
rdc_test %>% 
  c(letters[1:5]) %>%
  gsub(rdc_regex, "_", ., perl = TRUE)
Share Improve this question edited Mar 20 at 22:15 Josh asked Mar 20 at 21:21 JoshJosh 1,34913 silver badges34 bronze badges 1
  • 5 You have an unescaped - in your character class. It must be "[<>:\"/\\\\|?*~,;+\\-`!@#$%^&=()'{}]" – Wiktor Stribiżew Commented Mar 20 at 21:27
Add a comment  | 

1 Answer 1

Reset to default 5

Use chartr. Also note that R supports r"{...}" notation for string constants in which case escapes are ignored. Note comment below answer pointing out that if - is used in bad it should be put at the end since it has a special meaning (denoting a range) if used between characters.

bad <- r"{"<>:\"/\|?*}"
chartr(bad, strrep("_", nchar(bad)), r"{x":[\y}")
## [1] "x__[_y"

This variation also works:

chartr(bad, gsub(".", "_", bad), r"{x":[\y}")
## [1] "x__[_y"
发布评论

评论列表(0)

  1. 暂无评论