I am trying to use regex
in R
to replace forbidden characters. I know
base::make.names
can do some of this, but I want to control the replacement
character. I have successfully figured out how to do this for Windows file name
forbidden characters, as long as \
is expanded to accommodate the needs of
gsub
:
wdc <- c(
"<", ">", ":", '"', "/", "\\", "|", "?", "*"
) %>%
magrittr::set_names(., .)
wdc_regex <- wdc %>%
sub("\\\\", "\\\\\\\\", .) %>%
c("[", ., "]") %>%
paste0(collapse = "")
wdc %>%
c(letters[1:5]) %>%
gsub(wdc_regex, "_", ., perl = TRUE)
< > : " / \\ | ? *
"_" "_" "_" "_" "_" "_" "_" "_" "_" "a" "b" "c" "d" "e"
wdc_regex <- wdc %>%
c("[", ., "]") %>%
paste0(collapse = "")
wdc %>%
c(letters[1:5]) %>%
gsub(wdc_regex, "_", ., perl = TRUE)
< > : " / \\ | ? *
"_" "_" "_" "_" "_" "\\" "_" "_" "_" "a" "b" "c" "d" "e"
However, when I use the same strategy for characters that don't work with syntactically valid names in R
, I run
into a number of issues I don't understand.
- No modification needed to replace
\
: The code for Windows characters requires the call tosub("\\\\", "\\\\\\\\", .)
in order to replace\
with_
. However, the code below works without this step. Why is it not necessary to expand\
in the code below?
rdc_test <- c(
"<", ">", ":", '"', "/", "\\", "|", "?", "*", "~", ",", ";", "+", "-", "`",
"!", "@", "#", "$", "%", "^", "&", "=", "(", ")", "'", "{", "}"
, "[", "]"
) %>%
magrittr::set_names(., .)
rdc <- c(
"<", ">", ":", '"', "/", "\\", "|", "?", "*", "~", ",", ";", "+", "-", "`",
"!", "@", "#", "$", "%", "^", "&", "=", "(", ")", "'", "{", "}"
# , "[", "]"
)
rdc_regex <- rdc %>%
sub("\\\\", "\\\\\\\\", .) %>%
c("[", ., "]") %>%
paste0(collapse = "")
rdc_test %>%
c(letters[1:5]) %>%
gsub(rdc_regex, "_", ., perl = TRUE)
< > : " / \\ | ? * ~ , ; + - ` ! @ # $ % ^ & = ( ) ' { } [ ]
"_" "_" "_" "_" "_" "_" "_" "_" "_" "_" "_" "_" "_" "_" "_" "_" "_" "_" "_" "_" "_" "_" "_" "_" "_" "_" "_" "_" "_" "_" "a" "b" "c" "d" "e"
rdc_regex <- rdc %>%
c("[", ., "]") %>%
paste0(collapse = "")
rdc_test %>%
c(letters[1:5]) %>%
gsub(rdc_regex, "_", ., perl = TRUE)
< > : " / \\ | ? * ~ , ; + - ` ! @ # $ % ^ & = ( ) ' { } [ ]
"_" "_" "_" "_" "_" "_" "_" "_" "_" "_" "_" "_" "_" "_" "_" "_" "_" "_" "_" "_" "_" "_" "_" "_" "_" "_" "_" "_" "_" "_" "a" "b" "c" "d" "e"
- Square brackets
[
and]
replaced even when not included inregex
: in the code above, the square brackets characters[
and]
are not included in the character set defined inrdc_regex
, except as delimiters of the character set (Regex match any single character (one character only)). However, the square brackets are still replaced. How is this happening?
rdc_regex
[1] "[<>:\"/\\|?*~,;+-`!@#$%^&=()'{}]"
Solution
Based on the comment from @Wiktor Stribiżew, it appears the issue was an unescaped -
was acting as a range operator in my character class. Thus, +-backtick
matched every character "between" +
and backtick
in my local character list. This reference () says the special characters inside a character class are -]\^$
, so I've escaped all of them in the code below. I'm not sure if this is overkill, but it is currently working.
rdc_test <- c(
"<", ">", ":", '"', "/", "\\", "|", "?", "*", "~", ",", ";", "+", "-", "`",
"!", "@", "#", "$", "%", "^", "&", "=", "(", ")", "'", "{", "}"
, "[", "]"
) %>%
magrittr::set_names(., .)
rdc <- c(
"<", ">", ":", '"', "/", "\\", "|", "?", "*", "~", ",", ";", "+", "\\\\-", "`",
"!", "@", "#", "\\\\$", "%", "\\\\^", "&", "=", "(", ")", "'", "{", "}"
, "[", "\\\\]"
)
rdc_regex <- rdc %>%
sub("\\\\", "\\\\\\\\", .) %>%
c("[", ., "]") %>%
paste0(collapse = "")
rdc_test %>%
c(letters[1:5]) %>%
gsub(rdc_regex, "_", ., perl = TRUE)
I am trying to use regex
in R
to replace forbidden characters. I know
base::make.names
can do some of this, but I want to control the replacement
character. I have successfully figured out how to do this for Windows file name
forbidden characters, as long as \
is expanded to accommodate the needs of
gsub
:
wdc <- c(
"<", ">", ":", '"', "/", "\\", "|", "?", "*"
) %>%
magrittr::set_names(., .)
wdc_regex <- wdc %>%
sub("\\\\", "\\\\\\\\", .) %>%
c("[", ., "]") %>%
paste0(collapse = "")
wdc %>%
c(letters[1:5]) %>%
gsub(wdc_regex, "_", ., perl = TRUE)
< > : " / \\ | ? *
"_" "_" "_" "_" "_" "_" "_" "_" "_" "a" "b" "c" "d" "e"
wdc_regex <- wdc %>%
c("[", ., "]") %>%
paste0(collapse = "")
wdc %>%
c(letters[1:5]) %>%
gsub(wdc_regex, "_", ., perl = TRUE)
< > : " / \\ | ? *
"_" "_" "_" "_" "_" "\\" "_" "_" "_" "a" "b" "c" "d" "e"
However, when I use the same strategy for characters that don't work with syntactically valid names in R
, I run
into a number of issues I don't understand.
- No modification needed to replace
\
: The code for Windows characters requires the call tosub("\\\\", "\\\\\\\\", .)
in order to replace\
with_
. However, the code below works without this step. Why is it not necessary to expand\
in the code below?
rdc_test <- c(
"<", ">", ":", '"', "/", "\\", "|", "?", "*", "~", ",", ";", "+", "-", "`",
"!", "@", "#", "$", "%", "^", "&", "=", "(", ")", "'", "{", "}"
, "[", "]"
) %>%
magrittr::set_names(., .)
rdc <- c(
"<", ">", ":", '"', "/", "\\", "|", "?", "*", "~", ",", ";", "+", "-", "`",
"!", "@", "#", "$", "%", "^", "&", "=", "(", ")", "'", "{", "}"
# , "[", "]"
)
rdc_regex <- rdc %>%
sub("\\\\", "\\\\\\\\", .) %>%
c("[", ., "]") %>%
paste0(collapse = "")
rdc_test %>%
c(letters[1:5]) %>%
gsub(rdc_regex, "_", ., perl = TRUE)
< > : " / \\ | ? * ~ , ; + - ` ! @ # $ % ^ & = ( ) ' { } [ ]
"_" "_" "_" "_" "_" "_" "_" "_" "_" "_" "_" "_" "_" "_" "_" "_" "_" "_" "_" "_" "_" "_" "_" "_" "_" "_" "_" "_" "_" "_" "a" "b" "c" "d" "e"
rdc_regex <- rdc %>%
c("[", ., "]") %>%
paste0(collapse = "")
rdc_test %>%
c(letters[1:5]) %>%
gsub(rdc_regex, "_", ., perl = TRUE)
< > : " / \\ | ? * ~ , ; + - ` ! @ # $ % ^ & = ( ) ' { } [ ]
"_" "_" "_" "_" "_" "_" "_" "_" "_" "_" "_" "_" "_" "_" "_" "_" "_" "_" "_" "_" "_" "_" "_" "_" "_" "_" "_" "_" "_" "_" "a" "b" "c" "d" "e"
- Square brackets
[
and]
replaced even when not included inregex
: in the code above, the square brackets characters[
and]
are not included in the character set defined inrdc_regex
, except as delimiters of the character set (Regex match any single character (one character only)). However, the square brackets are still replaced. How is this happening?
rdc_regex
[1] "[<>:\"/\\|?*~,;+-`!@#$%^&=()'{}]"
Solution
Based on the comment from @Wiktor Stribiżew, it appears the issue was an unescaped -
was acting as a range operator in my character class. Thus, +-backtick
matched every character "between" +
and backtick
in my local character list. This reference (https://perldoc.perl./perlrequick) says the special characters inside a character class are -]\^$
, so I've escaped all of them in the code below. I'm not sure if this is overkill, but it is currently working.
rdc_test <- c(
"<", ">", ":", '"', "/", "\\", "|", "?", "*", "~", ",", ";", "+", "-", "`",
"!", "@", "#", "$", "%", "^", "&", "=", "(", ")", "'", "{", "}"
, "[", "]"
) %>%
magrittr::set_names(., .)
rdc <- c(
"<", ">", ":", '"', "/", "\\", "|", "?", "*", "~", ",", ";", "+", "\\\\-", "`",
"!", "@", "#", "\\\\$", "%", "\\\\^", "&", "=", "(", ")", "'", "{", "}"
, "[", "\\\\]"
)
rdc_regex <- rdc %>%
sub("\\\\", "\\\\\\\\", .) %>%
c("[", ., "]") %>%
paste0(collapse = "")
rdc_test %>%
c(letters[1:5]) %>%
gsub(rdc_regex, "_", ., perl = TRUE)
Share
Improve this question
edited Mar 20 at 22:15
Josh
asked Mar 20 at 21:21
JoshJosh
1,34913 silver badges34 bronze badges
1
|
1 Answer
Reset to default 5Use chartr. Also note that R supports r"{...}"
notation for string constants in which case escapes are ignored. Note comment below answer pointing out that if -
is used in bad
it should be put at the end since it has a special meaning (denoting a range) if used between characters.
bad <- r"{"<>:\"/\|?*}"
chartr(bad, strrep("_", nchar(bad)), r"{x":[\y}")
## [1] "x__[_y"
This variation also works:
chartr(bad, gsub(".", "_", bad), r"{x":[\y}")
## [1] "x__[_y"
-
in your character class. It must be"[<>:\"/\\\\|?*~,;+\\-`!@#$%^&=()'{}]"
– Wiktor Stribiżew Commented Mar 20 at 21:27