I have a pretty basic problem of incorrectly nested parentheses, but I am having a lot of difficulty identifying which parentheses are incorrectly nested. I am working with administrative data with misspellings and shortened words. I would like to change all misspellings and shortened versions of COMPANY to the complete, correctly spelled word. I have created some example data to show the problem that I am experiencing.
data <- tibble(respondents.long = c("COPMANY", "COMPANY", "CO", "COMP", "CO ", "COMP ", "COMPNY", "CO#"))
The code below should result in shortened or misspelled versions of COMPANY being changed to the complete, non-misspelled version. A major goal for the code that I am generating is that it is as easily replicable as possible and easy to add on to. So, I have added comments to the subexpressions of the regex used indicated by (?#).
data %>%
mutate(
respondents.long =
# if shortened or misspelled versions of COMPANY are found or if CO is immediately followed by a # sign
if_else(str_detect(respondents.long, regex("(?: ) (?# non-capture group; matches empty space before CO)
CO (?# matches literal CO)
(?!MPANY) (?# negative lookahead; indicates that MPANY does not follow CO)
(?:[MPANY]+(?:(?: )|$) (?# non-capture group; first choice; CO plus any combination of letters between brackets when the entire string is followed by a space or end of line)
| (?# OR operator; choices)
(?: ) (?# non-capture; second choice CO plus empty space)
| (?# OR operator; choices)
(?=#) (?# positive lookahead; third choice # immediately following CO)
| (?# OR operator; choices)
$) (?# fourth choice; CO at the end of line)",
# include "comments = T" to comment (?#) on regex sub-expressions
comments = T)),
# replace those string
str_replace_all(respondents.long,
# string to be detected
"(?: )CO(?!MPANY)(?:[MPANY]+(?:(?: )|$)|(?: )|(?=#)|$)",
# replacement
" COMPANY "),
# else leave as is
respondents.long))
The base code has worked for other words, so I'm certain that I'm overlooking something. I also tested the regex, and it works on regex101