r - Identifying incorrectly nested parentheses in regex

I have a pretty basic problem of incorrectly nested parentheses, but I am having a lot of difficulty identifying which parentheses are incorrectly nested. I am working with administrative data with misspellings and shortened words. I would like to change all misspellings and shortened versions of COMPANY to the complete, correctly spelled word. I have created some example data to show the problem that I am experiencing.

data <- tibble(respondents.long = c("COPMANY", "COMPANY", "CO", "COMP", "CO ", "COMP ", "COMPNY", "CO#"))

The code below should result in shortened or misspelled versions of COMPANY being changed to the complete, non-misspelled version. A major goal for the code that I am generating is that it is as easily replicable as possible and easy to add on to. So, I have added comments to the subexpressions of the regex used indicated by (?#).

data %>%
  mutate(
   respondents.long =
        # if shortened or misspelled versions of COMPANY are found or if CO is immediately followed by a # sign
        if_else(str_detect(respondents.long, regex("(?: ) (?# non-capture group; matches empty space before CO)
                                                   CO (?# matches literal CO)
                                                   (?!MPANY) (?# negative lookahead; indicates that MPANY does not follow CO)
                                                   (?:[MPANY]+(?:(?: )|$) (?# non-capture group; first choice; CO plus any combination of letters between brackets when the entire string is followed by a space or end of line)
                                                   | (?# OR operator; choices)
                                                   (?: ) (?# non-capture; second choice CO plus empty space)
                                                   | (?# OR operator; choices)
                                                   (?=#) (?# positive lookahead; third choice # immediately following CO)
                                                   | (?# OR operator; choices)
                                                   $) (?# fourth choice; CO at the end of line)",
                                                   # include "comments = T" to comment (?#) on regex sub-expressions
                                                   comments = T)),
                # replace those string
                str_replace_all(respondents.long,
                                # string to be detected
                                "(?: )CO(?!MPANY)(?:[MPANY]+(?:(?: )|$)|(?: )|(?=#)|$)",
                                # replacement
                                " COMPANY "),
                # else leave as is
                respondents.long))

The base code has worked for other words, so I'm certain that I'm overlooking something. I also tested the regex, and it works on regex101

科技改变生活-雨落星辰 - 所有的伟大,都源于一个勇敢的开始

r - Identifying incorrectly nested parentheses in regex - Stack Overflow

与本文相关的文章

评论列表(0)