This is my input:
"Once there was a (so-called) rock. it.,was not! in fact, a big rock."
I need it to output an array that looks like this
["Once", " ", "there", " ", "was", " ", "a", ",", "so", " ", "called", ",", "rock", ".", "it", ".", "was", " ", "not", ".", "in", " ", "fact", ",", "a", " ", "big", " ", "rock"]
There are some rules that the input needs to go through to make the punctuation be like this. These are how the rules go:
spaceDelimiters = " -_"
commaDelimiters = ",():;\""
periodDelimiters = ".!?"
If there's a spaceDelimiter character then it should replace it with a space. Same goes for the other comma and period ones. Comma has priority over space, and period has priority over comma
I got to a point where I was able to remove all of the delimiter characters, but I need them to be as separate pieces of an array. As well as there being a hierarchy, with periods overriding commas overriding spaces
Maybe my approach is just wrong? This is what I've got:
def split(string, delimiters):
regex_pattern = '|'.join(map(re.escape, delimiters))
return re.split(regex_pattern, string)
Which ends up doing everything wrong. It's not even close
This is my input:
"Once there was a (so-called) rock. it.,was not! in fact, a big rock."
I need it to output an array that looks like this
["Once", " ", "there", " ", "was", " ", "a", ",", "so", " ", "called", ",", "rock", ".", "it", ".", "was", " ", "not", ".", "in", " ", "fact", ",", "a", " ", "big", " ", "rock"]
There are some rules that the input needs to go through to make the punctuation be like this. These are how the rules go:
spaceDelimiters = " -_"
commaDelimiters = ",():;\""
periodDelimiters = ".!?"
If there's a spaceDelimiter character then it should replace it with a space. Same goes for the other comma and period ones. Comma has priority over space, and period has priority over comma
I got to a point where I was able to remove all of the delimiter characters, but I need them to be as separate pieces of an array. As well as there being a hierarchy, with periods overriding commas overriding spaces
Maybe my approach is just wrong? This is what I've got:
def split(string, delimiters):
regex_pattern = '|'.join(map(re.escape, delimiters))
return re.split(regex_pattern, string)
Which ends up doing everything wrong. It's not even close
Share Improve this question asked Nov 19, 2024 at 11:51 zealantannerzealantanner 313 bronze badges 4 |2 Answers
Reset to default 1Use the re
library to split text on word boundaries, then replace in sequence of precident
import re
s="Once there was a (so-called) rock. it.,was not! in fact, a big rock."
# split regex into tokens along word boundaries
regex=r"\b"
l=re.split(regex,s)
def replaceDelimeters(token:str):
# in each token identify if it contains a delimeter
spaceDelimiters = r"[^- _]*[- _]+[^- _]*"
commaDelimiters = r"[^,():;\"]*[,():;\"]+[^,():;\"]*"
periodDelimiters = r"[^.!?]*[.!?]+[^.!?]*"
# substitute for the replacement
token=re.sub(periodDelimiters,".",token)
token=re.sub(commaDelimiters,",",token)
token=re.sub(spaceDelimiters," ",token)
return token
# apply
[replaceDelimeters(token) for token in l if token!=""]
This method returns "." as the last entry to the list. I don't know if this is your desired behavior; your desired output states otherwise, but your logic appears to desire this. Deleting the last entry if it is a period should be easy enough in any case.
You can do it with a single regular expression.
Define your rules in precedence order (from lowest to highest) with the replacement character as the initial character:
rules = {
"space": " _-" , # put - last in the rule
"comma": ",():;\"",
"period": ".!?",
}
Then create a regular expression which is either one-or-more characters matching no rules or one-or-more characters matching at least one character matching the rule and any number of characters matching that rule and any lower precedence rules with the highest precedence rule earliest in the regular expression pattern:
prev = ""
rule_patterns = deque()
for name, rule in rules.items():
prev = rule + prev
rule_patterns.appendleft(f"(?P<{name}>[{prev}]*?[{rule}][{prev}]*)")
rule_patterns.appendleft(f"(?P<other>[^{prev}]+)")
pattern = repile("|".join(rule_patterns))
Which generates the pattern (?P<other>[^.!?,():;" _-]+)|(?P<period>[.!?,():;" _-]*?[.!?][.!?,():;" _-]*)|(?P<comma>[,():;" _-]*?[,():;"][,():;" _-]*)|(?P<space>[ _-]*?[ _-][ _-]*)
Then given your value:
value = "Once there was a (so-called) rock. it.,was not! in fact, a big rock."
You can find all the matches and, where a rule is matched instead output the first character in the rule:
matches = [
next(
(rule[0] for name, rule in rules.items() if match.group(name)),
match.group("other")
)
for match in pattern.finditer(value)
]
print(matches)
Outputs:
['Once', ' ', 'there', ' ', 'was', ' ', 'a', ',', 'so', ' ', 'called', ',', 'rock', '.', 'it', '.', 'was', ' ', 'not', '.', 'in', ' ', 'fact', ',', 'a', ' ', 'big', ' ', 'rock', '.']
delimiters
? – no comment Commented Nov 19, 2024 at 12:01