I have some text that is a mix of English, Japanese (kanji and kana), and Furigana (in the form of HTML ruby/rb/rt elements). I want to match each of those groups with a single .NET Regex and do a replace on each match depending on if it is an English, Japanese, or Furigana match. I thought it would be easy since for a simple test string like abc 日本日<ruby>aac a<ruby>bc
I can use a regex like (a|日|<ruby>)
where I try to match "a" or "日" or "<ruby>
." It works just fine so I figured I could subtitute three more complicated expression to match English, Japanese, or ruby HTML elements. I came up with the following:
First for the ruby text whose full form is <ruby><rb>kanji</rb><rt>kana</rt></ruby>
I came up with (?<rubyElement><ruby>.*?</ruby>)
which works perfectly.
Second, for the English, I tried using the unicode range (?<ascii>[\x20-\x7E]*)
but that was then capturing the < from the ruby text so I changed it to (?<ascii>([\x20-\x3B]|[\x3D-\x7E])*)
which basically leaves out the "<" character. I joined that with the above rubyElement capture group with a pipe to say "find either ruby HTML element or English" and it worked like a charm: (?<rubyElement><ruby>.*?</ruby>)|(?<ascii>([\x20-\x3B]|[\x3D-\x7E])*)
From here, I thought I just add another pipe to "or" in another condition and I'd be done. So I tried just the unicode range for kanji by itself (?<other>[\u3402-\uFA6D]*)
and it found the kanji. But then I piped it onto the end of the previous expression: (?<rubyElement><ruby>.*?</ruby>)|(?<ascii>([\x20-\x3B]|[\x3D-\x7E])*)|(?<other>[\u3402-\uFA6D]*)
and it didin't work. I thought maybe it needed to be in the form (a|b|c)
so I wrapped the entire thing in parentheses and that didn't work. English and Furigana were still found but not kanji... and I still need to add ranges for hiragana and katakana to be included with the kanji match.
I'm guessing this is my misunderstanding about how matching on three things grouped together and separated by the "or" / "|" symbol works but it sure does seem strange that things that work in isolation don't work when ORed together... especially since the proof of concept "three things ORed together (a|日|<ruby>)
regex worked).
UPDATE: it seems to have to do with using ranges in the first of multiple ORed together things. Consider the string "English日本語"... these 4 regexes match both "English" and "日本語:"
English|日本語
日本語|English
English|[\u3402-\uFA6D]*
日本語|[\x20-\x7E]*
But, this only matches "日本語" (note the "range" in 1st part now):
[\u3402-\uFA6D]*|English
And, this only matches "English" (again, range in 1st part):
[\x20-\x7E]*|日本語