I have a table that has a text document as one of the columns. These text documents can have about 500-100 words which may contain special characters. I want to find if any of these documents (or excerpts in the document) are in languages other than English. I do not want to replace these characters, I want to just know how many of the documents have different languages.
I understand there are so many languages so it might be hard to tackle this problem. So primarily, I want to detect any accented characters (e.g. 'München' or 'Café'). If further possible then detecting Chinese characters would be great.
I tried the below solution, however, it seems too broad i.e. it covers beyond accented characters and includes some other special characters that I don't care about
SELECT id,
regexp_instr(text_document, '[^[:ascii:]]') > 0 as has_non_ascii,
text_document
FROM snowflake_table
I have a table that has a text document as one of the columns. These text documents can have about 500-100 words which may contain special characters. I want to find if any of these documents (or excerpts in the document) are in languages other than English. I do not want to replace these characters, I want to just know how many of the documents have different languages.
I understand there are so many languages so it might be hard to tackle this problem. So primarily, I want to detect any accented characters (e.g. 'München' or 'Café'). If further possible then detecting Chinese characters would be great.
I tried the below solution, however, it seems too broad i.e. it covers beyond accented characters and includes some other special characters that I don't care about
SELECT id,
regexp_instr(text_document, '[^[:ascii:]]') > 0 as has_non_ascii,
text_document
FROM snowflake_table
Share
Improve this question
edited Mar 4 at 22:57
Dale K
27.5k15 gold badges58 silver badges83 bronze badges
asked Mar 4 at 21:43
Kamal TKamal T
111 bronze badge
1 Answer
Reset to default 0You can apply more specific regex expressions, for example:
CREATE OR REPLACE FUNCTION detect_accented_chars(input_text STRING)
RETURNS BOOLEAN
AS
$$
regexp_instr(input_text, '[áàâäãåāąăçćčďđéèêëēęěíìîïīįıłńňñóòôöōőøŕřśšşťţúùûüūůűųýÿźżž]') > 0
$$;
or you can use a python function:
create or replace function detect_accented_chars(input_text STRING) returns boolean
language python
runtime_version = 3.11
packages = ('unicodedata2')
handler = 'has_accented_chars'
as
$$
import unicodedata
def has_accented_chars(text):
return any(
"WITH" in unicodedata.name(char, "") or "ACUTE" in unicodedata.name(char, "")
or "GRAVE" in unicodedata.name(char, "") or "DIAERESIS" in unicodedata.name(char, "")
for char in text
)
$$;