Find different language characters in Snowflake SQL

I have a table that has a text document as one of the columns. These text documents can have about 500-100 words which may contain special characters. I want to find if any of these documents (or excerpts in the document) are in languages other than English. I do not want to replace these characters, I want to just know how many of the documents have different languages.

I understand there are so many languages so it might be hard to tackle this problem. So primarily, I want to detect any accented characters (e.g. 'München' or 'Café'). If further possible then detecting Chinese characters would be great.

I tried the below solution, however, it seems too broad i.e. it covers beyond accented characters and includes some other special characters that I don't care about

SELECT id,
       regexp_instr(text_document, '[^[:ascii:]]') > 0 as has_non_ascii,
       text_document
FROM snowflake_table

I tried the below solution, however, it seems too broad i.e. it covers beyond accented characters and includes some other special characters that I don't care about

SELECT id,
       regexp_instr(text_document, '[^[:ascii:]]') > 0 as has_non_ascii,
       text_document
FROM snowflake_table

Share Improve this question edited Mar 4 at 22:57 Dale K 27.5k15 gold badges58 silver badges83 bronze badges asked Mar 4 at 21:43 Kamal T 111 bronze badge

Add a comment |

1 Answer 1

Sorted by: Reset to default 0

You can apply more specific regex expressions, for example:

CREATE OR REPLACE FUNCTION detect_accented_chars(input_text STRING)
RETURNS BOOLEAN
AS
$$
    regexp_instr(input_text, '[áàâäãåāąăçćčďđéèêëēęěíìîïīįıłńňñóòôöōőøŕřśšşťţúùûüūůűųýÿźżž]') > 0
$$;

or you can use a python function:

create or replace function detect_accented_chars(input_text STRING) returns boolean
language python
runtime_version = 3.11
packages = ('unicodedata2')
handler = 'has_accented_chars'
as 
$$
import unicodedata

def has_accented_chars(text):
    return any(
        "WITH" in unicodedata.name(char, "") or "ACUTE" in unicodedata.name(char, "")
        or "GRAVE" in unicodedata.name(char, "") or "DIAERESIS" in unicodedata.name(char, "")
        for char in text
    )
$$;

科技改变生活-雨落星辰 - 所有的伟大,都源于一个勇敢的开始

Find different language characters in Snowflake SQL - Stack Overflow

1 Answer 1

与本文相关的文章

评论列表(0)