最新消息:雨落星辰是一个专注网站SEO优化、网站SEO诊断、搜索引擎研究、网络营销推广、网站策划运营及站长类的自媒体原创博客

Find different language characters in Snowflake SQL - Stack Overflow

programmeradmin2浏览0评论

I have a table that has a text document as one of the columns. These text documents can have about 500-100 words which may contain special characters. I want to find if any of these documents (or excerpts in the document) are in languages other than English. I do not want to replace these characters, I want to just know how many of the documents have different languages.

I understand there are so many languages so it might be hard to tackle this problem. So primarily, I want to detect any accented characters (e.g. 'München' or 'Café'). If further possible then detecting Chinese characters would be great.

I tried the below solution, however, it seems too broad i.e. it covers beyond accented characters and includes some other special characters that I don't care about

SELECT id,
       regexp_instr(text_document, '[^[:ascii:]]') > 0 as has_non_ascii,
       text_document
FROM snowflake_table

I have a table that has a text document as one of the columns. These text documents can have about 500-100 words which may contain special characters. I want to find if any of these documents (or excerpts in the document) are in languages other than English. I do not want to replace these characters, I want to just know how many of the documents have different languages.

I understand there are so many languages so it might be hard to tackle this problem. So primarily, I want to detect any accented characters (e.g. 'München' or 'Café'). If further possible then detecting Chinese characters would be great.

I tried the below solution, however, it seems too broad i.e. it covers beyond accented characters and includes some other special characters that I don't care about

SELECT id,
       regexp_instr(text_document, '[^[:ascii:]]') > 0 as has_non_ascii,
       text_document
FROM snowflake_table
Share Improve this question edited Mar 4 at 22:57 Dale K 27.5k15 gold badges58 silver badges83 bronze badges asked Mar 4 at 21:43 Kamal TKamal T 111 bronze badge
Add a comment  | 

1 Answer 1

Reset to default 0

You can apply more specific regex expressions, for example:

CREATE OR REPLACE FUNCTION detect_accented_chars(input_text STRING)
RETURNS BOOLEAN
AS
$$
    regexp_instr(input_text, '[áàâäãåāąăçćčďđéèêëēęěíìîïīįıłńňñóòôöōőøŕřśšşťţúùûüūůűųýÿźżž]') > 0
$$;

or you can use a python function:

create or replace function detect_accented_chars(input_text STRING) returns boolean
language python
runtime_version = 3.11
packages = ('unicodedata2')
handler = 'has_accented_chars'
as 
$$
import unicodedata

def has_accented_chars(text):
    return any(
        "WITH" in unicodedata.name(char, "") or "ACUTE" in unicodedata.name(char, "")
        or "GRAVE" in unicodedata.name(char, "") or "DIAERESIS" in unicodedata.name(char, "")
        for char in text
    )
$$;
发布评论

评论列表(0)

  1. 暂无评论