最新消息:雨落星辰是一个专注网站SEO优化、网站SEO诊断、搜索引擎研究、网络营销推广、网站策划运营及站长类的自媒体原创博客

Regex to extract text between a string and either a single character or another string (POSIX for snowflake regexp_substr) - Sta

programmeradmin0浏览0评论

Since Snowflake only supports POSIX regex instead of PCRE*, the solutions I've found don't work.

I'm looking to extract the text between TOKEN: and either ( or a few possible strings: TOKEN, TOKN.

I think the issue I'm having is in the ([^\\(]+) capturing group, where I need it to capture until either ( or one of the token strings. The solutions I found online utilize a positive look ahead, which Snowflake doesn't support. (Those solutions throw the error no argument for repetition operator: ?)

With the correct regex, the following code should return get this text for tkn1 and get this text also for tkn2, and nulls for tkn3 and tkn4.

Anyone got any ideas? Also open to alternate approaches, thanks!

with
 d as (select 'TOKEN: get this text (subtitle)TOKN: skip this text alsoTOKEN: get this textTOKN: not this one' as data),
 r as (select 'TOKEN: ([^\\(]+)(\\(|TOKEN|TOKN|$)' as reg),
 p as (
    select
        regexp_substr(data, r.reg, 1, 1, 'e') as tkn1,
        regexp_substr(data, r.reg, 1, 2, 'e') as tkn2,
        regexp_substr(data, r.reg, 1, 3, 'e') as tkn3,
        regexp_substr(data, r.reg, 1, 4, 'e') as tkn4
    from d
    join r 
 )
select * from p;

Edit: The regex had a typo, missing bar before end of line match: (\\(|TOKEN|TOKN$) --> (\\(|TOKEN|TOKN|$).

Since Snowflake only supports POSIX regex instead of PCRE*, the solutions I've found don't work.

I'm looking to extract the text between TOKEN: and either ( or a few possible strings: TOKEN, TOKN.

I think the issue I'm having is in the ([^\\(]+) capturing group, where I need it to capture until either ( or one of the token strings. The solutions I found online utilize a positive look ahead, which Snowflake doesn't support. (Those solutions throw the error no argument for repetition operator: ?)

With the correct regex, the following code should return get this text for tkn1 and get this text also for tkn2, and nulls for tkn3 and tkn4.

Anyone got any ideas? Also open to alternate approaches, thanks!

with
 d as (select 'TOKEN: get this text (subtitle)TOKN: skip this text alsoTOKEN: get this textTOKN: not this one' as data),
 r as (select 'TOKEN: ([^\\(]+)(\\(|TOKEN|TOKN|$)' as reg),
 p as (
    select
        regexp_substr(data, r.reg, 1, 1, 'e') as tkn1,
        regexp_substr(data, r.reg, 1, 2, 'e') as tkn2,
        regexp_substr(data, r.reg, 1, 3, 'e') as tkn3,
        regexp_substr(data, r.reg, 1, 4, 'e') as tkn4
    from d
    join r 
 )
select * from p;

Edit: The regex had a typo, missing bar before end of line match: (\\(|TOKEN|TOKN$) --> (\\(|TOKEN|TOKN|$).

Share Improve this question edited Feb 7 at 16:50 bbg asked Feb 6 at 16:42 bbgbbg 617 bronze badges 4
  • A new approach is needed, as POSIX ERE does not support the ? greedy qualifier. Even if we repaired the regex, your first extracted field on the first match would be what you think of as the first three tokens, matching from the first occurrence of TOKEN: to the second (the last) occurrence of TOKN:. Split on 'TOKEN: ' and then regex extract up to the next ( or end of string. Disclaimer: untested. – pilcrow Commented Feb 6 at 21:38
  • if you need real regex, use a UDF to do it: snowflake.pavlik.us/index.php/2020/11/20/… – Simeon Pilgrim Commented Feb 7 at 1:49
  • I'll play around with the multiple regex approach. This will be in dbt so I should be able to make a macro for it, however many steps I wind up needing. Good to know I'll need to look at an alternate approach. – bbg Commented Feb 7 at 17:03
  • As an aside, you mean to say [^(]+ (match anything other than left parenthesis) and not [^\(]+ (match anything other than left parenthesis or backslash). You don't escape parentheses inside regex character classes. However, this does not solve your fundamental problem. – pilcrow Commented Feb 7 at 17:22
Add a comment  | 

2 Answers 2

Reset to default 1

the dollar sign is saying "match the last character" where you are not wanting that:

with d as (
 select $1 as data
 from values 
 ('TOKEN: get this text (subtitle)TOKN: skip this text alsoTOKEN: get this textTOKN: not this one')
), r as (
    select $1 as reg
    from values
    ('TOKEN: ([^\\(]+)(\\(|TOKEN|TOKN$)'),
    ('TOKEN: ([^\\(]+)(\\(|TOKEN|TOKN)')
)
select
    regexp_substr(data, r.reg, 1, 1, 'e') as tkn1,
    regexp_substr(data, r.reg, 1, 2, 'e') as tkn2,
    regexp_substr(data, r.reg, 1, 3, 'e') as tkn3,
    --regexp_substr(data, r.reg, 1, 4, 'e') as tkn4
from d
join r 

gives:

TKN1 TKN2 TKN3
get this text null null
get this text get this text null

Maybe try dollar quoted string constants as explained here: https://docs.snowflake.com/en/sql-reference/data-types-text#label-dollar-quoted-string-constants

This expression worked when testing against your string on https://regexr.com/

(\w\s?)+(?=(\(\w+\))?([T].{1,3}[N]\:))

According to the documentation you'd use

$$(\w\s?)+(?=(\(\w+\))?([T].{1,3}[N]\:))$$

Seems strange to exclude parts of regex.

与本文相关的文章

发布评论

评论列表(0)

  1. 暂无评论