Regex to extract text between a string and either a single character or another string (POSIX for snowflake regexp_substr)

Since Snowflake only supports POSIX regex instead of PCRE*, the solutions I've found don't work.

I'm looking to extract the text between TOKEN: and either ( or a few possible strings: TOKEN, TOKN.

I think the issue I'm having is in the ([^\\(]+) capturing group, where I need it to capture until either ( or one of the token strings. The solutions I found online utilize a positive look ahead, which Snowflake doesn't support. (Those solutions throw the error no argument for repetition operator: ?)

With the correct regex, the following code should return get this text for tkn1 and get this text also for tkn2, and nulls for tkn3 and tkn4.

Anyone got any ideas? Also open to alternate approaches, thanks!

with
 d as (select 'TOKEN: get this text (subtitle)TOKN: skip this text alsoTOKEN: get this textTOKN: not this one' as data),
 r as (select 'TOKEN: ([^\\(]+)(\\(|TOKEN|TOKN|$)' as reg),
 p as (
    select
        regexp_substr(data, r.reg, 1, 1, 'e') as tkn1,
        regexp_substr(data, r.reg, 1, 2, 'e') as tkn2,
        regexp_substr(data, r.reg, 1, 3, 'e') as tkn3,
        regexp_substr(data, r.reg, 1, 4, 'e') as tkn4
    from d
    join r 
 )
select * from p;

Since Snowflake only supports POSIX regex instead of PCRE*, the solutions I've found don't work.

I'm looking to extract the text between TOKEN: and either ( or a few possible strings: TOKEN, TOKN.

With the correct regex, the following code should return get this text for tkn1 and get this text also for tkn2, and nulls for tkn3 and tkn4.

Anyone got any ideas? Also open to alternate approaches, thanks!

with
 d as (select 'TOKEN: get this text (subtitle)TOKN: skip this text alsoTOKEN: get this textTOKN: not this one' as data),
 r as (select 'TOKEN: ([^\\(]+)(\\(|TOKEN|TOKN|$)' as reg),
 p as (
    select
        regexp_substr(data, r.reg, 1, 1, 'e') as tkn1,
        regexp_substr(data, r.reg, 1, 2, 'e') as tkn2,
        regexp_substr(data, r.reg, 1, 3, 'e') as tkn3,
        regexp_substr(data, r.reg, 1, 4, 'e') as tkn4
    from d
    join r 
 )
select * from p;

Share Improve this question edited Feb 7 at 16:50 asked Feb 6 at 16:42 bbg 617 bronze badges

A new approach is needed, as POSIX ERE does not support the ? greedy qualifier. Even if we repaired the regex, your first extracted field on the first match would be what you think of as the first three tokens, matching from the first occurrence of TOKEN: to the second (the last) occurrence of TOKN:. Split on 'TOKEN: ' and then regex extract up to the next ( or end of string. Disclaimer: untested. – pilcrow Commented Feb 6 at 21:38
if you need real regex, use a UDF to do it: snowflake.pavlik.us/index.php/2020/11/20/… – Simeon Pilgrim Commented Feb 7 at 1:49
I'll play around with the multiple regex approach. This will be in dbt so I should be able to make a macro for it, however many steps I wind up needing. Good to know I'll need to look at an alternate approach. – bbg Commented Feb 7 at 17:03
As an aside, you mean to say [^(]+ (match anything other than left parenthesis) and not [^\(]+ (match anything other than left parenthesis or backslash). You don't escape parentheses inside regex character classes. However, this does not solve your fundamental problem. – pilcrow Commented Feb 7 at 17:22

Add a comment |

2 Answers 2

Sorted by: Reset to default 1

the dollar sign is saying "match the last character" where you are not wanting that:

with d as (
 select $1 as data
 from values 
 ('TOKEN: get this text (subtitle)TOKN: skip this text alsoTOKEN: get this textTOKN: not this one')
), r as (
    select $1 as reg
    from values
    ('TOKEN: ([^\\(]+)(\\(|TOKEN|TOKN$)'),
    ('TOKEN: ([^\\(]+)(\\(|TOKEN|TOKN)')
)
select
    regexp_substr(data, r.reg, 1, 1, 'e') as tkn1,
    regexp_substr(data, r.reg, 1, 2, 'e') as tkn2,
    regexp_substr(data, r.reg, 1, 3, 'e') as tkn3,
    --regexp_substr(data, r.reg, 1, 4, 'e') as tkn4
from d
join r

gives:

TKN1	TKN2	TKN3
get this text	null	null
get this text	get this text	null

Maybe try dollar quoted string constants as explained here: https://docs.snowflake.com/en/sql-reference/data-types-text#label-dollar-quoted-string-constants

This expression worked when testing against your string on https://regexr.com/

(\w\s?)+(?=(\(\w+\))?([T].{1,3}[N]\:))

According to the documentation you'd use

$$(\w\s?)+(?=(\(\w+\))?([T].{1,3}[N]\:))$$

Seems strange to exclude parts of regex.

科技改变生活-雨落星辰 - 所有的伟大,都源于一个勇敢的开始

Regex to extract text between a string and either a single character or another string (POSIX for snowflake regexp_substr) - Sta

2 Answers 2

与本文相关的文章

评论列表(0)