Since Snowflake only supports POSIX regex instead of PCRE*, the solutions I've found don't work.
I'm looking to extract the text between TOKEN:
and either (
or a few possible strings: TOKEN
, TOKN
.
I think the issue I'm having is in the ([^\\(]+)
capturing group, where I need it to capture until either (
or one of the token strings. The solutions I found online utilize a positive look ahead, which Snowflake doesn't support. (Those solutions throw the error no argument for repetition operator: ?
)
With the correct regex, the following code should return get this text
for tkn1
and get this text also
for tkn2
, and nulls for tkn3
and tkn4
.
Anyone got any ideas? Also open to alternate approaches, thanks!
with
d as (select 'TOKEN: get this text (subtitle)TOKN: skip this text alsoTOKEN: get this textTOKN: not this one' as data),
r as (select 'TOKEN: ([^\\(]+)(\\(|TOKEN|TOKN|$)' as reg),
p as (
select
regexp_substr(data, r.reg, 1, 1, 'e') as tkn1,
regexp_substr(data, r.reg, 1, 2, 'e') as tkn2,
regexp_substr(data, r.reg, 1, 3, 'e') as tkn3,
regexp_substr(data, r.reg, 1, 4, 'e') as tkn4
from d
join r
)
select * from p;
Edit: The regex had a typo, missing bar before end of line match: (\\(|TOKEN|TOKN$) --> (\\(|TOKEN|TOKN|$)
.
Since Snowflake only supports POSIX regex instead of PCRE*, the solutions I've found don't work.
I'm looking to extract the text between TOKEN:
and either (
or a few possible strings: TOKEN
, TOKN
.
I think the issue I'm having is in the ([^\\(]+)
capturing group, where I need it to capture until either (
or one of the token strings. The solutions I found online utilize a positive look ahead, which Snowflake doesn't support. (Those solutions throw the error no argument for repetition operator: ?
)
With the correct regex, the following code should return get this text
for tkn1
and get this text also
for tkn2
, and nulls for tkn3
and tkn4
.
Anyone got any ideas? Also open to alternate approaches, thanks!
with
d as (select 'TOKEN: get this text (subtitle)TOKN: skip this text alsoTOKEN: get this textTOKN: not this one' as data),
r as (select 'TOKEN: ([^\\(]+)(\\(|TOKEN|TOKN|$)' as reg),
p as (
select
regexp_substr(data, r.reg, 1, 1, 'e') as tkn1,
regexp_substr(data, r.reg, 1, 2, 'e') as tkn2,
regexp_substr(data, r.reg, 1, 3, 'e') as tkn3,
regexp_substr(data, r.reg, 1, 4, 'e') as tkn4
from d
join r
)
select * from p;
Edit: The regex had a typo, missing bar before end of line match: (\\(|TOKEN|TOKN$) --> (\\(|TOKEN|TOKN|$)
.
2 Answers
Reset to default 1the dollar sign is saying "match the last character" where you are not wanting that:
with d as (
select $1 as data
from values
('TOKEN: get this text (subtitle)TOKN: skip this text alsoTOKEN: get this textTOKN: not this one')
), r as (
select $1 as reg
from values
('TOKEN: ([^\\(]+)(\\(|TOKEN|TOKN$)'),
('TOKEN: ([^\\(]+)(\\(|TOKEN|TOKN)')
)
select
regexp_substr(data, r.reg, 1, 1, 'e') as tkn1,
regexp_substr(data, r.reg, 1, 2, 'e') as tkn2,
regexp_substr(data, r.reg, 1, 3, 'e') as tkn3,
--regexp_substr(data, r.reg, 1, 4, 'e') as tkn4
from d
join r
gives:
TKN1 | TKN2 | TKN3 |
---|---|---|
get this text | null | null |
get this text | get this text | null |
Maybe try dollar quoted string constants as explained here: https://docs.snowflake.com/en/sql-reference/data-types-text#label-dollar-quoted-string-constants
This expression worked when testing against your string on https://regexr.com/
(\w\s?)+(?=(\(\w+\))?([T].{1,3}[N]\:))
According to the documentation you'd use
$$(\w\s?)+(?=(\(\w+\))?([T].{1,3}[N]\:))$$
Seems strange to exclude parts of regex.
?
greedy qualifier. Even if we repaired the regex, your first extracted field on the first match would be what you think of as the first three tokens, matching from the first occurrence ofTOKEN:
to the second (the last) occurrence ofTOKN:
. Split on 'TOKEN: ' and then regex extract up to the next(
or end of string. Disclaimer: untested. – pilcrow Commented Feb 6 at 21:38[^(]+
(match anything other than left parenthesis) and not[^\(]+
(match anything other than left parenthesis or backslash). You don't escape parentheses inside regex character classes. However, this does not solve your fundamental problem. – pilcrow Commented Feb 7 at 17:22