I am trying to write a SQL query where I check string values in a column and if the column contains names that are in the name_list that have a hyphen before or after the target name, I want to flag these as -1. (ie. "John-Brown" or "Brown-John" would be flagged as -1). If the name exists at all, but is not hyphenated, then we will return a 1 flag (ie. "Johnny Appleseed" would be flagged as 1) and lastly if the name is not in the column, then we will return a 0.
WITH name_list AS (
SELECT STRING_AGG(name, '|') AS name_pattern
FROM UNNEST(['John', 'David', 'Alice', 'Michael']) AS name
)
SELECT
t.id,
t.column_to_check,
CASE
-- Case 1: Word is preceded or followed by a hyphen → return -1
WHEN REGEXP_CONTAINS(t.column_to_check, r'[-](' || nl.name_pattern || r')[-]') THEN -1
-- Case 2: Word exists anywhere (unless hyphenated) → return 1
WHEN REGEXP_CONTAINS(t.column_to_check, nl.name_pattern) THEN 1
ELSE 0
END AS flag
FROM table_name t
JOIN name_list nl ON TRUE;
This should work, however, with the real data that I am working with, I have 95 different names that I am checking and millions of rows of data, and I am getting the error:
Cannot parse regular expression: pattern too large - compile failed.
Does anyone know of another, efficient way to do this?
I am trying to write a SQL query where I check string values in a column and if the column contains names that are in the name_list that have a hyphen before or after the target name, I want to flag these as -1. (ie. "John-Brown" or "Brown-John" would be flagged as -1). If the name exists at all, but is not hyphenated, then we will return a 1 flag (ie. "Johnny Appleseed" would be flagged as 1) and lastly if the name is not in the column, then we will return a 0.
WITH name_list AS (
SELECT STRING_AGG(name, '|') AS name_pattern
FROM UNNEST(['John', 'David', 'Alice', 'Michael']) AS name
)
SELECT
t.id,
t.column_to_check,
CASE
-- Case 1: Word is preceded or followed by a hyphen → return -1
WHEN REGEXP_CONTAINS(t.column_to_check, r'[-](' || nl.name_pattern || r')[-]') THEN -1
-- Case 2: Word exists anywhere (unless hyphenated) → return 1
WHEN REGEXP_CONTAINS(t.column_to_check, nl.name_pattern) THEN 1
ELSE 0
END AS flag
FROM table_name t
JOIN name_list nl ON TRUE;
This should work, however, with the real data that I am working with, I have 95 different names that I am checking and millions of rows of data, and I am getting the error:
Cannot parse regular expression: pattern too large - compile failed.
Does anyone know of another, efficient way to do this?
Share Improve this question edited yesterday Dale K 27.3k15 gold badges57 silver badges83 bronze badges asked yesterday David GrahamDavid Graham 91 bronze badge1 Answer
Reset to default 2I don't think that regex does quite what you want it to do, because it expects hyphens on both sides of the name.
In any case, if I understand your problem correctly, you can probably use joins with the like operator instead:
with comp as (
select 'John' as name union all
select 'David'union all
select 'Alice' union all
select 'Michael'
)
select
a.id,
a.name,
case
when sum(
case
when a.name like concat('%', '-', b.name, '%') or a.name like concat('%', b.name, '-', '%') then 1
else 0
end
) > 0 then -1
when count(b.name) > 0 then 1
else 0
end as flag
from
schema.table a
left join comp b on a.name like concat('%', b.name, '%')
group by 1,2
order by 2,1
;
This took about 30s on a dataset of ~20 million rows in GBQ.