regex - SQL Bigquery Complex Regexp

I am trying to write a SQL query where I check string values in a column and if the column contains names that are in the name_list that have a hyphen before or after the target name, I want to flag these as -1. (ie. "John-Brown" or "Brown-John" would be flagged as -1). If the name exists at all, but is not hyphenated, then we will return a 1 flag (ie. "Johnny Appleseed" would be flagged as 1) and lastly if the name is not in the column, then we will return a 0.

WITH name_list AS (
    SELECT STRING_AGG(name, '|') AS name_pattern
    FROM UNNEST(['John', 'David', 'Alice', 'Michael']) AS name
)
SELECT 
    t.id, 
    t.column_to_check,
    CASE 
        -- Case 1: Word is preceded or followed by a hyphen → return -1
        WHEN REGEXP_CONTAINS(t.column_to_check, r'[-](' || nl.name_pattern || r')[-]') THEN -1
        
        -- Case 2: Word exists anywhere (unless hyphenated) → return 1
        WHEN REGEXP_CONTAINS(t.column_to_check, nl.name_pattern) THEN 1
        
        ELSE 0 
    END AS flag
FROM table_name t
JOIN name_list nl ON TRUE;

This should work, however, with the real data that I am working with, I have 95 different names that I am checking and millions of rows of data, and I am getting the error:

Cannot parse regular expression: pattern too large - compile failed.

Does anyone know of another, efficient way to do this?

WITH name_list AS (
    SELECT STRING_AGG(name, '|') AS name_pattern
    FROM UNNEST(['John', 'David', 'Alice', 'Michael']) AS name
)
SELECT 
    t.id, 
    t.column_to_check,
    CASE 
        -- Case 1: Word is preceded or followed by a hyphen → return -1
        WHEN REGEXP_CONTAINS(t.column_to_check, r'[-](' || nl.name_pattern || r')[-]') THEN -1
        
        -- Case 2: Word exists anywhere (unless hyphenated) → return 1
        WHEN REGEXP_CONTAINS(t.column_to_check, nl.name_pattern) THEN 1
        
        ELSE 0 
    END AS flag
FROM table_name t
JOIN name_list nl ON TRUE;

This should work, however, with the real data that I am working with, I have 95 different names that I am checking and millions of rows of data, and I am getting the error:

Cannot parse regular expression: pattern too large - compile failed.

Does anyone know of another, efficient way to do this?

Share Improve this question edited yesterday Dale K 27.3k15 gold badges57 silver badges83 bronze badges asked yesterday David Graham 91 bronze badge

Add a comment |

1 Answer 1

Sorted by: Reset to default 2

I don't think that regex does quite what you want it to do, because it expects hyphens on both sides of the name.

In any case, if I understand your problem correctly, you can probably use joins with the like operator instead:

with comp as (
  select 'John' as name union all
  select 'David'union all
  select 'Alice' union all
  select 'Michael'
)
select
  a.id,
  a.name,
  case 
    when sum(
      case 
        when a.name like concat('%', '-', b.name, '%') or a.name like concat('%', b.name, '-', '%') then 1 
        else 0 
      end
    ) > 0 then -1 
    when count(b.name) > 0 then 1
    else 0
  end as flag
from 
  schema.table a
  left join comp b on a.name like concat('%', b.name, '%')
group by 1,2
order by 2,1
;

This took about 30s on a dataset of ~20 million rows in GBQ.

科技改变生活-雨落星辰 - 所有的伟大,都源于一个勇敢的开始

regex - SQL Bigquery Complex Regexp - Stack Overflow

1 Answer 1

与本文相关的文章

评论列表(0)