最新消息:雨落星辰是一个专注网站SEO优化、网站SEO诊断、搜索引擎研究、网络营销推广、网站策划运营及站长类的自媒体原创博客

regex - Snowflake SQL Query to find records where character repeats at least 5 times in a row - Stack Overflow

programmeradmin2浏览0评论

I'm trying to query a table to find all instances where a character repeats at least 5 times in a row.

I've tried:

Select Column
From Table
where Column REGEXP '(.)\1{4,}'

but it returns nothing.

The table includes the following entries that SHOULD be returned:

1.111111111111E31
00000000000000000
xxxxxxxxxxxxxxxxx

I'm trying to query a table to find all instances where a character repeats at least 5 times in a row.

I've tried:

Select Column
From Table
where Column REGEXP '(.)\1{4,}'

but it returns nothing.

The table includes the following entries that SHOULD be returned:

1.111111111111E31
00000000000000000
xxxxxxxxxxxxxxxxx
Share Improve this question edited Mar 28 at 18:32 Lukasz Szozda 177k26 gold badges273 silver badges314 bronze badges asked Mar 28 at 17:00 KingTerrorKingTerror 512 bronze badges 6
  • Do you have any sort of collation on the column you're testing? – Andrew Commented Mar 28 at 17:17
  • I checked snowsight and I don't see any collation so I believe that means Snowflake uses the default, which compares strings based on their UTF-8 character representations. a simple query like Select Column from Table where Column like '%00000%' does return matches. – KingTerror Commented Mar 28 at 17:45
  • Try escaping the backslash: '(.)\\1{4}'. – Barmar Commented Mar 28 at 18:03
  • when I try that I get an error: SQL Error [100048] [2201B]: Invalid regular expression: '(.)\1{4,}', invalid esccape sequence: \1 – KingTerror Commented Mar 28 at 18:14
  • I think, Regexp is not the right tool in this case. Will you check all possible symbols? – ValNik Commented Mar 28 at 18:27
 |  Show 1 more comment

2 Answers 2

Reset to default 4

Using backreferences

Snowflake does not support backreferences in regular expression patterns (known as “squares” in formal language theory); however, backreferences are supported in the replacement string of the REGEXP_REPLACE function.

Second when the backslash is used inside single-quoted string it has to be escaped:

In single-quoted string constants, you must escape the backslash character in the backslash-sequence. For example, to specify \d, use \\d


Possible workaround is to write a custom UDF:

CREATE OR REPLACE FUNCTION regexp_test(r TEXT, t TEXT)
  RETURNS BOOLEAN
  LANGUAGE PYTHON
  STRICT
  RUNTIME_VERSION = '3.12'
  HANDLER = 'main'
AS $$
import re
def main(r,t):
  return bool(re.search(r,t))
$$;


WITH cte(Col) AS (
    SELECT '1.111111111111E31'  UNION ALL
    SELECT '00000000000000000'  UNION ALL
    SELECT 'xxxxxxxxxxxxxxxxx'  UNION ALL
    SELECT '12345678'
)
SELECT *, regexp_test('(.)\\1{4,}', col)
FROM cte;

Output:

+-------------------+--------------------------------+
|        COL        | REGEXP_TEST('(.)\\1{4,}', COL) |
+-------------------+--------------------------------+
| 1.111111111111E31 | TRUE                           |
| 00000000000000000 | TRUE                           |
| xxxxxxxxxxxxxxxxx | TRUE                           |
| 12345678          | FALSE                          |
+-------------------+--------------------------------+

I would like to propose a pure SQL solution. A MySQL dialect was used, but I think it would not be difficult to rewrite the query in Snowflake.

WITH recursive
  cte(col) AS (
    SELECT '' UNION ALL
    SELECT 'aaaa' UNION ALL
    SELECT 'aaaaa' UNION ALL
    SELECT 'zaaaaa' UNION ALL
    SELECT '1.111111111111E31' UNION ALL
    SELECT '00000000000000000' UNION ALL
    SELECT 'xxxxxxxxxxxxxxxxx' UNION ALL
    SELECT '12345678'
  ),
  rcte(col, pos, fnd) as (
    select
      col, 1,
      case
        when length(col) >= 5 and
             substr(col, 1, 5) = repeat(substr(col, 1, 1), 5) then 1
        else 0
      end
    from cte
    union all
    select
      col,
      pos + 1,
      case substr(col, pos + 1, 5)
        when repeat(substr(col, pos + 1, 1), 5) then 1
        else 0
      end
    from rcte
    where fnd = 0 and length(col) - pos >= 5
  )
SELECT col
FROM rcte
where fnd = 1;

Result:

+-------------------+
|        col        |
+-------------------+
| aaaaa             |
| 00000000000000000 |
| xxxxxxxxxxxxxxxxx |
| zaaaaa            |
| 1.111111111111E31 |
+-------------------+

Try it on db<>fiddle.

发布评论

评论列表(0)

  1. 暂无评论