sql - String Matching Struggle

I need to find State abbreviations in column name my_column. This column can contain values like these

John Smith of AZ --> Match

John Smith of AZ(Tucson)' --> Match

AZ John Smith --> Match

John Smith AZ for Tucson --> Match

Utah Jazz --> Don't Match

Azyme --> Don't Match

'Hazy --> Don't Match

I tried using CASE expressions with CHARINDEX and LIKE to do all this matching, but it is getting super-ugly. I wonder if there is a better way.

I asked our DBA to install Full-Text Index to see if I can do something better with CONTAINS, but not sure if it really helps.

I need to find State abbreviations in column name my_column. This column can contain values like these

John Smith of AZ --> Match

John Smith of AZ(Tucson)' --> Match

AZ John Smith --> Match

John Smith AZ for Tucson --> Match

Utah Jazz --> Don't Match

Azyme --> Don't Match

'Hazy --> Don't Match

I tried using CASE expressions with CHARINDEX and LIKE to do all this matching, but it is getting super-ugly. I wonder if there is a better way.

I asked our DBA to install Full-Text Index to see if I can do something better with CONTAINS, but not sure if it really helps.

Share Improve this question edited Feb 7 at 9:57 Thom A 95.6k11 gold badges60 silver badges92 bronze badges asked Feb 6 at 23:39 Andrew Park 397 bronze badges

3 You could use a regular expression like \b(AL|AK|AZ|AR|...|WI|WY)\b. \b matches word boundaries, so it won't match Jazz or Azyme – Barmar Commented Feb 6 at 23:49
1 SQL Server doesn't support Regex, @Barmar . (Unless you're from the future as SQL Server 2025 apparently will.) – Thom A Commented Feb 7 at 9:28
2 Something like ' ' + column + ' ' like '%[ (),]AZ[ (),]%' for multiple abbrev you can just join a table containing those – siggemannen Commented Feb 7 at 11:00
2016 is a pain here. Later versions do have functionality that would make this less cumbersome even in the absence of Regex (such as TRANSLATE that could be used to replace all the word boundary characters with a consistent character prior to splitting) – Martin Smith Commented Feb 7 at 11:30
You'll likely find a solution that will work, but when you do it won't perform well. This isn't the kind of work database servers were meant to do. You're BEST option is to capture this data at insert/update time, and use the client language to do the parsing first, along with handling the parsing for the conversion to the fixed schema. – Joel Coehoorn Commented Feb 7 at 14:40

Add a comment |

2 Answers 2

Sorted by: Reset to default 1

How about something like

declare @testcases table (testval varchar(50));

insert into @testcases 
values 
('John Smith of AZ'),
('John Smith of AZ(Tucson)'),
('AZ John Smith'),
('John Smith AZ for Tucson'),
('Utah Jazz'),
('Azyme'),
('Hazy')

select PATINDEX('%[^A-Z]AZ[^A-Z]%',testval) + PATINDEX('AZ[^A-Z]%',testval) + PATINDEX('%[^A-Z]AZ',testval)
from @testcases;

It'll match anything containing AZ, that doesn't border an alpha-numeric character. Mind you, this isn't very performant, but it will work.

Also it only works for Arizona. To find the other 49 states you'd have to scan again. I would probably try to extract the state in your application before inserting into the database.

Very generally, replace every non-letter character with a space, then add a space before and after the whole string, then search for your two-letter state code by padding it with a space at the beginning and end. Your examples would turn into:

' John Smith of AZ '--> Matches with '% AZ %'

' John Smith of AZ Tucson ' --> Matches with '% AZ %'

' AZ John Smith '--> Matches with '% AZ %'

' John Smith AZ for Tucson ' --> Matches with '% AZ %'

' Utah Jazz ' --> Don't Match

' Azyme ' --> Don't Match

' Hazy ' --> Don't Match

As you can see, the state code will always be isolated in its own space.

Why go to the trouble of replacing all non-letter characters with spaces? So that the following are also handled:

'John Smith lives in AZ.'

'John Smith loves AZ!'

'AZ? John Smith lives there.'

'John Smith does not live in California. (He lives in AZ)'

'John Smith has to drive through ID, WA, OR, and CA to get to AZ.'

For help on removing the non-letter characters, see here: How to strip all non-alphabetic characters from string in SQL Server?

科技改变生活-雨落星辰 - 所有的伟大,都源于一个勇敢的开始

sql - String Matching Struggle - Stack Overflow

2 Answers 2

与本文相关的文章

评论列表(0)