I need to find State abbreviations in column name my_column
. This column can contain values like these
John Smith of AZ
--> Match
John Smith of AZ(Tucson)'
--> Match
AZ John Smith
--> Match
John Smith AZ for Tucson
--> Match
Utah Jazz
--> Don't Match
Azyme
--> Don't Match
'Hazy
--> Don't Match
I tried using CASE
expressions with CHARINDEX
and LIKE
to do all this matching, but it is getting super-ugly. I wonder if there is a better way.
I asked our DBA to install Full-Text Index to see if I can do something better with CONTAINS, but not sure if it really helps.
I need to find State abbreviations in column name my_column
. This column can contain values like these
John Smith of AZ
--> Match
John Smith of AZ(Tucson)'
--> Match
AZ John Smith
--> Match
John Smith AZ for Tucson
--> Match
Utah Jazz
--> Don't Match
Azyme
--> Don't Match
'Hazy
--> Don't Match
I tried using CASE
expressions with CHARINDEX
and LIKE
to do all this matching, but it is getting super-ugly. I wonder if there is a better way.
I asked our DBA to install Full-Text Index to see if I can do something better with CONTAINS, but not sure if it really helps.
Share Improve this question edited Feb 7 at 9:57 Thom A 95.6k11 gold badges60 silver badges92 bronze badges asked Feb 6 at 23:39 Andrew ParkAndrew Park 397 bronze badges 5 |2 Answers
Reset to default 1How about something like
declare @testcases table (testval varchar(50));
insert into @testcases
values
('John Smith of AZ'),
('John Smith of AZ(Tucson)'),
('AZ John Smith'),
('John Smith AZ for Tucson'),
('Utah Jazz'),
('Azyme'),
('Hazy')
select PATINDEX('%[^A-Z]AZ[^A-Z]%',testval) + PATINDEX('AZ[^A-Z]%',testval) + PATINDEX('%[^A-Z]AZ',testval)
from @testcases;
It'll match anything containing AZ, that doesn't border an alpha-numeric character. Mind you, this isn't very performant, but it will work.
Also it only works for Arizona. To find the other 49 states you'd have to scan again. I would probably try to extract the state in your application before inserting into the database.
Very generally, replace every non-letter character with a space, then add a space before and after the whole string, then search for your two-letter state code by padding it with a space at the beginning and end. Your examples would turn into:
' John Smith of AZ '--> Matches with '% AZ %'
' John Smith of AZ Tucson ' --> Matches with '% AZ %'
' AZ John Smith '--> Matches with '% AZ %'
' John Smith AZ for Tucson ' --> Matches with '% AZ %'
' Utah Jazz ' --> Don't Match
' Azyme ' --> Don't Match
' Hazy ' --> Don't Match
As you can see, the state code will always be isolated in its own space.
Why go to the trouble of replacing all non-letter characters with spaces? So that the following are also handled:
'John Smith lives in AZ.'
'John Smith loves AZ!'
'AZ? John Smith lives there.'
'John Smith does not live in California. (He lives in AZ)'
'John Smith has to drive through ID, WA, OR, and CA to get to AZ.'
For help on removing the non-letter characters, see here: How to strip all non-alphabetic characters from string in SQL Server?
\b(AL|AK|AZ|AR|...|WI|WY)\b
.\b
matches word boundaries, so it won't matchJazz
orAzyme
– Barmar Commented Feb 6 at 23:49' ' + column + ' ' like '%[ (),]AZ[ (),]%'
for multiple abbrev you can just join a table containing those – siggemannen Commented Feb 7 at 11:00Regex
(such asTRANSLATE
that could be used to replace all the word boundary characters with a consistent character prior to splitting) – Martin Smith Commented Feb 7 at 11:30