
In the previous SQL posts, I showed you SUBSTRING, then I showed you REPLACE. Now, with REPLACE, you had to know what characters you wanted to replace. What if you only knew what characters you wanted to keep? Well, let me introduce you to PATINDEX. This is an advanced command that is easy to learn, difficult to master. I’ll bring this tool out of my tool belt in future posts to provide you more ways to use it.
PATINDEX let’s you define a string pattern. Many strings you’ll store in your database will have a pattern. A phone number (in america) has a very distinct pattern 3 digits of area code, 3 digits of prefix, and 4 digits telroot. You may store this in several different ways, but you’ll always store the same pattern every time, otherwise, how will your users interpret the data your storing.
The example I’m going to use to introduce you to PATINDEX is based on the idea of a setting up a fixed format or pattern for your phone numbers.
But first, I need to include my standard disclaimer:
I would like to point out that you have to be careful when and how often you use the techniques below. A good rule to keep in mind is text manipulation is slow and painful to a server. If you can leave the text manipulation to your middle-ware or front end, that would be better. But we all have been stuck in a situation where we needed to alter a string before those two points, and so I bring you the following lesson.
The Problem
phone |
605-555-2862 (561)555-2700 904-555-5680 N/A 580-555-5371 2815558368 (254)555-8430 336-555-2797 3365557233 592-555-3181/4951 96615551222 Ext. 249 +44 7930 555271 |
Looking at the examples in the column we know there characters in there other than 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, and 0. How can we easily remove the non numeric values from this column? We can use regular expressions. We can say, if it isn’t 0-9, remove it.
Now, the solution to this problem will use a WHILE loop. This is something I’m not really going to go into too much detail on, but using a while loop can really eat up processor time and memory. This is one of those times where this solution should only be used as a last resort, and only during non-peak server usage. Before implementing this solution, I’d be sure to spend time putting together a fix on the front end, to scrub out these bad characters before they got into my database. But, let’s come off the soapbox and discuss the solution, shall we?
The Solution
Let’s first look at the syntax for the PATINDEX function.
PATINDEX ('%pattern%', sourceString)
The PATINDEX function will search for a pattern in the SourceString, and return a bigint of where that pattern is found in the sourceString. If it’s not found, then you get 0. With this you can also provide a pattern ‘%[^0-9]%’ and it will search for characters that are not numbers.
But that’s only part of the solution. You then need to couple that with REPLACE, so you can remove anything that is not a number.
--variable to hold the current phone number being updated DECLARE @phone VARCHAR(20) --variable to hold the original value (so we can update it) DECLARE @originalPhone VARCHAR(20) --if there is a phone number with something other than a number in it WHILE EXISTS ( SELECT TOP 1 phone FROM sourceTable (READCOMMITTED) WHERE PATINDEX('%[^0-9]%', phone) ) > 0) BEGIN --store the original number SELECT TOP 1 @originalPhone = phone FROM sourceTable (READCOMMITTED) WHERE PATINDEX('%[^0-9]%', phone) > 0 --then copy it to a temporary variable SET @phone = @originalPhone --spin through the phone number one character at a time -- if there is a character other than a number WHILE(PATINDEX('%[^0-9]%', @phone) ) > 0 BEGIN --then remove that one character, then continue SET @phone = REPLACE(@phone , SUBSTRING(@phone, PATINDEX('%[^0-9]%', @phone), 1) , '') END --once all the non numeric characters are gone, update UPDATE sourceTable SET Phone = @phone WHERE Phone = @originalPhone END
If you’re having difficulties with the WHILE loop, search my blog for the entry on WHILE loops. They’re pretty easy once you get to know them.
If you wanted to get fancy you could make the replace portion of this code a scalar function, and call it in an update step. It would get rid of the explicit outer loop. In future versions of SQL this may improve, but for SQL 2000, this is the best I’ve come up with.
There are many more string manipulation techniques to cover, so stay tuned to this same bat-channel!
I saw the some sql functions. i have very usefull the site. thank u.
Hi| Above example is best for beginners.
very useful site
I am using SQL Server 2008 R2.
I found a typo in the above code.
–store the original number
SELECT TOP 1
@originalPhone = phone
FROM sourceTable (READCOMMITTED)
WHERE
PATINDEX(‘%[^0-9]%’, phone)
Should be:
–store the original number
SELECT TOP 1
@originalPhone = phone
FROM sourceTable (READCOMMITTED)
WHERE
PATINDEX(‘%[^0-9]%’, phone) > 0
The missing code is the > 0 on the PATINDEX line. Just FYI.
Good catch! I really need to get back on here more often and share. Anything you want to see covered in particular?
it does not work if there is sql character like ‘%’ in input string so what is the solution.
i used replace but its not feasible solution as how many character replaced is a question.
Collapse | Copy Code
declare @phone varchar(50)
set @phone=’212%@@f’
select REPLACE(@phone
, SUBSTRING(@phone, PATINDEX(‘%[^0-9a-zA-Z .()/-]%’, @phone), 1)
, ”)
Am I right in understanding that you’re looking to remove any characters that are not an alpha, number, space, period, left or right parenthesis, forward slash, or dash?
If so, you’ll need a loop to fire your replace multiple times, because it’s only going to find the first instance of your PATINDEX. Try this:
DECLARE @phone VARCHAR(50)
SET @phone='212%@@f'
WHILE PATINDEX('%[^0-9a-zA-Z .()/-]%', @phone) > 0
BEGIN
SET @phone = REPLACE(
@phone
, SUBSTRING(@phone, PATINDEX('%[^0-9a-zA-Z .()/-]%', @phone), 1)
, '')
END
SELECT @phone
Hi Shannon,
Thanks for the answer. For the sake of efficiency it would be better to store your PATINDEX call in a variable so that it only has to be called once each loop. To do that you’d have to add (something like) SET @Loc = PATINDEX(…) once before the WHILE loop and once just before the END statement in the loop.
Cheers,
Andrew.
Thanks fo the heads up. Efficiency in coding solutions is definitely something I need to pick up as I move from database administration to data science. Any little improvement that can be made gets magnified when dealing with billions of rows. Looks like I need to go back and brush up on my Big-O notation.