Part 10 of Learn RegEx With Louis
The REGEXP_SUBSTR function extracts parts of a string based on a regular expression pattern. It has some similarieties with the SUBSTRING function, but with some important (and interesting) differences. This function returns Nth occurrence of a substring that matches the regex pattern.
As I am sitting here looking to write my tenth entry about RegEx in SQL Server, I have to admit that I had no idea what this function was going to do. The classic SUBSTRING function takes strictly positional inputs. Given a string, return the characters from point 1, and then a certain number of characters from there. No pattern matching at all. Fortunately for you, my research will seem to take no time at all.
In my blog that was basically about REGEXP_MATCHES, I showed how you could see all the matches that an RegEx expression had with a string. REGEXP_SUBSTR, in it’s simplest form, will give you a scalar version of this function to return a single instance.
The syntax for this function is:
REGEXP_SUBSTR ( string_expression, pattern_expression [, start [, occurrence [, flags [, group ] ] ] ] )
In its simplest form, you pass in a value, and then a pattern. In this next example, it will get the first 3 numeric character pattern:
SELECT REGEXP_SUBSTR('abc1234567890987654321','\d{3}');
This returns the first characters that match the \d3 regular expression.
---------------------- 123
The next parameter tells you what character to start on. For example, if we start the search at the 4th character:
SELECT REGEXP_SUBSTR('abc1234567890987654321','\d{3}',4);
The output is still:
---------------------- 123
Since it is 1 based, a is int he first postion, and c is in the 3rd. But use 5:
SELECT REGEXP_SUBSTR('abc1234567890987654321','\d{3}',5);
The output is:
---------------------- 234
While I don’t feel that the starting position is extremely interesting (though I expect it has uses that I can’t yet imagine, especially since it comes before the next parameters, the occurrence paramter is. Now you can do basiclly what REGEXP_MATCHES does, but one call at a time:
SELECT REGEXP_SUBSTR('abc1234567890987654321','\d{3}',1,1);
This gets the first 3 characters
---------------------- 123
But the 3rd occurrence:
SELECT REGEXP_SUBSTR('abc1234567890987654321','\d{3}',1,3);
Returns the 7th, 8th, and 9th value:
---------------------- 789
You can see if we use REGEXP_MATCHES with this value,you can get the same output and see all the matches that you can get from REGEXP_SUBSTR:
SELECT match_id, match_value,
REGEXP_SUBSTR('abc1234567890987654321','\d{3}',1,match_id)
AS REGEXP_SUBSTR_result
FROM REGEXP_MATCHES('abc1234567890987654321','\d{3}');
You can see the output matches:
match_id match_value REGEXP_SUBSTR_result --------- -------------- ---------------------- 1 123 123 2 456 456 3 789 789 4 098 098 5 765 765 6 432 432
Groups
Skipping over the flags (which work just like previously discussed here and here), the next parameter is groups.
As I demonstrated with REGEXP_REPLACE, there is a concept of a group that allows you to get a subset of a match. In the replace version, you could format the output using placeholders. For REGEX_SUBSTR, you don’t format the output, you can grab the text from the slice.
In the following example, you can see that that with the grouped filter I use in the following '(\d{1})(\d{2})' without groups will work the same as before. 3 characters. But yu can see in the 4th and 5th columns, that I included the group value of 1, and then two. So the first gets 1 character, the second gets the last two that we output.
SELECT match_id, match_value,
REGEXP_SUBSTR('abc1234567890987654321','(\d{1})(\d{2})',1,match_id)
AS REGEXP_SUBSTR,
REGEXP_SUBSTR('abc1234567890987654321','(\d{1})(\d{2})',1,match_id,'',1)
AS REGEXP_SUBSTR_1,
REGEXP_SUBSTR('abc1234567890987654321','(\d{1})(\d{2})',1,match_id,'',2)
AS REGEXP_SUBSTR_2
FROM REGEXP_MATCHES('abc1234567890987654321','(\d{1})(\d{2})');
This shows that the groups are seperated:
match_id match_value REGEXP_SUBSTR REGEXP_SUBSTR_1 REGEXP_SUBSTR_2 ---------- ------------ -------------- ---------------- ---------------- 1 123 123 1 23 2 456 456 4 56 3 789 789 7 89 4 098 098 0 98 5 765 765 7 65 6 432 432 4 32
Quick Example
Consider a table with phone numbers (or at least, supposed phone numbers). You want to fetch out the first three numeric values, assuming the data format matches an expected format. You can do this without a WHERE clause, just by matching the format in the regular expression, but then pick out a single part of the number when it matches the format:
CREATE TABLE #numbers(phoneNumber varchar(13));
INSERT INTO #numbers
VALUES ('111-111-1111'),('222-221-2222'),('333-333'),('44444444444');
SELECT phoneNumber,
REGEXP_SUBSTR(phoneNumber,'^(\d{3})-(\d{3})-(\d{4})$',1,1,'',1)
AS areaCode
FROM #numbers;
You can see that the rows that matched the format was able to put out the value, but the two that didn’t match, returned NULL.
phoneNumber areaCode ------------- ------------- 111-111-1111 111 222-221-2222 222 333-333 NULL 44444444444 NULL
Note that you cannot do an OR expression here as you might expect. So if I try to add in an or to catch the '4444444444' value and get it’s area code, you will see that the '4444444444' value returns NULL still:
--using the same table variable from the first query in this section
SELECT phoneNumber,
REGEXP_SUBSTR(phoneNumber,
'^(\d{3})-(\d{3})-(\d{4})$|^(\d{3})(\d{3})(\d{4})$',1,1,'',1)
AS areaCode
FROM #numbers;
Same output. But now, change the group to 4:
SELECT phoneNumber,
REGEXP_SUBSTR(phoneNumber,
'^(\d{3})-(\d{3})-(\d{4})$|^(\d{3})(\d{3})(\d{4})$',1,1,'',4)
AS areaCode
FROM @numbers;
And now only the 444 row returns an area code:
phoneNumber areaCode ------------- ------------- 111-111-1111 NULL 222-221-2222 NULL 333-333 NULL 4444444444 444
Summary
At this point in the series, we are getting really close to finishing up. So REGEX_SUBSTR isn’t at all new ground. But it is a definitely useful function that is going to be useful when splitting values, especially when they meet a certain format. Just a few more functions to go, and I plan to do the last two in one post (but to be fair, I intended to do this function as part of a 3-function blog and this post started to feel a bit too large for that.




Leave a reply to Substring Search with Regular Expressions in SQL Server – Curated SQL Cancel reply