Part 10 of Learn RegEx With Louis
The REGEXP_SUBSTR function extracts parts of a string based on a regular expression pattern. It has some similarieties with the SUBSTRING function, but with some important (and interesting) differences. This function returns Nth occurrence of a substring that matches the regex pattern.
As I am sitting here looking to write my tenth entry about RegEx in SQL Server, I have to admit that I had no idea what this function was going to do. The classic SUBSTRING function takes strictly positional inputs. Given a string, return the characters from point 1, and then a certain number of characters from there. No pattern matching at all. Fortunately for you, my research will seem to take no time at all.
In my blog that was basically about REGEXP_MATCHES, I showed how you could see all the matches that an RegEx expression had with a string. REGEXP_SUBSTR, in it’s simplest form, will give you a scalar version of this function to return a single instance.
The syntax for this function is:
REGEXP_SUBSTR( string_expression, pattern_expression [, start [, occurrence [, flags [, group ] ] ] ])
In its simplest form, you pass in a value, and then a pattern. In this next example, it will get the first 3 numeric character pattern:
SELECT REGEXP_SUBSTR('abc1234567890987654321','\d{3}');
This returns the first characters that match the \d3 regular expression.
----------------------123
The next parameter tells you what character to start on. For example, if we start the search at the 4th character:
SELECT REGEXP_SUBSTR('abc1234567890987654321','\d{3}',4);
The output is still:
----------------------123
Since it is 1-based, a is in the first position, and c is in the 3rd. But use 5:
SELECT REGEXP_SUBSTR('abc1234567890987654321','\d{3}',5);
The output is:
----------------------234
While I don’t feel that the starting position is extremely interesting (though I expect it has uses that I can’t yet imagine, especially since it comes before the next parameters, the occurrence parameter is. Now you can do basically what REGEXP_MATCHES does, but one call at a time:
SELECT REGEXP_SUBSTR('abc1234567890987654321','\d{3}',1,1);
This gets the first 3 characters
----------------------123
But the 3rd occurrence:
SELECT REGEXP_SUBSTR('abc1234567890987654321','\d{3}',1,3);
Returns the 7th, 8th, and 9th value:
----------------------789
You can see if we use REGEXP_MATCHES with this value,you can get the same output and see all the matches that you can get from REGEXP_SUBSTR:
SELECT match_id, match_value, REGEXP_SUBSTR('abc1234567890987654321','\d{3}',1,match_id) AS REGEXP_SUBSTR_resultFROM REGEXP_MATCHES('abc1234567890987654321','\d{3}');
You can see the output matches:
match_id match_value REGEXP_SUBSTR_result--------- -------------- ----------------------1 123 1232 456 4563 789 7894 098 0985 765 7656 432 432
Groups
Skipping over the flags (which work just like previously discussed here and here), the next parameter is groups.
As I demonstrated with REGEXP_REPLACE, there is a concept of a group that allows you to get a subset of a match. In the replace version, you could format the output using placeholders. For REGEX_SUBSTR, you don’t format the output, you can grab the text from the slice.
In the following example, you can see that that with the grouped filter I use in the following '(\d{1})(\d{2})' without groups will work the same as before. 3 characters. But yu can see in the 4th and 5th columns, that I included the group value of 1, and then two. So the first gets 1 character, the second gets the last two that we output.
SELECT match_id, match_value, REGEXP_SUBSTR('abc1234567890987654321','(\d{1})(\d{2})',1,match_id) AS REGEXP_SUBSTR, REGEXP_SUBSTR('abc1234567890987654321','(\d{1})(\d{2})',1,match_id,'',1) AS REGEXP_SUBSTR_1, REGEXP_SUBSTR('abc1234567890987654321','(\d{1})(\d{2})',1,match_id,'',2) AS REGEXP_SUBSTR_2FROM REGEXP_MATCHES('abc1234567890987654321','(\d{1})(\d{2})');
This shows that the groups are separated:
match_id match_value REGEXP_SUBSTR REGEXP_SUBSTR_1 REGEXP_SUBSTR_2---------- ------------ -------------- ---------------- ----------------1 123 123 1 232 456 456 4 563 789 789 7 894 098 098 0 985 765 765 7 656 432 432 4 32
Quick Example
Consider a table with phone numbers (or at least, supposed phone numbers). You want to fetch out the first three numeric values, assuming the data format matches an expected format. You can do this without a WHERE clause, just by matching the format in the regular expression, but then pick out a single part of the number when it matches the format:
CREATE TABLE #numbers(phoneNumber varchar(13));INSERT INTO #numbers VALUES ('111-111-1111'),('222-221-2222'),('333-333'),('44444444444');SELECT phoneNumber, REGEXP_SUBSTR(phoneNumber,'^(\d{3})-(\d{3})-(\d{4})$',1,1,'',1) AS areaCodeFROM #numbers;
You can see that the rows that matched the format was able to put out the value, but the two that didn’t match, returned NULL.
phoneNumber areaCode------------- -------------111-111-1111 111222-221-2222 222333-333 NULL44444444444 NULL
Note that you cannot do an OR expression here as you might expect. So if I try to add in an or to catch the '4444444444' value and get it’s area code, you will see that the '4444444444' value returns NULL still:
--using the same table variable from the first query in this sectionSELECT phoneNumber, REGEXP_SUBSTR(phoneNumber, '^(\d{3})-(\d{3})-(\d{4})$|^(\d{3})(\d{3})(\d{4})$',1,1,'',1) AS areaCodeFROM #numbers;
Same output. But now, change the group to 4:
SELECT phoneNumber, REGEXP_SUBSTR(phoneNumber, '^(\d{3})-(\d{3})-(\d{4})$|^(\d{3})(\d{3})(\d{4})$',1,1,'',4) AS areaCodeFROM @numbers;
And now only the 444 row returns an area code:
phoneNumber areaCode------------- -------------111-111-1111 NULL222-221-2222 NULL333-333 NULL4444444444 444
Summary
At this point in the series, we are getting really close to finishing up. So REGEX_SUBSTR isn’t at all new ground. But it is a definitely useful function that is going to be useful when splitting values, especially when they meet a certain format. Just a few more functions to go, and I plan to do the last two in one post (but to be fair, I intended to do this function as part of a 3-function blog, and this post started to feel a bit too large for that.



Leave a Reply