REGEXP_ Functions in SQL Server 2025 – REGEXP_SUBSTR

Part 10 of Learn RegEx With Louis

The REGEXP_SUBSTR function extracts parts of a string based on a regular expression pattern. It has some similarieties with the SUBSTRING function, but with some important (and interesting) differences. This function returns Nth occurrence of a substring that matches the regex pattern.

As I am sitting here looking to write my tenth entry about RegEx in SQL Server, I have to admit that I had no idea what this function was going to do. The classic SUBSTRING function takes strictly positional inputs. Given a string, return the characters from point 1, and then a certain number of characters from there. No pattern matching at all. Fortunately for you, my research will seem to take no time at all.

In my blog that was basically about REGEXP_MATCHES, I showed how you could see all the matches that an RegEx expression had with a string. REGEXP_SUBSTR, in it’s simplest form, will give you a scalar version of this function to return a single instance.

The syntax for this function is:

			
REGEXP_SUBSTR
(
  string_expression,
  pattern_expression [, start [, occurrence [, flags [, group ] ] ] ]
)

		

In its simplest form, you pass in a value, and then a pattern. In this next example, it will get the first 3 numeric character pattern:

SELECT REGEXP_SUBSTR('abc1234567890987654321','\d{3}');

This returns the first characters that match the \d3 regular expression.

			
----------------------
123

The next parameter tells you what character to start on. For example, if we start the search at the 4th character:

SELECT REGEXP_SUBSTR('abc1234567890987654321','\d{3}',4);

The output is still:

			
----------------------
123

Since it is 1-based, a is in the first position, and c is in the 3rd. But use 5:

SELECT REGEXP_SUBSTR('abc1234567890987654321','\d{3}',5);

The output is:

			
----------------------
234

While I don’t feel that the starting position is extremely interesting (though I expect it has uses that I can’t yet imagine, especially since it comes before the next parameters, the occurrence parameter is. Now you can do basically what REGEXP_MATCHES does, but one call at a time:

SELECT REGEXP_SUBSTR('abc1234567890987654321','\d{3}',1,1);

This gets the first 3 characters

			
----------------------
123

But the 3rd occurrence:

SELECT REGEXP_SUBSTR('abc1234567890987654321','\d{3}',1,3);

Returns the 7th, 8th, and 9th value:

			
----------------------
789

You can see if we use REGEXP_MATCHES with this value,you can get the same output and see all the matches that you can get from REGEXP_SUBSTR:

			
SELECT match_id, match_value,
       REGEXP_SUBSTR('abc1234567890987654321','\d{3}',1,match_id) 
                                 AS REGEXP_SUBSTR_result
FROM   REGEXP_MATCHES('abc1234567890987654321','\d{3}');

You can see the output matches:

			
match_id  match_value    REGEXP_SUBSTR_result
--------- -------------- ----------------------
       123            123
       456            456
       789            789
       098            098
       765            765
       432            432

		

Groups

Skipping over the flags (which work just like previously discussed here and here), the next parameter is groups.

As I demonstrated with REGEXP_REPLACE, there is a concept of a group that allows you to get a subset of a match. In the replace version, you could format the output using placeholders. For REGEX_SUBSTR, you don’t format the output, you can grab the text from the slice.

In the following example, you can see that that with the grouped filter I use in the following '(\d{1})(\d{2})' without groups will work the same as before. 3 characters. But yu can see in the 4th and 5th columns, that I included the group value of 1, and then two. So the first gets 1 character, the second gets the last two that we output.

			
SELECT match_id, match_value,
  REGEXP_SUBSTR('abc1234567890987654321','(\d{1})(\d{2})',1,match_id) 
                                                       AS REGEXP_SUBSTR,
  REGEXP_SUBSTR('abc1234567890987654321','(\d{1})(\d{2})',1,match_id,'',1) 
                                                       AS REGEXP_SUBSTR_1,
  REGEXP_SUBSTR('abc1234567890987654321','(\d{1})(\d{2})',1,match_id,'',2) 
                                                       AS REGEXP_SUBSTR_2
FROM   REGEXP_MATCHES('abc1234567890987654321','(\d{1})(\d{2})');

		

This shows that the groups are separated:

			
match_id   match_value  REGEXP_SUBSTR  REGEXP_SUBSTR_1  REGEXP_SUBSTR_2
---------- ------------ -------------- ---------------- ----------------
        123          123            1                23
        456          456            4                56
        789          789            7                89
        098          098            0                98
        765          765            7                65
        432          432            4                32

		

Quick Example

Consider a table with phone numbers (or at least, supposed phone numbers). You want to fetch out the first three numeric values, assuming the data format matches an expected format. You can do this without a WHERE clause, just by matching the format in the regular expression, but then pick out a single part of the number when it matches the format:

			
CREATE TABLE #numbers(phoneNumber varchar(13));
INSERT INTO #numbers 
VALUES ('111-111-1111'),('222-221-2222'),('333-333'),('44444444444');
SELECT phoneNumber,
      REGEXP_SUBSTR(phoneNumber,'^(\d{3})-(\d{3})-(\d{4})$',1,1,'',1) 
                                                           AS areaCode
FROM   #numbers;

		

You can see that the rows that matched the format was able to put out the value, but the two that didn’t match, returned NULL.

			
phoneNumber   areaCode
------------- -------------
111-111-1111  111
222-221-2222  222
333-333       NULL
44444444444   NULL

		

Note that you cannot do an OR expression here as you might expect. So if I try to add in an or to catch the '4444444444' value and get it’s area code, you will see that the '4444444444' value returns NULL still:

			
--using the same table variable from the first query in this section
SELECT phoneNumber,
       REGEXP_SUBSTR(phoneNumber,
          '^(\d{3})-(\d{3})-(\d{4})$|^(\d{3})(\d{3})(\d{4})$',1,1,'',1) 
                                                          AS areaCode
FROM #numbers;

		

Same output. But now, change the group to 4:

			
SELECT phoneNumber,
    REGEXP_SUBSTR(phoneNumber,
       '^(\d{3})-(\d{3})-(\d{4})$|^(\d{3})(\d{3})(\d{4})$',1,1,'',4) 
                                                          AS areaCode
FROM  @numbers;

		

And now only the 444 row returns an area code:

			
phoneNumber   areaCode
------------- -------------
111-111-1111  NULL
222-221-2222  NULL
333-333       NULL
4444444444    444

		

Summary

At this point in the series, we are getting really close to finishing up. So REGEX_SUBSTR isn’t at all new ground. But it is a definitely useful function that is going to be useful when splitting values, especially when they meet a certain format. Just a few more functions to go, and I plan to do the last two in one post (but to be fair, I intended to do this function as part of a 3-function blog, and this post started to feel a bit too large for that.

Drsql's Database Musings