And the rest – REGEXP_INSTR, and REGEXP_COUNT

Part 11 of Learn RegEx With Louis

I can’t believe I am finally here. I almost wanted to make this into two blogs, but I decided that they were so similar to the others that it could be done in just one. Much like the Professor and Mary Ann were just as important as the other castaways. These functions are very useful too.

I am only combining them into a short version because they are, in how they work, very similar to all the other functions. I certainly will demonstrate all the functionality for each function, but not to the extra level I have in previous blogs.

This time, I will cover:

REGEXP_INSTR Returns the starting or ending position of the matched substring, depending on the option supplied.
REGEXP_COUNT Returns a count of the number of times that regex pattern occurs in a string.

`REGEXP_INSTR`

There is no INSTR function in SQL Server, though PATINDEX is a close enough analog for comparison. PATINDEX takes two parameters, and inputstring and a like expression, allowing you to find the first character of the first instance of that like expression pattern. For example:

SELECT PATINDEX('%3%','1234567890');

This returns 3. 3 is clearly the position of the 3, but with the ‘%’ charcters surrounding it seems like it might have been 1. But just like we look for a pattern using regular expressions with an open ended patther, PATINDEX also uses an “any number of any characters” in the search pattern along with, if you want it, the specific string value you are looking for. (searching for a pattern of ‘%’ will return 1, even for an empty string.

REGEXP_INSTR has these (by now in the series) very familiar other than one, parameters:

REGEXP_INSTR
(
string_expression,
pattern_expression
[, start [, occurrence [, return_option [, flags [, group ] ] ] ] ]
)

The return_option parameter is one that I will look at later in the blog.

The Simplest Form:

In it’s simplest form, you simply pass in a string value and a pattern to match. This pattern matches 3 (plus any characters before and after).

SELECT REGEXP_INSTR('1234567890','3');

This returns:

-----------
3

Since the 3 is in the third position.

Adding the `start` and `occurrence` parameters

The start and occurrence parameters work just like in the REGEXP_SUBSTR function.you can use one to find the location of a value, and the other to find the actual value, with the same starting 4 parameters.

For example, using the string of numbers I used in the REGEXP_SUBSTR blog, I will locate the pattern for the first, second, and third instances.

SELECT REGEXP_INSTR('1234567890','\d{3}',1,1) AS Loc_1,
       REGEXP_SUBSTR('1234567890','\d{3}',1,1) AS Value_1,
       REGEXP_INSTR('1234567890','\d{3}',1,2) AS Loc_2,
       REGEXP_SUBSTR('1234567890','\d{3}',1,2) AS Value_2,
       REGEXP_INSTR('1234567890','\d{3}',1,3) AS Loc_3,
       REGEXP_SUBSTR('1234567890','\d{3}',1,3) AS Value_3

Which you can see is 1, 4, and 7, along with the values that are being found:

Loc_1   Value_1  Loc_2   Value_2    Loc_3   Value_3
------- -------- ------- ---------- ------- ----------
1       123      4       456        7       789

The `return_option` parameter

The one main difference in REGEXP_INSTR from the other functions we hae looked at is the return_option parameter. It shows you the postioin that is right after the end of the pattern. It is a bit confusing (and the numeric parameter doesn’t help much with that), but with this value, you can find the start and end ((and subtracting 1 from the value), you can find the location of the start and end of the match to your pattern.

The possible values for the return_option parameter:

0 (default) – return the starting position for the value
1 – return the first position after the value

In this next piece of code, I have added the return_option value of 0 or 1 for the start and end of each position:

SELECT REGEXP_INSTR('1234567890','\d{3}',1,1,0) AS Start_1,
       REGEXP_INSTR('1234567890','\d{3}',1,1,1) AS NextPos_1,
       REGEXP_SUBSTR('1234567890','\d{3}',1,1) AS Value_1,
       REGEXP_INSTR('1234567890','\d{3}',1,2,0) AS Start_2,
       REGEXP_INSTR('1234567890','\d{3}',1,2,1) AS NextPos_2,
       REGEXP_SUBSTR('1234567890','\d{3}',1,2) AS Value_2;

This returns:

Start_1  NextPos_1   Value_1  Start_2   NextPos_2   Value_2
-------- ----------- -------- --------- ----------- --------
1        4           123      4         7           456

I am not quite sure the use case here, because for the most part I am going to want positions of value to use with a SUBSTRING/RIGHT/LEFT function call, but with all the power we have with the REGEXP_% family of functions, most of the string manipulation is done for us.

The `group` parameter

Skipping over the flag parameter, which works just as it has in previous functions, there is the group parameter. This parameter works very similar to how it did with the REGEXP_SUBST function (and as I will show, this will give you the location of the group that you may have fetched with the REGEXP_SUBSTR call.

Here is the sample call. It is getting a bit messy to follow since you can’t name parameters, but in this next call I have added comments that hopefully will help:

SELECT REGEXP_INSTR(
                    --Value to parse:
                    'abc1234567890987654321',
                    --Regex pattern:
                    '(\d{1})(\d{2})',
                    --Starting position:
                    1,
                    --Occurrence:
                    2,
                    --Return Option:
                    0, --(meaning return the start)
                    --Flags:
                    '',
                    --Group:
                    2
                    ) AS REGEXP_INSTR_2

This returns:

REGEXP_INSTR_2
--------------
8

Looking into the value:

Position: 1234567890123456789012
Value:    abc1234567890987654321

You can see that the first 4 is at position 7 (and that the first character is in the second group of 3 digits). The first group starts at 7, and is one character long. So the second group appears in the 8th position. To see a bit more detail:

SELECT REGEXP_SUBSTR('abc1234567890987654321',
              '(\d{1})(\d{2})',1,2) as REGEXP_SUBSTR,
REGEXP_SUBSTR('abc1234567890987654321',
              '(\d{1})(\d{2})',1,2,'',1) as REGEXP_SUBSTR_1,
REGEXP_SUBSTR('abc1234567890987654321',
              '(\d{1})(\d{2})',1,2,'',2) as REGEXP_SUBSTR_2

SELECT REGEXP_INSTR('abc1234567890987654321',
              '(\d{1})(\d{2})',1,2) as REGEXP_INSTR,
REGEXP_INSTR('abc1234567890987654321',
              '(\d{1})(\d{2})',1,2,0,'',1) as REGEXP_INSTR_1,
REGEXP_INSTR('abc1234567890987654321',
              '(\d{1})(\d{2})',1,2,0,'',2) as REGEXP_INSTR_2

This outputs:

REGEXP_SUBSTR  REGEXP_SUBSTR_1  REGEXP_SUBSTR_2
-------------- ---------------- ---------------
456            4                56

REGEXP_INSTR   REGEXP_INSTR_1   REGEXP_INSTR_2
-------------- ---------------- ---------------
7              7                8

Which you can see is much like the example we looked at before, where you can see the string and the groups in the REGEXP_SUBSTR calls and their locations in the REGEXP_INSTR calls.

`REGEXP_COUNT`

The purpose of this function is to count the number of matches you will find for a pattern in a string.

This function is the only one that I am more or less going to say more or less, you can probably figure it out from knowing everything we have discussed previously. I have already covered most of what is important about this function in all the other blogs in the series. But it is definitely an interesting function that you might use, especially doing some initial prepping for how many matches you might find and need to process.

The syntax is:

REGEXP_COUNT (
         string_expression,
         pattern_expression [ , start [ , flags ] ]
)

THis function just has the very base parameters, with optional start parameter to indicate the starting position for processing, and flags when you need them. For the example, I will use the same string and pattern we have been working with:

SELECT REGEXP_COUNT('abc1234567890987654321',
                      '(\d{1})(\d{2})') as REGEXP_COUNT;

This returns:

REGEXP_COUNT
------------
6

In this context, this is exactly the same as \d{3}. as you can see here:

SELECT REGEXP_COUNT('abc1234567890987654321',
                         '\d{3}') as REGEXP_COUNT;

We can see this in two ways, eithe just counting them:

   111222333444555666
abc1234567890987654321

Or by using REGEXP_MATCHES and seeing how many rows it outputs:

SELECT match_id, start_position, end_position, match_value
FROM   REGEXP_MATCHES('abc1234567890987654321','\d{3}') AS MATCHES;

You can see the characters at the different matches:

match_id   start_position end_position match_value
---------- -------------- ------------ ------------
1          4              6            123
2          7              9            456
3          10             12           789
4          13             15           098
5          16             18           765
6          19             21           432

Which has 6 rows, 1 for each match, which you saw in the REGEXP_COUNT example.

REGEXP_COUNT isn’t a function that I figure to use a whole lot, but it is yet another nice tool to have in the toolbox when you want to see how many matches you might have in a group of text.

The End of the Basics Blogs

The is the end of the weekly “Learn ReGex with Louis” series. I am really glad I did this, and I hope some people found it useful. I will refer back to this series as well as my RegEx cheat sheet when I start using this stuff in production work.

In the future, I will add more to the series in a couple of ways:

New use cases and tricks I can accomplish with regular expressions
Performance testing to see how these functions compare to other techniques. They are definitely going to be useful in some use cases, but will they be slower than other tools when you are doing thousands, millions, or even billions of operations?

I hope you have liked it so far, and if you have things you think I should try, give me a yeal and I can’t promise I will try, but I will try to try!

Drsql.link