Part 2 of Learn RegEx With Louis

In the previous entry in this series, we looked at matching a straightforward pattern like “fred”, which in a regular expression would find any string that had the characters f, r, e, d in a row at least once. We also noted that using \b to surround your string (front, back, or both), you can make sure that only fred was matched, and not frederick, freddy, bigfred or any other string that contains fred. And if you want to make sure the value is ONLY fred, you can simply delimit the front and the back of the string like: '^fred$'. And we discussed the . filter that fills in for any single character.

Last reminder, SQL Server’s implementation is apparently not case sensitive, but follows the collation. More on that in a later entry on case sensitive searches.

In this blog, I want to look for strings that have 1 or more instances of a repeating pattern. For example, say you want to look for something like the following:

LIKE '%FredFredFred%'

--(or any fixed or unlimited length of a, and only a)
LIKE '%aaaaaaaaaaaaaaa%' or '%aaaaaaa%'

I won’t start on the idea of a pattern like:

LIKE '%[abcdefg][abcdefg][abcdefg]%'

That will be left for the next entry.

Using the table that is again recreated in the Appendix of this blog, I will create some test data.

TRUNCATE TABLE RegExTest;

INSERT INTO RegExTest (Value)
VALUES ('aa'),('aaa'),('aaaa'),('aaaaa'),
       ('baaaaaa'),('aaaaab'),('baaaaab'),
       ('bababab'),('bababac'),('cbababac'),
       ('b1ab2ab3ab4a');

Repeating patterns using LIKE

Of course, using the exact string, it is easy to check for these basic patterns:

SELECT *
FROM  RegExTest
WHERE Value LIKE '%aa%';

However, what if you want to find a string that is only aa? That gets trickier. One pattern I have used in the past has been:

SELECT Value
FROM RegExTest
WHERE Value LIKE REPLICATE('a',LEN(Value));

So, this gives you:

Value
-----------------
aa
aaa
aaaa
aaaaa

But what about searching for: baa, baaa, or baaab? You clearly we could do:

SELECT Value
FROM RegExTest
WHERE Value LIKE 'b' + REPLICATE('a',LEN(Value) -1);

which for our set of data returns:

Value
----------------------
baaaaaa

And you could check to make sure that it starts with b and ends with b and just adjust the LEN offset, or use SUBSTRING perhaps? These tricks work, but regex definitely makes stuff like this easier for you.

Repeating single character patterns using RegEx

So if you want to look for a string that has 2 consecutive characters or patterns, you simply put the count in a search string like this:

{,[]}

So, lets say you want to find strings in our data where there are 3 or 4 a characters:

SELECT Value
FROM RegExTest
WHERE REGEXP_LIKE(Value,'a{3,4}');

This returns all of the rows with aaa patterns, but not the row with 2 a characters.

Value
--------------------------
aaa
aaaa
aaaaa
aaaaab
baaaaaa
baaaaab

This goes back to the word boundary conditions discussed in the previous entry. If you want ONLY rows with the 3 or 4 characters in the string, you need to use the \b to determine the boundary:

SELECT Value
FROM RegExTest
WHERE REGEXP_LIKE(Value,'\ba{3,4}\b');

Now you get just the two rows:

Value
---------------------
aaa
aaaa

If you want to let it be 1 to an unbounded number of a characters, leave off the boundary for the max number of characters:

SELECT Value
FROM RegExTest
WHERE REGEXP_LIKE(Value,'a{3,}');

This returns the same rows as when using 'a{3,4}', why? The same reason it matches the same rows as 'a{3,3}'. It is looking a the entire string and if there are 3 a character in a row, then you have a match. It actually matches the string multiple times, which doesn’t matter in a REGEXP_LIKE expression, but will become an issue with things like REGEXP_REPLACE.

Next, we want to look for rows with b in there as well. So we might write:

SELECT Value
FROM RegExTest
WHERE REGEXP_LIKE(Value,'ba{3,}');

This returns:

Value
--------------------
baaaaaa
baaaaab

and

SELECT Value
FROM RegExTest
WHERE REGEXP_LIKE(Value,'ba{3,}b');

This returns:

Value
----------------------
baaaaab

Repeating multi-character patterns using RegEx

Great… but what about these values? What kind of pattern can we use to look for repeated multi-character values?

'bababab', 'bababac', 'cbababac'

To do this, you need to use parenthesis to group the characters together. So, you can find these three strings (each with 3 ba patterns) using:

SELECT Value
FROM RegExTest
WHERE REGEXP_LIKE(Value,'(ba){3,}');

This returns:

Value
---------------------
bababab
bababac
cbababac

And if follows that you can then pick out the row prefixed and suffixed with c using:

SELECT Value
FROM RegExTest
WHERE REGEXP_LIKE(Value,'c(ba){3,}c');

Which returns:

Value
-------------------
cbababac

Finally, at least for this blog, what about this value? 'b1ab2ab3ab4a' where there is a pattern of bNumberabNumbera etc. For this pattern, just put the single character wildcard in the string that will be repeated, and you can then find b + anything + a:

SELECT Value
FROM RegExTest
WHERE REGEXP_LIKE(Value,'(b.a){3,}');

This returns that row that was expected. You may ask why not look for numbers in the pattern, and that is a reasonable question. It is one that we will get to soon when we start looking at regular expression character classes in the next entry in this series.

Note too that the repeating pattern is going to be used in various ways. One repeating tool is the *. as we looked at in the previous examples. So, I figured, let me try this, it should maybe match the same thing as '(b.a){3,}', right? um, no:

SELECT Value
FROM RegExTest
WHERE REGEXP_LIKE(Value,'(b.a).*');

This returns:

Value
------------------
b1ab2ab3ab4a
baaaaaa
baaaaab

But why? It basically repeats the pattern 0 and unlimited times. So, what this is saying is:

string that starts with b + any character + a + 0-unlimited characters.

Fine, so I didn’t want the last part of that, so why not this:

SELECT Value
FROM RegExTest
WHERE REGEXP_LIKE(Value,'(b.a)*');

This returns all rows… why? Because ALL of the strings match anything between 0 and any number of times. Just like this will:

SELECT Value
FROM RegExTest
WHERE REGEXP_LIKE(Value,'(So Does This)*');

But, a pattern you may actually want to use:

SELECT Value
FROM RegExTest
WHERE REGEXP_LIKE(Value,'^a*$');

This returns:

Value
---------------------
aa
aaa
aaaa
aaaaa

Because we delimited the start and the end, and are looking for only the character a repeated 0 or more times.

Summary

In this entry, I looked at how to repeat a series of characters or patterns using:

{,[]
(){,[]

as well as using the asterisk to repeat a pattern 0 or more times. Reminder of how important that one is because everything will match a pattern 0 times, so this is usually used for prefix/suffixes that may or may not be included.

These new constructs have been added into my cheat sheet here: Louis’ RegEx Cheat Sheet – Drsql Blog

Appendix

USE TempDB;
GO

SET NOCOUNT ON;

-- Drop the table if it exists
IF OBJECT_ID('dbo.RegExTest', 'U') IS NOT NULL
   DROP TABLE dbo.RegExTest;

-- Create the table
CREATE TABLE dbo.RegExTest (
   RegExTestId INT IDENTITY(1,1) 
       CONSTRAINT PK_RegExTest PRIMARY KEY,
   Value NVARCHAR(100) CONSTRAINT AK_RegExTest UNIQUE
);

One response to “Repeating Pattern Matching with SQL Server Regular Expressions”

  1. Viewing Multiple Matches in SQL Server Regular Expression Functions – Drsql's Blog Avatar

    […] has been used in all of the articles, let’s load in some rows. These are the rows I used in this blog. Like always, the table create is in the […]

    Like

Leave a reply to Viewing Multiple Matches in SQL Server Regular Expression Functions – Drsql's Blog Cancel reply

I’m Louis

I have been at this database thing for a long long time, with no plans to stop.

This is my blog site, companion to drsql.org

Recents