Part 1 of Learn RegEx With Louis

I have never once written an regular expression prior to a couple of articles on this blog. And truth be told, when I published those blogs, I got the expression wrong because it seemed to work, and it was what Copilot told me would work. If you are new like me and/or your code is important, test with lots of cases. I obviously fixed that code (thankfully the conclusions were right).

So no, I have never. LIKE does 99% of what I need in a simple manner, and .8% of the time in a complex way, so I never really thought about it too much. I suspect that will be the case even now in SQL, but like any good student, it is time to change my knowledge of regular expressions.

In this blog my goal will be to demonstrate the simplest comparisons, simple pattern matching. I want to start our tests with replicating things like:

LIKE 'This is my first known value'
LIKE 'Value%'
LIKE '%_alue%'

To test this, I am going to start out with a very simple table. You can get the definition from this post’s appendix, and I am only putting in a few values to start:

TRUNCATE TABLE dbo.RegExTest;

INSERT INTO dbo.RegExTest(Value)
VALUES ('This is a value'),
       ('This is also a value'),
       ('This is also a value and a string'),
       ('This is a value This is a value'),
       ('Is this a value This is a value'),
       ('How about val, how does this match'),
       ('How about val2, how does this match'),
       ('IsthisavalueThisisavalueWhyNoSpaces');

Note: In this series of posts on learning regular expressions, I don’t really care about performance. I am strictly trying to outline the tools you need to use regular expressions in specify situations. I am using a table for this so I can test more than one value at a time without having to retype the values.

Simple Filters

The most simple filters that we all regularly write are just simple “find a value” and “find a value that matches a simple string of characters”. That is what I will be covering in ths blog.

--USING LIKE
SELECT *
FROM RegExTest
WHERE Value LIKE 'This is a value';

The output is:

RegExTestId Value
----------- --------------------------------
1           This is a value

So, it stands to reason that you can just do this, right?:

SELECT *
FROM RegExTest
WHERE REGEXP_LIKE(Value, 'This is a value');

Reason, as you can see in the output, fails us:

RegExTestId Value
----------- --------------------------------
          5 Is this a value This is a value
          1 This is a value
          4 This is a value This is a value

Note: I am going to ignore the optional third parameter for this post. I will circle back to it unless it becomes necessary for the task at hand. One of those flags concerns case sensitivity. Regular Expressions are case sensitive by default, but in SQL Server, case sensitivity seems to default to the collation being used, just like a LIKE expression.

Any text you put in the string is treated as including any other characters in the front or rear of those characters. So you need to terminate the strings. In my previous posts about RegEx (such as this one about cardinality) I used \b to terminate the string, but that was incorrect (and I have repaired that now). \b is a word boundary, which I will touch on in a moment.

To get the exact match, you need to use:

^ – Matches the beginning of the string
$ – Matches the end of the string

Note: I have started my own cheat sheet here, where I have outlined these and other items I have used so far.

Without the delimiters, the string is considered to be an open-ended, so anything that matches these characters will be returned. You can even do things like spaces too:

SELECT Value
FROM RegExTest
WHERE REGEXP_LIKE(Value,' ');

You will see that all of the rows in the table are returned except 1. The row where the value of the Value column is 'IsthisavalueThisisavalueWhyNoSpaces'. There are several “special” characters, which I will look at in later entries, but two that are important to know immediately are:

. – represents a single character
/ – escape character
\ – starts regex commands

Note: Does this all seem kind of crazy? YES. Sometimes power is complicated! Eventually I hope to understand these things, but it won’t be immediate. One day I will see what all characters you can used, but for now, note you can use all of the alpha numeric characters which would be typical of most uses.

So, to get only the row where Value = 'This is a value', we need to query:

SELECT Value
FROM RegExTest
WHERE REGEXP_LIKE(Value,'^This is a value$');

Which returns, just the expected one row.

Beginning or end of a string

It should be semi-obvious how this is going to work, but for a person who has lived on LIKE for 30 years, it is a bit weird to type. For LIKE, the wildcard character is the trailing character:

--Using LIKE
SELECT Value
FROM RegExTest
WHERE Value LIKE 'This is%'

This returns:

Value
----------------------------------
This is a value
This is a value This is a value
This is also a value
This is also a value and a string


With Regular Expressions, the wildcard is inferred, but you have to terminate the start of the string:

SELECT Value
FROM RegExTest
WHERE REGEXP_LIKE (Value,'^This is');


Which gives you the same output as the LIKE expression previously did.

For a leading wildcard, you use the % as a prefix in the LIKE expression:

--Using LIKE
SELECT Value
FROM RegExTest
WHERE Value LIKE '%this is a value';

This returns these rows:

Value
---------------------------------
Is this a value This is a value
This is a value
This is a value This is a value

And the RegEx you terminate the ending of the string:

--Using RegEx

SELECT Value
FROM RegExTest
WHERE REGEXP_LIKE (Value,'This is a value$');

Again, same output as with the LIKE expression.

Matching all but a few characters

In some cases, you may want to match all (or most) of a string but not a few characters in the middle of a string. For a LIKE expression, you use the _ (underscore) character and for RegEx with a . (period).

For example, say I want to find any string that starts with ‘This is ‘, followed by any four characters, finishing up with ‘ a value’ plus any other characters.

--Using LIKE:
SELECT Value
FROM dbo.RegExTest
WHERE value LIKE 'This is __ a value%';

--Using RegEx
SELECT Value
FROM dbo.RegExTest
WHERE REGEXP_LIKE(Value,'^This is …. a value');

The output for both statements is:

Value
----------------------------------
This is also a value
This is also a value and a string

Matching just a part of a string

Finding a value that is part of a string, can be a bit more challenging with LIKE. For example, say you want only rows where the string “val” is included. Not value, not valor, not val2. Using like, this gets a little tricky. :

--Using LIKE
SELECT Value
FROM RegExTest
--WHERE Value LIKE '%val%'; --nope, returns everything

--nope, returns nothing because there is a comma after val
--WHERE Value LIKE '% val %'; 

WHERE Value LIKE '%[^a-z1-9]val[^a-z1-9]%'; 
             --val with no alphanumeric values

This returns:

Value
-----------------------------------
How about val, how does this match

Using RegEx, we can use the \b word boundary. This says that give us anything where the word is only that within the word boundary. So \bval\b, will only get the case where the characters val exist, and will ignore punctuation.

The next two statements will get the rows that match only the whole val and val2 words:

SELECT Value
FROM RegExTest
WHERE REGEXP_LIKE(Value,'\bval\b');

SELECT Value
FROM RegExTest
WHERE REGEXP_LIKE(Value,'\bval2\b');

The outputs are, respectively:

Value
------------------------------------
How about val, how does this match

Value
------------------------------------
How about val2, how does this match

What if I want val and val2 and value?

Then we can use \w*. \w is a character in a word, and the * means it can be 0 or more of the filter that precedes it. This is similar to % in LIKE, but you only (seem) to need it when it is in a special situation, like here where I want to the variable number of characters to be in the contiguous string, so it would not include anything after the string.

SELECT Value
FROM RegExTest
WHERE REGEXP_LIKE(Value,'\bval\w*\b')

You may be thinking, it matches with val%, why does this matter? In a LIKE expression it may not. But when we move into the tools, like doing a replace for all words that start with Val, we may need to just match the exact works that are being looked at. You can test your expressions in a place like this: https://regex101.com/, where I have typed in “This is a val that you need to check the value of in valor.” and you can see the output has matches for several words in there.

If you wanted to replace them all, you can do that with REGEXP_REPLACE (I am pretty sure at least, I am not quite there yet :))

Appendix

USE TempDB;
GO

SET NOCOUNT ON;

-- Drop the table if it exists
IF OBJECT_ID('dbo.RegExTest', 'U') IS NOT NULL
   DROP TABLE dbo.RegExTest;

-- Create the table
CREATE TABLE dbo.RegExTest (
  RegExTestId INT IDENTITY(1,1) CONSTRAINT PK_RegExTest PRIMARY KEY,
  Value NVARCHAR(100) CONSTRAINT AK_RegExTest UNIQUE
);

4 responses to “Simple Pattern Matching with SQL Server Regular Expressions”

  1. Negating Character Classes and Class Shortcuts in SQL Server – Drsql's Blog Avatar

    […] did use this back in Simple Pattern Matching with SQL Server Regular Expressions to pad out strings we were looking for. First though, a few simple […]

    Like

  2. […] Part 1 covers simple pattern matching: […]

    Like

  3. Francesco Avatar

    This will create an orgy of Non-SARGable queries.
    Do we have any statistics about the performances and REGEX in SQL Server?

    Like

    1. Louis Davidson Avatar

      I wrote a few things about this:

      SQL Server REGEXP_LIKE features augment, not replace, LIKE

      Cardinality and REGEXP_LIKE

      But the bottom line is that RegEx is very similar to any other WHERE clause function where a column value is used as an argument (slightly better, in the cardinality blog I show that you do have a wee bit of control over the cardinality estimates!), but for the most part, REGEXP_LIKE is just a tool to use when a LIKE expression won’t do it (and the other article notes that you should use LIKE and REGEXP_LIKE if you can do any filtering on a leading column.

      Thank you for the comment!

      Like

Leave a reply to Francesco Cancel reply

I’m Louis

I have been at this database thing for a long long time, with no plans to stop.

This is my blog site, companion to drsql.org

Recents