Part 4 of Learn RegEx With Louis

In part 4 we are starting to get close to knowing enough about Regular Expressions themselves to move towards looking at the different functions in SQL Server (they ae pretty standard functions, but I am very much looking at these in terms of SQL Server. I think there are several more entries to do (though I have changed that number since I started working on this blog), finishing up character classes, assertions, and then flags (honestly, I don’t really understand what assertions are, I have just read the term at this point!). I am pretty sure this isn’t all I need to learn, but it is what I see at this point left to cover.

In Part 3, I covered some of the basics of using character classes/sets. (I do tend to say sets.) This allowed us to do things like find words that start with a, b, c, d or e. This is done using: ^[a-e] or ^[abcde]. Now I want to look at two new things (one of which looks really similar to the previous classes but does things very differently.:

  • Negated character classes – Look for strings that don’t have a particular character in them
  • Perl character classes – shorthand for certain types of characters

One thing that I am going to reserve for a while are the special character classes that let you look for specific Unicode characters, ASCII characters, etc. I am going to need considerably more time fiddling with these (and maybe some external help!)

As usual, there is a table in the end of this blog in an Appendix section that creates a simple table. I will load my initial test cases here:


TRUNCATE TABLE dbo.RegExTest
GO
INSERT INTO dbo.RegExTest
VALUES ('Normal piece of text'),('$Prefixed piece of text'),
('1234567890'),('Includes a ^ caret'),('Not all digits 4 you');

Negated Character Classes

In this section, I will look at how to ask the question of a string “filter for any value without a certain set of characters.” To do this we need to negate a character set. This is done using the ^ (caret) symbol like this [^a-z] (we have previously used the ^ to denote the start of a string…it delightfully does both). This filter looks for any string that does not have any alpha characters.

So let’s try this… this should look for any string with a non-alpha character, right?:

SELECT Value
FROM   dbo.RegExTest
WHERE  REGEXP_LIKE(Value,'[^a-z]');

But everything is returned. Why? The space character, all the rows either have a space or no alpha characters. You have to account for it. Add that to the filter:

SELECT Value
FROM   dbo.RegExTest
WHERE  REGEXP_LIKE(Value,'[^a-z ]');

Now this returns the three rows that have non-alpha characters ($, numbers, and the ^ symbol)

Value
---------------------------
$Prefixed piece of text
1234567890
Includes a ^ caret

In the Perl Character Set section, I will cover more about how you can ignore (or look for) all typical white space characters. The only row that was not not returned from our query was:

Normal piece of text

Because all it has in it are alpha and blank characters (which were specifically excluded). Something to be careful to understand here is that there is a difference between:

WHERE REGEXP_LIKE(Value,'[^a-z ]'); --any single character match

and

WHERE REGEXP_LIKE(Value,'[^a-z ]{1,100}');
--any match that is 1-100 characters long.

The first one matches the 1234567890 10 times (which will matter when we move past just looking for values using REGEXP_LIKE) and the latter one just once. But say we want to matches to be in a longer pattern of non-alpha characters.

For example, consider this query:

SELECT Value
FROM   dbo.RegExTest
WHERE  REGEXP_LIKE(Value,'[^a-z ]{2}');

You might think this would return nothing since at least one item has more than 2 characters, but consider each value:

Value
--------------------------------------------------------
$Prefixed piece of text --prefixed with only 1 non alpha
1234567890 --12 matches, 34 matches, 56 matches, etc. 
           --So you actually have 5 matches.
Includes a ^ caret --only 1 caret
Normal piece of text --no matches

If you are on this journey with me, hang on because it is getting more and more interesting as time goes.

Can you add classes?

Not in the flavor of regular expressions that are using the SQL Server. This is from this document from Microsoft which notes the RE2 regular expression library. This is where I discovered there are a bunch of different regular expression implementation. Because it isn’t complex enough as it is!

So you may see something like this in examples (especially if you ask your friendly neighborhood LLM):

[a-z-[jkl]]

This pattern will not work in SQL Server, but it is the same as [a-im-z], just a bit more complex to write in some cases. For example, if you had code to let a person look for non-alpha characters, except a given set of values, you would have to write the code to translate the statement. Not a big deal, but interesting. The most important thing to learn is that if something isn’t working as you expected, check it out in Regex101.com but make sure you check the syntax in the RE2 regular expression library listed earlier.

Perl Character Classes

There are several character classes shorthand that you can use in what are called Perl character classes. They are not exactly obvious (especially in how capitalization is used to negate a class!), but they are definitely handy.

In the following subsections I will cover these.

\d – digits

equivlent to [0-9]

So any digit between 0-9. So you could do:

SELECT Value
FROM   RegExTest
WHERE  REGEXP_LIKE(Value,'\d'); --any item with digits

This would return the following in our samples:

Value
---------------------------
1234567890
Not all digits 4 you

You can also do \d{2}, which will specifically look for 2 digits:

SELECT Value
FROM   RegExTest
WHERE  REGEXP_LIKE(Value,'\d{2}'); 

Which will for our data will return:

Value
---------------------------
1234567890

Because it can match this string 5 times (and all LIKE needs is once).

\D – No digits

equivalent to [^0-9]

So now we can do something like the following, that will search for strings that can’t have more than one digit in a row.

SELECT Value
FROM   RegExTest
WHERE  REGEXP_LIKE(Value,'\D{2}');

Which returns these rows:

Value
---------------------------
$Prefixed piece of text
Includes a ^ caret
Normal piece of text
Not all digits 4 you

Whilt you might think that this next one (just using \D) would remove the 'Not all digits 4 you' (I did initially), remember that it matches every character individually. So that will have 19 matches, and only one non-match. All it takes is one.

SELECT Value
FROM   RegExTest
WHERE  REGEXP_LIKE(Value,'\D'); --Not only digits

Same output as in the previous example.

You can use the + modifier, which says that the previous pattern is matched one or more times. So it checks :

SELECT Value
FROM   RegExTest
WHERE  REGEXP_LIKE(Value,'^\D+$'); --No digits at all

This returns:

Value
---------------------------
$Prefixed piece of text
Includes a ^ caret
Normal piece of text

Note: this can be a wee bit tricky in some cases. This:

SELECT Value
FROM   RegExTest
WHERE  REGEXP_LIKE(Value,'^\D{1,10}$');

Matches no rows, because. all of the non-numeric rows are > 10 characters long. This will only match instances of numbers that are 1-10 characters long. (Very powerful syntax, but also a wee bit confusing too!) Change that 10 to any value longer than the strings, and you get back the same rows as

SELECT LEN(Value) as Length, Value
FROM   RegExTest
WHERE  REGEXP_LIKE(Value,'^\D{1,20}$');

Returns:

Length      Value
----------- ------------------------
18          Includes a ^ caret
20          Normal piece of text

You can see that this matches only exact lengths using the following example, which looks for exactly 20 character strings that are only non-numeric:

SELECT LEN(Value) as Length, Value
FROM   RegExTest
WHERE  REGEXP_LIKE(Value,'^\D{20}$');

Just returns the row with Length of 20.

Reminder to test these things a lot. It gets tricky and you have to be careful to only get the output you expect.

\s \S – Whitespace and Not Whitespace

\s equivalent to [\t\n\f\r ] and \S equivalent to [^\t\n\f\r ]

I see this definition in this page, but I don’t know what these characters are. (Does anyone keep the parts of regular expressions in their head? This is worse than sitting in a meeting with people who speak in acronyms that you don’t know :)) ChatGPT knew, so someone had written it down previously!:

  • \t – tab
  • \n – newline, linefeed
  • \r – carriage return
  • \f – page break/form feed

So I am going to add a few rows with these characters (we already have spaces):

INSERT INTO dbo.RegExTest
VALUES
('Includes    tab (not the drink)'),
('Includes carriage return
line feed'),
('Includes both tab    and
carriage return line feed');

Now, we can look for examples of tab characters using:

SELECT Value
FROM   RegExTest
WHERE  REGEXP_LIKE(Value,'\t');

It outputs these two rows.

Value
---------------------------

Includes both tab    and
carriage return line feed
Includes    tab (not the drink)

Note: If you are trying this at home, be sure that your editor isn’t turning tabs into spaces or it make take you longer than you’d like to admit to figure out what is going on. True story.

And carriage returns:

SELECT Value
FROM   RegExTest
WHERE  REGEXP_LIKE(Value,'\r');

Returns these two rows:

Value
---------------------------

Includes both tab    and
carriage return line feed
Includes carriage return
line feed

So now, you can look for rows with any whitespace:

SELECT Value
FROM   RegExTest
WHERE  REGEXP_LIKE(Value,'\s');

This returns a lot of rows:

Value
---------------------------

$Prefixed piece of text
Includes a ^ caret
Includes both tab    and
carriage return line feed
Includes carriage return
line feed
Includes    tab (not the drink)
Normal piece of text
Not all digits 4 you

And finally, the opposite using \S+$ to find rows with no whitepace:

SELECT Value
FROM   RegExTest
WHERE  REGEXP_LIKE(Value,'^\S+$');

The pattern wass discussed in the \D for digits section.

Value
---------------------------
1234567890

\w – Word Characters and \W not word characters

\w is equivalent to [0-9a-z_] and \W to [^0-9a-z_]

Noting that the function honors the collation which, if case sensitive you need to add the uppercase letters [0-9A-Za-z_]. And there is a flag to make it definitely case sensitive.

We did use this back in Simple Pattern Matching with SQL Server Regular Expressions to pad out strings we were looking for. First though, a few simple examples:

SELECT Value
FROM RegExTest
WHERE REGEXP_LIKE(Value,'[0-9A-Za-z_]'); 
           --any string that contains word characters

Or all of them, they all have letter or number characters. In fact, this one is probably slightly more useful for looking for non-word characters:

SELECT Value
FROM RegExTest
WHERE REGEXP_LIKE(Value,'\W'); 
        --any string that contains non word characters

But in data like I have created, only the string of all numbers dropped out. Why? it comes back to whitepsace. So you might think you could do the following and include \W and \S:

SELECT Value
FROM RegExTest
WHERE REGEXP_LIKE(Value,'[\W\S]'); 
         --any string that contains only word characters

But it turns out since character sets like this are OR’d together, this means it matches either not word characaters, or not space characters. You need to look for non-word AND non-space using:

SELECT Value
FROM RegExTest
WHERE REGEXP_LIKE(Value,'[^\w\s]'); 
        --any string that contains only word characters

This returns just the following rows:

Value
-----------------------------------
$Prefixed piece of text
Includes a ^ caret
Includes tab (not the drink)

You could add \t, ( and ) if you want the tab row to drop out. [^\w\s\t()]

Ths can be a very useful tool when you are debugging something like a CSV file, looking for some weird thing like an atypical character (sometimes something like a space that isn’t the standard unicode or char(32) character.

Finally for now, this was one of the things we looked at earlier. I will add two more rows to the table, this time one with two words that include tab in them, so I can show how to find the tab in the string, to start the string, and to end the string.

INSERT INTO dbo.RegExTest
VALUES
('My table is full'),
('Mytab is full');

Now, you can use the \w* construct to pad out a string in the front or back with only letters/numbers:

SELECT Value
FROM RegExTest
WHERE REGEXP_LIKE(Value,'\b\w<em>tab\w</em>\b'); 
--anything word that contains tab \b starts\ends a word, 
--and \w* fills in only letters or numbers

This returns rows with tab, table, and Mytab:

Value
---------------------------
Includes both tab    and
carriage return line feed
Includes    tab (not the drink)
My table is full
Mytab is full

Now change to get wordst that start specifically with tab by using the \b to make sure you pattern fits in a distinct word (denoted by whitespace characters):

SELECT Value
FROM RegExTest
WHERE REGEXP_LIKE(Value,'\btab\w*\b'); 
     --anything word that starts with tab

Now you see only rows containing tab and table:

Value
---------------------------
Includes both tab    and
carriage return line feed
Includes    tab (not the drink)
My table is full

And finally strings that end with tab:

SELECT Value
FROM RegExTest
WHERE REGEXP_LIKE(Value,'\b\w*tab\b'); 
    --anything word that starts with tab

That returns:

Value
---------------------------
Includes both tab    and
carriage return line feed
Includes    tab (not the drink)
Mytab is full

Summing it up

I started out to say that I was about finished learning the basic regular expression building blocks, but there is still a bit more syntax to discuss. Definitely very rich functionality that is going to be useful in addition to the LIKE operator, and things like PATINDEX.

In this entry, we discussed a few things.

  • Negating a character class using ^
    • So while [0-9] means a character between 0 and 9, [^0-9] means any character that is not in that same range.
  • Several shortcuts for different special characters:
    • \t: tab
    • \n: newline
    • \r: carriage return
    • \f: page break/form feed
  • Perl character classes
    • \d – digits = [0-9]
    • \D – not digits = [^0-9]
    • \s – whitepsace = [\t\n\f\r ]
    • \S – not whitepace = [^\t\n\f\r ]
    • \w – word characters = [0-9a-z_]
    • \W – not word characters = [^0-9a-z_]

And it only gets more cool, powerful, and complicated from here. I have also added the new items to the cheat sheet.

Appendix

USE TempDB;
GO

SET NOCOUNT ON;

-- Drop the table if it exists
IF OBJECT_ID('dbo.RegExTest', 'U') IS NOT NULL
DROP TABLE dbo.RegExTest;

-- Create the table
CREATE TABLE dbo.RegExTest (
RegExTestId INT IDENTITY(1,1) CONSTRAINT PK_RegExTest PRIMARY KEY,
Value NVARCHAR(100) CONSTRAINT AK_RegExTest UNIQUE
);

One response to “Negating Character Classes and Class Shortcuts in SQL Server RegEx”

  1. Why do I blog? – Drsql's Blog Avatar

    […] rather than directly a vocation. The other night, as I was sitting at my desk at 12:30 AM writing a blog about regular expressions, I had a crisis of conscience. I had originally sat down to do a bit of formatting, then I was […]

    Like

Leave a comment

I’m Louis

I have been at this database thing for a long long time, with no plans to stop.

This is my blog site, companion to drsql.org

Recents