Part 4 of Learn RegEx With Louis
In part 4 we are starting to get close to knowing enough about Regular Expressions themselves to move towards looking at the different functions in SQL Server (they ae pretty standard functions, but I am very much looking at these in terms of SQL Server. I think there are several more entries to do (though I have changed that number since I started working on this blog), finishing up character classes, assertions, and then flags (honestly, I don’t really understand what assertions are, I have just read the term at this point!). I am pretty sure this isn’t all I need to learn, but it is what I see at this point left to cover.
In Part 3, I covered some of the basics of using character classes/sets. (I do tend to say sets.) This allowed us to do things like find words that start with a, b, c, d or e. This is done using: ^[a-e] or ^[abcde]. Now I want to look at two new things (one of which looks really similar to the previous classes but does things very differently.:
- Negated character classes – Look for strings that don’t have a particular character in them
- Perl character classes – shorthand for certain types of characters
One thing that I am going to reserve for a while are the special character classes that let you look for specific Unicode characters, ASCII characters, etc. I am going to need considerably more time fiddling with these (and maybe some external help!)
As usual, there is a table in the end of this blog in an Appendix section that creates a simple table. I will load my initial test cases here:
TRUNCATE TABLE dbo.RegExTest
GO
INSERT INTO dbo.RegExTest
VALUES ('Normal piece of text'),('$Prefixed piece of text'),
('1234567890'),('Includes a ^ caret'),('Not all digits 4 you');
Negated Character Classes
In this section, I will look at how to ask the question of a string “filter for any value without a certain set of characters.” To do this we need to negate a character set. This is done using the ^ (caret) symbol like this [^a-z] (we have previously used the ^ to denote the start of a string…it delightfully does both). This filter looks for any string that does not have any alpha characters.
So let’s try this… this should look for any string with a non-alpha character, right?:
SELECT Value FROM dbo.RegExTest WHERE REGEXP_LIKE(Value,'[^a-z]');
But everything is returned. Why? The space character, all the rows either have a space or no alpha characters. You have to account for it. Add that to the filter:
SELECT Value FROM dbo.RegExTest WHERE REGEXP_LIKE(Value,'[^a-z ]');
Now this returns the three rows that have non-alpha characters ($, numbers, and the ^ symbol)
Value --------------------------- $Prefixed piece of text 1234567890 Includes a ^ caret
In the Perl Character Set section, I will cover more about how you can ignore (or look for) all typical white space characters. The only row that was not not returned from our query was:
Normal piece of text
Because all it has in it are alpha and blank characters (which were specifically excluded). Something to be careful to understand here is that there is a difference between:
WHERE REGEXP_LIKE(Value,'[^a-z ]'); --any single character match
and
WHERE REGEXP_LIKE(Value,'[^a-z ]{1,100}'); --any match that is 1-100 characters long.
The first one matches the 1234567890 10 times (which will matter when we move past just looking for values using REGEXP_LIKE) and the latter one just once. But say we want to matches to be in a longer pattern of non-alpha characters.
For example, consider this query:
SELECT Value
FROM dbo.RegExTest
WHERE REGEXP_LIKE(Value,'[^a-z ]{2}');
You might think this would return nothing since at least one item has more than 2 characters, but consider each value:
Value
--------------------------------------------------------
$Prefixed piece of text --prefixed with only 1 non alpha
1234567890 --12 matches, 34 matches, 56 matches, etc.
--So you actually have 5 matches.
Includes a ^ caret --only 1 caret
Normal piece of text --no matches
If you are on this journey with me, hang on because it is getting more and more interesting as time goes.
Can you add classes?
Not in the flavor of regular expressions that are using the SQL Server. This is from this document from Microsoft which notes the RE2 regular expression library. This is where I discovered there are a bunch of different regular expression implementation. Because it isn’t complex enough as it is!
So you may see something like this in examples (especially if you ask your friendly neighborhood LLM):
[a-z-[jkl]]
This pattern will not work in SQL Server, but it is the same as [a-im-z], just a bit more complex to write in some cases. For example, if you had code to let a person look for non-alpha characters, except a given set of values, you would have to write the code to translate the statement. Not a big deal, but interesting. The most important thing to learn is that if something isn’t working as you expected, check it out in Regex101.com but make sure you check the syntax in the RE2 regular expression library listed earlier.
Perl Character Classes
There are several character classes shorthand that you can use in what are called Perl character classes. They are not exactly obvious (especially in how capitalization is used to negate a class!), but they are definitely handy.
In the following subsections I will cover these.
\d – digits
equivlent to [0-9]
So any digit between 0-9. So you could do:
SELECT Value FROM RegExTest WHERE REGEXP_LIKE(Value,'\d'); --any item with digits
This would return the following in our samples:
Value --------------------------- 1234567890 Not all digits 4 you
You can also do \d{2}, which will specifically look for 2 digits:
SELECT Value
FROM RegExTest
WHERE REGEXP_LIKE(Value,'\d{2}');
Which will for our data will return:
Value --------------------------- 1234567890
Because it can match this string 5 times (and all LIKE needs is once).
\D – No digits
equivalent to
[^0-9]
So now we can do something like the following, that will search for strings that can’t have more than one digit in a row.
SELECT Value
FROM RegExTest
WHERE REGEXP_LIKE(Value,'\D{2}');
Which returns these rows:
Value --------------------------- $Prefixed piece of text Includes a ^ caret Normal piece of text Not all digits 4 you
Whilt you might think that this next one (just using \D) would remove the 'Not all digits 4 you' (I did initially), remember that it matches every character individually. So that will have 19 matches, and only one non-match. All it takes is one.
SELECT Value FROM RegExTest WHERE REGEXP_LIKE(Value,'\D'); --Not only digits
Same output as in the previous example.
You can use the + modifier, which says that the previous pattern is matched one or more times. So it checks :
SELECT Value FROM RegExTest WHERE REGEXP_LIKE(Value,'^\D+$'); --No digits at all
This returns:
Value --------------------------- $Prefixed piece of text Includes a ^ caret Normal piece of text
Note: this can be a wee bit tricky in some cases. This:
SELECT Value
FROM RegExTest
WHERE REGEXP_LIKE(Value,'^\D{1,10}$');
Matches no rows, because. all of the non-numeric rows are > 10 characters long. This will only match instances of numbers that are 1-10 characters long. (Very powerful syntax, but also a wee bit confusing too!) Change that 10 to any value longer than the strings, and you get back the same rows as
SELECT LEN(Value) as Length, Value
FROM RegExTest
WHERE REGEXP_LIKE(Value,'^\D{1,20}$');
Returns:
Length Value ----------- ------------------------ 18 Includes a ^ caret 20 Normal piece of text
You can see that this matches only exact lengths using the following example, which looks for exactly 20 character strings that are only non-numeric:
SELECT LEN(Value) as Length, Value
FROM RegExTest
WHERE REGEXP_LIKE(Value,'^\D{20}$');
Just returns the row with Length of 20.
Reminder to test these things a lot. It gets tricky and you have to be careful to only get the output you expect.
\s \S – Whitespace and Not Whitespace
\s equivalent to
[\t\n\f\r ]and \S equivalent to[^\t\n\f\r ]
I see this definition in this page, but I don’t know what these characters are. (Does anyone keep the parts of regular expressions in their head? This is worse than sitting in a meeting with people who speak in acronyms that you don’t know :)) ChatGPT knew, so someone had written it down previously!:
\t– tab\n– newline, linefeed\r– carriage return\f– page break/form feed
So I am going to add a few rows with these characters (we already have spaces):
INSERT INTO dbo.RegExTest
VALUES
('Includes tab (not the drink)'),
('Includes carriage return
line feed'),
('Includes both tab and
carriage return line feed');
Now, we can look for examples of tab characters using:
SELECT Value FROM RegExTest WHERE REGEXP_LIKE(Value,'\t');
It outputs these two rows.
Value --------------------------- Includes both tab and carriage return line feed Includes tab (not the drink)
Note: If you are trying this at home, be sure that your editor isn’t turning tabs into spaces or it make take you longer than you’d like to admit to figure out what is going on. True story.
And carriage returns:
SELECT Value FROM RegExTest WHERE REGEXP_LIKE(Value,'\r');
Returns these two rows:
Value --------------------------- Includes both tab and carriage return line feed Includes carriage return line feed
So now, you can look for rows with any whitespace:
SELECT Value FROM RegExTest WHERE REGEXP_LIKE(Value,'\s');
This returns a lot of rows:
Value --------------------------- $Prefixed piece of text Includes a ^ caret Includes both tab and carriage return line feed Includes carriage return line feed Includes tab (not the drink) Normal piece of text Not all digits 4 you
And finally, the opposite using \S+$ to find rows with no whitepace:
SELECT Value FROM RegExTest WHERE REGEXP_LIKE(Value,'^\S+$');
The pattern wass discussed in the \D for digits section.
Value --------------------------- 1234567890
\w – Word Characters and \W not word characters
\wis equivalent to[0-9a-z_]and\Wto[^0-9a-z_]
Noting that the function honors the collation which, if case sensitive you need to add the uppercase letters [0-9A-Za-z_]. And there is a flag to make it definitely case sensitive.
We did use this back in Simple Pattern Matching with SQL Server Regular Expressions to pad out strings we were looking for. First though, a few simple examples:
SELECT Value
FROM RegExTest
WHERE REGEXP_LIKE(Value,'[0-9A-Za-z_]');
--any string that contains word characters
Or all of them, they all have letter or number characters. In fact, this one is probably slightly more useful for looking for non-word characters:
SELECT Value
FROM RegExTest
WHERE REGEXP_LIKE(Value,'\W');
--any string that contains non word characters
But in data like I have created, only the string of all numbers dropped out. Why? it comes back to whitepsace. So you might think you could do the following and include \W and \S:
SELECT Value
FROM RegExTest
WHERE REGEXP_LIKE(Value,'[\W\S]');
--any string that contains only word characters
But it turns out since character sets like this are OR’d together, this means it matches either not word characaters, or not space characters. You need to look for non-word AND non-space using:
SELECT Value
FROM RegExTest
WHERE REGEXP_LIKE(Value,'[^\w\s]');
--any string that contains only word characters
This returns just the following rows:
Value ----------------------------------- $Prefixed piece of text Includes a ^ caret Includes tab (not the drink)
You could add
\t,(and)if you want the tab row to drop out.[^\w\s\t()]
Ths can be a very useful tool when you are debugging something like a CSV file, looking for some weird thing like an atypical character (sometimes something like a space that isn’t the standard unicode or char(32) character.
Finally for now, this was one of the things we looked at earlier. I will add two more rows to the table, this time one with two words that include tab in them, so I can show how to find the tab in the string, to start the string, and to end the string.
INSERT INTO dbo.RegExTest
VALUES
('My table is full'),
('Mytab is full');
Now, you can use the \w* construct to pad out a string in the front or back with only letters/numbers:
SELECT Value FROM RegExTest WHERE REGEXP_LIKE(Value,'\b\w<em>tab\w</em>\b'); --anything word that contains tab \b starts\ends a word, --and \w* fills in only letters or numbers
This returns rows with tab, table, and Mytab:
Value --------------------------- Includes both tab and carriage return line feed Includes tab (not the drink) My table is full Mytab is full
Now change to get wordst that start specifically with tab by using the \b to make sure you pattern fits in a distinct word (denoted by whitespace characters):
SELECT Value
FROM RegExTest
WHERE REGEXP_LIKE(Value,'\btab\w*\b');
--anything word that starts with tab
Now you see only rows containing tab and table:
Value --------------------------- Includes both tab and carriage return line feed Includes tab (not the drink) My table is full
And finally strings that end with tab:
SELECT Value
FROM RegExTest
WHERE REGEXP_LIKE(Value,'\b\w*tab\b');
--anything word that starts with tab
That returns:
Value --------------------------- Includes both tab and carriage return line feed Includes tab (not the drink) Mytab is full
Summing it up
I started out to say that I was about finished learning the basic regular expression building blocks, but there is still a bit more syntax to discuss. Definitely very rich functionality that is going to be useful in addition to the LIKE operator, and things like PATINDEX.
In this entry, we discussed a few things.
- Negating a character class using ^
- So while
[0-9]means a character between 0 and 9,[^0-9]means any character that is not in that same range.
- So while
- Several shortcuts for different special characters:
\t: tab\n: newline\r: carriage return\f: page break/form feed
- Perl character classes
\d– digits =[0-9]\D– not digits =[^0-9]\s– whitepsace =[\t\n\f\r ]\S– not whitepace =[^\t\n\f\r ]\w– word characters =[0-9a-z_]\W– not word characters =[^0-9a-z_]
And it only gets more cool, powerful, and complicated from here. I have also added the new items to the cheat sheet.
Appendix
USE TempDB;
GO
SET NOCOUNT ON;
-- Drop the table if it exists
IF OBJECT_ID('dbo.RegExTest', 'U') IS NOT NULL
DROP TABLE dbo.RegExTest;
-- Create the table
CREATE TABLE dbo.RegExTest (
RegExTestId INT IDENTITY(1,1) CONSTRAINT PK_RegExTest PRIMARY KEY,
Value NVARCHAR(100) CONSTRAINT AK_RegExTest UNIQUE
);




Leave a reply to Why do I blog? – Drsql's Blog Cancel reply