Part 3 of Learn RegEx With Louis
In this article, we are going to take an initial look at what are referred to as “character classes” or “character sets” in Regular Expressions. They are commonly used when looking for data to be in a certain format. For example:
We are going to look at how to set a filter for 'lll-ll-lln' and/or 'lll-ll-lll' (where l is letter and n is numeric).
This is the data I will be working with, and the table can be created from the appendix of this article.
TRUNCATE TABLE dbo.RegExTest;INSERT INTO dbo.RegExTest(Value)VALUES ('aaa-aa-aa1'),('bbb-bb-bb2'), ('ccc-cc-ccc'),('ddd-dd-dd4'), ('eee~ee~ee5'),('ff-f-fff'), ('gggg-gg-ggg'),('justplainwrong'); --justplainwrong will not match anything, --for a testing purpose
Like normal, I will start with a few examples of using like expressions to achieve what it can achieve both naturally, and using some more complex coding. As I started out with my discussion of using regular expressions in SQL Server (SQL Server REGEXP_LIKE features augment, not replace, LIKE. SQL Server REGEXP_LIKE features augment, not replace, LIKE)/
Hence, if you can use LIKE alone, it is usually going to be better (certainly not worse) performance since it is an operator and not a function. Then the next best method will be using LIKE along with REGEXP_LIKE if LIKE can make better use of indexes.
The Basics of Character Sets
To look for a character in a set (or class, but set feels a lot more natural to me), you format them using brackets. Each bracketed set references a single character, but it can be any character in the set:
[a-z]: all alpha characters[0-9]: all numeric characters[1-3]: 1, 2, or 3[1-35-7]: 1, 2, 3, 5, 6 or 7[1256]: 1, 2, 5, or 6[@8!4*]: @, 8, !, 4
And so on. I will put together a deeper dive on what all can go in these character sets, including negation, and characters that cannot be simply included in a range in a later blog. For now, suffice it to say that you can create your own this character OR this character OR … expression.
Using a LIKE expression
Using like, you can look for the format of 'lll-ll-lln' by using different character sets, in this case the first 3 characters are letters, then a -, then repeat that over and over for each character you need.
SELECT ValueFROM RegExTestWHERE Value LIKE '[a-z][a-z][a-z]-[a-z][a-z]-[a-z][a-z][0-9]';
Returns:
Value-------------------------aaa-aa-aa1bbb-bb-bb2ddd-dd-dd4
And if you want your output to include the ccc-cc-ccc value, along with any other values that end with a single character you could do:
SELECT ValueFROM RegExTestWHERE Value LIKE '[a-z][a-z][a-z]-[a-z][a-z]-[a-z][a-z]_';
But this will include things that end with a percent sign. Now, for the sake of argument, what if couldn’t rewrite it so easily. You might do an OR condition.
SELECT ValueFROM RegExTestWHERE Value LIKE '[a-z][a-z][a-z]-[a-z][a-z]-[a-z][a-z][0-9]'OR Value LIKE '[a-z][a-z][a-z]-[a-z][a-z]-[a-z][a-z][a-z]';
Both of these queries return:
Value------------------------------aaa-aa-aa1bbb-bb-bb2ccc-cc-cccddd-dd-dd4
Your character set can include more than one range too, like [0-9a-z]:
SELECT ValueFROM RegExTestWHERE Value LIKE '[a-z][a-z][a-z]-[a-z][a-z]-[a-z][a-z][0-9a-z]';
With the extra row being the row with all ‘c’ characters. You might think that second one is not going to perform well, and that is true. But neither is the query using one LIKE expression that must start with any letter of the alphabet (unless perhaps most of the rows start with numeric values.)
Regular Expressions
Using regular expressions, we can use the same character class/set/range, but we will not have to repeat the range over and over in text if the character sets are the same.
For example, if you just want values that start with three alpha characters:
SELECT ValueFROM RegExTestWHERE REGEXP_LIKE(Value,'[a-z]{3}-.*');
This expression says 3 alphanumeric values, then a – (dash, not an em dash, I wrote this with my own two hands on my twinkly Razer keyboard!), and then any number of other characters. So instead of repeating [a-z] manually 3 times, this makes it simpler (and also makes it more flexible because making a LIKE expression that is allows a pattern that finds '&ll&' or '&l&' starts to get messy. and you definitely would consider two LIKE expressions OR'd like '&[a-z]&'or'&[a-z][a-z]&'
That query returns:
Value--------------------------------aaa-aa-aa1bbb-bb-bb2ccc-cc-cccddd-dd-dd4
Since they all start with 3 alpha characters, and then a dash. To finish this out, we can do the following to get all the rows that end in a number:
SELECT ValueFROM RegExTestWHERE REGEXP_LIKE(Value,'^[a-z]{3}-[a-z]{2}-[a-z]{2}[0-9]$' );
This gives us the rows that end in a number:
Value---------------------------aaa-aa-aa1bbb-bb-bb2ddd-dd-dd4
And we can include the row that ends in a letter in much the same way as we could before using LIKE:
SELECT ValueFROM RegExTestWHERE REGEXP_LIKE(Value,'^[a-z]{3}-[a-z]{2}-[a-z]{2}[0-9a-z]$');
Multiple formats, same regular expression
But what if this was a format that we couldn’t just alter the end of the string. Clearly, we could write it as an OR expression in the query. There are two versions of this I want to look at, using OR, and using variable-length segments. Neither of these is straightforward using a LIKE expression.
Using OR in Regular expressions
This first example is what is going to feel natural to most of us. I want my string to look this X or Y. So we naturally write something like this:
SELECT ValueFROM RegExTest -- ends in 2 letters and one numberWHERE REGEXP_LIKE(Value,'^[a-z]{3}-[a-z]{2}-[a-z]{2}[0-9]$' ) --ends in 3 letters OR REGEXP_LIKE(Value,'^[a-z]{3}-[a-z]{2}-[a-z]{3}$' );
This does work and returns what is expected:
Value-------------------------aaa-aa-aa1bbb-bb-bb2ccc-cc-cccddd-dd-dd4
We don’t try to mix the two because it can lead to errors. This may not always be the worst idea unless you are really willing to put in the test cycles. As these expressions get more and more complicated, they become harder to test, so really put them through the paces.
But there is a way we can write an OR expression in the regular expression. In the following expression, there are 2 expressions separated by a | (pipe character):
([a-z]{2}[0-9]|[a-z]{3})
The outcome of this expression is that these strings can match either:
[a-z]{2}[0-9]
or
[a-z]{3}
You can see this by executing the query with that expression added:
SELECT ValueFROM RegExTest-- ends in 2 letters and one numberWHERE REGEXP_LIKE (Value,'^[a-z]{3}-[a-z]{2}-([a-z]{2}[0-9]|[a-z]{3})$');
That statement returns:
Value-------------------------------aaa-aa-aa1bbb-bb-bb2ccc-cc-cccddd-dd-dd4
Which now includes either 3 letters OR 2 letters and one number.
Variably repeating patterns
Note that I included a row that has a value of ff-f-fff. Only the last segment size matches the others, but what if we wanted to match any string that starts with 2 or 3 [a-z] characters (but not 4), then a -, we can write:
SELECT ValueFROM RegExTest-- ends in 2 letters and one numberWHERE REGEXP_LIKE(Value,'^[a-z]{2,3}-' );
The output is:
Value-----------------------------------aaa-aa-aa1bbb-bb-bb2ccc-cc-cccddd-dd-dd4ff-f-fff
And if you change it to NOT REGEXP_LIKE
SELECT ValueFROM RegExTest-- ends in 2 letters and one numberWHERE NOT REGEXP_LIKE(Value,'^[a-z]{2,3}-' );
This now returns:
Value----------------------------eee~ee~ee5gggg-gg-gggjustplainwrong
Note: The next entry in the series will include the not like case in the expressions.
Something you can’t do with ranges
For alphanumeric data you can do ranges, and this typically makes great sense, but you cannot do this for non-alphanumeric data:
SELECT ValueFROM RegExTestWHERE REGEXP_LIKE(Value,'[@-&]');
This will output the following error:
Msg 19300, Level 16, State 1, Line 163 An invalid Pattern '[@-&]' was provided. Error 'invalid character class range: @-&' occurred during evaluation of the Pattern.
But you can do that for individual items:
SELECT ValueFROM (values('a'),('5'),('@')) as Rows (Value)WHERE REGEXP_LIKE(Value,'[@-&]');
Which will return just the @ (at) character from my derived table.
Summary
Boy, this is getting both horribly messy and very powerful. In this entry in the Learn Regular Expressions with Louis, we discussed character classes (sets/ranges) such as [0-9] which states to match a single character that is between 0 and 9 in the character set/collation being used.
We also introduced the | (or) construct that lets you filter for more than one thing in a single regular expression. And if you think we have gotten messy so far, check out one of the official email address validation regular expressions here. https://pdw.ex-parrot.com/Mail-RFC822-Address.html Sheesh.
Appendix
USE TempDB;GOSET NOCOUNT ON;-- Drop the table if it existsIF OBJECT_ID('dbo.RegExTest', 'U') IS NOT NULLDROP TABLE dbo.RegExTest;-- Create the tableCREATE TABLE dbo.RegExTest (RegExTestId INT IDENTITY(1,1) CONSTRAINT PK_RegExTest PRIMARY KEY,Value NVARCHAR(100) CONSTRAINT AK_RegExTest UNIQUE);



Leave a Reply