Part 3 of Learn RegEx With Louis
In this article, we are going to take an initial look at what are referred to as “character classes” or “character sets” in Regular Expressions. They are commonly used when looking for data to be in a certain format. For example:
We are going to look at how to set a filter for 'lll-ll-lln' and/or 'lll-ll-lll' (where l is letter and n is numeric).
This is the data I will be working with, and the table can be created from the appendix of this article.
TRUNCATE TABLE dbo.RegExTest;
INSERT INTO dbo.RegExTest(Value)
VALUES ('aaa-aa-aa1'),('bbb-bb-bb2'),
('ccc-cc-ccc'),('ddd-dd-dd4'),
('eee~ee~ee5'),('ff-f-fff'),
('gggg-gg-ggg'),('justplainwrong');
--justplainwrong will not match anything,
--for a testing purpose
Like normal, I will start with a few examples of using like expressions to achieve what it can achieve both naturally, and using some more complex coding. As I started out with my discussion of using regular expressions in SQL Server (SQL Server REGEXP_LIKE features augment, not replace, LIKE. SQL Server REGEXP_LIKE features augment, not replace, LIKE)/
Hence, if you can use LIKE alone, it is usually going to be better (certainly not worse) performance since it is an operator and not a function. Then the next best method will be using LIKE along with REGEXP_LIKE if LIKE can make better use of indexes.
The Basics of character sets
To look for a character in a set (or class, but set feels a lot more natural to me), you format them using brackets. Each bracketed set references a single character, but it can be any character in the set:
[a-z]: all alpha characters[0-9]: all numeric characters[1-3]: 1, 2, or 3[1-35-7]: 1, 2, 3, 5, 6 or 7[1256]: 1, 2, 5, or 6[@8!4*]: @, 8, !, 4
And so on. I will put together a deeper dive on what all can go in these character sets, including negation, and characters that cannot be simply included in a range in a later blog. For now, suffice it to say that you can create your own this character OR this character OR … expression.
Using a LIKE expression
Using like, you can look for the format of 'lll-ll-lln' by using different character sets, in this case the first 3 characters are letters, then a -, then repeat that over and over for each character you need.
SELECT Value FROM RegExTest WHERE Value LIKE '[a-z][a-z][a-z]-[a-z][a-z]-[a-z][a-z][0-9]';
Returns:
Value ------------------------- aaa-aa-aa1 bbb-bb-bb2 ddd-dd-dd4
And if you want your output to include the ccc-cc-ccc value, along with any other values that end with a single character you could do:
SELECT Value FROM RegExTest WHERE Value LIKE '[a-z][a-z][a-z]-[a-z][a-z]-[a-z][a-z]_';
But this will include things that end with a percent sign. Now, for the sake of argument, what if couldn’t rewrite it so easily. You might do an OR condition.
SELECT Value FROM RegExTest WHERE Value LIKE '[a-z][a-z][a-z]-[a-z][a-z]-[a-z][a-z][0-9]' OR Value LIKE '[a-z][a-z][a-z]-[a-z][a-z]-[a-z][a-z][a-z]';
Both of these queries return:
Value ------------------------------ aaa-aa-aa1 bbb-bb-bb2 ccc-cc-ccc ddd-dd-dd4
Your character set can include more than one range too, like [0-9a-z]:
SELECT Value FROM RegExTest WHERE Value LIKE '[a-z][a-z][a-z]-[a-z][a-z]-[a-z][a-z][0-9a-z]';
With the extra row being the row with all ‘c’ characters. You might think that second one is not going to perform well, and that is true. But neither is the query using one LIKE expression that must start with any letter of the alphabet (unless perhaps most of the rows start with numeric values.)
Regular Expressions
Using regular expressions, we can use the same character class/set/range, but we will not have to repeat the range over and over in text if the character sets are the same.
For example, if you just want values that start with three alpha characters:
SELECT Value
FROM RegExTest
WHERE REGEXP_LIKE(Value,'[a-z]{3}-.*');
This expression says 3 alphanumeric values, then a – (dash, not an em dash, I wrote this with my own two hands on my twinkly Razer keyboard!), and then any number of other characters. So instead of repeating [a-z] manually 3 times, this makes it simpler (and also makes it more flexible because making a LIKE expression that is allows a pattern that finds '&ll&' or '&l&' starts to get messy. and you definitely would consider two LIKE expressions OR'd like '&[a-z]&'or'&[a-z][a-z]&'
That query returns:
Value -------------------------------- aaa-aa-aa1 bbb-bb-bb2 ccc-cc-ccc ddd-dd-dd4
Since they all start with 3 alpha characters, and then a dash. To finish this out, we can do the following to get all the rows that end in a number:
SELECT Value
FROM RegExTest
WHERE REGEXP_LIKE(Value,'^[a-z]{3}-[a-z]{2}-[a-z]{2}[0-9]$' );
This gives us the rows that end in a number:
Value --------------------------- aaa-aa-aa1 bbb-bb-bb2 ddd-dd-dd4
And we can include the row that ends in a letter in much the same way as we could before using LIKE:
SELECT Value
FROM RegExTest
WHERE REGEXP_LIKE(Value,'^[a-z]{3}-[a-z]{2}-[a-z]{2}[0-9a-z]$');
Multiple formats, same regular expression
But what if this was a format that we couldn’t just alter the end of the string. Clearly we could write it as an OR expression in the query. There are two versions of this I want to look at, using OR, and using variable length segments. Neither of this is straightforward using a LIKE expression.
Using OR in Regular expressions
This first example is what is going to feel natural to most of us. I want my string to look this X or Y. So we naturally write something like this:
SELECT Value
FROM RegExTest
-- ends in 2 letters and one number
WHERE REGEXP_LIKE(Value,'^[a-z]{3}-[a-z]{2}-[a-z]{2}[0-9]$' )
--ends in 3 letters
OR REGEXP_LIKE(Value,'^[a-z]{3}-[a-z]{2}-[a-z]{3}$' );
This does work and returns what is expected:
Value ------------------------- aaa-aa-aa1 bbb-bb-bb2 ccc-cc-ccc ddd-dd-dd4
We don’t try to mix the two because it can lead to errors. This may not always be the worst idea unless you are really willing to put in the test cycles. As these expressions get more and more complicated, they become harder to test, so really put them through the paces.
But there is a way we can write an OR expression in the regular expression. In the following expression, there are 2 expressions separated by a | (pipe character):
([a-z]{2}[0-9]|[a-z]{3})
The outcome of this expression is that this strings can match either:
[a-z]{2}[0-9]
or
[a-z]{3}
You can see this by executing the query with that expression added:
SELECT Value
FROM RegExTest
-- ends in 2 letters and one number
WHERE REGEXP_LIKE
(Value,'^[a-z]{3}-[a-z]{2}-([a-z]{2}[0-9]|[a-z]{3})$');
That statement returns:
Value ------------------------------- aaa-aa-aa1 bbb-bb-bb2 ccc-cc-ccc ddd-dd-dd4
Which now includes either 3 letters OR 2 letters and one number.
Variably repeating patterns
Note that I included a row that has a value of ff-f-fff. Only the last segment size matches the others, but what if we wanted match any string that has starts with 2 or 3 [a-z] characters (but not 4), then a -, we can write:
SELECT Value
FROM RegExTest
-- ends in 2 letters and one number
WHERE REGEXP_LIKE(Value,'^[a-z]{2,3}-' );
The output is:
Value ----------------------------------- aaa-aa-aa1 bbb-bb-bb2 ccc-cc-ccc ddd-dd-dd4 ff-f-fff
And if you change it to NOT REGEXP_LIKE
SELECT Value
FROM RegExTest
-- ends in 2 letters and one number
WHERE NOT REGEXP_LIKE(Value,'^[a-z]{2,3}-' );
This now returns:
Value ---------------------------- eee~ee~ee5 gggg-gg-ggg justplainwrong
Note: The next entry in the series will include the not like case in the expressions.
Something you can’t do with ranges
For alphanumeric data you can do ranges, and this typically makes great sense, but you cannot do this for non-alphanumeric data:
SELECT Value FROM RegExTest WHERE REGEXP_LIKE(Value,'[@-&]');
This will output the following error:
Msg 19300, Level 16, State 1, Line 163 An invalid Pattern '[@-&]' was provided. Error 'invalid character class range: @-&' occurred during evaluation of the Pattern.
But you can do that for individual items:
SELECT Value
FROM (values('a'),('5'),('@')) as Rows (Value)
WHERE REGEXP_LIKE(Value,'[@&]');
Which will return just the @ (at) character from my derived table.
Summary
Boy this is getting both horribly messy and very powerful. In this entry in the Learn Regular Expressions with Louis, we discussed character classes (sets/ranges) such as [0-9] which states to match a single character that is between 0 and 9 in the character set/collation being used.
We also introduced the | (or) construct that lets you filter for more than one thing in a single regular expression. And if you think we have gotten messy so far, check out one of the official email address validation regular expressions here. https://pdw.ex-parrot.com/Mail-RFC822-Address.html Sheesh.
Appendix
USE TempDB;
GO
SET NOCOUNT ON;
-- Drop the table if it exists
IF OBJECT_ID('dbo.RegExTest', 'U') IS NOT NULL
DROP TABLE dbo.RegExTest;
-- Create the table
CREATE TABLE dbo.RegExTest (
RegExTestId INT IDENTITY(1,1) CONSTRAINT PK_RegExTest PRIMARY KEY,
Value NVARCHAR(100) CONSTRAINT AK_RegExTest UNIQUE
);




Leave a comment