Part 3 of Learn RegEx With Louis

In this article, we are going to take an initial look at what are referred to as “character classes” or “character sets” in Regular Expressions. They are commonly used when looking for data to be in a certain format. For example:

We are going to look at how to set a filter for 'lll-ll-lln' and/or 'lll-ll-lll' (where l is letter and n is numeric).

This is the data I will be working with, and the table can be created from the appendix of this article.

TRUNCATE TABLE dbo.RegExTest;

INSERT INTO dbo.RegExTest(Value)
VALUES ('aaa-aa-aa1'),('bbb-bb-bb2'),
       ('ccc-cc-ccc'),('ddd-dd-dd4'),
       ('eee~ee~ee5'),('ff-f-fff'),
       ('gggg-gg-ggg'),('justplainwrong');
       --justplainwrong will not match anything, 
       --for a testing purpose

Like normal, I will start with a few examples of using like expressions to achieve what it can achieve both naturally, and using some more complex coding. As I started out with my discussion of using regular expressions in SQL Server (SQL Server REGEXP_LIKE features augment, not replace, LIKE. SQL Server REGEXP_LIKE features augment, not replace, LIKE)/

Hence, if you can use LIKE alone, it is usually going to be better (certainly not worse) performance since it is an operator and not a function. Then the next best method will be using LIKE along with REGEXP_LIKE if LIKE can make better use of indexes.

The Basics of character sets

To look for a character in a set (or class, but set feels a lot more natural to me), you format them using brackets. Each bracketed set references a single character, but it can be any character in the set:

  • [a-z]: all alpha characters
  • [0-9]: all numeric characters
  • [1-3]: 1, 2, or 3
  • [1-35-7]: 1, 2, 3, 5, 6 or 7
  • [1256]: 1, 2, 5, or 6
  • [@8!4*]: @, 8, !, 4

And so on. I will put together a deeper dive on what all can go in these character sets, including negation, and characters that cannot be simply included in a range in a later blog. For now, suffice it to say that you can create your own this character OR this character OR … expression.

Using a LIKE expression

Using like, you can look for the format of 'lll-ll-lln' by using different character sets, in this case the first 3 characters are letters, then a -, then repeat that over and over for each character you need.

SELECT Value
FROM   RegExTest
WHERE  Value LIKE '[a-z][a-z][a-z]-[a-z][a-z]-[a-z][a-z][0-9]';

Returns:

Value
-------------------------
aaa-aa-aa1
bbb-bb-bb2
ddd-dd-dd4

And if you want your output to include the ccc-cc-ccc value, along with any other values that end with a single character you could do:

SELECT Value
FROM   RegExTest
WHERE  Value LIKE '[a-z][a-z][a-z]-[a-z][a-z]-[a-z][a-z]_';

But this will include things that end with a percent sign. Now, for the sake of argument, what if couldn’t rewrite it so easily. You might do an OR condition.

SELECT Value
FROM   RegExTest
WHERE  Value LIKE '[a-z][a-z][a-z]-[a-z][a-z]-[a-z][a-z][0-9]'
OR    Value LIKE '[a-z][a-z][a-z]-[a-z][a-z]-[a-z][a-z][a-z]';

Both of these queries return:

Value
------------------------------
aaa-aa-aa1
bbb-bb-bb2
ccc-cc-ccc
ddd-dd-dd4

Your character set can include more than one range too, like [0-9a-z]:

SELECT Value
FROM   RegExTest
WHERE  Value LIKE '[a-z][a-z][a-z]-[a-z][a-z]-[a-z][a-z][0-9a-z]';

With the extra row being the row with all ‘c’ characters. You might think that second one is not going to perform well, and that is true. But neither is the query using one LIKE expression that must start with any letter of the alphabet (unless perhaps most of the rows start with numeric values.)

Regular Expressions

Using regular expressions, we can use the same character class/set/range, but we will not have to repeat the range over and over in text if the character sets are the same.

For example, if you just want values that start with three alpha characters:

SELECT Value
FROM   RegExTest
WHERE  REGEXP_LIKE(Value,'[a-z]{3}-.*');

This expression says 3 alphanumeric values, then a – (dash, not an em dash, I wrote this with my own two hands on my twinkly Razer keyboard!), and then any number of other characters. So instead of repeating [a-z] manually 3 times, this makes it simpler (and also makes it more flexible because making a LIKE expression that is allows a pattern that finds '&ll&' or '&l&' starts to get messy. and you definitely would consider two LIKE expressions OR'd like '&[a-z]&'or'&[a-z][a-z]&'

That query returns:

Value
--------------------------------
aaa-aa-aa1
bbb-bb-bb2
ccc-cc-ccc
ddd-dd-dd4

Since they all start with 3 alpha characters, and then a dash. To finish this out, we can do the following to get all the rows that end in a number:

SELECT Value
FROM   RegExTest
WHERE  REGEXP_LIKE(Value,'^[a-z]{3}-[a-z]{2}-[a-z]{2}[0-9]$' );

This gives us the rows that end in a number:

Value
---------------------------
aaa-aa-aa1
bbb-bb-bb2
ddd-dd-dd4

And we can include the row that ends in a letter in much the same way as we could before using LIKE:

SELECT Value
FROM   RegExTest
WHERE  REGEXP_LIKE(Value,'^[a-z]{3}-[a-z]{2}-[a-z]{2}[0-9a-z]$');

Multiple formats, same regular expression

But what if this was a format that we couldn’t just alter the end of the string. Clearly we could write it as an OR expression in the query. There are two versions of this I want to look at, using OR, and using variable length segments. Neither of this is straightforward using a LIKE expression.

Using OR in Regular expressions

This first example is what is going to feel natural to most of us. I want my string to look this X or Y. So we naturally write something like this:

SELECT Value
FROM   RegExTest
       -- ends in 2 letters and one number
WHERE  REGEXP_LIKE(Value,'^[a-z]{3}-[a-z]{2}-[a-z]{2}[0-9]$' )
       --ends in 3 letters
   OR   REGEXP_LIKE(Value,'^[a-z]{3}-[a-z]{2}-[a-z]{3}$' );

This does work and returns what is expected:

Value
-------------------------
aaa-aa-aa1
bbb-bb-bb2
ccc-cc-ccc
ddd-dd-dd4

We don’t try to mix the two because it can lead to errors. This may not always be the worst idea unless you are really willing to put in the test cycles. As these expressions get more and more complicated, they become harder to test, so really put them through the paces.

But there is a way we can write an OR expression in the regular expression. In the following expression, there are 2 expressions separated by a | (pipe character):

([a-z]{2}[0-9]|[a-z]{3})

The outcome of this expression is that this strings can match either:

[a-z]{2}[0-9]

or

[a-z]{3}

You can see this by executing the query with that expression added:

SELECT Value
FROM   RegExTest
-- ends in 2 letters and one number
WHERE  REGEXP_LIKE
        (Value,'^[a-z]{3}-[a-z]{2}-([a-z]{2}[0-9]|[a-z]{3})$');

That statement returns:

Value
-------------------------------
aaa-aa-aa1
bbb-bb-bb2
ccc-cc-ccc
ddd-dd-dd4

Which now includes either 3 letters OR 2 letters and one number.

Variably repeating patterns

Note that I included a row that has a value of ff-f-fff. Only the last segment size matches the others, but what if we wanted match any string that has starts with 2 or 3 [a-z] characters (but not 4), then a -, we can write:

SELECT Value
FROM   RegExTest
-- ends in 2 letters and one number
WHERE  REGEXP_LIKE(Value,'^[a-z]{2,3}-' );

The output is:

Value
-----------------------------------
aaa-aa-aa1
bbb-bb-bb2
ccc-cc-ccc
ddd-dd-dd4
ff-f-fff

And if you change it to NOT REGEXP_LIKE

SELECT Value
FROM   RegExTest
-- ends in 2 letters and one number
WHERE  NOT REGEXP_LIKE(Value,'^[a-z]{2,3}-' );

This now returns:

Value
----------------------------
eee~ee~ee5
gggg-gg-ggg
justplainwrong

Note: The next entry in the series will include the not like case in the expressions.

Something you can’t do with ranges

For alphanumeric data you can do ranges, and this typically makes great sense, but you cannot do this for non-alphanumeric data:

SELECT Value
FROM RegExTest
WHERE REGEXP_LIKE(Value,'[@-&]');

This will output the following error:

Msg 19300, Level 16, State 1, Line 163 An invalid Pattern '[@-&]' was provided. Error 'invalid character class range: @-&' occurred during evaluation of the Pattern.

But you can do that for individual items:

SELECT Value
FROM   (values('a'),('5'),('@')) as Rows (Value)
WHERE  REGEXP_LIKE(Value,'[@&]');

Which will return just the @ (at) character from my derived table.

Summary

Boy this is getting both horribly messy and very powerful. In this entry in the Learn Regular Expressions with Louis, we discussed character classes (sets/ranges) such as [0-9] which states to match a single character that is between 0 and 9 in the character set/collation being used.

We also introduced the | (or) construct that lets you filter for more than one thing in a single regular expression. And if you think we have gotten messy so far, check out one of the official email address validation regular expressions here. https://pdw.ex-parrot.com/Mail-RFC822-Address.html Sheesh.

Appendix

USE TempDB;
GO

SET NOCOUNT ON;

-- Drop the table if it exists
IF OBJECT_ID('dbo.RegExTest', 'U') IS NOT NULL
DROP TABLE dbo.RegExTest;

-- Create the table
CREATE TABLE dbo.RegExTest (
RegExTestId INT IDENTITY(1,1) CONSTRAINT PK_RegExTest PRIMARY KEY,
Value NVARCHAR(100) CONSTRAINT AK_RegExTest UNIQUE
);

One response to “Matching one of a set of characters with SQL Server Regular Expressions”

  1. Negating Character Classes and Class Shortcuts in SQL Server – Drsql's Blog Avatar

    […] In Part 3, I covered some of the basics of using character classes/sets. (I do tend to say sets.) This allowed us to do things like find words that start with a, b, c, d or e. This is done using: ^[a-e] or ^[abcde]. Now I want to look at two new things (one of which looks really similar to the previous classes but does things very differently.: […]

    Like

Leave a reply to Negating Character Classes and Class Shortcuts in SQL Server – Drsql's Blog Cancel reply

I’m Louis

I have been at this database thing for a long long time, with no plans to stop.

This is my blog site, companion to drsql.org

Recents