Part 9 of Learn RegEx With Louis
Continuing on with the REGEXP_ functions series, the next one I want to cover is the table valued function REGEXP_SPLIT_TO_TABLE. This function is definitely one of the ones you probably ought to know, especially if you are ever tasked to pull some data out of a data structure.
This function is a lot like the STRING_SPLIT function, and unlike things like the REGEXP_LIKE function, you can basically use the same main parameters as you used in STRING_SPLIT for simple cases, but from there the possibilities are a lot more endless because you can define almost any delimiters you want. It isn’t perfect, because of a few things, but we will discuss that more later on.
A bit of background with STRING_SPLIT
In the STRING_SPLIT function, you have 3 parameters, a value to split, a one-character delimiter, and an optional parameter to add an ordinal value for sorting the output. For example:
SELECT *
FROM STRING_SPLIT('a,b,c,d,e',',');
Returns:
value --------- a b c d e
And the following:
SELECT *
FROM STRING_SPLIT('a,b,c,d,e','c',1);
Which you can see that everything before c was put into the first item, and after that into a second row. And you now have an ordinal column you can sort on (note that it is specifically named that to tell you that it is safe to order by!).
value ordinal --------- -------------------- a,b, 1 ,d,e 2
A big limitation to STRING_SPLIT is that you are limited to one character for the delimeter. You can’t split on c, or even carriage return/linefeed (you could replace the carriage return or linefeeds like we did back in the multi-line blog entry of this series):
SELECT *
FROM STRING_SPLIT('a,b,c,d,e','c,');
Note that this throws an interesting error, interesting because notice the column header was output:
value --------- Msg 214, Level 16, State 11, Line 19 Procedure expects parameter 'separator' of type 'nchar(1)/nvarchar(1)'.
This is indicative of a runtime error in the function, not a syntax error as you might expect with an input that is too large. This is likely because you can pass the delimiter in as a variable, and it doesn’t restrict the input to a 1-character datatype. So, this is where STRING_SPLIT hits it’s hard limit. And I have only rarely used this function for much because of this, certainly never with any complex input.
REGEXP_SPLIT_TO_TABLE
Enter the more versitile REGEXP_SPLIT_TO_TABLE. Using REGEXP_SPLIT_TO_TABLE, you can parse through a string or file, on any delimiter you want, including any Regular Expression that we have discussed so far in this series.
Note that this function works basiclly as the opposite of
REGEXP_MATCHES, in that that function gives you the characters that match your expression, andREGEXP_SPLIT_TO_TABLEis going to give you what doesn’t match. It can be advantageous to use both of these functions as you are working out issues in one or the other.
Take our previous example:
SELECT *
FROM REGEXP_SPLIT_TO_TABLE('a,b,c,d,e',',');
This returns the same output as you previously got with STRING_SPLIT, with the ordinal column output no longer optional. That is almost always going to be useful anyhow, especially when debugging your output!
value ordinal --------- -------------------- a 1 b 2 c 3 d 4 e 5
Change the statement to use REGEXP_MATCHES and you will see the delimiter repeated 4 times. This clearly isn’t necessary here, but if you start splitting on complex delimiters, certainly could be useful.
For example, using our same string:
SELECT match_id, match_value
--not shown: start_position, end_position,
--substring_matches
FROM REGEXP_MATCHES('a,b,c,d,e',',');
You get back just 4 commas (and you can get the positions to check out):
match_id match_value -------------------- ----------- 1 , 2 , 3 , 4 ,
Empty delimiters
When splitting to a table, when you have delimiters to with no space between, an empty string will be returned. For example, still using letters a-e, but adding a few more commas:
SELECT *,LEN(Value) AS LENValue
FROM REGEXP_SPLIT_TO_TABLE('a,b,,,c,d,,e',',');
Note that we have 8 items, but 3 have a 0 length output:
value ordinal LENValue
------------ -------------------- -----------
a 1 1
b 2 1
3 0
4 0
c 5 1
d 6 1
7 0
e 8 1
Depending on what the values are, you can either filter out the empty strings, pretty eaisly (or just change them to a NULL with something like CASE expression on LEN(match_value). This will occur in less obvious ways at times as your expressions get more complex.
Expanded Delimiters
So far, things aren’t much different than STRING_SPLIT in my examples. Now let’s go a lot more complex with our string splitting. For example, multiple delimiters in a single expression, in this case comma (,) pipe (|), dollar sign ($) and exclamation point (!) could all be included as delimiters:
SELECT *
FROM REGEXP_SPLIT_TO_TABLE('a,b|c$d!e','[,$!|]');
This gives us the five values just like before.
Note: In no way am I suggesting that you should ever allow multiple delimiters in a file. But you don’t always have a choice when data is coming in. The more tools you have in your toolbox, even if you never actually need to use them, the better.
Next we can break each line of the string into it’s own row by making the delimeter Carriage Return + Line Feed: CHAR(13)+CHAR(10) or as it will show up in our RegEx expressions: \r\n:
SELECT * FROM REGEXP_SPLIT_TO_TABLE( 'a b c d e','\r\n');
This returns the following output, looking just like our previous results.
value ordinal ------------- -------------------- a 1 b 2 c 3 d 4 e 5
Note, we did not need to use a regular expression flag for this, since we referenced the linefeed as a delimiter. The
's'filter would let the.reference the linefeed, but it doesn’t matter when it is explicitly mentioned.
Working with oddly formatted data, you might have to get more complex with your delimiters. Like how about carriage return line feed OR a comma: '\r\n|,'
Note: the or (|) operator is fast becoming my favorite part of a regular expression!
SELECT * FROM REGEXP_SPLIT_TO_TABLE( 'a,b c,d e','\r\n|,');
Same output as previous because we handled both the endline, and the comma, so this shows that the delimiter you create with a regular expression using the | (or) doesn’t even need to be of the same length!
Now, lets go with , or crlf, or ,crlf: '\r\n|\r\n|,'
SELECT * FROM REGEXP_SPLIT_TO_TABLE( 'a,b, c,d, e',',\r\n|\r\n|,');
Once again, same output.
Note: Order matters in OR expressions. In my first attempt I had just added
.\r\nto the end of the string, but it found the , and used it first. (This was not the desired effect!). So if you have more complex filters like ‘,\r\n’ that use similar, simpler expressions as others like ‘,’, the more complex should go first.
There are all sorts of cool stuff you can do, like splitting on any whitespace:
SELECT * FROM REGEXP_SPLIT_TO_TABLE( 'a b c d e','\s');
Warning: Be careful, because while whitespace
\sincludes\rand\n, it doesn’t include\r\nas a pair. So, if you want to split on lines or a single space, you need to specify that specifically:'\r\n|\s'
SELECT * FROM REGEXP_SPLIT_TO_TABLE( 'a b c d e','\r\n|\s');
\r\n goes first because \s includes these characters individually, but not as a pair sadly. So just splitting on \s means you get a few extra, empty values returned. (This may not be an issue if you are already cleaning up empty strings, but better to output what you expect than to clean up a mess.
Now let’s do multiple spaces or carriage return line feeds. As an example:
SELECT *,LEN(Value) as LENValue FROM REGEXP_SPLIT_TO_TABLE( 'a b c d e','\r\n|\s');
This returns some extra space:
value ordinal LENValue
---------------- -------------------- -----------
a 1 1
2 0
b 3 1
4 0
5 0
c 6 1
d 7 1
e 8 1
Easy enough to handle by filtering on len > 0 as discussed earlier, but it is definitely something to be aware of. If you want to just allow for this in the source data, you can change to breaking on \s+, which allows for multiple spaces or carriage return linefeeds:
SELECT *,LEN(Value) AS LENValue FROM REGEXP_SPLIT_TO_TABLE( 'a b c d e','\s+');
Now we are back to the 5 row result. The point here was simply to note that what you allow for delimiters can be adjusted. You could break on many many different things so you need to tailor the delimiter to what you need (which isn’t always easy.)
Note: Be very careful when parsing data with unknown formats. Easy enough to do ad-hoc as you are testing over and over, but things can get dicey when you are dealing with automating messy data parsing, because it isn’t completely unlikely that a poorly formatted string will have a consistently poor format.
Finally, an example that is starting to get complex to type, and that is a comma delimited file where items have double quotes around non-numeric items.
In this example need to use several delimiters:”,” or ,” or “, or , or “:
SELECT * FROM REGEXP_SPLIT_TO_TABLE( '1,"aa","bb",2,3,"cc"', '","|,"|",|,|"');
This returns:
value ordinal -------------------- -------- 1 1 aa 2 bb 3 2 4 3 5 cc 6 7
This data is parsed (and has one extra blank row when your data ends with a ". But there is one major issue when parsing data that you don’t have control over the format, because you need that " to be a delimiter, or you would end up with row 7 being ".
The Caveat
At this point, you probably want to start using this for all of your data ingestion purposes. I know I did. But as Aaron Bertand noted in Split strings by Regular Expressions in SQL Server 2025, there is one big caveat and that is that you can’t do what are called lookaheads. Meaning SQL Server’s RegEx implementation only passes through the expression one character at a time in one direction.
So you can’t have data that includes your delimiter in the data because you can’t construct an expression that says “check the next character to see if…”. Expressions are traversed character by character and evaluated as they are touched.
So while we can pull out the value of "This Value" along with a few empty rows from a string like this: '"This Value"'
SELECT * FROM REGEXP_SPLIT_TO_TABLE( '"This Value"','","|,"|",|,"|,|"');;
The output is:
value ordinal ------------ -------------------- 1 This Value 2 3
If I wanted to include double quotes in my text from a value like: 'This "crazy" value'
You can’t really do this in just a Regular Expression. Typically when importing data, this value would be delimited with some value. For example, you often double the characters up (“This “”crazy””value”), or escape them using an escape character (like in regex: (“This /”crazy/”value”).
But there is no way to include an escape value in the source string, and there is no way to have an expression that knows that 2 delimiters means to ignore when 1 delimiter means skip. So none of these will work.
SELECT * FROM REGEXP_SPLIT_TO_TABLE( '"This "" Value"','","|,"|",|,"|,|"');
This returns:
value ordinal
--------------- --------------------
1
This 2
3
Value 4
5
And trying to escape the value:
SELECT * FROM REGEXP_SPLIT_TO_TABLE( '"This /" Value"','","|,"|",|,"|,|"');
Still fails to skip the value as a delimiter:
value ordinal
--------------- --------------------
1
This / 2
Value 3
4
So it is imperative to understand that whatever delimiter you use, it may not appear in the input string, or your output will be wonky.
One last cool trick
This trick isn’t that complicated, but it is also something that is pretty cool. Using an empty string as the delimiter, you can break every value onto its line.
SELECT *, UNICODE(CAST(VALUE AS NCHAR(1))) AS UNICODEVALUE FROM REGEXP_SPLIT_TO_TABLE( 'abc''d ef','');
This returns:
value ordinal UNICODEVALUE
--------- -------------------- -----------
a 1 97
b 2 98
c 3 99
' 4 39
d 5 100
6 13
7 10
e 8 101
f 9 102
Cleaned up just a bit as the output has some characters not lined up.
Someties this technique can be really handy when trying to figure out some weird character in your data that you are trying to ingest in a tool. We used a similar technique in part 8 of this series with REGEXP_REPLACE (and I implemented it back in 2019 in this article on Simple Talk), but this technique is a lot easier than any other than I have found for finding unexpected data in a query.
SELECT CAST(value as char(1)) AS value, ordinal,
UNICODE(cast(value as nchar(1))) AS UNICODEVALUE
FROM REGEXP_SPLIT_TO_TABLE(
'Thi2s i3s cra333%$#%^zy, le@@t 423435me clean this','')
WHERE REGEXP_LIKE(value,'[^a-z0-9 ]');
Now you can see the characters that seem out of place:
value ordinal UNICODEVALUE ----- -------------------- ------------ % 17 37 $ 18 36 # 19 35 % 20 37 ^ 21 94 , 24 44 @ 28 64 @ 29 64
Of course in this case, I don’t want numbers either, but this was easier to output here in the blog! Now we can see the characters that are giving us issues in an import.
Summary
In this blog we looked ata breaking apart strings using REGEXP_SPLIT TO_TABLE (as well as STRING_SPLIT). Like always, you can uses these functions on a set in a table by using CROSS or OUTER APPLY.
While STRING_SPLIT is still fine for things like ingesting parameters from a caller, REGEXP_SPLIT_TO_TABLE (other than the decidedly terrible to type name :)) is going to be a great tool when you need to parse delimited text (as long as you can avoid any text with the delimiter in the input.
Having ingested a few text files in my life, this can be a lot harder than it sounds! THe last time I did a process like this, we ended up replacing the delimiter in the text to make 100% sure the delimiter wouldn’t be an issue.




Leave a comment