T-SQL Tuesday #191 – Your Favorite String Parsing

Thank you to Steve Jones for this month’s invitation. While this time it was Steve who was late getting it out, to be fair, I didn’t realize that other than the fact that he said it was late. When I got the reminder last week I was on vacation I said “bah, maybe not this month” and moved on.

From the title of “Favorite String Parsing”, I will say 100% it is using SQL Server 2025’s addition of Regular Expressions. Previously, parsing text in SQL Server was one of my least favorite things to do. Regular expressions will make it just a bit nicer, because it has a lot more power than SUBSTRING, LEFT, RIGHT, and CHARINDEX/PATINDEX. All generally “good enough” functions for a lot of the things you need to do, but often woefully inadequate for parsing large amounts of text.

I have a vague memory of once having to parse some messy HTML using SQL, but I can’t remember the exact reason I did what I will describe, for two reasons:

  1. I have a terrible memory for exact details, while thankfully a fairly decent memory for processes\algorithms when I see actual code I have written. Returning to my previous employer has reminded me of this. I couldn’t have told you 1/10 of the things I worked on, but I remember them all as they come back up.
  2. I choose not to remember the horrible bits of code I have needed to write for fear I might admit to it and be stuck working on it again one day.

Generally speaking, though, I needed to parse HTML using native SQL Server pattern matching. This is not terribly hard if you are looking through well-organized code with no repeated tags. Also, it is rare that when you are forced to do something like parsing HTML in T-SQL, that you are working with well-organized anything.

The thing that was so hard is that the aforementioned SQL functions only work relatively easily for finding the first value in a list. (And this assumes your tag values you are searching don’t end up unescaped in the text you are parsing!)

For example:

declare @string varchar(max) = 
'<body>
    <h1>Example of Repeated Tags</h1>

    <div>
        <p>This is paragraph 1 inside a div.</p>
        <p>This is paragraph 2 inside the same div.</p>
        <p>This is paragraph 3 inside the same div.</p>
    </div>

    <h2>Repeated List Items</h2>
    <ul>
        <li>First item</li>
        <li>Second item</li>
        <li>Third item</li>
        <li>Fourth item</li>
        <li>Fifth item</li>
    </ul>
    <p>This is a final paragraph inside the same div.</p>
</body>
</html>'

SELECT CHARINDEX ('<li>',@string); --first tag
SELECT CHARINDEX ('</li>',@string); --first ending tag

SELECT SUBSTRING (@string
                 ,CHARINDEX ('<li>',@string) + 4 --starting position
                 ,CHARINDEX ('</li>',@string) -  --ending position -
                  CHARINDEX ('<li>',@string) -4 ) --starting postion
                                        --that is length, plus deal with tags

This returns 3 values. The starting point of the first list item, the end of it, and with some doing, the first item:


--------------------
291

--------------------
305

----------------------
First item

“Simple” enough, right? Now what about the second one? To find that, we need to remove the first item from the string. And do this again. and then the third, fourth, five hundredth… it is a loop.

Thank goodness I will never need to do this again… right? Right? I am not completely reassured.

One response to “T-SQL Tuesday #191 – Your Favorite String Parsing”

  1. T-SQL Tuesday #191 Round Up | Voice of the DBA Avatar

    […] Davidson wrote about parsing HTML, which is definitely a challenge in T-SQL given the nature of the language and how embedded […]

    Like

Leave a reply to T-SQL Tuesday #191 Round Up | Voice of the DBA Cancel reply

I’m Louis

I have been at this database thing for a long long time, with no plans to stop.

This is my blog site, companion to drsql.org

Recents