Fun solving a SQL problem (that I will never use in production)

On LinkedIn a few days ago, there was a question that I found interesting about what was purported to be an interview question. The gist was “say you have a set that looks like this:

OrderId Item Quantity
------- ---- ---------
O1      A1   5
O2      A2   1
O3      A3   3

and you need to expand it to be one row based on the value in Quantity, so it should look like”

OrderId Item Quantity
------- ---- -----------
O1      A1   1
O1      A1   1
O1      A1   1
O1      A1   1
O1      A1   1
O2      A2   1
O3      A3   1
O3      A3   1
O3      A3   1

Why keep reading?

The problem is (hopefully) completely void of any real value, but solving the problem turned out to be kind of fun and a good skills test. And while you may never do exactly this sort of thing, you may need a solution where how you address the issue changes the solution you need. And maybe this becomes useful for your journey too!

But the reason for this sort of exercise is to learn things. By either trying it yourself or reading on. Totally depends on your personality/learning style. Mine is to take a problem I am trying solve and try it. When I was called out for what was said to be a weak solution I posited (for rather large sets of data at least), I decided to test it out. Then I figured I could test the other algorithms. The stuff you learn doing stuff like this is pretty wild. It is also useful not because I am ever going to ungroup data (which is basically a strange case of denormalizing data), but because I do have to solve problems like this sometimes. And the hours I spent on this will stick with me next time.

The setup

So let’s see! I will create a table in the SQL Server temporary database (tempdb), on my SQL Server 2025 CTP2.1 test server with this structure:

CREATE TABLE dbo.ExpandTest
(   OrderId char(10) NOT NULL PRIMARY KEY,
    Item    char(10) NOT NULL,
    Quantity     int NOT NULL
);

All of the code will work on SQL Server 2022, when GENERATE_SERIES was added, but the other examples will work earlier versions of SQL Server.

In the Appendix, there is code to create this structure, and a stored procedure to load and reload the table with varying numbers or rows and for the Quantity value. Also in the appendix is a procedure to load the data in a few different, and most importantly, repeatable, shapes. To create a table where the Quantity value is always the same you can use the ‘Equal’ fill style:

EXEC dbo.ExpandTest$Load @NumberOfRows = 4, @QuantityValue = 4, 
    @QuantityFillStyle = 'Equal', @ShowRowsCreatedFlag = 1;

The output is the following, showing I made a table with 4 rows, and 4 Quantity for each. The default method is that all Quantity values are the same, there is a parameter that will generate a few different shapes of the data because some algorithms will be better for different Quantity column values.

OrderId    Item       Quantity
---------- ---------- -----------
0000000001 00000000A1 4
0000000002 00000000A2 4
0000000003 00000000A3 4
0000000004 00000000A4 4

To make a more variable set of data, I created a scenario where I just divide the numbers from the @QuantityValue by the row value. In most cases I will use a larger Quantity so we get more rows in the smaller later rows:

EXEC dbo.ExpandTest$Load @NumberOfRows = 4, @QuantityValue = 4,
    @QuantityFillStyle = 'Decreasing', @ShowRowsCreatedFlag = 1;

OrderId    Item       Quantity
---------- ---------- -----------
0000000001 00000000A1 4
0000000002 00000000A2 2
0000000003 00000000A3 1
0000000004 00000000A4 1

Then a final one where 1/2 of the rows are the @QuantityValue, and the other half are 1.

EXEC dbo.ExpandTest$Load @NumberOfRows = 4, @QuantityValue = 4, 
   @QuantityFillStyle = 'Half', @ShowRowsCreatedFlag = 1;

OrderId    Item       Quantity
---------- ---------- -----------
0000000001 00000000A1 4
0000000002 00000000A2 4
0000000003 00000000A3 1
0000000004 00000000A4 1

These different shapes should let us see how sizes of Quantity, especially smaller Quantity values mixed in with larger ones, affect the algorithm. In the following sections I will try several different algorithms to see how well they work.

INNER JOIN to GENERATE_SERIES

My initial solution was to use a simple join to a dynamically generated numbers table (using GENERATE_SERIES). So I created the following data:

EXEC dbo.ExpandTest$Load @NumberOfRows = 3, @QuantityValue = 3, 
   @QuantityFillStyle = 'Equal', @ShowRowsCreatedFlag = 1;

The output of this is:

OrderId    Item       Quantity
---------- ---------- -----------
0000000001 00000000A1 3
0000000002 00000000A2 3
0000000003 00000000A3 3

Then I execute the following code that just joins to GENERATE_SERIES, using a hard-coded max of 10:

SELECT OrderId, Item, 1 AS Quantity
FROM   dbo.ExpandTest as SP
        INNER JOIN GENERATE_SERIES(1,10) AS NumbersTable
          ON NumbersTable.Value BETWEEN 1 AND SP.Quantity;

The output of this query is these 9 rows:

OrderId    Item       Quantity
---------- ---------- -----------
0000000001 00000000A1 1
0000000001 00000000A1 1
0000000001 00000000A1 1
0000000002 00000000A2 1
0000000002 00000000A2 1
0000000002 00000000A2 1
0000000003 00000000A3 1
0000000003 00000000A3 1
0000000003 00000000A3 1

Easy enough. But one of the commentw was that this would fall apart with a very large values of Quantity. The comment actually said “Quantity in the Billions”, but that seems pretty crazy for my rather minimal computer, but I decided I would ramp this up and give it a real test. But we do need to do one last thing… testing.

Are we sure this is correct? AKA Testing

We need to test the output of the algorithms or we might be creating too many or too few rows. In this sparticular case, the test (since we are ungrouping rows in this algorithm, is to regroup the items, then join the rows together and see if there are differences. For simplicity of reusing this query later in the article, I am going to send the rows to a temp table named #CaptureOutput:

DROP TABLE #CaptureOutput;

SELECT OrderId, Item, 1 AS Quantity
INTO   #CaptureOutput
FROM   dbo.ExpandTest as SP
        INNER JOIN GENERATE_SERIES(1,10) AS NumbersTable
          ON NumbersTable.Value BETWEEN 1 AND SP.Quantity;

--Use this to test your outputs
WITH Regroup AS (
SELECT OrderId, Item, COUNT(*) as Quantity
FROM   #CaptureOutput
GROUP BY OrderId, Item
)
SELECT TOP 1000 * --limited just in case there are a LOT of rows.
FROM   dbo.ExpandTest as et
         FULL OUTER JOIN Regroup as rg
            ON rg.OrderId = et.OrderId
               AND rg.Item = et.Item
WHERE et.Quantity IS DISTINCT FROM rg.Quantity;

With the data generated from the orignal call, this returns 0 rows,which means there were no rows where the prescribed Quantity value was not the same as the one from the grouped value. To test the test, delete a few rows from #CaptureOutput:

DELETE TOP (2)
FROM  #CaptureOutput;

Now reexecute the test query and you will get at least one row returned. I got:

OrderId    Item       Quantity    OrderId    Item       Quantity
---------- ---------- ----------- ---------- ---------- -----------
0000000001 00000000A1 3           0000000001 00000000A1 1

Looking at the rows, you will see that the aggregate Quantity values will be mismatched by 2 over all of the rows. (DELETE order isn’t deterministic, so it could be 1 row with a difference of 2 or 1 rows with a difference of 1.

For my performance tests I am going to start with a set with 10000 rows, and 10000 Quantity for each row. Then I will show the algorithms to be used. I am going to start with equal Quantity values to show the maximum processing cost. Then I will run some tests with different sizes and shapes of data to see how smaller row counts affect the performance.

EXEC dbo.ExpandTest$Load @NumberOfRows = 10000, @QuantityValue = 10000,
   @QuantityFillStyle = 'Equal', @ShowRowsCreatedFlag = 1;

The output of the stored procedure will be 10,000 rows. With this configuration, it will mean the output of each of our queries will be 10,000*10,000 rows or 100,000,000 rows (which is not a tiny set at all, but certainly could be larger if I had a particular use case in mind that needed it.). Using the format I started with, limiting the GENERATE_SERIES to the number of rows in your example, which is in this case 10000):

DECLARE @MaxItems int = (SELECT MAX(Quantity) from dbo.ExpandTest);

SELECT OrderId, Item, 1 AS Quantity
FROM   dbo.ExpandTest as ET
        INNER JOIN GENERATE_SERIES(1,@MaxItems) AS NumbersTable
           ON NumbersTable.Value BETWEEN 1 AND ET.Quantity;

Within 9 seconds I had output 1.5 million rows out of the 100 million rows it will return. That is a lot of data to see output, so to take that out of the equation, I will output the rows to a temp table named #CaptureOutput (which is what I used for the test query so you can run the test query to check your results). The data is created, but we don’t have wait for it to spool to a results window. I also include the number of seconds that it took to execute.

DECLARE @StartTime datetime2(0) = SYSDATETIME();

DECLARE @MaxItems int = (SELECT MAX(Quantity) from dbo.ExpandTest)

DROP TABLE IF EXISTS #CaptureOutput;

SELECT OrderId, Item, 1 AS Quantity
INTO   #CaptureOutput
FROM   dbo.ExpandTest as ET
         INNER JOIN GENERATE_SERIES(1,@MaxItems) AS NumbersTable
           ON NumbersTable.Value BETWEEN 1 AND ET.Quantity;

SELECT DATEDIFF(Second,@StartTime, sysdatetime());

The output of this batch was around 35 seconds over a few executions.:

So we created 100 million rows of output in 38 seconds. I will ramp that up again later in the blog, but this is a LOT of rows to create, so I am plenty impressed by this algorithm, even if it is kind of clunky.

The other two tests are running the same code, after I execute the following code:

EXEC dbo.ExpandTest$Load @NumberOfRows = 15000, @QuantityValue = 10000, 
   @QuantityFillStyle = 'Half', @ShowRowsCreatedFlag = 1;

35 seconds, 75,007,500 rows

EXEC dbo.ExpandTest$Load @NumberOfRows = 10000, @QuantityValue = 30000,
    @QuantityFillStyle = 'Decreasing', @ShowRowsCreatedFlag = 1;

34 seconds, 288,925 rows

What turned out to be quite interesting in this particular case is that for these larger sets of data, the time didn’t change much at all. I am not going to try to say that this seems like it will be the best solution, but to output 100 million rows, 35 seconds felt like a win. When the output was 299,925 and still the same time to execute, that did not feel like a win. Clearly needing to generate 30000 rows 10000 times, was not optimal.

CROSS APPLY

This method is really similar to the previous one, but there is a big difference. For the 10000/10000 example, we do need to generate 10000 rows 10000 times. But instead of a fixed number, we use a CROSS APPLY. I have moved the generate code to after the code itself for easier reading of results.

DECLARE @StartTime datetime2(0) = SYSDATETIME();

DROP TABLE IF EXISTS #CaptureOutput;

SELECT OrderId, Item, 1 AS Quantity
INTO   #CaptureOutput
FROM   dbo.ExpandTest as ET
         CROSS APPLY GENERATE_SERIES(1, ET.Quantity) AS NumbersTable

SELECT DATEDIFF(Second,@StartTime, sysdatetime());

It also has one less query to figure out the max Quantity value which saves a little bit too. Executing this, you will see this version of the query was faster to start with, and was more (positively) affected by the size of the data:

EXEC dbo.ExpandTest$Load @NumberOfRows = 10000, @QuantityValue = 10000, 
    @QuantityFillStyle = 'Equal', @ShowRowsCreatedFlag = 1;

25 seconds, 100,000,000 rows

EXEC dbo.ExpandTest$Load @NumberOfRows = 15000, @QuantityValue = 10000,
    @QuantityFillStyle = 'Half', @ShowRowsCreatedFlag = 1;

18 seconds, 75,007,500 rows

EXEC dbo.ExpandTest$Load @NumberOfRows = 10000, @QuantityValue = 30000,
    @QuantityFillStyle = 'Decreasing', @ShowRowsCreatedFlag = 1;

< 1 second, 288,925 rows

Even though the first order had 30000 values to generate, the fact that only 30 had > 1000 rows to generate had a profound affect on the output. This is clearly a better algorithm.

Permanent Numbers Table, INNER JOIN

In this example, I will use the same “table of numbers” trick, but this time I will employ a physical table to do the work:

CREATE TABLE dbo.Number
(
   Value int not null PRIMARY KEY
);

I will add 10 million numbers, which I clearly will never try for my own sanity (and electric bill),but this seems large enough for this demo.

INSERT INTO dbo.Number
SELECT Value FROM GENERATE_SERIES(1,10000000);

I am not counting as part of the time creating the Number table (For this size it was quite small in any case). This would be a table that you would consider permanent for these kinds of operations when needed. I made it much much larger than I will attempt, to keep it interesting/somewhat true to life. However, if you are working with pretty large sets of data like this, you could even generate your numbers to a temporary table if it turns out to be that much faster.

DECLARE @StartTime datetime2(0) = SYSDATETIME();

DROP TABLE IF EXISTS #CaptureOutput;

SELECT OrderId, Item, 1 AS Quantity
INTO #CaptureOutput
FROM dbo.ExpandTest as ET
      INNER JOIN dbo.Number AS NumbersTable
         ON NumbersTable.Value BETWEEN 1 AND ET.Quantity;

SELECT DATEDIFF(Second,@StartTime, sysdatetime());

Tests:

EXEC dbo.ExpandTest$Load @NumberOfRows = 10000, @QuantityValue = 10000,
   @QuantityFillStyle = 'Equal', @ShowRowsCreatedFlag = 1;

9 seconds, 100,000,000 rows

EXEC dbo.ExpandTest$Load @NumberOfRows = 15000, @QuantityValue = 10000,
   @QuantityFillStyle = 'Half', @ShowRowsCreatedFlag = 1;

8 seconds, 75,007,500 rows

EXEC dbo.ExpandTest$Load @NumberOfRows = 10000, @QuantityValue = 30000, 
   @QuantityFillStyle = 'Decreasing', @ShowRowsCreatedFlag = 1;

< 1 second, 288,925 rows

Wow. This was far faster than any other method. Not having to generate the rows 10000 times was clearly noticed by the process.

CROSS APPLY Number Table.

Unlike the CROSS APPLY to GENERATE_SERIES, in this case there really isn’t any difference between this algorithm and the INNER JOIN.

DECLARE @StartTime datetime2(0) = SYSDATETIME();

DECLARE @MaxItems int = (SELECT MAX(Quantity) from dbo.ExpandTest)

DROP TABLE IF EXISTS #CaptureOutput;

SELECT OrderId, Item, 1 AS Quantity
INTO #CaptureOutput
FROM dbo.ExpandTest as ET
       CROSS APPLY (SELECT value 
                    FROM   dbo.Number 
                    WHERE  value <= ET.Quantity) AS NumbersTable

SELECT DATEDIFF(Second,@StartTime, SYSDATETIME());

Testing this in the same way we have for other algorithms.

EXEC dbo.ExpandTest$Load @NumberOfRows = 10000, @QuantityValue = 10000,
    @QuantityFillStyle = 'Equal', @ShowRowsCreatedFlag = 1;

9 seconds, 100,000,000 rows

EXEC dbo.ExpandTest$Load @NumberOfRows = 15000, @QuantityValue = 10000,
    @QuantityFillStyle = 'Half', @ShowRowsCreatedFlag = 1;

8 seconds, 75,007,500 rows

EXEC dbo.ExpandTest$Load @NumberOfRows = 10000, @QuantityValue = 30000, 
     @QuantityFillStyle = 'Decreasing', @ShowRowsCreatedFlag = 1;

< 1 second, 288,925 rows

Just as expected, not a lot of difference between these two methods.

Recursive CTE

There were a couple of suggestions that seemed kind of out there compared to the other versions. This was the first, using a recursive CTE. I had my doubts about this version, and they were partially born out.

DECLARE @StartTime datetime2(0) = SYSDATETIME();

DROP TABLE IF EXISTS #CaptureOutput;

WITH CTE AS (
SELECT OrderId, Item, Quantity, 1 as Iteration
FROM dbo.ExpandTest as ET
UNION ALL
SELECT OrderId,Item,Quantity as RowQuantity, Iteration + 1
FROM CTE
WHERE Iteration < Quantity
)
SELECT OrderId,item,1 as Quantity, Iteration
INTO  #CaptureOutput
FROM cte
OPTION (MAXRECURSION 30000); --this tops out at 32767,
--so that would be the max value for Quantity

SELECT DATEDIFF(Second,@StartTime, sysdatetime());

Executing this code was never going to be that fast. And the results show that.

EXEC dbo.ExpandTest$Load @NumberOfRows = 10000, @QuantityValue = 30000, 
   @QuantityFillStyle = 'Equal', @ShowRowsCreatedFlag = 1;

635 seconds, 100,000,000 rows

EXEC dbo.ExpandTest$Load @NumberOfRows = 15000, @QuantityValue = 10000, 
   @QuantityFillStyle = 'Half', @ShowRowsCreatedFlag = 1;

493 seconds, 75,007,500 rows

EXEC dbo.ExpandTest$Load @NumberOfRows = 10000, @QuantityValue = 30000, 
   @QuantityFillStyle = 'Decreasing', @ShowRowsCreatedFlag = 1;

1 second, 288,925 rows

Except, that last case. It did have to do 30000 recursions, but the number of rows it was recursing over got smaller and smaller. In the first case, it has to do this 10000 times for 10000 rows. The second 10000 times for 5000 rows. Both were brutal to wait for. While this wouldn’t be my choice (no matter how much I do love a good recursive CTE having been a fan of graph data strucutures in a relational database, but the number table solution came to me immediate, though not using CROSS APPLY.

Cursor

The last method I tried was just because someone mentioned it. They might be Oracle programmers, since I hear they use cursors more freely. But I looped through the 10000 rows, and then just iterated over the quantity values:

DROP TABLE IF EXISTS #CaptureOutput; --need a intermediate 
CREATE TABLE #CaptureOutput         --table for this algorithm
(   OrderId char(10) NOT NULL,
    Item    char(10) NOT NULL,
    Quantity     int NOT NULL
);

SET NOCOUNT ON;
declare @OuterCursor cursor, 
        @OrderId char(10), 
        @Item char(10), 
        @Quantity integer, 
        @I int = 0, 
        @StartTime datetime2(0) = SYSDATETIME();

SET @OuterCursor = CURSOR FOR (SELECT OrderId, Item, Quantity
                               FROM   dbo.ExpandTest)
OPEN @OuterCursor
WHILE 1=1
 BEGIN
    SET @i = 0;
    FETCH NEXT FROM @OuterCursor INTO @OrderId, @Item, @Quantity;
    IF @@FETCH_STATUS  0 
       BREAK;
    WHILE @I < @Quantity 
      BEGIN
        INSERT INTO #CaptureOutput (OrderId, Item, Quantity)
        Values (@OrderId, @Item, 1);
        set @I = @I + 1;
      END
 END;

SELECT DATEDIFF(Second,@StartTime, sysdatetime());

A lot of code, but that isn’t always bad right? Sometimes more code can help you make a process faster… but not in this case.

EXEC dbo.ExpandTest$Load @NumberOfRows = 10000, @QuantityValue = 30000, 
  @QuantityFillStyle = 'Equal', @ShowRowsCreatedFlag = 1;

1215 seconds, 100,000,000 rows

EXEC dbo.ExpandTest$Load @NumberOfRows = 15000, @QuantityValue = 10000,
   @QuantityFillStyle = 'Half', @ShowRowsCreatedFlag = 1;

916 seconds, 75,007,500 rows

EXEC dbo.ExpandTest$Load @NumberOfRows = 10000, @QuantityValue = 30000, 
   @QuantityFillStyle = 'Decreasing', @ShowRowsCreatedFlag = 1;

3 seconds, 288,925 rows

Just like the recursion example, I was semi-shocked by the example with the ‘Decreasing’ set. Still wouldn’t be my choice!

Summary

What have we learned? Well, hopefully that if you need a row per item you save it like that to start with. But really, what made it interesting to me was executing the queries and then thinking though what was actually happening. There were a few surprises, but most really came down to the fact that looping code isn’t the best, but the number of iterations can be a more linear cost than other algorithms. Recursion and looping through the rows with a cursor both took a lot of time when the row count was really high, but on the low end of actual rows being processed, they weren’t actually that bad.

Still, like one commenter noted, the CROSS APPLY solution is probably going to be your best bet.

Appendix – Variable test rig

USE TempDb;
GO
DROP TABLE IF EXISTS dbo.ExpandTest;
DROP PROCEDURE IF EXISTS dbo.ExpandTest$Load

CREATE TABLE dbo.ExpandTest
(   OrderId char(10) NOT NULL PRIMARY KEY,
    Item    char(10) NOT NULL,
    Quantity     int NOT NULL
);
GO

CREATE OR ALTER PROCEDURE dbo.ExpandTest$Load
(
    @NumberOfRows        int,
    @QuantityValue            int,
    @QuantityFillStyle        char(10) = 'Equal',
    @ShowRowsCreatedFlag bit = 0
)
AS
    BEGIN
    SET NOCOUNT ON;
    TRUNCATE TABLE dbo.ExpandTest;

    If @QuantityFillStyle = 'Equal'

        INSERT INTO dbo.ExpandTest(OrderId, Item, Quantity)
        SELECT RIGHT(CONCAT('0000000000',value),10),RIGHT(CONCAT('0000000000A',value),10),@QuantityValue
        FROM   GENERATE_SERIES(1,@numberOfRows) as RowsToCreate

    ELSE IF @QuantityFillStyle = 'Half'
        INSERT INTO dbo.ExpandTest(OrderId, Item, Quantity)
        SELECT RIGHT(CONCAT('0000000000',value),10),RIGHT(CONCAT('0000000000A',value),10),@QuantityValue
        FROM   GENERATE_SERIES(1,@numberOfRows/2) as RowsToCreate
        UNION ALL
        SELECT RIGHT(CONCAT('0000000000',value),10),RIGHT(CONCAT('0000000000A',value),10),1
        FROM   GENERATE_SERIES(@numberOfRows/2+ 1,@NumberOfRows) as RowsToCreate

    ELSE IF @QuantityFillStyle = 'Decreasing'

        INSERT INTO dbo.ExpandTest(OrderId, Item, Quantity)
        SELECT RIGHT(CONCAT('0000000000',value),10),RIGHT(CONCAT('0000000000A',value),10),@QuantityValue/value
        FROM   GENERATE_SERIES(1,@numberOfRows) as RowsToCreate


    IF @ShowRowsCreatedFlag = 1
        SELECT *
        FROM   dbo.ExpandTest;

END;
GO

Some examples to help you see what the output would be like:

EXEC dbo.ExpandTest$Load @NumberOfRows = 10, @QuantityValue = 10, 
              @QuantityFillStyle = 'Equal', @ShowRowsCreatedFlag = 1;

OrderId    Item       Quantity
=--------- ---------- -----------
0000000001 00000000A1 10
0000000002 00000000A2 10
0000000003 00000000A3 10
0000000004 00000000A4 10
0000000005 00000000A5 10
0000000006 00000000A6 10
0000000007 00000000A7 10
0000000008 00000000A8 10
0000000009 00000000A9 10
0000000010 0000000A10 10

EXEC dbo.ExpandTest$Load @NumberOfRows = 10, @QuantityValue = 40, 
          @QuantityFillStyle = 'Half', @ShowRowsCreatedFlag = 1;

OrderId    Item       Quantity
=--------- ---------- -----------
0000000001 00000000A1 40
0000000002 00000000A2 40
0000000003 00000000A3 40
0000000004 00000000A4 40
0000000005 00000000A5 40
0000000006 00000000A6 1
0000000007 00000000A7 1
0000000008 00000000A8 1
0000000009 00000000A9 1
0000000010 0000000A10 1

EXEC dbo.ExpandTest$Load @NumberOfRows = 10, @QuantityValue = 40, 
    @QuantityFillStyle = 'Decreasing', @ShowRowsCreatedFlag = 1;

OrderId    Item       Quantity
=--------- ---------- -----------
0000000001 00000000A1 40
0000000002 00000000A2 20
0000000003 00000000A3 13
0000000004 00000000A4 10
0000000005 00000000A5 8
0000000006 00000000A6 6
0000000007 00000000A7 5
0000000008 00000000A8 5
0000000009 00000000A9 4
0000000010 0000000A10 4

Drsql's Database Musings