When I wrote the blog from last week (Performance test of generating a set of sequential numbers), I was mostly wanting to see how well I could do this using a recursive CTE. I wondered how terribly they might be. Turns out, they are terrible compared to other methods, but they also aren’t nearly as bad as I expected (on my my Test Machine (NUC9i7QNX)), at least.
I thought I was done, nice quick little throwaway piece, but I went a little more in depth than I planned. Then Aaron Bertrand messaged me about a post that I had forgotten (even if I did edit it :)), where he was introducing GENERATE_SERIES (GENERATE_SERIES: My new go-to to build sets). In it, he had included a method of doing this using a method that replicates digits and then uses STRING_SPLIT and ROW_NUMBER to generate more digits. He also noted that it was blistering fast.
Aaron (if you know him) is rarely wrong about SQL (at the very least).
I also realized there was one other thing I wanted to add to my tests, that being just selecting from a Numbers/Tally table that has a billion rows. This should be the fastest way to pull a set of numbers.
It became a full-on post when I made an improvement to a technique that I am actually a bit proud of.
The new/adjusted methods
I am not going cover anything other than these differences, and follow that with a summary that includes the same sort of chart I made last week, with times in milliseconds, for each of these methods and those created last time. The code for this article, along with a database create script that shows the choices I made for my database (though most inserts are into tempdb, in the script, it builds a 1 billion row table named dbo.Numbers.)
Just query from a table
The fastest method (that I am shocked I forgot to include), is just querying from a table with guaranteed sequential values. Not always possible, and if you only need short sequences of vary large/small values, probably not the most optimal, but was the fastest in my test:
CREATE TABLE dbo.Number ( value INT NOT NULL PRIMARY KEY ); --trying to get bulk inserts, took five minutes on --"my machine", but it did insert a billion rows -- with a clustered index INSERT INTO dbo.Number WITH (TABLOCKX) (value) SELECT value FROM GENERATE_SERIES(1,1000000000);
In my tests, this just gets turned into a simple query: SELECT value FROM dbo.Number WHERE value < @MaxValue. If you regularly need complex sets, especially where simple math won’t cut it, pre-calculation is never a bad way to go. It will take up space, but if you are paying for compute, this will also be the least compute.
STRING_SPLIT REPLICATE
This one is kind of brilliant, and definitely one of the fastest methods. Totally glad Aaron pointed it out.
SELECT TOP (@maxValue )
ROW_NUMBER() OVER (ORDER BY @@SPID) AS Value
FROM STRING_SPLIT(REPLICATE(CAST(',' AS VARCHAR(MAX)), @maxValue - 1), ',')
ORDER BY value;
The idea is it replicates a string of commas, then lets STRING_SPLIT split them out into the number of empty strings that there are commas (plus one, hence the @maxValue - 1) that can then be output with ROW_NUMBER.
I made one slight change to Aaron’s code and made the comma an varchar(max) so it could be used for a lot larger size of data.
Enhancing the CROSS JOIN Digits method
There has been this de facto standard way of generating a fast set of numbers that I have used for many years. It is really cool because it uses math to create a large set of numbers very quickly. But it has never been great in terms of being “generic”. So when I was using the following code to try to generate small and large sets, there was a problem. It took five seconds to output 10 rows or 1000000 rows. And when I tried the vaguely insane 100000000 rows (that is 100 million), it wouldn’t work anymore. I needed another cross join to keep going. I only wanted code that could be generic for all cases, so I included it where it could handle up to 10 million rows I was trying to generate.
;WITH digits (I) AS (--set up a set of numbers from 0-9
SELECT I
FROM (VALUES (0),(1),(2),(3),(4),(5),
(6),(7),(8),(9)) AS digits (I))
--builds a set of data from from 0 to 999999
,Integers (I) AS (
SELECT D1.I + (10*D2.I) + (100*D3.I) +
(1000*D4.I) + (10000*D5.I) + (100000*D6.I) +
(1000000*D7.I) + (10000000*D8.I)
FROM digits AS D1 CROSS JOIN digits AS D2
CROSS JOIN digits AS D3
CROSS JOIN digits AS D4
CROSS JOIN digits AS D5
CROSS JOIN digits AS D6
CROSS JOIN digits AS D7
CROSS JOIN digits AS D8 )
--insert into table
INSERT INTO #holdValues WITH (TABLOCKX) (value)
SELECT I + 1
FROM Integers
WHERE I <= @MaxValue - 1;
So, it being a snow day where they requested people not be out frivolously, I said, “What could I do to make this faster?” I knew the problem came in with those later cross joins. The typical usage is to choose how large your output would be; you could just choose the number of digits and comment out the others. I also didn’t want to do dynamic SQL, because that seemed messy.
I started thinking about how to remove those cross joins without dynamic SQL? You can see in the next listing that what I did was to eliminate them by making them derived tables that either return 10 rows with 0-9, or 1 row with 0. The @MaxValue variable does have to be put into all of those derived tables, but you need it to filter the output anyhow. So the output will always have (in my case) 10 columns, but any that aren’t needed are turned into 0.
;WITH digits (I) AS (--set up a set of numbers from 0-9
SELECT I
FROM (VALUES (0),(1),(2),(3),(4),(5),
(6),(7),(8),(9)) AS digits (I))
--builds a set of data from from 0 to 999999
,Integers AS (
SELECT D1.I +
(10*D2.I) +
(100*D3.I) +
(1000*D4.I) +
(10000*D5.I) +
(100000*D6.I) +
(1000000*D7.I) +
(10000000*D8.I) +
(100000000*D9.I) +
(1000000000*D10.I) AS I
--the >= or i=0 effectively removes a cross join from the set, making small
--numbers (the normal) much faster, while allowing a lot higher digit count.
FROM digits AS D1
CROSS JOIN (SELECT * FROM digits
WHERE @MaxValue > 10 OR i = 0) AS D2
CROSS JOIN (SELECT * FROM digits
WHERE @MaxValue >= 100 OR i = 0) AS D3
CROSS JOIN (SELECT * FROM digits
WHERE @MaxValue >= 1000 OR i = 0) AS D4
CROSS JOIN (SELECT * FROM digits
WHERE @MaxValue >= 10000 OR i = 0) AS D5
CROSS JOIN (SELECT * FROM digits
WHERE @MaxValue >= 100000 OR i = 0) AS D6
CROSS JOIN (SELECT * FROM digits
WHERE @MaxValue >= 1000000 OR i = 0) AS D7
CROSS JOIN (SELECT * FROM digits
WHERE @MaxValue >= 10000000 OR i = 0) AS D8
CROSS JOIN (SELECT * FROM digits
WHERE @MaxValue >= 100000000 OR i = 0) AS D9
CROSS JOIN (SELECT * FROM digits
WHERE @MaxValue >= 1000000000 OR i = 0) AS D10
)
--insert into table
SELECT I + 1
FROM Integers
WHERE I < @MaxValue ;
Note that I also went ahead and added a few more digit cross joins to the query, and now it could technically generate more 9.9 billion rows. (I did not try that as 100 million was my breaking point!). (I realize now, this technique probably should have been a blog of its own!)
Have I tested it for every single case? No. I did check for various numbers that weren’t 1 with zeros following, and I did add checks to every one of the cases to make sure that the set requested was the set I got. This checks to make sure the values are between 1 and @MaxValue, and that there are @MaxValue rows in the table:
IF (SELECT COUNT(*) from #holdValues) <> @MaxValue OR EXISTS(SELECT * FROM #holdValues where value > @MaxValue)
Summary
So I ran (and reran and reran) these tests again using the script you can find here on my new GitHub repo, and there were a few things I noticed, which I will discuss in the following sections.
Generating up to 100000 rows
First, at these low levels, the output varied by milliseconds, likely showing some activity on my computer. This is not a perfect test bed, and frankly, in a test like this, a few milliseconds is not a big deal.
Second, up to 100000 rows, only one method really stank. And that was trying to make a generic CROSS JOIN Digits work. At just over 5 seconds each time, it was not awesome. The others were all from great to ok.
Sure 55 milliseconds to query direct from a table is way faster than 939 for the Recursive CTE (which was the worst at that point), but if you need 100000 numbers, are you going to notice? And you can do fancier things with a CTE if you are interested (since you don’t need to just + 1 every time)
Times stated in milliseconds.
SectionName 10 Rows 100 Rows 1000 Rows 10000 Rows 100000 Rows ------------------------- --------- ----------- --------- ----------- ----------- Recursive CTE 3 5 12 100 939 While Loop 3 6 55 42 267 CROSS JOIN ROWNUMBER 169 70 62 91 456 GENERATE_SERIES 3 2 4 15 239 CROSS JOIN Digits 5525 5711 5628 5504 5498 Modified CROSS JOIN Digit 34 38 37 46 137 STRING_SPLIT REPLICATE 4 5 6 30 290 Numbers Table 4 5 5 11 55
The cool thing is that the modified CROSS JOIN Digits method was, at this point, the fastest to get to 100000. Of course, for 10000, all but that original CROSS JOIN ROWNUMBER method were < .1 seconds. None too shabby.
To 100000000, and not beyond!
Here is where things get a lot more interesting. If you need to generate a lot of sequential numbers fast, what is the fastest way to do it? Well, recursion isn’t the way. Though let’s be clear, generating 10 million rows in 92 seconds on a computer with mobile processors from like 5+ years ago… not bad.
What is slightly interesting is that the fastest method (other than the numbers table) was the original CROSS JOIN Digits method. This stands to reason because at this point, we are in the zone where you need all the rows to be generated. And my modified method has to pay the price for the derived tables and filtering.
SectionName 1000000 Rows 10000000 Rows 100000000 Rows =----------------------------- ------------ ------------- -------------- Recursive CTE 9250 92267 950370 While Loop 13629 135228 -1 CROSS JOIN ROWNUMBER 3630 36005 -1 GENERATE_SERIES 1163 12137 125171 CROSS JOIN Digits 5936 9812 -1 Modified CROSS JOIN Digits 1103 10706 111282 STRING_SPLIT REPLICATE 6384 66191 680313 Numbers Table 494 4551 31878
The -1 values are places where I couldn’t realistically finish the calculation.
I tried the While Loop method a few times at 100 million, and it kept crashing. I crashed my SQL Server one time, and a few times just the connection. CROSS JOIN ROWNUMBER failed because while this:
SELECT TOP (@maxValue) ROW_NUMBER() OVER (ORDER BY i2.object_id) as value
FROM master.sys.indexes
CROSS JOIN master.sys.indexes as i2
CROSS JOIN master.sys.indexes as i3;
was fast, this following bit:
SELECT TOP (@maxValue) ROW_NUMBER() OVER (ORDER BY i2.object_id) as value
FROM master.sys.indexes
CROSS JOIN master.sys.indexes as i2
CROSS JOIN master.sys.indexes as i3
CROSS JOIN master.sys.indexes as i4;
took FOREVER to process even a few rows. I love doing this, but at a certain point, you just have to give in.
Conclusion
Use GENERATE_SERIES in almost every case possible when you have it available. If you need it significantly faster or you need specific sequences, create a table. You might load that table with GENERATE_SERIES (as I did), Or you might want to use a recursive CTE to load it if you need some weird (to the rest of the world) series like:
WITH BaseRows AS ( SELECT 1 AS value UNION ALL SELECT value + value + 1 --not a standard need FROM BaseRows WHERE value + value + 1 BETWEEN 1 AND 10000 ) SELECT Value FROM BaseRows OPTION (MAXRECURSION 0);
This output is:
Value ---------- 1 3 7 15 31 63 127 255 511 1023 2047 4095 8191
But not for anything that GENERATE_SERIES or one of these other methods can provide.
I assume I am done with this for now, but who knows? Maybe someone else will have another method that I can try and alter for fun and education?




Leave a reply to Brent Ozar Cancel reply