If you have ever done any performance tuning of queries in SQL Server, no doubt one of the first thing you have heard is that your search argument data types need to match the columns that you are querying. Not one thing in this blog is going to dispute that. Again, the BEST case is that if your column is an nvarchar, your search string matches that column datatype. But why is this? I will do my best to make this pretty clear, especially why it doesn’t always matter.
One of my blogging passions is making it clear to people what is more important than other things, and this topic is one of the cases where dire warnings are tossed about that “datatypes must match or performance will suffer greatly.” But this is not strictly true, and when someone makes a declaration that something must be true, finding one example where that doesn’t hold means that the reader just starts to doubt what you say.
Honestly people writing queries are actually quite fortunate that it isn’t a strict requirement that datatypes match or performance would be in trouble. In this blog I will show you some examples where datatype precedence bites you in performance. This is not to say that using perfectly equivalent types is not a very very good thing, but only to explain why you don’t notice it more when you are writing ad-hoc queries.
Setting up an example
To keep this simple, I am basically going to just create two tables with different types of primary key datatypes. I also included a char column that will make sure the tables aren’t super tiny physically. The only difference between the tables is the name of the table and the datatype of the key. These are the 2 tables I will create, in my TempDb on my test machine.
-- Create first demo table with VARCHAR primary keyCREATE TABLE dbo.DemoVarcharPK( KeyColumn VARCHAR(10) PRIMARY KEY, PaddingColumn CHAR(100) NOT NULL DEFAULT (REPLICATE('A', 100)));-- Create second demo table with INTEGER primary keyCREATE TABLE dbo.DemoIntPK( KeyColumn INT PRIMARY KEY, PaddingColumn CHAR(100) NOT NULL DEFAULT (REPLICATE('A', 100)));
In the Appendix, I include the code to load these tables with 1 million rows each, and every one of the KeyColumn values is an integer from one to a million (unpadded with spaces or zeros), just stored in different manners. I will be using again, this sort of query call to get the estimated plan and performance stats.
SET SHOWPLAN_TEXT ON;GO<Query I am Testing>GOSET SHOWPLAN_TEXT OFF;GOSET STATISTICS IO ON;GOSET STATISTICS TIME ON;GO<Query I am Testing>GO;ET STATISTICS IO OFF;GOSET STATISTICS TIME OFF;GO
The integer type key
Since this is the most typical datatype for a primary key, this is a good place to start. Consider the following query. The search argument is (what appears to be) an integer, represented in a string value.
Executing this query:
SELECT *FROM DemoIntPKWHERE KeyColumn = 50000;
I get back one row for the 50000 value, and this plan:
|--Clustered Index Seek(OBJECT:([tempdb].[dbo].[DemoIntPK].[PK__DemoIntP__BB2D052282D624B8]), SEEK:([tempdb].[dbo].[DemoIntPK].[KeyColumn]=[@1]) ORDERED FORWARD)
No data conversions needed, it probed the index, which as you can see in the stats is probably a three-level deep B-Tree structure (which makes sense for a compact value like an int.
The execution time was less than a millisecond, so I won’t include this in my output unless it shows something other than 9 in the future.
SQL Server Execution Times: CPU time = 0 ms, elapsed time = 0 ms.Table 'DemoIntPK'. Scan count 0, logical reads 3
So, we have established that this works rather nicely, integer comparing to integer is great. But what about other datatypes? Will a string value now cause truly horrible performance (in comparison only, a million rows is not a lot to process), let’s check it and see.
SELECT *FROM DemoIntPKWHERE KeyColumn = '50000';
Um? That seemed to perform just fine. In the plan there is a slight difference:
|--Clustered Index Seek(OBJECT:([tempdb].[dbo].[DemoIntPK].[PK__DemoIntP__BB2D052282D624B8]), SEEK:([tempdb].[dbo].[DemoIntPK].[KeyColumn]=CONVERT_IMPLICIT(int,[@1],0)) ORDERED FORWARD)
Now, in the plan we have this new implicit conversion of the input value to an integer. But shouldn’t that have hurt performance? That is in fact, a varchar(5) as we can see here:
SELECT *FROM dbo.ExamineExpression('''50000''');
This returns:
Datatype Nullability Expression------------ ----------- -------------varchar(5) NOT NULL '50000'
But no, it is still just 3 reads, no scans, and finished so fast that it doesn’t really register. Let’s try a few more. What if we change the string to explicitly a varchar(max)?
SELECT *FROM DemoIntPKWHERE KeyColumn = CAST('50000' as varchar(max));
Literally the only difference between this plan and the previous is the cast to the varchar(max), which is implicitly cast to an integer.
|--Clustered Index Seek(OBJECT:([tempdb].[dbo].[DemoIntPK].[PK__DemoIntP__BB2D052282D624B8]), SEEK:([tempdb].[dbo].[DemoIntPK].[KeyColumn]= CONVERT_IMPLICIT(int,CONVERT(varchar(max),[@1],0),0)) ORDERED FORWARD)
But, that is a text? What if we compare it to just any string?
SELECT *FROM DemoIntPkWHERE keyColumn = 'Fifty Thousand';
This causes the following error:
Msg 245, Level 16, State 1, Line 44Conversion failed when converting the varchar value 'Fifty Thousand'to data type int.
This should be the clue about what is happening, and if you read this post that I linked to in my last blog: SQL Server Datatype Precedence List, you will see that this is basically a matter of datatype precedence. The character datatypes are almost the lowest on the chart (only binary types are lower). So whatever value you compare to something higher on the list, it will usually implicitly convert to it. So a varchar character string like ‘Fifty Thousand’ will be implicitly converted to a integer when comparing or adding values.
For example:
SELECT *FROM dbo.ExamineExpression('''50000'' + 1');
This returns:
Datatype Nullability Expression------------ ----------- -------------int NULL '50000' + 1
And so is:
SELECT *FROM dbo.ExamineExpression('''Fifty Thousand'' + 1');
This returns:
Datatype Nullability Expression----------- ----------- -----------------------int NULL 'Fifty Thousand' + 1
Which of course we know that will give a runtime error, since the comparison/concatenation using the + operator, will fail.
Interestingly though, the performance issues are not always strictly the case that something higher on the precedence chart will be a difficult implicit conversion. Take 50000.000:
SELECT *FROM dbo.ExamineExpression('50000.000');
Which you can see is treated as an implicit conversion. Executing the following would seem like it would be a slow operation, right? Hence, it would need to compare each row to the expression like 50000.000, right?
SELECT *FROM DemoIntPkwhere keyColumn = 50000.000;
Nope, this executes nicely. The parameter you see in the plan is seemingly treated as just a number, with no implicit conversion (and it runs really quick because integer can never equal a numeric expression with a non-zero fractional part.
|--Clustered Index Seek(OBJECT:([tempdb].[dbo].[DemoIntPK].[PK__DemoIntP__BB2D052282D624B8]), SEEK:([tempdb].[dbo].[DemoIntPK].[KeyColumn]=[@1]) ORDERED FORWARD)
Why? Because the optimizer knows a non-integer value where CAST(Value as int) = Value is False will never return rows. Change the search argument to 50000.001, and now it won’t work well, right?
SELECT *FROM DemoIntPkwhere keyColumn = 50000.001;
Well is in the eye of the beholder. The output of the query was no rows, but what of the plan?
|--Clustered Index Seek(OBJECT:([tempdb].[dbo].[DemoIntPK].[PK__DemoIntP__BB2D052282D624B8]), SEEK:([tempdb].[dbo].[DemoIntPK].[KeyColumn]=[@1]) ORDERED FORWARD)
And what gets interesting is that it performs fine, and it has the same plan, but the query stats say that the optimizer said “I knew”:
Table 'DemoIntPK'. Scan count 0, logical reads 0, physical reads 0
It seemingly didn’t even check the data (0 logical reads) since it knew. I won’t show any other examples here but just be aware that if your string can be translated to a numeric value, it is generally going to work. And thankfully so or doing ad-hoc queries would be generally a nightmare.
Not the last I probably say this, but matching the datatype using a
CAST,CONVERT, or Variable that is the right type is the safest way to ensure proper performance. Like in so many ways, SQL Server saves us from ourselves, but it also means the actual bad cases are not obvious too.
Character Data
Character data is where many mistakes are made. The most typical is with a varchar (not Unicode) value. Say you query with this literal '50000' value:
SELECT *FROM DemoVarcharPKWHERE keyColumn = '50000';
As Obi-Wan said during those three months he was working for the Rebel data academy in a mix up of epic proportions, this is the value you are looking for. The plan is a seek:
|--Clustered Index Seek(OBJECT:([tempdb].[dbo].[DemoVarcharPK].[PK__DemoVarc__BB2D0522384CAB6A]), SEEK:([tempdb].[dbo].[DemoVarcharPK].[KeyColumn]=[@1]) ORDERED FORWARD)
It too read only 3 pages in the clustered index:
Table 'DemoVarcharPK'. Scan count 0, logical reads 3
Dandy. But you are going to notice that we are in a bit of a minefield as we navigate around textual data. Check the output of this query:
SELECT TOP 10 *FROM DemoVarcharPKORDER BY KeyColumn;
This returns:
KeyColumn PaddingColumn---------- --------------1 AAAAAAAAAAAAAA10 AAAAAAAAAAAAAA100 AAAAAAAAAAAAAA1000 AAAAAAAAAAAAAA10000 AAAAAAAAAAAAAA100000 AAAAAAAAAAAAAA1000000 AAAAAAAAAAAAAA100001 AAAAAAAAAAAAAA100002 AAAAAAAAAAAAAA100003 AAAAAAAAAAAAAA
Now, the sorting should give you a clue but the fact is, the KeyColumn values only look like integers. Can you match on an integer? You have seen the opposite earlier in the post, so of course. But how will this perform?
SELECT *FROM DemoVarcharPKWHERE keyColumn = 50000;
Well, the plan says this:
|--Parallelism(Gather Streams) |--Clustered Index Scan(OBJECT:([tempdb].[dbo].[DemoVarcharPK].[PK__DemoVarc__BB2D0522384CAB6A]), WHERE:(CONVERT_IMPLICIT(int,[tempdb].[dbo].[DemoVarcharPK].[KeyColumn],0)=[@1]))
And somewhere a sad trombone plays blues for the stats:
Table 'DemoVarcharPK'. Scan count 13, logical reads 15081, physical reads 0, page server reads 0, read-ahead reads 14765 SQL Server Execution Times: CPU time = 62 ms, elapsed time = 147 ms.SQL Server parse and compile time: CPU time = 0 ms, elapsed time = 0 ms.
This feels worse looking at the stats, but I barely noticed the time it took to execute (which is why we never notice this “on our machine!” You can see that the plan even used parallelism to achieve the goal quickly enough. And since that table wasn’t cached yet, it did a lot of read-aheads to fetch the data. Thing is, looking at the plan:
|--Parallelism(Gather Streams) |--Clustered Index Scan(OBJECT:([tempdb].[dbo].[DemoVarcharPK].[PK__DemoVarc__BB2D0522384CAB6A]), WHERE:(CONVERT_IMPLICIT(int,[tempdb].[dbo].[DemoVarcharPK].[KeyColumn],0)=[@1]))
And comparing it to the earlier plan where the column was an integer type and the parameter was a string:
|--Clustered Index Seek(OBJECT:([tempdb].[dbo].[DemoIntPK].[PK__DemoIntP__BB2D052282D624B8]),
SEEK:([tempdb].[dbo].[DemoIntPK].[KeyColumn]=CONVERT_IMPLICIT(int,[@1],0))
ORDERED FORWARD)
The don’t immediately look that different. I mean, both are doing an implicit conversion to an int, but the problem is where the implicit conversion is occurring. The good query is converting one value to the datatype of the data, the other is converting the datatype of every single row in the table.
This is the major problem. And it is why you have to be careful teaching people that something is universally bad for performance instead of universally a bad practice because in SOME non-obvious cases, this can be an issue.
Ok, one last thing with this table. And this is the case that is the most typical, practical example of these problems. Querying a varchar column with an nvarchar value. Apps these days tend to do variables in Unicode, and if you are working with a non-parameterized query (that does the conversion explicitly), this can occur:
SELECT *FROM DemoVarcharPKWHERE keyColumn = N'50000';
Now, here is where the most typical, most terrible case occurs:
|--Parallelism(Gather Streams) |--Clustered Index Scan(OBJECT:([tempdb].[dbo].[DemoVarcharPK].[PK__DemoVarc__BB2D0522384CAB6A]), WHERE:(CONVERT_IMPLICIT(nvarchar(10),[tempdb].[dbo].[DemoVarcharPK].[KeyColumn],0)=[@1]))
And yet again, the first time I executed this, I got physical reads:
Table 'DemoVarcharPK'. Scan count 13, logical reads 15081, physical reads 129, page server reads 0, read-ahead reads 14789 SQL Server Execution Times: CPU time = 139 ms, elapsed time = 155 ms.
And look, it is 100% true that hardware is faster these days. You may never notice this difference in performance, especially if you are on prem. Because this query took (conservatively, rounding the proper index utilization examples up to 1ms), 155X more resources than the other query. Pennies on your cloud platform? Probably. But run that query 100 times, and it is a dollar. 100000 times in a month and we are talking real money.
So, you as data programmer/architect took your time to use a varchar value instead of a nvarchar to save a lot of storage space. Win. And then it costs you 15000 extra operations on every query because (most likely), the user interface uses all Unicode values that can’t be easily converted to a compatible value.
Summary
This is the last time I will type this for now. Use the matching datatype when you are writing a query for reuse. If the column is int, use an int, even if a numeric, character, or whatever will likely be fine if it can be converted to an int. If it is a text, try casting the value to an int and seeing if the cast is lossless (they wanted 50000.001, not 50000 for whatever reason), but do what you can to make sure it is an int compared to an int.
Hopefully know you know why, in a way that you can’t run a few queries and feel like you can refute the reasons.
Appendix
Code may be downloaded from my Github Repo
--This script will built a series of tables for the datatype demos.DROP TABLE IF EXISTS dbo.DemoVarcharPK;DROP TABLE IF EXISTS dbo.DemoIntPK;-- Create first demo table with VARCHAR primary keyCREATE TABLE dbo.DemoVarcharPK( KeyColumn VARCHAR(10) PRIMARY KEY, PaddingColumn CHAR(100) NOT NULL DEFAULT (REPLICATE('A', 100)));-- Create second demo table with INTEGER primary keyCREATE TABLE dbo.DemoIntPK( KeyColumn INT PRIMARY KEY, PaddingColumn CHAR(100) NOT NULL DEFAULT (REPLICATE('A', 100)));INSERT INTO dbo.DemoIntPK( KeyColumn)SELECT ValueFROM GENERATE_SERIES(1, 1000000);INSERT INTO dbo.DemoVarcharPK( KeyColumn)SELECT ValueFROM GENERATE_SERIES(1, 1000000);---------------------================CREATE OR ALTER FUNCTION dbo.ExamineExpression( @Expression nvarchar(max)) RETURNS @Output TABLE( Datatype sysname, Nullability varchar(10), Expression nvarchar(max))AS BEGIN --add the expression to a simple SELECT statement DECLARE @SQL nvarchar(max) = 'SELECT ' + @expression + ' AS CheckMe' --then add it to a query: insert into @output(Datatype, Nullability, Expression) SELECT coalesce(system_type_name,'Invalid expression'), case when system_type_name is null then 'Error' when is_nullable = 1 then 'NULL' WHEN is_nullable = 0 then 'NOT NULL' else 'UNKNOWN' END as nullability, @Expression FROM sys.dm_exec_describe_first_result_set (@SQL,null,0) as dedfrs; RETURN;END;



Leave a Reply