I have NULL issues with NULL

There is no simpler topic in relational comparisons than three valued logic. I am being mostly facetious about this, but in reality, it seems so simple that people don’t think about how a NULL works, and make mistakes all of the time. I was reading a post about this the other day on LinkedIn (which by no means could one ever find again!) where one of the comments chastised the author of the post for not understanding “the fundamentals” of relational theory. The original poster wasn’t completely right (and my post may not be completely perfect either, though I will back most of what I write with code.)

In this post I want to point out a few of the key basics that one really should understand.

SQL Server Basics

In SQL Server, there is no data of type “Boolean” that you can store or interact with as a piece of data. But there is a Boolean type used in the context of logical comparisons that return TRUE, FALSE or NULL. PostgreSQL has a true Boolean type, but most SQL products don’t. In SQL Server we typically use a bit type to store TRUE/FALSE values, but even that is more or less just an integer.

Note that it is stored like an integer (called it as such in the docs, even though there are a few things you can’t do with it like aggregate or even simple math. For example, the following query:

SELECT CAST(1 AS BIT) + CAST(2 AS BIT);

Fails with:

Msg 402, Level 16, State 1, Line 8 The data types bit and bit are incompatible in the add operator.

This is kind of a good thing, especially when we are treating it like a Boolean, but also annoying when you want to see how many true values you have from a column of bit values. You can’t sum them, so you need to cast them to integers first.

NULL Basics

One of the first things you need to understand is how NULL is treated in comparisons. It is treated a an unknown value. This gets confusing because we often will treat it as representing a false value in many contexts. For example, in a relationship like customer to address. If we don’t have an address for the customer, we would represent that as a NULL. And even when you use OUTER JOINs, missing data is represented by NULL. But you have to be careful to think of this as the value saying “you don’t know the customer’s address”. If you want to represent “this customer does not have an address” you might just add a bit column to represent “Doesn’t have an address”.

Of course this does get interesting because you end up with:

			
AddressId  int FK to Address.AddressId
NoAddressFlag bit NULL

If both are NULL, then you would interpret this as “We don’t know the address, and we don’t know if they even have an address.”, If NoAddressFlag = 1, that would mean, they don’t have an address, and you can ignore the NULL value in AddressId. And if NoAddressFlag = 0, and AddressId is NULL, we generally know they have an address, but it is not known to us. And NoAddressFlag = 0, and AddressId is non NULL, then we have their address.

But the one thing you have to realize is that in any comparison, NULL represents any value, but you don’t know that value. Hence, NULL never is directly equal to NULL.

The most convincing example of `NULL` behavior

THe one big thing every SQL dev needs to be 100% clear on is that NULL and FALSE are not the same thing. They are similar in that they are not TRUE and are usually considered a negative thing (not always, of course, but in most work we do, we are looking for a TRUE outcome). But they are very different. When I want to make 100% sure that someone understands that NULL/UNKNOWN and FALSE are different things, I use this example:

			
DECLARE @NullValue INT = NULL;
SELECT CASE WHEN @NullValue = 1 THEN 1 ELSE 0 END, --is NULL = 1?
       CASE WHEN NOT(@NullValue = 1) THEN 1 ELSE 0 END, --is NOT(NULL) = 1?
       CASE WHEN NOT(NOT(@NullValue = 1)) THEN 1 ELSE 0 END, -- is NOT(NOT(NULL = 1))?
       CASE WHEN NOT(NOT(NOT(@NullValue = 1))) THEN 1 ELSE 0 END, --etc.
       CASE WHEN NOT(NOT(NOT(NOT(@NullValue = 1)))) THEN 1 ELSE 0 END;

		

By all that is logical in this world, if @NullValue does not equal 1, then the first should return 1 and if not, it will return 0. Yet the output says otherwise:

			
----------- ----------- ----------- ----------- -----------
0           0           0           0           0

But we know that @NullValue did not equal 1, then it must be FALSE, and by that logic, NOT(@NullValue = 1) should then be true. But also returned 0. This all comes down to the fact that unknown means we don’t know, or in less technical terms: ¯_(ツ)_/¯.

For a real world example, say you have 10 dollars. Now I promise to add an unknown amount of money to add to your $10. Werther this is a good deal or not is if you have determined a domain of value to that amount. Any positive number is good. Any negative number is less good. -$2,000,000,000 is a legit integer value.

NULL conceptually represents EVERY possible value that could exist in that column. So any value that is in the table might match, so the comparison results in UNKNOWN, which says “we are not sure, so it isn’t false, but it certainly isn’t true.” (and even if the domain of that column happened to be a single, non-null value, it still would not match.)

Truth tables tell the story

For every Boolean comparison operarator, there is a truth table that tells us what happens in each comparison. They list values, and then what happens when you use a NOT expression, and then AND and OR.

For example, consider the truth table for NOT.

A	NOT A
TRUE	FALSE
FALSE	TRUE
NULL	NULL

The key here is that NOT NULL will output NULL. This is the basis of all three valued logic. NULL is not FALSE, and NULL is not TRUE. NULL is UNKNOWN. So no matter how many times you NOT this comparison, it will remain UNKNOWN. AND and OR is also interesting:

A	B	A AND B	A OR B
TRUE	TRUE	TRUE	TRUE
TRUE	FALSE	FALSE	TRUE
TRUE	NULL	NULL	TRUE
FALSE	TRUE	FALSE	TRUE
FALSE	FALSE	FALSE	FALSE
FALSE	NULL	FALSE	NULL
NULL	TRUE	NULL	TRUE
NULL	FALSE	FALSE	NULL
NULL	NULL	NULL	NULL

TRUE OR NULL is TRUE. TRUE AND NULL is NULL. And as you get more and more AND and OR conditions in your equations, one NULL can have a big effect on your result.

Read through all of these in detail and think about them FALSE and anything is FALSE. TRUE OR anything is TRUE, But things like FALSE OR NULL are NULL, because you aren’t sure what NULL represents.

Now some code

The big example that people cite is working with IN and especially NOT IN. When you do soemthing like:

			
SELECT ...
FROM   CustomerId
WHERE  Status IN (SELECT StatusCode
                  FROM   StatusCode);

It gets translated to the mathematical equivalent of:

Row1 in table OR Row2 in table OR
…

If there is a NULL value in StatusCode, this isn’t a problem, because NULL or TRUE is true. But many people just casually change this to a NOT IN and this is where trouble starts.

			
SELECT ...
FROM   CustomerId
WHERE  Status NOT IN (SELECT StatusCode
                      FROM   StatusCode);

But this turns into a more complictated mathematical comparison where you need to check every value:

NOT (Row in Table) AND NOT (Row2 in table) AND
…

So now any NULL value will l cause the entire expression to become NULL, which will filter out all rows from the result set. Why? It is based on what is being looked for in a filter, whihc is generally TRUE.

Filters and constraints

One of the things that people get confused by (including me regularly!) is that you need to ask what any predicate is looking for. A filter is typically going to look for a TRUE match with a row. A constraint is going to look for a FALSE match to determine whether to include or exclude a row or value. And when you are looking for TRUE, you can’t just use NOT and be completely sure you will get the opposite because of the NOT(NULL) -> NULL.

Filters

Whenever you code a WHERE, HAVING, IF, CASE, ON, etc will all look for a TRUE result to determine whether to include or exclude a row or value.

So, when you write:

			
SELECT ...
FROM   Table1
        JOIN Table2
            ON Table1.Key = Table2.Key
WHERE  Table1.Status = 'Active'

		

Every key value in every row in Table1 will be compared to Table2.key and the only way it passes the test is if the expression result is TRUE. NULL or FALSE will be thrown away. Same in the WHERE expression. TRUE or excluded. This is one of the major reasons that NULL is so confusing for people.

If Table1.Status is NULL, that row would not be returned and this looks like it is FALSE. But the first time you try to handle this using:

			
SELECT ...
FROM   Table1
        JOIN Table2
            ON Table1.Key = Table2.Key
WHERE  NOT(Table1.Status = 'Active')

		

And the NULL rows arent in the output, it probably doesn’t even get noticed unless your testing process are really exhaustive.

Constraints

Whereas filters look for TRUE to include rows, CHECK and FOREIGN KEY constrant don’t work the same. Logically, it would seem that they look for rows that match their criteria. But what they are looking for are rows that explicitly don’t match the overall criteria. And when you mix in columns that allow NULL values, the behavior can be counter-intuitive.

Single column constraints

For example, consider these two tables, and a check constraint on CustomerName and an foreign key constraint on StatusCode (to Status table):

			
USE Tempdb;
GO
CREATE TABLE Status(
    StatusCode CHAR(1) NOT NULL PRIMARY KEY);
CREATE TABLE Customer
(
    CustomerId INT IDENTITY(1,1) PRIMARY KEY,
    CustomerName VARCHAR(255) NULL
          CHECK (LEN(CustomerName) > 3),
    StatusCode CHAR(1) NULL 
          REFERENCES Status (StatusCode)
);
INSERT INTO Status(StatusCode) VALUES ('A'), ('B'), ('C');

		

If you try to insert a row with a NULL StatusCode and/or CustomerName, you might expect them to fail since they don’t match the constraint, but they succeed:

			
INSERT Customer(CustomerName, StatusCode)
VALUES (NULL,NULL);
INSERT INTO Customer(CustomerName, StatusCode)
VALUES (NULL,'C');
INSERT INTO Customer(CustomerName, StatusCode)
VALUES ('Robin',NULL); 

		

Even though CustomerName is not > 3 in length, and StatusCode in the referenced table. This is because in both cases, the expression evaluates to NULL, and a constraint is satisfied if the expression is not FALSE. Since NULL is neither TRUE nor FALSE, the constraint passes. Then the NULL constraint on the column takes over.

Multi-column constraints

In a single column constraint, this makes pretty straightforward logical sense. But when there is more than one column, it starts to get confusing.

For a check contstraint, when it is the one column, it didn’t matter if it was NULL, but consider a constraint that checks multiple columns. As an example, I will add a constraint that says CustomerName must be 'Barney' and StatusCode must be 'B'. So I will alter the table to have this new constraint:

			
DELETE FROM Customer;
ALTER TABLE Customer 
ADD CONSTRAINT CK_Customer_NameAndStatus 
CHECK (CustomerName = 'Barney' AND StatusCode = 'B');

This statement works:

			
INSERT INTO Customer(CustomerName, StatusCode)
VALUES ('Barney', 'B');

And these all fail:

			
INSERT INTO Customer(CustomerName, StatusCode)
VALUES ('Barney', 'C');
INSERT INTO Customer(CustomerName, StatusCode)
VALUES ('Leonard', 'B');
INSERT INTO Customer(CustomerName, StatusCode)
VALUES ('Leonard', 'C');

		

Ok, so that seemed logical, But: these works: but what about these?

			
INSERT Customer(CustomerName, StatusCode)
VALUES (NULL,NULL);

Still good.

			
INSERT INTO Customer(CustomerName, StatusCode)
VALUES (NULL,'C');

But this now fails:

Msg 547, Level 16, State 0, Line 196 The INSERT statement conflicted with the CHECK constraint "CK_Customer_NameAndStatus". The conflict occurred in database "tempdb", table "dbo.Customer".

Because the expression no longer evaluates to NULL because while CustomerName is null, StatusCode = ‘B’ evaluates to false, so the entire expression becomes false, violating the constraint. But if part of the expression does pass:

			
INSERT INTO Customer(CustomerName, StatusCode)
VALUES ('Barney', NULL);

This succeeds.

Why? Back to the truth tables. When we have:

'Barney' = 'Barney' AND NULL = 'B'

The first part evaluates to TRUE, the second part evaluates to NULL, and TRUE AND NULL evaluates to NULL, which satisfies the constraint because it is not explicitly FALSE. But when we have:

NULL = 'Barney' AND 'C' = 'B'

The first part evaluates to NULL, the second part evaluates to FALSE, and NULL AND FALSE evalueates to FALSE, which violates the constraint.

This is why I always understand why people have issues with NULL values. It can get nuts and some things like NULL AND FALSE feels like it should be NULL, but one FALSE makes any AND expression FALSE.

Multi-key Foreign Key Constraints

Now say you have the following table with a complex key:

			
CREATE TABLE Domain
(
    DomainKey1 int NOT NULL,
    DomainKey2 int NOT NULL,
    CONSTRAINT PK_Domain PRIMARY KEY (DomainKey1, DomainKey2)
);
INSERT INTO Domain VALUES (1, 1), (2, 2);
GO
CREATE TABLE DomainReference
(
    DomainKey1 int NULL,
    DomainKey2 int NULL,
    CONSTRAINT FK_DomainReference_Domain FOREIGN KEY (DomainKey1, DomainKey2) REFERENCES Domain(DomainKey1, DomainKey2)
);
GO

		

It should be clear that this will succeed:

			
INSERT INTO DomainReference(DomainKey1, DomainKey2)
VALUES (1, 1);

And this:

			
INSERT INTO DomainReference(DomainKey1, DomainKey2)
VALUES (NULL, NULL);

But, what about partial references?

			
INSERT INTO DomainReference(DomainKey1, DomainKey2)
VALUES (1, NULL);
INSERT INTO DomainReference(DomainKey1, DomainKey2)
VALUES (NULL, 1);

Both worked. But what about this:

			
INSERT INTO DomainReference(DomainKey1, DomainKey2)
VALUES (99, NULL);
INSERT INTO DomainReference(DomainKey1, DomainKey2)
VALUES (NULL, 99);

These too worked. Why? Because the composite foreign key constraint allows for partial matches when one of the key columns is NULL. This feels like it disagrees with the previous example, but in this case, the reference is considered as a whole, and any NULL values in the foreign key means that the row do not participate in the referential integrity check. If you want to control this, you will need to use a CHECK constraint that says:

			
CHECK (DomainKey1 IS NULL AND DomainKey2 IS NULL) 
    OR (DomainKey1 IS NOT NULL AND DomainKey2 IS NOT NULL);

Summary

NULL values are a pain in the bottom half of your body. Generally necessary, but can very very tricky if you are not careful and understand the basics and intricacies of how they work when you start writing complex expressions that use more than one column or comparison. Is this all you need to know about NULL, that answer is not NULL, it is false. But in general, practice and walking through many scenarios as you write code will truly help you understand the behavior of NULL in your T-SQL code.

When doing column comparisons and you want to see if a column is the same or different and treat NULL as equivalent (meaning you don’t know the value in row A, and you also don’t in row B, so they are in the same condition), check out the IS DISTINCT FROM operator.

Drsql's Database Musings