Joining on NULLs

Watch this week's video on YouTube

It's important to be aware of columns that allow NULL values since SQL Server may handle NULLs differently than you might expect.

Today I want to look at what things to consider when joining on columns containing NULL values.

Natural, Composite, NULLable keys

Let's pretend we have an Account table containing the accounts of various users and an AccountType table describing the different types of accounts:

2019-03-22-12-33-46

These tables have the unfortunate design characteristics of:

  1. They use a natural, composite key of YearOpened and AccountType
  2. NULL is the valid default for AccountType

Not that either of the above attributes are outright bad, just that we need to handle them appropriately. For example, if we want to bring back a description of each user's account, we might write a query with an inner join like this:

SELECT
    a.UserId,
    at.YearOpened,
    at.AccountType,
    at.Description
FROM
    dbo.Account a
    INNER JOIN dbo.AccountType at
        ON a.YearOpened = at.YearOpened
        AND a.AccountType = at.AccountType

Only to discover the rows with NULLs are not present:

2019-03-22-12-31-39

Joining on NULLs

Since it's not possible to join on NULL values in SQL Server like you might expect, we need to be creative to achieve the results we want.

One option is to make our AccountType column NOT NULL and set some other default value. Another option is to create a new column that will act as a surrogate key to join on instead.

Both of the above options would fix the problem at the source, but what about if we can only make changes to our queries?

One common approach is to convert the NULLs to some other non-NULL value using a function like COALESCE or ISNULL:

SELECT
    a.UserId,
    at.YearOpened,
    at.AccountType,
    at.Description
FROM
    dbo.Account a
    INNER JOIN dbo.AccountType at
        ON a.YearOpened = at.YearOpened
        AND ISNULL(a.AccountType,'`') = ISNULL(at.AccountType,'`')

2019-03-22-12-49-47

While this returns the results we want, there are two major issues with this approach:

  1. In the above example we converted NULLs to the ` character. If we had a valid ` character in our data, we would get logically incorrect joins.
  2. Our query can no longer perform index seeks.

The first issue isn't a huge deal if you can guarantee the character you are replacing NULLs with will never appear in the column of data.

The second issue is more important since ISNULL prevents your query from being SARGable and will cause poor performance on large tables of data.

2019-03-22-12-51-04

Those Compute Scalar operators are forcing SQL Server to Scan the indexes and compute a value for every row.

A More Efficient Solution

If using a function like ISNULL hurts the performance of our queries, what can we do instead?

SELECT
    a.UserId,
    at.YearOpened,
    at.AccountType,
    at.Description
FROM
    dbo.Account a
    INNER JOIN dbo.AccountType at
        ON a.YearOpened = at.YearOpened
        AND (a.AccountType = at.AccountType OR (a.AccountType IS NULL AND at.AccountType IS NULL))

2019-03-22-12-48-59

This produces the same exact results while allowing SQL Server to Seek when possible and avoid costly row by row computations:

There are no seeks here since I don't have any additional filters, but the lack of Compute Scalar operators should be enough to prove the point.

While there are a few more variations that can achieve the same results using different execution plans (writing a query that joins non-nulls and unioning it with a query that selects only the nulls, using a computed column to convert the NULLs to non-null values, etc...) the key to good performance is to choose a solution that will not force SQL Server to compute values for every single row.

COUNT, DISTINCT, and NULLs

Watch this week's video on YouTube

One thing I see fairly often (and am occasionally guilty of myself) is using COUNT(DISTINCT) and DISTINCT interchangeably to get an idea of the number of unique values in a column.

While they will sometimes give you the same results, depending on the data the two methods may not always be interchangeable.

Let's start off with some test data. Important to note are the duplicate values, including the NULLs:

DROP TABLE IF EXISTS ##TestData;
CREATE TABLE ##TestData (Id int identity, Col1 char(1) NULL); 
INSERT INTO ##TestData VALUES ('A'); 
INSERT INTO ##TestData VALUES ('A'); 
INSERT INTO ##TestData VALUES ('B'); 
INSERT INTO ##TestData VALUES ('B');
INSERT INTO ##TestData VALUES (NULL); 
INSERT INTO ##TestData VALUES (NULL); 
CREATE CLUSTERED INDEX CL_Id ON ##TestData (Col1); 

If you want to know how many unique values are in Col1, you might write something like this:

SELECT COUNT(DISTINCT Col1) 
FROM ##TestData 

image-4

Two distinct values great! Except...weren't there some NULLs in there? If we want to see the actual values instead of just seeing the count:

SELECT DISTINCT Col1 
FROM ##TestData 

image-5

Interesting, when doing a plain DISTINCT we see there are three unique values, but in our previous query when we wrote COUNT(DISTINCT Col1) a count of two was returned.

And while the SQL Server documentation specifies that DISTINCT will include nulls while COUNT(DISTINCT) will not, this is not something that many people find intuitive.

Viewing and COUNTing the NULLs

Sometimes we might have to do the opposite of what the default functionality does when using DISTINCT and COUNT functions.

For example, viewing the unique values in a column and not including the nulls is pretty straightforward:

SELECT DISTINCT 
    Col1 
FROM 
    ##TestData 
WHERE  
    Col1 IS NOT NULL 

image-6

Getting the opposite effect of returning a COUNT that includes the NULL values is a little more complicated. One thing we can try to do is COUNT all of our DISTINCT non-null values and then combine it with a COUNT DISTINCT for our NULL values:

select  COUNT(DISTINCT Col1) + COUNT(DISTINCT CASE WHEN Col1 IS NULL THEN 1 END)
from    ##TestData;

image-7

While this logic is easy to interpret, it forces us to read our column of data twice, once for each COUNT - not very efficient on larger sets of data:

image-11

Another thing we can try is to put in a placeholder value (that doesn't exist elsewhere in the column's data) so that COUNT will include it in its calculation:

SELECT 
    /* ~~~ will never exist in our data */
    COUNT(DISTINCT ISNULL(Col1,'~~~')) 
FROM 
    ##TestData 

image-7

image-12

The ISNULL here functions the same as the CASE statement in our first attempt, without having to read the table twice. However, that Compute Scalar occurring to the left of our Clustered Index Scan will start to become painful as our data size increases since SQL Server will need to check each and every row and convert any NULLs it finds. Not to mention after computing all of those ~~~ values, SQL Server needs to re-sort the data to be able to find the DISTINCT values.

That leads us to a final attempt: using a DISTINCT in a derived table (to return our NULL) and then taking a count of that:

SELECT COUNT(*)  
FROM (SELECT DISTINCT Col1 FROM ##TestData) v 

image-7

image-13

This last option eliminates the Compute Scalar and extra sort. While there might be even better options out there for accomplishing the same task, at this point I'm pretty happy with how this will perform.

What's the Point?

SQL Server's documentation says that COUNT(*) returns items in a group while COUNT(Col1) return non nulls in the group.

/* returns items in the group.  Includes nulls */
SELECT COUNT(*) 
FROM ##TestData;

/* returns non null values in group */
SELECT COUNT(Col1) 
FROM ##TestData;

image-14

Because of this, COUNT, DISTINCT, and NULLs have a special relationship with each other that isn't always as intuitive as many people think.

Whenever using COUNT or DISTINCT, make sure to test with NULLs to make sure SQL Server is handling them like you expect.