COUNT, DISTINCT, and NULLs

Published on: 2019-02-19

Watch this week’s episode on YouTube.

One thing I see fairly often (and am occasionally guilty of myself) is using COUNT(DISTINCT) and DISTINCT interchangeably to get an idea of the number of unique values in a column.

While they will sometimes give you the same results, depending on the data the two methods may not always be interchangeable.

Let’s start off with some test data. Important to note are the duplicate values, including the NULLs:

DROP TABLE IF EXISTS ##TestData;
CREATE TABLE ##TestData (Id int identity, Col1 char(1) NULL); 
INSERT INTO ##TestData VALUES ('A'); 
INSERT INTO ##TestData VALUES ('A'); 
INSERT INTO ##TestData VALUES ('B'); 
INSERT INTO ##TestData VALUES ('B');
INSERT INTO ##TestData VALUES (NULL); 
INSERT INTO ##TestData VALUES (NULL); 
CREATE CLUSTERED INDEX CL_Id ON ##TestData (Col1); 

If you want to know how many unique values are in Col1, you might write something like this:

SELECT COUNT(DISTINCT Col1) 
FROM ##TestData 

Two distinct values great! Except…weren’t there some NULLs in there? If we want to see the actual values instead of just seeing the count:

SELECT DISTINCT Col1 
FROM ##TestData 

Interesting, when doing a plain DISTINCT we see there are three unique values, but in our previous query when we wrote COUNT(DISTINCT Col1) a count of two was returned.

And while the SQL Server documentation specifies that DISTINCT will include nulls while COUNT(DISTINCT) will not, this is not something that many people find intuitive.

Viewing and COUNTing the NULLs

Sometimes we might have to do the opposite of what the default functionality does when using DISTINCT and COUNT functions.

For example, viewing the unique values in a column and not including the nulls is pretty straightforward:

SELECT DISTINCT 
    Col1 
FROM 
    ##TestData 
WHERE  
    Col1 IS NOT NULL 

Getting the opposite effect of returning a COUNT that includes the NULL values is a little more complicated. One thing we can try to do is COUNT all of our DISTINCT non-null values and then combine it with a COUNT DISTINCT for our NULL values:

select  COUNT(DISTINCT Col1) + COUNT(DISTINCT CASE WHEN Col1 IS NULL THEN 1 END)
from    ##TestData;

While this logic is easy to interpret, it forces us to read our column of data twice, once for each COUNT – not very efficient on larger sets of data:

Another thing we can try is to put in a placeholder value (that doesn’t exist elsewhere in the column’s data) so that COUNT will include it in its calculation:

SELECT 
    /* ~~~ will never exist in our data */
    COUNT(DISTINCT ISNULL(Col1,'~~~')) 
FROM 
    ##TestData 

The ISNULL here functions the same as the CASE statement in our first attempt, without having to read the table twice. However, that Compute Scalar occurring to the left of our Clustered Index Scan will start to become painful as our data size increases since SQL Server will need to check each and every row and convert any NULLs it finds. Not to mention after computing all of those ~~~ values, SQL Server needs to re-sort the data to be able to find the DISTINCT values.

That leads us to a final attempt: using a DISTINCT in a derived table (to return our NULL) and then taking a count of that:

SELECT COUNT(*)  
FROM (SELECT DISTINCT Col1 FROM ##TestData) v 

This last option eliminates the Compute Scalar and extra sort. While there might be even better options out there for accomplishing the same task, at this point I’m pretty happy with how this will perform.

What’s the Point?

SQL Server’s documentation says that COUNT(*) returns items in a group while COUNT(Col1) return non nulls in the group.

/* returns items in the group.  Includes nulls */
SELECT COUNT(*) 
FROM ##TestData;

/* returns non null values in group */
SELECT COUNT(Col1) 
FROM ##TestData;

Because of this, COUNT, DISTINCT, and NULLs have a special relationship with each other that isn’t always as intuitive as many people think.

Whenever using COUNT or DISTINCT, make sure to test with NULLs to make sure SQL Server is handling them like you expect. 

Thanks for reading. You might also enjoy following me on Twitter.

Want to learn even more SQL?

Sign up for my newsletter to receive weekly SQL tips!

Parsing HTML in SQL Server

Published on: 2019-02-12

Recently I was asked how to parse text out of an HTML fragment stored in SQL Server.

Over the next few seconds my brain processed the following ideas:

  • SQL Server is not meant for parsing HTML. Parse the data with something else.
  • T-SQL does have functions like REPLACE, CHARINDEX, and SUBSTRING though, perfect for searching for tags and returning just the values between them.
  • CLRs could do it, probably using some kind of HTML shredding library. You also might be able to use XMLReader to do something with it…
  • Wait a minute, SQL Server has XML parsing functions built in!

Maybe you see where this is going.

WARNING – this is a terrible idea

Parsing HTML with T-SQL is not a great idea. It’s dirty, it’s prone to breaking, and it will make your server’s CPUs cry that they aren’t being used for some nobler cause. If you can parse your HTML somewhere outside of SQL Server, then DO IT THERE.

With that said, if you absolutely need to parse HTML on SQL Server, the best solution is probably to write a CLR.

However, if you are stuck in a bind and plain old T-SQL is the only option available to you, then you might be able to use SQL Server’s XML datatype and functions to get this done. I’ve been there before and can sympathize.

So anyway, here goes nothing:

Using XML to parse HTML

Let’s say we have the following fragment of HTML (copied from a bootstrap example template):

DECLARE @html xml = ' 
    <div class="container"> 
        <div class="card-deck mb-3 text-center"> 
            <div class="card-body"> 
                <h1 class="card-title pricing-card-title">$15 <small class="text-muted">/ mo</small></h1> 
                <ul class="list-unstyled mt-3 mb-4"> 
                    <li>20 users included</li> 
                    <li>10 GB of storage</li> 
                    <li>Priority email support</li> 
                    <li>Help center access</li> 
                </ul> 
                <button type="button" class="btn btn-lg btn-block btn-primary">Get started</button> 
            </div> 
        </div> 
    </div> 
'; 

If we wanted to say extract all of the text from this HTML (to allow text mining without all of the tags getting in the way) we could easily do this using the XML nodes() and value() methods:

-- Get all text values from elements 
SELECT 
    T.C.value('.','varchar(max)')  AS AllText
FROM 
    @html.nodes('/') T(C);

If we want to only extract the items from the list elements, we can write some XQuery to select only those elements:

-- Get a fragment of HTML 
SELECT 
    T.C.value('.','varchar(100)') AS ListValues 
FROM 
    @html.nodes('//*[local-name()=("li")]') T(C); 

Finally, we can also do things like select HTML fragments based on an attribute to parse further in subsequent steps. If I want to select the div with a class of “card-body”, I can write:

-- Get the text from within certain elements 
SELECT 
    T.C.query('.') AS CardBody 
FROM 
    @html.nodes('//div[@class="card-body"]') T(C); 

Yuck

To reiterate – you don’t want to do any of the above unless you have no other choice.

The XML parsing functions will not parse all HTML, so you may need to do some pre-processing on your HTML data first (removing invalid HTML, closing tags, etc…).

It works beautifully in the above example but your results may very. Good luck!

Thanks for reading. You might also enjoy following me on Twitter.

Want to learn even more SQL?

Sign up for my newsletter to receive weekly SQL tips!

Why make?

This post is a response to this month’s T-SQL Tuesday #111 prompt by Andy Leonard.  T-SQL Tuesday is a way for the SQL Server community to share ideas about different database and professional topics every month.

In this month’s topic Andy asks why do we do what we do?


Two years ago, I was bored. 

I’d come home from work, spend my free time watching Netflix and surfing the internet, occasionally tinker with some random side projects, and eventually going to bed. Rinse and repeat, day in and day out.

I felt unfulfilled.  While I value free time and relaxation, I had an overabundance of it.  I felt like I should be doing something more productive with at least some of that time.  I wanted to work on my “professional development” somehow, but it was extremely difficult to get motivated to work on boring career stuff.

I decided what I needed was a long-term project that would allow me to have fun and be creative, while also having some positive personal and professional development benefits; what I was looking for was the ULTIMATE side project.

After spending some time thinking about different ideas, I decided to make videos about SQL Server.  Not only would I enjoy learning more about how SQL Server works (fun), but I could get practice writing and speaking (career) as well as get to incorporate my other hobby of film making into the mix (creative).

At first it felt forced; while I enjoyed learning new things about SQL Server, it was not easy thinking of topics.  Writing and editing was strenuous, but coming up with jokes and visual ways to convey ideas was fun.  Filming (and lighting and audio recording) was hard, but editing has always been pure pleasure for me.

So while at times coming up with a weekly bit of content was challenging, I kept at it because not only was it good for me, but I incorporated enough fun and creative elements into the process to look forward to it and keep going with it.

Fast forwarding to today, the process still isn’t perfect but things have gotten better: I have enough ideas to probably last me a few years (and generating more all the time), writing is still tough but I’ve seen noticeable progress so I’m motivated to keep at it, I still don’t like being in front of a camera but I have a dramatically easier time speaking about technical topics so the practice has paid off there, and while every episode isn’t as creative as I’d like, I have a lot of fun being weird and coming up with new ideas for weekly videos.

Not only that, I now have new motivating factors that I didn’t have from day one.  I’ve made friends with a lot of people in the SQL Server community, and they are fantastic and supportive.  Many of them even want to collaborate and make fun videos which is something I always look forward to.  The audience that consumes the content is wonderful as well; every time I receive a thank you email or comment, I am filled with joy.  And obviously all of the skills I have learned – technical, presenting, and networking – have helped immensely in my day-to-day.

In conclusion, the reasons that caused me to start creating SQL Server videos still apply, however over time that list of motivators has grown and helps me continue to remain excited about what I do, even when the challenges feel greater some weeks than others.

Thanks for reading. You might also enjoy following me on Twitter.

Want to learn even more SQL?

Sign up for my newsletter to receive weekly SQL tips!