Dipping into the Cookie Jar

MJ-t-sql-Tuesday

This post is a response to this month's T-SQL Tuesday #112 prompt by Shane O'Neill.  T-SQL Tuesday is a way for the SQL Server community to share ideas about different database and professional topics every month. In this month's topic Shane asks us to describe past accomplishments that help keep us going.


Before the start of each presentation I give, I'm a nervous wreck.

It's not that I don't like presenting (I do) but in the minutes before my presentation start time I'm always filled with dread . Once I start my talk and am in the flow of my content the nerves usually subside. Those first few minutes are always rough though.

Before speaking I try to calm myself by going through a few various techniques, one of which is thinking about previous successful speaking engagements.

You've Done This Before

I rarely focus on a single past speaking engagement; rather I look at all of my appearances and pick one that best helps for the current situation.

Every presentation I give has some kind of new elements associated with it; some of these might be environmental like a bigger audience or a strange room setup, but others are self-imposed like wanting to try out a new story-telling technique or an interactive audience exercise.

At this point, I usually have enough previous speaking experiences to try and rationalize away any stressful thoughts:

  • "Speaking is scary." - You've done it before, you can do it again.
  • "This is a big audience." - Your online audiences have definitely been bigger.
  • "This joke will bomb." - You won't know until you try. And your past session reviews indicate that people think you are funny.
  • etc...

The great thing here is that I'm always able to find a way to rationalize some successful past experience as having been comparable or more difficult than the current scenario. Even when I only had a couple of speaking sessions under my belt, I could think back to when I successfully taught my coworkers something, or had to teach my family members how to do something technical.

100% Survival Rate

I still get nervous before speaking, but at least I can also remind myself that I've survived every previous time I've done it.

I'm not sure my nervousness will ever go away, but having past successes to think back on always helps quiet those nerves just a little bit.

Time Zones and Daylight Saving Time

Watch this week's video on YouTube

AT TIME ZONE is great because it makes it easy to perform daylight saving time and time zone conversions in our queries.

However, when using AT TIME ZONE for these types of conversions there are a couple key points to remember to ensure you are getting the correct results. Let's look at an example.

Thanks to reader Jeff Konicky for the inspiration for this post and allowing me to share it.

DST Conversions

This year, Daylight Savings Time kicks in at 2019-03-10 02:00:00 in Eastern Standard Time, meaning that the time zone switches from being 5 hours behind UTC to only 4 hours behind UTC.

If we have two datetime2s with UTC data, one right before DST kicks in and one right after, we should be able to use AT TIME ZONE to convert these UTC times to Eastern Standard/Daylight Time no problem:

DECLARE 
    @PreDST datetime2 = '2019-03-10 06:59:00',
    @PostDST datetime2 = '2019-03-19 07:00:00';

SELECT 
    @PreDST AT TIME ZONE 'Eastern Standard Time' AS [PreDST should be -05:00],
    @PostDST AT TIME ZONE 'Eastern Standard Time' AS [PostDST should be -04:00];

image

While that sounds great, the results above show it clearly didn't work - both of those datetime2s are showing as already being in daylight saving time (offset of -04:00).

Why did this happen?

SQL Server Doesn't Know Your Data Is In UTC

The fallacy above is that I said our two datetime2's are in UTC, but SQL Server doesn't actually know this. The datetime2 (and datetime) datatype doesn't allow for time zone offsets so SQL Server really doesn't know what time zone the data is in.

Using AT TIME ZONE on a datetime2 without offset information causes SQL Server to "...[assume] that [the datetime] is in the target time zone". That explains why the two datetime2s above, intended to be in UTC, are actually seen as Eastern Daylight Time by SQL Server.

Specifying the Time Zone Offset

So how do we tell SQL Server that our data is in UTC so AT TIME ZONE functions like we want?

One option is to use the assumption of AT TIME ZONE above in our favor, using it to first convert the datetime2s to UTC and then to EST:

SELECT 
    @PreDST AT TIME ZONE 'UTC' AT TIME ZONE 'Eastern Standard Time' AS [PreDST should be -05:00],
    @PostDST AT TIME ZONE 'UTC' AT TIME ZONE 'Eastern Standard Time' AS [PostDST should be -04:00];

image-1

Now we have the correct -05:00 and -04:00 offsets applied.

A cleaner solution would be to encode the original data by using the datetimeoffset datatype, which will indicate which time zone the datetimes are stored in:

DECLARE 
    @PreDST datetimeoffset = '2019-03-10 06:59:00+00:00',
    @PostDST datetimeoffset = '2019-03-19 07:00:00+00:00';

SELECT 
    @PreDST AT TIME ZONE 'Eastern Standard Time' AS [PreDST should be -05:00],
    @PostDST AT TIME ZONE 'Eastern Standard Time' AS [PostDST should be -04:00];

image-2

Don't Assume

Whenever working with datetime2 (or datetime) in SQL Server, you can't assume SQL Server knows the time zone of your data. Your business logic may indicate that you store dates in UTC, but SQL Server has no way of knowing that without your code explicitly stating it using something like AT TIME ZONE 'UTC' or storing your data in the datetimeoffset datatype.

Searching Complex JSON Data

Watch this week's video on YouTube

Computed column indexes make querying JSON data fast and efficient, especially when the schema of the JSON data is the same throughout a table.

It's also possible to break out a well-known complex JSON structure into multiple SQL Server tables.

However, what happens if you have different JSON structures being stored in each row of your database and you want to write efficient search queries against all of the rows of your complex JSON strings?

Complex JSON

Let's start out by creating a staging table that contains various fragments of JSON stored in a nvarchar column:

DROP TABLE IF EXISTS dbo.ImportedJson;
GO
CREATE TABLE dbo.ImportedJson
(
    Id int IDENTITY,
    JsonValue nvarchar(max)
);
GO

INSERT INTO dbo.ImportedJson (JsonValue) VALUES (N'{ 
    "Property1" : "Value1", 
    "Property2" : [1,2,3]
}');

INSERT INTO dbo.ImportedJson (JsonValue) VALUES (N'{ 
    "Property1" : "Value2", 
    "Property3" : [1,2,3], 
    "Property4" : ["A","B","C",null], 
    "Property5" : { 
                    "SubProp1": "A", 
                    "SubProp2": { 
                                    "SubSubProp1":"B", 
                                    "SubSubProp2": 1.2,
                                    "SubSubProp3" : true
                                } 
                    }, 
    "Property6" : [{"ArrayProp":"A"},{"ArrayProp":"B"}], 
    "Property7" : 123, 
    "Property8" : null 
}');

INSERT INTO dbo.ImportedJson (JsonValue) VALUES (N'{ 
    "Property8" : "Not null", 
    "Property9" : [4,5,6]
}');


SELECT * FROM dbo.ImportedJSON;

And the results: 

2019-02-21-06-54-22

Search Queries

If I want to search these values I have a few options.

First, I could write something like:

SELECT * FROM dbo.ImportedJSON WHERE JsonValue LIKE '%Property4" : "["A%';

But that technique is difficult to use on data that I'm not familiar with, and it will run slowly because it won't be able to seek to the data in any indexes.

A second option is to create something like a full text index, but unlike full text indexes on XML columns, I will have to fight with all of the quotes and colons and curly braces since there is no support for JSON. Yuck.

Option 3: Search Table

Option 3 is my favorite: normalize the data into a key and value columns that are easy to search:

WITH JSONRoot AS ( 
    SELECT 
        Id as RowId,
        CAST(hierarchyid::GetRoot().ToString() + CAST(ROW_NUMBER() OVER(ORDER BY (SELECT NULL)) AS NVARCHAR(4000)) + '/' AS NVARCHAR(4000)) as [HierarchyId], 
        [key],
        [value],
        CAST([type] AS INT) AS [type] 
    FROM 
        dbo.ImportedJson
        CROSS APPLY OPENJSON(JsonValue,'$') 
    UNION ALL 
    SELECT 
        RowId,
        CAST(JSONRoot.[HierarchyId] + CAST(ROW_NUMBER() OVER(ORDER BY (SELECT NULL)) AS NVARCHAR(4000)) + '/' AS NVARCHAR(4000)), 
        CASE WHEN JSONRoot.[type] = 4 THEN JSONRoot.[key]+'['+t.[key]+']' ELSE t.[key] END,
        t.[value],
        CAST(t.[type] AS INT) 
    FROM 
        JSONRoot 
        CROSS APPLY OPENJSON(JSONRoot.[value],'$') t 
    WHERE 
        JSONRoot.[type] > 3 /* Only parse complex data types */
) 
SELECT 
    RowId,
    CAST([HierarchyId] AS HierarchyId) AS [HierarchyId],
    [key],
    [value],
    [type]
FROM 
    JSONRoot 
ORDER BY 
    RowId,
    [HierarchyId]
GO

Results:

2019-02-21-11-12-41

This query parses each property of the original JSON input so that each key-value pair gets put on its row. Complex JSON objects are broken out into multiple rows, and a HierarchyId is included to maintain parent-child relationships if needed.

Having all of this complex JSON parsed out into a key value table now opens up possibilities of what we can do with it.

Process and Indexing

The above query isn't going to run itself. You'll either need to schedule it or incorporate it into an ETL to parse out your staged JSON data on a regular basis (kind of like full text indexing works asyncronously).

Alternatively you can write the logic into a trigger that fires on new row inserts into your staging table if you need this data in real-time. As with all triggers though, I wouldn't recommend this if your staging table is getting rows added at a high rate.

Once you decide how to store your parsed JSON data, add some indexes that will help your search queries run nice and fast (CREATE NONCLUSTERED INDEX IX_Value_Include ON dbo.ParsedJSON ([value]) INCLUDE ([key],RowId) would probably be a good starting point for many search queries) and you'll be doing a lot better than WHERE JsonValue LIKE '%Property4%'.

COUNT, DISTINCT, and NULLs

Watch this week's video on YouTube

One thing I see fairly often (and am occasionally guilty of myself) is using COUNT(DISTINCT) and DISTINCT interchangeably to get an idea of the number of unique values in a column.

While they will sometimes give you the same results, depending on the data the two methods may not always be interchangeable.

Let's start off with some test data. Important to note are the duplicate values, including the NULLs:

DROP TABLE IF EXISTS ##TestData;
CREATE TABLE ##TestData (Id int identity, Col1 char(1) NULL); 
INSERT INTO ##TestData VALUES ('A'); 
INSERT INTO ##TestData VALUES ('A'); 
INSERT INTO ##TestData VALUES ('B'); 
INSERT INTO ##TestData VALUES ('B');
INSERT INTO ##TestData VALUES (NULL); 
INSERT INTO ##TestData VALUES (NULL); 
CREATE CLUSTERED INDEX CL_Id ON ##TestData (Col1); 

If you want to know how many unique values are in Col1, you might write something like this:

SELECT COUNT(DISTINCT Col1) 
FROM ##TestData 

image-4

Two distinct values great! Except...weren't there some NULLs in there? If we want to see the actual values instead of just seeing the count:

SELECT DISTINCT Col1 
FROM ##TestData 

image-5

Interesting, when doing a plain DISTINCT we see there are three unique values, but in our previous query when we wrote COUNT(DISTINCT Col1) a count of two was returned.

And while the SQL Server documentation specifies that DISTINCT will include nulls while COUNT(DISTINCT) will not, this is not something that many people find intuitive.

Viewing and COUNTing the NULLs

Sometimes we might have to do the opposite of what the default functionality does when using DISTINCT and COUNT functions.

For example, viewing the unique values in a column and not including the nulls is pretty straightforward:

SELECT DISTINCT 
    Col1 
FROM 
    ##TestData 
WHERE  
    Col1 IS NOT NULL 

image-6

Getting the opposite effect of returning a COUNT that includes the NULL values is a little more complicated. One thing we can try to do is COUNT all of our DISTINCT non-null values and then combine it with a COUNT DISTINCT for our NULL values:

select  COUNT(DISTINCT Col1) + COUNT(DISTINCT CASE WHEN Col1 IS NULL THEN 1 END)
from    ##TestData;

image-7

While this logic is easy to interpret, it forces us to read our column of data twice, once for each COUNT - not very efficient on larger sets of data:

image-11

Another thing we can try is to put in a placeholder value (that doesn't exist elsewhere in the column's data) so that COUNT will include it in its calculation:

SELECT 
    /* ~~~ will never exist in our data */
    COUNT(DISTINCT ISNULL(Col1,'~~~')) 
FROM 
    ##TestData 

image-7

image-12

The ISNULL here functions the same as the CASE statement in our first attempt, without having to read the table twice. However, that Compute Scalar occurring to the left of our Clustered Index Scan will start to become painful as our data size increases since SQL Server will need to check each and every row and convert any NULLs it finds. Not to mention after computing all of those ~~~ values, SQL Server needs to re-sort the data to be able to find the DISTINCT values.

That leads us to a final attempt: using a DISTINCT in a derived table (to return our NULL) and then taking a count of that:

SELECT COUNT(*)  
FROM (SELECT DISTINCT Col1 FROM ##TestData) v 

image-7

image-13

This last option eliminates the Compute Scalar and extra sort. While there might be even better options out there for accomplishing the same task, at this point I'm pretty happy with how this will perform.

What's the Point?

SQL Server's documentation says that COUNT(*) returns items in a group while COUNT(Col1) return non nulls in the group.

/* returns items in the group.  Includes nulls */
SELECT COUNT(*) 
FROM ##TestData;

/* returns non null values in group */
SELECT COUNT(Col1) 
FROM ##TestData;

image-14

Because of this, COUNT, DISTINCT, and NULLs have a special relationship with each other that isn't always as intuitive as many people think.

Whenever using COUNT or DISTINCT, make sure to test with NULLs to make sure SQL Server is handling them like you expect. 

Parsing HTML in SQL Server

Watch this week's video on YouTube

Recently I was asked how to parse text out of an HTML fragment stored in SQL Server.

Over the next few seconds my brain processed the following ideas:

  • SQL Server is not meant for parsing HTML. Parse the data with something else.
  • T-SQL does have functions like REPLACE, CHARINDEX, and SUBSTRING though, perfect for searching for tags and returning just the values between them.
  • CLRs could do it, probably using some kind of HTML shredding library. You also might be able to use XMLReader to do something with it...
  • Wait a minute, SQL Server has XML parsing functions built in!

Maybe you see where this is going.

WARNING - this is a terrible idea

Parsing HTML with T-SQL is not a great idea. It's dirty, it's prone to breaking, and it will make your server's CPUs cry that they aren't being used for some nobler cause. If you can parse your HTML somewhere outside of SQL Server, then DO IT THERE.

With that said, if you absolutely need to parse HTML on SQL Server, the best solution is probably to write a CLR.

However, if you are stuck in a bind and plain old T-SQL is the only option available to you, then you might be able to use SQL Server's XML datatype and functions to get this done. I've been there before and can sympathize.

So anyway, here goes nothing:

Using XML to parse HTML

Let's say we have the following fragment of HTML (copied from a bootstrap example template):

DECLARE @html xml = ' 
    <div class="container"> 
        <div class="card-deck mb-3 text-center"> 
            <div class="card-body"> 
                <h1 class="card-title pricing-card-title">$15 <small class="text-muted">/ mo</small></h1> 
                <ul class="list-unstyled mt-3 mb-4"> 
                    <li>20 users included</li> 
                    <li>10 GB of storage</li> 
                    <li>Priority email support</li> 
                    <li>Help center access</li> 
                </ul> 
                <button type="button" class="btn btn-lg btn-block btn-primary">Get started</button> 
            </div> 
        </div> 
    </div> 
'; 

If we wanted to say extract all of the text from this HTML (to allow text mining without all of the tags getting in the way) we could easily do this using the XML nodes() and value() methods:

-- Get all text values from elements 
SELECT 
    T.C.value('.','varchar(max)')  AS AllText
FROM 
    @html.nodes('/') T(C);

image-1

If we want to only extract the items from the list elements, we can write some XQuery to select only those elements:

-- Get a fragment of HTML 
SELECT 
    T.C.value('.','varchar(100)') AS ListValues 
FROM 
    @html.nodes('//*[local-name()=("li")]') T(C); 

image-2

Finally, we can also do things like select HTML fragments based on an attribute to parse further in subsequent steps. If I want to select the div with a class of "card-body", I can write:

-- Get the text from within certain elements 
SELECT 
    T.C.query('.') AS CardBody 
FROM 
    @html.nodes('//div[@class="card-body"]') T(C); 

image-3

Yuck

To reiterate - you don't want to do any of the above unless you have no other choice.

The XML parsing functions will not parse all HTML, so you may need to do some pre-processing on your HTML data first (removing invalid HTML, closing tags, etc...).

It works beautifully in the above example but your results may very. Good luck!