4 Reasons To Avoid VARCHAR(8000)

Published Tue 19 March 2019 in SQL > Quirks

When first learning SQL Server, I had the brilliant idea of defining every column on my table as VARCHAR(8000). After all, the VARCHAR datatype only consumes as much storage space as the string itself (plus two bytes), so even if I defined the max length of a column as 8000, I wouldn't be wasting storage space if my data was smaller than that.

My development life would be easier too since I wouldn't have to think about the kind of data I was storing in my columns; I could define everything as VARCHAR(8000) once and never have to go back to make any changes. Brilliant!

While I was correct about not wasting storage space, it turns out the idea of making every column VARCHAR(8000) is a terrible idea.

What follows is a (partial) list of reasons to avoid defining every column in your table as VARCHAR(8000).

1. Documentation

When columns are defined accurately with specific data types and lengths, they help provide understanding about the data they contain.

For example, imagine you see following column in a table:

ZipCode VARCHAR(10)

If your database is full of U.S. data, this column definition provides enough information to make some pretty good assumptions:

With a length of 10, we can expect some of the data to be in the 5 digit zip code + 4 digit extended zip code format (eg. 12345-6789)
Choosing VARCHAR instead of CHAR means there are probably some entries that contain 5 digit zip codes and some with 5+4 digit zip codes (and a dash)

We can't be 100% confident about our assumptions until we verify them by profiling the data itself, but a lot of times having decent guesses is good enough.

2. Validation

Your application (hopefully) has some good client and server side validation built in. Those validations don't always work so it never hurts to have some extra validation on the database side of things as well.

Imagine the zip code field is directly next to the "Delivery Instructions" box on our app's "Shipping Details" page. It's not a stretch to think that a user will someday accidentally type in the special delivery instructions ("Please leave the package behind the house next to the giant pineapple bush") into the zip code field instead of the instructions field.

If your ZipCode field is defined as VARCHAR(8000), that set of delivery instructions will make its way into the database in the completely wrong column.

Have fun cleaning up that dirty data. If instead the field was correctly defined as ZipCode VARCHAR(10), the insert would fail and you would prevent that erroneous data from entering your database.

3. Indexing

Ever try to index a column (or set of columns) that total more than 1700 bytes (or 900 bytes in older versions)?

2019-03-13-19-42-15

SQL Server will let you create indexes that contain keys that may potentially be greater than 1700 bytes, but as soon as you try to insert large data into those indexes you will be greeted with this wall of error:

2019-03-13-19-44-11

Of course there are some tricks you can use to index those wide key combinations, but you don't want to use those unless you absolutely have to.

4. Inaccurate Estimates

Imagine we have 1000 rows of data in our table with a VARCHAR(8000) column:

CREATE TABLE #Varchar8000Test
(
    Id int identity PRIMARY KEY,
    BigColumn varchar(8000)
);

INSERT INTO #Varchar8000Test VALUES ('a');
GO 1000

You'll notice that each of those rows' BigColumn fields only contain 3 bytes of data (1 for "a" + 2 for varchar overhead). That's about 3 KB total for the whole table of data (plus whatever the int column takes up)

You would therefore think that SQL Server would know and use this information when executing a query :

SELECT * FROM #Varchar8000Test ORDER BY BigColumn
OPTION(MAXDOP 1)

However it doesn't:

2019-03-13-20-00-07

SQL Server estimates that each row is ~4 KB! This is because SQL Server doesn't sample the data to estimate the average row size for VARCHAR columns, it uses half of the defined length.

This carries over to other parts of the query plan too, like memory grants:

2019-03-13-20-04-39

Fitting this whole table into memory should only take about ~7 KB (3 KB for our BigColumn data), but SQL Server reserves significantly more than that. In this example an excess of 6 MB isn't a huge deal, but if you are working with a table with many more rows, those excessive memory grants can use up hundreds of megabytes or even gigabytes of memory unnecessarily.

...and more!

This list is only a start: there are plenty of additional reasons why defining every column on your table as VARCHAR(8000). My goal with this post wasn't to create a comprehensive list, but rather to provide enough compelling reasons to have dissuaded my novice self a few years ago from even considering this as an option.

Gaps and Islands Across Date Ranges

Published Tue 12 March 2019 in SQL > Development > Window Functions

Watch this week's video on YouTube

In a traditional gaps and islands problem, the goal is to identify groups of continuous data sequences (islands) and groups of data where the sequence is missing (gaps).

While many people encounter gaps and islands problems when dealing with ranges of dates, and recently I did too but with an interesting twist:

How do you determine gaps and islands of data that has overlapping date ranges?

Overlapping Date Ranges

First let's start with some sample data to help follow along. The peculiarity to pay attention to is that the date ranges for each row might be completely discrete, completely enclosed, or they may overlap each other on either end:

DROP TABLE IF EXISTS #OverlappingDateRanges;
CREATE TABLE #OverlappingDateRanges (StartDate date, EndDate date);

INSERT INTO #OverlappingDateRanges
SELECT '8/24/2017', '9/23/2017'  UNION ALL
SELECT '8/24/2017', '9/20/2017'  UNION ALL 
SELECT '9/23/2017', '9/27/2017'  UNION ALL 
SELECT '9/25/2017', '10/10/2017' UNION ALL
SELECT '10/17/2017','10/18/2017' UNION ALL 
SELECT '10/25/2017','11/3/2017'  UNION ALL 
SELECT '11/3/2017', '11/15/2017'

SELECT * FROM #OverlappingDateRanges;

2019-03-13_20-13-13

What's unusual about this data is that while the end date of some rows matches the start date of other rows (eg. row 6 and 7), the date ranges of some rows other rows are either fully contained within other rows (eg. row 2 is contained in row 1) while other rows only overlap one boundary (eg. row 4's EndDate doesn't overlap with any other rows, but its StartDate is before row 3's EndDate).

Solution

While there are several ways gaps and islands problems can be solved, here is the solution using window functions that made the most sense to me.

First, we need to create a row number column based on the sequence of start and end dates, as well as bring the previous row's EndDate to the current row:

SELECT
    ROW_NUMBER() OVER(ORDER BY StartDate,EndDate) AS RN,
    StartDate,
    EndDate,
    LAG(EndDate,1) OVER (ORDER BY StartDate, EndDate) AS PreviousEndDate
FROM
    #OverlappingDateRanges

2019-03-06-20-52-58

Next we add two more fields:

IslandStartInd: indicates when a new island begins by looking if the current row's StartDate occurs after the previous row's EndDate. We don't really need this column for the example, but I find it helpful to see what's going on in the next column.
IslandId: indicates which island number the current row belongs to.

SELECT
    *,
    CASE WHEN Groups.PreviousEndDate >= StartDate THEN 0 ELSE 1 END AS IslandStartInd,
    SUM(CASE WHEN Groups.PreviousEndDate >= StartDate THEN 0 ELSE 1 END) OVER (ORDER BY Groups.RN) AS IslandId
FROM
(
    SELECT
        ROW_NUMBER() OVER(ORDER BY StartDate,EndDate) AS RN,
        StartDate,
        EndDate,
        LAG(EndDate,1) OVER (ORDER BY StartDate, EndDate) AS PreviousEndDate
    FROM
        #OverlappingDateRanges
) Groups

2019-03-06-21-01-51

The IslandId field is just a SUM() of the IslandStartInd, similar to a window function running total.

Our final step is to aggregate our rows using an old fashioned GROUP BY to return the minimum and maximum start and end dates respectively from each of our islands:

SELECT
    MIN(StartDate) AS IslandStartDate,
    MAX(EndDate) AS IslandEndDate
FROM
    (
    SELECT
        *,
        CASE WHEN Groups.PreviousEndDate >= StartDate THEN 0 ELSE 1 END AS IslandStartInd,
        SUM(CASE WHEN Groups.PreviousEndDate >= StartDate THEN 0 ELSE 1 END) OVER (ORDER BY Groups.RN) AS IslandId
    FROM
    (
        SELECT
            ROW_NUMBER() OVER(ORDER BY StartDate,EndDate) AS RN,
            StartDate,
            EndDate,
            LAG(EndDate,1) OVER (ORDER BY StartDate, EndDate) AS PreviousEndDate
        FROM
            #OverlappingDateRanges
    ) Groups
) Islands
GROUP BY
    IslandId
ORDER BY 
    IslandStartDate

2019-03-06-21-03-35

Voilà

Regardless of how messy the date ranges within an island are, this technique neatly identifies gaps in the data and returns the start and end of each island's date range. I often find myself using this technique in scenarios where I need to aggregate some date-based transactional data that otherwise would be too difficult to summarize with aggregate functions alone.

Dipping into the Cookie Jar

Published Tue 12 March 2019 in SQL > TSqlTuesday

MJ-t-sql-Tuesday

This post is a response to this month's T-SQL Tuesday #112 prompt by Shane O'Neill. T-SQL Tuesday is a way for the SQL Server community to share ideas about different database and professional topics every month. In this month's topic Shane asks us to describe past accomplishments that help keep us going.

Before the start of each presentation I give, I'm a nervous wreck.

It's not that I don't like presenting (I do) but in the minutes before my presentation start time I'm always filled with dread . Once I start my talk and am in the flow of my content the nerves usually subside. Those first few minutes are always rough though.

Before speaking I try to calm myself by going through a few various techniques, one of which is thinking about previous successful speaking engagements.

You've Done This Before

I rarely focus on a single past speaking engagement; rather I look at all of my appearances and pick one that best helps for the current situation.

Every presentation I give has some kind of new elements associated with it; some of these might be environmental like a bigger audience or a strange room setup, but others are self-imposed like wanting to try out a new story-telling technique or an interactive audience exercise.

At this point, I usually have enough previous speaking experiences to try and rationalize away any stressful thoughts:

"Speaking is scary." - You've done it before, you can do it again.
"This is a big audience." - Your online audiences have definitely been bigger.
"This joke will bomb." - You won't know until you try. And your past session reviews indicate that people think you are funny.
etc...

The great thing here is that I'm always able to find a way to rationalize some successful past experience as having been comparable or more difficult than the current scenario. Even when I only had a couple of speaking sessions under my belt, I could think back to when I successfully taught my coworkers something, or had to teach my family members how to do something technical.

100% Survival Rate

I still get nervous before speaking, but at least I can also remind myself that I've survived every previous time I've done it.

I'm not sure my nervousness will ever go away, but having past successes to think back on always helps quiet those nerves just a little bit.

Time Zones and Daylight Saving Time

Published Tue 05 March 2019 in SQL > Development > Dates

Watch this week's video on YouTube

AT TIME ZONE is great because it makes it easy to perform daylight saving time and time zone conversions in our queries.

However, when using AT TIME ZONE for these types of conversions there are a couple key points to remember to ensure you are getting the correct results. Let's look at an example.

Thanks to reader Jeff Konicky for the inspiration for this post and allowing me to share it.

DST Conversions

This year, Daylight Savings Time kicks in at 2019-03-10 02:00:00 in Eastern Standard Time, meaning that the time zone switches from being 5 hours behind UTC to only 4 hours behind UTC.

If we have two datetime2s with UTC data, one right before DST kicks in and one right after, we should be able to use AT TIME ZONE to convert these UTC times to Eastern Standard/Daylight Time no problem:

DECLARE 
    @PreDST datetime2 = '2019-03-10 06:59:00',
    @PostDST datetime2 = '2019-03-19 07:00:00';

SELECT 
    @PreDST AT TIME ZONE 'Eastern Standard Time' AS [PreDST should be -05:00],
    @PostDST AT TIME ZONE 'Eastern Standard Time' AS [PostDST should be -04:00];

While that sounds great, the results above show it clearly didn't work - both of those datetime2s are showing as already being in daylight saving time (offset of -04:00).

Why did this happen?

SQL Server Doesn't Know Your Data Is In UTC

The fallacy above is that I said our two datetime2's are in UTC, but SQL Server doesn't actually know this. The datetime2 (and datetime) datatype doesn't allow for time zone offsets so SQL Server really doesn't know what time zone the data is in.

Using AT TIME ZONE on a datetime2 without offset information causes SQL Server to "...[assume] that [the datetime] is in the target time zone". That explains why the two datetime2s above, intended to be in UTC, are actually seen as Eastern Daylight Time by SQL Server.

Specifying the Time Zone Offset

So how do we tell SQL Server that our data is in UTC so AT TIME ZONE functions like we want?

One option is to use the assumption of AT TIME ZONE above in our favor, using it to first convert the datetime2s to UTC and then to EST:

SELECT 
    @PreDST AT TIME ZONE 'UTC' AT TIME ZONE 'Eastern Standard Time' AS [PreDST should be -05:00],
    @PostDST AT TIME ZONE 'UTC' AT TIME ZONE 'Eastern Standard Time' AS [PostDST should be -04:00];

Now we have the correct -05:00 and -04:00 offsets applied.

A cleaner solution would be to encode the original data by using the datetimeoffset datatype, which will indicate which time zone the datetimes are stored in:

DECLARE 
    @PreDST datetimeoffset = '2019-03-10 06:59:00+00:00',
    @PostDST datetimeoffset = '2019-03-19 07:00:00+00:00';

SELECT 
    @PreDST AT TIME ZONE 'Eastern Standard Time' AS [PreDST should be -05:00],
    @PostDST AT TIME ZONE 'Eastern Standard Time' AS [PostDST should be -04:00];

Don't Assume

Whenever working with datetime2 (or datetime) in SQL Server, you can't assume SQL Server knows the time zone of your data. Your business logic may indicate that you store dates in UTC, but SQL Server has no way of knowing that without your code explicitly stating it using something like AT TIME ZONE 'UTC' or storing your data in the datetimeoffset datatype.

Searching Complex JSON Data

Published Tue 26 February 2019 in SQL > Development > JSON

Watch this week's video on YouTube

Computed column indexes make querying JSON data fast and efficient, especially when the schema of the JSON data is the same throughout a table.

It's also possible to break out a well-known complex JSON structure into multiple SQL Server tables.

However, what happens if you have different JSON structures being stored in each row of your database and you want to write efficient search queries against all of the rows of your complex JSON strings?

Complex JSON

Let's start out by creating a staging table that contains various fragments of JSON stored in a nvarchar column:

DROP TABLE IF EXISTS dbo.ImportedJson;
GO
CREATE TABLE dbo.ImportedJson
(
    Id int IDENTITY,
    JsonValue nvarchar(max)
);
GO

INSERT INTO dbo.ImportedJson (JsonValue) VALUES (N'{ 
    "Property1" : "Value1", 
    "Property2" : [1,2,3]
}');

INSERT INTO dbo.ImportedJson (JsonValue) VALUES (N'{ 
    "Property1" : "Value2", 
    "Property3" : [1,2,3], 
    "Property4" : ["A","B","C",null], 
    "Property5" : { 
                    "SubProp1": "A", 
                    "SubProp2": { 
                                    "SubSubProp1":"B", 
                                    "SubSubProp2": 1.2,
                                    "SubSubProp3" : true
                                } 
                    }, 
    "Property6" : [{"ArrayProp":"A"},{"ArrayProp":"B"}], 
    "Property7" : 123, 
    "Property8" : null 
}');

INSERT INTO dbo.ImportedJson (JsonValue) VALUES (N'{ 
    "Property8" : "Not null", 
    "Property9" : [4,5,6]
}');


SELECT * FROM dbo.ImportedJSON;

And the results:

2019-02-21-06-54-22

Search Queries

If I want to search these values I have a few options.

First, I could write something like:

SELECT * FROM dbo.ImportedJSON WHERE JsonValue LIKE '%Property4" : "["A%';

But that technique is difficult to use on data that I'm not familiar with, and it will run slowly because it won't be able to seek to the data in any indexes.

A second option is to create something like a full text index, but unlike full text indexes on XML columns, I will have to fight with all of the quotes and colons and curly braces since there is no support for JSON. Yuck.

Option 3: Search Table

Option 3 is my favorite: normalize the data into a key and value columns that are easy to search:

WITH JSONRoot AS ( 
    SELECT 
        Id as RowId,
        CAST(hierarchyid::GetRoot().ToString() + CAST(ROW_NUMBER() OVER(ORDER BY (SELECT NULL)) AS NVARCHAR(4000)) + '/' AS NVARCHAR(4000)) as [HierarchyId], 
        [key],
        [value],
        CAST([type] AS INT) AS [type] 
    FROM 
        dbo.ImportedJson
        CROSS APPLY OPENJSON(JsonValue,'$') 
    UNION ALL 
    SELECT 
        RowId,
        CAST(JSONRoot.[HierarchyId] + CAST(ROW_NUMBER() OVER(ORDER BY (SELECT NULL)) AS NVARCHAR(4000)) + '/' AS NVARCHAR(4000)), 
        CASE WHEN JSONRoot.[type] = 4 THEN JSONRoot.[key]+'['+t.[key]+']' ELSE t.[key] END,
        t.[value],
        CAST(t.[type] AS INT) 
    FROM 
        JSONRoot 
        CROSS APPLY OPENJSON(JSONRoot.[value],'$') t 
    WHERE 
        JSONRoot.[type] > 3 /* Only parse complex data types */
) 
SELECT 
    RowId,
    CAST([HierarchyId] AS HierarchyId) AS [HierarchyId],
    [key],
    [value],
    [type]
FROM 
    JSONRoot 
ORDER BY 
    RowId,
    [HierarchyId]
GO

Results:

2019-02-21-11-12-41

This query parses each property of the original JSON input so that each key-value pair gets put on its row. Complex JSON objects are broken out into multiple rows, and a HierarchyId is included to maintain parent-child relationships if needed.

Having all of this complex JSON parsed out into a key value table now opens up possibilities of what we can do with it.

Process and Indexing

The above query isn't going to run itself. You'll either need to schedule it or incorporate it into an ETL to parse out your staged JSON data on a regular basis (kind of like full text indexing works asyncronously).

Alternatively you can write the logic into a trigger that fires on new row inserts into your staging table if you need this data in real-time. As with all triggers though, I wouldn't recommend this if your staging table is getting rows added at a high rate.

Once you decide how to store your parsed JSON data, add some indexes that will help your search queries run nice and fast (CREATE NONCLUSTERED INDEX IX_Value_Include ON dbo.ParsedJSON ([value]) INCLUDE ([key],RowId) would probably be a good starting point for many search queries) and you'll be doing a lot better than WHERE JsonValue LIKE '%Property4%'.