Data Type Precedence and Implicit Conversions

Published Tue 09 April 2019 in SQL > Development

SQL Server needs to make sure data types match when performing operations that involve multiple pieces of data.

When the data types do not match, SQL Server has to implicitly convert the data before performing any operations.

While most of the time these implicit conversions go unnoticed, they are important to understand since they can lead to unexpected results.

When 4.4/.44 Doesn't Equal 10

Let's start with this example:

SELECT 4.4/CAST(.44 AS VARCHAR(5))

Ignoring for a moment that our denominator is of type VARCHAR, if we do some quick mental math or use a calculator, we can see that the answer should be 10:

Simple-division

However, if we look at the result SQL Server returns, it's strangely 11:

2019-04-08-16-24-28

To understand why this happens, we need to understand SQL Server's data type precedence logic.

Data Type Precedence

If we start with a simpler version of this example, we'll see SQL Server does in fact know how to perform math and return an answer of 10:

SELECT 4.4/.44

2019-04-08-16-27-16

We can use the SQL_VARIANT_PROPERTY() function to see what data types SQL Server is assuming we are using in our calculation:

SELECT 
    SQL_VARIANT_PROPERTY(4.4,'BaseType'),
    SQL_VARIANT_PROPERTY(.44,'BaseType'),
    SQL_VARIANT_PROPERTY(4.4/.44,'BaseType')

2019-04-08-16-27-58

In this case, since the data types of both components are numeric, SQL Server doesn't have to break a sweat making any drastic conversions to give us our expected answer. If we instead give it something a little more challenging:

SELECT 4.4/CAST(.44 AS FLOAT)

2019-04-08-16-28-41

Here we still return the result of 10, however SQL Server had to do a little more work. We know from the previous example that 4.4 on its own is of type numeric, and in this example we are explicitly casting .44 to a float. SQL Server isn't able to perform the division operation directly on two different data types, so it refers to its data type precedence table to decide which value to convert to a matching datatype.

In the linked table above, the float data type appears higher on the list than numeric (synonym: decimal) data type. This causes SQL Server to convert our numeric 4.4 to a float before dividing.

While SQL Server is doing extra work behind the scenes that we didn't explicitly request, we can't be too angry with it since it still is giving us the "correct" answer.

"Incorrect" Conversions

Let's look at something a little more dastardly:

SELECT CAST(4.4 AS NUMERIC)/CAST(.44 AS FLOAT)

You might think this should also return 10 based on the previous example, but in fact it returns 9.090909:

2019-04-08-16-30-02

While we are still witnessing implicit conversion here (the numeric gets converted to a float in order to allow SQL Server to perform the division), we are also experiencing a case of default data type precision and scale. If we use the SQL_VARIANT_PROPERTY() function again to not only reveal base type but also precision and scale, we'll notice that when we let SQL Server "guess" the scale, it correctly chooses 1 decimal place, while when we use the default scale associated with numeric we get 0:

SELECT 
    SQL_VARIANT_PROPERTY(4.4,'BaseType'),
    SQL_VARIANT_PROPERTY(4.4,'Precision'),
    SQL_VARIANT_PROPERTY(4.4,'Scale') 

SELECT 
    SQL_VARIANT_PROPERTY(CAST(4.4 AS NUMERIC),'BaseType'),
    SQL_VARIANT_PROPERTY(CAST(4.4 AS NUMERIC),'Precision'),
    SQL_VARIANT_PROPERTY(CAST(4.4 AS NUMERIC),'Scale')

2019-04-08-16-32-55

What this really means is that the decimal portion of 4.4 is getting chopped off, leaving us with an equation of 4 / .44 = 11.

Putting Everything Together

So back to our original example that returns 11:

SELECT 4.4/CAST(.44 AS VARCHAR(5))

What exactly is going on here? Well for starters, data type precedence is forcing SQL Server to convert the VARCHAR .44 to a numeric. But a numeric with what precision and scale?

SELECT 
    SQL_VARIANT_PROPERTY(4.4,'BaseType'),
    SQL_VARIANT_PROPERTY(4.4,'Precision'),
    SQL_VARIANT_PROPERTY(4.4,'Scale')

2019-04-08-16-33-43

Since SQL Server determines that our numerator is NUMERIC(2,1), it converts the denominator to a NUMERIC(2,1) as well. This means instead of dividing by .44, we end up dividing by .4 which results in 11:

-- An explicit version of the same calculation
SELECT CAST(4.4 AS NUMERIC(2,1))/CAST(.44 AS NUMERIC(2,1))

2019-04-08-16-34-30

SQL Server Isn't Wrong

While it's easy to blame SQL Server for not knowing how to perform simple mathematical operations, it's us the developers who are to blame. SQL Server is a piece of software following rules for how to handle unclear situations that we input into it.

Having SQL Server throw an error instead of assuming data types and implicitly converting data on our behalf would make things less ambiguous, but it would also make simple operations a lot more tedious (looking at you SSIS).

In reality, the best solution is to accurately define data types from initial creation and to always be explicit with data types when the value of your computations matters.

Testing with Temporary Stored Procedures

Published Tue 02 April 2019 in SQL > Development

Watch this week's video on YouTube

A while back I learned that it's possible to create temporary stored procedures in SQL Server.

I never put that knowledge into practice however because I struggled to think of a good use case for when a temporary stored procedure would be preferable to a permanent stored procedure.

Not long ago I encountered a scenario where using a temporary stored procedure was the perfect solution to my problem.

Building New Tables

Recently, I had to build a new version of a legacy table because the legacy table's upstream data source was going to be retired.

The new table would contain all the same data as the legacy table, but populated from the new data source. Additionally, the new table would also include additional rows and columns. After building the final table, the plan was to create a view to replace the functionality of the legacy table.

I had to spend quite a bit of time writing a fairly elaborate query to make the data for the new data source match what was appearing in the legacy table (remember my gaps and islands post from a few weeks back? Imagine that on steroids).

Before building out the new ETL and tables though, I wanted to be able to test that the new query was producing the correct results. This was challenging because some of the data sources were on other servers. I was in a catch 22: couldn't test the data because I didn't build the ETL yet, but I didn't want to build the ETL until I tested the results.

Less Than Ideal Options

To make matters worse, I was only able to test my query using data in the production environment.

I thought of putting my new query into a stored procedure to make the logic easier to test. But with the production elevate process being length and restricted, I couldn't easily put my parameterized query into a permanent stored procedure to test my query with.

At this point I had a few options:

Don't test anything and just build out the tables and ETL in production. Cross my fingers and hope it all works, fixing any issues after the fact.
Create a permanent stored procedure with the query and elevate it to production. Hope that I don't have to make changes and go through the slow elevate process again.
Run the query over and over again with different parameters.

I didn't like the first two options because of the amount of time I would lose trying to elevate new tables or procedures into production.

The third option wasn't ideal either because while it would allow me to iterate quickly, documenting all of my tests would involve a massive file that would not be easy to navigate or change.

What I needed was a way to run a query through many different parameters in a concise manner without making any permanent production changes.

Temporary Stored Procedures for Regression Testing

Since I have access to create temporary stored procedures in production, I was able to create a temporary procedure containing my complex query:

CREATE PROCEDURE #ComplexBusinessLogic
  @parm1 int
AS
BEGIN
  /* This isn't the actual query.  The real one was ugly and hundreds of lines long.*/
  SELECT CASE @parm1
    WHEN 1 THEN 'A'
    WHEN 2 THEN 'B'
    WHEN 3 THEN 'C'
    END AS ResultValue
END;

As I mention in the comment of the stored procedure, this query was huge. If I wanted to execute it multiple times with different parameter values to test with, I'd have a gigantic file that would be difficult to navigate and easy to make errors in.

Instead, the temporary stored procedure made it easy to document my tests and execute the query as many times as needed:

/* Test for when parameter is 1 */
EXEC #ComplexBusinessLogic @parm1 = 1;

/* Test for scenario 2 */
EXEC #ComplexBusinessLogic @parm1 = 2;

/* The rarely occuring but very important test scenario 3 */
EXEC #ComplexBusinessLogic @parm1 = 3

I was able then to clearly define all of my tests and run them against the production data without creating any permanent production objects. This was great because I did find errors with my logic, but I was able to fix them and retry my tests over and over again until everything ran without issues.

Testing in Production

Ideally I wouldn't have to use this solution. It would have been much better to have data to test with in a non-production environment. Using a temporary stored procedure to test in production is a hack to get around environment restrictions.

However, what is ideal and what is real-world doesn't always align. Sometimes a hack helps meet deadlines when better options aren't available. In this instance, temporary stored procedures helped make testing a breeze.

Joining on NULLs

Published Tue 26 March 2019 in SQL > Quirks

Watch this week's video on YouTube

It's important to be aware of columns that allow NULL values since SQL Server may handle NULLs differently than you might expect.

Today I want to look at what things to consider when joining on columns containing NULL values.

Natural, Composite, NULLable keys

Let's pretend we have an Account table containing the accounts of various users and an AccountType table describing the different types of accounts:

2019-03-22-12-33-46

These tables have the unfortunate design characteristics of:

They use a natural, composite key of YearOpened and AccountType
NULL is the valid default for AccountType

Not that either of the above attributes are outright bad, just that we need to handle them appropriately. For example, if we want to bring back a description of each user's account, we might write a query with an inner join like this:

SELECT
    a.UserId,
    at.YearOpened,
    at.AccountType,
    at.Description
FROM
    dbo.Account a
    INNER JOIN dbo.AccountType at
        ON a.YearOpened = at.YearOpened
        AND a.AccountType = at.AccountType

Only to discover the rows with NULLs are not present:

2019-03-22-12-31-39

Joining on NULLs

Since it's not possible to join on NULL values in SQL Server like you might expect, we need to be creative to achieve the results we want.

One option is to make our AccountType column NOT NULL and set some other default value. Another option is to create a new column that will act as a surrogate key to join on instead.

Both of the above options would fix the problem at the source, but what about if we can only make changes to our queries?

One common approach is to convert the NULLs to some other non-NULL value using a function like COALESCE or ISNULL:

SELECT
    a.UserId,
    at.YearOpened,
    at.AccountType,
    at.Description
FROM
    dbo.Account a
    INNER JOIN dbo.AccountType at
        ON a.YearOpened = at.YearOpened
        AND ISNULL(a.AccountType,'`') = ISNULL(at.AccountType,'`')

2019-03-22-12-49-47

While this returns the results we want, there are two major issues with this approach:

In the above example we converted NULLs to the ` character. If we had a valid ` character in our data, we would get logically incorrect joins.
Our query can no longer perform index seeks.

The first issue isn't a huge deal if you can guarantee the character you are replacing NULLs with will never appear in the column of data.

The second issue is more important since ISNULL prevents your query from being SARGable and will cause poor performance on large tables of data.

2019-03-22-12-51-04

Those Compute Scalar operators are forcing SQL Server to Scan the indexes and compute a value for every row.

A More Efficient Solution

If using a function like ISNULL hurts the performance of our queries, what can we do instead?

SELECT
    a.UserId,
    at.YearOpened,
    at.AccountType,
    at.Description
FROM
    dbo.Account a
    INNER JOIN dbo.AccountType at
        ON a.YearOpened = at.YearOpened
        AND (a.AccountType = at.AccountType OR (a.AccountType IS NULL AND at.AccountType IS NULL))

2019-03-22-12-48-59

This produces the same exact results while allowing SQL Server to Seek when possible and avoid costly row by row computations:

There are no seeks here since I don't have any additional filters, but the lack of Compute Scalar operators should be enough to prove the point.

While there are a few more variations that can achieve the same results using different execution plans (writing a query that joins non-nulls and unioning it with a query that selects only the nulls, using a computed column to convert the NULLs to non-null values, etc...) the key to good performance is to choose a solution that will not force SQL Server to compute values for every single row.

4 Reasons To Avoid VARCHAR(8000)

Published Tue 19 March 2019 in SQL > Quirks

Watch this week's video on YouTube

When first learning SQL Server, I had the brilliant idea of defining every column on my table as VARCHAR(8000). After all, the VARCHAR datatype only consumes as much storage space as the string itself (plus two bytes), so even if I defined the max length of a column as 8000, I wouldn't be wasting storage space if my data was smaller than that.

My development life would be easier too since I wouldn't have to think about the kind of data I was storing in my columns; I could define everything as VARCHAR(8000) once and never have to go back to make any changes. Brilliant!

While I was correct about not wasting storage space, it turns out the idea of making every column VARCHAR(8000) is a terrible idea.

What follows is a (partial) list of reasons to avoid defining every column in your table as VARCHAR(8000).

1. Documentation

When columns are defined accurately with specific data types and lengths, they help provide understanding about the data they contain.

For example, imagine you see following column in a table:

ZipCode VARCHAR(10)

If your database is full of U.S. data, this column definition provides enough information to make some pretty good assumptions:

With a length of 10, we can expect some of the data to be in the 5 digit zip code + 4 digit extended zip code format (eg. 12345-6789)
Choosing VARCHAR instead of CHAR means there are probably some entries that contain 5 digit zip codes and some with 5+4 digit zip codes (and a dash)

We can't be 100% confident about our assumptions until we verify them by profiling the data itself, but a lot of times having decent guesses is good enough.

2. Validation

Your application (hopefully) has some good client and server side validation built in. Those validations don't always work so it never hurts to have some extra validation on the database side of things as well.

Imagine the zip code field is directly next to the "Delivery Instructions" box on our app's "Shipping Details" page. It's not a stretch to think that a user will someday accidentally type in the special delivery instructions ("Please leave the package behind the house next to the giant pineapple bush") into the zip code field instead of the instructions field.

If your ZipCode field is defined as VARCHAR(8000), that set of delivery instructions will make its way into the database in the completely wrong column.

Have fun cleaning up that dirty data. If instead the field was correctly defined as ZipCode VARCHAR(10), the insert would fail and you would prevent that erroneous data from entering your database.

3. Indexing

Ever try to index a column (or set of columns) that total more than 1700 bytes (or 900 bytes in older versions)?

2019-03-13-19-42-15

SQL Server will let you create indexes that contain keys that may potentially be greater than 1700 bytes, but as soon as you try to insert large data into those indexes you will be greeted with this wall of error:

2019-03-13-19-44-11

Of course there are some tricks you can use to index those wide key combinations, but you don't want to use those unless you absolutely have to.

4. Inaccurate Estimates

Imagine we have 1000 rows of data in our table with a VARCHAR(8000) column:

CREATE TABLE #Varchar8000Test
(
    Id int identity PRIMARY KEY,
    BigColumn varchar(8000)
);

INSERT INTO #Varchar8000Test VALUES ('a');
GO 1000

You'll notice that each of those rows' BigColumn fields only contain 3 bytes of data (1 for "a" + 2 for varchar overhead). That's about 3 KB total for the whole table of data (plus whatever the int column takes up)

You would therefore think that SQL Server would know and use this information when executing a query :

SELECT * FROM #Varchar8000Test ORDER BY BigColumn
OPTION(MAXDOP 1)

However it doesn't:

2019-03-13-20-00-07

SQL Server estimates that each row is ~4 KB! This is because SQL Server doesn't sample the data to estimate the average row size for VARCHAR columns, it uses half of the defined length.

This carries over to other parts of the query plan too, like memory grants:

2019-03-13-20-04-39

Fitting this whole table into memory should only take about ~7 KB (3 KB for our BigColumn data), but SQL Server reserves significantly more than that. In this example an excess of 6 MB isn't a huge deal, but if you are working with a table with many more rows, those excessive memory grants can use up hundreds of megabytes or even gigabytes of memory unnecessarily.

...and more!

This list is only a start: there are plenty of additional reasons why defining every column on your table as VARCHAR(8000). My goal with this post wasn't to create a comprehensive list, but rather to provide enough compelling reasons to have dissuaded my novice self a few years ago from even considering this as an option.

Gaps and Islands Across Date Ranges

Published Tue 12 March 2019 in SQL > Development > Window Functions

Watch this week's video on YouTube

In a traditional gaps and islands problem, the goal is to identify groups of continuous data sequences (islands) and groups of data where the sequence is missing (gaps).

While many people encounter gaps and islands problems when dealing with ranges of dates, and recently I did too but with an interesting twist:

How do you determine gaps and islands of data that has overlapping date ranges?

Overlapping Date Ranges

First let's start with some sample data to help follow along. The peculiarity to pay attention to is that the date ranges for each row might be completely discrete, completely enclosed, or they may overlap each other on either end:

DROP TABLE IF EXISTS #OverlappingDateRanges;
CREATE TABLE #OverlappingDateRanges (StartDate date, EndDate date);

INSERT INTO #OverlappingDateRanges
SELECT '8/24/2017', '9/23/2017'  UNION ALL
SELECT '8/24/2017', '9/20/2017'  UNION ALL 
SELECT '9/23/2017', '9/27/2017'  UNION ALL 
SELECT '9/25/2017', '10/10/2017' UNION ALL
SELECT '10/17/2017','10/18/2017' UNION ALL 
SELECT '10/25/2017','11/3/2017'  UNION ALL 
SELECT '11/3/2017', '11/15/2017'

SELECT * FROM #OverlappingDateRanges;

2019-03-13_20-13-13

What's unusual about this data is that while the end date of some rows matches the start date of other rows (eg. row 6 and 7), the date ranges of some rows other rows are either fully contained within other rows (eg. row 2 is contained in row 1) while other rows only overlap one boundary (eg. row 4's EndDate doesn't overlap with any other rows, but its StartDate is before row 3's EndDate).

Solution

While there are several ways gaps and islands problems can be solved, here is the solution using window functions that made the most sense to me.

First, we need to create a row number column based on the sequence of start and end dates, as well as bring the previous row's EndDate to the current row:

SELECT
    ROW_NUMBER() OVER(ORDER BY StartDate,EndDate) AS RN,
    StartDate,
    EndDate,
    LAG(EndDate,1) OVER (ORDER BY StartDate, EndDate) AS PreviousEndDate
FROM
    #OverlappingDateRanges

2019-03-06-20-52-58

Next we add two more fields:

IslandStartInd: indicates when a new island begins by looking if the current row's StartDate occurs after the previous row's EndDate. We don't really need this column for the example, but I find it helpful to see what's going on in the next column.
IslandId: indicates which island number the current row belongs to.

SELECT
    *,
    CASE WHEN Groups.PreviousEndDate >= StartDate THEN 0 ELSE 1 END AS IslandStartInd,
    SUM(CASE WHEN Groups.PreviousEndDate >= StartDate THEN 0 ELSE 1 END) OVER (ORDER BY Groups.RN) AS IslandId
FROM
(
    SELECT
        ROW_NUMBER() OVER(ORDER BY StartDate,EndDate) AS RN,
        StartDate,
        EndDate,
        LAG(EndDate,1) OVER (ORDER BY StartDate, EndDate) AS PreviousEndDate
    FROM
        #OverlappingDateRanges
) Groups

2019-03-06-21-01-51

The IslandId field is just a SUM() of the IslandStartInd, similar to a window function running total.

Our final step is to aggregate our rows using an old fashioned GROUP BY to return the minimum and maximum start and end dates respectively from each of our islands:

SELECT
    MIN(StartDate) AS IslandStartDate,
    MAX(EndDate) AS IslandEndDate
FROM
    (
    SELECT
        *,
        CASE WHEN Groups.PreviousEndDate >= StartDate THEN 0 ELSE 1 END AS IslandStartInd,
        SUM(CASE WHEN Groups.PreviousEndDate >= StartDate THEN 0 ELSE 1 END) OVER (ORDER BY Groups.RN) AS IslandId
    FROM
    (
        SELECT
            ROW_NUMBER() OVER(ORDER BY StartDate,EndDate) AS RN,
            StartDate,
            EndDate,
            LAG(EndDate,1) OVER (ORDER BY StartDate, EndDate) AS PreviousEndDate
        FROM
            #OverlappingDateRanges
    ) Groups
) Islands
GROUP BY
    IslandId
ORDER BY 
    IslandStartDate

2019-03-06-21-03-35

Voilà

Regardless of how messy the date ranges within an island are, this technique neatly identifies gaps in the data and returns the start and end of each island's date range. I often find myself using this technique in scenarios where I need to aggregate some date-based transactional data that otherwise would be too difficult to summarize with aggregate functions alone.