Animating a Magic 8 Ball in SQL Server Management Studio

Published Tue 14 May 2019 in SQL > TSqlTuesday

MJ-t-sql-Tuesday

This post is a response to this month's T-SQL Tuesday #114 prompt by Matthew McGiffen. T-SQL Tuesday is a way for the SQL Server community to share ideas about different database and professional topics every month. This month Matthew asks us to write about puzzles, so I decided to recreate a childhood favorite in SQL Server.

Watch this week's video on YouTube

As a kid, I found Magic 8 Balls alluring. There is something appealing about a who-knows-how-many-sides die emerging from the depths of a mysterious inky blue fluid to help answers life's most difficult questions.

I never ended up buying a magic eight ball of my own though, so today I'm going to build and animate one in SQL Server Management Studio.

Fun and Valuable? Signs point to yes.

While building a magic eight ball in SQL Server is not the most useful project in the world it is:

Fun
A great way to learn lots of cool SSMS and SQL tips and tricks to use in more useful situations.

Here's an example of the finished project followed by all of the components that make this project work. The full code for this solution can be found at the bottom of this post.

SSMSMagic8Ball

VALUES()

I needed a way to store all of the Magic 8 Ball messages. Some days I like UNIONing together a bunch of SELECT statements, but for these "larger" static datasets I like the syntax of VALUES().

SELECT * FROM 
(VALUES  
    ('It is certain.'), 
    ('It is decidedly so.'), 
    ('Without a doubt.'), 
    ('Yes - definitely.'), 
    ('You may rely on it.'), 
    ('As I see it, yes.'), 
    ('Most likely.'), 
    ('Outlook good.'), 
    ('Yes.'), 
    ('Signs point to yes.'), 

    ('Reply hazy, try again.'), 
    ('Ask again later.'), 
    ('Better not tell you now.'), 
    ('Cannot predict now.'), 
    ('Concentrate and ask again.'), 

    ('Don''t count on it.'), 
    ('My reply is no.'), 
    ('My sources say no.'), 
    ('Outlook not so good.'), 
    ('Very doubtful.') 
) T(Response)

ORDER BY NEWID()

After we create our data set of static messages, we need to randomly return 1 message for every shake of the eight ball. My favorite way to return one random record is to order the data by NEWID() (creating a random order for values) and then using TOP 1 to return only the first random record:

DECLARE @Message varchar(100) = '';

WITH MagicResponses AS ( 
    ...<VALUES() query from above>...
)

SELECT TOP 1 @Message = Response FROM MagicResponses ORDER BY NEWID();

Table Driven Animation

While I never have used this technique for animating an image before, I have used a control table to drive what data should get processed in an ETL.

In today's case, instead of saving the values of what data was last manipulated in an SSIS package, I'll be storing what each action each frame of animation should display, as well as how much delay to put in between each frame:

CREATE TABLE dbo.AnimationControl
(
    Id int IDENTITY PRIMARY KEY,
    ActionToTake varchar(20),
    DelayToTake varchar(20),
    ActionTakenDate datetime2

);

INSERT INTO dbo.AnimationControl (ActionToTake,DelayToTake) VALUES ('ShakeLeft','00:00:00.100');
INSERT INTO dbo.AnimationControl (ActionToTake,DelayToTake) VALUES ('ShakeRight','00:00:00.100');
INSERT INTO dbo.AnimationControl (ActionToTake,DelayToTake) VALUES ('ShakeLeft','00:00:00.100');
INSERT INTO dbo.AnimationControl (ActionToTake,DelayToTake) VALUES ('ShakeRight','00:00:00.100');
INSERT INTO dbo.AnimationControl (ActionToTake,DelayToTake) VALUES ('ShakeLeft','00:00:00.100');
INSERT INTO dbo.AnimationControl (ActionToTake,DelayToTake) VALUES ('ShakeRight','00:00:00.100');
INSERT INTO dbo.AnimationControl (ActionToTake,DelayToTake) VALUES ('Reveal','00:00:00.500');

WAITFOR DELAY

I wanted there to be a different delay between certain animation frames (I believe the final message reveal deserves a slightly more dramatic pause), so I'm using WAITFOR DELAY to achieve that.

WAITFOR DELAY @DelayToTake;

PRINT

The goal here is to print this ascii 8 ball shaking left and right before displaying the message. We do this using good old fashioned PRINT(). After printing a particular frame we update our control table to indicate that particular frame has been drawn.

IF @CurrentActionType = 'ShakeLeft'
BEGIN 
    PRINT(' 
                _.a$$$$$a._ 
              ,$$$$$$$$$$$$$. 
            ,$$$$$$$$$$$$$$$$$. 
           d$$$$$$$$$$$$$$$$$$$b 
          d$$$$$$$$~`"`~$$$$$$$$b 
         ($$$$$$$p   _   q$$$$$$$) 
         $$$$$$$$   (_)   $$$$$$$$ 
         $$$$$$$$   (_)   $$$$$$$$ 
         ($$$$$$$b       d$$$$$$$) 
          q$$$$$$$$a._.a$$$$$$$$p 
           q$$$$$$$$$$$$$$$$$$$p 
            `$$$$$$$$$$$$$$$$$` 
              `$$$$$$$$$$$$$` 
                `~$$$$$$$~` 
    ') 
END

GO

You might be wondering why I decided to use a control table to dictate what images to animate. The trouble was that in order to get the PRINT to actually display our ascii images on screen in SSMS, the batch needed to finish submitting. So each frame we print needs to be part of its own batch.

Since we have 7 frames in our animation, we need to execute our procedure 7 times.

Alternatively we can use GO 7, but then we get that ugly batch execution completed message which I don't think there is anyway to hide:

EXEC dbo.USP_ShakeThe8Ball;
GO 7

-- OR
EXEC dbo.USP_ShakeThe8Ball;
GO
EXEC dbo.USP_ShakeThe8Ball;
GO
EXEC dbo.USP_ShakeThe8Ball;
GO
EXEC dbo.USP_ShakeThe8Ball;
GO
EXEC dbo.USP_ShakeThe8Ball;
GO
EXEC dbo.USP_ShakeThe8Ball;
GO
EXEC dbo.USP_ShakeThe8Ball;
GO

Completely Useless? I guess not

I've always been a fan of occasionally taking a break to build things for pure fun. It's a good way to apply lesser known features to your code, stretch your creativity for solving problems, and of course push software functionality to their limits through feature abuse.

Here is the full set of code if you want to run it for yourself (note, this works on a 1920x1080 resolution monitor with SSMS at full screen...your results may vary):

CREATE OR ALTER PROCEDURE dbo.USP_ShakeThe8Ball
AS
BEGIN
    /* Hide extra output to the messages window that will ruin our animation */
    SET NOCOUNT ON;
    SET ANSI_WARNINGS OFF;

    /* Set up a table to keep track of our animation frames and insert into it */
    IF OBJECT_ID('dbo.AnimationControl') IS NULL
    BEGIN
        CREATE TABLE dbo.AnimationControl
        (
            Id int IDENTITY PRIMARY KEY,
            ActionToTake varchar(20),
            DelayToTake varchar(20),
            ActionTakenDate datetime2

        );
    END;

    INSERT INTO dbo.AnimationControl (ActionToTake,DelayToTake) VALUES ('ShakeLeft','00:00:00.100');
    INSERT INTO dbo.AnimationControl (ActionToTake,DelayToTake) VALUES ('ShakeRight','00:00:00.100');
    INSERT INTO dbo.AnimationControl (ActionToTake,DelayToTake) VALUES ('ShakeLeft','00:00:00.100');
    INSERT INTO dbo.AnimationControl (ActionToTake,DelayToTake) VALUES ('ShakeRight','00:00:00.100');
    INSERT INTO dbo.AnimationControl (ActionToTake,DelayToTake) VALUES ('ShakeLeft','00:00:00.100');
    INSERT INTO dbo.AnimationControl (ActionToTake,DelayToTake) VALUES ('ShakeRight','00:00:00.100');
    INSERT INTO dbo.AnimationControl (ActionToTake,DelayToTake) VALUES ('Reveal','00:00:00.500');


    DECLARE @CurrentActionId int = 1;
    DECLARE @CurrentActionType varchar(20) = 'ShakeLeft';
    DECLARE @DelayToTake varchar(20) = '00:00:00.100';

    /* If more than 1 second elapsed, clear the control table */
    DECLARE @LastRunDate datetime2;
    SELECT @LastRunDate = MAX(ActionTakenDate) FROM dbo.AnimationControl;

    IF DATEDIFF(millisecond,@LastRunDate,GETDATE()) > 1000
    BEGIN
        UPDATE dbo.AnimationControl SET ActionTakenDate = NULL;
    END

    /* Which action/frame are we currently on? */

    SELECT @CurrentActionId = MIN(Id) FROM dbo.AnimationControl WHERE ActionTakenDate IS NULL;
    SELECT @CurrentActionType = ActionToTake,
            @DelayToTake = DelayToTake
        FROM dbo.AnimationControl WHERE Id = @CurrentActionId


    WAITFOR DELAY @DelayToTake;
    /* Since we can't clear the Messages window, we need to fill it with
    blank space between animation frames to achieve the desired effect */


    PRINT(' 




        '); 


    IF @CurrentActionType = 'ShakeLeft'
    BEGIN 

        PRINT(' 
                  _.a$$$$$a._ 
                 ,$$$$$$$$$$$$$. 
               ,$$$$$$$$$$$$$$$$$. 
              d$$$$$$$$$$$$$$$$$$$b 
             d$$$$$$$$~`"`~$$$$$$$$b 
            ($$$$$$$p   _   q$$$$$$$) 
            $$$$$$$$   (_)   $$$$$$$$ 
            $$$$$$$$   (_)   $$$$$$$$ 
            ($$$$$$$b       d$$$$$$$) 
             q$$$$$$$$a._.a$$$$$$$$p 
              q$$$$$$$$$$$$$$$$$$$p 
               `$$$$$$$$$$$$$$$$$` 
                 `$$$$$$$$$$$$$` 
                   `~$$$$$$$~` 
        ') 

    END 

    If @CurrentActionType = 'ShakeRight' 

    BEGIN 

        PRINT(' 
                      _.a$$$$$a._ 
                     ,$$$$$$$$$$$$$. 
                   ,$$$$$$$$$$$$$$$$$. 
                  d$$$$$$$$$$$$$$$$$$$b 
                 d$$$$$$$$~`"`~$$$$$$$$b 
                ($$$$$$$p   _   q$$$$$$$) 
                $$$$$$$$   (_)   $$$$$$$$ 
                $$$$$$$$   (_)   $$$$$$$$ 
                ($$$$$$$b       d$$$$$$$) 
                 q$$$$$$$$a._.a$$$$$$$$p 
                  q$$$$$$$$$$$$$$$$$$$p 
                   `$$$$$$$$$$$$$$$$$` 
                     `$$$$$$$$$$$$$` 
                       `~$$$$$$$~` ') 

    END 

    IF @CurrentActionType = 'Reveal'
    BEGIN

        DECLARE @Message varchar(100) = '';

        WITH MagicResponses AS ( 
        SELECT * FROM 
        (VALUES  
            ('It is certain.'), 
            ('It is decidedly so.'), 
            ('Without a doubt.'), 
            ('Yes - definitely.'), 
            ('You may rely on it.'), 
            ('As I see it, yes.'), 
            ('Most likely.'), 
            ('Outlook good.'), 
            ('Yes.'), 
            ('Signs point to yes.'), 

            ('Reply hazy, try again.'), 
            ('Ask again later.'), 
            ('Better not tell you now.'), 
            ('Cannot predict now.'), 
            ('Concentrate and ask again.'), 

            ('Don''t count on it.'), 
            ('My reply is no.'), 
            ('My sources say no.'), 
            ('Outlook not so good.'), 
            ('Very doubtful.') 
        ) T(Response) 
        )

        SELECT TOP 1 @Message = Response FROM MagicResponses ORDER BY NEWID();

        BEGIN 

        PRINT(' 
                      _.a$$$$$a._ 
                     ,$$$$$$$$$$$$$. 
                   ,$$$$$$$$$$$$$$$$$. 
                  d$$$$$$$$$$$$$$$$$$$b 
                 d$$$$$$$$~`"`~$$$$$$$$b 
                ($$$$$$$p   _   q$$$$$$$) 
                $$$$$$$$   (_)   $$$$$$$$           ' + @Message + '
                $$$$$$$$   (_)   $$$$$$$$ 
                ($$$$$$$b       d$$$$$$$) 
                 q$$$$$$$$a._.a$$$$$$$$p 
                  q$$$$$$$$$$$$$$$$$$$p 
                   `$$$$$$$$$$$$$$$$$` 
                     `$$$$$$$$$$$$$` 
                       `~$$$$$$$~` ') 

        END 
    END

        PRINT(' 




        '); 

    UPDATE dbo.AnimationControl SET ActionTakenDate = GETDATE() WHERE Id = @CurrentActionId;

END;
GO



/*
CTRL+T first to show Results as Text

Then highlight and execute the following:

EXEC dbo.USP_ShakeThe8Ball;
GO 7

-- OR
EXEC dbo.USP_ShakeThe8Ball;
GO
EXEC dbo.USP_ShakeThe8Ball;
GO
EXEC dbo.USP_ShakeThe8Ball;
GO
EXEC dbo.USP_ShakeThe8Ball;
GO
EXEC dbo.USP_ShakeThe8Ball;
GO
EXEC dbo.USP_ShakeThe8Ball;
GO
EXEC dbo.USP_ShakeThe8Ball;
GO

*/

Temporary Staging Tables

Published Tue 07 May 2019 in SQL > Performance Tuning > Query Rewrites

Watch this week's video on YouTube

SQL Server Spool operators are a mixed bag. On one hand, they can negatively impact performance when writing data to disk in tempdb. On the other hand, they allow filtered and transformed result sets to be temporarily staged, making it easier for that data to be reused again during that query execution.

The problem with the latter scenario is that SQL Server doesn't always decide to use a spool; often it's happy to re-read (and re-process) the same data repeatedly. When this happens, one option you have is to explicitly create your own temporary staging table that will help SQL Server cache data it needs to reuse.

This post is a continuation of my series to document ways of refactoring queries for improved performance. I'll be using the StackOverflow 2014 data dump for these examples if you want to play along at home.

No spools

Let's start by looking at the following query:

WITH January2010Badges AS ( 
    SELECT 
        UserId,
        Name,
        Date
    FROM 
        dbo.Badges 
    WHERE 
        Date >= '2010-01-01' 
        AND Date <= '2010-02-01' 
), Next10PopularQuestions AS ( 
    SELECT TOP 10 * FROM (SELECT UserId, Name, Date FROM January2010Badges WHERE Name = 'Popular Question' ORDER BY Date OFFSET 10 ROWS) t
), Next10NotableQuestions AS ( 
    SELECT TOP 10 * FROM (SELECT UserId, Name, Date FROM January2010Badges WHERE Name = 'Notable Question' ORDER BY Date OFFSET 10 ROWS) t
), Next10StellarQuestions AS ( 
    SELECT TOP 10 * FROM (SELECT UserId, Name, Date FROM January2010Badges WHERE Name = 'Stellar Question' ORDER BY Date OFFSET 10 ROWS) t
)
SELECT UserId, Name FROM Next10PopularQuestions 
UNION ALL 
SELECT UserId, Name FROM Next10NotableQuestions
UNION ALL 
SELECT UserId, Name FROM Next10StellarQuestions

Note: This is not necessarily the most efficient way to write this query, but it makes for a good demo.

This query is returning offset results for different badges from one month of data in the dbo.Badges table. While the query is using a CTE to make the logic easy to understand (i.e. filter the data to just January 2010 results and then calculate our offsets based on those results), SQL Server isn't actually saving the results of our January2010Badges expression in tempdb to get reused. If we view the execution plan, we'll see it reading from our dbo.Badges clustered index three times:

2019-05-06-18-30-12

Table 'Badges'. Scan count 27, logical reads 151137, ...

That means every time SQL Server needs to run our offset logic in each "Next10..." expression, it needs to rescan the entire clustered index to first filter on the Date column and then the Name column. This results in about 150,000 logical reads.

Divide and Conquer

One potential solution would be to add a nonclustered index that would allow SQL Server to avoid scanning the entire clustered index three times. But since this series is about improving performance without adding permanent indexes (since sometimes you are stuck in scenarios where you can't easily add or modify an index), we'll look at mimicking a spool operation ourselves.

We'll use a temporary table to stage our filtered January 2010 results so SQL Server doesn't have to scan the clustered index each time it needs to perform logic on that subset of data. For years I've referred to this technique as "temporary staging tables" or "faking spools", but at a recent SQL Saturday Jeff Moden told me he refers to it as "Divide and Conquer". I think that's a great name, so I'll use it going forward. Thanks Jeff!

First let's divide our query so that we insert our January 2010 data into its own temporary table:

DROP TABLE IF EXISTS #January2010Badges;
CREATE TABLE #January2010Badges
(
    UserId int,
    Name nvarchar(40),
    Date datetime
    CONSTRAINT PK_NameDateUserId PRIMARY KEY CLUSTERED (Name,Date,UserId)
);

INSERT INTO #January2010Badges
SELECT
    UserId,
    Name,
    Date
FROM 
    dbo.Badges
WHERE 
    Date >= '2010-01-01' 
    AND Date <= '2010-02-01';

You'll notice I added a clustered primary key which will index the data in an order that will make filtering easier.

Next, we conquer by changing the rest of our query to read from our newly created temp table:

WITH Next10PopularQuestions AS ( 
    SELECT TOP 10 * FROM (SELECT UserId, Name, Date FROM #January2010Badges WHERE Name = 'Popular Question' ORDER BY Date OFFSET 10 ROWS) t
), Next10NotableQuestions AS ( 
    SELECT TOP 10 * FROM (SELECT UserId, Name, Date FROM #January2010Badges WHERE Name = 'Notable Question' ORDER BY Date OFFSET 10 ROWS) t
), Next10StellarQuestions AS ( 
    SELECT TOP 10 * FROM (SELECT UserId, Name, Date FROM #January2010Badges WHERE Name = 'Stellar Question' ORDER BY Date OFFSET 10 ROWS) t
)
SELECT UserId, Name FROM Next10PopularQuestions 
UNION ALL 
SELECT UserId, Name FROM Next10NotableQuestions 
UNION ALL 
SELECT UserId, Name FROM Next10StellarQuestions

Running this all together, we get the following plans and logical read counts:

2019-05-06-18-35-56

Table 'Badges'. Scan count 9, logical reads 50379, ...

(42317 rows affected)

(20 rows affected)
Table '#January2010Badges______________________________00000000003B'. Scan count 3, logical reads 12, ...

In this version of the query, SQL Server scans the clustered index a single time and saves that data to a temporary table. In the subsequent SELECTs, it seeks from this much smaller temporary table instead of going back to the clustered index, reducing the total amount of reads to 50379 + 12 = 50392: about a third of what the original query was doing.

Temporary Staged Data

At the end of day, you can hope that SQL Server creates a spool to temporarily stage or data, or you can be explicit about it and do it yourself. Either option is going to increase usage on your tempdb database, but at least by defining the temporary table yourself you can customize and index it to achieve maximum reuse and performance for your queries.

It's important to note that this is not a technique you want to abuse: writing and reading too much data from tempdb can cause contention problems that can make you worse off than having allowed SQL Server to scan your clustered index three times. However, when implemented sparingly and for good reasons, this technique can greatly improve the performance of certain queries.

IN vs UNION ALL

Published Tue 30 April 2019 in SQL > Performance Tuning > Query Rewrites

Watch this week's video on YouTube

When you need to filter query results on multiple values, you probably use an IN() statement or multiple predicates separated by ORs:

WHERE Col1 IN ('A','B','C')

WHERE Col1 = 'A' OR Col1 = 'B' OR Col1 = 'C'

While SQL Server will generate the same query plan for either syntax, there is another technique you can try that can sometimes can improve performance under certain conditions: UNION ALL.

Lookups and Scans

Let's say we have the following index on our dbo.Badges table:

CREATE NONCLUSTERED INDEX [IX_Badges] ON [dbo].[Badges] ([Name]) INCLUDE ([UserId]);

Next let's run these two separate queries:

/* Query 1 */
SELECT 
    Name, UserId, Date 
FROM 
    dbo.Badges 
WHERE 
    Name = 'Benefactor'
OPTION(MAXDOP 1)

/* Query 2 */
SELECT 
    Name, UserId, Date 
FROM 
    dbo.Badges 
WHERE 
    Name = 'Research Assistant'
OPTION(MAXDOP 1)

Note I'm enforcing MAXDOP 1 here to remove any performance differences due to parallelism in these demos.

The nonclustered index doesn't cover these queries - while SQL Server can seek the index for the Name predicate in the WHERE clause, it can't retrieve all the columns in the SELECT from the index alone. This leaves SQL Server with a tough choice to make:

Does it scan the whole clustered index to return all the required columns for the rows requested?
Does it seek to the matching records in the nonclustered index and then perform a key lookup to retrieve the remaining data?

So, what does SQL Server decide to do?

2019-04-26-15-46-29

For Query 1, SQL Server thinks that reading the entire clustered index and returning only the rows where Name = 'Benefactor' is the best option.

SQL Server takes a different approach for Query 2 however, using the non-covering nonclustered indexes to find the records with Name = 'Research Assistant' and then going to look up the Date values in the clustered index via a Key Lookup

The reason SQL server chooses these two different plans is because it thinks it will be faster to return smaller number of records with a Seek + Key Lookup approach ("Research Assistant", 127 rows), but faster to return a larger number of records with a Scan ("Benefactor", 17935 rows).

Kimberly Tripp has an excellent post that defines where this "tipping point" from a key lookup to a clustered index scan typically occurs, but the important thing to keep in mind for this post is that we can sometimes use SQL Server's ability to switch between these two approaches to our advantage.

Combining Queries with IN

So, what plan does SQL Server generate when we combine our two queries into one?

SELECT 
    Name, UserId, Date 
FROM 
    dbo.Badges 
WHERE 
    Name IN ('Benefactor','Research Assistant')
OPTION(MAXDOP 1)

2019-04-25-21-39-29

Interestingly enough SQL Server decides to retrieve the requested rows from the nonclustered index and then go lookup the remaining Date column in the clustered index.

If we look at the page reads (SET STATISTICS IO ON;) we'll see SQL Server had to read 85500 pages to return the data requested:

(18062 rows affected)
Table 'Badges'. Scan count 2, logical reads 85500, physical reads 20, read-ahead reads 33103, ...

Without correcting our index to include the Date column, is there some way we can achieve the same results with better performance?

UNION ALL

In this case it's possible to rewrite our query logic to use UNION ALL instead of IN/ORs:

SELECT 
    Name,UserId,Date 
FROM 
    dbo.Badges 
WHERE 
    Name = 'Benefactor' 
UNION ALL
SELECT 
    Name,UserId,Date 
FROM 
    dbo.Badges 
WHERE 
    Name = 'Research Assistant'
OPTION(MAXDOP 1)

2019-04-25-21-40-09

We get the same exact results through a hybrid execution plan.

In this case, our plan mirrors what SQL Server did when running our original two queries separately:

The rows where Name = 'Benefactor' are returned by scanning the clustered index.
The nonclustered index is seeked with clustered index lookups for the Name = 'Research Assistant' records.

Looking at the IO statistics for this UNION ALL query:

(18062 rows affected)
Table 'Badges'. Scan count 2, logical reads 50120, physical reads 6, read-ahead reads 49649, ...

Even though this query reads the whole clustered index to get the Benefactor rows, the total number of logical reads is still smaller than the seek/key lookup pattern seen in the combined query with IN(). This UNION ALL version gives SQL Server the ability to build a hybrid execution plan, combining two different techniques to generate a plan with fewer overall reads.

IN or UNION ALL?

There's no way to know for sure without trying each variation.

But if you have a slow performing query that is filtering on multiple values within a column, it might be worth trying to get SQL Server to use a different plan by rewriting the query.

Correlated Subqueries vs Derived Tables

Published Tue 23 April 2019 in SQL > Performance Tuning > Query Rewrites

Watch this week's video on YouTube

Correlated subqueries provide an intuitive syntax for writing queries that return related data. However, they often perform poorly due to needing to execute once for every value they join on.

The good news is that many correlated subqueries can be rewritten to use a derived table for improved performance.

When was each user's first badge awarded?

StackOverflow awards users badges for things like asking good questions, hitting certain vote thresholds, and more.

I want to write a query that figures out on what date did each user receive their first badge.

Using a correlated subquery, I might write my query as follows:

SET STATISTICS IO, TIME ON;

SELECT DISTINCT
    UserId,
    FirstBadgeDate = (SELECT MIN(Date) FROM dbo.Badges i WHERE o.UserId = i.UserId)
FROM
    dbo.Badges o

The syntax of the correlated subquery here makes it clear that for each UserId we want to return the MIN(Date) associated with that UserId from the badges table.

Looking at the execution plan and time and IO statistics (abbreviated for clarity) we see:

2019-04-18-07-11-57

(1318413 rows affected)
Table 'Worktable'. Scan count 0, logical reads 0, ...
Table 'Workfile'. Scan count 0, logical reads 0, ...
Table 'Badges'. Scan count 2, logical reads 43862, ...

(1 row affected)

 SQL Server Execution Times:
   CPU time = 3625 ms,  elapsed time = 8347 ms.

So, what's going on here? We read ~8 million rows of data from our index on the dbo.Badges table and then calculate the MIN(Date) for each UserId. This is the "correlated" part of our query, which then gets joined back to the dbo.Badges table using a Hash Match join operator.

Our join doesn't eliminate any rows so the ~8 million rows continue flowing through until near the very end where we have another Hash Match operator, this time being used to dedupe the rows for the DISTINCT part of query, reducing the final result to ~1 million rows.

Eliminating the Correlated Subquery

What would things look like if we rewrote this correlated subquery as a derived table in the FROM clause?

SELECT DISTINCT
    o.UserId,
    FirstBadgeDate
FROM
    dbo.Badges o
    INNER JOIN 
        (SELECT 
            UserId, 
            MIN(Date) as FirstBadgeDate 
        FROM 
            dbo.Badges GROUP BY UserId
        ) i
    ON o.UserId = i.UserId

2019-04-18-07-26-36

(1318413 rows affected)
Table 'Workfile'. Scan count 0, logical reads 0, ...
Table 'Worktable'. Scan count 0, logical reads 0, ...
Table 'Badges'. Scan count 2, logical reads 43862, ...

(1 row affected)

 SQL Server Execution Times:
   CPU time = 2516 ms,  elapsed time = 5350 ms.

If we look at the IO statistics, it's interesting to note that there is no difference in reads between these two queries.

Looking at the CPU time statistics however, this derived table query consistently comes in about 33% faster than the correlated subquery example. Why is that?

Looking at the execution plan reveals some details: in this plan, you can see we read in from the dbo.Badges index and go straight into a Hash Match operator. The top stream is deduping our data on UserId, taking it from ~8 million rows to ~1 million rows. The bottom stream does the same deduping while also calculating the MIN(DATE) for each UserId grouping.

When both of those streams join together, the final hash match operator is only joining ~1 million rows with ~1 million rows (as opposed to the first query that was joining ~8 million rows with ~1 million rows).

This last join is the reason for the performance improvement: because this execution plan can reduce the number of rows sooner the final join ends up having to do less work. Additionally, the records were already distinct going into the join, saving us from an extra deduping step.

Further Reducing Redundancy

You may have noticed that both of these queries are a little redundant: they both call on the dbo.Badges table unnecessarily. The best option to improve query performance would be to rewrite it as:

SELECT 
    UserId, 
    MIN(Date) as FirstBadgeDate 
FROM 
    dbo.Badges 
GROUP BY 
    UserId

2019-04-18-07-48-58-1

While this is the most efficient query of the three, most real-world queries and scenarios aren't this easy to simplify.

When your queries have more joins, WHERE clauses, and more, knowing how to refactor from a correlated subquery to a derived table query is critical to potentially improving performance.

Window Functions vs GROUP BYs

Published Tue 16 April 2019 in SQL > Performance Tuning > Query Rewrites

Watch this week's video on YouTube

There are many options available for improving the performance of a query: indexes, statistics, configuration settings, etc...

However, not all environments allow you to use those features (eg. vendor databases), leaving query rewriting as the only option.

This is the first post in a series to document common ways to refactor queries without otherwise altering the database. The goal of these posts will be to provide examples of performance pitfalls in queries and how to rewrite those queries to generate different query plans that (hopefully) improve performance.

I'll be using the StackOverflow 2014 data dump for these examples if you want to play along at home.

Who was first to earn each badge?

StackOverflow awards users badges for things like asking good questions, hitting certain vote thresholds, and more.

I want to write a query that figures out who is the first person awarded each badge. In cases where there is a tie for the first person to receive that badge, I want to return the user with the lowest UserId.

Window functions make this type of question easy to write a query for:

SELECT DISTINCT
    Name,
    FIRST_VALUE(UserId) OVER (PARTITION BY Name ORDER BY Date,UserId) AS UserId
FROM
    dbo.Badges b
ORDER BY
    Name,UserId

If you've used FIRST_VALUE before, this query should be easy to interpret: for each badge Name, return the first UserId sorted by Date (earliest date to receive the badge) and UserId (pick the lowest UserId when there are ties on Date).

This query was easy to write and is simple to understand. However, the performance is not great: it takes 46 seconds to finish returning results on my machine.

2019-04-11-20-45-45

Note: I assumed this table started off with the following index:

CREATE NONCLUSTERED INDEX IX_Badges__Name_Date_UserId ON [dbo].[Badges] (Name,Date,UserId);

Why so slow?

If we SET STATISTICS IO ON we'll notice that SQL Server reads 46767 pages from a nonclustered index. Since we aren't filtering our data, there's not much we can do to make that faster.

Reading right to left, next up we see two Segment operators. These don't add much overhead since our data is sorted on our segments/groups, so making SQL Server identify when our sorted rows change values is trivial.

Next up is the Window Spool operator which "Expands each row into the set of rows that represent the window associated with it." While it looks innocent by having a low relative cost, this operator is writing 8 million rows/reading 16 million rows (because of how Window Spool works) from tempdb. Ouch.

After that the Stream Aggregate operator and Compute Scalar operators check to see if the first value in each window being returned from the Window Spool is null and then return the first non-null value. These operations are also relatively painless since the data flowing through is already sorted.

The Hash Match operator then dedupes the data for our DISTINCT and then we sort the remaining ~2k rows for our output.

So while our query looks simple, the fact that our whole table of data is getting written to and read from tempdb before being deduped and sorted is a real performance killer.

Removing tempdb usage the old-fashioned way

When I say "the old fashioned way", I mean rewriting our window function to use more traditional aggregate functions and a GROUP BY:

SELECT
    b.Name,
    MIN(b.UserId) AS UserId
FROM
    dbo.Badges b
    INNER JOIN
    (
    SELECT
        Name,
        MIN(Date) AS Date
    FROM
        dbo.Badges
    GROUP BY
        Name
    ) m
        ON b.Name = m.Name
        AND b.Date = m.Date
GROUP BY
    b.Name
ORDER BY
    Name,UserId

I think by most people's standards, this query is not as easy to read. While not overly complex, it does take up a lot more screen space and is complicated by multiple GROUP BYs and a derived table.

And while the query may look ugly on the outside, it's what lies below the surface that really matters:

2019-04-11-20-49-58

What a beautifully simple execution plan. And it finishes executing almost instantly.

Let's break down what's going on. First, we start with similar Index Scan and Segment operators as the previous query so no real difference there.

At this point you may have noticed that while the written query uses two GROUP BYs and two MIN functions that are then joined together, there are not two Index Scans, two sets of aggregations, and no join happening in the execution plan.

SQL Server can use an optimization with the Top operator that allows it to take the sorted data and return only the Name and UserId rows for the top Name and Date values within a group (essentially matching the MIN logic). This is a great example of how the optimizer can take a declarative SQL query and decide how to efficiently return the data needs.

At this point, the Top operator filters our 8 million rows down to around 30k rows. 30k rows get deduped a lot faster with our Stream Aggregate operator, and since the data is already sorted we don't need an extra Sort operator.

Overall, this second query runs so much better than the original because SQL Server doesn't have to go to tempdb for any operations - all the data is pre-sorted in the index and can flow through.

So I shouldn't use Window Functions?

Not necessarily - it comes down to a trade offs.

I almost always start with a window function because of how easy they are to write and read. Plus I think they are fun to write as well.

However, if the window function is having to read/write a lot of data to tempdb and it's affecting the overall performance of your query, a rewrite may be necessary.

In that case, I much rather take more verbose syntax to get a 2000x performance boost.