Temporary Staging Tables

Watch this week's video on YouTube

SQL Server Spool operators are a mixed bag. On one hand, they can negatively impact performance when writing data to disk in tempdb. On the other hand, they allow filtered and transformed result sets to be temporarily staged, making it easier for that data to be reused again during that query execution.

The problem with the latter scenario is that SQL Server doesn't always decide to use a spool; often it's happy to re-read (and re-process) the same data repeatedly. When this happens, one option you have is to explicitly create your own temporary staging table that will help SQL Server cache data it needs to reuse.

This post is a continuation of my series to document ways of refactoring queries for improved performance. I'll be using the StackOverflow 2014 data dump for these examples if you want to play along at home.

No spools

Let's start by looking at the following query:

WITH January2010Badges AS ( 
    SELECT 
        UserId,
        Name,
        Date
    FROM 
        dbo.Badges 
    WHERE 
        Date >= '2010-01-01' 
        AND Date <= '2010-02-01' 
), Next10PopularQuestions AS ( 
    SELECT TOP 10 * FROM (SELECT UserId, Name, Date FROM January2010Badges WHERE Name = 'Popular Question' ORDER BY Date OFFSET 10 ROWS) t
), Next10NotableQuestions AS ( 
    SELECT TOP 10 * FROM (SELECT UserId, Name, Date FROM January2010Badges WHERE Name = 'Notable Question' ORDER BY Date OFFSET 10 ROWS) t
), Next10StellarQuestions AS ( 
    SELECT TOP 10 * FROM (SELECT UserId, Name, Date FROM January2010Badges WHERE Name = 'Stellar Question' ORDER BY Date OFFSET 10 ROWS) t
)
SELECT UserId, Name FROM Next10PopularQuestions 
UNION ALL 
SELECT UserId, Name FROM Next10NotableQuestions
UNION ALL 
SELECT UserId, Name FROM Next10StellarQuestions 

Note: This is not necessarily the most efficient way to write this query, but it makes for a good demo.

This query is returning offset results for different badges from one month of data in the dbo.Badges table. While the query is using a CTE to make the logic easy to understand (i.e. filter the data to just January 2010 results and then calculate our offsets based on those results), SQL Server isn't actually saving the results of our January2010Badges expression in tempdb to get reused. If we view the execution plan, we'll see it reading from our dbo.Badges clustered index three times:

2019-05-06-18-30-12

Table 'Badges'. Scan count 27, logical reads 151137, ...

That means every time SQL Server needs to run our offset logic in each "Next10..." expression, it needs to rescan the entire clustered index to first filter on the Date column and then the Name column. This results in about 150,000 logical reads.

Divide and Conquer

One potential solution would be to add a nonclustered index that would allow SQL Server to avoid scanning the entire clustered index three times. But since this series is about improving performance without adding permanent indexes (since sometimes you are stuck in scenarios where you can't easily add or modify an index), we'll look at mimicking a spool operation ourselves.

We'll use a temporary table to stage our filtered January 2010 results so SQL Server doesn't have to scan the clustered index each time it needs to perform logic on that subset of data. For years I've referred to this technique as "temporary staging tables" or "faking spools", but at a recent SQL Saturday Jeff Moden told me he refers to it as "Divide and Conquer". I think that's a great name, so I'll use it going forward. Thanks Jeff!

First let's divide our query so that we insert our January 2010 data into its own temporary table:

DROP TABLE IF EXISTS #January2010Badges;
CREATE TABLE #January2010Badges
(
    UserId int,
    Name nvarchar(40),
    Date datetime
    CONSTRAINT PK_NameDateUserId PRIMARY KEY CLUSTERED (Name,Date,UserId)
);

INSERT INTO #January2010Badges
SELECT
    UserId,
    Name,
    Date
FROM 
    dbo.Badges
WHERE 
    Date >= '2010-01-01' 
    AND Date <= '2010-02-01'; 

You'll notice I added a clustered primary key which will index the data in an order that will make filtering easier.

Next, we conquer by changing the rest of our query to read from our newly created temp table:

WITH Next10PopularQuestions AS ( 
    SELECT TOP 10 * FROM (SELECT UserId, Name, Date FROM #January2010Badges WHERE Name = 'Popular Question' ORDER BY Date OFFSET 10 ROWS) t
), Next10NotableQuestions AS ( 
    SELECT TOP 10 * FROM (SELECT UserId, Name, Date FROM #January2010Badges WHERE Name = 'Notable Question' ORDER BY Date OFFSET 10 ROWS) t
), Next10StellarQuestions AS ( 
    SELECT TOP 10 * FROM (SELECT UserId, Name, Date FROM #January2010Badges WHERE Name = 'Stellar Question' ORDER BY Date OFFSET 10 ROWS) t
)
SELECT UserId, Name FROM Next10PopularQuestions 
UNION ALL 
SELECT UserId, Name FROM Next10NotableQuestions 
UNION ALL 
SELECT UserId, Name FROM Next10StellarQuestions 

Running this all together, we get the following plans and logical read counts:

2019-05-06-18-35-56

Table 'Badges'. Scan count 9, logical reads 50379, ...

(42317 rows affected)

(20 rows affected)
Table '#January2010Badges______________________________00000000003B'. Scan count 3, logical reads 12, ...

In this version of the query, SQL Server scans the clustered index a single time and saves that data to a temporary table. In the subsequent SELECTs, it seeks from this much smaller temporary table instead of going back to the clustered index, reducing the total amount of reads to 50379 + 12 = 50392: about a third of what the original query was doing.

Temporary Staged Data

At the end of day, you can hope that SQL Server creates a spool to temporarily stage or data, or you can be explicit about it and do it yourself. Either option is going to increase usage on your tempdb database, but at least by defining the temporary table yourself you can customize and index it to achieve maximum reuse and performance for your queries.

It's important to note that this is not a technique you want to abuse: writing and reading too much data from tempdb can cause contention problems that can make you worse off than having allowed SQL Server to scan your clustered index three times. However, when implemented sparingly and for good reasons, this technique can greatly improve the performance of certain queries.

IN vs UNION ALL

Watch this week's video on YouTube

When you need to filter query results on multiple values, you probably use an IN() statement or multiple predicates separated by ORs:

WHERE Col1 IN ('A','B','C')

or

WHERE Col1 = 'A' OR Col1 = 'B' OR Col1 = 'C'

While SQL Server will generate the same query plan for either syntax, there is another technique you can try that can sometimes can improve performance under certain conditions: UNION ALL.

This post is a continuation of my series to document ways of refactoring queries for improved performance. I'll be using the StackOverflow 2014 data dump for these examples if you want to play along at home.

Lookups and Scans

Let's say we have the following index on our dbo.Badges table:

CREATE NONCLUSTERED INDEX [IX_Badges] ON [dbo].[Badges] ([Name]) INCLUDE ([UserId]);

Next let's run these two separate queries:

/* Query 1 */
SELECT 
    Name, UserId, Date 
FROM 
    dbo.Badges 
WHERE 
    Name = 'Benefactor'
OPTION(MAXDOP 1)

/* Query 2 */
SELECT 
    Name, UserId, Date 
FROM 
    dbo.Badges 
WHERE 
    Name = 'Research Assistant'
OPTION(MAXDOP 1)

Note I'm enforcing MAXDOP 1 here to remove any performance differences due to parallelism in these demos.

The nonclustered index doesn't cover these queries - while SQL Server can seek the index for the Name predicate in the WHERE clause, it can't retrieve all the columns in the SELECT from the index alone. This leaves SQL Server with a tough choice to make:

  1. Does it scan the whole clustered index to return all the required columns for the rows requested?
  2. Does it seek to the matching records in the nonclustered index and then perform a key lookup to retrieve the remaining data?

So, what does SQL Server decide to do?

2019-04-26-15-46-29

For Query 1, SQL Server thinks that reading the entire clustered index and returning only the rows where Name = 'Benefactor' is the best option.

SQL Server takes a different approach for Query 2 however, using the non-covering nonclustered indexes to find the records with Name = 'Research Assistant' and then going to look up the Date values in the clustered index via a Key Lookup

The reason SQL server chooses these two different plans is because it thinks it will be faster to return smaller number of records with a Seek + Key Lookup approach ("Research Assistant", 127 rows), but faster to return a larger number of records with a Scan ("Benefactor", 17935 rows).

Kimberly Tripp has an excellent post that defines where this "tipping point" from a key lookup to a clustered index scan typically occurs, but the important thing to keep in mind for this post is that we can sometimes use SQL Server's ability to switch between these two approaches to our advantage.

Combining Queries with IN

So, what plan does SQL Server generate when we combine our two queries into one?

SELECT 
    Name, UserId, Date 
FROM 
    dbo.Badges 
WHERE 
    Name IN ('Benefactor','Research Assistant')
OPTION(MAXDOP 1)

2019-04-25-21-39-29

Interestingly enough SQL Server decides to retrieve the requested rows from the nonclustered index and then go lookup the remaining Date column in the clustered index.

If we look at the page reads (SET STATISTICS IO ON;) we'll see SQL Server had to read 85500 pages to return the data requested:

(18062 rows affected)
Table 'Badges'. Scan count 2, logical reads 85500, physical reads 20, read-ahead reads 33103, ...

Without correcting our index to include the Date column, is there some way we can achieve the same results with better performance?

UNION ALL

In this case it's possible to rewrite our query logic to use UNION ALL instead of IN/ORs:

SELECT 
    Name,UserId,Date 
FROM 
    dbo.Badges 
WHERE 
    Name = 'Benefactor' 
UNION ALL
SELECT 
    Name,UserId,Date 
FROM 
    dbo.Badges 
WHERE 
    Name = 'Research Assistant'
OPTION(MAXDOP 1)

2019-04-25-21-40-09

We get the same exact results through a hybrid execution plan.

In this case, our plan mirrors what SQL Server did when running our original two queries separately:

  • The rows where Name = 'Benefactor' are returned by scanning the clustered index.
  • The nonclustered index is seeked with clustered index lookups for the Name = 'Research Assistant' records.

Looking at the IO statistics for this UNION ALL query:

(18062 rows affected)
Table 'Badges'. Scan count 2, logical reads 50120, physical reads 6, read-ahead reads 49649, ...

Even though this query reads the whole clustered index to get the Benefactor rows, the total number of logical reads is still smaller than the seek/key lookup pattern seen in the combined query with IN(). This UNION ALL version gives SQL Server the ability to build a hybrid execution plan, combining two different techniques to generate a plan with fewer overall reads.

IN or UNION ALL?

There's no way to know for sure without trying each variation.

But if you have a slow performing query that is filtering on multiple values within a column, it might be worth trying to get SQL Server to use a different plan by rewriting the query.

OR vs UNION ALL - Is One Better For Performance?

Today I want to show you a trick that could make your queries run faster.

It won't always work, but when it does everyone will be impressed with your performance tuning prowess.  Let's go!

Watch this week's video on YouTube

Our Skewed Data

Let's create a table and insert some data.

Notice the heavily skewed value distribution.  Also notice how we have a clustered index and a very skimpy nonclustered index:

DROP DATABASE IF EXISTS ORUnionAll
CREATE DATABASE ORUnionAll
GO

CREATE TABLE ORUnionAll.dbo.TestData
(
    Col1 int,
    Col2 char(200),
    Col3 int 
)
GO

INSERT INTO ORUnionAll.dbo.TestData VALUES (1,'',1)
GO 10000
INSERT INTO ORUnionAll.dbo.TestData VALUES (2,'',2)
GO 50
INSERT INTO ORUnionAll.dbo.TestData VALUES (3,'',3)
GO 50

CREATE CLUSTERED INDEX CL_Col1 ON ORUnionAll.dbo.TestData ( Col1 )
GO

CREATE NONCLUSTERED INDEX IX_Col3 ON ORUnionAll.dbo.TestData (Col3)
GO

If we write a query that filters on one of the low-occurrence values in Col3, SQL Server will perform an index seek with a key lookup (since our skimpy nonclustered index doesn't cover all of the columns in our SELECT):

SELECT 
    Col2, Col3
FROM 
    ORUnionAll.dbo.TestData
WHERE
    Col3 = 2

index-seek-with-lookup

If we then add an OR to our WHERE clause and filter on another low-occurrence value in Col3, SQL Server changes how it wants to retrieve results:

SELECT 
    Col2, Col3
FROM 
    ORUnionAll.dbo.TestData
WHERE
    Col3 = 2 OR Col3 = 3

index-scan-with-ors

Suddenly those key-lookups become too expensive for SQL Server and the query optimizer thinks it'll be faster to just scan the entire clustered index.

In general this makes sense; SQL Server tries to pick plans that are good enough in most scenarios, and in general I think it chooses wisely.

However, sometimes SQL Server doesn't pick great plans.  Sometimes the plans it picks are downright terrible.

If we encountered a similar scenario in the real-world where our tables had more columns, more rows, and larger datatypes, having SQL Server switch from a seek to a scan could kill performance.

So what can we do?

Solutions...maybe

The first thing that comes to mind is to modify or add some indexes.

But maybe our (real-world) table already has too many indexes.  Or maybe we are working with a data source where we can't modify our indexes.

We could also use the FORCESEEK hint, but I don't like using hints as permanent solutions because they feel dirty (and are likely to do unexpected things as your data changes).

One solution to UNION ALL

One solution that a lot of people overlook is rewriting the query so that it uses UNION ALLs instead of ORs.

A lot of the time it's pretty easy to refactor the query to multiple SELECT statements with UNION ALLs while remaining logically the same and returning the same results:

SELECT 
    Col2, Col3
FROM 
    ORUnionAll.dbo.TestData
WHERE 
    Col3 = 2
UNION ALL
SELECT 
    Col2, Col3
FROM 
    ORUnionAll.dbo.TestData
WHERE
    Col3 = 3

Sure, the query is uglier and will be a bigger pain to maintain if you need to make changes in the future, but sometimes we have to suffer for ~~fashion~~ query performance.

But does our UNION ALL query perform better?

union-alls-with-seeks

Well the plan shows seeks, but as Erik Darling recently pointed out, seeks aren't always a good thing.

So let's compare the reads of the OR query versus the UNION ALL query using SET STATISTICS IO ON:

logical-reads

So in this case, tricking SQL Server to pick a a different plan by using UNION ALLs gave us a performance boost.  The difference in reads isn't that large in the above scenario, but I've had this trick take my queries from minutes to seconds in the real world.

So the next time you are experiencing poor performance from a query with OR operators in it, try rewriting it using UNION ALLs.

It's not always going to fix your performance problem but you won't know until you give it a try.