Temporary Staging Tables

Published on: 2019-05-07

Watch this week’s episode on YouTube.

SQL Server Spool operators are a mixed bag. On one hand, they can negatively impact performance when writing data to disk in tempdb. On the other hand, they allow filtered and transformed result sets to be temporarily staged, making it easier for that data to be reused again during that query execution.

The problem with the latter scenario is that SQL Server doesn’t always decide to use a spool; often it’s happy to re-read (and re-process) the same data repeatedly. When this happens, one option you have is to explicitly create your own temporary staging table that will help SQL Server cache data it needs to reuse.

This post is a continuation of my series to document ways of refactoring queries for improved performance. I’ll be using the StackOverflow 2014 data dump for these examples if you want to play along at home.

No spools

Let’s start by looking at the following query:

WITH January2010Badges AS ( 
	SELECT 
		UserId,
		Name,
		Date
	FROM 
		dbo.Badges 
	WHERE 
		Date >= '2010-01-01' 
		AND Date <= '2010-02-01' 
), Next10PopularQuestions AS ( 
	SELECT TOP 10 * FROM (SELECT UserId, Name, Date FROM January2010Badges WHERE Name = 'Popular Question' ORDER BY Date OFFSET 10 ROWS) t
), Next10NotableQuestions AS ( 
	SELECT TOP 10 * FROM (SELECT UserId, Name, Date FROM January2010Badges WHERE Name = 'Notable Question' ORDER BY Date OFFSET 10 ROWS) t
), Next10StellarQuestions AS ( 
	SELECT TOP 10 * FROM (SELECT UserId, Name, Date FROM January2010Badges WHERE Name = 'Stellar Question' ORDER BY Date OFFSET 10 ROWS) t
)
SELECT UserId, Name FROM Next10PopularQuestions 
UNION ALL 
SELECT UserId, Name FROM Next10NotableQuestions
UNION ALL 
SELECT UserId, Name FROM Next10StellarQuestions 

Note: This is not necessarily the most efficient way to write this query, but it makes for a good demo.

This query is returning offset results for different badges from one month of data in the dbo.Badges table. While the query is using a CTE to make the logic easy to understand (i.e. filter the data to just January 2010 results and then calculate our offsets based on those results), SQL Server isn’t actually saving the results of our January2010Badges expression in tempdb to get reused. If we view the execution plan, we’ll see it reading from our dbo.Badges clustered index three times:

Three clustered index scans
Table 'Badges'. Scan count 27, logical reads 151137, ...

That means every time SQL Server needs to run our offset logic in each “Next10…” expression, it needs to rescan the entire clustered index to first filter on the Date column and then the Name column. This results in about 150,000 logical reads.

Divide and Conquer

One potential solution would be to add a nonclustered index that would allow SQL Server to avoid scanning the entire clustered index three times. But since this series is about improving performance without adding permanent indexes (since sometimes you are stuck in scenarios where you can’t easily add or modify an index), we’ll look at mimicking a spool operation ourselves.

We’ll use a temporary table to stage our filtered January 2010 results so SQL Server doesn’t have to scan the clustered index each time it needs to perform logic on that subset of data. For years I’ve referred to this technique as “temporary staging tables” or “faking spools”, but at a recent SQL Saturday Jeff Moden told me he refers to it as “Divide and Conquer”. I think that’s a great name, so I’ll use it going forward. Thanks Jeff!

First let’s divide our query so that we insert our January 2010 data into its own temporary table:

DROP TABLE IF EXISTS #January2010Badges;
CREATE TABLE #January2010Badges
(
	UserId int,
	Name nvarchar(40),
	Date datetime
	CONSTRAINT PK_NameDateUserId PRIMARY KEY CLUSTERED (Name,Date,UserId)
);

INSERT INTO #January2010Badges
SELECT
	UserId,
	Name,
	Date
FROM 
	dbo.Badges
WHERE 
	Date >= '2010-01-01' 
	AND Date <= '2010-02-01'; 

You’ll notice I added a clustered primary key which will index the data in an order that will make filtering easier.

Next, we conquer by changing the rest of our query to read from our newly created temp table:

WITH Next10PopularQuestions AS ( 
	SELECT TOP 10 * FROM (SELECT UserId, Name, Date FROM #January2010Badges WHERE Name = 'Popular Question' ORDER BY Date OFFSET 10 ROWS) t
), Next10NotableQuestions AS ( 
	SELECT TOP 10 * FROM (SELECT UserId, Name, Date FROM #January2010Badges WHERE Name = 'Notable Question' ORDER BY Date OFFSET 10 ROWS) t
), Next10StellarQuestions AS ( 
	SELECT TOP 10 * FROM (SELECT UserId, Name, Date FROM #January2010Badges WHERE Name = 'Stellar Question' ORDER BY Date OFFSET 10 ROWS) t
)
SELECT UserId, Name FROM Next10PopularQuestions 
UNION ALL 
SELECT UserId, Name FROM Next10NotableQuestions 
UNION ALL 
SELECT UserId, Name FROM Next10StellarQuestions 

Running this all together, we get the following plans and logical read counts:

Clustered index seeks
Table 'Badges'. Scan count 9, logical reads 50379, ...

(42317 rows affected)

(20 rows affected)
Table '#January2010Badges______________________________00000000003B'. Scan count 3, logical reads 12, ...

In this version of the query, SQL Server scans the clustered index a single time and saves that data to a temporary table. In the subsequent SELECTs, it seeks from this much smaller temporary table instead of going back to the clustered index, reducing the total amount of reads to 50379 + 12 = 50392: about a third of what the original query was doing.

Temporary Staged Data

At the end of day, you can hope that SQL Server creates a spool to temporarily stage or data, or you can be explicit about it and do it yourself. Either option is going to increase usage on your tempdb database, but at least by defining the temporary table yourself you can customize and index it to achieve maximum reuse and performance for your queries.

It’s important to note that this is not a technique you want to abuse: writing and reading too much data from tempdb can cause contention problems that can make you worse off than having allowed SQL Server to scan your clustered index three times. However, when implemented sparingly and for good reasons, this technique can greatly improve the performance of certain queries.

Thanks for reading. You might also enjoy following me on Twitter.

Want to learn even more SQL?

Sign up for my newsletter to receive weekly SQL tips!

4 Ways To Define Lookup Values In A Query

Published on: 2018-06-05

Watch this week’s episode on YouTube.

Occasionally you may need to create a lookup table of values for a single query.  Building a permanent table of these values is a good option if you need to reuse the values frequently, but for one-off, ad-hoc queries you may want to hard-code the lookup table of values directly in your query.

We can hard-code these values using several different techniques.  Below are the techniques I’ve seen used before, going in order from my least to most favorite.

Table Variables

DECLARE @T TABLE 
(
    ColorName varchar(20), 
    HexCode char(7)
);

INSERT INTO @T VALUES 
        ('FireBrick','#B22222'),
        ('HotPink','#FF69B4'),
        ('Tomato','#FF6347'),
        ('PapayaWhip','#FFEFD5'),
        ('RebeccaPurple','#663399'),
        ('LawnGreen','#7CFC00'),
        ('MidnightBlue','#191970');

SELECT * FROM @T;

Table variables get a bad rap in SQL Server because they don’t provide accurate row estimates to the query optimizer.  This can cause SQL Server to come up with some really terrible execution plans that will kill your query’s performance.

However, if your use case is a single small lookup table of less than 100 records, table variables might actually be a viable option.  They are quick and easy to setup and can be added to the top of your query.

With that said, I don’t think I’ve ever used a table variable in this type of scenario (or any scenario really).  I know some people love using them and I think that’s fine as long as you are keeping track of your query performance.  For me though, there are so many better options available…

Temporary Tables

CREATE TABLE #T 
(
    ColorName varchar(20), 
    HexCode char(7)
);

INSERT INTO #T VALUES 
        ('FireBrick','#B22222'),
        ('HotPink','#FF69B4'),
        ('Tomato','#FF6347'),
        ('PapayaWhip','#FFEFD5'),
        ('RebeccaPurple','#663399'),
        ('LawnGreen','#7CFC00'),
        ('MidnightBlue','#191970');

SELECT * FROM #T;

Temp tables are the answer to many of the table variable’s shortcomings.

Temp tables can perform well with larger amounts of data because they can be indexed and can have statistics generated on them.  Both of these features typically help SQL Server generate better execution plans.

There is some overhead in coding a temp table though: just like a table variable, a temp table needs to be created and inserted into before being able to use it in your query.  While normally not a huge deal, this is not something I want to have to do in those instances where I want to define some lookup values quickly…

SELECT with UNION ALL

SELECT
    *
FROM
    (
        SELECT 'FireBrick','#B22222'     UNION ALL
        SELECT 'HotPink','#FF69B4'   UNION ALL
        SELECT 'Tomato','#FF6347'    UNION ALL
        SELECT 'PapayaWhip','#FFEFD5'    UNION ALL
        SELECT 'RebeccaPurple','#663399' UNION ALL
        SELECT 'LawnGreen','#7CFC00'     UNION ALL
        SELECT 'MidnightBlue','#191970'
    ) T(ColorName,HexCode);

The next option is hard-coding values in SELECT statements and then joining them together with UNION ALLs.

This is probably the most common technique I see, and for good reason: the syntax is straight forward and doesn’t require any special setup; perfect for the one-time use ad-hoc scenario.

Its format also makes it easy to use the ALT + highlight shortcut to quickly create a derived table lookup from the results of another query or values copied from elsewhere.

I do like this method a lot, but there is one method that I like slightly more…

The VALUES() Constructor

SELECT
    *
FROM
    (VALUES 
        ('FireBrick','#B22222'),
        ('HotPink','#FF69B4'),
        ('Tomato','#FF6347'),
        ('PapayaWhip','#FFEFD5'),
        ('RebeccaPurple','#663399'),
        ('LawnGreen','#7CFC00'),
        ('MidnightBlue','#191970')
        ) T(ColorName,HexCode);

You’ve probably use the VALUES constructor in an INSERT INTO statement, but did you know you can also use it in a FROM clause?

This syntax is similar to our SELECT + UNION ALL technique above, except we are swapping in single quotes and parentheses for SELECTs and UNION ALLs.  I find this slightly easier to write, if only because it requires typing out fewer characters and feels more programmatic.

One Last Tip: CTE

WITH HtmlColors AS (
        SELECT 'FireBrick' AS ColorName,'#B22222' AS HexCode     UNION ALL
        SELECT 'HotPink','#FF69B4'   UNION ALL
        SELECT 'Tomato','#FF6347'    UNION ALL
        SELECT 'PapayaWhip','#FFEFD5'    UNION ALL
        SELECT 'RebeccaPurple','#663399' UNION ALL
        SELECT 'LawnGreen','#7CFC00'     UNION ALL
        SELECT 'MidnightBlue','#191970'
)

SELECT * FROM HtmlColors

This isn’t really an additional technique, but something related that I use often and feels appropriate to mention.

If using either the SELECT + UNION ALL or VALUES techniques, you can put those lookup queries into a common table expression for easier referencing.

This doesn’t give any performance advantage, but it does help keep your code clean by putting your lookup logic right at the top of your file.  This becomes particularly useful when using those hard-coded lookup values as parameters, allowing all changes to your query to be made right at the top of your file during subsequent runs.

Thanks for reading. You might also enjoy following me on Twitter.

Want to learn even more SQL?

Sign up for my newsletter to receive weekly SQL tips!