Data with Bert - correlated subquery

Watch this week's video on YouTube

Correlated subqueries provide an intuitive syntax for writing queries that return related data. However, they often perform poorly due to needing to execute once for every value they join on.

The good news is that many correlated subqueries can be rewritten to use a derived table for improved performance.

This post is a continuation of my series to document ways of refactoring queries for improved performance. I'll be using the StackOverflow 2014 data dump for these examples if you want to play along at home.

When was each user's first badge awarded?

StackOverflow awards users badges for things like asking good questions, hitting certain vote thresholds, and more.

I want to write a query that figures out on what date did each user receive their first badge.

Using a correlated subquery, I might write my query as follows:

SET STATISTICS IO, TIME ON;

SELECT DISTINCT
    UserId,
    FirstBadgeDate = (SELECT MIN(Date) FROM dbo.Badges i WHERE o.UserId = i.UserId)
FROM
    dbo.Badges o

The syntax of the correlated subquery here makes it clear that for each UserId we want to return the MIN(Date) associated with that UserId from the badges table.

Looking at the execution plan and time and IO statistics (abbreviated for clarity) we see:

2019-04-18-07-11-57

(1318413 rows affected)
Table 'Worktable'. Scan count 0, logical reads 0, ...
Table 'Workfile'. Scan count 0, logical reads 0, ...
Table 'Badges'. Scan count 2, logical reads 43862, ...

(1 row affected)

 SQL Server Execution Times:
   CPU time = 3625 ms,  elapsed time = 8347 ms.

So, what's going on here? We read ~8 million rows of data from our index on the dbo.Badges table and then calculate the MIN(Date) for each UserId. This is the "correlated" part of our query, which then gets joined back to the dbo.Badges table using a Hash Match join operator.

Our join doesn't eliminate any rows so the ~8 million rows continue flowing through until near the very end where we have another Hash Match operator, this time being used to dedupe the rows for the DISTINCT part of query, reducing the final result to ~1 million rows.

Eliminating the Correlated Subquery

What would things look like if we rewrote this correlated subquery as a derived table in the FROM clause?

SELECT DISTINCT
    o.UserId,
    FirstBadgeDate
FROM
    dbo.Badges o
    INNER JOIN 
        (SELECT 
            UserId, 
            MIN(Date) as FirstBadgeDate 
        FROM 
            dbo.Badges GROUP BY UserId
        ) i
    ON o.UserId = i.UserId

2019-04-18-07-26-36

(1318413 rows affected)
Table 'Workfile'. Scan count 0, logical reads 0, ...
Table 'Worktable'. Scan count 0, logical reads 0, ...
Table 'Badges'. Scan count 2, logical reads 43862, ...

(1 row affected)

 SQL Server Execution Times:
   CPU time = 2516 ms,  elapsed time = 5350 ms.

If we look at the IO statistics, it's interesting to note that there is no difference in reads between these two queries.

Looking at the CPU time statistics however, this derived table query consistently comes in about 33% faster than the correlated subquery example. Why is that?

Looking at the execution plan reveals some details: in this plan, you can see we read in from the dbo.Badges index and go straight into a Hash Match operator. The top stream is deduping our data on UserId, taking it from ~8 million rows to ~1 million rows. The bottom stream does the same deduping while also calculating the MIN(DATE) for each UserId grouping.

When both of those streams join together, the final hash match operator is only joining ~1 million rows with ~1 million rows (as opposed to the first query that was joining ~8 million rows with ~1 million rows).

This last join is the reason for the performance improvement: because this execution plan can reduce the number of rows sooner the final join ends up having to do less work. Additionally, the records were already distinct going into the join, saving us from an extra deduping step.

Further Reducing Redundancy

You may have noticed that both of these queries are a little redundant: they both call on the dbo.Badges table unnecessarily. The best option to improve query performance would be to rewrite it as:

SELECT 
    UserId, 
    MIN(Date) as FirstBadgeDate 
FROM 
    dbo.Badges 
GROUP BY 
    UserId

2019-04-18-07-48-58-1

While this is the most efficient query of the three, most real-world queries and scenarios aren't this easy to simplify.

When your queries have more joins, WHERE clauses, and more, knowing how to refactor from a correlated subquery to a derived table query is critical to potentially improving performance.

Correlated Subqueries vs Derived Tables

When was each user's first badge awarded?

Eliminating the Correlated Subquery

Further Reducing Redundancy