Window Functions vs GROUP BYs

Published on: 2019-04-16

Watch this week’s episode on YouTube.

There are many options available for improving the performance of a query: indexes, statistics, configuration settings, etc…

However, not all environments allow you to use those features (eg. vendor databases), leaving query rewriting as the only option.

This is the first post in a series to document common ways to refactor queries without otherwise altering the database. The goal of these posts will be to provide examples of performance pitfalls in queries and how to rewrite those queries to generate different query plans that (hopefully) improve performance.

I’ll be using the StackOverflow 2014 data dump for these examples if you want to play along at home.

Who was first to earn each badge?

StackOverflow awards users badges for things like asking good questions, hitting certain vote thresholds, and more.

I want to write a query that figures out who is the first person awarded each badge. In cases where there is a tie for the first person to receive that badge, I want to return the user with the lowest UserId.

Window functions make this type of question easy to write a query for:

SELECT DISTINCT
    Name,
    FIRST_VALUE(UserId) OVER (PARTITION BY Name ORDER BY Date,UserId) AS UserId
FROM
    dbo.Badges b
ORDER BY
    Name,UserId

If you’ve used FIRST_VALUE before, this query should be easy to interpret: for each badge Name, return the first UserId sorted by Date (earliest date to receive the badge) and UserId (pick the lowest UserId when there are ties on Date).

This query was easy to write and is simple to understand. However, the performance is not great: it takes 46 seconds to finish returning results on my machine.

Window function execution plan

Note: I assumed this table started off with the following index:

CREATE NONCLUSTERED INDEX IX_Badges__Name_Date_UserId ON [dbo].[Badges] (Name,Date,UserId);

Why so slow?

If we SET STATISTICS IO ON we’ll notice that SQL Server reads 46767 pages from a nonclustered index. Since we aren’t filtering our data, there’s not much we can do to make that faster.

Reading right to left, next up we see two Segment operators. These don’t add much overhead since our data is sorted on our segments/groups, so making SQL Server identify when our sorted rows change values is trivial.

Next up is the Window Spool operator which “Expands each row into the set of rows that represent the window associated with it.” While it looks innocent by having a low relative cost, this operator is writing 8 million rows/reading 16 million rows (because of how Window Spool works) from tempdb. Ouch.

After that the Stream Aggregate operator and Compute Scalar operators check to see if the first value in each window being returned from the Window Spool is null and then return the first non-null value. These operations are also relatively painless since the data flowing through is already sorted.

The Hash Match operator then dedupes the data for our DISTINCT and then we sort the remaining ~2k rows for our output.

So while our query looks simple, the fact that our whole table of data is getting written to and read from tempdb before being deduped and sorted is a real performance killer.

Removing tempdb usage the old-fashioned way

When I say “the old fashioned way”, I mean rewriting our window function to use more traditional aggregate functions and a GROUP BY:

SELECT
    b.Name,
    MIN(b.UserId) AS UserId
FROM
    dbo.Badges b
    INNER JOIN
    (
    SELECT
        Name,
        MIN(Date) AS Date
    FROM
        dbo.Badges
    GROUP BY
        Name
    ) m
        ON b.Name = m.Name
        AND b.Date = m.Date
GROUP BY
    b.Name
ORDER BY
    Name,UserId

I think by most people’s standards, this query is not as easy to read. While not overly complex, it does take up a lot more screen space and is complicated by multiple GROUP BYs and a derived table.

And while the query may look ugly on the outside, it’s what lies below the surface that really matters:

GROUP BY execution plan

What a beautifully simple execution plan. And it finishes executing almost instantly.

Let’s break down what’s going on. First, we start with similar Index Scan and Segment operators as the previous query so no real difference there.

At this point you may have noticed that while the written query uses two GROUP BYs and two MIN functions that are then joined together, there are not two Index Scans, two sets of aggregations, and no join happening in the execution plan.

SQL Server can use an optimization with the Top operator that allows it to take the sorted data and return only the Name and UserId rows for the top Name and Date values within a group (essentially matching the MIN logic). This is a great example of how the optimizer can take a declarative SQL query and decide how to efficiently return the data needs.

At this point, the Top operator filters our 8 million rows down to around 30k rows. 30k rows get deduped a lot faster with our Stream Aggregate operator, and since the data is already sorted we don’t need an extra Sort operator.

Overall, this second query runs so much better than the original because SQL Server doesn’t have to go to tempdb for any operations – all the data is pre-sorted in the index and can flow through.

So I shouldn’t use Window Functions?

Not necessarily – it comes down to a trade offs.

I almost always start with a window function because of how easy they are to write and read. Plus I think they are fun to write as well.

However, if the window function is having to read/write a lot of data to tempdb and it’s affecting the overall performance of your query, a rewrite may be necessary.

In that case, I much rather take more verbose syntax to get a 2000x performance boost.

Thanks for reading. You might also enjoy following me on Twitter.

Want to learn even more SQL?

Sign up for my newsletter to receive weekly SQL tips!

Data Type Precedence and Implicit Conversions

Published on: 2019-04-09

Watch this week’s episode on YouTube.

SQL Server needs to make sure data types match when performing operations that involve multiple pieces of data.

When the data types do not match, SQL Server has to implicitly convert the data before performing any operations.

While most of the time these implicit conversions go unnoticed, they are important to understand since they can lead to unexpected results.

When 4.4/.44 Doesn’t Equal 10

Let’s start with this example:

SELECT 4.4/CAST(.44 AS VARCHAR(5))

Ignoring for a moment that our denominator is of type VARCHAR, if we do some quick mental math or use a calculator, we can see that the answer should be 10:

Calculating some simple math

However, if we look at the result SQL Server returns, it’s strangely 11:

Eleven? Why eleven?

To understand why this happens, we need to understand SQL Server’s data type precedence logic.

Data Type Precedence

If we start with a simpler version of this example, we’ll see SQL Server does in fact know how to perform math and return an answer of 10:

SELECT 4.4/.44
10

We can use the SQL_VARIANT_PROPERTY() function to see what data types SQL Server is assuming we are using in our calculation:

SELECT 
    SQL_VARIANT_PROPERTY(4.4,'BaseType'),
    SQL_VARIANT_PROPERTY(.44,'BaseType'),
    SQL_VARIANT_PROPERTY(4.4/.44,'BaseType')
Data types

In this case, since the data types of both components are numeric, SQL Server doesn’t have to break a sweat making any drastic conversions to give us our expected answer. If we instead give it something a little more challenging:

SELECT 4.4/CAST(.44 AS FLOAT)
Still a 10

Here we still return the result of 10, however SQL Server had to do a little more work. We know from the previous example that 4.4 on its own is of type numeric, and in this example we are explicitly casting .44 to a float. SQL Server isn’t able to perform the division operation directly on two different data types, so it refers to its data type precedence table to decide which value to convert to a matching datatype.

In the linked table above, the float data type appears higher on the list than numeric (synonym: decimal) data type. This causes SQL Server to convert our numeric 4.4 to a float before dividing.

While SQL Server is doing extra work behind the scenes that we didn’t explicitly request, we can’t be too angry with it since it still is giving us the “correct” answer.

“Incorrect” Conversions

Let’s look at something a little more dastardly:

SELECT CAST(4.4 AS NUMERIC)/CAST(.44 AS FLOAT)

You might think this should also return 10 based on the previous example, but in fact it returns 9.090909:

Not quite 10

While we are still witnessing implicit conversion here (the numeric gets converted to a float in order to allow SQL Server to perform the division), we are also experiencing a case of default data type precision and scale. If we use the SQL_VARIANT_PROPERTY() function again to not only reveal base type but also precision and scale, we’ll notice that when we let SQL Server “guess” the scale, it correctly chooses 1 decimal place, while when we use the default scale associated with numeric we get 0:

SELECT 
    SQL_VARIANT_PROPERTY(4.4,'BaseType'),
    SQL_VARIANT_PROPERTY(4.4,'Precision'),
    SQL_VARIANT_PROPERTY(4.4,'Scale') 

SELECT 
    SQL_VARIANT_PROPERTY(CAST(4.4 AS NUMERIC),'BaseType'),
    SQL_VARIANT_PROPERTY(CAST(4.4 AS NUMERIC),'Precision'),
    SQL_VARIANT_PROPERTY(CAST(4.4 AS NUMERIC),'Scale')
Two numerics with different precisions and scales

What this really means is that the decimal portion of 4.4 is getting chopped off, leaving us with an equation of 4 / .44 = 11.

Putting Everything Together

So back to our original example that returns 11:

SELECT 4.4/CAST(.44 AS VARCHAR(5))

What exactly is going on here? Well for starters, data type precedence is forcing SQL Server to convert the VARCHAR .44 to a numeric. But a numeric with what precision and scale?

SELECT 
    SQL_VARIANT_PROPERTY(4.4,'BaseType'),
    SQL_VARIANT_PROPERTY(4.4,'Precision'),
    SQL_VARIANT_PROPERTY(4.4,'Scale') 
Precision 2, scale 1

Since SQL Server determines that our numerator is NUMERIC(2,1), it converts the denominator to a NUMERIC(2,1) as well. This means instead of dividing by .44, we end up dividing by .4 which results in 11:

-- An explicit version of the same calculation
SELECT CAST(4.4 AS NUMERIC(2,1))/CAST(.44 AS NUMERIC(2,1))
Eleventy

SQL Server Isn’t Wrong

While it’s easy to blame SQL Server for not knowing how to perform simple mathematical operations, it’s us the developers who are to blame. SQL Server is a piece of software following rules for how to handle unclear situations that we input into it.

Having SQL Server throw an error instead of assuming data types and implicitly converting data on our behalf would make things less ambiguous, but it would also make simple operations a lot more tedious (looking at you SSIS).

In reality, the best solution is to accurately define data types from initial creation and to always be explicit with data types when the value of your computations matters.

Thanks for reading. You might also enjoy following me on Twitter.

Want to learn even more SQL?

Sign up for my newsletter to receive weekly SQL tips!

Testing with Temporary Stored Procedures

Published on: 2019-04-02

Watch this week’s episode on YouTube.

A while back I learned that it’s possible to create temporary stored procedures in SQL Server.

I never put that knowledge into practice however because I struggled to think of a good use case for when a temporary stored procedure would be preferable to a permanent stored procedure.

Not long ago I encountered a scenario where using a temporary stored procedure was the perfect solution to my problem.

Building New Tables

Recently, I had to build a new version of a legacy table because the legacy table’s upstream data source was going to be retired.

The new table would contain all the same data as the legacy table, but populated from the new data source. Additionally, the new table would also include additional rows and columns. After building the final table, the plan was to create a view to replace the functionality of the legacy table.

I had to spend quite a bit of time writing a fairly elaborate query to make the data for the new data source match what was appearing in the legacy table (remember my gaps and islands post from a few weeks back? Imagine that on steroids).

Before building out the new ETL and tables though, I wanted to be able to test that the new query was producing the correct results. This was challenging because some of the data sources were on other servers. I was in a catch 22: couldn’t test the data because I didn’t build the ETL yet, but I didn’t want to build the ETL until I tested the results.

Less Than Ideal Options

To make matters worse, I was only able to test my query using data in the production environment.

I thought of putting my new query into a stored procedure to make the logic easier to test. But with the production elevate process being length and restricted, I couldn’t easily put my parameterized query into a permanent stored procedure to test my query with.

At this point I had a few options:

  • Don’t test anything and just build out the tables and ETL in production. Cross my fingers and hope it all works, fixing any issues after the fact.
  • Create a permanent stored procedure with the query and elevate it to production. Hope that I don’t have to make changes and go through the slow elevate process again.
  • Run the query over and over again with different parameters.

I didn’t like the first two options because of the amount of time I would lose trying to elevate new tables or procedures into production.

The third option wasn’t ideal either because while it would allow me to iterate quickly, documenting all of my tests would involve a massive file that would not be easy to navigate or change.

What I needed was a way to run a query through many different parameters in a concise manner without making any permanent production changes.

Temporary Stored Procedures for Regression Testing

Since I have access to create temporary stored procedures in production, I was able to create a temporary procedure containing my complex query:

CREATE PROCEDURE #ComplexBusinessLogic
  @parm1 int
AS
BEGIN
  /* This isn't the actual query.  The real one was ugly and hundreds of lines long.*/
  SELECT CASE @parm1
    WHEN 1 THEN 'A'
    WHEN 2 THEN 'B'
    WHEN 3 THEN 'C'
    END AS ResultValue
END;

As I mention in the comment of the stored procedure, this query was huge. If I wanted to execute it multiple times with different parameter values to test with, I’d have a gigantic file that would be difficult to navigate and easy to make errors in.

Instead, the temporary stored procedure made it easy to document my tests and execute the query as many times as needed:

/* Test for when parameter is 1 */
EXEC #ComplexBusinessLogic @parm1 = 1;

/* Test for scenario 2 */
EXEC #ComplexBusinessLogic @parm1 = 2;

/* The rarely occuring but very important test scenario 3 */
EXEC #ComplexBusinessLogic @parm1 = 3

I was able then to clearly define all of my tests and run them against the production data without creating any permanent production objects. This was great because I did find errors with my logic, but I was able to fix them and retry my tests over and over again until everything ran without issues.

Testing in Production

Ideally I wouldn’t have to use this solution. It would have been much better to have data to test with in a non-production environment. Using a temporary stored procedure to test in production is a hack to get around environment restrictions.

However, what is ideal and what is real-world doesn’t always align. Sometimes a hack helps meet deadlines when better options aren’t available. In this instance, temporary stored procedures helped make testing a breeze.

Thanks for reading. You might also enjoy following me on Twitter.

Want to learn even more SQL?

Sign up for my newsletter to receive weekly SQL tips!