Data with Bert logo

Does The Order Of Index Columns Matter?

Watch this week's video on YouTube

When beginning to learn SQL, at some point you learn that indexes can be created to help improve the performance of queries.

Creating your first few indexes can be intimidating though, particularly when trying to understand what order to put your key columns in.

Today we'll look at how row store indexes work to understand whether index column order matters.

Heap: Stack of Pages

blue-jay-heapImagine a stack of loose leaf pages.  This collection of pages is our table.

Each page has information about a bird on it - the bird's name, picture, description, habitat, migration patterns, visual markings, etc...  You can think of each of these pages as a row of data.

The problem with this stack of pages is that there is no enforced order: it's a heap.  Without any enforced order, searching for individual birds is time consuming; in order to find a particular bird, for example a blue jay, you would have to go through the stack of pages one at a time until you find the blue jay page.

The scanning doesn't stop there though.  Even though we found a blue jay page, there's no way for us to guarantee that there are no other blue jay pages in the stack.  This means we have to continue flipping through every page until we finish searching through the whole heap of pages.

Having to do this process every single time we need to retrieve data from our bird table is painful.  To make our job easier, we can define and enforce an order on the data by defining a clustered index.

Clustered Index: Bound Pages

clustered-index-1To make searching through our pages easier, we sort all of the pages by bird name and glue on a binding.  This book binding now keeps all of our pages in alphabetical order by bird name.

The SQL version of a book binding is a clustered index.  The clustered index is not an additional object to our data - it is that same exact table data, but now with an enforced sort order.

Having all of our data in sorted order by bird name makes certain queries really fast and efficient - instead of having to scan through every page to find the blue jay entry, we can now quickly flip to the "B" section, then the "BL" section, then the "BLU" section, etc... until we find BLUE JAY.  This is done quickly and efficiently because we know where to find blue jays in the book because the bird names are stored in alphabetical order.

Even better, after we find the blue jay page, we flip to the next page and see a page for cardinal.  Since we know all of the entries are stored alphabetically, we know that once we get to the next bird we have found all of our blue jay pages and don't need to continue flipping through the rest of the book.

While the clustered index allows us to find birds by name quickly, it's not perfect; since the clustered index is the table, it contains every property (column) of each bird, which is a lot of data!

Having to constantly reference this large, clustered index for each of my queries can be too cumbersome.  For most of our queries, we could get by with condensed version of my bird book that only contains the most essential information in it.

Nonclustered Index: Cut and Copy

nc-indexLet's say we want a lighter-weight version of our book that contains the most relevant information (bird name, color, description).

We can photocopy the entire book and then cut out and keep only the pieces of information that are relevant while discarding the rest.  If we paste all of those relevant pieces of information into a new book, still sorted by bird name, we now have a second copy of our data.  This is our nonclustered index.

This nonclustered index contains all of the same birds as my clustered index, just with fewer columns.  This means I can fit multiple birds onto a page, requiring me to flip through fewer pages to find the bird I need.

If we ever need to look up additional information about a particular bird that's not in our nonclustered index, we can always go back to my giant clustered index and retrieve any information we need.

With the lighter-weight nonclustered index in-hand, we go out to the woods to start identifying some birds.

Upon spotting an unfamiliar bird in our binoculars, we can flip open the nonclustered index to identify the bird.

The only problem is, since we don't know this bird's name, our nonclustered index by bird name is of no help.  We end up having to flip through each page one at a time trying to identify the bird instead of flipping directly to the correct page.

For these types of inquires where we want to identify a bird don't know the bird's name, a different index would beneficial...

Nonclustered Index 2: Color Bugaloo

nc-index-2Instead of having a nonclustered index sorted by bird name, what we really need is a way to filter down to the list of potential birds quickly.

One way we can do this is to create another copy of my book, still containing just bird names, colors, and descriptions, but this time order the book pages so they are in order of color first, then bird name.

When trying to identify an unknown bird, we can first limit the number of pages to search through by filtering on the bird's color.  In our case, color is a highly selective trait, since it filters down our list of potential birds to only a small subset of the whole book.  In our blue jay example, this means we would find the small subset of pages that contain blue birds, and then just check each one of those pages individually until we find the blue jay.

Order Matters

Indexes aren't magic; their high-performance capabilities come from the fact that they store data in a predetermined order.  If your query can utilize data stored in that order, great!

However, if your query wants to filter down on color first, but your index is sorted on bird name, then you'll be out of luck.  When it comes to determining what column should be the first key in your index, you should choose whichever one will be most selective (which one will filter you down to the fewest subset of results) for your particular query.

There's a lot more optimizing that can be done with indexes, but correctly choosing the order of columns for your index key is an essential first step.

Want to learn even more about index column order? Be sure to check out this post on cardinality.

Ignoring NULLs with FIRST_VALUE

Watch this week's video on YouTube

The SQL Server FIRST_VALUE function makes it easy to return the "first value in an ordered set of values."

The problem is that if that first value happens to be a NULL, there is no easy, built-in way to skip it.

While a UserVoice item exists to add the ability to ignore nulls (go vote!), today, we're going accomplish that end result with some alternative queries.

The Setup

Here's the example data we'll be skipping nulls on:

CREATE TABLE ##Data
(
       Id int IDENTITY(0,1),
       GroupId int,
       Value1 int
);
GO
INSERT INTO ##Data VALUES (1,1)
INSERT INTO ##Data VALUES (1,1)
INSERT INTO ##Data VALUES (1,3)
INSERT INTO ##Data VALUES (2,NULL)
INSERT INTO ##Data VALUES (2,NULL)
INSERT INTO ##Data VALUES (2,6)
INSERT INTO ##Data VALUES (2,4)
INSERT INTO ##Data VALUES (2,5);
GO

colall

We've got a an integer identity column, two groups of rows, and NULLs that are sprinkled into otherwise unsuspecting integer values.

If we write a query that uses the FIRST_VALUE function, you'll notice that our NULL gets chosen in group two - not quite what we want:

SELECT
       Id,
       GroupId,
       Value1,
       FIRST_VALUE(Value1) OVER (PARTITION BY GroupId ORDER BY Id) AS FirstValue1
FROM
       ##Data

2018-08-26_08-38-39

Let's look at two queries that will help us get the number 6 into that FirstValue1 column for the second group.

The Contenders

"The Derived FIRST_VALUE"

First up is still the FIRST_VALUE function, but inside of a derived table:

SELECT
    d.Id,
    d.GroupId,
    d.Value1,
    d2.FirstNotNullValue1
FROM
    ##Data d
    INNER JOIN
    (
    SELECT DISTINCT
        GroupId,
        FIRST_VALUE(Value1) OVER (PARTITION BY GroupId ORDER BY Id) as FirstNotNullValue1
    FROM ##Data
    WHERE Value1 IS NOT NULL
    ) d2
        ON d.GroupId = d2.GroupId

By filtering out NULLs in our derived table query, FIRST_VALUE returns the first non-null value like we want.  We then join that back to the original data and all is right again.

2018-08-26_08-42-00

"The Triple Join"

Our second attempt at this query sends us back to the dark ages of SQL Server 2008 before the FIRST_VALUE function existed:

SELECT
    d.Id,
    d.GroupId,
    d.Value1,
    d2.Value1 AS FirstNotNullValue1
FROM
    ##Data d
    LEFT JOIN
    (
    SELECT
        GroupId,
        MIN(Id) AS FirstNotNullIdValue1
    FROM
        ##Data
    WHERE
        Value1 IS NOT NULL
    GROUP BY
        GroupId
    ) m
        ON d.GroupId = m.GroupId
    INNER JOIN ##Data d2
        ON m.FirstNotNullIdValue1 = d2.Id;

We perform a triple join, with the critical element being our derived table which gets the MIN Id for each group of rows where Value1 IS NOT NULL.  Once we have the minimum Id for each group, we join back in the original data and produce the same final result:

2018-08-26_08-46-00

The Performance

Both of the above queries produce the same output - which one should you use in your production code?

Well, the "Derived FIRST_VALUE" query has a lower relative cost than the "Triple Join" query, maybe it's better?

2018-08-26_08-48-22

This isn't a real-world execution plan though - surely we never scan heaps our production environments.

Let's add a quick clustered index and see if that changes anything:

CREATE CLUSTERED INDEX CL_Id ON ##Data (GroupId,Id,Value1)

2018-08-26_09-07-57

Okay, a closer match up but the "Derived FIRST_VALUE" query still appears to have a slight edge.

If we SET STATISTICS IO ON though we start to see a different story:

2018-08-26_09-14-05

With only 8 rows of data, our "Derived FIRST_VALUE" query sure is performing a lot of reads.

What if we increase the size of our sample dataset?

SET STATISTICS IO, TIME OFF;
SET NOCOUNT ON;
GO
INSERT INTO ##Data (GroupId, Value1)  
SELECT GroupId, Value1 FROM ##Data
GO 10

And now check our plans and stats IO:

2018-08-26_09-17-33

2018-08-26_09-17-50

WOW that's a lot of reads in the "Derived FIRST_VALUE" query.

Conclusion

Besides sharing some solutions, the point I tried to make above is that DON'T TRUST CODE YOU FIND ON THE INTERNET (or in books, or copied from colleagues, etc...)

Both of the above queries will return the first value without NULLs.  But they probably won't perform exactly the same as they did on my examples above.

Copy the above code for sure - but test it out. See what works better on your specific server configuration, data size, and indexes.  Maybe both queries are terrible and you need a third, better way of doing it (if you write one, let me know!) - but please, please, please, always test your code.

FizzBuzz and Multiple Recursive Member CTEs

Watch this week's video on YouTube

Last week I needed to write a recursive common table expression.  I've written them before, but it's been a while and needed to visit the documentation to reference the syntax.

Instead of going straight to the examples, I decided to read into some of the details (since skipping the details really hurt me in last week's post) and noticed this line that I had never seen before:

2018-08-20_11-58-14

"Multiple...recursive members can be defined" - what????

I never knew you could have multiple recursive member statements in a CTE.  Heck, I didn't even know what having multiple recursive members could do.

Since the documentation doesn't talk about them beyond the one highlighted line above, I decided to create some examples to see if I could get them to work.

FizzBuzz

FizzBuzz is a programming puzzle that asks the solver to write a program that will list the numbers 1 to 100, displaying the word "Fizz" for any numbers that are a multiple of 3, "Buzz" for any multiples of 5, and "FizzBuzz" for any multiples of 3 and 5.

I decided to try and implement the FizzBuzz problem as both a single and multiple member CTE to see how the solutions would differ.

The Basic Recursive CTE

To start out, I decided to write a CTE that lists all numbers 0 to 100:

WITH c AS
(
    -- anchor member
    SELECT
        0 AS RowNumber
    UNION ALL
    -- recursive member
    SELECT
        c.RowNumber + 1
    FROM
        c /* the result of our last iteration */
    WHERE 
        RowNumber < 100
)

SELECT * 
FROM c;

The first SELECT statement in the CTE definition is known as the "anchor" member.  This query runs a single time and acts as the initial result that the recursive query acts on.

The second SELECT statement in the CTE definition is known as the "recursive" member.  This statement executes on the results of the previous execution (or on the results of the anchor member for the first iteration).

The recursive member will execute over and over again as long as it is still producing results.  Since our recursive statement is just adding 1 to the previous result, our recursive query would run forever - which is why we add the WHERE condition stop it from executing once we reach 100.

Our final SELECT statement returns the results of our recursive CTE, providing us with a neat list of numbers from 0 to 100:

2018-08-20_12-31-08

Single Recursive Member CTE for FizzBuzz

Now that our basic recursive CTE is working, let's make it solve FizzBuzz.  Here is our updated code:

WITH c AS
(
    SELECT
        0 AS RowNumber,
        'FizzBuzz' AS FizzOrBuzz
    UNION ALL
    SELECT
        c.RowNumber + 1,
        CASE 
            WHEN (c.RowNumber + 1) % 15 = 0 THEN 'FizzBuzz' 
            WHEN (c.RowNumber + 1) % 3 = 0 THEN 'Fizz' 
            WHEN (c.RowNumber + 1) % 5 = 0 THEN 'Buzz' 
            ELSE NULL 
        END
    FROM
        c
    WHERE 
        RowNumber < 100
)

SELECT
    RowNumber,
    FizzOrBuzz
FROM C
ORDER BY RowNumber;

First, we add a second column to our results to display the word "Fizz", "Buzz", or "FuzzBuzz".

In the anchor member, we defaulted this value to "FizzBuzz".  In our recursive member, we added a CASE statement to display the correct word.  The modulo operator (%) checks to see if the current row divided by 3, 5, or 15 results in a remainder - if the remainder is 0 then we know we found a multiple of that number.

This solution is pretty easy to read and provides the expected output for our FizzBuzz puzzle:

2018-08-20_12-39-21

Multiple Recurisve Member CTE for FizzBuzz

Alright the moment we've been waiting for - the multiple recursive member CTE:

WITH c AS
(
    SELECT
        0 AS RowNumber,
        'FizzBuzz' AS FizzOrBuzz
    UNION ALL
    /* All rows not Fizz or Buzz or FizzBuzz */
    SELECT
        c.RowNumber + 1,
        NULL AS FizzOrBuzz
    FROM
        c
    WHERE
        c.RowNumber+1 <= 100
        AND (c.RowNumber+1)%3<>0
        AND (c.RowNumber+1)%5<>0
    UNION ALL
    /* Fizz rows */
    SELECT
        c.RowNumber + 3,
        CAST('Fizz' AS VARCHAR(8)) AS FizzOrBuzz
    FROM
        c
    WHERE 
        c.RowNumber+3 <= 100
        and FizzOrBuzz in ('Fizz','FizzBuzz')
    UNION ALL
    /* Buzz rows */
    SELECT
        c.RowNumber + 5,
        'Buzz' AS FizzOrBuzz
    FROM
        c
    WHERE
        c.RowNumber+5 <= 100
        and FizzOrBuzz in ('Buzz','FizzBuzz')
)

SELECT
    RowNumber,
    STRING_AGG(FizzOrBuzz,'') AS FizzOrBuzz
FROM C
GROUP BY
    RowNumber
ORDER BY RowNumber

You'll notice we have 3 recursive members: the first generates all rows up to 100 that are not multiples of 3 or 5, the second generates all rows that are multiples of 3, and the third statement generates all rows that are multiple of 5.

If we were to run SELECT \* FROM c; after only making the mentioned changes, you'll notice that it looks like things are mostly working, but that we have duplicates (and incorrect labeling) for rows that are multiples of 3 and 5:

2018-08-20_12-52-24

The way I decided to fix that is by adding a STRING_AGG() function to the final SELECT statement, concatenating the outputs of rows with the same RowNumber. With that addition, our multiple recursive member CTE FizzBuzz solution is complete.

One thing to be aware of in the above solution: each of the recursive member statements will execute on the previous results of ANY recursive member statement, so we add the conditions "...and FizzOrBuzz in ..." to force each recursive statement to run only on the output from its own previous result.  This feels like cheating a little bit, but it was the only way I could solve the problem I had defined.

Practical Examples and Further Reading

I had a hard time coming up with a practical uses for multiple recursive member CTEs.

I searched online for some examples but it doesn't seem like many people have written about the topic.  One exception I did find was an article by Itzik Ben-Gan where he uses them to solve Lord of the Rings family trees (heh).

Honestly though, as excited as I was initially to learn that doing this is possible, I don't know if/when I'll ever use it.  I'm hoping I encounter a problem one day that can make use of multiple recursive statements, but who knows if that will ever happen.

If you have used multiple recursive member CTEs to solve a real-world problem before, leave me a comment - I'd love to hear about the scenario you used it in.

6 Techniques For Troubleshooting Your Code

MJ-t-sql-TuesdayThis post is a response to this month's T-SQL Tuesday #105 prompt by Wayne Sheffield.  T-SQL Tuesday is a way for the SQL Server community to share ideas about different database and professional topics every month.

This month's topic asks to share a time you ran into a metaphorical brick wall and how you worked it out.


Watch this week's video on YouTube

The Problem: Trimmed JSON Values

Recently I was using FOR JSON PATH to generate a JSON string from a query to use in a web app:

SELECT TOP 100 *
FROM master..spt_values
WHERE type='P'
FOR JSON PATH

The resulting JSON string is 5,580 characters long.

The goal was to read this query result into my .NET app like so:

var jsonResult = db.Database.SqlQuery<string>("SELECT TOP 100 * FROM ... FOR JSON PATH");

Is this the best way to design every app/database interaction?  Probably not, but it was right for this scenario because I didn't want to create models for all of the dynamic data I'd be returning.

Design decisions aside, my "brick wall" issue was that my "jsonResult" value was getting truncated around 2,000 characters instead of displaying the full 5,580.  The JSON string looked great in SSMS, but for the life of me I couldn't figure out why the data was getting chopped off when read into .NET.

Time to Debug

What follows are the usual steps I take when debugging a problem that has me stumped.  I've turned these into a 1950s style educational film so you can laugh at my bad jokes while your learn.

1. Rubber Duck Debugging

The first thing I usually do when I hit a wall like this is talk myself through the problem again.

This technique usually works well for me and is equivalent to those times when you ask  someone for help but realize the solution while explaining the problem to them.

To save yourself embarrassment (and to let your coworkers keep working uninterrupted), people often substitute an inanimate object, like a rubber duck, instead of a coworker to try and work out the problem on their own.

Alas, in this case explaining the problem to myself didn't help, so I moved on to the next technique.

2. Simplify the Problem

Breaking a problem down into smaller solvable parts can help sometimes.  I changed my query to return less data by switching to SELECT TOP 5 and seeing if .NET was still truncating the data.  It wasn't! Mildly successful!

In this case though, I couldn't really build off my simplified example.  As soon as my result passed ~2,000 characters, the JSON string was getting chopped off.

In this step I also figured out if I put my query into a derived table, my .NET code worked beautifully and returned the complete JSON string:

SELECT * FROM
(
SELECT TOP 100 *
FROM master..spt_values
WHERE type='P'
FOR JSON PATH
) t(c)

This was an ugly solution though and I didn't like doing it.  I especially didn't like it because I didn't know why a derived table fixed the output.

3. Check the Internet

As great as the internet is, I try to limit myself to how much searching I do on it when troubleshooting.  Searching for an online solution can quickly devolve into wasting two hours with nothing to show.

I performed some cursory searches on Google, StackOverflow, various forums, blogs, etc... but didn't find anything helpful (fun/sad fact: I searched for solutions again while typing up this post and now find plenty of solutions...who knows what I was doing wrong that day).

4. Ask a Friend

I love working through problems with others because I'm often amazed at how differently others approach a problem.  Often times this leads to a solution I would not have thought of on my own.  I especially enjoy hearing from people new to the subject area because they often have the most creative solutions due to not yet having become cynical and jaded from previous experience :).

I try to hold off on this option until at least trying all of the above techniques because 1) I hate breaking another person's concentration 2) I feel like I learn better if I struggle through a problem myself.

And in this case shopping the problem around didn't help - no one I talked to had a great solution.

5. Take a Break

After trying all of the above, I was out of ideas.  I took a break from the problem for the rest of the day, resolved to give it another try in the morning the morning.

6. RTFM

And the next morning, I had the idea to check the documentation to see what it said about the return type of FOR JSON PATH.

Embarrassingly, this one should be way higher on the list, and I'd like to say that it usually is, but for one reason or another I didn't bother checking until this late in the game.

And wouldn't you know it?  The last paragraph, of the last section, tells me exactly what I needed to know.

The documentation tells me that the JSON string will be broken up across multiple rows and my client app needs to concatenate them all together.  What I ended up doing is a simple String.Join():

var jsonResult = String.Join("",db.Database.SqlQuery<string>("SELECT TOP 100 * FROM ... FOR JSON PATH"));

There's no explanation for why SSMS is able to concatenate these values together but other client apps have to manually do so, but at least I found my documented solution.

Conclusion

Even though I found a somewhat-satisfactory solution in the documentation, my fall back was going to be to use the ugly derived table solution discovered in step 2.  It was ugly, but at some point I would have to call it quits and settle with an ugly workaround rather than spend more time on troubleshooting.

Next time I'll be sure to check the documentation earlier in the process and hopefully that will save me from some of the frustration I encountered in this particular scenario.

Displaying Long Values in SSMS

Watch this week's video on YouTube

I write a lot of dynamic SQL and frequently encounter variables that contain many characters:

DECLARE @LongValue NVARCHAR(MAX) = CAST('' AS NVARCHAR(MAX)) + 
N'SELECT
    ''A'' AS AShortValue,
    '''+REPLICATE(N'A',4000)+''' as ALongValue
ORDER BY 1';

This variable is 4059 characters long, and when I execute it it runs great.

SELECT LEN(@LongValue); -- 4059 characters
EXEC(@LongValue);

2018-08-01_12-17-24 A homage to one of my favorite Uncyclopedia entries.

If my programmatically built query had an error in it, the first thing I'd want to do when debugging it would be to see the the text of the entire @LongValue variable.

I could do this by just saying SELECT @LongValue, and while recent versions of SSMS will display the whole value for me, it completely loses my formatting which stinks (and is especially bad if there are any comments prefixed with --  in the query):

2018-08-01_12-25-54 Need a ultra HD wide display to fit this all on one screen.

I can say PRINT @LongValue, which will keep the formatting, but it will get trimmed at 4,000 characters (notice the missing ORDER BY):

2018-08-01_12-32-45

Some Better Ways

Erik Darling posts one solution to this problem in his T-SQL Tuesday #104 entry (as well as some other problems/solutions for lengthy SQL variables). Specifically he links to a SQL string printing script that will loop through the lengthy variable and print everything while maintaining formatting:

2018-08-01_12-32-45 Not perfectly formatted, but good enough.

And while I like using that stored procedure on my primary server, I'm too lazy to install it every where I need it.

Instead, I have a couple of go-to solutions that work on all SQL Server instances 2008 forward.

Solution 1: CAST to XML
SELECT CAST(@LongValue AS XML) AS LongValue

Casting the long variable to XML allows SSMS to generate a clickable, single-row result that preserves formatting:

2018-08-01_12-44-37 IntelliSense complains but I'm OK with it

The only downside to this approach is that certain charaters, like "<" and ">", can't be converted to XML:

2018-08-01_12-46-43

Solution 2: FOR XML PATH

A slight variation on solution 1, we can get similar results using FOR XML PATH:

SET @LongValue = '<' + @LongValue -- Let's add in an invalid character
SELECT @LongValue FOR XML PATH('')

2018-08-01_12-50-20 FOR XML PATH is one of the most abused SQL Server functions.

In this solution, the "<" is escaped to "<", which isn't perfect but at least my variable can be displayed with formatting intact.  A quick find and replace for any escaped characters and I'm good to go.

Good Enough

These techniques aren't perfect, but for purposes of debugging dynamically generated code they are good enough.

Maybe one day SSMS will print longer strings or include a syntax formatter and I won't care nearly as much.

And if not, I'll happily continue to abuse FOR XML to do things other than generate XML documents.