Clustered vs Nonclustered: Index Fundamentals You Need To Know

Published Tue 26 September 2017 in SQL > Performance Tuning > Indexing

patrick-tomasso-71909 Photo by Patrick Tomasso on Unsplash

How many times have you known that adding an index would improve query performance but you weren't exactly sure which type of index to add?

This happened to me all the time in my first few years (and maybe an extra year or two after that) of working with SQL Server.

Today I want to help alleviate some of that confusion by comparing two of the most common index types: clustered and nonclustered rowstore indexes.

Watch this week's video on YouTube

Clustered Indexes

Every table's data has some natural order to it.

If the order is random and not explicitly defined then that table is known as a heap. With the exception of a few special cases, we generally don't want to have heaps. Heaps don't perform well for the majority of queries becauase SQL Server has no meta knowledge about where data is stored within a heap.

If we don't have a random heap, that means we have defined the order that data should be stored in a table. ** The physical storage order of our data is defined by our clustered index.**

Every table can have exactly one clustered index because the data in a table can only be stored in one order i.e. you can't have that table's data physically stored on the disk in more than one order.

What are the benefits of a clustered index?

The data in a clustered index is stored in order. That means:

Finding the data you need in your clustered index is a matter of knowing where to look in our alphabetical list of data. Computers are really good at doing this.
If your data needs to be outputted in the same order that it's stored in - presto! - SQL doesn't need to do any additional sorting.

A lot of people like to put the clustered index on their table's primary key (PK). This is usually fine because a lot of the time our primary key is likely to be our most used field for joins, where statements, etc...

Some people think they can ONLY put their clustered index on their PK. That's not true! Often times it can be much more beneficial to put your clustered index on something that isn't your PK, like a different column that is getting more use than our PK. For example, you might have an identity column set as your PK, but every query against your table is filtering and sorting on a datetime2 column. Why store your table in PK order if you are always going to be filtering and returning data on that datetime2 column? Put that clustered index on the datetime2 column!

The downside to having data stored in this order is that actions like inserts and updates take long because SQL has to put them into the correct sorted order of the table pages - it can't just quickly tack them onto the end.

Another major benefit of a clustered index is that we don't have to "include" any additional data in our index. All of the data for our row exists right beside our indexed columns. This is not necessarily true of other index types (see nonclustered indexes below).

Pretend our clustered index is like the white pages of a phone book (note to future SQL developers in 2030 who have no idea what a phonebook is: it's something that stores the names, addresses, and landline phone numbers in your area. What's a landline? Oh boy...)

The phone book stores every person's name in alphabetical order, making it easy to look up certain individuals. Additionally, if we look someone up, we immediately have their address and phone number right their next to their name - no additional searching necessary!

This is a great feature of clustered indexes - if you ever need to retrieve many or all columns from your table, a clustered index will usually be efficient because once it finds the indexed value you are searching on, it doesn't need to go anywhere else to get the remaining data from that row.

Nonclustered Indexes

If a clustered index is like a phone book, a nonclustered index is like the index in the back of a chemistry text book. The chemistry text book has some natural order to it ("Chapter 1: Matter, Chapter 2: Elements, Chapter 3: Compounds, etc..."). However, this order doesn't help us if we want to look up the location of something specific, like "noble gases".

So what do we do? We go to the index in the back of the textbook which lists all topics in alphabetical order, making it easy to find the listing for "noble gases" and the page number they are discussed on. Once we know the page number for noble gases from our index, we can flip to the correct page and get the data we need.

This book index represents our nonclustered index. A nonclustered index contains the ordered data for the columns specified in that index, with pointers (book page numbers) that tell us where to go to find the rest of the data from that row (flip to the right book page). That means unlike a clustered index where all data is always present, using a nonclustered index often is a two step process: find the value of interest in the index and then go look up the rest of that row's data from where it actually exists on disk.

What are the benefits of a nonclustered index?

We can have as many nonclustered indexes on our tables as we want (well, we max out at 999). That's great! Create an index for every column!

Well, no, don't do that. There's overhead in creating nonclustered indexes. Essentially, every time you index some column(s), you are duplicating the unique values in those column(s) so that they can be stored in sorted order in your index. We get speed and efficiency in our data lookups, but with the cost of losing disk space. You need to test and see for each table and set of queries what the optimal number of indexes is. Adding an additional index can absolutely destroy performance, so always test your changes!

Additionally, using a nonclustered index to find an indexed column's value is fast (SQL is just going through the ordered index data to find the value it needs - once again, something computers are really good at doing). However, if you need other columns of data from the row that you just looked up, SQL is going to have to use those index pointers to go find the rest of that row data somewhere else on disk. This can really add up and slow down performance.

If those additional lookups are hurting performance, what you can do is INCLUDE your nonindexed columns in your nonclustered index. What this basically does is in addition to storing the sorted values of your indexed column(s), the index will also store whatever additional values you want to include as part of the index itself. Once again, you'll probably get better performance because SQL won't have to go to somewhere else on disk to find the data it needs, but you lose storage space because you are creating duplicates of that data as part of your index.

Example Usage Scenarios

[Note: I want to clarify that the above definitions and below examples don't cover lots of corner cases (blob values, fragmentation, etc...). I wanted this post to be a simple starting point when people don't know what index type they should try adding first, because this was the paralysis that I had when starting out.]{style="color: #808080;"}

[Every statement in this article can probably have an asterisk appended to the end of it, pointing out some example where a recommendation I wrote is 100% wrong. ALWAYS test your index changes, because what might improve one query may hurt another one already running on that table, and over time you will learn about all of those edge cases and how they affect index performance.]{style="color: #808080;"}

Alright let's take a look at a few common scenarios and what the best index for them might be. After reading each scenario, take a guess about what kind of index you would add and then click on the answer to reveal what I would do in that scenario. Assume no indexes exist yet on these tables unless otherwise noted.

You have OLTP data that's used only for transactional reads and writing new rows. You know the primary key is an identity integer column. What type of index would you create for the primary key?
Clustered index - Your queries are probably always going to be looking up by PK to return data. If you store the data in the table ordered by that PK, SQL will be able to do this very quickly. New row additions to the table will always get put at the end because of the auto-incrementing identity column, not creating any overhead for having to insert data in a specific location in the ordered data.
You have a query that wants to return most or all of the columns from a table. What type of index would make this the most efficient?
Clustered index - Since all of the column values are stored in the same location as your indexed fields, SQL won't have to go do any additional lookups to get all of the data you are requesting from it. If you created a nonclustered index you would have to INCLUDE all nonindexed columns, which would take up lots of space since you are essentially duplicating your entire table's data.
You have a table that is constantly having values updated. These updated values are used as in your JOINs and WHERE clauses. What type of index would you add? Nonclustered index - If our values are constantly changing, SQL only has to update the index and pointers while putting the actual data wherever it has available space on disk. Compare this to a clustered index where it has to put the inserted/updated data in the correct order, meaning potentially lots of operations to shift the data around if available free space doesn't exist at that location.
You have a table that already has a clustered index, but it doesn't cover columns in JOINs and WHERE clauses. What do you do? Nonclustered index - since the clustered index already exists, your only option is to add a nonclustered index. Depending on the queries hitting this table however, you may want to consider changing your clustered index to a nonclustered index if you think your JOINs and WHERE clauses will be improved by having those fields be part of the clustered index. Test it out!
You have a small staging table that you will always read all rows from and then truncate. You don't care about the order. Do you add an index? No, leave it as a heap - This is one scenario where not adding an index can give you better performance since there is no overhead in SQL having to store things in a sorted order or update indexes to specify the order. If you truly don't care about the order, and you will always be reading all rows from a table and then truncating the table, then it's better not to have the overhead of having indexes on the table.

How to Search and Destroy Non-SARGable Queries on Your Server

Published Tue 22 August 2017 in SQL > Development

Unexpected SQL Server Performance Killers #3

Watch this week's video on YouTube

In this series I explore scenarios that hurt SQL Server performance and show you how to avoid them. Pulled from my collection of "things I didn't know I was doing wrong for years."

Looking for a script to find non-SARGable queries on your server? Scroll to the bottom of this post.

What is a "SARGable" query?

Just because you add an index to your table doesn't mean you get immediate performance improvement. A query running against that table needs to be written in such a way that it actually takes advantage of that index.

SARGable, or "Search Argument-able", queries therefore are queries that are capable of utilizing indexes.

Examples please!

Okay let's see some examples of SARGable and non-SARGable queries using my favorite beverage data.

First, let's look at a non-SARGable query:

SELECT Name
FROM dbo.CoffeeInventory
WHERE CONVERT(CHAR(10),CreateDate,121)  = '2017-08-19'

Although this query correctly filters our rows to a specific date, it does so with this lousy execution plan:

SQL Server has to perform an Index Scan, or in other words has to check every single page of this index, to find our '2017–08–19' date value.

SQL Server does this because it can't immediately look at the value in the index and see if it is equal to the '2017–08–19' date we supplied — we told it first to convert every value in our column/index to a CHAR(10) date string so that it can be compared as a string.

Since the SQL Server has to first convert every single date in our column/index to a CHAR(10) string, that means it ends up having to read every single page of our index to do so.

The better option here would be to leave the column/index value as a datetime2 datatype and instead convert the right hand of the operand to a datetime2:

SELECT Name
FROM dbo.CoffeeInventory
WHERE CreateDate = CAST('2017-08-19' AS datetime2)

Alternatively, SQL Server is smart enough to do this conversion implicitly for us if we just leave our '2017–08–19' date as a string:

SELECT Name
FROM dbo.CoffeeInventory
WHERE CreateDate = '2017-08-19'

In this scenario SQL gives us an Index Seek because it doesn't have to modify any values in the column/index in order to be able to compare it to the datetime2 value that '2017–08–19' got converted to.

This means SQL only has to read what it needs to output to the results. Much more efficient.

One more example

Based on the last example we can assume that any function, explicit or implicit, that is running on the column side of an operator will result in a query that cannot make use of index seeks, making it non-SARGable.

That means that instead of doing something non-SARGable like this:

SELECT Name, CreateDate
FROM dbo.CoffeeInventory
WHERE DAY(CreateDate)  = 19

We want to make it SARGable by doing this instead:

SELECT Name, CreateDate
FROM dbo.CoffeeInventory
WHERE 
    CreateDate  >= '2017-08-19 00:00:00' 
    AND CreateDate < '2017-08-20 00:00:00'

In short, keep in mind whether SQL Server will have to modify the data in a column/index in order to compare it — if it does, your query probably isn't SARGable and you are going to end up scanning instead of seeking.

OK, non-SARGable queries are bad…how do I check if I have any on my server?

The script below looks at cached query plans and searches them for any table or index scans. Next, it looks for scalar operators, and if it finds any it means we have ourselves a non-SARGable query. The fix is then to rewrite the query to be SARGable or add a missing index.

-- From https://github.com/bertwagner/SQLServer/blob/master/Non-SARGable%20Execution%20Plans.sql
-- This script will check the execution plan cache for any queries that are non-SARGable.
-- It does this by finding table and index scans that contain a scalar operators

SET TRANSACTION ISOLATION LEVEL READ UNCOMMITTED

DECLARE @dbname SYSNAME
SET @dbname = QUOTENAME(DB_NAME());

WITH XMLNAMESPACES (DEFAULT 'http://schemas.microsoft.com/sqlserver/2004/07/showplan')

SELECT
   stmt.value('(@StatementText)[1]', 'varchar(max)') AS [Query],
   query_plan AS [QueryPlan],
   sc.value('(.//Identifier/ColumnReference/@Schema)[1]', 'varchar(128)') AS [Schema], 
   sc.value('(.//Identifier/ColumnReference/@Table)[1]', 'varchar(128)') AS [Table], 
   sc.value('(.//Identifier/ColumnReference/@Column)[1]', 'varchar(128)') AS [Column] ,
   CASE WHEN s.exist('.//TableScan') = 1 THEN 'TableScan' ELSE 'IndexScan' END AS [ScanType],
   sc.value('(@ScalarString)[1]', 'varchar(128)') AS [ScalarString]
FROM 
    sys.dm_exec_cached_plans AS cp
    CROSS APPLY sys.dm_exec_query_plan(cp.plan_handle) AS qp
    CROSS APPLY query_plan.nodes('/ShowPlanXML/BatchSequence/Batch/Statements/StmtSimple') AS batch(stmt)
    CROSS APPLY stmt.nodes('.//RelOp[TableScan or IndexScan]') AS scan(s)
    CROSS APPLY s.nodes('.//ScalarOperator') AS scalar(sc)
WHERE
    s.exist('.//ScalarOperator[@ScalarString]!=""') = 1 
    AND sc.exist('.//Identifier/ColumnReference[@Database=sql:variable("@dbname")][@Schema!="[sys]"]') = 1
    AND sc.value('(@ScalarString)[1]', 'varchar(128)') IS NOT NULL

I've found this script useful for myself, but if you find any issues with it please let me know, thanks!

Are your indexes being thwarted by mismatched datatypes?

Published Tue 01 August 2017 in SQL > Performance Tuning > Indexing

In this series I explore scenarios that hurt SQL Server performance and show you how to avoid them. Pulled from my collection of "things I didn't know I was doing wrong for years."

Watch this week's video on YouTube

Have you ever encountered a query that runs slowly, even though you've created indexes for it?

There's a few different reasons why this may happen. The one I see most frequently happens in the following scenario.

I'll have an espresso please

Let's say I have a table dbo.CoffeeInventory of coffee beans and prices that I pull from my favorite green coffee bean supplier each week. It looks something like this:

-- Make sure Actual Execution Plan is on
-- Let's see what our data looks like
SELECT * FROM dbo.CoffeeInventory

If you want to follow along, you can get this data set from this GitHub Gist

I want to be able to efficiently query this table and filter on price, so next I create an index like so:

CREATE CLUSTERED INDEX CL_Price ON dbo.CoffeeInventory (Price)

Now, I can write my query to find out what coffee prices are below my willingness to pay:

SELECT Name, Price FROM dbo.CoffeeInventory WHERE Price < 6.75

You would expect this query to be blazing fast and use a clustered index seek, right?

WRONG!

What the heck?

Why is SQL scanning the table when I added a clustered index on the column that I am filtering in my predicate? That's not how it's supposed to work!

Well dear reader, if we look a little bit closer at the table scan operation, we'll notice a little something called CONVERT_IMPLICIT:

CONVERT_IMPLICIT: ruiner of fast queries

What is CONVERT_IMPLICIT doing? Well as it implies, it's having to convert some data as it executes the query (as opposed to me having specified an explicit CAST() or CONVERT() function in my query).

The reason it needs to do this is because I defined my Price column as a VARCHAR(5):

Who put numeric data into a string datatype? Someone who hasn't had their coffee yet today.

In my query however, I'm doing a comparison against a number WHERE Price < 6.75. SQL Server is saying it doesn't know how to compare a string to a number, so it has to convert the VARCHAR string to a NUMERIC(3,2).

This is painful.

Why? Because SQL is performing that implicit conversion to the numeric datatype for every single row in my table. Hence, it can't seek using the index because it ends up having to scan the whole table to convert every record to a number first.

And this doesn't only happen with numbers and string conversion. Microsoft has posted an entire chart detailing what types of data type comparisons will force an implicit conversion:

<https://docs.microsoft.com/en-us/sql/t-sql/data-types/data-type-conversion-database-engine>

That's a lot of orange circles/implicit conversions!

How can I query my coffee faster?

Well in this scenario, we have two options.

Fix the datatype of our table to align with the data actually being stored in this (data stewards love this).
Not cause SQL Server to convert every row in the column.

Number 1 above is self-explanatory, and the better option if you can do it. However, if you aren't able to modify the column type, you are better off writing your query like this:

SELECT Name, Price FROM dbo.CoffeeInventory WHERE Price < '6.75'

d03ed-1uzge0e3lizkhtodyonsfmg

Since we do a comparison of equivalent datatypes, SQL Server doesn't need to do any conversions and our index gets used. Woo-hoo!

What about the rest of my server?

Remember that chart above? There are a lot of different data comparisons that can force a painful column side implicit conversion by SQL Server.

Fortunately, Jonathan Kehayias has written a great query that helps you find column side implicit conversions by querying the plan cache. Running his query is a great way to identify most of the implicit conversions happening in your queries so you can go back and fix them — and then rejoice in your improved query performance!

One SQL Cheat Code For Amazingly Fast JSON Queries

Published Tue 09 May 2017 in SQL > Development > JSON

f836d-17tk3zornsj5nirg7ai5dbq

Watch this week's video on YouTube

Recently I've been working with JSON in SQL Server 2016 a lot.

One of the hesitations many people have with using JSON in SQL Server is that they think that querying it must be really slow — SQL is supposed to excel at relational data, not string parsing right?

It turns out that performance is pretty good with the standalone SQL Server JSON functions. Even better is that it's possible to make queries against JSON data run at ludicrous speeds by using indexes on JSON parsed computed columns. In this post I want to take a look at how SQL is able to parse* with such great performance.

*"Parse" here is actually a lie —it's doing something else behind the scenes. You'll see what I mean, keep reading!

Computed Columns in SQL Server

The only way to get JSON indexes working on SQL server is to use a computed column. A computed column is basically a column that performs a function to calculate its values.

For example, let's say we have a table with some car JSON data in it:

DROP TABLE IF EXISTS dbo.DealerInventory;
CREATE TABLE dbo.DealerInventory
(
  Id int IDENTITY(1,1) PRIMARY KEY,
  Year int,
  JsonData nvarchar(300)
);

INSERT INTO dbo.DealerInventory (Year, JsonData) VALUES (2017, '{ "Make" : "Volkswagen", "Model" : "Golf" }');

INSERT INTO dbo.DealerInventory (Year, JsonData) VALUES (2017, '{ "Make" : "Honda", "Model" : "Civic" }');

INSERT INTO dbo.DealerInventory (Year, JsonData) VALUES (2017, '{ "Make" : "Subaru", "Model" : "Impreza" }');

SELECT * FROM dbo.DealerInventory;

/* Output:
Id    Year     JsonData
----- -------- ---------------------------------------------
1     2017     { "Make" : "Volkswagen", "Model" : "Golf" }
2     2017     { "Make" : "Honda", "Model" : "Civic" }
3     2017     { "Make" : "Subaru", "Model" : "Impreza" }
*/

We can add a new computed column to the table, "Make", which parses and extracts the Make property from each row's JSON string:

ALTER TABLE dbo.DealerInventory
ADD Make AS JSON_VALUE(JsonData, '$.Make');

SELECT * FROM dbo.DealerInventory;

/* Output:
Id Year  JsonData                                    Make
-- ----- ------------------------------------------- ----------
1  2017  { "Make" : "Volkswagen", "Model" : "Golf" } Volkswagen
2  2017  { "Make" : "Honda", "Model" : "Civic" }     Honda
3  2017  { "Make" : "Subaru", "Model" : "Impreza" }  Subaru
*/

By default, the above Make computed column is non-persisted, meaning its values are never stored to the database (persisted computed columns can also be created, but that's a topic for a different time). Instead, every time a query runs against our dbo.DealerInventory table, SQL Server will calculate the value for each row.

The performance of this isn't great — it's essentially a scalar function running for each row of our output :(. However, when you combine a computed column with an index, something interesting happens.

Time to dive in with DBCC Page

DBCC Page is an undocumented SQL Server function that shows what the raw data stored in a SQL page file looks like. Page files are how SQL Server stores its data.

In the rest of this post we'll be looking at how data pages (where the actual table data in SQL is stored) and index pages (where our index data is stored) are affected by non-persisted computed columns — and how they make JSON querying super fast.

First, let's take a look at the existing data we have. We do this by first turning on trace flag 3604 and using DBCC IND to get the page ids of our data. Additional details on the column definitions in DBCC IND and DBCC PAGE can be found in Paul Randal's blog post on the topic.

DBCC TRACEON(3604);

-- "Sandbox" is the name of my database
DBCC IND('Sandbox','dbo.DealerInventory',-1);

55ac9-1qczux6z_us9fzacehamhpa

If you look at the results above, row 2 contains our data page (indicated by PageType = 1) and the PagePID of that page is 305088 (if you are playing along at home, your PagePID is most likely something else). If we then look up that PagePID using DBCC PAGE we get something like this:

DBCC PAGE('Sandbox',1,305088,3) WITH TABLERESULTS

65fdc-1jaa-flg1di52y3t2gohyxg

You can see our three rows of data highlighted in red. The important thing to note here is that our computed column of the parsed "Make" value is truly non-persisted and no where to be found, meaning it has to get generated for every row during query execution.

Now, what if we add an index to our non-persisted computed column and then run DBCC IND again:

CREATE NONCLUSTERED INDEX IX_ParsedMake ON dbo.DealerInventory (Make)

DBCC IND('Sandbox','dbo.DealerInventory',-1);

cab34-1vfyzcblexgrytsqu5miomg

You'll now notice that in addition to data page 305088 (PageType = 1), we also have an index page 305096 (PageType = 2). If we examine both the data page and the index page we see something interesting:

DBCC PAGE('Sandbox',1,305088,3) WITH TABLERESULTS

DBCC PAGE('Sandbox',1,305096,3) WITH TABLERESULTS

Nothing has changed with our data page:

3348f-1ktty-u5w4iha9emrwys9zw

But our index page contains the parsed values for our "Make" column:

1b9ea-16ysi1x2ilwdo7j3tkwowlw

What does this mean? I thought non-persisted computed columns aren't saved to disk!

Exactly right: our non-persisted computed column "Make" isn't saved to the data page on the disk. However if we create an index on our non-persisted computed column, the computed value is persisted on the index page!

This is basically a cheat code for indexing computed columns.

SQL will only compute the "Make" value on a row's insert or update into the table (or during the initial index creation) — all future retrievals of our computed column will come from the pre-computed index page.

This is how SQL is able to parse indexed JSON properties so fast; instead of needing to do a table scan and parsing the JSON data for each row of our table, SQL Server can go look up the pre-parsed values in the index and return the correct data incredibly fast.

Personally, I think this makes JSON that much easier (and practical) to use in SQL Server 2016. Even though we are storing large JSON strings in our database, we can still index individual properties and return results incredibly fast.