Data with Bert logo

Reducing the Number of Reads in Your Queries

jamie-street-361489

In last week's post, we went over how one of best ways to improve query performance was to reduce the number of reads that your query has to do.

Less reads typically means faster query performance - so how can you reduce the number of reads SQL Server is required to make?

Watch this week's video on YouTube

Write more selective queries

Writing more selective queries can be done in a  few different ways.

For starters, if you don't need every column of data, don't use SELECT *.  Depending on the size of your rows, providing only the columns you need might allow SQL Server to use an index  that is narrower and/or denser.  An index that is narrow (fewer columns) or dense (more records per page) allows SQL Server to return the same amount of data you need in fewer pages.

The same thing goes for being more selective in your ON and WHERE clauses.  If you specify no WHERE conditions, SQL Server will return every single page in your table.  If you can filter down the data to exactly what you need, SQL Server can return only the data you need. This reduces logical reads and improves speed.

-- Read 1,000,000 pages
SELECT * FROM Products

-- Read 10,000 pages by being more selective in your SELECT and WHERE
SELECT ProductName FROM Products WHERE CreateDate > '2015-01-01'

Finally, do you have large object (LOB) data (eg. varchar(max), image, etc..) on the tables in your query? Do you actually need it in your final result set? No?  Then don't include it as part of your reads!  This could mean creating an index on the columns you do need or putting your LOB data in a separate table and only joining to it when you need it.

-- Read 1,000,000 pages because your Products table has a LOB field and nothing besides a clustered index
SELECT ProductName FROM Products WHERE CreateDate > '2017-08-01'

-- Read in 1000 pages only because you put your lob column on a separate table.
-- Or you created an index that contains ProductName and not your LOB column.
-- Same query, better performance!
SELECT ProductName FROM Products WHERE CreateDate > '2017-08-01'

-- Need that LOB data?  Just join in the separate table.  Reads are large here, but at least you only are reading when you truly need that LOB data.
SELECT 
  p.ProductName, 
  pe.ProductExtendedProperties -- this is nvarchar(max)
FROM
  Products p
  INNER JOIN ProductLOBs pe
    ON p.ProductId = pe.ProductId

Fix suboptimal execution plans

It's possible that your query is already as selective as it can be.  Maybe you are getting too many reads because SQL Server is generating and using a suboptimal execution plan.

A big tip off that this might be happening is if your cardinality estimates are out of whack a.k.a. the estimated vs. actual row counts have a large difference between them:

2017-12-06_12-36-57

If SQL Server thinks it only is going to read 1 row of data, but instead needs to read way more rows of data, it might choose a poor execution plan which results in more reads.

You might get a suboptimal execution plan like above for a variety of reasons, but here are the most common ones I see:

If you had a query that previously ran fine but doesn't anymore, you might be able to utilize Query Store to help identify why SQL Server started generating suboptimal plans.

The key is to get a good execution plan so that you aren't performing unnecessary reads.

But what if you aren't encountering any of the problems above and performance is still slow due to high numbers of reads?  Simply...

Add an index

mike-marquez-342243 A cup of coffee and a shot of espresso might have the same caffeine content - espresso is just more caffeine dense, just like the data stored in a narrow index.  Photo by Mike Marquez on Unsplash

If you have an existing table that has many columns and you only need a subset of them for your query, then consider adding an index for those columns.

Indexes are copies of your data stored in a different order with generally fewer columns.  If SQL Server is able to get all of the information it needs from a narrow index, it will do that instead of reading the full table/clustered index.

A copy of the data that has fewer columns will have greater page density (or the amount of data that fits on each page).  If SQL Server can get all of the data it needs by reading fewer, denser pages then your query will run faster.

Don't just go adding indexes willy nilly though.  You may already have an index that almost contains all of the columns your query needs.  Look at a table's existing indexes first and see if any of them are close to what you need.  Usually, you'll be better off adding an included column or two to an existing index instead creating a whole brand new index.  And you'll save on disk space by not creating duplicate indexes either.

Reduce index fragmentation

So indexes are great for reducing reads because they allow us to store only the data that is needed for a specific query (both as key columns and included columns).  Fewer columns = greater density = fewer reads necessary.

However, indexes can become  fragmented.  There's internal fragmentation, which causes less data to be stored on a page than what is possible, and external fragmentation which causes the pages to be stored out of logical order on the disk.

Internal fragmentation is problematic because it reduces page density, causing SQL Server to have to read more pages in order to get all of the data it needs.

External fragmentation is problematic, especially for spinning disk hard drives, because SQL Server needs to read from all over the disk to get the data it needs.

In general, reorganizing or rebuilding an index are the typical ways you want to fix a fragmented index.

To slow down future fragmentation, you can test out different fill factors to try and prevent page splits from fragmenting your indexes.

How to use Statistics IO to Improve Your Query Performance

2017-12-01_12-45-05

SQL Server's STATISTICS IO reporting is a great tool to help you performance tune queries.

Usually the goal of performance tuning is to make your query run faster.  One of the easiest ways to get a faster query is to reduce the amount of data a query is processing.  STATISTICS IO makes it easy to see how much data SQL Server is actually processing.

Specifically, the STATISTICS IO output helps with performance tuning because:

  1. The data it shows acts as a measuring stick for your performance tuning changes.
  2. It provides a good way of isolating the query changes you are making from other changes that may be happening on the server.

Watch this week's video on YouTube

So what's STATISTICS IO data look like?

To show IO statistics on your query, you first need to execute:

-- This applies to your current session only
SET STATISTICS IO ON;
GO

After running a query with with the above setting turned on, check SQL Server Management Studio's Messages tab to see output that looks something like this:

(157709 rows affected)
Table 'OrderLines'. Scan count 2, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 159, lob physical reads 0, lob read-ahead reads 0.
Table 'OrderLines'. Segment reads 1, segment skipped 0.
Table 'Workfile'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Worktable'. Scan count 0, logical reads 225917, physical reads 0, read-ahead reads 0, lob logical reads 225702, lob physical reads 0, lob read-ahead reads 0.
Table 'Orders'. Scan count 1, logical reads 138, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'StockItems'. Scan count 1, logical reads 16, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

The key things that my eyes are drawn to on initial examination of the STATISTICS IO output are the following:

  • Logical reads: The number of 8kB pages SQL Server had to read from the buffer cache (memory) in order to process and return the results of your query.  The more pages that need to be read, the slower your query.
  • Worktables/Workfiles: These are temporary objects that SQL Server creates in tempdb in order to process query results.  Although not always bad, it might indicate that SQL is doing more work than it needs to (perhaps an index could help?)
  • Lob Logical Reads: The number of large objects (e.g. varchar(max)) SQL is having to read.  I take the most cursory glance at this - if I'm returning high numbers of lobs, I might want to make sure I actually need them.  If not, I may add an index or move the lobs off to a separate table.

There are more properties in the STATISTICS IO output, but if we can significantly decrease the above three indicators then chances are good that we'll improve our query performance.

So why are these three indicators so useful?

Tracking performance changes

The main reason I like performance tuning with STATISTICS IO is because it makes it easy to create a baseline for my query performance.  When I make changes to the query, it's then easy to see if my changes helped or hurt the query.

The main metric I use for this is logical reads.  Logical reads refers to pages pulled from the cache (memory) versus physical reads which indicates the number of pages from disk.  However, all pages get loaded from disk into cache before SQL Server is able to use them.

This makes logical reads great for tracking performance changes because it clearly tells me how many 8kB pages in total SQL Server needed to read in order to my return my data.

If I add an index, does the total number of pages read go up or down?  Let me check my logical read counts and see.

What if I add some additional filtering or restructure my query?  I can easily tell if my changes hurt performance by seeing if the total number of logical reads went up or down.

Logical reads allow me to easily track the effectiveness of my tuning tactics.

The same concept applies to my worktable and logical lob read properties.  For the former, any time SQL Server is having to write data out to disk (tempdb in this case), performance will be slower.

In the latter case, if SQL Server is needing to move around large objects that comprise of multiple 8kB pages each, things will be slow.  If I can keep track of how many lob logical reads SQL Server is performing, then I can focus on removing that overhead from my query.

Isolating other factors that impact performance

For the same reason logical reads make it easy to track query performance after making changes, using logical reads makes it easy to mute other factors that might affect performance.

For example, I used to think that simply watching how long it took a query to run was a good indicator of helping me performance tune.  If the total number of seconds it took my query to run decreased, my changes helped improve performance!

This is a potentially deceptive way to measure performance though because what if during my first run the server was getting slammed by other queries?  Using elapsed run time isn't an effective way to measure performance.

Also, server environment hardware isn't always the same.  I might test a query in one environment and then deploy it to another.  My testing on an empty dev box might have been great, but as soon as the query runs in production along with all other queries, it might not perform as well.

More than one way to analyze a query

STATISTICS IO is a great place to start your performance tuning process.

It doesn't mean that you have to stop there though.  While being able to track the effects of your tuning changes and isolating other environment variables is important, ultimately you will have to use other means to actually improve performance.

So be sure to look at execution plans, dig into index and table statistics, rewrite the order of your table joins to see if it makes a difference, etc...  Just remember, performance always comes back to how much data SQL Server needs to process - reduce that and your queries will surely perform better.

Does The Join Order of My Tables Matter?

pan-xiaozhen-252035

I had a great question submitted to me (thank you Brandman!) that I thought would make for a good blog post:

...I've been wondering if it really matters from a performance standpoint where I start my queries. For example, if I join from A-B-C, would I be better off starting at table B and then going to A & C?

The short answer: Yes.  And no.

Watch this week's video on YouTube

Table join order matters for performance!

Disclaimer: For this post, I'm only going to be talking about INNER joins.  OUTER (LEFT, RIGHT, FULL, etc...) joins are a whole 'nother animal that I'll save for time.

Let's use the following query from WideWorldImporters for our examples:

/* 
-- Run if if you want to follow along - add  a computed column and index for CountryOfManufacture
ALTER TABLE Warehouse.StockItems SET (SYSTEM_VERSIONING = OFF);   
ALTER TABLE Warehouse.StockItems
ADD CountryOfManufacture AS CAST(JSON_VALUE(CustomFields,'$.CountryOfManufacture') AS NVARCHAR(10)) 
ALTER TABLE Warehouse.StockItems SET (SYSTEM_VERSIONING = ON); 
CREATE INDEX IX_CountryOfManufacture ON Warehouse.StockItems (CountryOfManufacture)
*/

SELECT
  o.OrderID,
  s.CountryOfManufacture
FROM
  Sales.Orders o                        -- 73595 rows
  INNER JOIN Sales.OrderLines l         -- 231412 rows
    ON o.OrderID = l.OrderID            -- 231412 rows after join
  INNER JOIN Warehouse.StockItems s     -- 227 rows
    ON l.StockItemID = s.StockItemID    -- 1036 rows after join 
    AND s.CountryOfManufacture = 'USA'  -- 8 rows for USA   

Note: with an INNER join, I normally would prefer putting my 'USA' filter in the WHERE clause, but for the rest of these examples it'll be easier to have it part of the ON.

The key thing to notice is that we are joining  three tables - Orders, OrderLines, and StockItems - and that OrderLines is what we use to join between the other two tables.

We basically have two options for table join orders then - we can join Orders with OrderLines first and then join in StockItems, or we can join OrderLines and StockItems first and then join in Orders.

In terms of performance, it's almost certain that the latter scenario (joining OrderLines with StockItems first) will be faster because StockItems will help us be more selective.

Selective?  Well you might notice that our StockItems table is small with only 227 rows.  It's made even smaller by filtering on 'USA' which reduces it to only 8 rows.

Since the StockItems table has no duplicate rows (it's a simple lookup table for product information) it is a great table to join with as early as possible since it will reduce the total number of rows getting passed around for the remainder of the query.

If we tried doing the Orders to OrderLines join first, we actually wouldn't filter out any rows in our first step, cause our subsequent join to StockItems to be more slower (because more rows would have to be processed).

Basically, join order DOES matter because if we can join two tables that will reduce the number of rows needed to be processed by subsequent steps, then our performance will improve.

So if the order that our tables are joined in makes a big difference for performance reasons, SQL Server follows the join order we define right?

SQL Server doesn't let you choose the join order

SQL is a declarative language: you write code that specifies *what* data to get, not *how* to get it.

Basically, the SQL Server query optimizer takes your SQL query and decides on its own how it thinks it should get the data.

It does this by using precalculated statistics on your table sizes and data contents in order to be able to pick a "good enough" plan quickly.

So even if we rearrange the order of the tables in our FROM statement like this:

SELECT
  o.OrderID,
  s.CountryOfManufacture
FROM
  Sales.OrderLines l
  INNER JOIN Warehouse.StockItems s
    ON l.StockItemID = s.StockItemID
    AND s.CountryOfManufacture = 'USA'
  INNER JOIN Sales.Orders o
    ON o.OrderID = l.OrderID

Or if we add parentheses:

SELECT
  o.OrderID,
  s.CountryOfManufacture
FROM
  (Sales.OrderLines l
  INNER JOIN Sales.Orders o
    ON l.OrderID = o.OrderID)
  INNER JOIN Warehouse.StockItems s
    ON l.StockItemID = s.StockItemID
    AND s.CountryOfManufacture = 'USA'

Or even if we rewrite the tables into subqueries:

SELECT
  l.OrderID,
  s.CountryOfManufacture
FROM
  (
  SELECT 
    o.OrderID,
    l.StockItemId
  FROM
    Sales.OrderLines l
    INNER JOIN Sales.Orders o
      ON l.OrderID = o.OrderID
  ) l
  INNER JOIN Warehouse.StockItems s
    ON l.StockItemID = s.StockItemID
    AND s.CountryOfManufacture = 'USA'

SQL Server will interpret and optimize our three separate queries (plus the original one from the top of the page) into the same exact execution plan:

2017-11-16_21-42-37

Basically, no matter how we try to redefine the order of our tables in the FROM statement, SQL Server will still do what it thinks it's best.

But what if SQL Server doesn't know best?

The majority of the time I see SQL Server doing something inefficient with an execution plan it's usually due to something wrong with statistics for that table/index.

Statistics are also a whole 'nother topic for a whole 'nother day (or month) of blog posts, so to not get too side tracked with this post, I'll point you to Kimberly Tripp's introductory blog post on the subject: https://www.sqlskills.com/blogs/kimberly/the-accidental-dba-day-15-of-30-statistics-maintenance/)

The key thing to take away is that if SQL Server is generating an execution plan where the order of table joins doesn't make sense check your statistics first because they are the root cause of many performance problems!

Forcing a join order

So you already checked to see if your statistics are the problem and exhausted all possibilities on that front.  SQL Server isn't optimizing for the optimal table join order, so what can you do?

Row goals

If SQL Server isn't behaving and I need to force a table join order, my preferred way is to do it via a TOP() command.

I learned this technique from watching Adam Machanic's fantastic presentation on the subject and I highly recommend you watch it.

Since in our example query SQL Server is already joining the tables in the most efficient order, let's force an inefficient join by joining Orders with OrderLines first.

**Basically, we write a subquery around the tables we want to join together first and make sure to include a TOP clause. **

SELECT
  o.OrderID,
  s.CountryOfManufacture
FROM
  (
  SELECT TOP(2147483647) -- A number of rows we know is larger than our table.  Watch Adam's presentation above for more info.
    o.OrderID,
    l.StockItemID
  FROM
    Sales.Orders o
    INNER JOIN Sales.OrderLines l
      ON o.OrderID = l.OrderID
  ) o
  INNER JOIN Warehouse.StockItems s
    ON o.StockItemID = s.StockItemID
    AND s.CountryOfManufacture = 'USA'

Including TOP forces SQL to perform the join between Orders and OrderLines first - inefficient in this example, but a great success in being able to control what SQL Server does.

2017-11-17_12-16-05

This is my favorite way of forcing a join order because we get to inject control over the join order of two specific tables in this case (Orders and OrderLines) but SQL Server will still use its own judgement in how any remaining tables should be joined.

While forcing a join order is generally a bad idea (what happens if the underlying data changes in the future and your forced join no longer is the best option), in certain scenarios where its required the TOP technique will cause the least amount of performance problems (since SQL still gets to decide what happens with the rest of the tables).

The same can't be said if using hints...

Query and join hints

Query and join hints will successfully force the order of the table joins in your query, however they have significant draw backs.

Let's look at the FORCE ORDER query hint.  Adding it to your query will successfully force the table joins to occur in the order that they are listed:

SELECT
  o.OrderID,
  s.CountryOfManufacture
FROM
  Sales.Orders o
  INNER JOIN Sales.OrderLines l
    ON o.OrderID = l.OrderID
  INNER JOIN Warehouse.StockItems s
    ON l.StockItemID = s.StockItemID
    AND s.CountryOfManufacture = 'USA'
OPTION (FORCE ORDER)

Looking at the execution plan we can see that Orders and OrderLines were joined together first as expected:

2017-11-17_12-23-37

The biggest drawback with the FORCE ORDER hint is that all tables in your query are going to have their join order forced (not evident in this example...but imagine we were joining 4 or 5 tables in total).

This makes your query incredibly fragile; if the underlying data changes in the future, you could be forcing multiple inefficient join orders.  Your query that you tuned with FORCE ORDER could go from running in seconds to minutes or hours.

The same problem exists with using a join hints:

SELECT
  o.OrderID,
  s.CountryOfManufacture
FROM
  Sales.Orders o 
  INNER LOOP JOIN Sales.OrderLines l
    ON o.OrderID = l.OrderID
  INNER JOIN Warehouse.StockItems s
    ON l.StockItemID = s.StockItemID
    AND s.CountryOfManufacture = 'USA'

Using the LOOP hint successfully forces our join order again, but once again the join order of all of our tables becomes fixed:

2017-11-17_12-28-21

A join hint is probably the most fragile hint that forces table join order because not only is it forcing the join order, but it's also forcing the algorithm used to perform the join.

In general, I only use query hints to force table join order as a temporary fix.

Maybe production has a problem and I need to get things running again; a query or join hint may be the quickest way to fix the immediate issue.  However, long term using the hint is probably a bad idea, so after the immediate fires are put out I will go back and try to determine the root cause of the performance problem.

Summary

  • Table join order matters for reducing the number of rows that the rest of the query needs to process.
  • By default SQL Server gives you no control over the join order - it uses statistics and the query optimizer to pick what it thinks is a good join order.
  • Most of the time, the query optimizer does a great job at picking efficient join orders.  When it doesn't, the first thing I do is check to see the health of my statistics and figure out if it's picking a sub-optimal plan because of that.
  • If I am in a special scenario and I truly do need to force a join order, I'll use the TOP clause to force a join order since it only forces the order of a single join.
  • In an emergency "production-servers-are-on-fire" scenario, I might use a query or join hint to immediately fix a performance issue and go back to implement a better solution once things calm down.

How NOLOCK Will Block Your Queries

k Photo by James Sutton on Unsplash

Note: the problem described below applies to all SELECT queries, not just those adorned with NOLOCK hints.  The fact that it applies to NOLOCK queries was a huge surprise to me though, hence the title.

Lots of people don't like NOLOCK (i.e. the read uncommitted isolation level) because it can return inaccurate data.  I've seen plenty of arguments cautioning developers from retrieving uncommitted reads because of how they can return dirty data, phantom reads, and non-repeatable reads.

I've known about all of those above problems, but there's one problem that I've never heard of until recently: NOLOCK can block other queries from running.

Watch this week's video on YouTube

Let's step back and understand why I've so often used NOLOCK in the past.  A fairly typical instance of when I use NOLOCK is when I want to let a query run overnight to return some large set of data.  I'm okay with some inconsistencies in the data (from dirty reads, etc...).  My primary concern is that I don't want the long running query to get in the way of other processes.

I always thought NOLOCK was a perfect solution for this scenario because it never locks the data that it reads - the results might not be perfect, but at least the query won't negatively impact any other process on the server.

This is where my understanding of NOLOCK was wrong: while NOLOCK won't lock row level data, it will take out a schema stability lock.

A schema stability (Sch-S) lock prevents the structure of a table from changing while the query is executing.  All SELECT statements, including those in the read uncommitted/NOLOCK isolation level, take out a Sch-S lock.  This makes sense because we wouldn't want to start reading data from a table and then have the column structure change half way through the data retrieval.

However, this also means there might be some operations that get blocked by a Sch-S lock.  For example, any command requesting a schema modification (Sch-M) lock gets blocked in this scenario.

What commands request Sch-M locks?

Things like an index REBUILD or sp_recompile table.  These are the types of commands running in my nightly maintenance jobs that I was trying to avoid hurting by using NOLOCK in the first place!

To reiterate, I used to think that using the NOLOCK hint was a great way to prevent blocking during long running queries.  However, it turns out that my NOLOCK queries were actually blocking my nightly index jobs (all SELECT queries block in this example, but I find the NOLOCK to be particularly misleading), which then caused other SELECT statements to get blocked too!

Let's take a look at this in action.  Here I have a query that creates a database, table, and then runs a long running query with NOLOCK:

DROP DATABASE IF EXISTS [Sandbox]
GO
CREATE DATABASE [Sandbox]
GO
USE [Sandbox]
GO

DROP TABLE IF EXISTS dbo.Test
CREATE TABLE dbo.Test
(
    c0 int IDENTITY PRIMARY KEY,
    c1 varchar(700) default REPLICATE('a',700)
)

CREATE NONCLUSTERED INDEX IX_Id ON dbo.Test (c1);
GO

INSERT INTO dbo.Test DEFAULT VALUES;
GO 1000


-- Read a billion records
SELECT * 
FROM 
    dbo.Test t1 (NOLOCK) 
    CROSS JOIN dbo.Test t2 (NOLOCK) 
    CROSS JOIN dbo.Test t3 (NOLOCK) 

Now, while that billion row read is occurring, we can verify that the query took out a Sch-S lock by looking at sys.dm_tran_locks:

SELECT *
FROM sys.dm_tran_locks
WHERE resource_type = 'OBJECT'

Sch-S lock granted

While that's running, if we try to rebuild an index, that rebuild is blocked (shown as a WAIT):

USE [Sandbox]
GO

ALTER INDEX IX_Id ON dbo.Test REBUILD

rebuild is blocked

Our index rebuild query will remain blocked until our billion row NOLOCK SELECT query finishes running (or is killed).  This means the query that I intended to be completely unobtrusive is now blocking my nightly index maintenance job from running.

Even worse, any other queries that try to run after the REBUILD query (or any other commands that request a Sch-M lock) are going to get blocked as well!  If I try to run a simple COUNT(*) query:

USE [Sandbox]
GO

SELECT COUNT(*) FROM dbo.Test

chained blocks

Blocked!  This means that not only is my initial NOLOCK query causing my index REBUILD maintenance jobs to wait, the Sch-M lock placed by the REBUILD maintenance job is causing any subsequent queries on that table to get blocked and be forced to wait as well.  I just derailed the timeliness of my maintenance job and subsequent queries with a blocking NOLOCK statement!

Solutions

Unfortunately this is a tough problem and there's no one-size-fits-all remedy.

Solution #1: Don't run long running queries

I could avoid running long queries at night when they might run into my index maintenance jobs.  This would prevent those index maintenance jobs and subsequent queries from getting delayed, but it means my initial billion row select query would then have to run earlier, negatively impacting server performance during a potentially busier time of day.

Solution #2: Use WAIT_AT_LOW_PRIORITY

Starting in 2014, I could do an online index rebuild with the WAIT_AT_LOW_PRIORITY option set:

ALTER INDEX IX_Id ON dbo.Test REBUILD 
WITH (ONLINE = ON (WAIT_AT_LOW_PRIORITY (MAX_DURATION = 1 MINUTES , ABORT_AFTER_WAIT = BLOCKERS)))

This query basically gives any blocking SELECT queries currently running 1 minute to finish executing or else this query will kill them and then execute the index rebuild.  Alternatively we could have also set ABORT_AFTER_WAIT = SELF and the rebuild query would kill itself, allowing the NOLOCK billion row SELECT to finish running and not preventing any other queries from running.

This is not a great solution because it means either the long running query gets killed or the index REBUILD gets killed.

Solution #3: REBUILD if no Sch-S, REORGANIZE otherwise

A programmatic solution can be written that tries to REBUILD the index, but falls back to REORGANIZE if it knows it will have to wait for a Sch-M lock.

I've created the boiler plate below as a starting point, but the sky is the limit with what you can do with it (e.g. create a WHILE loop to check for the lock every x seconds, create a timeout for when the script should stop trying to REBUILD and just REORGANIZE instead, etc...)

-- Idea for how to rebuild/reorganize based on a schema stability lock.
-- More of a starting point than fully functional code.
-- Not fully tested, you have been warned!
DECLARE 
    @TableName varchar(128) = 'Test',
    @HasSchemaStabilityLock bit = 0

SELECT TOP 1 @HasSchemaStabilityLock = 
    CASE WHEN l.request_mode IS NOT NULL THEN 1 ELSE 0 END
    FROM
        sys.dm_tran_locks as l
    WHERE
        l.resource_type = 'OBJECT'
        AND l.request_mode = 'Sch-S'
        AND l.request_type = 'LOCK'
        AND l.request_status = 'GRANT'
        AND OBJECT_NAME(l.resource_associated_entity_id) = @TableName

IF @HasSchemaStabilityLock = 0
BEGIN
    -- Perform a rebuild
    ALTER INDEX IX_Id ON dbo.Test REBUILD
    PRINT 'Index rebuilt'
END
ELSE
BEGIN
    -- Perform a REORG
    ALTER INDEX IX_Id ON dbo.Test REORGANIZE
    PRINT 'Index reorganized'
END

This solution is my favorite because:

  1. Ad hoc long running queries don't get killed (all of that time spent processing doesn't go to waste)
  2. Other select queries are not blocked by the Sch-M lock attempt by REBUILD
  3. Index maintenance still occurs, even if it ends up being a REORGANIZE instead of a REBUILD

Clustered vs Nonclustered: Index Fundamentals You Need To Know

patrick-tomasso-71909 Photo by Patrick Tomasso on Unsplash

How many times have you known that adding an index would improve query performance but you weren't exactly sure which type of index to add?

This happened to me all the time in my first few years (and maybe an extra year or two after that) of working with SQL Server.

Today I want to help alleviate some of that confusion by comparing two of the most common index types: clustered and nonclustered rowstore indexes.

Watch this week's video on YouTube

Clustered Indexes

Every table's data has some natural order to it.

If the order is random and not explicitly defined then that table is known as a heap.  With the exception of a few special cases, we generally don't want to have heaps.  Heaps don't perform well for the majority of queries becauase SQL Server has no meta knowledge about where data is stored within a heap.

If we don't have a random heap, that means we have defined the order that data should be stored in a table. ** The physical storage order of our data is defined by our clustered index.**

Every table can have exactly one clustered index because the data in a table can only be stored in one order i.e. you can't have that table's data physically stored on the disk in more than one order.

What are the benefits of a clustered index?

The data in a clustered index is stored in order.  That means:

  1. Finding the data you need in your clustered index is a matter of knowing where to look in our alphabetical list of data.  Computers are really good at doing this.
  2. If your data needs to be outputted in the same order that it's stored in - presto! - SQL doesn't need to do any additional sorting.

A lot of people like to put the clustered index on their table's primary key (PK).  This is usually fine because a lot of the time our primary key is likely to be our most used field for joins, where statements, etc...

Some people think they can ONLY put their clustered index on their PK.  That's not true! Often times it can be much more beneficial to put your clustered index on something that isn't your PK, like a different column that is getting more use than our PK.  For example, you might have an identity column set as your PK, but every query against your table is filtering and sorting on a datetime2 column.  Why store your table in PK order if you are always going to be filtering and returning data on that datetime2 column?  Put that clustered index on the datetime2 column!

The downside to having data stored in this order is that actions like inserts and updates take long because SQL has to put them into the correct sorted order of the table pages - it can't just quickly tack them onto the end.

Another major benefit of a clustered index is that we don't have to "include" any additional data in our index.  All of the data for our row exists right beside our indexed columns.  This is not necessarily true of other index types (see nonclustered indexes below).

Pretend our clustered index is like the white pages of a phone book (note to future SQL developers in 2030 who have no idea what a phonebook is: it's something that stores the names, addresses, and landline phone numbers in your area.  What's a landline?  Oh boy...)

The phone book stores every person's name in alphabetical order, making it easy to look up certain individuals.  Additionally, if we look someone up, we immediately have their address and phone number right their next to their name - no additional searching necessary!

This is a great feature of clustered indexes - if you ever need to retrieve many or all columns from your table, a clustered index will usually be efficient because once it finds the indexed value you are searching on, it doesn't need to go anywhere else to get the remaining data from that row.

Nonclustered Indexes

If a clustered index is like a phone book, a nonclustered index is like the index in the back of a chemistry text book.  The chemistry text book has some natural order to it ("Chapter 1: Matter, Chapter 2: Elements, Chapter 3: Compounds, etc...").  However, this order doesn't help us if we want to look up the location of something specific, like "noble gases".

So what do we do?  We go to the index in the back of the textbook which lists all topics in alphabetical order, making it easy to find the listing for "noble gases" and the page number they are discussed on.  Once we know the page number for noble gases from our index, we can flip to the correct page and get the data we need.

This book index represents our nonclustered index.  A nonclustered index contains the ordered data for the columns specified in that index, with pointers (book page numbers) that tell us where to go to find the rest of the data from that row (flip to the right book page).  That means unlike a clustered index where all data is always present, using a nonclustered index often is a two step process: find the value of interest in the index and then go look up the rest of that row's data from where it actually exists on disk.

What are the benefits of a nonclustered index?

We can have as many nonclustered indexes on our tables as we want (well, we max out at 999).  That's great! Create an index for every column!

Well, no, don't do that.  There's overhead in creating nonclustered indexes.  Essentially, every time you index some column(s), you are duplicating the unique values in those column(s) so that they can be stored in sorted order in your index.  We get speed and efficiency in our data lookups, but with the cost of losing disk space.  You need to test and see for each table and set of queries what the optimal number of indexes is.  Adding an additional index can absolutely destroy performance, so always test your changes!

Additionally, using a nonclustered index to find an indexed column's value is fast (SQL is just going through the ordered index data to find the value it needs - once again, something computers are really good at doing).  However, if you need other columns of data from the row that you just looked up, SQL is going to have to use those index pointers to go find the rest of that row data somewhere else on disk.  This can really add up and slow down performance.

If those additional lookups are hurting performance, what you can do is INCLUDE your nonindexed columns in your nonclustered index.  What this basically does is in addition to storing the sorted values of your indexed column(s), the index will also store whatever additional values you want to include as part of the index itself.  Once again, you'll probably get better performance because SQL won't have to go to somewhere else on disk to find the data it needs, but you lose storage space because you are creating duplicates of that data as part of your index.

Example Usage Scenarios

[Note: I want to clarify that the above definitions and below examples don't cover lots of corner cases (blob values, fragmentation, etc...).  I wanted this post to be a simple starting point when people don't know what index type they should try adding first, because this was the paralysis that I had when starting out.]{style="color: #808080;"}

[Every statement in this article can probably have an asterisk appended to the end of it, pointing out some example where a recommendation I wrote is 100% wrong.  ALWAYS test your index changes, because what might improve one query may hurt another one already running on that table, and over time you will learn about all of those edge cases and how they affect index performance.]{style="color: #808080;"}

Alright let's take a look at a few common scenarios and what the best index for them might be.  After reading each scenario, take a guess about what kind of index you would add and then click on the answer to reveal what I would do in that scenario.  Assume no indexes exist yet on these tables unless otherwise noted.

  • You have OLTP data that's used only for transactional reads and writing new rows. You know the primary key is an identity integer column.  What type of index would you create for the primary key?
    Clustered index - Your queries are probably always going to be looking up by PK to return data.  If you store the data in the table ordered by that PK, SQL will be able to do this very quickly.  New row additions to the table will always get put at the end because of the auto-incrementing identity column, not creating any overhead for having to insert data in a specific location in the ordered data.
  • You have a query that wants to return most or all of the columns from a table.  What type of index would make this the most efficient?
    Clustered index - Since all of the column values are stored in the same location as your indexed fields, SQL won't have to go do any additional lookups to get all of the data you are requesting from it.  If you created a nonclustered index you would have to INCLUDE all nonindexed columns, which would take up lots of space since you are essentially duplicating your entire table's data.
  • You have a table that is constantly having values updated.  These updated values are used as in your JOINs and WHERE clauses.  What type of index would you add? Nonclustered index - If our values are constantly changing, SQL only has to update the index and pointers while putting the actual data wherever it has available space on disk.  Compare this to a clustered index where it has to put the inserted/updated data in the correct order, meaning potentially lots of operations to shift the data around if available free space doesn't exist at that location.
  • You have a table that already has a clustered index, but it doesn't cover columns in JOINs and WHERE clauses.  What do you do? Nonclustered index - since the clustered index already exists, your only option is to add a nonclustered index.  Depending on the queries hitting this table however, you may want to consider changing your clustered index to a nonclustered index if you think your JOINs and WHERE clauses will be improved by having those fields be part of the clustered index.  Test it out!
  • You have a small staging table that you will always read all rows from and then truncate.  You don't care about the order.  Do you add an index? No, leave it as a heap - This is one scenario where not adding an index can give you better performance since there is no overhead in SQL having to store things in a sorted order or update indexes to specify the order.  If you truly don't care about the order, and you will always be reading all rows from a table and then truncating the table, then it's better not to have the overhead of having indexes on the table.