Data with Bert logo

Cardinality: Not Just For The Birds

Watch this week's video on YouTube

When building indexes for your queries, the order of your index key columns matters.  SQL Server can make the most effective use of an index if the data in that index is stored in the same order as what your query requires for a join, where predicate, grouping, or order by clause.

But if your query requires multiple key columns because of multiple predicates (eg. WHERE Color = 'Red' AND Size= 'Medium'), what order should you define the columns in your index key column definition?

Cardinality

In SQL Server,
cardinality refers to the number of distinct elements in a column.  All other considerations aside, when you are
defining the key columns for your index, the column with the highest
cardinality, or most distinct number of values, should go first.

To understand why, let's go back to our
example columns of Color and Size.  If we
have a table of data indicating the colors and sizes of various birds, it may
look something like this:

Cardinality-table-of-details

If we were to count
the number of distinct values in each of our Color and Size columns, we would
find out we have 20 distinct colors, but only 5 distinct sizes:

SELECT 
    COUNT(DISTINCT Color) AS DistinctColors, 
    COUNT(DISTINCT Size) AS DistinctSizes
FROM 
    dbo.Birds

(to make things
easier for this example, the data in this table is perfectly evenly distributed
across all 20 colors and 5 sizes – meaning each color is represented by one of
each of the five sizes, making for a total of 100 rows)

If we were to put Size as our leading index key column, SQL Server would immediately be able to narrow down the amount of rows it has to search to match our predicate (WHERE Color = 'Red' and Size = 'Medium') to 20 rows – after all, we can eliminate all rows where the sizes are not equal to Medium:

Order-by-Size-1

However, if we instead put Color as our first column, we can immediately eliminate 95% of the possibilities in our data set – only 5 rows with a value of 'Red' remain, one for each of our 5 distinct sizes (remember the data is perfectly distributed):

Order-by-Color-1

In most scenarios, putting the column with the highest cardinality first will allow SQL Server to filter out most of the data it knows it doesn't need, allowing it to focus on a smaller subset of data that it does still need to compare.

There are instances where you might want to deviate from this general rule though, like when you are trying to maximize an index's use by multiple queries; sometimes it might make sense to not put the columns in highest cardinality order if it means more queries are going to be able to make use of a single index.

Does The Order Of Index Columns Matter?

Watch this week's video on YouTube

When beginning to learn SQL, at some point you learn that indexes can be created to help improve the performance of queries.

Creating your first few indexes can be intimidating though, particularly when trying to understand what order to put your key columns in.

Today we'll look at how row store indexes work to understand whether index column order matters.

Heap: Stack of Pages

blue-jay-heapImagine a stack of loose leaf pages.  This collection of pages is our table.

Each page has information about a bird on it - the bird's name, picture, description, habitat, migration patterns, visual markings, etc...  You can think of each of these pages as a row of data.

The problem with this stack of pages is that there is no enforced order: it's a heap.  Without any enforced order, searching for individual birds is time consuming; in order to find a particular bird, for example a blue jay, you would have to go through the stack of pages one at a time until you find the blue jay page.

The scanning doesn't stop there though.  Even though we found a blue jay page, there's no way for us to guarantee that there are no other blue jay pages in the stack.  This means we have to continue flipping through every page until we finish searching through the whole heap of pages.

Having to do this process every single time we need to retrieve data from our bird table is painful.  To make our job easier, we can define and enforce an order on the data by defining a clustered index.

Clustered Index: Bound Pages

clustered-index-1To make searching through our pages easier, we sort all of the pages by bird name and glue on a binding.  This book binding now keeps all of our pages in alphabetical order by bird name.

The SQL version of a book binding is a clustered index.  The clustered index is not an additional object to our data - it is that same exact table data, but now with an enforced sort order.

Having all of our data in sorted order by bird name makes certain queries really fast and efficient - instead of having to scan through every page to find the blue jay entry, we can now quickly flip to the "B" section, then the "BL" section, then the "BLU" section, etc... until we find BLUE JAY.  This is done quickly and efficiently because we know where to find blue jays in the book because the bird names are stored in alphabetical order.

Even better, after we find the blue jay page, we flip to the next page and see a page for cardinal.  Since we know all of the entries are stored alphabetically, we know that once we get to the next bird we have found all of our blue jay pages and don't need to continue flipping through the rest of the book.

While the clustered index allows us to find birds by name quickly, it's not perfect; since the clustered index is the table, it contains every property (column) of each bird, which is a lot of data!

Having to constantly reference this large, clustered index for each of my queries can be too cumbersome.  For most of our queries, we could get by with condensed version of my bird book that only contains the most essential information in it.

Nonclustered Index: Cut and Copy

nc-indexLet's say we want a lighter-weight version of our book that contains the most relevant information (bird name, color, description).

We can photocopy the entire book and then cut out and keep only the pieces of information that are relevant while discarding the rest.  If we paste all of those relevant pieces of information into a new book, still sorted by bird name, we now have a second copy of our data.  This is our nonclustered index.

This nonclustered index contains all of the same birds as my clustered index, just with fewer columns.  This means I can fit multiple birds onto a page, requiring me to flip through fewer pages to find the bird I need.

If we ever need to look up additional information about a particular bird that's not in our nonclustered index, we can always go back to my giant clustered index and retrieve any information we need.

With the lighter-weight nonclustered index in-hand, we go out to the woods to start identifying some birds.

Upon spotting an unfamiliar bird in our binoculars, we can flip open the nonclustered index to identify the bird.

The only problem is, since we don't know this bird's name, our nonclustered index by bird name is of no help.  We end up having to flip through each page one at a time trying to identify the bird instead of flipping directly to the correct page.

For these types of inquires where we want to identify a bird don't know the bird's name, a different index would beneficial...

Nonclustered Index 2: Color Bugaloo

nc-index-2Instead of having a nonclustered index sorted by bird name, what we really need is a way to filter down to the list of potential birds quickly.

One way we can do this is to create another copy of my book, still containing just bird names, colors, and descriptions, but this time order the book pages so they are in order of color first, then bird name.

When trying to identify an unknown bird, we can first limit the number of pages to search through by filtering on the bird's color.  In our case, color is a highly selective trait, since it filters down our list of potential birds to only a small subset of the whole book.  In our blue jay example, this means we would find the small subset of pages that contain blue birds, and then just check each one of those pages individually until we find the blue jay.

Order Matters

Indexes aren't magic; their high-performance capabilities come from the fact that they store data in a predetermined order.  If your query can utilize data stored in that order, great!

However, if your query wants to filter down on color first, but your index is sorted on bird name, then you'll be out of luck.  When it comes to determining what column should be the first key in your index, you should choose whichever one will be most selective (which one will filter you down to the fewest subset of results) for your particular query.

There's a lot more optimizing that can be done with indexes, but correctly choosing the order of columns for your index key is an essential first step.

Want to learn even more about index column order? Be sure to check out this post on cardinality.

Should You Use Index Hints?

Watch this week's video on YouTube

One of the things that the SQL Server query optimizer does is determine how to retrieve the data requested by your query.

Usually it does a pretty good job, which is a great because if it didn't then we'd be spending most of our days programming sorting and joining algorithms instead of having fun actually working with our data.

Sometimes the query optimizer has a lapse in judgement and createds a less-than-efficient plan, requiring us to step in and save the day.

Index Hints Give You Control

One way to "fix" a poor performing plan is to use an index hint.  While we normally have no control over how SQL Server retrieves the data we requested, an index hint forces the  query optimizer to use the index specified in the hint to retrieve the data (hence, it's really more of a "command" than a "hint").

Sometimes when I feel like I'm losing control I like using an index hint to show SQL Server who's boss.  I occasionally will also use index hints when debugging poor performing queries because it allows me to confirm whether using an alternate index would improve performance without having to overhaul my code or change any other settings.

...But Sometimes That's Too Much Power

While I like using index hints for short-term debugging scenarios, that's about the only time they should be used because they can create some pretty undesirable outcomes.

For example, let's say I have this nice simple query and index here:

CREATE INDEX IX_OwnerUserId_CreationDate_Includes
ON dbo.Posts (OwnerUserId, CreationDate) INCLUDE (AcceptedAnswerId, ClosedDate, CommentCount, FavoriteCount, LastActivityDate);

SELECT
    OwnerUserId,
    AcceptedAnswerId
FROM
    dbo.Posts
WHERE
    OwnerUserId < 1000

This index was specifically created for a different query running on the Posts table, but it will also get used by the simple query above.

Executing this query without any hints causes SQL Server to use it anyway (since it's a pretty good index for the query), and we get decent performance: only 1002 logical reads.

2018-07-30_12-40-12 I wish all of my execution plans were this simple.

Let's pretend we don't trust the SQL Server optimizer to always choose this index, so instead we force it to use it by adding a hint:

SELECT
    OwnerUserId,
    AcceptedAnswerId
FROM
    dbo.Posts WITH (INDEX(IX_OwnerUserId_CreationDate_Includes))
WHERE
    OwnerUserId < 1000

With this hint, the index will perform exactly the same: 1002 logical reads, a good index seek, etc...

But what happens if in the future a better index gets added to the table?

CREATE INDEX IX_OwnerUserId_AcceptedAnswerId_Includes
ON dbo.Posts (OwnerUserId, AcceptedAnswerId) INCLUDE (LastEditorUserId, ParentId);

If we run the query WITHOUT the index hint, we'll see that SQL Server actually chooses this new index because it's smaller and we can get the data we need in only 522 logical reads:

2018-07-30_12-45-02 This execution plan looks the same, but you'll notice the smaller, more data dense index is being used.

If we had let SQL Server do it's job, it would have given us a great performing query!  Instead, we decided to intervene and hint (ie. force) it to use a sub-optimal index.

Things Can Get Worse

The above example is pretty benign - sure, without the hint SQL Server would have read about half as many pages, but this isn't a drastic difference in this scenario.

What could be disastrous is if because of the hint, the query optimizer decides to make a totally different plan that isn't nearly as efficient.  Or if one day someone drops the hinted index, causing the query with the hint to down right fail:

2018-07-30_12-50-55

Index hints  can be nice to use in the short-term for investigating, testing, and debugging.  However, they are almost never the correct long-term solution for fixing query performance.

Instead, it's better to look for the root-cause of a poor performing query: maybe you need to rebuild stats on an index or determine if the cardinality estimator being used is not ideal.  You might also benefit from rewriting a terribly written query.

Any of these options will likely help you create a better, long-term, flexible solutions rather than forcing SQL Server to use the same hard-coded, potentially sub-optimal index forever.

Is It Possible To Conditionally Index JSON Data?

Watch this week's video on YouTube

Recently I received a great question from an attendee to one of my sessions on JSON (what's up Nam!):

2018-04-25_15-58-21

At first glance it sounds like a filtered index question, and ultimately it is, but because of some of the intricacies involved in the response I thought it would make for a good blog post.

The Problem: Schema On Read

Imagine I have a central table that keeps track of warnings and errors for my burrito ordering app:

DROP TABLE IF EXISTS dbo.BurritoAppLog;
GO

CREATE TABLE dbo.BurritoAppLog 
( 
    Id int IDENTITY PRIMARY KEY,
    ErrorDetails nvarchar(1000)
); 
GO 

INSERT INTO dbo.BurritoAppLog VALUES (N'{"Type":"Warning", "MessageId": 100, "Severity": "High", "Information":"Running low on steak." }'); 
INSERT INTO dbo.BurritoAppLog VALUES (N'{"Type":"Warning", "MessageId": 50, "Severity": "Low", "Information":"Running low on queso." }');
GO 4000
INSERT INTO dbo.BurritoAppLog VALUES (N'{"Type":"Error", "MessageId": 10, "User":"Bert", "ErrorMessage":"Lettuce not available." }'); 
INSERT INTO dbo.BurritoAppLog VALUES (N'{"Type":"Error", "MessageId": 20, "User":"Jim", "ErrorMessage":"Cannot wrap burrito with quadruple meat." }'); 
GO 100

2018-04-25_19-21-04

Now imagine wanting to generate a report of only the rows that are errors.

Obviously, you'd want to index this data for faster querying performance.  Adding a non-clustered index on a non-persisted computed column of our JSON "Type" property will accomplish that:

ALTER TABLE dbo.BurritoAppLog 
ADD ErrorType AS JSON_VALUE(ErrorDetails, '$.Type');

ALTER TABLE dbo.BurritoAppLog 
ADD MessageId AS JSON_VALUE(ErrorDetails, '$.MessageId');

CREATE INDEX IX_ErrorType ON dbo.BurritoAppLog (ErrorType) INCLUDE (MessageId);

SELECT MessageId FROM dbo.BurritoAppLog WHERE ErrorType = 'Error'

And that works great.  Except that error entries in our table make up only 2.5% of our total rows.  Assuming we'll never need to query WHERE ErrorType = 'Warning' , this index is using a lot of unnecessary space.

So what if we create a filtered index instead?

Filtered JSON Indexes...

A filtered index should benefit us significantly here: it should save us space (since it won't include all of those warning rows) and it should make our INSERT queries into this table faster since the index won't need to be maintained for our non-"Error" rows.

So let's create a filtered index:

CREATE INDEX FX_ErrorType ON dbo.BurritoAppLog (ErrorType) INCLUDE (MessageId) WHERE ErrorType = 'Error'

Oh.

2018-04-25_19-47-03-1

So I guess we can't create a filtered index where the filter is on a computed column.  Maybe SQL Server won't mind if we persist the computed column?

DROP INDEX IX_ErrorType ON dbo.BurritoAppLog

ALTER TABLE dbo.BurritoAppLog
DROP COLUMN ErrorType;

ALTER TABLE dbo.BurritoAppLog 
ADD ErrorType AS JSON_VALUE(ErrorDetails, '$.Type') PERSISTED;

CREATE INDEX FX_ErrorType ON dbo.BurritoAppLog (ErrorType) INCLUDE (MessageId) WHERE ErrorType = 'Error'

NOOOOOOPPPPEEEE.  Same error message.

The issue is that SQL Server does not like computed columns, persisted or not, in a filtered index's WHERE clause.  It's one of the many limitations of filtered indexse (Aaron Bertrand has a great post outlining many of the shortcomings).

Computed Column Filtered Index Workaround

What is a performance minded, space-cautious, JSON-loving developer supposed to do?

One workaround to get our filtered index would be to parse our ErrorType property into its own table column on insert:

ALTER TABLE dbo.BurritoAppLog 
ADD PermanentErrorType varchar(10);

UPDATE dbo.BurritoAppLog SET PermanentErrorType = JSON_VALUE(ErrorDetails, '$.Type');

2018-04-25_20-01-45

With our PermanentErrorType column in place, we have no problem generating our filtered index:

CREATE INDEX FX_PermanentErrorType ON dbo.BurritoAppLog (PermanentErrorType) INCLUDE (MessageId) WHERE PermanentErrorType = 'Error'

If we compare the sizes of our nonclustered index to our filtered index, you'll immediately that the filtered index is significantly smaller:

2018-04-25_20-12-31-1

However, our table size is now slightly larger because of the added table column.

Conclusion

So what do you do if you run into this situation?  Well, if the ratio of undesired records to desired records is large like in the example above, you might want to make a permanent column to include in your filtered index - the size/performance benefit is certainly there.  This does mean that your table size will be larger (additional column) but performance will be faster if your queries are able to use the smaller filtered index.

How NOLOCK Will Block Your Queries

k Photo by James Sutton on Unsplash

Note: the problem described below applies to all SELECT queries, not just those adorned with NOLOCK hints.  The fact that it applies to NOLOCK queries was a huge surprise to me though, hence the title.

Lots of people don't like NOLOCK (i.e. the read uncommitted isolation level) because it can return inaccurate data.  I've seen plenty of arguments cautioning developers from retrieving uncommitted reads because of how they can return dirty data, phantom reads, and non-repeatable reads.

I've known about all of those above problems, but there's one problem that I've never heard of until recently: NOLOCK can block other queries from running.

Watch this week's video on YouTube

Let's step back and understand why I've so often used NOLOCK in the past.  A fairly typical instance of when I use NOLOCK is when I want to let a query run overnight to return some large set of data.  I'm okay with some inconsistencies in the data (from dirty reads, etc...).  My primary concern is that I don't want the long running query to get in the way of other processes.

I always thought NOLOCK was a perfect solution for this scenario because it never locks the data that it reads - the results might not be perfect, but at least the query won't negatively impact any other process on the server.

This is where my understanding of NOLOCK was wrong: while NOLOCK won't lock row level data, it will take out a schema stability lock.

A schema stability (Sch-S) lock prevents the structure of a table from changing while the query is executing.  All SELECT statements, including those in the read uncommitted/NOLOCK isolation level, take out a Sch-S lock.  This makes sense because we wouldn't want to start reading data from a table and then have the column structure change half way through the data retrieval.

However, this also means there might be some operations that get blocked by a Sch-S lock.  For example, any command requesting a schema modification (Sch-M) lock gets blocked in this scenario.

What commands request Sch-M locks?

Things like an index REBUILD or sp_recompile table.  These are the types of commands running in my nightly maintenance jobs that I was trying to avoid hurting by using NOLOCK in the first place!

To reiterate, I used to think that using the NOLOCK hint was a great way to prevent blocking during long running queries.  However, it turns out that my NOLOCK queries were actually blocking my nightly index jobs (all SELECT queries block in this example, but I find the NOLOCK to be particularly misleading), which then caused other SELECT statements to get blocked too!

Let's take a look at this in action.  Here I have a query that creates a database, table, and then runs a long running query with NOLOCK:

DROP DATABASE IF EXISTS [Sandbox]
GO
CREATE DATABASE [Sandbox]
GO
USE [Sandbox]
GO

DROP TABLE IF EXISTS dbo.Test
CREATE TABLE dbo.Test
(
    c0 int IDENTITY PRIMARY KEY,
    c1 varchar(700) default REPLICATE('a',700)
)

CREATE NONCLUSTERED INDEX IX_Id ON dbo.Test (c1);
GO

INSERT INTO dbo.Test DEFAULT VALUES;
GO 1000


-- Read a billion records
SELECT * 
FROM 
    dbo.Test t1 (NOLOCK) 
    CROSS JOIN dbo.Test t2 (NOLOCK) 
    CROSS JOIN dbo.Test t3 (NOLOCK) 

Now, while that billion row read is occurring, we can verify that the query took out a Sch-S lock by looking at sys.dm_tran_locks:

SELECT *
FROM sys.dm_tran_locks
WHERE resource_type = 'OBJECT'

Sch-S lock granted

While that's running, if we try to rebuild an index, that rebuild is blocked (shown as a WAIT):

USE [Sandbox]
GO

ALTER INDEX IX_Id ON dbo.Test REBUILD

rebuild is blocked

Our index rebuild query will remain blocked until our billion row NOLOCK SELECT query finishes running (or is killed).  This means the query that I intended to be completely unobtrusive is now blocking my nightly index maintenance job from running.

Even worse, any other queries that try to run after the REBUILD query (or any other commands that request a Sch-M lock) are going to get blocked as well!  If I try to run a simple COUNT(*) query:

USE [Sandbox]
GO

SELECT COUNT(*) FROM dbo.Test

chained blocks

Blocked!  This means that not only is my initial NOLOCK query causing my index REBUILD maintenance jobs to wait, the Sch-M lock placed by the REBUILD maintenance job is causing any subsequent queries on that table to get blocked and be forced to wait as well.  I just derailed the timeliness of my maintenance job and subsequent queries with a blocking NOLOCK statement!

Solutions

Unfortunately this is a tough problem and there's no one-size-fits-all remedy.

Solution #1: Don't run long running queries

I could avoid running long queries at night when they might run into my index maintenance jobs.  This would prevent those index maintenance jobs and subsequent queries from getting delayed, but it means my initial billion row select query would then have to run earlier, negatively impacting server performance during a potentially busier time of day.

Solution #2: Use WAIT_AT_LOW_PRIORITY

Starting in 2014, I could do an online index rebuild with the WAIT_AT_LOW_PRIORITY option set:

ALTER INDEX IX_Id ON dbo.Test REBUILD 
WITH (ONLINE = ON (WAIT_AT_LOW_PRIORITY (MAX_DURATION = 1 MINUTES , ABORT_AFTER_WAIT = BLOCKERS)))

This query basically gives any blocking SELECT queries currently running 1 minute to finish executing or else this query will kill them and then execute the index rebuild.  Alternatively we could have also set ABORT_AFTER_WAIT = SELF and the rebuild query would kill itself, allowing the NOLOCK billion row SELECT to finish running and not preventing any other queries from running.

This is not a great solution because it means either the long running query gets killed or the index REBUILD gets killed.

Solution #3: REBUILD if no Sch-S, REORGANIZE otherwise

A programmatic solution can be written that tries to REBUILD the index, but falls back to REORGANIZE if it knows it will have to wait for a Sch-M lock.

I've created the boiler plate below as a starting point, but the sky is the limit with what you can do with it (e.g. create a WHILE loop to check for the lock every x seconds, create a timeout for when the script should stop trying to REBUILD and just REORGANIZE instead, etc...)

-- Idea for how to rebuild/reorganize based on a schema stability lock.
-- More of a starting point than fully functional code.
-- Not fully tested, you have been warned!
DECLARE 
    @TableName varchar(128) = 'Test',
    @HasSchemaStabilityLock bit = 0

SELECT TOP 1 @HasSchemaStabilityLock = 
    CASE WHEN l.request_mode IS NOT NULL THEN 1 ELSE 0 END
    FROM
        sys.dm_tran_locks as l
    WHERE
        l.resource_type = 'OBJECT'
        AND l.request_mode = 'Sch-S'
        AND l.request_type = 'LOCK'
        AND l.request_status = 'GRANT'
        AND OBJECT_NAME(l.resource_associated_entity_id) = @TableName

IF @HasSchemaStabilityLock = 0
BEGIN
    -- Perform a rebuild
    ALTER INDEX IX_Id ON dbo.Test REBUILD
    PRINT 'Index rebuilt'
END
ELSE
BEGIN
    -- Perform a REORG
    ALTER INDEX IX_Id ON dbo.Test REORGANIZE
    PRINT 'Index reorganized'
END

This solution is my favorite because:

  1. Ad hoc long running queries don't get killed (all of that time spent processing doesn't go to waste)
  2. Other select queries are not blocked by the Sch-M lock attempt by REBUILD
  3. Index maintenance still occurs, even if it ends up being a REORGANIZE instead of a REBUILD