Does The Order Of Index Columns Matter?

Watch this week's video on YouTube

When beginning to learn SQL, at some point you learn that indexes can be created to help improve the performance of queries.

Creating your first few indexes can be intimidating though, particularly when trying to understand what order to put your key columns in.

Today we'll look at how row store indexes work to understand whether index column order matters.

Heap: Stack of Pages

blue-jay-heapImagine a stack of loose leaf pages.  This collection of pages is our table.

Each page has information about a bird on it - the bird's name, picture, description, habitat, migration patterns, visual markings, etc...  You can think of each of these pages as a row of data.

The problem with this stack of pages is that there is no enforced order: it's a heap.  Without any enforced order, searching for individual birds is time consuming; in order to find a particular bird, for example a blue jay, you would have to go through the stack of pages one at a time until you find the blue jay page.

The scanning doesn't stop there though.  Even though we found a blue jay page, there's no way for us to guarantee that there are no other blue jay pages in the stack.  This means we have to continue flipping through every page until we finish searching through the whole heap of pages.

Having to do this process every single time we need to retrieve data from our bird table is painful.  To make our job easier, we can define and enforce an order on the data by defining a clustered index.

Clustered Index: Bound Pages

clustered-index-1To make searching through our pages easier, we sort all of the pages by bird name and glue on a binding.  This book binding now keeps all of our pages in alphabetical order by bird name.

The SQL version of a book binding is a clustered index.  The clustered index is not an additional object to our data - it is that same exact table data, but now with an enforced sort order.

Having all of our data in sorted order by bird name makes certain queries really fast and efficient - instead of having to scan through every page to find the blue jay entry, we can now quickly flip to the "B" section, then the "BL" section, then the "BLU" section, etc... until we find BLUE JAY.  This is done quickly and efficiently because we know where to find blue jays in the book because the bird names are stored in alphabetical order.

Even better, after we find the blue jay page, we flip to the next page and see a page for cardinal.  Since we know all of the entries are stored alphabetically, we know that once we get to the next bird we have found all of our blue jay pages and don't need to continue flipping through the rest of the book.

While the clustered index allows us to find birds by name quickly, it's not perfect; since the clustered index is the table, it contains every property (column) of each bird, which is a lot of data!

Having to constantly reference this large, clustered index for each of my queries can be too cumbersome.  For most of our queries, we could get by with condensed version of my bird book that only contains the most essential information in it.

Nonclustered Index: Cut and Copy

nc-indexLet's say we want a lighter-weight version of our book that contains the most relevant information (bird name, color, description).

We can photocopy the entire book and then cut out and keep only the pieces of information that are relevant while discarding the rest.  If we paste all of those relevant pieces of information into a new book, still sorted by bird name, we now have a second copy of our data.  This is our nonclustered index.

This nonclustered index contains all of the same birds as my clustered index, just with fewer columns.  This means I can fit multiple birds onto a page, requiring me to flip through fewer pages to find the bird I need.

If we ever need to look up additional information about a particular bird that's not in our nonclustered index, we can always go back to my giant clustered index and retrieve any information we need.

With the lighter-weight nonclustered index in-hand, we go out to the woods to start identifying some birds.

Upon spotting an unfamiliar bird in our binoculars, we can flip open the nonclustered index to identify the bird.

The only problem is, since we don't know this bird's name, our nonclustered index by bird name is of no help.  We end up having to flip through each page one at a time trying to identify the bird instead of flipping directly to the correct page.

For these types of inquires where we want to identify a bird don't know the bird's name, a different index would beneficial...

Nonclustered Index 2: Color Bugaloo

nc-index-2Instead of having a nonclustered index sorted by bird name, what we really need is a way to filter down to the list of potential birds quickly.

One way we can do this is to create another copy of my book, still containing just bird names, colors, and descriptions, but this time order the book pages so they are in order of color first, then bird name.

When trying to identify an unknown bird, we can first limit the number of pages to search through by filtering on the bird's color.  In our case, color is a highly selective trait, since it filters down our list of potential birds to only a small subset of the whole book.  In our blue jay example, this means we would find the small subset of pages that contain blue birds, and then just check each one of those pages individually until we find the blue jay.

Order Matters

Indexes aren't magic; their high-performance capabilities come from the fact that they store data in a predetermined order.  If your query can utilize data stored in that order, great!

However, if your query wants to filter down on color first, but your index is sorted on bird name, then you'll be out of luck.  When it comes to determining what column should be the first key in your index, you should choose whichever one will be most selective (which one will filter you down to the fewest subset of results) for your particular query.

There's a lot more optimizing that can be done with indexes, but correctly choosing the order of columns for your index key is an essential first step.

Want to learn even more about index column order? Be sure to check out this post on cardinality.

In-Memory OLTP: A Case Study

Watch this week's video on YouTube

When In-Memory OLTP was first released in SQL Server 2014, I was excited to start using it.  All I could think was "my queries are going to run so FAST!"

Well, I never got around to implementing In-Memory OLTP.  Besides having an incompatible version of SQL Server at the time, the in-memory features had too many limitations for my specific use-cases.

Fast forward a few years, and I've done nothing with In-Memory OLTP.  Nothing that is until I saw Erin Stellato present at our Northern Ohio SQL Server User Group a few weeks ago - her presentation inspired me to take a look at In-Memory OLTP again to see if I could use it.

Use case: Improving ETL staging loads

After being refreshed on the ins and outs of in-memory SQL Server, I wanted to see if I could apply some of the techniques to one of my etls.

The ETL consists of two major steps:

  1. Shred documents into row/column data and then dump that data into a staging table.
  2. Delete some of the documents from the staging table.

In the real world, there's a step 1.5 that does some processing of the data, but it's not relevant to these in-memory OLTP demos.

So step one was to create my staging tables.  The memory optimized table is called "NewStage1" and the traditional disked based tabled is called "OldStage1":

DROP DATABASE IF EXISTS InMemoryTest;
GO
CREATE DATABASE InMemoryTest;
GO
USE InMemoryTest;
GO

ALTER DATABASE InMemoryTest ADD FILEGROUP imoltp_mod CONTAINS MEMORY_OPTIMIZED_DATA;
GO 
ALTER DATABASE InMemoryTest ADD FILE (name='imoltp_mod1', filename='C:\Program Files\Microsoft SQL Server\MSSQL14.MSSQLSERVER\MSSQL\DATA\imoltp_mod1') TO FILEGROUP imoltp_mod;
GO  
ALTER DATABASE InMemoryTest SET MEMORY_OPTIMIZED_ELEVATE_TO_SNAPSHOT=ON;
GO 
ALTER DATABASE InMemoryTest SET RECOVERY SIMPLE
GO

--Numbers Table
-- This needs to be in-memory to be called from a natively compiled procedure
DROP TABLE IF EXISTS InMemoryTest.dbo.Numbers;
GO
CREATE TABLE InMemoryTest.dbo.Numbers
(
    n int
    INDEX ix_n NONCLUSTERED HASH (n) WITH (BUCKET_COUNT=400000)
)
WITH (MEMORY_OPTIMIZED=ON, DURABILITY=SCHEMA_ONLY);  
GO 

INSERT INTO dbo.Numbers (n)
SELECT TOP (4000000) n = CONVERT(INT, ROW_NUMBER() OVER (ORDER BY s1.[object_id]))
FROM sys.all_objects AS s1 CROSS JOIN sys.all_objects AS s2
OPTION (MAXDOP 1);

-- Set up on-disk tables
DROP TABLE IF EXISTS InMemoryTest.dbo.OldStage1;
GO
CREATE TABLE InMemoryTest.dbo.OldStage1
(
    Id int,
    Col1 uniqueidentifier,
    Col2 uniqueidentifier,
    Col3 varchar(1000),
    Col4 varchar(50),
    Col5 varchar(50),
    Col6 varchar(50),
    Col7 int,
    Col8 int,
    Col9 varchar(50),
    Col10 varchar(900),
    Col11 varchar(900),
    Col12 int,
    Col13 int,
    Col14 bit
);
GO
CREATE CLUSTERED INDEX CL_Id ON InMemoryTest.dbo.OldStage1 (Id);
GO


--  Set up in-memory tables and natively compiled procedures
DROP TABLE IF EXISTS InMemoryTest.dbo.NewStage1;
GO
CREATE TABLE InMemoryTest.dbo.NewStage1
(
    Id int,
    Col1 uniqueidentifier,
    Col2 uniqueidentifier,
    Col3 varchar(1000),
    Col4 varchar(50),
    Col5 varchar(50),
    Col6 varchar(50),
    Col7 int,
    Col8 int,
    Col9 varchar(50),
    Col10 varchar(900),
    Col11 varchar(900),
    Col12 int,
    Col13 int,
    Col14 bit
    INDEX ix_id NONCLUSTERED HASH (id) WITH (BUCKET_COUNT=10)
)
WITH (MEMORY_OPTIMIZED=ON, DURABILITY=SCHEMA_ONLY);  
GO

Few things to keep in mind:

  • The tables have the same columns and datatypes, with the only difference being that the NewStage1 table is memory optimized.
  • My database is using simple recovery so I am able to perform minimal logging/bulk operations on my disk-based table.
  • Additionally, I'm using  the SCHEMA_ONLY durability setting.  This gives me outstanding performance because there is no writing to the transaction log!  However, this means if I lose my in-memory data for any reason (crash, restart, corruption, etc...) I am completely out of luck.  This is fine for my staging data scenario since I can easily recreate the data if necessary.

Inserting and deleting data

Next I'm going to create procedures for inserting and deleting my data into both my new and old staging tables:

DROP PROCEDURE IF EXISTS dbo.Insert_OldStage1;
GO
CREATE PROCEDURE dbo.Insert_OldStage1
    @Id int,
    @Rows int
AS
BEGIN
    INSERT INTO InMemoryTest.dbo.OldStage1 (Id, Col1, Col2, Col3, Col4, Col5, Col6, Col7, Col8, Col9, Col10, Col11, Col12, Col13, Col14)
    SELECT Id, Col1, Col2, Col3, Col4, Col5, Col6, Col7, Col8, Col9, Col10, Col11, Col12, Col13, Col14
    FROM
    (
    SELECT
    @Id as Id,
    '92D14DA3-2C55-4E50-A965-7D3C941417B3' as Col1,
    '92D14DA3-2C55-4E50-A965-7D3C941417B3' as Col2,
    'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaa' as Col3,
    'aaaaaaaaaaaaaaaaaaaa' as Col4,
    'aaaaaaaaaaaaaaaaaaaa' as Col5,
    'aaaaaaaaaaaaaaaaaaaa' as Col6,
    0 as Col7,
    0 as Col8,
    'aaaaaaaaaaaaaaaaaaaa' as Col9,
    'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa' as Col10,
    'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa' as Col11,
    1 as Col12,
    1 as Col13,
    1 as Col14
    )a
    CROSS APPLY
    (
    SELECT TOP (@Rows) n FROM dbo.Numbers
    )b
END

DROP PROCEDURE IF EXISTS dbo.Delete_OldStage1;
GO
CREATE PROCEDURE dbo.Delete_OldStage1
    @Id int
AS
BEGIN
    -- Use loop to delete to prevent filling transaction log
    DECLARE 
        @Count int = 0,
        @for_delete int,
        @chunk_size int = 1000000

    SELECT @for_delete = COUNT(Id) FROM InMemoryTest.dbo.OldStage1 
                            WHERE Id = @Id;

    WHILE (@Count < @for_delete)
    BEGIN
        SELECT @Count = @Count + @chunk_size;

        BEGIN TRAN
            DELETE TOP(@chunk_size) FROM InMemoryTest.dbo.OldStage1 WHERE Id = @Id
        COMMIT TRAN
    END
END;
GO



DROP PROCEDURE IF EXISTS dbo.Insert_NewStage1;
GO
CREATE PROCEDURE dbo.Insert_NewStage1
    @Id int,
    @Rows int
WITH NATIVE_COMPILATION, SCHEMABINDING  
AS   
BEGIN ATOMIC   
WITH (TRANSACTION ISOLATION LEVEL = SNAPSHOT, LANGUAGE = N'us_english')  

    INSERT INTO dbo.NewStage1 (Id, Col1, Col2, Col3, Col4, Col5, Col6, Col7, Col8, Col9, Col10, Col11, Col12, Col13, Col14)
    SELECT Id, Col1, Col2, Col3, Col4, Col5, Col6, Col7, Col8, Col9, Col10, Col11, Col12, Col13, Col14 
    FROM
    (
    SELECT
    @Id as Id,
    '92D14DA3-2C55-4E50-A965-7D3C941417B3' as Col1,
    '92D14DA3-2C55-4E50-A965-7D3C941417B3' as Col2,
    'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaa' as Col3,
    'aaaaaaaaaaaaaaaaaaaa' as Col4,
    'aaaaaaaaaaaaaaaaaaaa' as Col5,
    'aaaaaaaaaaaaaaaaaaaa' as Col6,
    0 as Col7,
    0 as Col8,
    'aaaaaaaaaaaaaaaaaaaa' as Col9,
    'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa' as Col10,
    'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa' as Col11,
    1 as Col12,
    1 as Col13,
    1 as Col14
    )a
    CROSS APPLY
    (
    SELECT TOP (@Rows) n FROM dbo.Numbers
    )b


END;   
GO  

DROP PROCEDURE IF EXISTS dbo.Delete_NewStage1;
GO
CREATE PROCEDURE dbo.Delete_NewStage1
    @Id int 
WITH NATIVE_COMPILATION, SCHEMABINDING  
AS   
BEGIN ATOMIC   
WITH (TRANSACTION ISOLATION LEVEL = SNAPSHOT, LANGUAGE = N'us_english')  

    DELETE FROM dbo.NewStage1 WHERE Id = @Id;

END   
GO  

Few more things to note:

  • My new procedures are natively compiled: SQL Server compiles them up front so at run time it can just execute without any extra steps.  The procedures that target my old disk-based tables will have to compile every time.
  • In the old delete procedure, I am deleting data in chunks so my transaction log doesn't get full.  In the new version of the procedure, I don't have to worry about this because, as I mentioned earlier, my memory optimized table doesn't have to use the transaction log.

Let's simulate a load

It's time to see if all of this fancy in-memory stuff is actually worth all of the restrictions.

In my load, I'm going to mimic loading three documents with around 3 million rows each.  Then, I'm going to delete the second document from each table:

-- Old on-disk method
-- Insert data for processing
EXEC InMemoryTest.dbo.Insert_OldStage1 @Id=1, @Rows=2500000;
GO
EXEC InMemoryTest.dbo.Insert_OldStage1 @Id=2, @Rows=3400000;
GO 
EXEC InMemoryTest.dbo.Insert_OldStage1 @Id=3, @Rows=2800000;
GO 

-- Delete set of records after processed
EXEC InMemoryTest.dbo.Delete_OldStage1 @Id = 2
GO

-- New in-memory method
-- Insert data for processing
EXEC InMemoryTest.dbo.Insert_NewStage1 @Id=1, @Rows=2500000;
GO
EXEC InMemoryTest.dbo.Insert_NewStage1 @Id=2, @Rows=3400000;
GO 
EXEC InMemoryTest.dbo.Insert_NewStage1 @Id=3, @Rows=2800000;
GO 

-- Delete set of records after processed
EXEC InMemoryTest.dbo.Delete_NewStage1 @Id = 2
GO

The in-memory version should have a significant advantage because:

  1. The natively compiled procedure is precompiled (shouldn't be a huge deal here since we are doing everything in a single INSERT INTO...SELECT).
  2. The in-memory table inserts/deletes don't have to write to the transaction log (this should be huge!)

Results

  -------------------- ---------------------------------------- -----------------------------------------
                                    **Disk-based**                            **In-Memory**
  INSERT 3 documents                    65 sec                                    6 sec
  DELETE 1 document                     46 sec                                    0 sec
  Total time                           111 sec                                    6 sec
  Difference                        -95% slower                                1750% faster
  -------------------- ---------------------------------------- -----------------------------------------

The results speak for themselves.  In this particular example, in-memory destroys the disk-based solution out of the water.

Obviously there are downsides to in-memory (like consuming a lot of memory) but if you are going for pure speed, there's nothing faster.

Warning! I am not you.

And you are not me.

While in-memory works great for my ETL scenario, there are many requirements and limitations.  It's not going to work in every scenario.  Be sure you understand the in-memory durability options to prevent any potential data loss and try it out for yourself!  You might be surprised by the performance gains you'll see.

How To Create Multi-Object JSON Arrays in SQL Server

blog-image

Recently I was discussing with Peter Saverman whether it would be possible to take some database tables that look like this:

2017-12-16_10-34-48

And output them so that the Cars and Toys data would map to a multi-object JSON array like so:

2017-12-16_10-38-51

Watch this week's video on YouTube

Why would you ever need this?

If you are coming from a pure SQL background, at this point you might be wondering  why you would ever want create an object array that contains mixed object types.  Well, from an application development standpoint this type of scenario can be fairly common.

In a database, it makes sense to divide Home and Car and Toy into separate tables.  Sure, we could probably combine the latter two with some normalization, but imagine we will have many different types of entities that will be more difficult to normalize - sometimes it just makes sense to store this information separately.

Not to mention that performing analytical type queries across many rows of data will typically be much faster stored in this three table format.

The three table layout, while organized from a database standpoint, might not be the best way to organize the data in an object-oriented application.  Usually in a transaction oriented application, we want our data to all be together as one entity.  This is why NoSQL is all the rage among app developers.  Having all of your related data all together makes it easy to manage, move, update, etc...  **This is where the array of multi-type objects comes in - it'd be pretty easy to use this structure as an array of dynamic or inherited objects inside of our application.

Why not just combine these Car and Toy entities in app?

Reading the data into the app through multiple queries and mapping that data to objects is usually the first way you would try doing something like this.

However, depending on many different variables, like the size of the data, the number of requests, the speed of the network, the hardware the app is running on, etc... mapping your data from multiple queries might not be the most efficient way to go.

On the other hand, if you have a big beefy SQL Server available that can do those transformations for you, and you are willing to pay for the processing time on an \$8k/core enterprise licensed machine, then performing all of the these transformations on your SQL Server is the way to go.

The solution

UPDATE: Jovan Popovic suggested an even cleaner solution using CONCAT_WS.  See the update at the bottom of this post.

First, here's the data if you want to play along at home:

DROP TABLE IF EXISTS ##Home;
GO
DROP TABLE IF EXISTS ##Car;
GO
DROP TABLE IF EXISTS ##Toy;
GO

CREATE TABLE ##Home
(
    HomeId int IDENTITY PRIMARY KEY,
    City nvarchar(20),
    State nchar(2)
);
GO

CREATE TABLE ##Car
(
    CarId int IDENTITY PRIMARY KEY,
    HomeId int,
    Year smallint,
    Make nvarchar(20),
    Model nvarchar(20),
    FOREIGN KEY (HomeId) REFERENCES ##Home(HomeId)
);
GO

CREATE TABLE ##Toy
(
    ToyId int IDENTITY PRIMARY KEY,
    HomeId int,
    Category nvarchar(20),
    RiderCapacity int,
    FOREIGN KEY (HomeId) REFERENCES ##Home(HomeId)
);
GO

INSERT INTO ##Home (City,State) VALUES ('Cleveland','OH')
INSERT INTO ##Home (City,State) VALUES ('Malibu','CA')

INSERT INTO ##Car (HomeId,Year, Make, Model) VALUES ('1','2017', 'Volkswagen', 'Golf')
INSERT INTO ##Car (HomeId,Year, Make, Model) VALUES ('2','2014', 'Porsche', '911')

INSERT INTO ##Toy (HomeId,Category, RiderCapacity) VALUES ('1','Bicycle', 1)
INSERT INTO ##Toy (HomeId,Category, RiderCapacity) VALUES ('2','Kayak', 2)

SELECT * FROM ##Home
SELECT * FROM ##Car
SELECT * FROM ##Toy

And here's the query that does all of the transforming:

SELECT 
    h.HomeId,
    h.City,
    h.State,
    GarageItems = JSON_QUERY('[' + STRING_AGG( GarageItems.DynamicData,',') + ']','$')
FROM
    ##Home h
    INNER JOIN
    (
        SELECT
            HomeId,
            JSON_QUERY(Cars,'$') AS DynamicData
        FROM
            ##Home h
            CROSS APPLY
            (
            SELECT 
                (
                SELECT  
                    *
                FROM
                    ##Car c
                WHERE
                    c.HomeId = h.HomeId
                    FOR JSON PATH, WITHOUT_ARRAY_WRAPPER
                ) AS Cars
            ) d 
        UNION ALL
        SELECT
            HomeId,
            JSON_QUERY(Cars,'$') AS DynamicData
        FROM
            ##Home h
            CROSS APPLY
            (
            SELECT 
                (
                SELECT  
                    *
                FROM
                    ##Toy c
                WHERE
                    c.HomeId = h.HomeId
                    FOR JSON PATH, WITHOUT_ARRAY_WRAPPER
                ) AS Cars
            ) d
    ) GarageItems
        ON h.HomeId = GarageItems.HomeId
GROUP BY
    h.HomeId,
    h.City,
    h.State

There are a couple of key elements that make this work.

CROSS APPLY

When using FOR JSON PATH , ALL rows and columns from that result set will get converted to a single JSON string.

This creates a problem if, for example, you want to have a column for your JSON string and a separate column for something like a foreign key (in our case, HomeId).  Or if you want to generate multiple JSON strings filtered on a foreign key.

The way I chose to get around this is to use CROSS APPLY with a join back to our Home table - this way we get our JSON string for either Cars or Toys created but then output it along with some additional columns.

WITHOUT_ARRAY_WRAPPER

When using FOR JSON PATH to turn a result set into a JSON string, SQL Server will automatically add square brackets around the JSON output as if it were an array.

This is a problem in our scenario because when we use FOR JSON PATH to turn the Car and Toy table into JSON strings, we eventually want to combine them together into the same array instead of two separate arrays.  The solution to this is using the WITHOUT_ARRAY_WRAPPER option to output the JSON string without the square brackets.

Conclusion

Your individual scenario and results may vary.  This solution was to solve a specific scenario in a specific environment.

Is it the right way to go about solving your performance problems all of the time? No.  But offloading these transformations onto SQL Server is an option to keep in mind.

Just remember - always test to make sure your performance changes are actually helping.

UPDATED Solution Using CONCAT_WS:

This solution recommended by Jovan Popovic is even easier than above.  It requires using CONCAT_WS, which is available starting in SQL Server 2017 (the above solution requires STRING_AGG which is also in 2017, but it could be rewritten using FOR XML string aggregation if necessary for earlier versions)

SELECT h.*,
'['+ CONCAT_WS(',',
(SELECT * FROM ##Car c WHERE c.HomeId = h.HomeId FOR JSON PATH, WITHOUT_ARRAY_WRAPPER),
(SELECT * FROM ##Toy t WHERE t.HomeId = h.HomeId FOR JSON PATH, WITHOUT_ARRAY_WRAPPER)
)
+ ']'
FROM ##Home h

Does The Join Order of My Tables Matter?

pan-xiaozhen-252035

I had a great question submitted to me (thank you Brandman!) that I thought would make for a good blog post:

...I've been wondering if it really matters from a performance standpoint where I start my queries. For example, if I join from A-B-C, would I be better off starting at table B and then going to A & C?

The short answer: Yes.  And no.

Watch this week's video on YouTube

Table join order matters for performance!

Disclaimer: For this post, I'm only going to be talking about INNER joins.  OUTER (LEFT, RIGHT, FULL, etc...) joins are a whole 'nother animal that I'll save for time.

Let's use the following query from WideWorldImporters for our examples:

/* 
-- Run if if you want to follow along - add  a computed column and index for CountryOfManufacture
ALTER TABLE Warehouse.StockItems SET (SYSTEM_VERSIONING = OFF);   
ALTER TABLE Warehouse.StockItems
ADD CountryOfManufacture AS CAST(JSON_VALUE(CustomFields,'$.CountryOfManufacture') AS NVARCHAR(10)) 
ALTER TABLE Warehouse.StockItems SET (SYSTEM_VERSIONING = ON); 
CREATE INDEX IX_CountryOfManufacture ON Warehouse.StockItems (CountryOfManufacture)
*/

SELECT
  o.OrderID,
  s.CountryOfManufacture
FROM
  Sales.Orders o                        -- 73595 rows
  INNER JOIN Sales.OrderLines l         -- 231412 rows
    ON o.OrderID = l.OrderID            -- 231412 rows after join
  INNER JOIN Warehouse.StockItems s     -- 227 rows
    ON l.StockItemID = s.StockItemID    -- 1036 rows after join 
    AND s.CountryOfManufacture = 'USA'  -- 8 rows for USA   

Note: with an INNER join, I normally would prefer putting my 'USA' filter in the WHERE clause, but for the rest of these examples it'll be easier to have it part of the ON.

The key thing to notice is that we are joining  three tables - Orders, OrderLines, and StockItems - and that OrderLines is what we use to join between the other two tables.

We basically have two options for table join orders then - we can join Orders with OrderLines first and then join in StockItems, or we can join OrderLines and StockItems first and then join in Orders.

In terms of performance, it's almost certain that the latter scenario (joining OrderLines with StockItems first) will be faster because StockItems will help us be more selective.

Selective?  Well you might notice that our StockItems table is small with only 227 rows.  It's made even smaller by filtering on 'USA' which reduces it to only 8 rows.

Since the StockItems table has no duplicate rows (it's a simple lookup table for product information) it is a great table to join with as early as possible since it will reduce the total number of rows getting passed around for the remainder of the query.

If we tried doing the Orders to OrderLines join first, we actually wouldn't filter out any rows in our first step, cause our subsequent join to StockItems to be more slower (because more rows would have to be processed).

Basically, join order DOES matter because if we can join two tables that will reduce the number of rows needed to be processed by subsequent steps, then our performance will improve.

So if the order that our tables are joined in makes a big difference for performance reasons, SQL Server follows the join order we define right?

SQL Server doesn't let you choose the join order

SQL is a declarative language: you write code that specifies *what* data to get, not *how* to get it.

Basically, the SQL Server query optimizer takes your SQL query and decides on its own how it thinks it should get the data.

It does this by using precalculated statistics on your table sizes and data contents in order to be able to pick a "good enough" plan quickly.

So even if we rearrange the order of the tables in our FROM statement like this:

SELECT
  o.OrderID,
  s.CountryOfManufacture
FROM
  Sales.OrderLines l
  INNER JOIN Warehouse.StockItems s
    ON l.StockItemID = s.StockItemID
    AND s.CountryOfManufacture = 'USA'
  INNER JOIN Sales.Orders o
    ON o.OrderID = l.OrderID

Or if we add parentheses:

SELECT
  o.OrderID,
  s.CountryOfManufacture
FROM
  (Sales.OrderLines l
  INNER JOIN Sales.Orders o
    ON l.OrderID = o.OrderID)
  INNER JOIN Warehouse.StockItems s
    ON l.StockItemID = s.StockItemID
    AND s.CountryOfManufacture = 'USA'

Or even if we rewrite the tables into subqueries:

SELECT
  l.OrderID,
  s.CountryOfManufacture
FROM
  (
  SELECT 
    o.OrderID,
    l.StockItemId
  FROM
    Sales.OrderLines l
    INNER JOIN Sales.Orders o
      ON l.OrderID = o.OrderID
  ) l
  INNER JOIN Warehouse.StockItems s
    ON l.StockItemID = s.StockItemID
    AND s.CountryOfManufacture = 'USA'

SQL Server will interpret and optimize our three separate queries (plus the original one from the top of the page) into the same exact execution plan:

2017-11-16_21-42-37

Basically, no matter how we try to redefine the order of our tables in the FROM statement, SQL Server will still do what it thinks it's best.

But what if SQL Server doesn't know best?

The majority of the time I see SQL Server doing something inefficient with an execution plan it's usually due to something wrong with statistics for that table/index.

Statistics are also a whole 'nother topic for a whole 'nother day (or month) of blog posts, so to not get too side tracked with this post, I'll point you to Kimberly Tripp's introductory blog post on the subject: https://www.sqlskills.com/blogs/kimberly/the-accidental-dba-day-15-of-30-statistics-maintenance/)

The key thing to take away is that if SQL Server is generating an execution plan where the order of table joins doesn't make sense check your statistics first because they are the root cause of many performance problems!

Forcing a join order

So you already checked to see if your statistics are the problem and exhausted all possibilities on that front.  SQL Server isn't optimizing for the optimal table join order, so what can you do?

Row goals

If SQL Server isn't behaving and I need to force a table join order, my preferred way is to do it via a TOP() command.

I learned this technique from watching Adam Machanic's fantastic presentation on the subject and I highly recommend you watch it.

Since in our example query SQL Server is already joining the tables in the most efficient order, let's force an inefficient join by joining Orders with OrderLines first.

**Basically, we write a subquery around the tables we want to join together first and make sure to include a TOP clause. **

SELECT
  o.OrderID,
  s.CountryOfManufacture
FROM
  (
  SELECT TOP(2147483647) -- A number of rows we know is larger than our table.  Watch Adam's presentation above for more info.
    o.OrderID,
    l.StockItemID
  FROM
    Sales.Orders o
    INNER JOIN Sales.OrderLines l
      ON o.OrderID = l.OrderID
  ) o
  INNER JOIN Warehouse.StockItems s
    ON o.StockItemID = s.StockItemID
    AND s.CountryOfManufacture = 'USA'

Including TOP forces SQL to perform the join between Orders and OrderLines first - inefficient in this example, but a great success in being able to control what SQL Server does.

2017-11-17_12-16-05

This is my favorite way of forcing a join order because we get to inject control over the join order of two specific tables in this case (Orders and OrderLines) but SQL Server will still use its own judgement in how any remaining tables should be joined.

While forcing a join order is generally a bad idea (what happens if the underlying data changes in the future and your forced join no longer is the best option), in certain scenarios where its required the TOP technique will cause the least amount of performance problems (since SQL still gets to decide what happens with the rest of the tables).

The same can't be said if using hints...

Query and join hints

Query and join hints will successfully force the order of the table joins in your query, however they have significant draw backs.

Let's look at the FORCE ORDER query hint.  Adding it to your query will successfully force the table joins to occur in the order that they are listed:

SELECT
  o.OrderID,
  s.CountryOfManufacture
FROM
  Sales.Orders o
  INNER JOIN Sales.OrderLines l
    ON o.OrderID = l.OrderID
  INNER JOIN Warehouse.StockItems s
    ON l.StockItemID = s.StockItemID
    AND s.CountryOfManufacture = 'USA'
OPTION (FORCE ORDER)

Looking at the execution plan we can see that Orders and OrderLines were joined together first as expected:

2017-11-17_12-23-37

The biggest drawback with the FORCE ORDER hint is that all tables in your query are going to have their join order forced (not evident in this example...but imagine we were joining 4 or 5 tables in total).

This makes your query incredibly fragile; if the underlying data changes in the future, you could be forcing multiple inefficient join orders.  Your query that you tuned with FORCE ORDER could go from running in seconds to minutes or hours.

The same problem exists with using a join hints:

SELECT
  o.OrderID,
  s.CountryOfManufacture
FROM
  Sales.Orders o 
  INNER LOOP JOIN Sales.OrderLines l
    ON o.OrderID = l.OrderID
  INNER JOIN Warehouse.StockItems s
    ON l.StockItemID = s.StockItemID
    AND s.CountryOfManufacture = 'USA'

Using the LOOP hint successfully forces our join order again, but once again the join order of all of our tables becomes fixed:

2017-11-17_12-28-21

A join hint is probably the most fragile hint that forces table join order because not only is it forcing the join order, but it's also forcing the algorithm used to perform the join.

In general, I only use query hints to force table join order as a temporary fix.

Maybe production has a problem and I need to get things running again; a query or join hint may be the quickest way to fix the immediate issue.  However, long term using the hint is probably a bad idea, so after the immediate fires are put out I will go back and try to determine the root cause of the performance problem.

Summary

  • Table join order matters for reducing the number of rows that the rest of the query needs to process.
  • By default SQL Server gives you no control over the join order - it uses statistics and the query optimizer to pick what it thinks is a good join order.
  • Most of the time, the query optimizer does a great job at picking efficient join orders.  When it doesn't, the first thing I do is check to see the health of my statistics and figure out if it's picking a sub-optimal plan because of that.
  • If I am in a special scenario and I truly do need to force a join order, I'll use the TOP clause to force a join order since it only forces the order of a single join.
  • In an emergency "production-servers-are-on-fire" scenario, I might use a query or join hint to immediately fix a performance issue and go back to implement a better solution once things calm down.

Why Parameter Sniffing Isn't Always A Bad Thing (But Usually Is)

05b48-1w4tmpnro22csahycrb9iug.jpeg

In this series I explore scenarios that hurt SQL Server performance and show you how to avoid them. Pulled from my collection of "things I didn't know I was doing wrong for years."

Watch this week's video on YouTube

Last week we discussed how implicit conversions could be one reason why your meticulously designed indexes aren't getting used.

Today let's look at another reason: parameter sniffing.

Here's the key: Parameter sniffing isn't always a bad thing.

Most of the time it's good: it means SQL Server is caching and reusing query plans to make your queries run faster.

Parameter sniffing only becomes a problem when the cached plan isn't anywhere close to being the optimal plan for given input parameters.

So what's parameter sniffing?

Let's start with our table dbo.CoffeeInventory which you can grab from Github.

4dedb-17uyfuaa5zzf_h5civ9ww4w

The key things to know about this table are that:

  1. We have a nonclustered index on our Name column.
  2. The data is not distributed evenly (we'll see this in a minute)

Now, let's write a stored procedure that will return a filtered list of coffees in our table, based on the country. Since there is no specific Country column, we'll write it so it filters on the Name column:

DROP PROCEDURE IF EXISTS dbo.FilterCoffee
GO
CREATE PROCEDURE dbo.FilterCoffee
@ParmCountry varchar(30)
AS
BEGIN
    SELECT Name, Price, Description 
    FROM Sandbox.dbo.CoffeeInventory
    WHERE Name LIKE @ParmCountry + '%'
END
GO

Let's take a look at parameter sniffing in action, then we'll take a look at why it happens and how to solve it.

EXEC dbo.FilterCoffee @ParmCountry = 'Costa Rica'
EXEC dbo.FilterCoffee @ParmCountry = 'Ethiopia'

Running the above statement gives us identical execution plans using table scans:

In this case we explicitly specified the parameter @ParmCountry. Sometimes SQL will parameterize simple queries on its own.

That's weird. We have two query executions, they are using the same plan, and neither plan is using our nonclustered index on Name!

Let's step back and try again. First, clear the query plan cache for this stored procedure:

DECLARE @cache_plan_handle varbinary(44)
SELECT @cache_plan_handle = c.plan_handle
FROM 
    sys.dm_exec_cached_plans c
    CROSS APPLY sys.dm_exec_sql_text(c.plan_handle) t
WHERE 
    text like 'CREATE%CoffeeInventory%' 
-- Never run DBCC FREEPROCCACHE without a parameter in production unless you want to lose all of your cached plans...
DBCC FREEPROCCACHE(@cache_plan_handle)

Next, execute the same stored procedure with the same parameter values, but this time with the 'Ethiopia' parameter value first. Look at the execution plan:

EXEC dbo.FilterCoffee @ParmCountry = 'Ethiopia'
EXEC dbo.FilterCoffee @ParmCountry = 'Costa Rica'

9a86e-1en_loooe8aojxr0jonohzg

Now our nonclustered index on Name is being utilized. Both queries are still receiving the same (albeit different) plan.

We didn't change anything with our stored procedure code, only the order that we executed the query with different parameters.

What the heck is going on here!?

This is an example of parameter sniffing. The first time a stored procedure (or query) is ran on SQL server, SQL will generate an execution plan for it and store that plan in the query plan cache:

SELECT
    c.usecounts,
    c.cacheobjtype,
    c.objtype,
    c.plan_handle,
    c.size_in_bytes,
    d.name,
    t.text,
    p.query_plan
FROM 
    sys.dm_exec_cached_plans c
    CROSS APPLY sys.dm_exec_sql_text(c.plan_handle) t
    CROSS APPLY sys.dm_exec_query_plan(c.plan_handle) p
    INNER JOIN sys.databases d
    ON t.dbid = d.database_id
WHERE 
    text like 'CREATE%CoffeeInventory%'

2c94d-1w8c6qvsh_ca3jumixdbnjw

All subsequent executions of that same query will go to the query cache to reuse that same initial query plan — this saves SQL Server time from having to regenerate a new query plan.

Note: A query with different values passed as parameters still counts as the "same query" in the eyes of SQL Server.

In the case of the examples above, the first time the query was executed was with the parameter for "Costa Rica". Remember when I said this dataset was heavily skewed? Let's look at some counts:

SELECT 
    LEFT(Name,CHARINDEX(' ',Name)) AS Country, 
    COUNT(*) AS CountryCount 
FROM dbo.CoffeeInventory 
GROUP BY 
    LEFT(Name,CHARINDEX(' ',Name))

109ae-1rdzlvzj99ccxpni3mwjnaw

"Costa Rica" has more than 10,000 rows in this table, while all other country names are in the single digits.

This means that when we executed our stored procedure for the first time, SQL Server generated an execution plan that used a table scan because it thought this would be the most efficient way to retrieve 10,003 of the 10,052 rows.

This table scan query plan is only optimal for Costa Rica . Passing in any other country name into the stored procedure would return only a handful of records, making it more efficient for SQL Server to use our nonclustered index.

However, since the Costa Rica plan was the first one to run, and therefore is the one that got added to the query plan cache, all other executions ended up using the same table scan execution plan.

After clearing our cached execution plan using DBCC FREEPROCCACHE, we executed our stored procedure again but with 'Ethiopia' as our parameter. SQL Server determined that a plan with an index seek is optimal to retrieve only 6 of the 10,052 rows in the table. It then cached that Index Seek plan, which is why the second time around the 'Costa Rica' parameter received the execution plan with Index Seek.

Ok, so how do I prevent parameter sniffing?

This question should really be rephrased as "how do I prevent SQL Server from using a sub-optimal plan from the query plan cache?"

Let's take a look at some of the techniques.

1. Use WITH RECOMPILE or OPTION (RECOMPILE)

We can simply add these query hints to either our EXEC statement:

EXEC dbo.FilterCoffee @ParmCountry = 'Ethiopia' WITH RECOMPILE
EXEC dbo.FilterCoffee @ParmCountry = 'Costa Rica' WITH RECOMPILE

or to our stored procedure itself:

DROP PROCEDURE IF EXISTS dbo.FilterCoffee
GO
CREATE PROCEDURE dbo.FilterCoffee
@ParmCountry varchar(30)
AS
BEGIN
    SELECT Name, Price, Description 
    FROM Sandbox.dbo.CoffeeInventory 
    WHERE Name LIKE @ParmCountry + '%'

    OPTION (RECOMPILE)

END
GO

What the RECOMPILE hint does is force SQL Server to generate a new execution plan every time these queries run.

Using RECOMPILE eliminates our parameter sniffing problem because SQL Server will regenerate the query plan every single time we execute the query.

The disadvantage here is that we lose all benefit from having SQL Server save CPU cycles by caching execution plans.

If your parameter sniffed query is getting ran frequently, RECOMPILE is probably a bad idea because you will encounter a lot of overheard to generate the query plan regularly.

If your parameter sniffed query doesn't get ran often, or if the query doesn't run often enough to stay in the query plan cache anyway, then RECOMPILE is a good solution.

2. Use the OPTIMIZE FOR query hint

Another option we have is to add either one of the following hints to our query. One of these would get added to the same location as OPTION (RECOMPILE) did in the above stored procedure:

OPTION (OPTIMIZE FOR (@ParmCountry UNKNOWN))

or

OPTION (OPTIMIZE FOR (@ParmCountry = 'Ethiopia'))

OPTIMIZE FOR UNKNOWN will use a query plan that's generated from the average distribution stats for that column/index. Often times it results in an average or bad execution plan so I don't like using it.

OPTIMIZE FOR VALUE creates a plan using whatever parameter value specified. This is great if you know your queries will be retrieving data that's optimized for the value you specified most of the time.

In our examples above, if we know the value 'Costa Rica' is rarely queried, we might optimize for index seeks. Most queries will then run the optimal cached query plan and we'll only take a hit when 'Costa Rica' is queried.

3. IF/ELSE

This solution allows for ultimate flexibility. Basically, you create different stored procedures that are optimized for different values. Those stored procedures have their plans cached, and then an IF/ELSE statement determines which procedure to run for a passed in parameter:

DROP PROCEDURE IF EXISTS dbo.FilterCoffee
GO
CREATE PROCEDURE dbo.FilterCoffee
@ParmCountry varchar(30)
AS
BEGIN
    IF @ParmCountry = 'Costa Rica'
    BEGIN
    EXEC dbo.ScanningStoredProcedure @ParmCountry
    END
    ELSE
    BEGIN
    EXEC dbo.SeekingStoredProcedure @ParmCountry
    END
END
GO

This option is more work (How do you determine what the IF condition should be? What happens more data is added to the table over time and the distribution of data changes?) but will give you the best performance if you want your plans to be cached and be optimal for the data getting passed in.

Conclusion

  1. Parameter sniffing is only bad when your data values are unevenly distributed and cached query plans are not optimal for all values.
  2. SQL Server caches the query plan that is generated from the first run of a query/stored procedure with whatever parameter values were used during that first run.
  3. Using the RECOMPILE hint is a good solution when your queries aren't getting ran often or aren't staying in the the query cache most of the time anyway.
  4. The OPTIMIZE FOR hint is good to use when you can specify a value that will generate a query plan that is efficient for most parameter values and are OK with taking a hit for a sub-optimal plan on infrequently queried values.
  5. Using complex logic (like IF/ELSE) will give you ultimate flexibility and performance, but will also be the worst for long term maintenance.