Data with Bert logo

Does The Join Order of My Tables Matter?

pan-xiaozhen-252035

I had a great question submitted to me (thank you Brandman!) that I thought would make for a good blog post:

...I've been wondering if it really matters from a performance standpoint where I start my queries. For example, if I join from A-B-C, would I be better off starting at table B and then going to A & C?

The short answer: Yes.  And no.

Watch this week's video on YouTube

Table join order matters for performance!

Disclaimer: For this post, I'm only going to be talking about INNER joins.  OUTER (LEFT, RIGHT, FULL, etc...) joins are a whole 'nother animal that I'll save for time.

Let's use the following query from WideWorldImporters for our examples:

/* 
-- Run if if you want to follow along - add  a computed column and index for CountryOfManufacture
ALTER TABLE Warehouse.StockItems SET (SYSTEM_VERSIONING = OFF);   
ALTER TABLE Warehouse.StockItems
ADD CountryOfManufacture AS CAST(JSON_VALUE(CustomFields,'$.CountryOfManufacture') AS NVARCHAR(10)) 
ALTER TABLE Warehouse.StockItems SET (SYSTEM_VERSIONING = ON); 
CREATE INDEX IX_CountryOfManufacture ON Warehouse.StockItems (CountryOfManufacture)
*/

SELECT
  o.OrderID,
  s.CountryOfManufacture
FROM
  Sales.Orders o                        -- 73595 rows
  INNER JOIN Sales.OrderLines l         -- 231412 rows
    ON o.OrderID = l.OrderID            -- 231412 rows after join
  INNER JOIN Warehouse.StockItems s     -- 227 rows
    ON l.StockItemID = s.StockItemID    -- 1036 rows after join 
    AND s.CountryOfManufacture = 'USA'  -- 8 rows for USA   

Note: with an INNER join, I normally would prefer putting my 'USA' filter in the WHERE clause, but for the rest of these examples it'll be easier to have it part of the ON.

The key thing to notice is that we are joining  three tables - Orders, OrderLines, and StockItems - and that OrderLines is what we use to join between the other two tables.

We basically have two options for table join orders then - we can join Orders with OrderLines first and then join in StockItems, or we can join OrderLines and StockItems first and then join in Orders.

In terms of performance, it's almost certain that the latter scenario (joining OrderLines with StockItems first) will be faster because StockItems will help us be more selective.

Selective?  Well you might notice that our StockItems table is small with only 227 rows.  It's made even smaller by filtering on 'USA' which reduces it to only 8 rows.

Since the StockItems table has no duplicate rows (it's a simple lookup table for product information) it is a great table to join with as early as possible since it will reduce the total number of rows getting passed around for the remainder of the query.

If we tried doing the Orders to OrderLines join first, we actually wouldn't filter out any rows in our first step, cause our subsequent join to StockItems to be more slower (because more rows would have to be processed).

Basically, join order DOES matter because if we can join two tables that will reduce the number of rows needed to be processed by subsequent steps, then our performance will improve.

So if the order that our tables are joined in makes a big difference for performance reasons, SQL Server follows the join order we define right?

SQL Server doesn't let you choose the join order

SQL is a declarative language: you write code that specifies *what* data to get, not *how* to get it.

Basically, the SQL Server query optimizer takes your SQL query and decides on its own how it thinks it should get the data.

It does this by using precalculated statistics on your table sizes and data contents in order to be able to pick a "good enough" plan quickly.

So even if we rearrange the order of the tables in our FROM statement like this:

SELECT
  o.OrderID,
  s.CountryOfManufacture
FROM
  Sales.OrderLines l
  INNER JOIN Warehouse.StockItems s
    ON l.StockItemID = s.StockItemID
    AND s.CountryOfManufacture = 'USA'
  INNER JOIN Sales.Orders o
    ON o.OrderID = l.OrderID

Or if we add parentheses:

SELECT
  o.OrderID,
  s.CountryOfManufacture
FROM
  (Sales.OrderLines l
  INNER JOIN Sales.Orders o
    ON l.OrderID = o.OrderID)
  INNER JOIN Warehouse.StockItems s
    ON l.StockItemID = s.StockItemID
    AND s.CountryOfManufacture = 'USA'

Or even if we rewrite the tables into subqueries:

SELECT
  l.OrderID,
  s.CountryOfManufacture
FROM
  (
  SELECT 
    o.OrderID,
    l.StockItemId
  FROM
    Sales.OrderLines l
    INNER JOIN Sales.Orders o
      ON l.OrderID = o.OrderID
  ) l
  INNER JOIN Warehouse.StockItems s
    ON l.StockItemID = s.StockItemID
    AND s.CountryOfManufacture = 'USA'

SQL Server will interpret and optimize our three separate queries (plus the original one from the top of the page) into the same exact execution plan:

2017-11-16_21-42-37

Basically, no matter how we try to redefine the order of our tables in the FROM statement, SQL Server will still do what it thinks it's best.

But what if SQL Server doesn't know best?

The majority of the time I see SQL Server doing something inefficient with an execution plan it's usually due to something wrong with statistics for that table/index.

Statistics are also a whole 'nother topic for a whole 'nother day (or month) of blog posts, so to not get too side tracked with this post, I'll point you to Kimberly Tripp's introductory blog post on the subject: https://www.sqlskills.com/blogs/kimberly/the-accidental-dba-day-15-of-30-statistics-maintenance/)

The key thing to take away is that if SQL Server is generating an execution plan where the order of table joins doesn't make sense check your statistics first because they are the root cause of many performance problems!

Forcing a join order

So you already checked to see if your statistics are the problem and exhausted all possibilities on that front.  SQL Server isn't optimizing for the optimal table join order, so what can you do?

Row goals

If SQL Server isn't behaving and I need to force a table join order, my preferred way is to do it via a TOP() command.

I learned this technique from watching Adam Machanic's fantastic presentation on the subject and I highly recommend you watch it.

Since in our example query SQL Server is already joining the tables in the most efficient order, let's force an inefficient join by joining Orders with OrderLines first.

**Basically, we write a subquery around the tables we want to join together first and make sure to include a TOP clause. **

SELECT
  o.OrderID,
  s.CountryOfManufacture
FROM
  (
  SELECT TOP(2147483647) -- A number of rows we know is larger than our table.  Watch Adam's presentation above for more info.
    o.OrderID,
    l.StockItemID
  FROM
    Sales.Orders o
    INNER JOIN Sales.OrderLines l
      ON o.OrderID = l.OrderID
  ) o
  INNER JOIN Warehouse.StockItems s
    ON o.StockItemID = s.StockItemID
    AND s.CountryOfManufacture = 'USA'

Including TOP forces SQL to perform the join between Orders and OrderLines first - inefficient in this example, but a great success in being able to control what SQL Server does.

2017-11-17_12-16-05

This is my favorite way of forcing a join order because we get to inject control over the join order of two specific tables in this case (Orders and OrderLines) but SQL Server will still use its own judgement in how any remaining tables should be joined.

While forcing a join order is generally a bad idea (what happens if the underlying data changes in the future and your forced join no longer is the best option), in certain scenarios where its required the TOP technique will cause the least amount of performance problems (since SQL still gets to decide what happens with the rest of the tables).

The same can't be said if using hints...

Query and join hints

Query and join hints will successfully force the order of the table joins in your query, however they have significant draw backs.

Let's look at the FORCE ORDER query hint.  Adding it to your query will successfully force the table joins to occur in the order that they are listed:

SELECT
  o.OrderID,
  s.CountryOfManufacture
FROM
  Sales.Orders o
  INNER JOIN Sales.OrderLines l
    ON o.OrderID = l.OrderID
  INNER JOIN Warehouse.StockItems s
    ON l.StockItemID = s.StockItemID
    AND s.CountryOfManufacture = 'USA'
OPTION (FORCE ORDER)

Looking at the execution plan we can see that Orders and OrderLines were joined together first as expected:

2017-11-17_12-23-37

The biggest drawback with the FORCE ORDER hint is that all tables in your query are going to have their join order forced (not evident in this example...but imagine we were joining 4 or 5 tables in total).

This makes your query incredibly fragile; if the underlying data changes in the future, you could be forcing multiple inefficient join orders.  Your query that you tuned with FORCE ORDER could go from running in seconds to minutes or hours.

The same problem exists with using a join hints:

SELECT
  o.OrderID,
  s.CountryOfManufacture
FROM
  Sales.Orders o 
  INNER LOOP JOIN Sales.OrderLines l
    ON o.OrderID = l.OrderID
  INNER JOIN Warehouse.StockItems s
    ON l.StockItemID = s.StockItemID
    AND s.CountryOfManufacture = 'USA'

Using the LOOP hint successfully forces our join order again, but once again the join order of all of our tables becomes fixed:

2017-11-17_12-28-21

A join hint is probably the most fragile hint that forces table join order because not only is it forcing the join order, but it's also forcing the algorithm used to perform the join.

In general, I only use query hints to force table join order as a temporary fix.

Maybe production has a problem and I need to get things running again; a query or join hint may be the quickest way to fix the immediate issue.  However, long term using the hint is probably a bad idea, so after the immediate fires are put out I will go back and try to determine the root cause of the performance problem.

Summary

  • Table join order matters for reducing the number of rows that the rest of the query needs to process.
  • By default SQL Server gives you no control over the join order - it uses statistics and the query optimizer to pick what it thinks is a good join order.
  • Most of the time, the query optimizer does a great job at picking efficient join orders.  When it doesn't, the first thing I do is check to see the health of my statistics and figure out if it's picking a sub-optimal plan because of that.
  • If I am in a special scenario and I truly do need to force a join order, I'll use the TOP clause to force a join order since it only forces the order of a single join.
  • In an emergency "production-servers-are-on-fire" scenario, I might use a query or join hint to immediately fix a performance issue and go back to implement a better solution once things calm down.

How to Search Stored Procedures and Ad-Hoc Queries

louis-blythe-192936-1-e1510402200998

Have you ever wanted to find something that was referenced in the body of a SQL query?

Maybe you need to know what queries you will have to modify for an upcoming table rename.  Or maybe you want to see how many queries on your server are running [SELECT *]{.lang:default .highlight:0 .decode:true .crayon-inline}

Below are two templates you can use to search across the text of SQL queries on your server.

Watch this week's video on YouTube

1. Searching Stored Procedures, Functions, and Views

If the queries you are interested in are part of a stored procedure, function, or view, then you have to look no further than the [sys.sql_modules]{.lang:default .highlight:0 .decode:true .crayon-inline} view.

This view stores the query text of every module in your database, along with a number of other properties.

You can use something like the following as a template for searching through the query texts of these database objects:

USE [<database name>];
GO

SET TRANSACTION ISOLATION LEVEL READ UNCOMMITTED

SELECT
    o.type_desc AS ObjectType,
    DB_NAME(o.parent_object_id) AS DatabaseName,
    s.name as SchemaName,
    o.name as ObjectName,
    r.Definition
FROM
    sys.sql_modules r
    INNER JOIN sys.objects o
        ON r.object_id = o.object_id
    INNER JOIN sys.schemas s
        ON o.schema_id = s.schema_id
WHERE
    -- put your search keyword here
    r.Definition LIKE '%SELECT%'

For example, I recently built a query for searching stored procedures and functions that might contain SQL injection vulnerabilities.

Using the starting template above, I added some filtering in the WHERE clause to limit my search to queries that follow common coding patterns that are vulnerable to SQL injection:

USE [<database name>];
GO

SET TRANSACTION ISOLATION LEVEL READ UNCOMMITTED

SELECT
    o.type_desc AS ObjectType,
    DB_NAME(o.parent_object_id) AS DatabaseName,
    s.name as SchemaName,
    o.name as ObjectName,
    r.Definition
FROM
    sys.sql_modules r
    INNER JOIN sys.objects o
        ON r.object_id = o.object_id
    INNER JOIN sys.schemas s
        ON o.schema_id = s.schema_id
WHERE
    -- Remove white space from query texts
    REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(
        r.Definition,CHAR(0),''),CHAR(9),''),CHAR(10),''),CHAR(11),''),
        CHAR(12),''),CHAR(13),''),CHAR(14),''),CHAR(160),''),' ','')
    LIKE '%+@%'
    AND 
    ( -- Only if executes a dynamic string
        r.Definition LIKE '%EXEC(%'
        OR r.Definition LIKE '%EXECUTE%'
        OR r.Definition LIKE '%sp_executesql%'
    )

2. Searching Ad-Hoc SQL Queries

Searching across ad-hoc queries is a little tougher.  Unless you are proactively logging the query texts with extended events or some other tool, there is no way to definitively search every ad-hoc query text.

However, SQL Server does create (or reuse) an execution plan for each query that executes.  Most of those plans are then added to the execution plan cache.

Execution plans are eventually removed from the cache for various reasons, but while they exist we can easily search their contents, including searching through that plan's query text.

As a starting point, you can use the following code to retrieve SQL query texts that are currently stored in the plan cache:

USE [<database name>];
GO

WITH XMLNAMESPACES (DEFAULT 'http://schemas.microsoft.com/sqlserver/2004/07/showplan')

SELECT
   stmt.value('(@StatementText)[1]', 'varchar(max)') AS [Query],
   query_plan AS [QueryPlan]
FROM 
    sys.dm_exec_cached_plans AS cp
    CROSS APPLY sys.dm_exec_query_plan(cp.plan_handle) AS qp
    CROSS APPLY query_plan.nodes('/ShowPlanXML/BatchSequence/Batch/Statements/StmtSimple') AS batch(stmt)
WHERE
    -- put your search keywords here
    stmt.value('(@StatementText)[1]', 'varchar(max)') LIKE '%SELECT%'

Although the template above searches for the query texts in our execution plans, you can also use it to search for other query plan elements, such as elements that indicate if you have non-sargable query.

I used this technique recently to search for ad-hoc queries that might be vulnerable to SQL injection.  I modified the template above to search the input parameter values instead of the query texts, flagging any values that look like they might have some injection code in them:

USE [<database name>];
GO

WITH XMLNAMESPACES (DEFAULT 'http://schemas.microsoft.com/sqlserver/2004/07/showplan')

SELECT
   stmt.value('(@StatementText)[1]', 'varchar(max)') AS [Query],
   query_plan AS [QueryPlan],
   stmt.value('(.//ColumnReference/@ParameterCompiledValue)[1]', 'varchar(1000)') AS [ParameterValue] 
FROM 
    sys.dm_exec_cached_plans AS cp
    CROSS APPLY sys.dm_exec_query_plan(cp.plan_handle) AS qp
    CROSS APPLY query_plan.nodes('/ShowPlanXML/BatchSequence/Batch/Statements/StmtSimple') AS batch(stmt)
WHERE
    -- if single quotes exist in a parameter
    stmt.value('(.//ColumnReference/@ParameterCompiledValue)[1]', 'varchar(1000)') like '%''%'
    OR stmt.value('(.//ColumnReference/@ParameterCompiledValue)[1]', 'varchar(1000)') like '%sys.objects%'
    OR stmt.value('(.//ColumnReference/@ParameterCompiledValue)[1]', 'varchar(1000)') like '%[0-9]=[0-9]%'

So while using this technique won't allow you to search across 100% of ad-hoc queries, it should be able to search the ones that run most frequently and appear in your plan cache.

Intrigued by how I'm searching query texts for SQL injection vulnerabilities? Attend my online webcast on Tuesday November 14, 2017 at 1PM Eastern at the PASS Security Virtual Group to learn about these queries and how protect yourself from SQL injection.

4 Common Misconceptions About SQL Injection Attacks

jaanus-jagomagi-377699 Photo by Jaanus Jagomägi on Unsplash

Interested in learning more about SQL injection attacks, including how to prevent them?  Attend my online webcast on Tuesday November 14, 2017 at 1PM Eastern at the PASS Security Virtual Group.

SQL injection continues to be one of the biggest security risks that we face as database professionals.

Every year, millions of users' personal information is leaked due to poorly written queries exploited by SQL injection.  The sad truth is that SQL injection is completely preventable with the right knowledge.

My goal today is to cover four common misconceptions that people have about SQL injection in an effort to dissuade any illusions that an injection attack is not something that can happen to you.

Watch this week's video on YouTube

1. "My database information isn't public"

Let's see, without me knowing anything about your databases, I'm guessing you might have some tables with names like:

  • Users
  • Inventory
  • Products
  • Sales
  • etc...

Any of those sound familiar?

You might not be publicly publishing your database object names, but that doesn't mean they aren't easy to guess.

All a malicious user needs is a list of common database table names and they can iterate over the ones they are interested in until they find the ones that match in your system.

2. "But I obfuscate all of my table and column names!"

obscure-table-name

Oh jeez.  I hope you don't do this.

Some people do this for job security ("since only I can understand my naming conventions, I'm guaranteeing myself a job!") and that's a terrible reason in and of itself.

Doing it for security reasons is just as horrible though.  Why?  Well, have you ever heard of some system tables like sys.objects and sys.columns?

SELECT 
    t.name, c.name 
FROM 
    sys.objects t
    INNER JOIN sys.columns c 
        on t.object_id = c.object_id

A hacker wanting to get into your system can easily write queries like the ones above, revealing your "secure" naming conventions.

sys-objects-results

Security through obscurity doesn't work.  If you have table names that aren't common, that's perfectly fine, but don't use that as your only form of prevention.

3. "Injection is the developer's/dba's/somebody else's problem"

You're exactly right.  SQL injection is a problem that should be tackled by the developer/dba/other person.

But it's also a problem that benefits from multiple layers of security, meaning it's your problem to solve as well.

Preventing sql injection is hard.

Developers should be validating, sanitizing, parameterizing etc...  DBAs should be parameterizing, sanitizing, restricting access, etc..

Multiple layers of security in the app and in the database are the only way to confidently prevent an injection attack.

4. "I'm too small of a fish in a big pond - no one would go out of their way to attack me"

So you run a niche business making and selling bespoke garden gnomes.

You only have a few dozen/hundred customers, so who would bother trying to steal your data with SQL injection?

Well, most SQL injection attacks can be completely automated with tools like sqlmap.  Someone might not care about your business enough to handcraft some SQL injection code, but that won't stop them from stealing your handcrafted garden gnome customers' data through automated means.

No app, big or small, is protected from the wrath of automated SQL injection tooling.

Interested in learning more about SQL injection attacks, including how to prevent them?  Attend my online webcast on Tuesday November 14, 2017 at 1PM Eastern at the PASS Security Virtual Group.

The Quickest Way To Get SQL Command Help

jp-valery-200305-3

Every once in a while I discover a SQL Server Management Studio trick that's apparently been around forever but is completely new to me.

Today I want to point out one of those features that had me thinking "how did I not know about this before":

The F1 keyboard shortcut.

Watch this week's video on YouTube

To use it, highlight a command or function that you want to know more information about and then press F1.  Simple as that.

Pressing F1 brings up the Microsoft online documentation for that keyword/function, making it the fastest way of getting to Microsoft's online documentation.  You'll solve your own questions faster than a coworker can tell you "to google it."

2017-10-26_18-47-05

Most recently I've been using the F1 shortcut in the following scenarios:

  • Can't remember the date/time style formats when using CONVERT?  Highlight CONVERT and press F1: BOOM! All date and time style codes appear before you.
  • Need to use some option for CREATE INDEX and don't remember the syntax?  Just highlight CREATE INDEX and press F1!  Everything you need is there.
  • Do you remember if BETWEEN is inclusive or exclusive?  F1 knows.  Just press it.

You get the idea.

Assuming you use the online Microsoft docs 10 times per day, 250 days a year, and each time it takes you 10 seconds to open a browser and search for the doc...

( 10/day * 250/year * 10 sec ) / 60 sec / 60 min = 6.94 hours saved.  Your welcome.

How to Make SQL Server Act Like A Human By Using WAITFOR

fischer-twins-396836 Photo by Fischer Twins on Unsplash

You probably tune your queries for maximum performance.  You take pride in knowing how to add indexes and refactor code in order to squeeze out every last drop your server's performance potential.  Speed is usually king.

That's why you probably don't use SQL Server's WAITFOR command regularly - it actually makes your overall query run slower.

However, slowness isn't always a bad thing.  Today I want to show you two of my favorite ways for using the WAITFOR command.

Watch this week's video on YouTube

You can also watch this week's content on my YouTube channel.

1. Building A Human

Modern day computers are fast.  CPUs perform billions of actions per second, the amount of RAM manufactures can cram onto a stick increases regularly, and SSDs are quickly making disk I/O concerns a thing of the past.

While all of those things are great for processing large workloads, they move computers further and further away from "human speed".

But "human speed" is sometimes what you want.  Maybe you want to simulate app usage on your database or the load created by analysts running ad hoc queries against your server.

This is where I love using WAITFOR DELAY - it can simulate humans executing queries extremely welll:

-- Run forever
WHILE (1=1)
BEGIN
    --Insert data to simulate an app action from our app
    EXEC dbo.BuyDonuts 12

    -- We usually average an order every 3 seconds
    WAITFOR DELAY '00:00:03'
END

Throw in some psuedo-random number generation and some IF statements, and you have a fake server load you can start using:

WHILE (1=1)
BEGIN
    -- Generate command values 1-24
    DECLARE @RandomDonutAmount int = ABS(CHECKSUM(NEWID()) % 25) + 1

    -- Generate a delay between 0 and 5 seconds
    DECLARE @RandomDelay int = ABS(CHECKSUM(NEWID()) % 6)

    EXEC dbo.BuyDonuts @RandomDonutAmount

    WAITFOR DELAY @RandomDelay
END

2. Poor Man's Service Broker

Service Broker is a great feature in SQL Server.  It handles messaging and queuing scenarios really well, but requires more setup time so I usually don't like using it in scenarios where I need something quick and dirty.

Instead of having to set up Service Broker to know when some data is available or a process is ready to be kicked off, I can do the same with a WHILE loop and a WAITFOR:

DECLARE @Quantity smallint = NULL

-- Keep checking our table data until we find the data we want
WHILE (@Quantity IS NULL)
BEGIN
    -- Delay each iteration of our loop for 10 seconds
    WAITFOR DELAY '00:00:03'

    -- Check to see if someone has bought a dozen donuts yet
    SELECT @Quantity = Quantity FROM dbo.Orders WHERE Quantity = 12
END

-- Run some other query now that our dozen donut purchase condition has been met
EXEC dbo.GenerateCoupon

Fancy? No.  Practical? Yes.

No longer do I need to keep checking a table for results before I run a query - I can have WAITFOR do that for me.

If you know there is a specific time you want to wait for until you start pinging some process, you can incorporate WAITFOR TIME to make your checking even more intelligent:

DECLARE @Quantity smallint = NULL

-- Let's not start checking until 6am...that's when the donut shop opens
WAITFOR TIME '06:00:00'
-- Keep checking our table data until we find the data we want
WHILE (@Quantity IS NULL)
BEGIN
    -- Delay each iteration of our loop for 10 seconds
    WAITFOR DELAY '00:00:03'

    -- Check to see if someone has bought a dozen donuts yet
    SELECT @Quantity = Quantity FROM dbo.Orders WHERE Quantity = 12
END

-- Run some other query now that our dozen donut purchase condition has been met
EXEC dbo.GenerateCoupon