Data with Bert logo

SQL FAILS with Andy Mallon, Erin Stellato, and Mr. Anonymous!

Watch this week's video on YouTube

While most of us strive to make as few mistakes as possible when it comes to our servers and data, accidents do occasionally happen.

Sometimes those accidents are easily fixed while other times the solutions require herculean efforts (usually accompanied by lots of caffeine and cursing...or is that just me?).

This week I'm excited to have guests Andy Mallon (t), Erin Stellato (t), and Mr. ANONYMOUS (t) (don't spoil the fun by clicking these links until after watching!) share some of their most memorable SQL Server mishaps.

It's a video only post so be sure to watch above or on my YouTube channel (and be sure to watch until the end for a special...furry...cameo).

T-SQL Documentation Generator

MJ-t-sql-TuesdayThis post is a response to this month's T-SQL Tuesday #110 prompt by Garry Bargsley.  T-SQL Tuesday is a way for the SQL Server community to share ideas about different database and professional topics every month.

This month's topic asks to share how we automate certain processes.


Watch this week's video on YouTube

I'm a fan of keeping documentation close to the code. I prefer writing my documentation directly above a procedure, function, or view definition because that's where it will be most beneficial to myself and other developers.

Not to mention that's the only place where the documentation has any chance of staying up to date when changes to the code are made.

What drives me crazy though is making a copy of that documentation somewhere else, into a different format. You know, like when someone without database access needs you to send them a description of all of the procedures for a project. Or if you are writing end-user documentation for your functions and views.

Not only is creating a copy of the documentation tedious, but there is no chance that it will stay up to date with future code changes.

So today I want to share how I automate some of my documentation generation directly from my code.

C# XML Style Documentation in T-SQL

C# uses XML to document objects directly in the code:

/// <summary>
/// Retrieves the details for a user.
/// </summary>
/// <param name="id">The internal id of the user.</param>
/// <returns>A user object.</returns>
public User GetUserDetails(int id)
{
    User user = ...
    return user;
}

I like this format: the documentation is directly next to the code and it is structured as XML, making it easy to parse for other uses (eg. use a static document generator to create end-user documentation directly from these comments).

This format is easily transferable to T-SQL:

/*
<documentation>
  <author>Bert</author>
  <summary>Retrieves the details for a user.</summary>
  <param name="@UserId">The internal id of the user.</param>
  <returns>The username, user's full name, and join date</returns>
</documentation>
*/
CREATE PROCEDURE dbo.USP_SelectUserDetails
       @UserId int
AS
BEGIN
    SELECT Username, FullName, JoinDate FROM dbo.[User] WHERE Id = @UserId
END
GO


/*
<documentation>
  <author>Bert</author>
  <summary>Returns the value 'A'.</summary>
  <param name="@AnyNumber">Can be any number.  Will be ignored.</param>
  <param name="@AnotherNumber">A different number.  Will also be ignored.</param>
  <returns>The value 'A'.</returns>
</documentation>
*/
CREATE FUNCTION dbo.UDF_SelectA
(
    @AnyNumber int,
    @AnotherNumber int
)
RETURNS char(1)
AS
BEGIN
       RETURN 'A';
END
GO

Sure, this might not be as visually appealing as the traditional starred comment block, but I've wrestled with parsing enough free formatted text that I don't mind a little extra structure in my comments.

Querying the Documentation

Now that our T-SQL object documentation has some structure, it's pretty easy to query and extract those XML comments:

WITH DocumentationDefintions AS (
SELECT
    SCHEMA_NAME(o.schema_id) as schema_name,
    o.name as object_name,
    o.create_date,
    o.modify_date,
    CAST(SUBSTRING(m.definition,CHARINDEX('<documentation>',m.definition),CHARINDEX('</documentation>',m.definition)+LEN('</documentation>')-CHARINDEX('<documentation>',m.definition)) AS XML) AS Documentation,
    p.parameter_id as parameter_order,
    p.name as parameter_name,
    t.name as parameter_type,
    p.max_length,
    p.precision,
    p.scale,
    p.is_output
FROM
    sys.objects o
    INNER JOIN sys.sql_modules m
        ON o.object_id = m.object_id
    LEFT JOIN sys.parameters p
        ON o.object_id = p.object_id
    INNER JOIN sys.types t
        ON p.system_type_id = t.system_type_id
WHERE 
    o.type in ('P','FN','IF','TF')
)
SELECT
    d.schema_name,
    d.object_name,
    d.parameter_name,
    d.parameter_type,
    t.c.value('author[1]','varchar(100)') as Author,
    t.c.value('summary[1]','varchar(max)') as Summary,
    t.c.value('returns[1]','varchar(max)') as Returns,
    p.c.value('@name','varchar(100)') as DocumentedParamName,
    p.c.value('.','varchar(100)') as ParamDescription
FROM
    DocumentationDefintions d 
    OUTER APPLY d.Documentation.nodes('/documentation') as t(c) 
    OUTER APPLY d.Documentation.nodes('/documentation/param') as p(c)
WHERE
    p.c.value('@name','varchar(100)') IS NULL -- objects that don't have documentation
    OR p.c.value('@name','varchar(100)') = d.parameter_name -- joining our documented parms with the actual ones
ORDER BY
    d.schema_name,
    d.object_name,
    d.parameter_order

This query pulls the parameters of our procedures and functions from sys.parameters and joins them with what we documented in our XML documentation. This gives us some nicely formatted documentation as well as visibility into what objects haven't been documented yet:

image

Only the Beginning

At this point, our procedure and function documentation is easily accessible via query. We can use this to dump the information into an Excel file for a project manager, or schedule a job to generate some static HTML documentation directly from the source every night.

This can be extended even further depending on your needs, but at least this is an automated starting point for generating further documentation directly from the T-SQL source.

Visualizing Hash Match Join Internals And Understanding Their Implications

This post is part 3 in a series about physical join operators (be sure to check out part 1 - nested loops joins, and part 2 - merge joins).

Watch this week's video on YouTube

Hash Match joins are the dependable workhorses of physical join operators.

While Nested Loops joins will fail if the data is too large to fit into memory, and Merge Joins require that the input data are sorted, a Hash Match will join any two data inputs you throw at it (as long as the join has an equality predicate and you have enough space in tempdb). 

The base hash match algorithm has two phases that work like this:

Hash-Match-Join-Looping-1

During the first "Build" phase, SQL Server builds an in-memory hash table from one of the inputs (typically the smaller of the two).  The hashes are calculated based on the join keys of the input data and then stored along with the row in the hash table under that hash bucket.  Most of the time there is only 1 row of data per hash bucket except when:

  1. There are rows with duplicate join keys.
  2. The hashing function produces a collision and totally different join keys receive the same hash (uncommon but possible).

Once the hash table is built, SQL Server begins the "Probe" phase.  During this second phase, SQL Server calculates the join key hash for each row in the second input, and checks to see if it exists in the hash table created in the first build phase.  If it finds a match for that hash, it then verifies if the join keys between the row(s) in the hash table and the row from the second table actually match (it needs to perform this verification due to potential hash collisions).

A common variation on this hash match algorithm occurs when the build phase cannot create a hash table that can be fully stored in memory:

Hash-Match-Join-spill-looping-1

This happens when the data is larger than what can be stored in memory or when SQL Server grants an inadequate amount of memory required for the hash match join.

When SQL Server runs doesn't have enough memory to store the build phase hash table, it proceeds by keeping some of the buckets in memory, while spilling the other buckets to tempdb. 

During the probe phase, SQL Server joins the rows of data from the second input to buckets from the build phase that are in memory. If the bucket that the row potentially matches isn't currently in memory, SQL Server writes that row to tempdb for later comparison. 

Once the matches for one bucket are complete, SQL Server clears that data from memory and loads the next bucket(s) into memory. It then compares the second input's rows (currently residing in tempdb) with the new in-memory buckets.

As with every physical join operator in this series, there are way more details about the hash match operator on Hugo Kornelis's reference on hash matches.

What Do Hash Match Joins Reveal?

Knowing the internals of how a hash match join works allows us to infer what the optimizer thinks about our data and the join's upstream operators, helping us focus our performance tuning efforts. 

Here are a few scenarios to consider the next time you see a hash match join being used in your execution plan:

  • While hash match joins are able to join huge sets of data, building the hash table from the first input is a blocking operation that will prevent downstream operators from executing. Due to this, I always check to see if there is an easy way to convert a hash match to either a nested loops or merge join.  Sometimes that won't be possible (too many rows for nested loops or unsorted data for merge joins) but it's always worth checking if a simple index change or improved estimates from a statistics update would cause SQL Server to pick a non-blocking hash match join operator.

  • Hash match joins are great for large joins - since they can spill to tempdb, it allows them to perform joins on large datasets that would fail an in-memory join with either the nested loops or merge join operators.

    • Seeing a hash match join operator means SQL Server thinks the upstream inputs are big.  If we know our inputs shouldn't be that big, then it's worth checking if we have a stats/estimation problem that is causing SQL Server to choose a hash match join incorrectly.
  • When executed in memory, hash match joins are fairly efficient. Problems arise when the build phase spills to tempdb.

    • If I notice the little yellow triangle indicating that the join is spilling to tempdb, I take a look to see why: if the data is larger than the server's available memory, there's not much that can be done there, but if the memory grant seems unusually small that means we probably have another statistics problem that is providing the SQL Server optimizer estimates that are too low.

2018 Community Influencer of the Year

BertWagnerRibbonA few days ago I was surprised to learn from Aaron Bertrand of SentryOne that he was selecting me as Community Influencer of the Year for 2018.

Aaron states that the Community Influencer of the Year award goes to, "someone who has made a dramatic impact on the SQL Server community."  This type of recognition is wonderful to hear and I am honored to have such kind words come from someone like Aaron.

I'm especially flattered since the previous award recipients are Andy Mallon (2016) and Drew Furgiuele (2017). Being recognized in the same league as those two feels amazing. Andy and Drew are incredible and inspiring, both in the data platform community and outside of it.

2018 was a great year and I'm looking forward to 2019. The number of people I've befriended this year and have helped me along the way is staggering; thank you to each and every one of you.

And finally, thank you all for following me along on this journey. I truly appreciate the interactions I have with you in-person, on Twitter, in the comments, and everywhere else.

Visualizing Merge Join Internals And Understanding Their Implications

This post is part 2 in a series about physical join operators (be sure to check out part 1 - nested loops joins, and part 3 - hash match joins).

Watch this week's video on YouTube

Merge joins are theoretically the fastest* physical join operators available, however they require that data from both inputs is sorted:

Merge-Join-1

The base algorithm works as follows: SQL Server compares the first rows from both sorted inputs.  It then continues comparing the next rows from the second input as long as the values match the first input's value.

Once the values no longer match, SQL Server increments the row of whichever input has the smaller value - it then continues performing comparisons and outputting any joined records. (For more detailed information, be sure to check out Craig Freedman's post on merge joins.)

This is efficient because in most instances SQL Server never has to go back and read any rows multiple times.  The exception here happens when duplicate values exist in both input tables (or rather, SQL Server doesn't have meta data available proving that duplicates don't exist in both tables) and SQL Server has to perform a many-to-many merge join:

Merge-Join-many-to-many

Note: The image above and the explanation below are "good enough" for understanding this process for practical purposes - if you want to dive into the peek-ahead buffers, optimizations, and other inner workings of this process, I highly recommend reading through Hugo Kornelis's reference on merge joins.

A many-to-many join forces SQL Server to write any duplicated values in the second table into a worktable in tempdb and do the comparisons there.  If those duplicated values are also duplicated in the first table, SQL Server then compares the first table's values to those already stored in the worktable.

What Do Merge Joins Reveal?

Knowing the internals of how a merge join works allows us to infer what the optimizer thinks about our data and the join's upstream operators, helping us focus our performance tuning efforts. 

Here are a few scenarios to consider the next time you see a merge join being used in your execution plan:

  • The optimizer chooses to use a merge join when the input data is already sorted or SQL Server can sort the data for a low enough cost.  Additionally, the optimizer is fairly pessimistic at calculating the costs of merge joins (great explanation by Joe Obbish), so if a merge join makes its way into your plans, it probably means that it is fairly efficient.

  • While a merge join may be efficient, it's always worth looking at why the data coming in to the merge join operator is already sorted:

    • If it's sorted because the merge join is pulling data directly from an index sorted on your join keys, then there is not much to be concerned about.
    • If the optimizer added a sort to the upstream merge join though, it may be worth investigating whether it's possible to presort that data so SQL Server doesn't need to sort it on its own.  Often times this can be as simple as redefining an included index column to a key column - if you are adding it as the last key column in the index then regression impact is usually minor but you may be able to allow SQL Server to use the merge join without any additional sorting required.
  • If your inputs contain many duplicates, it may be worth checking if a merge join is really the most efficient operator for the join.  As outlined above, many-to-many merge joins require tempdb usage which could become a bottle neck!

So while merge joins are typically not the high-cost problem spots in your execution plans, it's always worth investigating upstream operators to see if some additional improvements can be made.

[]{#fastest-exception}

*NOTE: There are always exceptions to the rule.  Merge joins have the fastest algorithm since each row only needs to be read once from the source inputs.  Also, optimizations occurring in other join operators can give those operators better performance under certain conditions.

For example, a single row outer table with an indexed inner table using a nested loops join will outperform the same setup with a merge join because of the inner loops joins' optimizations:

DROP TABLE IF EXISTS T1;
GO
CREATE TABLE T1 (Id int identity PRIMARY KEY, Col1 CHAR(1000));
GO

INSERT INTO T1 VALUES('');
GO

DROP TABLE IF EXISTS T2;
GO
CREATE TABLE T2 (Id int identity PRIMARY KEY, Col1 CHAR(1000));
GO

INSERT INTO T2 VALUES('');
GO 100

-- Turn on execution plans and check actual rows for T2
SELECT *
FROM T1 INNER LOOP JOIN T2 ON T1.Id = T2.Id;

SELECT *
FROM T1 INNER MERGE JOIN T2 ON T1.Id = T2.Id;

*There might also be instances where inputs with many duplicate records that require worktables may be slower than a nested loop join. *

As I mentioned though, I typically find these types of scenarios to be the exceptions when encountering merge joins in the real-world.