In-Memory OLTP: A Case Study

Published Tue 17 April 2018 in SQL > Development

When In-Memory OLTP was first released in SQL Server 2014, I was excited to start using it. All I could think was "my queries are going to run so FAST!"

Well, I never got around to implementing In-Memory OLTP. Besides having an incompatible version of SQL Server at the time, the in-memory features had too many limitations for my specific use-cases.

Fast forward a few years, and I've done nothing with In-Memory OLTP. Nothing that is until I saw Erin Stellato present at our Northern Ohio SQL Server User Group a few weeks ago - her presentation inspired me to take a look at In-Memory OLTP again to see if I could use it.

Use case: Improving ETL staging loads

After being refreshed on the ins and outs of in-memory SQL Server, I wanted to see if I could apply some of the techniques to one of my etls.

The ETL consists of two major steps:

Shred documents into row/column data and then dump that data into a staging table.
Delete some of the documents from the staging table.

In the real world, there's a step 1.5 that does some processing of the data, but it's not relevant to these in-memory OLTP demos.

So step one was to create my staging tables. The memory optimized table is called "NewStage1" and the traditional disked based tabled is called "OldStage1":

DROP DATABASE IF EXISTS InMemoryTest;
GO
CREATE DATABASE InMemoryTest;
GO
USE InMemoryTest;
GO

ALTER DATABASE InMemoryTest ADD FILEGROUP imoltp_mod CONTAINS MEMORY_OPTIMIZED_DATA;
GO 
ALTER DATABASE InMemoryTest ADD FILE (name='imoltp_mod1', filename='C:\Program Files\Microsoft SQL Server\MSSQL14.MSSQLSERVER\MSSQL\DATA\imoltp_mod1') TO FILEGROUP imoltp_mod;
GO  
ALTER DATABASE InMemoryTest SET MEMORY_OPTIMIZED_ELEVATE_TO_SNAPSHOT=ON;
GO 
ALTER DATABASE InMemoryTest SET RECOVERY SIMPLE
GO

--Numbers Table
-- This needs to be in-memory to be called from a natively compiled procedure
DROP TABLE IF EXISTS InMemoryTest.dbo.Numbers;
GO
CREATE TABLE InMemoryTest.dbo.Numbers
(
    n int
    INDEX ix_n NONCLUSTERED HASH (n) WITH (BUCKET_COUNT=400000)
)
WITH (MEMORY_OPTIMIZED=ON, DURABILITY=SCHEMA_ONLY);  
GO 

INSERT INTO dbo.Numbers (n)
SELECT TOP (4000000) n = CONVERT(INT, ROW_NUMBER() OVER (ORDER BY s1.[object_id]))
FROM sys.all_objects AS s1 CROSS JOIN sys.all_objects AS s2
OPTION (MAXDOP 1);

-- Set up on-disk tables
DROP TABLE IF EXISTS InMemoryTest.dbo.OldStage1;
GO
CREATE TABLE InMemoryTest.dbo.OldStage1
(
    Id int,
    Col1 uniqueidentifier,
    Col2 uniqueidentifier,
    Col3 varchar(1000),
    Col4 varchar(50),
    Col5 varchar(50),
    Col6 varchar(50),
    Col7 int,
    Col8 int,
    Col9 varchar(50),
    Col10 varchar(900),
    Col11 varchar(900),
    Col12 int,
    Col13 int,
    Col14 bit
);
GO
CREATE CLUSTERED INDEX CL_Id ON InMemoryTest.dbo.OldStage1 (Id);
GO


--  Set up in-memory tables and natively compiled procedures
DROP TABLE IF EXISTS InMemoryTest.dbo.NewStage1;
GO
CREATE TABLE InMemoryTest.dbo.NewStage1
(
    Id int,
    Col1 uniqueidentifier,
    Col2 uniqueidentifier,
    Col3 varchar(1000),
    Col4 varchar(50),
    Col5 varchar(50),
    Col6 varchar(50),
    Col7 int,
    Col8 int,
    Col9 varchar(50),
    Col10 varchar(900),
    Col11 varchar(900),
    Col12 int,
    Col13 int,
    Col14 bit
    INDEX ix_id NONCLUSTERED HASH (id) WITH (BUCKET_COUNT=10)
)
WITH (MEMORY_OPTIMIZED=ON, DURABILITY=SCHEMA_ONLY);  
GO

Few things to keep in mind:

The tables have the same columns and datatypes, with the only difference being that the NewStage1 table is memory optimized.
My database is using simple recovery so I am able to perform minimal logging/bulk operations on my disk-based table.
Additionally, I'm using the SCHEMA_ONLY durability setting. This gives me outstanding performance because there is no writing to the transaction log! However, this means if I lose my in-memory data for any reason (crash, restart, corruption, etc...) I am completely out of luck. This is fine for my staging data scenario since I can easily recreate the data if necessary.

Inserting and deleting data

Next I'm going to create procedures for inserting and deleting my data into both my new and old staging tables:

DROP PROCEDURE IF EXISTS dbo.Insert_OldStage1;
GO
CREATE PROCEDURE dbo.Insert_OldStage1
    @Id int,
    @Rows int
AS
BEGIN
    INSERT INTO InMemoryTest.dbo.OldStage1 (Id, Col1, Col2, Col3, Col4, Col5, Col6, Col7, Col8, Col9, Col10, Col11, Col12, Col13, Col14)
    SELECT Id, Col1, Col2, Col3, Col4, Col5, Col6, Col7, Col8, Col9, Col10, Col11, Col12, Col13, Col14
    FROM
    (
    SELECT
    @Id as Id,
    '92D14DA3-2C55-4E50-A965-7D3C941417B3' as Col1,
    '92D14DA3-2C55-4E50-A965-7D3C941417B3' as Col2,
    'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaa' as Col3,
    'aaaaaaaaaaaaaaaaaaaa' as Col4,
    'aaaaaaaaaaaaaaaaaaaa' as Col5,
    'aaaaaaaaaaaaaaaaaaaa' as Col6,
    0 as Col7,
    0 as Col8,
    'aaaaaaaaaaaaaaaaaaaa' as Col9,
    'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa' as Col10,
    'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa' as Col11,
    1 as Col12,
    1 as Col13,
    1 as Col14
    )a
    CROSS APPLY
    (
    SELECT TOP (@Rows) n FROM dbo.Numbers
    )b
END

DROP PROCEDURE IF EXISTS dbo.Delete_OldStage1;
GO
CREATE PROCEDURE dbo.Delete_OldStage1
    @Id int
AS
BEGIN
    -- Use loop to delete to prevent filling transaction log
    DECLARE 
        @Count int = 0,
        @for_delete int,
        @chunk_size int = 1000000

    SELECT @for_delete = COUNT(Id) FROM InMemoryTest.dbo.OldStage1 
                            WHERE Id = @Id;

    WHILE (@Count < @for_delete)
    BEGIN
        SELECT @Count = @Count + @chunk_size;

        BEGIN TRAN
            DELETE TOP(@chunk_size) FROM InMemoryTest.dbo.OldStage1 WHERE Id = @Id
        COMMIT TRAN
    END
END;
GO



DROP PROCEDURE IF EXISTS dbo.Insert_NewStage1;
GO
CREATE PROCEDURE dbo.Insert_NewStage1
    @Id int,
    @Rows int
WITH NATIVE_COMPILATION, SCHEMABINDING  
AS   
BEGIN ATOMIC   
WITH (TRANSACTION ISOLATION LEVEL = SNAPSHOT, LANGUAGE = N'us_english')  

    INSERT INTO dbo.NewStage1 (Id, Col1, Col2, Col3, Col4, Col5, Col6, Col7, Col8, Col9, Col10, Col11, Col12, Col13, Col14)
    SELECT Id, Col1, Col2, Col3, Col4, Col5, Col6, Col7, Col8, Col9, Col10, Col11, Col12, Col13, Col14 
    FROM
    (
    SELECT
    @Id as Id,
    '92D14DA3-2C55-4E50-A965-7D3C941417B3' as Col1,
    '92D14DA3-2C55-4E50-A965-7D3C941417B3' as Col2,
    'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaa' as Col3,
    'aaaaaaaaaaaaaaaaaaaa' as Col4,
    'aaaaaaaaaaaaaaaaaaaa' as Col5,
    'aaaaaaaaaaaaaaaaaaaa' as Col6,
    0 as Col7,
    0 as Col8,
    'aaaaaaaaaaaaaaaaaaaa' as Col9,
    'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa' as Col10,
    'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa' as Col11,
    1 as Col12,
    1 as Col13,
    1 as Col14
    )a
    CROSS APPLY
    (
    SELECT TOP (@Rows) n FROM dbo.Numbers
    )b


END;   
GO  

DROP PROCEDURE IF EXISTS dbo.Delete_NewStage1;
GO
CREATE PROCEDURE dbo.Delete_NewStage1
    @Id int 
WITH NATIVE_COMPILATION, SCHEMABINDING  
AS   
BEGIN ATOMIC   
WITH (TRANSACTION ISOLATION LEVEL = SNAPSHOT, LANGUAGE = N'us_english')  

    DELETE FROM dbo.NewStage1 WHERE Id = @Id;

END   
GO

Few more things to note:

My new procedures are natively compiled: SQL Server compiles them up front so at run time it can just execute without any extra steps. The procedures that target my old disk-based tables will have to compile every time.
In the old delete procedure, I am deleting data in chunks so my transaction log doesn't get full. In the new version of the procedure, I don't have to worry about this because, as I mentioned earlier, my memory optimized table doesn't have to use the transaction log.

Let's simulate a load

It's time to see if all of this fancy in-memory stuff is actually worth all of the restrictions.

In my load, I'm going to mimic loading three documents with around 3 million rows each. Then, I'm going to delete the second document from each table:

-- Old on-disk method
-- Insert data for processing
EXEC InMemoryTest.dbo.Insert_OldStage1 @Id=1, @Rows=2500000;
GO
EXEC InMemoryTest.dbo.Insert_OldStage1 @Id=2, @Rows=3400000;
GO 
EXEC InMemoryTest.dbo.Insert_OldStage1 @Id=3, @Rows=2800000;
GO 

-- Delete set of records after processed
EXEC InMemoryTest.dbo.Delete_OldStage1 @Id = 2
GO

-- New in-memory method
-- Insert data for processing
EXEC InMemoryTest.dbo.Insert_NewStage1 @Id=1, @Rows=2500000;
GO
EXEC InMemoryTest.dbo.Insert_NewStage1 @Id=2, @Rows=3400000;
GO 
EXEC InMemoryTest.dbo.Insert_NewStage1 @Id=3, @Rows=2800000;
GO 

-- Delete set of records after processed
EXEC InMemoryTest.dbo.Delete_NewStage1 @Id = 2
GO

The in-memory version should have a significant advantage because:

The natively compiled procedure is precompiled (shouldn't be a huge deal here since we are doing everything in a single INSERT INTO...SELECT).
The in-memory table inserts/deletes don't have to write to the transaction log (this should be huge!)

Results

  -------------------- ---------------------------------------- -----------------------------------------
                                    **Disk-based**                            **In-Memory**
  INSERT 3 documents                    65 sec                                    6 sec
  DELETE 1 document                     46 sec                                    0 sec
  Total time                           111 sec                                    6 sec
  Difference                        -95% slower                                1750% faster
  -------------------- ---------------------------------------- -----------------------------------------

The results speak for themselves. In this particular example, in-memory destroys the disk-based solution out of the water.

Obviously there are downsides to in-memory (like consuming a lot of memory) but if you are going for pure speed, there's nothing faster.

Warning! I am not you.

And you are not me.

While in-memory works great for my ETL scenario, there are many requirements and limitations. It's not going to work in every scenario. Be sure you understand the in-memory durability options to prevent any potential data loss and try it out for yourself! You might be surprised by the performance gains you'll see.

3 Essential Tools For The SQL Server Developer

Published Tue 10 April 2018 in SQL > TSqlTuesday

This post is a response to this month's T-SQL Tuesday #101 prompt by Jens Vestergaard. T-SQL Tuesday is a way for SQL Server bloggers to share ideas about different database and professional topics every month.

This month's topic is about what essential SQL Server related tools you use on a regular basis.

SQL Server Management Studio is an excellent tool for my day to day SQL Server developing needs.

However, sometimes I need to do things besides writing queries and managing server objects. Below is a list of my three most used tools I use on a regular basis when working with SQL Server.

Watch this week's video on YouTube

1. WinMerge

Often I need to compare the bodies of two stored procedures, table definitions, etc... to find differences.

While there are some built-in tools for doing difference comparisons in Visual Studio and SSMS source control plugins, I prefer using the third-party open-source tool WinMerge:

2018-04-06_12-28-50-1

The tool is a pretty straightforward difference checking tool, highlighting lines where the data between two files is different.

It has some other merge functions available in it, but honestly I keep it simple and use it to just look for differences between two pieces of text.

2. OnTopReplica

When on a single display, screen real estate is at a premium. This is especially true if you are forced to use a projector that's limited to 1024x768 resolution...

OnTopReplica to the rescue! This nifty open-source tool allows you to select a window and keep it open on top of all other windows.

This is great for when I want to reference some piece of code or text on screen while working in another window:

2018-04-06_12-37-54

In addition to forcing a window open to stay on top, it allows you to crop and resize that window so only the relevant parts are visible.

The OnTopReplica view is live too - that means it's great to use as a magnifier on your SSMS result sets when presenting (instead of constantly having to zoom in and out with ZoomIt):

2018-04-06_12-43-04 Look at those beautifully zoomed in results!

3. ScreenToGif

Sometimes explaining concepts with pictures is hard. For example, wouldn't that last screenshot be way better if it was animated?

ontopreplicaanimation-1

ScreenToGif is an open-source screen capture tool that does an excellent job compressing your recorded videos into gif animations. It also allows editing individual frames, allowing the addition of text, graphics, and keyboard shortcuts.

Power Query vs DAX: In The Power BI Kitchen with Eugene Meidinger

Published Tue 03 April 2018 in SQL > Community

Watch this week's video on YouTube

In this week's video with Power BI expert Eugene Meidinger, we're in the kitchen learning about Power BI.

Power BI has two languages available for working with data: Power Query and DAX.

Power Query is most useful cleaning and preparing your data as it comes from a variety of sources: Excel, csv, SQL Server, etc...

DAX is the heavy hitter that allows you mix all of these disparate data sources into a singel Power BI model.

Be sure to visit Eugene's blog post for more detailed information.

AT TIME ZONE: The easy way to deal with time zones and daylight savings time

Published Tue 27 March 2018 in SQL > Performance Tuning

Watch this week's video on YouTube

Many people think daylight savings time was created to help align the hours that the sun is up with our waking hours so farmers and school children didn't have to be outside in the dark. Or something like that.

Well let me fill you in on a little secret - daylight savings time was actually created by a government works project to ensure that programmers could forever write tedious conditional logic in their date-based queries to handle the date time conversions surrounding time zones and daylight savings time.

While these types of datetime conversions have historically been a pain, SQL Server 2016 introduced the AT TIME ZONE feature makes these types of conversions a breeze.

Defining Time Zone

Let's say you have a datetime value that you know is encoded in UTC (if you don't know what timezone your data was originally encoded in you're out of luck):

2018-03-26_11-54-21

Besides naming convention, there's nothing that tells us that our datetime is in UTC. But if we know that to be the case, we can use AT TIME ZONE to add an UTC offset to the timestamp to make it official:

2018-03-26_11-55-07

See that +00:00 at the end of our value? That's our time zone offset - it's basically telling us how many hours and minutes away from UTC our date is stored in. With this offset in place, our UTC datetime isn't encoded only by the variable name - it's actually encoded in the data itself.

Converting Time Zones

Now that we have a well-defined UTC datetime, we can use AT TIME ZONE to easily convert our data to different time zones.

For example, if we want to convert our UTC datetime to Eastern Standard Time, we can tack on AT TIME ZONE 'Eastern Standard Time' :

2018-03-26_12-03-37

The time portion of our datetime is now showing 7am with the offset indicating we are 4 hours behind UTC. Easy time zone conversions, yes!

But What About Our Farmer Friends?

So AT TIME ZONE makes it easy to convert between time zones - but how does it handle daylight savings time conversions?

Well, on March 11, 2018 I lost an hour of my life when daylight savings time kicked in at 2am. Does AT TIME ZONE recognize this theft?

2018-03-26_12-22-07

YES! Say so long to conditional conditions that subtract an amount of time based on the date; AT TIME ZONE handles the switch to DST with ease as noted by the offset change from -05:00 to -04:00 .

And just like that, programming for time zone and day light savings time logic just became a little easier.

But I Don't Live In The Eastern Standard Time Zone!

No problem, in addition to AT TIME ZONE, SQL Server 2016 added this nifty table that will give you the names of all of the time zones it supports:

select * from sys.time_zone_info

2018-03-26_12-28-54

How To Steal Data Using Second Order SQL Injection Attacks

Published Tue 20 March 2018 in SQL > Development > SQL Injection

Watch this week's video on YouTube

One misconception some people have about SQL injection is that it can only happen when concatenating a user input parameter directly into your dynamically built query string:

2018-03-18_06-58-28

While this type of injection flaw is easy to spot, there are other less direct ways an injection attack can occur.

Second Order SQL Injection Attacks

SQL injection attacks that delay execution until a secondary query are known as "second order".

This means a malicious user can inject a query fragment into a query (that's not necessarily vulnerable to injection), and then have that injected SQL execute in a second query that is vulnerable to SQL injection.

Let's look at an example.

Doggo Pics

Imagine a website where dog owners can share pictures of their best friends:

2018-03-18_07-05-34

The internet is a great place

Now imagine that users of Doggo Pics can set a preference for what order they view pictures when they visit the website:

2018-03-18_07-07-41

When a user saves their sorting preference, the sort column and order get saved to a preferences table in the database:

2018-03-18_07-13-46-1

No SQL injection vulnerabilities here

The next time the user goes to view the page, the doggo pictures will be sorted based on their saved preferences. This works because the query pulling the pics/descriptions is dynamically sorting the data based on the user's preference:

2018-03-18_07-22-05

We are dynamically sorting based on the user's preference

2018-03-18_07-19-53

Viewing doggos sorted by longest tongue length

The above flow is how the website is supposed to function. So how does a malicious user inject SQL code into these queries if the only query they directly save input into is the UpdateSortOrder procedure?

The Attack

So the developer of Doggo Pics was too busy enjoying pictures of doggos to implement any type of input validation on sort order preferences. This means a hacker can do something like inject a SQL statement onto the end of the sort order dropdown:

injection-attack-1

The hacker modifies the "desc" value to include an INSERT statement

When our dbo.UpdateSortOrder procedure executes on the backend, it looks like this:

2018-03-18_07-45-26-1

See where this is going? Now when our stored procedure that dynamically sorts the picture data executes, the hacker's INSERT statement is going to execute as well:

2018-03-18_07-46-33

What happens next is simple: The first time our malicious user goes to view the Doggo Pics, they receive the pictures in their preferred sort order. Additionally an INSERT INTO statement executes back on the server.

The second time the user views the Doggo Pics page, the values from that previously ran INSERT INTO statement are now visible on the screen:

injection-attack-passwords-2

So even though the first query the user encounters (saving sort order preferences) is entirely SQL injection free, our second order SQL injection attack occurs when our second SQL query dynamically executes the injected code that was stored in our user preferences table in the database.

How do I first and second order SQL injection attacks?

I recently presented at the GroupBy conference where I showed exactly how to protect your data from these types of attacks.

My presentation was recorded and is available for you to watch on YouTube:

https://www.youtube.com/watch?v=qrOLg3wjjOs

You can also read more about different types of SQL injection attacks and preventative solutions by reading through my blog archives.