Azure SQL vs Azure Table Storage

Published Tue 03 December 2019 in SQL > Azure

A year ago I built an app to keep track of pickup volleyball game scores and payments. It works well, but after a year of regular use it's time to update it with some improvements.

As part of the update process, I'm rewriting the data layer to use Azure's NoSQL Table Storage instead of Azure SQL.

Today I'll walk you through some of the details I considered when deciding to switch to a NoSQL storage solution.

Why Azure NoSQL Table Storage

Cost

Azure SQL has been good to me - it works, it's familiar, and it's relatively inexpensive. However, even though the \$5/month Azure SQL bill isn't cost prohibitive (especially with free credits :)), ideally I'd love the app to be able to fund itself via the fractions of a penny that get rounded up as part of the app's balancing logic.

Azure Table Storage in comparison is dirt cheap. As of the time I'm writing this, it will cost me about \$.05/month to store and access my < 1gb of application data.

Another option I considered is Azure SQL Serverless. This is a nice alternative because it would allow me to keep the relational structure of my data, however based on my historical app usage patterns it would definitely end up costing more than Azure Table Storage. Cool service, but not a good fit for my scenario.

Simplicity

Another reason I went with Azure Table Storage is its simplicity.

Azure offers an alternative table storage option called Cosmos DB. Cosmos DB is Azure's premium table storage offering, but it was overkill for what my app needed.

Yes, Cosmo DB's global distribution is cool. Yes, its additional indexes on my data is great. However, as I mentioned, cost is the biggest factor in how I'm deciding on a storage solution for this project. And Cosmos DB has lots of great features, but they come at a price.

My preference for this app is to pay more up front in development time to create a better design then rely on a service that does some of that for me at a higher monthly cost. In this case, simple features will accomplish what I need, so that's what I'm going with.

Design Considerations for Azure Table Storage

When rewriting my app's data layer, there were several new things that I had to account for in Azure Table Storage.

Primary Key

The primary key in Azure Table Storage is made up of two columns: PartitionKey and RowKey. This composite primary key is also the clustered (and only!) index for the table.

The PartitionKey in particular is important because it determines whether related rows of data will exist on the same underlying server or not. Different PartitionKey values may end up on different servers. This can be a good thing if you factor in parallel access in your design, or a significant bottleneck if all of your data ends up residing on a single server.

Latency

There are two aspects of latency to consider.

The first is how far you are from your Azure region. For me, this isn't a huge deal since the only users of the app are currently in northeast Ohio, so choosing to store all of the data in the same region is good enough.

The bigger consideration for this app is that with Azure Table Storage, you can't do any joins with your data. Well, you can join in your app, but you can't join in Azure itself.

This means that design is critical to reducing latency since joining multiple tables of data in your app will require multiple calls to the Azure Table Storage service. This is not something that is necessarily a deal breaker, just something that needs to be considered, especially if coming from a relational SQL background where you are using JOINs to filter and reduce your data before it is returned to your app.

Row Size Limitations

NoSQL gets a lot hate for its common pattern of storing giant blobs of semi-structured data in a single field (this design makes sense when considering the latency considerations).

However, this becomes a delicate balancing act since each row in Azure Table Storage can only be a maximum of 1mb in size. This causes you to want to fit as much data into a single call as possible (to reduce the number of calls) while also not exceeding the 1mb row size.

Azure Table Storage does allow up to 252 columns of data per table (plus the required PartitionKey, RowKey, and Timestamp columns) so at least your 1mb of data will be organized.

…and more!

The above details were the primary considerations I had to take into account for my specific app. There are things data like data consistency, durability, and more that you may want to take into account based on your app's goals and usage patterns.

Conclusion

While NoSQL can often break many of the relational concepts we are used to using, it is often the means for achieving the cheapest pay-only-for-what-you-use pricing on cloud providers.

When a SQL UPDATE Statement DELETES Rows

Published Tue 26 November 2019 in SQL > Development

Watch this week's video on YouTube

At first I wasn't going to write this post. I thought it would be too simple.

Then I watched two experienced SQL developers go through similar scenarios last week.

Sometimes the simple things are worth revisiting, so...

A SQL Server UPDATE Bug

I received a message from someone stating that when they update a row in their table, the row gets deleted.

Thinking that was strange, I asked the user if they could reproduce the issue. And they did. And after running the update statement, the row disappeared.

"WHAT THE...?"

Reproducing the Issue

So here's the scenario: we had an SSIS configuration table that looks something like this:

DROP TABLE IF EXISTS ##Configuration;
CREATE TABLE ##Configuration    
(
    ConfigurationFilter nvarchar(255) PRIMARY KEY,
    ConfiguredValue nvarchar(255),
    ConfiguredValueType nvarchar(20)
    -- some other fields
);

If you use SSIS, you might be familiar with this setup. In the table we had some innocuous looking rows:

INSERT INTO ##Configuration VALUES ('AdventureWorks_ETL_Bypass','1','int');
INSERT INTO ##Configuration VALUES ('WideWorldImporters_ETL_Bypass','0','int');
INSERT INTO ##Configuration VALUES ('Northwind_ETL_Bypass','1','int');

Querying a single ConfigurationFilter value returns a single row:

SELECT * FROM ##Configuration WHERE ConfigurationFilter = 'AdventureWorks_ETL_Bypass'

2019-11-25-18-59-25

Let's say we now want to update the 1 value to a 0::

UPDATE ##Configuration SET ConfigurationFilter = '0' 
WHERE ConfigurationFilter = 'AdventureWorks_ETL_Bypass'

Then, let's check to see if our change went through:

SELECT * FROM ##Configuration WHERE ConfigurationFilter = 'AdventureWorks_ETL_Bypass'

2019-11-25-18-59-59

"WHAT THE ...?"

Following the Rules

Do you see the problem?

Of course you do. But in the excitement of the moment, I didn't see the issue. I thought there was some SQL Server feature taking over that I didn't understand. Or possibly a bug in how UPDATE works in certain scenarios. How could an UPDATE possibly DELETE a row of data?

Look at the above UPDATE statement again. Our WHERE clause is filtering on the ConfigurationFilter field, which in this case is our table's primary key; it will only ever return one unique row.

That is until we change the value of that row's primary key: the SET clause is also updating ConfigurationFilter. This is the mistake. Since ConfigurationFilter has a different value after the update, our original query makes it appear that the row was deleted - when in reality it is now considered a different row based on the way we defined our primary key:

2019-11-25-19-04-56

Since there a lot of "Config..." names in this table, the field used in the SET statement should have been ConfiguredValue instead of ConfigurationName. Simple case of updating the wrong field.

Lesson Learned

SQL Server has been thoroughly vetted by running on millions(?) of systems. Bugs do exist, but the chances of you discovering a bug, let alone one that affects such a basic feature such as UPDATE, is very unlikely at this point.

The lesson here is that if you do think you find an issue, go back and check your query: it's more likely there was an error with the connection between chair and keyboard rather than with the tool itself.

First Time Speaking at PASS Summit

Published Tue 12 November 2019 in Professional Development > Presenting

Watch this week's video on YouTube

This week, I filmed my experience as a first time speaker at PASS Summit in Seattle. No blog post, just a video to relive the experience.

SQL Server 2019 Feature Power Rankings

Published Tue 29 October 2019 in SQL > Development

Watch this week's video on YouTube

With the release of SQL Server 2019 imminent, I thought it'd be fun to rank which features I am most looking forward to in the new release.

(Also, I needed a lighter blogging week since I'm busy finishing preparing for my two sessions at PASS Summit next week - hope to see you there!).

feature-rankings-quadrant-small

I decided to rank these features on two axes: Excitement and Priority

Excitement is easy to describe: how excited I am about using these features. In my case, excitement directly correlates with performance and developer usability improvements. That doesn't mean "Low Excitement" features aren't beneficial; on the contrary, many are great, they just don't top my list (it wouldn't be fun to have a quadrant with everything in the top right).

Priority is how quickly I'll work on implementing or tuning these features. The truth is that some of these features will work automatically once a SQL Server instance is upgraded, while some will require extra work (ie. query rewriting, hardware config). Once again, "Low Priority" features aren't bad, they just won't be the features that I focus on first.

Finally, these rankings are based on Microsoft's descriptions of these features and what little tinkering I've done with pre-releases of SQL Server 2019. As far as I know, this chart will totally change once I start using these features regularly in production environments.

And here are my rankings in list form in case that's more your style:

High Excitement, High Priority

Scalar function inlining
Memory grant feedback
sys.dm_exec_query_plan_stats
Accelerated Database Recovery
Table Variable deferred compilation

High Excitement, Low Priority

Big Data Clusters
Polybase all the things
Enhancements to running on Windows, Linux, and containers

Low Excitement, High Priority

Batch mode on rowstore indexes
Index encrypted columns
Optimize for sequential key
Useful truncation error messages

Low Excitement, Low Priority

New graph functions
Java language extension

What are you most excited for in 2019? What features did I miss? Disagree with where something should be ranked? Let me know in the comments below.

SQL Server Stored Procedures vs Functions vs Views

Published Tue 22 October 2019 in SQL > Development

Watch this week's video on YouTube

SQL Server has several ways to store queries for later executions.

This makes developers happy because it allows them to follow DRY principles: Don't Repeat Yourself. The more code you have, the more difficult it is to maintain. Centralizing frequently used code into stored procedures, functions, etc... is attractive.

While following the DRY pattern is beneficial in many programming languages, it can often cause poor performance in SQL Server.

Today's post will try to explain all of the different code organization features available in SQL Server and when to best use them (thank you to dovh49 on YouTube for recommending this week's topic and reminding me how confusing all of these different features can be when first learning to use them).

Scalar Functions

CREATE OR ALTER FUNCTION dbo.GetUserDisplayName
(
    @UserId int
)
RETURNS nvarchar(40)
AS
BEGIN
    DECLARE @DisplayName nvarchar(40);
    SELECT @DisplayName = DisplayName FROM dbo.Users WHERE Id = @UserId

    RETURN @DisplayName
END

SELECT TOP 10000 Title, dbo.GetUserDisplayName(OwnerUserId) FROM dbo.Posts

Scalar functions run statements that return a single value.

You'll often read about SQL functions being evil, and scalar functions are a big reason for this reputation. If your scalar function executes a query within it to return a single value, that means every row that calls that function runs this query. That's not good if you have to run a query once for every row in a million row table.

SQL Server 2019 can inline a lot of these, providing better performance in most cases. However, you can already do this yourself today by taking your scalar function and including it in your calling query as a subquery. The only downside is that you'll be repeating that same logic in every calling query that needs it.

Additionally, using a scalar function on the column side of a predicate will prevent SQL Server from being able to seek to data in any of its indexes; talk about performance killing.

For scalar functions that don't execute a query, you can always use WITH SCHEMABINDING to gain a performance boost.

Inline Table Valued Functions

CREATE OR ALTER FUNCTION dbo.SplitTags
(   
    @PostId int
)
RETURNS TABLE 
AS
RETURN 
(
    SELECT REPLACE(t.value,'>','') AS Tags 
    FROM dbo.Posts p 
    CROSS APPLY STRING_SPLIT(p.Tags,'<') t 
    WHERE Id = @PostId AND t.value <> ''
)
GO

SELECT * FROM dbo.SplitTags(4)

Inline table-valued functions allow a function to return a table result set instead of just a single value. They essentially are a way for you to reuse a derived table query (you know, when you nest a child query in your main query's FROM or WHERE clause).

These are usually considered "good" SQL Server functions - their performance is decent because SQL Server can get relatively accurate estimates on the data that they will return, as long as the statistics on that underlying data are accurate. Generally this allows for efficient execution plans to be created. As a bonus, they allow parameters so if you find yourself reusing a subquery over and over again, an inline table-valued function (with or without a parameter) is actually a nice feature.

Multi-Statement Table-Valued Functions

CREATE OR ALTER FUNCTION dbo.GetQuestionWithAnswers
(
    @PostId int
)
RETURNS 
@results TABLE 
(
    PostId bigint,
    Body nvarchar(max),
    CreationDate datetime
)
AS
BEGIN
    -- Returns the original question along with all of its answers in one result set
    -- Would be better to do this with something like a union all or a secondary join. 
    -- But this is an MSTVF demo, so I'm doing it with multiple statements.

    -- Statement 1
    INSERT INTO @results (PostId,Body,CreationDate)
    SELECT Id,Body,CreationDate 
    FROM dbo.Posts
    WHERE Id = @PostId;

    -- Statement 2
    INSERT INTO @results (PostId,Body,CreationDate)
    SELECT Id,Body,CreationDate 
    FROM dbo.Posts
    WHERE ParentId = @PostId;

    RETURN
END

SELECT * FROM dbo.GetQuestionWithAnswers(4)

Multi-statement table-valued functions at first glance look and feel just like their inline table-value function cousins: they both accept parameter inputs and return results back into a query. The major difference is that they allow multiple statements to be executed before the results are returned in a table variable:

This is a great idea in theory - who wouldn't want to encapsulate multiple operational steps into a single function to make their querying logical easier?

However, the major downside is that prior to SQL Server 2017, SQL Server knows nothing about what's happening inside of a mutli-statement table-valued function in the calling query. This means all of your estimates for MSTVFs will be 100 rows (1 if you are on a version prior to 2014, slightly more accurate if you are on versions 2017 and above). This means that execution plans generated for queries that call MSTVFs will often be...less than ideal. Because of this, MSTVFs help add to the "evil" reputation of SQL functions.

Stored Procedures

CREATE OR ALTER PROCEDURE dbo.InsertQuestionsAndAnswers
    @PostId int
AS
BEGIN
    SET NOCOUNT ON;

    INSERT INTO dbo.Questions (Id)
    SELECT Id
    FROM dbo.Posts
    WHERE Id = @PostId;

    INSERT INTO dbo.Answers (Id, PostId)
    SELECT Id, ParentId
    FROM dbo.Posts
    WHERE ParentId = @PostId;
END

EXEC dbo.InsertQuestionsAndAnswers @PostId = 4

Stored procedures encapsulate SQL query statements for easy execution. They return result sets, but those result sets can't be easily used within another query.

This works great when you want to define single or multi-step processes in a single object for easier calling later.

Stored procedures also have the added benefit of being able to have more flexible security rules placed on them, allowing users to access data in specific ways where they don't necessarily have access to the underlying sources.

Views

CREATE OR ALTER VIEW dbo.QuestionsWithUsers
WITH SCHEMABINDING
AS
SELECT
    p.Id AS PostId,
    u.Id AS UserId,
    u.DisplayName
FROM  
    dbo.Posts p
    INNER JOIN dbo.Users u
        ON p.OwnerUserId = u.Id
WHERE
    p.PostTypeId = 1;
GO

CREATE UNIQUE CLUSTERED INDEX CL_PostId ON dbo.QuestionsWithUsers (PostId);

SELECT * FROM dbo.QuestionsAndAnswersView;

Views are similar to inline table valued function - they allow you centralize a query in an object that can be easily called from other queries. The results of the view can be used as part of that calling query, however parameters can't be passed in to the view.

Views also have some of the security benefits of a stored procedure; they can be granted access to a view with a limited subset of data from an underlying table that those same users don't have access to.

Views also have some performance advantages since they can have indexes added to them, essentially materializing the result set in advance of the view being called (creating faster performance). If considering between an inlined table function and a view, if you don't need to parameterize the input, a view is usually the better option.

Natively Compiled Stored Procedures and Scalar Functions

CREATE TABLE dbo.QuestionsStaging (Id int PRIMARY KEY NONCLUSTERED) WITH ( MEMORY_OPTIMIZED = ON , DURABILITY = SCHEMA_ONLY );

CREATE TABLE dbo.AnswersStaging (Id int PRIMARY KEY NONCLUSTERED, PostId int) WITH ( MEMORY_OPTIMIZED = ON , DURABILITY = SCHEMA_ONLY );
GO

CREATE PROCEDURE dbo.InsertQuestionsAndAnswersCompiled
    @PostId int
WITH NATIVE_COMPILATION, SCHEMABINDING
AS BEGIN ATOMIC WITH
(
    TRANSACTION ISOLATION LEVEL = SNAPSHOT, LANGUAGE = N'us_english'
)
    INSERT INTO dbo.Questions (Id)
    SELECT Id
    FROM dbo.Posts
    WHERE Id = @PostId;

    INSERT INTO dbo.Answers (Id, PostId)
    SELECT Id, ParentId
    FROM dbo.Posts
    WHERE ParentId = @PostId;
END

These are same as the stored procedures and scalar functions mentioned above, except they are pre-compiled for use with in-memory tables in SQL Server.

This means instead of SQL Server interpreting the SQL query every time a procedure or scalar function has to run, it created the compiled version ahead of time reducing the startup overhead of executing one of these objects. This is a great performance benefit, however they have several limitations. If you are able to use them, you should, just be aware of what they can and can't do.

Conclusion

While writing this post I thought about when I was first learning all of these objects for storing SQL queries. Knowing the differences between all of the options available (or what those options even are!) can be confusing. I hope this post helps ease some of this confusion and helps you choose the right objects for storing your queries.