Data with Bert - SQL/Performance Tuning

This is a real-time progress bar for some of my old jobs. They are still running and stuck at 3%.

How many times have you written a program, ETL, analysis job, etc… that seemed like it would never finish running?

Although poor performance can be caused in a multitude of ways, the easiest to fix is by reducing your data in SQL Server instead of your in your programming/ETL/analysis layer (Excel, R, SAS, Python, ..NET, etc…).

SQL is built to handle and process data extremely efficiently. You will usually experience much better performance the more work (data merging, transformations, etc…) you can do to your data on the SQL server. I say "usually" because SQL won't always be faster than a programming language at transforming data, but 9 times out of 10 you can get faster results straight on the SQL Server.

Watch this week's video on YouTube

Let's look at one of my crappy processes

How many of us have ever written a process that does something like this:

1. Write the most basic query possible, something like SELECT * FROM dbo.User

2. Take the output of the above query, load it into Excel/SAS/Python/.NET/etc…

3. Write some code to filter the dataset

4. Write some code to summarize the data, transform columns, etc…

5. Write another SELECT * FROM dbo.Sale against the SQL Server to bring in more data

6. Bring it into Excel/SAS/Python/.NET/etc… and merge it with our original data

7. Repeat steps 3–6 as many times as needed

Some of my earliest PHP and MySQL websites worked exactly like this 😳! The code was slow on my server and users ended up suffering with slow webpage load times.

If the above process even slightly resembles something you've written before, continue reading on…

Why bother learning to transform data in SQL? I already know how to do that stuff in .

Old habits are hard to break, but you do want to make your processes run faster, right? This stuff is all easy, I promise!

Basically, if you are running code similar to above, the reason your job is slow is because you are not optimizing where your work is being performed:

Every time you write SELECT * you probably are bringing back more data than you actually need — you are hurting your performance.
Every time you don't have a WHERE clause, you are hurting your performance.
Every time your process queries the database multiple times (ie. multiple SELECT statements in your job to bring back data), you are hurting your performance.

In case you missed it, not taking the time to filter and reduce your data down as much as possible in your SQL is hurting your performance! Assuming your SQL Server and your programming layer are on different machines, you lose lots of time transferring unnecessary data over the wires (or air) as well as not efficiently using all of the advantages that your SQL server offers.

What's the solution to this inefficient processing?

Process your data on the SQL Server!

If you are not filtering, joining, and transforming your data until your programming layer, you are likely losing valuable SQL performance power and network efficiency. Here are some easy ways to reduce the size of your dataset on the SQL Server to improve performance in your jobs (and make your coworkers envious of your skills)!

SELECT [ColumnName]

If you are using SELECT *, stop!

SELECT * brings back all of the columns on your table, including the ones you don't need. This increases the amount of data sent over the network (which doesn't even get used) as well as increases the amount of data that needs to be read from disk (and storage hardware is usually relatively slow). Not to mention if your table is using indexes, SELECT * most likely causes some of those indexes not to be used as efficiently (or at all) which causes your queries to slow down even further.

But what if you do need all of the columns on a particular table? You still shouldn't use SELECT *! Although there's no performance difference, using SELECT * just means you are taking on technical debt. In the future, when a column gets added or removed from your table, your downstream processes may break because they are now automatically receiving (or no longer receiving) that column. Do you want to have to fix a failing process in the future because its now receiving more data that it was expecting? I don't think so!

JOINs

My inefficient process example above starts with selecting some data and bringing it into my programming environment. The process then runs another query to bring in additional data and joins it to the data from my first query in my programming environment.

This is terrible!

First off, we are breaking the first principle we learned in the SELECT * section above — we are bringing back more data than we need! If we are using INNER JOIN on our two datasets, we most likely are going to be filtering out some data — data we don't need. Joining on the SQL server first will reduce our total dataset size and make our network and disk performance more efficient.

Even if we are doing something like a LEFT or FULL OUTER join where we will be keeping all of the data from one or both of our datasets, it still benefits us to perform this join on the SQL Server. Why you ask? Because the people who built SQL Server have spent hundreds or thousands of hours performance tuning and debugging their joining algorithms. The chances that you will be able to write a more efficient join algorithm is highly unlikely.

And even if you are a programming savant, why reinvent the wheel? Unless your app needs every last microsecond of performance, just use SQL Server for what it's really good at: relational data joining.

WHERE Clauses

Let's say our dbo.User table has 50 thousand rows and our dbo.Sale table has 1 million rows. If your process is only looking for active users and sales from the past month, let's say 2 thousand rows and 22,000 rows respectively, then you are causing SQL to lookup and transfer 95% more rows than your process needs. Not only does it kill network performance, but your program layer then needs to filter out this data, doing extra work that it probably can't do as efficiently as SQL Server.

If instead I would have just added predicates to the SQL WHERE clause like Active=1 and SalesDate >= DATEADD(month, -1, GETDATE()) we would have saved both time and bandwidth.

Aggregate Functions

You know what's better than sending 10,000 rows of data over the network and then summing them up in your programming layer?

Using SQL's SUM() aggregate function to reduce those 10,000 rows to just 1 row before sending it across the network.

SQL aggregate functions take many rows of data and consolidate them down into fewer rows.

SQL's aggregate functions are also flexible enough to use the OVER() clause, allowing for windowed sets within your data — basically allowing you to be even more flexible with how you aggregate your data.

Don't wait until your application layer to summarize parts of your data — do it in your SQL query instead.

Scalar Functions

Although aggregate functions do some serious heavy lifting, scalar functions that run on each row of data aren't anything to laugh at either. Although they won't reduce the number of rows in your output, they can certainly reduce the number of columns you are outputting.

For example, say you have multiple columns of data in your dataset that ultimately need to be combined into a single output column. It's much better to use ISNULL(), COALESCE(), or CASE to combine multiple columns into a single column with logic in your SQL query so less data needs to be transferred later.

Once again, reducing the amount of data you are sending over the network is key to getting faster run times.

XML and JSON Functions

Last but not least, if your process is generating XML or JSON data at some point, consider generating that data on the SQL Server. Now, generating XML and JSON data won't always improve your performance — SQL Server is best at relational tasks and not large string creation — but in many cases, especially with JSON, SQL Server can outperform even the fastest .NET libraries.

If your network is your bottle neck, then it is very possible that SQL can apply complex logic and transform your data into XML or JSON faster on the SQL Server than if you needed to transfer all of that data to another location on the network and handle those transformations in another programming language.

In short: do as much work as possible in SQL

If your SQL queries could be following any of the above techniques and they're not, then fix them…today! Checking each of your queries for any of the above inefficiencies and mitigating them will probably (always test your changes) improve the performance of your applications and processes.

And then it won't feel like your process is taking forever to run.

Who needs a relational database when everything can be stored in a JSON string?

Every once in a while I hear of some technologist say that relational databases are dead; instead, a non-table based NoSQL storage format is the way of the future. SQL Server 2016 introduced JSON functionality, making it possible for some "non-SQL" data storage to make its way into the traditionally tabled-based SQL Server.

Does this mean all data in SQL Server going forward should be stored in long JSON strings? No, that would be a terrible idea. There are instances when storing JSON in SQL Server is a good choice though. In this post I want to create recommendations for when data should be stored as JSON and when it shouldn't.

Databases Should Not Be Entirely Comprised Of JSON

The screenshot below is an example of what I think some developers would do if they were given free reign in SQL Server 2016:

bb208-1zzw7xinj5wmotojplta2lw

Here we have an application database ("InventoryApp") that consists of only a single table ("dbo.Data") with three JSON NVARCHAR(MAX) columns to represent all of the data required by the app. Relationships exist between Sales, Purchases, and Customers but these are not defined on the database side.

If you are from the world of relational-SQL, you might not believe that anyone would design such a database structure. Believe me though, this is a realistic scenario. Entire companies (eg. Firebase: https://firebase.google.com/) build their services around abstracting the database layer away from developers, essentially storing entire tables or databases in large JSON strings.

Many developers like storing data this way because it is easy to deserialize JSON strings into objects in their programming languages to use in their apps. They like the fact that with JSON they can have an infinitely changing storage schema (just add new keys, values, and arrays!) so if they need a new field for their app, they can just add it in, serialize the object to a JSON string, and store it again in the database.

Obviously, going completely "NoSQL" might make short term development easier/quicker, but using SQL Server 2016 to only store data this way is a travesty: there's no way to use many of SQL Server's amazing performance, schema definition and validation, and security features.

So when is it appropriate to store JSON in SQL Server?

Appropriate Use Case #1: Error Logging

Errors happen. When they do, it's nice to be able to go back and look at the error message to see what happened.

The problem is that the structure of error messages isn't always consistent. Sometimes only the value of a single property will help identify the cause of failure. Other times, something more complex fails and it would be nice to have all of the values of a complex object available to make troubleshooting easier.

This is where JSON steps in: in most programming languages, it is easy to convert error messages and run time values to a JSON object on error. And since error messages and data values change in structure depending on where they occur, it's easy to dynamically turn any type of object into JSON data.

This data is perfect to store in SQL to be looked at later. None of these ideas are new — nvarchar(max) has been in SQL for a while now, and so programmers everywhere have been storing error information in that datatype.

With SQL Server 2016, it is now easier to examine and parse the error information directly in SQL Server Management Studio with the variety of JSON parsing functions available. No longer do programmers have to copy the code into some different tool — they can easily do it in SSMS.

Appropriate Use Case #2: Piloting Ideas

Most large workplaces have controls in place that prevent developers from making changes in production. In general this is a Good Idea™.

However, controls are sometimes too restrictive. For example, due to security restrictions, lack of server space, company politics, etc… developers are sometimes stuck developing in production. It's an unfortunate fact of life. In those scenarios, developers have to go through hell if they have to elevate each database structure change every time they want to test something in production.

JSON to the rescue! An nvarchar(max) column in a table can have its JSON data be easily added to and modified to fit more data than it was originally intended to hold. All without any database structure change requests.

Now this is not an ideal situation. In fact, it's a scenario that can add a lot of technical debt to the application long-term if not planned for.

However, if a "flexible" JSON column is built with eventual conversion to a traditional table structure in mind from the start, it's actually simple for a developer to transition an entirely JSON storage structure to a relational format later on. They key here is that the developer needs to have this conversion planned from day one.

Appropriate Use Case #3: Non-Analytical Data

Analytical data is SQL Server's bread and butter. Need to store lots of data and be able to query against it all day long? No problem, there are a plethora of performance tuning options to make your queries run fast and efficiently.

However, sometimes not all data needs to be analyzed. Often an app might need to save some session data to a database temporarily — why bother creating all of the maintenance overhead of strict database schemas if the data will never be queried for analytical purposes? Another example might be a website's dynamically created user profile settings. You can build normalized table(s) to store all of that data, but then you will be writing programming logic to normalize and denormalize your data out of the app.

If this data will not have to be searched, then why bother adding all of the overhead? Keep it in JSON and be done with.

How to fix your terribly slow SQL job