One SQL Cheat Code For Amazingly Fast JSON Queries

Published on: 2017-05-09

How non-persisted computed column indexes make your JSON queries high performance

You can watch this blog post on YouTube too!

Recently I’ve been working with JSON in SQL Server 2016 a lot.

One of the hesitations many people have with using JSON in SQL Server is that they think that querying it must be really slow — SQL is supposed to excel at relational data, not string parsing right?

It turns out that performance is pretty good with the standalone SQL Server JSON functions. Even better is that it’s possible to make queries against JSON data run at ludicrous speeds by using indexes on JSON parsed computed columns. In this post I want to take a look at how SQL is able to parse* with such great performance.

*“Parse” here is actually a lie —it’s doing something else behind the scenes. You’ll see what I mean, keep reading!

Computed Columns in SQL Server

The only way to get JSON indexes working on SQL server is to use a computed column. A computed column is basically a column that performs a function to calculate its values.

For example, let’s say we have a table with some car JSON data in it:

DROP TABLE IF EXISTS dbo.DealerInventory;
CREATE TABLE dbo.DealerInventory
(
  Id int IDENTITY(1,1) PRIMARY KEY,
  Year int,
  JsonData nvarchar(300)
);
INSERT INTO dbo.DealerInventory (Year, JsonData) VALUES (2017, '{ "Make" : "Volkswagen", "Model" : "Golf" }');
INSERT INTO dbo.DealerInventory (Year, JsonData) VALUES (2017, '{ "Make" : "Honda", "Model" : "Civic" }');
INSERT INTO dbo.DealerInventory (Year, JsonData) VALUES (2017, '{ "Make" : "Subaru", "Model" : "Impreza" }');
SELECT * FROM dbo.DealerInventory;
/* Output:
Id    Year     JsonData
----- -------- ---------------------------------------------
1     2017     { "Make" : "Volkswagen", "Model" : "Golf" }
2     2017     { "Make" : "Honda", "Model" : "Civic" }
3     2017     { "Make" : "Subaru", "Model" : "Impreza" }
*/

We can add a new computed column to the table, “Make”, which parses and extracts the Make property from each row’s JSON string:

ALTER TABLE dbo.DealerInventory
ADD Make AS JSON_VALUE(JsonData, '$.Make');
SELECT * FROM dbo.DealerInventory;
/* Output:
Id Year  JsonData                                    Make
-- ----- ------------------------------------------- ----------
1  2017  { "Make" : "Volkswagen", "Model" : "Golf" } Volkswagen
2  2017  { "Make" : "Honda", "Model" : "Civic" }     Honda
3  2017  { "Make" : "Subaru", "Model" : "Impreza" }  Subaru
*/

By default, the above Make computed column is non-persisted, meaning its values are never stored to the database (persisted computed columns can also be created, but that’s a topic for a different time). Instead, every time a query runs against our dbo.DealerInventory table, SQL Server will calculate the value for each row.

The performance of this isn’t great — it’s essentially a scalar function running for each row of our output :(. However, when you combine a computed column with an index, something interesting happens.

Time to dive in with DBCC Page

DBCC Page is an undocumented SQL Server function that shows what the raw data stored in a SQL page file looks like. Page files are how SQL Server stores its data.

In the rest of this post we’ll be looking at how data pages (where the actual table data in SQL is stored) and index pages (where our index data is stored) are affected by non-persisted computed columns — and how they make JSON querying super fast.

First, let’s take a look at the existing data we have. We do this by first turning on trace flag 3604 and using DBCC IND to get the page ids of our data. Additional details on the column definitions in DBCC IND and DBCC PAGE can be found in Paul Randal’s blog post on the topic.

DBCC TRACEON(3604);
-- "Sandbox" is the name of my database
DBCC IND('Sandbox','dbo.DealerInventory',-1);

If you look at the results above, row 2 contains our data page (indicated by PageType = 1) and the PagePID of that page is 305088 (if you are playing along at home, your PagePID is most likely something else). If we then look up that PagePID using DBCC PAGE we get something like this:

DBCC PAGE('Sandbox',1,305088,3) WITH TABLERESULTS

You can see our three rows of data highlighted in red. The important thing to note here is that our computed column of the parsed “Make” value is truly non-persisted and no where to be found, meaning it has to get generated for every row during query execution.

Now, what if we add an index to our non-persisted computed column and then run DBCC IND again:

CREATE NONCLUSTERED INDEX IX_ParsedMake ON dbo.DealerInventory (Make)
DBCC IND('Sandbox','dbo.DealerInventory',-1);

You’ll now notice that in addition to data page 305088 (PageType = 1), we also have an index page 305096 (PageType = 2). If we examine both the data page and the index page we see something interesting:

DBCC PAGE('Sandbox',1,305088,3) WITH TABLERESULTS
DBCC PAGE('Sandbox',1,305096,3) WITH TABLERESULTS

Nothing has changed with our data page:

But our index page contains the parsed values for our “Make” column:

What does this mean? I thought non-persisted computed columns aren’t saved to disk!

Exactly right: our non-persisted computed column “Make” isn’t saved to the data page on the disk. However if we create an index on our non-persisted computed column, the computed value is persisted on the index page!

This is basically a cheat code for indexing computed columns.

SQL will only compute the “Make” value on a row’s insert or update into the table (or during the initial index creation) — all future retrievals of our computed column will come from the pre-computed index page.

This is how SQL is able to parse indexed JSON properties so fast; instead of needing to do a table scan and parsing the JSON data for each row of our table, SQL Server can go look up the pre-parsed values in the index and return the correct data incredibly fast.

Personally, I think this makes JSON that much easier (and practical) to use in SQL Server 2016. Even though we are storing large JSON strings in our database, we can still index individual properties and return results incredibly fast.

 

Thanks for reading. You might also enjoy following me on Twitter.

Want to learn even more SQL?

Sign up for my newsletter to receive weekly SQL tips!

How to fix your terribly slow SQL job

Published on: 2017-05-02

Reduce your data in SQL for improved performance

This is a real-time progress bar for some of my old jobs. They are still running and stuck at 3%.

How many times have you written a program, ETL, analysis job, etc… that seemed like it would never finish running?

Although poor performance can be caused in a multitude of ways, the easiest to fix is by reducing your data in SQL Server instead of your in your programming/ETL/analysis layer (Excel, R, SAS, Python, ..NET, etc…).

SQL is built to handle and process data extremely efficiently. You will usually experience much better performance the more work (data merging, transformations, etc…) you can do to your data on the SQL server. I say “usually” because SQL won’t always be faster than a programming language at transforming data, but 9 times out of 10 you can get faster results straight on the SQL Server.

Prefer watching me talk about performance on YouTube? Check out this post’s video here.

Let’s look at one of my crappy processes

How many of us have ever written a process that does something like this:

1. Write the most basic query possible, something like SELECT * FROM dbo.User

2. Take the output of the above query, load it into Excel/SAS/Python/.NET/etc…

3. Write some code to filter the dataset

4. Write some code to summarize the data, transform columns, etc…

5. Write another SELECT * FROM dbo.Sale against the SQL Server to bring in more data

6. Bring it into Excel/SAS/Python/.NET/etc… and merge it with our original data

7. Repeat steps 3–6 as many times as needed

Some of my earliest PHP and MySQL websites worked exactly like this 😳! The code was slow on my server and users ended up suffering with slow webpage load times.

If the above process even slightly resembles something you’ve written before, continue reading on…

Why bother learning to transform data in SQL? I already know how to do that stuff in <insert programming language here>.

Old habits are hard to break, but you do want to make your processes run faster, right? This stuff is all easy, I promise!

Basically, if you are running code similar to above, the reason your job is slow is because you are not optimizing where your work is being performed:

  • Every time you write SELECT * you probably are bringing back more data than you actually need — you are hurting your performance.
  • Every time you don’t have a WHERE clause, you are hurting your performance.
  • Every time your process queries the database multiple times (ie. multiple SELECT statements in your job to bring back data), you are hurting your performance.

In case you missed it, not taking the time to filter and reduce your data down as much as possible in your SQL is hurting your performance! Assuming your SQL Server and your programming layer are on different machines, you lose lots of time transferring unnecessary data over the wires (or air) as well as not efficiently using all of the advantages that your SQL server offers.

What’s the solution to this inefficient processing?

Process your data on the SQL Server!

If you are not filtering, joining, and transforming your data until your programming layer, you are likely losing valuable SQL performance power and network efficiency. Here are some easy ways to reduce the size of your dataset on the SQL Server to improve performance in your jobs (and make your coworkers envious of your skills)!

SELECT [ColumnName]

If you are using SELECT *, stop!

SELECT * brings back all of the columns on your table, including the ones you don’t need. This increases the amount of data sent over the network (which doesn’t even get used) as well as increases the amount of data that needs to be read from disk (and storage hardware is usually relatively slow). Not to mention if your table is using indexes, SELECT * most likely causes some of those indexes not to be used as efficiently (or at all) which causes your queries to slow down even further.

But what if you do need all of the columns on a particular table? You still shouldn’t use SELECT *! Although there’s no performance difference, using SELECT * just means you are taking on technical debt. In the future, when a column gets added or removed from your table, your downstream processes may break because they are now automatically receiving (or no longer receiving) that column. Do you want to have to fix a failing process in the future because its now receiving more data that it was expecting? I don’t think so!

JOINs

My inefficient process example above starts with selecting some data and bringing it into my programming environment. The process then runs another query to bring in additional data and joins it to the data from my first query in my programming environment.

This is terrible!

First off, we are breaking the first principle we learned in the SELECT * section above — we are bringing back more data than we need! If we are using INNER JOIN on our two datasets, we most likely are going to be filtering out some data — data we don’t need. Joining on the SQL server first will reduce our total dataset size and make our network and disk performance more efficient.

Even if we are doing something like a LEFT or FULL OUTER join where we will be keeping all of the data from one or both of our datasets, it still benefits us to perform this join on the SQL Server. Why you ask? Because the people who built SQL Server have spent hundreds or thousands of hours performance tuning and debugging their joining algorithms. The chances that you will be able to write a more efficient join algorithm is highly unlikely.

And even if you are a programming savant, why reinvent the wheel? Unless your app needs every last microsecond of performance, just use SQL Server for what it’s really good at: relational data joining.

WHERE Clauses

Let’s say our dbo.User table has 50 thousand rows and our dbo.Sale table has 1 million rows. If your process is only looking for active users and sales from the past month, let’s say 2 thousand rows and 22,000 rows respectively, then you are causing SQL to lookup and transfer 95% more rows than your process needs. Not only does it kill network performance, but your program layer then needs to filter out this data, doing extra work that it probably can’t do as efficiently as SQL Server.

If instead I would have just added predicates to the SQL WHERE clause like Active=1 and SalesDate >= DATEADD(month, -1, GETDATE()) we would have saved both time and bandwidth.

Aggregate Functions

You know what’s better than sending 10,000 rows of data over the network and then summing them up in your programming layer?

Using SQL’s SUM() aggregate function to reduce those 10,000 rows to just 1 row before sending it across the network.

SQL aggregate functions take many rows of data and consolidate them down into fewer rows.

SQL’s aggregate functions are also flexible enough to use the OVER() clause, allowing for windowed sets within your data — basically allowing you to be even more flexible with how you aggregate your data.

Don’t wait until your application layer to summarize parts of your data — do it in your SQL query instead.

Scalar Functions

Although aggregate functions do some serious heavy lifting, scalar functions that run on each row of data aren’t anything to laugh at either. Although they won’t reduce the number of rows in your output, they can certainly reduce the number of columns you are outputting.

For example, say you have multiple columns of data in your dataset that ultimately need to be combined into a single output column. It’s much better to use ISNULL(), COALESCE(), or CASE to combine multiple columns into a single column with logic in your SQL query so less data needs to be transferred later.

Once again, reducing the amount of data you are sending over the network is key to getting faster run times.

XML and JSON Functions

Last but not least, if your process is generating XML or JSON data at some point, consider generating that data on the SQL Server. Now, generating XML and JSON data won’t always improve your performance — SQL Server is best at relational tasks and not large string creation — but in many cases, especially with JSON, SQL Server can outperform even the fastest .NET libraries.

If your network is your bottle neck, then it is very possible that SQL can apply complex logic and transform your data into XML or JSON faster on the SQL Server than if you needed to transfer all of that data to another location on the network and handle those transformations in another programming language.

In short: do as much work as possible in SQL

If your SQL queries could be following any of the above techniques and they’re not, then fix them…today! Checking each of your queries for any of the above inefficiencies and mitigating them will probably (always test your changes) improve the performance of your applications and processes.

And then it won’t feel like your process is taking forever to run.

 

Thanks for reading. You might also enjoy following me on Twitter.

Want to learn even more SQL?

Sign up for my newsletter to receive weekly SQL tips!

Who Stuck These Letters In My DateTimes?

Published on: 2017-03-28

How to handle JSON DateTimes in SQL Server 2016

Parsing, creating, and modifying JSON in SQL Server 2016 is really easy. JSON dates and times are not.

Coming from a predominantly SQL background, the JSON DateTime format took some getting used to, especially when it came to converting SQL datetimes to JSON and vice versa.

The remainder of this post will get you well on your way to working with JSON date times in SQL Server.

Breakdown of JSON date/time

In SQL Server, datetime2’s format is defined as follows:

YYYY-MM-DD hh:mm:ss[.fractional seconds]

JSON date time strings are defined like:

YYYY-MM-DDTHH:mm:ss.sssZ

Honestly, they look pretty similar. However, there are few key differences:

  • JSON separates the date and time portion of the string with the letter T
  • The Z is optional and indicates that the datetime is in UTC (if the Z is left off, JavaScript defaults to UTC). You can also specify a different timezone by replacing the Z with a + or  along with HH:mm (ie. -05:00 for Eastern Standard Time)
  • The precision of SQL’s datetime2 goes out to 7 decimal places, in JSON and JavaScript it only goes out to 3 places, so truncation may occur.

Now that we know the key differences between SQL datetime2 and JSON date time strings, let’s explore common transformations when working with JSON data in SQL.

Parsing JSON date time into SQL datetime2

The most common operation I perform with these new JSON functions is parsing, so let’s start with those. Let’s see how we can parse the date/times from JSON using SQL Server 2016’s JSON_VALUE() function:

DECLARE @jsonData nvarchar(max) = N'{ "createDate" : "2017-03-28T12:45:00.000Z" }'

-- SQL's JSON_VALUE() will read in the JSON date time as a string
SELECT JSON_VALUE(@jsonData, '$.createDate')
-- Output: 2017-03-28T12:45:00Z

-- If we want to read it in as a SQL datetime2, we need to use a CONVERT() (or a CAST())
SELECT CONVERT(datetime2, JSON_VALUE(@jsonData, '$.createDate'))
-- Output: 2017-03-28 12:45:00.0000000

-- 7 zeroes after the decimal? Our source only had 3 zeroes!
-- Since JSON/JavaScript times have decimal precision to only 3 places, we need to make
-- the precision of datetime2 match
SELECT CONVERT(datetime2(3), JSON_VALUE(@jsonData, '$.createDate'))
-- Output: 2017-03-28 12:45:00.000

-- So now we are returning our UTC date time from JSON, but what if we need to convert it to a different time zone?
-- Using SQL Server 2016's AT TIME ZONE with CONVERT() will allow us to do that easily.
-- To get a full list of time zone names, you can use SELECT * FROM sys.time_zone_info
SELECT CONVERT(datetime2(3), JSON_VALUE(@jsonData, '$.createDate')) AT TIME ZONE 'Eastern Standard Time'
-- Output: 2017-03-28 12:45:00.000 -04:00

-- What if we just need to grab the date?  Pretty easy, just CONVERT() to date
SELECT CONVERT(date, JSON_VALUE(@jsonData, '$.createDate'))
-- Output: 2017-03-28

--Same with just the time, just remember to use a precision value of 3
SELECT CONVERT(time(3), JSON_VALUE(@jsonData, '$.createDate'))
-- Output: 12:45:00.000

Inserting SQL datetime2 into JSON

Taking date/time data out of JSON and into SQL was pretty easy. What about going the opposite direction and inserting SQL date/time data into JSON?

DECLARE @sqlData datetime2 = '2017-03-28 12:45:00.1234567'

-- Let's first try the simplest SQL to JSON conversion first using FOR JSON PATH
SELECT @sqlData as SQLDateTime2 FOR JSON PATH
-- Output: [{"SQLDateTime2":"2017-03-28T12:45:00"}]

-- Honestly that's not too bad!
-- The datetime gets created in the YYYY-MM-DDTHH:MM:SS.fffffff format
-- Although this is pretty much what we need, what if we want to be explicit and specify that we are in UTC?
-- Just add the AT TIME ZONE modifier and we will get our JSON "Z" indicating UTC
SELECT @sqlData AT TIME ZONE 'UTC' AS SQLDateTime2 FOR JSON PATH
-- Output: [{"SQLDateTime2":"2017-03-28T12:45:00.1234567Z"}]

-- And if we provide a different time zone offset, the JSON is formatted correctly with the +/-HH:MM suffix:
SELECT @sqlData AT TIME ZONE 'Eastern Standard Time' AS SQLDateTime2 FOR JSON PATH
-- Output: [{"SQLDateTime2":"2017-03-28T12:45:00.1234567-04:00"}]

-- You might notice that there are 7 fractional second decimal places in all of the above examples.
-- Although out of JSON spec, this is ok!

-- What if we just want to insert the date?  Just specify with a SQL CONVERT()
SELECT CONVERT(date, @sqlData) as SQLDateTime2 FOR JSON PATH
-- Output: [{"SQLDateTime2":"2017-03-28"}]

-- And the same goes with the time portion
SELECT CONVERT(time, @sqlData) as SQLDateTime2 FOR JSON PATH
-- Output: [{"SQLDateTime2":"12:45:00.1234567"}]

Modifying JSON date time with SQL

So we’ve seen how easy it is to parse and create JSON date/time strings, but what about modifying JSON data?

DECLARE @sqlDate datetime2 = '2017-03-28 12:45:00.1234567'

 

DECLARE @jsonData nvarchar(max) = N'{ "createDate" : "2017-03-28T12:45:00.000Z" }'
		,@newDate datetime2(3) = '2017-03-28T12:48:00.123Z'

-- Let's start out modifying our data by replacing the value completely
SELECT JSON_VALUE(@jsonData, '$.createDate')

-- If we want to pass in a perfectly formatted JSON string, then it's pretty easy
SELECT JSON_MODIFY(@jsonData, '$.createDate', '2017-03-28T12:48:00.123Z')
-- Output: { "createDate" : "2017-03-28T12:48:00.123Z" }

-- If we want to pass in a SQL datetime2 value, say like what we have stored in @newDate, then things get a little messy.
-- The JSON_MODIFY function requires the third argument to be the nvarchar datatype.  This means
-- we need to get our SQL datetime2 into a valid JSON string first.  

-- If we use FOR JSON PATH to create the JSON date from the SQL datetime2, things get ugly because 
-- FOR JSON PATH always creates a property : value combination
SELECT JSON_MODIFY(@jsonData, '$.createDate', (SELECT @newDate as newDate FOR JSON PATH))
-- Output: { "createDate" : [{"newDate":"2017-03-28T12:48:00.123"}] }

-- In order to only pass the JSON datetime into the value for the "createDate" property, we need to 
-- use the CONVERT style number 127 to convert our dateTime to a JSON format
SELECT JSON_MODIFY(@jsonData, '$.createDate', (SELECT CONVERT(nvarchar, @newDate, 127)))
-- Output: { "createDate" : "2017-03-28T12:48:00.123" }

-- But what happened to our "Z" indicating UTC?  
-- We of course need to specify the AT TIME ZONE again:
SELECT JSON_MODIFY(@jsonData, '$.createDate', (SELECT CONVERT(nvarchar, @newDate AT TIME ZONE 'UTC', 127)))
--Output: { "createDate" : "2017-03-28T12:48:00.123Z" }

Overall, working with JSON dates/times is really easy using SQL Server 2016’s new JSON functions. Microsoft could have done a really bad job not following the ECMA standards, but they did a great job crossing their T‘s and placing their Z‘s.

Thanks for reading. You might also enjoy following me on Twitter.

Want to learn even more SQL?

Sign up for my newsletter to receive weekly SQL tips!