How To Create Multi-Object JSON Arrays in SQL Server

blog-image

Recently I was discussing with Peter Saverman whether it would be possible to take some database tables that look like this:

2017-12-16_10-34-48

And output them so that the Cars and Toys data would map to a multi-object JSON array like so:

2017-12-16_10-38-51

Watch this week's video on YouTube

Why would you ever need this?

If you are coming from a pure SQL background, at this point you might be wondering  why you would ever want create an object array that contains mixed object types.  Well, from an application development standpoint this type of scenario can be fairly common.

In a database, it makes sense to divide Home and Car and Toy into separate tables.  Sure, we could probably combine the latter two with some normalization, but imagine we will have many different types of entities that will be more difficult to normalize - sometimes it just makes sense to store this information separately.

Not to mention that performing analytical type queries across many rows of data will typically be much faster stored in this three table format.

The three table layout, while organized from a database standpoint, might not be the best way to organize the data in an object-oriented application.  Usually in a transaction oriented application, we want our data to all be together as one entity.  This is why NoSQL is all the rage among app developers.  Having all of your related data all together makes it easy to manage, move, update, etc...  **This is where the array of multi-type objects comes in - it'd be pretty easy to use this structure as an array of dynamic or inherited objects inside of our application.

Why not just combine these Car and Toy entities in app?

Reading the data into the app through multiple queries and mapping that data to objects is usually the first way you would try doing something like this.

However, depending on many different variables, like the size of the data, the number of requests, the speed of the network, the hardware the app is running on, etc... mapping your data from multiple queries might not be the most efficient way to go.

On the other hand, if you have a big beefy SQL Server available that can do those transformations for you, and you are willing to pay for the processing time on an \$8k/core enterprise licensed machine, then performing all of the these transformations on your SQL Server is the way to go.

The solution

UPDATE: Jovan Popovic suggested an even cleaner solution using CONCAT_WS.  See the update at the bottom of this post.

First, here's the data if you want to play along at home:

DROP TABLE IF EXISTS ##Home;
GO
DROP TABLE IF EXISTS ##Car;
GO
DROP TABLE IF EXISTS ##Toy;
GO

CREATE TABLE ##Home
(
    HomeId int IDENTITY PRIMARY KEY,
    City nvarchar(20),
    State nchar(2)
);
GO

CREATE TABLE ##Car
(
    CarId int IDENTITY PRIMARY KEY,
    HomeId int,
    Year smallint,
    Make nvarchar(20),
    Model nvarchar(20),
    FOREIGN KEY (HomeId) REFERENCES ##Home(HomeId)
);
GO

CREATE TABLE ##Toy
(
    ToyId int IDENTITY PRIMARY KEY,
    HomeId int,
    Category nvarchar(20),
    RiderCapacity int,
    FOREIGN KEY (HomeId) REFERENCES ##Home(HomeId)
);
GO

INSERT INTO ##Home (City,State) VALUES ('Cleveland','OH')
INSERT INTO ##Home (City,State) VALUES ('Malibu','CA')

INSERT INTO ##Car (HomeId,Year, Make, Model) VALUES ('1','2017', 'Volkswagen', 'Golf')
INSERT INTO ##Car (HomeId,Year, Make, Model) VALUES ('2','2014', 'Porsche', '911')

INSERT INTO ##Toy (HomeId,Category, RiderCapacity) VALUES ('1','Bicycle', 1)
INSERT INTO ##Toy (HomeId,Category, RiderCapacity) VALUES ('2','Kayak', 2)

SELECT * FROM ##Home
SELECT * FROM ##Car
SELECT * FROM ##Toy

And here's the query that does all of the transforming:

SELECT 
    h.HomeId,
    h.City,
    h.State,
    GarageItems = JSON_QUERY('[' + STRING_AGG( GarageItems.DynamicData,',') + ']','$')
FROM
    ##Home h
    INNER JOIN
    (
        SELECT
            HomeId,
            JSON_QUERY(Cars,'$') AS DynamicData
        FROM
            ##Home h
            CROSS APPLY
            (
            SELECT 
                (
                SELECT  
                    *
                FROM
                    ##Car c
                WHERE
                    c.HomeId = h.HomeId
                    FOR JSON PATH, WITHOUT_ARRAY_WRAPPER
                ) AS Cars
            ) d 
        UNION ALL
        SELECT
            HomeId,
            JSON_QUERY(Cars,'$') AS DynamicData
        FROM
            ##Home h
            CROSS APPLY
            (
            SELECT 
                (
                SELECT  
                    *
                FROM
                    ##Toy c
                WHERE
                    c.HomeId = h.HomeId
                    FOR JSON PATH, WITHOUT_ARRAY_WRAPPER
                ) AS Cars
            ) d
    ) GarageItems
        ON h.HomeId = GarageItems.HomeId
GROUP BY
    h.HomeId,
    h.City,
    h.State

There are a couple of key elements that make this work.

CROSS APPLY

When using FOR JSON PATH , ALL rows and columns from that result set will get converted to a single JSON string.

This creates a problem if, for example, you want to have a column for your JSON string and a separate column for something like a foreign key (in our case, HomeId).  Or if you want to generate multiple JSON strings filtered on a foreign key.

The way I chose to get around this is to use CROSS APPLY with a join back to our Home table - this way we get our JSON string for either Cars or Toys created but then output it along with some additional columns.

WITHOUT_ARRAY_WRAPPER

When using FOR JSON PATH to turn a result set into a JSON string, SQL Server will automatically add square brackets around the JSON output as if it were an array.

This is a problem in our scenario because when we use FOR JSON PATH to turn the Car and Toy table into JSON strings, we eventually want to combine them together into the same array instead of two separate arrays.  The solution to this is using the WITHOUT_ARRAY_WRAPPER option to output the JSON string without the square brackets.

Conclusion

Your individual scenario and results may vary.  This solution was to solve a specific scenario in a specific environment.

Is it the right way to go about solving your performance problems all of the time? No.  But offloading these transformations onto SQL Server is an option to keep in mind.

Just remember - always test to make sure your performance changes are actually helping.

UPDATED Solution Using CONCAT_WS:

This solution recommended by Jovan Popovic is even easier than above.  It requires using CONCAT_WS, which is available starting in SQL Server 2017 (the above solution requires STRING_AGG which is also in 2017, but it could be rewritten using FOR XML string aggregation if necessary for earlier versions)

SELECT h.*,
'['+ CONCAT_WS(',',
(SELECT * FROM ##Car c WHERE c.HomeId = h.HomeId FOR JSON PATH, WITHOUT_ARRAY_WRAPPER),
(SELECT * FROM ##Toy t WHERE t.HomeId = h.HomeId FOR JSON PATH, WITHOUT_ARRAY_WRAPPER)
)
+ ']'
FROM ##Home h

Behind the Scenes of an Online Presentation

Charlie is an attentive audience member

Last week I presented my session "DBAs vs Developers: JSON in SQL Server 2016" at the online GroupBy Conference.

As I prepared for the event, I thought about all of the things that were different about getting ready for an online versus an in-person event.

Thinking that others might be interested in seeing what I do to get ready for an online talk, I filmed myself as I prepared for presentation day and put together this "behind the scenes" video.

Check it out, along with my actual talk on JSON in SQL Server 2016, in the videos below!

https://www.youtube.com/watch?v=oDYCSym6Lko

Slides and demo code from the presentation is available at https://bertwagner.com/presentations/

And the presentation video itself:

https://www.youtube.com/watch?v=WJpUGHWUMoc

And slide deck:

https://www.slideshare.net/BertWagner/dbas-vs-developers-json-in-sql-server-2016

5 Things You Need to Do When Performance Testing JSON in SQL and C#

1ce87-1opx0seatjtntutc1qi1txg

Watch this week's video on YouTube

Want to learn more about using JSON in SQL Server? Watch me present at the online GroupBy conference on June 9, 2017 at 8am.

I've written a few articles this year about how awesome JSON performance is in SQL Server 2016.

The more I continue to use JSON in SQL Server, the more impressed I become with its speed and versatility. Over time I've learned new techniques and realize that JSON in SQL Server is actually much faster than I initially thought.

Today I want to correct some performance tests where I think I unfairly compared SQL Server JSON performance the first time around.

Major thanks to @JovanPop_MSFT for his help with performance testing suggestions.

Performance testing is hard

Before I dive into the performance tests, I want to be clear that these tests are still not perfect.

Performance testing in SQL Server is hard enough. When you start trying to compare SQL Server functions to code in .NET, lots of of other factors come in to play.

I'll try to to highlight where there still might be some problems with my methodology in the tests below, but overall I think these tests are more accurate comparisons of these features.

SQL Server JSON vs. Json.Net

There are two major issues with comparing SQL Server JSON functions to Json.NET functions in C#:

  1. Queries running in SQL Server Management Studio have significant overhead when rendering results to the results grid.
  2. The way SQL Server retrieves pages of data from disk or memory is not the same as how C# retrieves data from disk or memory.

The below tests should provide a more accurate comparison between SQL Server and .NET.

I am capturing SQL run times for the below tests using SET STATISTICS TIME ON. All of the test data for the below tests is available here: https://gist.github.com/bertwagner/f0645cf1b244af7d6bb75856db8744e0

Test #1 — Deserializing 20k JSON elements

For this first test, we will deserialize ~20k rows of car year-make-model data comparing the SQL Server OPENJSON function against Json.NET's DeserializeObject.

Previously this test used JSON_VALUE which was adding unnecessary processing overhead. With the query rewritten to run more efficiently, it looks like this:

SELECT year, make, model
FROM OPENJSON(@cars) WITH (year int, make nvarchar(50), model nvarchar(50));
-- 160ms

Now the problem with this query is that we are still drawing all ~20k rows of data to the screen in SQL Server Management Studio. The best way to avoid this extra processing is to simply convert the query to use COUNT:

SELECT COUNT(*)
FROM OPENJSON(@cars) WITH (year int, make nvarchar(50), model nvarchar(50));
-- 71ms

0f867-1n9s0lw6pnonrbxbcu5ljag

Looking at the execution plans, the OPENJSON function is still processing all ~20k rows in both queries, only the number of rows being brought back to the SSMS GUI differ.

This still isn't the same as what the C# test below does (all data in the C# example stays in memory at all times) but it is as close of a comparison that I could think of:

var deserializedCars = JsonConvert.DeserializeObject<IEnumerable<Car>>(carsJSON);
// 66ms

(Full C# code available at: https://gist.github.com/bertwagner/8e5e8b6ec977c1704355166f96ae3efa)

And the result of this test? SQL Server was nearly as fast as Json.NET!

736c4-1l8jgil5hl_acboawarlepq

Test #2 — Deserializing ~20k rows with a predicate

In this next test we filter and return only a subset of rows.

SQL:

SELECT count(*) FROM OPENJSON(@cars) WITH(model nvarchar(20) ) WHERE model = 'Golf'
// 58ms

C#

var queriedCars = JsonConvert.DeserializeObject<IEnumerable<Car>>(carsJSON).Where(x => x.Model == "Golf");
// 52ms

Result: SQL Server is nearly as fast once again!

08563-1mevkdr3q5nqtjumfnuqtdq

One more important thing to note about this specific test — if you add this data into a SQL table and add a computed column index, SQL Server will beat out Json.NET every time.

Test #3 — Serializing ~20 elements into JSON

This scenario is particularly difficult to test. If I want to serialize data in a SQL table to a JSON string, how do I write the equivalent of that in C#? Do I use a DataTable and hope that SQL's data is all in cache? Is the retrieval speed between the SQL Server buffer equivalent to C#'s DataTable? Would a collection of List's in C# be more appropriate than a DataTable?

In the end, I decided to force SQL to read pages from disk by clearing the cache and have C# read the object data from a flat file. This still isn't perfect, but it is as close as I think we can get:

SQL:

DBCC DROPCLEANBUFFERS
SELECT * FROM dbo.Cars FOR JSON AUTO
-- 108ms

C#:

string carsJSONFromFile = File.ReadAllText(@"../../CarData.json");
var serializedCars = JsonConvert.SerializeObject(deserializedCars);
// 63ms

This test still isn't perfect though because SSMS is outputting the JSON string to the screen while C# never has to. I didn't want to play around with outputting the C# version to a form or the console window because it still wouldn't have been an equal comparison.

Result: Json.Net is about twice as fast although this test is by far the most inaccurate. Still, SQL is still much faster than I initially thought.

9c286-19i6qzwiispgc1eirzhtjsa

SQL Server JSON vs. XML

In my previous article comparing SQL Server JSON to SQL Server XML, I focused on tests that were all done within SQL Server.

These tests were incomplete however: most of the time, a developer's app will have to do additional processing to get data into an XML format, while JSON data usually already exists in JSON format (assuming we have Javascript web app).

These two tests examine cases where XML may have been slightly faster than JSON on SQL Server, but if you consider the entire environment (app + database), using JSON wins.

Scenario #1 — XML data needs to be serialized

Although inserting XML data that is already in memory into a SQL Server table is faster than the equivalent operation in JSON, what happens if we need to serialize the data in our app first before sending the data to SQL Server?

// Serialize Car objects to XML
var result1 = SerializeToXML(cars);
// 166ms

// Serialize Car objects to JSON
var result2 = SerializeToJSON(cars);
// 69ms

public static Tuple<long, string> SerializeToXML(List<Car> cars)
{
  Stopwatch sw = new Stopwatch();
  sw.Start();
  StringWriter writer = new StringWriter();
  XmlSerializer serializer = new XmlSerializer(typeof(List<Car>));
  serializer.Serialize(writer, cars);
  string result = writer.ToString();
  sw.Stop();
  return new Tuple<long, string>(sw.ElapsedMilliseconds, result);
}

public static Tuple<long, string> SerializeToJSON(List<Car> cars)
{
  Stopwatch sw = new Stopwatch();
  sw.Start();
  var json = JsonConvert.SerializeObject(cars);
  sw.Stop();
  return new Tuple<long, string>(sw.ElapsedMilliseconds, json);
}

Using the most common libraries available to serializing data to XML and JSON, serializing data to JSON is twice as fast as serializing to XML (and as mentioned before, a lot of the time apps already have JSON data available — no need to serialize). This means the app serialization code will run faster and allow for the data to make it to SQL Server faster.

Scenario #5 — Transferring XML and JSON to SQL Server

Finally, after we have our serialized XML and JSON data in C#, how long does it take to transfer that data to SQL Server?

// Write XML string to SQL XML column
var result3 = WriteStringToSQL(
  result1.Item2, 
  "INSERT INTO dbo.XmlVsJson (XmlData) VALUES (@carsXML)", 
  new SqlParameter[]
  {
    new SqlParameter("carsXML", result1.Item2)
  });
// 142ms, 1.88mb of data

// Write JSON string to SQL
var result4 = WriteStringToSQL(
  carsJSON,
  "INSERT INTO dbo.XmlVsJson (JsonData) VALUES (@carsJSON)",
  new SqlParameter[]
  {
    new SqlParameter("carsJSON", carsJSON)
  });
// 20ms, 1.45mb of data

// Write XML string to nvarchar SQL column.  Taking the difference between this and result3, 100ms+ of time is spent converting to XML format on insert.
var result5 = WriteStringToSQL(
  result1.Item2, 
  "INSERT INTO dbo.XmlVsJson (JSONData) VALUES (@carsXML)",
  new SqlParameter[]
  {
    new SqlParameter("carsXML", result1.Item2)
  });
// 29ms, 1.88mb of data

Result: Writing JSON data to a nvarchar SQL Server column is much faster than writing XML data to an XML typed (or even an nvarchar typed) column.

Not only does SQL server need to parse the XML data upon insert, the physical size of the XML data being sent over TCP is larger due to the repetitive nature of XML syntax.

Conclusion

JSON performance in SQL Server is still awesome. In fact, it's even better than I had previously thought.

These tests are not meant to be conclusive; think of them more as errata for my previous JSON performance posts.

However, I think that these comparisons show that SQL Server's JSON functions are competitive with other languages' performance of handling JSON data.

Additionally, if serializing/deserializing reduces the amount of data transferred over TCP, using the JSON functions in SQL Server will most likely give you better total app/environment performance.

XML vs JSON Shootout: Which is Superior in SQL Server 2016?

"A duel is a duel" by Emanuele Rosso is licensed under CC BY-NC-ND 2.0

Watch this week's video on YouTube

Additional performance comparisons available in an updated post.

Starting with the 2016 release, SQL Server offers native JSON support. Although the implementation is not perfect, I am still a huge fan.

Even if a new feature like JSON support is awesome, I am only likely to use it if it is practical and performs better than the alternatives.

Today I want to pit JSON against XML and see which is the better format to use in SQL Server.

Enter XML, SQL's Bad Hombre

Full disclosure: I don't love XML and I also don't love SQL Server's implementation of it.

XML is too wordy (lots of characters wasted on closing tags), it has elements AND attributes (I don't like having to program for two different scenarios), and depending on what language you are programming in, sometimes you need schema files and sometimes you don't.

SQL Server's implementation of XML does have some nice features like a dedicated datatype that reduces storage space and validates syntax, but I find the querying of XML to be clumsy.

All XML grievances aside, I am still willing to use XML if it outperforms JSON. So let's run some test queries!

Is JSON SQL Server's New Sheriff in Town?

Although performance is the final decider in these comparison tests, I think JSON has a head start over XML purely in terms of usability. SQL Server's JSON function signatures are easier to remember and cleaner to write on screen.

The test data I'm using is vehicle year/make/model data from https://github.com/arthurkao/vehicle-make-model-data. Here's what it looks like once I loaded it into a table called dbo.XmlVsJson:

CREATE TABLE dbo.XmlVsJson
(
  Id INT IDENTITY PRIMARY KEY,
  XmlData XML,
  JsonData NVARCHAR(MAX)
)

(The full data query is available in this gist if you want to play along at home)

Data Size

So XML should be larger right? It's got all of those repetitive closing tags?

SELECT
  DATALENGTH(XmlData)/1024.0/1024.0 AS XmlMB,
  DATALENGTH(JsonData)/1024.0/1024.0 AS JsonMB
FROM
  dbo.XmlVsJson

82246-1ra2xqdbn4movjivmlhtnxa

Turns out the XML is actually smaller! How can this be? This is the magic behind the SQL Server XML datatype. SQL doesn't store XML as a giant string; it stores only the XML InfoSet, leading to a reduction in space.

The JSON on the other hand is stored as regular old nvarchar(max) so its full string contents are written to disk. XML wins in this case.

0745a-1liaytu4vkxadqugdbn1j3g

INSERT Performance

So XML is physically storing less data when using the XML data type than JSON in the nvarchar(max) data type, does that mean it will insert faster as well? Here's our query that tries to insert 100 duplicates of the row from our first query:

SET STATISTICS TIME ON

INSERT INTO dbo.XmlVsJson (XmlData)
SELECT XmlData FROM dbo.XmlVsJson 
  CROSS APPLY 
  (
    SELECT DISTINCT number 
    FROM master..spt_values 
    WHERE number BETWEEN 1 AND 100
  )t WHERE Id = 1
GO

INSERT INTO dbo.XmlVsJson (JsonData)
SELECT JsonData FROM dbo.XmlVsJson 
  CROSS APPLY 
  (
    SELECT DISTINCT number 
    FROM master..spt_values 
    WHERE number BETWEEN 1 AND 100
  )t WHERE Id = 1
GO

And the results? Inserting the 100 XML rows took 613ms on my machine, while inserting the 100 JSON rows took 1305ms…XML wins again!

JSON ain't looking too hot. Wait for it…

I'm guessing since the XML data type physically stores less data, it makes sense that it would also write it out to the table faster as well.

CRUD Operations

I'm incredibly impressed by SQL Server's JSON performance when compared to .NET — but how does it compare to XML on SQL Server?

Read

Let's select the fragment for our second car from our XML and JSON:

SELECT t.XmlData.query('/cars/car[2]') 
FROM dbo.XmlVsJson t 
WHERE Id = 1

SELECT JSON_QUERY(t.JsonData, '$.cars[1]') 
FROM dbo.XmlVsJson t 
WHERE Id = 1

91b35-1xixep1a6ebvvn38_vvpavq

Result? JSON wins (at 0ms vs 63ms for XML) when needing to pluck out a fragment from our larger object string.

8a02b-1hzaxjaobi3dxqurwrelf4w

What if we want to grab a specific value instead of a fragment?

SELECT t.XmlData.value('(/cars/car[2]/model)[1]', 'varchar(100)') FROM dbo.XmlVsJson t 
WHERE Id = 1

SELECT JSON_VALUE(t.JsonData, '$.cars[1].model') 
FROM dbo.XmlVsJson t 
WHERE Id = 1

Once again JSON wins with 0ms vs 11ms for XML.

41179-1esau9dtm4uezlxgarzxb1q

If you look at the execution plans for these last two queries, it's easy to see that XML has a lot more to do behind the scenes to retrieve the data:

XML:

327b8-1fgaieclnioqa5-zytrrndq

JSON:

45848-1s9aop29_eiltkvsnszkw9g

Create

We saw above that inserting rows of XML data is faster than inserting rows of JSON, but what if we want to insert new data into the object strings themselves? Here I want to insert the property "mileage" into the first car object:

db0c0-1btrkakuspyaj8iu5lq7wsg

UPDATE t SET XmlData.modify('
insert <mileage>100,000</mileage>
into (/cars/car[1])[1]') 
FROM dbo.XmlVsJson t 
WHERE Id = 1

UPDATE t SET JsonData = JSON_MODIFY(JsonData,
'$.cars[0].mileage','100,000') 
FROM dbo.XmlVsJson t 
WHERE Id = 1

In addition to the cleaner syntax (JSON_MODIFY() is essentially the same as a REPLACE()) the JSON insert runs in 22ms compared to the 206ms for XML. Another JSON win.

c01d2-1kng0gj5jdkk5ujd1qlg5ig

Update

Let's update the mileage properties we just added to have values of 110,000:

UPDATE t SET XmlData.modify('
replace value of (/cars/car[1]/mileage/text())[1]
with     "110,000"') 
FROM dbo.XmlVsJson t
WHERE Id = 1

UPDATE t SET JsonData = JSON_MODIFY(JsonData, '$.cars[0].mileage','110,000') 
FROM dbo.XmlVsJson t
WHERE Id = 1

72c82-1hpv3kbbgv88tzuyozvineg

Result? JSON has the quicker draw and was able to perform this update in 54ms vs XML's 194ms.

c81d3-1fvual_pu3eyahvumnfimqg

Delete

Deleting large string data, a DBA's dream *snicker*.

Let's delete the mileage property, undoing all of that hard work we just did:

UPDATE t SET XmlData.modify('
delete /cars/car[1]/mileage[1]') 
FROM dbo.XmlVsJson t 
WHERE Id = 1

UPDATE t SET JsonData = JSON_MODIFY(JsonData, '$.cars[0].mileage', null) 
FROM dbo.XmlVsJson t 
WHERE Id = 1

JSON doesn't take any time to reload and wins against XML again 50ms to 159ms.

538ba-1ojms_3yjdktnni2a3pqcrg

Read Part 2: Indexes

So above we saw that JSON was faster than XML at reading fragments and properties from a single row of serialized data. But our SQL Server's probably have LOTS of rows of data — how well does indexed data parsing do in our match up?

First let's expand our data — instead of storing all of our car objects in a single field, let's build a new table that has each car on its own row:

(once again, full dataset at GitHub if you are playing along at home)

Now that we have our expanded data in our table, let's add some indexes. The XML datatype in SQL Server has its own types of indexes, while JSON simply needs a computed column with a regular index applied to it.

DROP INDEX IF EXISTS PXML_XmlData ON XmlVsJson2
CREATE PRIMARY XML INDEX PXML_XmlData
ON XmlVsJson2 (XmlData);

ALTER TABLE dbo.XmlVsJson2
ADD MakeComputed AS JSON_VALUE(JsonData, '$.make')
CREATE NONCLUSTERED INDEX IX_JsonData ON dbo.XmlVsJson2 (MakeComputed)

(Note: I also tried adding an XML secondary index for even better performance, but I couldn't get the query engine to use that secondary index on such a basic dataset)

If we try to find all rows that match a predicate:

SELECT Id, XmlData 
FROM dbo.XmlVsJson2 t 
WHERE t.XmlData.exist('/car/make[.="ACURA"]') = 1

SELECT Id, JsonData 
FROM dbo.XmlVsJson2 t 
WHERE JSON_VALUE(t.JsonData, '$.make') = 'ACURA'

XML is able to filter out 96 rows in 200ms and JSON accomplishes the same in 9ms. A final win for JSON.

327d9-1-jtwonmiccgvf25ksnmb3q

Conclusion

If you need to store and manipulate serialized string data in SQL Server, there's no question: JSON is the format of choice. Although JSON's storage size is a little larger than its XML predecessor, SQL Server's JSON functions outperform XML in speed in nearly all cases.

Is there enough performance difference to rewrite all of your old XML code to JSON? Probably not, but every case is different.

One thing is clear: new development should consider taking advantage of SQL Server's new JSON functions.

One SQL Cheat Code For Amazingly Fast JSON Queries

f836d-17tk3zornsj5nirg7ai5dbq

Watch this week's video on YouTube

Recently I've been working with JSON in SQL Server 2016 a lot.

One of the hesitations many people have with using JSON in SQL Server is that they think that querying it must be really slow — SQL is supposed to excel at relational data, not string parsing right?

It turns out that performance is pretty good with the standalone SQL Server JSON functions. Even better is that it's possible to make queries against JSON data run at ludicrous speeds by using indexes on JSON parsed computed columns. In this post I want to take a look at how SQL is able to parse* with such great performance.

*"Parse" here is actually a lie —it's doing something else behind the scenes. You'll see what I mean, keep reading!

Computed Columns in SQL Server

The only way to get JSON indexes working on SQL server is to use a computed column. A computed column is basically a column that performs a function to calculate its values.

For example, let's say we have a table with some car JSON data in it:

DROP TABLE IF EXISTS dbo.DealerInventory;
CREATE TABLE dbo.DealerInventory
(
  Id int IDENTITY(1,1) PRIMARY KEY,
  Year int,
  JsonData nvarchar(300)
);

INSERT INTO dbo.DealerInventory (Year, JsonData) VALUES (2017, '{ "Make" : "Volkswagen", "Model" : "Golf" }');

INSERT INTO dbo.DealerInventory (Year, JsonData) VALUES (2017, '{ "Make" : "Honda", "Model" : "Civic" }');

INSERT INTO dbo.DealerInventory (Year, JsonData) VALUES (2017, '{ "Make" : "Subaru", "Model" : "Impreza" }');

SELECT * FROM dbo.DealerInventory;

/* Output:
Id    Year     JsonData
----- -------- ---------------------------------------------
1     2017     { "Make" : "Volkswagen", "Model" : "Golf" }
2     2017     { "Make" : "Honda", "Model" : "Civic" }
3     2017     { "Make" : "Subaru", "Model" : "Impreza" }
*/

We can add a new computed column to the table, "Make", which parses and extracts the Make property from each row's JSON string:

ALTER TABLE dbo.DealerInventory
ADD Make AS JSON_VALUE(JsonData, '$.Make');

SELECT * FROM dbo.DealerInventory;

/* Output:
Id Year  JsonData                                    Make
-- ----- ------------------------------------------- ----------
1  2017  { "Make" : "Volkswagen", "Model" : "Golf" } Volkswagen
2  2017  { "Make" : "Honda", "Model" : "Civic" }     Honda
3  2017  { "Make" : "Subaru", "Model" : "Impreza" }  Subaru
*/

By default, the above Make computed column is non-persisted, meaning its values are never stored to the database (persisted computed columns can also be created, but that's a topic for a different time). Instead, every time a query runs against our dbo.DealerInventory table, SQL Server will calculate the value for each row.

The performance of this isn't great — it's essentially a scalar function running for each row of our output :(. However, when you combine a computed column with an index, something interesting happens.

Time to dive in with DBCC Page

DBCC Page is an undocumented SQL Server function that shows what the raw data stored in a SQL page file looks like. Page files are how SQL Server stores its data.

In the rest of this post we'll be looking at how data pages (where the actual table data in SQL is stored) and index pages (where our index data is stored) are affected by non-persisted computed columns — and how they make JSON querying super fast.

First, let's take a look at the existing data we have. We do this by first turning on trace flag 3604 and using DBCC IND to get the page ids of our data. Additional details on the column definitions in DBCC IND and DBCC PAGE can be found in Paul Randal's blog post on the topic.

DBCC TRACEON(3604);

-- "Sandbox" is the name of my database
DBCC IND('Sandbox','dbo.DealerInventory',-1);

55ac9-1qczux6z_us9fzacehamhpa

If you look at the results above, row 2 contains our data page (indicated by PageType = 1) and the PagePID of that page is 305088 (if you are playing along at home, your PagePID is most likely something else). If we then look up that PagePID using DBCC PAGE we get something like this:

DBCC PAGE('Sandbox',1,305088,3) WITH TABLERESULTS

65fdc-1jaa-flg1di52y3t2gohyxg

You can see our three rows of data highlighted in red. The important thing to note here is that our computed column of the parsed "Make" value is truly non-persisted and no where to be found, meaning it has to get generated for every row during query execution.

Now, what if we add an index to our non-persisted computed column and then run DBCC IND again:

CREATE NONCLUSTERED INDEX IX_ParsedMake ON dbo.DealerInventory (Make)

DBCC IND('Sandbox','dbo.DealerInventory',-1);

cab34-1vfyzcblexgrytsqu5miomg

You'll now notice that in addition to data page 305088 (PageType = 1), we also have an index page 305096 (PageType = 2). If we examine both the data page and the index page we see something interesting:

DBCC PAGE('Sandbox',1,305088,3) WITH TABLERESULTS

DBCC PAGE('Sandbox',1,305096,3) WITH TABLERESULTS

Nothing has changed with our data page:

3348f-1ktty-u5w4iha9emrwys9zw

But our index page contains the parsed values for our "Make" column:

1b9ea-16ysi1x2ilwdo7j3tkwowlw

What does this mean? I thought non-persisted computed columns aren't saved to disk!

Exactly right: our non-persisted computed column "Make" isn't saved to the data page on the disk. However if we create an index on our non-persisted computed column, the computed value is persisted on the index page!

This is basically a cheat code for indexing computed columns.

SQL will only compute the "Make" value on a row's insert or update into the table (or during the initial index creation) — all future retrievals of our computed column will come from the pre-computed index page.

This is how SQL is able to parse indexed JSON properties so fast; instead of needing to do a table scan and parsing the JSON data for each row of our table, SQL Server can go look up the pre-parsed values in the index and return the correct data incredibly fast.

Personally, I think this makes JSON that much easier (and practical) to use in SQL Server 2016. Even though we are storing large JSON strings in our database, we can still index individual properties and return results incredibly fast.