Data with Bert logo

How To Create Multi-Object JSON Arrays in SQL Server

blog-image

Recently I was discussing with Peter Saverman whether it would be possible to take some database tables that look like this:

2017-12-16_10-34-48

And output them so that the Cars and Toys data would map to a multi-object JSON array like so:

2017-12-16_10-38-51

Watch this week's video on YouTube

Why would you ever need this?

If you are coming from a pure SQL background, at this point you might be wondering  why you would ever want create an object array that contains mixed object types.  Well, from an application development standpoint this type of scenario can be fairly common.

In a database, it makes sense to divide Home and Car and Toy into separate tables.  Sure, we could probably combine the latter two with some normalization, but imagine we will have many different types of entities that will be more difficult to normalize - sometimes it just makes sense to store this information separately.

Not to mention that performing analytical type queries across many rows of data will typically be much faster stored in this three table format.

The three table layout, while organized from a database standpoint, might not be the best way to organize the data in an object-oriented application.  Usually in a transaction oriented application, we want our data to all be together as one entity.  This is why NoSQL is all the rage among app developers.  Having all of your related data all together makes it easy to manage, move, update, etc...  **This is where the array of multi-type objects comes in - it'd be pretty easy to use this structure as an array of dynamic or inherited objects inside of our application.

Why not just combine these Car and Toy entities in app?

Reading the data into the app through multiple queries and mapping that data to objects is usually the first way you would try doing something like this.

However, depending on many different variables, like the size of the data, the number of requests, the speed of the network, the hardware the app is running on, etc... mapping your data from multiple queries might not be the most efficient way to go.

On the other hand, if you have a big beefy SQL Server available that can do those transformations for you, and you are willing to pay for the processing time on an \$8k/core enterprise licensed machine, then performing all of the these transformations on your SQL Server is the way to go.

The solution

UPDATE: Jovan Popovic suggested an even cleaner solution using CONCAT_WS.  See the update at the bottom of this post.

First, here's the data if you want to play along at home:

DROP TABLE IF EXISTS ##Home;
GO
DROP TABLE IF EXISTS ##Car;
GO
DROP TABLE IF EXISTS ##Toy;
GO

CREATE TABLE ##Home
(
    HomeId int IDENTITY PRIMARY KEY,
    City nvarchar(20),
    State nchar(2)
);
GO

CREATE TABLE ##Car
(
    CarId int IDENTITY PRIMARY KEY,
    HomeId int,
    Year smallint,
    Make nvarchar(20),
    Model nvarchar(20),
    FOREIGN KEY (HomeId) REFERENCES ##Home(HomeId)
);
GO

CREATE TABLE ##Toy
(
    ToyId int IDENTITY PRIMARY KEY,
    HomeId int,
    Category nvarchar(20),
    RiderCapacity int,
    FOREIGN KEY (HomeId) REFERENCES ##Home(HomeId)
);
GO

INSERT INTO ##Home (City,State) VALUES ('Cleveland','OH')
INSERT INTO ##Home (City,State) VALUES ('Malibu','CA')

INSERT INTO ##Car (HomeId,Year, Make, Model) VALUES ('1','2017', 'Volkswagen', 'Golf')
INSERT INTO ##Car (HomeId,Year, Make, Model) VALUES ('2','2014', 'Porsche', '911')

INSERT INTO ##Toy (HomeId,Category, RiderCapacity) VALUES ('1','Bicycle', 1)
INSERT INTO ##Toy (HomeId,Category, RiderCapacity) VALUES ('2','Kayak', 2)

SELECT * FROM ##Home
SELECT * FROM ##Car
SELECT * FROM ##Toy

And here's the query that does all of the transforming:

SELECT 
    h.HomeId,
    h.City,
    h.State,
    GarageItems = JSON_QUERY('[' + STRING_AGG( GarageItems.DynamicData,',') + ']','$')
FROM
    ##Home h
    INNER JOIN
    (
        SELECT
            HomeId,
            JSON_QUERY(Cars,'$') AS DynamicData
        FROM
            ##Home h
            CROSS APPLY
            (
            SELECT 
                (
                SELECT  
                    *
                FROM
                    ##Car c
                WHERE
                    c.HomeId = h.HomeId
                    FOR JSON PATH, WITHOUT_ARRAY_WRAPPER
                ) AS Cars
            ) d 
        UNION ALL
        SELECT
            HomeId,
            JSON_QUERY(Cars,'$') AS DynamicData
        FROM
            ##Home h
            CROSS APPLY
            (
            SELECT 
                (
                SELECT  
                    *
                FROM
                    ##Toy c
                WHERE
                    c.HomeId = h.HomeId
                    FOR JSON PATH, WITHOUT_ARRAY_WRAPPER
                ) AS Cars
            ) d
    ) GarageItems
        ON h.HomeId = GarageItems.HomeId
GROUP BY
    h.HomeId,
    h.City,
    h.State

There are a couple of key elements that make this work.

CROSS APPLY

When using FOR JSON PATH , ALL rows and columns from that result set will get converted to a single JSON string.

This creates a problem if, for example, you want to have a column for your JSON string and a separate column for something like a foreign key (in our case, HomeId).  Or if you want to generate multiple JSON strings filtered on a foreign key.

The way I chose to get around this is to use CROSS APPLY with a join back to our Home table - this way we get our JSON string for either Cars or Toys created but then output it along with some additional columns.

WITHOUT_ARRAY_WRAPPER

When using FOR JSON PATH to turn a result set into a JSON string, SQL Server will automatically add square brackets around the JSON output as if it were an array.

This is a problem in our scenario because when we use FOR JSON PATH to turn the Car and Toy table into JSON strings, we eventually want to combine them together into the same array instead of two separate arrays.  The solution to this is using the WITHOUT_ARRAY_WRAPPER option to output the JSON string without the square brackets.

Conclusion

Your individual scenario and results may vary.  This solution was to solve a specific scenario in a specific environment.

Is it the right way to go about solving your performance problems all of the time? No.  But offloading these transformations onto SQL Server is an option to keep in mind.

Just remember - always test to make sure your performance changes are actually helping.

UPDATED Solution Using CONCAT_WS:

This solution recommended by Jovan Popovic is even easier than above.  It requires using CONCAT_WS, which is available starting in SQL Server 2017 (the above solution requires STRING_AGG which is also in 2017, but it could be rewritten using FOR XML string aggregation if necessary for earlier versions)

SELECT h.*,
'['+ CONCAT_WS(',',
(SELECT * FROM ##Car c WHERE c.HomeId = h.HomeId FOR JSON PATH, WITHOUT_ARRAY_WRAPPER),
(SELECT * FROM ##Toy t WHERE t.HomeId = h.HomeId FOR JSON PATH, WITHOUT_ARRAY_WRAPPER)
)
+ ']'
FROM ##Home h

Reducing the Number of Reads in Your Queries

jamie-street-361489

In last week's post, we went over how one of best ways to improve query performance was to reduce the number of reads that your query has to do.

Less reads typically means faster query performance - so how can you reduce the number of reads SQL Server is required to make?

Watch this week's video on YouTube

Write more selective queries

Writing more selective queries can be done in a  few different ways.

For starters, if you don't need every column of data, don't use SELECT *.  Depending on the size of your rows, providing only the columns you need might allow SQL Server to use an index  that is narrower and/or denser.  An index that is narrow (fewer columns) or dense (more records per page) allows SQL Server to return the same amount of data you need in fewer pages.

The same thing goes for being more selective in your ON and WHERE clauses.  If you specify no WHERE conditions, SQL Server will return every single page in your table.  If you can filter down the data to exactly what you need, SQL Server can return only the data you need. This reduces logical reads and improves speed.

-- Read 1,000,000 pages
SELECT * FROM Products

-- Read 10,000 pages by being more selective in your SELECT and WHERE
SELECT ProductName FROM Products WHERE CreateDate > '2015-01-01'

Finally, do you have large object (LOB) data (eg. varchar(max), image, etc..) on the tables in your query? Do you actually need it in your final result set? No?  Then don't include it as part of your reads!  This could mean creating an index on the columns you do need or putting your LOB data in a separate table and only joining to it when you need it.

-- Read 1,000,000 pages because your Products table has a LOB field and nothing besides a clustered index
SELECT ProductName FROM Products WHERE CreateDate > '2017-08-01'

-- Read in 1000 pages only because you put your lob column on a separate table.
-- Or you created an index that contains ProductName and not your LOB column.
-- Same query, better performance!
SELECT ProductName FROM Products WHERE CreateDate > '2017-08-01'

-- Need that LOB data?  Just join in the separate table.  Reads are large here, but at least you only are reading when you truly need that LOB data.
SELECT 
  p.ProductName, 
  pe.ProductExtendedProperties -- this is nvarchar(max)
FROM
  Products p
  INNER JOIN ProductLOBs pe
    ON p.ProductId = pe.ProductId

Fix suboptimal execution plans

It's possible that your query is already as selective as it can be.  Maybe you are getting too many reads because SQL Server is generating and using a suboptimal execution plan.

A big tip off that this might be happening is if your cardinality estimates are out of whack a.k.a. the estimated vs. actual row counts have a large difference between them:

2017-12-06_12-36-57

If SQL Server thinks it only is going to read 1 row of data, but instead needs to read way more rows of data, it might choose a poor execution plan which results in more reads.

You might get a suboptimal execution plan like above for a variety of reasons, but here are the most common ones I see:

If you had a query that previously ran fine but doesn't anymore, you might be able to utilize Query Store to help identify why SQL Server started generating suboptimal plans.

The key is to get a good execution plan so that you aren't performing unnecessary reads.

But what if you aren't encountering any of the problems above and performance is still slow due to high numbers of reads?  Simply...

Add an index

mike-marquez-342243 A cup of coffee and a shot of espresso might have the same caffeine content - espresso is just more caffeine dense, just like the data stored in a narrow index.  Photo by Mike Marquez on Unsplash

If you have an existing table that has many columns and you only need a subset of them for your query, then consider adding an index for those columns.

Indexes are copies of your data stored in a different order with generally fewer columns.  If SQL Server is able to get all of the information it needs from a narrow index, it will do that instead of reading the full table/clustered index.

A copy of the data that has fewer columns will have greater page density (or the amount of data that fits on each page).  If SQL Server can get all of the data it needs by reading fewer, denser pages then your query will run faster.

Don't just go adding indexes willy nilly though.  You may already have an index that almost contains all of the columns your query needs.  Look at a table's existing indexes first and see if any of them are close to what you need.  Usually, you'll be better off adding an included column or two to an existing index instead creating a whole brand new index.  And you'll save on disk space by not creating duplicate indexes either.

Reduce index fragmentation

So indexes are great for reducing reads because they allow us to store only the data that is needed for a specific query (both as key columns and included columns).  Fewer columns = greater density = fewer reads necessary.

However, indexes can become  fragmented.  There's internal fragmentation, which causes less data to be stored on a page than what is possible, and external fragmentation which causes the pages to be stored out of logical order on the disk.

Internal fragmentation is problematic because it reduces page density, causing SQL Server to have to read more pages in order to get all of the data it needs.

External fragmentation is problematic, especially for spinning disk hard drives, because SQL Server needs to read from all over the disk to get the data it needs.

In general, reorganizing or rebuilding an index are the typical ways you want to fix a fragmented index.

To slow down future fragmentation, you can test out different fill factors to try and prevent page splits from fragmenting your indexes.

Learning Through Blogging and Speaking

55ae8-1lh0mvkliatliiikt0vlyow

This post is a response to this month's T-SQL Tuesday prompt created by Mala Mahadevan. T-SQL Tuesday was created by Adam Machanic and is a way for SQL users to share ideas about interesting topics. This month's topic is Setting Learning Goals for 2018.


Watch this week's video on YouTube

What do I want to learn in 2018?

Next year I'm going to strive to learn more about tasks usually reserved for a DBA - backups, replication, user and role permissions, etc...

As a developer, I am generally isolated from the day-to-day work of a DBA.  Not that I necessarily want to be responsible for a system's backups (and restorations!), but I do feel that I am missing out on a broader understanding of SQL Server by never having to perform these tasks myself.

I like being able to do-all-of-the-things, or at least know how to do them, because I think it makes me a better developer and a better architect of solutions.  By not having an expert knowledge of things like availability groups, I feel like it hurts me when it comes time to architect solutions.

The learning process

I like to learn by doing.  Since I'm not planning on becoming a DBA anytime soon, I'll need to create my own scenarios to learn how these different DBA specific features of SQL Server work.

This won't give me real-world experience, but for the most part I am ok with that - I'm trying to learn more details on the broader concepts rather than the nitty gritty of various real-world scenarios.

The best way I've found to learn by doing in this academic sense is through blogging.  I like blogging because it forces me to learn a concept well enough to be able to articulate it back easily to my readers (hi mom!).

Blogging forces you to create demos that point out features that are critical to understanding a demo.  If I can create a demo that recreates a real-world scenario, then my understanding of that problem in the real-world will be pretty good; at least way better than if I just read a book or MSDN article.

Implementation

Blogging is all good and fun, but the best way I've found to really understand a topic is to present it.  If you can teach someone else then you have a pretty good handle on the content.

Now, I'm not aspiring to give a 500-level talk about disaster-recovery planning, but I do think I can learn enough to present some 100-level topics to SQL Server beginners who are just a few books online articles behind me.

The best part about presenting is that you will have people smarter than you in the room with you.  Or at least people who have a different perspective about the topic than you.  These people will generally ask really good questions that...you don't know the answer to!

But there's no shame in saying you don't know the answer.  It's actually a wonderful opportunity - after the presentation, you can take your time and learn the solution.  Then blog about it so others can know the answer too.  Then incorporate into your future presentations.  It's a beautiful cycle.

How to use Statistics IO to Improve Your Query Performance

2017-12-01_12-45-05

SQL Server's STATISTICS IO reporting is a great tool to help you performance tune queries.

Usually the goal of performance tuning is to make your query run faster.  One of the easiest ways to get a faster query is to reduce the amount of data a query is processing.  STATISTICS IO makes it easy to see how much data SQL Server is actually processing.

Specifically, the STATISTICS IO output helps with performance tuning because:

  1. The data it shows acts as a measuring stick for your performance tuning changes.
  2. It provides a good way of isolating the query changes you are making from other changes that may be happening on the server.

Watch this week's video on YouTube

So what's STATISTICS IO data look like?

To show IO statistics on your query, you first need to execute:

-- This applies to your current session only
SET STATISTICS IO ON;
GO

After running a query with with the above setting turned on, check SQL Server Management Studio's Messages tab to see output that looks something like this:

(157709 rows affected)
Table 'OrderLines'. Scan count 2, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 159, lob physical reads 0, lob read-ahead reads 0.
Table 'OrderLines'. Segment reads 1, segment skipped 0.
Table 'Workfile'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Worktable'. Scan count 0, logical reads 225917, physical reads 0, read-ahead reads 0, lob logical reads 225702, lob physical reads 0, lob read-ahead reads 0.
Table 'Orders'. Scan count 1, logical reads 138, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'StockItems'. Scan count 1, logical reads 16, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

The key things that my eyes are drawn to on initial examination of the STATISTICS IO output are the following:

  • Logical reads: The number of 8kB pages SQL Server had to read from the buffer cache (memory) in order to process and return the results of your query.  The more pages that need to be read, the slower your query.
  • Worktables/Workfiles: These are temporary objects that SQL Server creates in tempdb in order to process query results.  Although not always bad, it might indicate that SQL is doing more work than it needs to (perhaps an index could help?)
  • Lob Logical Reads: The number of large objects (e.g. varchar(max)) SQL is having to read.  I take the most cursory glance at this - if I'm returning high numbers of lobs, I might want to make sure I actually need them.  If not, I may add an index or move the lobs off to a separate table.

There are more properties in the STATISTICS IO output, but if we can significantly decrease the above three indicators then chances are good that we'll improve our query performance.

So why are these three indicators so useful?

Tracking performance changes

The main reason I like performance tuning with STATISTICS IO is because it makes it easy to create a baseline for my query performance.  When I make changes to the query, it's then easy to see if my changes helped or hurt the query.

The main metric I use for this is logical reads.  Logical reads refers to pages pulled from the cache (memory) versus physical reads which indicates the number of pages from disk.  However, all pages get loaded from disk into cache before SQL Server is able to use them.

This makes logical reads great for tracking performance changes because it clearly tells me how many 8kB pages in total SQL Server needed to read in order to my return my data.

If I add an index, does the total number of pages read go up or down?  Let me check my logical read counts and see.

What if I add some additional filtering or restructure my query?  I can easily tell if my changes hurt performance by seeing if the total number of logical reads went up or down.

Logical reads allow me to easily track the effectiveness of my tuning tactics.

The same concept applies to my worktable and logical lob read properties.  For the former, any time SQL Server is having to write data out to disk (tempdb in this case), performance will be slower.

In the latter case, if SQL Server is needing to move around large objects that comprise of multiple 8kB pages each, things will be slow.  If I can keep track of how many lob logical reads SQL Server is performing, then I can focus on removing that overhead from my query.

Isolating other factors that impact performance

For the same reason logical reads make it easy to track query performance after making changes, using logical reads makes it easy to mute other factors that might affect performance.

For example, I used to think that simply watching how long it took a query to run was a good indicator of helping me performance tune.  If the total number of seconds it took my query to run decreased, my changes helped improve performance!

This is a potentially deceptive way to measure performance though because what if during my first run the server was getting slammed by other queries?  Using elapsed run time isn't an effective way to measure performance.

Also, server environment hardware isn't always the same.  I might test a query in one environment and then deploy it to another.  My testing on an empty dev box might have been great, but as soon as the query runs in production along with all other queries, it might not perform as well.

More than one way to analyze a query

STATISTICS IO is a great place to start your performance tuning process.

It doesn't mean that you have to stop there though.  While being able to track the effects of your tuning changes and isolating other environment variables is important, ultimately you will have to use other means to actually improve performance.

So be sure to look at execution plans, dig into index and table statistics, rewrite the order of your table joins to see if it makes a difference, etc...  Just remember, performance always comes back to how much data SQL Server needs to process - reduce that and your queries will surely perform better.

My Favorite SSMS Shortcut (After Copy/Paste)

abhishek-desai-361493-1-e1511530049804

If there's one keyboard shortcut I use more than any other (with the exception of copy and paste) it would be the ALT + highlight multi-line edit/block selection shortcut.

Let's take a look at three ways the ALT + highlight shortcut lets you work faster in SQL Server Management Studio.

Watch this week's video on YouTube

1. Add results of one query into the IN() statement

Sometimes when writing an ad hoc query  you might want to take the results of one query and put them into an IN() statement of another query.

Sure, you can write a subquery to put into your IN() statement...but that's too much work for a one-time use disposable query.

What you can do instead is:

  1. Copy your values of interest
  2. Paste them into your IN() statement
  3. Hold down the ALT key while dragging the mouse down in front of all of your pasted values
  4. Type a comma (see video above for an easier demonstration).

This allows you to quickly turn pasted values into a list suitable for an IN() statement.

2. Paste values into a template of commands

Sometimes you might have to do something repetitive, like insert a bunch of values into a table.  You could script out multiple value insertion with a UNION ALL and and INSERT INTO statement, but once again that's a lot of work for a one-time query.

Instead you can copy the values you want to insert by highlighting them while holding down the ALT key then positioning the cursor on the first location where you want to paste the values.  Then once you paste, SSMS will automatically multiline paste the values.

Once again, watch the video above to see a demo - a picture (video!) is worth a thousand words.

3. Quickly modify multiple rows

Finally, you might have ran into a situation where you already have a set of queries written, but they all need to be slightly modified in order to work.  For example, maybe the table name changed and now you need to append a prefix to each table name.

While holding down the ALT key, you can simply drag in front of the text you want to prepend to and then start typing.

Voila! You just edited multiple lines in one go.