One SQL Cheat Code For Amazingly Fast JSON Queries


Watch this week's video on YouTube

Recently I've been working with JSON in SQL Server 2016 a lot.

One of the hesitations many people have with using JSON in SQL Server is that they think that querying it must be really slow — SQL is supposed to excel at relational data, not string parsing right?

It turns out that performance is pretty good with the standalone SQL Server JSON functions. Even better is that it's possible to make queries against JSON data run at ludicrous speeds by using indexes on JSON parsed computed columns. In this post I want to take a look at how SQL is able to parse* with such great performance.

*"Parse" here is actually a lie —it's doing something else behind the scenes. You'll see what I mean, keep reading!

Computed Columns in SQL Server

The only way to get JSON indexes working on SQL server is to use a computed column. A computed column is basically a column that performs a function to calculate its values.

For example, let's say we have a table with some car JSON data in it:

DROP TABLE IF EXISTS dbo.DealerInventory;
CREATE TABLE dbo.DealerInventory
  Year int,
  JsonData nvarchar(300)

INSERT INTO dbo.DealerInventory (Year, JsonData) VALUES (2017, '{ "Make" : "Volkswagen", "Model" : "Golf" }');

INSERT INTO dbo.DealerInventory (Year, JsonData) VALUES (2017, '{ "Make" : "Honda", "Model" : "Civic" }');

INSERT INTO dbo.DealerInventory (Year, JsonData) VALUES (2017, '{ "Make" : "Subaru", "Model" : "Impreza" }');

SELECT * FROM dbo.DealerInventory;

/* Output:
Id    Year     JsonData
----- -------- ---------------------------------------------
1     2017     { "Make" : "Volkswagen", "Model" : "Golf" }
2     2017     { "Make" : "Honda", "Model" : "Civic" }
3     2017     { "Make" : "Subaru", "Model" : "Impreza" }

We can add a new computed column to the table, "Make", which parses and extracts the Make property from each row's JSON string:

ALTER TABLE dbo.DealerInventory
ADD Make AS JSON_VALUE(JsonData, '$.Make');

SELECT * FROM dbo.DealerInventory;

/* Output:
Id Year  JsonData                                    Make
-- ----- ------------------------------------------- ----------
1  2017  { "Make" : "Volkswagen", "Model" : "Golf" } Volkswagen
2  2017  { "Make" : "Honda", "Model" : "Civic" }     Honda
3  2017  { "Make" : "Subaru", "Model" : "Impreza" }  Subaru

By default, the above Make computed column is non-persisted, meaning its values are never stored to the database (persisted computed columns can also be created, but that's a topic for a different time). Instead, every time a query runs against our dbo.DealerInventory table, SQL Server will calculate the value for each row.

The performance of this isn't great — it's essentially a scalar function running for each row of our output :(. However, when you combine a computed column with an index, something interesting happens.

Time to dive in with DBCC Page

DBCC Page is an undocumented SQL Server function that shows what the raw data stored in a SQL page file looks like. Page files are how SQL Server stores its data.

In the rest of this post we'll be looking at how data pages (where the actual table data in SQL is stored) and index pages (where our index data is stored) are affected by non-persisted computed columns — and how they make JSON querying super fast.

First, let's take a look at the existing data we have. We do this by first turning on trace flag 3604 and using DBCC IND to get the page ids of our data. Additional details on the column definitions in DBCC IND and DBCC PAGE can be found in Paul Randal's blog post on the topic.


-- "Sandbox" is the name of my database
DBCC IND('Sandbox','dbo.DealerInventory',-1);


If you look at the results above, row 2 contains our data page (indicated by PageType = 1) and the PagePID of that page is 305088 (if you are playing along at home, your PagePID is most likely something else). If we then look up that PagePID using DBCC PAGE we get something like this:



You can see our three rows of data highlighted in red. The important thing to note here is that our computed column of the parsed "Make" value is truly non-persisted and no where to be found, meaning it has to get generated for every row during query execution.

Now, what if we add an index to our non-persisted computed column and then run DBCC IND again:

CREATE NONCLUSTERED INDEX IX_ParsedMake ON dbo.DealerInventory (Make)

DBCC IND('Sandbox','dbo.DealerInventory',-1);


You'll now notice that in addition to data page 305088 (PageType = 1), we also have an index page 305096 (PageType = 2). If we examine both the data page and the index page we see something interesting:



Nothing has changed with our data page:


But our index page contains the parsed values for our "Make" column:


What does this mean? I thought non-persisted computed columns aren't saved to disk!

Exactly right: our non-persisted computed column "Make" isn't saved to the data page on the disk. However if we create an index on our non-persisted computed column, the computed value is persisted on the index page!

This is basically a cheat code for indexing computed columns.

SQL will only compute the "Make" value on a row's insert or update into the table (or during the initial index creation) — all future retrievals of our computed column will come from the pre-computed index page.

This is how SQL is able to parse indexed JSON properties so fast; instead of needing to do a table scan and parsing the JSON data for each row of our table, SQL Server can go look up the pre-parsed values in the index and return the correct data incredibly fast.

Personally, I think this makes JSON that much easier (and practical) to use in SQL Server 2016. Even though we are storing large JSON strings in our database, we can still index individual properties and return results incredibly fast.

How to fix your terribly slow SQL job

This is a real-time progress bar for some of my old jobs. They are still running and stuck at 3%.

How many times have you written a program, ETL, analysis job, etc… that seemed like it would never finish running?

Although poor performance can be caused in a multitude of ways, the easiest to fix is by reducing your data in SQL Server instead of your in your programming/ETL/analysis layer (Excel, R, SAS, Python, ..NET, etc…).

SQL is built to handle and process data extremely efficiently. You will usually experience much better performance the more work (data merging, transformations, etc…) you can do to your data on the SQL server. I say "usually" because SQL won't always be faster than a programming language at transforming data, but 9 times out of 10 you can get faster results straight on the SQL Server.

Watch this week's video on YouTube

Let's look at one of my crappy processes

How many of us have ever written a process that does something like this:

1. Write the most basic query possible, something like SELECT * FROM dbo.User

2. Take the output of the above query, load it into Excel/SAS/Python/.NET/etc…

3. Write some code to filter the dataset

4. Write some code to summarize the data, transform columns, etc…

5. Write another SELECT * FROM dbo.Sale against the SQL Server to bring in more data

6. Bring it into Excel/SAS/Python/.NET/etc… and merge it with our original data

7. Repeat steps 3–6 as many times as needed

Some of my earliest PHP and MySQL websites worked exactly like this 😳! The code was slow on my server and users ended up suffering with slow webpage load times.

If the above process even slightly resembles something you've written before, continue reading on…

Why bother learning to transform data in SQL? I already know how to do that stuff in .

Old habits are hard to break, but you do want to make your processes run faster, right? This stuff is all easy, I promise!

Basically, if you are running code similar to above, the reason your job is slow is because you are not optimizing where your work is being performed:

  • Every time you write SELECT * you probably are bringing back more data than you actually need — you are hurting your performance.
  • Every time you don't have a WHERE clause, you are hurting your performance.
  • Every time your process queries the database multiple times (ie. multiple SELECT statements in your job to bring back data), you are hurting your performance.

In case you missed it, not taking the time to filter and reduce your data down as much as possible in your SQL is hurting your performance! Assuming your SQL Server and your programming layer are on different machines, you lose lots of time transferring unnecessary data over the wires (or air) as well as not efficiently using all of the advantages that your SQL server offers.

What's the solution to this inefficient processing?

Process your data on the SQL Server!

If you are not filtering, joining, and transforming your data until your programming layer, you are likely losing valuable SQL performance power and network efficiency. Here are some easy ways to reduce the size of your dataset on the SQL Server to improve performance in your jobs (and make your coworkers envious of your skills)!

SELECT [ColumnName]

If you are using SELECT *, stop!

SELECT * brings back all of the columns on your table, including the ones you don't need. This increases the amount of data sent over the network (which doesn't even get used) as well as increases the amount of data that needs to be read from disk (and storage hardware is usually relatively slow). Not to mention if your table is using indexes, SELECT * most likely causes some of those indexes not to be used as efficiently (or at all) which causes your queries to slow down even further.

But what if you do need all of the columns on a particular table? You still shouldn't use SELECT *! Although there's no performance difference, using SELECT * just means you are taking on technical debt. In the future, when a column gets added or removed from your table, your downstream processes may break because they are now automatically receiving (or no longer receiving) that column. Do you want to have to fix a failing process in the future because its now receiving more data that it was expecting? I don't think so!


My inefficient process example above starts with selecting some data and bringing it into my programming environment. The process then runs another query to bring in additional data and joins it to the data from my first query in my programming environment.

This is terrible!

First off, we are breaking the first principle we learned in the SELECT * section above — we are bringing back more data than we need! If we are using INNER JOIN on our two datasets, we most likely are going to be filtering out some data — data we don't need. Joining on the SQL server first will reduce our total dataset size and make our network and disk performance more efficient.

Even if we are doing something like a LEFT or FULL OUTER join where we will be keeping all of the data from one or both of our datasets, it still benefits us to perform this join on the SQL Server. Why you ask? Because the people who built SQL Server have spent hundreds or thousands of hours performance tuning and debugging their joining algorithms. The chances that you will be able to write a more efficient join algorithm is highly unlikely.

And even if you are a programming savant, why reinvent the wheel? Unless your app needs every last microsecond of performance, just use SQL Server for what it's really good at: relational data joining.

WHERE Clauses

Let's say our dbo.User table has 50 thousand rows and our dbo.Sale table has 1 million rows. If your process is only looking for active users and sales from the past month, let's say 2 thousand rows and 22,000 rows respectively, then you are causing SQL to lookup and transfer 95% more rows than your process needs. Not only does it kill network performance, but your program layer then needs to filter out this data, doing extra work that it probably can't do as efficiently as SQL Server.

If instead I would have just added predicates to the SQL WHERE clause like Active=1 and SalesDate >= DATEADD(month, -1, GETDATE()) we would have saved both time and bandwidth.

Aggregate Functions

You know what's better than sending 10,000 rows of data over the network and then summing them up in your programming layer?

Using SQL's SUM() aggregate function to reduce those 10,000 rows to just 1 row before sending it across the network.

SQL aggregate functions take many rows of data and consolidate them down into fewer rows.

SQL's aggregate functions are also flexible enough to use the OVER() clause, allowing for windowed sets within your data — basically allowing you to be even more flexible with how you aggregate your data.

Don't wait until your application layer to summarize parts of your data — do it in your SQL query instead.

Scalar Functions

Although aggregate functions do some serious heavy lifting, scalar functions that run on each row of data aren't anything to laugh at either. Although they won't reduce the number of rows in your output, they can certainly reduce the number of columns you are outputting.

For example, say you have multiple columns of data in your dataset that ultimately need to be combined into a single output column. It's much better to use ISNULL(), COALESCE(), or CASE to combine multiple columns into a single column with logic in your SQL query so less data needs to be transferred later.

Once again, reducing the amount of data you are sending over the network is key to getting faster run times.

XML and JSON Functions

Last but not least, if your process is generating XML or JSON data at some point, consider generating that data on the SQL Server. Now, generating XML and JSON data won't always improve your performance — SQL Server is best at relational tasks and not large string creation — but in many cases, especially with JSON, SQL Server can outperform even the fastest .NET libraries.

If your network is your bottle neck, then it is very possible that SQL can apply complex logic and transform your data into XML or JSON faster on the SQL Server than if you needed to transfer all of that data to another location on the network and handle those transformations in another programming language.

In short: do as much work as possible in SQL

If your SQL queries could be following any of the above techniques and they're not, then fix them…today! Checking each of your queries for any of the above inefficiencies and mitigating them will probably (always test your changes) improve the performance of your applications and processes.

And then it won't feel like your process is taking forever to run.

Three Changes You Should Always Make Before Checking In Any Code

Photo by Maico Amorim

This story originally appeared in Hacker Noon on March 11, 2017.

A code refactor always leaves me with a feeling of accomplishment. Although major refactorings are the most satisfying, every project doesn't always warrant them. Here are 3 quick and easy refactorings that I make to all of my projects in order to improve code readability:

1. Clean up formatting

The overall format of your code is what makes it possible to quickly navigate to areas of interest. Consistent indentation, line breaks, and patterns help programmers skim large chunks of code. Take the following sloppily formatted code for example:

Inventory inventory = new Inventory();
for (int i = 0; i < cars.Count; i++){

    var owner = owners.Where(x => x.VIN == cars[i].VIN).OrderByDescending(x => x.PurchaseDate).FirstOrDefault();

inventory.PreviousOwners.Add(new Owner { VIN = cars[i].VIN,
                                        Name = owner.Name});

and compare it to this:

Inventory inventory = new Inventory();

for (int i = 0; i < cars.Count; i++)

    var owner = owners.Where(x => x.VIN == cars[i].VIN)
                        .OrderByDescending(x => x.PurchaseDate)

inventory.PreviousOwners.Add(new Owner 
                                        VIN = cars[i].VIN,
                                        Name = owner.Name

The second example above consistently indents lines, adds new lines, and follows consistent coding patterns. This makes it easy to skim the code quickly.

Books have chapters, headings, and paragraphs defined by formatting that make it easy to find what is needed at a glance — formatting code makes it possible to find things easily too.

2. Rename classes, methods, and variables

Classes, methods, and variables should be named in such that they help the programmer understand what is happening in the code. The shorter the scope of an object the more permissible it is to use shorter names (eg. "i" as a counter in a loop that's only a line or two long).

It's easy to use uninformative names when writing a "first draft" of your program, but using the first name that comes to mind isn't always the best choice. Take a look at the following example:

public IEnumerable<string> GetData(int year)
    var result = CallApi("/Cars", year);

IEnumerable<string> output = new IEnumerable<string>();

foreach(var c in result)


public IEnumerable<string> RetrieveCarMakes(int year)
    var inventory = CallApi("/Cars", year);

IEnumerable<string> carMakes = new IEnumerable<string>();

    foreach(var car in inventory)

Using names that make sense make it much easier for someone else (or your future-self) to figure out what your code is doing.

3. Break up long expressions

When you get into a code writing groove it's easy to keep chaining commands together or using single-line syntax to speed up your writing. Often times, I look back on my code a day later and I see long expressions similar to this. Take a look at this two line expression:

bool hasHighSaleProbability = (daysOnLot < 60) ? true : (color == "Red" ? true : false);

var highSaleProbabilityVehicles = Inventory.Where(x => x.DaysOnLot < 60 or x.Color == "Red").Select(x => new { Make = x.Make, Model = x.Model, Year = x.Year }).Distinct().Select(x => new RecentInventoryView { YearDropdown = new SelectListItem { Text = x.Year, Value = x.Year }, MakeModelDropdown = new SelectListItem { Text = x.Make + " " + x.Model, Value = x.Make + " " + x.Model } });

Compared against this expression that has been broken out across multiple lines:

bool hasHighSaleProbability = false;

if (daysOnLot < 60 || color == "Red")
    hasHighSaleProbability = true;

var distinctMakesModelsYears = Inventory
    .Where(x => x.DaysOnLot < 60 or x.Color == "Red")
    .Select(x => new 
    Make = x.Make, 
    Model = x.Model, 
    Year = x.Year 

var recentInventoryView = distinctMakesModelsYears
    .Select(x => new RecentInventoryView 
        YearDropdown = new SelectListItem 
        Text = x.Year, 
        Value = x.Year 
        MakeModelDropdown = new SelectListItem 
        Text = x.Make + " " + x.Model, 
        Value = x.Make + " " + x.Model 

Although the first code snippet is technically shorter and has fewer lines, it is nearly unreadable. The second snippet breaks out the the if logic and breaks up all of the chained methods into more logical variables. The result? Code logic that is much easier to follow.

How to write understandable code for your future self


This story originally appeared in Hacker Noon on February 10, 2017.

I can't tell you the number of times the title of this post has crossed my mind as I dug through a piece of code that I hadn't touched in years.

At the time I wrote it, I probably thought my code was beautiful. An elegant masterpiece. It should have been printed, framed, and hung on a wall of The Programming Hall of Fame. As clever as I thought I may have been a few years ago, I rarely am able to read my old code without some serious time wasted debugging.

This problem plagued me regularly. I tried different techniques to try and make my code easier to understand.

First I tried adding comments to my code. Pretty easy to do, but not all that helpful.

When comments weren't cutting it, I tried to write self-documenting code instead: small, well-named classes and methods that described their limited functionality. This made the code more readable but I would still have questions.

I also tried documenting my code in a separate file. This had its benefits but didn't solve the problem entirely either.

Eventually I figured out what I needed to do: I needed to use all three of the above techniques to write truly beautiful and understandable code.


Comments in your code should document the why, not the how. When I first started programming, I would often write very unhelpful comments like this:

public class Class1
  public List<string> DoWork(List<string> a)
    List<string> numbers = new List<string>();

    // Loop over data
    for (int i = 1; i < a.Count; i++)
      int s = a[i].IndexOf(" ");
      string num = a[i].Substring(0,s);

      // Save data

    return numbers;

"Loop over data"? "Save data"? These comments are beneficial to understanding the code. I can easily tell that I have a loop, and that I'm adding my data to a collection, why should I waste valuable screen real estate with unhelpful comments?

Instead of saying what or how, comments should explain why. A programmer will see the for loop and know that it's looping over some type of collection of Addresses. However, a programmer will not know why we are starting our counter with int i = 1 — this is where adding a comment can improve the understanding of the code:

// i = 1 because the view will never display the first address
for (int i = 1; i < Addresses.Count; i++) {...

Now, we know some of the business logic driving our app. We know we don't process the first address because it never gets outputted to our view. This comment answers the why behind skipping the first address, adding clarity to the code.

Additionally, we remove the // Save data comment completely since it adds no insightful value.


Comments alone won't make code easy to reinterpret however. Let's take at our method again with our improved comments:

public class Class1
  public List<string> DoWork(List<string> a)
    List<string> numbers = new List<string>();

    // i = 1 because the view will never display the first address
    for (int i = 1; i < a.Count; i++)
      int s = a[i].IndexOf(" ");
      string num = a[i].Substring(0,s);


    return numbers;

What exactly is Class1? What kind of work isDoWork() doing? What about the use of int s? The names of the objects in our code don't aid in our understanding what this code is doing.

This is where the idea of self-documenting code comes in: instead of creating objects with arbitrary, non-informative names ("I swear I'll refactor this later"), we build descriptive objects. If I have a class, its name should give me a good idea about what its properties and methods could be. A method's name should be descriptive enough to tell me what I should expect as an output without having to dig into the details of what that method is doing. Variables should add additional illumination that make what and how type comments obsolete.

In our example, let's make our code self-documenting. First, this class is intended to help us clean address data. Let's call itAddressStandardizer. With that simple renaming we know that all of the properties and methods of this class should pertain to dealing with dirty address data and making it cleaner.

What about the method name List<string> DoWork(List<string> a)? Well , I can tell you that this method is trying to parse out the number portion of a street address. So let's change the method name and signature to something more informative, like List<string> ParseHouseNumbers(List<String> addresses). Now we can make an educated guess that this method accepts some address strings as an input and and it will return a list of parsed house numbers.

If we clean up some variable names, our code becomes much easier to read, like this:

public class AddressStandardizer
  public List<string> DoWork(List<string> addresses)
    List<string> houseNumbers = new List<string>();

    // i = 1 because the view will never display the first address
    for (int i = 1; i < addresses.Count; i++)
      int firstSpaceIndex = addresses[i].IndexOf(" ");
      string houseNumber = addresses[i].Substring(0,firstSpaceIndex);


    return houseNumbers;


Our code is finally starting to shape up. We have comments explaining why we chose to do something and we refactored our code to have object names that are informative.

The code at this point is ok but not perfect. If we don't look at this code for a few years, we probably have enough information now to look at the code and figure out what it's doing with relative ease.

The big piece of information that we are still missing however is knowing why this code was written in the first place.

Often times, I get a question from a manager or analyst about why we decided to build the project in the first place. Or I'll get a request for information about how the logic in the program works. Without a proper documentation file, the best thing I can do is send the business user a copy of my code. Most of the time that isn't very helpful.

What would be helpful though is an explanation of what our program is doing at a high-level. This is the purpose of formal documentation.

The documentation for this section of code might look something like this:

…After retrieving our customer information from our vendor, the program processes the data and cleans it up to load into our reporting warehouse. Cleaning up the data means parsing the addresses into multiple columns including house number, street name, street suffix, city, state, and postal code…

Now, when a business user needs to know what your program is doing, you can easily send them the above documentation their way. The documentation also acts as a nice refresher for you, the programmer, when it comes time to revisit the code, as well as any future coworkers who will be new to the project.

Wrap up

All of these techniques are necessary to eliminate code headaches down the road. Learn from my experience — not doing all three may save a little bit of time in the short term but it will hurt at some point in the future. Once you get in the habit of writing all three kinds of documentation, it will become second nature and make your life (and the lives of your future-self!) much easier.

Comments in code should explain the why not the how:

  • The how should be explained by well named classes and methods
  • Separate documentation still needed for developers and non-developers alike. Think of business users who need to know how your process works and the business logic that is in it; nice to have high-level document explaining the business uses of your process that someone non-technical can understand.

Will Technology Eliminate Your Tech Job?


This post is a response to this month's T-SQL Tuesday prompt. T-SQL Tuesday was created by Adam Machanic and is a way for SQL users to share ideas about interesting topics. This month's topic is The Times They Are A-Changing.

I think everyone's had the same fear at some point in their career: "Am I going to lose my job because of X?" X can be a variety of things — company reorganizations, positions being outsourced, robotic automation, new software advancements, etc…

I think the answer to this question depends 100% on the type of individual you are and nothing to do with what your job actually is (was).

Being a Linchpin

Seth Godin discusses the concept of a Linchpin in his same-titled book. A Linchpin is someone who is so good at what they do that they become indispensable to their organization. Linchpins are the kind of people who are self-motivated and are able to consistently deliver quality work. They are integral to the operation of a business, even if they don't get all of the glamour of having VP or Director in their title.

And why are Linchpins always guaranteed jobs? In one scenario, Linchpins will outgrow their role and be promoted or find a better job. They are always learning and growing in addition to delivering, and so this is the natural procession. In the alternate scenario, if the Linchpin has to lose his or her current job (ie. think company buyouts where entire departments close), they will either 1) become promoted to elsewhere in the company because management recognizes their great skills or 2) they will have no problem finding work elsewhere, especially with great recommendations from their former employer.

The Cloud, SaaS, PaaS, and other technologies

The past few years have seen many new technologies come into the SQL professional's workspace. Administrators now have the ability to manage their server instances online in the cloud and use new features and functionality that weren't previously available in local-network only instances. Developers also have new tools to interact with cloud instance, but also have totally new functionality available to them from a variety of online services.

As of now, I think most of these new advancements augment our current technology instead of replace it. I think this means that some professionals will choose to not learn about them or how to use them. And it's really easy to justify not learning them — it can be hard for some to find the time to learn something that they can't immediately use.

However, some professionals will be excited and will learn about these new technologies. Even if their environments don't need to use cloud platforms and other new features, they will find small areas in their environment that can use these technologies so they start getting experience using them. Worst case, even if it's not possible to modify something existing with these new tools, these professionals will create sandboxes for themselves and learn to use some of these technologies anyway. By doing this, they will be more confident in using these tools when the time necessitates that they be used.

When it's time to be promoted or to switch jobs, which of the two professionals is more likely to get hired — the one who knows only his or her old technology really well, or the professional who has taken the time to learn these new features even if they didn't have to use them in their old environment?

Is my role of business intelligence developer going to disappear?

I'm a professional learner. Officially I'm a business intelligence developer, but unofficially I also am a web developer, manager, DBA, and electrical engineer. I don't pretend that I am an expert in all of those unofficial capacities (or even the official one!), but I do continually try to improve myself in all of those roles.

Do I worry about having new technologies replace my current job role? No. I do think the tools I use today will be outdated and replaced at some point in the future though.

I imagine some future version of SSRS will be able to generate the majority of the reporting needed for my database based off metadata. Data will continue to evolve and live in environments other than just SQL Server, making my need for SSIS less important — I'll have to learn other ways to transform data, whether through C#, Python, some cloud querying tool, or all of the above. I'll have to get used to not only using data from databases and flat files, but also mixing in data from APIs and cloud storage. Some of this data will be relational but a lot of it will not.

And all of that sounds exciting! Learning new ways of working with data is a thrill because it means I won't get bored working on the same thing year after year. Sure, 10 years from now new technologies will replace my current job — fortunately for me though, by that point I'll be working with those new technologies.