Data with Bert logo

Converting JSON to SQL Server CREATE TABLE Statements

Watch this week's video on YouTube

Tedious, repetitive tasks are the bane of any lazy programmer.  I know, because I am one.

One such repetitive task that I find comparable to counting grains of rice is building database layouts from JSON data sources.

While some online services exist that will parse JSON objects into database structures, I don't like using them because I don't trust the people running those sites with my data.  Nothing personal against them, I just don't want to be passing my data through their servers.

My solution to this problem was to write a query that will parse my unfamiliar JSON documents into a series of CREATE TABLE statements.

Automatically Generating A SQL Database Schema From JSON

You can always get the most recent version of the query from GitHub, but I'll post the current version below so that it's easier to explain in this post:

/*
This code takes a JSON input string and automatically generates
SQL Server CREATE TABLE statements to make it easier
to convert serialized data into a database schema.

It is not perfect, but should provide a decent starting point when starting
to work with new JSON files.

A blog post with more information can be found at https://bertwagner.com/2018/05/22/converting-json-to-sql-server-create-table-statements/
*/
SET NOCOUNT ON;

DECLARE 
    @JsonData nvarchar(max) = '
        {
            "Id" : 1,
            "IsActive":true,
            "Ratio": 1.25,
            "ActivityArray":[true,false,true],
            "People" : ["Jim","Joan","John","Jeff"],
            "Places" : [{"State":"Connecticut", "Capitol":"Hartford", "IsExpensive":true},{"State":"Ohio","Capitol":"Columbus","MajorCities":["Cleveland","Cincinnati"]}],
            "Thing" : { "Type":"Foo", "Value" : "Bar" },
            "Created_At":"2018-04-18T21:25:48Z"
        }',
    @RootTableName nvarchar(4000) = N'AppInstance',
    @Schema nvarchar(128) = N'dbo',
    @DefaultStringPadding smallint = 20;

DROP TABLE IF EXISTS ##parsedJson;
WITH jsonRoot AS (
    SELECT 
        0 as parentLevel, 
        CONVERT(nvarchar(4000),NULL) COLLATE Latin1_General_BIN2 as parentTableName, 
        0 AS [level], 
        [type] ,
        @RootTableName COLLATE Latin1_General_BIN2 AS TableName,
        [key] COLLATE Latin1_General_BIN2 as ColumnName,
        [value],
        ROW_NUMBER() OVER (ORDER BY (SELECT 1)) AS ColumnSequence
    FROM 
        OPENJSON(@JsonData, '$')
    UNION ALL
    SELECT 
        jsonRoot.[level] as parentLevel, 
        CONVERT(nvarchar(4000),jsonRoot.TableName) COLLATE Latin1_General_BIN2, 
        jsonRoot.[level]+1, 
        d.[type],
        CASE WHEN jsonRoot.[type] IN (4,5) THEN CONVERT(nvarchar(4000),jsonRoot.ColumnName) ELSE jsonRoot.TableName END COLLATE Latin1_General_BIN2,
        CASE WHEN jsonRoot.[type] IN (4) THEN jsonRoot.ColumnName ELSE d.[key] END,
        d.[value],
        ROW_NUMBER() OVER (ORDER BY (SELECT 1)) AS ColumnSequence
    FROM 
        jsonRoot
        CROSS APPLY OPENJSON(jsonRoot.[value], '$') d
    WHERE 
        jsonRoot.[type] IN (4,5) 
), IdRows AS (
    SELECT 
        -2 as parentLevel,
        null as parentTableName,
        -1 as [level],
        null as [type],
        TableName as Tablename,
        TableName+'Id' as columnName, 
        null as [value],
        0 as columnsequence
    FROM 
        (SELECT DISTINCT tablename FROM jsonRoot) j
), FKRows AS (
    SELECT 
        DISTINCT -1 as parentLevel,
        null as parentTableName,
        -1 as [level],
        null as [type],
        TableName as Tablename,
        parentTableName+'Id' as columnName, 
        null as [value],
        0 as columnsequence
    FROM 
        (SELECT DISTINCT tableName,parentTableName FROM jsonRoot) j
    WHERE 
        parentTableName is not null
)
SELECT 
    *,
    CASE [type]
        WHEN 1 THEN 
            CASE WHEN TRY_CONVERT(datetime2, [value], 127) IS NULL THEN 'nvarchar' ELSE 'datetime2' END
        WHEN 2 THEN 
            CASE WHEN TRY_CONVERT(int, [value]) IS NULL THEN 'float' ELSE 'int' END
        WHEN 3 THEN 
            'bit'
        END COLLATE Latin1_General_BIN2 AS DataType,
    CASE [type]
        WHEN 1 THEN 
            CASE WHEN TRY_CONVERT(datetime2, [value], 127) IS NULL THEN MAX(LEN([value])) OVER (PARTITION BY TableName, ColumnName) + @DefaultStringPadding ELSE NULL END
        WHEN 2 THEN 
            NULL
        WHEN 3 THEN 
            NULL
        END AS DataTypePrecision
INTO ##parsedJson
FROM jsonRoot
WHERE 
    [type] in (1,2,3)
UNION ALL SELECT IdRows.parentLevel, IdRows.parentTableName, IdRows.[level], IdRows.[type], IdRows.TableName, IdRows.ColumnName, IdRows.[value], -10 AS ColumnSequence, 'int IDENTITY(1,1) PRIMARY KEY' as datatype, null as datatypeprecision FROM IdRows 
UNION ALL SELECT FKRows.parentLevel, FKRows.parentTableName, FKRows.[level], FKRows.[type], FKRows.TableName, FKRows.ColumnName, FKRows.[value], -9 AS ColumnSequence, 'int' as datatype, null as datatypeprecision FROM FKRows 

-- For debugging:
-- SELECT * FROM ##parsedJson ORDER BY ParentLevel, level, tablename, columnsequence

DECLARE @CreateStatements nvarchar(max);

SELECT
    @CreateStatements = COALESCE(@CreateStatements + CHAR(13) + CHAR(13), '') + 
    'CREATE TABLE ' + @Schema + '.' + TableName + CHAR(13) + '(' + CHAR(13) +
        STRING_AGG( ColumnName + ' ' + DataType + ISNULL('('+CAST(DataTypePrecision AS nvarchar(20))+')','') +  CASE WHEN DataType like '%PRIMARY KEY%' THEN '' ELSE ' NULL' END, ','+CHAR(13)) WITHIN GROUP (ORDER BY ColumnSequence) 
    + CHAR(13)+')'
FROM
    (SELECT DISTINCT 
        j.TableName, 
        j.ColumnName,
        MAX(j.ColumnSequence) AS ColumnSequence, 
        j.DataType, 
        j.DataTypePrecision, 
        j.[level] 
    FROM 
        ##parsedJson j
        CROSS APPLY (SELECT TOP 1 ParentTableName + 'Id' AS ColumnName FROM ##parsedJson p WHERE j.TableName = p.TableName ) p
    GROUP BY
        j.TableName, j.ColumnName,p.ColumnName, j.DataType, j.DataTypePrecision, j.[level] 
    ) j
GROUP BY
    TableName


PRINT @CreateStatements;

In the variables section, we can define our input JSON document string as well as define things like a root table name and default database schema name.

There is also a string padding variable.  This padding variable's value is added to the max value length found in each column being generated, giving each column a little bit more breathing room.

Next in the script is the recursive CTE that parses the JSON string.  The OPENJSON() function in SQL Server makes this part relatively easy since some of the work of determining datatypes is already done for you.

I've taken the liberty to convert all strings to nvarchar types, numbers to either floats or ints, booleans to bits, and datetime strings to datetime2s.

Two additional CTE expressions add an integer IDENTITY PRIMARY KEY column to each table as well as a column referencing the parent table if applicable (our foreign key column).

Finally, a little bit of dynamic SQL pieces together all of these components to generate our CREATE TABLE scripts.

Limitations

I created this code with a lot of assumptions about my (unfamiliar) JSON data sets.  For the purpose of roughly building out tables from large JSON files, I don't need the results to be perfect and production-ready; I just want the results to be mostly correct so the vast majority of tedious table creation work is automated.

With that disclaimer made, here are a few things to be aware of:

  • Sometimes there will be duplicate column names generated because of naming - just delete one.
  • While foreign key columns exist, the foreign key constraints don't.
  • This code uses STRING_AGG.  I'll leave it up to you to convert to STUFF and FOR XML PATH if you need to run it in versions prior to 2017.

Summary

This script is far from perfect.  But it has eliminated the need for me to build out these tables and columns from scratch.  Sure, the output sometimes needs a tweak or too, but for my purposes I'm happy with how it turned out.  I hope it helps you eliminate some boring table creation work too.

Why Is My VARCHAR(MAX) Variable Getting Truncated?

Watch this week's video on YouTube

Watch this week's episode on YouTube.

Sometimes SQL Server doesn't do what you tell it to do.

Normally that's ok - SQL is a declarative language after all, so we're supposed to tell it what we want it to do, not how we want it done.

And while that's fine for most querying needs, it can become really frustrating when SQL Server decides to completely disregard what you explicitly asked it to do.

Why Is My VARCHAR(MAX) Truncated to 8000 Characters?

A prime example of this is when you declare a variable as VARCHAR(MAX) because you want to assign a long string to it.  Storing values longer than 8000 characters long is the whole point of VARCHAR(MAX), right?

DECLARE @dynamicQuery VARCHAR(MAX);

SET @dynamicQuery = REPLICATE('a',8000) + 'b'

SELECT @dynamicQuery as dynamicQueryValue, LEN(@dynamicQuery) AS dynamicQueryLength

If we look at the above query, I would expect my variable @dynamicQuery to be 8001 characters long; it should be 8000 letter 'a's followed by a single letter 'b'.  8001 characters total, stored in a VARCHAR(MAX) defined variable.

But does SQL Server actually store all 8001 characters like we explicitly asked it to?

No:

2018-05-13_18-31-03

First we can see that the LEN() of our variable is only 8000 - not 8001 - characters long!

2018-05-13_18-31-28

Copying and pasting our resulting value into a new query window also shows us that there is no character 'b' at position 8001 like we expected.

The Miserly SQL Server

The reason this happens is that SQL Server doesn't want to store something as VARCHAR(MAX) if none of the variable's components are defined as VARCHAR(MAX).  I guess it doesn't want to store something in a less efficient way if there's no need for it.

2018-05-12_06-44-53

However, this logic is flawed since we clearly DO want to store more than 8000 characters.  So what can we do?

Make Something VARCHAR(MAX)

Seriously, that's it.  You can do something like CAST the single character 'b' as VARCHAR(MAX) and your @dynamicQuery variable will now contain 8001 characters:

2018-05-14_18-06-44

But casting a single character as VARCHAR(MAX) isn't very intuitive.

Instead, I recommend casting a blank as VARCHAR(MAX) and prefixing it to the start of your variable string.  Leave yourself a comment for the future and hopefully you'll remember why this superfluous looking piece of code is needed:

-- using CAST('') to force SQL to define
-- as varchar(MAX)
SET @dynamicQuery =  CAST('' AS varchar(MAX))
    + REPLICATE('a',8000)+ 'b'

Contributing to Community

MJ-t-sql-TuesdayThis post is a response to this month's T-SQL Tuesday #102 prompt by Riley Major. T-SQL Tuesday is a way for SQL Server users to share ideas about different database and professional topics every month.

The prompt I've chosen to write about this month is how and why I got started contributing to the SQL Server community.


Watch this week's video on YouTube

About a year ago, I was determined to improve my presentation skills.  I knew that in order to do that I needed to get more practice speaking.

I already was at my max for presenting at local user groups, conferences, etc... because at some point it becomes too cost and time prohibitive to travel to more events.  As an alternative, I decided that if I couldn't get more practice by speaking in person, I could at least film myself presenting.

And I figured if I'm already filming myself presenting, I might as well put a little extra polish on it and make the content available for others to watch.

And that is how I started filming weekly videos about SQL Server.

SQL Server Videos

There are already plenty of great SQL Server presentations on YouTube, spanning a plethora of topics from a variety of experts who know way more about SQL Server than me.

Whenever I want to learn about a SQL Server topic, I search for something like "SQL Server backups" or "SQL Server columnstore indexes" on YouTube.  There are plenty of great recorded presentations, virtual chapter screencasts, Q&As, and other tutorials for learning almost any topic you can imagine.

However, sometimes I'm not in the mood to watch in-depth hour long presentations.  Sometimes I want to watch a short, informative, regularly scheduled entertaining SQL videos - and this is where I saw a gap in programming.

So what better way to get what you want than by scratching your own itch.  I figured if I want to watch that type of SQL Server video, then I'm sure other people out there want to watch those same kinds of short SQL videos too.

Bert, the Director

When I was a kid, I wanted to be a movie maker.  In particular, I was entranced by special effects, so I made movies with friends that involved plenty of lightsabers, explosions, and green screen effects all throughout middle school and high school.

So while making SQL videos wasn't going to be totally new territory, I sure was unprepared for all of the initial work involved.

For the first three months, I was spending 15-20 hours per week writing, creating demos, shooting, editing, publishing, and marketing my videos.  Over time I've cut this process down to 8-10 hours a week, a more manageable amount of work that I can mostly get done on weekend mornings before the rest of the house wakes up.

Results

Making videos about SQL Server has been an amazing experience.  Not only do I personally feel fulfilled creating something week after week that improves my own skills, but it's rewarding to receive positive feedback via comments, messages, and emails that I'm also helping others become better SQL developers.

Contributing has also made me appreciate how amazing the #sqlfamily community truly is.  Everyone I talk to is wonderful and supportive, and everyone I meet wants to see one another succeed.

Your Turn

If you aren't already, I hope you consider contributing to the community .  Whether it be via blog posts, code contributions, presenting, tweeting, or making videos, giving back to the SQL Server community will grow your own skills and allow you to meet some really great people.

It can be scary putting yourself out there publicly, but don't let that stop you.  If you give it your best then the SQL Server community won't let you down.

Is It Possible To Conditionally Index JSON Data?

Watch this week's video on YouTube

Recently I received a great question from an attendee to one of my sessions on JSON (what's up Nam!):

2018-04-25_15-58-21

At first glance it sounds like a filtered index question, and ultimately it is, but because of some of the intricacies involved in the response I thought it would make for a good blog post.

The Problem: Schema On Read

Imagine I have a central table that keeps track of warnings and errors for my burrito ordering app:

DROP TABLE IF EXISTS dbo.BurritoAppLog;
GO

CREATE TABLE dbo.BurritoAppLog 
( 
    Id int IDENTITY PRIMARY KEY,
    ErrorDetails nvarchar(1000)
); 
GO 

INSERT INTO dbo.BurritoAppLog VALUES (N'{"Type":"Warning", "MessageId": 100, "Severity": "High", "Information":"Running low on steak." }'); 
INSERT INTO dbo.BurritoAppLog VALUES (N'{"Type":"Warning", "MessageId": 50, "Severity": "Low", "Information":"Running low on queso." }');
GO 4000
INSERT INTO dbo.BurritoAppLog VALUES (N'{"Type":"Error", "MessageId": 10, "User":"Bert", "ErrorMessage":"Lettuce not available." }'); 
INSERT INTO dbo.BurritoAppLog VALUES (N'{"Type":"Error", "MessageId": 20, "User":"Jim", "ErrorMessage":"Cannot wrap burrito with quadruple meat." }'); 
GO 100

2018-04-25_19-21-04

Now imagine wanting to generate a report of only the rows that are errors.

Obviously, you'd want to index this data for faster querying performance.  Adding a non-clustered index on a non-persisted computed column of our JSON "Type" property will accomplish that:

ALTER TABLE dbo.BurritoAppLog 
ADD ErrorType AS JSON_VALUE(ErrorDetails, '$.Type');

ALTER TABLE dbo.BurritoAppLog 
ADD MessageId AS JSON_VALUE(ErrorDetails, '$.MessageId');

CREATE INDEX IX_ErrorType ON dbo.BurritoAppLog (ErrorType) INCLUDE (MessageId);

SELECT MessageId FROM dbo.BurritoAppLog WHERE ErrorType = 'Error'

And that works great.  Except that error entries in our table make up only 2.5% of our total rows.  Assuming we'll never need to query WHERE ErrorType = 'Warning' , this index is using a lot of unnecessary space.

So what if we create a filtered index instead?

Filtered JSON Indexes...

A filtered index should benefit us significantly here: it should save us space (since it won't include all of those warning rows) and it should make our INSERT queries into this table faster since the index won't need to be maintained for our non-"Error" rows.

So let's create a filtered index:

CREATE INDEX FX_ErrorType ON dbo.BurritoAppLog (ErrorType) INCLUDE (MessageId) WHERE ErrorType = 'Error'

Oh.

2018-04-25_19-47-03-1

So I guess we can't create a filtered index where the filter is on a computed column.  Maybe SQL Server won't mind if we persist the computed column?

DROP INDEX IX_ErrorType ON dbo.BurritoAppLog

ALTER TABLE dbo.BurritoAppLog
DROP COLUMN ErrorType;

ALTER TABLE dbo.BurritoAppLog 
ADD ErrorType AS JSON_VALUE(ErrorDetails, '$.Type') PERSISTED;

CREATE INDEX FX_ErrorType ON dbo.BurritoAppLog (ErrorType) INCLUDE (MessageId) WHERE ErrorType = 'Error'

NOOOOOOPPPPEEEE.  Same error message.

The issue is that SQL Server does not like computed columns, persisted or not, in a filtered index's WHERE clause.  It's one of the many limitations of filtered indexse (Aaron Bertrand has a great post outlining many of the shortcomings).

Computed Column Filtered Index Workaround

What is a performance minded, space-cautious, JSON-loving developer supposed to do?

One workaround to get our filtered index would be to parse our ErrorType property into its own table column on insert:

ALTER TABLE dbo.BurritoAppLog 
ADD PermanentErrorType varchar(10);

UPDATE dbo.BurritoAppLog SET PermanentErrorType = JSON_VALUE(ErrorDetails, '$.Type');

2018-04-25_20-01-45

With our PermanentErrorType column in place, we have no problem generating our filtered index:

CREATE INDEX FX_PermanentErrorType ON dbo.BurritoAppLog (PermanentErrorType) INCLUDE (MessageId) WHERE PermanentErrorType = 'Error'

If we compare the sizes of our nonclustered index to our filtered index, you'll immediately that the filtered index is significantly smaller:

2018-04-25_20-12-31-1

However, our table size is now slightly larger because of the added table column.

Conclusion

So what do you do if you run into this situation?  Well, if the ratio of undesired records to desired records is large like in the example above, you might want to make a permanent column to include in your filtered index - the size/performance benefit is certainly there.  This does mean that your table size will be larger (additional column) but performance will be faster if your queries are able to use the smaller filtered index.

The Forgotten Fourth SQL Server Recovery Model

Watch this week's video on YouTube

SQL Server recovery models define when database transactions are written to the transaction log.   Understanding these models is critical for backup and recovery purposes as well as for how their behaviors impact the performance of queries.

Today we'll examine the differences between SQL Server's three official recovery models as well as an unofficial "fourth" recovery model that won't help in backup/recovery, but will help in performance of certain processes.

Full Recovery

The only recovery model that can potentially save all of your data when something bad happens (NOTE: "potentially" because if you aren't taking enough and/or testing your backups, you might experience data loss).

Under the full recovery model, every transaction is written to the transaction log first, and then persisted to the actual database.  This means if something disastrous happens to your server, as long as the change made its way into the transaction log AND your transaction log is readable AND your previous full/differential/log backups can be restored, you shouldn't experience any data loss (there are a lot of assumptions made with that statement though, so don't use this post as your only data loss prevention guide)

From a performance standpoint, full recovery is the slowest of the bunch because every transaction needs to be logged, and that creates some overhead.   Might be good for your OLTP databases, maybe not so much for your analytical staging databases (assuming you can recreate that data).

Simple Recovery

While some people incorrectly believe that simple recovery means no writing to the transaction log  (need proof that a database in simple recovery still writes to the trans log?  Try running ROLLBACK TRANSACTION after a huge delete) it actually means that the transaction log is cleared as soon as SQL Server is done using it and data has made its way to disk.

Since the transaction log is cleared regularly, your overall log size can be smaller since space is regularly reused.  Additionally, since that space is cleared you don't have to worry about backing it up.

No persistence of the transaction log means you won't be able to recover all of your data in case of server failure though.  This is generally OK if you are using simple recovery in databases where its easy to recreate any data since your last full backup (eg. staging data where you can easily redo the transactions that were lost).

Simple recovery minimally logs as many transactions as possible, making the throughput faster.  This works well in staging databases and for ETLs where data is in flux and can be easily recreated.

Bulk-Logged Recovery

If Goldilocks thinks the full recovery model has too much logging, and the simple model not enough logging, then she'll find the amount of logging in the bulk-logged recovery model to be just right.

Under bulk-logged, most transactions are fully logged, allowing for data restoration of those fully logged transactions if the need arises.  Bulk transactions however are minimally logged, allowing for better performance of things like bulk inserts (but no ability for restoration).

While restorations under the bulk-logged recovery model aren't as flexible as full recovery (eg. if the transaction log has any bulk transactions, you have to restore the whole transaction log instead of just up to a certain point), it does allow full logging for when most people need it and minimal logging for when most people don't need it.  Meaning for certain situations you can have your cake and eat it too!

The Fourth Unofficial Recovery Model: In-Memory SCHEMA_ONLY Durability

The SCHEMA_ONLY durability setting on a memory optimized table isn't a recovery model.  But it does behave a little bit like a recovery model in the sense that it defines how operations against your memory optimized table interact with your transaction log:

They don't.  Well, almost.

And that's the beauty of it, at least from a performance stand point.  If you are willing to trade off the ability to recover data for performance, then the SCHEMA_ONLY durability fits the bill - so long transaction log overhead.

So while none of the official recovery models allow you to prevent writing to the transaction log, the SCHEMA_ONLY durability setting does!