How to Automatically Purge Historical Data From a Temporal Table

Watch this week's video on YouTube

Temporal Tables are awesome.

They make analyzing time-series data a cinch, and because they automatically track row-level history, rolling-back from an "oops" scenario doesn't mean you have to pull out the database backups.

The problem with temporal tables is that they produce a lot of data. Every row-level change stored in the temporal table's history table quickly adds up, increasing the possibility that a low-disk space warning is going to be sent to the DBA on-call.

In the future with SQL Server 2017 CTP3, Microsoft allows us to add a retention period to our temporal tables, making purging old data in a temporal table as easy as specifying:

ALTER DATABASE DatabaseName
SET TEMPORAL_HISTORY_RETENTION ON

CREATE TABLE dbo.TableName (
...
) 
WITH
( 
    SYSTEM_VERSIONING = ON      
    (
        HISTORY_TABLE = dbo.TableNameHistory,            
        HISTORY_RETENTION_PERIOD = 6 MONTHS      
    )  
);

However, until we are all on 2017 in production, we have to manually automate the process with a few scripts.

Purging old data out of history tables in SQL Server 2016

In the next few steps we are going to write a script that deletes data more than a month old from my CarInventoryHistory table:

SELECT * FROM dbo.CarInventory;
SELECT * FROM dbo.CarInventoryHistory;

04cd5-1kjmrbwqd96geq1ypp5ejaa

And now if we write our DELETE statement:

ALTER TABLE dbo.CarInventory SET ( SYSTEM_VERSIONING = OFF  ) ;
GO

-- In the real-world we would do some DATE math here
DECLARE @OneMonthBack DATETIME2 = '2017-06-04';

DELETE FROM dbo.CarInventoryHistory WHERE SysStartTime < @OneMonthBack;

You'll notice that we first had to turn system versioning off: SQL Server won't let us delete data from a history table that is currently tracking a temporal table.

This is a poor solution however. Although the data will delete correctly from our history table, we open ourselves up to data integrity issues. If another process INSERTs/UPDATEs/DELETEs into our temporal table while the history deletion is occurring, those new INSERTs/UPDATEs/DELETEs won't be tracked because system versioning is turned off.

The better solution is to wrap our ALTER TABLE/DELETE logic in a transaction so any other queries running against our temporal table will have to wait:

-- Run this in query window #1 (delete data):
BEGIN TRANSACTION;
ALTER TABLE dbo.CarInventory SET ( SYSTEM_VERSIONING = OFF );
GO

-- In the real-world we would do some DATE math here
DECLARE @OneMonthBack DATETIME2 = '2017-06-04';

DELETE FROM dbo.CarInventoryHistory WITH (TABLOCKX)
WHERE SysStartTime < @OneMonthBack;

-- Let's wait for 10 seconds to mimic a longer delete operation
WAITFOR DELAY '00:00:10';

--Re-enable our SYSTEM_VERSIONING
ALTER TABLE dbo.CarInventory SET ( SYSTEM_VERSIONING = ON (HISTORY_TABLE = dbo.CarInventoryHistory));
GO

COMMIT TRANSACTION;

-- Run this in query window #2 during the same time as the above query (trying to update during deletion):
UPDATE dbo.CarInventory SET InLot = 0 WHERE CarId = 4;

And the result? Our history table data was deleted while still tracking the row-level data changes to our temporal table:

60a83-1sx-gxdjlwhjanonw622vnw

All that is left to do is to throw this script into a SQL Agent job and schedule how often you want it to run.

When Is It Appropriate To Store JSON in SQL Server?

Who needs a relational database when everything can be stored in a JSON string?

Every once in a while I hear of some technologist say that relational databases are dead; instead, a non-table based NoSQL storage format is the way of the future. SQL Server 2016 introduced JSON functionality, making it possible for some "non-SQL" data storage to make its way into the traditionally tabled-based SQL Server.

Does this mean all data in SQL Server going forward should be stored in long JSON strings? No, that would be a terrible idea. There are instances when storing JSON in SQL Server is a good choice though. In this post I want to create recommendations for when data should be stored as JSON and when it shouldn't.

Databases Should Not Be Entirely Comprised Of JSON

The screenshot below is an example of what I think some developers would do if they were given free reign in SQL Server 2016:

bb208-1zzw7xinj5wmotojplta2lw

Here we have an application database ("InventoryApp") that consists of only a single table ("dbo.Data") with three JSON NVARCHAR(MAX) columns to represent all of the data required by the app. Relationships exist between Sales, Purchases, and Customers but these are not defined on the database side.

If you are from the world of relational-SQL, you might not believe that anyone would design such a database structure. Believe me though, this is a realistic scenario. Entire companies (eg. Firebase: https://firebase.google.com/) build their services around abstracting the database layer away from developers, essentially storing entire tables or databases in large JSON strings.

Many developers like storing data this way because it is easy to deserialize JSON strings into objects in their programming languages to use in their apps. They like the fact that with JSON they can have an infinitely changing storage schema (just add new keys, values, and arrays!) so if they need a new field for their app, they can just add it in, serialize the object to a JSON string, and store it again in the database.

Obviously, going completely "NoSQL" might make short term development easier/quicker, but using SQL Server 2016 to only store data this way is a travesty: there's no way to use many of SQL Server's amazing performance, schema definition and validation, and security features.

So when is it appropriate to store JSON in SQL Server?

Appropriate Use Case #1: Error Logging

Errors happen. When they do, it's nice to be able to go back and look at the error message to see what happened.

The problem is that the structure of error messages isn't always consistent. Sometimes only the value of a single property will help identify the cause of failure. Other times, something more complex fails and it would be nice to have all of the values of a complex object available to make troubleshooting easier.

This is where JSON steps in: in most programming languages, it is easy to convert error messages and run time values to a JSON object on error. And since error messages and data values change in structure depending on where they occur, it's easy to dynamically turn any type of object into JSON data.

This data is perfect to store in SQL to be looked at later. None of these ideas are new — nvarchar(max) has been in SQL for a while now, and so programmers everywhere have been storing error information in that datatype.

With SQL Server 2016, it is now easier to examine and parse the error information directly in SQL Server Management Studio with the variety of JSON parsing functions available. No longer do programmers have to copy the code into some different tool — they can easily do it in SSMS.

Appropriate Use Case #2: Piloting Ideas

Most large workplaces have controls in place that prevent developers from making changes in production. In general this is a Good Idea™.

However, controls are sometimes too restrictive. For example, due to security restrictions, lack of server space, company politics, etc… developers are sometimes stuck developing in production. It's an unfortunate fact of life. In those scenarios, developers have to go through hell if they have to elevate each database structure change every time they want to test something in production.

JSON to the rescue! An nvarchar(max) column in a table can have its JSON data be easily added to and modified to fit more data than it was originally intended to hold. All without any database structure change requests.

Now this is not an ideal situation. In fact, it's a scenario that can add a lot of technical debt to the application long-term if not planned for.

However, if a "flexible" JSON column is built with eventual conversion to a traditional table structure in mind from the start, it's actually simple for a developer to transition an entirely JSON storage structure to a relational format later on. They key here is that the developer needs to have this conversion planned from day one.

Appropriate Use Case #3: Non-Analytical Data

Analytical data is SQL Server's bread and butter. Need to store lots of data and be able to query against it all day long? No problem, there are a plethora of performance tuning options to make your queries run fast and efficiently.

However, sometimes not all data needs to be analyzed. Often an app might need to save some session data to a database temporarily — why bother creating all of the maintenance overhead of strict database schemas if the data will never be queried for analytical purposes? Another example might be a website's dynamically created user profile settings. You can build normalized table(s) to store all of that data, but then you will be writing programming logic to normalize and denormalize your data out of the app.

If this data will not have to be searched, then why bother adding all of the overhead? Keep it in JSON and be done with.