Recursively Querying Row Groups

Watch this week's video on YouTube

Recursive queries are fun to plan and write. They can be frustrating too depending on the complexity of the problem you are trying to solve.

This post shows one solution for finding all records that are related, either directly or via intermediate records, using recursive queries in SQL Server.

The Data

Here's the data we'll be working with:

DROP TABLE  IF EXISTS #relationships;

CREATE TABLE #relationships (
    Id int,
    FK1 varchar(10),
    FK2 varchar(10)
);

INSERT INTO #relationships
VALUES 
    /* This group of records are all related, directly or through intermediate nodes */
    (1,'A','B'),
    (2,'A','E'),
    (3,'A','G'),
    (4,'A','F'),
    (5,'B','F'),
    (6,'B','E'),
    (7,'E','F'),
    (8,'G','H'),

    (9,'B','A'),  /* This is  an identical relationship as Id=1 */
    (10,'B','F'), /* This is a straight up duplicate record to Id=5*/


    /* These records are related simply where FK2 of one row is FK1 of the subsequent row */
    (11,'I','J'),
    (12,'J','K'),
    (13,'K','L'),

    /* Some more related records */
    (14,'M','N'),
    (15,'O','M'),
    (16,'P','O'),
    (17,'M','N'), /* Duplicate of Id=14 */
    (18,'M','O'), /* Flipped duplicate of 15 */
    (19,'M','P'),

    /* These records are interesting because the FK2 values never appear in the FK1 column */
    (20,'Q','R'),
    (21,'S','R'),
    (22,'T','R');

Each row has values for Id,FK1,and FK2 columns. Id is a unique primary key that I included to make referencing individual rows easier. FK1 and FK2 are foreign keys that reference full records in another table. The point is that this table shows the relationships between records.

I used blank lines to visually indicate a group of records in the data above. The end result I want to achieve is to have each group of related records to share the same group_id. I also added some comments to point out some interesting scenarios in the data.

I programmed in some of the data edge cases I knew I would encounter into the test data; a sort of poor-man's test driven development. Including these known edge cases helps me test my query so I know my final solution will handle them correctly. If you are applying this solution to your own data, make sure to add your own test cases into the data.

Query Transformations for Sorted Hashes

The first recursive query I wrote to solve this problem was ugly. I was creating sorted hashes to ensure that rows where the FK values were swapped were deduplicated (eg. I would want to dedupe where Id=1 and Id=9). This solution involved joining on CASE statements like this all throughout the set of queries: CASE WHEN FK1<=FK2 THEN '|'+FK1+','+FK2+'|' ELSE '|'+FK2+','+FK1+'|' END AS key_hash.

Yuck. The solution worked, but I know my future self would not want to maintain that type of code.

Rather than have the final query do all of the work, I decided to clean up the data first instead.

Initial Clean Up

I decided to transform the data to eliminate all duplicates and make sure my keys were always sorted so the value of FK1 < FK2. This allows for a simpler final query:

--Remove duplicates where keys are switched
DROP TABLE  IF EXISTS #deduped;
CREATE TABLE #deduped (
Id int IDENTITY(1,1),
FK1 varchar(10),
FK2 varchar(10),
key_hash varchar(100)
);
INSERT INTO #deduped (FK1,FK2,key_hash)
SELECT
    dupes.FK1,
    dupes.FK2,
    dupes.key_hash
FROM
    (
    SELECT
        hashes.*,
        ROW_NUMBER() OVER (PARTITION BY key_hash ORDER BY FK1) AS RN
    FROM
        (
        /* make sure FK1 is always smaller than FK2.  This eliminantes a lot of more complicated logic later on */
        SELECT
            CASE WHEN FK1 <= FK2 THEN FK1 ELSE FK2 END AS FK1,
            CASE WHEN FK1 <= FK2 THEN FK2 ELSE FK1 END AS FK2,
            CASE WHEN FK1 <= FK2 THEN '|'+FK1 +','+FK2+'|' ELSE '|'+FK2+','+FK1+'|' END AS key_hash

        FROM
            #relationships
        ) hashes
    ) dupes
WHERE
    dupes.RN = 1;

This still uses the key_hash CASE statement I mentioned previously, but it only creates it once initially to dedupe the entries regardless of their order.

Grouping Related Records

With a deduped dataset, the recursive query for finding groups of related records is (relatively) straight-forward (and a refresher if you need to remember how recursive CTEs work in SQL Server):

WITH c AS (
    /* The initial query */
    SELECT
        1 as level,
        DENSE_RANK() OVER(ORDER BY FK1) group_id,
        key_hash,
        FK1
        ,FK2
    FROM
        #deduped
    UNION ALL
    /* The recursive query that runs once for each of the above query's output and then once on every row from each subsequent result */
    SELECT 
        c.level+1,
        c.group_id,
        t.key_hash,
        t.FK1,
        t.FK2
    FROM 
        c
        INNER JOIN #deduped t
            ON c.FK2 = t.FK1
    WHERE
        /* don't include combinations already matched */
        c.key_hash not like '%'+t.key_hash+'%'
), 
/* regular CTE  */
output_with_dupes as (
SELECT
    level,
    key_hash,
    group_id,
    FK1,
    FK2,
    ROW_NUMBER() OVER (PARTITION BY key_hash ORDER BY group_id) AS  RN
FROM
    c
)

-- deduped output
SELECT
    group_id,
    FK1,
    FK2
FROM
    output_with_dupes
WHERE RN = 1
ORDER BY
    group_id,
    FK1
OPTION(MAXRECURSION 0)

Some records are duplicated after the initial join, so a subsequent expression is used to dedupe the final records. The result is a table of rows containing a group_id of records that are related to each other.

Couldn't you have done this with hierarchyid?

Probably.

But I wanted a solution flexible enough to reuse for other relational databases that don't support hierarchyid as well.

More than just recursive queries

If you've never written a recursive SQL query before, hopefully this post gives you an idea of how to do it.

The more important takeaway is that sometimes it's easier to solve a data problem than a query problem. Like I said previously, my initial recursive query was hideous and unmaintainable because of the duplicates I had in my data. Rather than living with that query, it made sense to clean up the data first and then write the simpler query on the clean data.

FizzBuzz and Multiple Recursive Member CTEs

Watch this week's video on YouTube

Last week I needed to write a recursive common table expression.  I've written them before, but it's been a while and needed to visit the documentation to reference the syntax.

Instead of going straight to the examples, I decided to read into some of the details (since skipping the details really hurt me in last week's post) and noticed this line that I had never seen before:

2018-08-20_11-58-14

"Multiple...recursive members can be defined" - what????

I never knew you could have multiple recursive member statements in a CTE.  Heck, I didn't even know what having multiple recursive members could do.

Since the documentation doesn't talk about them beyond the one highlighted line above, I decided to create some examples to see if I could get them to work.

FizzBuzz

FizzBuzz is a programming puzzle that asks the solver to write a program that will list the numbers 1 to 100, displaying the word "Fizz" for any numbers that are a multiple of 3, "Buzz" for any multiples of 5, and "FizzBuzz" for any multiples of 3 and 5.

I decided to try and implement the FizzBuzz problem as both a single and multiple member CTE to see how the solutions would differ.

The Basic Recursive CTE

To start out, I decided to write a CTE that lists all numbers 0 to 100:

WITH c AS
(
    -- anchor member
    SELECT
        0 AS RowNumber
    UNION ALL
    -- recursive member
    SELECT
        c.RowNumber + 1
    FROM
        c /* the result of our last iteration */
    WHERE 
        RowNumber < 100
)

SELECT * 
FROM c;

The first SELECT statement in the CTE definition is known as the "anchor" member.  This query runs a single time and acts as the initial result that the recursive query acts on.

The second SELECT statement in the CTE definition is known as the "recursive" member.  This statement executes on the results of the previous execution (or on the results of the anchor member for the first iteration).

The recursive member will execute over and over again as long as it is still producing results.  Since our recursive statement is just adding 1 to the previous result, our recursive query would run forever - which is why we add the WHERE condition stop it from executing once we reach 100.

Our final SELECT statement returns the results of our recursive CTE, providing us with a neat list of numbers from 0 to 100:

2018-08-20_12-31-08

Single Recursive Member CTE for FizzBuzz

Now that our basic recursive CTE is working, let's make it solve FizzBuzz.  Here is our updated code:

WITH c AS
(
    SELECT
        0 AS RowNumber,
        'FizzBuzz' AS FizzOrBuzz
    UNION ALL
    SELECT
        c.RowNumber + 1,
        CASE 
            WHEN (c.RowNumber + 1) % 15 = 0 THEN 'FizzBuzz' 
            WHEN (c.RowNumber + 1) % 3 = 0 THEN 'Fizz' 
            WHEN (c.RowNumber + 1) % 5 = 0 THEN 'Buzz' 
            ELSE NULL 
        END
    FROM
        c
    WHERE 
        RowNumber < 100
)

SELECT
    RowNumber,
    FizzOrBuzz
FROM C
ORDER BY RowNumber;

First, we add a second column to our results to display the word "Fizz", "Buzz", or "FuzzBuzz".

In the anchor member, we defaulted this value to "FizzBuzz".  In our recursive member, we added a CASE statement to display the correct word.  The modulo operator (%) checks to see if the current row divided by 3, 5, or 15 results in a remainder - if the remainder is 0 then we know we found a multiple of that number.

This solution is pretty easy to read and provides the expected output for our FizzBuzz puzzle:

2018-08-20_12-39-21

Multiple Recurisve Member CTE for FizzBuzz

Alright the moment we've been waiting for - the multiple recursive member CTE:

WITH c AS
(
    SELECT
        0 AS RowNumber,
        'FizzBuzz' AS FizzOrBuzz
    UNION ALL
    /* All rows not Fizz or Buzz or FizzBuzz */
    SELECT
        c.RowNumber + 1,
        NULL AS FizzOrBuzz
    FROM
        c
    WHERE
        c.RowNumber+1 <= 100
        AND (c.RowNumber+1)%3<>0
        AND (c.RowNumber+1)%5<>0
    UNION ALL
    /* Fizz rows */
    SELECT
        c.RowNumber + 3,
        CAST('Fizz' AS VARCHAR(8)) AS FizzOrBuzz
    FROM
        c
    WHERE 
        c.RowNumber+3 <= 100
        and FizzOrBuzz in ('Fizz','FizzBuzz')
    UNION ALL
    /* Buzz rows */
    SELECT
        c.RowNumber + 5,
        'Buzz' AS FizzOrBuzz
    FROM
        c
    WHERE
        c.RowNumber+5 <= 100
        and FizzOrBuzz in ('Buzz','FizzBuzz')
)

SELECT
    RowNumber,
    STRING_AGG(FizzOrBuzz,'') AS FizzOrBuzz
FROM C
GROUP BY
    RowNumber
ORDER BY RowNumber

You'll notice we have 3 recursive members: the first generates all rows up to 100 that are not multiples of 3 or 5, the second generates all rows that are multiples of 3, and the third statement generates all rows that are multiple of 5.

If we were to run SELECT \* FROM c; after only making the mentioned changes, you'll notice that it looks like things are mostly working, but that we have duplicates (and incorrect labeling) for rows that are multiples of 3 and 5:

2018-08-20_12-52-24

The way I decided to fix that is by adding a STRING_AGG() function to the final SELECT statement, concatenating the outputs of rows with the same RowNumber. With that addition, our multiple recursive member CTE FizzBuzz solution is complete.

One thing to be aware of in the above solution: each of the recursive member statements will execute on the previous results of ANY recursive member statement, so we add the conditions "...and FizzOrBuzz in ..." to force each recursive statement to run only on the output from its own previous result.  This feels like cheating a little bit, but it was the only way I could solve the problem I had defined.

Practical Examples and Further Reading

I had a hard time coming up with a practical uses for multiple recursive member CTEs.

I searched online for some examples but it doesn't seem like many people have written about the topic.  One exception I did find was an article by Itzik Ben-Gan where he uses them to solve Lord of the Rings family trees (heh).

Honestly though, as excited as I was initially to learn that doing this is possible, I don't know if/when I'll ever use it.  I'm hoping I encounter a problem one day that can make use of multiple recursive statements, but who knows if that will ever happen.

If you have used multiple recursive member CTEs to solve a real-world problem before, leave me a comment - I'd love to hear about the scenario you used it in.

Converting JSON to SQL Server CREATE TABLE Statements

Watch this week's video on YouTube

Tedious, repetitive tasks are the bane of any lazy programmer.  I know, because I am one.

One such repetitive task that I find comparable to counting grains of rice is building database layouts from JSON data sources.

While some online services exist that will parse JSON objects into database structures, I don't like using them because I don't trust the people running those sites with my data.  Nothing personal against them, I just don't want to be passing my data through their servers.

My solution to this problem was to write a query that will parse my unfamiliar JSON documents into a series of CREATE TABLE statements.

Automatically Generating A SQL Database Schema From JSON

You can always get the most recent version of the query from GitHub, but I'll post the current version below so that it's easier to explain in this post:

/*
This code takes a JSON input string and automatically generates
SQL Server CREATE TABLE statements to make it easier
to convert serialized data into a database schema.

It is not perfect, but should provide a decent starting point when starting
to work with new JSON files.

A blog post with more information can be found at https://bertwagner.com/2018/05/22/converting-json-to-sql-server-create-table-statements/
*/
SET NOCOUNT ON;

DECLARE 
    @JsonData nvarchar(max) = '
        {
            "Id" : 1,
            "IsActive":true,
            "Ratio": 1.25,
            "ActivityArray":[true,false,true],
            "People" : ["Jim","Joan","John","Jeff"],
            "Places" : [{"State":"Connecticut", "Capitol":"Hartford", "IsExpensive":true},{"State":"Ohio","Capitol":"Columbus","MajorCities":["Cleveland","Cincinnati"]}],
            "Thing" : { "Type":"Foo", "Value" : "Bar" },
            "Created_At":"2018-04-18T21:25:48Z"
        }',
    @RootTableName nvarchar(4000) = N'AppInstance',
    @Schema nvarchar(128) = N'dbo',
    @DefaultStringPadding smallint = 20;

DROP TABLE IF EXISTS ##parsedJson;
WITH jsonRoot AS (
    SELECT 
        0 as parentLevel, 
        CONVERT(nvarchar(4000),NULL) COLLATE Latin1_General_BIN2 as parentTableName, 
        0 AS [level], 
        [type] ,
        @RootTableName COLLATE Latin1_General_BIN2 AS TableName,
        [key] COLLATE Latin1_General_BIN2 as ColumnName,
        [value],
        ROW_NUMBER() OVER (ORDER BY (SELECT 1)) AS ColumnSequence
    FROM 
        OPENJSON(@JsonData, '$')
    UNION ALL
    SELECT 
        jsonRoot.[level] as parentLevel, 
        CONVERT(nvarchar(4000),jsonRoot.TableName) COLLATE Latin1_General_BIN2, 
        jsonRoot.[level]+1, 
        d.[type],
        CASE WHEN jsonRoot.[type] IN (4,5) THEN CONVERT(nvarchar(4000),jsonRoot.ColumnName) ELSE jsonRoot.TableName END COLLATE Latin1_General_BIN2,
        CASE WHEN jsonRoot.[type] IN (4) THEN jsonRoot.ColumnName ELSE d.[key] END,
        d.[value],
        ROW_NUMBER() OVER (ORDER BY (SELECT 1)) AS ColumnSequence
    FROM 
        jsonRoot
        CROSS APPLY OPENJSON(jsonRoot.[value], '$') d
    WHERE 
        jsonRoot.[type] IN (4,5) 
), IdRows AS (
    SELECT 
        -2 as parentLevel,
        null as parentTableName,
        -1 as [level],
        null as [type],
        TableName as Tablename,
        TableName+'Id' as columnName, 
        null as [value],
        0 as columnsequence
    FROM 
        (SELECT DISTINCT tablename FROM jsonRoot) j
), FKRows AS (
    SELECT 
        DISTINCT -1 as parentLevel,
        null as parentTableName,
        -1 as [level],
        null as [type],
        TableName as Tablename,
        parentTableName+'Id' as columnName, 
        null as [value],
        0 as columnsequence
    FROM 
        (SELECT DISTINCT tableName,parentTableName FROM jsonRoot) j
    WHERE 
        parentTableName is not null
)
SELECT 
    *,
    CASE [type]
        WHEN 1 THEN 
            CASE WHEN TRY_CONVERT(datetime2, [value], 127) IS NULL THEN 'nvarchar' ELSE 'datetime2' END
        WHEN 2 THEN 
            CASE WHEN TRY_CONVERT(int, [value]) IS NULL THEN 'float' ELSE 'int' END
        WHEN 3 THEN 
            'bit'
        END COLLATE Latin1_General_BIN2 AS DataType,
    CASE [type]
        WHEN 1 THEN 
            CASE WHEN TRY_CONVERT(datetime2, [value], 127) IS NULL THEN MAX(LEN([value])) OVER (PARTITION BY TableName, ColumnName) + @DefaultStringPadding ELSE NULL END
        WHEN 2 THEN 
            NULL
        WHEN 3 THEN 
            NULL
        END AS DataTypePrecision
INTO ##parsedJson
FROM jsonRoot
WHERE 
    [type] in (1,2,3)
UNION ALL SELECT IdRows.parentLevel, IdRows.parentTableName, IdRows.[level], IdRows.[type], IdRows.TableName, IdRows.ColumnName, IdRows.[value], -10 AS ColumnSequence, 'int IDENTITY(1,1) PRIMARY KEY' as datatype, null as datatypeprecision FROM IdRows 
UNION ALL SELECT FKRows.parentLevel, FKRows.parentTableName, FKRows.[level], FKRows.[type], FKRows.TableName, FKRows.ColumnName, FKRows.[value], -9 AS ColumnSequence, 'int' as datatype, null as datatypeprecision FROM FKRows 

-- For debugging:
-- SELECT * FROM ##parsedJson ORDER BY ParentLevel, level, tablename, columnsequence

DECLARE @CreateStatements nvarchar(max);

SELECT
    @CreateStatements = COALESCE(@CreateStatements + CHAR(13) + CHAR(13), '') + 
    'CREATE TABLE ' + @Schema + '.' + TableName + CHAR(13) + '(' + CHAR(13) +
        STRING_AGG( ColumnName + ' ' + DataType + ISNULL('('+CAST(DataTypePrecision AS nvarchar(20))+')','') +  CASE WHEN DataType like '%PRIMARY KEY%' THEN '' ELSE ' NULL' END, ','+CHAR(13)) WITHIN GROUP (ORDER BY ColumnSequence) 
    + CHAR(13)+')'
FROM
    (SELECT DISTINCT 
        j.TableName, 
        j.ColumnName,
        MAX(j.ColumnSequence) AS ColumnSequence, 
        j.DataType, 
        j.DataTypePrecision, 
        j.[level] 
    FROM 
        ##parsedJson j
        CROSS APPLY (SELECT TOP 1 ParentTableName + 'Id' AS ColumnName FROM ##parsedJson p WHERE j.TableName = p.TableName ) p
    GROUP BY
        j.TableName, j.ColumnName,p.ColumnName, j.DataType, j.DataTypePrecision, j.[level] 
    ) j
GROUP BY
    TableName


PRINT @CreateStatements;

In the variables section, we can define our input JSON document string as well as define things like a root table name and default database schema name.

There is also a string padding variable.  This padding variable's value is added to the max value length found in each column being generated, giving each column a little bit more breathing room.

Next in the script is the recursive CTE that parses the JSON string.  The OPENJSON() function in SQL Server makes this part relatively easy since some of the work of determining datatypes is already done for you.

I've taken the liberty to convert all strings to nvarchar types, numbers to either floats or ints, booleans to bits, and datetime strings to datetime2s.

Two additional CTE expressions add an integer IDENTITY PRIMARY KEY column to each table as well as a column referencing the parent table if applicable (our foreign key column).

Finally, a little bit of dynamic SQL pieces together all of these components to generate our CREATE TABLE scripts.

Limitations

I created this code with a lot of assumptions about my (unfamiliar) JSON data sets.  For the purpose of roughly building out tables from large JSON files, I don't need the results to be perfect and production-ready; I just want the results to be mostly correct so the vast majority of tedious table creation work is automated.

With that disclaimer made, here are a few things to be aware of:

  • Sometimes there will be duplicate column names generated because of naming - just delete one.
  • While foreign key columns exist, the foreign key constraints don't.
  • This code uses STRING_AGG.  I'll leave it up to you to convert to STUFF and FOR XML PATH if you need to run it in versions prior to 2017.

Summary

This script is far from perfect.  But it has eliminated the need for me to build out these tables and columns from scratch.  Sure, the output sometimes needs a tweak or too, but for my purposes I'm happy with how it turned out.  I hope it helps you eliminate some boring table creation work too.