Data with Bert logo

Recursively Querying Row Groups

Watch this week's video on YouTube

Recursive queries are fun to plan and write. They can be frustrating too depending on the complexity of the problem you are trying to solve.

This post shows one solution for finding all records that are related, either directly or via intermediate records, using recursive queries in SQL Server.

The Data

Here's the data we'll be working with:

DROP TABLE  IF EXISTS #relationships;

CREATE TABLE #relationships (
    Id int,
    FK1 varchar(10),
    FK2 varchar(10)
);

INSERT INTO #relationships
VALUES 
    /* This group of records are all related, directly or through intermediate nodes */
    (1,'A','B'),
    (2,'A','E'),
    (3,'A','G'),
    (4,'A','F'),
    (5,'B','F'),
    (6,'B','E'),
    (7,'E','F'),
    (8,'G','H'),

    (9,'B','A'),  /* This is  an identical relationship as Id=1 */
    (10,'B','F'), /* This is a straight up duplicate record to Id=5*/


    /* These records are related simply where FK2 of one row is FK1 of the subsequent row */
    (11,'I','J'),
    (12,'J','K'),
    (13,'K','L'),

    /* Some more related records */
    (14,'M','N'),
    (15,'O','M'),
    (16,'P','O'),
    (17,'M','N'), /* Duplicate of Id=14 */
    (18,'M','O'), /* Flipped duplicate of 15 */
    (19,'M','P'),

    /* These records are interesting because the FK2 values never appear in the FK1 column */
    (20,'Q','R'),
    (21,'S','R'),
    (22,'T','R');

Each row has values for Id,FK1,and FK2 columns. Id is a unique primary key that I included to make referencing individual rows easier. FK1 and FK2 are foreign keys that reference full records in another table. The point is that this table shows the relationships between records.

I used blank lines to visually indicate a group of records in the data above. The end result I want to achieve is to have each group of related records to share the same group_id. I also added some comments to point out some interesting scenarios in the data.

I programmed in some of the data edge cases I knew I would encounter into the test data; a sort of poor-man's test driven development. Including these known edge cases helps me test my query so I know my final solution will handle them correctly. If you are applying this solution to your own data, make sure to add your own test cases into the data.

Query Transformations for Sorted Hashes

The first recursive query I wrote to solve this problem was ugly. I was creating sorted hashes to ensure that rows where the FK values were swapped were deduplicated (eg. I would want to dedupe where Id=1 and Id=9). This solution involved joining on CASE statements like this all throughout the set of queries: CASE WHEN FK1<=FK2 THEN '|'+FK1+','+FK2+'|' ELSE '|'+FK2+','+FK1+'|' END AS key_hash.

Yuck. The solution worked, but I know my future self would not want to maintain that type of code.

Rather than have the final query do all of the work, I decided to clean up the data first instead.

Initial Clean Up

I decided to transform the data to eliminate all duplicates and make sure my keys were always sorted so the value of FK1 < FK2. This allows for a simpler final query:

--Remove duplicates where keys are switched
DROP TABLE  IF EXISTS #deduped;
CREATE TABLE #deduped (
Id int IDENTITY(1,1),
FK1 varchar(10),
FK2 varchar(10),
key_hash varchar(100)
);
INSERT INTO #deduped (FK1,FK2,key_hash)
SELECT
    dupes.FK1,
    dupes.FK2,
    dupes.key_hash
FROM
    (
    SELECT
        hashes.*,
        ROW_NUMBER() OVER (PARTITION BY key_hash ORDER BY FK1) AS RN
    FROM
        (
        /* make sure FK1 is always smaller than FK2.  This eliminantes a lot of more complicated logic later on */
        SELECT
            CASE WHEN FK1 <= FK2 THEN FK1 ELSE FK2 END AS FK1,
            CASE WHEN FK1 <= FK2 THEN FK2 ELSE FK1 END AS FK2,
            CASE WHEN FK1 <= FK2 THEN '|'+FK1 +','+FK2+'|' ELSE '|'+FK2+','+FK1+'|' END AS key_hash

        FROM
            #relationships
        ) hashes
    ) dupes
WHERE
    dupes.RN = 1;

This still uses the key_hash CASE statement I mentioned previously, but it only creates it once initially to dedupe the entries regardless of their order.

Grouping Related Records

With a deduped dataset, the recursive query for finding groups of related records is (relatively) straight-forward (and a refresher if you need to remember how recursive CTEs work in SQL Server):

WITH c AS (
    /* The initial query */
    SELECT
        1 as level,
        DENSE_RANK() OVER(ORDER BY FK1) group_id,
        key_hash,
        FK1
        ,FK2
    FROM
        #deduped
    UNION ALL
    /* The recursive query that runs once for each of the above query's output and then once on every row from each subsequent result */
    SELECT 
        c.level+1,
        c.group_id,
        t.key_hash,
        t.FK1,
        t.FK2
    FROM 
        c
        INNER JOIN #deduped t
            ON c.FK2 = t.FK1
    WHERE
        /* don't include combinations already matched */
        c.key_hash not like '%'+t.key_hash+'%'
), 
/* regular CTE  */
output_with_dupes as (
SELECT
    level,
    key_hash,
    group_id,
    FK1,
    FK2,
    ROW_NUMBER() OVER (PARTITION BY key_hash ORDER BY group_id) AS  RN
FROM
    c
)

-- deduped output
SELECT
    group_id,
    FK1,
    FK2
FROM
    output_with_dupes
WHERE RN = 1
ORDER BY
    group_id,
    FK1
OPTION(MAXRECURSION 0)

Some records are duplicated after the initial join, so a subsequent expression is used to dedupe the final records. The result is a table of rows containing a group_id of records that are related to each other.

Couldn't you have done this with hierarchyid?

Probably.

But I wanted a solution flexible enough to reuse for other relational databases that don't support hierarchyid as well.

More than just recursive queries

If you've never written a recursive SQL query before, hopefully this post gives you an idea of how to do it.

The more important takeaway is that sometimes it's easier to solve a data problem than a query problem. Like I said previously, my initial recursive query was hideous and unmaintainable because of the duplicates I had in my data. Rather than living with that query, it made sense to clean up the data first and then write the simpler query on the clean data.

FizzBuzz and Multiple Recursive Member CTEs

Watch this week's video on YouTube

Last week I needed to write a recursive common table expression.  I've written them before, but it's been a while and needed to visit the documentation to reference the syntax.

Instead of going straight to the examples, I decided to read into some of the details (since skipping the details really hurt me in last week's post) and noticed this line that I had never seen before:

2018-08-20_11-58-14

"Multiple...recursive members can be defined" - what????

I never knew you could have multiple recursive member statements in a CTE.  Heck, I didn't even know what having multiple recursive members could do.

Since the documentation doesn't talk about them beyond the one highlighted line above, I decided to create some examples to see if I could get them to work.

FizzBuzz

FizzBuzz is a programming puzzle that asks the solver to write a program that will list the numbers 1 to 100, displaying the word "Fizz" for any numbers that are a multiple of 3, "Buzz" for any multiples of 5, and "FizzBuzz" for any multiples of 3 and 5.

I decided to try and implement the FizzBuzz problem as both a single and multiple member CTE to see how the solutions would differ.

The Basic Recursive CTE

To start out, I decided to write a CTE that lists all numbers 0 to 100:

WITH c AS
(
    -- anchor member
    SELECT
        0 AS RowNumber
    UNION ALL
    -- recursive member
    SELECT
        c.RowNumber + 1
    FROM
        c /* the result of our last iteration */
    WHERE 
        RowNumber < 100
)

SELECT * 
FROM c;

The first SELECT statement in the CTE definition is known as the "anchor" member.  This query runs a single time and acts as the initial result that the recursive query acts on.

The second SELECT statement in the CTE definition is known as the "recursive" member.  This statement executes on the results of the previous execution (or on the results of the anchor member for the first iteration).

The recursive member will execute over and over again as long as it is still producing results.  Since our recursive statement is just adding 1 to the previous result, our recursive query would run forever - which is why we add the WHERE condition stop it from executing once we reach 100.

Our final SELECT statement returns the results of our recursive CTE, providing us with a neat list of numbers from 0 to 100:

2018-08-20_12-31-08

Single Recursive Member CTE for FizzBuzz

Now that our basic recursive CTE is working, let's make it solve FizzBuzz.  Here is our updated code:

WITH c AS
(
    SELECT
        0 AS RowNumber,
        'FizzBuzz' AS FizzOrBuzz
    UNION ALL
    SELECT
        c.RowNumber + 1,
        CASE 
            WHEN (c.RowNumber + 1) % 15 = 0 THEN 'FizzBuzz' 
            WHEN (c.RowNumber + 1) % 3 = 0 THEN 'Fizz' 
            WHEN (c.RowNumber + 1) % 5 = 0 THEN 'Buzz' 
            ELSE NULL 
        END
    FROM
        c
    WHERE 
        RowNumber < 100
)

SELECT
    RowNumber,
    FizzOrBuzz
FROM C
ORDER BY RowNumber;

First, we add a second column to our results to display the word "Fizz", "Buzz", or "FuzzBuzz".

In the anchor member, we defaulted this value to "FizzBuzz".  In our recursive member, we added a CASE statement to display the correct word.  The modulo operator (%) checks to see if the current row divided by 3, 5, or 15 results in a remainder - if the remainder is 0 then we know we found a multiple of that number.

This solution is pretty easy to read and provides the expected output for our FizzBuzz puzzle:

2018-08-20_12-39-21

Multiple Recurisve Member CTE for FizzBuzz

Alright the moment we've been waiting for - the multiple recursive member CTE:

WITH c AS
(
    SELECT
        0 AS RowNumber,
        'FizzBuzz' AS FizzOrBuzz
    UNION ALL
    /* All rows not Fizz or Buzz or FizzBuzz */
    SELECT
        c.RowNumber + 1,
        NULL AS FizzOrBuzz
    FROM
        c
    WHERE
        c.RowNumber+1 <= 100
        AND (c.RowNumber+1)%3<>0
        AND (c.RowNumber+1)%5<>0
    UNION ALL
    /* Fizz rows */
    SELECT
        c.RowNumber + 3,
        CAST('Fizz' AS VARCHAR(8)) AS FizzOrBuzz
    FROM
        c
    WHERE 
        c.RowNumber+3 <= 100
        and FizzOrBuzz in ('Fizz','FizzBuzz')
    UNION ALL
    /* Buzz rows */
    SELECT
        c.RowNumber + 5,
        'Buzz' AS FizzOrBuzz
    FROM
        c
    WHERE
        c.RowNumber+5 <= 100
        and FizzOrBuzz in ('Buzz','FizzBuzz')
)

SELECT
    RowNumber,
    STRING_AGG(FizzOrBuzz,'') AS FizzOrBuzz
FROM C
GROUP BY
    RowNumber
ORDER BY RowNumber

You'll notice we have 3 recursive members: the first generates all rows up to 100 that are not multiples of 3 or 5, the second generates all rows that are multiples of 3, and the third statement generates all rows that are multiple of 5.

If we were to run SELECT \* FROM c; after only making the mentioned changes, you'll notice that it looks like things are mostly working, but that we have duplicates (and incorrect labeling) for rows that are multiples of 3 and 5:

2018-08-20_12-52-24

The way I decided to fix that is by adding a STRING_AGG() function to the final SELECT statement, concatenating the outputs of rows with the same RowNumber. With that addition, our multiple recursive member CTE FizzBuzz solution is complete.

One thing to be aware of in the above solution: each of the recursive member statements will execute on the previous results of ANY recursive member statement, so we add the conditions "...and FizzOrBuzz in ..." to force each recursive statement to run only on the output from its own previous result.  This feels like cheating a little bit, but it was the only way I could solve the problem I had defined.

Practical Examples and Further Reading

I had a hard time coming up with a practical uses for multiple recursive member CTEs.

I searched online for some examples but it doesn't seem like many people have written about the topic.  One exception I did find was an article by Itzik Ben-Gan where he uses them to solve Lord of the Rings family trees (heh).

Honestly though, as excited as I was initially to learn that doing this is possible, I don't know if/when I'll ever use it.  I'm hoping I encounter a problem one day that can make use of multiple recursive statements, but who knows if that will ever happen.

If you have used multiple recursive member CTEs to solve a real-world problem before, leave me a comment - I'd love to hear about the scenario you used it in.