Recursively Querying Row Groups

Published on: 2020-08-25

Watch this week’s episode on YouTube

Recursively Querying Related Rows

Recursive queries are fun to plan and write. They can be frustrating too depending on the complexity of the problem you are trying to solve.

This post shows one solution for finding all records that are related, either directly or via intermediate records, using recursive queries in SQL Server.

The Data

Here’s the data we’ll be working with:

DROP TABLE  IF EXISTS #relationships;

CREATE TABLE #relationships (
	Id int,
	FK1 varchar(10),
	FK2 varchar(10)
);

INSERT INTO #relationships
VALUES 
	/* This group of records are all related, directly or through intermediate nodes */
	(1,'A','B'),
	(2,'A','E'),
	(3,'A','G'),
	(4,'A','F'),
	(5,'B','F'),
	(6,'B','E'),
	(7,'E','F'),
	(8,'G','H'),

	(9,'B','A'),  /* This is  an identical relationship as Id=1 */
	(10,'B','F'), /* This is a straight up duplicate record to Id=5*/


	/* These records are related simply where FK2 of one row is FK1 of the subsequent row */
	(11,'I','J'),
	(12,'J','K'),
	(13,'K','L'),

	/* Some more related records */
	(14,'M','N'),
	(15,'O','M'),
	(16,'P','O'),
	(17,'M','N'), /* Duplicate of Id=14 */
	(18,'M','O'), /* Flipped duplicate of 15 */
	(19,'M','P'),

	/* These records are interesting because the FK2 values never appear in the FK1 column */
	(20,'Q','R'),
	(21,'S','R'),
	(22,'T','R');

Each row has values for Id,FK1,and FK2 columns. Id is a unique primary key that I included to make referencing individual rows easier. FK1 and FK2 are foreign keys that reference full records in another table. The point is that this table shows the relationships between records.

I used blank lines to visually indicate a group of records in the data above. The end result I want to achieve is to have each group of related records to share the same group_id. I also added some comments to point out some interesting scenarios in the data.

I programmed in some of the data edge cases I knew I would encounter into the test data; a sort of poor-man’s test driven development. Including these known edge cases helps me test my query so I know my final solution will handle them correctly. If you are applying this solution to your own data, make sure to add your own test cases into the data.

Query Transformations for Sorted Hashes

The first recursive query I wrote to solve this problem was ugly. I was creating sorted hashes to ensure that rows where the FK values were swapped were deduplicated (eg. I would want to dedupe where Id=1 and Id=9). This solution involved joining on CASE statements like this all throughout the set of queries: CASE WHEN FK1<=FK2 THEN '|'+FK1+','+FK2+'|' ELSE '|'+FK2+','+FK1+'|' END AS key_hash.

Yuck. The solution worked, but I know my future self would not want to maintain that type of code.

Rather than have the final query do all of the work, I decided to clean up the data first instead.

Initial Clean Up

I decided to transform the data to eliminate all duplicates and make sure my keys were always sorted so the value of FK1 < FK2. This allows for a simpler final query:

--Remove duplicates where keys are switched
DROP TABLE  IF EXISTS #deduped;
CREATE TABLE #deduped (
Id int IDENTITY(1,1),
FK1 varchar(10),
FK2 varchar(10),
key_hash varchar(100)
);
INSERT INTO #deduped (FK1,FK2,key_hash)
SELECT
	dupes.FK1,
	dupes.FK2,
    dupes.key_hash
FROM
	(
	SELECT
		hashes.*,
		ROW_NUMBER() OVER (PARTITION BY key_hash ORDER BY FK1) AS RN
	FROM
		(
		/* make sure FK1 is always smaller than FK2.  This eliminantes a lot of more complicated logic later on */
		SELECT
			CASE WHEN FK1 <= FK2 THEN FK1 ELSE FK2 END AS FK1,
			CASE WHEN FK1 <= FK2 THEN FK2 ELSE FK1 END AS FK2,
			CASE WHEN FK1 <= FK2 THEN '|'+FK1 +','+FK2+'|' ELSE '|'+FK2+','+FK1+'|' END AS key_hash
		
		FROM
			#relationships
		) hashes
	) dupes
WHERE
	dupes.RN = 1;

This still uses the key_hash CASE statement I mentioned previously, but it only creates it once initially to dedupe the entries regardless of their order.

With a deduped dataset, the recursive query for finding groups of related records is (relatively) straight-forward (and a refresher if you need to remember how recursive CTEs work in SQL Server):

WITH c AS (
	/* The initial query */
	SELECT
		1 as level,
		DENSE_RANK() OVER(ORDER BY FK1) group_id,
		key_hash,
		FK1
		,FK2
	FROM
		#deduped
	UNION ALL
	/* The recursive query that runs once for each of the above query's output and then once on every row from each subsequent result */
	SELECT 
		c.level+1,
		c.group_id,
		t.key_hash,
		t.FK1,
		t.FK2
	FROM 
		c
		INNER JOIN #deduped t
			ON c.FK2 = t.FK1
	WHERE
		/* don't include combinations already matched */
		c.key_hash not like '%'+t.key_hash+'%'
), 
/* regular CTE  */
output_with_dupes as (
SELECT
	level,
	key_hash,
	group_id,
	FK1,
	FK2,
	ROW_NUMBER() OVER (PARTITION BY key_hash ORDER BY group_id) AS  RN
FROM
	c
)

-- deduped output
SELECT
	group_id,
	FK1,
	FK2
FROM
	output_with_dupes
WHERE RN = 1
ORDER BY
	group_id,
	FK1
OPTION(MAXRECURSION 0)

Some records are duplicated after the initial join, so a subsequent expression is used to dedupe the final records. The result is a table of rows containing a group_id of records that are related to each other.

Couldn’t you have done this with hierarchyid?

Probably.

But I wanted a solution flexible enough to reuse for other relational databases that don’t support hierarchyid as well.

More than just recursive queries

If you’ve never written a recursive SQL query before, hopefully this post gives you an idea of how to do it.

The more important takeaway is that sometimes it’s easier to solve a data problem than a query problem. Like I said previously, my initial recursive query was hideous and unmaintainable because of the duplicates I had in my data. Rather than living with that query, it made sense to clean up the data first and then write the simpler query on the clean data.

Thanks for reading. You might also enjoy following me on Twitter.

Want to learn even more SQL?

Sign up for my newsletter to receive weekly SQL tips!

Moving 1/3 of a heap

Published on: 2020-07-22

Watch this week’s video on YouTube.

A Giant Heap

Recently I had to filter out 1.2 billion records from a 3.5 billion row heap. Don’t ask me why this 3.5 billion row table is stored as a heap.

If the lack of a clustered index wasn’t bad enough, I also had some other restrictions:

  • I couldn’t add a clustered index (or any index for that matter) sorted on the key I needed to filter on. It wasn’t my system, and I needed to access the 1.2 billion records sooner than it would take to get a clustered index approved and added. Even then, maybe the lack of a clustered index on this table is a feature and my request would be denied. Who knows.
  • I didn’t have a server that could store all of the data. At first I thought of copying all 3.5 billion rows to my own server and indexing it how I needed, but I didn’t have enough storage space anywhere to do that.
  • My connection to the server had a relatively short timeout set on it. This also couldn’t be changed. If I couldn’t copy all 3.5 billion rows because of storage, I also couldn’t copy all 1.2 billion records in one fell swoop because the connection would timeout.

Iterative Process

Scanning the table hundreds of times…

I struggled with this problem for a little bit. My deadline clock was ticking and I was stuck as to how I could copy and subsequently query the 1.2 billion rows of data I needed. My focus transitioned from “what is the best way to do this” to “how do I do this”.

The solution that ended up working for me was to query the table hundreds of times, each time filtering out and copying only 1 week of data by running a query similar to this:

SELECT *
FROM dbo.MyBigHeap
WHERE
	CreateDate >= @StartDate
	AND CreateDate < @StopDate

Yes, this did cause me to scan the entire table hundreds of times, but in the end it was the right amount of data that I could copy at a time before the connection timed out.

Eventually I had the 1.2 billion rows I needed copied to my own server. I had a clustered column store index on the table (primarily for the compression savings) and some nonclustered indexes to support the queries I would need to run on it. Was this the best solution? I don’t know. But it worked for me given the constraints and deadline I had to meet.

Moral of the Story

Always put a clustered index on your tables. Even if you don’t have a use case to sort/filter them immediately, you will be creating a world of pain when someone comes along who does need to query that data.

Thanks for reading. You might also enjoy following me on Twitter.

Want to learn even more SQL?

Sign up for my newsletter to receive weekly SQL tips!

SQL Server Converts Numbers to Asterisks

Published on: 2019-12-17

Watch this week’s episode on YouTube.

Pop quiz: What will be the output of the below three statements?

DECLARE @val_varchar varchar(3) = '100';
DECLARE @val_bigint bigint = 100;
DECLARE @val_tinyint tinyint = 100;

PRINT('varchar to varchar:');
PRINT(CAST(@val_varchar AS VARCHAR(2)));

PRINT('bigint to varchar:');
PRINT(CAST(@val_bigint AS VARCHAR(2)));

PRINT('tinyint to varchar:');
PRINT(CAST(@val_tinyint AS VARCHAR(2)));

As you might have guessed, the first conversion truncates the value, leaving the answer as “10”:

varchar to varchar

The second response is also somewhat intuitive – a number can’t be converted into string that has fewer places than original digits:

bigint to varchar

However, the final answer returns:

tinyint to varchar

NOTE: if we change all of the conversions to VARCHAR(3) above, all three results return the string value ‘100’

What the ****?

Why does SQL Server sometimes error when converting a number into a string, but other times succeeds and returns an asterisk?

I don’t know.

The best (and logical) answer I could find online is from Robert Sheldon, who attributes it to poor error handling practices, “…before error handling got a more reputable foothold.”

My understanding is that the asterisk was originally the default truncation value. Later, when the SQL Server development team was adding the bigint datatype, they decided throwing an exception is a better way to handle truncation in errors. However, probably for backwards compatibility, they never went back to change the behavior of the asterisks for pre-existing datatypes.

The documentation is clear about this behavior as well: any int, smallint, or tiny int when converted to char or varchar with fewer characters will result in an asterisk.

This becomes a problem if you create strings out of tinyints, smallints, and ints and suddenly start receiving values larger than you originally expected:

DECLARE @Age int = 75;
PRINT(CAST(@Age AS VARCHAR(2)) + ' years old.');

SET @Age = 100;
PRINT(CAST(@Age AS VARCHAR(2)) + ' years old.');
* years old

Conversions are hard

So is SQL Server to blame for this issue? Sure. But I can empathize: breaking changes are not something people typically want to add to a product.

I’ve written before how implicit conversions can lead to funny results in SQL Server. This asterisk scenario is another case of SQL Server having to make a judgement call about handling an impossible situation you created for it.

While it’s easy to complain about whether SQL Server handles the conversion the correct way or not, at the end of the day the responsibility falls on you to prevent these types of impossible calculations from being requested of SQL Server in the first place.

Thanks for reading. You might also enjoy following me on Twitter.

Want to learn even more SQL?

Sign up for my newsletter to receive weekly SQL tips!