Recursively Querying Row Groups

Watch this week's video on YouTube

Recursive queries are fun to plan and write. They can be frustrating too depending on the complexity of the problem you are trying to solve.

This post shows one solution for finding all records that are related, either directly or via intermediate records, using recursive queries in SQL Server.

The Data

Here's the data we'll be working with:

DROP TABLE  IF EXISTS #relationships;

CREATE TABLE #relationships (
    Id int,
    FK1 varchar(10),
    FK2 varchar(10)
);

INSERT INTO #relationships
VALUES 
    /* This group of records are all related, directly or through intermediate nodes */
    (1,'A','B'),
    (2,'A','E'),
    (3,'A','G'),
    (4,'A','F'),
    (5,'B','F'),
    (6,'B','E'),
    (7,'E','F'),
    (8,'G','H'),

    (9,'B','A'),  /* This is  an identical relationship as Id=1 */
    (10,'B','F'), /* This is a straight up duplicate record to Id=5*/


    /* These records are related simply where FK2 of one row is FK1 of the subsequent row */
    (11,'I','J'),
    (12,'J','K'),
    (13,'K','L'),

    /* Some more related records */
    (14,'M','N'),
    (15,'O','M'),
    (16,'P','O'),
    (17,'M','N'), /* Duplicate of Id=14 */
    (18,'M','O'), /* Flipped duplicate of 15 */
    (19,'M','P'),

    /* These records are interesting because the FK2 values never appear in the FK1 column */
    (20,'Q','R'),
    (21,'S','R'),
    (22,'T','R');

Each row has values for Id,FK1,and FK2 columns. Id is a unique primary key that I included to make referencing individual rows easier. FK1 and FK2 are foreign keys that reference full records in another table. The point is that this table shows the relationships between records.

I used blank lines to visually indicate a group of records in the data above. The end result I want to achieve is to have each group of related records to share the same group_id. I also added some comments to point out some interesting scenarios in the data.

I programmed in some of the data edge cases I knew I would encounter into the test data; a sort of poor-man's test driven development. Including these known edge cases helps me test my query so I know my final solution will handle them correctly. If you are applying this solution to your own data, make sure to add your own test cases into the data.

Query Transformations for Sorted Hashes

The first recursive query I wrote to solve this problem was ugly. I was creating sorted hashes to ensure that rows where the FK values were swapped were deduplicated (eg. I would want to dedupe where Id=1 and Id=9). This solution involved joining on CASE statements like this all throughout the set of queries: CASE WHEN FK1<=FK2 THEN '|'+FK1+','+FK2+'|' ELSE '|'+FK2+','+FK1+'|' END AS key_hash.

Yuck. The solution worked, but I know my future self would not want to maintain that type of code.

Rather than have the final query do all of the work, I decided to clean up the data first instead.

Initial Clean Up

I decided to transform the data to eliminate all duplicates and make sure my keys were always sorted so the value of FK1 < FK2. This allows for a simpler final query:

--Remove duplicates where keys are switched
DROP TABLE  IF EXISTS #deduped;
CREATE TABLE #deduped (
Id int IDENTITY(1,1),
FK1 varchar(10),
FK2 varchar(10),
key_hash varchar(100)
);
INSERT INTO #deduped (FK1,FK2,key_hash)
SELECT
    dupes.FK1,
    dupes.FK2,
    dupes.key_hash
FROM
    (
    SELECT
        hashes.*,
        ROW_NUMBER() OVER (PARTITION BY key_hash ORDER BY FK1) AS RN
    FROM
        (
        /* make sure FK1 is always smaller than FK2.  This eliminantes a lot of more complicated logic later on */
        SELECT
            CASE WHEN FK1 <= FK2 THEN FK1 ELSE FK2 END AS FK1,
            CASE WHEN FK1 <= FK2 THEN FK2 ELSE FK1 END AS FK2,
            CASE WHEN FK1 <= FK2 THEN '|'+FK1 +','+FK2+'|' ELSE '|'+FK2+','+FK1+'|' END AS key_hash

        FROM
            #relationships
        ) hashes
    ) dupes
WHERE
    dupes.RN = 1;

This still uses the key_hash CASE statement I mentioned previously, but it only creates it once initially to dedupe the entries regardless of their order.

Grouping Related Records

With a deduped dataset, the recursive query for finding groups of related records is (relatively) straight-forward (and a refresher if you need to remember how recursive CTEs work in SQL Server):

WITH c AS (
    /* The initial query */
    SELECT
        1 as level,
        DENSE_RANK() OVER(ORDER BY FK1) group_id,
        key_hash,
        FK1
        ,FK2
    FROM
        #deduped
    UNION ALL
    /* The recursive query that runs once for each of the above query's output and then once on every row from each subsequent result */
    SELECT 
        c.level+1,
        c.group_id,
        t.key_hash,
        t.FK1,
        t.FK2
    FROM 
        c
        INNER JOIN #deduped t
            ON c.FK2 = t.FK1
    WHERE
        /* don't include combinations already matched */
        c.key_hash not like '%'+t.key_hash+'%'
), 
/* regular CTE  */
output_with_dupes as (
SELECT
    level,
    key_hash,
    group_id,
    FK1,
    FK2,
    ROW_NUMBER() OVER (PARTITION BY key_hash ORDER BY group_id) AS  RN
FROM
    c
)

-- deduped output
SELECT
    group_id,
    FK1,
    FK2
FROM
    output_with_dupes
WHERE RN = 1
ORDER BY
    group_id,
    FK1
OPTION(MAXRECURSION 0)

Some records are duplicated after the initial join, so a subsequent expression is used to dedupe the final records. The result is a table of rows containing a group_id of records that are related to each other.

Couldn't you have done this with hierarchyid?

Probably.

But I wanted a solution flexible enough to reuse for other relational databases that don't support hierarchyid as well.

More than just recursive queries

If you've never written a recursive SQL query before, hopefully this post gives you an idea of how to do it.

The more important takeaway is that sometimes it's easier to solve a data problem than a query problem. Like I said previously, my initial recursive query was hideous and unmaintainable because of the duplicates I had in my data. Rather than living with that query, it made sense to clean up the data first and then write the simpler query on the clean data.

Should I Transform My Data In My SQL Query?

Watch this week's video on YouTube

Imagine you need to join two tables of data and filter the results. Perhaps you also need to convert some of the values for display as well (eg. 0 => "No", 1 => "Yes").

Do you choose to perform all of this in your SQL query? Or do you bring the data into your app and and handle it there with code?

Option 1: Write it in a SQL query

Usually I'm inclined to do as much in the SQL query as possible because:

Optimization

Relational SQL databases tend to be optimized for doing things like joins, filtering, aggregations, etc….

Software engineers have spent decades tuning their sorting algorithms and squashing bugs to make their relational databases handle these operations efficiently. Sure, you can probably find the occasional edge case where you could handcraft some app code to accomplish one of these things faster, but in most cases it's not worth the additional time and effort.

Ordered Data

The types of operations in the example above (joins, filtering) benefit from ordered data.

Indexes in your database offer data stored in a predefined order, making all of those operations faster. If you need to perform a transformation that can utilize an index, it usually makes sense to let the database handle this operation in the SQL query rather than moving that data to your app and not have an index available.

Data Reduction

Joining and filtering your data in your database allows you to send a reduced number of records over the network to your application.

Network speeds tend to be one of the largest bottle necks in application systems, so eliminating the amount of data you need to pass through them up front in your SQL query can greatly improve the performance of your application.

This is especially true if you have switched to a work-from-home world where network speeds are even more of a bottleneck.

Hardware

This one can be a toss up. The SQL database servers I usually work with are beefy, having significantly more memory and CPU than my general purpose application servers (or my laptop running analytical code).

This means I can run most of my transformation logic faster on my database server than I can anywhere else. This comes at the cost of your performance hungry queries taking away resources from other queries that may be running on the database server at the same time, but on servers not running at capacity this trade off can be worth the speed. More on this in our app code option.

Portability

If you need to share your transformation logic, a SQL query is going to be easier to share with others or include in other processes than application code.

For example, SQL queries are like a universal language in most companies: programmers, analysts, data scientists, technical product managers, and anyone else who may be interested in understanding the business logic will be able to get a high-level understanding by looking at the SQL query. It will also be easier for them to incorporate into their processes.

Contrast that with some app code that is not easy to execute without installing dependencies and understanding language specific syntax; a SQL query will always be easier to share and reuse.

Option 2: Code it in the application

Everything you can write in a SQL query can be performed in most programming languages.

Let's discuss when it makes sense to store transformation logic in app code.

Domain Specific Performance Improvements

Most SQL relational databases are built to be general purpose. That's what makes them so powerful: they work well to help solve many different problems without needing domain specific optimizations.

Sometimes though, you may know something about your data that will allow you to work with it more efficiently than a relational database.

For example, maybe your data consists of mostly XML files, or you know you will be doing a lot of recursive processing of your data. Sure, most relational database engines have some way to accomplish these tasks, but performing this type of processing will most likely be faster to do in your application code.

Local Caching

Sometimes your application will need to process and reprocess the same data over and over again.

In these cases, it may make sense to transfer your data across the network a single time to your application, so it can locally process (and reprocess) that data as many times as needed.

Costs

Many enterprise database platforms cost money to license. Application servers usually don't have those same kinds of costs. That means that running your transformation logic in a SQL query on your database server is often more expensive than in code on an application server.

If your database server is not at full capacity, then this likely isn't an issue - after all you want to maximize the usage of the licenses you are paying for.

However, if your database environment is at or over capacity, offloading some transformation logic into your application code may make it run faster and will also likely be cheaper to run.

The cloud makes this case even more transparent, with database as a service offerings generally being more expensive than their application compute counterparts.

Conclusion: which is better?

It depends on your situation.

This post speaks to generalized scenarios. For specific uses, there will be times where it makes sense to store your logic in SQL queries, and other times in your application code.

The point is the next time before you dive in and start writing code, think through your restrictions and goals to figure out the best location to run things.

Moving 1/3 of a heap

Watch this week's video on YouTube

A Giant Heap

Recently I had to filter out 1.2 billion records from a 3.5 billion row heap. Don't ask me why this 3.5 billion row table is stored as a heap.

If the lack of a clustered index wasn't bad enough, I also had some other restrictions:

  • I couldn't add a clustered index (or any index for that matter) sorted on the key I needed to filter on. It wasn't my system, and I needed to access the 1.2 billion records sooner than it would take to get a clustered index approved and added. Even then, maybe the lack of a clustered index on this table is a feature and my request would be denied. Who knows.
  • I didn't have a server that could store all of the data. At first I thought of copying all 3.5 billion rows to my own server and indexing it how I needed, but I didn't have enough storage space anywhere to do that.
  • My connection to the server had a relatively short timeout set on it. This also couldn't be changed. If I couldn't copy all 3.5 billion rows because of storage, I also couldn't copy all 1.2 billion records in one fell swoop because the connection would timeout.

Iterative Process

Scanning the table hundreds of times...

I struggled with this problem for a little bit. My deadline clock was ticking and I was stuck as to how I could copy and subsequently query the 1.2 billion rows of data I needed. My focus transitioned from "what is the best way to do this" to "how do I do this".

The solution that ended up working for me was to query the table hundreds of times, each time filtering out and copying only 1 week of data by running a query similar to this:

SELECT *
FROM dbo.MyBigHeap
WHERE
    CreateDate >= @StartDate
    AND CreateDate < @StopDate

Yes, this did cause me to scan the entire table hundreds of times, but in the end it was the right amount of data that I could copy at a time before the connection timed out.

Eventually I had the 1.2 billion rows I needed copied to my own server. I had a clustered column store index on the table (primarily for the compression savings) and some nonclustered indexes to support the queries I would need to run on it. Was this the best solution? I don't know. But it worked for me given the constraints and deadline I had to meet.

Moral of the Story

Always put a clustered index on your tables. Even if you don't have a use case to sort/filter them immediately, you will be creating a world of pain when someone comes along who does need to query that data.

COMING SOON - New Content

Watch this week's video on YouTube

Hey everyone. Long time no chat. I wanted to write this quick post to let you know why there haven't been any videos in a while. If you are interested in technical content, skip this post and come back next week for a post on filtering data from heaps.

First off, thanks to everyone who reached out to see if things were ok. Things are great! Well, as good as they can be anyway. Since my last post/video a few things have changed in my life:

  • I got a new job building out a new data science team
  • My wife and I had our second child, which has been very exciting
  • COVID19 has turned the world upside down

Combine all of that with America's racial atrocities and injustices being put in the spotlight and well, writing new posts and filming videos just wasn't a priority.

And while the world still sometimes feels like a terrible place, and I'm still not getting adequate sleep with a 4 month old in the house, I've been wanting to get back to writing blog posts and making videos because they bring me, and I know many of you, enjoyment.

With that said, going forward I am going to make a few changes to make things more sustainable:

  • For over 3 years, I wrote posts and filmed videos every week. Most of the time this was manageable, but sometimes it caused me stress to meet deadlines for something that I'm doing for fun. So while I still plan to update this blog regularly, I plan on doing it in a way that doesn't self-impose stress.
  • You may have noticed the logo change to Data with Bert. SQL has always been just a small slice of my technical life and I've wanted to incorporate the rest of my data-related projects into my website. Things like what I'm currently building with Arduinos, and how I'm using the cloud to store data in non-relational services. Writing about these other projects never felt right given the site had "SQL" in the name, so that's what brought about the change. I am still going to write primarily about SQL, but I'm basically giving myself permission to widen the topic area to other data-related projects. They will all still have a heavy data problem solving focus, so if you've enjoyed the posts up to this point, I'm sure you will find the future posts relevant as well.

Anyway, thanks for reading :).

Using Python And NetworkX To Build A Twitter Follower Recommendation Engine

Watch this week's video on YouTube

This week, I want to share my process for analyzing Twitter. Specifically, I want to find who all of my friends follow on Twitter that I don't currently follow. Essentially, I want to build a Twitter follower recommendation engine.

Let's start with the theory behind the problem I'm trying to solve. On Twitter, I follow a bunch of people. Around 450 at the time of filming this video.

a

These are people I've either met in person and want to keep in touch with or am interested in following - or both!

Now I don't attend many conferences and events, so finding more people to follow that way is slow going; I probably only meet a handful of new people every year.

However, the second category of people, the interesting ones out there on the internet, is almost limitless. The problem is finding them in a virtual sea of bots, marketing-only accounts, and complainers.

Twitter does have a "Who to Follow" page, and while I'm sure it has some great suggestions, I don't necessarily trust all of Twitter's recommendations. It's kind of like trusting a site like Yelp for restaurant reviews when the local McDonalds rates better than your favorite burger place. I'm sure that particular McDonalds is beautiful by fast food standards, but I'm looking for more personalized recommendations.

So that leads me to this project: I decided to find better personalized recommendations on Twitter by looking at who my friends are following that I am not.

Getting the Data

To start, I needed a list of the people I follow on Twitter, and then a list of who they follow. The proper way to do this is to use the official Twitter API, so I wrote some Python code to do exactly that. Twitter calls the people you follow "Friends", which is funny since I usually don't consider one-way relationships friendships.

Screenshots of code are lame - go to GitHub to see the actual code.

I won't go through the code in detail (you can download it from GitHub if you'd like to do that) but essentially it uses the friends/list endpoint to download all of my Friends. I then went through each of those Friends and found all of their Friends.

In total, this ended up being around 125,000 people.

As a quick side note, the Twitter API is terrible. Using it is fine, but the rate limiting is horrendous. I was basically capped at downloading 12,000 people per hour, which meant it took about a day to download all of the data I needed.

Graph Analysis with NetworkX

Once I had the data downloaded, it was time to find relationships between my friends and the people they follow. For this, I decided to use an open source Python library called NetworkX. NetworkX helps perform complex network analysis, which is perfect for what I was trying to do.

NetworkX uses a graph structure to help with its analysis. A graph is made up of of nodes and edges. In our case, the Twitter users are our nodes, and our edges are the relationships. The first thing I did was load all of the people I follow and created a directional edge (aka. an arrow) to indicate the one-way relationship from me to them. Next, I looped over all of those people's friends, adding additional nodes and edges.

extended-network

At this point, my graph had 125,000 nodes which was way too many to draw quickly on my computer, and way too many to model with Lego minifigs. I figured not ALL of this data would be useful and I needed a way to filter it.

I started by filtering out any of my friends' friends who happen to already be my friends - this wouldn't be useful since I already follow these people.

Next, I decided to keep only the top 50 most followed people in that second tier of Twitter users. This would limit the processing needed while still giving me a list of the most followed Twitter users that my friends follow that I don't currently follow. Once again, all of this code is available on my GitHub.

At this point, I was able to use NetworkX's drawing capabilities to beautifully render the network of users and relationships. Sure, I could have just listed out the top users that I don't follow, but there is something special about being able to create a visual network of those relationships.

)

So that's it. It was fun using a graph library to find new people to follow on Twitter using the people I currently follow as a proxy for finding relevant users instead of Twitter's black box algorithms.

Page 1 / 33 »