Moving 1/3 of a heap

Watch this week's video on YouTube

A Giant Heap

Recently I had to filter out 1.2 billion records from a 3.5 billion row heap. Don't ask me why this 3.5 billion row table is stored as a heap.

If the lack of a clustered index wasn't bad enough, I also had some other restrictions:

  • I couldn't add a clustered index (or any index for that matter) sorted on the key I needed to filter on. It wasn't my system, and I needed to access the 1.2 billion records sooner than it would take to get a clustered index approved and added. Even then, maybe the lack of a clustered index on this table is a feature and my request would be denied. Who knows.
  • I didn't have a server that could store all of the data. At first I thought of copying all 3.5 billion rows to my own server and indexing it how I needed, but I didn't have enough storage space anywhere to do that.
  • My connection to the server had a relatively short timeout set on it. This also couldn't be changed. If I couldn't copy all 3.5 billion rows because of storage, I also couldn't copy all 1.2 billion records in one fell swoop because the connection would timeout.

Iterative Process

Scanning the table hundreds of times...

I struggled with this problem for a little bit. My deadline clock was ticking and I was stuck as to how I could copy and subsequently query the 1.2 billion rows of data I needed. My focus transitioned from "what is the best way to do this" to "how do I do this".

The solution that ended up working for me was to query the table hundreds of times, each time filtering out and copying only 1 week of data by running a query similar to this:

SELECT *
FROM dbo.MyBigHeap
WHERE
    CreateDate >= @StartDate
    AND CreateDate < @StopDate

Yes, this did cause me to scan the entire table hundreds of times, but in the end it was the right amount of data that I could copy at a time before the connection timed out.

Eventually I had the 1.2 billion rows I needed copied to my own server. I had a clustered column store index on the table (primarily for the compression savings) and some nonclustered indexes to support the queries I would need to run on it. Was this the best solution? I don't know. But it worked for me given the constraints and deadline I had to meet.

Moral of the Story

Always put a clustered index on your tables. Even if you don't have a use case to sort/filter them immediately, you will be creating a world of pain when someone comes along who does need to query that data.

COMING SOON - New Content

Watch this week's video on YouTube

Hey everyone. Long time no chat. I wanted to write this quick post to let you know why there haven't been any videos in a while. If you are interested in technical content, skip this post and come back next week for a post on filtering data from heaps.

First off, thanks to everyone who reached out to see if things were ok. Things are great! Well, as good as they can be anyway. Since my last post/video a few things have changed in my life:

  • I got a new job building out a new data science team
  • My wife and I had our second child, which has been very exciting
  • COVID19 has turned the world upside down

Combine all of that with America's racial atrocities and injustices being put in the spotlight and well, writing new posts and filming videos just wasn't a priority.

And while the world still sometimes feels like a terrible place, and I'm still not getting adequate sleep with a 4 month old in the house, I've been wanting to get back to writing blog posts and making videos because they bring me, and I know many of you, enjoyment.

With that said, going forward I am going to make a few changes to make things more sustainable:

  • For over 3 years, I wrote posts and filmed videos every week. Most of the time this was manageable, but sometimes it caused me stress to meet deadlines for something that I'm doing for fun. So while I still plan to update this blog regularly, I plan on doing it in a way that doesn't self-impose stress.
  • You may have noticed the logo change to Data with Bert. SQL has always been just a small slice of my technical life and I've wanted to incorporate the rest of my data-related projects into my website. Things like what I'm currently building with Arduinos, and how I'm using the cloud to store data in non-relational services. Writing about these other projects never felt right given the site had "SQL" in the name, so that's what brought about the change. I am still going to write primarily about SQL, but I'm basically giving myself permission to widen the topic area to other data-related projects. They will all still have a heavy data problem solving focus, so if you've enjoyed the posts up to this point, I'm sure you will find the future posts relevant as well.

Anyway, thanks for reading :).

Using Python And NetworkX To Build A Twitter Follower Recommendation Engine

Watch this week's video on YouTube

This week, I want to share my process for analyzing Twitter. Specifically, I want to find who all of my friends follow on Twitter that I don't currently follow. Essentially, I want to build a Twitter follower recommendation engine.

Let's start with the theory behind the problem I'm trying to solve. On Twitter, I follow a bunch of people. Around 450 at the time of filming this video.

a

These are people I've either met in person and want to keep in touch with or am interested in following - or both!

Now I don't attend many conferences and events, so finding more people to follow that way is slow going; I probably only meet a handful of new people every year.

However, the second category of people, the interesting ones out there on the internet, is almost limitless. The problem is finding them in a virtual sea of bots, marketing-only accounts, and complainers.

Twitter does have a "Who to Follow" page, and while I'm sure it has some great suggestions, I don't necessarily trust all of Twitter's recommendations. It's kind of like trusting a site like Yelp for restaurant reviews when the local McDonalds rates better than your favorite burger place. I'm sure that particular McDonalds is beautiful by fast food standards, but I'm looking for more personalized recommendations.

So that leads me to this project: I decided to find better personalized recommendations on Twitter by looking at who my friends are following that I am not.

Getting the Data

To start, I needed a list of the people I follow on Twitter, and then a list of who they follow. The proper way to do this is to use the official Twitter API, so I wrote some Python code to do exactly that. Twitter calls the people you follow "Friends", which is funny since I usually don't consider one-way relationships friendships.

Screenshots of code are lame - go to GitHub to see the actual code.

I won't go through the code in detail (you can download it from GitHub if you'd like to do that) but essentially it uses the friends/list endpoint to download all of my Friends. I then went through each of those Friends and found all of their Friends.

In total, this ended up being around 125,000 people.

As a quick side note, the Twitter API is terrible. Using it is fine, but the rate limiting is horrendous. I was basically capped at downloading 12,000 people per hour, which meant it took about a day to download all of the data I needed.

Graph Analysis with NetworkX

Once I had the data downloaded, it was time to find relationships between my friends and the people they follow. For this, I decided to use an open source Python library called NetworkX. NetworkX helps perform complex network analysis, which is perfect for what I was trying to do.

NetworkX uses a graph structure to help with its analysis. A graph is made up of of nodes and edges. In our case, the Twitter users are our nodes, and our edges are the relationships. The first thing I did was load all of the people I follow and created a directional edge (aka. an arrow) to indicate the one-way relationship from me to them. Next, I looped over all of those people's friends, adding additional nodes and edges.

extended-network

At this point, my graph had 125,000 nodes which was way too many to draw quickly on my computer, and way too many to model with Lego minifigs. I figured not ALL of this data would be useful and I needed a way to filter it.

I started by filtering out any of my friends' friends who happen to already be my friends - this wouldn't be useful since I already follow these people.

Next, I decided to keep only the top 50 most followed people in that second tier of Twitter users. This would limit the processing needed while still giving me a list of the most followed Twitter users that my friends follow that I don't currently follow. Once again, all of this code is available on my GitHub.

At this point, I was able to use NetworkX's drawing capabilities to beautifully render the network of users and relationships. Sure, I could have just listed out the top users that I don't follow, but there is something special about being able to create a visual network of those relationships.

)

So that's it. It was fun using a graph library to find new people to follow on Twitter using the people I currently follow as a proxy for finding relevant users instead of Twitter's black box algorithms.

SQL Server Converts Numbers to Asterisks

Watch this week's video on YouTube

Pop quiz: What will be the output of the below three statements?

DECLARE @val_varchar varchar(3) = '100';
DECLARE @val_bigint bigint = 100;
DECLARE @val_tinyint tinyint = 100;

PRINT('varchar to varchar:');
PRINT(CAST(@val_varchar AS VARCHAR(2)));

PRINT('bigint to varchar:');
PRINT(CAST(@val_bigint AS VARCHAR(2)));

PRINT('tinyint to varchar:');
PRINT(CAST(@val_tinyint AS VARCHAR(2)));

As you might have guessed, the first conversion truncates the value, leaving the answer as "10":

2019-12-16-12-58-46

The second response is also somewhat intuitive - a number can't be converted into string that has fewer places than original digits:

2019-12-16-12-59-03

However, the final answer returns:

2019-12-16-12-59-20

NOTE: if we change all of the conversions to VARCHAR(3) above, all three results return the string value '100'

What the ****?

Why does SQL Server sometimes error when converting a number into a string, but other times succeeds and returns an asterisk?

I don't know.

The best (and logical) answer I could find online is from Robert Sheldon, who attributes it to poor error handling practices, "...before error handling got a more reputable foothold."

My understanding is that the asterisk was originally the default truncation value. Later, when the SQL Server development team was adding the bigint datatype, they decided throwing an exception is a better way to handle truncation in errors. However, probably for backwards compatibility, they never went back to change the behavior of the asterisks for pre-existing datatypes.

The documentation is clear about this behavior as well: any int, smallint, or tiny int when converted to char or varchar with fewer characters will result in an asterisk.

This becomes a problem if you create strings out of tinyints, smallints, and ints and suddenly start receiving values larger than you originally expected:

DECLARE @Age int = 75;
PRINT(CAST(@Age AS VARCHAR(2)) + ' years old.');

SET @Age = 100;
PRINT(CAST(@Age AS VARCHAR(2)) + ' years old.');

2019-12-16-13-00-53

Conversions are hard

So is SQL Server to blame for this issue? Sure. But I can empathize: breaking changes are not something people typically want to add to a product.

I've written before how implicit conversions can lead to funny results in SQL Server. This asterisk scenario is another case of SQL Server having to make a judgement call about handling an impossible situation you created for it.

While it's easy to complain about whether SQL Server handles the conversion the correct way or not, at the end of the day the responsibility falls on you to prevent these types of impossible calculations from being requested of SQL Server in the first place.

Azure SQL vs Azure Table Storage

Watch this week's video on YouTube

A year ago I built an app to keep track of pickup volleyball game scores and payments. It works well, but after a year of regular use it's time to update it with some improvements.

As part of the update process, I'm rewriting the data layer to use Azure's NoSQL Table Storage instead of Azure SQL.

Today I'll walk you through some of the details I considered when deciding to switch to a NoSQL storage solution.

Why Azure NoSQL Table Storage

Cost

Azure SQL has been good to me - it works, it's familiar, and it's relatively inexpensive. However, even though the \$5/month Azure SQL bill isn't cost prohibitive (especially with free credits :)), ideally I'd love the app to be able to fund itself via the fractions of a penny that get rounded up as part of the app's balancing logic.

Azure Table Storage in comparison is dirt cheap. As of the time I'm writing this, it will cost me about \$.05/month to store and access my < 1gb of application data.

Another option I considered is Azure SQL Serverless. This is a nice alternative because it would allow me to keep the relational structure of my data, however based on my historical app usage patterns it would definitely end up costing more than Azure Table Storage. Cool service, but not a good fit for my scenario.

Simplicity

Another reason I went with Azure Table Storage is its simplicity.

Azure offers an alternative table storage option called Cosmos DB. Cosmos DB is Azure's premium table storage offering, but it was overkill for what my app needed.

Yes, Cosmo DB's global distribution is cool. Yes, its additional indexes on my data is great. However, as I mentioned, cost is the biggest factor in how I'm deciding on a storage solution for this project. And Cosmos DB has lots of great features, but they come at a price.

My preference for this app is to pay more up front in development time to create a better design then rely on a service that does some of that for me at a higher monthly cost. In this case, simple features will accomplish what I need, so that's what I'm going with.

Design Considerations for Azure Table Storage

When rewriting my app's data layer, there were several new things that I had to account for in Azure Table Storage.

Primary Key

The primary key in Azure Table Storage is made up of two columns: PartitionKey and RowKey. This composite primary key is also the clustered (and only!) index for the table.

The PartitionKey in particular is important because it determines whether related rows of data will exist on the same underlying server or not. Different PartitionKey values may end up on different servers. This can be a good thing if you factor in parallel access in your design, or a significant bottleneck if all of your data ends up residing on a single server.

Latency

There are two aspects of latency to consider.

The first is how far you are from your Azure region. For me, this isn't a huge deal since the only users of the app are currently in northeast Ohio, so choosing to store all of the data in the same region is good enough.

The bigger consideration for this app is that with Azure Table Storage, you can't do any joins with your data. Well, you can join in your app, but you can't join in Azure itself.

This means that design is critical to reducing latency since joining multiple tables of data in your app will require multiple calls to the Azure Table Storage service. This is not something that is necessarily a deal breaker, just something that needs to be considered, especially if coming from a relational SQL background where you are using JOINs to filter and reduce your data before it is returned to your app.

Row Size Limitations

NoSQL gets a lot hate for its common pattern of storing giant blobs of semi-structured data in a single field (this design makes sense when considering the latency considerations).

However, this becomes a delicate balancing act since each row in Azure Table Storage can only be a maximum of 1mb in size. This causes you to want to fit as much data into a single call as possible (to reduce the number of calls) while also not exceeding the 1mb row size.

Azure Table Storage does allow up to 252 columns of data per table (plus the required PartitionKey, RowKey, and Timestamp columns) so at least your 1mb of data will be organized.

…and more!

The above details were the primary considerations I had to take into account for my specific app. There are things data like data consistency, durability, and more that you may want to take into account based on your app's goals and usage patterns.

Conclusion

While NoSQL can often break many of the relational concepts we are used to using, it is often the means for achieving the cheapest pay-only-for-what-you-use pricing on cloud providers.