Moving 1/3 of a heap

Published on: 2020-07-22

Watch this week’s video on YouTube.

A Giant Heap

Recently I had to filter out 1.2 billion records from a 3.5 billion row heap. Don’t ask me why this 3.5 billion row table is stored as a heap.

If the lack of a clustered index wasn’t bad enough, I also had some other restrictions:

  • I couldn’t add a clustered index (or any index for that matter) sorted on the key I needed to filter on. It wasn’t my system, and I needed to access the 1.2 billion records sooner than it would take to get a clustered index approved and added. Even then, maybe the lack of a clustered index on this table is a feature and my request would be denied. Who knows.
  • I didn’t have a server that could store all of the data. At first I thought of copying all 3.5 billion rows to my own server and indexing it how I needed, but I didn’t have enough storage space anywhere to do that.
  • My connection to the server had a relatively short timeout set on it. This also couldn’t be changed. If I couldn’t copy all 3.5 billion rows because of storage, I also couldn’t copy all 1.2 billion records in one fell swoop because the connection would timeout.

Iterative Process

Scanning the table hundreds of times…

I struggled with this problem for a little bit. My deadline clock was ticking and I was stuck as to how I could copy and subsequently query the 1.2 billion rows of data I needed. My focus transitioned from “what is the best way to do this” to “how do I do this”.

The solution that ended up working for me was to query the table hundreds of times, each time filtering out and copying only 1 week of data by running a query similar to this:

FROM dbo.MyBigHeap
	CreateDate >= @StartDate
	AND CreateDate < @StopDate

Yes, this did cause me to scan the entire table hundreds of times, but in the end it was the right amount of data that I could copy at a time before the connection timed out.

Eventually I had the 1.2 billion rows I needed copied to my own server. I had a clustered column store index on the table (primarily for the compression savings) and some nonclustered indexes to support the queries I would need to run on it. Was this the best solution? I don’t know. But it worked for me given the constraints and deadline I had to meet.

Moral of the Story

Always put a clustered index on your tables. Even if you don’t have a use case to sort/filter them immediately, you will be creating a world of pain when someone comes along who does need to query that data.

Thanks for reading. You might also enjoy following me on Twitter.

Want to learn even more SQL?

Sign up for my newsletter to receive weekly SQL tips!

5 thoughts on “Moving 1/3 of a heap”

  1. Heh… they’d have probably put a Clustered Index on something that didn’t have anything to do with the dates you were looking for and so it probably wouldn’t have helped.

    I’m curious… what on Earth did this table contain and why was it important for you to download a third of it? There’s a bigger story to all of this and I’m certainly interested.

    1. You’re probably right Jeff. But I’d take an index on almost anything – hopefully I’d be able to use it as a proxy for the rows I would want to filter on.

      The table was full of log entries – so you would think it’d be filtered on a date, but that wasn’t the case *shrug*.

  2. SELECT *
    FROM dbo.MyBigHeap
    ORDER BY CreateDate
    OFFSET 1 rows

    This way you can take withdraw a specific portion of the data.
    – because not every week is the same
    – because you can take the maximum of the connection timeout
    – because you can automate easily to fetch the next 100 to 200 and 200 to 300, etc… because with dates must have been a pain, right?

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.