Gaps and Islands Across Date Ranges

Published on: 2019-03-12

Gaps and Islands
Watch this week’s episode on YouTube.

In a traditional gaps and islands problem, the goal is to identify groups of continuous data sequences (islands) and groups of data where the sequence is missing (gaps).

While many people encounter gaps and islands problems when dealing with ranges of dates, and recently I did too but with an interesting twist:

How do you determine gaps and islands of data that has overlapping date ranges?

Overlapping Date Ranges

First let’s start with some sample data to help follow along. The peculiarity to pay attention to is that the date ranges for each row might be completely discrete, completely enclosed, or they may overlap each other on either end:

DROP TABLE IF EXISTS #OverlappingDateRanges;
CREATE TABLE #OverlappingDateRanges (StartDate date, EndDate date);
 
INSERT INTO #OverlappingDateRanges
SELECT '8/24/2017', '9/23/2017'  UNION ALL
SELECT '8/24/2017', '9/20/2017'  UNION ALL 
SELECT '9/23/2017', '9/27/2017'  UNION ALL 
SELECT '9/25/2017', '10/10/2017' UNION ALL
SELECT '10/17/2017','10/18/2017' UNION ALL 
SELECT '10/25/2017','11/3/2017'  UNION ALL 
SELECT '11/3/2017', '11/15/2017'

SELECT * FROM #OverlappingDateRanges;

What’s unusual about this data is that while the end date of some rows matches the start date of other rows (eg. row 6 and 7), the date ranges of some rows other rows are either fully contained within other rows (eg. row 2 is contained in row 1) while other rows only overlap one boundary (eg. row 4’s EndDate doesn’t overlap with any other rows, but its StartDate is before row 3’s EndDate).

Solution

While there are several ways gaps and islands problems can be solved, here is the solution using window functions that made the most sense to me.

First, we need to create a row number column based on the sequence of start and end dates, as well as bring the previous row’s EndDate to the current row:

SELECT
    ROW_NUMBER() OVER(ORDER BY StartDate,EndDate) AS RN,
    StartDate,
    EndDate,
    LAG(EndDate,1) OVER (ORDER BY StartDate, EndDate) AS PreviousEndDate
FROM
    #OverlappingDateRanges
Previous end date brought onto every row.

Next we add two more fields:

  • IslandStartInd: indicates when a new island begins by looking if the current row’s StartDate occurs after the previous row’s EndDate. We don’t really need this column for the example, but I find it helpful to see what’s going on in the next column.
  • IslandId: indicates which island number the current row belongs to.
SELECT
    *,
    CASE WHEN Groups.PreviousEndDate >= StartDate THEN 0 ELSE 1 END AS IslandStartInd,
    SUM(CASE WHEN Groups.PreviousEndDate >= StartDate THEN 0 ELSE 1 END) OVER (ORDER BY Groups.RN) AS IslandId
FROM
(
    SELECT
        ROW_NUMBER() OVER(ORDER BY StartDate,EndDate) AS RN,
        StartDate,
        EndDate,
        LAG(EndDate,1) OVER (ORDER BY StartDate, EndDate) AS PreviousEndDate
    FROM
        #OverlappingDateRanges
) Groups
Island start and group indicator columns

The IslandId field is just a SUM() of the IslandStartInd, similar to a window function running total.

Our final step is to aggregate our rows using an old fashioned GROUP BY to return the minimum and maximum start and end dates respectively from each of our islands:

SELECT
    MIN(StartDate) AS IslandStartDate,
    MAX(EndDate) AS IslandEndDate
FROM
    (
    SELECT
        *,
        CASE WHEN Groups.PreviousEndDate >= StartDate THEN 0 ELSE 1 END AS IslandStartInd,
        SUM(CASE WHEN Groups.PreviousEndDate >= StartDate THEN 0 ELSE 1 END) OVER (ORDER BY Groups.RN) AS IslandId
    FROM
    (
        SELECT
            ROW_NUMBER() OVER(ORDER BY StartDate,EndDate) AS RN,
            StartDate,
            EndDate,
            LAG(EndDate,1) OVER (ORDER BY StartDate, EndDate) AS PreviousEndDate
        FROM
            #OverlappingDateRanges
    ) Groups
) Islands
GROUP BY
    IslandId
ORDER BY 
    IslandStartDate
Final seleccted islands

Voilà

Regardless of how messy the date ranges within an island are, this technique neatly identifies gaps in the data and returns the start and end of each island’s date range. I often find myself using this technique in scenarios where I need to aggregate some date-based transactional data that otherwise would be too difficult to summarize with aggregate functions alone.

Thanks for reading. You might also enjoy following me on Twitter.

Want to learn even more SQL?

Sign up for my newsletter to receive weekly SQL tips!

19 thoughts on “Gaps and Islands Across Date Ranges”

  1. Hi Burt,
    Shouldn’t your image that shows “The data as three color-coded islands.” have rows 6 and 7 in the same color group? That would be consistent with your image that shows the IslandStartInd and IslandID.

  2. Hi Bert,
    Shouldn’t your image that shows “The data as three color-coded islands.” have rows 6 and 7 in the same color group? That would be consistent with your image that shows the IslandStartInd and IslandID.

  3. Totally awesome in simplicity, Bert. Great job. Pretty easy to understand, as well. I “flipped the code” upside down from what you have it using cCTEs (Cascading CTEs) only because I find it easier to read top down. I also made it “groupable”. Once I get done testing for both functionality and performance, I’ll post it here.

    On that note, is there anything special that needs to be done to maintain leading space for indentation on this blog site?

    Again, really nice code. Thanks for posting it and the really nice article with great explanations.

    1. Thanks Jeff. Look forward to seeing your version.

      And I’m not sure what you mean by maintaining leading intention- in IE, Chrome, and Firefox the queries seem to maintain their indentation both on the page and when copied. Are you seeing something different?

      1. No… having tried posting code on this site before. A lot of sites take nicely formatted/indented code and remove some or all of the leading spaces on code lines. Some require flaming hoops to jump through to keep that from happening and just wanted to know if this site has any similar problems.

        Based on your comment above, the answer is that I won’t need to do anything special to code prior to posting it here.

        1. Heh… now if there was something to help me type better on “butterfly” keyboards. I meant “haven’t tried posting code on this site before”.

          1. Jeff

            Working on a version of this where I want to create the windows within a defined group, is that what you mean by creating a groupable version? If so, would be very helpful if you could post your version.

            Thanks.

  4. Sorry to be the bearer of bad news…

    This is pretty much precisely the solution that I came up with for this issue a while ago (though mine’s doing it within partitions).

    Unfortunately when explaining it to somebody else yesterday, I found out that it doesn’t actually work as well as I thought. I think we both fell into the same trap of carefully constructing the minimum test data to trigger the cases we were trying to anticipate, but skipped over the fact that your EndDate column will not necessarily be ordered across the entire set of rows once you’ve sorted by StartDate.

    This means that if you have a date range that falls entirely within the date range of the previous row *after you’ve sorted by StartDate and EndDate*, it’ll cause the following date range to be misidentified as the start of an island.

    Chuck this extra row into your temp table to see what I mean:

    INSERT #OverlappingDateRanges VALUES (‘20170825’, ‘20170826’);

    The net result is that you get overlapping ranges returned in your final output 🙁

    I’m working on a fix that doesn’t involve a litany of CTEs and subqueries, though I think it may have to involve one more level as it’ll probably be based around identifying the *running maximum* of the previous end dates, rather than just the value from the previous row.

    1. Got it — a lot quicker than I thought!

      Change your PreviousEndDate expression to:

      MAX(EndDate) OVER (ORDER BY StartDate, EndDate ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING)

      and you’re golden (Maybe rename it MaxPreviousEndDate or something too).

  5. Here’s a CTE version of the query too, based on my original. Like I say, mine’s operating on multiple windows with a partition key that’s two joins away from the date columns, so I’ve rejigged it to use your tables as the principle’s the same. I was originally flagging the island start rows in the first CTE, then doing a running total of it in the second CTE, but I think I prefer how you’ve got the CASE statement in the “middle” query as it makes the date comparison more obvious, so I moved it in mine as well:

    WITH base
    AS (SELECT StartDate,
    EndDate,
    MAX(EndDate) OVER
    (ORDER BY StartDate, EndDate
    ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING) AS RunningMaxEndDate
    FROM #OverlappingDateRanges),
    islands
    AS (SELECT StartDate,
    EndDate,
    SUM(IIF(RunningMaxEndDate >= DATEADD(DAY, -1, StartDate), 0, 1)) OVER
    (ORDER BY StartDate, EndDate
    ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS IslandNumber
    FROM base)
    SELECT MIN(StartDate) AS FromDate,
    MAX(EndDate) AS ToDate
    FROM islands
    GROUP BY IslandNumber;

    Note the ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW when identifying the island number — at scale, this makes a significant difference in CPU and query time, whether or not you’re using PARTITION BY for your windowing. It’ll default to RANGE unless you tell it not to, which leads to 534x the reads, 5308x the writes and about 1.8x the execution time on a quick test set with a supporting clustered index matching the ORDER BY clauses for the window functions. It’s a thing that I keep forgetting about myself but learned from one of the MS training kit books a few years ago.

  6. Two more things before I leave you alone — just noticed that you need the >= comparison to be against DATEADD(DAY, -1, StartDate) rather than just StartDate if you want to merge *consecutive* islands together like I was doing.

    Also I tested the subquery way (incorporating the ROWS… framing declaration) against the CTE way, expecting them to generate the same query plan, and there was actually a difference in the plans. This is down to the ROW_NUMBER() in your innermost query being used as the ordering key in the middle query — if you change “ORDER BY RN” to “ORDER BY StartDate, EndDate” the plans become equivalent; you incur an extra Sort operator using RN because the optimiser can’t tell that they’re functionally equivalent to each other. This is less significant than the ROWS… thing; neither are going to make a practical difference with small numbers of rows but start to show up at a hundred thousand rows and more.

    1. Thanks for catching the problem Simon. That’s the trouble with taking real-world code and then turning it into a simplified demo; inevitably sometimes edge cases appear in *other* real-world code.

      Thanks for the detailed write up though – hopefully it may benefit someone else with a similar issue.

    2. Thank you both for this solution – I’ve been working on this for longer than I will admit and could not resolve this last piece – so grateful!!

  7. Thanks for the great post, this helped me out not only for overlapping islands, but for custom gaps. I was used to using a COUNTIF window function in BigQuery, but when trying the same for custom gaps in SQL Server, I was left with a different windowing toolset, for which SUM(CASE…) worked perfectly – this is how I did it:

    ,SUM(CASE WHEN datediff(hh,lastProcessed,sentDate) < 24 or lastProcessed is null THEN 0 ELSE 1 END) OVER (partition by userId ORDER BY RN) AS IslandId

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.