# Gaps and Islands Across Date Ranges

Published on: 2019-03-12

In a traditional gaps and islands problem, the goal is to identify groups of continuous data sequences (islands) and groups of data where the sequence is missing (gaps).

While many people encounter gaps and islands problems when dealing with ranges of dates, and recently I did too but with an interesting twist:

How do you determine gaps and islands of data that has overlapping date ranges?

## Overlapping Date Ranges

First let’s start with some sample data to help follow along. The peculiarity to pay attention to is that the date ranges for each row might be completely discrete, completely enclosed, or they may overlap each other on either end:

``````DROP TABLE IF EXISTS #OverlappingDateRanges;
CREATE TABLE #OverlappingDateRanges (StartDate date, EndDate date);

INSERT INTO #OverlappingDateRanges
SELECT '8/24/2017', '9/23/2017'  UNION ALL
SELECT '8/24/2017', '9/20/2017'  UNION ALL
SELECT '9/23/2017', '9/27/2017'  UNION ALL
SELECT '9/25/2017', '10/10/2017' UNION ALL
SELECT '10/17/2017','10/18/2017' UNION ALL
SELECT '10/25/2017','11/3/2017'  UNION ALL
SELECT '11/3/2017', '11/15/2017'

SELECT * FROM #OverlappingDateRanges;``````

What’s unusual about this data is that while the end date of some rows matches the start date of other rows (eg. row 6 and 7), the date ranges of some rows other rows are either fully contained within other rows (eg. row 2 is contained in row 1) while other rows only overlap one boundary (eg. row 4’s EndDate doesn’t overlap with any other rows, but its StartDate is before row 3’s EndDate).

## Solution

While there are several ways gaps and islands problems can be solved, here is the solution using window functions that made the most sense to me.

First, we need to create a row number column based on the sequence of start and end dates, as well as bring the previous row’s EndDate to the current row:

``````SELECT
ROW_NUMBER() OVER(ORDER BY StartDate,EndDate) AS RN,
StartDate,
EndDate,
LAG(EndDate,1) OVER (ORDER BY StartDate, EndDate) AS PreviousEndDate
FROM
#OverlappingDateRanges``````

Next we add two more fields:

• IslandStartInd: indicates when a new island begins by looking if the current row’s StartDate occurs after the previous row’s EndDate. We don’t really need this column for the example, but I find it helpful to see what’s going on in the next column.
• IslandId: indicates which island number the current row belongs to.
``````SELECT
*,
CASE WHEN Groups.PreviousEndDate >= StartDate THEN 0 ELSE 1 END AS IslandStartInd,
SUM(CASE WHEN Groups.PreviousEndDate >= StartDate THEN 0 ELSE 1 END) OVER (ORDER BY Groups.RN) AS IslandId
FROM
(
SELECT
ROW_NUMBER() OVER(ORDER BY StartDate,EndDate) AS RN,
StartDate,
EndDate,
LAG(EndDate,1) OVER (ORDER BY StartDate, EndDate) AS PreviousEndDate
FROM
#OverlappingDateRanges
) Groups``````

The IslandId field is just a `SUM()` of the IslandStartInd, similar to a window function running total.

Our final step is to aggregate our rows using an old fashioned `GROUP BY` to return the minimum and maximum start and end dates respectively from each of our islands:

``````SELECT
MIN(StartDate) AS IslandStartDate,
MAX(EndDate) AS IslandEndDate
FROM
(
SELECT
*,
CASE WHEN Groups.PreviousEndDate >= StartDate THEN 0 ELSE 1 END AS IslandStartInd,
SUM(CASE WHEN Groups.PreviousEndDate >= StartDate THEN 0 ELSE 1 END) OVER (ORDER BY Groups.RN) AS IslandId
FROM
(
SELECT
ROW_NUMBER() OVER(ORDER BY StartDate,EndDate) AS RN,
StartDate,
EndDate,
LAG(EndDate,1) OVER (ORDER BY StartDate, EndDate) AS PreviousEndDate
FROM
#OverlappingDateRanges
) Groups
) Islands
GROUP BY
IslandId
ORDER BY
IslandStartDate``````

## Voilà

Regardless of how messy the date ranges within an island are, this technique neatly identifies gaps in the data and returns the start and end of each island’s date range. I often find myself using this technique in scenarios where I need to aggregate some date-based transactional data that otherwise would be too difficult to summarize with aggregate functions alone.

## 23 thoughts on “Gaps and Islands Across Date Ranges”

1. Tom Wickerath says:

Hi Burt,
Shouldn’t your image that shows “The data as three color-coded islands.” have rows 6 and 7 in the same color group? That would be consistent with your image that shows the IslandStartInd and IslandID.

2. Tom Wickerath says:

Hi Bert,
Shouldn’t your image that shows “The data as three color-coded islands.” have rows 6 and 7 in the same color group? That would be consistent with your image that shows the IslandStartInd and IslandID.

1. Bert says:

Yes, thank you Tom. All fixed now.

3. Jeff Moden says:

Totally awesome in simplicity, Bert. Great job. Pretty easy to understand, as well. I “flipped the code” upside down from what you have it using cCTEs (Cascading CTEs) only because I find it easier to read top down. I also made it “groupable”. Once I get done testing for both functionality and performance, I’ll post it here.

On that note, is there anything special that needs to be done to maintain leading space for indentation on this blog site?

Again, really nice code. Thanks for posting it and the really nice article with great explanations.

1. Bert says:

Thanks Jeff. Look forward to seeing your version.

And I’m not sure what you mean by maintaining leading intention- in IE, Chrome, and Firefox the queries seem to maintain their indentation both on the page and when copied. Are you seeing something different?

1. Jeff Moden says:

No… having tried posting code on this site before. A lot of sites take nicely formatted/indented code and remove some or all of the leading spaces on code lines. Some require flaming hoops to jump through to keep that from happening and just wanted to know if this site has any similar problems.

Based on your comment above, the answer is that I won’t need to do anything special to code prior to posting it here.

1. Jeff Moden says:

Heh… now if there was something to help me type better on “butterfly” keyboards. I meant “haven’t tried posting code on this site before”.

1. Mark says:

Jeff

Working on a version of this where I want to create the windows within a defined group, is that what you mean by creating a groupable version? If so, would be very helpful if you could post your version.

Thanks.

4. Simon says:

Sorry to be the bearer of bad news…

This is pretty much precisely the solution that I came up with for this issue a while ago (though mine’s doing it within partitions).

Unfortunately when explaining it to somebody else yesterday, I found out that it doesn’t actually work as well as I thought. I think we both fell into the same trap of carefully constructing the minimum test data to trigger the cases we were trying to anticipate, but skipped over the fact that your EndDate column will not necessarily be ordered across the entire set of rows once you’ve sorted by StartDate.

This means that if you have a date range that falls entirely within the date range of the previous row *after you’ve sorted by StartDate and EndDate*, it’ll cause the following date range to be misidentified as the start of an island.

Chuck this extra row into your temp table to see what I mean:

INSERT #OverlappingDateRanges VALUES (‘20170825’, ‘20170826’);

The net result is that you get overlapping ranges returned in your final output 🙁

I’m working on a fix that doesn’t involve a litany of CTEs and subqueries, though I think it may have to involve one more level as it’ll probably be based around identifying the *running maximum* of the previous end dates, rather than just the value from the previous row.

1. Simon says:

Got it — a lot quicker than I thought!

MAX(EndDate) OVER (ORDER BY StartDate, EndDate ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING)

and you’re golden (Maybe rename it MaxPreviousEndDate or something too).

5. Simon says:

Here’s a CTE version of the query too, based on my original. Like I say, mine’s operating on multiple windows with a partition key that’s two joins away from the date columns, so I’ve rejigged it to use your tables as the principle’s the same. I was originally flagging the island start rows in the first CTE, then doing a running total of it in the second CTE, but I think I prefer how you’ve got the CASE statement in the “middle” query as it makes the date comparison more obvious, so I moved it in mine as well:

WITH base
AS (SELECT StartDate,
EndDate,
MAX(EndDate) OVER
(ORDER BY StartDate, EndDate
ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING) AS RunningMaxEndDate
FROM #OverlappingDateRanges),
islands
AS (SELECT StartDate,
EndDate,
SUM(IIF(RunningMaxEndDate >= DATEADD(DAY, -1, StartDate), 0, 1)) OVER
(ORDER BY StartDate, EndDate
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS IslandNumber
FROM base)
SELECT MIN(StartDate) AS FromDate,
MAX(EndDate) AS ToDate
FROM islands
GROUP BY IslandNumber;

Note the ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW when identifying the island number — at scale, this makes a significant difference in CPU and query time, whether or not you’re using PARTITION BY for your windowing. It’ll default to RANGE unless you tell it not to, which leads to 534x the reads, 5308x the writes and about 1.8x the execution time on a quick test set with a supporting clustered index matching the ORDER BY clauses for the window functions. It’s a thing that I keep forgetting about myself but learned from one of the MS training kit books a few years ago.

1. Simon says:

okay that code was a lot nicer before WordPress got hold of it

6. Simon says:

Two more things before I leave you alone — just noticed that you need the >= comparison to be against DATEADD(DAY, -1, StartDate) rather than just StartDate if you want to merge *consecutive* islands together like I was doing.

Also I tested the subquery way (incorporating the ROWS… framing declaration) against the CTE way, expecting them to generate the same query plan, and there was actually a difference in the plans. This is down to the ROW_NUMBER() in your innermost query being used as the ordering key in the middle query — if you change “ORDER BY RN” to “ORDER BY StartDate, EndDate” the plans become equivalent; you incur an extra Sort operator using RN because the optimiser can’t tell that they’re functionally equivalent to each other. This is less significant than the ROWS… thing; neither are going to make a practical difference with small numbers of rows but start to show up at a hundred thousand rows and more.

1. Bert says:

Thanks for catching the problem Simon. That’s the trouble with taking real-world code and then turning it into a simplified demo; inevitably sometimes edge cases appear in *other* real-world code.

Thanks for the detailed write up though – hopefully it may benefit someone else with a similar issue.

2. SJBell says:

Thank you both for this solution – I’ve been working on this for longer than I will admit and could not resolve this last piece – so grateful!!

7. SaiKarthik says:

Hi Bert, Amazing regarding Gaps and Islands Concept.
I had tried many ways to solve some particular random data and failing at one or the other.
I tried posting it on Slack also.,
Here is the Link : https://dba.stackexchange.com/q/235191/100649?sem=2

If u can help me out, please refer the above link., i had kept my analysis in a good format.

8. Cody says:

Thanks for the great post, this helped me out not only for overlapping islands, but for custom gaps. I was used to using a COUNTIF window function in BigQuery, but when trying the same for custom gaps in SQL Server, I was left with a different windowing toolset, for which SUM(CASE…) worked perfectly – this is how I did it:

,SUM(CASE WHEN datediff(hh,lastProcessed,sentDate) < 24 or lastProcessed is null THEN 0 ELSE 1 END) OVER (partition by userId ORDER BY RN) AS IslandId

9. GSM says:

Bert,
I am trying to find island and gaps for datetimes in a given day. I successfully ran your code using datetimes. I would now like to expand the query to include a larger dataset.

Original Table #OverlappingDateRanges (StartDate date, EndDate date)
New Table #OverlappingDateRanges (MyDate date, StaffID int, StartDateTime datetime, EndDateTime datetime)

I have been unsuccessful modifying the following SQL to provide IslandStartDate and IslandEndDate for each MyDate, StaffID listed in the New Table.
SELECT
MIN(StartDate) AS IslandStartDate,
MAX(EndDate) AS IslandEndDate
FROM
(
SELECT
*,
CASE WHEN Groups.PreviousEndDate >= StartDate THEN 0 ELSE 1 END AS IslandStartInd,
SUM(CASE WHEN Groups.PreviousEndDate >= StartDate THEN 0 ELSE 1 END) OVER (ORDER BY Groups.RN) AS IslandId
FROM
(
SELECT
ROW_NUMBER() OVER(ORDER BY StartDate,EndDate) AS RN,
StartDate,
EndDate,
LAG(EndDate,1) OVER (ORDER BY StartDate, EndDate) AS PreviousEndDate
FROM
#OverlappingDateRanges
) Groups
) Islands
GROUP BY
IslandId
ORDER BY
IslandStartDate

Any help would be appreciated!

10. Thanks for this I was able to apply it to a current project. Save me a lot of time.

11. Stephen Lauzon says:

Hey, Bert, this was super-helpful. I adapted it to find students who were enrolled in programs (for a range of dates) where their program start or program end dates were outside of the ranges where they were enrolled in school.

This site uses Akismet to reduce spam. Learn how your comment data is processed.