Gaps and Islands Across Date Ranges

Watch this week's video on YouTube

In a traditional gaps and islands problem, the goal is to identify groups of continuous data sequences (islands) and groups of data where the sequence is missing (gaps).

While many people encounter gaps and islands problems when dealing with ranges of dates, and recently I did too but with an interesting twist:

How do you determine gaps and islands of data that has overlapping date ranges?

Overlapping Date Ranges

First let's start with some sample data to help follow along. The peculiarity to pay attention to is that the date ranges for each row might be completely discrete, completely enclosed, or they may overlap each other on either end:

DROP TABLE IF EXISTS #OverlappingDateRanges;
CREATE TABLE #OverlappingDateRanges (StartDate date, EndDate date);

INSERT INTO #OverlappingDateRanges
SELECT '8/24/2017', '9/23/2017'  UNION ALL
SELECT '8/24/2017', '9/20/2017'  UNION ALL 
SELECT '9/23/2017', '9/27/2017'  UNION ALL 
SELECT '9/25/2017', '10/10/2017' UNION ALL
SELECT '10/17/2017','10/18/2017' UNION ALL 
SELECT '10/25/2017','11/3/2017'  UNION ALL 
SELECT '11/3/2017', '11/15/2017'

SELECT * FROM #OverlappingDateRanges;

2019-03-13_20-13-13

What's unusual about this data is that while the end date of some rows matches the start date of other rows (eg. row 6 and 7), the date ranges of some rows other rows are either fully contained within other rows (eg. row 2 is contained in row 1) while other rows only overlap one boundary (eg. row 4's EndDate doesn't overlap with any other rows, but its StartDate is before row 3's EndDate).

Solution

While there are several ways gaps and islands problems can be solved, here is the solution using window functions that made the most sense to me.

First, we need to create a row number column based on the sequence of start and end dates, as well as bring the previous row's EndDate to the current row:

SELECT
    ROW_NUMBER() OVER(ORDER BY StartDate,EndDate) AS RN,
    StartDate,
    EndDate,
    LAG(EndDate,1) OVER (ORDER BY StartDate, EndDate) AS PreviousEndDate
FROM
    #OverlappingDateRanges

2019-03-06-20-52-58

Next we add two more fields:

  • IslandStartInd: indicates when a new island begins by looking if the current row's StartDate occurs after the previous row's EndDate. We don't really need this column for the example, but I find it helpful to see what's going on in the next column.
  • IslandId: indicates which island number the current row belongs to.
SELECT
    *,
    CASE WHEN Groups.PreviousEndDate >= StartDate THEN 0 ELSE 1 END AS IslandStartInd,
    SUM(CASE WHEN Groups.PreviousEndDate >= StartDate THEN 0 ELSE 1 END) OVER (ORDER BY Groups.RN) AS IslandId
FROM
(
    SELECT
        ROW_NUMBER() OVER(ORDER BY StartDate,EndDate) AS RN,
        StartDate,
        EndDate,
        LAG(EndDate,1) OVER (ORDER BY StartDate, EndDate) AS PreviousEndDate
    FROM
        #OverlappingDateRanges
) Groups

2019-03-06-21-01-51

The IslandId field is just a SUM() of the IslandStartInd, similar to a window function running total.

Our final step is to aggregate our rows using an old fashioned GROUP BY to return the minimum and maximum start and end dates respectively from each of our islands:

SELECT
    MIN(StartDate) AS IslandStartDate,
    MAX(EndDate) AS IslandEndDate
FROM
    (
    SELECT
        *,
        CASE WHEN Groups.PreviousEndDate >= StartDate THEN 0 ELSE 1 END AS IslandStartInd,
        SUM(CASE WHEN Groups.PreviousEndDate >= StartDate THEN 0 ELSE 1 END) OVER (ORDER BY Groups.RN) AS IslandId
    FROM
    (
        SELECT
            ROW_NUMBER() OVER(ORDER BY StartDate,EndDate) AS RN,
            StartDate,
            EndDate,
            LAG(EndDate,1) OVER (ORDER BY StartDate, EndDate) AS PreviousEndDate
        FROM
            #OverlappingDateRanges
    ) Groups
) Islands
GROUP BY
    IslandId
ORDER BY 
    IslandStartDate

2019-03-06-21-03-35

Voilà

Regardless of how messy the date ranges within an island are, this technique neatly identifies gaps in the data and returns the start and end of each island's date range. I often find myself using this technique in scenarios where I need to aggregate some date-based transactional data that otherwise would be too difficult to summarize with aggregate functions alone.

Ignoring NULLs with FIRST_VALUE

Watch this week's video on YouTube

The SQL Server FIRST_VALUE function makes it easy to return the "first value in an ordered set of values."

The problem is that if that first value happens to be a NULL, there is no easy, built-in way to skip it.

While a UserVoice item exists to add the ability to ignore nulls (go vote!), today, we're going accomplish that end result with some alternative queries.

The Setup

Here's the example data we'll be skipping nulls on:

CREATE TABLE ##Data
(
       Id int IDENTITY(0,1),
       GroupId int,
       Value1 int
);
GO
INSERT INTO ##Data VALUES (1,1)
INSERT INTO ##Data VALUES (1,1)
INSERT INTO ##Data VALUES (1,3)
INSERT INTO ##Data VALUES (2,NULL)
INSERT INTO ##Data VALUES (2,NULL)
INSERT INTO ##Data VALUES (2,6)
INSERT INTO ##Data VALUES (2,4)
INSERT INTO ##Data VALUES (2,5);
GO

colall

We've got a an integer identity column, two groups of rows, and NULLs that are sprinkled into otherwise unsuspecting integer values.

If we write a query that uses the FIRST_VALUE function, you'll notice that our NULL gets chosen in group two - not quite what we want:

SELECT
       Id,
       GroupId,
       Value1,
       FIRST_VALUE(Value1) OVER (PARTITION BY GroupId ORDER BY Id) AS FirstValue1
FROM
       ##Data

2018-08-26_08-38-39

Let's look at two queries that will help us get the number 6 into that FirstValue1 column for the second group.

The Contenders

"The Derived FIRST_VALUE"

First up is still the FIRST_VALUE function, but inside of a derived table:

SELECT
    d.Id,
    d.GroupId,
    d.Value1,
    d2.FirstNotNullValue1
FROM
    ##Data d
    INNER JOIN
    (
    SELECT DISTINCT
        GroupId,
        FIRST_VALUE(Value1) OVER (PARTITION BY GroupId ORDER BY Id) as FirstNotNullValue1
    FROM ##Data
    WHERE Value1 IS NOT NULL
    ) d2
        ON d.GroupId = d2.GroupId

By filtering out NULLs in our derived table query, FIRST_VALUE returns the first non-null value like we want.  We then join that back to the original data and all is right again.

2018-08-26_08-42-00

"The Triple Join"

Our second attempt at this query sends us back to the dark ages of SQL Server 2008 before the FIRST_VALUE function existed:

SELECT
    d.Id,
    d.GroupId,
    d.Value1,
    d2.Value1 AS FirstNotNullValue1
FROM
    ##Data d
    LEFT JOIN
    (
    SELECT
        GroupId,
        MIN(Id) AS FirstNotNullIdValue1
    FROM
        ##Data
    WHERE
        Value1 IS NOT NULL
    GROUP BY
        GroupId
    ) m
        ON d.GroupId = m.GroupId
    INNER JOIN ##Data d2
        ON m.FirstNotNullIdValue1 = d2.Id;

We perform a triple join, with the critical element being our derived table which gets the MIN Id for each group of rows where Value1 IS NOT NULL.  Once we have the minimum Id for each group, we join back in the original data and produce the same final result:

2018-08-26_08-46-00

The Performance

Both of the above queries produce the same output - which one should you use in your production code?

Well, the "Derived FIRST_VALUE" query has a lower relative cost than the "Triple Join" query, maybe it's better?

2018-08-26_08-48-22

This isn't a real-world execution plan though - surely we never scan heaps our production environments.

Let's add a quick clustered index and see if that changes anything:

CREATE CLUSTERED INDEX CL_Id ON ##Data (GroupId,Id,Value1)

2018-08-26_09-07-57

Okay, a closer match up but the "Derived FIRST_VALUE" query still appears to have a slight edge.

If we SET STATISTICS IO ON though we start to see a different story:

2018-08-26_09-14-05

With only 8 rows of data, our "Derived FIRST_VALUE" query sure is performing a lot of reads.

What if we increase the size of our sample dataset?

SET STATISTICS IO, TIME OFF;
SET NOCOUNT ON;
GO
INSERT INTO ##Data (GroupId, Value1)  
SELECT GroupId, Value1 FROM ##Data
GO 10

And now check our plans and stats IO:

2018-08-26_09-17-33

2018-08-26_09-17-50

WOW that's a lot of reads in the "Derived FIRST_VALUE" query.

Conclusion

Besides sharing some solutions, the point I tried to make above is that DON'T TRUST CODE YOU FIND ON THE INTERNET (or in books, or copied from colleagues, etc...)

Both of the above queries will return the first value without NULLs.  But they probably won't perform exactly the same as they did on my examples above.

Copy the above code for sure - but test it out. See what works better on your specific server configuration, data size, and indexes.  Maybe both queries are terrible and you need a third, better way of doing it (if you write one, let me know!) - but please, please, please, always test your code.

Here's a Quick Way To Generate a Running Total in SQL Server

da24c-105adqqwalb9gszttpjw23q

Watch this week's video on YouTube

Historically it's been difficult to accomplish certain tasks in SQL Server.

Probably the most annoying problem I had to do regularly before SQL Server 2012 was to generate a running total. How can a running total be so easy to do in Excel, but difficult to do in SQL?

SUM(), click, drag, done. Excel, you will always have a place in my heart.

Before SQL Server 2012, the solution to generating a running total involved cursors, CTEs, nested subqueries, or cross applies. This StackOverflow thread has a variety of solutions if you need to solve this problem in an older version of SQL Server.

However, SQL Server 2012's introduction of window functions makes creating a running total incredibly easy.

First, some test data:

CREATE TABLE dbo.Purchases
(
  CustomerID int,
  TransactionDate date,
  Price int
)
INSERT INTO dbo.Purchases VALUES (1,'2017-06-01',5)
INSERT INTO dbo.Purchases VALUES (1,'2017-06-15',8)
INSERT INTO dbo.Purchases VALUES (1,'2017-06-18',3)
INSERT INTO dbo.Purchases VALUES (1,'2017-06-30',6)
INSERT INTO dbo.Purchases VALUES (2,'2017-05-04',5)
INSERT INTO dbo.Purchases VALUES (2,'2017-06-04',5)
INSERT INTO dbo.Purchases VALUES (2,'2017-07-04',1)
INSERT INTO dbo.Purchases VALUES (3,'2017-05-01',2)
INSERT INTO dbo.Purchases VALUES (3,'2017-05-02',8)
INSERT INTO dbo.Purchases VALUES (3,'2017-05-03',9)
INSERT INTO dbo.Purchases VALUES (3,'2017-05-04',5)
INSERT INTO dbo.Purchases VALUES (3,'2017-05-05',4)
INSERT INTO dbo.Purchases VALUES (3,'2017-05-06',2)

6a0d6-1wnrnamwua7g-rnvwtlgyzw

Next, we write our query using the following window function OVER() syntax:

SELECT
  CustomerID,
  TransactionDate,
  Price,
  SUM(Price) OVER (PARTITION BY CustomerID ORDER BY TransactionDate) AS RunningTotal
FROM
  dbo.Purchases

The syntax for our OVER() clause is simple:

  • SUM(Price) specifies which column we are creating the running total on
  • PARTITION BY specifies around what group of data we want to create our "window" — each new window will reset the running total
  • ORDER BY specifies in what order the rows should be sorted before being summed

The results? An easy to write running total:

52df1-1_ulawysmk6gmrjzvuaijgq