Data with Bert - functions

SQL Server has several ways to store queries for later executions.

This makes developers happy because it allows them to follow DRY principles: Don't Repeat Yourself. The more code you have, the more difficult it is to maintain. Centralizing frequently used code into stored procedures, functions, etc... is attractive.

While following the DRY pattern is beneficial in many programming languages, it can often cause poor performance in SQL Server.

Today's post will try to explain all of the different code organization features available in SQL Server and when to best use them (thank you to dovh49 on YouTube for recommending this week's topic and reminding me how confusing all of these different features can be when first learning to use them).

Scalar Functions

CREATE OR ALTER FUNCTION dbo.GetUserDisplayName
(
    @UserId int
)
RETURNS nvarchar(40)
AS
BEGIN
    DECLARE @DisplayName nvarchar(40);
    SELECT @DisplayName = DisplayName FROM dbo.Users WHERE Id = @UserId

    RETURN @DisplayName
END

SELECT TOP 10000 Title, dbo.GetUserDisplayName(OwnerUserId) FROM dbo.Posts

Scalar functions run statements that return a single value.

You'll often read about SQL functions being evil, and scalar functions are a big reason for this reputation. If your scalar function executes a query within it to return a single value, that means every row that calls that function runs this query. That's not good if you have to run a query once for every row in a million row table.

SQL Server 2019 can inline a lot of these, providing better performance in most cases. However, you can already do this yourself today by taking your scalar function and including it in your calling query as a subquery. The only downside is that you'll be repeating that same logic in every calling query that needs it.

Additionally, using a scalar function on the column side of a predicate will prevent SQL Server from being able to seek to data in any of its indexes; talk about performance killing.

For scalar functions that don't execute a query, you can always use WITH SCHEMABINDING to gain a performance boost.

Inline Table Valued Functions

CREATE OR ALTER FUNCTION dbo.SplitTags
(   
    @PostId int
)
RETURNS TABLE 
AS
RETURN 
(
    SELECT REPLACE(t.value,'>','') AS Tags 
    FROM dbo.Posts p 
    CROSS APPLY STRING_SPLIT(p.Tags,'<') t 
    WHERE Id = @PostId AND t.value <> ''
)
GO

SELECT * FROM dbo.SplitTags(4)

Inline table-valued functions allow a function to return a table result set instead of just a single value. They essentially are a way for you to reuse a derived table query (you know, when you nest a child query in your main query's FROM or WHERE clause).

These are usually considered "good" SQL Server functions - their performance is decent because SQL Server can get relatively accurate estimates on the data that they will return, as long as the statistics on that underlying data are accurate. Generally this allows for efficient execution plans to be created. As a bonus, they allow parameters so if you find yourself reusing a subquery over and over again, an inline table-valued function (with or without a parameter) is actually a nice feature.

Multi-Statement Table-Valued Functions

CREATE OR ALTER FUNCTION dbo.GetQuestionWithAnswers
(
    @PostId int
)
RETURNS 
@results TABLE 
(
    PostId bigint,
    Body nvarchar(max),
    CreationDate datetime
)
AS
BEGIN
    -- Returns the original question along with all of its answers in one result set
    -- Would be better to do this with something like a union all or a secondary join. 
    -- But this is an MSTVF demo, so I'm doing it with multiple statements.

    -- Statement 1
    INSERT INTO @results (PostId,Body,CreationDate)
    SELECT Id,Body,CreationDate 
    FROM dbo.Posts
    WHERE Id = @PostId;

    -- Statement 2
    INSERT INTO @results (PostId,Body,CreationDate)
    SELECT Id,Body,CreationDate 
    FROM dbo.Posts
    WHERE ParentId = @PostId;

    RETURN
END

SELECT * FROM dbo.GetQuestionWithAnswers(4)

Multi-statement table-valued functions at first glance look and feel just like their inline table-value function cousins: they both accept parameter inputs and return results back into a query. The major difference is that they allow multiple statements to be executed before the results are returned in a table variable:

This is a great idea in theory - who wouldn't want to encapsulate multiple operational steps into a single function to make their querying logical easier?

However, the major downside is that prior to SQL Server 2017, SQL Server knows nothing about what's happening inside of a mutli-statement table-valued function in the calling query. This means all of your estimates for MSTVFs will be 100 rows (1 if you are on a version prior to 2014, slightly more accurate if you are on versions 2017 and above). This means that execution plans generated for queries that call MSTVFs will often be...less than ideal. Because of this, MSTVFs help add to the "evil" reputation of SQL functions.

Stored Procedures

CREATE OR ALTER PROCEDURE dbo.InsertQuestionsAndAnswers
    @PostId int
AS
BEGIN
    SET NOCOUNT ON;

    INSERT INTO dbo.Questions (Id)
    SELECT Id
    FROM dbo.Posts
    WHERE Id = @PostId;

    INSERT INTO dbo.Answers (Id, PostId)
    SELECT Id, ParentId
    FROM dbo.Posts
    WHERE ParentId = @PostId;
END

EXEC dbo.InsertQuestionsAndAnswers @PostId = 4

Stored procedures encapsulate SQL query statements for easy execution. They return result sets, but those result sets can't be easily used within another query.

This works great when you want to define single or multi-step processes in a single object for easier calling later.

Stored procedures also have the added benefit of being able to have more flexible security rules placed on them, allowing users to access data in specific ways where they don't necessarily have access to the underlying sources.

Views

CREATE OR ALTER VIEW dbo.QuestionsWithUsers
WITH SCHEMABINDING
AS
SELECT
    p.Id AS PostId,
    u.Id AS UserId,
    u.DisplayName
FROM  
    dbo.Posts p
    INNER JOIN dbo.Users u
        ON p.OwnerUserId = u.Id
WHERE
    p.PostTypeId = 1;
GO

CREATE UNIQUE CLUSTERED INDEX CL_PostId ON dbo.QuestionsWithUsers (PostId);

SELECT * FROM dbo.QuestionsAndAnswersView;

Views are similar to inline table valued function - they allow you centralize a query in an object that can be easily called from other queries. The results of the view can be used as part of that calling query, however parameters can't be passed in to the view.

Views also have some of the security benefits of a stored procedure; they can be granted access to a view with a limited subset of data from an underlying table that those same users don't have access to.

Views also have some performance advantages since they can have indexes added to them, essentially materializing the result set in advance of the view being called (creating faster performance). If considering between an inlined table function and a view, if you don't need to parameterize the input, a view is usually the better option.

Natively Compiled Stored Procedures and Scalar Functions

CREATE TABLE dbo.QuestionsStaging (Id int PRIMARY KEY NONCLUSTERED) WITH ( MEMORY_OPTIMIZED = ON , DURABILITY = SCHEMA_ONLY );

CREATE TABLE dbo.AnswersStaging (Id int PRIMARY KEY NONCLUSTERED, PostId int) WITH ( MEMORY_OPTIMIZED = ON , DURABILITY = SCHEMA_ONLY );
GO

CREATE PROCEDURE dbo.InsertQuestionsAndAnswersCompiled
    @PostId int
WITH NATIVE_COMPILATION, SCHEMABINDING
AS BEGIN ATOMIC WITH
(
    TRANSACTION ISOLATION LEVEL = SNAPSHOT, LANGUAGE = N'us_english'
)
    INSERT INTO dbo.Questions (Id)
    SELECT Id
    FROM dbo.Posts
    WHERE Id = @PostId;

    INSERT INTO dbo.Answers (Id, PostId)
    SELECT Id, ParentId
    FROM dbo.Posts
    WHERE ParentId = @PostId;
END

These are same as the stored procedures and scalar functions mentioned above, except they are pre-compiled for use with in-memory tables in SQL Server.

This means instead of SQL Server interpreting the SQL query every time a procedure or scalar function has to run, it created the compiled version ahead of time reducing the startup overhead of executing one of these objects. This is a great performance benefit, however they have several limitations. If you are able to use them, you should, just be aware of what they can and can't do.

Conclusion

While writing this post I thought about when I was first learning all of these objects for storing SQL queries. Knowing the differences between all of the options available (or what those options even are!) can be confusing. I hope this post helps ease some of this confusion and helps you choose the right objects for storing your queries.

Watch this week's video on YouTube

Over the past several week's I've been exploring ways to rewrite queries to improve execution performance.

I learned a lot of these techniques over time from trial an error, attending presentations, reading blog posts, speaking to other dbas and developers, etc... but never knew of a good resource that summarized these techniques in one place.

This post will be a quick round-up of everything I've covered so far, as well as 8 additional techniques that I use occasionally but don't necessarily require a full detailed post to explain them.

Why Rewrite Queries?

I often find myself working in environments where modifying indexes or changing server settings is out of the question when performance tuning. I usually run into these scenarios when dealing with:

Vendor databases
"Fragile" systems
Not enough disk space
Limited tooling/ad hoc analysis
Features limited by security software

While solving the root cause of a performance problem is always preferable, sometimes the only way I'm able to fix problems in these environments is by rewriting the queries.

I decided to write this summary post because it is a resource I would have loved to have when starting out. Sometimes it can be easy to get "writer's block" when trying to think of ways to rewrite a SQL query, so hopefully this list of techniques can provide ideas and get your creative juices flowing.

So, without further ado, here is a list of 12 techniques in no particular order that you can use to rewrite your queries to change their performance.

12 Ways to Refactor a Query to Change Performance

1. Window functions vs GROUP BY

Sometimes window functions rely a little too much on tempdb and blocking operators to accomplish what you ask of them. While using them is always my first choice because of their simple syntax, if they perform poorly you can usually rewrite them as an old-fashioned GROUP BY to achieve better performance.

2. Correlated subqueries vs derived tables

Many people like using correlated subqueries because the logic is often easy to understand, however switching to derived table queries often produces better performance due to their set-based nature.

3. IN vs UNION ALL

When filtering rows of data on multiple values in tables with skewed distributions and non-covering indexes, writing your logic into multiple statements joined with UNION ALLs can sometimes generate more efficient execution plans than just using IN or ORs.

4. Temporary Staging Tables

Sometimes the query optimizer struggles to generate an efficient execution plan for complex queries. Breaking a complex query into multiple steps that utilize temporary staging tables can provide SQL Server with more information about your data. They also cause you to write simpler queries which can cause the optimizer to generate more efficient execution plans as well as allow it to reuse result sets more easily.

5. Forcing Table Join Orders

Sometimes outdated statistics and other insufficient information can cause the SQL Server query optimizer to join tables in a less than ideal sequence. Adam Machanic has a fantastic presentation on forcing table join order with blocking operators without having to resort to join hints.

6. DISTINCT with few unique values

Using the DISTINCT operator is not always the fastest way to return the unique values in a dataset. In particular, Paul White uses recursive CTEs to return distinct values on large datasets with relatively few unique values. This is a great example of solving a problem using a very creative solution.

7. Eliminate UDFs

UDFs often cause poor query performance due to forcing serial plans and causing inaccurate estimates. One way to possibly improve the performance of queries that call UDFs is to try and inline the UDF logic directly into the main query. With SQL Server 2019 this will be something that happens automatically in a lot of cases, but as Brent Ozar points out you might occasionally have to manually inline a UDF's functionality to get the best performance.

8. Create UDFs

Sometimes a poorly configured server will parallelize queries too frequently and cause poorer performance than their serially equivalent plan. In those cases, putting the troublesome query logic into a scalar or multi-statement table-valued function might improve performance since they will force that part of the plan to run serially. Definitely not a best practice, but it is one way to force serial plans when you can't change the cost threshold for parallelism.

9. Data Compression

Not only does data compression save space, but on certain workloads it can actually improve performance. Since compressed data can be stored in fewer pages, read disk speeds are improved, but maybe more importantly the compressed data allows more to be stored in SQL Server's buffer pool, increasing the potential for SQL Server to reuse data already in memory.

10. Indexed Views

When you can't add new indexes to existing tables, you might be able to get away with creating a view on those tables and indexing the view instead. This works great for vendor databases where you can't touch any of the existing objects.

11. Switch cardinality estimators

The newer cardinality estimator introduced in SQL Server 2014 improves the performance of many queries. However, in some specific cases it can make queries perform more slowly. In those cases, a simple query hint is all you need to force SQL Server to change back to the legacy cardinality estimator.

12. Copy the data

If you can't get better performance by rewriting a query, you can always copy the data you need to a new table in a location where you CAN create indexes and do whatever other helpful transformations you need to do ahead of time.

...And more

By no means is this list exhaustive. There are so many ways to rewrite queries, and not all of them will work all the time.

The key is to think about what the query optimizer knows about your data and why it's choosing the plan it is. Once you understand what it's doing, you can start getting creative with various query rewrites that address that issue.

SQL Server Stored Procedures vs Functions vs Views