More "Wrong" SQL Server Math - Floating Point Errors

Published Tue 01 October 2019 in SQL > Development

Last week we looked at how implicit conversions and datatype precedence can cause SQL Server to output unexpected results (if you aren't aware of how it handles these features).

This week I want to share another example of when SQL Server's output may surprise you: floating point errors.

Charts don't add up to 100%

barchart

Years ago I was writing a query for a stacked bar chart in SSRS. The chart intended to show the percentage breakdown of distinct values in a table. For example, the chart would show that value A made up 30% of the rows, B made up 3%, C made up 12% and so on. Since every row had a value, I was expecting the stacked bar chart percentages to add up to 100%

However, in many instances the charts would come up short; instead of a full 100%, the percentages would only add up to 98% or 99%. What was going on?

Floats

To see an example of how this happens, let's look at the following query:

SELECT CASE WHEN CAST(.1 AS FLOAT)+CAST(.2 AS FLOAT) = CAST(.3 AS FLOAT) THEN 1 ELSE 0 END

If you've never encountered this error before, you'd expect the query to return a result of 1. However, it doesn't:

0result

The reason this happens is that anytime you use the float datatype, SQL Server is trading off numeric precision for space savings - in actuality, that .1 in the query above is really stored as 0.10000000000000000555111512312578270212 and the .2 is stored as 0.20000000000000001110223024625156540424. Summed together, we get 0.30000000000000001665334536937734810636, not .3.

And to be clear, this isn't a problem with SQL Server - any language that implements the IEEE standard for float data types experiences these same issues.

Float Approximation

In SQL Server, the int datatype can store every whole number from -2,147,483,648 to 2,147,483,647 in only 4 bytes of space.

A single precision float, using the same 4 bytes of data, can store almost any value between between -340,000,000,000,000,000,000,000,000,000,000,000,000 and 340,000,000,000,000,000,000,000,000,000,000,000,000.

The reason floats can store such a large range is because they are only storing approximate values; some compression happens in those 4 bytes that allows SQL Server to store a wider range of data, but the increased range of values comes at the cost of losing some accuracy.

Close Enough For Horse Shoes, Hand Grenades, and Floats

In short, a float works by storing its value as a percentage within a range. As an oversimplified example, imagine representing the number 17 as 17% of the range from 1 to 100. All you need to store is the range 1,100 and the number 17%. This doesn't give us much efficiency for small numbers, but this allows us to store much larger numbers in exactly the same way. For example, to store the number 3 billion, you could make your range 1 billion (10\^9), and 10 billion (10\^10), and the percentage of 30%.

And while that oversimplified example uses base-10 to make it easy for my brain to think about, computers like doing calculations in base-2. And the math is a little bit more involved.

The Wikipedia page on floating point encoding is really good, but it uses a bunch of math notation that I haven't seen since high school. Let's reimagine that example with language we should be a little bit more familiar with: T-SQL.

A 4 byte number is made up of 32 bits. The floating point encoding breaks down these bits into 3 sections:

floating-point

The first bit in blue is for the sign. This just indicates whether we will be left or right of 0 on the number line.

The next 8 bits in green indicate our exponent. This tells us which range of numbers we are in. Since we are using binary, the range is stored as a power of 2. And we only need to store the start of the range, since the end of the range would be the next power of 2.

Finally the last 23 bits in red encode the fractional location of our value within the range. Calculating our actual value then is simple as:

equation

Oh yeah, I promised to do the math in T-SQL. Let's try that again.

First, we declare some variables to store the 3 encoded parts of our floating point number:

DECLARE 
    @sign int,
    @exponent int,
    @fraction decimal(38,38);

You'll notice I am storing @fraction as a decimal and not float. This is some foreshadowing about how decimal is a precise datatype that I'll come back to in a little bit.

Next we store the sign. Since we are encoding the value .15625, the sign is positive, so we set our @sign bit to 0:

SELECT @sign = 0;

Great. Now let's calculate the value of our exponent. If you've never converted binary to decimal before, you basically raise each 1 or 0 to the power of its position, so:

-- Returns 124 
SELECT @exponent =
    (0*POWER(2,7))
    +(1*POWER(2,6))
    +(1*POWER(2,5))
    +(1*POWER(2,4))
    +(1*POWER(2,3))
    +(1*POWER(2,2))
    +(0*POWER(2,1))
    +(0*POWER(2,0));

Next up is converting the last 23 bits to decimal. In this case, the encoding standard specifies these are to be calculated as (1/2\^n) instead of the regular 2\^n, because we want a fraction:

-- Returns .25
SELECT @fraction = 
    (0*(1.0/POWER(2,1)))
    +(1*(1.0/POWER(2,2)))
    +(0*(1.0/POWER(2,3)))
    +(0*(1.0/POWER(2,4)))
    +(0*(1.0/POWER(2,5)))
    +(0*(1.0/POWER(2,6)))
    +(0*(1.0/POWER(2,7)))
    +(0*(1.0/POWER(2,8)))
    +(0*(1.0/POWER(2,9)))
    +(0*(1.0/POWER(2,10)))
    +(0*(1.0/POWER(2,11)))
    +(0*(1.0/POWER(2,12)))
    +(0*(1.0/POWER(2,13)))
    +(0*(1.0/POWER(2,14)))
    +(0*(1.0/POWER(2,15)))
    +(0*(1.0/POWER(2,16)))
    +(0*(1.0/POWER(2,17)))
    +(0*(1.0/POWER(2,18)))
    +(0*(1.0/POWER(2,19)))
    +(0*(1.0/POWER(2,20)))
    +(0*(1.0/POWER(2,21)))
    +(0*(1.0/POWER(2,22)))
    +(0*(1.0/POWER(2,23)));

Finally, we put them altogether like so:

-- Result: .15625
SELECT 
    POWER(-1,@sign)
    * POWER(CAST(2 AS DECIMAL(38,37)),(@exponent-127)) 
    * (1+@fraction)

In this example, there is no floating point error - .15625 can be accurately stored as a float. However, if go through the same exercise for a number like .1 or .2, you'll notice your numbers are not so perfect.

Fixing Floating Point Errors

Floating point math errors can be fixed in a few ways.

One option is to stop caring about them. The error occurring on floats is very small (although when compounded through arithmetic, the error can grow large enough to be noticeable like in my reporting bar chart example). If you are writing queries or reports where such a small amount of error doesn't matter, then you can continue on your merry way without having to change anything.

A second option is to still store values as floats (for that sweet, sweet storage space savings), but ensure your application code has business logic to correctly round numbers that are in precise.

However, if your data needs to be perfectly accurate every single time with no errors, use a different datatype. The logical choice here would be to use decimal in SQL Server, which uses a different internal method for storing your numbers, resulting in perfect results every time. However, the range of possible values is not as large as float, and you will pay for that precision with additional bytes of storage space.

In the end, floating point is good enough for many applications. The important thing is that you are aware that these kind of errors can happen and that you handle them appropriately.

SQL Server's "Wrong" Math

Published Tue 24 September 2019 in SQL > Development

Watch this week's video on YouTube

A couple of weeks ago I decided to rebuild my recording studio by getting rid of my fabric backdrop and replacing it with a true wall instead. Doing this would allow me more flexibility when shooting, further improving my filming process efficiency.

To determine how much lumber I would need for building the new walls, I decided to write a SQL query to help with my framing calculations. I was building a 6 foot wall and wanted to put a stud every 16 inches. Easy enough to do the math on this:

SELECT (6*12)/16

2019-09-23-19-41-32

The output of the query above was 4, indicating the number of studs I would need for one wall section.

What's interesting is that if we do this same equation in a calculator, we get a slightly different answer: 4.5.

$sql-math$

And while I didn't end up framing my walls incorrectly, if I trusted the output of my query I would have had some incorrectly sized walls.

Is SQL Server Bad At Simple Math?

What happened? Well it all has to do with how SQL Server handles calculations.

While tweeting about my studio rebuild process, Kenneth Fisher from SQL Studies tweeted about what I would learn about SQL Server from rebuilding the studio.

https://twitter.com/sqlstudent144/status/1170376006370283520

Jokingly, I tweeted back the above SELECT (6*12)/16 example because it is funny (scary?) how SQL Server chops off the .5 if you don't understand what's going on.

When you perform calculations in SQL Server, it converts any expressions to the datatype that has the highest precedence. In the above example, since all of the numbers we are dealing with are integers, SQL Server keeps the final answer as an integer, apparently not caring what should have come after the decimal.

The quick and dirty way to solve this is to include a datatype in the equation that allows for decimals and has a higher precedence than integer. Basically, convert one of the integers to a numeric by adding .0 to any of the values:

SELECT (6*12)/16.0

2019-09-23-19-53-29

This will then return the expected result.

Following up on Twitter, Andy Mallon mentions that you don't even need the 0, simply adding . will suffice:

https://twitter.com/AMtwo/status/1170391334500388865

SELECT * (6*12)/16.

Pat Phelan then took it a step further, saying you can use the e syntax if you want to get the same successful result but confuse your users:

https://twitter.com/YetAnotherSQL/status/1170400514716164096

SELECT * (6*12)/16e0

Out of all of these methods, I prefer adding the .0 because it is the least ambiguous. For calculations that matter however, I also like to throw a CAST around individual values or the entire equation to be certain that I am getting a result with the precision and scale that I expect instead of letting SQL Server automatically guess for me:

SELECT CAST((6*12)/16.0 AS NUMERIC (2,1))

Yes, it's a few extra characters, but the intent is clear.

Why Does All of This Matter?

Implicit conversions and datatype precedence are something that most people starting with SQL Server are not aware of until they discover that their results are "wrong". If your queries require precise answers, then you have to be precise and explicit in how you handle the data (otherwise you might build a studio wall incorrectly!)

For more information about all these types of conversions, check out Andy's post on the subject which has even more fun examples.

SQL Server Quick Tips

Published Tue 17 September 2019 in SQL > Development

Earlier this year I made a series of videos that were played before the start of several DBA Fundamentals presentations. These videos gave viewers something SQL related to watch while the online rooms were filling up with attendees before the start of each day's live presentation.

Since these videos only ran once, and because my regular filming location is under construction this week, I decided to republish them here today.

Hope you enjoy these quick tips on several SQL Server concepts.

STATISTICS IO, TIME

Enabling STATISTICS IO and TIME can help you benchmark your queries to see if your tuning efforts are helping or hindering execution performance:

Watch this video on YouTube

Key Concepts

The terminology surrounding keys in relational databases can be confusing - in this overview we summarize the difference between primary keys, foreign keys, natural keys, and more:

Watch this video on YouTube

SARGability

When performance tuning queries, one of first things to check is whether your queries are SARGable and allow the optimizer to seek to the data it needs:

Watch this video on YouTube

Tracking Who Last Changed a Row

Published Tue 10 September 2019 in SQL > Development

MJ-t-sql-Tuesday

This post is a response to this month's T-SQL Tuesday #118 prompt by Kevin Chant. T-SQL Tuesday is a way for the SQL Server community to share ideas about different database and professional topics every month.

This month's topic asks about our fantasy SQL Server feature.

Watch this week's video on YouTube

Data About The Data

Have you ever wondered who was the last person (or process) to modify a piece of data in your database?

SQL Server offers plenty of system views and functions that provide insight into how your server is running and how your queries are performing. However, it doesn't offer much information about who last modified your data.

That's why my ideal SQL Server feature would be to have a built-in way to audit who last modified a row of data and at what time.

Current Workarounds

Today, this sort of logging can be implemented in the application layer but that requires extra coding and time.

In SQL Server, temporal tables offer part of this functionality with their GENERATED ALWAYS FOR ROW START/END properties, but these only log a row created/row last modified date. There is no built-in way to log which user modified the data. The remaining temporal table functionality also adds unnecessary overhead if you don't actually need to keep track of all of the data history.

Default constraints exist for helping insert default values for when a row was created and who the user was that created the row, but restricting unauthorized access to those fields as well as handling instances where data is updated is not as straight forward.

The closest thing to get this type of automatic logging in SQL Server today is to implement the functionality with triggers. Reinterpreting some of my code I wrote when discussing how to fake temporal tables with triggers, we come up with this:

DROP TABLE dbo.TestData;
CREATE TABLE dbo.TestData (
    Id int IDENTITY CONSTRAINT PK_Id PRIMARY KEY,
    ColA int,
    ColB int,
    ColC int,
    LastModifiedDate datetime2,
    LastModifiedUser nvarchar(30)
);
GO

CREATE TRIGGER AutoLogger ON dbo.TestData
AFTER INSERT,UPDATE
AS
BEGIN
SET NOCOUNT ON;

DECLARE 
    @CurrentDateTime datetime2 = SYSUTCDATETIME(),
    @CurrentUser nchar(30) = SYSTEM_USER

UPDATE t
SET
       LastModifiedDate = @CurrentDateTime,
       LastModifiedUser = @CurrentUser
FROM
    dbo.TestData t
    INNER JOIN inserted i
        ON t.Id = i.Id;
END
GO

Now, any time a record is created or modified, we log the datetime and the user that performed the modification:

-- Regular insert
INSERT INTO dbo.TestData (ColA,ColB,ColC) VALUES (1,2,3);
-- Regular update
UPDATE dbo.Test SET ColA = 4 WHERE Id = 1;
SELECT * FROM dbo.TestData;

-- Questionable motive insert
INSERT INTO dbo.TestData (ColA,ColB,ColC,LastModifiedDate,LastModifiedUser) VALUES (1,2,3,'9999-12-31','NotMe');
-- Questionable motive update
UPDATE dbo.TestData SET LastModifiedDate = '9999-12-31', LastModifiedUser='NotMe' WHERE Id = 1;
SELECT * FROM dbo.TestData;

2019-09-09-19-41-49

While this solution provides some of the functionality I want, it has many downsides. First, it utilizes a trigger which often gets overlooked (at least initially) when debugging issues, causing confusion and maintenance headaches.

Secondly, there is some overhead in having a trigger run after each and every insert and update. Transaction throughput gets limited since every INSERT/UPDATE on this table will trigger a follow up UPDATE.

Additionally, this solution is not automatic: it must be created individually on every table you want logging on.

Finally, this table now contains extra columns on the clustered index, columns that I don't necessarily always want to be impacting my performance.

The Ideal Feature

I wish there was a database level option that allowed logging of who modified what row of data when. When turned on, it would automatically handle this logging logic (and maybe some more, like specifically what column in addition to what row was modified) without the need to set it up on individual tables or by using triggers.

Additionally, I would love if this data were not persisted on the table's clustered index itself. If there were a way to store the data in a nonclustered index for that table only (kind of like a non-persisted computed column value gets stored) that would be ideal.

Finally, I would love if this meta data were populated asynchronously to not impact the performance of inserts/updates/deletes on the main table of data.

Whitespace, Letter Case, and Other Things That Prevent Plan Reuse

Published Tue 03 September 2019 in SQL > Performance Tuning

Watch this week's video on YouTube

Last week's post briefly mentioned that SQL server may not reuse a query plan from cache if there is a small difference in whitespace.

In addition to differences in whitespace, SQL Server won't reuse query plans for a variety of reasons. Today's post will cover some of the more common scenarios of when a plan won't be reused.

Viewing the Cache

One way to view what's in SQL Server's query plan cache is by querying the sys.db_exec_query_stats DMV. Combined with the sys.dm_exec_sql_text and sys.dm_exec_query_plan functions, we can get an idea of what's in the cache and how often it's getting reused:

SELECT
    s.sql_handle,
    s.plan_generation_num,
    s.plan_handle,
    s.execution_count,
    s.query_hash,
    s.query_plan_hash,
    t.text,
    p.query_plan
FROM 
    sys.dm_exec_query_stats s
    CROSS APPLY sys.dm_exec_sql_text(sql_handle) t
    CROSS APPLY sys.dm_exec_query_plan(plan_handle) p

Now that we know how to view information about the query plan cache, let's run DBCC FREEPROCCACHE to clear it out so we can start with a clean slate (warning: probably a bad idea to run this in production).

Patterns That Prevent Plan Reuse

With our plan cache empty, let's run some similar queries and see if SQL Server decides to reuse plans, or insert nearly duplicate entries into the plan cache.

Whitespace

If we run these two queries:

SELECT TOP 100 DisplayName FROM dbo.Users;
GO
SELECT TOP 100 DisplayName FROM     dbo.Users;
GO

And then go check our query plan cache:

2019-09-02-19-06-37

We'll notice that SQL Server compiled two separate plans for what is essentially the same exact query (the only difference being the extra spaces before the table name). What this means is that if your queries are not exactly the same, or if you are in the habit of highlighting queries in SSMS before you run them (and accidently select some extra white space before/after your query), SQL Server will generate a brand new plan for you even though it has a compiled plan for a nearly identical query already in cache.

Don't think this can happen to you? While filming the video for this blog post, I accidently made this exact mistake even though I had reviewed my code and I was intentionally trying to show you what not to do. It's easy to slip up. Check out the blooper at 2:25 if interested.

Letter Case

Differences in letter casing also prevent cached plan reuse:

SELECT TOP 100 DisplayName FROM dbo.Users;
GO
select TOP 100 DisplayName from dbo.Users;
GO

2019-09-02-19-07-14

In this case, only our SELECT and FROM keywords have changed from upper to lower case, but in the eyes of SQL Server these queries are now different and it will generate plans individually for them.

Comments

Want to improve the clarity of your business logic with a comment? Great! But watch out if running the query with and without comments because SQL Server will consider these queries to be different:

SELECT TOP 100 DisplayName FROM dbo.Users;
GO
/* This query returns all user names */
SELECT TOP 100 DisplayName FROM dbo.Users;
GO

2019-09-02-19-07-46-1

Schema Names

Running the following two queries:

SELECT TOP 100 DisplayName FROM dbo.Users;
GO
SELECT TOP 100 DisplayName FROM Users;
GO

2019-09-02-19-08-11

These two queries create two separate entries in the plan cache. The problem here is that every user can have a different default schema, so if you don't prefix a table name with a schema, SQL Server won't be able to guarantee that each execution of the query should run for the same exact table, forcing it to generate new plans and prevent reuse.

SET Options

If you have two queries running with two different configurations of SET options, the queries will also generate separate plans:

SET ANSI_PADDING ON;
GO
SELECT TOP 100 DisplayName FROM dbo.Users;
GO
SET ANSI_PADDING OFF;
GO
SELECT TOP 100 DisplayName FROM dbo.Users;
GO

2019-09-02-19-08-53

Simple Parameterization Datatypes

If a query is simple enough for SQL Server to parameterize it on its own, we'll see the parameterize query usage count increase since SQL Server can reuse the query:

SELECT DisplayName FROM dbo.Users WHERE Id = 3;
GO
SELECT DisplayName FROM dbo.Users WHERE Id = 4;
GO

2019-09-02-19-11-45

However, you'll notice that SQL Server defined the parameterized value as tinyint. If our query suddenly sends in a value larger than 255:

SELECT DisplayName FROM dbo.Users WHERE Id = 256;
GO

2019-09-02-19-12-10

We'll see a new plan generated and stored in cache.

What is particularly interesting is that if we have differences in whitespace or comments in queries that can be simply parameterized, SQL Server will actually correct for this and be able to reuse the same plan even though queries without simple parameterization would fail to reuse the same plan:

SELECT DisplayName FROM dbo.Users WHERE Id = 3;
GO
SELECT DisplayName FROM dbo.Users WHERE Id = 3;
GO
SELECT DisplayName FROM dbo.Users WHERE Id = 3;
GO
SELECT DisplayName /*comment*/ FROM dbo.Users WHERE Id = 3;
GO

2019-09-02-19-12-47

Be Consistent

In conclusion, SQL Server is very particular about the differences in your submitted queries, no matter how small those differences may be. If you want SQL Server to be able to reuse plans from its cache, you need to make sure those queries are exactly the same.