Trailing Spaces in SQL Server

Published on: 2019-06-18

Watch this week’s episode on YouTube.

A long time ago I built an application that captured user input. One feature of the application was to compare the user’s input against a database of values.

The app performed this text comparison as part of a SQL Server stored procedure, allowing me to easily update the business logic in the future if necessary.

One day, I received an email from a user saying that the value they were typing in was matching with a database value that they knew shouldn’t match. That is the day I discovered SQL Server’s counter intuitive equality comparison when dealing with trailing space characters.

Padded white space

You are probably aware that the CHAR data type pads the value with spaces until the defined length is reached:

DECLARE @Value CHAR(10) = 'a'
SELECT
	@Value AS OriginalValue,
	LEN(@Value) AS StringLength,
	DATALENGTH(@Value) AS DataLength,
	CAST(@Value AS BINARY) AS StringToHex;
String length = 1, DATALENGTH = 10, String as hex = 61202020202020202020

The LEN() function shows the number of characters in our string, while the DATALENGTH() function shows us the number of bytes used by that string.

In this case, DATALENGTH is equal to 10. This result is due to the padded spaces occurring after the character “a” in order to fill the defined CHAR length of 10. We can confirm this by converting the value to hexadecimal. We see the value 61 (“a” in hex) followed by nine “20” values (spaces).

If we change our variable’s data type to VARCHAR, we’ll see the value is no longer padded with spaces:

DECLARE @Value VARCHAR(10) = 'a'
SELECT
	@Value AS OriginalValue,
	LEN(@Value) AS StringLength,
	DATALENGTH(@Value) AS DataLength,
	CAST(@Value AS BINARY) AS StringToHex;
String length = 1, DATALENGTH = 1, String as hex = 61000000000000000000

Given that one of these data types pads values with space characters while the other doesn’t, what happens if we compare the two?

DECLARE 
	@CharValue CHAR(10) = '',
	@VarcharValue VARCHAR(10) = ''
SELECT
	IIF(@CharValue=@VarcharValue,1,0) AS ValuesAreEqual,
	DATALENGTH(@CharValue) AS CharBytes,
	DATALENGTH(@VarcharValue) AS VarcharBytes

In this case SQL Server considers both values equal, even though we can confirm that the DATALENGTHs are different.

This behavior doesn’t only occur with mixed data type comparisons however. If we compare two values of the same data type, with one value containing several space characters, we experience something…unexpected:

DECLARE 
	@NoSpaceValue VARCHAR(10) = '',
	@MultiSpaceValue VARCHAR(10) = '    '
SELECT
	IIF(@NoSpaceValue=@MultiSpaceValue,1,0) AS ValuesAreEqual,
	DATALENGTH(@NoSpaceValue) AS NoSpaceBytes,
	DATALENGTH(@MultiSpaceValue) AS MultiSpaceBytes

Even though our two variables have different values (a blank compared to four space characters), SQL Server considers these values equal.

If we add a character with some trailing whitespace we’ll see the same behavior:

DECLARE 
	@NoSpaceValue VARCHAR(10) = 'a',
	@MultiSpaceValue VARCHAR(10) = 'a     '
SELECT
	IIF(@NoSpaceValue=@MultiSpaceValue,1,0) AS ValuesAreEqual,
	DATALENGTH(@NoSpaceValue) AS NoSpaceBytes,
	DATALENGTH(@MultiSpaceValue) AS MultiSpaceBytes

Both values are clearly different, but SQL Server considers them to be equal to each other. Switching our equal sign to a LIKE operator changes things slightly:

DECLARE 
   @NoSpaceValue VARCHAR(10) = 'a',
   @MultiSpaceValue VARCHAR(10) = 'a     '
SELECT
   IIF(@NoSpaceValue LIKE @MultiSpaceValue,1,0) AS ValuesAreEqual,
   DATALENGTH(@NoSpaceValue) AS NoSpaceBytes,
   DATALENGTH(@MultiSpaceValue) AS MultiSpaceBytes

Even though I would think that a LIKE without any wildcard characters would behave just like an equal sign, SQL Server doesn’t perform these comparisons the same way.

If we switch back to our equal sign comparison and prefix our character value with spaces we’ll also notice a different result:

DECLARE 
	@NoSpaceValue VARCHAR(10) = 'a',
	@MultiSpaceValue VARCHAR(10) = '    a'
SELECT
	IIF(@NoSpaceValue=@MultiSpaceValue,1,0) AS ValuesAreEqual,
	DATALENGTH(@NoSpaceValue) AS NoSpaceBytes,
	DATALENGTH(@MultiSpaceValue) AS MultiSpaceBytes

SQL Server considers two values equal regardless of spaces occurring at the end of a string. Spaces preceding a string however, no longer considered a match.

What is going on?

ANSI

While counter intuitive, SQL Server’s functionality is justified. SQL Server follows the ANSI specification for comparing strings, adding white space to strings so that they are the same length before comparing them. This explains the phenomena we are seeing.

It does not do this with the LIKE operator however, which explains the difference in behavior.

Comparisons when extra spaces matter

Let’s say we want to do a comparison where the difference in trailing spaces matters.

One option is to use the LIKE operator as we saw a few examples back. This is not the typical use of the LIKE operator however, so be sure to comment and explain what your query is attempting to do by using it. The last thing you want is some future maintainer of your code to switch it back to an equal sign because they don’t see any wild card characters.

Another option that I’ve seen is to perform a DATALENGTH comparison in addition to the value comparison:

DECLARE 
	@NoSpaceValue VARCHAR(10) = 'a',
	@MultiSpaceValue VARCHAR(10) = 'a    '
SELECT
	IIF(@NoSpaceValue = @MultiSpaceValue AND DATALENGTH(@NoSpaceValue) = DATALENGTH(@MultiSpaceValue),1,0) AS ValuesAreEqual,
	DATALENGTH(@NoSpaceValue) AS NoSpaceBytes,
	DATALENGTH(@MultiSpaceValue) AS MultiSpaceBytes

This solution isn’t right for every scenario however. For starters, you have no way of knowing if SQL Server will execute your value comparison or DATALENGTH predicate first. This could wreck havoc on index usage and cause poor performance.

A more serious problem can occur if you are comparing fields with different data types. For example, when comparing a VARCHAR to NVARCHAR data type, it’s pretty easy to create a scenario where your comparison query using DATALENGTH will trigger a false positive:

DECLARE 
	@NoSpaceValue VARCHAR(10) = 'a ',
	@MultiSpaceValue NVARCHAR(10) = 'a'
SELECT
	IIF(@NoSpaceValue = @MultiSpaceValue AND DATALENGTH(@NoSpaceValue) = DATALENGTH(@MultiSpaceValue),1,0) AS ValuesAreEqual,
	DATALENGTH(@NoSpaceValue) AS NoSpaceBytes,
	DATALENGTH(@MultiSpaceValue) AS MultiSpaceBytes

Here the NVARCHAR stores 2 bytes for every character, causing the DATALENGTHs of a single character NVARCHAR to be equal to a character + a space VARCHAR value.

The best thing to do in these scenarios is understand your data and pick a solution that will work for your particular situation.

And maybe trim your data before insertion (if it makes sense to do so)!

Thanks for reading. You might also enjoy following me on Twitter.

Want to learn even more SQL?

Sign up for my newsletter to receive weekly SQL tips!

CHOOSE() in SQL Server

Published on: 2019-06-04

Watch this week’s episode on YouTube.

While I know I don’t utilize most of the features available in SQL Server, I like to think I’m at least aware that those features exist.

This week I found a blind-spot in my assumption however. Even though it shipped in SQL Server 2012, the SQL Server CHOOSE function is a feature that I think I’m seeing for the first time this past week.

CHOOSE is CASE

CHOOSE returns the n-th item from a comma-delimited list.

Whenever learning a new feature in SQL Server I try to think of a good demo I could build to test out the functionality. In this case the immediate example that came to mind was building something that would provide a lookup of values:

SELECT 
	[key],
	[value],
	[type],
	CHOOSE(type+1,'null','string','int','boolean','array','object') AS JsonType
FROM
	OPENJSON(N'{
		"Property1":null,
		"Property2":"a",
		"Property3":3,
		"Property4":false,
		"Property5":[1,2,"3"],
		"Property6":{
			"SubProperty1":"a"
		}
	}');

In this case, the OPENJSON function returns a “type” field that indicates the datatype of that particular JSON property’s value. The issue is that the “type” column is numeric and I can never remember what type of data each number represents.

The above query solves this by using CHOOSE to create a lookup of values. Since OPENJSON returns results starting with 0, we need to use type+1 in order to get the 1-based CHOOSE function to work correctly:

json types

If we look at the CHOOSE function’s scalar operator properties in the execution plan, we’ll see that this function is just a fancy alias for a more verbose CASE statement:

[Expr1000] = Scalar Operator(
	CASE WHEN (CONVERT_IMPLICIT(int,OPENJSON_DEFAULT.[type],0)+(1))=(1) 
	THEN 'null' 
	ELSE 
		CASE WHEN (CONVERT_IMPLICIT(int,OPENJSON_DEFAULT.[type],0)+(1))=(2) 
		THEN 'string' 
		ELSE 
			CASE WHEN (CONVERT_IMPLICIT(int,OPENJSON_DEFAULT.[type],0)+(1))=(3) 
			THEN 'int' 
			ELSE 
				CASE WHEN (CONVERT_IMPLICIT(int,OPENJSON_DEFAULT.[type],0)+(1))=(4) 
				THEN 'boolean' 
				ELSE 
					CASE WHEN (CONVERT_IMPLICIT(int,OPENJSON_DEFAULT.[type],0)+(1))=(5) 
					THEN 'array' 
					ELSE 
						CASE WHEN (CONVERT_IMPLICIT(int,OPENJSON_DEFAULT.[type],0)+(1))=(6) 
						THEN 'object' 
						ELSE NULL END 
					END 
				END 
			END 
		END 
	END
)

The Set-Based Way

I think one of the reasons I’ve never used CHOOSE is because I would hate typing up all of those lookup values and trapping them in a SELECT statement, never to be used again.

Previously, I would have stored the lookup values in table and joined them with the OPENJSON results to accomplish the same end result:

DROP TABLE IF EXISTS #JsonType;
CREATE TABLE #JsonType
(
	Id tinyint,
	JsonType varchar(20),
	CONSTRAINT PK_JsonTypeId PRIMARY KEY CLUSTERED (Id)
);

INSERT INTO #JsonType VALUES (0,'null');
INSERT INTO #JsonType VALUES (1,'string');
INSERT INTO #JsonType VALUES (2,'int');
INSERT INTO #JsonType VALUES (3,'boolean');
INSERT INTO #JsonType VALUES (4,'array');
INSERT INTO #JsonType VALUES (5,'object');

SELECT 
	j.[key],
	j.[value],
	j.[type],
	t.JsonType
FROM
	OPENJSON(N'{
		"Property1":null,
		"Property2":"a",
		"Property3":3,
		"Property4":false,
		"Property5":[1,2,"3"],
		"Property6":{
						"SubProperty1":"a"
					}
	}') j
	INNER JOIN #JsonType t
		ON j.[type] = t.Id

While more initial setup is involved with this solution, it’s more flexible long-term. With a centralized set of values, there’s no need to update the CHOOSE function in all of your queries when you can update the values in a single lookup table.

And while I didn’t bother performance testing it, by virtue of being a scalar function, CHOOSE will probably perform worse in many real-world scenarios when compared to the table-based lookup approach (eg. large datasets, parallel plans, etc…).

CHOOSE What Works For You

I’m not surprised that it took me this long to learn about the CHOOSE function: while a simplified way to write certain CASE statements, I can’t think of many (any?) scenarios where I would prefer to use it over a CASE or a lookup-table solution.

Thanks for reading. You might also enjoy following me on Twitter.

Want to learn even more SQL?

Sign up for my newsletter to receive weekly SQL tips!

12 Ways To Rewrite SQL Queries for Better Performance

Published on: 2019-05-28

Watch this week’s video on YouTube. Thanks to you, we just crossed the 2k subscriber mark!

Over the past several week’s I’ve been exploring ways to rewrite queries to improve execution performance.

I learned a lot of these techniques over time from trial an error, attending presentations, reading blog posts, speaking to other dbas and developers, etc… but never knew of a good resource that summarized these techniques in one place.

This post will be a quick round-up of everything I’ve covered so far, as well as 8 additional techniques that I use occasionally but don’t necessarily require a full detailed post to explain them.

Why Rewrite Queries?

I often find myself working in environments where modifying indexes or changing server settings is out of the question when performance tuning. I usually run into these scenarios when dealing with:

  • Vendor databases
  • “Fragile” systems
  • Not enough disk space
  • Limited tooling/ad hoc analysis
  • Features limited by security software

While solving the root cause of a performance problem is always preferable, sometimes the only way I’m able to fix problems in these environments is by rewriting the queries.

I decided to write this summary post because it is a resource I would have loved to have when starting out. Sometimes it can be easy to get “writer’s block” when trying to think of ways to rewrite a SQL query, so hopefully this list of techniques can provide ideas and get your creative juices flowing.

So, without further ado, here is a list of 12 techniques in no particular order that you can use to rewrite your queries to change their performance.

12 Ways to Refactor a Query to Change Performance

1. Window functions vs GROUP BY

Sometimes window functions rely a little too much on tempdb and blocking operators to accomplish what you ask of them. While using them is always my first choice because of their simple syntax, if they perform poorly you can usually rewrite them as an old-fashioned GROUP BY to achieve better performance.

2. Correlated subqueries vs derived tables

Many people like using correlated subqueries because the logic is often easy to understand, however switching to derived table queries often produces better performance due to their set-based nature.

3. IN vs UNION ALL

When filtering rows of data on multiple values in tables with skewed distributions and non-covering indexes, writing your logic into multiple statements joined with UNION ALLs can sometimes generate more efficient execution plans than just using IN or ORs.

4. Temporary Staging Tables

Sometimes the query optimizer struggles to generate an efficient execution plan for complex queries. Breaking a complex query into multiple steps that utilize temporary staging tables can provide SQL Server with more information about your data. They also cause you to write simpler queries which can cause the optimizer to generate more efficient execution plans as well as allow it to reuse result sets more easily.

5. Forcing Table Join Orders

Sometimes outdated statistics and other insufficient information can cause the SQL Server query optimizer to join tables in a less than ideal sequence. Adam Machanic has a fantastic presentation on forcing table join order with blocking operators without having to resort to join hints.

6. DISTINCT with few unique values

Using the DISTINCT operator is not always the fastest way to return the unique values in a dataset. In particular, Paul White uses recursive CTEs to return distinct values on large datasets with relatively few unique values. This is a great example of solving a problem using a very creative solution.

7. Eliminate UDFs

UDFs often cause poor query performance due to forcing serial plans and causing inaccurate estimates. One way to possibly improve the performance of queries that call UDFs is to try and inline the UDF logic directly into the main query. With SQL Server 2019 this will be something that happens automatically in a lot of cases, but as Brent Ozar points out you might occasionally have to manually inline a UDF’s functionality to get the best performance.

8. Create UDFs

Sometimes a poorly configured server will parallelize queries too frequently and cause poorer performance than their serially equivalent plan. In those cases, putting the troublesome query logic into a scalar or multi-statement table-valued function might improve performance since they will force that part of the plan to run serially. Definitely not a best practice, but it is one way to force serial plans when you can’t change the cost threshold for parallelism.

9. Data Compression

Not only does data compression save space, but on certain workloads it can actually improve performance. Since compressed data can be stored in fewer pages, read disk speeds are improved, but maybe more importantly the compressed data allows more to be stored in SQL Server’s buffer pool, increasing the potential for SQL Server to reuse data already in memory.

10. Indexed Views

When you can’t add new indexes to existing tables, you might be able to get away with creating a view on those tables and indexing the view instead. This works great for vendor databases where you can’t touch any of the existing objects.

11. Switch cardinality estimators

The newer cardinality estimator introduced in SQL Server 2014 improves the performance of many queries. However, in some specific cases it can make queries perform more slowly. In those cases, a simple query hint is all you need to force SQL Server to change back to the legacy cardinality estimator.

12. Copy the data

If you can’t get better performance by rewriting a query, you can always copy the data you need to a new table in a location where you CAN create indexes and do whatever other helpful transformations you need to do ahead of time.

…And more

By no means is this list exhaustive. There are so many ways to rewrite queries, and not all of them will work all the time.

The key is to think about what the query optimizer knows about your data and why it’s choosing the plan it is. Once you understand what it’s doing, you can start getting creative with various query rewrites that address that issue.

Thanks for reading. You might also enjoy following me on Twitter.

Want to learn even more SQL?

Sign up for my newsletter to receive weekly SQL tips!