4 Reasons To Avoid VARCHAR(8000)

Watch this week's video on YouTube

When first learning SQL Server, I had the brilliant idea of defining every column on my table as VARCHAR(8000). After all, the VARCHAR datatype only consumes as much storage space as the string itself (plus two bytes), so even if I defined the max length of a column as 8000, I wouldn't be wasting storage space if my data was smaller than that.

My development life would be easier too since I wouldn't have to think about the kind of data I was storing in my columns; I could define everything as VARCHAR(8000) once and never have to go back to make any changes. Brilliant!

While I was correct about not wasting storage space, it turns out the idea of making every column VARCHAR(8000) is a terrible idea.

What follows is a (partial) list of reasons to avoid defining every column in your table as VARCHAR(8000).

1. Documentation

When columns are defined accurately with specific data types and lengths, they help provide understanding about the data they contain.

For example, imagine you see following column in a table:

ZipCode VARCHAR(10)

If your database is full of U.S. data, this column definition provides enough information to make some pretty good assumptions:

  • With a length of 10, we can expect some of the data to be in the 5 digit zip code + 4 digit extended zip code format (eg. 12345-6789)
  • Choosing VARCHAR instead of CHAR means there are probably some entries that contain 5 digit zip codes and some with 5+4 digit zip codes (and a dash)

We can't be 100% confident about our assumptions until we verify them by profiling the data itself, but a lot of times having decent guesses is good enough.

2. Validation

Your application (hopefully) has some good client and server side validation built in. Those validations don't always work so it never hurts to have some extra validation on the database side of things as well.

Imagine the zip code field is directly next to the "Delivery Instructions" box on our app's "Shipping Details" page. It's not a stretch to think that a user will someday accidentally type in the special delivery instructions ("Please leave the package behind the house next to the giant pineapple bush") into the zip code field instead of the instructions field.

If your ZipCode field is defined as VARCHAR(8000), that set of delivery instructions will make its way into the database in the completely wrong column.

Have fun cleaning up that dirty data. If instead the field was correctly defined as ZipCode VARCHAR(10), the insert would fail and you would prevent that erroneous data from entering your database.

3. Indexing

Ever try to index a column (or set of columns) that total more than 1700 bytes (or 900 bytes in older versions)?

2019-03-13-19-42-15

SQL Server will let you create indexes that contain keys that may potentially be greater than 1700 bytes, but as soon as you try to insert large data into those indexes you will be greeted with this wall of error:

2019-03-13-19-44-11

Of course there are some tricks you can use to index those wide key combinations, but you don't want to use those unless you absolutely have to.

4. Inaccurate Estimates

Imagine we have 1000 rows of data in our table with a VARCHAR(8000) column:

CREATE TABLE #Varchar8000Test
(
    Id int identity PRIMARY KEY,
    BigColumn varchar(8000)
);

INSERT INTO #Varchar8000Test VALUES ('a');
GO 1000

You'll notice that each of those rows' BigColumn fields only contain 3 bytes of data (1 for "a" + 2 for varchar overhead). That's about 3 KB total for the whole table of data (plus whatever the int column takes up)

You would therefore think that SQL Server would know and use this information when executing a query :

SELECT * FROM #Varchar8000Test ORDER BY BigColumn
OPTION(MAXDOP 1)

However it doesn't:

2019-03-13-20-00-07

SQL Server estimates that each row is ~4 KB! This is because SQL Server doesn't sample the data to estimate the average row size for VARCHAR columns, it uses half of the defined length.

This carries over to other parts of the query plan too, like memory grants:

2019-03-13-20-04-39

Fitting this whole table into memory should only take about ~7 KB (3 KB for our BigColumn data), but SQL Server reserves significantly more than that. In this example an excess of 6 MB isn't a huge deal, but if you are working with a table with many more rows, those excessive memory grants can use up hundreds of megabytes or even gigabytes of memory unnecessarily.

...and more!

This list is only a start: there are plenty of additional reasons why defining every column on your table as VARCHAR(8000). My goal with this post wasn't to create a comprehensive list, but rather to provide enough compelling reasons to have dissuaded my novice self a few years ago from even considering this as an option.

T-SQL Documentation Generator

MJ-t-sql-TuesdayThis post is a response to this month's T-SQL Tuesday #110 prompt by Garry Bargsley.  T-SQL Tuesday is a way for the SQL Server community to share ideas about different database and professional topics every month.

This month's topic asks to share how we automate certain processes.


Watch this week's video on YouTube

I'm a fan of keeping documentation close to the code. I prefer writing my documentation directly above a procedure, function, or view definition because that's where it will be most beneficial to myself and other developers.

Not to mention that's the only place where the documentation has any chance of staying up to date when changes to the code are made.

What drives me crazy though is making a copy of that documentation somewhere else, into a different format. You know, like when someone without database access needs you to send them a description of all of the procedures for a project. Or if you are writing end-user documentation for your functions and views.

Not only is creating a copy of the documentation tedious, but there is no chance that it will stay up to date with future code changes.

So today I want to share how I automate some of my documentation generation directly from my code.

C# XML Style Documentation in T-SQL

C# uses XML to document objects directly in the code:

/// <summary>
/// Retrieves the details for a user.
/// </summary>
/// <param name="id">The internal id of the user.</param>
/// <returns>A user object.</returns>
public User GetUserDetails(int id)
{
    User user = ...
    return user;
}

I like this format: the documentation is directly next to the code and it is structured as XML, making it easy to parse for other uses (eg. use a static document generator to create end-user documentation directly from these comments).

This format is easily transferable to T-SQL:

/*
<documentation>
  <author>Bert</author>
  <summary>Retrieves the details for a user.</summary>
  <param name="@UserId">The internal id of the user.</param>
  <returns>The username, user's full name, and join date</returns>
</documentation>
*/
CREATE PROCEDURE dbo.USP_SelectUserDetails
       @UserId int
AS
BEGIN
    SELECT Username, FullName, JoinDate FROM dbo.[User] WHERE Id = @UserId
END
GO


/*
<documentation>
  <author>Bert</author>
  <summary>Returns the value 'A'.</summary>
  <param name="@AnyNumber">Can be any number.  Will be ignored.</param>
  <param name="@AnotherNumber">A different number.  Will also be ignored.</param>
  <returns>The value 'A'.</returns>
</documentation>
*/
CREATE FUNCTION dbo.UDF_SelectA
(
    @AnyNumber int,
    @AnotherNumber int
)
RETURNS char(1)
AS
BEGIN
       RETURN 'A';
END
GO

Sure, this might not be as visually appealing as the traditional starred comment block, but I've wrestled with parsing enough free formatted text that I don't mind a little extra structure in my comments.

Querying the Documentation

Now that our T-SQL object documentation has some structure, it's pretty easy to query and extract those XML comments:

WITH DocumentationDefintions AS (
SELECT
    SCHEMA_NAME(o.schema_id) as schema_name,
    o.name as object_name,
    o.create_date,
    o.modify_date,
    CAST(SUBSTRING(m.definition,CHARINDEX('<documentation>',m.definition),CHARINDEX('</documentation>',m.definition)+LEN('</documentation>')-CHARINDEX('<documentation>',m.definition)) AS XML) AS Documentation,
    p.parameter_id as parameter_order,
    p.name as parameter_name,
    t.name as parameter_type,
    p.max_length,
    p.precision,
    p.scale,
    p.is_output
FROM
    sys.objects o
    INNER JOIN sys.sql_modules m
        ON o.object_id = m.object_id
    LEFT JOIN sys.parameters p
        ON o.object_id = p.object_id
    INNER JOIN sys.types t
        ON p.system_type_id = t.system_type_id
WHERE 
    o.type in ('P','FN','IF','TF')
)
SELECT
    d.schema_name,
    d.object_name,
    d.parameter_name,
    d.parameter_type,
    t.c.value('author[1]','varchar(100)') as Author,
    t.c.value('summary[1]','varchar(max)') as Summary,
    t.c.value('returns[1]','varchar(max)') as Returns,
    p.c.value('@name','varchar(100)') as DocumentedParamName,
    p.c.value('.','varchar(100)') as ParamDescription
FROM
    DocumentationDefintions d 
    OUTER APPLY d.Documentation.nodes('/documentation') as t(c) 
    OUTER APPLY d.Documentation.nodes('/documentation/param') as p(c)
WHERE
    p.c.value('@name','varchar(100)') IS NULL -- objects that don't have documentation
    OR p.c.value('@name','varchar(100)') = d.parameter_name -- joining our documented parms with the actual ones
ORDER BY
    d.schema_name,
    d.object_name,
    d.parameter_order

This query pulls the parameters of our procedures and functions from sys.parameters and joins them with what we documented in our XML documentation. This gives us some nicely formatted documentation as well as visibility into what objects haven't been documented yet:

image

Only the Beginning

At this point, our procedure and function documentation is easily accessible via query. We can use this to dump the information into an Excel file for a project manager, or schedule a job to generate some static HTML documentation directly from the source every night.

This can be extended even further depending on your needs, but at least this is an automated starting point for generating further documentation directly from the T-SQL source.