Should I Transform My Data In My SQL Query?

Published on: 2020-07-29

Watch this week’s video on YouTube.

Imagine you need to join two tables of data and filter the results. Perhaps you also need to convert some of the values for display as well (eg. 0 => “No”, 1 => “Yes”).

Do you choose to perform all of this in your SQL query? Or do you bring the data into your app and and handle it there with code?

Option 1: Write it in a SQL query

Usually I’m inclined to do as much in the SQL query as possible because:

Optimization

Relational SQL databases tend to be optimized for doing things like joins, filtering, aggregations, etc….

Software engineers have spent decades tuning their sorting algorithms and squashing bugs to make their relational databases handle these operations efficiently. Sure, you can probably find the occasional edge case where you could handcraft some app code to accomplish one of these things faster, but in most cases it’s not worth the additional time and effort.

Ordered Data

The types of operations in the example above (joins, filtering) benefit from ordered data.

Indexes in your database offer data stored in a predefined order, making all of those operations faster. If you need to perform a transformation that can utilize an index, it usually makes sense to let the database handle this operation in the SQL query rather than moving that data to your app and not have an index available.

Data Reduction

Joining and filtering your data in your database allows you to send a reduced number of records over the network to your application.

Network speeds tend to be one of the largest bottle necks in application systems, so eliminating the amount of data you need to pass through them up front in your SQL query can greatly improve the performance of your application.

This is especially true if you have switched to a work-from-home world where network speeds are even more of a bottleneck.

Hardware

This one can be a toss up. The SQL database servers I usually work with are beefy, having significantly more memory and CPU than my general purpose application servers (or my laptop running analytical code).

This means I can run most of my transformation logic faster on my database server than I can anywhere else. This comes at the cost of your performance hungry queries taking away resources from other queries that may be running on the database server at the same time, but on servers not running at capacity this trade off can be worth the speed. More on this in our app code option.

Portability

If you need to share your transformation logic, a SQL query is going to be easier to share with others or include in other processes than application code.

For example, SQL queries are like a universal language in most companies: programmers, analysts, data scientists, technical product managers, and anyone else who may be interested in understanding the business logic will be able to get a high-level understanding by looking at the SQL query. It will also be easier for them to incorporate into their processes.

Contrast that with some app code that is not easy to execute without installing dependencies and understanding language specific syntax; a SQL query will always be easier to share and reuse.

Option 2: Code it in the application

Everything you can write in a SQL query can be performed in most programming languages.

Let’s discuss when it makes sense to store transformation logic in app code.

Domain Specific Performance Improvements

Most SQL relational databases are built to be general purpose. That’s what makes them so powerful: they work well to help solve many different problems without needing domain specific optimizations.

Sometimes though, you may know something about your data that will allow you to work with it more efficiently than a relational database.

For example, maybe your data consists of mostly XML files, or you know you will be doing a lot of recursive processing of your data. Sure, most relational database engines have some way to accomplish these tasks, but performing this type of processing will most likely be faster to do in your application code.

Local Caching

Sometimes your application will need to process and reprocess the same data over and over again.

In these cases, it may make sense to transfer your data across the network a single time to your application, so it can locally process (and reprocess) that data as many times as needed.

Costs

Many enterprise database platforms cost money to license. Application servers usually don’t have those same kinds of costs. That means that running your transformation logic in a SQL query on your database server is often more expensive than in code on an application server.

If your database server is not at full capacity, then this likely isn’t an issue – after all you want to maximize the usage of the licenses you are paying for.

However, if your database environment is at or over capacity, offloading some transformation logic into your application code may make it run faster and will also likely be cheaper to run.

The cloud makes this case even more transparent, with database as a service offerings generally being more expensive than their application compute counterparts.

Conclusion: which is better?

It depends on your situation.

This post speaks to generalized scenarios. For specific uses, there will be times where it makes sense to store your logic in SQL queries, and other times in your application code.

The point is the next time before you dive in and start writing code, think through your restrictions and goals to figure out the best location to run things.

Thanks for reading. You might also enjoy following me on Twitter.

Want to learn even more SQL?

Sign up for my newsletter to receive weekly SQL tips!

SQL Server Converts Numbers to Asterisks

Published on: 2019-12-17

Watch this week’s episode on YouTube.

Pop quiz: What will be the output of the below three statements?

DECLARE @val_varchar varchar(3) = '100';
DECLARE @val_bigint bigint = 100;
DECLARE @val_tinyint tinyint = 100;

PRINT('varchar to varchar:');
PRINT(CAST(@val_varchar AS VARCHAR(2)));

PRINT('bigint to varchar:');
PRINT(CAST(@val_bigint AS VARCHAR(2)));

PRINT('tinyint to varchar:');
PRINT(CAST(@val_tinyint AS VARCHAR(2)));

As you might have guessed, the first conversion truncates the value, leaving the answer as “10”:

varchar to varchar

The second response is also somewhat intuitive – a number can’t be converted into string that has fewer places than original digits:

bigint to varchar

However, the final answer returns:

tinyint to varchar

NOTE: if we change all of the conversions to VARCHAR(3) above, all three results return the string value ‘100’

What the ****?

Why does SQL Server sometimes error when converting a number into a string, but other times succeeds and returns an asterisk?

I don’t know.

The best (and logical) answer I could find online is from Robert Sheldon, who attributes it to poor error handling practices, “…before error handling got a more reputable foothold.”

My understanding is that the asterisk was originally the default truncation value. Later, when the SQL Server development team was adding the bigint datatype, they decided throwing an exception is a better way to handle truncation in errors. However, probably for backwards compatibility, they never went back to change the behavior of the asterisks for pre-existing datatypes.

The documentation is clear about this behavior as well: any int, smallint, or tiny int when converted to char or varchar with fewer characters will result in an asterisk.

This becomes a problem if you create strings out of tinyints, smallints, and ints and suddenly start receiving values larger than you originally expected:

DECLARE @Age int = 75;
PRINT(CAST(@Age AS VARCHAR(2)) + ' years old.');

SET @Age = 100;
PRINT(CAST(@Age AS VARCHAR(2)) + ' years old.');
* years old

Conversions are hard

So is SQL Server to blame for this issue? Sure. But I can empathize: breaking changes are not something people typically want to add to a product.

I’ve written before how implicit conversions can lead to funny results in SQL Server. This asterisk scenario is another case of SQL Server having to make a judgement call about handling an impossible situation you created for it.

While it’s easy to complain about whether SQL Server handles the conversion the correct way or not, at the end of the day the responsibility falls on you to prevent these types of impossible calculations from being requested of SQL Server in the first place.

Thanks for reading. You might also enjoy following me on Twitter.

Want to learn even more SQL?

Sign up for my newsletter to receive weekly SQL tips!

Azure SQL vs Azure Table Storage

Published on: 2019-12-03

Watch this week’s episode on YouTube.

A year ago I built an app to keep track of pickup volleyball game scores and payments. It works well, but after a year of regular use it’s time to update it with some improvements.

As part of the update process, I’m rewriting the data layer to use Azure’s NoSQL Table Storage instead of Azure SQL.

Today I’ll walk you through some of the details I considered when deciding to switch to a NoSQL storage solution.

Why Azure NoSQL Table Storage

Cost

Azure SQL has been good to me – it works, it’s familiar, and it’s relatively inexpensive. However, even though the $5/month Azure SQL bill isn’t cost prohibitive (especially with free credits :)), ideally I’d love the app to be able to fund itself via the fractions of a penny that get rounded up as part of the app’s balancing logic.

Azure Table Storage in comparison is dirt cheap. As of the time I’m writing this, it will cost me about $.05/month to store and access my < 1gb of application data.

Another option I considered is Azure SQL Serverless. This is a nice alternative because it would allow me to keep the relational structure of my data, however based on my historical app usage patterns it would definitely end up costing more than Azure Table Storage. Cool service, but not a good fit for my scenario.

Simplicity

Another reason I went with Azure Table Storage is its simplicity.

Azure offers an alternative table storage option called Cosmos DB. Cosmos DB is Azure’s premium table storage offering, but it was overkill for what my app needed.

Yes, Cosmo DB’s global distribution is cool. Yes, its additional indexes on my data is great. However, as I mentioned, cost is the biggest factor in how I’m deciding on a storage solution for this project. And Cosmos DB has lots of great features, but they come at a price.

My preference for this app is to pay more up front in development time to create a better design then rely on a service that does some of that for me at a higher monthly cost. In this case, simple features will accomplish what I need, so that’s what I’m going with.

Design Considerations for Azure Table Storage

When rewriting my app’s data layer, there were several new things that I had to account for in Azure Table Storage.

Primary Key

The primary key in Azure Table Storage is made up of two columns: PartitionKey and RowKey. This composite primary key is also the clustered (and only!) index for the table.

The PartitionKey in particular is important because it determines whether related rows of data will exist on the same underlying server or not. Different PartitionKey values may end up on different servers. This can be a good thing if you factor in parallel access in your design, or a significant bottleneck if all of your data ends up residing on a single server.

Latency

There are two aspects of latency to consider.

The first is how far you are from your Azure region. For me, this isn’t a huge deal since the only users of the app are currently in northeast Ohio, so choosing to store all of the data in the same region is good enough.

The bigger consideration for this app is that with Azure Table Storage, you can’t do any joins with your data. Well, you can join in your app, but you can’t join in Azure itself.

This means that design is critical to reducing latency since joining multiple tables of data in your app will require multiple calls to the Azure Table Storage service. This is not something that is necessarily a deal breaker, just something that needs to be considered, especially if coming from a relational SQL background where you are using JOINs to filter and reduce your data before it is returned to your app.

Row Size Limitations

NoSQL gets a lot hate for its common pattern of storing giant blobs of semi-structured data in a single field (this design makes sense when considering the latency considerations).

However, this becomes a delicate balancing act since each row in Azure Table Storage can only be a maximum of 1mb in size. This causes you to want to fit as much data into a single call as possible (to reduce the number of calls) while also not exceeding the 1mb row size.

Azure Table Storage does allow up to 252 columns of data per table (plus the required PartitionKey, RowKey, and Timestamp columns) so at least your 1mb of data will be organized.

…and more!

The above details were the primary considerations I had to take into account for my specific app. There are things data like data consistency, durability, and more that you may want to take into account based on your app’s goals and usage patterns.

Conclusion

While NoSQL can often break many of the relational concepts we are used to using, it is often the means for achieving the cheapest pay-only-for-what-you-use pricing on cloud providers.

Thanks for reading. You might also enjoy following me on Twitter.

Want to learn even more SQL?

Sign up for my newsletter to receive weekly SQL tips!