Data with Bert logo

Parsing HTML in SQL Server

Watch this week's video on YouTube

Recently I was asked how to parse text out of an HTML fragment stored in SQL Server.

Over the next few seconds my brain processed the following ideas:

  • SQL Server is not meant for parsing HTML. Parse the data with something else.
  • T-SQL does have functions like REPLACE, CHARINDEX, and SUBSTRING though, perfect for searching for tags and returning just the values between them.
  • CLRs could do it, probably using some kind of HTML shredding library. You also might be able to use XMLReader to do something with it...
  • Wait a minute, SQL Server has XML parsing functions built in!

Maybe you see where this is going.

WARNING - this is a terrible idea

Parsing HTML with T-SQL is not a great idea. It's dirty, it's prone to breaking, and it will make your server's CPUs cry that they aren't being used for some nobler cause. If you can parse your HTML somewhere outside of SQL Server, then DO IT THERE.

With that said, if you absolutely need to parse HTML on SQL Server, the best solution is probably to write a CLR.

However, if you are stuck in a bind and plain old T-SQL is the only option available to you, then you might be able to use SQL Server's XML datatype and functions to get this done. I've been there before and can sympathize.

So anyway, here goes nothing:

Using XML to parse HTML

Let's say we have the following fragment of HTML (copied from a bootstrap example template):

DECLARE @html xml = ' 
    <div class="container"> 
        <div class="card-deck mb-3 text-center"> 
            <div class="card-body"> 
                <h1 class="card-title pricing-card-title">$15 <small class="text-muted">/ mo</small></h1> 
                <ul class="list-unstyled mt-3 mb-4"> 
                    <li>20 users included</li> 
                    <li>10 GB of storage</li> 
                    <li>Priority email support</li> 
                    <li>Help center access</li> 
                </ul> 
                <button type="button" class="btn btn-lg btn-block btn-primary">Get started</button> 
            </div> 
        </div> 
    </div> 
'; 

If we wanted to say extract all of the text from this HTML (to allow text mining without all of the tags getting in the way) we could easily do this using the XML nodes() and value() methods:

-- Get all text values from elements 
SELECT 
    T.C.value('.','varchar(max)')  AS AllText
FROM 
    @html.nodes('/') T(C);

image-1

If we want to only extract the items from the list elements, we can write some XQuery to select only those elements:

-- Get a fragment of HTML 
SELECT 
    T.C.value('.','varchar(100)') AS ListValues 
FROM 
    @html.nodes('//*[local-name()=("li")]') T(C); 

image-2

Finally, we can also do things like select HTML fragments based on an attribute to parse further in subsequent steps. If I want to select the div with a class of "card-body", I can write:

-- Get the text from within certain elements 
SELECT 
    T.C.query('.') AS CardBody 
FROM 
    @html.nodes('//div[@class="card-body"]') T(C); 

image-3

Yuck

To reiterate - you don't want to do any of the above unless you have no other choice.

The XML parsing functions will not parse all HTML, so you may need to do some pre-processing on your HTML data first (removing invalid HTML, closing tags, etc...).

It works beautifully in the above example but your results may very. Good luck!

Why make?

MJ-t-sql-TuesdayThis post is a response to this month's T-SQL Tuesday #111 prompt by Andy Leonard.  T-SQL Tuesday is a way for the SQL Server community to share ideas about different database and professional topics every month.

In this month's topic Andy asks why do we do what we do?


Two years ago, I was
bored. 

I'd come home from work, spend my free time watching Netflix and surfing the internet, occasionally tinker with some random side projects, and eventually going to bed. Rinse and repeat, day in and day out.

I felt unfulfilled.  While I value free time and relaxation, I had an overabundance of it.  I felt like I should be doing something more productive with at least some of that time.  I wanted to work on my "professional development" somehow, but it was extremely difficult to get motivated to work on boring career stuff.

I decided what I needed was a long-term project that would allow me to have fun and be creative, while also having some positive personal and professional development benefits; what I was looking for was the ULTIMATE side project.

After spending some time thinking about different ideas, I decided to make videos about SQL Server.  Not only would I enjoy learning more about how SQL Server works (fun), but I could get practice writing and speaking (career) as well as get to incorporate my other hobby of film making into the mix (creative).

At first it felt forced; while I enjoyed learning new things about SQL Server, it was not easy thinking of topics.  Writing and editing was strenuous, but coming up with jokes and visual ways to convey ideas was fun.  Filming (and lighting and audio recording) was hard, but editing has always been pure pleasure for me.

So while at times coming up with a weekly bit of content was challenging, I kept at it because not only was it good for me, but I incorporated enough fun and creative elements into the process to look forward to it and keep going with it.

Fast forwarding to today, the process still isn't perfect but things have gotten better: I have enough ideas to probably last me a few years (and generating more all the time), writing is still tough but I've seen noticeable progress so I'm motivated to keep at it, I still don't like being in front of a camera but I have a dramatically easier time speaking about technical topics so the practice has paid off there, and while every episode isn't as creative as I'd like, I have a lot of fun being weird and coming up with new ideas for weekly videos.

Not only that, I now have new motivating factors that I didn't have from day one.  I've made friends with a lot of people in the SQL Server community, and they are fantastic and supportive.  Many of them even want to collaborate and make fun videos which is something I always look forward to.  The audience that consumes the content is wonderful as well; every time I receive a thank you email or comment, I am filled with joy.  And obviously all of the skills I have learned - technical, presenting, and networking - have helped immensely in my day-to-day.

In conclusion, the reasons that caused me to start creating SQL Server videos still apply, however over time that list of motivators has grown and helps me continue to remain excited about what I do, even when the challenges feel greater some weeks than others.

Automating Database Maintenance with Jess Pomfret and dbatools

Watch this week's video on YouTube

dbatools is one of the coolest community projects I've seen - it is amazing how many commands are available to help make managing your SQL Server instances a breeze.

This week I had the opportunity to learn how to use dbatools to automate backups, change recovery models, and discover additional dbatools commands from dbatools contributor Jess Pomfret.

The video above goes over the basics, but be sure to check out Jess's companion blog post to learn more about these commands.

And once you start using dbatools and have ideas for adding more functionality, check out a previous video I did with Drew Furgiuele to learn about contributing to the community project yourself.

Cardinality: Not Just For The Birds

Watch this week's video on YouTube

When building indexes for your queries, the order of your index key columns matters.  SQL Server can make the most effective use of an index if the data in that index is stored in the same order as what your query requires for a join, where predicate, grouping, or order by clause.

But if your query requires multiple key columns because of multiple predicates (eg. WHERE Color = 'Red' AND Size= 'Medium'), what order should you define the columns in your index key column definition?

Cardinality

In SQL Server,
cardinality refers to the number of distinct elements in a column.  All other considerations aside, when you are
defining the key columns for your index, the column with the highest
cardinality, or most distinct number of values, should go first.

To understand why, let's go back to our
example columns of Color and Size.  If we
have a table of data indicating the colors and sizes of various birds, it may
look something like this:

Cardinality-table-of-details

If we were to count
the number of distinct values in each of our Color and Size columns, we would
find out we have 20 distinct colors, but only 5 distinct sizes:

SELECT 
    COUNT(DISTINCT Color) AS DistinctColors, 
    COUNT(DISTINCT Size) AS DistinctSizes
FROM 
    dbo.Birds

(to make things
easier for this example, the data in this table is perfectly evenly distributed
across all 20 colors and 5 sizes – meaning each color is represented by one of
each of the five sizes, making for a total of 100 rows)

If we were to put Size as our leading index key column, SQL Server would immediately be able to narrow down the amount of rows it has to search to match our predicate (WHERE Color = 'Red' and Size = 'Medium') to 20 rows – after all, we can eliminate all rows where the sizes are not equal to Medium:

Order-by-Size-1

However, if we instead put Color as our first column, we can immediately eliminate 95% of the possibilities in our data set – only 5 rows with a value of 'Red' remain, one for each of our 5 distinct sizes (remember the data is perfectly distributed):

Order-by-Color-1

In most scenarios, putting the column with the highest cardinality first will allow SQL Server to filter out most of the data it knows it doesn't need, allowing it to focus on a smaller subset of data that it does still need to compare.

There are instances where you might want to deviate from this general rule though, like when you are trying to maximize an index's use by multiple queries; sometimes it might make sense to not put the columns in highest cardinality order if it means more queries are going to be able to make use of a single index.

Optimizing for Ad Hoc Workloads

Watch this week's video on YouTube

The execution plan cache is a great feature: after SQL Server goes through the effort of generating a query plan, SQL Servers saves that plan in the plan cache to be reused again at a later date.

One downside to SQL Server caching almost all plans by default is that some of those plans won't ever get reused. Those single use plans will exist in the plan cache, inefficiently tying up a piece of the server's memory.

Today I want to look at a feature that will keep these one-time use plans out of the plan cache.

Plan Stubs

Instead of filling the execution plan cache with plans that will never get reused, the optimize for ad hoc workloads option will cache a plan stub instead of the full plan. The plan stub is significantly smaller in size and is only replaced with the full execution plan when SQL Server recognizes that the same query has executed multiple times.

This reduces the amount of size one-time queries take up in t he cache, allowing more reusable plans to remain in the cache for longer periods of time.

Enabling this server-level feature is as easy as (a database scoped versions :

sp_configure 'show advanced options',1
GO
reconfigure
GO  
sp_configure 'optimize for ad hoc workloads',1
GO
reconfigure
go

Once enabled you can watch the plan stub take up less space in the cache:

-- Run each of these queries once
DECLARE @Username varchar = 'A'
SELECT UserName 
FROM IndexDemos.dbo.[User] 
WHERE UserName like @Username+'%';
GO

DECLARE @Username varchar = 'B'
SELECT UserName 
FROM IndexDemos.dbo.[User] 
WHERE UserName like @Username+'%';
GO

SELECT 
    cp.cacheobjtype,
    cp.objtype,
    cp.plan_handle,
    cp.size_in_bytes,
    qp.query_plan,
    st.text
FROM
    sys.dm_exec_cached_plans cp
    CROSS APPLY sys.dm_exec_query_plan(cp.plan_handle) qp
    INNER JOIN sys.dm_exec_query_stats qs
        ON cp.plan_handle = qs.plan_handle
    CROSS APPLY sys.dm_exec_sql_text(qs.sql_handle) st
WHERE 
    st.text like 'DECLARE @Username varchar =%';

424 bytes each, these plan stubs are tiny!

Now if we run our second query filtering on UserName LIKE 'B%' again and then check the plan cache, we'll notice the stub is replaced with an actual compiled plan:

image-2

The downside to plan stubs is that they add some cpu load  to our server: each query gets compiled twice before it gets reused from cache.  However, since plan stubs reduce the size of our plan cache, this allows more reusable queries to be cached for longer periods of time.

Great! All my cache problems will be solved

Not necessarily.

If your workload truly involves lots of ad hoc queries (like many analysts all working on different problems or dynamic SQL that's generating completely different statements on every execution), enabling Optimize for Ad hoc Workloads may be your best option (Kimberly Tripp also has a great alternative: clearing single use plans automatically on a schedule).

However, often times single-use query plans have a more nefarious origin: unparameterized queries. In this case, enabling Optimize for Ad hoc Workloads may not negatively impact your server, but it certainly won't help. Why? Because those original queries will still be getting generated.

Brent Ozar has a good overview of why this happens, but the short answer is to force parameterization on your queries. When you enable force parameterization, SQL Server will ~~not~~ automatically parameterize your queries if they aren't already, reducing the number of one off query plans in your cache.

Whether you are dealing with too many single use queries on your server or some other problem, just remember to find the root cause of the problem instead of just treating the symptoms.