Parsing HTML in SQL Server

Published on: 2019-02-12

Recently I was asked how to parse text out of an HTML fragment stored in SQL Server.

Over the next few seconds my brain processed the following ideas:

  • SQL Server is not meant for parsing HTML. Parse the data with something else.
  • T-SQL does have functions like REPLACE, CHARINDEX, and SUBSTRING though, perfect for searching for tags and returning just the values between them.
  • CLRs could do it, probably using some kind of HTML shredding library. You also might be able to use XMLReader to do something with it…
  • Wait a minute, SQL Server has XML parsing functions built in!

Maybe you see where this is going.

WARNING – this is a terrible idea

Parsing HTML with T-SQL is not a great idea. It’s dirty, it’s prone to breaking, and it will make your server’s CPUs cry that they aren’t being used for some nobler cause. If you can parse your HTML somewhere outside of SQL Server, then DO IT THERE.

With that said, if you absolutely need to parse HTML on SQL Server, the best solution is probably to write a CLR.

However, if you are stuck in a bind and plain old T-SQL is the only option available to you, then you might be able to use SQL Server’s XML datatype and functions to get this done. I’ve been there before and can sympathize.

So anyway, here goes nothing:

Using XML to parse HTML

Let’s say we have the following fragment of HTML (copied from a bootstrap example template):

DECLARE @html xml = ' 
    <div class="container"> 
        <div class="card-deck mb-3 text-center"> 
            <div class="card-body"> 
                <h1 class="card-title pricing-card-title">$15 <small class="text-muted">/ mo</small></h1> 
                <ul class="list-unstyled mt-3 mb-4"> 
                    <li>20 users included</li> 
                    <li>10 GB of storage</li> 
                    <li>Priority email support</li> 
                    <li>Help center access</li> 
                </ul> 
                <button type="button" class="btn btn-lg btn-block btn-primary">Get started</button> 
            </div> 
        </div> 
    </div> 
'; 

If we wanted to say extract all of the text from this HTML (to allow text mining without all of the tags getting in the way) we could easily do this using the XML nodes() and value() methods:

-- Get all text values from elements 
SELECT 
    T.C.value('.','varchar(max)')  AS AllText
FROM 
    @html.nodes('/') T(C);

If we want to only extract the items from the list elements, we can write some XQuery to select only those elements:

-- Get a fragment of HTML 
SELECT 
    T.C.value('.','varchar(100)') AS ListValues 
FROM 
    @html.nodes('//*[local-name()=("li")]') T(C); 

Finally, we can also do things like select HTML fragments based on an attribute to parse further in subsequent steps. If I want to select the div with a class of “card-body”, I can write:

-- Get the text from within certain elements 
SELECT 
    T.C.query('.') AS CardBody 
FROM 
    @html.nodes('//div[@class="card-body"]') T(C); 

Yuck

To reiterate – you don’t want to do any of the above unless you have no other choice.

The XML parsing functions will not parse all HTML, so you may need to do some pre-processing on your HTML data first (removing invalid HTML, closing tags, etc…).

It works beautifully in the above example but your results may very. Good luck!

Thanks for reading. You might also enjoy following me on Twitter.

Want to learn even more SQL?

Sign up for my newsletter to receive weekly SQL tips!

T-SQL Documentation Generator

Published on: 2019-01-08

This post is a response to this month’s T-SQL Tuesday #110 prompt by Garry Bargsley.  T-SQL Tuesday is a way for the SQL Server community to share ideas about different database and professional topics every month.

This month’s topic asks to share how we automate certain processes.


I’m a fan of keeping documentation close to the code. I prefer writing my documentation directly above a procedure, function, or view definition because that’s where it will be most beneficial to myself and other developers.

Not to mention that’s the only place where the documentation has any chance of staying up to date when changes to the code are made.

What drives me crazy though is making a copy of that documentation somewhere else, into a different format. You know, like when someone without database access needs you to send them a description of all of the procedures for a project. Or if you are writing end-user documentation for your functions and views.

Not only is creating a copy of the documentation tedious, but there is no chance that it will stay up to date with future code changes.

So today I want to share how I automate some of my documentation generation directly from my code.

C# XML Style Documentation in T-SQL

C# uses XML to document objects directly in the code:

/// <summary>
/// Retrieves the details for a user.
/// </summary>
/// <param name="id">The internal id of the user.</param>
/// <returns>A user object.</returns>
public User GetUserDetails(int id)
{
    User user = ...
    return user;
}

I like this format: the documentation is directly next to the code and it is structured as XML, making it easy to parse for other uses (eg. use a static document generator to create end-user documentation directly from these comments).

This format is easily transferable to T-SQL:

/*
<documentation>
  <author>Bert</author>
  <summary>Retrieves the details for a user.</summary>
  <param name="@UserId">The internal id of the user.</param>
  <returns>The username, user's full name, and join date</returns>
</documentation>
*/
CREATE PROCEDURE dbo.USP_SelectUserDetails
       @UserId int
AS
BEGIN
    SELECT Username, FullName, JoinDate FROM dbo.[User] WHERE Id = @UserId
END
GO
 
 
/*
<documentation>
  <author>Bert</author>
  <summary>Returns the value 'A'.</summary>
  <param name="@AnyNumber">Can be any number.  Will be ignored.</param>
  <param name="@AnotherNumber">A different number.  Will also be ignored.</param>
  <returns>The value 'A'.</returns>
</documentation>
*/
CREATE FUNCTION dbo.UDF_SelectA
(
    @AnyNumber int,
    @AnotherNumber int
)
RETURNS char(1)
AS
BEGIN
       RETURN 'A';
END
GO

Sure, this might not be as visually appealing as the traditional starred comment block, but I’ve wrestled with parsing enough free formatted text that I don’t mind a little extra structure in my comments.

Querying the Documentation

Now that our T-SQL object documentation has some structure, it’s pretty easy to query and extract those XML comments:

WITH DocumentationDefintions AS (
SELECT
    SCHEMA_NAME(o.schema_id) as schema_name,
    o.name as object_name,
    o.create_date,
    o.modify_date,
    CAST(SUBSTRING(m.definition,CHARINDEX('<documentation>',m.definition),CHARINDEX('</documentation>',m.definition)+LEN('</documentation>')-CHARINDEX('<documentation>',m.definition)) AS XML) AS Documentation,
    p.parameter_id as parameter_order,
    p.name as parameter_name,
    t.name as parameter_type,
    p.max_length,
    p.precision,
    p.scale,
    p.is_output
FROM
    sys.objects o
    INNER JOIN sys.sql_modules m
        ON o.object_id = m.object_id
    LEFT JOIN sys.parameters p
        ON o.object_id = p.object_id
    INNER JOIN sys.types t
        ON p.system_type_id = t.system_type_id
WHERE 
    o.type in ('P','FN','IF','TF')
)
SELECT
    d.schema_name,
    d.object_name,
    d.parameter_name,
    d.parameter_type,
    t.c.value('author[1]','varchar(100)') as Author,
    t.c.value('summary[1]','varchar(max)') as Summary,
    t.c.value('returns[1]','varchar(max)') as Returns,
    p.c.value('@name','varchar(100)') as DocumentedParamName,
    p.c.value('.','varchar(100)') as ParamDescription
FROM
    DocumentationDefintions d 
    OUTER APPLY d.Documentation.nodes('/documentation') as t(c) 
    OUTER APPLY d.Documentation.nodes('/documentation/param') as p(c)
WHERE
    p.c.value('@name','varchar(100)') IS NULL -- objects that don't have documentation
    OR p.c.value('@name','varchar(100)') = d.parameter_name -- joining our documented parms with the actual ones
ORDER BY
    d.schema_name,
    d.object_name,
    d.parameter_order

This query pulls the parameters of our procedures and functions from sys.parameters and joins them with what we documented in our XML documentation. This gives us some nicely formatted documentation as well as visibility into what objects haven’t been documented yet:

Only the Beginning

At this point, our procedure and function documentation is easily accessible via query. We can use this to dump the information into an Excel file for a project manager, or schedule a job to generate some static HTML documentation directly from the source every night.

This can be extended even further depending on your needs, but at least this is an automated starting point for generating further documentation directly from the T-SQL source.

Thanks for reading. You might also enjoy following me on Twitter.

Want to learn even more SQL?

Sign up for my newsletter to receive weekly SQL tips!

Displaying Long Values in SSMS

Published on: 2018-08-07

Click here to watch this week’s video on YouTube.

I write a lot of dynamic SQL and frequently encounter variables that contain many characters:

DECLARE @LongValue NVARCHAR(MAX) = CAST('' AS NVARCHAR(MAX)) + 
N'SELECT
    ''A'' AS AShortValue,
    '''+REPLICATE(N'A',4000)+''' as ALongValue
ORDER BY 1';

This variable is 4059 characters long, and when I execute it it runs great.

SELECT LEN(@LongValue); -- 4059 characters
EXEC(@LongValue);

A homage to one of my favorite Uncyclopedia entries.

If my programmatically built query had an error in it, the first thing I’d want to do when debugging it would be to see the the text of the entire @LongValue variable.

I could do this by just saying SELECT @LongValue, and while recent versions of SSMS will display the whole value for me, it completely loses my formatting which stinks (and is especially bad if there are any comments prefixed with   in the query):

Need a ultra HD wide display to fit this all on one screen.

I can say PRINT @LongValue, which will keep the formatting, but it will get trimmed at 4,000 characters (notice the missing ORDER BY):

Some Better Ways

Erik Darling posts one solution to this problem in his T-SQL Tuesday #104 entry (as well as some other problems/solutions for lengthy SQL variables). Specifically he links to a SQL string printing script that will loop through the lengthy variable and print everything while maintaining formatting:

Not perfectly formatted, but good enough.

And while I like using that stored procedure on my primary server, I’m too lazy to install it every where I need it.

Instead, I have a couple of go-to solutions that work on all SQL Server instances 2008 forward.

Solution 1: CAST to XML
SELECT CAST(@LongValue AS XML) AS LongValue

Casting the long variable to XML allows SSMS to generate a clickable, single-row result that preserves formatting:

IntelliSense complains but I’m OK with it

The only downside to this approach is that certain charaters, like “<” and “>”, can’t be converted to XML:

Solution 2: FOR XML PATH

A slight variation on solution 1, we can get similar results using FOR XML PATH:

SET @LongValue = '<' + @LongValue -- Let's add in an invalid character
SELECT @LongValue FOR XML PATH('')

FOR XML PATH is one of the most abused SQL Server functions.

In this solution, the “<” is escaped to “&lt;”, which isn’t perfect but at least my variable can be displayed with formatting intact.  A quick find and replace for any escaped characters and I’m good to go.

Good Enough

These techniques aren’t perfect, but for purposes of debugging dynamically generated code they are good enough.

Maybe one day SSMS will print longer strings or include a syntax formatter and I won’t care nearly as much.

And if not, I’ll happily continue to abuse FOR XML to do things other than generate XML documents.

Thanks for reading. You might also enjoy following me on Twitter.

Want to learn even more SQL?

Sign up for my newsletter to receive weekly SQL tips!