Parsing HTML in SQL Server

Watch this week's video on YouTube

Recently I was asked how to parse text out of an HTML fragment stored in SQL Server.

Over the next few seconds my brain processed the following ideas:

  • SQL Server is not meant for parsing HTML. Parse the data with something else.
  • T-SQL does have functions like REPLACE, CHARINDEX, and SUBSTRING though, perfect for searching for tags and returning just the values between them.
  • CLRs could do it, probably using some kind of HTML shredding library. You also might be able to use XMLReader to do something with it...
  • Wait a minute, SQL Server has XML parsing functions built in!

Maybe you see where this is going.

WARNING - this is a terrible idea

Parsing HTML with T-SQL is not a great idea. It's dirty, it's prone to breaking, and it will make your server's CPUs cry that they aren't being used for some nobler cause. If you can parse your HTML somewhere outside of SQL Server, then DO IT THERE.

With that said, if you absolutely need to parse HTML on SQL Server, the best solution is probably to write a CLR.

However, if you are stuck in a bind and plain old T-SQL is the only option available to you, then you might be able to use SQL Server's XML datatype and functions to get this done. I've been there before and can sympathize.

So anyway, here goes nothing:

Using XML to parse HTML

Let's say we have the following fragment of HTML (copied from a bootstrap example template):

DECLARE @html xml = ' 
    <div class="container"> 
        <div class="card-deck mb-3 text-center"> 
            <div class="card-body"> 
                <h1 class="card-title pricing-card-title">$15 <small class="text-muted">/ mo</small></h1> 
                <ul class="list-unstyled mt-3 mb-4"> 
                    <li>20 users included</li> 
                    <li>10 GB of storage</li> 
                    <li>Priority email support</li> 
                    <li>Help center access</li> 
                </ul> 
                <button type="button" class="btn btn-lg btn-block btn-primary">Get started</button> 
            </div> 
        </div> 
    </div> 
'; 

If we wanted to say extract all of the text from this HTML (to allow text mining without all of the tags getting in the way) we could easily do this using the XML nodes() and value() methods:

-- Get all text values from elements 
SELECT 
    T.C.value('.','varchar(max)')  AS AllText
FROM 
    @html.nodes('/') T(C);

image-1

If we want to only extract the items from the list elements, we can write some XQuery to select only those elements:

-- Get a fragment of HTML 
SELECT 
    T.C.value('.','varchar(100)') AS ListValues 
FROM 
    @html.nodes('//*[local-name()=("li")]') T(C); 

image-2

Finally, we can also do things like select HTML fragments based on an attribute to parse further in subsequent steps. If I want to select the div with a class of "card-body", I can write:

-- Get the text from within certain elements 
SELECT 
    T.C.query('.') AS CardBody 
FROM 
    @html.nodes('//div[@class="card-body"]') T(C); 

image-3

Yuck

To reiterate - you don't want to do any of the above unless you have no other choice.

The XML parsing functions will not parse all HTML, so you may need to do some pre-processing on your HTML data first (removing invalid HTML, closing tags, etc...).

It works beautifully in the above example but your results may very. Good luck!

T-SQL Documentation Generator

MJ-t-sql-TuesdayThis post is a response to this month's T-SQL Tuesday #110 prompt by Garry Bargsley.  T-SQL Tuesday is a way for the SQL Server community to share ideas about different database and professional topics every month.

This month's topic asks to share how we automate certain processes.


Watch this week's video on YouTube

I'm a fan of keeping documentation close to the code. I prefer writing my documentation directly above a procedure, function, or view definition because that's where it will be most beneficial to myself and other developers.

Not to mention that's the only place where the documentation has any chance of staying up to date when changes to the code are made.

What drives me crazy though is making a copy of that documentation somewhere else, into a different format. You know, like when someone without database access needs you to send them a description of all of the procedures for a project. Or if you are writing end-user documentation for your functions and views.

Not only is creating a copy of the documentation tedious, but there is no chance that it will stay up to date with future code changes.

So today I want to share how I automate some of my documentation generation directly from my code.

C# XML Style Documentation in T-SQL

C# uses XML to document objects directly in the code:

/// <summary>
/// Retrieves the details for a user.
/// </summary>
/// <param name="id">The internal id of the user.</param>
/// <returns>A user object.</returns>
public User GetUserDetails(int id)
{
    User user = ...
    return user;
}

I like this format: the documentation is directly next to the code and it is structured as XML, making it easy to parse for other uses (eg. use a static document generator to create end-user documentation directly from these comments).

This format is easily transferable to T-SQL:

/*
<documentation>
  <author>Bert</author>
  <summary>Retrieves the details for a user.</summary>
  <param name="@UserId">The internal id of the user.</param>
  <returns>The username, user's full name, and join date</returns>
</documentation>
*/
CREATE PROCEDURE dbo.USP_SelectUserDetails
       @UserId int
AS
BEGIN
    SELECT Username, FullName, JoinDate FROM dbo.[User] WHERE Id = @UserId
END
GO


/*
<documentation>
  <author>Bert</author>
  <summary>Returns the value 'A'.</summary>
  <param name="@AnyNumber">Can be any number.  Will be ignored.</param>
  <param name="@AnotherNumber">A different number.  Will also be ignored.</param>
  <returns>The value 'A'.</returns>
</documentation>
*/
CREATE FUNCTION dbo.UDF_SelectA
(
    @AnyNumber int,
    @AnotherNumber int
)
RETURNS char(1)
AS
BEGIN
       RETURN 'A';
END
GO

Sure, this might not be as visually appealing as the traditional starred comment block, but I've wrestled with parsing enough free formatted text that I don't mind a little extra structure in my comments.

Querying the Documentation

Now that our T-SQL object documentation has some structure, it's pretty easy to query and extract those XML comments:

WITH DocumentationDefintions AS (
SELECT
    SCHEMA_NAME(o.schema_id) as schema_name,
    o.name as object_name,
    o.create_date,
    o.modify_date,
    CAST(SUBSTRING(m.definition,CHARINDEX('<documentation>',m.definition),CHARINDEX('</documentation>',m.definition)+LEN('</documentation>')-CHARINDEX('<documentation>',m.definition)) AS XML) AS Documentation,
    p.parameter_id as parameter_order,
    p.name as parameter_name,
    t.name as parameter_type,
    p.max_length,
    p.precision,
    p.scale,
    p.is_output
FROM
    sys.objects o
    INNER JOIN sys.sql_modules m
        ON o.object_id = m.object_id
    LEFT JOIN sys.parameters p
        ON o.object_id = p.object_id
    INNER JOIN sys.types t
        ON p.system_type_id = t.system_type_id
WHERE 
    o.type in ('P','FN','IF','TF')
)
SELECT
    d.schema_name,
    d.object_name,
    d.parameter_name,
    d.parameter_type,
    t.c.value('author[1]','varchar(100)') as Author,
    t.c.value('summary[1]','varchar(max)') as Summary,
    t.c.value('returns[1]','varchar(max)') as Returns,
    p.c.value('@name','varchar(100)') as DocumentedParamName,
    p.c.value('.','varchar(100)') as ParamDescription
FROM
    DocumentationDefintions d 
    OUTER APPLY d.Documentation.nodes('/documentation') as t(c) 
    OUTER APPLY d.Documentation.nodes('/documentation/param') as p(c)
WHERE
    p.c.value('@name','varchar(100)') IS NULL -- objects that don't have documentation
    OR p.c.value('@name','varchar(100)') = d.parameter_name -- joining our documented parms with the actual ones
ORDER BY
    d.schema_name,
    d.object_name,
    d.parameter_order

This query pulls the parameters of our procedures and functions from sys.parameters and joins them with what we documented in our XML documentation. This gives us some nicely formatted documentation as well as visibility into what objects haven't been documented yet:

image

Only the Beginning

At this point, our procedure and function documentation is easily accessible via query. We can use this to dump the information into an Excel file for a project manager, or schedule a job to generate some static HTML documentation directly from the source every night.

This can be extended even further depending on your needs, but at least this is an automated starting point for generating further documentation directly from the T-SQL source.

XML vs JSON Shootout: Which is Superior in SQL Server 2016?

"A duel is a duel" by Emanuele Rosso is licensed under CC BY-NC-ND 2.0

Watch this week's video on YouTube

Additional performance comparisons available in an updated post.

Starting with the 2016 release, SQL Server offers native JSON support. Although the implementation is not perfect, I am still a huge fan.

Even if a new feature like JSON support is awesome, I am only likely to use it if it is practical and performs better than the alternatives.

Today I want to pit JSON against XML and see which is the better format to use in SQL Server.

Enter XML, SQL's Bad Hombre

Full disclosure: I don't love XML and I also don't love SQL Server's implementation of it.

XML is too wordy (lots of characters wasted on closing tags), it has elements AND attributes (I don't like having to program for two different scenarios), and depending on what language you are programming in, sometimes you need schema files and sometimes you don't.

SQL Server's implementation of XML does have some nice features like a dedicated datatype that reduces storage space and validates syntax, but I find the querying of XML to be clumsy.

All XML grievances aside, I am still willing to use XML if it outperforms JSON. So let's run some test queries!

Is JSON SQL Server's New Sheriff in Town?

Although performance is the final decider in these comparison tests, I think JSON has a head start over XML purely in terms of usability. SQL Server's JSON function signatures are easier to remember and cleaner to write on screen.

The test data I'm using is vehicle year/make/model data from https://github.com/arthurkao/vehicle-make-model-data. Here's what it looks like once I loaded it into a table called dbo.XmlVsJson:

CREATE TABLE dbo.XmlVsJson
(
  Id INT IDENTITY PRIMARY KEY,
  XmlData XML,
  JsonData NVARCHAR(MAX)
)

(The full data query is available in this gist if you want to play along at home)

Data Size

So XML should be larger right? It's got all of those repetitive closing tags?

SELECT
  DATALENGTH(XmlData)/1024.0/1024.0 AS XmlMB,
  DATALENGTH(JsonData)/1024.0/1024.0 AS JsonMB
FROM
  dbo.XmlVsJson

82246-1ra2xqdbn4movjivmlhtnxa

Turns out the XML is actually smaller! How can this be? This is the magic behind the SQL Server XML datatype. SQL doesn't store XML as a giant string; it stores only the XML InfoSet, leading to a reduction in space.

The JSON on the other hand is stored as regular old nvarchar(max) so its full string contents are written to disk. XML wins in this case.

0745a-1liaytu4vkxadqugdbn1j3g

INSERT Performance

So XML is physically storing less data when using the XML data type than JSON in the nvarchar(max) data type, does that mean it will insert faster as well? Here's our query that tries to insert 100 duplicates of the row from our first query:

SET STATISTICS TIME ON

INSERT INTO dbo.XmlVsJson (XmlData)
SELECT XmlData FROM dbo.XmlVsJson 
  CROSS APPLY 
  (
    SELECT DISTINCT number 
    FROM master..spt_values 
    WHERE number BETWEEN 1 AND 100
  )t WHERE Id = 1
GO

INSERT INTO dbo.XmlVsJson (JsonData)
SELECT JsonData FROM dbo.XmlVsJson 
  CROSS APPLY 
  (
    SELECT DISTINCT number 
    FROM master..spt_values 
    WHERE number BETWEEN 1 AND 100
  )t WHERE Id = 1
GO

And the results? Inserting the 100 XML rows took 613ms on my machine, while inserting the 100 JSON rows took 1305ms…XML wins again!

JSON ain't looking too hot. Wait for it…

I'm guessing since the XML data type physically stores less data, it makes sense that it would also write it out to the table faster as well.

CRUD Operations

I'm incredibly impressed by SQL Server's JSON performance when compared to .NET — but how does it compare to XML on SQL Server?

Read

Let's select the fragment for our second car from our XML and JSON:

SELECT t.XmlData.query('/cars/car[2]') 
FROM dbo.XmlVsJson t 
WHERE Id = 1

SELECT JSON_QUERY(t.JsonData, '$.cars[1]') 
FROM dbo.XmlVsJson t 
WHERE Id = 1

91b35-1xixep1a6ebvvn38_vvpavq

Result? JSON wins (at 0ms vs 63ms for XML) when needing to pluck out a fragment from our larger object string.

8a02b-1hzaxjaobi3dxqurwrelf4w

What if we want to grab a specific value instead of a fragment?

SELECT t.XmlData.value('(/cars/car[2]/model)[1]', 'varchar(100)') FROM dbo.XmlVsJson t 
WHERE Id = 1

SELECT JSON_VALUE(t.JsonData, '$.cars[1].model') 
FROM dbo.XmlVsJson t 
WHERE Id = 1

Once again JSON wins with 0ms vs 11ms for XML.

41179-1esau9dtm4uezlxgarzxb1q

If you look at the execution plans for these last two queries, it's easy to see that XML has a lot more to do behind the scenes to retrieve the data:

XML:

327b8-1fgaieclnioqa5-zytrrndq

JSON:

45848-1s9aop29_eiltkvsnszkw9g

Create

We saw above that inserting rows of XML data is faster than inserting rows of JSON, but what if we want to insert new data into the object strings themselves? Here I want to insert the property "mileage" into the first car object:

db0c0-1btrkakuspyaj8iu5lq7wsg

UPDATE t SET XmlData.modify('
insert <mileage>100,000</mileage>
into (/cars/car[1])[1]') 
FROM dbo.XmlVsJson t 
WHERE Id = 1

UPDATE t SET JsonData = JSON_MODIFY(JsonData,
'$.cars[0].mileage','100,000') 
FROM dbo.XmlVsJson t 
WHERE Id = 1

In addition to the cleaner syntax (JSON_MODIFY() is essentially the same as a REPLACE()) the JSON insert runs in 22ms compared to the 206ms for XML. Another JSON win.

c01d2-1kng0gj5jdkk5ujd1qlg5ig

Update

Let's update the mileage properties we just added to have values of 110,000:

UPDATE t SET XmlData.modify('
replace value of (/cars/car[1]/mileage/text())[1]
with     "110,000"') 
FROM dbo.XmlVsJson t
WHERE Id = 1

UPDATE t SET JsonData = JSON_MODIFY(JsonData, '$.cars[0].mileage','110,000') 
FROM dbo.XmlVsJson t
WHERE Id = 1

72c82-1hpv3kbbgv88tzuyozvineg

Result? JSON has the quicker draw and was able to perform this update in 54ms vs XML's 194ms.

c81d3-1fvual_pu3eyahvumnfimqg

Delete

Deleting large string data, a DBA's dream *snicker*.

Let's delete the mileage property, undoing all of that hard work we just did:

UPDATE t SET XmlData.modify('
delete /cars/car[1]/mileage[1]') 
FROM dbo.XmlVsJson t 
WHERE Id = 1

UPDATE t SET JsonData = JSON_MODIFY(JsonData, '$.cars[0].mileage', null) 
FROM dbo.XmlVsJson t 
WHERE Id = 1

JSON doesn't take any time to reload and wins against XML again 50ms to 159ms.

538ba-1ojms_3yjdktnni2a3pqcrg

Read Part 2: Indexes

So above we saw that JSON was faster than XML at reading fragments and properties from a single row of serialized data. But our SQL Server's probably have LOTS of rows of data — how well does indexed data parsing do in our match up?

First let's expand our data — instead of storing all of our car objects in a single field, let's build a new table that has each car on its own row:

(once again, full dataset at GitHub if you are playing along at home)

Now that we have our expanded data in our table, let's add some indexes. The XML datatype in SQL Server has its own types of indexes, while JSON simply needs a computed column with a regular index applied to it.

DROP INDEX IF EXISTS PXML_XmlData ON XmlVsJson2
CREATE PRIMARY XML INDEX PXML_XmlData
ON XmlVsJson2 (XmlData);

ALTER TABLE dbo.XmlVsJson2
ADD MakeComputed AS JSON_VALUE(JsonData, '$.make')
CREATE NONCLUSTERED INDEX IX_JsonData ON dbo.XmlVsJson2 (MakeComputed)

(Note: I also tried adding an XML secondary index for even better performance, but I couldn't get the query engine to use that secondary index on such a basic dataset)

If we try to find all rows that match a predicate:

SELECT Id, XmlData 
FROM dbo.XmlVsJson2 t 
WHERE t.XmlData.exist('/car/make[.="ACURA"]') = 1

SELECT Id, JsonData 
FROM dbo.XmlVsJson2 t 
WHERE JSON_VALUE(t.JsonData, '$.make') = 'ACURA'

XML is able to filter out 96 rows in 200ms and JSON accomplishes the same in 9ms. A final win for JSON.

327d9-1-jtwonmiccgvf25ksnmb3q

Conclusion

If you need to store and manipulate serialized string data in SQL Server, there's no question: JSON is the format of choice. Although JSON's storage size is a little larger than its XML predecessor, SQL Server's JSON functions outperform XML in speed in nearly all cases.

Is there enough performance difference to rewrite all of your old XML code to JSON? Probably not, but every case is different.

One thing is clear: new development should consider taking advantage of SQL Server's new JSON functions.