XmlReader vs XmlDocument Performance

Published Thu 24 November 2016 in Programming

306d1-15myowyb7a3ye9kbyzjdttg

Recently I have been working on a project where I needed to parse XML files that were between 5mb and 20mb in size. Performance was critical for the project, so I wanted to make sure that I would parse these files as quickly as possible.

The two C# classes that I know of for parsing XML are XmlReader and XmlDocument. Based on my understanding of the two classes, XmlReader should perform faster in my scenario because it reads through an XML document only once, never storing more than the current node in memory. On the contrary, XmlDocument stores the whole XML file in memory which has some performance overhead.

Not knowing for certain which method I should use, I decided to write a quick performance test to measure the actual results of these two classes.

The Data

In my project, I knew what data I needed to extract from the XML up front so I decided to configure test in a way that mimics that requirement. If my project required me to run recursive logic in the XML document, needing a piece of information further down in the XML in order to know what pieces of information to pull earlier on from the XML, I would have set up an entirely different test.

For my test, I decided to use the Photography Stack Exchange user data dump as our sample file since it mimics the structure and file size of one my actual project's data. The Stack Exchange data dumps are great sample data sets because they involve real-world data and are released under a Creative Commons license.

The Test

The C# code for my test can be found in its entirety on GitHub.

In my test I created two methods to extract the same exact data from the XML; one of the methods used XmlReader and the other XmlDocument.

The first test uses XmlReader. The XmlReader object only stores a single node in memory at a time, so in order to read through the whole document we need to usewhile(reader.Read()) in order to loop all of the nodes. Inside of the loop, we check if each node is an element that we are looking for and if so then parse out the necessary data:

public static void XmlReaderTest(string filePath)
{
    // We create storage for ids of all of the rows from users where reputation == 1
    List<string> singleRepRowIds = new List<string>();

    using (XmlReader reader = XmlReader.Create(filePath))
    {
        while (reader.Read())
        {
            if (reader.IsStartElement())
            {
                if (reader.Name == "row" && reader.GetAttribute("Reputation") == "1")
                {
                    singleRepRowIds.Add(reader.GetAttribute("Id"));
                }
            }
        }
    }
}

On the other hand, the code for XmlDocument is much simpler: we load the whole XML file into memory and then write a LINQ query to find the elements of interest:

public static void XmlDocumentTest(string filePath)
{
    List<string> singleRepRowIds = new List<string>();

    XmlDocument doc = new XmlDocument();
    doc.Load(filePath);

    singleRepRowIds = doc.GetElementsByTagName("row").Cast<XmlNode>().Where(x => x.Attributes["Reputation"].InnerText == "1").Select(x => x.Attributes["Id"].InnerText).ToList();
}

After writing these two methods and confirming that they are returning the same exact results it was time to pit them against each other. I wrote a method to run each of my two XML parsing methods above 50 times and to take the average elapsed run time of each to eliminate any outlier data:

public static double RunPerformanceTest(string filePath, Action<string> performanceTestMethod)
{
    Stopwatch sw = new Stopwatch();

    int iterations = 50;
    double elapsedMilliseconds = 0;

    // Run the method 50 times to rule out any bias.
    for (var i = 0; i < iterations; i++)
    {
        sw.Restart();
        performanceTestMethod(filePath);
        sw.Stop();

        elapsedMilliseconds += sw.ElapsedMilliseconds;
    }

    // Calculate the average elapsed seconds per run
    double avergeSeconds = (elapsedMilliseconds / iterations) / 1000.0;

    return avergeSeconds;
}

Results and Conclusions

Cutting to the chase, XmlReader performed faster in my test:

Performance test results.

Now, is this ~.14 seconds of speed difference significant? In my case, it is, because I will be parsing many more elements and many more files dozens of times a day. After doing the math, I estimate I will save 45–60 seconds of parsing time for each set of XML files, which is huge in an almost-real-time system.

Would I have come to the same conclusion if blazing fast speed was not one of my requirements? No, I would probably go the XmlDocument route because the code is much cleaner and therefore easier to maintain.

And if my XML files were 50mb, 500mb, or 5gb in size? I would probably still use XmlReader at that point because trying to store 5gb of data in memory will not be pretty.

What about a scenario where I need to go backwards in my XML document — this might be a case where I would use XmlDocument because it is more convenient to go backwards and forwards with that class. However, a hybrid approach might be my best option if the data allows it: if I can use XmlReader to get through the bulk of my content quickly and then load just certain child trees of elements into XmlDocument for easier backwards/forwards traversal, then that would seem like an ideal scenario.

In short, XmlReader was faster than XmlDocumet for me in my scenario. The only way I could come to this conclusion though was by running some real world tests and measuring the performance data.

So should you use XmlReader or XmlDocument in your next project? The answer is it depends.

Coffee's Variety Problem

Published Mon 14 November 2016 in Programming

Freshly roasted coffee beans from central Mexico.

Spaghetti Sauce Origins

During the 1970s consumers had a limited number of spaghetti sauces that they could purchase at their local supermarket. Each store-bought sauce tasted the same, developed in test kitchens from the average flavor preferences of focus groups.

During this time, Prego brand spaghetti sauce was new to the market and was having a difficult time competing against the more established spaghetti sauce brands like Ragu. Prego tasted the same as all of the other brands so capturing loyal customers from other brands was a challenge.

Struggling to become competitive, Prego brought on board Howard Moskowitz. Today some consider Moskowitz the godfather of food-testing and market research, but during the 1970s Moskowitz's beliefs on product-design were contrary to the mainstream establishment. Moskowitz believed that using the average preferences of focus groups to develop a product lead to an average tasting spaghetti sauce: acceptable by most, loved by none.

Instead of giving focus groups similar tasting sauces, Moskowitz decided to test the extremes: have groups compare smooth pureed sauces with sauces that have chunks of vegetables in them, compare mild tasting sauces with ones with lots of spice, and so forth.

The results of this type of testing may seem obvious today, but at the time they were unheard of: people prefer different kinds of spaghetti sauce. It was better for a food company to have a portfolio of niche flavors rather than try to make one product that appealed to everybody.

Today spaghetti sauce no longer has a variety problem.

After all of this initial research Prego made Extra-Chunky spaghetti sauce, which instantly became a hit with all of the people who prefer a thicker type of tomato sauce. In the years since, spaghetti sauce manufactures have caught on to the technique and supermarket shelves today are lined with regular, thick and chunky, three-cheese, garlic, and many other types of sauces that appeal to a wide-array of consumer flavor preferences.

Coffee's Variety Problem

Coffee today has the same problem that spaghetti sauce did a quarter-century ago.

The average supermarket's coffee aisle may look like it has a wide variety of choices: whole beans versus ground coffee versus coffee pods, arabica versus robusta beans, caffeinated versus decaf, medium versus dark roast, etc…

However, once you take a look at what coffee a particular individual may drink — let's say whole bean, medium-roast caffeinated arabica beans -the amount of variety found at a super market is surprisingly small, maybe only 2 or 3 different types.

The limited selection for whole bean, medium roasted coffee at my local super market.

Additionally, it is nearly impossible to know how long the coffee has been sitting unsold on the shelves, it's difficult to find a variety of coffee from different geographic regions around the world, and only if you are extremely lucky can you find a coffee that has been lightly roasted.

Independent coffee shops, especially those that roast their own coffee, offer slightly better options in terms of variety, however they come at a steep price: it is not unusual to see these coffees selling for \$16-\$28 per pound.

I understand these small-batch roasters experience higher costs due to lack of scale, availability of beans, and a slowly developing market to third-wave premium coffee, however paying \$20 for a pound of coffee beans (or even worse, the standard 12oz bag) is not something I can justify doing regularly.

This is what caused me to set out on my quest to get premium quality coffee beans at an affordable price.

The Internet Cafe

While the internet offers advantages in buying premium roasted coffee beans, there are still issues with unknown freshness and high prices, especially when the cost of shipping is included.

Shipping costs and price per pound can be reduced when buying in bulk, but buying in bulk means I'll have roasted beans sitting around for a long time before being consumed, therefore affecting freshness and taste. Short of finding friends who want to split a large coffee order, buying roasted coffee beans online isn't a great option.

What is a great option is buying green, or un-roasted, coffee beans. The shelf life of green beans is up to one year if stored properly and green beans are significantly cheaper than roasted beans because there is less processing involved.

Green, un-roasted coffee beans.

Not only are green beans fresher and cheaper, there is significantly more variety available. Commercial roasters need to buy beans in large quantities in order to be able to sell to coffee shops and supermarkets. This means they are sourcing coffee beans from large commercial farms that are able to supply such a large amount of coffee beans.

If we buy green beans for personal consumption at home, the number of farms that we can buy from is hugely expanded since we only need to buy in small quantities. A farm that only produces a few hundred pounds of beans each year is now within our grasp since commercial roasters would never be able to purchase from them.

There are many retailers online that cater to the home roasting market. My favorite is http://sweetmarias.com and they always have a huge selection of premium beans from all over the world, many for under \$6/pound.

Home Roasting

Roasting beans at home used to be the norm in the early nineteenth century. Roasting coffee beans is similar to making popcorn and can be done over a fire, in a stove, or in the oven. While these methods work, they are messy and involved. Fortunately, cleaner and more scientific options exist.

Commercially available home roasters are one such option, however they cost several hundred to thousands of dollars — well out of my price range.

The most easily obtainable and easy to use home coffee roaster is an electric air-powered popcorn popper. These retail for between \$15–\$25 new. If you decide to try this route, get one without a mesh screen on the bottom.

Once you have an air popcorn popper, roasting coffee beans is as easy as pouring them in, turning on the heat, and waiting until they turn the brown color you are accustomed to seeing.

Although roasting beans can be as easy as turning on the popcorn popper and waiting until the beans reach a desired color, there is a scientific rabbit hole that roasting geeks like me eventually wander down…

Home Roasting 2.0: Web Roast

When I first started roasting coffee beans at home, I started with the air popper method. I soon wanted to start experimenting more with how to make the process more automated, as well as how I could become more precise and play with different variables in order to change the flavors of my roasted coffee.

There are some mods you can make to a regular air popcorn popper to give you more control in how your roast your beans, but ultimately I wanted more control.

I present to you, Web Roast:

The home-made, Internet of Things (IoT) coffee roaster.

The finished project with all code can be found on my project's GitHub page: https://github.com/bertwagner/Coffee-Roaster/

Essentially, this is still an air popper but with much more control. The key features include:

Air temperature probe
Ability to switch the fan and heat source on/off independently
Holding a constant temperature
Automatic graphing of roast profiles
Ability to run saved roast profiles

Now instead of standing at my kitchen counter turning the air popper on and off, holding a digital temperature probe, and shaking the whole roaster to circulate and cool the beans between "on" cycles, I can simply control all of these conditions from my iPhone.

Screen shot of the web app user interface.

From my phone I decide when to heat the beans, when to maintain a certain temperature, how quickly or slowly to hit certain roasting stages or "cracks", and logging to make sure I can reproduce results in future runs.

Beans develop different flavors based on how quickly or slowly they go through different phases of roasting. Some beans might be better suited for quick roasts that will maintain acidic fruit flavors. Other beans might need to be roasted more slowly to bring out nutty and cocoa flavors. The moisture content of a bean will also have an effect on roasting times, as well as beans that are sourced from different regions of the world.

Next Steps

Although I'm extremely satisfied with how my roaster has turned out, there's still a lot on the to-do list.

I'm currently adding functionality to save roast profiles, so after an initial run of desired results, reproducing those results for the same batch of green beans is easy.

In the future, I'd like to build a second, bigger drum-style roaster for being able to roast larger batches at a time.

Follow my Github coffee roaster project page to keep up with any future updates. Also I would love to hear from anyone who has built similar projects of their own.

Good luck and happy roasting!

3 Things I Wish I Knew When I Started Using Entity Framework

Published Sat 30 April 2016 in Programming

741a3-1j8khcb_mgxi_ttvtatuqsg

Entity Framework (EF) is Microsoft's object-relational-mapping tool (ORM) that allows programmers to easily map database objects to C# object models. What attracts developers to Entity Framework is the automatic mapping of SQL tables to C# model classes, a feature that removes a lot of the tedious boiler plate code that developers would otherwise have to write.

The quickest way to see why developers love Entity Framework is with an example. Without Entity Framework, the code needed to SELECT some data from SQL looks something like this:

using (SqlConnection conn = new SqlConnection(connectionString))
{
  string query = @"SELECT 
                      * 
                    FROM
                      Customers
                    WHERE
                      TransactionDate > '2016–04–01'";

  SqlCommand cmd = new SqlCommand(query, conn);
  SqlDataReader reader = cmd.ExecuteReader();

  while (reader.Read())
  {
    //Process my rows and put them into a model object
  }
}

With Entity Framework and LINQ (Language Integrated Query), I can instead write something like this:

DatabaseContext db = new DatabaseContext();

var customers = db.Customer.Where(c => c.TransactionDate > new DateTime(2016–04–01)).ToList();

As a developer I want to be writing code like the latter example — look at how clear and concise it is! However, when I first started using Entity Framework I quickly came across specific scenarios where the ORM was not working like I would have expected. Below are my top 3 insights about Entity Framework that I had to learn the hard way.

1. Select() is your friend

When I first started using Entity Framework I was enamored by the idea of never having to write boring SQL connection and mapping code again. During that time, I was frequently writing quick concise LINQ queries like this all over the place:

var results = db.Customer.Where(c => c.State == "Ohio");

All things were great for a while until I moved my code to production, where there was a lot more data, and my webpage started taking a very long time to return results. What was the problem?

My customer summary webpage was only displaying 4 columns from the Customer table to display summary information about customers: "State", "Name", "Address", and "Email". However, my Customer table had 50 columns in it. I foolishly thought since I was only using 4 columns from the table, Entity Framework would be smart enough to only bring those 4 columns of data back from the SQL server. Turns out I was wrong — what really was running behind the scenes was this:

SELECT
  *
FROM
  Customer
WHERE
  State = 'Connecticut'

Egad! I was running a SELECT *, which all SQL practitioners know is a big no-no both for performance as well as future maintainability of your queries!

One of the issues (features?) with Entity Framework is that by default the queries all run as SELECT *. Instead of bringing back all of the columns in a table, I figured I could limit my results by using Select() to choose which columns I wanted to return:

var results = db.Customer
                .Select(x => new 
                {
                  x.State,
                  x.Name,
                  x.Address,
                  x.Email
                });

And there was much rejoicing amongst the DBAs.

2. Lazy Loading can be great or terrible

Lazy Loading is the idea that data isn't pulled from the database until it is needed (hence the term "lazy"). This is a great feature if used correctly since it minimizes the amount of data that is transferred over the network. However, if you use lazy loading (which is often the default) unknowingly, you might be querying the SQL server more than is necessary.

Let's say our Customer and Order models looks like this, where every Customer has many associated Orders:

public class Customer
{
  [Key]
  public int Id { get; set; }
  public string State { get; set; }
  public string Name { get; set; }
  public virtual ICollection<Order> Orders { get; set; }
}

public class Order
{
  [Key]
  public int Id { get; set; }
  public int CustomerId { get; set; }

  [ForeignKey("CustomerId")]
  public virtual Customer Customer { get; set; }

  public int TotalItems { get; set; }
  public decimal TotalCost { get; set; }
}

If we write a LINQ query to bring back our Customer data and only use Customer data in our view, SQL just executes 1 query and things are efficient:

Controller:

var result = db.Customer.ToList();

View:

@foreach (var item in Model) {
  <tr>
    <td>
      @Html.DisplayFor(modelItem => item.State)
    </td>
    <td>
      @Html.DisplayFor(modelItem => item.Name)
    </td>
  </tr>
}

Result:

Bringing back fields exclusive to the Customer model.

SQL Trace:

One "SQL:BatchCompleted" means only a single query ran.

As you can see, Entity Framework only ran one query against the database and we are getting good performance. Very nice!

If we go ahead and change our View to display some Order data though:

View:

@foreach (var item in Model) {
  <tr>
    <td>
      @Html.DisplayFor(modelItem => item.State)
    </td>
    <td>
      @Html.DisplayFor(modelItem => item.Name)
    </td>
<td>
      @Html.DisplayFor(modelItem => item.Orders.First().TotalCost)
    </td>
  </tr>
}

Result:

Now EF nows to bring dollar amounts back from our Order model.

SQL Trace:

The Order query is executed once for every customer — this could result in horrible performance.

Because of Lazy Loading, entity framework runs a query to retrieve an order for each customer! In my example data with 5 Customers with 1 Order each, Entity Framework runs 6 queries: 1 query to retrieve our Customers, and then 5 queries to retrieve orders for each of customers! Imagine if we were display more than 5 customer records — our SQL server would be getting slammed with queries!

The solution to this is to Eager Load the data. This means all of the data will be brought back in our initial query via a SQL JOIN. In order to do this, we just use the Include() method to retrieve all of the data up front:

Controller:

var result = db.Customer.Include(x => x.Orders).ToList();

Trace:

EF generated SQL might be ugly, but it is efficient in this case.

Entity Framework only executes 1 query. This causes a little bit more upfront overhead since it transfers the Order data all at once, which would be a waste if we didn't use that Order data in our View, but if we do plan to use Order data in our View, then using Include() reduces the number of queries that will run against the SQL server.

3. Sometimes it's OK to not use Entity Framework LINQ queries

By default, Entity Framework generates some pretty interesting queries (interesting at least in the sense that no normal human being would structure a query like this). For example, in SQL I could write:

SELECT
  State,
  Name,
  Address,
  Email
FROM
  Customer
WHERE
  CreateDate > '2016–04–01'

And the query generated by Entity Framework might look something like this:

-- Region Parameters
DECLARE @p__linq__0 DateTime2 = '2016–04–01 00:00:00.0000000'
-- EndRegion

SELECT
  [Extent1].[State] AS [State],
  [Extent1].[Name] AS [Name],
  [Extent1].[Address] AS [Address],
  [Extent1].[Email] AS [Email]
FROM 
  [dbo].[Customer] AS [Extent1]
WHERE 
  [Extent1].[CreateDate] > @p__linq__0

Extents? Parameters? Aliases? What's all of this extra junk in my query? Although the latter SQL statement works, it isn't the prettiest on the eyes. And this is the example of simple query. Once you start writing complex LINQ queries using Entity Framework that involve the likes of Join(), GroupBy(), etc… the raw SQL that gets generated can get hairy and inefficient. In those cases, I've learned that it is perfectly to rely on good old t-sql stored procedures:

string query = "EXEC dbo.USP_NiceAndCleanSqlQuery";

db.Database.ExecuteSqlCommand(query);

At the end of the day, we are using Entity Framework to make our lives easier. If a LINQ query gets too complicated or is generating inefficient queries, it is always fine to fall back on an old standby.