Colin Cochrane

Software Developer based in Victoria, BC specializing in C#, PowerShell, Web Development and DevOps.

Is There Added Value In XHTML To Search Engine Spiders?

The use of XHTML in the context of SEO is a matter of debate.  The consensus tends to be that using XHTML falls into the category of optimization efforts that provide certain benefits for the site as a whole (extensibility, ability to use XSL transforms) but offers little or no added value in the eyes of the search engines.  That being said, as the number of pages that search engine spiders have to crawl continues to increase every day, the limits to how much the spiders can crawl are being tested.  This has been recognized by SEOs and is reflected in efforts to trim page sizes down to make a site more appealing to the spiders.  Now it is time to start considering the significant benefits that a well-formed XHTML document can potentially offer to search engine spiders.

Parsing XML is faster than parsing HTML for one simple reason: XML documents are expected to be well-formed.  This saves the parser from having to spend the extra overhead involved in "filling in the blanks" with a non-valid document.  Dealing with XML also opens the door to the use of speedy languages like XPath that provide fast and straightforward access to any given part of an XML document.  For instance, consider the following XHTML document:

<?xml version="1.0" encoding="utf-8" ?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en">
  <head>
    <title>My XHTML Document</title>
    <meta name="description" content="This is my XHTML document."/>
  </head>
  <body>
    <div>
      <h1>This is my XHTML document!</h1>
      <p>Here is some content.</p>
    </div>
  </body>
</html>

Now let's say we wanted to grab the contents of the title element from this document.  If we were to parse it as straight HTML we'd probably use a regular expression such as "<title>([^<]*)</title>" (As a quick aside, I want to clarify that HTML parsers are quite advanced and don't simply use regular expressions to read a document).  In Visual Basic the code to accomplish this would look like:

Imports System.Text.RegularExpressions

Class MyParser

  Function GetTitle(ByRef html As String) As String 
    Return RegEx.Match(html,"<title>([^<]*)</title>").Groups(1).Value 
  End Function 

End Class 

If we were to use XPath, on the other hand, we would get something like this:

Imports System.Xml 

Class MyParser 

  Function GetTitle(ByRef reader As XmlReader) As String 
    Dim doc As New XPath.XPathDocument(reader) 
    Dim navigator As XPath.XPathNavigator = doc.CreateNavigator
    Return navigator.SelectSingleNode(&quot;/head/title&quot;).Value 
  End Function 

End Class

Don't let the amount of code fool you.  While the first example uses 1 line of code to accomplish what takes the second example 3 lines, the real value comes when dealing with a non-trivial document.  The first method would need enumerate the elements in the document, which would involve either very complex regular expressions with added logic (because regular expressions are not best suited for parsing HTML), or the added overhead necessary for an existing HTML parser to accurately determine how the document "intended" to be structured.  Using XPath is a simple matter of using a different XPath expression for the "navigator.SelectSingleNode" method.

With that in mind, I constructed a very basic test to see what kind of speed differences we'd be looking at between HTML parsing and XML (using XPath) parsing.  The test was simple: I created a well-formed XHTML document consisting of a title element, meta description and keywords elements, 150 paragraphs of Lorum Ipsum, 1 <h1> element, 5 <h2> elements, 10 <h3> elements and 10 anchor elements scattered throughout the document.

The test consisted of two methods, one using XPath, and one using Regular Expressions.  The task of each method was to simply iterate through every element in the document once, and repeat this task 10000 times while being timed.  Once completed it would spit out the elapsed time in milliseconds that it took to complete.  The test was kept deliberately simple because the results are only meant to very roughly illustrate the performance differences between the two methods.  It was by no means an exhaustive performance analysis and should not be considered as such.

That being said, I ran the test 10 times and averaged the results for each method, resulting in the following:

XML Parsing (Using XPATH) - 13ms

HTML Parsing (Using RegEx) - 1852ms

As I said, these results are very rough, and meant to illustrate the difference between the two methods rather than the exact times. 

These results should, however, give you something to consider in respect to the potential benefits of XHTML to a search engine spider.  We don't know how search engine spiders are parsing web documents, and that will likely never change.  We do know that search engines are constantly refining their internal processes, including spider logic, and with the substantial performance beneifts of XML parsing, it doesn't seem too far-fetched to think that the search engines might have their spiders capitilizing on well-formed XHTML documents with faster XML parsing, or are at least taking a very serious look at implementing that functionality in the near future.  If you consider even a performance improvement of only 10ms, when you multiply that against the tens of thousands of pages being spidered every day, those milliseconds add up very quickly.

Comments (2) -

  • JoeLiTo

    12/8/2007 10:59:17 AM | Reply

    I wonder, do you think using an XHTML doctype declaration and then serving the page as text/html would make the search engine or use the html parser anyway?

  • Colin Cochrane

    12/8/2007 2:01:20 PM | Reply

    Serving an XHTML document as application/xhtml+xml declares to the user-agent that the response should be expected as well-formed XML and parsed accordingly.  This should logically apply to a search engine spider, especially considering that spiders are going to be significantly less forgiving than a standard web browser.

Pingbacks and trackbacks (1)+

Loading