Colin Cochrane

Colin Cochrane is a Software Developer based in Victoria, BC specializing in C#, PowerShell, Web Development and DevOps.

Is There Added Value In XHTML To Search Engine Spiders?

The use of XHTML in the context of SEO is a matter of debate.  The consensus tends to be that using XHTML falls into the category of optimization efforts that provide certain benefits for the site as a whole (extensibility, ability to use XSL transforms) but offers little or no added value in the eyes of the search engines.  That being said, as the number of pages that search engine spiders have to crawl continues to increase every day, the limits to how much the spiders can crawl are being tested.  This has been recognized by SEOs and is reflected in efforts to trim page sizes down to make a site more appealing to the spiders.  Now it is time to start considering the significant benefits that a well-formed XHTML document can potentially offer to search engine spiders.

Parsing XML is faster than parsing HTML for one simple reason: XML documents are expected to be well-formed.  This saves the parser from having to spend the extra overhead involved in "filling in the blanks" with a non-valid document.  Dealing with XML also opens the door to the use of speedy languages like XPath that provide fast and straightforward access to any given part of an XML document.  For instance, consider the following XHTML document:

<?xml version="1.0" encoding="utf-8" ?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en">
  <head>
    <title>My XHTML Document</title>
    <meta name="description" content="This is my XHTML document."/>
  </head>
  <body>
    <div>
      <h1>This is my XHTML document!</h1>
      <p>Here is some content.</p>
    </div>
  </body>
</html>

Now let's say we wanted to grab the contents of the title element from this document.  If we were to parse it as straight HTML we'd probably use a regular expression such as "<title>([^<]*)</title>" (As a quick aside, I want to clarify that HTML parsers are quite advanced and don't simply use regular expressions to read a document).  In Visual Basic the code to accomplish this would look like:

Imports System.Text.RegularExpressions

Class MyParser

  Function GetTitle(ByRef html As String) As String 
    Return RegEx.Match(html,"<title>([^<]*)</title>").Groups(1).Value 
  End Function 

End Class 

If we were to use XPath, on the other hand, we would get something like this:

Imports System.Xml 

Class MyParser 

  Function GetTitle(ByRef reader As XmlReader) As String 
    Dim doc As New XPath.XPathDocument(reader) 
    Dim navigator As XPath.XPathNavigator = doc.CreateNavigator
    Return navigator.SelectSingleNode(&quot;/head/title&quot;).Value 
  End Function 

End Class

Don't let the amount of code fool you.  While the first example uses 1 line of code to accomplish what takes the second example 3 lines, the real value comes when dealing with a non-trivial document.  The first method would need enumerate the elements in the document, which would involve either very complex regular expressions with added logic (because regular expressions are not best suited for parsing HTML), or the added overhead necessary for an existing HTML parser to accurately determine how the document "intended" to be structured.  Using XPath is a simple matter of using a different XPath expression for the "navigator.SelectSingleNode" method.

With that in mind, I constructed a very basic test to see what kind of speed differences we'd be looking at between HTML parsing and XML (using XPath) parsing.  The test was simple: I created a well-formed XHTML document consisting of a title element, meta description and keywords elements, 150 paragraphs of Lorum Ipsum, 1 <h1> element, 5 <h2> elements, 10 <h3> elements and 10 anchor elements scattered throughout the document.

The test consisted of two methods, one using XPath, and one using Regular Expressions.  The task of each method was to simply iterate through every element in the document once, and repeat this task 10000 times while being timed.  Once completed it would spit out the elapsed time in milliseconds that it took to complete.  The test was kept deliberately simple because the results are only meant to very roughly illustrate the performance differences between the two methods.  It was by no means an exhaustive performance analysis and should not be considered as such.

That being said, I ran the test 10 times and averaged the results for each method, resulting in the following:

XML Parsing (Using XPATH) - 13ms

HTML Parsing (Using RegEx) - 1852ms

As I said, these results are very rough, and meant to illustrate the difference between the two methods rather than the exact times. 

These results should, however, give you something to consider in respect to the potential benefits of XHTML to a search engine spider.  We don't know how search engine spiders are parsing web documents, and that will likely never change.  We do know that search engines are constantly refining their internal processes, including spider logic, and with the substantial performance beneifts of XML parsing, it doesn't seem too far-fetched to think that the search engines might have their spiders capitilizing on well-formed XHTML documents with faster XML parsing, or are at least taking a very serious look at implementing that functionality in the near future.  If you consider even a performance improvement of only 10ms, when you multiply that against the tens of thousands of pages being spidered every day, those milliseconds add up very quickly.

Visual Studio 2008 Initial Impressions - Part One

The release of Visual Studio 2008 and the .NET Framework 3.5 this past Monday has created a considerable buzz in the .NET community.  With language enhancements such as LINQ (Language Integrated Query) and lambda expressions as well as a plethora of refinements to the IDE itself there are a lot of new tools available at our disposal now.  I was very eager to get acquainted with these new tools so I installed a copy of Team Edition and spent almost every free moment this week familiarizing myself with them.  Here are some of my initial impressions.

1) LINQ To SQL Classes

I work with a lot of applications that depend heavily on a backend database, so I've coded my fair share of business logic layers which can be quite tedious.  LINQ To SQL Classes take a lot of the grunt work out of that process by providing a convenient visual designer that performs automatic object-relational mapping.  All you have to do is drag a table or stored procedure from the Server Explorer to the design window and the designer automatically creates a strongly-typed object or method that is ready for use in your application.




2) Intellisense Enhancements

There were a couple of really nice usability enhancements to Intellisense in Visual Studio 2008.  Now, as you type, the Intellisense list automatically filters the list down based on what you've entered in so far.  For instance, if you have entered "MyObject.ToS" the list would be filtered to only show the items that start with "ToS", which does a nice job of speeding things up.  The other enhancement addresses the issue that many people had with previous versions of Visual Studio and the way that the Intellisense list would often obscure chunks of your code, forcing you to close the window if you had to check something that was underneath it.  Now you just have to hit "Ctrl" while the list is open and it will become semi-transparant, allowing you to see the code underneath.

 



3) Improved IDE Performance

Not a "feature", necessarily, but a welcome improvement to Visual Studio.  You'll notice this as soon as you load the environment for the first time and discover how quickly the environment loads.  The performance improvements don't stop there either, as the IDE is a lot faster and responsive throughout.


Stay tuned for Part Two where I'll go in to some more features of LINQ as well as some of the language upgrades given to Visual Basic.

Internet Explorer 7 Did Not Kill XHTML

Professionally I make sure that I devote a certain amount of time every week to reading articles, whitepapers and blogs related to every aspect of web development. The subjects range from web design, to programming, SEO, and those that I spend a considerable amount of time reading about: web standards, accessibility, and pretty much anything related to the W3C. The communities based around those "W3C"-centric subjects are host to some extrordinarily well-research articles, posts and comments which is largely in part to the time afforded from the relatively slow pace in which major changes occur in respect to the major areas of HTML, XHTML and CSS.

Lack Of Support

One topic of controversy in this area has been Internet Explorer 7 not supporting the application/xhtml+xml MIME type, which essentially means not supporting true XHTML as specified by the W3C. Of course with this being related to Microsoft there is the expected amount of flak coming from the anti-Microsoft camp. That said, even once you've filtered out the extremes from the discourse, there are still a lot of people who think Internet Explorer 7 killed/is killing XHTML.

The support of Internet Explorer is certainly an important factor in the mainstream adoption of a web specification, considering that all versions of IE account for over 50% of web browsers used on the net. It seems reasonable that people would think that not supporting true XHTML would be a devastating blow to a specification that has been continually rising in popularity. In a time when IE is still recovering from the frustration of web developers everywhere about IE6's poor handling of CSS it's not hard to see why a lot of people think that the IE development team has a grudge against the W3C.

Tough But Fair

The IE development team is not stupid. They are also faced with the task of creating the browser that is used by millions of people every day. It is important to remember this because the decisions made regarding the development of IE are hardly made lightly. In a post on the IE Blog Chris Wilson, the lead program manager for the Internet Explorer platform and incendently a member of the XHTML 1.0 W3C working group, explained why IE7 does not support XHTML served as application/xhtml+xml.

The reasoning was that implementing support would involve hacking in XML constructs to the existing HTML parser in IE7. The existing parser is based on compatibility, and even if the support for properly served XHTML was implemented it would still have to accommodate invalid documents, which is exactly what shouldn't happen with an XHTML document (example of what happens when attempting to view an invalid XHTML document served as application/xhtml+xml). If support involves the same silent support for invalid documents, there is really no point.

In fact, had IE7 implemented support in this fashion it would have been worse for XHTML. Take a look at how many HTML documents on the web even come close to validating against their DOCTYPE. Now think about how many of those documents use XHTML (usually as a matter of the developer trying to look like they are on the cutting edge of the internet!. )If all of these non-valid XHTML documents stopped working in Internet Explorer the average IE user, discovering that a significant portion of the websites they visit don't display in their browser, would have reverted to IE6 (shudder), which would certainly have been counter-productive to the goal of increasing the adoption of XHTML.

All in all, it is natural to get impatient waiting for proper widespread support of XHTML. Just don't let the impatience make you lose sight of the big picture.