Colin Cochrane

Colin Cochrane is a Software Developer based in Victoria, BC specializing in C#, PowerShell, Web Development and DevOps.

del.icio.us Bans Search Engine Spiders

It appears that within the past 2-3 days the popular social book-marking site del.icio.us has started blocking the major search engine spiders from crawling their site.  This isn't a simple robots.txt exclusion, but rather a 404 response that is now being served based on the requesting User-Agent.

While I was doing some Photoshop work for a site of mine tonight I needed to grab some custom shapes to use to make some icons.  I recalled having bookmarked a good resource for custom shapes in del.icio.us, but after searching my bookmarks using my del.icio.us add-in for Firefix, I couldn't find it, so I pulled up my browser and went to my profile page on del.icio.us to do a search.  To my surprise, I was greeted with this:

del.icio.us 404 Errors User Agent set to Googlebot

After confirming I hadn't mistyped the URL, I checked out the del.icio.us homepage and found that all was fine there.  However, upon trying to perform a search, I was confronted with the same 404 error, and received the same response when trying to navigate to any page other than the homepage. 

At this point I was thinking that there might have been some server issues going on with del.icio.us, but that didn't line up with my Firefox add-in still showing my bookmarks.  I then noticed that my User-Agent switcher add-in was active (not sending the default User-Agent header), and remembered that I had set it to switch my User-Agent to Googlebot earlier because I was checking another site earlier today to see if it was cloaking (it was). 

I reset the User-Agent switcher so it was sending my normal User-Agent header and tried accessing my del.icio.us page again and I was surprised to see that it was no longer responding with a 404 error.  Puzzled by this, I took a look at del.icio.us' robots.txt and found that it was disallowing Googlebot, Slurp, Teoma, and msnbot for the following:

Disallow: /inbox
Disallow: /subscriptions
Disallow: /network
Disallow: /search
Disallow: /post
Disallow: /login
Disallow: /rss

Seeing that the robots.txt was blocking these search engine spiders, I tried accessing del.icio.us with my User-Agent switcher set to each of the disallowed User-Agents and received the same 404 response for each one.  I thought that there might have been some obscure issue with the add-in that was leading to this behaviour, so I popped open Fiddler, a nifty HTTP debugging proxy that I use to sniff HTTP headers.  Fiddler has a convenient feature that allows you to create HTTP requests manually, so I created a simple set of request headers and made HEAD and GET requests using the different User-Agents listed in the robots.txt.  I received the same responses as before.

HEAD Request using Googlebot User-Agent

My interest was definitely piqued at this point.  I ran a site command against del.icio.us in Google restricted to the past 24 hours and found results as fresh as 15 hours old.

Recent Google Search Results for a site command ran against del.icio.us

Running a normal site command on del.icio.us revealed numerous results that Google had a cached version of, many of which were as fresh as only three days ago.

This evidence seems to be indicating that del.icio.us has recently started blocking the major search engine spiders from crawling their site, by way of the requesting User-Agent.  Given the recent crawl dates and cache dates, it looks like this started happening within the past 2-3 days.  This raises some questions as to the intentions of del.icio.us, and perhaps Yahoo!  With Yahoo! recently integrating del.icio.us bookmarks into its search results this could an attempt to enhance the effectiveness of that new feature by preventing competing search engines from indexing content from del.icio.us.  While Yahoo!'s Slurp bot is also blocked, it's unlikely that Yahoo! would need to crawl the content of one of its own sites, as Yahoo! actually owns del.icio.us.

What are your thoughts on this?

ASP.NET Custom Errors: Preventing 302 Redirects To Custom Error Pages

 
You can download the HttpModule here.
 
Defining custom error pages is a convenient way to show users a friendly page when they encounter an HTTP error such as a 404 Not Found, or a 500 Server Error.  Unfortunately ASP.NET handles custom error pages by responding with a 302 Temporary redirect to the error page that was defined. For example, consider an example application that has IIS configured to map all requests to it, and has the following customErrors element defined in its web.config:
 
<customErrors mode="RemoteOnly" defaultRedirect="~/error.aspx">
<error statusCode="404" redirect="~/404.aspx" />
</customError>

If a user requested a page that didn't exist, then the HTTP response would look something like:

http://www.domain.com/non-existant-page.aspx --> 302 Found
http://www.domain.com/404.aspx  --> 404 Not Found
Date: Sat, 26 Jan 2008 03:08:21 GMT
Server: Microsoft-IIS/6.0
Content-Length: 24753
Content-Type: text/html; charset=utf-8
X-Powered-By: ASP.NET
 
As you can see, there is a 302 redirect that occurs to send the user to the custom error page.  This is not ideal for two reasons:

1) It's bad for SEO

When a search engine spiders crawls your site and comes across a page that doesn't exist, you want to make sure you respond with an HTTP status of 404 and send it on its way.  Otherwise you may end up with duplicate content issues or indexing problems, depending on the spider and search engine.

2) It can lead to more incorrect HTTP status responses

This ties in with the first point, but can be significantly more serious.  If the custom error page is not configured to response with the correct status code then the HTTP response could end up looking like:

http://www.domain.com/non-existant-page.aspx --> 302 Found
http://www.domain.com/404.aspx  --> 200 OK
Date: Sat, 26 Jan 2008 03:08:21 GMT
Server: Microsoft-IIS/6.0
Content-Length: 24753
Content-Type: text/html; charset=utf-8
X-Powered-By: ASP.NET
 
Which would almost guarantee that there would be duplicate content issues for the site with the search engines, as the search spiders are simply going to assume that the error page is a normal page, like any other.Furthermore it will probably cause some website and server administration headaches, as HTTP errors won't be accurately logged, making them harder to track and identify.
I tried to find a solution to this problem, but I didn't have any luck finding anything, other than people who were also looking for a way to get around it.  So I did what I usually do, and created my own solution.
 
The solution comes in the form of a small HTTP module that hooks onto the HttpContext.Error event.  When an error occurs, the module checks if the error's type is an HttpException.  If the error is an HttpException, then the following process takes place:
  1. The response headers are cleared (context.Response.ClearHeaders() )
  2. The response status code is set to match the actual HttpException.GetHttpCode() value (context.Response.StatusCode = HttpException.GetHttpCode())
  3. The customErrorsSection from the web.config is checked to see if the HTTP status code (HttpException.GetHttpCode() ) is defined.
  4. If the statusCode is defined in the customErrorsSection then the request is transferred, server-side, to the custom error page. (context.Server.Transfer(customErrorsCollection.Get(statusCode.ToString).Redirect) )
  5. If the statusCode is not defined in the customErrorsSection, then the response is flushed, immediately sending the response to the client.(context.Response.Flush() )

Here is the source code for the module.

   1: Imports System.Web
   2: Imports System.Web.Configuration
   3:  
   4: Public Class HttpErrorModule
   5:   Implements IHttpModule
   6:  
   7:   Public Sub Dispose() Implements System.Web.IHttpModule.Dispose
   8:     'Nothing to dispose.
   9:   End Sub
  10:  
  11:   Public Sub Init(ByVal context As System.Web.HttpApplication) Implements System.Web.IHttpModule.Init
  12:     AddHandler context.Error, New EventHandler(AddressOf Context_Error)
  13:   End Sub
  14:  
  15:   Private Sub Context_Error(ByVal sender As Object, ByVal e As EventArgs)
  16:     Dim context As HttpContext = CType(sender, HttpApplication).Context
  17:     If (context.Error.GetType Is GetType(HttpException)) Then
  18:       ' Get the Web application configuration.
  19:       Dim configuration As System.Configuration.Configuration = WebConfigurationManager.OpenWebConfiguration("~/web.config")
  20:  
  21:       ' Get the section.
  22:       Dim customErrorsSection As CustomErrorsSection = CType(configuration.GetSection("system.web/customErrors"), CustomErrorsSection)
  23:  
  24:       ' Get the collection
  25:       Dim customErrorsCollection As CustomErrorCollection = customErrorsSection.Errors
  26:  
  27:       Dim statusCode As Integer = CType(context.Error, HttpException).GetHttpCode
  28:  
  29:       'Clears existing response headers and sets the desired ones.
  30:       context.Response.ClearHeaders()
  31:       context.Response.StatusCode = statusCode
  32:       If (customErrorsCollection.Item(statusCode.ToString) IsNot Nothing) Then
  33:         context.Server.Transfer(customErrorsCollection.Get(statusCode.ToString).Redirect)
  34:       Else
  35:         context.Response.Flush()
  36:       End If
  37:  
  38:     End If
  39:  
  40:   End Sub
  41:  
  42: End Class

The following element also needs to be added to the httpModules element in your web.config (replace the attribute values if you aren't using the downloaded binary):

<httpModules>
<add name="HttpErrorModule" type="ColinCochrane.HttpErrorModule, ColinCochrane" />
</httpModules>

And there you go! No more 302 redirects to your custom error pages.

Web Standards: The Ideal And The Reality

There has been a flurry of reactions to the IE8 development team's recent announcement about the new version-targeting meta declaration that will be introduced in Internet Explorer 8. In an article I posted on the Metamend SEO Blog yesterday, I looked at how this feature could bring IE8 and Web Standards a lot closer together and find the ideal balance between backwards-compatibility and interoperability.  Many, however, did not share my optimism and saw this as another cop-out by Microsoft that would continue to hold back the web standards movement.  Being that this is a topic that involves both Internet Explorer/Microsoft and web standards I naturally came across a lot of heated discussion.  As I read more and more of this discussion I was once again reminded about how so many people take such an unreasonably hard stance on the issue of web standards and browser support.  When it comes to a topic as complex as web standards and interoperability it is crucial that one considers all factors, both theoretical and practical, otherwise the discussion will inevitably end up taking a "your with us or against us" mentality, that does little to benefit anyone.

The Ideal

Web standards are intended to bring consistency to the Web.  The ultimate ideal is a completely interoperable web, independent of platform or agent.  The more realistic ideal is a set of rules for the creation of content that, if followed, would ensure consistent presentation regardless of the client's browser   This would allow web developers who followed these rules to be safe in the knowledge that their content would be presented as they intended for all visitors.

The Reality

Web standards are attempting to bring consistency to what is a enormously complex and vast collection of mostly inconsistent data.  Even with more web pages being created that are built on web standards, there is still, and will always be, a subset of this collection that is non-standard.  There will never be an entirely interoperable web, nor would anyone reasonable expect there to be.  The reasonable expectation is that web standards are adopted by those who develop new content, or modify existing content, and that major web browsers will be truly standards-compliant in its presentation, so that web developers need not to worry about cross-browser compatibility.

One aspect that is often forgotten is the average internet user.  They don't care about standards, DOCTYPES or W3C recommendations.  All they care about is being able to visit a web site and have it display correctly, as they should.  This is what puts the browser developers in a bind, because the browser business is competitive and its hard to increase your user base if most pages on the web break when viewed with your product.  A degree of backwards-compatibility is absolutely essential, and denying that is simply ignorant.  This leads to something of a catch-22, however, because on the other side of the coin are the website owners who may not have the resources (be it time or money), or simply lack the desire, to redevelop their sites.   They are unlikely to make a substantial investment to bring their sites up to code for the sole reason of standards-compliance unless there is a benefit in doing so, or a harm in not doing so.  While the more vigorous supporters web standards  may wag their fingers at Microsoft for spending time worrying about backwards compatibility, you can be sure that if businesses were suddenly forced to spend tens of thousands of dollars to make their sites work in IE, Microsoft would be on the receiving end of a lot more than finger wagging.

I admit this was a minor rant.  As a supporter of web standards, I get a great deal of enjoyment out of good, honest discourse regarding their development and future.  This makes it all the more frustrating to read article after article and post after post that take close-minded stances, becoming dams in the flow of discussion.  The advancement of web standards is, and only can be, a collaborative effort, and this effort will be most productive when everyone enters in to it with their ears open and their egos left at the door.

Please Don't Urinate In The Pool: The Social Media Backlash

pool-party The increasing interest of the search engine marketing community in social media has resulted in more and more discussion about how to get in on the "traffic goldrush".  As an SEO, I appreciate the enthusiasm in exploring new methods for maximizing exposure for a client's site, but as a social media user I am finding myself becoming increasingly annoyed with the number of people that are set on finding ways to game the system.

The Social Media Backlash

My focus for the purposes of this post will be StumbleUpon, which is my favourite social media community by far.  That said, most of what I say will applicable to just about any social media community, so don't stop reading just because you're not a stumbler.  Within the StumbleUpon community there has been a surprisingly strong, and negative, reaction to those who write articles/blog posts that explore methods for leveraging StumbleUpon to drive the fabled "server crashing" levels of traffic, or dissect the inner-workings of the stumbling algorithm in order to figure out how to get that traffic with the least amount of effort and contribution necessary. 

"What Did I Do?"

When one of these people would end up on the receiving end of the StumbleUpon's community's ire they would be surprised. Instinctively, with perfectly crafted link-bait in hand, they would chronicle how they fell victim to hordes of angry stumblers, and express their disappointment while condemning the community for being so harsh.  Then, with anticipation of the inevitable rush of traffic their tale would attract to their site, they would hit the "post" button and quickly submit their post to their preferred social media channels.  What they didn't realize was that they were proving the reason for the community's backlash the instant they pressed "post".

Please Don't Urinate In The Pool

To explain that reason, we need to look at the reason people actually use StumbleUpon.  The biggest reason is the uncanny ability that it has for providing its users with a virtually endless supply of content that is almost perfectly targeted to them.  When this supply gets tainted, the user experience is worsened, and the better that the untainted experience is, the less tolerant the users will be of any tainting.

To illustrate, allow me to capitalize on the admittedly crude analogy found in the heading of this section.  Let's think of the StumbleUpon community as a group of friends at a pool party.  They are having a lot of fun, enjoying eachother's company, when they discover someone has been urinating in the pool.  The cleaner the water was before, the more everyone is going to notice the unwelcome "addition" to the water.  When they find out who urinated in the pool, they are going to be expectedly angry with them.  To stretch this analogy a little further, you can be damned sure that they wouldn't be happy when they found out that someone was telling everyone methods for strategically urinating in certain areas of the pool in order to maximize the number of people who would be exposed to the urine.

For anyone who was in the group of friends, and actually used and enjoyed the pool, the idea of urinating in it wouldn't even be an option.  Or, in the case of StumbleUpon, someone who actually participated in the community and enjoyed the service, wouldn't want to pollute it.

Catching Unwanted Spiders And Content Scraping Bots In ASP.NET

spiderinaglass

If you have a blog that is even moderately popular then you have likely fallen victim to some form of content scraping.  Ever since it became possible to earn money through ads on a website there have been people trying to find ways to cheat the system.  The most widespread example of this comes in the form of splogs and similar spam-based websites, which consist only of ads from Google AdSense and duplicated content that is scraped from other sites.  In this post I will share a method you can use to identify "evil" spiders and content scraping bots that are wasting your website's resources.

I'll start off by defining what is considered an "evil" spider/bot.  For our purposes here, we'll be looking at spiders and bots that ignore robots.txt and nofollow when crawling a site.  These are spiders and bots that offer no value to you in allowing them to crawl your site, as the major search engines use spiders and bots that respect these rules (with the unique exception of MSN who employs a certain bot that presents itself as a regular user in order to identify sites that present different content to search engine spiders than users).  

Of these valueless spiders, some are almost certainly going to be some form of content scraping bot, which is sent to literally copy the content of your site for use elsewhere.  It is in your best interest to limit how much of your content gets scraped because you want visitors coming to your site, not some spam-filled facsimile.

This method to identify unwanted spiders involves the creation of a trap,  which can be created as follows:

1) Create a Hidden Page

To identify these undesired visitors you need to isolate them.  Create a page on your site, but do not link to it from anywhere just yet.  For the purposes of my examples, I'll call our example page "trap.aspx".

spidertrap1

Now you want to disallow this page in your robots.txt.

spidertrap2

With this trap page disallowed in the robots.txt, it will prevent good spiders from crawling it.  What is needed now is a link to the trap page with the rel="nofollow" attribute, which should be placed on your home page for maximum effect.  The link must be invisible to users otherwise you might mistake a unwitting visitor for a bad spider.

<a rel="nofollow" href="/trap.aspx" style="display:none;" />

This creates a situation in which the only requests for "/trap.aspx" will be from a spider or bot that ignores both robots.txt and nofollow, which is exactly the kind of bots we want to identify.

2) Create a Log File

Create an XML document and name it "trap.xml" (or whatever you want) and place it in the App_Data folder of your application (or wherever you want, as long as the application has write-access to the directory).  Open the new XML document and create an empty root-element "<trapRequests>" and ensure it has a complete closing tag.

<?xml version="1.0" encoding="utf-8"?>
<trapRequests>
</trapRequests>
 
You can use whatever method is best for you to log the requests, you do not need to use an XML document.  I am using XML for the purposes of this example.

3) Log What Gets Caught In The Trap

With the trap in place, you now want to keep track of the requests being made for "trap.aspx".  This can be accomplished quite easily using LINQ, as illustrated in the following example:

Imports System.Xml.Linq
Partial Class trap_aspx Inherits System.Web.UI.Page
Protected Sub Page_Load(ByVal sender As Object, ByVal e As System.EventArgs) _ Handles Me.Load
 
  LogRequest(Request.UserHostAddress, Request.UserAgent)
End Sub

Private Sub LogRequest(ByVal ipAddress As String, ByVal userAgent As String)
Dim logFile As XDocument

Try

    logFile = XDocument.Load(Server.MapPath("~/App_Data/trap.xml"))
    logFile.Root.AddFirst(<request>
<date><%= Now.ToString %> </date>
<ip><%= ipAddress %> </ip>
<userAgent><%= Now.ToString %> </userAgent>
</request>)
     logFile.Save(Server.MapPath("~/App_Data/trap.xml"))
Catch ex As Exception
My.Log.WriteException(ex)
End Try
End Sub
End Class

This code sets it up so every request for this page is logged with:

  1. The Date and Time of the request.
  2. The IP address of the requesting agent.
  3. The User Agent of the requesting agent.

You can, of course, customize what information is logged to your preference.  The code will need to be adjusted if you are using a different storage method.  Once done, you will end up with an XML log file (or your custom store) with every request to "trap.aspx" that will look like:

<?xml version="1.0" encoding="utf-8"?>
<trapRequests>
<request>
<date>12/30/2007 12:54:20 PM</date>
<ip>1.2.3.4</ip>
<userAgent>ISCRAPECONTENT/1.2</userAgent>
</request>
<request>
<date>12/30/2007 2:31:51 PM</date>
<ip>2.3.4.5</ip>
<userAgent>BADSPIDER/0.5</userAgent>
</request>
</trapRequests>
 
Now you've set your trap and any unwanted bots and spiders that find it will be logged.  You are then free to use the logged data to deny access to offending IPs, User Agents, or by whatever criteria you decide is appropriate for your site.

GMail Security Exploit Allows Backdoor Into Your Account

I normally don't discuss topics related to online security, but I recently came across a post over at Sphinn that detailed how David Airey had his domain hijacked thanks to a security exploit in GMail and wanted to help make sure as many people as possible are made aware of it.  I recommend to any of you who uses GMail to read David's post that details the trouble that this has caused him and make sure that your GMail account hasn't been compromised.

Not the most festive post for an early Christmas morning, but I'll wish everyone a Merry Christmas anyways.

How Being An SEO Analyst Made Me A Better Web Developer

Being a successful web developer requires constant effort to refine your existing abilities while expanding your skill-set to include the new technologies that are continually released, so when I first started my job as a search engine optimization analyst I was fully expecting my web development skills to dull.  I could not have been more wrong.

Before I get to the list of reasons why being an SEO analyst made me a better web developer I'm going to give a quick overview of how I got into search engine optimization in the first place. Search engine optimization first captured my interest when I wanted to start driving more search traffic for a website I developed for the BCDA (and continue to volunteer my services as webmaster). Due to some dissatisfaction with our hosting provider I decided that we would switch hosting providers as soon as our existing contract expired and go with one of my preferred hosts. However, as a not-for-profit organization the budget for the website was limited and the new hosting was going cost a little more, so I decided to set up an AdSense account to bring in some added income. 

The expectations weren't high; I was hoping to bring in enough revenue through the website to cover hosting costs.  At that point I did not have much experience with SEO so I started researching and looking for strategies I could use on the site.  As I read more and more articles, blogs and whitepapers I became increasingly fascinated with the industry while I would apply all of the newfound knowledge to the site.  Soon after I responded to a job posting at an SEO firm, applied, and was shortly thereafter starting my new career as an SEO analyst. 

My first few weeks at the job were spent learning procedures, familiarizing myself with the various tools that we use, and, most importantly, honing my SEO skills.  I spent the majority of my time auditing and reporting on client sites, which exposed me to a lot of different websites, programming and scripting languages, and tens of thousands of lines of code.  During this process I realized that my web development skills weren't getting worse, they were actually getting better.   The following list examines the reasons for this improvement.

1) Coding Diversity

To properly analyze a site, identify problems, and be able to offer the right solutions I often have to go deeper than just HTML on a website.  This meant that I had to be proficient at coding in a variety of different languages, because I don't believe in pointing out problems with a site unless I can recommend how to best fix them.  Learning the different languages came quickly from the sheer volume of code I was faced with every day, and got easier with each language I learned.

2) Web Standards and Semantic Markup

In a recent post, Reducing Code Bloat Part Two: Semantic HTML, I discussed the importance of semantic HTML to lean, tidy markup for your web documents.  While I have always been a proponent of web standards and semantic markup my experience in SEO has served to solidify my beliefs.  After you have pored through 12,000 lines of markup that should have been 1000, or spent two hours implementing style modifications that should have taken five minutes, the appreciation for semantic markup and web standards is quickly realized.

3) Usability and Accessibility

Once I've optimized a site to draw more search traffic I need to help make sure that that traffic actually sticks around.  A big part of this is the usability and accessibility of a site.  There are a lot of other websites out there for people to visit and they are not going to waste time trying to figure out how to navigate through a meandering quagmire of a design.  This aspect of my job forces me to step into the shoes of the average user, which is something that a lot of developers need to do more often.  It has also made me more considerate when utilizing features and technologies, such as AJAX, in respect to accessibility, such that I ensure that the site is still accessible when that feature is disabled or is not supported.

4) The Value of Content

Before getting into SEO, I was among the many web developers guilty of thinking that a website's success can be ensured by implementing enough features, and that enough cool features could make up for a lack of simple, quality content.  Search engine optimization taught me the value of content, and that the right balance of innovative features and content will greatly enhance the effectiveness of both.

That covers some of the bigger reasons that working as an SEO analyst made me a better web developer.  Chances are that I will follow up on this post in the future with more reasons that I am sure to realize as I continue my career in SEO.  In fact one of the biggest reasons I love working in search engine optimization and marketing is that it is an industry that is constantly changing and evolving, and there is always sometime new to learn.

Reducing Code Bloat Part Two - Semantic HTML

In my first post on this subject, Reducing Code Bloat - Or How To Cut Your HTML Size In Half, I demonstrated how you can significantly reduce the size of a web document by simply moving style definitions externally and getting rid of a table-based layout.  In this installment I will look at the practice of semantic HTML and how effective it can be at keeping your markup tidy and lean.

What is Semantic HTML?

Semantic HTML, in a nutshell, is the practice of creating HTML documents that only contain the author's "meaning" and not how that meaning is presented.  This boils down to using only structural elements and not presentational elements.  A common issue with semantic HTML is identifying the often subtle differences between elements that represent the author's meaning versus a presentational element.  Consider the following examples:

<p>
I really felt the need to <i>emphasize</i> the point.
</p>

 

<p>
I really felt the need to <em>emphasize</em> the point.
</p>

Many people would consider these two snips of code to be essentially the same; a paragraph, with the word "emphasize" in italics.  When considering the semantic value of the elements, however, there is one significant difference.  The element <i> is a purely presentational (or visual), and it has no meaning to the semantic structure of the document.  The <em> element, on the other hand, has a meaningful semantic value in the document's structure because it defining its contents as being emphasized.  Visually we usually see the contents both <i> and <em> elements rendered as italicized text which is why the difference often seems like nitpicking, but the standard italicizing for <em> elements is simply the choice made by most web browsers, it has no inherent visual style.

Making Your Markup Work for You

Some of you might be thinking to yourself "some HTML tags have a different meaning than others? So what?".  Certainly a reasonable response, because when it comes down to it, HTML semantics can seem like pointless nitpicking.  That being said, this nitpicking drives home what HTML is really supposed to do: define markup.  Think of it like being a writer for a magazine.  You write your article, including paragraphs and indicating words or sentences that should be emphasized, or considered more strongly, in relation to the rest, and submit it.  You don't decide what font is used, the colour of the text, or how much space should be between each line, that's the editor's job.  It is exactly the same concept with semantic HTML: the structure and meaning of the content is the job of HTML, and the presentation is the job of the browser and CSS.

Without the burden of presentational elements you only have to worry about a core group of elements with meaningful structure value.  For instance...

<p>
This is a chunk of text <span class="red14pxitalic">where</span> random words 
are <span class="green16pxbold">styled</span> differently,
<span class="red14pxitalic">with</span> some words 
<span class="green16pxbold">being</span> red, italicized and at 14px, and others
being green, bold, and 16px.
</p>
 
with an external style-sheet definition...
span.red14pxitalic{color:red;font-size:14px;font-style:italic;}
span.green16pxbold{color:green;font-size:16px;font-weight:bold;}

...turns into this...

<p>
This is a chunk of text <em>where</em> random words 
are <strong>styled</strong> differently, <em>with</em> some words 
<strong>being</strong> red, italicized and at 14px, and others being green, 
bold, and 16px.
</p>

with an external style-sheet definition...

p em{color:red;font-size:14px;font-style:italic;}
p strong{color:green;font-size:16px;font-weight:bold;}

The second example accomplishes the same visual result as the first, but contains actual semantic worth in the document's structure.  It also illustrates how much simpler it is to create a new document because you don't have to worry about style.  You create the document and use meaningful semantic elements to identify whatever parts of the content are necessary, and let the style-sheet take care of the rest (assuming of course that the style-sheet was properly defined with the necessary styles).  By using a more complete range of HTML elements you will find yourself needing <span class="whatever"> tags less and less and find your markup becoming cleaner, easier to read, and smaller.

 

Code Size Comparison

 

  HTML Characters (Including Spaces) CSS Characters (Including Spaces)
Example One 302 127
Example Two 216 103

 

Part Three will continue looking at semantic HTML, as well as strategies you can use when defining your style framework.

Is There Added Value In XHTML To Search Engine Spiders?

The use of XHTML in the context of SEO is a matter of debate.  The consensus tends to be that using XHTML falls into the category of optimization efforts that provide certain benefits for the site as a whole (extensibility, ability to use XSL transforms) but offers little or no added value in the eyes of the search engines.  That being said, as the number of pages that search engine spiders have to crawl continues to increase every day, the limits to how much the spiders can crawl are being tested.  This has been recognized by SEOs and is reflected in efforts to trim page sizes down to make a site more appealing to the spiders.  Now it is time to start considering the significant benefits that a well-formed XHTML document can potentially offer to search engine spiders.

Parsing XML is faster than parsing HTML for one simple reason: XML documents are expected to be well-formed.  This saves the parser from having to spend the extra overhead involved in "filling in the blanks" with a non-valid document.  Dealing with XML also opens the door to the use of speedy languages like XPath that provide fast and straightforward access to any given part of an XML document.  For instance, consider the following XHTML document:

<?xml version="1.0" encoding="utf-8" ?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en">
  <head>
    <title>My XHTML Document</title>
    <meta name="description" content="This is my XHTML document."/>
  </head>
  <body>
    <div>
      <h1>This is my XHTML document!</h1>
      <p>Here is some content.</p>
    </div>
  </body>
</html>

Now let's say we wanted to grab the contents of the title element from this document.  If we were to parse it as straight HTML we'd probably use a regular expression such as "<title>([^<]*)</title>" (As a quick aside, I want to clarify that HTML parsers are quite advanced and don't simply use regular expressions to read a document).  In Visual Basic the code to accomplish this would look like:

Imports System.Text.RegularExpressions

Class MyParser

  Function GetTitle(ByRef html As String) As String 
    Return RegEx.Match(html,"<title>([^<]*)</title>").Groups(1).Value 
  End Function 

End Class 

If we were to use XPath, on the other hand, we would get something like this:

Imports System.Xml 

Class MyParser 

  Function GetTitle(ByRef reader As XmlReader) As String 
    Dim doc As New XPath.XPathDocument(reader) 
    Dim navigator As XPath.XPathNavigator = doc.CreateNavigator
    Return navigator.SelectSingleNode(&quot;/head/title&quot;).Value 
  End Function 

End Class

Don't let the amount of code fool you.  While the first example uses 1 line of code to accomplish what takes the second example 3 lines, the real value comes when dealing with a non-trivial document.  The first method would need enumerate the elements in the document, which would involve either very complex regular expressions with added logic (because regular expressions are not best suited for parsing HTML), or the added overhead necessary for an existing HTML parser to accurately determine how the document "intended" to be structured.  Using XPath is a simple matter of using a different XPath expression for the "navigator.SelectSingleNode" method.

With that in mind, I constructed a very basic test to see what kind of speed differences we'd be looking at between HTML parsing and XML (using XPath) parsing.  The test was simple: I created a well-formed XHTML document consisting of a title element, meta description and keywords elements, 150 paragraphs of Lorum Ipsum, 1 <h1> element, 5 <h2> elements, 10 <h3> elements and 10 anchor elements scattered throughout the document.

The test consisted of two methods, one using XPath, and one using Regular Expressions.  The task of each method was to simply iterate through every element in the document once, and repeat this task 10000 times while being timed.  Once completed it would spit out the elapsed time in milliseconds that it took to complete.  The test was kept deliberately simple because the results are only meant to very roughly illustrate the performance differences between the two methods.  It was by no means an exhaustive performance analysis and should not be considered as such.

That being said, I ran the test 10 times and averaged the results for each method, resulting in the following:

XML Parsing (Using XPATH) - 13ms

HTML Parsing (Using RegEx) - 1852ms

As I said, these results are very rough, and meant to illustrate the difference between the two methods rather than the exact times. 

These results should, however, give you something to consider in respect to the potential benefits of XHTML to a search engine spider.  We don't know how search engine spiders are parsing web documents, and that will likely never change.  We do know that search engines are constantly refining their internal processes, including spider logic, and with the substantial performance beneifts of XML parsing, it doesn't seem too far-fetched to think that the search engines might have their spiders capitilizing on well-formed XHTML documents with faster XML parsing, or are at least taking a very serious look at implementing that functionality in the near future.  If you consider even a performance improvement of only 10ms, when you multiply that against the tens of thousands of pages being spidered every day, those milliseconds add up very quickly.

Using CSS To Create Two Common HTML Border Effects

Seperating the style from the markup of a web document is generally a painless, if sometimes time-consuming, task.  In many cases however, the process can have some added speed-bumps; most notably when the original HTML is using an infamous table-based layout.  The two most common speedbumps when dealing with table-based layouts and styling are recreating the classic borderless table and keeping the default table border appearance.

The appearance of these two kinds of table are as follows 

Default Border

1 2
3 4

Borderless

1 2
3 4

The markup for these two tables looks like:

[code:html]

<!--Default Border -->
<table border="1">
<tbody>
<tr>
<td>1</td>
<td>2</td>
</tr>
<tr>
<td>3</td>
<td>4</td>
</tr>
</tbody>
</table>
<!-- Borderless -->
<table border="0" cellspacing="0" cellpadding="0">
<tbody>
<tr>
<td>1</td>
<td>2</td>
</tr>
<tr>
<td>3</td>
<td>4</td>
</tr>
</tbody>
</table>

[/code]

If you want to get the same effects while losing the HTML attributes you can use the folllow CSS:

Default Border

[code:html]

table{border-spacing:0px;border:solid 1px #8D8D8D;width:130px;}
table td{
border:solid 1px #C0C0C0;
border-bottom:solid 1px #8D8D8D;
border-left:solid 1px #8D8D8D;
display:table-cell;
margin:0;
padding:0;}

[/code]


Borderless

[code:html]

table{border:none;border-collapse:collapse;}
table td{padding:0;margin:0;}

[/code]

Duplicating the default table border look requires extra rules in its style definition because the default border contains two shades so the border-color values must be set accordingly. 

That is the basic method to replicating HTML table effects with CSS that are usually created with HTML attributes.