Colin Cochrane

Colin Cochrane is a Software Developer based in Victoria, BC specializing in C#, PowerShell, Web Development and DevOps.

Catching Unwanted Spiders And Content Scraping Bots In ASP.NET

spiderinaglass

If you have a blog that is even moderately popular then you have likely fallen victim to some form of content scraping.  Ever since it became possible to earn money through ads on a website there have been people trying to find ways to cheat the system.  The most widespread example of this comes in the form of splogs and similar spam-based websites, which consist only of ads from Google AdSense and duplicated content that is scraped from other sites.  In this post I will share a method you can use to identify "evil" spiders and content scraping bots that are wasting your website's resources.

I'll start off by defining what is considered an "evil" spider/bot.  For our purposes here, we'll be looking at spiders and bots that ignore robots.txt and nofollow when crawling a site.  These are spiders and bots that offer no value to you in allowing them to crawl your site, as the major search engines use spiders and bots that respect these rules (with the unique exception of MSN who employs a certain bot that presents itself as a regular user in order to identify sites that present different content to search engine spiders than users).  

Of these valueless spiders, some are almost certainly going to be some form of content scraping bot, which is sent to literally copy the content of your site for use elsewhere.  It is in your best interest to limit how much of your content gets scraped because you want visitors coming to your site, not some spam-filled facsimile.

This method to identify unwanted spiders involves the creation of a trap,  which can be created as follows:

1) Create a Hidden Page

To identify these undesired visitors you need to isolate them.  Create a page on your site, but do not link to it from anywhere just yet.  For the purposes of my examples, I'll call our example page "trap.aspx".

spidertrap1

Now you want to disallow this page in your robots.txt.

spidertrap2

With this trap page disallowed in the robots.txt, it will prevent good spiders from crawling it.  What is needed now is a link to the trap page with the rel="nofollow" attribute, which should be placed on your home page for maximum effect.  The link must be invisible to users otherwise you might mistake a unwitting visitor for a bad spider.

<a rel="nofollow" href="/trap.aspx" style="display:none;" />

This creates a situation in which the only requests for "/trap.aspx" will be from a spider or bot that ignores both robots.txt and nofollow, which is exactly the kind of bots we want to identify.

2) Create a Log File

Create an XML document and name it "trap.xml" (or whatever you want) and place it in the App_Data folder of your application (or wherever you want, as long as the application has write-access to the directory).  Open the new XML document and create an empty root-element "<trapRequests>" and ensure it has a complete closing tag.

<?xml version="1.0" encoding="utf-8"?>
<trapRequests>
</trapRequests>
 
You can use whatever method is best for you to log the requests, you do not need to use an XML document.  I am using XML for the purposes of this example.

3) Log What Gets Caught In The Trap

With the trap in place, you now want to keep track of the requests being made for "trap.aspx".  This can be accomplished quite easily using LINQ, as illustrated in the following example:

Imports System.Xml.Linq
Partial Class trap_aspx Inherits System.Web.UI.Page
Protected Sub Page_Load(ByVal sender As Object, ByVal e As System.EventArgs) _ Handles Me.Load
 
  LogRequest(Request.UserHostAddress, Request.UserAgent)
End Sub

Private Sub LogRequest(ByVal ipAddress As String, ByVal userAgent As String)
Dim logFile As XDocument

Try

    logFile = XDocument.Load(Server.MapPath("~/App_Data/trap.xml"))
    logFile.Root.AddFirst(<request>
<date><%= Now.ToString %> </date>
<ip><%= ipAddress %> </ip>
<userAgent><%= Now.ToString %> </userAgent>
</request>)
     logFile.Save(Server.MapPath("~/App_Data/trap.xml"))
Catch ex As Exception
My.Log.WriteException(ex)
End Try
End Sub
End Class

This code sets it up so every request for this page is logged with:

  1. The Date and Time of the request.
  2. The IP address of the requesting agent.
  3. The User Agent of the requesting agent.

You can, of course, customize what information is logged to your preference.  The code will need to be adjusted if you are using a different storage method.  Once done, you will end up with an XML log file (or your custom store) with every request to "trap.aspx" that will look like:

<?xml version="1.0" encoding="utf-8"?>
<trapRequests>
<request>
<date>12/30/2007 12:54:20 PM</date>
<ip>1.2.3.4</ip>
<userAgent>ISCRAPECONTENT/1.2</userAgent>
</request>
<request>
<date>12/30/2007 2:31:51 PM</date>
<ip>2.3.4.5</ip>
<userAgent>BADSPIDER/0.5</userAgent>
</request>
</trapRequests>
 
Now you've set your trap and any unwanted bots and spiders that find it will be logged.  You are then free to use the logged data to deny access to offending IPs, User Agents, or by whatever criteria you decide is appropriate for your site.

GMail Security Exploit Allows Backdoor Into Your Account

I normally don't discuss topics related to online security, but I recently came across a post over at Sphinn that detailed how David Airey had his domain hijacked thanks to a security exploit in GMail and wanted to help make sure as many people as possible are made aware of it.  I recommend to any of you who uses GMail to read David's post that details the trouble that this has caused him and make sure that your GMail account hasn't been compromised.

Not the most festive post for an early Christmas morning, but I'll wish everyone a Merry Christmas anyways.

How Being An SEO Analyst Made Me A Better Web Developer

Being a successful web developer requires constant effort to refine your existing abilities while expanding your skill-set to include the new technologies that are continually released, so when I first started my job as a search engine optimization analyst I was fully expecting my web development skills to dull.  I could not have been more wrong.

Before I get to the list of reasons why being an SEO analyst made me a better web developer I'm going to give a quick overview of how I got into search engine optimization in the first place. Search engine optimization first captured my interest when I wanted to start driving more search traffic for a website I developed for the BCDA (and continue to volunteer my services as webmaster). Due to some dissatisfaction with our hosting provider I decided that we would switch hosting providers as soon as our existing contract expired and go with one of my preferred hosts. However, as a not-for-profit organization the budget for the website was limited and the new hosting was going cost a little more, so I decided to set up an AdSense account to bring in some added income. 

The expectations weren't high; I was hoping to bring in enough revenue through the website to cover hosting costs.  At that point I did not have much experience with SEO so I started researching and looking for strategies I could use on the site.  As I read more and more articles, blogs and whitepapers I became increasingly fascinated with the industry while I would apply all of the newfound knowledge to the site.  Soon after I responded to a job posting at an SEO firm, applied, and was shortly thereafter starting my new career as an SEO analyst. 

My first few weeks at the job were spent learning procedures, familiarizing myself with the various tools that we use, and, most importantly, honing my SEO skills.  I spent the majority of my time auditing and reporting on client sites, which exposed me to a lot of different websites, programming and scripting languages, and tens of thousands of lines of code.  During this process I realized that my web development skills weren't getting worse, they were actually getting better.   The following list examines the reasons for this improvement.

1) Coding Diversity

To properly analyze a site, identify problems, and be able to offer the right solutions I often have to go deeper than just HTML on a website.  This meant that I had to be proficient at coding in a variety of different languages, because I don't believe in pointing out problems with a site unless I can recommend how to best fix them.  Learning the different languages came quickly from the sheer volume of code I was faced with every day, and got easier with each language I learned.

2) Web Standards and Semantic Markup

In a recent post, Reducing Code Bloat Part Two: Semantic HTML, I discussed the importance of semantic HTML to lean, tidy markup for your web documents.  While I have always been a proponent of web standards and semantic markup my experience in SEO has served to solidify my beliefs.  After you have pored through 12,000 lines of markup that should have been 1000, or spent two hours implementing style modifications that should have taken five minutes, the appreciation for semantic markup and web standards is quickly realized.

3) Usability and Accessibility

Once I've optimized a site to draw more search traffic I need to help make sure that that traffic actually sticks around.  A big part of this is the usability and accessibility of a site.  There are a lot of other websites out there for people to visit and they are not going to waste time trying to figure out how to navigate through a meandering quagmire of a design.  This aspect of my job forces me to step into the shoes of the average user, which is something that a lot of developers need to do more often.  It has also made me more considerate when utilizing features and technologies, such as AJAX, in respect to accessibility, such that I ensure that the site is still accessible when that feature is disabled or is not supported.

4) The Value of Content

Before getting into SEO, I was among the many web developers guilty of thinking that a website's success can be ensured by implementing enough features, and that enough cool features could make up for a lack of simple, quality content.  Search engine optimization taught me the value of content, and that the right balance of innovative features and content will greatly enhance the effectiveness of both.

That covers some of the bigger reasons that working as an SEO analyst made me a better web developer.  Chances are that I will follow up on this post in the future with more reasons that I am sure to realize as I continue my career in SEO.  In fact one of the biggest reasons I love working in search engine optimization and marketing is that it is an industry that is constantly changing and evolving, and there is always sometime new to learn.

Reducing Code Bloat Part Two - Semantic HTML

In my first post on this subject, Reducing Code Bloat - Or How To Cut Your HTML Size In Half, I demonstrated how you can significantly reduce the size of a web document by simply moving style definitions externally and getting rid of a table-based layout.  In this installment I will look at the practice of semantic HTML and how effective it can be at keeping your markup tidy and lean.

What is Semantic HTML?

Semantic HTML, in a nutshell, is the practice of creating HTML documents that only contain the author's "meaning" and not how that meaning is presented.  This boils down to using only structural elements and not presentational elements.  A common issue with semantic HTML is identifying the often subtle differences between elements that represent the author's meaning versus a presentational element.  Consider the following examples:

<p>
I really felt the need to <i>emphasize</i> the point.
</p>

 

<p>
I really felt the need to <em>emphasize</em> the point.
</p>

Many people would consider these two snips of code to be essentially the same; a paragraph, with the word "emphasize" in italics.  When considering the semantic value of the elements, however, there is one significant difference.  The element <i> is a purely presentational (or visual), and it has no meaning to the semantic structure of the document.  The <em> element, on the other hand, has a meaningful semantic value in the document's structure because it defining its contents as being emphasized.  Visually we usually see the contents both <i> and <em> elements rendered as italicized text which is why the difference often seems like nitpicking, but the standard italicizing for <em> elements is simply the choice made by most web browsers, it has no inherent visual style.

Making Your Markup Work for You

Some of you might be thinking to yourself "some HTML tags have a different meaning than others? So what?".  Certainly a reasonable response, because when it comes down to it, HTML semantics can seem like pointless nitpicking.  That being said, this nitpicking drives home what HTML is really supposed to do: define markup.  Think of it like being a writer for a magazine.  You write your article, including paragraphs and indicating words or sentences that should be emphasized, or considered more strongly, in relation to the rest, and submit it.  You don't decide what font is used, the colour of the text, or how much space should be between each line, that's the editor's job.  It is exactly the same concept with semantic HTML: the structure and meaning of the content is the job of HTML, and the presentation is the job of the browser and CSS.

Without the burden of presentational elements you only have to worry about a core group of elements with meaningful structure value.  For instance...

<p>
This is a chunk of text <span class="red14pxitalic">where</span> random words 
are <span class="green16pxbold">styled</span> differently,
<span class="red14pxitalic">with</span> some words 
<span class="green16pxbold">being</span> red, italicized and at 14px, and others
being green, bold, and 16px.
</p>
 
with an external style-sheet definition...
span.red14pxitalic{color:red;font-size:14px;font-style:italic;}
span.green16pxbold{color:green;font-size:16px;font-weight:bold;}

...turns into this...

<p>
This is a chunk of text <em>where</em> random words 
are <strong>styled</strong> differently, <em>with</em> some words 
<strong>being</strong> red, italicized and at 14px, and others being green, 
bold, and 16px.
</p>

with an external style-sheet definition...

p em{color:red;font-size:14px;font-style:italic;}
p strong{color:green;font-size:16px;font-weight:bold;}

The second example accomplishes the same visual result as the first, but contains actual semantic worth in the document's structure.  It also illustrates how much simpler it is to create a new document because you don't have to worry about style.  You create the document and use meaningful semantic elements to identify whatever parts of the content are necessary, and let the style-sheet take care of the rest (assuming of course that the style-sheet was properly defined with the necessary styles).  By using a more complete range of HTML elements you will find yourself needing <span class="whatever"> tags less and less and find your markup becoming cleaner, easier to read, and smaller.

 

Code Size Comparison

 

  HTML Characters (Including Spaces) CSS Characters (Including Spaces)
Example One 302 127
Example Two 216 103

 

Part Three will continue looking at semantic HTML, as well as strategies you can use when defining your style framework.

Is There Added Value In XHTML To Search Engine Spiders?

The use of XHTML in the context of SEO is a matter of debate.  The consensus tends to be that using XHTML falls into the category of optimization efforts that provide certain benefits for the site as a whole (extensibility, ability to use XSL transforms) but offers little or no added value in the eyes of the search engines.  That being said, as the number of pages that search engine spiders have to crawl continues to increase every day, the limits to how much the spiders can crawl are being tested.  This has been recognized by SEOs and is reflected in efforts to trim page sizes down to make a site more appealing to the spiders.  Now it is time to start considering the significant benefits that a well-formed XHTML document can potentially offer to search engine spiders.

Parsing XML is faster than parsing HTML for one simple reason: XML documents are expected to be well-formed.  This saves the parser from having to spend the extra overhead involved in "filling in the blanks" with a non-valid document.  Dealing with XML also opens the door to the use of speedy languages like XPath that provide fast and straightforward access to any given part of an XML document.  For instance, consider the following XHTML document:

<?xml version="1.0" encoding="utf-8" ?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en">
  <head>
    <title>My XHTML Document</title>
    <meta name="description" content="This is my XHTML document."/>
  </head>
  <body>
    <div>
      <h1>This is my XHTML document!</h1>
      <p>Here is some content.</p>
    </div>
  </body>
</html>

Now let's say we wanted to grab the contents of the title element from this document.  If we were to parse it as straight HTML we'd probably use a regular expression such as "<title>([^<]*)</title>" (As a quick aside, I want to clarify that HTML parsers are quite advanced and don't simply use regular expressions to read a document).  In Visual Basic the code to accomplish this would look like:

Imports System.Text.RegularExpressions

Class MyParser

  Function GetTitle(ByRef html As String) As String 
    Return RegEx.Match(html,"<title>([^<]*)</title>").Groups(1).Value 
  End Function 

End Class 

If we were to use XPath, on the other hand, we would get something like this:

Imports System.Xml 

Class MyParser 

  Function GetTitle(ByRef reader As XmlReader) As String 
    Dim doc As New XPath.XPathDocument(reader) 
    Dim navigator As XPath.XPathNavigator = doc.CreateNavigator
    Return navigator.SelectSingleNode(&quot;/head/title&quot;).Value 
  End Function 

End Class

Don't let the amount of code fool you.  While the first example uses 1 line of code to accomplish what takes the second example 3 lines, the real value comes when dealing with a non-trivial document.  The first method would need enumerate the elements in the document, which would involve either very complex regular expressions with added logic (because regular expressions are not best suited for parsing HTML), or the added overhead necessary for an existing HTML parser to accurately determine how the document "intended" to be structured.  Using XPath is a simple matter of using a different XPath expression for the "navigator.SelectSingleNode" method.

With that in mind, I constructed a very basic test to see what kind of speed differences we'd be looking at between HTML parsing and XML (using XPath) parsing.  The test was simple: I created a well-formed XHTML document consisting of a title element, meta description and keywords elements, 150 paragraphs of Lorum Ipsum, 1 <h1> element, 5 <h2> elements, 10 <h3> elements and 10 anchor elements scattered throughout the document.

The test consisted of two methods, one using XPath, and one using Regular Expressions.  The task of each method was to simply iterate through every element in the document once, and repeat this task 10000 times while being timed.  Once completed it would spit out the elapsed time in milliseconds that it took to complete.  The test was kept deliberately simple because the results are only meant to very roughly illustrate the performance differences between the two methods.  It was by no means an exhaustive performance analysis and should not be considered as such.

That being said, I ran the test 10 times and averaged the results for each method, resulting in the following:

XML Parsing (Using XPATH) - 13ms

HTML Parsing (Using RegEx) - 1852ms

As I said, these results are very rough, and meant to illustrate the difference between the two methods rather than the exact times. 

These results should, however, give you something to consider in respect to the potential benefits of XHTML to a search engine spider.  We don't know how search engine spiders are parsing web documents, and that will likely never change.  We do know that search engines are constantly refining their internal processes, including spider logic, and with the substantial performance beneifts of XML parsing, it doesn't seem too far-fetched to think that the search engines might have their spiders capitilizing on well-formed XHTML documents with faster XML parsing, or are at least taking a very serious look at implementing that functionality in the near future.  If you consider even a performance improvement of only 10ms, when you multiply that against the tens of thousands of pages being spidered every day, those milliseconds add up very quickly.

Using CSS To Create Two Common HTML Border Effects

Seperating the style from the markup of a web document is generally a painless, if sometimes time-consuming, task.  In many cases however, the process can have some added speed-bumps; most notably when the original HTML is using an infamous table-based layout.  The two most common speedbumps when dealing with table-based layouts and styling are recreating the classic borderless table and keeping the default table border appearance.

The appearance of these two kinds of table are as follows 

Default Border

1 2
3 4

Borderless

1 2
3 4

The markup for these two tables looks like:

[code:html]

<!--Default Border -->
<table border="1">
<tbody>
<tr>
<td>1</td>
<td>2</td>
</tr>
<tr>
<td>3</td>
<td>4</td>
</tr>
</tbody>
</table>
<!-- Borderless -->
<table border="0" cellspacing="0" cellpadding="0">
<tbody>
<tr>
<td>1</td>
<td>2</td>
</tr>
<tr>
<td>3</td>
<td>4</td>
</tr>
</tbody>
</table>

[/code]

If you want to get the same effects while losing the HTML attributes you can use the folllow CSS:

Default Border

[code:html]

table{border-spacing:0px;border:solid 1px #8D8D8D;width:130px;}
table td{
border:solid 1px #C0C0C0;
border-bottom:solid 1px #8D8D8D;
border-left:solid 1px #8D8D8D;
display:table-cell;
margin:0;
padding:0;}

[/code]


Borderless

[code:html]

table{border:none;border-collapse:collapse;}
table td{padding:0;margin:0;}

[/code]

Duplicating the default table border look requires extra rules in its style definition because the default border contains two shades so the border-color values must be set accordingly. 

That is the basic method to replicating HTML table effects with CSS that are usually created with HTML attributes.

A Code Snippet That Speaks For Itself

While working on a client's site today I came across this gem that I thought you might enjoy.  I'll let the snippet speak for itself.

[code:html]

<script type="text/javascript">
document.write('<');
document.write('!--  ');
</script>
&lt;!--   
<noscript> 
(Removed to protect anonymity of client)
</noscript> 
<!--//-->  

[/code]

*Smacks Forehead*