Colin Cochrane

Colin Cochrane is a Software Developer based in Victoria, BC specializing in C#, PowerShell, Web Development and DevOps.

Catching Unwanted Spiders And Content Scraping Bots In ASP.NET

spiderinaglass

If you have a blog that is even moderately popular then you have likely fallen victim to some form of content scraping.  Ever since it became possible to earn money through ads on a website there have been people trying to find ways to cheat the system.  The most widespread example of this comes in the form of splogs and similar spam-based websites, which consist only of ads from Google AdSense and duplicated content that is scraped from other sites.  In this post I will share a method you can use to identify "evil" spiders and content scraping bots that are wasting your website's resources.

I'll start off by defining what is considered an "evil" spider/bot.  For our purposes here, we'll be looking at spiders and bots that ignore robots.txt and nofollow when crawling a site.  These are spiders and bots that offer no value to you in allowing them to crawl your site, as the major search engines use spiders and bots that respect these rules (with the unique exception of MSN who employs a certain bot that presents itself as a regular user in order to identify sites that present different content to search engine spiders than users).  

Of these valueless spiders, some are almost certainly going to be some form of content scraping bot, which is sent to literally copy the content of your site for use elsewhere.  It is in your best interest to limit how much of your content gets scraped because you want visitors coming to your site, not some spam-filled facsimile.

This method to identify unwanted spiders involves the creation of a trap,  which can be created as follows:

1) Create a Hidden Page

To identify these undesired visitors you need to isolate them.  Create a page on your site, but do not link to it from anywhere just yet.  For the purposes of my examples, I'll call our example page "trap.aspx".

spidertrap1

Now you want to disallow this page in your robots.txt.

spidertrap2

With this trap page disallowed in the robots.txt, it will prevent good spiders from crawling it.  What is needed now is a link to the trap page with the rel="nofollow" attribute, which should be placed on your home page for maximum effect.  The link must be invisible to users otherwise you might mistake a unwitting visitor for a bad spider.

<a rel="nofollow" href="/trap.aspx" style="display:none;" />

This creates a situation in which the only requests for "/trap.aspx" will be from a spider or bot that ignores both robots.txt and nofollow, which is exactly the kind of bots we want to identify.

2) Create a Log File

Create an XML document and name it "trap.xml" (or whatever you want) and place it in the App_Data folder of your application (or wherever you want, as long as the application has write-access to the directory).  Open the new XML document and create an empty root-element "<trapRequests>" and ensure it has a complete closing tag.

<?xml version="1.0" encoding="utf-8"?>
<trapRequests>
</trapRequests>
 
You can use whatever method is best for you to log the requests, you do not need to use an XML document.  I am using XML for the purposes of this example.

3) Log What Gets Caught In The Trap

With the trap in place, you now want to keep track of the requests being made for "trap.aspx".  This can be accomplished quite easily using LINQ, as illustrated in the following example:

Imports System.Xml.Linq
Partial Class trap_aspx Inherits System.Web.UI.Page
Protected Sub Page_Load(ByVal sender As Object, ByVal e As System.EventArgs) _ Handles Me.Load
 
  LogRequest(Request.UserHostAddress, Request.UserAgent)
End Sub

Private Sub LogRequest(ByVal ipAddress As String, ByVal userAgent As String)
Dim logFile As XDocument

Try

    logFile = XDocument.Load(Server.MapPath("~/App_Data/trap.xml"))
    logFile.Root.AddFirst(<request>
<date><%= Now.ToString %> </date>
<ip><%= ipAddress %> </ip>
<userAgent><%= Now.ToString %> </userAgent>
</request>)
     logFile.Save(Server.MapPath("~/App_Data/trap.xml"))
Catch ex As Exception
My.Log.WriteException(ex)
End Try
End Sub
End Class

This code sets it up so every request for this page is logged with:

  1. The Date and Time of the request.
  2. The IP address of the requesting agent.
  3. The User Agent of the requesting agent.

You can, of course, customize what information is logged to your preference.  The code will need to be adjusted if you are using a different storage method.  Once done, you will end up with an XML log file (or your custom store) with every request to "trap.aspx" that will look like:

<?xml version="1.0" encoding="utf-8"?>
<trapRequests>
<request>
<date>12/30/2007 12:54:20 PM</date>
<ip>1.2.3.4</ip>
<userAgent>ISCRAPECONTENT/1.2</userAgent>
</request>
<request>
<date>12/30/2007 2:31:51 PM</date>
<ip>2.3.4.5</ip>
<userAgent>BADSPIDER/0.5</userAgent>
</request>
</trapRequests>
 
Now you've set your trap and any unwanted bots and spiders that find it will be logged.  You are then free to use the logged data to deny access to offending IPs, User Agents, or by whatever criteria you decide is appropriate for your site.

Disabling Configuration Inheritance For ASP.NET Child Applications

Configuration inheritance is a very robust feature of ASP.NET that allows you to set configuration settings in the Web.Config of a parent application and have it automatically be applied to all of its child applications.  There are certain situations, however, when there are configuration settings that you don't want to apply to a child application.  The usual course of action is to override the setting in question in the child application's web.config file, which is ideal if there are only a handful of settings to deal with.  This is less than ideal when there are a significant number of settings that need to be overridden, or when you simply want the child application to be largely independent of its parent. 

There solution is Configuration <location> Settings which allows you to selectively exclude portions of an application's web.config from being inherited by child applications.  This made my job a heck of a lot easier yesterday in the installation of BlogEngine.NET for a client that had a CMS.  Naturally the CMS had a rather lengthy web.config packed with a significant amount of settings that were specific to the CMS application, and would generally break any child application that someone would install. Naturally the installation of the blog application was no exception, as I quickly found out there were at least 20 various settings that were preventing it from running.  I had the blog application configured as a seperate application through a virtual directory that was a subdirectory of the main website application (which included the CMS), so by default ASP.NET was applying all the settings the the main application to the blog application.  

The prospect of digging through the large web.config of the CMS and identifying every setting that I would need to override was not a prospect I was particularly thrilled about, so I turned to the <location> element to save me some time.  Since the blog application was essentially independent of the main application there were no inherited settings that it would depend on, so I modified the parent web application's web.config to the following (edited to protect client).

[code:xml]

<location path="." inheritInChildApplications="false"> 
  <system.web>
  ...
  </system.web>
</location> 

[/code]

By wrapping the <system.web> element with the <location> element and setting the path attribute to "." and inheritInChildApplications attribute to "false" I prevented every child application of the main web application from inheriting the settings in the <system.web> element.  After making this modification I tried accessing the blog and just like that there were no more errors and it loaded perfectly.  I could have also set the <location> element to apply specifically for the blog by way of the path attribute if I didn't want to disable inheritance across every child application.  

Aside from the convenience that configuration <location> settings provided in this case, it is also a tremendously powerful way to manage configuration inheritance in your ASP.NET applications.  If you haven't read the MSDN Documentation on this topic, I really encourage you to do so, because there are so many scenarios that configuration <location> settings can be used that you will quickly discover that it is a incredibly valuable tool to have at your disposal. 

Using the ASP.NET Web.Sitemap to Manage Meta Data

kick it on DotNetKicks.com

In an ASP.NET application the web.sitemap is very convenient tool for managing the site's structure, especially when used with an asp:Menu control.  One aspect of the web.sitemap that is often overlooked is the ability to add custom attributes to the <siteMapNode> elements, which provides very useful leverage for managing a site's meta-data.  I think the best way to explain would be through a simple example.

Consider a small site with three pages: /default.aspx, /products.aspx, and /services.aspx. The web.sitemap for this site would look like this:

<?xml version="1.0" encoding="utf-8" ?>

<siteMap xmlns="http://schemas.microsoft.com/AspNet/SiteMap-File-1.0" >

<siteMapNode url="~/" title="Home">

     <siteMapNode url="~/products.aspx" title="Products" />
     <siteMapNode url="~/services.aspx" title="Services" />

</siteMapNode>

 

Now let's add some custom attributes where we can set the Page Title (because the title attribute is where the asp:Menu control looks for the name of a menu item and it's probably best to leave that as is), the Meta Description, and the Meta Keywords elements. 

 

<?xml version="1.0" encoding="utf-8" ?>

<siteMap xmlns="http://schemas.microsoft.com/AspNet/SiteMap-File-1.0" >

<siteMapNode url="~/" title="Home" pageTitle="Homepage" metaDescription="This is my homepage!" metaKeywords="homepage, keywords, etc">

     <siteMapNode url="~/products.aspx" title="Products" pageTitle="Our Products" metaDescription="These are our fine products" metaKeywords="products, widgets"/>
     <siteMapNode url="~/services.aspx" title="Services" pageTitle="Our Services" metaDescription="Services we offer" metaKeywords="services, widget cleaning"/>

</siteMapNode>

 

Now with that in place all we need is a way to access these new attributes and use them to set the elements on the pages.  This can be accomplished by adding a module to your project, we'll call it "MetaDataFunctions" for this example.  In this module you add the following procedure.

 

Public Sub GenerateMetaTags(ByVal TargetPage As Page)
    Dim head As HtmlHead = TargetPage.Master.Page.Header
    Dim meta As
New HtmlMeta

    If SiteMap.CurrentNode IsNot Nothing Then
      meta.Name = "keywords"
      meta.Content = SiteMap.CurrentNode("metaKeywords")

      head.Controls.Add(meta)

      meta =
New HtmlMeta
      meta.Name = "description"
      meta.Content = SiteMap.CurrentNode("metaDescription")
      head.Controls.Add(meta)

      TargetPage.Title = SiteMap.CurrentNode.Description
   
Else

      meta.Name = "keywords"
      meta.Content = "default keywords"

      head.Controls.Add(meta)

      meta =
New HtmlMeta
      meta.Name = "description"
      meta.Content = "default description"

      head.Controls.Add(meta)

      TargetPage.Title = "default page title"
    End If
  End Sub

 

Then all you have to do is call this procedure on the Page_Load event like so...

 

Protected Sub Page_Load(ByVal sender As Object, ByVal e As EventArgs) Handles Me.Load

    GenerateMetaTags(Me)    

End Sub

 

..and you'll be up and running.  Now you have a convenient, central location where you can see and manage your site's meta-data.

SEO Best Practices - Dynamic Pages in ASP.NET

kick it on DotNetKicks.com

One of the greatest time-savers in web development is the use of dynamic pages to serve up database driven content.  The most common examples of which are content management systems and product information pages.  More times than not these pages hinge on a querystring parameter such as /page.aspx?id=12345 to determine which record needs to be retrieved from the database and output to the page.  What is surprising is how many sites don't adequatly validate that crucial parameter.

Any parameter that can be tampered with by a user, such as a querystring, must be validated as a matter of basic security.  That being said, this validation must also adequately deal with a situation when that parameter is not valid.  Whether the parameter is for a non-existant record, or whether the parameter contains letters where it should only be numbers, the end-result is the same: the expected page does not exist.  As simple as this sounds there are countless applications out there that seem to completely ignore any sort of error handling, and are content to have Server Error in "/" Application be the extent of their error handling.  Somewhere in the development cycle the developers of these application decided that the default ASP.NET error page would be the best thing to show to the site's visitors, and that a 500 SERVER ERROR was the ideal response to send to any search engine spiders that might have the misfortune of coming across a link with a bad parameter in it.

With a dynamic page that depends on querystring parameters to generate its content, the following basic measures should be taken:

Protected Sub Page_Load(ByVal sender As Object, ByVal e As EventArgs) Handles Me.Load    
'Ensure that the requested URI actually has any querystring keys
If Request.Querystring.HasKeys() Then

'Ensure that the requested URI has the expected parameter, and that the parameter isn't empty
If Request.Querystring("id") IsNot Nothing Then

'Perform any additional type validation to ensure that the string value can be cast to the required type.

Else

Response.StatusCode = 404

Response.Redirect("/404.aspx",True)

End If

Else

Response.StatusCode = 404

Response.Redirect("/404.aspx",True)

End Sub


This is a basic example, but demonstrates how to perform simple validation against the querystring that will properly redirect anyone that reaches the page with a bad querystring in the request URL.  A similar approach should be taken when attempting to retrieve the data in the case that the record is not found.

Another useful trick is to define the default error redirect in the web.config file (<customErrors mode="RemoteOnly" defaultRedirect="/error.aspx">), and use that page to respond to the error appropriately by using the Server.GetLastError() method to get the most recent server exception and handling that exception as required.

There are many other ways to manage server responses when there is an error in your ASP.NET application.  What is most important is knowing that you need to handle these errors properly, up to and including an appropriate response to the request.   

ASP.NET's Answer to WordPress

It's an exciting time to be an ASP.NET developer.  As the ASP.NET community continues to grow we find ourselves with an ever-increasing aresenal of tools, controls and frameworks at our disposal.  Unfortunately this can make the decision on a component a little more difficult, as very few of the components out there have reached that point where they are largely considered the "standard".  ASP.NET blogging engines certainly fall in to this category, as many of you have probably noticed.

 When I decided to create this blog I had some definite requirements in mind when it came to choosing a blog engine.  First and foremost it had to render completely valid XHTML because I practice what I preach in respect to W3C compliance.  Control over SEO-important aspects such as canonical URLs and meta descrition elements were a necessity.  It also had to have a URL strategy that didn't rely on a "/page.aspx?id=123456789"-style mess of a querystring. Finally, the underlying code had to be well-organized and lean.  Without a de-facto "standard" for ASP.NET blog engines I started hunting around and researching the options that were out there.

I tried some different ASP.NET blog engines such as dasBlog and subText, but found the markup that was rendered was not acceptable.  Then I came across BlogEngine.NET, which, I was thrilled to discover, met all of my requirements.  It's light weight, very easy to set up, and is very well-designed and organized under the hood.  A nice feature is that, by default, it doesn't require a SQL  database to run, instead storing all posts, comments and settings in local XML files.  Of course SQL database integration is available, and is also easy to get up and running.  Aesthetically it comes with a nice collection of themes, which are quite easy to modify, and creating your own themes is a straightforward process. 

The above factors have led me to believe that BlogEngine.NET is in a position to become to ASP.NET what WordPress is for PHP.  If any of you are currently trying to decide on a solid ASP.NET blog system you should definitely try BlogEngine.NET.  You won't be disappointed.