Colin Cochrane

Software Developer based in Victoria, BC specializing in C#, PowerShell, Web Development and DevOps.

Catching Unwanted Spiders And Content Scraping Bots In ASP.NET

spiderinaglass

If you have a blog that is even moderately popular then you have likely fallen victim to some form of content scraping.  Ever since it became possible to earn money through ads on a website there have been people trying to find ways to cheat the system.  The most widespread example of this comes in the form of splogs and similar spam-based websites, which consist only of ads from Google AdSense and duplicated content that is scraped from other sites.  In this post I will share a method you can use to identify "evil" spiders and content scraping bots that are wasting your website's resources.

I'll start off by defining what is considered an "evil" spider/bot.  For our purposes here, we'll be looking at spiders and bots that ignore robots.txt and nofollow when crawling a site.  These are spiders and bots that offer no value to you in allowing them to crawl your site, as the major search engines use spiders and bots that respect these rules (with the unique exception of MSN who employs a certain bot that presents itself as a regular user in order to identify sites that present different content to search engine spiders than users).  

Of these valueless spiders, some are almost certainly going to be some form of content scraping bot, which is sent to literally copy the content of your site for use elsewhere.  It is in your best interest to limit how much of your content gets scraped because you want visitors coming to your site, not some spam-filled facsimile.

This method to identify unwanted spiders involves the creation of a trap,  which can be created as follows:

1) Create a Hidden Page

To identify these undesired visitors you need to isolate them.  Create a page on your site, but do not link to it from anywhere just yet.  For the purposes of my examples, I'll call our example page "trap.aspx".

spidertrap1

Now you want to disallow this page in your robots.txt.

spidertrap2

With this trap page disallowed in the robots.txt, it will prevent good spiders from crawling it.  What is needed now is a link to the trap page with the rel="nofollow" attribute, which should be placed on your home page for maximum effect.  The link must be invisible to users otherwise you might mistake a unwitting visitor for a bad spider.

<a rel="nofollow" href="/trap.aspx" style="display:none;" />

This creates a situation in which the only requests for "/trap.aspx" will be from a spider or bot that ignores both robots.txt and nofollow, which is exactly the kind of bots we want to identify.

2) Create a Log File

Create an XML document and name it "trap.xml" (or whatever you want) and place it in the App_Data folder of your application (or wherever you want, as long as the application has write-access to the directory).  Open the new XML document and create an empty root-element "<trapRequests>" and ensure it has a complete closing tag.

<?xml version="1.0" encoding="utf-8"?>
<trapRequests>
</trapRequests>
 
You can use whatever method is best for you to log the requests, you do not need to use an XML document.  I am using XML for the purposes of this example.

3) Log What Gets Caught In The Trap

With the trap in place, you now want to keep track of the requests being made for "trap.aspx".  This can be accomplished quite easily using LINQ, as illustrated in the following example:

Imports System.Xml.Linq
Partial Class trap_aspx Inherits System.Web.UI.Page
Protected Sub Page_Load(ByVal sender As Object, ByVal e As System.EventArgs) _ Handles Me.Load
 
  LogRequest(Request.UserHostAddress, Request.UserAgent)
End Sub

Private Sub LogRequest(ByVal ipAddress As String, ByVal userAgent As String)
Dim logFile As XDocument

Try

    logFile = XDocument.Load(Server.MapPath("~/App_Data/trap.xml"))
    logFile.Root.AddFirst(<request>
<date><%= Now.ToString %> </date>
<ip><%= ipAddress %> </ip>
<userAgent><%= Now.ToString %> </userAgent>
</request>)
     logFile.Save(Server.MapPath("~/App_Data/trap.xml"))
Catch ex As Exception
My.Log.WriteException(ex)
End Try
End Sub
End Class

This code sets it up so every request for this page is logged with:

  1. The Date and Time of the request.
  2. The IP address of the requesting agent.
  3. The User Agent of the requesting agent.

You can, of course, customize what information is logged to your preference.  The code will need to be adjusted if you are using a different storage method.  Once done, you will end up with an XML log file (or your custom store) with every request to "trap.aspx" that will look like:

<?xml version="1.0" encoding="utf-8"?>
<trapRequests>
<request>
<date>12/30/2007 12:54:20 PM</date>
<ip>1.2.3.4</ip>
<userAgent>ISCRAPECONTENT/1.2</userAgent>
</request>
<request>
<date>12/30/2007 2:31:51 PM</date>
<ip>2.3.4.5</ip>
<userAgent>BADSPIDER/0.5</userAgent>
</request>
</trapRequests>
 
Now you've set your trap and any unwanted bots and spiders that find it will be logged.  You are then free to use the logged data to deny access to offending IPs, User Agents, or by whatever criteria you decide is appropriate for your site.

Comments (6) -

  • Dan Elliott

    12/30/2007 1:12:00 PM | Reply

    This post was good timing for me, because I've been getting hammered by content scraping bots for the past week and have been trying to figure out a way to fight back.  I'll definitely be giving this a try.

    Thanks!

  • Samsara

    1/11/2008 6:25:45 AM | Reply

    Excellent Colin. Now. I wonder if you could help those of us who are not on using asp? I am positive you could get creative and think of something for us lowly non-aspy minions? ;)

    I'm thinking along the lines of similar first steps. A new page with robots denying it but linking it with a nofollow:  a rel="nofollow" href="/bottytrap.htm" style="display:none;" />

    Then a simply bottytrap page so when you look at your logs, you see who, what and when called bottytrap, eh?

    Can you think of something better?

    Thanks for the tip Colin!
    You are fabulous!

  • Colin Cochrane

    1/11/2008 9:46:28 AM | Reply

    Samsara:

    Monitoring requests to the trap page would certainly work using server logs as well.

    Let me know how it works out for you!

  • luigi

    4/9/2008 5:54:46 PM | Reply

    Good Job Colin, please go on posting about asp.net and seo related topics.

  • Frank Gennaro

    1/30/2009 1:20:54 AM | Reply

    Excellent article! Because I prefer C#, I converted the code and made some minor changes that C# types might find beneficial.

    public partial class Trap : System.Web.UI.Page
    {
        protected void Page_Load(object sender, EventArgs e)
        {
            LogRequest(Request.UserHostAddress, Request.UserHostName, Request.UserAgent);
        }

        private void LogRequest(string userHostAddress, string userHostName, string userAgent)
        {
            XDocument logFile = null;
            try
            {
                logFile = XDocument.Load(Server.MapPath("~/App_Data/Trap.xml"));

                XElement xml = new XElement("Request",
                            new XElement("Date", string.Format("{0} at {1}", DateTime.Now.ToLongDateString(), DateTime.Now.ToLongTimeString())),
                            new XElement("UserHostAddress", userHostAddress),
                            new XElement("UserHostName", userHostName),
                            new XElement("UserAgent", userAgent));
                logFile.Root.AddFirst(xml);
                logFile.Save(Server.MapPath("~/App_Data/Trap.xml"));

            }
            catch (Exception ex)
            {
                //My.Log.WriteException(ex)
            }
        }
    }

Loading

Colin Cochrane | SEO Best Practices - Dynamic Pages in ASP.NET

Colin Cochrane

Software Developer based in Victoria, BC specializing in C#, PowerShell, Web Development and DevOps.

SEO Best Practices - Dynamic Pages in ASP.NET

kick it on DotNetKicks.com

One of the greatest time-savers in web development is the use of dynamic pages to serve up database driven content.  The most common examples of which are content management systems and product information pages.  More times than not these pages hinge on a querystring parameter such as /page.aspx?id=12345 to determine which record needs to be retrieved from the database and output to the page.  What is surprising is how many sites don't adequatly validate that crucial parameter.

Any parameter that can be tampered with by a user, such as a querystring, must be validated as a matter of basic security.  That being said, this validation must also adequately deal with a situation when that parameter is not valid.  Whether the parameter is for a non-existant record, or whether the parameter contains letters where it should only be numbers, the end-result is the same: the expected page does not exist.  As simple as this sounds there are countless applications out there that seem to completely ignore any sort of error handling, and are content to have Server Error in "/" Application be the extent of their error handling.  Somewhere in the development cycle the developers of these application decided that the default ASP.NET error page would be the best thing to show to the site's visitors, and that a 500 SERVER ERROR was the ideal response to send to any search engine spiders that might have the misfortune of coming across a link with a bad parameter in it.

With a dynamic page that depends on querystring parameters to generate its content, the following basic measures should be taken:

Protected Sub Page_Load(ByVal sender As Object, ByVal e As EventArgs) Handles Me.Load    
'Ensure that the requested URI actually has any querystring keys
If Request.Querystring.HasKeys() Then

'Ensure that the requested URI has the expected parameter, and that the parameter isn't empty
If Request.Querystring("id") IsNot Nothing Then

'Perform any additional type validation to ensure that the string value can be cast to the required type.

Else

Response.StatusCode = 404

Response.Redirect("/404.aspx",True)

End If

Else

Response.StatusCode = 404

Response.Redirect("/404.aspx",True)

End Sub


This is a basic example, but demonstrates how to perform simple validation against the querystring that will properly redirect anyone that reaches the page with a bad querystring in the request URL.  A similar approach should be taken when attempting to retrieve the data in the case that the record is not found.

Another useful trick is to define the default error redirect in the web.config file (<customErrors mode="RemoteOnly" defaultRedirect="/error.aspx">), and use that page to respond to the error appropriately by using the Server.GetLastError() method to get the most recent server exception and handling that exception as required.

There are many other ways to manage server responses when there is an error in your ASP.NET application.  What is most important is knowing that you need to handle these errors properly, up to and including an appropriate response to the request.   

Comments (3) -

  • rstevens

    10/28/2007 4:18:37 PM | Reply

    Good overview.  A lot of developers out there don't realize how easy it is to handle bad parameters correctly.  Hopefully a few of them will read this.

  • Mischa Kroon

    11/28/2007 4:50:13 PM | Reply

    Nice, a little extra added information.

    The name "id" as a querystring is giving suboptimal search engine results.
    Or at least it was a while back so this is not a wise string for google.

Pingbacks and trackbacks (1)+

Loading