Colin Cochrane

Colin Cochrane is a Software Developer based in Victoria, BC specializing in C#, PowerShell, Web Development and DevOps.

Catching Unwanted Spiders And Content Scraping Bots In ASP.NET


If you have a blog that is even moderately popular then you have likely fallen victim to some form of content scraping.  Ever since it became possible to earn money through ads on a website there have been people trying to find ways to cheat the system.  The most widespread example of this comes in the form of splogs and similar spam-based websites, which consist only of ads from Google AdSense and duplicated content that is scraped from other sites.  In this post I will share a method you can use to identify "evil" spiders and content scraping bots that are wasting your website's resources.

I'll start off by defining what is considered an "evil" spider/bot.  For our purposes here, we'll be looking at spiders and bots that ignore robots.txt and nofollow when crawling a site.  These are spiders and bots that offer no value to you in allowing them to crawl your site, as the major search engines use spiders and bots that respect these rules (with the unique exception of MSN who employs a certain bot that presents itself as a regular user in order to identify sites that present different content to search engine spiders than users).  

Of these valueless spiders, some are almost certainly going to be some form of content scraping bot, which is sent to literally copy the content of your site for use elsewhere.  It is in your best interest to limit how much of your content gets scraped because you want visitors coming to your site, not some spam-filled facsimile.

This method to identify unwanted spiders involves the creation of a trap,  which can be created as follows:

1) Create a Hidden Page

To identify these undesired visitors you need to isolate them.  Create a page on your site, but do not link to it from anywhere just yet.  For the purposes of my examples, I'll call our example page "trap.aspx".


Now you want to disallow this page in your robots.txt.


With this trap page disallowed in the robots.txt, it will prevent good spiders from crawling it.  What is needed now is a link to the trap page with the rel="nofollow" attribute, which should be placed on your home page for maximum effect.  The link must be invisible to users otherwise you might mistake a unwitting visitor for a bad spider.

<a rel="nofollow" href="/trap.aspx" style="display:none;" />

This creates a situation in which the only requests for "/trap.aspx" will be from a spider or bot that ignores both robots.txt and nofollow, which is exactly the kind of bots we want to identify.

2) Create a Log File

Create an XML document and name it "trap.xml" (or whatever you want) and place it in the App_Data folder of your application (or wherever you want, as long as the application has write-access to the directory).  Open the new XML document and create an empty root-element "<trapRequests>" and ensure it has a complete closing tag.

<?xml version="1.0" encoding="utf-8"?>
You can use whatever method is best for you to log the requests, you do not need to use an XML document.  I am using XML for the purposes of this example.

3) Log What Gets Caught In The Trap

With the trap in place, you now want to keep track of the requests being made for "trap.aspx".  This can be accomplished quite easily using LINQ, as illustrated in the following example:

Imports System.Xml.Linq
Partial Class trap_aspx Inherits System.Web.UI.Page
Protected Sub Page_Load(ByVal sender As Object, ByVal e As System.EventArgs) _ Handles Me.Load
  LogRequest(Request.UserHostAddress, Request.UserAgent)
End Sub

Private Sub LogRequest(ByVal ipAddress As String, ByVal userAgent As String)
Dim logFile As XDocument


    logFile = XDocument.Load(Server.MapPath("~/App_Data/trap.xml"))
<date><%= Now.ToString %> </date>
<ip><%= ipAddress %> </ip>
<userAgent><%= Now.ToString %> </userAgent>
Catch ex As Exception
End Try
End Sub
End Class

This code sets it up so every request for this page is logged with:

  1. The Date and Time of the request.
  2. The IP address of the requesting agent.
  3. The User Agent of the requesting agent.

You can, of course, customize what information is logged to your preference.  The code will need to be adjusted if you are using a different storage method.  Once done, you will end up with an XML log file (or your custom store) with every request to "trap.aspx" that will look like:

<?xml version="1.0" encoding="utf-8"?>
<date>12/30/2007 12:54:20 PM</date>
<date>12/30/2007 2:31:51 PM</date>
Now you've set your trap and any unwanted bots and spiders that find it will be logged.  You are then free to use the logged data to deny access to offending IPs, User Agents, or by whatever criteria you decide is appropriate for your site.

Comments (6) -

  • Dan Elliott

    12/30/2007 1:12:00 PM |

    This post was good timing for me, because I've been getting hammered by content scraping bots for the past week and have been trying to figure out a way to fight back.  I'll definitely be giving this a try.



    1/9/2008 4:44:35 PM |


  • Samsara

    1/11/2008 6:25:45 AM |

    Excellent Colin. Now. I wonder if you could help those of us who are not on using asp? I am positive you could get creative and think of something for us lowly non-aspy minions? ;)

    I'm thinking along the lines of similar first steps. A new page with robots denying it but linking it with a nofollow:  a rel="nofollow" href="/bottytrap.htm" style="display:none;" />

    Then a simply bottytrap page so when you look at your logs, you see who, what and when called bottytrap, eh?

    Can you think of something better?

    Thanks for the tip Colin!
    You are fabulous!

  • Colin Cochrane

    1/11/2008 9:46:28 AM |


    Monitoring requests to the trap page would certainly work using server logs as well.

    Let me know how it works out for you!

  • luigi

    4/9/2008 5:54:46 PM |

    Good Job Colin, please go on posting about and seo related topics.

  • Frank Gennaro

    1/30/2009 1:20:54 AM |

    Excellent article! Because I prefer C#, I converted the code and made some minor changes that C# types might find beneficial.

    public partial class Trap : System.Web.UI.Page
        protected void Page_Load(object sender, EventArgs e)
            LogRequest(Request.UserHostAddress, Request.UserHostName, Request.UserAgent);

        private void LogRequest(string userHostAddress, string userHostName, string userAgent)
            XDocument logFile = null;
                logFile = XDocument.Load(Server.MapPath("~/App_Data/Trap.xml"));

                XElement xml = new XElement("Request",
                            new XElement("Date", string.Format("{0} at {1}", DateTime.Now.ToLongDateString(), DateTime.Now.ToLongTimeString())),
                            new XElement("UserHostAddress", userHostAddress),
                            new XElement("UserHostName", userHostName),
                            new XElement("UserAgent", userAgent));

            catch (Exception ex)

Comments are closed