If you have a blog that is even moderately popular then you have likely fallen victim to some form of content scraping. Ever since it became possible to earn money through ads on a website there have been people trying to find ways to cheat the system. The most widespread example of this comes in the form of splogs and similar spam-based websites, which consist only of ads from Google AdSense and duplicated content that is scraped from other sites. In this post I will share a method you can use to identify "evil" spiders and content scraping bots that are wasting your website's resources.
I'll start off by defining what is considered an "evil" spider/bot. For our purposes here, we'll be looking at spiders and bots that ignore robots.txt and nofollow when crawling a site. These are spiders and bots that offer no value to you in allowing them to crawl your site, as the major search engines use spiders and bots that respect these rules (with the unique exception of MSN who employs a certain bot that presents itself as a regular user in order to identify sites that present different content to search engine spiders than users).
Of these valueless spiders, some are almost certainly going to be some form of content scraping bot, which is sent to literally copy the content of your site for use elsewhere. It is in your best interest to limit how much of your content gets scraped because you want visitors coming to your site, not some spam-filled facsimile.
This method to identify unwanted spiders involves the creation of a trap, which can be created as follows:
1) Create a Hidden Page
To identify these undesired visitors you need to isolate them. Create a page on your site, but do not link to it from anywhere just yet. For the purposes of my examples, I'll call our example page "trap.aspx".
Now you want to disallow this page in your robots.txt.
With this trap page disallowed in the robots.txt, it will prevent good spiders from crawling it. What is needed now is a link to the trap page with the rel="nofollow" attribute, which should be placed on your home page for maximum effect. The link must be invisible to users otherwise you might mistake a unwitting visitor for a bad spider.
<a rel="nofollow" href="/trap.aspx" style="display:none;" />
This creates a situation in which the only requests for "/trap.aspx" will be from a spider or bot that ignores both robots.txt and nofollow, which is exactly the kind of bots we want to identify.
2) Create a Log File
Create an XML document and name it "trap.xml" (or whatever you want) and place it in the App_Data folder of your application (or wherever you want, as long as the application has write-access to the directory). Open the new XML document and create an empty root-element "<trapRequests>" and ensure it has a complete closing tag.
<?xml version="1.0" encoding="utf-8"?>
You can use whatever method is best for you to log the requests, you do not need to use an XML document. I am using XML for the purposes of this example.
3) Log What Gets Caught In The Trap
With the trap in place, you now want to keep track of the requests being made for "trap.aspx". This can be accomplished quite easily using LINQ, as illustrated in the following example:
Partial Class trap_aspx
Protected Sub Page_Load(ByVal sender As Object, ByVal e As System.EventArgs) _
Private Sub LogRequest(ByVal ipAddress As String, ByVal userAgent As String)
Dim logFile As XDocument
logFile = XDocument.Load(Server.MapPath("~/App_Data/trap.xml"))
<date><%= Now.ToString %> </date>
<ip><%= ipAddress %> </ip>
<userAgent><%= Now.ToString %> </userAgent>
Catch ex As Exception
This code sets it up so every request for this page is logged with:
- The Date and Time of the request.
- The IP address of the requesting agent.
- The User Agent of the requesting agent.
You can, of course, customize what information is logged to your preference. The code will need to be adjusted if you are using a different storage method. Once done, you will end up with an XML log file (or your custom store) with every request to "trap.aspx" that will look like:
<?xml version="1.0" encoding="utf-8"?>
<date>12/30/2007 12:54:20 PM</date>
<date>12/30/2007 2:31:51 PM</date>
Now you've set your trap and any unwanted bots and spiders that find it will be logged. You are then free to use the logged data to deny access to offending IPs, User Agents, or by whatever criteria you decide is appropriate for your site.