Colin Cochrane

Software Developer based in Victoria, BC specializing in C#, PowerShell, Web Development and DevOps.

Catching Unwanted Spiders And Content Scraping Bots In ASP.NET

spiderinaglass

If you have a blog that is even moderately popular then you have likely fallen victim to some form of content scraping.  Ever since it became possible to earn money through ads on a website there have been people trying to find ways to cheat the system.  The most widespread example of this comes in the form of splogs and similar spam-based websites, which consist only of ads from Google AdSense and duplicated content that is scraped from other sites.  In this post I will share a method you can use to identify "evil" spiders and content scraping bots that are wasting your website's resources.

I'll start off by defining what is considered an "evil" spider/bot.  For our purposes here, we'll be looking at spiders and bots that ignore robots.txt and nofollow when crawling a site.  These are spiders and bots that offer no value to you in allowing them to crawl your site, as the major search engines use spiders and bots that respect these rules (with the unique exception of MSN who employs a certain bot that presents itself as a regular user in order to identify sites that present different content to search engine spiders than users).  

Of these valueless spiders, some are almost certainly going to be some form of content scraping bot, which is sent to literally copy the content of your site for use elsewhere.  It is in your best interest to limit how much of your content gets scraped because you want visitors coming to your site, not some spam-filled facsimile.

This method to identify unwanted spiders involves the creation of a trap,  which can be created as follows:

1) Create a Hidden Page

To identify these undesired visitors you need to isolate them.  Create a page on your site, but do not link to it from anywhere just yet.  For the purposes of my examples, I'll call our example page "trap.aspx".

spidertrap1

Now you want to disallow this page in your robots.txt.

spidertrap2

With this trap page disallowed in the robots.txt, it will prevent good spiders from crawling it.  What is needed now is a link to the trap page with the rel="nofollow" attribute, which should be placed on your home page for maximum effect.  The link must be invisible to users otherwise you might mistake a unwitting visitor for a bad spider.

<a rel="nofollow" href="/trap.aspx" style="display:none;" />

This creates a situation in which the only requests for "/trap.aspx" will be from a spider or bot that ignores both robots.txt and nofollow, which is exactly the kind of bots we want to identify.

2) Create a Log File

Create an XML document and name it "trap.xml" (or whatever you want) and place it in the App_Data folder of your application (or wherever you want, as long as the application has write-access to the directory).  Open the new XML document and create an empty root-element "<trapRequests>" and ensure it has a complete closing tag.

<?xml version="1.0" encoding="utf-8"?>
<trapRequests>
</trapRequests>
 
You can use whatever method is best for you to log the requests, you do not need to use an XML document.  I am using XML for the purposes of this example.

3) Log What Gets Caught In The Trap

With the trap in place, you now want to keep track of the requests being made for "trap.aspx".  This can be accomplished quite easily using LINQ, as illustrated in the following example:

Imports System.Xml.Linq
Partial Class trap_aspx Inherits System.Web.UI.Page
Protected Sub Page_Load(ByVal sender As Object, ByVal e As System.EventArgs) _ Handles Me.Load
 
  LogRequest(Request.UserHostAddress, Request.UserAgent)
End Sub

Private Sub LogRequest(ByVal ipAddress As String, ByVal userAgent As String)
Dim logFile As XDocument

Try

    logFile = XDocument.Load(Server.MapPath("~/App_Data/trap.xml"))
    logFile.Root.AddFirst(<request>
<date><%= Now.ToString %> </date>
<ip><%= ipAddress %> </ip>
<userAgent><%= Now.ToString %> </userAgent>
</request>)
     logFile.Save(Server.MapPath("~/App_Data/trap.xml"))
Catch ex As Exception
My.Log.WriteException(ex)
End Try
End Sub
End Class

This code sets it up so every request for this page is logged with:

  1. The Date and Time of the request.
  2. The IP address of the requesting agent.
  3. The User Agent of the requesting agent.

You can, of course, customize what information is logged to your preference.  The code will need to be adjusted if you are using a different storage method.  Once done, you will end up with an XML log file (or your custom store) with every request to "trap.aspx" that will look like:

<?xml version="1.0" encoding="utf-8"?>
<trapRequests>
<request>
<date>12/30/2007 12:54:20 PM</date>
<ip>1.2.3.4</ip>
<userAgent>ISCRAPECONTENT/1.2</userAgent>
</request>
<request>
<date>12/30/2007 2:31:51 PM</date>
<ip>2.3.4.5</ip>
<userAgent>BADSPIDER/0.5</userAgent>
</request>
</trapRequests>
 
Now you've set your trap and any unwanted bots and spiders that find it will be logged.  You are then free to use the logged data to deny access to offending IPs, User Agents, or by whatever criteria you decide is appropriate for your site.

Three CSS Roll Over Techniques You Might Not Know About

When it comes to rollover effects in web design the most common way to accomplish the effect has traditionally been with JavaScript:


JavaScript in the HEAD section

[code:html]

<script language="JavaScript">
<!--
// preload images
if (document.images) {
img_on =new Image(); img_on.src ="../images/1.gif";
img_off=new Image(); img_off.src="../images/2.gif";
}
function handleOver() {
if (document.images) document.imgName.src=img_on.src;
}
function handleOut() {
if (document.images) document.imgName.src=img_off.src;
}
//-->
</script>

[/code]


And in the element with the rollover effect

[code:html]


<a href="http://www.domain.com" onMouseOver="handleOver();return true;" onMouseOut="handleOut();return true;"><img name="imgName" alt="Rollover!" src="/images/1.gif"/></a>

[/code]

The reason this method is used so commonly is because it is simple to implement and, more importantly, avoids the "lag" on the first mouseover that comes when using a CSS background-image switch on an selector:hover rule due to the delay required to download the rollover image. One thing that a lot of people don't realize is that there are methods to accomplish this effect in CSS without the initial rollover lag.

Method One - CSS Preloading

This is the quick and dirty way to force browsers to download rollover images when they initially load the page. Let's say you have the following document:

[code:html]


<html>
<head>
<title>My Rollover Page</title>
<style type="text/css">
#rollover{background:url(/images/1.gif);}
#rollover:hover{background:url(/images/2.gif);}
</style>
</head>
<body>
<div>
<a id="rollover" href="http://www.domain.com">My Rollover Link</a>
</div>
</body>
</html>

[/code]

In this page there would be a noticible delay when a user first mouses over the "rollover" anchor. The CSS Preloading method uses an invisible dummy element set to visibility:hidden, and has the "active" version of the rollover image set as its background.

[code:html]

<html>
<head>
<title>My Rollover Page</title>
<style type="text/css">
#preload{position:absolute;visibility:hidden;}
#image2{background:url(/images/2.gif);}
#rollover{background:url(/images/1.gif);}
#rollover:hover{background:url(/images/2.gif);}
</style>
</head>
<body>
<div id="preload">
<div id="image2"></div>
</div>
<div>
<a id="rollover" href="http://www.domain.com">My Rollover Link</a>
</div>
</body> 
</html>
 

[/code]

Method Two - Image Visibility Swap

This method accomplishes the same goal of forcing the browser to load both of the rollover images, but attacks it in a different way. Using the same example as above, we basically set the background of the containing anchor element to the "active" state of the rollover, and set the contained image to be the "inactive" state. Then it's just a matter of hiding the image element on hover.

[code:html]

<html>
<head>
<title>My Rollover Page</title>
<style type="text/css">
#rollover{background:url(/images/2.gif");display:block;height:50px;width:50px;}
#rollover:hover img{visibility:hidden;}
</style>
</head>
<body>
<div>
<a id="rollover" href="http://www.domain.com"><img src="/images/1.gif" alt="My Rollover's Inactive Image" /></a>
</div>
</body> 
</html>  

[/code]

This is the method that this site uses for the ColinCochrane.com logo in the header.

Method Three - Multistate Image

This method avoids the preloading problem altogether by using only one image that contains the inactive and active states. This is accomplished by creating an image that has the inactive and active versions stacked on top of eachother, like so:



Then all you do is set the element's height to half of that of the image and use the background-position property to shift the states on hover:

[code:html]

<html>
<head>
<title>My Rollover Page</title>
<style type="text/css">
#rollover{background:url(/images/multi.gif") bottom;display:block;height:20px;width:100px;}
#rollover:hover{background-position:top;}
</style>
</head>
<body>
<div>
<a id="rollover" href="http://www.domain.com"></a>
</div>
</body> 
</html>

[/code]


Now you have some different techniques to consider when implementing rollover effects on your website.