Spider Traps - The Concepts

What's a spider trap?

A spider-trap means different things to different people - to some it's just a method of identifying crawlers as they browse your site, to some it is an interactive extension to their logfiles, to some it's a way of determining if a crawler is good or bad by monitoring where it browses and to others it's a way to sabotage bad crawlers. My personal definition would be the first example "a method of identifying crawlers as they browse your site" as this is extremely flexible and can be adapted to meet a specific need. At the heart of each of these techniques you will find the same basic concept - a need to identify crawlers in real-time, and so this is what I intend to cover in this article.

If you took your average spider trap (also called "bot trap" or "crawler trap") and stripped it back to the most basic components you would find that what you are left with is a glorified sorting and filtering mechanism which gives somebody the ability to determine whether the current request is coming from a real user or not. The key thing to realise is that a spider-trap does next to nothing unless prompted - instead it provides a framework which can be used elsewhere to allow other scripts to integrate its features into their own logic.

You will notice I mentioned the "average spider trap" - so what's the difference between that basic example and the sort of spider trap you would find deployed on the web? Well firstly you have to realise that the basic example is very little use on its own as it essentially does nothing, there are two essential things which need to happen before it can be used;
  1. The spider-trap code needs to be made to actually do something because in its basic form it's orphaned and purposeless. This might be supporting a tracking function to report new crawlers or perhaps supporting logic to keep "bad" crawlers out of the site (or at least away from the content) - the list goes on but these were the first two examples that sprung to mind.
  2. Once you have figured out what you want your spider-trap to do then you need to integrate this into as much or as little of the site as you feel comfortable with. Integration is a double-edged sword - on the one hand full integration allows your spider-trap access to all site traffic, on the other this also has the potential to cause the most problems.
Once you have those two things in place you have the core of a working spider trap. From here it really is just a question of how much time you want to invest into your code to create a move advanced, useful, version of that script in order to best meet your needs.

Server-side or client-side?

There is no middle-ground on this question you must be using server-side code to create and implement your spider-trap as otherwise you are just wasting your time - this needed to be said as one of the most common issue which comes up in discussions about crawlers is that people expect 3rd party tracking scripts to be able to track crawlers (in exactly the same way they do for all the real people that use their site) and are surprised when they don't.

The reason for this is that client-side scripts are normally embedded into the page and so when a crawler or a spider finds that page they totally ignore the embedded elements because they aren't interested in them, only the page. Also as these client-side scripts often make heavy use of javascript to capture data about the user's configuration a crawler will pass right over that javascript without pausing because as far as it's concerned client-side scripting isn't an integral part of the page content and so gets ignored.

Server-side tracking scripts on the other hand are totally transparent to the end-user as they are totally passive and require nothing more than the basic page request to trigger them. In other words as the spider or crawler roams around the site requesting pages it will also be triggering server-side scripts which, as far as any client is concerned, are totally invisible.

This means that we have a way to observe them without them realising, and best of all this method requires them to do nothing they would not normally do.

What can a spider-trap do for me?

This really depends how complex you want to make your script and how much effort you are prepared to put into testing and improving the code - if there is no practical limit on the development then you'll be very surprised what you can make a spider-trap do. For the someone with a limited development budget three very simple and useful examples are;

Continued on page 2 - "Useful information to record from crawlers & the main methods of identification"

Table of contents:
Evolved
Code
ASP, SQL & VB meet the internet.

Navigate

Home Parent Directory Meta-Search

Technical

ASP Scripts SQL Scripts VB Programs Show All

Guides

Show All

Other

Contact Site News About Legal Sitemap Links