What's a spider trap?
A spider-trap means different things to different people - to some it's just a method of identifying crawlers as they
browse your site, to some it is an interactive extension to their logfiles, to some it's a way of determining if
a crawler is good or bad by monitoring where it browses and to others it's a way to sabotage bad crawlers.
My personal definition would be the first example "a method of identifying crawlers as they browse your site"
as this is extremely flexible and can be adapted to meet a specific need. At the heart of each of these techniques
you will find the same basic concept - a need to identify crawlers in real-time, and so this is what I intend to cover
in this article.
If you took your average spider trap (also called "bot trap" or "crawler trap") and stripped it
back to the most basic components you would find that what you are left with is a glorified sorting and filtering
mechanism which gives somebody the ability to determine whether the current request is coming from a real user or not.
The key thing to realise is that a spider-trap does next to nothing unless prompted - instead it provides a framework
which can be used elsewhere to allow other scripts to integrate its features into their own logic.
You will notice I mentioned the "average spider trap" - so what's the difference between that basic example
and the sort of spider trap you would find deployed on the web? Well firstly you have to realise that the basic
example is very little use on its own as it essentially does nothing, there are two essential things which need to
happen before it can be used;
-
The spider-trap code needs to be made to actually do something because in its basic form it's orphaned and
purposeless. This might be supporting a tracking function to report new crawlers or perhaps supporting logic to
keep "bad" crawlers out of the site (or at least away from the content) - the list goes on but these
were the first two examples that sprung to mind.
-
Once you have figured out what you want your spider-trap to do then you need to integrate this into as much or as
little of the site as you feel comfortable with. Integration is a double-edged sword - on the one hand full
integration allows your spider-trap access to all site traffic, on the other this also has the potential
to cause the most problems.
Once you have those two things in place you have the core of a
working spider trap. From here it really is
just a question of how much time you want to invest into your code to create a move advanced, useful, version of that
script in order to best meet your needs.
Server-side or client-side?
There is no middle-ground on this question you
must be using server-side code to create and implement your
spider-trap as otherwise you are just wasting your time - this needed to be said as one of the most common issue which
comes up in discussions about crawlers is that people expect 3
rd party tracking scripts to be able to track
crawlers (in exactly the same way they do for all the real people that use their site) and are surprised when they don't.
The reason for this is that client-side scripts are normally embedded into the page and so when a crawler or a spider
finds that page they totally ignore the embedded elements because they aren't interested in them, only the page. Also
as these client-side scripts often make heavy use of javascript to capture data about the user's configuration a
crawler will pass right over that javascript without pausing because as far as it's concerned client-side scripting
isn't an integral part of the page content and so gets ignored.
Server-side tracking scripts on the other hand are totally transparent to the end-user as they are totally passive
and require nothing more than the basic page request to trigger them. In other words as the spider or crawler roams
around the site requesting pages it will also be triggering server-side scripts which, as far as any client is
concerned, are totally invisible.
This means that we have a way to observe them without them realising, and best of all this method requires them to do
nothing they would not normally do.
What can a spider-trap do for me?
This really depends how complex you want to make your script and how much effort you are prepared to put into testing
and improving the code - if there is no practical limit on the development then you'll be
very surprised
what you can make a spider-trap do. For the someone with a limited development budget three very simple and useful
examples are;
-
Identifying Search Engines
The most obvious applications of the data you can generate through a spider-trap relate to search engines as
once you have a decent matching system you can identify a search engine's crawler with great ease. In turn this
would allow you to do simple things like charting when they visit, how often and how many pages they were
indexing.
-
Discover New Crawlers
New crawlers appear quite often and to a less extent established crawlers sometimes show up sporting a new
version number or suffix suggestive of the people behind them testing modifications and new techniques. In the
same way you can apply the data from your spider trap to existing crawlers you also have the ability to identify
these newer crawlers the moment they visit your site, rather than forcing you to trawl through your logs or making
you wait for someone else to update their listings.
-
Monitor Bad Crawlers
As robots.txt is not nromally enforced by the server there is nothing to stop a crawler from totally ignoring
your instructions and crawling .
-
Blocking "Bad" Crawlers
We all know there are crawlers out there which are accessing your site for a commercial reason but which are
poorly coded and which have very little respect for established rules (such as robots.txt and leaving intervals
between page requests so they don't over-stress the server). Also you have the malicious crawlers which
aren't accessing your site for any legitimate purpose other than to find issues which they can abuse.
Obviously if they're using a consistent method of identifying themselves then you can block them which saves
server time and your bandwidth. Some people don't always have access to server-level blocking mechanisms so
an integrated script is often the next best thing.
-
Blocking E-Mail Harvesters
I have a theory which says that most of the spam an e-mail account gets sent is as a result of that address being
"scraped" off some website or other, now aside from "armouring" the e-mail address the next best
thing is not to let it get harvested in the first place.
The problem is that most e-mail harvesting applications ship with a default user-agent which can be changed to
something else far more innocent sounding than the default, perhaps even a browser! Obviously this stops it being
picked up by routine user-agent blocking (which we just covered) so you could extend the logic a little further to
detect common traits which real users and real crawlers don't have but which e-mail harvesters do - for example
you often find they cope very badly with multiple cookies as they cross from site to site which is something
nothing valid does... (See the related links for more information)