Spider Traps - The Concepts
Continued from
page 2 - "Useful information to record from crawlers & the main methods of identification"
Behaviour
This is less precise than either of the two previous methods but will often produce interesting results. You really
would have to know a little about the dysfunctions of a given crawler, or exactly what headers it includes into its
requests or some other traits which are unique to that particular crawler implementation / engine / API / toolkit.
Originally I started thinking about this technique as an extended browser verification method but there's no reason
that it could not be adapted to crawlers once a sufficiently large data sample had been gathered. Using this concept
it might also be possible to identify a new crawler as using a certain type of engine or API and then decide how to
classify it based on that information.
The core of a behavioural system relies on two simple facts;
-
The people who write crawlers have their own ideas about how to optimise their request paterns, some might want
to use a certain header which does one thing, while another crawler designer might want to use a totally different
header to achieve a different effect while a third programmer might decide that he/she just needs the basic request
data without any extended headers.
The basic point I'm trying to make is that when programs to access internet sites are created their designers introduce
a certain amount of entropy into the equation (entropy refers to a certain amount of randomness), even if they are
working off a text-book design because that is the nature of creative people.
-
99% of the people who ever try to pass their request off as coming from something totally different to their actual
configuration don't bother to go the whole hog - they maybe just change the user-agent string or perhaps only use
a basic request with a fake user-agent and perhaps an extended header or two.
My point is that the majority of the people are lazy and don't bother to get as close to replicating the true nature
of that which they are trying to emulate. In other words if they wanted to pretend to be the Google crawler they
rarely ever bother to copy every header the GoogleBot supplies - merely the ones they expect or they think are important.
After all trapping a full GoogleBot request is rather time-consuming if you weren't planning to do it when you got up
this morning. Again it's all these little, personal changes which generate entropy.
Related links
E-Mail Protection - ASP script which "armours" e-mail addresses in order to make it much harder for spammers to detect and harvest them without interfering with normal users. It also employs a mix of heuristic and spider-trap techniques to deny access to harvesting programs (as discussed above).
Crawler Filter - ASP script which prevents certain crawlers and applications from visiting pages on the website. Features pattern matching against a database of bad and undesireable crawlers.