Spider Traps - The Concepts

Continued from page 2 - "Useful information to record from crawlers & the main methods of identification"

Behaviour

This is less precise than either of the two previous methods but will often produce interesting results. You really would have to know a little about the dysfunctions of a given crawler, or exactly what headers it includes into its requests or some other traits which are unique to that particular crawler implementation / engine / API / toolkit.

Originally I started thinking about this technique as an extended browser verification method but there's no reason that it could not be adapted to crawlers once a sufficiently large data sample had been gathered. Using this concept it might also be possible to identify a new crawler as using a certain type of engine or API and then decide how to classify it based on that information.

The core of a behavioural system relies on two simple facts;
  1. The people who write crawlers have their own ideas about how to optimise their request paterns, some might want to use a certain header which does one thing, while another crawler designer might want to use a totally different header to achieve a different effect while a third programmer might decide that he/she just needs the basic request data without any extended headers.

    The basic point I'm trying to make is that when programs to access internet sites are created their designers introduce a certain amount of entropy into the equation (entropy refers to a certain amount of randomness), even if they are working off a text-book design because that is the nature of creative people.
  2. 99% of the people who ever try to pass their request off as coming from something totally different to their actual configuration don't bother to go the whole hog - they maybe just change the user-agent string or perhaps only use a basic request with a fake user-agent and perhaps an extended header or two.

    My point is that the majority of the people are lazy and don't bother to get as close to replicating the true nature of that which they are trying to emulate. In other words if they wanted to pretend to be the Google crawler they rarely ever bother to copy every header the GoogleBot supplies - merely the ones they expect or they think are important. After all trapping a full GoogleBot request is rather time-consuming if you weren't planning to do it when you got up this morning. Again it's all these little, personal changes which generate entropy.

Related links

E-Mail Protection - ASP script which "armours" e-mail addresses in order to make it much harder for spammers to detect and harvest them without interfering with normal users. It also employs a mix of heuristic and spider-trap techniques to deny access to harvesting programs (as discussed above).
Crawler Filter - ASP script which prevents certain crawlers and applications from visiting pages on the website. Features pattern matching against a database of bad and undesireable crawlers.

Table of contents:
Evolved
Code
ASP, SQL & VB meet the internet.

Navigate

Home Parent Directory Meta-Search

Technical

ASP Scripts SQL Scripts VB Programs Show All

Guides

Show All

Other

Contact Site News About Legal Sitemap Links