Spider Traps - The Concepts

Continued from page 1 - "What is a spider-trap and what can it do for me?"

How should I store the data I'm using?

My personal preference will always be a nice, modern, RDBMS such as SQL Server, mySQL, Oracle etc. as this is exactly the type of task databases were created for. However the real bonus of using any database is the unparalleled level of data manipulation and searching on offer if you make good use of the structured query language (SQL for short) as this makes retrieving and analysing the data a whole lot easier.

Admittedly you could simply hard-code your data into the script but trust me when I say that you'll quickly come to find this doesn't give you enough flexibility to create anything more than a basic script as you'll find that data modification & searching become key issues as you try to add more complexity into your script.

In an ideal world your storage medium must be able to handle several users at once without any data corruption or locking issues, primarily you want to ensure that the data does not get corrupted but it's also not very helpful to be losing data you wanted stored. Also support for some kind of data query interface is a necessity as you are going to be faced with quite a lot of information eventually and at that point you don't want to be forced into relying on a slow & inefficient manual search solution.

What information do I need?

Depending on what you are planning to do your data requirements will change... For a spider-trap which is only required to perform identification then you are only really interesting in information which can help identify the crawlers, this means you wont need the extended information (**'d entries).

Equally if you intend your spider-trap to be able to log data for later analysis (or perform simple analysis & filtering on new crawlers autonomously) then you will require the extended information (**'d entries). In an ideal world if you are adopting a storage approach you will need to be able to get access to all of the request headers through your chosen scripting language, the reason for this is that you want to be able to retain enough information to enable you to recreate the request accurately in a step-by-step process should the need arise.

Immediately below you will see a short list of the request headers which can prove useful to anyone coding a spider-trap.

How to identify spiders & crawlers

Currently there are three easy (but rather basic) ways to identify an automated agent which is crawling (requesting pages from) your website - this doesn't attempt to be an exhausting list but it does cover the major areas which are for the most part the first two items;

User-Agent

This is a simple text label which any program which requests any from the internet should present along with the request, usually it will identify the product, usually a version and sometimes the name of the creator, maintainer or owner.

On the surface this seems like an easy (not to mention foolproof) way to identify a crawler, however this "label" under the control of whatever is issuing the request, making changing this text to something misleading a work of mere moments. Other reasons for changing the user-agent (excluding from purely malicious) will vary wildly - maybe the crawler has undergone a significant upgrade and the user-agent was "improved" to make its purpose more obvious or to add more contact information or maybe it just wants to check if what you gave it is any different from what you would give to a browser that asked for the same page.

Obviously this significantly devalues the user-agent as the sole means of identifying crawlers and other spidering applications where any sort of trusted reliability is required - after all if you are happy that a portion of the items your spider trap thinks are search engines aren't really then why try to improve on a method which works for you?

IP addresses & Hostnames

Anything which connects to the internet must have an TCP/IP address and every IP address which has ever been assigned can be used to determine which organisation owns it. Normally type of people who run the big search engines have actually bought a large block of IP addresses for their computers, workers and crawlers / spiders to use, this means that if we wanted to we could query an address we think might be a crawler and we could say with a very high level of accuracy whether what we think is a crawler from the Example Search Company is operating from an address they own or whether it's coming from someone else and so less likely to be them.

Secondary to the ownership information from IP addresses, for large internet companies, especially search engines, you can often do a reverse DNS query on an IP address and get a readable answer back explaining a little about what this IP address is used for e.g. do a lookup on an IP address which is used heavily by a search engine's crawlers will normally results in an answer such as crawler012.example.com - for example on of Google's is called crawl5.googlebot.com.

Obviously using IP addresses has both advantages and disadvantages - the biggest advantage being that they provide a relatively robust method of identifying the mainstream spiders / crawlers even if they are not supplying their usual user-agent. The downside is that once you start looking at the independent crawlers (ie the type you or I could download or buy) then because they are so geographically distributed and used for such a wide variety of purposes you will find that IP address only matching starts to fail quite often. Also you have to remember that your lists of IP addresses will need constant maintenance to keep them up to date, which in turn keeps your spider trap working as smoothly as possible.

Weighting

This is a more advanced method which combines two or more of the above techniques to essentially create something similar to a self-test quiz - the sort of thing involving multiple choice questions which make you calculate your score at the end of the quiz before telling you what it means if you scored 1 to 10 points and so on. The weighted method is remarkably similar to one of these quizes as it relies on the results of a set of tests, with a certain type of result scoring "points". At the end of the batch of tests the final score is calculated which then determines the outcome.

A weighted system is more complex than the others (with the possible exception of the behaviour analysis) because it requires that an appropriate value is attached to each response, in addition you also require a suitably structured scoring system to determine the final outcome - neither of these are simplistic and would require lots of tweaking and testing to ensure that they work as expected.

Continued on page 3 - "Methods of crawler identification (continued): Detecting crawlers by their behaviour."

Table of contents:
Evolved
Code
ASP, SQL & VB meet the internet.

Navigate

Home Parent Directory Meta-Search

Technical

ASP Scripts SQL Scripts VB Programs Show All

Guides

Show All

Other

Contact Site News About Legal Sitemap Links