Spider Traps - The Concepts
Continued from
page 1 - "What is a spider-trap and what can it do for me?"
How should I store the data I'm using?
My personal preference will always be a nice, modern, RDBMS such as SQL Server, mySQL, Oracle etc. as this is exactly the
type of task databases were created for. However the real bonus of using any database is the unparalleled level of data
manipulation and searching on offer if you make good use of the structured query language (SQL for short) as this makes
retrieving and analysing the data a whole lot easier.
Admittedly you could simply hard-code your data into the script but trust me when I say that you'll quickly come to find
this doesn't give you enough flexibility to create anything more than a basic script as you'll find that data modification
& searching become key issues as you try to add more complexity into your script.
In an ideal world your storage medium must be able to handle several users at once without any data corruption or locking
issues, primarily you want to ensure that the data does not get corrupted but it's also not very helpful to be losing
data you wanted stored. Also support for some kind of data query interface is a necessity as you are going to be faced
with quite a lot of information eventually and at that point you don't want to be forced into relying on a slow &
inefficient manual search solution.
What information do I need?
Depending on what you are planning to do your data requirements will change... For a spider-trap which is only required
to perform identification then you are only really interesting in information which can help identify the crawlers, this
means you wont need the extended information (**'d entries).
Equally if you intend your spider-trap to be able to log data for later analysis (or perform simple analysis & filtering
on new crawlers autonomously) then you will require the extended information (**'d entries). In an ideal world if you are
adopting a storage approach you will need to be able to get access to all of the request headers through your chosen scripting
language, the reason for this is that you want to be able to retain enough information to enable you to recreate the request
accurately in a step-by-step process should the need arise.
Immediately below you will see a short list of the request headers which can prove useful to
anyone coding a
spider-trap.
-
Timestamp** - a full timestamp from the server is very useful as it allows you to cross-match what you are
seeing in the spider trap with what your website logs are saying for the same time period.
-
User-Agent - the literal name of the crawler is always useful as once you put a name to something you are a
little closer to learning more about it.
-
TCP/IP address - the internet address space the crawler was operating from, often useful when tracking the
origin of a "fake" crawler or malicious crawlers.
-
From - content varies from crawler to crawler but typically this gives contact information for the owners,
operators or administrators of the crawler.
-
Referer - normal referrer header, although this is rarely used by crawlers it may suggest where
the crawler has been previously and is frequently used by malicious cralwers.
-
Via** - this header normally present when the request has come through a non-transparent proxy server, and
often indicates the make, model of and sometimes version. For our purposes this is useful because it indicates
that the IP address we are seeing may not be the true IP address.
-
Forwarded-for** - another proxy header, this time indicating the IP address of the machine which originally
issued the request through the proxy.
-
Cookies** - normal cookies header, again very rarely used be legitimate crawlers. However finding excessive
"foreign" cookies listed here is often indicative of a poorly coded, malicious, crawler.
How to identify spiders & crawlers
Currently there are three easy (but rather basic) ways to identify an automated agent which is crawling (requesting
pages from) your website - this doesn't attempt to be an exhausting list but it does cover the major areas which are
for the most part the first two items;
User-Agent
This is a simple text label which any program which requests any from the internet should present along with the
request, usually it will identify the product, usually a version and sometimes the name of the creator, maintainer
or owner.
On the surface this seems like an easy (not to mention foolproof) way to identify a crawler, however this "label"
under the control of whatever is issuing the request, making changing this text to something misleading a work of mere
moments. Other reasons for changing the user-agent (excluding from purely malicious) will vary wildly - maybe the
crawler has undergone a significant upgrade and the user-agent was "improved" to make its purpose more obvious
or to add more contact information or maybe it just wants to check if what you gave it is any different from what you
would give to a browser that asked for the same page.
Obviously this significantly devalues the user-agent as the
sole means of identifying crawlers and other
spidering applications where any sort of trusted reliability is required - after all if you are happy that a portion
of the items your spider trap thinks are search engines aren't really then why try to improve on a method which works
for you?
IP addresses & Hostnames
Anything which connects to the internet must have an TCP/IP address and every IP address which has ever been assigned
can be used to determine which organisation owns it. Normally type of people who run the big search engines have
actually bought a large block of IP addresses for their computers, workers and crawlers / spiders to use, this means
that if we wanted to we could query an address we think might be a crawler and we could say with a very high level of
accuracy whether what we
think is a crawler from the
Example Search Company is operating from an address
they own or whether it's coming from someone else and so less likely to be them.
Secondary to the ownership information from IP addresses, for large internet companies, especially search engines, you
can often do a reverse DNS query on an IP address and get a readable answer back explaining a little about what this
IP address is used for e.g. do a lookup on an IP address which is used heavily by a search engine's crawlers will normally
results in an answer such as
crawler012.example.com - for example on of Google's is called
crawl5.googlebot.com.
Obviously using IP addresses has both advantages and disadvantages - the biggest advantage being that they provide a
relatively robust method of identifying the mainstream spiders / crawlers even if they are not supplying their usual
user-agent. The downside is that once you start looking at the independent crawlers (ie the type you or I could download
or buy) then because they are so geographically distributed and used for such a wide variety of purposes you will find
that IP address only matching starts to fail quite often. Also you have to remember that your lists of IP addresses
will need constant maintenance to keep them up to date, which in turn keeps your spider trap working as smoothly as
possible.
Weighting
This is a more advanced method which combines two or more of the above techniques to essentially create something similar
to a self-test quiz - the sort of thing involving multiple choice questions which make you calculate your score at the end
of the quiz before telling you what it means if you scored 1 to 10 points and so on. The weighted method is remarkably
similar to one of these quizes as it relies on the results of a set of tests, with a certain type of result scoring
"points". At the end of the batch of tests the final score is calculated which then determines the outcome.
A weighted system is more complex than the others (with the possible exception of the behaviour analysis) because it
requires that an appropriate value is attached to each response, in addition you also require a suitably structured
scoring system to determine the final outcome - neither of these are simplistic and would require lots of tweaking and
testing to ensure that they work as expected.
Continued on
page 3 - "Methods of crawler identification (continued): Detecting crawlers by their behaviour."