What's all this about then?

So one thing led to another, and I ended up producing this.
I originally just tried a few obvious words, my name, linux, microsoft and so on... and started to collect some stats. Then I decided I wanted to broaden the search with a large set of words which simply collected off the Google Zeitgeist page.
The results on this page are the collated stats across the full set of words (see below). Each phrase is then broken down into a detailed page which shows the data for the full set of 100 results, a summary of the top 10 hits and a chart showing the coverage of webservers across the search results.
A perl script was used to perform searches against Google and MSN and scrape the results. I intentionally didn't want to use an API search to be sure I was getting the same results as a normal users. Each server was then identified using basic fingerprinting: an initial HEAD query, followed by more specific queries as required and finally trying OPTIONS matching. This proved to be sufficient to identify all but a few esoteric servers which were either intentionally hardened, or non-generic custom servers with no identification (e.g. directory.yahoo.com). Then it was a simple method of slicing and dicing.
All of the scraping, analysis and charting was done using a few simple perl scripts. I will make these available so others can have a play with their own analysis or want to verify these results, all of the raw data from the searches is available in the (not quite) .csv.gz files
A few highlights of these statistics. Firstly, for some reason some MSN queries return a results set that contains tracking links for every URL. These return a URL Starting "http://g.msn.com/9SE/1?" followed by the original URL, followed by some tracking information. One such search that originally screwed the results was for "MP3". It would be interesting to see what other search terms return tracked result sets.
Ok I've since discovered this is quite a common technique used during the development of a search engine and there's a Google FAQ about this when they were doing the same.
Looking at the queries for Microsoft and Linux show results scewed either for IIS or Apache respectively so people are eating their own dogfood. Interestingly the percentage of Microsoft sites using IIS is much lower (64%) compared to Linux sites running Apache (94%) suggesting that not everyone is as confident with their webserver.
Staying with the linux theme the 'linux' query interstingly returns RedHat Debian and Novell on the first page of Google results, but none of these show up on the first page of MSN results and Debian doesn't even feature in the top 100.
On the whole is seems that the MSN search engine is indeed placing IIS hosted sites higher in the results more frequently than other webservers. Frequently the MSN search is placing more IIS servers in the important top 10 results than Google even where result sets from a query have actually returned fewer IIS servers overall on MSN.
Looking at the coverage graphs, most search phrases return a more even spread of IIS servers thoughout the results sets from the MSN searchs.
So what's going on?
I have no idea, I doubt it's all a big conspiracy... but some possible explanations
spring to mind:
Perhaps the MSN search has simply been coded by developers used to talking
to IIS machines and so it just does that job better?
Perhaps the MSN spider is taking advantage of some specific IIS features to
provide enhanced indexing?
Ivor Hewitt
ivor at ivor dot it
February 2005,
Surrey, England.
Comments, suggestions, conspiracy theories etc welcome
All content Copyright © 2005 Ivor Hewitt.
http://www.ivor.it - Technology - http://www.ivor.org - The Hedge.