1. Dan Zambonini Bronze

    Technical Director at Box UK

    30 November 2004 12:05pm

    avatar

    Just a quick note of caution - if anyone’s noticing a huge increase in their bandwidth over the last month, it’s probably due to the new Microsoft search spider.

    We’ve had a 10x increase in bandwidth since Oct 10th (and the associated cost that entails).  After some investigation, we’ve noticed that the spider has got ’trapped’ in one of our sites that uses sessions - the spider is getting a new session for each request (and hence a new session id), and therefore continues to spider forever...

    Other spiders can cope with session ids (even if it's just to ignore pages/sites that use them), so I'm a little concerned that MS didn't put much consideration into this widely used technique...

    So, if anyone’s using session ids in their URLs, I’d suggest switching them off for the MS spider (with some user-agent sniffing), or otherwise pre-empting the hell of the MS search spider.

    Dan

  2. Lawrence Ladomery

    Web Consultant at architxt.net

    30 November 2004 15:37pm

    DSC_00093.jpg

    This is quite amazing. The spider must have been very active, like some trapped animal gone crazy!

    If all this traffic had a (considerably) negative impact on your server's performance then could it be classified as a denial of service attack?

    On 12:05:01 30 November 2004 Dan Zambonini wrote:

    Just a quick note of caution - if anyone’s noticing a huge increase in their bandwidth over the last month, it’s probably due to the new Microsoft search spider.

    We’ve had a 10x increase in bandwidth since Oct 10th (and the associated cost that entails).  After some investigation, we’ve noticed that the spider has got ’trapped’ in one of our sites that uses sessions - the spider is getting a new session for each request (and hence a new session id), and therefore continues to spider forever...

    Other spiders can cope with session ids (even if it’s just to ignore pages/sites that use them), so I’m a little concerned that MS didn’t put much consideration into this widely used technique...

    So, if anyone’s using session ids in their URLs, I’d suggest switching them off for the MS spider (with some user-agent sniffing), or otherwise pre-empting the hell of the MS search spider.

    Dan

  3. Dan Zambonini Bronze

    Technical Director at Box UK

    30 November 2004 17:28pm

    avatar

    It is a bit like a trapped animal - something extremely dangerous...  Possibly something rabid, or something with lots of bugs all over it.

    We're still looking into it - but the session ids might be a red herring - but it's definitely something to do with dynamically generated URLs (possibly URLs that store the previously viewed section, or some other kind of per-session based information).

    Either way, from what I can see, it's spidering the same pages over and over and over again, without recongising that they are the same page (albeit with different internal links/urls).

    I'll keep you posted.

  4. Ashley Friedlein Diamond

    CEO at Econsultancy

    01 December 2004 10:45am

    ashley-friedlein-favourite.jpg

    Mmm... I'm intrigued by the 'denial of service' angle. Could it be that there will be court cases coming againts search engines sometime soon for excessive site crawling? Loss of revenue due to slowed sites, for example?

    I guess for the time being no-one would make such complaints because they're keen to get the SEO rankings and so want to be indexed. However, I certainly know that the search spiders (especially Google and MSN) completely knacker our site when they visit - it's a bit of a love/hate relationship...

    Ashley

  5. Anonymous

    Fndr at Majestic12.co.uk

    02 December 2004 18:46pm

    Avatar-blank-50x50

    > If all this traffic had a (considerably) negative impact on your
    > server's performance then could it be classified as a denial
    > of service attack?

    Unlikely - DoS attack implies intent to take server down, and in this case this intent is not present. Does not excuse Microsoft from not having designed their crawler to avoid loops like that.

    regards

    Alex

  6. Deri Jones Gold

    Director at SciVisum.co.uk

    03 December 2004 18:21pm

    avatar

    Well, I hate to say 'told you so', but on Nov 11th I did...see the Times newspaper:
    http://business.timesonline.co.uk/article/0,,9075-1354322,00.html

    But hey, don't you just love Microsoft!

    Nagging thought though Dan - you're *sure* it's an MS spider....

    Deri
    SciVisum.co,uk
    Web application testing specialists

    On 12:05:01 30 November 2004 Dan Zambonini wrote:
    >Just a quick note of caution - if anyone’s noticing
    >a huge increase in their bandwidth over the last month,
    >it’s probably due to the new Microsoft search
    >spider.
    >
    >We’ve had a 10x increase in bandwidth since Oct 10th
    >(and the associated cost that entails).

  7. Dan Zambonini Bronze

    Technical Director at Box UK

    03 December 2004 18:32pm

    avatar No, that's a good point.  It's using the

        MS Search 4.0 Robot

    user agent, but I'll check the IP address range - could well be a spoofer.  I'll let you know - good thinking.

    Dan

  8. Anonymous

    Fndr at Majestic12.co.uk

    03 December 2004 18:43pm

    Avatar-blank-50x50

    The Times has a cheek to compare search engines while at the same time banning any robots from their own site apart from Google! Here is so-called robots.txt that defines which robots (that support this standard) can not access it - http://business.timesonline.co.uk/robots.txt

  9. Deri Jones Gold

    Director at SciVisum.co.uk

    06 December 2004 11:27am

    avatar

    The IP address is vital - can you let us know ASAP?

    See the discussion here from a year ago, of some dodgy guys using the MS search bot User Agent on their bot:
    http://www.webmasterworld.com/forum97/34.htm

    Deri

  10. Dan Zambonini Bronze

    Technical Director at Box UK

    06 December 2004 13:30pm

    avatar Doesn't look like MS:

    194.6.120.101

Reply to this thread

Log in to reply to this discussion or join Econsultancy for free so you can post to our forums along with other benefits.