Sunday, April 12, 2009

When Web Crawlers Attack

Web crawlers, or search bots, are very popular beasts of the Internet. They allow your site to be automatically scanned and indexed. The main advantage is that people may find your site through these indexes and visit your site. The main disadvantages is that your content is copied somewhere else (where you have no control over it) and that the bots take your server resources and bandwidth.

On my site for collectors, I have created a pretty extensive robots.txt file to prevent some nicer bots from scanning parts of the site they shouldn't and blocking semi-nice bots. In addition, server rules to block some less than nice bots out there were added.

The biggest problem left unanswered is what to do when the supposedly nice bots attack your site. The web's most-popular bots is probably GoogleBot, create and operated by Google. Obviously, it brings traffic and is a good bot that should be allowed to scan the site. However, more and more frequently I see that the bot is looking for more and more URLs that NEVER existed on the site. Atop of that, since the site supports 35 languages, the bot even made up language-specific URLs. For some reason, it decided I should have a /en/phone page and so it also tries to fetch /es/phone, de/phone and so on.

So why is that so annoying? Two main reasons:

1/ It appears in my logs. I check these for errors and end up spending time on it.
2/ The bot is not giving up on these URLs although a proper 404 code is returned. It tries them over and over and over and over again.

Any suggestions? Seems to me that modifying robots.txt with 35 new URLs each time GoogleBot makes up a URL isn't the easiest solution.

The problem is not unique to GoogleBot. I have completely blocked Alexa's ia_archiver which is making up URLs like crazy.

Are there any reasons for inventing NEVER-existing URLs? Probably broken HTML files or invalid links from somewhere. Sometimes, wrong interpretation of JavaScript code (do they really HAVE TO follow every nofollow link as well???) seems to be the reason.

2009/04/15 - Read the update

Tuesday, April 7, 2009

Colnect Rising on Compete

Though I update about trends in site metrics for Colnect, I'm not really sure what they mean as they don't always coincide with my Analytics results. You're welcomed to check Colnect's rankings on Compete. It has risen 34% in the last month. Pretty nice :)

Sunday, April 5, 2009

GMail turn 5 - still BETA??? Colnect will not follow.

Gmail's official blog announced that Gmail celebrates it's 5th birthday. 5 years is not a short amount of time. However, GMail is still in BETA. It seems that Google has changed the common meaning of "BETA" from "publicly available product about to go fully public when final fixes and additions are made" into "fully fledged public product that is expected to sometimes fail and we won't take responsibility for it when it does".
Google even created the 'beta' mark trend in logos of companies and services.

I personally find it rediculous and unfair to the customers. Of course products sometimes fail but we cannot abuse the term "BETA" for 5 (FIVE!!!) years.

Colnect has been marked as beta for less than 6 months since it went public before all key features were ready and prior to proper testing. Raising a site from grass-roots up is not a simple task. However, as of today, since Colnect is relatively stable and many of its key features (a lot more is to come but I'll elaborate on that another time) are ready and publicly available, the BETA mark will be removed.
Yes, my system may sometimes fail. Yes, it's not as perfect as I'd like it to be. However, it's public, it's working, it makes many people using it happy so it's not a beta anymore.

Link and Search

Did you like reading it? Stay in the loop via RSS. Thanks :)