In a former post I've mentioned that I have no idea how come invalid URLs for which no link on the site (nor sitemap) exists are being tried by legitimate bots such as GoogleBot.
Now I have a partial answer for the non existing URLs presented in the post. Some time ago, a twitter account for Colnect editors has been opened @ColnectEdits. It automatically twits about edits done on Colnect's catalogs so that other collectors may track it.
An interesting thing that you can see in the attached picture is the the links generated by the tweets are shown as http://colnect.com/en/phone... but actually do link to the correct full URLs, such as http://colnect.com/en/phonecards/item/id/9212. So it seems that the web crawlers read both as legitimate URLs and try to fetch them. Since it seems GoogleBot does not want to learn that /en/phone returns 404 from Colnect, I am now forced to add these as legitimate URLs to my site to avoid seeing more 404s in my logs. Oh well...
Colnect, Connecting Collectors. Colnect offers revolutionizing services to Collectors the world over. Colnect is available in 63 languages and offers extensive collectible catalogs and the easiest personal collection management and Auto-Matching for deals. Join us today :)
Showing posts with label web crawlers. Show all posts
Showing posts with label web crawlers. Show all posts
Wednesday, April 15, 2009
Sunday, April 12, 2009
When Web Crawlers Attack
Web crawlers, or search bots, are very popular beasts of the Internet. They allow your site to be automatically scanned and indexed. The main advantage is that people may find your site through these indexes and visit your site. The main disadvantages is that your content is copied somewhere else (where you have no control over it) and that the bots take your server resources and bandwidth.
On my site for collectors, I have created a pretty extensive robots.txt file to prevent some nicer bots from scanning parts of the site they shouldn't and blocking semi-nice bots. In addition, server rules to block some less than nice bots out there were added.
The biggest problem left unanswered is what to do when the supposedly nice bots attack your site. The web's most-popular bots is probably GoogleBot, create and operated by Google. Obviously, it brings traffic and is a good bot that should be allowed to scan the site. However, more and more frequently I see that the bot is looking for more and more URLs that NEVER existed on the site. Atop of that, since the site supports 35 languages, the bot even made up language-specific URLs. For some reason, it decided I should have a /en/phone page and so it also tries to fetch /es/phone, de/phone and so on.
So why is that so annoying? Two main reasons:
1/ It appears in my logs. I check these for errors and end up spending time on it.
2/ The bot is not giving up on these URLs although a proper 404 code is returned. It tries them over and over and over and over again.
Any suggestions? Seems to me that modifying robots.txt with 35 new URLs each time GoogleBot makes up a URL isn't the easiest solution.
The problem is not unique to GoogleBot. I have completely blocked Alexa's ia_archiver which is making up URLs like crazy.
Are there any reasons for inventing NEVER-existing URLs? Probably broken HTML files or invalid links from somewhere. Sometimes, wrong interpretation of JavaScript code (do they really HAVE TO follow every nofollow link as well???) seems to be the reason.
2009/04/15 - Read the update
On my site for collectors, I have created a pretty extensive robots.txt file to prevent some nicer bots from scanning parts of the site they shouldn't and blocking semi-nice bots. In addition, server rules to block some less than nice bots out there were added.
The biggest problem left unanswered is what to do when the supposedly nice bots attack your site. The web's most-popular bots is probably GoogleBot, create and operated by Google. Obviously, it brings traffic and is a good bot that should be allowed to scan the site. However, more and more frequently I see that the bot is looking for more and more URLs that NEVER existed on the site. Atop of that, since the site supports 35 languages, the bot even made up language-specific URLs. For some reason, it decided I should have a /en/phone page and so it also tries to fetch /es/phone, de/phone and so on.
So why is that so annoying? Two main reasons:
1/ It appears in my logs. I check these for errors and end up spending time on it.
2/ The bot is not giving up on these URLs although a proper 404 code is returned. It tries them over and over and over and over again.
Any suggestions? Seems to me that modifying robots.txt with 35 new URLs each time GoogleBot makes up a URL isn't the easiest solution.
The problem is not unique to GoogleBot. I have completely blocked Alexa's ia_archiver which is making up URLs like crazy.
Are there any reasons for inventing NEVER-existing URLs? Probably broken HTML files or invalid links from somewhere. Sometimes, wrong interpretation of JavaScript code (do they really HAVE TO follow every nofollow link as well???) seems to be the reason.
2009/04/15 - Read the update
Labels:
googlebot,
ia_archiver,
robots.txt,
web crawlers
Subscribe to:
Posts (Atom)
Link and Search
Did you like reading it? Stay in the loop via RSS. Thanks :)