Colnect, Connecting Collectors: googlebot

Showing posts with label googlebot. Show all posts

Wednesday, January 20, 2010

29 Million Colnect Pages Indexed on Google?

Colnect's rapid growth is being reflected in its presence on the world's most popular search engine, as we now have an astounding 29 million pages indexed on Google! This number is made even more impressive by the fact that only 2 months ago, Colnect's total page indicies were at only about 10% of its current tally.

However, these numbers are quite dubious, as performing a search with Google's Canadian version yields only 1.1 million search results, while others produce page indices in the range of 3-5 million. This begs the question of how exactly Google calculates the quantity of indexed pages for a particular site and if this number has any reliability or basis in reality. Although Colnect has added many new lanugages and categories over the past couple of months that would explain a certain increase in pages, a 10-fold increase over this timeframe seems entirely unrealistic. Only time will tell whether this enormous figure can sustain itself or fall back to a total more in line with past values.

Saturday, December 5, 2009

New SPAM technique? "warning_this_is_english_domain_to_solve_this_problem_submit_site_in_atoall.com.html"

It's not uncommon to see weird requests coming to my server at Colnect but I found this one interesting since it came from GoogleBot, the bot used by Google to index the web for its search engine.

The request made by the bot was for the URL:
/warning_this_is_english_domain_to_solve_this_problem_submit_site_in_atoall.com.html

Needless to say, this URL never existed on my domain. Seeing the actual page of atoall . com, having the title "Hot girls pictures free games boys images local news all", made me suspect spamming.

Searching for this URL on Google currently gets 106,000 results for warning_this_is_english_domain_to_solve_this_problem_submit_site_in_atoall.com.html.
which means that Google has indexed that many pages which don't really exist on the other domains. Some very well known domains have this page URL indexed on Google.

How does it happen?

Well, some sites are configured to never return a proper 404 code to let bots and people know the page is not found on their server. They prefer returning a 200 code that tells bots and browsers the page is found. The page's content, displayed to the user, indicates that what the user was looking for was never found. Most users would never know the difference between getting a 404 or 200 code.

So why do they generate a 200 code?

Well, it makes search bots, like Google, index a page that has content which was searched by a user. The next time a user would search for the same term on a search engine, there is a chance that he'll get to their page. Also, as some plug-ins to browsers can "steal" 404 pages by replacing them with their own custom results, returning a 200 code prevents it.

Why shouldn't they generate a 200 code?

The downside of returning such pages is the obvious spamming by sites such as atoall . com and others which seek illegitimate sources of traffic. According to Alexa, the site has been gaining traffic since August and it wouldn't come as a surprised if this unique form of spamming Google's search engine has a lot to do with it.

Another issue is that the search engine may choose to penalize sites which return the wrong results. The search engine can easily know if that is the case by requesting randomly generated page URLs.

So now my only question is: how come Google didn't already penalize atoall . com and removed it from their search results?

Wednesday, April 15, 2009

Invalid URL Requests From Legitimate Bots

In a former post I've mentioned that I have no idea how come invalid URLs for which no link on the site (nor sitemap) exists are being tried by legitimate bots such as GoogleBot.

Now I have a partial answer for the non existing URLs presented in the post. Some time ago, a twitter account for Colnect editors has been opened @ColnectEdits. It automatically twits about edits done on Colnect's catalogs so that other collectors may track it.

An interesting thing that you can see in the attached picture is the the links generated by the tweets are shown as http://colnect.com/en/phone... but actually do link to the correct full URLs, such as http://colnect.com/en/phonecards/item/id/9212. So it seems that the web crawlers read both as legitimate URLs and try to fetch them. Since it seems GoogleBot does not want to learn that /en/phone returns 404 from Colnect, I am now forced to add these as legitimate URLs to my site to avoid seeing more 404s in my logs. Oh well...

Sunday, April 12, 2009

When Web Crawlers Attack

Web crawlers, or search bots, are very popular beasts of the Internet. They allow your site to be automatically scanned and indexed. The main advantage is that people may find your site through these indexes and visit your site. The main disadvantages is that your content is copied somewhere else (where you have no control over it) and that the bots take your server resources and bandwidth.

On my site for collectors, I have created a pretty extensive robots.txt file to prevent some nicer bots from scanning parts of the site they shouldn't and blocking semi-nice bots. In addition, server rules to block some less than nice bots out there were added.

The biggest problem left unanswered is what to do when the supposedly nice bots attack your site. The web's most-popular bots is probably GoogleBot, create and operated by Google. Obviously, it brings traffic and is a good bot that should be allowed to scan the site. However, more and more frequently I see that the bot is looking for more and more URLs that NEVER existed on the site. Atop of that, since the site supports 35 languages, the bot even made up language-specific URLs. For some reason, it decided I should have a /en/phone page and so it also tries to fetch /es/phone, de/phone and so on.

So why is that so annoying? Two main reasons:

1/ It appears in my logs. I check these for errors and end up spending time on it.
2/ The bot is not giving up on these URLs although a proper 404 code is returned. It tries them over and over and over and over again.

Any suggestions? Seems to me that modifying robots.txt with 35 new URLs each time GoogleBot makes up a URL isn't the easiest solution.

The problem is not unique to GoogleBot. I have completely blocked Alexa's ia_archiver which is making up URLs like crazy.

Are there any reasons for inventing NEVER-existing URLs? Probably broken HTML files or invalid links from somewhere. Sometimes, wrong interpretation of JavaScript code (do they really HAVE TO follow every nofollow link as well???) seems to be the reason.

2009/04/15 - Read the update