Saturday, December 5, 2009

New SPAM technique? "warning_this_is_english_domain_to_solve_this_problem_submit_site_in_atoall.com.html"

It's not uncommon to see weird requests coming to my server at Colnect but I found this one interesting since it came from GoogleBot, the bot used by Google to index the web for its search engine.

The request made by the bot was for the URL:
/warning_this_is_english_domain_to_solve_this_problem_submit_site_in_atoall.com.html

Needless to say, this URL never existed on my domain. Seeing the actual page of atoall . com, having the title "Hot girls pictures free games boys images local news all", made me suspect spamming.

Searching for this URL on Google currently gets 106,000 results for warning_this_is_english_domain_to_solve_this_problem_submit_site_in_atoall.com.html.
which means that Google has indexed that many pages which don't really exist on the other domains. Some very well known domains have this page URL indexed on Google.


How does it happen?



Well, some sites are configured to never return a proper 404 code to let bots and people know the page is not found on their server. They prefer returning a 200 code that tells bots and browsers the page is found. The page's content, displayed to the user, indicates that what the user was looking for was never found. Most users would never know the difference between getting a 404 or 200 code.

So why do they generate a 200 code?



Well, it makes search bots, like Google, index a page that has content which was searched by a user. The next time a user would search for the same term on a search engine, there is a chance that he'll get to their page. Also, as some plug-ins to browsers can "steal" 404 pages by replacing them with their own custom results, returning a 200 code prevents it.

Why shouldn't they generate a 200 code?



The downside of returning such pages is the obvious spamming by sites such as atoall . com and others which seek illegitimate sources of traffic. According to Alexa, the site has been gaining traffic since August and it wouldn't come as a surprised if this unique form of spamming Google's search engine has a lot to do with it.

Another issue is that the search engine may choose to penalize sites which return the wrong results. The search engine can easily know if that is the case by requesting randomly generated page URLs.

So now my only question is: how come Google didn't already penalize atoall . com and removed it from their search results?

5 comments:

  1. Did you fill out a spam report and tell Google about it?

    ReplyDelete
  2. Funny, I saw this in Webmaster Tools today and thought...hey, investigate a little. I ended up here. It turned out I did exactly what you did to write this post. Nice read, as it it interesting to see, what happened in the last 7 days after you've found this spammy piece of request in your logs. :)

    Redo your investigation. It's funny. This stuff seems to work. :)

    ReplyDelete
  3. Anonymous - this post is my report.

    Jonas - Yes, it works. Now there are 124,000 indexed pages, meaning up 18,000 in a week.

    ReplyDelete
  4. Hi, a few questions which will help your readers (if answered!)

    What did you do to stop the problem?

    Do you think by returning a 404 error we are safe from the bots tactic?

    Do you know the bot name so we can block it?

    ReplyDelete
  5. Hi Barrie,

    I have notified Google's own Matt Cutts about it but he thought it's not a serious issue to relate to.

    Pages returning 404s are not indexed by bots.

    There's no bot to block here. They're simply putting a page somewhere with these links to the sites. Google's bot, as well as others, will go visit all these links and if they don't return 404, they'll be indexed.

    ReplyDelete

We welcome comments to our blog post but MANUALLY verify each comment. Spam comments will be reported. When asking for an answer on anything Colnect related, please use Colnect's forums. Thanks and happy Colnecting :)

Link and Search

Did you like reading it? Stay in the loop via RSS. Thanks :)