Wednesday, April 15, 2009

Invalid URL Requests From Legitimate Bots

In a former post I've mentioned that I have no idea how come invalid URLs for which no link on the site (nor sitemap) exists are being tried by legitimate bots such as GoogleBot.

Now I have a partial answer for the non existing URLs presented in the post. Some time ago, a twitter account for Colnect editors has been opened @ColnectEdits. It automatically twits about edits done on Colnect's catalogs so that other collectors may track it.



An interesting thing that you can see in the attached picture is the the links generated by the tweets are shown as http://colnect.com/en/phone... but actually do link to the correct full URLs, such as http://colnect.com/en/phonecards/item/id/9212. So it seems that the web crawlers read both as legitimate URLs and try to fetch them. Since it seems GoogleBot does not want to learn that /en/phone returns 404 from Colnect, I am now forced to add these as legitimate URLs to my site to avoid seeing more 404s in my logs. Oh well...

Phone cards catalog: biggest, most extensive, free

Happy to announce that Colnect's phone cards catalog, the world's most-extensive phone cards catalogs, has now over 150,000 phone cards listed in it.

Colnect's catalog is an endeavor of many collectors from around the world who constantly improve it.

Using Colnect's catalog, collectors from around the world can easily manage their personal collection on Colnect and find swap buddies from around the world.

Special thanks goes to all the contributors, editors and translators of Colnect.

Happy collecting :)

Monday, April 13, 2009

PayPal + Unicode ==> No Payment

So you got your PayPal merchant account for your awesome website and have created a nice button to allow members to receive the amazing premium paid services you've made for them. You create the button code using the wizard supplied on PayPal's site to ensure nothing goes wrong. Oh, your site is multilingual? Yes, so please create another button for every language. No, we cover only some of those on your site. PayPal hasn't enough resources to translate itself to all popular languages. It's probably not making as much money as Colnect that can afford to be translated to 35 languages.

So the button is on the site and you test it. It works. Hurray! That wasn't too hard. But hey, are you going to test each option on the button in each language? Yes, you should but it seems fine and PayPal is a serious website. Right? WRONG!

A member who tries to pay money is faced with this beautiful message: "PayPal cannot process this transaction because of a problem with the seller's website. Please contact the seller directly to resolve this problem."



Though you might expect PayPal to alert you when such an event happens that is obviously your fault, it never happens. You may keep wondering how much business you've lost due to this fuck up. Well, you made the mistake so you suffer the consequences. Right? WRONG!

The problem is that PayPal's server has some problem with unicode encoding. You have used the Euro sign and dared send it to their server. Your site has a problem. You have a problem. Don't you know that Euro signs are bad? The wizard that generated your code thought of letting you know it but than decided you should learn it the hard way. The hard way would be to go through technical support with a person who obviously doesn't know very much about all the relevant Internet technologies and tells you it's your fault again. It's your page header, it's your CSS (WTF?!?!), it's your bad browser cookies.

You finally create another button without the Euro sign and find out that it wasn't you after all. It was them. It is them. PayPal screwed it up. But it's your fault, you chose to use their services...



The author of this post is not affiliated with PayPal or any other similar service. The story is true. I keep being amazed at how unprofessional PayPal is. Your comments welcomed.

PayPal Opinion

The reason I'm not going to write "PayPal sucks" is probably because they seem to be somewhat better than the competition when it comes to receiving payments from around the world in a secure way. I do plan on trying MoneyBookers as well and it seems that other competitors either take hefty fees (WorldPay want 200GBP set-up fee...) and/or are limited in currencies and countries of availability.

So here's are some of the problems of PayPal for my website for collectors:

* Fees. Though almost anywhere on their site they publish the fees to be up to 3.4%, a closer examination reveals 3.9% for "cross-border" transactions (I'm sure the guy who made that bs up got a great bonus afterwards) plus a good 2.5% spread on currency conversion. So we're getting to 6.3% WITHOUT mentioning the fee per transaction and withdrawl fee.

* Support. My worst support experiences ever. Customer support first reply was always automated and faintly related to the question. Subsequent replies were never helpful. Technical support was lacking technical knowledge and misdirected me more than helping.

* Site Usability. They could have done a much better job at that. Navigation is horrible and sessions often expire. Many times I got sporadic server errors.

For the finishing paragraph I'll write the good things: setup was relatively painless and PayPal is popular and thus consumers feel secure using it.

Sunday, April 12, 2009

When Web Crawlers Attack

Web crawlers, or search bots, are very popular beasts of the Internet. They allow your site to be automatically scanned and indexed. The main advantage is that people may find your site through these indexes and visit your site. The main disadvantages is that your content is copied somewhere else (where you have no control over it) and that the bots take your server resources and bandwidth.

On my site for collectors, I have created a pretty extensive robots.txt file to prevent some nicer bots from scanning parts of the site they shouldn't and blocking semi-nice bots. In addition, server rules to block some less than nice bots out there were added.

The biggest problem left unanswered is what to do when the supposedly nice bots attack your site. The web's most-popular bots is probably GoogleBot, create and operated by Google. Obviously, it brings traffic and is a good bot that should be allowed to scan the site. However, more and more frequently I see that the bot is looking for more and more URLs that NEVER existed on the site. Atop of that, since the site supports 35 languages, the bot even made up language-specific URLs. For some reason, it decided I should have a /en/phone page and so it also tries to fetch /es/phone, de/phone and so on.

So why is that so annoying? Two main reasons:

1/ It appears in my logs. I check these for errors and end up spending time on it.
2/ The bot is not giving up on these URLs although a proper 404 code is returned. It tries them over and over and over and over again.

Any suggestions? Seems to me that modifying robots.txt with 35 new URLs each time GoogleBot makes up a URL isn't the easiest solution.

The problem is not unique to GoogleBot. I have completely blocked Alexa's ia_archiver which is making up URLs like crazy.

Are there any reasons for inventing NEVER-existing URLs? Probably broken HTML files or invalid links from somewhere. Sometimes, wrong interpretation of JavaScript code (do they really HAVE TO follow every nofollow link as well???) seems to be the reason.

2009/04/15 - Read the update

Tuesday, April 7, 2009

Colnect Rising on Compete


Though I update about trends in site metrics for Colnect, I'm not really sure what they mean as they don't always coincide with my Analytics results. You're welcomed to check Colnect's rankings on Compete. It has risen 34% in the last month. Pretty nice :)

Sunday, April 5, 2009

GMail turn 5 - still BETA??? Colnect will not follow.

Gmail's official blog announced that Gmail celebrates it's 5th birthday. 5 years is not a short amount of time. However, GMail is still in BETA. It seems that Google has changed the common meaning of "BETA" from "publicly available product about to go fully public when final fixes and additions are made" into "fully fledged public product that is expected to sometimes fail and we won't take responsibility for it when it does".
Google even created the 'beta' mark trend in logos of companies and services.

I personally find it rediculous and unfair to the customers. Of course products sometimes fail but we cannot abuse the term "BETA" for 5 (FIVE!!!) years.

Colnect has been marked as beta for less than 6 months since it went public before all key features were ready and prior to proper testing. Raising a site from grass-roots up is not a simple task. However, as of today, since Colnect is relatively stable and many of its key features (a lot more is to come but I'll elaborate on that another time) are ready and publicly available, the BETA mark will be removed.
Yes, my system may sometimes fail. Yes, it's not as perfect as I'd like it to be. However, it's public, it's working, it makes many people using it happy so it's not a beta anymore.

Link and Search

Did you like reading it? Stay in the loop via RSS. Thanks :)