Google crawling pages disallowed by robots.txt

posted by Brian Search No Comments »

A few weeks ago, we were completing the development of a tracking application for a website.  Basically, this tracking application exists on dynamically generated pages of the website that have the following structure:

www.mydomain.com/products/track/

It’s basically just a little tool that logs the user’s ip address, the item they clicked on, and then automatically redirects them to a vendor that sells that product.  The user never even knows they’ve visited the page.  To them, it’s a seamless transition from the item they clicked on to the vendor’s website.  It’s there to help us track user behavior, learn how to make the website better, and keep the vendors who are paying us for those referrals honest.

Now, obviosuly there’s no reason for a search engine to need to index these pages.  There’s no useful content there at all.  So, we made use of robots.txt to tell the search engines that there is no reason to look at those pages.

What is robots.txt?

Robots.txt is simply a file that can be placed on a website to notify automated “crawlers” that there are certain parts of the site that should not be visited.  It makes use of the robots exclusion protocol.  When an automated crawler (such as Googlebot) visits a website, it looks at the robots.txt file to see if there are any pages it should not vist.  Not all automated crawlers pay attention to robots.txt, but the major search engines claim that they do.

Using robots.txt, we told the crawlers not to visit any pages in the “track” folder.  That worked out really well.  A few weeks later, we decided that if a user rolled over a link and saw the word “track”, they might get spooked and wouldn’t want to click on that particular link.  People don’t really like the idea of being tracked.  So, we decided to change the structure of the tracking application to the following:

 www.mydomain.com/products/buy

This naming convention seemed much more inocuous and was in line with what the user was trying to do.  We updated the robots.txt file to reflect these changes and uploaded the changes.

Here comes Googlebot!

Much to our surprise, a few hours later, we started to see a lot of clicks coming from the same ip address.  Thinking I had a rogue Chinese robot on my hands (that sounds silly but it has happened before), I looked up the ip address.  Lo and behold, it belongs to Google!

Throughout the day, I watched as Googlebot clicked on item after item with a frequency of roughly every 2 minutes.  I rechecked my robots.txt file.  It should have been blocking this activity.  I logged into my Google webmaster tools account and found the problem:

Google downloads robots.txt about once every 24 hours.

This particular website’s robots.txt file had been downloaded earlier in the morning.  Even though these were new files, the protocol is an exclusion protocol.  Since these files were not listed in the file Google had cached, they were fair game.  A few hours later, Googlebot called in reinforcements.  The website was now getting hit by two different Google ip addresses with a frequency of roughly every hour.  Unfortunately, they didn’t bring their credit cards.  They kept going until about 1:00 am the next morning when the new robots.txt file was finally downloaded and cached.  In total Google crawled and indexed a little over 1000 pages of content that was blocked using robots.txt.

The funny thing is that these pages were actually indexed.  I searched and found them a week later.  They were all indexed with the content of the landing pages on the vendor’s sites.  So, we inadvertantly pulled off a decent sized cloaking operation - something that is expressly against Google’s quality guidelines.  I sweated it for a while, but there doesn’t seem to be any negative effects on the site’s rankings.

So, the lesson is that if you’re going to upload pages that you don’t want a search engine to crawl, you should disallow those pages in the robots.txt file and make that file available at least 24 hours before you upload the actual files to the website.  If you have a Google webmaster tools account, it’d be a good idea to log in and see which version of the robots.txt file is in Google’s cache.

I thought the saga was over, but a few days later a few of the pages were crawled by Googlebot again.  In this case, it was only about 5 pages, so it may have been a small bug in the system, or perhaps even a Google employee hand checking things.  In any case, the pages are still in the index.

What to do with extra domain names

posted by Brian Search No Comments »

If you’ve ever bought a domain name through Go-Daddy, you know how relentless they are at trying to upsell you on extra services and additional domain names.  In a lot of cases, it makes sense to purchase those extra domain names (the .org, .net, .info, etc) in order to protect your brand name.  After all, if you experience any kind of success, someone is likely going to snap them up later to leech off your success or to sell them to you at a marked up rate.

Once yParked Pageou’ve got them, what’s the best thing to do with them?  Most of the time, people leave them “parked”.  That basically means that they sit there with a bunch of ads.  If anyone happens to navigate to the parked domain name and click on an ad, the registrar generally gets commission on the click - sometimes they’ll share the revenue with the domain owner. 

Having a lot of pages parked is not really very good for you.  Pages like that constitute what might be considered “web spam”, and having a lot of them registered to your name might give someone the inclination that you’re a web spammer, which makes your legitimate domains suspect in that person’s eyes.  Why should you care?  Well, that “person” might be a Google spider, and when your sites are suspect, they generally don’t rank high.  Remember, Google is an official domain registrar, so they have access to information that is not available to the general public.  In fact, they can see through private registrations.

A better alternative is to redirect all the traffic from your additional domains to your main website.  You have to be careful how you do this, though.  There are two basic ways to forward a domain to another domain and there’s really no standard name for each method.  Many domain name registrars refer to them as “masked” forwarding and “unmasked” forwarding, so I’ll use those terms here.

What’s the difference? 

Let’s say I’ve got two domain names - funkychairs.com and funkychairs.org.  The website is built on funkychairs.com and I want to forward funkychairs.org to funkychairs.com.  If I used masked forwarding, then pointing my browser to funkychairs.org will show me the same site as funkychairs.com, but the location bar of my browser will still read “funkychairs.org”.  Got that?  If I’m on funkychairs.org/diningRoom, my browser’s location bar will say just that, but I’ll be seeing the same page as funkychairs.com/diningRoom.

 Okay?

Now, if I used unmasked forwarding, and I point my browser to funkychairs.org, it will automatically forward to funkychairs.com.  My browser’s location bar will always read funkychairs.com.  It’s as though funkychairs.org doesn’t exist.  This is definately the way to go. 

Why?

If you use masked forwarding, it looks like there are two or more different websites with the exact same content.  In other words, there’s a website called funkychairs.org and it’s exactly the same as another website called funkychairs.com.  A casual user or a search engine would have little idea they were actually the same website.  If you use unmasked forwarding, you’re telling users (including search engines) that all the content at funkychairs.org is located at funkychairs.com.  There is only one website containing the content.

Search engines don’t like duplicate content.  It provides a bad user experience.  Imagine if you did a search for something and the first 10 results were all the exact same thing on 10 different websites.  If you didn’t like the first result you got, you surely wouldn’t like the next nine.  So, the search engines tend to filter out duplicate content, such that only one of the duplicate sites will appear in the search engine results.

Still no problem, right?  As long as one of your duplicate sites is ranking well, you won’t be complaining.  That’s true, but another problem is link cannibalization.  My earlier post about keyword cannibalization discussed two or more pages of a given web site competing against each other for the same keywords.  The same theory applies here.  Let’s say 10 people visit funkychairs.com.  They love the content there, so they link to it.  10 more people go to funkychairs.org.  They love the content there, so they link to it. 

Sounds good so far, what’s the problem?

Well, search engines generally look at links from other website as “votes” for the website that they are linking to.  Now, you’ve got two sites with 10 votes each.  Since they have duplicate content, one will be filtered out.  So, now you’ve got one website with 10 votes competing against your competition, which may have 15 votes.  If you use unmasked forwarding, you’ll have one site with 20 votes instead.  All other things being equal, this would be the difference between your competition ouranking you and you outranking your competition.

So, next time you buy a lot of domains, make sure you do the right thing and use unmasked forwarding.  It will save you a whole lot of trouble in the future.