Google crawling pages disallowed by robots.txt
posted by Brian Search No Comments »A few weeks ago, we were completing the development of a tracking application for a website. Basically, this tracking application exists on dynamically generated pages of the website that have the following structure:
www.mydomain.com/products/track/
It’s basically just a little tool that logs the user’s ip address, the item they clicked on, and then automatically redirects them to a vendor that sells that product. The user never even knows they’ve visited the page. To them, it’s a seamless transition from the item they clicked on to the vendor’s website. It’s there to help us track user behavior, learn how to make the website better, and keep the vendors who are paying us for those referrals honest.
Now, obviosuly there’s no reason for a search engine to need to index these pages. There’s no useful content there at all. So, we made use of robots.txt to tell the search engines that there is no reason to look at those pages.
What is robots.txt?
Robots.txt is simply a file that can be placed on a website to notify automated “crawlers” that there are certain parts of the site that should not be visited. It makes use of the robots exclusion protocol. When an automated crawler (such as Googlebot) visits a website, it looks at the robots.txt file to see if there are any pages it should not vist. Not all automated crawlers pay attention to robots.txt, but the major search engines claim that they do.
Using robots.txt, we told the crawlers not to visit any pages in the “track” folder. That worked out really well. A few weeks later, we decided that if a user rolled over a link and saw the word “track”, they might get spooked and wouldn’t want to click on that particular link. People don’t really like the idea of being tracked. So, we decided to change the structure of the tracking application to the following:
www.mydomain.com/products/buy
This naming convention seemed much more inocuous and was in line with what the user was trying to do. We updated the robots.txt file to reflect these changes and uploaded the changes.
Here comes Googlebot!
Much to our surprise, a few hours later, we started to see a lot of clicks coming from the same ip address. Thinking I had a rogue Chinese robot on my hands (that sounds silly but it has happened before), I looked up the ip address. Lo and behold, it belongs to Google!
Throughout the day, I watched as Googlebot clicked on item after item with a frequency of roughly every 2 minutes. I rechecked my robots.txt file. It should have been blocking this activity. I logged into my Google webmaster tools account and found the problem:
Google downloads robots.txt about once every 24 hours.
This particular website’s robots.txt file had been downloaded earlier in the morning. Even though these were new files, the protocol is an exclusion protocol. Since these files were not listed in the file Google had cached, they were fair game. A few hours later, Googlebot called in reinforcements. The website was now getting hit by two different Google ip addresses with a frequency of roughly every hour. Unfortunately, they didn’t bring their credit cards. They kept going until about 1:00 am the next morning when the new robots.txt file was finally downloaded and cached. In total Google crawled and indexed a little over 1000 pages of content that was blocked using robots.txt.
The funny thing is that these pages were actually indexed. I searched and found them a week later. They were all indexed with the content of the landing pages on the vendor’s sites. So, we inadvertantly pulled off a decent sized cloaking operation - something that is expressly against Google’s quality guidelines. I sweated it for a while, but there doesn’t seem to be any negative effects on the site’s rankings.
So, the lesson is that if you’re going to upload pages that you don’t want a search engine to crawl, you should disallow those pages in the robots.txt file and make that file available at least 24 hours before you upload the actual files to the website. If you have a Google webmaster tools account, it’d be a good idea to log in and see which version of the robots.txt file is in Google’s cache.
I thought the saga was over, but a few days later a few of the pages were crawled by Googlebot again. In this case, it was only about 5 pages, so it may have been a small bug in the system, or perhaps even a Google employee hand checking things. In any case, the pages are still in the index.

Recent Comments