My Geeky Blog: Site Crawling for SEO

Spent some time today working on Spiralytics - our web crawling software. It crawls web sites for SEO purposes and builds up a report of all the pages it finds.

We had a problem a few weeks ago with a crawl on one of our sites - it could only find a small percentage of the web site. Normally this is caused by Javascript or Flash embedded links, but this time the pages were linked with normal anchor links. After some investigation I found the issue was caused by the web site returning HTTP error code 403.

HTTP Error 403

The 403 Forbidden error code is returned normally by the server when clients are not allowed to view a page. For example, if you attempt to view a directory like /pages/ but there is no index page. Their are other reasons, including the server incorrectly returning 403 instead of 401 Unauthorised. It was none of these reasons because the page is visible to web browsers. This only leaves something to do with the server not liking our crawler!

I first tried changing the user-agent to various Mozilla and googlebot, but still got the same response. I then tried to slow down the crawler in case the server restricted many requests from the same IP within a short time period. But again no luck?

So after all that - no luck. It'll have to wait for another day.

iPhone

Started working on a new iPhone app for a client. We've already decided on the basic functionality, so I was just finalising some of the details and producing mockups based on the initial designs from our designer.

My Geeky Blog

Thursday, 26 November 2009

Site Crawling for SEO

1 Comments:

Post a Comment

About Me

Previous Posts