Search Engine Spiders
10:40 on Mon, 16 Nov 2009 | SEO | 0 Comments
In previous articles we looked at ways to optimize your site to make it rank higher in the search engine results pages (SERPs). This month, we will look at the force behind what makes your site rank – the search engines themselves and how we can use this knowledge to make your sites rank better.
How do search engines work?
The first thing search engines have to do is to create a database of all the web pages on the web. In order to do that, they use programs called web spiders (or web robots) to follow links on web pages and create an index of them. Search engines will aim to serve up the most relevant pages to users and as we covered in previous articles, there are SEO techniques which will help your site rank better.
One of the first things you need to do is ensure that the search engines are able to crawl your site effectively. One way to do this is to ensure your site has good site architecture. That is, ensuring that users can access the most important pages through the links near the homepage of your site. For example, if your site is selling trainers it would be bad form to bury the trainers product pages five links away from the home page. If your site is user-friendly, it will naturally also be web crawler friendly as well, since crawlers, like users will follow links to get to pages.
Another good method is to use a sitemap which is basically a list of pages of a website which are accessible to search engines.
Sitemaps come in two forms: HTML and XML.
An HTML sitemap is the older version of sitemap which is a page on your website which is linked to from the homepage. Search engines can then access this sitemap from the homepage which then provides a gateway to all the other pages on the site.
The advantage of HTML sitemaps is that they can be easily read and used by humans to access certain sections of your site. The disadvantage however, is that on larger sites with an extremely large number of links, a web crawler might stop crawling links halfway down the page. Also, even if you create a sitemap with sub-pages, you are creating a further buffer to web crawlers as they are less likely to crawl pages, the deeper a page is from the homepage, especially on new sites.
The other type of sitemap comes in XML format. Google introduced this some time ago to help search engines crawl websites easier. XML sitemaps are meant only to be read by search engines and aren’t easily usable by humans but are the most efficient way to get a large site crawled.
All the major search engines support this protocol so having a sitemap will mean that the search engines will have updated information on your site. Even Yahoo uses this as well, their own version was urllist.txt however it now favors an xml sitemap. Bear in mind however, that just because you submit a sitemap, it doesn’t mean that all the links will be crawled and in turn, not all crawled links will be indexed – only the ones Google deems important your site. An XML sitemap can be easily generated by any of the free and paid programs on the Internet.
The disadvantage of XML sitemaps is that they are not as useful to smaller sites as the small number of internal links will mean that the site will be crawled pretty easily. Additionally, having an XML sitemap will remove the possibility of testing your site architecture since an XML sitemap will present the crawler with all the pages on your site without it having to crawl your site for links.
Another common thing to prevent crawler access to are dynamic search results in your site since these pages less relevant in the SERPs.
This is where robots.txt comes in. You can use this text file to specify how much access you want to give the search engine robots to your site.
For example, the following code allows search engine robots unrestricted access to all pages on your site:
The * in the code above means “all” so refers to all crawlers. There is no parameter specified after “Disallow:” so as it is, all crawlers are allowed access.
While the following tells ALL robots to keep out:
The forward slash here refers to the base directory which basically means the whole site.
The following code tells robots to block access to a specified file:
In the code above, file.html is located in a sub-folder called “directory” which can be accessed from the base directory.
Robots.txt isn’t the only way to block crawler access to a file though. In the previous article about meta tags I explained about how you can use meta tags to do the same thing. Robots.txt offers more flexibility, and allows you to have a central area from which to set crawler access to your site. The downside to this is that people can see exactly which pages you don’t want crawlers to see, and will have more information about your SEO strategy.
In general, it is good practice to have a robots.txt file since apart from regulating crawler access you can also do other things such as specify the location of your XML sitemap file. XML sitemaps are useful only if you have a large site, although they are still no substitute for good site architecture.