Imagine you’ve just finished the best blog article you’ve ever written in your life, but no one will ever get to see it because it never appears in the search results.
Disaster!
But, how is this even possible?
Let’s take a look at website indexing and discover the lowdown on all things search.
Google and the other search engines love giving great info from sites they index and constantly add to their massive libraries of URLs by sending out scouts, called spiders, to crawl websites and find new content.
How the search indexing spiders work
Even for these amazing internet spider bots the web is a huge monstrosity to navigate, so they rely on links to point them from page to page.
They pay attention to new URLs, sites that have changed and dead links. When they come across new or recently changed pages, they map it out like a web browser.
The difference between us and them is whereas we might skim certain pages, images and content, they will thoroughly scale the page up and down, creating an index entry for every unique word. This means your single web page could potentially be referenced in hundreds of index entries.
The different indexing spiders
There are hundreds of different spiders crawling all the sites of the internet, some are good and some are bad. The good ones look to index your site for their visitors and customers, the bad ones are trying to get information for spamming purposes.
The main good ones are:
- Googlebot
- Bingbot
- Slurp
- Facebot
- Alexa crawler
Helping your search spiders
There might be some pages on your site that you don’t want indexed, things like a thank you page for a form submission or a promo code page, etc.
The other thing is that Googlebot and some of the others have crawl budgets built into their programming, so they’ll only crawl so many URLs on your site before leaving.
So, what can you do to make sure the right pages are indexed?
You need to set out some rules and priorities to make things easier for your spiders. There are two ways to do this – through .txt files and meta directives.
What are Robots.txt files
This is a file which tells web spiders where they should and shouldn’t go on your site, although not all of them will listen to your wishes.
You can check if you already have it set up by adding /robots.txt to the end of your site URL and see what pops up. If there’s nothing there, this means you’ve got nothing set up.
The instruction for this is pretty straight forward so you can add in the instructions you want.
Here’s a couple of common ones:
- User agent is where you insert the name of the spider you want to call or if you want to call them all you can leave an asterisk *
- Disallow is where you insert a URL you want to stop the spiders from visiting or if you don’t want your site indexed at all you can add a backslash to tell spiders not to visit.
Often the disallow command is the most common but you can do other things like ask for a ‘crawl delay’ or ask for an ‘allow’ on certain aspects of a disallowed URL, or you can submit an xml sitemap, which tells the spider which are your most important pages.
What are Metta directives?
Meta directives are more commonly known as meta tags and they tell spiders what they can and can’t do for indexing purposes.
Because these are written into the code of a website page, it’s classed as a demand rather than a suggestion. Using these tools you can tell the spiders whether they can index, whether links are allowed and whether a search engine can pull snippets.
Site indexing
Sometimes sites don’t get indexed and there’s a few reasons why this might be happening:
- your robots.txt file or meta tags are blocking the spiders
- it’s a new site and the spiders haven’t had chance to index yet – this can take months to do
- It’s not linked to from anywhere else on the internet
- the navigation is hard to follow
- your site has been flagged as doing black hat techniques
But what can you do to make the spiders happier to crawl and index?
Here’s a few ideas:
- Organisation – links are the primary mode of transit for the spiders so, make sure your navigation has clear pathways. This means linking to other pages on your site in all your page text, not just your navigation menu
- Openness – don’t hide your best content behind logins, forms or surveys as the spiders can’t see them. Plus, they can’t see text inside images, videos and gifs, so make sure you’re including alt text too
- Sitemap – If you link your sitemap in the robots.txt file and submit it through Google search console, you can control what specific things you want Googlebot to crawl. And, you can even create these through your CMS, manually or via software
- Page check – to see the pages Google has already indexed go to Google and type [site: ‘your domain name’]. This will give you a list in the search results of all your indexed pages. It’s a great way to see if anything’s missing or unnecessary.
So, now you know how search indexing works, can you find any ways your own site needs improving?
Why not read The Three Faces of SEO to see how SEO can help with your search engine indexing too?