When you first learn robots.txt you think it’s fairly simple, allow, disallow. Well while robots.txt sadly does not support regular expresions, which would be so awsome if it did, it does support the HIGHLY useful wildcard.
The problem is google and other search engines is they have no clue how to handle querystrings. It confuses them to death, and as a result we get poor results. What if they create a link on the search engine with user informaiton, such as UserID, or SessionID, or something else that just becomes annoying. We want to maximize our results!
The solution to this problem just involves a semi-complex solution with advance usage of wildcards. In this case, on my site I have a search page that uses querystrings to sort results by genre, content rating, appearance, and a lot of other options. Plus I have a querystring for page number, page size.. it’s just a long mess of querystring values.
Well I don’t want to block all of the querystrings, because some are useful for a search engine, such as page number and query value. An example would be:
This works great! Shows all the results under A that are in page number 2! Perfect! But what if it gets more complex…
That is an actual result from one of my pages! I dont want google to index that, because can you image how many combinations it would have!! It would be choas, google would spit it out and get PO’d and potentially think it is a long as mess of spam. And trust me I use querystrings a lot in the search, and there are over 500 links on certain results, so it can become a HUGE mess.
Well I love the idea of having the querystring values pg and q since they are useful. So to figure this out we use the following solution in robots.txt:
This may look a bit complex at first, but the solution is simple. First we go ahead and disallow all querystrings: /search/?*
Then we go ahead and ban a few others!
In this case I go and ban anything with a querystring value after pg, and also anything after a combination of pg and q and q alone, along with q followed by pg.
Generally I ban everything that isn’t pg or q in a row. This also includes anything that may appear between them with:
The &* added will block anything between the two.
Then we go ahead and allow the specific values that are ?pg=* or ?q=*
As well we allow anything of the two followed by each other.
As you can tell while robot.txt seems simple, it can get a bit complex in some cases. While this is nothing near as bad of something of the stuff I’ve been through since it was quick 15 min solution.
Here’s also something to remember when a search engine uses robots.txt it will go through all potential cases. In this case I disallowed /?* but also put in an allow, which allowed the engine to index ?pg=*, but not all ?pg=* is approved, since in some cases it is blocked. In the case of wildcards google will go ahead and check all potential allows and disallows. With wildcards, the last one through will win.
For example if you have /?q=1&g=9&q=1 the robots.txt will allow ?q=*&pg=*, but because within the second wildcard there can be another querystring value, google will test that section of the querystring in the disallow part of robots.txt, and in this case it will block it since disallow: ?q=*&*&pg=* because the part “&* falls within the second querystring of the allow, which because it is a more fulfilled value, google will go ahead and block it.
It sounds complex, but if you think of it, it’s actually very logical. Which is also why you don’t have to worry about how you group or order robots.txt the allow and disallows can be in any order at any place and not make a difference.
Hopefully that helps you understand robots.txt a bit more and if I remember I’ll be sure to tell you how it turned out on google. hopefully my ranking goes up once all of the older stuff is washed out.