Thomas’ Developer Blog

April 29, 2008

Advanced Usage of Robots.txt w/ Querystrings

Filed under: robots.txt, search engines — Tags: , , , — sanzon @ 1:53 am

When you first learn robots.txt you think it’s fairly simple, allow, disallow.  Well while robots.txt sadly does not support regular expresions, which would be so awsome if it did, it does support the HIGHLY useful wildcard.

The problem is google and other search engines is they have no clue how to handle querystrings.  It confuses them to death, and as a result we get poor results.  What if they create a link on the search engine with user informaiton, such as UserID, or SessionID, or something else that just becomes annoying.  We want to maximize our results!

The solution to this problem just involves a semi-complex solution with advance usage of wildcards.  In this case, on my site I have a search page that uses querystrings to sort results by genre, content rating, appearance, and a lot of other options.  Plus I have a querystring for page number, page size.. it’s just a long mess of querystring values.

Well I don’t want to block all of the querystrings, because some are useful for a search engine, such as page number and query value.  An example would be:

/search/?q=a&pg=2

This works great!  Shows all the results under A that are in page number 2!  Perfect!  But what if it gets more complex…

/search/?t=1&g=29&go=1&cr=10&cro=1&s=6&ps=20&v=1&q=a

That is an actual result from one of my pages!  I dont want google to index that, because can you image how many combinations it would have!! It would be choas, google would spit it out and get PO’d and potentially think it is a long as mess of spam.  And trust me I use querystrings a lot in the search, and there are over 500 links on certain results, so it can become a HUGE mess.

Well I love the idea of having the querystring values pg and q since they are useful.  So to figure this out we use the following solution in robots.txt:

Disallow: /search/?*
Disallow: /search/?pg=*&*
Disallow: /search/?pg=*&q=*&*
Disallow: /search/?pg=*&*&q=*
Disallow: /search/?q=*&*
Disallow: /search/?q=*&pg=*&*
Disallow: /search/?q=*&*&pg=*

Allow: /search/?pg=*
Allow: /search/?pg=*&q=*
Allow: /search/?q=*
Allow: /search/?q=*&pg=*

 

This may look a bit complex at first, but the solution is simple.  First we go ahead and disallow all querystrings: /search/?*

Then we go ahead and ban a few others!

In this case I go and ban anything with a querystring value after pg, and also anything after a combination of pg and q and q alone, along with q followed by pg.

Generally I ban everything that isn’t pg or q in a row.  This also includes anything that may appear between them with:

Disallow: /search/?pg=*&*&q=*

The &* added will block anything between the two.

Then we go ahead and allow the specific values that are ?pg=* or ?q=*

As well we allow anything of the two followed by each other.

As you can tell while robot.txt seems simple, it can get a bit complex in some cases.  While this is nothing near as bad of something of the stuff I’ve been through since it was quick 15 min solution.

Here’s also something to remember when a search engine uses robots.txt it will go through all potential cases.  In this case I disallowed /?* but also put in an allow, which allowed the engine to index ?pg=*, but not all ?pg=* is approved, since in some cases it is blocked.  In the case of wildcards google will go ahead and check all potential allows and disallows.  With wildcards, the last one through will win.

For example if you have /?q=1&g=9&q=1 the robots.txt will allow ?q=*&pg=*, but because within the second wildcard there can be another querystring value, google will test that section of the querystring in the disallow part of robots.txt, and in this case it will block it since disallow: ?q=*&*&pg=* because the part “&* falls within the second querystring of the allow, which because it is a more fulfilled value, google will go ahead and block it.

It sounds complex, but if you think of it, it’s actually very logical.  Which is also why you don’t have to worry about how you group or order robots.txt the allow and disallows can be in any order at any place and not make a difference.

Hopefully that helps you understand robots.txt a bit more and if I remember I’ll be sure to tell you how it turned out on google.  hopefully my ranking goes up once all of the older stuff is washed out.

About these ads

8 Comments »

  1. Thanks for this great post with micro-detail.

    Comment by fkim @ FK Media — May 19, 2008 @ 6:03 pm

  2. Arhh I’m in trouble! I’ve managed to duplicate a whole directories worth of content by specifying a querystring version of each URL within a piece of javascript for click tale (so I’m not even linking to it, it’s through a javascript variable!)

    Is this what I need for my robots.txt?

    Disallow: /my-folder/?clicktale=*&*

    If for example the variable urls look like:

    /my-folder/page1.asp?clicktale=true
    /my-folder/page2.asp?clicktale=true
    /my-folder/page3.asp?clicktale=true

    Please let me know if my syntax is right!

    Comment by Jimmy — March 2, 2011 @ 11:45 am

  3. Hi

    I was looking round a few bits of info to clarify part of my intepretation of the spec for robots exclusion protocol, and according to robotstxt.org, wildcards are not supported in the Disallow syntax, you simply disallow a partial URL e.g. Disallow: http://www.domain.com/page.aspx?
    would block access to all query strings on page.aspx.

    Also, there is no “Allow:” syntax.

    Comment by John Hughes — June 1, 2011 @ 9:01 am

  4. Thanks, this helped me to refine the disallow at blogger. Just a matter of:

    Disallow: /search?q=*

    I still want to allow access to /search/labels/…, /search?updated…, and /search?archived…. But, of course I’d like to disallow actual queries entered by visitors to the blog. So I think this will work.

    Comment by Erin Thomas — June 15, 2013 @ 12:12 pm

  5. Hi there! This is kind of off topic but I need some advice from an established blog.
    Is it hard to set up your own blog? I’m not very techincal but I can figure things out pretty fast.
    I’m thinking about creating my own but I’m not sure where to begin. Do you have any tips or suggestions?

    Cheers

    Comment by pregnancy test false positive — September 4, 2014 @ 11:24 pm

  6. Definitely imagine that which you said. Your favourite reason appeared to be at the web the simplest factor to be aware of.
    I say to you, I definitely get annoyed while other
    folks think about issues that they plainly do not realize about.
    You controlled to hit the nail upon the highest and outlined out the whole thing with no need side-effects , folks can take a signal.
    Will probably be again to get more. Thanks

    Comment by how to configure best best best docsis 3 cable modem — September 28, 2014 @ 5:34 am

  7. cubanoamor.com

    Advanced Usage of Robots.txt w/ Querystrings | Thomas’ Developer Blog

    Trackback by cubanoamor.com — December 5, 2014 @ 8:48 pm


RSS feed for comments on this post. TrackBack URI

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

The Silver is the New Black Theme. Blog at WordPress.com.

Follow

Get every new post delivered to your Inbox.

%d bloggers like this: