The center for all Wikitravel images!

Tech:Fix Duplicate Content

From Wikitravel Shared
Jump to: navigation, search

Wikitravel has about 550 000 "real" pages, with each available as a normal HTML page (wikitravel.org) or as a mobile page (m.wikitravel.org):

https://www.google.com/search?num=100&q=site%3Awikitravel.org+-inurl%3Aindex.php&oq=site%3Awikitravel.org+-inurl%3Aindex.php&


There are also some 350 000 DIFF pages inadvertantly listed in Google:

https://www.google.com/search?num=100&q=site%3Awikitravel.org+inurl%3Aindex.php+inurl%3Adiff&oq=site%3Awikitravel.org+inurl%3Aindex.php+inurl%3Adiff&

These are no use to the visitor and waste Google's crawl budget.

They also flag as Duplicate Content as multiple revisions of each page are indexed with only minimal changes between each version. This may be stopping the "real" content page from ranking higher. The Duplicate Content page count is currently running at about 60% of the "real content" page count.


Additionally, any spam found on old revisions of pages is fully indexed by Google and outgoing links to spam and junk are also fully indexed and count towards/against wikitravel's link profile.

URLs with "/*/index.php" within should be excluded from spidering.


http://wikitravel.org/robots.txt

  User-agent: *
  Allow: /
  
  User-agent: Mediapartners-Google
  Allow: /

should probably be:

  User-agent: *
  Disallow: /*/index.php
  
  User-agent: Mediapartners-Google
  Disallow: /*/index.php


http://m.wikitravel.org/robots.txt

  Status: 404 Not Found

should probably be:

  User-agent: *
  Disallow: /*/index.php
  
  User-agent: Mediapartners-Google
  Disallow: /*/index.php


Response from IB[edit]

I completely understand your point, however, blocking duplicate or lower quality content via robots.txt is only recommended as a last resort. There is almost always a more appropriate mechanism, like a canonical tag, a meta robots "noindex,follow" tag, setting Google parameter handling in Google Webmaster Tools, etc. We have looked at the issue you're addressing multiple times and have decided that we don't want to take any action right now. One specific reason for this is that often when people re-publish WT content, they link to the specific version of the content they used, so, http://wikitravel.org/wiki/en/index.php?title=Africa&oldid=1909028 instead of http://wikitravel.org/en/Africa (the canonical version). If we blocked Google from crawling and indexing that page, then we would be forgoing the attribution link, which wouldn't be good.

  • note: I just noticed that the URL I posted above has a meta robots "noindex,nofollow" tag on it. That's incorrect and we'll get that fixed. I should also note that our long term plan is to put a canonical tag on these pages, but there have been some complications that have made that effort undesirable at this point.

Regarding Google's crawl budget... while there is an upper limit on the number of pages Google will crawl on a site per day, Wikitravel is nowhere near reaching that limit. We have found time and time again that stopping Google from crawling sections of a site (when we have not maxed out our crawl budget) doesn't cause an increase in the crawl rate in the remaining sections. Google is not determined to crawl x pages on a domain and when you close down a section, it makes up for it somewhere else.

IB-Dick 17:38, 14 January 2013 (EST)