A lot has been made of the recent post at Google’s Webmaster Central Blog regarding duplicate content. The post, entitled Demystifying the “Duplicate Content Penalty”, basically calls into question the idea that a site can be “penalized” for having significantly similar content to another site and places the blame on webmasters and SEOs for perpetuating “myth of duplicate content”. Whether you refer to it as a “penalty” or simply “filter” the results, the outcome is ultimately the same – one URL will be considered the “preferred” URL by Google and THAT will be the URL that is included within SERPs. The real question then becomes how to help Google identify your site as “preferred”.
Calling it a “penalty” vs. a “filter” is nothing more than semantics. Stating that there is no such thing as a “duplicate content penalty” is Google’s feeble attempt to make it seem as though they aren’t unfairly penalizing individual URLs in their self-appointed role policing the Internet. The fact of the matter is that it doesn’t matter whether a page suffers from an actual “penalty” or not. If it is “filtered out” of the results for having content that is too similar to another page already listed in the index, does it really matter if it’s technically referred to as a “penalty” or a “filter”? The ultimate result is the same – people will be able to more easily find one page than another. Period.
I completely understand Google’s goal in eliminating duplicate content from their search results. Imagine what SERPs would look like, taking into consideration the number of cookie-cutter, affiliate websites in existence today, if duplicate content weren’t factored into a site’s ranking? Part of any search engine’s goal, Google included, is to provide its users with relevant and unique information. If searching for the term “weight loss supplements” resulted 100 identical Herbalife sites, each with a different URL and the exact same information, how reliable would you believe those results to be? There is a reason duplicate content is frowned upon…and it should be.
It appears that Google, at least in this latest official post, is referring specifically to inadvertent and non-malicious duplicate content within a given site. I have suspected for some time that this sort of duplicate content has a low level of impact on a site’s ranking. Think, for example, about the number of WordPress sites that rank highly for competitive search terms. Many of those sites contain sidebars that are identical on every page. If that sort of duplicate content were penalized…or, uh…excuse me, if that sort of duplicate content were to trigger Google’s “filters”, that would likely have a negative impact on the ranking of each page. That hasn’t been my experience.
What about duplicate content from other sites (i.e. scraped content)? How does Google determine whether or not a site that scrapes content is malicious? In my opinion, stealing my intellectual property and regurgitating it on your own site as if you created it IS malicious under any circumstances and should be penalized accordingly. Google apparently doesn’t agree, as illustrated by a number of documented experiences with scraper sites outranking the original source due in large part to having a higher PR than the site where the information was originally published.
Google doesn’t do as good a job at this as they’d like to believe. Many a webmaster, SEO and search marketing guru has weighed in on the issue and the consensus, at least from what I’ve read, is pretty clear: duplicate content, regardless of its source, can have a negative impact on ranking and DON’T LET GOOGLE FIGURE STUFF OUT ON THEIR OWN! These posts seem to reiterate that it is important to put forth a modicum of effort to address what is within your control when it comes to the issue of duplicate content within your own site:
http://searchengineland.com/080915-121927.php
http://www.searchenginejournal.com/google-pagerank-play-doh/5504/
http://www.mattcutts.com/blog/seo-advice-url-canonicalization/
http://www.sugarrae.com/be-a-normalizer-a-c14n-exterminator/
http://janeandrobot.com/post/canonical-url-canonicalization-domain.aspx
http://www.mattcutts.com/blog/canonicalization-update/
I mean, seriously – are all of these experts full of crap? Matt Cutts included? I’d venture to say that they’re not…and preventing duplicate content issues from occuring in the first place is likely the best option for anyone concerned about driving traffic to a site via Google’s organic search results. As for duplicate content from another site…I think we all know there is little, if anything, that can truly be done about that and we have no choice but to leave it to Google to sort out those issues on their own.
Google likes to toot their own horn and proclaim that they’ve “got it covered”, but the fact of the matter is that the more you leave to Google’s bots to “figure out” the more time must be spent doing so…meaning greater resources and server load. Why? Why not just make things as clean and simple as possible? Does this recent clarification by Google mean that we should no longer concern ourselves with resolving a site’s canonical homepage issues? Does it mean that it’s no longer necessary to restrict bot access to printer friendly versions of pages? Does it mean that multiple URLs resolving with the same content will be ignored, but have no negative impact on the ranking of the preferred URL? No…I don’t think their claim to have a handle on duplicate content issues means any of that.
What it does mean is that Google is, as always, making every effort to improve the experience for its users – with or without consideration for the impact to webmasters. They will likely always error on the side of caution when it comes to preventing SPAM and other malicious activity within search results. To a certain extent, that is for the good of all…provided that Google recognizes the disconnect between what they claim to be true vs. the real-world experiences of many SEOs in dealing with duplicate content issues to this point.
If duplicate content wasn’t factored into ranking, the SERPs would be dominated by a large number of identical affiliate sites and little else in some queries. “Penalty” or not, there is a definitive reason that a search for “Herbalife” doesn’t result in finding Herbalife’s thousands of identical, cookie-cutter affiliate websites…and I believe at least part of that reason to be duplicate content. Don’t piss on my leg and then tell me it’s raining, Google! Give me a break!
See what others within the industry are saying on this issue at Search Engine Roundtable and Sphinn.
Fred says
Nice post, Aly… and interesting food for thought. I wouldn’t have thought as a “filter” being the same as a “penalty,” and I’m still not sure if I agree :)
The message I got from Google’s post is this — you can have a large, well-ranking, content-rich site and yes, if a few articles are duplicated elsewhere, who cares? There’s a possibility of a *filter*, which in my mind is still different from a *penalty* because I assume a penalty would afflict all pages of a site rather than just pages with substantial duplicate content.
Back when supplemental results existed I think this was pretty noticeable — with big CMS sites that had maybe one or two lines of unique content on dozens of sub-sub-pages, those pages would get dumped into the supplemental index (and I saw more than one site tank for really competitive keywords when a legitimate page got sucked into “supplemental hell”).
While I don’t agree with all of Google’s policies, this one seems pretty fair to me, since it essentially gives you a safety net if you’re not savvy enough to correct duplicate content problems with link soloing or robots.txt methods.
SEOAly says
I agree with the theory of Google’s practices providing a safety net for less-than-SEO-savvy webmasters…but I question whether or not they pull this off as effectively as they tend to believe. Based on posts and related comments I’ve read, Google isn’t as good at resolving duplicate content issues as they seem to believe.
The fact is that there are a lot of pages out there with duplicate content issues, whether internal or external…and some scraper sites outrank the original source, which is proof enough that Google has some unresolved issues in dealing with duplicate content. Whether or not the technical term is correct, it likely feels like a “penalty” to those whose pages are outranked by scraper sites.
I missed out on the days of the supplemental index, so I can’t speak to any personal experience with that. From what I understand at this point, Google claims there is no longer such a thing as the supplemental index, so…I guess that’s one thing I can leave off my “still need to learn about…” list. :)
I believe any attempt to provide legitimate, relevant search results is worthwhile. I just question Google’s attempt to play with the semantics of “penalty” vs. “filter” when ultimately the goal is the same – to avoid including pages with substantial amounts of duplicate content within SERPs.
Isn’t part of the goal also to prevent a single version of a page hosted at multiple URLs from monopolizing SERPs and gaining an unfair advantage over competitor’s pages by pushing them out of the top few pages of results? I can only assume that played into the initial conversations regarding duplicate content back in the early days…
Eric Werner says
From what I’ve observed there does seem to be a difference between the duplicate content filter and the duplicate content penalty. The penalized site that I’m thinking of all of the sudden has only 1 page indexed. It was also incidentally not a result of trying to spam the search engines. It was a legitimate company that failed to invest in SEO before they launched websites for several different countries with the same content.
SEOAly says
So there are instances in which duplicate content issues have resulted in pages being eliminated from the index…contradicting Google’s claim to be able to handle duplicate content issues without any collateral damage.
You’d think that country-specific URLs containing duplicate content would be among those whose rankings aren’t affected by duplicate content…in a perfect world. :)
Kyle Wegner says
Good post, but I think the “filter” versus “penalty” idea can be looked at another way too. We all know that the filter picks 1 page and ranks it while dumping another out of the SERPS. I’ve got that, and it seems fair enough even if Google can be wrong sometimes.
Where the “penalty” portion of the duplicate content issue might take effect is when there are duplicate internal pages on a site. Just yesterday I audited a site that was hosting their home page on 3 separate URLs, http://domain.com, http://www.domain.com, and http://www.domain.com/index/
Now, a filter would be a great thing in this case as no matter what the user will find the site they are looking for and the SERPs will not be overloaded with duplicate content. Fantastic. But what if there IS a Google penalty for duplicate content? Should this website, obviously built by SEO-less webbies, be penalized for trying to serve people content in every way they saw fit? Probably not.
I guess in the end the point here is that I’m glad Google clarified the terminology here. While it might seem like common knowledge or trivial to most people, now I have something trustworthy to quote to clients if all of a sudden they hear that they could be penalized for duplicate content.
Not that any of my clients would have forgotten to 301 their homepage directories anyway….
SEOAly says
Yeah…as if there aren’t tens of thousands of sites out there with canonical homepage issues, right? ;)
I guess the question that begs to be asked is does anyone know for sure how Google interprets canonical homepage issues? If having canonical homepage issues is deemed “malicious” (meaning doing so is an attempt to monopolize search results and prevent competitor pages from ranking as well) wouldn’t a penalty be warranted?
Or does Google simply attribute such issues to general SEO ignorance and ignore the “duplicate content” issues that result from hosting the same homepage content at multiple URLs? If anyone has documentation of canonical homepage issues having a measurable negative impact on ranking, I’d love to see it!
Eric Werner says
Imagine this scenario (which I’m currently in) – A client has the same site being served on 2 domains. Google has given Domain A a PR 4 and Domain B a PR 3.
So this doesn’t seem to be a full blow penalty – perhaps more similar to what might happen with canonicalization. (The other version isn’t knocked completely out of the SERPs, but it won’t rank high for anything either)
Then the problem is that the owner doesn’t like Domain A (The one with all of the PageRank and Aged inbound links, and high rankings) and wants Domain B to be promoted strongly from now on to eventually replace Domain A.
So some questions emerge – what happens if Domain A disappears – does Domain B move up in the SERPs to replace it? What happens if we just redirect everything from A to B? What happens if we don’t do anything – does Google eventually kick in the penalty even though they haven’t yet (it’s been this way for years before I got here).
SEOAly says
You’ve raised an important point, Eric…what should you do in that situation? I faced a similar circumstance earlier this year when I launched a new website for our pet sitting business here in Jacksonville.
The original site was hosted at http://www.welcomehomepetsitting.com and despite having some on site SEO issues, had achieved a PR2. I decided to launch a new site at http://www.welcomehomepetsitting.net as part of my SEO testing at the time. Once the .net site was complete and I no longer wanted to keep the .com site up, I redirected the .com to the .net (301, obviously…). It took a couple months for the Google toolbar to account for the redirect, but the .net now appears to have inherited the PR from the .com being redirected.
So, if you were to take down the site located at “Domain A” and 301 redirect that domain to “Domain B” the PR achieved by “Domain A” should flow to “Domain B” via the redirect. It’s my understanding that this is how it is supposed to work, and in my case it appears to have happened as expected…though it did take some time for the PR to be reflected properly (since Google doesn’t update their PR scores regularly…).
Good question! I’d love to hear from others’ experiences with 301 redirects and inheriting PR accordingly…since I can only speak to my one measly experience. ;)
Eric Werner says
That’s awesome that it worked in that case. That makes the situation seem a little less bleak.
I would love to get more info from those with experiences like that before making a move. One person I asked who had many many years of experience said that it would be bad to redirect from the older domain to the younger domain – partly because the older domain had link value from aged links that were ‘grandfathered’ and a lot of Google’s new rules on restricting PR don’t apply since they were there before the rule changes.
Let us know if you talk to someone who is a super expert on the effects of PR effects of redirects.