Posts tagged Duplicate
You Think YOU Have a Duplicate Content Problem?
Apr 26th
Duplicate content. We all know about it. Countless posts have been written on why it’s bad, how to avoid it. But maybe you’ve got a duplicate content problem and don’t even know it’s there. Or your duplicate content problem is bigger than you realize. So big, it’s epic.
That’s what I discovered recently when auditing a client site. We’re not talking about content replicated across multiple sites. Not scraper sites, or ripoff sites. One site. The original and only source. And it was by forensic tactics that I uncovered exactly how big the problem was. How epic. Orders of magnitude epic.
In this situation, we’re talking about a real estate site. Covering a wide swath of California – offices spread throughout northern and southern California. Billions of dollars in home sales in 2010.
Site: – A Key Metric
Whenever I perform an SEO audit, I run a site: check on Google as one of my first tasks, and ask the client how many pages they really have. This is just to get a feel for how well the site’s currently indexed. This site showed 86,000 pages indexed on my initial check. Except there’s really only about 15,000 pages. Wow. Really? Oh boy…
Now, it’s not uncommon to run a site: check and get less pages showing than actually exist. The public display of pages found is only an approximation, and subject to how well a site is indexed, Google’s algorithm at any given moment as well as fluctuations in the results due to competitive factors.
But this an opposite indexing problem. More than five times as many pages showing as actually exist. So I went back and began to examine the site, my senses on full alert.
1999 Called & Wants It’s Programming Methods Back
What I found that set off the next bell in my “that’s not right” process was finding that they’ve got over 400 agent pages – no – it’s not odd that a large real estate site has hundreds of agent pages. It’s that when you get to any of those pages, the next time you click on any page in the main navigation, the agent’s ID is stuck on the URL. And the home page link no longer goes to the main site home page, but instead goes back to that agent’s home page.
It’s a common programming method – passing identifiers along in the URL string. Except I know right away to then check for canonical URL tags – to see if those are being picked up by Google as authentic “unique” pages, or if the site’s coded to say “don’t index this version”.
No Canonical Tags. Anywhere.
Okay quick math time – 15,000 pages – 400 agents. That’s six million pages that could potentially be indexed. Except I was only seeing just over one percent of that. Still way too many for reality. Yet not the “OMG” disaster it could have been. Or was it?
Forensic SEO Tactics
Here’s where I really got curious – do I really need to go through all of those results to try and figure out what the heck is happening? Nope – not me. No way. No how. Instead, I let my brain chew on the problem.
And thought – let’s search Google first, just to see if any of these agent appended URLs are actually showing up. Sure enough, every one I manually tried was there.
From there, I performed an advanced site: check. In these particular URLs there’s a series of letters used as the variable identifier – so everything after XYZ in the URL string is the agent’s unique ID. So my search then looked like this: Site:www.Domain.com +XYZ
And guess what I found? Not 60,000 pages (the “overage” from the real count to the “pages found” count). What I found was
509,000 pages found
Great. Just great.
So what the heck is going on?
More tests. This time, I ran it with a different chunk of code in those agent URLs. And what did I get?
1.2 million pages found
Wow. This was a complete mess. And my first thought was – how could such completely insane variations exist?
Google – “We Do The Best We Can”
What turned out to be the problem was multi-layered. At any given time, the GoogleBot attempts to crawl the site. At a certain point, it’s just going to get tired of exploring a site, and run away, on to the next shiny object out there. Especially when those agent pages are several layers down in the link chain. Which means all the pages linked from there are also “technically” (but not really) even further down in the link chain.
And then even if some of those pages end up in the index, at some point, Google’s going to see “Hey this content is exactly the same as all this other content.”
And even though claims have been made (Thanks Matt!) that “Google does a pretty good job of figuring things out”, this is a great example of why that’s an imperfect system. Essentially, along the way of processing all this data, the system’s going to choke. And in this particular case, may even barf a little.
But overall, considering the fact that over a million “pages” are actually in their index, they’re able to pare it down by orders of magnitude, down to that 86,000 (still ridiculously over-counted) page range.
Good Enough Isn’t Good Enough
So Google’s system, without further guidance, is only able to pare it down to 86,000 pagers. That still leaves 70,000 of those pages being duplicate. Which means there’s a BIG problem still.
How does Google know which version is the most important? Most of the results in the first dozen pages of results for various searches ARE the primary site version, without the agent appendage. But not all. And for some phrases, it’s all agent pages that show up first.
Which in turn means that the pages that matter the most are NOT being given their full value. On a massive scale.
The Fix Ain’t So Easy
So, you’re saying to yourself – just slap that canonical tag in there. Problem solved.
Well sure, that’s important. Except that’s only good for the future experience. The site’s been like this forever. Would YOU want to be the one who ensures the 301 Redirects are implemented properly for that mess? Well, if you’re a REGEX genius, maybe you would. Me, not so much.
Then there’s the need (yes, it’s a NEED) to get the entire site recoded to STOP USING URL strings. Because I don’t care how much Google says all you need is canonical tags. Because not every search engine or link provider (intentionally or otherwise) is on board with that.
And even to Google, it’s only “an indicator”. It’s not a guarantee.
No, the ONLY proper, BEST PRACTICES tasking here is to strip out all those URL parameters. Just use cookies instead, for cryin out loud.
Which means a coding nightmare for some poor code monkey.
And more QA to ensure it’s all really done properly. Across the ENTIRE site.
Fortunately I’m not the one who has to code it. But I’m the one who’s got to do the QA on it. Yeah. Thanks. I’ll be over here curled up in a fetal ball. Crying. Uncontrollably. At least until I can rant about the process on Twitter.
Check out the SEO Tools guide at Search Engine Journal.
You Think YOU Have a Duplicate Content Problem?
View full post on Search Engine Journal
Duplicate Content Phantom: Don’t Be Duped, Be Informed
Mar 9th
Duplicate content has always been a hot topic among webmasters; mostly because no one really knows what it is and the rumors persist.
And Google doesn’t help much either. Sometimes, I think of it as a hyperactive 3 year-old, who is incredibly sharp in some areas, but not so much in others.
So the best way to go is to keep it simple, stay under the radar, and shoot for the middle of the road.
With that said, let’s figure out what duplicate content is, what it isn’t, and what you should do to stay on top of it.
What is duplicate content?
“Duplicate content generally refers to substantive blocks of content within or across domains that either completely match other content or are appreciably similar. Mostly, this is not deceptive in origin. Examples of non-malicious duplicate content could include:
- Discussion forums that can generate both regular and stripped-down pages targeted at mobile devices
- Store items shown or linked via multiple distinct URLs
- Printer-only versions of web pages”
Identical or substantially similar content.
Within your own domain or across others.
Most of it is normal and acceptable.
So far, so good.
Why Is Duplicate Content a Problem?
How would you like to search for the best pecan pie recipe just to find that every single result on the first page turned out the exact same recipe?
Users don’t like the same result and Google doesn’t like crawling the same results.
For a search engine, it’s also a processing consideration. If there is substantial duplication, the crawl/indexation rates might be dampened. In short, the site can lose some ‘trust’.
Two Types of Duplicate Content
We all have our own ideas of duplicate content and most of the times they boil down to “Don’t republish the same article to multiple directories. Instead, spend countless hours spinning that same article to the point where it doesn’t make sense any longer and THEN publish it to a zillion and one directories. That will surely trick all the PhDs working for Google into ranking my site pretty highly.”
Now in the spirit of “being informed”, let’s take a look at the 2 types of duplicate content you see around, shall we?
- Cross-domain type: this one is the most commonly thought of and includes the same content, which (often unintentionally) appears on several external sites.
- Within-your-domain type: the one that Google is actually mostly concerned about, i.e. that appears (often unintentionally) in several different places within your site.
Let’s now do a little more exploring into each type and see what Google really thinks about it.
Off-Site Content Syndication
There is absolutely nothing wrong with syndicating your content to different sites per se.
NOTHING WRONG WITH IT!
Here’s what happens when your content gets syndicated: Google will simply go through all the available versions and show the one that they find the most appropriate for a specific search.
Mind you the most appropriate version might not be the one you’d prefer to have ranked. That’s why it’s very important that each piece of syndicated content includes a link back to your original post – I assume it would be on your site. That way Google will trace the original version and will most likely (but not always) display it in its search results.
Per Matt Cutts:
I would be mindful that taking all your articles and submitting them for syndication all over the place can make it more difficult to determine how much the site wrote its own content vs. just used syndicated content. My advice would be 1) to avoid over-syndicating the articles that you write, and 2) if you do syndicate content, make sure that you include a link to the original content. That will help ensure that the original content has more PageRank, which will aid in picking the best documents in our index. (Source)
Black Hat Syndication
However, here’s the other side of content syndication coin: the content is deliberately duplicated across the web in an attempt to manipulate search engine rankings or to generate more traffic.
This results in repeated content showing up in SERPs, upsets the searchers, and forces Google to clean out the house.
“In the rare cases in which Google perceives that duplicate content may be shown with intent to manipulate our rankings and deceive our users, we’ll also make appropriate adjustments in the indexing and ranking of the sites involved. As a result, the ranking of the site may suffer, or the site might be removed entirely from the Google index, in which case it will no longer appear in search results.” (Resource – Google Webmaster Tools Help)
On-Site Content Syndication
On-site duplicate content problems are much more common and guess what: they are entirely UNDER YOUR CONTROL, which makes it very easy to fix them.
The first step to identifying the potential weak spots on your blog is learning more about your content management system.
For example, a blog post can show up on the home page of your blog, as well as category page, tag page, archives, etc. – THAT’S the true definition of duplicate content.
We, the users, have the common sense to understand that it’s still the same post; we just get to it via different URLs. However, search engines as unique pages with exactly same content = duplicate content.
How to Take Matters into Your Own Hands
Here are some practical “non-techie” steps you can take to minimize the presence to dupe content on your site:
- Take care of your canonicalization issues. In other words, www.trafficgenerationcafe.com, trafficgenerationcafe.com, trafficgenerationcafe.com/index.html are one and the same site as far as we are concerned, but 3 different sites as far as search engines are concerned. You need to pick your fave and stick with it. If you don’t know how, here’s are the instructions: WWW vs non-WWW: Why You Should Put All Your Links in One Basket
- Be consistent in your internal link building: don’t link to /page/ and /page and /page/index.htm – if links to your pages are split among the various versions, it can cause lower per-page PageRank.
- Include your preferred URLs in your sitemap
- Use 301 redirects: If you have restructured your site (for instance, changed your permalink structure to a more SEO-friendly one), use 301 redirects (“RedirectPermanent”) in your .htaccess file or, even simpler, use one of the many Redirection plugins available in your WordPress plugin directory.
- Use rel=”canonical”
- Use Google Webmaster Parameter Handling Tool
- Minimize repetition: i.e. don’t post your affiliate disclaimer on every single page; rather create a separate page for it and linked to it wheb needed.
- Managing your archive pages: Avoid duplicate content issues by displaying excerpts on your archive pages instead of full posts. You really want to give your readers just a hint of the content and direct them back to the original posts. To accomplish that, open your archive.php of your theme and replace the_content with the_excerpt. Hint: make sure your category and tag pages also display excerpts only.
- Country-specific content: Google is more likely to know that .de indicates Germany-focused content, for instance, than /de or de.example.com.
This is a good time to remind myself that I am not writing a novel…
Oh, wait a minute: one more important issue: robot.txt file.
According to TopRankBlog.com:
Google doesn’t recommend blocking duplicate URLs with robots.txt, because if they can’t crawl a URL they have to assume it’s unique. It’s better to let everything get crawled and to clearly indicate which URLs are duplicates…. Robots.txt controls crawling, not indexing. Google may index something (because of a link to it from an external site) but not crawl it. That can create a duplicate content issue.
Let’s move on, shall we?
Is There a Duplicate Content Penalty?
I’ll have Google answer this daunting question.
Here’s a quote by Susan Moskwa, Webmaster Trends Analyst from Google:
A lot of people think that if they have duplicate content that they’ll be penalized. In most cases, Google does not penalize sites for accidental duplication. Many, many, many sites have duplicate content.
Google may penalize sites for deliberate or manipulative duplication. For example: auto generated content, link networks or similar tactics designed to be manipulative.”
Susan further explained when webmasters should not worry about duplicate content:
- Common, minimal duplication.
- When you think the benefit outweighs potential ranking concerns. Consider your cost of fixing the duplicate content situation vs. the benefit you would receive.
- Remember: duplication is common and search engines can handle it.
How exactly does Google handle it?
While pulling up the search results, Google will basically collapse the duplicates leaving only the most relevant, in their opinion of course, page in the SERPs for that specific query. As I explained before, the way Google determines the most relevant result is based upon a myriad of factors and the only thing you can do for your part is to always link back to your original post.
Scraping Be Gone!
A word on the recent Google algorithm change:
My post mentioned that “we’re evaluating multiple changes that should help drive spam levels even lower, including one change that primarily affects sites that copy others’ content and sites with low levels of original content.” That change was approved at our weekly quality launch meeting last Thursday and launched earlier this week. (Matt Cutts – source)
What does it mean to the average webmaster?
We can all do a little chicken dance, since the probability of scraped (stolen, in other words) content ranking above the original articles that we put blood, sweat, and tears into, is minimal.
Google is rightfully going to war against all the autoblogs that don’t have what it takes to produce content of their own and all they do is republished other people’s work in hopes to rank highly in search engines, bring traffic to their crappy websites and make some money off AdSense, paid advertisement, and such.
Good riddance!
If you find that another site is duplicating your content by scraping (misappropriating and republishing) it, it’s unlikely that this will negatively impact your site’s ranking in Google search results pages. If you do spot a case that’s particularly frustrating, you are welcome to file a DMCA request to claim ownership of the content and request removal of the other site from Google’s index.
Duplicate Content Marketing Takeaway:
1. Dupe content doesn’t cause your site to be penalized.
2. Google is getting better at picking the best version of your content to be displayed in SERPs and ignoring the rest.
3. Almost all dupe content issues are easy to fix and should be fixed.
4. Don’t worry, be happy – don’t be afraid, be informed.
Check out the SEO Tools guide at Search Engine Journal.
Duplicate Content Phantom: Don’t Be Duped, Be Informed
View full post on Search Engine Journal
New Search Engine Unveils Duplicate Content Indicator for Websites – San Francisco Chronicle (press release)
Jan 17th
![]() Limo Broker News |
New Search Engine Unveils Duplicate Content Indicator for Websites
San Francisco Chronicle (press release) The SEO Engine(TM) recently announced a new feature that may change the way Website owners add new content in the future. This new type of SEO Software, … Top ranking car hire websites revealed |
View full post on SEO – Google News
Duplicate Content and Ways around It
Dec 7th
What Is Duplicate Content
Duplicate content is a forbidden practice in Google World, and probably every other search engine realm. If you’re unaware of what that phrase means, it refers to content, or even a close resemblance to content, that is used on a second site after being posted on a first. The benevolent intent is to prevent outright plagiarism. But the side effects are nearly worse than the original theft:
- Inability to post material on one’s own sight after guest posting on another
- Not being able to post articles anywhere else even if you’ve removed from the your home site
- Being bludgeoned by lower Google ranking if another site removes your guest articles and you re-post to your own
These are some of the unintended results of the duplicate content policy.
The Old Days
It’s no news to any even semi-aware writer that the Internet has changed the way we both read and write. The keyboard has replaced the typewriter, the Kindle is replacing the book, and streaming video must be cutting into theater movie profits, just as television did in the early 1950’s. I’m not going to make a case for the evil or the good of this (but know that I lean toward the former); it’s simply the current, inevitable result of the technology we cherish and nearly worship.
Formerly, a short, by my standards, time ago, up to the late 1990’s, the method of submission of one’s writing involved sending a neatly-typed, 250 word per page manuscript in a large manilla envelope with a similar, self-addressed, folded envelope. In all instances of my submissions and rejections, I never failed to receive some form of correspondence in return. That is no longer true when my articles are submitted by email. The process is too simple. Simplicity has the effect of inflating participation Anyone can do it. Maybe a good thing, maybe not, depending on the intent. Many simply want to be published. Fine. And that leads to the crux of the content issue.
Websites are easy to create. With some experience, a site can be put on line and articles posted to it in ten minutes. And your site becomes a showplace for all that writing talent.. Now, enter the duplicate content gremlin. This happened to me in naïve days only months ago. My website with a rank in the eight figure range contained my best stuff. I wanted it to be read by a wider audience than my girlfriend and a couple on line buddies, so I decided to submit the pieces to a website with clout, and gain a solid back link in the process. Surprise! I was told sternly that my work could not be used on this hefty site because it had been indexed. An offer to remove it from my own website would make no difference due to the reasons given above. It had been branded as USED; thus, it was tainted merchandise.
The Workaround: Option One
What to do? Is there no way to have articles on line without their being stamped “Do not transfer under penalty of Google”? The good news: There is not one way but two ways to accomplish this. The rejection of materials by that savvy webmaster had me scrambling for answers. The question was, how can I market my material short of uploading it time after time for possible guest posting? Using what I saw as my nemesis, Google, brought me to a new breed of website I didn’t know existed, a kind of marketplace for articles. What a cyber-godsend!
Two of these hybrid miracles stood out. One is run by Cathy Stucker, BloggerLinkUp. In it I was able to simply register and throw my name in the hat as a guest poster and as one requesting guest posts. The other, and the one I use on a daily basis is Ann Smarty’s MyBlogGuest, a curious name for a system that works like a charm. I prefer this site because it contains a section called “MyArticles” in which anything new I’ve written can be “posted” without being indexed. The first third of an article, a sizeable excerpt, goes on the market block for any of thousands to examine. The result has been wonderful; many of my pieces have been picked up by sites, and I’ve garnered some guest posts for my own site.
The Workaround: Option Two
The second option, for WordPress users, allows a handy choice in the Dashboard under Privacy: “I would like to block search engines, but allow normal visitors.” I created a WordPress site with this idea in mind. It serves as a showcase where I can send prospective publishers to review a variety of my original writings and without an intrusion of the duplicate content issue. While there, the user can see my biographical information, photos, and the About Me page.
These two options allow writers to promote their work freely without restrictions from the Google machinery. For greater visibility, sites such as Ann Smarty’s are the most appropriate; for personal satisfaction I somehow prefer the WordPress route. And, being wise, finally, to the crafty ways of indexing, I use both techniques.
Out of Control
Science fiction author William Gibson, the creator of the word “cyberspace,” commented that in technology we are moving as fast as we possibly can with utterly no idea of where we’re headed. The Google Internet dictatorship is a perfect example. In our quest for easy access to information, we’ve sacrificed, in a real sense, much of our freedom of expression. As we press for more security, our options shrink. What can be done? By us, nothing. We only drive along the internet highways, we don’t build them. It’s not even a matter of protests, like this one, which can make a difference. The whole situation, including the duplicate content dilemma, is in the hands Google.
Check out the SEO Tools guide at Search Engine Journal.
Duplicate Content and Ways around It
View full post on Search Engine Journal
Why Duplicate Content Is Good For You
Jun 29th
There are only two things that matter in SEO: writing great content for your users, and building links into that content. Everything else is a distraction.
This advice comes from Dan Crow, the Product Manager for Google Crawl Systems, who speaks regularly at SEO conferences. Forget everything else, he says, just focus on two things: great content and great links.
We’ve built our SEO content agency around that philosophy, so we don’t worry about all the details like keyword density, 301 redirects, or even duplicate content. In our experience, Google will overlook all those things, if you just focus on great content and great links.
To illustrate, here’s a case study on why duplicate content is not so bad, and can actually help you achieve top rankings.

Our client has a credit card finder website that he wanted to rank on the keyword “credit card concierge.” First, we focused on the content, coming up with an idea about using a credit card concierge service to perform silly errands for us, then rating them on the speed and efficiency with which they completed our insane tasks.
We wrote the piece, posted it to the client site, then focused on building links into it. It wasn’t long before we had achieved a Google top ten ranking for the keyword “credit card concierge”:
Then we reached out to Tim Ferriss, the New York Times bestselling author of The Four Hour Workweek. Tim specializes in “lifestyle design” services, and we thought credit card concierge services would be up his alley. We asked him for a link back, but Tim liked the piece enough to republish it on his blog … word-for-word, with a small text link at the bottom crediting the client’s site.
The traditional SEO response would be to turn down this offer and/or run screaming in terror, because of the “duplicate content” issue. We’ve all heard that Google will penalize duplicate content, you’ll lose your rankings, and the rivers will turn to blood. But we asked the two fundamental questions. Was it good content? (Yes.) Was it a good link? (Oh yes.)
The results were incredible. As soon as Tim published the piece on his blog, it went megaviral (which is bigger than “viral” but smaller than “gigaviral”), receiving hundreds of retweets, Diggs, and reposts. The blog post ultimately landed on the homepage of StumbleUpon.com, where it received over 300,000 Stumbles!
For the client, that one link from Tim’s blog resulted in hundreds of new customers to his site — all those people who read about the credit card concierge service wanted to sign up for it. And best of all, our client kept his ranking on the Google Top 10 — now sharing it with a newcomer: Tim’s repost of the article.
So here we clearly see that the “duplicate content” helped everyone involved. Tim got increased search rankings, and a load of viral traffic. The client got increased search rankings, and a load of new customers. Users got great content, and a load of chuckles.
Create great content for your users. Then build links back into that content.
When we relentlessly focus on these two fundamentals, everybody wins — our clients, our users, and ourselves. That “circle of goodness” is what Google is looking for, more than site map optimization or META tags.
But the circle extended even further. A few weeks later, I called Chase Visa, the credit card we used for the concierge experiment. It seemed the piece had caused quite a stir at the company, with a flood of new applications for the service. ”It’s actually been a fantastic marketing piece for us,” the concierge confided. ”And quite frankly, I thought it was hilarious.”
Great content and great links. Everybody wins.
Check out the SEO Tools guide at Search Engine Journal.
Why Duplicate Content Is Good For You
View full post on Search Engine Journal



