Posts tagged Duplicate
Do You Have Duplicate Content Issues Across Domain? Google Will Now Alert You
Oct 31st
Today, Google webmaster tools has launched a new message alert to let site owners know when a particular URL doesn’t appear because Google sees it as duplicate of a URL on a different domain. In the blog post announcing the feature and in an in-depth help topic, they provide details on how…
Please visit Search Engine Land for the full article.
View full post on Search Engine Land: News & Info About SEO, PPC, SEM, Search Engines & Search Marketing
Google Webmaster Tools Provides Details On Duplicate Content Across Domains
Oct 31st
Today, Google webmaster tools has launched a new report to let site owners know when a particular URL isn’t indexed because Google sees it as duplicate of a URL on a different domain. In the blog post announcing the feature, they provide details on how they identify duplicate clusters of…
Please visit Search Engine Land for the full article.
View full post on Search Engine Land: News & Info About SEO, PPC, SEM, Search Engines & Search Marketing
Hot At Sphinn: Search Conference Tips, Duplicate Content Problems & More
Jun 13th
With the fifth edition of SMX Advanced happening last week, discussion on our sister site Sphinn focused on how to optimize your search conference experience. Our “Discussion of the Week” asked people to share their Best Tips For Search Conference Attendees and whether you’re a…
Please visit Search Engine Land for the full article.
View full post on Search Engine Land: News & Info About SEO, PPC, SEM, Search Engines & Search Marketing
Massive Duplicate Content Problem- More Revealed
May 24th
So last month, I posted an article here entitled “You think YOU have a duplicate content problem?” where I described a duplicate content nightmare of epic proportions. Essentially, I found a site with hundreds of replications of every page on the site, and outlined steps I was going to need to task to the developers to fix it.
Well since that time, I did some more digging, in order to provide the full tasking plan for implementation. And found that the problem was MUCH worse than my original assessment even had. Stupid worse. And it looks like it’s going to be a major battle to get it all resolved…
Back when I first found the problem, I had estimated there should only be about 15,000 pages, yet discovered that Google was displaying 86,000 pages. And from there, I found that Google was indexing a million pages, but in their internal automated effort to only show the “right set”, ended up not doing such a great job.
Which meant, at the time, that those 15,000 pages weren’t getting all the value they really deserved.
More Digging
All of that info I’d initially uncovered and figured out was purely based on mid-level audit work. However when it came time for me to actually write up the findings in a clear, concise and plain English document to be provided the site’s developers, I needed to go further. I needed to really examine, and show examples of links where the problem existed.
And I needed to map out the fix – detailing SEO and Information Architecture best practices. Because when you’re telling developers that “the hundreds of hours you put into building this site caused massive problems that you’ll now need to fix”, you better be thorough. And prepared for push-back from someone who emotionally or psychologically may not be willing to admit or acknowledge their roll in the problem.
More Details
If you recall, it’s a real estate site. With several offices in several counties. Over 400 agents spread throughout. Well it turns out there are actually over 25,000 homes for sale in their system. And the site offers drill-down navigation – county to city, to neighborhood, and finally to individual properties.
If you add all those county, city, and neighborhood pages, plus the hundreds of agent pages, it turns out there’s just over 26,000 actual pages.
Oh - Look! Several Duplication Problems
Okay so in my last article, I explained how if you go to an agent’s bio page, and then click from there to any other page on the site, all the URLs get appended with that agent’s ID. That was where the majority of duplication comes from. Thousands of property pages, each one replicated with every agent’s unique ID appended to the URL.
But Wait! There’s More!
During my final write-up, I scanned the duplicate pages being indexed at Google. And guess what? It turns out that every county has two different URLs you can use to get to that county page. Apparently the code allows for multiple URLs – one was the way the site was originally architected. Another is how it was modified to work after they went live.
Except nobody realized you have to implement 301 Redirects when you do that.
Think that’s bad? Well a similar problem exists with every city in every county. And yes, every neighborhood within every city.
That’s over 130 pages that all have two different URLs you can use (and many of both versions are indexed at Google, thank you very much).
But Wait! There’s More!
And guess what? When I went into Google Webmaster Tools to see if there were anything that could reveal about this problem? Yeah, I discovered 100,000 404 Errors listed, all from the month of May!
Now I normally don’t worry too much about 404 errors. Most big sites are bound to have some. Sure, it’s best practices to address them all as they crop up. Yet in most cases, if there’s a handful, it’s a low priority task sometimes.
Except when it’s THIS massive.
Especially when most of them have LINKS POINTING TO THEM.
How To End Up with 100,000 404 Errors. In Under A Month.
It turns out there were three primary causes for all these 404 errors. First, when this site was first redesigned and rebuilt last year, there was initially URL structure issues. Recall how I mentioned earlier that this was how the duplicate content problem came about in the county/town/neighborhood system?
Well in the county/town system, those 1st version URLs still work.
They don’t, however, work in other sections of the site. Those sections got new URL structure in a way that their 1st generation URLs now go to a dead end. 404. Not found.
Except that’s only applicable for a handful of these.
Then there’s the fact that the old site – the one that existed before this rebuild, had some sort of site-level email linking scheme. Don’t ask me how, or why. All I know is there are hundreds of URLs that somehow got into the Google index at some point, where the URLs point to an email folder on that site. And within that email folder, there were all sorts of links pointing to property pages. Bizarre. To say the least.
The really massive 404 count however, comes from the old property URL structure on that now defunct site. For whatever reason, when the new site was built, nobody thought – “Hey – we’re scrapping this old site. So maybe we should 301 redirect all those property pages”.
And even before this site was rebuilt, nobody ever thought back on the old site “Hey – as properties get sold, maybe we should have an automated 301 set up for every one of those”.
Clean-Up on Aisle 3
So, as you can venture to guess, the problems on this particular site are way beyond more chaotic and entangled and painful than I originally thought when I first wrote this up last month.
Essentially, the entire site’s URL structure needs to be cleaned up. Which is awesome for me. Because in my tasking document, I not only communicated all those duplicate town / city / neighborhood pages need to be eliminated / 301′d. I went further. And said “throw out BOTH versions”. And replace them with THIS syntax.
That’s right – I went for it – truly polished, User friendly AND SEO friendly URL structure.
Because I’m a nice guy.
Not So Fast, Mister!
It turned out that the head of development for this particular site was very cooperative. Quite willing, without push-back, to revamp the entire county/town/neighborhood system. With my preferred URLs. That was just awesome to hear.
Until I learned that all was not so joyous.
As it turns out, the really BIG duplicate content problem? Where they need to strip out the Agent IDs from the URLs, and replace that with a browser cookie system?
Yeah – not so much. The answer was a resounding, emphatic, “Not Possible.”.
Oh No You Didn’t!
Okay so I’m not a world class web engineer. I don’t code complex sites in my sleep. I have, however, in the past, coded entire complex shopping cart systems, with multi-layered discounting, five variations of feature options, multiple-shipping method and pricing options, secure membership features, and much more. From scratch.
And so I know a thing or three about cookies.
Except, unfortunately, I didn’t create THIS site. So I wasn’t aware, until this bombshell discussion, that those Agent URLs get embedded in special email messages that go out to people who sign up for property alerts.
And they get syndicated out to national real estate sites.
Yeah, welcome to my little world.
So for now, all the other tasking I asked for is going to be worked on. At some point in the next who knows whenever.
But that agentID thing in the URL? They’re going to have to get back to us on that. Because I said – think about how you can resolve this. Because right now, it’s killing the site. And “not possible” is, well, not acceptable.
And just to cover the bases, I’m chewing on how this can be resolved. In case they come back with a “We really thought about it and we just can’t do it”.I’ve already come up with what I think is a solution.
However I need to chew on it and get together with a developer friend, a guy who happens to be just this side of rocket scientist.
And then, if I ever DO get this all worked out, I’ll write another follow-up article. Because it’s good to cleanse the soul like this, yet it’s also good karma to share the love in the form of “here’s how we did it – so you don’t have to go through the pain we did…”.
Check out the SEO Tools guide at Search Engine Journal.
Massive Duplicate Content Problem- More Revealed
View full post on Search Engine Journal
You Think YOU Have a Duplicate Content Problem?
Apr 26th
Duplicate content. We all know about it. Countless posts have been written on why it’s bad, how to avoid it. But maybe you’ve got a duplicate content problem and don’t even know it’s there. Or your duplicate content problem is bigger than you realize. So big, it’s epic.
That’s what I discovered recently when auditing a client site. We’re not talking about content replicated across multiple sites. Not scraper sites, or ripoff sites. One site. The original and only source. And it was by forensic tactics that I uncovered exactly how big the problem was. How epic. Orders of magnitude epic.
In this situation, we’re talking about a real estate site. Covering a wide swath of California – offices spread throughout northern and southern California. Billions of dollars in home sales in 2010.
Site: – A Key Metric
Whenever I perform an SEO audit, I run a site: check on Google as one of my first tasks, and ask the client how many pages they really have. This is just to get a feel for how well the site’s currently indexed. This site showed 86,000 pages indexed on my initial check. Except there’s really only about 15,000 pages. Wow. Really? Oh boy…
Now, it’s not uncommon to run a site: check and get less pages showing than actually exist. The public display of pages found is only an approximation, and subject to how well a site is indexed, Google’s algorithm at any given moment as well as fluctuations in the results due to competitive factors.
But this an opposite indexing problem. More than five times as many pages showing as actually exist. So I went back and began to examine the site, my senses on full alert.
1999 Called & Wants It’s Programming Methods Back
What I found that set off the next bell in my “that’s not right” process was finding that they’ve got over 400 agent pages – no – it’s not odd that a large real estate site has hundreds of agent pages. It’s that when you get to any of those pages, the next time you click on any page in the main navigation, the agent’s ID is stuck on the URL. And the home page link no longer goes to the main site home page, but instead goes back to that agent’s home page.
It’s a common programming method – passing identifiers along in the URL string. Except I know right away to then check for canonical URL tags – to see if those are being picked up by Google as authentic “unique” pages, or if the site’s coded to say “don’t index this version”.
No Canonical Tags. Anywhere.
Okay quick math time – 15,000 pages – 400 agents. That’s six million pages that could potentially be indexed. Except I was only seeing just over one percent of that. Still way too many for reality. Yet not the “OMG” disaster it could have been. Or was it?
Forensic SEO Tactics
Here’s where I really got curious – do I really need to go through all of those results to try and figure out what the heck is happening? Nope – not me. No way. No how. Instead, I let my brain chew on the problem.
And thought – let’s search Google first, just to see if any of these agent appended URLs are actually showing up. Sure enough, every one I manually tried was there.
From there, I performed an advanced site: check. In these particular URLs there’s a series of letters used as the variable identifier – so everything after XYZ in the URL string is the agent’s unique ID. So my search then looked like this: Site:www.Domain.com +XYZ
And guess what I found? Not 60,000 pages (the “overage” from the real count to the “pages found” count). What I found was
509,000 pages found
Great. Just great.
So what the heck is going on?
More tests. This time, I ran it with a different chunk of code in those agent URLs. And what did I get?
1.2 million pages found
Wow. This was a complete mess. And my first thought was – how could such completely insane variations exist?
Google – “We Do The Best We Can”
What turned out to be the problem was multi-layered. At any given time, the GoogleBot attempts to crawl the site. At a certain point, it’s just going to get tired of exploring a site, and run away, on to the next shiny object out there. Especially when those agent pages are several layers down in the link chain. Which means all the pages linked from there are also “technically” (but not really) even further down in the link chain.
And then even if some of those pages end up in the index, at some point, Google’s going to see “Hey this content is exactly the same as all this other content.”
And even though claims have been made (Thanks Matt!) that “Google does a pretty good job of figuring things out”, this is a great example of why that’s an imperfect system. Essentially, along the way of processing all this data, the system’s going to choke. And in this particular case, may even barf a little.
But overall, considering the fact that over a million “pages” are actually in their index, they’re able to pare it down by orders of magnitude, down to that 86,000 (still ridiculously over-counted) page range.
Good Enough Isn’t Good Enough
So Google’s system, without further guidance, is only able to pare it down to 86,000 pagers. That still leaves 70,000 of those pages being duplicate. Which means there’s a BIG problem still.
How does Google know which version is the most important? Most of the results in the first dozen pages of results for various searches ARE the primary site version, without the agent appendage. But not all. And for some phrases, it’s all agent pages that show up first.
Which in turn means that the pages that matter the most are NOT being given their full value. On a massive scale.
The Fix Ain’t So Easy
So, you’re saying to yourself – just slap that canonical tag in there. Problem solved.
Well sure, that’s important. Except that’s only good for the future experience. The site’s been like this forever. Would YOU want to be the one who ensures the 301 Redirects are implemented properly for that mess? Well, if you’re a REGEX genius, maybe you would. Me, not so much.
Then there’s the need (yes, it’s a NEED) to get the entire site recoded to STOP USING URL strings. Because I don’t care how much Google says all you need is canonical tags. Because not every search engine or link provider (intentionally or otherwise) is on board with that.
And even to Google, it’s only “an indicator”. It’s not a guarantee.
No, the ONLY proper, BEST PRACTICES tasking here is to strip out all those URL parameters. Just use cookies instead, for cryin out loud.
Which means a coding nightmare for some poor code monkey.
And more QA to ensure it’s all really done properly. Across the ENTIRE site.
Fortunately I’m not the one who has to code it. But I’m the one who’s got to do the QA on it. Yeah. Thanks. I’ll be over here curled up in a fetal ball. Crying. Uncontrollably. At least until I can rant about the process on Twitter.
Check out the SEO Tools guide at Search Engine Journal.
You Think YOU Have a Duplicate Content Problem?
View full post on Search Engine Journal
Duplicate Content Phantom: Don’t Be Duped, Be Informed
Mar 9th
Duplicate content has always been a hot topic among webmasters; mostly because no one really knows what it is and the rumors persist.
And Google doesn’t help much either. Sometimes, I think of it as a hyperactive 3 year-old, who is incredibly sharp in some areas, but not so much in others.
So the best way to go is to keep it simple, stay under the radar, and shoot for the middle of the road.
With that said, let’s figure out what duplicate content is, what it isn’t, and what you should do to stay on top of it.
What is duplicate content?
“Duplicate content generally refers to substantive blocks of content within or across domains that either completely match other content or are appreciably similar. Mostly, this is not deceptive in origin. Examples of non-malicious duplicate content could include:
- Discussion forums that can generate both regular and stripped-down pages targeted at mobile devices
- Store items shown or linked via multiple distinct URLs
- Printer-only versions of web pages”
Identical or substantially similar content.
Within your own domain or across others.
Most of it is normal and acceptable.
So far, so good.
Why Is Duplicate Content a Problem?
How would you like to search for the best pecan pie recipe just to find that every single result on the first page turned out the exact same recipe?
Users don’t like the same result and Google doesn’t like crawling the same results.
For a search engine, it’s also a processing consideration. If there is substantial duplication, the crawl/indexation rates might be dampened. In short, the site can lose some ‘trust’.
Two Types of Duplicate Content
We all have our own ideas of duplicate content and most of the times they boil down to “Don’t republish the same article to multiple directories. Instead, spend countless hours spinning that same article to the point where it doesn’t make sense any longer and THEN publish it to a zillion and one directories. That will surely trick all the PhDs working for Google into ranking my site pretty highly.”
Now in the spirit of “being informed”, let’s take a look at the 2 types of duplicate content you see around, shall we?
- Cross-domain type: this one is the most commonly thought of and includes the same content, which (often unintentionally) appears on several external sites.
- Within-your-domain type: the one that Google is actually mostly concerned about, i.e. that appears (often unintentionally) in several different places within your site.
Let’s now do a little more exploring into each type and see what Google really thinks about it.
Off-Site Content Syndication
There is absolutely nothing wrong with syndicating your content to different sites per se.
NOTHING WRONG WITH IT!
Here’s what happens when your content gets syndicated: Google will simply go through all the available versions and show the one that they find the most appropriate for a specific search.
Mind you the most appropriate version might not be the one you’d prefer to have ranked. That’s why it’s very important that each piece of syndicated content includes a link back to your original post – I assume it would be on your site. That way Google will trace the original version and will most likely (but not always) display it in its search results.
Per Matt Cutts:
I would be mindful that taking all your articles and submitting them for syndication all over the place can make it more difficult to determine how much the site wrote its own content vs. just used syndicated content. My advice would be 1) to avoid over-syndicating the articles that you write, and 2) if you do syndicate content, make sure that you include a link to the original content. That will help ensure that the original content has more PageRank, which will aid in picking the best documents in our index. (Source)
Black Hat Syndication
However, here’s the other side of content syndication coin: the content is deliberately duplicated across the web in an attempt to manipulate search engine rankings or to generate more traffic.
This results in repeated content showing up in SERPs, upsets the searchers, and forces Google to clean out the house.
“In the rare cases in which Google perceives that duplicate content may be shown with intent to manipulate our rankings and deceive our users, we’ll also make appropriate adjustments in the indexing and ranking of the sites involved. As a result, the ranking of the site may suffer, or the site might be removed entirely from the Google index, in which case it will no longer appear in search results.” (Resource – Google Webmaster Tools Help)
On-Site Content Syndication
On-site duplicate content problems are much more common and guess what: they are entirely UNDER YOUR CONTROL, which makes it very easy to fix them.
The first step to identifying the potential weak spots on your blog is learning more about your content management system.
For example, a blog post can show up on the home page of your blog, as well as category page, tag page, archives, etc. – THAT’S the true definition of duplicate content.
We, the users, have the common sense to understand that it’s still the same post; we just get to it via different URLs. However, search engines as unique pages with exactly same content = duplicate content.
How to Take Matters into Your Own Hands
Here are some practical “non-techie” steps you can take to minimize the presence to dupe content on your site:
- Take care of your canonicalization issues. In other words, www.trafficgenerationcafe.com, trafficgenerationcafe.com, trafficgenerationcafe.com/index.html are one and the same site as far as we are concerned, but 3 different sites as far as search engines are concerned. You need to pick your fave and stick with it. If you don’t know how, here’s are the instructions: WWW vs non-WWW: Why You Should Put All Your Links in One Basket
- Be consistent in your internal link building: don’t link to /page/ and /page and /page/index.htm – if links to your pages are split among the various versions, it can cause lower per-page PageRank.
- Include your preferred URLs in your sitemap
- Use 301 redirects: If you have restructured your site (for instance, changed your permalink structure to a more SEO-friendly one), use 301 redirects (“RedirectPermanent”) in your .htaccess file or, even simpler, use one of the many Redirection plugins available in your WordPress plugin directory.
- Use rel=”canonical”
- Use Google Webmaster Parameter Handling Tool
- Minimize repetition: i.e. don’t post your affiliate disclaimer on every single page; rather create a separate page for it and linked to it wheb needed.
- Managing your archive pages: Avoid duplicate content issues by displaying excerpts on your archive pages instead of full posts. You really want to give your readers just a hint of the content and direct them back to the original posts. To accomplish that, open your archive.php of your theme and replace the_content with the_excerpt. Hint: make sure your category and tag pages also display excerpts only.
- Country-specific content: Google is more likely to know that .de indicates Germany-focused content, for instance, than /de or de.example.com.
This is a good time to remind myself that I am not writing a novel…
Oh, wait a minute: one more important issue: robot.txt file.
According to TopRankBlog.com:
Google doesn’t recommend blocking duplicate URLs with robots.txt, because if they can’t crawl a URL they have to assume it’s unique. It’s better to let everything get crawled and to clearly indicate which URLs are duplicates…. Robots.txt controls crawling, not indexing. Google may index something (because of a link to it from an external site) but not crawl it. That can create a duplicate content issue.
Let’s move on, shall we?
Is There a Duplicate Content Penalty?
I’ll have Google answer this daunting question.
Here’s a quote by Susan Moskwa, Webmaster Trends Analyst from Google:
A lot of people think that if they have duplicate content that they’ll be penalized. In most cases, Google does not penalize sites for accidental duplication. Many, many, many sites have duplicate content.
Google may penalize sites for deliberate or manipulative duplication. For example: auto generated content, link networks or similar tactics designed to be manipulative.”
Susan further explained when webmasters should not worry about duplicate content:
- Common, minimal duplication.
- When you think the benefit outweighs potential ranking concerns. Consider your cost of fixing the duplicate content situation vs. the benefit you would receive.
- Remember: duplication is common and search engines can handle it.
How exactly does Google handle it?
While pulling up the search results, Google will basically collapse the duplicates leaving only the most relevant, in their opinion of course, page in the SERPs for that specific query. As I explained before, the way Google determines the most relevant result is based upon a myriad of factors and the only thing you can do for your part is to always link back to your original post.
Scraping Be Gone!
A word on the recent Google algorithm change:
My post mentioned that “we’re evaluating multiple changes that should help drive spam levels even lower, including one change that primarily affects sites that copy others’ content and sites with low levels of original content.” That change was approved at our weekly quality launch meeting last Thursday and launched earlier this week. (Matt Cutts – source)
What does it mean to the average webmaster?
We can all do a little chicken dance, since the probability of scraped (stolen, in other words) content ranking above the original articles that we put blood, sweat, and tears into, is minimal.
Google is rightfully going to war against all the autoblogs that don’t have what it takes to produce content of their own and all they do is republished other people’s work in hopes to rank highly in search engines, bring traffic to their crappy websites and make some money off AdSense, paid advertisement, and such.
Good riddance!
If you find that another site is duplicating your content by scraping (misappropriating and republishing) it, it’s unlikely that this will negatively impact your site’s ranking in Google search results pages. If you do spot a case that’s particularly frustrating, you are welcome to file a DMCA request to claim ownership of the content and request removal of the other site from Google’s index.
Duplicate Content Marketing Takeaway:
1. Dupe content doesn’t cause your site to be penalized.
2. Google is getting better at picking the best version of your content to be displayed in SERPs and ignoring the rest.
3. Almost all dupe content issues are easy to fix and should be fixed.
4. Don’t worry, be happy – don’t be afraid, be informed.
Check out the SEO Tools guide at Search Engine Journal.
Duplicate Content Phantom: Don’t Be Duped, Be Informed
View full post on Search Engine Journal
New Search Engine Unveils Duplicate Content Indicator for Websites – San Francisco Chronicle (press release)
Jan 17th
![]() Limo Broker News |
New Search Engine Unveils Duplicate Content Indicator for Websites
San Francisco Chronicle (press release) The SEO Engine(TM) recently announced a new feature that may change the way Website owners add new content in the future. This new type of SEO Software, … Top ranking car hire websites revealed |
View full post on SEO – Google News
