Duplicate content is exactly what it sounds, content on your site that is duplicated to exist multiple versions of the same content on multiple URLs. It is probably not a web editor who uploads duplicate content consciously but depending on what you use for Content Management System (CMS), you can have lots of duplicate content without knowing about it.
Why is duplicate content a problem?
There are three main issues with duplicate content for a search engine:
They do not know which version they should include in their index and which ones to exclude.
They do not know how to direct linkage: keep everything on one page or separate it across multiple versions.
They do not know which versions to display on the search results page.
What does this mean to you?
Search engines look on the web a bit differently than a human being. We see each page as a concept, but the search engines see each unique URL as a very own page. In order to keep up with the quality of search results, the search engine rarely displays multiple versions of the same content, instead it will choose a version they think is the best result. This weakens the visibility of all duplicates and may also adversely affect your ranking.
The problem is getting even bigger when visitors start linking to the different versions and the linking power is broken up. Instead of all incoming links going to an article, the links will go to multiple versions and you will not benefit as much of the linking power as only one of the versions is probably indexed. Since links are an important ranking factor, this can negatively affect your ranking.
In rare cases, where Google finds duplicate content that may appear with the intention of manipulating the rankings and tricking our users, we also perform appropriate adjustments to the indexing and ranking of affected websites. This may result in a negative impact on the site's ranking, or that the site will be completely removed from the Google index so it will no longer appear in search results. - Google Search Console
How is content duplicated?
There are many different reasons for duplicate content and the vast majority are technical. You can divide it into three different categories:
Because search engines see each URL as a unique page, different types of URL parameters can make duplicate content. It can be anything from adding filtering with query strings to session IDs to mark a visit.
If you have print-friendly versions of your pages, they can sometimes also create a duplicate URL. It is also common for different CMS to create duplicate content automatically: https://www.yourdomain.com/article-a also available at https://www.yourdomain.com/category/article-a . If you have added the article to several categories, the problem may be even bigger. It is also very common in online stores where a product is included in several different article groups and can be found in several different ways.
Incorrect domain management: www vs. non-www and http vs. https
One of the common causes of duplicate content is that you failed to redirect www and without www. Technically, www is a subdomain and the search engines will therefore treat them as a unique URL. Without thinking about it, you have created a whole duplicate page. The same applies if you have SSL certificate on the site. Then you can have two versions even there.
Web scraping and copied content
Duplicate content, of course, applies not only to articles and blog posts, but it applies to all types of text on your site. Many online stores have major product information issues when using the product information from the supplier.
If many then sell the same product and use the supplier's product information, it will duplicate content.
Web scraping can also be a problem. There are programs that scrap content from different websites and then publish it on their own website. You usually call it scraping because content is retrieved from the site, not from the database.
Usually, these programs do not scratch all text on your site without looking for specific information that they then copy and post on their own site. An example of such programs is price comparison sites.
How do you fix duplicate content?
To resolve the duplicate content issue, you need to specify which content is the primary one. You can do this in a few different ways, and the one that suits you best depends on how the content is duplicated.
One of the most common and best solutions for duplicate content is to permanently redirect the duplicate pages to the page you selected as the primary. This is the best solution for incorrect domain management, ie if you have duplicate versions due to www and non-www or http and https.
The duplicated versions often compete against each other both in search results and for links. By redirecting the duplicates to the primary page, the different versions compete against each other in the search results. But not so, the relevance and popularity of the primary page increases because it also gets to know the linkage of the other versions. This makes it much more likely to rely on key keywords. Learn more about redirects.
Another common way to handle duplicate content is by adding a canonical URL (canonical address). The canonical address tells the search engines which web page is primary and that other versions should be treated as copies. Linking power that goes to the duplicates is passed through canonicals to the primary page.
Canonical URL is great to set when your CMS automatically creates multiple URLs. Then you do not redirect everyone because you are building new content. Many CMS do a part of this automatically today but it can still be good to check. Canonical URl can also be used when publishing the same content across multiple domains, such as a guest post. Read more about canonical URL.
Meta robots noindex
Some pages often create duplicate URLs, such as the shopping cart in an online store. Since it is not relevant to index the shopping cart on the search engines, you can put a meta tag that explains that this page should not be indexed.
Meta the robots tag noindex, follow tells the search engine that they should not index the page but they can still crawl it.
Additional tips for managing duplicate content
Always be consistent when you link to pages - always link to the same version.
Specify in Google Search Console how you want your domain to be indexed, such as https://www.yourdomain.com in front of https://yourdomain.com.
Avoid repeating default texts such as terms of purchase and copyright. Instead, write a shorter text and link to the full text on its own unique page.
Do not publish pages without content. It sometimes happens when creating categories or the like that are only placeholders for pages lower down in the structure. If you have such pages, put noindex on them.
Get to know your CMS. In some cases, a blog post is displayed in its entirety on both category pages, tag pages, and in the archive.
Protect your content from web scratch by adding a self-referencing canonical to your pages. It does not help against any scratch, but some scraped content will then have a canonical URL pointing back to your content.