Your web site seemingly suffers from at the least some content material cannibalization, and also you may not even understand it.
Cannibalization hurts natural visitors and income: The impression can stretch from key pages not rating to algorithm points as a consequence of low area high quality.
Nevertheless, cannibalization is difficult to detect, can change over time, and exists on a spectrum.
It’s the “microplastics of Search engine optimisation.”
On this Memo, I’ll present you:
- The way to determine and repair content material cannibalization reliably.
- The way to automate content material cannibalization detection.
- An automatic workflow you may check out proper now: The Cannibalization Detector, my new key phrase cannibalization instrument.
I may have by no means carried out this with out Nicole Guercia from AirOps. I’ve designed the idea and stress-tested the automated workflow, however Nicole constructed the entire thing.
How To Assume About Content material Cannibalization The Proper Means
Earlier than leaping into the workflow, we should make clear a couple of guiding rules about content material cannibalization which are usually misunderstood.
The largest false impression about cannibalization is that it occurs on the key phrase degree.
It’s truly taking place on the consumer intent degree.
All of us must cease excited about this idea as key phrase cannibalization and as a substitute as content material cannibalization based mostly on consumer intent.
With this in thoughts, cannibalization…
- Is a shifting goal: When Google updates its understanding of intent throughout a core replace, all of a sudden two pages can compete with one another that beforehand didn’t.
- Exists on a spectrum: A web page can compete with one other web page or a number of pages, with an intent overlap from 10% to 100%. It’s exhausting to say precisely how a lot overlap is okay with out outcomes and context.
- Doesn’t cease at rankings: On the lookout for two pages which are getting a “substantial” quantity of impressions or rankings for a similar key phrase(s) may also help you see cannibalization, however it’s not a really correct methodology. It’s not sufficient proof.
- Wants common check-ups: You’ll want to test your web site for cannibalization recurrently and deal with your content material library as a “dwelling” ecosystem.
- Will be sneaky: Many instances should not clear-cut. For instance, worldwide content material cannibalization isn’t apparent. A /en listing to deal with all English-speaking international locations can compete with a /en-us listing for the U.S. market.
Various kinds of websites have essentially totally different weaknesses for cannibalization.
My mannequin for web site varieties is the integrator vs. aggregator mannequin. On-line retailers and different marketplaces face essentially totally different instances of cannibalization than SaaS or D2C corporations.
Integrators cannibalize between pages. Aggregators cannibalize between web page varieties.
- With aggregators, cannibalization usually occurs when two web page varieties are too related. For instance, you may have two web page varieties that would or couldn’t compete with one another: “factors of curiosity in {metropolis}” and “issues to do in {metropolis}”.
- With integrators, cannibalization usually occurs when corporations publish new content material with out upkeep and a plan for the present content material. An enormous a part of the problem is that it turns into tougher to maintain an outline of what you’ve got and what key phrases/intent it targets at a sure variety of articles (I discovered the linchpin to be round 250 articles).
How To Spot Content material Cannibalization

Content material cannibalization can have a number of of the next signs:
- “URL flickering”: which means at the least two URLs alternate in rating for a number of key phrases.
- A web page loses visitors and/or rating positions after one other one goes dwell.
- A brand new web page hits a rating plateau for its principal key phrase and can’t break into the highest 3 positions.
- Google doesn’t index a brand new web page or pages throughout the identical web page kind.
- Actual duplicate titles seem in Google’s search index.
- Google stories “crawled, not listed” or “found, not listed” for URLs that don’t have skinny content material or technical points.
Since Google doesn’t give us a transparent sign for cannibalization, one of the best ways to measure similarity between two or extra pages is cosine similarity between their tokenized embeddings (I do know, it’s a mouthful).
However that is what it means: Mainly, you evaluate how related two pages are by turning their textual content into numbers and seeing how intently these numbers level in the identical path.
Give it some thought like a chocolate cookie recipe:
- Tokenization = Break down every recipe (e.g., web page content material) into substances: flour, sugar, chocolate chips, and so forth.
- Embeddings = Convert every ingredient into numbers, like how a lot of every ingredient is used and the way vital each is to the recipe’s identification.
- Cosine Similarity = Examine the recipes mathematically. This provides you a quantity between 0 and 1. A rating of 1 means the recipes are similar, whereas 0 means they’re fully totally different.
Observe this course of to scan your web site and discover cannibalization candidates:
- Crawl: Scrape your web site with a instrument like Screaming Frog (optionally, exclude pages that don’t have any Search engine optimisation objective) to extract the URL and meta title of every web page
- Tokenization: Flip phrases in each the URL and title into items of phrases which are simpler to work with. These are your tokens.
- Embeddings: Flip the tokens into numbers to do “phrase math.”
- Similarity: Calculate the cosine similarity between all URLs and meta titles
Ideally, this provides you a shortlist of URLs and titles which are too related.
Within the subsequent step, you may apply the next course of to ensure they honestly cannibalize one another:
- Extract content material: Clearly isolate the principle content material (exclude navigation, footer, advertisements, and so forth.). Perhaps clear up sure parts, like cease phrases.
- Chunking or tokenization: Both break up content material into significant chunks (sentences or paragraphs) or tokenize immediately. I favor the latter.
- Embeddings: Embed the tokens.
- Entities: Extract named entities from the tokens and weigh them larger in embeddings. In essence, you test which embeddings are “identified issues” and provides them extra energy in your evaluation.
- Aggregation of embeddings: Combination token/chunk embeddings with a weighted averaging (eg, TF-IDF) or attention-weighted pooling.
- Cosine similarity: Calculate cosine similarity between ensuing embeddings.
You should utilize my app script when you’d prefer to attempt it out in Google Sheets (however I’ve a greater different for you in a second).
About cosine similarity: It’s not good, however adequate.
Sure, you may fine-tune embedding fashions for particular matters.
And sure, you should use superior embedding fashions like sentence transformers on prime, however this simplified course of is often enough. No must make an astrophysics venture out of it.
How To Repair Cannibalization
When you’ve recognized cannibalization, it’s best to take motion.
However don’t neglect to regulate your long-term method to content material creation and governance. In case you don’t, all this work to search out and repair cannibalization goes to be a waste.
Fixing Cannibalization In The Brief Time period
The short-term motion it’s best to take is dependent upon the diploma of cannibalization and the way shortly you may act.
“Diploma” means how related the content material throughout two or extra pages is, expressed in cosine or content material similarity.
Although not a precise science, in my expertise, a cosine similarity larger than 0.7 is classed as “excessive”, whereas it’s “low” beneath a worth of 0.5.

What to do if the pages have a excessive diploma of similarity:
- Canonicalize or noindex the web page when cannibalization occurs as a consequence of technical points like parameter URLs, or if the cannibalizing web page is irrelevant for Search engine optimisation, like paid touchdown pages. On this case, canonicalize the parameter URL to the non-parameter URL (or noindex the paid touchdown web page).
- Consolidate with one other web page when it’s not a technical difficulty. Consolidation means combining the content material and redirecting the URLs. I recommend taking the older web page and/or the worse-performing web page and redirecting to a brand new, higher web page. Then, switch any helpful content material to the brand new variant.
What to do if the pages have a low diploma of similarity:
- Noindex or take away (standing code: 410) once you don’t have the capability or potential to make content material modifications.
- Disambiguate the intent focus of the content material in case you have the capability, and if the overlap isn’t too sturdy. In essence, you need to differentiate the elements of the pages which are too related.
Fixing Cannibalization In The Lengthy Time period
It’s important to take long-term motion to regulate your technique or manufacturing course of as a result of content material cannibalization is a symptom of a much bigger difficulty, not a root trigger.
(Until we’re speaking about Google altering its understanding of intent throughout a core algorithm replace, and that has nothing to do with you or your group.)
Probably the most important long-term modifications you might want to make are:
- Create a content material roadmap: Search engine optimisation Integrators ought to preserve a dwelling spreadsheet or database with all Search engine optimisation-relevant URLs and their principal goal key phrases and intent to tighten editorial oversight. Whoever is in command of the content material roadmap wants to make sure there is no such thing as a overlap between articles and different web page varieties. Writers must have a transparent goal intent for brand spanking new and current content material.
- Develop clear web site structure: The pendant of a content material map for Search engine optimisation Aggregators is a web site structure map, which is just an outline of various web page varieties and the intent they aim. It’s important to underline the intent as you outline it with instance key phrases that you simply confirm regularly (”Are we nonetheless rating effectively for these key phrases?”) to match it towards Google’s understanding and opponents.
The final query is: “How do I do know when content material cannibalization is fastened?”
The reply is when the signs talked about within the earlier chapter go away:
- Indexing points resolve.
- URL flickering goes away.
- No duplicate titles seem in Google’s search index.
- “Crawled, not listed” or “found, not listed” points lower.
- Rankings stabilize and break by way of a plateau (if the web page has no different obvious points).
And, after working with my purchasers beneath this guide framework for years, I made a decision it’s time to automate it.
Introducing: A Totally Automated Cannibalization Detector
Along with Nicole, I used AirOps to construct a completely automated AI workflow that goes by way of 37 steps to detect cannibalization inside minutes.
It performs a radical evaluation of content material cannibalization by inspecting key phrase rankings, content material similarity, and historic information.
Under, I’ll break down an important steps that it automates in your behalf:
1. Preliminary URL Processing
The workflow extracts and normalizes the area and model identify from the enter URL.
This foundational step establishes the goal web site’s identification and creates the baseline for all subsequent evaluation.

2. Goal Content material Evaluation
To make sure that the system has high quality supply materials to investigate and evaluate towards opponents, Step 2 includes:
- Scraping the web page.
- Validating and analyzing the HTML construction for principal content material extraction.
- Cleansing the article content material and producing goal embeddings.

3. Key phrase Evaluation
Step 3 reveals the goal URL’s search visibility and potential vulnerabilities by:
- Analyzing rating key phrases by way of Semrush information.
- Filtering branded versus non-branded phrases.
- Figuring out SERP overlap with competing URLs.
- Conducting historic rating evaluation.
- Figuring out web page worth based mostly on a number of metrics.
- Analyzing place differential modifications over time.

4. Competing Content material Evaluation (Iteration Over Competing URLs)
Step 4 gathers extra context for cannibalization by iteratively processing every competing URL within the search outcomes by way of the earlier steps.

5. Ultimate Report Era
Within the ultimate step, the workflow cleans up the information and generates an actionable report.

Attempt The Automated Content material Cannibalization Detector

Attempt the Cannibalization Detector and take a look at an instance report.
Just a few issues to notice:
- That is an early model. We’re planning to optimize and enhance it over time.
- The workflow can day trip as a consequence of a excessive variety of requests. We deliberately restrict utilization in order to not get overwhelmed by API calls (they value cash). We’ll monitor utilization and would possibly briefly increase the restrict, which implies in case your first try isn’t profitable, attempt once more in a couple of minutes. It’d simply be a short lived spike in utilization.
- I’m an advisor to AirOps however was neither paid nor incentivized in another approach to construct this workflow.
Please go away your suggestions within the feedback.
We’d love to listen to how we are able to take the Cannibalization Detector to the subsequent degree!
Increase your abilities with Development Memo’s weekly skilled insights. Subscribe free of charge!
Featured Picture: Paulo Bobita/Search Engine Journal