blog-cannibalization

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Blog Cannibalization - Keyword Overlap Detection

Blog Cannibalization - 关键词重叠检测

Detect when multiple blog posts compete for the same search keywords. Two modes: local-only analysis (default) and DataForSEO API mode for SERP-level data.
检测多个博客文章是否针对相同搜索关键词展开竞争。支持两种模式:默认的纯本地分析模式,以及用于获取SERP级数据的DataForSEO API模式。

Two Modes

两种模式

ModeFlagCostData Source
Local(default)FreeFile content analysis via Grep/Read
API
--api
~$0.01/callDataForSEO Page Intersection + Ranked Keywords
Local mode works without any API keys. API mode requires DataForSEO credentials set as environment variables:
DATAFORSEO_LOGIN
and
DATAFORSEO_PASSWORD
.
模式标识成本数据源
本地模式(默认)免费通过Grep/Read分析文件内容
API模式
--api
约0.01美元/次调用DataForSEO Page Intersection + 排名关键词
本地模式无需任何API密钥。API模式需要将DataForSEO凭据设置为环境变量:
DATAFORSEO_LOGIN
DATAFORSEO_PASSWORD

Local Mode Workflow

本地模式工作流程

Step 1: Scan Blog Files

步骤1:扫描博客文件

Use Glob to find all content files in the target directory:
  • Patterns:
    **/*.md
    ,
    **/*.mdx
    ,
    **/*.html
  • Skip files in
    node_modules/
    ,
    .git/
    ,
    drafts/
使用Glob查找目标目录中的所有内容文件:
  • 匹配模式:
    **/*.md
    ,
    **/*.mdx
    ,
    **/*.html
  • 跳过
    node_modules/
    ,
    .git/
    ,
    drafts/
    目录下的文件

Step 2: Extract Primary Keywords

步骤2:提取核心关键词

For each file, read and extract keyword signals from:
  • Title tag or H1 heading (highest weight)
  • H2 headings (medium weight)
  • First paragraph (supporting signal)
  • Meta description if present in frontmatter
Primary keyword extraction method:
  1. Tokenize title and H1 into 1-gram, 2-gram, and 3-gram phrases
  2. Score each phrase by frequency across title + H2s + first paragraph
  3. Select the top-scoring 2-3 word phrase as the primary keyword
  4. Record secondary keywords from H2 headings
针对每个文件,读取并从以下位置提取关键词信号:
  • 标题标签或H1标题(权重最高)
  • H2标题(权重中等)
  • 第一段内容(辅助信号)
  • 若前置元数据中存在元描述,也会纳入提取
核心关键词提取方法:
  1. 将标题和H1拆分为1词、2词、3词短语
  2. 根据短语在标题+H2标题+第一段中的出现频率打分
  3. 选择得分最高的2-3词短语作为核心关键词
  4. 记录H2标题中的次要关键词

Step 3: Cluster by Similarity

步骤3:按相似度聚类

Group posts into clusters using these matching rules (in priority order):
  1. Exact match - identical primary keyword across 2+ posts
  2. Stem match - same root word (e.g., "optimize" vs "optimization")
  3. Semantic overlap - Claude determines that two keywords target the same search intent (e.g., "best CRM software" vs "top CRM tools 2026")
  4. Subset match - one keyword contains another (e.g., "email marketing" vs "email marketing for startups")
按照以下优先级规则将文章分组为聚类:
  1. 完全匹配 - 2篇及以上文章的核心关键词完全相同
  2. 词干匹配 - 词根相同(例如:"optimize" 与 "optimization")
  3. 语义重叠 - Claude判定两个关键词针对相同搜索意图(例如:"best CRM software" 与 "top CRM tools 2026")
  4. 子集匹配 - 一个关键词包含另一个(例如:"email marketing" 与 "email marketing for startups")

Step 4: Score and Flag

步骤4:评分与标记

For each cluster with 2+ posts, assess severity and generate a recommendation.
针对包含2篇及以上文章的每个聚类,评估严重程度并生成建议。

Step 5: Output Report

步骤5:输出报告

Display the results table and per-cluster recommendations.
显示结果表格及每个聚类的建议。

API Mode Workflow (DataForSEO)

API模式工作流程(DataForSEO)

Requires the
--api
flag. Uses WebFetch to call DataForSEO endpoints.
需要添加
--api
标识。使用WebFetch调用DataForSEO端点。

Endpoints Used

使用的端点

Page Intersection - find keywords where multiple URLs rank:
POST https://api.dataforseo.com/v3/dataforseo_labs/google/page_intersection/live
Authorization: Basic <base64(login:password)>

{
  "pages": {
    "1": "https://example.com/post-a",
    "2": "https://example.com/post-b"
  },
  "language_code": "en",
  "location_code": 2840
}
Cost: ~$0.01 per call. Returns overlapping keywords with position, volume, CPC.
Ranked Keywords - get all keywords a single URL ranks for:
POST https://api.dataforseo.com/v3/dataforseo_labs/google/ranked_keywords/live

{
  "target": "https://example.com/post-a",
  "language_code": "en",
  "location_code": 2840
}
Page Intersection - 查找多个URL共同排名的关键词:
POST https://api.dataforseo.com/v3/dataforseo_labs/google/page_intersection/live
Authorization: Basic <base64(login:password)>

{
  "pages": {
    "1": "https://example.com/post-a",
    "2": "https://example.com/post-b"
  },
  "language_code": "en",
  "location_code": 2840
}
成本:约0.01美元/次调用。返回包含排名位置、搜索量、CPC的重叠关键词。
Ranked Keywords - 获取单个URL排名的所有关键词:
POST https://api.dataforseo.com/v3/dataforseo_labs/google/ranked_keywords/live

{
  "target": "https://example.com/post-a",
  "language_code": "en",
  "location_code": 2840
}

API Analysis Steps

API分析步骤

  1. Collect all published URLs from the user (or sitemap)
  2. Run Ranked Keywords for each URL to build keyword profiles
  3. Run Page Intersection for URL pairs that share keyword clusters
  4. Calculate severity using the formula below
  5. Output enriched report with search volume and position data
  1. 收集用户提供的所有已发布URL(或从站点地图获取)
  2. 针对每个URL调用Ranked Keywords接口,构建关键词档案
  3. 针对共享关键词聚类的URL对调用Page Intersection接口
  4. 使用下方公式计算严重程度
  5. 输出包含搜索量和排名数据的增强版报告

Severity Scoring

严重程度评分

Four severity levels based on overlap signals:
LevelCriteriaAction Urgency
CriticalSame exact keyword, both pages in top 20Immediate
HighSame keyword cluster, one page outranks the otherThis week
MediumRelated keywords with partial SERP overlapThis month
LowSemantic similarity but different confirmed intentsMonitor
基于重叠信号分为四个严重等级:
等级判定标准处理优先级
紧急完全相同的关键词,且两个页面均处于搜索结果前20名立即处理
同一关键词聚类,其中一个页面排名高于另一个本周内处理
相关关键词,存在部分SERP重叠本月内处理
语义相似但已确认意图不同持续监控

Severity Formula (API Mode)

API模式严重程度公式

severity_score = overlap_count x avg_search_volume x (1 / position_gap)
Where:
  • overlap_count
    = number of shared ranking keywords
  • avg_search_volume
    = mean monthly volume of shared keywords
  • position_gap
    = absolute difference in average ranking position (min 1)
Higher score = more urgent cannibalization problem.
severity_score = overlap_count x avg_search_volume x (1 / position_gap)
其中:
  • overlap_count
    = 共享排名关键词的数量
  • avg_search_volume
    = 共享关键词的平均月搜索量
  • position_gap
    = 平均排名位置的绝对差值(最小值为1)
分数越高,关键词自相竞争问题越紧急。

Severity Heuristic (Local Mode)

本地模式严重程度判定规则

Without SERP data, use a simplified scoring:
  • Critical: Exact primary keyword match between posts
  • High: Stem match on primary keyword, or 3+ shared H2 keywords
  • Medium: Semantic overlap on primary keyword
  • Low: Subset match only, or shared secondary keywords
无SERP数据时,使用简化评分规则:
  • 紧急:文章间核心关键词完全匹配
  • :核心关键词词干匹配,或共享3个及以上H2关键词
  • :核心关键词语义重叠
  • :仅存在子集匹配,或共享次要关键词

Output Format

输出格式

Summary Table

汇总表格

| Post A | Post B | Shared Keywords | Severity | Recommendation |
|--------|--------|-----------------|----------|----------------|
| /best-crm-tools | /top-crm-software | best crm, crm tools, crm software | Critical | MERGE |
| /email-tips | /email-marketing-guide | email marketing | High | DIFFERENTIATE |
| /seo-basics | /seo-for-beginners | seo basics, beginner seo | Critical | CANONICAL |
| /react-hooks | /react-state-mgmt | react, state | Low | NO ACTION |
| 文章A | 文章B | 共享关键词 | 严重程度 | 建议 |
|--------|--------|-----------------|----------|----------------|
| /best-crm-tools | /top-crm-software | best crm, crm tools, crm software | 紧急 | MERGE |
| /email-tips | /email-marketing-guide | email marketing | 高 | DIFFERENTIATE |
| /seo-basics | /seo-for-beginners | seo basics, beginner seo | 紧急 | CANONICAL |
| /react-hooks | /react-state-mgmt | react, state | 低 | NO ACTION |

Per-Cluster Detail

聚类详情

For each flagged cluster, provide:
  • Both post titles and URLs
  • Full list of overlapping keywords (with volume if API mode)
  • Which post is stronger (more comprehensive, better structured)
  • Specific recommendation with rationale
针对每个标记的聚类,提供:
  • 两篇文章的标题和URL
  • 重叠关键词完整列表(API模式下包含搜索量)
  • 哪篇文章更具优势(内容更全面、结构更优)
  • 带有理由的具体建议

Recommendations

建议方案

Four possible actions for each cannibalization cluster:
针对每个关键词自相竞争聚类,有四种可选处理方式:

MERGE

MERGE(合并)

When both pages are thin or cover the same intent with similar depth.
  • Combine the best content from both into one comprehensive post
  • 301 redirect the weaker URL to the merged post
  • Preserve all internal links pointing to either URL
当两个页面内容单薄,或针对相同意图且深度相近时:
  • 将两篇文章的优质内容合并为一篇综合性文章
  • 将较弱页面设置301 redirect到合并后的文章
  • 保留指向任意一篇文章的所有内部链接

DIFFERENTIATE

DIFFERENTIATE(差异化)

When pages serve different intents but keyword targeting overlaps.
  • Shift the primary keyword of the weaker post to a related long-tail
  • Update the title, H1, and meta description to reflect the new focus
  • Add internal links between the two posts to signal distinct topics
当页面服务不同意图但关键词定位重叠时:
  • 将较弱页面的核心关键词转向相关长尾词
  • 更新标题、H1和元描述以体现新定位
  • 在两篇文章间添加内部链接,明确主题差异

CANONICAL

CANONICAL(规范链接)

When one post is clearly the authority and the other is a lesser duplicate.
  • Add
    rel="canonical"
    on the weaker page pointing to the authority
  • Consider noindexing the weaker page if it adds no unique value
  • Link from the weaker page to the authority page
当一篇文章明显是权威内容,另一篇为次要重复内容时:
  • 在较弱页面添加
    rel="canonical"
    标签指向权威页面
  • 如果较弱页面无独特价值,可考虑设置noindex
  • 在较弱页面添加指向权威页面的链接

NO ACTION

NO ACTION(无需处理)

When intent is genuinely different despite surface-level keyword similarity.
  • Document the reasoning for future audits
  • Monitor rankings quarterly for any position changes
  • Re-evaluate if either post drops in rankings
当表面关键词相似但实际意图完全不同时:
  • 记录理由以便未来审计
  • 每季度监控排名变化
  • 若任意一篇文章排名下降,重新评估

Error Handling

错误处理

  • No blog files found: If the directory contains no .md, .mdx, or .html files, report "No blog files found in [directory]" and suggest checking the path
  • DataForSEO credentials missing: In API mode, if credentials are not configured, fall back to local mode automatically and notify the user
  • API rate limits: DataForSEO has per-minute rate limits. If a 429 response is received, wait and retry once. If it persists, switch to local mode for remaining URLs
  • WebFetch failures: If a source URL is unreachable, skip it and note "Unable to verify - source unavailable" in the report
  • Single-post directory: If only one blog post exists, report "Cannibalization analysis requires at least 2 posts" and exit gracefully
  • 未找到博客文件:若目录中无.md、.mdx或.html文件,报告“在[目录]中未找到博客文件”并建议检查路径
  • 缺少DataForSEO凭据:API模式下若未配置凭据,自动 fallback到本地模式并通知用户
  • API速率限制:DataForSEO有每分钟速率限制。若收到429响应,等待并重试一次。若问题持续,剩余URL切换为本地模式处理
  • WebFetch失败:若源URL无法访问,跳过该URL并在报告中注明“无法验证 - 源不可用”
  • 单文章目录:若仅存在一篇博客文章,报告“关键词自相竞争分析至少需要2篇文章”并优雅退出