morphiq-scan

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Pipeline Position

流水线位置

Step 1 of 4 — entry point.
  • Input: A domain URL from the user.
  • Output: Scan Report (JSON) → consumed by morphiq-rank.
  • Data contract: See
    PIPELINE.md
    §1 for the Scan Report schema.
4步中的第1步 — 入口点。
  • 输入: 用户提供的域名URL。
  • 输出: 扫描报告(JSON格式)→ 供morphiq-rank使用。
  • 数据契约: 扫描报告的schema请参考
    PIPELINE.md
    第1节。

Purpose

用途

Morphiq Scan audits a website's readiness for AI visibility. It answers two questions: "Can AI systems parse and understand this site?" (Technical Score) and "Where are the gaps preventing AI citations?" (issue identification). The output feeds morphiq-rank for prioritization.
Morphiq Scan用于审计网站的AI可见性就绪度,它会回答两个问题:“AI系统能否解析和理解这个站点?”(技术得分)以及“哪些缺陷阻碍了AI引用?”(问题识别)。输出结果会提供给morphiq-rank用于优先级排序。

Workflow

工作流程

Step 1: Discover Pages

第1步:发现页面

  1. Fetch
    {domain}/robots.txt
    — extract sitemap URLs
  2. Fetch
    {domain}/sitemap.xml
    if not found in robots.txt
  3. Classify discovered pages by type using URL pattern matching
  4. Select up to 10 marketing-relevant pages, prioritized: home → pricing → features → product → solutions → about → blog → other → documentation
  5. Exclude non-marketing pages from scoring (contact, login, signup, legal, demo, careers, changelog)
For page type classification and URL patterns, read
references/page-type-rules.md
.
  1. 获取
    {domain}/robots.txt
    — 提取站点地图URL
  2. 如果在robots.txt中未找到站点地图,则获取
    {domain}/sitemap.xml
  3. 使用URL模式匹配将发现的页面按类型分类
  4. 最多选择10个与营销相关的页面,优先级如下: 首页 → 定价页 → 功能页 → 产品页 → 解决方案页 → 关于页 → 博客页 → 其他 → 文档页
  5. 排除非营销类页面的打分(联系页、登录页、注册页、法律页、演示页、招聘页、更新日志页)
页面类型分类和URL规则请参考
references/page-type-rules.md

Step 2: Audit Policy Files (Category 5 — Domain Level)

第2步:审计策略文件(第5类 — 域名级别)

  1. robots.txt — Validate existence, format, AI crawler access (GPTBot, Google-Extended, Anthropic-AI, PerplexityBot)
  2. llms.txt — Validate existence and quality (≥500 chars = good, <500 = thin)
  3. llms-full.txt — Check existence
  4. sitemap.xml — Validate XML structure
Score on 10-point scale. Generate issues for findings.
For detection rules and scoring, read
references/policy-files.md
.
  1. robots.txt — 验证文件是否存在、格式是否正确、AI爬虫的访问权限(GPTBot、Google-Extended、Anthropic-AI、PerplexityBot)
  2. llms.txt — 验证文件是否存在及内容质量(字符数≥500为良好,<500为内容单薄)
  3. llms-full.txt — 检查文件是否存在
  4. sitemap.xml — 验证XML结构是否正确
按10分制打分,针对发现的问题生成issue。
检测规则和打分标准请参考
references/policy-files.md

Step 3: Score Each Page — Per-Page Technical Score (0–100)

第3步:单页打分 — 单页技术得分(0–100)

For each selected page, compute across four dimensions:
DimensionPointsSub-checks
Schema40J1 (present), J2a (valid structure), J2b (required properties), J3 (relevant type), J4 (coverage)
Metadata30M1 (title), M2 (description), M3 (canonical), M4 (OG), M5 (Twitter)
FAQ20Linear scale: 0 FAQs=0, 1=5, 2=10, 3=15, 4+=20
Content10C1 (word count ≥300), C2 (≥3 paragraphs)
For sub-check methodology, read
references/agentic-readiness.md
.
针对每个选中的页面,从四个维度计算得分:
维度分值子检查项
Schema40J1 (存在), J2a (结构有效), J2b (必填属性), J3 (类型匹配), J4 (覆盖度)
Metadata30M1 (标题), M2 (描述), M3 ( canonical标签), M4 (OG标签), M5 (Twitter标签)
FAQ20线性打分:0条FAQ得0分,1条得5分,2条得10分,3条得15分,4条及以上得20分
Content10C1 (字数≥300), C2 (≥3个段落)
子检查项的计算方法请参考
references/agentic-readiness.md

Step 4: Score Content Quality (Category 2 — Per Page, 20 pts)

第4步:内容质量打分(第2类 — 单页,20分)

Evaluate citation-readiness: title clarity (3), TL;DR placement (4), E-E-A-T signals (6), statistics & citations (5), real examples (2).
For criteria, read
references/content-quality.md
.
评估引用就绪度:标题清晰度(3分)、TL;DR放置位置(4分)、E-E-A-T信号(6分)、数据与引用(5分)、真实案例(2分)。
打分标准请参考
references/content-quality.md

Step 5: Score Chunking & Retrieval (Category 3 — Per Page, 15 pts)

第5步:分块与检索打分(第3类 — 单页,15分)

Evaluate LLM retrieval optimization: heading hierarchy (3), section scope (3), paragraph self-containment (2.25), answer-first openings (2.25), vocabulary/lists/FAQ/summary (4.5).
For criteria, read
references/chunking-retrieval.md
.
评估LLM检索优化情况:标题层级(3分)、章节范围清晰度(3分)、段落独立性(2.25分)、开头优先给出答案(2.25分)、词汇/列表/FAQ/总结完整性(4.5分)。
打分标准请参考
references/chunking-retrieval.md

Step 6: Simulate Query Fanout (Category 4 — Domain Level, 10 pts)

第6步:模拟查询扇出(第4类 — 域名级别,10分)

  1. Identify core topics from page content
  2. Generate simulated sub-queries using per-model rules (GPT-5.4 two-phase, Claude bundled, Gemini systematic)
  3. Check site coverage per sub-query
  4. Apply citation weights (citation-producing 1.5x, silent 0.5x,
    site:
    2x)
  5. Score:
    (weighted answered / weighted total) × 10
For simulation rules and coverage scoring, read
references/query-fanout.md
.
  1. 从页面内容中识别核心主题
  2. 按照各模型专属规则生成模拟子查询(GPT-5.4采用两阶段模式、Claude采用捆绑模式、Gemini采用系统化模式)
  3. 检查每个子查询的站点内容覆盖情况
  4. 应用引用权重(可产生引用的查询权重1.5x、无引用的查询权重0.5x、
    site:
    查询权重2x)
  5. 得分公式:
    (加权已回答数量 / 加权总数量) × 10
模拟规则和覆盖度打分标准请参考
references/query-fanout.md

Step 7: Compute Aggregate Score (0–100)

第7步:计算综合得分(0–100)

overall = agentic_readiness(45) + content_quality(20) + chunking_retrieval(15) + query_fanout(10) + policy_files(10)
Categories 1–3 averaged across pages then scaled. Categories 4–5 are domain-level.
For the full rubric, read
references/scoring-rubric.md
.
overall = agentic_readiness(45) + content_quality(20) + chunking_retrieval(15) + query_fanout(10) + policy_files(10)
第1-3类得分先按页面取平均值再做缩放,第4-5类为域名级得分。
完整评分规则请参考
references/scoring-rubric.md

Step 8: Generate Issues

第8步:生成问题

Create issues with: id, category, severity, summary, detail, affected_urls, remediation_hint. Use issue ID patterns from the pipeline contract.
生成的问题包含以下字段:id、类别、严重程度、摘要、详情、受影响URL、修复建议。请使用流水线契约中规定的问题ID格式。

Step 9: Produce Scan Report

第9步:生成扫描报告

Assemble Scan Report JSON (
PIPELINE.md
§1): domain metadata, per-page scores, issues, schema detection, policy file status, query fanout analysis.
组装JSON格式的扫描报告(参考
PIPELINE.md
第1节):包含域名元数据、单页得分、问题列表、schema检测结果、策略文件状态、查询扇出分析。

SaaS Detection

SaaS检测

Before scoring, detect SaaS by matching 2+ of 3 content sources against indicator terms (platform, saas, cloud, api, etc.). Changes expected schemas for product/pricing/features/solutions pages.
For detection logic, read
references/page-type-rules.md
.
打分前先通过3个内容源中匹配到至少2个指示词(platform、saas、cloud、api等)来检测是否为SaaS站点,这会改变产品/定价/功能/解决方案页面的预期schema。
检测逻辑请参考
references/page-type-rules.md

Reference Files

参考文件

FilePurpose
references/scoring-rubric.md
Full 100-point rubric — Technical Score + 5-category pipeline model
references/agentic-readiness.md
Per-page Technical Score sub-checks (J1–J4, M1–M5, C1–C2, FAQ)
references/page-type-rules.md
19 page types, URL patterns, expected schemas, SaaS detection
references/content-quality.md
5 content quality pillars for citation-readiness
references/chunking-retrieval.md
10 evaluation areas for LLM retrieval optimization
references/query-fanout.md
Per-model fan-out simulation rules, citation weights, coverage scoring
references/policy-files.md
robots.txt + llms.txt detection, validation, and scoring
文件用途
references/scoring-rubric.md
完整百分制评分规则 — 技术得分+5类流水线模型
references/agentic-readiness.md
单页技术得分子检查项(J1–J4、M1–M5、C1–C2、FAQ)
references/page-type-rules.md
19种页面类型、URL规则、预期schema、SaaS检测逻辑
references/content-quality.md
引用就绪度的5大内容质量支柱
references/chunking-retrieval.md
LLM检索优化的10个评估维度
references/query-fanout.md
各模型的查询扇出模拟规则、引用权重、覆盖度打分标准
references/policy-files.md
robots.txt与llms.txt的检测、验证和打分规则