morphiq-scan

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Pipeline Position

流水线位置

Step 1 of 4 — entry point.

Input: A domain URL from the user.
Output: Scan Report (JSON) → consumed by morphiq-rank.
Data contract: See
```
PIPELINE.md
```
§1 for the Scan Report schema.

4步中的第1步 — 入口点。

输入： 用户提供的域名URL。
输出： 扫描报告（JSON格式）→ 供morphiq-rank使用。
数据契约： 扫描报告的schema请参考
```
PIPELINE.md
```
第1节。

Purpose

用途

Morphiq Scan audits a website's readiness for AI visibility. It answers two questions: "Can AI systems parse and understand this site?" (Technical Score) and "Where are the gaps preventing AI citations?" (issue identification). The output feeds morphiq-rank for prioritization.

Morphiq Scan用于审计网站的AI可见性就绪度，它会回答两个问题：“AI系统能否解析和理解这个站点？”（技术得分）以及“哪些缺陷阻碍了AI引用？”（问题识别）。输出结果会提供给morphiq-rank用于优先级排序。

Workflow

工作流程

Step 1: Discover Pages

第1步：发现页面

Fetch
```
{domain}/robots.txt
```
— extract sitemap URLs
Fetch
```
{domain}/sitemap.xml
```
if not found in robots.txt
Classify discovered pages by type using URL pattern matching
Select up to 10 marketing-relevant pages, prioritized: home → pricing → features → product → solutions → about → blog → other → documentation
Exclude non-marketing pages from scoring (contact, login, signup, legal, demo, careers, changelog)

For page type classification and URL patterns, read

references/page-type-rules.md

获取
```
{domain}/robots.txt
```
— 提取站点地图URL
如果在robots.txt中未找到站点地图，则获取
```
{domain}/sitemap.xml
```
使用URL模式匹配将发现的页面按类型分类
最多选择10个与营销相关的页面，优先级如下：首页 → 定价页 → 功能页 → 产品页 → 解决方案页 → 关于页 → 博客页 → 其他 → 文档页
排除非营销类页面的打分（联系页、登录页、注册页、法律页、演示页、招聘页、更新日志页）

页面类型分类和URL规则请参考

references/page-type-rules.md

。

Step 2: Audit Policy Files (Category 5 — Domain Level)

第2步：审计策略文件（第5类 — 域名级别）

robots.txt — Validate existence, format, AI crawler access (GPTBot, Google-Extended, Anthropic-AI, PerplexityBot)
llms.txt — Validate existence and quality (≥500 chars = good, <500 = thin)
llms-full.txt — Check existence
sitemap.xml — Validate XML structure

Score on 10-point scale. Generate issues for findings.

For detection rules and scoring, read

references/policy-files.md

robots.txt — 验证文件是否存在、格式是否正确、AI爬虫的访问权限（GPTBot、Google-Extended、Anthropic-AI、PerplexityBot）
llms.txt — 验证文件是否存在及内容质量（字符数≥500为良好，<500为内容单薄）
llms-full.txt — 检查文件是否存在
sitemap.xml — 验证XML结构是否正确

按10分制打分，针对发现的问题生成issue。

检测规则和打分标准请参考

references/policy-files.md

。

Step 3: Score Each Page — Per-Page Technical Score (0–100)

第3步：单页打分 — 单页技术得分（0–100）

For each selected page, compute across four dimensions:

Dimension	Points	Sub-checks
Schema	40	J1 (present), J2a (valid structure), J2b (required properties), J3 (relevant type), J4 (coverage)
Metadata	30	M1 (title), M2 (description), M3 (canonical), M4 (OG), M5 (Twitter)
FAQ	20	Linear scale: 0 FAQs=0, 1=5, 2=10, 3=15, 4+=20
Content	10	C1 (word count ≥300), C2 (≥3 paragraphs)

For sub-check methodology, read

references/agentic-readiness.md

针对每个选中的页面，从四个维度计算得分：

维度	分值	子检查项
Schema	40	J1 (存在), J2a (结构有效), J2b (必填属性), J3 (类型匹配), J4 (覆盖度)
Metadata	30	M1 (标题), M2 (描述), M3 ( canonical标签), M4 (OG标签), M5 (Twitter标签)
FAQ	20	线性打分：0条FAQ得0分，1条得5分，2条得10分，3条得15分，4条及以上得20分
Content	10	C1 (字数≥300), C2 (≥3个段落)

子检查项的计算方法请参考

references/agentic-readiness.md

。

Step 4: Score Content Quality (Category 2 — Per Page, 20 pts)

第4步：内容质量打分（第2类 — 单页，20分）

Evaluate citation-readiness: title clarity (3), TL;DR placement (4), E-E-A-T signals (6), statistics & citations (5), real examples (2).

For criteria, read

references/content-quality.md

评估引用就绪度：标题清晰度（3分）、TL;DR放置位置（4分）、E-E-A-T信号（6分）、数据与引用（5分）、真实案例（2分）。

打分标准请参考

references/content-quality.md

。

Step 5: Score Chunking & Retrieval (Category 3 — Per Page, 15 pts)

第5步：分块与检索打分（第3类 — 单页，15分）

Evaluate LLM retrieval optimization: heading hierarchy (3), section scope (3), paragraph self-containment (2.25), answer-first openings (2.25), vocabulary/lists/FAQ/summary (4.5).

For criteria, read

references/chunking-retrieval.md

评估LLM检索优化情况：标题层级（3分）、章节范围清晰度（3分）、段落独立性（2.25分）、开头优先给出答案（2.25分）、词汇/列表/FAQ/总结完整性（4.5分）。

打分标准请参考

references/chunking-retrieval.md

。

Step 6: Simulate Query Fanout (Category 4 — Domain Level, 10 pts)

第6步：模拟查询扇出（第4类 — 域名级别，10分）

Identify core topics from page content
Generate simulated sub-queries using per-model rules (GPT-5.4 two-phase, Claude bundled, Gemini systematic)
Check site coverage per sub-query
Apply citation weights (citation-producing 1.5x, silent 0.5x,
```
site:
```
2x)

Score:

(weighted answered / weighted total) × 10

For simulation rules and coverage scoring, read

references/query-fanout.md

从页面内容中识别核心主题
按照各模型专属规则生成模拟子查询（GPT-5.4采用两阶段模式、Claude采用捆绑模式、Gemini采用系统化模式）
检查每个子查询的站点内容覆盖情况
应用引用权重（可产生引用的查询权重1.5x、无引用的查询权重0.5x、
```
site:
```
查询权重2x）

得分公式：

(加权已回答数量 / 加权总数量) × 10

模拟规则和覆盖度打分标准请参考

references/query-fanout.md

。

Step 7: Compute Aggregate Score (0–100)

第7步：计算综合得分（0–100）

overall = agentic_readiness(45) + content_quality(20) + chunking_retrieval(15) + query_fanout(10) + policy_files(10)

Categories 1–3 averaged across pages then scaled. Categories 4–5 are domain-level.

For the full rubric, read

references/scoring-rubric.md

overall = agentic_readiness(45) + content_quality(20) + chunking_retrieval(15) + query_fanout(10) + policy_files(10)

第1-3类得分先按页面取平均值再做缩放，第4-5类为域名级得分。

完整评分规则请参考

references/scoring-rubric.md

。

Step 8: Generate Issues

第8步：生成问题

Create issues with: id, category, severity, summary, detail, affected_urls, remediation_hint. Use issue ID patterns from the pipeline contract.

生成的问题包含以下字段：id、类别、严重程度、摘要、详情、受影响URL、修复建议。请使用流水线契约中规定的问题ID格式。

Step 9: Produce Scan Report

第9步：生成扫描报告

Assemble Scan Report JSON (

PIPELINE.md

§1): domain metadata, per-page scores, issues, schema detection, policy file status, query fanout analysis.

组装JSON格式的扫描报告（参考

PIPELINE.md

第1节）：包含域名元数据、单页得分、问题列表、schema检测结果、策略文件状态、查询扇出分析。

SaaS Detection

SaaS检测

Before scoring, detect SaaS by matching 2+ of 3 content sources against indicator terms (platform, saas, cloud, api, etc.). Changes expected schemas for product/pricing/features/solutions pages.

For detection logic, read

references/page-type-rules.md

打分前先通过3个内容源中匹配到至少2个指示词（platform、saas、cloud、api等）来检测是否为SaaS站点，这会改变产品/定价/功能/解决方案页面的预期schema。

检测逻辑请参考

references/page-type-rules.md

。

Reference Files

参考文件

File	Purpose
`references/scoring-rubric.md`	Full 100-point rubric — Technical Score + 5-category pipeline model
`references/agentic-readiness.md`	Per-page Technical Score sub-checks (J1–J4, M1–M5, C1–C2, FAQ)
`references/page-type-rules.md`	19 page types, URL patterns, expected schemas, SaaS detection
`references/content-quality.md`	5 content quality pillars for citation-readiness
`references/chunking-retrieval.md`	10 evaluation areas for LLM retrieval optimization
`references/query-fanout.md`	Per-model fan-out simulation rules, citation weights, coverage scoring
`references/policy-files.md`	robots.txt + llms.txt detection, validation, and scoring

文件	用途
`references/scoring-rubric.md`	完整百分制评分规则 — 技术得分+5类流水线模型
`references/agentic-readiness.md`	单页技术得分子检查项（J1–J4、M1–M5、C1–C2、FAQ）
`references/page-type-rules.md`	19种页面类型、URL规则、预期schema、SaaS检测逻辑
`references/content-quality.md`	引用就绪度的5大内容质量支柱
`references/chunking-retrieval.md`	LLM检索优化的10个评估维度
`references/query-fanout.md`	各模型的查询扇出模拟规则、引用权重、覆盖度打分标准
`references/policy-files.md`	robots.txt与llms.txt的检测、验证和打分规则