morphiq-scan
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChinesePipeline Position
流水线位置
Step 1 of 4 — entry point.
- Input: A domain URL from the user.
- Output: Scan Report (JSON) → consumed by morphiq-rank.
- Data contract: See §1 for the Scan Report schema.
PIPELINE.md
4步中的第1步 — 入口点。
- 输入: 用户提供的域名URL。
- 输出: 扫描报告(JSON格式)→ 供morphiq-rank使用。
- 数据契约: 扫描报告的schema请参考 第1节。
PIPELINE.md
Purpose
用途
Morphiq Scan audits a website's readiness for AI visibility. It answers two questions: "Can AI systems parse and understand this site?" (Technical Score) and "Where are the gaps preventing AI citations?" (issue identification). The output feeds morphiq-rank for prioritization.
Morphiq Scan用于审计网站的AI可见性就绪度,它会回答两个问题:“AI系统能否解析和理解这个站点?”(技术得分)以及“哪些缺陷阻碍了AI引用?”(问题识别)。输出结果会提供给morphiq-rank用于优先级排序。
Workflow
工作流程
Step 1: Discover Pages
第1步:发现页面
- Fetch — extract sitemap URLs
{domain}/robots.txt - Fetch if not found in robots.txt
{domain}/sitemap.xml - Classify discovered pages by type using URL pattern matching
- Select up to 10 marketing-relevant pages, prioritized: home → pricing → features → product → solutions → about → blog → other → documentation
- Exclude non-marketing pages from scoring (contact, login, signup, legal, demo, careers, changelog)
For page type classification and URL patterns, read .
references/page-type-rules.md- 获取 — 提取站点地图URL
{domain}/robots.txt - 如果在robots.txt中未找到站点地图,则获取
{domain}/sitemap.xml - 使用URL模式匹配将发现的页面按类型分类
- 最多选择10个与营销相关的页面,优先级如下: 首页 → 定价页 → 功能页 → 产品页 → 解决方案页 → 关于页 → 博客页 → 其他 → 文档页
- 排除非营销类页面的打分(联系页、登录页、注册页、法律页、演示页、招聘页、更新日志页)
页面类型分类和URL规则请参考 。
references/page-type-rules.mdStep 2: Audit Policy Files (Category 5 — Domain Level)
第2步:审计策略文件(第5类 — 域名级别)
- robots.txt — Validate existence, format, AI crawler access (GPTBot, Google-Extended, Anthropic-AI, PerplexityBot)
- llms.txt — Validate existence and quality (≥500 chars = good, <500 = thin)
- llms-full.txt — Check existence
- sitemap.xml — Validate XML structure
Score on 10-point scale. Generate issues for findings.
For detection rules and scoring, read .
references/policy-files.md- robots.txt — 验证文件是否存在、格式是否正确、AI爬虫的访问权限(GPTBot、Google-Extended、Anthropic-AI、PerplexityBot)
- llms.txt — 验证文件是否存在及内容质量(字符数≥500为良好,<500为内容单薄)
- llms-full.txt — 检查文件是否存在
- sitemap.xml — 验证XML结构是否正确
按10分制打分,针对发现的问题生成issue。
检测规则和打分标准请参考 。
references/policy-files.mdStep 3: Score Each Page — Per-Page Technical Score (0–100)
第3步:单页打分 — 单页技术得分(0–100)
For each selected page, compute across four dimensions:
| Dimension | Points | Sub-checks |
|---|---|---|
| Schema | 40 | J1 (present), J2a (valid structure), J2b (required properties), J3 (relevant type), J4 (coverage) |
| Metadata | 30 | M1 (title), M2 (description), M3 (canonical), M4 (OG), M5 (Twitter) |
| FAQ | 20 | Linear scale: 0 FAQs=0, 1=5, 2=10, 3=15, 4+=20 |
| Content | 10 | C1 (word count ≥300), C2 (≥3 paragraphs) |
For sub-check methodology, read .
references/agentic-readiness.md针对每个选中的页面,从四个维度计算得分:
| 维度 | 分值 | 子检查项 |
|---|---|---|
| Schema | 40 | J1 (存在), J2a (结构有效), J2b (必填属性), J3 (类型匹配), J4 (覆盖度) |
| Metadata | 30 | M1 (标题), M2 (描述), M3 ( canonical标签), M4 (OG标签), M5 (Twitter标签) |
| FAQ | 20 | 线性打分:0条FAQ得0分,1条得5分,2条得10分,3条得15分,4条及以上得20分 |
| Content | 10 | C1 (字数≥300), C2 (≥3个段落) |
子检查项的计算方法请参考 。
references/agentic-readiness.mdStep 4: Score Content Quality (Category 2 — Per Page, 20 pts)
第4步:内容质量打分(第2类 — 单页,20分)
Evaluate citation-readiness: title clarity (3), TL;DR placement (4), E-E-A-T signals (6), statistics & citations (5), real examples (2).
For criteria, read .
references/content-quality.md评估引用就绪度:标题清晰度(3分)、TL;DR放置位置(4分)、E-E-A-T信号(6分)、数据与引用(5分)、真实案例(2分)。
打分标准请参考 。
references/content-quality.mdStep 5: Score Chunking & Retrieval (Category 3 — Per Page, 15 pts)
第5步:分块与检索打分(第3类 — 单页,15分)
Evaluate LLM retrieval optimization: heading hierarchy (3), section scope (3), paragraph self-containment (2.25), answer-first openings (2.25), vocabulary/lists/FAQ/summary (4.5).
For criteria, read .
references/chunking-retrieval.md评估LLM检索优化情况:标题层级(3分)、章节范围清晰度(3分)、段落独立性(2.25分)、开头优先给出答案(2.25分)、词汇/列表/FAQ/总结完整性(4.5分)。
打分标准请参考 。
references/chunking-retrieval.mdStep 6: Simulate Query Fanout (Category 4 — Domain Level, 10 pts)
第6步:模拟查询扇出(第4类 — 域名级别,10分)
- Identify core topics from page content
- Generate simulated sub-queries using per-model rules (GPT-5.4 two-phase, Claude bundled, Gemini systematic)
- Check site coverage per sub-query
- Apply citation weights (citation-producing 1.5x, silent 0.5x, 2x)
site: - Score:
(weighted answered / weighted total) × 10
For simulation rules and coverage scoring, read .
references/query-fanout.md- 从页面内容中识别核心主题
- 按照各模型专属规则生成模拟子查询(GPT-5.4采用两阶段模式、Claude采用捆绑模式、Gemini采用系统化模式)
- 检查每个子查询的站点内容覆盖情况
- 应用引用权重(可产生引用的查询权重1.5x、无引用的查询权重0.5x、查询权重2x)
site: - 得分公式:
(加权已回答数量 / 加权总数量) × 10
模拟规则和覆盖度打分标准请参考 。
references/query-fanout.mdStep 7: Compute Aggregate Score (0–100)
第7步:计算综合得分(0–100)
overall = agentic_readiness(45) + content_quality(20) + chunking_retrieval(15) + query_fanout(10) + policy_files(10)Categories 1–3 averaged across pages then scaled. Categories 4–5 are domain-level.
For the full rubric, read .
references/scoring-rubric.mdoverall = agentic_readiness(45) + content_quality(20) + chunking_retrieval(15) + query_fanout(10) + policy_files(10)第1-3类得分先按页面取平均值再做缩放,第4-5类为域名级得分。
完整评分规则请参考 。
references/scoring-rubric.mdStep 8: Generate Issues
第8步:生成问题
Create issues with: id, category, severity, summary, detail, affected_urls, remediation_hint. Use issue ID patterns from the pipeline contract.
生成的问题包含以下字段:id、类别、严重程度、摘要、详情、受影响URL、修复建议。请使用流水线契约中规定的问题ID格式。
Step 9: Produce Scan Report
第9步:生成扫描报告
Assemble Scan Report JSON ( §1): domain metadata, per-page scores, issues, schema detection, policy file status, query fanout analysis.
PIPELINE.md组装JSON格式的扫描报告(参考第1节):包含域名元数据、单页得分、问题列表、schema检测结果、策略文件状态、查询扇出分析。
PIPELINE.mdSaaS Detection
SaaS检测
Before scoring, detect SaaS by matching 2+ of 3 content sources against indicator terms (platform, saas, cloud, api, etc.). Changes expected schemas for product/pricing/features/solutions pages.
For detection logic, read .
references/page-type-rules.md打分前先通过3个内容源中匹配到至少2个指示词(platform、saas、cloud、api等)来检测是否为SaaS站点,这会改变产品/定价/功能/解决方案页面的预期schema。
检测逻辑请参考 。
references/page-type-rules.mdReference Files
参考文件
| File | Purpose |
|---|---|
| Full 100-point rubric — Technical Score + 5-category pipeline model |
| Per-page Technical Score sub-checks (J1–J4, M1–M5, C1–C2, FAQ) |
| 19 page types, URL patterns, expected schemas, SaaS detection |
| 5 content quality pillars for citation-readiness |
| 10 evaluation areas for LLM retrieval optimization |
| Per-model fan-out simulation rules, citation weights, coverage scoring |
| robots.txt + llms.txt detection, validation, and scoring |
| 文件 | 用途 |
|---|---|
| 完整百分制评分规则 — 技术得分+5类流水线模型 |
| 单页技术得分子检查项(J1–J4、M1–M5、C1–C2、FAQ) |
| 19种页面类型、URL规则、预期schema、SaaS检测逻辑 |
| 引用就绪度的5大内容质量支柱 |
| LLM检索优化的10个评估维度 |
| 各模型的查询扇出模拟规则、引用权重、覆盖度打分标准 |
| robots.txt与llms.txt的检测、验证和打分规则 |