robots-txt
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseSEO Technical: robots.txt
SEO技术:robots.txt
Guides configuration and auditing of robots.txt for search engine and AI crawler control.
When invoking: On first use, if helpful, open with 1–2 sentences on what this skill covers and why it matters, then provide the main output. On subsequent use or when the user asks to skip, go directly to the main output.
指导robots.txt的配置与审核,以实现对搜索引擎和AI爬虫的管控。
调用规则:首次使用时,如果对用户有帮助,可以先用1-2句话介绍本技能涵盖的内容以及其重要性,再提供核心输出。后续使用或用户要求跳过介绍时,直接输出核心内容。
Scope (Technical SEO)
适用范围(技术SEO)
- Robots.txt: Review Disallow/Allow; avoid blocking important pages
- Crawler access: Ensure crawlers (including AI crawlers) can access key pages
- Indexing: Misconfigured robots.txt can block indexing; verify no accidental blocks
- Robots.txt:检查Disallow/Allow规则,避免拦截重要页面
- 爬虫访问:确保爬虫(包括AI爬虫)可以访问核心页面
- 索引收录:配置错误的robots.txt会拦截索引收录,需确认不存在意外拦截的情况
Initial Assessment
初步评估
Check for product marketing context first: If or exists, read it for site URL and indexing goals.
.claude/product-marketing-context.md.cursor/product-marketing-context.mdIdentify:
- Site URL: Base domain (e.g., )
https://example.com - Indexing scope: Full site, partial, or specific paths to exclude
- AI crawler strategy: Allow search/indexing vs. block training data crawlers
优先检查产品营销上下文:如果存在或文件,读取其中的站点URL和索引收录目标信息。
.claude/product-marketing-context.md.cursor/product-marketing-context.md确认以下信息:
- 站点URL:根域名(例如)
https://example.com - 索引范围:全站、部分页面,或是需要排除的特定路径
- AI爬虫策略:允许搜索/收录,或是拦截训练数据类爬虫
Best Practices
最佳实践
Purpose and Limitations
作用与局限性
| Point | Note |
|---|---|
| Purpose | Controls crawler access; does NOT prevent indexing (disallowed URLs may still appear in search without snippet) |
| No-index | Use noindex meta or auth for sensitive content; robots.txt is publicly readable |
| Indexed vs non-indexed | Not all content should be indexed. robots.txt and noindex complement each other: robots for path-level crawl control, noindex for page-level indexing. See indexing |
| Advisory | Rules are advisory; malicious crawlers may ignore |
| 要点 | 说明 |
|---|---|
| 作用 | 控制爬虫访问,但不能阻止索引收录:被禁止的URL仍可能出现在搜索结果中,只是不会展示摘要 |
| No-index | 敏感内容请使用noindex meta标签或权限控制;robots.txt是公开可读取的 |
| 收录与非收录 | 并非所有内容都需要被收录。robots.txt和noindex是互补关系:robots.txt用于路径层级的爬取控制,noindex用于页面层级的收录控制,详见索引相关说明 |
| 规则属性 | 规则仅为建议性,恶意爬虫可能会忽略 |
Location and Format
存放位置与格式
| Item | Requirement |
|---|---|
| Path | Site root: |
| Encoding | UTF-8 plain text |
| Standard | RFC 9309 (Robots Exclusion Protocol) |
| 项目 | 要求 |
|---|---|
| 路径 | 站点根目录: |
| 编码 | UTF-8纯文本 |
| 标准 | RFC 9309(爬虫排除协议) |
Core Directives
核心指令
| Directive | Purpose | Example |
|---|---|---|
| Target crawler | |
| Block path prefix | |
| Allow path (can override Disallow) | |
| Declare sitemap absolute URL | |
| Strip query params (Yandex) | See below |
| 指令 | 作用 | 示例 |
|---|---|---|
| 指定目标爬虫 | |
| 拦截指定前缀的路径 | |
| 允许访问指定路径(可覆盖Disallow规则) | |
| 声明站点地图的绝对URL | |
| 剔除查询参数(仅Yandex支持) | 见下方示例 |
Critical: Do Not Block Rendering Resources
重要提示:请勿拦截渲染资源
- Do not block CSS, JS, images; Google needs them to render pages
- Only block paths that don't need crawling: admin, API, temp files
- 切勿拦截CSS、JS、图片资源,Google需要这些资源来正确渲染页面
- 仅拦截不需要被爬取的路径:后台、API、临时文件等
AI Crawler Strategy
AI爬虫管控策略
| User-agent | Purpose | Typical |
|---|---|---|
| OAI-SearchBot | ChatGPT search | Allow |
| GPTBot | OpenAI training | Disallow |
| Claude-SearchBot | Claude search | Allow |
| ClaudeBot | Anthropic training | Disallow |
| PerplexityBot | Perplexity search | Allow |
| Google-Extended | Gemini training | Disallow |
| CCBot | Common Crawl | Disallow |
| User-agent | 用途 | 典型配置 |
|---|---|---|
| OAI-SearchBot | ChatGPT搜索 | Allow |
| GPTBot | OpenAI模型训练 | Disallow |
| Claude-SearchBot | Claude搜索 | Allow |
| ClaudeBot | Anthropic模型训练 | Disallow |
| PerplexityBot | Perplexity搜索 | Allow |
| Google-Extended | Gemini模型训练 | Disallow |
| CCBot | Common Crawl | Disallow |
Clean-param (Yandex)
Clean-param(Yandex)
Clean-param: utm_source&utm_medium&utm_campaign&utm_term&utm_content&ref&fbclid&gclidClean-param: utm_source&utm_medium&utm_campaign&utm_term&utm_content&ref&fbclid&gclidOutput Format
输出格式
- Current state (if auditing)
- Recommended robots.txt (full file)
- Compliance checklist
- References: Google robots.txt
- 当前状态(如果是审核场景)
- 推荐的robots.txt(完整文件内容)
- 合规检查清单
- 参考资料:Google robots.txt官方文档
Related Skills
相关技能
- xml-sitemap: Sitemap URL to reference in robots.txt
- site-crawlability: Broader crawl and structure guidance
- xml-sitemap:可在robots.txt中引用的站点地图URL
- site-crawlability:更全面的爬取与站点结构指导