pseo-scale
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChinesepSEO Scale Architecture
pSEO 规模化架构
Architect pSEO systems that work at 10K-100K+ pages. The patterns in the other pseo-* skills are correct at 1K-10K. Beyond 10K, in-memory data layers, full-corpus validation, and single-deploy rollouts break down. This skill provides the architecture changes needed at scale.
构建可支撑10K-100K+量级页面的pSEO系统。其他pseo-*技能中的模式在1K-10K页面规模下适用,但超过10K后,基于内存的数据层、全量语料验证和单次部署发布的模式会失效。本技能提供规模化所需的架构调整方案。
Scale Tiers
规模层级
| Tier | Pages | Data Layer | Validation | Rollout | Sitemap |
|---|---|---|---|---|---|
| Small | < 1K | JSON/files, in-memory | Full pairwise | Single deploy | Single file |
| Medium | 1K-10K | Files or DB, two-tier memory | Fingerprint-based | 2-4 week batches | Index + children |
| Large | 10K-50K | Database required | Incremental + sampling | Category-by-category | Index + chunked |
| Very Large | 50K-100K+ | Database + cache layer | Delta-only + periodic full | ISR + sitemap waves | Index + streaming |
This skill focuses on the Large and Very Large tiers. If the project is under 10K pages, the standard pseo-* skills are sufficient.
| 层级 | 页面数量 | 数据层 | 验证方式 | 发布方式 | 站点地图 |
|---|---|---|---|---|---|
| 小型 | < 1K | JSON/文件,内存中 | 全量成对验证 | 单次部署 | 单文件 |
| 中型 | 1K-10K | 文件或数据库,双层内存 | 基于指纹 | 2-4周批次 | 索引+子文件 |
| 大型 | 10K-50K | 必须使用数据库 | 增量+抽样 | 按类别发布 | 索引+分块 |
| 超大型 | 50K-100K+ | 数据库+缓存层 | 仅增量+定期全量 | ISR+站点地图分批 | 索引+流式 |
本技能聚焦大型和超大型层级。如果项目页面规模低于10K,使用标准的pseo-*技能即可满足需求。
1. Database-Backed Data Layer
1. 数据库驱动的数据层
At 10K+ pages, JSON files and in-memory arrays stop working. The data layer must move to a database with proper indexing.
当页面规模达到10K+时,JSON文件和内存数组将无法正常工作。数据层必须迁移至带有合理索引的数据库。
Why In-Memory Breaks
内存架构失效的原因
Pages PageIndex in memory Full content (if loaded)
1K ~1MB ~100-500MB
10K ~10MB ~1-5GB (OOM)
50K ~50MB (borderline) ~5-25GB (impossible)
100K ~100MB ~10-50GB (impossible)At 50K+, even holding all records in memory is borderline. At 100K, returning an array of 100K objects takes ~100MB and seconds to deserialize. You need cursor-based iteration.
PageIndexgetAllSlugs()页面数 内存中的PageIndex 加载的完整内容
1K ~1MB ~100-500MB
10K ~10MB ~1-5GB (OOM)
50K ~50MB (临界值) ~5-25GB (不可能)
100K ~100MB ~10-50GB (不可能)当页面数达到50K+时,即使仅将所有记录保存在内存中也已接近临界值。到100K时,返回包含100K个对象的数组会占用约100MB内存,且反序列化需要数秒时间。此时需要采用基于游标(cursor)的迭代方式。
PageIndexgetAllSlugs()Database Requirements
数据库要求
Minimum schema:
sql
CREATE TABLE pages (
id SERIAL PRIMARY KEY,
slug TEXT UNIQUE NOT NULL,
canonical_path TEXT UNIQUE NOT NULL,
title TEXT NOT NULL,
h1 TEXT NOT NULL,
meta_description TEXT NOT NULL,
category TEXT NOT NULL,
subcategory TEXT,
status TEXT DEFAULT 'published',
last_modified TIMESTAMPTZ NOT NULL,
published_at TIMESTAMPTZ,
-- Heavy fields (only loaded per-page)
intro_text TEXT,
body_content TEXT,
faqs JSONB,
related_slugs TEXT[],
featured_image JSONB,
-- Scale fields
data_sufficiency_score REAL, -- see section 2
content_hash TEXT, -- for incremental validation
last_validated TIMESTAMPTZ
);
-- Required indexes for pSEO queries
CREATE INDEX idx_pages_category ON pages(category);
CREATE INDEX idx_pages_status ON pages(status) WHERE status = 'published';
CREATE INDEX idx_pages_slug ON pages(slug);
CREATE INDEX idx_pages_last_modified ON pages(last_modified DESC);
CREATE INDEX idx_pages_sufficiency ON pages(data_sufficiency_score);
CREATE INDEX idx_pages_category_status ON pages(category, status);Categories table:
sql
CREATE TABLE categories (
slug TEXT PRIMARY KEY,
name TEXT NOT NULL,
description TEXT,
parent_slug TEXT REFERENCES categories(slug),
page_count INT DEFAULT 0,
last_modified TIMESTAMPTZ
);Redirects table:
sql
CREATE TABLE redirects (
source TEXT PRIMARY KEY,
destination TEXT NOT NULL,
created_at TIMESTAMPTZ DEFAULT NOW()
);
CREATE INDEX idx_redirects_destination ON redirects(destination);最小化表结构:
sql
CREATE TABLE pages (
id SERIAL PRIMARY KEY,
slug TEXT UNIQUE NOT NULL,
canonical_path TEXT UNIQUE NOT NULL,
title TEXT NOT NULL,
h1 TEXT NOT NULL,
meta_description TEXT NOT NULL,
category TEXT NOT NULL,
subcategory TEXT,
status TEXT DEFAULT 'published',
last_modified TIMESTAMPTZ NOT NULL,
published_at TIMESTAMPTZ,
-- 大字段(仅按需加载单页内容时读取)
intro_text TEXT,
body_content TEXT,
faqs JSONB,
related_slugs TEXT[],
featured_image JSONB,
-- 规模化相关字段
data_sufficiency_score REAL, -- 见第2节
content_hash TEXT, -- 用于增量验证
last_validated TIMESTAMPTZ
);
-- pSEO查询所需的索引
CREATE INDEX idx_pages_category ON pages(category);
CREATE INDEX idx_pages_status ON pages(status) WHERE status = 'published';
CREATE INDEX idx_pages_slug ON pages(slug);
CREATE INDEX idx_pages_last_modified ON pages(last_modified DESC);
CREATE INDEX idx_pages_sufficiency ON pages(data_sufficiency_score);
CREATE INDEX idx_pages_category_status ON pages(category, status);分类表:
sql
CREATE TABLE categories (
slug TEXT PRIMARY KEY,
name TEXT NOT NULL,
description TEXT,
parent_slug TEXT REFERENCES categories(slug),
page_count INT DEFAULT 0,
last_modified TIMESTAMPTZ
);重定向表:
sql
CREATE TABLE redirects (
source TEXT PRIMARY KEY,
destination TEXT NOT NULL,
created_at TIMESTAMPTZ DEFAULT NOW()
);
CREATE INDEX idx_redirects_destination ON redirects(destination);Data Layer API at Scale
规模化的数据层API
The pseo-data API contract changes at scale:
typescript
// getAllSlugs() must support cursor-based iteration at 50K+
async function* getAllSlugsCursor(): AsyncGenerator<{ slug: string; category: string }> {
let cursor: string | null = null;
while (true) {
const batch = await db.query(
`SELECT slug, category FROM pages
WHERE status = 'published'
${cursor ? `AND slug > $1` : ''}
ORDER BY slug LIMIT 1000`,
cursor ? [cursor] : []
);
if (batch.length === 0) break;
for (const row of batch) yield row;
cursor = batch[batch.length - 1].slug;
}
}
// getPagesByCategory() must use DB pagination, not in-memory slicing
async function getPagesByCategory(
category: string,
opts?: { limit?: number; offset?: number }
): Promise<PageIndex[]> {
return db.query(
`SELECT slug, title, h1, meta_description, canonical_path, category, last_modified
FROM pages
WHERE category = $1 AND status = 'published'
ORDER BY title
LIMIT $2 OFFSET $3`,
[category, opts?.limit ?? 50, opts?.offset ?? 0]
);
}
// getRelatedPages() needs a precomputed index or efficient query
async function getRelatedPages(slug: string, limit = 5): Promise<PageIndex[]> {
// Option A: Use related_slugs array from the page record
// Option B: Same category + shared tags query
// Option C: Precomputed relatedness table (best at 50K+)
return db.query(
`SELECT p.slug, p.title, p.h1, p.meta_description, p.canonical_path,
p.category, p.last_modified
FROM pages p
JOIN pages source ON source.slug = $1
WHERE p.category = source.category
AND p.slug != $1
AND p.status = 'published'
ORDER BY p.last_modified DESC
LIMIT $2`,
[slug, limit]
);
}pseo-data的API契约在规模化场景下需要调整:
typescript
// 当页面数达到50K+时,getAllSlugs()必须支持基于游标的迭代
async function* getAllSlugsCursor(): AsyncGenerator<{ slug: string; category: string }> {
let cursor: string | null = null;
while (true) {
const batch = await db.query(
`SELECT slug, category FROM pages
WHERE status = 'published'
${cursor ? `AND slug > $1` : ''}
ORDER BY slug LIMIT 1000`,
cursor ? [cursor] : []
);
if (batch.length === 0) break;
for (const row of batch) yield row;
cursor = batch[batch.length - 1].slug;
}
}
// getPagesByCategory()必须使用数据库分页,而非内存切片
async function getPagesByCategory(
category: string,
opts?: { limit?: number; offset?: number }
): Promise<PageIndex[]> {
return db.query(
`SELECT slug, title, h1, meta_description, canonical_path, category, last_modified
FROM pages
WHERE category = $1 AND status = 'published'
ORDER BY title
LIMIT $2 OFFSET $3`,
[category, opts?.limit ?? 50, opts?.offset ?? 0]
);
}
// getRelatedPages()需要预计算索引或高效查询
async function getRelatedPages(slug: string, limit = 5): Promise<PageIndex[]> {
// 选项A:使用页面记录中的related_slugs数组
// 选项B:同分类+共享标签查询
// 选项C:预计算相关性表(50K+页面时最佳)
return db.query(
`SELECT p.slug, p.title, p.h1, p.meta_description, p.canonical_path,
p.category, p.last_modified
FROM pages p
JOIN pages source ON source.slug = $1
WHERE p.category = source.category
AND p.slug != $1
AND p.status = 'published'
ORDER BY p.last_modified DESC
LIMIT $2`,
[slug, limit]
);
}Connection Pooling
连接池配置
At build time with parallel page generation, you need connection pooling:
typescript
import { Pool } from "pg";
const pool = new Pool({
max: 10, // limit concurrent connections during build
idleTimeoutMillis: 30000,
connectionTimeoutMillis: 5000,
});ORM alternative: If using Prisma or Drizzle, configure connection pool limits in the ORM config. Default pool sizes are often too high for build processes.
See for full patterns by database type.
references/database-patterns.md在并行页面生成的构建阶段,需要配置数据库连接池:
typescript
import { Pool } from "pg";
const pool = new Pool({
max: 10, // 限制构建期间的并发连接数
idleTimeoutMillis: 30000,
connectionTimeoutMillis: 5000,
});**ORM替代方案:**如果使用Prisma或Drizzle,需在ORM配置中设置连接池限制。默认的连接池大小通常过高,不适合构建流程。
更多数据库模式请参考。
references/database-patterns.md2. Data Sufficiency Gating
2. 数据充足性校验机制
At 100K pages, many combinations will produce thin content. Gate page generation BEFORE build time — don't create pages that will fail quality checks.
当页面规模达到100K时,很多组合会生成内容单薄的页面。需在构建前就拦截这类页面——不要生成无法通过质量检查的页面。
Sufficiency Score
充足性评分
Compute a score per potential page based on available data:
typescript
function computeSufficiencyScore(record: RawRecord): number {
let score = 0;
const weights = {
hasTitle: 10,
hasDescription: 10, // > 50 chars
hasBodyContent: 20, // > 200 words
hasFAQs: 15, // >= 3 Q&A pairs
hasUniqueAttributes: 15, // >= 3 non-null structured attributes
hasImage: 5,
hasCategory: 10,
hasNumericData: 10, // stats, ratings, prices — LLM citation signal
hasSourceCitation: 5, // data provenance for E-E-A-T
};
if (record.title?.length > 10) score += weights.hasTitle;
if (record.description?.length > 50) score += weights.hasDescription;
if (wordCount(record.bodyContent) > 200) score += weights.hasBodyContent;
if (record.faqs?.length >= 3) score += weights.hasFAQs;
// ... etc
return score; // 0-100
}基于可用数据为每个潜在页面计算评分:
typescript
function computeSufficiencyScore(record: RawRecord): number {
let score = 0;
const weights = {
hasTitle: 10,
hasDescription: 10, // 超过50字符
hasBodyContent: 20, // 超过200词
hasFAQs: 15, // 至少3组问答
hasUniqueAttributes: 15, // 至少3个非空结构化属性
hasImage: 5,
hasCategory: 10,
hasNumericData: 10, // 统计数据、评分、价格——LLM引用信号
hasSourceCitation: 5, // 数据来源,用于E-E-A-T评估
};
if (record.title?.length > 10) score += weights.hasTitle;
if (record.description?.length > 50) score += weights.hasDescription;
if (wordCount(record.bodyContent) > 200) score += weights.hasBodyContent;
if (record.faqs?.length >= 3) score += weights.hasFAQs;
// ... 其他评分逻辑
return score; // 0-100分
}Gating Thresholds
校验阈值
| Score | Action |
|---|---|
| 80-100 | Generate page — sufficient data |
| 60-79 | Generate page with enrichment flag — mark for content pipeline |
| 40-59 | Hold — do not generate until data is enriched |
| 0-39 | Reject — insufficient data, do not generate |
Store the score in the database ( column) so you can:
data_sufficiency_score- Query how many pages are gated vs. ready
- Track enrichment progress over time
- Re-score after data enrichment
- Gate at build time:
WHERE data_sufficiency_score >= 60 AND status = 'published'
| 评分 | 操作 |
|---|---|
| 80-100 | 生成页面——数据充足 |
| 60-79 | 生成页面并标记为待增强——进入内容流水线 |
| 40-59 | 暂存——数据增强完成前不生成 |
| 0-39 | 拒绝——数据不足,不生成页面 |
将评分存储在数据库中(字段),以便:
data_sufficiency_score- 查询被拦截和已就绪的页面数量
- 跟踪数据增强的进度
- 数据更新后重新评分
- 构建时过滤:
WHERE data_sufficiency_score >= 60 AND status = 'published'
Combination Gating
组合页面校验
For combination pages (service × city, product × use-case), both dimensions must have sufficient data:
typescript
function gateCombination(dimA: RawRecord, dimB: RawRecord): boolean {
// Both dimensions must independently clear a minimum bar
const scoreA = computeSufficiencyScore(dimA);
const scoreB = computeSufficiencyScore(dimB);
// The combination itself must produce enough unique content
// that it's not just "dimA text + dimB text" pasted together
const combinationHasUniqueContent =
dimA.attributes?.length >= 2 &&
dimB.attributes?.length >= 2;
return scoreA >= 40 && scoreB >= 40 && combinationHasUniqueContent;
}This is the most important scale pattern. 500 services × 200 cities = 100K combinations, but maybe only 15K have enough data for a quality page. Generate only the 15K.
对于组合页面(如服务×城市、产品×使用场景),两个维度的数据都必须满足最低要求:
typescript
function gateCombination(dimA: RawRecord, dimB: RawRecord): boolean {
// 两个维度必须各自达到最低标准
const scoreA = computeSufficiencyScore(dimA);
const scoreB = computeSufficiencyScore(dimB);
// 组合本身必须能产生足够的独特内容
// 不能只是“dimA文本 + dimB文本”的简单拼接
const combinationHasUniqueContent =
dimA.attributes?.length >= 2 &&
dimB.attributes?.length >= 2;
return scoreA >= 40 && scoreB >= 40 && combinationHasUniqueContent;
}这是最重要的规模化模式。500种服务×200个城市=100K个组合,但可能只有15K个组合拥有足够数据生成高质量页面。仅生成这15K个页面即可。
3. Content Enrichment Pipeline
3. 内容增强流水线
At 100K pages, you can't manually write intros, FAQs, and descriptions. You need an automated enrichment pipeline with quality controls.
当页面规模达到100K时,无法手动编写引言、问答和描述。需要带有质量控制的自动化内容增强流水线。
Pipeline Architecture
流水线架构
Raw data (DB/CMS/API)
│
▼
Data sufficiency scoring ──→ Reject/hold insufficient records
│
▼
Automated enrichment ──→ Generate intros, FAQs, summaries from structured data
│
▼
Quality sampling ──→ Human reviews a random sample (5-10%) per batch
│
▼
Publish gate ──→ Only records with score >= 60 and enrichment complete
│
▼
Page generation原始数据(数据库/CMS/API)
│
▼
数据充足性评分 ──→ 拒绝/暂存数据不足的记录
│
▼
自动化增强 ──→ 基于结构化数据生成引言、问答、摘要
│
▼
质量抽样 ──→ 人工审核每批次中的随机样本(5-10%)
│
▼
发布校验 ──→ 仅允许评分≥60且增强完成的记录通过
│
▼
页面生成Enrichment Sources (non-LLM)
非LLM的增强来源
Before reaching for LLM generation, exhaust structured data enrichment:
- Template composition from multiple fields: Combine 3+ data fields into prose. Not but structured sentences using attributes, stats, and category context.
"Best {service} in {city}" - Aggregation: Roll up child data into parent summaries (category stats, comparison tables, top-N lists)
- Cross-referencing: Enrich records by joining data sources (product + reviews, service + location demographics, listing + category averages)
- FAQ generation from data patterns: Turn common attribute variations into Q&A pairs ("How much does X cost in Y?" → answer from price data)
- Comparison data: Auto-generate comparison sections from sibling records in the same category
在使用LLM生成内容前,优先利用结构化数据增强:
- 多字段模板组合:将3个以上数据字段组合成连贯文本。不要使用“{city}最佳{service}”这类简单模板,而是结合属性、统计数据和分类上下文构建结构化语句。
- 聚合:将子数据汇总到父级摘要中(分类统计、对比表格、Top-N榜单)
- 交叉引用:通过关联多数据源增强记录(产品+评论、服务+地域人口数据、列表+分类平均值)
- 基于数据模式生成问答:将常见属性变体转化为问答对(“X在Y地区的价格是多少?”→从价格数据中提取答案)
- 对比数据:自动生成同分类下兄弟记录的对比模块
LLM-Assisted Enrichment
LLM辅助增强
If structured enrichment is insufficient, LLM-assisted generation is acceptable under strict conditions:
- Never generate the entire page content — LLM fills gaps in an otherwise data-driven page
- Always ground in real data — the LLM prompt includes the record's actual attributes, stats, and context
- Human review sampling — review 5-10% of LLM-generated content per batch before publishing
- Store the generation metadata — track which fields were LLM-generated vs. sourced from data
- Apply quality guard after enrichment — run pseo-quality-guard on enriched content before publishing
- Regenerate periodically — stale LLM content should be refreshed when underlying data changes
Google's position (2025): AI-generated content is acceptable if it provides genuine value. The risk is not the generation method but the output quality. Scaled LLM generation that produces interchangeable pages will trigger the same penalties as template spam.
如果结构化数据增强无法满足需求,可在严格条件下使用LLM辅助生成:
- 绝不生成完整页面内容——LLM仅填充数据驱动型页面中的空白部分
- 始终基于真实数据——LLM提示词需包含记录的实际属性、统计数据和上下文
- 人工抽样审核——每批次发布前审核5-10%的LLM生成内容
- 存储生成元数据——跟踪哪些字段是LLM生成的,哪些是来自数据源的
- 增强后质量校验——发布前对增强内容运行pseo-quality-guard
- 定期重新生成——当底层数据变化时,更新过时的LLM生成内容
Google 2025年立场:AI生成内容只要能提供真实价值就是可接受的。风险不在于生成方式,而在于输出质量。规模化生成的同质化页面会触发与模板垃圾内容相同的处罚。
4. Incremental Validation
4. 增量验证
At 100K pages, full-corpus quality checks take hours. Switch to incremental validation.
当页面规模达到100K时,全量语料质量检查需要数小时。需切换为增量验证模式。
Delta Validation
增量验证
Only validate pages that changed since the last validation run:
typescript
async function getChangedPages(since: Date): Promise<string[]> {
const result = await db.query(
`SELECT slug FROM pages
WHERE last_modified > $1
OR last_validated IS NULL
AND status = 'published'`,
[since]
);
return result.map(r => r.slug);
}Content hashing: Store a hash of each page's rendered content. On validation, only re-check pages whose hash changed:
typescript
import { createHash } from "crypto";
function contentHash(page: BaseSEOContent): string {
const content = `${page.title}|${page.h1}|${page.metaDescription}|${page.bodyContent}`;
return createHash("sha256").update(content).digest("hex").slice(0, 16);
}仅验证自上次验证以来有变化的页面:
typescript
async function getChangedPages(since: Date): Promise<string[]> {
const result = await db.query(
`SELECT slug FROM pages
WHERE last_modified > $1
OR last_validated IS NULL
AND status = 'published'`,
[since]
);
return result.map(r => r.slug);
}内容哈希:存储每个页面渲染内容的哈希值。验证时仅重新检查哈希值变化的页面:
typescript
import { createHash } from "crypto";
function contentHash(page: BaseSEOContent): string {
const content = `${page.title}|${page.h1}|${page.metaDescription}|${page.bodyContent}`;
return createHash("sha256").update(content).digest("hex").slice(0, 16);
}Periodic Full Scan
定期全量扫描
Run a complete validation weekly or before major releases. For the full scan at 100K:
- Parallelize: Run 4-8 validation workers, each processing a category partition
- Sample cross-category similarity: Don't compare all 100K × 100K. Compare each page against 50 random pages from other categories + all pages in the same category.
- Stream results to disk: Write validation results to a JSONL file, then aggregate. Don't accumulate all results in memory.
typescript
// Parallel validation by category
const categories = await getAllCategories();
const workers = categories.map(cat =>
validateCategory(cat.slug) // each worker handles one category
);
const results = await Promise.all(workers);每周或重大发布前运行一次全量验证。针对100K页面的全量扫描:
- 并行化:运行4-8个验证工作进程,每个进程处理一个分类分区
- 跨类别相似度抽样:无需对比所有100K×100K页面。每个页面仅与其他分类中的50个随机页面+同分类下的所有页面进行对比。
- 流式结果存储:将验证结果写入JSONL文件,再进行聚合。不要将所有结果保存在内存中。
typescript
// 按分类并行验证
const categories = await getAllCategories();
const workers = categories.map(cat =>
validateCategory(cat.slug) // 每个工作进程处理一个分类
);
const results = await Promise.all(workers);Validation Budget
验证时间预算
| Scale | Delta validation | Full validation |
|---|---|---|
| 10K | Minutes | 30-60 minutes |
| 50K | Minutes | 2-4 hours |
| 100K | Minutes | 4-8 hours |
Optimization: Pre-compute and store fingerprints (MinHash signatures) in the database. Validation then only compares fingerprints, not full content.
| 规模 | 增量验证 | 全量验证 |
|---|---|---|
| 10K | 数分钟 | 30-60分钟 |
| 50K | 数分钟 | 2-4小时 |
| 100K | 数分钟 | 4-8小时 |
优化方案:在数据库中预计算并存储指纹(MinHash签名)。验证时仅对比指纹,而非完整内容。
5. Crawl Budget Management
5. 抓取预算管理
At 100K pages, Google won't crawl everything immediately. Crawl budget — the number of pages Google crawls per day — becomes a constraint.
当页面规模达到100K时,Google不会立即抓取所有页面。抓取预算——Google每天抓取的页面数量——会成为限制因素。
Sitemap Submission Strategy
站点地图提交策略
sitemap-index.xml
├── sitemap-category-a.xml (≤ 50,000 URLs)
├── sitemap-category-b.xml
├── sitemap-category-c.xml
└── ...Submission cadence:
- Submit the sitemap index to both Google Search Console and Bing Webmaster Tools
- Update individual category sitemaps as pages are added
- Don't submit all 100K URLs at once — Google throttles crawling for new sites with sudden URL spikes
Programmatic sitemap submission:
typescript
// Use Google Indexing API for high-priority pages (job postings, livestreams)
// For standard pages, rely on sitemap discovery + Search Console
// Batch sitemap updates by category
async function updateCategorySitemap(category: string) {
const pages = await getPagesByCategory(category, { limit: 50000 });
const xml = generateSitemapXml(pages);
await writeFile(`public/sitemap-${category}.xml`, xml);
}sitemap-index.xml
├── sitemap-category-a.xml (≤ 50,000个URL)
├── sitemap-category-b.xml
├── sitemap-category-c.xml
└── ...提交节奏:
- 将站点地图索引文件提交至Google Search Console和Bing Webmaster Tools
- 页面新增时更新对应分类的站点地图
- 不要一次性提交所有100K个URL——对于突然新增大量URL的新站点,Google会限制抓取频率
程序化站点地图提交:
typescript
// 高优先级页面(招聘信息、直播内容)使用Google Indexing API
// 标准页面依赖站点地图自动发现+Search Console
// 按分类批量更新站点地图
async function updateCategorySitemap(category: string) {
const pages = await getPagesByCategory(category, { limit: 50000 });
const xml = generateSitemapXml(pages);
await writeFile(`public/sitemap-${category}.xml`, xml);
}Crawl Budget Optimization
抓取预算优化
- Prioritize high-value pages: Set in sitemap to guide Googlebot (0.8 for hubs, 0.6 for pages with high sufficiency scores, 0.4 for the rest)
priority - Remove low-quality pages from sitemap: Pages with sufficiency score < 60 should not be in the sitemap
- Fix crawl waste: Ensure no soft 404s, no redirect chains, no parameter-based duplicates — all waste crawl budget
- Server response time: Keep TTFB < 500ms. Slow servers get less crawl budget.
- Monitor crawl stats: Check Google Search Console → Settings → Crawl stats weekly. If crawl rate drops, investigate.
- 优先高价值页面:在站点地图中设置引导Googlebot(枢纽页面设为0.8,高充足性评分页面设为0.6,其余设为0.4)
priority - 站点地图移除低质量页面:充足性评分<60的页面不应出现在站点地图中
- 减少抓取浪费:确保没有软404、重定向链、参数型重复页面——这些都会消耗抓取预算
- 服务器响应时间:保持TTFB<500ms。服务器响应越慢,获取的抓取预算越少。
- 监控抓取统计:每周查看Google Search Console→设置→抓取统计。如果抓取频率下降,需排查原因。
Indexing Rate Expectations
索引率预期
Google does not guarantee indexing all pages. Realistic expectations:
| Site authority | Pages submitted | Likely indexed (6 months) |
|---|---|---|
| New site | 100K | 10-30% |
| Established (DR 30-50) | 100K | 40-70% |
| High authority (DR 60+) | 100K | 70-90% |
If indexing rate is low:
- Improve content quality on indexed pages first
- Earn more backlinks to category hubs
- Reduce total page count (prune thin pages) — a smaller, higher-quality corpus often indexes better than a large, mediocre one
- Ensure internal linking reaches every page within 3 clicks
Google不保证所有页面都被索引。合理预期如下:
| 站点权重 | 提交页面数 | 6个月内可能被索引的比例 |
|---|---|---|
| 新站点 | 100K | 10-30% |
| 成熟站点(DR 30-50) | 100K | 40-70% |
| 高权重站点(DR 60+) | 100K | 70-90% |
如果索引率较低:
- 先优化已被索引页面的内容质量
- 为分类枢纽页面获取更多外链
- 减少总页面数(删除低质量页面)——规模更小、质量更高的页面集合通常比大规模低质量页面的索引效果更好
- 确保所有页面都能在3次点击内通过内部链接访问到
6. Monitoring at Scale
6. 规模化监控
At 100K pages, you can't manually check pages. Build automated monitoring.
当页面规模达到100K时,无法手动检查所有页面。需构建自动化监控体系。
Key Metrics to Track
需跟踪的关键指标
| Metric | Source | Alert Threshold |
|---|---|---|
| Pages indexed | Google Search Console API | Drops > 5% week-over-week |
| Crawl rate | Google Search Console API | Drops > 20% |
| Crawl errors (5xx, 404) | Server logs, GSC | > 1% of total pages |
| CWV regressions | CrUX API or RUM | LCP > 4s or CLS > 0.25 on any template |
| Build duration | CI/CD logs | > 2x baseline |
| Build memory peak | CI/CD logs | > 80% of available memory |
| Page count by status | Database | Published count deviates from expected |
| Sufficiency score distribution | Database | > 20% of published pages below threshold |
| 指标 | 数据来源 | 告警阈值 |
|---|---|---|
| 已索引页面数 | Google Search Console API | 周环比下降>5% |
| 抓取频率 | Google Search Console API | 下降>20% |
| 抓取错误(5xx、404) | 服务器日志、GSC | 占总页面数>1% |
| CWV指标退化 | CrUX API或RUM | 任何模板的LCP>4s或CLS>0.25 |
| 构建时长 | CI/CD日志 | 超过基线2倍 |
| 构建内存峰值 | CI/CD日志 | 超过可用内存的80% |
| 各状态页面数 | 数据库 | 已发布页面数与预期偏差较大 |
| 充足性评分分布 | 数据库 | >20%的已发布页面低于阈值 |
Automated Monitoring Script
自动化监控脚本
typescript
// scripts/monitor-pseo.ts — run daily via cron or CI
async function monitor() {
const metrics = {
totalPublished: await db.query("SELECT COUNT(*) FROM pages WHERE status = 'published'"),
avgSufficiency: await db.query("SELECT AVG(data_sufficiency_score) FROM pages WHERE status = 'published'"),
belowThreshold: await db.query("SELECT COUNT(*) FROM pages WHERE data_sufficiency_score < 60 AND status = 'published'"),
recentlyModified: await db.query("SELECT COUNT(*) FROM pages WHERE last_modified > NOW() - INTERVAL '7 days'"),
neverValidated: await db.query("SELECT COUNT(*) FROM pages WHERE last_validated IS NULL AND status = 'published'"),
redirectCount: await db.query("SELECT COUNT(*) FROM redirects"),
brokenRedirects: await db.query(
"SELECT COUNT(*) FROM redirects r WHERE NOT EXISTS (SELECT 1 FROM pages p WHERE p.canonical_path = r.destination)"
),
};
// Output report or send to monitoring service
console.log(JSON.stringify(metrics, null, 2));
// Alert on critical conditions
if (metrics.brokenRedirects > 0) console.error("ALERT: Broken redirects found");
if (metrics.belowThreshold / metrics.totalPublished > 0.2) {
console.error("ALERT: >20% of published pages below sufficiency threshold");
}
}typescript
// scripts/monitor-pseo.ts — 通过cron或CI每日运行
async function monitor() {
const metrics = {
totalPublished: await db.query("SELECT COUNT(*) FROM pages WHERE status = 'published'"),
avgSufficiency: await db.query("SELECT AVG(data_sufficiency_score) FROM pages WHERE status = 'published'"),
belowThreshold: await db.query("SELECT COUNT(*) FROM pages WHERE data_sufficiency_score < 60 AND status = 'published'"),
recentlyModified: await db.query("SELECT COUNT(*) FROM pages WHERE last_modified > NOW() - INTERVAL '7 days'"),
neverValidated: await db.query("SELECT COUNT(*) FROM pages WHERE last_validated IS NULL AND status = 'published'"),
redirectCount: await db.query("SELECT COUNT(*) FROM redirects"),
brokenRedirects: await db.query(
"SELECT COUNT(*) FROM redirects r WHERE NOT EXISTS (SELECT 1 FROM pages p WHERE p.canonical_path = r.destination)"
),
};
// 输出报告或发送至监控服务
console.log(JSON.stringify(metrics, null, 2));
// 关键条件告警
if (metrics.brokenRedirects > 0) console.error("ALERT: 发现无效重定向");
if (metrics.belowThreshold / metrics.totalPublished > 0.2) {
console.error("ALERT: >20%的已发布页面未达到充足性阈值");
}
}Search Console API Integration
Search Console API集成
At 100K pages, manual Search Console checks are impractical. Use the API:
- Indexing status: Query the URL Inspection API in batches to check indexing status of new pages
- Performance data: Pull clicks, impressions, CTR by page template to identify underperforming page types
- Coverage issues: Monitor for "Crawled — currently not indexed" and "Discovered — currently not indexed" trends
当页面规模达到100K时,手动检查Search Console不现实。需使用API:
- 索引状态:批量调用URL Inspection API检查新页面的索引状态
- 性能数据:按页面模板拉取点击量、展示量、CTR,识别表现不佳的页面类型
- 覆盖问题:监控“已抓取——当前未索引”和“已发现——当前未索引”的趋势
7. Edge and CDN Architecture
7. 边缘与CDN架构
At 100K pages, the origin server can't handle all traffic directly.
当页面规模达到100K时,源服务器无法直接处理所有流量。
Caching Strategy
缓存策略
Client → CDN Edge → Origin (Next.js/framework)
↓
Cache Layer (Redis/edge KV)
↓
DatabaseCDN configuration:
- Cache all pSEO pages at the edge with
s-maxage=86400, stale-while-revalidate=3600 - Use ISR revalidation to refresh cached pages (not full rebuilds)
- Set longer TTLs for stable pages (30 days), shorter for dynamic data pages (1 day)
Cache invalidation:
- On data change → invalidate the specific page's cache via on-demand revalidation API
- On category change → invalidate all pages in the category
- On template change → purge the CDN for all pages of that template type
Edge rendering (if supported):
- Deploy to edge runtimes (Vercel Edge, Cloudflare Workers) for < 50ms TTFB globally
- Not all frameworks support edge rendering — check compatibility
- Edge functions have memory limits (~128MB) that constrain complex data operations
客户端 → CDN边缘节点 → 源服务器(Next.js/框架)
↓
缓存层(Redis/边缘KV)
↓
数据库CDN配置:
- 边缘缓存所有pSEO页面,设置
s-maxage=86400, stale-while-revalidate=3600 - 使用ISR重新验证机制刷新缓存页面(无需全量重建)
- 稳定页面设置更长TTL(30天),动态数据页面设置更短TTL(1天)
缓存失效:
- 数据变更时→通过按需重新验证API失效对应页面的缓存
- 分类变更时→失效该分类下所有页面的缓存
- 模板变更时→清除CDN中该模板类型的所有页面缓存
边缘渲染(如果支持):
- 部署至边缘运行时(Vercel Edge、Cloudflare Workers),实现全球<50ms的TTFB
- 并非所有框架都支持边缘渲染——需检查兼容性
- 边缘函数有内存限制(约128MB),会限制复杂数据操作
Database Connection from Edge
边缘环境的数据库连接
Edge functions can't maintain persistent database connections. Options:
- HTTP-based database (PlanetScale, Neon serverless driver, Supabase edge functions)
- Edge KV store (Cloudflare KV, Vercel KV) for index-tier data with database as source of truth
- Pre-generated JSON at build time for index-tier data, database only for full page content
边缘函数无法维持持久数据库连接。可选方案:
- 基于HTTP的数据库(PlanetScale、Neon无服务器驱动、Supabase边缘函数)
- 边缘KV存储(Cloudflare KV、Vercel KV)用于索引层级数据,数据库作为唯一数据源
- 构建时预生成JSON用于索引层级数据,数据库仅存储完整页面内容
8. Scale-Specific Build Strategy
8. 规模化构建策略
At 100K pages, the build process itself needs architecture.
当页面规模达到100K时,构建流程本身也需要架构优化。
Don't Build All Pages
不预构建所有页面
typescript
// At 100K, only pre-build the most important pages
export async function generateStaticParams() {
// Pre-build: hub pages + top 1K pages by traffic/priority
const hubs = await getAllCategories();
const topPages = await db.query(
`SELECT slug, category FROM pages
WHERE status = 'published' AND data_sufficiency_score >= 80
ORDER BY priority DESC LIMIT 1000`
);
return [
...hubs.map(h => ({ category: h.slug })),
...topPages.map(p => ({ category: p.category, slug: p.slug })),
];
}
// ISR handles the remaining 99K pages on first request
export const dynamicParams = true;
export const revalidate = 86400; // 24 hourstypescript
// 100K页面规模下,仅预构建最重要的页面
export async function generateStaticParams() {
// 预构建:枢纽页面+流量/优先级Top 1K页面
const hubs = await getAllCategories();
const topPages = await db.query(
`SELECT slug, category FROM pages
WHERE status = 'published' AND data_sufficiency_score >= 80
ORDER BY priority DESC LIMIT 1000`
);
return [
...hubs.map(h => ({ category: h.slug })),
...topPages.map(p => ({ category: p.category, slug: p.slug })),
];
}
// 剩余99K页面由ISR在首次请求时处理
export const dynamicParams = true;
export const revalidate = 86400; // 24小时Build Time Budget
构建时间预算
| Pages pre-built | Expected build time | Memory |
|---|---|---|
| 1K (hubs + top pages) | 5-15 minutes | 2-4GB |
| 5K | 15-45 minutes | 4-6GB |
| 10K | 30-90 minutes | 6-8GB |
| 100K (DON'T DO THIS) | 5-15 hours | 16GB+ |
Rule: Never pre-build more than 10K pages. Use ISR for everything else.
| 预构建页面数 | 预期构建时长 | 内存需求 |
|---|---|---|
| 1K(枢纽+Top页面) | 5-15分钟 | 2-4GB |
| 5K | 15-45分钟 | 4-6GB |
| 10K | 30-90分钟 | 6-8GB |
| 100K(禁止操作) | 5-15小时 | 16GB+ |
规则:绝不预构建超过10K个页面。长尾页面全部由ISR处理。
Warm-Up After Deploy
部署后缓存预热
After deploying, the ISR cache is cold. The first visitor to each page triggers generation. For critical pages:
typescript
// scripts/warm-cache.ts — run after deploy
async function warmCache() {
const priorityPages = await db.query(
`SELECT canonical_path FROM pages
WHERE data_sufficiency_score >= 80 AND status = 'published'
ORDER BY priority DESC LIMIT 5000`
);
// Hit each page to trigger ISR generation (rate-limited)
for (const page of priorityPages) {
await fetch(`${baseUrl}${page.canonical_path}`);
await sleep(100); // 10 pages/second — don't DDoS yourself
}
}部署完成后,ISR缓存是空的。每个页面的首次访问会触发生成。对于关键页面:
typescript
// scripts/warm-cache.ts — 部署后运行
async function warmCache() {
const priorityPages = await db.query(
`SELECT canonical_path FROM pages
WHERE data_sufficiency_score >= 80 AND status = 'published'
ORDER BY priority DESC LIMIT 5000`
);
// 访问每个页面触发ISR生成(限速)
for (const page of priorityPages) {
await fetch(`${baseUrl}${page.canonical_path}`);
await sleep(100); // 每秒10个页面——避免自我DDoS
}
}Checklist
检查清单
- Database is the primary data store (not JSON files or in-memory arrays)
- Required indexes exist on slug, category, status, last_modified, sufficiency_score
- Data sufficiency scoring is implemented and stored per page
- Pages with score < 60 are gated from generation
- Combination pages are gated on both dimensions
- Content enrichment pipeline exists (structured data first, LLM-assisted only with review)
- Incremental validation is implemented (delta + periodic full scan)
- Content hashes are stored for change detection
- Sitemap is split by category with index file
- Sitemap excludes pages below sufficiency threshold
- Crawl budget is monitored via Search Console
- Monitoring script runs daily with alerts
- CDN caching is configured with appropriate TTLs
- ISR handles the long tail (only top pages pre-built)
- Cache warm-up script exists for post-deploy
- Connection pooling is configured for build-time queries
- No function loads > 10K full page records into memory
- 数据库作为主要数据存储(而非JSON文件或内存数组)
- 已为slug、category、status、last_modified、sufficiency_score创建必要索引
- 已实现数据充足性评分并存储在每个页面中
- 评分<60的页面被拦截,不生成
- 组合页面需通过两个维度的校验
- 存在内容增强流水线(优先结构化数据,LLM辅助需审核)
- 已实现增量验证(增量+定期全量扫描)
- 已存储内容哈希用于变更检测
- 站点地图按分类拆分并配有索引文件
- 站点地图排除了充足性评分不达标的页面
- 通过Search Console监控抓取预算
- 每日运行监控脚本并配置告警
- 已配置CDN缓存及合理的TTL
- 长尾页面由ISR处理(仅预构建Top页面)
- 存在部署后的缓存预热脚本
- 已为构建时查询配置连接池
- 没有函数会将>10K条完整页面记录加载到内存中
Relationship to Other Skills
与其他技能的关系
- Extends: pseo-data (replaces in-memory patterns with database), pseo-performance (adds CDN/edge and scale-specific build strategy), pseo-quality-guard (adds incremental validation)
- Depends on: All content and structure skills must be in place before scaling
- Validated by: pseo-quality-guard (quality doesn't change — scale does)
- Works with: pseo-orchestrate (scale considerations at every phase)
- 扩展:pseo-data(用数据库替代内存模式)、pseo-performance(新增CDN/边缘及规模化构建策略)、pseo-quality-guard(新增增量验证)
- 依赖:所有内容和结构技能需先部署到位,再进行规模化
- 验证:pseo-quality-guard(质量标准不变——仅规模变化)
- 协同:pseo-orchestrate(每个阶段都需考虑规模化因素)