pseo-scale

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

pSEO Scale Architecture

pSEO 规模化架构

Architect pSEO systems that work at 10K-100K+ pages. The patterns in the other pseo-* skills are correct at 1K-10K. Beyond 10K, in-memory data layers, full-corpus validation, and single-deploy rollouts break down. This skill provides the architecture changes needed at scale.

构建可支撑10K-100K+量级页面的pSEO系统。其他pseo-*技能中的模式在1K-10K页面规模下适用，但超过10K后，基于内存的数据层、全量语料验证和单次部署发布的模式会失效。本技能提供规模化所需的架构调整方案。

Scale Tiers

规模层级

Tier	Pages	Data Layer	Validation	Rollout	Sitemap
Small	< 1K	JSON/files, in-memory	Full pairwise	Single deploy	Single file
Medium	1K-10K	Files or DB, two-tier memory	Fingerprint-based	2-4 week batches	Index + children
Large	10K-50K	Database required	Incremental + sampling	Category-by-category	Index + chunked
Very Large	50K-100K+	Database + cache layer	Delta-only + periodic full	ISR + sitemap waves	Index + streaming

This skill focuses on the Large and Very Large tiers. If the project is under 10K pages, the standard pseo-* skills are sufficient.

层级	页面数量	数据层	验证方式	发布方式	站点地图
小型	< 1K	JSON/文件，内存中	全量成对验证	单次部署	单文件
中型	1K-10K	文件或数据库，双层内存	基于指纹	2-4周批次	索引+子文件
大型	10K-50K	必须使用数据库	增量+抽样	按类别发布	索引+分块
超大型	50K-100K+	数据库+缓存层	仅增量+定期全量	ISR+站点地图分批	索引+流式

本技能聚焦大型和超大型层级。如果项目页面规模低于10K，使用标准的pseo-*技能即可满足需求。

1. Database-Backed Data Layer

1. 数据库驱动的数据层

At 10K+ pages, JSON files and in-memory arrays stop working. The data layer must move to a database with proper indexing.

当页面规模达到10K+时，JSON文件和内存数组将无法正常工作。数据层必须迁移至带有合理索引的数据库。

Why In-Memory Breaks

内存架构失效的原因

Pages    PageIndex in memory    Full content (if loaded)
1K       ~1MB                   ~100-500MB
10K      ~10MB                  ~1-5GB (OOM)
50K      ~50MB (borderline)     ~5-25GB (impossible)
100K     ~100MB                 ~10-50GB (impossible)

At 50K+, even holding all

PageIndex

records in memory is borderline. At 100K,

getAllSlugs()

returning an array of 100K objects takes ~100MB and seconds to deserialize. You need cursor-based iteration.

页面数    内存中的PageIndex    加载的完整内容
1K       ~1MB                   ~100-500MB
10K      ~10MB                  ~1-5GB (OOM)
50K      ~50MB (临界值)     ~5-25GB (不可能)
100K     ~100MB                 ~10-50GB (不可能)

当页面数达到50K+时，即使仅将所有

PageIndex

记录保存在内存中也已接近临界值。到100K时，

getAllSlugs()

返回包含100K个对象的数组会占用约100MB内存，且反序列化需要数秒时间。此时需要采用基于游标（cursor）的迭代方式。

Database Requirements

数据库要求

Minimum schema:

sql

CREATE TABLE pages (
  id            SERIAL PRIMARY KEY,
  slug          TEXT UNIQUE NOT NULL,
  canonical_path TEXT UNIQUE NOT NULL,
  title         TEXT NOT NULL,
  h1            TEXT NOT NULL,
  meta_description TEXT NOT NULL,
  category      TEXT NOT NULL,
  subcategory   TEXT,
  status        TEXT DEFAULT 'published',
  last_modified TIMESTAMPTZ NOT NULL,
  published_at  TIMESTAMPTZ,
  -- Heavy fields (only loaded per-page)
  intro_text    TEXT,
  body_content  TEXT,
  faqs          JSONB,
  related_slugs TEXT[],
  featured_image JSONB,
  -- Scale fields
  data_sufficiency_score REAL,  -- see section 2
  content_hash  TEXT,           -- for incremental validation
  last_validated TIMESTAMPTZ
);

-- Required indexes for pSEO queries
CREATE INDEX idx_pages_category ON pages(category);
CREATE INDEX idx_pages_status ON pages(status) WHERE status = 'published';
CREATE INDEX idx_pages_slug ON pages(slug);
CREATE INDEX idx_pages_last_modified ON pages(last_modified DESC);
CREATE INDEX idx_pages_sufficiency ON pages(data_sufficiency_score);
CREATE INDEX idx_pages_category_status ON pages(category, status);

Categories table:

sql

CREATE TABLE categories (
  slug          TEXT PRIMARY KEY,
  name          TEXT NOT NULL,
  description   TEXT,
  parent_slug   TEXT REFERENCES categories(slug),
  page_count    INT DEFAULT 0,
  last_modified TIMESTAMPTZ
);

Redirects table:

sql

CREATE TABLE redirects (
  source      TEXT PRIMARY KEY,
  destination TEXT NOT NULL,
  created_at  TIMESTAMPTZ DEFAULT NOW()
);
CREATE INDEX idx_redirects_destination ON redirects(destination);

最小化表结构：

sql

CREATE TABLE pages (
  id            SERIAL PRIMARY KEY,
  slug          TEXT UNIQUE NOT NULL,
  canonical_path TEXT UNIQUE NOT NULL,
  title         TEXT NOT NULL,
  h1            TEXT NOT NULL,
  meta_description TEXT NOT NULL,
  category      TEXT NOT NULL,
  subcategory   TEXT,
  status        TEXT DEFAULT 'published',
  last_modified TIMESTAMPTZ NOT NULL,
  published_at  TIMESTAMPTZ,
  -- 大字段（仅按需加载单页内容时读取）
  intro_text    TEXT,
  body_content  TEXT,
  faqs          JSONB,
  related_slugs TEXT[],
  featured_image JSONB,
  -- 规模化相关字段
  data_sufficiency_score REAL,  -- 见第2节
  content_hash  TEXT,           -- 用于增量验证
  last_validated TIMESTAMPTZ
);

-- pSEO查询所需的索引
CREATE INDEX idx_pages_category ON pages(category);
CREATE INDEX idx_pages_status ON pages(status) WHERE status = 'published';
CREATE INDEX idx_pages_slug ON pages(slug);
CREATE INDEX idx_pages_last_modified ON pages(last_modified DESC);
CREATE INDEX idx_pages_sufficiency ON pages(data_sufficiency_score);
CREATE INDEX idx_pages_category_status ON pages(category, status);

分类表：

sql

CREATE TABLE categories (
  slug          TEXT PRIMARY KEY,
  name          TEXT NOT NULL,
  description   TEXT,
  parent_slug   TEXT REFERENCES categories(slug),
  page_count    INT DEFAULT 0,
  last_modified TIMESTAMPTZ
);

重定向表：

sql

CREATE TABLE redirects (
  source      TEXT PRIMARY KEY,
  destination TEXT NOT NULL,
  created_at  TIMESTAMPTZ DEFAULT NOW()
);
CREATE INDEX idx_redirects_destination ON redirects(destination);

Data Layer API at Scale

规模化的数据层API

The pseo-data API contract changes at scale:

typescript

// getAllSlugs() must support cursor-based iteration at 50K+
async function* getAllSlugsCursor(): AsyncGenerator<{ slug: string; category: string }> {
  let cursor: string | null = null;
  while (true) {
    const batch = await db.query(
      `SELECT slug, category FROM pages
       WHERE status = 'published'
       ${cursor ? `AND slug > $1` : ''}
       ORDER BY slug LIMIT 1000`,
      cursor ? [cursor] : []
    );
    if (batch.length === 0) break;
    for (const row of batch) yield row;
    cursor = batch[batch.length - 1].slug;
  }
}

// getPagesByCategory() must use DB pagination, not in-memory slicing
async function getPagesByCategory(
  category: string,
  opts?: { limit?: number; offset?: number }
): Promise<PageIndex[]> {
  return db.query(
    `SELECT slug, title, h1, meta_description, canonical_path, category, last_modified
     FROM pages
     WHERE category = $1 AND status = 'published'
     ORDER BY title
     LIMIT $2 OFFSET $3`,
    [category, opts?.limit ?? 50, opts?.offset ?? 0]
  );
}

// getRelatedPages() needs a precomputed index or efficient query
async function getRelatedPages(slug: string, limit = 5): Promise<PageIndex[]> {
  // Option A: Use related_slugs array from the page record
  // Option B: Same category + shared tags query
  // Option C: Precomputed relatedness table (best at 50K+)
  return db.query(
    `SELECT p.slug, p.title, p.h1, p.meta_description, p.canonical_path,
            p.category, p.last_modified
     FROM pages p
     JOIN pages source ON source.slug = $1
     WHERE p.category = source.category
       AND p.slug != $1
       AND p.status = 'published'
     ORDER BY p.last_modified DESC
     LIMIT $2`,
    [slug, limit]
  );
}

pseo-data的API契约在规模化场景下需要调整：

typescript

// 当页面数达到50K+时，getAllSlugs()必须支持基于游标的迭代
async function* getAllSlugsCursor(): AsyncGenerator<{ slug: string; category: string }> {
  let cursor: string | null = null;
  while (true) {
    const batch = await db.query(
      `SELECT slug, category FROM pages
       WHERE status = 'published'
       ${cursor ? `AND slug > $1` : ''}
       ORDER BY slug LIMIT 1000`,
      cursor ? [cursor] : []
    );
    if (batch.length === 0) break;
    for (const row of batch) yield row;
    cursor = batch[batch.length - 1].slug;
  }
}

// getPagesByCategory()必须使用数据库分页，而非内存切片
async function getPagesByCategory(
  category: string,
  opts?: { limit?: number; offset?: number }
): Promise<PageIndex[]> {
  return db.query(
    `SELECT slug, title, h1, meta_description, canonical_path, category, last_modified
     FROM pages
     WHERE category = $1 AND status = 'published'
     ORDER BY title
     LIMIT $2 OFFSET $3`,
    [category, opts?.limit ?? 50, opts?.offset ?? 0]
  );
}

// getRelatedPages()需要预计算索引或高效查询
async function getRelatedPages(slug: string, limit = 5): Promise<PageIndex[]> {
  // 选项A：使用页面记录中的related_slugs数组
  // 选项B：同分类+共享标签查询
  // 选项C：预计算相关性表（50K+页面时最佳）
  return db.query(
    `SELECT p.slug, p.title, p.h1, p.meta_description, p.canonical_path,
            p.category, p.last_modified
     FROM pages p
     JOIN pages source ON source.slug = $1
     WHERE p.category = source.category
       AND p.slug != $1
       AND p.status = 'published'
     ORDER BY p.last_modified DESC
     LIMIT $2`,
    [slug, limit]
  );
}

Connection Pooling

连接池配置

At build time with parallel page generation, you need connection pooling:

typescript

import { Pool } from "pg";

const pool = new Pool({
  max: 10,                    // limit concurrent connections during build
  idleTimeoutMillis: 30000,
  connectionTimeoutMillis: 5000,
});

ORM alternative: If using Prisma or Drizzle, configure connection pool limits in the ORM config. Default pool sizes are often too high for build processes.

See

references/database-patterns.md

for full patterns by database type.

在并行页面生成的构建阶段，需要配置数据库连接池：

typescript

import { Pool } from "pg";

const pool = new Pool({
  max: 10,                    // 限制构建期间的并发连接数
  idleTimeoutMillis: 30000,
  connectionTimeoutMillis: 5000,
});

**ORM替代方案：**如果使用Prisma或Drizzle，需在ORM配置中设置连接池限制。默认的连接池大小通常过高，不适合构建流程。

更多数据库模式请参考

references/database-patterns.md

。

2. Data Sufficiency Gating

2. 数据充足性校验机制

At 100K pages, many combinations will produce thin content. Gate page generation BEFORE build time — don't create pages that will fail quality checks.

当页面规模达到100K时，很多组合会生成内容单薄的页面。需在构建前就拦截这类页面——不要生成无法通过质量检查的页面。

Sufficiency Score

充足性评分

Compute a score per potential page based on available data:

typescript

function computeSufficiencyScore(record: RawRecord): number {
  let score = 0;
  const weights = {
    hasTitle: 10,
    hasDescription: 10,           // > 50 chars
    hasBodyContent: 20,           // > 200 words
    hasFAQs: 15,                  // >= 3 Q&A pairs
    hasUniqueAttributes: 15,      // >= 3 non-null structured attributes
    hasImage: 5,
    hasCategory: 10,
    hasNumericData: 10,           // stats, ratings, prices — LLM citation signal
    hasSourceCitation: 5,         // data provenance for E-E-A-T
  };

  if (record.title?.length > 10) score += weights.hasTitle;
  if (record.description?.length > 50) score += weights.hasDescription;
  if (wordCount(record.bodyContent) > 200) score += weights.hasBodyContent;
  if (record.faqs?.length >= 3) score += weights.hasFAQs;
  // ... etc

  return score; // 0-100
}

基于可用数据为每个潜在页面计算评分：

typescript

function computeSufficiencyScore(record: RawRecord): number {
  let score = 0;
  const weights = {
    hasTitle: 10,
    hasDescription: 10,           // 超过50字符
    hasBodyContent: 20,           // 超过200词
    hasFAQs: 15,                  // 至少3组问答
    hasUniqueAttributes: 15,      // 至少3个非空结构化属性
    hasImage: 5,
    hasCategory: 10,
    hasNumericData: 10,           // 统计数据、评分、价格——LLM引用信号
    hasSourceCitation: 5,         // 数据来源，用于E-E-A-T评估
  };

  if (record.title?.length > 10) score += weights.hasTitle;
  if (record.description?.length > 50) score += weights.hasDescription;
  if (wordCount(record.bodyContent) > 200) score += weights.hasBodyContent;
  if (record.faqs?.length >= 3) score += weights.hasFAQs;
  // ... 其他评分逻辑

  return score; // 0-100分
}

Gating Thresholds

校验阈值

Score	Action
80-100	Generate page — sufficient data
60-79	Generate page with enrichment flag — mark for content pipeline
40-59	Hold — do not generate until data is enriched
0-39	Reject — insufficient data, do not generate

Store the score in the database (

data_sufficiency_score

column) so you can:

Query how many pages are gated vs. ready
Track enrichment progress over time
Re-score after data enrichment

Gate at build time:

WHERE data_sufficiency_score >= 60 AND status = 'published'

评分	操作
80-100	生成页面——数据充足
60-79	生成页面并标记为待增强——进入内容流水线
40-59	暂存——数据增强完成前不生成
0-39	拒绝——数据不足，不生成页面

将评分存储在数据库中（

data_sufficiency_score

字段），以便：

查询被拦截和已就绪的页面数量
跟踪数据增强的进度
数据更新后重新评分

构建时过滤：

WHERE data_sufficiency_score >= 60 AND status = 'published'

Combination Gating

组合页面校验

For combination pages (service × city, product × use-case), both dimensions must have sufficient data:

typescript

function gateCombination(dimA: RawRecord, dimB: RawRecord): boolean {
  // Both dimensions must independently clear a minimum bar
  const scoreA = computeSufficiencyScore(dimA);
  const scoreB = computeSufficiencyScore(dimB);

  // The combination itself must produce enough unique content
  // that it's not just "dimA text + dimB text" pasted together
  const combinationHasUniqueContent =
    dimA.attributes?.length >= 2 &&
    dimB.attributes?.length >= 2;

  return scoreA >= 40 && scoreB >= 40 && combinationHasUniqueContent;
}

This is the most important scale pattern. 500 services × 200 cities = 100K combinations, but maybe only 15K have enough data for a quality page. Generate only the 15K.

对于组合页面（如服务×城市、产品×使用场景），两个维度的数据都必须满足最低要求：

typescript

function gateCombination(dimA: RawRecord, dimB: RawRecord): boolean {
  // 两个维度必须各自达到最低标准
  const scoreA = computeSufficiencyScore(dimA);
  const scoreB = computeSufficiencyScore(dimB);

  // 组合本身必须能产生足够的独特内容
  // 不能只是“dimA文本 + dimB文本”的简单拼接
  const combinationHasUniqueContent =
    dimA.attributes?.length >= 2 &&
    dimB.attributes?.length >= 2;

  return scoreA >= 40 && scoreB >= 40 && combinationHasUniqueContent;
}

这是最重要的规模化模式。500种服务×200个城市=100K个组合，但可能只有15K个组合拥有足够数据生成高质量页面。仅生成这15K个页面即可。

3. Content Enrichment Pipeline

3. 内容增强流水线

At 100K pages, you can't manually write intros, FAQs, and descriptions. You need an automated enrichment pipeline with quality controls.

当页面规模达到100K时，无法手动编写引言、问答和描述。需要带有质量控制的自动化内容增强流水线。

Pipeline Architecture

流水线架构

Raw data (DB/CMS/API)
    │
    ▼
Data sufficiency scoring ──→ Reject/hold insufficient records
    │
    ▼
Automated enrichment ──→ Generate intros, FAQs, summaries from structured data
    │
    ▼
Quality sampling ──→ Human reviews a random sample (5-10%) per batch
    │
    ▼
Publish gate ──→ Only records with score >= 60 and enrichment complete
    │
    ▼
Page generation

原始数据（数据库/CMS/API）
    │
    ▼
数据充足性评分 ──→ 拒绝/暂存数据不足的记录
    │
    ▼
自动化增强 ──→ 基于结构化数据生成引言、问答、摘要
    │
    ▼
质量抽样 ──→ 人工审核每批次中的随机样本（5-10%）
    │
    ▼
发布校验 ──→ 仅允许评分≥60且增强完成的记录通过
    │
    ▼
页面生成

Enrichment Sources (non-LLM)

非LLM的增强来源

Before reaching for LLM generation, exhaust structured data enrichment:

Template composition from multiple fields: Combine 3+ data fields into prose. Not
```
"Best {service} in {city}"
```
but structured sentences using attributes, stats, and category context.
Aggregation: Roll up child data into parent summaries (category stats, comparison tables, top-N lists)
Cross-referencing: Enrich records by joining data sources (product + reviews, service + location demographics, listing + category averages)
FAQ generation from data patterns: Turn common attribute variations into Q&A pairs ("How much does X cost in Y?" → answer from price data)
Comparison data: Auto-generate comparison sections from sibling records in the same category

在使用LLM生成内容前，优先利用结构化数据增强：

多字段模板组合：将3个以上数据字段组合成连贯文本。不要使用“{city}最佳{service}”这类简单模板，而是结合属性、统计数据和分类上下文构建结构化语句。
聚合：将子数据汇总到父级摘要中（分类统计、对比表格、Top-N榜单）
交叉引用：通过关联多数据源增强记录（产品+评论、服务+地域人口数据、列表+分类平均值）
基于数据模式生成问答：将常见属性变体转化为问答对（“X在Y地区的价格是多少？”→从价格数据中提取答案）
对比数据：自动生成同分类下兄弟记录的对比模块

LLM-Assisted Enrichment

LLM辅助增强

If structured enrichment is insufficient, LLM-assisted generation is acceptable under strict conditions:

Never generate the entire page content — LLM fills gaps in an otherwise data-driven page
Always ground in real data — the LLM prompt includes the record's actual attributes, stats, and context
Human review sampling — review 5-10% of LLM-generated content per batch before publishing
Store the generation metadata — track which fields were LLM-generated vs. sourced from data
Apply quality guard after enrichment — run pseo-quality-guard on enriched content before publishing
Regenerate periodically — stale LLM content should be refreshed when underlying data changes

Google's position (2025): AI-generated content is acceptable if it provides genuine value. The risk is not the generation method but the output quality. Scaled LLM generation that produces interchangeable pages will trigger the same penalties as template spam.

如果结构化数据增强无法满足需求，可在严格条件下使用LLM辅助生成：

绝不生成完整页面内容——LLM仅填充数据驱动型页面中的空白部分
始终基于真实数据——LLM提示词需包含记录的实际属性、统计数据和上下文
人工抽样审核——每批次发布前审核5-10%的LLM生成内容
存储生成元数据——跟踪哪些字段是LLM生成的，哪些是来自数据源的
增强后质量校验——发布前对增强内容运行pseo-quality-guard
定期重新生成——当底层数据变化时，更新过时的LLM生成内容

Google 2025年立场：AI生成内容只要能提供真实价值就是可接受的。风险不在于生成方式，而在于输出质量。规模化生成的同质化页面会触发与模板垃圾内容相同的处罚。

4. Incremental Validation

4. 增量验证

At 100K pages, full-corpus quality checks take hours. Switch to incremental validation.

当页面规模达到100K时，全量语料质量检查需要数小时。需切换为增量验证模式。

Delta Validation

增量验证

Only validate pages that changed since the last validation run:

typescript

async function getChangedPages(since: Date): Promise<string[]> {
  const result = await db.query(
    `SELECT slug FROM pages
     WHERE last_modified > $1
       OR last_validated IS NULL
       AND status = 'published'`,
    [since]
  );
  return result.map(r => r.slug);
}

Content hashing: Store a hash of each page's rendered content. On validation, only re-check pages whose hash changed:

typescript

import { createHash } from "crypto";

function contentHash(page: BaseSEOContent): string {
  const content = `${page.title}|${page.h1}|${page.metaDescription}|${page.bodyContent}`;
  return createHash("sha256").update(content).digest("hex").slice(0, 16);
}

仅验证自上次验证以来有变化的页面：

typescript

async function getChangedPages(since: Date): Promise<string[]> {
  const result = await db.query(
    `SELECT slug FROM pages
     WHERE last_modified > $1
       OR last_validated IS NULL
       AND status = 'published'`,
    [since]
  );
  return result.map(r => r.slug);
}

内容哈希：存储每个页面渲染内容的哈希值。验证时仅重新检查哈希值变化的页面：

typescript

import { createHash } from "crypto";

function contentHash(page: BaseSEOContent): string {
  const content = `${page.title}|${page.h1}|${page.metaDescription}|${page.bodyContent}`;
  return createHash("sha256").update(content).digest("hex").slice(0, 16);
}

Periodic Full Scan

定期全量扫描

Run a complete validation weekly or before major releases. For the full scan at 100K:

Parallelize: Run 4-8 validation workers, each processing a category partition
Sample cross-category similarity: Don't compare all 100K × 100K. Compare each page against 50 random pages from other categories + all pages in the same category.
Stream results to disk: Write validation results to a JSONL file, then aggregate. Don't accumulate all results in memory.

typescript

// Parallel validation by category
const categories = await getAllCategories();
const workers = categories.map(cat =>
  validateCategory(cat.slug) // each worker handles one category
);
const results = await Promise.all(workers);

每周或重大发布前运行一次全量验证。针对100K页面的全量扫描：

并行化：运行4-8个验证工作进程，每个进程处理一个分类分区
跨类别相似度抽样：无需对比所有100K×100K页面。每个页面仅与其他分类中的50个随机页面+同分类下的所有页面进行对比。
流式结果存储：将验证结果写入JSONL文件，再进行聚合。不要将所有结果保存在内存中。

typescript

// 按分类并行验证
const categories = await getAllCategories();
const workers = categories.map(cat =>
  validateCategory(cat.slug) // 每个工作进程处理一个分类
);
const results = await Promise.all(workers);

Validation Budget

验证时间预算

Scale	Delta validation	Full validation
10K	Minutes	30-60 minutes
50K	Minutes	2-4 hours
100K	Minutes	4-8 hours

Optimization: Pre-compute and store fingerprints (MinHash signatures) in the database. Validation then only compares fingerprints, not full content.

规模	增量验证	全量验证
10K	数分钟	30-60分钟
50K	数分钟	2-4小时
100K	数分钟	4-8小时

优化方案：在数据库中预计算并存储指纹（MinHash签名）。验证时仅对比指纹，而非完整内容。

5. Crawl Budget Management

5. 抓取预算管理

At 100K pages, Google won't crawl everything immediately. Crawl budget — the number of pages Google crawls per day — becomes a constraint.

当页面规模达到100K时，Google不会立即抓取所有页面。抓取预算——Google每天抓取的页面数量——会成为限制因素。

Sitemap Submission Strategy

站点地图提交策略

sitemap-index.xml
├── sitemap-category-a.xml     (≤ 50,000 URLs)
├── sitemap-category-b.xml
├── sitemap-category-c.xml
└── ...

Submission cadence:

Submit the sitemap index to both Google Search Console and Bing Webmaster Tools
Update individual category sitemaps as pages are added
Don't submit all 100K URLs at once — Google throttles crawling for new sites with sudden URL spikes

Programmatic sitemap submission:

typescript

// Use Google Indexing API for high-priority pages (job postings, livestreams)
// For standard pages, rely on sitemap discovery + Search Console

// Batch sitemap updates by category
async function updateCategorySitemap(category: string) {
  const pages = await getPagesByCategory(category, { limit: 50000 });
  const xml = generateSitemapXml(pages);
  await writeFile(`public/sitemap-${category}.xml`, xml);
}

sitemap-index.xml
├── sitemap-category-a.xml     (≤ 50,000个URL)
├── sitemap-category-b.xml
├── sitemap-category-c.xml
└── ...

提交节奏：

将站点地图索引文件提交至Google Search Console和Bing Webmaster Tools
页面新增时更新对应分类的站点地图
不要一次性提交所有100K个URL——对于突然新增大量URL的新站点，Google会限制抓取频率

程序化站点地图提交：

typescript

// 高优先级页面（招聘信息、直播内容）使用Google Indexing API
// 标准页面依赖站点地图自动发现+Search Console

// 按分类批量更新站点地图
async function updateCategorySitemap(category: string) {
  const pages = await getPagesByCategory(category, { limit: 50000 });
  const xml = generateSitemapXml(pages);
  await writeFile(`public/sitemap-${category}.xml`, xml);
}

Crawl Budget Optimization

抓取预算优化

Prioritize high-value pages: Set
```
priority
```
in sitemap to guide Googlebot (0.8 for hubs, 0.6 for pages with high sufficiency scores, 0.4 for the rest)
Remove low-quality pages from sitemap: Pages with sufficiency score < 60 should not be in the sitemap
Fix crawl waste: Ensure no soft 404s, no redirect chains, no parameter-based duplicates — all waste crawl budget
Server response time: Keep TTFB < 500ms. Slow servers get less crawl budget.
Monitor crawl stats: Check Google Search Console → Settings → Crawl stats weekly. If crawl rate drops, investigate.

优先高价值页面：在站点地图中设置
```
priority
```
引导Googlebot（枢纽页面设为0.8，高充足性评分页面设为0.6，其余设为0.4）
站点地图移除低质量页面：充足性评分<60的页面不应出现在站点地图中
减少抓取浪费：确保没有软404、重定向链、参数型重复页面——这些都会消耗抓取预算
服务器响应时间：保持TTFB<500ms。服务器响应越慢，获取的抓取预算越少。
监控抓取统计：每周查看Google Search Console→设置→抓取统计。如果抓取频率下降，需排查原因。

Indexing Rate Expectations

索引率预期

Google does not guarantee indexing all pages. Realistic expectations:

Site authority	Pages submitted	Likely indexed (6 months)
New site	100K	10-30%
Established (DR 30-50)	100K	40-70%
High authority (DR 60+)	100K	70-90%

If indexing rate is low:

Improve content quality on indexed pages first
Earn more backlinks to category hubs
Reduce total page count (prune thin pages) — a smaller, higher-quality corpus often indexes better than a large, mediocre one
Ensure internal linking reaches every page within 3 clicks

Google不保证所有页面都被索引。合理预期如下：

站点权重	提交页面数	6个月内可能被索引的比例
新站点	100K	10-30%
成熟站点（DR 30-50）	100K	40-70%
高权重站点（DR 60+）	100K	70-90%

如果索引率较低：

先优化已被索引页面的内容质量
为分类枢纽页面获取更多外链
减少总页面数（删除低质量页面）——规模更小、质量更高的页面集合通常比大规模低质量页面的索引效果更好
确保所有页面都能在3次点击内通过内部链接访问到

6. Monitoring at Scale

6. 规模化监控

At 100K pages, you can't manually check pages. Build automated monitoring.

当页面规模达到100K时，无法手动检查所有页面。需构建自动化监控体系。

Key Metrics to Track

需跟踪的关键指标

Metric	Source	Alert Threshold
Pages indexed	Google Search Console API	Drops > 5% week-over-week
Crawl rate	Google Search Console API	Drops > 20%
Crawl errors (5xx, 404)	Server logs, GSC	> 1% of total pages
CWV regressions	CrUX API or RUM	LCP > 4s or CLS > 0.25 on any template
Build duration	CI/CD logs	> 2x baseline
Build memory peak	CI/CD logs	> 80% of available memory
Page count by status	Database	Published count deviates from expected
Sufficiency score distribution	Database	> 20% of published pages below threshold

指标	数据来源	告警阈值
已索引页面数	Google Search Console API	周环比下降>5%
抓取频率	Google Search Console API	下降>20%
抓取错误（5xx、404）	服务器日志、GSC	占总页面数>1%
CWV指标退化	CrUX API或RUM	任何模板的LCP>4s或CLS>0.25
构建时长	CI/CD日志	超过基线2倍
构建内存峰值	CI/CD日志	超过可用内存的80%
各状态页面数	数据库	已发布页面数与预期偏差较大
充足性评分分布	数据库	>20%的已发布页面低于阈值

Automated Monitoring Script

自动化监控脚本

typescript

// scripts/monitor-pseo.ts — run daily via cron or CI
async function monitor() {
  const metrics = {
    totalPublished: await db.query("SELECT COUNT(*) FROM pages WHERE status = 'published'"),
    avgSufficiency: await db.query("SELECT AVG(data_sufficiency_score) FROM pages WHERE status = 'published'"),
    belowThreshold: await db.query("SELECT COUNT(*) FROM pages WHERE data_sufficiency_score < 60 AND status = 'published'"),
    recentlyModified: await db.query("SELECT COUNT(*) FROM pages WHERE last_modified > NOW() - INTERVAL '7 days'"),
    neverValidated: await db.query("SELECT COUNT(*) FROM pages WHERE last_validated IS NULL AND status = 'published'"),
    redirectCount: await db.query("SELECT COUNT(*) FROM redirects"),
    brokenRedirects: await db.query(
      "SELECT COUNT(*) FROM redirects r WHERE NOT EXISTS (SELECT 1 FROM pages p WHERE p.canonical_path = r.destination)"
    ),
  };

  // Output report or send to monitoring service
  console.log(JSON.stringify(metrics, null, 2));

  // Alert on critical conditions
  if (metrics.brokenRedirects > 0) console.error("ALERT: Broken redirects found");
  if (metrics.belowThreshold / metrics.totalPublished > 0.2) {
    console.error("ALERT: >20% of published pages below sufficiency threshold");
  }
}

typescript

// scripts/monitor-pseo.ts — 通过cron或CI每日运行
async function monitor() {
  const metrics = {
    totalPublished: await db.query("SELECT COUNT(*) FROM pages WHERE status = 'published'"),
    avgSufficiency: await db.query("SELECT AVG(data_sufficiency_score) FROM pages WHERE status = 'published'"),
    belowThreshold: await db.query("SELECT COUNT(*) FROM pages WHERE data_sufficiency_score < 60 AND status = 'published'"),
    recentlyModified: await db.query("SELECT COUNT(*) FROM pages WHERE last_modified > NOW() - INTERVAL '7 days'"),
    neverValidated: await db.query("SELECT COUNT(*) FROM pages WHERE last_validated IS NULL AND status = 'published'"),
    redirectCount: await db.query("SELECT COUNT(*) FROM redirects"),
    brokenRedirects: await db.query(
      "SELECT COUNT(*) FROM redirects r WHERE NOT EXISTS (SELECT 1 FROM pages p WHERE p.canonical_path = r.destination)"
    ),
  };

  // 输出报告或发送至监控服务
  console.log(JSON.stringify(metrics, null, 2));

  // 关键条件告警
  if (metrics.brokenRedirects > 0) console.error("ALERT: 发现无效重定向");
  if (metrics.belowThreshold / metrics.totalPublished > 0.2) {
    console.error("ALERT: >20%的已发布页面未达到充足性阈值");
  }
}

Search Console API Integration

Search Console API集成

At 100K pages, manual Search Console checks are impractical. Use the API:

Indexing status: Query the URL Inspection API in batches to check indexing status of new pages
Performance data: Pull clicks, impressions, CTR by page template to identify underperforming page types
Coverage issues: Monitor for "Crawled — currently not indexed" and "Discovered — currently not indexed" trends

当页面规模达到100K时，手动检查Search Console不现实。需使用API：

索引状态：批量调用URL Inspection API检查新页面的索引状态
性能数据：按页面模板拉取点击量、展示量、CTR，识别表现不佳的页面类型
覆盖问题：监控“已抓取——当前未索引”和“已发现——当前未索引”的趋势

7. Edge and CDN Architecture

7. 边缘与CDN架构

At 100K pages, the origin server can't handle all traffic directly.

当页面规模达到100K时，源服务器无法直接处理所有流量。

Caching Strategy

缓存策略

Client → CDN Edge → Origin (Next.js/framework)
              ↓
         Cache Layer (Redis/edge KV)
              ↓
           Database

CDN configuration:

Cache all pSEO pages at the edge with

s-maxage=86400, stale-while-revalidate=3600

Use ISR revalidation to refresh cached pages (not full rebuilds)
Set longer TTLs for stable pages (30 days), shorter for dynamic data pages (1 day)

Cache invalidation:

On data change → invalidate the specific page's cache via on-demand revalidation API
On category change → invalidate all pages in the category
On template change → purge the CDN for all pages of that template type

Edge rendering (if supported):

Deploy to edge runtimes (Vercel Edge, Cloudflare Workers) for < 50ms TTFB globally
Not all frameworks support edge rendering — check compatibility
Edge functions have memory limits (~128MB) that constrain complex data operations

客户端 → CDN边缘节点 → 源服务器（Next.js/框架）
              ↓
         缓存层（Redis/边缘KV）
              ↓
           数据库

CDN配置：

边缘缓存所有pSEO页面，设置

s-maxage=86400, stale-while-revalidate=3600

使用ISR重新验证机制刷新缓存页面（无需全量重建）
稳定页面设置更长TTL（30天），动态数据页面设置更短TTL（1天）

缓存失效：

数据变更时→通过按需重新验证API失效对应页面的缓存
分类变更时→失效该分类下所有页面的缓存
模板变更时→清除CDN中该模板类型的所有页面缓存

边缘渲染（如果支持）：

部署至边缘运行时（Vercel Edge、Cloudflare Workers），实现全球<50ms的TTFB
并非所有框架都支持边缘渲染——需检查兼容性
边缘函数有内存限制（约128MB），会限制复杂数据操作

Database Connection from Edge

边缘环境的数据库连接

Edge functions can't maintain persistent database connections. Options:

HTTP-based database (PlanetScale, Neon serverless driver, Supabase edge functions)
Edge KV store (Cloudflare KV, Vercel KV) for index-tier data with database as source of truth
Pre-generated JSON at build time for index-tier data, database only for full page content

边缘函数无法维持持久数据库连接。可选方案：

基于HTTP的数据库（PlanetScale、Neon无服务器驱动、Supabase边缘函数）
边缘KV存储（Cloudflare KV、Vercel KV）用于索引层级数据，数据库作为唯一数据源
构建时预生成JSON用于索引层级数据，数据库仅存储完整页面内容

8. Scale-Specific Build Strategy

8. 规模化构建策略

At 100K pages, the build process itself needs architecture.

当页面规模达到100K时，构建流程本身也需要架构优化。

Don't Build All Pages

不预构建所有页面

typescript

// At 100K, only pre-build the most important pages
export async function generateStaticParams() {
  // Pre-build: hub pages + top 1K pages by traffic/priority
  const hubs = await getAllCategories();
  const topPages = await db.query(
    `SELECT slug, category FROM pages
     WHERE status = 'published' AND data_sufficiency_score >= 80
     ORDER BY priority DESC LIMIT 1000`
  );
  return [
    ...hubs.map(h => ({ category: h.slug })),
    ...topPages.map(p => ({ category: p.category, slug: p.slug })),
  ];
}

// ISR handles the remaining 99K pages on first request
export const dynamicParams = true;
export const revalidate = 86400; // 24 hours

typescript

// 100K页面规模下，仅预构建最重要的页面
export async function generateStaticParams() {
  // 预构建：枢纽页面+流量/优先级Top 1K页面
  const hubs = await getAllCategories();
  const topPages = await db.query(
    `SELECT slug, category FROM pages
     WHERE status = 'published' AND data_sufficiency_score >= 80
     ORDER BY priority DESC LIMIT 1000`
  );
  return [
    ...hubs.map(h => ({ category: h.slug })),
    ...topPages.map(p => ({ category: p.category, slug: p.slug })),
  ];
}

// 剩余99K页面由ISR在首次请求时处理
export const dynamicParams = true;
export const revalidate = 86400; // 24小时

Build Time Budget

构建时间预算

Pages pre-built	Expected build time	Memory
1K (hubs + top pages)	5-15 minutes	2-4GB
5K	15-45 minutes	4-6GB
10K	30-90 minutes	6-8GB
100K (DON'T DO THIS)	5-15 hours	16GB+

Rule: Never pre-build more than 10K pages. Use ISR for everything else.

预构建页面数	预期构建时长	内存需求
1K（枢纽+Top页面）	5-15分钟	2-4GB
5K	15-45分钟	4-6GB
10K	30-90分钟	6-8GB
100K（禁止操作）	5-15小时	16GB+

规则：绝不预构建超过10K个页面。长尾页面全部由ISR处理。

Warm-Up After Deploy

部署后缓存预热

After deploying, the ISR cache is cold. The first visitor to each page triggers generation. For critical pages:

typescript

// scripts/warm-cache.ts — run after deploy
async function warmCache() {
  const priorityPages = await db.query(
    `SELECT canonical_path FROM pages
     WHERE data_sufficiency_score >= 80 AND status = 'published'
     ORDER BY priority DESC LIMIT 5000`
  );

  // Hit each page to trigger ISR generation (rate-limited)
  for (const page of priorityPages) {
    await fetch(`${baseUrl}${page.canonical_path}`);
    await sleep(100); // 10 pages/second — don't DDoS yourself
  }
}

部署完成后，ISR缓存是空的。每个页面的首次访问会触发生成。对于关键页面：

typescript

// scripts/warm-cache.ts — 部署后运行
async function warmCache() {
  const priorityPages = await db.query(
    `SELECT canonical_path FROM pages
     WHERE data_sufficiency_score >= 80 AND status = 'published'
     ORDER BY priority DESC LIMIT 5000`
  );

  // 访问每个页面触发ISR生成（限速）
  for (const page of priorityPages) {
    await fetch(`${baseUrl}${page.canonical_path}`);
    await sleep(100); // 每秒10个页面——避免自我DDoS
  }
}

Checklist

检查清单

Relationship to Other Skills

与其他技能的关系

Extends: pseo-data (replaces in-memory patterns with database), pseo-performance (adds CDN/edge and scale-specific build strategy), pseo-quality-guard (adds incremental validation)
Depends on: All content and structure skills must be in place before scaling
Validated by: pseo-quality-guard (quality doesn't change — scale does)
Works with: pseo-orchestrate (scale considerations at every phase)

扩展：pseo-data（用数据库替代内存模式）、pseo-performance（新增CDN/边缘及规模化构建策略）、pseo-quality-guard（新增增量验证）
依赖：所有内容和结构技能需先部署到位，再进行规模化
验证：pseo-quality-guard（质量标准不变——仅规模变化）
协同：pseo-orchestrate（每个阶段都需考虑规模化因素）