pseo-scale

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

pSEO Scale Architecture

pSEO 规模化架构

Architect pSEO systems that work at 10K-100K+ pages. The patterns in the other pseo-* skills are correct at 1K-10K. Beyond 10K, in-memory data layers, full-corpus validation, and single-deploy rollouts break down. This skill provides the architecture changes needed at scale.
构建可支撑10K-100K+量级页面的pSEO系统。其他pseo-*技能中的模式在1K-10K页面规模下适用,但超过10K后,基于内存的数据层、全量语料验证和单次部署发布的模式会失效。本技能提供规模化所需的架构调整方案。

Scale Tiers

规模层级

TierPagesData LayerValidationRolloutSitemap
Small< 1KJSON/files, in-memoryFull pairwiseSingle deploySingle file
Medium1K-10KFiles or DB, two-tier memoryFingerprint-based2-4 week batchesIndex + children
Large10K-50KDatabase requiredIncremental + samplingCategory-by-categoryIndex + chunked
Very Large50K-100K+Database + cache layerDelta-only + periodic fullISR + sitemap wavesIndex + streaming
This skill focuses on the Large and Very Large tiers. If the project is under 10K pages, the standard pseo-* skills are sufficient.
层级页面数量数据层验证方式发布方式站点地图
小型< 1KJSON/文件,内存中全量成对验证单次部署单文件
中型1K-10K文件或数据库,双层内存基于指纹2-4周批次索引+子文件
大型10K-50K必须使用数据库增量+抽样按类别发布索引+分块
超大型50K-100K+数据库+缓存层仅增量+定期全量ISR+站点地图分批索引+流式
本技能聚焦大型和超大型层级。如果项目页面规模低于10K,使用标准的pseo-*技能即可满足需求。

1. Database-Backed Data Layer

1. 数据库驱动的数据层

At 10K+ pages, JSON files and in-memory arrays stop working. The data layer must move to a database with proper indexing.
当页面规模达到10K+时,JSON文件和内存数组将无法正常工作。数据层必须迁移至带有合理索引的数据库。

Why In-Memory Breaks

内存架构失效的原因

Pages    PageIndex in memory    Full content (if loaded)
1K       ~1MB                   ~100-500MB
10K      ~10MB                  ~1-5GB (OOM)
50K      ~50MB (borderline)     ~5-25GB (impossible)
100K     ~100MB                 ~10-50GB (impossible)
At 50K+, even holding all
PageIndex
records in memory is borderline. At 100K,
getAllSlugs()
returning an array of 100K objects takes ~100MB and seconds to deserialize. You need cursor-based iteration.
页面数    内存中的PageIndex    加载的完整内容
1K       ~1MB                   ~100-500MB
10K      ~10MB                  ~1-5GB (OOM)
50K      ~50MB (临界值)     ~5-25GB (不可能)
100K     ~100MB                 ~10-50GB (不可能)
当页面数达到50K+时,即使仅将所有
PageIndex
记录保存在内存中也已接近临界值。到100K时,
getAllSlugs()
返回包含100K个对象的数组会占用约100MB内存,且反序列化需要数秒时间。此时需要采用基于游标(cursor)的迭代方式。

Database Requirements

数据库要求

Minimum schema:
sql
CREATE TABLE pages (
  id            SERIAL PRIMARY KEY,
  slug          TEXT UNIQUE NOT NULL,
  canonical_path TEXT UNIQUE NOT NULL,
  title         TEXT NOT NULL,
  h1            TEXT NOT NULL,
  meta_description TEXT NOT NULL,
  category      TEXT NOT NULL,
  subcategory   TEXT,
  status        TEXT DEFAULT 'published',
  last_modified TIMESTAMPTZ NOT NULL,
  published_at  TIMESTAMPTZ,
  -- Heavy fields (only loaded per-page)
  intro_text    TEXT,
  body_content  TEXT,
  faqs          JSONB,
  related_slugs TEXT[],
  featured_image JSONB,
  -- Scale fields
  data_sufficiency_score REAL,  -- see section 2
  content_hash  TEXT,           -- for incremental validation
  last_validated TIMESTAMPTZ
);

-- Required indexes for pSEO queries
CREATE INDEX idx_pages_category ON pages(category);
CREATE INDEX idx_pages_status ON pages(status) WHERE status = 'published';
CREATE INDEX idx_pages_slug ON pages(slug);
CREATE INDEX idx_pages_last_modified ON pages(last_modified DESC);
CREATE INDEX idx_pages_sufficiency ON pages(data_sufficiency_score);
CREATE INDEX idx_pages_category_status ON pages(category, status);
Categories table:
sql
CREATE TABLE categories (
  slug          TEXT PRIMARY KEY,
  name          TEXT NOT NULL,
  description   TEXT,
  parent_slug   TEXT REFERENCES categories(slug),
  page_count    INT DEFAULT 0,
  last_modified TIMESTAMPTZ
);
Redirects table:
sql
CREATE TABLE redirects (
  source      TEXT PRIMARY KEY,
  destination TEXT NOT NULL,
  created_at  TIMESTAMPTZ DEFAULT NOW()
);
CREATE INDEX idx_redirects_destination ON redirects(destination);
最小化表结构:
sql
CREATE TABLE pages (
  id            SERIAL PRIMARY KEY,
  slug          TEXT UNIQUE NOT NULL,
  canonical_path TEXT UNIQUE NOT NULL,
  title         TEXT NOT NULL,
  h1            TEXT NOT NULL,
  meta_description TEXT NOT NULL,
  category      TEXT NOT NULL,
  subcategory   TEXT,
  status        TEXT DEFAULT 'published',
  last_modified TIMESTAMPTZ NOT NULL,
  published_at  TIMESTAMPTZ,
  -- 大字段(仅按需加载单页内容时读取)
  intro_text    TEXT,
  body_content  TEXT,
  faqs          JSONB,
  related_slugs TEXT[],
  featured_image JSONB,
  -- 规模化相关字段
  data_sufficiency_score REAL,  -- 见第2节
  content_hash  TEXT,           -- 用于增量验证
  last_validated TIMESTAMPTZ
);

-- pSEO查询所需的索引
CREATE INDEX idx_pages_category ON pages(category);
CREATE INDEX idx_pages_status ON pages(status) WHERE status = 'published';
CREATE INDEX idx_pages_slug ON pages(slug);
CREATE INDEX idx_pages_last_modified ON pages(last_modified DESC);
CREATE INDEX idx_pages_sufficiency ON pages(data_sufficiency_score);
CREATE INDEX idx_pages_category_status ON pages(category, status);
分类表:
sql
CREATE TABLE categories (
  slug          TEXT PRIMARY KEY,
  name          TEXT NOT NULL,
  description   TEXT,
  parent_slug   TEXT REFERENCES categories(slug),
  page_count    INT DEFAULT 0,
  last_modified TIMESTAMPTZ
);
重定向表:
sql
CREATE TABLE redirects (
  source      TEXT PRIMARY KEY,
  destination TEXT NOT NULL,
  created_at  TIMESTAMPTZ DEFAULT NOW()
);
CREATE INDEX idx_redirects_destination ON redirects(destination);

Data Layer API at Scale

规模化的数据层API

The pseo-data API contract changes at scale:
typescript
// getAllSlugs() must support cursor-based iteration at 50K+
async function* getAllSlugsCursor(): AsyncGenerator<{ slug: string; category: string }> {
  let cursor: string | null = null;
  while (true) {
    const batch = await db.query(
      `SELECT slug, category FROM pages
       WHERE status = 'published'
       ${cursor ? `AND slug > $1` : ''}
       ORDER BY slug LIMIT 1000`,
      cursor ? [cursor] : []
    );
    if (batch.length === 0) break;
    for (const row of batch) yield row;
    cursor = batch[batch.length - 1].slug;
  }
}

// getPagesByCategory() must use DB pagination, not in-memory slicing
async function getPagesByCategory(
  category: string,
  opts?: { limit?: number; offset?: number }
): Promise<PageIndex[]> {
  return db.query(
    `SELECT slug, title, h1, meta_description, canonical_path, category, last_modified
     FROM pages
     WHERE category = $1 AND status = 'published'
     ORDER BY title
     LIMIT $2 OFFSET $3`,
    [category, opts?.limit ?? 50, opts?.offset ?? 0]
  );
}

// getRelatedPages() needs a precomputed index or efficient query
async function getRelatedPages(slug: string, limit = 5): Promise<PageIndex[]> {
  // Option A: Use related_slugs array from the page record
  // Option B: Same category + shared tags query
  // Option C: Precomputed relatedness table (best at 50K+)
  return db.query(
    `SELECT p.slug, p.title, p.h1, p.meta_description, p.canonical_path,
            p.category, p.last_modified
     FROM pages p
     JOIN pages source ON source.slug = $1
     WHERE p.category = source.category
       AND p.slug != $1
       AND p.status = 'published'
     ORDER BY p.last_modified DESC
     LIMIT $2`,
    [slug, limit]
  );
}
pseo-data的API契约在规模化场景下需要调整:
typescript
// 当页面数达到50K+时,getAllSlugs()必须支持基于游标的迭代
async function* getAllSlugsCursor(): AsyncGenerator<{ slug: string; category: string }> {
  let cursor: string | null = null;
  while (true) {
    const batch = await db.query(
      `SELECT slug, category FROM pages
       WHERE status = 'published'
       ${cursor ? `AND slug > $1` : ''}
       ORDER BY slug LIMIT 1000`,
      cursor ? [cursor] : []
    );
    if (batch.length === 0) break;
    for (const row of batch) yield row;
    cursor = batch[batch.length - 1].slug;
  }
}

// getPagesByCategory()必须使用数据库分页,而非内存切片
async function getPagesByCategory(
  category: string,
  opts?: { limit?: number; offset?: number }
): Promise<PageIndex[]> {
  return db.query(
    `SELECT slug, title, h1, meta_description, canonical_path, category, last_modified
     FROM pages
     WHERE category = $1 AND status = 'published'
     ORDER BY title
     LIMIT $2 OFFSET $3`,
    [category, opts?.limit ?? 50, opts?.offset ?? 0]
  );
}

// getRelatedPages()需要预计算索引或高效查询
async function getRelatedPages(slug: string, limit = 5): Promise<PageIndex[]> {
  // 选项A:使用页面记录中的related_slugs数组
  // 选项B:同分类+共享标签查询
  // 选项C:预计算相关性表(50K+页面时最佳)
  return db.query(
    `SELECT p.slug, p.title, p.h1, p.meta_description, p.canonical_path,
            p.category, p.last_modified
     FROM pages p
     JOIN pages source ON source.slug = $1
     WHERE p.category = source.category
       AND p.slug != $1
       AND p.status = 'published'
     ORDER BY p.last_modified DESC
     LIMIT $2`,
    [slug, limit]
  );
}

Connection Pooling

连接池配置

At build time with parallel page generation, you need connection pooling:
typescript
import { Pool } from "pg";

const pool = new Pool({
  max: 10,                    // limit concurrent connections during build
  idleTimeoutMillis: 30000,
  connectionTimeoutMillis: 5000,
});
ORM alternative: If using Prisma or Drizzle, configure connection pool limits in the ORM config. Default pool sizes are often too high for build processes.
See
references/database-patterns.md
for full patterns by database type.
在并行页面生成的构建阶段,需要配置数据库连接池:
typescript
import { Pool } from "pg";

const pool = new Pool({
  max: 10,                    // 限制构建期间的并发连接数
  idleTimeoutMillis: 30000,
  connectionTimeoutMillis: 5000,
});
**ORM替代方案:**如果使用Prisma或Drizzle,需在ORM配置中设置连接池限制。默认的连接池大小通常过高,不适合构建流程。
更多数据库模式请参考
references/database-patterns.md

2. Data Sufficiency Gating

2. 数据充足性校验机制

At 100K pages, many combinations will produce thin content. Gate page generation BEFORE build time — don't create pages that will fail quality checks.
当页面规模达到100K时,很多组合会生成内容单薄的页面。需在构建前就拦截这类页面——不要生成无法通过质量检查的页面。

Sufficiency Score

充足性评分

Compute a score per potential page based on available data:
typescript
function computeSufficiencyScore(record: RawRecord): number {
  let score = 0;
  const weights = {
    hasTitle: 10,
    hasDescription: 10,           // > 50 chars
    hasBodyContent: 20,           // > 200 words
    hasFAQs: 15,                  // >= 3 Q&A pairs
    hasUniqueAttributes: 15,      // >= 3 non-null structured attributes
    hasImage: 5,
    hasCategory: 10,
    hasNumericData: 10,           // stats, ratings, prices — LLM citation signal
    hasSourceCitation: 5,         // data provenance for E-E-A-T
  };

  if (record.title?.length > 10) score += weights.hasTitle;
  if (record.description?.length > 50) score += weights.hasDescription;
  if (wordCount(record.bodyContent) > 200) score += weights.hasBodyContent;
  if (record.faqs?.length >= 3) score += weights.hasFAQs;
  // ... etc

  return score; // 0-100
}
基于可用数据为每个潜在页面计算评分:
typescript
function computeSufficiencyScore(record: RawRecord): number {
  let score = 0;
  const weights = {
    hasTitle: 10,
    hasDescription: 10,           // 超过50字符
    hasBodyContent: 20,           // 超过200词
    hasFAQs: 15,                  // 至少3组问答
    hasUniqueAttributes: 15,      // 至少3个非空结构化属性
    hasImage: 5,
    hasCategory: 10,
    hasNumericData: 10,           // 统计数据、评分、价格——LLM引用信号
    hasSourceCitation: 5,         // 数据来源,用于E-E-A-T评估
  };

  if (record.title?.length > 10) score += weights.hasTitle;
  if (record.description?.length > 50) score += weights.hasDescription;
  if (wordCount(record.bodyContent) > 200) score += weights.hasBodyContent;
  if (record.faqs?.length >= 3) score += weights.hasFAQs;
  // ... 其他评分逻辑

  return score; // 0-100分
}

Gating Thresholds

校验阈值

ScoreAction
80-100Generate page — sufficient data
60-79Generate page with enrichment flag — mark for content pipeline
40-59Hold — do not generate until data is enriched
0-39Reject — insufficient data, do not generate
Store the score in the database (
data_sufficiency_score
column) so you can:
  • Query how many pages are gated vs. ready
  • Track enrichment progress over time
  • Re-score after data enrichment
  • Gate at build time:
    WHERE data_sufficiency_score >= 60 AND status = 'published'
评分操作
80-100生成页面——数据充足
60-79生成页面并标记为待增强——进入内容流水线
40-59暂存——数据增强完成前不生成
0-39拒绝——数据不足,不生成页面
将评分存储在数据库中
data_sufficiency_score
字段),以便:
  • 查询被拦截和已就绪的页面数量
  • 跟踪数据增强的进度
  • 数据更新后重新评分
  • 构建时过滤:
    WHERE data_sufficiency_score >= 60 AND status = 'published'

Combination Gating

组合页面校验

For combination pages (service × city, product × use-case), both dimensions must have sufficient data:
typescript
function gateCombination(dimA: RawRecord, dimB: RawRecord): boolean {
  // Both dimensions must independently clear a minimum bar
  const scoreA = computeSufficiencyScore(dimA);
  const scoreB = computeSufficiencyScore(dimB);

  // The combination itself must produce enough unique content
  // that it's not just "dimA text + dimB text" pasted together
  const combinationHasUniqueContent =
    dimA.attributes?.length >= 2 &&
    dimB.attributes?.length >= 2;

  return scoreA >= 40 && scoreB >= 40 && combinationHasUniqueContent;
}
This is the most important scale pattern. 500 services × 200 cities = 100K combinations, but maybe only 15K have enough data for a quality page. Generate only the 15K.
对于组合页面(如服务×城市、产品×使用场景),两个维度的数据都必须满足最低要求:
typescript
function gateCombination(dimA: RawRecord, dimB: RawRecord): boolean {
  // 两个维度必须各自达到最低标准
  const scoreA = computeSufficiencyScore(dimA);
  const scoreB = computeSufficiencyScore(dimB);

  // 组合本身必须能产生足够的独特内容
  // 不能只是“dimA文本 + dimB文本”的简单拼接
  const combinationHasUniqueContent =
    dimA.attributes?.length >= 2 &&
    dimB.attributes?.length >= 2;

  return scoreA >= 40 && scoreB >= 40 && combinationHasUniqueContent;
}
这是最重要的规模化模式。500种服务×200个城市=100K个组合,但可能只有15K个组合拥有足够数据生成高质量页面。仅生成这15K个页面即可。

3. Content Enrichment Pipeline

3. 内容增强流水线

At 100K pages, you can't manually write intros, FAQs, and descriptions. You need an automated enrichment pipeline with quality controls.
当页面规模达到100K时,无法手动编写引言、问答和描述。需要带有质量控制的自动化内容增强流水线。

Pipeline Architecture

流水线架构

Raw data (DB/CMS/API)
Data sufficiency scoring ──→ Reject/hold insufficient records
Automated enrichment ──→ Generate intros, FAQs, summaries from structured data
Quality sampling ──→ Human reviews a random sample (5-10%) per batch
Publish gate ──→ Only records with score >= 60 and enrichment complete
Page generation
原始数据(数据库/CMS/API)
数据充足性评分 ──→ 拒绝/暂存数据不足的记录
自动化增强 ──→ 基于结构化数据生成引言、问答、摘要
质量抽样 ──→ 人工审核每批次中的随机样本(5-10%)
发布校验 ──→ 仅允许评分≥60且增强完成的记录通过
页面生成

Enrichment Sources (non-LLM)

非LLM的增强来源

Before reaching for LLM generation, exhaust structured data enrichment:
  • Template composition from multiple fields: Combine 3+ data fields into prose. Not
    "Best {service} in {city}"
    but structured sentences using attributes, stats, and category context.
  • Aggregation: Roll up child data into parent summaries (category stats, comparison tables, top-N lists)
  • Cross-referencing: Enrich records by joining data sources (product + reviews, service + location demographics, listing + category averages)
  • FAQ generation from data patterns: Turn common attribute variations into Q&A pairs ("How much does X cost in Y?" → answer from price data)
  • Comparison data: Auto-generate comparison sections from sibling records in the same category
在使用LLM生成内容前,优先利用结构化数据增强:
  • 多字段模板组合:将3个以上数据字段组合成连贯文本。不要使用“{city}最佳{service}”这类简单模板,而是结合属性、统计数据和分类上下文构建结构化语句。
  • 聚合:将子数据汇总到父级摘要中(分类统计、对比表格、Top-N榜单)
  • 交叉引用:通过关联多数据源增强记录(产品+评论、服务+地域人口数据、列表+分类平均值)
  • 基于数据模式生成问答:将常见属性变体转化为问答对(“X在Y地区的价格是多少?”→从价格数据中提取答案)
  • 对比数据:自动生成同分类下兄弟记录的对比模块

LLM-Assisted Enrichment

LLM辅助增强

If structured enrichment is insufficient, LLM-assisted generation is acceptable under strict conditions:
  1. Never generate the entire page content — LLM fills gaps in an otherwise data-driven page
  2. Always ground in real data — the LLM prompt includes the record's actual attributes, stats, and context
  3. Human review sampling — review 5-10% of LLM-generated content per batch before publishing
  4. Store the generation metadata — track which fields were LLM-generated vs. sourced from data
  5. Apply quality guard after enrichment — run pseo-quality-guard on enriched content before publishing
  6. Regenerate periodically — stale LLM content should be refreshed when underlying data changes
Google's position (2025): AI-generated content is acceptable if it provides genuine value. The risk is not the generation method but the output quality. Scaled LLM generation that produces interchangeable pages will trigger the same penalties as template spam.
如果结构化数据增强无法满足需求,可在严格条件下使用LLM辅助生成:
  1. 绝不生成完整页面内容——LLM仅填充数据驱动型页面中的空白部分
  2. 始终基于真实数据——LLM提示词需包含记录的实际属性、统计数据和上下文
  3. 人工抽样审核——每批次发布前审核5-10%的LLM生成内容
  4. 存储生成元数据——跟踪哪些字段是LLM生成的,哪些是来自数据源的
  5. 增强后质量校验——发布前对增强内容运行pseo-quality-guard
  6. 定期重新生成——当底层数据变化时,更新过时的LLM生成内容
Google 2025年立场:AI生成内容只要能提供真实价值就是可接受的。风险不在于生成方式,而在于输出质量。规模化生成的同质化页面会触发与模板垃圾内容相同的处罚。

4. Incremental Validation

4. 增量验证

At 100K pages, full-corpus quality checks take hours. Switch to incremental validation.
当页面规模达到100K时,全量语料质量检查需要数小时。需切换为增量验证模式。

Delta Validation

增量验证

Only validate pages that changed since the last validation run:
typescript
async function getChangedPages(since: Date): Promise<string[]> {
  const result = await db.query(
    `SELECT slug FROM pages
     WHERE last_modified > $1
       OR last_validated IS NULL
       AND status = 'published'`,
    [since]
  );
  return result.map(r => r.slug);
}
Content hashing: Store a hash of each page's rendered content. On validation, only re-check pages whose hash changed:
typescript
import { createHash } from "crypto";

function contentHash(page: BaseSEOContent): string {
  const content = `${page.title}|${page.h1}|${page.metaDescription}|${page.bodyContent}`;
  return createHash("sha256").update(content).digest("hex").slice(0, 16);
}
仅验证自上次验证以来有变化的页面:
typescript
async function getChangedPages(since: Date): Promise<string[]> {
  const result = await db.query(
    `SELECT slug FROM pages
     WHERE last_modified > $1
       OR last_validated IS NULL
       AND status = 'published'`,
    [since]
  );
  return result.map(r => r.slug);
}
内容哈希:存储每个页面渲染内容的哈希值。验证时仅重新检查哈希值变化的页面:
typescript
import { createHash } from "crypto";

function contentHash(page: BaseSEOContent): string {
  const content = `${page.title}|${page.h1}|${page.metaDescription}|${page.bodyContent}`;
  return createHash("sha256").update(content).digest("hex").slice(0, 16);
}

Periodic Full Scan

定期全量扫描

Run a complete validation weekly or before major releases. For the full scan at 100K:
  • Parallelize: Run 4-8 validation workers, each processing a category partition
  • Sample cross-category similarity: Don't compare all 100K × 100K. Compare each page against 50 random pages from other categories + all pages in the same category.
  • Stream results to disk: Write validation results to a JSONL file, then aggregate. Don't accumulate all results in memory.
typescript
// Parallel validation by category
const categories = await getAllCategories();
const workers = categories.map(cat =>
  validateCategory(cat.slug) // each worker handles one category
);
const results = await Promise.all(workers);
每周或重大发布前运行一次全量验证。针对100K页面的全量扫描:
  • 并行化:运行4-8个验证工作进程,每个进程处理一个分类分区
  • 跨类别相似度抽样:无需对比所有100K×100K页面。每个页面仅与其他分类中的50个随机页面+同分类下的所有页面进行对比。
  • 流式结果存储:将验证结果写入JSONL文件,再进行聚合。不要将所有结果保存在内存中。
typescript
// 按分类并行验证
const categories = await getAllCategories();
const workers = categories.map(cat =>
  validateCategory(cat.slug) // 每个工作进程处理一个分类
);
const results = await Promise.all(workers);

Validation Budget

验证时间预算

ScaleDelta validationFull validation
10KMinutes30-60 minutes
50KMinutes2-4 hours
100KMinutes4-8 hours
Optimization: Pre-compute and store fingerprints (MinHash signatures) in the database. Validation then only compares fingerprints, not full content.
规模增量验证全量验证
10K数分钟30-60分钟
50K数分钟2-4小时
100K数分钟4-8小时
优化方案:在数据库中预计算并存储指纹(MinHash签名)。验证时仅对比指纹,而非完整内容。

5. Crawl Budget Management

5. 抓取预算管理

At 100K pages, Google won't crawl everything immediately. Crawl budget — the number of pages Google crawls per day — becomes a constraint.
当页面规模达到100K时,Google不会立即抓取所有页面。抓取预算——Google每天抓取的页面数量——会成为限制因素。

Sitemap Submission Strategy

站点地图提交策略

sitemap-index.xml
├── sitemap-category-a.xml     (≤ 50,000 URLs)
├── sitemap-category-b.xml
├── sitemap-category-c.xml
└── ...
Submission cadence:
  • Submit the sitemap index to both Google Search Console and Bing Webmaster Tools
  • Update individual category sitemaps as pages are added
  • Don't submit all 100K URLs at once — Google throttles crawling for new sites with sudden URL spikes
Programmatic sitemap submission:
typescript
// Use Google Indexing API for high-priority pages (job postings, livestreams)
// For standard pages, rely on sitemap discovery + Search Console

// Batch sitemap updates by category
async function updateCategorySitemap(category: string) {
  const pages = await getPagesByCategory(category, { limit: 50000 });
  const xml = generateSitemapXml(pages);
  await writeFile(`public/sitemap-${category}.xml`, xml);
}
sitemap-index.xml
├── sitemap-category-a.xml     (≤ 50,000个URL)
├── sitemap-category-b.xml
├── sitemap-category-c.xml
└── ...
提交节奏:
  • 将站点地图索引文件提交至Google Search Console和Bing Webmaster Tools
  • 页面新增时更新对应分类的站点地图
  • 不要一次性提交所有100K个URL——对于突然新增大量URL的新站点,Google会限制抓取频率
程序化站点地图提交:
typescript
// 高优先级页面(招聘信息、直播内容)使用Google Indexing API
// 标准页面依赖站点地图自动发现+Search Console

// 按分类批量更新站点地图
async function updateCategorySitemap(category: string) {
  const pages = await getPagesByCategory(category, { limit: 50000 });
  const xml = generateSitemapXml(pages);
  await writeFile(`public/sitemap-${category}.xml`, xml);
}

Crawl Budget Optimization

抓取预算优化

  • Prioritize high-value pages: Set
    priority
    in sitemap to guide Googlebot (0.8 for hubs, 0.6 for pages with high sufficiency scores, 0.4 for the rest)
  • Remove low-quality pages from sitemap: Pages with sufficiency score < 60 should not be in the sitemap
  • Fix crawl waste: Ensure no soft 404s, no redirect chains, no parameter-based duplicates — all waste crawl budget
  • Server response time: Keep TTFB < 500ms. Slow servers get less crawl budget.
  • Monitor crawl stats: Check Google Search Console → Settings → Crawl stats weekly. If crawl rate drops, investigate.
  • 优先高价值页面:在站点地图中设置
    priority
    引导Googlebot(枢纽页面设为0.8,高充足性评分页面设为0.6,其余设为0.4)
  • 站点地图移除低质量页面:充足性评分<60的页面不应出现在站点地图中
  • 减少抓取浪费:确保没有软404、重定向链、参数型重复页面——这些都会消耗抓取预算
  • 服务器响应时间:保持TTFB<500ms。服务器响应越慢,获取的抓取预算越少。
  • 监控抓取统计:每周查看Google Search Console→设置→抓取统计。如果抓取频率下降,需排查原因。

Indexing Rate Expectations

索引率预期

Google does not guarantee indexing all pages. Realistic expectations:
Site authorityPages submittedLikely indexed (6 months)
New site100K10-30%
Established (DR 30-50)100K40-70%
High authority (DR 60+)100K70-90%
If indexing rate is low:
  1. Improve content quality on indexed pages first
  2. Earn more backlinks to category hubs
  3. Reduce total page count (prune thin pages) — a smaller, higher-quality corpus often indexes better than a large, mediocre one
  4. Ensure internal linking reaches every page within 3 clicks
Google不保证所有页面都被索引。合理预期如下:
站点权重提交页面数6个月内可能被索引的比例
新站点100K10-30%
成熟站点(DR 30-50)100K40-70%
高权重站点(DR 60+)100K70-90%
如果索引率较低:
  1. 先优化已被索引页面的内容质量
  2. 为分类枢纽页面获取更多外链
  3. 减少总页面数(删除低质量页面)——规模更小、质量更高的页面集合通常比大规模低质量页面的索引效果更好
  4. 确保所有页面都能在3次点击内通过内部链接访问到

6. Monitoring at Scale

6. 规模化监控

At 100K pages, you can't manually check pages. Build automated monitoring.
当页面规模达到100K时,无法手动检查所有页面。需构建自动化监控体系。

Key Metrics to Track

需跟踪的关键指标

MetricSourceAlert Threshold
Pages indexedGoogle Search Console APIDrops > 5% week-over-week
Crawl rateGoogle Search Console APIDrops > 20%
Crawl errors (5xx, 404)Server logs, GSC> 1% of total pages
CWV regressionsCrUX API or RUMLCP > 4s or CLS > 0.25 on any template
Build durationCI/CD logs> 2x baseline
Build memory peakCI/CD logs> 80% of available memory
Page count by statusDatabasePublished count deviates from expected
Sufficiency score distributionDatabase> 20% of published pages below threshold
指标数据来源告警阈值
已索引页面数Google Search Console API周环比下降>5%
抓取频率Google Search Console API下降>20%
抓取错误(5xx、404)服务器日志、GSC占总页面数>1%
CWV指标退化CrUX API或RUM任何模板的LCP>4s或CLS>0.25
构建时长CI/CD日志超过基线2倍
构建内存峰值CI/CD日志超过可用内存的80%
各状态页面数数据库已发布页面数与预期偏差较大
充足性评分分布数据库>20%的已发布页面低于阈值

Automated Monitoring Script

自动化监控脚本

typescript
// scripts/monitor-pseo.ts — run daily via cron or CI
async function monitor() {
  const metrics = {
    totalPublished: await db.query("SELECT COUNT(*) FROM pages WHERE status = 'published'"),
    avgSufficiency: await db.query("SELECT AVG(data_sufficiency_score) FROM pages WHERE status = 'published'"),
    belowThreshold: await db.query("SELECT COUNT(*) FROM pages WHERE data_sufficiency_score < 60 AND status = 'published'"),
    recentlyModified: await db.query("SELECT COUNT(*) FROM pages WHERE last_modified > NOW() - INTERVAL '7 days'"),
    neverValidated: await db.query("SELECT COUNT(*) FROM pages WHERE last_validated IS NULL AND status = 'published'"),
    redirectCount: await db.query("SELECT COUNT(*) FROM redirects"),
    brokenRedirects: await db.query(
      "SELECT COUNT(*) FROM redirects r WHERE NOT EXISTS (SELECT 1 FROM pages p WHERE p.canonical_path = r.destination)"
    ),
  };

  // Output report or send to monitoring service
  console.log(JSON.stringify(metrics, null, 2));

  // Alert on critical conditions
  if (metrics.brokenRedirects > 0) console.error("ALERT: Broken redirects found");
  if (metrics.belowThreshold / metrics.totalPublished > 0.2) {
    console.error("ALERT: >20% of published pages below sufficiency threshold");
  }
}
typescript
// scripts/monitor-pseo.ts — 通过cron或CI每日运行
async function monitor() {
  const metrics = {
    totalPublished: await db.query("SELECT COUNT(*) FROM pages WHERE status = 'published'"),
    avgSufficiency: await db.query("SELECT AVG(data_sufficiency_score) FROM pages WHERE status = 'published'"),
    belowThreshold: await db.query("SELECT COUNT(*) FROM pages WHERE data_sufficiency_score < 60 AND status = 'published'"),
    recentlyModified: await db.query("SELECT COUNT(*) FROM pages WHERE last_modified > NOW() - INTERVAL '7 days'"),
    neverValidated: await db.query("SELECT COUNT(*) FROM pages WHERE last_validated IS NULL AND status = 'published'"),
    redirectCount: await db.query("SELECT COUNT(*) FROM redirects"),
    brokenRedirects: await db.query(
      "SELECT COUNT(*) FROM redirects r WHERE NOT EXISTS (SELECT 1 FROM pages p WHERE p.canonical_path = r.destination)"
    ),
  };

  // 输出报告或发送至监控服务
  console.log(JSON.stringify(metrics, null, 2));

  // 关键条件告警
  if (metrics.brokenRedirects > 0) console.error("ALERT: 发现无效重定向");
  if (metrics.belowThreshold / metrics.totalPublished > 0.2) {
    console.error("ALERT: >20%的已发布页面未达到充足性阈值");
  }
}

Search Console API Integration

Search Console API集成

At 100K pages, manual Search Console checks are impractical. Use the API:
  • Indexing status: Query the URL Inspection API in batches to check indexing status of new pages
  • Performance data: Pull clicks, impressions, CTR by page template to identify underperforming page types
  • Coverage issues: Monitor for "Crawled — currently not indexed" and "Discovered — currently not indexed" trends
当页面规模达到100K时,手动检查Search Console不现实。需使用API:
  • 索引状态:批量调用URL Inspection API检查新页面的索引状态
  • 性能数据:按页面模板拉取点击量、展示量、CTR,识别表现不佳的页面类型
  • 覆盖问题:监控“已抓取——当前未索引”和“已发现——当前未索引”的趋势

7. Edge and CDN Architecture

7. 边缘与CDN架构

At 100K pages, the origin server can't handle all traffic directly.
当页面规模达到100K时,源服务器无法直接处理所有流量。

Caching Strategy

缓存策略

Client → CDN Edge → Origin (Next.js/framework)
         Cache Layer (Redis/edge KV)
           Database
CDN configuration:
  • Cache all pSEO pages at the edge with
    s-maxage=86400, stale-while-revalidate=3600
  • Use ISR revalidation to refresh cached pages (not full rebuilds)
  • Set longer TTLs for stable pages (30 days), shorter for dynamic data pages (1 day)
Cache invalidation:
  • On data change → invalidate the specific page's cache via on-demand revalidation API
  • On category change → invalidate all pages in the category
  • On template change → purge the CDN for all pages of that template type
Edge rendering (if supported):
  • Deploy to edge runtimes (Vercel Edge, Cloudflare Workers) for < 50ms TTFB globally
  • Not all frameworks support edge rendering — check compatibility
  • Edge functions have memory limits (~128MB) that constrain complex data operations
客户端 → CDN边缘节点 → 源服务器(Next.js/框架)
         缓存层(Redis/边缘KV)
           数据库
CDN配置:
  • 边缘缓存所有pSEO页面,设置
    s-maxage=86400, stale-while-revalidate=3600
  • 使用ISR重新验证机制刷新缓存页面(无需全量重建)
  • 稳定页面设置更长TTL(30天),动态数据页面设置更短TTL(1天)
缓存失效:
  • 数据变更时→通过按需重新验证API失效对应页面的缓存
  • 分类变更时→失效该分类下所有页面的缓存
  • 模板变更时→清除CDN中该模板类型的所有页面缓存
边缘渲染(如果支持):
  • 部署至边缘运行时(Vercel Edge、Cloudflare Workers),实现全球<50ms的TTFB
  • 并非所有框架都支持边缘渲染——需检查兼容性
  • 边缘函数有内存限制(约128MB),会限制复杂数据操作

Database Connection from Edge

边缘环境的数据库连接

Edge functions can't maintain persistent database connections. Options:
  • HTTP-based database (PlanetScale, Neon serverless driver, Supabase edge functions)
  • Edge KV store (Cloudflare KV, Vercel KV) for index-tier data with database as source of truth
  • Pre-generated JSON at build time for index-tier data, database only for full page content
边缘函数无法维持持久数据库连接。可选方案:
  • 基于HTTP的数据库(PlanetScale、Neon无服务器驱动、Supabase边缘函数)
  • 边缘KV存储(Cloudflare KV、Vercel KV)用于索引层级数据,数据库作为唯一数据源
  • 构建时预生成JSON用于索引层级数据,数据库仅存储完整页面内容

8. Scale-Specific Build Strategy

8. 规模化构建策略

At 100K pages, the build process itself needs architecture.
当页面规模达到100K时,构建流程本身也需要架构优化。

Don't Build All Pages

不预构建所有页面

typescript
// At 100K, only pre-build the most important pages
export async function generateStaticParams() {
  // Pre-build: hub pages + top 1K pages by traffic/priority
  const hubs = await getAllCategories();
  const topPages = await db.query(
    `SELECT slug, category FROM pages
     WHERE status = 'published' AND data_sufficiency_score >= 80
     ORDER BY priority DESC LIMIT 1000`
  );
  return [
    ...hubs.map(h => ({ category: h.slug })),
    ...topPages.map(p => ({ category: p.category, slug: p.slug })),
  ];
}

// ISR handles the remaining 99K pages on first request
export const dynamicParams = true;
export const revalidate = 86400; // 24 hours
typescript
// 100K页面规模下,仅预构建最重要的页面
export async function generateStaticParams() {
  // 预构建:枢纽页面+流量/优先级Top 1K页面
  const hubs = await getAllCategories();
  const topPages = await db.query(
    `SELECT slug, category FROM pages
     WHERE status = 'published' AND data_sufficiency_score >= 80
     ORDER BY priority DESC LIMIT 1000`
  );
  return [
    ...hubs.map(h => ({ category: h.slug })),
    ...topPages.map(p => ({ category: p.category, slug: p.slug })),
  ];
}

// 剩余99K页面由ISR在首次请求时处理
export const dynamicParams = true;
export const revalidate = 86400; // 24小时

Build Time Budget

构建时间预算

Pages pre-builtExpected build timeMemory
1K (hubs + top pages)5-15 minutes2-4GB
5K15-45 minutes4-6GB
10K30-90 minutes6-8GB
100K (DON'T DO THIS)5-15 hours16GB+
Rule: Never pre-build more than 10K pages. Use ISR for everything else.
预构建页面数预期构建时长内存需求
1K(枢纽+Top页面)5-15分钟2-4GB
5K15-45分钟4-6GB
10K30-90分钟6-8GB
100K(禁止操作)5-15小时16GB+
规则:绝不预构建超过10K个页面。长尾页面全部由ISR处理。

Warm-Up After Deploy

部署后缓存预热

After deploying, the ISR cache is cold. The first visitor to each page triggers generation. For critical pages:
typescript
// scripts/warm-cache.ts — run after deploy
async function warmCache() {
  const priorityPages = await db.query(
    `SELECT canonical_path FROM pages
     WHERE data_sufficiency_score >= 80 AND status = 'published'
     ORDER BY priority DESC LIMIT 5000`
  );

  // Hit each page to trigger ISR generation (rate-limited)
  for (const page of priorityPages) {
    await fetch(`${baseUrl}${page.canonical_path}`);
    await sleep(100); // 10 pages/second — don't DDoS yourself
  }
}
部署完成后,ISR缓存是空的。每个页面的首次访问会触发生成。对于关键页面:
typescript
// scripts/warm-cache.ts — 部署后运行
async function warmCache() {
  const priorityPages = await db.query(
    `SELECT canonical_path FROM pages
     WHERE data_sufficiency_score >= 80 AND status = 'published'
     ORDER BY priority DESC LIMIT 5000`
  );

  // 访问每个页面触发ISR生成(限速)
  for (const page of priorityPages) {
    await fetch(`${baseUrl}${page.canonical_path}`);
    await sleep(100); // 每秒10个页面——避免自我DDoS
  }
}

Checklist

检查清单

  • Database is the primary data store (not JSON files or in-memory arrays)
  • Required indexes exist on slug, category, status, last_modified, sufficiency_score
  • Data sufficiency scoring is implemented and stored per page
  • Pages with score < 60 are gated from generation
  • Combination pages are gated on both dimensions
  • Content enrichment pipeline exists (structured data first, LLM-assisted only with review)
  • Incremental validation is implemented (delta + periodic full scan)
  • Content hashes are stored for change detection
  • Sitemap is split by category with index file
  • Sitemap excludes pages below sufficiency threshold
  • Crawl budget is monitored via Search Console
  • Monitoring script runs daily with alerts
  • CDN caching is configured with appropriate TTLs
  • ISR handles the long tail (only top pages pre-built)
  • Cache warm-up script exists for post-deploy
  • Connection pooling is configured for build-time queries
  • No function loads > 10K full page records into memory
  • 数据库作为主要数据存储(而非JSON文件或内存数组)
  • 已为slug、category、status、last_modified、sufficiency_score创建必要索引
  • 已实现数据充足性评分并存储在每个页面中
  • 评分<60的页面被拦截,不生成
  • 组合页面需通过两个维度的校验
  • 存在内容增强流水线(优先结构化数据,LLM辅助需审核)
  • 已实现增量验证(增量+定期全量扫描)
  • 已存储内容哈希用于变更检测
  • 站点地图按分类拆分并配有索引文件
  • 站点地图排除了充足性评分不达标的页面
  • 通过Search Console监控抓取预算
  • 每日运行监控脚本并配置告警
  • 已配置CDN缓存及合理的TTL
  • 长尾页面由ISR处理(仅预构建Top页面)
  • 存在部署后的缓存预热脚本
  • 已为构建时查询配置连接池
  • 没有函数会将>10K条完整页面记录加载到内存中

Relationship to Other Skills

与其他技能的关系

  • Extends: pseo-data (replaces in-memory patterns with database), pseo-performance (adds CDN/edge and scale-specific build strategy), pseo-quality-guard (adds incremental validation)
  • Depends on: All content and structure skills must be in place before scaling
  • Validated by: pseo-quality-guard (quality doesn't change — scale does)
  • Works with: pseo-orchestrate (scale considerations at every phase)
  • 扩展:pseo-data(用数据库替代内存模式)、pseo-performance(新增CDN/边缘及规模化构建策略)、pseo-quality-guard(新增增量验证)
  • 依赖:所有内容和结构技能需先部署到位,再进行规模化
  • 验证:pseo-quality-guard(质量标准不变——仅规模变化)
  • 协同:pseo-orchestrate(每个阶段都需考虑规模化因素)