clerk-incident-runbook

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Clerk Incident Runbook

Clerk Incident Runbook

Overview

概述

Procedures for responding to Clerk-related incidents in production.
生产环境中Clerk相关事件的响应流程。

Prerequisites

前提条件

  • Access to Clerk dashboard
  • Access to application logs
  • Emergency contact list
  • Rollback procedures documented
  • 有权访问Clerk dashboard
  • 有权访问应用日志
  • 紧急联系人列表
  • 已归档的回滚流程

Incident Categories

事件分类

Category 1: Complete Auth Outage

类别1:完全认证中断

Symptoms: All users unable to sign in, middleware returning errors
Immediate Actions:
bash
undefined
症状: 所有用户无法登录,middleware返回错误
立即执行操作:
bash
undefined

1. Check Clerk status

1. Check Clerk status

2. Check your endpoint

2. Check your endpoint

3. Check environment variables

3. Check environment variables

vercel env ls | grep CLERK

**Mitigation Steps:**
```typescript
// Emergency bypass mode (use with caution)
// middleware.ts
import { clerkMiddleware } from '@clerk/nextjs/server'
import { NextResponse } from 'next/server'

const EMERGENCY_BYPASS = process.env.CLERK_EMERGENCY_BYPASS === 'true'

export default clerkMiddleware(async (auth, request) => {
  if (EMERGENCY_BYPASS) {
    // Log for audit
    console.warn('[EMERGENCY] Auth bypass active', {
      path: request.nextUrl.pathname,
      timestamp: new Date().toISOString()
    })
    return NextResponse.next()
  }

  // Normal auth flow
  await auth.protect()
})
vercel env ls | grep CLERK

**缓解步骤:**
```typescript
// Emergency bypass mode (use with caution)
// middleware.ts
import { clerkMiddleware } from '@clerk/nextjs/server'
import { NextResponse } from 'next/server'

const EMERGENCY_BYPASS = process.env.CLERK_EMERGENCY_BYPASS === 'true'

export default clerkMiddleware(async (auth, request) => {
  if (EMERGENCY_BYPASS) {
    // Log for audit
    console.warn('[EMERGENCY] Auth bypass active', {
      path: request.nextUrl.pathname,
      timestamp: new Date().toISOString()
    })
    return NextResponse.next()
  }

  // Normal auth flow
  await auth.protect()
})

Category 2: Webhook Processing Failure

类别2:Webhook处理失败

Symptoms: User data out of sync, missing user records
Diagnosis:
bash
undefined
症状: 用户数据不同步,用户记录缺失
诊断:
bash
undefined

Check webhook endpoint

Check webhook endpoint

curl -X POST https://yourapp.com/api/webhooks/clerk
-H "Content-Type: application/json"
-d '{"type":"ping"}'
-w "\n%{http_code}"
curl -X POST https://yourapp.com/api/webhooks/clerk
-H "Content-Type: application/json"
-d '{"type":"ping"}'
-w "\n%{http_code}"

Check Clerk dashboard for failed webhooks

Check Clerk dashboard for failed webhooks

Dashboard > Webhooks > Failed Deliveries

Dashboard > Webhooks > Failed Deliveries


**Recovery:**
```typescript
// scripts/resync-users.ts
import { clerkClient } from '@clerk/nextjs/server'
import { db } from '../lib/db'

async function resyncAllUsers() {
  const client = await clerkClient()
  let offset = 0
  const limit = 100

  while (true) {
    const { data: users, totalCount } = await client.users.getUserList({
      limit,
      offset
    })

    for (const user of users) {
      await db.user.upsert({
        where: { clerkId: user.id },
        update: {
          email: user.emailAddresses[0]?.emailAddress,
          firstName: user.firstName,
          lastName: user.lastName,
          updatedAt: new Date()
        },
        create: {
          clerkId: user.id,
          email: user.emailAddresses[0]?.emailAddress,
          firstName: user.firstName,
          lastName: user.lastName
        }
      })
    }

    console.log(`Synced ${offset + users.length} of ${totalCount} users`)
    offset += limit

    if (offset >= totalCount) break
  }

  console.log('Resync complete')
}

resyncAllUsers()

**恢复:**
```typescript
// scripts/resync-users.ts
import { clerkClient } from '@clerk/nextjs/server'
import { db } from '../lib/db'

async function resyncAllUsers() {
  const client = await clerkClient()
  let offset = 0
  const limit = 100

  while (true) {
    const { data: users, totalCount } = await client.users.getUserList({
      limit,
      offset
    })

    for (const user of users) {
      await db.user.upsert({
        where: { clerkId: user.id },
        update: {
          email: user.emailAddresses[0]?.emailAddress,
          firstName: user.firstName,
          lastName: user.lastName,
          updatedAt: new Date()
        },
        create: {
          clerkId: user.id,
          email: user.emailAddresses[0]?.emailAddress,
          firstName: user.firstName,
          lastName: user.lastName
        }
      })
    }

    console.log(`Synced ${offset + users.length} of ${totalCount} users`)
    offset += limit

    if (offset >= totalCount) break
  }

  console.log('Resync complete')
}

resyncAllUsers()

Category 3: Security Incident

类别3:安全事件

Symptoms: Unauthorized access detected, suspicious sessions
Immediate Actions:
typescript
// scripts/emergency-session-revoke.ts
import { clerkClient } from '@clerk/nextjs/server'

async function revokeUserSessions(userId: string) {
  const client = await clerkClient()

  // Get all active sessions
  const sessions = await client.sessions.getSessionList({
    userId,
    status: 'active'
  })

  // Revoke all sessions
  for (const session of sessions.data) {
    await client.sessions.revokeSession(session.id)
    console.log(`Revoked session: ${session.id}`)
  }

  console.log(`Revoked ${sessions.data.length} sessions for user ${userId}`)
}

// Revoke all sessions for compromised user
revokeUserSessions('user_xxx')
typescript
// scripts/emergency-lockout.ts
import { clerkClient } from '@clerk/nextjs/server'

async function lockoutUser(userId: string) {
  const client = await clerkClient()

  // Ban user (prevents new sign-ins)
  await client.users.banUser(userId)

  // Revoke all sessions
  const sessions = await client.sessions.getSessionList({
    userId,
    status: 'active'
  })

  for (const session of sessions.data) {
    await client.sessions.revokeSession(session.id)
  }

  console.log(`User ${userId} locked out and all sessions revoked`)
}
症状: 检测到未授权访问,存在可疑会话
立即执行操作:
typescript
// scripts/emergency-session-revoke.ts
import { clerkClient } from '@clerk/nextjs/server'

async function revokeUserSessions(userId: string) {
  const client = await clerkClient()

  // Get all active sessions
  const sessions = await client.sessions.getSessionList({
    userId,
    status: 'active'
  })

  // Revoke all sessions
  for (const session of sessions.data) {
    await client.sessions.revokeSession(session.id)
    console.log(`Revoked session: ${session.id}`)
  }

  console.log(`Revoked ${sessions.data.length} sessions for user ${userId}`)
}

// Revoke all sessions for compromised user
revokeUserSessions('user_xxx')
typescript
// scripts/emergency-lockout.ts
import { clerkClient } from '@clerk/nextjs/server'

async function lockoutUser(userId: string) {
  const client = await clerkClient()

  // Ban user (prevents new sign-ins)
  await client.users.banUser(userId)

  // Revoke all sessions
  const sessions = await client.sessions.getSessionList({
    userId,
    status: 'active'
  })

  for (const session of sessions.data) {
    await client.sessions.revokeSession(session.id)
  }

  console.log(`User ${userId} locked out and all sessions revoked`)
}

Category 4: Performance Degradation

类别4:性能下降

Symptoms: Slow sign-in, high latency, timeouts
Diagnosis:
typescript
// scripts/diagnose-performance.ts
async function diagnosePerformance() {
  const results = {
    authCheck: 0,
    getUserList: 0,
    currentUser: 0
  }

  // Measure auth check
  const authStart = performance.now()
  await auth()
  results.authCheck = performance.now() - authStart

  // Measure API call
  const apiStart = performance.now()
  const client = await clerkClient()
  await client.users.getUserList({ limit: 1 })
  results.getUserList = performance.now() - apiStart

  // Measure currentUser
  const userStart = performance.now()
  await currentUser()
  results.currentUser = performance.now() - userStart

  console.log('Performance Diagnosis:', results)

  // Check for issues
  if (results.authCheck > 100) {
    console.warn('Auth check slow - check middleware configuration')
  }
  if (results.getUserList > 500) {
    console.warn('API slow - check Clerk status or network')
  }

  return results
}
症状: 登录缓慢、延迟高、请求超时
诊断:
typescript
// scripts/diagnose-performance.ts
async function diagnosePerformance() {
  const results = {
    authCheck: 0,
    getUserList: 0,
    currentUser: 0
  }

  // Measure auth check
  const authStart = performance.now()
  await auth()
  results.authCheck = performance.now() - authStart

  // Measure API call
  const apiStart = performance.now()
  const client = await clerkClient()
  await client.users.getUserList({ limit: 1 })
  results.getUserList = performance.now() - apiStart

  // Measure currentUser
  const userStart = performance.now()
  await currentUser()
  results.currentUser = performance.now() - userStart

  console.log('Performance Diagnosis:', results)

  // Check for issues
  if (results.authCheck > 100) {
    console.warn('Auth check slow - check middleware configuration')
  }
  if (results.getUserList > 500) {
    console.warn('API slow - check Clerk status or network')
  }

  return results
}

Runbook Procedures

响应手册执行流程

Procedure 1: Auth Outage Response

流程1:认证中断响应

1. [ ] Confirm outage (check status.clerk.com)
2. [ ] Check application logs for errors
3. [ ] Verify environment variables
4. [ ] If Clerk outage:
   a. [ ] Enable emergency bypass (if safe)
   b. [ ] Notify users via status page
   c. [ ] Monitor Clerk status
5. [ ] If application issue:
   a. [ ] Check recent deployments
   b. [ ] Rollback if necessary
   c. [ ] Check middleware configuration
6. [ ] Document timeline and actions
7. [ ] Conduct post-mortem
1. [ ] 确认中断(检查status.clerk.com)
2. [ ] 检查应用日志中的错误
3. [ ] 验证环境变量
4. [ ] 如果是Clerk服务中断:
   a. [ ] 启用紧急旁路模式(确认安全后操作)
   b. [ ] 通过状态页通知用户
   c. [ ] 监控Clerk状态
5. [ ] 如果是应用自身问题:
   a. [ ] 检查最近的部署记录
   b. [ ] 必要时执行回滚
   c. [ ] 检查middleware配置
6. [ ] 记录时间线和执行的操作
7. [ ] 开展事后复盘

Procedure 2: Security Breach Response

流程2:安全漏洞响应

1. [ ] Identify affected accounts
2. [ ] Revoke all sessions for affected users
3. [ ] Lock compromised accounts
4. [ ] Reset API keys if exposed
5. [ ] Enable additional verification
6. [ ] Notify affected users
7. [ ] Review access logs
8. [ ] Document and report
1. [ ] 识别受影响的账号
2. [ ] 吊销受影响用户的所有会话
3. [ ] 锁定被入侵的账号
4. [ ] 如果API密钥泄露则重置密钥
5. [ ] 启用额外的验证措施
6. [ ] 通知受影响的用户
7. [ ] 审核访问日志
8. [ ] 记录并上报事件

Procedure 3: Data Sync Recovery

流程3:数据同步恢复

1. [ ] Identify sync gap (check webhook logs)
2. [ ] Pause webhook processing
3. [ ] Export current database state
4. [ ] Run resync script
5. [ ] Verify data integrity
6. [ ] Resume webhook processing
7. [ ] Monitor for new issues
1. [ ] 识别同步缺口(检查webhook日志)
2. [ ] 暂停webhook处理
3. [ ] 导出当前数据库状态
4. [ ] 执行重新同步脚本
5. [ ] 验证数据完整性
6. [ ] 恢复webhook处理
7. [ ] 监控是否出现新问题

Emergency Contacts

紧急联系人

yaml
undefined
yaml
undefined

.github/INCIDENT_CONTACTS.yml

.github/INCIDENT_CONTACTS.yml

contacts: on_call: - name: On-Call Engineer phone: "+1-xxx-xxx-xxxx" slack: "@oncall"
clerk_support: - url: "https://clerk.com/support" - email: "support@clerk.com" - priority: "For enterprise: contact account manager"
escalation: - level: 1 contact: "On-call engineer" time: "0-15 min" - level: 2 contact: "Engineering lead" time: "15-30 min" - level: 3 contact: "CTO" time: "30+ min"
undefined
contacts: on_call: - name: On-Call Engineer phone: "+1-xxx-xxx-xxxx" slack: "@oncall"
clerk_support: - url: "https://clerk.com/support" - email: "support@clerk.com" - priority: "For enterprise: contact account manager"
escalation: - level: 1 contact: "On-call engineer" time: "0-15 min" - level: 2 contact: "Engineering lead" time: "15-30 min" - level: 3 contact: "CTO" time: "30+ min"
undefined

Post-Incident

事件后处理

Template

模板

markdown
undefined
markdown
undefined

Incident Report: [Title]

事件报告:[标题]

Summary

摘要

  • Date: YYYY-MM-DD
  • Duration: X hours Y minutes
  • Severity: P1/P2/P3
  • Impact: [Number of affected users]
  • 日期: YYYY-MM-DD
  • 持续时长: X小时Y分钟
  • 严重级别: P1/P2/P3
  • 影响范围: [受影响用户数量]

Timeline

时间线

  • HH:MM - Incident detected
  • HH:MM - Initial response
  • HH:MM - Mitigation applied
  • HH:MM - Resolution confirmed
  • HH:MM - 检测到事件
  • HH:MM - 启动初步响应
  • HH:MM - 执行缓解措施
  • HH:MM - 确认问题解决

Root Cause

根本原因

[Description of root cause]
[根本原因描述]

Resolution

解决方案

[Steps taken to resolve]
[为解决问题采取的步骤]

Prevention

预防措施

  • Action item 1
  • Action item 2
  • 行动项1
  • 行动项2

Lessons Learned

经验教训

[Key takeaways]
undefined
[核心收获]
undefined

Output

产出物

  • Incident response procedures
  • Recovery scripts
  • Emergency bypass capability
  • Post-incident templates
  • 事件响应流程
  • 恢复脚本
  • 紧急旁路能力
  • 事件后处理模板

Resources

参考资源

Next Steps

后续步骤

Proceed to
clerk-data-handling
for user data management.
前往
clerk-data-handling
了解用户数据管理相关内容。