clerk-incident-runbook
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseClerk Incident Runbook
Clerk Incident Runbook
Overview
概述
Procedures for responding to Clerk-related incidents in production.
生产环境中Clerk相关事件的响应流程。
Prerequisites
前提条件
- Access to Clerk dashboard
- Access to application logs
- Emergency contact list
- Rollback procedures documented
- 有权访问Clerk dashboard
- 有权访问应用日志
- 紧急联系人列表
- 已归档的回滚流程
Incident Categories
事件分类
Category 1: Complete Auth Outage
类别1:完全认证中断
Symptoms: All users unable to sign in, middleware returning errors
Immediate Actions:
bash
undefined症状: 所有用户无法登录,middleware返回错误
立即执行操作:
bash
undefined1. Check Clerk status
1. Check Clerk status
curl -s https://status.clerk.com/api/v1/status | jq
curl -s https://status.clerk.com/api/v1/status | jq
2. Check your endpoint
2. Check your endpoint
3. Check environment variables
3. Check environment variables
vercel env ls | grep CLERK
**Mitigation Steps:**
```typescript
// Emergency bypass mode (use with caution)
// middleware.ts
import { clerkMiddleware } from '@clerk/nextjs/server'
import { NextResponse } from 'next/server'
const EMERGENCY_BYPASS = process.env.CLERK_EMERGENCY_BYPASS === 'true'
export default clerkMiddleware(async (auth, request) => {
if (EMERGENCY_BYPASS) {
// Log for audit
console.warn('[EMERGENCY] Auth bypass active', {
path: request.nextUrl.pathname,
timestamp: new Date().toISOString()
})
return NextResponse.next()
}
// Normal auth flow
await auth.protect()
})vercel env ls | grep CLERK
**缓解步骤:**
```typescript
// Emergency bypass mode (use with caution)
// middleware.ts
import { clerkMiddleware } from '@clerk/nextjs/server'
import { NextResponse } from 'next/server'
const EMERGENCY_BYPASS = process.env.CLERK_EMERGENCY_BYPASS === 'true'
export default clerkMiddleware(async (auth, request) => {
if (EMERGENCY_BYPASS) {
// Log for audit
console.warn('[EMERGENCY] Auth bypass active', {
path: request.nextUrl.pathname,
timestamp: new Date().toISOString()
})
return NextResponse.next()
}
// Normal auth flow
await auth.protect()
})Category 2: Webhook Processing Failure
类别2:Webhook处理失败
Symptoms: User data out of sync, missing user records
Diagnosis:
bash
undefined症状: 用户数据不同步,用户记录缺失
诊断:
bash
undefinedCheck webhook endpoint
Check webhook endpoint
curl -X POST https://yourapp.com/api/webhooks/clerk
-H "Content-Type: application/json"
-d '{"type":"ping"}'
-w "\n%{http_code}"
-H "Content-Type: application/json"
-d '{"type":"ping"}'
-w "\n%{http_code}"
curl -X POST https://yourapp.com/api/webhooks/clerk
-H "Content-Type: application/json"
-d '{"type":"ping"}'
-w "\n%{http_code}"
-H "Content-Type: application/json"
-d '{"type":"ping"}'
-w "\n%{http_code}"
Check Clerk dashboard for failed webhooks
Check Clerk dashboard for failed webhooks
Dashboard > Webhooks > Failed Deliveries
Dashboard > Webhooks > Failed Deliveries
**Recovery:**
```typescript
// scripts/resync-users.ts
import { clerkClient } from '@clerk/nextjs/server'
import { db } from '../lib/db'
async function resyncAllUsers() {
const client = await clerkClient()
let offset = 0
const limit = 100
while (true) {
const { data: users, totalCount } = await client.users.getUserList({
limit,
offset
})
for (const user of users) {
await db.user.upsert({
where: { clerkId: user.id },
update: {
email: user.emailAddresses[0]?.emailAddress,
firstName: user.firstName,
lastName: user.lastName,
updatedAt: new Date()
},
create: {
clerkId: user.id,
email: user.emailAddresses[0]?.emailAddress,
firstName: user.firstName,
lastName: user.lastName
}
})
}
console.log(`Synced ${offset + users.length} of ${totalCount} users`)
offset += limit
if (offset >= totalCount) break
}
console.log('Resync complete')
}
resyncAllUsers()
**恢复:**
```typescript
// scripts/resync-users.ts
import { clerkClient } from '@clerk/nextjs/server'
import { db } from '../lib/db'
async function resyncAllUsers() {
const client = await clerkClient()
let offset = 0
const limit = 100
while (true) {
const { data: users, totalCount } = await client.users.getUserList({
limit,
offset
})
for (const user of users) {
await db.user.upsert({
where: { clerkId: user.id },
update: {
email: user.emailAddresses[0]?.emailAddress,
firstName: user.firstName,
lastName: user.lastName,
updatedAt: new Date()
},
create: {
clerkId: user.id,
email: user.emailAddresses[0]?.emailAddress,
firstName: user.firstName,
lastName: user.lastName
}
})
}
console.log(`Synced ${offset + users.length} of ${totalCount} users`)
offset += limit
if (offset >= totalCount) break
}
console.log('Resync complete')
}
resyncAllUsers()Category 3: Security Incident
类别3:安全事件
Symptoms: Unauthorized access detected, suspicious sessions
Immediate Actions:
typescript
// scripts/emergency-session-revoke.ts
import { clerkClient } from '@clerk/nextjs/server'
async function revokeUserSessions(userId: string) {
const client = await clerkClient()
// Get all active sessions
const sessions = await client.sessions.getSessionList({
userId,
status: 'active'
})
// Revoke all sessions
for (const session of sessions.data) {
await client.sessions.revokeSession(session.id)
console.log(`Revoked session: ${session.id}`)
}
console.log(`Revoked ${sessions.data.length} sessions for user ${userId}`)
}
// Revoke all sessions for compromised user
revokeUserSessions('user_xxx')typescript
// scripts/emergency-lockout.ts
import { clerkClient } from '@clerk/nextjs/server'
async function lockoutUser(userId: string) {
const client = await clerkClient()
// Ban user (prevents new sign-ins)
await client.users.banUser(userId)
// Revoke all sessions
const sessions = await client.sessions.getSessionList({
userId,
status: 'active'
})
for (const session of sessions.data) {
await client.sessions.revokeSession(session.id)
}
console.log(`User ${userId} locked out and all sessions revoked`)
}症状: 检测到未授权访问,存在可疑会话
立即执行操作:
typescript
// scripts/emergency-session-revoke.ts
import { clerkClient } from '@clerk/nextjs/server'
async function revokeUserSessions(userId: string) {
const client = await clerkClient()
// Get all active sessions
const sessions = await client.sessions.getSessionList({
userId,
status: 'active'
})
// Revoke all sessions
for (const session of sessions.data) {
await client.sessions.revokeSession(session.id)
console.log(`Revoked session: ${session.id}`)
}
console.log(`Revoked ${sessions.data.length} sessions for user ${userId}`)
}
// Revoke all sessions for compromised user
revokeUserSessions('user_xxx')typescript
// scripts/emergency-lockout.ts
import { clerkClient } from '@clerk/nextjs/server'
async function lockoutUser(userId: string) {
const client = await clerkClient()
// Ban user (prevents new sign-ins)
await client.users.banUser(userId)
// Revoke all sessions
const sessions = await client.sessions.getSessionList({
userId,
status: 'active'
})
for (const session of sessions.data) {
await client.sessions.revokeSession(session.id)
}
console.log(`User ${userId} locked out and all sessions revoked`)
}Category 4: Performance Degradation
类别4:性能下降
Symptoms: Slow sign-in, high latency, timeouts
Diagnosis:
typescript
// scripts/diagnose-performance.ts
async function diagnosePerformance() {
const results = {
authCheck: 0,
getUserList: 0,
currentUser: 0
}
// Measure auth check
const authStart = performance.now()
await auth()
results.authCheck = performance.now() - authStart
// Measure API call
const apiStart = performance.now()
const client = await clerkClient()
await client.users.getUserList({ limit: 1 })
results.getUserList = performance.now() - apiStart
// Measure currentUser
const userStart = performance.now()
await currentUser()
results.currentUser = performance.now() - userStart
console.log('Performance Diagnosis:', results)
// Check for issues
if (results.authCheck > 100) {
console.warn('Auth check slow - check middleware configuration')
}
if (results.getUserList > 500) {
console.warn('API slow - check Clerk status or network')
}
return results
}症状: 登录缓慢、延迟高、请求超时
诊断:
typescript
// scripts/diagnose-performance.ts
async function diagnosePerformance() {
const results = {
authCheck: 0,
getUserList: 0,
currentUser: 0
}
// Measure auth check
const authStart = performance.now()
await auth()
results.authCheck = performance.now() - authStart
// Measure API call
const apiStart = performance.now()
const client = await clerkClient()
await client.users.getUserList({ limit: 1 })
results.getUserList = performance.now() - apiStart
// Measure currentUser
const userStart = performance.now()
await currentUser()
results.currentUser = performance.now() - userStart
console.log('Performance Diagnosis:', results)
// Check for issues
if (results.authCheck > 100) {
console.warn('Auth check slow - check middleware configuration')
}
if (results.getUserList > 500) {
console.warn('API slow - check Clerk status or network')
}
return results
}Runbook Procedures
响应手册执行流程
Procedure 1: Auth Outage Response
流程1:认证中断响应
1. [ ] Confirm outage (check status.clerk.com)
2. [ ] Check application logs for errors
3. [ ] Verify environment variables
4. [ ] If Clerk outage:
a. [ ] Enable emergency bypass (if safe)
b. [ ] Notify users via status page
c. [ ] Monitor Clerk status
5. [ ] If application issue:
a. [ ] Check recent deployments
b. [ ] Rollback if necessary
c. [ ] Check middleware configuration
6. [ ] Document timeline and actions
7. [ ] Conduct post-mortem1. [ ] 确认中断(检查status.clerk.com)
2. [ ] 检查应用日志中的错误
3. [ ] 验证环境变量
4. [ ] 如果是Clerk服务中断:
a. [ ] 启用紧急旁路模式(确认安全后操作)
b. [ ] 通过状态页通知用户
c. [ ] 监控Clerk状态
5. [ ] 如果是应用自身问题:
a. [ ] 检查最近的部署记录
b. [ ] 必要时执行回滚
c. [ ] 检查middleware配置
6. [ ] 记录时间线和执行的操作
7. [ ] 开展事后复盘Procedure 2: Security Breach Response
流程2:安全漏洞响应
1. [ ] Identify affected accounts
2. [ ] Revoke all sessions for affected users
3. [ ] Lock compromised accounts
4. [ ] Reset API keys if exposed
5. [ ] Enable additional verification
6. [ ] Notify affected users
7. [ ] Review access logs
8. [ ] Document and report1. [ ] 识别受影响的账号
2. [ ] 吊销受影响用户的所有会话
3. [ ] 锁定被入侵的账号
4. [ ] 如果API密钥泄露则重置密钥
5. [ ] 启用额外的验证措施
6. [ ] 通知受影响的用户
7. [ ] 审核访问日志
8. [ ] 记录并上报事件Procedure 3: Data Sync Recovery
流程3:数据同步恢复
1. [ ] Identify sync gap (check webhook logs)
2. [ ] Pause webhook processing
3. [ ] Export current database state
4. [ ] Run resync script
5. [ ] Verify data integrity
6. [ ] Resume webhook processing
7. [ ] Monitor for new issues1. [ ] 识别同步缺口(检查webhook日志)
2. [ ] 暂停webhook处理
3. [ ] 导出当前数据库状态
4. [ ] 执行重新同步脚本
5. [ ] 验证数据完整性
6. [ ] 恢复webhook处理
7. [ ] 监控是否出现新问题Emergency Contacts
紧急联系人
yaml
undefinedyaml
undefined.github/INCIDENT_CONTACTS.yml
.github/INCIDENT_CONTACTS.yml
contacts:
on_call:
- name: On-Call Engineer
phone: "+1-xxx-xxx-xxxx"
slack: "@oncall"
clerk_support:
- url: "https://clerk.com/support"
- email: "support@clerk.com"
- priority: "For enterprise: contact account manager"
escalation:
- level: 1
contact: "On-call engineer"
time: "0-15 min"
- level: 2
contact: "Engineering lead"
time: "15-30 min"
- level: 3
contact: "CTO"
time: "30+ min"
undefinedcontacts:
on_call:
- name: On-Call Engineer
phone: "+1-xxx-xxx-xxxx"
slack: "@oncall"
clerk_support:
- url: "https://clerk.com/support"
- email: "support@clerk.com"
- priority: "For enterprise: contact account manager"
escalation:
- level: 1
contact: "On-call engineer"
time: "0-15 min"
- level: 2
contact: "Engineering lead"
time: "15-30 min"
- level: 3
contact: "CTO"
time: "30+ min"
undefinedPost-Incident
事件后处理
Template
模板
markdown
undefinedmarkdown
undefinedIncident Report: [Title]
事件报告:[标题]
Summary
摘要
- Date: YYYY-MM-DD
- Duration: X hours Y minutes
- Severity: P1/P2/P3
- Impact: [Number of affected users]
- 日期: YYYY-MM-DD
- 持续时长: X小时Y分钟
- 严重级别: P1/P2/P3
- 影响范围: [受影响用户数量]
Timeline
时间线
- HH:MM - Incident detected
- HH:MM - Initial response
- HH:MM - Mitigation applied
- HH:MM - Resolution confirmed
- HH:MM - 检测到事件
- HH:MM - 启动初步响应
- HH:MM - 执行缓解措施
- HH:MM - 确认问题解决
Root Cause
根本原因
[Description of root cause]
[根本原因描述]
Resolution
解决方案
[Steps taken to resolve]
[为解决问题采取的步骤]
Prevention
预防措施
- Action item 1
- Action item 2
- 行动项1
- 行动项2
Lessons Learned
经验教训
[Key takeaways]
undefined[核心收获]
undefinedOutput
产出物
- Incident response procedures
- Recovery scripts
- Emergency bypass capability
- Post-incident templates
- 事件响应流程
- 恢复脚本
- 紧急旁路能力
- 事件后处理模板
Resources
参考资源
Next Steps
后续步骤
Proceed to for user data management.
clerk-data-handling前往了解用户数据管理相关内容。
clerk-data-handling