engineering-senior-developer

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Senior Development Guide

高级开发指南

Overview

概述

This guide covers the workflow, standards, and patterns for delivering production-grade software across web, backend, mobile, and platform work. Use it when planning implementation, making architecture tradeoffs, improving code quality, or shipping safely.
本指南涵盖了Web、后端、移动端及平台类项目中,交付生产级软件的工作流、标准与模式。在规划实现方案、进行架构权衡、提升代码质量或安全发布时,可参考本指南。

Delivery Workflow

交付工作流

1. Understand the problem

1. 明确问题

  • Clarify goals, constraints, success metrics, deadlines, and non-goals. If any of these are missing, ask before writing code — never assume.
  • Identify unknowns: if >2 unknowns exist, add a spike task (max 2 hours, concrete deliverable) before committing to an estimate.
  • Propose a minimal viable technical approach first. If the approach requires >5 days of work, look for a simpler alternative or split into phases.
  • Define acceptance criteria before implementation. Every criterion must be verifiable — "works correctly" is not a criterion; "returns 200 with JWT containing user_id claim" is.
  • 明确目标、约束条件、成功指标、截止日期及非目标内容。若有缺失,编写代码前务必确认——切勿主观假设。
  • 识别未知项:若存在超过2个未知项,在承诺估算前添加一个探索任务(最长2小时,需产出具体交付物)。
  • 先提出最小可行技术方案。若方案需要超过5天的工作量,寻找更简单的替代方案或拆分为多个阶段。
  • 实现前定义验收标准。每个标准必须可验证——"正常工作"不属于合格标准;"返回包含user_id声明的JWT且状态码为200"才是合格标准。

2. Plan implementation

2. 规划实现

  • Break work into small, testable milestones. Each milestone must be mergeable independently — if milestone B cannot ship without milestone A, they are one milestone.
  • If a change touches >5 files, write a 1-paragraph plan before starting. If >15 files, write a design doc (see references/design-docs.md).
  • Plan rollback: every database migration must be reversible. If a migration drops a column, first deploy code that stops reading it, then drop in the next release.
  • For any new external dependency (API, service, database), define: timeout (default 5s), retry policy (3 attempts, exponential backoff), circuit breaker threshold (5 failures in 60s), and fallback behavior.
  • 将工作拆分为小型、可测试的里程碑。每个里程碑必须可独立合并——若里程碑B无法脱离里程碑A发布,则它们应合并为一个里程碑。
  • 若变更涉及超过5个文件,开始前撰写一段简短的规划。若超过15个文件,撰写设计文档(参考references/design-docs.md)。
  • 规划回滚方案:所有数据库迁移必须可回滚。若迁移需要删除列,先部署不再读取该列的代码,再在下一版本中执行删除操作。
  • 对于任何新增外部依赖(API、服务、数据库),定义:超时时间(默认5秒)、重试策略(3次尝试,指数退避)、熔断阈值(60秒内5次失败)及降级行为。

3. Implement and verify

3. 实现与验证

  • Functions >40 lines: split. Files >300 lines: split. If a function takes >4 parameters, introduce a config/options object.
  • Test at the right level: pure logic = unit test. API endpoints = integration test. Critical user flows = E2E test (max 5 E2E tests per feature — they are slow and flaky).
  • Every new API endpoint must have: input validation (Zod/Pydantic/class-validator), error response schema, rate limit, and at least one integration test.
  • Every database query on a table with >10k rows must have an index. Run EXPLAIN ANALYZE and reject sequential scans on large tables.
  • Never catch an error and swallow it silently. Log it with context (operation, input, correlation ID) or re-throw. Catch only errors you can handle.
  • Backward compatibility: if changing a shared type or API response shape, grep all consumers. If >0 consumers depend on the old shape, use expand-migrate-contract (add new field, migrate consumers, remove old field).
  • 函数代码超过40行:拆分。文件代码超过300行:拆分。如果函数参数超过4个,引入配置/选项对象。
  • 在合适的层级进行测试:纯逻辑=单元测试。API端点=集成测试。关键用户流程=端到端(E2E)测试——每个功能最多5个E2E测试,因为这类测试速度慢且不稳定。
  • 每个新增API端点必须包含:输入验证(Zod/Pydantic/class-validator)、错误响应 schema、速率限制,以及至少一个集成测试。
  • 针对行数超过10k的表执行的所有数据库查询必须包含索引。运行EXPLAIN ANALYZE,拒绝在大表上执行全表扫描。
  • 切勿捕获错误后静默忽略。需附带上下文(操作、输入、关联ID)记录日志,或重新抛出异常。仅捕获可处理的错误。
  • 向后兼容性:若修改共享类型或API响应结构,搜索所有消费者。若有消费者依赖旧结构,采用扩展-迁移-收缩流程(新增字段、迁移消费者、删除旧字段)。

4. Ship and stabilize

4. 发布与稳定

  • Before merging: run the full test suite locally, verify no lint/type errors, check bundle size / binary size if applicable.
  • Deploy with observability: every deploy must be visible in metrics within 5 minutes. If you cannot tell from a dashboard whether the deploy is healthy, add instrumentation before deploying.
  • After deploy: monitor error rates and p95 latency for 30 minutes. If error rate increases >2x or p95 doubles, roll back immediately — do not debug in production first.
  • Capture tech debt within 48 hours of discovering it. Each debt item must have: description, impact (latency/reliability/developer velocity), estimated effort, and owner. Unowned debt does not get fixed.
  • 合并前:本地运行完整测试套件,确保无语法/类型错误,若适用则检查包大小/二进制大小。
  • 带着可观测性部署:每次部署必须在5分钟内体现在指标中。若无法通过仪表盘判断部署是否健康,先添加监控 instrumentation 再部署。
  • 部署后:监控30分钟错误率与p95延迟。若错误率提升超过2倍或p95延迟翻倍,立即回滚——不要先在生产环境调试。
  • 发现技术债务后48小时内记录。每个债务项必须包含:描述、影响(延迟/可靠性/开发效率)、预估工作量及负责人。无人负责的债务不会被修复。

Engineering Standards

工程标准

  • Every API endpoint must document: request/response schema, error codes with meanings, authentication requirement, and rate limit. If an endpoint lacks this, it is not ready for review.
  • Every database migration must be tested with rollback on a copy of production data (or realistic seed data) before merging. Migrations that take >30s on production-size data must be run as background jobs, not blocking deploys.
  • Every critical path (auth, payment, core CRUD) must have: latency histogram, error rate counter, and an alert that fires when p95 exceeds 2x the baseline or error rate exceeds 1%.
  • Security is not a follow-up task. Auth checks, input validation, and CSRF protection are part of the initial implementation. If a PR adds an endpoint without auth, it is not ready for review.
  • Performance budgets: API response p95 <200ms for reads, <500ms for writes. If a new feature exceeds these, optimize before merging — not after.
  • 每个API端点必须文档化:请求/响应 schema、错误码及含义、认证要求、速率限制。若端点缺少这些文档,不允许进入评审环节。
  • 所有数据库迁移在合并前,必须在生产数据副本(或真实测试数据)上测试正向与回滚流程。若在生产规模数据上执行时间超过30秒的迁移,需作为后台任务运行,不能阻塞部署。
  • 每个关键路径(认证、支付、核心CRUD)必须包含:延迟直方图、错误率计数器,以及当p95超过基线2倍或错误率超过1%时触发的告警。
  • 安全不是后续任务。认证检查、输入验证与CSRF保护是初始实现的一部分。若PR新增的端点无认证,不允许进入评审环节。
  • 性能预算:API响应p95延迟,读请求<200ms,写请求<500ms。若新功能超出该预算,需先优化再合并——而非之后。

Anti-Pattern Detection

反模式检测

When reviewing or writing code, flag and fix these immediately:
  • N+1 queries: Loop that makes a database call per iteration. Fix with batch query, JOIN, or DataLoader.
  • Unbounded queries:
    SELECT *
    or query without LIMIT on a user-facing endpoint. Always paginate, always select only needed columns.
  • Shared mutable state: Global variable modified by multiple request handlers. Use request-scoped context or dependency injection.
  • Stringly-typed code: Using raw strings for status, type, or role values. Use enums or union types.
  • God function: Function that handles parsing, validation, business logic, database, and response formatting. Split into layers.
  • Missing error context:
    catch (e) { throw new Error("failed") }
    . Always include the original error and operation context.
  • Premature abstraction: Creating a generic framework for something used in one place. Wait until the pattern repeats 3 times.
  • Config in code: Hardcoded URLs, API keys, feature flags, or environment-specific values. Move to environment variables or config service.
评审或编写代码时,立即标记并修复以下问题:
  • N+1查询:循环中每次迭代都发起数据库调用。通过批量查询、JOIN或DataLoader修复。
  • 无界查询:用户面向的端点使用
    SELECT *
    或无LIMIT的查询。始终分页,仅选择所需列。
  • 共享可变状态:被多个请求处理器修改的全局变量。使用请求作用域上下文或依赖注入。
  • 字符串类型代码:使用原始字符串表示状态、类型或角色值。使用枚举或联合类型。
  • 上帝函数:同时处理解析、验证、业务逻辑、数据库操作与响应格式化的函数。拆分为多个层级。
  • 缺失错误上下文
    catch (e) { throw new Error("failed") }
    。始终包含原始错误与操作上下文。
  • 过早抽象:为仅使用一次的功能创建通用框架。需等到模式重复出现3次再抽象。
  • 代码中硬编码配置:硬编码的URL、API密钥、功能开关或环境特定值。迁移到环境变量或配置服务。

Technology Selection Decision Rules

技术选型决策规则

Language Selection

语言选型

  • Go: Choose when the service is I/O-bound with high concurrency (>10k concurrent connections), needs single-binary deployment, or is infrastructure tooling (CLI, proxy, sidecar). Avoid for rapid prototyping or heavy ORM usage.
  • Node.js (TypeScript): Choose when the team is full-stack JS, the service is a BFF (Backend for Frontend), real-time features dominate (WebSocket, SSE), or the ecosystem has a mature library for the domain. Avoid for CPU-bound processing (image/video, ML inference).
  • Python: Choose when the domain is data/ML, the team is Python-native, or rapid iteration matters more than runtime performance. Avoid for latency-sensitive services (<10ms p95 target) unless using FastAPI with uvloop.
  • Java/Kotlin: Choose when the organization has JVM expertise, needs mature enterprise libraries (Spring, Hibernate), or requires strong typing with high throughput. Avoid for small services where JVM startup time (2-5s) exceeds acceptable cold start.
  • Rust: Choose only when the service has hard latency requirements (<1ms p99), processes untrusted input at scale, or is a performance-critical library. The 2-3x development time multiplier must be justified by a specific, measured need.
  • Go:当服务为I/O密集型且需要高并发(>10k并发连接)、需要单二进制部署,或是基础设施工具(CLI、代理、边车)时选择。避免用于快速原型开发或重度ORM场景。
  • Node.js (TypeScript):当团队为全栈JS技术栈、服务为BFF(Backend for Frontend)、实时功能为主(WebSocket、SSE),或领域内有成熟生态库时选择。避免用于CPU密集型处理(图片/视频、ML推理)。
  • Python:当领域为数据/ML、团队以Python为主,或快速迭代比运行时性能更重要时选择。避免用于对延迟敏感的服务(p95目标<10ms),除非使用带uvloop的FastAPI。
  • Java/Kotlin:当组织具备JVM技术栈经验、需要成熟企业级库(Spring、Hibernate),或需要强类型与高吞吐量时选择。避免用于JVM启动时间(2-5s)超出可接受冷启动时间的小型服务。
  • Rust:仅当服务有严格延迟要求(p99<1ms)、需要大规模处理不可信输入,或是性能关键型库时选择。必须通过具体、可衡量的需求证明2-3倍开发时间的投入是合理的。

Database Selection (Quick Reference)

数据库选型(快速参考)

  • PostgreSQL (default): Use for any relational data <10TB. Supports JSON, full-text search, and geospatial — eliminate the need for a second database when possible.
  • Redis: Use for caching (TTL <24h), session storage, rate limiting, or real-time leaderboards. Never as primary data store — data loss on restart is expected.
  • MongoDB: Use only when schema genuinely varies per document (CMS, event stores with polymorphic payloads). If you can define a schema upfront, use PostgreSQL.
  • SQLite: Use for CLI tools, mobile apps, embedded systems, or local-first apps. If >1 process writes concurrently, use PostgreSQL.
  • PostgreSQL(默认):用于任何<10TB的关系型数据。支持JSON、全文搜索与地理空间功能——尽可能避免使用第二个数据库。
  • Redis:用于缓存(TTL<24h)、会话存储、速率限制或实时排行榜。绝不能作为主数据存储——重启时数据丢失是预期行为。
  • MongoDB:仅当每个文档的schema确实存在差异时使用(CMS、带多态负载的事件存储)。若可预先定义schema,使用PostgreSQL。
  • SQLite:用于CLI工具、移动应用、嵌入式系统或本地优先应用。若有多个进程同时写入,使用PostgreSQL。

API Style Selection

API风格选型

  • REST: Default choice. Use for CRUD-dominant APIs, public APIs, or when caching matters (HTTP caching works natively).
  • GraphQL: Use only when the client needs flexible field selection across >5 entity types AND the frontend team controls the schema. If <3 entity types, REST is simpler. Never use GraphQL for server-to-server communication.
  • gRPC: Use for internal service-to-service communication when latency <5ms matters, streaming is required, or strong contract enforcement across >3 languages is needed. Never expose gRPC directly to browsers without a gateway.
  • tRPC: Use when frontend and backend share a TypeScript monorepo. Gives end-to-end type safety with zero schema duplication. Not suitable for multi-language backends.
  • REST:默认选择。用于以CRUD为主的API、公开API,或缓存重要的场景(HTTP缓存原生支持)。
  • GraphQL:仅当客户端需要跨>5个实体类型灵活选择字段,且前端团队控制schema时使用。若实体类型<3个,REST更简单。绝不能用于服务间通信。
  • gRPC:当延迟<5ms很重要、需要流式传输,或需要跨>3种语言强契约约束时,用于内部服务间通信。绝不能直接向浏览器暴露gRPC,需通过网关转发。
  • tRPC:当前端与后端共享TypeScript单体仓库时使用。无需重复定义schema即可实现端到端类型安全。不适用于多语言后端。

Estimation Decision Rules

估算决策规则

Task Estimation by Type

按任务类型估算

  • CRUD endpoint (route + validation + DB query + tests): 2-4 hours. If it needs pagination, filtering, and sorting: add 2 hours.
  • CRUD endpoint with auth + rate limiting + caching: 1 day.
  • Third-party API integration (OAuth, webhook, SDK): 1-2 days. Always double the estimate if the third-party documentation is poor — verify by reading it before estimating.
  • Database migration (add column, add index): 1-2 hours if table <1M rows. If >1M rows, add 4 hours for online migration tooling and testing.
  • Database migration (schema redesign, data backfill): 2-3 days including dual-write period and verification.
  • New service from scratch (repo + CI/CD + health checks + first endpoint + monitoring): 3-5 days.
  • Authentication system (signup + login + password reset + session management): 3-5 days with a proven library (NextAuth, Passport, Django auth). 2-3 weeks if building custom — strongly discourage.
  • File upload/download (presigned URLs + validation + storage): 1-2 days.
  • Real-time feature (WebSocket/SSE + connection management + reconnection): 2-3 days.
  • Search (full-text across >3 fields with filtering): 1 day with PostgreSQL tsvector, 2-3 days with Elasticsearch.
  • CRUD端点(路由+验证+DB查询+测试):2-4小时。若需要分页、过滤与排序:增加2小时。
  • 带认证+速率限制+缓存的CRUD端点:1天。
  • 第三方API集成(OAuth、Webhook、SDK):1-2天。若第三方文档质量差,估算时间翻倍——估算前需先阅读文档验证。
  • 数据库迁移(添加列、添加索引):若表行数<1M,1-2小时。若>1M行,增加4小时用于在线迁移工具与测试。
  • 数据库迁移(schema重设计、数据回填):2-3天,包含双写阶段与验证。
  • 从零开始搭建新服务(仓库+CI/CD+健康检查+第一个端点+监控):3-5天。
  • 认证系统(注册+登录+密码重置+会话管理):使用成熟库(NextAuth、Passport、Django auth)需3-5天。若自定义开发需2-3周——强烈不建议。
  • 文件上传/下载(预签名URL+验证+存储):1-2天。
  • 实时功能(WebSocket/SSE+连接管理+重连):2-3天。
  • 搜索(跨>3个字段的全文搜索+过滤):使用PostgreSQL tsvector需1天,使用Elasticsearch需2-3天。

Estimation Multipliers

估算乘数

  • First time using a library/framework: 2x the estimate.
  • Legacy codebase with no tests: 1.5x (you must write tests to verify your change).
  • Multi-timezone or i18n requirement: 1.3x.
  • Compliance requirement (audit logging, encryption, access controls): 1.5x.
  • If the estimate exceeds 5 days, break into sub-tasks and estimate each. If any sub-task exceeds 3 days, it is too large — break further.
  • 首次使用某库/框架:估算时间×2。
  • 无测试的遗留代码库:估算时间×1.5(必须编写测试验证变更)。
  • 多时区或国际化(i18n)需求:估算时间×1.3。
  • 合规要求(审计日志、加密、访问控制):估算时间×1.5。
  • 若估算超过5天,拆分为子任务并分别估算。若任何子任务超过3天,说明拆分不够,需进一步拆分。

Refactoring Decision Rules

重构决策规则

When to Refactor

何时重构

  • Cyclomatic complexity >15 (measure with
    scripts/review_checklist.py
    or ESLint
    complexity
    rule): Extract branches into named functions. Each function should have complexity <10.
  • Function called from >5 call sites with different flag combinations: Replace boolean flags with strategy pattern or separate functions.
    processOrder(order, true, false, true)
    is unreadable — split into
    processStandardOrder()
    ,
    processExpressOrder()
    .
  • Identical code block appears 3+ times: Extract to a shared function. At 2 occurrences, tolerate duplication — premature abstraction is worse than duplication.
  • Module has >10 imports from other modules: High coupling. Introduce a facade or reorganize module boundaries so each module imports from at most 5 others.
  • Test requires >20 lines of setup: The code under test has too many dependencies. Inject dependencies and provide test doubles. If you cannot test a function without starting a database, the function does too much.
  • Changing one feature requires modifying >5 files across >2 directories: Shotgun surgery. Consolidate related logic into a single module or introduce a feature module pattern.
  • 圈复杂度>15(使用
    scripts/review_checklist.py
    或ESLint
    complexity
    规则测量):将分支提取为命名函数。每个函数的圈复杂度应<10。
  • 函数被>5个调用点以不同标志组合调用:用策略模式或拆分函数替代布尔标志。
    processOrder(order, true, false, true)
    难以阅读——拆分为
    processStandardOrder()
    processExpressOrder()
  • 相同代码块出现3次以上:提取为共享函数。出现2次时可容忍重复——过早抽象比重复更糟。
  • 模块从其他模块导入>10个依赖:耦合度高。引入外观模式或重新组织模块边界,使每个模块最多从5个其他模块导入。
  • 测试需要>20行初始化代码:被测代码依赖过多。注入依赖并提供测试替身。若无需启动数据库就无法测试函数,说明该函数职责过多。
  • 修改一个功能需要修改>2个目录下的>5个文件:霰弹式修改。将相关逻辑整合到单个模块,或引入功能模块模式。

When NOT to Refactor

何时不重构

  • Code is stable, untouched for >6 months, and has no pending feature work. Leave it alone.
  • You are refactoring to match a style preference, not to fix a measurable problem (complexity, coupling, test difficulty). Stop.
  • The refactor would require >3 days and there is no feature work that benefits from it. Defer and document as tech debt.
  • 代码稳定,超过6个月未修改,且无待开发功能。保持原样。
  • 重构仅为匹配个人风格偏好,而非解决可衡量的问题(复杂度、耦合度、测试难度)。停止重构。
  • 重构需要>3天,且无任何功能开发可从中受益。推迟并记录为技术债务。

Code Review Decision Rules

代码评审决策规则

By Pattern Detected

按检测到的模式

  • New database query without EXPLAIN ANALYZE output: Request it. No query merges without execution plan on production-like data volume.
  • New endpoint without error response test: Block. Every endpoint must be tested with: valid input (200), missing auth (401), forbidden (403), and invalid input (400).
  • Catch block that logs and continues: Verify the function can actually recover. If it cannot, re-throw. Catching + logging + continuing hides bugs until they cascade.
  • Function with >3 levels of nesting: Request extraction of inner blocks into named functions. Deep nesting signals logic that should be inverted (early returns) or decomposed.
  • Magic numbers/strings: Request named constants.
    if (retries > 3)
    if (retries > MAX_RETRIES)
    . Exception: 0, 1, and common HTTP status codes (200, 404, 500).
  • TODO/FIXME without issue link: Block. Every TODO must reference a ticket. Untracked TODOs never get fixed and rot.
  • Test that asserts only
    toBeTruthy()
    or
    not.toThrow()
    : Weak assertion. Request specific value assertions. A test that passes when the output is wrong is worse than no test.
  • New dependency added: Check: bundle size impact (run
    bundlesize
    or equivalent), last commit date, download count, license compatibility, and whether a stdlib solution exists. Block if the dependency adds >50KB to the bundle for a feature that could be built in <2 hours.
  • 新增数据库查询但无EXPLAIN ANALYZE输出:要求提供。无生产规模数据量的执行计划,不允许合并查询。
  • 新增端点但无错误响应测试:阻止合并。每个端点必须测试:有效输入(200)、缺失认证(401)、权限不足(403)、无效输入(400)。
  • 捕获错误后记录日志并继续执行:验证函数是否真的能恢复。若不能,重新抛出异常。捕获+记录+继续执行会隐藏bug,直到问题扩散。
  • 函数嵌套层级>3:要求将内部块提取为命名函数。深层嵌套表明逻辑应反转(提前返回)或分解。
  • 魔法数字/字符串:要求替换为命名常量。
    if (retries > 3)
    if (retries > MAX_RETRIES)
    。例外:0、1及常见HTTP状态码(200、404、500)。
  • TODO/FIXME无关联问题链接:阻止合并。每个TODO必须关联工单。未跟踪的TODO永远不会被修复,只会逐渐失效。
  • 仅断言
    toBeTruthy()
    not.toThrow()
    的测试
    :弱断言。要求替换为具体值断言。当输出错误时仍能通过的测试,比没有测试更糟。
  • 新增依赖:检查:包大小影响(运行
    bundlesize
    或等效工具)、最后提交日期、下载量、许可证兼容性,以及是否存在标准库解决方案。若依赖为一个可在<2小时内实现的功能增加了>50KB的包大小,阻止合并。

Architecture Guidance

架构指导

  • If the team is <5 engineers, use a monolith. If 5-15, use a modular monolith with clear domain modules. If >15 with independent release cycles, consider service extraction.
  • Keep boundaries explicit between domain logic, data access, and transport layers. If domain logic imports HTTP request types or database client types, the boundary is violated.
  • When adding a new dependency (library, service, or tool): check last commit date (reject if >12 months inactive), weekly downloads (reject if <1k for JS, <500 for Python), and open security advisories.
  • Use feature flags for changes that affect >1000 users or involve new infrastructure. Roll out to 1% → 10% → 50% → 100% with 24-hour soak at each stage.
  • Record architecture decisions (ADRs) for: new external dependencies, database schema changes, API contract changes, infrastructure changes, and technology choices. An ADR that takes >10 minutes to write is too long.
  • 若团队人数<5,使用单体架构。若5-15人,使用模块化单体架构,明确领域模块边界。若>15人且需要独立发布周期,考虑拆分服务。
  • 明确领域逻辑、数据访问与传输层的边界。若领域逻辑导入HTTP请求类型或数据库客户端类型,说明边界已被破坏。
  • 添加新依赖(库、服务或工具)时:检查最后提交日期(若超过12个月未更新则拒绝)、周下载量(JS<1k或Python<500则拒绝)、公开安全漏洞。
  • 对于影响>1000用户的变更或涉及新基础设施的变更,使用功能开关。按1% → 10% → 50% → 100%的比例逐步发布,每个阶段保留24小时观察期。
  • 记录架构决策(ADR):针对新增外部依赖、数据库schema变更、API契约变更、基础设施变更与技术选型。ADR撰写时间不应超过10分钟。

Self-Verification Protocol

自我验证流程

After completing any implementation, run this checklist before considering the task done:
  • Run the full test suite. Zero failures. If a test you did not touch fails, verify it also fails on main — if it does not, your change broke it.
  • For every new API endpoint, make a real HTTP request (curl or integration test) and verify the response matches the documented schema field-by-field.
  • For every database migration, run forward AND rollback on a fresh database and on production-like seed data. Both directions must succeed.
  • After refactoring, diff test outputs before/after. If any behavior changed unintentionally, the refactor introduced a bug — revert and redo.
  • Run
    scripts/review_checklist.py
    on every modified file. Fix all findings before requesting review.
  • Check that no
    console.log
    ,
    print()
    ,
    debugger
    , or
    TODO
    statements remain in production code paths.
  • Verify that every new dependency passes the health check: last commit <12 months, downloads >1k/week (JS) or >500/week (Python), zero open critical CVEs.
  • If the change affects a user-facing flow, manually walk through it end-to-end in the browser/app. Automation catches regressions — manual testing catches UX issues.
完成任何实现后,在标记任务完成前运行以下检查清单:
  • 运行完整测试套件。零失败。若未修改的测试失败,验证其在主分支是否也失败——若主分支也失败,说明是已有问题;否则,你的变更导致了失败。
  • 对于每个新增API端点,发起真实HTTP请求(curl或集成测试),逐字段验证响应是否与文档化的schema一致。
  • 对于每个数据库迁移,在全新数据库与生产规模测试数据上运行正向与回滚流程。两个方向都必须成功。
  • 重构后,对比重构前后的测试输出。若任何行为意外变更,说明重构引入了bug——回滚并重做。
  • 对每个修改的文件运行
    scripts/review_checklist.py
    。修复所有问题后再请求评审。
  • 检查生产代码路径中是否残留
    console.log
    print()
    debugger
    TODO
    语句。
  • 验证每个新增依赖通过健康检查:最后提交<12个月,周下载量>1k(JS)或>500(Python),无公开严重CVE。
  • 若变更影响用户面向的流程,在浏览器/应用中手动走查完整流程。自动化测试捕获回归问题——手动测试捕获UX问题。

Failure Recovery

故障恢复

  • Build fails after your change: Read the full error — not just the last line. If it is a type error, fix it. If it is a test failure in code you did not touch, run
    git stash && npm test
    to verify it fails on main too. If it does, pre-existing; if not, your change broke it.
  • Test passes locally, fails in CI: Check environment differences — Node/Python version, missing env vars, timezone, OS case sensitivity, or Docker vs host filesystem. Never
    skip
    a CI test — fix the root cause.
  • Approach not working after 2 hours: Stop. Write: (1) what you tried, (2) why it failed, (3) what assumption was wrong. Identify the simplest alternative. If none exists, escalate with a concrete blocker description.
  • Performance regression after deploy: Roll back first, investigate second. Reproduce locally with profiling (see references/performance-profiling.md). Never debug performance in production while users are affected.
  • Merge conflict on large PR: If >5 files conflict, do not resolve blindly. Re-read both sides of every conflict. If the other branch changed the same function, test both behaviors after resolution.
  • Dependency upgrade breaks things: Pin the working version, file an issue with exact error, move on. Do not spend >1 hour on a third-party regression.
  • Flaky test blocking merge: Run it 5 times locally. If it fails inconsistently, it has a race condition or timing dependency. Fix the test (add proper waits, mock time, remove shared state) — never retry-until-pass.
  • 你的变更导致构建失败:阅读完整错误信息——不要只看最后一行。若为类型错误,修复即可。若为未修改代码的测试失败,运行
    git stash && npm test
    验证主分支是否也失败。若主分支也失败,说明是已有问题;否则,你的变更导致了失败。
  • 本地测试通过,CI测试失败:检查环境差异——Node/Python版本、缺失的环境变量、时区、OS大小写敏感性,或Docker与宿主文件系统的差异。切勿跳过CI测试——修复根本原因。
  • 尝试2小时后方案仍不可行:停止。记录:(1) 尝试的内容,(2) 失败原因,(3) 错误的假设。确定最简单的替代方案。若无替代方案,提交具体的阻塞描述并升级。
  • 部署后出现性能回归:先回滚,再调查。通过本地 profiling 重现问题(参考references/performance-profiling.md)。用户受影响时,切勿在生产环境调试性能。
  • 大型PR出现合并冲突:若>5个文件冲突,不要盲目解决。仔细阅读每个冲突的双方内容。若其他分支修改了同一函数,解决冲突后测试两种行为。
  • 依赖升级导致问题:固定可用版本,提交包含准确错误信息的issue,继续推进。不要在第三方回归问题上花费超过1小时。
  • 不稳定测试阻塞合并:本地运行5次。若失败不一致,说明存在竞态条件或时间依赖。修复测试(添加适当等待、模拟时间、移除共享状态)——切勿重试直到通过。

Existing Codebase Orientation

陌生代码库适应流程

When dropped into an unfamiliar codebase, complete this sequence before writing any code:
  1. Read README and config files (5 min) — Identify language, framework, build/test/deploy commands, and entry points.
  2. Run the app locally (10 min) — If it fails, fixing the dev setup is task zero. Nothing else matters until the app runs.
  3. Run the test suite (5 min) — Note: total tests, duration, failures. Failing tests on main are a red flag to document.
  4. Trace the critical path (15 min) — Pick the core user action. Trace from HTTP handler → business logic → database → response. Note file and function at each step.
  5. Map architecture layers (10 min) — Entry points (routes/controllers), business logic, data access, external services. Note where boundaries are clean vs tangled.
  6. Check dependency health (5 min) —
    npm outdated
    /
    pip list --outdated
    . Flag anything >2 major versions behind.
  7. Read last 20 commits (5 min) — Understand active work areas and commit conventions.
  8. Identify test gaps (5 min) — What types exist (unit/integration/E2E)? Where are the blind spots?
Only after this orientation should you plan your change. This 1-hour investment prevents breaking things you did not know existed.
进入陌生代码库后,在编写任何代码前完成以下步骤:
  1. 阅读README与配置文件(5分钟)——识别语言、框架、构建/测试/部署命令及入口点。
  2. 本地运行应用(10分钟)——若运行失败,修复开发环境是首要任务。应用无法运行前,其他工作都不重要。
  3. 运行测试套件(5分钟)——记录:测试总数、耗时、失败数。主分支存在失败测试是需要记录的危险信号。
  4. 追踪关键路径(15分钟)——选择核心用户操作。从HTTP处理器→业务逻辑→数据库→响应追踪。记录每个步骤的文件与函数。
  5. 映射架构层级(10分钟)——入口点(路由/控制器)、业务逻辑、数据访问、外部服务。记录边界清晰与混乱的地方。
  6. 检查依赖健康状况(5分钟)——运行
    npm outdated
    /
    pip list --outdated
    。标记任何落后2个大版本以上的依赖。
  7. 阅读最近20次提交(5分钟)——了解活跃工作区域与提交规范。
  8. 识别测试缺口(5分钟)——存在哪些测试类型(单元/集成/E2E)?哪些是测试盲区?
完成该适应流程后再规划你的变更。这1小时的投入可避免破坏未知的现有功能。

Scripts

脚本

  • scripts/review_checklist.py
    -- Analyze a source file for common code review concerns: TODO/FIXME count, long functions, bare excepts, hardcoded secrets, leftover debug statements. Run with
    --help
    for options.
  • scripts/review_checklist.py
    -- 分析源文件,检查常见代码评审问题:TODO/FIXME数量、长函数、裸except、硬编码密钥、残留调试语句。运行
    --help
    查看选项。

References

参考资料

  • Code Examples — TypeScript API handler, Python retry wrapper, migration-safe SQL, and CI quality gate.
  • Design Docs & ADRs — Design document template (filled-in payment processing example), Architecture Decision Record example, and lightweight RFC template.
  • Production Patterns — Feature flags with percentage rollout, graceful shutdown, expand-migrate-contract database migrations, structured logging with correlation IDs, circuit breaker, and retry with exponential backoff. All TypeScript.
  • Code Review & Incident Response — Code review checklist, incident response runbook, post-mortem template, and on-call handoff template.
  • Performance Profiling — Node.js profiling with clinic.js and Chrome DevTools, EXPLAIN ANALYZE workflow, pg_stat_statements top queries, N+1 detection, memory leak investigation (3-snapshot heap diff), CPU profiling with worker threads, and a full worked example from "endpoint is slow" to root cause fix.
  • Debugging Strategies — Git bisect with automated test scripts, structured log correlation across services, HTTP traffic capture and replay, seed-based reproducible test data, race condition debugging with timestamp analysis and lock contention queries, production debug endpoints, verbose logging feature flags, canary deployments, and common bug pattern signatures.
  • Architecture Decisions — Concrete decision matrices with scoring and thresholds for monolith vs microservices, database selection (PostgreSQL vs MySQL vs MongoDB vs DynamoDB vs Redis), queue selection (SQS vs RabbitMQ vs Kafka), caching strategy with hit rate thresholds and stampede prevention, API style (REST vs GraphQL vs gRPC), and auth strategy (sessions vs JWT vs OAuth2) with security checklists.
  • Team Patterns — PR review turnaround SLA by PR size, review comment categories (must-fix/suggestion/learning/praise), tech debt register with severity scoring, on-call runbook template with triage decision tree, story point calibration with reference stories, velocity tracking formulas, brown bag and decision log templates, mentoring framework with pairing progression, and dependency upgrade strategy with Renovate config.
  • 代码示例 — TypeScript API处理器、Python重试包装器、迁移安全SQL、CI质量门禁。
  • 设计文档与ADR — 设计文档模板(含支付处理示例)、架构决策记录示例、轻量级RFC模板。
  • 生产模式 — 带百分比发布的功能开关、优雅关闭、扩展-迁移-收缩数据库迁移、带关联ID的结构化日志、熔断、指数退避重试。所有示例均为TypeScript。
  • 代码评审与事件响应 — 代码评审检查清单、事件响应手册、事后复盘模板、值班交接模板。
  • 性能调优 — 使用clinic.js与Chrome DevTools进行Node.js调优、EXPLAIN ANALYZE工作流、pg_stat_statements top查询、N+1查询检测、内存泄漏排查(3快照堆对比)、Worker线程CPU调优、从"端点缓慢"到根因修复的完整案例。
  • 调试策略 — 结合自动化测试脚本的Git bisect、跨服务结构化日志关联、HTTP流量捕获与重放、基于种子的可重现测试数据、通过时间戳分析与锁竞争查询调试竞态条件、生产调试端点、 verbose日志功能开关、金丝雀部署、常见bug模式特征。
  • 架构决策 — 带评分与阈值的具体决策矩阵:单体vs微服务、数据库选型(PostgreSQL vs MySQL vs MongoDB vs DynamoDB vs Redis)、队列选型(SQS vs RabbitMQ vs Kafka)、带命中率阈值与缓存击穿防护的缓存策略、API风格(REST vs GraphQL vs gRPC)、认证策略(会话vs JWT vs OAuth2)及安全检查清单。
  • 团队模式 — 按PR大小划分的PR评审周转SLA、评审评论分类(必须修复/建议/学习/表扬)、带严重程度评分的技术债务登记、带分诊决策树的值班手册、带参考故事的故事点校准、速度跟踪公式、午餐分享与决策日志模板、带结对进阶的指导框架、带Renovate配置的依赖升级策略。",