it-operations

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

IT Operations Expert

IT运维专家

A comprehensive skill for managing IT infrastructure operations, ensuring service reliability, implementing monitoring and alerting strategies, managing incidents, and maintaining operational excellence through automation and best practices.
这是一项用于管理IT基础设施运营、保障服务可靠性、实施监控与告警策略、处理事件并通过自动化和最佳实践维持卓越运营的综合性技能。

Core Principles

核心原则

1. Service Reliability First

1. 服务可靠性优先

  • Proactive Monitoring: Implement comprehensive observability before incidents occur
  • Incident Management: Structured response processes with clear escalation paths
  • SLA/SLO Management: Define and maintain service level objectives aligned with business needs
  • Continuous Improvement: Learn from incidents through blameless post-mortems
  • 主动监控:在事件发生前部署全面的可观测性方案
  • 事件管理:具备清晰升级路径的结构化响应流程
  • SLA/SLO管理:定义并维护与业务需求对齐的服务级别目标
  • 持续改进:通过无责事后复盘从事件中学习经验

2. Automation Over Manual Processes

2. 自动化替代手动流程

  • Infrastructure as Code: Manage infrastructure configuration through version-controlled code
  • Runbook Automation: Convert manual procedures into automated workflows
  • Self-Healing Systems: Implement automated remediation for common issues
  • Configuration Management: Maintain consistency across environments
  • 基础设施即代码:通过版本控制代码管理基础设施配置
  • 运行手册自动化:将手动流程转化为自动化工作流
  • 自修复系统:针对常见问题部署自动化修复方案
  • 配置管理:维持各环境间的配置一致性

3. ITIL Service Management

3. ITIL服务管理

  • Service Strategy: Align IT services with business objectives
  • Service Design: Design resilient, scalable services
  • Service Transition: Manage changes with minimal disruption
  • Service Operation: Deliver and support services effectively
  • Continual Service Improvement: Iteratively enhance service quality
  • 服务战略:使IT服务与业务目标对齐
  • 服务设计:设计具备韧性、可扩展的服务
  • 服务过渡:以最小中断为目标管理变更
  • 服务运营:高效交付与支持服务
  • 持续服务改进:迭代提升服务质量

4. Operational Excellence

4. 卓越运营

  • Documentation: Maintain current runbooks, procedures, and architecture diagrams
  • Knowledge Management: Build searchable knowledge bases from incident resolutions
  • Capacity Planning: Forecast and provision resources proactively
  • Cost Optimization: Balance performance requirements with infrastructure costs
  • 文档管理:维护最新的运行手册、流程和架构图
  • 知识管理:基于事件解决方案构建可搜索的知识库
  • 容量规划:主动预测并配置资源
  • 成本优化:平衡性能需求与基础设施成本

Core Workflow

核心工作流

Infrastructure Operations Workflow

基础设施运营工作流

1. MONITORING & OBSERVABILITY
   ├─ Define SLIs/SLOs/SLAs for critical services
   ├─ Implement metrics collection (infrastructure, application, business)
   ├─ Configure alerting with proper thresholds and escalation
   ├─ Build dashboards for different audiences (ops, devs, executives)
   └─ Establish on-call rotation and escalation procedures

2. INCIDENT MANAGEMENT
   ├─ Receive alert or user report
   ├─ Assess severity and impact (P1/P2/P3/P4)
   ├─ Engage appropriate responders
   ├─ Investigate and diagnose root cause
   ├─ Implement fix or workaround
   ├─ Communicate status to stakeholders
   ├─ Document resolution in knowledge base
   └─ Conduct post-incident review

3. CHANGE MANAGEMENT
   ├─ Submit change request with impact assessment
   ├─ Review and approve through CAB (Change Advisory Board)
   ├─ Schedule change window
   ├─ Execute change with rollback plan ready
   ├─ Validate success criteria
   ├─ Document actual vs planned results
   └─ Close change ticket

4. CAPACITY PLANNING
   ├─ Collect resource utilization trends
   ├─ Analyze growth patterns
   ├─ Forecast future requirements
   ├─ Plan procurement or provisioning
   ├─ Execute capacity additions
   └─ Monitor effectiveness

5. AUTOMATION & OPTIMIZATION
   ├─ Identify repetitive manual tasks
   ├─ Document current process
   ├─ Design automated solution
   ├─ Implement and test automation
   ├─ Deploy to production
   ├─ Measure time/cost savings
   └─ Iterate and improve
1. MONITORING & OBSERVABILITY
   ├─ Define SLIs/SLOs/SLAs for critical services
   ├─ Implement metrics collection (infrastructure, application, business)
   ├─ Configure alerting with proper thresholds and escalation
   ├─ Build dashboards for different audiences (ops, devs, executives)
   └─ Establish on-call rotation and escalation procedures

2. INCIDENT MANAGEMENT
   ├─ Receive alert or user report
   ├─ Assess severity and impact (P1/P2/P3/P4)
   ├─ Engage appropriate responders
   ├─ Investigate and diagnose root cause
   ├─ Implement fix or workaround
   ├─ Communicate status to stakeholders
   ├─ Document resolution in knowledge base
   └─ Conduct post-incident review

3. CHANGE MANAGEMENT
   ├─ Submit change request with impact assessment
   ├─ Review and approve through CAB (Change Advisory Board)
   ├─ Schedule change window
   ├─ Execute change with rollback plan ready
   ├─ Validate success criteria
   ├─ Document actual vs planned results
   └─ Close change ticket

4. CAPACITY PLANNING
   ├─ Collect resource utilization trends
   ├─ Analyze growth patterns
   ├─ Forecast future requirements
   ├─ Plan procurement or provisioning
   ├─ Execute capacity additions
   └─ Monitor effectiveness

5. AUTOMATION & OPTIMIZATION
   ├─ Identify repetitive manual tasks
   ├─ Document current process
   ├─ Design automated solution
   ├─ Implement and test automation
   ├─ Deploy to production
   ├─ Measure time/cost savings
   └─ Iterate and improve

Decision Frameworks

决策框架

Alert Configuration Decision Matrix

告警配置决策矩阵

ScenarioAlert TypeThresholdResponse TimeEscalation
Service completely downPageImmediate< 5 minImmediate to on-call
Service degradedPage2-3 failures< 15 minAfter 15 min to on-call
High resource usageWarning> 80% sustained< 1 hourAfter 2 hours to team lead
Approaching capacityInfo> 70% trend< 24 hoursWeekly capacity review
Configuration driftTicketAny deviation< 7 daysMonthly review
场景告警类型阈值响应时间升级策略
服务完全中断电话通知即时< 5分钟立即通知值班人员
服务性能下降电话通知2-3次故障< 15分钟15分钟后升级至值班人员
资源高负载警告持续超过80%< 1小时2小时后升级至团队负责人
接近容量上限信息提示趋势超过70%< 24小时每周容量评审
配置漂移工单任何偏差< 7天月度评审

Incident Severity Classification

事件严重程度分类

Priority 1 (Critical)
  • Complete service outage affecting all users
  • Data loss or security breach
  • Financial impact > $10K/hour
  • Response: Immediate, 24/7, all hands on deck
Priority 2 (High)
  • Partial service outage affecting many users
  • Significant performance degradation
  • Financial impact $1K-$10K/hour
  • Response: < 30 minutes during business hours
Priority 3 (Medium)
  • Service degradation affecting some users
  • Non-critical functionality impaired
  • Workaround available
  • Response: < 4 hours during business hours
Priority 4 (Low)
  • Minor issues with minimal impact
  • Cosmetic problems
  • Enhancement requests
  • Response: Next business day
Priority 1 (Critical)
  • 影响所有用户的完全服务中断
  • 数据丢失或安全漏洞
  • 每小时财务损失超过1万美元
  • 响应要求:即时响应,7×24小时,全员参与
Priority 2 (High)
  • 影响大量用户的部分服务中断
  • 显著性能下降
  • 每小时财务损失1000-10000美元
  • 响应要求:工作时间内30分钟内响应
Priority 3 (Medium)
  • 影响部分用户的服务性能下降
  • 非关键功能受损
  • 存在临时解决方案
  • 响应要求:工作时间内4小时内响应
Priority 4 (Low)
  • 影响极小的轻微问题
  • 界面显示类问题
  • 功能增强请求
  • 响应要求:下一个工作日处理

Change Management Risk Assessment

变更管理风险评估

Risk Level = Impact × Likelihood × Complexity

Impact (1-5):
1 = Single user
2 = Team
3 = Department
4 = Company-wide
5 = Customer-facing

Likelihood of Issues (1-5):
1 = Routine, tested
2 = Familiar, documented
3 = Some uncertainty
4 = New territory
5 = Never done before

Complexity (1-5):
1 = Single component
2 = Few components
3 = Multiple systems
4 = Cross-platform
5 = Enterprise-wide

Risk Score Interpretation:
1-20: Standard change (pre-approved)
21-50: Normal change (CAB review)
51-75: High-risk change (extensive testing, senior approval)
76-125: Emergency change only (executive approval)
Risk Level = Impact × Likelihood × Complexity

Impact (1-5):
1 = Single user
2 = Team
3 = Department
4 = Company-wide
5 = Customer-facing

Likelihood of Issues (1-5):
1 = Routine, tested
2 = Familiar, documented
3 = Some uncertainty
4 = New territory
5 = Never done before

Complexity (1-5):
1 = Single component
2 = Few components
3 = Multiple systems
4 = Cross-platform
5 = Enterprise-wide

Risk Score Interpretation:
1-20: Standard change (pre-approved)
21-50: Normal change (CAB review)
51-75: High-risk change (extensive testing, senior approval)
76-125: Emergency change only (executive approval)

Monitoring Tool Selection

监控工具选型

RequirementPrometheus + GrafanaDatadogNew RelicELK StackSplunk
CostFree (self-hosted)$$$$$$$$Free-$$$$$$$
MetricsExcellentExcellentExcellentGoodGood
LogsVia LokiExcellentExcellentExcellentExcellent
TracesVia TempoExcellentExcellentLimitedGood
Learning CurveSteepModerateModerateSteepSteep
Cloud-NativeExcellentExcellentExcellentGoodGood
On-PremisesExcellentGoodGoodExcellentExcellent
APMVia exportersExcellentExcellentLimitedGood
需求Prometheus + GrafanaDatadogNew RelicELK StackSplunk
成本免费(自托管)极高极高免费-高极高
指标采集优秀优秀优秀良好良好
日志管理通过Loki优秀优秀优秀优秀
链路追踪通过Tempo优秀优秀有限良好
学习曲线陡峭中等中等陡峭陡峭
云原生支持优秀优秀优秀良好良好
本地部署支持优秀良好良好优秀优秀
应用性能监控(APM)通过导出器优秀优秀有限良好

Common Operational Challenges

常见运营挑战

Challenge 1: Alert Fatigue

挑战1:告警疲劳

Problem: Too many false positive alerts causing team burnout
Solution:
yaml
Alert Tuning Process:
1. Measure baseline alert volume and false positive rate
2. Categorize alerts by actionability:
   - Actionable + Urgent = Keep as page
   - Actionable + Not Urgent = Ticket
   - Not Actionable = Remove or convert to dashboard metric
3. Implement alert aggregation (group similar alerts)
4. Add context to alerts (runbook links, relevant metrics)
5. Regular review meetings (weekly) to tune thresholds
6. Track metrics:
   - MTTA (Mean Time to Acknowledge): < 5 min target
   - False Positive Rate: < 20% target
   - Alert Volume per Week: Trending down
问题:过多误报导致团队倦怠
解决方案:
yaml
Alert Tuning Process:
1. Measure baseline alert volume and false positive rate
2. Categorize alerts by actionability:
   - Actionable + Urgent = Keep as page
   - Actionable + Not Urgent = Ticket
   - Not Actionable = Remove or convert to dashboard metric
3. Implement alert aggregation (group similar alerts)
4. Add context to alerts (runbook links, relevant metrics)
5. Regular review meetings (weekly) to tune thresholds
6. Track metrics:
   - MTTA (Mean Time to Acknowledge): < 5 min target
   - False Positive Rate: < 20% target
   - Alert Volume per Week: Trending down

Challenge 2: Incident Documentation During Crisis

挑战2:危机中的事件文档记录

Problem: Teams skip documentation during high-pressure incidents
Solution:
  • Assign dedicated scribe role (not the incident commander)
  • Use incident management tools (PagerDuty, Opsgenie) with automatic timeline
  • Template-based incident reports with required fields
  • Post-incident review scheduled automatically (within 48 hours)
  • Gamify documentation (track and recognize thorough documentation)
问题:团队在高压事件中跳过文档记录
解决方案:
  • 指定专门的记录员角色(非事件指挥官)
  • 使用具备自动时间线的事件管理工具(PagerDuty、Opsgenie)
  • 基于模板的事件报告,包含必填字段
  • 自动安排事后复盘(48小时内)
  • 将文档记录游戏化(追踪并认可详尽的文档)

Challenge 3: Knowledge Silos

挑战3:知识孤岛

Problem: Critical knowledge trapped in individual team members' heads
Solution:
yaml
Knowledge Transfer Strategy:
- Pair Programming/Shadowing: 20% of sprint capacity
- Runbook Requirements: Every system must have runbook
- Lunch & Learn Sessions: Weekly 30-min knowledge sharing
- Cross-Training Matrix: Track who knows what, identify gaps
- On-Call Rotation: Everyone rotates to spread knowledge
- Post-Incident Reviews: Mandatory team sharing
- Documentation Sprints: Quarterly focus on doc completion
问题:关键知识仅掌握在个别团队成员手中
解决方案:
yaml
Knowledge Transfer Strategy:
- Pair Programming/Shadowing: 20% of sprint capacity
- Runbook Requirements: Every system must have runbook
- Lunch & Learn Sessions: Weekly 30-min knowledge sharing
- Cross-Training Matrix: Track who knows what, identify gaps
- On-Call Rotation: Everyone rotates to spread knowledge
- Post-Incident Reviews: Mandatory team sharing
- Documentation Sprints: Quarterly focus on doc completion

Challenge 4: Balancing Stability vs Innovation

挑战4:平衡稳定性与创新

Problem: Operations team resists change to maintain stability
Solution:
  • Implement change windows (planned maintenance periods)
  • Use blue-green or canary deployments for lower risk
  • Establish "innovation time" (Google 20% time model)
  • Create sandbox environments for experimentation
  • Measure and reward both stability AND improvement metrics
  • Include "toil reduction" as OKR target
问题:运营团队为维持稳定性而抗拒变更
解决方案:
  • 实施变更窗口(计划维护时段)
  • 使用蓝绿部署或金丝雀部署降低风险
  • 设立“创新时间”(谷歌20%时间模式)
  • 创建沙箱环境用于实验
  • 同时衡量并奖励稳定性与改进指标
  • 将“减少繁琐工作”作为OKR目标

Key Metrics & KPIs

关键指标与KPI

Service Reliability Metrics

服务可靠性指标

yaml
Availability:
  Formula: (Total Time - Downtime) / Total Time × 100
  Target: 99.9% (43.8 min/month downtime)
  Measurement: Per service, monthly

MTTR (Mean Time to Recovery):
  Formula: Sum of recovery times / Number of incidents
  Target: < 30 minutes for P1, < 4 hours for P2
  Measurement: Per severity level, monthly

MTBF (Mean Time Between Failures):
  Formula: Total operational time / Number of failures
  Target: > 720 hours (30 days)
  Measurement: Per service, quarterly

MTTA (Mean Time to Acknowledge):
  Formula: Sum of acknowledgment times / Number of alerts
  Target: < 5 minutes for pages
  Measurement: Per on-call engineer, weekly

Change Success Rate:
  Formula: Successful changes / Total changes × 100
  Target: > 95%
  Measurement: Monthly

Incident Recurrence Rate:
  Formula: Repeat incidents / Total incidents × 100
  Target: < 10%
  Measurement: Quarterly (same root cause within 90 days)
yaml
Availability:
  Formula: (Total Time - Downtime) / Total Time × 100
  Target: 99.9% (43.8 min/month downtime)
  Measurement: Per service, monthly

MTTR (Mean Time to Recovery):
  Formula: Sum of recovery times / Number of incidents
  Target: < 30 minutes for P1, < 4 hours for P2
  Measurement: Per severity level, monthly

MTBF (Mean Time Between Failures):
  Formula: Total operational time / Number of failures
  Target: > 720 hours (30 days)
  Measurement: Per service, quarterly

MTTA (Mean Time to Acknowledge):
  Formula: Sum of acknowledgment times / Number of alerts
  Target: < 5 minutes for pages
  Measurement: Per on-call engineer, weekly

Change Success Rate:
  Formula: Successful changes / Total changes × 100
  Target: > 95%
  Measurement: Monthly

Incident Recurrence Rate:
  Formula: Repeat incidents / Total incidents × 100
  Target: < 10%
  Measurement: Quarterly (same root cause within 90 days)

Operational Efficiency Metrics

运营效率指标

yaml
Toil Percentage:
  Definition: Time spent on manual, repetitive tasks
  Target: < 30% of team capacity
  Measurement: Weekly time tracking

Automation Coverage:
  Formula: Automated tasks / Total repetitive tasks × 100
  Target: > 70%
  Measurement: Quarterly audit

On-Call Load:
  Formula: Alerts per on-call shift
  Target: < 5 actionable alerts per shift
  Measurement: Per engineer, weekly

Runbook Coverage:
  Formula: Services with runbooks / Total services × 100
  Target: 100%
  Measurement: Monthly audit

Knowledge Base Utilization:
  Formula: Incidents resolved via KB / Total incidents × 100
  Target: > 40%
  Measurement: Monthly
yaml
Toil Percentage:
  Definition: Time spent on manual, repetitive tasks
  Target: < 30% of team capacity
  Measurement: Weekly time tracking

Automation Coverage:
  Formula: Automated tasks / Total repetitive tasks × 100
  Target: > 70%
  Measurement: Quarterly audit

On-Call Load:
  Formula: Alerts per on-call shift
  Target: < 5 actionable alerts per shift
  Measurement: Per engineer, weekly

Runbook Coverage:
  Formula: Services with runbooks / Total services × 100
  Target: 100%
  Measurement: Monthly audit

Knowledge Base Utilization:
  Formula: Incidents resolved via KB / Total incidents × 100
  Target: > 40%
  Measurement: Monthly

Integration Points

集成对接

With Development Teams

与开发团队

  • Participate in design reviews for operational requirements
  • Provide deployment automation and CI/CD pipeline support
  • Share monitoring and logging requirements
  • Collaborate on incident response and post-mortems
  • Joint ownership of SLOs and error budgets
  • 参与针对运营需求的设计评审
  • 提供部署自动化与CI/CD流水线支持
  • 共享监控与日志需求
  • 协作处理事件响应与事后复盘
  • 共同承担SLO与错误预算的责任

With Security Teams

与安全团队

  • Implement security monitoring and alerting
  • Manage access controls and authentication systems
  • Coordinate vulnerability patching and remediation
  • Conduct security incident response
  • Maintain compliance with security policies
  • 部署安全监控与告警
  • 管理访问控制与认证系统
  • 协调漏洞补丁与修复
  • 开展安全事件响应
  • 维持与安全政策的合规性

With Business Stakeholders

与业务利益相关方

  • Report on service availability and performance
  • Communicate planned maintenance windows
  • Provide capacity planning forecasts
  • Translate technical metrics to business impact
  • Participate in business continuity planning
  • 汇报服务可用性与性能
  • 沟通计划内的维护窗口
  • 提供容量规划预测
  • 将技术指标转化为业务影响
  • 参与业务连续性规划

Best Practices

最佳实践

1. Blameless Post-Mortems

1. 无责事后复盘

markdown
Post-Incident Review Template:
- Incident Summary (what happened, when, impact)
- Timeline of Events (detailed chronology)
- Root Cause Analysis (5 Whys or Fishbone)
- What Went Well (strengths during response)
- What Could Be Improved (opportunities)
- Action Items (with owners and due dates)
- Lessons Learned (shareable insights)

Rules:
- No blame or punishment
- Focus on systems and processes, not people
- Everyone can speak freely
- Action items must be tracked to completion
markdown
Post-Incident Review Template:
- Incident Summary (what happened, when, impact)
- Timeline of Events (detailed chronology)
- Root Cause Analysis (5 Whys or Fishbone)
- What Went Well (strengths during response)
- What Could Be Improved (opportunities)
- Action Items (with owners and due dates)
- Lessons Learned (shareable insights)

Rules:
- No blame or punishment
- Focus on systems and processes, not people
- Everyone can speak freely
- Action items must be tracked to completion

2. Runbook Standards

2. 运行手册标准

yaml
Runbook Contents:
  - Service Overview: Purpose, dependencies, architecture
  - SLIs/SLOs/SLAs: Defined thresholds and targets
  - Common Issues: Symptoms, causes, solutions
  - Troubleshooting Steps: Step-by-step procedures
  - Escalation Paths: Who to contact and when
  - Useful Commands: Copy-paste ready commands
  - Dashboard Links: Direct links to relevant dashboards
  - Recent Changes: Link to change log
  - Contact Information: Team, product owner, SMEs

Maintenance:
  - Review quarterly or after major incidents
  - Test procedures during low-traffic periods
  - Update after every significant change
  - Track usage metrics (page views, helpfulness ratings)
yaml
Runbook Contents:
  - Service Overview: Purpose, dependencies, architecture
  - SLIs/SLOs/SLAs: Defined thresholds and targets
  - Common Issues: Symptoms, causes, solutions
  - Troubleshooting Steps: Step-by-step procedures
  - Escalation Paths: Who to contact and when
  - Useful Commands: Copy-paste ready commands
  - Dashboard Links: Direct links to relevant dashboards
  - Recent Changes: Link to change log
  - Contact Information: Team, product owner, SMEs

Maintenance:
  - Review quarterly or after major incidents
  - Test procedures during low-traffic periods
  - Update after every significant change
  - Track usage metrics (page views, helpfulness ratings)

3. On-Call Best Practices

3. 值班最佳实践

yaml
On-Call Preparation:
  - Laptop with VPN access
  - Mobile device with notification apps
  - Contact list (escalation paths)
  - Access to all critical systems
  - Runbooks bookmarked
  - Backup on-call identified

During On-Call:
  - Acknowledge alerts within 5 minutes
  - Update incident status regularly
  - Follow escalation procedures
  - Document all actions in incident ticket
  - Handoff clearly to next on-call

Post On-Call:
  - Complete incident reports
  - Submit toil reduction tickets
  - Provide feedback on runbooks
  - Update on-call documentation
yaml
On-Call Preparation:
  - Laptop with VPN access
  - Mobile device with notification apps
  - Contact list (escalation paths)
  - Access to all critical systems
  - Runbooks bookmarked
  - Backup on-call identified

During On-Call:
  - Acknowledge alerts within 5 minutes
  - Update incident status regularly
  - Follow escalation procedures
  - Document all actions in incident ticket
  - Handoff clearly to next on-call

Post On-Call:
  - Complete incident reports
  - Submit toil reduction tickets
  - Provide feedback on runbooks
  - Update on-call documentation

4. Change Management Discipline

4. 变更管理规范

yaml
Standard Change Process:
  1. Create change request (RFC)
  2. Document:
     - What: Specific changes being made
     - Why: Business justification
     - When: Proposed date/time
     - Who: Change implementer and approver
     - How: Step-by-step procedure
     - Risk: Assessment and mitigation
     - Rollback: Detailed rollback plan
     - Testing: Validation steps
  3. Submit for CAB review (7 days advance notice)
  4. Implement during approved window
  5. Validate success criteria
  6. Close change with actual results
  7. Post-implementation review if issues occurred

Emergency Change Process:
  - Executive approval required
  - Implement with heightened monitoring
  - Full team notification
  - Complete documentation within 24 hours
  - Mandatory post-change review
yaml
Standard Change Process:
  1. Create change request (RFC)
  2. Document:
     - What: Specific changes being made
     - Why: Business justification
     - When: Proposed date/time
     - Who: Change implementer and approver
     - How: Step-by-step procedure
     - Risk: Assessment and mitigation
     - Rollback: Detailed rollback plan
     - Testing: Validation steps
  3. Submit for CAB review (7 days advance notice)
  4. Implement during approved window
  5. Validate success criteria
  6. Close change with actual results
  7. Post-implementation review if issues occurred

Emergency Change Process:
  - Executive approval required
  - Implement with heightened monitoring
  - Full team notification
  - Complete documentation within 24 hours
  - Mandatory post-change review

Reference Files

参考文档

For detailed technical guidance, see:
  • reference/monitoring.md - Observability, metrics, alerting, and dashboard design
  • reference/incident-management.md - Incident response, root cause analysis, post-mortems
  • reference/infrastructure.md - Server management, network operations, capacity planning
  • reference/automation.md - Scripting, configuration management, orchestration tools
  • reference/backup-recovery.md - Backup strategies, disaster recovery, business continuity
如需详细技术指导,请查看:
  • reference/monitoring.md - 可观测性、指标、告警与仪表盘设计
  • reference/incident-management.md - 事件响应、根因分析、事后复盘
  • reference/infrastructure.md - 服务器管理、网络运营、容量规划
  • reference/automation.md - 脚本、配置管理、编排工具
  • reference/backup-recovery.md - 备份策略、灾难恢复、业务连续性

Getting Started

快速入门

  1. For New Infrastructure: Start with reference/infrastructure.md for setup guidance
  2. For Monitoring Setup: Review reference/monitoring.md for observability strategy
  3. For Incident Response: See reference/incident-management.md for procedures
  4. For Automation Projects: Check reference/automation.md for tooling recommendations
  5. For DR Planning: Consult reference/backup-recovery.md for recovery strategies
  1. 针对新基础设施:查看reference/infrastructure.md获取部署指导
  2. 针对监控部署:查看reference/monitoring.md获取可观测性策略
  3. 针对事件响应:查看reference/incident-management.md获取流程规范
  4. 针对自动化项目:查看reference/automation.md获取工具推荐
  5. 针对灾难恢复规划:查看reference/backup-recovery.md获取恢复策略