aws-cost-finops

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

AWS Cost Optimization & FinOps

AWS成本优化与FinOps

Systematic workflows for AWS cost optimization and financial operations management.
AWS成本优化与财务运营管理的系统化工作流。

When to Use This Skill

何时使用本技能

Use this skill when you need to:
  • Find cost savings: Identify unused resources, rightsizing opportunities, or commitment discounts
  • Analyze spending: Understand cost trends, detect anomalies, or break down costs
  • Optimize architecture: Choose cost-effective services, storage tiers, or instance types
  • Implement FinOps: Set up governance, tagging, budgets, or monthly reviews
  • Make purchase decisions: Evaluate Reserved Instances, Savings Plans, or Spot instances
  • Troubleshoot costs: Investigate unexpected bills or cost spikes
  • Plan budgets: Forecast costs or evaluate impact of new projects
当你需要以下操作时,使用本技能:
  • 挖掘成本节省空间:识别未使用资源、实例规格调整机会或承诺折扣
  • 分析支出情况:了解成本趋势、检测异常或拆分成本明细
  • 优化架构:选择高性价比服务、存储层级或实例类型
  • 实施FinOps:建立治理机制、标签策略、预算或月度评审流程
  • 制定采购决策:评估Reserved Instance、Savings Plans或Spot实例
  • 排查成本问题:调查意外账单或成本激增情况
  • 规划预算:预测成本或评估新项目的成本影响

Cost Optimization Workflow

成本优化工作流

Follow this systematic approach for AWS cost optimization:
┌─────────────────────────────────────────────┐
│ 1. DISCOVER                                 │
│    What are we spending money on?           │
│    Run: find_unused_resources.py            │
│    Run: cost_anomaly_detector.py            │
└─────────────────────────────────────────────┘
┌─────────────────────────────────────────────┐
│ 2. ANALYZE                                  │
│    Where are the optimization opportunities?│
│    Run: rightsizing_analyzer.py             │
│    Run: detect_old_generations.py           │
│    Run: spot_recommendations.py             │
│    Run: analyze_ri_recommendations.py       │
└─────────────────────────────────────────────┘
┌─────────────────────────────────────────────┐
│ 3. PRIORITIZE                               │
│    What should we optimize first?           │
│    - Quick wins (low risk, high savings)    │
│    - Low-hanging fruit (easy to implement)  │
│    - Strategic improvements                 │
└─────────────────────────────────────────────┘
┌─────────────────────────────────────────────┐
│ 4. IMPLEMENT                                │
│    Execute optimization actions             │
│    - Delete unused resources                │
│    - Rightsize instances                    │
│    - Purchase commitments                   │
│    - Migrate to new generations             │
└─────────────────────────────────────────────┘
┌─────────────────────────────────────────────┐
│ 5. MONITOR                                  │
│    Verify savings and track metrics         │
│    - Monthly cost reviews                   │
│    - Tag compliance monitoring              │
│    - Budget variance tracking               │
└─────────────────────────────────────────────┘

遵循以下系统化方法进行AWS成本优化:
┌─────────────────────────────────────────────┐
│ 1. DISCOVER                                 │
│    What are we spending money on?           │
│    Run: find_unused_resources.py            │
│    Run: cost_anomaly_detector.py            │
└─────────────────────────────────────────────┘
┌─────────────────────────────────────────────┐
│ 2. ANALYZE                                  │
│    Where are the optimization opportunities?│
│    Run: rightsizing_analyzer.py             │
│    Run: detect_old_generations.py           │
│    Run: spot_recommendations.py             │
│    Run: analyze_ri_recommendations.py       │
└─────────────────────────────────────────────┘
┌─────────────────────────────────────────────┐
│ 3. PRIORITIZE                               │
│    What should we optimize first?           │
│    - Quick wins (low risk, high savings)    │
│    - Low-hanging fruit (easy to implement)  │
│    - Strategic improvements                 │
└─────────────────────────────────────────────┘
┌─────────────────────────────────────────────┐
│ 4. IMPLEMENT                                │
│    Execute optimization actions             │
│    - Delete unused resources                │
│    - Rightsize instances                    │
│    - Purchase commitments                   │
│    - Migrate to new generations             │
└─────────────────────────────────────────────┘
┌─────────────────────────────────────────────┐
│ 5. MONITOR                                  │
│    Verify savings and track metrics         │
│    - Monthly cost reviews                   │
│    - Tag compliance monitoring              │
│    - Budget variance tracking               │
└─────────────────────────────────────────────┘

Core Workflows

核心工作流

Workflow 1: Monthly Cost Optimization Review

工作流1:月度成本优化评审

Frequency: Run monthly (first week of each month)
Step 1: Find Unused Resources
bash
undefined
频率:每月运行(每月第一周)
步骤1:查找未使用资源
bash
undefined

Scan for waste across all resources

Scan for waste across all resources

python3 scripts/find_unused_resources.py
python3 scripts/find_unused_resources.py

Expected output:

Expected output:

- Unattached EBS volumes

- Unattached EBS volumes

- Old snapshots

- Old snapshots

- Unused Elastic IPs

- Unused Elastic IPs

- Idle NAT Gateways

- Idle NAT Gateways

- Idle EC2 instances

- Idle EC2 instances

- Unused load balancers

- Unused load balancers

- Estimated monthly savings

- Estimated monthly savings


**Step 2: Analyze Cost Anomalies**
```bash

**步骤2:分析成本异常**
```bash

Detect unusual spending patterns

Detect unusual spending patterns

python3 scripts/cost_anomaly_detector.py --days 30
python3 scripts/cost_anomaly_detector.py --days 30

Expected output:

Expected output:

- Cost spikes and anomalies

- Cost spikes and anomalies

- Top cost drivers

- Top cost drivers

- Period-over-period comparison

- Period-over-period comparison

- 30-day forecast

- 30-day forecast


**Step 3: Identify Rightsizing Opportunities**
```bash

**步骤3:识别实例规格调整机会**
```bash

Find oversized instances

Find oversized instances

python3 scripts/rightsizing_analyzer.py --days 30
python3 scripts/rightsizing_analyzer.py --days 30

Expected output:

Expected output:

- EC2 instances with low utilization

- EC2 instances with low utilization

- RDS instances with low utilization

- RDS instances with low utilization

- Recommended smaller instance types

- Recommended smaller instance types

- Estimated savings

- Estimated savings


**Step 4: Generate Monthly Report**
```bash

**步骤4:生成月度报告**
```bash

Use the template to compile findings

Use the template to compile findings

cp assets/templates/monthly_cost_report.md reports/$(date +%Y-%m)-cost-report.md
cp assets/templates/monthly_cost_report.md reports/$(date +%Y-%m)-cost-report.md

Fill in:

Fill in:

- Findings from scripts

- Findings from scripts

- Action items

- Action items

- Team cost breakdowns

- Team cost breakdowns

- Optimization wins

- Optimization wins


**Step 5: Team Review Meeting**
- Present findings to engineering teams
- Assign optimization tasks
- Track action items to completion

---

**步骤5:团队评审会议**
- 向工程团队展示发现结果
- 分配优化任务
- 跟踪行动项直至完成

---

Workflow 2: Commitment Purchase Analysis (RI/Savings Plans)

工作流2:承诺型采购分析(RI/Savings Plans)

When: Quarterly or when usage patterns stabilize
Step 1: Analyze Current Usage
bash
undefined
时机:每季度或使用模式稳定时
步骤1:分析当前使用情况
bash
undefined

Identify workloads suitable for commitments

Identify workloads suitable for commitments

python3 scripts/analyze_ri_recommendations.py --days 60
python3 scripts/analyze_ri_recommendations.py --days 60

Looks for:

Looks for:

- EC2 instances running consistently for 60+ days

- EC2 instances running consistently for 60+ days

- RDS instances with stable usage

- RDS instances with stable usage

- Calculates ROI for 1yr vs 3yr commitments

- Calculates ROI for 1yr vs 3yr commitments


**Step 2: Review Recommendations**

Evaluate each recommendation:
✅ Good candidate if:
  • Running 24/7 for 60+ days
  • Workload is stable and predictable
  • No plans to change architecture
  • Savings > 30%
❌ Poor candidate if:
  • Workload is variable or experimental
  • Architecture changes planned
  • Instance type may change
  • Dev/test environment

**Step 3: Choose Commitment Type**

**Reserved Instances**:
- Standard RI: Highest discount (63%), no flexibility
- Convertible RI: Moderate discount (54%), can change instance type
- Best for: Specific instance types, stable workloads

**Savings Plans**:
- Compute SP: Flexible across instance types, regions (66% savings)
- EC2 Instance SP: Flexible across sizes in same family (72% savings)
- Best for: Variable workloads within constraints

**Decision Matrix**:
Known instance type, won't change → Standard RI May need to change types → Convertible RI or Compute SP Variable workloads → Compute Savings Plan Maximum flexibility → Compute Savings Plan

**Step 4: Purchase and Track**
- Purchase through AWS Console or CLI
- Tag commitments with purchase date and owner
- Monitor utilization monthly
- Aim for >90% utilization

**Reference**: See `references/best_practices.md` for detailed commitment strategies

---

**步骤2:评估推荐方案**

评估每个推荐项:
✅ Good candidate if:
  • Running 24/7 for 60+ days
  • Workload is stable and predictable
  • No plans to change architecture
  • Savings > 30%
❌ Poor candidate if:
  • Workload is variable or experimental
  • Architecture changes planned
  • Instance type may change
  • Dev/test environment

**步骤3:选择承诺类型**

**Reserved Instances**:
- Standard RI:折扣最高(63%),无灵活性
- Convertible RI:中等折扣(54%),可更改实例类型
- 适用场景:特定实例类型、稳定工作负载

**Savings Plans**:
- Compute SP:跨实例类型、区域灵活适用(66%节省)
- EC2 Instance SP:同系列内跨规格灵活适用(72%节省)
- 适用场景:约束范围内的可变工作负载

**决策矩阵**:
Known instance type, won't change → Standard RI May need to change types → Convertible RI or Compute SP Variable workloads → Compute Savings Plan Maximum flexibility → Compute Savings Plan

**步骤4:采购与跟踪**
- 通过AWS控制台或CLI进行采购
- 为承诺项添加采购日期和所有者标签
- 每月监控利用率
- 目标利用率>90%

**参考**:详见`references/best_practices.md`中的详细承诺策略

---

Workflow 3: Instance Generation Migration

工作流3:实例代际迁移

When: During architecture reviews or optimization sprints
Step 1: Detect Old Instances
bash
undefined
时机:架构评审或优化冲刺期间
步骤1:检测旧代际实例
bash
undefined

Find outdated instance generations

Find outdated instance generations

python3 scripts/detect_old_generations.py
python3 scripts/detect_old_generations.py

Identifies:

Identifies:

- t2 → t3 migrations (10% savings)

- t2 → t3 migrations (10% savings)

- m4 → m5 → m6i migrations

- m4 → m5 → m6i migrations

- Intel → Graviton opportunities (20% savings)

- Intel → Graviton opportunities (20% savings)


**Step 2: Prioritize Migrations**

**Quick Wins (Low Risk)**:
t2 → t3: Drop-in replacement, 10% savings m4 → m5: Better performance, 5% savings gp2 → gp3: No downtime, 20% savings

**Medium Effort (Test Required)**:
x86 → Graviton (ARM64): 20% savings
  • Requires ARM64 compatibility testing
  • Most modern frameworks support ARM64
  • Test in staging first

**Step 3: Execute Migration**

**For EC2 (x86 to x86)**:
1. Stop instance
2. Change instance type
3. Start instance
4. Verify application

**For Graviton Migration**:
1. Create ARM64 AMI or Docker image
2. Launch new Graviton instance
3. Test thoroughly
4. Cut over traffic
5. Terminate old instance

**Step 4: Validate Savings**
- Monitor new costs in Cost Explorer
- Verify performance is acceptable
- Document migration for other teams

**Reference**: See `references/best_practices.md` → Compute Optimization

---

**步骤2:优先排序迁移任务**

**快速见效项(低风险)**:
t2 → t3: 直接替换,节省10% m4 → m5: 性能提升,节省5% gp2 → gp3: 无停机,节省20%

**中等工作量(需测试)**:
x86 → Graviton (ARM64): 节省20%
  • 需要ARM64兼容性测试
  • 大多数现代框架支持ARM64
  • 先在预发布环境测试

**步骤3:执行迁移**

**EC2(x86到x86)迁移**:
1. 停止实例
2. 更改实例类型
3. 启动实例
4. 验证应用

**Graviton迁移**:
1. 创建ARM64 AMI或Docker镜像
2. 启动新的Graviton实例
3. 全面测试
4. 切换流量
5. 终止旧实例

**步骤4:验证节省效果**
- 在Cost Explorer中监控新成本
- 验证性能符合要求
- 记录迁移过程供其他团队参考

**参考**:详见`references/best_practices.md` → 计算优化

---

Workflow 4: Spot Instance Evaluation

工作流4:Spot实例评估

When: For fault-tolerant workloads or Auto Scaling Groups
Step 1: Identify Candidates
bash
undefined
时机:适用于容错工作负载或Auto Scaling组
步骤1:识别候选对象
bash
undefined

Analyze workloads for Spot suitability

Analyze workloads for Spot suitability

python3 scripts/spot_recommendations.py
python3 scripts/spot_recommendations.py

Evaluates:

Evaluates:

- Instances in Auto Scaling Groups (good candidates)

- Instances in Auto Scaling Groups (good candidates)

- Dev/test/staging environments

- Dev/test/staging environments

- Batch processing workloads

- Batch processing workloads

- CI/CD and build servers

- CI/CD and build servers


**Step 2: Assess Suitability**

**Excellent for Spot**:
- Stateless applications
- Batch jobs
- CI/CD pipelines
- Data processing
- Auto Scaling Groups

**NOT suitable for Spot**:
- Databases (without replicas)
- Stateful applications
- Real-time services
- Mission-critical workloads

**Step 3: Implementation Strategy**

**Option 1: Fargate Spot (Easiest)**
```yaml

**步骤2:评估适用性**

**非常适合Spot的场景**:
- 无状态应用
- 批处理作业
- CI/CD流水线
- 数据处理
- Auto Scaling组

**不适合Spot的场景**:
- 数据库(无副本时)
- 有状态应用
- 实时服务
- 关键业务工作负载

**步骤3:实施策略**

**选项1:Fargate Spot(最简单)**
```yaml

ECS task definition

ECS task definition

requiresCompatibilities:
  • FARGATE capacityProviderStrategy:
  • capacityProvider: FARGATE_SPOT weight: 70 # 70% Spot
  • capacityProvider: FARGATE weight: 30 # 30% On-Demand

**Option 2: EC2 Auto Scaling with Spot**
```yaml
requiresCompatibilities:
  • FARGATE capacityProviderStrategy:
  • capacityProvider: FARGATE_SPOT weight: 70 # 70% Spot
  • capacityProvider: FARGATE weight: 30 # 30% On-Demand

**选项2:带Spot的EC2 Auto Scaling**
```yaml

Mixed instances policy

Mixed instances policy

MixedInstancesPolicy: InstancesDistribution: OnDemandBaseCapacity: 2 OnDemandPercentageAboveBaseCapacity: 30 SpotAllocationStrategy: capacity-optimized LaunchTemplate: Overrides: - InstanceType: m5.large - InstanceType: m5a.large - InstanceType: m5n.large

**Option 3: EC2 Spot Fleet**
```bash
MixedInstancesPolicy: InstancesDistribution: OnDemandBaseCapacity: 2 OnDemandPercentageAboveBaseCapacity: 30 SpotAllocationStrategy: capacity-optimized LaunchTemplate: Overrides: - InstanceType: m5.large - InstanceType: m5a.large - InstanceType: m5n.large

**选项3:EC2 Spot Fleet**
```bash

Create Spot Fleet with diverse instance types

Create Spot Fleet with diverse instance types

aws ec2 request-spot-fleet --spot-fleet-request-config file://spot-fleet.json

**Step 4: Implement Interruption Handling**
```bash
aws ec2 request-spot-fleet --spot-fleet-request-config file://spot-fleet.json

**步骤4:实现中断处理**
```bash

Handle 2-minute termination notice

Handle 2-minute termination notice

Instance metadata: /latest/meta-data/spot/instance-action

Instance metadata: /latest/meta-data/spot/instance-action

In application:

In application:

  1. Poll for termination notice
  2. Gracefully shutdown (save state)
  3. Drain connections
  4. Exit

**Reference**: See `references/best_practices.md` → Compute Optimization → Spot Instances

---
  1. Poll for termination notice
  2. Gracefully shutdown (save state)
  3. Drain connections
  4. Exit

**参考**:详见`references/best_practices.md` → 计算优化 → Spot实例

---

Quick Reference: Cost Optimization Scripts

快速参考:成本优化脚本

All Scripts Location

所有脚本位置

bash
ls scripts/
bash
ls scripts/

find_unused_resources.py

find_unused_resources.py

analyze_ri_recommendations.py

analyze_ri_recommendations.py

detect_old_generations.py

detect_old_generations.py

spot_recommendations.py

spot_recommendations.py

rightsizing_analyzer.py

rightsizing_analyzer.py

cost_anomaly_detector.py

cost_anomaly_detector.py

undefined
undefined

Script Usage Patterns

脚本使用模式

Monthly Review (Run all):
bash
python3 scripts/find_unused_resources.py
python3 scripts/cost_anomaly_detector.py --days 30
python3 scripts/rightsizing_analyzer.py --days 30
Quarterly Optimization:
bash
python3 scripts/analyze_ri_recommendations.py --days 60
python3 scripts/detect_old_generations.py
python3 scripts/spot_recommendations.py
Specific Region Only:
bash
python3 scripts/find_unused_resources.py --region us-east-1
python3 scripts/rightsizing_analyzer.py --region us-west-2
Named AWS Profile:
bash
python3 scripts/find_unused_resources.py --profile production
python3 scripts/cost_anomaly_detector.py --profile production --days 60
月度评审(全部运行):
bash
python3 scripts/find_unused_resources.py
python3 scripts/cost_anomaly_detector.py --days 30
python3 scripts/rightsizing_analyzer.py --days 30
季度优化:
bash
python3 scripts/analyze_ri_recommendations.py --days 60
python3 scripts/detect_old_generations.py
python3 scripts/spot_recommendations.py
仅特定区域:
bash
python3 scripts/find_unused_resources.py --region us-east-1
python3 scripts/rightsizing_analyzer.py --region us-west-2
指定AWS配置文件:
bash
python3 scripts/find_unused_resources.py --profile production
python3 scripts/cost_anomaly_detector.py --profile production --days 60

Script Requirements

脚本要求

bash
undefined
bash
undefined

Install dependencies

Install dependencies

pip install boto3 tabulate
pip install boto3 tabulate

AWS credentials required

AWS credentials required

Configure via: aws configure

Configure via: aws configure

Or use: --profile PROFILE_NAME

Or use: --profile PROFILE_NAME


---

---

Service-Specific Optimization

服务特定优化

Compute Optimization

计算优化

Key Actions:
  • Migrate to Graviton (20% savings)
  • Use Spot for fault-tolerant workloads (70% savings)
  • Purchase RIs for stable workloads (40-65% savings)
  • Right-size oversized instances
Reference:
references/best_practices.md
→ Compute Optimization
关键操作:
  • 迁移至Graviton(节省20%)
  • 为容错工作负载使用Spot(节省70%)
  • 为稳定工作负载购买RI(节省40-65%)
  • 调整超大实例的规格
参考
references/best_practices.md
→ 计算优化

Storage Optimization

存储优化

Key Actions:
  • Convert gp2 → gp3 (20% savings)
  • Implement S3 lifecycle policies (50-95% savings)
  • Delete old snapshots
  • Use S3 Intelligent-Tiering
Reference:
references/best_practices.md
→ Storage Optimization
关键操作:
  • 将gp2转换为gp3(节省20%)
  • 实施S3生命周期策略(节省50-95%)
  • 删除旧快照
  • 使用S3 Intelligent-Tiering
参考
references/best_practices.md
→ 存储优化

Network Optimization

网络优化

Key Actions:
  • Replace NAT Gateways with VPC Endpoints (save $25-30/month each)
  • Use CloudFront to reduce data transfer costs
  • Colocate resources in same AZ when possible
Reference:
references/best_practices.md
→ Network Optimization
关键操作:
  • 用VPC Endpoints替换NAT Gateways(每个每月节省25-30美元)
  • 使用CloudFront降低数据传输成本
  • 尽可能将资源部署在同一可用区
参考
references/best_practices.md
→ 网络优化

Database Optimization

数据库优化

Key Actions:
  • Right-size RDS instances
  • Use gp3 storage (20% cheaper than gp2)
  • Evaluate Aurora Serverless for variable workloads
  • Purchase RDS Reserved Instances
Reference:
references/best_practices.md
→ Database Optimization

关键操作:
  • 调整RDS实例规格
  • 使用gp3存储(比gp2便宜20%)
  • 为可变工作负载评估Aurora Serverless
  • 购买RDS Reserved Instance
参考
references/best_practices.md
→ 数据库优化

Service Alternatives Decision Guide

服务替代方案决策指南

Need help choosing between services?
Question: "Should I use EC2, Lambda, or Fargate?" Answer: See
references/service_alternatives.md
→ Compute Alternatives
Question: "Which S3 storage class should I use?" Answer: See
references/service_alternatives.md
→ Storage Alternatives
Question: "Should I use RDS or Aurora?" Answer: See
references/service_alternatives.md
→ Database Alternatives
Question: "NAT Gateway vs VPC Endpoint vs NAT Instance?" Answer: See
references/service_alternatives.md
→ Networking Alternatives

需要帮助选择服务?
问题:"我应该使用EC2、Lambda还是Fargate?" 答案:详见
references/service_alternatives.md
→ 计算服务替代方案
问题:"我应该使用哪种S3存储类别?" 答案:详见
references/service_alternatives.md
→ 存储服务替代方案
问题:"我应该使用RDS还是Aurora?" 答案:详见
references/service_alternatives.md
→ 数据库服务替代方案
问题:"NAT Gateway vs VPC Endpoint vs NAT Instance?" 答案:详见
references/service_alternatives.md
→ 网络服务替代方案

FinOps Governance & Process

FinOps治理与流程

Setting Up FinOps

搭建FinOps体系

Phase 1: Foundation (Month 1)
  • Enable Cost Explorer
  • Set up AWS Budgets
  • Define tagging strategy
  • Activate cost allocation tags
Phase 2: Visibility (Months 2-3)
  • Implement tagging enforcement
  • Run optimization scripts
  • Set up monthly reviews
  • Create team cost reports
Phase 3: Culture (Ongoing)
  • Cost metrics in engineering KPIs
  • Cost review in architecture decisions
  • Regular optimization sprints
  • FinOps champions in each team
Full Guide: See
references/finops_governance.md
阶段1:基础搭建(第1个月)
  • 启用Cost Explorer
  • 设置AWS Budgets
  • 定义标签策略
  • 激活成本分配标签
阶段2:可视化(第2-3个月)
  • 实施标签强制策略
  • 运行优化脚本
  • 建立月度评审流程
  • 创建团队成本报告
阶段3:文化建设(持续进行)
  • 将成本指标纳入工程KPI
  • 在架构决策中加入成本评审
  • 定期开展优化冲刺
  • 在每个团队设立FinOps负责人
完整指南:详见
references/finops_governance.md

Monthly Review Process

月度评审流程

Week 1: Data Collection
  • Run all optimization scripts
  • Export Cost & Usage Reports
  • Compile findings
Week 2: Analysis
  • Identify trends
  • Find opportunities
  • Prioritize actions
Week 3: Team Reviews
  • Present to engineering teams
  • Discuss optimizations
  • Assign action items
Week 4: Executive Reporting
  • Create executive summary
  • Forecast next quarter
  • Report optimization wins
Template: See
assets/templates/monthly_cost_report.md
Detailed Process: See
references/finops_governance.md
→ Monthly Review Process

第1周:数据收集
  • 运行所有优化脚本
  • 导出成本与使用报告
  • 整理发现结果
第2周:分析
  • 识别趋势
  • 发现优化机会
  • 优先排序行动项
第3周:团队评审
  • 向工程团队展示
  • 讨论优化方案
  • 分配行动项
第4周:高管汇报
  • 创建高管摘要
  • 预测下一季度成本
  • 汇报优化成果
模板:详见
assets/templates/monthly_cost_report.md
详细流程:详见
references/finops_governance.md
→ 月度评审流程

Cost Optimization Checklist

成本优化检查清单

Quick Wins (Do First)

快速见效项(优先完成)

  • Delete unattached EBS volumes
  • Delete old EBS snapshots (>90 days)
  • Release unused Elastic IPs
  • Convert gp2 → gp3 volumes
  • Stop/terminate idle EC2 instances
  • Enable S3 Intelligent-Tiering
  • Set up AWS Budgets and alerts
  • 删除未挂载的EBS卷
  • 删除旧EBS快照(超过90天)
  • 释放未使用的Elastic IP
  • 将gp2转换为gp3卷
  • 停止/终止闲置EC2实例
  • 启用S3 Intelligent-Tiering
  • 设置AWS Budgets和告警

Medium Effort (This Quarter)

中等工作量项(本季度完成)

  • Right-size oversized instances
  • Migrate to newer instance generations
  • Purchase Reserved Instances for stable workloads
  • Implement S3 lifecycle policies
  • Replace NAT Gateways with VPC Endpoints (where applicable)
  • Enable automated resource scheduling (dev/test)
  • Implement tagging strategy and enforcement
  • 调整超大实例的规格
  • 迁移至新一代实例
  • 为稳定工作负载购买Reserved Instance
  • 实施S3生命周期策略
  • 用VPC Endpoints替换NAT Gateways(适用场景)
  • 启用自动化资源调度(开发/测试环境)
  • 实施标签策略与强制机制

Strategic Initiatives (Ongoing)

战略举措(持续进行)

  • Migrate to Graviton instances
  • Implement Spot for fault-tolerant workloads
  • Establish monthly cost review process
  • Set up cost allocation by team
  • Implement chargeback/showback model
  • Create FinOps culture and practices

  • 迁移至Graviton实例
  • 为容错工作负载实施Spot
  • 建立月度成本评审流程
  • 按团队分配成本
  • 实施成本分摊/展示模型
  • 打造FinOps文化与实践

Troubleshooting Cost Issues

成本问题排查

"My bill suddenly increased"

"我的账单突然增加"

  1. Run cost anomaly detection:
    bash
    python3 scripts/cost_anomaly_detector.py --days 30
  2. Check Cost Explorer for service breakdown
  3. Review CloudTrail for resource creation events
  4. Check for AutoScaling events
  5. Verify no Reserved Instances expired
  1. 运行成本异常检测:
    bash
    python3 scripts/cost_anomaly_detector.py --days 30
  2. 在Cost Explorer中查看服务明细
  3. 查看CloudTrail中的资源创建事件
  4. 检查AutoScaling事件
  5. 确认没有Reserved Instance过期

"I need to reduce costs by X%"

"我需要将成本降低X%"

Follow the optimization workflow:
  1. Run all discovery scripts
  2. Calculate total potential savings
  3. Prioritize by: Savings Amount × (1 / Effort)
  4. Focus on quick wins first
  5. Implement strategic changes for long-term
遵循优化工作流:
  1. 运行所有发现脚本
  2. 计算总潜在节省金额
  3. 按以下优先级排序:节省金额 × (1 / 工作量)
  4. 优先处理快速见效项
  5. 实施战略变更以实现长期节省

"How do I know if Reserved Instances make sense?"

"我如何判断Reserved Instance是否值得购买"

Run RI analysis:
bash
python3 scripts/analyze_ri_recommendations.py --days 60
Look for:
  • Instances running 60+ days consistently
  • Workloads that won't change
  • Savings > 30%
运行RI分析:
bash
python3 scripts/analyze_ri_recommendations.py --days 60
关注以下要点:
  • 持续运行60天以上的实例
  • 工作负载不会发生变化
  • 节省比例超过30%

"Which resources can I safely delete?"

"哪些资源可以安全删除"

Run unused resource finder:
bash
python3 scripts/find_unused_resources.py
Safe to delete (usually):
  • Unattached EBS volumes (after verifying)
  • Snapshots > 90 days (if backups exist elsewhere)
  • Unused Elastic IPs (after verifying not in DNS)
  • Stopped EC2 instances > 30 days (after confirming abandoned)
Always verify with resource owner before deletion!

运行未使用资源查找脚本:
bash
python3 scripts/find_unused_resources.py
通常可安全删除的资源:
  • 未挂载的EBS卷(需先验证)
  • 超过90天的快照(如果其他地方有备份)
  • 未使用的Elastic IP(需先验证未在DNS中使用)
  • 停止超过30天的EC2实例(需确认已废弃)
删除前务必与资源所有者确认!

Best Practices Summary

最佳实践总结

  1. Tag Everything: Consistent tagging enables cost allocation and accountability
  2. Monitor Continuously: Weekly script runs catch waste early
  3. Review Monthly: Regular reviews prevent cost drift
  4. Right-size Proactively: Don't wait for cost issues to optimize
  5. Use Commitments Wisely: RIs/SPs for stable workloads only
  6. Test Before Migrating: Especially for Graviton or Spot
  7. Automate Cleanup: Scheduled shutdown of dev/test resources
  8. Share Wins: Celebrate cost savings to build FinOps culture

  1. 全面打标签:一致的标签可实现成本分配与问责
  2. 持续监控:每周运行脚本可尽早发现浪费
  3. 月度评审:定期评审防止成本失控
  4. 主动调整规格:不要等到出现成本问题才优化
  5. 明智使用承诺型采购:仅为稳定工作负载购买RI/SP
  6. 迁移前测试:尤其是Graviton或Spot实例
  7. 自动化清理:定时关闭开发/测试环境资源
  8. 分享成果:庆祝成本节省以打造FinOps文化

Additional Resources

额外资源

Detailed References:
  • references/best_practices.md
    : Comprehensive optimization strategies
  • references/service_alternatives.md
    : Cost-effective service selection
  • references/finops_governance.md
    : Organizational FinOps practices
Templates:
  • assets/templates/monthly_cost_report.md
    : Monthly reporting template
Scripts:
  • All scripts in
    scripts/
    directory with
    --help
    for usage
AWS Documentation:
详细参考文档:
  • references/best_practices.md
    :全面的优化策略
  • references/service_alternatives.md
    :高性价比服务选择指南
  • references/finops_governance.md
    :企业级FinOps实践
模板:
  • assets/templates/monthly_cost_report.md
    :月度报告模板
脚本:
  • 所有脚本位于
    scripts/
    目录,使用
    --help
    查看用法
AWS官方文档: