monitor

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Monitor

监控

Monitoring Checklist

监控检查清单

Basic Monitoring:
- [ ] Uptime monitoring (is site up?)
- [ ] Error tracking (are errors happening?)
- [ ] Performance monitoring (is it slow?)
- [ ] User activity (are people using it?)
- [ ] Critical alerts configured
- [ ] Check dashboard daily
See MONITORING-SETUP.md for implementation.

Basic Monitoring:
- [ ] Uptime monitoring (is site up?)
- [ ] Error tracking (are errors happening?)
- [ ] Performance monitoring (is it slow?)
- [ ] User activity (are people using it?)
- [ ] Critical alerts configured
- [ ] Check dashboard daily
请查看MONITORING-SETUP.md了解实现方法。

Why Monitor?

为什么要监控?

Without monitoring:
  • Users hit errors, you don't know
  • Site goes down, you find out from Twitter
  • Slow performance, users leave silently
  • Security issues, no alert
With monitoring:
  • Errors show in dashboard immediately
  • Get text when site goes down
  • See performance degradation
  • Catch issues before users complain
Goal: Know about problems before users tell you.

没有监控的情况:
  • 用户遇到错误,你却毫不知情
  • 网站宕机,你从Twitter才得知
  • 性能缓慢,用户默默离开
  • 出现安全问题,没有任何告警
有监控的情况:
  • 错误会立即显示在仪表板中
  • 网站宕机时你会收到短信通知
  • 可以看到性能下降的情况
  • 在用户投诉前发现问题
目标:在用户告知你之前就发现问题。

Three Essential Monitors

三个核心监控项

1. Is It Up?

1. 网站是否在线?

Uptime monitoring - Pings your app every minute
Free tools:
  • UptimeRobot (free, 50 monitors)
  • Pingdom (limited free tier)
  • Vercel/Netlify (built-in for deployed apps)
Setup:
1. Sign up for UptimeRobot
2. Add monitor for https://yourapp.com
3. Add your email for alerts
4. Get texted if site is down
正常运行时间监控 - 每分钟ping一次你的应用
免费工具:
  • UptimeRobot(免费版支持50个监控项)
  • Pingdom(有限免费版)
  • Vercel/Netlify(针对已部署应用的内置功能)
设置步骤:
1. Sign up for UptimeRobot
2. Add monitor for https://yourapp.com
3. Add your email for alerts
4. Get texted if site is down

2. Are There Errors?

2. 是否存在错误?

Error tracking - Captures JavaScript errors and API failures
Free tools:
  • Sentry (free tier: 5k errors/month)
  • LogRocket (limited free)
  • Vercel/Netlify logs (for deployed apps)
Setup:
Tell AI:
"Add Sentry error tracking:
- Capture all frontend errors
- Capture all API errors
- Include user context
- Send to Sentry dashboard"
错误追踪 - 捕获JavaScript错误和API调用失败
免费工具:
  • Sentry(免费版:每月5000条错误记录)
  • LogRocket(有限免费版)
  • Vercel/Netlify日志(针对已部署应用)
设置步骤:
Tell AI:
"Add Sentry error tracking:
- Capture all frontend errors
- Capture all API errors
- Include user context
- Send to Sentry dashboard"

3. Is It Slow?

3. 性能是否缓慢?

Performance monitoring - Tracks page load times
Free tools:
  • Vercel Analytics (built-in)
  • Google PageSpeed Insights (free)
  • Cloudflare Analytics (free tier)
Setup:
  • Usually automatic with hosting platform
  • Check dashboard weekly

性能监控 - 跟踪页面加载时间
免费工具:
  • Vercel Analytics(内置功能)
  • Google PageSpeed Insights(免费)
  • Cloudflare Analytics(免费版)
设置步骤:
  • 通常由托管平台自动开启
  • 每周检查一次仪表板

What to Monitor

监控内容

Critical Metrics

核心指标

Must monitor:
  • Site uptime (99%+)
  • Error rate (< 1% of requests)
  • API response time (< 500ms)
  • Page load time (< 3s)
Nice to have:
  • Active users
  • Feature usage
  • Conversion rates
  • User paths
For MVP: Focus on the "must monitor" only.

必须监控:
  • 网站正常运行时间(99%以上)
  • 错误率(请求占比<1%)
  • API响应时间(<500毫秒)
  • 页面加载时间(<3秒)
可选监控:
  • 活跃用户数
  • 功能使用情况
  • 转化率
  • 用户路径
针对MVP:只关注“必须监控”的指标即可。

Setting Up Alerts

配置告警

Configure alerts for:
Critical (text me immediately):
  • Site is down
  • Error rate spike (10x normal)
  • Database connection lost
  • Payment processing failing
Important (email within hour):
  • API slow (>2 seconds)
  • Error rate elevated (2x normal)
  • Disk space low (>80%)
Informational (daily digest):
  • New errors discovered
  • Performance trending down
  • Traffic patterns
Tell AI:
Configure monitoring alerts:
- Critical: Text to [phone]
- Important: Email to [email]
- Send summary: Daily at 9am

为以下情况配置告警:
紧急情况(立即发送短信):
  • 网站宕机
  • 错误率激增(达到正常水平的10倍)
  • 数据库连接丢失
  • 支付处理失败
重要情况(1小时内发送邮件):
  • API响应缓慢(>2秒)
  • 错误率升高(达到正常水平的2倍)
  • 磁盘空间不足(已用>80%)
信息类(每日摘要):
  • 发现新错误
  • 性能呈下降趋势
  • 流量模式变化
告诉AI:
Configure monitoring alerts:
- Critical: Text to [phone]
- Important: Email to [email]
- Send summary: Daily at 9am

Daily Monitoring Routine

日常监控流程

5-minute morning check:
Daily Check:
1. Open monitoring dashboard
2. Check uptime (should be 100% yesterday)
3. Check error count (any spikes?)
4. Check performance (slower than usual?)
5. Review any alerts from overnight
If all green: You're done, 5 minutes.
If red: Investigate using debug skill.

5分钟晨间检查:
Daily Check:
1. Open monitoring dashboard
2. Check uptime (should be 100% yesterday)
3. Check error count (any spikes?)
4. Check performance (slower than usual?)
5. Review any alerts from overnight
如果一切正常:5分钟即可完成检查。
如果出现异常:使用调试工具进行排查。

Reading Monitoring Dashboards

如何查看监控仪表板

Uptime Dashboard

正常运行时间仪表板

Green: Site responding
Red: Site down or slow to respond
What to check:
  • Uptime percentage (target: 99%+)
  • Response time (target: <500ms)
  • Recent downtime incidents
绿色: 网站正常响应
红色: 网站宕机或响应缓慢
需要检查的内容:
  • 正常运行时间百分比(目标:99%以上)
  • 响应时间(目标:<500毫秒)
  • 近期宕机事件

Error Dashboard

错误仪表板

Look for:
  • Error count spikes (sudden jump)
  • New error types (didn't see before)
  • Affected users (how many hit this?)
  • Error frequency (happening a lot?)
Priority:
  • Affecting many users → High priority
  • Blocking key features → High priority
  • Edge case error → Lower priority
需要关注:
  • 错误数量激增(突然上升)
  • 新的错误类型(之前未出现过)
  • 受影响用户数(有多少用户遇到该错误?)
  • 错误频率(是否频繁发生?)
优先级:
  • 影响大量用户 → 高优先级
  • 阻塞核心功能 → 高优先级
  • 边缘场景错误 → 低优先级

Performance Dashboard

性能仪表板

Look for:
  • Load time trending up (getting slower)
  • Slow endpoints (which API calls)
  • Slow pages (which routes)
  • Geographic differences (slow in specific regions)

需要关注:
  • 加载时间呈上升趋势(越来越慢)
  • 响应缓慢的接口(哪些API调用)
  • 加载缓慢的页面(哪些路由)
  • 地域差异(特定地区响应缓慢)

Error Investigation

错误排查

When errors spike:
1. Open error tracking dashboard (Sentry)
2. Find the most frequent error
3. Read error message and stack trace
4. Note: How many users affected?
5. Note: Started when?
6. Check: Did we deploy recently?
Give to AI:
Error in production:
[Paste error message and stack trace]

Affected: [X] users in last [Y] hours
Started: [timestamp]
Recent deploys: [any?]

Please:
1. Explain what's wrong
2. Propose hotfix
3. How to test before deploying

当错误数量激增时:
1. Open error tracking dashboard (Sentry)
2. Find the most frequent error
3. Read error message and stack trace
4. Note: How many users affected?
5. Note: Started when?
6. Check: Did we deploy recently?
告知AI:
Error in production:
[Paste error message and stack trace]

Affected: [X] users in last [Y] hours
Started: [timestamp]
Recent deploys: [any?]

Please:
1. Explain what's wrong
2. Propose hotfix
3. How to test before deploying

User-Reported Issues

用户反馈问题的处理

When user reports problem:
User Report Investigation:
1. Can you reproduce it?
2. Check monitoring for errors at that time
3. Check logs for that user
4. Check if others affected
5. Determine severity

Then use debug skill to fix.
Tell AI:
User reported: [issue description]
User: [email or ID]
Timestamp: [when it happened]

Check monitoring and logs for this user at this time.
What errors or issues do you see?

当用户反馈问题时:
User Report Investigation:
1. Can you reproduce it?
2. Check monitoring for errors at that time
3. Check logs for that user
4. Check if others affected
5. Determine severity

Then use debug skill to fix.
告诉AI:
User reported: [issue description]
User: [email or ID]
Timestamp: [when it happened]

Check monitoring and logs for this user at this time.
What errors or issues do you see?

Proactive Monitoring

主动监控

Catch issues before users:
Weekly checks:
Weekly Review:
- [ ] Error trends (going up or down?)
- [ ] Performance trends (slower?)
- [ ] New error types introduced
- [ ] Uptime issues resolved
- [ ] Alert noise (too many false alerts?)
Monthly checks:
Monthly Health:
- [ ] Compare to last month
- [ ] Any degradation?
- [ ] Any improvements?
- [ ] Monitoring gaps (what's not tracked?)

在用户发现前捕获问题:
每周检查:
Weekly Review:
- [ ] Error trends (going up or down?)
- [ ] Performance trends (slower?)
- [ ] New error types introduced
- [ ] Uptime issues resolved
- [ ] Alert noise (too many false alerts?)
每月检查:
Monthly Health:
- [ ] Compare to last month
- [ ] Any degradation?
- [ ] Any improvements?
- [ ] Monitoring gaps (what's not tracked?)

Free Monitoring Stack

免费监控工具栈

Recommended for MVP:
Uptime:
  • UptimeRobot (free) - 50 monitors
Errors:
  • Sentry (free) - 5k errors/month
Performance:
  • Vercel Analytics (free on Vercel)
  • Cloudflare Analytics (free)
Logs:
  • Platform logs (Vercel, Netlify, Railway)
Cost: $0/month until you need more.

针对MVP的推荐方案:
正常运行时间监控:
  • UptimeRobot(免费版)- 支持50个监控项
错误追踪:
  • Sentry(免费版)- 每月5000条错误记录
性能监控:
  • Vercel Analytics(在Vercel平台免费使用)
  • Cloudflare Analytics(免费版)
日志:
  • 平台日志(Vercel、Netlify、Railway)
成本:在你需要更多功能前,每月0美元。

When to Upgrade Monitoring

何时升级监控方案

Upgrade when:
  • Hitting free tier limits
  • Need more detailed analytics
  • Need faster alert response
  • Need advanced features (session replay, etc.)
Paid tiers (typically $20-50/mo):
  • Sentry Pro ($26/mo)
  • LogRocket ($99/mo - session replay)
  • DataDog ($15/host/mo)
For < 1000 users: Free tiers sufficient.

当出现以下情况时升级:
  • 达到免费版上限
  • 需要更详细的分析数据
  • 需要更快的告警响应
  • 需要高级功能(如会话重放等)
付费版本(通常每月20-50美元):
  • Sentry Pro(每月26美元)
  • LogRocket(每月99美元 - 支持会话重放)
  • DataDog(每台主机每月15美元)
针对用户数<1000的情况:免费版足够使用。

Common Monitoring Mistakes

常见监控误区

MistakeFix
No monitoring set upSet up before launch
Alert fatigue (too many alerts)Only alert on critical issues
Checking once a monthCheck daily (5 minutes)
Ignoring trendsWatch for degradation over time
No alerts configuredSet up text alerts for downtime
Monitoring but not actingUse monitoring to find and fix issues

误区解决方法
未设置任何监控上线前完成监控设置
告警疲劳(过多告警)仅针对紧急情况发送告警
每月仅检查一次每日检查(5分钟即可)
忽略趋势变化关注长期的性能或错误率下降趋势
未配置告警为宕机情况设置短信告警
仅监控但不处理利用监控发现并修复问题

Interpreting Trends

趋势解读

Good trends:
  • Errors decreasing
  • Performance improving
  • Uptime stable at 99.9%+
Warning trends:
  • Errors slowly increasing
  • Performance slowly degrading
  • Uptime dipping below 99%
Critical trends:
  • Sudden error spike
  • Sudden performance drop
  • Multiple downtime incidents
Action: Address warning trends before they become critical.

良好趋势:
  • 错误数量减少
  • 性能提升
  • 正常运行时间稳定在99.9%以上
警告趋势:
  • 错误数量缓慢上升
  • 性能缓慢下降
  • 正常运行时间低于99%
紧急趋势:
  • 错误数量突然激增
  • 性能突然下降
  • 多次出现宕机事件
行动建议:在警告趋势演变为紧急情况前进行处理。

Logging vs Monitoring

日志 vs 监控

Logging:
  • Records what happened
  • For debugging specific issues
  • Detailed, verbose
  • Review when investigating
Monitoring:
  • Tracks overall health
  • For catching issues early
  • High-level metrics
  • Review daily
Both needed: Monitoring alerts you, logs help debug.

日志:
  • 记录已发生的事件
  • 用于调试特定问题
  • 详细且冗长
  • 仅在排查问题时查看
监控:
  • 跟踪整体健康状况
  • 用于提前发现问题
  • 高层面指标
  • 每日查看
两者缺一不可:监控负责告警,日志负责协助调试。

Setting Up Logging

设置日志

Tell AI:
Add application logging:
- Log all errors with context
- Log API requests/responses
- Log slow operations (>1s)
- Log authentication events
- Don't log sensitive data

Format: JSON with timestamp, level, message, context
Send to: [Platform logs or external service]
Log levels:
  • ERROR: Something broke
  • WARN: Something concerning
  • INFO: Normal operations
  • DEBUG: Detailed debugging info
Production: Log ERROR and WARN only.

告诉AI:
Add application logging:
- Log all errors with context
- Log API requests/responses
- Log slow operations (>1s)
- Log authentication events
- Don't log sensitive data

Format: JSON with timestamp, level, message, context
Send to: [Platform logs or external service]
日志级别:
  • ERROR:发生故障
  • WARN:存在潜在问题
  • INFO:正常操作
  • DEBUG:详细调试信息
生产环境:仅记录ERROR和WARN级别。

Monitoring Integrations

监控集成

Third-party services:
Payments (Stripe):
  • Failed payments alert
  • Refund requests alert
  • Subscription cancellations (daily digest)
Email (SendGrid):
  • Delivery failures alert
  • Bounce rate elevated alert
  • Spam complaints alert
Database:
  • Connection pool exhausted
  • Slow queries (>1s)
  • Disk space low
Tell AI:
Add monitoring for [service]:
- Alert on failures
- Track success rate
- Log errors with context

第三方服务集成:
支付服务(Stripe):
  • 支付失败告警
  • 退款请求告警
  • 订阅取消(每日摘要)
邮件服务(SendGrid):
  • 投递失败告警
  • 退信率升高告警
  • 垃圾邮件投诉告警
数据库:
  • 连接池耗尽
  • 慢查询(>1秒)
  • 磁盘空间不足
告诉AI:
Add monitoring for [service]:
- Alert on failures
- Track success rate
- Log errors with context

Incident Response

事件响应流程

When alerts fire:
Incident Response:
1. Acknowledge alert (mark as seen)
2. Assess severity:
   - Critical: Site down, payments failing
   - High: Errors affecting many users
   - Medium: Isolated issues
3. Immediate action:
   - Critical: Hotfix or rollback
   - High: Fix within hours
   - Medium: Fix in next deploy
4. Update users if needed
5. Post-mortem after resolved
Critical incidents:
1. Assess impact (how many affected?)
2. Quick fix or rollback
3. Deploy hotfix
4. Verify fixed
5. Monitor closely for hour
6. Update status page if you have one

当告警触发时:
Incident Response:
1. Acknowledge alert (mark as seen)
2. Assess severity:
   - Critical: Site down, payments failing
   - High: Errors affecting many users
   - Medium: Isolated issues
3. Immediate action:
   - Critical: Hotfix or rollback
   - High: Fix within hours
   - Medium: Fix in next deploy
4. Update users if needed
5. Post-mortem after resolved
紧急事件处理:
1. Assess impact (how many affected?)
2. Quick fix or rollback
3. Deploy hotfix
4. Verify fixed
5. Monitor closely for hour
6. Update status page if you have one

Success Looks Like

成功的标准

✅ Know about issues before users report them
✅ Uptime >99.9%
✅ Errors caught and fixed quickly
✅ Performance trends stable or improving
✅ Daily monitoring routine (5 minutes)
✅ Alerts configured and actionable
✅ Issues resolved proactively
✅ 在用户反馈前发现问题
✅ 正常运行时间>99.9%
✅ 错误被快速发现并修复
✅ 性能趋势稳定或提升
✅ 每日5分钟的监控流程
✅ 告警配置合理且可执行
✅ 问题被主动解决