capacity-planner

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

capacity-planner

capacity-planner

Sizing tool for ops teams that handle queued work — Support, CX, Customer Success, BizOps, IT ops, Finance ops. Built on Erlang-C queueing theory, Little's Law, and the operational-leadership canon (Fournier, Larson, Cleveland, Reinertsen). Deterministic, stdlib-only, no LLM calls.
专为处理排队类工作的运营团队打造的规模测算工具——适用于支持团队、客户体验(CX)团队、客户成功团队、业务运营团队、IT运营团队、财务运营团队。基于Erlang-C排队理论、Little定理和运营领导力权威理论(Fournier、Larson、Cleveland、Reinertsen)构建。采用确定性算法,仅依赖标准库,无需调用大语言模型(LLM)。

Purpose

用途

You are an ops leader sized 15 → 35 with no idea how the 35-person org will actually behave at peak load. Or you are at 88% utilization and SLA is starting to slip. Or you have a hiring budget approved and need to sequence it across four quarters without burning out the existing team. This skill answers those questions with arithmetic, not vibes.
It produces three artifacts:
  1. Capacity sizing at 70/80/90% utilization against P50/P90/P99 demand, with P(SLA breach) at each point and a SAFE/WATCH/AT_RISK/CRITICAL risk band.
  2. Utilization health at the per-member traffic-light level plus a team verdict (HEALTHY/SQUEEZED/OVERLOADED/UNBALANCED).
  3. 12-month quarterly hiring plan accounting for ramp curves, attrition, QoQ demand growth, and span-of-control manager triggers.
如果你是管理15至35人规模的运营负责人,不确定35人团队在峰值负载下的实际表现;或者你的团队利用率已达88%,服务水平协议(SLA)开始出现偏差;或者你已获得招聘预算,需要在四个季度内合理安排招聘节奏,避免现有团队过度劳累——本工具将通过精准计算为你解答这些问题,而非仅凭主观判断。
本工具可生成三类成果:
  1. 容量测算:基于70%/80%/90%的利用率,结合P50/P90/P99需求规模,计算各节点的SLA违约概率,并划分SAFE/WATCH/AT_RISK/CRITICAL风险等级。
  2. 利用率健康度评估:为团队成员提供红黄绿三色利用率健康标识,同时给出团队整体健康结论(HEALTHY/SQUEEZED/OVERLOADED/UNBALANCED)。
  3. 12个月季度招聘计划:考虑人员成长周期、流失率、季度需求增长以及管理幅度触发条件。

When to use

使用场景

  • Annual ops capacity planning (October-November for the following fiscal year).
  • Quarterly re-sizing if demand changed >15% or attrition spiked.
  • Pre-budget defense — the math that justifies the headcount ask to your CFO.
  • Diagnostic when an ops team is missing SLA and you need to know whether it's a sizing problem, a process problem, or a bottleneck problem.
  • M&A / new-segment launch modeling — sizing a new team or combined org.
  • 年度运营容量规划(财年次年的10-11月)。
  • 季度重新测算:当需求变化超过15%或人员流失率激增时。
  • 预算审批前的论证:向首席财务官(CFO)证明人员编制需求的数学依据。
  • 问题诊断:当运营团队无法达成SLA时,判断问题出在规模配置、流程还是瓶颈环节。
  • 并购/新业务板块启动建模:为新团队或合并后的组织测算规模。

Workflow

工作流程

  1. Intake demand. Pull P50/P90/P99 daily ticket/case volume from your work system (Zendesk, Intercom, JSM, ServiceNow, Salesforce). If you only have averages, stop and pull the distribution. Single- point demand estimates are the most expensive anti-pattern in ops.
  2. Model throughput. Run
    capacity_modeler.py
    with your demand, AHT, SLA target, current FTE, and shrinkage. Use
    --profile
    for your function (support / cx / bizops / finance-ops / it-ops). Read the 80%-utilization row — that's your sizing point.
  3. Flag utilization risk. Run
    utilization_analyzer.py
    against your current team's actual utilization data. Anyone >85% sustained is a throughput-collapse risk per Reinertsen. Spread >30 percentage points across team means UNBALANCED — fix that before hiring.
  4. Sequence hiring. Run
    hiring_sequencer.py
    with current FTE, target EOY, ramp time, attrition, and growth. It will front-load hires (Q1 35%, Q4 15%), apply ramp curves, and trigger a manager hire when span of control crosses 7 ICs/manager.
  5. Walk the Forcing-question library (see below). One question at a time. Do not skip ahead. Answers must be written down before you commit the plan.
  1. 收集需求数据:从你的工作系统(Zendesk、Intercom、JSM、ServiceNow、Salesforce)提取P50/P90/P99的每日工单/案例量。如果只有平均值,请停止操作并提取分布数据——单点需求估算是运营领域代价最高的错误做法。
  2. 建模吞吐量:运行
    capacity_modeler.py
    脚本,输入你的需求数据、平均处理时长(AHT)、SLA目标、当前全职等效人员(FTE)和人员缩减率。使用
    --profile
    参数指定团队职能(support / cx / bizops / finance-ops / it-ops)。查看80%利用率对应的行——这就是你的规模配置参考点。
  3. 标记利用率风险:运行
    utilization_analyzer.py
    脚本,分析当前团队的实际利用率数据。根据Reinertsen的理论,持续利用率超过85%的成员存在吞吐量崩溃风险。团队内利用率差异超过30个百分点意味着团队配置失衡——请在招聘前解决此问题。
  4. 规划招聘节奏:运行
    hiring_sequencer.py
    脚本,输入当前FTE、年末目标FTE、人员成长周期、流失率和需求增长率。脚本会前置招聘安排(第一季度35%,第四季度15%),应用成长周期曲线,并在管理幅度超过7名一线员工/经理时触发招聘经理的提示。
  5. 逐一回答强制问题库中的问题(见下文)。请逐个回答,不要跳过。在确定计划前,必须写下所有答案。

Scripts

脚本说明

  • scripts/capacity_modeler.py
    — Erlang-C sizing with shrinkage adjustment and P50/P90/P99 breach probabilities.
    --profile
    for industry defaults.
  • scripts/utilization_analyzer.py
    — per-member traffic-light + team-level health verdict with variance detection.
  • scripts/hiring_sequencer.py
    — 12-month quarterly plan with ramp, attrition, growth, max-hires-per-quarter constraint, and manager-trigger logic.
All three accept
--input <path>
(JSON),
--output {markdown,json}
,
--sample
(built-in example), and
--help
. Stdlib only.
  • scripts/capacity_modeler.py
    —— 基于Erlang-C算法的规模测算工具,支持人员缩减率调整,可计算P50/P90/P99需求下的违约概率。使用
    --profile
    参数获取行业默认值。
  • scripts/utilization_analyzer.py
    —— 为团队成员提供红黄绿三色健康标识,并结合方差检测给出团队整体健康结论。
  • scripts/hiring_sequencer.py
    —— 生成12个月季度招聘计划,涵盖人员成长周期、流失率、需求增长、单季度最大招聘人数限制以及经理触发逻辑。
三个脚本均支持
--input <path>
(JSON格式输入)、
--output {markdown,json}
(输出格式)、
--sample
(内置示例)和
--help
(帮助信息)参数。仅依赖Python标准库。

References

参考资料

  • references/queueing_theory_canon.md
    — Erlang, Little, Hopp & Spearman, Reinertsen, Kingman, Cleveland, ITIL, Armony et al. (8 sources). The math.
  • references/ops_workforce_planning_canon.md
    — Fournier, Larson, Google SRE Workbook, Frei, Lawler, Bersin, Gartner, Grove (8 sources). The people factors.
  • references/capacity_anti_patterns.md
    — 11 named anti-patterns with cited sources, tool guards, and the meta-discipline that Lencioni + Goldratt + Christensen impose. (8+ named sources.)
  • references/queueing_theory_canon.md
    —— 包含Erlang、Little、Hopp & Spearman、Reinertsen、Kingman、Cleveland、ITIL、Armony等人的8份资料,聚焦算法原理。
  • references/ops_workforce_planning_canon.md
    —— 包含Fournier、Larson、Google SRE工作手册、Frei、Lawler、Bersin、Gartner、Grove的8份资料,聚焦人力因素。
  • references/capacity_anti_patterns.md
    —— 列出11种已命名的错误做法,附带资料引用、工具防护措施,以及Lencioni、Goldratt、Christensen提出的元学科理论(8+份命名资料)。

Assets

配套资源

  • assets/capacity_brief_template.md
    — 20-minute fill-out template with JSON skeletons for all three tools and an output checklist.
  • assets/capacity_brief_template.md
    —— 20分钟填写模板,包含三个工具的JSON骨架和输出检查清单。

Assumptions

假设条件

This skill assumes:
  • Work is queued (tickets, cases, work items) — not project-style. If your team's work isn't queued, this is the wrong skill.
  • Demand has a stationary-enough distribution within a quarter. Step-changes (new product launch, M&A, regulatory shift) require re-running mid-quarter.
  • You have at least 90 days of historical demand data to compute P50/P90/P99. If not, generate the distribution from your sales / user-base forecast first.
  • Service is single-class within a queue. If you have hard priority tiers (P1/P2/P3 with class-specific SLAs), model each as a separate queue and sum.
  • Channels are modeled coherently. Multi-channel teams use the appropriate
    --profile
    with built-in shrinkage premium.
本工具基于以下假设:
  • 工作为排队类(工单、案例、工作项)——而非项目式工作。如果你的团队工作不属于排队类,本工具不适用。
  • 需求在一个季度内具有足够稳定的分布。若出现阶跃变化(如新产品发布、并购、监管政策变动),需在季度中期重新运行工具。
  • 你拥有至少90天的历史需求数据用于计算P50/P90/P99。若没有,请先根据销售/用户基数预测生成需求分布。
  • 同一队列内的服务为单一类别。如果存在严格的优先级层级(P1/P2/P3,且各层级有特定SLA),请将每个层级作为单独队列建模,再汇总结果。
  • 多渠道工作需统一建模。多渠道团队需使用对应的
    --profile
    参数,该参数内置了人员缩减率溢价。

Anti-patterns

错误做法

See
references/capacity_anti_patterns.md
for the full taxonomy with sources. Top eight:
  1. Plan-to-100%-utilization (Reinertsen Principle 12)
  2. Treat-ramp-as-instant (Larson)
  3. Ignore-attrition-in-12-month-plan (Bersin)
  4. Hire-ICs-forever-with-no-manager-trigger (Fournier)
  5. Size-to-P50-demand-only (Cleveland)
  6. No-shrinkage-adjustment (Cleveland, SRE Workbook)
  7. Single-channel-model-for-multi-channel-work (Gartner, Kingman)
  8. No-surge-plan-for-P99-events (Hopp & Spearman, Reinertsen)
完整的错误做法分类及资料引用请查看
references/capacity_anti_patterns.md
。排名前8的错误做法:
  1. 按100%利用率规划(Reinertsen第12原则)
  2. 假设人员成长周期为零(Larson)
  3. 12个月计划中忽略人员流失率(Bersin)
  4. 持续招聘一线员工,未设置经理招聘触发条件(Fournier)
  5. 仅按P50需求规模配置(Cleveland)
  6. 未考虑人员缩减率调整(Cleveland、SRE工作手册)
  7. 用单渠道模型处理多渠道工作(Gartner、Kingman)
  8. 未针对P99峰值事件制定应急计划(Hopp & Spearman、Reinertsen)

Distinct from

与其他工具的区别

  • c-level-advisor/vpe-advisor
    measures engineering throughput via DORA 4 metrics, story points, deployment frequency, and cycle time bottlenecks. It is for engineering teams shipping code. This skill is for ops teams handling tickets/cases. Different unit of work, different math (Erlang-C vs. DORA), different bottleneck (queueing-blind staffing vs. WIP + lead time).
  • c-level-advisor/chro-advisor
    does strategic workforce planning (1-5 year capability portfolios, talent supply, leadership succession). This skill does operational 0-12 month capacity sizing against demand. Per Lawler: conflating them gets you hired into the wrong jobs.
  • project-management/*
    tracks delivery throughput on projects (Jira velocity, sprint capacity). This skill sizes around steady- state queued work.
  • Sibling
    process-mapper
    finds the bottleneck. This skill sizes the team around a known bottleneck. Order of operations: process-mapper first → capacity-planner second. Hiring around the wrong constraint wastes the hires.
  • business-growth/cs-coverage
    (if it exists) sizes Customer Success coverage by ARR/CSM ratio and segment. This skill sizes by queued work volume (tickets, cases, escalations). For a CS team that handles both relationship work AND a ticket queue, run both.
  • c-level-advisor/vpe-advisor
    :通过DORA四项指标、故事点、部署频率和周期时间衡量工程团队的吞吐量,适用于交付代码的工程团队。本工具适用于处理工单/案例的运营团队。两者的工作单元、算法(Erlang-C vs DORA)、瓶颈点(无排队意识的人员配置 vs 在制品(WIP)+前置时间)均不同。
  • c-level-advisor/chro-advisor
    :负责战略性人力规划(1-5年能力组合、人才供给、领导力继任)。本工具负责运营性的0-12个月容量测算,以需求为依据。根据Lawler的理论,混淆两者会导致招聘错误的岗位。
  • project-management/*
    :跟踪项目的交付吞吐量(Jira速度、冲刺容量)。本工具针对稳态排队类工作进行规模配置。
  • 同类工具
    process-mapper
    定位瓶颈。本工具围绕已知瓶颈配置团队规模。操作顺序:先运行process-mapper,再运行本工具。围绕错误的约束条件招聘只会浪费招聘资源。
  • business-growth/cs-coverage
    (若存在):按ARR/CSM比率和业务板块测算客户成功团队的覆盖范围。本工具按排队类工作数量(工单、案例、升级请求)测算规模。对于同时处理关系维护和工单队列的客户成功团队,需同时运行两个工具。

Forcing-question library (Matt Pocock grill discipline)

强制问题库(Matt Pocock严格审查原则)

Discipline: walk these one at a time. Do not skip ahead. Answers must be written down. If you can't answer one, that is your next investigation.
原则:请逐一回答,不要跳过。必须写下答案。如果无法回答某个问题,这就是你接下来需要调研的内容。

Q1 — "What is your bottleneck, and have you confirmed it empirically?"

Q1 — “你的瓶颈是什么,是否已通过实证确认?”

Recommended answer: a named, measured stage in the workflow with queue-time data showing where work waits. Not a vibe. Not "escalations take too long". An actual measured queue.
Why it's the first question: Goldratt (The Goal, 1984) — every system has exactly one binding constraint at a time. Sizing around the wrong constraint wastes hires entirely. If you do not know your bottleneck, run
process-mapper
BEFORE this skill.
Canon: Eli Goldratt, The Goal (1984); Reinertsen, Principles of Product Development Flow (2009).
推荐答案:工作流中一个已命名、可衡量的阶段,且有队列时间数据显示工作等待的位置。不能是主观感受,不能是“升级请求处理时间太长”,必须是实际测量的队列。 为什么是第一个问题:Goldratt(《目标》,1984)——每个系统在同一时间恰好存在一个约束条件。围绕错误的约束条件配置规模会完全浪费招聘资源。如果你不知道瓶颈所在,请先运行
process-mapper
工具,再使用本工具。 参考资料:Eli Goldratt,《目标》(1984);Reinertsen,《产品开发流原则》(2009)。

Q2 — "What service trade-off are you accepting?"

Q2 — “你接受哪些服务权衡?”

Recommended answer: a written, explicit choice — fast vs. empathetic, broad vs. deep, low-cost vs. high-quality. Frances Frei is unambiguous: you cannot win all four. The team that tries wins zero.
Why it matters: AHT, SLA, and shrinkage inputs are the operational expression of this trade-off. If they don't agree (e.g., you set AHT for "empathy" but SLA for "speed"), the plan is internally inconsistent.
Canon: Frances Frei & Anne Morriss, Uncommon Service (HBR Press, 2012).
推荐答案:书面、明确的选择——快速vs共情、广泛vs深入、低成本vs高质量。Frances Frei明确指出:你无法同时达成这四个目标。试图兼顾的团队最终会一无所获。 重要性:平均处理时长(AHT)、SLA和人员缩减率输入值是这种权衡的运营体现。如果这些参数不一致(例如,你设置AHT以实现“共情”,但SLA要求“速度”),则计划存在内部矛盾。 参考资料:Frances Frei & Anne Morriss,《卓越服务的艺术》(哈佛商业评论出版社,2012)。

Q3 — "What's your demand P90, and what's the gap to your P99?"

Q3 — “你的需求P90值是多少,与P99值的差距有多大?”

Recommended answer: two specific numbers from the last 90 days of data, with the calendar context of each (e.g., "P90 was 480 tickets/day on normal Tuesdays; P99 was 720 on the day after the November release"). A team sized to P50 misses SLA half the time. A team sized to P99 overstaffs by 30-50%. P90 is the right operating sizing point per Cleveland.
Canon: Brad Cleveland, Call Center Management on Fast Forward (4th ed., 2019); A.K. Erlang, The Theory of Probabilities and Telephone Conversations (1909).
推荐答案:过去90天数据中的两个具体数值,附带各自的时间背景(例如,“P90为正常周二的480工单/天;P99为11月版本发布次日的720工单/天”)。按P50需求配置的团队会有一半时间无法达成SLA。按P99需求配置的团队会超编30-50%。根据Cleveland的理论,P90是合适的运营规模配置点。 参考资料:Brad Cleveland,《呼叫中心管理进阶》(第4版,2019);A.K. Erlang,《概率理论与电话通话》(1909)。

Q4 — "At your planned utilization, what is P(SLA breach) at P90 and at P99?"

Q4 — “在你规划的利用率下,P90和P99需求对应的SLA违约概率是多少?”

Recommended answer: two probabilities, computed (not guessed) from Erlang-C with your specific N, AHT, and SLA target. If P(breach at P90)
10% you are understaffed at the sizing point. If P(breach at P99) > 50% you have no surge plan and the next peak event will be visible to the CEO.
Canon: Erlang (1909); Hopp & Spearman, Factory Physics (3rd ed., 2008), VUT equation.
推荐答案:两个通过Erlang-C算法结合你的具体人员数量(N)、平均处理时长(AHT)和SLA目标计算得出的概率(而非猜测)。如果P90需求下的违约概率>10%,说明你的规模配置不足。如果P99需求下的违约概率>50%,说明你没有应急计划,下一次峰值事件会引起CEO的关注。 参考资料:Erlang(1909);Hopp & Spearman,《工厂物理学》(第3版,2008),VUT方程。

Q5 — "Have you budgeted replacement hires for the attrition you'll see this year?"

Q5 — “你是否为今年预计的人员流失率预算了替代招聘名额?”

Recommended answer: yes, with a specific number. At 30% annual attrition (Bersin BPO midpoint), a 20-FTE team loses ~6 people this year. If your "add 5 net" plan is actually a "hire 11" plan, the recruiting volume changes drastically. Anti-pattern #3.
Canon: Bersin/Deloitte talent benchmarks (2015-2023); Edward Lawler, Strategic Workforce Planning (USC CEO, 2008).
推荐答案:是,且有具体数字。在年流失率30%(Bersin业务流程外包基准值)的情况下,一个20人的团队今年将流失约6人。如果你的“净增5人”计划实际是“招聘11人”计划,招聘工作量会大幅增加。对应错误做法#3。 参考资料:Bersin/Deloitte人才基准(2015-2023);Edward Lawler,《战略性人力规划》(南加州大学CEO项目,2008)。

Q6 — "When does span of control trigger a manager hire, and who is the candidate?"

Q6 — “管理幅度何时触发经理招聘,候选人是谁?”

Recommended answer: a specific quarter (from
hiring_sequencer.py
) and at least one identified candidate (internal lead or external hire). Past 7 ICs/manager, 1:1s degrade, feedback cycles slip, attrition climbs. Past 10 you have a coverage crisis. Hire the manager BEFORE crossing 10, not after.
Canon: Camille Fournier, The Manager's Path (O'Reilly, 2017), ch. 5; Andy Grove, High Output Management (1983).
推荐答案:一个具体的季度(来自
hiring_sequencer.py
脚本的结果),且至少有一个确定的候选人(内部主管或外部招聘)。当管理幅度超过7名一线员工/经理时,一对一沟通质量下降,反馈周期变长,人员流失率上升。超过10人时会出现覆盖危机。请在管理幅度达到10人之前招聘经理,而非之后。 参考资料:Camille Fournier,《经理成长指南》(O'Reilly,2017),第5章;Andy Grove,《高产出管理》(1983)。

Q7 — "What is your surge plan for the P99 day?"

Q7 — “你针对P99峰值日的应急计划是什么?”

Recommended answer: an explicit, documented plan — overflow tier, BPO contracted capacity, on-call rotation, executive escalation tree, OR a written degradation contract that says "on P99 days we extend SLA to X minutes and notify customers proactively". If the answer is "we'll figure it out", the P99 day is a fire visible to the board.
Canon: Hopp & Spearman, Factory Physics (2008); Reinertsen (2009) on capacity-margin discipline.

Walk these seven in order. One at a time. Write the answers down. The plan you submit is only as defensible as your answers to these seven questions.
推荐答案:一个明确、有文档记录的计划——包括溢出层级、外包服务供应商(BPO)签约容量、待命轮值、高管升级流程,或者一份书面的降级协议,说明“在P99峰值日,我们将把SLA延长至X分钟,并主动通知客户”。如果答案是“我们到时再想办法”,那么P99峰值日会演变成董事会可见的危机。 参考资料:Hopp & Spearman,《工厂物理学》(2008);Reinertsen(2009)关于容量余量原则的内容。

请按顺序逐一回答这七个问题。必须写下答案。你提交的计划的可信度完全取决于你对这七个问题的回答。