capacity-planner
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
Chinesecapacity-planner
capacity-planner
Sizing tool for ops teams that handle queued work — Support, CX,
Customer Success, BizOps, IT ops, Finance ops. Built on Erlang-C
queueing theory, Little's Law, and the operational-leadership canon
(Fournier, Larson, Cleveland, Reinertsen). Deterministic, stdlib-only,
no LLM calls.
专为处理排队类工作的运营团队打造的规模测算工具——适用于支持团队、客户体验(CX)团队、客户成功团队、业务运营团队、IT运营团队、财务运营团队。基于Erlang-C排队理论、Little定理和运营领导力权威理论(Fournier、Larson、Cleveland、Reinertsen)构建。采用确定性算法,仅依赖标准库,无需调用大语言模型(LLM)。
Purpose
用途
You are an ops leader sized 15 → 35 with no idea how the 35-person org
will actually behave at peak load. Or you are at 88% utilization and
SLA is starting to slip. Or you have a hiring budget approved and need
to sequence it across four quarters without burning out the existing
team. This skill answers those questions with arithmetic, not vibes.
It produces three artifacts:
- Capacity sizing at 70/80/90% utilization against P50/P90/P99 demand, with P(SLA breach) at each point and a SAFE/WATCH/AT_RISK/CRITICAL risk band.
- Utilization health at the per-member traffic-light level plus a team verdict (HEALTHY/SQUEEZED/OVERLOADED/UNBALANCED).
- 12-month quarterly hiring plan accounting for ramp curves, attrition, QoQ demand growth, and span-of-control manager triggers.
如果你是管理15至35人规模的运营负责人,不确定35人团队在峰值负载下的实际表现;或者你的团队利用率已达88%,服务水平协议(SLA)开始出现偏差;或者你已获得招聘预算,需要在四个季度内合理安排招聘节奏,避免现有团队过度劳累——本工具将通过精准计算为你解答这些问题,而非仅凭主观判断。
本工具可生成三类成果:
- 容量测算:基于70%/80%/90%的利用率,结合P50/P90/P99需求规模,计算各节点的SLA违约概率,并划分SAFE/WATCH/AT_RISK/CRITICAL风险等级。
- 利用率健康度评估:为团队成员提供红黄绿三色利用率健康标识,同时给出团队整体健康结论(HEALTHY/SQUEEZED/OVERLOADED/UNBALANCED)。
- 12个月季度招聘计划:考虑人员成长周期、流失率、季度需求增长以及管理幅度触发条件。
When to use
使用场景
- Annual ops capacity planning (October-November for the following fiscal year).
- Quarterly re-sizing if demand changed >15% or attrition spiked.
- Pre-budget defense — the math that justifies the headcount ask to your CFO.
- Diagnostic when an ops team is missing SLA and you need to know whether it's a sizing problem, a process problem, or a bottleneck problem.
- M&A / new-segment launch modeling — sizing a new team or combined org.
- 年度运营容量规划(财年次年的10-11月)。
- 季度重新测算:当需求变化超过15%或人员流失率激增时。
- 预算审批前的论证:向首席财务官(CFO)证明人员编制需求的数学依据。
- 问题诊断:当运营团队无法达成SLA时,判断问题出在规模配置、流程还是瓶颈环节。
- 并购/新业务板块启动建模:为新团队或合并后的组织测算规模。
Workflow
工作流程
- Intake demand. Pull P50/P90/P99 daily ticket/case volume from your work system (Zendesk, Intercom, JSM, ServiceNow, Salesforce). If you only have averages, stop and pull the distribution. Single- point demand estimates are the most expensive anti-pattern in ops.
- Model throughput. Run with your demand, AHT, SLA target, current FTE, and shrinkage. Use
capacity_modeler.pyfor your function (support / cx / bizops / finance-ops / it-ops). Read the 80%-utilization row — that's your sizing point.--profile - Flag utilization risk. Run against your current team's actual utilization data. Anyone >85% sustained is a throughput-collapse risk per Reinertsen. Spread >30 percentage points across team means UNBALANCED — fix that before hiring.
utilization_analyzer.py - Sequence hiring. Run with current FTE, target EOY, ramp time, attrition, and growth. It will front-load hires (Q1 35%, Q4 15%), apply ramp curves, and trigger a manager hire when span of control crosses 7 ICs/manager.
hiring_sequencer.py - Walk the Forcing-question library (see below). One question at a time. Do not skip ahead. Answers must be written down before you commit the plan.
- 收集需求数据:从你的工作系统(Zendesk、Intercom、JSM、ServiceNow、Salesforce)提取P50/P90/P99的每日工单/案例量。如果只有平均值,请停止操作并提取分布数据——单点需求估算是运营领域代价最高的错误做法。
- 建模吞吐量:运行脚本,输入你的需求数据、平均处理时长(AHT)、SLA目标、当前全职等效人员(FTE)和人员缩减率。使用
capacity_modeler.py参数指定团队职能(support / cx / bizops / finance-ops / it-ops)。查看80%利用率对应的行——这就是你的规模配置参考点。--profile - 标记利用率风险:运行脚本,分析当前团队的实际利用率数据。根据Reinertsen的理论,持续利用率超过85%的成员存在吞吐量崩溃风险。团队内利用率差异超过30个百分点意味着团队配置失衡——请在招聘前解决此问题。
utilization_analyzer.py - 规划招聘节奏:运行脚本,输入当前FTE、年末目标FTE、人员成长周期、流失率和需求增长率。脚本会前置招聘安排(第一季度35%,第四季度15%),应用成长周期曲线,并在管理幅度超过7名一线员工/经理时触发招聘经理的提示。
hiring_sequencer.py - 逐一回答强制问题库中的问题(见下文)。请逐个回答,不要跳过。在确定计划前,必须写下所有答案。
Scripts
脚本说明
- — Erlang-C sizing with shrinkage adjustment and P50/P90/P99 breach probabilities.
scripts/capacity_modeler.pyfor industry defaults.--profile - — per-member traffic-light + team-level health verdict with variance detection.
scripts/utilization_analyzer.py - — 12-month quarterly plan with ramp, attrition, growth, max-hires-per-quarter constraint, and manager-trigger logic.
scripts/hiring_sequencer.py
All three accept (JSON), ,
(built-in example), and . Stdlib only.
--input <path>--output {markdown,json}--sample--help- —— 基于Erlang-C算法的规模测算工具,支持人员缩减率调整,可计算P50/P90/P99需求下的违约概率。使用
scripts/capacity_modeler.py参数获取行业默认值。--profile - —— 为团队成员提供红黄绿三色健康标识,并结合方差检测给出团队整体健康结论。
scripts/utilization_analyzer.py - —— 生成12个月季度招聘计划,涵盖人员成长周期、流失率、需求增长、单季度最大招聘人数限制以及经理触发逻辑。
scripts/hiring_sequencer.py
三个脚本均支持(JSON格式输入)、(输出格式)、(内置示例)和(帮助信息)参数。仅依赖Python标准库。
--input <path>--output {markdown,json}--sample--helpReferences
参考资料
- — Erlang, Little, Hopp & Spearman, Reinertsen, Kingman, Cleveland, ITIL, Armony et al. (8 sources). The math.
references/queueing_theory_canon.md - — Fournier, Larson, Google SRE Workbook, Frei, Lawler, Bersin, Gartner, Grove (8 sources). The people factors.
references/ops_workforce_planning_canon.md - — 11 named anti-patterns with cited sources, tool guards, and the meta-discipline that Lencioni + Goldratt + Christensen impose. (8+ named sources.)
references/capacity_anti_patterns.md
- —— 包含Erlang、Little、Hopp & Spearman、Reinertsen、Kingman、Cleveland、ITIL、Armony等人的8份资料,聚焦算法原理。
references/queueing_theory_canon.md - —— 包含Fournier、Larson、Google SRE工作手册、Frei、Lawler、Bersin、Gartner、Grove的8份资料,聚焦人力因素。
references/ops_workforce_planning_canon.md - —— 列出11种已命名的错误做法,附带资料引用、工具防护措施,以及Lencioni、Goldratt、Christensen提出的元学科理论(8+份命名资料)。
references/capacity_anti_patterns.md
Assets
配套资源
- — 20-minute fill-out template with JSON skeletons for all three tools and an output checklist.
assets/capacity_brief_template.md
- —— 20分钟填写模板,包含三个工具的JSON骨架和输出检查清单。
assets/capacity_brief_template.md
Assumptions
假设条件
This skill assumes:
- Work is queued (tickets, cases, work items) — not project-style. If your team's work isn't queued, this is the wrong skill.
- Demand has a stationary-enough distribution within a quarter. Step-changes (new product launch, M&A, regulatory shift) require re-running mid-quarter.
- You have at least 90 days of historical demand data to compute P50/P90/P99. If not, generate the distribution from your sales / user-base forecast first.
- Service is single-class within a queue. If you have hard priority tiers (P1/P2/P3 with class-specific SLAs), model each as a separate queue and sum.
- Channels are modeled coherently. Multi-channel teams use the
appropriate with built-in shrinkage premium.
--profile
本工具基于以下假设:
- 工作为排队类(工单、案例、工作项)——而非项目式工作。如果你的团队工作不属于排队类,本工具不适用。
- 需求在一个季度内具有足够稳定的分布。若出现阶跃变化(如新产品发布、并购、监管政策变动),需在季度中期重新运行工具。
- 你拥有至少90天的历史需求数据用于计算P50/P90/P99。若没有,请先根据销售/用户基数预测生成需求分布。
- 同一队列内的服务为单一类别。如果存在严格的优先级层级(P1/P2/P3,且各层级有特定SLA),请将每个层级作为单独队列建模,再汇总结果。
- 多渠道工作需统一建模。多渠道团队需使用对应的参数,该参数内置了人员缩减率溢价。
--profile
Anti-patterns
错误做法
See for the full taxonomy with
sources. Top eight:
references/capacity_anti_patterns.md- Plan-to-100%-utilization (Reinertsen Principle 12)
- Treat-ramp-as-instant (Larson)
- Ignore-attrition-in-12-month-plan (Bersin)
- Hire-ICs-forever-with-no-manager-trigger (Fournier)
- Size-to-P50-demand-only (Cleveland)
- No-shrinkage-adjustment (Cleveland, SRE Workbook)
- Single-channel-model-for-multi-channel-work (Gartner, Kingman)
- No-surge-plan-for-P99-events (Hopp & Spearman, Reinertsen)
完整的错误做法分类及资料引用请查看。排名前8的错误做法:
references/capacity_anti_patterns.md- 按100%利用率规划(Reinertsen第12原则)
- 假设人员成长周期为零(Larson)
- 12个月计划中忽略人员流失率(Bersin)
- 持续招聘一线员工,未设置经理招聘触发条件(Fournier)
- 仅按P50需求规模配置(Cleveland)
- 未考虑人员缩减率调整(Cleveland、SRE工作手册)
- 用单渠道模型处理多渠道工作(Gartner、Kingman)
- 未针对P99峰值事件制定应急计划(Hopp & Spearman、Reinertsen)
Distinct from
与其他工具的区别
- measures engineering throughput via DORA 4 metrics, story points, deployment frequency, and cycle time bottlenecks. It is for engineering teams shipping code. This skill is for ops teams handling tickets/cases. Different unit of work, different math (Erlang-C vs. DORA), different bottleneck (queueing-blind staffing vs. WIP + lead time).
c-level-advisor/vpe-advisor - does strategic workforce planning (1-5 year capability portfolios, talent supply, leadership succession). This skill does operational 0-12 month capacity sizing against demand. Per Lawler: conflating them gets you hired into the wrong jobs.
c-level-advisor/chro-advisor - tracks delivery throughput on projects (Jira velocity, sprint capacity). This skill sizes around steady- state queued work.
project-management/* - Sibling finds the bottleneck. This skill sizes the team around a known bottleneck. Order of operations: process-mapper first → capacity-planner second. Hiring around the wrong constraint wastes the hires.
process-mapper - (if it exists) sizes Customer Success coverage by ARR/CSM ratio and segment. This skill sizes by queued work volume (tickets, cases, escalations). For a CS team that handles both relationship work AND a ticket queue, run both.
business-growth/cs-coverage
- :通过DORA四项指标、故事点、部署频率和周期时间衡量工程团队的吞吐量,适用于交付代码的工程团队。本工具适用于处理工单/案例的运营团队。两者的工作单元、算法(Erlang-C vs DORA)、瓶颈点(无排队意识的人员配置 vs 在制品(WIP)+前置时间)均不同。
c-level-advisor/vpe-advisor - :负责战略性人力规划(1-5年能力组合、人才供给、领导力继任)。本工具负责运营性的0-12个月容量测算,以需求为依据。根据Lawler的理论,混淆两者会导致招聘错误的岗位。
c-level-advisor/chro-advisor - :跟踪项目的交付吞吐量(Jira速度、冲刺容量)。本工具针对稳态排队类工作进行规模配置。
project-management/* - 同类工具:定位瓶颈。本工具围绕已知瓶颈配置团队规模。操作顺序:先运行process-mapper,再运行本工具。围绕错误的约束条件招聘只会浪费招聘资源。
process-mapper - (若存在):按ARR/CSM比率和业务板块测算客户成功团队的覆盖范围。本工具按排队类工作数量(工单、案例、升级请求)测算规模。对于同时处理关系维护和工单队列的客户成功团队,需同时运行两个工具。
business-growth/cs-coverage
Forcing-question library (Matt Pocock grill discipline)
强制问题库(Matt Pocock严格审查原则)
Discipline: walk these one at a time. Do not skip ahead. Answers must
be written down. If you can't answer one, that is your next investigation.
原则:请逐一回答,不要跳过。必须写下答案。如果无法回答某个问题,这就是你接下来需要调研的内容。
Q1 — "What is your bottleneck, and have you confirmed it empirically?"
Q1 — “你的瓶颈是什么,是否已通过实证确认?”
Recommended answer: a named, measured stage in the workflow with
queue-time data showing where work waits. Not a vibe. Not "escalations
take too long". An actual measured queue.
Why it's the first question: Goldratt (The Goal, 1984) — every
system has exactly one binding constraint at a time. Sizing around the
wrong constraint wastes hires entirely. If you do not know your
bottleneck, run BEFORE this skill.
process-mapperCanon: Eli Goldratt, The Goal (1984); Reinertsen, Principles of
Product Development Flow (2009).
推荐答案:工作流中一个已命名、可衡量的阶段,且有队列时间数据显示工作等待的位置。不能是主观感受,不能是“升级请求处理时间太长”,必须是实际测量的队列。
为什么是第一个问题:Goldratt(《目标》,1984)——每个系统在同一时间恰好存在一个约束条件。围绕错误的约束条件配置规模会完全浪费招聘资源。如果你不知道瓶颈所在,请先运行工具,再使用本工具。
参考资料:Eli Goldratt,《目标》(1984);Reinertsen,《产品开发流原则》(2009)。
process-mapperQ2 — "What service trade-off are you accepting?"
Q2 — “你接受哪些服务权衡?”
Recommended answer: a written, explicit choice — fast vs. empathetic,
broad vs. deep, low-cost vs. high-quality. Frances Frei is unambiguous:
you cannot win all four. The team that tries wins zero.
Why it matters: AHT, SLA, and shrinkage inputs are the operational
expression of this trade-off. If they don't agree (e.g., you set AHT for
"empathy" but SLA for "speed"), the plan is internally inconsistent.
Canon: Frances Frei & Anne Morriss, Uncommon Service (HBR Press,
2012).
推荐答案:书面、明确的选择——快速vs共情、广泛vs深入、低成本vs高质量。Frances Frei明确指出:你无法同时达成这四个目标。试图兼顾的团队最终会一无所获。
重要性:平均处理时长(AHT)、SLA和人员缩减率输入值是这种权衡的运营体现。如果这些参数不一致(例如,你设置AHT以实现“共情”,但SLA要求“速度”),则计划存在内部矛盾。
参考资料:Frances Frei & Anne Morriss,《卓越服务的艺术》(哈佛商业评论出版社,2012)。
Q3 — "What's your demand P90, and what's the gap to your P99?"
Q3 — “你的需求P90值是多少,与P99值的差距有多大?”
Recommended answer: two specific numbers from the last 90 days of
data, with the calendar context of each (e.g., "P90 was 480 tickets/day
on normal Tuesdays; P99 was 720 on the day after the November release").
A team sized to P50 misses SLA half the time. A team sized to P99
overstaffs by 30-50%. P90 is the right operating sizing point per
Cleveland.
Canon: Brad Cleveland, Call Center Management on Fast Forward (4th
ed., 2019); A.K. Erlang, The Theory of Probabilities and Telephone
Conversations (1909).
推荐答案:过去90天数据中的两个具体数值,附带各自的时间背景(例如,“P90为正常周二的480工单/天;P99为11月版本发布次日的720工单/天”)。按P50需求配置的团队会有一半时间无法达成SLA。按P99需求配置的团队会超编30-50%。根据Cleveland的理论,P90是合适的运营规模配置点。
参考资料:Brad Cleveland,《呼叫中心管理进阶》(第4版,2019);A.K. Erlang,《概率理论与电话通话》(1909)。
Q4 — "At your planned utilization, what is P(SLA breach) at P90 and at P99?"
Q4 — “在你规划的利用率下,P90和P99需求对应的SLA违约概率是多少?”
Recommended answer: two probabilities, computed (not guessed) from
Erlang-C with your specific N, AHT, and SLA target. If P(breach at P90)
10% you are understaffed at the sizing point. If P(breach at P99) > 50% you have no surge plan and the next peak event will be visible to the CEO.
Canon: Erlang (1909); Hopp & Spearman, Factory Physics (3rd ed.,
2008), VUT equation.
推荐答案:两个通过Erlang-C算法结合你的具体人员数量(N)、平均处理时长(AHT)和SLA目标计算得出的概率(而非猜测)。如果P90需求下的违约概率>10%,说明你的规模配置不足。如果P99需求下的违约概率>50%,说明你没有应急计划,下一次峰值事件会引起CEO的关注。
参考资料:Erlang(1909);Hopp & Spearman,《工厂物理学》(第3版,2008),VUT方程。
Q5 — "Have you budgeted replacement hires for the attrition you'll see this year?"
Q5 — “你是否为今年预计的人员流失率预算了替代招聘名额?”
Recommended answer: yes, with a specific number. At 30% annual
attrition (Bersin BPO midpoint), a 20-FTE team loses ~6 people this year.
If your "add 5 net" plan is actually a "hire 11" plan, the recruiting
volume changes drastically. Anti-pattern #3.
Canon: Bersin/Deloitte talent benchmarks (2015-2023); Edward Lawler,
Strategic Workforce Planning (USC CEO, 2008).
推荐答案:是,且有具体数字。在年流失率30%(Bersin业务流程外包基准值)的情况下,一个20人的团队今年将流失约6人。如果你的“净增5人”计划实际是“招聘11人”计划,招聘工作量会大幅增加。对应错误做法#3。
参考资料:Bersin/Deloitte人才基准(2015-2023);Edward Lawler,《战略性人力规划》(南加州大学CEO项目,2008)。
Q6 — "When does span of control trigger a manager hire, and who is the candidate?"
Q6 — “管理幅度何时触发经理招聘,候选人是谁?”
Recommended answer: a specific quarter (from )
and at least one identified candidate (internal lead or external hire).
Past 7 ICs/manager, 1:1s degrade, feedback cycles slip, attrition
climbs. Past 10 you have a coverage crisis. Hire the manager BEFORE
crossing 10, not after.
hiring_sequencer.pyCanon: Camille Fournier, The Manager's Path (O'Reilly, 2017),
ch. 5; Andy Grove, High Output Management (1983).
推荐答案:一个具体的季度(来自脚本的结果),且至少有一个确定的候选人(内部主管或外部招聘)。当管理幅度超过7名一线员工/经理时,一对一沟通质量下降,反馈周期变长,人员流失率上升。超过10人时会出现覆盖危机。请在管理幅度达到10人之前招聘经理,而非之后。
参考资料:Camille Fournier,《经理成长指南》(O'Reilly,2017),第5章;Andy Grove,《高产出管理》(1983)。
hiring_sequencer.pyQ7 — "What is your surge plan for the P99 day?"
Q7 — “你针对P99峰值日的应急计划是什么?”
Recommended answer: an explicit, documented plan — overflow tier,
BPO contracted capacity, on-call rotation, executive escalation tree,
OR a written degradation contract that says "on P99 days we extend SLA
to X minutes and notify customers proactively". If the answer is "we'll
figure it out", the P99 day is a fire visible to the board.
Canon: Hopp & Spearman, Factory Physics (2008); Reinertsen (2009)
on capacity-margin discipline.
Walk these seven in order. One at a time. Write the answers down. The
plan you submit is only as defensible as your answers to these seven
questions.
推荐答案:一个明确、有文档记录的计划——包括溢出层级、外包服务供应商(BPO)签约容量、待命轮值、高管升级流程,或者一份书面的降级协议,说明“在P99峰值日,我们将把SLA延长至X分钟,并主动通知客户”。如果答案是“我们到时再想办法”,那么P99峰值日会演变成董事会可见的危机。
参考资料:Hopp & Spearman,《工厂物理学》(2008);Reinertsen(2009)关于容量余量原则的内容。
请按顺序逐一回答这七个问题。必须写下答案。你提交的计划的可信度完全取决于你对这七个问题的回答。