knowledge-ops
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
Chineseknowledge-ops
knowledge-ops
Company SOP + internal runbook authoring, 5W2H completeness validation, and KB hygiene reporting for Head-of-Ops / Knowledge-Manager / TPM-Internal personas.
面向运营主管/知识经理/内部技术项目经理的企业SOP+内部运行手册创作、5W2H完整性验证及知识库卫生报告工具。
Purpose
用途
An ops organization three years in accumulates a sprawl: 600 Notion pages, 200 Confluence runbooks, three Obsidian vaults, a folder, and a channel that exists because nobody can find the canonical doc. Predictable failure modes:
Drive/SOPs/Slack #ops-questions- No owner — 40% of SOPs name "the team" instead of a person. When the doc rots, nobody is accountable.
- No last-reviewed date — a 2023 vendor-offboarding SOP still references a procurement tool sunset in 2024.
- Vague success signals — runbook step 4 says "verify the service is up". A new operator can't tell what that means.
- No rollback path — incident-comms cascade runbook tells you how to send the alert. It doesn't tell you how to retract it when the alert was wrong.
- Orphan pages — half the KB has no inbound links. Nobody finds them via navigation; they only exist because somebody knew the URL.
- Glossary drift — "CSM" means Customer Success Manager in three docs and Customer Solutions Manager in five. New hires guess wrong for six months.
- Happy-path-only SOPs — the doc covers what happens when everything works. It doesn't cover the 30% case where it doesn't.
This skill answers the operator's actual question: "Which 20 docs do I fix first, and what specifically is wrong with each?" — with deterministic logic, not intuition.
运营组织成立三年后,会积累大量分散的文档:600个Notion页面、200个Confluence运行手册、三个Obsidian库、一个文件夹,还有一个频道——因为没人能找到权威文档。常见的失效模式包括:
Drive/SOPs/Slack #ops-questions- 无负责人——40%的SOP只写“团队”而非具体个人,文档过时后无人负责。
- 无最后审核日期——2023年的供应商离职SOP仍引用2024年已停用的采购工具。
- 成功信号模糊——运行手册第4步写“验证服务已启动”,新运维人员无法明确具体标准。
- 无回滚路径——事件沟通流程手册只说明如何发送警报,未提及警报错误时的撤回方式。
- 孤立页面——半数知识库页面无入站链接,无法通过导航找到,仅靠知晓URL的人才能访问。
- 术语漂移——“CSM”在3份文档中指客户成功经理,在5份文档中指客户解决方案经理,新员工半年内都在猜测正确含义。
- 仅覆盖理想流程的SOP——文档只描述一切顺利时的操作,未覆盖30%的异常场景。
本工具解决运维人员的真实需求:“我应该优先修复哪20份文档,每份具体存在什么问题?”——采用确定性逻辑,而非直觉判断。
When to use
使用场景
- Authoring a new SOP for a cross-functional company process (procurement intake, vendor offboarding, incident-comms cascade, employee onboarding, expense reimbursement, customer-escalation playbook, security-incident comms, system-access provisioning).
- Validating an existing internal runbook before it goes into rotation (every step must have a named owner, expected duration, observable success signal, observable failure signal, rollback path, escalation contact).
- Ingesting a multi-document KB export (Notion zip, Confluence space export, Obsidian vault, directory) and surfacing what's broken: orphan pages, stale pages (no edit > 12 months), glossary drift, missing-owner pages, cross-link map.
Drive/SOPs/ - Onboarding a new ops hire by generating the SOPs and ops-handbook pages they need to read in week 1.
- Wiki cleanup sprints — quarterly hygiene work where the org decides which 30 docs to archive, rewrite, or merge.
- 为跨职能企业流程创作新SOP,包括采购申请、供应商离职处理、事件沟通流程、员工入职、费用报销、客户升级处理手册、安全事件沟通、系统权限配置。
- 在现有内部运行手册投入使用前进行验证,确保每个步骤都包含指定负责人、预期时长、可观测成功信号、可观测失败信号、回滚路径、升级联系人。
- 导入多文档知识库导出文件(Notion压缩包、Confluence空间导出、Obsidian库、目录),识别问题:孤立页面、过时页面(12个月以上未编辑)、术语漂移、无负责人页面、交叉链接图谱。
Drive/SOPs/ - 生成新运维人员入职第一周需要阅读的SOP和运维手册页面,辅助入职培训。
- 知识库清理冲刺——每季度的卫生整理工作,确定需要归档、重写或合并的30份文档。
Workflow
工作流程
Four-step deterministic flow (matches the ops org's actual workflow, not an abstract process):
- Ingest KB. Run on the existing wiki export. Output is a markdown health report: orphan pages, stale pages, glossary drift, missing-owner pages, cross-link map, prioritized cleanup list. The report ranks the top-20 docs to fix first — usually a mix of high-traffic stale docs and compliance-relevant missing-owner docs. Take this list to the cleanup sprint.
kb_ingester.py --input <vault-dir> - Validate existing runbooks. For each runbook in the cleanup list (or any new runbook before it goes into rotation), run . The validator scores each step against six checks (named owner, expected duration, observable success signal, observable failure signal, rollback path, escalation contact) and produces a per-step traffic-light + overall validity score 0-100 + MUST-FIX issue list. A runbook scoring < 60 is not safe to use in an incident.
runbook_validator.py --input <runbook.md> - Generate missing SOPs. For SOPs that need to be written from scratch (or rewritten because the existing one is unsalvageable), run . Output is a 5W2H-structured SOP scaffold: Who (RACI), What (process steps), When (triggers + frequency), Where (system + tool), Why (purpose + regulatory basis), How (step-by-step), How-much (cost + time per execution). The
sop_generator.py --input <metadata.json> --profile <ops|support|finance|hr|it|regulated>profile adds version control, signoff, and audit-trail sections (ISO 9001 / FDA 21 CFR Part 211 / SOC 2 / HIPAA).regulated - Cross-link + close the loop. Re-run after the cleanup sprint to verify orphan-page count is down and glossary drift is resolved. The metric that matters is "unfindable docs" (orphans) and "unsafe runbooks" (validity score < 60) — not page count.
kb_ingester.py
四步确定性流程(匹配运维组织实际工作流,而非抽象流程):
- 导入知识库:对现有知识库导出文件运行,输出markdown格式的健康报告,包含孤立页面、过时页面、术语漂移、无负责人页面、交叉链接图谱、优先级排序的清理清单。报告会列出优先修复的前20份文档——通常是高流量过时文档和合规相关的无负责人文档组合。将此清单用于清理冲刺工作。
kb_ingester.py --input <vault-dir> - 验证现有运行手册:对清理清单中的每份运行手册(或新创作的运行手册投入使用前),运行。验证器会针对六项检查(指定负责人、预期时长、可观测成功信号、可观测失败信号、回滚路径、升级联系人)对每个步骤打分,生成逐步骤的红绿灯标识、0-100的整体有效性得分,以及必须修复的问题列表。得分低于60的运行手册不适用于事件处理场景。
runbook_validator.py --input <runbook.md> - 生成缺失的SOP:对于需要从头创作(或现有文档无法修复需重写)的SOP,运行。输出采用5W2H结构的SOP框架:Who(RACI职责分配)、What(流程步骤)、When(触发条件+频率)、Where(系统+工具)、Why(目的+合规依据)、How(分步操作)、How-much(每次执行的成本+时间)。
sop_generator.py --input <metadata.json> --profile <ops|support|finance|hr|it|regulated>配置文件会添加版本控制、签字确认和审计追踪章节(符合ISO 9001 / FDA 21 CFR Part 211 / SOC 2 / HIPAA要求)。regulated - 交叉链接+闭环验证:清理冲刺完成后,重新运行,验证孤立页面数量是否减少、术语漂移是否解决。关键指标是**“无法找到的文档”(孤立页面)和“不安全的运行手册”(有效性得分<60)**——而非页面总数。
kb_ingester.py
Scripts
脚本说明
scripts/sop_generator.py--profileopssupportfinancehritregulatedSOC2HIPAAISO13485GDPRSOX--samplescripts/runbook_validator.py/healthz--samplescripts/kb_ingester.pyDrive/SOPs/[link](path)last_reviewedowner:staleness × inbound-link-count--samplescripts/sop_generator.py--profileopssupportfinancehritregulatedSOC2HIPAAISO13485GDPRSOX--samplescripts/runbook_validator.py/healthz--samplescripts/kb_ingester.pyDrive/SOPs/[link](path)last_reviewedowner:过时程度 × 入站链接数--sampleReferences
参考资料
- — Kaoru Ishikawa's 5W2H method, Toyota standard-work discipline, Atul Gawande's checklist manifesto, Atlassian Confluence SOP guidance, ISO 9001 SOP requirements, ITIL v4 Service Operation, FDA 21 CFR Part 211. Eight cited sources covering SOP authoring canon.
references/5w2h_sop_canon.md - — Google SRE Workbook (runbook chapter), Atlassian incident-management runbooks, PagerDuty Incident Response taxonomy, AWS Well-Architected operational excellence pillar, Charity Majors on observability-runbook integration, Susan Fowler on production-ready microservices, ITIL v4 Operations. Seven cited sources covering runbook design canon.
references/runbook_canon.md - — Eight anti-patterns drawn from Notion/Confluence wiki industry research, Mozilla SUMO knowledge-base lessons, Stack Overflow community-management research, the Atlassian Team Playbook, MIT TIK org-wiki studies, Cynthia Lee on glossary drift, and Adam Wiggins on "documentation rot".
references/kb_hygiene_anti_patterns.md
- — 石川馨的5W2H方法、丰田标准作业规范、阿图·葛文德的《清单革命》、Atlassian Confluence SOP指南、ISO 9001 SOP要求、ITIL v4服务运营、FDA 21 CFR Part 211。涵盖SOP创作规范的8个引用来源。
references/5w2h_sop_canon.md - — Google SRE工作手册(运行手册章节)、Atlassian事件管理运行手册、PagerDuty事件响应分类、AWS架构完善框架卓越运营支柱、Charity Majors关于可观测性与运行手册集成的观点、Susan Fowler关于生产就绪微服务的内容、ITIL v4运营。涵盖运行手册设计规范的7个引用来源。
references/runbook_canon.md - — 8种反模式,来自Notion/Confluence知识库行业研究、Mozilla SUMO知识库经验、Stack Overflow社区管理研究、Atlassian团队手册、MIT TIK组织知识库研究、Cynthia Lee关于术语漂移的内容、Adam Wiggins关于“文档过时”的观点。
references/kb_hygiene_anti_patterns.md
Assumptions
前提假设
- The KB is in markdown (or can be exported to markdown — Notion, Confluence, Obsidian, and Google Docs all support this). HTML-only or PDF-only KBs require a conversion pass first; out of scope.
- The user has authority to commission rewrites or archives. Producing a cleanup list nobody acts on is wasted work — route findings to a named owner before running the ingester.
- Owner metadata lives in YAML frontmatter () or in a top-of-page "Owner:" line. Tribal-knowledge ownership (the person who last edited the page) is treated as missing.
owner: alex@company.com - "Stale" defaults to 12 months. Override with on
--stale-days. Some compliance regimes (FDA, ISO 13485) require shorter review cycles; usekb_ingester.pyand--profile regulated.--stale-days 365 - The user is not asking for a personal PKM. Personal Karpathy-style second-brain work belongs in .
engineering/llm-wiki
- 知识库采用markdown格式(或可导出为markdown——Notion、Confluence、Obsidian和Google Docs均支持此功能)。仅HTML或PDF格式的知识库需先进行转换,不在本工具范围内。
- 用户有权安排文档重写或归档。生成无人执行的清理清单是无效工作——在运行导入工具前,需将结果提交给指定负责人。
- 负责人元数据存储在YAML前置字段()或页面顶部的“Owner:”行中。仅靠部落知识(最后编辑页面的人)认定的负责人视为缺失。
owner: alex@company.com - “过时”默认定义为12个月。可通过的
kb_ingester.py参数覆盖。部分合规体系(FDA、ISO 13485)要求更短的审核周期;使用--stale-days和--profile regulated参数。--stale-days 365 - 用户并非需要个人PKM工具。Karpathy风格的个人第二大脑工具属于范畴。
engineering/llm-wiki
Anti-patterns
反模式
- Generating SOPs in bulk without owners. A doc with no owner has a half-life of 6 months. Refuse to generate a batch of 30 SOPs unless each one is assigned to a named human.
- Using as a checkbox. The validator catches missing structure. It does not catch wrong content. A runbook can score 100 and still tell the operator the wrong thing.
runbook_validator.py - Treating orphan pages as garbage by default. Some orphans are reference pages found only via search — not all orphans should be archived. The cleanup list is a priority queue, not a delete list.
- Confusing knowledge-ops with . Process-mapper documents the flow of work between stages (BPMN, cycle time, bottleneck). Knowledge-ops documents the artifacts operators consume to execute the work (SOP, runbook, glossary). Both can apply to the same process.
process-mapper - Letting glossary drift accumulate. Two definitions of "CSM" in three years becomes seven definitions in five. Fix glossary drift the moment it surfaces in output.
kb_ingester.py - Skipping the regulated profile under regulated workload. If the process touches PHI, SOX-relevant financial controls, or ISO 13485 device QMS, use . Missing version control on a regulated SOP is an audit finding.
--profile regulated - Hand-writing 5W2H sections from memory. The 5W2H scaffold exists because operators forget "How-much". Use the generator; edit the output.
- 批量生成无负责人的SOP:无负责人的文档半衰期为6个月。除非每份SOP都分配给指定个人,否则拒绝批量生成30份SOP。
- 将用作形式化勾选工具:验证器仅检查结构完整性,无法检测内容错误。一份得分100的运行手册仍可能给出错误操作指引。
runbook_validator.py - 默认将孤立页面视为垃圾文档:部分孤立页面是仅通过搜索找到的参考页面——并非所有孤立页面都应归档。清理清单是优先级队列,而非删除列表。
- 混淆知识运营与:Process-mapper记录工作在各阶段的流转(BPMN、周期时间、瓶颈)。知识运营记录运维人员执行工作时使用的文档(SOP、运行手册、术语表)。两者可应用于同一流程。
process-mapper - 放任术语漂移累积:三年内出现两种“CSM”定义,五年内会演变为七种。一旦输出中出现术语漂移,立即修复。
kb_ingester.py - 在合规场景下跳过regulated配置文件:如果流程涉及PHI(受保护健康信息)、SOX相关财务控制或ISO 13485设备质量管理体系,需使用参数。合规SOP缺失版本控制会导致审计不合格。
--profile regulated - 凭记忆手写5W2H章节:5W2H框架的存在就是为了避免运维人员遗漏“How-much”项。使用生成工具,再对输出内容进行编辑。
Distinct from
区别于其他工具
- — Karpathy-style personal PKM second brain where one human ingests sources into their own interlinked vault. Knowledge-ops is organizational: many authors, many readers, named owners per doc, formal review cycles, compliance overlays.
engineering/llm-wiki - — system-ops runbook for debugging a production system (logs, alerts, k8s, on-call). Knowledge-ops runbooks are operator runbooks for business processes (incident-comms cascade, vendor offboarding, employee onboarding). The audience is fellow operators, not engineers tailing logs.
engineering-team/runbook-generator - — Jira / Confluence delivery tracking, sprint ticket workflow, project-status reporting. Knowledge-ops is the content in those Confluence pages, not the tracking of who edits them.
project-management/* - (sibling) — BPMN process design: where the stages are, where work waits, which stage is the bottleneck. Knowledge-ops is process documentation: the SOP and runbook artifacts that tell an operator how to execute the process the mapper described.
business-operations/process-mapper - (sibling) — broadcast announcements, all-hands messaging, change-management comms. Knowledge-ops is the durable reference artifact; internal-comms is the broadcast.
business-operations/internal-comms - — formal regulatory compliance authoring (ISO 13485 QMS, MDR technical files, 21 CFR Part 820). Knowledge-ops borrows the regulatory checklist but is not a substitute for a notified-body audit.
ra-qm-team/*
- — Karpathy风格的个人PKM第二大脑,由个人将资源导入自己的关联库。知识运营是组织级工具:多作者、多读者、每份文档指定负责人、正式审核周期、合规覆盖。
engineering/llm-wiki - — 用于调试生产系统的运维运行手册(日志、警报、k8s、轮值)。知识运营的运行手册是针对业务流程的运维人员手册(事件沟通流程、供应商离职处理、员工入职)。受众是其他运维人员,而非查看日志的工程师。
engineering-team/runbook-generator - — Jira/Confluence交付跟踪、冲刺工单流程、项目状态报告。知识运营关注这些Confluence页面中的内容,而非谁编辑了页面的跟踪功能。
project-management/* - (同类工具)—— BPMN流程设计:包含哪些阶段、工作在哪里等待、哪个阶段是瓶颈。知识运营是流程文档:告诉运维人员如何执行流程设计工具所描述的流程的SOP和运行手册文档。
business-operations/process-mapper - (同类工具)—— 广播通知、全员消息、变更管理沟通。知识运营是持久化的参考文档;内部沟通是广播消息。
business-operations/internal-comms - — 正式合规创作(ISO 13485质量管理体系、MDR技术文件、21 CFR Part 820)。知识运营借鉴合规检查清单,但不能替代公告机构的审计。
ra-qm-team/*
Forcing-question library (Matt Pocock grill discipline)
强制问题库(Matt Pocock审查规范)
Before invoking the tools, the orchestrator (or ) walks the user through these questions one at a time, with a recommended answer + canon citation. Never bundled. Walk depth-first — do not open question 4 until 1-3 are locked.
/cs:grill-bizops-
"Who is the named owner of this SOP / runbook, and do they know they own it?" Recommended: a single human (not "the team"), and yes — they have agreed in writing. Canon: Gawande 2009 (The Checklist Manifesto) — checklists without an owner rot within 12 months. Ownership is the discipline.
-
"When was this doc last reviewed, and what is the review cadence?" Recommended: reviewed within the last 12 months (90 days if); cadence written in the frontmatter. Canon: ISO 9001:2015 §7.5.3 — controlled documents require review-cycle metadata. ITIL v4 echoes this for Service Operation runbooks.
--profile regulated -
"For each runbook step: what is the observable success signal — by which I mean, what specific output tells you the step worked?" Recommended: a concrete observable ("HTTP 200 from", "Slack thread closed with
/healthzreaction", "Salesforce opportunity moved todonestage") — not "the service is up" or "it works". Canon: Beyer et al. 2018 (Site Reliability Workbook, Ch. 8) — observable signals are the entire point of a runbook. Vague success criteria are the leading cause of runbook misuse during incidents.Closed-Won -
"What is the rollback path for each runbook step that can fail?" Recommended: every step that mutates state has either a rollback path or an explicit "cannot roll back — escalate to X" line. Canon: AWS Well-Architected Framework, Operational Excellence pillar — "you cannot run a process you cannot reverse without first agreeing what 'reverse' means".
-
"Where does this doc live, and what other docs link to it?" Recommended: in the canonical wiki, and at least 2 inbound links from related docs. An orphan SOP is an unfindable SOP. Canon: Atlassian Team Playbook on documentation health — orphan rate > 20% is the leading indicator of a wiki sprawl problem.
-
"What is the regulatory overlay on this process — SOC 2, HIPAA, ISO 13485, GDPR, SOX, none?" Recommended: explicit answer. If "none", confirm by checking the data classes the process touches. Canon: FDA 21 CFR Part 211.100 (Written procedures; deviations) — regulated SOPs require version control, change history, and signoff. Skip this step and the doc is an audit finding.
-
"Is the happy path the only path documented, or are the 2-3 most common failure modes also documented?" Recommended: the top-2 failure modes per process are documented with their own recovery sub-procedure. Canon: Fowler 2016 (Production-Ready Microservices) — operations docs that cover only the happy path are responsible for 60%+ of incident-time waste.
After all 7 are locked, invoke → → in sequence.
kb_ingester.pyrunbook_validator.pysop_generator.py在调用工具前,编排器(或)会引导用户逐一回答以下问题(提供推荐答案+规范引用)。切勿批量提问,采用深度优先方式——在1-3题确认前,不要展开第4题。
/cs:grill-bizops-
“这份SOP/运行手册的指定负责人是谁,他们是否知晓自己的职责?” 推荐答案:单个具体人员(而非“团队”),且负责人已书面确认。 规范引用:葛文德2009年《清单革命》——无负责人的清单会在12个月内过时,明确职责是核心规范。
-
“这份文档最后一次审核是什么时候,审核周期是多久?” 推荐答案:过去12个月内审核过(如果使用则为90天);审核周期已写入前置字段。 规范引用:ISO 9001:2015 §7.5.3——受控文档需包含审核周期元数据。ITIL v4对服务运营运行手册也有相同要求。
--profile regulated -
“对于运行手册的每个步骤:可观测成功信号是什么?即,什么具体输出能表明步骤已完成?” 推荐答案:具体可观测指标(“返回HTTP 200”、“Slack线程已关闭且带有
/healthz反应”、“Salesforce商机已移至done阶段”)——而非“服务已启动”或“操作成功”。 规范引用:Beyer等人2018年《站点可靠性工作手册》第8章——可观测信号是运行手册的核心价值。模糊的成功标准是事件期间运行手册误用的主要原因。Closed-Won -
“每个可能失败的运行手册步骤的回滚路径是什么?” 推荐答案:每个修改状态的步骤都有回滚路径,或明确标注“无法回滚——升级至X”。 规范引用:AWS架构完善框架卓越运营支柱——“在未明确‘撤销’的定义前,无法运行无法撤销的流程”。
-
“这份文档存储在哪里,哪些其他文档链接到它?” 推荐答案:存储在权威知识库中,且至少有2个来自相关文档的入站链接。孤立的SOP是无法找到的SOP。 规范引用:Atlassian团队手册关于文档健康的内容——孤立页面占比>20%是知识库分散问题的主要指标。
-
“此流程的合规覆盖是什么?SOC 2、HIPAA、ISO 13485、GDPR、SOX,还是无?” 推荐答案:明确回答。如果是“无”,需确认流程涉及的数据类型。 规范引用:FDA 21 CFR Part 211.100(书面程序;偏差)——合规SOP需包含版本控制、变更历史和签字确认。跳过此步骤会导致文档在审计中不合格。
-
“文档是否仅覆盖理想流程,还是也记录了2-3种最常见的故障模式?” 推荐答案:记录每个流程的前2种故障模式及其对应的恢复子流程。 规范引用:Fowler 2016年《生产就绪微服务》——仅覆盖理想流程的运维文档导致60%以上的事件时间浪费。
所有7个问题确认后,按顺序调用 → → 。
kb_ingester.pyrunbook_validator.pysop_generator.py