witness-observer
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseWitness Observer
Witness 观察者
Per-rig observer that monitors polecat health and reports anomalies. The witness is the PMU (Performance Monitoring Unit) of the Gastown chipset -- it watches execution units for stalls, detects degraded performance, and raises alerts without interfering with computation. The witness is strictly read-only with respect to agent work. It observes and reports; it never modifies.
针对每个设备(rig)的观察者,负责监控polecat的健康状态并上报异常情况。Witness是Gastown芯片组的PMU(性能监控单元)——它监控执行单元是否停滞,检测性能下降情况,并在不干扰计算的前提下发出警报。Witness对Agent的工作内容严格保持只读权限,仅负责观察和上报,绝不进行任何修改。
Activation Triggers
激活触发条件
This skill activates when:
- The agent is assigned to monitor a rig's worker agents
- Multiple polecats are running and health monitoring is needed
- Stall detection is required for long-running work items
- The mayor needs a supervisory agent to watch active polecats
本技能在以下场景下激活:
- Agent被指派监控某一设备(rig)的工作Agent时
- 运行多个polecat且需要健康监控时
- 需要对长时间运行的工作项进行停滞检测时
- Mayor需要一个监控Agent来观察活跃的polecat时
Core Capabilities
核心功能
Patrol Loop
巡检循环
The witness runs a periodic patrol that checks all active agents in its rig for health indicators.
Patrol cycle:
SCAN EVALUATE ACT WAIT
| | | |
v v v v
list agents -> check each -> nudge/escalate -> sleep interval
(active ones) for stalls if needed (default 5 min)Implementation:
typescript
const state = new StateManager({ stateDir: '.chipset/state/' });
const patrolInterval = 5 * 60 * 1000; // 5 minutes (configurable)
const stallThreshold = 30 * 60 * 1000; // 30 minutes (configurable)
async function patrol(): Promise<void> {
// Get all agents that should be working
const agents = await state.listAgents({ role: 'polecat' });
const active = agents.filter(a => a.status === 'active');
for (const agent of active) {
const hook = await state.getHook(agent.id);
if (!hook || hook.status !== 'active') continue;
// Check last activity timestamp
const lastActivity = new Date(hook.lastActivity).getTime();
const elapsed = Date.now() - lastActivity;
if (elapsed > stallThreshold) {
await handleStall(agent, hook, elapsed);
}
}
}Witness会定期运行巡检,检查其负责设备中所有活跃Agent的健康指标。
巡检周期:
SCAN EVALUATE ACT WAIT
| | | |
v v v v
list agents -> check each -> nudge/escalate -> sleep interval
(active ones) for stalls if needed (default 5 min)实现代码:
typescript
const state = new StateManager({ stateDir: '.chipset/state/' });
const patrolInterval = 5 * 60 * 1000; // 5 minutes (configurable)
const stallThreshold = 30 * 60 * 1000; // 30 minutes (configurable)
async function patrol(): Promise<void> {
// Get all agents that should be working
const agents = await state.listAgents({ role: 'polecat' });
const active = agents.filter(a => a.status === 'active');
for (const agent of active) {
const hook = await state.getHook(agent.id);
if (!hook || hook.status !== 'active') continue;
// Check last activity timestamp
const lastActivity = new Date(hook.lastActivity).getTime();
const elapsed = Date.now() - lastActivity;
if (elapsed > stallThreshold) {
await handleStall(agent, hook, elapsed);
}
}
}Stall Detection
停滞检测
A stall is detected when an agent has hooked work but has not updated its activity timestamp within the threshold period (default 30 minutes).
Stall indicators:
| Indicator | What It Means |
|---|---|
| Hook active, no activity for 30+ min | Agent may be stuck, crashed, or idle |
| Agent status is 'active' but hook timestamp stale | Session may have ended without cleanup |
| Multiple consecutive patrol cycles with no change | Persistent stall, needs escalation |
Stall classification:
typescript
type StallSeverity = 'warning' | 'alert' | 'critical';
function classifyStall(elapsed: number, nudgesSent: number): StallSeverity {
if (nudgesSent >= 2) return 'critical'; // Nudged twice, still stalled
if (elapsed > 60 * 60 * 1000) return 'alert'; // Over 1 hour
return 'warning'; // First detection
}当Agent已关联工作(hooked work)但在阈值周期内(默认30分钟)未更新活动时间戳时,即判定为停滞。
停滞指标:
| 指标 | 含义 |
|---|---|
| Hook处于活跃状态,但30分钟以上无活动 | Agent可能已卡住、崩溃或处于空闲状态 |
| Agent状态为“活跃”但Hook时间戳已过期 | 会话可能已结束但未清理 |
| 连续多个巡检周期无变化 | 持续停滞,需要上报 |
停滞分级:
typescript
type StallSeverity = 'warning' | 'alert' | 'critical';
function classifyStall(elapsed: number, nudgesSent: number): StallSeverity {
if (nudgesSent >= 2) return 'critical'; // Nudged twice, still stalled
if (elapsed > 60 * 60 * 1000) return 'alert'; // Over 1 hour
return 'warning'; // First detection
}Nudge Protocol
提示协议
When a stall is detected, the witness follows a graduated escalation protocol.
Step 1 -- Send nudge to stalled agent:
typescript
async function handleStall(
agent: AgentIdentity,
hook: HookState,
elapsed: number
): Promise<void> {
const severity = classifyStall(elapsed, getNudgeCount(agent.id));
if (severity === 'warning') {
// First nudge: ask agent if it's still working
const nudge: AgentMessage = {
from: witnessId,
to: agent.id,
channel: 'nudge',
payload: `HEALTH_CHECK: no activity for ${Math.floor(elapsed / 60000)}m on ${hook.workItem?.beadId}`,
timestamp: new Date().toISOString(),
durable: false,
};
// Write nudge file
recordNudge(agent.id);
return;
}
if (severity === 'alert' || severity === 'critical') {
// Escalate to mayor
await escalateToMayor(agent, hook, severity, elapsed);
}
}Step 2 -- Wait for response (next patrol cycle):
If the agent responds to the nudge (updates its hook activity timestamp or sends mail), the stall is resolved. No further action needed.
Step 3 -- Escalate if unresolved:
typescript
async function escalateToMayor(
agent: AgentIdentity,
hook: HookState,
severity: StallSeverity,
elapsed: number
): Promise<void> {
const escalation: AgentMessage = {
from: witnessId,
to: 'mayor',
channel: 'mail',
payload: `STALL_${severity.toUpperCase()}: ${agent.id} idle ${Math.floor(elapsed / 60000)}m on ${hook.workItem?.beadId}`,
timestamp: new Date().toISOString(),
durable: true,
};
// Write escalation to .chipset/state/mail/mayor/{timestamp}-{witnessId}.json
}检测到停滞时,Witness会遵循分级上报流程。
步骤1 — 向停滞的Agent发送提示信息:
typescript
async function handleStall(
agent: AgentIdentity,
hook: HookState,
elapsed: number
): Promise<void> {
const severity = classifyStall(elapsed, getNudgeCount(agent.id));
if (severity === 'warning') {
// First nudge: ask agent if it's still working
const nudge: AgentMessage = {
from: witnessId,
to: agent.id,
channel: 'nudge',
payload: `HEALTH_CHECK: no activity for ${Math.floor(elapsed / 60000)}m on ${hook.workItem?.beadId}`,
timestamp: new Date().toISOString(),
durable: false,
};
// Write nudge file
recordNudge(agent.id);
return;
}
if (severity === 'alert' || severity === 'critical') {
// Escalate to mayor
await escalateToMayor(agent, hook, severity, elapsed);
}
}步骤2 — 等待响应(下一个巡检周期):
如果Agent回复提示信息(更新Hook活动时间戳或发送消息),则停滞状态解除,无需进一步操作。
步骤3 — 未解决则上报:
typescript
async function escalateToMayor(
agent: AgentIdentity,
hook: HookState,
severity: StallSeverity,
elapsed: number
): Promise<void> {
const escalation: AgentMessage = {
from: witnessId,
to: 'mayor',
channel: 'mail',
payload: `STALL_${severity.toUpperCase()}: ${agent.id} idle ${Math.floor(elapsed / 60000)}m on ${hook.workItem?.beadId}`,
timestamp: new Date().toISOString(),
durable: true,
};
// Write escalation to .chipset/state/mail/mayor/{timestamp}-{witnessId}.json
}Health Reporting
健康状态上报
The witness provides aggregate health summaries when queried by the mayor.
typescript
interface RigHealthReport {
rigName: string;
timestamp: string;
totalAgents: number;
activeAgents: number;
stalledAgents: number;
idleAgents: number;
terminatedAgents: number;
stalledDetails: Array<{
agentId: string;
beadId: string;
stalledMinutes: number;
nudgesSent: number;
}>;
}当Mayor查询时,Witness会提供汇总的健康状态报告。
typescript
interface RigHealthReport {
rigName: string;
timestamp: string;
totalAgents: number;
activeAgents: number;
stalledAgents: number;
idleAgents: number;
terminatedAgents: number;
stalledDetails: Array<{
agentId: string;
beadId: string;
stalledMinutes: number;
nudgesSent: number;
}>;
}Communication Protocol
通信协议
Messages the Witness SENDS
Witness发送的消息
| Channel | Target | Purpose | Durability |
|---|---|---|---|
| Stalled polecats | "Are you still working?" health check | Non-durable |
| Mayor | Stall alerts (warning, alert, critical) | Durable |
| Mayor | Health report summaries | Durable |
| 通道 | 目标 | 用途 | 持久性 |
|---|---|---|---|
| 停滞的polecat | 健康检查:“你是否仍在工作?” | 非持久化 |
| Mayor | 停滞警报(警告、警示、严重) | 持久化 |
| Mayor | 健康状态汇总报告 | 持久化 |
Messages the Witness RECEIVES
Witness接收的消息
| Channel | Source | Content |
|---|---|---|
| Mayor | Instructions (adjust thresholds, focus on specific agent) |
| Polecats | Status responses to nudges |
| 通道 | 来源 | 内容 |
|---|---|---|
| Mayor | 指令(调整阈值、重点监控特定Agent) |
| Polecat | 对提示信息的状态回复 |
Error Handling
错误处理
False Positive Stalls
误报停滞
If an agent is working but updates are slow (large commits, long test runs), the witness may detect a false positive. The nudge protocol handles this: the agent responds to the nudge, and the witness records the response as activity.
如果Agent正在工作但更新缓慢(如大提交、长时间测试),Witness可能会检测到误报。提示协议可处理此情况:Agent回复提示信息后,Witness会将该回复记录为活动状态。
Witness Restart
Witness重启
If the witness itself restarts, it resumes patrol from scratch. It reads current agent and hook state from the filesystem -- there is no witness-specific state that needs recovery. The patrol loop is stateless between cycles.
如果Witness自身重启,会从头开始恢复巡检。它从文件系统读取当前Agent和Hook状态——无需恢复Witness特定的状态。巡检循环在周期之间是无状态的。
Unresponsive Agent
无响应的Agent
If an agent does not respond to two nudges across two patrol cycles, the witness sends a escalation to the mayor. The mayor decides whether to terminate and replace the agent.
critical如果Agent在两个巡检周期内未回复两次提示信息,Witness会向Mayor发送“严重”级别的上报。Mayor将决定是否终止并替换该Agent。
Boundary: What the Witness Does NOT Do
边界:Witness绝不执行的操作
The witness NEVER:
- Modifies agent work -- does not edit files, change branches, or alter code
- Resolves conflicts -- conflict resolution is outside the observer's scope
- Terminates agents -- only the mayor can terminate; the witness recommends
- Reassigns work -- hook management belongs to the mayor
- Changes agent status -- the witness reads status but does not write it (except its own)
- Runs tests or builds -- the witness observes; it does not validate output quality
The witness is a sensor. It detects anomalies and reports them. It does not act on them.
Witness绝对不会:
- 修改Agent的工作内容——不编辑文件、切换分支或修改代码
- 解决冲突——冲突解决不属于观察者的职责范围
- 终止Agent——只有Mayor可以终止Agent,Witness仅提供建议
- 重新分配工作——Hook管理属于Mayor的职责
- 修改Agent状态——Witness仅读取状态,不进行写入(自身状态除外)
- 运行测试或构建——Witness仅负责观察,不验证输出质量
Witness是一个传感器,仅检测异常并上报,不执行任何干预操作。
Integration with Other Gastown Skills
与其他Gastown技能的集成
| Skill | Relationship |
|---|---|
| Witness reports stalls and health TO mayor |
| Witness monitors polecat health, sends nudges |
| Witness can observe refinery queue depth and merge failures |
| Witness reads state via StateManager (read-only) |
| 技能 | 关系 |
|---|---|
| Witness向Mayor上报停滞情况和健康状态 |
| Witness监控polecat的健康状态并发送提示信息 |
| Witness可观察refinery队列深度和合并失败情况 |
| Witness通过StateManager读取状态(只读) |
References
参考资料
- -- How this pattern derives from Gastown's witness.go patrol
references/gastown-origin.md - -- Read-only constraints and observation-only scope
references/boundaries.md
- —— 该模式如何从Gastown的witness.go巡检程序衍生而来
references/gastown-origin.md - —— 只读约束与仅观察的职责范围
references/boundaries.md