witness-observer

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Witness Observer

Witness 观察者

Per-rig observer that monitors polecat health and reports anomalies. The witness is the PMU (Performance Monitoring Unit) of the Gastown chipset -- it watches execution units for stalls, detects degraded performance, and raises alerts without interfering with computation. The witness is strictly read-only with respect to agent work. It observes and reports; it never modifies.
针对每个设备(rig)的观察者,负责监控polecat的健康状态并上报异常情况。Witness是Gastown芯片组的PMU(性能监控单元)——它监控执行单元是否停滞,检测性能下降情况,并在不干扰计算的前提下发出警报。Witness对Agent的工作内容严格保持只读权限,仅负责观察和上报,绝不进行任何修改。

Activation Triggers

激活触发条件

This skill activates when:
  • The agent is assigned to monitor a rig's worker agents
  • Multiple polecats are running and health monitoring is needed
  • Stall detection is required for long-running work items
  • The mayor needs a supervisory agent to watch active polecats
本技能在以下场景下激活:
  • Agent被指派监控某一设备(rig)的工作Agent时
  • 运行多个polecat且需要健康监控时
  • 需要对长时间运行的工作项进行停滞检测时
  • Mayor需要一个监控Agent来观察活跃的polecat时

Core Capabilities

核心功能

Patrol Loop

巡检循环

The witness runs a periodic patrol that checks all active agents in its rig for health indicators.
Patrol cycle:
SCAN            EVALUATE         ACT              WAIT
  |                |               |                |
  v                v               v                v
list agents -> check each  ->  nudge/escalate -> sleep interval
(active ones)   for stalls     if needed         (default 5 min)
Implementation:
typescript
const state = new StateManager({ stateDir: '.chipset/state/' });
const patrolInterval = 5 * 60 * 1000; // 5 minutes (configurable)
const stallThreshold = 30 * 60 * 1000; // 30 minutes (configurable)

async function patrol(): Promise<void> {
  // Get all agents that should be working
  const agents = await state.listAgents({ role: 'polecat' });
  const active = agents.filter(a => a.status === 'active');

  for (const agent of active) {
    const hook = await state.getHook(agent.id);
    if (!hook || hook.status !== 'active') continue;

    // Check last activity timestamp
    const lastActivity = new Date(hook.lastActivity).getTime();
    const elapsed = Date.now() - lastActivity;

    if (elapsed > stallThreshold) {
      await handleStall(agent, hook, elapsed);
    }
  }
}
Witness会定期运行巡检,检查其负责设备中所有活跃Agent的健康指标。
巡检周期:
SCAN            EVALUATE         ACT              WAIT
  |                |               |                |
  v                v               v                v
list agents -> check each  ->  nudge/escalate -> sleep interval
(active ones)   for stalls     if needed         (default 5 min)
实现代码:
typescript
const state = new StateManager({ stateDir: '.chipset/state/' });
const patrolInterval = 5 * 60 * 1000; // 5 minutes (configurable)
const stallThreshold = 30 * 60 * 1000; // 30 minutes (configurable)

async function patrol(): Promise<void> {
  // Get all agents that should be working
  const agents = await state.listAgents({ role: 'polecat' });
  const active = agents.filter(a => a.status === 'active');

  for (const agent of active) {
    const hook = await state.getHook(agent.id);
    if (!hook || hook.status !== 'active') continue;

    // Check last activity timestamp
    const lastActivity = new Date(hook.lastActivity).getTime();
    const elapsed = Date.now() - lastActivity;

    if (elapsed > stallThreshold) {
      await handleStall(agent, hook, elapsed);
    }
  }
}

Stall Detection

停滞检测

A stall is detected when an agent has hooked work but has not updated its activity timestamp within the threshold period (default 30 minutes).
Stall indicators:
IndicatorWhat It Means
Hook active, no activity for 30+ minAgent may be stuck, crashed, or idle
Agent status is 'active' but hook timestamp staleSession may have ended without cleanup
Multiple consecutive patrol cycles with no changePersistent stall, needs escalation
Stall classification:
typescript
type StallSeverity = 'warning' | 'alert' | 'critical';

function classifyStall(elapsed: number, nudgesSent: number): StallSeverity {
  if (nudgesSent >= 2) return 'critical';    // Nudged twice, still stalled
  if (elapsed > 60 * 60 * 1000) return 'alert';  // Over 1 hour
  return 'warning';                            // First detection
}
当Agent已关联工作(hooked work)但在阈值周期内(默认30分钟)未更新活动时间戳时,即判定为停滞。
停滞指标:
指标含义
Hook处于活跃状态,但30分钟以上无活动Agent可能已卡住、崩溃或处于空闲状态
Agent状态为“活跃”但Hook时间戳已过期会话可能已结束但未清理
连续多个巡检周期无变化持续停滞,需要上报
停滞分级:
typescript
type StallSeverity = 'warning' | 'alert' | 'critical';

function classifyStall(elapsed: number, nudgesSent: number): StallSeverity {
  if (nudgesSent >= 2) return 'critical';    // Nudged twice, still stalled
  if (elapsed > 60 * 60 * 1000) return 'alert';  // Over 1 hour
  return 'warning';                            // First detection
}

Nudge Protocol

提示协议

When a stall is detected, the witness follows a graduated escalation protocol.
Step 1 -- Send nudge to stalled agent:
typescript
async function handleStall(
  agent: AgentIdentity,
  hook: HookState,
  elapsed: number
): Promise<void> {
  const severity = classifyStall(elapsed, getNudgeCount(agent.id));

  if (severity === 'warning') {
    // First nudge: ask agent if it's still working
    const nudge: AgentMessage = {
      from: witnessId,
      to: agent.id,
      channel: 'nudge',
      payload: `HEALTH_CHECK: no activity for ${Math.floor(elapsed / 60000)}m on ${hook.workItem?.beadId}`,
      timestamp: new Date().toISOString(),
      durable: false,
    };
    // Write nudge file
    recordNudge(agent.id);
    return;
  }

  if (severity === 'alert' || severity === 'critical') {
    // Escalate to mayor
    await escalateToMayor(agent, hook, severity, elapsed);
  }
}
Step 2 -- Wait for response (next patrol cycle):
If the agent responds to the nudge (updates its hook activity timestamp or sends mail), the stall is resolved. No further action needed.
Step 3 -- Escalate if unresolved:
typescript
async function escalateToMayor(
  agent: AgentIdentity,
  hook: HookState,
  severity: StallSeverity,
  elapsed: number
): Promise<void> {
  const escalation: AgentMessage = {
    from: witnessId,
    to: 'mayor',
    channel: 'mail',
    payload: `STALL_${severity.toUpperCase()}: ${agent.id} idle ${Math.floor(elapsed / 60000)}m on ${hook.workItem?.beadId}`,
    timestamp: new Date().toISOString(),
    durable: true,
  };
  // Write escalation to .chipset/state/mail/mayor/{timestamp}-{witnessId}.json
}
检测到停滞时,Witness会遵循分级上报流程。
步骤1 — 向停滞的Agent发送提示信息:
typescript
async function handleStall(
  agent: AgentIdentity,
  hook: HookState,
  elapsed: number
): Promise<void> {
  const severity = classifyStall(elapsed, getNudgeCount(agent.id));

  if (severity === 'warning') {
    // First nudge: ask agent if it's still working
    const nudge: AgentMessage = {
      from: witnessId,
      to: agent.id,
      channel: 'nudge',
      payload: `HEALTH_CHECK: no activity for ${Math.floor(elapsed / 60000)}m on ${hook.workItem?.beadId}`,
      timestamp: new Date().toISOString(),
      durable: false,
    };
    // Write nudge file
    recordNudge(agent.id);
    return;
  }

  if (severity === 'alert' || severity === 'critical') {
    // Escalate to mayor
    await escalateToMayor(agent, hook, severity, elapsed);
  }
}
步骤2 — 等待响应(下一个巡检周期):
如果Agent回复提示信息(更新Hook活动时间戳或发送消息),则停滞状态解除,无需进一步操作。
步骤3 — 未解决则上报:
typescript
async function escalateToMayor(
  agent: AgentIdentity,
  hook: HookState,
  severity: StallSeverity,
  elapsed: number
): Promise<void> {
  const escalation: AgentMessage = {
    from: witnessId,
    to: 'mayor',
    channel: 'mail',
    payload: `STALL_${severity.toUpperCase()}: ${agent.id} idle ${Math.floor(elapsed / 60000)}m on ${hook.workItem?.beadId}`,
    timestamp: new Date().toISOString(),
    durable: true,
  };
  // Write escalation to .chipset/state/mail/mayor/{timestamp}-{witnessId}.json
}

Health Reporting

健康状态上报

The witness provides aggregate health summaries when queried by the mayor.
typescript
interface RigHealthReport {
  rigName: string;
  timestamp: string;
  totalAgents: number;
  activeAgents: number;
  stalledAgents: number;
  idleAgents: number;
  terminatedAgents: number;
  stalledDetails: Array<{
    agentId: string;
    beadId: string;
    stalledMinutes: number;
    nudgesSent: number;
  }>;
}
当Mayor查询时,Witness会提供汇总的健康状态报告。
typescript
interface RigHealthReport {
  rigName: string;
  timestamp: string;
  totalAgents: number;
  activeAgents: number;
  stalledAgents: number;
  idleAgents: number;
  terminatedAgents: number;
  stalledDetails: Array<{
    agentId: string;
    beadId: string;
    stalledMinutes: number;
    nudgesSent: number;
  }>;
}

Communication Protocol

通信协议

Messages the Witness SENDS

Witness发送的消息

ChannelTargetPurposeDurability
nudge
Stalled polecats"Are you still working?" health checkNon-durable
mail
MayorStall alerts (warning, alert, critical)Durable
mail
MayorHealth report summariesDurable
通道目标用途持久性
nudge
停滞的polecat健康检查:“你是否仍在工作?”非持久化
mail
Mayor停滞警报(警告、警示、严重)持久化
mail
Mayor健康状态汇总报告持久化

Messages the Witness RECEIVES

Witness接收的消息

ChannelSourceContent
mail
MayorInstructions (adjust thresholds, focus on specific agent)
mail
PolecatsStatus responses to nudges
通道来源内容
mail
Mayor指令(调整阈值、重点监控特定Agent)
mail
Polecat对提示信息的状态回复

Error Handling

错误处理

False Positive Stalls

误报停滞

If an agent is working but updates are slow (large commits, long test runs), the witness may detect a false positive. The nudge protocol handles this: the agent responds to the nudge, and the witness records the response as activity.
如果Agent正在工作但更新缓慢(如大提交、长时间测试),Witness可能会检测到误报。提示协议可处理此情况:Agent回复提示信息后,Witness会将该回复记录为活动状态。

Witness Restart

Witness重启

If the witness itself restarts, it resumes patrol from scratch. It reads current agent and hook state from the filesystem -- there is no witness-specific state that needs recovery. The patrol loop is stateless between cycles.
如果Witness自身重启,会从头开始恢复巡检。它从文件系统读取当前Agent和Hook状态——无需恢复Witness特定的状态。巡检循环在周期之间是无状态的。

Unresponsive Agent

无响应的Agent

If an agent does not respond to two nudges across two patrol cycles, the witness sends a
critical
escalation to the mayor. The mayor decides whether to terminate and replace the agent.
如果Agent在两个巡检周期内未回复两次提示信息,Witness会向Mayor发送“严重”级别的上报。Mayor将决定是否终止并替换该Agent。

Boundary: What the Witness Does NOT Do

边界:Witness绝不执行的操作

The witness NEVER:
  • Modifies agent work -- does not edit files, change branches, or alter code
  • Resolves conflicts -- conflict resolution is outside the observer's scope
  • Terminates agents -- only the mayor can terminate; the witness recommends
  • Reassigns work -- hook management belongs to the mayor
  • Changes agent status -- the witness reads status but does not write it (except its own)
  • Runs tests or builds -- the witness observes; it does not validate output quality
The witness is a sensor. It detects anomalies and reports them. It does not act on them.
Witness绝对不会:
  • 修改Agent的工作内容——不编辑文件、切换分支或修改代码
  • 解决冲突——冲突解决不属于观察者的职责范围
  • 终止Agent——只有Mayor可以终止Agent,Witness仅提供建议
  • 重新分配工作——Hook管理属于Mayor的职责
  • 修改Agent状态——Witness仅读取状态,不进行写入(自身状态除外)
  • 运行测试或构建——Witness仅负责观察,不验证输出质量
Witness是一个传感器,仅检测异常并上报,不执行任何干预操作。

Integration with Other Gastown Skills

与其他Gastown技能的集成

SkillRelationship
mayor-coordinator
Witness reports stalls and health TO mayor
polecat-worker
Witness monitors polecat health, sends nudges
refinery-merge
Witness can observe refinery queue depth and merge failures
beads-state
Witness reads state via StateManager (read-only)
技能关系
mayor-coordinator
Witness向Mayor上报停滞情况和健康状态
polecat-worker
Witness监控polecat的健康状态并发送提示信息
refinery-merge
Witness可观察refinery队列深度和合并失败情况
beads-state
Witness通过StateManager读取状态(只读)

References

参考资料

  • references/gastown-origin.md
    -- How this pattern derives from Gastown's witness.go patrol
  • references/boundaries.md
    -- Read-only constraints and observation-only scope
  • references/gastown-origin.md
    —— 该模式如何从Gastown的witness.go巡检程序衍生而来
  • references/boundaries.md
    —— 只读约束与仅观察的职责范围