witness-observer

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Witness Observer

Witness 观察者

Per-rig observer that monitors polecat health and reports anomalies. The witness is the PMU (Performance Monitoring Unit) of the Gastown chipset -- it watches execution units for stalls, detects degraded performance, and raises alerts without interfering with computation. The witness is strictly read-only with respect to agent work. It observes and reports; it never modifies.

针对每个设备（rig）的观察者，负责监控polecat的健康状态并上报异常情况。Witness是Gastown芯片组的PMU（性能监控单元）——它监控执行单元是否停滞，检测性能下降情况，并在不干扰计算的前提下发出警报。Witness对Agent的工作内容严格保持只读权限，仅负责观察和上报，绝不进行任何修改。

Activation Triggers

激活触发条件

This skill activates when:

The agent is assigned to monitor a rig's worker agents
Multiple polecats are running and health monitoring is needed
Stall detection is required for long-running work items
The mayor needs a supervisory agent to watch active polecats

本技能在以下场景下激活：

Agent被指派监控某一设备（rig）的工作Agent时
运行多个polecat且需要健康监控时
需要对长时间运行的工作项进行停滞检测时
Mayor需要一个监控Agent来观察活跃的polecat时

Core Capabilities

核心功能

Patrol Loop

巡检循环

The witness runs a periodic patrol that checks all active agents in its rig for health indicators.

Patrol cycle:

SCAN            EVALUATE         ACT              WAIT
  |                |               |                |
  v                v               v                v
list agents -> check each  ->  nudge/escalate -> sleep interval
(active ones)   for stalls     if needed         (default 5 min)

Implementation:

typescript

const state = new StateManager({ stateDir: '.chipset/state/' });
const patrolInterval = 5 * 60 * 1000; // 5 minutes (configurable)
const stallThreshold = 30 * 60 * 1000; // 30 minutes (configurable)

async function patrol(): Promise<void> {
  // Get all agents that should be working
  const agents = await state.listAgents({ role: 'polecat' });
  const active = agents.filter(a => a.status === 'active');

  for (const agent of active) {
    const hook = await state.getHook(agent.id);
    if (!hook || hook.status !== 'active') continue;

    // Check last activity timestamp
    const lastActivity = new Date(hook.lastActivity).getTime();
    const elapsed = Date.now() - lastActivity;

    if (elapsed > stallThreshold) {
      await handleStall(agent, hook, elapsed);
    }
  }
}

Witness会定期运行巡检，检查其负责设备中所有活跃Agent的健康指标。

巡检周期：

SCAN            EVALUATE         ACT              WAIT
  |                |               |                |
  v                v               v                v
list agents -> check each  ->  nudge/escalate -> sleep interval
(active ones)   for stalls     if needed         (default 5 min)

实现代码：

typescript

const state = new StateManager({ stateDir: '.chipset/state/' });
const patrolInterval = 5 * 60 * 1000; // 5 minutes (configurable)
const stallThreshold = 30 * 60 * 1000; // 30 minutes (configurable)

async function patrol(): Promise<void> {
  // Get all agents that should be working
  const agents = await state.listAgents({ role: 'polecat' });
  const active = agents.filter(a => a.status === 'active');

  for (const agent of active) {
    const hook = await state.getHook(agent.id);
    if (!hook || hook.status !== 'active') continue;

    // Check last activity timestamp
    const lastActivity = new Date(hook.lastActivity).getTime();
    const elapsed = Date.now() - lastActivity;

    if (elapsed > stallThreshold) {
      await handleStall(agent, hook, elapsed);
    }
  }
}

Stall Detection

停滞检测

A stall is detected when an agent has hooked work but has not updated its activity timestamp within the threshold period (default 30 minutes).

Stall indicators:

Indicator	What It Means
Hook active, no activity for 30+ min	Agent may be stuck, crashed, or idle
Agent status is 'active' but hook timestamp stale	Session may have ended without cleanup
Multiple consecutive patrol cycles with no change	Persistent stall, needs escalation

Stall classification:

typescript

type StallSeverity = 'warning' | 'alert' | 'critical';

function classifyStall(elapsed: number, nudgesSent: number): StallSeverity {
  if (nudgesSent >= 2) return 'critical';    // Nudged twice, still stalled
  if (elapsed > 60 * 60 * 1000) return 'alert';  // Over 1 hour
  return 'warning';                            // First detection
}

当Agent已关联工作（hooked work）但在阈值周期内（默认30分钟）未更新活动时间戳时，即判定为停滞。

停滞指标：

指标	含义
Hook处于活跃状态，但30分钟以上无活动	Agent可能已卡住、崩溃或处于空闲状态
Agent状态为“活跃”但Hook时间戳已过期	会话可能已结束但未清理
连续多个巡检周期无变化	持续停滞，需要上报

停滞分级：

typescript

type StallSeverity = 'warning' | 'alert' | 'critical';

function classifyStall(elapsed: number, nudgesSent: number): StallSeverity {
  if (nudgesSent >= 2) return 'critical';    // Nudged twice, still stalled
  if (elapsed > 60 * 60 * 1000) return 'alert';  // Over 1 hour
  return 'warning';                            // First detection
}

Nudge Protocol

提示协议

When a stall is detected, the witness follows a graduated escalation protocol.

Step 1 -- Send nudge to stalled agent:

typescript

async function handleStall(
  agent: AgentIdentity,
  hook: HookState,
  elapsed: number
): Promise<void> {
  const severity = classifyStall(elapsed, getNudgeCount(agent.id));

  if (severity === 'warning') {
    // First nudge: ask agent if it's still working
    const nudge: AgentMessage = {
      from: witnessId,
      to: agent.id,
      channel: 'nudge',
      payload: `HEALTH_CHECK: no activity for ${Math.floor(elapsed / 60000)}m on ${hook.workItem?.beadId}`,
      timestamp: new Date().toISOString(),
      durable: false,
    };
    // Write nudge file
    recordNudge(agent.id);
    return;
  }

  if (severity === 'alert' || severity === 'critical') {
    // Escalate to mayor
    await escalateToMayor(agent, hook, severity, elapsed);
  }
}

Step 2 -- Wait for response (next patrol cycle):

If the agent responds to the nudge (updates its hook activity timestamp or sends mail), the stall is resolved. No further action needed.

Step 3 -- Escalate if unresolved:

typescript

async function escalateToMayor(
  agent: AgentIdentity,
  hook: HookState,
  severity: StallSeverity,
  elapsed: number
): Promise<void> {
  const escalation: AgentMessage = {
    from: witnessId,
    to: 'mayor',
    channel: 'mail',
    payload: `STALL_${severity.toUpperCase()}: ${agent.id} idle ${Math.floor(elapsed / 60000)}m on ${hook.workItem?.beadId}`,
    timestamp: new Date().toISOString(),
    durable: true,
  };
  // Write escalation to .chipset/state/mail/mayor/{timestamp}-{witnessId}.json
}

检测到停滞时，Witness会遵循分级上报流程。

步骤1 — 向停滞的Agent发送提示信息：

typescript

async function handleStall(
  agent: AgentIdentity,
  hook: HookState,
  elapsed: number
): Promise<void> {
  const severity = classifyStall(elapsed, getNudgeCount(agent.id));

  if (severity === 'warning') {
    // First nudge: ask agent if it's still working
    const nudge: AgentMessage = {
      from: witnessId,
      to: agent.id,
      channel: 'nudge',
      payload: `HEALTH_CHECK: no activity for ${Math.floor(elapsed / 60000)}m on ${hook.workItem?.beadId}`,
      timestamp: new Date().toISOString(),
      durable: false,
    };
    // Write nudge file
    recordNudge(agent.id);
    return;
  }

  if (severity === 'alert' || severity === 'critical') {
    // Escalate to mayor
    await escalateToMayor(agent, hook, severity, elapsed);
  }
}

步骤2 — 等待响应（下一个巡检周期）：

如果Agent回复提示信息（更新Hook活动时间戳或发送消息），则停滞状态解除，无需进一步操作。

步骤3 — 未解决则上报：

typescript

async function escalateToMayor(
  agent: AgentIdentity,
  hook: HookState,
  severity: StallSeverity,
  elapsed: number
): Promise<void> {
  const escalation: AgentMessage = {
    from: witnessId,
    to: 'mayor',
    channel: 'mail',
    payload: `STALL_${severity.toUpperCase()}: ${agent.id} idle ${Math.floor(elapsed / 60000)}m on ${hook.workItem?.beadId}`,
    timestamp: new Date().toISOString(),
    durable: true,
  };
  // Write escalation to .chipset/state/mail/mayor/{timestamp}-{witnessId}.json
}

Health Reporting

健康状态上报

The witness provides aggregate health summaries when queried by the mayor.

typescript

interface RigHealthReport {
  rigName: string;
  timestamp: string;
  totalAgents: number;
  activeAgents: number;
  stalledAgents: number;
  idleAgents: number;
  terminatedAgents: number;
  stalledDetails: Array<{
    agentId: string;
    beadId: string;
    stalledMinutes: number;
    nudgesSent: number;
  }>;
}

当Mayor查询时，Witness会提供汇总的健康状态报告。

typescript

interface RigHealthReport {
  rigName: string;
  timestamp: string;
  totalAgents: number;
  activeAgents: number;
  stalledAgents: number;
  idleAgents: number;
  terminatedAgents: number;
  stalledDetails: Array<{
    agentId: string;
    beadId: string;
    stalledMinutes: number;
    nudgesSent: number;
  }>;
}

Communication Protocol

通信协议

Messages the Witness SENDS

Witness发送的消息

Channel	Target	Purpose	Durability
`nudge`	Stalled polecats	"Are you still working?" health check	Non-durable
`mail`	Mayor	Stall alerts (warning, alert, critical)	Durable
`mail`	Mayor	Health report summaries	Durable

通道	目标	用途	持久性
`nudge`	停滞的polecat	健康检查：“你是否仍在工作？”	非持久化
`mail`	Mayor	停滞警报（警告、警示、严重）	持久化
`mail`	Mayor	健康状态汇总报告	持久化

Messages the Witness RECEIVES

Witness接收的消息

Channel	Source	Content
`mail`	Mayor	Instructions (adjust thresholds, focus on specific agent)
`mail`	Polecats	Status responses to nudges

通道	来源	内容
`mail`	Mayor	指令（调整阈值、重点监控特定Agent）
`mail`	Polecat	对提示信息的状态回复

Error Handling

错误处理

False Positive Stalls

误报停滞

If an agent is working but updates are slow (large commits, long test runs), the witness may detect a false positive. The nudge protocol handles this: the agent responds to the nudge, and the witness records the response as activity.

如果Agent正在工作但更新缓慢（如大提交、长时间测试），Witness可能会检测到误报。提示协议可处理此情况：Agent回复提示信息后，Witness会将该回复记录为活动状态。

Witness Restart

Witness重启

If the witness itself restarts, it resumes patrol from scratch. It reads current agent and hook state from the filesystem -- there is no witness-specific state that needs recovery. The patrol loop is stateless between cycles.

如果Witness自身重启，会从头开始恢复巡检。它从文件系统读取当前Agent和Hook状态——无需恢复Witness特定的状态。巡检循环在周期之间是无状态的。

Unresponsive Agent

无响应的Agent

If an agent does not respond to two nudges across two patrol cycles, the witness sends a

critical

escalation to the mayor. The mayor decides whether to terminate and replace the agent.

如果Agent在两个巡检周期内未回复两次提示信息，Witness会向Mayor发送“严重”级别的上报。Mayor将决定是否终止并替换该Agent。

Boundary: What the Witness Does NOT Do

边界：Witness绝不执行的操作

The witness NEVER:

Modifies agent work -- does not edit files, change branches, or alter code
Resolves conflicts -- conflict resolution is outside the observer's scope
Terminates agents -- only the mayor can terminate; the witness recommends
Reassigns work -- hook management belongs to the mayor
Changes agent status -- the witness reads status but does not write it (except its own)
Runs tests or builds -- the witness observes; it does not validate output quality

The witness is a sensor. It detects anomalies and reports them. It does not act on them.

Witness绝对不会：

修改Agent的工作内容——不编辑文件、切换分支或修改代码
解决冲突——冲突解决不属于观察者的职责范围
终止Agent——只有Mayor可以终止Agent，Witness仅提供建议
重新分配工作——Hook管理属于Mayor的职责
修改Agent状态——Witness仅读取状态，不进行写入（自身状态除外）
运行测试或构建——Witness仅负责观察，不验证输出质量

Witness是一个传感器，仅检测异常并上报，不执行任何干预操作。

Integration with Other Gastown Skills

与其他Gastown技能的集成

Skill	Relationship
`mayor-coordinator`	Witness reports stalls and health TO mayor
`polecat-worker`	Witness monitors polecat health, sends nudges
`refinery-merge`	Witness can observe refinery queue depth and merge failures
`beads-state`	Witness reads state via StateManager (read-only)

技能	关系
`mayor-coordinator`	Witness向Mayor上报停滞情况和健康状态
`polecat-worker`	Witness监控polecat的健康状态并发送提示信息
`refinery-merge`	Witness可观察refinery队列深度和合并失败情况
`beads-state`	Witness通过StateManager读取状态（只读）

References

参考资料

```
references/gastown-origin.md
```
-- How this pattern derives from Gastown's witness.go patrol
```
references/boundaries.md
```
-- Read-only constraints and observation-only scope

```
references/gastown-origin.md
```
—— 该模式如何从Gastown的witness.go巡检程序衍生而来
```
references/boundaries.md
```
—— 只读约束与仅观察的职责范围