codex-history-ingest

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Codex History Ingest — Conversation Mining

Codex历史导入——对话挖掘

You are extracting knowledge from the user's past Codex sessions and distilling it into the Obsidian wiki. Session logs are rich but noisy: focus on durable knowledge, not operational telemetry.

This skill can be invoked directly or via the

wiki-history-ingest

router (

/wiki-history-ingest codex

你正在从用户过往的Codex会话中提取知识，并将其提炼到Obsidian wiki中。会话日志内容丰富但存在噪音：请聚焦于可长期复用的知识，而非运营遥测数据。

该技能可直接调用，也可通过

wiki-history-ingest

路由调用（

/wiki-history-ingest codex

）。

Before You Start

开始之前

Read

.env

to get

OBSIDIAN_VAULT_PATH

and

CODEX_HISTORY_PATH

(default to

~/.codex

if unset)

Read
```
.manifest.json
```
at the vault root to check what has already been ingested
Read
```
index.md
```
at the vault root to understand what the wiki already contains

读取
```
.env
```
获取
```
OBSIDIAN_VAULT_PATH
```
和
```
CODEX_HISTORY_PATH
```
（如果未设置则默认使用
```
~/.codex
```
）
读取vault根目录下的
```
.manifest.json
```
，检查哪些内容已经被导入过
读取vault根目录下的
```
index.md
```
，了解wiki中已有的内容

Ingest Modes

导入模式

Append Mode (default)

追加模式（默认）

Check

.manifest.json

for each source file. Only process:

Files not in the manifest (new session rollouts, new index files)
Files whose modification time is newer than
```
ingested_at
```
in the manifest

Use this mode for regular syncs.

检查每个源文件对应的

.manifest.json

记录，仅处理：

未在manifest中出现的文件（新的会话rollout、新的索引文件）
修改时间晚于manifest中
```
ingested_at
```
的文件

该模式适用于常规同步。

Full Mode

全量模式

Process everything regardless of manifest. Use after

wiki-rebuild

or if the user explicitly asks for a full re-ingest.

忽略manifest，处理所有文件。在执行

wiki-rebuild

后或用户明确要求全量重新导入时使用。

Codex Data Layout

Codex数据布局

Codex stores local artifacts under

~/.codex/

~/.codex/
├── sessions/                          # Session rollout logs by date
│   └── YYYY/MM/DD/
│       └── rollout-<timestamp>-<id>.jsonl
├── archived_sessions/                 # Archived rollout logs
├── session_index.jsonl                # Lightweight index of thread id/name/updated_at
├── history.jsonl                      # Local transcript history (if persistence enabled)
├── config.toml                        # User config (contains history settings)
└── state_*.sqlite / logs_*.sqlite     # Runtime DBs (usually skip)

Codex将本地产物存储在

~/.codex/

目录下。

~/.codex/
├── sessions/                          # 按日期存储的会话rollout日志
│   └── YYYY/MM/DD/
│       └── rollout-<timestamp>-<id>.jsonl
├── archived_sessions/                 # 已归档的rollout日志
├── session_index.jsonl                # 轻量级索引，存储线程ID/名称/更新时间
├── history.jsonl                      # 本地转录历史（如果开启了持久化）
├── config.toml                        # 用户配置（包含历史设置）
└── state_*.sqlite / logs_*.sqlite     # 运行时数据库（通常跳过）

Key data sources ranked by value

按价值排序的核心数据源

```
session_index.jsonl
```
— best inventory source for IDs, titles, and freshness
```
sessions/**/rollout-*.jsonl
```
— rich structured transcript events
```
history.jsonl
```
— useful fallback/timeline aid if enabled

Avoid ingesting SQLite internals unless the user explicitly asks.

```
session_index.jsonl
```
—— 最优质的清单来源，包含ID、标题和更新时间
```
sessions/**/rollout-*.jsonl
```
—— 丰富的结构化转录事件
```
history.jsonl
```
—— 如果开启了持久化，可用作备用/时间线辅助

除非用户明确要求，否则不要导入SQLite内部文件。

Step 1: Survey and Compute Delta

步骤1：调研并计算增量

Scan

CODEX_HISTORY_PATH

and compare against

.manifest.json

```
~/.codex/session_index.jsonl
```
```
~/.codex/sessions/**/rollout-*.jsonl
```
```
~/.codex/archived_sessions/**
```
(optional; only if user asks for archived history)
```
~/.codex/history.jsonl
```
(optional fallback)

Classify each file:

New — not in manifest
Modified — in manifest but file is newer than
```
ingested_at
```
Unchanged — already ingested and unchanged

Report a concise delta summary before deep parsing.

扫描

CODEX_HISTORY_PATH

，并与

.manifest.json

对比：

```
~/.codex/session_index.jsonl
```
```
~/.codex/sessions/**/rollout-*.jsonl
```
```
~/.codex/archived_sessions/**
```
（可选；仅当用户要求导入归档历史时处理）
```
~/.codex/history.jsonl
```
（可选备用）

对每个文件分类：

新增 —— 未出现在manifest中
已修改 —— 存在于manifest中，但文件更新时间晚于
```
ingested_at
```
无变更 —— 已导入且内容未修改

在深度解析前，先输出简洁的增量汇总。

Step 2: Parse Session Index First

步骤2：优先解析会话索引

session_index.jsonl

typically has entries like:

json

{"id":"...","thread_name":"...","updated_at":"..."}

Use it to:

Build a canonical session inventory
Prioritize recent/high-signal sessions
Map rollout IDs to human-readable thread names

session_index.jsonl

的条目通常格式如下：

json

{"id":"...","thread_name":"...","updated_at":"..."}

用它来完成以下操作：

构建规范的会话清单
优先处理近期/高信号的会话
将rollout ID映射为人类可读的线程名称

Step 3: Parse Rollout JSONL Safely

步骤3：安全解析Rollout JSONL

Each

rollout-*.jsonl

line is an event envelope with:

json

{
  "timestamp": "...",
  "type": "session_meta|turn_context|event_msg|response_item",
  "payload": { ... }
}

每个

rollout-*.jsonl

行都是一个事件信封，包含以下字段：

json

{
  "timestamp": "...",
  "type": "session_meta|turn_context|event_msg|response_item",
  "payload": { ... }
}

Extraction rules

提取规则

Prioritize user intent and assistant-visible outputs
Favor
```
response_item
```
records with user/assistant message content
Use
```
event_msg
```
selectively for meaningful milestones; ignore pure telemetry
Treat
```
session_meta
```
as metadata (cwd, model, ids), not user knowledge

优先提取用户意图和助手可见的输出
优先选择包含用户/助手消息内容的
```
response_item
```
记录
仅选择性使用有意义的里程碑类
```
event_msg
```
，忽略纯遥测数据
将
```
session_meta
```
视为元数据（工作目录、模型、ID），不属于用户知识范畴

Skip/noise filters

跳过/噪音过滤规则

Token accounting events
Tool plumbing with no semantic content
Raw command output unless it contains reusable decisions/patterns
Repeated plan snapshots unless they add novel decisions

Token统计事件
无语义内容的工具底层调用日志
原始命令输出，除非包含可复用的决策/模式
重复的计划快照，除非新增了独特的决策内容

Critical privacy filter

关键隐私过滤规则

Rollout logs can include injected instructions, tool payloads, and sensitive text. Do not ingest verbatim system/developer prompts or secrets.

Remove API keys, tokens, passwords, credentials
Redact private identifiers unless relevant and approved
Summarize instead of quoting raw transcripts

Rollout日志可能包含注入的指令、工具载荷和敏感文本。不要逐字导入系统/开发者提示词或机密信息。

移除API密钥、令牌、密码、凭证
脱敏私有标识符，除非相关且获得用户批准
优先概括内容，而非直接引用原始转录文本

Step 4: Cluster by Topic

步骤4：按主题聚类

Do not create one wiki page per session.

Group by stable topics across many sessions
Split mixed sessions into separate themes
Merge recurring concepts across dates/projects
Use
```
cwd
```
from metadata to infer project scope

不要为每个会话单独创建一个wiki页面。

跨多个会话按稳定主题分组
将混合主题的会话拆分为独立的主题
合并跨日期/项目的重复概念
使用元数据中的
```
cwd
```
推断项目范围

Step 5: Distill into Wiki Pages

步骤5：提炼为Wiki页面

Route extracted knowledge using existing wiki conventions:

Project-specific architecture/process ->
```
projects/<name>/...
```
General concepts ->
```
concepts/
```
Recurring techniques/debug playbooks ->
```
skills/
```
Tools/services ->
```
entities/
```
Cross-session patterns ->
```
synthesis/
```

For each impacted project, create/update

projects/<name>/<name>.md

(project name as filename, never

_project.md

遵循现有wiki规范分配提取到的知识：

项目专属的架构/流程 ->
```
projects/<name>/...
```
通用概念 ->
```
concepts/
```
可复用的技术/调试手册 ->
```
skills/
```
工具/服务 ->
```
entities/
```
跨会话的通用模式 ->
```
synthesis/
```

为每个涉及的项目创建/更新

projects/<name>/<name>.md

（以项目名称为文件名，禁止使用

_project.md

）。

Writing rules

编写规则

Distill knowledge, not chronology
Avoid "on date X we discussed..." unless date context is essential
Add
```
summary:
```
frontmatter on each new/updated page (1-2 sentences, <= 200 chars)
Add provenance markers:
- ```
^[extracted]
```
  when directly grounded in explicit session content
- ```
^[inferred]
```
  when synthesizing patterns across events/sessions
- ```
^[ambiguous]
```
  when sessions conflict
Add/update
```
provenance:
```
frontmatter mix for each changed page

提炼知识，而非按时间线记录
避免使用“在X日期我们讨论了...”这类表述，除非时间上下文是必要信息
为每个新增/更新的页面添加
```
summary:
```
前置元数据（1-2句话，不超过200字符）
添加来源标记：
- ```
^[extracted]
```
  内容直接来自明确的会话内容
- ```
^[inferred]
```
  内容是跨事件/会话的模式合成结果
- ```
^[ambiguous]
```
  不同会话的内容存在冲突
为每个修改的页面添加/更新
```
provenance:
```
前置元数据组合

Step 6: Update Manifest, Log, and Index

步骤6：更新Manifest、日志和索引

Update

.manifest.json

更新

.manifest.json

For each processed source file:

```
ingested_at
```
,
```
size_bytes
```
,
```
modified_at
```

source_type

codex_rollout

codex_index

codex_history

```
project
```
: inferred project name (when applicable)
```
pages_created
```
,
```
pages_updated
```

Add/update a top-level project/session summary block:

json

{
  "project-name": {
    "source_path": "~/.codex/sessions/...",
    "last_ingested": "TIMESTAMP",
    "sessions_ingested": 12,
    "sessions_total": 40,
    "index_updated_at": "TIMESTAMP"
  }
}

为每个处理过的源文件添加以下字段：

```
ingested_at
```
、
```
size_bytes
```
、
```
modified_at
```

source_type

codex_rollout

codex_index

codex_history

```
project
```
: 推断的项目名称（如适用）
```
pages_created
```
、
```
pages_updated
```

添加/更新顶层的项目/会话汇总块：

json

{
  "project-name": {
    "source_path": "~/.codex/sessions/...",
    "last_ingested": "TIMESTAMP",
    "sessions_ingested": 12,
    "sessions_total": 40,
    "index_updated_at": "TIMESTAMP"
  }
}

Update special files

更新特殊文件

Update

index.md

and

log.md

- [TIMESTAMP] CODEX_HISTORY_INGEST sessions=N pages_updated=X pages_created=Y mode=append|full

更新

index.md

和

log.md

：

- [TIMESTAMP] CODEX_HISTORY_INGEST sessions=N pages_updated=X pages_created=Y mode=append|full

Privacy and Compliance

隐私与合规

Distill and synthesize; avoid raw transcript dumps
Default to redaction for anything that looks sensitive
Ask the user before storing personal/sensitive details
Keep references to other people minimal and purpose-bound

优先提炼和合成内容，避免直接导出原始转录文本
所有疑似敏感的内容默认脱敏
存储个人/敏感细节前先征求用户同意
尽量减少对其他人的提及，且仅用于必要目的

Reference

参考

See

references/codex-data-format.md

for field-level parsing notes and extraction guidance.

查看

references/codex-data-format.md

获取字段级解析说明和提取指导。

codex-history-ingest

Original

Translation

Codex History Ingest — Conversation Mining

Codex历史导入——对话挖掘

Before You Start

开始之前

Ingest Modes

导入模式

Append Mode (default)

追加模式（默认）

Full Mode

全量模式

Codex Data Layout

Codex数据布局

Key data sources ranked by value

按价值排序的核心数据源

Step 1: Survey and Compute Delta

步骤1：调研并计算增量

Step 2: Parse Session Index First

步骤2：优先解析会话索引

Step 3: Parse Rollout JSONL Safely

步骤3：安全解析Rollout JSONL

Extraction rules

提取规则

Skip/noise filters

跳过/噪音过滤规则

Critical privacy filter

关键隐私过滤规则

Step 4: Cluster by Topic

步骤4：按主题聚类

Step 5: Distill into Wiki Pages

步骤5：提炼为Wiki页面

Writing rules

编写规则

Step 6: Update Manifest, Log, and Index

步骤6：更新Manifest、日志和索引

Update .manifest.json

更新.manifest.json

Update special files

更新特殊文件

Privacy and Compliance

隐私与合规

Reference

参考

Update
`.manifest.json`

更新
`.manifest.json`