codex-history-ingest

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Codex History Ingest — Conversation Mining

Codex历史导入——对话挖掘

You are extracting knowledge from the user's past Codex sessions and distilling it into the Obsidian wiki. Session logs are rich but noisy: focus on durable knowledge, not operational telemetry.
This skill can be invoked directly or via the
wiki-history-ingest
router (
/wiki-history-ingest codex
).
你正在从用户过往的Codex会话中提取知识,并将其提炼到Obsidian wiki中。会话日志内容丰富但存在噪音:请聚焦于可长期复用的知识,而非运营遥测数据。
该技能可直接调用,也可通过
wiki-history-ingest
路由调用(
/wiki-history-ingest codex
)。

Before You Start

开始之前

  1. Read
    .env
    to get
    OBSIDIAN_VAULT_PATH
    and
    CODEX_HISTORY_PATH
    (default to
    ~/.codex
    if unset)
  2. Read
    .manifest.json
    at the vault root to check what has already been ingested
  3. Read
    index.md
    at the vault root to understand what the wiki already contains
  1. 读取
    .env
    获取
    OBSIDIAN_VAULT_PATH
    CODEX_HISTORY_PATH
    (如果未设置则默认使用
    ~/.codex
  2. 读取vault根目录下的
    .manifest.json
    ,检查哪些内容已经被导入过
  3. 读取vault根目录下的
    index.md
    ,了解wiki中已有的内容

Ingest Modes

导入模式

Append Mode (default)

追加模式(默认)

Check
.manifest.json
for each source file. Only process:
  • Files not in the manifest (new session rollouts, new index files)
  • Files whose modification time is newer than
    ingested_at
    in the manifest
Use this mode for regular syncs.
检查每个源文件对应的
.manifest.json
记录,仅处理:
  • 未在manifest中出现的文件(新的会话rollout、新的索引文件)
  • 修改时间晚于manifest中
    ingested_at
    的文件
该模式适用于常规同步。

Full Mode

全量模式

Process everything regardless of manifest. Use after
wiki-rebuild
or if the user explicitly asks for a full re-ingest.
忽略manifest,处理所有文件。在执行
wiki-rebuild
后或用户明确要求全量重新导入时使用。

Codex Data Layout

Codex数据布局

Codex stores local artifacts under
~/.codex/
.
~/.codex/
├── sessions/                          # Session rollout logs by date
│   └── YYYY/MM/DD/
│       └── rollout-<timestamp>-<id>.jsonl
├── archived_sessions/                 # Archived rollout logs
├── session_index.jsonl                # Lightweight index of thread id/name/updated_at
├── history.jsonl                      # Local transcript history (if persistence enabled)
├── config.toml                        # User config (contains history settings)
└── state_*.sqlite / logs_*.sqlite     # Runtime DBs (usually skip)
Codex将本地产物存储在
~/.codex/
目录下。
~/.codex/
├── sessions/                          # 按日期存储的会话rollout日志
│   └── YYYY/MM/DD/
│       └── rollout-<timestamp>-<id>.jsonl
├── archived_sessions/                 # 已归档的rollout日志
├── session_index.jsonl                # 轻量级索引,存储线程ID/名称/更新时间
├── history.jsonl                      # 本地转录历史(如果开启了持久化)
├── config.toml                        # 用户配置(包含历史设置)
└── state_*.sqlite / logs_*.sqlite     # 运行时数据库(通常跳过)

Key data sources ranked by value

按价值排序的核心数据源

  1. session_index.jsonl
    — best inventory source for IDs, titles, and freshness
  2. sessions/**/rollout-*.jsonl
    — rich structured transcript events
  3. history.jsonl
    — useful fallback/timeline aid if enabled
Avoid ingesting SQLite internals unless the user explicitly asks.
  1. session_index.jsonl
    —— 最优质的清单来源,包含ID、标题和更新时间
  2. sessions/**/rollout-*.jsonl
    —— 丰富的结构化转录事件
  3. history.jsonl
    —— 如果开启了持久化,可用作备用/时间线辅助
除非用户明确要求,否则不要导入SQLite内部文件。

Step 1: Survey and Compute Delta

步骤1:调研并计算增量

Scan
CODEX_HISTORY_PATH
and compare against
.manifest.json
:
  • ~/.codex/session_index.jsonl
  • ~/.codex/sessions/**/rollout-*.jsonl
  • ~/.codex/archived_sessions/**
    (optional; only if user asks for archived history)
  • ~/.codex/history.jsonl
    (optional fallback)
Classify each file:
  • New — not in manifest
  • Modified — in manifest but file is newer than
    ingested_at
  • Unchanged — already ingested and unchanged
Report a concise delta summary before deep parsing.
扫描
CODEX_HISTORY_PATH
,并与
.manifest.json
对比:
  • ~/.codex/session_index.jsonl
  • ~/.codex/sessions/**/rollout-*.jsonl
  • ~/.codex/archived_sessions/**
    (可选;仅当用户要求导入归档历史时处理)
  • ~/.codex/history.jsonl
    (可选备用)
对每个文件分类:
  • 新增 —— 未出现在manifest中
  • 已修改 —— 存在于manifest中,但文件更新时间晚于
    ingested_at
  • 无变更 —— 已导入且内容未修改
在深度解析前,先输出简洁的增量汇总。

Step 2: Parse Session Index First

步骤2:优先解析会话索引

session_index.jsonl
typically has entries like:
json
{"id":"...","thread_name":"...","updated_at":"..."}
Use it to:
  • Build a canonical session inventory
  • Prioritize recent/high-signal sessions
  • Map rollout IDs to human-readable thread names
session_index.jsonl
的条目通常格式如下:
json
{"id":"...","thread_name":"...","updated_at":"..."}
用它来完成以下操作:
  • 构建规范的会话清单
  • 优先处理近期/高信号的会话
  • 将rollout ID映射为人类可读的线程名称

Step 3: Parse Rollout JSONL Safely

步骤3:安全解析Rollout JSONL

Each
rollout-*.jsonl
line is an event envelope with:
json
{
  "timestamp": "...",
  "type": "session_meta|turn_context|event_msg|response_item",
  "payload": { ... }
}
每个
rollout-*.jsonl
行都是一个事件信封,包含以下字段:
json
{
  "timestamp": "...",
  "type": "session_meta|turn_context|event_msg|response_item",
  "payload": { ... }
}

Extraction rules

提取规则

  • Prioritize user intent and assistant-visible outputs
  • Favor
    response_item
    records with user/assistant message content
  • Use
    event_msg
    selectively for meaningful milestones; ignore pure telemetry
  • Treat
    session_meta
    as metadata (cwd, model, ids), not user knowledge
  • 优先提取用户意图和助手可见的输出
  • 优先选择包含用户/助手消息内容的
    response_item
    记录
  • 仅选择性使用有意义的里程碑类
    event_msg
    ,忽略纯遥测数据
  • session_meta
    视为元数据(工作目录、模型、ID),不属于用户知识范畴

Skip/noise filters

跳过/噪音过滤规则

  • Token accounting events
  • Tool plumbing with no semantic content
  • Raw command output unless it contains reusable decisions/patterns
  • Repeated plan snapshots unless they add novel decisions
  • Token统计事件
  • 无语义内容的工具底层调用日志
  • 原始命令输出,除非包含可复用的决策/模式
  • 重复的计划快照,除非新增了独特的决策内容

Critical privacy filter

关键隐私过滤规则

Rollout logs can include injected instructions, tool payloads, and sensitive text. Do not ingest verbatim system/developer prompts or secrets.
  • Remove API keys, tokens, passwords, credentials
  • Redact private identifiers unless relevant and approved
  • Summarize instead of quoting raw transcripts
Rollout日志可能包含注入的指令、工具载荷和敏感文本。不要逐字导入系统/开发者提示词或机密信息。
  • 移除API密钥、令牌、密码、凭证
  • 脱敏私有标识符,除非相关且获得用户批准
  • 优先概括内容,而非直接引用原始转录文本

Step 4: Cluster by Topic

步骤4:按主题聚类

Do not create one wiki page per session.
  • Group by stable topics across many sessions
  • Split mixed sessions into separate themes
  • Merge recurring concepts across dates/projects
  • Use
    cwd
    from metadata to infer project scope
不要为每个会话单独创建一个wiki页面。
  • 跨多个会话按稳定主题分组
  • 将混合主题的会话拆分为独立的主题
  • 合并跨日期/项目的重复概念
  • 使用元数据中的
    cwd
    推断项目范围

Step 5: Distill into Wiki Pages

步骤5:提炼为Wiki页面

Route extracted knowledge using existing wiki conventions:
  • Project-specific architecture/process ->
    projects/<name>/...
  • General concepts ->
    concepts/
  • Recurring techniques/debug playbooks ->
    skills/
  • Tools/services ->
    entities/
  • Cross-session patterns ->
    synthesis/
For each impacted project, create/update
projects/<name>/<name>.md
(project name as filename, never
_project.md
).
遵循现有wiki规范分配提取到的知识:
  • 项目专属的架构/流程 ->
    projects/<name>/...
  • 通用概念 ->
    concepts/
  • 可复用的技术/调试手册 ->
    skills/
  • 工具/服务 ->
    entities/
  • 跨会话的通用模式 ->
    synthesis/
为每个涉及的项目创建/更新
projects/<name>/<name>.md
(以项目名称为文件名,禁止使用
_project.md
)。

Writing rules

编写规则

  • Distill knowledge, not chronology
  • Avoid "on date X we discussed..." unless date context is essential
  • Add
    summary:
    frontmatter on each new/updated page (1-2 sentences, <= 200 chars)
  • Add provenance markers:
    • ^[extracted]
      when directly grounded in explicit session content
    • ^[inferred]
      when synthesizing patterns across events/sessions
    • ^[ambiguous]
      when sessions conflict
  • Add/update
    provenance:
    frontmatter mix for each changed page
  • 提炼知识,而非按时间线记录
  • 避免使用“在X日期我们讨论了...”这类表述,除非时间上下文是必要信息
  • 为每个新增/更新的页面添加
    summary:
    前置元数据(1-2句话,不超过200字符)
  • 添加来源标记:
    • ^[extracted]
      内容直接来自明确的会话内容
    • ^[inferred]
      内容是跨事件/会话的模式合成结果
    • ^[ambiguous]
      不同会话的内容存在冲突
  • 为每个修改的页面添加/更新
    provenance:
    前置元数据组合

Step 6: Update Manifest, Log, and Index

步骤6:更新Manifest、日志和索引

Update
.manifest.json

更新
.manifest.json

For each processed source file:
  • ingested_at
    ,
    size_bytes
    ,
    modified_at
  • source_type
    :
    codex_rollout
    |
    codex_index
    |
    codex_history
  • project
    : inferred project name (when applicable)
  • pages_created
    ,
    pages_updated
Add/update a top-level project/session summary block:
json
{
  "project-name": {
    "source_path": "~/.codex/sessions/...",
    "last_ingested": "TIMESTAMP",
    "sessions_ingested": 12,
    "sessions_total": 40,
    "index_updated_at": "TIMESTAMP"
  }
}
为每个处理过的源文件添加以下字段:
  • ingested_at
    size_bytes
    modified_at
  • source_type
    :
    codex_rollout
    |
    codex_index
    |
    codex_history
  • project
    : 推断的项目名称(如适用)
  • pages_created
    pages_updated
添加/更新顶层的项目/会话汇总块:
json
{
  "project-name": {
    "source_path": "~/.codex/sessions/...",
    "last_ingested": "TIMESTAMP",
    "sessions_ingested": 12,
    "sessions_total": 40,
    "index_updated_at": "TIMESTAMP"
  }
}

Update special files

更新特殊文件

Update
index.md
and
log.md
:
- [TIMESTAMP] CODEX_HISTORY_INGEST sessions=N pages_updated=X pages_created=Y mode=append|full
更新
index.md
log.md
- [TIMESTAMP] CODEX_HISTORY_INGEST sessions=N pages_updated=X pages_created=Y mode=append|full

Privacy and Compliance

隐私与合规

  • Distill and synthesize; avoid raw transcript dumps
  • Default to redaction for anything that looks sensitive
  • Ask the user before storing personal/sensitive details
  • Keep references to other people minimal and purpose-bound
  • 优先提炼和合成内容,避免直接导出原始转录文本
  • 所有疑似敏感的内容默认脱敏
  • 存储个人/敏感细节前先征求用户同意
  • 尽量减少对其他人的提及,且仅用于必要目的

Reference

参考

See
references/codex-data-format.md
for field-level parsing notes and extraction guidance.
查看
references/codex-data-format.md
获取字段级解析说明和提取指导。