codex-history-ingest
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseCodex History Ingest — Conversation Mining
Codex历史导入——对话挖掘
You are extracting knowledge from the user's past Codex sessions and distilling it into the Obsidian wiki. Session logs are rich but noisy: focus on durable knowledge, not operational telemetry.
This skill can be invoked directly or via the router ().
wiki-history-ingest/wiki-history-ingest codex你正在从用户过往的Codex会话中提取知识,并将其提炼到Obsidian wiki中。会话日志内容丰富但存在噪音:请聚焦于可长期复用的知识,而非运营遥测数据。
该技能可直接调用,也可通过路由调用()。
wiki-history-ingest/wiki-history-ingest codexBefore You Start
开始之前
- Read to get
.envandOBSIDIAN_VAULT_PATH(default toCODEX_HISTORY_PATHif unset)~/.codex - Read at the vault root to check what has already been ingested
.manifest.json - Read at the vault root to understand what the wiki already contains
index.md
- 读取获取
.env和OBSIDIAN_VAULT_PATH(如果未设置则默认使用CODEX_HISTORY_PATH)~/.codex - 读取vault根目录下的,检查哪些内容已经被导入过
.manifest.json - 读取vault根目录下的,了解wiki中已有的内容
index.md
Ingest Modes
导入模式
Append Mode (default)
追加模式(默认)
Check for each source file. Only process:
.manifest.json- Files not in the manifest (new session rollouts, new index files)
- Files whose modification time is newer than in the manifest
ingested_at
Use this mode for regular syncs.
检查每个源文件对应的记录,仅处理:
.manifest.json- 未在manifest中出现的文件(新的会话rollout、新的索引文件)
- 修改时间晚于manifest中的文件
ingested_at
该模式适用于常规同步。
Full Mode
全量模式
Process everything regardless of manifest. Use after or if the user explicitly asks for a full re-ingest.
wiki-rebuild忽略manifest,处理所有文件。在执行后或用户明确要求全量重新导入时使用。
wiki-rebuildCodex Data Layout
Codex数据布局
Codex stores local artifacts under .
~/.codex/~/.codex/
├── sessions/ # Session rollout logs by date
│ └── YYYY/MM/DD/
│ └── rollout-<timestamp>-<id>.jsonl
├── archived_sessions/ # Archived rollout logs
├── session_index.jsonl # Lightweight index of thread id/name/updated_at
├── history.jsonl # Local transcript history (if persistence enabled)
├── config.toml # User config (contains history settings)
└── state_*.sqlite / logs_*.sqlite # Runtime DBs (usually skip)Codex将本地产物存储在目录下。
~/.codex/~/.codex/
├── sessions/ # 按日期存储的会话rollout日志
│ └── YYYY/MM/DD/
│ └── rollout-<timestamp>-<id>.jsonl
├── archived_sessions/ # 已归档的rollout日志
├── session_index.jsonl # 轻量级索引,存储线程ID/名称/更新时间
├── history.jsonl # 本地转录历史(如果开启了持久化)
├── config.toml # 用户配置(包含历史设置)
└── state_*.sqlite / logs_*.sqlite # 运行时数据库(通常跳过)Key data sources ranked by value
按价值排序的核心数据源
- — best inventory source for IDs, titles, and freshness
session_index.jsonl - — rich structured transcript events
sessions/**/rollout-*.jsonl - — useful fallback/timeline aid if enabled
history.jsonl
Avoid ingesting SQLite internals unless the user explicitly asks.
- —— 最优质的清单来源,包含ID、标题和更新时间
session_index.jsonl - —— 丰富的结构化转录事件
sessions/**/rollout-*.jsonl - —— 如果开启了持久化,可用作备用/时间线辅助
history.jsonl
除非用户明确要求,否则不要导入SQLite内部文件。
Step 1: Survey and Compute Delta
步骤1:调研并计算增量
Scan and compare against :
CODEX_HISTORY_PATH.manifest.json~/.codex/session_index.jsonl~/.codex/sessions/**/rollout-*.jsonl- (optional; only if user asks for archived history)
~/.codex/archived_sessions/** - (optional fallback)
~/.codex/history.jsonl
Classify each file:
- New — not in manifest
- Modified — in manifest but file is newer than
ingested_at - Unchanged — already ingested and unchanged
Report a concise delta summary before deep parsing.
扫描,并与对比:
CODEX_HISTORY_PATH.manifest.json~/.codex/session_index.jsonl~/.codex/sessions/**/rollout-*.jsonl- (可选;仅当用户要求导入归档历史时处理)
~/.codex/archived_sessions/** - (可选备用)
~/.codex/history.jsonl
对每个文件分类:
- 新增 —— 未出现在manifest中
- 已修改 —— 存在于manifest中,但文件更新时间晚于
ingested_at - 无变更 —— 已导入且内容未修改
在深度解析前,先输出简洁的增量汇总。
Step 2: Parse Session Index First
步骤2:优先解析会话索引
session_index.jsonljson
{"id":"...","thread_name":"...","updated_at":"..."}Use it to:
- Build a canonical session inventory
- Prioritize recent/high-signal sessions
- Map rollout IDs to human-readable thread names
session_index.jsonljson
{"id":"...","thread_name":"...","updated_at":"..."}用它来完成以下操作:
- 构建规范的会话清单
- 优先处理近期/高信号的会话
- 将rollout ID映射为人类可读的线程名称
Step 3: Parse Rollout JSONL Safely
步骤3:安全解析Rollout JSONL
Each line is an event envelope with:
rollout-*.jsonljson
{
"timestamp": "...",
"type": "session_meta|turn_context|event_msg|response_item",
"payload": { ... }
}每个行都是一个事件信封,包含以下字段:
rollout-*.jsonljson
{
"timestamp": "...",
"type": "session_meta|turn_context|event_msg|response_item",
"payload": { ... }
}Extraction rules
提取规则
- Prioritize user intent and assistant-visible outputs
- Favor records with user/assistant message content
response_item - Use selectively for meaningful milestones; ignore pure telemetry
event_msg - Treat as metadata (cwd, model, ids), not user knowledge
session_meta
- 优先提取用户意图和助手可见的输出
- 优先选择包含用户/助手消息内容的记录
response_item - 仅选择性使用有意义的里程碑类,忽略纯遥测数据
event_msg - 将视为元数据(工作目录、模型、ID),不属于用户知识范畴
session_meta
Skip/noise filters
跳过/噪音过滤规则
- Token accounting events
- Tool plumbing with no semantic content
- Raw command output unless it contains reusable decisions/patterns
- Repeated plan snapshots unless they add novel decisions
- Token统计事件
- 无语义内容的工具底层调用日志
- 原始命令输出,除非包含可复用的决策/模式
- 重复的计划快照,除非新增了独特的决策内容
Critical privacy filter
关键隐私过滤规则
Rollout logs can include injected instructions, tool payloads, and sensitive text. Do not ingest verbatim system/developer prompts or secrets.
- Remove API keys, tokens, passwords, credentials
- Redact private identifiers unless relevant and approved
- Summarize instead of quoting raw transcripts
Rollout日志可能包含注入的指令、工具载荷和敏感文本。不要逐字导入系统/开发者提示词或机密信息。
- 移除API密钥、令牌、密码、凭证
- 脱敏私有标识符,除非相关且获得用户批准
- 优先概括内容,而非直接引用原始转录文本
Step 4: Cluster by Topic
步骤4:按主题聚类
Do not create one wiki page per session.
- Group by stable topics across many sessions
- Split mixed sessions into separate themes
- Merge recurring concepts across dates/projects
- Use from metadata to infer project scope
cwd
不要为每个会话单独创建一个wiki页面。
- 跨多个会话按稳定主题分组
- 将混合主题的会话拆分为独立的主题
- 合并跨日期/项目的重复概念
- 使用元数据中的推断项目范围
cwd
Step 5: Distill into Wiki Pages
步骤5:提炼为Wiki页面
Route extracted knowledge using existing wiki conventions:
- Project-specific architecture/process ->
projects/<name>/... - General concepts ->
concepts/ - Recurring techniques/debug playbooks ->
skills/ - Tools/services ->
entities/ - Cross-session patterns ->
synthesis/
For each impacted project, create/update (project name as filename, never ).
projects/<name>/<name>.md_project.md遵循现有wiki规范分配提取到的知识:
- 项目专属的架构/流程 ->
projects/<name>/... - 通用概念 ->
concepts/ - 可复用的技术/调试手册 ->
skills/ - 工具/服务 ->
entities/ - 跨会话的通用模式 ->
synthesis/
为每个涉及的项目创建/更新(以项目名称为文件名,禁止使用)。
projects/<name>/<name>.md_project.mdWriting rules
编写规则
- Distill knowledge, not chronology
- Avoid "on date X we discussed..." unless date context is essential
- Add frontmatter on each new/updated page (1-2 sentences, <= 200 chars)
summary: - Add provenance markers:
- when directly grounded in explicit session content
^[extracted] - when synthesizing patterns across events/sessions
^[inferred] - when sessions conflict
^[ambiguous]
- Add/update frontmatter mix for each changed page
provenance:
- 提炼知识,而非按时间线记录
- 避免使用“在X日期我们讨论了...”这类表述,除非时间上下文是必要信息
- 为每个新增/更新的页面添加前置元数据(1-2句话,不超过200字符)
summary: - 添加来源标记:
- 内容直接来自明确的会话内容
^[extracted] - 内容是跨事件/会话的模式合成结果
^[inferred] - 不同会话的内容存在冲突
^[ambiguous]
- 为每个修改的页面添加/更新前置元数据组合
provenance:
Step 6: Update Manifest, Log, and Index
步骤6:更新Manifest、日志和索引
Update .manifest.json
.manifest.json更新.manifest.json
.manifest.jsonFor each processed source file:
- ,
ingested_at,size_bytesmodified_at - :
source_type|codex_rollout|codex_indexcodex_history - : inferred project name (when applicable)
project - ,
pages_createdpages_updated
Add/update a top-level project/session summary block:
json
{
"project-name": {
"source_path": "~/.codex/sessions/...",
"last_ingested": "TIMESTAMP",
"sessions_ingested": 12,
"sessions_total": 40,
"index_updated_at": "TIMESTAMP"
}
}为每个处理过的源文件添加以下字段:
- 、
ingested_at、size_bytesmodified_at - :
source_type|codex_rollout|codex_indexcodex_history - : 推断的项目名称(如适用)
project - 、
pages_createdpages_updated
添加/更新顶层的项目/会话汇总块:
json
{
"project-name": {
"source_path": "~/.codex/sessions/...",
"last_ingested": "TIMESTAMP",
"sessions_ingested": 12,
"sessions_total": 40,
"index_updated_at": "TIMESTAMP"
}
}Update special files
更新特殊文件
Update and :
index.mdlog.md- [TIMESTAMP] CODEX_HISTORY_INGEST sessions=N pages_updated=X pages_created=Y mode=append|full更新和:
index.mdlog.md- [TIMESTAMP] CODEX_HISTORY_INGEST sessions=N pages_updated=X pages_created=Y mode=append|fullPrivacy and Compliance
隐私与合规
- Distill and synthesize; avoid raw transcript dumps
- Default to redaction for anything that looks sensitive
- Ask the user before storing personal/sensitive details
- Keep references to other people minimal and purpose-bound
- 优先提炼和合成内容,避免直接导出原始转录文本
- 所有疑似敏感的内容默认脱敏
- 存储个人/敏感细节前先征求用户同意
- 尽量减少对其他人的提及,且仅用于必要目的
Reference
参考
See for field-level parsing notes and extraction guidance.
references/codex-data-format.md查看获取字段级解析说明和提取指导。
references/codex-data-format.md