fix-jsonl-surrogates

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Fix JSONL Surrogates

修复JSONL代理字符问题

Repairs Claude Code JSONL chat files that contain lone Unicode surrogates (U+D800-U+DFFF), which cause the Anthropic API to reject requests with "invalid high surrogate in string".
修复包含孤立Unicode代理字符(U+D800-U+DFFF)的Claude Code JSONL聊天文件,这类字符会导致Anthropic API拒绝请求,返回「字符串中存在无效高位代理」错误。

Background

背景

Node.js internally uses UTF-16 strings.
JSON.stringify()
doesn't validate surrogate pairs, so lone surrogates from file reads, terminal output, or web content can end up serialized as
\uD8xx
without a matching low surrogate — producing invalid JSON that the API rejects.
Common triggers on Windows:
  • Emoji in code comments or terminal output
  • Box-drawing characters from build logs
  • Web content with malformed UTF-16 encoding
Upstream issue: anthropics/claude-code#44230
Node.js内部使用UTF-16字符串。
JSON.stringify()
不会验证代理对,因此来自文件读取、终端输出或网页内容的孤立代理字符可能会被序列化为没有匹配低位代理的
\uD8xx
,生成API会拒绝的无效JSON。
Windows下常见触发场景:
  • 代码注释或终端输出中的表情符号
  • 构建日志中的方框绘制字符
  • 存在错误UTF-16编码的网页内容
上游问题:anthropics/claude-code#44230

How to Use

使用方法

Quick diagnosis

快速诊断

If the user has a specific session or request ID, find the JSONL file:
bash
grep -rl "req_XXXXX" ~/.claude/projects/*/
如果用户有具体的会话或请求ID,可通过以下命令找到对应的JSONL文件:
bash
grep -rl "req_XXXXX" ~/.claude/projects/*/

Run the repair script

运行修复脚本

The bundled Python script handles scanning and fixing. It works on individual files, session directories, or entire project trees.
bash
undefined
内置的Python脚本可完成扫描和修复操作,支持对单个文件、会话目录或整个项目目录树执行操作。
bash
undefined

Scan a specific JSONL file + its session directory (subagents, tool-results)

扫描指定JSONL文件及其会话目录(子代理、工具执行结果)

python {SKILL_DIR}/scripts/fix_surrogates.py <path-to-file.jsonl>
python {SKILL_DIR}/scripts/fix_surrogates.py <path-to-file.jsonl>

Dry run first to see what would change

先执行试运行,查看会产生哪些变更

python {SKILL_DIR}/scripts/fix_surrogates.py <path-to-file.jsonl> --dry-run --verbose
python {SKILL_DIR}/scripts/fix_surrogates.py <path-to-file.jsonl> --dry-run --verbose

Scan all sessions in a project directory

扫描项目目录下的所有会话

python {SKILL_DIR}/scripts/fix_surrogates.py ~/.claude/projects/<project-dir>/
python {SKILL_DIR}/scripts/fix_surrogates.py ~/.claude/projects/<project-dir>/

Scan everything recursively (includes subagent files)

递归扫描所有内容(包含子代理文件)

python {SKILL_DIR}/scripts/fix_surrogates.py ~/.claude/projects/<project-dir>/ --recursive

The script:
1. Reads each file with `errors='surrogateescape'` to preserve bad characters for detection
2. Parses JSON and recursively checks all string values for surrogate code points
3. Replaces lone surrogates with U+FFFD (replacement character) via `str.encode('utf-8', 'surrogatepass').decode('utf-8', 'replace')`
4. Backs up originals as `.bak` before modifying
5. Reports a summary of what was found and fixed
python {SKILL_DIR}/scripts/fix_surrogates.py ~/.claude/projects/<project-dir>/ --recursive

脚本执行逻辑:
1. 使用`errors='surrogateescape'`读取每个文件,保留异常字符用于检测
2. 解析JSON并递归检查所有字符串值中是否存在代理字符编码点
3. 通过`str.encode('utf-8', 'surrogatepass').decode('utf-8', 'replace')`将孤立代理字符替换为U+FFFD(替换字符)
4. 修改前将原文件备份为`.bak`格式
5. 输出检测和修复结果的汇总报告

Important notes

重要说明

  • Surrogates are often transient: They may exist only in Node.js memory during request construction, not in the stored JSONL. If the script finds nothing, the error was transient and retrying the session should work.
  • Check all session artifacts: The script automatically scans the main JSONL plus any subagent files and tool-result cache in the session directory.
  • Safe to re-run: The script is idempotent. Running it on already-clean files is a no-op.
  • 代理字符通常是临时的:它们可能仅在请求构建时存在于Node.js内存中,不会存储在JSONL文件里。如果脚本没有检测到问题,说明错误是临时出现的,重试会话即可解决。
  • 检查所有会话产物:脚本会自动扫描主JSONL文件,以及会话目录下的所有子代理文件和工具结果缓存。
  • 可安全重复运行:脚本是幂等的,在已无问题的文件上运行不会产生任何变更。

Manual investigation

手动排查

If the script finds nothing but the error persists, the surrogate may come from content loaded at request time (CLAUDE.md, memory files, etc.). Scan those too:
bash
python {SKILL_DIR}/scripts/fix_surrogates.py ~/.claude/ --recursive --dry-run
如果脚本未检测到问题但错误仍然存在,说明代理字符可能来自请求时加载的内容(CLAUDE.md、内存文件等)。也可以扫描这些文件:
bash
python {SKILL_DIR}/scripts/fix_surrogates.py ~/.claude/ --recursive --dry-run

What
{SKILL_DIR}
means

{SKILL_DIR}
说明

Replace
{SKILL_DIR}
with the actual path to this skill directory. When Claude invokes this skill, it knows the skill's location and can substitute the correct path.
{SKILL_DIR}
替换为该技能目录的实际路径。当Claude调用此技能时,会自动识别技能位置并替换为正确路径。