session-anonymizer
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseTherapy Anonymizer
治疗会话匿名化工具
Three-layer PII detection and anonymization for therapy session transcripts. Supports Russian and English. Fully local by default — no data leaves the machine.
针对治疗会话记录的三层PII检测与匿名化工具。支持俄语和英语。默认完全本地化运行——数据不会离开本地机器。
Architecture
架构
Three detection layers run in sequence, each catching what others miss:
| Layer | Tool | Catches | Size | Speed |
|---|---|---|---|---|
| 1 | Natasha | Russian names, locations, organizations | 27 MB | instant |
| 2 | OpenAI Privacy Filter (opf) | Phones, accounts, addresses, emails | 2.8 GB | ~1.5s |
| 3 | Ollama LLM | Medications, dates, contextual IDs | 2.5-7 GB | ~10s |
Spans from all layers are merged, overlaps resolved, and a unified redacted output is produced.
三层检测工具依次运行,各自捕捉其他工具遗漏的内容:
| 层级 | 工具 | 检测内容 | 大小 | 速度 |
|---|---|---|---|---|
| 1 | Natasha | 俄语姓名、地点、组织机构 | 27 MB | 即时 |
| 2 | OpenAI Privacy Filter (opf) | 电话、账户、地址、邮箱 | 2.8 GB | ~1.5秒 |
| 3 | Ollama LLM | 药物、日期、上下文标识 | 2.5-7 GB | ~10秒 |
整合所有层级的检测结果,解决重叠问题,生成统一的脱敏输出。
Prerequisites
前置依赖
bash
pip install natasha setuptools pymorphy2-dicts-ru
pip install 'opf @ git+https://github.com/openai/privacy-filter.git'
ollama pull qwen3:4bEach layer is optional — the script gracefully skips unavailable layers and warns.
bash
pip install natasha setuptools pymorphy2-dicts-ru
pip install 'opf @ git+https://github.com/openai/privacy-filter.git'
ollama pull qwen3:4b每个层级都是可选的——脚本会自动跳过不可用的层级并发出警告。
Usage
使用方法
Single file
单个文件
bash
python3 ~/.claude/skills/therapy-anonymizer/scripts/anonymize.py session.txtbash
python3 ~/.claude/skills/therapy-anonymizer/scripts/anonymize.py session.txtStdin pipe
标准输入管道
bash
cat session.txt | python3 ~/.claude/skills/therapy-anonymizer/scripts/anonymize.pybash
cat session.txt | python3 ~/.claude/skills/therapy-anonymizer/scripts/anonymize.pyBatch processing
批量处理
bash
python3 ~/.claude/skills/therapy-anonymizer/scripts/anonymize.py --batch ~/sessions/ -o ~/sessions_clean/bash
python3 ~/.claude/skills/therapy-anonymizer/scripts/anonymize.py --batch ~/sessions/ -o ~/sessions_clean/JSON report
JSON报告
bash
python3 ~/.claude/skills/therapy-anonymizer/scripts/anonymize.py session.txt --jsonbash
python3 ~/.claude/skills/therapy-anonymizer/scripts/anonymize.py session.txt --jsonPseudonyms instead of tags
使用假名替代标签
bash
python3 ~/.claude/skills/therapy-anonymizer/scripts/anonymize.py session.txt --pseudonymsbash
python3 ~/.claude/skills/therapy-anonymizer/scripts/anonymize.py session.txt --pseudonymsSelect layers / model
选择层级/模型
bash
undefinedbash
undefinedFast — Natasha only
快速模式——仅使用Natasha
python3 ~/.claude/skills/therapy-anonymizer/scripts/anonymize.py session.txt --layers natasha
python3 ~/.claude/skills/therapy-anonymizer/scripts/anonymize.py session.txt --layers natasha
LLM only — maximum coverage
仅使用LLM——覆盖范围最大
python3 ~/.claude/skills/therapy-anonymizer/scripts/anonymize.py session.txt --layers ollama --model gemma4:e2b
undefinedpython3 ~/.claude/skills/therapy-anonymizer/scripts/anonymize.py session.txt --layers ollama --model gemma4:e2b
undefinedEncrypt output (AES-256)
加密输出(AES-256)
bash
python3 ~/.claude/skills/therapy-anonymizer/scripts/anonymize.py session.txt -o clean.txt --encrypt "password"bash
python3 ~/.claude/skills/therapy-anonymizer/scripts/anonymize.py session.txt -o clean.txt --encrypt "password"Invoking from Claude Code
从Claude Code调用
To anonymize text already in context, pipe it through the script:
bash
echo '<text>' | python3 ~/.claude/skills/therapy-anonymizer/scripts/anonymize.py --jsonFor files, pass the path directly. Always recommend manual review after automated anonymization.
要对上下文已有的文本进行匿名化,可通过管道将其传入脚本:
bash
echo '<text>' | python3 ~/.claude/skills/therapy-anonymizer/scripts/anonymize.py --json对于文件,直接传入路径即可。始终建议在自动匿名化后进行人工审核。
Limitations
局限性
- Contextual identifiers ("the only red-haired architect in Kostroma") are NOT detected by any automated tool
- OPF is English-focused — Russian coverage is partial
- Medications detected only by Layer 3 (requires Ollama)
- Does not assess re-identification risk from combinations of non-PII fields
- 上下文标识(如“科斯特罗马唯一的红发建筑师”)无法被任何自动化工具检测到
- OPF以英语为核心——俄语覆盖范围有限
- 药物仅能被第3层检测到(需要Ollama)
- 无法评估非PII字段组合带来的重新识别风险
Guardrails
防护准则
- NEVER send raw transcripts to cloud services
- Cloud verification only on already-anonymized text
- Always recommend manual review for therapy data
- Never log original PII values
- 切勿将原始记录发送到云服务
- 仅对已匿名化的文本进行云端验证
- 始终建议对治疗数据进行人工审核
- 绝不记录原始PII值