session-anonymizer

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Therapy Anonymizer

治疗会话匿名化工具

Three-layer PII detection and anonymization for therapy session transcripts. Supports Russian and English. Fully local by default — no data leaves the machine.
针对治疗会话记录的三层PII检测与匿名化工具。支持俄语和英语。默认完全本地化运行——数据不会离开本地机器。

Architecture

架构

Three detection layers run in sequence, each catching what others miss:
LayerToolCatchesSizeSpeed
1NatashaRussian names, locations, organizations27 MBinstant
2OpenAI Privacy Filter (opf)Phones, accounts, addresses, emails2.8 GB~1.5s
3Ollama LLMMedications, dates, contextual IDs2.5-7 GB~10s
Spans from all layers are merged, overlaps resolved, and a unified redacted output is produced.
三层检测工具依次运行,各自捕捉其他工具遗漏的内容:
层级工具检测内容大小速度
1Natasha俄语姓名、地点、组织机构27 MB即时
2OpenAI Privacy Filter (opf)电话、账户、地址、邮箱2.8 GB~1.5秒
3Ollama LLM药物、日期、上下文标识2.5-7 GB~10秒
整合所有层级的检测结果,解决重叠问题,生成统一的脱敏输出。

Prerequisites

前置依赖

bash
pip install natasha setuptools pymorphy2-dicts-ru
pip install 'opf @ git+https://github.com/openai/privacy-filter.git'
ollama pull qwen3:4b
Each layer is optional — the script gracefully skips unavailable layers and warns.
bash
pip install natasha setuptools pymorphy2-dicts-ru
pip install 'opf @ git+https://github.com/openai/privacy-filter.git'
ollama pull qwen3:4b
每个层级都是可选的——脚本会自动跳过不可用的层级并发出警告。

Usage

使用方法

Single file

单个文件

bash
python3 ~/.claude/skills/therapy-anonymizer/scripts/anonymize.py session.txt
bash
python3 ~/.claude/skills/therapy-anonymizer/scripts/anonymize.py session.txt

Stdin pipe

标准输入管道

bash
cat session.txt | python3 ~/.claude/skills/therapy-anonymizer/scripts/anonymize.py
bash
cat session.txt | python3 ~/.claude/skills/therapy-anonymizer/scripts/anonymize.py

Batch processing

批量处理

bash
python3 ~/.claude/skills/therapy-anonymizer/scripts/anonymize.py --batch ~/sessions/ -o ~/sessions_clean/
bash
python3 ~/.claude/skills/therapy-anonymizer/scripts/anonymize.py --batch ~/sessions/ -o ~/sessions_clean/

JSON report

JSON报告

bash
python3 ~/.claude/skills/therapy-anonymizer/scripts/anonymize.py session.txt --json
bash
python3 ~/.claude/skills/therapy-anonymizer/scripts/anonymize.py session.txt --json

Pseudonyms instead of tags

使用假名替代标签

bash
python3 ~/.claude/skills/therapy-anonymizer/scripts/anonymize.py session.txt --pseudonyms
bash
python3 ~/.claude/skills/therapy-anonymizer/scripts/anonymize.py session.txt --pseudonyms

Select layers / model

选择层级/模型

bash
undefined
bash
undefined

Fast — Natasha only

快速模式——仅使用Natasha

python3 ~/.claude/skills/therapy-anonymizer/scripts/anonymize.py session.txt --layers natasha
python3 ~/.claude/skills/therapy-anonymizer/scripts/anonymize.py session.txt --layers natasha

LLM only — maximum coverage

仅使用LLM——覆盖范围最大

python3 ~/.claude/skills/therapy-anonymizer/scripts/anonymize.py session.txt --layers ollama --model gemma4:e2b
undefined
python3 ~/.claude/skills/therapy-anonymizer/scripts/anonymize.py session.txt --layers ollama --model gemma4:e2b
undefined

Encrypt output (AES-256)

加密输出(AES-256)

bash
python3 ~/.claude/skills/therapy-anonymizer/scripts/anonymize.py session.txt -o clean.txt --encrypt "password"
bash
python3 ~/.claude/skills/therapy-anonymizer/scripts/anonymize.py session.txt -o clean.txt --encrypt "password"

Invoking from Claude Code

从Claude Code调用

To anonymize text already in context, pipe it through the script:
bash
echo '<text>' | python3 ~/.claude/skills/therapy-anonymizer/scripts/anonymize.py --json
For files, pass the path directly. Always recommend manual review after automated anonymization.
要对上下文已有的文本进行匿名化,可通过管道将其传入脚本:
bash
echo '<text>' | python3 ~/.claude/skills/therapy-anonymizer/scripts/anonymize.py --json
对于文件,直接传入路径即可。始终建议在自动匿名化后进行人工审核。

Limitations

局限性

  • Contextual identifiers ("the only red-haired architect in Kostroma") are NOT detected by any automated tool
  • OPF is English-focused — Russian coverage is partial
  • Medications detected only by Layer 3 (requires Ollama)
  • Does not assess re-identification risk from combinations of non-PII fields
  • 上下文标识(如“科斯特罗马唯一的红发建筑师”)无法被任何自动化工具检测到
  • OPF以英语为核心——俄语覆盖范围有限
  • 药物仅能被第3层检测到(需要Ollama)
  • 无法评估非PII字段组合带来的重新识别风险

Guardrails

防护准则

  • NEVER send raw transcripts to cloud services
  • Cloud verification only on already-anonymized text
  • Always recommend manual review for therapy data
  • Never log original PII values
  • 切勿将原始记录发送到云服务
  • 仅对已匿名化的文本进行云端验证
  • 始终建议对治疗数据进行人工审核
  • 绝不记录原始PII值