session-anonymizer

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Therapy Anonymizer

治疗会话匿名化工具

Three-layer PII detection and anonymization for therapy session transcripts. Supports Russian and English. Fully local by default — no data leaves the machine.

针对治疗会话记录的三层PII检测与匿名化工具。支持俄语和英语。默认完全本地化运行——数据不会离开本地机器。

Architecture

架构

Three detection layers run in sequence, each catching what others miss:

Layer	Tool	Catches	Size	Speed
1	Natasha	Russian names, locations, organizations	27 MB	instant
2	OpenAI Privacy Filter (opf)	Phones, accounts, addresses, emails	2.8 GB	~1.5s
3	Ollama LLM	Medications, dates, contextual IDs	2.5-7 GB	~10s

Spans from all layers are merged, overlaps resolved, and a unified redacted output is produced.

三层检测工具依次运行，各自捕捉其他工具遗漏的内容：

层级	工具	检测内容	大小	速度
1	Natasha	俄语姓名、地点、组织机构	27 MB	即时
2	OpenAI Privacy Filter (opf)	电话、账户、地址、邮箱	2.8 GB	~1.5秒
3	Ollama LLM	药物、日期、上下文标识	2.5-7 GB	~10秒

整合所有层级的检测结果，解决重叠问题，生成统一的脱敏输出。

Prerequisites

前置依赖

bash

pip install natasha setuptools pymorphy2-dicts-ru
pip install 'opf @ git+https://github.com/openai/privacy-filter.git'
ollama pull qwen3:4b

Each layer is optional — the script gracefully skips unavailable layers and warns.

bash

pip install natasha setuptools pymorphy2-dicts-ru
pip install 'opf @ git+https://github.com/openai/privacy-filter.git'
ollama pull qwen3:4b

每个层级都是可选的——脚本会自动跳过不可用的层级并发出警告。

Usage

使用方法

Single file

单个文件

bash

python3 ~/.claude/skills/therapy-anonymizer/scripts/anonymize.py session.txt

bash

python3 ~/.claude/skills/therapy-anonymizer/scripts/anonymize.py session.txt

Stdin pipe

标准输入管道

bash

cat session.txt | python3 ~/.claude/skills/therapy-anonymizer/scripts/anonymize.py

bash

cat session.txt | python3 ~/.claude/skills/therapy-anonymizer/scripts/anonymize.py

Batch processing

批量处理

bash

python3 ~/.claude/skills/therapy-anonymizer/scripts/anonymize.py --batch ~/sessions/ -o ~/sessions_clean/

bash

python3 ~/.claude/skills/therapy-anonymizer/scripts/anonymize.py --batch ~/sessions/ -o ~/sessions_clean/

JSON report

JSON报告

bash

python3 ~/.claude/skills/therapy-anonymizer/scripts/anonymize.py session.txt --json

bash

python3 ~/.claude/skills/therapy-anonymizer/scripts/anonymize.py session.txt --json

Pseudonyms instead of tags

使用假名替代标签

bash

python3 ~/.claude/skills/therapy-anonymizer/scripts/anonymize.py session.txt --pseudonyms

bash

python3 ~/.claude/skills/therapy-anonymizer/scripts/anonymize.py session.txt --pseudonyms

Select layers / model

选择层级/模型

bash

undefined

bash

undefined

Fast — Natasha only

快速模式——仅使用Natasha

python3 ~/.claude/skills/therapy-anonymizer/scripts/anonymize.py session.txt --layers natasha

LLM only — maximum coverage

仅使用LLM——覆盖范围最大

python3 ~/.claude/skills/therapy-anonymizer/scripts/anonymize.py session.txt --layers ollama --model gemma4:e2b

undefined

python3 ~/.claude/skills/therapy-anonymizer/scripts/anonymize.py session.txt --layers ollama --model gemma4:e2b

undefined

Encrypt output (AES-256)

加密输出（AES-256）

bash

python3 ~/.claude/skills/therapy-anonymizer/scripts/anonymize.py session.txt -o clean.txt --encrypt "password"

bash

python3 ~/.claude/skills/therapy-anonymizer/scripts/anonymize.py session.txt -o clean.txt --encrypt "password"

Invoking from Claude Code

从Claude Code调用

To anonymize text already in context, pipe it through the script:

bash

echo '<text>' | python3 ~/.claude/skills/therapy-anonymizer/scripts/anonymize.py --json

For files, pass the path directly. Always recommend manual review after automated anonymization.

要对上下文已有的文本进行匿名化，可通过管道将其传入脚本：

bash

echo '<text>' | python3 ~/.claude/skills/therapy-anonymizer/scripts/anonymize.py --json

对于文件，直接传入路径即可。始终建议在自动匿名化后进行人工审核。

Limitations

局限性

Contextual identifiers ("the only red-haired architect in Kostroma") are NOT detected by any automated tool
OPF is English-focused — Russian coverage is partial
Medications detected only by Layer 3 (requires Ollama)
Does not assess re-identification risk from combinations of non-PII fields

上下文标识（如“科斯特罗马唯一的红发建筑师”）无法被任何自动化工具检测到
OPF以英语为核心——俄语覆盖范围有限
药物仅能被第3层检测到（需要Ollama）
无法评估非PII字段组合带来的重新识别风险

Guardrails

防护准则

NEVER send raw transcripts to cloud services
Cloud verification only on already-anonymized text
Always recommend manual review for therapy data
Never log original PII values

切勿将原始记录发送到云服务
仅对已匿名化的文本进行云端验证
始终建议对治疗数据进行人工审核
绝不记录原始PII值