data-anonymizer

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Data Anonymizer

数据匿名化工具

Detect and mask personally identifiable information (PII) in text documents and structured data. Supports multiple masking strategies and can process CSV files at scale.
检测并掩码文本文档和结构化数据中的个人可识别信息(PII)。支持多种掩码策略,可批量处理CSV文件。

Quick Start

快速开始

python
from scripts.data_anonymizer import DataAnonymizer
python
from scripts.data_anonymizer import DataAnonymizer

Anonymize text

Anonymize text

anonymizer = DataAnonymizer() result = anonymizer.anonymize("Contact John Smith at john@email.com or 555-123-4567") print(result)
anonymizer = DataAnonymizer() result = anonymizer.anonymize("Contact John Smith at john@email.com or 555-123-4567") print(result)

"Contact [NAME] at [EMAIL] or [PHONE]"

"Contact [NAME] at [EMAIL] or [PHONE]"

Anonymize CSV

anonymizer.anonymize_csv("customers.csv", "customers_anon.csv")
undefined
undefined

Features

功能特性

  • PII Detection: Names, emails, phones, SSN, addresses, credit cards, dates
  • Multiple Strategies: Mask, redact, hash, fake data replacement
  • CSV Processing: Anonymize specific columns or auto-detect
  • Reversible Tokens: Optional mapping for de-anonymization
  • Custom Patterns: Add your own PII patterns
  • Audit Report: List all detected PII with locations
  • PII检测:姓名、邮箱、电话、社保号、地址、信用卡号、日期
  • 多种策略:掩码、编辑、哈希、虚假数据替换
  • CSV处理:可匿名化指定列或自动检测
  • 可逆标记:可选用于去匿名化的映射功能
  • 自定义规则:添加专属PII识别规则
  • 审计报告:列出所有检测到的PII及其位置

API Reference

API参考

Initialization

初始化

python
anonymizer = DataAnonymizer(
    strategy="mask",      # mask, redact, hash, fake
    reversible=False      # Enable token mapping
)
python
anonymizer = DataAnonymizer(
    strategy="mask",      # mask, redact, hash, fake
    reversible=False      # Enable token mapping
)

Text Anonymization

文本匿名化

python
undefined
python
undefined

Basic anonymization

Basic anonymization

result = anonymizer.anonymize(text)
result = anonymizer.anonymize(text)

With specific PII types

With specific PII types

result = anonymizer.anonymize(text, pii_types=["email", "phone"])
result = anonymizer.anonymize(text, pii_types=["email", "phone"])

Get detected PII report

Get detected PII report

result, report = anonymizer.anonymize(text, return_report=True)
undefined
result, report = anonymizer.anonymize(text, return_report=True)
undefined

Masking Strategies

掩码策略

python
text = "Email john@test.com, call 555-1234"
python
text = "Email john@test.com, call 555-1234"

Mask (default) - replace with type labels

Mask (default) - replace with type labels

anonymizer.strategy = "mask"
anonymizer.strategy = "mask"

"Email [EMAIL], call [PHONE]"

"Email [EMAIL], call [PHONE]"

Redact - replace with asterisks

Redact - replace with asterisks

anonymizer.strategy = "redact"
anonymizer.strategy = "redact"

"Email ***************, call ********"

"Email ***************, call ********"

Hash - replace with hash

Hash - replace with hash

anonymizer.strategy = "hash"
anonymizer.strategy = "hash"

"Email a1b2c3d4, call e5f6g7h8"

"Email a1b2c3d4, call e5f6g7h8"

Fake - replace with realistic fake data

Fake - replace with realistic fake data

anonymizer.strategy = "fake"
anonymizer.strategy = "fake"

"Email jane@example.org, call 555-9876"

"Email jane@example.org, call 555-9876"

undefined
undefined

CSV Processing

CSV处理

python
undefined
python
undefined

Auto-detect PII columns

Auto-detect PII columns

anonymizer.anonymize_csv("input.csv", "output.csv")
anonymizer.anonymize_csv("input.csv", "output.csv")

Specify columns

Specify columns

anonymizer.anonymize_csv( "input.csv", "output.csv", columns=["name", "email", "phone"] )
anonymizer.anonymize_csv( "input.csv", "output.csv", columns=["name", "email", "phone"] )

Different strategies per column

Different strategies per column

anonymizer.anonymize_csv( "input.csv", "output.csv", column_strategies={ "name": "fake", "email": "hash", "ssn": "redact" } )
undefined
anonymizer.anonymize_csv( "input.csv", "output.csv", column_strategies={ "name": "fake", "email": "hash", "ssn": "redact" } )
undefined

Reversible Anonymization

可逆匿名化

python
anonymizer = DataAnonymizer(reversible=True)
python
anonymizer = DataAnonymizer(reversible=True)

Anonymize with token mapping

Anonymize with token mapping

result = anonymizer.anonymize("John Smith: john@test.com") mapping = anonymizer.get_mapping()
result = anonymizer.anonymize("John Smith: john@test.com") mapping = anonymizer.get_mapping()

Save mapping securely

Save mapping securely

anonymizer.save_mapping("mapping.json", encrypt=True, password="secret")
anonymizer.save_mapping("mapping.json", encrypt=True, password="secret")

Later, de-anonymize

Later, de-anonymize

anonymizer.load_mapping("mapping.json", password="secret") original = anonymizer.deanonymize(result)
undefined
anonymizer.load_mapping("mapping.json", password="secret") original = anonymizer.deanonymize(result)
undefined

Custom Patterns

自定义规则

python
undefined
python
undefined

Add custom PII pattern

Add custom PII pattern

anonymizer.add_pattern( name="employee_id", pattern=r"EMP-\d{6}", label="[EMPLOYEE_ID]" )
undefined
anonymizer.add_pattern( name="employee_id", pattern=r"EMP-\d{6}", label="[EMPLOYEE_ID]" )
undefined

CLI Usage

命令行使用

bash
undefined
bash
undefined

Anonymize text file

Anonymize text file

python data_anonymizer.py --input document.txt --output document_anon.txt
python data_anonymizer.py --input document.txt --output document_anon.txt

Anonymize CSV

Anonymize CSV

python data_anonymizer.py --input customers.csv --output customers_anon.csv
python data_anonymizer.py --input customers.csv --output customers_anon.csv

Specific strategy

Specific strategy

python data_anonymizer.py --input data.csv --output anon.csv --strategy fake
python data_anonymizer.py --input data.csv --output anon.csv --strategy fake

Generate audit report

Generate audit report

python data_anonymizer.py --input document.txt --report audit.json
python data_anonymizer.py --input document.txt --report audit.json

Specific PII types only

Specific PII types only

python data_anonymizer.py --input doc.txt --types email phone ssn
undefined
python data_anonymizer.py --input doc.txt --types email phone ssn
undefined

CLI Arguments

命令行参数

ArgumentDescriptionDefault
--input
Input fileRequired
--output
Output fileRequired
--strategy
Masking strategymask
--types
PII types to detectall
--columns
CSV columns to processauto
--report
Generate audit report-
--reversible
Enable token mappingFalse
参数描述默认值
--input
输入文件必填
--output
输出文件必填
--strategy
掩码策略mask
--types
需检测的PII类型全部
--columns
需处理的CSV列自动检测
--report
生成审计报告-
--reversible
启用标记映射False

Supported PII Types

支持的PII类型

TypeExamplesPattern
name
John Smith, Mary JohnsonNLP-based
email
user@domain.comRegex
phone
555-123-4567, (555) 123-4567Regex
ssn
123-45-6789Regex
credit_card
4111-1111-1111-1111Regex + Luhn
address
123 Main St, City, ST 12345NLP + Regex
date_of_birth
01/15/1990, January 15, 1990Regex
ip_address
192.168.1.1Regex
类型示例识别方式
name
John Smith, Mary Johnson基于NLP
email
user@domain.com正则表达式
phone
555-123-4567, (555) 123-4567正则表达式
ssn
123-45-6789正则表达式
credit_card
4111-1111-1111-1111正则表达式 + Luhn算法
address
123 Main St, City, ST 12345NLP + 正则表达式
date_of_birth
01/15/1990, January 15, 1990正则表达式
ip_address
192.168.1.1正则表达式

Examples

示例

Anonymize Customer Support Logs

匿名化客户支持日志

python
anonymizer = DataAnonymizer(strategy="mask")

log = """
Ticket #1234: Customer John Doe (john.doe@company.com) called about
billing issue. SSN on file: 123-45-6789. Callback number: 555-867-5309.
Address: 123 Oak Street, Springfield, IL 62701.
"""

result = anonymizer.anonymize(log)
print(result)
python
anonymizer = DataAnonymizer(strategy="mask")

log = """
Ticket #1234: Customer John Doe (john.doe@company.com) called about
billing issue. SSN on file: 123-45-6789. Callback number: 555-867-5309.
Address: 123 Oak Street, Springfield, IL 62701.
"""

result = anonymizer.anonymize(log)
print(result)

Ticket #1234: Customer [NAME] ([EMAIL]) called about

Ticket #1234: Customer [NAME] ([EMAIL]) called about

billing issue. SSN on file: [SSN]. Callback number: [PHONE].

billing issue. SSN on file: [SSN]. Callback number: [PHONE].

Address: [ADDRESS].

Address: [ADDRESS].

undefined
undefined

GDPR Compliance for Database Export

GDPR合规的数据库导出

python
anonymizer = DataAnonymizer(strategy="hash")
python
anonymizer = DataAnonymizer(strategy="hash")

Consistent hashing for joins

Consistent hashing for joins

anonymizer.anonymize_csv( "users.csv", "users_anon.csv", columns=["email", "name", "phone"] )
anonymizer.anonymize_csv( "orders.csv", "orders_anon.csv", columns=["customer_email"] # Same hash as users.email )
undefined
anonymizer.anonymize_csv( "users.csv", "users_anon.csv", columns=["email", "name", "phone"] )
anonymizer.anonymize_csv( "orders.csv", "orders_anon.csv", columns=["customer_email"] # Same hash as users.email )
undefined

Generate Test Data from Production

从生产数据生成测试数据

python
anonymizer = DataAnonymizer(strategy="fake")
python
anonymizer = DataAnonymizer(strategy="fake")

Replace real PII with realistic fake data

Replace real PII with realistic fake data

anonymizer.anonymize_csv( "production_data.csv", "test_data.csv" )
anonymizer.anonymize_csv( "production_data.csv", "test_data.csv" )

Test data has same structure but fake PII

Test data has same structure but fake PII

undefined
undefined

Dependencies

依赖项

pandas>=2.0.0
faker>=18.0.0
pandas>=2.0.0
faker>=18.0.0

Limitations

局限性

  • Name detection may miss unusual names
  • Address detection works best for US formats
  • Custom patterns may be needed for domain-specific PII
  • Fake data replacement doesn't preserve exact format
  • 姓名检测可能遗漏不常见的姓名
  • 地址检测对美国格式的支持最佳
  • 针对特定领域的PII可能需要自定义规则
  • 虚假数据替换无法完全保留原始格式