date-normalizer
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseDate Normalizer
日期规范化工具
Parse and normalize dates from various formats into consistent, standardized formats for data cleaning and ETL pipelines.
将各种格式的日期解析并规范化为统一的标准格式,适用于数据清洗和ETL流水线。
Purpose
用途
Date standardization for:
- Data cleaning and ETL pipelines
- Database imports with mixed date formats
- Log file parsing and analysis
- International data harmonization
- Report generation with consistent dating
日期标准化适用于:
- 数据清洗和ETL流水线
- 存在混合日期格式的数据库导入场景
- 日志文件解析与分析
- 国际数据统一处理
- 生成日期格式统一的报告
Features
功能特性
- Smart Parsing: Automatically detect and parse 100+ date formats
- Format Conversion: Convert to ISO 8601, US, EU, or custom formats
- Batch Processing: Normalize entire CSV columns
- Ambiguity Detection: Flag dates that could be interpreted multiple ways
- Timezone Handling: Convert and normalize timezones
- Relative Dates: Parse "today", "yesterday", "next week"
- Validation: Detect and report invalid dates
- 智能解析:自动检测并解析100多种日期格式
- 格式转换:转换为ISO 8601、美式、欧式或自定义格式
- 批量处理:规范化整个CSV列的日期
- 歧义检测:标记可能存在多种解读方式的日期
- 时区处理:转换并规范化时区
- 相对日期解析:识别“today”“yesterday”“next week”等相对日期
- 验证功能:检测并报告无效日期
Quick Start
快速开始
python
from date_normalizer import DateNormalizerpython
from date_normalizer import DateNormalizerNormalize single date
规范化单个日期
normalizer = DateNormalizer()
result = normalizer.normalize("03/14/2024")
print(result) # {'normalized': '2024-03-14', 'format': 'iso8601'}
normalizer = DateNormalizer()
result = normalizer.normalize("03/14/2024")
print(result) # {'normalized': '2024-03-14', 'format': 'iso8601'}
Normalize to specific format
转换为指定格式
result = normalizer.normalize("March 14, 2024", output_format="us")
print(result) # {'normalized': '03/14/2024', 'format': 'us'}
result = normalizer.normalize("March 14, 2024", output_format="us")
print(result) # {'normalized': '03/14/2024', 'format': 'us'}
Batch normalize CSV column
批量规范化CSV列
normalizer.normalize_csv(
'data.csv',
date_column='created_at',
output='normalized.csv',
output_format='iso8601'
)
undefinednormalizer.normalize_csv(
'data.csv',
date_column='created_at',
output='normalized.csv',
output_format='iso8601'
)
undefinedCLI Usage
CLI 使用方式
bash
undefinedbash
undefinedNormalize single date
规范化单个日期
python date_normalizer.py --date "March 14, 2024"
python date_normalizer.py --date "March 14, 2024"
Convert to specific format
转换为指定格式
python date_normalizer.py --date "14/03/2024" --format us
python date_normalizer.py --date "14/03/2024" --format us
Normalize CSV column
规范化CSV列
python date_normalizer.py --csv data.csv --column date --format iso8601 --output normalized.csv
python date_normalizer.py --csv data.csv --column date --format iso8601 --output normalized.csv
Detect ambiguous dates
检测歧义日期
python date_normalizer.py --date "01/02/03" --detect-ambiguous
undefinedpython date_normalizer.py --date "01/02/03" --detect-ambiguous
undefinedAPI Reference
API 参考
DateNormalizer
DateNormalizer
python
class DateNormalizer:
def normalize(self, date_string: str, output_format: str = 'iso8601',
dayfirst: bool = False, yearfirst: bool = False) -> Dict
def normalize_batch(self, dates: List[str], **kwargs) -> List[Dict]
def normalize_csv(self, csv_path: str, date_column: str,
output: str = None, **kwargs) -> str
def detect_format(self, date_string: str) -> str
def is_valid(self, date_string: str) -> bool
def is_ambiguous(self, date_string: str) -> bool
def parse_relative(self, relative_string: str) -> datetimepython
class DateNormalizer:
def normalize(self, date_string: str, output_format: str = 'iso8601',
dayfirst: bool = False, yearfirst: bool = False) -> Dict
def normalize_batch(self, dates: List[str], **kwargs) -> List[Dict]
def normalize_csv(self, csv_path: str, date_column: str,
output: str = None, **kwargs) -> str
def detect_format(self, date_string: str) -> str
def is_valid(self, date_string: str) -> bool
def is_ambiguous(self, date_string: str) -> bool
def parse_relative(self, relative_string: str) -> datetimeOutput Formats
输出格式
ISO 8601 (default):
python
'2024-03-14' # Date only
'2024-03-14T15:30:00' # With time
'2024-03-14T15:30:00+00:00' # With timezoneUS Format:
python
'03/14/2024' # MM/DD/YYYYEU Format:
python
'14/03/2024' # DD/MM/YYYYLong Format:
python
'March 14, 2024'Custom Format:
python
normalizer.normalize(date, output_format='%Y%m%d') # '20240314'ISO 8601(默认):
python
'2024-03-14' # 仅日期
'2024-03-14T15:30:00' # 带时间
'2024-03-14T15:30:00+00:00' # 带时区美式格式:
python
'03/14/2024' # MM/DD/YYYY欧式格式:
python
'14/03/2024' # DD/MM/YYYY长格式:
python
'March 14, 2024'自定义格式:
python
normalizer.normalize(date, output_format='%Y%m%d') # '20240314'Supported Input Formats
支持的输入格式
Numeric:
- (ISO)
2024-03-14 - (US)
03/14/2024 - (EU)
14/03/2024 - (German)
14.03.2024 - (Japanese)
2024/03/14 - (Compact)
20240314
Textual:
March 14, 202414 March 2024Mar 14, 202414-Mar-2024
Relative:
- ,
today,yesterdaytomorrow - ,
next weeklast month - ,
2 days agoin 3 weeks
With Time:
2024-03-14 15:30:0003/14/2024 3:30 PM2024-03-14T15:30:00Z
数字格式:
- (ISO)
2024-03-14 - (美式)
03/14/2024 - (欧式)
14/03/2024 - (德国式)
14.03.2024 - (日本式)
2024/03/14 - (紧凑格式)
20240314
文本格式:
March 14, 202414 March 2024Mar 14, 202414-Mar-2024
相对日期:
- ,
today,yesterdaytomorrow - ,
next weeklast month - ,
2 days agoin 3 weeks
带时间的格式:
2024-03-14 15:30:0003/14/2024 3:30 PM2024-03-14T15:30:00Z
Ambiguity Handling
歧义处理
Dates like are ambiguous. Specify interpretation:
01/02/03python
undefined像这样的日期存在歧义,需要指定解读方式:
01/02/03python
undefinedDay first (EU)
日在前(欧式)
normalizer.normalize("01/02/03", dayfirst=True)
normalizer.normalize("01/02/03", dayfirst=True)
Result: 2003-02-01
结果: 2003-02-01
Month first (US)
月在前(美式)
normalizer.normalize("01/02/03", dayfirst=False)
normalizer.normalize("01/02/03", dayfirst=False)
Result: 2003-01-02
结果: 2003-01-02
Year first
年在前
normalizer.normalize("01/02/03", yearfirst=True)
normalizer.normalize("01/02/03", yearfirst=True)
Result: 2001-02-03
结果: 2001-02-03
undefinedundefinedUse Cases
使用场景
Clean Messy Data:
python
messy_dates = [
"March 14, 2024",
"2024-03-15",
"03/16/2024",
"17-Mar-2024"
]
normalized = normalizer.normalize_batch(messy_dates)清洗杂乱数据:
python
messy_dates = [
"March 14, 2024",
"2024-03-15",
"03/16/2024",
"17-Mar-2024"
]
normalized = normalizer.normalize_batch(messy_dates)All converted to: ['2024-03-14', '2024-03-15', '2024-03-16', '2024-03-17']
全部转换为: ['2024-03-14', '2024-03-15', '2024-03-16', '2024-03-17']
**CSV Normalization:**
```python
**CSV规范化:**
```pythonInput CSV with mixed date formats
输入包含混合日期格式的CSV
Convert all to ISO 8601
将所有日期转换为ISO 8601格式
normalizer.normalize_csv(
'orders.csv',
date_column='order_date',
output='orders_normalized.csv',
output_format='iso8601'
)
**Validation:**
```python
if not normalizer.is_valid("invalid date"):
print("Invalid date detected")Timezone Conversion:
python
normalizer.normalize(
"2024-03-14 15:30:00+00:00",
output_timezone='America/New_York'
)normalizer.normalize_csv(
'orders.csv',
date_column='order_date',
output='orders_normalized.csv',
output_format='iso8601'
)
**验证:**
```python
if not normalizer.is_valid("invalid date"):
print("检测到无效日期")时区转换:
python
normalizer.normalize(
"2024-03-14 15:30:00+00:00",
output_timezone='America/New_York'
)Limitations
局限性
- Cannot parse dates from images or PDFs (use OCR first)
- Ambiguous dates require manual specification of format
- Very old dates (<1900) may have limited support
- Non-Gregorian calendars not supported
- Some regional formats may need explicit configuration
- 无法解析图片或PDF中的日期(需先使用OCR工具)
- 歧义日期需要手动指定格式
- 非常古老的日期(<1900)支持有限
- 不支持非公历日历
- 部分区域格式可能需要显式配置