date-normalizer

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Date Normalizer

日期规范化工具

Parse and normalize dates from various formats into consistent, standardized formats for data cleaning and ETL pipelines.
将各种格式的日期解析并规范化为统一的标准格式,适用于数据清洗和ETL流水线。

Purpose

用途

Date standardization for:
  • Data cleaning and ETL pipelines
  • Database imports with mixed date formats
  • Log file parsing and analysis
  • International data harmonization
  • Report generation with consistent dating
日期标准化适用于:
  • 数据清洗和ETL流水线
  • 存在混合日期格式的数据库导入场景
  • 日志文件解析与分析
  • 国际数据统一处理
  • 生成日期格式统一的报告

Features

功能特性

  • Smart Parsing: Automatically detect and parse 100+ date formats
  • Format Conversion: Convert to ISO 8601, US, EU, or custom formats
  • Batch Processing: Normalize entire CSV columns
  • Ambiguity Detection: Flag dates that could be interpreted multiple ways
  • Timezone Handling: Convert and normalize timezones
  • Relative Dates: Parse "today", "yesterday", "next week"
  • Validation: Detect and report invalid dates
  • 智能解析:自动检测并解析100多种日期格式
  • 格式转换:转换为ISO 8601、美式、欧式或自定义格式
  • 批量处理:规范化整个CSV列的日期
  • 歧义检测:标记可能存在多种解读方式的日期
  • 时区处理:转换并规范化时区
  • 相对日期解析:识别“today”“yesterday”“next week”等相对日期
  • 验证功能:检测并报告无效日期

Quick Start

快速开始

python
from date_normalizer import DateNormalizer
python
from date_normalizer import DateNormalizer

Normalize single date

规范化单个日期

normalizer = DateNormalizer() result = normalizer.normalize("03/14/2024") print(result) # {'normalized': '2024-03-14', 'format': 'iso8601'}
normalizer = DateNormalizer() result = normalizer.normalize("03/14/2024") print(result) # {'normalized': '2024-03-14', 'format': 'iso8601'}

Normalize to specific format

转换为指定格式

result = normalizer.normalize("March 14, 2024", output_format="us") print(result) # {'normalized': '03/14/2024', 'format': 'us'}
result = normalizer.normalize("March 14, 2024", output_format="us") print(result) # {'normalized': '03/14/2024', 'format': 'us'}

Batch normalize CSV column

批量规范化CSV列

normalizer.normalize_csv( 'data.csv', date_column='created_at', output='normalized.csv', output_format='iso8601' )
undefined
normalizer.normalize_csv( 'data.csv', date_column='created_at', output='normalized.csv', output_format='iso8601' )
undefined

CLI Usage

CLI 使用方式

bash
undefined
bash
undefined

Normalize single date

规范化单个日期

python date_normalizer.py --date "March 14, 2024"
python date_normalizer.py --date "March 14, 2024"

Convert to specific format

转换为指定格式

python date_normalizer.py --date "14/03/2024" --format us
python date_normalizer.py --date "14/03/2024" --format us

Normalize CSV column

规范化CSV列

python date_normalizer.py --csv data.csv --column date --format iso8601 --output normalized.csv
python date_normalizer.py --csv data.csv --column date --format iso8601 --output normalized.csv

Detect ambiguous dates

检测歧义日期

python date_normalizer.py --date "01/02/03" --detect-ambiguous
undefined
python date_normalizer.py --date "01/02/03" --detect-ambiguous
undefined

API Reference

API 参考

DateNormalizer

DateNormalizer

python
class DateNormalizer:
    def normalize(self, date_string: str, output_format: str = 'iso8601',
                 dayfirst: bool = False, yearfirst: bool = False) -> Dict
    def normalize_batch(self, dates: List[str], **kwargs) -> List[Dict]
    def normalize_csv(self, csv_path: str, date_column: str,
                     output: str = None, **kwargs) -> str
    def detect_format(self, date_string: str) -> str
    def is_valid(self, date_string: str) -> bool
    def is_ambiguous(self, date_string: str) -> bool
    def parse_relative(self, relative_string: str) -> datetime
python
class DateNormalizer:
    def normalize(self, date_string: str, output_format: str = 'iso8601',
                 dayfirst: bool = False, yearfirst: bool = False) -> Dict
    def normalize_batch(self, dates: List[str], **kwargs) -> List[Dict]
    def normalize_csv(self, csv_path: str, date_column: str,
                     output: str = None, **kwargs) -> str
    def detect_format(self, date_string: str) -> str
    def is_valid(self, date_string: str) -> bool
    def is_ambiguous(self, date_string: str) -> bool
    def parse_relative(self, relative_string: str) -> datetime

Output Formats

输出格式

ISO 8601 (default):
python
'2024-03-14'  # Date only
'2024-03-14T15:30:00'  # With time
'2024-03-14T15:30:00+00:00'  # With timezone
US Format:
python
'03/14/2024'  # MM/DD/YYYY
EU Format:
python
'14/03/2024'  # DD/MM/YYYY
Long Format:
python
'March 14, 2024'
Custom Format:
python
normalizer.normalize(date, output_format='%Y%m%d')  # '20240314'
ISO 8601(默认):
python
'2024-03-14'  # 仅日期
'2024-03-14T15:30:00'  # 带时间
'2024-03-14T15:30:00+00:00'  # 带时区
美式格式:
python
'03/14/2024'  # MM/DD/YYYY
欧式格式:
python
'14/03/2024'  # DD/MM/YYYY
长格式:
python
'March 14, 2024'
自定义格式:
python
normalizer.normalize(date, output_format='%Y%m%d')  # '20240314'

Supported Input Formats

支持的输入格式

Numeric:
  • 2024-03-14
    (ISO)
  • 03/14/2024
    (US)
  • 14/03/2024
    (EU)
  • 14.03.2024
    (German)
  • 2024/03/14
    (Japanese)
  • 20240314
    (Compact)
Textual:
  • March 14, 2024
  • 14 March 2024
  • Mar 14, 2024
  • 14-Mar-2024
Relative:
  • today
    ,
    yesterday
    ,
    tomorrow
  • next week
    ,
    last month
  • 2 days ago
    ,
    in 3 weeks
With Time:
  • 2024-03-14 15:30:00
  • 03/14/2024 3:30 PM
  • 2024-03-14T15:30:00Z
数字格式:
  • 2024-03-14
    (ISO)
  • 03/14/2024
    (美式)
  • 14/03/2024
    (欧式)
  • 14.03.2024
    (德国式)
  • 2024/03/14
    (日本式)
  • 20240314
    (紧凑格式)
文本格式:
  • March 14, 2024
  • 14 March 2024
  • Mar 14, 2024
  • 14-Mar-2024
相对日期:
  • today
    ,
    yesterday
    ,
    tomorrow
  • next week
    ,
    last month
  • 2 days ago
    ,
    in 3 weeks
带时间的格式:
  • 2024-03-14 15:30:00
  • 03/14/2024 3:30 PM
  • 2024-03-14T15:30:00Z

Ambiguity Handling

歧义处理

Dates like
01/02/03
are ambiguous. Specify interpretation:
python
undefined
01/02/03
这样的日期存在歧义,需要指定解读方式:
python
undefined

Day first (EU)

日在前(欧式)

normalizer.normalize("01/02/03", dayfirst=True)
normalizer.normalize("01/02/03", dayfirst=True)

Result: 2003-02-01

结果: 2003-02-01

Month first (US)

月在前(美式)

normalizer.normalize("01/02/03", dayfirst=False)
normalizer.normalize("01/02/03", dayfirst=False)

Result: 2003-01-02

结果: 2003-01-02

Year first

年在前

normalizer.normalize("01/02/03", yearfirst=True)
normalizer.normalize("01/02/03", yearfirst=True)

Result: 2001-02-03

结果: 2001-02-03

undefined
undefined

Use Cases

使用场景

Clean Messy Data:
python
messy_dates = [
    "March 14, 2024",
    "2024-03-15",
    "03/16/2024",
    "17-Mar-2024"
]

normalized = normalizer.normalize_batch(messy_dates)
清洗杂乱数据:
python
messy_dates = [
    "March 14, 2024",
    "2024-03-15",
    "03/16/2024",
    "17-Mar-2024"
]

normalized = normalizer.normalize_batch(messy_dates)

All converted to: ['2024-03-14', '2024-03-15', '2024-03-16', '2024-03-17']

全部转换为: ['2024-03-14', '2024-03-15', '2024-03-16', '2024-03-17']


**CSV Normalization:**
```python

**CSV规范化:**
```python

Input CSV with mixed date formats

输入包含混合日期格式的CSV

Convert all to ISO 8601

将所有日期转换为ISO 8601格式

normalizer.normalize_csv( 'orders.csv', date_column='order_date', output='orders_normalized.csv', output_format='iso8601' )

**Validation:**
```python
if not normalizer.is_valid("invalid date"):
    print("Invalid date detected")
Timezone Conversion:
python
normalizer.normalize(
    "2024-03-14 15:30:00+00:00",
    output_timezone='America/New_York'
)
normalizer.normalize_csv( 'orders.csv', date_column='order_date', output='orders_normalized.csv', output_format='iso8601' )

**验证:**
```python
if not normalizer.is_valid("invalid date"):
    print("检测到无效日期")
时区转换:
python
normalizer.normalize(
    "2024-03-14 15:30:00+00:00",
    output_timezone='America/New_York'
)

Limitations

局限性

  • Cannot parse dates from images or PDFs (use OCR first)
  • Ambiguous dates require manual specification of format
  • Very old dates (<1900) may have limited support
  • Non-Gregorian calendars not supported
  • Some regional formats may need explicit configuration
  • 无法解析图片或PDF中的日期(需先使用OCR工具)
  • 歧义日期需要手动指定格式
  • 非常古老的日期(<1900)支持有限
  • 不支持非公历日历
  • 部分区域格式可能需要显式配置