anomaly-detector

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Anomaly Detector

异常检测器

Audience: Data engineers and analysts detecting outliers in datasets.
Goal: Provide production-ready anomaly detection functions for various data types.
受众: 需要检测数据集中异常值的数据工程师和分析师。
目标: 为各种数据类型提供可用于生产环境的异常检测函数。

Scripts

脚本

Execute detection functions from
scripts/anomaly_detection.py
:
python
from scripts.anomaly_detection import (
    detect_anomalies_zscore,
    detect_anomalies_iqr,
    detect_anomalies_modified_zscore,
    detect_anomalies_isolation_forest,
    detect_anomalies_lof,
    detect_anomalies_rolling,
    detect_anomalies_stl,
    detect_anomalies_ensemble
)
执行
scripts/anomaly_detection.py
中的检测函数:
python
from scripts.anomaly_detection import (
    detect_anomalies_zscore,
    detect_anomalies_iqr,
    detect_anomalies_modified_zscore,
    detect_anomalies_isolation_forest,
    detect_anomalies_lof,
    detect_anomalies_rolling,
    detect_anomalies_stl,
    detect_anomalies_ensemble
)

Method Selection

方法选择

MethodBest ForLimitations
Z-ScoreNormal distributionsSensitive to outliers
IQRSkewed distributionsLess sensitive overall
Modified Z-ScoreRobust detectionSlower computation
Isolation ForestHigh-dimensional dataRequires tuning
LOFLocal density anomaliesComputationally expensive
RollingTime-series with trendsWindow size sensitive
STLSeasonal time-seriesRequires known period
方法适用场景局限性
Z-Score正态分布数据对异常值敏感
IQR偏态分布数据整体敏感性较低
Modified Z-Score鲁棒性检测计算速度较慢
Isolation Forest高维数据需要调参
LOF局部密度异常计算成本高
Rolling带趋势的时间序列数据对窗口大小敏感
STL季节性时间序列数据需要已知周期

Usage Examples

使用示例

Single Column Detection

单列数据检测

python
import pandas as pd
from scripts.anomaly_detection import detect_anomalies_zscore, detect_anomalies_iqr

df = pd.read_csv('data.csv')
python
import pandas as pd
from scripts.anomaly_detection import detect_anomalies_zscore, detect_anomalies_iqr

df = pd.read_csv('data.csv')

Z-score method (good for normal distributions)

Z-score方法(适用于正态分布数据)

anomalies_z = detect_anomalies_zscore(df['value'], threshold=3.0)
anomalies_z = detect_anomalies_zscore(df['value'], threshold=3.0)

IQR method (robust to skewed data)

IQR方法(对偏态数据鲁棒)

anomalies_iqr = detect_anomalies_iqr(df['value'], multiplier=1.5)
print(f"Z-score found {anomalies_z.sum()} anomalies") print(f"IQR found {anomalies_iqr.sum()} anomalies")
undefined
anomalies_iqr = detect_anomalies_iqr(df['value'], multiplier=1.5)
print(f"Z-score发现 {anomalies_z.sum()} 个异常值") print(f"IQR发现 {anomalies_iqr.sum()} 个异常值")
undefined

Multi-Column with Isolation Forest

多列数据与孤立森林

python
from scripts.anomaly_detection import detect_anomalies_isolation_forest

numeric_cols = ['revenue', 'quantity', 'price']
anomalies = detect_anomalies_isolation_forest(df, numeric_cols, contamination=0.01)

df_anomalies = df[anomalies]
python
from scripts.anomaly_detection import detect_anomalies_isolation_forest

numeric_cols = ['revenue', 'quantity', 'price']
anomalies = detect_anomalies_isolation_forest(df, numeric_cols, contamination=0.01)

df_anomalies = df[anomalies]

Ensemble Approach (Recommended)

集成方法(推荐)

python
from scripts.anomaly_detection import detect_anomalies_ensemble

results = detect_anomalies_ensemble(
    df,
    columns=['revenue', 'quantity'],
    methods=['zscore', 'iqr', 'isolation_forest'],
    min_agreement=2  # Flag if 2+ methods agree
)

confirmed_anomalies = df[results['is_anomaly']]
python
from scripts.anomaly_detection import detect_anomalies_ensemble

results = detect_anomalies_ensemble(
    df,
    columns=['revenue', 'quantity'],
    methods=['zscore', 'iqr', 'isolation_forest'],
    min_agreement=2  # 若2种及以上方法判定则标记为异常
)

confirmed_anomalies = df[results['is_anomaly']]

Time-Series Anomalies

时间序列异常检测

python
from scripts.anomaly_detection import detect_anomalies_rolling, detect_anomalies_stl
python
from scripts.anomaly_detection import detect_anomalies_rolling, detect_anomalies_stl

Rolling window (for trending data)

滚动窗口(适用于带趋势的数据)

anomalies = detect_anomalies_rolling(df['daily_sales'], window=7, n_std=2.0)
anomalies = detect_anomalies_rolling(df['daily_sales'], window=7, n_std=2.0)

STL decomposition (for seasonal data)

STL分解(适用于季节性数据)

anomalies = detect_anomalies_stl(df['monthly_revenue'], period=12, threshold=3.0)
undefined
anomalies = detect_anomalies_stl(df['monthly_revenue'], period=12, threshold=3.0)
undefined

Dependencies

依赖项

pandas
numpy
scikit-learn  # For Isolation Forest, LOF
statsmodels   # For STL decomposition
pandas
numpy
scikit-learn  # 用于Isolation Forest、LOF
statsmodels   # 用于STL分解