skill-system-eda

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Skill System EDA

Skill System EDA

Use
scripts/eda.py
for deterministic tabular analysis artifacts.
使用
scripts/eda.py
生成确定性表格分析产物。

Core Commands

核心命令

bash
python3 scripts/eda.py profile-dataset --input data.csv --target Class --output /tmp/eda
python3 scripts/eda.py distribution-report --input data.csv --target Class --profile /tmp/eda/profile.yaml
python3 scripts/eda.py correlation-matrix --input data.csv --target Class --profile /tmp/eda/profile.yaml
python3 scripts/eda.py anomaly-profiling --input data.csv --target Class --profile /tmp/eda/profile.yaml
python3 scripts/eda.py feature-importance-scan --input data.csv --target Class --profile /tmp/eda/profile.yaml
python3 scripts/eda.py leakage-detector --input data.csv --target Class --profile /tmp/eda/profile.yaml
python3 scripts/eda.py save-contract --profile /tmp/eda/profile.yaml --output /tmp/eda/contract.yaml
python3 scripts/eda.py validate-contract --input new_data.csv --contract /tmp/eda/contract.yaml
bash
python3 scripts/eda.py profile-dataset --input data.csv --target Class --output /tmp/eda
python3 scripts/eda.py distribution-report --input data.csv --target Class --profile /tmp/eda/profile.yaml
python3 scripts/eda.py correlation-matrix --input data.csv --target Class --profile /tmp/eda/profile.yaml
python3 scripts/eda.py anomaly-profiling --input data.csv --target Class --profile /tmp/eda/profile.yaml
python3 scripts/eda.py feature-importance-scan --input data.csv --target Class --profile /tmp/eda/profile.yaml
python3 scripts/eda.py leakage-detector --input data.csv --target Class --profile /tmp/eda/profile.yaml
python3 scripts/eda.py save-contract --profile /tmp/eda/profile.yaml --output /tmp/eda/contract.yaml
python3 scripts/eda.py validate-contract --input new_data.csv --contract /tmp/eda/contract.yaml

Output Model

输出模型

  • profile-dataset
    creates
    profile.yaml
    and
    report.md
  • later commands update
    profile.yaml
    and append sections to
    report.md
  • save-contract
    emits
    contract.yaml
  • validate-contract
    prints JSON
    PASS
    /
    FAIL
    with a violation list
  • profile-dataset
    生成
    profile.yaml
    report.md
    文件
  • 后续命令会更新
    profile.yaml
    并向
    report.md
    中追加内容
  • save-contract
    生成
    contract.yaml
    文件
  • validate-contract
    输出包含违规列表的JSON格式
    PASS
    /
    FAIL
    结果

Analysis Rules

分析规则

  • Use Polars (not pandas) for data IO/aggregation/profiling flows.
  • Keep sampling deterministic with lazy
    .head(N)
    when
    --sample
    is used.
  • Treat
    profile.yaml
    as the machine-readable source of truth;
    report.md
    is the human-readable companion.
  • Use Polars + numpy + scipy for profiling, shifts, correlations, KS tests, and Cramer's V.
  • Use sklearn feature ranking only when available; otherwise keep tree-based importance explicitly skipped.
  • Use lazy scan strategy for large CSV/parquet inputs (
    scan_csv
    /
    scan_parquet
    ), with materialization delayed until needed.
  • Apply high-cardinality guards:
    >50
    unique skips one-hot in feature importance, and profile truncates categorical columns (
    >100
    unique or
    >50%
    row cardinality) to top-20 values.
  • 数据IO/聚合/分析流程使用Polars(而非pandas)实现。
  • 当使用
    --sample
    参数时,通过惰性
    .head(N)
    方法保证采样的确定性。
  • profile.yaml
    视为机器可读的可信数据源;
    report.md
    是面向人类的可读配套文档。
  • 分析、数据偏移、相关性计算、KS检验和Cramer's V计算使用Polars + numpy + scipy实现。
  • 仅当sklearn可用时使用其特征排序功能;否则明确跳过基于树模型的特征重要性计算。
  • 对大型CSV/Parquet输入文件使用惰性扫描策略(
    scan_csv
    /
    scan_parquet
    ),延迟数据实例化直到需要时才执行。
  • 应用高基数防护:当唯一值数量
    >50
    时,特征重要性计算跳过独热编码;当分类列的唯一值数量
    >100
    或行基数占比
    >50%
    时,分析仅保留前20个高频值。

Memory Integration

内存集成

  • By default, commands write a summary memory plus one memory per warning/critical finding.
  • Prefer
    skill-system-memory/scripts/mem.py store
    when available.
  • If memory writes fail or
    EDA_DISABLE_MEM_PY=1
    is set, write fallback payloads under
    .memory/pending/
    .
  • Use
    --no-memory
    for deterministic tests or when no writeback is desired.
  • 默认情况下,命令会写入一份摘要内存,以及每个警告/严重问题对应的单独内存记录。
  • 优先使用
    skill-system-memory/scripts/mem.py store
    (若可用)。
  • 若内存写入失败或设置了
    EDA_DISABLE_MEM_PY=1
    ,则将备选数据写入
    .memory/pending/
    目录下。
  • 确定性测试或不需要回写结果时,使用
    --no-memory
    参数。

Contract Lifecycle

合约生命周期

  • save-contract
    derives column requirements from
    profile.yaml
    .
  • Numeric ranges use observed bounds for tiny datasets and profile-derived percentile bounds for larger datasets.
  • Truncated categorical columns produce
    cardinality_range
    rules instead of
    allowed_values
    .
  • validate-contract
    fails closed and returns machine-readable violations.
skill
{
  "schema_version": "2.0",
  "id": "skill-system-eda",
  "version": "1.0.0",
  "capabilities": [
    "eda-profile",
    "eda-distribution",
    "eda-correlation",
    "eda-anomaly",
    "eda-feature-importance",
    "eda-leakage",
    "eda-contract-save",
    "eda-contract-validate"
  ],
  "effects": ["fs.read", "fs.write", "proc.exec"],
  "operations": {
    "profile-dataset": {
      "description": "Profile a CSV/parquet dataset and generate profile.yaml plus report.md.",
      "input": {
        "input": { "type": "string", "required": true },
        "target": { "type": "string", "required": false },
        "output": { "type": "string", "required": true },
        "sample": { "type": "integer", "required": false },
        "no_memory": { "type": "boolean", "required": false }
      },
      "output": {
        "description": "Artifact paths for the generated EDA profile",
        "fields": { "profile": "string", "report": "string" }
      },
      "entrypoints": {
        "unix": ["python3", "scripts/eda.py", "profile-dataset", "--input", "{input}", "--output", "{output}"]
      }
    },
    "distribution-report": {
      "description": "Append distribution and class-conditional analysis to an existing profile/report.",
      "input": {
        "input": { "type": "string", "required": true },
        "target": { "type": "string", "required": true },
        "profile": { "type": "string", "required": true }
      },
      "output": { "description": "Updated profile/report paths", "fields": { "profile": "string", "report": "string" } },
      "entrypoints": {
        "unix": ["python3", "scripts/eda.py", "distribution-report", "--input", "{input}", "--target", "{target}", "--profile", "{profile}"]
      }
    },
    "correlation-matrix": {
      "description": "Compute feature and target correlations and append them to profile/report.",
      "input": {
        "input": { "type": "string", "required": true },
        "target": { "type": "string", "required": false },
        "profile": { "type": "string", "required": true }
      },
      "output": { "description": "Updated profile/report paths", "fields": { "profile": "string", "report": "string" } },
      "entrypoints": {
        "unix": ["python3", "scripts/eda.py", "correlation-matrix", "--input", "{input}", "--profile", "{profile}"]
      }
    },
    "anomaly-profiling": {
      "description": "Compare class-conditional distributions and effect sizes.",
      "input": {
        "input": { "type": "string", "required": true },
        "target": { "type": "string", "required": true },
        "profile": { "type": "string", "required": true }
      },
      "output": { "description": "Updated profile/report paths", "fields": { "profile": "string", "report": "string" } },
      "entrypoints": {
        "unix": ["python3", "scripts/eda.py", "anomaly-profiling", "--input", "{input}", "--target", "{target}", "--profile", "{profile}"]
      }
    },
    "feature-importance-scan": {
      "description": "Rank features with mutual information and optional tree importances.",
      "input": {
        "input": { "type": "string", "required": true },
        "target": { "type": "string", "required": true },
        "profile": { "type": "string", "required": true }
      },
      "output": { "description": "Updated profile/report paths", "fields": { "profile": "string", "report": "string" } },
      "entrypoints": {
        "unix": ["python3", "scripts/eda.py", "feature-importance-scan", "--input", "{input}", "--target", "{target}", "--profile", "{profile}"]
      }
    },
    "leakage-detector": {
      "description": "Detect high-correlation, target-encoding, and temporal leakage indicators.",
      "input": {
        "input": { "type": "string", "required": true },
        "target": { "type": "string", "required": true },
        "profile": { "type": "string", "required": true }
      },
      "output": { "description": "Updated profile/report paths", "fields": { "profile": "string", "report": "string" } },
      "entrypoints": {
        "unix": ["python3", "scripts/eda.py", "leakage-detector", "--input", "{input}", "--target", "{target}", "--profile", "{profile}"]
      }
    },
    "save-contract": {
      "description": "Generate a data contract from a saved EDA profile.",
      "input": {
        "profile": { "type": "string", "required": true },
        "output": { "type": "string", "required": true }
      },
      "output": { "description": "Contract path", "fields": { "contract": "string" } },
      "entrypoints": {
        "unix": ["python3", "scripts/eda.py", "save-contract", "--profile", "{profile}", "--output", "{output}"]
      }
    },
    "validate-contract": {
      "description": "Validate a new dataset against a saved contract and emit PASS/FAIL JSON.",
      "input": {
        "input": { "type": "string", "required": true },
        "contract": { "type": "string", "required": true }
      },
      "output": { "description": "Validation status and violations", "fields": { "status": "string", "violations": "array" } },
      "entrypoints": {
        "unix": ["python3", "scripts/eda.py", "validate-contract", "--input", "{input}", "--contract", "{contract}"]
      }
    }
  },
  "stdout_contract": {
    "last_line_json": true
  }
}
  • save-contract
    profile.yaml
    中提取列要求以生成数据合约。
  • 数值范围:小型数据集使用实际观测到的边界值,大型数据集使用分析得出的百分位数边界值。
  • 被截断的分类列会生成
    cardinality_range
    规则,而非
    allowed_values
    规则。
  • validate-contract
    采用封闭失败原则,并返回机器可读的违规信息。
skill
{
  "schema_version": "2.0",
  "id": "skill-system-eda",
  "version": "1.0.0",
  "capabilities": [
    "eda-profile",
    "eda-distribution",
    "eda-correlation",
    "eda-anomaly",
    "eda-feature-importance",
    "eda-leakage",
    "eda-contract-save",
    "eda-contract-validate"
  ],
  "effects": ["fs.read", "fs.write", "proc.exec"],
  "operations": {
    "profile-dataset": {
      "description": "分析CSV/Parquet数据集并生成profile.yaml和report.md文件。",
      "input": {
        "input": { "type": "string", "required": true },
        "target": { "type": "string", "required": false },
        "output": { "type": "string", "required": true },
        "sample": { "type": "integer", "required": false },
        "no_memory": { "type": "boolean", "required": false }
      },
      "output": {
        "description": "生成的EDA分析结果文件路径",
        "fields": { "profile": "string", "report": "string" }
      },
      "entrypoints": {
        "unix": ["python3", "scripts/eda.py", "profile-dataset", "--input", "{input}", "--output", "{output}"]
      }
    },
    "distribution-report": {
      "description": "在已有的分析结果/报告中追加分布分析和基于类别条件的分析内容。",
      "input": {
        "input": { "type": "string", "required": true },
        "target": { "type": "string", "required": true },
        "profile": { "type": "string", "required": true }
      },
      "output": { "description": "更新后的分析结果/报告文件路径", "fields": { "profile": "string", "report": "string" } },
      "entrypoints": {
        "unix": ["python3", "scripts/eda.py", "distribution-report", "--input", "{input}", "--target", "{target}", "--profile", "{profile}"]
      }
    },
    "correlation-matrix": {
      "description": "计算特征与目标变量的相关性并追加到分析结果/报告中。",
      "input": {
        "input": { "type": "string", "required": true },
        "target": { "type": "string", "required": false },
        "profile": { "type": "string", "required": true }
      },
      "output": { "description": "更新后的分析结果/报告文件路径", "fields": { "profile": "string", "report": "string" } },
      "entrypoints": {
        "unix": ["python3", "scripts/eda.py", "correlation-matrix", "--input", "{input}", "--profile", "{profile}"]
      }
    },
    "anomaly-profiling": {
      "description": "比较基于类别条件的分布情况和效应量。",
      "input": {
        "input": { "type": "string", "required": true },
        "target": { "type": "string", "required": true },
        "profile": { "type": "string", "required": true }
      },
      "output": { "description": "更新后的分析结果/报告文件路径", "fields": { "profile": "string", "report": "string" } },
      "entrypoints": {
        "unix": ["python3", "scripts/eda.py", "anomaly-profiling", "--input", "{input}", "--target", "{target}", "--profile", "{profile}"]
      }
    },
    "feature-importance-scan": {
      "description": "通过互信息和可选的树模型对特征进行重要性排序。",
      "input": {
        "input": { "type": "string", "required": true },
        "target": { "type": "string", "required": true },
        "profile": { "type": "string", "required": true }
      },
      "output": { "description": "更新后的分析结果/报告文件路径", "fields": { "profile": "string", "report": "string" } },
      "entrypoints": {
        "unix": ["python3", "scripts/eda.py", "feature-importance-scan", "--input", "{input}", "--target", "{target}", "--profile", "{profile}"]
      }
    },
    "leakage-detector": {
      "description": "检测高相关性、目标编码和时间泄露等指标。",
      "input": {
        "input": { "type": "string", "required": true },
        "target": { "type": "string", "required": true },
        "profile": { "type": "string", "required": true }
      },
      "output": { "description": "更新后的分析结果/报告文件路径", "fields": { "profile": "string", "report": "string" } },
      "entrypoints": {
        "unix": ["python3", "scripts/eda.py", "leakage-detector", "--input", "{input}", "--target", "{target}", "--profile", "{profile}"]
      }
    },
    "save-contract": {
      "description": "从已保存的EDA分析结果中生成数据合约。",
      "input": {
        "profile": { "type": "string", "required": true },
        "output": { "type": "string", "required": true }
      },
      "output": { "description": "数据合约文件路径", "fields": { "contract": "string" } },
      "entrypoints": {
        "unix": ["python3", "scripts/eda.py", "save-contract", "--profile", "{profile}", "--output", "{output}"]
      }
    },
    "validate-contract": {
      "description": "验证新数据集是否符合已保存的数据合约,并输出PASS/FAIL格式的JSON结果。",
      "input": {
        "input": { "type": "string", "required": true },
        "contract": { "type": "string", "required": true }
      },
      "output": { "description": "验证状态和违规信息", "fields": { "status": "string", "violations": "array" } },
      "entrypoints": {
        "unix": ["python3", "scripts/eda.py", "validate-contract", "--input", "{input}", "--contract", "{contract}"]
      }
    }
  },
  "stdout_contract": {
    "last_line_json": true
  }
}