skill-system-eda

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Skill System EDA

Use

scripts/eda.py

for deterministic tabular analysis artifacts.

使用

scripts/eda.py

生成确定性表格分析产物。

Core Commands

核心命令

bash

python3 scripts/eda.py profile-dataset --input data.csv --target Class --output /tmp/eda
python3 scripts/eda.py distribution-report --input data.csv --target Class --profile /tmp/eda/profile.yaml
python3 scripts/eda.py correlation-matrix --input data.csv --target Class --profile /tmp/eda/profile.yaml
python3 scripts/eda.py anomaly-profiling --input data.csv --target Class --profile /tmp/eda/profile.yaml
python3 scripts/eda.py feature-importance-scan --input data.csv --target Class --profile /tmp/eda/profile.yaml
python3 scripts/eda.py leakage-detector --input data.csv --target Class --profile /tmp/eda/profile.yaml
python3 scripts/eda.py save-contract --profile /tmp/eda/profile.yaml --output /tmp/eda/contract.yaml
python3 scripts/eda.py validate-contract --input new_data.csv --contract /tmp/eda/contract.yaml

bash

python3 scripts/eda.py profile-dataset --input data.csv --target Class --output /tmp/eda
python3 scripts/eda.py distribution-report --input data.csv --target Class --profile /tmp/eda/profile.yaml
python3 scripts/eda.py correlation-matrix --input data.csv --target Class --profile /tmp/eda/profile.yaml
python3 scripts/eda.py anomaly-profiling --input data.csv --target Class --profile /tmp/eda/profile.yaml
python3 scripts/eda.py feature-importance-scan --input data.csv --target Class --profile /tmp/eda/profile.yaml
python3 scripts/eda.py leakage-detector --input data.csv --target Class --profile /tmp/eda/profile.yaml
python3 scripts/eda.py save-contract --profile /tmp/eda/profile.yaml --output /tmp/eda/contract.yaml
python3 scripts/eda.py validate-contract --input new_data.csv --contract /tmp/eda/contract.yaml

Output Model

输出模型

```
profile-dataset
```
creates
```
profile.yaml
```
and
```
report.md
```
later commands update
```
profile.yaml
```
and append sections to
```
report.md
```
```
save-contract
```
emits
```
contract.yaml
```
```
validate-contract
```
prints JSON
```
PASS
```
/
```
FAIL
```
with a violation list

```
profile-dataset
```
生成
```
profile.yaml
```
和
```
report.md
```
文件
后续命令会更新
```
profile.yaml
```
并向
```
report.md
```
中追加内容
```
save-contract
```
生成
```
contract.yaml
```
文件
```
validate-contract
```
输出包含违规列表的JSON格式
```
PASS
```
/
```
FAIL
```
结果

Analysis Rules

分析规则

Use Polars (not pandas) for data IO/aggregation/profiling flows.
Keep sampling deterministic with lazy
```
.head(N)
```
when
```
--sample
```
is used.
Treat
```
profile.yaml
```
as the machine-readable source of truth;
```
report.md
```
is the human-readable companion.
Use Polars + numpy + scipy for profiling, shifts, correlations, KS tests, and Cramer's V.
Use sklearn feature ranking only when available; otherwise keep tree-based importance explicitly skipped.
Use lazy scan strategy for large CSV/parquet inputs (
```
scan_csv
```
/
```
scan_parquet
```
), with materialization delayed until needed.
Apply high-cardinality guards:
```
>50
```
unique skips one-hot in feature importance, and profile truncates categorical columns (
```
>100
```
unique or
```
>50%
```
row cardinality) to top-20 values.

数据IO/聚合/分析流程使用Polars（而非pandas）实现。
当使用
```
--sample
```
参数时，通过惰性
```
.head(N)
```
方法保证采样的确定性。
将
```
profile.yaml
```
视为机器可读的可信数据源；
```
report.md
```
是面向人类的可读配套文档。
分析、数据偏移、相关性计算、KS检验和Cramer's V计算使用Polars + numpy + scipy实现。
仅当sklearn可用时使用其特征排序功能；否则明确跳过基于树模型的特征重要性计算。
对大型CSV/Parquet输入文件使用惰性扫描策略（
```
scan_csv
```
/
```
scan_parquet
```
），延迟数据实例化直到需要时才执行。
应用高基数防护：当唯一值数量
```
>50
```
时，特征重要性计算跳过独热编码；当分类列的唯一值数量
```
>100
```
或行基数占比
```
>50%
```
时，分析仅保留前20个高频值。

Memory Integration

内存集成

By default, commands write a summary memory plus one memory per warning/critical finding.

Prefer

skill-system-memory/scripts/mem.py store

when available.

If memory writes fail or
```
EDA_DISABLE_MEM_PY=1
```
is set, write fallback payloads under
```
.memory/pending/
```
.
Use
```
--no-memory
```
for deterministic tests or when no writeback is desired.

默认情况下，命令会写入一份摘要内存，以及每个警告/严重问题对应的单独内存记录。

优先使用

skill-system-memory/scripts/mem.py store

（若可用）。

若内存写入失败或设置了
```
EDA_DISABLE_MEM_PY=1
```
，则将备选数据写入
```
.memory/pending/
```
目录下。
确定性测试或不需要回写结果时，使用
```
--no-memory
```
参数。

Contract Lifecycle

合约生命周期

```
save-contract
```
derives column requirements from
```
profile.yaml
```
.
Numeric ranges use observed bounds for tiny datasets and profile-derived percentile bounds for larger datasets.
Truncated categorical columns produce
```
cardinality_range
```
rules instead of
```
allowed_values
```
.
```
validate-contract
```
fails closed and returns machine-readable violations.

skill

{
  "schema_version": "2.0",
  "id": "skill-system-eda",
  "version": "1.0.0",
  "capabilities": [
    "eda-profile",
    "eda-distribution",
    "eda-correlation",
    "eda-anomaly",
    "eda-feature-importance",
    "eda-leakage",
    "eda-contract-save",
    "eda-contract-validate"
  ],
  "effects": ["fs.read", "fs.write", "proc.exec"],
  "operations": {
    "profile-dataset": {
      "description": "Profile a CSV/parquet dataset and generate profile.yaml plus report.md.",
      "input": {
        "input": { "type": "string", "required": true },
        "target": { "type": "string", "required": false },
        "output": { "type": "string", "required": true },
        "sample": { "type": "integer", "required": false },
        "no_memory": { "type": "boolean", "required": false }
      },
      "output": {
        "description": "Artifact paths for the generated EDA profile",
        "fields": { "profile": "string", "report": "string" }
      },
      "entrypoints": {
        "unix": ["python3", "scripts/eda.py", "profile-dataset", "--input", "{input}", "--output", "{output}"]
      }
    },
    "distribution-report": {
      "description": "Append distribution and class-conditional analysis to an existing profile/report.",
      "input": {
        "input": { "type": "string", "required": true },
        "target": { "type": "string", "required": true },
        "profile": { "type": "string", "required": true }
      },
      "output": { "description": "Updated profile/report paths", "fields": { "profile": "string", "report": "string" } },
      "entrypoints": {
        "unix": ["python3", "scripts/eda.py", "distribution-report", "--input", "{input}", "--target", "{target}", "--profile", "{profile}"]
      }
    },
    "correlation-matrix": {
      "description": "Compute feature and target correlations and append them to profile/report.",
      "input": {
        "input": { "type": "string", "required": true },
        "target": { "type": "string", "required": false },
        "profile": { "type": "string", "required": true }
      },
      "output": { "description": "Updated profile/report paths", "fields": { "profile": "string", "report": "string" } },
      "entrypoints": {
        "unix": ["python3", "scripts/eda.py", "correlation-matrix", "--input", "{input}", "--profile", "{profile}"]
      }
    },
    "anomaly-profiling": {
      "description": "Compare class-conditional distributions and effect sizes.",
      "input": {
        "input": { "type": "string", "required": true },
        "target": { "type": "string", "required": true },
        "profile": { "type": "string", "required": true }
      },
      "output": { "description": "Updated profile/report paths", "fields": { "profile": "string", "report": "string" } },
      "entrypoints": {
        "unix": ["python3", "scripts/eda.py", "anomaly-profiling", "--input", "{input}", "--target", "{target}", "--profile", "{profile}"]
      }
    },
    "feature-importance-scan": {
      "description": "Rank features with mutual information and optional tree importances.",
      "input": {
        "input": { "type": "string", "required": true },
        "target": { "type": "string", "required": true },
        "profile": { "type": "string", "required": true }
      },
      "output": { "description": "Updated profile/report paths", "fields": { "profile": "string", "report": "string" } },
      "entrypoints": {
        "unix": ["python3", "scripts/eda.py", "feature-importance-scan", "--input", "{input}", "--target", "{target}", "--profile", "{profile}"]
      }
    },
    "leakage-detector": {
      "description": "Detect high-correlation, target-encoding, and temporal leakage indicators.",
      "input": {
        "input": { "type": "string", "required": true },
        "target": { "type": "string", "required": true },
        "profile": { "type": "string", "required": true }
      },
      "output": { "description": "Updated profile/report paths", "fields": { "profile": "string", "report": "string" } },
      "entrypoints": {
        "unix": ["python3", "scripts/eda.py", "leakage-detector", "--input", "{input}", "--target", "{target}", "--profile", "{profile}"]
      }
    },
    "save-contract": {
      "description": "Generate a data contract from a saved EDA profile.",
      "input": {
        "profile": { "type": "string", "required": true },
        "output": { "type": "string", "required": true }
      },
      "output": { "description": "Contract path", "fields": { "contract": "string" } },
      "entrypoints": {
        "unix": ["python3", "scripts/eda.py", "save-contract", "--profile", "{profile}", "--output", "{output}"]
      }
    },
    "validate-contract": {
      "description": "Validate a new dataset against a saved contract and emit PASS/FAIL JSON.",
      "input": {
        "input": { "type": "string", "required": true },
        "contract": { "type": "string", "required": true }
      },
      "output": { "description": "Validation status and violations", "fields": { "status": "string", "violations": "array" } },
      "entrypoints": {
        "unix": ["python3", "scripts/eda.py", "validate-contract", "--input", "{input}", "--contract", "{contract}"]
      }
    }
  },
  "stdout_contract": {
    "last_line_json": true
  }
}

```
save-contract
```
从
```
profile.yaml
```
中提取列要求以生成数据合约。
数值范围：小型数据集使用实际观测到的边界值，大型数据集使用分析得出的百分位数边界值。
被截断的分类列会生成
```
cardinality_range
```
规则，而非
```
allowed_values
```
规则。
```
validate-contract
```
采用封闭失败原则，并返回机器可读的违规信息。

skill

{
  "schema_version": "2.0",
  "id": "skill-system-eda",
  "version": "1.0.0",
  "capabilities": [
    "eda-profile",
    "eda-distribution",
    "eda-correlation",
    "eda-anomaly",
    "eda-feature-importance",
    "eda-leakage",
    "eda-contract-save",
    "eda-contract-validate"
  ],
  "effects": ["fs.read", "fs.write", "proc.exec"],
  "operations": {
    "profile-dataset": {
      "description": "分析CSV/Parquet数据集并生成profile.yaml和report.md文件。",
      "input": {
        "input": { "type": "string", "required": true },
        "target": { "type": "string", "required": false },
        "output": { "type": "string", "required": true },
        "sample": { "type": "integer", "required": false },
        "no_memory": { "type": "boolean", "required": false }
      },
      "output": {
        "description": "生成的EDA分析结果文件路径",
        "fields": { "profile": "string", "report": "string" }
      },
      "entrypoints": {
        "unix": ["python3", "scripts/eda.py", "profile-dataset", "--input", "{input}", "--output", "{output}"]
      }
    },
    "distribution-report": {
      "description": "在已有的分析结果/报告中追加分布分析和基于类别条件的分析内容。",
      "input": {
        "input": { "type": "string", "required": true },
        "target": { "type": "string", "required": true },
        "profile": { "type": "string", "required": true }
      },
      "output": { "description": "更新后的分析结果/报告文件路径", "fields": { "profile": "string", "report": "string" } },
      "entrypoints": {
        "unix": ["python3", "scripts/eda.py", "distribution-report", "--input", "{input}", "--target", "{target}", "--profile", "{profile}"]
      }
    },
    "correlation-matrix": {
      "description": "计算特征与目标变量的相关性并追加到分析结果/报告中。",
      "input": {
        "input": { "type": "string", "required": true },
        "target": { "type": "string", "required": false },
        "profile": { "type": "string", "required": true }
      },
      "output": { "description": "更新后的分析结果/报告文件路径", "fields": { "profile": "string", "report": "string" } },
      "entrypoints": {
        "unix": ["python3", "scripts/eda.py", "correlation-matrix", "--input", "{input}", "--profile", "{profile}"]
      }
    },
    "anomaly-profiling": {
      "description": "比较基于类别条件的分布情况和效应量。",
      "input": {
        "input": { "type": "string", "required": true },
        "target": { "type": "string", "required": true },
        "profile": { "type": "string", "required": true }
      },
      "output": { "description": "更新后的分析结果/报告文件路径", "fields": { "profile": "string", "report": "string" } },
      "entrypoints": {
        "unix": ["python3", "scripts/eda.py", "anomaly-profiling", "--input", "{input}", "--target", "{target}", "--profile", "{profile}"]
      }
    },
    "feature-importance-scan": {
      "description": "通过互信息和可选的树模型对特征进行重要性排序。",
      "input": {
        "input": { "type": "string", "required": true },
        "target": { "type": "string", "required": true },
        "profile": { "type": "string", "required": true }
      },
      "output": { "description": "更新后的分析结果/报告文件路径", "fields": { "profile": "string", "report": "string" } },
      "entrypoints": {
        "unix": ["python3", "scripts/eda.py", "feature-importance-scan", "--input", "{input}", "--target", "{target}", "--profile", "{profile}"]
      }
    },
    "leakage-detector": {
      "description": "检测高相关性、目标编码和时间泄露等指标。",
      "input": {
        "input": { "type": "string", "required": true },
        "target": { "type": "string", "required": true },
        "profile": { "type": "string", "required": true }
      },
      "output": { "description": "更新后的分析结果/报告文件路径", "fields": { "profile": "string", "report": "string" } },
      "entrypoints": {
        "unix": ["python3", "scripts/eda.py", "leakage-detector", "--input", "{input}", "--target", "{target}", "--profile", "{profile}"]
      }
    },
    "save-contract": {
      "description": "从已保存的EDA分析结果中生成数据合约。",
      "input": {
        "profile": { "type": "string", "required": true },
        "output": { "type": "string", "required": true }
      },
      "output": { "description": "数据合约文件路径", "fields": { "contract": "string" } },
      "entrypoints": {
        "unix": ["python3", "scripts/eda.py", "save-contract", "--profile", "{profile}", "--output", "{output}"]
      }
    },
    "validate-contract": {
      "description": "验证新数据集是否符合已保存的数据合约，并输出PASS/FAIL格式的JSON结果。",
      "input": {
        "input": { "type": "string", "required": true },
        "contract": { "type": "string", "required": true }
      },
      "output": { "description": "验证状态和违规信息", "fields": { "status": "string", "violations": "array" } },
      "entrypoints": {
        "unix": ["python3", "scripts/eda.py", "validate-contract", "--input", "{input}", "--contract", "{contract}"]
      }
    }
  },
  "stdout_contract": {
    "last_line_json": true
  }
}