data-quality-checker

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Overview

概述

Detect common data quality issues in market analysis documents before publication. The checker validates five categories: price scale consistency, instrument notation, date/weekday accuracy, allocation totals, and unit usage. All findings are advisory -- they flag potential issues for human review rather than blocking publication.
在发布前检测市场分析文档中常见的数据质量问题。该校验工具会检查五大类问题:价格规模一致性、金融工具符号、日期/星期准确性、配置总额、单位使用。所有校验结果均为建议性质——仅标记潜在问题供人工审核,不会拦截发布流程。

When to Use

适用场景

  • Before publishing a weekly strategy blog or market analysis report
  • After generating automated market summaries
  • When reviewing translated documents (English/Japanese) for data accuracy
  • When combining data from multiple sources (FRED, FMP, FINVIZ) into one report
  • As a pre-flight check for any document containing financial data
  • 发布每周策略博客或市场分析报告前
  • 生成自动化市场摘要后
  • 校验翻译后的英/日文文档的数据准确性时
  • 将多个来源(FRED、FMP、FINVIZ)的数据整合到一份报告时
  • 作为所有含金融数据的文档的发布前检查项

Prerequisites

前置要求

  • Python 3.9+
  • No external API keys required
  • No third-party Python packages required (uses only standard library)
  • Python 3.9+
  • 无需外部API密钥
  • 无需第三方Python包(仅使用标准库)

Workflow

工作流程

Step 1: Receive Input Document

步骤1:接收输入文档

Accept the target markdown file path and optional parameters:
  • --file
    : Path to the markdown document to validate (required)
  • --checks
    : Comma-separated list of checks to run (optional; default: all)
  • --as-of
    : Reference date for year inference in YYYY-MM-DD format (optional)
  • --output-dir
    : Directory for report output (optional; default:
    reports/
    )
接收目标markdown文件路径和可选参数:
  • --file
    : 待校验markdown文档的路径(必填)
  • --checks
    : 待执行校验项的逗号分隔列表(可选;默认执行全部校验)
  • --as-of
    : 用于年份推断的参考日期,格式为YYYY-MM-DD(可选)
  • --output-dir
    : 报告输出目录(可选;默认:
    reports/

Step 2: Execute Validation Script

步骤2:执行校验脚本

Run the data quality checker script:
bash
python3 skills/data-quality-checker/scripts/check_data_quality.py \
  --file path/to/document.md \
  --output-dir reports/
To run specific checks only:
bash
python3 skills/data-quality-checker/scripts/check_data_quality.py \
  --file path/to/document.md \
  --checks price_scale,dates,allocations
To provide a reference date for year inference (useful for documents without explicit year in dates):
bash
python3 skills/data-quality-checker/scripts/check_data_quality.py \
  --file path/to/document.md \
  --as-of 2026-02-28
运行数据质量校验脚本:
bash
python3 skills/data-quality-checker/scripts/check_data_quality.py \
  --file path/to/document.md \
  --output-dir reports/
仅执行指定校验项:
bash
python3 skills/data-quality-checker/scripts/check_data_quality.py \
  --file path/to/document.md \
  --checks price_scale,dates,allocations
指定用于年份推断的参考日期(适用于日期未明确标注年份的文档):
bash
python3 skills/data-quality-checker/scripts/check_data_quality.py \
  --file path/to/document.md \
  --as-of 2026-02-28

Step 3: Load Reference Standards

步骤3:加载参考标准

Read the relevant reference documents to contextualize findings:
  • references/instrument_notation_standard.md
    -- Standard ticker notation, digit-count hints, and naming conventions for each instrument class
  • references/common_data_errors.md
    -- Catalog of frequently observed errors including FRED data delays, ETF/futures scale confusion, holiday oversights, allocation total pitfalls, and unit confusion patterns
Use these references to explain findings and suggest corrections.
读取相关参考文档为校验结果提供上下文:
  • references/instrument_notation_standard.md
    ——各品类金融工具的标准代码符号、位数提示和命名规范
  • references/common_data_errors.md
    ——常见错误目录,包括FRED数据延迟、ETF/期货规模混淆、节假日遗漏、配置总额陷阱和单位混淆模式
使用这些参考资料解释校验结果并给出修正建议。

Step 4: Review Findings

步骤4:查看校验结果

Examine each finding in the output:
  • ERROR -- High confidence issues (e.g., date-weekday mismatches verified by calendar computation). Strongly recommend correction.
  • WARNING -- Likely issues that need human judgment (e.g., price scale anomalies, notation inconsistencies, allocation sums off by more than 0.5%).
  • INFO -- Informational notes (e.g., mixed bp/% usage that may be intentional).
检查输出中的每一条结果:
  • ERROR——高置信度问题(例如经日历计算验证的日期与星期不匹配),强烈建议修正。
  • WARNING——可能存在的问题,需要人工判断(例如价格规模异常、符号不一致、配置总额偏差超过0.5%)。
  • INFO——提示性说明(例如可能是有意设置的bp/%混合使用)。

Step 5: Generate Quality Report

步骤5:生成质量报告

The script produces two output files:
  1. JSON report (
    data_quality_YYYY-MM-DD_HHMMSS.json
    ): Machine-readable list of findings with severity, category, message, line number, and context.
  2. Markdown report (
    data_quality_YYYY-MM-DD_HHMMSS.md
    ): Human-readable report grouped by severity level.
Present the findings to the user with explanations referencing the knowledge base. Suggest specific corrections for each issue.
脚本会生成两个输出文件:
  1. JSON报告
    data_quality_YYYY-MM-DD_HHMMSS.json
    ):机器可读的校验结果列表,包含严重级别、分类、提示信息、行号和上下文。
  2. Markdown报告
    data_quality_YYYY-MM-DD_HHMMSS.md
    ):按严重级别分组的人工可读报告。
向用户展示校验结果,并结合知识库提供解释,针对每个问题给出具体修正建议。

Output Format

输出格式

JSON Finding Structure

JSON结果结构

json
{
  "severity": "WARNING",
  "category": "price_scale",
  "message": "GLD: $2,800 has 4 digits (expected 2-3 digits)",
  "line_number": 5,
  "context": "GLD: $2,800"
}
json
{
  "severity": "WARNING",
  "category": "price_scale",
  "message": "GLD: $2,800 has 4 digits (expected 2-3 digits)",
  "line_number": 5,
  "context": "GLD: $2,800"
}

Markdown Report Structure

Markdown报告结构

markdown
undefined
markdown
undefined

Data Quality Report

Data Quality Report

Source: path/to/document.md Generated: 2026-02-28 14:30:00 Total findings: 3
Source: path/to/document.md Generated: 2026-02-28 14:30:00 Total findings: 3

ERROR (1)

ERROR (1)

  • [dates] (line 12): Date-weekday mismatch: January 1, 2026 (Monday) -- actual weekday is Thursday
  • [dates] (line 12): Date-weekday mismatch: January 1, 2026 (Monday) -- actual weekday is Thursday

WARNING (2)

WARNING (2)

  • [price_scale] (line 5): GLD: $2,800 has 4 digits (expected 2-3 digits)
    GLD: $2,800
  • [allocations]: Allocation total: 110.0% (expected ~100%)
undefined
  • [price_scale] (line 5): GLD: $2,800 has 4 digits (expected 2-3 digits)
    GLD: $2,800
  • [allocations]: Allocation total: 110.0% (expected ~100%)
undefined

Resources

资源

  • scripts/check_data_quality.py
    -- Main validation script
  • references/instrument_notation_standard.md
    -- Notation and price scale reference
  • references/common_data_errors.md
    -- Common error patterns and prevention
  • scripts/check_data_quality.py
    ——核心校验脚本
  • references/instrument_notation_standard.md
    ——符号与价格规模参考文档
  • references/common_data_errors.md
    ——常见错误模式与预防指南

Key Principles

核心原则

  1. Advisory mode: All findings are warnings for human review. The script always exits with code 0 on successful execution, even when findings are present. Exit code 1 is reserved for script failures (file not found, parse errors).
  2. Section-aware allocation checking: Only percentages within allocation sections (identified by headings like "配分", "Allocation", or table columns like "ウェイト", "目安比率") are checked. Random percentages in body text (probability, RSI, YoY growth) are ignored.
  3. Bilingual support: Handles both English and Japanese date formats, weekday names, and section headings. Full-width characters (%, 〜, en-dash) are normalized before processing.
  4. Year inference: For dates without an explicit year, the checker infers the year using (in priority order): the
    --as-of
    option, a YYYY pattern found in the document title/metadata, or the current year with a 6-month cross-year heuristic.
  5. Digit-count heuristic: Price scale validation uses digit counts (number of digits before the decimal point) rather than absolute price ranges. This approach is resilient to price changes over time while still catching ETF/futures confusion errors.
  1. 咨询模式:所有校验结果均为供人工审核的警告。脚本执行成功时始终返回退出码0,即便存在校验结果。退出码1仅用于标记脚本执行失败(文件未找到、解析错误等)。
  2. 分区感知配置检查:仅检查配置分区内的百分比(通过“配分”、“Allocation”等标题,或“ウェイト”、“目安比率”等表格列识别)。正文内的随机百分比(概率、RSI、同比增长率等)会被忽略。
  3. 双语支持:支持处理英/日文日期格式、星期名称和分区标题。处理前会先对全角字符(%、〜、en-dash)做标准化处理。
  4. 年份推断:对于未明确标注年份的日期,校验工具会按以下优先级推断年份:
    --as-of
    参数值、文档标题/元数据中匹配到的YYYY格式年份、结合6个月跨年启发式规则的当前年份。
  5. 位数启发式规则:价格规模校验使用小数点前的位数而非绝对价格区间。该方法可应对长期价格波动,同时仍能识别ETF/期货混淆错误。