sec-edgar-pipeline

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

SEC EDGAR Pipeline

SEC EDGAR 数据处理流水线

Overview

概述

This pipeline is centered on
edgar-analyzer
and the EDGAR data sources. The core loop is: configure credentials, create a project with examples, analyze patterns, generate code, run extraction, and export reports.
该流水线围绕
edgar-analyzer
和EDGAR数据源构建。核心流程为:配置凭证、创建带示例的项目、分析模式、生成代码、运行提取任务以及导出报告。

Setup (Keys + User Agent)

配置设置(密钥 + 用户代理)

Use the setup wizard to configure required keys:
bash
python -m edgar_analyzer setup
使用设置向导配置所需密钥:
bash
python -m edgar_analyzer setup

or

edgar-analyzer setup

Required entries:

- `OPENROUTER_API_KEY`
- (Optional) `JINA_API_KEY`
- `EDGAR` user agent string ("Name email@example.com")
edgar-analyzer setup

必填项:

- `OPENROUTER_API_KEY`
-(可选)`JINA_API_KEY`
- `EDGAR`用户代理字符串(格式为“姓名 email@example.com”)

End-to-End CLI Workflow

端到端CLI工作流

bash
undefined
bash
undefined

1. Create project

1. 创建项目

edgar-analyzer project create my_project --template minimal
edgar-analyzer project create my_project --template minimal

2. Add examples + project.yaml

2. 添加示例文件 + 配置project.yaml

projects/my_project/examples/*.json

路径:projects/my_project/examples/*.json

3. Analyze examples

3. 分析示例文件

edgar-analyzer analyze-project projects/my_project
edgar-analyzer analyze-project projects/my_project

4. Generate extraction code

4. 生成提取代码

edgar-analyzer generate-code projects/my_project
edgar-analyzer generate-code projects/my_project

5. Run extraction

5. 运行提取任务

edgar-analyzer run-extraction projects/my_project --output-format csv

Outputs land in `projects/<name>/output/`.
edgar-analyzer run-extraction projects/my_project --output-format csv

输出文件将保存至`projects/<name>/output/`目录下。

EDGAR-Specific Conventions

EDGAR专属规范

  • CIK values are 10-digit, zero-padded (e.g.,
    0000320193
    ).
  • Rate limit: SEC API allows 10 requests/sec. Scripts use ~0.11s delays.
  • User agent is mandatory; include name + email.
  • CIK值为10位数字,不足位数需补零(例如:
    0000320193
    )。
  • 速率限制:SEC API允许每秒10次请求,脚本默认设置约0.11秒的请求延迟。
  • 用户代理为必填项,需包含姓名+邮箱信息。

Scripted Example (Apple DEF 14A)

脚本示例(苹果公司DEF 14A文件)

edgar/scripts/fetch_apple_def14a.py
shows the direct flow:
  1. Fetch latest DEF 14A metadata
  2. Download HTML
  3. Parse Summary Compensation Table (SCT)
  4. Save raw HTML + extracted JSON + ground truth
edgar/scripts/fetch_apple_def14a.py
展示了直接处理流程:
  1. 获取最新DEF 14A文件元数据
  2. 下载HTML文件
  3. 解析摘要薪酬表(SCT)
  4. 保存原始HTML文件 + 提取后的JSON数据 + 基准真值数据

Recipe-Driven Extraction

基于规则的提取

edgar/recipes/sct_extraction/config.yaml
defines a multi-step pipeline:
  • Fetch DEF 14A filings by company list
  • Extract SCT tables with
    SCTAdapter
  • Validate with
    sct_validator
  • Write results to
    output/sct
edgar/recipes/sct_extraction/config.yaml
定义了多步骤流水线:
  • 按公司列表获取DEF 14A申报文件
  • 使用
    SCTAdapter
    提取SCT表格数据
  • 通过
    sct_validator
    验证数据
  • 将结果写入
    output/sct
    目录

Report Generation

报告生成

edgar/scripts/create_csv_reports.py
converts JSON results into:
  • executive_compensation_<timestamp>.csv
  • top_25_executives_<timestamp>.csv
  • company_summary_<timestamp>.csv
edgar/scripts/create_csv_reports.py
可将JSON结果转换为以下格式的报告:
  • executive_compensation_<timestamp>.csv
  • top_25_executives_<timestamp>.csv
  • company_summary_<timestamp>.csv

Troubleshooting

故障排查

  • No filings found: confirm CIK formatting and filing type (DEF 14A vs DEF 14A/A).
  • API errors: slow down requests and confirm user-agent is set.
  • Extraction errors: regenerate code or use manual ground truth in POC scripts.
  • 未找到申报文件:确认CIK格式是否正确,以及申报文件类型是否匹配(区分DEF 14A与DEF 14A/A)。
  • API调用错误:降低请求速率,确认用户代理信息已正确设置。
  • 提取错误:重新生成提取代码,或在POC脚本中使用手动配置的基准真值数据。

Related Skills

相关技能

  • universal/data/reporting-pipelines
  • toolchains/python/testing/pytest
  • universal/data/reporting-pipelines
  • toolchains/python/testing/pytest