nutmeg-wrangle
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseWrangle
数据规整
Help the user manipulate football data effectively. This skill is about the mechanics of working with data, adapted to the user's language and tools.
帮助用户高效处理足球数据。这项技能涉及数据处理的操作方法,可适配用户的语言和工具栈。
Accuracy
准确性要求
Read and follow before answering any question about provider-specific facts (IDs, endpoints, schemas, coordinates, rate limits). Always use — never guess from training data.
docs/accuracy-guardrail.mdsearch_docs回答任何与数据提供商相关的具体事实(ID、端点、schema、坐标、速率限制)问题前,请先阅读并遵循的要求。始终使用工具查询,绝对不要根据训练数据臆测。
docs/accuracy-guardrail.mdsearch_docsFirst: check profile
第一步:检查用户配置
Read . If it doesn't exist, tell the user to run first. Use their profile for language preference and stack.
.nutmeg.user.md/nutmeg读取文件。如果文件不存在,告知用户先运行命令。根据用户配置确定语言偏好和技术栈。
.nutmeg.user.md/nutmegCore operations
核心操作
Coordinate transforms
坐标转换
Football data coordinates vary by provider. Always verify and convert before combining data.
Use to look up the specific system. Key conversions:
search_docs(query="coordinate system", provider="[provider]")- Opta (0-100) to StatsBomb (120x80): ,
x * 1.2y * 0.8 - Wyscout to Opta: stays,
x(invert Y)y = 100 - y - Any to kloppy normalised: use kloppy's in Python
.transform()
不同提供商的足球数据坐标体系各不相同。合并数据前务必先校验并转换坐标。
使用查询特定提供商的坐标体系。常用转换规则:
search_docs(query="coordinate system", provider="[provider]")- Opta (0-100) 转 StatsBomb (120x80): ,
x * 1.2y * 0.8 - Wyscout 转 Opta: 保持不变,
x(Y轴反转)y = 100 - y - 任意格式转kloppy标准化格式:在Python中使用kloppy的方法
.transform()
Filtering events
事件过滤
Common filtering patterns for football event data:
By event type:
- Shots: filter for shot/miss/goal/saved event types
- Passes in final third: filter passes where x > 66.7 (Opta coords)
- Defensive actions: tackles + interceptions + ball recoveries
By match state:
- Open play only: exclude set pieces (corners, free kicks, throw-ins, penalties)
- First half vs second half: use periodId or timestamp
- Score state: track running score to filter "when winning", "when losing"
By zone:
- Penalty area actions: x > 83, 21 < y < 79 (Opta coords)
- High press: actions in opponent's defensive third (x > 66.7)
足球事件数据的常用过滤模式:
按事件类型:
- 射门:筛选射门/射偏/进球/扑救类事件
- 进攻三区传球:筛选x坐标大于66.7的传球(Opta坐标体系)
- 防守动作:抢断+拦截+球权夺回
按比赛状态:
- 仅开放战:排除定位球(角球、任意球、界外球、点球)
- 上半场vs下半场:使用periodId或时间戳筛选
- 比分状态:追踪实时比分以筛选「领先时」「落后时」的事件
按区域:
- 禁区内动作:x > 83, 21 < y < 79(Opta坐标体系)
- 高位逼抢:对手防守三区内的动作(x > 66.7)
Joining datasets
数据集拼接
Common joins in football data:
| Join | Key | Notes |
|---|---|---|
| Events + lineups | player_id + match_id | Get player names/positions for each event |
| Events + xG | match_id + event sequence | Match xG to specific shots |
| Multiple providers | match date + team names | Fuzzy matching often needed |
| Season data + Elo | date | Join Elo rating at time of match |
Fuzzy team name matching is a constant pain. Build a mapping table:
python
TEAM_MAP = {
'Man City': 'Manchester City',
'Man United': 'Manchester United',
'Spurs': 'Tottenham Hotspur',
'Wolves': 'Wolverhampton Wanderers',
# ...
}足球数据的常用拼接方式:
| 拼接类型 | 关联键 | 说明 |
|---|---|---|
| 事件数据 + 阵容数据 | player_id + match_id | 获取每个事件对应的球员姓名/位置 |
| 事件数据 + xG数据 | match_id + 事件序列 | 将xG值匹配到对应射门动作 |
| 多提供商数据 | 比赛日期 + 球队名称 | 通常需要模糊匹配 |
| 赛季数据 + Elo数据 | 日期 | 关联比赛当日的Elo评分 |
模糊队名匹配是常见痛点,可以构建映射表:
python
TEAM_MAP = {
'Man City': 'Manchester City',
'Man United': 'Manchester United',
'Spurs': 'Tottenham Hotspur',
'Wolves': 'Wolverhampton Wanderers',
# ...
}Reshaping
数据重构
Common reshaping operations:
- Wide to long: Season stats tables (one column per stat) to tidy format (one row per stat per team)
- Events to possession chains: Group consecutive events by the same team into possession sequences
- Match-level to season aggregates: Group by team, sum/average per-match values
- Player-match to player-season: Aggregate across matches, weight by minutes played
常用重构操作:
- 宽表转长表: 将赛季统计表(每个统计项一列)转换为整洁格式(每个球队的每个统计项占一行)
- 事件转控球链: 将同一球队的连续事件分组为控球序列
- 比赛级别聚合到赛季级别: 按球队分组,对单场数值求和/取平均值
- 球员单场数据聚合到球员赛季数据: 跨比赛聚合,按出场分钟加权
Handling large datasets
处理大型数据集
Full event data for a PL season is ~500MB+ (380 matches x ~1700 events). Strategies:
Python:
- Use polars instead of pandas for 5-10x speed improvement
- Process match-by-match in a loop, don't load all into memory
- Use DuckDB for SQL queries on Parquet files without loading into memory
JavaScript/TypeScript:
- Stream JSON files with or
readlineJSONStream - Use SQLite (better-sqlite3) for local queries
- Process files in parallel with worker threads
R:
- Use data.table instead of tidyverse for large datasets
- Arrow/Parquet for out-of-memory processing
英超一整个赛季的完整事件数据约500MB以上(380场比赛×每场约1700个事件)。处理策略:
Python:
- 使用polars替代pandas可提升5-10倍速度
- 逐场循环处理,不要一次性加载所有数据到内存
- 使用DuckDB对Parquet文件执行SQL查询,无需加载到内存
JavaScript/TypeScript:
- 使用或
readline流式读取JSON文件JSONStream - 使用SQLite(better-sqlite3)执行本地查询
- 使用工作线程并行处理文件
R:
- 处理大型数据集时使用data.table替代tidyverse
- 使用Arrow/Parquet实现外存处理
Data quality checks
数据质量校验
Always validate after wrangling:
| Check | What to look for |
|---|---|
| Event counts | ~1500-2000 events per PL match. Much less = data issue |
| Coordinate range | Should be within provider's expected range |
| Missing player IDs | Some events lack player attribution (ball out, etc.) |
| Duplicate events | Same event_id appearing twice |
| Time gaps | Large gaps in event timestamps within a match |
| Team attribution | Verify home/away assignment is consistent |
数据整理完成后务必执行校验:
| 校验项 | 校验标准 |
|---|---|
| 事件数量 | 每场英超比赛约有1500-2000个事件。数量远低于该范围说明存在数据问题 |
| 坐标范围 | 应在对应提供商的预期坐标范围内 |
| 球员ID缺失 | 部分事件没有对应的球员归属(比如出界等) |
| 重复事件 | 同一event_id重复出现 |
| 时间间隔 | 同一场比赛的事件时间戳存在较大间隔 |
| 球队归属 | 校验主/客场队分配是否一致 |
Format conversion
格式转换
| From | To | Tool/method |
|---|---|---|
| JSON events | DataFrame | pandas/polars |
| CSV | Parquet | |
| Provider format | kloppy model | |
| kloppy model | DataFrame | |
| Any | SQLite | Load into SQLite for ad-hoc queries |
| 源格式 | 目标格式 | 工具/方法 |
|---|---|---|
| JSON事件数据 | DataFrame | pandas/polars的 |
| CSV | Parquet | polars的 |
| 提供商原始格式 | kloppy模型 | Python中使用 |
| kloppy模型 | DataFrame | |
| 任意格式 | SQLite | 加载到SQLite中执行即席查询 |