nutmeg-wrangle

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Wrangle

数据规整

Help the user manipulate football data effectively. This skill is about the mechanics of working with data, adapted to the user's language and tools.
帮助用户高效处理足球数据。这项技能涉及数据处理的操作方法,可适配用户的语言和工具栈。

Accuracy

准确性要求

Read and follow
docs/accuracy-guardrail.md
before answering any question about provider-specific facts (IDs, endpoints, schemas, coordinates, rate limits). Always use
search_docs
— never guess from training data.
回答任何与数据提供商相关的具体事实(ID、端点、schema、坐标、速率限制)问题前,请先阅读并遵循
docs/accuracy-guardrail.md
的要求。始终使用
search_docs
工具查询,绝对不要根据训练数据臆测。

First: check profile

第一步:检查用户配置

Read
.nutmeg.user.md
. If it doesn't exist, tell the user to run
/nutmeg
first. Use their profile for language preference and stack.
读取
.nutmeg.user.md
文件。如果文件不存在,告知用户先运行
/nutmeg
命令。根据用户配置确定语言偏好和技术栈。

Core operations

核心操作

Coordinate transforms

坐标转换

Football data coordinates vary by provider. Always verify and convert before combining data.
Use
search_docs(query="coordinate system", provider="[provider]")
to look up the specific system. Key conversions:
  • Opta (0-100) to StatsBomb (120x80):
    x * 1.2
    ,
    y * 0.8
  • Wyscout to Opta:
    x
    stays,
    y = 100 - y
    (invert Y)
  • Any to kloppy normalised: use kloppy's
    .transform()
    in Python
不同提供商的足球数据坐标体系各不相同。合并数据前务必先校验并转换坐标。
使用
search_docs(query="coordinate system", provider="[provider]")
查询特定提供商的坐标体系。常用转换规则:
  • Opta (0-100) 转 StatsBomb (120x80):
    x * 1.2
    ,
    y * 0.8
  • Wyscout 转 Opta:
    x
    保持不变,
    y = 100 - y
    (Y轴反转)
  • 任意格式转kloppy标准化格式:在Python中使用kloppy的
    .transform()
    方法

Filtering events

事件过滤

Common filtering patterns for football event data:
By event type:
  • Shots: filter for shot/miss/goal/saved event types
  • Passes in final third: filter passes where x > 66.7 (Opta coords)
  • Defensive actions: tackles + interceptions + ball recoveries
By match state:
  • Open play only: exclude set pieces (corners, free kicks, throw-ins, penalties)
  • First half vs second half: use periodId or timestamp
  • Score state: track running score to filter "when winning", "when losing"
By zone:
  • Penalty area actions: x > 83, 21 < y < 79 (Opta coords)
  • High press: actions in opponent's defensive third (x > 66.7)
足球事件数据的常用过滤模式:
按事件类型:
  • 射门:筛选射门/射偏/进球/扑救类事件
  • 进攻三区传球:筛选x坐标大于66.7的传球(Opta坐标体系)
  • 防守动作:抢断+拦截+球权夺回
按比赛状态:
  • 仅开放战:排除定位球(角球、任意球、界外球、点球)
  • 上半场vs下半场:使用periodId或时间戳筛选
  • 比分状态:追踪实时比分以筛选「领先时」「落后时」的事件
按区域:
  • 禁区内动作:x > 83, 21 < y < 79(Opta坐标体系)
  • 高位逼抢:对手防守三区内的动作(x > 66.7)

Joining datasets

数据集拼接

Common joins in football data:
JoinKeyNotes
Events + lineupsplayer_id + match_idGet player names/positions for each event
Events + xGmatch_id + event sequenceMatch xG to specific shots
Multiple providersmatch date + team namesFuzzy matching often needed
Season data + ElodateJoin Elo rating at time of match
Fuzzy team name matching is a constant pain. Build a mapping table:
python
TEAM_MAP = {
    'Man City': 'Manchester City',
    'Man United': 'Manchester United',
    'Spurs': 'Tottenham Hotspur',
    'Wolves': 'Wolverhampton Wanderers',
    # ...
}
足球数据的常用拼接方式:
拼接类型关联键说明
事件数据 + 阵容数据player_id + match_id获取每个事件对应的球员姓名/位置
事件数据 + xG数据match_id + 事件序列将xG值匹配到对应射门动作
多提供商数据比赛日期 + 球队名称通常需要模糊匹配
赛季数据 + Elo数据日期关联比赛当日的Elo评分
模糊队名匹配是常见痛点,可以构建映射表:
python
TEAM_MAP = {
    'Man City': 'Manchester City',
    'Man United': 'Manchester United',
    'Spurs': 'Tottenham Hotspur',
    'Wolves': 'Wolverhampton Wanderers',
    # ...
}

Reshaping

数据重构

Common reshaping operations:
  • Wide to long: Season stats tables (one column per stat) to tidy format (one row per stat per team)
  • Events to possession chains: Group consecutive events by the same team into possession sequences
  • Match-level to season aggregates: Group by team, sum/average per-match values
  • Player-match to player-season: Aggregate across matches, weight by minutes played
常用重构操作:
  • 宽表转长表: 将赛季统计表(每个统计项一列)转换为整洁格式(每个球队的每个统计项占一行)
  • 事件转控球链: 将同一球队的连续事件分组为控球序列
  • 比赛级别聚合到赛季级别: 按球队分组,对单场数值求和/取平均值
  • 球员单场数据聚合到球员赛季数据: 跨比赛聚合,按出场分钟加权

Handling large datasets

处理大型数据集

Full event data for a PL season is ~500MB+ (380 matches x ~1700 events). Strategies:
Python:
  • Use polars instead of pandas for 5-10x speed improvement
  • Process match-by-match in a loop, don't load all into memory
  • Use DuckDB for SQL queries on Parquet files without loading into memory
JavaScript/TypeScript:
  • Stream JSON files with
    readline
    or
    JSONStream
  • Use SQLite (better-sqlite3) for local queries
  • Process files in parallel with worker threads
R:
  • Use data.table instead of tidyverse for large datasets
  • Arrow/Parquet for out-of-memory processing
英超一整个赛季的完整事件数据约500MB以上(380场比赛×每场约1700个事件)。处理策略:
Python:
  • 使用polars替代pandas可提升5-10倍速度
  • 逐场循环处理,不要一次性加载所有数据到内存
  • 使用DuckDB对Parquet文件执行SQL查询,无需加载到内存
JavaScript/TypeScript:
  • 使用
    readline
    JSONStream
    流式读取JSON文件
  • 使用SQLite(better-sqlite3)执行本地查询
  • 使用工作线程并行处理文件
R:
  • 处理大型数据集时使用data.table替代tidyverse
  • 使用Arrow/Parquet实现外存处理

Data quality checks

数据质量校验

Always validate after wrangling:
CheckWhat to look for
Event counts~1500-2000 events per PL match. Much less = data issue
Coordinate rangeShould be within provider's expected range
Missing player IDsSome events lack player attribution (ball out, etc.)
Duplicate eventsSame event_id appearing twice
Time gapsLarge gaps in event timestamps within a match
Team attributionVerify home/away assignment is consistent
数据整理完成后务必执行校验:
校验项校验标准
事件数量每场英超比赛约有1500-2000个事件。数量远低于该范围说明存在数据问题
坐标范围应在对应提供商的预期坐标范围内
球员ID缺失部分事件没有对应的球员归属(比如出界等)
重复事件同一event_id重复出现
时间间隔同一场比赛的事件时间戳存在较大间隔
球队归属校验主/客场队分配是否一致

Format conversion

格式转换

FromToTool/method
JSON eventsDataFramepandas/polars
read_json
or manual parsing
CSVParquet
df.write_parquet()
(polars) or
df.to_parquet()
(pandas)
Provider formatkloppy model
kloppy.load_{provider}()
in Python
kloppy modelDataFrame
dataset.to_df()
AnySQLiteLoad into SQLite for ad-hoc queries
源格式目标格式工具/方法
JSON事件数据DataFramepandas/polars的
read_json
方法或手动解析
CSVParquetpolars的
df.write_parquet()
或 pandas的
df.to_parquet()
提供商原始格式kloppy模型Python中使用
kloppy.load_{provider}()
kloppy模型DataFrame
dataset.to_df()
任意格式SQLite加载到SQLite中执行即席查询