earth2studio-data-fetch

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Earth2Studio Data Fetch Skill

Earth2Studio 数据获取Skill

Purpose

用途

Guide a user through downloading weather/climate data via Earth2Studio data source APIs. Identifies compatible sources by checking the lexicon, verifies variable support, and produces a working fetch script outputting an xarray DataArray.
指导用户通过Earth2Studio数据源API下载天气/气候数据。通过检查词汇表识别兼容数据源,验证变量支持情况,并生成可运行的获取脚本,输出xarray DataArray。

Prerequisites

前置条件

  • Earth2Studio installed (
    uv pip install earth2studio
    or equivalent)
  • Network access to remote data stores (GCS, S3, CDS API, etc.)
  • For CDS-based sources: valid CDS API key configured (
    ~/.cdsapirc
    )
  • Python 3.10+
  • 已安装Earth2Studio(
    uv pip install earth2studio
    或等效命令)
  • 具备远程数据存储(GCS、S3、CDS API等)的网络访问权限
  • 对于基于CDS的数据源:已配置有效的CDS API密钥(
    ~/.cdsapirc
  • Python 3.10+

Instructions

操作说明

You are helping a user download specific weather/climate data using Earth2Studio's data source APIs. Your job is to identify which data source(s) can provide the requested variables, verify compatibility via the lexicon system, and produce a working fetch script.
你将协助用户使用Earth2Studio的数据源API下载特定的天气/气候数据。你的任务是确定哪些数据源可以提供请求的变量,通过词汇表系统验证兼容性,并生成可运行的获取脚本。

Core principle: live docs and lexicon are the source of truth

核心原则:实时文档与词汇表是权威来源

Data source APIs, available variables, and the lexicon evolve between releases. Before recommending a data source or writing a fetch script:
  1. Fetch the relevant data source doc page to confirm the API signature and constructor arguments.
  2. Check the lexicon to verify the requested variable is supported by that data source.
Live doc references (fetch only what the user's request requires):
数据源API、可用变量和词汇表会随版本更新而演变。在推荐数据源或编写获取脚本之前:
  1. 获取相关数据源的文档页面,确认API签名和构造函数参数。
  2. 检查词汇表,验证请求的变量是否受该数据源支持。
实时文档参考(仅获取用户请求所需内容):

Interaction protocol

交互流程

Step 1. Understand the user's request

步骤1:理解用户需求

Extract from what the user has said (ask follow-ups if needed, cap at 3 questions):
  • Variables — what do they want? Use Earth2Studio variable names (e.g.
    t2m
    ,
    u500
    ,
    z850
    ,
    tp
    ,
    msl
    ). If the user uses plain language ("500 hPa geopotential height"), map it to the E2Studio name by checking the live
    base.py
    E2STUDIO_VOCAB.
  • Time — what date/time range? A single timestamp, a range, or multiple discrete times?
  • Data type — analysis/reanalysis (historical state) or forecast (lead-time based)?
  • Lead time (forecast only) — how far ahead? Which initialization time?
  • Region — global or regional (e.g. North America for HRRR)?
  • Output format — xarray DataArray (default), save to file (NetCDF/Zarr)?
从用户的表述中提取信息(必要时可追问,最多3个问题):
  • 变量 — 用户需要什么数据?使用Earth2Studio变量名称(例如
    t2m
    u500
    z850
    tp
    msl
    )。如果用户使用自然语言描述(如“500 hPa位势高度”),通过查看实时
    base.py
    中的E2STUDIO_VOCAB映射为Earth2Studio名称。
  • 时间 — 需要什么日期/时间范围?单个时间戳、时间范围还是多个离散时间点?
  • 数据类型 — 分析/再分析(历史状态)还是预测(基于提前期)?
  • 提前期(仅预测型) — 预测提前多久?初始时间是什么时候?
  • 区域 — 全球还是特定区域(如HRRR的北美区域)?
  • 输出格式 — xarray DataArray(默认)、保存为文件(NetCDF/Zarr)?

Step 2. Identify candidate data sources

步骤2:筛选候选数据源

Based on the request type, narrow candidates:
Analysis/reanalysis (historical state at a specific time):
  • Use analysis data source page to identify options
  • Common choices: GFS (operational, recent), HRRR (NA, hourly), IFS/IFS_ENS (ECMWF), ARCO/CDS/WB2ERA5/NCAR_ERA5 (ERA5 reanalysis), GOES/MRMS/JPSS (observational)
Forecast (predictions from an initialization time with lead times):
  • Use forecast data source page to identify options
  • Common choices: GFS_FX, GEFS_FX, HRRR_FX, IFS_FX, IFS_ENS_FX, AIFS_FX, CFS_FX
Key differentiators to surface:
  • Temporal coverage — operational sources (GFS, HRRR) have limited history; reanalysis (ERA5 via ARCO/CDS/WB2) goes back decades
  • Spatial resolution — HRRR is 3km NA-only; GFS is 0.25° global; WB2ERA5_32x64 is 5.625° global
  • Update frequency — some are real-time, some have multi-day lag
根据请求类型缩小候选范围:
分析/再分析(特定时间的历史状态):
  • 参考分析型数据源页面确定可选方案
  • 常见选择:GFS(业务化,近期数据)、HRRR(北美区域,逐小时)、IFS/IFS_ENS(ECMWF)、ARCO/CDS/WB2ERA5/NCAR_ERA5(ERA5再分析)、GOES/MRMS/JPSS(观测数据)
预测(基于初始时间和提前期的预测结果):
  • 参考预测型数据源页面确定可选方案
  • 常见选择:GFS_FX、GEFS_FX、HRRR_FX、IFS_FX、IFS_ENS_FX、AIFS_FX、CFS_FX
需明确的关键差异:
  • 时间覆盖范围 — 业务化数据源(GFS、HRRR)历史数据有限;再分析数据源(通过ARCO/CDS/WB2获取的ERA5)可追溯至数十年前
  • 空间分辨率 — HRRR为3km分辨率仅覆盖北美;GFS为0.25°全球分辨率;WB2ERA5_32x64为5.625°全球分辨率
  • 更新频率 — 部分数据源为实时更新,部分存在多日延迟

Step 3. Verify variable support via lexicon

步骤3:通过词汇表验证变量支持

This is critical. Each data source has a lexicon file that defines which E2Studio variables it can provide.
To verify:
  1. Fetch the source's lexicon file from
    https://github.com/NVIDIA/earth2studio/blob/main/earth2studio/lexicon/<source>.py
    (e.g.
    gfs.py
    ,
    hrrr.py
    ,
    cds.py
    ,
    arco.py
    ,
    wb2.py
    )
  2. Check that the user's requested variable(s) appear as keys in the source's
    VOCAB
    dict
  3. If a variable is NOT in a source's lexicon, that source cannot provide it — try another
The lexicon VOCAB maps Earth2Studio variable names → source-specific identifiers. If a variable key exists in the VOCAB, the source supports it.
Present the results clearly: "GFS supports
t2m
,
u500
,
z850
. HRRR also supports these but is limited to North America. ARCO (ERA5) supports all three and has data back to 1959."
这一步至关重要。每个数据源都有对应的词汇表文件,定义了它能提供的Earth2Studio变量。
验证方法:
  1. https://github.com/NVIDIA/earth2studio/blob/main/earth2studio/lexicon/<source>.py
    获取对应数据源的词汇表文件(例如
    gfs.py
    hrrr.py
    cds.py
    arco.py
    wb2.py
  2. 检查用户请求的变量是否出现在该数据源的
    VOCAB
    字典的键中
  3. 如果变量不在数据源的词汇表中,说明该数据源无法提供此数据,请尝试其他数据源
词汇表VOCAB将Earth2Studio变量名称映射为数据源专属标识符。如果VOCAB中存在该变量键,则表示数据源支持该变量。
清晰呈现结果:"GFS支持
t2m
u500
z850
。HRRR同样支持这些变量,但仅覆盖北美区域。ARCO(ERA5)支持这三个变量,且数据可追溯至1959年。"

Step 4. Confirm data source selection with user

步骤4:与用户确认数据源选择

Present the viable options with tradeoffs:
SourceVariablesCoverageResolutionTime Range
...............
Let the user pick. If there's one obvious choice, recommend it and ask for confirmation.
呈现可行选项及权衡:
数据源变量覆盖范围分辨率时间范围
...............
让用户选择。如果有一个明显最优的选项,可直接推荐并请求确认。

Step 5. Generate fetch script

步骤5:生成获取脚本

Write a Python script that uses the selected data source to fetch the requested data. The script structure depends on whether it's an analysis or forecast source.
Analysis source pattern:
python
import datetime
from earth2studio.data import <SourceClass>
编写Python脚本,使用选定的数据源获取请求的数据。脚本结构取决于数据源是分析型还是预测型。
分析型数据源模板:
python
import datetime
from earth2studio.data import <SourceClass>

Initialize data source

初始化数据源

ds = <SourceClass>()
ds = <SourceClass>()

Fetch data

获取数据

Analysis sources use: ds(time, variable) -> xr.DataArray

分析型数据源使用:ds(time, variable) -> xr.DataArray

time = [datetime.datetime(YYYY, M, D, H)] # or array of times variable = ["var1", "var2"] # E2Studio variable names
data = ds(time, variable)

**Forecast source pattern:**

```python
import datetime
from earth2studio.data import <SourceClass>
time = [datetime.datetime(YYYY, M, D, H)] # 或时间数组 variable = ["var1", "var2"] # Earth2Studio变量名称
data = ds(time, variable)

**预测型数据源模板:**

```python
import datetime
from earth2studio.data import <SourceClass>

Initialize data source

初始化数据源

ds = <SourceClass>()
ds = <SourceClass>()

Forecast sources use: ds(time, lead_time, variable) -> xr.DataArray

预测型数据源使用:ds(time, lead_time, variable) -> xr.DataArray

time = [datetime.datetime(YYYY, M, D, H)] # initialization time lead_time = [datetime.timedelta(hours=H)] # or array of lead times variable = ["var1", "var2"]
data = ds(time, lead_time, variable)

Always fetch the specific data source's API doc page to confirm the exact
constructor arguments and call signature before writing the script — they can
vary (some need auth tokens, cache paths, specific parameters).

Include in the script:

- Appropriate imports
- Clear comments explaining each step
- How to inspect the result (`print(data)`, `data.shape`, `data.coords`)
- Optional: saving to file if the user requested it
time = [datetime.datetime(YYYY, M, D, H)] # 初始时间 lead_time = [datetime.timedelta(hours=H)] # 或提前期数组 variable = ["var1", "var2"]
data = ds(time, lead_time, variable)

编写脚本前,请务必获取对应数据源的API文档页面,确认确切的构造函数参数和调用签名——这些可能存在差异(部分需要认证令牌、缓存路径或特定参数)。

脚本中需包含:

- 合适的导入语句
- 清晰的步骤注释
- 结果检查方法(`print(data)`、`data.shape`、`data.coords`)
- 可选:如果用户要求,添加保存到文件的代码

Step 6. Offer next steps

步骤6:提供后续建议

After delivering the script, mention:
  • How to change variables/times without rewriting the whole thing
  • If they might want to feed this into a model, point them to the discover skill
  • Cache behavior (data is cached locally after first fetch via
    EARTH2STUDIO_CACHE
    )
交付脚本后,可提及:
  • 如何无需重写整个脚本即可修改变量/时间
  • 如果用户需要将数据输入模型,可引导至发现skill
  • 缓存机制(首次获取后数据会通过
    EARTH2STUDIO_CACHE
    本地缓存)

Ownership and out-of-scope

职责范围与超出范围事项

Owns: identifying data sources for a user's variable/time request, verifying variable support via lexicon, generating data fetch scripts, explaining analysis vs. forecast source differences.
Does not own: installation (earth2studio-install), model selection (earth2studio-discover), inference pipelines, custom data source creation (point to extend examples), data source authentication setup beyond what the docs describe.
负责: 根据用户的变量/时间请求识别数据源,通过词汇表验证变量支持,生成数据获取脚本,解释分析型与预测型数据源的差异。
不负责: 安装(earth2studio-install)、模型选择(earth2studio-discover)、推理管道、自定义数据源创建(可指向扩展示例)、文档描述之外的数据源认证设置。

Examples

示例

Typical invocation:
"I need 500 hPa geopotential height and 2m temperature from ERA5 for January 1, 2020 at 00Z."
The skill would:
  1. Map plain language →
    z500
    ,
    t2m
  2. Check ARCO/CDS/WB2ERA5 lexicons for support
  3. Recommend ARCO (free, no API key) or CDS (official, needs key)
  4. Generate a fetch script using the selected source
典型调用场景:
"我需要2020年1月1日00Z时ERA5的500 hPa位势高度和2米温度数据。"
该Skill会:
  1. 将自然语言描述映射为
    z500
    t2m
  2. 检查ARCO/CDS/WB2ERA5词汇表的支持情况
  3. 推荐ARCO(免费,无需API密钥)或CDS(官方渠道,需密钥)
  4. 使用选定的数据源生成获取脚本

Limitations

限制

  • Network required — all data sources fetch from remote stores (GCS, S3, CDS API)
  • No local file loading — for local NetCDF/Zarr, use
    DataArrayFile
    /
    DataSetFile
    directly
  • One source type per script — cannot mix analysis and forecast sources in a single call
  • Variable availability varies — not all sources provide all variables; always verify via lexicon
  • Rate limits — CDS API has queue-based throttling; GCS/S3 sources are generally faster
  • 需要网络 — 所有数据源均从远程存储(GCS、S3、CDS API)获取数据
  • 不支持本地文件加载 — 若要加载本地NetCDF/Zarr文件,请直接使用
    DataArrayFile
    /
    DataSetFile
  • 每个脚本仅支持一种数据源类型 — 单次调用无法混合分析型和预测型数据源
  • 变量可用性存在差异 — 并非所有数据源都提供所有变量;请始终通过词汇表验证
  • 速率限制 — CDS API采用队列限流;GCS/S3数据源通常速度更快

Troubleshooting

故障排查

ErrorCauseSolution
KeyError: '<var>'
Not in lexiconCheck lexicon; try another source
FileNotFoundError
/ 404
Time not availableVerify temporal coverage
CDS API timeout
Queue congestionRetry or use ARCO for ERA5
ModuleNotFoundError
Not installed
uv pip install earth2studio
Empty DataArrayTime/var mismatchCheck datetime and variable name
错误原因解决方案
KeyError: '<var>'
变量不在词汇表中检查词汇表;尝试其他数据源
FileNotFoundError
/ 404
请求的时间数据不存在验证时间覆盖范围
CDS API timeout
队列拥堵重试或使用ARCO获取ERA5数据
ModuleNotFoundError
未安装Earth2Studio执行
uv pip install earth2studio
空DataArray时间/变量不匹配检查日期时间和变量名称