earth2studio-data-fetch
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseEarth2Studio Data Fetch Skill
Earth2Studio 数据获取Skill
Purpose
用途
Guide a user through downloading weather/climate data via Earth2Studio data source
APIs. Identifies compatible sources by checking the lexicon, verifies variable
support, and produces a working fetch script outputting an xarray DataArray.
指导用户通过Earth2Studio数据源API下载天气/气候数据。通过检查词汇表识别兼容数据源,验证变量支持情况,并生成可运行的获取脚本,输出xarray DataArray。
Prerequisites
前置条件
- Earth2Studio installed (or equivalent)
uv pip install earth2studio - Network access to remote data stores (GCS, S3, CDS API, etc.)
- For CDS-based sources: valid CDS API key configured ()
~/.cdsapirc - Python 3.10+
- 已安装Earth2Studio(或等效命令)
uv pip install earth2studio - 具备远程数据存储(GCS、S3、CDS API等)的网络访问权限
- 对于基于CDS的数据源:已配置有效的CDS API密钥()
~/.cdsapirc - Python 3.10+
Instructions
操作说明
You are helping a user download specific weather/climate data using
Earth2Studio's data source APIs. Your job is to identify which data source(s)
can provide the requested variables, verify compatibility via the lexicon
system, and produce a working fetch script.
你将协助用户使用Earth2Studio的数据源API下载特定的天气/气候数据。你的任务是确定哪些数据源可以提供请求的变量,通过词汇表系统验证兼容性,并生成可运行的获取脚本。
Core principle: live docs and lexicon are the source of truth
核心原则:实时文档与词汇表是权威来源
Data source APIs, available variables, and the lexicon evolve between releases.
Before recommending a data source or writing a fetch script:
- Fetch the relevant data source doc page to confirm the API signature and constructor arguments.
- Check the lexicon to verify the requested variable is supported by that data source.
Live doc references (fetch only what the user's request requires):
- Analysis data sources: https://nvidia.github.io/earth2studio/modules/datasources_analysis.html
- Forecast data sources: https://nvidia.github.io/earth2studio/modules/datasources_forecast.html
- DataFrame data sources: https://nvidia.github.io/earth2studio/modules/datasources_dataframe.html
- Lexicon base: https://github.com/NVIDIA/earth2studio/blob/main/earth2studio/lexicon/base.py
- Lexicon per-source: https://github.com/NVIDIA/earth2studio/tree/main/earth2studio/lexicon
数据源API、可用变量和词汇表会随版本更新而演变。在推荐数据源或编写获取脚本之前:
- 获取相关数据源的文档页面,确认API签名和构造函数参数。
- 检查词汇表,验证请求的变量是否受该数据源支持。
实时文档参考(仅获取用户请求所需内容):
- 分析型数据源: https://nvidia.github.io/earth2studio/modules/datasources_analysis.html
- 预测型数据源: https://nvidia.github.io/earth2studio/modules/datasources_forecast.html
- DataFrame数据源: https://nvidia.github.io/earth2studio/modules/datasources_dataframe.html
- 基础词汇表: https://github.com/NVIDIA/earth2studio/blob/main/earth2studio/lexicon/base.py
- 各数据源专属词汇表: https://github.com/NVIDIA/earth2studio/tree/main/earth2studio/lexicon
Interaction protocol
交互流程
Step 1. Understand the user's request
步骤1:理解用户需求
Extract from what the user has said (ask follow-ups if needed, cap at 3
questions):
- Variables — what do they want? Use Earth2Studio variable names
(e.g. ,
t2m,u500,z850,tp). If the user uses plain language ("500 hPa geopotential height"), map it to the E2Studio name by checking the livemslE2STUDIO_VOCAB.base.py - Time — what date/time range? A single timestamp, a range, or multiple discrete times?
- Data type — analysis/reanalysis (historical state) or forecast (lead-time based)?
- Lead time (forecast only) — how far ahead? Which initialization time?
- Region — global or regional (e.g. North America for HRRR)?
- Output format — xarray DataArray (default), save to file (NetCDF/Zarr)?
从用户的表述中提取信息(必要时可追问,最多3个问题):
- 变量 — 用户需要什么数据?使用Earth2Studio变量名称(例如、
t2m、u500、z850、tp)。如果用户使用自然语言描述(如“500 hPa位势高度”),通过查看实时msl中的E2STUDIO_VOCAB映射为Earth2Studio名称。base.py - 时间 — 需要什么日期/时间范围?单个时间戳、时间范围还是多个离散时间点?
- 数据类型 — 分析/再分析(历史状态)还是预测(基于提前期)?
- 提前期(仅预测型) — 预测提前多久?初始时间是什么时候?
- 区域 — 全球还是特定区域(如HRRR的北美区域)?
- 输出格式 — xarray DataArray(默认)、保存为文件(NetCDF/Zarr)?
Step 2. Identify candidate data sources
步骤2:筛选候选数据源
Based on the request type, narrow candidates:
Analysis/reanalysis (historical state at a specific time):
- Use analysis data source page to identify options
- Common choices: GFS (operational, recent), HRRR (NA, hourly), IFS/IFS_ENS (ECMWF), ARCO/CDS/WB2ERA5/NCAR_ERA5 (ERA5 reanalysis), GOES/MRMS/JPSS (observational)
Forecast (predictions from an initialization time with lead times):
- Use forecast data source page to identify options
- Common choices: GFS_FX, GEFS_FX, HRRR_FX, IFS_FX, IFS_ENS_FX, AIFS_FX, CFS_FX
Key differentiators to surface:
- Temporal coverage — operational sources (GFS, HRRR) have limited history; reanalysis (ERA5 via ARCO/CDS/WB2) goes back decades
- Spatial resolution — HRRR is 3km NA-only; GFS is 0.25° global; WB2ERA5_32x64 is 5.625° global
- Update frequency — some are real-time, some have multi-day lag
根据请求类型缩小候选范围:
分析/再分析(特定时间的历史状态):
- 参考分析型数据源页面确定可选方案
- 常见选择:GFS(业务化,近期数据)、HRRR(北美区域,逐小时)、IFS/IFS_ENS(ECMWF)、ARCO/CDS/WB2ERA5/NCAR_ERA5(ERA5再分析)、GOES/MRMS/JPSS(观测数据)
预测(基于初始时间和提前期的预测结果):
- 参考预测型数据源页面确定可选方案
- 常见选择:GFS_FX、GEFS_FX、HRRR_FX、IFS_FX、IFS_ENS_FX、AIFS_FX、CFS_FX
需明确的关键差异:
- 时间覆盖范围 — 业务化数据源(GFS、HRRR)历史数据有限;再分析数据源(通过ARCO/CDS/WB2获取的ERA5)可追溯至数十年前
- 空间分辨率 — HRRR为3km分辨率仅覆盖北美;GFS为0.25°全球分辨率;WB2ERA5_32x64为5.625°全球分辨率
- 更新频率 — 部分数据源为实时更新,部分存在多日延迟
Step 3. Verify variable support via lexicon
步骤3:通过词汇表验证变量支持
This is critical. Each data source has a lexicon file that defines which
E2Studio variables it can provide.
To verify:
- Fetch the source's lexicon file from
(e.g.
https://github.com/NVIDIA/earth2studio/blob/main/earth2studio/lexicon/<source>.py,gfs.py,hrrr.py,cds.py,arco.py)wb2.py - Check that the user's requested variable(s) appear as keys in the
source's dict
VOCAB - If a variable is NOT in a source's lexicon, that source cannot provide it — try another
The lexicon VOCAB maps Earth2Studio variable names → source-specific
identifiers. If a variable key exists in the VOCAB, the source supports it.
Present the results clearly: "GFS supports , , . HRRR also
supports these but is limited to North America. ARCO (ERA5) supports all
three and has data back to 1959."
t2mu500z850这一步至关重要。每个数据源都有对应的词汇表文件,定义了它能提供的Earth2Studio变量。
验证方法:
- 从获取对应数据源的词汇表文件(例如
https://github.com/NVIDIA/earth2studio/blob/main/earth2studio/lexicon/<source>.py、gfs.py、hrrr.py、cds.py、arco.py)wb2.py - 检查用户请求的变量是否出现在该数据源的字典的键中
VOCAB - 如果变量不在数据源的词汇表中,说明该数据源无法提供此数据,请尝试其他数据源
词汇表VOCAB将Earth2Studio变量名称映射为数据源专属标识符。如果VOCAB中存在该变量键,则表示数据源支持该变量。
清晰呈现结果:"GFS支持、、。HRRR同样支持这些变量,但仅覆盖北美区域。ARCO(ERA5)支持这三个变量,且数据可追溯至1959年。"
t2mu500z850Step 4. Confirm data source selection with user
步骤4:与用户确认数据源选择
Present the viable options with tradeoffs:
| Source | Variables | Coverage | Resolution | Time Range |
|---|---|---|---|---|
| ... | ... | ... | ... | ... |
Let the user pick. If there's one obvious choice, recommend it and ask for
confirmation.
呈现可行选项及权衡:
| 数据源 | 变量 | 覆盖范围 | 分辨率 | 时间范围 |
|---|---|---|---|---|
| ... | ... | ... | ... | ... |
让用户选择。如果有一个明显最优的选项,可直接推荐并请求确认。
Step 5. Generate fetch script
步骤5:生成获取脚本
Write a Python script that uses the selected data source to fetch the
requested data. The script structure depends on whether it's an analysis or
forecast source.
Analysis source pattern:
python
import datetime
from earth2studio.data import <SourceClass>编写Python脚本,使用选定的数据源获取请求的数据。脚本结构取决于数据源是分析型还是预测型。
分析型数据源模板:
python
import datetime
from earth2studio.data import <SourceClass>Initialize data source
初始化数据源
ds = <SourceClass>()
ds = <SourceClass>()
Fetch data
获取数据
Analysis sources use: ds(time, variable) -> xr.DataArray
分析型数据源使用:ds(time, variable) -> xr.DataArray
time = [datetime.datetime(YYYY, M, D, H)] # or array of times
variable = ["var1", "var2"] # E2Studio variable names
data = ds(time, variable)
**Forecast source pattern:**
```python
import datetime
from earth2studio.data import <SourceClass>time = [datetime.datetime(YYYY, M, D, H)] # 或时间数组
variable = ["var1", "var2"] # Earth2Studio变量名称
data = ds(time, variable)
**预测型数据源模板:**
```python
import datetime
from earth2studio.data import <SourceClass>Initialize data source
初始化数据源
ds = <SourceClass>()
ds = <SourceClass>()
Forecast sources use: ds(time, lead_time, variable) -> xr.DataArray
预测型数据源使用:ds(time, lead_time, variable) -> xr.DataArray
time = [datetime.datetime(YYYY, M, D, H)] # initialization time
lead_time = [datetime.timedelta(hours=H)] # or array of lead times
variable = ["var1", "var2"]
data = ds(time, lead_time, variable)
Always fetch the specific data source's API doc page to confirm the exact
constructor arguments and call signature before writing the script — they can
vary (some need auth tokens, cache paths, specific parameters).
Include in the script:
- Appropriate imports
- Clear comments explaining each step
- How to inspect the result (`print(data)`, `data.shape`, `data.coords`)
- Optional: saving to file if the user requested ittime = [datetime.datetime(YYYY, M, D, H)] # 初始时间
lead_time = [datetime.timedelta(hours=H)] # 或提前期数组
variable = ["var1", "var2"]
data = ds(time, lead_time, variable)
编写脚本前,请务必获取对应数据源的API文档页面,确认确切的构造函数参数和调用签名——这些可能存在差异(部分需要认证令牌、缓存路径或特定参数)。
脚本中需包含:
- 合适的导入语句
- 清晰的步骤注释
- 结果检查方法(`print(data)`、`data.shape`、`data.coords`)
- 可选:如果用户要求,添加保存到文件的代码Step 6. Offer next steps
步骤6:提供后续建议
After delivering the script, mention:
- How to change variables/times without rewriting the whole thing
- If they might want to feed this into a model, point them to the discover skill
- Cache behavior (data is cached locally after first fetch via
)
EARTH2STUDIO_CACHE
交付脚本后,可提及:
- 如何无需重写整个脚本即可修改变量/时间
- 如果用户需要将数据输入模型,可引导至发现skill
- 缓存机制(首次获取后数据会通过本地缓存)
EARTH2STUDIO_CACHE
Ownership and out-of-scope
职责范围与超出范围事项
Owns: identifying data sources for a user's variable/time request,
verifying variable support via lexicon, generating data fetch scripts,
explaining analysis vs. forecast source differences.
Does not own: installation (earth2studio-install), model selection
(earth2studio-discover), inference pipelines, custom data source creation
(point to extend examples), data source authentication setup beyond what
the docs describe.
负责: 根据用户的变量/时间请求识别数据源,通过词汇表验证变量支持,生成数据获取脚本,解释分析型与预测型数据源的差异。
不负责: 安装(earth2studio-install)、模型选择(earth2studio-discover)、推理管道、自定义数据源创建(可指向扩展示例)、文档描述之外的数据源认证设置。
Examples
示例
Typical invocation:
"I need 500 hPa geopotential height and 2m temperature from ERA5 for January 1, 2020 at 00Z."
The skill would:
- Map plain language → ,
z500t2m - Check ARCO/CDS/WB2ERA5 lexicons for support
- Recommend ARCO (free, no API key) or CDS (official, needs key)
- Generate a fetch script using the selected source
典型调用场景:
"我需要2020年1月1日00Z时ERA5的500 hPa位势高度和2米温度数据。"
该Skill会:
- 将自然语言描述映射为、
z500t2m - 检查ARCO/CDS/WB2ERA5词汇表的支持情况
- 推荐ARCO(免费,无需API密钥)或CDS(官方渠道,需密钥)
- 使用选定的数据源生成获取脚本
Limitations
限制
- Network required — all data sources fetch from remote stores (GCS, S3, CDS API)
- No local file loading — for local NetCDF/Zarr, use
/
DataArrayFiledirectlyDataSetFile - One source type per script — cannot mix analysis and forecast sources in a single call
- Variable availability varies — not all sources provide all variables; always verify via lexicon
- Rate limits — CDS API has queue-based throttling; GCS/S3 sources are generally faster
- 需要网络 — 所有数据源均从远程存储(GCS、S3、CDS API)获取数据
- 不支持本地文件加载 — 若要加载本地NetCDF/Zarr文件,请直接使用/
DataArrayFileDataSetFile - 每个脚本仅支持一种数据源类型 — 单次调用无法混合分析型和预测型数据源
- 变量可用性存在差异 — 并非所有数据源都提供所有变量;请始终通过词汇表验证
- 速率限制 — CDS API采用队列限流;GCS/S3数据源通常速度更快
Troubleshooting
故障排查
| Error | Cause | Solution |
|---|---|---|
| Not in lexicon | Check lexicon; try another source |
| Time not available | Verify temporal coverage |
| Queue congestion | Retry or use ARCO for ERA5 |
| Not installed | |
| Empty DataArray | Time/var mismatch | Check datetime and variable name |
| 错误 | 原因 | 解决方案 |
|---|---|---|
| 变量不在词汇表中 | 检查词汇表;尝试其他数据源 |
| 请求的时间数据不存在 | 验证时间覆盖范围 |
| 队列拥堵 | 重试或使用ARCO获取ERA5数据 |
| 未安装Earth2Studio | 执行 |
| 空DataArray | 时间/变量不匹配 | 检查日期时间和变量名称 |