earth2studio-data-fetch
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseEarth2Studio Data Fetch Skill
Earth2Studio 数据获取Skill
Purpose
目的
Guide a user through downloading weather/climate data via Earth2Studio data source
APIs. Identifies compatible sources by checking the lexicon, verifies variable
support, and produces a working fetch script outputting an xarray DataArray.
指导用户通过Earth2Studio数据源API下载天气/气候数据。通过检查lexicon识别兼容的数据源,验证变量支持情况,并生成可运行的获取脚本,输出xarray DataArray。
Prerequisites
前提条件
- Earth2Studio installed (or equivalent)
uv pip install earth2studio - Network access to remote data stores (GCS, S3, CDS API, etc.)
- For CDS-based sources: valid CDS API key configured ()
~/.cdsapirc - Python 3.10+
- 已安装Earth2Studio(或等效命令)
uv pip install earth2studio - 可访问远程数据存储(GCS、S3、CDS API等)
- 对于基于CDS的数据源:已配置有效的CDS API密钥()
~/.cdsapirc - Python 3.10及以上版本
Instructions
操作指南
You are helping a user download specific weather/climate data using
Earth2Studio's data source APIs. Your job is to identify which data source(s)
can provide the requested variables, verify compatibility via the lexicon
system, and produce a working fetch script.
你将帮助用户使用Earth2Studio的数据源API下载特定的天气/气候数据。你的任务是确定哪些数据源可以提供请求的变量,通过lexicon系统验证兼容性,并生成可运行的获取脚本。
Core principle: live docs and lexicon are the source of truth
核心原则:实时文档与lexicon是唯一依据
Data source APIs, available variables, and the lexicon evolve between releases.
Before recommending a data source or writing a fetch script:
- Fetch the relevant data source doc page to confirm the API signature and constructor arguments.
- Check the lexicon to verify the requested variable is supported by that data source.
Live doc references (fetch only what the user's request requires):
- Analysis data sources: https://nvidia.github.io/earth2studio/modules/datasources_analysis.html
- Forecast data sources: https://nvidia.github.io/earth2studio/modules/datasources_forecast.html
- DataFrame data sources: https://nvidia.github.io/earth2studio/modules/datasources_dataframe.html
- Lexicon base: https://github.com/NVIDIA/earth2studio/blob/main/earth2studio/lexicon/base.py
- Lexicon per-source: https://github.com/NVIDIA/earth2studio/tree/main/earth2studio/lexicon
数据源API、可用变量以及lexicon会随版本更新而变化。在推荐数据源或编写获取脚本之前:
- 获取相关数据源的文档页面,确认API签名和构造函数参数。
- 检查lexicon,验证请求的变量是否被该数据源支持。
实时文档参考(仅获取用户请求所需内容):
- 分析数据源: https://nvidia.github.io/earth2studio/modules/datasources_analysis.html
- 预报数据源: https://nvidia.github.io/earth2studio/modules/datasources_forecast.html
- DataFrame数据源: https://nvidia.github.io/earth2studio/modules/datasources_dataframe.html
- Lexicon基础: https://github.com/NVIDIA/earth2studio/blob/main/earth2studio/lexicon/base.py
- 各数据源Lexicon: https://github.com/NVIDIA/earth2studio/tree/main/earth2studio/lexicon
Interaction protocol
交互流程
Step 1. Understand the user's request
步骤1. 理解用户请求
Extract from what the user has said (ask follow-ups if needed, cap at 3
questions):
- Variables — what do they want? Use Earth2Studio variable names
(e.g. ,
t2m,u500,z850,tp). If the user uses plain language ("500 hPa geopotential height"), map it to the E2Studio name by checking the livemslE2STUDIO_VOCAB.base.py - Time — what date/time range? A single timestamp, a range, or multiple discrete times?
- Data type — analysis/reanalysis (historical state) or forecast (lead-time based)?
- Lead time (forecast only) — how far ahead? Which initialization time?
- Region — global or regional (e.g. North America for HRRR)?
- Output format — xarray DataArray (default), save to file (NetCDF/Zarr)?
从用户的表述中提取信息(如有需要可跟进提问,最多3个问题):
- 变量 — 用户需要什么数据?使用Earth2Studio变量名称(例如、
t2m、u500、z850、tp)。如果用户使用自然语言描述(如“500 hPa位势高度”),需通过实时msl中的E2STUDIO_VOCAB映射为Earth2Studio名称。base.py - 时间 — 需要什么日期/时间范围?单个时间戳、时间范围还是多个离散时间点?
- 数据类型 — 分析/再分析(历史状态)还是预报(基于提前时长)?
- 提前时长(仅预报) — 预报提前多久?初始化时间是什么时候?
- 区域 — 全球还是特定区域(如HRRR对应的北美地区)?
- 输出格式 — xarray DataArray(默认)、保存为文件(NetCDF/Zarr)?
Step 2. Identify candidate data sources
步骤2. 筛选候选数据源
Based on the request type, narrow candidates:
Analysis/reanalysis (historical state at a specific time):
- Use analysis data source page to identify options
- Common choices: GFS (operational, recent), HRRR (NA, hourly), IFS/IFS_ENS (ECMWF), ARCO/CDS/WB2ERA5/NCAR_ERA5 (ERA5 reanalysis), GOES/MRMS/JPSS (observational)
Forecast (predictions from an initialization time with lead times):
- Use forecast data source page to identify options
- Common choices: GFS_FX, GEFS_FX, HRRR_FX, IFS_FX, IFS_ENS_FX, AIFS_FX, CFS_FX
Key differentiators to surface:
- Temporal coverage — operational sources (GFS, HRRR) have limited history; reanalysis (ERA5 via ARCO/CDS/WB2) goes back decades
- Spatial resolution — HRRR is 3km NA-only; GFS is 0.25° global; WB2ERA5_32x64 is 5.625° global
- Update frequency — some are real-time, some have multi-day lag
根据请求类型缩小候选范围:
分析/再分析(特定时间的历史状态):
- 参考分析数据源页面确定可选方案
- 常见选择:GFS(业务化、近期数据)、HRRR(北美、逐小时)、IFS/IFS_ENS(ECMWF)、ARCO/CDS/WB2ERA5/NCAR_ERA5(ERA5再分析)、GOES/MRMS/JPSS(观测数据)
预报(基于初始化时间和提前时长的预测数据):
- 参考预报数据源页面确定可选方案
- 常见选择:GFS_FX、GEFS_FX、HRRR_FX、IFS_FX、IFS_ENS_FX、AIFS_FX、CFS_FX
需明确的关键差异:
- 时间覆盖范围 — 业务化数据源(GFS、HRRR)历史数据有限;再分析数据源(通过ARCO/CDS/WB2获取的ERA5)可追溯至数十年前
- 空间分辨率 — HRRR为北美地区3km分辨率;GFS为全球0.25°分辨率;WB2ERA5_32x64为全球5.625°分辨率
- 更新频率 — 部分数据源为实时更新,部分存在多日延迟
Step 3. Verify variable support via lexicon
步骤3. 通过lexicon验证变量支持
This is critical. Each data source has a lexicon file that defines which
E2Studio variables it can provide.
To verify:
- Fetch the source's lexicon file from
(e.g.
https://github.com/NVIDIA/earth2studio/blob/main/earth2studio/lexicon/<source>.py,gfs.py,hrrr.py,cds.py,arco.py)wb2.py - Check that the user's requested variable(s) appear as keys in the
source's dict
VOCAB - If a variable is NOT in a source's lexicon, that source cannot provide it — try another
The lexicon VOCAB maps Earth2Studio variable names → source-specific
identifiers. If a variable key exists in the VOCAB, the source supports it.
Present the results clearly: "GFS supports , , . HRRR also
supports these but is limited to North America. ARCO (ERA5) supports all
three and has data back to 1959."
t2mu500z850这一步至关重要。每个数据源都有一个lexicon文件,定义了它可以提供的Earth2Studio变量。
验证步骤:
- 从获取对应数据源的lexicon文件(例如
https://github.com/NVIDIA/earth2studio/blob/main/earth2studio/lexicon/<source>.py、gfs.py、hrrr.py、cds.py、arco.py)wb2.py - 检查用户请求的变量是否出现在该数据源字典的键中
VOCAB - 如果变量不在数据源的lexicon中,则该数据源无法提供该数据——尝试其他数据源
lexicon的VOCAB将Earth2Studio变量名称映射为数据源特定标识符。如果VOCAB中存在该变量键,则表示数据源支持该变量。
清晰呈现结果:例如“GFS支持、、。HRRR也支持这些变量,但仅覆盖北美地区。ARCO(ERA5)支持这三个变量,且数据可追溯至1959年。”
t2mu500z850Step 4. Confirm data source selection with user
步骤4. 与用户确认数据源选择
Present the viable options with tradeoffs:
| Source | Variables | Coverage | Resolution | Time Range |
|---|---|---|---|---|
| ... | ... | ... | ... | ... |
Let the user pick. If there's one obvious choice, recommend it and ask for
confirmation.
向用户展示可行选项及其优缺点:
| 数据源 | 支持变量 | 覆盖区域 | 分辨率 | 时间范围 |
|---|---|---|---|---|
| ... | ... | ... | ... | ... |
让用户选择。如果有一个明显最优的选项,可直接推荐并请求确认。
Step 5. Generate fetch script
步骤5. 生成获取脚本
Write a Python script that uses the selected data source to fetch the
requested data. The script structure depends on whether it's an analysis or
forecast source.
Analysis source pattern:
python
import datetime
from earth2studio.data import <SourceClass>编写Python脚本,使用选定的数据源获取请求的数据。脚本结构取决于它是分析型还是预报型数据源。
分析型数据源模板:
python
import datetime
from earth2studio.data import <SourceClass>Initialize data source
Initialize data source
ds = <SourceClass>()
ds = <SourceClass>()
Fetch data
Fetch data
Analysis sources use: ds(time, variable) -> xr.DataArray
Analysis sources use: ds(time, variable) -> xr.DataArray
time = [datetime.datetime(YYYY, M, D, H)] # or array of times
variable = ["var1", "var2"] # E2Studio variable names
data = ds(time, variable)
**Forecast source pattern:**
```python
import datetime
from earth2studio.data import <SourceClass>time = [datetime.datetime(YYYY, M, D, H)] # or array of times
variable = ["var1", "var2"] # E2Studio variable names
data = ds(time, variable)
**预报型数据源模板:**
```python
import datetime
from earth2studio.data import <SourceClass>Initialize data source
Initialize data source
ds = <SourceClass>()
ds = <SourceClass>()
Forecast sources use: ds(time, lead_time, variable) -> xr.DataArray
Forecast sources use: ds(time, lead_time, variable) -> xr.DataArray
time = [datetime.datetime(YYYY, M, D, H)] # initialization time
lead_time = [datetime.timedelta(hours=H)] # or array of lead times
variable = ["var1", "var2"]
data = ds(time, lead_time, variable)
Always fetch the specific data source's API doc page to confirm the exact
constructor arguments and call signature before writing the script — they can
vary (some need auth tokens, cache paths, specific parameters).
Include in the script:
- Appropriate imports
- Clear comments explaining each step
- How to inspect the result (`print(data)`, `data.shape`, `data.coords`)
- Optional: saving to file if the user requested ittime = [datetime.datetime(YYYY, M, D, H)] # initialization time
lead_time = [datetime.timedelta(hours=H)] # or array of lead times
variable = ["var1", "var2"]
data = ds(time, lead_time, variable)
编写脚本前务必获取对应数据源的API文档页面,确认准确的构造函数参数和调用签名——这些参数可能存在差异(部分需要认证令牌、缓存路径或特定参数)。
脚本中需包含:
- 合适的导入语句
- 清晰的步骤注释
- 如何检查结果的说明(`print(data)`、`data.shape`、`data.coords`)
- 可选:如果用户要求,添加保存到文件的代码Step 6. Offer next steps
步骤6. 提供后续建议
After delivering the script, mention:
- How to change variables/times without rewriting the whole thing
- If they might want to feed this into a model, point them to the discover skill
- Cache behavior (data is cached locally after first fetch via
)
EARTH2STUDIO_CACHE
交付脚本后,告知用户:
- 如何无需重写整个脚本即可修改变量/时间
- 如果用户可能需要将数据输入模型,引导至discover skill
- 缓存机制(首次获取后数据会通过缓存到本地)
EARTH2STUDIO_CACHE
Ownership and out-of-scope
负责范围与非负责范围
Owns: identifying data sources for a user's variable/time request,
verifying variable support via lexicon, generating data fetch scripts,
explaining analysis vs. forecast source differences.
Does not own: installation (earth2studio-install), model selection
(earth2studio-discover), inference pipelines, custom data source creation
(point to extend examples), data source authentication setup beyond what
the docs describe.
负责范围: 为用户的变量/时间请求识别数据源,通过lexicon验证变量支持情况,生成数据获取脚本,解释分析型与预报型数据源的差异。
非负责范围: 安装(earth2studio-install)、模型选择(earth2studio-discover)、推理管道、自定义数据源创建(请参考扩展示例)、文档描述之外的数据源认证设置。
Examples
示例
Typical invocation:
"I need 500 hPa geopotential height and 2m temperature from ERA5 for January 1, 2020 at 00Z."
The skill would:
- Map plain language → ,
z500t2m - Check ARCO/CDS/WB2ERA5 lexicons for support
- Recommend ARCO (free, no API key) or CDS (official, needs key)
- Generate a fetch script using the selected source
典型调用场景:
"我需要2020年1月1日00Z的ERA5 500 hPa位势高度和2m气温数据。"
该Skill会:
- 将自然语言映射为、
z500t2m - 检查ARCO/CDS/WB2ERA5的lexicon以确认支持情况
- 推荐ARCO(免费、无需API密钥)或CDS(官方渠道、需密钥)
- 使用选定的数据源生成获取脚本
Limitations
限制
- Network required — all data sources fetch from remote stores (GCS, S3, CDS API)
- No local file loading — for local NetCDF/Zarr, use
/
DataArrayFiledirectlyDataSetFile - One source type per script — cannot mix analysis and forecast sources in a single call
- Variable availability varies — not all sources provide all variables; always verify via lexicon
- Rate limits — CDS API has queue-based throttling; GCS/S3 sources are generally faster
- 需要网络 — 所有数据源均从远程存储(GCS、S3、CDS API)获取数据
- 不支持本地文件加载 — 如需加载本地NetCDF/Zarr文件,请直接使用/
DataArrayFileDataSetFile - 单脚本仅支持一种数据源类型 — 单次调用无法混合分析型和预报型数据源
- 变量可用性存在差异 — 并非所有数据源都提供所有变量;务必通过lexicon验证
- 速率限制 — CDS API采用队列限流;GCS/S3数据源通常速度更快
Troubleshooting
故障排除
| Error | Cause | Solution |
|---|---|---|
| Not in lexicon | Check lexicon; try another source |
| Time not available | Verify temporal coverage |
| Queue congestion | Retry or use ARCO for ERA5 |
| Not installed | |
| Empty DataArray | Time/var mismatch | Check datetime and variable name |
| 错误信息 | 原因 | 解决方案 |
|---|---|---|
| 变量不在lexicon中 | 检查lexicon;尝试其他数据源 |
| 请求的时间数据不存在 | 验证时间覆盖范围 |
| 队列拥堵 | 重试或使用ARCO获取ERA5数据 |
| 未安装Earth2Studio | 执行 |
| 空DataArray | 时间/变量不匹配 | 检查日期时间和变量名称 |