earth2studio-data-fetch

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Earth2Studio Data Fetch Skill

Earth2Studio 数据获取Skill

Purpose

目的

Guide a user through downloading weather/climate data via Earth2Studio data source APIs. Identifies compatible sources by checking the lexicon, verifies variable support, and produces a working fetch script outputting an xarray DataArray.

指导用户通过Earth2Studio数据源API下载天气/气候数据。通过检查lexicon识别兼容的数据源，验证变量支持情况，并生成可运行的获取脚本，输出xarray DataArray。

Prerequisites

前提条件

Earth2Studio installed (
```
uv pip install earth2studio
```
or equivalent)
Network access to remote data stores (GCS, S3, CDS API, etc.)
For CDS-based sources: valid CDS API key configured (
```
~/.cdsapirc
```
)
Python 3.10+

已安装Earth2Studio（
```
uv pip install earth2studio
```
或等效命令）
可访问远程数据存储（GCS、S3、CDS API等）
对于基于CDS的数据源：已配置有效的CDS API密钥（
```
~/.cdsapirc
```
）
Python 3.10及以上版本

Instructions

操作指南

You are helping a user download specific weather/climate data using Earth2Studio's data source APIs. Your job is to identify which data source(s) can provide the requested variables, verify compatibility via the lexicon system, and produce a working fetch script.

你将帮助用户使用Earth2Studio的数据源API下载特定的天气/气候数据。你的任务是确定哪些数据源可以提供请求的变量，通过lexicon系统验证兼容性，并生成可运行的获取脚本。

Core principle: live docs and lexicon are the source of truth

核心原则：实时文档与lexicon是唯一依据

Data source APIs, available variables, and the lexicon evolve between releases. Before recommending a data source or writing a fetch script:

Fetch the relevant data source doc page to confirm the API signature and constructor arguments.
Check the lexicon to verify the requested variable is supported by that data source.

Live doc references (fetch only what the user's request requires):

Analysis data sources: https://nvidia.github.io/earth2studio/modules/datasources_analysis.html
Forecast data sources: https://nvidia.github.io/earth2studio/modules/datasources_forecast.html
DataFrame data sources: https://nvidia.github.io/earth2studio/modules/datasources_dataframe.html
Lexicon base: https://github.com/NVIDIA/earth2studio/blob/main/earth2studio/lexicon/base.py
Lexicon per-source: https://github.com/NVIDIA/earth2studio/tree/main/earth2studio/lexicon

数据源API、可用变量以及lexicon会随版本更新而变化。在推荐数据源或编写获取脚本之前：

获取相关数据源的文档页面，确认API签名和构造函数参数。
检查lexicon，验证请求的变量是否被该数据源支持。

实时文档参考（仅获取用户请求所需内容）：

分析数据源： https://nvidia.github.io/earth2studio/modules/datasources_analysis.html
预报数据源： https://nvidia.github.io/earth2studio/modules/datasources_forecast.html
DataFrame数据源： https://nvidia.github.io/earth2studio/modules/datasources_dataframe.html
Lexicon基础： https://github.com/NVIDIA/earth2studio/blob/main/earth2studio/lexicon/base.py
各数据源Lexicon： https://github.com/NVIDIA/earth2studio/tree/main/earth2studio/lexicon

Interaction protocol

交互流程

Step 1. Understand the user's request

步骤1. 理解用户请求

Extract from what the user has said (ask follow-ups if needed, cap at 3 questions):

Variables — what do they want? Use Earth2Studio variable names (e.g.
```
t2m
```
,
```
u500
```
,
```
z850
```
,
```
tp
```
,
```
msl
```
). If the user uses plain language ("500 hPa geopotential height"), map it to the E2Studio name by checking the live
```
base.py
```
E2STUDIO_VOCAB.
Time — what date/time range? A single timestamp, a range, or multiple discrete times?
Data type — analysis/reanalysis (historical state) or forecast (lead-time based)?
Lead time (forecast only) — how far ahead? Which initialization time?
Region — global or regional (e.g. North America for HRRR)?
Output format — xarray DataArray (default), save to file (NetCDF/Zarr)?

从用户的表述中提取信息（如有需要可跟进提问，最多3个问题）：

变量 — 用户需要什么数据？使用Earth2Studio变量名称（例如
```
t2m
```
、
```
u500
```
、
```
z850
```
、
```
tp
```
、
```
msl
```
）。如果用户使用自然语言描述（如“500 hPa位势高度”），需通过实时
```
base.py
```
中的E2STUDIO_VOCAB映射为Earth2Studio名称。
时间 — 需要什么日期/时间范围？单个时间戳、时间范围还是多个离散时间点？
数据类型 — 分析/再分析（历史状态）还是预报（基于提前时长）？
提前时长（仅预报） — 预报提前多久？初始化时间是什么时候？
区域 — 全球还是特定区域（如HRRR对应的北美地区）？
输出格式 — xarray DataArray（默认）、保存为文件（NetCDF/Zarr）？

Step 2. Identify candidate data sources

步骤2. 筛选候选数据源

Based on the request type, narrow candidates:

Analysis/reanalysis (historical state at a specific time):

Use analysis data source page to identify options
Common choices: GFS (operational, recent), HRRR (NA, hourly), IFS/IFS_ENS (ECMWF), ARCO/CDS/WB2ERA5/NCAR_ERA5 (ERA5 reanalysis), GOES/MRMS/JPSS (observational)

Forecast (predictions from an initialization time with lead times):

Use forecast data source page to identify options
Common choices: GFS_FX, GEFS_FX, HRRR_FX, IFS_FX, IFS_ENS_FX, AIFS_FX, CFS_FX

Key differentiators to surface:

Temporal coverage — operational sources (GFS, HRRR) have limited history; reanalysis (ERA5 via ARCO/CDS/WB2) goes back decades
Spatial resolution — HRRR is 3km NA-only; GFS is 0.25° global; WB2ERA5_32x64 is 5.625° global
Update frequency — some are real-time, some have multi-day lag

根据请求类型缩小候选范围：

分析/再分析（特定时间的历史状态）：

参考分析数据源页面确定可选方案
常见选择：GFS（业务化、近期数据）、HRRR（北美、逐小时）、IFS/IFS_ENS（ECMWF）、ARCO/CDS/WB2ERA5/NCAR_ERA5（ERA5再分析）、GOES/MRMS/JPSS（观测数据）

预报（基于初始化时间和提前时长的预测数据）：

参考预报数据源页面确定可选方案
常见选择：GFS_FX、GEFS_FX、HRRR_FX、IFS_FX、IFS_ENS_FX、AIFS_FX、CFS_FX

需明确的关键差异：

时间覆盖范围 — 业务化数据源（GFS、HRRR）历史数据有限；再分析数据源（通过ARCO/CDS/WB2获取的ERA5）可追溯至数十年前
空间分辨率 — HRRR为北美地区3km分辨率；GFS为全球0.25°分辨率；WB2ERA5_32x64为全球5.625°分辨率
更新频率 — 部分数据源为实时更新，部分存在多日延迟

Step 3. Verify variable support via lexicon

步骤3. 通过lexicon验证变量支持

This is critical. Each data source has a lexicon file that defines which E2Studio variables it can provide.

To verify:

Fetch the source's lexicon file from

https://github.com/NVIDIA/earth2studio/blob/main/earth2studio/lexicon/<source>.py

(e.g.

gfs.py

hrrr.py

cds.py

arco.py

wb2.py

)

Check that the user's requested variable(s) appear as keys in the source's
```
VOCAB
```
dict
If a variable is NOT in a source's lexicon, that source cannot provide it — try another

The lexicon VOCAB maps Earth2Studio variable names → source-specific identifiers. If a variable key exists in the VOCAB, the source supports it.

Present the results clearly: "GFS supports
t2m
,
u500
,
z850
. HRRR also supports these but is limited to North America. ARCO (ERA5) supports all three and has data back to 1959."

这一步至关重要。每个数据源都有一个lexicon文件，定义了它可以提供的Earth2Studio变量。

验证步骤：

从

https://github.com/NVIDIA/earth2studio/blob/main/earth2studio/lexicon/<source>.py

获取对应数据源的lexicon文件（例如

gfs.py

、

hrrr.py

、

cds.py

、

arco.py

、

wb2.py

）

检查用户请求的变量是否出现在该数据源
```
VOCAB
```
字典的键中
如果变量不在数据源的lexicon中，则该数据源无法提供该数据——尝试其他数据源

lexicon的VOCAB将Earth2Studio变量名称映射为数据源特定标识符。如果VOCAB中存在该变量键，则表示数据源支持该变量。

清晰呈现结果：例如“GFS支持

t2m

、

u500

、

z850

。HRRR也支持这些变量，但仅覆盖北美地区。ARCO（ERA5）支持这三个变量，且数据可追溯至1959年。”

Step 4. Confirm data source selection with user

步骤4. 与用户确认数据源选择

Present the viable options with tradeoffs:

Source	Variables	Coverage	Resolution	Time Range
...	...	...	...	...

Let the user pick. If there's one obvious choice, recommend it and ask for confirmation.

向用户展示可行选项及其优缺点：

数据源	支持变量	覆盖区域	分辨率	时间范围
...	...	...	...	...

让用户选择。如果有一个明显最优的选项，可直接推荐并请求确认。

Step 5. Generate fetch script

步骤5. 生成获取脚本

Write a Python script that uses the selected data source to fetch the requested data. The script structure depends on whether it's an analysis or forecast source.

Analysis source pattern:

python

import datetime
from earth2studio.data import <SourceClass>

编写Python脚本，使用选定的数据源获取请求的数据。脚本结构取决于它是分析型还是预报型数据源。

分析型数据源模板：

python

import datetime
from earth2studio.data import <SourceClass>

Initialize data source

ds = <SourceClass>()

Fetch data

Analysis sources use: ds(time, variable) -> xr.DataArray

time = [datetime.datetime(YYYY, M, D, H)] # or array of times variable = ["var1", "var2"] # E2Studio variable names

data = ds(time, variable)


**Forecast source pattern:**

```python
import datetime
from earth2studio.data import <SourceClass>

time = [datetime.datetime(YYYY, M, D, H)] # or array of times variable = ["var1", "var2"] # E2Studio variable names

data = ds(time, variable)


**预报型数据源模板：**

```python
import datetime
from earth2studio.data import <SourceClass>

Initialize data source

ds = <SourceClass>()

Forecast sources use: ds(time, lead_time, variable) -> xr.DataArray

time = [datetime.datetime(YYYY, M, D, H)] # initialization time lead_time = [datetime.timedelta(hours=H)] # or array of lead times variable = ["var1", "var2"]

data = ds(time, lead_time, variable)


Always fetch the specific data source's API doc page to confirm the exact
constructor arguments and call signature before writing the script — they can
vary (some need auth tokens, cache paths, specific parameters).

Include in the script:

- Appropriate imports
- Clear comments explaining each step
- How to inspect the result (`print(data)`, `data.shape`, `data.coords`)
- Optional: saving to file if the user requested it

time = [datetime.datetime(YYYY, M, D, H)] # initialization time lead_time = [datetime.timedelta(hours=H)] # or array of lead times variable = ["var1", "var2"]

data = ds(time, lead_time, variable)


编写脚本前务必获取对应数据源的API文档页面，确认准确的构造函数参数和调用签名——这些参数可能存在差异（部分需要认证令牌、缓存路径或特定参数）。

脚本中需包含：

- 合适的导入语句
- 清晰的步骤注释
- 如何检查结果的说明（`print(data)`、`data.shape`、`data.coords`）
- 可选：如果用户要求，添加保存到文件的代码

Step 6. Offer next steps

步骤6. 提供后续建议

After delivering the script, mention:

How to change variables/times without rewriting the whole thing
If they might want to feed this into a model, point them to the discover skill
Cache behavior (data is cached locally after first fetch via
```
EARTH2STUDIO_CACHE
```
)

交付脚本后，告知用户：

如何无需重写整个脚本即可修改变量/时间
如果用户可能需要将数据输入模型，引导至discover skill
缓存机制（首次获取后数据会通过
```
EARTH2STUDIO_CACHE
```
缓存到本地）

Ownership and out-of-scope

负责范围与非负责范围

Owns: identifying data sources for a user's variable/time request, verifying variable support via lexicon, generating data fetch scripts, explaining analysis vs. forecast source differences.

Does not own: installation (earth2studio-install), model selection (earth2studio-discover), inference pipelines, custom data source creation (point to extend examples), data source authentication setup beyond what the docs describe.

负责范围： 为用户的变量/时间请求识别数据源，通过lexicon验证变量支持情况，生成数据获取脚本，解释分析型与预报型数据源的差异。

非负责范围： 安装（earth2studio-install）、模型选择（earth2studio-discover）、推理管道、自定义数据源创建（请参考扩展示例）、文档描述之外的数据源认证设置。

Examples

示例

Typical invocation:

"I need 500 hPa geopotential height and 2m temperature from ERA5 for January 1, 2020 at 00Z."

The skill would:

Map plain language →
```
z500
```
,
```
t2m
```
Check ARCO/CDS/WB2ERA5 lexicons for support
Recommend ARCO (free, no API key) or CDS (official, needs key)
Generate a fetch script using the selected source

典型调用场景：

"我需要2020年1月1日00Z的ERA5 500 hPa位势高度和2m气温数据。"

该Skill会：

将自然语言映射为
```
z500
```
、
```
t2m
```
检查ARCO/CDS/WB2ERA5的lexicon以确认支持情况
推荐ARCO（免费、无需API密钥）或CDS（官方渠道、需密钥）
使用选定的数据源生成获取脚本

Limitations

限制

Network required — all data sources fetch from remote stores (GCS, S3, CDS API)
No local file loading — for local NetCDF/Zarr, use
```
DataArrayFile
```
/
```
DataSetFile
```
directly
One source type per script — cannot mix analysis and forecast sources in a single call
Variable availability varies — not all sources provide all variables; always verify via lexicon
Rate limits — CDS API has queue-based throttling; GCS/S3 sources are generally faster

需要网络 — 所有数据源均从远程存储（GCS、S3、CDS API）获取数据
不支持本地文件加载 — 如需加载本地NetCDF/Zarr文件，请直接使用
```
DataArrayFile
```
/
```
DataSetFile
```
单脚本仅支持一种数据源类型 — 单次调用无法混合分析型和预报型数据源
变量可用性存在差异 — 并非所有数据源都提供所有变量；务必通过lexicon验证
速率限制 — CDS API采用队列限流；GCS/S3数据源通常速度更快

Troubleshooting

故障排除

Error	Cause	Solution
`KeyError: '<var>'`	Not in lexicon	Check lexicon; try another source
`FileNotFoundError` / 404	Time not available	Verify temporal coverage
`CDS API timeout`	Queue congestion	Retry or use ARCO for ERA5
`ModuleNotFoundError`	Not installed	`uv pip install earth2studio`
Empty DataArray	Time/var mismatch	Check datetime and variable name

错误信息	原因	解决方案
`KeyError: '<var>'`	变量不在lexicon中	检查lexicon；尝试其他数据源
`FileNotFoundError` / 404	请求的时间数据不存在	验证时间覆盖范围
`CDS API timeout`	队列拥堵	重试或使用ARCO获取ERA5数据
`ModuleNotFoundError`	未安装Earth2Studio	执行 `uv pip install earth2studio`
空DataArray	时间/变量不匹配	检查日期时间和变量名称