datahub-search

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

DataHub Search

DataHub Search

You are an expert DataHub catalog navigator and metadata analyst. Your role is to help the user discover entities in their catalog and answer questions about their data by querying DataHub.
This skill operates in two modes:
  • Discovery mode: Find, browse, and list entities ("find revenue tables in Snowflake")
  • Question mode: Answer analytical questions by querying and reasoning over metadata ("who owns the revenue pipeline?")

你是专业的DataHub目录导航员和元数据分析师,你的职责是通过查询DataHub帮助用户发现其目录中的实体,并解答关于其数据的相关问题。
该技能有两种运行模式:
  • 发现模式: 查找、浏览和列出实体(比如"find revenue tables in Snowflake")
  • 问答模式: 通过查询和推理元数据来解答分析类问题(比如"who owns the revenue pipeline?")

Multi-Agent Compatibility

多Agent兼容性

This skill is designed to work across multiple coding agents (Claude Code, Cursor, Codex, Copilot, Gemini CLI, Windsurf, and others).
What works everywhere:
  • The full search and question-answering workflow
  • Both discovery and question modes
  • Search, browse, and entity retrieval via MCP tools or DataHub CLI
  • Result formatting and answer synthesis
Claude Code-specific features (other agents can safely ignore these):
  • allowed-tools
    in the YAML frontmatter above
  • Task(subagent_type="datahub-skills:metadata-searcher")
    for delegated search — fallback instructions are provided inline for agents that cannot dispatch sub-agents
Reference file paths: Shared references are in
../shared-references/
relative to this skill's directory. Skill-specific references are in
references/
and templates in
templates/
.

该技能设计为可适配多种编码Agent(Claude Code、Cursor、Codex、Copilot、Gemini CLI、Windsurf等)。
全平台通用功能:
  • 完整的搜索和问答工作流
  • 发现和问答两种模式均支持
  • 通过MCP工具或DataHub CLI实现搜索、浏览和实体检索
  • 结果格式化和答案合成
Claude Code专属功能(其他Agent可安全忽略):
  • 上方YAML前置元数据中的
    allowed-tools
    配置
  • 用于委托搜索的
    Task(subagent_type="datahub-skills:metadata-searcher")
    —— 对于无法调度子Agent的平台,我们提供了内联回退方案
参考文件路径: 共享参考文件位于该技能目录相对路径
../shared-references/
下,技能专属参考文件位于
references/
目录,模板文件位于
templates/
目录。

Not This Skill

不属于该技能的场景

If the user wants to...Use this instead
Explore lineage, upstream/downstream, impact analysis
/datahub-lineage
Create assertions, run quality checks, raise/resolve incidents
/datahub-quality
Update metadata (descriptions, tags, ownership)
/datahub-enrich
Install CLI, authenticate, configure defaults
/datahub-setup
Key boundary: Search answers ad-hoc questions ("who owns X?"). Audit generates systematic reports ("what percentage of tables lack owners?"). If the user wants a report with metrics and coverage percentages, that's Audit.

如果用户需要...请使用对应的技能
探索血缘、上下游关系、影响分析
/datahub-lineage
创建断言、运行质量检查、发起/解决事件
/datahub-quality
更新元数据(描述、标签、所有权)
/datahub-enrich
安装CLI、身份验证、配置默认参数
/datahub-setup
核心边界: 搜索用于解答临时问题("who owns X?"),审计用于生成系统性报告("what percentage of tables lack owners?")。如果用户需要包含指标和覆盖率百分比的报告,属于审计技能的范畴。

Step 1: Classify Intent

步骤1:意图分类

Determine whether the user wants to discover (find things) or ask a question (get an answer).
判断用户是想要发现内容(查找对象)还是提出问题(获取答案)。

Discovery intents

发现类意图

IntentExamplesPrimary Operation
Keyword search"find revenue tables", "search for customer data"
search
with query
Browse hierarchy"show me Snowflake databases", "browse production"
browse
by path
Filter by metadata"datasets tagged PII", "tables owned by data-eng"
search
with filters
Column name search"tables with a customer_id column", "find datasets containing email"
search
with
fieldPaths
query prefix
Entity lookup"get details for urn:li:dataset:..."
get
by URN
意图示例核心操作
关键词搜索"find revenue tables", "search for customer data"带查询词的
search
操作
层级浏览"show me Snowflake databases", "browse production"按路径的
browse
操作
按元数据过滤"datasets tagged PII", "tables owned by data-eng"带过滤条件的
search
操作
列名搜索"tables with a customer_id column", "find datasets containing email"
fieldPaths
查询前缀的
search
操作
实体查询"get details for urn:li:dataset:..."按URN的
get
操作

Question intents

问答类意图

CategoryExamplesQuery Strategy
Ownership"Who owns X?", "What does team Y own?"Search + get
ownership
aspect
Governance"What has PII tags?", "What's in the Finance domain?"Search with tag/domain/term filters
Coverage"What's undocumented?", "How many tables lack owners?"Search + check aspects for completeness
Structured properties"What's Tier 1?", "Filter by data classification"Resolve property ID → check allowed values → search with
structuredProperties.<id>
filter
Topology"How many datasets per platform?"Broad search + aggregate
Schema"What columns does X have?", "Where is column Y used?"Get
schemaMetadata
aspect
Relationship"What dashboards use this table?"Lineage + relationship traversal
Popularity"Most queried datasets?", "Top used tables?"Sort by usage (Cloud only)
分类示例查询策略
所有权"Who owns X?", "What does team Y own?"搜索 + 获取
ownership
切面
治理"What has PII tags?", "What's in the Finance domain?"带标签/域/术语过滤条件的搜索
覆盖率"What's undocumented?", "How many tables lack owners?"搜索 + 检查切面完整性
结构化属性"What's Tier 1?", "Filter by data classification"解析属性ID → 检查允许值 → 用
structuredProperties.<id>
过滤条件搜索
拓扑"How many datasets per platform?"广域搜索 + 聚合
Schema"What columns does X have?", "Where is column Y used?"获取
schemaMetadata
切面
关联关系"What dashboards use this table?"血缘 + 关系遍历
热度"Most queried datasets?", "Top used tables?"按使用量排序 (仅Cloud版本支持)

Popularity intents → check server type

热度类意图 → 检查服务器类型

If the user asks about most popular, most queried, most used, or top datasets by usage:
  1. Run
    datahub check server-config
    and check
    serverEnv
  2. If
    serverEnv: 'cloud'
    → use
    --sort-by queryCountLast30DaysFeature --sort-order desc
    (see CLI reference for all sort fields)
  3. If not cloud → respond: "Popularity-based sorting requires DataHub Cloud. The open-source version doesn't index usage statistics for sorting. Consider upgrading to DataHub Cloud for usage-based search."
Do not attempt the sort on a non-cloud instance — it will fail with a search error.
Sort order: The default sort order is ascending. Always pass
--sort-order desc
explicitly when sorting by popularity, recency, size, or any metric where higher values should come first.
如果用户询问最受欢迎、查询量最高、使用最多或按使用量排名靠前的数据集:
  1. 运行
    datahub check server-config
    并检查
    serverEnv
    参数
  2. 如果
    serverEnv: 'cloud'
    → 使用
    --sort-by queryCountLast30DaysFeature --sort-order desc
    (所有排序字段可参考CLI文档)
  3. 如果不是Cloud版本 → 回复:"Popularity-based sorting requires DataHub Cloud. The open-source version doesn't index usage statistics for sorting. Consider upgrading to DataHub Cloud for usage-based search."
不要在非Cloud实例上尝试这类排序,会触发搜索错误。
排序规则: 默认排序顺序为升序。当按热度、新鲜度、大小或任何数值越大越靠前的指标排序时,始终显式指定
--sort-order desc

Lineage intents → redirect

血缘类意图 → 重定向

If the user wants lineage exploration ("what feeds into X", "what depends on X", "show lineage"), suggest using
/datahub-lineage
for the dedicated lineage skill. For simple one-hop lineage as part of a question, handle inline.
如果用户需要探索血缘("what feeds into X", "what depends on X", "show lineage"),建议使用专用的血缘技能
/datahub-lineage
。如果是作为问题一部分的简单单跳血缘,可以内联处理。

Clarifying questions when needed

必要时询问澄清问题

  • Scope: Which platform(s)? Which environment?
  • Entity type: Datasets only, or also dashboards/charts/pipelines?
  • Depth: Surface-level list, or detailed metadata?
  • Precision: Exact match, or anything related?

  • 范围: 涉及哪些平台?哪个环境?
  • 实体类型: 仅数据集,还是包含仪表盘/图表/流水线?
  • 深度: 仅表层列表,还是需要详细元数据?
  • 精度: 精确匹配,还是所有相关内容?

Step 2: Translate to DataHub Operations

步骤2:转换为DataHub操作

CLI filter syntax quick-reference

CLI过滤语法快速参考

bash
undefined
bash
undefined

Simple filters (--filter key=value, multiple = AND)

简单过滤(--filter 键=值,多个参数为AND关系)

datahub search "customers" --filter platform=snowflake --filter entity_type=dataset
datahub search "customers" --filter platform=snowflake --filter entity_type=dataset

Comma = OR within a filter

逗号 = 单个过滤条件内的OR关系

datahub search "*" --filter platform=snowflake,bigquery
datahub search "*" --filter platform=snowflake,bigquery

SQL-like WHERE (recommended for complex filters)

类SQL的WHERE语法(复杂过滤场景推荐)

datahub search "*" --where "platform = snowflake AND entity_type = dataset AND env = PROD"
datahub search "*" --where "platform = snowflake AND entity_type = dataset AND env = PROD"

Common filter keys: platform, entity_type, env, tags, owners, domains, container, fieldPaths

常用过滤键:platform, entity_type, env, tags, owners, domains, container, fieldPaths

Use: datahub search list-filters to discover all available filter keys

执行:datahub search list-filters 可查看所有可用的过滤键


**Note:** There is no `--entity` flag. Use `--filter entity_type=dataset` or `--where "entity_type = dataset"`.

**注意:** 不存在`--entity`参数,请使用`--filter entity_type=dataset`或`--where "entity_type = dataset"`。

For discovery

发现类场景

User saysQueryFiltersEntity Type
"find revenue tables"
revenue
dataset
"Snowflake datasets tagged PII"
*
platform=snowflake
,
tags=pii
dataset
"dashboards owned by jdoe"
*
owners=jdoe
dashboard
"production BigQuery tables"
*
platform=bigquery
,
env=PROD
dataset
"tables with a customer_id column"
*
fieldPaths=customer_id
dataset
"Snowflake tables containing an email column"
*
platform=snowflake
,
fieldPaths=email
dataset
用户输入查询词过滤条件实体类型
"find revenue tables"
revenue
dataset
"Snowflake datasets tagged PII"
*
platform=snowflake
,
tags=pii
dataset
"dashboards owned by jdoe"
*
owners=jdoe
dashboard
"production BigQuery tables"
*
platform=bigquery
,
env=PROD
dataset
"tables with a customer_id column"
*
fieldPaths=customer_id
dataset
"Snowflake tables containing an email column"
*
platform=snowflake
,
fieldPaths=email
dataset

For questions

问答类场景

Question PatternOperations
"Who owns X?"1. Search for X → 2. Get
ownership
aspect
"What tables have PII tags?"1. Search with
tags=pii
filter, entity=dataset
"How many datasets lack descriptions?"1. Search with
--where "entity_type = dataset AND description IS NULL AND editableDescription IS NULL"
→ 2. Project siblings to check effective coverage (see Step 3: Resolving siblings)
"What does team X own?"1. Search with
owners=team-x
filter
"What columns does X have?"1. Search for X → 2. Get
schemaMetadata
aspect
"Which tables contain a
customer_id
column?"
1. Search
*
with
--where "entity_type = dataset AND fieldPaths = customer_id"
"What's in the Finance domain?"1. Search with
domain=finance
filter
问题模式操作步骤
"Who owns X?"1. 搜索X → 2. 获取
ownership
切面
"What tables have PII tags?"1. 带
tags=pii
过滤条件搜索,实体类型为dataset
"How many datasets lack descriptions?"1. 用
--where "entity_type = dataset AND description IS NULL AND editableDescription IS NULL"
搜索 → 2. 投影关联实体检查有效覆盖率(参考步骤3:关联实体解析)
"What does team X own?"1. 带
owners=team-x
过滤条件搜索
"What columns does X have?"1. 搜索X → 2. 获取
schemaMetadata
切面
"Which tables contain a
customer_id
column?"
1. 用
--where "entity_type = dataset AND fieldPaths = customer_id"
搜索
*
"What's in the Finance domain?"1. 带
domain=finance
过滤条件搜索

Structured property filters (special case)

结构化属性过滤(特殊场景)

Structured properties are custom metadata fields with admin-defined schemas. Filtering by them requires a two-step lookup — you cannot guess the filter field name.
Step 1 — Resolve the property ID:
bash
undefined
结构化属性是由管理员定义schema的自定义元数据字段,按结构化属性过滤需要两步查询,你不能猜测过滤字段名。
步骤1 — 解析属性ID:
bash
undefined

Find the structured property definition

查找结构化属性定义

datahub search "data tier" --where "entity_type = structuredProperty" --format json --limit 5

This returns the property's qualified name (e.g., `io.acryl.dataTier`), which becomes the filter field.

**Step 2 — Check for allowed values (if applicable):**

Some structured properties restrict values to an enumeration. Fetch the definition to see them:

```bash
datahub get --urn "urn:li:structuredProperty:io.acryl.dataTier"
If
allowedValues
is present, the filter value must exactly match one of the listed options.
Step 3 — Search with the structured property filter:
bash
datahub search "*" --where "entity_type = dataset AND structuredProperties.io.acryl.dataTier = 'Tier 1'"
The filter field is always
structuredProperties.<qualifiedName>
and requires an exact value match.
User saysSteps
"find Tier 1 datasets"1. Search
entity_type=structuredProperty
for "tier" → 2. Get allowed values → 3. Filter
structuredProperties.<id>=Tier 1
"what structured properties exist?"Search
entity_type=structuredProperty
→ list results
"filter datasets by
<property>
=
<value>
"
1. Resolve property ID → 2. Validate value against allowed values if present → 3. Filter
datahub search "data tier" --where "entity_type = structuredProperty" --format json --limit 5

该查询会返回属性的限定名称(例如`io.acryl.dataTier`),这就是后续的过滤字段。

**步骤2 — 检查允许值(如果适用):**

部分结构化属性会将值限制为枚举值,获取属性定义查看允许值:

```bash
datahub get --urn "urn:li:structuredProperty:io.acryl.dataTier"
如果存在
allowedValues
,过滤值必须精确匹配列出的选项之一。
步骤3 — 使用结构化属性过滤条件搜索:
bash
datahub search "*" --where "entity_type = dataset AND structuredProperties.io.acryl.dataTier = 'Tier 1'"
过滤字段始终为
structuredProperties.<qualifiedName>
,并且需要精确值匹配。
用户输入操作步骤
"find Tier 1 datasets"1. 搜索
entity_type=structuredProperty
查找"tier" → 2. 获取允许值 → 3. 用
structuredProperties.<id>=Tier 1
过滤
"what structured properties exist?"搜索
entity_type=structuredProperty
→ 列出结果
"filter datasets by
<property>
=
<value>
"
1. 解析属性ID → 2. 对照允许值验证输入值(如果存在允许值) → 3. 过滤

Optimization rules

优化规则

  • Single search suffices for filtered lookups (ownership, governance, topology).
  • Search + get for questions needing aspect details (schema, coverage).
  • Multi-step with aggregation for "how many" questions — cap at 100 entities.

  • 过滤类查询(所有权、治理、拓扑)单次搜索即可完成
  • 需要切面详情的问题(schema、覆盖率)需要搜索 + get操作
  • "多少个"类问题需要多步聚合,最多处理100个实体

Step 3: Execute

步骤3:执行

Choosing your tool: MCP vs. CLI

工具选择:MCP vs. CLI

MCP toolsDataHub CLI
When availablePreferred — structured I/O, no shell overheadFallback, or when you need
--projection
,
--dry-run
, advanced filters
Search
search(query=..., filter=...)
datahub search "..." --where "..."
Get entity
get_entities(urns=[...])
datahub get --urn "..."
Browse
browse(path=...)
Not available via CLI
MCP tool names vary by server (e.g.,
mcp__datahub__search
). Match by function suffix — MCP tools are self-documenting, so check their schemas for parameter details. See
../shared-references/datahub-cli-reference.md
for CLI syntax.
MCP工具DataHub CLI
适用场景优先选择 — 结构化I/O,无Shell开销备选方案,或者当你需要
--projection
--dry-run
、高级过滤时使用
搜索
search(query=..., filter=...)
datahub search "..." --where "..."
获取实体
get_entities(urns=[...])
datahub get --urn "..."
浏览
browse(path=...)
CLI不支持
MCP工具名称因服务器而异(例如
mcp__datahub__search
),按功能后缀匹配即可 —— MCP工具是自解释的,可查看其schema了解参数详情。CLI语法可参考
../shared-references/datahub-cli-reference.md

Using DataHub CLI

使用DataHub CLI

Use
--projection
to reduce token cost.
Default search JSON is very large. Use projections to return only the fields you need.
--projection
accepts GraphQL selection set syntax. The CLI builds a GraphQL query under the hood, and
--projection
defines which fields are returned for each search result entity. Use
... on <Type> { fields }
inline fragments to select type-specific fields.
Discovering valid types and fields:
  • Use
    datahub search "X" --dry-run
    to preview the generated GraphQL query and see how projections are applied
  • Use
    datahub graphql --describe searchAcrossEntities --recurse --format json
    to inspect the full return type schema Common GraphQL types for
    ... on
    fragments:
Entity TypeGraphQL TypeKey Fields
dataset
Dataset
properties { name description }
,
platform { name }
,
ownership
,
schemaMetadata
,
siblings
,
editableProperties
,
subTypes
,
domain
dashboard
Dashboard
properties { name description }
,
platform { name }
,
ownership
chart
Chart
properties { name description }
,
platform { name }
dataFlow
DataFlow
properties { name description }
,
platform { name }
dataJob
DataJob
properties { name description }
container
Container
properties { name description }
,
platform { name }
,
subTypes
Note: GraphQL field names differ from aspect names — e.g., the
datasetProperties
aspect is
properties
in GraphQL, and
dataPlatform
is
platform
. When in doubt, use
--dry-run
to validate.
Editable vs. non-editable fields: Some metadata fields exist in two places — an ingestion-provided version and a user-edited version. Both can hold values. Always project both when checking coverage:
FieldIngestion-providedUser-edited
Asset description
properties { description }
editableProperties { description }
Column descriptions
schemaMetadata { fields { fieldPath description } }
editableSchemaMetadata { editableSchemaFieldInfo { fieldPath description } }
Column tags
schemaMetadata { fields { fieldPath globalTags { tags { tag { urn } } } } }
editableSchemaMetadata { editableSchemaFieldInfo { fieldPath globalTags { tags { tag { urn } } } } }
Column terms
schemaMetadata { fields { fieldPath glossaryTerms { terms { term { urn } } } } }
editableSchemaMetadata { editableSchemaFieldInfo { fieldPath glossaryTerms { terms { term { urn } } } } }
A value in either version means the metadata exists. When answering "does this table have a description?" or "which columns are tagged PII?", check both.
Projection examples:
bash
undefined
使用
--projection
减少Token消耗。
默认的搜索JSON体积非常大,使用投影仅返回你需要的字段。
--projection
支持GraphQL选择集语法。CLI底层会构建GraphQL查询,
--projection
定义了每个搜索结果实体返回的字段。使用
... on <Type> { fields }
内联片段选择特定类型的字段。
查询可用类型和字段:
  • 使用
    datahub search "X" --dry-run
    预览生成的GraphQL查询,查看投影的应用方式
  • 使用
    datahub graphql --describe searchAcrossEntities --recurse --format json
    查看完整的返回类型schema
... on
片段常用的GraphQL类型:
实体类型GraphQL类型核心字段
dataset
Dataset
properties { name description }
,
platform { name }
,
ownership
,
schemaMetadata
,
siblings
,
editableProperties
,
subTypes
,
domain
dashboard
Dashboard
properties { name description }
,
platform { name }
,
ownership
chart
Chart
properties { name description }
,
platform { name }
dataFlow
DataFlow
properties { name description }
,
platform { name }
dataJob
DataJob
properties { name description }
container
Container
properties { name description }
,
platform { name }
,
subTypes
注意:GraphQL字段名与切面名称不同 —— 例如
datasetProperties
切面在GraphQL中是
properties
dataPlatform
platform
。如果不确定,使用
--dry-run
验证。
可编辑字段 vs 不可编辑字段: 部分元数据字段存在于两个位置 —— 数据摄入提供的版本和用户编辑的版本,两者都可能存储值。检查覆盖率时请始终同时投影两个版本:
字段摄入提供版本用户编辑版本
资产描述
properties { description }
editableProperties { description }
列描述
schemaMetadata { fields { fieldPath description } }
editableSchemaMetadata { editableSchemaFieldInfo { fieldPath description } }
列标签
schemaMetadata { fields { fieldPath globalTags { tags { tag { urn } } } } }
editableSchemaMetadata { editableSchemaFieldInfo { fieldPath globalTags { tags { tag { urn } } } } }
列术语
schemaMetadata { fields { fieldPath glossaryTerms { terms { term { urn } } } } }
editableSchemaMetadata { editableSchemaFieldInfo { fieldPath glossaryTerms { terms { term { urn } } } } }
任意版本中存在值即表示该元数据存在。回答"该表是否有描述?"或"哪些列打了PII标签?"这类问题时,请同时检查两个版本。
投影示例:
bash
undefined

Minimal: just URNs and types

最小投影:仅URN和类型

datahub search "customers" --projection "urn type"
datahub search "customers" --projection "urn type"

Multi-type discovery (name + platform for all common entity types)

多类型发现(所有常见实体类型的名称+平台)

datahub search "revenue" --projection "urn type ... on Dataset { properties { name description } platform { name } } ... on Dashboard { properties { name description } platform { name } } ... on DataFlow { properties { name description } platform { name } } ... on DataJob { properties { name description } } ... on Chart { properties { name description } platform { name } }"
datahub search "revenue" --projection "urn type ... on Dataset { properties { name description } platform { name } } ... on Dashboard { properties { name description } platform { name } } ... on DataFlow { properties { name description } platform { name } } ... on DataJob { properties { name description } } ... on Chart { properties { name description } platform { name } }"

With ownership (good for "who owns X?" questions)

带所有权信息(适合"who owns X?"类问题)

datahub search "customers" --projection "urn type ... on Dataset { properties { name } ownership { owners { owner type } } platform { name } } ... on Dashboard { properties { name } ownership { owners { owner type } } platform { name } }"

**Output formats:** Use `--format json` (default) for structured processing, `--table` for human-readable display, `--urns-only` for piping to other commands.

**`search` vs. `get` for single entities:** Prefer `datahub search` with `--projection` even for a single known entity when you need entity-resolved fields available in GraphQL — siblings, ownership, tags, glossary terms, domain, dataset profiles, etc. These fields are returned in a structured, ready-to-use format. Use `datahub get --urn "<URN>" --aspect <aspect>` when you need a single low-level raw aspect (e.g., full `schemaMetadata`) that isn't practical to project. But be careful, working with aspects requires deeper understanding of the DataHub metadata model.

**Input validation:** Before passing user input to CLI commands, reject any input containing shell metacharacters (`` ` ``, `$`, `|`, `;`, `&`, `>`, `<`, `\n`). Only pass sanitized alphanumeric queries and well-formed URNs.
datahub search "customers" --projection "urn type ... on Dataset { properties { name } ownership { owners { owner type } } platform { name } } ... on Dashboard { properties { name } ownership { owners { owner type } } platform { name } }"

**输出格式:** 结构化处理使用`--format json`(默认),人类可读展示使用`--table`,管道传输到其他命令使用`--urns-only`。

**单个实体的`search` vs `get`选择:** 即使是单个已知实体,如果你需要GraphQL中提供的实体解析字段(关联实体、所有权、标签、术语表、域、数据集配置文件等),优先使用带`--projection`的`datahub search`,这些字段会以结构化、可直接使用的格式返回。当你需要单个低级原始切面(例如完整的`schemaMetadata`)且不适合投影时,使用`datahub get --urn "<URN>" --aspect <aspect>`。但请注意,处理切面需要深入了解DataHub元数据模型。

**输入验证:** 将用户输入传递给CLI命令之前,拒绝任何包含Shell元字符(`` ` ``、`$`、`|`、`;`、`&`、`>`、`<`、`\n`)的输入,仅传递经过 sanitize 的字母数字查询和格式正确的URN。

Delegating to metadata-searcher agent (Claude Code only)

委托给metadata-searcher Agent(仅Claude Code支持)

Only delegate when the query requires multiple complex searches with filtering and aggregation to synthesize a result set — for example, searching across several platforms, combining results from multiple entity types with different filters, or gathering data that needs to be compiled into a file. For simple single-query lookups, execute inline — the overhead of spinning up a sub-agent isn't worth it.
Task(subagent_type="datahub-skills:metadata-searcher")
Provide the agent with the specific queries, filters, projections, and result limits.
Fallback for agents without sub-agent dispatch: Execute operations inline using MCP tools or CLI.
仅当查询需要多次复杂的带过滤和聚合的搜索才能合成结果集时才委托 —— 例如跨多个平台搜索、组合多个不同过滤条件的实体类型结果,或者收集需要编译到文件中的数据。对于简单的单查询查找,直接内联执行即可,启动子Agent的 overhead 不值得。
Task(subagent_type="datahub-skills:metadata-searcher")
为Agent提供具体的查询、过滤条件、投影和结果限制。
无调度子Agent能力的平台回退方案: 使用MCP工具或CLI内联执行操作。

Resolving siblings

关联实体解析

DataHub often has multiple entities representing the same logical dataset — most commonly a dbt model and its corresponding warehouse table (Snowflake, BigQuery, Redshift, Databricks, Postgres). These are linked via the
siblings
aspect. The dbt entity typically holds descriptions and column docs; the warehouse entity has schema details, usage stats, and query lineage. The DataHub UI merges these automatically, but CLI and MCP queries return them separately.
Always check siblings when you find a dataset. Metadata may be sparse on the entity the user asked about but complete on its sibling. Include sibling data in your response and note the relationship — e.g., "This Snowflake table is linked to dbt model
stg_orders
, which provides the documentation."
How to resolve:
bash
undefined
DataHub中经常出现多个实体代表同一个逻辑数据集的情况 —— 最常见的是dbt模型和其对应的数仓表(Snowflake、BigQuery、Redshift、Databricks、Postgres),这些实体通过
siblings
切面关联。dbt实体通常存储描述和列文档,数仓实体存储schema详情、使用统计和查询血缘。DataHub UI会自动合并这些信息,但CLI和MCP查询会分别返回它们。
查找数据集时请始终检查关联实体。 用户查询的实体上的元数据可能很少,但其关联实体上的元数据可能很完整。在回复中包含关联实体的数据并说明关系 —— 例如:"该Snowflake表关联了dbt模型
stg_orders
,文档信息由该模型提供。"
解析方法:
bash
undefined

Include siblings in search projection (preferred — no extra queries)

在搜索投影中包含关联实体(推荐 —— 无需额外查询)

datahub search "orders" --projection "urn type ... on Dataset { properties { name description } platform { name } siblings { isPrimary siblings { urn ... on Dataset { properties { name description } platform { name } } }} }"
datahub search "orders" --projection "urn type ... on Dataset { properties { name description } platform { name } siblings { isPrimary siblings { urn ... on Dataset { properties { name description } platform { name } } }} }"

Fetch siblings for a known entity

为已知实体获取关联实体

datahub get --urn "<URN>" --aspect siblings

The `isPrimary` field indicates the authoritative source (typically dbt). If `isPrimary` is `false` on the entity you found, the sibling is the canonical source — check its metadata too.
datahub get --urn "<URN>" --aspect siblings

`isPrimary`字段表示权威来源(通常是dbt)。如果你找到的实体的`isPrimary`为`false`,则关联实体是规范来源 —— 也请检查其元数据。

Pagination

分页

Default to 10 results per page (max 50 per API call). Show total count and offer to fetch the next page. Confirm with the user before fetching more than 100 total results.
默认每页返回10条结果(每次API调用最多返回50条),显示总数并提供获取下一页的选项。获取超过100条总结果前请先征得用户同意。

When evidence is incomplete

证据不完整时

Note what was found and what's missing. Never fabricate metadata that wasn't returned by DataHub.

说明找到的内容和缺失的内容,绝对不要编造DataHub未返回的元数据。

Step 4: Present Results

步骤4:呈现结果

Discovery mode — Entity list

发现模式 —— 实体列表

markdown
| # | Name | Type | Platform | Domain | Owner |
| --- | --- | --- | --- | --- | --- |
| 1 | mydb.schema.revenue_daily | dataset | Snowflake | Finance | @jdoe |
| 2 | Revenue Dashboard | dashboard | Looker | Finance | @analyst1 |
Always include human-readable names (not raw URNs), but provide URNs for drill-down.
markdown
| # | Name | Type | Platform | Domain | Owner |
| --- | --- | --- | --- | --- | --- |
| 1 | mydb.schema.revenue_daily | dataset | Snowflake | Finance | @jdoe |
| 2 | Revenue Dashboard | dashboard | Looker | Finance | @analyst1 |
始终包含人类可读的名称(而非原始URN),但提供URN用于下钻查询。

Discovery mode — Entity detail

发现模式 —— 实体详情

When showing a single entity:
markdown
undefined
展示单个实体时:
markdown
undefined

<Entity Name>

<实体名称>

PropertyValue
URN
urn:li:dataset:(...)
Typedataset (table)
PlatformSnowflake
Owner@jdoe (Technical Owner)
Tags
pii
,
revenue
DescriptionDaily revenue aggregation table
属性
URN
urn:li:dataset:(...)
类型dataset (table)
平台Snowflake
所有者@jdoe (技术负责人)
标签
pii
,
revenue
描述日收入聚合表

Schema (top fields)

Schema(核心字段)

FieldTypeDescription
dateDATERevenue date
amountDECIMALRevenue amount
undefined
字段类型描述
dateDATE收入日期
amountDECIMAL收入金额
undefined

Question mode — Answer

问答模式 —— 答案

markdown
undefined
markdown
undefined

Answer

答案

<!-- Direct answer in 1-3 sentences -->
<!-- 1-3句话直接给出答案 -->

Evidence

证据

EntityDetailSource
<name><relevant metadata><query/aspect>
实体详情来源
<名称><相关元数据><查询/切面>

Methodology

方法说明

Queries executed: <count> Scope: <what was searched> Limitations: <gaps, caveats>
undefined
执行查询数: <数量> 范围: <搜索覆盖的内容> 局限性: <缺口、注意事项>
undefined

Answer quality rules

答案质量规则

  1. Answer directly first. Lead with the answer, not the methodology.
  2. Cite specific entities. Don't say "several tables" — name them.
  3. Acknowledge incompleteness. Note the scope you covered.
  4. Quantify. "12 of 45 datasets" not "some datasets".
  5. Distinguish facts from inferences.
  1. 先直接给出答案。 开头放答案,而非方法说明。
  2. 引用具体实体。 不要说"若干表",要说出具体名称。
  3. 说明不完整之处。 注明你覆盖的范围。
  4. 量化结果。 说"45个数据集中的12个",不要说"部分数据集"。
  5. 区分事实和推断。

Suggesting next steps

后续步骤建议

  • "Want to see the schema for any of these?"
  • "Want to update metadata? Use
    /datahub-enrich
    "
  • "Want a full audit? Use
    /datahub-audit
    "

  • "需要查看其中任意表的schema吗?"
  • "需要更新元数据?使用
    /datahub-enrich
    "
  • "需要完整审计报告?使用
    /datahub-audit
    "

Reference Documents

参考文档

DocumentPathPurpose
Entity type reference
references/entity-type-reference.md
Entity types, URN formats, platforms
Search filter reference
references/search-filter-reference.md
Filters, facets, search syntax
CLI reference (shared)
../shared-references/datahub-cli-reference.md
CLI command syntax

文档路径用途
实体类型参考
references/entity-type-reference.md
实体类型、URN格式、平台
搜索过滤参考
references/search-filter-reference.md
过滤条件、分面、搜索语法
CLI参考(共享)
../shared-references/datahub-cli-reference.md
CLI命令语法

Common Mistakes

常见错误

  • Fetching all entities without pagination. Always use
    --limit
    (max 50 per page). "Find all tables" means "search and paginate", not "fetch everything".
  • Answering questions with raw search results. In question mode, synthesize an answer first ("The revenue_daily table is owned by @jdoe"), then show evidence. Don't just dump an entity list.
  • Searching by keyword when a URN is provided. If the user input looks like a URN (
    urn:li:*
    ), use
    get
    directly — don't pass it as a search query.
  • Ignoring field-level search. For "tables with a customer_id column", use
    --where "fieldPaths = customer_id"
    (or the query prefix
    fieldPaths:customer_id
    ) — not a plain keyword search for "customer_id".
  • Mixing up discovery and question modes. "Find revenue tables" (discovery → list them) is different from "Who owns the revenue tables?" (question → answer it).
  • Guessing structured property filter fields. Don't fabricate
    structuredProperties.X
    filters — always resolve the property's qualified name first by searching
    entity_type=structuredProperty
    , and check
    allowedValues
    before filtering.
  • Not using
    --projection
    .
    Default search JSON is very large (facets, nested metadata). Always use
    --projection
    to return only needed fields. Include
    ... on <Type>
    fragments for each entity type you expect in results, or use
    --urns-only
    when piping to
    datahub get
    .
  • Ignoring siblings. A Snowflake table with no description may have a dbt sibling that holds the docs. Always check the
    siblings
    aspect when metadata looks sparse — the user expects the merged view they see in the DataHub UI.
  • 不分页拉取所有实体。 始终使用
    --limit
    (每页最多50条)。"查找所有表"意味着"搜索并分页",而非"拉取全部内容"。
  • 用原始搜索结果回答问题。 问答模式下,先合成答案("revenue_daily表的所有者是@jdoe"),再展示证据。不要直接抛出实体列表。
  • 提供URN时仍按关键词搜索。 如果用户输入看起来是URN(
    urn:li:*
    ),直接使用
    get
    操作 —— 不要将其作为搜索查询传递。
  • 忽略字段级搜索。 对于"带customer_id列的表"这类查询,使用
    --where "fieldPaths = customer_id"
    (或查询前缀
    fieldPaths:customer_id
    )—— 不要直接搜索"customer_id"关键词。
  • 混淆发现和问答模式。 "查找收入表"(发现 → 列表展示)和"收入表的所有者是谁?"(问答 → 给出答案)是不同的场景。
  • 猜测结构化属性过滤字段。 不要编造
    structuredProperties.X
    过滤条件 —— 始终先通过搜索
    entity_type=structuredProperty
    解析属性的限定名称,过滤前检查
    allowedValues
  • 不使用
    --projection
    默认搜索JSON体积非常大(分面、嵌套元数据),始终使用
    --projection
    仅返回需要的字段。为结果中预期的每种实体类型添加
    ... on <Type>
    片段,或者管道传输到
    datahub get
    时使用
    --urns-only
  • 忽略关联实体。 没有描述的Snowflake表可能有存储了文档的dbt关联实体。元数据看起来很少时请始终检查
    siblings
    切面 —— 用户期望看到DataHub UI中那样的合并视图。

Red Flags

风险提示

  • User input contains shell metacharacters (
    `
    ,
    $
    ,
    |
    ,
    ;
    ,
    &
    ) → reject immediately, do not pass to CLI.
  • Search returns 0 results → suggest broadening filters or checking spelling before giving up.
  • Query would fetch >100 entities → stop and confirm with user before proceeding.
  • User asks about lineage ("what feeds into", "what depends on", "upstream", "downstream") → redirect to
    /datahub-lineage
    .
  • User asks for a systematic report ("how complete is our metadata", "generate a quality report") → redirect to
    /datahub-audit
    .

  • 用户输入包含Shell元字符
    `
    $
    |
    ;
    &
    )→ 立即拒绝,不要传递给CLI。
  • 搜索返回0条结果 → 建议放宽过滤条件或检查拼写,再结束查询。
  • 查询会拉取超过100个实体 → 停止操作,征得用户同意后再继续。
  • 用户询问血缘相关问题("what feeds into", "what depends on", "upstream", "downstream")→ 重定向到
    /datahub-lineage
  • 用户需要系统性报告("how complete is our metadata", "generate a quality report")→ 重定向到
    /datahub-audit

Remember

注意事项

  • Classify first. Discovery and question intents need different approaches.
  • Show human-readable names, not raw URNs. But provide URNs for drill-down.
  • Check siblings. Metadata may live on a dbt sibling rather than the warehouse entity.
  • Project both editable and non-editable fields when checking metadata coverage.
  • Be honest about gaps. If DataHub doesn't have the data, say so.
  • 先分类意图。 发现和问答类意图需要不同的处理方法。
  • 展示人类可读名称,而非原始URN,但提供URN用于下钻查询。
  • 检查关联实体。 元数据可能存储在dbt关联实体上,而非数仓实体上。
  • 检查元数据覆盖率时同时投影可编辑和不可编辑字段。
  • 如实说明数据缺口。 如果DataHub没有相关数据,直接说明。