optimize-data-model-compute

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Optimize Orchestration Compute (BigQuery Reservations)

优化编排计算资源（BigQuery 预留插槽）

Purpose

目标

Automatically assign data models from data orchestration to BigQuery slot reservations based on compute complexity and cost optimization strategy. Routes high-compute workloads to on-demand capacities while using reservations pricing for high-traffic jobs.

根据计算复杂度和成本优化策略，自动将数据编排中的数据模型分配到BigQuery预留插槽。将高计算量工作负载路由到按需计算资源，同时为高流量作业使用预留定价。

When to Use

适用场景

Assigning new models/actions to appropriate compute tiers (reserved vs on-demand)
Rebalancing reservation assignments based on priority changes
Optimizing costs by moving low-priority workloads to on-demand
Ensuring critical pipelines get guaranteed compute resources

为新模型/操作分配合适的计算层级（预留 vs 按需）
根据优先级变化重新平衡预留插槽分配
将低优先级工作负载转移到按需计算以优化成本
确保关键流水线获得有保障的计算资源

Implementation Steps

实施步骤

Step 1: Source Configuration

步骤1：源配置

Extract and apply recommendations from Masthead insights automatically.

Ask user for the BigQuery dataset where Masthead insights are stored. This dataset must contain an
```
insights
```
table with the same schema as
```
masthead-prod.YOUR_DATASET.insights
```
. If the user does not have access, direct them to request access.
Once provided, immediately persist it by appending to the project-level agent instructions file (e.g.
```
.github/copilot-instructions.md
```
, or
```
AGENTS.md
```
— whichever already exists in the project root, or creating if none exist):
```

MASTHEAD_DATASET=YOUR_DATASET

```
On subsequent runs, read this value from the instructions file instead of prompting the user again.
Identify which orchestration technology is in use by querying available recommendations:

bash

bq query --project_id=YOUR_PROJECT --use_legacy_sql=false --format=csv \
"SELECT subtype, COUNT(*) AS recommendation_count
FROM \`masthead-prod.YOUR_DATASET.insights\`
WHERE category = 'Cost'
  AND type = 'Compute costs'
GROUP BY subtype"

If compute model recommendations are not present or all rows share the same technology, infer from the user's project structure (presence of

dbt_project.yml

→ dbt,

definitions/

folder → Dataform, Airflow DAG files → Airflow). Confirm the detected tool with the user before proceeding.

Pull recommendations from
```
masthead-prod.YOUR_DATASET.insights
```
:

bash

bq query --project_id=YOUR_PROJECT --use_legacy_sql=false --format=csv \
"SELECT
  action_name,
  recommended_model,
  cost_30d,
  last_updated_time
FROM \`masthead-prod.YOUR_DATASET.insights\`
WHERE category = 'Cost'
  AND type = 'Compute costs'
  AND subtype = ...
ORDER BY cost_30d DESC" > compute_assignment_candidates.csv

Pull recommendations from
```
masthead-prod.YOUR_DATASET.insights
```
:

bash

bq query --project_id=YOUR_PROJECT --use_legacy_sql=false --format=csv \
"SELECT
  action_name,
  recommended_model,
  cost_30d,
  last_updated_time
FROM \`masthead-prod.YOUR_DATASET.insights\`
WHERE category = 'Cost'
  AND type = 'Compute costs'
  AND subtype = ...
ORDER BY cost_30d DESC" > compute_assignment_candidates.csv

Resolve reservation targets using
```
recommended_model
```
values and reservation edition metadata:

Verify reservation editions using
```
INFORMATION_SCHEMA.RESERVATIONS
```
:

bash

bq query --project_id=RESERVATION_ADMIN_PROJECT --location=US --use_legacy_sql=false --format=csv \
"SELECT
  reservation_name,
  project_id,
  edition,
  slot_capacity
FROM \`region-us\`.INFORMATION_SCHEMA.RESERVATIONS
ORDER BY project_id, reservation_name"

Map

recommended_model = 'ON_DEMAND'

to the config entry where

reservation = 'none'

For all other values (for example
```
ENTERPRISE
```
), choose a reservation whose edition matches
```
recommended_model
```
.
If exactly one matching reservation exists, assign automatically.
If multiple matching reservations exist, ask the user which reservation tag to use.
If no matching reservation exists, ask the user to pick a fallback reservation or create a new matching reservation first.
Ensure an on-demand bucket exists. If missing, create one:

javascript

{
  tag: 'on_demand',
  reservation: 'none',
  actions: []
}

Convert recommendations into assignment lists automatically:

bash

bq query --project_id=YOUR_PROJECT --use_legacy_sql=false --format=csv \
"SELECT
  action_name,
  recommended_model
FROM \`masthead-prod.YOUR_DATASET.insights\`
WHERE category = 'Cost'
  AND type = 'Compute costs'
  AND subtype = ...
ORDER BY cost_30d DESC" > compute_assignment_final.csv

Apply all recommendations based on the detected orchestration tool:

自动提取并应用Masthead洞察中的建议。

询问用户存储Masthead洞察的BigQuery数据集。该数据集必须包含一个
```
insights
```
表，其架构与
```
masthead-prod.YOUR_DATASET.insights
```
一致。如果用户没有访问权限，引导他们申请访问。
收到数据集信息后，立即持久化存储：将其追加到项目级别的Agent说明文件中（例如
```
.github/copilot-instructions.md
```
或
```
AGENTS.md
```
，优先使用项目根目录已存在的文件，若不存在则创建新文件）：
```

MASTHEAD_DATASET=YOUR_DATASET

```
后续运行时，直接从该说明文件读取此值，无需再次询问用户。
通过查询可用建议，确定正在使用的编排技术：

bash

bq query --project_id=YOUR_PROJECT --use_legacy_sql=false --format=csv \
"SELECT subtype, COUNT(*) AS recommendation_count
FROM \`masthead-prod.YOUR_DATASET.insights\`
WHERE category = 'Cost'
  AND type = 'Compute costs'
GROUP BY subtype"

如果没有计算模型相关建议，或所有行使用相同技术，则从项目结构推断（存在

dbt_project.yml

→ dbt，存在

definitions/

文件夹 → Dataform，存在Airflow DAG文件 → Airflow）。在继续之前，需与用户确认检测到的工具。

从
```
masthead-prod.YOUR_DATASET.insights
```
提取建议：

bash

bq query --project_id=YOUR_PROJECT --use_legacy_sql=false --format=csv \
"SELECT
  action_name,
  recommended_model,
  cost_30d,
  last_updated_time
FROM \`masthead-prod.YOUR_DATASET.insights\`
WHERE category = 'Cost'
  AND type = 'Compute costs'
  AND subtype = ...
ORDER BY cost_30d DESC" > compute_assignment_candidates.csv

从
```
masthead-prod.YOUR_DATASET.insights
```
提取建议：

bash

bq query --project_id=YOUR_PROJECT --use_legacy_sql=false --format=csv \
"SELECT
  action_name,
  recommended_model,
  cost_30d,
  last_updated_time
FROM \`masthead-prod.YOUR_DATASET.insights\`
WHERE category = 'Cost'
  AND type = 'Compute costs'
  AND subtype = ...
ORDER BY cost_30d DESC" > compute_assignment_candidates.csv

使用
```
recommended_model
```
值和预留版本元数据确定预留目标：

使用
```
INFORMATION_SCHEMA.RESERVATIONS
```
验证预留版本：

bash

bq query --project_id=RESERVATION_ADMIN_PROJECT --location=US --use_legacy_sql=false --format=csv \
"SELECT
  reservation_name,
  project_id,
  edition,
  slot_capacity
FROM \`region-us\`.INFORMATION_SCHEMA.RESERVATIONS
ORDER BY project_id, reservation_name"

将

recommended_model = 'ON_DEMAND'

映射到

reservation = 'none'

的配置项。

对于其他值（例如
```
ENTERPRISE
```
），选择版本与
```
recommended_model
```
匹配的预留插槽。
如果恰好存在一个匹配的预留插槽，自动分配。
如果存在多个匹配的预留插槽，询问用户使用哪个预留标签。
如果没有匹配的预留插槽，询问用户选择一个 fallback 预留插槽或先创建一个新的匹配预留插槽。
确保按需计算桶存在。如果缺失，创建一个：

javascript

{
  tag: 'on_demand',
  reservation: 'none',
  actions: []
}

将建议自动转换为分配列表：

bash

bq query --project_id=YOUR_PROJECT --use_legacy_sql=false --format=csv \
"SELECT
  action_name,
  recommended_model
FROM \`masthead-prod.YOUR_DATASET.insights\`
WHERE category = 'Cost'
  AND type = 'Compute costs'
  AND subtype = ...
ORDER BY cost_30d DESC" > compute_assignment_final.csv

根据检测到的编排工具应用所有建议：

Dataform

Open
```
definitions/_reservations.js
```
.

Replace

on_demand

actions

with all rows where

recommended_model = 'ON_DEMAND'

Replace reserved reservation
```
actions
```
with all rows where
```
recommended_model != 'ON_DEMAND'
```
.
Remove duplicates; keep only actions present in this repo's Dataform graph.
Verify:
```
dataform compile
```
and check for duplicate assignments.
See package reference:
```
@masthead-data/dataform-package
```

打开
```
definitions/_reservations.js
```
。

将

on_demand

的

actions

替换为所有

recommended_model = 'ON_DEMAND'

的行。

将预留插槽的
```
actions
```
替换为所有
```
recommended_model != 'ON_DEMAND'
```
的行。
移除重复项；仅保留此仓库Dataform图中存在的操作。
验证：执行
```
dataform compile
```
并检查是否存在重复分配。
参考包：
```
@masthead-data/dataform-package
```

dbt

Follow the reservation assignment workflow from
```
masthead-data/dbt-reservations
```
.
Map recommendations to the appropriate dbt model tags or selector targets.
Update the relevant
```
dbt_project.yml
```
or profile configuration per the repo's instructions.

遵循
```
masthead-data/dbt-reservations
```
中的预留分配流程。
将建议映射到相应的dbt模型标签或选择器目标。
根据该仓库的说明更新相关的
```
dbt_project.yml
```
或配置文件。

Airflow

Follow the reservation assignment workflow from
```
masthead-data/airflow-reservations
```
.
Map recommendations to DAG or task-level BigQuery reservation labels.
Update the relevant operator configuration per the repo's instructions.

遵循
```
masthead-data/airflow-reservations
```
中的预留分配流程。
将建议映射到DAG或任务级别的BigQuery预留标签。
根据该仓库的说明更新相关的算子配置。

Step 2: Verify Changes

步骤2：验证变更

After applying, confirm assignments are non-overlapping and align with the recommendation output. For Dataform:

bash

undefined

应用变更后，确认分配无重叠且与建议输出一致。对于Dataform：

bash

undefined

Check syntax

检查语法

dataform compile

Validate no duplicate assignments

验证无重复分配

grep -r ".actions" definitions/_reservations.js


For dbt and Airflow, follow the verification steps in their respective repositories.

grep -r ".actions" definitions/_reservations.js


对于dbt和Airflow，请遵循各自仓库中的验证步骤。

Decision Criteria

决策标准

Factor	Reserved Slots	On-Demand
Priority	High, SLA-bound	Low, flexible
Frequency	Regular, scheduled	Ad-hoc, occasional
Cost Pattern	Predictable usage	Variable, sporadic
Impact	Critical pipelines	Experimental, samples

影响因素	预留插槽	按需计算
优先级	高优先级、绑定SLA	低优先级、灵活调度
执行频率	定期、已调度	临时、偶尔执行
成本模式	可预测的使用量	可变、零散的使用量
业务影响	关键业务流水线	实验性、样本类任务

Key Notes

关键注意事项

Each action should appear in only ONE reservation config
File starts with
```
_
```
to ensure it runs first in Dataform queue
Changes take effect on next Dataform workflow run
Package automatically handles global assignment (no per-file edits needed)
Default mode is full auto-apply of recommendations; rely on PR review for validation
The only interactive checkpoint is reservation selection when more than one reservation matches the recommended edition

每个操作应仅出现在一个预留配置中
文件以
```
_
```
开头，确保在Dataform队列中优先运行
变更将在下次Dataform工作流运行时生效
包会自动处理全局分配（无需逐个文件编辑）
默认模式为自动应用所有建议；依赖PR评审进行验证
唯一的交互式检查点是：当多个预留插槽匹配建议版本时，需选择预留插槽

Package References

包参考

Dataform:
```
@masthead-data/dataform-package
```
dbt:
```
masthead-data/dbt-reservations
```
Airflow:
```
masthead-data/airflow-reservations
```

Dataform：
```
@masthead-data/dataform-package
```
dbt：
```
masthead-data/dbt-reservations
```
Airflow：
```
masthead-data/airflow-reservations
```