optimize-data-model-compute
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseOptimize Orchestration Compute (BigQuery Reservations)
优化编排计算资源(BigQuery 预留插槽)
Purpose
目标
Automatically assign data models from data orchestration to BigQuery slot reservations based on compute complexity and cost optimization strategy. Routes high-compute workloads to on-demand capacities while using reservations pricing for high-traffic jobs.
根据计算复杂度和成本优化策略,自动将数据编排中的数据模型分配到BigQuery预留插槽。将高计算量工作负载路由到按需计算资源,同时为高流量作业使用预留定价。
When to Use
适用场景
- Assigning new models/actions to appropriate compute tiers (reserved vs on-demand)
- Rebalancing reservation assignments based on priority changes
- Optimizing costs by moving low-priority workloads to on-demand
- Ensuring critical pipelines get guaranteed compute resources
- 为新模型/操作分配合适的计算层级(预留 vs 按需)
- 根据优先级变化重新平衡预留插槽分配
- 将低优先级工作负载转移到按需计算以优化成本
- 确保关键流水线获得有保障的计算资源
Implementation Steps
实施步骤
Step 1: Source Configuration
步骤1:源配置
Extract and apply recommendations from Masthead insights automatically.
-
Ask user for the BigQuery dataset where Masthead insights are stored. This dataset must contain antable with the same schema as
insights. If the user does not have access, direct them to request access.masthead-prod.YOUR_DATASET.insightsOnce provided, immediately persist it by appending to the project-level agent instructions file (e.g., or.github/copilot-instructions.md— whichever already exists in the project root, or creating if none exist):AGENTS.md<!-- masthead --> MASTHEAD_DATASET=YOUR_DATASET <!-- /masthead -->On subsequent runs, read this value from the instructions file instead of prompting the user again. -
Identify which orchestration technology is in use by querying available recommendations:
bash
bq query --project_id=YOUR_PROJECT --use_legacy_sql=false --format=csv \
"SELECT subtype, COUNT(*) AS recommendation_count
FROM \`masthead-prod.YOUR_DATASET.insights\`
WHERE category = 'Cost'
AND type = 'Compute costs'
GROUP BY subtype"If compute model recommendations are not present or all rows share the same technology, infer from the user's project structure (presence of → dbt, folder → Dataform, Airflow DAG files → Airflow). Confirm the detected tool with the user before proceeding.
dbt_project.ymldefinitions/- Pull recommendations from :
masthead-prod.YOUR_DATASET.insights
bash
bq query --project_id=YOUR_PROJECT --use_legacy_sql=false --format=csv \
"SELECT
action_name,
recommended_model,
cost_30d,
last_updated_time
FROM \`masthead-prod.YOUR_DATASET.insights\`
WHERE category = 'Cost'
AND type = 'Compute costs'
AND subtype = ...
ORDER BY cost_30d DESC" > compute_assignment_candidates.csv- Pull recommendations from :
masthead-prod.YOUR_DATASET.insights
bash
bq query --project_id=YOUR_PROJECT --use_legacy_sql=false --format=csv \
"SELECT
action_name,
recommended_model,
cost_30d,
last_updated_time
FROM \`masthead-prod.YOUR_DATASET.insights\`
WHERE category = 'Cost'
AND type = 'Compute costs'
AND subtype = ...
ORDER BY cost_30d DESC" > compute_assignment_candidates.csv- Resolve reservation targets using values and reservation edition metadata:
recommended_model
- Verify reservation editions using :
INFORMATION_SCHEMA.RESERVATIONS
bash
bq query --project_id=RESERVATION_ADMIN_PROJECT --location=US --use_legacy_sql=false --format=csv \
"SELECT
reservation_name,
project_id,
edition,
slot_capacity
FROM \`region-us\`.INFORMATION_SCHEMA.RESERVATIONS
ORDER BY project_id, reservation_name"- Map to the config entry where
recommended_model = 'ON_DEMAND'.reservation = 'none' - For all other values (for example ), choose a reservation whose edition matches
ENTERPRISE.recommended_model - If exactly one matching reservation exists, assign automatically.
- If multiple matching reservations exist, ask the user which reservation tag to use.
- If no matching reservation exists, ask the user to pick a fallback reservation or create a new matching reservation first.
- Ensure an on-demand bucket exists. If missing, create one:
javascript
{
tag: 'on_demand',
reservation: 'none',
actions: []
}- Convert recommendations into assignment lists automatically:
bash
bq query --project_id=YOUR_PROJECT --use_legacy_sql=false --format=csv \
"SELECT
action_name,
recommended_model
FROM \`masthead-prod.YOUR_DATASET.insights\`
WHERE category = 'Cost'
AND type = 'Compute costs'
AND subtype = ...
ORDER BY cost_30d DESC" > compute_assignment_final.csv- Apply all recommendations based on the detected orchestration tool:
自动提取并应用Masthead洞察中的建议。
-
询问用户存储Masthead洞察的BigQuery数据集。该数据集必须包含一个表,其架构与
insights一致。如果用户没有访问权限,引导他们申请访问。masthead-prod.YOUR_DATASET.insights收到数据集信息后,立即持久化存储:将其追加到项目级别的Agent说明文件中(例如或.github/copilot-instructions.md,优先使用项目根目录已存在的文件,若不存在则创建新文件):AGENTS.md<!-- masthead --> MASTHEAD_DATASET=YOUR_DATASET <!-- /masthead -->后续运行时,直接从该说明文件读取此值,无需再次询问用户。 -
通过查询可用建议,确定正在使用的编排技术:
bash
bq query --project_id=YOUR_PROJECT --use_legacy_sql=false --format=csv \
"SELECT subtype, COUNT(*) AS recommendation_count
FROM \`masthead-prod.YOUR_DATASET.insights\`
WHERE category = 'Cost'
AND type = 'Compute costs'
GROUP BY subtype"如果没有计算模型相关建议,或所有行使用相同技术,则从项目结构推断(存在 → dbt,存在文件夹 → Dataform,存在Airflow DAG文件 → Airflow)。在继续之前,需与用户确认检测到的工具。
dbt_project.ymldefinitions/- 从提取建议:
masthead-prod.YOUR_DATASET.insights
bash
bq query --project_id=YOUR_PROJECT --use_legacy_sql=false --format=csv \
"SELECT
action_name,
recommended_model,
cost_30d,
last_updated_time
FROM \`masthead-prod.YOUR_DATASET.insights\`
WHERE category = 'Cost'
AND type = 'Compute costs'
AND subtype = ...
ORDER BY cost_30d DESC" > compute_assignment_candidates.csv- 从提取建议:
masthead-prod.YOUR_DATASET.insights
bash
bq query --project_id=YOUR_PROJECT --use_legacy_sql=false --format=csv \
"SELECT
action_name,
recommended_model,
cost_30d,
last_updated_time
FROM \`masthead-prod.YOUR_DATASET.insights\`
WHERE category = 'Cost'
AND type = 'Compute costs'
AND subtype = ...
ORDER BY cost_30d DESC" > compute_assignment_candidates.csv- 使用值和预留版本元数据确定预留目标:
recommended_model
- 使用验证预留版本:
INFORMATION_SCHEMA.RESERVATIONS
bash
bq query --project_id=RESERVATION_ADMIN_PROJECT --location=US --use_legacy_sql=false --format=csv \
"SELECT
reservation_name,
project_id,
edition,
slot_capacity
FROM \`region-us\`.INFORMATION_SCHEMA.RESERVATIONS
ORDER BY project_id, reservation_name"- 将映射到
recommended_model = 'ON_DEMAND'的配置项。reservation = 'none' - 对于其他值(例如),选择版本与
ENTERPRISE匹配的预留插槽。recommended_model - 如果恰好存在一个匹配的预留插槽,自动分配。
- 如果存在多个匹配的预留插槽,询问用户使用哪个预留标签。
- 如果没有匹配的预留插槽,询问用户选择一个 fallback 预留插槽或先创建一个新的匹配预留插槽。
- 确保按需计算桶存在。如果缺失,创建一个:
javascript
{
tag: 'on_demand',
reservation: 'none',
actions: []
}- 将建议自动转换为分配列表:
bash
bq query --project_id=YOUR_PROJECT --use_legacy_sql=false --format=csv \
"SELECT
action_name,
recommended_model
FROM \`masthead-prod.YOUR_DATASET.insights\`
WHERE category = 'Cost'
AND type = 'Compute costs'
AND subtype = ...
ORDER BY cost_30d DESC" > compute_assignment_final.csv- 根据检测到的编排工具应用所有建议:
Dataform
Dataform
- Open .
definitions/_reservations.js - Replace
on_demandwith all rows whereactions.recommended_model = 'ON_DEMAND' - Replace reserved reservation with all rows where
actions.recommended_model != 'ON_DEMAND' - Remove duplicates; keep only actions present in this repo's Dataform graph.
- Verify: and check for duplicate assignments.
dataform compile - See package reference:
@masthead-data/dataform-package
- 打开。
definitions/_reservations.js - 将的
on_demand替换为所有actions的行。recommended_model = 'ON_DEMAND' - 将预留插槽的替换为所有
actions的行。recommended_model != 'ON_DEMAND' - 移除重复项;仅保留此仓库Dataform图中存在的操作。
- 验证:执行并检查是否存在重复分配。
dataform compile - 参考包:
@masthead-data/dataform-package
dbt
dbt
- Follow the reservation assignment workflow from .
masthead-data/dbt-reservations - Map recommendations to the appropriate dbt model tags or selector targets.
- Update the relevant or profile configuration per the repo's instructions.
dbt_project.yml
- 遵循中的预留分配流程。
masthead-data/dbt-reservations - 将建议映射到相应的dbt模型标签或选择器目标。
- 根据该仓库的说明更新相关的或配置文件。
dbt_project.yml
Airflow
Airflow
- Follow the reservation assignment workflow from .
masthead-data/airflow-reservations - Map recommendations to DAG or task-level BigQuery reservation labels.
- Update the relevant operator configuration per the repo's instructions.
- 遵循中的预留分配流程。
masthead-data/airflow-reservations - 将建议映射到DAG或任务级别的BigQuery预留标签。
- 根据该仓库的说明更新相关的算子配置。
Step 2: Verify Changes
步骤2:验证变更
After applying, confirm assignments are non-overlapping and align with the recommendation output. For Dataform:
bash
undefined应用变更后,确认分配无重叠且与建议输出一致。对于Dataform:
bash
undefinedCheck syntax
检查语法
dataform compile
dataform compile
Validate no duplicate assignments
验证无重复分配
grep -r ".actions" definitions/_reservations.js
For dbt and Airflow, follow the verification steps in their respective repositories.grep -r ".actions" definitions/_reservations.js
对于dbt和Airflow,请遵循各自仓库中的验证步骤。Decision Criteria
决策标准
| Factor | Reserved Slots | On-Demand |
|---|---|---|
| Priority | High, SLA-bound | Low, flexible |
| Frequency | Regular, scheduled | Ad-hoc, occasional |
| Cost Pattern | Predictable usage | Variable, sporadic |
| Impact | Critical pipelines | Experimental, samples |
| 影响因素 | 预留插槽 | 按需计算 |
|---|---|---|
| 优先级 | 高优先级、绑定SLA | 低优先级、灵活调度 |
| 执行频率 | 定期、已调度 | 临时、偶尔执行 |
| 成本模式 | 可预测的使用量 | 可变、零散的使用量 |
| 业务影响 | 关键业务流水线 | 实验性、样本类任务 |
Key Notes
关键注意事项
- Each action should appear in only ONE reservation config
- File starts with to ensure it runs first in Dataform queue
_ - Changes take effect on next Dataform workflow run
- Package automatically handles global assignment (no per-file edits needed)
- Default mode is full auto-apply of recommendations; rely on PR review for validation
- The only interactive checkpoint is reservation selection when more than one reservation matches the recommended edition
- 每个操作应仅出现在一个预留配置中
- 文件以开头,确保在Dataform队列中优先运行
_ - 变更将在下次Dataform工作流运行时生效
- 包会自动处理全局分配(无需逐个文件编辑)
- 默认模式为自动应用所有建议;依赖PR评审进行验证
- 唯一的交互式检查点是:当多个预留插槽匹配建议版本时,需选择预留插槽
Package References
包参考
- Dataform:
@masthead-data/dataform-package - dbt:
masthead-data/dbt-reservations - Airflow:
masthead-data/airflow-reservations