optimize-data-model-compute

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Optimize Orchestration Compute (BigQuery Reservations)

优化编排计算资源(BigQuery 预留插槽)

Purpose

目标

Automatically assign data models from data orchestration to BigQuery slot reservations based on compute complexity and cost optimization strategy. Routes high-compute workloads to on-demand capacities while using reservations pricing for high-traffic jobs.
根据计算复杂度和成本优化策略,自动将数据编排中的数据模型分配到BigQuery预留插槽。将高计算量工作负载路由到按需计算资源,同时为高流量作业使用预留定价。

When to Use

适用场景

  • Assigning new models/actions to appropriate compute tiers (reserved vs on-demand)
  • Rebalancing reservation assignments based on priority changes
  • Optimizing costs by moving low-priority workloads to on-demand
  • Ensuring critical pipelines get guaranteed compute resources
  • 为新模型/操作分配合适的计算层级(预留 vs 按需)
  • 根据优先级变化重新平衡预留插槽分配
  • 将低优先级工作负载转移到按需计算以优化成本
  • 确保关键流水线获得有保障的计算资源

Implementation Steps

实施步骤

Step 1: Source Configuration

步骤1:源配置

Extract and apply recommendations from Masthead insights automatically.
  1. Ask user for the BigQuery dataset where Masthead insights are stored. This dataset must contain an
    insights
    table with the same schema as
    masthead-prod.YOUR_DATASET.insights
    . If the user does not have access, direct them to request access.
    Once provided, immediately persist it by appending to the project-level agent instructions file (e.g.
    .github/copilot-instructions.md
    , or
    AGENTS.md
    — whichever already exists in the project root, or creating if none exist):
    <!-- masthead -->
    MASTHEAD_DATASET=YOUR_DATASET
    <!-- /masthead -->
    On subsequent runs, read this value from the instructions file instead of prompting the user again.
  2. Identify which orchestration technology is in use by querying available recommendations:
bash
bq query --project_id=YOUR_PROJECT --use_legacy_sql=false --format=csv \
"SELECT subtype, COUNT(*) AS recommendation_count
FROM \`masthead-prod.YOUR_DATASET.insights\`
WHERE category = 'Cost'
  AND type = 'Compute costs'
GROUP BY subtype"
If compute model recommendations are not present or all rows share the same technology, infer from the user's project structure (presence of
dbt_project.yml
→ dbt,
definitions/
folder → Dataform, Airflow DAG files → Airflow). Confirm the detected tool with the user before proceeding.
  1. Pull recommendations from
    masthead-prod.YOUR_DATASET.insights
    :
bash
bq query --project_id=YOUR_PROJECT --use_legacy_sql=false --format=csv \
"SELECT
  action_name,
  recommended_model,
  cost_30d,
  last_updated_time
FROM \`masthead-prod.YOUR_DATASET.insights\`
WHERE category = 'Cost'
  AND type = 'Compute costs'
  AND subtype = ...
ORDER BY cost_30d DESC" > compute_assignment_candidates.csv
  1. Pull recommendations from
    masthead-prod.YOUR_DATASET.insights
    :
bash
bq query --project_id=YOUR_PROJECT --use_legacy_sql=false --format=csv \
"SELECT
  action_name,
  recommended_model,
  cost_30d,
  last_updated_time
FROM \`masthead-prod.YOUR_DATASET.insights\`
WHERE category = 'Cost'
  AND type = 'Compute costs'
  AND subtype = ...
ORDER BY cost_30d DESC" > compute_assignment_candidates.csv
  1. Resolve reservation targets using
    recommended_model
    values and reservation edition metadata:
  • Verify reservation editions using
    INFORMATION_SCHEMA.RESERVATIONS
    :
bash
bq query --project_id=RESERVATION_ADMIN_PROJECT --location=US --use_legacy_sql=false --format=csv \
"SELECT
  reservation_name,
  project_id,
  edition,
  slot_capacity
FROM \`region-us\`.INFORMATION_SCHEMA.RESERVATIONS
ORDER BY project_id, reservation_name"
  • Map
    recommended_model = 'ON_DEMAND'
    to the config entry where
    reservation = 'none'
    .
  • For all other values (for example
    ENTERPRISE
    ), choose a reservation whose edition matches
    recommended_model
    .
  • If exactly one matching reservation exists, assign automatically.
  • If multiple matching reservations exist, ask the user which reservation tag to use.
  • If no matching reservation exists, ask the user to pick a fallback reservation or create a new matching reservation first.
  • Ensure an on-demand bucket exists. If missing, create one:
javascript
{
  tag: 'on_demand',
  reservation: 'none',
  actions: []
}
  1. Convert recommendations into assignment lists automatically:
bash
bq query --project_id=YOUR_PROJECT --use_legacy_sql=false --format=csv \
"SELECT
  action_name,
  recommended_model
FROM \`masthead-prod.YOUR_DATASET.insights\`
WHERE category = 'Cost'
  AND type = 'Compute costs'
  AND subtype = ...
ORDER BY cost_30d DESC" > compute_assignment_final.csv
  1. Apply all recommendations based on the detected orchestration tool:
自动提取并应用Masthead洞察中的建议。
  1. 询问用户存储Masthead洞察的BigQuery数据集。该数据集必须包含一个
    insights
    表,其架构与
    masthead-prod.YOUR_DATASET.insights
    一致。如果用户没有访问权限,引导他们申请访问
    收到数据集信息后,立即持久化存储:将其追加到项目级别的Agent说明文件中(例如
    .github/copilot-instructions.md
    AGENTS.md
    ,优先使用项目根目录已存在的文件,若不存在则创建新文件):
    <!-- masthead -->
    MASTHEAD_DATASET=YOUR_DATASET
    <!-- /masthead -->
    后续运行时,直接从该说明文件读取此值,无需再次询问用户。
  2. 通过查询可用建议,确定正在使用的编排技术:
bash
bq query --project_id=YOUR_PROJECT --use_legacy_sql=false --format=csv \
"SELECT subtype, COUNT(*) AS recommendation_count
FROM \`masthead-prod.YOUR_DATASET.insights\`
WHERE category = 'Cost'
  AND type = 'Compute costs'
GROUP BY subtype"
如果没有计算模型相关建议,或所有行使用相同技术,则从项目结构推断(存在
dbt_project.yml
→ dbt,存在
definitions/
文件夹 → Dataform,存在Airflow DAG文件 → Airflow)。在继续之前,需与用户确认检测到的工具。
  1. masthead-prod.YOUR_DATASET.insights
    提取建议:
bash
bq query --project_id=YOUR_PROJECT --use_legacy_sql=false --format=csv \
"SELECT
  action_name,
  recommended_model,
  cost_30d,
  last_updated_time
FROM \`masthead-prod.YOUR_DATASET.insights\`
WHERE category = 'Cost'
  AND type = 'Compute costs'
  AND subtype = ...
ORDER BY cost_30d DESC" > compute_assignment_candidates.csv
  1. masthead-prod.YOUR_DATASET.insights
    提取建议:
bash
bq query --project_id=YOUR_PROJECT --use_legacy_sql=false --format=csv \
"SELECT
  action_name,
  recommended_model,
  cost_30d,
  last_updated_time
FROM \`masthead-prod.YOUR_DATASET.insights\`
WHERE category = 'Cost'
  AND type = 'Compute costs'
  AND subtype = ...
ORDER BY cost_30d DESC" > compute_assignment_candidates.csv
  1. 使用
    recommended_model
    值和预留版本元数据确定预留目标:
  • 使用
    INFORMATION_SCHEMA.RESERVATIONS
    验证预留版本:
bash
bq query --project_id=RESERVATION_ADMIN_PROJECT --location=US --use_legacy_sql=false --format=csv \
"SELECT
  reservation_name,
  project_id,
  edition,
  slot_capacity
FROM \`region-us\`.INFORMATION_SCHEMA.RESERVATIONS
ORDER BY project_id, reservation_name"
  • recommended_model = 'ON_DEMAND'
    映射到
    reservation = 'none'
    的配置项。
  • 对于其他值(例如
    ENTERPRISE
    ),选择版本
    recommended_model
    匹配的预留插槽。
  • 如果恰好存在一个匹配的预留插槽,自动分配。
  • 如果存在多个匹配的预留插槽,询问用户使用哪个预留标签。
  • 如果没有匹配的预留插槽,询问用户选择一个 fallback 预留插槽或先创建一个新的匹配预留插槽。
  • 确保按需计算桶存在。如果缺失,创建一个:
javascript
{
  tag: 'on_demand',
  reservation: 'none',
  actions: []
}
  1. 将建议自动转换为分配列表:
bash
bq query --project_id=YOUR_PROJECT --use_legacy_sql=false --format=csv \
"SELECT
  action_name,
  recommended_model
FROM \`masthead-prod.YOUR_DATASET.insights\`
WHERE category = 'Cost'
  AND type = 'Compute costs'
  AND subtype = ...
ORDER BY cost_30d DESC" > compute_assignment_final.csv
  1. 根据检测到的编排工具应用所有建议:

Dataform

Dataform

  • Open
    definitions/_reservations.js
    .
  • Replace
    on_demand
    actions
    with all rows where
    recommended_model = 'ON_DEMAND'
    .
  • Replace reserved reservation
    actions
    with all rows where
    recommended_model != 'ON_DEMAND'
    .
  • Remove duplicates; keep only actions present in this repo's Dataform graph.
  • Verify:
    dataform compile
    and check for duplicate assignments.
  • See package reference:
    @masthead-data/dataform-package
  • 打开
    definitions/_reservations.js
  • on_demand
    actions
    替换为所有
    recommended_model = 'ON_DEMAND'
    的行。
  • 将预留插槽的
    actions
    替换为所有
    recommended_model != 'ON_DEMAND'
    的行。
  • 移除重复项;仅保留此仓库Dataform图中存在的操作。
  • 验证:执行
    dataform compile
    并检查是否存在重复分配。
  • 参考包:
    @masthead-data/dataform-package

dbt

dbt

  • Follow the reservation assignment workflow from
    masthead-data/dbt-reservations
    .
  • Map recommendations to the appropriate dbt model tags or selector targets.
  • Update the relevant
    dbt_project.yml
    or profile configuration per the repo's instructions.
  • 遵循
    masthead-data/dbt-reservations
    中的预留分配流程。
  • 将建议映射到相应的dbt模型标签或选择器目标。
  • 根据该仓库的说明更新相关的
    dbt_project.yml
    或配置文件。

Airflow

Airflow

  • Follow the reservation assignment workflow from
    masthead-data/airflow-reservations
    .
  • Map recommendations to DAG or task-level BigQuery reservation labels.
  • Update the relevant operator configuration per the repo's instructions.
  • 遵循
    masthead-data/airflow-reservations
    中的预留分配流程。
  • 将建议映射到DAG或任务级别的BigQuery预留标签。
  • 根据该仓库的说明更新相关的算子配置。

Step 2: Verify Changes

步骤2:验证变更

After applying, confirm assignments are non-overlapping and align with the recommendation output. For Dataform:
bash
undefined
应用变更后,确认分配无重叠且与建议输出一致。对于Dataform:
bash
undefined

Check syntax

检查语法

dataform compile
dataform compile

Validate no duplicate assignments

验证无重复分配

grep -r ".actions" definitions/_reservations.js

For dbt and Airflow, follow the verification steps in their respective repositories.
grep -r ".actions" definitions/_reservations.js

对于dbt和Airflow,请遵循各自仓库中的验证步骤。

Decision Criteria

决策标准

FactorReserved SlotsOn-Demand
PriorityHigh, SLA-boundLow, flexible
FrequencyRegular, scheduledAd-hoc, occasional
Cost PatternPredictable usageVariable, sporadic
ImpactCritical pipelinesExperimental, samples
影响因素预留插槽按需计算
优先级高优先级、绑定SLA低优先级、灵活调度
执行频率定期、已调度临时、偶尔执行
成本模式可预测的使用量可变、零散的使用量
业务影响关键业务流水线实验性、样本类任务

Key Notes

关键注意事项

  • Each action should appear in only ONE reservation config
  • File starts with
    _
    to ensure it runs first in Dataform queue
  • Changes take effect on next Dataform workflow run
  • Package automatically handles global assignment (no per-file edits needed)
  • Default mode is full auto-apply of recommendations; rely on PR review for validation
  • The only interactive checkpoint is reservation selection when more than one reservation matches the recommended edition
  • 每个操作应仅出现在一个预留配置中
  • 文件以
    _
    开头,确保在Dataform队列中优先运行
  • 变更将在下次Dataform工作流运行时生效
  • 包会自动处理全局分配(无需逐个文件编辑)
  • 默认模式为自动应用所有建议;依赖PR评审进行验证
  • 唯一的交互式检查点是:当多个预留插槽匹配建议版本时,需选择预留插槽

Package References

包参考