using-dbt-for-analytics-engineering

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Using dbt for Analytics Engineering

使用dbt进行分析工程

Core principle: Apply software engineering discipline (DRY, modularity, testing) to data transformation work through dbt's abstraction layer.
核心原则: 通过dbt的抽象层,将软件工程准则(DRY、模块化、测试)应用于数据转换工作。

When to Use

适用场景

  • Building new dbt models, sources, or tests
  • Modifying existing model logic or configurations
  • Refactoring a dbt project structure
  • Creating analytics pipelines or data transformations
  • Working with warehouse data that needs modeling
Do NOT use for:
  • Querying the semantic layer (use the
    answering-natural-language-questions-with-dbt
    skill)
  • 构建新的dbt模型、数据源或测试
  • 修改现有模型的逻辑或配置
  • 重构dbt项目结构
  • 创建分析管道或数据转换流程
  • 处理需要建模的数仓数据
不适用场景:
  • 查询语义层(请使用
    answering-natural-language-questions-with-dbt
    技能)

Reference Guides

参考指南

This skill includes detailed reference guides for specific techniques. Read the relevant guide when needed:
GuideUse When
references/planning-dbt-models.mdBuilding new models - work backwards from desired output and use
dbt show
to validate results
references/discovering-data.mdExploring unfamiliar sources or onboarding to a project
references/writing-data-tests.mdAdding tests - prioritize high-value tests over exhaustive coverage
references/debugging-dbt-errors.mdFixing project parsing, compilation, or database errors
references/evaluating-impact-of-a-dbt-model-change.mdAssessing downstream effects before modifying models
references/writing-documentation.mdWrite documentation that doesn't just restate the column name
references/managing-packages.mdInstalling and managing dbt packages
本技能包含针对特定技术的详细参考指南。按需阅读相关指南:
指南适用场景
references/planning-dbt-models.md构建新模型——从期望的输出倒推,并使用
dbt show
验证结果
references/discovering-data.md探索不熟悉的数据源或参与新项目
references/writing-data-tests.md添加测试——优先覆盖高价值场景,而非追求全面覆盖
references/debugging-dbt-errors.md修复项目解析、编译或数据库错误
references/evaluating-impact-of-a-dbt-model-change.md修改模型前评估下游影响
references/writing-documentation.md编写不只是重复列名的文档
references/managing-packages.md安装和管理dbt包

DAG building guidelines

DAG构建准则

  • Conform to the existing style of a project (medallion layers, stage/intermediate/mart, etc)
  • Focus heavily on DRY principles.
    • Before adding a new model or column, always be sure that the same logic isn't already defined elsewhere that can be used.
    • Prefer a change that requires you to add one column to an existing intermediate model over adding an entire additional model to the project.
When users request new models: Always ask "why a new model vs extending existing?" before proceeding. Legitimate reasons exist (different grain, precalculation for performance), but users often request new models out of habit. Your job is to surface the tradeoff, not blindly comply.
  • 遵循项目现有风格(如medallion分层、stage/intermediate/mart结构等)
  • 严格遵循DRY原则
    • 添加新模型或列之前,务必确认相同逻辑未在其他地方定义
    • 优先选择在现有中间模型中添加一列,而非在项目中新增完整模型
当用户请求新模型时: 开始前务必询问“为什么需要新模型而非扩展现有模型?”。确实存在合理理由(如不同粒度、性能预计算),但用户常因习惯提出请求。你的职责是明确权衡利弊,而非盲目执行。

Model building guidelines

模型构建准则

  • Always use data modelling best practices when working in a project
  • Follow dbt best practices in code:
    • Always use
      {{ ref }}
      and
      {{ source }}
      over hardcoded table names
    • Use CTEs over subqueries
  • Before building a model, follow references/planning-dbt-models.md to plan your approach.
  • Before modifying or building on existing models, read their YAML documentation:
    • Find the model's YAML file (can be any
      .yml
      or
      .yaml
      file in the models directory, but normally colocated with the SQL file)
    • Check the model's
      description
      to understand its purpose
    • Read column-level
      description
      fields to understand what each column represents
    • Review any
      meta
      properties that document business logic or ownership
    • This context prevents misusing columns or duplicating existing logic
  • 在项目中工作时,始终遵循数据建模最佳实践
  • 代码中遵循dbt最佳实践:
    • 始终使用
      {{ ref }}
      {{ source }}
      ,而非硬编码表名
    • 使用CTE而非子查询
  • 构建模型前,遵循references/planning-dbt-models.md规划方法
  • 修改或基于现有模型开发前,阅读其YAML文档:
    • 找到模型的YAML文件(通常位于models目录下的
      .yml
      .yaml
      文件,一般与SQL文件同目录)
    • 查看模型的
      description
      以了解其用途
    • 阅读列级
      description
      字段,理解每列的含义
    • 查看任何记录业务逻辑或归属的
      meta
      属性
    • 这些上下文可避免误用列或重复现有逻辑

You must look at the data to be able to correctly model the data

必须查看数据才能正确建模

When implementing a model, you must use
dbt show
regularly to:
  • preview the input data you will work with, so that you use relevant columns and values
  • preview the results of your model, so that you know your work is correct
  • run basic data profiling (counts, min, max, nulls) of input and output data, to check for misconfigured joins or other logic errors
实现模型时,必须定期使用
dbt show
  • 预览要处理的输入数据,确保使用相关列和值
  • 预览模型结果,确认工作正确性
  • 对输入和输出数据进行基础数据探查(计数、最小值、最大值、空值),检查连接配置错误或其他逻辑问题

Cost management best practices

成本管理最佳实践

  • Use
    --limit
    with
    dbt show
    and insert limits early into CTEs when exploring data
  • Use deferral (
    --defer --state path/to/prod/artifacts
    ) to reuse production objects
  • Use
    dbt clone
    to produce zero-copy clones
  • Avoid large unpartitioned table scans in BigQuery
  • Always use
    --select
    instead of running the entire project
  • 探索数据时,在
    dbt show
    中使用
    --limit
    ,并在CTE早期添加限制条件
  • 使用延迟加载(
    --defer --state path/to/prod/artifacts
    )复用生产对象
  • 使用
    dbt clone
    生成零拷贝克隆
  • 在BigQuery中避免大型未分区表扫描
  • 始终使用
    --select
    而非运行整个项目

Interacting with the CLI

与CLI交互

  • You will be working in a terminal environment where you have access to the dbt CLI, and potentially the dbt MCP server. The MCP server may include access to the dbt Cloud platform's APIs if relevant.
  • You should prefer working with the dbt MCP server's tools, and help the user install and onboard the MCP when appropriate.
  • 你将在终端环境中工作,可访问dbt CLI,可能还有dbt MCP服务器。MCP服务器可能包含对dbt Cloud平台API的访问权限(如相关)
  • 优先使用dbt MCP服务器的工具,并在适当时帮助用户安装和接入MCP

Common Mistakes

常见错误

MistakeWhy It's WrongFix
One-shotting modelsData work requires validation; schemas are unknownFollow references/planning-dbt-models.md, iterate with
dbt show
Not working iterativelyChanges to multiple models at once makes it hard to debugRun
dbt build --select changed_model
on each model after modifying it
Assuming schema knowledgeColumn names, types, and values vary across warehousesFollow references/discovering-data.md before writing SQL
Not reading existing model documentationColumn names don't reveal business meaningRead YAML descriptions before modifying models
Creating unnecessary modelsWarehouse compute has real costsExtend existing models when possible
Hardcoding table namesBreaks dbt's dependency graphAlways use
{{ ref() }}
and
{{ source() }}
Global config changesConfiguration cascades unexpectedlyChange surgically, match existing patterns
Running DDL directlyBypasses dbt's abstraction and trackingUse dbt commands exclusively
错误原因解决方法
一次性构建模型数据工作需要验证, schema未知遵循references/planning-dbt-models.md,使用
dbt show
迭代开发
非迭代开发同时修改多个模型会增加调试难度修改后对每个模型运行
dbt build --select changed_model
假设已知schema不同数仓的列名、类型和值存在差异编写SQL前遵循references/discovering-data.md
未阅读现有模型文档列名无法体现业务含义修改模型前阅读YAML描述
创建不必要的模型数仓计算有实际成本尽可能扩展现有模型
硬编码表名破坏dbt的依赖关系图始终使用
{{ ref() }}
{{ source() }}
全局配置变更配置会意外级联影响精准修改,匹配现有模式
直接运行DDL绕过dbt的抽象和跟踪仅使用dbt命令

Rationalizations to Resist

需要抵制的合理化借口

ExcuseReality
"User explicitly asked for a new model"Users request out of habit. Ask why before complying.
"I've done this pattern hundreds of times"This project's schema may differ. Verify with
dbt show
.
"User is senior / knows what they're doing"Seniority doesn't change compute costs. Surface tradeoffs.
"It's just a small change"Small changes compound. Follow DRY principles.
借口实际情况
“用户明确要求新模型”用户常因习惯提出请求,执行前先询问原因
“我已经做过几百次这个模式”不同项目的schema可能不同,需用
dbt show
验证
“用户是资深人员/知道自己在做什么”资历不改变计算成本,需明确权衡利弊
“只是一个小改动”小改动会累积,遵循DRY原则

Red Flags - STOP and Reconsider

危险信号——停止并重新考虑

  • About to write SQL without checking actual column names
  • Modifying a model without reading its YAML documentation first
  • Creating a new model when a column addition would suffice
  • User gave table names as "the usual columns" - verify anyway
  • Skipping
    dbt show
    validation because "it's straightforward"
  • Running DDL or queries directly against the warehouse
  • 未检查实际列名就准备编写SQL
  • 未阅读YAML文档就修改模型
  • 可通过添加列实现却要创建新模型
  • 用户称表名是“常用列”——仍需验证
  • 因“逻辑简单”跳过
    dbt show
    验证
  • 直接对数仓运行DDL或查询