databricks-jobs

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Lakeflow Jobs Development

Lakeflow Jobs开发

FIRST: Use the parent
databricks
skill for CLI basics, authentication, profile selection, and data exploration commands.
Lakeflow Jobs are scheduled workflows that run notebooks, Python scripts, SQL queries, and other tasks on Databricks.
首要步骤:使用父级
databricks
技能掌握CLI基础操作、身份验证、配置文件选择和数据探索相关命令。
Lakeflow Jobs是运行在Databricks上的可调度工作流,支持执行notebook、Python脚本、SQL查询及其他类型的任务。

Scaffolding a New Job Project

新建作业项目脚手架

Use
databricks bundle init
with a config file to scaffold non-interactively. This creates a project in the
<project_name>/
directory:
bash
databricks bundle init default-python --config-file <(echo '{"project_name": "my_job", "include_job": "yes", "include_pipeline": "no", "include_python": "yes", "serverless": "yes"}') --profile <PROFILE> < /dev/null
  • project_name
    : letters, numbers, underscores only
After scaffolding, create
CLAUDE.md
and
AGENTS.md
in the project directory. These files are essential to provide agents with guidance on how to work with the project. Use this content:
undefined
使用带配置文件的
databricks bundle init
命令可以非交互方式生成项目脚手架,会在
<project_name>/
目录下创建项目:
bash
databricks bundle init default-python --config-file <(echo '{"project_name": "my_job", "include_job": "yes", "include_pipeline": "no", "include_python": "yes", "serverless": "yes"}') --profile <PROFILE> < /dev/null
  • project_name
    :仅支持字母、数字和下划线
脚手架生成完成后,请在项目目录下创建
CLAUDE.md
AGENTS.md
文件,这两个文件是为Agent提供项目操作指引的必要文件,可使用以下内容:
undefined

Databricks Asset Bundles Project

Databricks Asset Bundles项目

This project uses Databricks Asset Bundles for deployment.
本项目使用Databricks Asset Bundles进行部署。

Prerequisites

前置依赖

Install the Databricks CLI (>= v0.288.0) if not already installed:
  • macOS:
    brew tap databricks/tap && brew install databricks
  • Linux:
    curl -fsSL https://raw.githubusercontent.com/databricks/setup-cli/main/install.sh | sh
  • Windows:
    winget install Databricks.DatabricksCLI
Verify:
databricks -v
如果还未安装Databricks CLI(>= v0.288.0)请先安装:
  • macOS:
    brew tap databricks/tap && brew install databricks
  • Linux:
    curl -fsSL https://raw.githubusercontent.com/databricks/setup-cli/main/install.sh | sh
  • Windows:
    winget install Databricks.DatabricksCLI
校验安装:
databricks -v

For AI Agents

给AI Agent的说明

Read the
databricks
skill for CLI basics, authentication, and deployment workflow. Read the
databricks-jobs
skill for job-specific guidance.
If skills are not available, install them:
databricks experimental aitools skills install
undefined
阅读
databricks
技能了解CLI基础、身份验证和部署工作流。 阅读
databricks-jobs
技能获取作业相关的专属指引。
如果技能不可用,请先安装:
databricks experimental aitools skills install
undefined

Project Structure

项目结构

my-job-project/
├── databricks.yml              # Bundle configuration
├── resources/
│   └── my_job.job.yml          # Job definition
├── src/
│   ├── my_notebook.ipynb       # Notebook tasks
│   └── my_module/              # Python wheel package
│       ├── __init__.py
│       └── main.py
├── tests/
│   └── test_main.py
└── pyproject.toml               # Python project config (if using wheels)
my-job-project/
├── databricks.yml              # Bundle配置文件
├── resources/
│   └── my_job.job.yml          # 作业定义文件
├── src/
│   ├── my_notebook.ipynb       # Notebook任务
│   └── my_module/              # Python wheel包
│       ├── __init__.py
│       └── main.py
├── tests/
│   └── test_main.py
└── pyproject.toml               # Python项目配置(使用wheel时需要)

Configuring Tasks

配置任务

Edit
resources/<job_name>.job.yml
to configure tasks:
yaml
resources:
  jobs:
    my_job:
      name: my_job

      tasks:
        - task_key: my_notebook
          notebook_task:
            notebook_path: ../src/my_notebook.ipynb

        - task_key: my_python
          depends_on:
            - task_key: my_notebook
          python_wheel_task:
            package_name: my_package
            entry_point: main
Task types:
notebook_task
,
python_wheel_task
,
spark_python_task
,
pipeline_task
,
sql_task
编辑
resources/<job_name>.job.yml
文件来配置任务:
yaml
resources:
  jobs:
    my_job:
      name: my_job

      tasks:
        - task_key: my_notebook
          notebook_task:
            notebook_path: ../src/my_notebook.ipynb

        - task_key: my_python
          depends_on:
            - task_key: my_notebook
          python_wheel_task:
            package_name: my_package
            entry_point: main
支持的任务类型:
notebook_task
python_wheel_task
spark_python_task
pipeline_task
sql_task

Job Parameters

作业参数

Parameters defined at job level are passed to ALL tasks (no need to repeat per task):
yaml
resources:
  jobs:
    my_job:
      parameters:
        - name: catalog
          default: ${var.catalog}
        - name: schema
          default: ${var.schema}
Access parameters in notebooks with
dbutils.widgets.get("catalog")
.
在作业层级定义的参数会传递给所有任务(无需在每个任务中重复定义):
yaml
resources:
  jobs:
    my_job:
      parameters:
        - name: catalog
          default: ${var.catalog}
        - name: schema
          default: ${var.schema}
在notebook中通过
dbutils.widgets.get("catalog")
获取参数。

Writing Notebook Code

编写Notebook代码

python
undefined
python
undefined

Read parameters

读取参数

catalog = dbutils.widgets.get("catalog") schema = dbutils.widgets.get("schema")
catalog = dbutils.widgets.get("catalog") schema = dbutils.widgets.get("schema")

Read tables

读取表

df = spark.read.table(f"{catalog}.{schema}.my_table")
df = spark.read.table(f"{catalog}.{schema}.my_table")

SQL queries

SQL查询

result = spark.sql(f"SELECT * FROM {catalog}.{schema}.my_table LIMIT 10")
result = spark.sql(f"SELECT * FROM {catalog}.{schema}.my_table LIMIT 10")

Write output

写入输出

df.write.mode("overwrite").saveAsTable(f"{catalog}.{schema}.output_table")
undefined
df.write.mode("overwrite").saveAsTable(f"{catalog}.{schema}.output_table")
undefined

Scheduling

调度配置

yaml
resources:
  jobs:
    my_job:
      trigger:
        periodic:
          interval: 1
          unit: DAYS
Or with cron:
yaml
      schedule:
        quartz_cron_expression: "0 0 2 * * ?"
        timezone_id: "UTC"
yaml
resources:
  jobs:
    my_job:
      trigger:
        periodic:
          interval: 1
          unit: DAYS
也可以使用cron表达式配置:
yaml
      schedule:
        quartz_cron_expression: "0 0 2 * * ?"
        timezone_id: "UTC"

Multi-Task Jobs with Dependencies

带依赖的多任务作业

yaml
resources:
  jobs:
    my_pipeline_job:
      tasks:
        - task_key: extract
          notebook_task:
            notebook_path: ../src/extract.ipynb

        - task_key: transform
          depends_on:
            - task_key: extract
          notebook_task:
            notebook_path: ../src/transform.ipynb

        - task_key: load
          depends_on:
            - task_key: transform
          notebook_task:
            notebook_path: ../src/load.ipynb
yaml
resources:
  jobs:
    my_pipeline_job:
      tasks:
        - task_key: extract
          notebook_task:
            notebook_path: ../src/extract.ipynb

        - task_key: transform
          depends_on:
            - task_key: extract
          notebook_task:
            notebook_path: ../src/transform.ipynb

        - task_key: load
          depends_on:
            - task_key: transform
          notebook_task:
            notebook_path: ../src/load.ipynb

Unit Testing

单元测试

Run unit tests locally:
bash
uv run pytest
在本地运行单元测试:
bash
uv run pytest

Development Workflow

开发工作流

  1. Validate:
    databricks bundle validate --profile <profile>
  2. Deploy:
    databricks bundle deploy -t dev --profile <profile>
  3. Run:
    databricks bundle run <job_name> -t dev --profile <profile>
  4. Check run status:
    databricks jobs get-run --run-id <id> --profile <profile>
  1. 校验配置
    databricks bundle validate --profile <profile>
  2. 部署
    databricks bundle deploy -t dev --profile <profile>
  3. 运行
    databricks bundle run <job_name> -t dev --profile <profile>
  4. 查看运行状态
    databricks jobs get-run --run-id <id> --profile <profile>

Documentation

相关文档