databricks-jobs
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseDatabricks Lakeflow Jobs
Databricks Lakeflow Jobs
Overview
概述
Databricks Jobs orchestrate data workflows with multi-task DAGs, flexible triggers, and comprehensive monitoring. Jobs support diverse task types and can be managed via Python SDK, CLI, or Asset Bundles.
Databricks Jobs 借助多任务DAG、灵活的触发器和全面的监控功能来编排数据工作流。作业支持多种任务类型,可通过Python SDK、CLI或Asset Bundles进行管理。
Reference Files
参考文件
| Use Case | Reference File |
|---|---|
| Configure task types (notebook, Python, SQL, dbt, etc.) | task-types.md |
| Set up triggers and schedules | triggers-schedules.md |
| Configure notifications and health monitoring | notifications-monitoring.md |
| Complete working examples | examples.md |
| 使用场景 | 参考文件 |
|---|---|
| 配置任务类型(notebook、Python、SQL、dbt等) | task-types.md |
| 设置触发器和调度 | triggers-schedules.md |
| 配置通知和健康监控 | notifications-monitoring.md |
| 完整可用示例 | examples.md |
Quick Start
快速开始
Python SDK
Python SDK
python
from databricks.sdk import WorkspaceClient
from databricks.sdk.service.jobs import Task, NotebookTask, Source
w = WorkspaceClient()
job = w.jobs.create(
name="my-etl-job",
tasks=[
Task(
task_key="extract",
notebook_task=NotebookTask(
notebook_path="/Workspace/Users/user@example.com/extract",
source=Source.WORKSPACE
)
)
]
)
print(f"Created job: {job.job_id}")python
from databricks.sdk import WorkspaceClient
from databricks.sdk.service.jobs import Task, NotebookTask, Source
w = WorkspaceClient()
job = w.jobs.create(
name="my-etl-job",
tasks=[
Task(
task_key="extract",
notebook_task=NotebookTask(
notebook_path="/Workspace/Users/user@example.com/extract",
source=Source.WORKSPACE
)
)
]
)
print(f"Created job: {job.job_id}")CLI
CLI
bash
databricks jobs create --json '{
"name": "my-etl-job",
"tasks": [{
"task_key": "extract",
"notebook_task": {
"notebook_path": "/Workspace/Users/user@example.com/extract",
"source": "WORKSPACE"
}
}]
}'bash
databricks jobs create --json '{
"name": "my-etl-job",
"tasks": [{
"task_key": "extract",
"notebook_task": {
"notebook_path": "/Workspace/Users/user@example.com/extract",
"source": "WORKSPACE"
}
}]
}'Asset Bundles (DABs)
Asset Bundles (DABs)
yaml
undefinedyaml
undefinedresources/jobs.yml
resources/jobs.yml
resources:
jobs:
my_etl_job:
name: "[${bundle.target}] My ETL Job"
tasks:
- task_key: extract
notebook_task:
notebook_path: ../src/notebooks/extract.py
undefinedresources:
jobs:
my_etl_job:
name: "[${bundle.target}] My ETL Job"
tasks:
- task_key: extract
notebook_task:
notebook_path: ../src/notebooks/extract.py
undefinedCore Concepts
核心概念
Multi-Task Workflows
多任务工作流
Jobs support DAG-based task dependencies:
yaml
tasks:
- task_key: extract
notebook_task:
notebook_path: ../src/extract.py
- task_key: transform
depends_on:
- task_key: extract
notebook_task:
notebook_path: ../src/transform.py
- task_key: load
depends_on:
- task_key: transform
run_if: ALL_SUCCESS # Only run if all dependencies succeed
notebook_task:
notebook_path: ../src/load.pyrun_if conditions:
- (default) - Run when all dependencies succeed
ALL_SUCCESS - - Run when all dependencies complete (success or failure)
ALL_DONE - - Run when at least one dependency succeeds
AT_LEAST_ONE_SUCCESS - - Run when no dependencies failed
NONE_FAILED - - Run when all dependencies failed
ALL_FAILED - - Run when at least one dependency failed
AT_LEAST_ONE_FAILED
Jobs支持基于DAG的任务依赖:
yaml
tasks:
- task_key: extract
notebook_task:
notebook_path: ../src/extract.py
- task_key: transform
depends_on:
- task_key: extract
notebook_task:
notebook_path: ../src/transform.py
- task_key: load
depends_on:
- task_key: transform
run_if: ALL_SUCCESS # Only run if all dependencies succeed
notebook_task:
notebook_path: ../src/load.pyrun_if 条件:
- (默认)- 所有依赖任务成功时运行
ALL_SUCCESS - - 所有依赖任务完成时运行(无论成功或失败)
ALL_DONE - - 至少一个依赖任务成功时运行
AT_LEAST_ONE_SUCCESS - - 没有依赖任务失败时运行
NONE_FAILED - - 所有依赖任务失败时运行
ALL_FAILED - - 至少一个依赖任务失败时运行
AT_LEAST_ONE_FAILED
Task Types Summary
任务类型汇总
| Task Type | Use Case | Reference |
|---|---|---|
| Run notebooks | task-types.md#notebook-task |
| Run Python scripts | task-types.md#spark-python-task |
| Run Python wheels | task-types.md#python-wheel-task |
| Run SQL queries/files | task-types.md#sql-task |
| Run dbt projects | task-types.md#dbt-task |
| Trigger DLT/SDP pipelines | task-types.md#pipeline-task |
| Run Spark JARs | task-types.md#spark-jar-task |
| Trigger other jobs | task-types.md#run-job-task |
| Loop over inputs | task-types.md#for-each-task |
| 任务类型 | 使用场景 | 参考链接 |
|---|---|---|
| 运行notebook | task-types.md#notebook-task |
| 运行Python脚本 | task-types.md#spark-python-task |
| 运行Python wheels | task-types.md#python-wheel-task |
| 运行SQL查询/文件 | task-types.md#sql-task |
| 运行dbt项目 | task-types.md#dbt-task |
| 触发DLT/SDP流水线 | task-types.md#pipeline-task |
| 运行Spark JAR包 | task-types.md#spark-jar-task |
| 触发其他作业 | task-types.md#run-job-task |
| 循环处理输入 | task-types.md#for-each-task |
Trigger Types Summary
触发器类型汇总
| Trigger Type | Use Case | Reference |
|---|---|---|
| Cron-based scheduling | triggers-schedules.md#cron-schedule |
| Interval-based | triggers-schedules.md#periodic-trigger |
| File arrival events | triggers-schedules.md#file-arrival-trigger |
| Table change events | triggers-schedules.md#table-update-trigger |
| Always-running jobs | triggers-schedules.md#continuous-jobs |
| 触发器类型 | 使用场景 | 参考链接 |
|---|---|---|
| 基于Cron的调度 | triggers-schedules.md#cron-schedule |
| 基于时间间隔的触发 | triggers-schedules.md#periodic-trigger |
| 文件到达事件触发 | triggers-schedules.md#file-arrival-trigger |
| 表变更事件触发 | triggers-schedules.md#table-update-trigger |
| 持续运行的作业 | triggers-schedules.md#continuous-jobs |
Compute Configuration
计算资源配置
Job Clusters (Recommended)
作业集群(推荐)
Define reusable cluster configurations:
yaml
job_clusters:
- job_cluster_key: shared_cluster
new_cluster:
spark_version: "15.4.x-scala2.12"
node_type_id: "i3.xlarge"
num_workers: 2
spark_conf:
spark.speculation: "true"
tasks:
- task_key: my_task
job_cluster_key: shared_cluster
notebook_task:
notebook_path: ../src/notebook.py定义可复用的集群配置:
yaml
job_clusters:
- job_cluster_key: shared_cluster
new_cluster:
spark_version: "15.4.x-scala2.12"
node_type_id: "i3.xlarge"
num_workers: 2
spark_conf:
spark.speculation: "true"
tasks:
- task_key: my_task
job_cluster_key: shared_cluster
notebook_task:
notebook_path: ../src/notebook.pyAutoscaling Clusters
自动扩缩容集群
yaml
new_cluster:
spark_version: "15.4.x-scala2.12"
node_type_id: "i3.xlarge"
autoscale:
min_workers: 2
max_workers: 8yaml
new_cluster:
spark_version: "15.4.x-scala2.12"
node_type_id: "i3.xlarge"
autoscale:
min_workers: 2
max_workers: 8Existing Cluster
现有集群
yaml
tasks:
- task_key: my_task
existing_cluster_id: "0123-456789-abcdef12"
notebook_task:
notebook_path: ../src/notebook.pyyaml
tasks:
- task_key: my_task
existing_cluster_id: "0123-456789-abcdef12"
notebook_task:
notebook_path: ../src/notebook.pyServerless Compute
无服务器计算
For notebook and Python tasks, omit cluster configuration to use serverless:
yaml
tasks:
- task_key: serverless_task
notebook_task:
notebook_path: ../src/notebook.py
# No cluster config = serverless对于notebook和Python任务,省略集群配置即可使用无服务器计算:
yaml
tasks:
- task_key: serverless_task
notebook_task:
notebook_path: ../src/notebook.py
# 无集群配置 = 使用无服务器计算Job Parameters
作业参数
Define Parameters
定义参数
yaml
parameters:
- name: env
default: "dev"
- name: date
default: "{{start_date}}" # Dynamic value referenceyaml
parameters:
- name: env
default: "dev"
- name: date
default: "{{start_date}}" # 动态值引用Access in Notebook
在Notebook中访问参数
python
undefinedpython
undefinedIn notebook
在Notebook中
dbutils.widgets.get("env")
dbutils.widgets.get("date")
undefineddbutils.widgets.get("env")
dbutils.widgets.get("date")
undefinedPass to Tasks
传递参数给任务
yaml
tasks:
- task_key: my_task
notebook_task:
notebook_path: ../src/notebook.py
base_parameters:
env: "{{job.parameters.env}}"
custom_param: "value"yaml
tasks:
- task_key: my_task
notebook_task:
notebook_path: ../src/notebook.py
base_parameters:
env: "{{job.parameters.env}}"
custom_param: "value"Common Operations
常见操作
Python SDK Operations
Python SDK 操作
python
from databricks.sdk import WorkspaceClient
w = WorkspaceClient()python
from databricks.sdk import WorkspaceClient
w = WorkspaceClient()List jobs
列出作业
jobs = w.jobs.list()
jobs = w.jobs.list()
Get job details
获取作业详情
job = w.jobs.get(job_id=12345)
job = w.jobs.get(job_id=12345)
Run job now
立即运行作业
run = w.jobs.run_now(job_id=12345)
run = w.jobs.run_now(job_id=12345)
Run with parameters
带参数运行作业
run = w.jobs.run_now(
job_id=12345,
job_parameters={"env": "prod", "date": "2024-01-15"}
)
run = w.jobs.run_now(
job_id=12345,
job_parameters={"env": "prod", "date": "2024-01-15"}
)
Cancel run
取消作业运行
w.jobs.cancel_run(run_id=run.run_id)
w.jobs.cancel_run(run_id=run.run_id)
Delete job
删除作业
w.jobs.delete(job_id=12345)
undefinedw.jobs.delete(job_id=12345)
undefinedCLI Operations
CLI 操作
bash
undefinedbash
undefinedList jobs
列出作业
databricks jobs list
databricks jobs list
Get job details
获取作业详情
databricks jobs get 12345
databricks jobs get 12345
Run job
运行作业
databricks jobs run-now 12345
databricks jobs run-now 12345
Run with parameters
带参数运行作业
databricks jobs run-now 12345 --job-params '{"env": "prod"}'
databricks jobs run-now 12345 --job-params '{"env": "prod"}'
Cancel run
取消作业运行
databricks jobs cancel-run 67890
databricks jobs cancel-run 67890
Delete job
删除作业
databricks jobs delete 12345
undefineddatabricks jobs delete 12345
undefinedAsset Bundle Operations
Asset Bundle 操作
bash
undefinedbash
undefinedValidate configuration
验证配置
databricks bundle validate
databricks bundle validate
Deploy job
部署作业
databricks bundle deploy
databricks bundle deploy
Run job
运行作业
databricks bundle run my_job_resource_key
databricks bundle run my_job_resource_key
Deploy to specific target
部署到指定环境
databricks bundle deploy -t prod
databricks bundle deploy -t prod
Destroy resources
销毁资源
databricks bundle destroy
undefineddatabricks bundle destroy
undefinedPermissions (DABs)
权限配置(DABs)
yaml
resources:
jobs:
my_job:
name: "My Job"
permissions:
- level: CAN_VIEW
group_name: "data-analysts"
- level: CAN_MANAGE_RUN
group_name: "data-engineers"
- level: CAN_MANAGE
user_name: "admin@example.com"Permission levels:
- - View job and run history
CAN_VIEW - - View, trigger, and cancel runs
CAN_MANAGE_RUN - - Full control including edit and delete
CAN_MANAGE
yaml
resources:
jobs:
my_job:
name: "My Job"
permissions:
- level: CAN_VIEW
group_name: "data-analysts"
- level: CAN_MANAGE_RUN
group_name: "data-engineers"
- level: CAN_MANAGE
user_name: "admin@example.com"权限级别:
- - 查看作业和运行历史
CAN_VIEW - - 查看、触发和取消作业运行
CAN_MANAGE_RUN - - 完全控制权限,包括编辑和删除作业
CAN_MANAGE
Common Issues
常见问题
| Issue | Solution |
|---|---|
| Job cluster startup slow | Use job clusters with |
| Task dependencies not working | Verify |
| Schedule not triggering | Check |
| File arrival not detecting | Ensure path has proper permissions and uses cloud storage URL |
| Table update trigger missing events | Verify Unity Catalog table and proper grants |
| Parameter not accessible | Use |
| "admins" group error | Cannot modify admins permissions on jobs |
| Serverless task fails | Ensure task type supports serverless (notebook, Python) |
| 问题 | 解决方案 |
|---|---|
| 作业集群启动缓慢 | 使用带有 |
| 任务依赖不生效 | 验证 |
| 调度未触发 | 检查 |
| 文件到达事件未被检测到 | 确保路径有正确的权限,且使用云存储URL |
| 表更新触发器未捕获事件 | 验证Unity Catalog表及权限配置正确 |
| 参数无法访问 | 在Notebook中使用 |
| “admins”组权限错误 | 无法修改作业上的admins组权限 |
| 无服务器任务执行失败 | 确保任务类型支持无服务器计算(notebook、Python) |
Related Skills
相关技能
- asset-bundles - Deploy jobs via Databricks Asset Bundles
- spark-declarative-pipelines - Configure pipelines triggered by jobs
- asset-bundles - 通过Databricks Asset Bundles部署作业
- spark-declarative-pipelines - 配置由作业触发的流水线