databricks-2025
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
Chinese🚨 CRITICAL GUIDELINES
🚨 重要指南
Windows File Path Requirements
Windows文件路径要求
MANDATORY: Always Use Backslashes on Windows for File Paths
When using Edit or Write tools on Windows, you MUST use backslashes () in file paths, NOT forward slashes ().
\/Examples:
- ❌ WRONG:
D:/repos/project/file.tsx - ✅ CORRECT:
D:\repos\project\file.tsx
This applies to:
- Edit tool file_path parameter
- Write tool file_path parameter
- All file operations on Windows systems
🔴 强制要求:Windows系统下文件路径必须使用反斜杠
在Windows系统上使用编辑或写入工具时,文件路径必须使用反斜杠(),绝对不能使用正斜杠()。
\/示例:
- ❌ 错误:
D:/repos/project/file.tsx - ✅ 正确:
D:\repos\project\file.tsx
此要求适用于:
- 编辑工具的file_path参数
- 写入工具的file_path参数
- Windows系统上的所有文件操作
Documentation Guidelines
文档指南
NEVER create new documentation files unless explicitly requested by the user.
- Priority: Update existing README.md files rather than creating new documentation
- Repository cleanliness: Keep repository root clean - only README.md unless user requests otherwise
- Style: Documentation should be concise, direct, and professional - avoid AI-generated tone
- User preference: Only create additional .md files when user specifically asks for documentation
除非用户明确要求,否则绝对不要创建新的文档文件。
- 优先级:优先更新现有的README.md文件,而非创建新文档
- 仓库整洁性:保持仓库根目录整洁——除非用户要求,否则只保留README.md
- 风格:文档应简洁、直接、专业——避免AI生成的语气
- 用户偏好:仅在用户明确要求时才创建额外的.md文件
Azure Data Factory Databricks Integration 2025
2025版Azure Data Factory与Databricks集成
Databricks Job Activity (Recommended 2025)
Databricks Job 活动(2025推荐方案)
🚨 CRITICAL UPDATE (2025): The Databricks Job activity is now the ONLY recommended method for orchestrating Databricks in ADF. Microsoft strongly recommends migrating from legacy Notebook, Python, and JAR activities.
🚨 重要更新(2025): Databricks Job活动现已成为在ADF中编排Databricks任务的唯一推荐方式。Microsoft强烈建议从传统的Notebook、Python和JAR活动迁移至此方案。
Why Databricks Job Activity?
为什么选择Databricks Job活动?
Old Pattern (Notebook Activity - ❌ LEGACY):
json
{
"name": "RunNotebook",
"type": "DatabricksNotebook", // ❌ DEPRECATED - Migrate to DatabricksJob
"linkedServiceName": { "referenceName": "DatabricksLinkedService" },
"typeProperties": {
"notebookPath": "/Users/user@example.com/MyNotebook",
"baseParameters": { "param1": "value1" }
}
}New Pattern (Databricks Job Activity - ✅ CURRENT 2025):
json
{
"name": "RunDatabricksWorkflow",
"type": "DatabricksJob", // ✅ CORRECT activity type (NOT DatabricksSparkJob)
"linkedServiceName": { "referenceName": "DatabricksLinkedService" },
"typeProperties": {
"jobId": "123456", // Reference existing Databricks Workflow Job
"jobParameters": { // Pass parameters to the Job
"param1": "value1",
"runDate": "@pipeline().parameters.ProcessingDate"
}
},
"policy": {
"timeout": "0.12:00:00",
"retry": 2,
"retryIntervalInSeconds": 30
}
}旧模式(Notebook活动 - ❌ 已过时):
json
{
"name": "RunNotebook",
"type": "DatabricksNotebook", // ❌ DEPRECATED - Migrate to DatabricksJob
"linkedServiceName": { "referenceName": "DatabricksLinkedService" },
"typeProperties": {
"notebookPath": "/Users/user@example.com/MyNotebook",
"baseParameters": { "param1": "value1" }
}
}新模式(Databricks Job活动 - ✅ 2025当前推荐):
json
{
"name": "RunDatabricksWorkflow",
"type": "DatabricksJob", // ✅ CORRECT activity type (NOT DatabricksSparkJob)
"linkedServiceName": { "referenceName": "DatabricksLinkedService" },
"typeProperties": {
"jobId": "123456", // Reference existing Databricks Workflow Job
"jobParameters": { // Pass parameters to the Job
"param1": "value1",
"runDate": "@pipeline().parameters.ProcessingDate"
}
},
"policy": {
"timeout": "0.12:00:00",
"retry": 2,
"retryIntervalInSeconds": 30
}
}Benefits of Databricks Job Activity (2025)
Databricks Job活动的优势(2025版)
-
Serverless Execution by Default:
- ✅ No cluster specification needed in linked service
- ✅ Automatically runs on Databricks serverless compute
- ✅ Faster startup times and lower costs
- ✅ Managed infrastructure by Databricks
-
Advanced Workflow Features:
- ✅ Run As - Execute jobs as specific users/service principals
- ✅ Task Values - Pass data between tasks within workflow
- ✅ Conditional Execution - If/Else and For Each task types
- ✅ AI/BI Tasks - Model serving endpoints, Power BI semantic models
- ✅ Repair Runs - Rerun failed tasks without reprocessing successful ones
- ✅ Notifications/Alerts - Built-in alerting on job failures
- ✅ Git Integration - Version control for notebooks and code
- ✅ DABs Support - Databricks Asset Bundles for deployment
- ✅ Built-in Lineage - Data lineage tracking across tasks
- ✅ Queuing and Concurrent Runs - Better resource management
-
Centralized Job Management:
- Jobs defined once in Databricks workspace
- Single source of truth for all environments
- Versioning through Databricks (Git-backed)
- Consistent across orchestration tools
-
Better Orchestration:
- Complex task dependencies within Job
- Multiple heterogeneous tasks (notebook, Python, SQL, Delta Live Tables)
- Job-level monitoring and logging
- Parameter passing between tasks
-
Improved Reliability:
- Retry logic at Job and task level
- Better error handling and recovery
- Automatic cluster management
-
Cost Optimization:
- Serverless compute (pay only for execution)
- Job clusters (auto-terminating)
- Optimized cluster sizing per task
- Spot instance support
-
默认支持无服务器执行:
- ✅ 无需在链接服务中指定集群配置
- ✅ 自动在Databricks无服务器计算资源上运行
- ✅ 启动速度更快,成本更低
- ✅ 由Databricks托管基础设施
-
高级工作流功能:
- ✅ 以指定身份运行 - 以特定用户/服务主体执行任务
- ✅ 任务值传递 - 在工作流的任务之间传递数据
- ✅ 条件执行 - 支持If/Else和For Each任务类型
- ✅ AI/BI任务 - 模型服务端点、Power BI语义模型
- ✅ 修复运行 - 仅重新运行失败的任务,无需重新处理已成功的任务
- ✅ 通知/告警 - 内置任务失败告警功能
- ✅ Git集成 - 笔记本和代码的版本控制
- ✅ DABs支持 - 用于部署的Databricks Asset Bundles
- ✅ 内置数据血缘 - 跨任务的数据血缘跟踪
- ✅ 排队与并发运行 - 更优的资源管理
-
集中式任务管理:
- 在Databricks工作区中一次性定义任务
- 所有环境的单一事实来源
- 通过Databricks实现版本控制(基于Git)
- 在不同编排工具中保持一致性
-
更强大的编排能力:
- 任务内的复杂依赖关系
- 多种异构任务(笔记本、Python、SQL、Delta Live Tables)
- 任务级别的监控与日志
- 任务之间的参数传递
-
更高的可靠性:
- 任务和任务级别的重试逻辑
- 更完善的错误处理与恢复机制
- 自动集群管理
-
成本优化:
- 无服务器计算(仅为执行时间付费)
- 任务集群(自动终止)
- 针对每个任务优化集群规模
- 支持抢占式实例
Implementation
实施步骤
1. Create Databricks Job
1. 创建Databricks任务
python
undefinedpython
undefinedIn Databricks workspace
In Databricks workspace
Create Job with tasks
Create Job with tasks
{
"name": "Data Processing Job",
"tasks": [
{
"task_key": "ingest",
"notebook_task": {
"notebook_path": "/Notebooks/Ingest",
"base_parameters": {}
},
"job_cluster_key": "small_cluster"
},
{
"task_key": "transform",
"depends_on": [{ "task_key": "ingest" }],
"notebook_task": {
"notebook_path": "/Notebooks/Transform"
},
"job_cluster_key": "medium_cluster"
},
{
"task_key": "load",
"depends_on": [{ "task_key": "transform" }],
"notebook_task": {
"notebook_path": "/Notebooks/Load"
},
"job_cluster_key": "small_cluster"
}
],
"job_clusters": [
{
"job_cluster_key": "small_cluster",
"new_cluster": {
"spark_version": "13.3.x-scala2.12",
"node_type_id": "Standard_DS3_v2",
"num_workers": 2
}
},
{
"job_cluster_key": "medium_cluster",
"new_cluster": {
"spark_version": "13.3.x-scala2.12",
"node_type_id": "Standard_DS4_v2",
"num_workers": 8
}
}
]
}
{
"name": "Data Processing Job",
"tasks": [
{
"task_key": "ingest",
"notebook_task": {
"notebook_path": "/Notebooks/Ingest",
"base_parameters": {}
},
"job_cluster_key": "small_cluster"
},
{
"task_key": "transform",
"depends_on": [{ "task_key": "ingest" }],
"notebook_task": {
"notebook_path": "/Notebooks/Transform"
},
"job_cluster_key": "medium_cluster"
},
{
"task_key": "load",
"depends_on": [{ "task_key": "transform" }],
"notebook_task": {
"notebook_path": "/Notebooks/Load"
},
"job_cluster_key": "small_cluster"
}
],
"job_clusters": [
{
"job_cluster_key": "small_cluster",
"new_cluster": {
"spark_version": "13.3.x-scala2.12",
"node_type_id": "Standard_DS3_v2",
"num_workers": 2
}
},
{
"job_cluster_key": "medium_cluster",
"new_cluster": {
"spark_version": "13.3.x-scala2.12",
"node_type_id": "Standard_DS4_v2",
"num_workers": 8
}
}
]
}
Get Job ID after creation
Get Job ID after creation
undefinedundefined2. Create ADF Pipeline with Databricks Job Activity (2025)
2. 创建包含Databricks Job活动的ADF管道(2025版)
json
{
"name": "PL_Databricks_Serverless_Workflow",
"properties": {
"activities": [
{
"name": "ExecuteDatabricksWorkflow",
"type": "DatabricksJob", // ✅ Correct activity type
"dependsOn": [],
"policy": {
"timeout": "0.12:00:00",
"retry": 2,
"retryIntervalInSeconds": 30
},
"typeProperties": {
"jobId": "123456", // Databricks Job ID from workspace
"jobParameters": { // ⚠️ Use jobParameters (not parameters)
"input_path": "/mnt/data/input",
"output_path": "/mnt/data/output",
"run_date": "@pipeline().parameters.runDate",
"environment": "@pipeline().parameters.environment"
}
},
"linkedServiceName": {
"referenceName": "DatabricksLinkedService_Serverless",
"type": "LinkedServiceReference"
}
},
{
"name": "LogJobExecution",
"type": "WebActivity",
"dependsOn": [
{
"activity": "ExecuteDatabricksWorkflow",
"dependencyConditions": ["Succeeded"]
}
],
"typeProperties": {
"url": "@pipeline().parameters.LoggingEndpoint",
"method": "POST",
"body": {
"jobId": "123456",
"runId": "@activity('ExecuteDatabricksWorkflow').output.runId",
"status": "Succeeded",
"duration": "@activity('ExecuteDatabricksWorkflow').output.executionDuration"
}
}
}
],
"parameters": {
"runDate": {
"type": "string",
"defaultValue": "@utcnow()"
},
"environment": {
"type": "string",
"defaultValue": "production"
},
"LoggingEndpoint": {
"type": "string"
}
}
}
}json
{
"name": "PL_Databricks_Serverless_Workflow",
"properties": {
"activities": [
{
"name": "ExecuteDatabricksWorkflow",
"type": "DatabricksJob", // ✅ Correct activity type
"dependsOn": [],
"policy": {
"timeout": "0.12:00:00",
"retry": 2,
"retryIntervalInSeconds": 30
},
"typeProperties": {
"jobId": "123456", // Databricks Job ID from workspace
"jobParameters": { // ⚠️ Use jobParameters (not parameters)
"input_path": "/mnt/data/input",
"output_path": "/mnt/data/output",
"run_date": "@pipeline().parameters.runDate",
"environment": "@pipeline().parameters.environment"
}
},
"linkedServiceName": {
"referenceName": "DatabricksLinkedService_Serverless",
"type": "LinkedServiceReference"
}
},
{
"name": "LogJobExecution",
"type": "WebActivity",
"dependsOn": [
{
"activity": "ExecuteDatabricksWorkflow",
"dependencyConditions": ["Succeeded"]
}
],
"typeProperties": {
"url": "@pipeline().parameters.LoggingEndpoint",
"method": "POST",
"body": {
"jobId": "123456",
"runId": "@activity('ExecuteDatabricksWorkflow').output.runId",
"status": "Succeeded",
"duration": "@activity('ExecuteDatabricksWorkflow').output.executionDuration"
}
}
}
],
"parameters": {
"runDate": {
"type": "string",
"defaultValue": "@utcnow()"
},
"environment": {
"type": "string",
"defaultValue": "production"
},
"LoggingEndpoint": {
"type": "string"
}
}
}
}3. Configure Linked Service (2025 - Serverless)
3. 配置链接服务(2025版 - 无服务器)
✅ RECOMMENDED: Serverless Linked Service (No Cluster Configuration)
json
{
"name": "DatabricksLinkedService_Serverless",
"type": "Microsoft.DataFactory/factories/linkedservices",
"properties": {
"type": "AzureDatabricks",
"typeProperties": {
"domain": "https://adb-123456789.azuredatabricks.net",
"authentication": "MSI" // ✅ Managed Identity (recommended 2025)
// ⚠️ NO existingClusterId or newClusterNodeType needed for serverless!
// The Databricks Job activity automatically uses serverless compute
}
}
}Alternative: Access Token Authentication
json
{
"name": "DatabricksLinkedService_Token",
"type": "Microsoft.DataFactory/factories/linkedservices",
"properties": {
"type": "AzureDatabricks",
"typeProperties": {
"domain": "https://adb-123456789.azuredatabricks.net",
"accessToken": {
"type": "AzureKeyVaultSecret",
"store": {
"referenceName": "AzureKeyVault",
"type": "LinkedServiceReference"
},
"secretName": "databricks-access-token"
}
}
}
}🚨 CRITICAL: For Databricks Job activity, DO NOT specify cluster properties in the linked service. The job configuration in Databricks workspace controls compute resources.
✅ 推荐:无服务器链接服务(无需集群配置)
json
{
"name": "DatabricksLinkedService_Serverless",
"type": "Microsoft.DataFactory/factories/linkedservices",
"properties": {
"type": "AzureDatabricks",
"typeProperties": {
"domain": "https://adb-123456789.azuredatabricks.net",
"authentication": "MSI" // ✅ Managed Identity (recommended 2025)
// ⚠️ NO existingClusterId or newClusterNodeType needed for serverless!
// The Databricks Job activity automatically uses serverless compute
}
}
}替代方案:访问令牌认证
json
{
"name": "DatabricksLinkedService_Token",
"type": "Microsoft.DataFactory/factories/linkedservices",
"properties": {
"type": "AzureDatabricks",
"typeProperties": {
"domain": "https://adb-123456789.azuredatabricks.net",
"accessToken": {
"type": "AzureKeyVaultSecret",
"store": {
"referenceName": "AzureKeyVault",
"type": "LinkedServiceReference"
},
"secretName": "databricks-access-token"
}
}
}
}🚨 重要提示:对于Databricks Job活动,请勿在链接服务中指定集群属性。计算资源由Databricks工作区中的任务配置控制。
🆕 2025 New Connectors and Enhancements
🆕 2025版新连接器与增强功能
ServiceNow V2 Connector (RECOMMENDED - V1 End of Support)
ServiceNow V2连接器(推荐 - V1已停止支持)
🚨 CRITICAL: ServiceNow V1 connector is at End of Support stage. Migrate to V2 immediately!
Key Features of V2:
- ✅ Native Query Builder - Aligns with ServiceNow's condition builder experience
- ✅ Enhanced Performance - Optimized data extraction
- ✅ Better Error Handling - Improved diagnostics and retry logic
- ✅ OData Support - Modern API integration patterns
Copy Activity Example:
json
{
"name": "CopyFromServiceNowV2",
"type": "Copy",
"inputs": [
{
"referenceName": "ServiceNowV2Source",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "AzureSqlSink",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "ServiceNowV2Source",
"query": "sysparm_query=active=true^priority=1^sys_created_on>=javascript:gs.dateGenerate('2025-01-01')",
"httpRequestTimeout": "00:01:40" // 100 seconds
},
"sink": {
"type": "AzureSqlSink",
"writeBehavior": "upsert",
"upsertSettings": {
"useTempDB": true,
"keys": ["sys_id"]
}
},
"enableStaging": true,
"stagingSettings": {
"linkedServiceName": {
"referenceName": "AzureBlobStorage",
"type": "LinkedServiceReference"
}
}
}
}Linked Service (OAuth2 - Recommended):
json
{
"name": "ServiceNowV2LinkedService",
"type": "Microsoft.DataFactory/factories/linkedservices",
"properties": {
"type": "ServiceNowV2",
"typeProperties": {
"endpoint": "https://dev12345.service-now.com",
"authenticationType": "OAuth2",
"clientId": "your-oauth-client-id",
"clientSecret": {
"type": "AzureKeyVaultSecret",
"store": {
"referenceName": "AzureKeyVault",
"type": "LinkedServiceReference"
},
"secretName": "servicenow-client-secret"
},
"username": "service-account@company.com",
"password": {
"type": "AzureKeyVaultSecret",
"store": {
"referenceName": "AzureKeyVault",
"type": "LinkedServiceReference"
},
"secretName": "servicenow-password"
},
"grantType": "password"
}
}
}Linked Service (Basic Authentication - Legacy):
json
{
"name": "ServiceNowV2LinkedService_Basic",
"type": "Microsoft.DataFactory/factories/linkedservices",
"properties": {
"type": "ServiceNowV2",
"typeProperties": {
"endpoint": "https://dev12345.service-now.com",
"authenticationType": "Basic",
"username": "admin",
"password": {
"type": "AzureKeyVaultSecret",
"store": {
"referenceName": "AzureKeyVault",
"type": "LinkedServiceReference"
},
"secretName": "servicenow-password"
}
}
}
}Migration from V1 to V2:
- Update linked service type from to
ServiceNowServiceNowV2 - Update source type from to
ServiceNowSourceServiceNowV2Source - Test queries in ServiceNow UI's condition builder first
- Adjust timeout settings if needed (V2 may have different performance)
🚨 重要提示:ServiceNow V1连接器已进入停止支持阶段,请立即迁移至V2版本!
V2版本的核心功能:
- ✅ 原生查询构建器 - 与ServiceNow的条件构建器体验保持一致
- ✅ 性能提升 - 优化的数据提取
- ✅ 更完善的错误处理 - 改进的诊断与重试逻辑
- ✅ OData支持 - 现代API集成模式
复制活动示例:
json
{
"name": "CopyFromServiceNowV2",
"type": "Copy",
"inputs": [
{
"referenceName": "ServiceNowV2Source",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "AzureSqlSink",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "ServiceNowV2Source",
"query": "sysparm_query=active=true^priority=1^sys_created_on>=javascript:gs.dateGenerate('2025-01-01')",
"httpRequestTimeout": "00:01:40" // 100 seconds
},
"sink": {
"type": "AzureSqlSink",
"writeBehavior": "upsert",
"upsertSettings": {
"useTempDB": true,
"keys": ["sys_id"]
}
},
"enableStaging": true,
"stagingSettings": {
"linkedServiceName": {
"referenceName": "AzureBlobStorage",
"type": "LinkedServiceReference"
}
}
}
}链接服务(OAuth2 - 推荐):
json
{
"name": "ServiceNowV2LinkedService",
"type": "Microsoft.DataFactory/factories/linkedservices",
"properties": {
"type": "ServiceNowV2",
"typeProperties": {
"endpoint": "https://dev12345.service-now.com",
"authenticationType": "OAuth2",
"clientId": "your-oauth-client-id",
"clientSecret": {
"type": "AzureKeyVaultSecret",
"store": {
"referenceName": "AzureKeyVault",
"type": "LinkedServiceReference"
},
"secretName": "servicenow-client-secret"
},
"username": "service-account@company.com",
"password": {
"type": "AzureKeyVaultSecret",
"store": {
"referenceName": "AzureKeyVault",
"type": "LinkedServiceReference"
},
"secretName": "servicenow-password"
},
"grantType": "password"
}
}
}链接服务(基本认证 - 已过时):
json
{
"name": "ServiceNowV2LinkedService_Basic",
"type": "Microsoft.DataFactory/factories/linkedservices",
"properties": {
"type": "ServiceNowV2",
"typeProperties": {
"endpoint": "https://dev12345.service-now.com",
"authenticationType": "Basic",
"username": "admin",
"password": {
"type": "AzureKeyVaultSecret",
"store": {
"referenceName": "AzureKeyVault",
"type": "LinkedServiceReference"
},
"secretName": "servicenow-password"
}
}
}
}从V1迁移至V2的步骤:
- 将链接服务类型从更新为
ServiceNowServiceNowV2 - 将源类型从更新为
ServiceNowSourceServiceNowV2Source - 先在ServiceNow UI的条件构建器中测试查询
- 如有需要,调整超时设置(V2的性能可能有所不同)
Enhanced PostgreSQL Connector
增强版PostgreSQL连接器
Improved performance and features:
json
{
"name": "PostgreSQLLinkedService",
"type": "PostgreSql",
"typeProperties": {
"connectionString": "host=myserver.postgres.database.azure.com;port=5432;database=mydb;uid=myuser",
"password": {
"type": "AzureKeyVaultSecret",
"store": { "referenceName": "KeyVault" },
"secretName": "postgres-password"
},
// 2025 enhancement
"enableSsl": true,
"sslMode": "Require"
}
}性能与功能提升:
json
{
"name": "PostgreSQLLinkedService",
"type": "PostgreSql",
"typeProperties": {
"connectionString": "host=myserver.postgres.database.azure.com;port=5432;database=mydb;uid=myuser",
"password": {
"type": "AzureKeyVaultSecret",
"store": { "referenceName": "KeyVault" },
"secretName": "postgres-password"
},
// 2025 enhancement
"enableSsl": true,
"sslMode": "Require"
}
}Microsoft Fabric Warehouse Connector (NEW 2025)
Microsoft Fabric Warehouse连接器(2025新增)
🆕 Native support for Microsoft Fabric Warehouse (Q3 2024+)
Supported Activities:
- ✅ Copy Activity (source and sink)
- ✅ Lookup Activity
- ✅ Get Metadata Activity
- ✅ Script Activity
- ✅ Stored Procedure Activity
Linked Service Configuration:
json
{
"name": "FabricWarehouseLinkedService",
"type": "Microsoft.DataFactory/factories/linkedservices",
"properties": {
"type": "Warehouse", // ✅ NEW dedicated Fabric Warehouse type
"typeProperties": {
"endpoint": "myworkspace.datawarehouse.fabric.microsoft.com",
"warehouse": "MyWarehouse",
"authenticationType": "ServicePrincipal", // Recommended
"servicePrincipalId": "<app-registration-id>",
"servicePrincipalKey": {
"type": "AzureKeyVaultSecret",
"store": {
"referenceName": "AzureKeyVault",
"type": "LinkedServiceReference"
},
"secretName": "fabric-warehouse-sp-key"
},
"tenant": "<tenant-id>"
}
}
}Alternative: Managed Identity Authentication (Preferred)
json
{
"name": "FabricWarehouseLinkedService_ManagedIdentity",
"type": "Microsoft.DataFactory/factories/linkedservices",
"properties": {
"type": "Warehouse",
"typeProperties": {
"endpoint": "myworkspace.datawarehouse.fabric.microsoft.com",
"warehouse": "MyWarehouse",
"authenticationType": "SystemAssignedManagedIdentity"
}
}
}Copy Activity Example:
json
{
"name": "CopyToFabricWarehouse",
"type": "Copy",
"inputs": [
{
"referenceName": "AzureSqlSource",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "FabricWarehouseSink",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "AzureSqlSource"
},
"sink": {
"type": "WarehouseSink",
"writeBehavior": "insert", // or "upsert"
"writeBatchSize": 10000,
"tableOption": "autoCreate" // Auto-create table if not exists
},
"enableStaging": true, // Recommended for large data
"stagingSettings": {
"linkedServiceName": {
"referenceName": "AzureBlobStorage",
"type": "LinkedServiceReference"
},
"path": "staging/fabric-warehouse"
},
"translator": {
"type": "TabularTranslator",
"mappings": [
{
"source": { "name": "CustomerID" },
"sink": { "name": "customer_id" }
}
]
}
}
}Best Practices for Fabric Warehouse:
- ✅ Use managed identity for authentication (no secret rotation)
- ✅ Enable staging for large data loads (> 1GB)
- ✅ Use for dynamic schema creation
tableOption: autoCreate - ✅ Leverage Fabric's lakehouse integration for unified analytics
- ✅ Monitor Fabric capacity units (CU) consumption
🆕 原生支持Microsoft Fabric Warehouse(2024年第三季度及以后版本)
支持的活动:
- ✅ 复制活动(源和目标)
- ✅ 查找活动
- ✅ 获取元数据活动
- ✅ 脚本活动
- ✅ 存储过程活动
链接服务配置:
json
{
"name": "FabricWarehouseLinkedService",
"type": "Microsoft.DataFactory/factories/linkedservices",
"properties": {
"type": "Warehouse", // ✅ NEW dedicated Fabric Warehouse type
"typeProperties": {
"endpoint": "myworkspace.datawarehouse.fabric.microsoft.com",
"warehouse": "MyWarehouse",
"authenticationType": "ServicePrincipal", // Recommended
"servicePrincipalId": "<app-registration-id>",
"servicePrincipalKey": {
"type": "AzureKeyVaultSecret",
"store": {
"referenceName": "AzureKeyVault",
"type": "LinkedServiceReference"
},
"secretName": "fabric-warehouse-sp-key"
},
"tenant": "<tenant-id>"
}
}
}替代方案:托管身份认证(首选)
json
{
"name": "FabricWarehouseLinkedService_ManagedIdentity",
"type": "Microsoft.DataFactory/factories/linkedservices",
"properties": {
"type": "Warehouse",
"typeProperties": {
"endpoint": "myworkspace.datawarehouse.fabric.microsoft.com",
"warehouse": "MyWarehouse",
"authenticationType": "SystemAssignedManagedIdentity"
}
}
}复制活动示例:
json
{
"name": "CopyToFabricWarehouse",
"type": "Copy",
"inputs": [
{
"referenceName": "AzureSqlSource",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "FabricWarehouseSink",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "AzureSqlSource"
},
"sink": {
"type": "WarehouseSink",
"writeBehavior": "insert", // or "upsert"
"writeBatchSize": 10000,
"tableOption": "autoCreate" // Auto-create table if not exists
},
"enableStaging": true, // Recommended for large data
"stagingSettings": {
"linkedServiceName": {
"referenceName": "AzureBlobStorage",
"type": "LinkedServiceReference"
},
"path": "staging/fabric-warehouse"
},
"translator": {
"type": "TabularTranslator",
"mappings": [
{
"source": { "name": "CustomerID" },
"sink": { "name": "customer_id" }
}
]
}
}
}Fabric Warehouse最佳实践:
- ✅ 使用托管身份进行认证(无需密钥轮换)
- ✅ 针对大型数据加载(>1GB)启用暂存
- ✅ 使用实现动态模式创建
tableOption: autoCreate - ✅ 利用Fabric的湖仓集成实现统一分析
- ✅ 监控Fabric容量单元(CU)的消耗
Enhanced Snowflake Connector
增强版Snowflake连接器
Improved performance:
json
{
"name": "SnowflakeLinkedService",
"type": "Snowflake",
"typeProperties": {
"connectionString": "jdbc:snowflake://myaccount.snowflakecomputing.com",
"database": "mydb",
"warehouse": "mywarehouse",
"authenticationType": "KeyPair",
"username": "myuser",
"privateKey": {
"type": "AzureKeyVaultSecret",
"store": { "referenceName": "KeyVault" },
"secretName": "snowflake-private-key"
},
"privateKeyPassphrase": {
"type": "AzureKeyVaultSecret",
"store": { "referenceName": "KeyVault" },
"secretName": "snowflake-passphrase"
}
}
}性能提升:
json
{
"name": "SnowflakeLinkedService",
"type": "Snowflake",
"typeProperties": {
"connectionString": "jdbc:snowflake://myaccount.snowflakecomputing.com",
"database": "mydb",
"warehouse": "mywarehouse",
"authenticationType": "KeyPair",
"username": "myuser",
"privateKey": {
"type": "AzureKeyVaultSecret",
"store": { "referenceName": "KeyVault" },
"secretName": "snowflake-private-key"
},
"privateKeyPassphrase": {
"type": "AzureKeyVaultSecret",
"store": { "referenceName": "KeyVault" },
"secretName": "snowflake-passphrase"
}
}
}Managed Identity for Azure Storage (2025)
Azure存储的托管身份支持(2025版)
Azure Table Storage
Azure表存储
Now supports system-assigned and user-assigned managed identity:
json
{
"name": "AzureTableStorageLinkedService",
"type": "AzureTableStorage",
"typeProperties": {
"serviceEndpoint": "https://mystorageaccount.table.core.windows.net",
"authenticationType": "ManagedIdentity" // New in 2025
// Or user-assigned:
// "credential": {
// "referenceName": "UserAssignedManagedIdentity"
// }
}
}现在支持系统分配和用户分配的托管身份:
json
{
"name": "AzureTableStorageLinkedService",
"type": "AzureTableStorage",
"typeProperties": {
"serviceEndpoint": "https://mystorageaccount.table.core.windows.net",
"authenticationType": "ManagedIdentity" // New in 2025
// Or user-assigned:
// "credential": {
// "referenceName": "UserAssignedManagedIdentity"
// }
}
}Azure Files
Azure文件存储
Now supports managed identity authentication:
json
{
"name": "AzureFilesLinkedService",
"type": "AzureFileStorage",
"typeProperties": {
"fileShare": "myshare",
"accountName": "mystorageaccount",
"authenticationType": "ManagedIdentity" // New in 2025
}
}现在支持托管身份认证:
json
{
"name": "AzureFilesLinkedService",
"type": "AzureFileStorage",
"typeProperties": {
"fileShare": "myshare",
"accountName": "mystorageaccount",
"authenticationType": "ManagedIdentity" // New in 2025
}
}Mapping Data Flows - Spark 3.3
映射数据流 - Spark 3.3
Spark 3.3 now powers Mapping Data Flows:
Performance Improvements:
- 30% faster data processing
- Improved memory management
- Better partition handling
- Enhanced join performance
New Features:
- Adaptive Query Execution (AQE)
- Dynamic partition pruning
- Improved caching
- Better column statistics
json
{
"name": "DataFlow1",
"type": "MappingDataFlow",
"typeProperties": {
"sources": [
{
"dataset": { "referenceName": "SourceDataset" }
}
],
"transformations": [
{
"name": "Transform1"
}
],
"sinks": [
{
"dataset": { "referenceName": "SinkDataset" }
}
]
}
}现在Spark 3.3为映射数据流提供支持:
性能提升:
- 数据处理速度提升30%
- 改进的内存管理
- 更优的分区处理
- 增强的连接性能
新功能:
- 自适应查询执行(AQE)
- 动态分区剪枝
- 改进的缓存机制
- 更完善的列统计
json
{
"name": "DataFlow1",
"type": "MappingDataFlow",
"typeProperties": {
"sources": [
{
"dataset": { "referenceName": "SourceDataset" }
}
],
"transformations": [
{
"name": "Transform1"
}
],
"sinks": [
{
"dataset": { "referenceName": "SinkDataset" }
}
]
}
}Azure DevOps Server 2022 Support
Azure DevOps Server 2022支持
Git integration now supports on-premises Azure DevOps Server 2022:
json
{
"name": "DataFactory",
"properties": {
"repoConfiguration": {
"type": "AzureDevOpsGit",
"accountName": "on-prem-ado-server",
"projectName": "MyProject",
"repositoryName": "adf-repo",
"collaborationBranch": "main",
"rootFolder": "/",
"hostName": "https://ado-server.company.com" // On-premises server
}
}
}Git集成现在支持本地Azure DevOps Server 2022:
json
{
"name": "DataFactory",
"properties": {
"repoConfiguration": {
"type": "AzureDevOpsGit",
"accountName": "on-prem-ado-server",
"projectName": "MyProject",
"repositoryName": "adf-repo",
"collaborationBranch": "main",
"rootFolder": "/",
"hostName": "https://ado-server.company.com" // On-premises server
}
}
}🔐 Managed Identity 2025 Best Practices
🔐 2025版托管身份最佳实践
User-Assigned vs System-Assigned Managed Identity
用户分配与系统分配托管身份对比
System-Assigned Managed Identity:
json
{
"type": "AzureBlobStorage",
"typeProperties": {
"serviceEndpoint": "https://mystorageaccount.blob.core.windows.net",
"accountKind": "StorageV2"
// ✅ Uses Data Factory's system-assigned identity automatically
}
}User-Assigned Managed Identity (NEW 2025):
json
{
"type": "AzureBlobStorage",
"typeProperties": {
"serviceEndpoint": "https://mystorageaccount.blob.core.windows.net",
"accountKind": "StorageV2",
"credential": {
"referenceName": "UserAssignedManagedIdentityCredential",
"type": "CredentialReference"
}
}
}When to Use User-Assigned:
- ✅ Sharing identity across multiple data factories
- ✅ Complex multi-environment setups
- ✅ Granular permission management
- ✅ Identity lifecycle independent of data factory
Credential Consolidation (NEW 2025):
ADF now supports a centralized Credentials feature:
json
{
"name": "ManagedIdentityCredential",
"type": "Microsoft.DataFactory/factories/credentials",
"properties": {
"type": "ManagedIdentity",
"typeProperties": {
"resourceId": "/subscriptions/{sub}/resourceGroups/{rg}/providers/Microsoft.ManagedIdentity/userAssignedIdentities/{identity-name}"
}
}
}Benefits:
- ✅ Consolidate all Microsoft Entra ID-based credentials in one place
- ✅ Reuse credentials across multiple linked services
- ✅ Centralized permission management
- ✅ Easier audit and compliance tracking
系统分配托管身份:
json
{
"type": "AzureBlobStorage",
"typeProperties": {
"serviceEndpoint": "https://mystorageaccount.blob.core.windows.net",
"accountKind": "StorageV2"
// ✅ Uses Data Factory's system-assigned identity automatically
}
}用户分配托管身份(2025新增):
json
{
"type": "AzureBlobStorage",
"typeProperties": {
"serviceEndpoint": "https://mystorageaccount.blob.core.windows.net",
"accountKind": "StorageV2",
"credential": {
"referenceName": "UserAssignedManagedIdentityCredential",
"type": "CredentialReference"
}
}
}何时使用用户分配托管身份:
- ✅ 在多个数据工厂之间共享身份
- ✅ 复杂的多环境设置
- ✅ 细粒度的权限管理
- ✅ 身份生命周期独立于数据工厂
凭证整合(2025新增):
ADF现在支持集中式的凭证功能:
json
{
"name": "ManagedIdentityCredential",
"type": "Microsoft.DataFactory/factories/credentials",
"properties": {
"type": "ManagedIdentity",
"typeProperties": {
"resourceId": "/subscriptions/{sub}/resourceGroups/{rg}/providers/Microsoft.ManagedIdentity/userAssignedIdentities/{identity-name}"
}
}
}优势:
- ✅ 将所有Microsoft Entra ID凭证集中管理
- ✅ 在多个链接服务中复用凭证
- ✅ 集中式权限管理
- ✅ 更易于审计与合规跟踪
MFA Enforcement Compatibility (October 2025)
MFA强制兼容(2025年10月)
🚨 IMPORTANT: Azure requires MFA for all users by October 2025
Impact on ADF:
- ✅ Managed identities are UNAFFECTED - No MFA required for service accounts
- ✅ Continue using system-assigned and user-assigned identities without changes
- ❌ Interactive user logins affected - Personal Azure AD accounts need MFA
- ✅ Service principals with certificate auth - Recommended alternative to secrets
Best Practice:
json
{
"type": "AzureSqlDatabase",
"typeProperties": {
"server": "myserver.database.windows.net",
"database": "mydb",
"authenticationType": "SystemAssignedManagedIdentity"
// ✅ No MFA needed, no secret rotation, passwordless
}
}🚨 重要提示:Azure要求所有用户在2025年10月前启用MFA
对ADF的影响:
- ✅ 托管身份不受影响 - 服务账户无需MFA
- ✅ 继续使用系统分配和用户分配身份,无需更改
- ❌ 交互式用户登录受影响 - 个人Azure AD账户需要MFA
- ✅ 使用证书认证的服务主体 - 推荐作为密钥的替代方案
最佳实践:
json
{
"type": "AzureSqlDatabase",
"typeProperties": {
"server": "myserver.database.windows.net",
"database": "mydb",
"authenticationType": "SystemAssignedManagedIdentity"
// ✅ No MFA needed, no secret rotation, passwordless
}
}Principle of Least Privilege (2025)
最小权限原则(2025版)
Storage Blob Data Roles:
- - Read-only access (source)
Storage Blob Data Reader - - Read/write access (sink)
Storage Blob Data Contributor - ❌ Avoid unless needed
Storage Blob Data Owner
SQL Database Roles:
sql
-- Create contained database user for managed identity
CREATE USER [datafactory-name] FROM EXTERNAL PROVIDER;
-- Grant minimal required permissions
ALTER ROLE db_datareader ADD MEMBER [datafactory-name];
ALTER ROLE db_datawriter ADD MEMBER [datafactory-name];
-- ❌ Avoid db_owner unless truly neededKey Vault Access Policies:
json
{
"permissions": {
"secrets": ["Get"] // ✅ Only Get permission needed
// ❌ Don't grant List, Set, Delete unless required
}
}存储Blob数据角色:
- - 只读访问(源)
Storage Blob Data Reader - - 读写访问(目标)
Storage Blob Data Contributor - ❌ 除非必要,否则避免使用
Storage Blob Data Owner
SQL数据库角色:
sql
-- Create contained database user for managed identity
CREATE USER [datafactory-name] FROM EXTERNAL PROVIDER;
-- Grant minimal required permissions
ALTER ROLE db_datareader ADD MEMBER [datafactory-name];
ALTER ROLE db_datawriter ADD MEMBER [datafactory-name];
-- ❌ Avoid db_owner unless truly needed密钥保管库访问策略:
json
{
"permissions": {
"secrets": ["Get"] // ✅ Only Get permission needed
// ❌ Don't grant List, Set, Delete unless required
}
}Best Practices (2025)
2025版最佳实践
-
Use Databricks Job Activity (MANDATORY):
- ❌ STOP using Notebook, Python, JAR activities
- ✅ Migrate to DatabricksJob activity immediately
- ✅ Define workflows in Databricks workspace
- ✅ Leverage serverless compute (no cluster config needed)
- ✅ Utilize advanced features (Run As, Task Values, If/Else, Repair Runs)
-
Managed Identity Authentication (MANDATORY 2025):
- ✅ Use managed identities for ALL Azure resources
- ✅ Prefer system-assigned for simple scenarios
- ✅ Use user-assigned for shared identity needs
- ✅ Leverage Credentials feature for consolidation
- ✅ MFA-compliant for October 2025 enforcement
- ❌ Avoid access keys and connection strings
- ✅ Store any remaining secrets in Key Vault
-
Monitor Job Execution:
- Track Databricks Job run IDs from ADF output
- Log Job parameters for auditability
- Set up alerts for job failures
- Use Databricks job-level monitoring
- Leverage built-in lineage tracking
-
Optimize Spark 3.3 Usage (Data Flows):
- Enable Adaptive Query Execution (AQE)
- Use appropriate partition counts (4-8 per core)
- Monitor execution plans in Databricks
- Use broadcast joins for small dimensions
- Implement dynamic partition pruning
-
使用Databricks Job活动(强制要求):
- ❌ 停止使用Notebook、Python、JAR活动
- ✅ 立即迁移至DatabricksJob活动
- ✅ 在Databricks工作区中定义工作流
- ✅ 利用无服务器计算(无需集群配置)
- ✅ 使用高级功能(以指定身份运行、任务值传递、If/Else、修复运行)
-
托管身份认证(2025强制要求):
- ✅ 对所有Azure资源使用托管身份
- ✅ 简单场景优先使用系统分配身份
- ✅ 共享身份需求使用用户分配身份
- ✅ 利用凭证功能实现整合
- ✅ 符合2025年10月的MFA强制要求
- ❌ 避免使用访问密钥和连接字符串
- ✅ 将剩余密钥存储在密钥保管库中
-
监控任务执行:
- 从ADF输出中跟踪Databricks任务运行ID
- 记录任务参数以满足审计需求
- 为任务失败设置告警
- 使用Databricks任务级监控
- 利用内置的数据血缘跟踪
-
优化Spark 3.3的使用(数据流):
- 启用自适应查询执行(AQE)
- 使用合适的分区数(每个核心4-8个分区)
- 在Databricks中监控执行计划
- 对小维度表使用广播连接
- 实现动态分区剪枝