databricks-2025

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

🚨 CRITICAL GUIDELINES

🚨 重要指南

Windows File Path Requirements

Windows文件路径要求

MANDATORY: Always Use Backslashes on Windows for File Paths
When using Edit or Write tools on Windows, you MUST use backslashes (
\
) in file paths, NOT forward slashes (
/
).
Examples:
  • ❌ WRONG:
    D:/repos/project/file.tsx
  • ✅ CORRECT:
    D:\repos\project\file.tsx
This applies to:
  • Edit tool file_path parameter
  • Write tool file_path parameter
  • All file operations on Windows systems
🔴 强制要求:Windows系统下文件路径必须使用反斜杠
在Windows系统上使用编辑或写入工具时,文件路径必须使用反斜杠(
\
),绝对不能使用正斜杠(
/
)。
示例:
  • ❌ 错误:
    D:/repos/project/file.tsx
  • ✅ 正确:
    D:\repos\project\file.tsx
此要求适用于:
  • 编辑工具的file_path参数
  • 写入工具的file_path参数
  • Windows系统上的所有文件操作

Documentation Guidelines

文档指南

NEVER create new documentation files unless explicitly requested by the user.
  • Priority: Update existing README.md files rather than creating new documentation
  • Repository cleanliness: Keep repository root clean - only README.md unless user requests otherwise
  • Style: Documentation should be concise, direct, and professional - avoid AI-generated tone
  • User preference: Only create additional .md files when user specifically asks for documentation

除非用户明确要求,否则绝对不要创建新的文档文件。
  • 优先级:优先更新现有的README.md文件,而非创建新文档
  • 仓库整洁性:保持仓库根目录整洁——除非用户要求,否则只保留README.md
  • 风格:文档应简洁、直接、专业——避免AI生成的语气
  • 用户偏好:仅在用户明确要求时才创建额外的.md文件

Azure Data Factory Databricks Integration 2025

2025版Azure Data Factory与Databricks集成

Databricks Job Activity (Recommended 2025)

Databricks Job 活动(2025推荐方案)

🚨 CRITICAL UPDATE (2025): The Databricks Job activity is now the ONLY recommended method for orchestrating Databricks in ADF. Microsoft strongly recommends migrating from legacy Notebook, Python, and JAR activities.
🚨 重要更新(2025): Databricks Job活动现已成为在ADF中编排Databricks任务的唯一推荐方式。Microsoft强烈建议从传统的Notebook、Python和JAR活动迁移至此方案。

Why Databricks Job Activity?

为什么选择Databricks Job活动?

Old Pattern (Notebook Activity - ❌ LEGACY):
json
{
  "name": "RunNotebook",
  "type": "DatabricksNotebook",  // ❌ DEPRECATED - Migrate to DatabricksJob
  "linkedServiceName": { "referenceName": "DatabricksLinkedService" },
  "typeProperties": {
    "notebookPath": "/Users/user@example.com/MyNotebook",
    "baseParameters": { "param1": "value1" }
  }
}
New Pattern (Databricks Job Activity - ✅ CURRENT 2025):
json
{
  "name": "RunDatabricksWorkflow",
  "type": "DatabricksJob",  // ✅ CORRECT activity type (NOT DatabricksSparkJob)
  "linkedServiceName": { "referenceName": "DatabricksLinkedService" },
  "typeProperties": {
    "jobId": "123456",  // Reference existing Databricks Workflow Job
    "jobParameters": {  // Pass parameters to the Job
      "param1": "value1",
      "runDate": "@pipeline().parameters.ProcessingDate"
    }
  },
  "policy": {
    "timeout": "0.12:00:00",
    "retry": 2,
    "retryIntervalInSeconds": 30
  }
}
旧模式(Notebook活动 - ❌ 已过时):
json
{
  "name": "RunNotebook",
  "type": "DatabricksNotebook",  // ❌ DEPRECATED - Migrate to DatabricksJob
  "linkedServiceName": { "referenceName": "DatabricksLinkedService" },
  "typeProperties": {
    "notebookPath": "/Users/user@example.com/MyNotebook",
    "baseParameters": { "param1": "value1" }
  }
}
新模式(Databricks Job活动 - ✅ 2025当前推荐):
json
{
  "name": "RunDatabricksWorkflow",
  "type": "DatabricksJob",  // ✅ CORRECT activity type (NOT DatabricksSparkJob)
  "linkedServiceName": { "referenceName": "DatabricksLinkedService" },
  "typeProperties": {
    "jobId": "123456",  // Reference existing Databricks Workflow Job
    "jobParameters": {  // Pass parameters to the Job
      "param1": "value1",
      "runDate": "@pipeline().parameters.ProcessingDate"
    }
  },
  "policy": {
    "timeout": "0.12:00:00",
    "retry": 2,
    "retryIntervalInSeconds": 30
  }
}

Benefits of Databricks Job Activity (2025)

Databricks Job活动的优势(2025版)

  1. Serverless Execution by Default:
    • ✅ No cluster specification needed in linked service
    • ✅ Automatically runs on Databricks serverless compute
    • ✅ Faster startup times and lower costs
    • ✅ Managed infrastructure by Databricks
  2. Advanced Workflow Features:
    • Run As - Execute jobs as specific users/service principals
    • Task Values - Pass data between tasks within workflow
    • Conditional Execution - If/Else and For Each task types
    • AI/BI Tasks - Model serving endpoints, Power BI semantic models
    • Repair Runs - Rerun failed tasks without reprocessing successful ones
    • Notifications/Alerts - Built-in alerting on job failures
    • Git Integration - Version control for notebooks and code
    • DABs Support - Databricks Asset Bundles for deployment
    • Built-in Lineage - Data lineage tracking across tasks
    • Queuing and Concurrent Runs - Better resource management
  3. Centralized Job Management:
    • Jobs defined once in Databricks workspace
    • Single source of truth for all environments
    • Versioning through Databricks (Git-backed)
    • Consistent across orchestration tools
  4. Better Orchestration:
    • Complex task dependencies within Job
    • Multiple heterogeneous tasks (notebook, Python, SQL, Delta Live Tables)
    • Job-level monitoring and logging
    • Parameter passing between tasks
  5. Improved Reliability:
    • Retry logic at Job and task level
    • Better error handling and recovery
    • Automatic cluster management
  6. Cost Optimization:
    • Serverless compute (pay only for execution)
    • Job clusters (auto-terminating)
    • Optimized cluster sizing per task
    • Spot instance support
  1. 默认支持无服务器执行:
    • ✅ 无需在链接服务中指定集群配置
    • ✅ 自动在Databricks无服务器计算资源上运行
    • ✅ 启动速度更快,成本更低
    • ✅ 由Databricks托管基础设施
  2. 高级工作流功能:
    • 以指定身份运行 - 以特定用户/服务主体执行任务
    • 任务值传递 - 在工作流的任务之间传递数据
    • 条件执行 - 支持If/Else和For Each任务类型
    • AI/BI任务 - 模型服务端点、Power BI语义模型
    • 修复运行 - 仅重新运行失败的任务,无需重新处理已成功的任务
    • 通知/告警 - 内置任务失败告警功能
    • Git集成 - 笔记本和代码的版本控制
    • DABs支持 - 用于部署的Databricks Asset Bundles
    • 内置数据血缘 - 跨任务的数据血缘跟踪
    • 排队与并发运行 - 更优的资源管理
  3. 集中式任务管理:
    • 在Databricks工作区中一次性定义任务
    • 所有环境的单一事实来源
    • 通过Databricks实现版本控制(基于Git)
    • 在不同编排工具中保持一致性
  4. 更强大的编排能力:
    • 任务内的复杂依赖关系
    • 多种异构任务(笔记本、Python、SQL、Delta Live Tables)
    • 任务级别的监控与日志
    • 任务之间的参数传递
  5. 更高的可靠性:
    • 任务和任务级别的重试逻辑
    • 更完善的错误处理与恢复机制
    • 自动集群管理
  6. 成本优化:
    • 无服务器计算(仅为执行时间付费)
    • 任务集群(自动终止)
    • 针对每个任务优化集群规模
    • 支持抢占式实例

Implementation

实施步骤

1. Create Databricks Job

1. 创建Databricks任务

python
undefined
python
undefined

In Databricks workspace

In Databricks workspace

Create Job with tasks

Create Job with tasks

{ "name": "Data Processing Job", "tasks": [ { "task_key": "ingest", "notebook_task": { "notebook_path": "/Notebooks/Ingest", "base_parameters": {} }, "job_cluster_key": "small_cluster" }, { "task_key": "transform", "depends_on": [{ "task_key": "ingest" }], "notebook_task": { "notebook_path": "/Notebooks/Transform" }, "job_cluster_key": "medium_cluster" }, { "task_key": "load", "depends_on": [{ "task_key": "transform" }], "notebook_task": { "notebook_path": "/Notebooks/Load" }, "job_cluster_key": "small_cluster" } ], "job_clusters": [ { "job_cluster_key": "small_cluster", "new_cluster": { "spark_version": "13.3.x-scala2.12", "node_type_id": "Standard_DS3_v2", "num_workers": 2 } }, { "job_cluster_key": "medium_cluster", "new_cluster": { "spark_version": "13.3.x-scala2.12", "node_type_id": "Standard_DS4_v2", "num_workers": 8 } } ] }
{ "name": "Data Processing Job", "tasks": [ { "task_key": "ingest", "notebook_task": { "notebook_path": "/Notebooks/Ingest", "base_parameters": {} }, "job_cluster_key": "small_cluster" }, { "task_key": "transform", "depends_on": [{ "task_key": "ingest" }], "notebook_task": { "notebook_path": "/Notebooks/Transform" }, "job_cluster_key": "medium_cluster" }, { "task_key": "load", "depends_on": [{ "task_key": "transform" }], "notebook_task": { "notebook_path": "/Notebooks/Load" }, "job_cluster_key": "small_cluster" } ], "job_clusters": [ { "job_cluster_key": "small_cluster", "new_cluster": { "spark_version": "13.3.x-scala2.12", "node_type_id": "Standard_DS3_v2", "num_workers": 2 } }, { "job_cluster_key": "medium_cluster", "new_cluster": { "spark_version": "13.3.x-scala2.12", "node_type_id": "Standard_DS4_v2", "num_workers": 8 } } ] }

Get Job ID after creation

Get Job ID after creation

undefined
undefined

2. Create ADF Pipeline with Databricks Job Activity (2025)

2. 创建包含Databricks Job活动的ADF管道(2025版)

json
{
  "name": "PL_Databricks_Serverless_Workflow",
  "properties": {
    "activities": [
      {
        "name": "ExecuteDatabricksWorkflow",
        "type": "DatabricksJob",  // ✅ Correct activity type
        "dependsOn": [],
        "policy": {
          "timeout": "0.12:00:00",
          "retry": 2,
          "retryIntervalInSeconds": 30
        },
        "typeProperties": {
          "jobId": "123456",  // Databricks Job ID from workspace
          "jobParameters": {  // ⚠️ Use jobParameters (not parameters)
            "input_path": "/mnt/data/input",
            "output_path": "/mnt/data/output",
            "run_date": "@pipeline().parameters.runDate",
            "environment": "@pipeline().parameters.environment"
          }
        },
        "linkedServiceName": {
          "referenceName": "DatabricksLinkedService_Serverless",
          "type": "LinkedServiceReference"
        }
      },
      {
        "name": "LogJobExecution",
        "type": "WebActivity",
        "dependsOn": [
          {
            "activity": "ExecuteDatabricksWorkflow",
            "dependencyConditions": ["Succeeded"]
          }
        ],
        "typeProperties": {
          "url": "@pipeline().parameters.LoggingEndpoint",
          "method": "POST",
          "body": {
            "jobId": "123456",
            "runId": "@activity('ExecuteDatabricksWorkflow').output.runId",
            "status": "Succeeded",
            "duration": "@activity('ExecuteDatabricksWorkflow').output.executionDuration"
          }
        }
      }
    ],
    "parameters": {
      "runDate": {
        "type": "string",
        "defaultValue": "@utcnow()"
      },
      "environment": {
        "type": "string",
        "defaultValue": "production"
      },
      "LoggingEndpoint": {
        "type": "string"
      }
    }
  }
}
json
{
  "name": "PL_Databricks_Serverless_Workflow",
  "properties": {
    "activities": [
      {
        "name": "ExecuteDatabricksWorkflow",
        "type": "DatabricksJob",  // ✅ Correct activity type
        "dependsOn": [],
        "policy": {
          "timeout": "0.12:00:00",
          "retry": 2,
          "retryIntervalInSeconds": 30
        },
        "typeProperties": {
          "jobId": "123456",  // Databricks Job ID from workspace
          "jobParameters": {  // ⚠️ Use jobParameters (not parameters)
            "input_path": "/mnt/data/input",
            "output_path": "/mnt/data/output",
            "run_date": "@pipeline().parameters.runDate",
            "environment": "@pipeline().parameters.environment"
          }
        },
        "linkedServiceName": {
          "referenceName": "DatabricksLinkedService_Serverless",
          "type": "LinkedServiceReference"
        }
      },
      {
        "name": "LogJobExecution",
        "type": "WebActivity",
        "dependsOn": [
          {
            "activity": "ExecuteDatabricksWorkflow",
            "dependencyConditions": ["Succeeded"]
          }
        ],
        "typeProperties": {
          "url": "@pipeline().parameters.LoggingEndpoint",
          "method": "POST",
          "body": {
            "jobId": "123456",
            "runId": "@activity('ExecuteDatabricksWorkflow').output.runId",
            "status": "Succeeded",
            "duration": "@activity('ExecuteDatabricksWorkflow').output.executionDuration"
          }
        }
      }
    ],
    "parameters": {
      "runDate": {
        "type": "string",
        "defaultValue": "@utcnow()"
      },
      "environment": {
        "type": "string",
        "defaultValue": "production"
      },
      "LoggingEndpoint": {
        "type": "string"
      }
    }
  }
}

3. Configure Linked Service (2025 - Serverless)

3. 配置链接服务(2025版 - 无服务器)

✅ RECOMMENDED: Serverless Linked Service (No Cluster Configuration)
json
{
  "name": "DatabricksLinkedService_Serverless",
  "type": "Microsoft.DataFactory/factories/linkedservices",
  "properties": {
    "type": "AzureDatabricks",
    "typeProperties": {
      "domain": "https://adb-123456789.azuredatabricks.net",
      "authentication": "MSI"  // ✅ Managed Identity (recommended 2025)
      // ⚠️ NO existingClusterId or newClusterNodeType needed for serverless!
      // The Databricks Job activity automatically uses serverless compute
    }
  }
}
Alternative: Access Token Authentication
json
{
  "name": "DatabricksLinkedService_Token",
  "type": "Microsoft.DataFactory/factories/linkedservices",
  "properties": {
    "type": "AzureDatabricks",
    "typeProperties": {
      "domain": "https://adb-123456789.azuredatabricks.net",
      "accessToken": {
        "type": "AzureKeyVaultSecret",
        "store": {
          "referenceName": "AzureKeyVault",
          "type": "LinkedServiceReference"
        },
        "secretName": "databricks-access-token"
      }
    }
  }
}
🚨 CRITICAL: For Databricks Job activity, DO NOT specify cluster properties in the linked service. The job configuration in Databricks workspace controls compute resources.
✅ 推荐:无服务器链接服务(无需集群配置)
json
{
  "name": "DatabricksLinkedService_Serverless",
  "type": "Microsoft.DataFactory/factories/linkedservices",
  "properties": {
    "type": "AzureDatabricks",
    "typeProperties": {
      "domain": "https://adb-123456789.azuredatabricks.net",
      "authentication": "MSI"  // ✅ Managed Identity (recommended 2025)
      // ⚠️ NO existingClusterId or newClusterNodeType needed for serverless!
      // The Databricks Job activity automatically uses serverless compute
    }
  }
}
替代方案:访问令牌认证
json
{
  "name": "DatabricksLinkedService_Token",
  "type": "Microsoft.DataFactory/factories/linkedservices",
  "properties": {
    "type": "AzureDatabricks",
    "typeProperties": {
      "domain": "https://adb-123456789.azuredatabricks.net",
      "accessToken": {
        "type": "AzureKeyVaultSecret",
        "store": {
          "referenceName": "AzureKeyVault",
          "type": "LinkedServiceReference"
        },
        "secretName": "databricks-access-token"
      }
    }
  }
}
🚨 重要提示:对于Databricks Job活动,请勿在链接服务中指定集群属性。计算资源由Databricks工作区中的任务配置控制。

🆕 2025 New Connectors and Enhancements

🆕 2025版新连接器与增强功能

ServiceNow V2 Connector (RECOMMENDED - V1 End of Support)

ServiceNow V2连接器(推荐 - V1已停止支持)

🚨 CRITICAL: ServiceNow V1 connector is at End of Support stage. Migrate to V2 immediately!
Key Features of V2:
  • Native Query Builder - Aligns with ServiceNow's condition builder experience
  • Enhanced Performance - Optimized data extraction
  • Better Error Handling - Improved diagnostics and retry logic
  • OData Support - Modern API integration patterns
Copy Activity Example:
json
{
  "name": "CopyFromServiceNowV2",
  "type": "Copy",
  "inputs": [
    {
      "referenceName": "ServiceNowV2Source",
      "type": "DatasetReference"
    }
  ],
  "outputs": [
    {
      "referenceName": "AzureSqlSink",
      "type": "DatasetReference"
    }
  ],
  "typeProperties": {
    "source": {
      "type": "ServiceNowV2Source",
      "query": "sysparm_query=active=true^priority=1^sys_created_on>=javascript:gs.dateGenerate('2025-01-01')",
      "httpRequestTimeout": "00:01:40"  // 100 seconds
    },
    "sink": {
      "type": "AzureSqlSink",
      "writeBehavior": "upsert",
      "upsertSettings": {
        "useTempDB": true,
        "keys": ["sys_id"]
      }
    },
    "enableStaging": true,
    "stagingSettings": {
      "linkedServiceName": {
        "referenceName": "AzureBlobStorage",
        "type": "LinkedServiceReference"
      }
    }
  }
}
Linked Service (OAuth2 - Recommended):
json
{
  "name": "ServiceNowV2LinkedService",
  "type": "Microsoft.DataFactory/factories/linkedservices",
  "properties": {
    "type": "ServiceNowV2",
    "typeProperties": {
      "endpoint": "https://dev12345.service-now.com",
      "authenticationType": "OAuth2",
      "clientId": "your-oauth-client-id",
      "clientSecret": {
        "type": "AzureKeyVaultSecret",
        "store": {
          "referenceName": "AzureKeyVault",
          "type": "LinkedServiceReference"
        },
        "secretName": "servicenow-client-secret"
      },
      "username": "service-account@company.com",
      "password": {
        "type": "AzureKeyVaultSecret",
        "store": {
          "referenceName": "AzureKeyVault",
          "type": "LinkedServiceReference"
        },
        "secretName": "servicenow-password"
      },
      "grantType": "password"
    }
  }
}
Linked Service (Basic Authentication - Legacy):
json
{
  "name": "ServiceNowV2LinkedService_Basic",
  "type": "Microsoft.DataFactory/factories/linkedservices",
  "properties": {
    "type": "ServiceNowV2",
    "typeProperties": {
      "endpoint": "https://dev12345.service-now.com",
      "authenticationType": "Basic",
      "username": "admin",
      "password": {
        "type": "AzureKeyVaultSecret",
        "store": {
          "referenceName": "AzureKeyVault",
          "type": "LinkedServiceReference"
        },
        "secretName": "servicenow-password"
      }
    }
  }
}
Migration from V1 to V2:
  1. Update linked service type from
    ServiceNow
    to
    ServiceNowV2
  2. Update source type from
    ServiceNowSource
    to
    ServiceNowV2Source
  3. Test queries in ServiceNow UI's condition builder first
  4. Adjust timeout settings if needed (V2 may have different performance)
🚨 重要提示:ServiceNow V1连接器已进入停止支持阶段,请立即迁移至V2版本!
V2版本的核心功能:
  • 原生查询构建器 - 与ServiceNow的条件构建器体验保持一致
  • 性能提升 - 优化的数据提取
  • 更完善的错误处理 - 改进的诊断与重试逻辑
  • OData支持 - 现代API集成模式
复制活动示例:
json
{
  "name": "CopyFromServiceNowV2",
  "type": "Copy",
  "inputs": [
    {
      "referenceName": "ServiceNowV2Source",
      "type": "DatasetReference"
    }
  ],
  "outputs": [
    {
      "referenceName": "AzureSqlSink",
      "type": "DatasetReference"
    }
  ],
  "typeProperties": {
    "source": {
      "type": "ServiceNowV2Source",
      "query": "sysparm_query=active=true^priority=1^sys_created_on>=javascript:gs.dateGenerate('2025-01-01')",
      "httpRequestTimeout": "00:01:40"  // 100 seconds
    },
    "sink": {
      "type": "AzureSqlSink",
      "writeBehavior": "upsert",
      "upsertSettings": {
        "useTempDB": true,
        "keys": ["sys_id"]
      }
    },
    "enableStaging": true,
    "stagingSettings": {
      "linkedServiceName": {
        "referenceName": "AzureBlobStorage",
        "type": "LinkedServiceReference"
      }
    }
  }
}
链接服务(OAuth2 - 推荐):
json
{
  "name": "ServiceNowV2LinkedService",
  "type": "Microsoft.DataFactory/factories/linkedservices",
  "properties": {
    "type": "ServiceNowV2",
    "typeProperties": {
      "endpoint": "https://dev12345.service-now.com",
      "authenticationType": "OAuth2",
      "clientId": "your-oauth-client-id",
      "clientSecret": {
        "type": "AzureKeyVaultSecret",
        "store": {
          "referenceName": "AzureKeyVault",
          "type": "LinkedServiceReference"
        },
        "secretName": "servicenow-client-secret"
      },
      "username": "service-account@company.com",
      "password": {
        "type": "AzureKeyVaultSecret",
        "store": {
          "referenceName": "AzureKeyVault",
          "type": "LinkedServiceReference"
        },
        "secretName": "servicenow-password"
      },
      "grantType": "password"
    }
  }
}
链接服务(基本认证 - 已过时):
json
{
  "name": "ServiceNowV2LinkedService_Basic",
  "type": "Microsoft.DataFactory/factories/linkedservices",
  "properties": {
    "type": "ServiceNowV2",
    "typeProperties": {
      "endpoint": "https://dev12345.service-now.com",
      "authenticationType": "Basic",
      "username": "admin",
      "password": {
        "type": "AzureKeyVaultSecret",
        "store": {
          "referenceName": "AzureKeyVault",
          "type": "LinkedServiceReference"
        },
        "secretName": "servicenow-password"
      }
    }
  }
}
从V1迁移至V2的步骤:
  1. 将链接服务类型从
    ServiceNow
    更新为
    ServiceNowV2
  2. 将源类型从
    ServiceNowSource
    更新为
    ServiceNowV2Source
  3. 先在ServiceNow UI的条件构建器中测试查询
  4. 如有需要,调整超时设置(V2的性能可能有所不同)

Enhanced PostgreSQL Connector

增强版PostgreSQL连接器

Improved performance and features:
json
{
  "name": "PostgreSQLLinkedService",
  "type": "PostgreSql",
  "typeProperties": {
    "connectionString": "host=myserver.postgres.database.azure.com;port=5432;database=mydb;uid=myuser",
    "password": {
      "type": "AzureKeyVaultSecret",
      "store": { "referenceName": "KeyVault" },
      "secretName": "postgres-password"
    },
    // 2025 enhancement
    "enableSsl": true,
    "sslMode": "Require"
  }
}
性能与功能提升:
json
{
  "name": "PostgreSQLLinkedService",
  "type": "PostgreSql",
  "typeProperties": {
    "connectionString": "host=myserver.postgres.database.azure.com;port=5432;database=mydb;uid=myuser",
    "password": {
      "type": "AzureKeyVaultSecret",
      "store": { "referenceName": "KeyVault" },
      "secretName": "postgres-password"
    },
    // 2025 enhancement
    "enableSsl": true,
    "sslMode": "Require"
  }
}

Microsoft Fabric Warehouse Connector (NEW 2025)

Microsoft Fabric Warehouse连接器(2025新增)

🆕 Native support for Microsoft Fabric Warehouse (Q3 2024+)
Supported Activities:
  • ✅ Copy Activity (source and sink)
  • ✅ Lookup Activity
  • ✅ Get Metadata Activity
  • ✅ Script Activity
  • ✅ Stored Procedure Activity
Linked Service Configuration:
json
{
  "name": "FabricWarehouseLinkedService",
  "type": "Microsoft.DataFactory/factories/linkedservices",
  "properties": {
    "type": "Warehouse",  // ✅ NEW dedicated Fabric Warehouse type
    "typeProperties": {
      "endpoint": "myworkspace.datawarehouse.fabric.microsoft.com",
      "warehouse": "MyWarehouse",
      "authenticationType": "ServicePrincipal",  // Recommended
      "servicePrincipalId": "<app-registration-id>",
      "servicePrincipalKey": {
        "type": "AzureKeyVaultSecret",
        "store": {
          "referenceName": "AzureKeyVault",
          "type": "LinkedServiceReference"
        },
        "secretName": "fabric-warehouse-sp-key"
      },
      "tenant": "<tenant-id>"
    }
  }
}
Alternative: Managed Identity Authentication (Preferred)
json
{
  "name": "FabricWarehouseLinkedService_ManagedIdentity",
  "type": "Microsoft.DataFactory/factories/linkedservices",
  "properties": {
    "type": "Warehouse",
    "typeProperties": {
      "endpoint": "myworkspace.datawarehouse.fabric.microsoft.com",
      "warehouse": "MyWarehouse",
      "authenticationType": "SystemAssignedManagedIdentity"
    }
  }
}
Copy Activity Example:
json
{
  "name": "CopyToFabricWarehouse",
  "type": "Copy",
  "inputs": [
    {
      "referenceName": "AzureSqlSource",
      "type": "DatasetReference"
    }
  ],
  "outputs": [
    {
      "referenceName": "FabricWarehouseSink",
      "type": "DatasetReference"
    }
  ],
  "typeProperties": {
    "source": {
      "type": "AzureSqlSource"
    },
    "sink": {
      "type": "WarehouseSink",
      "writeBehavior": "insert",  // or "upsert"
      "writeBatchSize": 10000,
      "tableOption": "autoCreate"  // Auto-create table if not exists
    },
    "enableStaging": true,  // Recommended for large data
    "stagingSettings": {
      "linkedServiceName": {
        "referenceName": "AzureBlobStorage",
        "type": "LinkedServiceReference"
      },
      "path": "staging/fabric-warehouse"
    },
    "translator": {
      "type": "TabularTranslator",
      "mappings": [
        {
          "source": { "name": "CustomerID" },
          "sink": { "name": "customer_id" }
        }
      ]
    }
  }
}
Best Practices for Fabric Warehouse:
  • ✅ Use managed identity for authentication (no secret rotation)
  • ✅ Enable staging for large data loads (> 1GB)
  • ✅ Use
    tableOption: autoCreate
    for dynamic schema creation
  • ✅ Leverage Fabric's lakehouse integration for unified analytics
  • ✅ Monitor Fabric capacity units (CU) consumption
🆕 原生支持Microsoft Fabric Warehouse(2024年第三季度及以后版本)
支持的活动:
  • ✅ 复制活动(源和目标)
  • ✅ 查找活动
  • ✅ 获取元数据活动
  • ✅ 脚本活动
  • ✅ 存储过程活动
链接服务配置:
json
{
  "name": "FabricWarehouseLinkedService",
  "type": "Microsoft.DataFactory/factories/linkedservices",
  "properties": {
    "type": "Warehouse",  // ✅ NEW dedicated Fabric Warehouse type
    "typeProperties": {
      "endpoint": "myworkspace.datawarehouse.fabric.microsoft.com",
      "warehouse": "MyWarehouse",
      "authenticationType": "ServicePrincipal",  // Recommended
      "servicePrincipalId": "<app-registration-id>",
      "servicePrincipalKey": {
        "type": "AzureKeyVaultSecret",
        "store": {
          "referenceName": "AzureKeyVault",
          "type": "LinkedServiceReference"
        },
        "secretName": "fabric-warehouse-sp-key"
      },
      "tenant": "<tenant-id>"
    }
  }
}
替代方案:托管身份认证(首选)
json
{
  "name": "FabricWarehouseLinkedService_ManagedIdentity",
  "type": "Microsoft.DataFactory/factories/linkedservices",
  "properties": {
    "type": "Warehouse",
    "typeProperties": {
      "endpoint": "myworkspace.datawarehouse.fabric.microsoft.com",
      "warehouse": "MyWarehouse",
      "authenticationType": "SystemAssignedManagedIdentity"
    }
  }
}
复制活动示例:
json
{
  "name": "CopyToFabricWarehouse",
  "type": "Copy",
  "inputs": [
    {
      "referenceName": "AzureSqlSource",
      "type": "DatasetReference"
    }
  ],
  "outputs": [
    {
      "referenceName": "FabricWarehouseSink",
      "type": "DatasetReference"
    }
  ],
  "typeProperties": {
    "source": {
      "type": "AzureSqlSource"
    },
    "sink": {
      "type": "WarehouseSink",
      "writeBehavior": "insert",  // or "upsert"
      "writeBatchSize": 10000,
      "tableOption": "autoCreate"  // Auto-create table if not exists
    },
    "enableStaging": true,  // Recommended for large data
    "stagingSettings": {
      "linkedServiceName": {
        "referenceName": "AzureBlobStorage",
        "type": "LinkedServiceReference"
      },
      "path": "staging/fabric-warehouse"
    },
    "translator": {
      "type": "TabularTranslator",
      "mappings": [
        {
          "source": { "name": "CustomerID" },
          "sink": { "name": "customer_id" }
        }
      ]
    }
  }
}
Fabric Warehouse最佳实践:
  • ✅ 使用托管身份进行认证(无需密钥轮换)
  • ✅ 针对大型数据加载(>1GB)启用暂存
  • ✅ 使用
    tableOption: autoCreate
    实现动态模式创建
  • ✅ 利用Fabric的湖仓集成实现统一分析
  • ✅ 监控Fabric容量单元(CU)的消耗

Enhanced Snowflake Connector

增强版Snowflake连接器

Improved performance:
json
{
  "name": "SnowflakeLinkedService",
  "type": "Snowflake",
  "typeProperties": {
    "connectionString": "jdbc:snowflake://myaccount.snowflakecomputing.com",
    "database": "mydb",
    "warehouse": "mywarehouse",
    "authenticationType": "KeyPair",
    "username": "myuser",
    "privateKey": {
      "type": "AzureKeyVaultSecret",
      "store": { "referenceName": "KeyVault" },
      "secretName": "snowflake-private-key"
    },
    "privateKeyPassphrase": {
      "type": "AzureKeyVaultSecret",
      "store": { "referenceName": "KeyVault" },
      "secretName": "snowflake-passphrase"
    }
  }
}
性能提升:
json
{
  "name": "SnowflakeLinkedService",
  "type": "Snowflake",
  "typeProperties": {
    "connectionString": "jdbc:snowflake://myaccount.snowflakecomputing.com",
    "database": "mydb",
    "warehouse": "mywarehouse",
    "authenticationType": "KeyPair",
    "username": "myuser",
    "privateKey": {
      "type": "AzureKeyVaultSecret",
      "store": { "referenceName": "KeyVault" },
      "secretName": "snowflake-private-key"
    },
    "privateKeyPassphrase": {
      "type": "AzureKeyVaultSecret",
      "store": { "referenceName": "KeyVault" },
      "secretName": "snowflake-passphrase"
    }
  }
}

Managed Identity for Azure Storage (2025)

Azure存储的托管身份支持(2025版)

Azure Table Storage

Azure表存储

Now supports system-assigned and user-assigned managed identity:
json
{
  "name": "AzureTableStorageLinkedService",
  "type": "AzureTableStorage",
  "typeProperties": {
    "serviceEndpoint": "https://mystorageaccount.table.core.windows.net",
    "authenticationType": "ManagedIdentity"  // New in 2025
    // Or user-assigned:
    // "credential": {
    //   "referenceName": "UserAssignedManagedIdentity"
    // }
  }
}
现在支持系统分配和用户分配的托管身份:
json
{
  "name": "AzureTableStorageLinkedService",
  "type": "AzureTableStorage",
  "typeProperties": {
    "serviceEndpoint": "https://mystorageaccount.table.core.windows.net",
    "authenticationType": "ManagedIdentity"  // New in 2025
    // Or user-assigned:
    // "credential": {
    //   "referenceName": "UserAssignedManagedIdentity"
    // }
  }
}

Azure Files

Azure文件存储

Now supports managed identity authentication:
json
{
  "name": "AzureFilesLinkedService",
  "type": "AzureFileStorage",
  "typeProperties": {
    "fileShare": "myshare",
    "accountName": "mystorageaccount",
    "authenticationType": "ManagedIdentity"  // New in 2025
  }
}
现在支持托管身份认证:
json
{
  "name": "AzureFilesLinkedService",
  "type": "AzureFileStorage",
  "typeProperties": {
    "fileShare": "myshare",
    "accountName": "mystorageaccount",
    "authenticationType": "ManagedIdentity"  // New in 2025
  }
}

Mapping Data Flows - Spark 3.3

映射数据流 - Spark 3.3

Spark 3.3 now powers Mapping Data Flows:
Performance Improvements:
  • 30% faster data processing
  • Improved memory management
  • Better partition handling
  • Enhanced join performance
New Features:
  • Adaptive Query Execution (AQE)
  • Dynamic partition pruning
  • Improved caching
  • Better column statistics
json
{
  "name": "DataFlow1",
  "type": "MappingDataFlow",
  "typeProperties": {
    "sources": [
      {
        "dataset": { "referenceName": "SourceDataset" }
      }
    ],
    "transformations": [
      {
        "name": "Transform1"
      }
    ],
    "sinks": [
      {
        "dataset": { "referenceName": "SinkDataset" }
      }
    ]
  }
}
现在Spark 3.3为映射数据流提供支持:
性能提升:
  • 数据处理速度提升30%
  • 改进的内存管理
  • 更优的分区处理
  • 增强的连接性能
新功能:
  • 自适应查询执行(AQE)
  • 动态分区剪枝
  • 改进的缓存机制
  • 更完善的列统计
json
{
  "name": "DataFlow1",
  "type": "MappingDataFlow",
  "typeProperties": {
    "sources": [
      {
        "dataset": { "referenceName": "SourceDataset" }
      }
    ],
    "transformations": [
      {
        "name": "Transform1"
      }
    ],
    "sinks": [
      {
        "dataset": { "referenceName": "SinkDataset" }
      }
    ]
  }
}

Azure DevOps Server 2022 Support

Azure DevOps Server 2022支持

Git integration now supports on-premises Azure DevOps Server 2022:
json
{
  "name": "DataFactory",
  "properties": {
    "repoConfiguration": {
      "type": "AzureDevOpsGit",
      "accountName": "on-prem-ado-server",
      "projectName": "MyProject",
      "repositoryName": "adf-repo",
      "collaborationBranch": "main",
      "rootFolder": "/",
      "hostName": "https://ado-server.company.com"  // On-premises server
    }
  }
}
Git集成现在支持本地Azure DevOps Server 2022:
json
{
  "name": "DataFactory",
  "properties": {
    "repoConfiguration": {
      "type": "AzureDevOpsGit",
      "accountName": "on-prem-ado-server",
      "projectName": "MyProject",
      "repositoryName": "adf-repo",
      "collaborationBranch": "main",
      "rootFolder": "/",
      "hostName": "https://ado-server.company.com"  // On-premises server
    }
  }
}

🔐 Managed Identity 2025 Best Practices

🔐 2025版托管身份最佳实践

User-Assigned vs System-Assigned Managed Identity

用户分配与系统分配托管身份对比

System-Assigned Managed Identity:
json
{
  "type": "AzureBlobStorage",
  "typeProperties": {
    "serviceEndpoint": "https://mystorageaccount.blob.core.windows.net",
    "accountKind": "StorageV2"
    // ✅ Uses Data Factory's system-assigned identity automatically
  }
}
User-Assigned Managed Identity (NEW 2025):
json
{
  "type": "AzureBlobStorage",
  "typeProperties": {
    "serviceEndpoint": "https://mystorageaccount.blob.core.windows.net",
    "accountKind": "StorageV2",
    "credential": {
      "referenceName": "UserAssignedManagedIdentityCredential",
      "type": "CredentialReference"
    }
  }
}
When to Use User-Assigned:
  • ✅ Sharing identity across multiple data factories
  • ✅ Complex multi-environment setups
  • ✅ Granular permission management
  • ✅ Identity lifecycle independent of data factory
Credential Consolidation (NEW 2025):
ADF now supports a centralized Credentials feature:
json
{
  "name": "ManagedIdentityCredential",
  "type": "Microsoft.DataFactory/factories/credentials",
  "properties": {
    "type": "ManagedIdentity",
    "typeProperties": {
      "resourceId": "/subscriptions/{sub}/resourceGroups/{rg}/providers/Microsoft.ManagedIdentity/userAssignedIdentities/{identity-name}"
    }
  }
}
Benefits:
  • ✅ Consolidate all Microsoft Entra ID-based credentials in one place
  • ✅ Reuse credentials across multiple linked services
  • ✅ Centralized permission management
  • ✅ Easier audit and compliance tracking
系统分配托管身份:
json
{
  "type": "AzureBlobStorage",
  "typeProperties": {
    "serviceEndpoint": "https://mystorageaccount.blob.core.windows.net",
    "accountKind": "StorageV2"
    // ✅ Uses Data Factory's system-assigned identity automatically
  }
}
用户分配托管身份(2025新增):
json
{
  "type": "AzureBlobStorage",
  "typeProperties": {
    "serviceEndpoint": "https://mystorageaccount.blob.core.windows.net",
    "accountKind": "StorageV2",
    "credential": {
      "referenceName": "UserAssignedManagedIdentityCredential",
      "type": "CredentialReference"
    }
  }
}
何时使用用户分配托管身份:
  • ✅ 在多个数据工厂之间共享身份
  • ✅ 复杂的多环境设置
  • ✅ 细粒度的权限管理
  • ✅ 身份生命周期独立于数据工厂
凭证整合(2025新增):
ADF现在支持集中式的凭证功能:
json
{
  "name": "ManagedIdentityCredential",
  "type": "Microsoft.DataFactory/factories/credentials",
  "properties": {
    "type": "ManagedIdentity",
    "typeProperties": {
      "resourceId": "/subscriptions/{sub}/resourceGroups/{rg}/providers/Microsoft.ManagedIdentity/userAssignedIdentities/{identity-name}"
    }
  }
}
优势:
  • ✅ 将所有Microsoft Entra ID凭证集中管理
  • ✅ 在多个链接服务中复用凭证
  • ✅ 集中式权限管理
  • ✅ 更易于审计与合规跟踪

MFA Enforcement Compatibility (October 2025)

MFA强制兼容(2025年10月)

🚨 IMPORTANT: Azure requires MFA for all users by October 2025
Impact on ADF:
  • Managed identities are UNAFFECTED - No MFA required for service accounts
  • ✅ Continue using system-assigned and user-assigned identities without changes
  • Interactive user logins affected - Personal Azure AD accounts need MFA
  • Service principals with certificate auth - Recommended alternative to secrets
Best Practice:
json
{
  "type": "AzureSqlDatabase",
  "typeProperties": {
    "server": "myserver.database.windows.net",
    "database": "mydb",
    "authenticationType": "SystemAssignedManagedIdentity"
    // ✅ No MFA needed, no secret rotation, passwordless
  }
}
🚨 重要提示:Azure要求所有用户在2025年10月前启用MFA
对ADF的影响:
  • 托管身份不受影响 - 服务账户无需MFA
  • ✅ 继续使用系统分配和用户分配身份,无需更改
  • 交互式用户登录受影响 - 个人Azure AD账户需要MFA
  • 使用证书认证的服务主体 - 推荐作为密钥的替代方案
最佳实践:
json
{
  "type": "AzureSqlDatabase",
  "typeProperties": {
    "server": "myserver.database.windows.net",
    "database": "mydb",
    "authenticationType": "SystemAssignedManagedIdentity"
    // ✅ No MFA needed, no secret rotation, passwordless
  }
}

Principle of Least Privilege (2025)

最小权限原则(2025版)

Storage Blob Data Roles:
  • Storage Blob Data Reader
    - Read-only access (source)
  • Storage Blob Data Contributor
    - Read/write access (sink)
  • ❌ Avoid
    Storage Blob Data Owner
    unless needed
SQL Database Roles:
sql
-- Create contained database user for managed identity
CREATE USER [datafactory-name] FROM EXTERNAL PROVIDER;

-- Grant minimal required permissions
ALTER ROLE db_datareader ADD MEMBER [datafactory-name];
ALTER ROLE db_datawriter ADD MEMBER [datafactory-name];

-- ❌ Avoid db_owner unless truly needed
Key Vault Access Policies:
json
{
  "permissions": {
    "secrets": ["Get"]  // ✅ Only Get permission needed
    // ❌ Don't grant List, Set, Delete unless required
  }
}
存储Blob数据角色:
  • Storage Blob Data Reader
    - 只读访问(源)
  • Storage Blob Data Contributor
    - 读写访问(目标)
  • ❌ 除非必要,否则避免使用
    Storage Blob Data Owner
SQL数据库角色:
sql
-- Create contained database user for managed identity
CREATE USER [datafactory-name] FROM EXTERNAL PROVIDER;

-- Grant minimal required permissions
ALTER ROLE db_datareader ADD MEMBER [datafactory-name];
ALTER ROLE db_datawriter ADD MEMBER [datafactory-name];

-- ❌ Avoid db_owner unless truly needed
密钥保管库访问策略:
json
{
  "permissions": {
    "secrets": ["Get"]  // ✅ Only Get permission needed
    // ❌ Don't grant List, Set, Delete unless required
  }
}

Best Practices (2025)

2025版最佳实践

  1. Use Databricks Job Activity (MANDATORY):
    • ❌ STOP using Notebook, Python, JAR activities
    • ✅ Migrate to DatabricksJob activity immediately
    • ✅ Define workflows in Databricks workspace
    • ✅ Leverage serverless compute (no cluster config needed)
    • ✅ Utilize advanced features (Run As, Task Values, If/Else, Repair Runs)
  2. Managed Identity Authentication (MANDATORY 2025):
    • ✅ Use managed identities for ALL Azure resources
    • ✅ Prefer system-assigned for simple scenarios
    • ✅ Use user-assigned for shared identity needs
    • ✅ Leverage Credentials feature for consolidation
    • ✅ MFA-compliant for October 2025 enforcement
    • ❌ Avoid access keys and connection strings
    • ✅ Store any remaining secrets in Key Vault
  3. Monitor Job Execution:
    • Track Databricks Job run IDs from ADF output
    • Log Job parameters for auditability
    • Set up alerts for job failures
    • Use Databricks job-level monitoring
    • Leverage built-in lineage tracking
  4. Optimize Spark 3.3 Usage (Data Flows):
    • Enable Adaptive Query Execution (AQE)
    • Use appropriate partition counts (4-8 per core)
    • Monitor execution plans in Databricks
    • Use broadcast joins for small dimensions
    • Implement dynamic partition pruning
  1. 使用Databricks Job活动(强制要求):
    • ❌ 停止使用Notebook、Python、JAR活动
    • ✅ 立即迁移至DatabricksJob活动
    • ✅ 在Databricks工作区中定义工作流
    • ✅ 利用无服务器计算(无需集群配置)
    • ✅ 使用高级功能(以指定身份运行、任务值传递、If/Else、修复运行)
  2. 托管身份认证(2025强制要求):
    • ✅ 对所有Azure资源使用托管身份
    • ✅ 简单场景优先使用系统分配身份
    • ✅ 共享身份需求使用用户分配身份
    • ✅ 利用凭证功能实现整合
    • ✅ 符合2025年10月的MFA强制要求
    • ❌ 避免使用访问密钥和连接字符串
    • ✅ 将剩余密钥存储在密钥保管库中
  3. 监控任务执行:
    • 从ADF输出中跟踪Databricks任务运行ID
    • 记录任务参数以满足审计需求
    • 为任务失败设置告警
    • 使用Databricks任务级监控
    • 利用内置的数据血缘跟踪
  4. 优化Spark 3.3的使用(数据流):
    • 启用自适应查询执行(AQE)
    • 使用合适的分区数(每个核心4-8个分区)
    • 在Databricks中监控执行计划
    • 对小维度表使用广播连接
    • 实现动态分区剪枝

Resources

参考资源