adf-validation-rules
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
Chinese🚨 CRITICAL GUIDELINES
🚨 关键指南
Windows File Path Requirements
Windows文件路径要求
MANDATORY: Always Use Backslashes on Windows for File Paths
When using Edit or Write tools on Windows, you MUST use backslashes () in file paths, NOT forward slashes ().
\/Examples:
- ❌ WRONG:
D:/repos/project/file.tsx - ✅ CORRECT:
D:\repos\project\file.tsx
This applies to:
- Edit tool file_path parameter
- Write tool file_path parameter
- All file operations on Windows systems
强制要求:Windows系统下文件路径始终使用反斜杠
在Windows系统上使用编辑或写入工具时,文件路径必须使用反斜杠(),绝对不能使用正斜杠()。
\/示例:
- ❌ 错误:
D:/repos/project/file.tsx - ✅ 正确:
D:\repos\project\file.tsx
此要求适用于:
- 编辑工具的file_path参数
- 写入工具的file_path参数
- Windows系统上的所有文件操作
Documentation Guidelines
文档指南
NEVER create new documentation files unless explicitly requested by the user.
- Priority: Update existing README.md files rather than creating new documentation
- Repository cleanliness: Keep repository root clean - only README.md unless user requests otherwise
- Style: Documentation should be concise, direct, and professional - avoid AI-generated tone
- User preference: Only create additional .md files when user specifically asks for documentation
除非用户明确要求,否则绝对不要创建新的文档文件。
- 优先级:优先更新现有README.md文件,而非创建新文档
- 仓库整洁性:保持仓库根目录整洁 - 除非用户要求,否则只保留README.md
- 风格:文档应简洁、直接、专业 - 避免AI生成的冗余语气
- 用户偏好:仅在用户明确要求文档时才创建额外的.md文件
Azure Data Factory Validation Rules and Limitations
Azure Data Factory验证规则与限制
🚨 CRITICAL: Activity Nesting Limitations
🚨 关键:活动嵌套限制
Azure Data Factory has STRICT nesting rules for control flow activities. Violating these rules will cause pipeline failures or prevent pipeline creation.
Azure Data Factory对控制流活动有严格的嵌套规则。违反这些规则会导致管道失败或无法创建管道。
Supported Control Flow Activities for Nesting
支持嵌套的控制流活动
Four control flow activities support nested activities:
- ForEach: Iterates over collections and executes activities in a loop
- If Condition: Branches based on true/false evaluation
- Until: Implements do-until loops with timeout options
- Switch: Evaluates activities matching case conditions
有四种控制流活动支持嵌套活动:
- ForEach:遍历集合并循环执行活动
- If Condition:基于真假判断分支执行
- Until:实现带超时选项的do-until循环
- Switch:根据匹配的条件执行活动
✅ PERMITTED Nesting Combinations
✅ 允许的嵌套组合
| Parent Activity | Can Contain | Notes |
|---|---|---|
| ForEach | If Condition | ✅ Allowed |
| ForEach | Switch | ✅ Allowed |
| Until | If Condition | ✅ Allowed |
| Until | Switch | ✅ Allowed |
| 父活动 | 可包含的活动 | 说明 |
|---|---|---|
| ForEach | If Condition | ✅ 允许 |
| ForEach | Switch | ✅ 允许 |
| Until | If Condition | ✅ 允许 |
| Until | Switch | ✅ 允许 |
❌ PROHIBITED Nesting Combinations
❌ 禁止的嵌套组合
| Parent Activity | CANNOT Contain | Reason |
|---|---|---|
| If Condition | ForEach | ❌ Not supported - use Execute Pipeline workaround |
| If Condition | Switch | ❌ Not supported - use Execute Pipeline workaround |
| If Condition | Until | ❌ Not supported - use Execute Pipeline workaround |
| If Condition | Another If | ❌ Cannot nest If within If |
| Switch | ForEach | ❌ Not supported - use Execute Pipeline workaround |
| Switch | If Condition | ❌ Not supported - use Execute Pipeline workaround |
| Switch | Until | ❌ Not supported - use Execute Pipeline workaround |
| Switch | Another Switch | ❌ Cannot nest Switch within Switch |
| ForEach | Another ForEach | ❌ Single level only - use Execute Pipeline workaround |
| Until | Another Until | ❌ Single level only - use Execute Pipeline workaround |
| ForEach | Until | ❌ Single level only - use Execute Pipeline workaround |
| Until | ForEach | ❌ Single level only - use Execute Pipeline workaround |
| 父活动 | 不可包含的活动 | 原因 |
|---|---|---|
| If Condition | ForEach | ❌ 不支持 - 使用Execute Pipeline替代方案 |
| If Condition | Switch | ❌ 不支持 - 使用Execute Pipeline替代方案 |
| If Condition | Until | ❌ 不支持 - 使用Execute Pipeline替代方案 |
| If Condition | 另一个If | ❌ 不能在If中嵌套If |
| Switch | ForEach | ❌ 不支持 - 使用Execute Pipeline替代方案 |
| Switch | If Condition | ❌ 不支持 - 使用Execute Pipeline替代方案 |
| Switch | Until | ❌ 不支持 - 使用Execute Pipeline替代方案 |
| Switch | 另一个Switch | ❌ 不能在Switch中嵌套Switch |
| ForEach | 另一个ForEach | ❌ 仅支持单层级 - 使用Execute Pipeline替代方案 |
| Until | 另一个Until | ❌ 仅支持单层级 - 使用Execute Pipeline替代方案 |
| ForEach | Until | ❌ 仅支持单层级 - 使用Execute Pipeline替代方案 |
| Until | ForEach | ❌ 仅支持单层级 - 使用Execute Pipeline替代方案 |
🚫 Special Activity Restrictions
🚫 特殊活动限制
Validation Activity:
- ❌ CANNOT be placed inside ANY nested activity
- ❌ CANNOT be used within ForEach, If, Switch, or Until activities
- ✅ Must be at pipeline root level only
Validation Activity(验证活动):
- ❌ 绝对不能放置在任何嵌套活动内部
- ❌ 绝对不能在ForEach、If、Switch或Until活动中使用
- ✅ 必须仅位于管道根层级
🔧 Workaround: Execute Pipeline Pattern
🔧 替代方案:Execute Pipeline模式
The ONLY supported workaround for prohibited nesting combinations:
Instead of direct nesting, use the Execute Pipeline Activity to call a child pipeline:
json
{
"name": "ParentPipeline_WithIfCondition",
"activities": [
{
"name": "IfCondition_Parent",
"type": "IfCondition",
"typeProperties": {
"expression": "@equals(pipeline().parameters.ProcessData, 'true')",
"ifTrueActivities": [
{
"name": "ExecuteChildPipeline_WithForEach",
"type": "ExecutePipeline",
"typeProperties": {
"pipeline": {
"referenceName": "ChildPipeline_ForEachLoop",
"type": "PipelineReference"
},
"parameters": {
"ItemList": "@pipeline().parameters.Items"
}
}
}
]
}
}
]
}Child Pipeline Structure:
json
{
"name": "ChildPipeline_ForEachLoop",
"parameters": {
"ItemList": {"type": "array"}
},
"activities": [
{
"name": "ForEach_InChildPipeline",
"type": "ForEach",
"typeProperties": {
"items": "@pipeline().parameters.ItemList",
"activities": [
// Your ForEach logic here
]
}
}
]
}Why This Works:
- Each pipeline can have ONE level of nesting
- Execute Pipeline creates a new pipeline context
- Child pipeline gets its own nesting level allowance
- Enables unlimited depth through pipeline chaining
针对禁止嵌套组合的唯一支持的替代方案:
不要直接嵌套,而是使用Execute Pipeline Activity调用子管道:
json
{
"name": "ParentPipeline_WithIfCondition",
"activities": [
{
"name": "IfCondition_Parent",
"type": "IfCondition",
"typeProperties": {
"expression": "@equals(pipeline().parameters.ProcessData, 'true')",
"ifTrueActivities": [
{
"name": "ExecuteChildPipeline_WithForEach",
"type": "ExecutePipeline",
"typeProperties": {
"pipeline": {
"referenceName": "ChildPipeline_ForEachLoop",
"type": "PipelineReference"
},
"parameters": {
"ItemList": "@pipeline().parameters.Items"
}
}
}
]
}
}
]
}子管道结构:
json
{
"name": "ChildPipeline_ForEachLoop",
"parameters": {
"ItemList": {"type": "array"}
},
"activities": [
{
"name": "ForEach_InChildPipeline",
"type": "ForEach",
"typeProperties": {
"items": "@pipeline().parameters.ItemList",
"activities": [
// 此处添加你的ForEach逻辑
]
}
}
]
}为什么此方案可行:
- 每个管道可以有一层嵌套
- Execute Pipeline会创建新的管道上下文
- 子管道拥有自己的嵌套层级配额
- 通过管道链式调用实现无限深度
🔢 Activity and Resource Limits
🔢 活动与资源限制
Pipeline Limits
管道限制
| Resource | Limit | Notes |
|---|---|---|
| Activities per pipeline | 80 | Includes inner activities for containers |
| Parameters per pipeline | 50 | - |
| ForEach concurrent iterations | 50 (maximum) | Set via |
| ForEach items | 100,000 | - |
| Lookup activity rows | 5,000 | Maximum rows returned |
| Lookup activity size | 4 MB | Maximum size of returned data |
| Web activity timeout | 1 hour | Default timeout for Web activities |
| Copy activity timeout | 7 days | Maximum execution time |
| 资源 | 限制 | 说明 |
|---|---|---|
| 每个管道的活动数量 | 80 | 包含容器的内部活动 |
| 每个管道的参数数量 | 50 | - |
| ForEach并发迭代数 | 50(最大值) | 通过 |
| ForEach项数量 | 100,000 | - |
| Lookup活动返回行数 | 5,000 | 返回的最大行数 |
| Lookup活动返回数据大小 | 4 MB | 返回数据的最大大小 |
| Web活动超时时间 | 1小时 | Web活动的默认超时时间 |
| Copy活动超时时间 | 7天 | 最大执行时间 |
ForEach Activity Configuration
ForEach活动配置
json
{
"name": "ForEachActivity",
"type": "ForEach",
"typeProperties": {
"items": "@pipeline().parameters.ItemList",
"isSequential": false, // false = parallel execution
"batchCount": 50, // Max 50 concurrent iterations
"activities": [
// Nested activities
]
}
}Critical Considerations:
- → Executes one item at a time (slow but predictable)
isSequential: true - → Executes up to
isSequential: falseitems in parallelbatchCount - Maximum is 50 regardless of setting
batchCount - Cannot use Set Variable activity inside parallel ForEach (variable scope is pipeline-level)
json
{
"name": "ForEachActivity",
"type": "ForEach",
"typeProperties": {
"items": "@pipeline().parameters.ItemList",
"isSequential": false, // false = 并行执行
"batchCount": 50, // 最大50个并发迭代
"activities": [
// 嵌套活动
]
}
}关键注意事项:
- → 逐个执行项(速度慢但可预测)
isSequential: true - → 最多并行执行
isSequential: false个项batchCount - 无论设置如何,的最大值为50
batchCount - 不能在并行ForEach中使用Set Variable活动(变量作用域为管道层级)
Set Variable Activity Limitations
Set Variable活动限制
❌ CANNOT use inside ForEach with
Set VariableisSequential: false- Reason: Variables are pipeline-scoped, not ForEach-scoped
- Multiple parallel iterations would cause race conditions
- ✅ Alternative: Use with array type, or use sequential execution
Append Variable
❌ 不能在的ForEach中使用
isSequential: falseSet Variable- 原因:变量是管道作用域,而非ForEach作用域
- 多个并行迭代会导致竞争条件
- ✅ 替代方案:使用数组类型的,或使用顺序执行
Append Variable
📊 Linked Services: Azure Blob Storage
📊 链接服务:Azure Blob存储
Authentication Methods
身份验证方法
1. Account Key (Basic)
1. 账户密钥(基础)
json
{
"type": "AzureBlobStorage",
"typeProperties": {
"connectionString": {
"type": "SecureString",
"value": "DefaultEndpointsProtocol=https;AccountName=<account>;AccountKey=<key>"
}
}
}⚠️ Limitations:
- Secondary Blob service endpoints are NOT supported
- Security Risk: Account keys should be stored in Azure Key Vault
json
{
"type": "AzureBlobStorage",
"typeProperties": {
"connectionString": {
"type": "SecureString",
"value": "DefaultEndpointsProtocol=https;AccountName=<account>;AccountKey=<key>"
}
}
}⚠️ 限制:
- 不支持Blob服务二级端点
- 安全风险:账户密钥应存储在Azure Key Vault中
2. Shared Access Signature (SAS)
2. 共享访问签名(SAS)
json
{
"type": "AzureBlobStorage",
"typeProperties": {
"sasUri": {
"type": "SecureString",
"value": "https://<account>.blob.core.windows.net/<container>?<SAS-token>"
}
}
}Critical Requirements:
- Dataset must be absolute path from container level
folderPath - SAS token expiry must extend beyond pipeline execution
- SAS URI path must align with dataset configuration
json
{
"type": "AzureBlobStorage",
"typeProperties": {
"sasUri": {
"type": "SecureString",
"value": "https://<account>.blob.core.windows.net/<container>?<SAS-token>"
}
}
}关键要求:
- 数据集必须是从容器层级开始的绝对路径
folderPath - SAS令牌的有效期必须覆盖管道执行时间
- SAS URI路径必须与数据集配置一致
3. Service Principal
3. 服务主体
json
{
"type": "AzureBlobStorage",
"typeProperties": {
"serviceEndpoint": "https://<account>.blob.core.windows.net",
"accountKind": "StorageV2", // REQUIRED for service principal
"servicePrincipalId": "<client-id>",
"servicePrincipalCredential": {
"type": "SecureString",
"value": "<client-secret>"
},
"tenant": "<tenant-id>"
}
}Critical Requirements:
- MUST be set (StorageV2, BlobStorage, or BlockBlobStorage)
accountKind - Service Principal requires Storage Blob Data Reader (source) or Storage Blob Data Contributor (sink) role
- ❌ NOT compatible with soft-deleted blob accounts in Data Flow
json
{
"type": "AzureBlobStorage",
"typeProperties": {
"serviceEndpoint": "https://<account>.blob.core.windows.net",
"accountKind": "StorageV2", // 服务主体必需
"servicePrincipalId": "<client-id>",
"servicePrincipalCredential": {
"type": "SecureString",
"value": "<client-secret>"
},
"tenant": "<tenant-id>"
}
}关键要求:
- 必须设置(StorageV2、BlobStorage或BlockBlobStorage)
accountKind - 服务主体需要Storage Blob Data Reader(源)或Storage Blob Data Contributor(接收器)角色
- ❌ 与数据流中的软删除Blob账户不兼容
4. Managed Identity (Recommended)
4. 托管标识(推荐)
json
{
"type": "AzureBlobStorage",
"typeProperties": {
"serviceEndpoint": "https://<account>.blob.core.windows.net",
"accountKind": "StorageV2" // REQUIRED for managed identity
},
"connectVia": {
"referenceName": "AutoResolveIntegrationRuntime",
"type": "IntegrationRuntimeReference"
}
}Critical Requirements:
- MUST be specified (cannot be empty or "Storage")
accountKind - ❌ Empty or "Storage" account kind will cause Data Flow failures
- Managed identity must have Storage Blob Data Reader/Contributor role assigned
- For Storage firewall: Must enable "Allow trusted Microsoft services"
json
{
"type": "AzureBlobStorage",
"typeProperties": {
"serviceEndpoint": "https://<account>.blob.core.windows.net",
"accountKind": "StorageV2" // 托管标识必需
},
"connectVia": {
"referenceName": "AutoResolveIntegrationRuntime",
"type": "IntegrationRuntimeReference"
}
}关键要求:
- 必须指定(不能为空或"Storage")
accountKind - ❌ 空值或"Storage"账户类型会导致数据流失败
- 托管标识必须被分配Storage Blob Data Reader/Contributor角色
- 对于存储防火墙:必须启用"允许受信任的Microsoft服务"
Common Blob Storage Pitfalls
Blob存储常见陷阱
| Issue | Cause | Solution |
|---|---|---|
| Data Flow fails with managed identity | | Set |
| Secondary endpoint doesn't work | Using account key auth | Not supported - use different auth method |
| SAS token expired during run | Token expiry too short | Extend SAS token validity period |
| Cannot access $logs container | System container not visible in UI | Use direct path reference |
| Soft-deleted blobs inaccessible | Service principal/managed identity | Use account key or SAS instead |
| Private endpoint connection fails | Wrong endpoint for Data Flow | Ensure ADLS Gen2 private endpoint exists |
| 问题 | 原因 | 解决方案 |
|---|---|---|
| 托管标识下数据流失败 | | 将 |
| 二级端点无法工作 | 使用账户密钥身份验证 | 不支持 - 使用其他身份验证方法 |
| 运行期间SAS令牌过期 | 令牌有效期过短 | 延长SAS令牌的有效期 |
| 无法访问$logs容器 | 系统容器在UI中不可见 | 使用直接路径引用 |
| 软删除的Blob无法访问 | 使用服务主体/托管标识 | 改用账户密钥或SAS |
| 专用端点连接失败 | 数据流的端点错误 | 确保ADLS Gen2专用端点存在 |
📊 Linked Services: Azure SQL Database
📊 链接服务:Azure SQL数据库
Authentication Methods
身份验证方法
1. SQL Authentication
1. SQL身份验证
json
{
"type": "AzureSqlDatabase",
"typeProperties": {
"server": "<server-name>.database.windows.net",
"database": "<database-name>",
"authenticationType": "SQL",
"userName": "<username>",
"password": {
"type": "SecureString",
"value": "<password>"
}
}
}Best Practice:
- Store password in Azure Key Vault
- Use connection string with Key Vault reference
json
{
"type": "AzureSqlDatabase",
"typeProperties": {
"server": "<server-name>.database.windows.net",
"database": "<database-name>",
"authenticationType": "SQL",
"userName": "<username>",
"password": {
"type": "SecureString",
"value": "<password>"
}
}
}最佳实践:
- 将密码存储在Azure Key Vault中
- 使用包含Key Vault引用的连接字符串
2. Service Principal
2. 服务主体
json
{
"type": "AzureSqlDatabase",
"typeProperties": {
"server": "<server-name>.database.windows.net",
"database": "<database-name>",
"authenticationType": "ServicePrincipal",
"servicePrincipalId": "<client-id>",
"servicePrincipalCredential": {
"type": "SecureString",
"value": "<client-secret>"
},
"tenant": "<tenant-id>"
}
}Requirements:
- Microsoft Entra admin must be configured on SQL server
- Service principal must have contained database user created
- Grant appropriate role: ,
db_datareader, etc.db_datawriter
json
{
"type": "AzureSqlDatabase",
"typeProperties": {
"server": "<server-name>.database.windows.net",
"database": "<database-name>",
"authenticationType": "ServicePrincipal",
"servicePrincipalId": "<client-id>",
"servicePrincipalCredential": {
"type": "SecureString",
"value": "<client-secret>"
},
"tenant": "<tenant-id>"
}
}要求:
- SQL服务器必须配置Microsoft Entra管理员
- 必须为服务主体创建包含数据库用户
- 授予适当的角色:、
db_datareader等db_datawriter
3. Managed Identity
3. 托管标识
json
{
"type": "AzureSqlDatabase",
"typeProperties": {
"server": "<server-name>.database.windows.net",
"database": "<database-name>",
"authenticationType": "SystemAssignedManagedIdentity"
}
}Requirements:
- Create contained database user for managed identity
- Grant appropriate database roles
- Configure firewall to allow Azure services (or specific IP ranges)
json
{
"type": "AzureSqlDatabase",
"typeProperties": {
"server": "<server-name>.database.windows.net",
"database": "<database-name>",
"authenticationType": "SystemAssignedManagedIdentity"
}
}要求:
- 为托管标识创建包含数据库用户
- 授予适当的数据库角色
- 配置防火墙以允许Azure服务(或特定IP范围)
SQL Database Configuration Best Practices
SQL数据库配置最佳实践
Connection String Parameters
连接字符串参数
Server=tcp:<server>.database.windows.net,1433;
Database=<database>;
Encrypt=mandatory; // Options: mandatory, optional, strict
TrustServerCertificate=false;
ConnectTimeout=30;
CommandTimeout=120;
Pooling=true;
ConnectRetryCount=3;
ConnectRetryInterval=10;Critical Parameters:
- : Default is
Encrypt(recommended)mandatory - : Set to
Poolingif experiencing idle connection issuesfalse - : Recommended for transient fault handling
ConnectRetryCount - : Seconds between retries
ConnectRetryInterval
Server=tcp:<server>.database.windows.net,1433;
Database=<database>;
Encrypt=mandatory; // 选项:mandatory, optional, strict
TrustServerCertificate=false;
ConnectTimeout=30;
CommandTimeout=120;
Pooling=true;
ConnectRetryCount=3;
ConnectRetryInterval=10;关键参数:
- :默认值为
Encrypt(推荐)mandatory - :如果遇到空闲连接问题,设置为
Poolingfalse - :推荐用于临时故障处理
ConnectRetryCount - :重试间隔(秒)
ConnectRetryInterval
Common SQL Database Pitfalls
SQL数据库常见陷阱
| Issue | Cause | Solution |
|---|---|---|
| Serverless tier auto-paused | Pipeline doesn't wait for resume | Implement retry logic or keep-alive |
| Connection pool timeout | Idle connections closed | Add |
| Firewall blocks connection | IP not whitelisted | Add Azure IR IPs or enable Azure services |
| Always Encrypted fails in Data Flow | Not supported for sink | Use service principal/managed identity in copy activity |
| Decimal precision loss | Copy supports up to 28 precision | Use string type for higher precision |
| Parallel copy not working | No partition configuration | Enable physical or dynamic range partitioning |
| 问题 | 原因 | 解决方案 |
|---|---|---|
| 无服务器层自动暂停 | 管道未等待恢复 | 实现重试逻辑或保持活动连接 |
| 连接池超时 | 空闲连接被关闭 | 添加 |
| 防火墙阻止连接 | IP未被列入白名单 | 添加Azure IR IP或启用Azure服务 |
| 数据流中始终加密失败 | 接收器不支持 | 在复制活动中使用服务主体/托管标识 |
| 小数精度丢失 | 复制活动最多支持28位精度 | 对更高精度使用字符串类型 |
| 并行复制不工作 | 未配置分区 | 启用物理或动态范围分区 |
Performance Optimization
性能优化
Parallel Copy Configuration
并行复制配置
json
{
"source": {
"type": "AzureSqlSource",
"partitionOption": "PhysicalPartitionsOfTable" // or "DynamicRange"
},
"parallelCopies": 8, // Recommended: (DIU or IR nodes) × (2 to 4)
"enableStaging": true,
"stagingSettings": {
"linkedServiceName": {
"referenceName": "AzureBlobStorage",
"type": "LinkedServiceReference"
}
}
}Partition Options:
- : Uses SQL Server physical partitions
PhysicalPartitionsOfTable - : Creates logical partitions based on column values
DynamicRange - : No partitioning (default)
None
Staging Best Practices:
- Always use staging for large data movements (> 1GB)
- Use PolyBase or COPY statement for best performance
- Parquet format recommended for staging files
json
{
"source": {
"type": "AzureSqlSource",
"partitionOption": "PhysicalPartitionsOfTable" // 或 "DynamicRange"
},
"parallelCopies": 8, // 推荐:(DIU或IR节点) × (2至4)
"enableStaging": true,
"stagingSettings": {
"linkedServiceName": {
"referenceName": "AzureBlobStorage",
"type": "LinkedServiceReference"
}
}
}分区选项:
- :使用SQL Server物理分区
PhysicalPartitionsOfTable - :基于列值创建逻辑分区
DynamicRange - :无分区(默认)
None
暂存最佳实践:
- 对于大数据移动(>1GB),始终使用暂存
- 使用PolyBase或COPY语句以获得最佳性能
- 推荐使用Parquet格式作为暂存文件
🔍 Data Flow Limitations
🔍 数据流限制
General Limits
通用限制
- Column name length: 128 characters maximum
- Row size: 1 MB maximum (some sinks like SQL have lower limits)
- String column size: Varies by sink (SQL: 8000 for varchar, 4000 for nvarchar)
- 列名长度:最大128个字符
- 行大小:最大1MB(某些接收器如SQL的限制更低)
- 字符串列大小:因接收器而异(SQL:varchar最大8000,nvarchar最大4000)
Transformation-Specific Limits
特定转换限制
| Transformation | Limitation |
|---|---|
| Lookup | Cache size limited by cluster memory |
| Join | Large joins may cause memory errors |
| Pivot | Maximum 10,000 unique values |
| Window | Requires partitioning for large datasets |
| 转换 | 限制 |
|---|---|
| Lookup | 缓存大小受集群内存限制 |
| Join | 大型连接可能导致内存错误 |
| Pivot | 最多10,000个唯一值 |
| Window | 大型数据集需要分区 |
Performance Considerations
性能注意事项
- Partitioning: Always partition large datasets before transformations
- Broadcast: Use broadcast hint for small dimension tables
- Sink optimization: Enable table option "Recreate" instead of "Truncate" for better performance
- 分区:在转换前始终对大型数据集进行分区
- 广播:对小型维度表使用广播提示
- 接收器优化:启用表选项"重新创建"而非"截断"以获得更好的性能
🛡️ Validation Checklist for Pipeline Creation
🛡️ 管道创建验证清单
Before Creating Pipeline
创建管道前
- Verify activity nesting follows permitted combinations
- Check ForEach activities don't contain other ForEach/Until
- Verify If/Switch activities don't contain ForEach/Until/If/Switch
- Ensure Validation activities are at pipeline root level only
- Confirm total activities < 80 per pipeline
- Verify no Set Variable activities in parallel ForEach
- 验证活动嵌套符合允许的组合
- 检查ForEach活动不包含其他ForEach/Until
- 验证If/Switch活动不包含ForEach/Until/If/Switch
- 确保Validation活动仅位于管道根层级
- 确认每个管道的总活动数<80
- 验证并行ForEach中没有Set Variable活动
Linked Service Validation
链接服务验证
- Blob Storage: If using managed identity/service principal, is set
accountKind - SQL Database: Authentication method matches security requirements
- All services: Secrets stored in Key Vault, not hardcoded
- All services: Firewall rules configured for integration runtime IPs
- Network: Private endpoints configured if using VNet integration
- Blob存储:如果使用托管标识/服务主体,已设置
accountKind - SQL数据库:身份验证方法符合安全要求
- 所有服务:机密存储在Key Vault中,未硬编码
- 所有服务:已为集成运行时IP配置防火墙规则
- 网络:如果使用VNet集成,已配置专用端点
Activity Configuration Validation
活动配置验证
- ForEach: ≤ 50 if parallel execution
batchCount - Lookup: Query returns < 5000 rows and < 4 MB data
- Copy: DIU configured appropriately (2-256 for Azure IR)
- Copy: Staging enabled for large data movements
- All activities: Timeout values appropriate for expected execution time
- All activities: Retry logic configured for transient failures
- ForEach:如果是并行执行,≤50
batchCount - Lookup:查询返回<5000行且<4MB数据
- Copy:已适当配置DIU(Azure IR为2-256)
- Copy:大型数据移动已启用暂存
- 所有活动:超时值与预期执行时间匹配
- 所有活动:已为临时故障配置重试逻辑
Data Flow Validation
数据流验证
- Column names ≤ 128 characters
- Source query doesn't return > 1 MB per row
- Partitioning configured for large datasets
- Sink has appropriate schema and data type mappings
- Staging linked service configured for optimal performance
- 列名≤128个字符
- 源查询返回的每行数据<1MB
- 大型数据集已配置分区
- 接收器有适当的架构和数据类型映射
- 已配置暂存链接服务以优化性能
🔍 Automated Validation Script
🔍 自动化验证脚本
CRITICAL: Always run automated validation before committing or deploying ADF pipelines!
The adf-master plugin includes a comprehensive PowerShell validation script that checks for ALL the rules and limitations documented above.
关键:在提交或部署ADF管道前始终运行自动化验证!
adf-master插件包含一个全面的PowerShell验证脚本,可检查上述所有规则和限制。
Using the Validation Script
使用验证脚本
Location:
${CLAUDE_PLUGIN_ROOT}/scripts/validate-adf-pipelines.ps1Basic usage:
powershell
undefined位置:
${CLAUDE_PLUGIN_ROOT}/scripts/validate-adf-pipelines.ps1基本用法:
powershell
undefinedFrom the root of your ADF repository
从ADF仓库根目录执行
pwsh -File validate-adf-pipelines.ps1
**With custom paths:**
```powershell
pwsh -File validate-adf-pipelines.ps1 `
-PipelinePath "path/to/pipeline" `
-DatasetPath "path/to/dataset"With strict mode (additional warnings):
powershell
pwsh -File validate-adf-pipelines.ps1 -Strictpwsh -File validate-adf-pipelines.ps1
**自定义路径:**
```powershell
pwsh -File validate-adf-pipelines.ps1 `
-PipelinePath "path/to/pipeline" `
-DatasetPath "path/to/dataset"严格模式(附加警告):
powershell
pwsh -File validate-adf-pipelines.ps1 -StrictWhat the Script Validates
脚本验证内容
The automated validation script checks for issues that Microsoft's official package does NOT validate:
@microsoft/azure-data-factory-utilities-
Activity Nesting Violations:
- ForEach → ForEach, Until, Validation
- Until → Until, ForEach, Validation
- IfCondition → ForEach, If, IfCondition, Switch, Until, Validation
- Switch → ForEach, If, IfCondition, Switch, Until, Validation
-
Resource Limits:
- Pipeline activity count (max 120, warn at 100)
- Pipeline parameter count (max 50)
- Pipeline variable count (max 50)
- ForEach batchCount limit (max 50, warn at 30 in strict mode)
-
Variable Scope Violations:
- SetVariable in parallel ForEach (causes race conditions)
- Proper AppendVariable vs SetVariable usage
-
Dataset Configuration Issues:
- Missing fileName or wildcardFileName for file-based datasets
- AzureBlobFSLocation missing required fileSystem property
- Missing required properties for DelimitedText, Json, Parquet types
-
Copy Activity Validations:
- Source/sink type compatibility with dataset types
- Lookup activity firstRowOnly=false warnings (5000 row/4MB limits)
- Blob file dependencies (additionalColumns logging pattern)
此自动化验证脚本会检查Microsoft官方包未验证的问题:
@microsoft/azure-data-factory-utilities-
活动嵌套违规:
- ForEach → ForEach、Until、Validation
- Until → Until、ForEach、Validation
- IfCondition → ForEach、If、IfCondition、Switch、Until、Validation
- Switch → ForEach、If、IfCondition、Switch、Until、Validation
-
资源限制:
- 管道活动数(最大120,100时警告)
- 管道参数数(最大50)
- 管道变量数(最大50)
- ForEach batchCount限制(最大50,严格模式下30时警告)
-
变量作用域违规:
- 并行ForEach中的SetVariable(会导致竞争条件)
- AppendVariable与SetVariable的正确使用
-
数据集配置问题:
- 基于文件的数据集缺少fileName或wildcardFileName
- AzureBlobFSLocation缺少必需的fileSystem属性
- DelimitedText、Json、Parquet类型缺少必需属性
-
复制活动验证:
- 源/接收器类型与数据集类型兼容性
- Lookup活动firstRowOnly=false警告(5000行/4MB限制)
- Blob文件依赖项(additionalColumns日志模式)
Integration with CI/CD
与CI/CD集成
GitHub Actions example:
yaml
- name: Validate ADF Pipelines
run: |
pwsh -File validate-adf-pipelines.ps1 -PipelinePath pipeline -DatasetPath dataset
shell: pwshAzure DevOps example:
yaml
- task: PowerShell@2
displayName: 'Validate ADF Pipelines'
inputs:
filePath: 'validate-adf-pipelines.ps1'
arguments: '-PipelinePath pipeline -DatasetPath dataset'
pwsh: trueGitHub Actions示例:
yaml
- name: Validate ADF Pipelines
run: |
pwsh -File validate-adf-pipelines.ps1 -PipelinePath pipeline -DatasetPath dataset
shell: pwshAzure DevOps示例:
yaml
- task: PowerShell@2
displayName: 'Validate ADF Pipelines'
inputs:
filePath: 'validate-adf-pipelines.ps1'
arguments: '-PipelinePath pipeline -DatasetPath dataset'
pwsh: trueCommand Reference
命令参考
Use the command to run the validation script with proper guidance:
/adf-validatebash
/adf-validateThis command will:
- Detect your ADF repository structure
- Run the validation script with appropriate paths
- Parse and explain any errors or warnings found
- Provide specific solutions for each violation
- Recommend next actions based on results
- Suggest CI/CD integration patterns
使用命令运行验证脚本并获得适当指导:
/adf-validatebash
/adf-validate此命令将:
- 检测ADF仓库结构
- 使用适当路径运行验证脚本
- 解析并解释发现的任何错误或警告
- 为每个违规提供特定解决方案
- 根据结果推荐后续操作
- 建议CI/CD集成模式
Exit Codes
退出代码
- 0: Validation passed (no errors)
- 1: Validation failed (errors found - DO NOT DEPLOY)
- 0:验证通过(无错误)
- 1:验证失败(发现错误 - 请勿部署)
Best Practices
最佳实践
- Run validation before every commit to catch issues early
- Add validation to CI/CD pipeline to prevent invalid deployments
- Use strict mode during development for additional warnings
- Re-validate after bulk changes or generated pipelines
- Document validation exceptions if you must bypass a warning
- Share validation results with team to prevent repeated mistakes
- 每次提交前运行验证,尽早发现问题
- 将验证添加到CI/CD管道,防止无效部署
- 开发期间使用严格模式,获取附加警告
- 批量更改或生成管道后重新验证
- 记录验证例外,如果必须绕过警告
- 与团队共享验证结果,防止重复错误
🚨 CRITICAL: Enforcement Protocol
🚨 关键:执行协议
When creating or modifying ADF pipelines:
- ALWAYS validate activity nesting against the permitted/prohibited table
- REJECT any attempt to create prohibited nesting combinations
- SUGGEST Execute Pipeline workaround for complex nesting needs
- VALIDATE linked service authentication matches the connector type
- CHECK all limits (activities, parameters, ForEach iterations, etc.)
- VERIFY required properties are set (e.g., for managed identity)
accountKind - WARN about common pitfalls specific to the connector being used
Example Validation Response:
❌ INVALID PIPELINE STRUCTURE DETECTED:
Issue: ForEach activity contains another ForEach activity
Location: Pipeline "PL_DataProcessing" → ForEach "OuterLoop" → ForEach "InnerLoop"
This violates Azure Data Factory nesting rules:
- ForEach activities support only a SINGLE level of nesting
- You CANNOT nest ForEach within ForEach
✅ RECOMMENDED SOLUTION:
Use the Execute Pipeline pattern:
1. Create a child pipeline with the inner ForEach logic
2. Replace the inner ForEach with an Execute Pipeline activity
3. Pass required parameters to the child pipeline
Would you like me to generate the refactored pipeline structure?创建或修改ADF管道时:
- 始终验证活动嵌套是否符合允许/禁止列表
- 拒绝任何创建禁止嵌套组合的尝试
- 建议对复杂嵌套需求使用Execute Pipeline替代方案
- 验证链接服务身份验证与连接器类型匹配
- 检查所有限制(活动数、参数数、ForEach迭代数等)
- 验证必需属性已设置(例如,托管标识的)
accountKind - 警告所使用连接器的常见陷阱
验证响应示例:
❌ 检测到无效管道结构:
问题:ForEach活动包含另一个ForEach活动
位置:管道 "PL_DataProcessing" → ForEach "OuterLoop" → ForEach "InnerLoop"
这违反了Azure Data Factory嵌套规则:
- ForEach活动仅支持单层级嵌套
- 不能在ForEach中嵌套ForEach
✅ 推荐解决方案:
使用Execute Pipeline模式:
1. 创建包含内部ForEach逻辑的子管道
2. 用Execute Pipeline活动替换内部ForEach
3. 将所需参数传递给子管道
是否需要我生成重构后的管道结构?📚 Reference Documentation
📚 参考文档
Official Microsoft Learn Resources:
- Activity nesting: https://learn.microsoft.com/en-us/azure/data-factory/concepts-nested-activities
- Blob Storage connector: https://learn.microsoft.com/en-us/azure/data-factory/connector-azure-blob-storage
- SQL Database connector: https://learn.microsoft.com/en-us/azure/data-factory/connector-azure-sql-database
- Pipeline limits: https://learn.microsoft.com/en-us/azure/azure-resource-manager/management/azure-subscription-service-limits#data-factory-limits
Last Updated: 2025-01-24 (Based on official Microsoft documentation)
This validation rules skill MUST be consulted before creating or modifying ANY Azure Data Factory pipeline to ensure compliance with platform limitations and best practices.
Microsoft官方Learn资源:
- 活动嵌套:https://learn.microsoft.com/en-us/azure/data-factory/concepts-nested-activities
- Blob存储连接器:https://learn.microsoft.com/en-us/azure/data-factory/connector-azure-blob-storage
- SQL数据库连接器:https://learn.microsoft.com/en-us/azure/data-factory/connector-azure-sql-database
- 管道限制:https://learn.microsoft.com/en-us/azure/azure-resource-manager/management/azure-subscription-service-limits#data-factory-limits
最后更新: 2025-01-24(基于Microsoft官方文档)
在创建或修改任何ADF管道前,必须参考此验证规则技能,确保符合平台限制和最佳实践。
Progressive Disclosure References
渐进式披露参考
For detailed validation matrices and resource limits, see:
- Nesting Rules: - Complete matrix of permitted and prohibited activity nesting combinations with workaround patterns
references/nesting-rules.md - Resource Limits: - Complete reference for all ADF limits (pipeline, activity, trigger, data flow, integration runtime, expression, API)
references/resource-limits.md
如需详细的验证矩阵和资源限制,请参阅:
- 嵌套规则:- 完整的允许/禁止活动嵌套组合矩阵及替代方案模式
references/nesting-rules.md - 资源限制:- 所有ADF限制的完整参考(管道、活动、触发器、数据流、集成运行时、表达式、API)
references/resource-limits.md