fabric-lakehouse

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

When to Use This Skill

何时使用此技能

Use this skill when you need to:
  • Generate a document or explanation that includes definition and context about Fabric Lakehouse and its capabilities.
  • Design, build, and optimize Lakehouse solutions using best practices.
  • Understand the core concepts and components of a Lakehouse in Microsoft Fabric.
  • Learn how to manage tabular and non-tabular data within a Lakehouse.
在以下场景中使用此技能:
  • 生成包含Fabric Lakehouse定义、背景及其功能的文档或说明。
  • 采用最佳实践设计、构建和优化Lakehouse解决方案。
  • 了解Microsoft Fabric中Lakehouse的核心概念和组件。
  • 学习如何在Lakehouse中管理表格型和非表格型数据。

Fabric Lakehouse

Fabric Lakehouse

Core Concepts

核心概念

What is a Lakehouse?

什么是Lakehouse?

Lakehouse in Microsoft Fabric is an item that gives users a place to store their tabular data (like tables) and non-tabular data (like files). It combines the flexibility of a data lake with the management capabilities of a data warehouse. It provides:
  • Unified storage in OneLake for structured and unstructured data
  • Delta Lake format for ACID transactions, versioning, and time travel
  • SQL analytics endpoint for T-SQL queries
  • Semantic model for Power BI integration
  • Support for other table formats like CSV, Parquet
  • Support for any file formats
  • Tools for table optimization and data management
Microsoft Fabric中的Lakehouse是一项服务,为用户提供存储表格型数据(如表格)和非表格型数据(如文件)的空间。它结合了数据湖的灵活性与数据仓库的管理能力,提供:
  • 统一存储:结构化和非结构化数据统一存储于OneLake
  • Delta Lake格式:支持ACID事务、版本控制和时光旅行
  • SQL分析端点:用于T-SQL查询
  • 语义模型:用于Power BI集成
  • 支持CSV、Parquet等其他表格格式
  • 支持任意文件格式
  • 表格优化和数据管理工具

Key Components

关键组件

  • Delta Tables: Managed tables with ACID compliance and schema enforcement
  • Files: Unstructured/semi-structured data in the Files section
  • SQL Endpoint: Auto-generated read-only SQL interface for querying
  • Shortcuts: Virtual links to external/internal data without copying
  • Fabric Materialized Views: Pre-computed tables for fast query performance
  • Delta Tables:符合ACID标准且支持架构强制的托管表格
  • Files:Files区域中的非结构化/半结构化数据
  • SQL Endpoint:自动生成的只读SQL查询接口
  • Shortcuts:无需复制数据即可链接外部/内部数据的虚拟链接
  • Fabric Materialized Views:预计算表格,可实现快速查询性能

Tabular data in a Lakehouse

Lakehouse中的表格型数据

Tabular data in a form of tables are stored under "Tables" folder. Main format for tables in Lakehouse is Delta. Lakehouse can store tabular data in other formats like CSV or Parquet, these formats are only available for Spark querying. Tables can be internal, when data is stored under "Tables" folder, or external, when only reference to a table is stored under "Tables" folder but the data itself is stored in a referenced location. Tables are referenced through Shortcuts, which can be internal (pointing to another location in Fabric) or external (pointing to data stored outside of Fabric).
表格形式的表格型数据存储在“Tables”文件夹下。Lakehouse中表格的主要格式为Delta。Lakehouse也可以存储CSV或Parquet等其他格式的表格型数据,但这些格式仅支持Spark查询。 表格分为内部表格和外部表格:内部表格的数据存储在“Tables”文件夹下;外部表格仅在“Tables”文件夹下存储表格引用,实际数据存储在引用位置。表格通过Shortcuts进行引用,Shortcuts可以是内部的(指向Fabric中的其他位置)或外部的(指向Fabric之外的存储位置)。

Schemas for tables in a Lakehouse

Lakehouse中表格的架构

When creating a lakehouse, users can choose to enable schemas. Schemas are used to organize Lakehouse tables. Schemas are implemented as folders under the "Tables" folder and store tables inside of those folders. The default schema is "dbo" and it can't be deleted or renamed. All other schemas are optional and can be created, renamed, or deleted. Users can reference a schema located in another lakehouse using a Schema Shortcut, thereby referencing all tables in the destination schema with a single shortcut.
创建Lakehouse时,用户可以选择启用架构。架构用于组织Lakehouse中的表格,以“Tables”文件夹下的子文件夹形式实现,表格存储在这些子文件夹中。默认架构为“dbo”,无法删除或重命名。所有其他架构为可选架构,可创建、重命名或删除。用户可以使用Schema Shortcuts引用另一个Lakehouse中的架构,从而通过单个快捷方式引用目标架构中的所有表格。

Files in a Lakehouse

Lakehouse中的文件

Files are stored under "Files" folder. Users can create folders and subfolders to organize their files. Any file format can be stored in Lakehouse.
文件存储在“Files”文件夹下。用户可以创建文件夹和子文件夹来组织文件。Lakehouse支持存储任意文件格式。

Fabric Materialized Views

Fabric物化视图

Set of pre-computed tables that are automatically updated based on a schedule. They provide fast query performance for complex aggregations and joins. Materialized views are defined using PySpark or Spark SQL and stored in an associated Notebook.
一组基于计划自动更新的预计算表格,可为复杂聚合和连接操作提供快速查询性能。物化视图通过PySpark或Spark SQL定义,并存储在关联的Notebook中。

Spark Views

Spark视图

Logical tables defined by a SQL query. They do not store data but provide a virtual layer for querying. Views are defined using Spark SQL and stored in Lakehouse next to Tables.
由SQL查询定义的逻辑表格,不存储数据,仅提供查询的虚拟层。视图通过Spark SQL定义,并与Tables一起存储在Lakehouse中。

Security

安全性

Item access or control plane security

项访问或控制平面安全性

Users can have workspace roles (Admin, Member, Contributor, Viewer) that provide different levels of access to Lakehouse and its contents. Users can also get access permission using sharing capabilities of Lakehouse.
用户可拥有工作区角色(管理员、成员、贡献者、查看者),不同角色对Lakehouse及其内容拥有不同级别的访问权限。用户也可以通过Lakehouse的共享功能获取访问权限。

Data access or OneLake Security

数据访问或OneLake安全性

For data access use OneLake security model, which is based on Microsoft Entra ID (formerly Azure Active Directory) and role-based access control (RBAC). Lakehouse data is stored in OneLake, so access to data is controlled through OneLake permissions. In addition to object-level permissions, Lakehouse also supports column-level and row-level security for tables, allowing fine-grained control over who can see specific columns or rows in a table.
数据访问采用OneLake安全模型,该模型基于Microsoft Entra ID(前身为Azure Active Directory)和基于角色的访问控制(RBAC)。Lakehouse数据存储在OneLake中,因此数据访问由OneLake权限控制。除对象级权限外,Lakehouse还支持表格的列级和行级安全性,可对谁能查看表格中的特定列或行进行细粒度控制。

Lakehouse Shortcuts

Lakehouse快捷方式

Shortcuts create virtual links to data without copying:
快捷方式可创建虚拟链接来访问数据,无需复制数据:

Types of Shortcuts

快捷方式类型

  • Internal: Link to other Fabric Lakehouses/tables, cross-workspace data sharing
  • ADLS Gen2: Link to ADLS Gen2 containers in Azure
  • Amazon S3: AWS S3 buckets, cross-cloud data access
  • Dataverse: Microsoft Dataverse, business application data
  • Google Cloud Storage: GCS buckets, cross-cloud data access
  • 内部:链接到其他Fabric Lakehouse/表格,支持跨工作区数据共享
  • ADLS Gen2:链接到Azure中的ADLS Gen2容器
  • Amazon S3:AWS S3存储桶,支持跨云数据访问
  • Dataverse:Microsoft Dataverse,业务应用数据
  • Google Cloud Storage:GCS存储桶,支持跨云数据访问

Performance Optimization

性能优化

V-Order Optimization

V-Order优化

For faster data read with semantic model enable V-Order optimization on Delta tables. This presorts data in a way that improves query performance for common access patterns.
若要通过语义模型实现更快的数据读取,可在Delta表格上启用V-Order优化。该优化会按照能提升常见访问模式查询性能的方式预排序数据。

Table Optimization

表格优化

Tables can also be optimized using the OPTIMIZE command, which compacts small files into larger ones and can also apply Z-ordering to improve query performance on specific columns. Regular optimization helps maintain performance as data is ingested and updated over time. The Vacuum command can be used to clean up old files and free up storage space, especially after updates and deletes.
还可使用OPTIMIZE命令优化表格,该命令将小文件合并为大文件,还可应用Z-ordering来提升特定列的查询性能。定期优化有助于在数据摄入和更新过程中维持性能。Vacuum命令可用于清理旧文件并释放存储空间,尤其是在数据更新和删除后。

Lineage

数据血缘

The Lakehouse item supports lineage, which allows users to track the origin and transformations of data. Lineage information is automatically captured for tables and files in Lakehouse, showing how data flows from source to destination. This helps with debugging, auditing, and understanding data dependencies.
Lakehouse项支持数据血缘功能,用户可追踪数据的来源和转换过程。Lakehouse中的表格和文件会自动捕获血缘信息,展示数据从源到目标的流动路径。这有助于调试、审计和理解数据依赖关系。

PySpark Code Examples

PySpark代码示例

See PySpark code for details.
详情请参阅PySpark代码

Getting data into Lakehouse

将数据导入Lakehouse

See Get data for details.
详情请参阅获取数据