vaex
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseVaex
Vaex
Overview
概述
Vaex is a high-performance Python library designed for lazy, out-of-core DataFrames to process and visualize tabular datasets that are too large to fit into RAM. Vaex can process over a billion rows per second, enabling interactive data exploration and analysis on datasets with billions of rows.
Vaex 是一个高性能Python库,专为延迟计算的核外DataFrame设计,用于处理和可视化无法放入RAM的大型表格数据集。Vaex每秒可处理超过十亿行数据,支持对数十亿行的数据集进行交互式数据探索和分析。
When to Use This Skill
何时使用该技能
Use Vaex when:
- Processing tabular datasets larger than available RAM (gigabytes to terabytes)
- Performing fast statistical aggregations on massive datasets
- Creating visualizations and heatmaps of large datasets
- Building machine learning pipelines on big data
- Converting between data formats (CSV, HDF5, Arrow, Parquet)
- Needing lazy evaluation and virtual columns to avoid memory overhead
- Working with astronomical data, financial time series, or other large-scale scientific datasets
在以下场景中使用Vaex:
- 处理超出可用内存的表格数据集(从千兆字节到太字节)
- 对海量数据集执行快速统计聚合
- 创建大型数据集的可视化和热力图
- 在大数据上构建机器学习管道
- 在不同数据格式之间转换(CSV、HDF5、Arrow、Parquet)
- 需要延迟计算和虚拟列以避免内存开销
- 处理天文数据、金融时间序列或其他大规模科学数据集
Core Capabilities
核心功能
Vaex provides six primary capability areas, each documented in detail in the references directory:
Vaex提供六大核心功能领域,每个领域的详细文档都在参考目录中:
1. DataFrames and Data Loading
1. DataFrame与数据加载
Load and create Vaex DataFrames from various sources including files (HDF5, CSV, Arrow, Parquet), pandas DataFrames, NumPy arrays, and dictionaries. Reference for:
references/core_dataframes.md- Opening large files efficiently
- Converting from pandas/NumPy/Arrow
- Working with example datasets
- Understanding DataFrame structure
从各种来源加载并创建Vaex DataFrame,包括文件(HDF5、CSV、Arrow、Parquet)、pandas DataFrame、NumPy数组和字典。参考了解:
references/core_dataframes.md- 高效打开大型文件
- 从pandas/NumPy/Arrow转换
- 处理示例数据集
- 理解DataFrame结构
2. Data Processing and Manipulation
2. 数据处理与操作
Perform filtering, create virtual columns, use expressions, and aggregate data without loading everything into memory. Reference for:
references/data_processing.md- Filtering and selections
- Virtual columns and expressions
- Groupby operations and aggregations
- String operations and datetime handling
- Working with missing data
无需将所有数据加载到内存即可执行过滤、创建虚拟列、使用表达式和聚合数据。参考了解:
references/data_processing.md- 过滤与选择
- 虚拟列与表达式
- 分组操作与聚合
- 字符串操作与日期时间处理
- 缺失数据处理
3. Performance and Optimization
3. 性能与优化
Leverage Vaex's lazy evaluation, caching strategies, and memory-efficient operations. Reference for:
references/performance.md- Understanding lazy evaluation
- Using for batching operations
delay=True - Materializing columns when needed
- Caching strategies
- Asynchronous operations
利用Vaex的延迟计算、缓存策略和内存高效操作。参考了解:
references/performance.md- 理解延迟计算
- 使用进行批量操作
delay=True - 在需要时物化列
- 缓存策略
- 异步操作
4. Data Visualization
4. 数据可视化
Create interactive visualizations of large datasets including heatmaps, histograms, and scatter plots. Reference for:
references/visualization.md- Creating 1D and 2D plots
- Heatmap visualizations
- Working with selections
- Customizing plots and subplots
创建大型数据集的交互式可视化,包括热力图、直方图和散点图。参考了解:
references/visualization.md- 创建1D和2D图表
- 热力图可视化
- 处理选择结果
- 自定义图表和子图
5. Machine Learning Integration
5. 机器学习集成
Build ML pipelines with transformers, encoders, and integration with scikit-learn, XGBoost, and other frameworks. Reference for:
references/machine_learning.md- Feature scaling and encoding
- PCA and dimensionality reduction
- K-means clustering
- Integration with scikit-learn/XGBoost/CatBoost
- Model serialization and deployment
构建包含转换器、编码器的ML管道,并与scikit-learn、XGBoost等框架集成。参考了解:
references/machine_learning.md- 特征缩放与编码
- PCA与降维
- K-means聚类
- 与scikit-learn/XGBoost/CatBoost集成
- 模型序列化与部署
6. I/O Operations
6. I/O操作
Efficiently read and write data in various formats with optimal performance. Reference for:
references/io_operations.md- File format recommendations
- Export strategies
- Working with Apache Arrow
- CSV handling for large files
- Server and remote data access
高效读写各种格式的数据,实现最佳性能。参考了解:
references/io_operations.md- 文件格式推荐
- 导出策略
- 使用Apache Arrow
- 大型CSV文件处理
- 服务器与远程数据访问
Quick Start Pattern
快速入门模式
For most Vaex tasks, follow this pattern:
python
import vaex对于大多数Vaex任务,请遵循以下模式:
python
import vaex1. Open or create DataFrame
1. Open or create DataFrame
df = vaex.open('large_file.hdf5') # or .csv, .arrow, .parquet
df = vaex.open('large_file.hdf5') # or .csv, .arrow, .parquet
OR
OR
df = vaex.from_pandas(pandas_df)
df = vaex.from_pandas(pandas_df)
2. Explore the data
2. Explore the data
print(df) # Shows first/last rows and column info
df.describe() # Statistical summary
print(df) # Shows first/last rows and column info
df.describe() # Statistical summary
3. Create virtual columns (no memory overhead)
3. Create virtual columns (no memory overhead)
df['new_column'] = df.x ** 2 + df.y
df['new_column'] = df.x ** 2 + df.y
4. Filter with selections
4. Filter with selections
df_filtered = df[df.age > 25]
df_filtered = df[df.age > 25]
5. Compute statistics (fast, lazy evaluation)
5. Compute statistics (fast, lazy evaluation)
mean_val = df.x.mean()
stats = df.groupby('category').agg({'value': 'sum'})
mean_val = df.x.mean()
stats = df.groupby('category').agg({'value': 'sum'})
6. Visualize
6. Visualize
df.plot1d(df.x, limits=[0, 100])
df.plot(df.x, df.y, limits='99.7%')
df.plot1d(df.x, limits=[0, 100])
df.plot(df.x, df.y, limits='99.7%')
7. Export if needed
7. Export if needed
df.export_hdf5('output.hdf5')
undefineddf.export_hdf5('output.hdf5')
undefinedWorking with References
使用参考文档
The reference files contain detailed information about each capability area. Load references into context based on the specific task:
- Basic operations: Start with and
references/core_dataframes.mdreferences/data_processing.md - Performance issues: Check
references/performance.md - Visualization tasks: Use
references/visualization.md - ML pipelines: Reference
references/machine_learning.md - File I/O: Consult
references/io_operations.md
参考文件包含每个功能领域的详细信息。根据具体任务加载相关参考内容:
- 基础操作:从和
references/core_dataframes.md开始references/data_processing.md - 性能问题:查看
references/performance.md - 可视化任务:使用
references/visualization.md - ML管道:参考
references/machine_learning.md - 文件I/O:查阅
references/io_operations.md
Best Practices
最佳实践
- Use HDF5 or Apache Arrow formats for optimal performance with large datasets
- Leverage virtual columns instead of materializing data to save memory
- Batch operations using when performing multiple calculations
delay=True - Export to efficient formats rather than keeping data in CSV
- Use expressions for complex calculations without intermediate storage
- Profile with to understand memory usage and optimize operations
df.stat()
- 使用HDF5或Apache Arrow格式以获得大型数据集的最佳性能
- 利用虚拟列而非物化数据以节省内存
- 执行批量操作时使用
delay=True - 导出为高效格式而非保留CSV格式
- 使用表达式进行复杂计算,无需中间存储
- 使用分析以了解内存使用情况并优化操作
df.stat()
Common Patterns
常见模式
Pattern: Converting Large CSV to HDF5
模式:将大型CSV转换为HDF5
python
import vaexpython
import vaexOpen large CSV (processes in chunks automatically)
Open large CSV (processes in chunks automatically)
df = vaex.from_csv('large_file.csv')
df = vaex.from_csv('large_file.csv')
Export to HDF5 for faster future access
Export to HDF5 for faster future access
df.export_hdf5('large_file.hdf5')
df.export_hdf5('large_file.hdf5')
Future loads are instant
Future loads are instant
df = vaex.open('large_file.hdf5')
undefineddf = vaex.open('large_file.hdf5')
undefinedPattern: Efficient Aggregations
模式:高效聚合
python
undefinedpython
undefinedUse delay=True to batch multiple operations
Use delay=True to batch multiple operations
mean_x = df.x.mean(delay=True)
std_y = df.y.std(delay=True)
sum_z = df.z.sum(delay=True)
mean_x = df.x.mean(delay=True)
std_y = df.y.std(delay=True)
sum_z = df.z.sum(delay=True)
Execute all at once
Execute all at once
results = vaex.execute([mean_x, std_y, sum_z])
undefinedresults = vaex.execute([mean_x, std_y, sum_z])
undefinedPattern: Virtual Columns for Feature Engineering
模式:用于特征工程的虚拟列
python
undefinedpython
undefinedNo memory overhead - computed on the fly
No memory overhead - computed on the fly
df['age_squared'] = df.age ** 2
df['full_name'] = df.first_name + ' ' + df.last_name
df['is_adult'] = df.age >= 18
undefineddf['age_squared'] = df.age ** 2
df['full_name'] = df.first_name + ' ' + df.last_name
df['is_adult'] = df.age >= 18
undefinedResources
资源
This skill includes reference documentation in the directory:
references/- - DataFrame creation, loading, and basic structure
core_dataframes.md - - Filtering, expressions, aggregations, and transformations
data_processing.md - - Optimization strategies and lazy evaluation
performance.md - - Plotting and interactive visualizations
visualization.md - - ML pipelines and model integration
machine_learning.md - - File formats and data import/export
io_operations.md
该技能在目录中包含参考文档:
references/- - DataFrame创建、加载和基本结构
core_dataframes.md - - 过滤、表达式、聚合和转换
data_processing.md - - 优化策略与延迟计算
performance.md - - 绘图与交互式可视化
visualization.md - - ML管道与模型集成
machine_learning.md - - 文件格式与数据导入/导出
io_operations.md