umap-learn
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseUMAP-Learn
UMAP-Learn
Overview
概述
UMAP (Uniform Manifold Approximation and Projection) is a dimensionality reduction technique for visualization and general non-linear dimensionality reduction. Apply this skill for fast, scalable embeddings that preserve local and global structure, supervised learning, and clustering preprocessing.
UMAP(Uniform Manifold Approximation and Projection)是一种用于可视化和通用非线性降维的技术。使用该工具可生成快速、可扩展的嵌入,同时保留数据的局部和全局结构,适用于监督学习和聚类预处理。
Quick Start
快速开始
Installation
安装
bash
uv pip install umap-learnbash
uv pip install umap-learnBasic Usage
基础用法
UMAP follows scikit-learn conventions and can be used as a drop-in replacement for t-SNE or PCA.
python
import umap
from sklearn.preprocessing import StandardScalerUMAP遵循scikit-learn的规范,可以作为t-SNE或PCA的直接替代工具。
python
import umap
from sklearn.preprocessing import StandardScalerPrepare data (standardization is essential)
准备数据(标准化至关重要)
scaled_data = StandardScaler().fit_transform(data)
scaled_data = StandardScaler().fit_transform(data)
Method 1: Single step (fit and transform)
方法1:单步操作(拟合+转换)
embedding = umap.UMAP().fit_transform(scaled_data)
embedding = umap.UMAP().fit_transform(scaled_data)
Method 2: Separate steps (for reusing trained model)
方法2:分步操作(用于复用训练好的模型)
reducer = umap.UMAP(random_state=42)
reducer.fit(scaled_data)
embedding = reducer.embedding_ # Access the trained embedding
**Critical preprocessing requirement:** Always standardize features to comparable scales before applying UMAP to ensure equal weighting across dimensions.reducer = umap.UMAP(random_state=42)
reducer.fit(scaled_data)
embedding = reducer.embedding_ # 获取训练后的嵌入结果
**关键预处理要求:** 在应用UMAP之前,务必将特征标准化到可比的尺度,以确保各维度权重均等。Typical Workflow
典型工作流
python
import umap
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScalerpython
import umap
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler1. Preprocess data
1. 预处理数据
scaler = StandardScaler()
scaled_data = scaler.fit_transform(raw_data)
scaler = StandardScaler()
scaled_data = scaler.fit_transform(raw_data)
2. Create and fit UMAP
2. 创建并拟合UMAP
reducer = umap.UMAP(
n_neighbors=15,
min_dist=0.1,
n_components=2,
metric='euclidean',
random_state=42
)
embedding = reducer.fit_transform(scaled_data)
reducer = umap.UMAP(
n_neighbors=15,
min_dist=0.1,
n_components=2,
metric='euclidean',
random_state=42
)
embedding = reducer.fit_transform(scaled_data)
3. Visualize
3. 可视化
plt.scatter(embedding[:, 0], embedding[:, 1], c=labels, cmap='Spectral', s=5)
plt.colorbar()
plt.title('UMAP Embedding')
plt.show()
undefinedplt.scatter(embedding[:, 0], embedding[:, 1], c=labels, cmap='Spectral', s=5)
plt.colorbar()
plt.title('UMAP嵌入结果')
plt.show()
undefinedParameter Tuning Guide
参数调优指南
UMAP has four primary parameters that control the embedding behavior. Understanding these is crucial for effective usage.
UMAP有四个主要参数控制嵌入行为,理解这些参数对有效使用至关重要。
n_neighbors (default: 15)
n_neighbors(默认值:15)
Purpose: Balances local versus global structure in the embedding.
How it works: Controls the size of the local neighborhood UMAP examines when learning manifold structure.
Effects by value:
- Low values (2-5): Emphasizes fine local detail but may fragment data into disconnected components
- Medium values (15-20): Balanced view of both local structure and global relationships (recommended starting point)
- High values (50-200): Prioritizes broad topological structure at the expense of fine-grained details
Recommendation: Start with 15 and adjust based on results. Increase for more global structure, decrease for more local detail.
作用: 平衡嵌入结果中的局部与全局结构。
工作原理: 控制UMAP在学习流形结构时所考察的局部邻域大小。
不同取值的影响:
- 低取值(2-5): 强调精细的局部细节,但可能将数据分割为不连通的组件
- 中等取值(15-20): 平衡局部结构与全局关系(推荐的起始值)
- 高取值(50-200): 优先考虑广泛的拓扑结构,牺牲细粒度细节
建议: 从15开始,根据结果调整。若需更多全局结构则增大取值,若需更多局部细节则减小取值。
min_dist (default: 0.1)
min_dist(默认值:0.1)
Purpose: Controls how tightly points cluster in the low-dimensional space.
How it works: Sets the minimum distance apart that points are allowed to be in the output representation.
Effects by value:
- Low values (0.0-0.1): Creates clumped embeddings useful for clustering; reveals fine topological details
- High values (0.5-0.99): Prevents tight packing; emphasizes broad topological preservation over local structure
Recommendation: Use 0.0 for clustering applications, 0.1-0.3 for visualization, 0.5+ for loose structure.
作用: 控制低维空间中点的聚集紧密程度。
工作原理: 设置输出表示中允许点之间的最小距离。
不同取值的影响:
- 低取值(0.0-0.1): 生成聚集的嵌入结果,适用于聚类;可揭示精细的拓扑细节
- 高取值(0.5-0.99): 避免紧密聚集;相较于局部结构,更强调广泛的拓扑保留
建议: 聚类应用使用0.0,可视化使用0.1-0.3,松散结构使用0.5+。
n_components (default: 2)
n_components(默认值:2)
Purpose: Determines the dimensionality of the embedded output space.
Key feature: Unlike t-SNE, UMAP scales well in the embedding dimension, enabling use beyond visualization.
Common uses:
- 2-3 dimensions: Visualization
- 5-10 dimensions: Clustering preprocessing (better preserves density than 2D)
- 10-50 dimensions: Feature engineering for downstream ML models
Recommendation: Use 2 for visualization, 5-10 for clustering, higher for ML pipelines.
作用: 确定嵌入输出空间的维度。
核心特性: 与t-SNE不同,UMAP在嵌入维度上具有良好的扩展性,可用于可视化之外的场景。
常见用途:
- 2-3维: 可视化
- 5-10维: 聚类预处理(比2D更能保留密度信息)
- 10-50维: 下游机器学习模型的特征工程
建议: 可视化使用2维,聚类使用5-10维,机器学习流水线使用更高维度。
metric (default: 'euclidean')
metric(默认值:'euclidean')
Purpose: Specifies how distance is calculated between input data points.
Supported metrics:
- Minkowski variants: euclidean, manhattan, chebyshev
- Spatial metrics: canberra, braycurtis, haversine
- Correlation metrics: cosine, correlation (good for text/document embeddings)
- Binary data metrics: hamming, jaccard, dice, russellrao, kulsinski, rogerstanimoto, sokalmichener, sokalsneath, yule
- Custom metrics: User-defined distance functions via Numba
Recommendation: Use euclidean for numeric data, cosine for text/document vectors, hamming for binary data.
作用: 指定输入数据点之间的距离计算方式。
支持的度量方式:
- 闵可夫斯基变体: euclidean、manhattan、chebyshev
- 空间度量: canberra、braycurtis、haversine
- 相关度量: cosine、correlation(适用于文本/文档嵌入)
- 二进制数据度量: hamming、jaccard、dice、russellrao、kulsinski、rogerstanimoto、sokalmichener、sokalsneath、yule
- 自定义度量: 通过Numba实现用户自定义的距离函数
建议: 数值数据使用euclidean,文本/文档向量使用cosine,二进制数据使用hamming。
Parameter Tuning Example
参数调优示例
python
undefinedpython
undefinedFor visualization with emphasis on local structure
强调局部结构的可视化
umap.UMAP(n_neighbors=15, min_dist=0.1, n_components=2, metric='euclidean')
umap.UMAP(n_neighbors=15, min_dist=0.1, n_components=2, metric='euclidean')
For clustering preprocessing
聚类预处理
umap.UMAP(n_neighbors=30, min_dist=0.0, n_components=10, metric='euclidean')
umap.UMAP(n_neighbors=30, min_dist=0.0, n_components=10, metric='euclidean')
For document embeddings
文档嵌入
umap.UMAP(n_neighbors=15, min_dist=0.1, n_components=2, metric='cosine')
umap.UMAP(n_neighbors=15, min_dist=0.1, n_components=2, metric='cosine')
For preserving global structure
保留全局结构
umap.UMAP(n_neighbors=100, min_dist=0.5, n_components=2, metric='euclidean')
undefinedumap.UMAP(n_neighbors=100, min_dist=0.5, n_components=2, metric='euclidean')
undefinedSupervised and Semi-Supervised Dimension Reduction
监督式与半监督式降维
UMAP supports incorporating label information to guide the embedding process, enabling class separation while preserving internal structure.
UMAP支持整合标签信息来指导嵌入过程,在保留内部结构的同时实现类别分离。
Supervised UMAP
监督式UMAP
Pass target labels via the parameter when fitting:
ypython
undefined拟合时通过参数传递目标标签:
ypython
undefinedSupervised dimension reduction
监督式降维
embedding = umap.UMAP().fit_transform(data, y=labels)
**Key benefits:**
- Achieves cleanly separated classes
- Preserves internal structure within each class
- Maintains global relationships between classes
**When to use:** When you have labeled data and want to separate known classes while keeping meaningful point embeddings.embedding = umap.UMAP().fit_transform(data, y=labels)
**核心优势:**
- 实现清晰的类别分离
- 保留每个类别内部的结构
- 维持类别之间的全局关系
**适用场景:** 拥有标注数据,希望分离已知类别同时保持有意义的点嵌入结果。Semi-Supervised UMAP
半监督式UMAP
For partial labels, mark unlabeled points with following scikit-learn convention:
-1python
undefined对于部分标注的数据,按照scikit-learn的约定,将未标注的点标记为:
-1python
undefinedCreate semi-supervised labels
创建半监督标签
semi_labels = labels.copy()
semi_labels[unlabeled_indices] = -1
semi_labels = labels.copy()
semi_labels[unlabeled_indices] = -1
Fit with partial labels
使用部分标签拟合
embedding = umap.UMAP().fit_transform(data, y=semi_labels)
**When to use:** When labeling is expensive or you have more data than labels available.embedding = umap.UMAP().fit_transform(data, y=semi_labels)
**适用场景:** 标注成本高,或可用数据多于标注数据时。Metric Learning with UMAP
基于UMAP的度量学习
Train a supervised embedding on labeled data, then apply to new unlabeled data:
python
undefined在标注数据上训练监督式嵌入,然后应用于新的未标注数据:
python
undefinedTrain on labeled data
在标注数据上训练
mapper = umap.UMAP().fit(train_data, train_labels)
mapper = umap.UMAP().fit(train_data, train_labels)
Transform unlabeled test data
转换未标注的测试数据
test_embedding = mapper.transform(test_data)
test_embedding = mapper.transform(test_data)
Use as feature engineering for downstream classifier
作为下游分类器的特征工程
from sklearn.svm import SVC
clf = SVC().fit(mapper.embedding_, train_labels)
predictions = clf.predict(test_embedding)
**When to use:** For supervised feature engineering in machine learning pipelines.from sklearn.svm import SVC
clf = SVC().fit(mapper.embedding_, train_labels)
predictions = clf.predict(test_embedding)
**适用场景:** 机器学习流水线中的监督式特征工程。UMAP for Clustering
UMAP用于聚类
UMAP serves as effective preprocessing for density-based clustering algorithms like HDBSCAN, overcoming the curse of dimensionality.
UMAP可作为HDBSCAN等基于密度的聚类算法的有效预处理步骤,克服维数灾难。
Best Practices for Clustering
聚类最佳实践
Key principle: Configure UMAP differently for clustering than for visualization.
Recommended parameters:
- n_neighbors: Increase to ~30 (default 15 is too local and can create artificial fine-grained clusters)
- min_dist: Set to 0.0 (pack points densely within clusters for clearer boundaries)
- n_components: Use 5-10 dimensions (maintains performance while improving density preservation vs. 2D)
核心原则: 聚类的UMAP配置与可视化的配置不同。
推荐参数:
- n_neighbors: 增大至约30(默认15过于局部化,可能产生人工细粒度聚类)
- min_dist: 设置为0.0(让点在聚类内密集聚集,以便形成清晰的边界)
- n_components: 使用5-10维(相较于2D,在保持性能的同时提升密度保留能力)
Clustering Workflow
聚类工作流
python
import umap
import hdbscan
from sklearn.preprocessing import StandardScalerpython
import umap
import hdbscan
from sklearn.preprocessing import StandardScaler1. Preprocess data
1. 预处理数据
scaled_data = StandardScaler().fit_transform(data)
scaled_data = StandardScaler().fit_transform(data)
2. UMAP with clustering-optimized parameters
2. 使用聚类优化参数的UMAP
reducer = umap.UMAP(
n_neighbors=30,
min_dist=0.0,
n_components=10, # Higher than 2 for better density preservation
metric='euclidean',
random_state=42
)
embedding = reducer.fit_transform(scaled_data)
reducer = umap.UMAP(
n_neighbors=30,
min_dist=0.0,
n_components=10, # 高于2维以更好地保留密度
metric='euclidean',
random_state=42
)
embedding = reducer.fit_transform(scaled_data)
3. Apply HDBSCAN clustering
3. 应用HDBSCAN聚类
clusterer = hdbscan.HDBSCAN(
min_cluster_size=15,
min_samples=5,
metric='euclidean'
)
labels = clusterer.fit_predict(embedding)
clusterer = hdbscan.HDBSCAN(
min_cluster_size=15,
min_samples=5,
metric='euclidean'
)
labels = clusterer.fit_predict(embedding)
4. Evaluate
4. 评估
from sklearn.metrics import adjusted_rand_score
score = adjusted_rand_score(true_labels, labels)
print(f"Adjusted Rand Score: {score:.3f}")
print(f"Number of clusters: {len(set(labels)) - (1 if -1 in labels else 0)}")
print(f"Noise points: {sum(labels == -1)}")
undefinedfrom sklearn.metrics import adjusted_rand_score
score = adjusted_rand_score(true_labels, labels)
print(f"调整兰德指数: {score:.3f}")
print(f"聚类数量: {len(set(labels)) - (1 if -1 in labels else 0)}")
print(f"噪声点数量: {sum(labels == -1)}")
undefinedVisualization After Clustering
聚类后可视化
python
undefinedpython
undefinedCreate 2D embedding for visualization (separate from clustering)
创建用于可视化的2D嵌入(与聚类使用的嵌入分离)
vis_reducer = umap.UMAP(n_neighbors=15, min_dist=0.1, n_components=2, random_state=42)
vis_embedding = vis_reducer.fit_transform(scaled_data)
vis_reducer = umap.UMAP(n_neighbors=15, min_dist=0.1, n_components=2, random_state=42)
vis_embedding = vis_reducer.fit_transform(scaled_data)
Plot with cluster labels
结合聚类标签绘图
import matplotlib.pyplot as plt
plt.scatter(vis_embedding[:, 0], vis_embedding[:, 1], c=labels, cmap='Spectral', s=5)
plt.colorbar()
plt.title('UMAP Visualization with HDBSCAN Clusters')
plt.show()
**Important caveat:** UMAP does not completely preserve density and can create artificial cluster divisions. Always validate and explore resulting clusters.import matplotlib.pyplot as plt
plt.scatter(vis_embedding[:, 0], vis_embedding[:, 1], c=labels, cmap='Spectral', s=5)
plt.colorbar()
plt.title('UMAP可视化结合HDBSCAN聚类结果')
plt.show()
**重要提示:** UMAP并不能完全保留密度,可能会产生人工聚类划分。请始终验证并探索生成的聚类结果。Transforming New Data
转换新数据
UMAP enables preprocessing of new data through its method, allowing trained models to project unseen data into the learned embedding space.
transform()UMAP可通过方法预处理新数据,允许训练好的模型将未见过的数据投影到学习到的嵌入空间中。
transform()Basic Transform Usage
基础转换用法
python
undefinedpython
undefinedTrain on training data
在训练数据上训练
trans = umap.UMAP(n_neighbors=15, random_state=42).fit(X_train)
trans = umap.UMAP(n_neighbors=15, random_state=42).fit(X_train)
Transform test data
转换测试数据
test_embedding = trans.transform(X_test)
undefinedtest_embedding = trans.transform(X_test)
undefinedIntegration with Machine Learning Pipelines
与机器学习流水线整合
python
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import umappython
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import umapSplit data
划分数据
X_train, X_test, y_train, y_test = train_test_split(data, labels, test_size=0.2)
X_train, X_test, y_train, y_test = train_test_split(data, labels, test_size=0.2)
Preprocess
预处理
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
Train UMAP
训练UMAP
reducer = umap.UMAP(n_components=10, random_state=42)
X_train_embedded = reducer.fit_transform(X_train_scaled)
X_test_embedded = reducer.transform(X_test_scaled)
reducer = umap.UMAP(n_components=10, random_state=42)
X_train_embedded = reducer.fit_transform(X_train_scaled)
X_test_embedded = reducer.transform(X_test_scaled)
Train classifier on embeddings
在嵌入结果上训练分类器
clf = SVC()
clf.fit(X_train_embedded, y_train)
accuracy = clf.score(X_test_embedded, y_test)
print(f"Test accuracy: {accuracy:.3f}")
undefinedclf = SVC()
clf.fit(X_train_embedded, y_train)
accuracy = clf.score(X_test_embedded, y_test)
print(f"测试准确率: {accuracy:.3f}")
undefinedImportant Considerations
重要注意事项
Data consistency: The transform method assumes the overall distribution in the higher-dimensional space is consistent between training and test data. When this assumption fails, consider using Parametric UMAP instead.
Performance: Transform operations are efficient (typically <1 second), though initial calls may be slower due to Numba JIT compilation.
Scikit-learn compatibility: UMAP follows standard sklearn conventions and works seamlessly in pipelines:
python
from sklearn.pipeline import Pipeline
pipeline = Pipeline([
('scaler', StandardScaler()),
('umap', umap.UMAP(n_components=10)),
('classifier', SVC())
])
pipeline.fit(X_train, y_train)
predictions = pipeline.predict(X_test)数据一致性: transform方法假设高维空间中训练数据与测试数据的整体分布一致。当该假设不成立时,考虑使用参数化UMAP。
性能: 转换操作效率很高(通常<1秒),但首次调用可能因Numba JIT编译而较慢。
Scikit-learn兼容性: UMAP遵循标准sklearn规范,可无缝集成到流水线中:
python
from sklearn.pipeline import Pipeline
pipeline = Pipeline([
('scaler', StandardScaler()),
('umap', umap.UMAP(n_components=10)),
('classifier', SVC())
])
pipeline.fit(X_train, y_train)
predictions = pipeline.predict(X_test)Advanced Features
高级特性
Parametric UMAP
参数化UMAP
Parametric UMAP replaces direct embedding optimization with a learned neural network mapping function.
Key differences from standard UMAP:
- Uses TensorFlow/Keras to train encoder networks
- Enables efficient transformation of new data
- Supports reconstruction via decoder networks (inverse transform)
- Allows custom architectures (CNNs for images, RNNs for sequences)
Installation:
bash
uv pip install umap-learn[parametric_umap]参数化UMAP将直接嵌入优化替换为学习到的神经网络映射函数。
与标准UMAP的主要区别:
- 使用TensorFlow/Keras训练编码器网络
- 支持高效转换新数据
- 支持通过解码器网络进行重构(逆转换)
- 允许自定义架构(适用于图像的CNN、适用于序列的RNN)
安装:
bash
uv pip install umap-learn[parametric_umap]Requires TensorFlow 2.x
需要TensorFlow 2.x
**Basic usage:**
```python
from umap.parametric_umap import ParametricUMAP
**基础用法:**
```python
from umap.parametric_umap import ParametricUMAPDefault architecture (3-layer 100-neuron fully-connected network)
默认架构(3层100神经元全连接网络)
embedder = ParametricUMAP()
embedding = embedder.fit_transform(data)
embedder = ParametricUMAP()
embedding = embedder.fit_transform(data)
Transform new data efficiently
高效转换新数据
new_embedding = embedder.transform(new_data)
**Custom architecture:**
```python
import tensorflow as tfnew_embedding = embedder.transform(new_data)
**自定义架构:**
```python
import tensorflow as tfDefine custom encoder
定义自定义编码器
encoder = tf.keras.Sequential([
tf.keras.layers.InputLayer(input_shape=(input_dim,)),
tf.keras.layers.Dense(128, activation='relu'),
tf.keras.layers.Dense(64, activation='relu'),
tf.keras.layers.Dense(2) # Output dimension
])
embedder = ParametricUMAP(encoder=encoder, dims=(input_dim,))
embedding = embedder.fit_transform(data)
**When to use Parametric UMAP:**
- Need efficient transformation of new data after training
- Require reconstruction capabilities (inverse transforms)
- Want to combine UMAP with autoencoders
- Working with complex data types (images, sequences) benefiting from specialized architectures
**When to use standard UMAP:**
- Need simplicity and quick prototyping
- Dataset is small and computational efficiency isn't critical
- Don't require learned transformations for future dataencoder = tf.keras.Sequential([
tf.keras.layers.InputLayer(input_shape=(input_dim,)),
tf.keras.layers.Dense(128, activation='relu'),
tf.keras.layers.Dense(64, activation='relu'),
tf.keras.layers.Dense(2) # 输出维度
])
embedder = ParametricUMAP(encoder=encoder, dims=(input_dim,))
embedding = embedder.fit_transform(data)
**何时使用参数化UMAP:**
- 训练后需要高效转换新数据
- 需要重构能力(逆转换)
- 希望将UMAP与自编码器结合
- 处理复杂数据类型(图像、序列),可受益于专用架构
**何时使用标准UMAP:**
- 需要简洁性和快速原型开发
- 数据集较小,计算效率不是关键
- 不需要为未来数据学习转换方式Inverse Transforms
逆转换
Inverse transforms enable reconstruction of high-dimensional data from low-dimensional embeddings.
Basic usage:
python
reducer = umap.UMAP()
embedding = reducer.fit_transform(data)逆转换支持从低维嵌入重构高维数据。
基础用法:
python
reducer = umap.UMAP()
embedding = reducer.fit_transform(data)Reconstruct high-dimensional data from embedding coordinates
从嵌入坐标重构高维数据
reconstructed = reducer.inverse_transform(embedding)
**Important limitations:**
- Computationally expensive operation
- Works poorly outside the convex hull of the embedding
- Accuracy decreases in regions with gaps between clusters
**Use cases:**
- Understanding structure of embedded data
- Visualizing smooth transitions between clusters
- Exploring interpolations between data points
- Generating synthetic samples in embedding space
**Example: Exploring embedding space:**
```python
import numpy as npreconstructed = reducer.inverse_transform(embedding)
**重要局限性:**
- 计算成本高
- 在嵌入凸包外的区域效果不佳
- 聚类之间存在间隙的区域,准确率会下降
**使用场景:**
- 理解嵌入数据的结构
- 可视化聚类之间的平滑过渡
- 探索数据点之间的插值
- 在嵌入空间生成合成样本
**示例:探索嵌入空间:**
```python
import numpy as npCreate grid of points in embedding space
在嵌入空间创建点网格
x = np.linspace(embedding[:, 0].min(), embedding[:, 0].max(), 10)
y = np.linspace(embedding[:, 1].min(), embedding[:, 1].max(), 10)
xx, yy = np.meshgrid(x, y)
grid_points = np.c_[xx.ravel(), yy.ravel()]
x = np.linspace(embedding[:, 0].min(), embedding[:, 0].max(), 10)
y = np.linspace(embedding[:, 1].min(), embedding[:, 1].max(), 10)
xx, yy = np.meshgrid(x, y)
grid_points = np.c_[xx.ravel(), yy.ravel()]
Reconstruct samples from grid
从网格重构样本
reconstructed_samples = reducer.inverse_transform(grid_points)
undefinedreconstructed_samples = reducer.inverse_transform(grid_points)
undefinedAlignedUMAP
AlignedUMAP
For analyzing temporal or related datasets (e.g., time-series experiments, batch data):
python
from umap import AlignedUMAP用于分析时间序列或相关数据集(如时间序列实验、批次数据):
python
from umap import AlignedUMAPList of related datasets
相关数据集列表
datasets = [day1_data, day2_data, day3_data]
datasets = [day1_data, day2_data, day3_data]
Create aligned embeddings
创建对齐的嵌入结果
mapper = AlignedUMAP().fit(datasets)
aligned_embeddings = mapper.embeddings_ # List of embeddings
**When to use:** Comparing embeddings across related datasets while maintaining consistent coordinate systems.mapper = AlignedUMAP().fit(datasets)
aligned_embeddings = mapper.embeddings_ # 嵌入结果列表
**适用场景:** 在保持一致坐标系的同时,比较相关数据集的嵌入结果。Reproducibility
可复现性
To ensure reproducible results, always set the parameter:
random_statepython
reducer = umap.UMAP(random_state=42)UMAP uses stochastic optimization, so results will vary slightly between runs without a fixed random state.
为确保结果可复现,请始终设置参数:
random_statepython
reducer = umap.UMAP(random_state=42)UMAP使用随机优化,若不固定随机状态,不同运行的结果会略有差异。
Common Issues and Solutions
常见问题与解决方案
Issue: Disconnected components or fragmented clusters
- Solution: Increase to emphasize more global structure
n_neighbors
Issue: Clusters too spread out or not well separated
- Solution: Decrease to allow tighter packing
min_dist
Issue: Poor clustering results
- Solution: Use clustering-specific parameters (n_neighbors=30, min_dist=0.0, n_components=5-10)
Issue: Transform results differ significantly from training
- Solution: Ensure test data distribution matches training, or use Parametric UMAP
Issue: Slow performance on large datasets
- Solution: Set (default), or consider dimensionality reduction with PCA first
low_memory=True
Issue: All points collapsed to single cluster
- Solution: Check data preprocessing (ensure proper scaling), increase
min_dist
问题: 出现不连通的组件或碎片化聚类
- 解决方案: 增大以强调更多全局结构
n_neighbors
问题: 聚类过于分散或分离效果不佳
- 解决方案: 减小以允许更紧密的聚集
min_dist
问题: 聚类结果不佳
- 解决方案: 使用聚类专用参数(n_neighbors=30, min_dist=0.0, n_components=5-10)
问题: 转换结果与训练结果差异显著
- 解决方案: 确保测试数据分布与训练数据一致,或使用参数化UMAP
问题: 大型数据集上性能缓慢
- 解决方案: 设置(默认值),或先使用PCA进行降维
low_memory=True
问题: 所有点坍缩为单个聚类
- 解决方案: 检查数据预处理(确保正确缩放),增大
min_dist
Resources
资源
references/
references/
Contains detailed API documentation:
- : Complete UMAP class parameters and methods
api_reference.md
Load these references when detailed parameter information or advanced method usage is needed.
包含详细的API文档:
- :完整的UMAP类参数和方法说明
api_reference.md
当需要详细的参数信息或高级方法用法时,可查阅这些参考文档。