multilabel-classification
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
Chinese<!-- Bundled files (accessible via ${CLAUDE_SKILL_DIR}):
- SKILL.md — this file
- scripts/demo.py — runnable marimo notebook with worked example
-->
<!-- 捆绑文件(可通过${CLAUDE_SKILL_DIR}访问):
- SKILL.md — 本文件
- scripts/demo.py — 可运行的marimo笔记本,包含完整示例
-->
Multilabel Classification with XGBoost (Done Right)
基于XGBoost的多标签分类(正确实践)
Multilabel ≠ multiclass. Multiclass picks one class from N. Multilabel
predicts any subset of N labels — each row can have zero, one, or
many labels on simultaneously. The metrics, the model wrapping, and the
failure modes are all different.
For tabular multilabel, default to XGBoost wrapped in
: it fits one independent XGBoost model per
label. Simple, fast, and competitive. Switch to only
when labels are correlated and the ordering is meaningful.
MultiOutputClassifierClassifierChain多标签分类≠多分类。多分类是从N个类别中选择一个类别,而多标签分类是预测N个标签中的任意子集——每行可同时拥有0个、1个或多个标签。两者的评估指标、模型封装方式以及失效模式均不同。
对于表格数据的多标签分类,优先选择用包裹的XGBoost:它为每个标签训练一个独立的XGBoost模型,简单、快速且性能优异。仅当标签存在相关性且排序有意义时,才考虑切换为。
MultiOutputClassifierClassifierChainWhen to use this skill
何时使用该方案
- Each row can have multiple labels (tags on a post, attributes of a product, diseases in a patient, content moderation categories)
- The labels are NOT mutually exclusive
- The features are tabular
- You have at least a few hundred examples per label (rare labels will underperform — see pitfalls)
- 每行可拥有多个标签(如帖子标签、产品属性、患者患病类型、内容审核类别)
- 标签之间并非互斥
- 特征为表格形式
- 每个标签至少有几百个样本(稀有标签的模型性能会较差——参见常见陷阱)
When NOT to use this skill
何时不使用该方案
- Each row has exactly one label → see
multiclass-classification - Two labels exactly → see
binary-classification - The labels are extremely strongly correlated and you have a natural ordering (e.g. hierarchical taxonomies) → consider tree-based multi-target methods or label powerset
- Extreme multilabel (> 1000 labels) → specialized algorithms outside this skill's scope
- 每行恰好只有一个标签 → 参考方案
multiclass-classification - 仅包含两个标签 → 参考方案
binary-classification - 标签之间存在极强相关性且有天然排序(如层级分类体系)→ 考虑基于树的多目标方法或标签幂集法
- 极端多标签场景(标签数量>1000)→ 超出本方案范围,需使用专门算法
Project layout
项目结构
<project>/
├── data/
├── src/
│ ├── train.py # ibis read → MultiOutputClassifier(XGBClassifier) → MLflow
│ ├── predict.py # reload, return per-row label vector + per-label probas
│ └── plots.py # label balance, co-occurrence, per-label metrics, cardinality
├── notebooks/
│ └── demo.py
└── mlruns/<project>/
├── data/
├── src/
│ ├── train.py # ibis读取数据 → MultiOutputClassifier(XGBClassifier) → MLflow
│ ├── predict.py # 加载模型,返回每行的标签向量+单标签概率
│ └── plots.py # 标签分布、共现情况、单标签指标、标签基数
├── notebooks/
│ └── demo.py
└── mlruns/Data access — same ibis pattern
数据访问——通用ibis模式
python
import ibis
table = ibis.duckdb.connect().read_parquet("data/train.parquet")
feature_cols = [c for c in table.columns if c.startswith("feature_")]
label_cols = [c for c in table.columns if c.startswith("label_")]
data = (
table
.select(*feature_cols, *label_cols)
.execute()
)
X = data[feature_cols]
Y = data[label_cols].to_numpy().astype(int) # shape: (n_samples, n_labels)The target is now a matrix, not a vector. That's the core
shape difference from the other classification skills.
Ypython
import ibis
table = ibis.duckdb.connect().read_parquet("data/train.parquet")
feature_cols = [c for c in table.columns if c.startswith("feature_")]
label_cols = [c for c in table.columns if c.startswith("label_")]
data = (
table
.select(*feature_cols, *label_cols)
.execute()
)
X = data[feature_cols]
Y = data[label_cols].to_numpy().astype(int) # 形状: (n_samples, n_labels)目标变量现在是一个矩阵,而非向量。这是本方案与其他分类方案的核心区别。
YThe pipeline — MultiOutputClassifier
MultiOutputClassifier流水线——MultiOutputClassifier
MultiOutputClassifierpython
from sklearn.compose import ColumnTransformer
from sklearn.multioutput import MultiOutputClassifier
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from xgboost import XGBClassifier
def build_pipeline(feature_cols, seed):
return Pipeline([
("preprocess", ColumnTransformer([("num", StandardScaler(), feature_cols)])),
("clf", MultiOutputClassifier(
XGBClassifier(
n_estimators=300,
max_depth=4,
learning_rate=0.05,
subsample=0.8,
colsample_bytree=0.8,
reg_lambda=1.0,
objective="binary:logistic",
eval_metric="logloss",
random_state=seed,
n_jobs=-1,
),
n_jobs=-1, # parallelize across labels
)),
])MultiOutputClassifierbinary:logisticpython
from sklearn.compose import ColumnTransformer
from sklearn.multioutput import MultiOutputClassifier
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from xgboost import XGBClassifier
def build_pipeline(feature_cols, seed):
return Pipeline([
("preprocess", ColumnTransformer([("num", StandardScaler(), feature_cols)])),
("clf", MultiOutputClassifier(
XGBClassifier(
n_estimators=300,
max_depth=4,
learning_rate=0.05,
subsample=0.8,
colsample_bytree=0.8,
reg_lambda=1.0,
objective="binary:logistic",
eval_metric="logloss",
random_state=seed,
n_jobs=-1,
),
n_jobs=-1, # 跨标签并行训练
)),
])MultiOutputClassifierbinary:logisticThe five things that separate this from a tutorial
本方案与普通教程的五大区别
1. Hamming loss is the primary metric, NOT subset accuracy
1. 核心指标为汉明损失,而非子集准确率
Subset accuracy is "did we predict the exact set of labels for this
row?" — all of them right or none. On a 6-label problem with average
2 labels per row, even getting 90% per-label accuracy gives you only
~50% subset accuracy. Subset accuracy is brutally pessimistic and
will mislead you about model quality.
Hamming loss is the average per-label-slot error rate:
python
from sklearn.metrics import hamming_loss, accuracy_score
ham = hamming_loss(Y_test, Y_pred) # primary metric, lower = better
exact_match = accuracy_score(Y_test, Y_pred) # subset accuracy — too strict aloneReport both, but optimize for hamming loss + per-label F1, not subset
accuracy.
子集准确率指“是否准确预测了该行的全部标签”——要么全对,要么全错。在一个有6个标签、每行平均2个标签的问题中,即使单标签准确率达到90%,子集准确率也仅约50%。子集准确率过于严苛,会误导你对模型质量的判断。
汉明损失是单标签槽位的平均错误率:
python
from sklearn.metrics import hamming_loss, accuracy_score
ham = hamming_loss(Y_test, Y_pred) # 核心指标,值越小越好
exact_match = accuracy_score(Y_test, Y_pred) # 子集准确率——单独使用过于严格建议同时报告两个指标,但优化目标应为汉明损失+单标签F1值,而非子集准确率。
2. Four F1 averages — yes, four, not three
2. 四种F1平均值——没错,是四种而非三种
For multilabel, sklearn's supports a fourth averaging
strategy you don't have in multiclass: . Each averaging
strategy answers a different question:
f1_scoresamples| Average | What it computes | When to use |
|---|---|---|
| macro | Unweighted mean of per-label F1 | All labels matter equally — rare labels drag the average down (good) |
| micro | F1 over the pooled | Overall correctness across all label slots |
| weighted | Per-label F1 weighted by support | Weights toward common labels — hides rare-label failures |
| samples | Per-row F1, then averaged across rows | Per-row "did we get the labels mostly right?" — useful for tagging tasks |
python
from sklearn.metrics import f1_score
f1_macro = f1_score(Y_test, Y_pred, average="macro", zero_division=0)
f1_micro = f1_score(Y_test, Y_pred, average="micro", zero_division=0)
f1_weighted = f1_score(Y_test, Y_pred, average="weighted", zero_division=0)
f1_samples = f1_score(Y_test, Y_pred, average="samples", zero_division=0)Default to macro F1 for the same reason as multiclass: it surfaces
rare-label failures that the other averages hide.
在多标签分类中,sklearn的支持多分类场景中没有的第四种平均策略:。每种平均策略对应不同的业务问题:
f1_scoresamples| 平均策略 | 计算方式 | 使用场景 |
|---|---|---|
| macro | 单标签F1值的未加权平均值 | 所有标签同等重要——稀有标签会拉低平均值(这是好事,能暴露问题) |
| micro | 基于所有 | 衡量所有标签槽位的整体正确性 |
| weighted | 按标签样本量加权的单标签F1平均值 | 向常见标签倾斜——会掩盖稀有标签的性能问题 |
| samples | 每行的F1值,再求所有行的平均值 | 衡量每行“是否大致预测对了标签”——适用于标签标注任务 |
python
from sklearn.metrics import f1_score
f1_macro = f1_score(Y_test, Y_pred, average="macro", zero_division=0)
f1_micro = f1_score(Y_test, Y_pred, average="micro", zero_division=0)
f1_weighted = f1_score(Y_test, Y_pred, average="weighted", zero_division=0)
f1_samples = f1_score(Y_test, Y_pred, average="samples", zero_division=0)优先选择macro F1值,原因与多分类场景相同:它能暴露其他平均策略掩盖的稀有标签性能问题。
3. Label co-occurrence matters — and points at when to use ClassifierChain
3. 标签共现很重要——决定是否使用ClassifierChain
If labels are independent (like make_multilabel_classification's
default), is optimal. If labels are correlated
("if label_3 is on, label_5 is also on 80% of the time"), the
independent-models assumption is suboptimal — you're throwing away
information.
MultiOutputClassifierThe conditional co-occurrence matrix P(label_j | label_i) reveals
this:
python
import numpy as np
n_labels = Y.shape[1]
cooc = np.zeros((n_labels, n_labels))
for i in range(n_labels):
i_count = int(Y[:, i].sum())
if i_count == 0:
continue
for j in range(n_labels):
cooc[i, j] = float(((Y[:, i] == 1) & (Y[:, j] == 1)).sum() / i_count)如果标签相互独立(如make_multilabel_classification的默认设置),是最优选择。如果标签存在相关性(“当label_3为1时,label_5有80%的概率也为1”),独立模型的假设就不再最优——你会丢失有用信息。
MultiOutputClassifier条件共现矩阵 P(label_j | label_i) 可以揭示这种相关性:
python
import numpy as np
n_labels = Y.shape[1]
cooc = np.zeros((n_labels, n_labels))
for i in range(n_labels):
i_count = int(Y[:, i].sum())
if i_count == 0:
continue
for j in range(n_labels):
cooc[i, j] = float(((Y[:, i] == 1) & (Y[:, j] == 1)).sum() / i_count)cooc[i, j] = "given label_i is on, how often is label_j also on?"
cooc[i, j] = "当label_i为1时,label_j也为1的概率"
If most off-diagonal entries hover around the marginal P(label_j),
labels are roughly independent → use `MultiOutputClassifier`. If
some off-diagonal entries are much higher than the marginals, labels
are correlated → consider `ClassifierChain`.
如果大多数非对角线元素接近label_j的边缘概率,说明标签大致独立→使用`MultiOutputClassifier`。如果某些非对角线元素远高于边缘概率,说明标签存在相关性→考虑使用`ClassifierChain`。4. ClassifierChain
for correlated labels
ClassifierChain4. 针对相关标签的ClassifierChain
ClassifierChainpython
from sklearn.multioutput import ClassifierChain
clf_chain = ClassifierChain(
XGBClassifier(...),
order=[0, 1, 2, 3, 4, 5], # or "random" for cross-validated stability
random_state=42,
)ClassifierChainCatch: chain order matters. Different orders give different
results. Common practice: train multiple chains with random orders,
average their predictions (this is the "ensemble of classifier
chains" trick). For most production systems,
is good enough and much simpler — only switch to
when you can measure that label correlations actually exist and
matter for your accuracy.
MultiOutputClassifierClassifierChainpython
from sklearn.multioutput import ClassifierChain
clf_chain = ClassifierChain(
XGBClassifier(...),
order=[0, 1, 2, 3, 4, 5], # 或设为"random"以通过交叉验证保证稳定性
random_state=42,
)ClassifierChain注意: 链的顺序会影响结果,不同顺序会得到不同的预测结果。常见做法是训练多个随机顺序的链,然后平均它们的预测结果(这就是“分类器链集成”技巧)。对于大多数生产系统,已经足够简单且性能优异——只有当你能明确测量到标签相关性确实存在且对准确率有影响时,才考虑切换到。
MultiOutputClassifierClassifierChain5. Per-label monitoring — every label needs its own F1
5. 单标签监控——每个标签都需要单独的F1值
In multilabel, each label has its own positive rate, its own
imbalance, and its own difficulty. A model can have great macro F1
overall while one specific rare label is at F1 = 0.
Always log per-label F1 to MLflow as separate metrics:
python
for i, lbl in enumerate(label_cols):
f1_i = float(f1_score(Y_test[:, i], Y_pred[:, i], average="binary", zero_division=0))
mlflow.log_metric(f"test_f1__{lbl}", f1_i)This is the multilabel version of "per-class F1" from the multiclass
skill — same idea, but each label is genuinely independent so the
imbalance can vary wildly across labels.
在多标签分类中,每个标签的正样本率、不平衡程度和训练难度都不同。模型整体的macro F1值可能很高,但某个特定的稀有标签的F1值可能为0。
务必将每个标签的F1值作为独立指标记录到MLflow中:
python
for i, lbl in enumerate(label_cols):
f1_i = float(f1_score(Y_test[:, i], Y_pred[:, i], average="binary", zero_division=0))
mlflow.log_metric(f"test_f1__{lbl}", f1_i)这是多标签场景下对多分类方案中“单类别F1值”的延伸——思路相同,但每个标签都是真正独立的,因此不同标签的不平衡程度可能差异极大。
MLflow logging
MLflow日志记录
| Kind | What |
|---|---|
| data path, n_rows, n_features, n_labels, label_columns, seed, hyperparameters |
| hamming_loss (primary), subset_accuracy, F1 macro / micro / weighted / samples, per-label F1 (one metric per label), label cardinality (true vs predicted) |
| data hash, label cardinality / density from sidecar |
| model, label balance bar, co-occurrence heatmap, per-label metrics bar, label cardinality histogram (true vs pred) |
| 类型 | 记录内容 |
|---|---|
| 数据路径、样本数、特征数、标签数、标签列、随机种子、超参数 |
| hamming_loss(核心指标)、subset_accuracy、F1 macro/micro/weighted/samples、单标签F1值(每个标签对应一个指标)、标签基数(真实值vs预测值) |
| 数据哈希、标签基数/密度(来自辅助文件) |
| 模型、标签分布柱状图、标签共现热力图、单标签指标柱状图、标签基数直方图(真实值vs预测值) |
Common pitfalls
常见陷阱
- Reporting subset accuracy as the primary metric. It's brutally strict and will make every model look bad. Use hamming loss + per-label F1.
- Using a per-row threshold instead of per-label. Each label has
its own optimal threshold. 's default is 0.5 per label, which is rarely right for any of them. For cost-sensitive applications, tune per-label thresholds on a validation set.
MultiOutputClassifier - Not parallelizing across labels. fits labels in parallel — it's free speed for independent labels.
MultiOutputClassifier(..., n_jobs=-1) - Forgetting . If a label has no positives in the test set (or no predicted positives), F1 is undefined. The default is to warn; set
zero_division=0to silently treat undefined F1 as 0.zero_division=0 - Imbalanced rare labels with no special handling. Per-label
works, but
scale_pos_weightdoesn't make it easy to vary per label. Either useMultiOutputClassifier(one global weight per row) or train per-label models manually for the imbalanced labels.sample_weight - Confusing multilabel with multiclass. Multilabel: with binary entries. Multiclass:
Y.shape == (n_samples, n_labels)with integer labels iny.shape == (n_samples,). Pass the wrong shape to[0, n_classes)and you'll get nonsense.f1_score - Ignoring the label cardinality drift. If the true average is 2.0 labels per row and the model predicts 1.2, it's under-predicting positives across the board — usually a threshold-tuning problem.
- 将子集准确率作为核心指标:它过于严苛,会让所有模型看起来都很差。应使用汉明损失+单标签F1值。
- 使用行级阈值而非标签级阈值:每个标签都有自己的最优阈值。的默认阈值是每个标签0.5,但这几乎不适用于任何标签。对于成本敏感的场景,应在验证集上调整每个标签的阈值。
MultiOutputClassifier - 未跨标签并行训练:可以并行训练多个标签——对于独立标签来说,这是免费的性能提升。
MultiOutputClassifier(..., n_jobs=-1) - 忘记设置:如果某个标签在测试集中没有正样本(或没有预测出正样本),F1值会无定义。默认行为是发出警告;设置
zero_division=0可将无定义的F1值默认为0,避免警告。zero_division=0 - 未对不平衡的稀有标签进行特殊处理:单标签的有效,但
scale_pos_weight无法轻松为每个标签设置不同的值。可以使用MultiOutputClassifier(每行一个全局权重),或为不平衡标签单独训练模型。sample_weight - 混淆多标签与多分类:多标签分类中,元素为二进制值;多分类中
Y.shape == (n_samples, n_labels),元素为y.shape == (n_samples,)范围内的整数。如果将错误的形状传入[0, n_classes),会得到无意义的结果。f1_score - 忽略标签基数漂移:如果真实的每行平均标签数是2.0,而模型预测的是1.2,说明模型整体低估了正样本——通常是阈值调整问题。
Worked example
完整示例
See (marimo notebook). It generates a 6-label tabular
classification problem with varying per-label positive rates (13% to
53%), fits XGBoost in a , and walks through:
demo.pyMultiOutputClassifier- Per-label positive rate bar
- Label co-occurrence heatmap (so the buyer can decide if ClassifierChain would help)
- Hamming loss vs subset accuracy on the test set
- All four F1 averages side by side
- Per-label F1 bar chart (showing the rare labels lag)
- True vs predicted label cardinality histogram (catches over- / under-prediction)
参见(marimo笔记本)。它生成了一个包含6个标签的表格分类问题,每个标签的正样本率不同(13%到53%),使用训练XGBoost,并演示了以下内容:
demo.pyMultiOutputClassifier- 单标签正样本率柱状图
- 标签共现热力图(帮助你判断是否需要使用ClassifierChain)
- 测试集上的汉明损失vs子集准确率
- 四种F1平均值的对比
- 单标签F1值柱状图(展示稀有标签的性能滞后)
- 真实vs预测标签基数直方图(捕捉过度/不足预测问题)",