multilabel-classification

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese
<!-- Bundled files (accessible via ${CLAUDE_SKILL_DIR}): - SKILL.md — this file - scripts/demo.py — runnable marimo notebook with worked example -->
<!-- 捆绑文件(可通过${CLAUDE_SKILL_DIR}访问): - SKILL.md — 本文件 - scripts/demo.py — 可运行的marimo笔记本,包含完整示例 -->

Multilabel Classification with XGBoost (Done Right)

基于XGBoost的多标签分类(正确实践)

Multilabel ≠ multiclass. Multiclass picks one class from N. Multilabel predicts any subset of N labels — each row can have zero, one, or many labels on simultaneously. The metrics, the model wrapping, and the failure modes are all different.
For tabular multilabel, default to XGBoost wrapped in
MultiOutputClassifier
: it fits one independent XGBoost model per label. Simple, fast, and competitive. Switch to
ClassifierChain
only when labels are correlated and the ordering is meaningful.
多标签分类≠多分类。多分类是从N个类别中选择一个类别,而多标签分类是预测N个标签中的任意子集——每行可同时拥有0个、1个或多个标签。两者的评估指标、模型封装方式以及失效模式均不同。
对于表格数据的多标签分类,优先选择用
MultiOutputClassifier
包裹的XGBoost
:它为每个标签训练一个独立的XGBoost模型,简单、快速且性能优异。仅当标签存在相关性且排序有意义时,才考虑切换为
ClassifierChain

When to use this skill

何时使用该方案

  • Each row can have multiple labels (tags on a post, attributes of a product, diseases in a patient, content moderation categories)
  • The labels are NOT mutually exclusive
  • The features are tabular
  • You have at least a few hundred examples per label (rare labels will underperform — see pitfalls)
  • 每行可拥有多个标签(如帖子标签、产品属性、患者患病类型、内容审核类别)
  • 标签之间并非互斥
  • 特征为表格形式
  • 每个标签至少有几百个样本(稀有标签的模型性能会较差——参见常见陷阱)

When NOT to use this skill

何时不使用该方案

  • Each row has exactly one label → see
    multiclass-classification
  • Two labels exactly → see
    binary-classification
  • The labels are extremely strongly correlated and you have a natural ordering (e.g. hierarchical taxonomies) → consider tree-based multi-target methods or label powerset
  • Extreme multilabel (> 1000 labels) → specialized algorithms outside this skill's scope
  • 每行恰好只有一个标签 → 参考
    multiclass-classification
    方案
  • 仅包含两个标签 → 参考
    binary-classification
    方案
  • 标签之间存在极强相关性且有天然排序(如层级分类体系)→ 考虑基于树的多目标方法或标签幂集法
  • 极端多标签场景(标签数量>1000)→ 超出本方案范围,需使用专门算法

Project layout

项目结构

<project>/
├── data/
├── src/
│   ├── train.py         # ibis read → MultiOutputClassifier(XGBClassifier) → MLflow
│   ├── predict.py       # reload, return per-row label vector + per-label probas
│   └── plots.py         # label balance, co-occurrence, per-label metrics, cardinality
├── notebooks/
│   └── demo.py
└── mlruns/
<project>/
├── data/
├── src/
│   ├── train.py         # ibis读取数据 → MultiOutputClassifier(XGBClassifier) → MLflow
│   ├── predict.py       # 加载模型,返回每行的标签向量+单标签概率
│   └── plots.py         # 标签分布、共现情况、单标签指标、标签基数
├── notebooks/
│   └── demo.py
└── mlruns/

Data access — same ibis pattern

数据访问——通用ibis模式

python
import ibis

table = ibis.duckdb.connect().read_parquet("data/train.parquet")
feature_cols = [c for c in table.columns if c.startswith("feature_")]
label_cols = [c for c in table.columns if c.startswith("label_")]

data = (
    table
    .select(*feature_cols, *label_cols)
    .execute()
)
X = data[feature_cols]
Y = data[label_cols].to_numpy().astype(int)  # shape: (n_samples, n_labels)
The target
Y
is now a matrix, not a vector. That's the core shape difference from the other classification skills.
python
import ibis

table = ibis.duckdb.connect().read_parquet("data/train.parquet")
feature_cols = [c for c in table.columns if c.startswith("feature_")]
label_cols = [c for c in table.columns if c.startswith("label_")]

data = (
    table
    .select(*feature_cols, *label_cols)
    .execute()
)
X = data[feature_cols]
Y = data[label_cols].to_numpy().astype(int)  # 形状: (n_samples, n_labels)
目标变量
Y
现在是一个矩阵,而非向量。这是本方案与其他分类方案的核心区别。

The pipeline —
MultiOutputClassifier

流水线——
MultiOutputClassifier

python
from sklearn.compose import ColumnTransformer
from sklearn.multioutput import MultiOutputClassifier
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from xgboost import XGBClassifier

def build_pipeline(feature_cols, seed):
    return Pipeline([
        ("preprocess", ColumnTransformer([("num", StandardScaler(), feature_cols)])),
        ("clf", MultiOutputClassifier(
            XGBClassifier(
                n_estimators=300,
                max_depth=4,
                learning_rate=0.05,
                subsample=0.8,
                colsample_bytree=0.8,
                reg_lambda=1.0,
                objective="binary:logistic",
                eval_metric="logloss",
                random_state=seed,
                n_jobs=-1,
            ),
            n_jobs=-1,  # parallelize across labels
        )),
    ])
MultiOutputClassifier
fits one independent binary classifier per label and stitches them together. Each underlying XGBoost is just a binary classifier (
binary:logistic
), so all the binary-classification skill's lessons apply per label.
python
from sklearn.compose import ColumnTransformer
from sklearn.multioutput import MultiOutputClassifier
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from xgboost import XGBClassifier

def build_pipeline(feature_cols, seed):
    return Pipeline([
        ("preprocess", ColumnTransformer([("num", StandardScaler(), feature_cols)])),
        ("clf", MultiOutputClassifier(
            XGBClassifier(
                n_estimators=300,
                max_depth=4,
                learning_rate=0.05,
                subsample=0.8,
                colsample_bytree=0.8,
                reg_lambda=1.0,
                objective="binary:logistic",
                eval_metric="logloss",
                random_state=seed,
                n_jobs=-1,
            ),
            n_jobs=-1,  # 跨标签并行训练
        )),
    ])
MultiOutputClassifier
每个标签训练一个独立的二分类器,并将它们的结果整合。每个底层XGBoost模型都是一个二分类器(
binary:logistic
),因此二分类方案中的所有经验都适用于单个标签。

The five things that separate this from a tutorial

本方案与普通教程的五大区别

1. Hamming loss is the primary metric, NOT subset accuracy

1. 核心指标为汉明损失,而非子集准确率

Subset accuracy is "did we predict the exact set of labels for this row?" — all of them right or none. On a 6-label problem with average 2 labels per row, even getting 90% per-label accuracy gives you only ~50% subset accuracy. Subset accuracy is brutally pessimistic and will mislead you about model quality.
Hamming loss is the average per-label-slot error rate:
python
from sklearn.metrics import hamming_loss, accuracy_score

ham = hamming_loss(Y_test, Y_pred)         # primary metric, lower = better
exact_match = accuracy_score(Y_test, Y_pred)  # subset accuracy — too strict alone
Report both, but optimize for hamming loss + per-label F1, not subset accuracy.
子集准确率指“是否准确预测了该行的全部标签”——要么全对,要么全错。在一个有6个标签、每行平均2个标签的问题中,即使单标签准确率达到90%,子集准确率也仅约50%。子集准确率过于严苛,会误导你对模型质量的判断。
汉明损失是单标签槽位的平均错误率:
python
from sklearn.metrics import hamming_loss, accuracy_score

ham = hamming_loss(Y_test, Y_pred)         # 核心指标,值越小越好
exact_match = accuracy_score(Y_test, Y_pred)  # 子集准确率——单独使用过于严格
建议同时报告两个指标,但优化目标应为汉明损失+单标签F1值,而非子集准确率。

2. Four F1 averages — yes, four, not three

2. 四种F1平均值——没错,是四种而非三种

For multilabel, sklearn's
f1_score
supports a fourth averaging strategy you don't have in multiclass:
samples
. Each averaging strategy answers a different question:
AverageWhat it computesWhen to use
macroUnweighted mean of per-label F1All labels matter equally — rare labels drag the average down (good)
microF1 over the pooled
(sample, label)
predictions
Overall correctness across all label slots
weightedPer-label F1 weighted by supportWeights toward common labels — hides rare-label failures
samplesPer-row F1, then averaged across rowsPer-row "did we get the labels mostly right?" — useful for tagging tasks
python
from sklearn.metrics import f1_score

f1_macro = f1_score(Y_test, Y_pred, average="macro", zero_division=0)
f1_micro = f1_score(Y_test, Y_pred, average="micro", zero_division=0)
f1_weighted = f1_score(Y_test, Y_pred, average="weighted", zero_division=0)
f1_samples = f1_score(Y_test, Y_pred, average="samples", zero_division=0)
Default to macro F1 for the same reason as multiclass: it surfaces rare-label failures that the other averages hide.
在多标签分类中,sklearn的
f1_score
支持多分类场景中没有的第四种平均策略:
samples
。每种平均策略对应不同的业务问题:
平均策略计算方式使用场景
macro单标签F1值的未加权平均值所有标签同等重要——稀有标签会拉低平均值(这是好事,能暴露问题)
micro基于所有
(样本,标签)
预测结果的F1值
衡量所有标签槽位的整体正确性
weighted按标签样本量加权的单标签F1平均值向常见标签倾斜——会掩盖稀有标签的性能问题
samples每行的F1值,再求所有行的平均值衡量每行“是否大致预测对了标签”——适用于标签标注任务
python
from sklearn.metrics import f1_score

f1_macro = f1_score(Y_test, Y_pred, average="macro", zero_division=0)
f1_micro = f1_score(Y_test, Y_pred, average="micro", zero_division=0)
f1_weighted = f1_score(Y_test, Y_pred, average="weighted", zero_division=0)
f1_samples = f1_score(Y_test, Y_pred, average="samples", zero_division=0)
优先选择macro F1值,原因与多分类场景相同:它能暴露其他平均策略掩盖的稀有标签性能问题。

3. Label co-occurrence matters — and points at when to use ClassifierChain

3. 标签共现很重要——决定是否使用ClassifierChain

If labels are independent (like make_multilabel_classification's default),
MultiOutputClassifier
is optimal. If labels are correlated ("if label_3 is on, label_5 is also on 80% of the time"), the independent-models assumption is suboptimal — you're throwing away information.
The conditional co-occurrence matrix P(label_j | label_i) reveals this:
python
import numpy as np

n_labels = Y.shape[1]
cooc = np.zeros((n_labels, n_labels))
for i in range(n_labels):
    i_count = int(Y[:, i].sum())
    if i_count == 0:
        continue
    for j in range(n_labels):
        cooc[i, j] = float(((Y[:, i] == 1) & (Y[:, j] == 1)).sum() / i_count)
如果标签相互独立(如make_multilabel_classification的默认设置),
MultiOutputClassifier
是最优选择。如果标签存在相关性(“当label_3为1时,label_5有80%的概率也为1”),独立模型的假设就不再最优——你会丢失有用信息。
条件共现矩阵 P(label_j | label_i) 可以揭示这种相关性:
python
import numpy as np

n_labels = Y.shape[1]
cooc = np.zeros((n_labels, n_labels))
for i in range(n_labels):
    i_count = int(Y[:, i].sum())
    if i_count == 0:
        continue
    for j in range(n_labels):
        cooc[i, j] = float(((Y[:, i] == 1) & (Y[:, j] == 1)).sum() / i_count)

cooc[i, j] = "given label_i is on, how often is label_j also on?"

cooc[i, j] = "当label_i为1时,label_j也为1的概率"


If most off-diagonal entries hover around the marginal P(label_j),
labels are roughly independent → use `MultiOutputClassifier`. If
some off-diagonal entries are much higher than the marginals, labels
are correlated → consider `ClassifierChain`.

如果大多数非对角线元素接近label_j的边缘概率,说明标签大致独立→使用`MultiOutputClassifier`。如果某些非对角线元素远高于边缘概率,说明标签存在相关性→考虑使用`ClassifierChain`。

4.
ClassifierChain
for correlated labels

4. 针对相关标签的
ClassifierChain

python
from sklearn.multioutput import ClassifierChain

clf_chain = ClassifierChain(
    XGBClassifier(...),
    order=[0, 1, 2, 3, 4, 5],  # or "random" for cross-validated stability
    random_state=42,
)
ClassifierChain
fits N binary classifiers in sequence, where each classifier sees the previous classifiers' predictions as additional features. This lets it learn label correlations like "if label_0 is predicted, label_3 becomes more likely."
Catch: chain order matters. Different orders give different results. Common practice: train multiple chains with random orders, average their predictions (this is the "ensemble of classifier chains" trick). For most production systems,
MultiOutputClassifier
is good enough and much simpler — only switch to
ClassifierChain
when you can measure that label correlations actually exist and matter for your accuracy.
python
from sklearn.multioutput import ClassifierChain

clf_chain = ClassifierChain(
    XGBClassifier(...),
    order=[0, 1, 2, 3, 4, 5],  # 或设为"random"以通过交叉验证保证稳定性
    random_state=42,
)
ClassifierChain
按顺序训练N个二分类器,每个分类器会将之前分类器的预测结果作为额外特征。这样它就能学习到类似“如果label_0被预测为1,label_3的概率会更高”的标签相关性。
注意: 链的顺序会影响结果,不同顺序会得到不同的预测结果。常见做法是训练多个随机顺序的链,然后平均它们的预测结果(这就是“分类器链集成”技巧)。对于大多数生产系统,
MultiOutputClassifier
已经足够简单且性能优异——只有当你能明确测量到标签相关性确实存在且对准确率有影响时,才考虑切换到
ClassifierChain

5. Per-label monitoring — every label needs its own F1

5. 单标签监控——每个标签都需要单独的F1值

In multilabel, each label has its own positive rate, its own imbalance, and its own difficulty. A model can have great macro F1 overall while one specific rare label is at F1 = 0.
Always log per-label F1 to MLflow as separate metrics:
python
for i, lbl in enumerate(label_cols):
    f1_i = float(f1_score(Y_test[:, i], Y_pred[:, i], average="binary", zero_division=0))
    mlflow.log_metric(f"test_f1__{lbl}", f1_i)
This is the multilabel version of "per-class F1" from the multiclass skill — same idea, but each label is genuinely independent so the imbalance can vary wildly across labels.
在多标签分类中,每个标签的正样本率、不平衡程度和训练难度都不同。模型整体的macro F1值可能很高,但某个特定的稀有标签的F1值可能为0。
务必将每个标签的F1值作为独立指标记录到MLflow中:
python
for i, lbl in enumerate(label_cols):
    f1_i = float(f1_score(Y_test[:, i], Y_pred[:, i], average="binary", zero_division=0))
    mlflow.log_metric(f"test_f1__{lbl}", f1_i)
这是多标签场景下对多分类方案中“单类别F1值”的延伸——思路相同,但每个标签都是真正独立的,因此不同标签的不平衡程度可能差异极大。

MLflow logging

MLflow日志记录

KindWhat
params
data path, n_rows, n_features, n_labels, label_columns, seed, hyperparameters
metrics
hamming_loss (primary), subset_accuracy, F1 macro / micro / weighted / samples, per-label F1 (one metric per label), label cardinality (true vs predicted)
tags
data hash, label cardinality / density from sidecar
artifacts
model, label balance bar, co-occurrence heatmap, per-label metrics bar, label cardinality histogram (true vs pred)
类型记录内容
params
数据路径、样本数、特征数、标签数、标签列、随机种子、超参数
metrics
hamming_loss(核心指标)、subset_accuracy、F1 macro/micro/weighted/samples单标签F1值(每个标签对应一个指标)、标签基数(真实值vs预测值)
tags
数据哈希、标签基数/密度(来自辅助文件)
artifacts
模型、标签分布柱状图、标签共现热力图、单标签指标柱状图、标签基数直方图(真实值vs预测值)

Common pitfalls

常见陷阱

  1. Reporting subset accuracy as the primary metric. It's brutally strict and will make every model look bad. Use hamming loss + per-label F1.
  2. Using a per-row threshold instead of per-label. Each label has its own optimal threshold.
    MultiOutputClassifier
    's default is 0.5 per label, which is rarely right for any of them. For cost-sensitive applications, tune per-label thresholds on a validation set.
  3. Not parallelizing across labels.
    MultiOutputClassifier(..., n_jobs=-1)
    fits labels in parallel — it's free speed for independent labels.
  4. Forgetting
    zero_division=0
    .
    If a label has no positives in the test set (or no predicted positives), F1 is undefined. The default is to warn; set
    zero_division=0
    to silently treat undefined F1 as 0.
  5. Imbalanced rare labels with no special handling. Per-label
    scale_pos_weight
    works, but
    MultiOutputClassifier
    doesn't make it easy to vary per label. Either use
    sample_weight
    (one global weight per row) or train per-label models manually for the imbalanced labels.
  6. Confusing multilabel with multiclass. Multilabel:
    Y.shape == (n_samples, n_labels)
    with binary entries. Multiclass:
    y.shape == (n_samples,)
    with integer labels in
    [0, n_classes)
    . Pass the wrong shape to
    f1_score
    and you'll get nonsense.
  7. Ignoring the label cardinality drift. If the true average is 2.0 labels per row and the model predicts 1.2, it's under-predicting positives across the board — usually a threshold-tuning problem.
  1. 将子集准确率作为核心指标:它过于严苛,会让所有模型看起来都很差。应使用汉明损失+单标签F1值。
  2. 使用行级阈值而非标签级阈值:每个标签都有自己的最优阈值。
    MultiOutputClassifier
    的默认阈值是每个标签0.5,但这几乎不适用于任何标签。对于成本敏感的场景,应在验证集上调整每个标签的阈值。
  3. 未跨标签并行训练
    MultiOutputClassifier(..., n_jobs=-1)
    可以并行训练多个标签——对于独立标签来说,这是免费的性能提升。
  4. 忘记设置
    zero_division=0
    :如果某个标签在测试集中没有正样本(或没有预测出正样本),F1值会无定义。默认行为是发出警告;设置
    zero_division=0
    可将无定义的F1值默认为0,避免警告。
  5. 未对不平衡的稀有标签进行特殊处理:单标签的
    scale_pos_weight
    有效,但
    MultiOutputClassifier
    无法轻松为每个标签设置不同的值。可以使用
    sample_weight
    (每行一个全局权重),或为不平衡标签单独训练模型。
  6. 混淆多标签与多分类:多标签分类中
    Y.shape == (n_samples, n_labels)
    ,元素为二进制值;多分类中
    y.shape == (n_samples,)
    ,元素为
    [0, n_classes)
    范围内的整数。如果将错误的形状传入
    f1_score
    ,会得到无意义的结果。
  7. 忽略标签基数漂移:如果真实的每行平均标签数是2.0,而模型预测的是1.2,说明模型整体低估了正样本——通常是阈值调整问题。

Worked example

完整示例

See
demo.py
(marimo notebook). It generates a 6-label tabular classification problem with varying per-label positive rates (13% to 53%), fits XGBoost in a
MultiOutputClassifier
, and walks through:
  • Per-label positive rate bar
  • Label co-occurrence heatmap (so the buyer can decide if ClassifierChain would help)
  • Hamming loss vs subset accuracy on the test set
  • All four F1 averages side by side
  • Per-label F1 bar chart (showing the rare labels lag)
  • True vs predicted label cardinality histogram (catches over- / under-prediction)
参见
demo.py
(marimo笔记本)。它生成了一个包含6个标签的表格分类问题,每个标签的正样本率不同(13%到53%),使用
MultiOutputClassifier
训练XGBoost,并演示了以下内容:
  • 单标签正样本率柱状图
  • 标签共现热力图(帮助你判断是否需要使用ClassifierChain)
  • 测试集上的汉明损失vs子集准确率
  • 四种F1平均值的对比
  • 单标签F1值柱状图(展示稀有标签的性能滞后)
  • 真实vs预测标签基数直方图(捕捉过度/不足预测问题)",