自标签 (Self-Labeled) 介绍

本文将对自标签(self-labeled)做简要介绍,主要包括定义和分类。其中定义给出中英文对照。文章参考自[1]。算法

定义

首先是对这类方法的定义,以下图所示imageide

Semi-supervised learning (SSL):

结合监督学习和无监督学习来给模式识别提供额外信息。
An extension of unsupervised and supervised learning by including additional information typical of the other learning paradigm.学习

SSL 分为如下两类:spa

  • Semi-supervised classification (SS-Cla):

    关注于半监督分类问题orm

  • Semi-supervised clustering (SS-Clu):

    关注于半监督聚类问题ci

Self-Labeled 方法 关注(SS-Cla),即分类问题。rem

Self-Labeled Method:

自标签方法通常指经过标注无标签样原本扩充数据集(EL)。
These techniques aim to obtain one (or several) enlarged labeled set(s) (EL), based on their most confident predictions, to classify unlabeled data.input

  • Self-training:

    利用带标注样本训练一个分类器,给无标签样本标注。而后使用置信度高的无标签标注样本扩充数据集EL来retrain模型。
    A classifier is trained with an initial small number of labeled examples, aiming to classify unlabeled points. Then it is retrained with its own most confident predictions, enlarging its labeled training set. This model does not make any specific assumptions for the input data, but it accepts that its own predictions tend to be correct.it

  • Co-training:

    训练多个分类器,各个分类器互相用各自的置信度高的样本学习。
    It trains one classifier in each specific view, and then the classifiers teach each other the most confidently predicted examples. Multi-view learning for SSC is usually understood to be a generalization of co-training.io

分类

根据 Addition mechanism:

选择假样本的选择方式。

  • Incremental:

    从EL=L开始,不断选择最confident的样本。
    优势:速度快。
    缺点:选择到假标签打错的样本。

  • Batch:

    制定某种增长规则,选择符合这种标准的样本加入训练集。跟Incremental的区别是Incremental选择现训练阶段分类器预测置信度高的样本,给样本打上肯定类别的标签,而Batch在训练阶段不给无监督样本打上肯定类别标签。

  • Amending:

    从EL=L开始,不断选择或删除样本。可提供纠正能力。

根据 Single-learning versus multi-learning:

  • single-learning:预测由单一分类算法/分类器给出。

  • multi-learning:预测由分类器给出。

根据 Single-view versus multi-view:

样本的特征(具备完备的条件信息)表示称为一个view。

  • multi-view

  • single-view

根据 Confidence measures:

如何定义置信度(Confidence)

  • Simple

    经过计算样本的几率。

  • Agreement and combination

    多分类器的预测结合或使用混合模型来计算。

根据 Self-teaching versus mutual-teaching:

  • mutual-teaching:每种分类器互相提供各自的EL。

  • Self-teaching:每种分类器使用各自的EL。

根据 Stopping criteria:

  • 选择全集

    传统的方法给全部无监督样本打上假标签。但这会引入较多错误标注的样本。

  • 选择部分

    选择部分样本,但需预先定义选择的迭代次数和受数据集大小影响。

  • 假设不变

    当选择的样本不改变假设(分类器)中止。

[1] Triguero, I. et al. “Self-labeled techniques for semi-supervised learning: taxonomy, software and empirical study.” Knowledge and Information Systems 42 (2013): 245-284.