【DDFD】《Multi-view Face Detection Using Deep Convolutional Neural Networks》

时间 2021-01-02 标签 CNN

ICMR-2015

International Conference on Multimedia Retrieval

计算机图形学与多媒体 B 类会议

文章目录

5 Conclusion（own）

1 Background and Motivation

Multi-view Face Detection，当前的解决方法可大致分为3类：

Cascade-based（基于 Viola and Jones detector cascade 的改进）, VJ 的缺点，fail to detect faces from different angles（side view or partially occluded faces）
DPM-based，（deformable part models technique，目标检测划时代的鼻祖），缺点，computationally intensive
Neural-Network-based

PS： Viola and Jones detector cascade 创新点如下，特征的快速计算方法——积分图，有效的分类器学习方法——AdaBoost，以及高效的分类策略——级联结构的设计（复杂度排序分类器，降低输入窗口的数量）

参考：
走近人脸检测（2）——VJ人脸检测器及其发展
 长文干货！走近人脸检测：从?VJ?到深度学习（上）

传统的 learning algorithms（SVM，Boosting）和 image features（HOG or Haar wavelets）are not strong enough to capture faces of different poses，导致了 hopelessly inaccurate，作者说，来深度学习吧，哈哈哈哈

引出了 RCNN 、Overfeat、SPPNet 那一套，缺点是速度太慢！

作者用深度学习的方法，摒弃掉 RCNN 的 segmentation（selective search）, bounding-box regression, or SVM classi ers 减速模块，提出 Deep Dense Face Detector (DDFD), 来检测 faces in a wide range of orientations

2 Advantages / Contributions

提出 DDFD（Deep Dense Face Detector）
- not require pose/landmark annotation（和用了这些信息的模型旗鼓相当），
- single model（相比于 R-CNN based method 简单直接）
能 detect faces from different angles，能 handle occlusion to some extent
分析数据集得出：更好的正样本采样策略和更 sophisticated 的数据增广方法，能带来更好的效果

3 Method

DDFD 的设计理念

DNN 包办 classification 和 feature extraction（也是DNN 相比于传统 ML 的优势）
简化 DNN，以 minimize the computational complexity

输入 resize to 227x227，模型 similar AlexNet（5 conv，3 fc），最后用 sigmoid 接 fc，训练好模型后，配合 sliding window 来检测人脸，之后 NMS 调整进行 accurately localize （缩放图像，截取 227x227 来应对大小不同的脸）

face rotation 可分为 in plane 和 out plane rotation，具体如下

参考：https://www.researchgate.net/figure/n-plane-roll-and-out-of-plane-pitch-yaw-rotations_fig1_279394366/actions

in plane，就是旋转照片
out of plane，就是照片里的人动了 pitch (up and down)，yaw (left to right)

作者发现，上图中，up-right face 得分最高，随着 in plane rotation，得分下降！
同样的事情也发生在 out of plane rotation 中

作者分析了数据集

可以通过三种旋转的直方图分布看出，大部分的样本在 30 度旋转之内，接近正脸，这样难怪 up-right face 的检测得分那么高！训练中，正负样本差距 200 倍，一个 mini-bath 是 128，如果 random sample，一个 batch 大概才 2 个正样本，这当然不利于区别 face 和 non-face，作者强行在每个 batch 中设定，正负样本 1：3

还有个问题，就是正样本的旋转角度分布不均匀，如何确保 all categories of the training examples have similar chances to contribute in optimizing the CNN. 这和抽样策略息息相关，所以作者说 better sampling strategies 能进一步提升 DDFD 的效果

作者进一步分析 fig 1 的遮挡情况，发现漏检或者效果不好，most of the face images in the AFLW dataset are not occluded, which makes it difficult for a CNN to learn that faces can be occluded.（监督学习当然不能奢求模型无师自通）

所以作者得出结论，more sophisticated data augmentation 能实现 better results

4 Experiments

4.1 Datasets

AFLW dataset，21 K images，24k face annotation
作者 sample the sub-window，通过 IoU 筛选（0.5）来扩充正样本数量，最终 200 k positive，20 million negative
PASCAL Face dataset，851 images and 1341 annotated faces
AFW，205 images with 473 annotated faces,
FDDB dataset，5171 annotated faces with 2846 images

4.2 Strategies Comparision

1）scale factor

在 PASCAL Face dataset 数据上，先把图片放大 5 倍，这样 227x227 的 window 能检测到原图中，227/5 大小的人脸了，然后在放大5倍的图片上对比不同缩小比例的 P-R 曲线， $fs$ 越小表示缩小的程度越小，能检测到更小的人脸，上图可以看出，效果差不多，作者后面都用 $fs = 0.7937$

2）NMS strategies

NMS-avg 效果更好

3）bounding boxes regression

不要效果更好，作者分析是 the mismatch between the annotations of the training set and the test set.造成，如下图

左边训练集，右边测试集，红框是作者算法的结果，这个结果相比 gt 而言，IoU 小于 0.5，算是 fp！所以不要 bbox regression 会更好，因为加了更强的约束和修正，会让模型对训练集更过拟合，而且会带来漏检！

4.3 Comparison with R-CNN

FT 是指 fine tuning 的意思，一个在 PASCAL-VOC 目标检测数据集上 fine-tune，一个在人脸数据集上 fine-tun，可以看出 R-CNN 配合 BBox 效果有显著提升，但还是不及作者的 DDFD！

作者分析，可能是 R-CNN 中 SS 策略不好，可能漏检，然后 SS 和 bbox 的配合不好！

4.4 Comparisons with state-of-the-art

其它方法（eg DPM or HeadHunter）运用了额外的标签信息（pose annota-
tion or information about facial landmarks）

5 Conclusion（own）

正样本分布直方图得出来的结论真的很有理有据（更好的抽样方法，数据增广增加数据的多样性能带来更好的效果），前者在论文《Libra R-CNN: Towards Balanced Learning for Object Detection》中也有体现（进行了采用策略的改进）
这句话很精髓
算法离不开数据集！对数据集的挖掘很到位，数据集和模型的表现分析的也很透彻！应用创新