代表的なモデル

本章では、転移学習でよく用いられる代表的な CNN モデルを紹介します。

VGGNet
GoogLeNet / Inception
MobileNet
ResNet

これらは、物体検出やセマンティックセグメンテーションに使用されるモデルのバックボーンにも利用されることも多いため、押さえておく必要があります。

import torch 
import torch.nn as nn
import torch.nn.functional as F

VGGNet (2014)

論文：https://arxiv.org/pdf/1409.1556.pdf

30_1

出典：http://www.sanko-shoko.net/note.php?id=pyk1

VGGNet は 2014 年に ILSVRC で画像分類タスクで 2 位となったモデルです。1 位ではないのですが、非常にシンプルでわかりやすいアーキテクチャで高い精度をたたき出したため、よく用いられます。

特徴は 3 点あります。

$3\times3$ フィルタのみを使用
同一チャネルの複数の畳み込み層と Max Pooling を 1 セットとし、繰り返す
Max Pooling 後の出力チャネル数を 2 倍にする

$3\times3$ のフィルタが採用されているのは、それが上下左右中心の情報を受容できる一番小さなサイズであるためです。

また、 $3\times3$ での畳み込みを繰り返すことで、より大きなフィルタサイズを持つフィルタでの畳み込みを近似します。例えば $7\times7$ での畳み込みを 1 回行う場合と、 $3\times3$ での畳み込みを 3 回繰り返すこと場合では、同じサイズ (H, W) の特徴マップを出力することができます。下記の条件で考えてみましょう。

入力の特徴マップのチャネル数が 16
出力の特徴マップのチャネル数が 32

$7\times7$ の畳み込みを 1 回行う場合

画像のサイズの推移：IN (32, 32, 16) →　OUT（26, 26, 32）
パラメータ数： $7\times7\times16\times32 = 25,088$

$3\times3$ の畳み込みを 3 回行う場合

画像のサイズの推移：IN（32, 32, 16）→ (30, 30, 32）→（28, 28, 32）→ OUT（26, 26, 32）
パラメータ数： $3\times3\times16\times32\times3 = 13,824$

となり、バイアスの影響を考慮していませんが、同じ範囲を半分程度のパラメータ数で見られることになります。さらにパラメータ数が削減できただけでなく、精度の向上も見られたため、ここから CNN モデルは $3\times3$ をフィルタサイズの中心として研究されることになりました。

30_2

class VGG(nn.Module):
    def __init__(self, n_classes=1000):
        super(VGG, self).__init__()
        
        self.features = nn.Sequential(
            # ch:3 -> 64, size: 224 * 224 -> 112 * 112
            nn.Conv2d(3, 64, kernel_size=3, padding=1),
            nn.ReLU(inplace=True),
            nn.Conv2d(64, 64, kernel_size=3, padding=1),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(kernel_size=2, stride=2),
            
            # ch:64 -> 128, size: 112 * 112 -> 56 * 56
            nn.Conv2d(64, 128, kernel_size=3, padding=1),
            nn.ReLU(inplace=True),
            nn.Conv2d(128, 128, kernel_size=3, padding=1),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(kernel_size=2, stride=2),
            
            # ch:128 -> 256, size: 56 * 56 -> 28 * 28
            nn.Conv2d(128, 256, kernel_size=3, padding=1),
            nn.ReLU(inplace=True),
            nn.Conv2d(256, 256, kernel_size=3, padding=1),
            nn.ReLU(inplace=True),
            nn.Conv2d(256, 256, kernel_size=3, padding=1),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(kernel_size=2, stride=2),
            
            # ch 256 -> 512, size: 28 * 28 -> 14 * 14
            nn.Conv2d(256, 512, kernel_size=3, padding=1),
            nn.ReLU(inplace=True),
            nn.Conv2d(512, 512, kernel_size=3, padding=1),
            nn.ReLU(inplace=True),
            nn.Conv2d(512, 512, kernel_size=3, padding=1),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(kernel_size=2, stride=2),
            
            # ch 512 -> 512, size: 14 * 14 -> 7 * 7
            nn.Conv2d(512, 512, kernel_size=3, padding=1),
            nn.ReLU(inplace=True),
            nn.Conv2d(512, 512, kernel_size=3, padding=1),
            nn.ReLU(inplace=True),
            nn.Conv2d(512, 512, kernel_size=3, padding=1),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(kernel_size=2, stride=2)
        )
        
        self.avgpool = nn.AdaptiveAvgPool2d((7, 7))
        
        self.classifier = nn.Sequential(
            nn.Linear(512 * 7 * 7, 4096),
            nn.ReLU(inplace=True),
            nn.Dropout(),
            nn.Linear(4096, 4096),
            nn.ReLU(inplace=True),
            nn.Dropout(),
            nn.Linear(4096, n_classes)
        )
    
    def forward(self, x):
        x = self.features(x)
        x = self.avgpool(x)
        x = torch.flatten(x, 1)
        x = self.classifier(x)
        return x

model = VGG()

from torchsummary import summary

summary(model, input_size=(3, 224, 224), device='cpu')

---------------------------------------------------------------- Layer (type) Output Shape Param # ================================================================ Conv2d-1 [-1, 64, 224, 224] 1,792 ReLU-2 [-1, 64, 224, 224] 0 Conv2d-3 [-1, 64, 224, 224] 36,928 ReLU-4 [-1, 64, 224, 224] 0 MaxPool2d-5 [-1, 64, 112, 112] 0 Conv2d-6 [-1, 128, 112, 112] 73,856 ReLU-7 [-1, 128, 112, 112] 0 Conv2d-8 [-1, 128, 112, 112] 147,584 ReLU-9 [-1, 128, 112, 112] 0 MaxPool2d-10 [-1, 128, 56, 56] 0 Conv2d-11 [-1, 256, 56, 56] 295,168 ReLU-12 [-1, 256, 56, 56] 0 Conv2d-13 [-1, 256, 56, 56] 590,080 ReLU-14 [-1, 256, 56, 56] 0 Conv2d-15 [-1, 256, 56, 56] 590,080 ReLU-16 [-1, 256, 56, 56] 0 MaxPool2d-17 [-1, 256, 28, 28] 0 Conv2d-18 [-1, 512, 28, 28] 1,180,160 ReLU-19 [-1, 512, 28, 28] 0 Conv2d-20 [-1, 512, 28, 28] 2,359,808 ReLU-21 [-1, 512, 28, 28] 0 Conv2d-22 [-1, 512, 28, 28] 2,359,808 ReLU-23 [-1, 512, 28, 28] 0 MaxPool2d-24 [-1, 512, 14, 14] 0 Conv2d-25 [-1, 512, 14, 14] 2,359,808 ReLU-26 [-1, 512, 14, 14] 0 Conv2d-27 [-1, 512, 14, 14] 2,359,808 ReLU-28 [-1, 512, 14, 14] 0 Conv2d-29 [-1, 512, 14, 14] 2,359,808 ReLU-30 [-1, 512, 14, 14] 0 MaxPool2d-31 [-1, 512, 7, 7] 0 AdaptiveAvgPool2d-32 [-1, 512, 7, 7] 0 Linear-33 [-1, 4096] 102,764,544 ReLU-34 [-1, 4096] 0 Dropout-35 [-1, 4096] 0 Linear-36 [-1, 4096] 16,781,312 ReLU-37 [-1, 4096] 0 Dropout-38 [-1, 4096] 0 Linear-39 [-1, 1000] 4,097,000 ================================================================ Total params: 138,357,544 Trainable params: 138,357,544 Non-trainable params: 0 ---------------------------------------------------------------- Input size (MB): 0.57 Forward/backward pass size (MB): 218.78 Params size (MB): 527.79 Estimated Total Size (MB): 747.15 ----------------------------------------------------------------

GoogLeNet / Inception（2014）

論文：https://arxiv.org/pdf/1409.4842.pdf

ILSVRC 2014 における優勝モデルです。LeNet のオマージュにより名付けられており、Inception という別名も持ちます。こちらのネットワークは提案以降、改良が続けられており、2020 年 3 月現在では Inception v4 が最新です。

GoogLeNet がいわゆる Inception v1 に当たります。

Inception モジュール

GoogLeNet は Inception モジュールとして、複数のネットワークを1つにまとめ、モジュールを積み重ねる、Network In Network の構成がなされています。

Inception モジュールの内部では、下図 (a) のように、異なるフィルタサイズの複数の畳み込み層を同時に通して、結合する処理が行われます。これはパラメータ数を削減して計算コストを下げつつ、複雑なアーキテクチャを組むための工夫です。

どの程度パラメータを削減できるのか、具体的に下記の条件で考えてみましょう。

入力の特徴マップのチャネル数が $192$
出力の特徴マップのチャネル数が $96$

通常の $5\times5$ の畳み込み層において、バイアスを除いたパラメータ数は

$\begin{array}{c} 5 \times 5 \times 192 \times 96 = 460,800 \end{array}$

となります。一方で、naive(a) の Inception モジュールを使用した場合

$\begin{array}{c} &(1 \times 1 \times 192 \times 24)\\ &+ (3 \times 3 \times 192 \times 24)\\ &+ ( 5 \times 5 \times 192 \times 24)\\ &= 161,280 \end{array}$

と大きく削減できることがわかります。

30_3

出典：https://arxiv.org/pdf/1409.4842.pdf

Inception モジュールはさらにチャネル方向の次元削減も考慮したパターンも提案されており、上図 (b) がその構成です。

見ての通り、 $3\times3$ や $5\times5$ の畳み込み層の前に $1\times1$ の畳み込み層を重ねています。 $1\times1$ の畳み込み層を通して一度チャネル数を減らしてから、より計算コストのかかる $3\times3$ や $5\times5$ の畳み込み層に通すことで、全体のパラメータ数を減らす考え方です。

こちらも具体的に数を追ってみてみましょう。以下の条件で計算します。

入力の特徴マップのチャネル数が $192$
中間 ( $1\times1$ 畳み込み層を通した後）の特徴マップのチャネル数が $16$
出力の特徴マップのチャネル数が $96$

$\begin{aligned} &(1 \times 1 \times 192 \times 24) \\ &+ (1 \times 1 \times 192 \times 16) + (3 \times 3 \times 16 \times 24) \\ &+ (1 \times 1 \times 192 \times 16) + (5 \times 5 \times 16 \times 24) \\ &+ (1 \times 1 \times 192 \times 24) \\ &= 28,416 \end{aligned}$

(a) のパターンでは、 $161,280$ だったので、大きくパラメータ数を削減できたことが確認できます。

(b) パターンの Inception モジュールは以下のように実装することが可能です。

class Inception(nn.Module):
    
    def __init__(self, in_channels, ch1x1, ch3x3, ch5x5, pool_proj):
        super(Inception, self).__init__()
        
        # 1x1 conv
        self.branch1 = nn.Sequential(
            nn.Conv2d(in_channels, ch1x1, kernel_size=1),
            nn.ReLU(inplace=True)
        )
        
        # 1x1 conv -> 3x3 conv
        self.branch2 = nn.Sequential(
            nn.Conv2d(in_channels, ch1x1, kernel_size=1),
            nn.ReLU(inplace=True),
            nn.Conv2d(ch1x1, ch3x3, kernel_size=3, padding=1),
            nn.ReLU(inplace=True)
        )
        
        # 1x1 conv -> 5x5 conv
        self.branch3 = nn.Sequential(
            nn.Conv2d(in_channels, ch1x1, kernel_size=1),
            nn.ReLU(inplace=True),
            nn.Conv2d(ch1x1, ch5x5, kernel_size=3, padding=1),
            nn.ReLU(inplace=True)
        )
        
        # 3x3 pool -> 1x1 conv
        self.branch4 = nn.Sequential(
            nn.MaxPool2d(kernel_size=3, stride=1, padding=1),
            nn.Conv2d(in_channels, pool_proj, kernel_size=1),
            nn.ReLU(inplace=True)
        )
        
    def forward(self, x):
        branch1 = self.branch1(x)
        branch2 = self.branch2(x)
        branch3 = self.branch3(x)
        branch4 = self.branch4(x)
        
        return torch.cat([branch1, branch2, branch3, branch4], 1)

Global Average Pooling (GAP)

また、Global Average Pooling も GoogLeNet で採用されました。こちらは、CNN で特徴マップを全結合層へつなぐ際に Flatten の代わりに使用されます。

Flatten では、特徴マップの各画素を順番に切り取って並べることでベクトル化していたのに対し、GAP では一つの特徴マップから Average Pooling で $1\times1$ のサイズにしたものを並べてベクトル化します。

30_4

PyTorch では nn.AdaptiveAvgPool2d((1, 1)) とすることで使用できます。

ResNet（2015）

論文：https://arxiv.org/pdf/1512.03385.pdf

Residual モジュール

ResNet の特徴は、名前の由来にもなっている Residual モジュールを採用している点です。

層を深くすることで精度が向上することがわかり、深くしたいモチベーションがある一方で、深くしすぎると逆伝播時に勾配消失してしまう問題がありました。この問題に対する工夫として提案されたのが Residual モジュールです。

30_5

出典：https://arxiv.org/pdf/1512.03385.pdf

仕組みとしては、畳み込み層への入力を分岐させ、1 層先の畳み込み層の出力と結合させます。

こうすることで、逆伝播の際に微分を行うと、

$\frac{d}{d x}(f(x) + x) = \frac{d}{d x}f(x) + 1$

と通常に加えて 1 増えます。これにより、勾配消失を防いで層を深くすることが可能となりました。

Bottleneck モジュール

Residualモジュールに、 $1\times1$ 畳み込みを加えて、パラメータを削減しより効率的に学習を行えるモジュールも提案されました。

30_6

出典：https://arxiv.org/pdf/1512.03385.pdf

こちらは、間でチャネル数がいったん小さくなった後、元の大きさに戻って出ていく様が、ボトルネックの形状と似ていることから、Bottleneck モジュールとも呼ばれます。

He の初期化

活性化関数に ReLU を用いる際の、最適な重みの初期値もここで提案されました。重みの初期値は、平均 0 標準偏差 1 の正規分布からランダムに設定されました。

He の初期化では、前層から渡されるノード数が $n$ 個である場合には、重みの初期値を平均 0、標準偏差 $\sqrt{\frac{2}{n}}$ の正規分布から生成します。

PyTorchでは nn.init.kaiming_normal_ で設定できます。

Batch Normarlization

精度向上でも紹介した Batch Normalization もここで提案されました。

様々なバリエーション

ResNet は最大 152 層まで深くしたアーキテクチャが提案されています。

各バリエーションの構成は以下の通りです。

30_7

出典：https://arxiv.org/pdf/1512.03385.pdf

ResNet18
ResNet34
ResNet50
ResNet101
ResNet152

があります。参考までに、ResNetの全体のスクリプトも掲載します。こちらはtorchvisionモジュール内で実装されているResNetとなりますが、特に覚える必要はないため興味がある方のみご覧ください。

https://github.com/pytorch/vision/blob/master/torchvision/models/resnet.py

def conv3x3(in_planes, out_planes, stride=1, groups=1, dilation=1):
    """3x3 convolution with padding"""
    return nn.Conv2d(in_planes, out_planes, kernel_size=3, stride=stride,
                     padding=dilation, groups=groups, bias=False, dilation=dilation)


def conv1x1(in_planes, out_planes, stride=1):
    """1x1 convolution"""
    return nn.Conv2d(in_planes, out_planes, kernel_size=1, stride=stride, bias=False)


class BasicBlock(nn.Module):
    expansion = 1

    def __init__(self, inplanes, planes, stride=1, downsample=None, groups=1,
                 base_width=64, dilation=1, norm_layer=None):
        super(BasicBlock, self).__init__()
        if norm_layer is None:
            norm_layer = nn.BatchNorm2d
        if groups != 1 or base_width != 64:
            raise ValueError('BasicBlock only supports groups=1 and base_width=64')
        if dilation > 1:
            raise NotImplementedError("Dilation > 1 not supported in BasicBlock")
        # Both self.conv1 and self.downsample layers downsample the input when stride != 1
        self.conv1 = conv3x3(inplanes, planes, stride)
        self.bn1 = norm_layer(planes)
        self.relu = nn.ReLU(inplace=True)
        self.conv2 = conv3x3(planes, planes)
        self.bn2 = norm_layer(planes)
        self.downsample = downsample
        self.stride = stride

    def forward(self, x):
        identity = x

        out = self.conv1(x)
        out = self.bn1(out)
        out = self.relu(out)

        out = self.conv2(out)
        out = self.bn2(out)

        if self.downsample is not None:
            identity = self.downsample(x)

        out += identity
        out = self.relu(out)

        return out


class Bottleneck(nn.Module):
    expansion = 4

    def __init__(self, inplanes, planes, stride=1, downsample=None, groups=1,
                 base_width=64, dilation=1, norm_layer=None):
        super(Bottleneck, self).__init__()
        if norm_layer is None:
            norm_layer = nn.BatchNorm2d
        width = int(planes * (base_width / 64.)) * groups
        # Both self.conv2 and self.downsample layers downsample the input when stride != 1
        self.conv1 = conv1x1(inplanes, width)
        self.bn1 = norm_layer(width)
        self.conv2 = conv3x3(width, width, stride, groups, dilation)
        self.bn2 = norm_layer(width)
        self.conv3 = conv1x1(width, planes * self.expansion)
        self.bn3 = norm_layer(planes * self.expansion)
        self.relu = nn.ReLU(inplace=True)
        self.downsample = downsample
        self.stride = stride

    def forward(self, x):
        identity = x

        out = self.conv1(x)
        out = self.bn1(out)
        out = self.relu(out)

        out = self.conv2(out)
        out = self.bn2(out)
        out = self.relu(out)

        out = self.conv3(out)
        out = self.bn3(out)

        if self.downsample is not None:
            identity = self.downsample(x)

        out += identity
        out = self.relu(out)

        return out

class ResNet(nn.Module):

    def __init__(self, block, layers, num_classes=1000, zero_init_residual=False,
                 groups=1, width_per_group=64, replace_stride_with_dilation=None,
                 norm_layer=None):
        super(ResNet, self).__init__()
        if norm_layer is None:
            norm_layer = nn.BatchNorm2d
        self._norm_layer = norm_layer

        self.inplanes = 64
        self.dilation = 1
        if replace_stride_with_dilation is None:
            # each element in the tuple indicates if we should replace
            # the 2x2 stride with a dilated convolution instead
            replace_stride_with_dilation = [False, False, False]
        if len(replace_stride_with_dilation) != 3:
            raise ValueError("replace_stride_with_dilation should be None "
                             "or a 3-element tuple, got {}".format(replace_stride_with_dilation))
        self.groups = groups
        self.base_width = width_per_group
        self.conv1 = nn.Conv2d(3, self.inplanes, kernel_size=7, stride=2, padding=3,
                               bias=False)
        self.bn1 = norm_layer(self.inplanes)
        self.relu = nn.ReLU(inplace=True)
        self.maxpool = nn.MaxPool2d(kernel_size=3, stride=2, padding=1)
        self.layer1 = self._make_layer(block, 64, layers[0])
        self.layer2 = self._make_layer(block, 128, layers[1], stride=2,
                                       dilate=replace_stride_with_dilation[0])
        self.layer3 = self._make_layer(block, 256, layers[2], stride=2,
                                       dilate=replace_stride_with_dilation[1])
        self.layer4 = self._make_layer(block, 512, layers[3], stride=2,
                                       dilate=replace_stride_with_dilation[2])
        self.avgpool = nn.AdaptiveAvgPool2d((1, 1))
        self.fc = nn.Linear(512 * block.expansion, num_classes)
        
        # Heの初期値の設定
        for m in self.modules():
            if isinstance(m, nn.Conv2d):
                nn.init.kaiming_normal_(m.weight, mode='fan_out', nonlinearity='relu')
            elif isinstance(m, (nn.BatchNorm2d, nn.GroupNorm)):
                nn.init.constant_(m.weight, 1)
                nn.init.constant_(m.bias, 0)

        # Zero-initialize the last BN in each residual branch,
        # so that the residual branch starts with zeros, and each residual block behaves like an identity.
        # This improves the model by 0.2~0.3% according to https://arxiv.org/abs/1706.02677
        if zero_init_residual:
            for m in self.modules():
                if isinstance(m, Bottleneck):
                    nn.init.constant_(m.bn3.weight, 0)
                elif isinstance(m, BasicBlock):
                    nn.init.constant_(m.bn2.weight, 0)

    def _make_layer(self, block, planes, blocks, stride=1, dilate=False):
        norm_layer = self._norm_layer
        downsample = None
        previous_dilation = self.dilation
        if dilate:
            self.dilation *= stride
            stride = 1
        if stride != 1 or self.inplanes != planes * block.expansion:
            downsample = nn.Sequential(
                conv1x1(self.inplanes, planes * block.expansion, stride),
                norm_layer(planes * block.expansion),
            )

        layers = []
        layers.append(block(self.inplanes, planes, stride, downsample, self.groups,
                            self.base_width, previous_dilation, norm_layer))
        self.inplanes = planes * block.expansion
        for _ in range(1, blocks):
            layers.append(block(self.inplanes, planes, groups=self.groups,
                                base_width=self.base_width, dilation=self.dilation,
                                norm_layer=norm_layer))

        return nn.Sequential(*layers)

    def forward(self, x):
        x = self.conv1(x)
        x = self.bn1(x)
        x = self.relu(x)
        x = self.maxpool(x)

        x = self.layer1(x)
        x = self.layer2(x)
        x = self.layer3(x)
        x = self.layer4(x)

        x = self.avgpool(x)
        x = torch.flatten(x, 1)
        x = self.fc(x)

        return x

resnet18 = ResNet(BasicBlock, [2, 2, 2, 2]) #resnet18
summary(resnet18, (3, 224, 224))

resnet50 = ResNet(Bottleneck, [3, 4, 6, 3]) 
summary(resnet50, (3, 224, 224))

---------------------------------------------------------------- Layer (type) Output Shape Param # ================================================================ Conv2d-1 [-1, 64, 112, 112] 9,408 BatchNorm2d-2 [-1, 64, 112, 112] 128 ReLU-3 [-1, 64, 112, 112] 0 MaxPool2d-4 [-1, 64, 56, 56] 0 Conv2d-5 [-1, 64, 56, 56] 4,096 BatchNorm2d-6 [-1, 64, 56, 56] 128 ReLU-7 [-1, 64, 56, 56] 0 Conv2d-8 [-1, 64, 56, 56] 36,864 BatchNorm2d-9 [-1, 64, 56, 56] 128 ReLU-10 [-1, 64, 56, 56] 0 Conv2d-11 [-1, 256, 56, 56] 16,384 BatchNorm2d-12 [-1, 256, 56, 56] 512 Conv2d-13 [-1, 256, 56, 56] 16,384 BatchNorm2d-14 [-1, 256, 56, 56] 512 ReLU-15 [-1, 256, 56, 56] 0 Bottleneck-16 [-1, 256, 56, 56] 0 Conv2d-17 [-1, 64, 56, 56] 16,384 BatchNorm2d-18 [-1, 64, 56, 56] 128 ReLU-19 [-1, 64, 56, 56] 0 Conv2d-20 [-1, 64, 56, 56] 36,864 BatchNorm2d-21 [-1, 64, 56, 56] 128 ReLU-22 [-1, 64, 56, 56] 0 Conv2d-23 [-1, 256, 56, 56] 16,384 BatchNorm2d-24 [-1, 256, 56, 56] 512 ReLU-25 [-1, 256, 56, 56] 0 Bottleneck-26 [-1, 256, 56, 56] 0 Conv2d-27 [-1, 64, 56, 56] 16,384 BatchNorm2d-28 [-1, 64, 56, 56] 128 ReLU-29 [-1, 64, 56, 56] 0 Conv2d-30 [-1, 64, 56, 56] 36,864 BatchNorm2d-31 [-1, 64, 56, 56] 128 ReLU-32 [-1, 64, 56, 56] 0 Conv2d-33 [-1, 256, 56, 56] 16,384 BatchNorm2d-34 [-1, 256, 56, 56] 512 ReLU-35 [-1, 256, 56, 56] 0 Bottleneck-36 [-1, 256, 56, 56] 0 Conv2d-37 [-1, 128, 56, 56] 32,768 BatchNorm2d-38 [-1, 128, 56, 56] 256 ReLU-39 [-1, 128, 56, 56] 0 Conv2d-40 [-1, 128, 28, 28] 147,456 BatchNorm2d-41 [-1, 128, 28, 28] 256 ReLU-42 [-1, 128, 28, 28] 0 Conv2d-43 [-1, 512, 28, 28] 65,536 BatchNorm2d-44 [-1, 512, 28, 28] 1,024 Conv2d-45 [-1, 512, 28, 28] 131,072 BatchNorm2d-46 [-1, 512, 28, 28] 1,024 ReLU-47 [-1, 512, 28, 28] 0 Bottleneck-48 [-1, 512, 28, 28] 0 Conv2d-49 [-1, 128, 28, 28] 65,536 BatchNorm2d-50 [-1, 128, 28, 28] 256 ReLU-51 [-1, 128, 28, 28] 0 Conv2d-52 [-1, 128, 28, 28] 147,456 BatchNorm2d-53 [-1, 128, 28, 28] 256 ReLU-54 [-1, 128, 28, 28] 0 Conv2d-55 [-1, 512, 28, 28] 65,536 BatchNorm2d-56 [-1, 512, 28, 28] 1,024 ReLU-57 [-1, 512, 28, 28] 0 Bottleneck-58 [-1, 512, 28, 28] 0 Conv2d-59 [-1, 128, 28, 28] 65,536 BatchNorm2d-60 [-1, 128, 28, 28] 256 ReLU-61 [-1, 128, 28, 28] 0 Conv2d-62 [-1, 128, 28, 28] 147,456 BatchNorm2d-63 [-1, 128, 28, 28] 256 ReLU-64 [-1, 128, 28, 28] 0 Conv2d-65 [-1, 512, 28, 28] 65,536 BatchNorm2d-66 [-1, 512, 28, 28] 1,024 ReLU-67 [-1, 512, 28, 28] 0 Bottleneck-68 [-1, 512, 28, 28] 0 Conv2d-69 [-1, 128, 28, 28] 65,536 BatchNorm2d-70 [-1, 128, 28, 28] 256 ReLU-71 [-1, 128, 28, 28] 0 Conv2d-72 [-1, 128, 28, 28] 147,456 BatchNorm2d-73 [-1, 128, 28, 28] 256 ReLU-74 [-1, 128, 28, 28] 0 Conv2d-75 [-1, 512, 28, 28] 65,536 BatchNorm2d-76 [-1, 512, 28, 28] 1,024 ReLU-77 [-1, 512, 28, 28] 0 Bottleneck-78 [-1, 512, 28, 28] 0 Conv2d-79 [-1, 256, 28, 28] 131,072 BatchNorm2d-80 [-1, 256, 28, 28] 512 ReLU-81 [-1, 256, 28, 28] 0 Conv2d-82 [-1, 256, 14, 14] 589,824 BatchNorm2d-83 [-1, 256, 14, 14] 512 ReLU-84 [-1, 256, 14, 14] 0 Conv2d-85 [-1, 1024, 14, 14] 262,144 BatchNorm2d-86 [-1, 1024, 14, 14] 2,048 Conv2d-87 [-1, 1024, 14, 14] 524,288 BatchNorm2d-88 [-1, 1024, 14, 14] 2,048 ReLU-89 [-1, 1024, 14, 14] 0 Bottleneck-90 [-1, 1024, 14, 14] 0 Conv2d-91 [-1, 256, 14, 14] 262,144 BatchNorm2d-92 [-1, 256, 14, 14] 512 ReLU-93 [-1, 256, 14, 14] 0 Conv2d-94 [-1, 256, 14, 14] 589,824 BatchNorm2d-95 [-1, 256, 14, 14] 512 ReLU-96 [-1, 256, 14, 14] 0 Conv2d-97 [-1, 1024, 14, 14] 262,144 BatchNorm2d-98 [-1, 1024, 14, 14] 2,048 ReLU-99 [-1, 1024, 14, 14] 0 Bottleneck-100 [-1, 1024, 14, 14] 0 Conv2d-101 [-1, 256, 14, 14] 262,144 BatchNorm2d-102 [-1, 256, 14, 14] 512 ReLU-103 [-1, 256, 14, 14] 0 Conv2d-104 [-1, 256, 14, 14] 589,824 BatchNorm2d-105 [-1, 256, 14, 14] 512 ReLU-106 [-1, 256, 14, 14] 0 Conv2d-107 [-1, 1024, 14, 14] 262,144 BatchNorm2d-108 [-1, 1024, 14, 14] 2,048 ReLU-109 [-1, 1024, 14, 14] 0 Bottleneck-110 [-1, 1024, 14, 14] 0 Conv2d-111 [-1, 256, 14, 14] 262,144 BatchNorm2d-112 [-1, 256, 14, 14] 512 ReLU-113 [-1, 256, 14, 14] 0 Conv2d-114 [-1, 256, 14, 14] 589,824 BatchNorm2d-115 [-1, 256, 14, 14] 512 ReLU-116 [-1, 256, 14, 14] 0 Conv2d-117 [-1, 1024, 14, 14] 262,144 BatchNorm2d-118 [-1, 1024, 14, 14] 2,048 ReLU-119 [-1, 1024, 14, 14] 0 Bottleneck-120 [-1, 1024, 14, 14] 0 Conv2d-121 [-1, 256, 14, 14] 262,144 BatchNorm2d-122 [-1, 256, 14, 14] 512 ReLU-123 [-1, 256, 14, 14] 0 Conv2d-124 [-1, 256, 14, 14] 589,824 BatchNorm2d-125 [-1, 256, 14, 14] 512 ReLU-126 [-1, 256, 14, 14] 0 Conv2d-127 [-1, 1024, 14, 14] 262,144 BatchNorm2d-128 [-1, 1024, 14, 14] 2,048 ReLU-129 [-1, 1024, 14, 14] 0 Bottleneck-130 [-1, 1024, 14, 14] 0 Conv2d-131 [-1, 256, 14, 14] 262,144 BatchNorm2d-132 [-1, 256, 14, 14] 512 ReLU-133 [-1, 256, 14, 14] 0 Conv2d-134 [-1, 256, 14, 14] 589,824 BatchNorm2d-135 [-1, 256, 14, 14] 512 ReLU-136 [-1, 256, 14, 14] 0 Conv2d-137 [-1, 1024, 14, 14] 262,144 BatchNorm2d-138 [-1, 1024, 14, 14] 2,048 ReLU-139 [-1, 1024, 14, 14] 0 Bottleneck-140 [-1, 1024, 14, 14] 0 Conv2d-141 [-1, 512, 14, 14] 524,288 BatchNorm2d-142 [-1, 512, 14, 14] 1,024 ReLU-143 [-1, 512, 14, 14] 0 Conv2d-144 [-1, 512, 7, 7] 2,359,296 BatchNorm2d-145 [-1, 512, 7, 7] 1,024 ReLU-146 [-1, 512, 7, 7] 0 Conv2d-147 [-1, 2048, 7, 7] 1,048,576 BatchNorm2d-148 [-1, 2048, 7, 7] 4,096 Conv2d-149 [-1, 2048, 7, 7] 2,097,152 BatchNorm2d-150 [-1, 2048, 7, 7] 4,096 ReLU-151 [-1, 2048, 7, 7] 0 Bottleneck-152 [-1, 2048, 7, 7] 0 Conv2d-153 [-1, 512, 7, 7] 1,048,576 BatchNorm2d-154 [-1, 512, 7, 7] 1,024 ReLU-155 [-1, 512, 7, 7] 0 Conv2d-156 [-1, 512, 7, 7] 2,359,296 BatchNorm2d-157 [-1, 512, 7, 7] 1,024 ReLU-158 [-1, 512, 7, 7] 0 Conv2d-159 [-1, 2048, 7, 7] 1,048,576 BatchNorm2d-160 [-1, 2048, 7, 7] 4,096 ReLU-161 [-1, 2048, 7, 7] 0 Bottleneck-162 [-1, 2048, 7, 7] 0 Conv2d-163 [-1, 512, 7, 7] 1,048,576 BatchNorm2d-164 [-1, 512, 7, 7] 1,024 ReLU-165 [-1, 512, 7, 7] 0 Conv2d-166 [-1, 512, 7, 7] 2,359,296 BatchNorm2d-167 [-1, 512, 7, 7] 1,024 ReLU-168 [-1, 512, 7, 7] 0 Conv2d-169 [-1, 2048, 7, 7] 1,048,576 BatchNorm2d-170 [-1, 2048, 7, 7] 4,096 ReLU-171 [-1, 2048, 7, 7] 0 Bottleneck-172 [-1, 2048, 7, 7] 0 AdaptiveAvgPool2d-173 [-1, 2048, 1, 1] 0 Linear-174 [-1, 1000] 2,049,000 ================================================================ Total params: 25,557,032 Trainable params: 25,557,032 Non-trainable params: 0 ---------------------------------------------------------------- Input size (MB): 0.57 Forward/backward pass size (MB): 286.56 Params size (MB): 97.49 Estimated Total Size (MB): 384.62 ----------------------------------------------------------------

MobileNet (2017)

MobileNet はモデルサイズの軽量化を図りながら、高精度の予測を可能としたモデルです。物体検出などの速度が求められる問題設定で、バックボーンとして使用されています。

Depthwise Separable Convolution

モデルサイズの軽量化に大きく貢献した工夫として、Depthwise Separable Convolution があります。畳み込みの計算を以下の 2 つに分解することで、通常の畳み込み処理から大きくパラメータ数を削減することに成功しました。

Depthwise Convolution
Pointwise Convolution

30_8

出典：https://arxiv.org/pdf/1704.04861.pdf

Depthwise Convolution

3 チャネルの画像に対し、各チャネルごとに 1 枚ずつフィルタを用意し、チャネル単位で畳み込みを行います。この際、通常の畳み込みとは異なり、各チャネルでの計算結果を足し合わせる処理は行いません。

PyTorch では、nn.Conv2d(groups=<入力画像のチャネル数>) とすることで実行できます。

Pointwise Convolution

Depthwise Convolution では入力画像のチャネル数分のチャネルを持った特徴マップが出力されます。

それを、 $1\times1$ のフィルタを出力特徴マップのチャネル数分用意して、畳み込み計算を行うのが Pointwise Convolution です。チャネル方向の畳み込みと覚えてください。

PyTorch では、nn.Conv2d(kernel_size=1) とすることで実行できます。

通常の畳み込み層とのパラメータ数の比較

入力のチャネル数：3
出力のチャネル数：10

上記の条件の際、通常の畳み込み層においてバイアスを除くパラメータは、 $3 \times 3 \times 3 \times 10 = 270$ 個となります。

30_9

一方、Depthwise Separable Convolution の場合

Depthwise： $3 \times 3 \times 3 = 27$
Pointwise： $1 \times 1 \times 3 \times 10 = 30$
合計： $27 + 30 = 57$

となります。

30_10

class Block(nn.Module):
    
    def __init__(self, in_planes, out_planes, stride=1):
        super(Block, self).__init__()
        
        # Depthwise Convolution
        self.conv1 = nn.Conv2d(in_planes, in_planes, kernel_size=3, stride=stride, padding=1, groups=in_planes, bias=False)
        self.bn1 = nn.BatchNorm2d(in_planes)
        
        # Pointwise Convolution
        self.conv2 = nn.Conv2d(in_planes, out_planes, kernel_size=1, stride=1, padding=0, bias=False)
        self.bn2= nn.BatchNorm2d(out_planes)
        
    def forward(self, x):
        out = F.relu(self.bn1(self.conv1(x)))
        out = F.relu(self.bn2(self.conv2(out)))
        return out

class MobileNet(nn.Module):
    
    # (128, 2）：planes=128, stride=2
    cfg = [64, (128,2), 128, (256,2), 256, (512,2), 512, 512, 512, 512, 512, (1024,2), 1024]
    
    def __init__(self, n_classes=1000):
        super(MobileNet, self).__init__()
        
        self.conv1 = nn.Conv2d(3, 32, kernel_size=3, stride=1, bias=False)
        self.bn1 = nn.BatchNorm2d(32) 
        self.layers = self._make_layers(in_planes=32)
        self.avgpool = nn.AvgPool2d(2)
        self.linear = nn.Linear(7*7*1024, n_classes)
        
    def _make_layers(self, in_planes):
        layers = []
        for x in self.cfg:
            # xがint型であればx, そうでない（＝タプル）場合はx[0]
            out_planes = x if isinstance(x, int) else x[0]
            
            stride = 1 if isinstance(x, int) else x[1]
            layers.append(Block(in_planes, out_planes, stride))
            in_planes = out_planes
        
        return nn.Sequential(*layers)
    
    def forward(self, x):
        out = F.relu(self.bn1(self.conv1(x)))
        out = self.layers(out)
        out = self.avgpool(out)
        out = out.view(out.size(0), -1)
        out = self.linear(out)
        return out

model = MobileNet()

summary(model, (3, 224, 224))

この流れのように、CNN はパラメータ数を削減しながら精度を向上させるように発展が進められてきました。現在でも活発に研究が進められているため、今後もどのようなモデルが発表されるのか楽しみです。

本章で紹介したモデルは、有名なモデルの一部ではありますが、最も重要なアーキテクチャがたくさん入っているので紹介しました。モデルの内部をすべて覚える必要はありませんが、各モデルがどんな特徴を持っているかはなんとなく覚えておくと、これからコンピュータビジョンの分野を進んでいく上で強い武器になります。これからも学び続けていきましょう。