Rethinking the Inception Architecture for Computer Vision 정리 [Inception-V2 / Inception-V3]

AI/Computer Vision 2021. 9. 14. 09:25

728x90

Rethinking the Inception Architecture for Computer Vision 정리 [Inception-V2 / Inception-V3]

선행되어 연구되었던 Going deeper with convolutions, 즉 GoogLeNet은 VGG을 이기고 2014년 IRSVRC에서 우승했지만, 워낙 구조가 복잡하고 연산량이 많아서 잘 쓰이지 않았다.

이를 개선하고자 후속 연구인 Rethinking the Inception Architecture for Computer Vision에서는 연산량을 줄이기 위한 실험과 이를 적용한 모델들인 Inception-V2/Inception-V3 개발!

0. Abstract

본 논문에서는 Inception-v2와 Inception-v3을 소개함.

model size와 computational cost가 모든 task들에서 성능 향상을 불러오지만, 파라미터 수와 계산 효율성 또한 중요한 요소임 -> 어떻게 이것을 이루어낼 것이냐?

(1) Suitably factorized convolutions (2) Aggressive regularization

1. Introduction

이전의 GoogLeNet에서의 Inception이 성능 향상을 가져오긴 했지만, 몇 가지 한계점이 존재했음

1) The complexity of the Inception architecture makes it more difficult to make changes to the network

[Inception 구조의 복잡성이 네트워크 구조를 변화시키는 것 자체가 힘들게 함]

2) Also, does not provide a clear description about the contributing factors that lead to the various design decisions of the GoogLeNet architecture.

[GoogLeNet 구조의 다양한 디자인 원칙이 명확하게 드러나지 않음]

-> 따라서, 이 논문에서는 어떻게 CNN을 효율적으로 키울지(scaling up convolutional networks in efficient ways)에 대한 통상적인 원칙들과 최적화 아이디어에 대해 설명할 것이다!

2. General Design Principles

이 파트에서는 CNN의 다양한 구조적인 결정들에 대한 디자인 원칙들을 설명함.

1) Avoid representational bottlenecks, especially early in the network

-> bottlenecks으로 인한 극단적인 information compression은 피해야 함

2) Higher dimensional representations are easier to process locally within a network

-> tile 당 activation들을 늘리면 * disentangled feature를 많이 얻을 수 있고, 네트워크가 더 빨리 학습하게 될 것.

* disentangled feature: 어떤 이미지를 나타내는 latent variable이 여러 개로 분리 되어 각각 다른 이미지의 특성에 관한 정보를 담고 있는 것

3) Spatial aggregation can be done over lower dimensional embeddings without much or any loss in representational power

4) Balance the width and depth of the network

-> 모델의 폭과 너비 둘 다 커지는 것 좋지만, optimal improvement는 폭과 너비가 균형 있게 커질 때!

3. Factorizing Convolutions with Large Filter Size

* Factorizing : 분해. 여러 factor들의 곱으로 숫자나 수학적인 객체들을 만드는 것

=> GoogLeNet 네트워크의 이득 중 상당 부분은 dimension reduction를 충분히 사용함으로써 발생한 것!

3절에서는 모델의 계산 효율을 높이기 위한 목적으로 Convolution Factorizring 기법들을 살펴본다!

Inception network는 fully convolutional하기 때문에, 각 weight는 activation 당 하나의 곱셈에 해당한다. 따라서, 계산 비용을 줄이면 paramter의 수가 줄어들게 된다.

=> 이는 적절한 factorizing(위의 뜻 참고)이 이뤄지면 더 많은 parameter를 얻을 수 있으며, 이에 따라 빠른 학습이 가능하다는 것을 의미한다.

=> 또한 메모리를 포함한 계산 비용의 절감을 통해, single computer에서 모델의 각 복제본들을 학습할 수 있는 능력을 유지하면서 네트워크의 filter-bank size를 늘릴 수 있다

3.1 Factorization into smaller convolutions

= 작은 convolution로의 Factorization

spatial filter가 클수록 이미지에서의 더 멀리 떨어진 unit activation 간의 dependency 파악 가능, 하지만 그를 가지고 있는 convolution의 계산 비용 증가 ! 어떻게 5x5 convolution을 동일한 input size와 output depth를 가지면서, 더 적은 parameter를 가진 multi-layer 네트워크로 대체할까??

$5 \times 5$ conv (fully-connected component) 에서 2-layer $3 \times 3$ conv로 대체하는 것!

이렇게 함으로써 인접한 unit 간의 weight를 공유함으로써 parameter 수를 줄여준다! translation invariant( = 입력에 shift가 일어난 경우에도 변함 없이 학습한 패턴을 캡처하는 convolution 방식의 특성)하기 때문에 두 개의 conv로 대체해도 괜찮지 않을까라는 아이디어ㅇㅇ

5x5 짜리 conv 도 크다. 이걸 3x3 conv 2개로 바꾸어보면,
- 5x5 : 3x3 = 25 : 9 (25/9 = 2.78 times)
  - 5x5 conv 연산 한번은 당연히 3x3 conv 연산보다 약 2.78 배 비용이 더 들어간다.
- 만약 크기가 같은 2개의 layer 를 하나의 5x5 로 변환하는 것과 3x3 짜리 2개로 변환하는 것 사이의 비용을 계산해보자.
  - 5x5xN : (3x3xN) + (3x3xN) = 25 : 9+9 = 25 : 18 (약 28% 의 reduction 효과)

작은 convolution로의 Factorization을 적용한 inception

BUT, 위와 같은 방법이 5x5 conv의 기능을 유지할 수 있는가?에 대한 의문 생길 수 있음! 이 경우에 대해 실험 진행 ...

3.2 Spatial Factorization into Asymmetric Convolutions

= 비대칭 convolutions으로의 Factorization

filter의 크기가 3x3보다 큰 convolution은 항상 $3 \times 3$ convolution의 sequence로 축소될 수 있으므로, 이를 이용하는 것은 보통 효율적이지 않다고 볼 수 있다

물론 $2 \times 2$ convolution과 같이 더 작은 단위로 factorizing을 할 수도 있지만, $n \times 1$ 과 같은 asymmetric convolution을 사용하는 것이 훨씬 좋은 것으로 밝혀졌다

=> 즉, 보통의 경우 N x N 의 형태로 Conv 를 수행하게 되는데, 이를 1 x N 과 N x 1 로 Factorization 하는 기법

3x1 convolution 뒤에 1x3 convolution을 사용한 2-layer를 sliding 하는 것과, $3 \times 3$ convolution의 receptive field는 동일하다.

비대칭 convolutions으로의 Factorization을 적용한 Inception

4. Utility of Auxiliary Classifiers

보조 분류기는 원래 동기는 다음과 같다.

Useful한 gradient를 하위 layer로 밀어 넣어, 즉시 useful하게 만들기 위함
Very deep network의 vanishing gradient 문제를 해결하여, 학습 중의 수렴을 개선시키기 위함

=> 하지만 실험 결과 학습 초기에는 보조 classifier들이 수렴을 개선시키지 않는다!

=> 보조 분류기가 regularizer로 동작한다!

5. Efficient Grid Size Reduction

원래 CNN은 Feature Map의 Grid 크기를 줄여가는 과정을 Max-Pooling 을 이용해서 진행하며, Convolution과 Pooling은 언제나 붙어다닌다. 그렇다면, 둘 중 어떤 걸 먼저 해야 더 효율적인가?

실험 결과 둘 다 비슷하기 때문에, Conv와 Pooling을 순서대로 수행하는 것보다는 병렬적으로 수행하고 난 뒤에 concat하는 것이 Representational Bottleneck을 줄이기 때문에 더 낫다!

6. Inception-v2

위에 언급된 새로운 Architecture들( 두 가지의 Factorization / 보조 분류기 / Grid 크기 줄이기 ) 모두 반영!

+) Inception-V3

Inception-V3는 V2의 구조와 동일하지만, 결과가 더 좋은 것들을 반영한다. 그래서 논문의 모델은 Inception-V2를 설명한다고 보는 것이 낫을 것ㅇㅇ

+) Inception-V4

V3모델에 또 몇 가지 추가하여 성능을 올린 게 Inception-V4

~~7. Model Regularization via Label Smoothing~~

~~8. Training Methodology~~

~~9. Performance on Lower Resolution Input~~

~~10. Experimental Results and Comparsions~~

11. Conclusions

=> 네트워크 내부에서의 factorizing convolution 기법과 적극적인 dimension reduction으로, 높은 성능을 유지하면서도, 비교적 낮은 계산 비용이 드는 네트워크를 만들었다!

=> 적은 수의 parameter와 BN이 사용 된 보조 분류기, label-smoothing 기법이 함께 사용되면, 크지 않은 규모의 학습 데이터 상에서도, 고성능의 네트워크를 학습 할 수 있다.

PyTorch 구현

import torch
import torch.nn as nn
import torchvision

def ConvBNReLU(in_channels,out_channels,kernel_size,stride=1,padding=0):
    return nn.Sequential(
        nn.Conv2d(in_channels=in_channels, out_channels=out_channels, kernel_size=kernel_size, stride=stride,padding=padding),
        nn.BatchNorm2d(out_channels),
        nn.ReLU6(inplace=True),
    )

def ConvBNReLUFactorization(in_channels,out_channels,kernel_sizes,paddings):
    return nn.Sequential(
        nn.Conv2d(in_channels=in_channels, out_channels=out_channels, kernel_size=kernel_sizes, stride=1,padding=paddings),
        nn.BatchNorm2d(out_channels),
        nn.ReLU6(inplace=True)
    )

# 3 x Inception-A

class InceptionV2A(nn.Module):
    def __init__(self, in_channels,out_channels1,out_channels2reduce, out_channels2, out_channels3reduce, out_channels3, out_channels4):
        super(InceptionV2A, self).__init__()

        # 1x1
        self.branch1 = ConvBNReLU(in_channels=in_channels,out_channels=out_channels1,kernel_size=1) 

        # 1x1 -> 3x3
        self.branch2 = nn.Sequential(
            ConvBNReLU(in_channels=in_channels, out_channels=out_channels2reduce, kernel_size=1),
            ConvBNReLU(in_channels=out_channels2reduce, out_channels=out_channels2, kernel_size=3, padding=1),
        )

        # 1x1 -> 3x3 -> 3x3
        self.branch3 = nn.Sequential(
            ConvBNReLU(in_channels=in_channels,out_channels=out_channels3reduce,kernel_size=1),
            ConvBNReLU(in_channels=out_channels3reduce, out_channels=out_channels3, kernel_size=3, padding=1),
            ConvBNReLU(in_channels=out_channels3, out_channels=out_channels3, kernel_size=3, padding=1),
        )

        # MaxPool -> 1x1
        self.branch4 = nn.Sequential(
            nn.MaxPool2d(kernel_size=3, stride=1, padding=1),
            ConvBNReLU(in_channels=in_channels, out_channels=out_channels4, kernel_size=1),
        )

    def forward(self, x):
        out1 = self.branch1(x)
        out2 = self.branch2(x)
        out3 = self.branch3(x)
        out4 = self.branch4(x)
        out = torch.cat([out1, out2, out3, out4], dim=1)
        return out

# 5 x Inception-B

class InceptionV2B(nn.Module):
    def __init__(self, in_channels,out_channels1,out_channels2reduce, out_channels2, out_channels3reduce, out_channels3, out_channels4):
        super(InceptionV2B, self).__init__()

        # 1x1
        self.branch1 = ConvBNReLU(in_channels=in_channels,out_channels=out_channels1,kernel_size=1)

        # 1x1 -> 1xn -> nx1
        self.branch2 = nn.Sequential(
            ConvBNReLU(in_channels=in_channels, out_channels=out_channels2reduce, kernel_size=1),
            ConvBNReLUFactorization(in_channels=out_channels2reduce, out_channels=out_channels2reduce, kernel_sizes=[1,3],paddings=[0,1]),
            ConvBNReLUFactorization(in_channels=out_channels2reduce, out_channels=out_channels2, kernel_sizes=[3,1],paddings=[1, 0]),
        )

        # 1x1 -> 1xn -> nx1 -> 1xn -> nx1
        self.branch3 = nn.Sequential(
            ConvBNReLU(in_channels=in_channels,out_channels=out_channels3reduce,kernel_size=1),
            ConvBNReLUFactorization(in_channels=out_channels3reduce, out_channels=out_channels3reduce,kernel_sizes=[1, 3], paddings=[0, 1]),
            ConvBNReLUFactorization(in_channels=out_channels3reduce, out_channels=out_channels3reduce,kernel_sizes=[3, 1], paddings=[1, 0]),
            ConvBNReLUFactorization(in_channels=out_channels3reduce, out_channels=out_channels3reduce, kernel_sizes=[1, 3], paddings=[0, 1]),
            ConvBNReLUFactorization(in_channels=out_channels3reduce, out_channels=out_channels3,kernel_sizes=[3, 1], paddings=[1, 0]),
        )

        # MaxPool -> 1x1
        self.branch4 = nn.Sequential(
            nn.MaxPool2d(kernel_size=3, stride=1, padding=1),
            ConvBNReLU(in_channels=in_channels, out_channels=out_channels4, kernel_size=1),
        )

    def forward(self, x):
        out1 = self.branch1(x)
        out2 = self.branch2(x)
        out3 = self.branch3(x)
        out4 = self.branch4(x)
        out = torch.cat([out1, out2, out3, out4], dim=1)
        return out

# 2 x Inception-C

class InceptionV2C(nn.Module):
    def __init__(self, in_channels,out_channels1,out_channels2reduce, out_channels2, out_channels3reduce, out_channels3, out_channels4):
        super(InceptionV2C, self).__init__()

        # 1x1
        self.branch1 = ConvBNReLU(in_channels=in_channels,out_channels=out_channels1,kernel_size=1)

        # 1x1 -> 1x3
        #     -> 3x1
        self.branch2_conv1 = ConvBNReLU(in_channels=in_channels, out_channels=out_channels2reduce, kernel_size=1)
        self.branch2_conv2a = ConvBNReLUFactorization(in_channels=out_channels2reduce, out_channels=out_channels2, kernel_sizes=[1,3],paddings=[0,1])
        self.branch2_conv2b = ConvBNReLUFactorization(in_channels=out_channels2reduce, out_channels=out_channels2, kernel_sizes=[3,1],paddings=[1,0])

        # 1x1 -> 3x3 -> 1x3
        #            -> 3x1
        self.branch3_conv1 = ConvBNReLU(in_channels=in_channels,out_channels=out_channels3reduce,kernel_size=1)
        self.branch3_conv2 = ConvBNReLU(in_channels=out_channels3reduce, out_channels=out_channels3, kernel_size=3,stride=1,padding=1)
        self.branch3_conv3a = ConvBNReLUFactorization(in_channels=out_channels3, out_channels=out_channels3, kernel_sizes=[3, 1],paddings=[1, 0])
        self.branch3_conv3b = ConvBNReLUFactorization(in_channels=out_channels3, out_channels=out_channels3, kernel_sizes=[1, 3],paddings=[0, 1])

        # MaxPool -> 1x1
        self.branch4 = nn.Sequential(
            nn.MaxPool2d(kernel_size=3, stride=1, padding=1),
            ConvBNReLU(in_channels=in_channels, out_channels=out_channels4, kernel_size=1),
        )

    def forward(self, x):
        out1 = self.branch1(x)
        x2 = self.branch2_conv1(x)
        out2 = torch.cat([self.branch2_conv2a(x2), self.branch2_conv2b(x2)],dim=1)
        x3 = self.branch3_conv2(self.branch3_conv1(x))
        out3 = torch.cat([self.branch3_conv3a(x3), self.branch3_conv3b(x3)], dim=1)
        out4 = self.branch4(x)
        out = torch.cat([out1, out2, out3, out4], dim=1)
        return out

728x90

'AI > Computer Vision' 카테고리의 다른 글

Fully convolutional networks for semantic segmentation (FCN) 정리 (0)	2022.01.15
Densely Connected Convolutional Networks 정리 및 코드 구현 by Pytorch [DenseNet] (0)	2021.11.02
Deep Residual Learning for Image Recognition 정리 및 코드 구현 by Pytorch [ResNet] (0)	2021.09.25
Going deeper with convolutions 정리 및 코드 구현 by PyTorch & Tensorflow [GoogLeNet] (0)	2021.09.05
VERY DEEP CONVOLUTIONAL NETWORKS FOR LARGE SCALE IMAGE RECOGNITION 정리 및 코드 구현 by PyTorch & Tensorflow [VGGNet] (0)	2021.09.02

ABOUT ME

세상은 내가 정하는 대로 세상은 내가 정하는 대로

'AI > Computer Vision' 카테고리의 다른 글

티스토리툴바

ABOUT ME

'AI > Computer Vision' 카테고리의 다른 글

관련글 관련글 더보기

티스토리툴바