[2023 Winter Multimodal Seminar 3. Reasoning 1) Structure Modeling] Memory Fusion Network for Multi-view Sequential Learning 리뷰 (AAAI, 2018, Oral)

AI/Multimodal 2023. 1. 25. 11:09

728x90

[2023 Winter Multimodal Seminar 3. Reasoning 1) Structure Modeling]
Memory Fusion Network for Multi-view Sequential Learning 리뷰 (AAAI, 2018)

목차

0. 들어가기 전에 ... Reasoning이란 ?

1. Introduction

2. Methods

3. Experiments

4. Results

5. Conclusion

6. Code Review

0. 들어가기 전에 ... Reasoning이란 ?

Reasoning은 일반적으로 task에 대한 여러 inferential steps를 통해 multimodal evidence로부터 knowledge을 구성하는 것을 목표로 한다

세부 챌린지로는 Structure Modeling, Intermediate Concepts, Inference Paradigms, External Knowledge으로 구성된다.

그 중 오늘 읽을 논문은 Structure Modeling 에 해당한다

Structure Modeling 은 그 구성이 발생한 그 관계를 정의하는 것을 목표로 한다. 이 중 오늘 볼 논문은 Temporal(Sequential)한 구조를 잘 다루는 모델에 대한 것이다

위 그림을 보면,

두 개의 모달리티 정보를 잘 align해서 각 timestamp마다 하나의 representation을 뽑아낸다 (우리 여기까지 논문 읽었죠 ? ^^)

하지만 아직 각 timestamp마다의 representation들을 다음 단계에서는 어떻게 잘 반영하여 학습할지는 배우지 않았다

이것이 reasoning에서 배울 것이다!

아래의 세 단계로 나눠진다

1) Writing : 각 timestep마다의 feature와 넘어온 multimodal memory 간에 유사도 계산하여 memory에 저장

2) Compose : 이전 memory와 새로운 feature 잘 합치는 weighted Fusion (👾 오늘 읽을 논문에 해당 ! )

3) Reading : Multimodal memory 정보를 잘 요약하는 function

1. Introduction

Multiview sequential data를 잘 처리하기 위해 (== Multimodal Model의 중요성 )

- 동일한 데이터여도 관점에 따라 각각 고유한 representation을 갖는다

- 따라서 데이터를 종합적이고 정확하게 기술하기 위해서는 여러 뷰를 함께 사용해야 한다 !

Multi-view sequential learning - 멀티 뷰에서는 두 가지 형태의 상호작용 모두를 반영하는 것이 중요하다

1. view-specific interactions (= unimodal)

하나의 뷰만 포함하는 상호작용
예를 들어, 발화된 작품의 순서만을 바탕으로 화자의 정서를 학습하는 것

2. crossview interactions (= bimodal, trimodal, ... )

여러 뷰를 포함하는 상호작용
예를 들어, 번개가 번쩍하는 비디오 이후에 몇 초 뒤 천둥 소리가 들리는 오디오

이런 multi-view sequential data를 다루는 세 가지 주요 타입들이 있다

단순히 이 모든 multiple view들을 하나의 view로 합쳐서, 모델의 input으로 넣는 방법
- 모델
  - HMM (Hidden Markov Model)
  - SVM
  - HCRF (Hidden Conditional Random Fields)
  - RNN 계열 (LSTM)
- 단점
  - 작은 size의 training sample에 overfitting되게 한다
    - 이유 : 각 view는 특정한 statistical property를 가지는데, 이런 단순한 방식으로는 그 statistical property를 고려할 수 없기 때문에
위에서 말한 방법보다 좀 더 체계적으로 접근하기 위해, multi view variation을 다루는 방법
- 모델
  - Multi view HCRF (기존의 HCRF에서 multi view를 잘 다룰 수 있도록 변형 )
  - Multi view LSTM (각 view마다 각 component으로 나눠서 )
- 단점
  - 서술 X , 하지만 각각 따로따로 input으로 넣으면 서로 간의 연관성이 잘 학습 안 되지 않을까 ?
각 다른 view들의 temporal representation을 배우기 위해서 time dimension을 붕괴(?)시켜버리는 경우
- 모델
  - 시간에 따른 feature values의 평균을 이용
    - Multiple Kernel Learning, subspace learning, co-training
  - 각 view마다 다른 모델들로 학습하고, 아래의 방법으로 그 모델들을 합치는 방법
    - decision voting
    - tensor product
    - dnn
- 단점
  - 위의 방법들은 각 view간의 관계들을 어느정도 배우긴 하지만, temporal dimension을 고려하지 않음으로써 representation을 학습하는 데 한계가 있고 결국 이는 성능에 영향을 미친다
    - 엄청나게 긴 sequence들을 고려할 때 모든 temporal information이 충분하게 고려되지 않는다

본 논문은 뭐가 다른가 ?
- 단순히 합치는 방법이 아니라 각 뷰를 고려한다는 점에서 첫번째 방법과 다름
- view specific한 정보만 반영하는 두번째 방법과 달리 attention network을 통해 cross view 정보까지도 반영한다
- 세번째 모델과는 이 view specific / cross view 정보를 시간별로 잘 학습한다는 점에서 다름

2. Methods

모델은 크게 세 가지 부분으로 구성

1. System of Long Short Term Memory : LSTM으로 각 부분에 대해 인코딩함 (View Specific)

2. Delta-memory Attention Network (DMAN) : 각 LSTM의 memory dimensions에서 relevant score를 계산해서 cross-view interactions을 파악함 (both crossview and temporal interactions)

3. Multi-view Gated Memory : DMAN의 출력 값과 이전 timestep에 대한 memory을 기반으로 업데이트하여 시간에 따른 cross-view information 정보를 저장한다 (temporal interactions)

1. System of Long Short Term Memory

각 timestamp마다 view-specific interactions 계산

2. Delta-memory Attention Network

앞서 LSTM에서 뽑아낸 각 representation 간의 cross-view interactions을 잘 반영하기 위한 것
여기서는 coefficients가 핵심
- 이 모듈에서는 NN 통해서 attention coefficient(a[t-1, t]) 계산하는데
- coefficient assignment technique을 통해 각 모달의 representation 간의 cross-view interaction을 잘 반영하게 되는 것
- 하지만 t 시간 하나에 대한 coefficient만 고려한다면, LSTM memories이 변경되지 않은 경우에(동일한 장면이 계속 나온다든가) 동일한 cross-view interactions가 계속 발생하기 때문에 ... 문제가 된다

input : c[t-1,t] / t-1 시간 t 시간에 해당하는 메모리의 concat

a[t-1, t] : NN 통해서 attention coefficient 계산 / softmax activated scores - 높은 coefficients값을 regularize하기 위함

output : c^[t-1, t] / 다른 timestamp에 정보 전달하는 역할

코드 상에서 확인하면 다음과 같다

3. Multi-view Gated Memory

시간에 따른 cross-view interactions의 기록들을 다 저장하는 부분

input : c^[t-1, t] / NN 통해서 a cross-view update proposal 생성

r1 / r2 계산

- γ1는 retain 하기 위한 gate로써의 역할 (이전 memory인 u_t-1 를 얼마나 유지할 것인가)

- γ2는 update 하기 위한 gate로써의 역할 (update proposal uˆ t을 기반으로 얼마나 업데이트 할 것인가 )

- update proposal uˆ t에 tan_h 씌운 이유는 큰 변화를 막아서 model stability를 향상시키기 위함

output : u^t

최종 output은
- the final state of the Multi-view Gated Memory u T
- the outputs of each of the n LSTMs

3. Experiments

데이터셋

Sentiment Analysis : negative, positive, neutral 구분하는 task
1. CMU-MOSI dataset
2. MOUD dataset
3. YouTube dataset
4. ICT-MMMO dataset
Emotion Recognition : basic emotions + continuous emotions (anger, happiness, sadness, neutrality, ..) 구분하는 task
1. IEMOCAP dataset
Speaker Traits Analysis : 화자의 대화 행동을 분석하는 task
1. POM dataset (http://multicomp.cs.cmu.edu/resources/pom-dataset/)

Sequence Features

Language View
1. Glove word embeddings : 300 dimensional word embeddings trained on 840 billion tokens from the common crawl dataset
2. 300차원
Visual View
1. the library Facet (iMotions 2017) is used to extract a set of visual features including facial action units, facial landmarks, head pose, gaze tracking and HOG features
2. 35차원
Acoustic View
1. the software COVAREP is used to extract acoustic features including 12 Mel-frequency cepstral coefficients, pitch tracking and voiced/unvoiced segmenting features, glottal source parameters , peak slope parameters and maxima dispersion quotients .
2. 74차원

Baselines

View Concatenation Sequential Learning Models
1. Song2013 (⊲)
2. Morency2011 : Hidden Markov Model (×)
3. Quattoni2007 : Hidden Conditional Random Field (HCRF) (≀)
4. Morency2007 : Latent Discriminative Hidden Conditional Random Fields (LDHCRFs) (#)
5. Hochreiter1997 : LSTM (§)
Multi-view Sequential Learning Models
1. Rajagopalan2016 : Multi-view (MV) LSTM (◇)
2. Song2012 : MV-HCRF (⊳)
3. Song2013MV : MV-HSSHCRF (∪)
Dataset Specific Baselines
1. Poria2015 : Multiple Kernel Learning (♣)
2. Nojavanasghari2016 : Deep Fusion Approach (♭)
3. Zadeh2016 : Support Vector Machine (♡)
4. Ho1998 : Random Forest (●)
Dataset Specific State-of-the-art Baselines
1. Poria2017 : Bidirectional Contextual LSTM (†)
  1. IEMOCAP / MOUD SOTA
2. Zadeh2017 : Tensor Fusion Network (∗)
  1. CMU-MOSI SOTA
3. Wang2016 : Selective Additive Learning Convolutional Neural Network (∩)
MFN Ablation Study Baselines
1. MFN {l, v, a}: These baselines use only individual views – l for language, v for visual, and a for acoustic.
2. MFN (no ∆): This variation of our model shrinks the context to only the current timestamp t in the DMAN.
3. MFN (no mem): This variation of our model removes the Delta-memory Attention Network and Multi-view Gated Memory from the MFN

4. Results

Metrics
- BA : Binary Acc.
- MA : multiclass accuracy
- r : Pearson’s correlation

MFN Achieves State-of-The-Art Performance for Multiview Sequential Modeling
Ablation Study

5. Conclusion

논문의 장점
- 지금의 시점에서는 너무 간단해 보이지만, 이 논문이 나왔을 시점에서는 이전 연구들의 한계를 잘 정리해서 극복하고자 했던 점이 좋았다. -> 특히 Temporal information을 반영하고자 했던 연구가 이전에 없었던 것을 해결 !
논문의 단점
- 결과 레포팅 너무 보기 힘들었다 효율적인 레포팅 요망
- 시간 폭을 더 늘려본다든가 그런 실험도 있었으면 좋았을 것 같은데 !

6. Code Review

class MFN(nn.Module):
	def __init__(self,config,NN1Config,NN2Config,gamma1Config,gamma2Config,outConfig):
		super(MFN, self).__init__()
		[self.d_l,self.d_a,self.d_v] = config["input_dims"]
		[self.dh_l,self.dh_a,self.dh_v] = config["h_dims"]
		total_h_dim = self.dh_l+self.dh_a+self.dh_v
		self.mem_dim = config["memsize"]
		window_dim = config["windowsize"]
		output_dim = 1
		attInShape = total_h_dim*window_dim
		gammaInShape = attInShape+self.mem_dim
		final_out = total_h_dim+self.mem_dim
		h_att1 = NN1Config["shapes"]
		h_att2 = NN2Config["shapes"]
		h_gamma1 = gamma1Config["shapes"]
		h_gamma2 = gamma2Config["shapes"]
		h_out = outConfig["shapes"]
		att1_dropout = NN1Config["drop"]
		att2_dropout = NN2Config["drop"]
		gamma1_dropout = gamma1Config["drop"]
		gamma2_dropout = gamma2Config["drop"]
		out_dropout = outConfig["drop"]

		self.lstm_l = nn.LSTMCell(self.d_l, self.dh_l)
		self.lstm_a = nn.LSTMCell(self.d_a, self.dh_a)
		self.lstm_v = nn.LSTMCell(self.d_v, self.dh_v)

		self.att1_fc1 = nn.Linear(attInShape, h_att1)
		self.att1_fc2 = nn.Linear(h_att1, attInShape)
		self.att1_dropout = nn.Dropout(att1_dropout)

		self.att2_fc1 = nn.Linear(attInShape, h_att2)
		self.att2_fc2 = nn.Linear(h_att2, self.mem_dim)
		self.att2_dropout = nn.Dropout(att2_dropout)

		self.gamma1_fc1 = nn.Linear(gammaInShape, h_gamma1)
		self.gamma1_fc2 = nn.Linear(h_gamma1, self.mem_dim)
		self.gamma1_dropout = nn.Dropout(gamma1_dropout)

		self.gamma2_fc1 = nn.Linear(gammaInShape, h_gamma2)
		self.gamma2_fc2 = nn.Linear(h_gamma2, self.mem_dim)
		self.gamma2_dropout = nn.Dropout(gamma2_dropout)

		self.out_fc1 = nn.Linear(final_out, h_out)
		self.out_fc2 = nn.Linear(h_out, output_dim)
		self.out_dropout = nn.Dropout(out_dropout)
		
	def forward(self,x):
		x_l = x[:,:,:self.d_l]
		x_a = x[:,:,self.d_l:self.d_l+self.d_a]
		x_v = x[:,:,self.d_l+self.d_a:]
		# x is t x n x d
		n = x.shape[1]
		t = x.shape[0]
		self.h_l = torch.zeros(n, self.dh_l).cuda()
		self.h_a = torch.zeros(n, self.dh_a).cuda()
		self.h_v = torch.zeros(n, self.dh_v).cuda()
		self.c_l = torch.zeros(n, self.dh_l).cuda()
		self.c_a = torch.zeros(n, self.dh_a).cuda()
		self.c_v = torch.zeros(n, self.dh_v).cuda()
		self.mem = torch.zeros(n, self.mem_dim).cuda()
		all_h_ls = []
		all_h_as = []
		all_h_vs = []
		all_c_ls = []
		all_c_as = []
		all_c_vs = []
		all_mems = []
		for i in range(t):
			# prev time step
			prev_c_l = self.c_l
			prev_c_a = self.c_a
			prev_c_v = self.c_v
			# curr time step
			new_h_l, new_c_l = self.lstm_l(x_l[i], (self.h_l, self.c_l))
			new_h_a, new_c_a = self.lstm_a(x_a[i], (self.h_a, self.c_a))
			new_h_v, new_c_v = self.lstm_v(x_v[i], (self.h_v, self.c_v))
			# concatenate
			prev_cs = torch.cat([prev_c_l,prev_c_a,prev_c_v], dim=1)
			new_cs = torch.cat([new_c_l,new_c_a,new_c_v], dim=1)
			cStar = torch.cat([prev_cs,new_cs], dim=1)
			attention = F.softmax(self.att1_fc2(self.att1_dropout(F.relu(self.att1_fc1(cStar)))),dim=1)
			attended = attention*cStar
			cHat = F.tanh(self.att2_fc2(self.att2_dropout(F.relu(self.att2_fc1(attended)))))
			both = torch.cat([attended,self.mem], dim=1)
			gamma1 = F.sigmoid(self.gamma1_fc2(self.gamma1_dropout(F.relu(self.gamma1_fc1(both)))))
			gamma2 = F.sigmoid(self.gamma2_fc2(self.gamma2_dropout(F.relu(self.gamma2_fc1(both)))))
			self.mem = gamma1*self.mem + gamma2*cHat
			all_mems.append(self.mem)
			# update
			self.h_l, self.c_l = new_h_l, new_c_l
			self.h_a, self.c_a = new_h_a, new_c_a
			self.h_v, self.c_v = new_h_v, new_c_v
			all_h_ls.append(self.h_l)
			all_h_as.append(self.h_a)
			all_h_vs.append(self.h_v)
			all_c_ls.append(self.c_l)
			all_c_as.append(self.c_a)
			all_c_vs.append(self.c_v)

		# last hidden layer last_hs is n x h
		last_h_l = all_h_ls[-1]
		last_h_a = all_h_as[-1]
		last_h_v = all_h_vs[-1]
		last_mem = all_mems[-1]
		last_hs = torch.cat([last_h_l,last_h_a,last_h_v,last_mem], dim=1)
		output = self.out_fc2(self.out_dropout(F.relu(self.out_fc1(last_hs))))
		return output

Reference

https://arxiv.org/pdf/1802.00927.pdf

728x90

'AI > Multimodal' 카테고리의 다른 글

[2023 Spring Lab Seminar] Learning Transferable Visual Models From Natural Language Supervision (ICML, 2021) (0)	2023.06.20
[2023 Winter Multimodal Seminar 4.Transference 2) Transfer] Multimodal Few-Shot Learning with Frozen Language Models 논문 리뷰 (NeurIPS, 2021) (4)	2023.02.20
[Related Works] Vision-Language Transformer (0)	2023.02.13
[2023 Winter Multimodal Seminar 1. Representation 1) Fusion] Tensor Fusion Network for Multimodal Sentiment Analysis 리뷰 (EMNLP, 2017, Oral) (10)	2023.01.05
[2023 Winter Multimodal Seminar] 0. Core research challenges in multimodal learning (0)	2023.01.05

ABOUT ME