MoE (Mixture of Experts) 알아보기 1 - 원리와 구성 요소

AI/NLP 2025. 7. 26. 21:22

728x90

MoE (Mixture of Experts) 알아보기 1 - 원리와 구성 요소

최근 대형 언어 모델(LLM) 구조를 살펴보다 보면 이런 모델 이름이 자주 보입니다:

Qwen-235B-A22B, DeepSeek-V2-MoE, Kimi-K2, SwitchTransformer …

겉보기엔 수백억~수조 개의 파라미터를 가진 엄청난 모델이지만,
실제로 추론 시에는 수십억 개 파라미터만 계산에 참여한다고 합니다.

“어떻게 이런 게 가능하지?”

답은 바로 MoE(Mixture of Experts) 구조에 있습니다.

기존에는 어땠는데 ?

기존 Transformer 구조에서 각 블록의 핵심은
Self-Attention → FeedForward Layer (즉, Fully Connected) 구조입니다.

하지만 MoE는 이 중 FeedForward(FN) 부분만 바꿉니다.

✅ FFN을 여러 개로 나눠서 “전문가(Expert)”로 구성하고,
✅ 입력 토큰마다 일부 전문가만 선택해 계산하는 방식이죠.

마치 병원에 수십 명의 전문의가 있지만,
환자 하나가 진료받을 땐 가장 적절한 2~3명의 의사만 참여하는 구조라고 보면 됩니다 :)

즉, Transformer 블록 구조를 다시 보면

[Self-Attention] → [FeedForward (FFN)] → [LayerNorm etc.]

이때 일반적으로 쓰는 FeedForward(FFN) 레이어는:

Dense Layer 2개 (예: Linear → GELU → Linear)
전체 토큰에 대해 고정된 계산을 함
(즉, 모든 입력이 같은 방식으로 처리됨)

기존:
FFN(x) = W2 · GELU(W1 · x)

MoE:
MoE(x) = Σ (gate_i(x) × Expert_i(x))  ← 여러 Expert 중 일부만 선택

각 Expert는 FFN과 동일한 구조를 가지지만
전체 중 일부만 활성화됨 (예: 64개 중 Top-2)
따라서 계산량은 확 줄고, 다양한 표현력이 생김

MoE layer from the [Switch Transformers paper](https://arxiv.org/abs/2101.03961)

MoE 어떻게 생겼어 ?

위에서 잠깐 언급했던 것처럼, MoE의 두 가지 핵심 구성 요소는 아래와 같습니다

1. Sparse MoE 레이어

Dense FFN 레이어 대신 사용
여러 "전문가"(예: 8개)를 포함
실제로는 각 전문가가 FFN이지만, 더 복잡한 네트워크나 계층적 MoE도 가능

2. 게이트 네트워크/라우터

어떤 토큰을 어떤 전문가에게 보낼지 결정
예: "More" 토큰은 두 번째 전문가로, "Parameters" 토큰은 첫 번째 전문가로
하나의 토큰을 여러 전문가에게 보낼 수도 있음
- 현대적인 MoE에서는 모든 전문가를 다 활용하기보다는, 게이트가 가장 점수가 높은 상위 K개의 전문가만 활성화(Top-K Routing)하도록 제한하는 경우가 많음
- K값은 보통 1 또는 2가 많은데, K=1이면 구현이 단순하고 통신 비용이 줄어들지만 표현력이 떨어질 수 있고, K=2이면 전문가 두 개의 출력을 혼합하여 더 풍부한 표현이 가능하나 그만큼 약간의 추가 계산과 통신이 필요
학습 가능한 파라미터로 구성되어 전체 네트워크와 함께 사전 훈련됨

위 핵심 요소들을 하나씩 살펴 보도록 합시다

Sparse MoE 레이어

효과

표현력(capacity): 엄청나게 증가
필요 연산량: 거의 증가하지 않음
결과: 파라미터 수와 연산량의 분리 달성

예시

Switch Transformer 사례

전문가 수: 2048개
활성화: 1개만 선택
토큰당 연산: 1/2048 수준으로 감소
전체 모델: 1.6조 파라미터
실제 훈련 비용: 100억짜리 밀집 모델과 비슷

일반적인 계산 공식

MoE 레이어 계산량 = 밀집 모델 × (k/N)
- N: 전체 전문가 수
- k: 활성화되는 전문가 수

한계와 오버헤드

추가 비용 요소

게이트 연산: 선택적 계산을 위한 추가 연산
통신 비용: 선택된 출력들을 모으는 비용
All-to-All 통신: 토큰을 해당 장비로 전송하는 통신

성능에 영향을 주는 요인

작은 배치: 통신 오버헤드가 커질 수 있음
해결책: 충분히 큰 배치로 병렬화 + 통신 최적화

Routing 방법론의 발전 (Top-K, Load Balancing 등)

MoE 역사에서 라우팅 전략은 계속 발전해왔으며, 대표적인 방법론은 다음과 같다:

Top-K 토큰 라우팅 (Token Choice Routing):
- 가장 전통적인 방식으로, 각 입력 토큰에 대해 게이트 확률이 높은 상위 K개의 전문가를 선택하는 방법이다research.google.
  - 앞서 설명한 대로 K=1 (Switch) 또는 K=2 (Shazeer, GLaM 등)가 주로 사용되며, 선택된 전문가들만 활성화됨
  - 토큰별 독립적으로 전문가를 고르다 보니, 단순 구현으로는 특정 전문가에 토큰이 쏠릴 위험이 있다. 이를 막기 위해 Shazeer (2017)은 Load Balancing Loss를 도입했다yuxi-liu-wired.github.io.
    - 이 보조 손실은 미니배치 내 각 전문가가 선택된 횟수(혹은 게이트 확률 합)가 고르게 되도록 유도하는 항으로, 전문가별 사용률의 분산을 줄이는 방향으로 작용한다
전문가 선택 라우팅 (Expert Choice Routing):
- 2022년에 제안된 새로운 방법으로, 토큰이 전문가를 고르는 기존 방식을 뒤집어 전문가가 토큰을 선택하도록 설계된 알고리즘
- 구글 Brain 팀의 Zhou 등은 각 전문가마다 정해진 처리 용량(예: 한 배치에서 최대 m개 토큰)을 두고, 게이트 점수가 높은 순으로 m개의 토큰을 그 전문가에게 할당하는 방법을 제시했다research.googleresearch.google.
  - 이렇게 하면 모든 전문가는 자신의 용량이 허용하는 만큼 토큰을 받게 되어 부하가 고르게 분산된다research.googleresearch.google.
  - 또한 중요하거나 어려운 토큰은 여러 전문가에게 중복 할당될 수도 있게 하여(동적 K 할당), 토큰 난이도에 따라 유연하게 처리한다research.googleresearch.google
- 단점
  - 다만 구현이 복잡해지고, 토큰이 여러 전문가에 복제될 가능성이 있어 메모리 소모가 증가할 수 있음

Challenges

MoE를 실제 대규모 모델에 적용할 때는 몇 가지 고유한 난제들이 존재하며, 이를 해결하기 위한 다양한 기법들이 연구됨

전문가 불균형 및 “전문가 붕괴” 문제:
- 문제 앞서 여러 차례 언급했듯이, 학습 과정에서 일부 전문가만 자주 선택되고 다른 전문가들은 거의 선택되지 않아 유효하게 학습되지 못하는 현상이 발생할 수 있다
  - 이렇게 되면 전체 모델 용량 중 상당 부분이 낭비되고, 소수 전문가에 과부하가 걸려 과적합 위험도 커진다
  - 해결 1 이 문제를 막기 위해 도입된 것이 Load Balancing(부하 균형) 기법이다. Shazeer 등의 연구는 Auxiliary Loss 두 개를 추가하여, (1) 모든 전문가의 사용 빈도가 균일해지고 (2) 게이트 확률의 분포가 고르게 퍼지도록 유도했다
  - 해결 2 최근에는 아예 학습 후에 활용도 낮은 전문가를 제거(Prune)하고 남은 전문가를 재훈련하는 방식으로 전문가 수를 줄여 모델을 간소화하는 후처리 기법도 연구되고 있음
학습 불안정성과 수렴 어려움:
- 문제 1. MoE는 게이팅으로 인한 비선형 선택 때문에 학습이 자칫 불안정해지거나 초기에 수렴이 어려울 수 있다.
  - 예를 들어 게이트 출력이 한쪽으로 치우치면(한 전문가에 매우 높은 점수) 그에 대한 그라디언트가 매우 크게 발생하여 폭발적 경사나 수치상 불안정이 생길 수 있음
  - 해결 1 Fedus 등은 Switch Transformer 연구에서 이러한 현상을 막기 위해 Z-loss라는 항을 도입했는데, 이는 게이트 logits 값이 지나치게 크거나 작은 것을 막아주는 정규화 손실
    - 이로써 mixed-precision(bfloat16) 환경에서도 학습이 안정되었고, 실제 Switch Transformer는 FP32 대신 bfloat16으로 훈련하면서도 발산 없이 수렴할 수 있었닥호 함
  - 해결 2 학습 초기 워밍업(warm-up) 스텝을 길게 가져가거나, Gradient Clipping을 통해 각 배치의 게이트 gradient를 제한하는 등의 전처치가 쓰이기도 한다
- 문제 2. 한편 Top-K 게이팅의 선택 연산은 미분 불가능하기 때문에, 학습 중에는 Softmax 확률 값을 사용하지만 역전파 시 선택되지 않은 전문가에는 gradient가 전혀 가지 않는다. 이로 인해 게이트 결정이 극단적으로 변화하면 loss 지형이 불규칙해질 수 있다
  - 해결 1 이를 완화하기 위해 Gumbel-Softmax 등의 기법으로 샘플링 기반의 부드러운 선택을 모사하거나(DSelect-k 참고), 학습 후반에는 게이트 출력의 변화폭을 줄이도록 **게이트 온도(temperature)**를 낮추는 등 스케줄 조정이 쓰이고 있음
통신 및 인프라 문제:
- 문제 MoE의 성능을 제대로 내기 위해서는 분산 학습 인프라가 뒷받침되어야 한다. 수십~수백 개의 전문가를 여러 장비(GPU/TPU)에 나눠 배치하면, 토큰-전문가 매핑에 따라 매 스텝마다 cross-device 통신이 발생한다. 이 통신을 최적화하지 않으면 대기 시간이 커져서 실제 효율 이득을 상쇄할 수 있
  - 해결 1 이를 해결하기 위해 Google의 GShard는 XLA 컴파일러 수준에서 자동으로 올투올(All-to-All) 통신을 최적화해 주었
  - 해결 2 Microsoft DeepSpeed-MoE는 데이터 병렬, 모델 병렬, 전문가 병렬, 파이프라인 병렬, ZeRO-Offload 등 다각도의 병렬화 기법을 조합하여 512대 GPU로 3.5조 파라미터 MoE를 선형에 가깝게 스케일링

코드로 보기

FFN 동작

def ffn_forward(x):
    # x: [B, S, D] (Batch, Sequence, Dimension)
    
    # 모든 토큰에 동일한 변환 적용
    hidden = linear1(x)      # [B, S, D] -> [B, S, 4D]
    hidden = activation(hidden)
    output = linear2(hidden) # [B, S, 4D] -> [B, S, D]
    
    return output

파라미터 수	2 × d_model × d_ff
활성 파라미터	모든 파라미터 항상 사용
메모리 사용량	일정함
계산 복잡도	O(모든 토큰 × 모든 파라미터)
통신 오버헤드	없음

MoE 동작

def moe_forward(x):
    # x: [B, S, D]
    B, S, D = x.shape
    
    # 1. 토큰 단위로 재배열
    tokens = x.reshape(-1, D)  # [B*S, D]
    
    # 2. 게이팅: 각 토큰이 어느 전문가로 갈지 결정
    gate_logits = gate_network(tokens)  # [B*S, num_experts]
    gate_probs = softmax(gate_logits)
    
    # Top-1 게이팅 예시
    expert_ids = argmax(gate_probs)     # [B*S]
    expert_weights = max(gate_probs)    # [B*S]
    
    # 3. 토큰 디스패치
    dispatched_tokens = {}
    for expert_id in range(num_experts):
        mask = (expert_ids == expert_id)
        dispatched_tokens[expert_id] = tokens[mask]
    
    # 4. 전문가 실행 (병렬)
    expert_outputs = {}
    for expert_id, expert_tokens in dispatched_tokens.items():
        if len(expert_tokens) > 0:
            expert_outputs[expert_id] = experts[expert_id](expert_tokens)
    
    # 5. 결과 수집 및 결합
    output_tokens = torch.zeros_like(tokens)
    for expert_id, expert_output in expert_outputs.items():
        mask = (expert_ids == expert_id)
        weights = expert_weights[mask].unsqueeze(1)
        output_tokens[mask] = expert_output * weights
    
    # 6. 원래 형태로 복원
    output = output_tokens.reshape(B, S, D)
    return output

파라미터 수	num_experts × 2 × d_model × d_ff
활성 파라미터	top-k개 전문가만 사용
메모리 사용량	전체 파라미터만큼 필요하지만 계산은 일부만
계산 복잡도	O(모든 토큰 × (k/num_experts) × 파라미터)
통신 오버헤드	All-to-All 통신 필요

DeepSpeed 에서의 layer 구현 확인 ...

# Copyright (c) Microsoft Corporation.
# SPDX-License-Identifier: Apache-2.0

# DeepSpeed Team

from typing import Optional, Tuple

import torch
from torch import nn
from torch.nn import functional as F

from deepspeed.utils import groups, log_dist
from .experts import Experts
from .sharded_moe import MOELayer, TopKGate


class MoE(nn.Module):
    """Initialize an MoE layer.

    Arguments:
        hidden_size (int): the hidden dimension of the model, importantly this is also the input and output dimension.
        expert (nn.Module): the torch module that defines the expert (e.g., MLP, torch.linear).
        num_experts (int, optional): default=1, the total number of experts per layer.
        ep_size (int, optional): default=1, number of ranks in the expert parallel world or group.
        k (int, optional): default=1, top-k gating value, only supports k=1 or k=2.
        capacity_factor (float, optional): default=1.0, the capacity of the expert at training time.
        eval_capacity_factor (float, optional): default=1.0, the capacity of the expert at eval time.
        min_capacity (int, optional): default=4, the minimum capacity per expert regardless of the capacity_factor.
        use_residual (bool, optional): default=False, make this MoE layer a Residual MoE (https://arxiv.org/abs/2201.05596) layer.
        noisy_gate_policy (str, optional): default=None, noisy gate policy, valid options are 'Jitter', 'RSample' or 'None'.
        drop_tokens (bool, optional): default=True, whether to drop tokens - (setting to False is equivalent to infinite capacity).
        use_rts (bool, optional): default=True, whether to use Random Token Selection.
        use_tutel (bool, optional): default=False, whether to use Tutel optimizations (if installed).
        enable_expert_tensor_parallelism (bool, optional): default=False, whether to use tensor parallelism for experts
        top2_2nd_expert_sampling (bool, optional): default=True, whether to perform sampling for 2nd expert
    """

    def __init__(self,
                 hidden_size: int,
                 expert: nn.Module,
                 num_experts: int = 1,
                 ep_size: int = 1,
                 k: int = 1,
                 capacity_factor: float = 1.0,
                 eval_capacity_factor: float = 1.0,
                 min_capacity: int = 4,
                 use_residual: bool = False,
                 noisy_gate_policy: Optional[str] = None,
                 drop_tokens: bool = True,
                 use_rts: bool = True,
                 use_tutel: bool = False,
                 enable_expert_tensor_parallelism: bool = False,
                 top2_2nd_expert_sampling: bool = True) -> None:

        super(MoE, self).__init__()

        self.use_residual = use_residual
        self.enable_expert_tensor_parallelism = enable_expert_tensor_parallelism
        assert num_experts % ep_size == 0, f"Number of experts ({num_experts}) should be divisible by expert parallel size ({ep_size})"
        self.ep_size = ep_size
        self.expert_group_name = f"ep_size_{self.ep_size}"
        self.num_experts = num_experts
        self.num_local_experts = num_experts // self.ep_size

        log_dist(
            f'Creating MoE layer with num_experts: {num_experts} | num_local_experts: {self.num_local_experts} | expert_parallel_size: {self.ep_size}',
            [0])

        assert noisy_gate_policy is None or noisy_gate_policy in ['None', 'Jitter', 'RSample'], \
            'Unsupported noisy_gate_policy: ' + noisy_gate_policy

        experts = Experts(expert, self.num_local_experts, self.expert_group_name)
        self.deepspeed_moe = MOELayer(TopKGate(hidden_size, num_experts, k, capacity_factor, eval_capacity_factor,
                                               min_capacity, noisy_gate_policy, drop_tokens, use_rts, None,
                                               top2_2nd_expert_sampling),
                                      experts,
                                      self.expert_group_name,
                                      self.ep_size,
                                      self.num_local_experts,
                                      use_tutel=use_tutel)
        if self.use_residual:
            self.mlp = expert
            # coefficient is used for weighted sum of the output of expert and mlp
            self.coefficient = nn.Linear(hidden_size, 2)

    def set_deepspeed_parallelism(self, use_data_before_expert_parallel_: bool = False) -> None:
        self._create_process_groups(use_data_before_expert_parallel_=use_data_before_expert_parallel_)

    def _create_process_groups(self, use_data_before_expert_parallel_: bool = False) -> None:
        # Create process group for a layer if needed
        if self.expert_group_name not in groups._get_expert_parallel_group_dict():
            print(f"No existing process group found, creating a new group named: {self.expert_group_name}")
            if (groups.mpu is None) or (not self.enable_expert_tensor_parallelism):
                # Condition 1 - no groups.mpu means no tensor parallelism
                # Condition 2 - disabling expert tensor parallelism on purpose
                groups._create_expert_and_data_parallel(
                    self.ep_size, use_data_before_expert_parallel_=use_data_before_expert_parallel_)
            else:
                # expert tensor parallelism is enabled
                groups._create_expert_data_and_model_parallel(
                    self.ep_size, mpu=groups.mpu, use_data_before_expert_parallel_=use_data_before_expert_parallel_)
        # Set the group handle for the MOELayer (deepspeed_moe) object
        self.deepspeed_moe._set_ep_group(groups._get_expert_parallel_group(self.expert_group_name))

    def forward(self,
                hidden_states: torch.Tensor,
                used_token: Optional[torch.Tensor] = None) -> Tuple[torch.Tensor, torch.Tensor, torch.Tensor]:
        """ MoE forward

        Arguments:
            hidden_states (Tensor): input to the layer
            used_token (Tensor, optional): default: None, mask only used tokens

        Returns:
            A tuple including output, gate loss, and expert count.

            * output (Tensor): output of the model

            * l_aux (Tensor): gate loss value

            * exp_counts (Tensor): expert count
        """
        output = self.deepspeed_moe(hidden_states, used_token)
        if self.use_residual:
            # Residual MoE
            output_mlp = self.mlp(hidden_states)
            if isinstance(output_mlp, tuple):
                output_mlp = output_mlp[0]  # Ignore the bias term for now
            coef = self.coefficient(hidden_states)
            coef = F.softmax(coef, dim=-1)
            output = output * coef[..., 0:1] + output_mlp * coef[..., 1:]
        return output, self.deepspeed_moe.l_aux, self.deepspeed_moe.exp_counts

Ref.

https://huggingface.co/blog/moe

Mixture of Experts Explained

Hi, the figures are missing

huggingface.co

https://yuxi.ml/essays/posts/mixture-of-experts/#:~:text=It%20is%20no%20coincidence%2C%20then%2C,29

Mixture of Experts – Yuxi on the Wired

How MoE works, its history, and what it is good for.

yuxi.ml

https://www.microsoft.com/en-us/research/blog/deepspeed-powers-8x-larger-moe-model-training-with-high-performance/#:~:text=DeepSpeed%20,for%20a%20constant%20compute%20budget

DeepSpeed powers 8x larger MoE model training with high performance - Microsoft Research

Today, we are proud to announce DeepSpeed MoE, a high-performance system that supports massive scale mixture of experts (MoE) models as part of the DeepSpeed (opens in new tab) optimization library. MoE models are an emerging class of sparsely activated mo

www.microsoft.com

https://github.com/deepspeedai/DeepSpeed/blob/master/deepspeed/moe/layer.py

DeepSpeed/deepspeed/moe/layer.py at master · deepspeedai/DeepSpeed

DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective. - deepspeedai/DeepSpeed

github.com

728x90

'AI > NLP' 카테고리의 다른 글

MoE (Mixture of Experts) 알아보기 2 - 최신 MoE 기반의 LLM들 (1)	2025.07.26
REASONING EFFORT AND PROBLEM COMPLEXITY:A SCALING ANALYSIS IN LLMS (25.03) 논문 리뷰 (0)	2025.07.19
DAPO: An Open-Source LLM Reinforcement Learning System at Scale (25.03) 논문 리뷰 (0)	2025.07.19
ProRL: Prolonged Reinforcement Learning Expands Reasoning Boundaries in Large Language Models (25.07) 논문 리뷰 (0)	2025.07.18
LLM의 Context Length 늘리기 (2) - KV Cache 최적화 기법들 (3)	2025.07.13

ABOUT ME

세상은 내가 정하는 대로 세상은 내가 정하는 대로

기존에는 어땠는데 ?

MoE 어떻게 생겼어 ?

Sparse MoE 레이어

효과

예시

Switch Transformer 사례

일반적인 계산 공식

한계와 오버헤드

Routing 방법론의 발전 (Top-K, Load Balancing 등)

Challenges

FFN 동작

MoE 동작

'AI > NLP' 카테고리의 다른 글

티스토리툴바

ABOUT ME

기존에는 어땠는데 ?

MoE 어떻게 생겼어 ?

Sparse MoE 레이어

효과

예시

Switch Transformer 사례

일반적인 계산 공식

한계와 오버헤드

Routing 방법론의 발전 (Top-K, Load Balancing 등)

Challenges

FFN 동작

MoE 동작

'AI > NLP' 카테고리의 다른 글

관련글 관련글 더보기

티스토리툴바