시간, 메모리 효율적으로 LLM 학습하기 (1) (Gradient Accumulation, Gradient Checkpointing, Mixed Precision Training ... )

AI/NLP 2024. 11. 4. 22:12

728x90

시간, 메모리 효율적으로 LLM 학습하기 (1)
(Gradient Accumulation, Gradient Checkpointing, Mixed Precision Training ... )

모델을 학습시키다 보면 OOM 문제를 맞닥뜨리게도 되고, 또 학습하는 시간 때문에도 골머리를 앓게 된다...!

본인에게 가능한 환경에서 최대한의 퍼포먼스를 낼 수 있는 방법을 찾아보자!

본 포스트에서는 Single GPU 환경에서 할 수 있는 방안에 대해서 살펴본다

https://huggingface.co/docs/transformers/en/perf_train_gpu_one

Methods and tools for efficient training on a single GPU

This guide demonstrates practical techniques that you can use to increase the efficiency of your model’s training by optimizing memory utilization, speeding up the training, or both. If you’d like to understand how GPU is utilized during training, plea

huggingface.co

1. Gradient Accumulation

- 시간 👎 메모리 👍

- Batch size를 작게 하여 LLM을 돌리게 되면, 모델의 성능이 좋지 않게 되는데, 이를 극복하기 위한 방법

- Backward Pass시에 Gradient Update 매번 하는 것(1 step 마다)이 아니라 몇 번 쌓아뒀다가 하는 방법

- 연산 시간이 늘어나는 대신 메모리 사용량이 줄어든다 (Gradient checkpointing은 메모리 효율성을 증가시킬 수 있으나, 학습 속도를 약 20%가량 저하시킨다.)

- accelerate를 사용하면 아래와 같이 간단하게 !

from accelerate import Accelerator
accelerator = Accelerator(gradient_accumulation_steps=2)
model, optimizer, training_dataloader, scheduler = accelerator.prepare(
    model, optimizer, training_dataloader, scheduler
)
for batch in training_dataloader:
    with accelerator.accumulate(model):
        inputs, targets = batch
        outputs = model(inputs)
        loss = loss_function(outputs, targets)
        accelerator.backward(loss)
        optimizer.step()
        scheduler.step()
        optimizer.zero_grad()

https://huggingface.co/docs/accelerate/en/usage_guides/gradient_accumulation

Performing gradient accumulation with Accelerate

Gradient accumulation is a technique where you can train on bigger batch sizes than your machine would normally be able to fit into memory. This is done by accumulating gradients over several batches, and only stepping the optimizer after a certain number

huggingface.co

2. Gradient Checkpointing

- 시간 👎 메모리 👍

- Gradient Accumulation을 통해서 step을 모아서 가중치를 업데이트 한다고 해도, 역전파 과정에서 메모리가 터져버리는 현상이 발생 할 수 있음

- 매번 Backward Pass에서 필요한 Gradient를 다 저장하는 것이 아니라 일부만 저장, 필요한 것은 그때그때 계산

training_args = TrainingArguments(
    per_device_train_batch_size=1, 
    gradient_accumulation_steps=4, 
    gradient_checkpointing=True, 
    **default_args
)

trainer = Trainer(model=model, args=training_args, train_dataset=ds)
result = trainer.train()
print_summary(result)

3. Mixed Precision Training

- 시간 👍 메모리 👍

- 대부분 우리가 모델을 학습할 때 사용하게 되는 Data type은 32bit (FP32) - weight와 input 모두 !

- 하지만 메모리와 학습 시간 모두 효율적으로 하기 위해서 FP32와 FP16을 적절히 섞어서 학습 수행

- 그럼 어떤 건 FP32으로 표현하고 어떤 건 FP16으로 표현하는가?

activations
activation gradients
weights
weight gradients

- 위에서 weights와 weight gradients에는 FP16의 범위 안에서 잘 표현이 되나 activation gradients의 경우에는 FP16으로 표현하게 된다면 FP16 으로 표현할 수 있는 범위를 넘어서면서 강제적으로 0이 되어버리는 현상이 발생하였다(underflow).

- PyTorch에서 mixed precision learning을 위해 제공하는 package인 AMP(Automatic Mixed Precision)을 사용하여 Mixed Precision Training할 수 있다

import torch
 
scaler = torch.cuda.amp.GradScaler() # Training시에 생성
 
for data, label in data_iter:
   optimizer.zero_grad()
   with torch.cuda.amp.autocast(): # Mixed precision으로 operation들을 casting 
      outputs = model(data)
 
   scaler.scale(loss).backward() # Loss를 scaling한 후에 backward진행
   scaler.step(optimizer) # 원래 scale에 맞추어 gradient를 unscale하고 optimizer를 통한 gradient update
   scaler.update() # 다음 iteration을 위해 scale update

4. 배치 사이즈 줄이기

위의 방법으로도 효과가 없을 경우, 더 좋은 사양의 GPU를 사용하거나 multi-GPU 환경을 사용할 것을 고려해야 한다.

여러 개의 GPU를 사용하는 상황에도 물론, 위 사항들은 적용 가능하다.

다음 포스트에서는 multi GPU 환경에서 취할 수 있는 방안들을 살펴본다.

Reference

https://littlefoxdiary.tistory.com/126

[Huggingface] Single GPU에서 효율적인 모델 학습을 하는 방법

원문 허깅페이스 - https://huggingface.co/docs/transformers/en/perf_train_gpu_one 모델 수렴과 GPU메모리를 고려하여 단일 GPU에서 메모리를 최적화하는 방법에 대해 HF에서 정리한 문서이다! 대규모 모델을 학습

littlefoxdiary.tistory.com

https://jihan819.tistory.com/entry/AI-Mixed-Precision-Training-%EC%9D%B4%EB%9E%80

[AI] Mixed Precision Training 이란?

Mixed Precision 이란?처리속도를 높이기 위한 FP16 과, 정확도 유지를 위한 FP32 를 섞어서 학습하는 방법1. Intro대부분의 LLM 학습 시 기본으로 사용되고 있는 테크닉으로, FP32(Single Precision) 과 FP16(Half Pr

jihan819.tistory.com

https://velog.io/@twinjuy/Auto-Mixed-Precision%EC%9D%B4%EB%9E%80

Auto Mixed Precision이란?

이번에는 AutoMixedPrecision에 대해서 알아보겠습니다. AutoMixedPrecision이란 NVIDIA와 BAIDU에서 연구하고 ICLR에 발표한 논문인 Mixed Precision Traing을 바탕으로 발전된 내용입니다. 이 부분의 내용은 NVIDIA

velog.io

728x90

'AI > NLP' 카테고리의 다른 글

LLM 빠르게 추론하기 (1) - Quantization - 양자화 방법론과 원리 (PTQ, QWA, bitsandbytes, GPTQ, AWQ, SmoothQuant) (0)	2025.03.23
시간, 메모리 효율적으로 LLM 학습하기 (2) (DP, DDP, FSDP, DeepSpeed ... ) (4)	2024.11.04
LoRA: Low-Rank Adaptation of Large Language Models 논문 리뷰 (+ Adapter, Prefix Tuning) (0)	2024.06.04
MoRA: High-Rank Updating for Parameter-Efficient Fine-Tuning 논문 리뷰 (0)	2024.06.04
Query Expansion by Prompting Large Language Models (Google, 2023) 논문 리뷰 (0)	2024.05.24

ABOUT ME

세상은 내가 정하는 대로 세상은 내가 정하는 대로

'AI > NLP' 카테고리의 다른 글

티스토리툴바

ABOUT ME

'AI > NLP' 카테고리의 다른 글

관련글 관련글 더보기

티스토리툴바