[Team Seminar] Why do LLMs attend to the first token?

AI/NLP 2025. 5. 10. 16:22

728x90

[Team Seminar] Why do LLMs attend to the first token?

원래 궁금하던 논문이었는데 때마침 팀원분이 리뷰해주셔서 나이스

Why do LLMs attend to the first token?

Large Language Models (LLMs) tend to attend heavily to the first token in the sequence -- creating a so-called attention sink. Many works have studied this phenomenon in detail, proposing various ways to either leverage or alleviate it. Attention sinks hav

arxiv.org

내용 정리

Attention Sink

LLM는 종종 시퀀스의 첫번째 토큰에 많은 어텐션을 집중시키는 attention sink 현상을 보임
이게 원래는 완화, 제거 되어야 된다는 이야기가 있었는데 반대로 최근엔 이를 이용하거나 필수적인 무언가로 본다

왜 이런 attention sink를 학습하고 사용하나 ?
- Attention sink의 이점은 ?
- Context Length와 Model Depth와의 관계는?

Attention Sink가 일어나면 어떤 효과 ?
- Perturbation : 아주 작은 입력의 변화
- Over mixing : 여러 레이어, 긴 시퀀스에서 토큰 간 정보 과도하게 섞여서 결국 모든 토큰의 임베딩 거의 같아짐
  - 싱크 현상이 없을 때 이 현상이 일어남 - 과도하게 섞이면서 전체 임베딩에 정보가 퍼짐

-> 반대로 싱크현상이 있음으로써 Overmixing을 늦출 수 있는 효과가 있다

실험 결과 Sink 발생 시에 모델이 더 견고하게 representation 유지 가능
- perbutation이 잘 안 퍼져야 안정적, 정보 보존
- perbutation이 넓게 있으면 불안정, 정보 퍼짐
자코비안 norm을 통해서 증명
-수식적으로 증명 ) 레이어 깊어질 수록 Context length 길어질 수록 head 개수 많아질수록 더 over mixing이 될 확률 높아진다, 더 sink 크게 일어난다
- 실험적으로도 일치하더라 ㅇㅇ

감상

Encoder은 CLS에, Decoder에서는 원래 EOS에 정보가 몰려있다 학습 방향 때문에 -> 그런 이야기가 있었는데 그거와 상반된 게 재밌네
- https://wikidocs.net/161973
- https://applepy.tistory.com/138

Gemma 3 Chat Template

<bos>는 있고 <eos>은 없음
https://huggingface.co/google/gemma-3-27b-it/discussions/6

<bos><start_of_turn>user
knock knock<end_of_turn>
<start_of_turn>model
who is there<end_of_turn>
<start_of_turn>user
Gemma<end_of_turn>
<start_of_turn>model
Gemma who?<end_of_turn>

Qwen Chat Template
<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\nWhat is large language model.<|im_end|>\n<|im_start|>assistant\n

이번에 본 겸 해서 chat_template 모델별로 어떻게 다른지도 함 봐야겠다

728x90

'AI > NLP' 카테고리의 다른 글

[AI Agent] LangChain Expression Language(LCEL) (0)	2025.05.24
[멀티노드 분산학습] FSDP + Accelerate로 Multi Node Training하기 (0)	2025.05.24
[AI Agent] AI Agent with LangChain / LangGraph / LangSmith (0)	2025.05.10
LLM 서빙하기 (2) - Triton Inference Server로 LLM 서빙하기 (HuggingFace 모델을 Triton으로 배포하는 방법) (0)	2025.04.05
LLM 서빙하기 (1) - Triton Inference Server란? (0)	2025.03.23

ABOUT ME

세상은 내가 정하는 대로 세상은 내가 정하는 대로

Attention Sink

'AI > NLP' 카테고리의 다른 글

티스토리툴바

ABOUT ME

Attention Sink

'AI > NLP' 카테고리의 다른 글

관련글 관련글 더보기

티스토리툴바