๐Ÿ’ฟ Data/์ด๋ชจ์ €๋ชจ

[๋”ฅ๋Ÿฌ๋‹, NLP] Transformer(Positional encoding, Attention)

Jayden1116 2022. 3. 7. 21:34

Positional Encoding

  • RNN๊ณผ ๋‹ฌ๋ฆฌ Transformer๋Š” ๋ชจ๋“  ํ† ํฐ์ด ํ•œ๋ฒˆ์— ์ž…๋ ฅ๋˜๊ธฐ ๋•Œ๋ฌธ์— recursive๋ฅผ ํ†ตํ•œ ๋‹จ์–ด ๊ฐ„ ์œ„์น˜, ์ˆœ์„œ ์ •๋ณด๋ฅผ ๋‹ด์„ ์ˆ˜ ์—†์Šต๋‹ˆ๋‹ค.
  • ๊ทธ๋ ‡๊ธฐ ๋•Œ๋ฌธ์—, ์• ์ดˆ์— input ์‹œ ํ† ํฐ์˜ ์œ„์น˜์— ๋Œ€ํ•œ ์ •๋ณด๋ฅผ ๋งŒ๋“ค์–ด ํ† ํฐ์— ํฌํ•จ์‹œํ‚ค๋Š” ์ž‘์—…์„ ํ•˜๊ฒŒ ๋˜๋Š”๋ฐ ์ด ๊ณผ์ •์ด Positional Encoding ์ž…๋‹ˆ๋‹ค.

Self-Attention

  • Attention : ๋””์ฝ”๋”์—์„œ ์ถœ๋ ฅ ๋‹จ์–ด๋ฅผ ์˜ˆ์ธกํ•˜๋Š” ๋งค ์‹œ์ (time step)๋งˆ๋‹ค, ์ธ์ฝ”๋”์—์„œ์˜ ์ „์ฒด ์ž…๋ ฅ ๋ฌธ์žฅ์„ ์ฐธ๊ณ ํ•˜๋Š” ๋ฐฉ๋ฒ•. ์ด ๋•Œ, ์ „์ฒด ์ž…๋ ฅ๋˜๋Š” ๋ฌธ์žฅ์˜ ํ† ํฐ์„ ๋™์ผํ•œ ๋น„์ค‘์œผ๋กœ ์ฐธ๊ณ ํ•˜๋Š” ๊ฒƒ์ด ์•„๋‹Œ, ํ•ด๋‹น ์‹œ์ ์˜ ์˜ˆ์ธกํ•  ๋‹จ์–ด์™€ ์—ฐ๊ด€์„ฑ์ด ๋†’์€ ์ž…๋ ฅ ํ† ํฐ์„ ๋” ๋น„์ค‘์žˆ๊ฒŒ ์ง‘์ค‘(attention)ํ•ด์„œ ๋ณด๋Š” ๋ฐฉ๋ฒ•์ž…๋‹ˆ๋‹ค.
  • ๋ฌธ์žฅ ๋‚ด์—์„œ์˜ ํ† ํฐ์˜ ๊ด€๊ณ„๋ฅผ ํ™•์ธํ•˜๊ธฐ ์œ„ํ•ด ์ž๊ธฐ ์ž์‹ ์— ๋Œ€ํ•œ attention์„ ํ•˜๋Š” ๊ณผ์ •์ž…๋‹ˆ๋‹ค.
  • q = k = v ๋กœ ์ฟผ๋ฆฌ, ํ‚ค, ๋ฒจ๋ฅ˜์˜ ์ถœ์ฒ˜๊ฐ€ ๋™์ผํ•ฉ๋‹ˆ๋‹ค.

Masked Self-Attention

  • Transformer๋Š” ๊ฐ ์‹œํ€€์Šค์˜ ํ† ํฐ์„ ํ•œ๋ฒˆ์— ์ž…๋ ฅ๋ฐ›์Šต๋‹ˆ๋‹ค. ์ฆ‰, ๋””์ฝ”๋”์—์„œ๋„ output์— ๋Œ€ํ•œ ๊ฐ ์‹œํ€€์Šค์˜ ํ† ํฐ๋“ค์„ ํ•œ๋ฒˆ์— ์ž…๋ ฅ๋ฐ›๊ฒŒ ๋ฉ๋‹ˆ๋‹ค. ์ด ๋•Œ, Transformer์—๋Š” ์ˆœ์ฐจ์ ์ด๋ผ๋Š” ๊ฐœ๋…์ด ์—†๊ธฐ ๋•Œ๋ฌธ์—, t ์‹œ์ ์—์„œ ์˜ˆ์ธกํ•  ๊ฐ’๊ณผ ๊ฐ™์€ ์ž์‹  ์ดํ›„์˜ ๊ฐ’์— ๋Œ€ํ•ด masking(์•„์ฃผ ์ž‘์€ ๊ฐ’์œผ๋กœ ๋ณด๋‚ด์„œ 0์œผ๋กœ ๋งŒ๋“œ๋Š” ์ž‘์—…)ํ•˜๊ฒŒ ๋ฉ๋‹ˆ๋‹ค. ์ด๋Š”, ๋งˆ์น˜ ๋ฏธ๋ž˜์˜ ๊ฐ’์ด ๋ฐ˜์˜๋˜๋Š” ๋ฐ์ดํ„ฐ ๋ˆ„์ˆ˜ ํ˜„์ƒ์„ ๋ฐฉ์ง€ํ•˜๊ธฐ ์œ„ํ•จ์ž…๋‹ˆ๋‹ค. ๋””์ฝ”๋”์˜ Self-Attention ๊ณผ์ •์—์„œ๋งŒ ์ง„ํ–‰๋ฉ๋‹ˆ๋‹ค.

image

์ฐธ๊ณ