๐Ÿ’ฟ Data/์ด๋ชจ์ €๋ชจ

    [๋”ฅ๋Ÿฌ๋‹, NLP] ๋‹ค์–‘ํ•œ ํ…์ŠคํŠธ ์ „์ฒ˜๋ฆฌ ๋ฐฉ๋ฒ•

    1. ํ…์ŠคํŠธ ์ „์ฒ˜๋ฆฌ 1-1. ๋‹จ์ˆœ ํ† ํฐํ™” exam = ["I want to be a superman. Sometimes, I imagine that i have a super power and fly to the sky without anything. Someone says 'it's not possible', but i trust myself.", "I feel better than anytime."] import spacy from spacy.tokenizer import Tokenizer nlp = spacy.load("en_core_web_sm") tok = Tokenizer(nlp.vocab) exam_token = [] for doc in tok.pipe(exam): do..

    [๋”ฅ๋Ÿฌ๋‹, NLP] Transformer(Positional encoding, Attention)

    Positional Encoding RNN๊ณผ ๋‹ฌ๋ฆฌ Transformer๋Š” ๋ชจ๋“  ํ† ํฐ์ด ํ•œ๋ฒˆ์— ์ž…๋ ฅ๋˜๊ธฐ ๋•Œ๋ฌธ์— recursive๋ฅผ ํ†ตํ•œ ๋‹จ์–ด ๊ฐ„ ์œ„์น˜, ์ˆœ์„œ ์ •๋ณด๋ฅผ ๋‹ด์„ ์ˆ˜ ์—†์Šต๋‹ˆ๋‹ค. ๊ทธ๋ ‡๊ธฐ ๋•Œ๋ฌธ์—, ์• ์ดˆ์— input ์‹œ ํ† ํฐ์˜ ์œ„์น˜์— ๋Œ€ํ•œ ์ •๋ณด๋ฅผ ๋งŒ๋“ค์–ด ํ† ํฐ์— ํฌํ•จ์‹œํ‚ค๋Š” ์ž‘์—…์„ ํ•˜๊ฒŒ ๋˜๋Š”๋ฐ ์ด ๊ณผ์ •์ด Positional Encoding ์ž…๋‹ˆ๋‹ค. Self-Attention Attention : ๋””์ฝ”๋”์—์„œ ์ถœ๋ ฅ ๋‹จ์–ด๋ฅผ ์˜ˆ์ธกํ•˜๋Š” ๋งค ์‹œ์ (time step)๋งˆ๋‹ค, ์ธ์ฝ”๋”์—์„œ์˜ ์ „์ฒด ์ž…๋ ฅ ๋ฌธ์žฅ์„ ์ฐธ๊ณ ํ•˜๋Š” ๋ฐฉ๋ฒ•. ์ด ๋•Œ, ์ „์ฒด ์ž…๋ ฅ๋˜๋Š” ๋ฌธ์žฅ์˜ ํ† ํฐ์„ ๋™์ผํ•œ ๋น„์ค‘์œผ๋กœ ์ฐธ๊ณ ํ•˜๋Š” ๊ฒƒ์ด ์•„๋‹Œ, ํ•ด๋‹น ์‹œ์ ์˜ ์˜ˆ์ธกํ•  ๋‹จ์–ด์™€ ์—ฐ๊ด€์„ฑ์ด ๋†’์€ ์ž…๋ ฅ ํ† ํฐ์„ ๋” ๋น„์ค‘์žˆ๊ฒŒ ์ง‘์ค‘(attention)ํ•ด์„œ ๋ณด๋Š” ๋ฐฉ๋ฒ•์ž…๋‹ˆ๋‹ค. ๋ฌธ์žฅ ๋‚ด์—์„œ์˜ ํ† ..

    [๋”ฅ๋Ÿฌ๋‹, NLP] RNN, LSTM, GRU

    RNN(Recurrent Neural Network) ์ž…๋ ฅ์ธต -> ์€๋‹‰์ธต -> ์ถœ๋ ฅ์ธต์˜ ๊ตฌ์กฐ๋งŒ์„ ๊ฐ–๋Š” ์‹ ๊ฒฝ๋ง์„ ํ”ผ๋“œ ํฌ์›Œ๋“œ ์‹ ๊ฒฝ๋ง์ด๋ผ๊ณ  ํ•ฉ๋‹ˆ๋‹ค. ๋ฐ˜๋ฉด ์ˆœํ™˜ ์‹ ๊ฒฝ๋ง(RNN)์€ ์€๋‹‰์ธต์—์„œ ๋‚˜์˜จ ๊ฐ’์„ ์ถœ๋ ฅ์ธต์œผ๋กœ ๋ณด๋‚ด๋ฉด์„œ ๋™์‹œ์— ๋‹ค์‹œ ์€๋‹‰์ธต ๋…ธ๋“œ์˜ ๊ฒฐ๊ณผ๊ฐ’์„ ๋‹ค์Œ ๊ณ„์‚ฐ์„ ์œ„ํ•œ ์ž…๋ ฅ์œผ๋กœ ๋ณด๋‚ด๋Š” ๊ตฌ์กฐ๋ฅผ ๊ฐ–๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.RNN ๊ตฌ์กฐ ์™ผ์ชฝ๊ณผ ์˜ค๋ฅธ์ชฝ ๋ชจ๋‘ RNN ๊ตฌ์กฐ๋ฅผ ํ‘œํ˜„ํ•œ ๊ทธ๋ฆผ์ž…๋‹ˆ๋‹ค. ์˜ค๋ฅธ์ชฝ ๊ทธ๋ฆผ์„ ๊ธฐ์ค€์œผ๋กœ 3๋ฒˆ์งธ ์…€(์—ฐ๋‘์ƒ‰ ๋„ค๋ชจ)๋Š” ์ด์ „ x2์— ์˜ํ•œ ๊ฒฐ๊ณผ๊ฐ’๊ณผ ์ƒˆ๋กœ์šด input์ธ x3๋ฅผ ํ•จ๊ป˜ ๋ฐ›์•„ y3์˜ ์ถœ๋ ฅ๊ฐ’์„ ๋‚ด๋†“๊ฒŒ ๋ฉ๋‹ˆ๋‹ค. ์ฆ‰, RNN์€ t ์‹œ์ ์—์„œ x_t์˜ ์ž…๋ ฅ๊ฐ’์— ๋Œ€ํ•œ ์ถœ๋ ฅ๊ฐ’ y_t๋ฅผ ๊ณ„์‚ฐํ•  ๋•Œ, ์ด์ „ ์‹œ์ (t-1)์— ๋Œ€ํ•œ ๊ฒฐ๊ณผ๊ฐ’์„ ํ•จ๊ป˜ ์ž…๋ ฅ๊ฐ’์œผ๋กœ ๋ฐ›์•„ ๋ฐ˜์˜ํ•˜๊ฒŒ ๋ฉ๋‹ˆ๋‹ค. ์ด๋Ÿฌํ•œ ๊ตฌ์กฐ๋ฅผ ๊ฐ–๋Š” ์ด์œ ๋Š” ์ž์—ฐ์–ด, ์‹œ๊ณ„..

    [๋”ฅ๋Ÿฌ๋‹, NLP] ๋ถ„ํฌ ๊ฐ€์„ค, Word2Vec

    ๋ถ„ํฌ ๊ฐ€์„ค(Distributed Representation) ํšŸ์ˆ˜ ๊ธฐ๋ฐ˜์ด ์•„๋‹Œ, ๋‹จ์–ด์˜ ๋ถ„ํฌ๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ํ‘œํ˜„ํ•˜๋Š” ๋ถ„ํฌ๊ธฐ๋ฐ˜ ๋‹จ์–ดํ‘œํ˜„์˜ ๋ฐฐ๊ฒฝ์ด ๋˜๋Š” ๊ฐ€์„ค ๋น„์Šทํ•œ ์œ„์น˜์— ๋“ฑ์žฅํ•˜๋Š” ๋‹จ์–ด๋“ค์€ ๋น„์Šทํ•œ ์˜๋ฏธ๋ฅผ ์ง€๋‹Œ๋‹ค.๋Š” ๊ฐ€์„ค์ž…๋‹ˆ๋‹ค. Word2Vec ๋ง๊ทธ๋Œ€๋กœ ๋‹จ์–ด๋ฅผ ๋ฒกํ„ฐํ™”ํ•˜๋Š” ๋ฐฉ๋ฒ• ์ค‘ ํ•˜๋‚˜๋กœ ์›ํ•ซ์ธ์ฝ”๋”ฉ๊ณผ๋Š” ๋‹ค๋ฅธ ๋ถ„์‚ฐ ํ‘œํ˜„ ๋ฐฉ๋ฒ•์ž…๋‹ˆ๋‹ค. ์›ํ•ซ์ธ์ฝ”๋”ฉ์€ ๋‹จ์–ด ๋ฒกํ„ฐ์˜ ์ฐจ์›์ด ๋‹จ์–ด ์ง‘ํ•ฉ์˜ ํฌ๊ธฐ๊ฐ€ ๋˜๋ฉฐ, ํ•ด๋‹นํ•˜์ง€ ์•Š๋Š” ์—ด์—๋Š” ์ „๋ถ€ 0 ๊ฐ’์œผ๋กœ ํฌ์†Œํ•˜๋‹ค๋ฉด, Word2Vec์€ ๋น„๊ต์  ์ €์ฐจ์›์— ๋‹จ์–ด์˜ ์˜๋ฏธ๋ฅผ ๋ถ„์‚ฐํ•˜์—ฌ ํ‘œํ˜„ํ•˜๊ฒŒ ๋ฉ๋‹ˆ๋‹ค. ์ฃผ๋ณ€ ๋‹จ์–ด๋ฅผ input์œผ๋กœ ์ค‘์‹ฌ ๋‹จ์–ด๋ฅผ ์˜ˆ์ธกํ•˜๋Š” CBoW ๋ฐฉ๋ฒ•๊ณผ ์ค‘์‹ฌ ๋‹จ์–ด๋ฅผ input์œผ๋กœ ์ฃผ๋ณ€ ๋‹จ์–ด๋ฅผ ์˜ˆ์ธกํ•˜๋Š” Skip-gram ๋ฐฉ๋ฒ•์ด ์žˆ์Šต๋‹ˆ๋‹ค.

    [๋”ฅ๋Ÿฌ๋‹, NLP] ๋ถˆ์šฉ์–ด, ์ถ”์ถœ, BoW/TF-IDF

    ๋ถˆ์šฉ์–ด(Stop words) ์ž์ฃผ ๋“ฑ์žฅํ•˜์ง€๋งŒ ์ž์—ฐ์–ด๋ฅผ ๋ถ„์„ํ•˜๋Š” ๊ฒƒ์— ์žˆ์–ด ํฐ ๋„์›€์ด ๋˜์ง€ ์•Š๋Š” ๋‹จ์–ด ๊ฐ–๊ณ  ์žˆ๋Š” ๋ง๋ญ‰์น˜ ๋ฐ์ดํ„ฐ์—์„œ ์ตœ๋Œ€ํ•œ ์œ ์˜๋ฏธํ•œ ๋‹จ์–ด(ํ† ํฐ)๋ฅผ ์„ ๋ณ„ํ•˜๊ธฐ ์œ„ํ•ด ๋ถˆ์šฉ์–ด๋Š” ์ œ๊ฑฐํ•˜๋Š” ๊ฒƒ์ด ์ข‹์Šต๋‹ˆ๋‹ค. I, he, her, ์กฐ์‚ฌ, ์ ‘๋ฏธ์‚ฌ ๊ฐ™์€ ๋‹จ์–ด๋“ค์ด ๋Œ€๋ถ€๋ถ„ ๋ถˆ์šฉ์–ด๋กœ ์ฒ˜๋ฆฌ๋ฉ๋‹ˆ๋‹ค. ์–ด๊ฐ„ ์ถ”์ถœ(Stemming) ๋ง๋ญ‰์น˜ ๋ฐ์ดํ„ฐ์—์„œ ๋‹จ์–ด๋ฅผ ์ค„์ผ ์ˆ˜ ์žˆ๋Š” ์ •๊ทœํ™” ๋ฐฉ๋ฒ• ์ค‘ ํ•˜๋‚˜ ๋‹จ์–ด์—์„œ ๊ฐœ๋…์  ์˜๋ฏธ๋ฅผ ๊ฐ–๋Š” ์–ด๊ฐ„๋งŒ ์ถ”์ถœํ•˜๋Š” ๋ฐฉ๋ฒ• ex) analysis๊ณผ analytic -> ๋‘˜ ๋‹ค ๋ถ„์„์˜ ์˜๋ฏธ๋ฅผ ๊ฐ–๊ณ  ์žˆ์œผ๋ฏ€๋กœ analy๋กœ ์ค„์ผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์˜ˆ์‹œ์™€ ๊ฐ™์ด ์–ด๊ฐ„๋งŒ ์ถ”์ถœํ•˜๋‹ค๋ณด๋‹ˆ ์‚ฌ์ „์— ์—†๋Š” ๋‹จ์–ด๊ฐ€ ์ƒ๊ธฐ๊ฒŒ ๋ฉ๋‹ˆ๋‹ค. ํ‘œ์ œ์–ด ์ถ”์ถœ(Lemmatization) ๋งˆ์ฐฌ๊ฐ€์ง€๋กœ ๋ง๋ญ‰์น˜ ๋ฐ์ดํ„ฐ์—์„œ ๋‹จ์–ด๋ฅผ ์ค„์ผ ์ˆ˜ ์žˆ๋Š” ์ •๊ทœํ™” ๋ฐฉ๋ฒ• ์ค‘..