1. ํ ์คํธ ์ ์ฒ๋ฆฌ
1-1. ๋จ์ ํ ํฐํ
exam = ["I want to be a superman. Sometimes, I imagine that i have a super power and fly to the sky without anything. Someone says 'it's not possible', but i trust myself.", "I feel better than anytime."]
import spacy
from spacy.tokenizer import Tokenizer
nlp = spacy.load("en_core_web_sm")
tok = Tokenizer(nlp.vocab)
exam_token = []
for doc in tok.pipe(exam):
doc_tokens = [token for token in doc]
exam_token.append(doc_tokens)
print(exam_token)
1-2. ์ ๊ท ํํ์ ํ์ฉ ํ ํฐํ
import re
exam_token = []
for doc in tok.pipe(exam):
doc_tokens = [re.sub(r"[^a-z0-9]", "", token.text.lower()) for token in doc]
exam_token.append(doc_tokens)
print(exam_token)
1-3. spacy ๋ด์ฅ ํจ์ ํ์ฉ ํ ํฐํ(is_stop,is_punct)
exam_token = []
for doc in tok.pipe(exam):
doc_tokens = []
for token in doc:
if token.is_stop == False & token.is_punct == False:
doc_tokens.append(token)
exam_token.append(doc_tokens)
print(exam_token)
1-4. ์ ๊ท ํํ์ ๋ฐ spacy ๋ด์ฅ ํจ์ ํ์ฉ ํ ํฐํ
exam_token = []
for doc in tok.pipe(exam):
doc_tokens = []
for token in doc:
if token.is_stop == False & token.is_punct == False:
doc_tokens.append(re.sub(r"[^a-z0-9]", "", token.text.lower()))
exam_token.append(doc_tokens)
print(exam_token)
๋จ์ด ๋ฒกํฐํ(ํ์ ๊ธฐ๋ฐ)
2-1. TF(CounterVectorizer)
from sklearn.feature_extraction.text import CountVectorizer
vect = CountVectorizer(stop_words='english')
dtm = vect.fit_transform(exam)
dtm = pd.DataFrame(dtm.todense(), columns=vect.get_feature_names())
dtm
2-2. TF-IDF(TfidfVectorizer)
from sklearn.feature_extraction.text import TfidfVectorizer
vect = TfidfVectorizer(stop_words='english')
dtm = vect.fit_transform(exam)
dtm = pd.DataFrame(dtm.todense(), columns=vect.get_feature_names())
dtm
ํ
์คํธ ์ ์ฒ๋ฆฌํ๋ ํํธ์์ ๊ฐ์ธ์ ์ผ๋ก ํท๊ฐ๋ฆฌ๋ ๋ถ๋ถ๋ค์ ๋๋ ์ ์์์ ์ ์ฉํด๋ณด์์ต๋๋ค.
๊ณต๋ถํ์๋๋ฐ ์ฐธ๊ณ ๊ฐ ๋์์ผ๋ฉด ์ข๊ฒ ์ต๋๋ค!
์ด์์ ๋๋ค. ๊ฐ์ฌํฉ๋๋ค. :)
'๐ฟ Data > ์ด๋ชจ์ ๋ชจ' ์นดํ ๊ณ ๋ฆฌ์ ๋ค๋ฅธ ๊ธ
[๋ฅ๋ฌ๋, CV] FCN, ๊ฐ์ฒด ํ์ง/์ธ์ (0) | 2022.03.11 |
---|---|
[๋ฅ๋ฌ๋, CV] CNN ๊ธฐ๋ณธ, ์ ์ด ํ์ต ๊ฐ๋ (0) | 2022.03.10 |
[๋ฅ๋ฌ๋, NLP] Transformer(Positional encoding, Attention) (0) | 2022.03.07 |
[๋ฅ๋ฌ๋, NLP] RNN, LSTM, GRU (0) | 2022.03.06 |
[๋ฅ๋ฌ๋, NLP] ๋ถํฌ ๊ฐ์ค, Word2Vec (0) | 2022.03.06 |