1. 텍스트 전처리

1-1. 단순 토큰화

exam = ["I want to be a superman. Sometimes, I imagine that i have a super power and fly to the sky without anything. Someone says 'it's not possible', but i trust myself.", "I feel better than anytime."]

import spacy
from spacy.tokenizer import Tokenizer

nlp = spacy.load("en_core_web_sm")

tok = Tokenizer(nlp.vocab)

exam_token = []

for doc in tok.pipe(exam):
    doc_tokens = [token for token in doc]
    exam_token.append(doc_tokens)

print(exam_token)

1-2. 정규 표현식 활용 토큰화

import re

exam_token = []

for doc in tok.pipe(exam):
    doc_tokens = [re.sub(r"[^a-z0-9]", "", token.text.lower()) for token in doc]
    exam_token.append(doc_tokens)

print(exam_token)

1-3. spacy 내장 함수 활용 토큰화(is_stop,is_punct)

exam_token = []

for doc in tok.pipe(exam):
    doc_tokens = []
    for token in doc:
        if token.is_stop == False & token.is_punct == False:
            doc_tokens.append(token)

    exam_token.append(doc_tokens)

print(exam_token)

1-4. 정규 표현식 및 spacy 내장 함수 활용 토큰화

exam_token = []

for doc in tok.pipe(exam):
    doc_tokens = []
    for token in doc:
        if token.is_stop == False & token.is_punct == False:
            doc_tokens.append(re.sub(r"[^a-z0-9]", "", token.text.lower()))

    exam_token.append(doc_tokens)

print(exam_token)

단어 벡터화(횟수 기반)

2-1. TF(CounterVectorizer)

from sklearn.feature_extraction.text import CountVectorizer

vect = CountVectorizer(stop_words='english')

dtm = vect.fit_transform(exam)

dtm = pd.DataFrame(dtm.todense(), columns=vect.get_feature_names())

dtm

2-2. TF-IDF(TfidfVectorizer)

from sklearn.feature_extraction.text import TfidfVectorizer

vect = TfidfVectorizer(stop_words='english')

dtm = vect.fit_transform(exam)

dtm = pd.DataFrame(dtm.todense(), columns=vect.get_feature_names())

dtm

텍스트 전처리하는 파트에서 개인적으로 헷갈리는 부분들을 나눠서 예시에 적용해보았습니다.
공부하시는데 참고가 되었으면 좋겠습니다!

이상입니다. 감사합니다. :)

'💿 Data > 이모저모' 카테고리의 다른 글

[딥러닝, CV] FCN, 객체 탐지/인식 (0)	2022.03.11
[딥러닝, CV] CNN 기본, 전이 학습 개념 (0)	2022.03.10
[딥러닝, NLP] Transformer(Positional encoding, Attention) (0)	2022.03.07
[딥러닝, NLP] RNN, LSTM, GRU (0)	2022.03.06
[딥러닝, NLP] 분포 가설, Word2Vec (0)	2022.03.06

[딥러닝, NLP] 다양한 텍스트 전처리 방법

1. 텍스트 전처리

1-1. 단순 토큰화

1-2. 정규 표현식 활용 토큰화

1-3. spacy 내장 함수 활용 토큰화(is_stop,is_punct)

1-4. 정규 표현식 및 spacy 내장 함수 활용 토큰화

단어 벡터화(횟수 기반)

2-1. TF(CounterVectorizer)

2-2. TF-IDF(TfidfVectorizer)

'💿 Data > 이모저모' 카테고리의 다른 글

티스토리툴바