[TIL]28.Decision Tree(의사결정나무)

목표

사이킷럿 파이프라인(pipeline) 이해
사이킷런 결정트리(decision tree) 이해
결정트리의 특성 중요도(feature importance)를 활용
결정트리 모델의 장점을 이해하고 선형회귀모델과 비교

간단히 데이터 확인 팁

train.head().T # Transpose 사용하여 feature에 대해 보기 편하게

# target에 대해 회귀와 분류 중 어느 것으로 할지 판단 가능/ 분류문제라면 최빈기준모델 정하는 기준, 데이터 분포 확인
train[target].value_counts(normalize=True) 

# ProfileReport
!pip install pandas-profiling==2.8.0 --user # 최신이 아니면 에러가 나는 경우가 있음

from pandas_profiling import ProfileReport
profile = ProfileReport(train)

# '중복된 특성'이 있는지 확인
train.T.duplicated()

# '숫자형' 변수가 아닌 변수들 확인 -> OneHotencoding 시 Cardinality 너무 높은 애들 확인 가능
train.describe(exclude='number')
train.describe(exclude='number').T.sort_values(by='unique')

def engineer(df):
    # 내가 엔지니어링 하고 싶은대로 함수 만들기 (후에 X_train/X_val/X_test에 적용하기 편하게)


# 추가로 target 이외의 features 고를 때 
features = train.drop(columns=[target]).columns # 1번 방법
features = train.columns.difference([target], sort=False) # 2번 방법

Pipeline(파이프라인) 라이브러리

말 그대로 머신러닝 프로세스에 들어가는 여러 인코더들을 깔끔하게 쓸 수 있다!

pipe = make_pipeline(
    OneHotEncoder(), 
    SimpleImputer(), 
    StandardScaler(), 
    LogisticRegression(n_jobs=-1)
)

pipe.fit(X_train, y_train)

단, 그 안에 있는 인코더에 접근하고 싶을 때는 잠깐 분해해서 사용하는 느낌으로 가면 된다.

pipe.named_steps # pipe 내의 여러 인코더들을 dictionary 형식으로 저장

# 이런 식으로 각 인코더를 가져올 수 있다.
model_lr = pipe.named_steps['logisticregression']
enc = pipe.named_steps['onehotencoder']

{'onehotencoder': OneHotEncoder(cols=['opinion_h1n1_vacc_effective', 'opinion_h1n1_risk', 'opinion_h1n1_sick_from_vacc', 'agegrp', 'census_msa']), 'simpleimputer': SimpleImputer(), 'standardscaler': StandardScaler(), 'logisticregression': LogisticRegression(n_jobs=-1)}

Decision Tree Model(결정트리모델)

특성들을 기준으로 샘플을 분류해 나아가는데, 그 형태가 나무의 가지가 뻗어나가는 모습과 닮았습니다.
마치 스무고개를 하듯이 특성들의 수치를 가지고 질문을 하며 점차 정답을 찾아가는 과정
질문 혹은 말단의 정답을 'node(노드)'라고 하며 노드를 연결하는 선을 'edge(엣지)'라고 합니다.

각 node(노드)는 root node(뿌리노드), internal node(중간노드), leaf(external, terminal) node(말단 노드)로 구분

결정트리는 분류와 회귀문제에 모두 적용 가능
결정트리는 데이터를 '분할'해 나아가는 과정
분류 과정은 새로운 데이터가 특정 말단 노드에 속한다는 정보를 확인한 뒤 말단노드의 빈도가 가장 높은 범주로 데이터를 분류
결정트리는 분류과정을 트리구조로 굉장히 직관적으로 확인할 수 있다는 장점이 있습니다.

Decision Tree Model(결정트리모델) - 학습 알고리즘

결정트리를 학습하는 것은 '노드를 어떻게 분할하는가'에 대한 문제 즉, 분할 방법에 따라 다른 트리가 나오게 됩니다.
결정트리의 비용함수를 정의하고 그것을 최소화하도록 분할하는 것이 트리모델 학습 알고리즘
Gini Impurity(Index) : 지니 불순도

$${\displaystyle I_{G}(p)=\sum_{i=1}^{J}p_{i}(1-p_{i})=1-\sum_{i=1}^{J}{p_{i}}^{2}}$$

Entropy(엔트로피)

$${\displaystyle \mathrm {H}(T)=\operatorname I_{E}\left(p_{1},p_{2},...,p_{J}\right)=-\sum_{i=1}^{J}{p_{i}\log_{2}p_{i}}}$$

엔트로피가 좀더 균형잡힌 Tree를 만든다고 합니다. 다만, 지니불순도가 더 계산이 빠르기 때문에 자주 사용됩니다.

Impurity(불순도)는 '여러 범주가 섞여 있는 정도' 즉, 순수도의 반대 느낌..!
A,B 두 클래스가 1번(45%, 55%), 2번(80%, 20%)로 섞인 경우, 1번이 불순도가 더 높다.

분할에 사용할 특성이나 분할지점(값)은 타겟변수를 가장 잘 구별해주는(불순도의 감소가 최대가 되는, 즉 '정보획득이 가장 큰') 것을 선택합니다.
여기서 'Information Gain(정보획득)'이란 특정 특성을 사용해 분할했을 때 불순도의 감소량을 뜻합니다.

$${\displaystyle I_G(T,a)=\mathrm {H} {(T)}-\mathrm {H} {(T|a)}} = 분할\ 전\ 노드\ 불순도 - 분할\ 후\ 자식노드들의\ 불순도$$

Decision Tree Model(결정트리모델) - DecisionTreeClassifier()

# 예시
from sklearn.tree import DecisionTreeClassifier # 결정트리분류 모델은 스케일링을 해줄 필요가 없다. 분류 개념이기에  각 칼럼 별 단위를 맞출 필요가 없음

pipe = make_pipeline(
    OneHotEncoder(use_cat_names=True),  
    SimpleImputer(), 
    DecisionTreeClassifier(random_state=1, criterion='entropy')
)

만들어진 Tree model 시각적으로 활용하는 라이브러리 예시

# graphviz 설치방법: conda install -c conda-forge python-graphviz
import graphviz
from sklearn.tree import export_graphviz

model_dt = pipe.named_steps['decisiontreeclassifier']
enc = pipe.named_steps['onehotencoder']
encoded_columns = enc.transform(X_val).columns

dot_data = export_graphviz(model_dt
                          , max_depth=3
                          , feature_names=encoded_columns
                          , class_names=['no', 'yes']
                          , filled=True
                          , proportion=True)


display(graphviz.Source(dot_data))

Decision Tree Model(결정트리모델) - DecisionTreeClassifier()의 Hyperpaameter

트리 모델은 과적합이 발생하기 쉽습니다.(각 관측값을 분할하면서 결국엔 끝까지 그 값 하나만 남을 때 까지 분기를 하므로)
해서 이런 과적합을 막아줄 수 있는 조건을 설정할 수 있습니다.(하이퍼마라미터의 역할)

대표적인 3가지

min_samples_split -> ex) 100으로 설정하면 sample이 100개 이상일 때만 분기
min_samples_leaf -> ex) 100으로 설정하면 leaf node에 최소한 100개의 sample이 있어야 함
max_depth -> Tree의 전체적인 깊이를 조정(분기하는 층을 최대 몇까지 가져가느냐)

Decision Tree Model(결정트리모델) - DecisionTreeClassifier()의 feature importance(특성 중요도)

선형모델에서는 특성과 타겟의 관계를 확인하기 위해 coefficients(회귀 계수)를 살펴보았습니다.
결정트리에서는 그 역할을 하는 게 feature importance(특성 중요도)입니다.
특성이 얼마나 일찍, 자주 분기에 사용되는지를 확인 가능합니다.


model_dt = pipe.named_steps['decisiontreeclassifier'] # 파이프라인으로 묶여있는 경우

importances = pd.Series(model_dt.feature_importances_, encoded_columns)
plt.figure(figsize=(10,30))
importances.sort_values().plot.barh(); # 특성 중요도는 총합이 1이 되게끔 normalized 되어있다.

추가 내용

결정트리모델은 선형모델과 달리 비선형, 비단조(Non-monotonic), feature interactions(특성상호작용) 특징을 가지고 있는 데이터 분석에 용이합니다.

특성상호작용 : 특성들끼리 서로 상호작용을 하는 경우(==다중공선성), 회귀 분석에서는 변수들간의 상호작용이 높으면 개별 계수를 해석하는데 어려움이 있고 학습이 올바르게 되지 않을 수 있습니다. 하지만, 트리모델은 이런 상호작용을 자동으로 걸러내는 특징이 있습니다

생각해보면, 노드가 분기할 때 불순도를 많이 낮추는 특징으로 분기를 하게 되는데 만약 상관성이 굉장히 높은 변수가 있어서 앞에서 이미 그 변수를 통한 분할이 되었다면, 그 뒤에 상관성이 높은 변수는 어차피 불순도를 낮추는데 큰 역할을 할 수 없게 됨(이미 앞에서 유사한 역할로 분할을 진행했기에) 따라서, 트리모델에서는 다중공선성이 큰 문제가 되지 않습니다.

나중에 참고할 수 있을 자료

DecisionTreeRegressor Model의 max_depth를 조절함으로써 점점 괜찮게 fitting 되는 모습

import pandas as pd
from sklearn.linear_model import LinearRegression

columns = ['mobility', 'density']
data = [[80.574, -3.067]
,[84.248, -2.981]
,[87.264, -2.921]
,[87.195, -2.912]
,[89.076, -2.84]
,[89.608, -2.797]
,[89.868, -2.702]
,[90.101, -2.699]
,[92.405, -2.633]
,[95.854, -2.481]
,[100.696, -2.363]
,[101.06, -2.322]
,[401.672, -1.501]
,[390.724, -1.46]
,[567.534, -1.274]
,[635.316, -1.212]
,[733.054, -1.1]
,[759.087, -1.046]
,[894.206, -0.915]
,[990.785, -0.714]
,[1090.109, -0.566]
,[1080.914, -0.545]
,[1122.643, -0.4]
,[1178.351, -0.309]
,[1260.531, -0.109]
,[1273.514, -0.103]
,[1288.339, 0.01]
,[1327.543, 0.119]
,[1353.863, 0.377]
,[1414.509, 0.79]
,[1425.208, 0.963]
,[1421.384, 1.006]
,[1442.962, 1.115]
,[1464.35, 1.572]
,[1468.705, 1.841]
,[1447.894, 2.047]
,[1457.628, 2.2]]

thurber = pd.DataFrame(columns=columns, data=data)

# 데이터를 시각화 합니다.
thurber.plot('mobility', 'density', kind='scatter', title='Thurber');

# 선형회귀모델을 통해 성능 확인
X_thurber = thurber[['mobility']]
y_thurber = thurber['density']
linear = LinearRegression()
linear.fit(X_thurber, y_thurber)
print('R2: ', linear.score(X_thurber, y_thurber))
ax = thurber.plot('mobility', 'density', kind='scatter', title='Thurber')
ax.plot(X_thurber, linear.predict(X_thurber));

from ipywidgets import interact
from sklearn.tree import DecisionTreeRegressor, export_graphviz
import matplotlib.pyplot as plt
import graphviz

def show_tree(tree, colnames):
    dot = export_graphviz(tree, feature_names=colnames, filled=True, rounded=True)   
    return graphviz.Source(dot)

def thurber_tree(max_depth=1):
    tree = DecisionTreeRegressor(max_depth=max_depth)
    tree.fit(X_thurber, y_thurber)
    print('R2: ', tree.score(X_thurber, y_thurber))
    ax = thurber.plot('mobility', 'density', kind='scatter', title='Thuber')
    ax.step(X_thurber, tree.predict(X_thurber), where='mid')
    plt.show()
    display(show_tree(tree, colnames=['mobility']))

interact(thurber_tree, max_depth=(1,6,1));

'💿 Data > 부트캠프' 카테고리의 다른 글

[TIL]30.Evaluation Metrics for Classification(Precision, Recall, f1score, threshold, ROC curve, AUC) (0)	2021.12.29
[TIL]29.RandomForest(랜덤포레스트) (0)	2021.12.27
[TIL]27.Section2_sprint1 challenge (0)	2021.12.24
[TIL]26.Logistic Regression(로지스틱 회귀; 분류) (0)	2021.12.22
[TIL]25.Ridge Regression(능형 회귀) (0)	2021.12.21

목표