[TIL]31.Model Selection(모델 선택)

목표

Model Selection(모델 선택)을 위한 Cross Validation(교차검증) 방법 이해 및 활용
Hyperparameter를 최적화하여 모델의 성능 향상

Cross-Validation(교차검증)

Hold-Out 교차검증 : train/validate/test set으로 나눠 학습을 진행
- train set의 크기가 작을 때는 val set을 따로 분리하는 것이 부담이 될 수 있습니다.
- 억지로 val set을 따로 추출해도 예측 성능에 대한 추정이 부정확할 확률이 높습니다.
K-fold 교차검증 : 데이터를 k개로 등분하고 k개의 집합에서 k-1개는 train set, 1개는 val set으로 사용하여 k번 학습하는 방법
- 위의 Hold-Out 방법의 단점을 극복할 수 있습니다.
- 어떤 학습 모델을 사용해야할지, 어떤 하이퍼파라미터를 사용해야할지 결정하는데 도움을 줍니다.

주의사항 : 시계열데이터(time series)에는 교차검증이 적합하지 않습니다.

교차검증을 통한 score 확인(from sklearn.model_selection import cross_val_score) 예시

from sklearn.model_selection import cross_val_score

# 3-fold 교차검증을 수행합니다.
k = 3
scores = cross_val_score(pipe, X_train, y_train, cv=k, 
                         scoring='f1')

Hyperparameter tuning

머신러닝 모델을 만들 때 중요한 이슈는 Optimization(최적화)와 Generalization(일반화)
- Optimization(최적화) : 훈련 데이터로 더 좋은 성능을 얻기 위해 모델을 조정
- Generalization(일반화) : 학습된 모델이 처음 본 데이터에서 얼마나 좋은 성능을 내는지

과소적합(underfitting) : 모델의 복잡도를 높이는 과정에서 train, val set의 손실이 함께 감소하는 시점(모델이 더 학습 가능)
과적합(overfitting) : train set의 손실은 계속 감소하는데 val set의 손실은 증가하는 시점
이상적인 모델은 과소적합과 과적합 사이에 존재

검증곡선 그리기

검증곡선 : train/val set에 대해 y축 : score(예시에선 MAE), x축 : 하이퍼파라미터(예시에선 max_depth)로 그린 그래프훈련곡선(learning curve)의 경우는 x축 : 훈련데이터 수(# of traing samples)에 대한 것입니다. 혼동 주의!

예시(from sklearn.model_selection import validation_curve) , max_depth에 대해서만 보기로 하겠습니다.

실제로는 하나의 하이퍼파라미터만 가지고 검증곡선을 그리는 것이 아주 유용하진 않습니다.

import matplotlib.pyplot as plt
from category_encoders import OrdinalEncoder
from sklearn.model_selection import validation_curve
from sklearn.tree import DecisionTreeRegressor

pipe = make_pipeline(
    OrdinalEncoder(), 
    SimpleImputer(), 
    DecisionTreeRegressor()
)

depth = range(1, 30, 2)
ts, vs = validation_curve(
    pipe, X_train, y_train
    , param_name='decisiontreeregressor__max_depth'
    , param_range=depth, scoring='neg_mean_absolute_error'
    , cv=3
    , n_jobs=-1
)

train_scores_mean = np.mean(-ts, axis=1)
validation_scores_mean = np.mean(-vs, axis=1)

fig, ax = plt.subplots()

# 훈련세트 검증곡선
ax.plot(depth, train_scores_mean, label='training error')

# 검증세트 검증곡선
ax.plot(depth, validation_scores_mean, label='validation error')

# 이상적인 max_depth
ax.vlines(5,0, train_scores_mean.max(), color='blue')

# 그래프 셋팅
ax.set(title='Validation Curve'
      , xlabel='Model Complexity(max_depth)', ylabel='MAE')
ax.legend()
fig.dpi = 100

대략적으로 max_depth=5 정도일 때가 과적합을 막으면서 동시에 일반화 성능도 유지할 수 있겠습니다.

Randomized Search CV

하이퍼파라미터에 대해 일정 범주를 설정하고 무작위로 추출하여 나온 조합으로 모델을 학습시켜 최적의 모델을 찾는 방법
하이퍼파라미터 : 모델 훈련 중에 학습이 되지 않는 파라미터 즉, 사람이 직접 조정을 해주어야하는 파라미터
이 때 하이퍼파라미터 조합을 찾아주는 좋은 툴이 있습니다.
-GridSearchCV : 검증하고싶은 하이퍼파라미터들의 수치를 정해주고 그 조합을 모두 검증
-RandomizedSearchCV : 검증하고싶은 하이퍼파라미터들의 값 범위를 지정해주고 무작위로 그 값에서 추출해 그 조합을 검증

랜덤포레스트 적용 예시

from scipy.stats import randint, uniform

pipe = make_pipeline(
    TargetEncoder(), 
    SimpleImputer(), 
    RandomForestRegressor(random_state=2)
)

dists = {
    'targetencoder__smoothing': [2.,20.,50.,60.,100.,500.,1000.], # int로 넣으면 error(bug)
    'targetencoder__min_samples_leaf': randint(1, 10),     
    'simpleimputer__strategy': ['mean', 'median'], 
    'randomforestregressor__n_estimators': randint(50, 500), 
    'randomforestregressor__max_depth': [5, 10, 15, 20, None], 
    'randomforestregressor__max_features': uniform(0, 1) # max_features
}

clf = RandomizedSearchCV(
    pipe, 
    param_distributions=dists, 
    n_iter=50, # 위의 하이퍼파라미터 조건에서 조합을 50개를 추출해서 학습합니다.
    cv=3, # 50 x 3 = 150 의 학습 과정을 실행하게 됩니다.
    scoring='neg_mean_absolute_error',  # MAE에 '-'가 붙은 값을 쓰는 이유는, 다른 score와는 다르게 MAE, MSE, RMSE 등은 숫자가 낮을수록 좋은 것이다. 그러므로 -를 붙여줘서 이 기준에 맞춰준 것!
    verbose=1,
    n_jobs=-1
)

clf.fit(X_train, y_train);

최적의 하이퍼파라미터 및 최고 점수

clf.best_params_ # 최적의 하이퍼파라미터
-clf.best_score_ # 최고 점수(예시에선 점수를 'neg_mean_absolute_error'로 지정하였기 때문에 '-'를 붙여줍니다.
# MAE, MSE, RMSE 등은 다른 score와 다르게 높을수록 안좋기 때문에 -를 붙여주어 계산하게 됩니다.

하이퍼파라미터 조합에 따라, 시행한 cv에 대한 정보

pd.DataFrame(clf.cv_results_).sort_values(by='rank_test_score').T
# rank_test_score: 테스트 순위
# mean_score_time: 예측에 걸리는 시간

하이퍼파라미터의 조합이 50개이기 때문에 columns 수가 50개인 것을 확인할 수 있습니다.

만들어진 모델 중 가장 성능이 좋은 모델을 pipe에 지정

pipe = clf.best_estimator_

best_estimator_과 refit parameter

best_estimator_
Estimator that was chosen by the search, i.e. estimator which gave highest score (or smallest loss if specified) on the left out data. Not available if refit=False. ... See refit parameter for more information ...
refit parameter
refit : boolean, string, or callable, default=True
Refit an estimator using the best found parameters on the whole dataset.

즉, best_estimator_는 CV가 끝난 후 찾은 best parameter를 사용해 모든 train set을 가지고 다시 학습(refit)한 상태입니다.

Hold-Out 교차검증을 수행한 경우에는, 추후 train + val set에 최적화된 하이퍼파라미터로 최종 모델에 refit해야 합니다.

선형회귀, 랜덤포레스트 모델들의 튜닝 추천 하이퍼파라미터

Random Forest

class_weight (불균형(imbalanced) 클래스인 경우)
max_depth (너무 깊어지면 과적합)
n_estimators (적을경우 과소적합, 높을경우 긴 학습시간)
min_samples_leaf (과적합일경우 높임)
max_features (줄일 수록 다양한 트리생성)

Logistic Regression

C (Inverse of regularization strength)
class_weight (불균형 클래스인 경우)
penalty

Ridge / Lasso Regression

alpha

'💿 Data > 부트캠프' 카테고리의 다른 글

[TIL]33.Choose your ML problems (0)	2022.01.01
[TIL]32.Section2 Sprint2 Chall(Sprint2 키워드 중심 정리) (0)	2021.12.31
[TIL]30.Evaluation Metrics for Classification(Precision, Recall, f1score, threshold, ROC curve, AUC) (0)	2021.12.29
[TIL]29.RandomForest(랜덤포레스트) (0)	2021.12.27
[TIL]28.Decision Tree(의사결정나무) (0)	2021.12.26

목표