Jayden`s

    [TIL]30.Evaluation Metrics for Classification(Precision, Recall, f1score, threshold, ROC curve, AUC)

    ๋ชฉํ‘œ Counfusion Matrix์— ๋Œ€ํ•œ ์ดํ•ด ๋ฐ ํ•ด์„ ์ •๋ฐ€๋„, ์žฌํ˜„์œจ์„ ์ดํ•ดํ•˜๊ณ  ์‚ฌ์šฉ ROC Curve, AUC score์— ๋Œ€ํ•œ ์ดํ•ด feature engineering ํŒ def engineer(df): """ํŠน์„ฑ์„ ์—”์ง€๋‹ˆ์–ด๋ง ํ•˜๋Š” ํ•จ์ˆ˜์ž…๋‹ˆ๋‹ค.""" # ์ƒˆ๋กœ์šด ํŠน์„ฑ์„ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค. behaviorals = [col for col in df.columns if 'behavioral' in col] df['behaviorals'] = df[behaviorals].sum(axis=1) # 'behavioral'์ด๋ผ๋Š” ๋‹จ์–ด๊ฐ€ ํฌํ•จ๋œ column์— ๋Œ€ํ•ด์„œ sumํ•œ ์ƒˆ๋กœ์šด ๋ณ€์ˆ˜ ๋งŒ๋“ค๊ธฐ # ๊ณ„์ ˆ๋…๊ฐ(seas)์— ๋Œ€ํ•œ ๋ชจ๋ธ์„ ํ•™์Šตํ•  ๊ฒƒ์ด๊ธฐ ๋•Œ๋ฌธ์— h1n1์— ๋Œ€ํ•œ ํŠน์„ฑ์„ ์ œ๊ฑฐํ•ฉ๋‹ˆ๋‹ค. dels = [col for ..

    category_encoders(TargetEncoder, CatBoostEncoder) ๊ทธ๋ฆฌ๊ณ  Ordinal๊ณผ OneHot encoder

    1. ์ˆ˜์—…์—์„œ ์–ธ๊ธ‰๋˜์ง€ ์•Š์€ ๋‹ค๋ฅธ ์ข…๋ฅ˜์˜ category_encoders๋ฅผ 2๊ฐœ ์ด์ƒ ์‚ฌ์šฉํ•ด ๊ฒฐ๊ณผ๋ฅผ ๊ณต์œ ํ•ด ๋ณด์‹œ๊ณ , ๋‹ค์Œ ์งˆ๋ฌธ์— ๋Œ€ํ•ด ์„œ๋กœ ๋…ผ์˜ํ•ด ๋ณด์„ธ์š”. ์‚ฌ์šฉํ•˜์‹  encoder๋Š” ๊ฐ๊ฐ ์–ด๋–ค ์žฅ๋‹จ์ ์„ ๊ฐ–๊ณ  ์žˆ์œผ๋ฉฐ, ์–ด๋–ค ์ƒํ™ฉ์—์„œ ์‚ฌ์šฉํ•˜๋ฉด ์ข‹์„๊นŒ์š”? ์—ฌ๋Ÿฌ ์ธ์ฝ”๋” ์ค‘ TargetEncoder์™€ CatBoostEncoder๊ฐ€ ๊ฐ€์žฅ ํฅ๋ฏธ๊ฐ€ ์ƒ๊ฒจ ์ ์šฉํ•ด๋ณด์•˜์Šต๋‹ˆ๋‹ค. OrdinalEncoder(๊ธฐ์ค€) ํ”ผํŒ… ์…€ ์‹œ๊ฐ„ : 4.27 s ์ •ํ™•๋„ ๋ฐ f1 score ํŠน์„ฑ ์ค‘์š”๋„ TargetEncoder ํ”ผํŒ… ์…€ ์‹œ๊ฐ„ : 5.29 s ์ •ํ™•๋„ ๋ฐ f1 score ํŠน์„ฑ ์ค‘์š”๋„ CatBoostEncoder ํ”ผํŒ… ์…€ ์‹œ๊ฐ„ : 14.1 s ์ •ํ™•๋„ ๋ฐ f1 score ํŠน์„ฑ ์ค‘์š”๋„ ๋‘ ๊ฐ€์ง€์˜ ์ธ์ฝ”๋”๋ฅผ ์„ค์ •ํ•œ ์ด์œ ๋Š” ์ž์„ธํžˆ๋Š” ๋ชจ๋ฅด์ง€๋งŒ, Ca..

    [TIL]29.RandomForest(๋žœ๋คํฌ๋ ˆ์ŠคํŠธ)

    ๋ชฉํ‘œ RandomForest์— ๋Œ€ํ•œ ์ดํ•ด Ordinal encoding(์ˆœ์„œํ˜• ์ธ์ฝ”๋”ฉ)๊ณผ OneHot encoding(์›ํ•ซ ์ธ์ฝ”๋”ฉ)์„ ๊ตฌ๋ถ„ํ•˜์—ฌ ์‚ฌ์šฉ ๋ฒ”์ฃผํ˜• ๋ณ€์ˆ˜์˜ ์ธ์ฝ”๋”ฉ ๋ฐฉ๋ฒ•์ด ํŠธ๋ฆฌ๋ชจ๋ธ๊ณผ ์„ ํ˜•ํšŒ๊ท€ ๋ชจ๋ธ์— ์ฃผ๋Š” ์˜ํ–ฅ์— ๋Œ€ํ•ด ์ดํ•ด ๋“ค์–ด๊ฐ€๊ธฐ ์ „ ๋ฐ์ดํ„ฐ๊ฐ€ ์„ ํ˜•/๋น„์„ ํ˜• ์ƒ๊ด€์—†์ด ๋ถ„๋ฅ˜๋ฌธ์ œ์— ์ ‘๊ธ€ํ•  ๋•Œ ๋žœ๋คํฌ๋ ˆ์ŠคํŠธ๋ฅผ ๋จผ์ € ์ ์šฉํ•ด๋ณด๋Š” ๊ฒƒ์„ ์ถ”์ฒœ ๊ฒฐ์ •ํŠธ๋ฆฌ๋ชจ๋ธ : ํ•œ๊ฐœ์˜ ํŠธ๋ฆฌ๋งŒ ์‚ฌ์šฉํ•˜๊ธฐ ๋•Œ๋ฌธ์— ํ•œ ๋…ธ๋“œ์—์„œ ์ƒ๊ธด ์—๋Ÿฌ๊ฐ€ ํ•˜๋ถ€ ๋…ธ๋“œ์—์„œ๋„ ๊ณ„์† ์˜ํ–ฅ์„ ์ค๋‹ˆ๋‹ค. ๋˜ํ•œ, ํŠธ๋ฆฌ์˜ ๊นŠ์ด์— ๋”ฐ๋ผ ๊ณผ์ ํ•ฉ๋˜๋Š” ๊ฒฝํ–ฅ์ด ์žˆ์Šต๋‹ˆ๋‹ค. ์ด ๋ถ€๋ถ„์„ ๋žœ๋คํฌ๋ ˆ์ŠคํŠธ๋ฅผ ํ†ตํ•ด ํ•ด๊ฒฐ ํŒ %%time # ์…€ ์‹คํ–‰ํ•˜๋Š”๋ฐ ๊ฑธ๋ฆฌ๋Š” ์‹œ๊ฐ„ ๋ณด์—ฌ์คŒ RandomForest(๋žœ๋คํฌ๋ ˆ์ŠคํŠธ) ๋žœ๋คํฌ๋ ˆ์ŠคํŠธ๋Š” Ensemble(์•™์ƒ๋ธ”) ๋ฐฉ๋ฒ• ์•™์ƒ๋ธ” ๋ฐฉ๋ฒ•์€ ํ•œ ์ข…๋ฅ˜์˜ ๋ฐ์ดํ„ฐ๋กœ ์—ฌ๋Ÿฌ ๋จธ์‹ ..

    ์‹œ๊ฐํ™”๋ฅผ ํ†ตํ•ด ํ™•์ธํ•œ Imputer์˜ ์ฐจ์ด

    2๊ฐœ ์ด์ƒ์˜ imputer๋ฅผ ์‚ฌ์šฉํ•ด ๊ฐ๊ฐ ํŠน์„ฑ-ํƒ€๊ฒŸ ๊ด€๊ณ„ ๊ทธ๋ž˜ํ”„๋ฅผ ๊ทธ๋ ค ๊ณต์œ ํ•˜๊ณ  ๋‹ค์Œ ์งˆ๋ฌธ์— ๋Œ€ํ•ด ์„œ๋กœ ๋…ผ์˜ํ•ด ๋ณด์„ธ์š”. ๋จผ์ € ํŠน์„ฑ์ค‘์š”๋„์—์„œ ๊ฐ€์žฅ ์ค‘์š”๋„๊ฐ€ ๋†’๊ฒŒ ๋‚˜์˜จ 'doctor_recc_h1n1' ํŠน์„ฑ์— ๋Œ€ํ•ด์„œ๋งŒ Imputer ๋ณ€๊ฒฝ์— ๋”ฐ๋ฅธ ํƒ€๊ฒŸ์™€์˜ ๊ด€๊ณ„ ๊ทธ๋ž˜ํ”„๋ฅผ ๊ทธ๋ ค๋ณด์•˜์Šต๋‹ˆ๋‹ค. seaborn plots ์‚ฌ์šฉํ•˜์—ฌ ๊ด€์‹ฌ์žˆ๋Š” ํŠน์„ฑ๋“ค๊ณผ target๊ฐ„์˜ ๊ด€๊ณ„๋ฅผ ๊ทธ๋ž˜ํ”„๋กœ ๋‚˜ํƒ€๋‚ด ๋ณด์„ธ์š”. Imputer๋ฅผ ์ ์šฉํ•˜์ง€ ์•Š์•˜์„ ๋•Œ ๋ณ€์ˆ˜์™€ ํƒ€๊ฒŸ ๋ชจ๋‘ binaryํ•œ ๊ฐ’์œผ๋กœ 0๊ณผ 1์— ๋Œ€ํ•œ ๊ฐ’๋“ค๋งŒ ์ฐํžˆ๋Š” ๊ฒƒ์ด ๋ณด์ž…๋‹ˆ๋‹ค. SimpleImputer(strategy='mean') ์ ์šฉ ์‹œ imputer1 = SimpleImputer(strategy='mean') train_imp1 = im..

    [TIL]28.Decision Tree(์˜์‚ฌ๊ฒฐ์ •๋‚˜๋ฌด)

    ๋ชฉํ‘œ ์‚ฌ์ดํ‚ท๋Ÿฟ ํŒŒ์ดํ”„๋ผ์ธ(pipeline) ์ดํ•ด ์‚ฌ์ดํ‚ท๋Ÿฐ ๊ฒฐ์ •ํŠธ๋ฆฌ(decision tree) ์ดํ•ด ๊ฒฐ์ •ํŠธ๋ฆฌ์˜ ํŠน์„ฑ ์ค‘์š”๋„(feature importance)๋ฅผ ํ™œ์šฉ ๊ฒฐ์ •ํŠธ๋ฆฌ ๋ชจ๋ธ์˜ ์žฅ์ ์„ ์ดํ•ดํ•˜๊ณ  ์„ ํ˜•ํšŒ๊ท€๋ชจ๋ธ๊ณผ ๋น„๊ต ๊ฐ„๋‹จํžˆ ๋ฐ์ดํ„ฐ ํ™•์ธ ํŒ train.head().T # Transpose ์‚ฌ์šฉํ•˜์—ฌ feature์— ๋Œ€ํ•ด ๋ณด๊ธฐ ํŽธํ•˜๊ฒŒ # target์— ๋Œ€ํ•ด ํšŒ๊ท€์™€ ๋ถ„๋ฅ˜ ์ค‘ ์–ด๋Š ๊ฒƒ์œผ๋กœ ํ• ์ง€ ํŒ๋‹จ ๊ฐ€๋Šฅ/ ๋ถ„๋ฅ˜๋ฌธ์ œ๋ผ๋ฉด ์ตœ๋นˆ๊ธฐ์ค€๋ชจ๋ธ ์ •ํ•˜๋Š” ๊ธฐ์ค€, ๋ฐ์ดํ„ฐ ๋ถ„ํฌ ํ™•์ธ train[target].value_counts(normalize=True) # ProfileReport !pip install pandas-profiling==2.8.0 --user # ์ตœ์‹ ์ด ์•„๋‹ˆ๋ฉด ์—๋Ÿฌ๊ฐ€ ๋‚˜๋Š” ๊ฒฝ์šฐ๊ฐ€ ์žˆ์Œ from pandas_pro..