๐Ÿ’ฟ Data/์ด๋ชจ์ €๋ชจ

    Data Wrangling

    Data Wrangling ์˜๋ฏธ raw data๋ฅผ ๋” ์†์‰ฝ๊ฒŒ ์‚ฌ์šฉ ๊ฐ€๋Šฅํ•œ ํ˜•ํƒœ๋กœ ๋ฐ”๊พธ๋Š” ๋ชจ๋“  ๊ณผ์ •์„ ์˜๋ฏธํ•ฉ๋‹ˆ๋‹ค.(Data cleaning, Data remediation, Data munging ์ด๋ผ๊ณ  ๋ถˆ๋ฆฌ๊ธฐ๋„ ํ•ฉ๋‹ˆ๋‹ค.) ๊ตฌ์„ฑ(๋‹จ๊ณ„) Discovery(๋ฐœ๊ฒฌ) ๋ฐ์ดํ„ฐ์— ๋Œ€ํ•ด ์นœ์ˆ™ํ•ด์ง€๋Š” ๋‹จ๊ณ„๋กœ, EDA ๋‹จ๊ณ„์™€ ๊ฐ™์ด ๋ฐ์ดํ„ฐ์˜ ํŠน์„ฑ, ํ†ต๊ณ„์น˜ ๋“ฑ์„ ์‚ดํŽด๋ณด๋ฉฐ ๋ฐฉํ–ฅ์„ ์žก๋Š” ๋‹จ๊ณ„์ž…๋‹ˆ๋‹ค. Structuring(๊ตฌ์กฐํ™”) ์ผ๋ฐ˜์ ์ธ raw data๋Š” ๋ฐ”๋กœ ์‚ฌ์šฉํ•˜๊ธฐ ํž˜๋“ค๊ธฐ ๋•Œ๋ฌธ์— ๊ฐ๊ฐ์˜ raw data๋ฅผ ์ ์ ˆํ•˜๊ฒŒ ์กฐํ•ฉํ•˜์—ฌ ์›ํ•˜๋Š” ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„์„ ์–ป๋Š” ๋‹จ๊ณ„์ž…๋‹ˆ๋‹ค. ์˜ค๋Š˜ ์ €ํฌ๊ฐ€ ๋ฐฐ์šด merge, groupby ๋“ฑ์„ ์ด์šฉํ•œ ํŠน์„ฑ ์ƒ์„ฑ ๋ฐ ์ •๋ฆฌ๊ฐ€ ์ด ๋‹จ๊ณ„์— ํ•ด๋‹นํ•ฉ๋‹ˆ๋‹ค. Cleaning(์ฒญ์†Œ) ๋ฐ์ดํ„ฐ ๋ถ„์„์— ์žˆ์–ด ์˜ํ–ฅ์„ ์ฃผ๋Š” ์˜ค๋ฅ˜๋ฅผ ์ œ๊ฑฐ..

    ๋ฐ์ดํ„ฐ ์ง๋ฌด ๊ฐ„๋‹จ ์ •๋ฆฌ(๋น…๋ฐ์ดํ„ฐ ์ปค๋ฆฌ์–ด ๊ฐ€์ด๋“œ๋ถ ์ฐธ๊ณ )

    ์‹ค๋ฌด์—์„œ ๋ฐ์ดํ„ฐ๋ถ„์„๊ฐ€/๋ฐ์ดํ„ฐ์—”์ง€๋‹ˆ์–ด/๋ฐ์ดํ„ฐ์‚ฌ์ด์–ธํ‹ฐ์ŠคํŠธ์—๊ฒŒ ์ค‘์š”ํ•œ ์—ญ๋Ÿ‰์ด ๋ฌด์—‡์ด๋ผ๊ณ  ์ƒ๊ฐํ•˜๋ฉฐ, ์–ด๋–ค ์–ด๋ ค์›€์„ ๊ฒช๊ฒŒ ๋  ๊ฒƒ ๊ฐ™์€์ง€์™€ ์–ด๋–ค ๋Šฅ๋ ฅ์ด ํ•„์š”ํ•  ์ง€ ๋…ผ์˜ํ•ด ๋ณด์„ธ์š”. ๋˜ํ•œ, ์„น์…˜2 ํ”„๋กœ์ ํŠธ์— ์•ž์„œ ๋ณธ์ธ์ด ์–ป๊ณ ์ž ํ•˜๋Š” ์ ์ด๋‚˜ ๋‹ค์ง ๋“ฑ์„ ์„œ๋กœ ๊ณต์œ ํ•ด ๋ณด์„ธ์š”. ๋ฐ์ดํ„ฐ ์ง๋ฌด ๋ถ„๋ฅ˜(์œ„์˜ ์งˆ๋ฌธ์„ ๊ธฐ์ค€์œผ๋กœ ์ž‘์„ฑํ•˜์˜€์Šต๋‹ˆ๋‹ค.) ์ข€๋” ์ž์„ธํ•œ ๋‚ด์šฉ์€ ๋” ์ฐพ์•„๋ด์•ผ๊ฒ ์ง€๋งŒ, ๋Œ€๋žต์ ์œผ๋กœ๋‚˜๋งˆ ์ง๋ฌด๋ณ„ ์—ญ๋Ÿ‰ ๊ทธ๋ฆฌ๊ณ  ์–ด๋–ค ์–ด๋ ค์›€์ด ์žˆ์„์ง€ ๋“ฑ์„ ํŒ๋‹จํ•  ์ˆ˜ ์žˆ์„ ๊ฒƒ ๊ฐ™์Šต๋‹ˆ๋‹ค. ๋ฐ์ดํ„ฐ ๋ถ„์„๊ฐ€(Data Analyst) ๋ฐ์ดํ„ฐ์—์„œ ๊ธฐ์—…์˜ ํ˜„์žฌ ์ƒํƒœ์™€ ๊ด€๋ จ๋œ ์ธ์‚ฌ์ดํŠธ๋ฅผ ๋„์ถœํ•˜๊ณ  ๊ฒฝ์˜์ง„์—๊ฒŒ ํšจ์œจ์ ์œผ๋กœ ์ „๋‹ฌํ•˜๋Š” ์—…๋ฌด๋ฅผ ๋‹ด๋‹นํ•ฉ๋‹ˆ๋‹ค. ์—ฌ๊ธฐ์„œ ์ธ์‚ฌ์ดํŠธ๋ž€ '๊ฐœ์„ ํ•ด์•ผ ํ•  ๋ฌธ์ œ๋ฅผ ์ฐพ์•„๋‚ด๊ณ  ์ด๋ฅผ ํ•ด๊ฒฐํ•  ์•„์ด๋””์–ด๋ฅผ ์ƒ๊ฐํ•ด๋‚ด๋Š” ๊ฒƒ'์ž…๋‹ˆ๋‹ค. ์ฃผ๋กœ ๊ฐ„๋‹จํ•œ..

    HyperParameter tuning

    GridSearchCV ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ ํŠœ๋‹์„ ์ง„ํ–‰ํ•ฉ๋‹ˆ๋‹ค. ๋ชจ๋ธ ์„ฑ๋Šฅ์„ ๋†’์ด๊ธฐ ์œ„ํ•ด ๊ฐ€๋Šฅํ•œ ์‹œ๋„๋ฅผ ๋‹ค ํ•ด๋ณด์„ธ์š”. ๋ชจ๋ธ ์„ฑ๋Šฅ ๊ฐœ์„ ์— ๊ฐ€์žฅ ํฐ ์˜ํ–ฅ์„ ์ค€ ํŠน์„ฑ๊ณตํ•™์ด๋‚˜ ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ ํŠœ๋‹์— ๋Œ€ํ•ด์„œ ์™œ ์„ฑ๋Šฅ ๊ฐœ์„ ์— ํฐ ์˜ํ–ฅ์„ ์ฃผ์—ˆ๋Š”์ง€ ์„ค๋ช…ํ•ด ๋ณด์‹œ๊ณ  ์„œ๋กœ์˜ ๊ฒฐ๊ณผ์— ๋Œ€ํ•ด ๊ณต์œ ํ•˜๊ณ  ํ† ๋ก ํ•ด ๋ณด์„ธ์š”. Ordinal Encoder ์‚ฌ์šฉ 1-1. RandomizedSearchCV : GridSearchCV๋ฅผ ํ•˜๊ธฐ ์ „ ์ ๋‹นํ•œ ๋ฒ”์œ„๋ฅผ ์ฐพ๊ธฐ ์œ„ํ•ด ์‹คํ–‰ํ•˜์˜€์Šต๋‹ˆ๋‹ค. cross_val_score๋ฅผ ํ†ตํ•ด cv = 5๋กœ ์„ ์ •ํ•˜์˜€์Šต๋‹ˆ๋‹ค. ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ ์กฐ์ • ์œ„์˜ ๊ฒฐ๊ณผ์—์„œ ๋Œ€๋žต์ ์œผ๋กœ GridSearchCV์— ๋„ฃ์–ด์ค„ ์ˆซ์ž๋ฅผ ์ƒ๊ฐํ•ด ๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. 1-2. GridSearchCV : ์œ„์˜ ๊ฐ’์„ ๊ธฐ์ค€์œผ๋กœ ์•ฝ๊ฐ„์˜ ์ฐจ์ด๋ฅผ ์ฃผ๋ฉฐ ์ตœ์ ์˜ ํŒŒ..

    Evaluation metrics for Classification

    confusion matrix, classification report ๋“ฑ์„ ๊ทธ๋ ค ๋ณด์‹œ๊ณ , ๊ฐ ํ‰๊ฐ€์ง€ํ‘œ๋“ค์— ๋Œ€ํ•ด ์ตœ๋Œ€ํ•œ ๋ถ„์„ํ•˜๊ณ  ๋ฌด์—‡์ด ๋ถ€์กฑํ•œ์ง€ ์–ด๋–ค ๋ฐฉํ–ฅ์œผ๋กœ ์„ฑ๋Šฅ์„ ๋†’์—ฌ์•ผ ํ•  ์ง€ ๋…ผ์˜ํ•ด ๋ณด์„ธ์š”. ๋ถ„๋ฅ˜ ๋ฌธ์ œ์˜ ํ‰๊ฐ€ ์ง€ํ‘œ accuracy(์ •ํ™•๋„) f1_score precision(์ •๋ฐ€๋„) recall(์žฌํ˜„์œจ ; sensitivity) ROC curve ๋ฐ AUC score accuracy(์ •ํ™•๋„) f1_score precision ๋ฐ recall - classification_report train set val set confusion matrix train set val set ROC curve ๋ฐ AUC train set val set train set vs val set ๋‹น์—ฐํ•œ ๊ฒฐ๊ณผ๊ฒ ์ง€๋งŒ, ์—ฌ๋Ÿฌ์ง€ํ‘œ..

    category_encoders(TargetEncoder, CatBoostEncoder) ๊ทธ๋ฆฌ๊ณ  Ordinal๊ณผ OneHot encoder

    1. ์ˆ˜์—…์—์„œ ์–ธ๊ธ‰๋˜์ง€ ์•Š์€ ๋‹ค๋ฅธ ์ข…๋ฅ˜์˜ category_encoders๋ฅผ 2๊ฐœ ์ด์ƒ ์‚ฌ์šฉํ•ด ๊ฒฐ๊ณผ๋ฅผ ๊ณต์œ ํ•ด ๋ณด์‹œ๊ณ , ๋‹ค์Œ ์งˆ๋ฌธ์— ๋Œ€ํ•ด ์„œ๋กœ ๋…ผ์˜ํ•ด ๋ณด์„ธ์š”. ์‚ฌ์šฉํ•˜์‹  encoder๋Š” ๊ฐ๊ฐ ์–ด๋–ค ์žฅ๋‹จ์ ์„ ๊ฐ–๊ณ  ์žˆ์œผ๋ฉฐ, ์–ด๋–ค ์ƒํ™ฉ์—์„œ ์‚ฌ์šฉํ•˜๋ฉด ์ข‹์„๊นŒ์š”? ์—ฌ๋Ÿฌ ์ธ์ฝ”๋” ์ค‘ TargetEncoder์™€ CatBoostEncoder๊ฐ€ ๊ฐ€์žฅ ํฅ๋ฏธ๊ฐ€ ์ƒ๊ฒจ ์ ์šฉํ•ด๋ณด์•˜์Šต๋‹ˆ๋‹ค. OrdinalEncoder(๊ธฐ์ค€) ํ”ผํŒ… ์…€ ์‹œ๊ฐ„ : 4.27 s ์ •ํ™•๋„ ๋ฐ f1 score ํŠน์„ฑ ์ค‘์š”๋„ TargetEncoder ํ”ผํŒ… ์…€ ์‹œ๊ฐ„ : 5.29 s ์ •ํ™•๋„ ๋ฐ f1 score ํŠน์„ฑ ์ค‘์š”๋„ CatBoostEncoder ํ”ผํŒ… ์…€ ์‹œ๊ฐ„ : 14.1 s ์ •ํ™•๋„ ๋ฐ f1 score ํŠน์„ฑ ์ค‘์š”๋„ ๋‘ ๊ฐ€์ง€์˜ ์ธ์ฝ”๋”๋ฅผ ์„ค์ •ํ•œ ์ด์œ ๋Š” ์ž์„ธํžˆ๋Š” ๋ชจ๋ฅด์ง€๋งŒ, Ca..