Jayden1116
Jayden`s LifeTrip ๐Ÿ”†
Jayden1116
์ „์ฒด ๋ฐฉ๋ฌธ์ž
์˜ค๋Š˜
์–ด์ œ
  • Jayden`s (481)
    • ๐Ÿฏ Hello, Jayden (144)
      • ์ผ๊ธฐ (1)
      • ์‹ ๋ฌธ (121)
      • ์Œ์•… (6)
      • ๊ฒฝ์ œ (16)
    • ๐Ÿ’› JavaScript (88)
      • ์ด๋ชจ์ €๋ชจ (4)
      • ๋ฐฑ์ค€ (44)
      • ํ”„๋กœ๊ทธ๋ž˜๋จธ์Šค (40)
      • ๋ฒ„๊ทธ (0)
    • ๐ŸŽญ HTML CSS (6)
      • ํํŠธ๋ฏ€๋ฅด (2)
      • ํฌ์Šค์Šค (4)
    • ๐Ÿ’ป CS (13)
      • ์ž๋ฃŒ๊ตฌ์กฐ ๋ฐ ์•Œ๊ณ ๋ฆฌ์ฆ˜ (1)
      • ๋„คํŠธ์›Œํฌ (9)
      • ์šด์˜์ฒด์ œ (1)
      • ๋ฐ์ดํ„ฐ ๋ฒ ์ด์Šค (0)
      • ๋””์ž์ธ ํŒจํ„ด (1)
    • ๐Ÿ Python (71)
      • ๋ฐฑ์ค€ (67)
      • ํ”„๋กœ๊ทธ๋ž˜๋จธ์Šค (4)
    • ๐Ÿ’ฟ Data (156)
      • ์ด๋ชจ์ €๋ชจ (65)
      • ๋ถ€ํŠธ์บ ํ”„ (89)
      • ๊ทธ๋กœ์Šค ํ•ดํ‚น (2)

๋ธ”๋กœ๊ทธ ๋ฉ”๋‰ด

  • ๐Ÿ”ด ๋ธ”๋กœ๊ทธ(ํ™ˆ)
  • ๐Ÿฑ Github
  • ๊ธ€์“ฐ๊ธฐ
  • ํŽธ์ง‘
hELLO ยท Designed By JSW.
Jayden1116

Jayden`s LifeTrip ๐Ÿ”†

๐Ÿ’ฟ Data/์ด๋ชจ์ €๋ชจ

์‹œ๊ฐํ™”๋ฅผ ํ†ตํ•ด ํ™•์ธํ•œ Imputer์˜ ์ฐจ์ด

2021. 12. 26. 14:54

2๊ฐœ ์ด์ƒ์˜ imputer๋ฅผ ์‚ฌ์šฉํ•ด ๊ฐ๊ฐ ํŠน์„ฑ-ํƒ€๊ฒŸ ๊ด€๊ณ„ ๊ทธ๋ž˜ํ”„๋ฅผ ๊ทธ๋ ค ๊ณต์œ ํ•˜๊ณ  ๋‹ค์Œ ์งˆ๋ฌธ์— ๋Œ€ํ•ด ์„œ๋กœ ๋…ผ์˜ํ•ด ๋ณด์„ธ์š”.

image

๋จผ์ € ํŠน์„ฑ์ค‘์š”๋„์—์„œ ๊ฐ€์žฅ ์ค‘์š”๋„๊ฐ€ ๋†’๊ฒŒ ๋‚˜์˜จ 'doctor_recc_h1n1' ํŠน์„ฑ์— ๋Œ€ํ•ด์„œ๋งŒ Imputer ๋ณ€๊ฒฝ์— ๋”ฐ๋ฅธ ํƒ€๊ฒŸ์™€์˜ ๊ด€๊ณ„ ๊ทธ๋ž˜ํ”„๋ฅผ ๊ทธ๋ ค๋ณด์•˜์Šต๋‹ˆ๋‹ค.

seaborn plots ์‚ฌ์šฉํ•˜์—ฌ ๊ด€์‹ฌ์žˆ๋Š” ํŠน์„ฑ๋“ค๊ณผ target๊ฐ„์˜ ๊ด€๊ณ„๋ฅผ ๊ทธ๋ž˜ํ”„๋กœ ๋‚˜ํƒ€๋‚ด ๋ณด์„ธ์š”.

  1. Imputer๋ฅผ ์ ์šฉํ•˜์ง€ ์•Š์•˜์„ ๋•Œ

image

๋ณ€์ˆ˜์™€ ํƒ€๊ฒŸ ๋ชจ๋‘ binaryํ•œ ๊ฐ’์œผ๋กœ 0๊ณผ 1์— ๋Œ€ํ•œ ๊ฐ’๋“ค๋งŒ ์ฐํžˆ๋Š” ๊ฒƒ์ด ๋ณด์ž…๋‹ˆ๋‹ค.

  1. SimpleImputer(strategy='mean') ์ ์šฉ ์‹œ
imputer1 = SimpleImputer(strategy='mean')
train_imp1 = imputer1.fit_transform(train_chall)
train_imp1 = pd.DataFrame(train_imp1, columns=['doctor_recc_h1n1', 'vacc_h1n1_f'])

image

  1. SimpleImputer(strategy='median') ์ ์šฉ ์‹œ
imputer2 = SimpleImputer(strategy='median')
train_imp2 = imputer2.fit_transform(train_chall)
train_imp2 = pd.DataFrame(train_imp2, columns=['doctor_recc_h1n1', 'vacc_h1n1_f'])

image

  1. SimpleImputer(strategy='most_frequent') ์ ์šฉ ์‹œ

image

  1. imputer4 = SimpleImputer(strategy='constant', fill_value=2) ์ ์šฉ ์‹œ
imputer4 = SimpleImputer(strategy='constant', fill_value=2)
train_imp4 = imputer4.fit_transform(train_chall)
train_imp4 = pd.DataFrame(train_imp4, columns=['doctor_recc_h1n1', 'vacc_h1n1_f'])

image

  1. KNNImputer(n_neighbors=2) ์ ์šฉ ์‹œ
imputer5 = KNNImputer(n_neighbors=2)
train_imp5 = imputer5.fit_transform(train_chall)
train_imp5 = pd.DataFrame(train_imp5, columns=['doctor_recc_h1n1', 'vacc_h1n1_f'])

image

์‚ฌ์šฉํ•˜์‹  imputer๋Š” ๊ฐ๊ฐ ์–ด๋–ค ์žฅ๋‹จ์ ์„ ๊ฐ–๊ณ  ์žˆ์œผ๋ฉฐ ์–ด๋–ค ์ƒํ™ฉ์—์„œ ์‚ฌ์šฉํ•˜๋ฉด ์ข‹์„๊นŒ์š”?

  • ๋ณ€์ˆ˜๊ฐ€ continuous value์ผ ๊ฒฝ์šฐ : SimpleImputer์˜ 'mean', 'constant', KNNImputer
  • ๋ณ€์ˆ˜๊ฐ€ categorical value์ผ ๊ฒฝ์šฐ : ์ˆœ์„œํ˜•-SimpleImputer์˜ 'median', 'most_frequent' / ๋ช…๋ชฉํ˜•-SimpleImputer์˜ 'most_frequent'/ ์ƒ๊ด€์—†์ด-SimpleImputer์˜ 'constant'

๋งŒ๋“œ์‹  ํŠน์„ฑ-ํƒ€๊ฒŸ ๊ด€๊ณ„ ๊ทธ๋ž˜ํ”„์—์„œ ์ฐจ์ด์ ์ด ์žˆ๋‹ค๋ฉด ๋ฌด์—‡์ด๊ณ , ์™œ ๊ทธ๋Ÿฐ ์ฐจ์ด๊ฐ€ ๋ฐœ์ƒํ–ˆ๋‹ค๊ณ  ์ƒ๊ฐํ•˜์‹œ๋‚˜์š”?

์œ„์˜ ๊ทธ๋ž˜ํ”„๋ฅผ ๋ณด๋ฉด, ์นดํ…Œ๊ณ ๋ฆฌ์ปฌํ•œ ๋ณ€์ˆ˜์— 'mean', 'constant', 'KNN'๊ณผ ๊ฐ™์€ Imputer๋ฅผ ์ ์šฉํ•˜๊ฒŒ ๋˜๋ฉด Missing value๊ฐ€ binaryํ•œ ๊ฐ’์œผ๋กœ ์ฑ„์›Œ์ง€๋Š” ๊ฒŒ ์•„๋‹Œ, ๊ทธ ์ด์™ธ์˜ ๋‹ค๋ฅธ ์ˆซ์ž๋กœ ์ฑ„์›Œ์ง€๊ฒŒ ๋ฉ๋‹ˆ๋‹ค. ํ•ด์„œ ์œ„์™€ ๊ฐ™์ด categoricalํ•œ ๋ณ€์ˆ˜์— ๋Œ€ํ•ด์„œ๋Š” 'median', 'most_frequent'์™€ ๊ฐ™์ด 0๊ณผ 1 ์ด์™ธ์˜ ์ˆซ์ž๊ฐ€ ๋‚˜์˜ค์ง€ ์•Š๋Š” ๋ฐฉ๋ฒ•์„ ์จ์•ผํ•œ๋‹ค๊ณ  ์ƒ๊ฐํ•ฉ๋‹ˆ๋‹ค. ์ด๊ฒŒ ๊ณง ๊ฐ Imputer์˜ ์žฅ๋‹จ์ ์ด์ง€ ์•Š์„๊นŒ ์ƒ๊ฐํ–ˆ์Šต๋‹ˆ๋‹ค.

์ถ”๊ฐ€ ์‚ฌํ•ญ

  1. Multivariate feature imputation

๋ง ๊ทธ๋Œ€๋กœ ์œ„์˜ ๊ฒฝ์šฐ์ฒ˜๋Ÿผ ๊ฐ ๋ณ€์ˆ˜(์ปฌ๋Ÿผ)์— ๋Œ€ํ•œ 'ํ‰๊ท ', '์ค‘์œ„๊ฐ’', '์ตœ๋นˆ๊ฐ’' ๋“ฑ์„ ์ ์šฉํ•˜๋Š” ๊ฒƒ์ด ์•„๋‹Œ ์ „์ฒด ๋ฐ์ดํ„ฐ์…‹์— ๋Œ€ํ•ด ๊ธฐ์ค€์„ ์žก๊ณ (๋ชจ๋“  ์ปฌ๋Ÿผ์„ ๊ณ ๋ คํ•˜์—ฌ) ๊ฐ’์„ ์ฑ„์šฐ๋Š” imputer ๋ฐฉ์‹์ž…๋‹ˆ๋‹ค.
sklearn.impute.IterativeImputer๋กœ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.

image

  1. Marking imputed values

๊ฒฐ์ธก์น˜์— ๋Œ€ํ•œ ๋ถ€๋ถ„์„ True๋กœ ํ•˜๊ณ  ๊ฒฐ์ธก์น˜๊ฐ€ ์•„๋‹Œ ๊ฐ’์— ๋Œ€ํ•ด์„œ๋Š” False๋ฅผ ๋ฐ˜ํ™˜ํ•ฉ๋‹ˆ๋‹ค.
sklearn.impute.MissingIndicator๋กœ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.

image

์œ„ ์‚ฌ์ง„์—์„œ features='all'์„ ํ•˜๋А๋ƒ ์•ˆํ•˜๋А๋ƒ์— ๋”ฐ๋ผ missing value๊ฐ€ ์žˆ๋Š” ์ปฌ๋Ÿผ์„ ํ‘œ์‹œ ์ƒ๋žตํ•˜๊ธฐ๋„ ์•ˆํ•˜๊ธฐ๋„ ํ•˜๋Š” ๊ฒƒ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

์œ„์˜ Imputer ๋ชจ๋‘ ๊ฒฐ๊ตญ์—” ๋ณ€์ˆ˜์˜ ๊ฒฐ์ธก์น˜๋ฅผ ๋ฐ”๊พธ๋Š” ๊ฒƒ์ธ๋ฐ, ์™œ KNN์˜ ๊ฒฝ์šฐ์—๋งŒ ๊ฒฐ์ธก์น˜์— ํ•ด๋‹นํ–ˆ๋˜ '0.5'๊ฐ’์„ ๊ฐ–๋Š” ๊ด€์ธก์น˜๋“ค์ด ์ „๋ถ€ target์ด 1์—๋งŒ ํ•ด๋‹นํ•˜๋Š” ๊ฒƒ์ธ์ง€ ์ž˜ ๋ชจ๋ฅด๊ฒ ์Šต๋‹ˆ๋‹ค..!

๋„์ „๊ณผ์ œ์—์„œ ๊ทธ๋ž˜ํ”„๋ฅผ ํ†ตํ•ด Imputer์— ๋”ฐ๋ฅธ ๊ฒฐ์ธก์น˜ ๋‹ค๋ฃจ๋Š” ์ฐจ์ด์ ์„ ํ™•์ธํ•˜๋Š” ๊ฒƒ์ด ์ด ์˜๋„๊ฐ€ ๋งž๋Š”์ง€ ์ •ํ™•์น˜๋Š” ์•Š์Šต๋‹ˆ๋‹ค..!

์ด์ƒ์ž…๋‹ˆ๋‹ค. ๊ฐ์‚ฌํ•ฉ๋‹ˆ๋‹ค. :)

'๐Ÿ’ฟ Data > ์ด๋ชจ์ €๋ชจ' ์นดํ…Œ๊ณ ๋ฆฌ์˜ ๋‹ค๋ฅธ ๊ธ€

Evaluation metrics for Classification  (0) 2021.12.31
category_encoders(TargetEncoder, CatBoostEncoder) ๊ทธ๋ฆฌ๊ณ  Ordinal๊ณผ OneHot encoder  (0) 2021.12.28
Ridge regression, ๋ชจ๋ธ ์„ฑ๊ณผ ํ‰๊ฐ€ ์ง€ํ‘œ, OneHotencoding, feature selection  (0) 2021.12.23
์ƒˆ๋กœ์šด ํŠน์„ฑ(ํŠน์„ฑ๊ณตํ•™), ์ด์ƒ์น˜, Scaler, ๋ชจ๋ธ ์„ฑ๋Šฅ ํ–ฅ์ƒ  (0) 2021.12.23
Kaggle_House Sales in King County, USA  (0) 2021.12.21
    '๐Ÿ’ฟ Data/์ด๋ชจ์ €๋ชจ' ์นดํ…Œ๊ณ ๋ฆฌ์˜ ๋‹ค๋ฅธ ๊ธ€
    • Evaluation metrics for Classification
    • category_encoders(TargetEncoder, CatBoostEncoder) ๊ทธ๋ฆฌ๊ณ  Ordinal๊ณผ OneHot encoder
    • Ridge regression, ๋ชจ๋ธ ์„ฑ๊ณผ ํ‰๊ฐ€ ์ง€ํ‘œ, OneHotencoding, feature selection
    • ์ƒˆ๋กœ์šด ํŠน์„ฑ(ํŠน์„ฑ๊ณตํ•™), ์ด์ƒ์น˜, Scaler, ๋ชจ๋ธ ์„ฑ๋Šฅ ํ–ฅ์ƒ
    Jayden1116
    Jayden1116
    ์•„๋งˆ๋„ ํ•œ๋ฒˆ ๋ฟ์ธ ์ธ์ƒ์„ ์—ฌํ–‰ ์ค‘์ธ Jayden์˜ ์ผ์ง€๐Ÿ„๐ŸŒŠ

    ํ‹ฐ์Šคํ† ๋ฆฌํˆด๋ฐ”