๐Ÿ’ฟ Data/์ด๋ชจ์ €๋ชจ

์ƒˆ๋กœ์šด ํŠน์„ฑ(ํŠน์„ฑ๊ณตํ•™), ์ด์ƒ์น˜, Scaler, ๋ชจ๋ธ ์„ฑ๋Šฅ ํ–ฅ์ƒ

Jayden1116 2021. 12. 23. 15:48

1) ์ƒˆ๋กœ์šด ํŠน์„ฑ์„ ๋งŒ๋“ค ์ˆ˜ ์žˆ๋‹ค๋ฉด, ์–ด๋–ค ํŠน์„ฑ๊ณตํ•™์„ ํ•ด ๋ณผ ์ˆ˜ ์žˆ์„๊นŒ์š”?

  • BMI(๋น„๋งŒ์ง€์ˆ˜) = ๋ชธ๋ฌด๊ฒŒ / ํ‚ค^2 (ํ‚ค : [m], ๋ชธ๋ฌด๊ฒŒ : [kg])
  • ๋Œ€์‚ฌ์ฆํ›„๊ตฐ ๊ฐ€๋Šฅ์„ฑ : ์ˆ˜์ถ•๊ธฐ ํ˜ˆ์••๊ณผ ์ด์™„๊ธฐ ํ˜ˆ์•• ์ฐจ์ด ์ฐธ๊ณ 
  • age / 365 ๋ฅผ ํ†ตํ•ด ๋‚˜์ด๋กœ ๋งž์ถ”๊ธฐ

2) ์•„์›ƒ๋ผ์ด์–ด๊ฐ€ ์žˆ๋Š” ํŠน์„ฑ์ด ์žˆ๋‹ค๋ฉด, ์–ด๋–ค ๊ธฐ์ค€์œผ๋กœ ์ œ๊ฑฐํ•  ์ˆ˜ ์žˆ์„๊นŒ์š”?

์ด ๋ถ€๋ถ„์ด ๋„๋ฉ”์ธ ์ง€์‹๊ณผ ์—ฐ๊ด€์ด ํฐ ๊ฒƒ ๊ฐ™์Šต๋‹ˆ๋‹ค.

  • ๋จผ์ € boxplot๊ณผ ๊ฐ™์€ ์‹œ๊ฐํ™” ์ž๋ฃŒ๋ฅผ ํ†ตํ•ด์„œ ์ด์ƒ์น˜์˜ ์œ ๋ฌด๋ฅผ ํ™•์ธ
  • ํ†ต๊ณ„์น˜์— ๊ธฐ๋ฐ˜ํ•ด์„œ ์ƒ์œ„ ๋ฐ ํ•˜์œ„ %์˜ ๊ฐ’์„ ์ œ๊ฑฐ
  • ํ˜น์€ ๋„๋ฉ”์ธ ์ง€์‹์„ ๋„์ž…ํ•ด์„œ ์ด์ƒ์น˜์— ๋Œ€ํ•œ ๊ธฐ์ค€์„ ์žก๊ณ  ์ œ๊ฑฐ

๊ณผ์ œ์˜ ์˜ˆ์‹œ์—์„œ ์ €๊ฐ™์€ ๊ฒฝ์šฐ๋Š” ๋‹จ์ˆœํžˆ ํ†ต๊ณ„์น˜๋กœ ํ•˜๊ฒŒ ๋˜๋‹ˆ ๋ชธ๋ฌด๊ฒŒ๊ฐ€ 100kg๋งŒ ๋„˜์–ด๊ฐ€๋„ ์ œ๊ฑฐ๊ฐ€ ๋˜์–ด๋ฒ„๋ ค์„œ ๋”ฐ๋กœ ๋„๋ฉ”์ธ ์ง€์‹์„ ์„œ์น˜ ํ˜น์€ ์ƒ์‹์„ (๊ต‰์žฅํžˆ ์ฃผ๊ด€์ ์ด์ง€๋งŒ)์— ์˜ํ•ด ์ด์ƒ์น˜๋ฅผ ์ œ๊ฑฐํ•˜์˜€์Šต๋‹ˆ๋‹ค.

# ์ˆ˜์น˜๊ฐ’์— ๋Œ€ํ•œ ์ด์ƒ์น˜ ์กด์žฌ ๊ฐ„๋‹จํžˆ ํ™•์ธ
df['height'].min(), df['height'].max() 
df['weight'].min(), df['weight'].max()
# ๊ฐ ์‹œ๊ฐํ™” ์ž๋ฃŒ ๋ฐ ์ƒ์‹์„ (๋„๋ฉ”์ธ)์—์„œ ์ด์ƒ์น˜์— ๋Œ€ํ•œ ๊ธฐ์ค€์„ ์žก๊ณ  ๋ฐ์ดํ„ฐ ์„ ๋ณ„
df = df[(df['height'] >= 140) & (df['height'] <= 200)]
df = df[(df['weight'] >= 30) & (df['weight'] <= 150)]
df = df[(df['ap_hi'] >= 30) & (df['ap_hi'] <= 200)]
df = df[(df['ap_lo'] >= 30) & (df['ap_lo'] <= 200)]
df = df[df['ap_lo'] < df['ap_hi']]

3) feature scaling ๋ฌธ์„œ๋ฅผ ๋ณด๋ฉด ์—ฌ๋Ÿฌ๊ฐ€์ง€ ์ข…๋ฅ˜์˜ Scalar๊ฐ€ ์žˆ๋Š”๋ฐ, ๊ฐ Scalar๋Š” ๊ฐ๊ฐ ์–ด๋–ค ์ƒํ™ฉ์— ์ ์šฉํ•˜๋ฉด ์ข‹์„๊นŒ์š”? ๋ณธ ๋ฐ์ดํ„ฐ์—์„œ๋Š” ์–ด๋–ค scalar๋ฅผ ์ ์šฉํ•ด๋ณผ ์ˆ˜ ์žˆ์„๊นŒ์š”?

image

  1. StandardScaler

    ๊ฐ ์ฐจ์›(์ปฌ๋Ÿผ)์˜ ๋‹จ์œ„๋ฅผ ๋งž์ถฐ์ค๋‹ˆ๋‹ค ๊ทธ๋Ÿฌ๋‚˜ ์ด์ƒ์น˜๊ฐ€ ์กด์žฌํ•œ๋‹ค๋ฉด ํ‰๊ท ๊ณผ ํ‘œ์ค€ํŽธ์ฐจ์— ์˜ํ–ฅ์„ ๋ผ์ณ ๊ท ํ˜•์žกํžŒ ์ฒ™๋„๋ฅผ ๋ณด์žฅํ•  ์ˆ˜ ์—†์Šต๋‹ˆ๋‹ค.

  2. MinMaxScaler

    ๋ชจ๋“  ๋ณ€์ˆ˜์˜ ๊ฐ’์ด 0 ~ 1 ์‚ฌ์ด์— ์žˆ๋„๋ก ๋ฐ์ดํ„ฐ๋ฅผ ์กฐ์ •ํ•ฉ๋‹ˆ๋‹ค. ๋‹ค๋งŒ, ์ด์ƒ์น˜๊ฐ€ ์กด์žฌ ์‹œ ๋ณ€ํ™˜๋œ ๊ฐ’์ด ๋งค์šฐ ์ข์€ ๋ฒ”์œ„๋กœ ์••์ถœ๋  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ฆ‰, ์ด์ƒ์น˜์— ๊ต‰์žฅํžˆ ๋ฏผ๊ฐํ•ฉ๋‹ˆ๋‹ค.

  3. MaxAbsScaler

    ์ ˆ๋Œ€๊ฐ’์ด 0 ~ 1 ์‚ฌ์ด์— ์žˆ๋„๋ก ์กฐ์ •ํ•ฉ๋‹ˆ๋‹ค. ์ฆ‰, -1 ~ 1 ์‚ฌ์ด๋กœ ์กฐ์ •ํ•˜๊ฒŒ ๋ฉ๋‹ˆ๋‹ค. ์–‘์ˆ˜ ๋ฐ์ดํ„ฐ๋กœ๋งŒ ๊ตฌ์„ฑ๋œ ๋ฐ์ดํ„ฐ์…‹์—์„  MinMaxScaler์™€ ์œ ์‚ฌํ•˜๊ฒŒ ๋™์ž‘ํ•ฉ๋‹ˆ๋‹ค.

  4. RobustScaler

    ์ด์ƒ์น˜์˜ ์˜ํ–ฅ์„ ์ตœ์†Œํ™”ํ•œ ๊ธฐ๋ฒ•์ž…๋‹ˆ๋‹ค. ์ค‘์•™๊ฐ’๊ณผ IQR์„ ์‚ฌ์šฉํ•˜๊ธฐ ๋•Œ๋ฌธ์— StandardScaler์™€ ๋น„๊ต ์‹œ ๋™์ผํ•œ ๊ฐ’์„ ๋” ๋„“๊ฒŒ ๋ถ„ํฌ์‹œํ‚ค๊ณ  ์žˆ์Œ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

image

scaler ์ฐธ๊ณ  ์‚ฌ์ดํŠธ

์ผ๋‹จ ๊ณผ์ œ ๋ฐ์ดํ„ฐ์—์„  StandardScaler๋ฅผ ๊ฐ€์žฅ ๊ธฐ๋ณธ์ ์œผ๋กœ ์‚ฌ์šฉ ๊ฐ€๋Šฅํ•  ๊ฒƒ ๊ฐ™๊ณ , ์—ฐ์†ํ˜• ๋ฐ์ดํ„ฐ์— ์ƒ๊ฐ๋ณด๋‹ค ์ด์ƒ์น˜๋“ค์ด ๋งŽ์ด ์žˆ์–ด์„œ RobustScaler๋„ ์ ํ•ฉํ•ด๋ณด์ž…๋‹ˆ๋‹ค.(๋‹ค๋งŒ, ์ด์ƒ์น˜๋Š” ๊ทธ ์ „์— ์ œ๊ฑฐ๊ฐ€ ์„ ํ–‰๋˜๋Š” ๊ฒŒ ๊ฐ€์žฅ ์ข‹๋‹ค๊ณ  ์ƒ๊ฐํ•ฉ๋‹ˆ๋‹ค.)

4. ๋ชจ๋ธ ์„ฑ๋Šฅ์„ ๋†’์ด๊ธฐ ์œ„ํ•ด ์–ด๋–ค ํŠœ๋‹์„ ์ถ”๊ฐ€ํ•ด ๋ณผ ์ˆ˜ ์žˆ์„๊นŒ์š”?

๋ช…๋ชฉํ˜• ์ž๋ฃŒ๋ฅผ 0๊ณผ 1๋กœ ์ธ์ฝ”๋”ฉํ•ด์ฃผ๋Š” OneHotEncoder, ๊ฒฐ์ธก์น˜๋ฅผ ์ฑ„์›Œ์ฃผ๋Š” Simplimputer, ๋‹คํ•ญ์‹ ํ˜•ํƒœ์˜ ์ปฌ๋Ÿผ์„ ๋งŒ๋“ค์–ด์ฃผ๋Š” PolynomialFeatures ๋“ฑ๋“ฑ์˜ ์ „์ฒ˜๋ฆฌ ์ž๋ฃŒ ํŠœ๋‹์ด ์žˆ์Šต๋‹ˆ๋‹ค.
๋˜ํ•œ, sklearn์˜ ์—ฌ๋Ÿฌ ๋ชจ๋ธ์—์„œ ์‚ฌ๋žŒ์ด ๋ฐ”๊ฟ€ ์ˆ˜ ์žˆ๋Š” ์ธ์ž(Hyper parameter)๋ฅผ ์กฐ์ ˆํ•ด์ฃผ๋Š” ๋ฐฉ๋ฒ•์ด ์žˆ์„ ๊ฒƒ ๊ฐ™์Šต๋‹ˆ๋‹ค.

์ด์ƒ์ž…๋‹ˆ๋‹ค. ๊ฐ์‚ฌํ•ฉ๋‹ˆ๋‹ค.