๐Ÿ’ฟ Data/์ด๋ชจ์ €๋ชจ

Kaggle_House Sales in King County, USA

Jayden1116 2021. 12. 21. 00:45

์ผ€๊ธ€


0. ๋“ค์–ด๊ฐ€๊ธฐ ์ „

  1. Data fields
  2. ID : ์ง‘์„ ๊ตฌ๋ถ„ํ•˜๋Š” ๋ฒˆํ˜ธ
  3. date : ์ง‘์„ ๊ตฌ๋งคํ•œ ๋‚ ์งœ
  4. price : ์ง‘์˜ ๊ฐ€๊ฒฉ(Target variable)
  5. bedrooms : ์นจ์‹ค์˜ ์ˆ˜
  6. bathrooms : ํ™”์žฅ์‹ค์˜ ์ˆ˜
  7. sqft_living : ์ฃผ๊ฑฐ ๊ณต๊ฐ„์˜ ํ‰๋ฐฉ ํ”ผํŠธ(๋ฉด์ )
  8. sqft_lot : ๋ถ€์ง€์˜ ํ‰๋ฐฉ ํ”ผํŠธ(๋ฉด์ )
  9. floors : ์ง‘์˜ ์ธต ์ˆ˜
  10. waterfront : ์ง‘์˜ ์ „๋ฐฉ์— ๊ฐ•์ด ํ๋ฅด๋Š”์ง€ ์œ ๋ฌด (a.k.a. ๋ฆฌ๋ฒ„๋ทฐ)
  11. view : ์ง‘์ด ์–ผ๋งˆ๋‚˜ ์ข‹์•„ ๋ณด์ด๋Š”์ง€์˜ ์ •๋„
  12. condition : ์ง‘์˜ ์ „๋ฐ˜์ ์ธ ์ƒํƒœ
  13. grade : King County grading ์‹œ์Šคํ…œ ๊ธฐ์ค€์œผ๋กœ ๋งค๊ธด ์ง‘์˜ ๋“ฑ๊ธ‰
  14. sqft_above : ์ง€ํ•˜์‹ค์„ ์ œ์™ธํ•œ ํ‰๋ฐฉ ํ”ผํŠธ(๋ฉด์ )
  15. sqft_basement : ์ง€ํ•˜์‹ค์˜ ํ‰๋ฐฉ ํ”ผํŠธ(๋ฉด์ )
  16. yr_built : ์ง€์–ด์ง„ ๋…„๋„
  17. yr_renovated : ์ง‘์„ ์žฌ๊ฑด์ถ•ํ•œ ๋…„๋„
  18. zipcode : ์šฐํŽธ๋ฒˆํ˜ธ
  19. lat : ์œ„๋„
  20. long : ๊ฒฝ๋„
  21. sqft_living15 : 2015๋…„ ๊ธฐ์ค€ ์ฃผ๊ฑฐ ๊ณต๊ฐ„์˜ ํ‰๋ฐฉ ํ”ผํŠธ(๋ฉด์ , ์ง‘์„ ์žฌ๊ฑด์ถ•ํ–ˆ๋‹ค๋ฉด, ๋ณ€ํ™”๊ฐ€ ์žˆ์„ ์ˆ˜ ์žˆ์Œ)
  22. sqft_lot15 : 2015๋…„ ๊ธฐ์ค€ ๋ถ€์ง€์˜ ํ‰๋ฐฉ ํ”ผํŠธ(๋ฉด์ , ์ง‘์„ ์žฌ๊ฑด์ถ•ํ–ˆ๋‹ค๋ฉด, ๋ณ€ํ™”๊ฐ€ ์žˆ์„ ์ˆ˜ ์žˆ์Œ)

# ๋ฐ์ดํ„ฐ๋Š” ์บ๊ธ€์—์„œ ๋ฐ›์œผ์…”๋„ ๋˜๊ณ  ์ด ๋งํฌ์—์„œ ๋ถˆ๋Ÿฌ์™€๋„ ๋ฉ๋‹ˆ๋‹ค.
import pandas as pd
df = pd.read_csv('https://ds-lecture-data.s3.ap-northeast-2.amazonaws.com/kc_house_data/kc_house_data.csv')

1. EDA

(๊ฐ ์ข… ํ†ต๊ณ„์น˜ ํ™•์ธ, ์—ฌ๋Ÿฌ plot์„ ํ†ตํ•œ ์‹œ๊ฐํ™” ๋“ฑ์€ ์ƒ๋žตํ•˜๊ฒ ์Šต๋‹ˆ๋‹ค.)

๋จผ์ €

df.corr()['price'].sort_values(ascending=False)

์„ ํ†ตํ•ด ์ƒ๊ด€๊ณ„์ˆ˜ ์ƒ์œ„ 5๊ฐœ๋Š” 'price', 'sqft_living', 'grade', 'sqft_above', 'sqft_living15', 'bathrooms' ์ธ ๊ฒƒ์„ ํŒŒ์•…ํ•ด๋‘์—ˆ์Šต๋‹ˆ๋‹ค.

2. Feature Engineering

(๋‹น์žฅ ๋– ์˜ค๋ฅด๋Š” ๋ฐฉ๋ฒ•์€ ์—†์–ด, ์ผ๋‹จ 'id'์™€ 'date'์„ ์ œ์™ธํ•œ ๋ณ€์ˆ˜๋“ค๋กœ ๋‹ค์ค‘ํšŒ๊ท€๋ฅผ ์ง„ํ–‰ํ•˜์˜€์Šต๋‹ˆ๋‹ค.)

3. Modeling

3-1-1. id ๋ฐ date๋ฅผ ์ œ์™ธํ•œ ๋ชจ๋“  ๋ณ€์ˆ˜์— ๋Œ€ํ•ด์„œ ๋‹ค์ค‘ํšŒ๊ท€

target = 'price'
features = ['bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot', 'floors',
       'waterfront', 'view', 'condition', 'grade', 'sqft_above',
       'sqft_basement', 'yr_built', 'yr_renovated', 'lat', 'long',
       'sqft_living15', 'sqft_lot15', 'zipcode']

y_train = df_train[target]
X_train = df_train[features]
y_test = df_test[target]
X_test = df_test[features]

์™€ ๊ฐ™์€ ์กฐ๊ฑด์œผ๋กœ ์„ ํ˜•ํšŒ๊ท€ ์ธ๋ฑ์‹ฑ ํ›„, ๋ชจ๋ธ ํ”ผํŒ…์„ ์ง„ํ–‰ํ•˜์˜€์Šต๋‹ˆ๋‹ค.

ํ…Œ์ŠคํŠธ ๋ฐ์ดํ„ฐ์˜ MSE : 44334905831.3, MAE : 129983.7, RMSE : 210558.6, R2 : 0.7(0.6751740573558591) ์ž…๋‹ˆ๋‹ค.
ํ›ˆ๋ จ ๋ฐ์ดํ„ฐ์˜ MSE : 39685983731.2, MAE : 124068.5, RMSE : 199213.4, R2 : 0.7 ์ž…๋‹ˆ๋‹ค.
ํšŒ๊ท€๊ณ„์ˆ˜๋กœ ๋ชจ๋ธ์„ ์„ค๋ช…ํ•˜๋Š” ๋ถ€๋ถ„์€ ์ผ๋‹จ ๋„˜์–ด๊ฐ€๊ฒ ์Šต๋‹ˆ๋‹ค.(๋ณ€์ˆ˜๊ฐ€ ๋„ˆ๋ฌด ๋งŽ์•„ ์˜๋ฏธ๋ฅผ ํŒŒ์•…ํ•˜๊ธฐ ์–ด๋ ค์šธ ๊ฑฐ๋ผ ์ƒ๊ฐ)

3-1-2. id ๋ฐ date๋ฅผ ์ œ์™ธํ•œ ๋ชจ๋“  ๋ณ€์ˆ˜์— ๋Œ€ํ•ด์„œ ๋‹ค์ค‘ํšŒ๊ท€(๊ทธ๋Ÿฐ๋ฐ ์ด์ œ ๋ฌด์ž‘์ • ํ‘œ์ค€ํ™”๋ฅผ ๊ณ๋“ค์ธ)

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_train_scaled = scaler.fit(X_train).transform(X_train)
X_test_scaled = scaler.fit(X_test).transform(X_test)

์œ„์™€ ๊ฐ™์ด ํ”ผ์ณ์…‹์— ํ‘œ์ค€ํ™” ์ „์ฒ˜๋ฆฌ๋ฅผ ์ง„ํ–‰ ํ›„ ๋ชจ๋ธ ํ”ผํŒ…์„ ์ง„ํ–‰ํ•˜์˜€์Šต๋‹ˆ๋‹ค.

ํ…Œ์ŠคํŠธ ๋ฐ์ดํ„ฐ์˜ MSE : 41358447199.6, MAE : 131365.1, RMSE : 203367.8, R2 : 0.7(0.6858104110180232) ์ž…๋‹ˆ๋‹ค.
ํ›ˆ๋ จ ๋ฐ์ดํ„ฐ์˜ MSE : 39899513851.1, MAE : 123577.7, RMSE : 199748.6, R2 : 0.7 ์ž…๋‹ˆ๋‹ค.

์ผ๋‹จ, ๋ฌด์ž‘์ • ๋ณ€์ˆ˜๋“ค์˜ ๋‹จ์œ„๋ฅผ ๋งž์ถฐ๋ณด์•˜๋Š”๋ฐ, ํ…Œ์ŠคํŠธ ๋ฐ์ดํ„ฐ์— ๋Œ€ํ•œ score๊ฐ€ ์•ฝ 0.01 ์ฆ๊ฐ€ํ•œ ๊ฒƒ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
์ƒ๊ฐ๋ณด๋‹ค ํ‘œ์ค€ํ™” ์ „์ฒ˜๋ฆฌ๊ฐ€ ๋ฏธ๋น„ํ•˜๋‹ค ํŒ๋‹จํ•˜์—ฌ ์ผ๋‹จ์€ ๋ณ€์ˆ˜์— ๋Œ€ํ•œ ๋ณ€ํ™”๋งŒ ์ฃผ๋ฉฐ ๋ชจ๋ธ๋งํ•ด๋ณด์•˜์Šต๋‹ˆ๋‹ค.

3-2. ์ƒ๊ด€๊ณ„์ˆ˜ ์ƒ์œ„ 5๊ฐœ๋กœ ๋‹ค์ค‘ํšŒ๊ท€(ํ‘œ์ค€ํ™”X)

# price์™€ ์ƒ๊ด€๊ณ„์ˆ˜๊ฐ€ ๋†’์€(0.5 ์ด์ƒ) ํŠน์„ฑ์€ 'sqft_living', 'grade', 'sqft_above', 'sqft_living15', 'bathrooms' ์ž…๋‹ˆ๋‹ค.
target = 'price'
features = ['sqft_living', 'grade', 'sqft_above', 'sqft_living15', 'bathrooms']

y_train = df_train[target]
X_train = df_train[features]
y_test = df_test[target]
X_test = df_test[features]

ํ…Œ์ŠคํŠธ ๋ฐ์ดํ„ฐ์˜ MSE : 63247102520.8, MAE : 166621.4, RMSE : 251489.8, R2 : 0.5(0.5195278717932403) ์ž…๋‹ˆ๋‹ค.
ํ›ˆ๋ จ ๋ฐ์ดํ„ฐ์˜ MSE : 60420427179.7, MAE : 159247.2, RMSE : 245805.7, R2 : 0.5 ์ž…๋‹ˆ๋‹ค.

์˜ˆ์ƒ์€ ํ–ˆ์ง€๋งŒ ์ด์ „๋ณด๋‹ค ๋ณ€์ˆ˜๊ฐ€ ์ค„์–ด๋“œ๋‹ˆ score๊ฐ€ ๋–จ์–ด์ง€๋Š” ๋ชจ์Šต์„ ๋ณด์ž…๋‹ˆ๋‹ค.

3-3. EDA๋ฅผ ํ†ตํ•ด ์•„๋ž˜์™€ ๊ฐ™์€ ๊ต‰์žฅํžˆ ์ฃผ๊ด€์ ์ธ ๊ฐ€์ •์„ ํ†ตํ•ด ๋‹ค์ค‘ํšŒ๊ท€

์ž์„ธํ•œ EDA๋Š” ์ƒ๋žตํ–ˆ์ง€๋งŒ, ์ด๊ฒƒ์ €๊ฒƒ ๊ทธ๋ ค๋ณด๊ณ  ํ†ต๊ณ„์น˜๋ฅผ ๋ณด๋ฉด์„œ ์•„๋ž˜์™€ ๊ฐ™์€ ๊ฐ€์ •์„ ํ†ตํ•ด ๋ช‡๊ฐ€์ง€ ๋ณ€์ˆ˜๋ฅผ drop ํ›„ ์ง„ํ–‰ํ•ด๋ณด์•˜์Šต๋‹ˆ๋‹ค.

  • sqft ๋ณ€์ˆ˜๋“ค ์ค‘ sqft_living์€ sqft_above์™€ sqft_basement๋ฅผ ๋”ํ•œ ๊ฐ’์ด๊ณ  ์ƒ๊ด€๊ณ„์ˆ˜ ์ƒ ์ง€ํ•˜์˜ ์˜ํ–ฅ์ด ์ ์œผ๋ฏ€๋กœ basement๋Š” ์ œ์™ธ์‹œ์ผฐ์Šต๋‹ˆ๋‹ค.
  • ๋˜ํ•œ sqft15 ๋ณ€์ˆ˜๋“ค์˜ ๊ฒฝ์šฐ, 15๋…„๋„๋ฅผ ๊ธฐ์ค€์œผ๋กœ ์žฐ ๊ฐ’๋“ค์ด๊ณ  train์…‹์—์„œ์˜ ๊ฐ€๊ฒฉ์€ 14๋…„๋„ ๊ธฐ์ค€์ด๊ธฐ์— ์ œ์™ธ์‹œ์ผœ๋ณด์•˜์Šต๋‹ˆ๋‹ค.
target = 'price'
features = ['bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot', 'floors', 
            'grade', 'sqft_above', 'lat', 'long', 'view', 'condition', 'waterfront', 'zipcode']

y_train = df_train[target]
X_train = df_train[features]
y_test = df_test[target]
X_test = df_test[features]

์œ„์˜ ๋ณ€์ˆ˜๋“ค๋กœ๋งŒ ๋ชจ๋ธ๋ง์„ ์ง„ํ–‰ํ•˜์˜€๊ณ 

ํ…Œ์ŠคํŠธ ๋ฐ์ดํ„ฐ์˜ MSE : 47966313586.4, MAE : 136276.1, RMSE : 219012.1, R2 : 0.6(0.6499013670241869) ์ž…๋‹ˆ๋‹ค.
ํ›ˆ๋ จ ๋ฐ์ดํ„ฐ์˜ MSE : 42738317364.2, MAE : 130068.3, RMSE : 206732.5, R2 : 0.7 ์ž…๋‹ˆ๋‹ค.

์ด๊ฒƒ๋„ ํŒ๋‹จํ•˜๊ธฐ ๋‹ค์†Œ ์• ๋งคํ•ฉ๋‹ˆ๋‹ค. ๋ณ€์ˆ˜๋ฅผ 3๊ฐœ ์ค„์ธ ๊ฒƒ์น˜๊ณ ๋Š” ๊ทธ๋ž˜๋„ score๊ฐ€ ๋งŽ์ด ์•ˆ๋–จ์–ด์ง„๊ฑด๊ฐ€ ์‹ถ๊ธฐ๋„ ํ•ฉ๋‹ˆ๋‹ค.

3-1-3. id ๋ฐ date๋ฅผ ์ œ์™ธํ•œ ๋ชจ๋“  ๋ณ€์ˆ˜์— ๋Œ€ํ•ด์„œ ๋‹ค์ค‘ํšŒ๊ท€(ํ—Œ๋ฐ, ์ด๋ฒˆ์—” train data์˜ ๋น„์œจ์„ ๋Š˜๋ ค๋ณธ)

์ฒ˜์Œ ์ง„ํ–‰ํ•œ 3-1-1์—์„œ ์ด๋ฒˆ์—” ๊ธฐ์กด train๊ณผ test๋ฅผ ๋‚˜๋ˆด๋˜ ๊ธฐ๊ฐ„ '2015-01-01'์„ '2015-03-15'๋กœ ๋Š˜๋ ค์„œ train set์˜ ๋น„์œจ์„ 67.7%(๊ธฐ์กด)์—์„œ 81.4%(๊ธฐ๊ฐ„ ๋Š˜๋ฆฐ ํ›„)๋กœ ํ™•๋ณดํ•œ ํ›„ ์ง„ํ–‰ํ•ด๋ณด์•˜์Šต๋‹ˆ๋‹ค.

df_train = df[df['date'] < pd.to_datetime('2015-03-15')]
df_test = df.drop(df_train.index)

print(df_train.shape[0] / df.shape[0] * 100) # 81.41396381807246

์ดํ›„ ๋ชจ๋ธ๋ง ์ง„ํ–‰ํ•œ ๊ฒฐ๊ณผ๋Š”

ํ…Œ์ŠคํŠธ ๋ฐ์ดํ„ฐ์˜ MSE : 44334905831.3, MAE : 129983.7, RMSE : 210558.6, R2 : 0.7(0.67640644518824) ์ž…๋‹ˆ๋‹ค.
ํ›ˆ๋ จ ๋ฐ์ดํ„ฐ์˜ MSE : 39685983731.2, MAE : 124068.5, RMSE : 199213.4, R2 : 0.7 ์ž…๋‹ˆ๋‹ค.

๊ธฐ์กด๋ณด๋‹ค ์•ฝ 0.005 ์ •๋„ score๊ฐ€ ์ฆ๊ฐ€ํ–ˆ์Šต๋‹ˆ๋‹ค.

3-4-1. ์ตœํ›„์˜ ๋ณด๋ฃจ, '์—๋ผ ๋ชจ๋ฅด๊ฒ ๋‹ค.' date๋„ ์ˆซ์žํ˜•์œผ๋กœ ๋ฐ”๊พธ๊ณ , date๋ฅผ ๊ธฐ์ค€์œผ๋กœ ๋ฐ์ดํ„ฐ๋ฅผ ๋‚˜๋ˆ„๋Š” ๊ฒƒ์ด ์•„๋‹Œ train_test_split์„ ์‚ฌ์šฉํ•˜์—ฌ ๋ชจ๋ธ์„ ๋งŒ๋“ค์–ด ๋ณด์•˜์Šต๋‹ˆ๋‹ค.

# date๋ฅผ ์ˆซ์ž๋กœ ๋ฐ”๊ฟ”์ค๋‹ˆ๋‹ค.
df['date'] = pd.to_numeric(df['date'])

# ์ƒˆ๋กœ์šด train, test set์„ ๋งŒ๋“ค๊ธฐ ์œ„ํ•ด ์•„๋ž˜์™€ ๊ฐ™์ด X์™€ y๋ฅผ ์ง€์ •ํ•ฉ๋‹ˆ๋‹ค.
X = df.drop(['price', 'id'], axis=1)
y = df['price']

# ์•„๋ž˜์™€ ๊ฐ™์ด train, test set์„ ๋งŒ๋“ค์–ด์ฃผ์—ˆ์Šต๋‹ˆ๋‹ค.
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

๋งˆ์ฐฌ๊ฐ€์ง€๋กœ ๋ชจ๋ธ๋ง์„ ์ง„ํ–‰ํ•˜๋ฉด

ํ…Œ์ŠคํŠธ ๋ฐ์ดํ„ฐ์˜ MSE : 68087426000.4, MAE : 167000.7, RMSE : 260935.7, R2 : 0.5(0.5448324932597781) ์ž…๋‹ˆ๋‹ค.
ํ›ˆ๋ จ ๋ฐ์ดํ„ฐ์˜ MSE : 59062183895.9, MAE : 161677.9, RMSE : 243027.1, R2 : 0.5 ์ž…๋‹ˆ๋‹ค.

์ด๋Ÿฐ...... date๋ฅผ ์ถ”๊ฐ€ํ–ˆ๊ธฐ์— ๋ญ”๊ฐ€ ๋” ํฐ ๊ฐ’์ด ๋‚˜์˜ฌ๊ฑฐ๋ผ ์ƒ๊ฐํ–ˆ๋Š”๋ฐ, ๋” ์ž‘์€ ๊ฐ’์ด ๋‚˜์™”์Šต๋‹ˆ๋‹ค.
๊ทธ๋Ÿฌ๋‹ค ๋ฌธ๋“ date๋Š” ์ˆซ์žํ˜•์œผ๋กœ ๋ฐ”๊พธ๋ฉด ํ˜ผ์ž์„œ scale์ด ๋„ˆ๋ฌด ์ปค์„œ ๊ทธ๋Ÿฐ๊ฑด ์•„๋‹๊นŒ ์‹ถ์–ด ์—ฌ๊ธฐ์— ๋‹ค์‹œ ํ•œ๋ฒˆ ํ‘œ์ค€ํ™”๋ฅผ ํ•ด๋ณด์•˜์Šต๋‹ˆ๋‹ค.

3-4-2. ์œ„์˜ ๋ฐฉ์‹์— ํ‘œ์ค€ํ™”

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_train_scaled = scaler.fit(X_train).transform(X_train)
X_test_scaled = scaler.fit(X_test).transform(X_test)

๋งˆ์ฐฌ๊ฐ€์ง€๋กœ ํ‘œ์ค€ํ™” ์ „์ฒ˜๋ฆฌ ๊ณผ์ •์„ ๊ฑฐ์นœ ํ›„ ๋ชจ๋ธ๋ง์„ ํ•˜์˜€์Šต๋‹ˆ๋‹ค.

ํ…Œ์ŠคํŠธ ๋ฐ์ดํ„ฐ์˜ MSE : 44479631979.7, MAE : 124595.4, RMSE : 210901.9, R2 : 0.7(0.7026516586368019) ์ž…๋‹ˆ๋‹ค.
ํ›ˆ๋ จ ๋ฐ์ดํ„ฐ์˜ MSE : 39093340103.0, MAE : 124646.5, RMSE : 197720.4, R2 : 0.7 ์ž…๋‹ˆ๋‹ค.

์ด๊ฒŒ ๋งž๋Š”์ง„ ์ •ํ™•ํžˆ ๋ชจ๋ฅด์ง€๋งŒ ๊ทธ๋ž˜๋„ ๋ฐ˜์˜ฌ๋ฆผ ์•ˆํ•˜๊ณ  test set์— ๋Œ€ํ•ด 0.7์ด ๋„˜๋Š” score๊ฐ€ ๋‚˜์™”์Šต๋‹ˆ๋‹ค!

ํšŒ๊ท€๊ณ„์ˆ˜ ๋ฐ ๊ณผ๋Œ€/๊ณผ์†Œ ์ ํ•ฉ์„ ๊ณ ๋ คํ•˜๋Š” ๋ถ€๋ถ„์€ ์ถ”ํ›„ ๋” ์—…๋ฐ์ดํŠธํ•˜๋„๋ก ํ•˜๊ฒ ์Šต๋‹ˆ๋‹ค..!