๐Ÿ’ฟ Data/์ด๋ชจ์ €๋ชจ

Feature Engineering_๊ฒฐ์ธก์น˜ ์ฒ˜๋ฆฌ, apply ํ•จ์ˆ˜ ์ ์šฉ

Jayden1116 2021. 12. 9. 22:40

NA Value Handling
19๋…„๋„ 4๋ถ„๊ธฐ์˜ ๋‹น๊ธฐ์ˆœ์ด์ต(๋น„์ง€๋ฐฐ) ๋ถ€๋ถ„์„ Na๋กœ ๋Œ€์ฒดํ•˜์„ธ์š”
์ดํ›„ ํ•ด๋‹น ๊ฒฐ์ธก์น˜๋ฅผ mean imputation ๋ฐฉ๋ฒ•์„ ์‚ฌ์šฉํ•˜์—ฌ ์ฒ˜๋ฆฌํ•˜์„ธ์š”.

Feature Engineering
Relative Perfomance ๋ผ๋Š” ์ƒˆ๋กœ์šด feature๋ฅผ ๊ณ„์‚ฐํ•˜์„ธ์š”.

์ด๋Š” ์ตœ๊ทผ 1๋…„์น˜ ๋งค์ถœ์•ก์˜ ํ‰๊ท ๊ฐ’์„ ๊ธฐ์ค€์œผ๋กœ

10% ์ด์ƒ -> S
5% ์ด์ƒ -> A
-5 ~ 5% -> B
-5%์ดํ•˜ -> C
10%์ดํ•˜ -> D
๋ผ๋Š” ๊ฐ’์„ ๊ฐ–๋Š” feature์ž…๋‹ˆ๋‹ค.

20๋…„๋„ 2๋ถ„๊ธฐ์— ํ•ด๋‹นํ•˜๋Š” ๊ฒฐ๊ณผ๊ฐ’์€ A๊ฐ€ ๋‚˜์™€์•ผํ•ฉ๋‹ˆ๋‹ค.

๊ฐ๊ฐ์— ํ•ด๋‹นํ•˜๋Š” ๋“ฑ๊ธ‰์ด ๋‚˜์˜ค๊ธฐ ์œ„ํ•ด์„œ ํ•„์š”ํ•œ ๋งค์ถœ์•ก์„ ์ถ”๊ฐ€๋กœ ์„œ์ˆ ํ•˜์„ธ์š”.

url = 'https://ds-lecture-data.s3.ap-northeast-2.amazonaws.com/kt%26g/kt%26g_0.csv'
headers = ['๋ถ„๊ธฐ', '๋งค์ถœ์•ก', '์˜์—…์ด์ต', '์˜์—…์ด์ต(๋ฐœํ‘œ๊ธฐ์ค€)', '์„ธ์ „๊ณ„์†์‚ฌ์—…์ด์ต',
           '๋‹น๊ธฐ์ˆœ์ด์ต', '๋‹น๊ธฐ์ˆœ์ด์ต(์ง€๋ฐฐ)', '๋‹น๊ธฐ์ˆœ์ด์ต(๋น„์ง€๋ฐฐ)', '์ž์‚ฐ์ด๊ณ„', '๋ถ€์ฑ„์ด๊ณ„',
           '์ž๋ณธ์ด๊ณ„', '์ž๋ณธ์ด๊ณ„(์ง€๋ฐฐ)', '์ž๋ณธ์ด๊ณ„(๋น„์ง€๋ฐฐ)', '์ž๋ณธ๊ธˆ', '์˜์—…ํ™œ๋™ํ˜„๊ธˆํ๋ฆ„',
           'ํˆฌ์žํ™œ๋™ํ˜„๊ธˆํ๋ฆ„', '์žฌ๋ฌดํ™œ๋™ํ˜„๊ธˆํ๋ฆ„', '์˜์—…์ด์ต๋ฅ ', '์ˆœ์ด์ต๋ฅ ', 'ROE(%)',
           'ROA(%)', '๋ถ€์ฑ„๋น„์œจ', '์ž๋ณธ์œ ๋ณด์œจ', 'EPS(์›)', 'PER(๋ฐฐ)']
import pandas as pd
import numpy as np

df = pd.read_csv(url,names=headers)

1. ๊ฒฐ์ธก์น˜๋กœ ๋Œ€์ฒด ํ›„ ํ‰๊ท ๊ฐ’์œผ๋กœ ๋Œ€์ฒด

df.loc[2,'๋‹น๊ธฐ์ˆœ์ด์ต(๋น„์ง€๋ฐฐ)'] = np.NAN # ๊ฒฐ์ธก์น˜๋กœ ๋ฐ”๊ฟ”์ฃผ์—ˆ์Šต๋‹ˆ๋‹ค.
df.fillna({'๋‹น๊ธฐ์ˆœ์ด์ต(๋น„์ง€๋ฐฐ)':df['๋‹น๊ธฐ์ˆœ์ด์ต(๋น„์ง€๋ฐฐ)'].mean(), '์ž๋ณธ์ด๊ณ„(๋น„์ง€๋ฐฐ)':df['์ž๋ณธ์ด๊ณ„(๋น„์ง€๋ฐฐ)'].mean()}, inplace=True) # ๊ฒฐ์ธก์น˜๊ฐ€ ์žˆ๋Š” ์ปฌ๋Ÿผ๋“ค์„ ํ™•์ธ ํ›„ ๊ฐ๊ฐ์˜ ์ปฌ๋Ÿผ ํ‰๊ท ๊ฐ’์œผ๋กœ ๋Œ€์ฒดํ•˜์˜€์Šต๋‹ˆ๋‹ค.

2. Relative Perfomance ๋ผ๋Š” ์ƒˆ๋กœ์šด feature๋ฅผ ๊ณ„์‚ฐํ•˜์„ธ์š”.

def Toint(string) :
    return int(string.replace(',',''))

df['๋งค์ถœ์•ก']= df['๋งค์ถœ์•ก'].apply(Toint) # ๋จผ์ €  ์‰ผํ‘œ๋ฅผ ์ œ๊ฑฐํ•˜๊ณ  ๋ฐ์ดํ„ฐ ์ž๋ฃŒ ์œ ํ˜•์„ ์ •์ˆ˜ํ˜•์œผ๋กœ ๋ณ€๊ฒฝํ•˜์˜€์Šต๋‹ˆ๋‹ค.
def RP_category(x) :
    if x >= 1.1 * df['๋งค์ถœ์•ก'].mean() :
        return 'S'
    elif x >= 1.05 * df['๋งค์ถœ์•ก'].mean() :
        return 'A'
    elif x >= 0.95 * df['๋งค์ถœ์•ก'].mean() :
        return 'B'
    elif x >= 0.90 * df['๋งค์ถœ์•ก'].mean() :
        return 'C'
    else :
        return 'D'

df['Relative Performance'] = df['๋งค์ถœ์•ก'].apply(RP_category) # ๋งˆ์ฐฌ๊ฐ€์ง€๋กœ ๋žญํ‚น์„ ๋ถ€์—ฌํ•˜๋Š” ํ•จ์ˆ˜๋ฅผ ์ •์˜ํ•˜๊ณ  apply๋กœ ๊ฐ๊ฐ์˜ ๊ฐ’์— ์ ์šฉํ•˜์˜€์Šต๋‹ˆ๋‹ค.