๐Ÿ’ฟ Data/๋ถ€ํŠธ์บ ํ”„

[TIL]1.Exploratory Data Analysis(EDA)

Jayden1116 2021. 11. 24. 20:38

๋ฐ์ดํ„ฐ์…‹ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ

import pandas as pd
pd.read_csv('')

๋ฐ์ดํ„ฐ์…‹ ๊ฐ„๋‹จํ•˜๊ฒŒ ํ™•์ธ

์ฒซ๋ฒˆ์งธ๋ถ€ํ„ฐ ๋‹ค์„ฏ๋ฒˆ์งธ row ํ™•์ธ(์‹ค์ œ ๋ฐ์ดํ„ฐ๋ฅผ ๋Œ€๋žต ์ฒดํฌํ•  ๋•Œ ์•„์ฃผ ์ž์ฃผ ์‚ฌ์šฉ๋จ) ๊ฑ ์‹œ๋„๋•Œ๋„ ์—†์ด ์‚ฌ์šฉ๋จ

df.head()

๋ฐ์ดํ„ฐ์…‹ ๋ชจ์–‘ ํ™•์ธ(df์˜ ํ–‰๊ณผ ์—ด ๊ฐฏ์ˆ˜, ๊ตฌ์กฐ๋ฅผ ๊ฐ„๋‹จํ•˜๊ฒŒ ํŠœํ”Œ ๋ณด์—ฌ์คŒ)

df.shape

EDA(Exploratory Data Analysis) ์•„์ฃผ ์ค‘์š”ํ•œ ๊ฐœ๋…์“ฐ

์šฐ๋ฆฌ๊ฐ€ ์ง€์ธ์œผ๋กœ๋ถ€ํ„ฐ ์ƒ์„  ์„ ๋ฌผ์„ ๋ฐ›์•˜๋‹ค. ์ด์ œ ์ด๊ฑธ ์–ด๋–ป๊ฒŒ ํ• ๊นŒ. ์ช„๋จน์„๊นŒ ํƒ•ํ•ด๋จน์„๊นŒ ์•„๋‹ˆ๋ฉด ํ’€์–ด์ค„๊นŒ ๋ญ ์–ด๋–ป๊ฒŒ ํ• ๊นŒ

 

  • ์ƒ์„ ์— ๋…์€ ์—†์„๊นŒ
  • ๋จน์„ ์ˆ˜ ์—†๋Š” ๋ถ€๋ถ„์€ ์žˆ๋‚˜
  • ์ƒ์„ ์ด ๋งž๊ธด ํ•ด?
  • ์š”๋ฆฌํ•˜๊ธฐ ์œ„ํ•œ ์—ฌ๋Ÿฌ ๊ฐ€์ • ex)์‹ ์„ ํ•˜๋ฉด ํšŒ๋„ ๊ฐ€๋Šฅํ•˜๋‹ค!
    ์™€ ๊ฐ™์ด ๋ฐ์ดํ„ฐ๋„ ์ด๋ฆฌ์ €๋ฆฌ ๊ฒฐ์ธก์น˜๋Š” ์žˆ๋Š”์ง€, ๋ญ ๋ฐ์ดํ„ฐ ํฌ๊ธฐ๋Š” ์–ด๋–ค์ง€, ์นผ๋Ÿผ๋ณ„๋กœ ๋ฌด์Šจ ์ž๋ฃŒํ˜•์ธ์ง€ ๋“ฑ๋“ฑ ์ƒ…์ƒ…ํžˆ ํŒŒ์•…ํ•˜๋Š” ๊ฒƒ
    ์ด๊ฒŒ ํƒ์ƒ‰์  ๋ฐ์ดํ„ฐ ๋ถ„์„! ์šฐ๋ฆฌ๊ฐ€ ์ฒ˜์Œ ๋ฐ์ดํ„ฐ๋ฅผ ๋ฐ›์œผ๋ฉด ์ด ๋ฐ์ดํ„ฐ์— ๋Œ€ํ•œ ๊ฒฌ์ ์„ ๋‚ด์•ผํ•œ๋‹ค. ์ด๊ฒŒ ๊ณง EDA์ด๋‹ค.

Graphic ๋ฐฉ๋ฒ•

  • ์ฐจํŠธ ํ˜น์€ ๊ทธ๋ฆผ ๋“ฑ์„ ์ด์šฉํ•˜์—ฌ(์‹œ๊ฐํ™”๋ฅผ ํ†ตํ•˜์—ฌ) ๋ฐ์ดํ„ฐ๋ฅผ ํ™•์ธํ•˜๋Š” ๋ฐฉ๋ฒ•

Non-Graphic ๋ฐฉ๋ฒ•

  • ์‹œ๊ฐ์ ์ธ ์š”์†Œ๊ฐ€ ์•„๋‹Œ ์ฃผ๋กœ Summary Statistics๋ฅผ ํ†ตํ•ด ๋ฐ์ดํ„ฐ๋ฅผ ํ™•์ธํ•˜๋Š” ๋ฐฉ๋ฒ•
    df.describe()

๋™์‹œ์— EDA์˜ ํƒ€๊ฒŸ ๋ฐ์ดํ„ฐ์— ๋”ฐ๋ผ Univaraite, Multi-variate 2๊ฐ€์ง€๋กœ ๋‚˜๋ˆ„์–ด์ง„๋‹ค.

์ฆ‰, ์กฐํ•ฉ์ด 2x2๋กœ 4๊ฐ€์ง€

Uni-Graphic

Histogram, Pie chart, Boxplot, QQplot ๋“ฑ์ด ์žˆ๋‹ค.
QQplot : ๋ฐ์ดํ„ฐ์˜ ๋ถ„ํฌ์™€ ์ด๋ก ์ƒ ๋ถ„ํฌ๊ฐ€ ์ž˜ ์ผ์น˜ํ•˜๋Š”๊ฐ€๋ฅผ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋Š” ๋ฐฉ๋ฒ•
ex) ์„ฑ์  ๋ถ„ํฌ๊ฐ€ ์–ด๋””๊ฐ€ ๋” ๊ณ ๋ฅธ๊ฐ€
A๋ฐ˜ : 10, 20, 30, 40, 50, 60, 70, 80, 90, 100
B๋ฐ˜ : 10, 15, 20, 75, 80, 85, 90, 95, 95, 100

 

์ด์™ธ์— ๋‹ค๋ฅธ QQplot

Uni-NonGraphic

Sample Data์˜ Distribution์„ ํ™•์ธํ•˜๋Š” ๊ฒƒ์ด ์ฃผ๋ชฉ์ 
Numeric data์˜ ๊ฒฝ์šฐ summary statistics๋ฅผ ์ œ์ผ ๋งŽ์ด ํ™œ์šฉํ•˜๋Š”๋ฐ, ์—ฌ๊ธฐ์—๋Š”

Center (Mean, Median, Mod)
Spread (Variance, SD, IQR, Range)
Modality (Peak)
Shape (Tail, Skewness, Kurtosis)
Outliers

๋“ฑ์ด ์žˆ๋‹ค.

Multi-Graphic

  • Category&Numeric : Boxplots, Stacked bar, Parallel Coordinate, Heatmap ๋“ฑ

  • Numeric&Numeric : Scatter Plot

pandas๋ฅผ ์‚ฌ์šฉํ•œ ๊ธฐ์ดˆ EDA

์œ ์šฉํ•œ ์• ๋“ค(์ž์ฃผ ์“ฐ๋Š” ์• ๋“ค)

Missing Data(๊ฒฐ์ธก์น˜)

  • isna
  • dropna
  • fillna

Data Frame(๋ฐ์ดํ„ฐ ๊ตฌ์กฐ ํŒŒ์•…)

  • index
  • columns
  • dtypes
  • info
  • loc
  • iloc
  • head
  • replace
  • describe
  • shape

Visible(์‹œ๊ฐํ™”ํ•  ๋•Œ)

  • plot
  • plot.bar
  • plot.hist
  • plot.box
  • plot.pie
  • plot.scatter
์‹œ๊ฐํ™”์‹œ <matplotlib.~> ์—†์• ๋Š” ๋ฒ•

๋์— ; ๋ฅผ ๋ถ™์—ฌ์ค€๋‹ค.

ํ•œ๊ธ€ ํฐํŠธ ๋‚˜์˜ค๊ฒŒ ํ•˜๋Š” ๋ฒ•

๋จผ์ € ํ•œ๊ธ€ ํฐํŠธ๋ฅผ ๋ฐ›๊ณ 

!sudo apt-get install -y fonts-nanum
!sudo fc-cache -fv
!rm ~\.cache\matplotlib -rf

๊ธ€๊ผด์„ ์„ค์ •ํ•ด์ค€๋‹ค.

import matplotlib.pyplot as plt
plt.rc('font', family='NanumBarunGothic')

Data Preprocessing

  • ์•ˆ์ข‹์€ ๋ฐ์ดํ„ฐ๊ฐ€ ๋“ค์–ด๊ฐ€๋ฉด ์•„๋ฌด๋ฆฌ ๋ชจ๋ธ์ด ์ข‹์•„๋„ ์•ˆ์ข‹์€ ๊ฒฐ๊ณผ๊ฐ€ ๋‚˜์˜จ๋‹ค.

๋ฐ์ดํ„ฐ ํ˜น์€ ๋ถ„์„, ์ž‘์—…๋งˆ๋‹ค ํ•ด์•ผํ•˜๋Š” ๋ฐ์ดํ„ฐ์˜ ์ „์ฒ˜๋ฆฌ๋Š” ๋‹ค๋ฅด์ง€๋งŒ ํฌ๊ฒŒ ์•„๋ž˜์™€ ๊ฐ™์€ ํ๋ฆ„์„ ๊ฐ–๋Š”๋‹ค.

Cleaning

  • noise๋ฅผ ์ œ๊ฑฐํ•˜๊ฑฐ๋‚˜, inconsistency๋ฅผ ๋ณด์ •ํ•˜๋Š” ๊ณผ์ •์„ ์˜๋ฏธ

Missing Values ์ฒ˜๋ฆฌ ๋ฐฉ๋ฒ•

  • ๊ทธ๋ƒฅ ์‚ญ์ œ
  • ์ˆ˜๋™์œผ๋กœ ์ž…๋ ฅ
  • global constant(๊ทธ๋ƒฅ ๊ฐ’์„ ์ •ํ•ด์„œ ๋„ฃ๋Š” ๊ฒƒ)
  • Imputation(ํ‰๊ท , ์ค‘์•™๊ฐ’, ํšŒ๊ท€๋ฅผ ํ†ตํ•œ ๊ฐ’์„ ๋„ฃ์Œ)

Noisy data

ํฐ ๋ฐฉํ–ฅ์„ฑ์—์„œ ๋ฒ—์–ด๋‚œ random error ํ˜น์€ variance ๋“ฑ์„ ํฌํ•จํ•˜๋Š” ๋ฐ์ดํ„ฐ
๋Œ€๋ถ€๋ถ„ descriptive statistics ํ˜น์€ visualization ๋“ฑ์˜ eda๋ฅผ ํ†ตํ•ด ์ œ๊ฑฐ ๊ฐ€๋Šฅ

Integration

์—ฌ๋Ÿฌ ๊ฐœ๋กœ ๋‚˜๋‰˜์–ด์ง„ ๋ฐ์ดํ„ฐ๋“ค์„ ๋ถ„์„ํ•˜๊ธฐ ํŽธํ•˜๊ฒŒ ํ•˜๋‚˜๋กœ ํ•ฉ์น˜๋Š” ๊ณผ์ •
ex) merge

Transformation

๋ฐ์ดํ„ฐ์˜ ํ˜•ํƒœ๋ฅผ ๋ณ€ํ™˜ํ•˜๋Š” ์ž‘์—…์œผ๋กœ, scaling์ด๋ผ๊ณ  ๋ถ€๋ฅด๊ธฐ๋„ ํ•ฉ๋‹ˆ๋‹ค.
ex) normalize

Reduction

๋ฐ์ดํ„ฐ๋ฅผ ์˜๋ฏธ์žˆ๊ฒŒ ์ค„์ด๋Š” ๊ฒƒ์„ ์˜๋ฏธํ•˜๋ฉฐ, dimension reduction๊ณผ ์œ ์‚ฌํ•œ ๋ชฉ์ ์„ ๊ฐ–๋Š”๋‹ค.
ex) PCA