Jayden1116 2021. 11. 27. 03:11

ANOVA(one-way)

์•ž์—์„œ
์›์ƒ˜ํ”Œ ์นด์ด : 1๊ฐœ ๊ทธ๋ฃน์˜ ํ‰๊ท ์ด ํŠน์ • ์ˆ˜์™€ ๊ฐ™์€์ง€
ํˆฌ์ƒ˜ํ”Œ ์นด์ด : 2๊ฐœ ๊ทธ๋ฃน์˜ ํ‰๊ท ์ด ์œ ์˜๋ฏธํ•˜๊ฒŒ ๋‹ค๋ฅธ์ง€

  • 2๊ฐœ ์ด์ƒ ๊ทธ๋ฃน์˜ ํ‰๊ท ์— ์ฐจ์ด๊ฐ€ ์žˆ๋Š”์ง€๋ฅผ ๊ฐ€์„ค ๊ฒ€์ •ํ•˜๋Š” ๋ฐฉ๋ฒ•์— ๋Œ€ํ•ด์„œ ๋ฐฐ์›Œ๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.

Multiple Comparison

  • 2๊ฐœ ์ด์ƒ์˜ ์—ฌ๋Ÿฌ ๊ทธ๋ฃน์„ ๋น„๊ตํ•˜๊ธฐ ์œ„ํ•ด์„œ ์ด๋Ÿฐ ์ƒ๊ฐ์„ ํ•  ์ˆ˜ ์žˆ๋‹ค.
    ์ด๋ก ์ƒ์œผ๋ก  ๊ฐ€๋Šฅํ•˜์ง€๋งŒ ๋ฌธ์ œ๊ฐ€ ์žˆ๋‹ค.
    3๋ฒˆ์˜ ๊ฐ€์„ค ๊ฒ€์ •์—์„œ ๊ฐ๊ฐ ํ†ต๊ณ„์ ์œผ๋กœ ์—๋Ÿฌ๊ฐ€ ๋‚  ํ™•๋ฅ ์€ α์ด๋‹ค.

    ์ฆ‰, 3๊ฐœ์˜ ๊ฐ€์„ค ๊ฒ€์ • ์ค‘ ์ ์–ด๋„ ํ•˜๋‚˜์—์„œ ์—๋Ÿฌ๊ฐ€ ๋‚  ํ™•๋ฅ ์€
    1−(1−α)^3 ์ด๊ณ  α=0.05 ๊ธฐ์ค€์œผ๋กœ ์•ฝ 15 % ์ž…๋‹ˆ๋‹ค.

์ˆ˜ํ•™์ ์œผ๋กœ
m๊ฐœ ๊ทธ๋ฃน์— ๋Œ€ํ•œ ๊ฐ€์„ค ๊ฒ€์ •์ด๋ผ๋ฉด
ํ‰๊ท α = 1−(1−α)^m , ํ‰๊ท α ≤ m⋅α ๋ผ๋Š” ๊ฒƒ์ด ์ˆ˜ํ•™์ ์œผ๋กœ ์ฆ๋ช…๋˜์–ด ์žˆ์Šต๋‹ˆ๋‹ค.
์ฆ‰, ์—ฌ๋Ÿฌ๊ฐœ๋ฅผ ํ•˜๋‚˜ํ•˜๋‚˜์”ฉ ๋น„๊ต ํ•˜๋Š” ๊ฒƒ์€ ๊ทธ๋ฃน์ˆ˜๊ฐ€ ๋Š˜์–ด๋‚ ์ˆ˜๋ก ์—๋Ÿฌ๋„ ์ปค์ง„๋‹ค๋Š” ์ด์•ผ๊ธฐ์ฃ .
์ด๋ ‡๊ธฐ ๋•Œ๋ฌธ์— ์—ฌ๋Ÿฌ๊ฐœ์˜ ๊ทธ๋ฃน์„ ํ•œ๊บผ๋ฒˆ์— ๋น„๊ต ํ•˜๋Š” ๋ฐฉ๋ฒ•์ด ํ•„์š”!!!

Variation

๊ทธ๋Ÿผ ์—ฌ๋Ÿฌ ๊ทธ๋ฃน๊ฐ„์˜ ์ฐจ์ด๋ฅผ ์–ด๋–ป๊ฒŒ ํ™•์ธํ•ด์•ผํ•˜๋Š”๊ฐ€?
'์—ฌ๋Ÿฌ ๊ทธ๋ฃน๋“ค์ด ํ•˜๋‚˜์˜ ๋ถ„ํฌ์—์„œ๋ถ€ํ„ฐ ์™”๋‹ค'๋ผ ๊ฐ€์ •
์ด๋ฅผ ์œ„ํ•œ ์ง€ํ‘œ๋Š” 'F-statistic' ์ด๋‹ค.

๊ฐ„๋‹จํžˆ F = (์ง‘๋‹จ ๊ฐ„ ๋ถ„์‚ฐ) / (์ง‘๋‹จ ๋‚ด ๋ถ„์‚ฐ)

F ๊ฐ’์ด ํฌ๋‹ค

  • ๋ถ„์ž(์ง‘๋‹จ ๊ฐ„ ๋ถ„์‚ฐ)๋Š” ํฌ๊ณ , ๋ถ„๋ชจ(์ง‘๋‹จ ๋‚ด ๋ถ„์‚ฐ)์€ ์ž‘์•„์•ผํ•œ๋‹ค.
  • ์ฆ‰, '๋‹ค๋ฅธ ๊ทธ๋ฃน๋ผ๋ฆฌ์˜ ๋ถ„ํฌ๊ฐ€ ๋‹ค๋ฅผ ๊ฒƒ์ด๋‹ค' ๋ผ๋Š” ๊ฐ€์ •์ด ๋ถ™๊ฒŒ ๋œ๋‹ค.
  • ๊ท€๋ฌด๊ฐ€์„ค์€ '๊ทธ๋ฃน๋ผ๋ฆฌ ๋ถ„ํฌ๊ฐ€ ๊ฐ™์„ ๊ฒƒ์ด๋‹ค.'์ธ๋ฐ, F๊ฐ’์ด ๋งค์šฐ ํฌ๋ฉด pvalue๋Š” ๋งค์šฐ ์ž‘๊ฒŒ ๋˜์–ด ๊ท€๋ฌด๊ฐ€์„ค์ด ๊ธฐ๊ฐ๋œ๋‹ค.
# scipy๋ฅผ ์ด์šฉํ•œ oneway ANOVA(oneway, twoway๋Š” ์ฐพ์•„๋ณด๊ธฐ)
from scipy.stats import f_oneway
f_oneway(g1, g2, g3)

F_onewayResult(statistic=2.6009238802972483, pvalue=0.11524892355706169)

ํฐ ์ˆ˜์˜ ๋ฒ•์น™

sample ๋ฐ์ดํ„ฐ ์ˆ˜๊ฐ€ ์ปค์งˆ์ˆ˜๋ก, sample์˜ ํ†ต๊ณ„์น˜๋Š” ์ ์  ๋ชจ์ง‘๋‹จ์˜ ๋ชจ์ˆ˜์™€ ๊ฐ€๊นŒ์›Œ์ง„๋‹ค.

์ค‘์‹ฌ๊ทนํ•œ์ •๋ฆฌ(Central Limit Theorem, CLT)

sample ์ž์ฒด์˜ ๊ฐฏ์ˆ˜๊ฐ€ ๋งŽ์•„์งˆ์ˆ˜๋ก, ๊ฐ sample๋“ค์˜ ํ‰๊ท ์€ ์ •๊ทœ๋ถ„ํฌ์— ๊ทผ์‚ฌํ•œ๋‹ค.

Point estimate(์  ์ถ”์ •) vs Interval estimate(๊ตฌ๊ฐ„ ์ถ”์ •)

์ ์ถ”์ •

  • 130cm์ผ ๊ฒƒ์ด๋‹ค.

๊ตฌ๊ฐ„์ถ”์ •

  • 125~135cm ์ •๋„์ผ ๊ฒƒ์ด๋‹ค.
  • 120~140cm ์ •๋„์ผ ๊ฒƒ์ด๋‹ค.
  • 1~300cm ์ •๋„์ผ ๊ฒƒ์ด๋‹ค.
    ์˜ˆ์ธกํ•˜๋Š” '๊ตฌ๊ฐ„'์ด ๋„“์–ด์งˆ์ˆ˜๋ก ๋งž์„ ํ™•๋ฅ (์‹ ๋ขฐ๋„)์€ ์˜ฌ๋ผ๊ฐ„๋‹ค.

์‹ ๋ขฐ๋„

์‹ ๋ขฐ๋„๊ฐ€ 95% ๋ผ๋Š” ์˜๋ฏธ๋Š” ํ‘œ๋ณธ์„ 100๋ฒˆ ๋ฝ‘์•˜์„ ๋•Œ, 95๋ฒˆ์€ ์‹ ๋ขฐ๊ตฌ๊ฐ„ ๋‚ด์— ๋ชจ์ง‘๋‹จ์˜ ํ‰๊ท ์ด ํฌํ•จ๋œ๋‹ค.

์‹ ๋ขฐ ๊ตฌ๊ฐ„์˜ ์„ค์ • ๋ฐ ํ•ด์„

# ์‹ ๋ขฐ๊ตฌ๊ฐ„ ๊ตฌํ•˜๊ธฐ
from scipy import stats

def confidence_interval(data, confidence = 0.95):

  """
  ์ฃผ์–ด์ง„ ๋ฐ์ดํ„ฐ์˜ ํ‘œ๋ณธ **ํ‰๊ท **์— ๋Œ€ํ•œ ์‹ ๋ขฐ๊ตฌ๊ฐ„์„ ๊ณ„์‚ฐ.
  ๊ธฐ๋ณธ ๊ฐ’์œผ๋กœ t-๋ถ„ํฌ์™€ ์–‘๋ฐฉํ–ฅ (two-tailed), 95%์˜ ์‹ ๋ขฐ๋„๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.

  ์ž…๋ ฅ ๊ฐ’ : 
    data - ์—ฌ๋Ÿฌ ๊ฐœ๋กœ ์ด๋ฃจ์–ด์ง„ (list ํ˜น์€ numpy ๋ฐฐ์—ด) ํ‘œ๋ณธ ๊ด€์ธก์น˜
    confidence - ์‹ ๋ขฐ๊ตฌ๊ฐ„์„ ์œ„ํ•œ ์‹ ๋ขฐ๋„ 

  ๋ฐ˜ํ™˜ ๋˜๋Š” ๊ฐ’:
    (ํ‰๊ท , ํ•˜ํ•œ, ์ƒํ•œ๊ตฌ๊ฐ„)์œผ๋กœ ์ด๋ฃจ์–ด์ง„ tuple
  """

  data = np.array(data)
  mean = np.mean(data)
  n = len(data)

  # std / sqrt(n)
  stderr = stats.sem(data) 
  # Standard Error of Mean (https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.sem.html)
  # s / sqrt(n)

  # length_of_one_interval
  interval = stderr * stats.t.ppf( (1 + confidence) / 2 , n - 1) # ppf : inverse of cdf
  return (mean, mean - interval, mean + interval)

# cdf -> t ๋ฅผ ๋„ฃ์œผ๋ฉด %
# ppf -> % ๋ฅผ ๋„ฃ์œผ๋ฉด t

# 1 + 0.95 / 2 -> 0.975
# (1 - 0.95) / 2 -> 0.025
# scipy.stats์—์„œ t ๋ฅผ ์ด์šฉํ•œ ๋ฐฉ๋ฒ•(ํ›จ์”ฌ ๊ฐ„ํŽธ)
from scipy.stats import t

# ํ‘œ๋ณธ์˜ ํฌ๊ธฐ
n = len(sample)
# ์ž์œ ๋„
dof = n-1
# ํ‰๊ท ์˜ ํ‰๊ท 
mean = np.mean(sample)
# ํ‘œ๋ณธ์˜ ํ‘œ์ค€ํŽธ์ฐจ
sample_std = np.std(sample, ddof = 1)
# ํ‘œ์ค€ ์˜ค์ฐจ
std_err = sample_std / n ** 0.5 # sample_std / sqrt(n)

CI = t.interval(.95, dof, loc = mean, scale = std_err) # https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.t.html
print("95% ์‹ ๋ขฐ๊ตฌ๊ฐ„: ", CI)