[Python] 데이터 에듀 ADP 실기 모의고사 4회 2번 파이썬 ver. (통계분석)

250x250

Link

GitHub

나의 GitHub Contribution 그래프

Loading data ...

Notice

Recent Posts

Recent Comments

« 2025/07 »
일	월	화	수	목	금	토
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

Tags more

Archives

관리 메뉴

Data Science LAB

[Python] 데이터 에듀 ADP 실기 모의고사 4회 2번 파이썬 ver. (통계분석) 본문

adp 실기/기출문제

[Python] 데이터 에듀 ADP 실기 모의고사 4회 2번 파이썬 ver. (통계분석)

ㅅ ㅜ ㅔ ㅇ 2022. 9. 10. 13:47

728x90

사용 데이터 : bike_marketing.csv

변수	데이터 형태	설명
company_num	수치형	회사 번호
google_adwords	수치형	구글 Adwords에 대한 비용
facebook	수치형	페이스북 광고에 대한 비용
twitter	수치형	트위터 광고에 대한 비용
marketing_total	수치형	총 마케팅 예산
revenues	수치형	매출 정보
employees	수치형	종업원 수
pop_density	범주형	타깃 시장의 인구밀도 수준 (Low, Medium, High)

1. pop_density 변수를 factor형 변수로 변환하고, pop_density 별 revenues의 평균 차이가 있는 지 통계분석을 시행하여 결과를 해석하시오. 만약 대립가설이 채택된다면 사후분석을 실시하고 결과를 해석하시오.

import pandas as pd
import numpy as np
from scipy import stats

df = pd.read_csv('../data/bike_marketing.csv')
df.head()

import seaborn as sns
import matplotlib.pyplot as plt

sns.scatterplot(x='pop_density', y = 'revenues', hue = 'pop_density', style='pop_density',
                s=100, data =df)
plt.show()

- 귀무가설 : 세가지 인구 밀도에 대해 revenues의 평균은 모두 같음
- 대립가설 : 세가지 인구 밀도에 대해 revenues의 평균값에는 차이가 있음

from statsmodels.formula.api import ols
from statsmodels.stats.anova import anova_lm

model = ols('revenues ~ C(pop_density)', df).fit()
anova_lm(model)

p-value값이 0.05보다 크기 때문에 귀무가설을 채택한다. 따라서 세가지 인구 밀도에 대해 revenues의 평균의 차이는 통계적으로 유의미한 차이를 가지고 있다고 할 수 없다.

2. google_adwords, facebook, twitter, marketing_total, employees가 revenues에 영향을 미치는 지 알아보는 회귀분석을 전진 선택법을 사용하여 수행하고 결과를 해석하시오.

import statsmodels.api as sm

model = sm.OLS.from_formula('revenues ~ google_adwords + facebook + twitter + marketing_total + employees', data = df).fit()
model.summary()

3. 전진선택법을 사용해 변수를 선택한 후 새롭게 생성한 회귀모형에 대한 잔차분석을 수행하고 결과를 해석하시오.

- Dubin - Watson

from statsmodels.stats.stattools import durbin_watson

durbin_watson(model.resid)
# 2.1113783728609468

더빈 왓슨 검정 결과 값이 2에 가깝기 때문에 독립성 가정 만족

- Shapiro-Wilk 검정

shapiro_test = stats.shapiro(model.resid)
shapiro_test

# ShapiroResult(statistic=0.9865776300430298, pvalue=0.09909190982580185)

p-value값이 0.05보다 크기 때문에 귀무가설 기각 -> 정규분포를 따름

- 잔차의 등분산성

fitted = model.predict(df)
residual = df['revenues'] - fitted

sns.regplot(fitted, residual, lowess=True, line_kws = {'color':'red'})
plt.plot([fitted.min(), fitted.max()], [0,0], '--', color='grey')

- 잔차 vs 예측값의 분포 (등분산성 가정 확인)
- 그래프의 기울기를 나타내는 빨간 선이 포물선의 성향을 띄기 때문에 잔차는 평균인 0을 중심으로 고르게 분포한다고 할 수 없다.

QQPlot (정규성 가정)

from statsmodels.graphics.gofplots import qqplot
qqplot(model.resid, line='s')
plt.show()

직선위에 그래프의 점들이 주로 위치하고 있으므로 정규성을 만족한다고 판단

- Scale Location (등분산성 가정 확인)

sr = stats.zscore(residual)
sns.regplot(fitted, np.sqrt(np.abs(sr)), lowess=True, line_kws={'color':'red'})

- 빨간선의 기울기가 0에 가까워야 하지만, fitted_value가 커질수록 다소 변화하는 경향을 보인다. 이렇게 빨간선의 기울기가 0에서 떨어진 점이 있다면 해당 점에서는 표준화 잔차가 큼을 의미하고, 회귀직선이 y값을 잘 적합하지 못함을 의미한다. 또한 해당 점들은 이상치일 가능성이 있다.

- 극단값

from statsmodels.stats.outliers_influence import OLSInfluence
cd, _ = OLSInfluence(model).cooks_distance
cd.sort_values(ascending=False).head()

728x90

'adp 실기 > 기출문제' 카테고리의 다른 글

[Python] ADP 실기 대비 기출문제 (15회) (0)	2022.09.16
[Python] 데이터 에듀 ADP 실기 모의고사 4회 3번 파이썬 ver. (비정형 데이터마이닝) (0)	2022.09.12
[Python] 데이터 에듀 ADP 실기 모의고사 파이썬 ver. (정형 데이터마이닝) (0)	2022.09.08
[Python] 데이터 에듀 ADP 실기 모의고사 3회 3번 파이썬 ver. (비정형 데이터마이닝) (0)	2022.09.07
[Python] 데이터 에듀 ADP 실기 모의고사 3회 2번 파이썬 ver. (정형데이터마이닝) (0)	2022.09.06