250x250
Link
๋‚˜์˜ GitHub Contribution ๊ทธ๋ž˜ํ”„
Loading data ...
Notice
Recent Posts
Recent Comments
๊ด€๋ฆฌ ๋ฉ”๋‰ด

Data Science LAB

[Python] ์„ ํ˜• ํšŒ๊ท€๋ถ„์„ ๋ณธ๋ฌธ

๐Ÿ›  Machine Learning/๊ธฐ์ดˆ ํ†ต๊ณ„

[Python] ์„ ํ˜• ํšŒ๊ท€๋ถ„์„

ใ…… ใ…œ ใ…” ใ…‡ 2022. 8. 22. 22:37
728x90

ํ•˜๋‚˜ ํ˜น์€ ๊ทธ ์ด์ƒ์˜ ์›์ธ์ด ์ข…์†๋ณ€์ˆ˜์— ๋ฏธ์น˜๋Š” ์˜ํ–ฅ์„ ์ถ”์ ํ•˜์—ฌ ์‹์œผ๋กœ ํ‘œํ˜„ํ•˜๋Š” ํ†ต๊ณ„๊ธฐ๋ฒ•์œผ๋กœ ๋จธ์‹ ๋Ÿฌ๋‹๊ณผ ๋‹ค๋ฅด๊ฒŒ ์‹์œผ๋กœ ํ‘œํ˜„ํ•˜๊ธฐ ๋•Œ๋ฌธ์— ํ•ด์„๋ ฅ์„ ๋†’์ผ ์ˆ˜ ์žˆ๋‹ค. 

 

์„ ํ˜• ํšŒ๊ท€๋ถ„์„์˜ ํ‰๊ฐ€

SST : ์ด๋ณ€๋™

SSE : ์„ค๋ช…๋œ ๋ณ€๋™

SSR : ์„ค๋ช…๋˜์ง€ ์•Š์€ ๋ณ€๋™์„ ์˜๋ฏธ

์œ„์˜ ์ˆ˜์‹์ด ์˜๋ฏธํ•˜๋Š” ๋ฐ”๋Š” ์ด ๋ณ€๋™ ์ค‘ ์„ค๋ช…๋œ ๋ณ€๋™์˜ ๋น„์œจ์ด๋‹ค.

์ฆ‰, ํšŒ๊ท€ ์ถ”์ •์„ ์ด ์ „์ฒด ๋ฐ์ดํ„ฐ๋ฅผ ์–ผ๋งˆ๋‚˜ ์„ค๋ช…ํ•˜๊ณ  ์žˆ๋Š”์ง€๋ฅผ ์˜๋ฏธํ•˜๋ฉฐ ์ด ๊ฐ’์ด ๋†’๋‹ค๋ฉด ํšŒ๊ท€ ์ถ”์ • ์ง์„ ์œผ๋กœ ์ƒˆ๋กœ์šด ๊ฐ’์„ ์˜ˆ์ธกํ•˜๊ฑฐ๋‚˜ ์ถ”์ •ํ•˜๋”๋ผ๋„ ๋ฏฟ์„ ์ˆ˜ ์žˆ๋Š” ์ •๋„๋ฅผ ์˜๋ฏธํ•œ๋‹ค.

 

RMSE ๊ฐ’์€ ํ‰๊ท  ์ œ๊ณฑ๊ทผ ์˜ค์ฐจ๋กœ ์˜ˆ์ธก๊ฐ’์—์„œ ์‹ค์ œ ๊ด€์ธก๊ฐ’์„ ๋บ€ ๊ฐ’์˜ ์ œ๊ณฑ์˜ ํ•ฉ์„ ํ‘œ๋ณธ์˜ ์ˆ˜๋กœ ๋‚˜๋ˆˆ ๊ฒƒ์ด๋‹ค. SSE๊ฐ’์„ ์ž์œ ๋„ (n-2)๋กœ ๋‚˜๋ˆ„๊ณ  ๋ฃจํŠธ๋ฅผ ์ทจํ•œ ๊ฐ’๊ณผ ๊ฐ™๋‹ค. RMSE ๊ฐ’์ด ๋‚ฎ์„ ์ˆ˜๋ก ์˜ˆ์ธก๋ ฅ์ด ์ข‹๋‹ค๊ณ  ํ•  ์ˆ˜ ์žˆ๋‹ค.

 

statsmodels.formula.api.ols(formula, data, subset=None, drop_cols=None, *arg, **kwargs)
models.summary() ๋ชจ๋ธ ์ ํ•ฉ ๊ฒฐ๊ณผ ์š”์•ฝ ์ œ์‹œ
models.params ๋ณ€์ˆ˜์˜ ํšŒ๊ท€๊ณ„์ˆ˜
model.predict() ์ƒˆ๋กœ์šด ๋ฐ์ดํ„ฐ์— ๋Œ€ํ•œ ์˜ˆ์ธก๊ฐ’

 

import pandas as pd
import numpy as np

house = pd.read_csv('../data/kc_house_data.csv')
house = house[['price', 'sqft_living']]

# ๋…๋ฆฝ๋ณ€์ˆ˜์™€ ์ข…์†๋ณ€์ˆ˜์˜ ์„ ํ˜• ๊ฐ€์ •
house.corr()

๋…๋ฆฝ๋ณ€์ˆ˜์™€ ์ข…์†๋ณ€์ˆ˜์˜ ์ƒ๊ด€๊ณ„์ˆ˜๊ฐ€ 0.7๋กœ ์–‘์˜ ์ƒ๊ด€๊ด€๊ณ„๊ฐ€ ์žˆ์Œ์„ ํ™•์ธ

 

from statsmodels.formula.api import ols
import matplotlib.pyplot as plt

y = house['price']
X = house['sqft_living']

#๋‹จ์ˆœ ์„ ํ˜• ํšŒ๊ท€ ๋ชจํ˜• ์ ํ•ฉ
lr = ols('price ~ sqft_living', data=house).fit()
y_pred = lr.predict(X)

plt.scatter(X,y)
plt.plot(X, y_pred, color='red')
plt.xlabel('sqft_living', fontsize=10)
plt.ylabel('price', fontsize=10)
plt.title('Linear Regression Result')
plt.show()

๋‹จ์ˆœ ์„ ํ˜• ํšŒ๊ท€๋ถ„์„์œผ๋กœ๋Š” ์„ค๋ช…์ด ๋ถˆ๊ฐ€๋Šฅํ•œ ํ˜•ํƒœ์ž„

 

lr.summary()

 

๋‹ค์ค‘ ์„ ํ˜• ํšŒ๊ท€๋ถ„์„

- ๋…๋ฆฝ๋ณ€์ˆ˜์˜ ์ˆ˜๊ฐ€ ๋‘ ๊ฐœ ์ด์ƒ์ด๋ฉด ํ•„์ˆ˜์ ์œผ๋กœ ๋‹ค์ค‘๊ณต์„ ์„ฑ์˜ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•ด์•ผ ํ•จ
- ๋…๋ฆฝ๋ณ€์ˆ˜๋“ค ๊ฐ„์˜ ์ƒ๊ด€๊ณ„์ˆ˜๋ฅผ ๊ตฌํ•ด ์ƒ๊ด€์„ฑ์„ ์ง์ ‘ํŒŒ์•…ํ•˜๊ณ  ์ƒ๊ด€์„ฑ์ด 0.9 ์ด์ƒ์ด๋ฉด ๋‹ค์ค‘๊ณต์„ ์„ฑ์ด ์žˆ๋‹ค๊ณ  ํŒ๋‹จ
- ๋‹ค์ค‘๊ณต์„ ์„ฑ์ด ์˜์‹ฌ๋˜๋Š” ๋‘ ๋…๋ฆฝ๋ณ€์ˆ˜์˜ ํšŒ๊ท€๋ถ„์„์œผ๋กœ ํ—ˆ์šฉ ์˜ค์ฐจ๋ฅผ ๊ตฌํ–ˆ์„ ๋•Œ 0.1 ์ดํ•˜์ด๋ฉด ๋‹ค์ค‘๊ณต์„ ์„ฑ ๋ฌธ์ œ๊ฐ€ ์‹ฌ๊ฐํ•˜๋‹ค๊ณ  ํ•  ์ˆ˜ ์žˆ์Œ
- VIF๊ฐ€ 10 ์ด์ƒ์ด๋ฉด ๋‹ค์ค‘๊ณต์„ ์„ฑ์ด ์กด์žฌํ•จ
import pandas as pd
cars = pd.read_csv('../data/Cars93.csv')
cars.info()

import numpy as np
import statsmodels.api as sm
import statsmodels.formula.api as smf

#์ปฌ๋Ÿผ์˜ ํŠน์ˆ˜ ๋ฌธ์ž ์ œ๊ฑฐ
cars.columns = cars.columns.str.replace('.', "")
model = smf.ols(formula='Price ~ EngineSize + RPM + Weight + Length + MPGcity + MPGhighway', data=cars)
result = model.fit()
result.summary()

 

cars[['EngineSize', 'RPM', 'Weight', 'Length', 'MPGcity', 'MPGhighway']].corr()

 

from patsy import dmatrices
from statsmodels.stats.outliers_influence import variance_inflation_factor

y, X = dmatrices('Price ~ EngineSize + RPM + Weight + Length + MPGcity + MPGhighway',
                 data = cars, return_type ='dataframe')

# ๋…๋ฆฝ๋ณ€์ˆ˜๋ผ๋ฆฌ์˜ VIF๊ฐ’์„ ๊ณ„์‚ฐํ•˜์—ฌ ๋ฐ์ดํ„ฐ ํ”„๋ ˆ์ž„์œผ๋กœ ๋งŒ๋“ฆ
vif_list = []
for i in range(1, len(X.columns)):
    vif_list.append([variance_inflation_factor(X.values, i), X.columns[i]])
    
pd.DataFrame(vif_list, columns=['vif', 'variable'])

 

model = smf.ols(formula = 'Price ~ EngineSize + RPM + Weight + MPGhighway', data = cars)
result = model.fit()
result.summary()

from concurrent.futures import process
import time
import itertools

def processSubset(X,y, feature_set):
    model = sm.OLS(y, X[list(feature_set)])
    regr = model.fit()
    AIC = regr.aic
    return {'model':regr, 'AIC':AIC}

#์ „์ง„ ์„ ํƒ๋ฒ•
def forward(X,y,predictors):
    # ๋ณ€์ˆ˜๋“ค์ด ๋ฏธ๋ฆฌ ์ •์˜๋œ predictors์— ์žˆ๋Š” ์ง€ ์—†๋Š” ์ง€ ํ™•์ธ
    remaining_predictors = [p for p in X.columns.difference(['Intercept']) if p not in predictors]
    results = []
    for p in remaining_predictors:
        results.append(processSubset(X=X, y=y, feature_set=predictors+[p]+['Intercept'] ))
        
        #๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„์œผ๋กœ ๋ณ€ํ™˜
    models = pd.DataFrame(results)
        
        #AIC๊ฐ€ ๊ฐ€์žฅ ๋‚ฎ์€ ๊ฒƒ์„ ์„ ํƒ
         best_model = models.loc[models['AIC'].argmin()] # index
    print('Processed ', models.shape[0], 'models on',len(predictors)+1, 'predictors in')
    print('Selected predictors : ', best_model['model'].model.exog_names,
            'AIC : ', best_model[0])
    return best_model
    
#ํ›„์ง„ ์†Œ๊ฑฐ๋ฒ•
def backward(X,y,predictors):
    tic = time.time()
    results = []
    
    for combo in itertools.combinations(predictors, len(predictors) -1):
        results.append(processSubset(X=X, y=y, feature_set=list(combo) + ['Intercept']))
    
    models = pd.DataFrame(results)
    
    #๊ฐ€์žฅ ๋‚ฎ์€ AIC๋ฅผ ๊ฐ€์ง„ ๋ชจ๋ธ์„ ์„ ํƒ
    best_model = models.loc[models['AIC'].argmin()]
    toc = time.time()
    
    print('Processed ', models.shape[0], 
          'models on', len(predictors)-1,
          'predictors in', (toc-tic))
    print('Selected predicors : ', best_model['model'].model.exog_names,
          'AIC :',best_model[0])
    return best_model

#๋‹จ๊ณ„์  ์„ ํƒ๋ฒ•
def Stepwise_model(X,y):
    Stepmodels = pd.DataFrame(columns = ['AIC','model'])
    tic = time.time()
     predictors = []
    Smodel_before = processSubset(X,y,predictors+['Intercept'])['AIC']
    
    for i in range(1, len(X.columns.difference(['Intercept'])) +1):
        Forward_result = forward(X=X, y=y, predictors=predictors)
        print('forward')
        Stepmodels.loc[i] = Forward_result
        predictors = Stepmodels.loc[i]['model'].model.exog_names
        predictors = [k for k in predictors if k!='Intercept']
        Backward_result = backward(X=X, y=y, predictors=predictors)
        
        if Backward_result['AIC'] < Forward_result['AIC']:
            Stepmodels.loc[i] = Backward_result
            predictors = Stepmodels.loc[i]['model'].model.exog_names
            Smodel_before = Stepmodels.loc[i]['AIC']
            predictors = [k for k in predictors if k!='Intercept']
            print('backward')
            
        if Stepmodels.loc[i]['AIC'] > Smodel_before:
            break
        
        else:
            Smodel_before = Stepmodels.loc[i]['AIC']
            
    toc = time.time()
    print('Total elapsed time : ',(toc-tic), 'Seconds.')
    
    return (Stepmodels['model'][len(Stepmodels['model'])])

 

Stepwise_best_model = Stepwise_model(X=X, y=y)
Stepwise_best_model.summary()

 

ํŒŒ์ด์ฌ์ด ํŽธํ•˜๊ธด ํ•˜์ง€๋งŒ ๋ณ€์ˆ˜์„ ํƒ๋ฒ• ๋งŒํผ์€ ์ฝ”๋“œ ๊ตฌํ˜„ํ•˜๋Š”๊ฒŒ ๋„ˆ๋ฌด ๋ณต์žกํ•ด์„œ ๋งŒ์•ฝ ADP ์‹œํ—˜์— ๋‚˜์˜จ๋‹ค๋ฉด,,, R๋กœ ํ•  ๊ฒƒ๊ฐ™๋‹ค

728x90
Comments