250x250
Link
๋‚˜์˜ GitHub Contribution ๊ทธ๋ž˜ํ”„
Loading data ...
Notice
Recent Posts
Recent Comments
๊ด€๋ฆฌ ๋ฉ”๋‰ด

Data Science LAB

[Python] PCA ์˜ˆ์ œ ๋ณธ๋ฌธ

๐Ÿ›  Machine Learning/์ฐจ์› ์ถ•์†Œ

[Python] PCA ์˜ˆ์ œ

ใ…… ใ…œ ใ…” ใ…‡ 2022. 3. 6. 21:11
728x90

2022.03.05 - [Python] PCA(Principal Component Analysis)

 

[Python] PCA(Principal Component Analysis)

PCA ๊ฐœ์š” PCA(Principal Component Analysis)๋Š” ๊ฐ€์žฅ ๋Œ€ํ‘œ์ ์ธ ์ฐจ์› ์ถ•์†Œ ๊ธฐ๋ฒ•์œผ๋กœ ์—ฌ๋Ÿฌ ๋ณ€์ˆ˜ ๊ฐ„์— ์กด์žฌํ•˜๋Š” ์ƒ๊ด€๊ด€๊ณ„๋ฅผ ์ด์šฉํ•ด ์ด๋ฅผ ๋Œ€ํ‘œํ•˜๋Š” ์ฃผ์„ฑ๋ถ„(Principal Component)๋ฅผ ์ถ”์ถœํ•ด ์ฐจ์›์„ ์ถ•์†Œํ•˜๋Š” ๊ธฐ๋ฒ•์ด๋‹ค.

suhye.tistory.com

์ง€๋‚œ ํฌ์ŠคํŒ…์—์„œ ๊ณต๋ถ€ํ–ˆ์—ˆ๋˜ PCA๋ฅผ ๋‹ค๋ฅธ ๋ฐ์ดํ„ฐ์…‹์„ ์ด์šฉํ•˜์—ฌ ์‹ค์Šตํ•ด ๋ณด๋ ค๊ณ  ํ•œ๋‹ค. 

 

 

 

 

๋ฐ์ดํ„ฐ์…‹ ๋‹ค์šด๋กœ๋“œ

https://archive.ics.uci.edu/ml/datasets/default+of+credit+card+clients 

 

UCI Machine Learning Repository: default of credit card clients Data Set

default of credit card clients Data Set Download: Data Folder, Data Set Description Abstract: This research aimed at the case of customers’ default payments in Taiwan and compares the predictive accuracy of probability of default among six data mini

archive.ics.uci.edu

 

 

 

 

๋ฐ์ดํ„ฐ์…‹ ๋กœ๋”ฉ

import pandas as pd
df = pd.read_excel('default of credit card clients.xls',header=1,sheet_name='Data').iloc[0:,1:]
print(df.shape)
df.head()

 

#DataFrame์œผ๋กœ ๋ณ€ํ™˜
df.rename(columns = {'PAY_0':'PAY1','default payment next month':'default'},inplace=True)
y_target = df['default']
X_features = df.drop('default',axis=1)

 

 

 

 

์†์„ฑ๊ฐ„์˜ ์ƒ๊ด€๋„๋ฅผ heatmap์œผ๋กœ ์‹œ๊ฐํ™”

import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

corr = X_features.corr()
plt.figure(figsize=(14,14))
sns.heatmap(corr,annot=True,fmt='.1g')

BILL_AMTI1 ~ BILL_AMTI6์˜ 6๊ฐœ ์†์„ฑ๋ผ๋ฆฌ์˜ ์ƒ๊ด€๋„๊ฐ€ ๋Œ€๋ถ€๋ถ„ 0.9 ์ด์ƒ์œผ๋กœ ๋งค์šฐ ๋†’์€ ๊ฒƒ์„ ์•Œ ์ˆ˜ ์žˆ๋‹ค. 

๋˜ํ•œ, PAY_1 ~ 6์˜ ์ƒ๊ด€๋„ ์—ญ์‹œ ๋†’๋‹ค. ์ด๋ ‡๊ฒŒ ๋†’์€ ์ƒ๊ด€๋„๋ฅผ ๊ฐ€์ง„ ์†์„ฑ๋“ค์€ ์†Œ์ˆ˜์˜ PCA๋งŒ์œผ๋กœ๋„ ์ž์—ฐ์Šค๋Ÿฝ๊ฒŒ ์ด ์†์„ฑ๋“ค์˜ ๋ณ€๋™์„ฑ์„ ์ˆ˜์šฉํ•  ์ˆ˜ ์žˆ๋‹ค. 

 

 

 

 

bill_amti1~6์˜ ์†์„ฑ๋ผ๋ฆฌ ์ƒ๊ด€๋„๊ฐ€ ๋งค์šฐ ๋†’์•„ PCA๋กœ 2๊ฐœ ์†์„ฑ์œผ๋กœ ๋ณ€ํ™˜

from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

#BILL_AMT 6๊ฐœ ์†์„ฑ๋ช… ์ƒ์„ฑ
cols_bill = ['BILL_AMT'+str(i) for i in range(1,7)]
print("๋Œ€์ƒ ์†์„ฑ๋ช… : ",cols_bill)

#2๊ฐœ์˜ PCA ์†์„ฑ์„ ๊ฐ€์ง„ PCA ๊ฐ์ฒด ์ƒ์„ฑ, explained_variance_ratio ๊ณ„์‚ฐ์„ ์œ„ํ•ด fit() ํ˜ธ์ถœ
scaler = StandardScaler()
df_cols_scaled = scaler.fit_transform(X_features[cols_bill])
pca = PCA(n_components=2)
pca.fit(df_cols_scaled)
print("PCA Components๋ณ„ ๋ณ€๋™์„ฑ : ",pca.explained_variance_ratio_)

2๊ฐœ์˜ PCA Components ๋งŒ์œผ๋กœ 6๊ฐœ์˜ ์†์„ฑ์˜ ๋ณ€๋™์„ฑ์„ ์•ฝ 95%์ด์ƒ ์„ค๋ช…ํ•  ์ˆ˜ ์žˆ์œผ๋ฉฐ ํŠนํžˆ ์ฒซ ๋ฒˆ์งธ PCA ์ถ•์€ 90%์˜ ๋ณ€๋™์„ฑ์„ ์ˆ˜์šฉํ•  ์ˆ˜ ์žˆ๋‹ค. 

 

 

 

์›๋ณธ ๋ฐ์ดํ„ฐ์…‹ VS PCA ๋ณ€ํ™˜ ๋ฐ์ดํ„ฐ์…‹ ๋ถ„๋ฅ˜ ์˜ˆ์ธก ๊ฒฐ๊ณผ ๋น„๊ต

- ์›๋ณธ๋ฐ์ดํ„ฐ์…‹์— ๋žœ๋คํฌ๋ ˆ์ŠคํŠธ ์ ์šฉ

import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score

rf = RandomForestClassifier(n_estimators=300,random_state = 156)
scores = cross_val_score(rf,X_features,y_target,scoring='accuracy',cv=3)

print("CV=3์ธ ๊ฒฝ์šฐ์˜ ๊ฐœ๋ณ„ Fold์„ธํŠธ๋ณ„ ์ •ํ™•๋„ : ",scores)
print("ํ‰๊ท  ์ •ํ™•๋„ : {0:.4f}".format(np.mean(scores)))

 

 

-PCA ๋ณ€ํ™˜ ๋ฐ์ดํ„ฐ์…‹์— ๋žœ๋คํฌ๋ ˆ์ŠคํŠธ ์ ์šฉ

from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
df_scaled = scaler.fit_transform(X_features)

pca = PCA(n_components = 6)
df_pca = pca.fit_transform(df_scaled)
scores_pca = cross_val_score(rf,df_pca,y_target,scoring='accuracy',cv=3)

print("CV=3์ธ ๊ฒฝ์šฐ์˜ PCA ๊ฐœ๋ณ„ Fold ์„ธํŠธ๋ณ„ ์ •ํ™•๋„ : ",scores_pca)
print("PCA ๋ณ€ํ™˜ ๋ฐ์ดํ„ฐ ์„ธํŠธ ํ‰๊ท  ์ •ํ™•๋„ : {0:.4f}".format(np.mean(scores_pca)))

์ „์ฒด 23๊ฐœ ์†์„ฑ์ค‘ 6๊ฐœ์˜ PCA ์ปดํฌ๋„ŒํŠธ๋งŒ์œผ๋กœ๋„ ์›๋ณธ ๋ฐ์ดํ„ฐ๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•œ ๋ถ„๋ฅ˜ ์˜ˆ์ธก ๊ฒฐ๊ณผ๋ณด๋‹ค ์•ฝ 1~2%์˜ ์˜ˆ์ธก ์„ฑ๋Šฅ ์ €ํ•˜๊ฐ€ ๋ฐœ์ƒํ•˜์˜€๋‹ค. ์ „์ฒด ์†์„ฑ์˜ ¼์ˆ˜์ค€์œผ๋กœ ์ด์ •๋„ ์ˆ˜์น˜์˜ ์˜ˆ์ธก ์„ฑ๋Šฅ์„ ์œ ์ง€ํ•  ์ˆ˜ ์žˆ๋Š” ๊ฒƒ์€ PCA์˜ ์„ฑ๋Šฅ์„ ์ž˜ ๋‚˜ํƒ€๋‚ธ๋‹ค. 

 

 

 

 

728x90

'๐Ÿ›  Machine Learning > ์ฐจ์› ์ถ•์†Œ' ์นดํ…Œ๊ณ ๋ฆฌ์˜ ๋‹ค๋ฅธ ๊ธ€

[Python]NMF  (0) 2022.03.08
[Python] SVD(Singular Value Decomposition)  (0) 2022.03.07
[Python] LDA(Linear Discriminant Analysis)  (0) 2022.03.07
[Python] PCA(Principal Component Analysis)  (0) 2022.03.05
Comments