250x250
Link
나의 GitHub Contribution 그래프
Loading data ...
Notice
Recent Posts
Recent Comments
관리 메뉴

Data Science LAB

[Python] RandomForest 본문

adp 실기/알고리즘 이론

[Python] RandomForest

ㅅ ㅜ ㅔ ㅇ 2022. 8. 18. 17:21
728x90


배깅 방식이 사용되며 배깅은 같은 알고리즘으로 여러 분류기를 만들어 보팅으로 최종 결정하는 알고리즘이다.
앙상블 알고리즘 중에서 비교적 빠른 속도를 가지고 있으며, 다양한 영역에서 좋은 성능을 보이나. 랜덤 포레스트의 기반 알고리즘은 결정 트리로 결정 트리의 쉽고 직관적인 장점을 가진다.

랜덤 포레스트는 여러 개의 결정 트리 분류기가 전체 데이터에서 배깅 방식으로 각자의 데이터를 샘플링해 개별적으로 학습을 수행한 뒤 최종적으로 모든 분류기가 보팅을 통해 예측 결정을 하게 된다. 또한 개별 트리가 학습하는 데이터는 전체 데이터에서 일부가 중첩되게 만든 데이터이다. 이렇게 여러 개의 데이터를 중첩되게 분리하는 것을 부트스트래핑 분할 방식이라고 한다.

 

<장점>
- 앙상블 알고리즘 중 비교적 빠른 속도
- 다양한 영역에서 좋은 성능을 보임
- 결정 트리의 쉽고 직관적인 장점

<단점>
- 하이퍼 파라미터가 너무 많고 튜닝을 위한 시간이 많이 소모됨

 

RandomForestClassifier

breast = pd.read_csv('./data/breast-cancer.csv')

breast['diagnosis'] = np.where(breast['diagnosis'] == 'M', 1,0)
features = ['area_mean','texture_mean']
X = breast[features]
y = breast['diagnosis']

X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.3, stratify=y, random_state=1)
print(X_train.shape, X_test.shape)
print(y_train.shape, y_test.shape)

# (398, 2) (171, 2)
# (398,) (171,)
from sklearn.ensemble import RandomForestClassifier

clf = RandomForestClassifier(n_estimators=100, min_samples_leaf=5)
pred = clf.fit(X_train, y_train).predict(X_test)
print('정확도 : ',clf.score(X_test, y_test))

# 정확도 :  0.9064327485380117
pred = clf.predict(X_test)

test_cm = confusion_matrix(y_test, pred)
test_acc = accuracy_score(y_test, pred)
test_prc = precision_score(y_test, pred)
test_rcll = recall_score(y_test, pred)
test_f1 = f1_score(y_test, pred)

print(test_cm)
print('정확도\t{}'.format(round(acc*100,2)))
print('정밀도\t{}'.format(round(test_prc*100,2)))
print('재현율\t{}'.format(round(test_rcll*100,2)))

# [[103   4]
# [ 12  52]]
# 정확도	90.14
# 정밀도	92.86
# 재현율	81.25
plot_roc_curve(clf, X_test, y_test)
plt.show()

 

importances = clf.feature_importances_
column_nm = pd.DataFrame(['area_mean', 'texture_mean'])
feature_importances = pd.concat([column_nm,
                                 pd.DataFrame(importances)],
                                axis=1)
feature_importances.columns = ['feature_nm','importances']
print(feature_importances)

변수 중요도는 'area_mean' 이 약 0.74%, 'texture_mean' 이 0.26으로 확인

f = features
xtick_label_position = list(range(len(f)))
plt.xticks(xtick_label_position, f)
plt.bar([x for x in range(len(importances))], importances)

 

 

RandomForestRegressor

car = pd.read_csv(r'./data/CarPrice_Assignment.csv')
car_num = car.select_dtypes(['number'])
features = list(car_num.columns.difference(['car_ID','symboling','price']))

X = car_num[features]
y = car_num['price']

X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.3, random_state=1)

print(X_train.shape, X_test.shape)
print(y_train.shape, y_test.shape)

# (143, 13) (62, 13)
# (143,) (62,)
from sklearn.ensemble import RandomForestRegressor

reg = RandomForestRegressor()
pred = reg.fit(X_train, y_train).predict(X_test)

mse = mean_squared_error(y_test, pred)
rmse = np.sqrt(mse)
mae = mean_absolute_error(y_test, pred)
acc = reg.score(X_test, y_test)

print("MSE\t{}".format(round(mse,3)))
print('MAE\t{}'.format(round(mae,3)))
print('ACC\t{}'.format(round(acc*100,3)))

# MSE	4482570.206
# MAE	1385.773
# ACC	92.575
importances = reg.feature_importances_
column_nm = pd.DataFrame(features)
feature_importances = pd.concat([column_nm,
                                 pd.DataFrame(importances)],
                                axis=1)
feature_importances.columns = ['feature_nm', 'importances']
print(feature_importances)

n_features = X_train.shape[1]
importances = reg.feature_importances_
column_nm = features

plt.barh(range(n_features), importances, align='center')
plt.yticks(np.arange(n_features), column_nm)
plt.xlabel('feature_importances')
plt.ylabel('feature')
plt.ylim(-1, n_features)
plt.show()

728x90

'adp 실기 > 알고리즘 이론' 카테고리의 다른 글

[Python] XGBoost  (0) 2022.08.19
[Python] DecisionTree (의사결정나무)  (0) 2022.08.17
[Python] 서포트벡터머신(SVM)  (0) 2022.08.16
Comments