[Python] DecisionTree (의사결정나무)

250x250

Link

GitHub

나의 GitHub Contribution 그래프

Loading data ...

Notice

Recent Posts

Recent Comments

« 2025/07 »
일	월	화	수	목	금	토
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

Tags more

Archives

관리 메뉴

Data Science LAB

[Python] DecisionTree (의사결정나무) 본문

adp 실기/알고리즘 이론

[Python] DecisionTree (의사결정나무)

ㅅ ㅜ ㅔ ㅇ 2022. 8. 17. 22:40

728x90

의사결정나무

SVM처럼 분류와 회귀 그리고 다중 출력 작업도 가능한 머신러닝 알고리즘이다. 데이터의 규칙을 학습을 통해 자동으로 찾아내 트리 기반의 분류 규칙을 만든다. 일반적으로 스무고개와 같이 if/else 형태를 띄기 때문에 어떤 기준을 바탕으로 규칙을 만들어야 가장 효율적인 분류가 될 것인가가 성능을 좌우한다.

결정트리는 루트노드부터 리프노드까지 데이터로부터 생성한 규칙을 기준으로 예측을 결정하게 된다. 루트노드에서 시작하여 규칙노드로부터 브랜치/서브 트리를 생성하여 데이터를 분류하고 최종적으로 리프노드에서 결정값을 예측한다. 많은 규칙이 존재하는 경우, 결정이 복잡해지고 이는 곧 과적합을 일으키기 쉽다.

정보의 균일도를 측정하는 방법은 엔트로피를 활용한 정보 이득 지수와 지니계수가 있다. 정보이득은 엔트로피를 기반으로 하며 데이터의 혼잡도를 의미한다. 서로 다른 값이 섞여 있으면 엔트로피가 높고 같은 값이 섞여 있으면 엔트로피가 낮다. 이때 정보 이득 지수는 (1-엔트로피 지수)이다. 지니계수는 0이 가장 평등하고 1로 갈수록 불공평하다. 즉, 다양성이 낮을 수록 높다는 의미로 1로 갈수록 지니 계수가 높은 속성을 기준으로 분할한다.

기대 집단의 사람들 중 가장 좋은 반응을 보일 고객의 유치방안을 예측하고자 하는 경우에는 예측력에 치중한다. 신용 평가에서는 심사 결과 부적격 판정이 나온 경우, 고객에게 부적격 이유를 설명해야 하므로 해석력에 치중한다.

<장점>
- 결과를 설명하기 용이하다.
- 모형을 만드는 방법이 간단하다.
- 대용량 데이터에 빠르게 만들 수 있다.
- 비정상 잡음 데이터에 대해서도 민감함이 없이 분류가 가능하다.
- 상관성이 높은 변수에 있어서도 크게 영향을 받지 않는다.
- 전처리가 거의 필요하지 않으며 스케일링 작업이 필요하지 않다.

<단점>
- 과대적합 가능성이 높다.
- 분류 경계선 부근의 자료값에 대해 오차가 크다.
- 설명변수 간의 중요도를 판단하기 쉽지 않다.

1. 분류

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X,y,stratify=y, test_size=0.3, random_state=1)

print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

#(700, 20)
#(300, 20)
#(700,)
#(300,)

from sklearn.tree import DecisionTreeClassifier

clf = DecisionTreeClassifier(max_depth=5)
clf.fit(X_train, y_train)

from sklearn.metrics import confusion_matrix, accuracy_score, precision_score, recall_score, f1_score

pred = clf.predict(X_test)
test_cm = confusion_matrix(y_test, pred)
test_acc = accuracy_score(y_test, pred)
test_prc = precision_score(y_test, pred)
test_rcll = recall_score(y_test,pred)
test_f1 = recall_score(y_test,pred)

print(test_cm,'\n')
print('정확도\t{}'.format(round(test_acc*100,2)))
print('정밀도\t{}'.format(round(test_prc*100,2)))
print('재현율\t{}'.format(round(test_rcll*100,2)))

# [[ 28  62]
 [ 25 185]] 

#정확도	71.0
#정밀도	74.9
#재현율	88.1

import matplotlib.pyplot as plt
from sklearn.metrics import plot_roc_curve, roc_auc_score

plot_roc_curve(clf, X_test, y_test)
plt.show()
R_A_score = roc_auc_score(y_test, clf.predict_proba(X_test)[:,1])
print('ROC_AUC_score : ', R_A_score)

# 변수 중요도 확인
importances = clf.feature_importances_
column_nm = pd.DataFrame(X.columns)

feature_importances = pd.concat([column_nm,
                                pd.DataFrame(importances)],
                               axis=1)

feature_importances.columns = ['feature_nm', 'importances']
print(feature_importances)

2. 회귀

from cProfile import label
import numpy as np
from sklearn.tree import DecisionTreeRegressor
import matplotlib.pyplot as plt

np.random.seed(0)
X = np.sort(5*np.random.rand(400,1), axis=0)
T = np.linspace(0,5,500)[:, np.newaxis]
y = np.sin(X).ravel()

y[::1] += 1 * (0.5 - np.random.rand(400))
plt.scatter(X, y, s = 20, edgecolors='black', c='darkorange', label='data')

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X,y,train_size=0.7, random_state=1)
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

# (280, 1) (120, 1) (280,) (120,)

regr_1 = DecisionTreeRegressor(max_depth=5)
regr_2 = DecisionTreeRegressor(max_depth=5)

from unittest import result
from sklearn.metrics import mean_squared_error, mean_absolute_error
import pandas as pd
import numpy as np

regr_1.fit(X_train, y_train)
regr_2.fit(X_train, y_train)

y_1 = regr_1.predict(X_test)
y_2 = regr_2.predict(X_test)

preds = [y_1, y_2]
weights = ['max_depth = 2', 'max_depth = 5']
evls = ['mse','rmse', 'mae']
results = pd.DataFrame(index=weights, columns=evls)

for pred, nm in zip(preds, weights):
    mse = mean_squared_error(y_test, pred)
    mae = mean_absolute_error(y_test, pred)
    rmse = np.sqrt(mse)
    
    results.loc[nm]['mse'] = round(mse,2)
    results.loc[nm]['rmse'] = round(rmse, 2)
    results.loc[nm]['mae'] = round(mae,2)
    
results

X_test = np.sort(5*np.random.rand(40,1), axis=0)

regrs = [regr_1, regr_2]
depths = ['max_depth = 2', 'max_depth = 5']
model_color = ['m','c']
fig, axes = plt.subplots(nrows=1, ncols=2, sharey=True, figsize=(13,5))

for ix, regr in enumerate(regrs):
    pred = regr.fit(X,y).predict(X_test)
    r2 = regr.score(X_test, pred)
    mae = mean_absolute_error(X_test, pred)
    
    axes[ix].plot(X_test,
                  pred,
                  color = model_color[ix],
                  label = '{}'.format(depths[ix]))
    
    axes[ix].scatter(X,y,
                     s=20,
                     edgecolor='gray',
                     label='data')
    
    axes[ix].legend(loc = 'upper right',
                    ncol=1,
                    fancybox=True,
                    shadow=True)
    
    axes[ix].set_title('R2 : {r}, MAE : {m}'.format(r=round(r2,3), m = round(mae,3)))
    
fig.text(0.5, 0.04, 'data', ha='center', va='center')
fig.text(0.06, 0.5, 'target', ha='center', va='center', rotation='vertical')
fig.suptitle('Decision Tree Regression', fontsize=14)
plt.show()