250x250
Link
๋‚˜์˜ GitHub Contribution ๊ทธ๋ž˜ํ”„
Loading data ...
Notice
Recent Posts
Recent Comments
๊ด€๋ฆฌ ๋ฉ”๋‰ด

Data Science LAB

[Python] DBSCAN ๋ณธ๋ฌธ

๐Ÿ›  Machine Learning/Clustering

[Python] DBSCAN

ใ…… ใ…œ ใ…” ใ…‡ 2022. 3. 4. 23:24
728x90

DBSCAN

DBSCAN์€ ๋ฐ€๋„ ๊ธฐ๋ฐ˜์˜ ๊ตฐ์ง‘ํ™” ๋Œ€ํ‘œ ์•Œ๊ณ ๋ฆฌ์ฆ˜์ด๋‹ค. ๊ฐ„๋‹จํ•˜๊ณ  ์ง๊ด€์ ์ธ ์•Œ๊ณ ๋ฆฌ์ฆ˜์œผ๋กœ ๋ฐ์ดํ„ฐ์˜ ๋ถ„ํฌ๊ฐ€ ๊ธฐํ•˜ํ•™์ ์œผ๋กœ ๋ณต์žกํ•œ ๊ฒฝ์šฐ์—๋„ ํšจ๊ณผ์ ์œผ๋กœ ๊ตฐ์ง‘ํ™”ํ•  ์ˆ˜ ์žˆ๋‹ค. 

 

์œ„์˜ ๊ทธ๋ฆผ๊ณผ ๊ฐ™์ด ์›ํ˜•์˜ ๋ฐ์ดํ„ฐ ๋ถ„ํฌ๋ฅผ ๋„๋Š” ๊ฒฝ์šฐ, KMeans ๋‚˜ GMM์€ ๊ตฐ์ง‘ํ™”๋ฅผ ์ž˜ ์ˆ˜ํ–‰ํ•˜์ง€ ๋ชปํ•œ๋‹ค. 

 

  • ์ž…์‹ค๋ก  ์ฃผ๋ณ€ ์˜์—ญ(epsilon) : ๊ฐœ๋ณ„ ๋ฐ์ดํ„ฐ๋ฅผ ์ค‘์‹ฌ์œผ๋กœ ์ž…์‹ค๋ก  ๋ฐ˜๊ฒฝ์„ ๊ฐ€์ง€๋Š” ์›ํ˜•์˜ ์˜์—ญ
  • ์ตœ์†Œ ๋ฐ์ดํ„ฐ ๊ฐœ์ˆ˜(min points) : ๊ฐœ๋ณ„ ๋ฐ์ดํ„ฐ์˜ ์ฃผ๋ณ€ ์˜์—ญ์— ํฌํ•จ๋˜๋Š” ํƒ€ ๋ฐ์ดํ„ฐ์˜ ๊ฐœ์ˆ˜ 

 

์ž…์‹ค๋ก  ์ฃผ๋ณ€ ์˜์—ญ ๋‚ด์— ํฌํ•จ๋˜๋Š” ์ตœ์†Œ ๋ฐ์ดํ„ฐ ๊ฐœ์ˆ˜๋ฅผ ์ถฉ์กฑ์‹œํ‚ค๋Š”๊ฐ€์— ๋”ฐ๋ผ ๋ฐ์ดํ„ฐ ํฌ์ธํŠธ๋ฅผ ๋‹ค์Œ๊ณผ ๊ฐ™์ด ์ •์˜ํ•œ๋‹ค. 

  • ํ•ต์‹ฌ ํฌ์ธํŠธ(Core Point) : ์ฃผ๋ณ€ ์˜์—ญ ๋‚ด์— ์ตœ์†Œ ๋ฐ์ดํ„ฐ ๊ฐœ์ˆ˜ ์ด์ƒ์˜ ํƒ€ ๋ฐ์ดํ„ฐ๋ฅผ ๊ฐ€์ง€๊ณ  ์žˆ์„ ๊ฒฝ์šฐ, ํ•ด๋‹น ๋ฐ์ดํ„ฐ๋ฅผ ์˜๋ฏธ
  • ์ด์›ƒ ํฌ์ธํŠธ(Neighbor Point) : ์ฃผ๋ณ€ ์˜์—ญ ๋‚ด์— ์œ„์น˜ํ•œ ํƒ€ ๋ฐ์ดํ„ฐ
  • ๊ฒฝ๊ณ„ ํฌ์ธํŠธ(Border Point) : ์ฃผ๋ณ€ ์˜์—ญ ๋‚ด์— ์ตœ์†Œ ๋ฐ์ดํ„ฐ ๊ฐœ์ˆ˜ ์ด์ƒ์˜ ์ด์›ƒ ํฌ์ธํŠธ๋ฅผ ๊ฐ€์ง€๊ณ  ์žˆ์ง€ ์•Š์ง€๋งŒ, ํ•ต์‹ฌ ํฌ์ธํŠธ๋ฅผ ์ด์›ƒ ํฌ์ธํŠธ๋กœ ๊ฐ€์ง€๊ณ  ์žˆ๋Š” ๋ฐ์ดํ„ฐ
  • ์žก์Œ ํฌ์ธํŠธ(Noise Point) : ์ตœ์†Œ ๋ฐ์ดํ„ฐ ๊ฐœ์ˆ˜ ์ด์ƒ์˜ ์ด์›ƒ ํฌ์ธํŠธ๋ฅผ ๊ฐ€์ง€๊ณ  ์žˆ์ง€ ์•Š์œผ๋ฉฐ, ํ•ต์‹ฌ ํฌ์ธํŠธ๋„ ์ด์›ƒ ํฌ์ธํŠธ๋กœ ๊ฐ€์ง€๊ณ  ์žˆ์ง€ ์•Š์€ ๋ฐ์ดํ„ฐ

 

์ถœ์ฒ˜ : ์œ„ํ‚ค๋ฐฑ๊ณผ

 

 

1. epsilon๊ณผ min points ์„ค์ •

2. ๋ฐ์ดํ„ฐ๋กœ๋ถ€ํ„ฐ Core Points์˜ ์กฐ๊ฑด์„ ๋งŒ์กฑํ•˜๋Š” ์ž„์˜์˜ ์  ์„ ํƒ

3. ๋ฐ€๋„ - ๋„๋‹ฌ ๊ฐ€๋Šฅํ•œ ์ ์„ ๋ฝ‘์•„ Core Points์™€ Border Points ๊ตฌ๋ถ„, ์ด์— ์†ํ•˜์ง€ ์•Š์€ ์ ๋“ค์€ Noise Points๋กœ ๊ตฌ๋ถ„

4. epsilon ๋ฐ˜๊ฒฝ ์•ˆ์˜ Core Points ์—ฐ๊ฒฐ

5. ์—ฐ๊ฒฐ๋œ ์ ๋“ค์€ ํ•˜๋‚˜์˜ ๊ตฐ์ง‘์œผ๋กœ ํ˜•์„ฑ

6. ๋ชจ๋“  ์ ๋“ค์€ ๊ตฐ์ง‘์— ํ• ๋‹นํ•ด์•ผํ•จ(์—ฌ๋Ÿฌ ๊ตฐ์ง‘์— ๊ฑธ์ณ์žˆ์œผ๋ฉด, ๋จผ์ € ํ• ๋‹น๋œ ๊ตฐ์ง‘์œผ๋กœ ํ• ๋‹น)

 

 

 

iris ๋ฐ์ดํ„ฐ์— ์ ์šฉ

๋ฐ์ดํ„ฐ ๋ฐ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ ๋กœ๋”ฉ

from sklearn.datasets import load_iris
from sklearn.cluster import KMeans

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
%matplotlib inline

iris = load_iris()
feature_names = ['sepal_length','sepal_width','petal_length','petal_width']

iris_df = pd.DataFrame(data = iris.data,columns=feature_names)
iris_df['target'] = iris.target
iris_df.head()

 

 

 

DBSCAN ์ ์šฉ

from sklearn.cluster import DBSCAN

dbscan = DBSCAN(eps=0.6,min_samples =8, metric='euclidean')
dbscan_labels = dbscan.fit_predict(iris.data)
iris_df['dbscan_cluster'] = dbscan_labels
iris_df['target'] = iris.target

iris_result = iris_df.groupby(['target'])['dbscan_cluster'].value_counts()
print(iris_result)

-1๋กœ ํ‘œํ˜„๋œ ๊ตฐ์ง‘ ๋ ˆ์ด๋ธ”์€ Noise Point๋ฅผ ์˜๋ฏธํ•œ๋‹ค. 3๊ฐœ์˜ ํƒ€๊ฒŸ์ด์ง€๋งŒ, ์‹ค์ œ ๊ตฐ์ง‘์€ 0,1 ๋‘๊ฐ€์ง€๋กœ ์ด๋ฃจ์–ด์ง„ ๊ฒƒ์„ ํ™•์ธํ•  ์ˆ˜์žˆ๋‹ค. (๊ผญ ์•ˆ์ข‹์€ ๊ฒƒ์€ ์•„๋‹˜!)

 

 

 

 

PCA๋ฅผ ์ ์šฉํ•˜์—ฌ 2๊ฐœ์˜ ํ”ผ์ฒ˜๋กœ ์••์ถ•

#๊ตฐ์ง‘ ์ƒ์„ฑ ๋ฐ ์‹œ๊ฐํ™”ํ•˜๋Š” ํ•จ์ˆ˜ ์ƒ์„ฑ
def visualize_cluster_plot(clusterobj, dataframe, label_name, iscenter=True):
    if iscenter :
        centers = clusterobj.cluster_centers_
        
    unique_labels = np.unique(dataframe[label_name].values)
    markers=['o', 's', '^', 'x', '*']
    isNoise=False

    for label in unique_labels:
        label_cluster = dataframe[dataframe[label_name]==label]
        if label == -1:
            cluster_legend = 'Noise'
            isNoise=True
        else :
            cluster_legend = 'Cluster '+str(label)
        
        plt.scatter(x=label_cluster['ftr1'], y=label_cluster['ftr2'], s=70,\
                    edgecolor='k', marker=markers[label], label=cluster_legend)
        
        if iscenter:
            center_x_y = centers[label]
            plt.scatter(x=center_x_y[0], y=center_x_y[1], s=250, color='white',
                        alpha=0.9, edgecolor='k', marker=markers[label])
            plt.scatter(x=center_x_y[0], y=center_x_y[1], s=70, color='k',\
                        edgecolor='k', marker='$%d$' % label)
    if isNoise:
        legend_loc='upper center'
    else: legend_loc='upper right'
    
    plt.legend(loc=legend_loc)
    plt.show()
from sklearn.decomposition import PCA
pca = PCA(n_components=2,random_state = 0)
pca_transformed = pca.fit_transform(iris.data)

iris_df['ftr1'] = pca_transformed[:,0]
iris_df['ftr2'] = pca_transformed[:,1]

visualize_cluster_plot(dbscan,iris_df,'dbscan_cluster',iscenter = False)

โญ ํ‘œ์‹œ๋Š” ์žก์Œ(Noise)๋ฅผ ์˜๋ฏธํ•œ๋‹ค. 

Noise ๋ฐ์ดํ„ฐ๊ฐ€ ๋งŽ์ด ๋ณด์ด๋Š” ๊ฒƒ์„ํ™•์ธํ•  ์ˆ˜ ์žˆ๋‹ค. 

eps๋ฅผ ํฌ๊ฒŒ ํ•˜๋ฉด ๋ฐ˜๊ฒฝ์ด ์ปค์ ธ ํฌํ•จํ•˜๋Š” ๋ฐ์ดํ„ฐ๊ฐ€ ๋งŽ์•„์ ธ ๋…ธ์ด์ฆˆ ๋ฐ์ดํ„ฐ๊ฐ€ ๊ฐ์†Œํ•œ๋‹ค. 

min_samples๋ฅผ ํฌ๊ฒŒ ํ•˜๋ฉด ์ฃผ์–ด์ง„ ๋ฐ˜๊ฒฝ ๋‚ด์— ๋” ๋งŽ์€ ๋ฐ์ดํ„ฐ๋ฅผ ํฌํ•จ์‹œ์ผœ์•ผ ํ•˜๋ฏ€๋กœ ๋…ธ์ด์ฆˆ ๊ฐœ์ˆ˜๊ฐ€ ์ปค์ง€๊ฒŒ ๋œ๋‹ค. 

 

 

 

eps ์ฆ๊ฐ€(0.6 => 0.8)

dbscan = DBSCAN(eps=0.8,min_samples = 8, metric='euclidean')
dbscan_labels = dbscan.fit_predict(iris.data)

iris_df['dbscan_cluster'] = dbscan_labels
iris_df['target'] = iris.target

iris_result = iris_df.groupby(['target'])['dbscan_cluster'].value_counts()
print(iris_result)

visualize_cluster_plot(dbscan,iris_df,'dbscan_cluster',iscenter=False)

๋…ธ์ด์ฆˆ๊ฐ€ 3๊ฐœ๋กœ ๊ฐ์†Œํ•œ ๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ๋‹ค. 

 

 

 

min_samples ์ฆ๊ฐ€(8 => 16)

dbscan = DBSCAN(eps=0.6,min_samples=16,metric='euclidean')
iris_result = iris_df.groupby(['target'])['dbscan_cluster'].value_counts()
print(iris_result)

visualize_cluster_plot(dbscan,iris_df,'dbscan_cluster',iscenter=False)

 ๋…ธ์ด์ฆˆ ๊ฐœ์ˆ˜๋Š” ๋˜‘๊ฐ™๋‹ค. 

 

 

 

 

 

make_circles๋ฅผ ์ด์šฉํ•ด ๋ฐ์ดํ„ฐ ์…‹ ์ƒ์„ฑ

KMeans VS GMM VS DBSCAN

from sklearn.datasets import make_circles
X,y = make_circles(n_samples=1000, shuffle=True, noise=0.05, random_state =0,factor=0.5)
clusterDF = pd.DataFrame(data=X,columns=['ftr1','ftr2'])
clusterDF['target'] = y

visualize_cluster_plot(None, clusterDF, 'target', iscenter=False)

 

 

 

 

 

KMEANS

from sklearn.cluster import KMeans

kmeans = KMeans(n_clusters=2,max_iter=1000,random_state=0)
kmeans_labels = kmeans.fit_predict(X)
clusterDF['kmeans_cluster'] = kmeans_labels
visualize_cluster_plot(kmeans,clusterDF,'kmeans_cluster',iscenter=True)

๊ตฐ์ง‘ํ™”๊ฐ€ ์ž˜ ์ด๋ฃจ์–ด์ง€์ง€ ์•Š์•˜๋‹ค.

 

 

 

GMM

from sklearn.mixture import GaussianMixture

gmm = GaussianMixture(n_components=2,random_state = 0)
gmm_label = gmm.fit_predict(X)
clusterDF['gmm_cluster'] = gmm_label
visualize_cluster_plot(gmm,clusterDF,'gmm_cluster',iscenter=False)

๋งˆ์ฐฌ๊ฐ€์ง€๋กœ ๊ตฐ์ง‘ํ™”๊ฐ€ ์ž˜ ์ด๋ฃจ์–ด ์ง€์ง€ ์•Š์•˜๋‹ค. 

 

 

 

 

 

DBSCAN

from sklearn.cluster import DBSCAN

dbscan = DBSCAN(eps=0.2,min_samples=10, metric='euclidean')
dbscan_labels = dbscan.fit_predict(X)
clusterDF['dbscan_cluster'] = dbscan_labels
visualize_cluster_plot(dbscan,clusterDF,'dbscan_cluster',iscenter=False)

์ •ํ™•ํ•˜๊ฒŒ ๊ตฐ์ง‘ํ™”๊ฐ€ ๋˜์—ˆ๋‹ค. 

DBSCAN์€ ์›ํ˜•์˜ ๋ฐ์ดํ„ฐ์…‹์—์„œ ๊ตฐ์ง‘ํ™”๊ฐ€ ์ž˜ ์ด๋ฃจ์–ด์ง€๋Š” ๊ฒƒ์„ ํ™•์ธ

728x90
Comments