250x250
Link
๋‚˜์˜ GitHub Contribution ๊ทธ๋ž˜ํ”„
Loading data ...
Notice
Recent Posts
Recent Comments
๊ด€๋ฆฌ ๋ฉ”๋‰ด

Data Science LAB

[python] GMM(Gaussian Mixture Model) ๋ณธ๋ฌธ

๐Ÿ›  Machine Learning/Clustering

[python] GMM(Gaussian Mixture Model)

ใ…… ใ…œ ใ…” ใ…‡ 2022. 3. 3. 13:46
728x90

GMM

GMM ๊ตฐ์ง‘ํ™”๋Š” ๊ตฐ์ง‘ํ™”๋ฅผ ์ ์šฉํ•˜๊ณ ์ž ํ•˜๋Š” ๋ฐ์ดํ„ฐ๊ฐ€ ์—ฌ๋Ÿฌ ๊ฐœ์˜ ๊ฐ€์šฐ์‹œ์•ˆ ๋ถ„ํฌ๋ฅผ ๊ฐ€์ง„ ๋ฐ์ดํ„ฐ ์ง‘ํ•ฉ๋“ค์ด ์„ž์—ฌ์„œ ์ƒ์„ฑ๋œ ๊ฒƒ์ด๋ผ๋Š” ๊ฐ€์ •ํ•˜์— ๊ตฐ์ง‘ํ™”๋ฅผ ์ˆ˜ํ–‰ํ•˜๋Š” ๋ฐฉ์‹์ด๋‹ค. ๊ฐ€์šฐ์‹œ์•ˆ ๋ถ„ํฌ๋Š” ์ •๊ทœ ๋ถ„ํฌ(Normal distribution)๋ผ๊ณ ๋„ ํ•˜๋ฉฐ, ์ขŒ์šฐ ๋Œ€์นญํ˜•์˜ ์ข… ํ˜•ํƒœ์ด๋‹ค. GMM์€ ๋ฐ์ดํ„ฐ๋ฅผ ์—ฌ๋Ÿฌ๊ฐœ์˜ ์ •๊ทœ ๋ถ„ํฌ๊ฐ€ ์„ž์ธ ๊ฒƒ์œผ๋กœ ๊ฐ„์ฃผํ•˜์—ฌ ์„ž์ธ ๋ฐ์ดํ„ฐ ๋ถ„ํฌ์—์„œ ๊ฐœ๋ณ„ ์œ ํ˜•์˜ ์ •๊ทœ ๋ถ„ํฌ๋ฅผ ์ถ”์ถœํ•œ๋‹ค. 

 

 

 

 

 

์ „์ฒด ๋ฐ์ดํ„ฐ ์…‹์€ ์„œ๋กœ ๋‹ค๋ฅธ ์ •๊ทœ ๋ถ„ํฌ ํ˜•ํƒœ๋ฅผ ๊ฐ€์ง„ ์—ฌ๋Ÿฌ ๊ฐ€์ง€ ํ™•๋ฅ  ๋ถ„ํฌ ๊ณก์„ ์œผ๋กœ ๊ตฌ์„ฑ๋  ์ˆ˜ ์žˆ์œผ๋ฉฐ, ์ด๋ ‡๊ฒŒ ์„œ๋กœ ๋‹ค๋ฅธ ์ •๊ทœ ๋ถ„ํฌ์— ๊ธฐ๋ฐ˜ํ•˜์—ฌ ๊ตฐ์ง‘ํ™”๋ฅผ ์ˆ˜ํ–‰ํ•˜๋Š” ๊ฒƒ์ด GMM ๊ตฐ์ง‘ํ™” ๋ฐฉ์‹์ด๋‹ค. 

 

 

 

 

 

 

 

 

 

GMM์„ ์ด์šฉํ•œ iris ๋ฐ์ดํ„ฐ์…‹ ๊ตฐ์ง‘ํ™”

from sklearn.datasets import load_iris
from sklearn.cluster import KMeans

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
%matplotlib inline

iris = load_iris()
feature_names = ['sepal_length','sepal_width','petal_length','petal_width']

iris_df = pd.DataFrame(data = iris.data,columns=feature_names)
iris_df['target'] = iris.target
iris_df.head()

 

 

 

 

 

from sklearn.mixture import GaussianMixture

gmm = GaussianMixture(n_components=3, random_state = 0).fit(iris.data)
gmm_cluster_labels = gmm.predict(iris.data)

#๊ตฐ์ง‘ํ™” ๊ฒฐ๊ณผ๋ฅผ iris_df์— ์ €์žฅ
iris_df['gmm_cluster'] = gmm_cluster_labels
iris_df['target'] = iris.target

#target๊ฐ’์— ๋”ฐ๋ผ gmm_cluster๊ฐ€ ์–ด๋–ป๊ฒŒ ๋งคํ•‘๋๋Š”์ง€ ํ™•์ธ
gmm_result = iris_df.groupby('target')['gmm_cluster'].value_counts()
print(gmm_result)

 

GMM์—์„œ ๊ฐ€์žฅ ์ค‘์š”ํ•œ ํŒŒ๋ผ๋ฏธํ„ฐ๋Š” n_components์ด๋‹ค. gausian mixture ๋ชจ๋ธใ„น์˜ ์ด ๊ฐœ์ˆ˜์ด๋ฉฐ, KMeans ๊ตฐ์ง‘ ๊ฐœ์ˆ˜์™€ ๊ฐ™์ด ์ค‘์š”ํ•œ ์—ญํ• ์„ ํ•œ๋‹ค.  n_components๋ฅผ 3์œผ๋กœ ์„ค์ •ํ•˜์—ฌ GaussianMixture๋กœ ๊ตฐ์ง‘ํ™”๋ฅผ ์ˆ˜ํ–‰ํ•˜์˜€๋‹ค. 

 

Target 0 ์€ ๊ตฐ์ง‘ 0์œผ๋กœ, Target 2๋Š” ๊ตฐ์ง‘ 1๋กœ ์ž˜ ๋งคํ•‘ ๋˜์—ˆ์ง€๋งŒ, target 1์˜ ๋ฐ์ดํ„ฐ ์ค‘ 5๊ฐœ๊ฐ€ ๊ตฐ์ง‘ 2๋กœ ์ž˜๋ชป ๋งคํ•‘๋˜์—ˆ๋‹ค. 

 

 

 

 

 

iris ๋ฐ์ดํ„ฐ KMeans ์ ์šฉ(n_clusters=3)

kmeans = KMeans(n_clusters = 3, init='k-means++',max_iter=300,random_state=0).fit(iris.data)
kmeans_cluster_labels = kmeans.predict(iris.data)
iris_df['kmeans_cluster'] = kmeans_cluster_labels
iris_result = iris_df.groupby(['target'])['kmeans_cluster'].value_counts()
print(iris_result)

GMM๋ณด๋‹ค KMeans์˜ ์˜ค์ฐจ๊ฐ€ ๋” ํฌ๋‹ค๋Š” ๊ฒƒ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์—ˆ๋‹ค. 

 

 

 

 

 

 

KMeans VS GMM

KMeans๋Š” ๊ฐœ๋ณ„ ๊ตฐ์ง‘์˜ ์ค‘์‹ฌ์—์„œ ์›ํ˜•์˜ ๋ฒ”์œ„๋กœ ๋ฐ์ดํ„ฐ๋ฅผ ๊ตฐ์ง‘ํ™”ํ•˜๊ธฐ์— ์œ ๋ฆฌ

GMM๋Š” ํƒ€์›ํ˜•์˜ ๋ฐ์ดํ„ฐ์…‹์— ์œ ๋ฆฌ

def visualize_cluster_plot(clusterobj, dataframe, label_name, iscenter=True):
    if iscenter :
        centers = clusterobj.cluster_centers_
        
    unique_labels = np.unique(dataframe[label_name].values)
    markers=['o', 's', '^', 'x', '*']
    isNoise=False

    for label in unique_labels:
        label_cluster = dataframe[dataframe[label_name]==label]
        if label == -1:
            cluster_legend = 'Noise'
            isNoise=True
        else :
            cluster_legend = 'Cluster '+str(label)
        
        plt.scatter(x=label_cluster['ftr1'], y=label_cluster['ftr2'], s=70,\
                    edgecolor='k', marker=markers[label], label=cluster_legend)
        
        if iscenter:
            center_x_y = centers[label]
            plt.scatter(x=center_x_y[0], y=center_x_y[1], s=250, color='white',
                        alpha=0.9, edgecolor='k', marker=markers[label])
            plt.scatter(x=center_x_y[0], y=center_x_y[1], s=70, color='k',\
                        edgecolor='k', marker='$%d$' % label)
    if isNoise:
        legend_loc='upper center'
    else: legend_loc='upper right'
    
    plt.legend(loc=legend_loc)
    plt.show()

 

ํด๋Ÿฌ์Šคํ„ฐ ๊ฒฐ๊ณผ๋ฅผ ๋‹ด์€ DataFrame๊ณผ ์‚ฌ์ดํ‚ท๋Ÿฐ์˜ Cluster ๊ฐ์ฒด๋“ฑ์„ ์ธ์ž๋กœ ๋ฐ›์•„ ํด๋Ÿฌ์Šคํ„ฐ๋ง ๊ฒฐ๊ณผ๋ฅผ ์‹œ๊ฐํ™”ํ•˜๋Š” ํ•จ์ˆ˜ ์ƒ์„ฑ

 

 

 

 

ํƒ€์›ํ˜•์˜ ๋ฐ์ดํ„ฐ์…‹ ์ƒ์„ฑ

from sklearn.datasets import make_blobs

X,y = make_blobs(n_samples=300,n_features=2,centers=3,cluster_std = 0.5,random_state=0)

#๊ธธ๊ฒŒ ๋Š˜์–ด๋‚œ ํƒ€์›ํ˜•์˜ ๋ฐ์ดํ„ฐ์…‹์„ ์ƒ์„ฑํ•˜๊ธฐ ์œ„ํ•ด ๋ณ€ํ™˜
transformation = [[0.60834549,-0.63667349],[-0.40887718,0.85253229]]
X_aniso = np.dot(X,transformation)

#feature๋ฐ์ดํ„ฐ ์…‹๊ณผ make_blob์˜ ๊ฒฐ๊ณผ๊ฐ’์„ ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„์œผ๋กœ ์ €์žฅ
clusterDF = pd.DataFrame(data=X_aniso,columns=['ftr1','ftr2'])
clusterDF['target'] = y

#์ƒ์„ฑ๋œ ๋ฐ์ดํ„ฐ์…‹์„ target๋ณ„๋กœ ๋‹ค๋ฅธ ๋งˆ์ปค๋กœ ํ‘œ์‹œํ•ด ์‹œ๊ฐํ™”
visualize_cluster_plot(None,clusterDF,'target',iscenter=False)

 

 

 

 

 

 

KMeans ์ ์šฉ

kmeans = KMeans(3,random_state=0)
kmeans_label = kmeans.fit_predict(X_aniso)
clusterDF['kmeans_label'] = kmeans_label

visualize_cluster_plot(kmeans,clusterDF,'kmeans_label',iscenter=True)

๊ตฐ์ง‘ 0๊ณผ 2๊ฐ€ ์ž˜ ๋ถ„๋ฅ˜๋˜์ง€ ์•Š์•˜๋‹ค. 

 

 

 

 

 

GMM ์ ์šฉ

#3๊ฐœ์˜ n_components๊ธฐ๋ฐ˜ GMM์„ X_aniso ๋ฐ์ดํ„ฐ์…‹์— ์ ์šฉ
gmm = GaussianMixture(n_components=3,random_state=0)
gmm_label = gmm.fit(X_aniso).predict(X_aniso)
clusterDF['gmm_label'] = gmm_label

#GaussianMixture๋Š” cluster_centers์†์„ฑ์ด ์—†์–ด iscluster๋ฅผ False๋กœ ์„ค์ •
visualize_cluster_plot(gmm,clusterDF,'gmm_label',iscenter=False)

 

๊ตฐ์ง‘์ด ์ž˜ ๋ถ„๋ฅ˜๋˜์—ˆ๋‹ค. 

 

 

 

 

 

KMeans์™€ GMM๋น„๊ต

print("----- KMeans Clustering -----")
print(clusterDF.groupby('target')['kmeans_label'].value_counts())
print('\n------ Gaussian Mixture Clustering -----')
print(clusterDF.groupby('target')['gmm_label'].value_counts())

 

 

 

GMM์€ KMeans๋ณด๋‹ค ์œ ์—ฐํ•˜๊ฒŒ ๋‹ค์–‘ํ•œ ๋ฐ์ดํ„ฐ์…‹์— ์ž˜ ์ ์šฉ๋˜์ง€๋งŒ ๊ตฐ์ง‘ํ™”๋ฅผ ์œ„ํ•œ ์ˆ˜ํ–‰์‹œ๊ฐ„์ด ์˜ค๋ž˜๊ฑธ๋ฆฐ๋‹ค. 

728x90
Comments