μΌ | μ | ν | μ | λͺ© | κΈ | ν |
---|---|---|---|---|---|---|
1 | 2 | 3 | 4 | |||
5 | 6 | 7 | 8 | 9 | 10 | 11 |
12 | 13 | 14 | 15 | 16 | 17 | 18 |
19 | 20 | 21 | 22 | 23 | 24 | 25 |
26 | 27 | 28 | 29 | 30 | 31 |
- ν¬λ‘€λ§
- λ°μ΄ν°λΆμμ λ¬Έκ°
- μΈλμνλ§
- ADsP
- PCA
- λ 립νλ³Έ
- iloc
- Python
- LDA
- Lambda
- t-test
- λμνλ³Έ
- λΉ λ°μ΄ν°
- μλν΄λΌμ°λ
- dataframe
- pandas
- μ£Όμ±λΆλΆμ
- numpy
- opencv
- datascience
- DBSCAN
- λ°μ΄ν°λΆκ· ν
- ν μ€νΈλΆμ
- λ°μ΄ν°λΆμ
- κ΅°μ§ν
- ADP
- νμ΄μ¬
- μ€λ²μνλ§
- λ°μ΄ν°λΆμμ€μ λ¬Έκ°
- λΉ λ°μ΄ν°λΆμκΈ°μ¬
Data Science LAB
[Python] KMeans Clustering(K-νκ· κ΅°μ§ν) λ³Έλ¬Έ
[Python] KMeans Clustering(K-νκ· κ΅°μ§ν)
γ γ γ γ 2022. 2. 28. 18:28KMeans Clusteringμ΄λ?
κ°μ₯ μμ£Ό μ¬μ©λλ κ΅°μ§ν μκ³ λ¦¬μ¦μΌλ‘, λ°μ΄ν°μ μ Kκ°μ κ΅°μ§μΌλ‘ κ΅°μ§ννλ μκ³ λ¦¬μ¦μ΄λ€.
μμμ κ΅°μ§ μ€μ¬μ κ°μ(K)λ₯Ό μ€μ νμ¬ ν΄λΉ μ€μ¬μ κ°μ₯ κ°κΉμ΄ λ°μ΄ν°λ₯Ό μ ννλ€. κ΅°μ§ μ€μ¬μ μ μ νλ λ°μ΄ν°μ νκ· μ§μ μΌλ‘ μ΄λνκ³ , μ΄λλ μ€μ¬μ μμ λ€μ κ°κΉμ΄ ν¬μΈνΈλ₯Ό μ ν, λ€μ μ€μ¬μ μ νκ· μ§μ μΌλ‘ μ΄λνλ νλ‘μΈμ€λ₯Ό λ°λ³΅μ μΌλ‘ μννλ€. λμ΄μ μ€μ¬μ μ μ΄λμ΄ μμ λκΉμ§ λ°λ³΅μ κ³μνλ€.
KMeans Process
1. κ΅°μ§νμ κΈ°μ€μ΄ λλ μ€μ¬μ ꡬμ±νλ €λ κ΅°μ§μ κ°μλ§νΌ μμμ μμΉμ κ°μ Έλ€ λμ
2. κ° λ°μ΄ν°λ κ°μ₯ κ°κΉμ΄ κ³³μ μμΉν μ€μ¬μ μ μμ
3. κ° λ°μ΄ν°μ μμμ΄ κ²°μ λλ©΄ κ΅°μ§ μ€μ¬μ μ μμλ λ°μ΄ν°μ νκ· μ€μ¬μΌλ‘ μ΄λ
4. λ°λ μ€μ¬μ μ μμΉμ λ§μΆ° μμ λ³κ²½
5. λ€μ μ€μ¬μ μ μμλ λ°μ΄ν°μ νκ· μ€μ¬μΌλ‘ μ΄λ
6. μμ λ³κ²½μ΄ μμ λκΉμ§ λ°λ³΅νκ³ μ’ λ£
KMeans μ₯μ
1. κ°μ₯ λ§μ΄ νμ©λλ μκ³ λ¦¬μ¦μΌλ‘ μ½κ³ κ°κ²°
2. λΉμ§λ νμ΅μ΄κΈ° λλ¬Έμ λ°μ΄ν°μ λν μ¬μ νμ΅μ΄ νμνμ§ μμ
KMeans λ¨μ
1. κ±°λ¦¬κΈ°λ° μκ³ λ¦¬μ¦μΌλ‘ μμ±μ κ°μκ° λ§€μ° λ§μ κ²½μ° κ΅°μ§ν μ νλκ° λ¨μ΄μ§(PCA μ°¨μκ°μ μ μ©μΌλ‘ ν΄κ²°)
2. λ°λ³΅ νμκ° λ§μΌλ©΄ μν μκ°μ΄ λ§€μ° λλ €μ§
3. κ΅°μ§μ κ°μ(K)λ₯Ό κ²°μ νκΈ° μ΄λ €μ
KMeans Parameter
μ¬μ΄ν·λ° ν¨ν€μ§μ KMeans ν΄λμ€λ λ€μκ³Ό κ°μ νλΌλ―Έν°λ₯Ό κ°μ§κ³ μλ€.
Parameter | default | μ€λͺ |
n_clusters | κ΅°μ§νν κ°μ(μ€μ¬μ μ κ°μ,k) | |
init | k-means++ | μ΄κΈ° κ΅°μ§ μ€μ¬μ μ μ’ν μ€μ λ°©μ |
n_init | 10 | μ΄κΈ° κ΅°μ§ μ€μ¬μ λͺ λ² μ€μ |
max_iter | μ΅λ λ°λ³΅ νμ, μ΄ νμ μ΄μ μ λͺ¨λ λ°μ΄ν° μ΄λμ΄ λλλ©΄ μ’ λ£ | |
algorithm | 'auto' | auto, full, elkan |
KMeans μκ³ λ¦¬μ¦μ μ΄μ©ν iris λ°μ΄ν° μ κ΅°μ§ν
νμ λͺ¨λκ³Ό λ°μ΄ν°μ λ‘λ
from sklearn.preprocessing import scale
from sklearn.datasets import load_iris
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
%matplotlib inline
iris = load_iris()
df = pd.DataFrame(data = iris.data, columns =['sepal_length','sepal_width','petal_length','petal_width'])
df.head()
μ¬μ΄ν·λ°μ load_iris()λ₯Ό μ΄μ©νμ¬ λΆκ½ λ°μ΄ν°λ₯Ό μΆμΆνκ³ , DataFrameνμμΌλ‘ λ³κ²½νμ¬ λ°μ΄ν° νΈλ€λ§μ νΈνκ² ν μ μλ€.
3κ°μ κ·Έλ£ΉμΌλ‘ κ΅°μ§ν
kmeans = KMeans(n_clusters=3,init = 'k-means++',max_iter=300,random_state = 0)
kmeans.fit(df)
print(kmeans.labels_)
k=3, μ΅λ λ°λ³΅ νμλ 3000μΌλ‘ μ€μ νκ³ κ° λ°μ΄ν°κ° μ΄λ€ κ΅°μ§μΌλ‘ λΆλ₯λμλμ§ μΆλ ₯ν΄ λ³΄μλ€.
label κ°μ΄ 0,1,2λ‘ κ΅¬μ±λμ΄ μμΌλ©° κ°κ° 첫 λ²μ§Έ, λ λ²μ§Έ, μΈ λ²μ§Έ κ΅°μ§μ μνλ κ²μ μλ―Ένλ€.
df['target'] = iris.target
df['cluster'] = kmeans.labels_
iris_result = df.groupby(['target','cluster'])['sepal_length'].count()
print(iris_result)
λΆλ₯ νκΉμ΄ 0μΈ λ°μ΄ν°λ€μ λͺ¨λ 1λ² κ΅°μ§μΌλ‘ μ κ·Έλ£Ήνλμμ§λ§, νκΉ 1κ³Ό 2λ λΆμ°λμ΄μ κ΅°μ§ν λμμμ νμΈν μ μμλ€.
μκ°ν
from sklearn.decomposition import PCA
pca = PCA(n_components=2)
pca_transformed=pca.fit_transform(iris.data)
df['pca_x'] = pca_transformed[:,0]
df['pca_y'] = pca_transformed[:,1]
df.head()
2μ°¨μ νλ©΄μμ κ°λ³ λ°μ΄ν°μ κ΅°μ§νλ₯Ό μκ°μ μΌλ‘ νννκΈ° μν΄
μμ±μ΄ 4κ°μΈ λ°μ΄ν° μ μ 2κ°λ‘ μ°¨μ μΆμνμλ€.
X,y μ’νλ‘ κ°λ³ λ°μ΄ν°λ₯Ό ννν μ μκ² λμλ€.
#κ΅°μ§ κ°μ΄ 0,1,2μΈ κ²½μ°λ§λ€ λ³λμ μΈλ±μ€λ‘ μΆμΆ
marker0 = df[df['cluster'] == 0].index
marker1 = df[df['cluster'] == 1].index
marker2 = df[df['cluster'] == 2].index
#κ΅°μ§ κ° 0,1,2μ ν΄λΉνλ μΈλ±μ€λ‘ κ° κ΅°μ§ λ 벨μ pca_x, pca_yκ° μΆμΆ.o,s,^λ‘ λ§μ»€ νμ
plt.scatter(x=df.loc[marker0,'pca_x'],y=df.loc[marker0,'pca_y'],marker='o')
plt.scatter(x=df.loc[marker1,'pca_x'],y=df.loc[marker1,'pca_y'],marker='s')
plt.scatter(x=df.loc[marker2,'pca_x'],y=df.loc[marker2,'pca_y'],marker='^')
plt.xlabel('PCA 1')
plt.ylabel('PCA 2')
plt.title("3 clusters visulaization by 2 PCA Components")
plt.show()
κ΅°μ§κ°μ λ°λΌ λ§μ»€ νμλ₯Ό λ€λ₯΄κ² νμ¬ μκ°ννμλ€.
κ΅°μ§ν μκ³ λ¦¬μ¦ ν μ€νΈλ₯Ό μν λ°μ΄ν° μμ±
μ¬μ΄ν·λ°μ λ€μν κ΅°μ§ν μκ³ λ¦¬μ¦μ ν μ€νΈνκΈ° μν΄ κ°λ¨ν λ°μ΄ν° μμ±κΈ°λ₯Ό μ 곡νλ€.
λνμ μΈ κ΅°μ§νμ© λ°μ΄ν° μμ±κΈ°λ‘λ make_blobs()μ make_classification() APIκ° μλ€.
λ APIλ λΉμ·νκ² μ¬λ¬ ν΄λμ€μ ν΄λΉνλ λ°μ΄ν° μ μ λ§λ λ€.
make_blobs()λ κ°λ³ κ΅°μ§μ μ€μ¬μ κ³Ό νμ€ νΈμ°¨ μ μ΄ κΈ°λ₯μ΄ μΆκ°λμ΄ μκ³ ,
make_classification()μ λ Έμ΄μ¦λ₯Ό ν¬ν¨ν λ°μ΄ν°λ₯Ό λ§λλ λ°μ μ μ©νκ² μ¬μ©ν μ μλ€.
λ°μ΄ν°μ μμ±
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs
%matplotlib inline
X,y = make_blobs(n_samples=200, n_features = 2, centers=3, cluster_std = 0.8, random_state=0)
print(X.shape,y.shape)
#y targetμ λΆν¬ νμΈ
unique, counts =np.unique(y,return_counts=True)
print(unique,counts)
μμ±ν λ°μ΄ν°μλ 200, λ°μ΄ν°μ νΌμ² κ°μλ 2, μ€μ¬μ μ 3, νμ€νΈμ°¨λ 0.8λ‘ μ€μ νμ¬ λ°μ΄ν°μ μ μμ±νμλ€.
νΌμ² λ°μ΄ν°μ Xμλ 200κ°μ λ μ½λμ 2κ°μ νΌμ²,
νκΉ λ°μ΄ν°μ yμλ 200κ°μ λ μ½λκ° μ‘΄μ¬νλ€.
3κ°μ cluster κ°μ [0,1,2] μ΄λ©° κ°κ° [67,67,66]κ°λ‘ κ· μΌνκ² κ΅¬μ±λμ΄μλ κ²μ νμΈ ν μ μλ€.
λ°μ΄ν°λ₯Ό λ°μ΄ν°νλ μμΌλ‘ λ³ν
import pandas as pd
cluster = pd.DataFrame(data=X,columns=['ftr1','ftr2'])
cluster['target'] = y
cluster.head()
λ°μ΄ν°μ μ΄ μ΄λ€ κ΅°μ§ν λΆν¬λ₯Ό κ°μ§κ³ λ§λ€μ΄μ‘λμ§ νμΈ
target_list = np.unique(y)
#κ° νκΉλ³ μ°μ λμ λ§μ»€κ°
markers = ['o','s','^','P','D','H','X']
#3κ°μ κ΅°μ§ μμμΌλ‘ ꡬλΆν λ°μ΄ν° μ
μ μμ±νμΌλ―λ‘ target_listλ [0,1,2]
for target in target_list:
target_cluster = cluster[cluster['target'] == target]
plt.scatter(x=target_cluster['ftr1'],y = target_cluster['ftr2'],edgecolor='k',marker=markers[target])
plt.show()
KMeans κ΅°μ§ν ν μκ°ν
#KMeans ν΄λ¬μ€ν°λ§ μν
kmeans = KMeans(n_clusters=3,init='k-means++',max_iter=200,random_state=0)
cluster_labels = kmeans.fit_predict(X)
cluster['kmeans_label'] = cluster_labels
#cluster_centers_λ κ°λ³ ν΄λ¬μ€ν°μ μ€μ¬ μμΉ μ’ν μκ°νλ₯Ό μν΄ μΆμΆ
centers = kmeans.cluster_centers_
unique_labels = np.unique(cluster_labels)
#κ΅°μ§νλ label μ νλ³λ‘ iterationνλ©΄μ marker λ³ scatter plot μν
for label in unique_labels:
label_cluster = cluster[cluster['kmeans_label'] == label]
center_x_y = centers[label]
plt.scatter(x=label_cluster['ftr1'],y=label_cluster['ftr2'],edgecolor='k',marker=markers[label])
#κ΅°μ§λ³ μ€μ¬μμΉ μ’ν μκ°ν
plt.scatter(x=center_x_y[0],y = center_x_y[1], s=200, color='white',alpha=0.9,edgecolor='k',marker=markers[label])
plt.scatter(x=center_x_y[0],y = center_x_y[1], s=70, color='k',edgecolor='k',marker='$%d$' % label)
plt.show()
mable_blobs()μ νκΉκ³Ό kmeans_labelμ κ΅°μ§ λ²νΈλ₯Ό μλ―Ένλ―λ‘ μλ‘ λ€λ₯Έ κ°μΌλ‘ 맀νλ μ μλ€.
print(cluster.groupby('target')['kmeans_label'].value_counts())
'π Machine Learning > Clustering' μΉ΄ν κ³ λ¦¬μ λ€λ₯Έ κΈ
[Python] DBSCAN (0) | 2022.03.04 |
---|---|
[python] GMM(Gaussian Mixture Model) (0) | 2022.03.03 |
[Python] νκ· μ΄λ (0) | 2022.03.02 |
[Python] κ΅°μ§ νκ°(μ€λ£¨μ£ κ³μ) (0) | 2022.03.01 |