250x250
Link
๋‚˜์˜ GitHub Contribution ๊ทธ๋ž˜ํ”„
Loading data ...
Notice
Recent Posts
Recent Comments
๊ด€๋ฆฌ ๋ฉ”๋‰ด

Data Science LAB

[Python] ๋ฌธ์„œ ๊ตฐ์ง‘ํ™” ๋ณธ๋ฌธ

๐Ÿ›  Machine Learning/ํ…์ŠคํŠธ ๋ถ„์„

[Python] ๋ฌธ์„œ ๊ตฐ์ง‘ํ™”

ใ…… ใ…œ ใ…” ใ…‡ 2022. 2. 24. 13:48
728x90

๋ฌธ์„œ ๊ตฐ์ง‘ํ™”๋ž€?

๋น„์Šทํ•œ ํ…์ŠคํŠธ ๊ตฌ์„ฑ์˜ ๋ฌธ์„œ๋ฅผ ๊ตฐ์ง‘ํ™”(Clustering)ํ•˜๋Š” ๊ฒƒ์ด๋‹ค. 

๋™์ผํ•œ ๊ตฐ์ง‘์— ์†ํ•˜๋Š” ๋ฌธ์„œ๋ฅผ ๊ฐ™์€ ์นดํ…Œ๊ณ ๋ฆฌ ์†Œ์†์œผ๋กœ ๋ถ„๋ฅ˜ํ•˜๋Š” ๊ฒƒ์ด์ง€๋งŒ, ๋น„์ง€๋„ํ•™์Šต ๊ธฐ๋ฐ˜์œผ๋กœ ๋™์ž‘ํ•œ๋‹ค๋Š” ์ ์ด ํ…์ŠคํŠธ ๋ถ„๋ฅ˜์™€๋Š” ๋‹ค๋ฅด๋‹ค.

 

 

 

 

 

 

 

๋ฐ์ดํ„ฐ์…‹ ๋‹ค์šด

https://archive.ics.uci.edu/ml/datasets/Opinosis+Opinion+%26frasl%3B+Review 

 

UCI Machine Learning Repository: Opinosis Opinion ⁄ Review Data Set

Opinosis Opinion ⁄ Review Data Set Download: Data Folder, Data Set Description Abstract: This dataset contains sentences extracted from user reviews on a given topic. Example topics are “performance of Toyota Camryâ€ย and “sound quality o

archive.ics.uci.edu

 

์œ„์˜ ๋งํฌ๋กœ ๋“ค์–ด๊ฐ„ ๋’ค, 

 

 

๋ฐ์ดํ„ฐ ํด๋”๋ฅผ ๋‹ค์šด๋ฐ›๋Š”๋‹ค. 

์••์ถ•ํŒŒ์ผ์„ ํ’€๋ฉด 

 

๋‹ค์Œ๊ณผ ๊ฐ™์€ ํ˜•ํƒœ์˜ ๋””๋ ‰ํ† ๋ฆฌ๋กœ ์ด๋ฃจ์–ด์ ธ ์žˆ๋‹ค. 

 

 

topics ๋””๋ ‰ํ„ฐ๋ฆฌ ์•ˆ์—๋Š” 51๊ฐœ์˜ ํŒŒ์ผ๋กœ ๊ตฌ์„ฑ๋˜์–ด ์žˆ๋‹ค. 

์ด ํŒŒ์ผ๋“ค์„ ์ด์šฉํ•˜์—ฌ ๋ฌธ์„œ ๊ตฐ์ง‘ํ™”๋ฅผ ์ง„ํ–‰ํ•ด ๋ณด๋ ค๊ณ  ํ•œ๋‹ค. 

 

 

 

 

 

 

 

 

์ฃผํ”ผํ„ฐ ๋…ธํŠธ๋ถ์œผ๋กœ ๋ฐ์ดํ„ฐ์…‹ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ

import pandas as pd
import glob, os

path = r'C:\Users\OpinosisDataset1.0\OpinosisDataset1.0\topics'

#path๋กœ ์ง€์ •ํ•œ ๋””๋ ‰ํ† ๋ฆฌ ๋ฐ‘์˜ ๋ชจ๋“  .data ํŒŒ์ผ์˜ ํŒŒ์ผ๋ช…์„ ๋ฆฌ์ŠคํŠธ๋กœ ์ทจํ•ฉ
all_files = glob.glob(os.path.join(path,"*.data"))
filename_list = []
opinion_text = []


for file_ in all_files:
    df = pd.read_table(file_,index_col = None, header=0, encoding='latin1')
    
    #์ ˆ๋Œ€ ๊ฒฝ๋กœ๋กœ ์ฃผ์–ด์ง„ ํŒŒ์ผ๋ช…์„ ๊ฐ€๊ณต, ๋ฆฌ๋ˆ…์Šค์—์„œ ์ˆ˜ํ–‰ํ•  ๋•Œ๋Š” ๋‹ค์Œ \\์„ /๋กœ ๋ณ€๊ฒฝ
    #๋งจ ๋งˆ์ง€๋ง‰ .dataํ™•์žฅ์ž ์ œ๊ฑฐ
    filename_ = file_.split('\\')[-1]
    filename = filename_.split('.')[0]
    
    #ํŒŒ์ผ๋ช… ๋ฆฌ์ŠคํŠธ์™€ ํŒŒ์ผ ๋‚ด์šฉ ๋ฆฌ์ŠคํŠธ์— ํŒŒ์ผ๋ช…๊ณผ ํŒŒ์ผ ๋‚ด์šฉ ์ถ”๊ฐ€
    filename_list.append(filename)
    opinion_text.append(df.to_string())
    

#ํŒŒ์ผ๋ช… list์™€ ํŒŒ์ผ ๋‚ด์šฉ list ๊ฐ์ฒด๋ฅผ DataFrame์œผ๋กœ ์ƒ์„ฑ
document_df = pd.DataFrame({'filename':filename_list,
                           'opinion_text': opinion_text})
document_df.head()

 

์‚ฌ์šฉ์ž์˜ PC ๊ฒฝ๋กœ์— ๋งž์ถฐ ๋ฐ์ดํ„ฐ ์…‹์„ ๋ถˆ๋Ÿฌ์˜จ ํ›„, path๋กœ ์ง€์ •ํ•œ ๋””๋ ‰ํ† ๋ฆฌ ๋ฐ‘์˜ ๋ชจ๋“  .data ํŒŒ์ผ์˜ ํŒŒ์ผ๋ช…์„ ๋ฆฌ์ŠคํŠธ๋กœ ์ทจํ•ฉํ•œ๋‹ค. 

 

๊ฐœ๋ณ„ ํŒŒ์ผ์˜ ํŒŒ์ผ๋ช…์€ filenam_list๋กœ ์ทจํ•ฉํ•˜๊ณ , ํŒŒ์ผ์˜ ๋‚ด์šฉ์€ ๋กœ๋”ฉ ํ›„ ๋‹ค์‹œ string์œผ๋กœ ๋ณ€ํ™˜ํ•˜์—ฌ opinion_text list๋กœ ์ทจํ•ฉํ•œ๋‹ค. 

 

DataFrame์„ ์ƒ์„ฑํ•˜๋Š” For ๋ฌธ์„ ์ž‘์„ฑํ•˜์—ฌ ๋””๋ ‰ํ† ๋ฆฌ ๋ฐ‘์˜ ๋ชจ๋“  ํŒŒ์ผ์„ ํ•˜๋‚˜์”ฉ ์ถ”๊ฐ€ํ•˜์—ฌ list๋กœ ์ƒ์„ฑํ•˜๊ณ  ํŒŒ์ผ๋ช…์„ ๊ฐ€๊ณตํ•˜๊ณ  ํ™•์žฅ์ž๋ฅผ ์ œ๊ฑฐํ•˜์—ฌ ์ตœ์ข…์ ์ธ DataFrame์„ ์ƒ์„ฑํ•œ๋‹ค. 

 

 

 

 

 

 

 

 

TF-IDF ํ˜•ํƒœ๋กœ ํ”ผ์ฒ˜ ๋ฒกํ„ฐํ™”

from nltk.stem import WordNetLemmatizer
import nltk
import string

remove_punct_dict = dict((ord(punct), None) for punct in string.punctuation)
lemmar = WordNetLemmatizer()

def LemTokens(tokens):
    return [lemmar.lemmatize(token) for token in tokens]

def LemNormalize(text):
    return LemTokens(nltk.word_tokenize(text.lower().translate(remove_punct_dict)))

 

Lemmatization์„ ๊ตฌํ˜„ํ•˜๋Š” LemNormalize(text)ํ•จ์ˆ˜๋ฅผ ์ƒ์„ฑํ•ด ์ฃผ์—ˆ๋‹ค. 

 

 

from sklearn.feature_extraction.text import TfidfVectorizer
tfidf_vect = TfidfVectorizer(tokenizer = LemNormalize, stop_words = 'english',
                            ngram_range = (1,2),min_df=0.05,max_df = 0.85)

#opinion_text ์นผ๋Ÿผ ๊ฐ’์œผ๋กœ ํ”ผ์ฒ˜ ๋ฒกํ„ฐํ™” ์ˆ˜ํ–‰
feature_vect = tfidf_vect.fit_transform(document_df['opinion_text'])

 

๊ฐœ๋ณ„ ๋ฌธ์„œ ํ…์ŠคํŠธ์— ๋Œ€ํ•ด TF-IDF ๋ณ€ํ™˜๋œ ํ”ผ์ฒ˜ ๋ฒกํ„ฐํ™”๋œ ํ–‰๋ ฌ์„ ๊ตฌํ•  ์ˆ˜ ์žˆ์Œ

 

 

 

 

 

 

 

K-Means Clustering (n = 5) ์ˆ˜ํ–‰

from sklearn.cluster import KMeans

km = KMeans(n_clusters = 5, max_iter = 10000, random_state=0)
km.fit(feature_vect)
cluster_label = km.labels_
cluster_centers = km.cluster_centers_

document_df['cluster_label'] = cluster_label
document_df.head()

 

์ค‘์‹ฌ(n_cluster)๋ฅผ 5๊ฐœ๋กœ ์„ค์ •ํ•œ ๋’ค, KMeans ๋ชจ๋ธ์„ ์ƒ์„ฑํ•˜์˜€์œผ๋ฉฐ

cluster_label ์ปฌ๋Ÿผ์„ ์ถ”๊ฐ€ํ•˜์—ฌ ๊ฐ ๋ฌธ์„œ๊ฐ€ ์–ด๋–ค ๊ตฐ์ง‘์œผ๋กœ ๊ตฐ์ง‘ํ™”๋˜์—ˆ๋Š”์ง€ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋„๋ก ํ•˜์˜€๋‹ค. 

 

 

 

 

 

 

 

#๋ฐ์ดํ„ฐ์ •๋ ฌ
document_df[document_df['cluster_label'] == 0].sort_values(by='filename')

cluster_label์ด 0์ธ ๋ฐ์ดํ„ฐ์…‹๋งŒ ์กฐํšŒํ•œ ๊ฒฐ๊ณผ,

0๋ฒˆ ๊ตฐ์ง‘์€ ํ˜ธํ…”๊ณผ ๊ด€๋ จ๋œ ๋ฌธ์„œ๋“ค๋กœ ๊ตฐ์ง‘ํ™” ๋˜์–ด์ ธ ์žˆ์Œ์„ ์•Œ ์ˆ˜ ์žˆ์—ˆ๋‹ค. 

 

 

 

 

document_df[document_df['cluster_label'] == 1].sort_values(by='filename')

 

๋งˆ์ฐฌ๊ฐ€์ง€๋กœ, ๊ตฐ์ง‘1์€ ํ‚จ๋“ค, ์•„์ดํŒŸ ๋“ฑ์˜ ์ „์ž๊ธฐ๊ธฐ์— ๋Œ€ํ•œ ๋ฆฌ๋ทฐ๋“ค๋กœ ๊ตฐ์ง‘ํ™”๋˜์–ด์ ธ ์žˆ์Œ์„ ์•Œ์ˆ˜ ์žˆ์—ˆ๋‹ค. 

 

 

 

 

 

 

'

 

 

 

K-Means Clustering(n=3) ์ˆ˜ํ–‰

 

from sklearn.cluster import KMeans

km = KMeans(n_clusters = 3, max_iter = 10000, random_state=0)
km.fit(feature_vect)
cluster_label = km.labels_
cluster_centers = km.cluster_centers_

document_df['cluster_label'] = cluster_label
document_df.sort_values(by = 'cluster_label')

์œ„์˜ ๋ชจ๋ธ๊ณผ ๋™์ผํ•˜๊ฒŒ KMeans ๋ชจ๋ธ์ด์ง€๋งŒ, 

n_cluster๊ฐ’์„ 3์œผ๋กœ ์กฐ์ •ํ•˜์—ฌ ๋ชจ๋ธ์„ ๋‹ค์‹œ ์ƒ์„ฑํ•˜์˜€๋‹ค. 

 

 

 

 

 

 

 

๊ตฐ์ง‘ ๋ณ„ ํ•ต์‹ฌ๋‹จ์–ด ์ถ”์ถœ

cluster_centers = km.cluster_centers_
print('cluster_centers shape : ',cluster_centers.shape)
print(cluster_centers)

 

cluster_centers๋Š” (3,4611)๋ฐฐ์—ด๋กœ ์ด๋ฃจ์–ด์ ธ ์žˆ์œผ๋ฉฐ,

3๊ฐœ์˜ ๊ตฐ์ง‘, 4611๊ฐœ์˜ ํ”ผ์ฒ˜๋กœ ๊ตฌ์„ฑ๋˜์–ด์žˆ์Œ์„ ์˜๋ฏธํ•œ๋‹ค. 

 

๊ฐ ๋ฐฐ์—ด ๊ฐ’์€ 0~1์‚ฌ์ด์˜ ๊ฐ’์œผ๋กœ ํ”ผ์ฒ˜๊ฐ€ ์ค‘์‹ฌ๊ฐ’๊ณผ ์–ผ๋งˆ๋‚˜ ๊ฐ€๊นŒ์ด ์œ„์น˜ํ•˜๊ณ  ์žˆ๋Š” ์ง€๋ฅผ ์˜๋ฏธํ•œ๋‹ค. 

 

 

 

 

#๊ตฐ์ง‘๋ณ„ top n ํ•ต์‹ฌ ๋‹จ์–ด, ๊ทธ ๋‹จ์–ด์˜ ์ค‘์‹ฌ ์œ„์น˜ ์ƒ๋Œ€๊ฐ’, ๋Œ€์ƒ ํŒŒ์ผ๋ช… ๋ฐ˜ํ™˜
def get_cluster_details(cluster_model,cluster_data,feature_names,clusters_num,top_n_features=10):
    cluster_details = {}
    
    #cluster_centers array์˜ ๊ฐ’์ด ํฐ ์ˆœ์œผ๋กœ ์ •๋ ฌ๋œ ์ธ๋ฑ์Šค ๊ฐ’ ๋ฐ˜ํ™˜
    #๊ตฐ์ง‘ ์ค‘์‹ฌ์  ๋ณ„ ํ• ๋‹น๋œ word ํ”ผ์ฒ˜๋“ค์˜ ๊ฑฐ๋ฆฌ๊ฐ’์ด ํฐ ์ˆœ์œผ๋กœ ๊ฐ’์„ ๊ตฌํ•˜๊ธฐ ์œ„ํ•จ
    centroid_feature_ordered_ind = cluster_model.cluster_centers_.argsort()[:,::-1]
    
    #๊ฐœ๋ณ„ ๊ตฐ์ง‘๋ณ„๋กœ ๋ฐ˜๋ณตํ•˜๋ฉด์„œ ํ•ต์‹ฌ ๋‹จ์–ด, ๊ทธ ๋‹จ์–ด์˜ ์ค‘์‹ฌ ์œ„์น˜ ์ƒ๋Œ“๊ฐ’, ๋Œ€์ƒ ํŒŒ์ผ๋ช… ์ž…๋ ฅ
    for cluster_num in range(clusters_num):
        cluster_details[cluster_num] = {}
        cluster_details[cluster_num]['cluster'] = cluster_num
        
        #top n ํ”ผ์ฒ˜ ๋‹จ์–ด ๊ตฌํ•˜๊ธฐ
        top_feature_indexes = centroid_feature_ordered_ind[cluster_num,:top_n_features]
        top_features = [ feature_names[ind] for ind in top_feature_indexes ]
        
        #ํ•ด๋‹น ํ”ผ์ฒ˜ ๋‹จ์–ด์˜ ์ค‘์‹ฌ ์œ„์น˜ ์ƒ๋Œ“๊ฐ’ ๊ตฌํ•˜๊ธฐ
        top_feature_values = cluster_model.cluster_centers_[cluster_num,top_feature_indexes].tolist()
        
        #cluster_details ๋”•์…”๋„ˆ๋ฆฌ ๊ฐ์ฒด์— ๊ฐœ๋ณ„ ๊ตฐ์ง‘๋ณ„ ํ•ต์‹ฌ๋‹จ์–ด์™€ ์ค‘์‹ฌ์œ„์น˜ ์ƒ๋Œ“๊ฐ’, ํ•ด๋‹น ํŒŒ์ผ๋ช… ์ž…๋ ฅ
        cluster_details[cluster_num]['top_features'] = top_features
        cluster_details[cluster_num]['top_features_value'] = top_feature_values
        filenames = cluster_data[cluster_data['cluster_label'] == cluster_num]['filename']
        filenames = filenames.values.tolist()
        
        cluster_details[cluster_num]['filenames'] = filenames
        
    return cluster_details

 

get_cluster_details()๋ฅผ ํ˜ธ์ถœํ•˜๋ฉด dictionary๋ฅผ ์›์†Œ๋กœ ๊ฐ€์ง€๋Š” ๋ฆฌ์ŠคํŠธ์ธ cluster_details๋ฅผ ๋ฐ˜ํ™˜ํ•œ๋‹ค. 

cluster_details์—๋Š” ๊ฐœ๋ณ„ ๊ตฐ์ง‘ ๋ฒˆํ˜ธ, ํ•ต์‹ฌ ๋‹จ์–ด, ํ•ต์‹ฌ ๋‹จ์–ด ์ค‘์‹ฌ ์œ„์น˜ ์ƒ๋Œ“๊ฐ’, ํŒŒ์ผ๋ช… ์†์„ฑ ๊ฐ’ ์ •๋ณด๊ฐ€ ์žˆ๋‹ค. 

 

 

def print_cluster_details(cluster_details):
    for cluster_num, cluster_detail in cluster_details.items():
        print('-----Cluster {0}'.format(cluster_num))
        print('Top features: ',cluster_detail['top_features'])
        print('Reviews ํŒŒ์ผ๋ช… : ',cluster_detail['filenames'][:7])
        print("=======================================================")

 

 cluster_details๋ฅผ ์ž์„ธํžˆ ๋ณด๊ธฐ ์œ„ํ•œ ํ•จ์ˆ˜๋ฅผ ๋”ฐ๋กœ ์ƒ์„ฑํ•˜์˜€๋‹ค. 

 

 

feature_names = tfidf_vect.get_feature_names()

cluster_details = get_cluster_details(cluster_model = km, cluster_data = document_df,
                                     feature_names = feature_names, clusters_num=3, top_n_features=10)
print_cluster_details(cluster_details)

 

์ƒ์„ฑํ•œ ํ•จ์ˆ˜๋“ค์„ ํ˜ธ์ถœํ•œ ๊ฒฐ๊ณผ 

get_cluster_detail() ํ•จ์ˆ˜๋ฅผ ํ˜ธ์ถœํ•˜๋ฉด KMeans ๊ตฐ์ง‘ํ™” ๊ฐ์ฒด, ํŒŒ์ผ๋ช… ์ถ”์ถœ์„ ์œ„ํ•œ document_df DataFrame, ํ•ต์‹ฌ ๋‹จ์–ด ์ถ”์ถœ์„ ์œ„ํ•œ ํ”ผ์ฒ˜๋ช… ๋ฆฌ์ŠคํŠธ, ์ „์ฒด ๊ตฐ์ง‘ ๊ฐœ์ˆ˜, ํ•ต์‹ฌ ๋‹จ์–ด ์ถ”์ถœ ๊ฐœ์ˆ˜๋ฅผ ๋ฐ˜ํ™˜ํ•œ๋‹ค. 

 

728x90
Comments