250x250
Link
๋‚˜์˜ GitHub Contribution ๊ทธ๋ž˜ํ”„
Loading data ...
Notice
Recent Posts
Recent Comments
๊ด€๋ฆฌ ๋ฉ”๋‰ด

Data Science LAB

[Python] ๋ฌธ์„œ ์œ ์‚ฌ๋„ ๋ณธ๋ฌธ

๐Ÿ›  Machine Learning/ํ…์ŠคํŠธ ๋ถ„์„

[Python] ๋ฌธ์„œ ์œ ์‚ฌ๋„

ใ…… ใ…œ ใ…” ใ…‡ 2022. 2. 25. 14:42
728x90

๋ฌธ์„œ ์‚ฌ์ด์˜ ์œ ์‚ฌ๋„ ์ธก์ •์€ ์ฃผ๋กœ ์ฝ”์‚ฌ์ธ ์œ ์‚ฌ๋„(Cosine Similarity)๋ฅผ ์‚ฌ์šฉํ•œ๋‹ค. 

๋ฒกํ„ฐ์˜ ํฌ๊ธฐ ๋ณด๋‹ค๋Š” ๋ฒกํ„ฐ์˜ ์ƒํ˜ธ ๋ฐฉํ–ฅ์„ฑ์ด ์–ผ๋งˆ๋‚˜ ์œ ์‚ฌํ•œ์ง€์— ๊ธฐ๋ฐ˜ํ•˜์—ฌ ์ธก์ •ํ•œ๋‹ค.

 

๋‘ ๋ฒกํ„ฐ์˜ ์‚ฌ์ž‡๊ฐ์— ๋”ฐ๋ผ ์ƒํ™” ๊ด€๊ณ„๋Š” ์œ ์‚ฌํ•˜๊ฑฐ๋‚˜ ๊ด€๋ จ์ด ์—†๊ฑฐ๋‚˜ ์•„์˜ˆ ๋ฐ˜๋Œ€ ๊ด€๊ณ„๊ฐ€ ๋  ์ˆ˜ ์žˆ๋‹ค. 

 

๋‘ ๋ฒกํ„ฐ A,B์˜ ๋‚ด์  ๊ฐ’์€ ๋‘ ๋ฒกํ„ฐ์˜ ํฌ๊ธฐ๋ฅผ ๊ฒ‚ํ•œ ๊ฐ’์— ์ฝ”์‚ฌ์ธ ๊ฐ๋„ ๊ฐ’์„ ๊ณฑํ•œ ๊ฐ’์ด๋‹ค. 

 

๋”ฐ๋ผ์„œ ์œ ์‚ฌ๋„(similarity)๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์ด ๋‘ ๋ฒกํ„ฐ์˜ ๋‚ด์ ์„ ์ด ๋ฒกํ„ฐ ํฌ๊ธฐ์˜ ํ•ฉ์œผ๋กœ ๋‚˜๋ˆˆ ๊ฒƒ์ด๋‹ค. 

 

 

 

 

 

 

 

 

 

 

๋‘ ๋„˜ํŒŒ์ด ๋ฐฐ์—ด์— ๋Œ€ํ•œ ์ฝ”์‚ฌ์ธ ์œ ์‚ฌ๋„ ๊ตฌํ•˜๋Š” ํ•จ์ˆ˜ ์ƒ์„ฑ

import numpy as np

def cos_similarity(v1,v2):
    dot_product = np.dot(v1,v2)
    l2_norm = (np.sqrt(sum(np.square(v1))) * np.sqrt(sum(np.square(v2))))
    similarity = dot_product / l2_norm
    
    return similarity

 

 

 

 

 

 

 

 

3๊ฐœ์˜ ๊ฐ„๋‹จํ•œ ๋ฌธ์„œ๋“ค์˜ ์œ ์‚ฌ๋„ ๋น„๊ต

from sklearn.feature_extraction.text import TfidfVectorizer

doc_list = ['if you take the blue pill, the story ends',
           'if you take the red pill, you stay in Wonderland',
           'if you take the red pill, I show you how deep the rabbit hole goes']

tfidf_vect_simple = TfidfVectorizer()
feature_vect_simple = tfidf_vect_simple.fit_transform(doc_list)
print(feature_vect_simple.shape)

3๊ฐœ์˜ ๊ฐ„๋‹จํ•œ ๋ฌธ์„œ๋ฅผ ์ž„์˜๋กœ ์ƒ์„ฑํ•œ ๋’ค,

TfidfVector๋กœ ๋ฒกํ„ฐํ™”ํ•ด ์ฃผ์—ˆ๋‹ค. 

 

 

 

 

 

#TFidfVectorizer๋กœ transform()ํ•œ ๊ฒฐ๊ณผ๋Š” ํฌ์†Œ ํ–‰๋ ฌ์ด๋ฏ€๋กœ ๋ฐ€์ง‘ ํ–‰๋ ฌ๋กœ ๋ณ€ํ™˜
feature_vect_dense = feature_vect_simple.todense()

#์ฒซ ๋ฒˆ์งธ ๋ฌธ์žฅ๊ณผ ๋‘ ๋ฒˆ์งธ ๋ฌธ์žฅ์˜ ํ”ผ์ฒ˜ ๋ฒกํ„ฐ ์ถ”์ถœ
vect1 = np.array(feature_vect_dense[0]).reshape(-1, )
vect2 = np.array(feature_vect_dense[1]).reshape(-1, )

#์ฒซ ๋ฒˆ์งธ ๋ฌธ์žฅ๊ณผ ๋‘ ๋ฒˆ์งธ ๋ฌธ์žฅ์˜ ํ”ผ์ฒ˜ ๋ฒกํ„ฐ๋กœ ๋‘ ๊ฐœ ๋ฌธ์žฅ์˜ ์ฝ”์‚ฌ์ธ ์œ ์‚ฌ๋„ ์ถ”์ถœ
similarity_simple = cos_similarity(vect1,vect2)
print('๋ฌธ์žฅ 1,2์˜ Cosine ์œ ์‚ฌ๋„ : {0:.3f}'.format(similarity_simple))

์•ž์—์„œ ์ƒ์„ฑํ•œ ๊ฒฐ๊ณผ๋Š” ํฌ์†Œ ํ–‰๋ ฌ์ด๊ธฐ ๋•Œ๋ฌธ์— ๋ฐ€์ง‘ํ–‰๋ ฌ๋กœ ๋ณ€ํ™˜ํ•ด ์ค€ ๋’ค,

3๊ฐœ์˜ ๋ฌธ์„œ์ค‘ ์ฒซ๋ฒˆ์งธ์™€ ๋‘ ๋ฒˆ์งธ ๋ฌธ์žฅ์˜ ํ”ผ์ฒ˜ ๋ฒกํ„ฐ๋ฅผ ์ถ”์ถœํ•˜์—ฌ ๋‘ ๋ฌธ์žฅ์˜ ์ฝ”์‚ฌ์ธ ์œ ์‚ฌ๋„๋ฅผ ์ธก์ •ํ•˜์˜€๋‹ค. 

 

 

 

 

vect3 = np.array(feature_vect_dense[2]).reshape(-1, )
similarity_simple = cos_similarity(vect1,vect3)
print('๋ฌธ์žฅ 1,3์˜ Cosine ์œ ์‚ฌ๋„ : {0:.3f}'.format(similarity_simple))

 

 

 

similarity_simple = cos_similarity(vect2,vect3)
print('๋ฌธ์žฅ 2,3์˜ Cosine ์œ ์‚ฌ๋„ : {0:.3f}'.format(similarity_simple))

๋ฌธ์žฅ 1,2์™€ ๋™์ผํ•˜๊ฒŒ ๊ฐ ๋ฌธ์žฅ ๋ณ„๋กœ ์œ ์‚ฌ๋„ ๋น„๊ต๋ฅผ ํ•ด๋ณด์•˜๋‹ค. 

 

 

 

 

 

 

 

Sklearn๋ฅผ ํ™œ์šฉํ•œ ๋ฌธ์„œ ์œ ์‚ฌ๋„ ์ธก์ •

from sklearn.metrics.pairwise import cosine_similarity

similarity_simple_pair = cosine_similarity(feature_vect_simple[0],feature_vect_simple)
print(similarity_simple_pair)

cosine_similarity()๋Š” ํฌ์†Œ ํ–‰๋ ฌ, ๋ฐ€์ง‘ ํ–‰๋ ฌ ๋ชจ๋‘๊ฐ€ ๊ฐ€๋Šฅํ•˜๋ฉฐ ๋ฐฐ์—ด ๋˜ํ•œ ๊ฐ€๋Šฅํ•˜๋‹ค. 

๋”ฐ๋ผ์„œ ๋ณ„๋„์˜ ๋ณ€ํ™˜ ์ž‘์—…์ด ํ•„์š”ํ•˜์ง€ ์•Š๋‹ค. 

 

 

์ฒซ๋ฒˆ์งธ ๋ฌธ์žฅ๊ณผ, ๋ฌธ์žฅ 3๊ฐœ์˜ ๋ฌธ์„œ ์œ ์‚ฌ๋„ ์ธก์ • ๊ฒฐ๊ณผ 

1์€ ์ž๊ธฐ ์ž์‹ ๊ณผ์˜ ์œ ์‚ฌ๋„ ์ธก์ • ๊ฒฐ๊ณผ์ด๋ฉฐ,

๋ฌธ์„œ 1๊ณผ 2์˜ ์œ ์‚ฌ๋„๋Š” 0.402, ๋ฌธ์„œ 1๊ณผ 3์˜ ์œ ์‚ฌ๋„๋Š” 0.404๋กœ ์ธก์ •๋˜์—ˆ๋‹ค. 

 

 

 

similarity_simple_pair = cosine_similarity(feature_vect_simple[0],feature_vect_simple[1:])
print(similarity_simple_pair)

์ž๊ธฐ ์ž์‹ ๊ณผ์˜ ๋ฌธ์„œ ์œ ์‚ฌ๋„๋ฅผ ์—†์• ๊ณ  ์‹ถ๋‹ค๋ฉด

[1:]์„ ์ถ”๊ฐ€ํ•˜๋ฉด ์ œ๊ฑฐํ•  ์ˆ˜ ์žˆ๋‹ค. 

 

 

 

 

similarity_simple_pair = cosine_similarity(feature_vect_simple, feature_vect_simple)
print(similarity_simple_pair)
print('similarity_simple_pair shape : ',similarity_simple_pair.shape)

์ฒซ ๋ฒˆ์งธ ๋กœ์šฐ๋Š” 1๋ฒˆ ๋ฌธ์„œ์™€ 2,3๋ฒˆ์งธ ๋ฌธ์„œ์™€์˜ ์œ ์‚ฌ๋„๋ฅผ ๋‚˜ํƒ€๋‚ธ๋‹ค. 

 

 

 

 

 

 

 

 

Opinion Reveiw ๋ฐ์ดํ„ฐ ์…‹์„ ์ด์šฉํ•œ ๋ฌธ์„œ ์œ ์‚ฌ๋„ ์ธก์ •

from nltk.stem import WordNetLemmatizer
import nltk
import string

remove_punct_dict = dict((ord(punct), None) for punct in string.punctuation)
lemmar = WordNetLemmatizer()

def LemTokens(tokens):
    return [lemmar.lemmatize(token) for token in tokens]

def LemNormalize(text):
    return LemTokens(nltk.word_tokenize(text.lower().translate(remove_punct_dict)))

 

Lemmatization์„ ๊ตฌํ˜„ํ•˜๋Š” LemNormalize() ํ•จ์ˆ˜๋ฅผ ์ƒ์„ฑํ•˜์˜€๋‹ค. 

 

 

 

 

 

 

import pandas as pd
import glob ,os
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans

path = r'C:\Users\OpinosisDataset1.0\OpinosisDataset1.0\topics'
all_files = glob.glob(os.path.join(path, "*.data"))     
filename_list = []
opinion_text = []

for file_ in all_files:
    df = pd.read_table(file_,index_col=None, header=0,encoding='latin1')
    filename_ = file_.split('\\')[-1]
    filename = filename_.split('.')[0]
    filename_list.append(filename)
    opinion_text.append(df.to_string())

document_df = pd.DataFrame({'filename':filename_list, 'opinion_text':opinion_text})

tfidf_vect = TfidfVectorizer(tokenizer=LemNormalize, stop_words='english' , \
                             ngram_range=(1,2), min_df=0.05, max_df=0.85 )
feature_vect = tfidf_vect.fit_transform(document_df['opinion_text'])

km_cluster = KMeans(n_clusters=3, max_iter=10000, random_state=0)
km_cluster.fit(feature_vect)
cluster_label = km_cluster.labels_
cluster_centers = km_cluster.cluster_centers_
document_df['cluster_label'] = cluster_label

 

๋ฌธ์„œ ๊ตฐ์ง‘ํ™”์™€ ๋™์ผํ•˜๊ฒŒ ๋””๋ ‰ํ† ๋ฆฌ์˜ ํŒŒ์ผ๋“ค์„ ๋ชจ๋‘ ๋ถˆ๋Ÿฌ์™€ ๋ฆฌ์ŠคํŠธ๋กœ ์ €์žฅํ•˜๊ณ , 

n=3์œผ๋กœ ์„ค์ •ํ•˜์—ฌ KMeans ๊ตฐ์ง‘ํ™”๋ฅผ ์ง„ํ–‰ํ•˜์˜€๋‹ค. 

 

 

 

 

 

 

 

 

ํ˜ธํ…”์„ ์ฃผ์ œ๋กœ ๊ตฐ์ง‘ํ™”๋œ ๋ฌธ์„œ์™€ ๋‹ค๋ฅธ ๋ฌธ์„œ์™€์˜ ์œ ์‚ฌ๋„ ์ธก์ •

from sklearn.metrics.pairwise import cosine_similarity

#ํ˜ธํ…”๋กœ ๊ตฐ์ง‘ํ™”๋œ ๋ฌธ์„œ์˜ ์ธ๋ฑ์Šค ์ถ”์ถœ
hotel_indexes = document_df[document_df['cluster_label']==1].index
print("ํ˜ธํ…”๋กœ ๊ตฐ์ง‘ํ™”๋œ ๋ฌธ์„œ์˜ DataFrame Index : ",hotel_indexes)

#๊ทธ์ค‘ ์ฒซ ๋ฒˆ์งธ ๋ฌธ์„œ ์ถ”์ถœํ•ด ํŒŒ์ผ๋ช… ํ‘œ์‹œ
comparison_docname = document_df.iloc[hotel_indexes[0]]['filename']
print('๋น„๊ต ๋ฌธ์„œ๋ช…',comparison_docname, '์™€ ํƒ€ ๋ฌธ์„œ ์œ ์‚ฌ๋„')

#document_df์—์„œ ์ถ”์ถœํ•œ Index ๊ฐ์ฒด๋ฅผ feature_vect๋กœ ์ž…๋ ฅํ•˜์—ฌ ํ˜ธํ…” ๊ตฐ์ง‘ํ™”๋œ feature_vect ์ถ”์ถœ
similarity_pair = cosine_similarity(feature_vect[hotel_indexes[0]], feature_vect[hotel_indexes])
print(similarity_pair)

ํ˜ธํ…”์„ ์ฃผ์ œ๋กœ ๊ตฐ์ง‘ํ™”๋œ ๋ฌธ์„œ์˜ ์ธ๋ฑ์Šค ์ถ”์ถœ -> TfidfVectorizer ๊ฐ์ฒด ๋ณ€์ˆ˜์ธ feature_vect์—์„œ ํ˜ธํ…”๋กœ ๊ตฐ์ง‘ํ™”๋œ ๋ฌธ์„œ์˜ ํ”ผ์ฒ˜๋ฒกํ„ฐ ์ถ”์ถœ

 

 

 

 

 

 

 

 

import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

#์ฒซ ๋ฒˆ์งธ ๋ฌธ์„œ์™€ ํƒ€ ๋ฌธ์„œ ๊ฐ„ ์œ ์‚ฌ๋„๊ฐ€ ํฐ ์ˆœ์œผ๋กœ ์ •๋ ฌํ•œ ์ธ๋ฑ์Šค๋ฅผ ์ถ”์ถœํ•˜๋˜ ์ž๊ธฐ ์ž์‹  ์ œ์™ธ
sorted_index = similarity_pair.argsort()[:,::-1]
sorted_index = sorted_index[:,1:]

#์œ ์‚ฌ๋„๊ฐ€ ํฐ ์ˆœ์œผ๋กœ hotel_indexes ์ถ”์ถœํ•ด ์žฌ์ •๋ ฌ
hotel_sorted_indexes = hotel_indexes[sorted_index.reshape(-1)]

#์œ ์‚ฌ๋„๊ฐ€ ํฐ ์ˆœ์œผ๋กœ ์œ ์‚ฌ๋„ ๊ฐ’ ์žฌ์ •๋ ฌ ํ•˜๋˜ ์ž๊ธฐ ์ž์‹  ์ œ์™ธ
hotel_1_sim_value = np.sort(similarity_pair.reshape(-1))[::-1]
hotel_1_sim_value = hotel_1_sim_value[1:]

#์œ ์‚ฌ๋„๊ฐ€ ํฐ ์ˆœ์œผ๋กœ ์ •๋ ฌ๋œ ์ธ๋ฑ์Šค์™€ ์œ ์‚ฌ๋„ ๊ฐ’์„ ์ด์šฉํ•ด ํŒŒ์ผ๋ช…๊ณผ ์œ ์‚ฌ๋„ ๊ฐ’์„ ๋ง‰๋Œ€ ๊ทธ๋ž˜ํ”„๋กœ ์‹œ๊ฐํ™”
hotel_1_sim_df = pd.DataFrame()
hotel_1_sim_df['filename'] = document_df.iloc[hotel_sorted_indexes]['filename']
hotel_1_sim_df['similarity'] = hotel_1_sim_value

sns.barplot(x='similarity',y='filename',data = hotel_1_sim_df)
plt.title(comparison_docname)

์œ ์‚ฌ๋„๊ฐ€ ๋†’์€ ์ˆœ์œผ๋กœ ์ •๋ ฌ ํ›„ ์‹œ๊ฐํ™”

728x90
Comments