[Python] 문서 군집화

250x250

Link

GitHub

나의 GitHub Contribution 그래프

Loading data ...

Notice

Recent Posts

Recent Comments

« 2025/05 »
일	월	화	수	목	금	토
				1	2	3
4	5	6	7	8	9	10
11	12	13	14	15	16	17
18	19	20	21	22	23	24
25	26	27	28	29	30	31

Tags more

Archives

관리 메뉴

Data Science LAB

[Python] 문서 군집화 본문

🛠 Machine Learning/텍스트 분석

[Python] 문서 군집화

ㅅ ㅜ ㅔ ㅇ 2022. 2. 24. 13:48

728x90

문서 군집화란?

비슷한 텍스트 구성의 문서를 군집화(Clustering)하는 것이다.

동일한 군집에 속하는 문서를 같은 카테고리 소속으로 분류하는 것이지만, 비지도학습 기반으로 동작한다는 점이 텍스트 분류와는 다르다.

데이터셋 다운

https://archive.ics.uci.edu/ml/datasets/Opinosis+Opinion+%26frasl%3B+Review

UCI Machine Learning Repository: Opinosis Opinion ⁄ Review Data Set

Opinosis Opinion ⁄ Review Data Set Download: Data Folder, Data Set Description Abstract: This dataset contains sentences extracted from user reviews on a given topic. Example topics are â€œperformance of Toyota Camryâ€ and â€œsound quality o

archive.ics.uci.edu

위의 링크로 들어간 뒤,

데이터 폴더를 다운받는다.

압축파일을 풀면

다음과 같은 형태의 디렉토리로 이루어져 있다.

topics 디렉터리 안에는 51개의 파일로 구성되어 있다.

이 파일들을 이용하여 문서 군집화를 진행해 보려고 한다.

주피터 노트북으로 데이터셋 불러오기

import pandas as pd
import glob, os

path = r'C:\Users\OpinosisDataset1.0\OpinosisDataset1.0\topics'

#path로 지정한 디렉토리 밑의 모든 .data 파일의 파일명을 리스트로 취합
all_files = glob.glob(os.path.join(path,"*.data"))
filename_list = []
opinion_text = []


for file_ in all_files:
    df = pd.read_table(file_,index_col = None, header=0, encoding='latin1')
    
    #절대 경로로 주어진 파일명을 가공, 리눅스에서 수행할 때는 다음 \\을 /로 변경
    #맨 마지막 .data확장자 제거
    filename_ = file_.split('\\')[-1]
    filename = filename_.split('.')[0]
    
    #파일명 리스트와 파일 내용 리스트에 파일명과 파일 내용 추가
    filename_list.append(filename)
    opinion_text.append(df.to_string())
    

#파일명 list와 파일 내용 list 객체를 DataFrame으로 생성
document_df = pd.DataFrame({'filename':filename_list,
                           'opinion_text': opinion_text})
document_df.head()

사용자의 PC 경로에 맞춰 데이터 셋을 불러온 후, path로 지정한 디렉토리 밑의 모든 .data 파일의 파일명을 리스트로 취합한다.

개별 파일의 파일명은 filenam_list로 취합하고, 파일의 내용은 로딩 후 다시 string으로 변환하여 opinion_text list로 취합한다.

DataFrame을 생성하는 For 문을 작성하여 디렉토리 밑의 모든 파일을 하나씩 추가하여 list로 생성하고 파일명을 가공하고 확장자를 제거하여 최종적인 DataFrame을 생성한다.

TF-IDF 형태로 피처 벡터화

from nltk.stem import WordNetLemmatizer
import nltk
import string

remove_punct_dict = dict((ord(punct), None) for punct in string.punctuation)
lemmar = WordNetLemmatizer()

def LemTokens(tokens):
    return [lemmar.lemmatize(token) for token in tokens]

def LemNormalize(text):
    return LemTokens(nltk.word_tokenize(text.lower().translate(remove_punct_dict)))

Lemmatization을 구현하는 LemNormalize(text)함수를 생성해 주었다.

from sklearn.feature_extraction.text import TfidfVectorizer
tfidf_vect = TfidfVectorizer(tokenizer = LemNormalize, stop_words = 'english',
                            ngram_range = (1,2),min_df=0.05,max_df = 0.85)

#opinion_text 칼럼 값으로 피처 벡터화 수행
feature_vect = tfidf_vect.fit_transform(document_df['opinion_text'])

개별 문서 텍스트에 대해 TF-IDF 변환된 피처 벡터화된 행렬을 구할 수 있음

K-Means Clustering (n = 5) 수행

from sklearn.cluster import KMeans

km = KMeans(n_clusters = 5, max_iter = 10000, random_state=0)
km.fit(feature_vect)
cluster_label = km.labels_
cluster_centers = km.cluster_centers_

document_df['cluster_label'] = cluster_label
document_df.head()

중심(n_cluster)를 5개로 설정한 뒤, KMeans 모델을 생성하였으며

cluster_label 컬럼을 추가하여 각 문서가 어떤 군집으로 군집화되었는지 확인할 수 있도록 하였다.

#데이터정렬
document_df[document_df['cluster_label'] == 0].sort_values(by='filename')

cluster_label이 0인 데이터셋만 조회한 결과,

0번 군집은 호텔과 관련된 문서들로 군집화 되어져 있음을 알 수 있었다.

document_df[document_df['cluster_label'] == 1].sort_values(by='filename')

마찬가지로, 군집1은 킨들, 아이팟 등의 전자기기에 대한 리뷰들로 군집화되어져 있음을 알수 있었다.

K-Means Clustering(n=3) 수행

from sklearn.cluster import KMeans

km = KMeans(n_clusters = 3, max_iter = 10000, random_state=0)
km.fit(feature_vect)
cluster_label = km.labels_
cluster_centers = km.cluster_centers_

document_df['cluster_label'] = cluster_label
document_df.sort_values(by = 'cluster_label')

위의 모델과 동일하게 KMeans 모델이지만,

n_cluster값을 3으로 조정하여 모델을 다시 생성하였다.

군집 별 핵심단어 추출

cluster_centers = km.cluster_centers_
print('cluster_centers shape : ',cluster_centers.shape)
print(cluster_centers)

cluster_centers는 (3,4611)배열로 이루어져 있으며,

3개의 군집, 4611개의 피처로 구성되어있음을 의미한다.

각 배열 값은 0~1사이의 값으로 피처가 중심값과 얼마나 가까이 위치하고 있는 지를 의미한다.

#군집별 top n 핵심 단어, 그 단어의 중심 위치 상대값, 대상 파일명 반환
def get_cluster_details(cluster_model,cluster_data,feature_names,clusters_num,top_n_features=10):
    cluster_details = {}
    
    #cluster_centers array의 값이 큰 순으로 정렬된 인덱스 값 반환
    #군집 중심점 별 할당된 word 피처들의 거리값이 큰 순으로 값을 구하기 위함
    centroid_feature_ordered_ind = cluster_model.cluster_centers_.argsort()[:,::-1]
    
    #개별 군집별로 반복하면서 핵심 단어, 그 단어의 중심 위치 상댓값, 대상 파일명 입력
    for cluster_num in range(clusters_num):
        cluster_details[cluster_num] = {}
        cluster_details[cluster_num]['cluster'] = cluster_num
        
        #top n 피처 단어 구하기
        top_feature_indexes = centroid_feature_ordered_ind[cluster_num,:top_n_features]
        top_features = [ feature_names[ind] for ind in top_feature_indexes ]
        
        #해당 피처 단어의 중심 위치 상댓값 구하기
        top_feature_values = cluster_model.cluster_centers_[cluster_num,top_feature_indexes].tolist()
        
        #cluster_details 딕셔너리 객체에 개별 군집별 핵심단어와 중심위치 상댓값, 해당 파일명 입력
        cluster_details[cluster_num]['top_features'] = top_features
        cluster_details[cluster_num]['top_features_value'] = top_feature_values
        filenames = cluster_data[cluster_data['cluster_label'] == cluster_num]['filename']
        filenames = filenames.values.tolist()
        
        cluster_details[cluster_num]['filenames'] = filenames
        
    return cluster_details

get_cluster_details()를 호출하면 dictionary를 원소로 가지는 리스트인 cluster_details를 반환한다.

cluster_details에는 개별 군집 번호, 핵심 단어, 핵심 단어 중심 위치 상댓값, 파일명 속성 값 정보가 있다.

def print_cluster_details(cluster_details):
    for cluster_num, cluster_detail in cluster_details.items():
        print('-----Cluster {0}'.format(cluster_num))
        print('Top features: ',cluster_detail['top_features'])
        print('Reviews 파일명 : ',cluster_detail['filenames'][:7])
        print("=======================================================")

cluster_details를 자세히 보기 위한 함수를 따로 생성하였다.

feature_names = tfidf_vect.get_feature_names()

cluster_details = get_cluster_details(cluster_model = km, cluster_data = document_df,
                                     feature_names = feature_names, clusters_num=3, top_n_features=10)
print_cluster_details(cluster_details)

생성한 함수들을 호출한 결과

get_cluster_detail() 함수를 호출하면 KMeans 군집화 객체, 파일명 추출을 위한 document_df DataFrame, 핵심 단어 추출을 위한 피처명 리스트, 전체 군집 개수, 핵심 단어 추출 개수를 반환한다.

728x90

'🛠 Machine Learning > 텍스트 분석' 카테고리의 다른 글

[Python] 한글 텍스트 처리 - 네이버 영화 평점 감성 분석 (0)	2022.02.26
[Python] 문서 유사도 (0)	2022.02.25
[Python] 토픽 모델링 (20 뉴스그룹) (0)	2022.02.22
[Python] SentiWordNet, VADER을 이용한 영화 감상평 감성 분석 (0)	2022.02.21
[Python] 감성분석 - 비지도 학습 (0)	2022.02.20

'🛠 Machine Learning/텍스트 분석' Related Articles

Comments

내 블로그 - 관리자 홈 전환	`Q` `Q`
새 글 쓰기	`W` `W`

글 수정 (권한 있는 경우)	`E` `E`
댓글 영역으로 이동	`C` `C`

이 페이지의 URL 복사	`S` `S`
맨 위로 이동	`T` `T`
티스토리 홈 이동	`H` `H`
단축키 안내	`Shift` + `/` `⇧` + `/`

Data Science LAB

Data Science LAB

[Python] 문서 군집화 본문

[Python] 문서 군집화

문서 군집화란?

데이터셋 다운

주피터 노트북으로 데이터셋 불러오기

TF-IDF 형태로 피처 벡터화

K-Means Clustering (n = 5) 수행

K-Means Clustering(n=3) 수행

군집 별 핵심단어 추출

'🛠 Machine Learning > 텍스트 분석' 카테고리의 다른 글

티스토리툴바

개인정보

단축키

내 블로그

블로그 게시글

모든 영역