250x250
Link
๋‚˜์˜ GitHub Contribution ๊ทธ๋ž˜ํ”„
Loading data ...
Notice
Recent Posts
Recent Comments
๊ด€๋ฆฌ ๋ฉ”๋‰ด

Data Science LAB

[Python] ํ† ํ”ฝ ๋ชจ๋ธ๋ง (20 ๋‰ด์Šค๊ทธ๋ฃน) ๋ณธ๋ฌธ

๐Ÿ›  Machine Learning/ํ…์ŠคํŠธ ๋ถ„์„

[Python] ํ† ํ”ฝ ๋ชจ๋ธ๋ง (20 ๋‰ด์Šค๊ทธ๋ฃน)

ใ…… ใ…œ ใ…” ใ…‡ 2022. 2. 22. 17:15
728x90

Topic Modeling

ํ† ํ”ฝ ๋ชจ๋ธ๋ง์ด๋ž€ ๋ฌธ์„œ ์ง‘ํ•ฉ์— ์ˆจ์–ด ์žˆ๋Š” ์ฃผ์ œ๋ฅผ ์ฐพ์•„๋‚ด๋Š” ๊ฒƒ์ด๋‹ค. ๋จธ์‹ ๋Ÿฌ๋‹ ๊ธฐ๋ฐ˜์˜ ํ† ํ”ฝ ๋ชจ๋ธ์€ ์ˆจ๊ฒจ์ง„ ์ฃผ์ œ๋ฅผ ํšจ๊ณผ์ ์œผ๋กœ ํ‘œํ˜„ํ•  ์ˆ˜ ์žˆ๋Š” ์ค‘์‹ฌ ๋‹จ์–ด๋ฅผ ํ•จ์ถ•์ ์œผ๋กœ ์ถ”์ถœํ•ด๋‚ธ๋‹ค. 

 

 

ํ† ํ”ฝ๋ชจ๋ธ๋ง์—์„œ๋Š” LDA(Latent Dirichlet Allocation)์„ ์ฃผ๋กœ ํ™œ์šฉํ•œ๋‹ค. ํ”ํžˆ ๋จธ์‹ ๋Ÿฌ๋‹์—์„œ ์‚ฌ์šฉํ•˜๋Š” LDA(Linear Discriminant Analysis)์™€๋Š” ๋‹ค๋ฅธ ์•Œ๊ณ ๋ฆฌ์ฆ˜์ด๋ฏ€๋กœ ์ฃผ์˜ํ•ด์•ผํ•œ๋‹ค. 

 

 

๊ธฐ๋ณธ ๋ฐ์ดํ„ฐ์…‹์ธ 20๋‰ด์Šค๊ทธ๋ฃน ๋ฐ์ดํ„ฐ ์…‹์„ ์ด์šฉํ•˜์—ฌ ํ† ํ”ฝ๋ชจ๋ธ๋ง์„ ์ง„ํ–‰ํ•ด๋ณด๋ ค๊ณ  ํ•œ๋‹ค.

20๋‰ด์Šค๊ทธ๋ฃน ๋ฐ์ดํ„ฐ์…‹์—๋Š” 20๊ฐ€์ง€์˜ ์ฃผ์ œ๋ฅผ ๊ฐ€์ง„ ๋‰ด์Šค๊ทธ๋ฃน์˜ ๋ฐ์ดํ„ฐ๊ฐ€ ์žˆ๋Š”๋ฐ, ๊ทธ ์ค‘ 8๊ฐœ์˜ ์ฃผ์ œ๋ฅผ ์ถ”์ถœํ•˜๊ณ , ์ด๋“ค ํ…์ŠคํŠธ์— LDA ๊ธฐ๋ฐ˜์˜ ํ† ํ”ฝ ๋ชจ๋ธ๋ง์„ ์ ์šฉํ•ด๋ณด๋ ค๊ณ  ํ•œ๋‹ค. 

 

 

 

 

 

 

ํ•„์š”ํ•œ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ ๋กœ๋”ฉ ํ›„ ์นดํ…Œ๊ณ ๋ฆฌ ์ถ”์ถœ

from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation

#๋งฅ, ์œˆ๋„์šฐ์ฆˆ, ์•ผ๊ตฌ, ํ•˜ํ‚ค, ์ค‘๋™, ๊ธฐ๋…๊ต, ์ „์ž๊ณตํ•™, ์˜ํ•™ 8๊ฐœ ์ฃผ์ œ ์ถ”์ถœ
cats = ['comp.sys.mac.hardware','comp.windows.x','rec.sport.baseball','rec.sport.hockey','talk.politics.mideast','soc.religion.christian','sci.electronics','sci.med']

#cats ๋ณ€์ˆ˜๋กœ ๊ธฐ์žฌ๋œ ์นดํ…Œ๊ณ ๋ฆฌ๋งŒ ์ถ”์ถœ
news_df = fetch_20newsgroups(subset='all',remove=('headers','footers','quotos'),categories = cats,random_state=0)

#Count๊ธฐ๋ฐ˜์˜ ๋ฒกํ„ฐํ™”๋งŒ ์ ์šฉ
count_vect = CountVectorizer(max_df=0.95,max_features=1000,min_df=2,stop_words='english',ngram_range=(1,2))
feat_vect = count_vect.fit_transform(news_df.data)
print('CountVectorizer Shape : ',feat_vect.shape)

๋งฅ, ์œˆ๋„์šฐ์ฆˆ, ์•ผ๊ตฌ, ํ•˜ํ‚ค, ์ค‘๋™, ๊ธฐ๋…๊ต, ์ „์ž๊ณตํ•™, ์˜ํ•™ ์ด 8๊ฐœ์˜ ์ฃผ์ œ๋ฅผ ์ถ”์ถœํ•œ ํ›„ 

์ถ”์ถœ๋œ ํ…์ŠคํŠธ๋ฅผ Count๊ธฐ๋ฐ˜์œผ๋กœ ๋ฒกํ„ฐํ™” ๋ณ€ํ™˜ํ•˜์˜€๋‹ค. 

(LDA๋Š” Count๊ธฐ๋ฐ˜์˜ ๋ฒกํ„ฐํ™”๋งŒ ์ ์šฉ ๊ฐ€๋Šฅํ•จ)

 

CounterVectorizer ๊ฐ์ฒด ๋ณ€์ˆ˜์ธ feat_vect๋Š” 7855๊ฐœ์˜ ๋ฌธ์„œ๊ฐ€ 1000๊ฐœ์˜ ํ”ผ์ฒ˜๋กœ ๊ตฌ์„ฑ๋œ ํ–‰๋ ฌ๋ฐ์ดํ„ฐ์ด๋‹ค. 

์ด๋ ‡๊ฒŒ ํ”ผ์ฒ˜ ๋ฒกํ„ฐํ™”๋œ ๋ฐ์ดํ„ฐ์…‹์„ ๊ธฐ๋ฐ˜์œผ๋กœ LDA ํ† ํ”ฝ ๋ชจ๋ธ๋ง์„ ์ˆ˜ํ–‰ํ•˜๋ ค๊ณ  ํ•œ๋‹ค. 

 

 

 

 

 

 

lda = LatentDirichletAllocation(n_components=8,random_state=0)
lda.fit(feat_vect)

print(lda.components_.shape)
lda.components_

LatentDirichletAllocation์˜ n_components๋Š” ์œ„์—์„œ ์ถ”์ถœํ•œ ์ฃผ์ œ์˜ ๊ฐœ์ˆ˜์™€ ๋™์ผํ•œ 8๊ฐœ๋กœ ์„ค์ •ํ•˜์˜€๋‹ค. 

components๋Š” ๊ฐ ํ† ํ”ฝ๋ณ„๋กœ words ํ”ผ์ฒ˜๊ฐ€ ์–ผ๋งˆ๋‚˜ ๋งŽ์ด ๊ทธ ํ† ํ”ฝ์— ํ• ๋‹น ๋๋Š”์ง€ ์ˆ˜์น˜๋ฅผ ๊ฐ€์ง€๊ณ  ์žˆ๋‹ค. 

(๋†’์€ ๊ฐ’์ผ ์ˆ˜๋ก ํ•ด๋‹น word ํ”ผ์ฒ˜๋Š” ์ค‘์‹ฌ word)

 

8๊ฐœ์˜ ํ† ํ”ฝ๋ณ„๋กœ 1000๊ฐœ์˜ word ํ”ผ์ฒ˜๊ฐ€ ํ•ด๋‹น ํ† ํ”ฝ๋ณ„๋กœ ์—ฐ๊ด€๋„ ๊ฐ’์„ ๊ฐ€์ง€๊ณ  ์žˆ๋‹ค. 

 

 

 

 

 

 

 

ํ•จ์ˆ˜ ์ƒ์„ฑ

def display_topics(model,feature_names,no_top_words):
    for topic_index,topic in enumerate(model.components_):
        print('Topic #',topic_index)
        
        #components_array์—์„œ ๊ฐ€์žฅ ๊ฐ’์ด ํฐ ์ˆœ์œผ๋กœ ์ •๋ ฌํ–ˆ์„ ๋•Œ, ๊ทธ ๊ฐ’์˜ array ์ธ๋ฑ์Šค ๋ฐ˜ํ™˜
        topic_word_indexes = topic.argsort()[::-1]
        top_indexes = topic_word_indexes[:no_top_words]
        
        #top_indexes ๋Œ€์ƒ์ธ ์ธ๋ฑ์Šค ๋ณ„๋กœ feature_names์— ํ•ด๋‹นํ•˜๋Š” word feature ์ถ”์ถœ ํ›„ join์œผ๋กœ concat
        feature_concat = ' '.join([feature_names[i] for i in top_indexes])
        print(feature_concat)
        

#CountVectorizer ๊ฐ์ฒด ๋‚ด์˜ ์ „์ฒด word์˜ ๋ช…์นญ์„ get_features_names()๋ฅผ ํ†ตํ•ด ์ถ”์ถœ
feature_names = count_vect.get_feature_names()

#ํ† ํ”ฝ๋ณ„๋กœ ๊ฐ€์žฅ ์—ฐ๊ด€๋„ ๋†’์€ word 15๊ฐœ์”ฉ ์ถ”์ถœ
display_topics(lda,feature_names,15)

display_topics()ํ•จ์ˆ˜๋ฅผ ์ƒ์„ฑํ•˜์—ฌ ๊ฐ ํ† ํ”ฝ๋ณ„๋กœ ์—ฐ๊ด€๋„๊ฐ€ ๋†’์€ ์ˆœ์„œ๋Œ€๋กœ words๋ฅผ ์ถœ๋ ฅํ•˜์˜€๋‹ค. 

 

 

728x90
Comments