250x250
Link
๋‚˜์˜ GitHub Contribution ๊ทธ๋ž˜ํ”„
Loading data ...
Notice
Recent Posts
Recent Comments
๊ด€๋ฆฌ ๋ฉ”๋‰ด

Data Science LAB

[Python] SentiWordNet, VADER์„ ์ด์šฉํ•œ ์˜ํ™” ๊ฐ์ƒํ‰ ๊ฐ์„ฑ ๋ถ„์„ ๋ณธ๋ฌธ

๐Ÿ›  Machine Learning/ํ…์ŠคํŠธ ๋ถ„์„

[Python] SentiWordNet, VADER์„ ์ด์šฉํ•œ ์˜ํ™” ๊ฐ์ƒํ‰ ๊ฐ์„ฑ ๋ถ„์„

ใ…… ใ…œ ใ…” ใ…‡ 2022. 2. 21. 10:57
728x90

์ง€๋‚œ ํฌ์ŠคํŒ…์—์„œ WordNet๊ณผ SentiWordNet์— ๋Œ€ํ•ด ๊ณต๋ถ€ํ•˜์˜€์œผ๋‹ˆ IMDB ์˜ํ™” ๊ฐ์ƒํ‰ ๊ฐ์„ฑ ๋ถ„์„์„ SentiWordNet ๊ธฐ๋ฐ˜์œผ๋กœ ์ˆ˜ํ–‰ํ•ด ๋ณด๋ ค๊ณ  ํ•œ๋‹ค. 

https://suhye.tistory.com/entry/%E3%85%87-1?category=1040378 

 

[Python] ๊ฐ์„ฑ๋ถ„์„ - ๋น„์ง€๋„ ํ•™์Šต

์ด์ „ ํฌ์ŠคํŒ…(์ง€๋„ํ•™์Šต)์— ์ด์–ด์„œ ๋น„์ง€๋„ ํ•™์Šต์˜ ๊ฐ์„ฑ ๋ถ„์„๊นŒ์ง€ ๊ณต๋ถ€ํ•ด ๋ณด๋ ค๊ณ  ํ•œ๋‹ค! https://suhye.tistory.com/entry/mn?category=1040378 [Python] ๊ฐ์„ฑ ๋ถ„์„(Sentiment Analysis) - ์ง€๋„ํ•™์Šต ๊ฐ์„ฑ๋ถ„์„ ์ด๋ž€? ๊ฐ..

suhye.tistory.com

 

 

 

 

 

๊ฐ์„ฑ ๋ถ„์„ ์ˆœ์„œ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค. 

1. ๋ฌธ์„œ๋ฅผ ๋ฌธ์žฅ ๋‹จ์œ„๋กœ ๋ถ„ํ•ด

2. ๋ฌธ์žฅ์„ ๋‹จ์–ด ๋‹จ์œ„๋กœ ํ† ํฐํ™” ํ•œ ๋’ค ํ’ˆ์‚ฌ ํƒœ๊น…

3. ํ’ˆ์‚ฌ ํƒœ๊น…๋œ ๋‹จ์–ด ๊ธฐ๋ฐ˜์œผ๋กœ synset ๊ฐ์ฒด์™€ senti_synset ๊ฐ์ฒด ์ƒ์„ฑ

4.Senti_synset์—์„œ ๊ธ์ •, ๋ถ€์ •์˜ ๊ฐ์„ฑ์ง€์ˆ˜๋ฅผ ๊ตฌํ•˜๊ณ  ์ด๋ฅผ ๋ชจ๋‘ ํ•ฉ์‚ฐํ•˜์—ฌ, ํŠน์ • ๊ฐ’ ์ด์ƒ์ผ ๋•Œ ๊ธ์ •, ์•„๋‹ ๋•Œ์—๋Š” ๋ถ€์ • ๊ฐ์„ฑ์œผ๋กœ ๊ฒฐ์ •

 

 

 

 

SentiWordNet

ํ’ˆ์‚ฌ ํƒœ๊น… ๋‚ด๋ถ€ ํ•จ์ˆ˜ ์ƒ์„ฑ

from nltk.corpus import wordnet as wn

def penn_to_wn(tag):
    if tag.startswith('J'):
        return wn.ADJ
    elif tag.startswith('N'):
        return wn.NOUN
    elif tag.startswith('R'):
        return wn.ADV
    elif tag.startswith('V'):
        return wn.VERB

ํ•„์š”ํ•œ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋ฅผ ๋ถˆ๋Ÿฌ์˜จ ํ›„, NLTK์˜ PennTreebank Tag๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ WordNet์˜ ํ’ˆ์‚ฌ Tag์œผ๋กœ ๋ณ€ํ™˜ํ•ด ์ฃผ๋Š” ํ•จ์ˆ˜๋ฅผ ์ƒ์„ฑํ•œ๋‹ค. 

 

 

 

 

 

 

๋ฌธ์žฅ โ‡จ ๋‹จ์–ด โ‡จ ํ’ˆ์‚ฌ ํƒœ๊น… ํ›„ SentiSynset ํด๋ž˜์Šค ์ƒ์„ฑ ํ›„ Polarity Score ํ•ฉ์‚ฐ ํ•จ์ˆ˜ ์ƒ์„ฑ

from nltk.stem import WordNetLemmatizer
from nltk.corpus import sentiwordnet as swn
from nltk import sent_tokenize, word_tokenize, pos_tag


def swn_polarity(text):
    #๊ฐ์„ฑ ์ง€์ˆ˜ ์ดˆ๊ธฐํ™”
    sentiment = 0.0
    tokens_count = 0
    
    lemmatizer = WordNetLemmatizer()
    raw_sentences = sent_tokenize(text)
    
    #๋ถ„ํ•ด๋œ ๋ฌธ์žฅ๋ณ„๋กœ ๋‹จ์–ด ํ† ํฐ -> ํ’ˆ์‚ฌ ํƒœ๊น… ํ›„ sentiSynset ์ƒ์„ฑ -> ๊ฐ์„ฑ ์ง€์ˆ˜ ํ•ฉ์‚ฐ
    for raw_sentence in raw_sentences:
        #NLTK ๊ธฐ๋ฐ˜์˜ ํ’ˆ์‚ฌ ํƒœ๊น… ๋ฌธ์žฅ ์ถ”์ถœ
        tagged_sentence = pos_tag(word_tokenize(raw_sentence))
        for word, tag in tagged_sentence:
            
            #WordNet ๊ธฐ๋ฐ˜ ํ’ˆ์‚ฌ ํƒœ๊น…๊ณผ ์–ด๊ทผ ์ถ”์ถœ
            wn_tag = penn_to_wn(tag)
            if wn_tag not in (wn.NOUN,wn.ADJ,wn.ADV):
                continue
            lemma = lemmatizer.lemmatize(word,pos=wn_tag)
            if not lemma:
                continue
                    #์–ด๊ทผ์„ ์ถ”์ถœํ•œ ๋‹จ์–ด์™€ WordNet ๊ธฐ๋ฐ˜ ํ’ˆ์‚ฌ ํƒœ๊น…์„ ์ž…๋ ฅํ•˜์—ฌ Synset ๊ฐ์ฒด๋ฅผ ์ƒ์„ฑ
            synsets = wn.synsets(lemma, pos=wn_tag)
            if not synsets:
                continue
                    #sentiwordnet์˜ ๊ฐ์„ฑ ๋‹จ์–ด ๋ถ„์„์œผ๋กœ ๊ฐ์„ฑ synset ์ถ”์ถœ
                    #๋ชจ๋“  ๋‹จ์–ด์— ๋Œ€ํ•ด ๊ธ์ • ๊ฐ์„ฑ ์ง€์ˆ˜๋Š” +๋กœ, ๋ถ€์ • ๊ฐ์„ฑ ์ง€์ˆ˜๋Š” -๋กœ ํ•ฉ์‚ฐํ•˜์—ฌ ๊ฐ์„ฑ ์ง€์ˆ˜ ๊ณ„์‚ฐ
            synset = synsets[0]
            swn_synset = swn.senti_synset(synset.name())
            sentiment += (swn_synset.pos_score() - swn_synset.neg_score())
            tokens_count += 1
                    
            
    if not tokens_count:
        return 0
            
            
            
    #์ด score๊ฐ€ 0์ด์ƒ์ด๋ฉด ๊ธ์ • 1, ๊ทธ๋ ‡์ง€ ์•Š์œผ๋ฉด ๋ถ€์ • 0 ๋ฐ˜ํ™˜
    if sentiment >= 0:
        return 1
            
    return 0

 

 

 

 

 

 

 

 

 

IMDB ๊ฐ์ƒํ‰์˜ ๊ฐœ๋ณ„ ๋ฌธ์„œ์— swn_polarity(text)ํ•จ์ˆ˜ ์ ์šฉ

review_df['preds'] = review_df['review'].apply(lambda x : swn_polarity(x))
y_target = review_df['sentiment'].values
preds = review_df['preds'].values

 

 

 

 

 

 

 

๊ฐ์„ฑ ๋ถ„์„ ์„ฑ๋Šฅ ์˜ˆ์ธก

from sklearn.metrics import accuracy_score, confusion_matrix, precision_score
from sklearn.metrics import recall_score,f1_score, roc_auc_score
import numpy as np

print("confusion matrix : ",confusion_matrix(y_target,preds))
print("์ •ํ™•๋„ : {:.3f} ".format(accuracy_score(y_target,preds)))
print("์ •๋ฐ€๋„ : {:.3f}".format(precision_score(y_target,preds)))
print("์žฌํ˜„์œจ : {:.3f}".format(recall_score(y_target,preds)))

์ •ํ™•๋„์™€ ์ •๋ฐ€๋„, ์žฌํ˜„์œจ์ด 0.6-0.7์‚ฌ์ด๋ฅผ ๋ณด์ด๊ธฐ ๋•Œ๋ฌธ์— ๋†’์ง€๋Š” ์•Š๋‹ค. 

 

 

 


 

 

 

 

 

VADER

VADER๋Š” ์†Œ์…œ ๋ฏธ๋””์–ด์˜ ๊ฐ์„ฑ ๋ถ„์„ ์šฉ๋„๋กœ ๋งŒ๋“ค์–ด์ง„  ๋ฃฐ ๊ธฐ๋ฐ˜์˜ Lexicon์ด๋‹ค. SentimentInetensityAnalyzer ํด๋ž˜์Šค๋ฅผ ์ด์šฉํ•˜์—ฌ ์‰ฝ๊ฒŒ ๊ฐ์„ฑ ๋ถ„์„์„ ์ œ๊ณตํ•œ๋‹ค. 

์•ž์˜ ํฌ์ŠคํŒ…์—์„œ nltk.download('all')์„ ์ˆ˜ํ–‰ํ•ด ์ฃผ์—ˆ์œผ๋ฏ€๋กœ ๋”ฐ๋กœ ์„ค์น˜ ์ฝ”๋“œ๋Š” ์‹คํ–‰ํ•˜์ง€ ์•Š์•˜๋‹ค. 

 

 

์ด๋ฒˆ ํฌ์ŠคํŒ…์—์„œ๋Š” ๊ฐ„๋‹จํ•˜๊ฒŒ revew_df์˜ ๊ฐ์ƒํ‰ ์ค‘ ํ•˜๋‚˜๋งŒ ๊ฐ์„ฑ ๋ถ„์„์„ ์ˆ˜ํ–‰ํ•ด ๋ณด๊ณ ์ž ํ•œ๋‹ค.

VADER๋Š” ์ง€์†์ ์œผ๋กœ ๋ฒ„์ „์ด ์—…๋ฐ์ดํŠธ ๋˜๊ธฐ ๋•Œ๋ฌธ์— ์„ค์น˜ํ•œ ๋ฒ„์ „์— ๋”ฐ๋ผ ๊ฒฐ๊ณผ๊ฐ€ ๋‹ค๋ฅด๊ฒŒ ์ถœ๋ ฅ๋  ์ˆ˜ ์žˆ๋‹ค. 

 

 

 

 

 

 

from nltk.sentiment.vader import SentimentIntensityAnalyzer

senti_analyzer = SentimentIntensityAnalyzer()
senti_scores = senti_analyzer.polarity_scores(review_df['review'][0])
print(senti_scores)

 

SentimentIntensityAnalyzer ๊ฐ์ฒด๋ฅผ ์ƒ์„ฑํ•œ ๋’ค ๋ฌธ์„œ๋ณ„๋กœ polarity_scores()๋ฉ”์„œ๋“œ๋ฅผ ํ˜ธ์ถœํ•˜๋ฉด ์†์‰ฝ๊ฒŒ ๊ฐ์„ฑ ์ ์ˆ˜๋ฅผ ๊ตฌํ•  ์ˆ˜ ์žˆ๋‹ค. 

'neg' : ๋ถ€์ • ๊ฐ์„ฑ ์ง€์ˆ˜

'neu' : ์ค‘๋ฆฝ์ ์ธ ๊ฐ์„ฑ ์ง€์ˆ˜

'pos' : ๊ธ์ • ๊ฐ์„ฑ ์ง€์ˆ˜

'compound' : neg, neu, pos ์ง€์ˆ˜๋ฅผ ์ ์ ˆํ•˜๊ฒŒ ์กฐํ•ฉํ•˜์—ฌ -1~1 ์‚ฌ์ด์˜ ๊ฐ์„ฑ ์ง€์ˆ˜๋ฅผ ํ‘œํ˜„ํ•œ ๊ฐ’

 

 

 

 

 

def vader_polarity(review,threshold=0.1):
    analyzer = SentimentIntensityAnalyzer()
    scores = analyzer.polarity_scores(review)
    
    #compound์— ๊ธฐ๋ฐ˜ํ•˜์—ฌ threshold ์ž…๋ ฅ๊ฐ’ ๋ณด๋‹ค ํฌ๋ฉด 1, ์•„๋‹ˆ๋ฉด 0 ๋ฐ˜ํ™˜
    agg_score = scores['compound']
    final_sentiment = 1 if agg_score>= threshold else 0
    return final_sentiment



#apply lambda ์‹์„ ์ด์šฉํ•˜์—ฌ ๋ ˆ์ฝ”๋“œ ๋ณ„ vader_polarity() ์ˆ˜ํ–‰ ํ›„ ๊ฒฐ๊ณผ 'vader_preds'์— ์ €์žฅ
review_df['vader_preds'] = review_df['review'].apply(lambda x: vader_polarity(x,0.1))
y_target = review_df['sentiment'].values
vader_preds = review_df['vader_preds'].values

print("Confusion matrix : ",confusion_matrix(y_target,vader_preds))
print("์ •ํ™•๋„ : {:.3f} ".format(accuracy_score(y_target,vader_preds)))
print("์ •๋ฐ€๋„ : {:.3f}".format(precision_score(y_target,vader_preds)))
print("์žฌํ˜„์œจ : {:.3f}".format(recall_score(y_target,vader_preds)))

vader_polarity()ํ•จ์ˆ˜๋Š” ์ž…๋ ฅ ํŒŒ๋ผ๋ฏธํ„ฐ๋กœ ์˜ํ™” ๊ฐ์ƒํ‰ ํ…์ŠคํŠธ์™€ ๊ธ์ •/๋ถ€์ •์„ ๊ฒฐ์ •ํ•˜๋Š” ์ž„๊ณ„๊ฐ’์„ ๊ฐ€์ง„๋‹ค. 

SentimentIntensityAnalyzer ๊ฐ์ฒด์˜ polarity_scores()๋ฉ”์„œ๋“œ๋ฅผ ํ˜ธ์ถœํ•˜์—ฌ ๊ฐ์„ฑ ๊ฒฐ๊ณผ๋ฅผ ๋ฐ˜ํ™˜ํ•œ๋‹ค. 

์ •ํ™•๋„๊ฐ€ SentiWordNet์œผ๋กœ ํ–ฅ์ƒ๋˜์—ˆ์œผ๋ฉฐ ์žฌํ˜„์œจ์€ 0.851๋กœ ๋งŽ์ด ํ–ฅ์ƒ๋˜์—ˆ๋‹ค. 

728x90
Comments