250x250
Link
๋‚˜์˜ GitHub Contribution ๊ทธ๋ž˜ํ”„
Loading data ...
Notice
Recent Posts
Recent Comments
๊ด€๋ฆฌ ๋ฉ”๋‰ด

Data Science LAB

[Python] ๊ฐ์„ฑ ๋ถ„์„(Sentiment Analysis) - ์ง€๋„ํ•™์Šต ๋ณธ๋ฌธ

๐Ÿ›  Machine Learning/ํ…์ŠคํŠธ ๋ถ„์„

[Python] ๊ฐ์„ฑ ๋ถ„์„(Sentiment Analysis) - ์ง€๋„ํ•™์Šต

ใ…… ใ…œ ใ…” ใ…‡ 2022. 2. 19. 15:10
728x90
๊ฐ์„ฑ๋ถ„์„ ์ด๋ž€?

๊ฐ์„ฑ๋ถ„์„์ด๋ž€ ๋ฌธ์„œ์˜ ์ฃผ๊ด€์ ์ธ ๊ฐ์„ฑ/์˜๊ฒฌ/๊ฐ์ •/๊ธฐ๋ถ„ ๋“ฑ์„ ํŒŒ์•…ํ•˜๊ธฐ ์œ„ํ•œ ๋ฐฉ๋ฒ•์œผ๋กœ, ์†Œ์…œ๋ฏธ๋””์–ด๋‚˜ ์—ฌ๋ก ์กฐ์‚ฌ, ์˜จ๋ผ์ธ ๋ฆฌ๋ทฐ ๋“ฑ ๋‹ค์–‘ํ•œ ๋ถ„์•ผ์—์„œ ํ™œ์šฉ๋˜๊ณ  ์žˆ๋‹ค. ๋ฌธ์„œ์˜ ๊ธ€์ž๊ฐ€ ๋‚˜ํƒ€๋‚ด๋Š” ์—ฌ๋Ÿฌ ์ฃผ๊ด€์ ์ธ ๋‹จ์–ด์™€ ๋ฌธ๋งฅ์„ ๊ธฐ๋ฐ˜์œผ๋กœ ๊ฐ์„ฑ ์ˆ˜์น˜๋ฅผ ๊ณ„์‚ฐํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ์ด์šฉํ•œ๋‹ค. ๊ฐ์„ฑ ์ˆ˜์น˜๋ฅผ ๊ธ์ •/๋ถ€์ • ์ง€์ˆ˜๋กœ ๊ตฌ๋ถ„์ง€์–ด ๊ฐ ์ง€์ˆ˜๋ฅผ ํ•ฉ์‚ฐํ•˜์—ฌ ๊ธ์ • ๋˜๋Š” ๋ถ€์ • ๊ฐ์„ฑ์„ ๊ฒฐ์ •ํ•œ๋‹ค.

 

 

 

 

๊ฐ์„ฑ๋ถ„์„์€ ํฌ๊ฒŒ ์ง€๋„ํ•™์Šต๊ณผ ๋น„์ง€๋„ ํ•™์Šต ๋ฐฉ๋ฒ•์œผ๋กœ ๋‚˜๋‰œ๋‹ค. 

 

- ์ง€๋„ํ•™์Šต : ํ•™์Šต๋ฐ์ดํ„ฐ์™€ ํƒ€๊นƒ ๋ ˆ์ด๋ธ” ๊ฐ’์„ ๊ธฐ๋ฐ˜์œผ๋กœ ๊ฐ์„ฑ ๋ถ„์„ ํ•™์Šต์„ ์ˆ˜ํ–‰ํ•œ ๋’ค ์ด๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ๋‹ค๋ฅธ ๋ฐ์ดํ„ฐ์˜ ๊ฐ์„ฑ์„ ์˜ˆ์ธกํ•˜๋Š” ๋ฐฉ๋ฒ•

 

- ๋น„์ง€๋„ ํ•™์Šต : 'Lexicon'์ด๋ผ๋Š” ๊ฐ์„ฑ ์–ดํœ˜ ์‚ฌ์ „์„ ์ด์šฉํ•ด ๊ฐ์„ฑ ๋ถ„์„์„ ์œ„ํ•œ ์šฉ์–ด์™€ ๋ฌธ๋งฅ์— ๋Œ€ํ•œ ์ •๋ณด๋ฅผ ํŒŒ์•…ํ•ด ๋ฌธ์„œ์˜ ๊ธ์ •/๋ถ€์ •์„ ํŒ๋‹จ

 

 

 

์˜ค๋Š˜์€ ์ง€๋„ํ•™์Šต์„ ์ด์šฉํ•œ ๊ฐ์„ฑ๋ถ„์„์„ ํ•ด๋ณด๋ ค๊ณ  ํ•œ๋‹ค! ๏ผˆ๏ฟฃ๏ธถ๏ฟฃ๏ผ‰โ†—ใ€€

 

 

 

 

 

๋ฐ์ดํ„ฐ์…‹ ๋‹ค์šด

https://www.kaggle.com/c/word2vec-nlp-tutorial/data

 

Bag of Words Meets Bags of Popcorn | Kaggle

 

www.kaggle.com

๋จผ์ €, ์บ๊ธ€์—์„œ ๋ฐ์ดํ„ฐ ์…‹์„ ๋‹ค์šด๋ฐ›์•„์ค€๋‹ค.

 

 

 

 

 

 

 

์˜ˆ์ œ ์‹ค์Šต

1. ๋ฐ์ดํ„ฐ์…‹ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ ๋ฐ ๋ฐ์ดํ„ฐ ํ™•์ธ

import pandas as pd

review_df = pd.read_csv(r"C:\Users\suhye\Desktop\Kaggle\1.Popcorn\labeledTrainData.tsv\labeledTrainData.tsv",header=0,sep="\t")
review_df.head()

๋ถˆ๋Ÿฌ์˜จ ๋ฐ์ดํ„ฐ ์…‹์„ ํ™•์ธํ•ด ๋ณด๋‹ˆ id, sentiment(1-๊ธ์ •์  ํ‰๊ฐ€, 2- ๋ถ€์ •์  ํ‰๊ฐ€), review๋กœ ์ด๋ฃจ์–ด์ ธ ์žˆ๋Š” ๊ฒƒ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์—ˆ๋‹ค. 

 

 

 

 

print(review_df['review'][0])

์ฒซ๋ฒˆ์งธ ์˜ํ™” ๋ฆฌ๋ทฐ ํ…์ŠคํŠธ๋ฅผ ํ™•์ธํ•ด ๋ณธ ๊ฒฐ๊ณผ,

<br /> ํƒœ๊ทธ๊ฐ€ ์กด์žฌํ•˜๋Š” ๊ฒƒ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์—ˆ๋‹ค.

HTMLํ˜•์‹์—์„œ ์ถ”์ถœํ–ˆ๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค!

 

 

 

 

 

2. ๋ฌธ์ž์—ด๋กœ ๋ณ€ํ™˜

import re

#<br> html ํƒœ๊ทธ๋Š” replace ํ•จ์ˆ˜๋กœ ๊ณต๋ฐฑ์œผ๋กœ ์ „ํ™˜
review_df['review'] = review_df['review'].str.replace('<br />', ' ')

#ํŒŒ์ด์ฌ์˜ ์ •๊ทœ ํ‘œํ˜„์‹ ๋ชจ๋“ˆ์ธ re๋ฅผ ์ด์šฉํ•ด ์˜์–ด ๋ฌธ์ž์—ด์ด ์•„๋‹Œ ๋ฌธ์ž๋Š” ๋ชจ๋‘ ๊ณต๋ฐฑ์œผ๋กœ ๋ณ€ํ™˜
review_df['review'] = review_df['review'].apply(lambda x : re.sub("[^a-zA-Z]"," ",x))

replaceํ•จ์ˆ˜๋ฅผ ์ด์šฉํ•˜์—ฌ html ํƒœ๊ทธ๋ฅผ ๊ณต๋ฐฑ์œผ๋กœ ๋ฐ”๊ฟ”์ฃผ์—ˆ๋‹ค. 

๋˜ํ•œ ์˜์–ด๊ฐ€ ์•„๋‹Œ ํŠน์ˆ˜ ๋ฌธ์ž ๋“ฑ์€ ํ”ผ์ฒ˜๋กœ์จ์˜ ์˜๋ฏธ๊ฐ€ ์—†๊ธฐ ๋•Œ๋ฌธ์— ๊ณต๋ฐฑ์œผ๋กœ ๋ณ€ํ˜ธ๋‚˜ํ•ด์ฃผ์—ˆ๋‹ค. 

[^a-zA-A]๋Š” ์˜์–ด ๋Œ€, ์†Œ๋ฌธ์ž๊ฐ€ ์•„๋‹Œ ๋ชจ๋“  ๋ฌธ์ž๋ฅผ ์ฐพ๋Š” ๊ฒƒ์ด๋‹ค. 

 

 

 

 

3. train/test ๋ฐ์ดํ„ฐ๋กœ ๋ถ„ํ• 

from sklearn.model_selection import train_test_split

class_df = review_df['sentiment']
feature_df = review_df.drop(['id','sentiment'],axis=1,inplace=False)

X_train,X_test,y_train,y_test = train_test_split(feature_df,class_df,test_size=0.3,random_state=156)

print("Train data shape : ",X_train.shape)
print("Test data shape : ",X_test.shape)

train_test_split ๋ชจ๋“ˆ์„ ์ด์šฉํ•˜์—ฌ ๋ฐ์ดํ„ฐ ์…‹์„ ํ›ˆ๋ จ๋ฐ์ดํ„ฐ์™€ ํ…Œ์ŠคํŠธ ๋ฐ์ดํ„ฐ๋กœ ๋ถ„ํ• ํ•ด ์ฃผ์—ˆ๋‹ค. 

Train ๋ฐ์ดํ„ฐ์—๋Š” 17500๊ฐœ, Test ๋ฐ์ดํ„ฐ์—๋Š” 7500๊ฐœ์˜ ๋ฐ์ดํ„ฐ๊ฐ€ ์žˆ๋Š” ๊ฒƒ์„ ํ™•์ธ

 

 

 

 

4. ๋ชจ๋ธ ์ƒ์„ฑ(ConVectorizer)

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, roc_auc_score

#์Šคํ†ฑ์›Œ๋“œ๋Š” english, ngram ์€ (1,2)๋กœ ์„ค์ •ํ•ด CountVectorization ์ˆ˜ํ–‰
pipeline = Pipeline([('cnt_vect',CountVectorizer(stop_words='english',ngram_range=(1,2))),
                    ('lr',LogisticRegression(C=10))])


#Pipeline์„ ์ด์šฉํ•ด fit,predict
pipeline.fit(X_train['review'],y_train)
pred = pipeline.predict(X_test['review'])
pred_probs = pipeline.predict_proba(X_test['review'])[:,1]

print("์˜ˆ์ธก ์ •ํ™•๋„ : {0:.4f} ".format(accuracy_score(y_test,pred)))
print("ROC-AUC : {0:.4f}".format(roc_auc_score(y_test,pred_probs)))

๋ฆฌ๋ทฐ ํ…์ŠคํŠธ๋ฅผ ํ”ผ์ฒ˜ ๋ฒกํ„ฐํ™”ํ•œ ํ›„, ๋ถ„๋ฅ˜ ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์ ์šฉํ•˜์—ฌ ์˜ˆ์ธก ์„ฑ๋Šฅ์„ ์ธก์ •ํ•ด ๋ณด์•˜๋‹ค. 

Pipeline ๊ฐ์ฒด๋ฅผ ์ด์šฉํ•˜์—ฌ ํ”ผ์ฒ˜ ๋ฒกํ„ฐํ™”์™€ ๋กœ์ง€์Šคํ‹ฑํšŒ๊ท€๋ชจ๋ธ์„ ํ•œ๋ฒˆ์— ์ˆ˜ํ–‰ํ•˜์˜€๋‹ค. 

 

์˜ˆ์ธก์ •ํ™•๋„์™€ ROC๋Š” ๊ฐ๊ฐ 0.89, 0.95๋กœ ๋†’๊ฒŒ ์ธก์ •๋˜์—ˆ๋‹ค. 

 

 

 

 

TfidfVectorizer

#count์™€ ๋™์ผํ•œ ์กฐ๊ฑด์œผ๋กœ ์‹คํ–‰
pipeline = Pipeline([('tfidf_vect',TfidfVectorizer(stop_words='english',ngram_range=(1,2))),
                    ('lr',LogisticRegression(C=10))])


pipeline.fit(X_train['review'],y_train)
pred = pipeline.predict(X_test['review'])
pred_probs = pipeline.predict_proba(X_test['review'])[:,1]

print("์˜ˆ์ธก ์ •ํ™•๋„ : {0:.4f} ".format(accuracy_score(y_test,pred)))
print("ROC-AUC : {0:.4f}".format(roc_auc_score(y_test,pred_probs)))

Count๋ฒกํ„ฐ์™€ ๋™์ผํ•œ ์กฐ๊ฑด์œผ๋กœ Tfidf๋ฒกํ„ฐ๋ฅผ ์ด์šฉํ•œ ๋ชจ๋ธ๋„ ์ƒ์„ฑํ•ด ๋ณด์•˜๋Š”๋ฐ, 

์˜ˆ์ธก ์ •ํ™•๋„์™€ ROC๋ชจ๋‘ ์กฐ๊ธˆ์”ฉ ์ฆ๊ฐ€ํ•œ ๊ฒƒ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์—ˆ๋‹ค. 

 

728x90
Comments