250x250
Link
๋‚˜์˜ GitHub Contribution ๊ทธ๋ž˜ํ”„
Loading data ...
Notice
Recent Posts
Recent Comments
๊ด€๋ฆฌ ๋ฉ”๋‰ด

Data Science LAB

[Python] ํ•œ๊ธ€ ํ…์ŠคํŠธ ์ฒ˜๋ฆฌ - ๋„ค์ด๋ฒ„ ์˜ํ™” ํ‰์  ๊ฐ์„ฑ ๋ถ„์„ ๋ณธ๋ฌธ

๐Ÿ›  Machine Learning/ํ…์ŠคํŠธ ๋ถ„์„

[Python] ํ•œ๊ธ€ ํ…์ŠคํŠธ ์ฒ˜๋ฆฌ - ๋„ค์ด๋ฒ„ ์˜ํ™” ํ‰์  ๊ฐ์„ฑ ๋ถ„์„

ใ…… ใ…œ ใ…” ใ…‡ 2022. 2. 26. 19:59
728x90

๋“œ๋””์–ด Konlpy๋ฅผ ์„ค์น˜ํ•ด์„œ ์ฃผํ”ผํ„ฐ ๋…ธํŠธ๋ถ์—์„œ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๊ฒŒ ๋˜์—ˆ๋‹ค!

 

 

 

 

 

https://suhye.tistory.com/entry/%E3%85%9C?category=1037658 

 

[Python]KONLPy ์„ค์น˜ ๋ฐฉ๋ฒ• ๋ฐ ์—๋Ÿฌ ํ•ด๊ฒฐ

๋“œ๋””์–ด๋“œ๋””์–ด๋“œ๋””์–ด! ์•„๋‚˜์ฝ˜๋‹ค์—์„œ ์—๋Ÿฌ๋ฅผ ํ•ด๊ฒฐํ•˜๊ณ  konlpy๋ฅผ ์‹คํ–‰ํ•  ์ˆ˜ ์žˆ๊ฒŒ ๋˜์—ˆ๋‹ค! konlpy ์‹คํ–‰์ด ์•ˆ๋ผ์„œ ํ•œ๊ธ€ ํ…์ŠคํŠธ ๋ถ„์„์„ ์•„์˜ˆ ๋ชปํ–ˆ์—ˆ๋Š”๋ฐ ์ด์ œ ํ•  ์ˆ˜ ์žˆ๊ฒŒ ๋˜์—ˆ๋‹ค(๊ฐ๊ฒฉ) Konlpy ์„ค์น˜ ๋ฐฉ๋ฒ• 1. JA

suhye.tistory.com

 

 

 

 

 

 

 

๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ ๋ฐ ๋ฐ์ดํ„ฐ์…‹ ๋กœ๋”ฉ

import konlpy
import pandas as pd

train = pd.read_csv(r'C:\Users\Naver\ratings_train.txt',sep='\t')
train.head()

๋จผ์ € ํƒญ(\t)๋กœ ์นผ๋Ÿผ์„ ๋ถ„๋ฆฌํ•˜๊ณ , ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„ ํ˜•์‹์œผ๋กœ ๋ฐ์ดํ„ฐ๋ฅผ ๋ถˆ๋Ÿฌ์˜จ๋‹ค. 

 

 

 

 

 

 

train['label'].value_counts()

ํ•™์Šต ๋ฐ์ดํ„ฐ ์…‹์˜ 0๊ณผ 1์˜ label ๊ฐ’์ด ๊ฐ๊ฐ 75173, 74827๋กœ ๊ฑฐ์˜ ๋™์ผํ•˜๊ฒŒ ๋ถ„ํฌ๋˜์–ด์žˆ๋Š” ๊ฒƒ์„ ํ™•์ธํ•จ

 

 

 

 

 

 

 

import re
train = train.fillna(' ')


train['document'] = train['document'].apply(lambda x : re.sub(r"\d+"," ",x))


test = pd.read_csv(r"C:\Users\suhye\Desktop\Machine Learning\Naver\ratings_test.txt",sep='\t')
test = test.fillna(' ')
test['document'] = test['document'].apply(lambda x : re.sub(r"\d+"," ",x))

#id์ปฌ๋Ÿผ ์‚ญ์ œ ์ˆ˜ํ–‰
train.drop('id',axis=1,inplace=True)
test.drop('id',axis=1, inplace=True)

์ •๊ทœ ํ‘œํ˜„์‹์„ ์ด์šฉํ•˜์—ฌ ์ˆซ์ž๋ฅผ ๊ณต๋ฐฑ์œผ๋กœ ๋ณ€๊ฒฝํ•˜๊ณ , 

ํ…Œ์ŠคํŠธ ๋ฐ์ดํ„ฐ ์…‹์„ ๋กœ๋”ฉํ•˜๊ณ  ๋™์ผํ•˜๊ฒŒ null๊ณผ ์ˆซ์ž๋ฅผ ๊ณต๋ฐฑ์œผ๋กœ ๋ณ€ํ™˜ํ•˜์˜€๋‹ค. 

 

 

 

 

 

 

 

๋ฌธ์žฅ์„ ํ˜•ํƒœ์†Œ ๋‹จ์–ด ํ˜•ํƒœ๋กœ ํ† ํฐํ™”ํ•˜์—ฌ list ๊ฐ์ฒด ๋ฐ˜ํ™˜

from konlpy.tag import Twitter

twitter = Twitter()

def tw_tokenizer(text):
    tokens_ko = twitter.morphs(text)
    return tokens_ko

Twitter ํด๋ž˜์Šค๋Š” SNS ๋ถ„์„์— ์ ํ•ฉํ•˜๋‹ค. 

Twitter์˜ morphs() ๋ฉ”์„œ๋“œ๋ฅผ ์ด์šฉํ•˜๋ฉด ์ž…๋ ฅ ์ธ์ž๋กœ ๋“ค์–ด์˜จ ๋ฌธ์žฅ์„ ํ˜•ํƒœ์†Œ ๋‹จ์–ด ํ˜•ํƒœ๋กœ ํ† ํฐํ™”ํ•˜์—ฌ ๋ฆฌ์ŠคํŠธ ํ˜•ํƒœ๋กœ ๋ฐ˜ํ™˜์‹œ์ผœ์ค€๋‹ค. 

 

 

 

 

 

 

 

 

TF-IDF ๋ชจ๋ธ ์ƒ์„ฑ

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV

#Twitter ๊ฐ์ฒด์˜ morphs()๊ฐ์ฒด๋ฅผ ์ด์šฉํ•œ tokenizer ์‚ฌ์šฉ
tfidf_vect = TfidfVectorizer(tokenizer = tw_tokenizer,ngram_range=(1,2),min_df=3,max_df=0.9)
tfidf_matrix_train = tfidf_vect.fit_transform(train['document'])

์‚ฌ์ดํ‚ท๋Ÿฐ์˜ TfidfVectorizer๋ฅผ ์ด์šฉํ•˜์—ฌ TF-IDF ํ”ผ์ฒ˜ ๋ชจ๋ธ์„ ์ƒ์„ฑํ•˜์˜€๋‹ค. 

 

 

 

 

 

lr = LogisticRegression()
params = {'C':[1,3.5,4.5,5.5,10]}
grid = GridSearchCV(lr,param_grid=params,cv=3,scoring='accuracy',verbose=1)
grid.fit(tfidf_matrix_train,train['label'])
print("best params : {}".format(grid.best_params_))
print("best score : {:.3f}".format(grid.best_score_))

๋กœ์ง€์Šคํ‹ฑ ํšŒ๊ท€๋ถ„์„์„ ์ด์šฉํ•˜์—ฌ ๋ถ„๋ฅ˜ ๊ธฐ๋ฐ˜์˜ ๊ฐ์„ฑ ๋ถ„์„์„ ์ˆ˜ํ–‰ํ•˜์˜€๋‹ค. 

์ •ํ™•๋„ ํ–ฅ์ƒ์„ ์œ„ํ•ด GridSearch๊นŒ์ง€ ํ•ด์ฃผ๋ฉด

 

 

 

 

 

from sklearn.metrics import accuracy_score

#TfidfVectorizer๋ฅผ ์ด์šฉํ•ด ํ…Œ์ŠคํŠธ ๋ฐ์ดํ„ฐ๋ฅผ TF-IDF ๊ฐ’์œผ๋กœ ํ”ผ์ฒ˜ ๋ณ€ํ™˜
tfidf_matrix_test = tfidf_vect.transform(test['document'])

#classifier๋Š” ์ตœ์  classifier๋ฅผ ์ด์šฉ
best_estimator = grid.best_estimator_
preds = best_estimator.predict(tfidf_matrix_test)

print("Logistic Regression ์ •ํ™•๋„ : ",accuracy_score(test['label'],preds))

 

 

 

 

728x90
Comments