250x250
Link
๋‚˜์˜ GitHub Contribution ๊ทธ๋ž˜ํ”„
Loading data ...
Notice
Recent Posts
Recent Comments
๊ด€๋ฆฌ ๋ฉ”๋‰ด

Data Science LAB

[Python] ๋‰ด์Šค ๊ทธ๋ฃน ๋ถ„๋ฅ˜ ๋ณธ๋ฌธ

๐Ÿ›  Machine Learning/ํ…์ŠคํŠธ ๋ถ„์„

[Python] ๋‰ด์Šค ๊ทธ๋ฃน ๋ถ„๋ฅ˜

ใ…… ใ…œ ใ…” ใ…‡ 2022. 2. 19. 15:09
728x90

์‚ฌ์ดํ‚ท๋Ÿฐ ๋‚ด๋ถ€์˜ ์˜ˆ์ œ ๋ฐ์ดํ„ฐ์ธ 20 ๋‰ด์Šค๊ทธ๋ฃน ๋ฐ์ดํ„ฐ ์…‹์„ ํ™œ์šฉํ•ด ํ…์ŠคํŠธ ๋ถ„๋ฅ˜ ์‹ค์Šต์„ ํ•ด๋ณด๋ ค๊ณ  ํ•œ๋‹ค. 

ํ…์ŠคํŠธ ๋ถ„๋ฅ˜๋Š” ํ•™์Šต ๋ฐ์ดํ„ฐ๋ฅผ ํ†ตํ•ด ๋ชจ๋ธ์„ ํ•™์Šต ์‹œํ‚จ ํ›„ ์ด ํ•™์Šต ๋ชจ๋ธ์„ ์ด์šฉํ•ด ๋‹ค๋ฅธ ๋ฌธ์„œ์˜ ๋ถ„๋ฅ˜๋ฅผ ์˜ˆ์ธกํ•ด ๋ณด๋ ค๊ณ  ํ•œ๋‹ค. 

 

 

 

Count๊ธฐ๋ฐ˜์˜ ๋กœ์ง€์Šคํ‹ฑ ํšŒ๊ท€๋ชจ๋ธ๊ณผ, TF-IDF๊ธฐ๋ฐ˜์˜ ๋กœ์ง€์Šคํ‹ฑ ํšŒ๊ท€๋ชจ๋ธ์„ ๊ฐ๊ฐ ์ƒ์„ฑํ•œ ํ›„ ๋น„๊ตํ•ด๋ณด๊ณ 

ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ ์กฐ์ •๊นŒ์ง€ ํ•ด๋ณด๋ ค๊ณ  ํ•œ๋‹ค( •ฬ€ ω •ฬ )โœง

 

 

 

1. ํ…์ŠคํŠธ ์ •๊ทœํ™”

fetch_20newsgroups()๋Š” ์ธํ„ฐ๋„ท์—์„œ ๋กœ์ปฌ ์ปดํ“จํ„ฐ๋กœ ๋จผ์ € ๋ฐ์ดํ„ฐ๋ฅผ ๋ฐ›์€ ํ›„, ๋ฉ”๋ชจ๋ฆฌ๋กœ ๋ฐ์ดํ„ฐ๋ฅผ ๋กœ๋”ฉํ•œ๋‹ค. 

from sklearn.datasets import fetch_20newsgroups

news_data = fetch_20newsgroups(subset = 'all', random_state = 156)

#์–ด๋–ค key๊ฐ’์„ ๊ฐ€์ง€๊ณ  ์žˆ๋Š”์ง€ ํ™•์ธ
print(news_data.keys())

fillenames๋Š” fetch_20newsgroups API๊ฐ€ ์ธํ„ฐ๋„ท์—์„œ ๋‚ด๋ ค๋ฐ›์•„ ๋กœ์ปฌ ์ปดํ“จํ„ฐ์— ์ €์žฅํ•˜๋Š” ๋””๋ ‰ํ„ฐ๋ฆฌ์™€ ํŒŒ์ผ๋ช…์„ ์ง€์นญ

 

 

 

 

 

 

import pandas as pd

print("target ํด๋ž˜์Šค์˜ ๊ฐ’๊ณผ ๋ถ„ํฌ๋„ : \n",pd.Series(news_data.target).value_counts().sort_index())
print("target ํด๋ž˜์Šค์˜ ์ด๋ฆ„๋“ค : \n",news_data.target_names)

Targetํด๋ž˜์Šค์˜ ๊ฐ’์€ 0-19(20๊ฐœ)๊นŒ์ง€๋กœ ๋ถ„ํฌ๋˜์–ด์žˆ์Œ

 

 

 

 

 

 

print(news_data.data[0])

๋ถˆ๋Ÿฌ์˜จ ๋ฐ์ดํ„ฐ ์ค‘ ๊ฐ€์žฅ ์ฒซ๋ฒˆ์งธ ๋ฐ์ดํ„ฐ ํ•˜๋‚˜๋งŒ ํ™•์ธํ•ด ๋ณธ ๊ฒฐ๊ณผ, 

์ œ๋ชฉ, ์ž‘์„ฑ์ž, ์†Œ์†, ์ด๋ฉ”์ผ, ๊ธฐ์‚ฌ ๋‚ด์šฉ ๋“ฑ ๋‹ค์–‘ํ•œ ์ •๋ณด๋ฅผ ๊ฐ€์ง€๊ณ  ์žˆ์Œ

 

๋ชจ๋“  ํ”ผ์ฒ˜๋ฅผ ํฌํ•จํ•ด ๋จธ์‹ ๋Ÿฌ๋‹ ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์ง„ํ–‰ํ•˜๋ฉด ๋†’์€ ์„ฑ๋Šฅ์„ ๊ฐ€์ง€๊ฒŒ ๋จ -> ๊ธฐ์‚ฌ ๋‚ด์šฉ๋งŒ์„ ์ด์šฉํ•ด ํ…์ŠคํŠธ ๋ถ„์„ ์ง„ํ–‰

 

 

 

 

 

ํ•™์Šต์šฉ/ํ…Œ์ŠคํŠธ์šฉ ๋ฐ์ดํ„ฐ์˜ ๋‚ด์šฉ๋งŒ์„ ์ถ”์ถœ
#ํ•™์Šต์šฉ ๋ฐ์ดํ„ฐ์˜ ๋‚ด์šฉ๋งŒ ์ถ”์ถœ
train_news = fetch_20newsgroups(subset = 'train', remove=('headers','footers','quotes'),random_state=156)

X_train = train_news.data
y_train = train_news.target

 

 

#ํ…Œ์ŠคํŠธ ๋ฐ์ดํ„ฐ์˜ ๋‚ด์šฉ๋งŒ ์ถ”์ถœ
test_news = fetch_20newsgroups(subset = 'test',remove=('headers','footers','quotes'),random_state=156)

X_test = test_news.data
y_test = test_news.target

print("ํ•™์Šต ๋ฐ์ดํ„ฐ ํฌ๊ธฐ : {}".format(len(train_news.data)))
print("ํ…Œ์ŠคํŠธ ๋ฐ์ดํ„ฐ ํฌ๊ธฐ : {}".format(len(test_news.data)))

ํ•™์Šต ๋ฐ์ดํ„ฐ๋Š” 11314๊ฐœ์˜ ๋‰ด์Šค๊ทธ๋ฃน ๋ฌธ์„œ๊ฐ€ ๋ฆฌ์ŠคํŠธ ํ˜•ํƒœ๋กœ ์ฃผ์–ด์ง€๊ณ , 

ํ…Œ์ŠคํŠธ ๋ฐ์ดํ„ฐ๋Š” 7532๊ฐœ์˜ ๋‰ด์Šค๊ทธ๋ฃน ๋ฌธ์„œ๊ฐ€ ๋ฆฌ์ŠคํŠธ ํ˜•ํƒœ๋กœ ์ฃผ์–ด์ง„ ๊ฒƒ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Œ

 

 

 

 

 

ํ”ผ์ฒ˜ ๋ฒกํ„ฐํ™” ๋ณ€ํ™˜๊ณผ ๋จธ์‹ ๋Ÿฌ๋‹ ๋ชจ๋ธ ํ•™์Šต, ์˜ˆ์ธก, ํ‰๊ฐ€(Count๊ธฐ๋ฐ˜)
from sklearn.feature_extraction.text import CountVectorizer

#ํ”ผ์ฒ˜ ๋ฒกํ„ฐํ™” ๋ณ€ํ™˜ ์ˆ˜ํ–‰
cnt_vect = CountVectorizer()
cnt_vect.fit(X_train)
X_train_cnt_vect = cnt_vect.transform(X_train)

#ํ•™์Šต๋ฐ์ดํ„ฐ๋กœ ์ƒ์„ฑ๋œ CountVectirizer๋ฅผ ์ด์šฉํ•ด ํ…Œ์ŠคํŠธ ๋ฐ์ดํ„ฐ ํ”ผ์ฒ˜ ๋ฒกํ„ฐํ™” ๋ณ€ํ™˜
X_test_cnt_vect = cnt_vect.transform(X_test)


print("ํ•™์Šต ๋ฐ์ดํ„ฐ CountVectorizer Shape : ",X_train_cnt_vect.shape)

ํ•™์Šต๋ฐ์ดํ„ฐ๋ฅผ CountVectorizer๋กœ ํ”ผ์ฒ˜๋ฅผ ์ถ”์ถœํ•œ ๊ฒฐ๊ณผ, 11314๊ฐœ์˜ ๋ฌธ์„œ์—์„œ ๋‹จ์–ด๊ฐ€ 101631๊ฐœ๋กœ ๋งŒ๋“ค์–ด์ง

 

 

 

 

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

#๋กœ์ง€์Šคํ‹ฑํšŒ๊ท€๋ถ„์„์œผ๋กœ ํ•™์Šต
lr = LogisticRegression()
lr.fit(X_train_cnt_vect,y_train)

#์˜ˆ์ธก
lr_pred = lr.predict(X_test_cnt_vect)

#ํ‰๊ฐ€ 
print('CountVectorized Logistic Regression ์˜ˆ์ธก ์ •ํ™•๋„ : {0:3f}'.format(accuracy_score(y_test,lr_pred)))

ํ”ผ์ฒ˜ ๋ฒกํ„ฐํ™”๋œ ๋ฐ์ดํ„ฐ์— ๋กœ์ง€์Šคํ‹ฑ ํšŒ๊ท€๋ฅผ ์ ์šฉํ•ด ๋‰ด์Šค๊ทธ๋ฃน์„ ๋ถ„๋ฅ˜ํ•œ ๊ฒฐ๊ณผ, 

accuracy_score๋Š” 0.607๋กœ ๋‚˜ํƒ€๋‚จ

 

 

 

 

ํ”ผ์ฒ˜ ๋ฒกํ„ฐํ™” ๋ณ€ํ™˜๊ณผ ๋จธ์‹ ๋Ÿฌ๋‹ ๋ชจ๋ธ ํ•™์Šต, ์˜ˆ์ธก, ํ‰๊ฐ€(Count๊ธฐ๋ฐ˜)
from sklearn.feature_extraction.text import TfidfVectorizer

#TF-IDF ๋ฒกํ„ฐํ™” ์ ์šฉ
tfidf_vect = TfidfVectorizer()
tfidf_vect.fit(X_train)
X_train_tfidf_vect = tfidf_vect.transform(X_train)
X_test_tfidf_vect = tfidf_vect.transform(X_test)


#๋กœ์ง€์Šคํ‹ฑํšŒ๊ท€ ์ ์šฉ
lr = LogisticRegression()
lr.fit(X_train_tfidf_vect,y_train)
pred = lr.predict(X_test_tfidf_vect)
print('TF-IDF Logistic Regression ์˜ˆ์ธก ์ •ํ™•๋„ : {0:3f}'.format(accuracy_score(y_test,pred)))

TF-IDF ์˜ accuracy_score๋Š” 0.674๋กœ Count๊ธฐ๋ฐ˜๋ณด๋‹ค ์ข€ ๋” ๋†’์€ ๊ฒƒ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์—ˆ๋‹ค. 

 

 

 

 

#stop words ํ•„ํ„ฐ๋ง ์ถ”๊ฐ€ ํ›„, n_gram์„ (1,2)๋กœ ๋ณ€๊ฒฝ
tfidf_vect = TfidfVectorizer(stop_words='english',ngram_range = (1,2),max_df = 300)
tfidf_vect.fit(X_train)
X_train_tfidf_vect = tfidf_vect.transform(X_train)
X_test_tfidf_vect = tfidf_vect.transform(X_test)

lr = LogisticRegression()
lr.fit(X_train_tfidf_vect,y_train)

pred = lr.predict(X_test_tfidf_vect)
print("TF-IDF ๋กœ์ง€์Šคํ‹ฑ ํšŒ๊ท€์˜ ์˜ˆ์ธก ์ •ํ™•๋„ : {0:.3f}".format(accuracy_score(y_test,pred)))

stop wordsํ•„ํ„ฐ๋ง์„ ์ถ”๊ฐ€ํ•œ ํ›„ ngram_range๋ฅผ (1,2)๋กœ ์„ค์ •ํ•˜์˜€๋”๋‹ˆ, accuracy_score๊ฐ€ 0.692๊นŒ์ง€ ๋†’์•„์กŒ๋‹ค!

 

 

 

 

 

 

 

 

GridSearch ์‹คํ–‰
from sklearn.model_selection import GridSearchCV

#์ตœ์  C๊ฐ’ ๋„์ถœ ํŠœ๋‹ ์ˆ˜ํ–‰ ๋ฐ CV๋Š” 3 ํด๋“œ ์…‹ ์ง„ํ–‰

params = {'C':[0.01,0.1,1,5,10]}
grid_cv_lr = GridSearchCV(lr,param_grid = params,cv=3, scoring = 'accuracy',verbose=1)
grid_cv_lr.fit(X_train_tfidf_vect,y_train)
print("๋กœ์ง€์Šคํ‹ฑ ํšŒ๊ท€์˜ best C Parameter : ",grid_cv_lr.best_params_)

#์ตœ์  C๊ฐ’์œผ๋กœ ํ•™์Šต๋œ grid_cv๋กœ ์˜ˆ์ธก ๋ฐ ์ •ํ™•๋„ ํ‰๊ฐ€
pred = grid_cv_lr.predict(X_test_tfidf_vect)
print("TF-IDF Vectorized Logistic Regression์˜ ์ •ํ™•๋„ : {0:3f}".format(accuracy_score(y_test,pred)))

์ตœ์  ํŒŒ๋ผ๋ฏธํ„ฐ C๋Š” 10์œผ๋กœ ๋‚˜ํƒ€๋‚ฌ์œผ๋ฉฐ, ๋กœ์ง€์Šคํ‹ฑํšŒ๊ท€์˜ ์ •ํ™•๋„๋Š” 0.701๊นŒ์ง€ ์ƒ์Šนํ•œ ๊ฒƒ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋‹ค.

 

 

 

 

์‚ฌ์ดํ‚ท๋Ÿฐ ํŒŒ์ดํ”„๋ผ์ธ ์‚ฌ์šฉ ๋ฐ GridSearchCV์™€์˜ ๊ฒฐํ•ฉ
from sklearn.pipeline import Pipeline

#TfidfVecorizer ๊ฐ์ฒด๋ฅผ tfidf_vect๋กœ, ๋กœ์ง€์Šคํ‹ฑ ํšŒ๊ท€ ๊ฐ์ฒด๋ฅผ lr๋กœ ์ƒ์„ฑํ•˜๋Š” pipeline

pipeline = Pipeline([('tfidf_vect',TfidfVectorizer(stop_words = 'english',ngram_range = (1,2),max_df=300)),
                    ('lr',LogisticRegression(C=10))])
                    
                    
                    
pipeline.fit(X_train,y_train)
pred = pipeline.predict(X_test)
print("pipeline์„ ํ†ตํ•œ Logistic Regression์˜ ์˜ˆ์ธก ์ •ํ™•๋„ : {0:.3f}".format(accuracy_score(y_test,pred)))

์‚ฌ์ดํ‚ท๋Ÿฐ์˜ Pipeline ํด๋ž˜์Šค๋ฅผ ์‚ฌ์šฉํ•˜๋ฉด ํ”ผ์ฒ˜ ๋ฒกํ„ฐํ™”์™€ ๋จธ์‹ ๋Ÿฌ๋‹ ์•Œ๊ณ ๋ฆฌ์ฆ˜ ํ•™์Šต, ์˜ˆ์ธก์„ ํ•œ ๋ฒˆ์— ์ง„ํ–‰ํ•  ์ˆ˜ ์žˆ๋‹ค. 

Pipeline์„ ์ด์šฉํ•˜๋ฉด ๋ฐ์ดํ„ฐ์˜ ์ „์ฒ˜๋ฆฌ์™€ ๋จธ์‹ ๋Ÿฌ๋‹ ํ•™์Šต ๊ณผ์ •์„ ํ†ต์ผ๋œ API ๊ธฐ๋ฐ˜์—์„œ ์ฒ˜๋ฆฌํ•  ์ˆ˜ ์žˆ์–ด ๋” ์ง๊ด€์ ์ธ ๋จธ์‹ ๋Ÿฌ๋‹ ๋ชจ๋ธ ์ฝ”๋“œ๋ฅผ ์ƒ์„ฑํ•  ์ˆ˜ ์žˆ๋‹ค. ๋˜ํ•œ ๋Œ€์šฉ๋Ÿ‰ ๋ฐ์ดํ„ฐ์˜ ํ”ผ์ฒ˜ ๋ฒกํ„ฐํ™”๋ฅผ ๋ณ„๋„ ๋ฐ์ดํ„ฐ๋กœ ์ €์žฅํ•˜์ง€ ์•Š๊ธฐ ๋•Œ๋ฌธ์— ์ˆ˜ํ–‰์‹œ๊ฐ„์„ ์ข€ ๋” ์ ˆ์•ฝํ•  ์ˆ˜ ์žˆ๋‹ค. 

 

๋ณ„๋„์˜ TfidfVectorizer์™€ LogisticRegression์˜ fit(), transform(), predict()๋ฅผ ์ˆ˜ํ–‰ํ•  ํ•„์š”๊ฐ€ ์—†๋‹ค. 

 

 

 

 

 

 

 

from sklearn.pipeline import Pipeline

pipeline = Pipeline([('tfidf_vect',TfidfVectorizer(stop_words='english')),
                    ('lr',LogisticRegression())])

#Pipeline์˜ ๊ฐ๊ฐ์˜ ๊ฐ์ฒด ๋ณ€์ˆ˜์— ์–ธ๋”๋ฐ”(_) 2๊ฐœ๋ฅผ ์—ฐ๋‹ฌ์•„ ๋ถ™์—ฌ GridSearchCV์— ์‚ฌ์šฉ๋  ํŒŒ๋ผ๋ฏธํ„ฐ,ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ ์ด๋ฆ„๊ณผ ๊ฐ’ ์„ค์ •
params = {'tfidf_vect__ngram_range':[(1,1),(1,2),(1,3)],
         'tfidf_vect__max_df': [100,300,700],
         'lr__C':[1,5,10]}


#GridSearchCV์˜ ์ƒ์„ฑ์ž์— Estimator๊ฐ€ ์•„๋‹Œ Pipeline๊ฐ์ฒด ์ž…๋ ฅ
grid_cv_pipe = GridSearchCV(pipeline,param_grid=params,cv=3,scoring='accuracy',verbose=1)
grid_cv_pipe.fit(X_train,y_train)
print(grid_cv_pipe.best_params_,grid_cv_pipe.best_score_)

pred = grid_cv_pipe.predict(X_test)
print('Pipeline์„ ํ†ตํ•œ Logistic Regression์˜ ์˜ˆ์ธก ์ •ํ™•๋„ : {0:.3f}'.format(accuracy_score(y_test,pred)))
728x90
Comments