250x250
Link
๋‚˜์˜ GitHub Contribution ๊ทธ๋ž˜ํ”„
Loading data ...
Notice
Recent Posts
Recent Comments
๊ด€๋ฆฌ ๋ฉ”๋‰ด

Data Science LAB

[Python]ํ…์ŠคํŠธ ์ „์ฒ˜๋ฆฌ - ํ…์ŠคํŠธ ์ •๊ทœํ™” ๋ณธ๋ฌธ

๐Ÿ›  Machine Learning/ํ…์ŠคํŠธ ๋ถ„์„

[Python]ํ…์ŠคํŠธ ์ „์ฒ˜๋ฆฌ - ํ…์ŠคํŠธ ์ •๊ทœํ™”

ใ…… ใ…œ ใ…” ใ…‡ 2022. 2. 17. 12:39
728x90

ํ…์ŠคํŠธ ์ž์ฒด๋ฅผ ๋ฐ”๋กœ ํ”ผ์ฒ˜๋กœ ๋งŒ๋“ค ์ˆ˜ ์—†๊ธฐ ๋•Œ๋ฌธ์—, ํ…์ŠคํŠธ๋ฅผ ๊ฐ€๊ณตํ•ด์ฃผ๋Š” ์ž‘์—…์ด ํ•„์š”ํ•˜๋‹ค. 

ํ…์ŠคํŠธ ์ •๊ทœํ™”๋Š” ํ…์ŠคํŠธ๋ฅผ ๋จธ์‹ ๋Ÿฌ๋‹ ์•Œ๊ณ ๋ฆฌ์ฆ˜์ด๋‚˜ NLP ์–ดํ”Œ๋ฆฌ์ผ€์ด์…˜์˜ ์ž…๋ ฅ ๋ฐ์ดํ„ฐ๋กœ ์‚ฌ์šฉํ•˜๊ธฐ ์œ„ํ•ด ํด๋ Œ์ง•, ์ •์ œ, ํ† ํฐํ™”, ์–ด๊ทผ ๋“ฑ์˜ ๋‹ค์–‘ํ•œ ํ…์ŠคํŠธ ๋ฐ์ดํ„ฐ์˜ ์‚ฌ์ „ ์ž‘์—…์„ ์ˆ˜ํ–‰ํ•˜๋Š” ๊ฒƒ์„ ์˜๋ฏธํ•œ๋‹ค. 

 

 

ํด๋ Œ์ง•(Cleansing)

ํด๋ Œ์ง•์€ ํ…์ŠคํŠธ์—์„œ ๋ถ„์„์— ๋ฐฉํ•ด๋˜๋Š” ๋ฌธ์ž๋‚˜ ๊ธฐํ˜ธ ๋“ฑ์„ ๋จผ์ € ์ œ๊ฑฐํ•˜๋Š” ์ž‘์—…์ด๋‹ค. (XTML, XMLํƒœ๊ทธ ๋“ฑ)

 

 

 

 

 

 

ํ…์ŠคํŠธ ํ† ํฐํ™”(Tokenization)

- ๋ฌธ์žฅ ํ† ํฐํ™” : ๋ฌธ์„œ์—์„œ ๋ฌธ์žฅ์„ ๋ถ„๋ฅ˜

- ๋‹จ์–ด ํ† ํฐํ™” : ๋ฌธ์žฅ์—์„œ ๋‹จ์–ด๋ฅผ ํ† ํฐ์œผ๋กœ ๋ถ„๋ฆฌ

์œ„์˜ ๋‘๊ฐ€์ง€ ์ข…๋ฅ˜๋กœ ๋‚˜๋‰œ๋‹ค. 

 

 

 

 

๋ฌธ์žฅ ํ† ํฐํ™”

๋จผ์ €, ๋ฌธ์žฅ ํ† ํฐํ™”๋Š” ๋ฌธ์žฅ์˜ ๋งˆ์นจํ‘œ(.)๋‚˜ ๊ฐœํ–‰๋ฌธ์ž(\n) ๋“ฑ ๋ฌธ์žฅ์˜ ๋งˆ์ง€๋ง‰์„ ๋œปํ•˜๋Š” ๊ธฐํ˜ธ์— ๋”ฐ๋ผ ๋ถ„๋ฆฌํ•œ๋‹ค. 

NLTK์—์„œ๋Š” ์ผ๋ฐ˜์ ์œผ๋กœ sent_tokenize๋ฅผ ์ด์šฉํ•ด ํ† ํฐํ™”๋ฅผ ํ•œ๋‹ค. 

 

 

3๊ฐœ์˜ ๋ฌธ์žฅ์œผ๋กœ ์ด๋ฃจ์–ด์ง„ ๋ฐ์ดํ„ฐ๋ฅผ ๋ฌธ์žฅ์œผ๋กœ ๊ฐ๊ฐ ๋ถ„๋ฆฌํ•ด๋ณด๋ ค๊ณ  ํ•œ๋‹ค. 

from nltk import sent_tokenize
import nltk
nltk.download('punkt')

nltk.download('punkt')์„ ์ด์šฉํ•˜๋ฉด ๋งˆ์นจํ‘œ, ๊ฐœํ–‰๋ฌธ์ž ๋“ฑ์˜ ๋ฐ์ดํ„ฐ ์…‹์„ ๋‹ค์šด ๋ฐ›์„ ์ˆ˜ ์žˆ๋‹ค. 

 

 

 

 

#3๊ฐœ์˜ ๋ฌธ์žฅ์œผ๋กœ๋œ ๋ฌธ์„œ๋ฅผ ๋ฌธ์žฅ์œผ๋กœ ๋ถ„๋ฅ˜
text_sample = "The Matrix is everywhere its all around us, here even in this room.\
                You can see it out your window or on your television. \
                You feel it when you go to work, or go to chuarch or pay your taxes."
sentences = sent_tokenize(text = text_sample)
print(type(sentences),len(sentences))
print(sentences)

 

sent_tokenize()๋Š” ๊ฐ๊ฐ ๋ฌธ์žฅ์œผ๋กœ ๊ตฌ์„ฑ๋œ list๊ฐ์ฒด๋ฅผ ๋ฐ˜ํ™˜ํ•ด ์ค€๋‹ค. ๋ฐ˜ํ™˜๋œ list๊ฐ์ฒด๊ฐ€ ๋ฌธ์žฅ์œผ๋กœ๋œ ๋ฌธ์ž์—ด์„ ๊ฐ–๊ณ  ์žˆ๋Š” ๊ฒƒ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋‹ค. 

 

 

 

 

 

 

๋‹จ์–ด ํ† ํฐํ™”

๋‹จ์–ด ํ† ํฐํ™”(Word Tokenization)๋Š” ๋ฌธ์žฅ์„ ๋‹จ์–ด๋กœ ํ† ํฐํ™” ํ•˜๋Š” ๊ฒƒ์ด๋‹ค. ์ผ๋ฐ˜์ ์œผ๋กœ ์ฝค๋งˆ(,)๋‚˜ ๋งˆ์นจํ‘œ(.), ๊ณต๋ฐฑ์œผ๋กœ ๋‹จ์–ด๋ฅผ ๋ถ„๋ฆฌํ•œ๋‹ค. ์ •๊ทœ ํ‘œํ˜„์‹์„ ์ด์šฉํ•ด์„œ ๋‹ค์–‘ํ•œ ์œ ํ˜•์œผ๋กœ ํ† ํฐํ™”๋ฅผ ํ•  ์ˆ˜๋„ ์žˆ๋‹ค. 

 

 

NLTK์—์„œ ๊ธฐ๋ณธ์œผ๋กœ ์ œ๊ณตํ•˜๋Š” word_tokenize()๋ฅผ ์ด์šฉํ•ด ๋‹จ์–ด ํ† ํฐํ™”๋ฅผ ํ•ด๋ณด๋ ค๊ณ  ํ•œ๋‹ค. 

from nltk import word_tokenize

sentence = "The Matrix is everywhere its all around us, here even in this room."
words = word_tokenize(sentence)

print(type(words),len(words))
print(words)

์˜ˆ์‹œ๋กœ ์•„๋ฌด ๋ฌธ์žฅ์ด๋‚˜ sentence์— ๋„ฃ๊ณ , word_tokenize()๋ฅผ ์ด์šฉํ•ด ๋‹จ์–ด ํ† ํฐํ™”๋ฅผ ํ•ด๋ณธ ๊ฒฐ๊ณผ,

๋ฆฌ์ŠคํŠธ์— 15๊ฐœ์˜ ๋‹จ์–ด๋กœ ๋‚˜๋‰˜์–ด์„œ ์ €์žฅ๋œ ๊ฒƒ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์—ˆ๋‹ค. 

 

 

 

 

 

 

์ด๋ฒˆ์—๋Š” ๋ฌธ์žฅ ํ† ํฐํ™”์™€ ๋‹จ์–ด ํ† ํฐํ™”๋ฅผ ํ•ฉ์ณ ๋ฌธ์„œ์˜ ๋ชจ๋“  ๋‹จ์–ด๋ฅผ ํ† ํฐํ™” ํ•ด๋ณด๋ ค๊ณ  ํ•œ๋‹ค. 

์•„๊นŒ ์˜ˆ์ œ์—์„œ ์˜ˆ์‹œ๋กœ ๋„ฃ์€ 3๊ฐœ์˜ ๋ฌธ์žฅ์œผ๋กœ ๋œ text_sample์„ ์ด์šฉํ•ด ๋ฌธ์žฅ๋ณ„๋กœ ๋‹จ์–ด ํ† ํฐํ™”๋ฅผ ์ ์šฉํ•œ๋‹ค. 

from nltk import word_tokenize,sent_tokenize

#์—ฌ๋Ÿฌ ๋ฌธ์žฅ์œผ๋กœ๋œ ๋ฐ์ดํ„ฐ๋ฅผ ๋ฌธ์žฅ๋ณ„๋กœ ๋‹จ์–ด ํ† ํฐํ™”ํ•˜๊ฒŒ ๋งŒ๋“œ๋Š” ํ•จ์ˆ˜ ์ƒ์„ฑ
def tokenize_text(text):
    #๋ฌธ์žฅ๋ณ„๋กœ ๋ถ„๋ฆฌ ํ† ํฐ
    sentences = sent_tokenize(text)
    
    #๋ถ„๋ฆฌ๋œ ๋ฌธ์žฅ๋ณ„ ๋‹จ์–ด ํ† ํฐํ™”
    word_tokens = [word_tokenize(sentence) for sentence in sentences]
    
    return word_tokens  
 
 #์—ฌ๋Ÿฌ๋ฌธ์žฅ์— ๋Œ€ํ•ด ๋ฌธ์žฅ๋ณ„ ๋‹จ์–ด ํ† ํฐํ™”
word_tokens = tokenize_text(text_sample)
print(type(word_tokens),len(word_tokens))
print(word_tokens)

๋จผ์ €, ๋ฌธ์žฅ ํ† ํฐํ™”๋ฅผ ์ˆ˜ํ–‰ํ•œ ํ›„ ๋‹จ์–ด ํ† ํฐํ™”๋ฅผ ์ˆ˜ํ–‰ํ•˜๋Š” ํ•จ์ˆ˜๋ฅผ ์ƒ์„ฑํ•˜์—ฌ text_sample์— ์ ์šฉํ•œ ๊ฒฐ๊ณผ, 

๋ฌธ์žฅ๋ณ„๋กœ ๋‹จ์–ด ํ† ํฐํ™”๊ฐ€ ์ž˜ ์ด๋ฃจ์–ด์ง„ ๊ฒƒ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์—ˆ๋‹ค. 

๋ฌธ์žฅ ํ† ํฐํ™”๋ฅผ ๋จผ์ € ์ง„ํ–‰ํ•˜์˜€์œผ๋ฏ€๋กœ, ๋ฆฌ์ŠคํŠธ์— ๊ฐ์ฒด 3๊ฐœ๊ฐ€ ๋‚ดํฌ๋˜์–ด ์ถœ๋ ฅ๋˜์—ˆ๋‹ค. 

 

 

 

 

 

 

์Šคํ†ฑ์›Œ๋“œ์ œ๊ฑฐ

์Šคํ†ฑ ์›Œ๋“œ(stop word)๋Š” ๋ถ„์„ํ•  ๋•Œ ์˜๋ฏธ๊ฐ€ ์—†๋Š” ๋‹จ์–ด๋ฅผ ์˜๋ฏธํ•œ๋‹ค. 

์˜์–ด์—์„œ is, the, a ๋“ฑ ๋ฌธ์žฅ์„ ๊ตฌ์„ฑํ•˜๋Š” ํ•„์ˆ˜ ๋ฌธ๋ฒ• ์š”์†Œ์ด์ง€๋งŒ ๋ฌธ๋งฅ์ ์œผ๋กœ๋Š” ํฐ ์˜๋ฏธ๊ฐ€ ์—†๋Š” ๋‹จ์–ด๋“ค์ด๋‹ค. 

์ด๋Ÿฌํ•œ ๋‹จ์–ด๋“ค์€ ๋ฌธ์žฅ์— ์ž์ฃผ ๋“ฑ์žฅํ•˜๊ธฐ ๋•Œ๋ฌธ์— ์ œ๊ฑฐํ•˜์ง€ ์•Š์œผ๋ฉด ์ค‘์š”ํ•œ ๋‹จ์–ด๋กœ ์ธ์ง€๋  ์ˆ˜ ์žˆ๋‹ค. 

 

 

๋จผ์ €, NLTK์—์„œ stopwords๋ชฉ๋ก์„ ๋‹ค์šด๋ฐ›๋Š”๋‹ค. 

(nltk.download()๋ฅผ ํ•˜๋ฉด nltk์—์„œ ๋‹ค์šด๋ฐ›์„ ์ˆ˜ ์žˆ๋Š” ๋ชฉ๋ก์ด ๋ชจ๋‘ ๋‹ค์šด๋˜๊ธฐ ๋•Œ๋ฌธ์— ์•ˆ์— ๊ผญ ๋‹ค์šด๋ฐ›๊ณ ์ž ํ•˜๋Š” ๊ฒƒ์„ ์ž…๋ ฅํ•ด์•ผํ•œ๋‹ค!)

import nltk
nltk.download('stopwords')

 

๋‹ค์šด๋กœ๋“œ๊ฐ€ ์™„๋ฃŒ๋˜๋ฉด NLTK์—์„œ ์˜์–ด์˜ ๊ฒฝ์šฐ stopwords๊ฐ€ ๋ช‡ ๊ฐœ ์žˆ๋Š” ์ง€ ์•Œ์•„๋ณธ ํ›„, ๊ทธ์ค‘ 20๊ฐœ๋งŒ ํ™•์ธํ•ด ๋ณธ๋‹ค. 

print("์˜์–ด stop words ๊ฐœ์ˆ˜ :",len(nltk.corpus.stopwords.words('english')))
print(nltk.corpus.stopwords.words('english')[:20])

NLTK์—์„œ ์˜์–ด์˜ stop words๋Š” 179๊ฐœ ์ด๋ฉฐ, i,me,my๋“ฑ์ด ํฌํ•จ๋˜์–ด์ ธ ์žˆ๋Š” ๊ฒƒ์„ ์•Œ ์ˆ˜ ์žˆ๋‹ค. 

 

 

 

 

words_token๋ฆฌ์ŠคํŠธ์— ๋Œ€ํ•ด stopwords๋ฅผ ํ•„ํ„ฐ๋ง์œผ๋กœ ์ œ๊ฑฐํ•ด ๋ถ„์„์— ์˜๋ฏธ์žˆ๋Š” ๋‹จ์–ด๋งŒ์„ ์ถ”์ถœํ•ด ๋ณด์ž๋ฉด 

stopwords = nltk.corpus.stopwords.words('english')
all_tokens = []

#3๊ฐœ์˜ ๋ฌธ์žฅ๋ณ„๋กœ ์–ป์€ word_tokens_list์— ๋Œ€ํ•ด stopwords์ œ๊ฑฐ
for sentence in word_tokens:
    filltered_words = []
    
    #๊ฐœ๋ณ„ ๋ฌธ์žฅ๋ณ„๋กœ ํ† ํฐํ™”๋œ ๋ฌธ์žฅ list์— ๋Œ€ํ•ด stopwords ์ œ๊ฑฐ
    for word in sentence:
        
        #์†Œ๋ฌธ์ž๋กœ ๋ชจ๋‘ ๋ณ€ํ™˜
        word = word.lower()
        
        #ํ† ํฐํ™”๋œ ๊ฐœ๋ณ„ ๋‹จ์–ด๊ฐ€ ์Šคํ†ฑ ์›Œ๋“œ์˜ ๋‹จ์–ด์— ํฌํ•จ๋˜์ง€ ์•Š์œผ๋ฉด word_tokens์— ์ถ”๊ฐ€
        if word not in stopwords:
            filltered_words.append(word)
            
    all_tokens.append(filltered_words)
    
    
print(all_tokens)

3๊ฐœ์˜ ๋ฌธ์žฅ์—์„œ is, this๊ฐ™์€ ์Šคํ†ฑ์›Œ๋“œ๊ฐ€ ํ•„ํ„ฐ๋ง์„ ํ†ตํ•ด ์ œ๊ฑฐ๋์Œ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Œ

 

 

 

 

 

 

Stemming/Lemmatization

์˜์–ด์˜ ๊ฒฝ์šฐ, ํ˜„์žฌ/๊ณผ๊ฑฐ, 3์ธ์นญ์ผ ๋•Œ ๋“ฑ ์—ฌ๋Ÿฌ ์กฐ๊ฑด์— ๋”ฐ๋ผ ๋‹จ์–ด์˜ ํ˜•ํƒœ๊ฐ€ ๋ณ€ํ™”ํ•œ๋‹ค. ๋”ฐ๋ผ์„œ ๋‹จ์–ด์˜ ์›ํ˜•์„ ์ฐพ์•„ ๋ถ„์„์„ ์ง„ํ–‰ํ•ด์•ผ ํ•œ๋‹ค. 

Stemming๊ณผ Lemmatization ๋ชจ๋‘ ๋‹จ์–ด์˜ ์›ํ˜•์„ ์ฐพ๋Š” ๋ชฉ์ ์„ ๊ฐ€์ง€๊ณ  ์žˆ์ง€๋งŒ, Lemmatization์ด ๋” ์ •๊ตํ•˜๋ฉฐ ์˜๋ฏธ๋ก ์ ์ธ ๊ธฐ๋ฐ˜์—์„œ ๋‹จ์–ด์˜ ์›ํ˜•์„ ์ฐพ๋Š”๋‹ค. 

Stemming์˜ ๊ฒฝ์šฐ, ๋‹จ์–ด์˜ ์›ํ˜•์œผ๋กœ ๋ณ€ํ˜• ์‹œ ์ผ๋ฐ˜์ ์ด๊ฑฐ๋‚˜ ๋‹จ์ˆœํ•œ ๋ฐฉ๋ฒ•์„ ์‚ฌ์šฉํ•˜์—ฌ ์ผ๋ถ€ ์ฒ ์ž๊ฐ€ ํ›ผ์†๋œ ๋‹จ์–ด ์–ด๊ทผ์„ ์ถ”์ถœํ•ด๋‚ด๋Š” ๊ฒฝํ–ฅ์ด ์žˆ๊ณ , Lemmatization์€ ๋ฌธ๋ฒ•์ ์ธ ์š”์†Œ์™€ ๋” ์˜๋ฏธ์ ์ธ ๋ถ€๋ถ„์„ ๊ฐ์•ˆํ•ด ์ •ํ™•ํ•œ ์ฒ ์ž๋กœ ์ถ”์ถœํ•ด๋‚ด๋Š” ๊ฒฝํ–ฅ์ด ์žˆ๋‹ค. 

 

NLTK์—์„œ๋Š” ๋‹ค์–‘ํ•œ stemmer๋ฅผ ์ œ๊ณตํ•œ๋‹ค. ๋Œ€ํ‘œ์ ์ธ Stemmer์—๋Š” Porter, Lancaster, Snowball Stemmer๊ฐ€ ์žˆ๋‹ค. 

๋˜ํ•œ Lemmatization์„ ์œ„ํ•ด์„œ๋Š” WordNetLemmatizer๋ฅผ ์ œ๊ณตํ•œ๋‹ค. 

 

 

๋จผ์ €, NLTK์˜ LancasterStemmer๋ฅผ ์ด์šฉํ•ด ๋ณด๋ฉด, ์ง„ํ–‰ํ˜•, 3์ธ์นญ ๋‹จ์ˆ˜, ๊ณผ๊ฑฐํ˜•์— ๋”ฐ๋ฅธ ๋™์‚ฌ, ๋น„๊ต ๋“ฑ ํ˜•์šฉ์‚ฌ์˜ ๋ณ€ํ™”์— ๋”ฐ๋ผ ๋” ๋‹จ์ˆœํ•˜๊ฒŒ ์›ํ˜• ๋‹จ์–ด๋ฅผ ์ฐพ์•„์ค€๋‹ค. 

from nltk.stem import LancasterStemmer
stemmer = LancasterStemmer()

print(stemmer.stem('working'),stemmer.stem('works'),stemmer.stem('worked'))
print(stemmer.stem('amusing'),stemmer.stem('amuses'),stemmer.stem('amused'))
print(stemmer.stem('happier'),stemmer.stem('happiest'))
print(stemmer.stem('fancier'),stemmer.stem('fanciest'))

work ๋‹จ์–ด๋ฅผ ์ž…๋ ฅํ•˜๋ฉด, ์ง„ํ–‰ํ˜•, 3์ธ์นญ ๋‹จ์ˆ˜, ๊ณผ๊ฑฐํ˜• ๋ชจ๋‘ ์›ํ˜• ๋‹จ์–ด์ธ 'work'๋ฅผ ์ œ๋Œ€๋กœ ์ฐพ์•„๋‚ด์ง€๋งŒ,

amuse ์˜ ๊ฒฝ์šฐ, 'e'๊ฐ€ ๋น ์ง„ 'amus'๋ฅผ ์ถœ๋ ฅํ•ด ๋‚ธ๋‹ค. 

 

๋˜ํ•œ ํ˜•์šฉ์‚ฌ์ธ happy์˜ ๊ฒฝ์šฐ, ์›ํ˜•์ธ 'happy'์™€ 'happiest'๋ฅผ ์ œ๋Œ€๋กœ ์ถœ๋ ฅํ•ด ๋‚ด์ง€๋งŒ,

fancy์˜ ๊ฒฝ์šฐ 'fant','fanciest'๋ฅผ ์ถœ๋ ฅํ•ด ๋‚ด๋Š” ๊ฒƒ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋‹ค. 

 

 

 

 

 

์ด๋ฒˆ์—๋Š” ๋น„๊ต์  ์ •ํ™•ํ•œ WordNetLemmatizer๋ฅผ ์ด์šฉํ•ด Lemmatization์„ ํ•ด๋ณด๋ ค๊ณ  ํ•œ๋‹ค. 

์ผ๋ฐ˜์ ์œผ๋กœ Lemmatization์€ ๋ณด๋‹ค ์ •ํ™•ํ•œ ์›ํ˜• ๋‹จ์–ด ์ถ”์ถœ์„ ์œ„ํ•ด ๋‹จ์–ด์˜ ํ’ˆ์‚ฌ๋ฅผ ์ž…๋ ฅํ•ด์ค˜์•ผํ•œ๋‹ค. 

from nltk.stem import WordNetLemmatizer
import nltk
nltk.download('wordnet')

lemma = WordNetLemmatizer()
print(lemma.lemmatize('amusing','v'),lemma.lemmatize('amuses','v'),lemma.lemmatize('amused','v'))
print(lemma.lemmatize('happier','a'),lemma.lemmatize('happiest','a'))
print(lemma.lemmatize('fancier','a'),lemma.lemmatize('fanciest','a'))

LancasterStemmer๋ณด๋‹ค ์ •ํ™•ํ•˜๊ฒŒ ๋‹จ์–ด๋ฅผ ์ถœ๋ ฅํ•ด ๋‚ด๋Š” ๊ฒƒ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋‹ค. 

 

728x90
Comments