250x250
Link
๋‚˜์˜ GitHub Contribution ๊ทธ๋ž˜ํ”„
Loading data ...
Notice
Recent Posts
Recent Comments
๊ด€๋ฆฌ ๋ฉ”๋‰ด

Data Science LAB

[Python]๋‰ด์Šค๊ธฐ์‚ฌ ํฌ๋กค๋ง(Newspaper ์ด์šฉ) ๋ณธ๋ฌธ

๐Ÿ Python/Crawling

[Python]๋‰ด์Šค๊ธฐ์‚ฌ ํฌ๋กค๋ง(Newspaper ์ด์šฉ)

ใ…… ใ…œ ใ…” ใ…‡ 2022. 2. 13. 00:49
728x90

์ด๋ฒˆ์—๋Š” newpaper ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋ฅผ ์ด์šฉํ•˜์—ฌ ์›น์‚ฌ์ดํŠธ์˜ ๋‰ด์Šค ๊ธฐ์‚ฌ๋ฅผ ํฌ๋กค๋ง ํ•ด๋ณด๋ ค๊ณ  ํ•ฉ๋‹ˆ๋‹ค.

https://www.3gpp.org/news-events/2143-3gpp-meets-imt-2020

 

3GPP meets IMT-2020

November 28, 2020 Earlier this week the ITU issued a press release to publicise the move to the approval process - by the 193 member states of the Union - of their ITU-R Recommendation: 'Detailed specifications of the radio interfaces of IMT-2020.' (ITU-R

www.3gpp.org

 

 

1. ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ ๋ฐ ๋ฐ์ดํ„ฐ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ

!pip install newspaper3k
import newspaper
from newspaper import Article

article = Article("https://www.3gpp.org/news-events/2143-3gpp-meets-imt-2020")

#๊ธฐ์‚ฌ ๋‹ค์šด๋กœ๋“œ
article.download()
article.parse()

ํŒŒ์ด์ฌ3 ์‚ฌ์šฉ์ค‘์ด๋ฉด 

pip install newspaper3k

 

ํŒŒ์ด์ฌ 2 ์‚ฌ์šฉ์ค‘์ด๋ฉด 

pip install newspaper

 

(์ €๋Š” ํŒŒ์ด์ฌ 3 ์‚ฌ์šฉํ–ˆ๊ธฐ ๋•Œ๋ฌธ์— newspaper3k๋กœ ํ–ˆ์Šต๋‹ˆ๋‹ค!)

 

 

2. ๊ธฐ์‚ฌ ์ •๋ณด ํ™•์ธ

 

- ๊ธฐ์‚ฌ ๋‚ด์šฉ

print(article.text)

 

- ๊ธฐ์‚ฌ ์ œ๋ชฉ

article.title

 

- ๊ธฐ์‚ฌ ์ €์ž 

article.authors

 

 

 

 

์ด๋ฒˆ์—๋Š” techcrunch ์‚ฌ์ดํŠธ์—์„œ ๊ธฐ์‚ฌ๋“ค์„ ํฌ๋กค๋ง ํ•ด๋ณด๋ ค๊ณ  ํ•ฉ๋‹ˆ๋‹ค!

https://techcrunch.com/

 

TechCrunch – Startup and Technology News

TechCrunch - Reporting on the business of technology, startups, venture capital funding, and Silicon Valley

social.techcrunch.com

 

ํ…Œํฌํฌ๋Ÿฐ์น˜๋Š” ๊ธฐ์ˆ  ์‚ฐ์—… ๋‰ด์Šค์˜ ์˜จ๋ผ์ธ ์ถœํŒ์‚ฌ๋กœ ๊ฐ์ข… ๊ธฐ์ˆ ๊ณผ ๊ธฐ์—…์˜ ๊ธฐ์‚ฌ๋“ค์ด ์˜ฌ๋ผ์™€ ์žˆ์Šต๋‹ˆ๋‹ค.

๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋Š” ์œ„์—์„œ ์„ค๋ช…ํ•œ ๊ฒƒ๊ณผ ๋™์ผํ•˜๊ธฐ ๋•Œ๋ฌธ์— ๋‹ค์‹œ ๋ถˆ๋Ÿฌ์˜ค์ง€ ์•Š๋„๋ก ํ•˜๊ฒ ์Šต๋‹ˆ๋‹น(/โ–ฝ๏ผผ)

 

 

 

1. ์‚ฌ์ดํŠธ์—์„œ ๊ธฐ์‚ฌ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ

site = newspaper.build('https://techcrunch.com/')
site.article_urls()

์‚ฌ์ดํŠธ ์ œ์ผ ์œ„์— ์˜ฌ๋ผ์˜ค๋Š” ๊ธฐ์‚ฌ๊ฐ€ ์‹œ๊ฐ„์— ๋”ฐ๋ผ ๋งค๋ฒˆ ๋‹ฌ๋ผ์ ธ์„œ ํ•˜๋Š” ์‹œ๊ฐ„์— ๋”ฐ๋ผ ํฌ๋กค๋ง ๋œ ๊ธฐ์‚ฌ๊ฐ€ ๋‹ค๋ฅผ ์ˆ˜ ์žˆ๋‹ค!

 

 

site_article = site.articles[0]
site_article.download()
site_article.parse()
print("article title : ",site_article.title)
print("article url : ",site_article.url)

๋งˆ์ฐฌ๊ฐ€์ง€๋กœ ์ œ์ผ ์œ—์ค„์˜ ๊ธฐ์‚ฌ๋Š” ๋ณ€๋™ ๊ฐ€๋Šฅํ•ฉ๋‹ˆ๋Œฑ

 

 

2. for ํ•จ์ˆ˜๋ฅผ ์ด์šฉํ•ด ๊ธฐ์‚ฌ ์ €์žฅ

allarticles = []

for i in range(len(site.article_urls())):
    article = Article(site.article_urls()[i])
    article.download()
    article.parse()
    allarticles.append(article)

 

 

3. ํฌ๋กค๋ง ํ•ด์˜จ ๋ฐ์ดํ„ฐ๋ฅผ ๋ฐ์ดํ„ฐ ํ”„๋ ˆ์ž„ ํ˜•์‹์œผ๋กœ ์ €์žฅ

import pandas as pd
df = pd.DataFrame(columns=['Title','Autrhors','PubDate','URL','Text'])

for i in range(len(allarticles)):
    row = dict(zip(['Title','Autrhors','PubDate','URL','Text'],
                   [allarticles[i].title,allarticles[i].authors,allarticles[i].publish_date,allarticles[i].url,allarticles[i].text]))
    row_s = pd.Series(row)
    row_s.name=i
    df = df.append(row_s)
    
df

์ด๋Ÿฐ์‹์œผ๋กœ ๊ธฐ์‚ฌ์˜ ์ œ๋ชฉ, ์ €์ž, ์ถœํŒ์ผ, URL, ๋‚ด์šฉ์ด ์ž˜ ์ €์žฅ๋œ ๊ฒƒ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค

 

 

 

728x90
Comments