파이썬 크롤링 연습(2) - 랭킹 뉴스 끌어오기

파이썬 크롤링 연습(2) - 랭킹 뉴스 끌어오기

2020. 7. 21. 00:37ㆍ캐리의 데이터 세상/파이썬

BeautifulSoup을 사용해서 네이버 뉴스 페이지의 '많이 본 뉴스' 기사 제목/링크/전문을 끌어와 보겠습니다.

<가장 많이 본 뉴스> 박스 안의 섹션별 1위부터 10위까지의 기사들이 <ul class="section_list_ranking"> 하위 <li> <a>태그 안에 제목, 기사링크 값이 들어 있는 것을 확인할 수 있습니다.

제목 -> 기사링크 -> 기사전문에서 불필요한 태그들은 처리하고 기사 전문을 뽑아 내는 코드는 다음과 같습니다. BeautifulSoup 문서를 참조해서 코드를 쪼개어 설명을 덧붙입니다.

from bs4 import BeautifulSoup
import requests,time
url="https://news.naver.com/main/ranking/popularDay.nhn"
r=requests.get(url)

soup = BeautifulSoup(r.text,'html.parser')
results=soup.select('.section_list_ranking li a')
#class명 'section_list_ranking' 하위의 <li> 태그 하위 <a>태그 문자열 select!

for result in results:
    print('기사 제목:',result.attrs['title'])
    print('기사 링크:',result.attrs['href'])
    print() # 기사 전문과 링크 사이 공백여유주기
    print()
    url_content='https://news.naver.com'+result.attrs['href']
    response_content = requests.get(url_content)
    soup_content=BeautifulSoup(response_content.text,'html.parser')
    content=soup_content.select_one('#articleBodyContents')
    # print(content.contents) #print해보면 가공전의 각종 태그,주석,공백들 혼합되어있음
    
    # 데이터 가공하기
    output=''
    for item in content.contents:
        stripped=str(item).strip() # strip()으로 공백제거
        if stripped=='':
            continue
        if stripped[0] not in['<','/']: #태그나 주석제거
            output+=str(item).strip()
    print(output.replace('본문 내용TV플레이어','')) #불필요한 문자열 ''공백처리
    print()
    time.sleep(2)

(1) Requests 모듈

The requests module allows you to send HTTP requests using Python. The HTTP request returns a Response Object with all the response data (content, encoding, status, etc).

Requests 메서드 모음

(자료 참조 : w3schools.com )

delete(url, args)	Sends a DELETE request to the specified url
get(url, params, args)	Sends a GET request to the specified url
head(url, args)	Sends a HEAD request to the specified url
patch(url, data, args)	Sends a PATCH request to the specified url
post(url, data, json, args)	Sends a POST request to the specified url
put(url, data, args)	Sends a PUT request to the specified url
request(method, url, args)	Sends a request of the specified method to the specified url

(2) Attrs

A tag may have any number of attributes. The tag <b id="boldest"> has an attribute “id” whose value is “boldest”. You can access a tag’s attributes by treating the tag like a dictionary:

Attrs로 속성값 접근해서 값을 추출할 수 있습니다.

print('기사제목:',result.attrs['title']) -> 실시간 뉴스 랭킹에서 <a> 태그의 'title' 속성값으로 제목 추출

(자료 참조 : BeautifulSoup 문서 #attributes)

(3) .contents and .children

A tag’s children are available in a list called .contents:
The BeautifulSoup object itself has children. In this case, the <html> tag is the child of the BeautifulSoup object.: A string does not have .contents, because it can’t contain anything:
Instead of getting them as a list, you can iterate over a tag’s children using the .children generator:

(예시 참조 : BeautifulSoup - #contents-and-children)

다음 랭킹 뉴스도 같은 방법으로 시도하였으나 태그와 각종 특수문자가 한꺼번에 끌어와져서 클리닝 함수를 좀 더 찾아보고 추후 업데이트 하겠습니다.

'캐리의 데이터 세상 > 파이썬' 카테고리의 다른 글

파이썬 0~1 스케일링 최소 최대 MinMaxScaler (0)	2021.06.15
파이썬으로 지수함수 그래프 그리기 (0)	2021.04.27
파이썬으로 1차함수 그래프 그리기 (0)	2021.04.25
파이썬 리스트 명령어 (0)	2021.04.23
파이썬 크롤링 연습(3) 랭킹 기사로 워드클라우드 만들기 (0)	2020.07.22
파이썬 - 주피터 노트북(Jupyter Notebook) 단축키 모음 (0)	2020.07.20
파이썬 크롤링 연습 (1) - 스타벅스 매장 목록 불러오기 (5)	2020.07.18

캐리의 데이터 세상

캐리의 데이터 세상

'캐리의 데이터 세상 > 파이썬' 카테고리의 다른 글

관련글

티스토리툴바