技术支持qq群:437355848 630011153 144081101
python爬虫cookbook第二章 数据采集和提取
在本章中,我们将介绍:
- 如何使用BeautifulSoup解析网站并浏览DOM
- 用BeautifulSoup查找方法搜索DOM
- 用XPath和lxml查询DOM
- 使用XPath和CSS选择器查询数据
- 使用Scrapy选择器
- 以Unicode / UTF-8格式加载数据
用BeautifulSoup解析网站并浏览DOM
01_bs_browser.py
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 | # -*- coding: utf-8 -*-
# Author: china-testing#126.com wechat:pythontesting qq群:144081101
# CreateDate: 2018-05-08
# 02_blog_01_html.py
import requests
from bs4 import BeautifulSoup
html = requests.get("https://china-testing.github.io/address.html").text
soup = BeautifulSoup(html, "lxml")
print('######### 第一行 #########')
print(soup.html.body.div.table.tr)
print('######### 子节点 #########')
print([str(c)[:45] for c in soup.html.body.div.table.children])
print('######### 父节点 #########')
print(str(soup.html.body.div.table)[:50])
|
执行结果:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 | $ python3 01_bs_browser.py
######### 第一行 #########
<tr>
<th align="center">类别</th>
<th align="center">名称</th>
<th align="center">名称</th>
<th align="center">名称</th>
<th align="center">名称</th>
<th align="center">名称</th>
<th align="center">名称</th>
<th align="center">名称</th>
</tr>
######### 子节点 #########
['\n', '<thead>\n<tr>\n<th align="center">类别</th>\n<th a', '\n', '<tbody>\n<tr>\n<td align="center">本人博客</td>\n<td', '\n']
######### 父节点 #########
<table>
<thead>
<tr>
<th align="center">类别</th>
<t
|
BeautifulSoup查找方法搜索DOM
代码:02_events_with_urlib3.py
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 | import urllib3
from bs4 import BeautifulSoup
def get_upcoming_events(url):
req = urllib3.PoolManager()
res = req.request('GET', url)
soup = BeautifulSoup(res.data, 'html.parser')
events = soup.find('ul', {'class': 'list-recent-events'}).findAll('li')
for event in events:
event_details = dict()
event_details['name'] = event.find('h3').find("a").text
event_details['location'] = event.find('span', {'class', 'event-location'}).text
event_details['time'] = event.find('time').text
print(event_details)
get_upcoming_events('https://www.python.org/events/python-events/')
|
requests对urllib3进行了封装,一般是直接使用requests。
Scrapy 爬取python.org
Scrapy是用于提取数据的非常流行的开源Python抓取框架。 Scrapy提供所有这些功能以及许多其他内置模块和扩展。当涉及到使用Python进行挖掘时,它也是我们的首选工具。 Scrapy提供了许多值得一提的强大功能: * 内置的扩展来生成HTTP请求并处理压缩,身份验证,缓存,操作用户代理和HTTP标头 * 内置的支持选择和提取选择器语言如数据CSS和XPath,以及支持使用正则表达式选择内容和链接。 * 编码支持来处理语言和非标准编码声明 * 灵活的API来重用和编写自定义中间件和管道,提供干净而简单的方法来实现自动化等任务。比如下载资产(例如图像或媒体)并将数据存储在存储器中,如文件系统,S3,数据库等
有几种使用Scrapy的方法。一个是程序模式我们在代码中创建抓取工具和蜘蛛。也可以配置Scrapy模板或生成器项目,然后从命令行使用运行。本书将遵循程序模式,因为它的代码在单个文件中。
代码:03_events_with_scrapy.py
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 | import scrapy
from scrapy.crawler import CrawlerProcess
class PythonEventsSpider(scrapy.Spider):
name = 'pythoneventsspider'
start_urls = ['https://www.python.org/events/python-events/',]
found_events = []
def parse(self, response):
for event in response.xpath('//ul[contains(@class, "list-recent-events")]/li'):
event_details = dict()
event_details['name'] = event.xpath('h3[@class="event-title"]/a/text()').extract_first()
event_details['location'] = event.xpath('p/span[@class="event-location"]/text()').extract_first()
event_details['time'] = event.xpath('p/time/text()').extract_first()
self.found_events.append(event_details)
if __name__ == "__main__":
process = CrawlerProcess({ 'LOG_LEVEL': 'ERROR'})
process.crawl(PythonEventsSpider)
spider = next(iter(process.crawlers)).spider
process.start()
for event in spider.found_events: print(event)
|
课后习题: 用scrapy爬取https://china-testing.github.io/首页的博客标题,共10条。
参考答案:
03_blog_with_scrapy.py
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 | from scrapy.crawler import CrawlerProcess
class PythonEventsSpider(scrapy.Spider):
name = 'pythoneventsspider'
start_urls = ['https://china-testing.github.io/',]
found_events = []
def parse(self, response):
for event in response.xpath('//article//h1'):
event_details = dict()
event_details['name'] = event.xpath('a/text()').extract_first()
self.found_events.append(event_details)
if __name__ == "__main__":
process = CrawlerProcess({ 'LOG_LEVEL': 'ERROR'})
process.crawl(PythonEventsSpider)
spider = next(iter(process.crawlers)).spider
process.start()
for event in spider.found_events: print(event)
|
Selenium和PhantomJs爬取Python.org
04_events_with_selenium.py
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 | from selenium import webdriver
def get_upcoming_events(url):
driver = webdriver.Chrome()
driver.get(url)
events = driver.find_elements_by_xpath('//ul[contains(@class, "list-recent-events")]/li')
for event in events:
event_details = dict()
event_details['name'] = event.find_element_by_xpath('h3[@class="event-title"]/a').text
event_details['location'] = event.find_element_by_xpath('p/span[@class="event-location"]').text
event_details['time'] = event.find_element_by_xpath('p/time').text
print(event_details)
driver.close()
get_upcoming_events('https://www.python.org/events/python-events/')
|
改用driver = webdriver.PhantomJS('phantomjs')可以使用无界面的方式,代码如下:
05_events_with_phantomjs.py
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 | from selenium import webdriver
def get_upcoming_events(url):
driver = webdriver.Chrome()
driver.get(url)
events = driver.find_elements_by_xpath('//ul[contains(@class, "list-recent-events")]/li')
for event in events:
event_details = dict()
event_details['name'] = event.find_element_by_xpath('h3[@class="event-title"]/a').text
event_details['location'] = event.find_element_by_xpath('p/span[@class="event-location"]').text
event_details['time'] = event.find_element_by_xpath('p/time').text
print(event_details)
driver.close()
get_upcoming_events('https://www.python.org/events/python-events/')
|
不过selenium的headless模式已经可以更好的代替phantomjs了。
04_events_with_selenium_headless.py
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 | from selenium import webdriver
def get_upcoming_events(url):
options = webdriver.ChromeOptions()
options.add_argument('headless')
driver = webdriver.Chrome(chrome_options=options)
driver.get(url)
events = driver.find_elements_by_xpath('//ul[contains(@class, "list-recent-events")]/li')
for event in events:
event_details = dict()
event_details['name'] = event.find_element_by_xpath('h3[@class="event-title"]/a').text
event_details['location'] = event.find_element_by_xpath('p/span[@class="event-location"]').text
event_details['time'] = event.find_element_by_xpath('p/time').text
print(event_details)
driver.close()
get_upcoming_events('https://www.python.org/events/python-events/')
|