使用python进行web抓取 Web Scraping with Python

时间 2019-12-07 标签使用 python 进行 web 抓取 web scraping python

前言

网页抓取适合收集和处理大量的数据。超越搜索引擎，好比能找到最便宜的机票。javascript

API能提供很好的格式化的数据。可是不少站点不提供API，无统一的API。即使有API，数据类型和格式未必彻底符合你的要求，且速度也可能太慢。css

应用场景：市场预测、机器语言翻译、医疗诊断等。甚至时艺术，好比http://wefeelfine.org/。html

本文基于python3，须要python基础。java

代码下载：http://pythonscraping.com/code/。python

第一个网页抓取

链接

from urllib.request import urlopen
html = urlopen("http://www.pythonscraping.com/exercises/exercise1.html")
print(html.read())

执行结果：程序员

$ python3 1-basicExample.py 
b'<html>\n<head>\n<title>A Useful Page</title>\n</head>\n<body>\n<h1>An Interesting Title</h1>\n<div>\nLorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.\n</div>\n</body>\n</html>\n'

BeautifulSoup简介

from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen("http://www.pythonscraping.com/pages/page1.html")
bsObj = BeautifulSoup(html.read(), 'lxml');
print(bsObj.h1)

执行结果：web

$ python3 2-beautifulSoup.py 
<h1>An Interesting Title</h1>

HTML代码层次以下：正则表达式

• html → <html><head>...</head><body>...</body></html>
    — head → <head><title>A Useful Page<title></head>
        — title → <title>A Useful Page</title>
    — body → <body><h1>An Int...</h1><div>Lorem ip...</div></body>
        — h1 → <h1>An Interesting Title</h1>
        — div → <div>Lorem Ipsum dolor...</div>

注意这里bsObj.h1和下面的效果相同：数据库

bsObj.html.body.h1
bsObj.body.h1
bsObj.html.h1

urlopen容易发生的错误为：express

•在服务器上找不到该页面(或获取错误), 404或者500
•找不到服务器

都体现为HTTPError。能够以下方式处理：

try:
    html = urlopen("http://www.pythonscraping.com/pages/page1.html")
except HTTPError as e:
    print(e)
    #return null, break, or do some other "Plan B"
else:
    #program continues. Note: If you return or break in the
    #exception catch, you do not need to use the "else" statement

说明

本文摘要自Web Scraping with Python - 2015

书籍下载地址：https://bitbucket.org/xurongzhong/python-chinese-library/downloads

源码地址：https://bitbucket.org/wswp/code

演示站点：http://example.webscraping.com/

演示站点代码：http://bitbucket.org/wswp/places

推荐的python基础教程： http://www.diveintopython.net

HTML和JavaScript基础：

http://www.w3schools.com

本文博客：http://my.oschina.net/u/1433482/

本文网址：http://my.oschina.net/u/1433482/blog/620858

交流：python开发自动化测试群291184506 PythonJava单元白盒测试群144081101

web抓取简介

为何要进行web抓取？

网购的时候想比较下各个网站的价格，也就是实现惠惠购物助手的功能。有API天然方便，可是一般是没有API，此时就须要web抓取。

web抓取是否合法？

抓取的数据，我的使用不违法，商业用途或从新发布则须要考虑受权，另外须要注意礼节。根据国外已经判决的案例，通常来讲位置和电话能够从新发布，可是原创数据不容许从新发布。

http://www.austlii.edu.au/au/cases/cth/FCA/2010/44.html

http://caselaw.findlaw.com/us-supreme-court/499/340.html

背景研究

robots.txt和Sitemap能够帮助了解站点的规模和结构，还可使用谷歌搜索和WHOIS等工具。

好比：http://example.webscraping.com/robots.txt

# section 1
User-agent: BadCrawler
Disallow: /

# section 2
User-agent: *
Crawl-delay: 5
Disallow: /trap 

# section 3
Sitemap: http://example.webscraping.com/sitemap.xml

更多关于web机器人的介绍参见 http://www.robotstxt.org。
Sitemap的协议： http://www.sitemaps.org/protocol.html，好比：

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<url><loc>http://example.webscraping.com/view/Afghanistan-1
</loc></url>
<url><loc>http://example.webscraping.com/view/Aland-Islands-2
</loc></url>
<url><loc>http://example.webscraping.com/view/Albania-3</loc>
</url>
...
</urlset>

站点地图常常不完整。

站点大小评估：
经过google的site查询好比：site:automationtesting.sinaapp.com

站点技术评估：

# pip install builtwith
# ipython
In [1]: import builtwith

In [2]: builtwith.parse('http://automationtesting.sinaapp.com/')
Out[2]: 
{u'issue-trackers': [u'Trac'],
 u'javascript-frameworks': [u'jQuery'],
 u'programming-languages': [u'Python'],
 u'web-servers': [u'Nginx']}

分析网站全部者：

# pip install python-whois
# ipython
In [1]: import whois

In [2]: print whois.whois('http://automationtesting.sinaapp.com')
{
  "updated_date": "2016-01-07 00:00:00", 
  "status": [
    "serverDeleteProhibited https://www.icann.org/epp#serverDeleteProhibited", 
    "serverTransferProhibited https://www.icann.org/epp#serverTransferProhibited", 
    "serverUpdateProhibited https://www.icann.org/epp#serverUpdateProhibited"
  ], 
  "name": null, 
  "dnssec": null, 
  "city": null, 
  "expiration_date": "2021-06-29 00:00:00", 
  "zipcode": null, 
  "domain_name": "SINAAPP.COM", 
  "country": null, 
  "whois_server": "whois.paycenter.com.cn", 
  "state": null, 
  "registrar": "XIN NET TECHNOLOGY CORPORATION", 
  "referral_url": "http://www.xinnet.com", 
  "address": null, 
  "name_servers": [
    "NS1.SINAAPP.COM", 
    "NS2.SINAAPP.COM", 
    "NS3.SINAAPP.COM", 
    "NS4.SINAAPP.COM"
  ], 
  "org": null, 
  "creation_date": "2009-06-29 00:00:00", 
  "emails": null
}

抓取第一个站点

简单的爬虫(crawling)代码以下：

import urllib2

def download(url):
    print 'Downloading:', url
    try:
        html = urllib2.urlopen(url).read()
    except urllib2.URLError as e:
        print 'Download error:', e.reason
        html = None
    return html

能够基于错误码重试。HTTP状态码：https://tools.ietf.org/html/rfc7231#section-6。4**不必重试，5**能够重试下。

import urllib2

def download(url, num_retries=2):
    print 'Downloading:', url
    try:
        html = urllib2.urlopen(url).read()
    except urllib2.URLError as e:
        print 'Download error:', e.reason
        html = None
        if num_retries > 0:
            if hasattr(e, 'code') and 500 <= e.code < 600:
                # recursively retry 5xx HTTP errors
                return download(url, num_retries-1)
    return html

http://httpstat.us/500 会返回500，能够用它来测试下：

>>> download('http://httpstat.us/500')
Downloading: http://httpstat.us/500
Download error: Internal Server Error
Downloading: http://httpstat.us/500
Download error: Internal Server Error
Downloading: http://httpstat.us/500
Download error: Internal Server Error

设置 user agent：

urllib2默认的user agent是“Python-urllib/2.7”，不少网站会对此进行拦截, 推荐使用接近真实的agent，好比

Mozilla/5.0 (X11; Linux x86_64; rv:38.0) Gecko/20100101 Firefox/38.0

为此咱们增长user agent设置：

import urllib2

def download(url, user_agent='Mozilla/5.0 (X11; Linux x86_64; rv:38.0) Gecko/20100101 Firefox/38.0', num_retries=2):
    print 'Downloading:', url
    headers = {'User-agent': user_agent}
    request = urllib2.Request(url, headers=headers)    
    try:
        html = urllib2.urlopen(request).read()
    except urllib2.URLError as e:
        print 'Download error:', e.reason
        html = None
        if num_retries > 0:
            if hasattr(e, 'code') and 500 <= e.code < 600:
                # recursively retry 5xx HTTP errors
                return download(url, num_retries-1)
    return html

爬行站点地图：

def crawl_sitemap(url):
    # download the sitemap file
    sitemap = download(url)
    # extract the sitemap links
    links = re.findall('<loc>(.*?)</loc>', sitemap)
    # download each link
    for link in links:
        html = download(link)
        # scrape html here
        # ...

ID循环爬行：

• http://example.webscraping.com/view/Afghanistan-1
• http://example.webscraping.com/view/Australia-2
• http://example.webscraping.com/view/Brazil-3

上面几个网址仅仅是最后面部分不一样，一般程序员喜欢用数据库的id，好比：
http://example.webscraping.com/view/1 ，这样咱们就能够数据库的id抓取网页。

for page in itertools.count(1):
    url = 'http://example.webscraping.com/view/-%d' % page
    html = download(url)
    if html is None:
        break
    else:
        # success - can scrape the result
        pass

固然数据库有可能删除了一条记录，为此咱们改进成以下：

# maximum number of consecutive download errors allowed
max_errors = 5
# current number of consecutive download errors
num_errors = 0
for page in itertools.count(1):
    url = 'http://example.webscraping.com/view/-%d' % page
    html = download(url)
    if html is None:
        # received an error trying to download this webpage
        num_errors += 1
        if num_errors == max_errors:
            # reached maximum number of
            # consecutive errors so exit
            break
    else:
        # success - can scrape the result
        # ...
        num_errors = 0

有些网站不存在的时候会返回404，有些网站的ID不是这么有规则的，好比亚马逊使用ISBN。

分析网页

通常的浏览器都有"查看页面源码"的功能，在Firefox，Firebug尤为方便。以上工具均可以邮件点击网页调出。

抓取网页数据主要有3种方法：正则表达式、BeautifulSoup和lxml。

正则表达式示例：

In [1]: import re

In [2]: import common

In [3]: url = 'http://example.webscraping.com/view/UnitedKingdom-239'

In [4]: html = common.download(url)
Downloading: http://example.webscraping.com/view/UnitedKingdom-239

In [5]: re.findall('<td class="w2p_fw">(.*?)</td>', html)
Out[5]: 
['<img src="/places/static/images/flags/gb.png" />',
 '244,820 square kilometres',
 '62,348,447',
 'GB',
 'United Kingdom',
 'London',
 '<a href="/continent/EU">EU</a>',
 '.uk',
 'GBP',
 'Pound',
 '44',
 '@# #@@|@## #@@|@@# #@@|@@## #@@|@#@ #@@|@@#@ #@@|GIR0AA',
 '^(([A-Z]\\d{2}[A-Z]{2})|([A-Z]\\d{3}[A-Z]{2})|([A-Z]{2}\\d{2}[A-Z]{2})|([A-Z]{2}\\d{3}[A-Z]{2})|([A-Z]\\d[A-Z]\\d[A-Z]{2})|([A-Z]{2}\\d[A-Z]\\d[A-Z]{2})|(GIR0AA))$',
 'en-GB,cy-GB,gd',
 '<div><a href="/iso/IE">IE </a></div>']

In [6]: re.findall('<td class="w2p_fw">(.*?)</td>', html)[1]
Out[6]: '244,820 square kilometres'

维护成本比较高。

Beautiful Soup：

In [7]: from bs4 import BeautifulSoup

In [8]: broken_html = '<ul class=country><li>Area<li>Population</ul>'

In [9]: # parse the HTML

In [10]: soup = BeautifulSoup(broken_html, 'html.parser')

In [11]: fixed_html = soup.prettify()

In [12]: print fixed_html
<ul class="country">
 <li>
  Area
  <li>
   Population
  </li>
 </li>
</ul>
In [13]: ul = soup.find('ul', attrs={'class':'country'})

In [14]: ul.find('li') # returns just the first match
Out[14]: <li>Area<li>Population</li></li>

In [15]: ul.find_all('li') # returns all matches
Out[15]: [<li>Area<li>Population</li></li>, <li>Population</li>]

完整的例子：

In [1]: from bs4 import BeautifulSoup

In [2]: url = 'http://example.webscraping.com/places/view/United-Kingdom-239'

In [3]: import common

In [5]: html = common.download(url)
Downloading: http://example.webscraping.com/places/view/United-Kingdom-239

In [6]: soup = BeautifulSoup(html)
/usr/lib/python2.7/site-packages/bs4/__init__.py:166:
 UserWarning: No parser was explicitly specified, so I'm using the best 
available HTML parser for this system ("lxml"). This usually isn't a 
problem, but if you run this code on another system, or in a different 
virtual environment, it may use a different parser and behave 
differently.

To get rid of this warning, change this:

 BeautifulSoup([your markup])

to this:

 BeautifulSoup([your markup], "lxml")

  markup_type=markup_type))

In [7]: # locate the area row

In [8]: tr = soup.find(attrs={'id':'places_area__row'})

In [9]: td = tr.find(attrs={'class':'w2p_fw'}) # locate the area tag

In [10]: area = td.text # extract the text from this tag

In [11]: print area
244,820 square kilometres

Lxml基于 libxml2(c语言实现)，更快速，可是有时更难安装。网址：http://lxml.de/installation.html。

In [1]: import lxml.html

In [2]: broken_html = '<ul class=country><li>Area<li>Population</ul>'

In [3]: tree = lxml.html.fromstring(broken_html) # parse the HTML

In [4]: fixed_html = lxml.html.tostring(tree, pretty_print=True)

In [5]: print fixed_html
<ul class="country">
<li>Area</li>
<li>Population</li>
</ul>

lxml的容错能力也比较强，少半边标签一般没事。

下面使用css选择器，注意安装cssselect。

In [1]: import common

In [2]: import lxml.html

In [3]: url = 'http://example.webscraping.com/places/view/United-Kingdom-239'

In [4]: html = common.download(url)
Downloading: http://example.webscraping.com/places/view/United-Kingdom-239

In [5]: tree = lxml.html.fromstring(html)

In [6]: td = tree.cssselect('tr#places_area__row > td.w2p_fw')[0]

In [7]: area = td.text_content()

In [8]: print area
244,820 square kilometres

在 CSS 中，选择器是一种模式，用于选择须要添加样式的元素。

"CSS" 列指示该属性是在哪一个 CSS 版本中定义的。（CSS一、CSS2 仍是 CSS3。）

选择器	例子	例子描述	CSS
.class	.intro	选择 class="intro" 的全部元素。	1
#id	#firstname	选择 id="firstname" 的全部元素。	1
*	*	选择全部元素。	2
element	p	选择全部 <p> 元素。	1
element,element	div,p	选择全部 <div> 元素和全部 <p> 元素。	1
element element	div p	选择 <div> 元素内部的全部 <p> 元素。	1
element>element	div>p	选择父元素为 <div> 元素的全部 <p> 元素。	2
element+element	div+p	选择紧接在 <div> 元素以后的全部 <p> 元素。	2
[attribute]	[target]	选择带有 target 属性全部元素。	2
[attribute=value]	[target=_blank]	选择 target="_blank" 的全部元素。	2
[attribute~=value]	[title~=flower]	选择 title 属性包含单词 "flower" 的全部元素。	2
[attribute\|=value]	[lang\|=en]	选择 lang 属性值以 "en" 开头的全部元素。	2
:link	a:link	选择全部未被访问的连接。	1
:visited	a:visited	选择全部已被访问的连接。	1
:active	a:active	选择活动连接。	1
:hover	a:hover	选择鼠标指针位于其上的连接。	1
:focus	input:focus	选择得到焦点的 input 元素。	2
:first-letter	p:first-letter	选择每一个 <p> 元素的首字母。	1
:first-line	p:first-line	选择每一个 <p> 元素的首行。	1
:first-child	p:first-child	选择属于父元素的第一个子元素的每一个 <p> 元素。	2
:before	p:before	在每一个 <p> 元素的内容以前插入内容。	2
:after	p:after	在每一个 <p> 元素的内容以后插入内容。	2
:lang(language)	p:lang(it)	选择带有以 "it" 开头的 lang 属性值的每一个 <p> 元素。	2
element1~element2	p~ul	选择前面有 <p> 元素的每一个 <ul> 元素。	3
[attribute^=value]	a[src^="https"]	选择其 src 属性值以 "https" 开头的每一个 <a> 元素。	3
[attribute$=value]	a[src$=".pdf"]	选择其 src 属性以 ".pdf" 结尾的全部 <a> 元素。	3
[attribute=value*]	a[src*="abc"]	选择其 src 属性中包含 "abc" 子串的每一个 <a> 元素。	3
:first-of-type	p:first-of-type	选择属于其父元素的首个 <p> 元素的每一个 <p> 元素。	3
:last-of-type	p:last-of-type	选择属于其父元素的最后 <p> 元素的每一个 <p> 元素。	3
:only-of-type	p:only-of-type	选择属于其父元素惟一的 <p> 元素的每一个 <p> 元素。	3
:only-child	p:only-child	选择属于其父元素的惟一子元素的每一个 <p> 元素。	3
:nth-child(n)	p:nth-child(2)	选择属于其父元素的第二个子元素的每一个 <p> 元素。	3
:nth-last-child(n)	p:nth-last-child(2)	同上，从最后一个子元素开始计数。	3
:nth-of-type(n)	p:nth-of-type(2)	选择属于其父元素第二个 <p> 元素的每一个 <p> 元素。	3
:nth-last-of-type(n)	p:nth-last-of-type(2)	同上，可是从最后一个子元素开始计数。	3
:last-child	p:last-child	选择属于其父元素最后一个子元素每一个 <p> 元素。	3
:root	:root	选择文档的根元素。	3
:empty	p:empty	选择没有子元素的每一个 <p> 元素（包括文本节点）。	3
:target	#news:target	选择当前活动的 #news 元素。	3
:enabled	input:enabled	选择每一个启用的 <input> 元素。	3
:disabled	input:disabled	选择每一个禁用的 <input> 元素	3
:checked	input:checked	选择每一个被选中的 <input> 元素。	3
:not(selector)	:not(p)	选择非 <p> 元素的每一个元素。	3
::selection	::selection	选择被用户选取的元素部分。	3

CSS 选择器参见：http://www.w3school.com.cn/cssref/css_selectors.ASP 和 https://pythonhosted.org/cssselect/#supported-selectors。

下面经过提取以下页面的国家数据来比较性能：

比较代码：

import urllib2
import itertools
import re
from bs4 import BeautifulSoup
import lxml.html
import time


FIELDS = ('area', 'population', 'iso', 'country', 'capital',
'continent', 'tld', 'currency_code', 'currency_name', 'phone',
'postal_code_format', 'postal_code_regex', 'languages',
'neighbours')


def download(url, user_agent='Mozilla/5.0 (X11; Linux x86_64; rv:38.0) Gecko/20100101 Firefox/38.0', num_retries=2):
    print 'Downloading:', url
    headers = {'User-agent': user_agent}
    request = urllib2.Request(url, headers=headers)    
    try:
        html = urllib2.urlopen(request).read()
    except urllib2.URLError as e:
        print 'Download error:', e.reason
        html = None
        if num_retries > 0:
            if hasattr(e, 'code') and 500 <= e.code < 600:
                # recursively retry 5xx HTTP errors
                return download(url, num_retries-1)
    return html

def re_scraper(html):
    results = {}
    for field in FIELDS:
        results[field] = re.search(r'places_%s__row.*?w2p_fw">(.*?)</td>' % field, html.replace('\n','')).groups()[0]
    return results


def bs_scraper(html):
    soup = BeautifulSoup(html, 'html.parser')
    results = {}
    for field in FIELDS:
        results[field] = soup.find('table').find('tr',id='places_%s__row' % field).find('td',class_='w2p_fw').text
    return results


def lxml_scraper(html):
    tree = lxml.html.fromstring(html)
    results = {}
    for field in FIELDS:
        results[field] = tree.cssselect('table > tr#places_%s__row> td.w2p_fw' % field)[0].text_content()
    return results


NUM_ITERATIONS = 1000 # number of times to test each scraper
html = download('http://example.webscraping.com/places/view/United-Kingdom-239')

for name, scraper in [('Regular expressions', re_scraper),('BeautifulSoup', bs_scraper),('Lxml', lxml_scraper)]:
    # record start time of scrape
    start = time.time()
    for i in range(NUM_ITERATIONS):
        if scraper == re_scraper:
            re.purge()
        result = scraper(html)
        # check scraped result is as expected
        assert(result['area'] == '244,820 square kilometres')
        
    # record end time of scrape and output the total
    end = time.time()
    print '%s: %.2f seconds' % (name, end - start)

Windows执行结果：

Downloading: http://example.webscraping.com/places/view/United-Kingdom-239
Regular expressions: 11.63 seconds
BeautifulSoup: 92.80 seconds
Lxml: 7.25 seconds

Linux执行结果:

Downloading: http://example.webscraping.com/places/view/United-Kingdom-239
Regular expressions: 3.09 seconds
BeautifulSoup: 29.40 seconds
Lxml: 4.25 seconds

其中 re.purge() 用户清正则表达式的缓存。

推荐使用基于Linux的lxml，在同一网页屡次分析的状况优点更为明显。