网页抓取适合收集和处理大量的数据。超越搜索引擎,好比能找到最便宜的机票。javascript
API能提供很好的格式化的数据。可是不少站点不提供API,无统一的API。即使有API,数据类型和格式未必彻底符合你的要求,且速度也可能太慢。css
应用场景:市场预测、机器语言翻译、医疗诊断等。甚至时艺术,好比http://wefeelfine.org/。html
本文基于python3,须要python基础。java
代码下载:http://pythonscraping.com/code/。python
from urllib.request import urlopen html = urlopen("http://www.pythonscraping.com/exercises/exercise1.html") print(html.read())
执行结果:程序员
$ python3 1-basicExample.py b'<html>\n<head>\n<title>A Useful Page</title>\n</head>\n<body>\n<h1>An Interesting Title</h1>\n<div>\nLorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.\n</div>\n</body>\n</html>\n'
from urllib.request import urlopen from bs4 import BeautifulSoup html = urlopen("http://www.pythonscraping.com/pages/page1.html") bsObj = BeautifulSoup(html.read(), 'lxml'); print(bsObj.h1)
执行结果:web
$ python3 2-beautifulSoup.py <h1>An Interesting Title</h1>
HTML代码层次以下:正则表达式
• html → <html><head>...</head><body>...</body></html> — head → <head><title>A Useful Page<title></head> — title → <title>A Useful Page</title> — body → <body><h1>An Int...</h1><div>Lorem ip...</div></body> — h1 → <h1>An Interesting Title</h1> — div → <div>Lorem Ipsum dolor...</div>
注意这里bsObj.h1和下面的效果相同:数据库
bsObj.html.body.h1 bsObj.body.h1 bsObj.html.h1
urlopen容易发生的错误为:express
•在服务器上找不到该页面(或获取错误), 404或者500
•找不到服务器
都体现为HTTPError。能够以下方式处理:
try: html = urlopen("http://www.pythonscraping.com/pages/page1.html") except HTTPError as e: print(e) #return null, break, or do some other "Plan B" else: #program continues. Note: If you return or break in the #exception catch, you do not need to use the "else" statement
本文摘要自Web Scraping with Python - 2015
书籍下载地址:https://bitbucket.org/xurongzhong/python-chinese-library/downloads
源码地址:https://bitbucket.org/wswp/code
演示站点:http://example.webscraping.com/
演示站点代码:http://bitbucket.org/wswp/places
推荐的python基础教程: http://www.diveintopython.net
HTML和JavaScript基础:
本文博客:http://my.oschina.net/u/1433482/
本文网址:http://my.oschina.net/u/1433482/blog/620858
交流:python开发自动化测试群291184506 PythonJava单元白盒测试群144081101
为何要进行web抓取?
网购的时候想比较下各个网站的价格,也就是实现惠惠购物助手的功能。有API天然方便,可是一般是没有API,此时就须要web抓取。
web抓取是否合法?
抓取的数据,我的使用不违法,商业用途或从新发布则须要考虑受权,另外须要注意礼节。根据国外已经判决的案例,通常来讲位置和电话能够从新发布,可是原创数据不容许从新发布。
更多参考:
http://www.bvhd.dk/uploads/tx_mocarticles/S_-_og_Handelsrettens_afg_relse_i_Ofir-sagen.pdf
http://www.austlii.edu.au/au/cases/cth/FCA/2010/44.html
http://caselaw.findlaw.com/us-supreme-court/499/340.html
背景研究
robots.txt和Sitemap能够帮助了解站点的规模和结构,还可使用谷歌搜索和WHOIS等工具。
好比:http://example.webscraping.com/robots.txt
# section 1 User-agent: BadCrawler Disallow: / # section 2 User-agent: * Crawl-delay: 5 Disallow: /trap # section 3 Sitemap: http://example.webscraping.com/sitemap.xml
更多关于web机器人的介绍参见 http://www.robotstxt.org。
Sitemap的协议: http://www.sitemaps.org/protocol.html,好比:
<?xml version="1.0" encoding="UTF-8"?> <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"> <url><loc>http://example.webscraping.com/view/Afghanistan-1 </loc></url> <url><loc>http://example.webscraping.com/view/Aland-Islands-2 </loc></url> <url><loc>http://example.webscraping.com/view/Albania-3</loc> </url> ... </urlset>
站点地图常常不完整。
站点大小评估:
经过google的site查询 好比:site:automationtesting.sinaapp.com
站点技术评估:
# pip install builtwith # ipython In [1]: import builtwith In [2]: builtwith.parse('http://automationtesting.sinaapp.com/') Out[2]: {u'issue-trackers': [u'Trac'], u'javascript-frameworks': [u'jQuery'], u'programming-languages': [u'Python'], u'web-servers': [u'Nginx']}
分析网站全部者:
# pip install python-whois # ipython In [1]: import whois In [2]: print whois.whois('http://automationtesting.sinaapp.com') { "updated_date": "2016-01-07 00:00:00", "status": [ "serverDeleteProhibited https://www.icann.org/epp#serverDeleteProhibited", "serverTransferProhibited https://www.icann.org/epp#serverTransferProhibited", "serverUpdateProhibited https://www.icann.org/epp#serverUpdateProhibited" ], "name": null, "dnssec": null, "city": null, "expiration_date": "2021-06-29 00:00:00", "zipcode": null, "domain_name": "SINAAPP.COM", "country": null, "whois_server": "whois.paycenter.com.cn", "state": null, "registrar": "XIN NET TECHNOLOGY CORPORATION", "referral_url": "http://www.xinnet.com", "address": null, "name_servers": [ "NS1.SINAAPP.COM", "NS2.SINAAPP.COM", "NS3.SINAAPP.COM", "NS4.SINAAPP.COM" ], "org": null, "creation_date": "2009-06-29 00:00:00", "emails": null }
抓取第一个站点
简单的爬虫(crawling)代码以下:
import urllib2 def download(url): print 'Downloading:', url try: html = urllib2.urlopen(url).read() except urllib2.URLError as e: print 'Download error:', e.reason html = None return html
能够基于错误码重试。HTTP状态码:https://tools.ietf.org/html/rfc7231#section-6。4**不必重试,5**能够重试下。
import urllib2 def download(url, num_retries=2): print 'Downloading:', url try: html = urllib2.urlopen(url).read() except urllib2.URLError as e: print 'Download error:', e.reason html = None if num_retries > 0: if hasattr(e, 'code') and 500 <= e.code < 600: # recursively retry 5xx HTTP errors return download(url, num_retries-1) return html
http://httpstat.us/500 会返回500,能够用它来测试下:
>>> download('http://httpstat.us/500') Downloading: http://httpstat.us/500 Download error: Internal Server Error Downloading: http://httpstat.us/500 Download error: Internal Server Error Downloading: http://httpstat.us/500 Download error: Internal Server Error
设置 user agent:
urllib2默认的user agent是“Python-urllib/2.7”,不少网站会对此进行拦截, 推荐使用接近真实的agent,好比
Mozilla/5.0 (X11; Linux x86_64; rv:38.0) Gecko/20100101 Firefox/38.0
为此咱们增长user agent设置:
import urllib2 def download(url, user_agent='Mozilla/5.0 (X11; Linux x86_64; rv:38.0) Gecko/20100101 Firefox/38.0', num_retries=2): print 'Downloading:', url headers = {'User-agent': user_agent} request = urllib2.Request(url, headers=headers) try: html = urllib2.urlopen(request).read() except urllib2.URLError as e: print 'Download error:', e.reason html = None if num_retries > 0: if hasattr(e, 'code') and 500 <= e.code < 600: # recursively retry 5xx HTTP errors return download(url, num_retries-1) return html
爬行站点地图:
def crawl_sitemap(url): # download the sitemap file sitemap = download(url) # extract the sitemap links links = re.findall('<loc>(.*?)</loc>', sitemap) # download each link for link in links: html = download(link) # scrape html here # ...
ID循环爬行:
• http://example.webscraping.com/view/Afghanistan-1
• http://example.webscraping.com/view/Australia-2
• http://example.webscraping.com/view/Brazil-3
上面几个网址仅仅是最后面部分不一样,一般程序员喜欢用数据库的id,好比:
http://example.webscraping.com/view/1 ,这样咱们就能够数据库的id抓取网页。
for page in itertools.count(1): url = 'http://example.webscraping.com/view/-%d' % page html = download(url) if html is None: break else: # success - can scrape the result pass
固然数据库有可能删除了一条记录,为此咱们改进成以下:
# maximum number of consecutive download errors allowed max_errors = 5 # current number of consecutive download errors num_errors = 0 for page in itertools.count(1): url = 'http://example.webscraping.com/view/-%d' % page html = download(url) if html is None: # received an error trying to download this webpage num_errors += 1 if num_errors == max_errors: # reached maximum number of # consecutive errors so exit break else: # success - can scrape the result # ... num_errors = 0
有些网站不存在的时候会返回404,有些网站的ID不是这么有规则的,好比亚马逊使用ISBN。
通常的浏览器都有"查看页面源码"的功能,在Firefox,Firebug尤为方便。以上工具均可以邮件点击网页调出。
抓取网页数据主要有3种方法:正则表达式、BeautifulSoup和lxml。
正则表达式示例:
In [1]: import re In [2]: import common In [3]: url = 'http://example.webscraping.com/view/UnitedKingdom-239' In [4]: html = common.download(url) Downloading: http://example.webscraping.com/view/UnitedKingdom-239 In [5]: re.findall('<td class="w2p_fw">(.*?)</td>', html) Out[5]: ['<img src="/places/static/images/flags/gb.png" />', '244,820 square kilometres', '62,348,447', 'GB', 'United Kingdom', 'London', '<a href="/continent/EU">EU</a>', '.uk', 'GBP', 'Pound', '44', '@# #@@|@## #@@|@@# #@@|@@## #@@|@#@ #@@|@@#@ #@@|GIR0AA', '^(([A-Z]\\d{2}[A-Z]{2})|([A-Z]\\d{3}[A-Z]{2})|([A-Z]{2}\\d{2}[A-Z]{2})|([A-Z]{2}\\d{3}[A-Z]{2})|([A-Z]\\d[A-Z]\\d[A-Z]{2})|([A-Z]{2}\\d[A-Z]\\d[A-Z]{2})|(GIR0AA))$', 'en-GB,cy-GB,gd', '<div><a href="/iso/IE">IE </a></div>'] In [6]: re.findall('<td class="w2p_fw">(.*?)</td>', html)[1] Out[6]: '244,820 square kilometres'
维护成本比较高。
Beautiful Soup:
In [7]: from bs4 import BeautifulSoup In [8]: broken_html = '<ul class=country><li>Area<li>Population</ul>' In [9]: # parse the HTML In [10]: soup = BeautifulSoup(broken_html, 'html.parser') In [11]: fixed_html = soup.prettify() In [12]: print fixed_html <ul class="country"> <li> Area <li> Population </li> </li> </ul> In [13]: ul = soup.find('ul', attrs={'class':'country'}) In [14]: ul.find('li') # returns just the first match Out[14]: <li>Area<li>Population</li></li> In [15]: ul.find_all('li') # returns all matches Out[15]: [<li>Area<li>Population</li></li>, <li>Population</li>]
完整的例子:
In [1]: from bs4 import BeautifulSoup In [2]: url = 'http://example.webscraping.com/places/view/United-Kingdom-239' In [3]: import common In [5]: html = common.download(url) Downloading: http://example.webscraping.com/places/view/United-Kingdom-239 In [6]: soup = BeautifulSoup(html) /usr/lib/python2.7/site-packages/bs4/__init__.py:166: UserWarning: No parser was explicitly specified, so I'm using the best available HTML parser for this system ("lxml"). This usually isn't a problem, but if you run this code on another system, or in a different virtual environment, it may use a different parser and behave differently. To get rid of this warning, change this: BeautifulSoup([your markup]) to this: BeautifulSoup([your markup], "lxml") markup_type=markup_type)) In [7]: # locate the area row In [8]: tr = soup.find(attrs={'id':'places_area__row'}) In [9]: td = tr.find(attrs={'class':'w2p_fw'}) # locate the area tag In [10]: area = td.text # extract the text from this tag In [11]: print area 244,820 square kilometres
Lxml基于 libxml2(c语言实现),更快速,可是有时更难安装。网址:http://lxml.de/installation.html。
In [1]: import lxml.html In [2]: broken_html = '<ul class=country><li>Area<li>Population</ul>' In [3]: tree = lxml.html.fromstring(broken_html) # parse the HTML In [4]: fixed_html = lxml.html.tostring(tree, pretty_print=True) In [5]: print fixed_html <ul class="country"> <li>Area</li> <li>Population</li> </ul>
lxml的容错能力也比较强,少半边标签一般没事。
下面使用css选择器,注意安装cssselect。
In [1]: import common In [2]: import lxml.html In [3]: url = 'http://example.webscraping.com/places/view/United-Kingdom-239' In [4]: html = common.download(url) Downloading: http://example.webscraping.com/places/view/United-Kingdom-239 In [5]: tree = lxml.html.fromstring(html) In [6]: td = tree.cssselect('tr#places_area__row > td.w2p_fw')[0] In [7]: area = td.text_content() In [8]: print area 244,820 square kilometres
在 CSS 中,选择器是一种模式,用于选择须要添加样式的元素。
"CSS" 列指示该属性是在哪一个 CSS 版本中定义的。(CSS一、CSS2 仍是 CSS3。)
选择器 | 例子 | 例子描述 | CSS |
---|---|---|---|
.class | .intro | 选择 class="intro" 的全部元素。 | 1 |
#id | #firstname | 选择 id="firstname" 的全部元素。 | 1 |
* | * | 选择全部元素。 | 2 |
element | p | 选择全部 <p> 元素。 | 1 |
element,element | div,p | 选择全部 <div> 元素和全部 <p> 元素。 | 1 |
element element | div p | 选择 <div> 元素内部的全部 <p> 元素。 | 1 |
element>element | div>p | 选择父元素为 <div> 元素的全部 <p> 元素。 | 2 |
element+element | div+p | 选择紧接在 <div> 元素以后的全部 <p> 元素。 | 2 |
[attribute] | [target] | 选择带有 target 属性全部元素。 | 2 |
[attribute=value] | [target=_blank] | 选择 target="_blank" 的全部元素。 | 2 |
[attribute~=value] | [title~=flower] | 选择 title 属性包含单词 "flower" 的全部元素。 | 2 |
[attribute|=value] | [lang|=en] | 选择 lang 属性值以 "en" 开头的全部元素。 | 2 |
:link | a:link | 选择全部未被访问的连接。 | 1 |
:visited | a:visited | 选择全部已被访问的连接。 | 1 |
:active | a:active | 选择活动连接。 | 1 |
:hover | a:hover | 选择鼠标指针位于其上的连接。 | 1 |
:focus | input:focus | 选择得到焦点的 input 元素。 | 2 |
:first-letter | p:first-letter | 选择每一个 <p> 元素的首字母。 | 1 |
:first-line | p:first-line | 选择每一个 <p> 元素的首行。 | 1 |
:first-child | p:first-child | 选择属于父元素的第一个子元素的每一个 <p> 元素。 | 2 |
:before | p:before | 在每一个 <p> 元素的内容以前插入内容。 | 2 |
:after | p:after | 在每一个 <p> 元素的内容以后插入内容。 | 2 |
:lang(language) | p:lang(it) | 选择带有以 "it" 开头的 lang 属性值的每一个 <p> 元素。 | 2 |
element1~element2 | p~ul | 选择前面有 <p> 元素的每一个 <ul> 元素。 | 3 |
[attribute^=value] | a[src^="https"] | 选择其 src 属性值以 "https" 开头的每一个 <a> 元素。 | 3 |
[attribute$=value] | a[src$=".pdf"] | 选择其 src 属性以 ".pdf" 结尾的全部 <a> 元素。 | 3 |
[attribute*=value] | a[src*="abc"] | 选择其 src 属性中包含 "abc" 子串的每一个 <a> 元素。 | 3 |
:first-of-type | p:first-of-type | 选择属于其父元素的首个 <p> 元素的每一个 <p> 元素。 | 3 |
:last-of-type | p:last-of-type | 选择属于其父元素的最后 <p> 元素的每一个 <p> 元素。 | 3 |
:only-of-type | p:only-of-type | 选择属于其父元素惟一的 <p> 元素的每一个 <p> 元素。 | 3 |
:only-child | p:only-child | 选择属于其父元素的惟一子元素的每一个 <p> 元素。 | 3 |
:nth-child(n) | p:nth-child(2) | 选择属于其父元素的第二个子元素的每一个 <p> 元素。 | 3 |
:nth-last-child(n) | p:nth-last-child(2) | 同上,从最后一个子元素开始计数。 | 3 |
:nth-of-type(n) | p:nth-of-type(2) | 选择属于其父元素第二个 <p> 元素的每一个 <p> 元素。 | 3 |
:nth-last-of-type(n) | p:nth-last-of-type(2) | 同上,可是从最后一个子元素开始计数。 | 3 |
:last-child | p:last-child | 选择属于其父元素最后一个子元素每一个 <p> 元素。 | 3 |
:root | :root | 选择文档的根元素。 | 3 |
:empty | p:empty | 选择没有子元素的每一个 <p> 元素(包括文本节点)。 | 3 |
:target | #news:target | 选择当前活动的 #news 元素。 | 3 |
:enabled | input:enabled | 选择每一个启用的 <input> 元素。 | 3 |
:disabled | input:disabled | 选择每一个禁用的 <input> 元素 | 3 |
:checked | input:checked | 选择每一个被选中的 <input> 元素。 | 3 |
:not(selector) | :not(p) | 选择非 <p> 元素的每一个元素。 | 3 |
::selection | ::selection | 选择被用户选取的元素部分。 | 3 |
CSS 选择器参见:http://www.w3school.com.cn/cssref/css_selectors.ASP 和 https://pythonhosted.org/cssselect/#supported-selectors。
下面经过提取以下页面的国家数据来比较性能:
比较代码:
import urllib2 import itertools import re from bs4 import BeautifulSoup import lxml.html import time FIELDS = ('area', 'population', 'iso', 'country', 'capital', 'continent', 'tld', 'currency_code', 'currency_name', 'phone', 'postal_code_format', 'postal_code_regex', 'languages', 'neighbours') def download(url, user_agent='Mozilla/5.0 (X11; Linux x86_64; rv:38.0) Gecko/20100101 Firefox/38.0', num_retries=2): print 'Downloading:', url headers = {'User-agent': user_agent} request = urllib2.Request(url, headers=headers) try: html = urllib2.urlopen(request).read() except urllib2.URLError as e: print 'Download error:', e.reason html = None if num_retries > 0: if hasattr(e, 'code') and 500 <= e.code < 600: # recursively retry 5xx HTTP errors return download(url, num_retries-1) return html def re_scraper(html): results = {} for field in FIELDS: results[field] = re.search(r'places_%s__row.*?w2p_fw">(.*?)</td>' % field, html.replace('\n','')).groups()[0] return results def bs_scraper(html): soup = BeautifulSoup(html, 'html.parser') results = {} for field in FIELDS: results[field] = soup.find('table').find('tr',id='places_%s__row' % field).find('td',class_='w2p_fw').text return results def lxml_scraper(html): tree = lxml.html.fromstring(html) results = {} for field in FIELDS: results[field] = tree.cssselect('table > tr#places_%s__row> td.w2p_fw' % field)[0].text_content() return results NUM_ITERATIONS = 1000 # number of times to test each scraper html = download('http://example.webscraping.com/places/view/United-Kingdom-239') for name, scraper in [('Regular expressions', re_scraper),('BeautifulSoup', bs_scraper),('Lxml', lxml_scraper)]: # record start time of scrape start = time.time() for i in range(NUM_ITERATIONS): if scraper == re_scraper: re.purge() result = scraper(html) # check scraped result is as expected assert(result['area'] == '244,820 square kilometres') # record end time of scrape and output the total end = time.time() print '%s: %.2f seconds' % (name, end - start)
Windows执行结果:
Downloading: http://example.webscraping.com/places/view/United-Kingdom-239 Regular expressions: 11.63 seconds BeautifulSoup: 92.80 seconds Lxml: 7.25 seconds
Linux执行结果:
Downloading: http://example.webscraping.com/places/view/United-Kingdom-239 Regular expressions: 3.09 seconds BeautifulSoup: 29.40 seconds Lxml: 4.25 seconds
其中 re.purge() 用户清正则表达式的缓存。
推荐使用基于Linux的lxml,在同一网页屡次分析的状况优点更为明显。