爬虫-Python爬虫经常使用库

时间 2019-11-19 标签爬虫 python 爬虫经常使用库

1、经常使用库php

一、requests 作请求的时候用到。html

requests.get("url")python

二、selenium 自动化会用到。mysql

三、lxmljquery

四、beautifulsoupredis

五、pyquery 网页解析库说是比beautiful 好用，语法和jquery很是像。sql

六、pymysql 存储库。操做mysql数据的。数据库

七、pymongo 操做MongoDB 数据库。浏览器

八、redis 非关系型数据库。服务器

九、jupyter 在线记事本。

2、什么是Urllib

Python内置的Http请求库

urllib.request 请求模块　　模拟浏览器

urllib.error 异常处理模块

urllib.parse url解析模块　　工具模块，如：拆分、合并

urllib.robotparser robots.txt 解析模块　　

2和3的区别

Python2

import urllib2

response = urllib2.urlopen('http://www.baidu.com');

Python3

import urllib.request

response =urllib.request.urlopen('http://www.baidu.com');

用法：

urlOpen 发送请求给服务器。

urllib.request.urlopen(url,data=None[参数],[timeout,]*,cafile=None,capath=None,cadefault=false,context=None)

例子：

例子1：

import urllib.requests

response=urllib.reqeust.urlopen('http://www.baidu.com')

print(response.read().decode('utf-8'))

　　例子2：

　　import urllib.request

　　import urllib.parse

　　data=bytes(urllib.parse.urlencode({'word':'hello'}),encoding='utf8')

　　response=urllib.reqeust.urlopen('http://httpbin.org/post',data=data)

　　print(response.read())

　　注：加data就是post发送，不加就是以get发送。

　　例子3：

　　超时测试

　　import urllib.request

　　response =urllib.request.urlopen('http://httpbin.org/get',timeout=1)

　　print(response.read())

　　-----正常

　　import socket

　　import urllib.reqeust

　　import urllib.error

　　try:

　　　　response=urllib.request.urlopen('http://httpbin.org/get',timeout=0.1)

　　except urllib.error.URLError as e:

　　　　if isinstance(e.reason,socket.timeout):

　　　　　　print('TIME OUT')

　　这是就是输出 TIME OUT

响应

响应类型

import urllib.request

response=urllib.request.urlopen('https://www.python.org')

print(type(response))

输出:print(type(response))

　　　状态码、响应头

　　　import urllib.request

　　　response = urllib.request.urlopen('http://www.python.org')

　　　print(response.status) // 正确返回200

　　　print(response.getheaders()) //返回请求头

　　 print(response.getheader('Server'))　　

3、Request 能够添加headers

　　import urllib.request

　　request=urllib.request.Request('https://python.org')

　　response=urllib.request.urlopen(request)

　　print(response.read().decode('utf-8'))

　　例子：

　　from urllib import request,parse

　　url='http://httpbin.org/post'

　　headers={

　　　　User-Agent:Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.75 Safari/537.36

　　　　Host:httpbin.org

　　}

　　dict={

　　　　'name':'Germey'

　　}

　　data=bytes(parse.urlencode(dict),encoding='utf8')

　　req= request.Request(url=url,data=data,headers=headers,method='POST')

　　response = request.urlopen(req)

　　print(response.read().decode('utf-8'))

4、代理

　　import urllib.request

　　proxy_handler =urllib.request.ProxyHandler({

　　　　'http':'http://127.0.0.1:9743',

　　　　'https':'http://127.0.0.1:9743',

　　})

　　opener =urllib.request.build_opener(proxy_handler)

　　response= opener.open('http://httpbin.org/get')

　　print(response.read())

5、Cookie

　　import http.cookiejar,urllib.request

　　cookie = http.cookiejar.Cookiejar()

　　handler=urllib.request.HTTPCookieProcessor(cookie)

　　opener = urllib.request.build_opener(handler)

　　response = opener.open('http://www.baidu.com')

　　for item in cookie:

　　　　print(item.name+"="+item.value)

　　第一种保存cookie方式

　　import http.cookiejar,urllib.request

　　filename = 'cookie.txt'　　

　　cookie =http.cookiejar.MozillaCookieJar(filename)

　　handler= urllib.request.HTTPCookieProcessor(cookie)

　　opener=urllib.request.build_opener(handler)

　　response= opener.open('http://www.baidu.com')

　　cookie.save(ignore_discard=True,ignore_expires=True)

　　第二种保存cookie方式

　　import http.cookiejar,urllib.request

　　filename = 'cookie.txt'

　　cookie =http.cookiejar.LWPCookieJar(filename)

　　handler=urllib.request.HTTPCookieProcessor(cookie)

　　opener=urllib.request.build_opener(handler)

　　response=opener.open('http://www.baidu.com')

　　cookie.save(ignore_discard=True,ignore_expires=True)

　　读取cookie

　　import http.cookiejar,urllib.request

　　cookie=http.cookiejar.LWPCookieJar()

　　cookie.load('cookie.txt',ignore_discard=True,ignore_expires=True)

　　handler=urllib.request.HTTPCookieProcessor(cookie)

　　opener=urllib.request.build_opener(handler)

　　response=opener.open('http://www.baidu.com')

　　print(response.read().decode('utf-8'))

6、异常处理

　　例子1：

　　from urllib import reqeust,error

　　 try:

　　　　response =request.urlopen('http://cuiqingcai.com/index.htm')　

　　except error.URLError as e:

　　　　print(e.reason)　　//url异常捕获

　　例子2:

　　from urllib import reqeust,error

　　 try:

　　　　response =request.urlopen('http://cuiqingcai.com/index.htm')　

　　except error.HTTPError as e:

　　　　print(e.reason,e.code,e.headers,sep='\n')　　//url异常捕获

　　except error.URLError as e:

　　　　print(e.reason)　　

　　else:

　　　　print('Request Successfully')

七、URL解析

　　urlparse //url 拆分

　　urllib.parse.urlparse(urlstring,scheme='',allow_fragments=True)

　　例子：

　　from urllib.parse import urlparse //url 拆分

　　result = urlparse('http://www.baidu.com/index.html;user?id=5#comment')

　　print(type(result),result)

　　结果：

　　例子2：

　　from urllib.parse import urlparse //没有http

　　result = urlparse('www.baidu.com/index.html;user?id=5#comment',scheme='https')

　 print(result)

　　例子3：

　　from urllib.parse import urlparse

　　result = urlparse('http://www.baidu.com/index.html;user?id=5#comment',scheme='https')

　　print(result)

　　例子4：

　　from urllib.parse import urlparse

　　result = urlparse('http://www.baidu.com/index.html;user?id=5#comment',allow_fragments=False)

　　print(result)

　　例子5：

　　from urllib.parse import urlparse

　　result = urlparse('http://www.baidu.com/index.html#comment',allow_fragments=False)

　　print(result)

7、拼接

　　urlunparse

　　例子：

　　from urllib.parse import urlunparse

　　data=['http','www.baidu.com','index.html','user','a=6','comment']

　　print(urlunparse(data))

　　urljoin

　　from urllib.parse import urljoin

　　print(urljoin('http://www.baidu.com','FAQ.html'))

　　后面覆盖前面的

　　urlencode

　　from urllib.parse import urlencode

　　params={

　　　　'name':'gemey',

　　　　'age':22

　　}

　　base_url='http//www.baidu.com?'

　　url = base_url+urlencode(params)

　　print(url)

　　http://www.baidu.com?name=gemey&age=22

example:

urllib是Python自带的标准库，无需安装，直接能够用。
提供了以下功能：

网页请求
响应获取
代理和cookie设置
异常处理
URL解析

爬虫所须要的功能，基本上在urllib中都能找到，学习这个标准库，能够更加深刻的理解后面更加便利的requests库。

urllib库

urlopen 语法

urllib.request.urlopen(url,data=None,[timeout,]*,cafile=None,capath=None,cadefault=False,context=None) #url:访问的网址 #data:额外的数据，如header，form data

用法

# request:GET import urllib.request response = urllib.request.urlopen('http://www.baidu.com') print(response.read().decode('utf-8')) # request: POST # http测试：http://httpbin.org/ import urllib.parse import urllib.request data = bytes(urllib.parse.urlencode({'word':'hello'}),encoding='utf8') response = urllib.request.urlopen('http://httpbin.org/post',data=data) print(response.read()) # 超时设置 import urllib.request response = urllib.request.urlopen('http://httpbin.org/get',timeout=1) print(response.read()) import socket import urllib.request import urllib.error try: response = urllib.request.urlopen('http://httpbin.org/get',timeout=0.1) except urllib.error.URLError as e: if isinstance(e.reason,socket.timeout): print('TIME OUT')

响应

# 响应类型 import urllib.open response = urllib.request.urlopen('https:///www.python.org') print(type(response)) # 状态码， 响应头 import urllib.request response = urllib.request.urlopen('https://www.python.org') print(response.status) print(response.getheaders()) print(response.getheader('Server'))

Request

声明一个request对象，该对象能够包括header等信息，而后用urlopen打开。

# 简单例子 import urllib.request request = urllib.request.Requests('https://python.org') response = urllib.request.urlopen(request) print(response.read().decode('utf-8')) # 增长header from urllib import request, parse url = 'http://httpbin.org/post' headers = { 'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.133 Safari/537.36' 'Host':'httpbin.org' } # 构造POST表格 dict = { 'name':'Germey' } data = bytes(parse.urlencode(dict),encoding='utf8') req = request.Request(url=url,data=data,headers=headers,method='POST') response = request.urlopen(req) print(response.read()).decode('utf-8') # 或者随后增长header from urllib import request, parse url = 'http://httpbin.org/post' dict = { 'name':'Germey' } req = request.Request(url=url,data=data,method='POST') req.add_hader('User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.133 Safari/537.36') response = request.urlopen(req) print(response.read().decode('utf-8'))

Handler：处理更加复杂的页面

官方说明
代理

import urllib.request proxy_handler = urllib.request.ProxyHandler({ 'http':'http://127.0.0.1:9743' 'https':'https://127.0.0.1.9743' }) opener = urllib.request.build_openner(proxy_handler) response = opener.open('http://www.baidu.com') print(response.read())

Cookie:客户端用于记录用户身份,维持登陆信息

import http.cookiejar, urllib.request cookie = http.cookiejar.CookieJar() handler = urllib.request.HTTPCookieProcessor(cookie) opener = urllib.request.build_opener(handler) response = opener.open("http://www.baidu.com") for item in cookie: print(item.name+"="+item.value) # 保存cooki为文本 import http.cookiejar, urllib.request filename = "cookie.txt" # 保存类型有不少种 ## 类型1 cookie = http.cookiejar.MozillaCookieJar(filename) ## 类型2 cookie = http.cookiejar.LWPCookieJar(filename) handler = urllib.request.HTTPCookieProcessor(cookie) opener = urllib.request.build_opener(handler) response = opener.open("http://www.baidu.com") # 使用相应的方法读取 import http.cookiejar, urllib.request cookie = http.cookiejar.LWPCookieJar() cookie.load('cookie.txt',ignore_discard=True,ignore_expires=True) handler = urllib.request.HTTPCookieProcessor(cookie) opener = urllib.request.build_opener(handler) response = opener.open("http://www.baidu.com")

异常处理

捕获异常，保证程序稳定运行

# 访问不存在的页面 from urllib import request, error try: response = request.urlopen('http://cuiqingcai.com/index.htm') except error.URLError as e: print(e.reason) # 先捕获子类错误 from urllib imort request, error try: response = request.urlopen('http://cuiqingcai.com/index.htm') except error.HTTPError as e: print(e.reason, e.code, e.headers, sep='\n') except error.URLError as e: print(e.reason) else: print("Request Successfully') # 判断缘由 import socket import urllib.request import urllib.error try: response = urllib.request.urlopen('http://httpbin.org/get',timeout=0.1) except urllib.error.URLError as e: if isinstance(e.reason,socket.timeout): print('TIME OUT')

URL解析

主要是一个工具模块，可用于为爬虫提供URL。

urlparse:拆分URL

urlib.parse.urlparse(urlstring,scheme='', allow_fragments=True) # scheme: 协议类型 # 是否忽略’#‘部分

举个例子

from urllib import urlparse result = urlparse("https://edu.hellobi.com/course/157/play/lesson/2580") result ##ParseResult(scheme='https', netloc='edu.hellobi.com', path='/course/157/play/lesson/2580', params='', query='', fragment='')

urlunparse:拼接URL，为urlparse的反向操做

from urllib.parse import urlunparse data = ['http','www.baidu.com','index.html','user','a=6','comment'] print(urlunparse(data))

urljoin:拼接两个URL

urljoin

urlencode:字典对象转换成GET请求对象

from urllib.parse import urlencode params = { 'name':'germey', 'age': 22 } base_url = 'http://www.baidu.com?' url = base_url + urlencode(params) print(url)

最后还有一个robotparse，解析网站容许爬取的部分。

做者：hoptop 连接：https://www.jianshu.com/p/cfbdacbeac6e 來源：简书著做权归做者全部。商业转载请联系做者得到受权，非商业转载请注明出处。