今天简单爬取一个网页的源代码时,发现出现了乱码
python代码:python
import requests req = requests.get("http://www.ccit.js.cn") req_text = req.text print(req_text)
部分截图:
web
(1)咱们先来测试一下python3的默认编码是什么svg
import sys print('目前系统的编码为:',sys.getdefaultencoding()) name1="惊鸿一面" name2=name1.encode("utf-8")#str经过(encode)转为bytes print("name1的类型:",type(name1)) print("name2的类型",type(name2)) print(name2)
运行结果:
(2)知识点:测试
(3)缘由总结:
Python3的默认编码是utf-8,全部的数据他都会以utf-8进行编码(encode)。所以,Python3将目标网站的源码爬取以后进行utf-8编码,但咱们所爬取的目标网站是GB2312编码,与Python3的默认编码不一样,于是形成乱码
解决方案:
使用通用的编码格式网站
(4)注:ui
UnicodeEncodeError: 'gb2312' codec can't encode character '\xb3' in position 293: illegal multibyte sequence
的缘由是,你须要解码的文件中有些中文字符没法进行解码(有些中文字符是不在GB2312范围内的)(5)咱们以几种常见的编码格式进行encode测试编码
import requests req= requests.get("http://www.ccit.js.cn") req_text1=req.text.encode("utf-8") req_text2=req.text.encode("GB2312") req_text3=req.text.encode("GB18030") print(req_text1)#成功编码成bytes print(req_text2)#UnicodeEncodeError: 'gb2312' codec can't encode character '\xb3' in position 293: illegal multibyte sequence print(req_text3)#成功编码成bytes
(6)接着上面又作了decode测试,遵行编码使用准则,可是仍是乱码!!spa
import requests req= requests.get("http://www.ccit.js.cn") req_text1=req.text.encode("utf-8").decode("utf-8") req_text2=req.text.encode("utf-8").decode("GB2312") req_text3=req.text.encode("utf-8").decode("GB18030") req_text4=req.text.encode("GB18030").decode("utf-8") req_text5=req.text.encode("GB18030").decode("GB2312") req_text6=req.text.encode("GB18030").decode("GB18030") print(req_text1)#成功可是乱码 print(req_text2)#UnicodeDecodeError: 'gb2312' codec can't decode byte 0xc3 in position 297: illegal multibyte sequence print(req_text3)#成功可是乱码 print(req_text4)#UnicodeDecodeError: 'utf-8' codec can't decode byte 0x81 in position 293: invalid start byte print(req_text5)#UnicodeDecodeError: 'gb2312' codec can't decode byte 0x81 in position 293: illegal multibyte sequence print(req_text6)#成功可是乱码
那到底怎样才能解决呢???请看以下代码:code
import requests req= requests.get("http://www.ccit.js.cn") req_text=req.text.encode("latin1").decode("GBK") print(req_text)
这里进行encode时使用了latin1。xml