python讨论qq群:996113038
php
导语:html
以武汉为中心的全国保卫战已经基本胜利,而国外的疫情发展开始愈演愈烈。不少小伙伴想要了解全球的疫情数据,因此此次咱们来爬取一下世卫组织官网上的疫情pdf,以及几个主要国家从1月22到如今的数据。
python
爬取的网页主要有两个:
apache
https://www.who.int/emergencies/diseases/novel-coronavirus-2019/situation-reports https://experience.arcgis.com/experience/685d0ace521648f8a5beeeee1b9125cd
代码及相关资源获取:json
1:关注“python趣味爱好者”公众号,回复“ code4 ”获取源代码。ruby
2:加入群聊:996113038。在群文件中下载源代码以及相关资料。ide
开发工具:函数
python3.6.4工具
相关第三方库:开发工具
BeautifulSoup
pandas
json
time
效果演示:
所有下载完成后目录图片
pdf图片示例:
csv数据示例:
基本原理:
首先咱们进入situation reports页面,能够看到中间所有是pdf的超连接,经过观察代码咱们能够发现规律。
而后咱们使用beautifulsoup来提取a标签内容
datas = soup.select('div#PageContent_C006_Col01 > div.sf-content-block.content-block > div a')
接下来咱们循环获取每一个a标签内容,并下载对应的pdf文件
for data in datas: downloadUrl = 'https://www.who.int' + data['href'] #下载路径 try: r = requests.get(downloadUrl) pdf = r.content # 响应的二进制文件 if (data.get_text()): with open(data.get_text() + ".pdf", 'wb') as f: # 二进制写入 f.write(pdf) print(data.get_text() + ".pdf" + "下载成功") except requests.exceptions.ConnectionError: r.status_code = "Connection refused"
而后咱们开始爬取各国近段时间的疫情数据。
https://experience.arcgis.com/experience/685d0ace521648f8a5beeeee1b9125cd
在这个网页右侧咱们能够看到各国名字
因而咱们打开F12进行抓包
以意大利为例,咱们能够看到这就是咱们想要的数据,多观察几个国家咱们能够发现各个国家的数据文件有一个共同规律
url = 'https://services.arcgis.com/5T5nSi527N4F7luB/arcgis/rest/services/Historic_adm0_v3/FeatureServer/0/query?f=json&where=ADM0_NAME%3D%27' + name + '%27&returnGeometry=false&spatialRel=esriSpatialRelIntersects&outFields=OBJECTID%2CNewCase%2CDateOfDataEntry&orderByFields=DateOfDataEntry%20asc&resultOffset=0&resultRecordCount=2000&cacheHint=true'
因此循环将数据文件爬取下来,而后利用pandas转为csv文件便可
def getDatas(): global res for name in ['China', 'Italy', 'Spain', 'France', 'Germany', 'Switzerland', 'Netherlands', 'Norway', 'Belgium', 'Sweden', 'Australia', 'Brazil', 'Egypt']: url = 'https://services.arcgis.com/5T5nSi527N4F7luB/arcgis/rest/services/Historic_adm0_v3/FeatureServer/0/query?f=json&where=ADM0_NAME%3D%27' + name + '%27&returnGeometry=false&spatialRel=esriSpatialRelIntersects&outFields=OBJECTID%2CNewCase%2CDateOfDataEntry&orderByFields=DateOfDataEntry%20asc&resultOffset=0&resultRecordCount=2000&cacheHint=true' html = json.loads(openUrl(url)) conserve(html, name) print(name+"疫情数据下载成功") #America 单独拿出来 name = 'America' url = 'https://services.arcgis.com/5T5nSi527N4F7luB/arcgis/rest/services/Historic_adm0_v3/FeatureServer/0/query?f=json&where=ADM0_NAME%3D%27United%20States%20of%20America%27&returnGeometry=false&spatialRel=esriSpatialRelIntersects&outFields=OBJECTID%2Ccum_conf%2CDateOfDataEntry&orderByFields=DateOfDataEntry%20asc&resultOffset=0&resultRecordCount=2000&cacheHint=true' html = json.loads(openUrl(url)) conserve(html, name) print(name + "疫情数据下载成功") res['Datetime'] = pd.date_range(start='20200122', end=timeStamp(res.index.get_level_values(0).values[-1])) res.to_csv('Datas.csv', encoding='utf_8_sig')
部分代码:
下面是获取pdf函数代码:
def getPdfs(): url = "https://www.who.int/emergencies/diseases/novel-coronavirus-2019/situation-reports" strhtml=requests.get(url) soup=BeautifulSoup(strhtml.text,'lxml') datas = soup.select('div#PageContent_C006_Col01 > div.sf-content-block.content-block > div a') for data in datas: downloadUrl = 'https://www.who.int' + data['href'] #下载路径 try: r = requests.get(downloadUrl) pdf = r.content # 响应的二进制文件 if (data.get_text()): with open(data.get_text() + ".pdf", 'wb') as f: # 二进制写入 f.write(pdf) print(data.get_text() + ".pdf" + "下载成功") except requests.exceptions.ConnectionError: r.status_code = "Connection refused"
感谢你们观看,有钱的老板能够打赏一下小编哦!
扫描下方二维码,关注公众号
参考资料: