python爬虫爬取A站文章(1)

时间 2020-07-13 标签 python 爬虫爬取站文章 1

本文使用了beautifulSoup
帮助文档 URL：http://beautifulsoup.readthedocs.io/zh_CN/latest/
是中文版的，不懂的能够查。
目标网站：http://www.acfun.cn/a/ac4875397

首先F12查看页面代码，找到其中的关键标签。

web

requests 库：用来发送各类请求app

初始代码：svg

import requests
from bs4 import BeautifulSoup
res=requests.get('http://www.acfun.cn/a/ac4875397') #请求页面
#这里是get请求，还有post请求，用于提交表单数据
AcObj=BeautifulSoup(res.content,'lxml') #将网页源码构转化为BeautifulSoup对象
title = AcObj.find_all('div','caption')#提取爽文title
a_list=AcObj.find_all('div','article-content') #提取爽文内容
print(title)
print(a_list)

爬取结果

能够看到，内容中又大量的杂质。
注意，爬取出来的内容是储存在列表中的。
能够将列表再次转化为beautifulSoup对象，使用text方法，清除多余的标签。post

title_Obj=BeautifulSoup(str(title), 'lxml')
bsObj=BeautifulSoup(str(a_list), 'lxml')

print(title_Obj.text)
print(bsObj.text)

最后，将爬取到的内容写入txt文件中
首先建立名为 happy的文件夹，再在文件夹中建立txt文件网站

import os

path='e:\\happy\\'
folder = os.path.exists(path)


if not folder:     #判断是否存在文件夹若是不存在则建立为文件夹
	os.makedirs(path)     #makedirs 建立文件时若是路径不存在会建立这个路径
	print('creating...')
	print('OK...')
 
else:
		print('文件夹已经存在')
name='123'
full_path = path + name + '.txt'
file = open(full_path,'w')

最后的代码：3d

import requests
from bs4 import BeautifulSoup
import os

res=requests.get('http://www.acfun.cn/a/ac4875397') #请求页面
#这里是get请求，还有post请求，用于提交表单数据
AcObj=BeautifulSoup(res.content,'lxml') #将网页源码构转化为BeautifulSoup对象
title = AcObj.find_all('div','caption')#提取爽文title
a_list=AcObj.find_all('div','article-content') #提取爽文内容
#打印
title_Obj=BeautifulSoup(str(title), 'lxml')
bsObj=BeautifulSoup(str(a_list), 'lxml')

print(title_Obj.text)
print(bsObj.text)

path='e:\\happy\\'
folder = os.path.exists(path)

if not folder:     #判断是否存在文件夹若是不存在则建立为文件夹
	os.makedirs(path)     #makedirs 建立文件时若是路径不存在会建立这个路径
	print('creating...')
	print('OK...')
 
else:
		print('文件夹已经存在')

full_path = path + str(title_Obj.text) + '.txt'
file = open(full_path,'w')
file.write(str(bsObj.text))
file.close()

这只是爬取了一篇文章，但咱们的目标是

全部的爽文！！！！code