使用requests + bs4抓取B站web端Python视频数据

使用requests + bs4抓取B站web端Python视频数据html

目标:掌握bs4抓取数据的套路python

抓取关键词:web

视频图片svg

播放量url

上传时间spa

做者:code

import requests
from  bs4 import BeautifulSoup


def get_html():
    url = "https://www.bilibili.com/"
    headers = {
        'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_4) AppleWebKit/537.36 (KHTML, like Gecko) \ Chrome/52.0.2743.116 Safari/537.36'
    }
    html = requests.get(url, headers=headers).text
    soup = BeautifulSoup(html, 'lxml')
    groom_module = soup.find_all(attrs={'class': 'groom-module home-card'})
    for i in groom_module:
        time = get_time(i.a['href'])
        image = i.a.img['src']
        pic = requests.get("https:"+image, timeout=10)
        title = i.find(attrs={'class': 'title'}).text
        fp = open("pictures\\" + image[-20:], 'wb')
        fp.write(pic.content)
        fp.close()
        author = i.find(attrs={'class': 'author'}).text
        play = i.find(attrs={'class': 'play'}).text
        print(time, image,title,author,play)


def get_time(url):
    url = "https://www.bilibili.com"+url
    headers = {
        'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_4) AppleWebKit/537.36 (KHTML, like Gecko) \ Chrome/52.0.2743.116 Safari/537.36'
    }
    html = requests.get(url, headers=headers).text
    soup = BeautifulSoup(html, 'lxml')
    return soup.find("time").text


get_html()

借此机会复习了一下美味汤的基本知识:视频

获取标签:soup.title 也就是节点
获取属性:soup.img['src'] 获取节点内部的属性
获取节点的名称:soup.title.name
标准选择器:find_all( name , attrs , recursive , text , **kwargs )
    name:soup.find_all('ul')
    attrs: soup.find_all(attrs={'id': 'list-1'}
           soup.find_all(attrs={'name': 'elements'}
           soup.find_all(id='list-1')
           soup.find_all(class_='element')
    text: soup.find_all(text='Foo')

            find( name , attrs , recursive , text , **kwargs )

CSS选择器:
    soup.select('.panel .panel-heading')    .是类属性
    soup.select('ul li')                    标签名
    soup.select('#list-2 .element')         先id再类名
    soup.select('ul')[0]    

for ul in soup.select('ul'):
    print(ul.select('li'))

获取属性:
for ul in soup.select('ul'):
    print(ul['id'])
    print(ul.attrs['id'])
获取内容:
for li in soup.select('li'):
    print(li.get_text())

写的比较简陋且没有功能简单,由于没有实际需求,因此实现功能也够了。
爬虫效果图xml