04 Beautiful Soup

时间 2019-11-17 标签 04 beautiful soup

Beautiful Soup

简介

简单来讲，Beautiful Soup是python的一个库，最主要的功能是从网页抓取数据。官方解释以下：css

'''
Beautiful Soup提供一些简单的、python式的函数用来处理导航、搜索、修改分析树等功能。
它是一个工具箱，经过解析文档为用户提供须要抓取的数据，由于简单，因此不须要多少代码就能够写出一个完整的应用程序。
'''

Beautiful Soup 是一个能够从HTML或XML文件中提取数据的Python库.它可以经过你喜欢的转换器实现惯用的文档导航,查找,修改文档的方式.Beautiful Soup会帮你节省数小时甚至数天的工做时间.你可能在寻找 Beautiful Soup3 的文档,Beautiful Soup 3 目前已经中止开发,官网推荐在如今的项目中使用Beautiful Soup 4。html

安装

pip install beautifulsoup4

解析器

Beautiful Soup支持Python标准库中的HTML解析器,还支持一些第三方的解析器，若是咱们不安装它，则 Python 会使用 Python默认的解析器html5

lxml 解析器更增强大，速度更快，推荐安装。python

pip install lxml

另外一个可供选择的解析器是纯Python实现的 html5lib , html5lib的解析方式与浏览器相同,能够选择下列方法来安装html5lib:正则表达式

pip install html5lib

解析器对比：浏览器

BeautifulSoup使用

BS导入

1. 导包：from bs4 import BeautifulSoup
2. 能够将一个html文档，转化为BeautifulSoup对象，而后经过对象的方法或者属性去查找指定的节点内容
    2.1 本地文件：soup = BeautifulSoup(open('本地文件'), 'lxml')

    2.2 网络数据：soup = BeautifulSoup('字符串类型或者字节类型', 'lxml')

属性

<1>根据标签名查找
        - soup.a   只能找到第一个符合要求的标签，返回标签

<2>获取属性
        - soup.a.attrs  返回一个字典,获取a全部的属性和属性值
        - soup.a.attrs['href']   获取href属性
        - soup.a['href']   也可简写为这种形式

<3>获取内容
        - soup.a.string
        - soup.a.text
        - soup.a.get_text()    与text无区别
       【注意】若是标签还有标签，那么string获取到的结果为None，而其它两个，能够获取文本内容

<4>find：找到第一个符合要求的标签
        - soup.find('a')  找到第一个符合要求的
        - soup.find('a', title="xxx")
        - soup.find('a', alt="xxx")
        - soup.find('a', class_="xxx")
        - soup.find('a', id="xxx")

<5>find_all：找到全部符合要求的标签
        - soup.find_all('a')
        - soup.find_all(['a','b']) 找到全部的a和b标签
        - soup.find_all('a', limit=2)  限制前两个

<6>根据选择器选择指定的内容
               select:soup.select('#feng')
        - 常见的选择器：标签选择器(a)、类选择器(.)、id选择器(#)、层级选择器
            - 层级选择器：
                div .dudu #lala .meme .xixi  下面好多级
                div > p > a > .lala          只能是下面一级
        【注意】select选择器返回永远是列表，须要经过下标提取指定的对象

方法

doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p id="my p" class="title"><b id="bbb" class="boldest">The Dormouse's story</b>
</p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

测试数据

find_all()

找到全部符合要求的标签
返回一个列表
find_all(name=None, attrs={}, recursive=True, text=None,limit=None, **kwargs)

1 name

五种过滤器：字符串、正则表达式、列表、True和方法网络

# 字符串:即标签名
print(soup.find_all('b'))  # [<b class="boldest" id="bbb">The Dormouse's story</b>]

# 正则表达式
print(soup.find_all(re.compile("^b")))  # 找出b开头的标签，结果有body和b标签

# 列表：若是传入列表参数，BeautifulSoup会与列表中任一元素匹配的内容返回
print(soup.find_all(['a', 'b']))  # 找到文档中全部<a>标签和<b>标签

# True: 能够匹配任何值
print(soup.find_all(True))  # 找出全部的tag
for tag in soup.find_all(True):
    print(tag.name)             # html head title body p b p a a a p

# 方法: 若是没有合适过滤器，能够定义一个方法，方法只接受一个元素参数，若是这个方法返回True, 表示当前元素匹配而且被找到，若是不是则返回False
def has_class_but_no_id(tag):
    return tag.has_attr('class') and not tag.has_attr('id')
print(soup.find_all(has_class_but_no_id))

2 按照类名查找

class关键字为class_, class_=value,value能够是五种选择器之一ide

print(soup.find_all('a', class_='sister'))  # 查找class为sister的a标签
print(soup.find_all('a', id='link3'))  # 查找id为link3的a标签

3 attrs

print(soup.find_all('p', attrs={'class': 'story'}))  # 查找class为story的p标签

4 text

值能够是字符、列表、True和正则函数

print(soup.find_all(text='Elsie'))  # ['Elsie']
print(soup.find_all('a', text='Elsie'))  # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]

5 limit

限制返回结果的数量工具

print(soup.find_all('a', limit=2))

6 recursive

默认为True,即搜索当前tag的全部子孙节点，若是只想搜索tag的直接子节点，可使用参数recursive=False

print(soup.html.find_all('a'))
# 局部查找
print(soup.html.find_all('a', recursive=False))

find()

find()参数与和find_all彻底同样
soup.find('a') 等同于soup.a，只能找到每个符合要求的标签

selector选择器

selector等同于css选择器

返回列表

print(soup.select('.sister'))  # 查找class为sister的标签
print(soup.select("#link2"))  # 查找id为link2的标签
print(soup.select('.c1 a'))  # 查找class为c1标签下的a标签