网页中有用的信息一般存在于网页中的文本或各类不一样标签的属性值,为了得到这些网页信息,有必要有一些查找方法能够获取这些文本值或标签属性。而Beautiful Soup中内置了一些查找方式:css
如下这段HTML是例程要用到的参考网页html
<html> <body> <div class="ecopyramid"> <ul id="producers"> <li class="producerlist"> <div class="name">plants</div> <div class="number">100000</div> </li> <li class="producerlist"> <div class="name">algae</div> <div class="number">100000</div> </li> </ul> <ul id="primaryconsumers"> <li class="primaryconsumerlist"> <div class="name">deer</div> <div class="number">1000</div> </li> <li class="primaryconsumerlist"> <div class="name">rabbit</div> <div class="number">2000</div> </li> <ul> <ul id="secondaryconsumers"> <li class="secondaryconsumerlist"> <div class="name">fox</div> <div class="number">100</div> </li> <li class="secondaryconsumerlist"> <div class="name">bear</div> <div class="number">100</div> </li> </ul> <ul id="tertiaryconsumers"> <li class="tertiaryconsumerlist"> <div class="name">lion</div> <div class="number">80</div> </li> <li class="tertiaryconsumerlist"> <div class="name">tiger</div> <div class="number">50</div> </li> </ul> </body> </html>
以上代码是一个生态金字塔的简单展现,为了找到其中的第一辈子产者,第一消费者或第二消费者,咱们可使用Beautiful Soup的查找方法。通常来讲,为了找到BeautifulSoup对象内任何第一个标签入口,咱们可使用find()方法。python
能够明显看到,生产者在第一个<ul>标签里,由于生产者是在整个HTML文档中第一个<ul>标签中出现,因此能够简单的使用find()方法找到第一辈子产者。下图HTML树表明了第一个生产者所在位置。正则表达式
而后在ecologicalpyramid.py中写入下面一段代码,使用ecologicalpyramid.html文件建立BeautifulSoup对象。函数
from bs4 import BeautifulSoup with open("ecologicalpyramid.html","r") as ecological_pyramid: soup = BeautifulSoup(ecological_pyramid) producer_entries = soup.find("ul") print(producer_entries.li.div.string)
输出获得:plantsspa
find()函数以下:code
find(name,attrs,recursive,text,**wargs)regexp
这些参数至关于过滤器同样能够进行筛选处理。orm
不一样的参数过滤能够应用到如下状况:xml
咱们能够传递任何标签的名字来查找到它第一次出现的地方。找到后,find函数返回一个BeautifulSoup的标签对象。
from bs4 import BeautifulSoup with open("ecologicalpyramid.html", "r") as ecological_pyramid: soup = BeautifulSoup(ecological_pyramid,"html") producer_entries = soup.find("ul") print(type(producer_entries))
直接字符串的话,查找的是标签。若是想要查找文本的话,则须要用到text参数。以下所示:
from bs4 import BeautifulSoup with open("ecologicalpyramid.html", "r") as ecological_pyramid: soup = BeautifulSoup(ecological_pyramid,"html") plants_string = soup.find(text="plants") print(plants_string)
有如下HTML代码:
<br/> <div>The below HTML has the information that has email ids.</div> abc@example.com <div>xyz@example.com</div> <span>foo@example.com</span>
参考以下代码:
import re from bs4 import BeautifulSoup email_id_example = """<br/> <div>The below HTML has the information that has email ids.</div> abc@example.com <div>xyz@example.com</div> <span>foo@example.com</span> """ soup = BeautifulSoup(email_id_example) emailid_regexp = re.compile("\w+@\w+\.\w+") first_email_id = soup.find(text=emailid_regexp) print(first_email_id)
观看例程HTML代码,其中第一消费者在ul标签里面且id属性为priaryconsumers.
由于第一消费者出现的ul不是文档中第一个ul,因此经过前面查找标签的办法就行不通了。如今经过标签属性进行查找,参考代码以下:
from bs4 import BeautifulSoup with open("ecologicalpyramid.html", "r") as ecological_pyramid: soup = BeautifulSoup(ecological_pyramid,"html") primary_consumer = soup.find(id="primaryconsumers") print(primary_consumer.li.div.string)
经过标签属性查找的方式适用于大多数标签属性,包括id,style,title,可是有一组标签属性例外。
customattr = ""'<p data-custom="custom">custom attribute example</p>""" customsoup = BeautifulSoup(customattr,'lxml') customSoup.find(data-custom="custom")
using_attrs = customsoup.find(attrs={'data-custom':'custom'}) print(using_attrs)
css_class = soup.find(attrs={'class':'primaryconsumerlist'}) print(css_class)
css_class = soup.find(class_ = "primaryconsumers" )
css_class = soup.find(attrs={'class':'primaryconsumers'})
def is_secondary_consumers(tag): return tag.has_attr('id') and tag.get('id') == 'secondaryconsumers'
secondary_consumer = soup.find(is_secondary_consumers) print(secondary_consumer.li.div.string)
all_tertiaryconsumers = soup.find_all(class_="tertiaryconsumerslist")
for tertiaryconsumer in all_tertiaryconsumers: print(tertiaryconsumer.div.string)
email_ids = soup.find_all(text=emailid_regexp) print(email_ids)
email_ids_limited = soup.find_all(text=emailid_regexp,limit=2) print(email_ids_limited)
all_texts = soup.find_all(text=True) print(all_texts)
all_texts_in_list = soup.find_all(text=["plants","algae"]) print(all_texts_in_list)
[u'plants', u'algae']
div_li_tags = soup.find_all(["div","li"])
primaryconsumers = soup.find_all(class_="primaryconsumerlist") primaryconsumer = primaryconsumers[0] parent_ul = primaryconsumer.find_parents('ul') print(parent_ul)
immediateprimary_consumer_parent = primary_consumer.find_parent()
producers= soup.find(id='producers') next_siblings = producers.find_next_siblings() print(next_siblings)
first_div = soup.div all_li_tags = first_div.find_all_next("li")