Python爬虫 | Selenium详解

1、简介

网页三元素:javascript

html负责内容;css

css负责样式;html

JavaScript负责动做java

 

从数据的角度考虑,网页上呈现出来的数据的来源:python

html文件web

ajax接口ajax

javascript加载redis

 

若是用requests对一个页面发送请求,只能得到当前加载出来的部分页面,动态加载的数据是获取不到的,好比下拉滚轮获得的数据。selenium最初是一个自动化测试工具, 而爬虫中使用它主要是为了解决requests没法直接执行JavaScript代码的问题。selenium本质是经过驱动浏览器,彻底模拟浏览器的操做,好比跳转、输入、点击、下拉等,来拿到网页渲染以后的结果,可支持多种浏览器。Selenium是python的一个第三方库,对外提供的接口能够操做浏览器,而后让浏览器完成自动化的操做。chrome

selenium在爬虫中的应用:api

  • 模拟登陆
  • 便捷的获取动态加载的数据

缺点:

  • 爬取数据的效率底
  • 环境部署繁琐

2、环境安装

 

查看谷歌浏览器版本

 

 

 

 

 

 

下载好之后,就有驱动程序了。

 

3、基本使用

1.浏览器建立

Selenium支持很是多的浏览器,如Chrome、Firefox、Edge等,还有Android、BlackBerry等手机端的浏览器。另外,也支持无界面浏览器PhantomJS。

from selenium import webdriver

browser = webdriver.Chrome()

browser = webdriver.Firefox()

browser = webdriver.Edge()

browser = webdriver.PhantomJS()

browser = webdriver.Safari()

browser.quit() 关闭浏览器

browser.close()

 

2.打开网页

browser.get(url)                  打开path路径

page_text = browser.page_source        获取当前浏览器页面的源码数据

 

3.元素定位

查找一个元素

element = find_element_by_id()

element = find_element_by_name()

element = find_element_by_class_name()

element = find_element_by_tag_name()

element = find_element_by_link_text()

element = find_element_by_partial_link_text()

element = find_element_by_xpath()

element = find_element_by_css_selector()

 

查找多个元素

element = find_element_by_id()

element = find_element_by_name()

element = find_element_by_class_name()

element = find_element_by_tag_name()

element = find_element_by_link_text()

element = find_element_by_partial_link_text()

element = find_element_by_xpath()

element = find_element_by_css_selector()

注意:

(1)find_element_by_xxx第一个符合条件的标签,find_elements_by_xxx找的是全部符合条件的标签。

(2)根据ID、CSS选择器和XPath获取,它们返回的结果彻底一致。

(3)另外,Selenium还提供了通用方法find_element(),它须要传入两个参数:查找方式By和值。实际上,它就是find_element_by_id()这种方法的通用函数版本,好比find_element_by_id(id)就等价于find_element(By.ID, id),两者获得的结果彻底一致。

 

# 经过id定位

<html>
 <body>
  <form id="loginForm">
   <input name="username" type="text" />
   <input name="password" type="password" />
   <input name="continue" type="submit" value="Login" />
  </form>
 </body>
<html>

login_form = driver.find_element_by_id('loginForm')

# 经过name定位

<html>
 <body>
  <form id="loginForm">
   <input name="username" type="text" />
   <input name="password" type="password" />
   <input name="continue" type="submit" value="Login" />
   <input name="continue" type="button" value="Clear" />
  </form>
</body>
<html>

username = driver.find_element_by_name('username')
password = driver.find_element_by_name('password')

# 经过连接文本定位

<html>
 <body>
  <p>Are you sure you want to do this?</p>
  <a href="continue.html">Continue</a>
  <a href="cancel.html">Cancel</a>
</body>
<html>

continue_link = driver.find_element_by_link_text('Continue')
continue_link = driver.find_element_by_partial_link_text('Conti')

# 经过标签名定位

<html>
 <body>
  <h1>Welcome</h1>
  <p>Site content goes here.</p>
</body>
<html>

heading1 = driver.find_element_by_tag_name('h1')

# 经过类名定位

<html>
 <body>
  <p class="content">Site content goes here.</p>
</body>
<html>

content = driver.find_element_by_class_name('content')

# 经过CSS选择器定位

<html>
 <body>
  <p class="content">Site content goes here.</p>
</body>
<html>

content = driver.find_element_by_css_selector('p.content')



# 推荐使用xpath定位
username = driver.find_element_by_xpath("//form[input/@name='username']")
username = driver.find_element_by_xpath("//form[@id='loginForm']/input[1]")
username = driver.find_element_by_xpath("//input[@name='username']")

# 推荐使用连接文本定位
continue_link = driver.find_element_by_link_text('Continue')
continue_link = driver.find_element_by_partial_link_text('Conti')
View Code

 

4.输入文字

element.send_keys("some text")

elementt.clear() # 清空文字

 

5.点击

element.click()

 

6.动做链

from selenium.webdriver import ActionChains

source = browser.find_element_by_css_selector('')

target = browser.find_element_by_css_selector('')

actions = ActionChains(browser)

actions.drag_and_drop(source, target).perform()

actions.release()

 

7.在页面间切换

适用与页面中点开连接出现新的页面的网站,可是浏览器对象browser仍是以前页面的对象

window_handles = driver.window_handles

driver.switch_to.window(window_handles[-1])

 

8.保存网页截图

driver.save_screenshot('screen.png')

 

9.执行JavaScript下拉进度条

browser.execute_script('window.scrollTo(0, document.body.scrollHeight)')

 

10.前进和后退

browser.back()

browser.forward()

 

11.等待

from selenium.webdriver.support.ui import WebDriverWait

from selenium.webdriver.support import expected_conditions as EC

driver = webdriver.Chrome()

driver.get("http://somedomain/")

try:

    element = WebDriverWait(driver, 10).until(

        EC.presence_of_element_located((By.ID, "myDynamicElement"))

    )

finally:

    driver.quit()

条件

title_is
title_contains
presence_of_element_located
visibility_of_element_located
visibility_of
presence_of_all_elements_located
text_to_be_present_in_element
text_to_be_present_in_element_value
frame_to_be_available_and_switch_to_it
invisibility_of_element_located
element_to_be_clickable
staleness_of
element_to_be_selected
element_located_to_be_selected
element_selection_state_to_be
element_located_selection_state_to_be
alert_is_present
View Code

 

12.Cookie处理

获取、添加、删除Cookies

browser.get_cookies()

browser.add_cookie({'name': 'name', 'domain': 'www.zhihu.com', 'value': 'germey'})

browser.delete_all_cookies()

 

13.谷歌无头浏览器

from selenium.webdriver.chrome.options import Options。

chrome_options = Options()

chrome_options.add_argument('--headless')

chrome_options.add_argument('--disable-gpu')

browser = webdriver.Chrome(executable_path=path, chrome_options=chrome_options)

 

14.规避监测

相关的网站会对selenium发起的请求进行监测,网站后台能够根据window.navigator.webdriver返回值进行selenium的监测,若返回值为undefinded,则不是selenium进行的请求发送;若为true,则是selenium发起的请求。

 

规避监测的方法:

from selenium.webdriver import ChromeOptions

option = ChromeOptions()

option.add_experimental_option('excludeSwitches', ['enable-automation'])

bro = webdriver.Chrome(executable_path='chromedriver.exe',options=option)

 

4、鼠标键盘操做(ActionChains)

1. ActionChains基本用法

ActionChains的执行原理:当你调用ActionChains的方法时,不会当即执行,而是会将全部的操做按顺序存放在一个队列里当你调用perform()方法时,队列中的时间会依次执行

 

有两种调用方法:

链式写法

menu = driver.find_element_by_css_selector(".nav")
hidden_submenu = driver.find_element_by_css_selector(".nav #submenu1")

ActionChains(driver).move_to_element(menu).click(hidden_submenu).perform()

分步写法

menu = driver.find_element_by_css_selector(".nav")
hidden_submenu = driver.find_element_by_css_selector(".nav #submenu1")

actions = ActionChains(driver)
actions.move_to_element(menu)
actions.click(hidden_submenu)
actions.perform()

两种写法本质是同样的,ActionChains都会按照顺序执行全部的操做。

 

2. ActionChains方法列表

click(on_element=None) ——单击鼠标左键
click_and_hold(on_element=None) ——点击鼠标左键,不松开
context_click(on_element=None) ——点击鼠标右键
double_click(on_element=None) ——双击鼠标左键
drag_and_drop(source, target) ——拖拽到某个元素而后松开
drag_and_drop_by_offset(source, xoffset, yoffset) ——拖拽到某个坐标而后松开
key_down(value, element=None) ——按下某个键盘上的键
key_up(value, element=None) ——松开某个键
move_by_offset(xoffset, yoffset) ——鼠标从当前位置移动到某个坐标
move_to_element(to_element) ——鼠标移动到某个元素
move_to_element_with_offset(to_element, xoffset, yoffset) ——移动到距某个元素(左上角坐标)多少距离的位置
perform() ——执行链中的全部动做
release(on_element=None) ——在某个元素位置松开鼠标左键
send_keys(*keys_to_send) ——发送某个键到当前焦点的元素
send_keys_to_element(element, *keys_to_send) ——发送某个键到指定元素

 

3. 代码示例

(1)点击操做

# -*- coding: utf-8 -*-

from selenium import webdriver
from selenium.webdriver.common.action_chains import ActionChains
from time import sleep


driver = webdriver.Firefox()
driver.implicitly_wait(10)
driver.maximize_window()
driver.get('http://sahitest.com/demo/clicks.htm')

click_btn = driver.find_element_by_xpath('//input[@value="click me"]')          # 单击按钮
doubleclick_btn = driver.find_element_by_xpath('//input[@value="dbl click me"]')    # 双击按钮
rightclick_btn = driver.find_element_by_xpath('//input[@value="right click me"]')    # 右键单击按钮


ActionChains(driver).click(click_btn).double_click(doubleclick_btn).context_click(rightclick_btn).perform()  # 链式用法

print driver.find_element_by_name('t2').get_attribute('value')

sleep(2)
driver.quit()

(2)鼠标移动

# -*- coding: utf-8 -*-

from selenium import webdriver
from selenium.webdriver.common.action_chains import ActionChains
from time import sleep

driver = webdriver.Firefox()
driver.implicitly_wait(10)
driver.maximize_window()
driver.get('http://sahitest.com/demo/mouseover.htm')

write = driver.find_element_by_xpath('//input[@value="Write on hover"]') # 鼠标移动到此元素,在下面的input框中会显示“Mouse moved”
blank = driver.find_element_by_xpath('//input[@value="Blank on hover"]') # 鼠标移动到此元素,会清空下面input框中的内容

result = driver.find_element_by_name('t1')

action = ActionChains(driver)
action.move_to_element(write).perform() # 移动到write,显示“Mouse moved”
print result.get_attribute('value')

# action.move_to_element(blank).perform()
action.move_by_offset(10, 50).perform() # 移动到距离当前位置(10,50)的点,与上句效果相同,移动到blank上,清空
print result.get_attribute('value')

action.move_to_element_with_offset(blank, 10, -40).perform() # 移动到距离blank元素(10,-40)的点,可移动到write上
print result.get_attribute('value')

sleep(2)

通常不多用位置关系来移动鼠标,若是须要,可参考下面的连接来测量元素位置


(3)拖拽

# -*- coding: utf-8 -*-

from selenium import webdriver
from selenium.webdriver.common.action_chains import ActionChains
from time import sleep

driver = webdriver.Firefox()
driver.implicitly_wait(10)
driver.maximize_window()
driver.get('http://sahitest.com/demo/dragDropMooTools.htm')

dragger = driver.find_element_by_id('dragger') # 被拖拽元素
item1 = driver.find_element_by_xpath('//div[text()="Item 1"]') # 目标元素1
item2 = driver.find_element_by_xpath('//div[text()="Item 2"]') # 目标2
item3 = driver.find_element_by_xpath('//div[text()="Item 3"]') # 目标3
item4 = driver.find_element_by_xpath('//div[text()="Item 4"]') # 目标4

action = ActionChains(driver)
action.drag_and_drop(dragger, item1).perform() # 1.移动dragger到item1
sleep(2)
action.click_and_hold(dragger).release(item2).perform() # 2.效果与上句相同,也能起到移动效果
sleep(2)
action.click_and_hold(dragger).move_to_element(item3).release().perform() # 3.效果与上两句相同,也能起到移动的效果
sleep(2)
# action.drag_and_drop_by_offset(dragger, 400, 150).perform() # 4.移动到指定坐标
action.click_and_hold(dragger).move_by_offset(400, 150).release().perform() # 5.与上一句相同,移动到指定坐标
sleep(2)
driver.quit()

通常用坐标定位不多,用上例中的方法1足够了,若是看源码,会发现方法2其实就是方法1中的drag_and_drop()的实现。注意:拖拽使用时注意加等待时间,有时会由于速度太快而失败。

 

(4)按键

  模拟按键有多种方法,能用win32api来实现,能用SendKeys来实现,也能够用selenium的WebElement对象的send_keys()方法来实现,这里ActionChains类也提供了几个模拟按键的方法。

# -*- coding: utf-8 -*-

from selenium import webdriver
from selenium.webdriver.common.action_chains import ActionChains
from time import sleep

driver = webdriver.Firefox()
driver.implicitly_wait(10)
driver.maximize_window()
driver.get('http://sahitest.com/demo/keypress.htm')

key_up_radio = driver.find_element_by_id('r1') # 监测按键升起
key_down_radio = driver.find_element_by_id('r2') # 监测按键按下
key_press_radio = driver.find_element_by_id('r3') # 监测按键按下升起

enter = driver.find_elements_by_xpath('//form[@name="f1"]/input')[1] # 输入框
result = driver.find_elements_by_xpath('//form[@name="f1"]/input')[0] # 监测结果

# 监测key_down
key_down_radio.click()
ActionChains(driver).key_down(Keys.CONTROL, enter).key_up(Keys.CONTROL).perform()
print result.get_attribute('value')

# 监测key_up
key_up_radio.click()
enter.click()
ActionChains(driver).key_down(Keys.SHIFT).key_up(Keys.SHIFT).perform()
print result.get_attribute('value')

# 监测key_press
key_press_radio.click()
enter.click()
ActionChains(driver).send_keys('a').perform()
print result.get_attribute('value')
driver.quit()

 

示例2:

 

# -*- coding: utf-8 -*-

from selenium import webdriver
from selenium.webdriver.common.action_chains import ActionChains
from selenium.webdriver.common.keys import Keys
from time import sleep

driver = webdriver.Firefox()
driver.implicitly_wait(10)
driver.maximize_window()

driver.get('http://sahitest.com/demo/label.htm')

input1 = driver.find_elements_by_tag_name('input')[3]
input2 = driver.find_elements_by_tag_name('input')[4]

action = ActionChains(driver)
input1.click()
action.send_keys('Test Keys').perform()
action.key_down(Keys.CONTROL).send_keys('a').key_up(Keys.CONTROL).perform() # ctrl+a
action.key_down(Keys.CONTROL).send_keys('c').key_up(Keys.CONTROL).perform() # ctrl+c

action.key_down(Keys.CONTROL, input2).send_keys('v').key_up(Keys.CONTROL).perform() # ctrl+v

print input1.get_attribute('value')
print input2.get_attribute('value')

driver.quit()

 

 

5、使用示例

示例1:打开百度,搜索爬虫

from selenium import webdriver
from time import sleep

bro = webdriver.Chrome()
bro.get(url='https://www.baidu.com/')
sleep(2)

text_input = bro.find_element_by_id('kw')
text_input.send_keys('爬虫')
sleep(2)

bro.find_element_by_id('su').click()
sleep(3)

print(bro.page_source)
bro.quit()

示例2:获取豆瓣电影中更多电影详情数据(谷歌无头浏览器)

from selenium import webdriver
from time import sleep
from selenium.webdriver.chrome.options import Options

第1步:下面三行固定
chrome_options = Options()
chrome_options.add_argument('--headless')
chrome_options.add_argument('--disable-gpu')

url = 'https://movie.douban.com/typerank?type_name=%E6%83%8A%E6%82%9A&type=19&interval_id=100:90&action=' 第2步:把chrome_options对象做为参数
bro = webdriver.Chrome(chrome_options=chrome_options)
bro.get(url)
sleep(3)
bro.execute_script('window.scrollTo(0,document.body.scrollHeight)')
sleep(3)
bro.execute_script('window.scrollTo(0,document.body.scrollHeight)')
sleep(3)
bro.execute_script('window.scrollTo(0,document.body.scrollHeight)')
sleep(2)
page_text = bro.page_source

with open('./douban.html','w',encoding='utf-8') as fp:
    fp.write(page_text)
print(page_text)
sleep(1)
bro.quit()

示例3:登陆qq空间

  在web 中,常常会遇到frame 嵌套页面的应用,使用WebDriver 每次只能在一个页面上识别元素,对于frame 嵌套内的页面上的元素,直接定位是定位是定位不到的。这个时候就须要经过switch_to_frame()方法将当前定位的主体切换了frame 里。先定位到iframe,再在iframe中进行标签订位。不然,定位不到咱们想要的标签。

import requests
from selenium import webdriver
from lxml import etree
import time

driver = webdriver.Chrome(executable_path=r'C:\Users\Administrator\chromedriver.exe')
driver.get('https://qzone.qq.com/')

#switch_to操做切换frame,此时才能进行登录页面的操做。
driver.switch_to.frame('login_frame')       
#点击使用帐号密码登录,须要绑定click事件 driver.find_element_by_id('switcher_plogin').click() #driver.find_element_by_id('u').clear() driver.find_element_by_id('u').send_keys('QQ') #driver.find_element_by_id('p').clear() driver.find_element_by_id('p').send_keys('密码') #点击登录,绑定click事件 driver.find_element_by_id('login_button').click() time.sleep(2) driver.execute_script('window.scrollTo(0,document.body.scrollHeight)') time.sleep(2) driver.execute_script('window.scrollTo(0,document.body.scrollHeight)') time.sleep(2) driver.execute_script('window.scrollTo(0,document.body.scrollHeight)') time.sleep(2) page_text = driver.page_source #获取页面源码数据,注意page_source无括号。 tree = etree.HTML(page_text) #执行解析操做 li_list = tree.xpath('//ul[@id="feed_friend_list"]/li') for li in li_list: text_list = li.xpath('.//div[@class="f-info"]//text()|.//div[@class="f-info qz_info_cut"]//text()') text = ''.join(text_list) print(text+'\n\n\n') driver.quit()

发现小框是嵌套在大框里面的,在当前的html源码中,又嵌套了一个html子页面,这个子页面是包含在iframe标签中的。因此,若是定位的标签是存在于iframe中的,那么必定须要使用switch to函数,将当前浏览器页面的参照物切换到iframe中,iframe中有一个idlogin_frame的属性值,能够根据它来定位。

示例4:利用搜狗搜索接口抓取微信公众号(无头、规避检测、等待、切换页面)

# 添加启动参数 (add_argument)
# 添加实验性质的设置参数 (add_experimental_option)

from selenium import webdriver
from selenium.webdriver.support.wait import WebDriverWait
import time
import requests
from lxml import etree

option = webdriver.ChromeOptions()
option.add_argument('headless')
#设置chromedriver启动参数,规避对selenium的检测机制
option.add_experimental_option('excludeSwitches', ['enable-automation'])

driver = webdriver.Chrome(chrome_options=option)

url = 'http://weixin.sogou.com/weixin?type=1&s_from=input&query=python_shequ'

driver.get(url)
print(driver.title)

timeout = 5
link = WebDriverWait(driver, timeout).until(
    lambda d: d.find_element_by_link_text('Python爱好者社区'))
link.click()
time.sleep(1)

# 切换页面
window_handles = driver.window_handles
driver.switch_to.window(window_handles[-1]) print(driver.title)

article_links = WebDriverWait(driver, timeout).until(
    # EC.presence_of_element_located((By.XPATH, '//h4[@class="weui_media_title"]')) lambda d: d.find_elements_by_xpath('//h4[@class="weui_media_title"]'))

article_link_list = []
for item in article_links:
    article_link = 'https://mp.weixin.qq.com' + item.get_attribute('hrefs')
    # print(article_link)
    article_link_list.append(article_link)
print(article_link_list)

first_article_link = article_link_list[0]
header = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36 OPR/26.0.1656.60
    }

response = requests.get(first_article_link,
                        headers=header,
                        timeout=5
                        )

tree = etree.HTML(response.text)
title = tree.xpath('//h2[@id="activity-name"]/text()')[0].strip()
content = tree.xpath('//div[@id="js_content"]//text()')
content = ''.join(content).strip()

print(title)
print(content)

示例5:用selenium实现一个头条号的模拟发文接口

import time
import redis
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.wait import WebDriverWait


r = redis.Redis('127.0.0.1', 6379)

def toutiao_save_and_preview(title, content, expand_link):
    option = webdriver.ChromeOptions()
    option.add_argument('headless')
    driver = webdriver.Chrome(chrome_options=option)

    # 获取渲染的正文
    driver.get('file:///Users/Documents/toutiao.html')
    driver.execute_script("contentIn('"+ content +"');")
    timeout = 5
    content_copy = WebDriverWait(driver, timeout).until(lambda d: d.find_element_by_xpath('//button[@class="btn"]'))
    content_copy.click()

    # 模拟登陆发文页面
    cookie_toutiao = [{'name': 'ccid', 'value': 'db43e70fd9404338c49209ba04f7a11f'}, {'name': 'tt_webid', 'value': '6612748996061414925'}, {'name': 'UM_distinctid', 'value': '1667a53d28d449-0e229246a33996-4a506a-1fa400-1667a53d28e361'}, {'name': 'sso_uid_tt', 'value': '4c8179804d74252717c675607c721602'}, {'name': 'toutiao_sso_user', 'value': '8acc9b248cd201034637248021183d5a'}, {'name': 'sso_login_status', 'value': '1'}, {'name': 'sessionid', 'value': '8441fa3fc5ae5bc08c3becc780b5b2df'}, {'name': '_mp_test_key_1', 'value': '6aba81df9e257bea2a99713977f1e33b'}, {'name': 'uid_tt', 'value': '75b5b52039d4c9dd41315d061c833f0b'}, {'name': 'ccid', 'value': '4231c5cd5a98033f2e78336b1809a18a'}, {'name': 'tt_webid', 'value': '6631884089946523149'}, {'name': 'UM_distinctid', 'value': '16783e1566479-0ae7bcdcaeb592-113b6653-13c680-16783e156656d4'}, {'name': 'passport_auth_status', 'value': '99f731f2c6dc150e6dfea46799f20e90'}, {'name': 'sso_uid_tt', 'value': 'f4bcd2cf972384b428449b0479475ce6'}, {'name': 'toutiao_sso_user', 'value': '60df7bb620b4b6d1d17a1de83daec9c1'}, {'name': 'sso_login_status', 'value': '1'}, {'name': 'sessionid', 'value': '786fe64e9186d51b8427290a557b3c7b'}, {'name': 'uid_tt', 'value': '91a7a72a85861ae686fb66177bc16bca'}, {'name': '__tea_sdk__ssid', 'value': '60b289e6-e2a4-4494-a3e8-7936f9731426'}, {'name': 'uuid', 'value': 'w:3ec91cefd76b438583154fea77baa54b'}, {'name': 'tt_im_token', 'value': '1544105894108419437114683515671344747598423336731147829901779697'}]

    driver.get('https://mp.toutiao.com/profile_v3/index')
    for cookie in cookie_toutiao:
        driver.add_cookie(cookie)
    driver.get('https://mp.toutiao.com/profile_v3/graphic/publish')
    print(driver.title)

    # driver.maximize_window()
    # 写标题
    print('写标题')
    write_title = WebDriverWait(driver, timeout).until(lambda d: d.find_element_by_xpath('//*[@id="title"]'))
    write_title.click()
    write_title.send_keys(title)

    # 粘贴正文
    print('写正文')
    write_content = WebDriverWait(driver, timeout).until(lambda d: d.find_element_by_xpath(
        '//*[@id="graphic"]/div/div/div[2]/div[1]/div[2]/div[3]/div[2] | //div[contains(@class,"ql-editor")]'))
    write_content.click()
    write_content.clear()
    write_content.send_keys(Keys.SHIFT + Keys.INSERT)
    # time.sleep(1)

    # 检测图片上传是否完成
    try:
        if 'img' in content:
            WebDriverWait(driver, timeout).until(
                lambda d: d.find_element_by_xpath('//div[@class="pgc-img-wrapper"]'))
            print('images uploaded success')
        else:
            print('no image included')
    except:
        print('images uploaded fail')

    # 页面向下滚动
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight)")
    time.sleep(1)

    # 添加扩展连接
    expand_check = WebDriverWait(driver, timeout).until(lambda d: d.find_element_by_xpath(
        '//div[@class="pgc-external-link"]//input[@type="checkbox"]',
    ))
    expand_check.click()
    expand_link_box = WebDriverWait(driver, timeout).until(lambda d: d.find_element_by_xpath(
        '//div[@class="link-input"]//input[@type="text"]',
    ))
    expand_link_box.send_keys(expand_link)
    time.sleep(1)

    # 自动封面
    front_img = WebDriverWait(driver, timeout).until(lambda d: d.find_element_by_xpath(
        '//div[@class="article-cover"]/div/div[@class="tui2-radio-group"]/label[3]/div/input',
    ))
    front_img.click()
    time.sleep(1)

    # 保存草稿
    save_draft = WebDriverWait(driver, timeout).until(lambda d: d.find_element_by_xpath(
        '//div[@class="publish-footer"]/button[4]/span'))
    save_draft.click()
    time.sleep(1)

    # 从内容管理页,获取预览连接和文章ID
    print('get preview_link and article_id')
    # driver.refresh()
    preview_link = WebDriverWait(driver, timeout).until(lambda d: d.find_element_by_xpath(
        '//div[@id="article-list"]//div[@class="master-title"][1]/a')).get_attribute('href')
    article_id = preview_link.split('=')[-1]
    print(preview_link, article_id)
    time.sleep(1)


    content_management = WebDriverWait(driver, timeout).until(lambda d: d.find_element_by_link_text('内容管理'))
    content_management.click()
    time.sleep(1)
    driver.refresh()
    preview_link = WebDriverWait(driver, timeout).until(lambda d: d.find_element_by_xpath(
        '//*[@id="article-list"]/div[2]/div/div/div[1]/div/a')).get_attribute('href')
    article_id = preview_link.split('=')[-1]

    index_page = WebDriverWait(driver, timeout).until(lambda d: d.find_element_by_xpath('//a[@class="shead_logo"]'))
    index_page.click()
    driver.get('https://mp.toutiao.com/profile_v3/index')

    print(r.scard('cookie_pool_toutiao'))
    return preview_link, article_id


if __name__ == "__main__":
    print('start')
    start_time = time.time()
    title = 'Children'
    content = '<p>cute</p><p><img class="wscnph" src="http://img.mp.itc.cn/upload/20170105/1a7095f0c7eb4316954dda4a8b93b88c_th.jpg" /></p>'
    expand_link = 'https://www.cnblogs.com/Summer-skr--blog/'
    img = ''
    preview_link, article_id = toutiao_save_and_preview(title, content, expand_link)
    print(preview_link)
    print(article_id)
    finish_time = time.time()
print(finish_time - start_time)
View Code

 示例6:爬取京东

from selenium import webdriver
import time


class JdSpider(object):
    def __init__(self):
        self.i = 0
        self.url = 'https://www.jd.com/'
        self.browser = webdriver.Chrome()

    # 获取页面信息 - 到具体商品的页面
    def get_html(self):
        self.browser.get(self.url)
        self.browser.find_element_by_xpath('//*[@id="key"]').send_keys('爬虫书')  # 搜索框输入“爬虫书”
        self.browser.find_element_by_xpath('//*[@id="search"]/div/div[2]/button').click()  # 点击搜索
        time.sleep(3)  # 给商品页面加载时间

    # 解析页面
    def parse_html(self):
        # 把下拉菜单拉到底部,执行JS脚本
        self.browser.execute_script('window.scrollTo(0,document.body.scrollHeight)')
        time.sleep(2)
        # 提取全部商品节点对象列表 li列表
        li_list = self.browser.find_elements_by_xpath('//*[@id="J_goodsList"]/ul/li')
        for li in li_list:
            info_list = li.text.split('\n')
            if info_list[0].startswith('每满') or info_list[1].startswith(''):
                price = info_list[1]
                name = info_list[2]
                comment = info_list[3]
                shop = info_list[4]
            elif info_list[0].startswith('单件'):
                price = info_list[3]
                name = info_list[4]
                comment = info_list[5]
                shop = info_list[6]
            else:
                price = info_list[0]
                name = info_list[1]
                comment = info_list[2]
                shop = info_list[3]

            print(price, comment, shop, name)

    # 主函数
    def main(self):
        self.get_html()
        while True:
            self.parse_html()
            # 判断是否该点击下一页,没有找到说明不是最后一页
            if self.browser.page_source.find('pn-next disabled') == -1:
                self.browser.find_element_by_class_name('pn-next').click()
                time.sleep(2)
            else:
                break
        print(self.i)


if __name__ == '__main__':
    spider = JdSpider()
    spider.main()
View Code

 

selenium相关文档:

         https://www.seleniumhq.org/docs/

         https://selenium-python.readthedocs.io