使用python和Selenium爬取淘宝数据

简介

之前使用python爬取了速卖通的数据，这里使用Selenium爬取淘宝数据，总体的实现逻辑没有太大区别，但是用Selenium打开淘宝页面，并且会有一个拖拽式的验证码，我们同样使用Selenium来模拟这些动作

遇到的问题

登陆：淘宝会强制我们进行登录，不然无法访问具体商品列表，我这里的实现思路，是获取登录页的title，如果获取的网页title和登录页的title我们自行登录逻辑
验证码：实践中淘宝采用了拖拽式验证码，我们使用Selenium模拟拖拽即可，但是在实际中如果一次性将拖拽条拖完，则无法通过验证，我这里模拟了拖拽，在拖拽到固定长度后停留0.1秒，然后再将进度条拖拽完成，模拟人拖拽的过程
爬取详情中由于各个商场的风格页面不一样，有可能有些元素无法获取，所以要对获取不到元素时进行异常处理（其实所有定位元素都需要异常处理，因为爬虫异常总是会发生）
后台限制：由于太快的请求可能会遭遇封IP等操作，我这边每个操作都设置了一个延时，尽量避免过快操作

Chrome浏览器现在已经支持headless模式，即不打开浏览器UI的情况下进行爬取数据

1
2
3

chrome_options = Options()
chrome_options.add_argument('--headless')
driver = webdriver.Chrome(chrome_options=chrome_options)

代码

import time

from selenium import webdriver
from selenium.common.exceptions import NoSuchElementException
from selenium.webdriver import ActionChains
from selenium.webdriver.common.by import By

browser = webdriver.Chrome()
browser.get('https://s.taobao.com/search?q=ipad')


def roll(height=800):
    time.sleep(1)
    for i in range(15):
        script = "window.scrollBy(0," + str(height) + ")"
        time.sleep(2)
        browser.execute_script(script)
    browser.execute_script("window.scrollBy(0," + str(-height) + ")")
    time.sleep(1)


title = browser.title
title_after_strip = title.strip();
#如果title和登录页匹配则登录账号密码
if title_after_strip == '淘宝网 - 淘！我喜欢':
    actions = ActionChains(browser)
    username = browser.find_element_by_id('fm-login-id')
    password = browser.find_element_by_id('fm-login-password')
    username.send_keys('账号')
    time.sleep(3)
    password.send_keys('密码')
    bt = browser.find_element_by_xpath("//div[@class='fm-btn']/button")
    bt.click()
    time.sleep(1)
    #选中拖拽验证码
    source = browser.find_element_by_xpath("//div[@id='nocaptcha-password']//span[@id='nc_1_n1z']")
    time.sleep(0.3)
    actions.click_and_hold(source).perform()
    actions.move_by_offset(108, 0)
    #拖拽中停止0.1秒模拟人拖拽过程
    time.sleep(0.1)
    actions.move_by_offset(150, 0)
    actions.release().perform()
    time.sleep(2)
    bt = browser.find_element_by_xpath("//div[@class='fm-btn']/button")
    bt.click()
print('-------------')
time.sleep(20)

#获取数据
def getData():
    itemList = browser.find_element_by_class_name('m-itemlist')
    items = itemList.find_element_by_class_name('items')
    div = items.find_elements_by_xpath('./div')
    for k in div:
        price = k.find_element_by_xpath('.//strong')
        print(price.get_attribute("outerHTML"))
        pic = k.find_element_by_xpath(".//div[@class='pic']")
        #点击链接进入商品详情
        a = pic.find_element_by_xpath('./a')
        #获取商品详情
        getProductDetail(a)
    page = browser.find_element_by_id('mainsrp-pager')
    nextPage = page.find_element_by_css_selector("[class='J_Ajax num icon-tag']")
    nextPage.click()


#获取商品详情
def getProductDetail(a):
    a.click()
    time.sleep(1)
    #获取浏览器各个窗口的句柄
    handles = browser.window_handles
    #由于点击详情会新开一个页面，切换到详情页面
    browser.switch_to_window(handles[2])
    time.sleep(5)
    try:
        product = browser.find_element_by_xpath("//div[@id='detail']//div[@class='tb-detail-hd']/h1/a")
        print(product.get_attribute("innerHTML"))
    except NoSuchElementException:
        print('爬取失败')
    #关闭当前浏览器窗口    
    browser.close()
    #切换到商品列表窗口
    browser.switch_to_window(handles[0])


for i in range(50):
    print(str(i)+"---------------------------------------------------")
    getData()
    time.sleep(10)