Python网络爬虫及自动化--获取页面cookie、headers

一、Selenium库webdirver类

1、获取cookie，driver.get_cookies()

报错信息：

说明是驱动问题，驱动和浏览器不匹配，且提示不要用64位ie驱动ie10or11，即使是在64位的电脑上。

IE驱动包下载地址：

https://github.com/SeleniumHQ/selenium/wiki/InternetExplorerDriver

下载替换IEDriverServer.exe，ok！

然后将获取的cookies转换:

cookie = [item["name"] + "=" + item["value"] for item in cookies ]
cookiestr = ';'.join(item for item in cookie)
headers_cookie ={
           "Cookie": cookiestr         # 通过接口请求时需要cookies等信息
}
response = requests.post(url, data=body, headers=headers_cookie)

这样就可以在做UI自动化的时候，通过接口获取到数据来进行相关测试。

获取包含选中元素的HTML：get_attribute('outerHTML')

获取元素内的全部HTML：get_attribute('innerHTML')

获取元素标签的内容：get_attribute(‘textContent’)

元素没找到返回<class 'NoneType'>

2、获取请求headers，

# 获取请求头信息
agent = driver.execute_script("return navigator.userAgent")
print(agent) # 查看请求头是否更改。

注意：由于IE浏览器安全设置，driver.page_source为空，By.ID等无法定位元素

改用FireFox，使用driver.find_element定位元素需要等待页面加载结束，使用time.sleep(2)等待

from selenium import webdriver  #导入驱动模块
from selenium.webdriver.common.by import By
import time

driver =webdriver.Firefox()       #创建相应浏览器驱动对象
driver.get("https://passport.neea.edu.cn/NCRELogin?ReturnUrl=https://ncre-bm.neea.cn/Home/VerifyPassport/?LoginType=0|61&Safe=1")  #使用驱动对象打开网站
'''
cookie=driver.get_cookies()
print(cookie)
# 获取请求头信息 
agent = driver.execute_script("return navigator.userAgent") 
print(agent) # 查看请求头是否更改。
driver.page_source.encode("utf-8") 
s=driver.page_source
print(s)
'''
#time.sleep(3)
driver.find_element(By.NAME,"txtUserName").send_keys("xx")#姓名
pw = driver.find_element(By.ID,"txtPassword")#.send_keys("yy")#姓名
#driver.find_element(By.LINK_TEXT,"在线报名")click()
print(pw.get_attribute("outerHTML"))
input('stop')
#driver.save_screenshot('1.png')

二、requests库

1、安装requests库：python -m pip install requests

response = requests.get(url)

response.encoding="utf-8" #注意： get后，必须使用encoding指定正确页面编码，否则为乱码

requests默认是keep-alive的，可能没有释放，加参数 headers={'Connection':'close'}

response的常用方法.

response.text：url响应的网页内容，字符串形式

response.content：url响应网页内容（二进制形式）

response.status_code：http请求的返回状态，200就是访问成功，404则失败

判断请求是否成功

assert response.status_code==200

如果是200，不会有任何反应

如果不是200，则报错

2、获取headers

response.headers #获取响应头

response.request.headers #获取请求头

这是默认请求头

当你用默认请求头去访问百度网站，只会返回一小段的内容，而用浏览器去访问，就有非常多的内容。因为服务器识别出默认的请求头不是一个正常的浏览器，所以只会返回一点。

F12打开开发人员工具，复制User-Agent值。

headers={"User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36"}

使用方法：requests.get(url,headers=headers)

p={“wd”:“python”}
使用方法：request.get(url,params=p,headers=headers}

3、获取和处理cookie

response.cookies #获取cookies，为RequestsCookieJar数据类型

cookie = response.cookies.get_dict() #将cookies转为字典

requests.utils.dict_from_cookiejar(response.cookies) #将RequestsCookieJar类型的cookies转换成字典

4、常见错误

未指定请求头，报错：

未指定页面编码，显示乱码：

可以使用正则根据页面charset自动设置页面编码

import requests
import re

url = 'https://www.ip138.com/post/'
headers = { 
    #"Content-Type": "application/json",
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36" 
} 
res = requests.get(url,headers=headers)
#根据页面charset设置编码
regex = re.compile(r'<meta.*charset=[\'"]?([\w-]+)[\'"]?',re.I)
charset = regex.findall(res.text)
res.encoding = charset[0] if len(charset)>0 else "utf-8"
print(res.text)

5、Session方式访问

import requests
import re

url = 'https://www.ip138.com/post/'
headers = { 
    #"Content-Type": "application/json",
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36" 
} 

session = requests.Session()
payload = {'areacode':'P610481000000','username':'P610481000000','password':'xxxxxx','code':'4087'}
res = session.post('http://10.5.92.120:9080/pis/doLogin.html', data=payload)
#根据页面charset设置编码
regex = re.compile(r'<meta.*charset=[\'"]?([\w-]+)[\'"]?',re.I)
charset = regex.findall(res.text)
res.encoding = charset[0] if len(charset)>0 else "utf-8"
print(res.cookies.get_dict())
res = session.get("http://10.5.92.120:9080/pis/errorDataInit.html")
print(res.text)

其中验证码未进行识别，需要人工确认。