如何用python实现网站数据获取和处理

一、网络库的选择

python的关于http网络请求的module有好些个，我们使用这些库来达到网络爬取或者完成RESTful API交换。这些库比较推荐的有urllib3，requests。这些库之所以流行有一些长处，比如说他们是AIOHTTP和HTTPX，还比如说requests库支持OAuth鉴权等等一些实用功能。像异步机制能并非发起多个请求，就很方便用来做网站爬取之类的工作。

如果你只是简单地做一些网络请求，像python原生带的urllib.requests或者http.client也不是不行，因为毕竟他们不依赖pip安装第三方库，更容易做到代码跨平台移植。

还有一些别的网络库，比如httpx，这个可能没有requests稳定。yarl库可以类似pathlib方式方便生成url。fsspec库抽象文件系统，云节点，URLs和远程服务点。这些感兴趣都可以试用下。

在本文中，使用request库通过代码来讲解网站爬取的一些注意事项。

二、网络请求和响应

爬去网站时，我们和网站的应用层交互协议通常是http/https。这里我们用linux的nc/ncat模拟一个http服务网站。然后使用reqeusts来写一个简单的python网络请求。

2.1 服务端

这里描述了一个监听8090端口的tcp服务。每当有tcp客户端连接此端口，我们都给他返回

HTTP/1.1 200 OK\r\n

“I‘m 200/OK”

代码语言：bash

复制

while true; do { echo -e 'HTTP/1.1 200 OK\r\n'; echo “I‘m 200/OK”;  } | nc -q0 -l 8080; done while true; do { echo -e 'HTTP/1.1 200 OK\r\n'; cat 200_content;  } | nc -q0 -l 8090; done

2.2 客户端

简单的几行代码组成了一个完整的http请求。并且得到http相应码是200后，打印读到的content。

代码语言：python

代码运行次数：0

复制

Cloud Studio 代码运行

import requests
r = requests.get('http://127.0.0.1')
print(r.status_code)

if r.status_code==200:

print(r.text)

三、需要注意的细节

爬取网站时候，我们有时候还会碰到鉴权、30x跳转、数据解析等问题。这里分别说明这些问题requests库或python是如何应对的。

3.1 网站登陆授权

很多网站都需要鉴权，鉴权类型有很多种，下面举例几种从简单到复杂的鉴权方案。

3.1.1 Basic Authentication

HTTP Basic Auth是最简单的一类鉴权,Requests是完全支持这种方案，并且写法也很简单。

Making requests with HTTP Basic Auth is very simple:

代码语言：python

代码运行次数：0

复制

Cloud Studio 代码运行

import requests
r = requests.get('http://127.0.0.1',auth=('user', 'pass'))
print(r.status_code)

if r.status_code==200:

print(r.text)

于是在nc服务端那边看的Basic Auth的请求头长这样。

3.1.2 Digest Authentication

这也是一个很流行的HTTP 鉴权方案，同样 Requests 也完美支持。

代码语言：python

代码运行次数：0

复制

Cloud Studio 代码运行

import requests

from requests.auth import HTTPDigestAuth
r = requests.get('http://127.0.0.1',auth=HTTPDigestAuth('user', 'pass'))
print(r.status_code)

if r.status_code==200:

print(r.text)

3.1.3 OAuth 1 Authentication

还有个出门的OAuth鉴权方案，支持这个需要使用pip安装 requests-oauthlib 模块。pip3 install requests-oauthlib

代码语言：python

代码运行次数：0

复制

Cloud Studio 代码运行

import requests

from requests_oauthlib import OAuth1
auth = OAuth1('YOUR_APP_KEY', 'YOUR_APP_SECRET',

'USER_OAUTH_TOKEN', 'USER_OAUTH_TOKEN_SECRET')

r = requests.get('http://127.0.0.1',auth=auth)
print(r.status_code)

if r.status_code==200:

print(r.text)

可以观察到请求头长这样

3.2 网站跳转

我们访问url，有时候会得到30x的响应，表面实际url会重定向到http Location的一个新地址。Requests库支持一个默认跳转功能，打开后，如果访问到是30x响应，那么Requests会继续帮你发起到Location的新请求，直到遇到非30x响应或者跳转次数超限。

我们在8080端口上构造一个301响应，跳转目的Location是本地的8090端口。

代码语言：bash

复制

while true; do { echo -e "HTTP/1.1 301 Moved Permanently\r\nLocation: http://127.0.0.1"; echo "I'm 301 redirect content."; } | nc -q0 -l 8080; done

同时在8090端口，起一个返回200响应的监听服务。

代码语言：bash

复制

while true; do { echo -e 'HTTP/1.1 200 OK\r\n'; cat “I‘m 200/OK”;  } | nc -q0 -l 8090; done

然后客户端请求长这样

代码语言：python

代码运行次数：0

复制

Cloud Studio 代码运行

import requests

r = requests.get('http://127.0.0.1',allow_redirects=True)
print(r.status_code)

if r.status_code==200:

print(r.text)

那么因为allow_redirects=True，所以会直接得到301跳转之后那个Location的内容，也就是8090返回的200响应的这个内容。因为allow_redirects默认值就是True，所以这里这个选项也可以不写。

3.3 数据解析

因为json上比较常用的一种消息传输格式。

以json举例，我们可以使用python的json库，进一步加工网站消息。

8090端口那个返回200的监听服务，我们改写成返回json字符串

代码语言：bash

复制

while true; do { echo -e 'HTTP/1.1 200 OK\r\n'; echo '{"name": "Bob", "languages": ["English", "French"]}';  } | nc -q0 -l 8090; done

那么客户端代码处理json如下：

代码语言：python

代码运行次数：0

复制

Cloud Studio 代码运行

import requests

import json
r = requests.get('http://127.0.0.1',auth=auth,allow_redirects=True)
print(r.status_code)

if r.status_code==200:

print(r.text)

person_dict = json.loads(r.text)

print(person_dict['languages'])