scrapy 爬虫

Spider

#0 GitHub

代码语言：javascript

复制

None

#1 环境

代码语言：javascript

复制

Python3.6
Scrapy==1.6.0  # 安装Scrapy pip3 install Scrapy

#2 爬虫原理

#2.1 核心部件

Scrapy Engine(引擎): 负责Spider、ItemPipeline、Downloader、Scheduler中间的通讯，信号、数据传递等。
Scheduler(调度器): 它负责接受引擎发送过来的Request请求，并按照一定的方式进行整理排列，入队，当引擎需要时，交还给引擎。
Downloader（下载器）：负责下载Scrapy Engine(引擎)发送的所有Requests请求，并将其获取到的Responses交还给Scrapy Engine(引擎)，由引擎交给Spider来处理，
Spider（爬虫）：它负责处理所有Responses,从中分析提取数据，获取Item字段需要的数据，并将需要跟进的URL提交给引擎，再次进入Scheduler(调度器).
Item Pipeline(管道)：它负责处理Spider中获取到的Item，并进行进行后期处理（详细分析、过滤、存储等）的地方。
Downloader Middlewares（下载中间件）：你可以当作是一个可以自定义扩展下载功能的组件。
Spider Middlewares（Spider中间件）：你可以理解为是一个可以自定扩展和操作引擎和Spider中间通信的功能组件（比如进入Spider的Responses;和从Spider出去的Requests）

#2.2 Scrapy的运作流程

1 引擎：Hi！Spider, 你要处理哪一个网站？
2 Spider：老大要我处理xxxx.com。
3 引擎：你把第一个需要处理的URL给我吧。
4 Spider：给你，第一个URL是xxxxxxx.com。
5 引擎：Hi！调度器，我这有request请求你帮我排序入队一下。
5 引擎：Hi！调度器，我这有request请求你帮我排序入队一下。
7 引擎：Hi！调度器，把你处理好的request请求给我。
8 调度器：给你，这是我处理好的request
9 引擎：Hi！下载器，你按照老大的下载中间件的设置帮我下载一下这个request请求
10 下载器：好的！给你，这是下载好的东西。（如果失败：sorry，这个request下载失败了。然后引擎告诉调度器，这个request下载失败了，你记录一下，我们待会儿再下载）
11 引擎：Hi！Spider，这是下载好的东西，并且已经按照老大的下载中间件处理过了，你自己处理一下（注意！这儿responses默认是交给def parse()这个函数处理的）
12 Spider：（处理完毕数据之后对于需要跟进的URL），Hi！引擎，我这里有两个结果，这个是我需要跟进的URL，还有这个是我获取到的Item数据。
13 引擎：Hi ！管道我这儿有个item你帮我处理一下！调度器！这是需要跟进URL你帮我处理下。然后从第四步开始循环，直到获取完老大需要全部信息。
14 管道调度器：好的，现在就做！

注意！只有当调度器中不存在任何request了，整个程序才会停止，（也就是说，对于下载失败的URL，Scrapy也会重新下载。）

#3 制作 Scrapy 爬虫

新建项目(scrapy startproject xxx)：新建一个新的爬虫项目
明确目标(编写items.py)：明确你想要抓取的目标
制作爬虫(spiders/xxspider.py)：制作爬虫开始爬取网页
存储内容(pipelines.py)：设计管道存储爬取内容

#3.1 创建工程

代码语言：javascript

复制

scrapy startproject mySpider # 新建爬虫项目

代码语言：javascript

复制

.
├── mySpider
│   ├── __init__.py
│   ├── items.py
│   ├── middlewares.py
│   ├── pipelines.py
│   ├── settings.py
│   └── spiders
│       ├── __init__.py
└── scrapy.cfg

scrapy.cfg: 项目的配置文件。
mySpider/: 项目的Python模块，将会从这里引用代码。
mySpider/items.py: 项目的目标文件。
mySpider/pipelines.py: 项目的管道文件。
mySpider/settings.py: 项目的设置文件。
mySpider/spiders/: 存储爬虫代码目录。

#3.2 明确目标(mySpider/items.py)

我们打算抓取 http://www.dy100.me/ 网站里的电影名 / 电影详细信息url 。

打开 mySpider 目录下的 items.py
Item 定义结构化数据字段，用来保存爬取到的数据，有点像 Python 中的 dict，但是提供了一些额外的保护减少错误。
可以通过创建一个 scrapy.Item 类，并且定义类型为 scrapy.Field 的类属性来定义一个 Item（可以理解为Django里面的models.py）。

代码语言：javascript

复制

import scrapy
class MyspiderItem(scrapy.Item):

name = scrapy.Field()  # 电影名

info = scrapy.Field()  # 电影详细信息url

#3.3 制作爬虫（spiders/itcastSpider.py）

爬虫功能要分两步

爬数据

在当前目录下输入命令，将在mySpider/spider目录下创建一个名为itcast的爬虫，并指定爬取域的范围：

代码语言：javascript

复制

scrapy genspider itcast "itcast.cn" # 该命令会自动生成一个itcast.py文件,爬虫的主要逻辑代码就在里面写

打开 mySpider/spider目录里的 itcast.py，默认增加了下列代码:

代码语言：javascript

复制

import scrapy
class ItcastSpider(scrapy.Spider):

name = "itcast" # 唯一标识

allowed_domains = ["itcast.cn"]

start_urls = (

'http://www.itcast.cn/', # 自动生成,需要手动修改成自己需要的url

)
def parse(self, response): # 爬虫爬下来的html页面
    pass </code></pre></div></div><p>其实也可以由我们自行创建itcast.py并编写上面的代码，只不过使用命令可以免去编写固定代码的麻烦</p><p>要建立一个Spider， 你必须用scrapy.Spider类创建一个子类，并确定了三个强制的属性 和 一个方法。</p><p>name = “” ：这个爬虫的识别名称，必须是唯一的，在不同的爬虫必须定义不同的名字。</p><p>allow_domains = [] 是搜索的域名范围，也就是爬虫的约束区域，规定爬虫只爬取这个域名下的网页，不存在的URL会被忽略。</p><p>start_urls = () ：爬取的URL元祖/列表。爬虫从这里开始抓取数据，所以，第一次下载的数据将会从这些urls开始。其他子URL将会从这些起始URL中继承性生成。</p><p>parse(self, response) ：解析的方法，每个初始URL完成下载后将被调用，调用的时候传入从每一个URL传回的Response对象来作为唯一参数，主要作用如下：</p><p>负责解析返回的网页数据(response.body)，提取结构化数据(生成item)

生成需要下一页的URL请求。

将start_urls的值修改为需要爬取的第一个url
代码语言：javascript
复制
start_urls = ("http://www.dy100.me/",) # 把url改成需要爬取的url
 然后运行一下看看，在mySpider目录下执行：
 
代码语言：javascript
复制
scrapy crawl itcast
取数据
 XPath 表达式的例子及对应的含义:
 
代码语言：javascript
复制
/html/head/title: 选择HTML文档中 <head> 标签内的 <title> 元素
/html/head/title/text(): 选择上面提到的 <title> 元素的文字
//td: 选择所有的 <td> 元素
//div[@class="mine"]: 选择所有具有 class="mine" 属性的 div 元素
 举例我们读取网站 http://www.dy100.me/ 的网站标题，修改 itcast.py 文件代码如下：：
 
代码语言：javascript
复制
class ItcastSpider(scrapy.Spider):
    name = 'itcast'
    allowed_domains = ['itcast.cn']
    start_urls = [
        'http://www.dy100.me/',
        ]
def parse(self, response):
    for foo in response.xpath(&#39;//*[@id=&#34;post_container&#34;]/li/div[2]/h2/a/text()&#39;):
        print(&#34;foo:&#34;,foo.extract()) # 尝试打印结果</code></pre></div></div><h4 id="277r4" name="#3.4-%E8%A7%A3%E6%9E%90">#3.4 解析</h4><blockquote><p> 只获取一个数据</p></blockquote><div class="rno-markdown-code"><div class="rno-markdown-code-toolbar"><div class="rno-markdown-code-toolbar-info"><div class="rno-markdown-code-toolbar-item is-type"><span class="is-m-hidden">代码语言：</span>javascript</div></div><div class="rno-markdown-code-toolbar-opt"><div class="rno-markdown-code-toolbar-copy"><i class="icon-copy"></i><span class="is-m-hidden">复制</span></div></div></div><div class="developer-code-block"><pre class="prism-token token line-numbers language-javascript"><code class="language-javascript" style="margin-left:0">response.xpath(&#39;//*[@id=&#34;blogname&#34;]/a/h1/text()&#39;).extract_first()

>>> 电影100|电影天堂
代码语言：javascript
复制
# 测试数据
response.xpath('//[@id="blogname"]/a/h1')

>>> [<Selector xpath='//[@id="blogname"]/a/h1' data='<h1>电影100|电影天堂</h1>'>] # 拿到的是有个列表类型的数据,列表里的数据是Selector类型
response.xpath('//*[@id="blogname"]/a/h1').extract()

>>> ['<h1>电影100|电影天堂</h1>'] # 拿到的是一个列表的数据,每个数据是一个标签
response.xpath('//[@id="blogname"]/a/h1').extract()[0]

>>><h1>电影100|电影天堂</h1> # 拿到是列表中第一个元素的数据

response.xpath('//[@id="blogname"]/a/h1').extract_first() # 也是拿到列表中的第一个元素
response.xpath('//[@id="blogname"]/a/h1/text()').extract_first()

>>> 电影100|电影天堂 # 拿到的是h1标签里面的内容 text()

---
 如何获取标签属性的值
代码语言：javascript
复制
response.xpath('//[@id="blogname"]/a/img/@src').extract_first() # 获取src属性的值

response.xpath('//*[@id="blogname"]/a/img/@alt').extract_first() # 获取alt属性的值
 如何获取所有电影的数据
代码语言：javascript
复制
    def parse(self, response):
    for foo in response.xpath(&#39;//*[@id=&#34;post_container&#34;]/li&#39;): # 循环整个li标签, 拷贝出来的xpath是单个li标签,所以需要把li标签后边的数组下标去掉
        print(&#34;foo:&#34;,foo.xpath(&#39;div[2]/h2/a/text()&#39;).extract_first()) # 获取电影名, 标签前面不能有反斜杠
        print(&#34;foo:&#34;,foo.xpath(&#39;div[2]/h2/a/@href&#39;).extract_first()) # 获取详细信息url
    return None</code></pre></div></div><div class="rno-markdown-code"><div class="rno-markdown-code-toolbar"><div class="rno-markdown-code-toolbar-info"><div class="rno-markdown-code-toolbar-item is-type"><span class="is-m-hidden">代码语言：</span>javascript</div></div><div class="rno-markdown-code-toolbar-opt"><div class="rno-markdown-code-toolbar-copy"><i class="icon-copy"></i><span class="is-m-hidden">复制</span></div></div></div><div class="developer-code-block"><pre class="prism-token token line-numbers language-javascript"><code class="language-javascript" style="margin-left:0">foo: 女孩 Girl (2018)

foo: http://www.dy100.me/15368.html

foo: 波西米亚狂想曲 Bohemian Rhapsody (2018)

foo: http://www.dy100.me/15365.html

foo: 乔纳斯 Jonas (2018)

foo: http://www.dy100.me/15362.html

foo: 谁先爱上他的 誰先愛上他的 (2018)

foo: http://www.dy100.me/15359.html

foo: 穷人·榴莲·麻药·偷渡客 窮人。榴槤。麻藥。偷渡客 (2012)

foo: http://www.dy100.me/15356.html

foo: 狼屋 La casa lobo (2018)

foo: http://www.dy100.me/15353.html

foo: 巴斯特·斯克鲁格斯的歌谣 The Ballad of Buster Scruggs (2018)

foo: http://www.dy100.me/15350.html

foo: 被抹去的男孩 Boy Erased (2018)

foo: http://www.dy100.me/15344.html

foo: 调琴师 Andhadhun (2018)

foo: http://www.dy100.me/15347.html

foo: 末代独裁 The Last King of Scotland (2006)

foo: http://www.dy100.me/15341.html

foo: 缄默的迷宫 Im Labyrinth des Schweigens (2014)

foo: http://www.dy100.me/15338.html

foo: 你还活着 Du levande (2007)

foo: http://www.dy100.me/15335.html

foo: 绿皮书 Green Book (2018)

foo: http://www.dy100.me/15332.html

foo: 90年代中期 Mid90s (2018)

foo: http://www.dy100.me/15329.html

foo: 无双 無雙 (2018)

foo: http://www.dy100.me/15326.html

foo: 花村 McCabe & Mrs. Miller (1971)

foo: http://www.dy100.me/15323.html

foo: 破浪 Breaking the Waves (1996)

foo: http://www.dy100.me/15320.html

foo: 安娜的旅程 Les Rendez-vous d'Anna (1978)

foo: http://www.dy100.me/15317.html
 如何将数据输出
 
代码语言：javascript
复制
scrapy crawl itcast # 不输出数据
scrapy crawl itcast -o info.json # 输出文件,文件名为info.json
代码语言：javascript
复制
    def parse(self, response):
        items = []
    for foo in response.xpath(&#39;//*[@id=&#34;post_container&#34;]/li&#39;):
        name = foo.xpath(&#39;div[2]/h2/a/text()&#39;).extract_first()
        info = foo.xpath(&#39;div[2]/h2/a/@href&#39;).extract_first()
        item = MyspiderItem() # 实例化
        item[&#34;name&#34;] = name # 赋值
        item[&#34;info&#34;] = info # 赋值
        items.append(item) # 添加到列表中

    return items # 记得要返回</code></pre></div></div><p>但是获取文件中保存的中文是Unicode编码</p><div class="rno-markdown-code"><div class="rno-markdown-code-toolbar"><div class="rno-markdown-code-toolbar-info"><div class="rno-markdown-code-toolbar-item is-type"><span class="is-m-hidden">代码语言：</span>javascript</div></div><div class="rno-markdown-code-toolbar-opt"><div class="rno-markdown-code-toolbar-copy"><i class="icon-copy"></i><span class="is-m-hidden">复制</span></div></div></div><div class="developer-code-block"><pre class="prism-token token line-numbers language-javascript"><code class="language-javascript" style="margin-left:0">[

{"name": "\u5973\u5b69 Girl (2018)", "info": "http://www.dy100.me/15368.html"},

{"name": "\u6ce2\u897f\u7c73\u4e9a\u72c2\u60f3\u66f2 Bohemian Rhapsody (2018)", "info": "http://www.dy100.me/15365.html"},

{"name": "\u4e54\u7eb3\u65af Jonas (2018)", "info": "http://www.dy100.me/15362.html"},

{"name": "\u8c01\u5148\u7231\u4e0a\u4ed6\u7684 \u8ab0\u5148\u611b\u4e0a\u4ed6\u7684 (2018)", "info": "http://www.dy100.me/15359.html"},

{"name": "\u7a77\u4eba\u00b7\u69b4\u83b2\u00b7\u9ebb\u836f\u00b7\u5077\u6e21\u5ba2 \u7aae\u4eba\u3002\u69b4\u69e4\u3002\u9ebb\u85e5\u3002\u5077\u6e21\u5ba2 (2012)", "info": "http://www.dy100.me/15356.html"},

{"name": "\u72fc\u5c4b La casa lobo (2018)", "info": "http://www.dy100.me/15353.html"},

{"name": "\u5df4\u65af\u7279\u00b7\u65af\u514b\u9c81\u683c\u65af\u7684\u6b4c\u8c23 The Ballad of Buster Scruggs (2018)", "info": "http://www.dy100.me/15350.html"},

{"name": "\u88ab\u62b9\u53bb\u7684\u7537\u5b69 Boy Erased (2018)", "info": "http://www.dy100.me/15344.html"},

{"name": "\u8c03\u7434\u5e08 Andhadhun (2018)", "info": "http://www.dy100.me/15347.html"},

{"name": "\u672b\u4ee3\u72ec\u88c1 The Last King of Scotland (2006)", "info": "http://www.dy100.me/15341.html"},

{"name": "\u7f04\u9ed8\u7684\u8ff7\u5bab Im Labyrinth des Schweigens (2014)", "info": "http://www.dy100.me/15338.html"},

{"name": "\u4f60\u8fd8\u6d3b\u7740 Du levande (2007)", "info": "http://www.dy100.me/15335.html"},

{"name": "\u7eff\u76ae\u4e66 Green Book (2018)", "info": "http://www.dy100.me/15332.html"},

{"name": "90\u5e74\u4ee3\u4e2d\u671f Mid90s (2018)", "info": "http://www.dy100.me/15329.html"},

{"name": "\u65e0\u53cc \u7121\u96d9 (2018)", "info": "http://www.dy100.me/15326.html"},

{"name": "\u82b1\u6751 McCabe & Mrs. Miller (1971)", "info": "http://www.dy100.me/15323.html"},

{"name": "\u7834\u6d6a Breaking the Waves (1996)", "info": "http://www.dy100.me/15320.html"},

{"name": "\u5b89\u5a1c\u7684\u65c5\u7a0b Les Rendez-vous d'Anna (1978)", "info": "http://www.dy100.me/15317.html"}
修改settings.py
代码语言：javascript
复制
FEED_EXPORT_ENCODING = 'utf-8' # 加上这行代码就可以在输出中文时不是Unicode编码
代码语言：javascript
复制
{"name": "\u88ab\u62b9\u53bb\u7684\u7537\u5b69 Boy Erased (2018)", "info": "http://www.dy100.me/15344.html"},

{"name": "\u8c03\u7434\u5e08 Andhadhun (2018)", "info": "http://www.dy100.me/15347.html"},

{"name": "\u672b\u4ee3\u72ec\u88c1 The Last King of Scotland (2006)", "info": "http://www.dy100.me/15341.html"},

{"name": "\u7f04\u9ed8\u7684\u8ff7\u5bab Im Labyrinth des Schweigens (2014)", "info": "http://www.dy100.me/15338.html"},

{"name": "\u4f60\u8fd8\u6d3b\u7740 Du levande (2007)", "info": "http://www.dy100.me/15335.html"},

{"name": "\u7eff\u76ae\u4e66 Green Book (2018)", "info": "http://www.dy100.me/15332.html"},

{"name": "90\u5e74\u4ee3\u4e2d\u671f Mid90s (2018)", "info": "http://www.dy100.me/15329.html"},

{"name": "\u65e0\u53cc \u7121\u96d9 (2018)", "info": "http://www.dy100.me/15326.html"},

{"name": "\u82b1\u6751 McCabe & Mrs. Miller (1971)", "info": "http://www.dy100.me/15323.html"},

{"name": "\u7834\u6d6a Breaking the Waves (1996)", "info": "http://www.dy100.me/15320.html"},

{"name": "\u5b89\u5a1c\u7684\u65c5\u7a0b Les Rendez-vous d'Anna (1978)", "info": "http://www.dy100.me/15317.html"}

]

[

{"name": "女孩 Girl (2018)", "info": "http://www.dy100.me/15368.html"},

{"name": "波西米亚狂想曲 Bohemian Rhapsody (2018)", "info": "http://www.dy100.me/15365.html"},

{"name": "乔纳斯 Jonas (2018)", "info": "http://www.dy100.me/15362.html"},

{"name": "谁先爱上他的 誰先愛上他的 (2018)", "info": "http://www.dy100.me/15359.html"},

{"name": "穷人·榴莲·麻药·偷渡客 窮人。榴槤。麻藥。偷渡客 (2012)", "info": "http://www.dy100.me/15356.html"},

{"name": "狼屋 La casa lobo (2018)", "info": "http://www.dy100.me/15353.html"},

{"name": "巴斯特·斯克鲁格斯的歌谣 The Ballad of Buster Scruggs (2018)", "info": "http://www.dy100.me/15350.html"},

{"name": "被抹去的男孩 Boy Erased (2018)", "info": "http://www.dy100.me/15344.html"},

{"name": "调琴师 Andhadhun (2018)", "info": "http://www.dy100.me/15347.html"},

{"name": "末代独裁 The Last King of Scotland (2006)", "info": "http://www.dy100.me/15341.html"},

{"name": "缄默的迷宫 Im Labyrinth des Schweigens (2014)", "info": "http://www.dy100.me/15338.html"},

{"name": "你还活着 Du levande (2007)", "info": "http://www.dy100.me/15335.html"},

{"name": "绿皮书 Green Book (2018)", "info": "http://www.dy100.me/15332.html"},

{"name": "90年代中期 Mid90s (2018)", "info": "http://www.dy100.me/15329.html"},

{"name": "无双 無雙 (2018)", "info": "http://www.dy100.me/15326.html"},

{"name": "花村 McCabe & Mrs. Miller (1971)", "info": "http://www.dy100.me/15323.html"},

{"name": "破浪 Breaking the Waves (1996)", "info": "http://www.dy100.me/15320.html"},

{"name": "安娜的旅程 Les Rendez-vous d'Anna (1978)", "info": "http://www.dy100.me/15317.html"}

]

~
 如何解决输出的数据是否追加问题
 
如果使用命令scrapy crawl itcast -o info.json,那么每次执行,得到的新数据都会追加到info.json文件中
#4 保存数据
 scrapy保存信息的最简单的方法主要有四种，-o 输出指定格式的文件，命令如下：
 
代码语言：javascript
复制
scrapy crawl itcast -o teachers.json
 json lines格式，默认为Unicode编码
 
代码语言：javascript
复制
scrapy crawl itcast -o teachers.jsonl
 csv 逗号表达式，可用Excel打开
 
代码语言：javascript
复制
scrapy crawl itcast -o teachers.csv
 xml格式
 
代码语言：javascript
复制
scrapy crawl itcast -o teachers.xml

Spider

#0 GitHub

#1 环境

#2 爬虫原理

#2.1 核心部件

#2.2 Scrapy的运作流程

#3 制作 Scrapy 爬虫

#3.1 创建工程

#3.2 明确目标(mySpider/items.py)

#3.3 制作爬虫 （spiders/itcastSpider.py）

>>> 电影100|电影天堂

response.xpath('//[@id="blogname"]/a/h1') >>> [<Selector xpath='//[@id="blogname"]/a/h1' data='<h1>电影100|电影天堂</h1>'>] # 拿到的是有个列表类型的数据,列表里的数据是Selector类型

response.xpath('//*[@id="blogname"]/a/h1').extract() >>> ['<h1>电影100|电影天堂</h1>'] # 拿到的是一个列表的数据,每个数据是一个标签

response.xpath('//[@id="blogname"]/a/h1').extract()[0] >>><h1>电影100|电影天堂</h1> # 拿到是列表中第一个元素的数据 response.xpath('//[@id="blogname"]/a/h1').extract_first() # 也是拿到列表中的第一个元素

#4 保存数据

#3.3 制作爬虫（spiders/itcastSpider.py）

`>>> 电影100|电影天堂`

response.xpath('//[@id="blogname"]/a/h1')
>>> [<Selector xpath='//[@id="blogname"]/a/h1' data='<h1>电影100|电影天堂</h1>'>] # 拿到的是有个列表类型的数据,列表里的数据是Selector类型

response.xpath('//*[@id="blogname"]/a/h1').extract()
>>> ['<h1>电影100|电影天堂</h1>'] # 拿到的是一个列表的数据,每个数据是一个标签

response.xpath('//[@id="blogname"]/a/h1').extract()[0]
>>><h1>电影100|电影天堂</h1> # 拿到是列表中第一个元素的数据
response.xpath('//[@id="blogname"]/a/h1').extract_first() # 也是拿到列表中的第一个元素