Python小白爬虫入门的第一个案例:爬取全站小说

前言

很多免费的资源只能看但是不提供下载,今天我们以小说为例,教你如何把互联网上只能看不能下载的东西给下载下来

知识点:

  • requests
  • css选择器
  • 全站小说爬取思路

开发环境:

  • 版 本:anaconda5.2.0(python3.6.5)
  • 编辑器:pycharm 社区版

开始撸代码:

1、导入工具

代码语言:javascript
复制
import requests
import parsel

2、伪造浏览器的环境

代码语言:javascript
复制
headers = {
    # "Cookie": "bcolor=; font=; size=; fontcolor=; width=; Hm_lvt_3806e321b1f2fd3d61de33e5c1302fa5=1596800365,1596800898; Hm_lpvt_3806e321b1f2fd3d61de33e5c1302fa5=1596802442",
    "Host": "www.shuquge.com",
    "Referer": "http://www.shuquge.com/txt/8659/index.html",
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.92 Safari/537.36",
}

3、解析网站,爬取小说

代码语言:javascript
复制
def download_one_chapter(url_chapter, book):
    """爬取一章小说"""
    # 从浏览器里面分析出来的
    response = requests.get(url_chapter, headers=headers)
    # response.apparent_encoding
    # 自适应编码,万能的  正确率是百分之 99%
    response.encoding = response.apparent_encoding
    # print(response.text)
    """提取数据"""
    """
    工具  bs4 parsel
xpath
css
re
"""
# 把html转化为提取对象
# 标签重复怎么办 id class 怎么二次进行提取
sel = parsel.Selector(response.text)
h1 = sel.css('h1::text')
title = h1.get()
print(title)

content = sel.css('#content ::text').getall()
# print(content)
# text = "".join(content)
# print(text)
# w write 写入
"""写入数据"""
# with open(title + '.txt', mode='w', encoding='utf-8') as f:
with open(book + '.txt', mode='w', encoding='utf-8') as f:
    f.write(title)
    f.write('\n')
    for line in content:
        f.write(line.strip())
        f.write('\n')

"""爬取一本小说 会有很多章"""

download_one_chapter('http://www.shuquge.com/txt/8659/2324752.html')

download_one_chapter('http://www.shuquge.com/txt/8659/2324753.html')

def download_one_book(book_url):
response = requests.get(book_url, headers=headers)
response.encoding = response.apparent_encoding
html = response.text
sel = parsel.Selector(html)
title = sel.css('h2::text').get()

index_s = sel.css('body > div.listmain > dl > dd > a::attr(href)').getall()
print(index_s)
for index in index_s:
    print(book_url[:-10] + index)
    one_chapter_url = book_url[:-10] + index
    download_one_chapter(one_chapter_url, title)</code></pre></div></div><p>1. 异常不会 try except</p><p>2. 错误重试 报错之后,重新尝试,或者是记录下来,重新请求</p><p>下载一本小说需要哪些东西</p><div class="rno-markdown-code"><div class="rno-markdown-code-toolbar"><div class="rno-markdown-code-toolbar-info"><div class="rno-markdown-code-toolbar-item is-type"><span class="is-m-hidden">代码语言:</span>javascript</div></div><div class="rno-markdown-code-toolbar-opt"><div class="rno-markdown-code-toolbar-copy"><i class="icon-copy"></i><span class="is-m-hidden">复制</span></div></div></div><div class="developer-code-block"><pre class="prism-token token line-numbers language-javascript"><code class="language-javascript" style="margin-left:0">download_one_book(&#39;http://www.shuquge.com/txt/8659/index.html&#39;)

download_one_book('http://www.shuquge.com/txt/122230/index.html')
download_one_book('http://www.shuquge.com/txt/117456/index.html')

根据每一章的地址下载每一章小说根据每一本小说的目录页下载一个本小说

下载整个网站的小说 -> 下载所有类别的小说 -> 下载每一个类别下面的每一页小说

运行代码后的效果: