爬虫中无头浏览器如何选择 - 成就云开发者社区

我们日常使用浏览器的步骤为：启动浏览器、打开一个网页、进行交互。而无头浏览器指的是我们使用脚本来执行以上过程的浏览器，能模拟真实的浏览器使用场景。主要是用作爬虫，用以捕捉Web上的各类数据；这里的无头主要是指没有界面，完全是后台操作。它就是一个真实的浏览器。只是这个浏览器是无界面的。在爬虫中使用无头浏览器有很多的注意事项，比如我们的业务场景是否适合使用无头浏览器、我们可以通过这些方面进行判别，如果目标网站反爬不是很难，可以直接通过简单的http请求进行采集，不适合使用无头浏览器方案。反之如果网站有多种验证机制，例如需要验证登录、js反爬策略，如果研发不能进行网站行为分析的情况下，建议使用无头浏览器伪装正常用户，并且需要搭配代理一起使用，代理建议使用像亿牛云提供的爬虫代理去访问网站效果会更好，这里简单的示例下使用代理的方式：

代码语言：javascript

复制

from selenium import webdriver
    import string
    import zipfile
# 代理服务器(产品官网 www.16yun.cn)
proxyHost = &#34;t.16yun.cn&#34;
proxyPort = &#34;3111&#34;

# 代理验证信息
proxyUser = &#34;username&#34;
proxyPass = &#34;password&#34;


def create_proxy_auth_extension(proxy_host, proxy_port,
                                proxy_username, proxy_password,
                                scheme=&#39;http&#39;, plugin_path=None):
    if plugin_path is None:
        plugin_path = r&#39;/tmp/{}_{}@t.16yun.zip&#39;.format(proxy_username, proxy_password)

    manifest_json = &#34;&#34;&#34;
    {
        &#34;version&#34;: &#34;1.0.0&#34;,
        &#34;manifest_version&#34;: 2,
        &#34;name&#34;: &#34;16YUN Proxy&#34;,
        &#34;permissions&#34;: [
            &#34;proxy&#34;,
            &#34;tabs&#34;,
            &#34;unlimitedStorage&#34;,
            &#34;storage&#34;,
            &#34;&lt;all_urls&gt;&#34;,
            &#34;webRequest&#34;,
            &#34;webRequestBlocking&#34;
        ],
        &#34;background&#34;: {
            &#34;scripts&#34;: [&#34;background.js&#34;]
        },
        &#34;minimum_chrome_version&#34;:&#34;22.0.0&#34;
    }
    &#34;&#34;&#34;

    background_js = string.Template(
        &#34;&#34;&#34;
        var config = {
            mode: &#34;fixed_servers&#34;,
            rules: {
                singleProxy: {
                    scheme: &#34;${scheme}&#34;,
                    host: &#34;${host}&#34;,
                    port: parseInt(${port})
                },
                bypassList: [&#34;localhost&#34;]
            }
          };

        chrome.proxy.settings.set({value: config, scope: &#34;regular&#34;}, function() {});

        function callbackFn(details) {
            return {
                authCredentials: {
                    username: &#34;${username}&#34;,
                    password: &#34;${password}&#34;
                }
            };
        }

        chrome.webRequest.onAuthRequired.addListener(
            callbackFn,
            {urls: [&#34;&lt;all_urls&gt;&#34;]},
            [&#39;blocking&#39;]
        );
        &#34;&#34;&#34;
    ).substitute(
        host=proxy_host,
        port=proxy_port,
        username=proxy_username,
        password=proxy_password,
        scheme=scheme,
    )
    print(background_js)

    with zipfile.ZipFile(plugin_path, &#39;w&#39;) as zp:
        zp.writestr(&#34;manifest.json&#34;, manifest_json)
        zp.writestr(&#34;background.js&#34;, background_js)

    return plugin_path


proxy_auth_plugin_path = create_proxy_auth_extension(
    proxy_host=proxyHost,
    proxy_port=proxyPort,
    proxy_username=proxyUser,
    proxy_password=proxyPass)

option = webdriver.ChromeOptions()

option.add_argument(&#34;--start-maximized&#34;)

# 如报错 chrome-extensions
# option.add_argument(&#34;--disable-extensions&#34;)

option.add_extension(proxy_auth_plugin_path)

# 关闭webdriver的一些标志
# option.add_experimental_option(&#39;excludeSwitches&#39;, [&#39;enable-automation&#39;])

driver = webdriver.Chrome(
    chrome_options=option,
    executable_path=&#34;./chromdriver&#34;
)

# 修改webdriver get属性
# script = &#39;&#39;&#39;
# Object.defineProperty(navigator, &#39;webdriver&#39;, {
# get: () =&gt; undefined
# })
# &#39;&#39;&#39;
# driver.execute_cdp_cmd(&#34;Page.addScriptToEvaluateOnNewDocument&#34;, {&#34;source&#34;: script})


driver.get(&#34;https://httpbin.org/ip&#34;)
这里需要注意的是，我们在使用浏览器时要注意版本是否一致，可以查看具体的帮助说明，如果不一致，即使程序能够运行，也会出现爬虫代理认证信息失败，需要弹窗要求手动输入认证信息的问题。</code></pre></div></div>