国产精品久久久久久,日日操夜夜干,国产成人免费在线

前言

通常我們在一個站站點進行采集的時候，如果是小站的話我們使用scrapy本身就可以滿足。

但是如果在面對一些比較大型的站點的時候，單個scrapy就顯得力不從心了。

要是我們能夠多個Scrapy一起采集該多好啊人多力量大。

很遺憾Scrapy官方并不支持多個同時采集一個站點，雖然官方給出一個方法：

**將一個站點的分割成幾部分交給不同的scrapy去采集**

似乎是個解決辦法，但是很麻煩誒！畢竟分割很麻煩的哇

下面就改輪到我們的額主角Scrapy-Redis登場了！

能看到這篇文章的小伙伴肯定已經知道什么是Scrapy以及Scrapy-Redis了，基礎概念這里就不再介紹。默認情況下Scrapy-Redis是發送GET請求獲取數據的，對于某些使用POST請求的情況需要重寫make_request_from_data函數即可，但奇怪的是居然沒在網上搜到簡潔明了的答案，或許是太簡單了？。

這里我以httpbin.org這個網站為例，首先在settings.py中添加所需配置，這里需要根據實際情況進行修改：

				?

									SCHEDULER = "scrapy_redis.scheduler.Scheduler" #啟用Redis調度存儲請求隊列

									SCHEDULER_PERSIST = True #不清除Redis隊列、這樣可以暫停/恢復 爬取

									DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter" #確保所有的爬蟲通過Redis去重

									SCHEDULER_QUEUE_CLASS = 'scrapy_redis.queue.SpiderPriorityQueue'

									REDIS_URL = "redis://127.0.0.1:6379"

爬蟲代碼如下：

				?

									# -*- coding: utf-8 -*-

									import scrapy

									from scrapy_redis.spiders import RedisSpider

									class HpbSpider(RedisSpider):

									 name = 'hpb'

									 redis_key = 'test_post_data'

									 def make_request_from_data(self, data):

									  """Returns a Request instance from data coming from Redis.

									  By default, ``data`` is an encoded URL. You can override this method to

									  provide your own message decoding.

									  Parameters

									  ----------

									  data : bytes

									   Message from redis.

									  """

									  return scrapy.FormRequest("https://www.httpbin.org/post",

									         formdata={"data":data},callback=self.parse)

									 def parse(self, response):

									  print(response.body)

這里為了簡單直接進行輸出，真實使用時可以結合pipeline寫數據庫等。

然后啟動爬蟲程序scrapy crawl hpb，由于我們還沒向test_post_data中寫數據，所以啟動后程序進入等待狀態。然后模擬向隊列寫數據：

				?

									import redis

									rd = redis.Redis('127.0.0.1',port=6379,db=0)

									for _ in range(1000):

									 rd.lpush('test_post_data',_)

此時可以看到爬蟲已經開始獲取程序了：

2019-05-06 16:30:21 [hpb] DEBUG: Read 8 requests from 'test_post_data'
2019-05-06 16:30:21 [scrapy.core.engine] DEBUG: Crawled (200) <POST https://www.httpbin.org/post> (referer: None)
2019-05-06 16:30:21 [scrapy.core.engine] DEBUG: Crawled (200) <POST https://www.httpbin.org/post> (referer: None)
2019-05-06 16:30:21 [scrapy.core.engine] DEBUG: Crawled (200) <POST https://www.httpbin.org/post> (referer: None)
2019-05-06 16:30:21 [scrapy.core.engine] DEBUG: Crawled (200) <POST https://www.httpbin.org/post> (referer: None)
2019-05-06 16:30:21 [scrapy.core.engine] DEBUG: Crawled (200) <POST https://www.httpbin.org/post> (referer: None)
2019-05-06 16:30:21 [scrapy.core.engine] DEBUG: Crawled (200) <POST https://www.httpbin.org/post> (referer: None)
2019-05-06 16:30:21 [scrapy.core.engine] DEBUG: Crawled (200) <POST https://www.httpbin.org/post> (referer: None)
2019-05-06 16:30:21 [scrapy.core.engine] DEBUG: Crawled (200) <POST https://www.httpbin.org/post> (referer: None)
b'{\n "args": {}, \n "data": "", \n "files": {}, \n "form": {\n    "data": "0"\n }, \n "headers": {\n    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", \n    "Accept-Encoding": "gzip,deflate", \n    "Accept-Language": "en", \n    "Content-Length": "6", \n    "Content-Type": "application/x-www-form-urlencoded", \n    "Host": "www.httpbin.org", \n    "User-Agent": "Scrapy/1.5.1 (+https://scrapy.org)"\n }, \n "json": null, \n "origin": "1.2.3.48, 1.2.3.48", \n "url": "https://www.httpbin.org/post"\n}\n'
b'{\n "args": {}, \n "data": "", \n "files": {}, \n "form": {\n    "data": "1"\n }, \n "headers": {\n    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", \n    "Accept-Encoding": "gzip,deflate", \n    "Accept-Language": "en", \n    "Content-Length": "6", \n    "Content-Type": "application/x-www-form-urlencoded", \n    "Host": "www.httpbin.org", \n    "User-Agent": "Scrapy/1.5.1 (+https://scrapy.org)"\n }, \n "json": null, \n "origin": "1.2.3.48, 1.2.3.48", \n "url": "https://www.httpbin.org/post"\n}\n'
b'{\n "args": {}, \n "data": "", \n "files": {}, \n "form": {\n    "data": "3"\n }, \n "headers": {\n    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", \n    "Accept-Encoding": "gzip,deflate", \n    "Accept-Language": "en", \n    "Content-Length": "6", \n    "Content-Type": "application/x-www-form-urlencoded", \n    "Host": "www.httpbin.org", \n    "User-Agent": "Scrapy/1.5.1 (+https://scrapy.org)"\n }, \n "json": null, \n "origin": "1.2.3.48, 1.2.3.48", \n "url": "https://www.httpbin.org/post"\n}\n'
b'{\n "args": {}, \n "data": "", \n "files": {}, \n "form": {\n    "data": "2"\n }, \n "headers": {\n    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", \n    "Accept-Encoding": "gzip,deflate", \n    "Accept-Language": "en", \n    "Content-Length": "6", \n    "Content-Type": "application/x-www-form-urlencoded", \n    "Host": "www.httpbin.org", \n    "User-Agent": "Scrapy/1.5.1 (+https://scrapy.org)"\n }, \n "json": null, \n "origin": "1.2.3.48, 1.2.3.48", \n "url": "https://www.httpbin.org/post"\n}\n'
b'{\n "args": {}, \n "data": "", \n "files": {}, \n "form": {\n    "data": "4"\n }, \n "headers": {\n    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", \n    "Accept-Encoding": "gzip,deflate", \n    "Accept-Language": "en", \n    "Content-Length": "6", \n    "Content-Type": "application/x-www-form-urlencoded", \n    "Host": "www.httpbin.org", \n    "User-Agent": "Scrapy/1.5.1 (+https://scrapy.org)"\n }, \n "json": null, \n "origin": "1.2.3.48, 1.2.3.48", \n "url": "https://www.httpbin.org/post"\n}\n'
b'{\n "args": {}, \n "data": "", \n "files": {}, \n "form": {\n    "data": "5"\n }, \n "headers": {\n    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", \n    "Accept-Encoding": "gzip,deflate", \n    "Accept-Language": "en", \n    "Content-Length": "6", \n    "Content-Type": "application/x-www-form-urlencoded", \n    "Host": "www.httpbin.org", \n    "User-Agent": "Scrapy/1.5.1 (+https://scrapy.org)"\n }, \n "json": null, \n "origin": "1.2.3.48, 1.2.3.48", \n "url": "https://www.httpbin.org/post"\n}\n'
b'{\n "args": {}, \n "data": "", \n "files": {}, \n "form": {\n    "data": "6"\n }, \n "headers": {\n    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", \n    "Accept-Encoding": "gzip,deflate", \n    "Accept-Language": "en", \n    "Content-Length": "6", \n    "Content-Type": "application/x-www-form-urlencoded", \n    "Host": "www.httpbin.org", \n    "User-Agent": "Scrapy/1.5.1 (+https://scrapy.org)"\n }, \n "json": null, \n "origin": "1.2.3.48, 1.2.3.48", \n "url": "https://www.httpbin.org/post"\n}\n'
b'{\n "args": {}, \n "data": "", \n "files": {}, \n "form": {\n    "data": "7"\n }, \n "headers": {\n    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", \n    "Accept-Encoding": "gzip,deflate", \n    "Accept-Language": "en", \n    "Content-Length": "6", \n    "Content-Type": "application/x-www-form-urlencoded", \n    "Host": "www.httpbin.org", \n    "User-Agent": "Scrapy/1.5.1 (+https://scrapy.org)"\n }, \n "json": null, \n "origin": "1.2.3.48, 1.2.3.48", \n "url": "https://www.httpbin.org/post"\n}\n'
2019-05-06 16:31:09 [scrapy.extensions.logstats] INFO: Crawled 1001 pages (at 280 pages/min), scraped 0 items (at 0 items/min)
2019-05-06 16:32:09 [scrapy.extensions.logstats] INFO: Crawled 1001 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2019-05-06 16:33:09 [scrapy.extensions.logstats] INFO: Crawled 1001 pages (at 0 pages/min), scraped 0 items (at 0 items/min)