代理的使用場景
編寫爬蟲代碼的程序員,永遠繞不開就是使用代理,在編碼過程中,你會碰到如下情形:
網絡不好,需要代理;
目標站點國內訪問不了,需要代理;
網站封殺了你的 IP,需要代理。
使用 HttpProxyMiddleware 中間件
本次的測試站點依舊使用 http://httpbin.org/,通過訪問 http://httpbin.org/ip可以獲取當前請求的 IP 地址。
HttpProxyMiddleware 中間件默認是開啟的,可以查看其源碼重點為 process_request() 方法。
修改代理的方式非常簡單,只需要在 Requests 請求創建的時候,增加 meta 參數即可。
import scrapy class PtSpider(scrapy.Spider): name = 'pt' allowed_domains = ['httpbin.org'] start_urls = ['http://httpbin.org/ip'] def start_requests(self): yield scrapy.Request(url=self.start_urls[0], meta={'proxy': 'http://202.5.116.49:8080'}) def parse(self, response): print(response.text)
接下來通過獲取一下 https://www.kuaidaili.com/free/ 網站的代理 IP,并測試其代理是否可用。
import scrapy class PtSpider(scrapy.Spider): name = 'pt' allowed_domains = ['httpbin.org', 'kuaidaili.com'] start_urls = ['https://www.kuaidaili.com/free/'] def parse(self, response): IP = response.xpath('//td[@data-title="IP"]/text()').getall() PORT = response.xpath('//td[@data-title="PORT"]/text()').getall() url = 'http://httpbin.org/ip' for ip, port in zip(IP, PORT): proxy = f"http://{ip}:{port}" meta = { 'proxy': proxy, 'dont_retry': True, 'download_timeout': 10, } yield scrapy.Request(url=url, callback=self.check_proxy, meta=meta, dont_filter=True) def check_proxy(self, response): print(response.text)
接下來將可用的代理 IP 保存到 JSON 文件中。
import scrapy class PtSpider(scrapy.Spider): name = 'pt' allowed_domains = ['httpbin.org', 'kuaidaili.com'] start_urls = ['https://www.kuaidaili.com/free/'] def parse(self, response): IP = response.xpath('//td[@data-title="IP"]/text()').getall() PORT = response.xpath('//td[@data-title="PORT"]/text()').getall() url = 'http://httpbin.org/ip' for ip, port in zip(IP, PORT): proxy = f"http://{ip}:{port}" meta = { 'proxy': proxy, 'dont_retry': True, 'download_timeout': 10, '_proxy': proxy } yield scrapy.Request(url=url, callback=self.check_proxy, meta=meta, dont_filter=True) def check_proxy(self, response): proxy_ip = response.json()['origin'] if proxy_ip is not None: yield { 'proxy': response.meta['_proxy'] }
同時修改 start_requests 方法,獲取 10 頁代理。
class PtSpider(scrapy.Spider): name = 'pt' allowed_domains = ['httpbin.org', 'kuaidaili.com'] url_format = 'https://www.kuaidaili.com/free/inha/{}/' def start_requests(self): for page in range(1, 11): yield scrapy.Request(url=self.url_format.format(page))
實現一個自定義的代理中間件也比較容易,有兩種辦法,第一種繼承 HttpProxyMiddleware,編寫如下代碼:
from scrapy.downloadermiddlewares.httpproxy import HttpProxyMiddleware from collections import defaultdict import random class RandomProxyMiddleware(HttpProxyMiddleware): def __init__(self, auth_encoding='latin-1'): self.auth_encoding = auth_encoding self.proxies = defaultdict(list) with open('./proxy.csv') as f: proxy_list = f.readlines() for proxy in proxy_list: scheme = 'http' url = proxy.strip() self.proxies[scheme].append(self._get_proxy(url, scheme)) def _set_proxy(self, request, scheme): creds, proxy = random.choice(self.proxies[scheme]) request.meta['proxy'] = proxy if creds: request.headers['Proxy-Authorization'] = b'Basic ' + creds
代碼核心重寫了 __init__ 構造方法,并重寫了 _set_proxy 方法,在其中實現了隨機代理獲取。
同步修改 settings.py 文件中的代碼。
DOWNLOADER_MIDDLEWARES = { 'proxy_text.middlewares.RandomProxyMiddleware': 543, }
創建一個新的代理中間件類
class NRandomProxyMiddleware(object): def __init__(self, settings): # 從settings中讀取代理配置 PROXIES self.proxies = settings.getlist("PROXIES") def process_request(self, request, spider): request.meta["proxy"] = random.choice(self.proxies) @classmethod def from_crawler(cls, crawler): if not crawler.settings.getbool("HTTPPROXY_ENABLED"): raise NotConfigured return cls(crawler.settings)
可以看到該類從 settings.py 文件中的 PROXIES 讀取配置,所以修改對應配置如下所示:
DOWNLOADER_MIDDLEWARES = { 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': None, 'proxy_text.middlewares.NRandomProxyMiddleware': 543, } # 代碼是前文代碼采集的結果 PROXIES = ['http://140.249.48.241:6969', 'http://47.96.16.149:80', 'http://140.249.48.241:6969', 'http://47.100.14.22:9006', 'http://47.100.14.22:9006']
如果你想測試爬蟲,可編寫一個隨機返回請求代理的函數,將其用到任意爬蟲代碼之上,完成本博客任務。
以上就是python爬蟲框架scrapy代理中間件掌握學習教程的詳細內容,更多關于scrapy框架代理中間件學習的資料請關注服務器之家其它相關文章!
原文鏈接:https://blog.csdn.net/hihell/article/details/121012464