# 爬虫项目7 Scrapy **Repository Path**: cthousand/item-7 ## Basic Information - **Project Name**: 爬虫项目7 Scrapy - **Description**: 此项目中的目标网址具有登录验证+IP封锁+帐号封锁,3种反爬手段,利用IP池,帐号池,和模拟登录技术,结合scapy成功实现大规模数据爬取。 - **Primary Language**: Python - **License**: Not specified - **Default Branch**: master - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 2 - **Created**: 2022-05-04 - **Last Updated**: 2023-05-15 ## Categories & Tags **Categories**: Uncategorized **Tags**: Scrapy ## README # scrapy ## 背景介绍 scrapy是python爬虫生态中最流行的框架,架构清晰,拓展性强,通常我们使用requests或aiohttp实现整个爬虫逻辑,但整个过程中其实有很多步骤是重复的,即然如此,我们完全可以吧这些步骤的逻辑抽离出来,把通用的功能做成一个又一个的通用组件。 scrapy框架的架构,如图所示: ![IMG20220504133032](https://cthousand-pic-save.oss-cn-hangzhou.aliyuncs.com/img/202205041335833.jpg) * Engine:框架的中央处理器,负责数据的流转和逻辑的处理 * Item:抽象数据结构,爬取的数据会被赋值成Item对象,每个Item对象就是一个类,定义了爬虫结果的数据字段。 * Scheduler:调度器,用来接收Engine发过来的Request并将其加入队列,同时也可以将Request发回给Engine供Dowmloader执行,主要维护Request的调度逻辑,比如先进先出,先进后出,优先级进出等待。 * Spiders:可以对应多个Spider,每个Spider里面定义了站点的爬取逻辑和页面的解析规则,主要负责解析响应并生成Item和新的请求然后并发给Engine处理 * Downloaer:下载器,主要负责完成'向服务器发送请求,然后拿到响应'的过程,响应的数据会再发给Engine处理 * Item Pipeline:项目管道,主要负责处理由Spider从页面中抽取的Item,做一些数据清洗,验证和存储等工作。 * Downloader Middlewares:下载器中间件,负责Downloader和Engine之间的request和response的处理过程。 * Spider Middlewares:蜘蛛中间件,主要负责Spider和Engine之间的Item,request和response的处理过程。 ## 目的明确 本次我们需要爬取的站点为:https://antispider7.scrape.center/ ,这个站点需要登录才能爬取,登录之后(帐号和密码均是admin),我们便可以看到如下画面: ![image-20220504135638726](https://cthousand-pic-save.oss-cn-hangzhou.aliyuncs.com/img/202205041356760.png) 这里是一些书籍信息,我们需要进入每一本书对应的详情页,将改本书的信息爬取下来,总数接近1万本。需要将数据保存至本地mongodb数据库中。交付结果如下图: ![image-20220504142257794](https://cthousand-pic-save.oss-cn-hangzhou.aliyuncs.com/img/202205041422829.png) 这个站点设置了如下的一些反爬措施,但个账号每5分钟内最多访问页面10次,平均下载每分钟最多访问2次,超过则被封号;此外站点还设置了IP的封禁,同样每5分钟最多访问10次。 主要面临两个障碍: * 封禁帐号:应对策略是帐号池,每个请求,从帐号池中随机选择一个帐号请求,从而降低帐号被封禁的概率 * 封禁IP:应对策略是IP代理,采用付费代理中隧道代理,每次请求都是一个不同IP,从而降低IP被封禁的概率 ## 方法分析 1. 用帐号登录后,列表页url分析:![image-20220504140346682](https://cthousand-pic-save.oss-cn-hangzhou.aliyuncs.com/img/202205041403716.png) ![](https://cthousand-pic-save.oss-cn-hangzhou.aliyuncs.com/img/202205041404732.png) ajax请求,url中无加密参数,接下来查看登录验证方式:![image-20220504140655635](https://cthousand-pic-save.oss-cn-hangzhou.aliyuncs.com/img/202205041406668.png) 属于JWT验证,接下来查看详情页url构造方式。 2. 详情页url分析 ![](https://cthousand-pic-save.oss-cn-hangzhou.aliyuncs.com/img/202205041408884.png) ![](https://cthousand-pic-save.oss-cn-hangzhou.aliyuncs.com/img/202205041409741.png) url中```7952978```属于列表页中的ID。 爬取逻辑已清洗,先从列表页接口获取书籍ID,再根据ID构造详情页接口Url,爬取每一本书的详情;模拟登录是jwt方式,在请求体中加入authorization字段。 ## 代码实现 ### 主逻辑实现 新建scrapy项目 ```scrapy startproject scrapycompositedemo``` 进入项目,新建一个spider,名称为book ```scrapy genspider book antispider7.scrape.center``` 定义Item,直接和详情页接口返回的字段一致即可,在item.py里面定义一个BookItem,代码如下: ```python from scrapy import Field, Item class BookItem(Item): authors = Field() catalog = Field() comments = Field() cover = Field() id = Field() introduction = Field() isbn = Field() name = Field() page_number = Field() price = Field() published_at = Field() publisher = Field() score = Field() tags = Field() translator = Field() ``` spider代码,在book.py中改写代码如下 ```python from scrapy import Request, Spider from ..items import BookItem from loguru import logger class BookSpider(Spider): name = 'book' allowed_domains = ['antispider7.scrape.center'] base_url = 'https://antispider7.scrape.center' max_page = 512 def start_requests(self): for page in range(1, self.max_page+1): url = f'{self.base_url}/api/book/?limit=18&offset={(page-1)*18}' yield Request(url, callback=self.parse_index) def parse_index(self, res): data = res.json() results = data.get('results') for result in results: id = result.get('id') url = f'{self.base_url}/api/book/{id}/' yield Request(url, callback=self.parse_detail, priority=2) # priority表示该Requst在调度器中具有优先权 def parse_detail(self, res): data = res.json() item = BookItem() for field in item.fields: item[field] = data.get(field) yield item ``` ### 帐号池实现 定时间隔设置为600s,用redis库的散列表实现,设置三个进程,分别是获取模块、测试模块、接口模块 ```python from concurrent.futures import thread from redis import StrictRedis from requests import Session import requests from flask import Flask import random import time import re from multiprocessing import Process url = 'https://antispider7.scrape.center/api/login' url_center = 'https://antispider7.scrape.center/api/book/?limit=18&offset=0' db = StrictRedis(host='localhost', port=6379, db=0, decode_responses=True) s = Session() app = Flask(__name__) def Getcookie(): # 没有cookie就获取cookie并填入 while True: for username in db.hkeys('account'): if not db.hexists('cookie', username): r = s.post(url=url, data={'username': username, 'password': db.hget('account', username)}) # for cookie in s.cookies: # result.append(f'{cookie.name}={cookie.value}') # result = ';'.join(result) print(r.status_code, username) result = r.json().get('token', 'fail') db.hset('cookie', username, result) time.sleep(600) def Testcookie(): # 根据res状态码把库中cookie无效的username键值对删除 while True: if db.hgetall('cookie'): print('exist') else: print('not exist') db.hset('cookie', '1', '1') print('已创建cookie') for username, cookie in db.hgetall('cookie').items(): res = requests.get( url_center, headers={'Authorization': f'jwt {cookie}'}, allow_redirects=False) if res.status_code != 200: db.hdel('cookie', username) print('cookie', username, res.status_code, '已失效!') else: print('cookie', username, res.status_code, '可用!') time.sleep(600) @app.route('/') # 接口功能 def cookie_api(): result = db.hvals('cookie') result = [i for i in result if i != ''] result = random.choice(result) return result def api(): app.run(threaded=True) def schedule(): p1 = Process(target=Getcookie) p2 = Process(target=Testcookie) p3 = Process(target=api) p1.start() p2.start() p3.start() p1.join() p2.join() p3.join() if __name__ == '__main__': schedule() ``` ### 代理池实现 购买快代理的隧道代理服务,结合上文的帐号池,在middlewares中添加如下代码: ```python import requests class AuthorizationMiddleware(): accountpool_url = 'http://127.0.0.1:5000/' tunnel = "tps656.kdlapi.com:15818" username = "t15160185660737" password = "vjopqphj" proxies = "http://%(user)s:%(pwd)s@%(proxy)s/" % { "user": username, "pwd": password, "proxy": tunnel} def process_request(self, request, spider): with requests.get(self.accountpool_url) as res: authorization = res.text authorization = f'jwt {authorization}' request.headers['authorization'] = authorization # jwt认证参数加入 request.meta['proxy'] = self.proxies # 代理参数加入 ``` ### 存储设置 ```python from itemadapter import ItemAdapter import pymongo class ScrapycompositedemoPipeline: def process_item(self, item, spider): return item class MongoDBPipeline(object): @classmethod def from_crawler(cls, crawler): cls.connect_string = crawler.settings.get('MONGODB_CONNECTION_STRING') cls.database = crawler.settings.get('MONGODB_DATABASE') cls.collection = crawler.settings.get('MONGODB_COLLECTION') return cls() def open_spider(self, spider): self.client = pymongo.MongoClient(self.connect_string) self.db = self.client[self.database] def process_item(self, item, spider): self.db[self.collection].update_one( {'id': item['id']}, {'$set': dict(item)}, True) return item def close_spider(self, spider): self.client.close() ``` ### settings设置 ```python ROBOTSTXT_OBEY = False # 不尊从robot协议 CONCURRENT_REQUESTS = 5 # 最大并发数 DOWNLOADER_MIDDLEWARES = { 'scrapycompositedemo.middlewares.AuthorizationMiddleware': 543 # 下载中间件的优先度设置,最小约靠近engine } RETRY_HTTP_CODES = [401, 403, 500, 502, 503, 504] # 重试情况,增加爬取成功率 DOWNLOAD_TIMEOUT = 10 # 请求超时设置 RETRY_TIMES = 10 # 重试次数设置 # mongodb常量设置 MONGODB_CONNECTION_STRING = 'localhost' MONGODB_DATABASE = 'books' MONGODB_COLLECTION = 'books' ITEM_PIPELINES = { 'scrapycompositedemo.pipelines.MongoDBPipeline': 300 # 数据管道优先度设置,数字越小越靠近engine } ``` 最后,进行项目目录终端,运行如下代码: ```scrapy crawl book``` ![image-20220504143813936](https://cthousand-pic-save.oss-cn-hangzhou.aliyuncs.com/img/202205041438988.png) 静待项目爬取完成。