# spider **Repository Path**: liqkjm/spider ## Basic Information - **Project Name**: spider - **Description**: No description available - **Primary Language**: Unknown - **License**: Not specified - **Default Branch**: master - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2019-03-04 - **Last Updated**: 2020-12-17 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # 使用手册 ## 项目目录 - spider/ - run.py 可视化界面启动程序 - manage.py 批量爬虫任务脚本 - task.xml 批量爬虫任务xml(可保存任意位置,但路径在启动manage.py时需要传入) ## 环境配置 ### 版本 - win10 - python 3.6(可使用anaconda管理) - mysql ### 依赖库安装 > 可根据requirements.txt安装:`pip install -r requirements.txt -i https://pypi.douban.com/simple pyqt5 ` - pyqt5 5.11.3 (用于图形界面) - scrapy 1.5.1 (用于爬虫) - pillow 5.4.1 (用于保存图片) - pymysql 0.9.3 (用于连接数据库) 具体安装步骤(使用豆瓣镜像): > pip install -i https://pypi.douban.com/simple pyqt5 > pip install -i https://pypi.douban.com/simple scrapy > pip install -i https://pypi.douban.com/simple pillow > pip install -i https://pypi.douban.com/simple PyMySQL 或者使用anaconda(不影响本机环境): 1. 使用anaconda navigator新建一个python3.6的虚拟环境,进入该环境 2. 可能出现pip版本过低:`python -m pip install --upgrade pip` 3. 安装pyqt5: `pip install -i https://pypi.douban.com/simple pyqt5` 4. 安装scrapy:`pip install scrapy` 5. 安装pillow:`pip install pillow` 6. 安装pymysql:`pip install pymysql -i https://pypi.douban.com/simple` ### 设置环境变量 - python - scrapy(将安装python的目录下的scripts目录添加到环境变量) ### MySQL数据库 创建sku,price表(表名已固定,若需更改,修改`spider/spider/pipelines.py, line 120`处的代码) CREATE TABLE `spider`.`sku` ( `id` INT NOT NULL AUTO_INCREMENT, `sku_id` VARCHAR(45) NOT NULL, `source` VARCHAR(45) NULL, `keyword` VARCHAR(45) NULL, `trans_key` VARCHAR(45) NULL, `name` VARCHAR(100) NULL, `label` VARCHAR(45) NULL, `image_urls` VARCHAR(1000) NULL, PRIMARY KEY (`id`)); CREATE TABLE `spider`.`price` ( `id` VARCHAR(45) NOT NULL, `time` DATETIME NULL, `price` VARCHAR(45) NULL, PRIMARY KEY (`id`), CONSTRAINT `fk_sku_price_id` FOREIGN KEY (`id`) REFERENCES `spider`.`sku` (`id`) ON DELETE NO ACTION ON UPDATE NO ACTION); CREATE TABLE `price` ( `id` varchar(45) NOT NULL, `time` timestamp NULL DEFAULT CURRENT_TIMESTAMP, `price` varchar(45) DEFAULT NULL ) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci ## 使用 ### 1. 执行可视化界面程序 1. 进入项目目录(不可在项目目录外执行),命令行执行`python run.py` 2. 显示界面后,输入需要爬取的信息后,点击`开始爬虫`,即可开始 ### 2. 批量任务 1. D:\project\spider\manage.py D:\project\spider\spider\config\settings.cfg ## 项目结构 - spider - spider - spiders - jd.py - yhd.py - sn.py - ... - util - mainwindow.py (图形界面操作) - ... - items.py - middlewares.py - pipelines.py - settings.py - main.py - run.py ## 日志文件 打开spider/log目录下的日志文件,可查看爬虫运行的具体情况 ## 关于爬虫 京东,一号店用的同一个cdn图片服务器,接口一样:http://img13.360buyimg.com/n1/s150x150_ img10/img13一样,可能是分布式服务器 ## TODO 1. price(id, time, price) 时刻 2. sku 增加网站字段 3. 外键约束 # -*- mode: python -*- block_cipher = None a = Analysis(['run_gui.py'], pathex=['D:\\project\\spider'], binaries=[], datas=[('.\\scrapy', 'scrapy'), ('.\\scrapy.cfg', '.'), ('.\\settings.cfg', '.')], hiddenimports=[ 'robotparser', 'spider.settings', 'spider.spiders', 'spider.spiders.jd', 'spider.spiders.yhd', 'spider.spiders.sn', 'scrapy.crawler', "scrapy.spiderloader", "scrapy.logformatter", "scrapy.dupefilters", "scrapy.squeues", 'scrapy.statscollectors', "scrapy.extensions.spiderstate", "scrapy.extensions.corestats", "scrapy.extensions.telnet", "scrapy.extensions.logstats", "scrapy.extensions.memusage", "scrapy.extensions.memdebug", "scrapy.extensions.feedexport", "scrapy.extensions.closespider", "scrapy.extensions.debug", "scrapy.extensions.httpcache", "scrapy.extensions.statsmailer", "scrapy.extensions.throttle", "scrapy.core.scheduler", "scrapy.core.engine", "scrapy.core.scraper", "scrapy.core.spidermw", "scrapy.core.downloader", "scrapy.downloadermiddlewares.stats", "scrapy.downloadermiddlewares.httpcache", "scrapy.downloadermiddlewares.cookies", "scrapy.downloadermiddlewares.useragent", "scrapy.downloadermiddlewares.httpproxy", "scrapy.downloadermiddlewares.ajaxcrawl", "scrapy.downloadermiddlewares.decompression", "scrapy.downloadermiddlewares.defaultheaders", "scrapy.downloadermiddlewares.downloadtimeout", "scrapy.downloadermiddlewares.httpauth", "scrapy.downloadermiddlewares.httpcompression", "scrapy.downloadermiddlewares.redirect", "scrapy.downloadermiddlewares.retry", "scrapy.downloadermiddlewares.robotstxt", "scrapy.spidermiddlewares.depth", "scrapy.spidermiddlewares.httperror", "scrapy.spidermiddlewares.offsite", "scrapy.spidermiddlewares.referer", "scrapy.spidermiddlewares.urllength", "scrapy.pipelines", "scrapy.core.downloader.handlers.http", "scrapy.core.downloader.contextfactory", "os", "json", "csv", "re",'scrapy',"pymysql", 'win32api', 'pywin32_system32', 'pythonwin', 'win32', 'win32console', 'pywin32_system32', 'win32com' ], hookspath=[], runtime_hooks=[], excludes=[], win_no_prefer_redirects=False, win_private_assemblies=False, cipher=block_cipher, # noarchive=False ) pyz = PYZ(a.pure, a.zipped_data, cipher=block_cipher) exe = EXE(pyz, a.scripts, [], exclude_binaries=True, name='run_gui', debug=False, bootloader_ignore_signals=False, strip=False, upx=True, console=True ) coll = COLLECT(exe, a.binaries, a.zipfiles, a.datas, strip=False, upx=True, name='run_gui') 接口不可用,错误:Get https://itemapi.yhd.com/getPrices.do? params.area=2_2817_51973_0¶ms.skuIds=1233203: net/http: invalid header field name ""