# WebCollector-Python **Repository Path**: johntomyang/WebCollector-Python ## Basic Information - **Project Name**: WebCollector-Python - **Description**: WebCollector-Python WebCollector-Python 是一个无须配置、便于二次开发的 Python 爬虫框架(内核),它提供精简的的 API,只需少量代码 - **Primary Language**: Unknown - **License**: GPL-3.0 - **Default Branch**: master - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 4 - **Created**: 2019-06-11 - **Last Updated**: 2020-12-19 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # WebCollector-Python WebCollector-Python is an open source web crawler framework based on Python.It provides some simple interfaces for crawling the Web,you can setup a multi-threaded web crawler in less than 5 minutes. ## HomePage [https://github.com/CrawlScript/WebCollector-Python](https://github.com/CrawlScript/WebCollector-Python) ## WebCollector Java Version For better efficiency, WebCollector Java Version is recommended: [https://github.com/CrawlScript/WebCollector](https://github.com/CrawlScript/WebCollector) ## Installation ### pip ```bash pip install https://github.com/CrawlScript/WebCollector-Python/archive/master.zip ``` ## Example Index ### Basic + [demo_auto_news_crawler.py](examples/demo_auto_news_crawler.py) + [demo_manual_news_crawler.py](examples/demo_manual_news_crawler.py) ## Quickstart ### Automatically Detecting URLs [demo_auto_news_crawler.py](examples/demo_auto_news_crawler.py): ```python # coding=utf-8 import webcollector as wc class NewsCrawler(wc.RamCrawler): def __init__(self): super().__init__(auto_detect=True) self.num_threads = 10 self.add_seed("https://github.blog/") self.add_regex("+https://github.blog/[0-9]+.*") self.add_regex("-.*#.*") # do not detect urls that contain "#" def visit(self, page, detected): if page.match_url("https://github.blog/[0-9]+.*"): title = page.select("h1.lh-condensed")[0].text.strip() content = page.select("div.markdown-body")[0].text.replace("\n", " ").strip() print("\nURL: ", page.url) print("TITLE: ", title) print("CONTENT: ", content[:50], "...") crawler = NewsCrawler() crawler.start(10) ``` ### Manually Detecting URLs [demo_manual_news_crawler.py](examples/demo_manual_news_crawler.py): ```python # coding=utf-8 import webcollector as wc class NewsCrawler(wc.RamCrawler): def __init__(self): super().__init__(auto_detect=False) self.num_threads = 10 self.add_seed("https://github.blog/") def visit(self, page, detected): detected.extend(page.links("https://github.blog/[0-9]+.*")) if page.match_url("https://github.blog/[0-9]+.*"): title = page.select("h1.lh-condensed")[0].text.strip() content = page.select("div.markdown-body")[0].text.replace("\n", " ").strip() print("\nURL: ", page.url) print("TITLE: ", title) print("CONTENT: ", content[:50], "...") crawler = NewsCrawler() crawler.start(10) ``` ### Filter Detected URLs by detected_filter Plugin [demo_detected_filter.py](examples/demo_detected_filter.py): ```python # coding=utf-8 import webcollector as wc from webcollector.filter import Filter import re class RegexDetectedFilter(Filter): def filter(self, crawl_datum): if re.fullmatch("https://github.blog/2019-02.*", crawl_datum.url): return crawl_datum else: print("filtered by detected_filter: {}".format(crawl_datum.brief_info())) return None class NewsCrawler(wc.RamCrawler): def __init__(self): super().__init__(auto_detect=True, detected_filter=RegexDetectedFilter()) self.num_threads = 10 self.add_seed("https://github.blog/") def visit(self, page, detected): detected.extend(page.links("https://github.blog/[0-9]+.*")) if page.match_url("https://github.blog/[0-9]+.*"): title = page.select("h1.lh-condensed")[0].text.strip() content = page.select("div.markdown-body")[0].text.replace("\n", " ").strip() print("\nURL: ", page.url) print("TITLE: ", title) print("CONTENT: ", content[:50], "...") crawler = NewsCrawler() crawler.start(10) ``` ### Resume Crawling by RedisCrawler [demo_redis_crawler.py](examples/demo_redis_crawler.py): ```python # coding=utf-8 from redis import StrictRedis import webcollector as wc class NewsCrawler(wc.RedisCrawler): def __init__(self): super().__init__(redis_client=StrictRedis("127.0.0.1"), db_prefix="news", auto_detect=True) self.num_threads = 10 self.resumable = True # you can resume crawling after shutdown self.add_seed("https://github.blog/") self.add_regex("+https://github.blog/[0-9]+.*") self.add_regex("-.*#.*") # do not detect urls that contain "#" def visit(self, page, detected): if page.match_url("https://github.blog/[0-9]+.*"): title = page.select("h1.lh-condensed")[0].text.strip() content = page.select("div.markdown-body")[0].text.replace("\n", " ").strip() print("\nURL: ", page.url) print("TITLE: ", title) print("CONTENT: ", content[:50], "...") crawler = NewsCrawler() crawler.start(10) ```