# GSpider

**Repository Path**: jiangmitiao/GSpider

## Basic Information

- **Project Name**: GSpider
- **Description**: java垂直爬虫
- **Primary Language**: Java
- **License**: Apache-2.0
- **Default Branch**: master
- **Homepage**: None
- **GVP Project**: No

## Statistics

- **Stars**: 2
- **Forks**: 1
- **Created**: 2015-06-11
- **Last Updated**: 2021-08-05

## Categories & Tags

**Categories**: Uncategorized

**Tags**: None

## README

# GSpider

**GSpider**是一款java垂直爬虫，使用非常简单。以下是介绍：
 
- **爬取网页** ：采用`Jsoup`框架整理网页；
- **设置任务** ：实现`PageProcessor`接口即可；
- **配置选项** ：可设置任务线程数，服务器错误代码信息，；
- **批处理任务** ： 使用xml配置文件即可执行批处理任务*（测试中）

----------


##爬虫思想

###任务流程
```flow
st=>start: GSpider初始化
e=>end: 结束 
op1=>operation: 设置参数 
op2=>operation: init()
op3=>operation: 添加初始网址，爬取规则，页面处理类
op4=>operation: check()检查错误
cond1=>condition: 是否正常
op5=>operation: controlInit()始化控制命令行
op6=>operation: start,look,stop命令控制

st->op1->op2->op3->op4->cond1
cond1(yes)->op5->op6->e
cond1(no)->e
```


## 使用方法

### 简单参考

接下来以爬取`知乎` 为例，没有配置爬取任务，默认获取网页的`标题`+`网址`。
在init方法后，在Console中可以输入start命令开始爬取。爬取过程中可以输入look命令查看当前线程的一些情况。 同样，也可以输入add命令添加爬取源。最后如果想停止爬虫则可以输入stop命令停止爬取。

#### 代码
``` java
package com.sinaapp.gavinzhang.GSpider.examples;

import com.sinaapp.gavinzhang.GSpider.GSpider;
import java.util.regex.Pattern;

/**
 * Created by gavin on 15-7-23.
 */
public class Zhihu {
    public static void main(String[] args)throws Exception
    {
        GSpider nst = new GSpider();
        Pattern p = Pattern.compile("http://www.zhihu.com/question/[0-9]+");
        nst.addRegex(p);
        nst.addWebUrl("http://www.zhihu.com/explore");
        nst.controlInit();
    }
}
```

### 参数参考

接下来继续以以爬取`知乎` 为例。
新建一个Zhihu类，继承DefaultPageProcessor。在dispose方法中对Document对象html进行一些操作。
在main方法中开启爬虫。
#### 代码
``` java
package com.sinaapp.gavinzhang.GSpider.examples;

import com.sinaapp.gavinzhang.GSpider.DefaultPageProcessor;
import com.sinaapp.gavinzhang.GSpider.Exception.GSpiderInitException;
import com.sinaapp.gavinzhang.GSpider.GSpider;
import com.sinaapp.gavinzhang.GSpider.GSpiderBuilder;
import org.jsoup.nodes.Document;
import java.util.regex.Pattern;

/**
 * Created by gavin on 15-7-23.
 */
public class Zhihu extends DefaultPageProcessor {

    @Override
    public void dispose(Document html, String webUrl) {
        //super.dispose(html, webUrl);
        System.out.println(html.title());
    }

    public static void main(String[] args) throws GSpiderInitException {
        GSpiderBuilder gSpiderBuilder = new GSpiderBuilder();
        gSpiderBuilder.addWebUrl("http://www.zhihu.com/explore");
        gSpiderBuilder.setPageProcessor(new Zhihu());
        gSpiderBuilder.addRegex(Pattern.compile("http://www.zhihu.com/question/[0-9]+"));
        GSpider gSpider = gSpiderBuilder.build();
        gSpider.controlInit();
    }
}
```

## 关于

**GSpdier** 参考了网上一些大牛的思想，本人是一名学生，感谢那些网上的布道者，这是一个最初的版本，以后会跟进。

## 反馈与建议
- 微博：[@gavin要加油](http://weibo.com/wildfireg13)
- 邮箱：<zhang159916@gmail.com>


## 更新

### 1.2.0###
加入了PageProcessor接口，形成AbstractPageProcessor抽象类实现该接口，DefaultPageProcessor类继承该抽象类。9.7
实现了自己的网页下载类HttpProcessor，并且实现了接口HttpProcessorInterceptor，使得可以在下载页面前设置header。9.10
支持gzip压缩。9.15
redis待访问列表由队列实现，取数据采用brpop(0,xxx)方法，在队列为空时会阻塞。
准备采用并发包下的已访问列表类代替默认阻塞型已访问列表。
重大进展，支持http访问查看爬虫状态,端口号：41108。9.29

### 1.1.18###
引入参数对象，来分别对GSpider和processor来配置。应该比builder模式要更方便一些。
加入http状态检测器，解偶，设置process数量等同任务线程池大小并保存起来。
获取待爬取网页由阻塞型容器实现

### 1.1.9###
1. 已爬列表新的实现方式
   由ConcurrentHashMap实现，在plugin/concurrent包下的ConcurrentFoundWebUrlList默认不起用。
2. 布隆处理器最小容量为8388608*32，网址数量在15w以下时，出错率在2%以下。

### 1.1.8###
1. 爬虫抓取系统环节优化,优先将处理得到的网址加入待爬列表。
2. 加入实验性布隆处理器。

##错误总结
jsoup如想使用absUrl，在prase时需要添加网址
set赋值时，没有加this.导致赋值失败。