原文地址:http://www.jianshu.com/p/c3fc3129407d
1. 爬虫框架webmagic
WebMagic是一个简单灵活的爬虫框架。基于WebMagic,你可以快速开发出一个高效、易维护的爬虫。
1.1 官网地址
官网文档写的比较清楚,建议大家直接阅读官方文档,也可以阅读下面的内容。地址如下:
官网:
中文文档地址:
English:
2. webmagic与spring boot框架集成
spring boot
与webmagic
的结合主要有三个模块,分别为爬取模块Processor
,入库模块Pipeline
,向数据库存入爬取数据,和定时任务模块Scheduled
,复制定时爬取网站数据。
2.1 maven添加
2.2 爬取模块Processor
爬取简书首页Processor,分析简书首页的页面数据,获取响应的简书链接和标题,放入wegmagic的Page中,到入库模块取出添加到数据库。代码如下:
<span class="hljs-keyword">import com.shang.spray.entity.Sources;<span class="hljs-keyword">import com.shang.spray.pipeline.NewsPipeline;
<span class="hljs-keyword">import us.codecraft.webmagic.Page;
<span class="hljs-keyword">import us.codecraft.webmagic.Site;
<span class="hljs-keyword">import us.codecraft.webmagic.Spider;
<span class="hljs-keyword">import us.codecraft.webmagic.processor.PageProcessor;
<span class="hljs-keyword">import us.codecraft.webmagic.selector.Selectable;
<span class="hljs-keyword">import java.util.List;
<span class="hljs-comment">/**
-
info:简书首页爬虫
-
Created by shang on 16/9/9.
*/
<span class="hljs-keyword">public <span class="hljs-class"><span class="hljs-keyword">class <span class="hljs-title">JianShuProcessor <span class="hljs-keyword"><span class="hljs-keyword">implements <span class="hljs-type">PageProcessor {<span class="hljs-keyword">private Site site = Site.me()
.setDomain(<span class="hljs-string">"jianshu.com")
.setSleepTime(<span class="hljs-number">100)
.setUserAgent(<span class="hljs-string">"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML,like Gecko) Chrome/52.0.2743.116 Safari/537.36");
;<span class="hljs-keyword">public <span class="hljs-keyword">static final <span class="hljs-keyword">String list = <span class="hljs-string">"http://www.jianshu.com";
@Override
<span class="hljs-keyword">public void process(Page page) {
<span class="hljs-keyword">if (page.getUrl().regex(list).match()) {
Listlist=page.getHtml().xpath(<span class="hljs-string">"//ul[@class='article-list thumbnails']/li").nodes();
<span class="hljs-keyword">for (Selectable s : <span class="hljs-type">list) {
<span class="hljs-keyword">String title=s.xpath(<span class="hljs-string">"//div/h4/a/text()").toString();
<span class="hljs-keyword">String link=s.xpath(<span class="hljs-string">"//div/h4").links().toString();
News <span class="hljs-keyword">new<span class="hljs-type">s=<span class="hljs-keyword">new <span class="hljs-type">News();
<span class="hljs-keyword">new<span class="hljs-type">s.setTitle(title);
<span class="hljs-keyword">new<span class="hljs-type">s.setInfo(title);
<span class="hljs-keyword">new<span class="hljs-type">s.setLink(link);
<span class="hljs-keyword">new<span class="hljs-type">s.setSources(<span class="hljs-keyword">new <span class="hljs-type">Sources(<span class="hljs-number">5));
page.putField(<span class="hljs-string">"news"+title,<span class="hljs-keyword">new<span class="hljs-type">s);
}
}
}@Override
<span class="hljs-keyword">public Site getSite() {
<span class="hljs-keyword">return site;
}<span class="hljs-keyword">public <span class="hljs-keyword">static void main(<span class="hljs-keyword">String[] args) {
Spider spider=Spider.create(<span class="hljs-keyword">new <span class="hljs-type">JianShuProcessor());
spider.addUrl(<span class="hljs-string">"http://www.jianshu.com");
spider.addPipeline(<span class="hljs-keyword">new <span class="hljs-type">NewsPipeline());
spider.thread(<span class="hljs-number">5);
spider.setExitWhenComplete(<span class="hljs-literal">true);
spider.start();
}
}2.3 入库模块
Pipeline
入库模块结合spring boot的Repository模块一起组合成入库方法,继承webmagic的Pipeline,然后实现方法,在process方法中获取爬虫模块的数据,然后调用spring boot的save方法。代码如下:
<span class="hljs-keyword">import com.shang.spray.entity.News;
<span class="hljs-keyword">import com.shang.spray.entity.Sources;
<span class="hljs-keyword">import com.shang.spray.repository.NewsRepository;
<span class="hljs-keyword">import org.apache.commons.lang3.StringUtils;
<span class="hljs-keyword">import org.springframework.beans.factory.annotation.Autowired;
<span class="hljs-keyword">import org.springframework.data.jpa.domain.Specification;
<span class="hljs-keyword">import org.springframework.stereotype.Repository;
<span class="hljs-keyword">import us.codecraft.webmagic.ResultItems;
<span class="hljs-keyword">import us.codecraft.webmagic.Task;
<span class="hljs-keyword">import us.codecraft.webmagic.pipeline.Pipeline;<span class="hljs-keyword">import javax.persistence.criteria.CriteriaBuilder;
<span class="hljs-keyword">import javax.persistence.criteria.CriteriaQuery;
<span class="hljs-keyword">import javax.persistence.criteria.Predicate;
<span class="hljs-keyword">import javax.persistence.criteria.Root;
<span class="hljs-keyword">import java.util.ArrayList;
<span class="hljs-keyword">import java.util.Date;
<span class="hljs-keyword">import java.util.List;
<span class="hljs-keyword">import java.util.Map;<span class="hljs-comment">/**
-
info:新闻
-
Created by shang on 16/8/22.
*/
@Repository
<span class="hljs-keyword">public <span class="hljs-class"><span class="hljs-keyword">class <span class="hljs-title">NewsPipeline <span class="hljs-keyword"><span class="hljs-keyword">implements <span class="hljs-type">Pipeline {@Autowired
protected NewsRepository <span class="hljs-keyword">new<span class="hljs-type">sRepository;@Override
<span class="hljs-keyword">public void process(ResultItems resultItems,Task task) {
<span class="hljs-keyword">for (Map.Entry<<span class="hljs-keyword">String,Object> entry : <span class="hljs-type">resultItems.getAll().entrySet()) {
<span class="hljs-keyword">if (entry.getKey().contains(<span class="hljs-string">"news")) {
News <span class="hljs-keyword">new<span class="hljs-type">s=(News) entry.getValue();
Specificationspecification=<span class="hljs-keyword">new <span class="hljs-type">Specification () {
@Override
<span class="hljs-keyword">public Predicate toPredicate(Rootroot,CriteriaQuery<?> criteriaQuery,CriteriaBuilder criteriaBuilder) {
<span class="hljs-keyword">return criteriaBuilder.and(criteriaBuilder.equal(root.<span class="hljs-keyword">get(<span class="hljs-string">"link"),<span class="hljs-keyword">new<span class="hljs-type">s.getLink()));
}
};
<span class="hljs-keyword">if (<span class="hljs-keyword">new<span class="hljs-type">sRepository.findOne(specification) == <span class="hljs-literal">null) {<span class="hljs-comment">//检查链接是否已存在
<span class="hljs-keyword">new<span class="hljs-type">s.setAuthor(<span class="hljs-string">"水花");
<span class="hljs-keyword">new<span class="hljs-type">s.setTypeId(<span class="hljs-number">1);
<span class="hljs-keyword">new<span class="hljs-type">s.setSort(<span class="hljs-number">1);
<span class="hljs-keyword">new<span class="hljs-type">s.setStatus(<span class="hljs-number">1);
<span class="hljs-keyword">new<span class="hljs-type">s.setExplicitLink(<span class="hljs-literal">true);
<span class="hljs-keyword">new<span class="hljs-type">s.setCreateDate(<span class="hljs-keyword">new <span class="hljs-type">Date());
<span class="hljs-keyword">new<span class="hljs-type">s.setModifyDate(<span class="hljs-keyword">new <span class="hljs-type">Date());
<span class="hljs-keyword">new<span class="hljs-type">sRepository.save(<span class="hljs-keyword">new<span class="hljs-type">s);
}
}}
}
}2.4 定时任务模块
Scheduled
-