国产片侵犯亲女视频播放_亚洲精品二区_在线免费国产视频_欧美精品一区二区三区在线_少妇久久久_在线观看av不卡

服務(wù)器之家:專注于服務(wù)器技術(shù)及軟件下載分享
分類導(dǎo)航

PHP教程|ASP.NET教程|Java教程|ASP教程|編程技術(shù)|正則表達(dá)式|C/C++|IOS|C#|Swift|Android|VB|R語言|JavaScript|易語言|vb.net|

服務(wù)器之家 - 編程語言 - Java教程 - 分享一個(gè)簡單的java爬蟲框架

分享一個(gè)簡單的java爬蟲框架

2021-02-03 11:17qq_35488769 Java教程

這篇文章主要介紹了分享一個(gè)簡單的java爬蟲框架,具有一定參考價(jià)值,需要的朋友可以了解下。

反復(fù)給網(wǎng)站編寫不同的爬蟲邏輯太麻煩了,自己實(shí)現(xiàn)了一個(gè)小框架

可以自定義的部分有:

請求方式(默認(rèn)為Getuser-agent為谷歌瀏覽器的設(shè)置),可以通過實(shí)現(xiàn)RequestSet接口來自定義請求方式

儲(chǔ)存方式(默認(rèn)儲(chǔ)存在f盤的html文件夾下),可以通過SaveUtil接口來自定義保存方式

需要保存的資源(默認(rèn)為整個(gè)html頁面)

篩選方式(默認(rèn)所有url都符合要求),通過實(shí)現(xiàn)ResourseChooser接口來自定義需要保存的url和資源頁面

實(shí)現(xiàn)的部分有:

html頁面的下載方式,通過HttpClient實(shí)現(xiàn)html頁面的下載

html頁面的解析部分,通過jsoup實(shí)現(xiàn)html頁面的解析

HtmlDownloader類,用于根據(jù)一個(gè)url下載一個(gè)html頁面

java" id="highlighter_524259">
?
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
package DownloadPackage;
import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;
import org.apache.http.HttpEntity;
import org.apache.http.HttpResponse;
import org.apache.http.impl.client.CloseableHttpClient;
import org.apache.http.impl.client.HttpClients;
/*
 * 根據(jù)一個(gè)url下載一個(gè)html頁面
 */
public class HtmlDownloader {
    RequestSet requestset = null;
    public HtmlDownloader(RequestSet requestset){
        this.requestset = requestset;
    }
    public String downloadhtml(String url){
        String html = null;
        //創(chuàng)建一個(gè)客戶端
        //創(chuàng)建一個(gè)讀取流從entity讀取html
        BufferedReader reader = null;
        CloseableHttpClient httpclient = HttpClients.createDefault();
        HttpResponse response = null;
        try {
            response = httpclient.execute(requestset.getMethod(url));
            HttpEntity entity = response.getEntity();
            reader = new BufferedReader(new InputStreamReader(entity.getContent()));
            StringBuilder sb = new StringBuilder();
            while((html = reader.readLine()) != null){
                sb.append(html);
            }
            html = sb.toString();
            System.out.println("一個(gè)html頁面獲取成功");
        }
        catch (IOException e) {
            System.out.println(url+"連接失敗");
        }
        finally{
            if(reader != null){
                try {
                    reader.close();
                    httpclient.close();
                }
                catch (IOException e) {
                    // TODO Auto-generated catch block
                    e.printStackTrace();
                }
            }
        }
        return html;
    }
}

UrlGet類,用于根據(jù)一個(gè)html頁面獲得所有的url連接

?
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
package DownloadPackage;
import java.util.LinkedList;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
public class UrlGet {
    public LinkedList<String> geturls(String html){
        LinkedList<String> urls = new LinkedList<String>();
        Document doc = Jsoup.parse(html);
        Elements links = doc.getElementsByTag("a");
        for (Element link:links){
            String url = link.attr("href");
            urls.add(url);
        }
        return urls;
    }
}

資源選擇接口,需要實(shí)現(xiàn)三個(gè)方法,第一是isNeed方法,判斷url是否為需要的,第二個(gè)是isResourse方法,判斷url頁面是不是需要的資源頁面,第三個(gè)是process方法,

有時(shí)網(wǎng)頁上的url是我們需要的但是格式不對,對url進(jìn)行加工

?
1
2
3
4
5
6
package ChoosePackage;
public interface ResourseChooser {
    public Boolean isNeed(String url);
    public Boolean isResourse(String url);
    public String process(String url);
}

RequsetSet類,用于自定義請求方法的接口,實(shí)現(xiàn)getMethod方法獲取請求方法

?
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
package DownloadPackage;
import org.apache.http.client.methods.HttpGet;
/*
 * 一個(gè)用于獲得Request請求的接口
 * 實(shí)現(xiàn)getMethod方法獲取Get方法
 */
public interface RequestSet {
    public HttpGet getMethod(String url);
}
Saveutil接口用于自定義保存方式,需要實(shí)現(xiàn)save方法
package SaveUtil;
/*
 * 數(shù)據(jù)儲(chǔ)存的工具接口,必須實(shí)現(xiàn)保存方法
 */
public interface SaveUtil {
    public void save(String url,String html);
}

Spider類,有五中構(gòu)造方法,可以實(shí)現(xiàn)多種自定義操作,其中實(shí)現(xiàn)了上述自定義接口的默認(rèn)實(shí)現(xiàn)類

?
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
package Spider;
import java.io.BufferedWriter;
import java.io.File;
import java.io.FileWriter;
import java.io.IOException;
import java.util.HashSet;
import java.util.Iterator;
import java.util.LinkedList;
import org.apache.http.client.config.RequestConfig;
import org.apache.http.client.methods.HttpGet;
import ChoosePackage.MyResourseChooser;
import ChoosePackage.ResourseChooser;
import DownloadPackage.HtmlDownloader;
import DownloadPackage.RequestSet;
import DownloadPackage.UrlGet;
import SaveUtil.MySaveUtil;
import SaveUtil.SaveUtil;
/*
 * 用于爬取資源的類
 */
public class Spider{
    public static void main(String[] args) {
        new Spider("http://www.bilibili.net").spiderstart();
    }
    //種子url
    String seed = null;
    //用于保存數(shù)據(jù)的類,需要自己實(shí)現(xiàn)
    private SaveUtil saveutil = null;
    //html下載類
    private HtmlDownloader downloader = null;
    //url下載類
    private UrlGet urldownloader = null;
    //資源選擇工具
    private ResourseChooser resoursechooser = null;
    //用于保存未下載的網(wǎng)頁
    LinkedList<String> unvisited = new LinkedList<String>();
    //用于保存已下載的網(wǎng)頁
    HashSet<String> visited = new HashSet<String>();
    //自定義儲(chǔ)存方式,請求方式,資源篩選方式的構(gòu)造方法
    public Spider(SaveUtil saveutil,RequestSet request,ResourseChooser resoursechooser,String seed){
        this.saveutil = saveutil;
        this.downloader = new HtmlDownloader(request);
        this.urldownloader = new UrlGet();
        this.resoursechooser = resoursechooser;
        this.seed = seed;
        unvisited.add(seed);
    }
    //自定義儲(chǔ)存方式,資源篩選方式的構(gòu)造方法
    public Spider(SaveUtil saveutil,ResourseChooser resoursechooser,String seed){
        this.resoursechooser = resoursechooser;
        this.downloader = new HtmlDownloader(new getRequest());
        this.saveutil = saveutil;
        this.urldownloader = new UrlGet();
        this.seed = seed;
        unvisited.add(seed);
    }
    //自定義儲(chǔ)存方式,請求的構(gòu)造方法
    public Spider(SaveUtil saveutil,RequestSet requestset,String seed){
        this.saveutil = saveutil;
        this.downloader = new HtmlDownloader(requestset);
        this.resoursechooser = new MyResourseChooser();
        this.urldownloader = new UrlGet();
        this.seed = seed;
        unvisited.add(seed);
    }
    //自定義儲(chǔ)存方式的構(gòu)造方法
    public Spider(SaveUtil saveutil,String seed){
        this.saveutil = saveutil;
        this.downloader = new HtmlDownloader(new getRequest());
        this.resoursechooser = (new MyResourseChooser());
        this.urldownloader = new UrlGet();
        this.seed = seed;
        unvisited.add(seed);
    }
    //默認(rèn)的爬蟲構(gòu)造方法
    public Spider(String seed){
        this.saveutil = new MySaveUtil();
        this.downloader = new HtmlDownloader(new getRequest());
        this.resoursechooser = (new MyResourseChooser());
        this.urldownloader = new UrlGet();
        this.seed = seed;
        unvisited.add(seed);
    }
    //開始爬取的方法
    private void spiderstart(){
        String html = null;
        while(!unvisited.isEmpty()){
            String url = unvisited.poll();
            System.out.println("開始獲取"+url);
            if(resoursechooser.isNeed(url)){
                try{
                    html = downloader.downloadhtml(url);
                }
                catch(RuntimeException e){
                    System.out.println(url+"連接獲取失敗");
                    continue;
                }
                visited.add(url);
                LinkedList<String> urls = new LinkedList<String>();
                try{
                    urls = urldownloader.geturls(html);
                }
                catch(RuntimeException e){
                    System.out.println(url+"的html頁面為空");
                    continue;
                }
                Iterator<String> it = urls.iterator();
                while(it.hasNext()){
                    String newurl = it.next();
                    if(resoursechooser.isNeed(newurl)&&!visited.contains(newurl)&&!unvisited.contains(newurl)){
                        newurl = resoursechooser.process(newurl);
                        unvisited.add(newurl);
                        System.out.println(newurl+"加入頁面");
                    }
                }
                System.out.println("獲取了"+url+"上的所有url");
                if(resoursechooser.isResourse(url)){
                    saveutil.save(url,html);
                }
            }
        }
    }
    //默認(rèn)資源篩選類
    private class MyResourseChooser implements ResourseChooser{
        @Override
        public Boolean isNeed(String url) {
            // TODO Auto-generated method stub
            if(!url.startsWith("/")&&!url.startsWith("http")){
                return false;
            }
            return true;
        }
        @Override
        public Boolean isResourse(String url) {
            // TODO Auto-generated method stub
            return true;
        }
        @Override
        public String process(String url) {
            // TODO Auto-generated method stub
            if(!url.startsWith("http")){
                url = seed+url;
            }
            return url;
        }
    }
    public class getRequest implements RequestSet{
        public HttpGet getMethod(String url) {
            // TODO Auto-generated method stub
            //創(chuàng)建一個(gè)get請求方法
            HttpGet getmethod = new HttpGet(url);
            //HttpHost proxy = new HttpHost("124.88.67.81",80);這里不設(shè)置代理IP
            //設(shè)置請求超時(shí)時(shí)間等
            RequestConfig responseconfig = RequestConfig.custom().setConnectionRequestTimeout(10000).setConnectTimeout(10000).setSocketTimeout(10000).build();
            //設(shè)置請求頭,主要是user-agent
            getmethod.addHeader("User-Agent","Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36");
            //設(shè)置請求參數(shù)
            getmethod.setConfig(responseconfig);
            return getmethod;
        }
    }
    //默認(rèn)的存儲(chǔ)類
    public class MySaveUtil implements SaveUtil{
        @Override
        public void save(String url, String html) {
            // TODO Auto-generated method stub
            String filename = getfilename(url);
            BufferedWriter writer = null;
            try{
                writer = new BufferedWriter(new FileWriter(filename));
                writer.write(html);
                writer.flush();
                System.out.println("文件寫入成功");
            }
            catch(IOException e){
                System.out.println("文件寫入失敗");
            }
            finally{
                try {
                    if(writer != null)
                    writer.close();
                }
                catch (IOException e) {
                    // TODO Auto-generated catch block
                    System.out.println("流關(guān)閉失敗");
                }
            }
        }
        private String getfilename(String url){
            String fileparentpath = "f://html";
            File file = new File(fileparentpath);
            if(!file.exists()){
                file.mkdir();
            }
            int last = url.lastIndexOf(".");
            int first = url.indexOf(".");
            url = url.substring(first,last);
            url = url.replaceAll("\\.", "");
            url = url.replaceAll("/", "");
            return fileparentpath+"/"+url+".txt";
        }
    }
}

總結(jié)

以上就是本文關(guān)于分享一個(gè)簡單的java爬蟲框架的全部內(nèi)容,希望對大家有所幫助。有什么問題可以隨時(shí)留言,小編會(huì)及時(shí)回復(fù)大家的。感謝朋友們對本站的支持!

原文鏈接:http://blog.csdn.net/qq_35488769/article/details/70591405

延伸 · 閱讀

精彩推薦
主站蜘蛛池模板: 黑人一区 | 日日夜夜精品 | 亚洲福利影院 | av色伊人久久综合一区二区 | 欧美精品理论片大全 | 欧美午夜精品久久久久久浪潮 | 国产精品一区二区不卡 | 久久首页 | 成人亚洲欧美 | www,99热| 好了av在线 | 亚洲综合自拍 | 免费成人在线网站 | 成人午夜视频免费 | 久久三区 | 黄色片com| av黄网| 成人精品久久久 | 亚洲一区二区 | 国产成人综合av | 中文亚洲字幕 | 日韩av一区二区在线观看 | 色站综合 | 精品在线一区二区三区 | 亚洲 自拍 另类 欧美 丝袜 | 欧美a级免费看 | 青青草一区二区 | 精品不卡 | 伦理午夜电影免费观看 | 成人精品在线视频 | 国产精品久久久久久久久久久久| 亚洲精品视频在线播放 | 亚洲欧美另类久久久精品2019 | 久久久久99啪啪免费 | 欧美成人免费在线视频 | 日韩不卡一二三 | 国产一区二区影院 | 免费成人av网站 | 欧美伦理一区二区三区 | 国产一区在线视频播放 | 午夜婷婷丁香 |