Java 爬蟲工具Jsoup詳解
Jsoup是一款 Java 的 HTML 解析器,可直接解析某個 URL 地址、HTML 文本內容。它提供了一套非常省力的 API,可通過 DOM,CSS 以及類似于 jQuery 的操作方法來取出和操作數據。
jsoup 的主要功能如下:
1. 從一個 URL,文件或字符串中解析 HTML;
2. 使用 DOM 或 CSS 選擇器來查找、取出數據;
3. 可操作 HTML 元素、屬性、文本;
jsoup 是基于 MIT 協議發布的,可放心使用于商業項目。
jsoup 可以從包括字符串、URL 地址以及本地文件來加載 HTML 文檔,并生成 Document 對象實例。
簡單而言,Jsoup就是先取html頁面代碼然后解析這些頁面通過Jsoup攜帶的滿足我們絕大多數需求的各種選擇器從這個頁面中獲取我們所需要的重要數據的一款功能強大的html解析器,但也只是相對而言,這里的頁面這是死的靜態頁面,如果你想獲取動態生成的頁面數據那么你得用到其他的java 爬蟲技術,我會不定時更新這些技術一起探討。下面我們來具體談談如何運用Jsoup
一、如何取頁面
Jsoup提供了用來解析html頁面的方法 parse(),我們通過解析它可以獲取整個頁面的dom對象,通過這個對象來獲取你所需要的頁面所須有的參數。獲取頁面的方法有很多,這里就簡單的列舉幾個:
① 通過Jsoup攜帶的connect()方法
1
|
String htmlPage = Jsoup.connect( "https://www.baidu.com" ).get().toString(); |
這個方法說需要的參數就是一個String類型的url鏈接,但是你的注意把這些鏈接的protrol加上,以免問題, 其實這個方法解決了我們很多問題,我們完全可以把Jsoup解析html抽取成一段通用工具類,然后通過改變拼接的url參數獲取到很多我們想要的東西,舉個例子:京東和淘寶的商品鏈接都是固定的,通過改變其三方商品ID來獲取商品詳情參數。
1
2
3
4
5
|
String url = "https://item.jd.com/11476104681.html" ; 完全可以替換成 String url = "https://item.jd.com/" +skuId+ ".html" ; |
通過改變他的三方商品ID你就可以獲取這個頁面一些基本數據,像商品的圖片和標題什么的都可以輕松獲取,而價格因為做了一些相關方面的處理得動態的獲取,這里先不做說明,后面慢慢會講解。
②通過httpclient直接獲取這個頁面的靜態頁面
先貼一部分httpclient獲取頁面工具
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
|
import java.io.IOException; import java.io.UnsupportedEncodingException; import java.util.ArrayList; import java.util.List; import java.util.Map; import java.util.Set; import org.apache.http.HttpEntity; import org.apache.http.HttpResponse; import org.apache.http.NameValuePair; import org.apache.http.ParseException; import org.apache.http.client.ClientProtocolException; import org.apache.http.client.entity.UrlEncodedFormEntity; import org.apache.http.client.methods.HttpGet; import org.apache.http.client.methods.HttpPost; import org.apache.http.client.methods.HttpUriRequest; import org.apache.http.impl.client.DefaultHttpClient; import org.apache.http.message.BasicNameValuePair; import org.apache.http.protocol.HTTP; import org.apache.http.util.EntityUtils; /** * HTTP請求工具類. * @author LuoLong * @since 20150513 * */ public class HttpClientUtils { /** * post方式請求. * @param url 請求地址. * @param params 請求參數 * @return String */ public static String post(String url, Map<String, String> params) { DefaultHttpClient httpclient = new DefaultHttpClient(); String body = null ; HttpPost post = postForm(url, params); body = invoke(httpclient, post); httpclient.getConnectionManager().shutdown(); return body; } /** * get方式請求. * @param url 請求地址. * @return String */ public static String get(String url) { DefaultHttpClient httpclient = new DefaultHttpClient(); String body = null ; HttpGet get = new HttpGet(url); body = invoke(httpclient, get); httpclient.getConnectionManager().shutdown(); return body; } /** * 請求方法. * @param httpclient DefaultHttpClient. * @param httpost 請求方式. * @return String */ private static String invoke(DefaultHttpClient httpclient, HttpUriRequest httpost) { HttpResponse response = sendRequest(httpclient, httpost); String body = paseResponse(response); return body; } /** * * @param response * @return */ @SuppressWarnings ({ "deprecation" , "unused" }) private static String paseResponse(HttpResponse response) { HttpEntity entity = response.getEntity(); String charset = EntityUtils.getContentCharSet(entity); String body = null ; try { body = EntityUtils.toString(entity); } catch (ParseException e) { e.printStackTrace(); } catch (IOException e) { e.printStackTrace(); } return body; } private static HttpResponse sendRequest(DefaultHttpClient httpclient, HttpUriRequest httpost) { HttpResponse response = null ; try { response = httpclient.execute(httpost); } catch (ClientProtocolException e) { e.printStackTrace(); } catch (IOException e) { e.printStackTrace(); } return response; } @SuppressWarnings ( "deprecation" ) private static HttpPost postForm(String url, Map<String, String> params) { HttpPost httpost = new HttpPost(url); List<NameValuePair> nvps = new ArrayList<NameValuePair>(); Set<String> keySet = params.keySet(); for (String key : keySet) { nvps.add( new BasicNameValuePair(key, params.get(key))); } try { httpost.setEntity( new UrlEncodedFormEntity(nvps, HTTP.UTF_8)); } catch (UnsupportedEncodingException e) { e.printStackTrace(); } return httpost; } } |
通過get()方法就可以獲取html頁面的String類型數據
1
2
3
4
|
String content = HttpClientUtils.get(url); 或者你可以直接把頁面下載到本地,然后解析此html文檔獲取 File input = new File(FilePath); Document doc = Jsoup.parse(input, "UTF-8" , url); |
二、解析頁面獲取需要的數據
當你獲取到頁面的dom對象后,那么下面的操作就非常簡單了,你只需要通過操作這個dom對象來獲取頁面所有的靜態資源,動態加載的資源不在此列,后面在做講解。
先貼一段百度網頁的源代碼:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
|
</form> <div id= "m" ></div> </div> </div> <div id= "u" > <a class= "toindex" href= "/" rel= "external nofollow" >百度首頁</a> <a href= "javascript:;" rel= "external nofollow" name= "tj_settingicon" class= "pf" >設置<i class= "c-icon c-icon-triangle-down" ></i></a> <a href= "https://passport.baidu.com/v2/?login&tpl=mn&u=http%3A%2F%2Fwww.baidu.com%2F" rel= "external nofollow" rel= "external nofollow" name= "tj_login" class= "lb" onclick= "return false;" >登錄</a> </div> <div id= "u1" > <a href= "http://news.baidu.com" rel= "external nofollow" name= "tj_trnews" class= "mnav" >新聞</a> <a href= "http://www.hao123.com" rel= "external nofollow" name= "tj_trhao123" class= "mnav" >hao123</a> <a href= "http://map.baidu.com" rel= "external nofollow" name= "tj_trmap" class= "mnav" >地圖</a> <a href= "http://v.baidu.com" rel= "external nofollow" name= "tj_trvideo" class= "mnav" >視頻</a> <a href= "http://tieba.baidu.com" rel= "external nofollow" name= "tj_trtieba" class= "mnav" >貼吧</a> <a href= "http://xueshu.baidu.com" rel= "external nofollow" name= "tj_trxueshu" class= "mnav" >學術</a> <a href= "https://passport.baidu.com/v2/?login&tpl=mn&u=http%3A%2F%2Fwww.baidu.com%2F" rel= "external nofollow" rel= "external nofollow" name= "tj_login" class= "lb" onclick= "return false;" >登錄</a> <a href= "http://www.baidu.com/gaoji/preferences.html" rel= "external nofollow" name= "tj_settingicon" class= "pf" >設置</a> <a href= "http://www.baidu.com/more/" rel= "external nofollow" name= "tj_briicon" class= "bri" style= "display: block;" >更多產品</a> </div> </div> </div> <div class= "s_tab" id= "s_tab" > <b>網頁</b> <a href= "http://news.baidu.com/ns?cl=2&rn=20&tn=news&word=" rel= "external nofollow" wdfield= "word" onmousedown= "return c({'fm':'tab','tab':'news'})" >新聞</a> <a href= "http://tieba.baidu.com/f?kw=&fr=wwwt" rel= "external nofollow" wdfield= "kw" onmousedown= "return c({'fm':'tab','tab':'tieba'})" >貼吧</a> <a href= "http://zhidao.baidu.com/q?ct=17&pn=0&tn=ikaslist&rn=10&word=&fr=wwwt" rel= "external nofollow" wdfield= "word" onmousedown= "return c({'fm':'tab','tab':'zhidao'})" >知道</a> <a href= "http://music.baidu.com/search?fr=ps&ie=utf-8&key=" rel= "external nofollow" wdfield= "key" onmousedown= "return c({'fm':'tab','tab':'music'})" >音樂</a> <a href= "http://image.baidu.com/search/index?tn=baiduimage&ps=1&ct=201326592&lm=-1&cl=2&nc=1&ie=utf-8&word=" rel= "external nofollow" wdfield= "word" onmousedown= "return c({'fm':'tab','tab':'pic'})" >圖片</a> <a href= "http://v.baidu.com/v?ct=301989888&rn=20&pn=0&db=0&s=25&ie=utf-8&word=" rel= "external nofollow" wdfield= "word" onmousedown= "return c({'fm':'tab','tab':'video'})" >視頻</a> <a href= "http://map.baidu.com/m?word=&fr=ps01000" rel= "external nofollow" wdfield= "word" onmousedown= "return c({'fm':'tab','tab':'map'})" >地圖</a> <a href= "http://wenku.baidu.com/search?word=&lm=0&od=0&ie=utf-8" rel= "external nofollow" wdfield= "word" onmousedown= "return c({'fm':'tab','tab':'wenku'})" >文庫</a> <a href= "//www.baidu.com/more/" rel= "external nofollow" onmousedown= "return c({'fm':'tab','tab':'more'})" >更多»</a> </div> <div class= "qrcodeCon" > <div id= "qrcode" > <div class= "qrcode-item qrcode-item-1" > <div class= "qrcode-img" ></div> <div class= "qrcode-text" > <p><b>手機百度</b></p> </div> </div> </div> </div> <div id= "ftCon" > <div class= "ftCon-Wrapper" > <div id= "ftConw" > <p id= "lh" ><a id= "setf" href= "//www.baidu.com/cache/sethelp/help.html" rel= "external nofollow" onmousedown= "return ns_c({'fm':'behs','tab':'favorites','pos':0})" target= "_blank" >把百度設為主頁</a><a onmousedown= "return ns_c({'fm':'behs','tab':'tj_about'})" href= "http://home.baidu.com" rel= "external nofollow" >關于百度</a><a onmousedown= "return ns_c({'fm':'behs','tab':'tj_about_en'})" href= "http://ir.baidu.com" rel= "external nofollow" >About Baidu</a><a onmousedown= "return ns_c({'fm':'behs','tab':'tj_tuiguang'})" href= "http://e.baidu.com/?refer=888" rel= "external nofollow" >百度推廣</a></p> <p id= "cp" >©2017 Baidu <a href= "http://www.baidu.com/duty/" rel= "external nofollow" onmousedown= "return ns_c({'fm':'behs','tab':'tj_duty'})" >使用百度前必讀</a> <a href= "http://jianyi.baidu.com/" rel= "external nofollow" class= "cp-feedback" onmousedown= "return ns_c({'fm':'behs','tab':'tj_homefb'})" >意見反饋</a> 京ICP證030173號 <i class= "c-icon-icrlogo" ></i> <a id= "jgwab" target= "_blank" href= "http://www.beian.gov.cn/portal/registerSystemInfo?recordcode=11000002000001" rel= "external nofollow" >京公網安備11000002000001號</a> <i class= "c-icon-jgwablogo" ></i></p> </div> </div> </div> <div id= "wrapper_wrapper" > </div> </div> <div class= "c-tips-container" id= "c-tips-container" ></div> |
在貼上Jsoup自身攜帶的常用的幾個獲取dom對象具體元素的方法:
1
2
3
4
5
6
|
method description getElementsByClass() 通過Class屬性來定位元素,獲取的是所有帶這個 class 屬性的集合 getElementsByTag(); 通過標簽名字來定位元素,獲取的是所有帶有這個標簽名字的元素結合 getElementById(); 通過標簽的ID來定位元素,這個是精準定位,因為頁面的ID基本不會重復 getElementsByAttributeValue(); 通過屬性和屬性名來定位元素,獲取的也是一個滿足條件的集合; getElementsByAttributeValueMatching() 通過正則匹配屬性 |
比如說我現在要獲取百度首頁這個title,那么我們得先確定這玩意在哪,通過查看我們發現它是id=”u”的div標簽的一個子元素,那么不管那么多我們先通過這個Id取到這個對象然后在獲取這個Title,下面是具體操作
- //獲取頁面對象
- String startPage="https://www.baidu.com";
- Document document = Jsoup.connect(startPage).userAgent("Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36").get();
- //定位元素父級
- Element parentElement = document.getElementById("u");
- //定位具體元素
- Element titleElement = parentElement.getElementsByTag("a").get(0);
- //獲取所需數據
- String title = titleElement.text();
- System.out.println(title);
又或者我需要獲取頁面《手機百度》這個數據:
- String startPage="https://www.baidu.com";
- Document document = Jsoup.connect(startPage).userAgent("Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36").get();
- Element elementById = document.getElementById("qrcode");
- String text = elementById.getAllElements().get(0).getAllElements().get(1).getElementsByTag("b").text();
- System.out.println(text);
這就是一個很簡單的爬蟲編寫工具,Jsoup功能很強大,對直接爬取沒有動態加載的靜態資源頁面再適合不過。
感謝閱讀,希望能幫助到大家,謝謝大家對本站的支持!
原文鏈接:http://blog.csdn.net/smile_miracle/article/details/70677570